24 Jan 2019

CHOCoLaTe CHIP BITeS: Bias of teaching evaluations

Submitted by Adam R. Johnson, Harvey Mudd College

I have been doing a lot of research on the efficacy of Student Evaluations of Teaching (SETs). This is part of a personal mission to try to convince the leadership of my institution to stop using SETs for tenure or salary decisions for faculty members. The short version is that SETs are biased against women and minorities. I have been trying to affect change at my institution by beginning conversations and discussions around this topic. I have learned a lot and wanted to share it with the community. I am not going to do a complete literature review here, but here are some useful starting points for your own research. 


There is evidence of gender and racial bias in student evaluations.(1)


Teaching evaluations can, in fact, reward bad teaching.(2)


A meta-analysis of 158 journal articles and two book chapters does not support the validity of SETs. It finds that the research on SETs fails to answer crucial validity questions.(3)


Stanford University has taken a careful look at SETs and have demonstrated that they are biased and largely uninformative. The office of the Vice Provost for Teaching and Learning has some additional information about their decision to change the way they collect and use SETs.


The University of Southern California has stopped using SETs for promotion decisions in favor of a peer-review model.


One study's conclusion is even more direct: SETs are biased against women, which means that they are illegal.(4)  Ryerson University reached an arbitration agreement with their Faculty Association which ensures that SET results, "are not used in measuring teaching effectiveness for promotion or tenure.”


So why do we all still continue to use SETs in important decision making processes for faculty? One possibility might be that we all think that while other students may be biased, our own, of course, are not. Certainly, my students are fair at Harvey Mudd College--we have an honor code so they couldn't possibly be biased. Of course, implicit bias is deeply ingrained and often unidentifiable by the person exhibiting that bias, and it may even be unintentional.


Then this past December I found a journal article published in early 2018 from the Journal Medical Education about using chocolate chip cookies to bias student evaluations of teaching. (5) Their conclusions are quite direct: "[Chocolate cookies] had a significant effect on course evaluation. These findings question the validity of SETs and their use in making widespread decisions within a faculty."


I thought I would test whether the published finding generalized beyond that set of students and institution. I carried out an experiment to try to bias my own students' perception of my teaching by simply providing chocolate chip cookies on evaluation day at the end of Fall 2018. This would allow me to explicitly show whether it was possible for me to bias my own students evaluations of teaching by this method. So I purchased 2-18 oz packages of Trader Joe's chocolate chip cookies and handed them out at my 10 am class but not my 9 am class.


Thanks to my office of institutional research, I was able to gather my SET data over the past three years of my teaching first-year chemistry. As it happens, there is a little bit of a bias in my teaching evaluations based on whether I teach the "early" or "late" version of the class (9am/10am or 8am/9am) on several of the SET metrics. Unfortunately, I was not able to replicate the results of the cookie study; I found no significant differences between my 9am and 10am SET values for fall 2018 due to the presence of cookies. Importantly, my sample size was only about 65% the size of the published study, and I only have one semester’s worth of data. I am curious to try this experiment again to try to replicate the published study at Harvey Mudd College.


In any event, this result doesn't change my desire to change the use of SETs at my institution. I think they are valuable and informative for faculty in the context of their own courses, and I do not think they should be used for formal, external evaluation for salary, tenure, or promotion cases. I hope that this blog post opens a few eyes and you can bring this discussion to your campus.


(1)  Boring, A. (2017). Gender biases in student evaluations of teaching. Journal of Public Economics, 145, 27-41. (

(2) Stroebe, W. (2016). Why good teaching evaluations may reward bad teaching: On grade inflation and other unintended consequences of student evaluations. Perspectives on Psychological Science, 11(6), 800-816. (

(3) Spooren, P., Brockx, B., & Mortelmans, D. (2013). On the validity of student evaluation of teaching: The state of the art. Review of Educational Research, 83(4), 598-642. (

(4) (Mitchell, K.M.W. & Martin J (2018). Gender Bias in Student Evaluations.PS: Political Science and Politics, 51(3), 648-652. (

(5) (Hessler, M. Pöpping, D.M., Hollstein, H., Ohlenburg, H., Arnemann, P.H., Massoth, C., Seidel, L.M., Zarbock, A. & Wenk, M. (2018). Availability of cookies during an academic course session affects evaluation of teaching. Medical Education, 52, 1064–1072. (


Thanks for pulling all this data together in one place, Adam!  Here is a link to one of my favorite sites to look at, too. It's the research about which words appear in teaching evals for men and women across disciplines. Lots to think about!


Thanks for the post, Adam.  I completely agree with the assessment that SETs can be (and often are) biased.  I also agree with the notion that students are not usually well positioned to be good evaluators of teaching.  From a pessimistic perspective, one could argue that the reason students are routinely asked to evaluate the teaching is that they are the customers.  We can argue about how that set of relationships (students are accountable to professors, professors are accountable to administration, administration is beholden to students/families paying the bills) may skew incentives in education, but it does seems to be a fact of life.

My question is whether there are ways to administer evaluations that can mitigate bias (at least to some extent) and give more useful information about student learning.  My expertise in this area is limited, but it does seem like practices like changing the timing of administering evals, priming students to think about their learning (and learning gains) and their work in the class rather than the professor personally, etc., might help.  There must be some research on this, right?

I guess my point is that there are all sorts of sources of information that routinely exhibit bias (e.g., college, grad school, or job recommendation letters), yet we still use them because they provide useful information that is otherwise difficult to get and we believe that if we understand the bias well then we can set up systems to mitigate its effects.  Could SETs fall into that category?  I'm not sure I'm convinced either way yet.