Tuesday, August 11, 2009

In Defense of RateMyProfessor.com

By Jonathan Leirer

After the release of the Forbes/CCAP rankings last week, one issue regarding the methodology has been consistently raised: the validity of using RateMyProfessor.com data to compile the rankings. This post will address and hopefully resolve some of these issues.

First, a little abstract statistical theory. One of the primary objections to using RateMyProfessor.com is that only students that are extremely (dis)satisfied bother to use the site. Let’s assume that if all students evaluated a given professor, their votes would be normally distributed with a mean of 3.5 and a standard deviation of 0.5. (For simplicity, let’s assume that there is a latent continuous rating that maps to the actual rating set {1,2,3,4,5}.) Now, let’s assume that a student only votes on RateMyProfessor.com (RMP) if their personal rating of the professor is more than 2 standard deviations away from the mean; that is, only students who rate the professor lower than 3.5-(2*0.5)= 2.5 or greater than 3.5+(2*0.5)= 4.5. What would the average rating be for this professor? Well, the expected value of the rating is equal to the initial mean, 3.5. Click here for a formal proof. Thus, given a sufficiently large sample, we would expect to see the same average rating as we would if all students participate. (Note, these results hold with all distributions containing a mean and variance such that a) the distribution is symmetric about the mean and b) the censoring decision is also symmetric about the mean). So, even if only disgruntled or exceptionally satisfied students rate their professors, the averages are consistent.

Now, let’s look at the actual data for a bit. One common criticism of the RMP variable is that easy professors get higher overall scores. Let’s check the validity of such a statement. As you can see by the graph, there is actually a positive relationship between the rigor score and overall score, at the institutional level, suggesting that schools which have generally more difficult instruction also have generally better (more helpful and more clear) instruction. However, if you recall from our methodology, we were concerned that some schools with small sample sizes could have inaccurate results. To address this issue we took a Bayesian approach, and obtained updates overall and rigor scores accounting for sample size. However, the sample sizes were sufficiently large that no score changed more than 1/5 a standard deviation.

Still, some are worried about low response rates biasing the results. First, a low number of responses do not bias results, given that the responses are still random and uncorrelated that with systematic tendencies to rate. They may be more unreliable (that is, have higher standard errors), but the Bayes approach helps correct for the lack of information in low response schools. Still, it may be worth examining what relationships, if any, exist between the number of responses, and overall and rigor score and between the response rate (number of responses in proportion to the size of the school) and the overall and rigor scores. As figures 2 and 3 show, there is no strong relationship between the number of responses or response rate and the final bayes score although there is an apparent convergence of scores as the response rate (and number of responses) increases. This is potentially troubling and is something that will be taken into account in the future but is likely an artifact of less variability in larger schools (there is a strong positive correlation between the number of ratings and the response rate).

In addition we would like to call attention to our response last year, found here, that discusses the strengths and weaknesses of using RMP, citing recent publications on this very issue. We hope this brief analysis can help answer some of the question readers may have on the usefulness and appropriateness of the RMP variable.


capeman said...

Wait a minute. The graph plots professor rating (y axis) vs. rigor (x axis). But at ratemyprofessor they rate on EASINESS, not RIGOR. i.e. a 5 (out of 5 possible) is the EASIEST. I've checked this out on some of my colleagues; the students definitely get it -- a 1 means the prof is tough as nails.

So, I have to wonder -- in making this graph, was EASINESS converted to RIGOR, say by inverting the numbers, i.e. a 1 becomes a 5?

Or is easiness CONFUSED with rigor?

My impression is that at ratemyprofessor, the profs with highest EASINESS factor get the best rating.

The claim in this blog is at best counterintuitive, at worst, just ass-backwards.

Jonathan L. said...

Yes, we generated a new variable. rigor, as defined as 5-easiness.

This is clearly described in the methodology.

capeman said...

I don't know where your methodology is -- a link maybe?

But you say

"we generated a new variable. rigor, as defined as 5-easiness."

What does this mean? How did you define rigor? It's impossible to tell from what you just wrote.

Jonathan L. said...

On ratemyprofessor.com, on metric used to rate professors is the easiness metric. To generate the rigor variable, we took the easiness score and subtracted it from 5; that is, 5-easiness.

The methodology i was referring to was the methodology for the Forbes/CCAP college rankings, found Here

Anonymous said...

This is ridiculous. In order to make the assumptions that you make in your "proofs," you have to assume that (a) the respondents are *representative* of the students as a whole, and (b) that the data are normally distributed. You cannot assume either of those things. Pick up a basic econometrics textbook for college freshmen, look at the assumptions, and you will see that - Bayesian approaches aside - you cannot credibly make this argument.

I am shocked that a shop of economists would put out such drivel.

Anonymous said...

While you're at it, Google "common source bias" and think about your "rigor" and "quality" correlation.

Jonathan L. said...

In the formal proof (provided in the link) the only assumptions that are needed is that the distribution has a mean and a variance, and that the distribution is symmetric about its mean. It is actually explicitly stated that the respondents are not "representative" (we assume a student will only respond if that student really liked or really hated the prof.) If they were "representative" (i.e. if there was no systematic reason why their evaluation of the prof would affect whether or not they rated them on RMP) then there would be no need for a proof. The proof inherently assumes that the raters are biased and thus not representative and does not rely on normality (although the normal distribution is symmetric and thus supported by the proof, as is the uniform distribution among others).

These assumptions are clearly iterated in the text of the post as well as in the accompanying formal proof.

Anonymous said...

The rankings state they are to help undergraduate students.

However, RMP allows graduate students to post as well. I looked at the best professor (as voted on by the graduating class) and he had okay (mid 3) rankings. All the undergraduate evaluations were fantastic; he did have a few (1s) evals from a graduate course.

So, do you have filter RMP so only undergraduate evaluations are counted? Otherwise, you are allowing graduate students to affect a ranking designed for undergraduates?

Scott Merryfield said...


So if easiness is a 5, then Rigor is a 0? I though that the results needed to maintain the range of 1 - 5. You should have used:

6 - Easiness to maintain that range based on the results.