A Primer on the Metaculus Scoring Rule
By Anthony Aguirre, Co-founder at Metaculus
On Metaculus, thousands of forecasters have submitted hundreds of thousands of forecasts on thousands of questions over the last six years. We pride ourselves in keeping score and transparently reporting the accuracy of every forecast made on our platform.
Given some discussion of it in the community, we thought it would be useful to provide a primer on how the Scoring Rule on Metaculus works, and its current benefits and shortcomings. We have some ideas for future changes, but we’ll get to those in a later post.
First, let’s explore how the Scoring Rule works today.
How the Metaculus Scoring Rule Works
When forecasters perform well on Metaculus, they are rewarded with points, which accrue over time and are represented as each forecaster’s overall Metaculus Score.
For both binary and numerical questions, the Score is composed of two kinds of points, which we call the absolute and relative. The absolute points A are points you get for being correct, i.e. predicting high probability for the actual outcome; the relative points R are points you get for being more correct than the rest of the community. The total Score for a forecasting question is a weighted sum of A and R such that both increase with the number of predictions N on a question, with weight shifting from primarily A for small N, to primarily R for large N. Finally, all of this is averaged over time (with some extra considerations for questions that resolve unexpectedly early).
If you’re interested in the math, you can read all about the details here.
Now you may ask, why this Scoring Rule? It seems pretty complicated! In practice, one does not need to understand the Scoring Rule in all its detail: for a given question, the important thing to know is that in expectation, you’ll do best when predicting your true belief of the probability, and updating your prediction whenever that changes.
But to get under the hood a little bit, let’s look at some of the considerations behind how we constructed this particular framework.
Our Three Criteria for a Good Scoring Rule
Our first criterion is that our Scoring Rule be “proper”, which means that in expectation, i.e. over many predictions on questions with given “true” probabilities, the strategy that accrues the most points is to correctly assign those given probabilities. Having a proper scoring rule is crucial for reasons that should be obvious, but we’ll spell them out: a Scoring Rule that is not proper would incentivize predictions that don’t match forecasters’ best estimates of the actual probabilities involved, thus compromising the trustworthiness of the forecasts produced. Both A and R components of the Metaculus Scoring Rule meet this criterion.
Our second criterion is that we want to reward foresight in addition to accuracy, meaning that, all else being equal, we want to reward accurate forecasts that are made sooner and updated whenever new information becomes available. This is why we introduce time averaging into the Scoring Rule. As should also be obvious, no points are awarded during the time before any prediction is made.
Our third criterion is that the Scoring Rule should strike the right balance between breadth and depth, so to speak. We want to encourage broad participation by our community forecasters, and we also want to incentivize forecasters to deeply focus on questions where they can apply a lot of time or expertise. Our positive sum Scoring Rule — in which for a given question the total points awarded across users is generally positive — encourages broader participation on more questions by not penalizing forecasters who participate in more questions, but also encourages depth by giving many more points to those who have the greatest insight and foresight. Clearly, we could have chosen a Scoring Rule that was zero sum or even negative sum. Let’s examine this tradeoff and related issues in more detail.
Positive Sum, Zero Sum, and Negative Sum Tradeoffs
To begin, let us acknowledge that a Scoring Rule serves two related but distinct purposes. First, it provides a clear metric for individual predictors who wish to measure and improve their forecasting accuracy. Second, it serves to incentivize the behavior of the entire group of forecasters toward the larger Metaculus goal of producing accurate forecasts across a large number and variety of questions.
There is no single route to achieving either one of these purposes, and optimizing for either one produces some unavoidable tradeoffs. Let’s discuss the second purpose first: what are the behaviors we want to incentivize at the community level? Foremost, it’s important that forecasters predict their true beliefs about probabilities in their forecasts, rather than say, extremizing their beliefs, which would distort the information produced.
Over many questions, this property is supported via a Scoring Rule being proper. If there were a fixed corpus of questions and just one prediction on each, any proper Scoring Rule would pretty much do the job. However, the corpus of questions on Metaculus as a whole is not fixed, and the time and energy of any given forecaster is finite, so users face the question of which questions they should put time and energy into predicting on and updating.
This is quite tricky, as there are strong tradeoffs. More time on less questions is likely to lead to more accuracy, but less predictions overall, and almost no predictions on what are perceived as “hard” questions. (For example, if the scoring rule were just the Brier Score, it’s easy to see that for a great average score forecasters should just predict on questions to which they attribute either very low or very high probability. Therefore questions that are non-obvious, and therefore more likely to be interesting or important, would get neglected.)
In addition, there’s a question of whether the score as a whole is positive, zero, or negative sum. For a negative sum score there’s a strong disincentive to predict, and the best strategy would be to only predict on questions where you can outperform everyone.
We have chosen to balance these tradeoffs by setting both the A (absolute) and R (relative) to be mildly net-positive. The motivation for a net-positive score is, of course, to encourage participation and to overcome the psychological barrier to predicting at all: without it there is a rational motivation against participating in most questions, on top of loss aversion.
The motive for the separate A and R components is that if there are questions with very few predictions so that a “community” prediction is poorly-determined, we want people to make predictions on them, and be rewarded for being correct. This is supported by the A points, which dominate when N is small. For questions with lots of predictions, more predictions are primarily useful if they provide a reason to alter the standing prediction; this is supported by making the R points relatively more important at large N.
In short, the scoring has been set to incentivize the behavior that serves our goal of producing many high quality predictions. These incentives aren’t perfect, and as discussed below there may be behaviors consistent with them that are not useful for the platform. We’re actively considering some tweaks to the scoring method to mitigate some of those.
Evaluating Forecaster Skill
Now, let’s turn to evaluating forecaster skill — how we can answer the question of “how good” a predictor is. There’s good news and there’s bad news here. The bad news here is that there just isn’t a single number that will tell you that, just like there isn’t a single number that can summarize how good a baseball player is. The good news is that prediction quality can be measured, as a combination of calibration and precision, over a specific corpus of questions.
The way you measure performance depends, of course, on what you care about. Perhaps unintuitively, any given score will be subject to some way in which it can get pretty high according to one metric, while simultaneously tanking according to other measures. “Good predictions” are really defined by the applications for which they are used. For example, when used to make decisions, in some cases excellent calibration may be more important than precision. In other cases, it’s exactly the opposite; in some cases it may be really important to get the tails of the distribution right, and in others matter very little.
The Metaculus Prediction, for example, uses a relative score that is fairly distinct from the Metaculus Score in determining how much to weight different predictors’ prediction. This score was chosen in part because it makes the Metaculus Prediction itself accurate (by a given set of metrics). Our philosophy here, both for individuals and for the platform as a whole, is to provide the Metaculus Score as one somewhat useful way to assess quality, but supplemented by the full track record with lots of ways to slice and dice the success data. Something the platform is currently missing are additional good ways for users to compare themselves to other users, e.g. leaderboards based on metrics other than the Metaculus Score. More to come on this front.
Some Final Questions & Answers
It’s also worth addressing some questions that have come up regarding Metaculus scoring and how it compares to other methods.
Q: Aren’t there binary questions on Metaculus for which you can get positive points however it turns out?
A: Yes. The Scoring Rule is designed to be positive-sum, so these exist and are at some level unavoidable. But we don’t regard them as problematic. It would be a problem if one could efficiently extract large numbers of points by putting in predictions that don’t match what you actually believe. However, in this case you can get more points by doing the right thing (predicting what you think is the correct probability) and as discussed below you can even at low effort, you can get more points by doing the wrong thing in a way that is also pretty unproblematic.
Q: Won’t you get a lot of points for just predicting the community predictions on lots of questions?
A: Yes, a fair number. This is unavoidable because we’ve chosen to make Metaculus positive-sum. It’s not the best way to get lots of points, especially given that the “relative” points become more important — so on a fixed corpus of questions you won’t compete with the best predictors, and there are metrics on which you won’t do particularly well. Importantly, people piling on to the community prediction does not harm it (or affect the Metaculus Prediction, which relies on relative scoring), except for making it slightly “stickier.” This is part of why we are preparing to make some scoring weight adjustments (i.e. moving more weight to the relative component of the score during the time that the Community Prediction is visible) so that this behavior becomes even less rewarding than it is now.
Q: Wouldn’t it be simpler just to use a prediction market approach? After all, markets are tried and true systems that resist attempts to “extract” value for free from them.
A: Prediction markets have some advantages, and some disadvantages as well.
They are by default zero-sum, or negative-sum given transaction costs. So unlike most financial markets, there is an active disincentive to participate unless you strongly believe you have privileged information or analysis methods. Prediction markets can be subsidized, making them positive-sum. This boosts the participation incentive, but then participation is limited by total investment capital. And to make matters worse, a fair amount of trading is required just to find the standing price on a contract, even once the participants all have a probability in mind. So in all, there is a very large challenge to achieving high enough liquidity, and this has been a failing of many prediction markets in the past.
A distinct dynamic that (real money) prediction markets have, and which is a double-edged sword, is that they are useful for financially hedging against real-world events. This may be valuable enough to some participants to overcome the general negative-sum nature of the market, helping to provide liquidity. However, there is no good reason to believe that a market strongly affected by this behavior will converge on a good prediction of the event probability. Moreover, prediction markets are amenable to manipulation by agents for whom that manipulation is worth the cost of buying at inaccurate probabilities; and this can be particularly “cheap” when the market is at near 0% or near 100% probability.
So, extracting accurate probabilities from prediction markets is not always possible (as has been extensively written about), and there’s no obvious or simple way to extract something like a probability distribution over a number or date, which is straightforward on Metaculus.