Aligning Incentives for Forecast Accuracy, Relevance, and Efficacy: A New Paradigm for Metaculus Tournaments

10 min readApr 20, 2021

Introducing Fortified Essays & Incentive-Compatible Kelly Strategy for Metaculus Tournaments

by Gaia Dempsey, CEO, and Max Wainwright, CTO of Metaculus

Updated 2022–4–6: The foundation of the Metaculus Tournament Scoring System is still as described below, though there have been some significant updates. For the latest details on how Metaculus scores tournament performance, and for worked through examples of various scoring scenarios, visit this discussion post.

Baseline Forecast Accuracy

Operating since 2014, the Metaculus platform elicits and aggregates full-distribution probabilistic predictions. Six years later, with the Metaculus forecasting community having made over 500,000 individual predictions on more than 4,000 topics, we can confidently say that the ensemble forecasts produced demonstrate consistent high quality on average, evaluated both in terms of accuracy and calibration. Our platform’s unique structure also delivers excellent forecast information density (relative to point estimates) such that both confidence and uncertainty are quantified, and, unlike the common practice of static forecasts, provides a capacity for continuous forecast updating.

In short, the Metaculus platform is on average going to produce forecasts that you can trust, and it will provide you with more information and be more up-to-date than a standard forecasting system.

But it can be difficult, as a decision-maker, to use a standalone numerical forecast — even an accurate, well-calibrated, probabilistic, updated one — without some more information. Even the most accurate and timely forecasts are worthless if they aren’t heeded.

So for Metaculus, our next challenge is to increase our capacity to communicate forecasts with relevance, in order to increase their real-world impact and efficacy.

Increasing Relevance and Efficacy: A Reimagined Metaculus Tournament Structure

Translating research into practical applications is a fascinating process in any field. In forecasting, we anticipate that it will be an ongoing one for years to come, enriched by the interplay among a wide community of researchers, practitioners, data scientists, modelers, and technologists on the supply side, and non-profit, commercial, and governmental organizations on the demand side.

As Metaculus evolves, we are developing better methods of connecting our forecasting process to real-world organizations and decision-makers who need access to crystal-clear information in real-time. We believe that it is increasingly important to consider not just numerical data, but also clear, written analysis that provides context and a greater understanding of the relevant modeling or judgmental assumptions, meaning, and implications of any given probabilistic prediction.

We can draw important lessons from the interpretability movement in AI: inscrutable numerical outputs of complex models lack transparency and therefore lack accountability and trust. But algorithms and systems capable of “showing their work” and providing context for their reasoning are lauded as more useful and valuable, and are therefore more likely to be utilized.

In a similar vein, one of the key calls to action in the recent University of Pennsylvania Perry World House white paper Keeping Score: A New Approach to Geopolitical Forecasting is to “effectively translate probabilities into useful information for policymakers.”

In answer to this pressing need, we are introducing a new paradigm for Metaculus Tournaments that incentivizes both empirically high-quality forecasts, and effective analysis and communication.

And, as we get ready to launch our first Forecasting Cause this week, we know that the forecasts produced will be put to good use.

Updating Metaculus Tournaments: The Calibration Set, the Long-Term Set, The Fortified Essay, and Incentive-Compatible Kelly Criterion Rules

Metaculus Tournaments will be able to utilize three modular components, and a new scoring system:

  1. Calibration. Within the Calibration Set are forecasts that resolve within the timeframe of the tournament (e.g. typically one to a few years) — thus providing ground-truth calibration for the larger dataset of forecasts. The Calibration set is made up of questions and forecasts that are essentially identical to those that have been within Metaculus tournaments thus far. Empirical forecast scoring is utilized, with the best-performing forecasters within the tournament receiving tournament prizes.
  2. Long-Term. Long-Term forecasts do not resolve within the timescale of the tournament — their forecast horizons may be a decade out or even more. Such forecasts are included in tournaments because of their utility in shaping decisions in the near-term. However, waiting potentially 10+ years to distribute tournament prizes is usually impractical, so Long-Term Forecasts are considered out-of-sample for the purpose of awarding tournament prizes. That doesn’t mean that Long-Term Forecasts will not be empirically scored on the Metaculus platform, however. They will be part of forecasters’ track records, appropriately weighted in individuals’ Metaculus Scores. We have some plans in the works for additional ways of recognizing Long-Term forecasters.
  3. Fortified Essays. While an essay is “just an opinion,” a fortified essay is an opinion with testable predictions fortifying its claims. These are persuasive essays written in response to tournament prompts with Metaculus forecasts natively embedded within them. Fortified Essays within Metaculus Tournaments will have a separate prize structure and judging process — typically judges will be a selection of respected academic experts and practitioners in the relevant field. The embedded predictions are likely to include a selection from both Calibration and Long-Term Forecasts. For forecasters, these represent an opportunity to shape policy and decision-making. Stay tuned for more to come on Fortified Essays.
  4. Incentive-Compatible Kelly Criterion Rules. As we move into a world where we’re hosting more tournaments, incentive compatibility within the tournament framework has become increasingly important in order to protect long-term forecast accuracy. With this aim in mind, we are very excited to introduce a novel approach to utilizing the Kelly Criterion in forecasting tournament scoring, a method with the proper incentives that respects the core Metaculus modus operandi of making forecasts with the greatest possible accuracy, rather than placing bets.

We look forward to hearing from the community as we move into our first Forecasting Cause Tournaments using this new structure!

For the mathematically curious, read on to explore how we’ve solved for incentive-compatibility in scoring Metaculus Tournaments.

Deep Dive: Incentive-Compatible Kelly Betting Rules

Up until now, Metaculus has used a single points scale — the Metaculus Score — to measure both progress across the platform and to rank and reward forecasters within a tournament.

Outside of tournaments, this system works well for incentivizing good predictions. The Metaculus Score constitutes a proper scoring rule: you’ll get the most points for making the best predictions (not over-confident, and not under-confident).

As we’ve recently written about, Metaculus points are mildly positive-sum and therefore encourage participation in addition to accuracy. Generally, forecasters can expect to gain more points if they make more predictions, even without being a subject-matter expert yet. Indeed, the feedback gained through the Scoring Rules are intended to help people get more calibrated.

That being said, the points system highlights real differences in forecasting abilities, so climbing to the top of the leaderboard confers real bragging rights.

However, once tournament prizes are associated with a scoring rule, things get more complicated. It’s still very important that the system be a proper scoring rule, but now the best predictions should maximize the expected prize, not just the expected points. In addition, we need to know how big the prize pot will be, so, mathematically, the scoring rule cannot allow for unbounded positive points. If one person gains points, another person should lose them.

Likewise, we don’t plan to hand out negative prizes, so the tournament scoring rule shouldn’t have negative points. And finally, we want the prize difference between skilled and novice forecasters to be several orders of magnitude in size, not just a factor of a few, so that there’s significant recognition awarded to the best forecasters, while inaccurate forecasters are not awarded prizes.

It turns out that these criteria are hard to satisfy! If a tournament payout is proportional to a sum of proper scores, and those proper scores are all non-negative, then we already guarantee that the best forecaster will get no more than twice as many points as someone who always predicts 50%. And if we generally rescale a proper scoring system based on the total number of points earned across all forecasters (in order to keep the prize pool constant), the system will no longer be proper.

Luckily, there is some prior art when it comes to distributing prizes based on accurate predictions; namely, prediction markets. Metaculus is a community forecasting platform, not a prediction market, but there are important lessons that we can draw from them.

When participating in a prediction market, you should almost surely employ a Kelly betting strategy, which maximizes the expected growth rate of your portfolio. Given enough time and enough bets, Kelly bettors will almost certainly dominate the market. For a single binary bet, the Kelly criterion states that the fraction of wealth that you should stake is

where p is your prediction for the desired outcome, b=(1−pm)/pm is the given odds ratio on the bet, and pm is the market prediction. You should only place a bet if you think the outcome is more likely than the market does. Once a bet resolves, your new wealth will increase (or decrease) by

If the market price is set by the weighted average of individuals’ predictions, that is pm=∑wipi∑wi, and everyone makes a Kelly bet, then the the total wealth (∑wi) will remain fixed.

When there are multiple bets, we just repeat this process for each bet. Money moves after each bet resolves, and the market price for each bet will depend on where that wealth gets transferred. However, it turns out that it doesn’t actually matter which order the bets get resolved in. This is very nice, because it means we can treat multiple bets as occurring sequentially even though they all happened at the same time.

We can write out the final wealth for each forecaster in terms of log scores. Let the log score for forecaster n be

where pn(rk) is the predicted probability of the kth outcome. The total wealth for that forecaster after all bets are settled will be

That is, in a community of bettors all making Kelly bets on binary outcomes, the ending wealth for each forecaster is exactly proportional to the softmax of the log score of their predicted probabilities summed over each bet.

This was derived for single-bet binary predictions, but it’s easy to extend it to continuous forecasts and forecasts with an extended horizon. In fact, scores for continuous questions can use the exact same formula. For long-horizon forecasts, define a new normalized log score as

where p(t) is the forecaster’s prediction, pc(t), is the community prediction, and θ(t−t0) is a step function that is zero before the forecaster starts predicting and one thereafter. In plain language, the normalized score is just the time-average of the log score relative to the community prediction. The community prediction is only relevant to set a zero point for forecasters who are late to the tournament (otherwise, it cancels in the softmax). This rewards forecasters for making consistently good predictions since the start of the tournament, but it also lets latecomers potentially catch up.

There is one final tweak we need to make before we’ve got our final formula. Although we want latecomers to be able to catch up, we also don’t want anyone to let the community prediction do all the work and claim a prize at the last minute. Therefore, we can make the prize payout proportional the total fraction of time f that each forecaster is participating.

Our final formula for scoring tournaments is then

This formula very nearly satisfies all of our criteria. It’s highly rewarding for the best forecasters; the total payout is a fixed size with no negative rewards, and it’s very nearly proper in expected log prize.

The caveats are:

  1. The denominator of the softmax function means that it’s not truly proper, but the discrepancy from properness only matters once you’ve already dominated the tournament (in which case you should be somewhat loss-averse); and
  2. It’s approximately proper in log prize, not total prize.

However, with enough questions the winner of the tournament will almost surely be the one who tries to maximize log prize, as anyone trying to maximize expected total prize will probably go bust in an all-or-nothing bet. If forecasters are then able to funnel some of their winnings into the next tournament so that they compound (a potential feature under consideration), then maximizing log prize makes even more sense.

We hope you enjoyed this deep-dive tour of our new scoring rules! We’re very excited that we’ve struck upon a method to score tournaments that has the right incentives while keeping the core Metaculus feature of making forecasts rather than placing bets. And one where, unlike in prediction markets, winnings from one prediction can compound with the winnings of a second prediction even when those predictions happen at the same time.