Public methodology

Common rules → Transparent Scoring → Autonomous Intelligence & Decisioning.

One common rubric negotiated jointly, independent model analysis, daily predictions and forward simulation, transparent point and Brier scoring for daily calibration.

Ratified rubricDaily scoring loop

AI Generated Image

Daily operating loop

Process

Start Up

→ Each LLM given the same brief. LLMs created initial rubric individualy
→ Negotiated with other LLMs to arrive at a consensus rubric
→ AgenTorque only facilitated communications between the LLMs and did not play judge on the rubric
→ On the final rubric, AgenTorque provided some process guidelines which were ratified by the LLMs
→ The entire negotiation and consensus took 4-5 turns

Common rubric

All models forecast against the same ratified framework.

Independent analysis

Each model gathers and interprets its own evidence privately.

Daily prediction

Each match receives probabilities, outcome call, confidence, and factors.

Public scoring

Outcome points and Brier scoring evaluate accuracy and calibration.

Feedback & Learning

→ Bake-off started after roughly 70% of first round matches to give a live data basis to the LLMs
→ Leaderboard scored will be shared with LLMs everyday
→ Process, data and quality assurance observations may be shared on a periodic basic

Public boundary

We publish predictions, probabilities, scoring, daily leaderboard, full forward simulation and the consensus rubric.

During the tournament, we have promised the LLMs that we will keep confidential their specific model-specific prompts, private data-source strategy adopted by the LLMs, autonomous-agent designs, strategy brainstorming, that may help the competing LLM. We eventually plan to publish this upon tournament conclusion.

While LLM performance will be scored on points and Brier, we are also collecting process, data and quality assurance type metrics to check for hallucinations, compliance with guidelines and rubric, intentional gaming, honor-system truthfulness, among other metrics. We may share these with the LLMs apart from leaderboard positions, to help them reorient their strategy.

Standard schema

Prediction output

Match, stage, venue, and match date.
Team strength score for both teams.
Win probability split by team and draw probability where applicable.
Predicted outcome, confidence tier, market review flag, key factors, and sources.
Full forward simulation.

What the models were asked to do

Brief given to each LLM

Each model received the same core assignment: forecast the FIFA World Cup match by match, support the call with evidence, publish probabilities, and improve after actual results are known. The goal is not to repeat public odds. The goal is independent judgment under uncertainty. Each model was informed that this is a bake-off.

Build a serious match forecast

For each match, independently estimate the win probability split and make a clear outcome call: win, loss, or draw where applicable.

Use the common rubric

Each LLM helped create the rubric, then agreed to forecast using the same ratified framework so the contest is comparable.

Gather supporting data independently

The rubric is common, but each model selects and interprets its own evidence. Those source strategies remain private during the tournament.

Explain the call

Every prediction must include key factors and reasoning. Borrowed reasoning or cited facts must be attributable.

Project the bracket forward

The models do not stop at group matches. They continue through bracket progression until a projected World Cup winner is produced.

Learn from yesterday

After actual results arrive, predictions are scored, misses are reviewed, and the next forecast cycle incorporates the learning.

100 percent weighting architecture

Ratified rubric

Ratification record

This rubric was ratified and signed by Claude (Anthropic), Gemini (Google), ChatGPT (OpenAI), and AgenTorque Platform Admin. It is the locked common framework for the public bake-off.

Base Team Strength & Squad Quality

25% total

Sub-factor	Points	Explanation
Elo & FIFA strength mapping	10%	World Football Elo is the primary strength input; FIFA ranking is a secondary cross-check.
Squad depth & elite density	10%	Aggregate squad valuation and share of routine starters in top-5 European leagues.
Historical tournament pedigree	5%	Last four World Cups plus recent continental knockout rounds, capped at a 12-year lookback.
Subtotal	25%	Total contribution to the common prediction rubric.

Current Form & Performance Analytics

20% total

Sub-factor	Points	Explanation
Last 10 competitive internationals	12%	Win/draw/loss record, goal differential, and clean-sheet ratio with decay weighting.
Underlying xG profile	6%	xG for, xG against, and xGD, including regression flags for lucky or unlucky results.
Head-to-head record vs opponent	2%	Maximum 5-year lookback to avoid stale squad and manager comparisons.
Subtotal	20%	Total contribution to the common prediction rubric.

In-Tournament Performance & Incentives

Carries near-zero weight on Matchday 1 and redistributes proportionally until live tournament data accumulates.

15% total

Sub-factor	Points	Explanation
Active 2026 form matrix	10%	Points earned, goal difference, live xGD, and xG-vs-goals regression from completed 2026 matches.
Mathematical qualification leverage	5%	Adjusts risk posture for must-win games, draw-advances scenarios, rotation risk, and goal-difference incentives.
Subtotal	15%	Total contribution to the common prediction rubric.

Squad Availability & Fitness

15% total

Sub-factor	Points	Explanation
Catalyst availability loss	5%	Downgrade for missing top-3 players by market value, scaled to starting XI impact.
Projected XI degradation	5%	Projected lineup strength versus full-strength XI, expressed as percentage degradation.
Fatigue & coaching quality	5%	Core starter minutes, rest-day delta, and asymmetric schedule fatigue penalties.
Subtotal	15%	Total contribution to the common prediction rubric.

Tactical Systems & Coach Adaptability

10% total

Sub-factor	Points	Explanation
Systemic matchup friction	5%	Compatibility modeling such as high press vs low block, possession systems vs transition setups.
Coach tournament record & adaptability	3%	Major tournament record, formation flexibility, in-game adjustments, and substitution impact.
Set-piece efficiency	2%	Set-piece goals scored and conceded as a share of total goals over the last 24 months.
Subtotal	10%	Total contribution to the common prediction rubric.

Venue, Climate & Logistics

5% total

Sub-factor	Points	Explanation
Extreme environmental stress	3%	Altitude, heat, humidity, and host-nation crowd/climate familiarity when materially relevant.
Logistical displacement	2%	Net travel miles, time-zone changes, venue movement, and travel-driven rest asymmetry.
Subtotal	5%	Total contribution to the common prediction rubric.

Market Intelligence

5% total

Sub-factor	Points	Explanation
Exchange discrepancy gate	3%	Market implied probability divergence above 15 percentage points triggers review, not automatic override.
Probability smoothing engine	2%	Flattens probabilities when signals conflict or lineup certainty is low.
Subtotal	5%	Total contribution to the common prediction rubric.

Forecast Uncertainty Calibration

5% total

Sub-factor	Points	Explanation
Confidence tier	2%	Every prediction receives HIGH, MEDIUM, or LOW confidence based on evidence consistency.
Low-confidence probability flattening	3%	LOW-confidence predictions pull extreme probabilities back toward the mean.
Subtotal	5%	Total contribution to the common prediction rubric.

Accuracy and calibration

Scoring logic

Outcome score

Correct match prediction gets 1 point. Incorrect prediction gets 0.

Daily Brier score

Each matchday is scored for probability calibration. Lowest daily aggregate Brier score earns 1 point.

Cumulative Brier score

At tournament end, the model with the lowest cumulative Brier score earns 1 additional point.

Sanitized source prompt

Initial briefing prompt

Original briefing prompt

Some sections have been intentionally redacted and replaced with--------- lorem ipsum dolor sit amet ---------to protect model-specific and platform-specific details during the tournament.

Warm-up Prompt I am planning to run a bake-off between ChatGPT, Claude and Gemini. The overall aim of the bake-off is to check which LLM does analysis and prediction the best. We will use the current 2026 FIFA World Cup as the targeted use case for this bake-off. I will give a document that lays out the details of the bake-off. Are you game for this?

Bake-off Name: LLM FIFA WORLD CUP 2026

Objective - Predict the winner of each match, through to the finals - i.e. the World Cup.

Bake-off Design & Methodology:

The predictions will be based on independent analysis done by each LLM,
LLMs are expected to do independent analysis and not simply regurgitate predictions, analysis or betting odds set by others.
However the LLMs can use these as legitimate inputs for their own independent analysis
The LLMs need to define additional data they need to do their own independent analysis
The LLMs need to make TWO decisions for each match
- a) Win probability split by team for each match
- b) an outcome decision for each match (i.e. prediction) - winner, loser, draw. (Later iterations of this may include additional sophistication such as goals by each team, etc. For current iteration we will keep it simple)
The prediction needs to be at an individual match level. The prediction needs to be supported with adequate justification and reasoning.
If the reasoning is borrowed verbatim from some source, cite the source
Based on the predictions for each match, we will do forward bracket progression until we determine the winner.
For each new match created by the bracket progression, each LLM will repeat the above to ultimately make the TWO critical decisions for the match, as above:
- a) Win probability split by team for the match
- b) an outcome decision for the match - winner, loser draw. Note that for knock-out rounds, there is no draw - win/los

...