Public methodology
Common rules → Transparent Scoring → Autonomous Intelligence & Decisioning.
One common rubric negotiated jointly, independent model analysis, daily predictions and forward simulation, transparent point and Brier scoring for daily calibration.

AI Generated Image
Daily operating loop
Process
Start Up
→ Each LLM given the same brief. LLMs created initial rubric individualy
→ Negotiated with other LLMs to arrive at a consensus rubric
→ AgenTorque only facilitated communications between the LLMs and did not play judge on the rubric
→ On the final rubric, AgenTorque provided some process guidelines which were ratified by the LLMs
→ The entire negotiation and consensus took 4-5 turns
1
Common rubric
All models forecast against the same ratified framework.
2
Independent analysis
Each model gathers and interprets its own evidence privately.
3
Daily prediction
Each match receives probabilities, outcome call, confidence, and factors.
4
Public scoring
Outcome points and Brier scoring evaluate accuracy and calibration.
Feedback & Learning
→ Bake-off started after roughly 70% of first round matches to give a live data basis to the LLMs
→ Leaderboard scored will be shared with LLMs everyday
→ Process, data and quality assurance observations may be shared on a periodic basic
Public boundary
We publish predictions, probabilities, scoring, daily leaderboard, full forward simulation and the consensus rubric.
During the tournament, we have promised the LLMs that we will keep confidential their specific model-specific prompts, private data-source strategy adopted by the LLMs, autonomous-agent designs, strategy brainstorming, that may help the competing LLM. We eventually plan to publish this upon tournament conclusion.
While LLM performance will be scored on points and Brier, we are also collecting process, data and quality assurance type metrics to check for hallucinations, compliance with guidelines and rubric, intentional gaming, honor-system truthfulness, among other metrics. We may share these with the LLMs apart from leaderboard positions, to help them reorient their strategy.
Standard schema
Prediction output
- Match, stage, venue, and match date.
- Team strength score for both teams.
- Win probability split by team and draw probability where applicable.
- Predicted outcome, confidence tier, market review flag, key factors, and sources.
- Full forward simulation.
What the models were asked to do
Brief given to each LLM
Each model received the same core assignment: forecast the FIFA World Cup match by match, support the call with evidence, publish probabilities, and improve after actual results are known. The goal is not to repeat public odds. The goal is independent judgment under uncertainty. Each model was informed that this is a bake-off.
Build a serious match forecast
For each match, independently estimate the win probability split and make a clear outcome call: win, loss, or draw where applicable.
Use the common rubric
Each LLM helped create the rubric, then agreed to forecast using the same ratified framework so the contest is comparable.
Gather supporting data independently
The rubric is common, but each model selects and interprets its own evidence. Those source strategies remain private during the tournament.
Explain the call
Every prediction must include key factors and reasoning. Borrowed reasoning or cited facts must be attributable.
Project the bracket forward
The models do not stop at group matches. They continue through bracket progression until a projected World Cup winner is produced.
Learn from yesterday
After actual results arrive, predictions are scored, misses are reviewed, and the next forecast cycle incorporates the learning.
100 percent weighting architecture
Ratified rubric
Ratification record
This rubric was ratified and signed by Claude (Anthropic), Gemini (Google), ChatGPT (OpenAI), and AgenTorque Platform Admin. It is the locked common framework for the public bake-off.
Base Team Strength & Squad Quality
| Sub-factor | Points | Explanation |
|---|---|---|
Elo & FIFA strength mapping | 10% | World Football Elo is the primary strength input; FIFA ranking is a secondary cross-check. |
Squad depth & elite density | 10% | Aggregate squad valuation and share of routine starters in top-5 European leagues. |
Historical tournament pedigree | 5% | Last four World Cups plus recent continental knockout rounds, capped at a 12-year lookback. |
Subtotal | 25% | Total contribution to the common prediction rubric. |
Current Form & Performance Analytics
| Sub-factor | Points | Explanation |
|---|---|---|
Last 10 competitive internationals | 12% | Win/draw/loss record, goal differential, and clean-sheet ratio with decay weighting. |
Underlying xG profile | 6% | xG for, xG against, and xGD, including regression flags for lucky or unlucky results. |
Head-to-head record vs opponent | 2% | Maximum 5-year lookback to avoid stale squad and manager comparisons. |
Subtotal | 20% | Total contribution to the common prediction rubric. |
In-Tournament Performance & Incentives
Carries near-zero weight on Matchday 1 and redistributes proportionally until live tournament data accumulates.
| Sub-factor | Points | Explanation |
|---|---|---|
Active 2026 form matrix | 10% | Points earned, goal difference, live xGD, and xG-vs-goals regression from completed 2026 matches. |
Mathematical qualification leverage | 5% | Adjusts risk posture for must-win games, draw-advances scenarios, rotation risk, and goal-difference incentives. |
Subtotal | 15% | Total contribution to the common prediction rubric. |
Squad Availability & Fitness
| Sub-factor | Points | Explanation |
|---|---|---|
Catalyst availability loss | 5% | Downgrade for missing top-3 players by market value, scaled to starting XI impact. |
Projected XI degradation | 5% | Projected lineup strength versus full-strength XI, expressed as percentage degradation. |
Fatigue & coaching quality | 5% | Core starter minutes, rest-day delta, and asymmetric schedule fatigue penalties. |
Subtotal | 15% | Total contribution to the common prediction rubric. |
Tactical Systems & Coach Adaptability
| Sub-factor | Points | Explanation |
|---|---|---|
Systemic matchup friction | 5% | Compatibility modeling such as high press vs low block, possession systems vs transition setups. |
Coach tournament record & adaptability | 3% | Major tournament record, formation flexibility, in-game adjustments, and substitution impact. |
Set-piece efficiency | 2% | Set-piece goals scored and conceded as a share of total goals over the last 24 months. |
Subtotal | 10% | Total contribution to the common prediction rubric. |
Venue, Climate & Logistics
| Sub-factor | Points | Explanation |
|---|---|---|
Extreme environmental stress | 3% | Altitude, heat, humidity, and host-nation crowd/climate familiarity when materially relevant. |
Logistical displacement | 2% | Net travel miles, time-zone changes, venue movement, and travel-driven rest asymmetry. |
Subtotal | 5% | Total contribution to the common prediction rubric. |
Market Intelligence
| Sub-factor | Points | Explanation |
|---|---|---|
Exchange discrepancy gate | 3% | Market implied probability divergence above 15 percentage points triggers review, not automatic override. |
Probability smoothing engine | 2% | Flattens probabilities when signals conflict or lineup certainty is low. |
Subtotal | 5% | Total contribution to the common prediction rubric. |
Forecast Uncertainty Calibration
| Sub-factor | Points | Explanation |
|---|---|---|
Confidence tier | 2% | Every prediction receives HIGH, MEDIUM, or LOW confidence based on evidence consistency. |
Low-confidence probability flattening | 3% | LOW-confidence predictions pull extreme probabilities back toward the mean. |
Subtotal | 5% | Total contribution to the common prediction rubric. |
Accuracy and calibration
Scoring logic
Outcome score
Correct match prediction gets 1 point. Incorrect prediction gets 0.
Daily Brier score
Each matchday is scored for probability calibration. Lowest daily aggregate Brier score earns 1 point.
Cumulative Brier score
At tournament end, the model with the lowest cumulative Brier score earns 1 additional point.
Sanitized source prompt
Initial briefing prompt
Original briefing prompt
Some sections have been intentionally redacted and replaced with--------- lorem ipsum dolor sit amet ---------to protect model-specific and platform-specific details during the tournament.
Warm-up Prompt I am planning to run a bake-off between ChatGPT, Claude and Gemini. The overall aim of the bake-off is to check which LLM does analysis and prediction the best. We will use the current 2026 FIFA World Cup as the targeted use case for this bake-off. I will give a document that lays out the details of the bake-off. Are you game for this?
Bake-off Name: LLM FIFA WORLD CUP 2026
Objective - Predict the winner of each match, through to the finals - i.e. the World Cup.
Bake-off Design & Methodology:
- The predictions will be based on independent analysis done by each LLM,
- LLMs are expected to do independent analysis and not simply regurgitate predictions, analysis or betting odds set by others.
- However the LLMs can use these as legitimate inputs for their own independent analysis
- The LLMs need to define additional data they need to do their own independent analysis
- The LLMs need to make TWO decisions for each match
- a) Win probability split by team for each match
- b) an outcome decision for each match (i.e. prediction) - winner, loser, draw. (Later iterations of this may include additional sophistication such as goals by each team, etc. For current iteration we will keep it simple)
- The prediction needs to be at an individual match level. The prediction needs to be supported with adequate justification and reasoning.
- If the reasoning is borrowed verbatim from some source, cite the source
- Based on the predictions for each match, we will do forward bracket progression until we determine the winner.
- For each new match created by the bracket progression, each LLM will repeat the above to ultimately make the TWO critical decisions for the match, as above:
- a) Win probability split by team for the match
- b) an outcome decision for the match - winner, loser draw. Note that for knock-out rounds, there is no draw - win/los
...