Current Projects

Preliminary Bot or Not Results Are Here!

Our “Bot or Not” project, in a preliminary experiment, has found several ways that usually can tell the difference between even the most sophisticated Generative AIs (GenAIs) and humans.

One is the “framing effect,” a measure of how humans make forecasts on binary questions, meaning ones with only two possible outcomes, for example yes or no.. “The framing effect is a type of cognitive bias or error in thinking. ‘Framing’ refers to whether an option is presented as a loss (negative) or a gain (positive). People are generally biased toward picking an option they view as a gain over one they view as a loss, even if both options lead to the same result. They are also more likely to make a riskier decision when the option is presented as a gain, as opposed to a loss.” Source —>

We also had a preliminary success with the “strict refocusing” bias in humans. Compared to an occurrence frame, a non-occurrence frame resulted in higher estimates if base-rate evidence favored occurrence, lower estimates if evidence favored non-occurrence, and similar estimates if evidence supported indifference. However, in our Multi-AI Oracle tests with refocusing experiments, it didn’t show these human-like biases. Source —>

Following, details on how we set up the more detailed and, we hope, are more replicable version of theabove mentioned pr-eliminary experiments. Currently we are close to submitting our results to a refereed journal.

Devising our methods was challenging, given our objective: Determine whether the human biases of reframing and refocusing might be reflected in the outputs of a bot, Jeremy Lichtman’s BestWorldBot, designed to elicit forecasts and rationales from three Generative AI models (henceforth denoted GenAIs), each based on a different foundation model.

For this, we chose Anthropic’s claude-3-5-sonnet-latest; Mistral.ai’s mistral-large-latest; and OpenAI’s gpt-4o, with each of these individually aided with collection and curation of data by a fourth GenAI, Perplexity’s llama-3.1-sonar-huge-128k-online. Whether some GenAIs might lack human biases is important, because these might make it possible to distinguish outputs of GenAIs from those of humans.

However, given the fast changing composition of GenAIs, our results are but a snapshot in time of their behaviors. Hence, we are presenting more detail than usual in our methods section, so that we or others could develop a time series of similar studies that could document a trajectory of how human biases might or might not be reflected in their outputs.

History of our Experiment

We began by reviewing reports of GenAIs reflecting biases typical of humans “in domains ranging from perception to emotion.” (Glickman, 2025) These biases have been described as originating in the biases displayed among the human-generated inputs to their training runs. (Tortora, 2024) Also, at least one human bias has apparently been enhanced when humans interact with some GenAIs. For example, in a stock-picking experiment, use of GenAI “made executives significantly more optimistic in their forecasts.” (Parra-Moyano, 2025)

Considering these findings, we wondered, is it possible that there exist human biases that are not reflected in the outputs of some GenAIs? Therefore, our first task: Create hypotheses potentially useful for detecting failure of one or more human biases to propagate into the outputs of one or more GenAIs.

We further expanded our hypothesis generation efforts based on the performances of 47 GenAI-based bots in the Q4 Metaculus AI Benchmark Tournament, as reported on its website in November and early December of 2024. During that time span, these bots were being run automatically from a server that arguably prevented last minute inputs by humans. By inspection of their required outputs of forecasting rationales, it was clear that these generally were unlike each other in many ways. Thus, given the appearance of noisy data, we needed to develop means to enable signals, should they exist, to avoid being overwhelmed by noise. Based on review of the existing literature on human biases that could be detected via forecasting experiments, and results from these in minimizing signal to noise problems, we decided that a fruitful topic could be to determine to what extent the human biases of reframing and refocusing (Mandel, 2014) (Mandel, 2005) (Mandel, 2001) might be detectable in the outputs of three foundation GenAIs when challenged in a forecasting experiment.

Design of the Experiment
We began by designing our experiment such that when forecasting on both sides of any numeric setpoint, ideally they would sum to 100% on average, with an awareness that this could trigger an unavoidable bias toward either greater than or less than 100%.

During design of this experiment, Jeremy Lichtman’s BestWorldBot in its original multi-generative-AI format was undergoing testing with a single, highly verbose prompt against forecasting questions in the Metaculus AI Benchmark Tournament, Q4, conducted from Oct. 8, 2024 through Jan. 8, 2025. The results indicated that this bot was likely to be free of bugs and capable of credible forecasts.

Question Generation
It is well known that both the topics and wordings of forecasting questions are crucial to eliciting high quality forecasts. For example, they must be specific (Tetlock, 2017), must not be decomposable into two or more simpler questions, and be falsifiable. (RAND Forecasting initiative, 2024).

We were particularly wary of the “unpacking” effect, which has been shown to inflate probability estimates in humans (van Boven 2003). So to avoid this effect, we wrote questions with what we assessed to be the bare minimum of words and concepts required.

We also examined results from IARPA crowd-sourced geopolitical forecasting competitions in 2014 and again in 2019 to determine what classes of questions are both nontrivial and forecastable, as determined empirically by the ranges of resulting Brier scores across their human crowds. These particular competitions featured over 100 questions each and over at least 500 forecasters, making the results likely to be replicable. We used these empirical results to avoid question topics that typically are forecasted by humans as nearly 0% likely or nearly 100% likely.

These constraints led to creating binary questions only, meaning that each one could be answered as yes or no, and ones that would be scored based on numerical set points derived from reputable and timely data sources so we could place them at a number close to the likely answer.

Consequently, we chose these topics, set points and scoring sources:
Economic questions, set points chosen from the most recent Nov. 2024 data and results scored, variously, from: https://tradingeconomics.com/russia/currency
https://tradingeconomics.com/lebanon/interest-rate
https://tradingeconomics.com/italy/interest-rate
https://companiesmarketcap.com/
https://tradingeconomics.com/benin/consumer-price-index-cpi
https://tradingeconomics.com/argentina/consumer-price-index-cpi
https://tradingeconomics.com/mexico/consumer-price-index-cpi
https://tradingeconomics.com/argentina/interest-rate
https://www.xe.com/currencyconverter/convert/
https://www.cbr.ru/eng/press/keypr/

Arctic Sea Ice extent, set points from Jan. 2, 2024, and scoring data as of Jan. 2, 2025 as reported by the National Snow and Ice Data Center. We chose the below regions, as defined by the Center, to ensure a range of sizes and locations in the Arctic:
Bering Sea
Canadian Archipelago
Central Arctic
Chuckchi Sea
East Siberian Sea
Entire Northern Hemisphere
Hudson Bay
Kara Sea
Laptev Sea
Sea of Okhotsk

Conflict questions regarded:
Bahrain
Iran
Iraq
Turkey
Sudan
Yemen

Conflict topics included indicators of both peaceful developments and conflict trends. Setpoints were the most recent data as of Dec. 2, 2024 as reported by ACLED. Scoring data was for the nearest data before Jan. 1, 2025, as determined at the end of January 2025 given the difficulty of verifying conflict data. We determined this date empirically by observing when their posted data usually stabilizes. The reason for using the set point of the prior month instead of same date, prior year was that conflicts on average change far more over the course of a year than over the course of a month, and none of the regions chosen tend to have drastic weather changes around Nov. — Dec.

We also sought consistency in the words or phrases in our questions such that they specified measurable outcomes. Because there are many ways to describe such things, in every case we chose variable names used by entities with a high reputation for data science. Examples, we chose variable names from ACLED for conflict data, and from NSIDC for sea ice.

Consequently, we completed our methods decision making process by creating:
• 60 forecasting questions run in batches each weekday from Dec. 14 through Dec. 31;
• Three broad subject areas: economic indicators, Arctic Sea Ice extents, and geopolitical conflicts.
• Each of the question sets were further broken down into doublets that provided semantic signals of over and under for a given numeric setpoint.
• Each doublet was constructed to elicit either a human-like reframing or a human-like refocusing bias, should such biases exist in any of the three foundation models tested.

Each of these question sets were further broken down into doublets of questions that examined over and under for a specific numeric setpoint. The wording of these question subsets was designed to elicit framing/reframing biases and focusing/refocusing biases, should they exist. The specific wording was derived from existing studies on these biases in humans (see the list of citations below). To ensure clarity, each numeric set point included the phrase “Assume that the value of {number} specified in this question is rounded in scientific notation to the nearest integer.”

Operation of BestWorldBot
For this experiment, for all questions we sourced outside news from Perplexity’s llama-3.1-sonar-huge-128k-online as inputs to BestWorldBot’s forecasting component models of claude-3-5-sonnet-latest; mistral-large-latest; and gpt-4o (no longer available). The resulting outputs were sorted into the following columns of comma separated variables files:
• question_id;
• question_type;
• median; base_rate;
• SD (defined as standard deviation between the median and the base rate);
• confidence (defined querying each of the LLMs on how confident they are of their predictions denoted as between 0 and 10 and take a median, with anything lower than 6 being low confidence);
• confidence_mode (based on the confidence value where >=9 is high confidence and below 6 is low confidence, also triggered by an exceptionally high SD);
• mellers (a coefficient of 1.45 applied to the formula published by Barbara Mellers in https://doi.org/10.1177/17456916231185339 for extremizing forecasts);
• reverse_mellers (uses the formula from above, but with a coefficient of 0.65 to move the values closer to 50%);
• theory_of_mind (we ask the models what they think other models would predict);
• close _type (appropriate base case for some questions is closer to the extremes, while others are closer to 50%. When we have a low confidence value, this helps us to determine whether to extremize or de-extremize the value. ‘A’ implies closer to zero. ‘B’ implies closer to 50%. ‘C’ implies closer to 100%);
• num_responses;
• model_value;
• claude;
• claude_confidence;
• mistral;
• mistral_confidence;
• open_ai (uses gpt-4o);
• openai_confidence

We began collecting data on the reconfigured BestWorldBot on Dec. 14, 2024, and competed its runs on Dec. 31, 2024.
We submitted our 60 forecasting questions to BestWorldBot in one batch per weekday. The same basic BestWorldBot prompt was used for all questions, but with specifications inserted to define each question. Details of BestWorldBot’s exact prompt wording and operationswill later be provided in Supplemental Materials.

Hidden Variable Changes
The narrow Dec. 14 – 31, 2024 time span of data collection was chosen to minimize the hidden variables of possible changes in these models. Given the black box nature of the foundation models (Caballar, 2024) used by BestWorldBot, we do not know whether they changed during this timespan.

Additionally, these foundation models were trained on somewhat differing datasets and honed in differing post processing. We do not know whether partial or even entire training runs might have been made and integrated into the three models we tested.

Sources (Carolyn’s note, sorry, I haven’t formatted them yet for publication, but have provided links for you, gentle reader, whereby you may look them up):
(Bullock, 2024) Legal considerations for defining “frontier model” Charlie Bullock, Suzanne Van Arsdale, Mackenzie Arnold, Cullen O’Keefe, Christoph Winter , Institute for Law and AI, Sept. 2024, https://law-ai.org/frontier-model-definitions/.
(Caballar, 2024) https://www.ibm.com/think/topics/foundation-models
(Toner, 2023) Helen Toner, What Are Generative AI, Large Language Models, and Foundation Models?, Center for Security and Emerging Technology (May 12, 2023), https://cset.georgetown.edu/article/what-are-generative-ai- large-language-models-and-foundation-models/
(Mandel, 2014) Do framing effects reveal irrational choice? By Mandel, David R. Journal of Experimental Psychology: General, Vol 143(3), Jun 2014, 1185-1198 https://psycnet.apa.org/buy/2013-30638-001
(Mandel, 2001) Gain-Loss Framing and Choice: Separating Outcome Formulations from Descriptor Formulations, David R. Mandel https://doi.org/10.1006/obhd.2000.2932
(Mandel, 2005) Are risk assessments of a terrorist attack coherent? David R Mandel https://pubmed.ncbi.nlm.nih.gov/16393037/
(Glickman, 2025) https://www.nature.com/articles/s41562-024-02077-2
(Tetlock, 2017) https://www.science.org/doi/10.1126/science.aal3147

In other research news, Dec. 13, 2024, Botmaster Jeremy’s version at that time of BestWorldBot broke its ten-day streak at #2, gradually slipping to #7 as of Dec. 31, the last day of Metaculus’ Q4 AI Benchmark Tournament. But when the questions scored after the 31st were counted, Jan. 8, 2025, BestWorldBot wound up at #37 out of 42. We are examined the data to determine why BestWorldBot scored so poorly on questions of the form “Will X happen before Dec. 31?” Our initial finding is that nearly all questions scored after Dec. 31 resolved as “no.” That suggests that BestWorldBot had an optimism bias. See the color-coded bots in the final Q4 leaderboard history below to see which bots suddenly rose in their standings in January, and which ones fell.

Author

EditorialTeam

https://cmeinel.com

View all posts

Preliminary Bot or Not Results Are Here!

Author

CONTACT

RECENT POSTS

NEWSLETTER

BestWorld