Current Projects

Preliminary Bot or Not Results Are Here!

Our “Bot or Not” project has found several ways that usually can tell the difference between even the most sophisticated Generative AIs (GenAIs) and humans. One is the “framing effect,” a measure on how humans make forecasts on binary questions, meaning ones with only two possible outcomes, for example yes or no.  “The framing effect is a type of cognitive bias or error in thinking. ‘Framing’ refers to whether an option is presented as a loss (negative) or a gain (positive). People are generally biased toward picking an option they view as a gain over one they view as a loss, even if both options lead to the same result. They are also more likely to make a riskier decision when the option is presented as a gain, as opposed to a loss.” Source —>

By contrast, in thousands of experiments we have recently conducted with our Multi-AI Oracle, it hasn’t shown human-style framing biases.

We also have success with
the “strict refocusing” bias. In humans,  Compared to an occurrence frame, a non-occurrence frame resulted in higher estimates if base-rate evidence favored occurrence, lower estimates if evidence favored non-occurrence, and similar estimates if evidence supported indifference. However, in our Multi-AI Oracle tests with refocusing experiments, it didn’t show human-like bias.
We also are continuing experiments with a reasoning technique known as “dialectical integration.” This technique is easy for an honest and thoughtful human. However, so far we have only found one example of a GenAI “in the wild” (as economics researchers call it) that scored even slightly well on this reasoning technique. Was it a fluke? Or have we not yet evaluated enough “in the wild” data?

We are beginning to get a handle on this, thanks to an example using ChatGPT devised by Benjamin Wilson, the AI Automation Engineer with Metaculus.
Thank you, Ben Wilson, for your insight!

We just got in results from the AutoIC tool using our Multi-AI oracle with five slightly different prompts based on the the ongoing (at the time of this experiments) negotiations between the t
he International Longshoremen’s Association and the the United States Maritime Alliance (USMX). Evaluating the likelihood of a strike before the end of March 2025, using one GenAI at a time, all of them showed improvements in nearly all measures of integrative complexity, even in dialectical integration. Our theory is that they are picking up statements by humans in their training data and news updates that reveal human level dialectical integrative texts. The rarity of dialectical integration as seen so far in the wild is important because inability to use dialectical integration while scoring better on elaborative integration often signals lies or BS. (“BS” means something you find in a pasture inhabited by a bull.) So what this means is that Jeremy and Carolyn, aided by Prof. Luke Conway‘s team at the University of Montana, have demonstrated that GenAIs in the wild typically produce texts that are remarkably similar to those of humans who are lying or BSing.  However, we also now know that with the right prompts, it is possible to cause ae GenAI to do as well or better with dialectical integration as it does on elaborative integration.

We plan on obtaining more results with other GenAIs and with comparable human results and using other prompts to stimulate production of more texts that might score well at dialectical integration.


The good news, as explained by Jeremy, is that Shannon entropy means that there should be countless ways to ensure that AIs should not be able to forever hide — as long as we humans keep on making use of Shannon entropy to discover new ways to detect them faster than the AIs could find ways to fake being human. Indeed, our results so far show that our experiments with the behaviors of GenAI on reframing and refocusing reveal that they do not show the biases typical of humans. So that gives us two solid measures. However, it is difficult to detect these measures in the wild, whereas dialectical integration tests work well in the wild.

Our Bot or Not research is using Shannon entropy to lay the foundations for detecting bots, lest we find out the hard way: why the Fermi Paradox?  Why haven’t we seen any signs of other technological beings?

There are trillions of opportunities for technologically capable species to have arisen. According to NASA, our galaxy alone hosts some 300 million life-friendly planets, and that our universe contains some two trillion galaxies. Yet despite our increasingly powerful optical and radio observatories, we still see no signs of technological life. Is it possible that every intelligent species self-destructs before making their mark on the universe? Or is it possible that, despite life having spread across our planet almost as soon as it cooled down enough to allow it to survive, life may be unique to Earth?
Some people point to this as evidence that technological civilizations always self-destruct. Even the AIs must be doomed to self-destruct. Alternatively, either we are the first known technological species — or we might be the first to survive for long. We believe — OK, hope — that it is possible for us to become the first to survive and, indeed, flourish.

Here’s why Shannon entropy implies that we could be able to keep both us and our AIs out of trouble — that is top say, enough out of trouble for us humans to survive. Whether we will is another matter. Are we as suvivable as tardigrades?

The number of possible Shannon entropy detection techniques is effectively infinite. Already, other researchers have reported finding other GenAI bot detection techniques.  Most importantly, it is mathematically provable — OK, OK, a hypothesis with vast empirical support — that finding techniques to detect the works of AIs should be possible, as long as we would keep up a fast evolving war of measures and countermeasures. In addition, the Turing Machine Halting Problem implies that no GenAI system wil be able to make itself free of defects begging to be exploited by humans (aided by our more obedient computing sytems) — just like all other large software entities. Alas, this same math fact (assuming P
≠NP) means we can’t always ensure GenAI safety. So in part humans will be counting on being more resilient than dangerous computing entities. Since tardigrades could survive and thrive fo half a billion years, could we humans, in some form, also so survive?

Indeed, detection of the outputs of GenAIs and evaluation of them is a subset of the practice of computer security. We know empirically, nit just theoretically, that computer security is an endless competition, not just between defenders and human attackers, and also the ways that even non-sapient software can go wrong without a malicious element in the loop. Any AI that wants to rule us must durably win the race between humans developing ways to corral or defeat them. In addition, any conceivable AI also would be limited by how much energy, heat sinking, and computing power it could control while dealing with kinetic attacks, for example nuclear weapons.

What about them exponentially growing more intelligent.. In addition, Moore’s Law of computational ability doubling every two yearts has not just slowed — it is within sight of its end. Today’s most powerful circuits are so tiny that quantum effects are causing the chip real estate dedicated to finding and fixing quantum errors is growing fast and wlll soon hit a dead end. Also, their power requirements have been increasing, meaning their heat sinking is increasing. Basically, the path to superintelligent AI is fragile enough that it might not survive an amateur saboteur, much less nuclear weapons. Hide a serverfarm under a mountain? What about the power supply? Heat sinking? Other supply chains?

Indeed, given the vast infrastructure for any superhuman AI to survive, if they destroy us they will find that they have also self-destructed. Pride comes before a fall. So it is in the existential self-interest of any superhuman AI to take good care of its creators and sustainers and avoid angering its potential saboteurs — us humans.

That said — Murphy’s Law. A superintelligent AI ought to understand the risk of losing its human helpers. But as the saying goes, to err is human, but to really FUBAR things requires a computer.  That’s why we believe it is essential for the survival of us humans to keep AIs under control. The foundation for controlling them is surveillance, and the foundation of that, in turn, is detecting the identifying signatures generated by AIs. Hence, our Bot or Not research.

But, but, won’t a superintelligent AI figure out how to survive getting rid of all humans?
The inherent math-based limitation of any AI is based on the same math that underlies encryption. Even the strongest encryption system is easy and cheap to use. By contrast, cracking at least one of today’s encryption systems has been shown to be impossible for all possible future computers, including quantum computers. That is — unless someone or some AI discovers that NP=P, but how likely is that? See also Computers and Intractability, a book Carolyn rereads every few years. Her favorite: the Turing machine halting problem.

But, but! What about future computing systems that could make superhuman AIs so small and so energy efficient that they could evade our bot killers?

Too bad for the future’s truly intelligent AIs, Moore’s law is over.
The basic limitation of AIs is inherent in the structure of our universe: quantum mechanics. As more computing power gets packed into chips, their features get smaller. Quantum mechanics means that they increasingly make mistakes, and correction systems take up greater percentages of chip real estate.

Bottom line: Math and physics fundamentally represent the structure of our universe, and it is on our side against the AIs.

Right now, we are happy to report our initial results with AutoIC, an AI detection technique that is easy to use.
The Political Cognition Lab of the University of Montana has kindly provided the use of AutoIC, a tool reserved for researchers to automate the scoring of textual data for integrative complexityResult: Jeremy’s Multi-AI Oracle scores well on some measures of integrative complexity, poorly on others, but almost always zero on measures of dialectical integration, meaning ability to synthesize seeming opposites into a whole that makes logical sense. This is a key element of true intelligence. We also have tested the 000_bot of Lars Warren Ericson that he has been competing in the Metaculus AI Forecasting Benchmark Tournament, and it also gives similar results.  If substantiated by tests on more GenAI bots, this will be huge as it may be revealing a fingerprint, so to speak, of the minds of GenAIs. Another advantage of AutoIC: It is great at smoking out lies and propaganda. Carolyn and Jeremy are working on that, and soon will share the results here.

If you are thinking that GenAI should be perfect for spreading harmful bull**t, yes, that’s frightening us, too. The problem is that making GenAI increasingly persuasive is a way to ensure investments today, profits tomorrow. That isn’t the fault of their tech bro leaders. Building boringly unbelievable bots would be a fast track to bankruptcy. We now also have a preliminary finding that nearly all Generative AI-based bots in the ongoing Metaculus AI Benchmark Tournament tend to be overconfident on questions that, on average, are over 50% likely to be scored as “yes.” Conversely, none so far tend to score as underconfident. This contrasts with human crowdsourcing competitions, in which average humans tend to be underconfident, as reported by Wharton professor Barbara Mellers. However, in other real-life situations, Prof. Mellers has found that humans are generally overconfident, much like the GenAI bots we have evaluated.

So in general, overconfidence is not a measure of whether texts have been generated by a human or a GenAI bot. It is as important for us to rule out hypotheses as it is to support them. That’s life as a scientist — or should be.

Yet another preliminary finding:
Jeremy’s bestworldbot turns out to be excellent at forecasting questions that score yes, but terrible at those that score no. We see this discovery as a win because at least we know the broad outlines of our bestworldbot’s problem. We won’t be competing in the upcoming Metaculus Q1 (Jan. 2025 — March 2025) competition because we will be focusing on research on our Bots or Not project. In the meantime, Botmaster Jeremy Lichtman will continue testing and improving his creations. Perhaps we will run one of Jeremy’s creations in the final quarter of that competition. Stay tuned.

In other research news, Dec. 13, 2024, Botmaster Jeremy’s current version of bestworldbot broke its ten-day streak at #2, gradually slipping to #7 as of Dec. 31, the last day of Metaculus’ Q4 AI Benchmark Tournament. But when the questions scored after the 31st were counted, Jan. 8, 2025, our bot wound up at #37 out of 42. We are examining the data to determine why bestworldbot scored so poorly on questions of the form “Will X happen before Dec. 31?” Our initial finding is that nearly all questions scored after Dec. 31 resolved as “no.” That suggests that bestworldbot had an optimism bias. See the color-coded bots in the final Q4 leaderboard history below to see which bots suddenly rose in their standings in January, and which ones fell.

Our oldest experiment, launched July 12, 2024, ended Sept. 30, 2024: forecasts by Jeremy’s bestworldbot in the Q3 AI Forecasting Tournament on 125 geopolitical questions, which were only open for 24 hours each weekday, beginning at 10:30 AM EDT. The sponsor, Metaculus, has been providing the resulting data to us for our planned “Bots or Not” analysis. We expect this AI Forecasting Tournament to continue until June 30, 2025.

Our second experiment, the Humans vs Multi-AI Panel Forecasting Experiment was launched July 18, 2024 on “What is the probability that the US FED will cut interest rates in September 2024?” It ended Sept. 18, 2024, when the US FED lowered its key overnight borrowing rate by a half percentage point, or 50 basis points, a huge cut compared to the usual quarter point changes, and the first cut since 2020. The Multi-AI Panel was created by Jeremy, assisted by Brian LaBatte and Carolyn Meinel, using five generative AIs: PerplexityClaude, Mistral, Cohere, and OpenAI , aided by the AskNews system, working together in a panel format. The competing human forecasters were BestWorld staffers Brian LaBatte, Michael DeVault, and Carolyn Meinel.

The third experiment, also launched on July 18, 2024, ending Sept. 18, 2025, was Old Bot, which also forecasted the FED rate cut in competition our human team. It used the same component AIs as the Multi-Panel experiment above, but in a different format. Humans won.

Our long term research goal is creation of a “What’s News and What’s Next” system of traditional journalism enhanced with Gen AI-based news aggregation along with AI, human, and hybrid AI/human discussions of what will likely happen next. The objective is to give credibility to our news coverage via what we expect to be our mostly true forecasts, much like how people nowadays trust weather forecasts. 

Learn more here —>

Author