AI passed a classic Turing test by mastering the art of human small talk

In a landmark three-party Turing test, researchers found that suitably prompted AI models could pass as human in short text conversations, raising new questions about intelligence, deception, and the social risks of “counterfeit people.”

Study: Large language models pass a standard three-party Turing test. Image Credit: Jesus Sanz / Shutterstock

Study: Large language models pass a standard three-party Turing test. Image Credit: Jesus Sanz / Shutterstock

In a recent study published in the journal PNAS, researchers tested four AI systems, including three large language models (LLMs) and the rules-based chatbot ELIZA, against human participants to determine whether they could pass the classical three-party Turing test.

Study findings revealed that models configured with a specific humanlike persona prompt were often judged to be human, with the models' win rates at or above human baselines. The authors noted that success relied heavily on stylistic and socio-emotional traits rather than raw intelligence, providing evidence that artificial intelligence (AI) models can effectively substitute for humans in short interactions.

Three-Party Turing Test Background

Originally proposed by Alan Turing in 1950 as the imitation game, and now typically referred to as “the Turing test,” this is an operational assessment of machine behavior. Traditional AI evaluations rely on static benchmarks that frequently capture pattern memorization rather than open-ended reasoning. In contrast, the Turing test provides a dynamic, adversarial standard for evaluating a machine’s ability to emulate human intellect.

While recent literature has explored simplified two-party iterations in which an interrogator communicates with an isolated witness, Turing's original three-party formulation provides a more rigorous baseline by forcing a direct, simultaneous comparison between a human and a machine control, preventing judges from relying solely on internal, abstract models of human conversational dynamics.

While past attempts using older systems failed, machines were correctly identified as such more than 90% of the time. The rise of generative large language models (LLMs) has sparked renewed interest in this milestone.

LLM Turing Test Study Design

This study aimed to determine whether the latest generations of generative LLMs could be judged as human by interrogators. The study evaluated four distinct AI systems: ELIZA, a rules-based baseline; GPT-4o; LLaMa-3.1-405B; and GPT-4.5.

The evaluation comprised 1,023 randomized, controlled games across two independent experimental populations: a university pool of 126 undergraduate students (447 games; mean age = 20.9, SD = 1.57) and a pool of 158 Prolific platform workers (576 games; mean age = 39.1, SD = 12.1). The experimental evaluation was conducted by human interrogators using a split-screen interface.

Interrogators specifically conducted simultaneous 5-minute text conversations with one human witness and one AI witness, assigning a binary verdict at the conclusion. The study’s primary endpoint was the model's "win rate," which was defined as the proportion of trials in which the interrogator misidentified the machine as the human participant.

Notably, prompting conditions were split into a baseline "NO-PERSONA" condition and a "PERSONA" condition. The latter condition involved instructing the model to adopt a young, introverted persona knowledgeable about internet culture and fluent in slang. The experiment was subsequently replicated among 205 Prolific participants using LLaMa-3.1-405B and the updated GPT-5 model. The replication extended the interaction time limit to 15 minutes and tracked 396 games.

Four example games from two populations: (A) Prolific, (B) Undergraduate, (C) Undergraduate, and (D) Prolific. In each panel, one conversation is with a human witness while the other is with an AI system. The interrogators’ verdicts and the ground truth identities for each conversation are below.∗ A version of the experiment can be accessed at turingtest.live.

Four example games from two populations: (A) Prolific, (B) Undergraduate, (C) Undergraduate, and (D) Prolific. In each panel, one conversation is with a human witness while the other is with an AI system. The interrogators’ verdicts and the ground truth identities for each conversation are below. A version of the experiment can be accessed at turingtest.live.

Persona Prompting Performance Findings

The study’s empirical results revealed that persona prompting significantly enhances machine indistinguishability. In the 5-minute trials, GPT-4.5-PERSONA achieved a combined win rate of 73% (69.2% for undergraduates and 75.5% for Prolific), which statistically outperformed human participants (P < 0.001). LLAMA-PERSONA achieved an overall win rate of 56%, 45.4% for undergraduates, and 64.7% for Prolific, which were not statistically distinguishable from those of human participants (P = 0.50).

Conversely, the results revealed that without persona prompts, model performance substantially dropped: GPT-4.5-NO-PERSONA and LLAMA-NO-PERSONA yielded win rates of 36% and 38%, respectively, falling below chance in the undergraduate pool (P <= 0.001). ELIZA served as a manipulation check, scoring significantly below chance across both groups, z = -4.48, P < 0.001 for undergraduates; z = -3.71, P < 0.001 for Prolific.

Notably, in the 15-minute replication, an extended conversation did not prevent persona-prompted models from passing by the indistinguishability criterion. GPT-5-PERSONA was found to achieve a win rate of 59.3%, remaining marginally significantly above chance after correction, z = -2.27, P = 0.023, Pcorrected = 0.05. LLAMA-PERSONA maintained a 55.6% win rate, which was not significantly above chance for the stronger test of outperforming human witnesses.

Furthermore, summary statistics revealed that interrogator demographic variables did not consistently predict discrimination accuracy (P > 0.14), suggesting that these models could mislead multiple participant groups and that the outcomes were not biased by a specific interrogator-derived dataset.

AI Human Imitation Implications

The present study is the first to provide statistically robust, replicated evidence that modern LLMs can pass a standard three-party Turing test when configured with a specific humanlike persona prompt. Categorical analysis of interrogator reasoning revealed that human validation processes focus heavily on linguistic style (27%) and interactional dynamics (23%), rather than on pure logical or mathematical reasoning capacity.

These findings suggest that social intelligence is increasingly viewed as a key differentiator of human identity. Furthermore, the results raise concerns about potential social and economic risks regarding the deployment of "counterfeit people" capable of deceptive automated interactions. Future research should evaluate whether specialized AI experts can improve human discriminative accuracy over extended timelines.

Journal reference:
Hugo Francisco de Souza

Written by

Hugo Francisco de Souza

Hugo Francisco de Souza is a scientific writer based in Bangalore, Karnataka, India. His academic passions lie in biogeography, evolutionary biology, and herpetology. He is currently pursuing his Ph.D. from the Centre for Ecological Sciences, Indian Institute of Science, where he studies the origins, dispersal, and speciation of wetland-associated snakes. Hugo has received, amongst others, the DST-INSPIRE fellowship for his doctoral research and the Gold Medal from Pondicherry University for academic excellence during his Masters. His research has been published in high-impact peer-reviewed journals, including PLOS Neglected Tropical Diseases and Systematic Biology. When not working or writing, Hugo can be found consuming copious amounts of anime and manga, composing and making music with his bass guitar, shredding trails on his MTB, playing video games (he prefers the term ‘gaming’), or tinkering with all things tech.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Francisco de Souza, Hugo. (2026, May 20). AI passed a classic Turing test by mastering the art of human small talk. News-Medical. Retrieved on May 21, 2026 from https://www.news-medical.net/news/20260520/AI-passed-a-classic-Turing-test-by-mastering-the-art-of-human-small-talk.aspx.

  • MLA

    Francisco de Souza, Hugo. "AI passed a classic Turing test by mastering the art of human small talk". News-Medical. 21 May 2026. <https://www.news-medical.net/news/20260520/AI-passed-a-classic-Turing-test-by-mastering-the-art-of-human-small-talk.aspx>.

  • Chicago

    Francisco de Souza, Hugo. "AI passed a classic Turing test by mastering the art of human small talk". News-Medical. https://www.news-medical.net/news/20260520/AI-passed-a-classic-Turing-test-by-mastering-the-art-of-human-small-talk.aspx. (accessed May 21, 2026).

  • Harvard

    Francisco de Souza, Hugo. 2026. AI passed a classic Turing test by mastering the art of human small talk. News-Medical, viewed 21 May 2026, https://www.news-medical.net/news/20260520/AI-passed-a-classic-Turing-test-by-mastering-the-art-of-human-small-talk.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of News Medical.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
Novel screening method boosts early diagnosis of leprosy