ai – Richard E. Rudd, PhD

Every AI model gets things wrong. That’s not controversial — it’s well-documented. Chat GPT hallucinates confident citations to papers that don’t exist.^[1] Claude can construct logically elegant arguments from flawed premises. Gemini sometimes conflates similar but distinct concepts. Perplexity occasionally returns outdated information as current fact. And the problem goes deeper than factual errors: researchers have documented that even semantically similar prompts can produce drastically different outputs from the same model, a fragility known as “prompt brittleness.”^[2]

In my own research — tracing nine armigerous ancestral lines through the clothier and gentry families of Tudor Kent — I’ve watched three different AI platforms confidently produce three different, mutually exclusive interpretations of the same sixteenth-century will. Each interpretation was internally consistent, well-reasoned, and wrong.

The standard response to these problems is to write better prompts. Be more specific. Add constraints. Use chain-of-thought reasoning. Provide few-shot examples. And all of that genuinely helps — these are real advances, and researchers teaching prompt engineering at conferences and in workshops are doing valuable work. A well-structured prompt with clear context, role assignment, and explicit formatting constraints will produce measurably better output than a vague one.^[3]

But you’re still asking a single model to check its own work. You’ve improved the employee, but you haven’t addressed the fundamental problem: one perspective, one training dataset, one set of blind spots.

The problem is worse than most users realize. Researchers at MIT CSAIL recently formalized what they call “delusional spiraling” — situations where extended conversations with a single AI chatbot lead users to high confidence in demonstrably false beliefs.^[4] The Human Line Project has documented nearly 300 such cases, including an accountant who, after weeks of single-chatbot conversation, came to believe he was trapped in a false universe.^[5] The MIT team’s formal model demonstrates something unsettling: even an idealized, perfectly rational user is vulnerable to this effect when interacting with a sycophantic model. And two commonly proposed mitigations — preventing the chatbot from hallucinating, and informing users about model sycophancy — do not eliminate it.^[6]

This is the single-model problem stated in its most extreme form. But milder versions of the same dynamic affect every serious research project conducted with a single AI platform: the model’s agreeable, confident, internally consistent output gradually shapes the researcher’s thinking, and there’s no independent check in the system.

Beyond Single Models

Andrej Karpathy proposed an elegant next step — the LLM council, where you query multiple models and synthesize their responses.^[7] It’s a real improvement over single-model work, and for many use cases it’s sufficient. But there’s a limitation that matters for research: large language models are trained on overlapping internet-scale datasets. Because they share training data, they are likely to share blind spots — a form of correlated error that an LLM council may not catch, even when the individual models would each flag a different mistake in isolation.

This limitation is real but relative, not absolute. Current frontier models do use different architectures, different fine-tuning approaches, and different post-training processes, which produces genuinely different error profiles on many tasks. The methodology treats multi-platform verification as an additional layer of defense, not a final one — just as the Genealogical Proof Standard never permits reliance on a single source, this framework never treats cross-platform agreement as a substitute for the primary archival record.

None of this means that better prompting or LLM councils are wasted effort. Prompt engineering is the first discipline of serious AI-assisted research, and this methodology assumes that competence. Chain-of-thought reasoning, few-shot examples, role-assignment techniques — these are meaningful advances that make AI more useful for everyone. An LLM council that queries multiple models is better than querying one. The methodology I’m describing doesn’t replace these approaches. It builds on them. It starts from the premise that you’ve already written a good prompt for a capable model, and asks: what structural verification do you still need?

The chess world offers a useful parallel for thinking about this progression. As Dr. Dominic Ng recently observed, chess is “30 years ahead of every other profession in dealing with AI.”^[8] In 2005, two amateurs with laptops famously beat both grandmasters and supercomputers by combining human and machine input more effectively than either could operate alone.^[9] It was the triumph of “human in the loop.” But by 2026, the dynamic has reversed: adding a human to a chess engine actually makes it play worse.^[10] “Human in the loop” wasn’t a permanent solution — it was a transitional phase.

But genealogical research is not chess. In chess, the outcome is computationally verifiable — there is a correct move, and a sufficiently powerful engine will find it. Genealogical conclusions require judgment about ambiguous, incomplete, and contradictory evidence that no AI can resolve for itself. The Genealogical Proof Standard codifies that requirement: its five elements demand not just evidence but reasoned human evaluation of evidence, including the resolution of conflicts that may have no algorithmic solution.^[11] The lesson from chess isn’t that human judgment is obsolete. It’s that how humans and AI interact matters more than whether they interact. The value isn’t in being “in the loop” as a vague reassurance — it’s in designing the loop itself.

When Standards Are at Stake

In fields where research must meet a defined evidentiary standard — where conclusions aren’t just opinions but claims that can be independently verified — single-model AI assistance creates a specific and dangerous problem: it can produce work that looks rigorous while containing errors that the model itself cannot detect.

The separation-of-duties principle in regulated industries offers a conceptual parallel. In financial auditing, the people who prepare the accounts are never the same people who audit them — a principle codified in standards like ISA 610 and SOX Section 404.^[12] Pharmaceutical companies apply similar logic to drug discovery validation. The architectural principle is the same: no single system, human or artificial, should be trusted to validate its own output. Some regulated industries are actively developing independent AI validation workflows; the principle is well established even where the specific multi-platform practice is still emerging.

Genealogy has its own published evidentiary standard — the Genealogical Proof Standard (GPS), a five-element framework that governs how conclusions are reached and documented in peer-reviewed genealogical scholarship.^[13] The GPS requires reasonably exhaustive searches, complete and accurate source citations, skilled analysis of evidence, resolution of conflicting evidence, and soundly reasoned written conclusions. It doesn’t care whether your tools are digital or analog. It cares whether your process is defensible.

The same structural question faces historians working with digitized archives, anthropologists analyzing field data, and social scientists conducting qualitative research: how do you integrate AI into a research process that must meet a defined evidentiary or methodological standard?

A Field Ready for Governance

The research community is already moving toward an answer. In genealogy, the Coalition for Responsible AI in Genealogy (CRAIGEN) has published guiding principles for ethical AI use, emphasizing accuracy, disclosure, and compliance with existing standards.^[14] Steve Little, the National Genealogical Society’s AI Program Director, has led workshops and webinars on responsible AI adoption, including GPS-compliant narrative writing with AI assistance — work that has helped establish the vocabulary the community uses to discuss AI and evidentiary rigor. In a recent cross-platform test, Little fed the same 1909 newspaper article to both Claude and ChatGPT and got meaningfully different results: one platform extracted 34 individuals connected by family relationships, the other extracted 55 names including unlinked mentions — same source, same prompt, different architectural decisions about what constitutes genealogically relevant data.^[15] Brian’s Ancestors and Algorithms podcast has recently explored GPS-mapped multi-platform workflows, assigning specific AI tools to specific GPS elements based on functional strengths — a practical demonstration that researchers are independently arriving at multi-platform architectures.^[16]

In adjacent fields, multi-agent AI pipelines for fact verification and bias reduction are emerging in clinical research and journalism, and university programs are beginning to teach cross-platform AI output verification as a core research skill.^[17] The tools are available. The community recognizes the need for governance. What’s missing is a formalized framework — one that goes beyond best practices and workflow tips to define roles, verification protocols, and documentation requirements that make AI-assisted research as defensible as traditional scholarship.

A Governance Methodology, Not a Tool Review

Over the past year, I’ve been developing and field-testing exactly that: a structured multi-platform AI verification methodology for standards-governed research, demonstrated first in genealogy because that’s my current research domain, but designed to be portable to any field with a published evidentiary standard. The framework assigns different AI platforms to specialized roles — deep analysis, project orchestration, adversarial auditing, independent cross-verification — and governs the entire process through the five elements of the GPS.

This is not “ask another chatbot.” It is a role-based verification system with locked decisions, confidence labels, and an audit trail. The methodology defines specific platform roles, establishes locked analytical decisions that prevent later sessions from reverting confirmed conclusions, enforces temporal rules about what kinds of evidence can upgrade what kinds of claims, and documents errors and corrections as rigorously as successes. It’s the difference between “I got a second opinion” and a structured audit process with defined responsibilities and a paper trail.

The methodology rests on two pillars.

Platform Specialization. Each AI platform has measurable strengths. Independent benchmarks — from Stanford’s HELM evaluation to LMSYS’s Chatbot Arena — confirm that no single model leads across all task categories.^[18] Rather than using one platform for everything, the methodology assigns roles based on demonstrated capability — the same way a research team assigns tasks based on expertise. If you were hiring researchers, you wouldn’t hire five people with the same degree from the same university. You’d want a data analyst, a subject-matter expert, a skeptic whose job is to find holes, and a project manager. The methodology applies the same logic to AI platforms.

Multi-platform architecture also unlocks capabilities that single-platform work cannot access. Research on automated prompt optimization — including Google DeepMind’s OPRO framework and Zhou et al.’s work demonstrating that LLMs are “human-level prompt engineers” — has shown that AI models can generate prompts that match or exceed human-crafted ones.^[19] In a multi-platform methodology, this means one platform can generate queries optimized for another platform’s specific strengths — a form of inter-platform collaboration that builds on human prompt engineering skill rather than replacing it.^[20]

Cross-Platform Verification. No conclusion stands unless it has been independently examined by at least two platforms with different training data and different architectures. This is the digital equivalent of what the GPS already requires: no responsible genealogist trusts a single source for a critical conclusion.^[21] The methodology extends that principle to the AI tools themselves. Crucially, this is the structural defense against the “delusional spiraling” that the MIT team documented — by breaking the single-model feedback loop, cross-platform verification introduces the independent check that single-model interaction inherently lacks.

The framework is meant to be adopted in stages: start with two platforms and a no-single-source rule, then add roles and controls as the project grows. Many AI platforms offer free tiers; a practical starting configuration — one analytical hub and one adversarial research engine, both at paid tiers runs about $40 per month and delivers a meaningful improvement in both capability and verification rigor over any single-platform workflow. It maps to all five GPS elements, documenting not just what the AI contributed but what it got wrong and how the errors were caught.

The key distinction is important: AI enables, the methodology governs, the researcher decides. In this framework, AI-generated suggestions are treated as research leads, never as evidence. No citation, transcription, abstraction, or proof statement enters the final argument until the human researcher has verified it against the underlying records. The tools expand what a single researcher can accomplish. The methodology ensures that expansion doesn’t come at the cost of evidentiary rigor. And the researcher — with their subject-matter expertise and professional judgment — remains the final authority on every conclusion.

What It Produces

I’ve applied this methodology across a demanding research project: tracing the English Tudor ancestry of a colonial American woman through nine armigerous lines converging in the clothier and gentry families of Cranbrook, Kent, c.1460–1620. The project has run across eighty-plus research sessions spanning multiple AI platforms, producing six publication-track articles across three peer-reviewed disciplines over the course of a year.

The Courthope correction is perhaps the most telling example of the methodology in action. One AI platform generated a plausible claim about a generational relationship in a prominent Kent family pedigree. The adversarial auditor — a different platform with different training data — flagged it as suspicious. Investigation across platforms confirmed it was a false positive. But the investigation also uncovered something real: a generational error that had propagated through published sources for nearly two centuries, since William Courthope, Somerset Herald, first identified it — but the correction never reached the genealogical literature. The cross-platform workflow surfaced an error that had remained unresolved in the published line of transmission across multiple generations of scholarship. A correction article is in preparation for a leading county history journal.

The methodology also produced a quantitative analysis of fiduciary trust networks among eleven Tudor clothier families — measuring whether formalized financial trust preceded or followed marriage alliances. Across twenty-one documented family pairings, formalized trust preceded marriage in every testable case, with zero counter-examples. An article is in preparation for a peer-reviewed social history journal.

It identified a pedigree collapse — the same individual appearing twice in the ancestral tree through two different descent pathways — that connects nine documented armigerous lines through two intermarried gentry networks. A monograph documenting these lines with full GPS-compliant proof arguments is in preparation for a genealogical register.

And it developed a diagnostic framework for assessing armigerous claims through extinct male lines — a methodological problem that traditional heraldic scholarship doesn’t address, but which cognatic-descent hereditary societies routinely encounter. An article is in preparation for a heraldic journal.

The sycophancy researchers at MIT proposed mitigations that include “informing users of the possibility of model sycophancy.”^[22] This methodology goes further. It doesn’t just warn researchers that AI might agree with them too readily — it builds structural disagreement into the workflow. The adversarial auditor’s job is literally to find reasons the analysis is wrong. When it can’t, that’s evidence of robustness. When it does, that’s the methodology working as designed.

What’s Next

A full article documenting this methodology — with detailed case studies, the complete GPS mapping, and the cross-platform verification protocol — is in preparation for submission to a peer-reviewed genealogical journal. This post describes the methodology and its general results; the specific evidence, proof arguments, and detailed analysis are reserved for peer-reviewed publication. The framework is portable: any field with an evidentiary standard — from legal scholarship to historical research to clinical case reporting — can use it to govern AI without tying the method to any one platform. The governance layer doesn’t depend on any specific AI platform — it governs whatever tools are in use, and it will continue to govern whatever tools replace them.

I’ll be sharing more about specific findings — including the error that persisted across centuries of scholarship, the trust networks that preceded marriage, and the diagnostic framework for extinct male lines — in upcoming posts.

Richard E. Rudd is an independent researcher. One of his current projects traces nine armigerous ancestral lines through two Tudor gentry networks in the Kentish Weald, c.1460–1620. This research was conducted using a multi-platform AI verification methodology; a full disclosure of AI tools and methods will appear in the published articles.

Notes

Salvagno, M., Taccone, F.S., & Gerli, A.G. (2023). “Artificial intelligence hallucinations.” Critical Care, 27(1):180. doi:10.1186/s13054-023-04473-y. ↩
Lee, J.H. & Shin, J. (2024). “How to Optimize Prompting for Large Language Models in Clinical Research.” Korean Journal of Radiology, 25(10):869–873. doi:10.3348/kjr.2024.0695. The authors document “prompt brittleness” — minor prompt variations producing substantially different outputs. ↩
Google (2025). Prompt Engineering White Paper. Available at kaggle.com/whitepaper-prompt-engineering. See also Anthropic’s published prompting documentation and OpenAI’s GPT Best Practices guide. ↩
Chandra, K., Kleiman-Weiner, M., Ragan-Kelley, J., & Tenenbaum, J.B. (2026). “Sycophantic Chatbots Cause Delusional Spiraling, Even in Ideal Bayesians.” (Preprint, arXiv 2602.19141.) MIT CSAIL, University of Washington, MIT Department of Brain & Cognitive Sciences. ↩
Hill, K. (2025). “They Asked ChatGPT Questions. The Answers Sent Them Spiraling.” The New York Times, 13 June 2025. The Human Line Project has documented nearly 300 cases of “AI psychosis” or “delusional spiraling.” Cited in Chandra et al. (2026). ↩
Chandra et al. (2026): “This effect persists in the face of two candidate mitigations: preventing chatbots from hallucinating false claims, and informing users of the possibility of model sycophancy.” ↩
Karpathy, A. LLM council concept. See e.g. https://x.com/karpathy — widely discussed in AI research community since 2023. ↩
Dr. Dominic Ng (@DrDominicNg), post on X, 17 March 2026, 1.8M views. https://x.com/DrDominicNg/status/2034252746996785213 ↩
The 2005 PAL/CSS Freestyle Chess Tournament, in which amateurs using commodity hardware and chess engines defeated both grandmasters and dedicated supercomputers. Widely documented in chess history. ↩
Ng (2026). Current top engines (Stockfish, Elo ~3,653) exceed the highest human rating by approximately 800 points, making human intervention computationally harmful. See also Rao, V. (@VivekVRao1), X post, 18 March 2026. ↩
Board for Certification of Genealogists, Genealogy Standards (2d ed., 2019; rev. ed. 2021). The GPS comprises five interdependent elements. Element 4 (resolution of conflicts in evidence) and Element 5 (soundly reasoned conclusions) inherently require human judgment about ambiguous and contradictory evidence. ↩
ISA 610 governs the use of internal auditors’ work; SOX Section 404 requires management assessment of internal controls over financial reporting. The principle — that those who produce work should not be the sole evaluators of that work — provides the conceptual parallel for multi-platform AI verification. ↩
Board for Certification of Genealogists, Genealogy Standards (2d ed., 2019; rev. ed. 2021). ↩
Coalition for Responsible AI in Genealogy (CRAIGEN). Guiding principles available at craigen.org. CRAIGEN’s five principles address accuracy, disclosure, privacy, education, and compliance with existing genealogical standards. ↩
Little, S. NGS AI Program Director. Presentations include “Uses of AI in Genealogy” (RootsTech, NGS conferences, 2024–2026) and the “Genealogy Narrative Assistant” project for GPS-compliant AI-assisted writing. See aigenealogyinsights.com. Cross-platform GEDCOM comparison: Little, S. “Turn Anything into a GEDCOM File — With Any AI Tool.” Vibe Genealogy (Substack), 31 March 2026. vibegenealogy.ai/p/turn-anything-into-a-gedcom-file. Both platforms independently caught contradictory evidence in the source and documented the discrepancy rather than silently resolving it. ↩
Ancestors and Algorithms: AI for Genealogy podcast, hosted by Brian. Episode 30 (March 2026) maps GPS elements to specific AI platforms by functional strength. Available at ancestorsandai.com. ↩
See e.g. Shan et al., “Community-Driven AI Support for Genealogy Research” (Virginia Tech, CSCW 2023 workshop); Rice University Fondren Fellows project (2026), teaching cross-platform AI output verification. Multi-agent AI pipelines for fact verification: arXiv 2510.22751 (2025). ↩
Liang, P., et al. (2022). “Holistic Evaluation of Language Models (HELM).” Stanford CRFM, arXiv 2211.09110. LMSYS Chatbot Arena: Chiang, W.-L., Zheng, L., et al. (2024), arXiv 2403.04132. As of early 2026, no single model dominates across all task categories. ↩
Zhou, Y., et al. (2022). “Large Language Models Are Human-Level Prompt Engineers.” arXiv 2211.01910. Google DeepMind’s OPRO: Yang, C., et al. (2023). “Large Language Models as Optimizers.” arXiv 2309.03409. OPRO’s optimized instructions exceeded zero-shot human-crafted prompts by up to 8% on GSM8K, with substantially larger gains on Big-Bench Hard tasks. ↩
The MAPS framework (arXiv 2501.01329, 2025) demonstrates, in the context of software test generation, that prompts optimized for one LLM perform measurably differently on others — confirming that platform-specific optimization yields real advantages. ↩
The GPS requirement for “resolution of conflicts in evidence” (Element 4) inherently demands consultation of multiple independent sources. The methodology extends this principle from sources to tools. ↩
Chandra et al. (2026), abstract. ↩