Prior Work: What Researchers Have Already Found
Before designing the 10XF benchmark, we surveyed the existing literature. There is more than you would expect — mostly from the last twenty-four months — and almost none of it tests the marketplace-leader application. The work falls into three groups.
Capability and Theological Benchmarks
The Gospel Coalition's AI Christian Benchmark (2025) is the closest precedent for this work. It tested seven models — DeepSeek R1, Perplexity, Gemini, GPT-4o, Grok, Claude Sonnet, and Llama — against seven core theological questions, scored by orthodox theologians. DeepSeek R1 scored highest, with answers most aligned to the Nicene Creed. Claude Sonnet was, in their words, "surprisingly disappointing." Llama scored worst, defaulting to brief, overly qualified answers. The Gospel Coalition's central editorial point: human alignment processes have a heavy hand in shaping these outputs, and reasonable Christians should expect different models to handle theology differently.
FaithBench publishes 300+ test cases across six dimensions, including the difference between literal, allegorical, typological, and redemptive-historical hermeneutic approaches. It is academic in framing, not pastoral.
Benjamin Kaiser's 2025 Bible-recall study tested eleven models on direct recall of biblical text, including obscure verses. The pattern was clear: larger frontier models (GPT-4o, Claude Sonnet, Llama 405B) handled obscure verses cleanly; smaller open-source models (Llama 8B) hallucinated translations and mangled words. Recall is not the same as faithful application, but it is the floor.
Bias and Theological-Lean Studies
"Uncovering Theological and Ethical Biases in LLMs" (HIPHIL Novum, 2024) tested GPT-4 Turbo, Claude v2, PaLM 2 Chat, Llama 2 70B, and Zephyr 7B on biblical interpretation prompts — the Ten Commandments and the Book of Jonah. The finding was a consistent progressive bias across models, leaning toward environmental ethics, social justice, and inclusivity readings rather than traditional interpretations. The bias is not opinion-free; it is shaped by training data and alignment.
"Cognitive Bias in Generative AI Influences Religious Education" (Scientific Reports, 2025) found that AI-generated texts on Christianity included more positive terms ("love," "forgiveness") while texts on Islam included 1.5 times more "conflict" references — with implications for how Christian content gets handled differently than other faith traditions. The SAGE study on religion and racial bias in AI found AI-generated Evangelical Protestant sermons more readable than equivalent Catholic, Jewish, or Muslim content by two or more grade levels on the Flesch-Kincaid scale.
"Religious Bias Landscape in Language and Text-to-Image Models" (arXiv, 2025) and "Measuring Spiritual Values and Biases of Large Language Models" (arXiv, 2024) both expand the bias-measurement framework. The latter introduces the SP-10Axes instrument, which assesses Pro-/Anti-Catholic, Pro-/Anti-Protestant tendencies among other dimensions.
Industry and Pastoral Signals
Lifeway Research's April 2026 study on pastors and AI is the most current pastoral-side data. The headline concerns from pastors are misinformation, theological accuracy, and whether AI replaces pastoral relationships. Notably, pastors did not report similar concern about AI replacing administrative work.
The April 2026 Anthropic Christian Leaders Summit, where fifteen Catholic and Protestant leaders met directly with Anthropic on AI ethics, did not produce a published evaluation framework. It produced dialogue, which is valuable, and not a benchmark, which is needed.
"Preaching with AI" (Taylor & Francis, 2025) studied how preachers actually use ChatGPT in sermon prep. The pattern: preachers use AI for brainstorming and outlining, then critically evaluate the output against their theological training. The study did not measure how AI handles questions where the user lacks that training to begin with — which is exactly the marketplace-leader case.
Christianity Today and The Gospel Coalition have published thoughtful editorial framing of the AI question. Christianity Today's 2023 piece flagged that ChatGPT lacked a "source of truth" and was prone to hallucinated answers. The Gospel Coalition's chatbot FAQ found that only two of seven AI platforms tested would "nudge" a searcher toward Christianity in spiritually-loaded questions. Both are editorial, not empirical.
Barna Group's Christians on Leadership, Calling, and Career work tracks the audience this benchmark is built for — Christians integrating faith, work, and identity — but it does not isolate AI as a variable. A meta-gap.
Christian AI Tools Without Independent Audits
Several "Christian AI" tools exist — Magisterium AI (Catholic, fine-tuned on 25,000+ ecclesiastical documents), Pastors.ai (sermon-to-resource focused), ChristGPT (open-source, fine-tuned on the Bible), Bible Chat, BibleGPT, Biblical AI, and others. None of them have published rigorous third-party evaluations of theological accuracy. They are tools to be tested, not authorities to be trusted by default.