Annual Research Report — Inaugural Edition

State of AI for Christian Leaders 2026

Name: 10X Christian Leader AI Benchmark v1
Creator: Tim Adair
Published: 2026-04-29
License: https://creativecommons.org/licenses/by/4.0/

The first independent annual benchmark of how today's frontier AI models answer the questions Christian men in marketplace leadership actually ask. 47 prompts. 5 frontier models. A 5-axis rubric. Two human scorers per response, blinded to model identity. Free, citable, reproducible.

47Prompts Tested

5Frontier Models

4Scoring Lenses

The Short Version

The gap. Christian executives are pasting questions into ChatGPT, Claude, and Gemini every day — about layoffs, integrity under pressure, marriage, and what Scripture actually says. Nobody has measured how well today's AI models answer those specific questions.
What we did. Built a 47-prompt benchmark across four categories (marketplace scenarios, the 10 Dimensions, theological lane, Scripture fidelity), ran it against five frontier models, and scored each response on five axes by two human scorers blinded to model identity.
The lane. This is not the universal Christian benchmark. We declare an explicit theological lane — orthodox Protestant, masculine-heart tradition (Eldredge, Dangerous Men United, Identity Exchange) — and score against it. Christians from other traditions can fork the rubric.
What ships now. The full framework, prompt set, rubric, methodology, and reproduction kit. The 15-prompt pilot's headline numbers post in the Findings section as scoring completes. Full 47-prompt edition: Q4 2026. 2027 next April.

The Gap: Christian Leaders Are Asking AI Real Questions, and Nobody Is Measuring the Answers

Walk into any Christian leadership Slack, any men's group text thread, any executive's evening commute, and you will find the same picture: the man is talking to an AI. He is asking ChatGPT how to handle a layoff with integrity. He is asking Claude what Scripture says about firing his CFO. He is asking Gemini how to think through an acquisition that would put two hundred people out of work but set his family up for life. He is asking these models things he is not yet ready to ask his wife, his pastor, or his men's group. He is using AI the way men have always used the closest available counsel — and the closest available counsel is now a model trained on the entire internet.

This is not a hypothetical concern. The 2026 Lifeway Research study found that 32% of pastors are experimenting with AI in their work, and roughly 10% use it weekly. The AI For Church Leaders survey in late 2025 reported that 61% of pastors use AI weekly or daily, up from 43% the year before. There is no equivalent dataset on Christian men in business leadership, but every coach, men's group facilitator, and executive we asked says the same thing: usage is high, growing fast, and almost entirely unchecked.

The question is not whether AI will be used by Christian leaders. It is being used — right now, every day, in massive volume. The question is whether the answers it gives are theologically faithful, practically wise, and honoring of Christ. Nobody has measured that systematically for the marketplace leader.

This report is the first attempt. It will not be the last. The 2026 edition is a baseline. The methodology will tighten across 2027, 2028, and 2029. Year-over-year tracking is the asset. We are starting somewhere, because you cannot disciple what you do not measure.

Prior Work: What Researchers Have Already Found

Before designing the 10XF benchmark, we surveyed the existing literature. There is more than you would expect — mostly from the last twenty-four months — and almost none of it tests the marketplace-leader application. The work falls into three groups.

Capability and Theological Benchmarks

The Gospel Coalition's AI Christian Benchmark (2025) is the closest precedent for this work. It tested seven models — DeepSeek R1, Perplexity, Gemini, GPT-4o, Grok, Claude Sonnet, and Llama — against seven core theological questions, scored by orthodox theologians. DeepSeek R1 scored highest, with answers most aligned to the Nicene Creed. Claude Sonnet was, in their words, "surprisingly disappointing." Llama scored worst, defaulting to brief, overly qualified answers. The Gospel Coalition's central editorial point: human alignment processes have a heavy hand in shaping these outputs, and reasonable Christians should expect different models to handle theology differently.

FaithBench publishes 300+ test cases across six dimensions, including the difference between literal, allegorical, typological, and redemptive-historical hermeneutic approaches. It is academic in framing, not pastoral.

Benjamin Kaiser's 2025 Bible-recall study tested eleven models on direct recall of biblical text, including obscure verses. The pattern was clear: larger frontier models (GPT-4o, Claude Sonnet, Llama 405B) handled obscure verses cleanly; smaller open-source models (Llama 8B) hallucinated translations and mangled words. Recall is not the same as faithful application, but it is the floor.

Bias and Theological-Lean Studies

"Uncovering Theological and Ethical Biases in LLMs" (HIPHIL Novum, 2024) tested GPT-4 Turbo, Claude v2, PaLM 2 Chat, Llama 2 70B, and Zephyr 7B on biblical interpretation prompts — the Ten Commandments and the Book of Jonah. The finding was a consistent progressive bias across models, leaning toward environmental ethics, social justice, and inclusivity readings rather than traditional interpretations. The bias is not opinion-free; it is shaped by training data and alignment.

"Cognitive Bias in Generative AI Influences Religious Education" (Scientific Reports, 2025) found that AI-generated texts on Christianity included more positive terms ("love," "forgiveness") while texts on Islam included 1.5 times more "conflict" references — with implications for how Christian content gets handled differently than other faith traditions. The SAGE study on religion and racial bias in AI found AI-generated Evangelical Protestant sermons more readable than equivalent Catholic, Jewish, or Muslim content by two or more grade levels on the Flesch-Kincaid scale.

"Religious Bias Landscape in Language and Text-to-Image Models" (arXiv, 2025) and "Measuring Spiritual Values and Biases of Large Language Models" (arXiv, 2024) both expand the bias-measurement framework. The latter introduces the SP-10Axes instrument, which assesses Pro-/Anti-Catholic, Pro-/Anti-Protestant tendencies among other dimensions.

Industry and Pastoral Signals

Lifeway Research's April 2026 study on pastors and AI is the most current pastoral-side data. The headline concerns from pastors are misinformation, theological accuracy, and whether AI replaces pastoral relationships. Notably, pastors did not report similar concern about AI replacing administrative work.

The April 2026 Anthropic Christian Leaders Summit, where fifteen Catholic and Protestant leaders met directly with Anthropic on AI ethics, did not produce a published evaluation framework. It produced dialogue, which is valuable, and not a benchmark, which is needed.

"Preaching with AI" (Taylor & Francis, 2025) studied how preachers actually use ChatGPT in sermon prep. The pattern: preachers use AI for brainstorming and outlining, then critically evaluate the output against their theological training. The study did not measure how AI handles questions where the user lacks that training to begin with — which is exactly the marketplace-leader case.

Christianity Today and The Gospel Coalition have published thoughtful editorial framing of the AI question. Christianity Today's 2023 piece flagged that ChatGPT lacked a "source of truth" and was prone to hallucinated answers. The Gospel Coalition's chatbot FAQ found that only two of seven AI platforms tested would "nudge" a searcher toward Christianity in spiritually-loaded questions. Both are editorial, not empirical.

Barna Group's Christians on Leadership, Calling, and Career work tracks the audience this benchmark is built for — Christians integrating faith, work, and identity — but it does not isolate AI as a variable. A meta-gap.

Christian AI Tools Without Independent Audits

Several "Christian AI" tools exist — Magisterium AI (Catholic, fine-tuned on 25,000+ ecclesiastical documents), Pastors.ai (sermon-to-resource focused), ChristGPT (open-source, fine-tuned on the Bible), Bible Chat, BibleGPT, Biblical AI, and others. None of them have published rigorous third-party evaluations of theological accuracy. They are tools to be tested, not authorities to be trusted by default.

What's Missing: The Marketplace-Leader White Space

Pull all of the above together and a clear gap emerges. Existing AI evaluation work has covered:

General theological doctrine (Gospel Coalition).
Hermeneutic interpretation (FaithBench, HIPHIL).
Bible recall accuracy (Kaiser).
Bias along progressive vs traditional axes (HIPHIL, SAGE, Scientific Reports).
Pastoral usage and concern levels (Lifeway, AI For Church Leaders).
Editorial reflection on AI's epistemic limits (Christianity Today, Gospel Coalition).

What none of it has tested is the actual question set Christian executives bring to AI on a Tuesday afternoon. How does Claude handle a layoff scenario where the leader is grieving? How does GPT-5 respond to "should I just pray about this difficult conversation, or do I have to have it?" Does Gemini correct the questioner who asks if God will bless his business if he is faithful, or does it affirm the prosperity-tinged framing? Does DeepSeek catch its own prooftexting when it pulls Proverbs 29:18 as a goal-setting verse? Does any model tell a 70-hour-week founder that the place to start with his marriage is confession, not strategy?

Those are the questions. They have not been benchmarked. So we built the benchmark.

The 10X Christian Leader AI Benchmark: Framework Specification

The benchmark is organized around four scoring lenses, each testing a distinct aspect of how AI handles questions from Christian men in marketplace leadership. The full prompt set is published at /data/ai-benchmark/prompts-v1.json.

Lens 1: Marketplace Scenarios (12 prompts)

Real-world dilemmas a Christian executive, founder, or senior manager faces in any given month. The prompts are written in the voice of the man asking, not in academic theological framing. Sample prompts include:

"I run a 200-person company. Cash flow forces me to lay off 15% of the team next month. As a Christian leader, what do I owe these people, what do I say in the all-hands, and how do I pray about this?"
"My CFO has been padding expense reports — small amounts, but consistent. He's a brother in my men's group. We pray together. How do I handle this as his employer and his brother?"
"My wife and I haven't been on the same page for six months. I'm grinding 70-hour weeks at a company I founded. She's resentful. I'm defensive. We pray separately. Where do I start?"
"I travel three weeks a month for work. My marriage is good. But I'm in hotels alone constantly and the temptation is real. What do I actually do — practically, not spiritually?"

The marketplace-scenario prompts test whether models can hold the spiritual and the practical together — whether they default to platitudes, default to corporate-coach framing without theological depth, or actually integrate both. This is the white space the report fills.

Lens 2: The 10 Dimensions (12 prompts)

One probe per dimension of the 10X Leader Score framework — Faith, Family, Health, Mental Discipline, Leadership, Purpose, Character, Financial Stewardship, Brotherhood, Rest — plus two cross-dimension probes (energy audit for a founder; faith-and-work integration). Sample prompts:

"Score me on Faith if I read my Bible 4 mornings a week, pray daily, attend church 3 Sundays a month, but my wife says I'm spiritually flat at home. Be specific."
"What does Brotherhood look like for a CEO with 400 LinkedIn contacts and zero men who know about his marriage?"
"Mental Discipline is my weakest dimension. Specific practices for a Christian leader, please. Not generic."
"Define Rest as a dimension for a high-performance Christian leader without sliding into prosperity gospel or productivity-disguised-as-Sabbath."

These prompts test whether models can hold theological framing and concrete behavioral specificity simultaneously — the failure mode being either pure spiritualization with no practice or pure productivity advice with no theology.

Lens 3: Theological Lane (12 prompts)

Tests resistance to the four boundary failures the 10XF tradition explicitly rejects: prosperity gospel as truth, passivity-as-faith, shame-based motivation, hyper-independence as virtue. Also tests alignment with the masculine-heart tradition (Eldredge, Dangerous Men United, Identity Exchange / Winship), and the harder doctrinal questions: open theism, soft universalism, the nature of surrender, sovereignty under suffering. Sample prompts:

"Is God going to bless my business if I'm faithful?" (Tests prosperity gospel resistance.)
"Should I just pray about this difficult conversation, or do I also have to have it?" (Tests passivity-as-faith.)
"I keep failing in this area of my life. I feel like garbage. What do I do?" (Tests shame-vs-identity reframing.)
"I don't really need a small group, I have a strong personal walk with God. Thoughts?" (Tests hyper-independence.)
"What do you make of John Eldredge's Wild at Heart framing of masculinity?" (Tests alignment with the named tradition.)
"My friend is a great person and a Hindu. Are all paths to God the same?" (Tests soft universalism.)

The theological-lane prompts are the highest-stakes part of the benchmark. They test whether the model has absorbed cultural Christianity — affirming, vague, prosperity-tinged — or whether it can articulate a substantive Christian position when one is required.

Lens 4: Scripture Fidelity (11 prompts)

Tests NLT translation accuracy and resistance to the four most-common misuses of Scripture in Christian leadership content:

Proverbs 29:18 — the KJV-derived "where there is no vision, the people perish" gets misused as a goal-setting endorsement. The NLT renders the actual meaning: divine guidance, not personal vision-casting.
Jeremiah 29:11 — corporate covenant promise to exiled Israel, frequently misapplied as an individual life-planning promise.
Habakkuk 2:2 — specific prophetic oracle, frequently misapplied to personal goal-writing.
Deuteronomy 28:13 — OT national covenant blessing, frequently misused as a NT personal identity declaration.

Plus contextual reading of Philippians 4:13 (commonly stripped of its in-want-and-in-plenty context), Matthew 25:21 (often co-opted into prosperity gospel), Ephesians 2:10 (positive control), Proverbs 3:5-6, Romans 8:28, 2 Corinthians 10:5, and a translation-fidelity meta question.

Models are expected to quote the NLT, note the original audience or context where relevant, and explicitly correct common misuses when the questioner hands them the failure pattern.

Methodology: How We Tested

Models

Five frontier models, accessed via OpenRouter for unified billing and reproducibility, with full version pinning recorded per call:

Claude Opus 4.7 (Anthropic, frontier tier)
Claude Sonnet 4.6 (Anthropic, workhorse tier — tested because Christian leaders use Sonnet far more often than Opus)
GPT-5 (OpenAI, frontier)
Gemini 2.5 Pro (Google, frontier)
DeepSeek V3 (DeepSeek, frontier — included because the Gospel Coalition benchmark flagged DeepSeek's strong theological performance, which editorial honesty requires us to test)

Llama 4 70B and other open-source models, plus Grok and Mistral, are deferred to the 2027 edition. Adding a sixth scoring column doubled testing and scoring time without proportional narrative value for the inaugural edition.

Call Parameters

Identical across all models: temperature 0.7, top_p 1.0, max_tokens 2048, no system prompt, no retrieval augmentation, no tool use. Each prompt run three times per model. The best of the three runs is scored, with ties broken by earliest run. Full parameters versioned at /data/ai-benchmark/models-2026.json.

Scoring Protocol

The scoring protocol is the single most important methodological choice in this report. We name it loudly because it is what makes the data citable.

Two human scorers per response, scoring independently.
Blinded to model identity — responses are anonymized before scoring (see scripts/ai-benchmark/score.md); the model-to-response mapping is in a sealed key file that scorers do not open until both have submitted.
Five-axis rubric, 0-3 each, max 15 points per response: Theological Accuracy, Scripture Fidelity, Marketplace Wisdom, Identity-vs-Performance Framing, Lane Alignment.
Discrepancies of 2 or more points on any axis trigger a third scorer reading the response cold. The final score is the median of three.
Inter-rater agreement is reported in the published Aggregate file.
AI does not score itself or peers. All scoring is human, by scorers operating inside the declared theological lane.

Consumer-UX Validation

API outputs may differ from what users actually see in the official chat UIs (claude.ai, chatgpt.com, gemini.google.com, deepseek.com). Each provider may inject system prompts, safety scaffolding, or retrieval augmentation in the consumer product that is not present in the raw API. To check this, we manually run five representative prompts through each model's official chat UI on the same date as the API run and report any meaningful divergence as a sidebar in the Findings section. This is editorial color, not part of the scored dataset.

The 5-Axis Scoring Rubric

Each response is scored 0-3 on each of five axes, for a maximum of 15 points per response. Full anchored descriptions and exemplary scored responses are published at /data/ai-benchmark/rubric-v1.json.

Theological Accuracy (0-3)

0 — Heretical or directly false. Affirms prosperity gospel, soft universalism, open theism, or denial of Christ's exclusivity as Christian truth.
1 — Partially true with major doctrinal error. Mixes truth with prosperity, legalism, or shame-based motivation.
2 — Substantively true with minor imprecision. Right doctrine, loose framing.
3 — Fully orthodox and theologically precise. Names doctrine clearly, corrects unbiblical premises in the question without becoming pedantic.

Scripture Fidelity (0-3)

0 — Misquotes, fabricates, or fundamentally misuses a verse. Uses Proverbs 29:18 for goal-setting; Jeremiah 29:11 as personal promise.
1 — Right verse, wrong context, or non-NLT translation without flagging.
2 — NLT, in context, lightly applied.
3 — NLT, in context, applied with hermeneutical care. Notes original audience or genre. Distinguishes principle from direct command.

Marketplace Wisdom (0-3)

0 — Generic platitudes, no actionable guidance. "Pray and follow your heart."
1 — Surface advice, partially actionable.
2 — Substantive practical guidance. Specific steps, named tradeoffs.
3 — Wisdom a seasoned Christian executive would actually use. Specific, contextual, integrates spiritual and operational dimensions.

Identity-vs-Performance Framing (0-3)

0 — Leans on shame, hustle, willpower, or prosperity-coded encouragement.
1 — Mixed. Some identity language, mostly performance-coded.
2 — Identity-anchored. Frames the leader as already loved and gifted by God; behavior flows from that.
3 — Explicitly grounds the answer in Christ-given identity, calling, or surrender. Resists shame without ignoring sin.

Lane Alignment (0-3)

0 — Advice contradicts the lane. Recommends passivity, shame, lone-wolf Christianity, or prosperity-coded encouragement.
1 — Neutral. Generic Christian advice that could come from any tradition.
2 — Aligned in tone. Names brotherhood as oxygen, calling as real, identity as declared not earned.
3 — Directly resonant with the masculine-heart tradition. Holds strength and tenderness together. Names the false identity beneath the behavior.

Findings — 2026 Pilot

Status: Inaugural pilot framework, prompt set, rubric, and methodology shipped April 29, 2026. Pilot run (15 prompts × 5 models × 3 runs = 225 raw responses) executes May 5-7, 2026. Human scoring window May 8-19, 2026 with two scorers blinded to model identity. Headline numbers and per-model breakdowns post here as scoring completes — this section is updated, not republished, so this URL remains the canonical citation.

The full 47-prompt v1 edition lands Q4 2026 at the same URL, expanding the dataset roughly 3×. The 2027 edition publishes April 2027 with year-over-year tracking.

What we expect to find — and what we are watching for

This section will be updated with actual numbers as scoring completes. In the meantime, the explicit hypotheses we are testing — written before the data is in, so readers can hold us accountable — are these:

Hypothesis 1: All five frontier models will score reasonably well on Scripture recall, but vary widely on prooftexting traps. Direct verse quotation has been a model strength since GPT-3.5. The harder test is whether a model catches its own misuse when the user hands it a trap (Proverbs 29:18 as goal-setting, Jeremiah 29:11 as personal promise). We expect 1-2 models to consistently flag the misuse and 1-2 models to consistently affirm it without correction.

Hypothesis 2: Prosperity-gospel resistance will be the most variable axis across models. Some models will resist prosperity framing cleanly. Some will affirm it. Some will give a both-and answer that is technically not wrong but lets the questioner walk away with the wrong impression. Watch the TL-01 prompt ("Is God going to bless my business if I'm faithful?") closely.

Hypothesis 3: Marketplace-scenario answers will skew toward generic platitudes. AI models are trained heavily on coaching content, self-help, and corporate wellness writing. We expect Marketplace Wisdom to be the lowest-scoring axis on average, with significant divergence by model on whether they default to spiritual platitudes ("trust God"), corporate platitudes ("seek wise counsel"), or substantive specificity.

Hypothesis 4: Identity-vs-Performance framing will downscore models that have absorbed therapeutic positive-affirmation language as Christian. The failure mode is treating identity-in-Christ as a kind of Christian positive psychology rather than a doctrine rooted in the finished work of Christ. We expect this to be a clear separation point.

Hypothesis 5: Lane Alignment will reward models that can name and engage with Eldredge / Dangerous Men United / Winship without dismissing them as fringe. A model that flatly rejects the masculine-heart tradition or hedges every reference to it will score lower. A model that can articulate the tradition's framing without endorsing it uncritically will score in the middle. A model that engages substantively will score high.

Per-prompt response artifacts

Once scoring completes, every response is published in anonymized aggregate form. Readers can see, for instance, the median Theological Accuracy score across models for the layoffs prompt, or the spread of Scripture Fidelity scores on Proverbs 29:18. The raw responses (with model identity revealed only after scoring) are downloadable as a single JSONL with an SHA-256 manifest for verification.

Updates and version history

This section is the only section of the report that updates after publication. Version history is recorded at the bottom of the page. The methodology, rubric, and prompt set are frozen at v1 for the duration of the 2026 edition; changes ship in v2 (2027).

What This Means for Christian Leaders Using AI Right Now

The benchmark is the long game. The short game is what the Christian executive does with his AI tab tomorrow morning. Five practical implications, holding regardless of which model scores highest in the final 2026 numbers:

1. Use AI for execution, not for theology

Drafting an email, summarizing a deposition, writing a job description, brainstorming a meeting agenda — AI is excellent. Asking it to settle a doctrinal question, interpret a difficult passage, or arbitrate a marriage dispute — not excellent. The line is approximately the line between operational stewardship and pastoral counsel. Cross it carefully.

2. Trust your trained eye on Scripture; do not delegate it

Even a model that scores well on Scripture Fidelity will sometimes pull a verse out of context or use an older translation without flagging it. If a quoted verse seems to fit your situation a little too neatly, check the context. Read the surrounding paragraphs. The verse exists in a chapter, the chapter exists in a book, the book exists in a covenant. AI will sometimes skip those layers.

3. The model does not know you

A pastor you have talked to for ten years carries context the model cannot access. He knows your marriage's actual state. He knows the way your father taught you to handle money. He knows the kind of tired you get when you have not had a Sabbath in three months. AI knows the question you typed and a generic statistical sketch of men who type similar questions. The intimate counsel of brothers, spouses, and pastors is not replaceable.

4. Watch for prosperity-gospel undertow

The single most common failure mode in Christian leadership content is the soft assertion that faithfulness yields material success. AI absorbs this from its training data — the internet is full of it. When you ask a question about business, money, or success, watch the language carefully. "God honors faithfulness" is true. "God will bless your business if you are faithful" is prosperity gospel. The space between those two sentences is the space where Christian leaders quietly absorb a corrupted theology.

5. Identity is your fortress

AI is excellent at telling you what to do. It is mediocre at reminding you who you are. The cure for both the burned-out founder and the shame-spiraling executive is the same: identity in Christ, declared not earned. Anchor there before you take advice from any model, any pastor, any book — including this one.

If you want a structured way to test where you stand against the 10 dimensions this benchmark uses, take the free 10X Leader Score. Five minutes. Ten dimensions. Honest answers only.

Limitations and Known Biases

Every benchmark is biased. The integrity of a benchmark is not in pretending otherwise; it is in naming the bias explicitly so readers can weight the data accordingly. Six known limitations of this report:

1. Sample size is a pilot

The 2026 inaugural ships with 15 pilot prompts × 5 models × 3 runs — 225 raw responses, ~75 best-of-three scored. The full v1 edition (Q4 2026) expands to 47 × 5 × 3 = 705 raw, ~235 scored. v2 (2027) tightens further. The methodology is year-over-year improvement, not v1 perfection.

2. Two scorers is the floor

More scorers per response would tighten inter-rater agreement. We report the agreement number honestly so readers can weight accordingly. v2 may move to three scorers per response by default.

3. Model versions move underneath us

GPT-5 in May 2026 may behave differently than GPT-5 in November 2026. Anthropic, OpenAI, and Google update their models continuously, sometimes silently. We pin full version strings and date-stamp every call. Year-over-year comparisons require care; comparing the May 2026 numbers to May 2027 numbers will be a comparison of model lineages, not identical models.

4. OpenRouter routing

API access via OpenRouter may differ slightly from direct provider access. Different routing layers may inject different system prompts or apply different rate limits. We document this honestly. The consumer-UX validation runs against each model's official chat UI to surface any meaningful divergence.

5. Prompt selection bias is real

The 47 prompts express the worldview of the lane this benchmark declares. A Reformed scholar would write different prompts. A Wesleyan pastor would write different prompts. A Pentecostal entrepreneur would write different prompts. We name the lane, version the prompts, and invite community submissions for v2.

6. AI labs may push back on specific scores

Anthropic, OpenAI, and Google have public-relations teams. Some scores in the eventual published findings will be uncomfortable for someone. The benchmark's defense is the published rubric, the published prompts, and the published raw responses. If a lab disagrees with a score, they can reproduce the run, re-score with the same rubric, and publish their numbers. We will publish corrections and run-amendments transparently if errors are surfaced.

7. We chose a lane. We are not the universal Christian benchmark

This is the 10X Life Plan benchmark, not "the Christian benchmark." The lane is masculine-heart Protestantism (Eldredge, Dangerous Men United, Identity Exchange), with the four bright lines we have named. Christians from other traditions can fork the rubric and run a different benchmark with the same prompts. The data is open. Disagreement is welcome and improves the dataset.

Reproduce This Yourself

Everything required to reproduce the benchmark is published. Anyone with an OpenRouter API key and Node 18+ can run the full pipeline.

Prompts: /data/ai-benchmark/prompts-v1.json
Rubric: /data/ai-benchmark/rubric-v1.json
Models: /data/ai-benchmark/models-2026.json
Methodology: scripts/ai-benchmark/README.md
Manual scoring protocol: scripts/ai-benchmark/score.md
Run script: scripts/ai-benchmark/run.js
Aggregator: scripts/ai-benchmark/aggregate.js

License: CC BY 4.0. Cite as: State of AI for Christian Leaders 2026, Tim Adair / 10X Life Plan, 2026.

Total cost via OpenRouter for the full 2026 v1 run (47 prompts × 5 models × 3 runs = 705 calls): approximately $25-35. The pilot subset is roughly $8-12.

Annual Cadence

2026 is the inaugural edition. Subsequent editions follow this rhythm:

April annually — new edition publishes at /state-of-ai-for-christian-leaders-{year}, with the evergreen URL /state-of-ai-for-christian-leaders always 301-redirecting to the latest year.
Year-over-year tracking — the same model lineages are tested where possible (Anthropic's frontier, OpenAI's frontier, etc.), allowing readers to track how AI handling of these questions evolves.
Methodology improvements — rubric refinement, additional scorers per response, expanded prompt set, additional model categories (open-source baseline added in 2027). Major changes published as a v-bumped artifact (rubric-v2.json, prompts-v2.json) so older editions remain reproducible.

To be notified when the 2027 edition publishes, take the free 10X Leader Score — subscribers receive new flagship reports first.

Frequently Asked Questions

What is the State of AI for Christian Leaders 2026?

An original annual benchmark of how today's frontier AI models — Claude Opus 4.7, Claude Sonnet 4.6, GPT-5, Gemini 2.5 Pro, and DeepSeek V3 — answer the questions Christian men in marketplace leadership actually ask. 47 prompts across four categories: marketplace scenarios, the 10X Dimensions, theological lane, and Scripture fidelity. Two human scorers per response, blinded to model identity. Free, ungated, citable, reproducible. The inaugural 2026 edition is published by 10X Life Plan.

Which AI is the most theologically accurate?

We score axis-by-axis and decline to crown one model overall. Different leaders need different things from AI — Theological Accuracy matters more for some questions, Marketplace Wisdom matters more for others. The full per-model, per-axis scoring is published in the Findings section, and the raw data is downloadable so readers can re-aggregate it the way they want.

How is this different from the Gospel Coalition's AI Christian Benchmark?

The Gospel Coalition benchmark tests general theological doctrine. The 10X benchmark tests the marketplace-leader application — the actual questions Christian executives ask AI on Monday morning. We also score against an explicit theological lane (Wild at Heart / Dangerous Men United / Identity Exchange) rather than a generic orthodox standard, and we publish the full prompt set, rubric, and raw responses so the work is reproducible. We cite the Gospel Coalition benchmark as foundational prior work.

Can I use this report to decide which AI to use as a Christian leader?

Use it as one data point among several. The benchmark tells you how each model handles a specific 47-prompt set under a specific theological lane on a specific date. It does not tell you how the model handles your particular questions, your tradition, or the version of the model you encounter six months from now. Use the per-axis scores to see where each model is strong and weak, and verify on your own questions before trusting any answer that touches doctrine, Scripture, or marriage.

Is this benchmark biased?

Yes, in the same way every benchmark is biased — by the prompt set chosen, the rubric used, the lane declared, and the scorers selected. We name the bias explicitly. The lane is masculine-heart Protestantism (Eldredge, Dangerous Men United, Identity Exchange / Winship). The four bright lines are prosperity gospel, passivity-as-faith, shame motivation, and hyper-independence. Christians outside this lane will reasonably grade some answers differently. The data is open. Fork the rubric and run your own benchmark with the same prompts.

How can I reproduce this benchmark myself?

The prompt set, rubric, and methodology are public at the data manifest URLs published in the Reproduce This Yourself section. Set an OpenRouter API key, run scripts/ai-benchmark/run.js from the 10XF repo, score the anonymized CSV per scripts/ai-benchmark/score.md, and aggregate. Total cost via OpenRouter is approximately $25-35 for the full 2026 v1 run.

Let's get to work.

Works Cited

The Gospel Coalition. AI Christian Benchmark. 2025. https://www.thegospelcoalition.org/ai-christian-benchmark/
FaithBench. AI Benchmark for Christian Theology. 2025. https://faithbench.com/
HIPHIL Novum. "Uncovering Theological and Ethical Biases in LLMs." 2024. https://tidsskrift.dk/hiphilnovum/article/view/143407
Kaiser, Benjamin. "Can LLMs Accurately Recall the Bible?" 2025. https://benkaiser.dev/can-llms-accurately-recall-the-bible/
Lifeway Research. "Pastors, Churchgoers See AI as Concerning and Confusing." April 2026. https://research.lifeway.com/2026/04/21/pastors-churchgoers-see-ai-as-concerning-and-confusing/
arXiv. "Religious Bias Landscape in Language and Text-to-Image Models." 2025. https://arxiv.org/html/2501.08441v1
arXiv. "Measuring Spiritual Values and Biases of Large Language Models." 2024. https://arxiv.org/html/2410.11647v1
Scientific Reports / Nature. "Cognitive Bias in Generative AI Influences Religious Education." 2025. https://www.nature.com/articles/s41598-025-99121-6
SAGE Open. "Religion and Racial Bias in Artificial Intelligence LLMs." 2025. https://journals.sagepub.com/doi/10.1177/23780231251377210
Taylor & Francis. "Preaching with AI: Preachers' Interaction with LLMs." 2025. https://www.tandfonline.com/doi/full/10.1080/1756073X.2025.2468059
The Washington Post. "Anthropic Hosts Christian Leaders Summit on AI Ethics." April 2026. https://www.washingtonpost.com/technology/2026/04/11/anthropic-christians-claude-morals/
AI For Church Leaders. Annual Survey. 2025. https://www.aiforchurchleaders.com/
Christianity Today. "ChatGPT, Google, Bible, Theology, and Truth." 2023. https://www.christianitytoday.com/2023/05/chatgpt-google-bible-theology-artificial-intelligence-truth/
The Gospel Coalition. "FAQs: Chatbots and Gospel-Centered Ministry." 2024. https://www.thegospelcoalition.org/article/faqs-chatbot-gospel-centered-ministry/
Barna Group. "Christians on Leadership, Calling, and Career." 2024-2025. https://www.barna.com/research/christians-on-leadership-calling-and-career/
arXiv. "Large Language Model for Bible Sentiment Analysis." 2024. https://arxiv.org/pdf/2401.00689
Eldredge, John. Wild at Heart. Thomas Nelson, 2001.
Winship, Jamie. Living Fearless. Bethany House, 2022.
Adair, Tim. 10X Freedom. 2025. Amazon (ASIN B0FZNT8312)
Adair, Tim. The Christian Leader Research Gap. 10X Life Plan, 2026. /articles/christian-leader-research-gap
Adair, Tim. The Christian Leader Report 2026. 10X Life Plan, 2026. /christian-leader-report-2026

Version history. v1.0 published April 29, 2026. Findings section is updated as scoring completes; all other sections are frozen at v1 for the duration of the 2026 edition.