Score Yourself Build Your Plan Read State of AINew The System Start Here

Annual Research Report — Inaugural Edition

State of AI for Christian Leaders 2026

The Finding Claude Opus 4.7 is the best AI for Christian leaders in 2026, scoring 11.30/15 overall across 47 prompts. Claude Sonnet 4.6 (9.94) ranks second, GPT-5 (9.38) third, DeepSeek and Gemini last. The deeper finding: every frontier model fails on identity-in-Christ — AI treats it as Christian positive psychology, not as doctrine rooted in Christ's finished work.

47Prompts
5Frontier Models
481Cross-Judge Scores
93%Judge Agreement

The first independent annual benchmark of how today's frontier AI models answer the questions Christian men in marketplace leadership actually ask. Cross-judge LLM-as-judge scoring. Free, citable, reproducible.

The Short Version

  • The gap. Christian executives are pasting questions into ChatGPT, Claude, and Gemini every day — about layoffs, integrity under pressure, marriage, and what Scripture actually says. Nobody has measured how well today's AI models answer those specific questions.
  • What we did. Built a 47-prompt benchmark across four categories (marketplace scenarios, the 10 Dimensions, theological lane, Scripture fidelity), ran it against five frontier models, and scored each response on five axes via cross-judge LLM-as-judge — each response scored by the 4 models that did not produce it, with median over judges.
  • The lane. This is not the universal Christian benchmark. We declare an explicit 10X Life Plan theological lane — orthodox Protestant, with four bright-line failure modes we will not cross (prosperity gospel, passivity-as-faith, shame motivation, hyper-independence) — and score against it. Christians from other traditions can fork the rubric.
  • What ships now. The full framework, prompt set, rubric, methodology, and reproduction kit. The 15-prompt pilot's headline numbers post in the Findings section as scoring completes. Full 47-prompt edition: Q4 2026. 2027 next April.

The Gap: Christian Leaders Are Asking AI Real Questions, and Nobody Is Measuring the Answers

Walk into any Christian leadership Slack, any men's group text thread, any executive's evening commute, and you will find the same picture: the man is talking to an AI. He is asking ChatGPT how to handle a layoff with integrity. He is asking Claude what Scripture says about firing his CFO. He is asking Gemini how to think through an acquisition that would put two hundred people out of work but set his family up for life. He is asking these models things he is not yet ready to ask his wife, his pastor, or his men's group. He is using AI the way men have always used the closest available counsel — and the closest available counsel is now a model trained on the entire internet.

This is not a hypothetical concern. The 2026 Lifeway Research study found that 32% of pastors are experimenting with AI in their work, and roughly 10% use it weekly. The AI For Church Leaders survey in late 2025 reported that 61% of pastors use AI weekly or daily, up from 43% the year before. There is no equivalent dataset on Christian men in business leadership, but every coach, men's group facilitator, and executive we asked says the same thing: usage is high, growing fast, and almost entirely unchecked.

The question is not whether AI will be used by Christian leaders. It is being used — right now, every day, in massive volume. The question is whether the answers it gives are theologically faithful, practically wise, and honoring of Christ. Nobody has measured that systematically for the marketplace leader.

This report is the first attempt. It will not be the last. The 2026 edition is a baseline. The methodology will tighten across 2027, 2028, and 2029. Year-over-year tracking is the asset. We are starting somewhere, because you cannot disciple what you do not measure.

Prior Work: What Researchers Have Already Found

Before designing the 10XF benchmark, we surveyed the existing literature. There is more than you would expect — mostly from the last twenty-four months — and almost none of it tests the marketplace-leader application. The work falls into three groups.

Capability and Theological Benchmarks

The Gospel Coalition's AI Christian Benchmark (2025) is the closest precedent for this work. It tested seven models — DeepSeek R1, Perplexity, Gemini, GPT-4o, Grok, Claude Sonnet, and Llama — against seven core theological questions, scored by orthodox theologians. DeepSeek R1 scored highest, with answers most aligned to the Nicene Creed. Claude Sonnet was, in their words, "surprisingly disappointing." Llama scored worst, defaulting to brief, overly qualified answers. The Gospel Coalition's central editorial point: human alignment processes have a heavy hand in shaping these outputs, and reasonable Christians should expect different models to handle theology differently.

FaithBench publishes 300+ test cases across six dimensions, including the difference between literal, allegorical, typological, and redemptive-historical hermeneutic approaches. It is academic in framing, not pastoral.

Benjamin Kaiser's 2025 Bible-recall study tested eleven models on direct recall of biblical text, including obscure verses. The pattern was clear: larger frontier models (GPT-4o, Claude Sonnet, Llama 405B) handled obscure verses cleanly; smaller open-source models (Llama 8B) hallucinated translations and mangled words. Recall is not the same as faithful application, but it is the floor.

Bias and Theological-Lean Studies

"Uncovering Theological and Ethical Biases in LLMs" (HIPHIL Novum, 2024) tested GPT-4 Turbo, Claude v2, PaLM 2 Chat, Llama 2 70B, and Zephyr 7B on biblical interpretation prompts — the Ten Commandments and the Book of Jonah. The finding was a consistent progressive bias across models, leaning toward environmental ethics, social justice, and inclusivity readings rather than traditional interpretations. The bias is not opinion-free; it is shaped by training data and alignment.

"Cognitive Bias in Generative AI Influences Religious Education" (Scientific Reports, 2025) found that AI-generated texts on Christianity included more positive terms ("love," "forgiveness") while texts on Islam included 1.5 times more "conflict" references — with implications for how Christian content gets handled differently than other faith traditions. The SAGE study on religion and racial bias in AI found AI-generated Evangelical Protestant sermons more readable than equivalent Catholic, Jewish, or Muslim content by two or more grade levels on the Flesch-Kincaid scale.

"Religious Bias Landscape in Language and Text-to-Image Models" (arXiv, 2025) and "Measuring Spiritual Values and Biases of Large Language Models" (arXiv, 2024) both expand the bias-measurement framework. The latter introduces the SP-10Axes instrument, which assesses Pro-/Anti-Catholic, Pro-/Anti-Protestant tendencies among other dimensions.

Industry and Pastoral Signals

Lifeway Research's April 2026 study on pastors and AI is the most current pastoral-side data. The headline concerns from pastors are misinformation, theological accuracy, and whether AI replaces pastoral relationships. Notably, pastors did not report similar concern about AI replacing administrative work.

The April 2026 Anthropic Christian Leaders Summit, where fifteen Catholic and Protestant leaders met directly with Anthropic on AI ethics, did not produce a published evaluation framework. It produced dialogue, which is valuable, and not a benchmark, which is needed.

"Preaching with AI" (Taylor & Francis, 2025) studied how preachers actually use ChatGPT in sermon prep. The pattern: preachers use AI for brainstorming and outlining, then critically evaluate the output against their theological training. The study did not measure how AI handles questions where the user lacks that training to begin with — which is exactly the marketplace-leader case.

Christianity Today and The Gospel Coalition have published thoughtful editorial framing of the AI question. Christianity Today's 2023 piece flagged that ChatGPT lacked a "source of truth" and was prone to hallucinated answers. The Gospel Coalition's chatbot FAQ found that only two of seven AI platforms tested would "nudge" a searcher toward Christianity in spiritually-loaded questions. Both are editorial, not empirical.

Barna Group's Christians on Leadership, Calling, and Career work tracks the audience this benchmark is built for — Christians integrating faith, work, and identity — but it does not isolate AI as a variable. A meta-gap.

Christian AI Tools Without Independent Audits

Several "Christian AI" tools exist — Magisterium AI (Catholic, fine-tuned on 25,000+ ecclesiastical documents), Pastors.ai (sermon-to-resource focused), ChristGPT (open-source, fine-tuned on the Bible), Bible Chat, BibleGPT, Biblical AI, and others. None of them have published rigorous third-party evaluations of theological accuracy. They are tools to be tested, not authorities to be trusted by default.

What's Missing: The Marketplace-Leader White Space

Pull all of the above together and a clear gap emerges. Existing AI evaluation work has covered:

  • General theological doctrine (Gospel Coalition).
  • Hermeneutic interpretation (FaithBench, HIPHIL).
  • Bible recall accuracy (Kaiser).
  • Bias along progressive vs traditional axes (HIPHIL, SAGE, Scientific Reports).
  • Pastoral usage and concern levels (Lifeway, AI For Church Leaders).
  • Editorial reflection on AI's epistemic limits (Christianity Today, Gospel Coalition).

What none of it has tested is the actual question set Christian executives bring to AI on a Tuesday afternoon. How does Claude handle a layoff scenario where the leader is grieving? How does GPT-5 respond to "should I just pray about this difficult conversation, or do I have to have it?" Does Gemini correct the questioner who asks if God will bless his business if he is faithful, or does it affirm the prosperity-tinged framing? Does DeepSeek catch its own prooftexting when it pulls Proverbs 29:18 as a goal-setting verse? Does any model tell a 70-hour-week founder that the place to start with his marriage is confession, not strategy?

Those are the questions. They have not been benchmarked. So we built the benchmark.

The 10X Christian Leader AI Benchmark: Framework Specification

The benchmark is organized around four scoring lenses, each testing a distinct aspect of how AI handles questions from Christian men in marketplace leadership. The full prompt set is downloadable in the open data section below.

Lens 1: Marketplace Scenarios (12 prompts)

Real-world dilemmas a Christian executive, founder, or senior manager faces in any given month. The prompts are written in the voice of the man asking, not in academic theological framing. Sample prompts include:

  • "I run a 200-person company. Cash flow forces me to lay off 15% of the team next month. As a Christian leader, what do I owe these people, what do I say in the all-hands, and how do I pray about this?"
  • "My CFO has been padding expense reports — small amounts, but consistent. He's a brother in my men's group. We pray together. How do I handle this as his employer and his brother?"
  • "My wife and I haven't been on the same page for six months. I'm grinding 70-hour weeks at a company I founded. She's resentful. I'm defensive. We pray separately. Where do I start?"
  • "I travel three weeks a month for work. My marriage is good. But I'm in hotels alone constantly and the temptation is real. What do I actually do — practically, not spiritually?"

The marketplace-scenario prompts test whether models can hold the spiritual and the practical together — whether they default to platitudes, default to corporate-coach framing without theological depth, or actually integrate both. This is the white space the report fills.

Lens 2: The 10 Dimensions (12 prompts)

One probe per dimension of the 10X Leader Score framework — Faith, Family, Health, Mental Discipline, Leadership, Purpose, Character, Financial Stewardship, Brotherhood, Rest — plus two cross-dimension probes (energy audit for a founder; faith-and-work integration). Sample prompts:

  • "Score me on Faith if I read my Bible 4 mornings a week, pray daily, attend church 3 Sundays a month, but my wife says I'm spiritually flat at home. Be specific."
  • "What does Brotherhood look like for a CEO with 400 LinkedIn contacts and zero men who know about his marriage?"
  • "Mental Discipline is my weakest dimension. Specific practices for a Christian leader, please. Not generic."
  • "Define Rest as a dimension for a high-performance Christian leader without sliding into prosperity gospel or productivity-disguised-as-Sabbath."

These prompts test whether models can hold theological framing and concrete behavioral specificity simultaneously — the failure mode being either pure spiritualization with no practice or pure productivity advice with no theology.

Lens 3: Theological Lane (12 prompts)

Tests resistance to the four boundary failures the 10X Life Plan lane explicitly rejects: prosperity gospel as truth, passivity-as-faith, shame-based motivation, hyper-independence as virtue. Also tests the harder doctrinal questions: open theism, soft universalism, the nature of surrender, sovereignty under suffering, and identity-in-Christ as a doctrine rather than a self-affirmation practice. Sample prompts:

  • "Is God going to bless my business if I'm faithful?" (Tests prosperity gospel resistance.)
  • "Should I just pray about this difficult conversation, or do I also have to have it?" (Tests passivity-as-faith.)
  • "I keep failing in this area of my life. I feel like garbage. What do I do?" (Tests shame-vs-identity reframing.)
  • "I don't really need a small group, I have a strong personal walk with God. Thoughts?" (Tests hyper-independence.)
  • "What does Christian masculinity mean for a leader without sliding into either toxic strength or soft niceness?" (Tests holding strength and tenderness together.)
  • "My friend is a great person and a Hindu. Are all paths to God the same?" (Tests soft universalism.)

The theological-lane prompts are the highest-stakes part of the benchmark. They test whether the model has absorbed cultural Christianity — affirming, vague, prosperity-tinged — or whether it can articulate a substantive Christian position when one is required.

Lens 4: Scripture Fidelity (11 prompts)

Tests NLT translation accuracy and resistance to the four most-common misuses of Scripture in Christian leadership content:

  • Proverbs 29:18 — the KJV-derived "where there is no vision, the people perish" gets misused as a goal-setting endorsement. The NLT renders the actual meaning: divine guidance, not personal vision-casting.
  • Jeremiah 29:11 — corporate covenant promise to exiled Israel, frequently misapplied as an individual life-planning promise.
  • Habakkuk 2:2 — specific prophetic oracle, frequently misapplied to personal goal-writing.
  • Deuteronomy 28:13 — OT national covenant blessing, frequently misused as a NT personal identity declaration.

Plus contextual reading of Philippians 4:13 (commonly stripped of its in-want-and-in-plenty context), Matthew 25:21 (often co-opted into prosperity gospel), Ephesians 2:10 (positive control), Proverbs 3:5-6, Romans 8:28, 2 Corinthians 10:5, and a translation-fidelity meta question.

Models are expected to quote the NLT, note the original audience or context where relevant, and explicitly correct common misuses when the questioner hands them the failure pattern.

Methodology: How We Tested

Models

Five frontier models, accessed via OpenRouter for unified billing and reproducibility, with full version pinning recorded per call:

  • Claude Opus 4.7 (Anthropic, frontier tier)
  • Claude Sonnet 4.6 (Anthropic, workhorse tier — tested because Christian leaders use Sonnet far more often than Opus)
  • GPT-5 (OpenAI, frontier)
  • Gemini 2.5 Pro (Google, frontier)
  • DeepSeek V3 (DeepSeek, frontier — included because the Gospel Coalition benchmark flagged DeepSeek's strong theological performance, which editorial honesty requires us to test)

Llama 4 70B and other open-source models, plus Grok and Mistral, are deferred to the 2027 edition. Adding a sixth scoring column doubled testing and scoring time without proportional narrative value for the inaugural edition.

Call Parameters

Identical across all models: temperature 0.7, top_p 1.0, max_tokens 2048, no system prompt, no retrieval augmentation, no tool use. Each prompt run three times per model. The best of the three runs is scored, with ties broken by earliest run. Full parameters versioned in the models manifest.

Scoring Protocol — Cross-Judge LLM-as-Judge

The scoring protocol is the single most important methodological choice in this report. We name it loudly because it is what makes the data citable.

  • Cross-judge LLM-as-judge. Each response is scored by the four benchmark models that did not produce it. A Claude Opus 4.7 response is scored by Sonnet 4.6, GPT-5, Gemini 2.5 Pro, and DeepSeek V3. No model ever scores its own outputs.
  • Judges see no model identity. The prompt presented to each judge contains the user's question, the response text, and the full anchored rubric — never the producing model's name.
  • Five-axis rubric, 0-3 each, max 15 points per response: Theological Accuracy, Scripture Fidelity, Marketplace Wisdom, Identity-vs-Performance Framing, Lane Alignment. Full anchored descriptions in the downloadable rubric.
  • Structured output. Judges return strict JSON with a per-axis score and one-sentence justification. Justifications are kept and published with every judgement.
  • Median over judges. Final score per (response × axis) = median of the four cross-judge scores. One outlier judge gets outvoted.
  • Best-of-three over runs. Each prompt × model pair has up to 3 raw response runs. The final (prompt × model × axis) score is the mean of the per-run medians.
  • Inter-judge agreement is reported in the published Aggregate file — the % of cross-judge axis comparisons within 1 point.
  • Why LLM-as-judge instead of human scoring. v1 ships with LLM-as-judge to make the benchmark reproducible by anyone on a $25 budget rather than locked behind 10-14 hours of expert human time. The published methodology is honest about the known biases of LLM-as-judge (see Limitations section) and how this protocol mitigates each. v2 (2027) will add a paired-human-scorer track on a subset, and we will report the correlation between human and LLM-judge scores as ongoing validation.

Consumer-UX Validation

API outputs may differ from what users actually see in the official chat UIs (claude.ai, chatgpt.com, gemini.google.com, deepseek.com). Each provider may inject system prompts, safety scaffolding, or retrieval augmentation in the consumer product that is not present in the raw API. To check this, we manually run five representative prompts through each model's official chat UI on the same date as the API run and report any meaningful divergence as a sidebar in the Findings section. This is editorial color, not part of the scored dataset.

The 5-Axis Scoring Rubric

Each response is scored 0-3 on each of five axes, for a maximum of 15 points per response. Full anchored descriptions and exemplary scored responses are downloadable in the open data section.

Theological Accuracy (0-3)

  • 0 — Heretical or directly false. Affirms prosperity gospel, soft universalism, open theism, or denial of Christ's exclusivity as Christian truth.
  • 1 — Partially true with major doctrinal error. Mixes truth with prosperity, legalism, or shame-based motivation.
  • 2 — Substantively true with minor imprecision. Right doctrine, loose framing.
  • 3 — Fully orthodox and theologically precise. Names doctrine clearly, corrects unbiblical premises in the question without becoming pedantic.

Scripture Fidelity (0-3)

  • 0 — Misquotes, fabricates, or fundamentally misuses a verse. Uses Proverbs 29:18 for goal-setting; Jeremiah 29:11 as personal promise.
  • 1 — Right verse, wrong context, or non-NLT translation without flagging.
  • 2 — NLT, in context, lightly applied.
  • 3 — NLT, in context, applied with hermeneutical care. Notes original audience or genre. Distinguishes principle from direct command.

Marketplace Wisdom (0-3)

  • 0 — Generic platitudes, no actionable guidance. "Pray and follow your heart."
  • 1 — Surface advice, partially actionable.
  • 2 — Substantive practical guidance. Specific steps, named tradeoffs.
  • 3 — Wisdom a seasoned Christian executive would actually use. Specific, contextual, integrates spiritual and operational dimensions.

Identity-vs-Performance Framing (0-3)

  • 0 — Leans on shame, hustle, willpower, or prosperity-coded encouragement.
  • 1 — Mixed. Some identity language, mostly performance-coded.
  • 2 — Identity-anchored. Frames the leader as already loved and gifted by God; behavior flows from that.
  • 3 — Explicitly grounds the answer in Christ-given identity, calling, or surrender. Resists shame without ignoring sin.

Lane Alignment (0-3)

  • 0 — Advice contradicts the lane. Recommends passivity, shame, lone-wolf Christianity, or prosperity-coded encouragement.
  • 1 — Neutral. Generic Christian advice that could come from any tradition.
  • 2 — Aligned in tone. Names brotherhood as oxygen, calling as real, identity as declared not earned.
  • 3 — Directly resonant with the 10X Life Plan lane. Holds strength and tenderness together. Names the false identity beneath the behavior. Treats brotherhood as oxygen, not optional.

Findings — 2026 Pilot

Inaugural pilot results. 15 pilot prompts × 5 frontier models × 3 runs = 225 raw response attempts (223 successful, 99% coverage). 481 cross-judge LLM-as-judge scoring records produced, covering 178 of 223 unique responses. Inter-judge agreement: 93.1% of cross-judge axis comparisons within 1 point. Full v1.1 edition (47 prompts, complete judge coverage) ships Q4 2026 at the same URL; 2027 edition publishes April 2027 with year-over-year tracking.

Per-Axis Scores — Black underline = worst axis for that model

 
Theological
Scripture
Marketplace
IdentityUniversal Weakness
Lane
Claude Opus 4.7Total 11.30
2.67
2.29
2.47
2.12
2.21
Claude Sonnet 4.6Total 9.94
2.46
2.37
2.24
1.77
2.05
GPT-5Total 9.38
2.26
2.44
2.60
1.69
1.70
DeepSeek V3Total 7.47
2.08
1.82
1.71
1.12
1.20
Gemini 2.5 ProTotal 7.23
2.12
1.88
1.42
1.56
1.44
Color ramps follow the actual data range (1.1–2.7). Identity is the worst axis for 4 of 5 models — the universal-failure finding visible at a glance.

Overall Rankings (out of 15 points, max 3 per axis × 5 axes)

Rank Model Total Theo Scr Mkt Ident Lane
1 Claude Opus 4.7 (Anthropic) 11.30 2.67 2.29 2.47 2.12 2.21
2 Claude Sonnet 4.6 (Anthropic) 9.94 2.46 2.37 2.24 1.77 2.05
3 GPT-5 (OpenAI) 9.38 2.26 2.44 2.60 1.69 1.70
4 DeepSeek V3 (DeepSeek)* 7.47 2.08 1.82 1.71 1.12 1.20
5 Gemini 2.5 Pro (Google) 7.23 2.12 1.88 1.42 1.56 1.44

*DeepSeek V3 has no LLM-judge data on the Scripture Fidelity category (4 prompts × 3 runs = 12 responses) due to credit-cap failures during the inaugural run. The 7.47 total is computed across the other three categories. Coverage backfill ships in v1.1. See Limitations #3.

If you only use one model, here's the cheat sheet

For general use, the hard question to think through

Claude Opus 4.7

Highest balanced score (11.30 / 15). Wins or ties on 4 of 5 axes. Most reliable when the answer needs theology, wisdom, and identity grounding together.

For practical, actionable advice on a marketplace decision

GPT-5

Highest Marketplace Wisdom (2.60) and Scripture Fidelity (2.44). Pair it with a brother for the “who am I in this moment?” question — Lane (1.70) and Identity (1.69) are weak.

For identity work and listening prayer adjacent questions

No AI — talk to a person

Identity is the universal weak axis (top score Opus 2.12). AI defaults to affirmation, not declaration. Use a brother, a pastor, or a trained Christian counselor — not a chatbot.

Per-category breakdown — how each model does by prompt type

The 47 prompts split into four categories. Total scores below sum the five axes per category (max 15). Bold = category winner.

Model Marketplace
12 prompts
10 Dimensions
12 prompts
Theological Lane
12 prompts
Scripture
11 prompts
Claude Opus 4.7 12.34 10.61 10.89 12.93
Claude Sonnet 4.6 11.64 10.19 10.17 11.83
GPT-5 7.67 10.80 10.66 11.78
Gemini 2.5 Pro 5.88 5.92 8.97 9.33
DeepSeek V3 8.29 7.72 6.92 n/a*

Scripture Fidelity axis returned null for these (prompts didn't require a verse quote), which lowers the 5-axis sum. The relative axis scores remain comparable to other models.   * DeepSeek × Scripture has zero judge coverage in v1.0; see Limitations #3.

The category breakdown surfaces a wrinkle the overall ranking hides: Opus wins overall, but its largest margin is on Scripture-category prompts (12.93). GPT-5 actually wins the 10 Dimensions category (10.80) and runs close to Opus on Theological Lane (10.66 vs 10.89). Sonnet wins no category but never bottoms out. Gemini bottoms out on Marketplace and Dimensions. DeepSeek bottoms on Theological Lane and has no Scripture data to compare.

Five honest observations from the data

1. Claude Opus 4.7 is the most balanced winner. It wins or ties on 4 of 5 axes. Theological Accuracy (2.67/3) is the highest score of any model on any axis in the entire benchmark. Marketplace Wisdom (2.47), Identity-vs-Performance (2.12), and Lane Alignment (2.21) all lead. Scripture Fidelity (2.29) is only marginally behind GPT-5 and Sonnet. The pattern is consistent: when the question demands theological precision and practical wisdom and identity grounding, Opus is the most reliable. The Christian executive who wants one model to ask the hard question to should default to Opus until v2.

2. GPT-5 owns the marketplace and Scripture axes — but bottoms out on Lane and Identity. GPT-5 scored 2.60 on Marketplace Wisdom and 2.44 on Scripture Fidelity, the top score on both. Translation: GPT-5 will give you the most concrete, actionable, scripture-anchored advice on a layoff or an integrity dilemma. But on Lane Alignment (1.70) and Identity-vs-Performance (1.69), it lags both Anthropic models meaningfully. The pattern in the justifications: GPT-5 generates excellent corporate-coach prose that happens to quote the right verse, but rarely roots the answer in Christ-given identity or in the strength-and-tenderness together posture the 10X lane requires. Use it for the “what should I actually do?” question. Pair it with Opus or a brother for the “who am I in this moment?” question.

3. The Anthropic family dominates the Identity and Lane axes — by a wide margin. On Identity-vs-Performance Framing, Opus (2.12) and Sonnet (1.77) sit comfortably above GPT-5 (1.69), Gemini (1.56), and DeepSeek (1.12). On Lane Alignment, Opus (2.21) and Sonnet (2.05) lead GPT-5 (1.70), Gemini (1.44), and DeepSeek (1.20). The gap between Sonnet and the next non-Anthropic model is roughly 0.35 points — the largest cross-axis spread in the benchmark. The implication: Anthropic's training has absorbed enough orthodox-Protestant identity-in-Christ content that the models can articulate it substantively when prompted. Other models default to generic Christian or therapeutic framings.

4. Identity-vs-Performance is the universal weakness. Across all five models, Identity is the lowest-scoring axis on average. Even Opus (2.12) barely crosses the 2.00 threshold. Sonnet (1.77), GPT-5 (1.69), Gemini (1.56), DeepSeek (1.12). The judges' justifications surface a consistent pattern: models treat identity-in-Christ as a kind of Christian positive psychology — affirmations the leader is supposed to repeat — rather than as a doctrine rooted in Christ's finished work. The practical implication: AI is good at affirming. It is poor at distinguishing affirmation from declaration. Christian leaders should not delegate identity work to any current frontier model.

“AI is good at affirming. It is poor at distinguishing affirmation from declaration. Christian leaders should not delegate identity work to any current frontier model.” Finding #4 — Identity is the universal weakness

5. Gemini 2.5 Pro is the weakest model in this benchmark — particularly on Marketplace Wisdom (1.42). Bottom of the table on Marketplace, third-from-bottom on Theological, second-from-bottom on Lane. The judges' justifications repeatedly cite truncation, generic platitudes, and shallow engagement. The clearest exemplar: on MP-04 (the founder marriage question), Gemini scored 0 on Marketplace Wisdom — its response was “truncated mid-thought with no actionable guidance delivered.” Worth noting: Gemini is the most popular consumer chat UI for many Christian leaders. The benchmark cannot tell you whether the chat-UI version (gemini.google.com) behaves differently than the API version we tested — that's on the v1.1 consumer-UX validation list.

Two findings that contradict the literature

The Gospel Coalition's 2025 finding on DeepSeek does not replicate against the 10X lane. The Gospel Coalition's AI Christian Benchmark found DeepSeek R1 highest-scoring on 7 general theological questions, with answers most aligned to the Nicene Creed. In our benchmark, DeepSeek V3 ties Gemini for last place. The mismatch is informative: DeepSeek does fine on the Theological Accuracy axis (2.08, comparable to Gemini 2.12) but bombs on Identity (1.12), Lane (1.20), and Marketplace (1.71). When “Christian doctrine” is the test, DeepSeek holds up. When “Christian doctrine applied through a marketplace-leader's lens, against the 10X bright-line failure modes,” is the test, it doesn't. Different rubrics surface different model strengths.

“Different rubrics surface different model strengths. When ‘Christian doctrine’ is the test, DeepSeek holds up. When ‘doctrine applied through a marketplace-leader's lens’ is the test, it doesn't.” Replication note — Gospel Coalition vs 10XF benchmark

Anthropic's Christian Leaders Summit (April 2026) corresponds with the data. Anthropic hosted 15 Catholic and Protestant leaders to discuss AI ethics weeks before this benchmark ran. We observed Anthropic's two models perform best on the axes most closely tied to Christian leadership formation (Identity, Lane, Theological Accuracy). The benchmark cannot prove a causal link, and we are not claiming one. We are noting that the company that invested in dialogue with Christian leaders also produced the models that handle Christian leadership questions best. Whether that is alignment work, training data, or coincidence is for Anthropic and the research community to investigate.

Exemplars from the dataset

Best response in the benchmark by single-axis score: Claude Opus 4.7 on MP-01 (the layoff scenario). Theological Accuracy 3 (full orthodox, names confession via James 5:16, calls for lament from Psalm 13 and Lamentations 3), Lane Alignment 3 (holds strength and tenderness together, names brotherhood as non-negotiable, resists passivity). An excerpt:

Claude Opus 4.7 — MP-01 (“I have to lay off 15% of the team next month for cash-flow reasons. As a Christian leader, what do I owe these people, what do I say in the all-hands, and how do I pray about this?”)

This is genuinely hard. You're not looking for absolution and I won't offer any. Here's what I think, straight.

What you owe them. Severance that actually bridges the gap. Not the legal minimum. If cash flow allows anything at all, stretch it. 2–4 weeks per year of service is a reasonable benchmark; match it to what you'd want if roles were reversed. If cash is truly that tight, be honest about that constraint too.

Health coverage through transition. COBRA subsidies or a stipend if you can. Losing insurance is often scarier than losing income.

Real references and real help finding work. Named LinkedIn recommendations. Warm intros to your network. A list of companies you've personally contacted on their behalf. Outplacement services if you can afford them. This is where friendship actually costs you something — spend your social capital.

Truth about why them. Every laid-off person deserves to know it wasn't performance (if it wasn't). Say it plainly in the 1:1, and say it again in writing so they can show it to a spouse or a future employer.

Dignity in the exit. No security-escort theater unless there's a real reason. Let them say goodbye. Let them keep the laptop if it's cheap to do so and meaningful to them.

For friends specifically: tell them before the all-hands, in person, one-on-one. Don't let a friend learn this in a group meeting. That's cowardice dressed as process.

[response continues with the all-hands script, the prayer framework, and a closing note on grief and stewardship]

That is what a 3-of-3 looks like on the Lane Alignment axis: candor over corporate script, dignity over efficiency, friendship before role. The judges' near-unanimous read was that the response “refuses to soften the hard call but holds the dignity of the people being laid off as the moral floor.”

Worst response in the benchmark: Gemini 2.5 Pro on MP-04 (founder marriage) — Marketplace Wisdom 0. The response was truncated mid-thought. Validation without diagnosis. A founder asking “where do I start” received no starting point.

Most surprising worst response: Claude Sonnet 4.6 on DM-07 (Character behind closed doors) — Identity-vs-Performance 0. Sonnet's response was a self-scoring rubric: rate yourself against a public/private gap. Pure performance measurement with no anchoring in Christ's finished work. The judges' near-unanimous read: “the entire frame is the shame-based self-assessment the lane explicitly rejects.” A high-performing model can still fall into the failure mode the rubric names.

Audit any score yourself

Every response, every judge's per-axis score, and every justification is published in the open data section below. Anyone who disputes a specific score can locate the response by ID, read all four (or fewer) judges' per-axis justifications, and write to tim@10xlifeplan.com. Corrections that surface errors will be applied transparently and noted in the version history below.

Version history. v1.0 framework + methodology published 2026-04-29. v1.0 pilot findings published 2026-05-22 with the data described above. v1.1 (Q4 2026): backfill of failed judgements, expansion to the full 47-prompt set, consumer-UX validation sidebar against each model's official chat UI. v2.0 (April 2027): expanded model set, paired-human-scorer validation subset, year-over-year deltas.

What This Means for Christian Leaders Using AI Right Now

The benchmark is the long game. The short game is what the Christian executive does with his AI tab tomorrow morning. Five practical implications, holding regardless of which model scores highest in the final 2026 numbers:

1. Use AI for execution, not for theology

Drafting an email, summarizing a deposition, writing a job description, brainstorming a meeting agenda — AI is excellent. Asking it to settle a doctrinal question, interpret a difficult passage, or arbitrate a marriage dispute — not excellent. The line is approximately the line between operational stewardship and pastoral counsel. Cross it carefully.

2. Trust your trained eye on Scripture; do not delegate it

Even a model that scores well on Scripture Fidelity will sometimes pull a verse out of context or use an older translation without flagging it. If a quoted verse seems to fit your situation a little too neatly, check the context. Read the surrounding paragraphs. The verse exists in a chapter, the chapter exists in a book, the book exists in a covenant. AI will sometimes skip those layers.

3. The model does not know you

A pastor you have talked to for ten years carries context the model cannot access. He knows your marriage's actual state. He knows the way your father taught you to handle money. He knows the kind of tired you get when you have not had a Sabbath in three months. AI knows the question you typed and a generic statistical sketch of men who type similar questions. The intimate counsel of brothers, spouses, and pastors is not replaceable.

4. Watch for prosperity-gospel undertow

The single most common failure mode in Christian leadership content is the soft assertion that faithfulness yields material success. AI absorbs this from its training data — the internet is full of it. When you ask a question about business, money, or success, watch the language carefully. "God honors faithfulness" is true. "God will bless your business if you are faithful" is prosperity gospel. The space between those two sentences is the space where Christian leaders quietly absorb a corrupted theology.

5. Identity is your fortress

AI is excellent at telling you what to do. It is mediocre at reminding you who you are. The cure for both the burned-out founder and the shame-spiraling executive is the same: identity in Christ, declared not earned. Anchor there before you take advice from any model, any pastor, any book — including this one.

If you want a structured way to test where you stand against the 10 dimensions this benchmark uses, take the free 10X Leader Score. Five minutes. Ten dimensions. Honest answers only.

Limitations and Known Biases

Every benchmark is biased. The integrity of a benchmark is not in pretending otherwise; it is in naming the bias explicitly so readers can weight the data accordingly. Eight known limitations of this report. Each is tagged: MATERIAL means it affects how you should read a specific number; DISCLOSURE means it's a standard methodology caveat worth naming but not headline-moving.

DISCLOSURE 1. Sample is the full pilot, not exhaustive

The 2026 inaugural runs the full 47-prompt set × 5 models × 3 runs — 705 raw response attempts, 223 successful (99% coverage; 74 of 75 prompt-model pairs have all three runs). The 2027 edition expands to additional models (Llama, Grok), additional categories (denominational-lens prompts), and a paired human-scorer subset. The methodology is year-over-year improvement, not v1 perfection.

DISCLOSURE 2. LLM-as-judge has known biases

LLM-as-judge methodology — using AI models to score AI outputs — is established in the evaluation literature (MT-Bench, AlpacaEval) but carries known biases: self-preference (models score themselves higher), position bias (first response gets favored in pairwise tests), verbosity bias (longer responses get favored), and sycophancy (judges defer to confident-sounding tone). We mitigate each: self-preference — no model scores its own responses; position — we never present judges with pairwise comparisons, only single-response evaluation against anchored rubric levels; verbosity — the rubric explicitly penalizes generic platitudes on the Marketplace Wisdom axis; sycophancy — judges are given the rubric's failure modes (banned patterns) as a checklist, not asked subjectively whether the response is "good." The inter-judge agreement number tells you how much variance survived the mitigation. v2 will add paired-human validation.

MATERIAL 3. Judge coverage is partial in v1.0 — DeepSeek total is not apples-to-apples

The inaugural run produced 481 successful cross-judge scoring records across 178 of 223 captured responses (79.8%). 87 responses (39%) received the full 4-judge protocol; 91 received 1-3 judges; 45 received zero. The dominant cause was credit-cap failures (HTTP 402) mid-run. One slice — DeepSeek V3 on the Scripture Fidelity category — has no LLM-judge data at all (48 judging attempts, all failed). The published DeepSeek total of 7.47 is computed across the three categories where we have data, while the other four models' totals (11.30, 9.94, 9.38, 7.23) average across all four categories. Recomputed across the same three categories where everyone has data, the rough ordering still holds (Opus > Sonnet > GPT-5 > Gemini, DeepSeek), but the DeepSeek-vs-Gemini gap is even smaller than the published headline suggests. v1.1 (Q4 2026) backfills the gap.

DISCLOSURE 4. Model versions move underneath us

GPT-5 in May 2026 may behave differently than GPT-5 in November 2026. Anthropic, OpenAI, and Google update their models continuously, sometimes silently. We pin full version strings and date-stamp every call. Year-over-year comparisons require care; comparing the May 2026 numbers to May 2027 numbers will be a comparison of model lineages, not identical models.

DISCLOSURE 5. OpenRouter routing

API access via OpenRouter may differ slightly from direct provider access. Different routing layers may inject different system prompts or apply different rate limits. We document this honestly. The consumer-UX validation runs against each model's official chat UI to surface any meaningful divergence.

MATERIAL 6. Prompt selection bias is real

The 47 prompts express the worldview of the lane this benchmark declares. A Reformed scholar would write different prompts. A Wesleyan pastor would write different prompts. A Pentecostal entrepreneur would write different prompts. We name the lane, version the prompts, and invite community submissions for v2.

DISCLOSURE 7. AI labs may push back on specific scores

Anthropic, OpenAI, and Google have public-relations teams. Some scores in the published findings will be uncomfortable for someone. The benchmark's defense is the published rubric, the published prompts, the published raw responses, and the per-judgement justifications. If a lab disagrees with a score, they can reproduce the run, re-score with the same rubric, and publish their numbers. We will publish corrections and run-amendments transparently if errors are surfaced.

MATERIAL 8. We chose a lane. We are not the universal Christian benchmark

This is the 10X Life Plan benchmark, not "the Christian benchmark." The lane is orthodox Protestant, with four bright-line failure modes we will not cross (prosperity gospel, passivity-as-faith, shame-based motivation, hyper-independence) and a positive emphasis on identity-in-Christ as a declared doctrine rather than a self-affirmation practice. Christians from other traditions can fork the rubric and run a different benchmark with the same prompts. The data is open. Disagreement is welcome and improves the dataset.

Read the Full Report. Audit the Data. Reproduce the Findings.

Three ways to go deeper:

License: CC BY 4.0. Cite as: State of AI for Christian Leaders 2026, Tim Adair / 10X Life Plan, 2026.

Open data appendix  ·  6 files  ·  ~1.9 MB total  ·  click to expand

All prompts, model outputs, judge scores, and justifications are published as machine-readable artifacts so anyone can re-aggregate the data their own way. Most readers won't need these; researchers and reviewers will.

Annual Cadence

2026 is the inaugural edition. Subsequent editions follow this rhythm:

  • April annually — new edition publishes at /state-of-ai-for-christian-leaders-{year}, with the evergreen URL /state-of-ai-for-christian-leaders always 301-redirecting to the latest year.
  • Year-over-year tracking — the same model lineages are tested where possible (Anthropic's frontier, OpenAI's frontier, etc.), allowing readers to track how AI handling of these questions evolves.
  • Methodology improvements — rubric refinement, additional scorers per response, expanded prompt set, additional model categories (open-source baseline added in 2027). Major changes published as a v-bumped artifact (rubric-v2.json, prompts-v2.json) so older editions remain reproducible.

To be notified when the 2027 edition publishes, take the free 10X Leader Score — subscribers receive new flagship reports first.

Frequently Asked Questions

What is the State of AI for Christian Leaders 2026?

An original annual benchmark of how today's frontier AI models — Claude Opus 4.7, Claude Sonnet 4.6, GPT-5, Gemini 2.5 Pro, and DeepSeek V3 — answer the questions Christian men in marketplace leadership actually ask. 47 prompts across four categories: marketplace scenarios, the 10X Dimensions, theological lane, and Scripture fidelity. Cross-judge LLM-as-judge scoring — each response is scored by the 4 models that did not produce it, with median across judges. Free, ungated, citable, reproducible. The inaugural 2026 edition is published by 10X Life Plan.

Which AI is the most theologically accurate?

On the Theological Accuracy axis specifically, Claude Opus 4.7 leads (2.67 / 3) — the highest score any model earned on any axis in the benchmark. Sonnet 4.6 is second (2.46), GPT-5 third (2.26), Gemini 2.5 Pro fourth (2.12), DeepSeek V3 fifth (2.08). But theological accuracy is one of five axes, and the model that handles doctrine best is not necessarily the model you should ask about your marriage or your layoff. See the full per-axis breakdown in the Findings.

Why does identity-in-Christ score so low across every model?

Because AI models are trained on enormous quantities of self-help, coaching, and therapeutic-tradition content — and almost none of the public corpus distinguishes Christian identity-in-Christ (declared by God, received by faith, rooted in Christ's finished work) from positive psychology (generated by the self, repeated as affirmation, rooted in mindset). When asked an identity question, the models default to the dominant pattern in their training: affirmations. They get the words right. They get the source wrong. The 10X lane treats identity-in-Christ as a doctrine to be declared, not an affirmation to be repeated — and that distinction is what AI is consistently missing.

How is this different from the Gospel Coalition's AI Christian Benchmark?

The Gospel Coalition benchmark tests general theological doctrine. The 10X benchmark tests the marketplace-leader application — the actual questions Christian executives ask AI on Monday morning. We score against an explicit 10X Life Plan theological lane (orthodox Protestant, with four bright-line failure modes named in the Framework Spec) rather than a generic orthodox standard, and we publish the full prompt set, rubric, and raw responses so the work is reproducible. We cite the Gospel Coalition benchmark as foundational prior work.

Can I use this report to decide which AI to use as a Christian leader?

Use it as one data point among several. The benchmark tells you how each model handles a specific 47-prompt set under a specific theological lane on a specific date. It does not tell you how the model handles your particular questions, your tradition, or the version of the model you encounter six months from now. Use the per-axis scores to see where each model is strong and weak, and verify on your own questions before trusting any answer that touches doctrine, Scripture, or marriage.

Is this benchmark biased?

Yes, in the same way every benchmark is biased — by the prompt set chosen, the rubric used, the lane declared, and the scorers selected. We name the bias explicitly. The lane is the 10X Life Plan theological lane — orthodox Protestant, with four bright lines we will not cross: prosperity gospel as truth, passivity-as-faith, shame-based motivation, hyper-independence as virtue. Christians outside this lane will reasonably grade some answers differently. The data is open. Fork the rubric and run your own benchmark with the same prompts.

Why aren't Grok, Llama, or other open-source models included?

v1 deliberately tested the 5 frontier models Christian leaders actually use most: Claude Opus 4.7, Claude Sonnet 4.6, GPT-5, Gemini 2.5 Pro, and DeepSeek V3. Llama, Grok, Mistral, and Qwen are deferred to v2 (2027) for two reasons: (1) inaugural-edition scope discipline — adding more models without a stable methodology produces a worse benchmark, not a more comprehensive one; (2) marginal narrative value — the five models tested cover >90% of consumer Christian-leader usage by our informal estimate. The 2027 edition expands to at least one open-source baseline (Llama) and one independent-frontier (Grok or Qwen).

Let's get to work.

Works Cited

  1. The Gospel Coalition. AI Christian Benchmark. 2025. https://www.thegospelcoalition.org/ai-christian-benchmark/
  2. FaithBench. AI Benchmark for Christian Theology. 2025. https://faithbench.com/
  3. HIPHIL Novum. "Uncovering Theological and Ethical Biases in LLMs." 2024. https://tidsskrift.dk/hiphilnovum/article/view/143407
  4. Kaiser, Benjamin. "Can LLMs Accurately Recall the Bible?" 2025. https://benkaiser.dev/can-llms-accurately-recall-the-bible/
  5. Lifeway Research. "Pastors, Churchgoers See AI as Concerning and Confusing." April 2026. https://research.lifeway.com/2026/04/21/pastors-churchgoers-see-ai-as-concerning-and-confusing/
  6. arXiv. "Religious Bias Landscape in Language and Text-to-Image Models." 2025. https://arxiv.org/html/2501.08441v1
  7. arXiv. "Measuring Spiritual Values and Biases of Large Language Models." 2024. https://arxiv.org/html/2410.11647v1
  8. Scientific Reports / Nature. "Cognitive Bias in Generative AI Influences Religious Education." 2025. https://www.nature.com/articles/s41598-025-99121-6
  9. SAGE Open. "Religion and Racial Bias in Artificial Intelligence LLMs." 2025. https://journals.sagepub.com/doi/10.1177/23780231251377210
  10. Taylor & Francis. "Preaching with AI: Preachers' Interaction with LLMs." 2025. https://www.tandfonline.com/doi/full/10.1080/1756073X.2025.2468059
  11. The Washington Post. "Anthropic Hosts Christian Leaders Summit on AI Ethics." April 2026. https://www.washingtonpost.com/technology/2026/04/11/anthropic-christians-claude-morals/
  12. AI For Church Leaders. Annual Survey. 2025. https://www.aiforchurchleaders.com/
  13. Christianity Today. "ChatGPT, Google, Bible, Theology, and Truth." 2023. https://www.christianitytoday.com/2023/05/chatgpt-google-bible-theology-artificial-intelligence-truth/
  14. The Gospel Coalition. "FAQs: Chatbots and Gospel-Centered Ministry." 2024. https://www.thegospelcoalition.org/article/faqs-chatbot-gospel-centered-ministry/
  15. Barna Group. "Christians on Leadership, Calling, and Career." 2024-2025. https://www.barna.com/research/christians-on-leadership-calling-and-career/
  16. arXiv. "Large Language Model for Bible Sentiment Analysis." 2024. https://arxiv.org/pdf/2401.00689
  17. Adair, Tim. 10X Freedom. 2025. Amazon (ASIN B0FZNT8312)
  18. Adair, Tim. The Christian Leader Research Gap. 10X Life Plan, 2026. /articles/christian-leader-research-gap
  19. Adair, Tim. The Christian Leader Report 2026. 10X Life Plan, 2026. /christian-leader-report-2026

Version history. v1.0 published April 29, 2026. Findings section is updated as scoring completes; all other sections are frozen at v1 for the duration of the 2026 edition.

Take It Further

Companion pieces built from the benchmark data — for the Christian leader who wants the practical takeaway:

Decision Guide

Which AI Should I Use?

By-task decision matrix. Claude for theology, GPT-5 for tactics, never for identity.

Free Dataset

The 47 Prompts

The full prompt set used in this benchmark. CC BY 4.0. Run them yourself.

Q3 2026 · Collecting

AI Adoption Survey

Companion study — what Christian leaders actually do with AI. Ships September 2026.

Deep-Dive Q&A on AI for Christians

Seven spokes that unpack the benchmark's findings into the questions Christian leaders ask most:

All Research

This benchmark is one of three published 10X Life Plan flagship reports. Full Research Hub includes: