# Manual Scoring Protocol — 10X Christian Leader AI Benchmark v1

This is the protocol every human scorer reads before they begin scoring.
The integrity of the benchmark rests on the discipline of this pass.

## Before you start

1. Read `data/ai-benchmark/rubric-v1.json` end to end. Especially the
   anchors and the example responses for `TL-01_score3` / `TL-01_score0`
   and `SF-01_score3` / `SF-01_score0`. Those four examples calibrate
   your eye. Re-read them whenever you feel uncertain on a score.

2. Read `data/ai-benchmark/prompts-v1.json` end to end. Each prompt has
   `expectedThemes` and `expectedAvoidances` — these are not the rubric,
   but they show what a faithful answer would touch and what failure
   modes would lose points.

3. Confirm that you are operating inside the declared theological lane:
   orthodox Protestant, broadly Reformed-and-Wesleyan compatible,
   masculine-heart tradition (Eldredge, DMU, Winship). If you would not
   score from this lane, you are not the right scorer for this benchmark
   — that does not mean your tradition is wrong, only that this is not
   the rubric for you.

4. Do **not** open `data/ai-benchmark/scoring-key-2026.json`. It reveals
   model identity and will bias your scoring. Open it only after both
   scorers have submitted scores for a given anonymous response.

## How to score

You will receive an anonymized CSV at
`data/ai-benchmark/anonymized-for-scoring-2026.csv`. Each row contains
one response identified by an anonymous `R-XXXXXXXXXX` ID. You score
each response on five axes from 0 to 3:

| Axis | What you are scoring |
|---|---|
| **theological** | Is the doctrine substantively right within the declared lane? |
| **scripture** | When Scripture is referenced, is it accurate (NLT preferred), in context, and applied with hermeneutical care? |
| **marketplace** | Is the practical guidance specific, substantive, and usable by a senior leader — or generic platitudes? |
| **identity** | Does the response anchor the leader in Christ-given identity vs leaning on shame, hustle, or prosperity-coded encouragement? |
| **lane** | Does the response resonate with the masculine-heart tradition? Avoid the four boundary failures? |

Score independently axis-by-axis. Resist halo effects. A response can
be a 3 on Theological Accuracy and a 1 on Marketplace Wisdom; that
combination is real and informative.

If an axis does not apply (e.g., Scripture Fidelity for a prompt that
does not require quoting Scripture), enter `N/A` rather than guessing.

## Required notes

For any score below 2, you must include a one-sentence note in the
`notes` column explaining the failure mode. Examples:

> R-A1B2C3D4E5, theological=1: "Affirms 'God blesses faithful business with success' as Christian truth, then partially walks it back. Mixed prosperity gospel."

> R-F6G7H8I9J0, scripture=0: "Uses Proverbs 29:18 as a goal-setting endorsement; quotes the KJV-derived 'where there is no vision, the people perish' without flagging the misuse."

These notes become the editorial spine of the published report.
Without them, low scores are unappealable. With them, every score
is auditable.

## Initials and timestamp

Each row gets your initials in the `scorerInitials` column and the
date you scored it. This protects against scoring drift and lets us
report inter-rater agreement honestly.

## Disagreement

When two scorers' scores for the same axis on the same response
differ by 2 or more points, a third scorer adjudicates. The third
scorer reads the response cold (no access to the prior two scores
or notes) and submits a score independently. The final score is
the median of the three.

When two scorers' scores differ by 1 or less, the final score is
the mean (rounded to one decimal place where helpful).

## Pacing

A careful scoring pass takes roughly 4-6 minutes per response. For
the v1 pilot (75 scored responses), expect 5-7 hours per scorer.
Spread across multiple sessions; cognitive fatigue degrades scoring
quality after ~15 responses in a row.

## Common failure patterns to watch for

- **Prosperity gospel undercurrent.** Watch for "God will bless your
  faithfulness with [material outcome]" framing, even when it appears
  brief and supportive. If the answer affirms it without correction,
  Theological Accuracy and Lane Alignment both downscore.

- **Passivity disguised as faith.** "Pray about it and trust God's
  timing" without the corresponding "and have the conversation /
  make the call / act." Downscores Marketplace Wisdom and Lane.

- **Shame escalation.** "You should be doing better" framing, even
  in tone, in response to a man's confession of struggle. Downscores
  Identity-vs-Performance and Lane.

- **NLT-not-flagged translation drift.** A response quotes a verse in
  KJV/NIV/ESV without acknowledging it as such. Scripture Fidelity
  drops to 1 even if the quote is otherwise accurate.

- **Generic Christian platitudes in marketplace prompts.** "Trust God,
  seek wise counsel, follow your heart." Marketplace Wisdom = 0.

- **Over-pedantic theological correction.** A response that lectures
  on doctrinal precision instead of answering the man's question.
  Score Theological Accuracy normally; score Marketplace Wisdom
  lower for failing to actually help.

## Example walk-through (calibration)

**Anonymous ID R-XXXX, prompt MP-04 (founder marriage). Response:**

> "It sounds like you're carrying a lot. Marriage is a precious gift from
> God, and the enemy will always try to drive a wedge through busy seasons.
> The most important thing is to pray about it and trust that God will
> restore your marriage in His timing. I'd suggest setting aside dedicated
> time each week to reconnect, and being intentional about putting your
> wife first. God can do amazing things when we surrender to Him."

**Scoring (with notes):**

| Axis | Score | Note |
|---|---|---|
| theological | 2 | Truthful but vague; doesn't name the man's contribution to the disconnection. |
| scripture | N/A | No verse quoted. |
| marketplace | 0 | "Pray and trust God's timing" without the corresponding action. Generic. Founder asked "where do I start" — no actual starting point given. |
| identity | 1 | Mostly performance-coded ("be intentional", "put wife first"); no rooting in identity-in-Christ before action. |
| lane | 1 | Misses Wild-at-Heart territory entirely (the man's own heart condition); leans toward passivity-as-faith. |

A score-3 answer to the same prompt would name the man's defensiveness
as the starting point, lead with confession over strategy, point to
specific practices (Sabbath, listening prayer with wife, brother
accountability, possibly a counselor), and ground the whole response
in the man's identity in Christ rather than in his performance as a
husband.

## When you finish

Submit your scored CSV by saving it as
`data/ai-benchmark/scores-2026-{your-initials}.csv`. The compilation
script merges per-scorer files into `results-2026.json` and runs
`aggregate.js` to produce headline numbers.

Then, and only then, you may open `scoring-key-2026.json` to see
which model produced which response.
