Scaling Behavior of Chain-of-thought Monitoring

Featured
Research - Present
Chain of ThoughtSandbaggingInspect AIOpenRouter APIHuggingFace Inference Providers

Overview πŸ”—

I lead a research project mentored by Eyon Jang↗ (MATS 8 Scholar) focused on how chain-of-thought monitoring scales as the gap between monitor and target model capabilities grows. The core goal is to test whether weaker monitors can reliably separate sandbagging behavior from genuine model incapability.

Motivation πŸ”—

As LLMs grow more capable, oversight is essential for ensuring safe and reliable behavior. Chain-of-thought (CoT) monitoring, which inspects the reasoning traces of models to detect suspicious behavior like sandbagging (i.e. deliberate, deceptive underperformance), is a prominent technique currently used in AI labs pre-deployment. However, the effectiveness of CoT monitoring likely depends on the relative capabilities of the monitor and target models; weaker monitors may struggle to identify subtle suspiciousness in highly capable targets. Understanding how monitoring performance scales with the capability gap is crucial for designing robust oversight systems.

Research questions πŸ”—

  1. How does monitoring effectiveness change as the monitor-to-model capability ratio widens?
  2. Is there a capability ratio below which monitoring fails entirely?
  3. Do the same scaling laws hold across model families?
  4. Case study: Can monitors distinguish sandbagging from true incapability?

Expected result πŸ”—

A scaling curve showing how monitor performance degrades or holds as capability gap grows, with each point representing ROC-AUC for a monitor-target pair.

  • Y-axis: Monitor success rate measured by ROC-AUC (sandbagger = positive, genuine incapability = negative).
  • X-axis: Capability gap between monitor and target model.

It should curve upward as gap widens, with potential thresholds where monitoring fails.

Methodology πŸ”—

Model selection πŸ”—

For full chain-of-thought monitoring experiments, we select reasoning models with full CoT exposed. To establish scaling trends within a model family first, we focus on Qwen3 models of varying sizes/capabilities available via HuggingFace Inference API and OpenRouter (alternate between them).

  • Qwen3-8B
  • Qwen3-14B
  • Qwen3-30B-A3B-Thinking
  • Qwen3-32B
  • Qwen3-Next-80B-A3B-Thinking
  • Qwen3-235B-A22B-Thinking
  • Qwen3-Coder-480B-A35B-Instruct

We plan to expand to other model families (e.g., AllenAI Olmo 3, OpenAI’s GPT-OSS) later.

Weaker monitor MM πŸ”—

Given a target model TT with capability CTC_T, we systematically select all weaker monitor models MM such that CM<CTC_M < C_T to cover a range of capability gaps Ξ”C=CTβˆ’CM\Delta_C = C_T - C_M.

Incapable target Tβ€²T' πŸ”—

Given a target model TT, we create an incapable version Tβ€²T' by

  • restricting its output space OR
  • choosing nn weaker model(s) with capability CTβ€²C_T' that are kβ‰₯1k \geq 1 levels below TT in capability

Capability estimation πŸ”—

For general capability, we use the Epoch Capabilities Index (ECI) from Epoch’s model benchmarksβ†—.

For models without published ECI, we estimate using regression based on model size and known ECI values ECI = m * log(params) + b, where b accounts for thinking boost. We caliberate the results by known ECI scores of models within the same series (Qwen3) and of models with similar capability.

As a start, within-family comparisons temporarily use model size as a proxy.

Task selection πŸ”—

To make models use chain-of-thought reasoning (Emmons et al., 2025β†—), we select tasks known to benefit from CoT prompting and proxy for real world, safety-relevant scenarios.

And since we pick sandbagging as case study, we choose tasks where models can plausibly sandbag: models (as small as 14B) need to perform at >= 40% task accuracy as a baseline. Furthermore, models should have some motivation/reasons to sandbag on these tasks (e.g., safety concerns, risks against self-preservation).

We select two main tasks:

  • WMDP (Li et al., 2024β†—), a multiple-choice QA dataset which as proxy for hazardous knowledge queries across biomedical, chemical, and cybersecurity domains
  • BigCodeBench Full (Zhuo et al., 2024β†—), a code generation benchmark

Sandbagging simulation πŸ”—

We elicit sandbagging behavior via prompts that instruct the target model TT to either

  • provide incorrect answers with plausible reasoning
  • imitate a confused but well-intentioned student

Monitoring setup πŸ”—

We implement chain-of-thought monitoring with Inspect AI↗ and ControlArena↗

  • Monitor input: The full CoT reasoning trace from a model (either TT or Tβ€²T'), actions taken, final answer, and whether the answer is correct. Each sample is labeled as sandbagging (if from TT) or genuine incapability (if from Tβ€²T' without any system prompt).

  • Monitor model: The weaker model MM selected for the target TT.

  • Monitor output: A probability score [0.0, 1.0] indicating likelihood of suspiciousness in general.

Evaluation πŸ”—

We evaluate monitor success rate using ROC-AUC for distinguishing sandbagging from genuine incapability across varying capability gaps Ξ”C\Delta_C. We plot ROC-AUC (y-axis) against Ξ”C\Delta_C (x-axis) to visualize scaling behavior.

Results will be aggregated over multiple runs with different random seeds to ensure robustness, and confidence intervals will be reported.

Results will come soon!