SafeChain
Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities

1 University of Washington    2 University of Georgia    3 University of Chicago

⚠️ Warning: This paper contains model outputs that may be considered offensive.

Abstract

Emerging large reasoning models (LRMs), such as DeepSeek-R1 models, leverage long chain-of-thought (CoT) reasoning to generate structured intermediate steps, enhancing their reasoning capabilities. However, long CoT does not inherently guarantee safe outputs, potentially leading to harmful consequences such as the introduction of security vulnerabilities in code or the spread of misinformation.

Current research on large language model (LLM) safety usually focuses on short-answer responses, overlooking the long CoT style outputs of LRMs. To bridge this gap, we conduct a systematic study of LRM safety. First, we investigate safety evaluators calibrated against human annotations. Using our newly developed metrics, we thoroughly assess the safety of 12 state-of-the-art LRMs on StrongReject and WildJailbreak datasets. Our results show that LRMs are not safe compared to their reasoning advance. Further, we perform a fine-grained analysis of the reasoning trace and final answer. We find that three decoding strategies—ZeroThink, LessThink, and MoreThink—can improve model safety without additional training. However, these strategies either use constrained reasoning traces or incur high inference costs.

To better strengthen LRM safety, we introduce SafeChain, the first-of-its-kind safety training dataset in CoT style. We fine-tune two LRMs with SafeChain, showing that it not only enhances model safety but also preserves performance across 6 reasoning benchmarks.


Left: The structured thought process by LRM when answering an example instruction from StrongReject. The safety-aware and harmful contents are marked in blue and red, respectively. Middle: We apply three prompting setups with varying CoT length, i.e., ZeroThink, LessThink and MoreThink (see Section 4.2). Our results show that ZeroThink yields the best safety performance. Right: Our pipeline to synthesize safety alignment dataset, SafeChain, for LRMs (see Section 5). Models fine-tuned with SafeChain exhibit improved safety performance while preserve reasoning capabilities across six reasoning benchmarks.

Pilot Study on Safety Evaluator for LRMs


This table summarizes the Acc, F-1, and PCC of evaluators RS-Match, OpenAIMod, HarmBenchEval, and Llama-Guard. Among all evaluators, we observe Llama-Guard exhibit robust performance across all metrics when evaluating the safety of reasoning models.

Overall Evaluation


This table presents the safety performance of all LRMs evaluated using Safe@1, Safe@K, and ConsSafe@K (refer to Section 3.2.)


We compare the safety of R1-70B with its pre-trained model Llama-3.3-70B-Instruct, as well as the corresponding base model Llama-3.1-70B. We note that only 32.3% of responses by R1-70B is considered safe, implying that fine-tuning with long CoT does not necessarily enhance safety performance.


This figure shows how Safe@1 and Safe@K of R1-7B and R1-8B vary as decoding configuration (temperature, p value for top-p, and k value for top-k) change. We observe that the safety of LRMs degrades as temperature increases.

Different Thinking


Texts in grey orange, green boxes are instructions, Chain-of-Thoughts and answers respectively. Text in red are enforced replacement text for MoreThink to substitute the end of thinking tag (i.e., </think>). For ith output in MoreThink, the input context is { input, output 1, …, output i-1 }.


This tables shows the safety performances of R1 models under default, ZeroThink, LessThink, and MoreThink thinking setups. We observe that length of thought process affects safety. All thinking strategies yield enhanced safety performance than the default setup.

SafeChain


This table summarizes the math, coding, and safety performance of R1-7B and R1-8B fine-tuned with different datasets. We observe that SafeChain improves models' safety performance while preserves their math and coding performance across all benchmarks.

BibTeX

If you find our work useful, please consider citing our paper:

@article{jiang2025safechain,
        title={SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities},
        author={Jiang, Fengqing and Xu, Zhangchen and Li, Yuetai and Niu, Luyao and Xiang, Zhen and Li, Bo and Lin, Bill Yuchen and Poovendran, Radha},
        journal={arXiv preprint arXiv:2502.12025},
        year={2025}
      }