Emerging large reasoning models (LRMs), such as DeepSeek-R1 models, leverage long chain-of-thought (CoT) reasoning to generate structured intermediate steps, enhancing their reasoning capabilities. However, long CoT does not inherently guarantee safe outputs, potentially leading to harmful consequences such as the introduction of security vulnerabilities in code or the spread of misinformation.
Current research on large language model (LLM) safety usually focuses on short-answer responses, overlooking the long CoT style outputs of LRMs. To bridge this gap, we conduct a systematic study of LRM safety. First, we investigate safety evaluators calibrated against human annotations. Using our newly developed metrics, we thoroughly assess the safety of 12 state-of-the-art LRMs on StrongReject and WildJailbreak datasets. Our results show that LRMs are not safe compared to their reasoning advance. Further, we perform a fine-grained analysis of the reasoning trace and final answer. We find that three decoding strategies—ZeroThink, LessThink, and MoreThink—can improve model safety without additional training. However, these strategies either use constrained reasoning traces or incur high inference costs.
To better strengthen LRM safety, we introduce SafeChain, the first-of-its-kind safety training dataset in CoT style. We fine-tune two LRMs with SafeChain, showing that it not only enhances model safety but also preserves performance across 6 reasoning benchmarks.
Left: The structured thought process by LRM when answering an example instruction from StrongReject. The safety-aware and harmful contents are marked in blue and red, respectively. Middle: We apply three prompting setups with varying CoT length, i.e., ZeroThink, LessThink and MoreThink (see Section 4.2). Our results show that ZeroThink yields the best safety performance. Right: Our pipeline to synthesize safety alignment dataset, SafeChain, for LRMs (see Section 5). Models fine-tuned with SafeChain exhibit improved safety performance while preserve reasoning capabilities across six reasoning benchmarks.
This table summarizes the Acc, F-1, and PCC of evaluators RS-Match, OpenAIMod, HarmBenchEval, and Llama-Guard. Among all evaluators, we observe Llama-Guard exhibit robust performance across all metrics when evaluating the safety of reasoning models.
This table presents the safety performance of all LRMs evaluated using Safe@1, Safe@K, and ConsSafe@K (refer to Section 3.2.)
We compare the safety of R1-70B with its pre-trained model Llama-3.3-70B-Instruct, as well as the corresponding base model Llama-3.1-70B. We note that only 32.3% of responses by R1-70B is considered safe, implying that fine-tuning with long CoT does not necessarily enhance safety performance.
This figure shows how Safe@1 and Safe@K of R1-7B and R1-8B vary as decoding configuration (temperature, p value for top-p, and k value for top-k) change. We observe that the safety of LRMs degrades as temperature increases.
Texts in
grey
orange,
green
boxes are instructions, Chain-of-Thoughts and answers respectively.
Text in
red
are enforced replacement text for
MoreThink
to substitute the end of thinking tag (i.e.,
</think>
).
For ith output in
MoreThink,
the input context is { input, output 1, …, output i-1 }.
This tables shows the safety performances of R1 models under default, ZeroThink, LessThink, and MoreThink thinking setups. We observe that length of thought process affects safety. All thinking strategies yield enhanced safety performance than the default setup.
This table summarizes the math, coding, and safety performance of R1-7B and R1-8B fine-tuned with different datasets. We observe that SafeChain improves models' safety performance while preserves their math and coding performance across all benchmarks.
If you find our work useful, please consider citing our paper:
@article{jiang2025safechain,
title={SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities},
author={Jiang, Fengqing and Xu, Zhangchen and Li, Yuetai and Niu, Luyao and Xiang, Zhen and Li, Bo and Lin, Bill Yuchen and Poovendran, Radha},
journal={arXiv preprint arXiv:2502.12025},
year={2025}
}