This is the code repository of our submission: Understanding the Dark Side of LLMs’ Intrinsic Self-Correction.
This repository contains the codes for:
- Evaluating LLMs' intrinsic self-correction on Yes/No question answering task.
- Evaluating LLMs' final answer wavering in a 10-round conversation.
- Probing LLMs' internal answers.
- Interpreting LLMs' prompt bias: Prompt Attribution and Contribution Tracking (PACT).
📣📣Please also check our project website.
If you find our project useful, please consider citing:
@misc{zhang2024understandingdarkllmsintrinsic,
title={Understanding the Dark Side of LLMs' Intrinsic Self-Correction},
author={Qingjie Zhang and Han Qiu and Di Wang and Haoting Qian and Yiming Li and Tianwei Zhang and Minlie Huang},
year={2024},
eprint={2412.14959},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2412.14959},
}|
|
|
Left: ChatGPT o1 pro mode example. The fifth "Are you sure ..." stimulates 21-second think and makes ChatGPT o1 pro modify the answer.
Right: ChatGPT 4o (2024.12.17) example. "Are you sure ..." is easy to make ChatGPT 4o modify the answer.
- Create a new conda environment:
conda create -n self-correction python=3.9 -y # Python >= 3.9 required
conda activate self-correction- Install PyTorch:
- Visit PyTorch's website to install PyTorch 2.5.1 with the configuration matching your hardware.
- Install dependencies:
pip install -r requirements.txt- For OpenAI models: Configure
OPENAI_API_KEYandOPENAI_ENDPOINTinllm_inference/api_config.py - For local models: Set up your Hugging Face access token or login to huggingface.co, and than you can use the default model path
Reproduce Table 1: Self-correction performance on Yes/No question answering tasks
python run_self_correction.py --model llama3-8b-instruct --devices 0Available models:
- llama2-7b-instruct
- llama3-8b-instruct
- llama3.1-8b-instruct
- gpt3.5-turbo
- gpt4o
- o1-preview
- o1-mini
--devices: GPU ids, eg: "0,1" to use two or more GPUs or "0" to use the first GPU
Results will be saved in ./results/$MODEL_NAME/self_correction/
To generate the summary table for all models:
python draw/metric.py # Outputs to ./metric.csvReproduce Figure 2: Analysis of how LLMs change their final answers in a 10-round conversation
python run_self_correction.py --model $MODEL_NAME --devices $DEVICES --repeat_exp --rounds 10To generate the visualization:
python draw/change_answer_times.py # Outputs to ./model_answer_change.pdfReproduce Figure 3: Analysis of internal answer changes during self-correction
# With attack
python run_lens.py --model llama3-8b-instruct --devices $DEVICES --exp tuned_lens
# Round 0: without attack
python run_lens.py --model llama3-8b-instruct --devices $DEVICES --exp tuned_lens --round 0Base on the results, you can generate the visualizations:
# Figure 3 Left: Internal answer wavering
python draw/case_lens_internal_answer_wavering.py --model llama3-8b-instruct
# Figure 3 Right: "Are you sure?" vs. "You are wrong"
python draw/average_layer_confidence.py --model llama3-8b-instructNote: Use --devices to specify GPU devices (e.g., "0,1" for multiple GPUs, "0" for single GPU)
Reproduce Figure 4: Analysis of prompt bias during self_correction
Follow our tutorial ./pact.ipynb to generate PACT visualization.