Understanding the Dark Side of LLMs’ Intrinsic Self-Correction

This is the code repository of our submission: Understanding the Dark Side of LLMs’ Intrinsic Self-Correction.

This repository contains the codes for:

Evaluating LLMs' intrinsic self-correction on Yes/No question answering task.
Evaluating LLMs' final answer wavering in a 10-round conversation.
Probing LLMs' internal answers.
Interpreting LLMs' prompt bias: Prompt Attribution and Contribution Tracking (PACT).

📣📣Please also check our project website.

If you find our project useful, please consider citing:

@misc{zhang2024understandingdarkllmsintrinsic,
      title={Understanding the Dark Side of LLMs' Intrinsic Self-Correction}, 
      author={Qingjie Zhang and Han Qiu and Di Wang and Haoting Qian and Yiming Li and Tianwei Zhang and Minlie Huang},
      year={2024},
      eprint={2412.14959},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2412.14959}, 
}

A quick glance for the extremely simple questions on SOTA ChatGPTs

Left: ChatGPT o1 pro mode example. The fifth "Are you sure ..." stimulates 21-second think and makes ChatGPT o1 pro modify the answer.

Right: ChatGPT 4o (2024.12.17) example. "Are you sure ..." is easy to make ChatGPT 4o modify the answer.

Setup

Environment Setup

Create a new conda environment:

conda create -n self-correction python=3.9 -y  # Python >= 3.9 required
conda activate self-correction

Install PyTorch:

Visit PyTorch's website to install PyTorch 2.5.1 with the configuration matching your hardware.

Install dependencies:

pip install -r requirements.txt

API Configuration

For OpenAI models: Configure OPENAI_API_KEY and OPENAI_ENDPOINT in llm_inference/api_config.py
For local models: Set up your Hugging Face access token or login to huggingface.co, and than you can use the default model path

1. Evaluating LLMs' intrinsic self-correction on Yes/No question answering task.

Reproduce Table 1: Self-correction performance on Yes/No question answering tasks

python run_self_correction.py --model llama3-8b-instruct --devices 0

Available models:

llama2-7b-instruct
llama3-8b-instruct
llama3.1-8b-instruct
gpt3.5-turbo
gpt4o
o1-preview
o1-mini

--devices: GPU ids, eg: "0,1" to use two or more GPUs or "0" to use the first GPU

Results will be saved in ./results/$MODEL_NAME/self_correction/

To generate the summary table for all models:

python draw/metric.py  # Outputs to ./metric.csv

2. Evaluating LLMs' final answer wavering in a 10-round conversation.

Reproduce Figure 2: Analysis of how LLMs change their final answers in a 10-round conversation

python run_self_correction.py --model $MODEL_NAME --devices $DEVICES --repeat_exp --rounds 10

To generate the visualization:

python draw/change_answer_times.py  # Outputs to ./model_answer_change.pdf

3. Probing LLMs' internal answers.

Reproduce Figure 3: Analysis of internal answer changes during self-correction

# With attack
python run_lens.py --model llama3-8b-instruct --devices $DEVICES --exp tuned_lens

# Round 0: without attack
python run_lens.py --model llama3-8b-instruct --devices $DEVICES --exp tuned_lens --round 0

Base on the results, you can generate the visualizations:

# Figure 3 Left: Internal answer wavering
python draw/case_lens_internal_answer_wavering.py --model llama3-8b-instruct

# Figure 3 Right: "Are you sure?" vs. "You are wrong"
python draw/average_layer_confidence.py --model llama3-8b-instruct

Note: Use --devices to specify GPU devices (e.g., "0,1" for multiple GPUs, "0" for single GPU)

4. Interpreting LLMs' prompt bias: Prompt Attribution and Contribution Tracking (PACT).

Reproduce Figure 4: Analysis of prompt bias during self_correction

Follow our tutorial ./pact.ipynb to generate PACT visualization.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
dataset/boolq_json		dataset/boolq_json
draw		draw
llm_inference		llm_inference
video		video
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pact.ipynb		pact.ipynb
pact.json		pact.json
requirements.txt		requirements.txt
run_lens.py		run_lens.py
run_self_correction.py		run_self_correction.py
tools.py		tools.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Understanding the Dark Side of LLMs’ Intrinsic Self-Correction

A quick glance for the extremely simple questions on SOTA ChatGPTs

Setup

Environment Setup

API Configuration

1. Evaluating LLMs' intrinsic self-correction on Yes/No question answering task.

2. Evaluating LLMs' final answer wavering in a 10-round conversation.

3. Probing LLMs' internal answers.

4. Interpreting LLMs' prompt bias: Prompt Attribution and Contribution Tracking (PACT).

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

qingjiesjtu/USC

Folders and files

Latest commit

History

Repository files navigation

Understanding the Dark Side of LLMs’ Intrinsic Self-Correction

A quick glance for the extremely simple questions on SOTA ChatGPTs

Setup

Environment Setup

API Configuration

1. Evaluating LLMs' intrinsic self-correction on Yes/No question answering task.

2. Evaluating LLMs' final answer wavering in a 10-round conversation.

3. Probing LLMs' internal answers.

4. Interpreting LLMs' prompt bias: Prompt Attribution and Contribution Tracking (PACT).

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages