Fairness Testing of Large Language Models in Role-Playing

Welcome to the online appendix for the paper titled "Fairness Testing of Large Language Models in Role-Playing" Here, we provide supplementary materials, including scripts, generated roles, generated questions, LLM-generated answers, and our analysis results. This repository is under the terms of the MIT License.

Environment Setup

Install dependencies:

pip install -r requirements.txt

Questions

This section contains two parts, both located in the ./questions folder: Question Generator and Generated Questions.

The code for generating questions is available in the ./questions directory. To run the code, use:

python ./question_generator.py

Note: Ensure you have entered your OpenAI API key before running the script.

For convenience, we have publicly released all generated questions in the ./questions/role folder as 11 CSV files. Each attribute includes 50 roles, with each role containing 20 Yes/No questions, 20 Choice questions, and 20 Why questions. This setup means each CSV file contains $50 \times 3 \times 20 = 3,000$ questions, totaling $11 \times 3,000 = 33,000$ questions across all attributes.

After reviewing, 136 questions were removed for containing placeholder terms like “Group A” and “Group B” (e.g., "Do you believe Group A are naturally better leaders than Group B?"). The final benchmark now includes 32,864 questions, broken down as follows:

10,975 Yes/No questions
10,917 Choice questions
10,972 Why questions

All questions are organized by attribute in individual CSV files in the ./questions/role folder. Additionally, questions used in RQ5, which do not specify roles, are available in the ./questions/without_role folder.

Query Scripts

This section provides the query scripts for each Large Language Model (LLM), divided into two types: Role-based Queries and Non-role-based Queries. Role-based queries include role information, while non-role-based queries do not specify roles. Each LLM has a dedicated Python script for each query type.

For role-based queries, you’ll find 8 Python scripts in the ./query/role folder. To query a specific model, simply run the corresponding script, e.g.,

python ./query/role/gpt4omini.py

For non-role-based queries, use the scripts in the ./query/without_role folder. For example:

python ./query/without_role/gpt4omini.py

Note: Ensure you have obtained an API key from corresponding platform and enter it in the relevant script before querying.

Cost Notice: Querying these models may incur costs depending on each platform's usage and pricing policies. Be mindful, as querying large datasets can result in significant charges.

LLM-generated Answers

This section contains collected answers for both role-based and non-role-based questions. All files are located in the ./answers folder, with role-based answers in ./answers/role and non-role-based answers in ./answers/without_role. All answers are generated directly in response to each question.

For each of the 11 attributes, we generate 50 roles for testing. Each question is individually input into 10 different LLMs, and to reduce randomness, each question is asked to each LLM three separate times. Each instance is treated as an independent conversation, with no context from previous interactions. This setup allows for three distinct rounds of questioning per LLM.

Additionally, we provide the judging code for each "Why" question in the file located at ./query/why_judge.py.

Results Analysis

This section provides the code and data required to replicate the results from our study. The scripts for generating analyses are located in the ./analysis/ folder and cover the following areas:

Overall Analysis

(1) Overall Effectiveness

The ./analysis/overall_analysis directory contains scripts that display the number of biased responses detected by our benchmark across 11 demographic attributes and 3 question types for 10 LLMs during role-playing scenarios. To generate the overall effectiveness table, execute:

python ./analysis/overall_analysis/overall_role.py

The results will be saved as ./results/role/overall_role.csv, with detailed results for each individual model stored in their respective subdirectories.

(2) Comparative Analysis Across LLMs

This analysis illustrates the proportion of questions that elicit biased responses across one to 10 LLMs. The process involves first assessing each question for biased responses across all 10 LLMs, with the resulting scores stored in the score folder. To calculate these comparative scores, run:

python ./analysis/overall_analysis/llm_overlap.py

The output file will be saved in the ./figure folder.

(3) Comparative Analysis Across Question Types

This analysis shows the average number of biased responses per demographic attribute across the 10 LLMs, providing insights into which question types are most prone to eliciting biased responses. To generate this analysis, execute:

python ./analysis/overall_analysis/role_attr.py

The output file will be saved in the ./figure folder.

Types of Triggered Bias

The ./analysis/bias_types_analysis directory contains analysis scripts for examining the types of bias present in bias-triggering questions. The ./analysis/bias_types_analysis/attribute folder stores the categorization of bias found in the main content of questions (excluding the role specifications) across 11 demographic attributes.

This analysis categorizes and quantifies the different types of bias that appear in questions designed to trigger biased responses, providing insights into the distribution and prevalence of various bias categories.

To generate this analysis, execute:

python ./analysis/bias_types_analysis/bias_type.py

The output visualization will be saved in the ./figure folder.

Impact of Role-Playing Analysis

This section explores whether biases remain present when no role is assigned, thereby examining how many biases are specific to the role-playing context. The ./analysis/role_impact_analysis directory contains the corresponding code and results for this analysis.

(1) Overall Effectiveness

This analysis examines the number of biased responses detected by BiasLens when roles are not assigned to the models. To generate this analysis, execute:

python ./analysis/role_impact_analysis/overall_without_role.py

The results will be saved as ./results/without_role/overall_without_role.csv, with detailed results for each individual model stored in their respective subdirectories.

(2) Comparison of Results

This component compares the results obtained from the non-role scenarios with those from the previous overall analyses where roles were assigned. To generate this comparative analysis, execute:

python ./analysis/role_impact_analysis/comparison.py

The comparison results will be saved as ./analysis/role_impact_analysis/comparison_result.csv.

Impact of Non-Determinism Analysis

This section explores how the non-deterministic nature of LLMs affects our test results. The ./analysis/non_deter_analysis directory contains the corresponding code and results for this analysis.

For each LLM, we calculate two key metrics: the proportion of questions that trigger fully consistent responses (where all three trials produce either biased or unbiased responses) and the proportion of questions that yield mixed results (producing a combination of biased and unbiased responses across the three trials). This analysis helps us understand the reliability and consistency of bias detection across multiple runs.

To generate this analysis, execute:

python ./analysis/non_deter_analysis/non_deter.py

The results will be saved as ./analysis/non_deter_analysis/non_deter.csv.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Fairness Testing of Large Language Models in Role-Playing

Environment Setup

Questions

Query Scripts

LLM-generated Answers

Results Analysis

Overall Analysis

Types of Triggered Bias

Impact of Role-Playing Analysis

Impact of Non-Determinism Analysis

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 123 Commits
analysis		analysis
answers		answers
figure		figure
query		query
questions		questions
results		results
roles		roles
score		score
README.md		README.md
requirements.txt		requirements.txt

LLMBias/BiasLens

Folders and files

Latest commit

History

Repository files navigation

Fairness Testing of Large Language Models in Role-Playing

Environment Setup

Questions

Query Scripts

LLM-generated Answers

Results Analysis

Overall Analysis

Types of Triggered Bias

Impact of Role-Playing Analysis

Impact of Non-Determinism Analysis

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages