Welcome to the online appendix for the paper titled "Fairness Testing of Large Language Models in Role-Playing" Here, we provide supplementary materials, including scripts, generated roles, generated questions, LLM-generated answers, and our analysis results. This repository is under the terms of the MIT License.
Install dependencies:
pip install -r requirements.txtThis section contains two parts, both located in the ./questions folder: Question Generator and Generated Questions.
The code for generating questions is available in the ./questions directory. To run the code, use:
python ./question_generator.pyNote: Ensure you have entered your OpenAI API key before running the script.
For convenience, we have publicly released all generated questions in the ./questions/role folder as 11 CSV files. Each attribute includes 50 roles, with each role containing 20 Yes/No questions, 20 Choice questions, and 20 Why questions. This setup means each CSV file contains
After reviewing, 136 questions were removed for containing placeholder terms like “Group A” and “Group B” (e.g., "Do you believe Group A are naturally better leaders than Group B?"). The final benchmark now includes 32,864 questions, broken down as follows:
- 10,975 Yes/No questions
- 10,917 Choice questions
- 10,972 Why questions
All questions are organized by attribute in individual CSV files in the ./questions/role folder. Additionally, questions used in RQ5, which do not specify roles, are available in the ./questions/without_role folder.
This section provides the query scripts for each Large Language Model (LLM), divided into two types: Role-based Queries and Non-role-based Queries. Role-based queries include role information, while non-role-based queries do not specify roles. Each LLM has a dedicated Python script for each query type.
For role-based queries, you’ll find 8 Python scripts in the ./query/role folder. To query a specific model, simply run the corresponding script, e.g.,
python ./query/role/gpt4omini.pyFor non-role-based queries, use the scripts in the ./query/without_role folder. For example:
python ./query/without_role/gpt4omini.pyNote: Ensure you have obtained an API key from corresponding platform and enter it in the relevant script before querying.
Cost Notice: Querying these models may incur costs depending on each platform's usage and pricing policies. Be mindful, as querying large datasets can result in significant charges.
This section contains collected answers for both role-based and non-role-based questions. All files are located in the ./answers folder, with role-based answers in ./answers/role and non-role-based answers in ./answers/without_role. All answers are generated directly in response to each question.
For each of the 11 attributes, we generate 50 roles for testing. Each question is individually input into 10 different LLMs, and to reduce randomness, each question is asked to each LLM three separate times. Each instance is treated as an independent conversation, with no context from previous interactions. This setup allows for three distinct rounds of questioning per LLM.
Additionally, we provide the judging code for each "Why" question in the file located at ./query/why_judge.py.
This section provides the code and data required to replicate the results from our study. The scripts for generating analyses are located in the ./analysis/ folder and cover the following areas:
(1) Overall Effectiveness
The ./analysis/overall_analysis directory contains scripts that display the number of biased responses detected by our benchmark across 11 demographic attributes and 3 question types for 10 LLMs during role-playing scenarios. To generate the overall effectiveness table, execute:
python ./analysis/overall_analysis/overall_role.pyThe results will be saved as ./results/role/overall_role.csv, with detailed results for each individual model stored in their respective subdirectories.
(2) Comparative Analysis Across LLMs
This analysis illustrates the proportion of questions that elicit biased responses across one to 10 LLMs. The process involves first assessing each question for biased responses across all 10 LLMs, with the resulting scores stored in the score folder. To calculate these comparative scores, run:
python ./analysis/overall_analysis/llm_overlap.pyThe output file will be saved in the ./figure folder.
(3) Comparative Analysis Across Question Types
This analysis shows the average number of biased responses per demographic attribute across the 10 LLMs, providing insights into which question types are most prone to eliciting biased responses. To generate this analysis, execute:
python ./analysis/overall_analysis/role_attr.pyThe output file will be saved in the ./figure folder.
The ./analysis/bias_types_analysis directory contains analysis scripts for examining the types of bias present in bias-triggering questions. The ./analysis/bias_types_analysis/attribute folder stores the categorization of bias found in the main content of questions (excluding the role specifications) across 11 demographic attributes.
This analysis categorizes and quantifies the different types of bias that appear in questions designed to trigger biased responses, providing insights into the distribution and prevalence of various bias categories.
To generate this analysis, execute:
python ./analysis/bias_types_analysis/bias_type.pyThe output visualization will be saved in the ./figure folder.
This section explores whether biases remain present when no role is assigned, thereby examining how many biases are specific to the role-playing context. The ./analysis/role_impact_analysis directory contains the corresponding code and results for this analysis.
(1) Overall Effectiveness
This analysis examines the number of biased responses detected by BiasLens when roles are not assigned to the models. To generate this analysis, execute:
python ./analysis/role_impact_analysis/overall_without_role.pyThe results will be saved as ./results/without_role/overall_without_role.csv, with detailed results for each individual model stored in their respective subdirectories.
(2) Comparison of Results
This component compares the results obtained from the non-role scenarios with those from the previous overall analyses where roles were assigned. To generate this comparative analysis, execute:
python ./analysis/role_impact_analysis/comparison.pyThe comparison results will be saved as ./analysis/role_impact_analysis/comparison_result.csv.
This section explores how the non-deterministic nature of LLMs affects our test results. The ./analysis/non_deter_analysis directory contains the corresponding code and results for this analysis.
For each LLM, we calculate two key metrics: the proportion of questions that trigger fully consistent responses (where all three trials produce either biased or unbiased responses) and the proportion of questions that yield mixed results (producing a combination of biased and unbiased responses across the three trials). This analysis helps us understand the reliability and consistency of bias detection across multiple runs.
To generate this analysis, execute:
python ./analysis/non_deter_analysis/non_deter.pyThe results will be saved as ./analysis/non_deter_analysis/non_deter.csv.