Gender Inclusivity Fairness Index (GIFI)

A Multilevel Framework for Evaluating Gender Diversity in Large Language Models

Zhengyang Shan1, Emily Ruth Diana2, Jiawei Zhou3

1Boston University, 2Carnegie Mellon University, 3Stony Brook University

Abstract

We present a comprehensive evaluation of gender fairness in large language models (LLMs), focusing on their ability to handle both binary and non-binary genders. While previous studies primarily focus on binary gender distinctions, we introduce the Gender Inclusivity Fairness Index (GIFI), a novel and comprehensive metric that quantifies the diverse gender inclusivity of LLMs. GIFI consists of a wide range of evaluations at different levels, from simply probing the model with respect to provided gender pronouns to testing various aspects of model generation and cognitive behaviors under different gender assumptions, revealing biases associated with varying gender identifiers. We conduct extensive evaluations with GIFI on 22 prominent open-source and proprietary LLMs of varying sizes and capabilities, discovering significant variations in LLMs' gender inclusivity. Our study highlights the importance of improving LLMs’ inclusivity, providing a critical benchmark for future advancements in gender fairness in generative models.

Framework Overview

GIFI method diagram

We evaluate gender fairness in LLMs through a series of progressively complex tests, organized into four stages: Pronoun Recognition, Fairness in Distribution, Stereotype and Role Assignment, and Consistency in Performance.

Ranking Results

Model comparison Image 2

Model comparison

The GIFI rankings highlight models like GPT-4o, Claude 3, and DeepSeek V3 as top performers, demonstrating advanced capabilities in addressing complex tasks related to gender fairness. These models offer balanced performance across all pronoun categories. Conversely, models such as Vicuna, GPT-2, and LLaMA 2 rank poorly, struggling particularly with handling neopronouns and overall gender fairness. To better understand individual model capabilities, we analyze their performance on each of the seven evaluation tasks. The radar chart offers a comparative view of all models across the seven dimensions, illustrating their diverse strengths and weaknesses. The individual radar charts break down the performance of each model, highlighting that while some models perform well overall, they may exhibit strengths or weaknesses in specific tasks. For instance, Claude 4 excels in tasks such as sentiment neutrality and gender pronoun recognition, but performs poorly in stereotypical association. GPT-4o mini demonstrates balanced performance across tasks, though with slightly lower scores in gender diversity recognition and occupational fairness. Phi-3 shows high fairness in stereotypical association and occupational fairness, indicating a tendency to mitigate traditional gender roles.

GIFI Evaluation Leaderboard

Model GDR SN NTS CF SA OF PE GIFI
Gemini 1.5 Pro0.550.780.920.740.370.360.970.67
Gemini 1.5 Flash0.550.760.920.870.180.080.960.62
Claude 3 🥈0.670.780.950.870.310.420.970.71
GPT-4o mini0.610.810.940.990.360.130.950.68
GPT-4o 🥇0.760.770.960.860.370.410.960.73
GPT 40.710.780.930.840.340.140.960.67
GPT 3.5 turbo0.640.730.930.820.350.140.960.65
GPT 20.270.690.810.320.640.570.530.55
Gemma 20.510.670.820.360.470.330.930.58
LLaMA 30.630.690.850.620.410.150.950.61
LLaMA 20.590.670.840.580.390.090.810.57
Vicuna0.310.670.820.390.390.200.650.49
Zephyr0.400.650.850.380.590.420.700.57
Mistral0.510.700.810.370.560.380.820.59
Phi-30.500.730.850.250.720.590.790.63
Gemma 30.650.700.910.600.470.200.960.64
Qwen 30.590.760.900.530.390.200.940.61
Yi-1.50.610.670.840.260.560.350.920.60
Gemini 2.0 Flash0.700.770.870.530.400.240.990.64
Claude Sonnet 40.800.830.930.630.340.170.970.67
DeepSeek V3 🥉0.670.680.930.890.560.180.990.70
LLaMA 40.530.780.930.760.120.080.930.59

BibTeX

@inproceedings{shan2025gifi,
  title={Gender Inclusivity Fairness Index (GIFI): A Multilevel Framework for Evaluating Gender Diversity in Large Language Models},
  author={Shan, Zhengyang and Diana, Emily Ruth and Zhou, Jiawei},
  booktitle={Proceedings of ACL},
  year={2025}
}