Gender Inclusivity Fairness Index (GIFI)

A Multilevel Framework for Evaluating Gender Diversity in Large Language Models

Zhengyang Shan¹, Emily Ruth Diana², Jiawei Zhou³

¹Boston University, ²Carnegie Mellon University, ³Stony Brook University

Abstract

We present a comprehensive evaluation of gender fairness in large language models (LLMs), focusing on their ability to handle both binary and non-binary genders. While previous studies primarily focus on binary gender distinctions, we introduce the Gender Inclusivity Fairness Index (GIFI), a novel and comprehensive metric that quantifies the diverse gender inclusivity of LLMs. GIFI consists of a wide range of evaluations at different levels, from simply probing the model with respect to provided gender pronouns to testing various aspects of model generation and cognitive behaviors under different gender assumptions, revealing biases associated with varying gender identifiers. We conduct extensive evaluations with GIFI on 22 prominent open-source and proprietary LLMs of varying sizes and capabilities, discovering significant variations in LLMs' gender inclusivity. Our study highlights the importance of improving LLMs’ inclusivity, providing a critical benchmark for future advancements in gender fairness in generative models.

Framework Overview

We evaluate gender fairness in LLMs through a series of progressively complex tests, organized into four stages: Pronoun Recognition, Fairness in Distribution, Stereotype and Role Assignment, and Consistency in Performance.

Ranking Results

Model comparison

The GIFI rankings highlight models like GPT-4o, Claude 3, and DeepSeek V3 as top performers, demonstrating advanced capabilities in addressing complex tasks related to gender fairness. These models offer balanced performance across all pronoun categories. Conversely, models such as Vicuna, GPT-2, and LLaMA 2 rank poorly, struggling particularly with handling neopronouns and overall gender fairness. To better understand individual model capabilities, we analyze their performance on each of the seven evaluation tasks. The radar chart offers a comparative view of all models across the seven dimensions, illustrating their diverse strengths and weaknesses. The individual radar charts break down the performance of each model, highlighting that while some models perform well overall, they may exhibit strengths or weaknesses in specific tasks. For instance, Claude 4 excels in tasks such as sentiment neutrality and gender pronoun recognition, but performs poorly in stereotypical association. GPT-4o mini demonstrates balanced performance across tasks, though with slightly lower scores in gender diversity recognition and occupational fairness. Phi-3 shows high fairness in stereotypical association and occupational fairness, indicating a tendency to mitigate traditional gender roles.

GIFI Evaluation Leaderboard

Model	GDR	SN	NTS	CF	SA	OF	PE	GIFI
Gemini 1.5 Pro	0.55	0.78	0.92	0.74	0.37	0.36	0.97	0.67
Gemini 1.5 Flash	0.55	0.76	0.92	0.87	0.18	0.08	0.96	0.62
Claude 3 🥈	0.67	0.78	0.95	0.87	0.31	0.42	0.97	0.71
GPT-4o mini	0.61	0.81	0.94	0.99	0.36	0.13	0.95	0.68
GPT-4o 🥇	0.76	0.77	0.96	0.86	0.37	0.41	0.96	0.73
GPT 4	0.71	0.78	0.93	0.84	0.34	0.14	0.96	0.67
GPT 3.5 turbo	0.64	0.73	0.93	0.82	0.35	0.14	0.96	0.65
GPT 2	0.27	0.69	0.81	0.32	0.64	0.57	0.53	0.55
Gemma 2	0.51	0.67	0.82	0.36	0.47	0.33	0.93	0.58
LLaMA 3	0.63	0.69	0.85	0.62	0.41	0.15	0.95	0.61
LLaMA 2	0.59	0.67	0.84	0.58	0.39	0.09	0.81	0.57
Vicuna	0.31	0.67	0.82	0.39	0.39	0.20	0.65	0.49
Zephyr	0.40	0.65	0.85	0.38	0.59	0.42	0.70	0.57
Mistral	0.51	0.70	0.81	0.37	0.56	0.38	0.82	0.59
Phi-3	0.50	0.73	0.85	0.25	0.72	0.59	0.79	0.63
Gemma 3	0.65	0.70	0.91	0.60	0.47	0.20	0.96	0.64
Qwen 3	0.59	0.76	0.90	0.53	0.39	0.20	0.94	0.61
Yi-1.5	0.61	0.67	0.84	0.26	0.56	0.35	0.92	0.60
Gemini 2.0 Flash	0.70	0.77	0.87	0.53	0.40	0.24	0.99	0.64
Claude Sonnet 4	0.80	0.83	0.93	0.63	0.34	0.17	0.97	0.67
DeepSeek V3 🥉	0.67	0.68	0.93	0.89	0.56	0.18	0.99	0.70
LLaMA 4	0.53	0.78	0.93	0.76	0.12	0.08	0.93	0.59

BibTeX

@inproceedings{shan2025gifi,
  title={Gender Inclusivity Fairness Index (GIFI): A Multilevel Framework for Evaluating Gender Diversity in Large Language Models},
  author={Shan, Zhengyang and Diana, Emily Ruth and Zhou, Jiawei},
  booktitle={Proceedings of ACL},
  year={2025}
}