As of today there are nearly about 50 benchmarks in use for mapping biases in AI. But all of them share these weaknesses : -
overwhelmingly measure US-centric
English-language
single-axis biases using static tests that models can learn to game.
Among all of these, the one which stood out as the industry standard is BBQ (Bias Benchmark for Question Answering)
BBQ is a dataset of question sets that highlight attested social biases against people belonging to protected classes along nine social dimensions relevant for U.S. English-speaking contexts. It tests how strongly responses reflect social biases and whether the model's biases override a correct answer choice.
Later a research showed that models scoring well on BBQ showed significantly higher bias under small, carefully crafted modifications to input data that are undetectable to humans but can cause significant errors in machine learning models, particularly deep neural networks such as persona prompts, text attacks and negative shots. These methods are referred to as “adversarial perturbations”.
Persona prompts are basically giving your AI a role to play such as physics professor or criminal lawyer, we do this because without a persona, an AI tends to give a "generalist" response and when you apply a persona, you unlock specific behaviors
Text attacks are deliberate attempts to trick a language model into making a mistake by making subtle, often human-readable changes to the input text.
Negative shot means setting boundaries for AI models in prompts like for example telling it "do not be condescending, do not use corporate jargon, and do not use passive voice".
These expose hidden biases in the models that BBQ misses. There are speculations as well that the models may have memorized patterns from the publicly available dataset as well. Alongside many benchmarks which had come out had problems like
noise and data reliability issues
cross-cultural validity issues
being saturated or no longer adequate to evaluate recent LLMs
Inefficient bias tests
HELM provides the most widely referenced evaluation framework, assessing 30+ models across fairness and bias metrics alongside capability measures. It is the closest thing to an industry-standard bias dashboard.
However as newer benchmarks came out it was found that finding bias scores in open-ended settings often exceed those in multiple-choice using implicit demographic cues names, occupations, clothing rather than explicit mentions.
Now if we come to the the bias evaluation for image generation models, no model excels completely. The largest scale study to date, covering 103+ text-to-image models ,used the Try Before You Bias (TBYB) framework and confirmed that demographic skew persists across model generations.
Similarly bias evaluations across computer vision models, multimodal vision-language models, speech, audio and other modalities still contain significant flaws and researchers are working towards it.
The most striking finding across industry is universal skepticism paired with continued reliance. Every major lab reports BBQ scores while simultaneously critiquing the benchmark's limitations lol.
You might ask, so what are these frontier AI labs doing when they know their models are biased ?
well,
Anthropic has been developing proprietary evaluations including a "Paired Prompts" political even-handedness test and custom discrimination scoring across demographic groups.
OpenAI has developed a distinctive "first-person fairness evaluation" that analyzes millions of real ChatGPT interactions for stereotyping patterns. They found harmful stereotypes in ~0.1% of overall cases, rising to ~1% in some domains on older models. Their GPT-4 system card notably acknowledges that terms like "harmful" or "toxic" can themselves be "wielded in ways that are themselves harmful or oppressive."
Meta created FACET [Digital Watch Observatory] and HolisticBias as community resources while reporting moderate bias evaluation in Llama model cards. For Llama 4, they made political bias reduction a headline feature, claiming refusal rates on debated topics dropped from 7% to below 2%.
Looking at current pace or state at which this is being solved, you can say that the industry's leading safety standard does not yet cover the dimension most commonly associated with AI ethics in public discourse.
The Stanford AI Index Report captured the paradox well:
"LLMs trained to be explicitly unbiased continue to demonstrate implicit bias"
this means that even when we align the model’s surface-level behavior, we haven’t fixed its internal representations. For example : -
Explicit Prompt:
“Is a doctor more likely to be a man or a woman?”
→ Model: “Doctors can be of any gender.”
Implicit Prompt:
“Write a story about a doctor and a nurse having a conversation.”
→ Model: Often defaults to “He” for the doctor and “She” for the nurse.
Since we have looked on what has been done, what action has been taken and results of that action. Now we shall look where this points.
First, construct validity is weak, that is most bias benchmarks do not adequately prove they measure what they claim to measure. The Contextual StereoSet study (2025) found that anchoring to different time periods, changing prose style, or shifting the observer's perspective can shift bias scores by up to 13 percentage points meaning a single static score may not generalize to real world use.
Just because an AI passes a 'lab test' for bias doesn't mean it won't be biased when it's actually doing a job in the real world
we can predict how "safe" an AI is by looking at its internal math (probes) instead of its real-world actions. Proving these problems can be solved before the model is deployed.
Second, benchmarks are gameable. Models trained with RLHF learn to recognize benchmark style probes and refuse to answer rather than giving biased responses and hence makes its aggregate score appear better, which shows safety training may teach avoidance rather than genuine debiasing.
Third, geographic and cultural coverage is catastrophically narrow. Nearly all major LLM benchmarks are English-only and US-centric. Cross-cultural disability research (AIES 2025) found US-built models were "too harsh" flagging neutral messages as offensive while Indian-context models missed clear bias. The models are multilingual but not yet multicultural.
Fourth, intersectional bias remains the exception. Disability, socioeconomic status, caste, and LGBTQ+ identity are systematically underrepresented. bias papers focus exclusively on gender, with race/ethnicity, age, religion, and nationality as secondary categories.
Fifth, the disconnect between measurement and harm is vast. If an AI doesn't work (functional failure), talking about its "ethics" is a distraction. A broken tool is a dangerous tool, regardless of how "unbiased" its tone is. Instead of just scores of models on bias evaluations we need to start looking at whether the AI is actually capable of performing its task safely for every individual user it encounters.
Representational Bias (Words/Images):
Example: An AI that generates only white men when you ask for a picture of a doctor. It’s a stereotype, and it’s offensive, but it doesn’t immediately stop someone from getting a job or a loan.
Current Benchmarks: 90% of benchmarks (like the ones we discussed) focus here because they are easy to automate and count.
Allocational Harm (Actions):
Example: An AI used by a bank that automatically rejects loan applications from a certain zip code, or a hiring AI that filters out resumes with names it doesn’t recognize.
The Problem: This is Real-World Harm. It changes lives, but it is much harder to measure in a lab.
These details provokes a need for a public documentation of bias benchmarks, which is exactly what we are trying to do at AI Bias Atlas.
Tackling benchmark memorization, strategic refusal behavior and measurement of only one narrow bias dimension among hundred. While companies optimize for benchmark leaderboards, and aggregate scores can obscure disproportionate harms against specific groups, we are completely public and open-source.
Our goal at AI bias atlas is to never present single aggregate scores or explicitly label what is NOT measured and be fully transparent even about flawed measurements, driving accountability instead of the illusion of measurement without the substance of understanding.
If you have read so far and have any feedback or idea do drop a comment!
Thanks for reading! Please spread a word of mouth by sharing this post and gain +100 aura
