Biased bots: AI hiring managers shortlist candidates with AI resumes

5 min read Original article ↗

Job seekers who use the same AI model to compose their resumes as the AI model used to evaluate their application are more likely to advance through the hiring process than those submitting human-written materials, according to researchers.

The findings, detailed in a preprint paper titled "AI Self-preferencing in Algorithmic Hiring: Empirical Evidence and Insights," further amplify persistent concerns about AI bias.

Authors Jiannan Xu, a PhD candidate at the University of Maryland, Gujie Li, assistant professor at the National University of Singapore, and Jane Yi Jiang, assistant professor at Ohio State University, found that "LLMs consistently prefer resumes generated by themselves over those written by humans or produced by alternative models, even when content quality is controlled."

The researchers confirmed that AI models prefer their own output, a phenomenon documented in prior research. So those who use OpenAI's GPT-4o, for example, to help draft their resume and cover letter have better odds of getting called back when GPT-4o is assessing their qualifications.

"Our findings suggest that when AI models are used to evaluate resumes, they tend to prefer content that resembles their own output," Jane Yi Jiang, one of the paper's co-authors, told The Register in an email. "This doesn't necessarily mean the systems are 'discriminating' in a legal or intentional sense, but it does raise important questions about fairness."

The researchers devised a resume submission experiment involving a dataset of 2,245 human-written resumes and multiple AI models, including GPT-4o, GPT-4o-mini, GPT-4-turbo, LLaMA 3.3-70B, Mistral-7B, Qwen 2.5-72B, and DeepSeek-V3.

They looked both at LLM-vs-Human bias, where a model prefers its own output to the human equivalent, and at LLM-vs-LLM bias, where the model prefers its own output to material generated by a different model.

LLM-vs-Human bias proved the strongest, statistically speaking. The AI self-preference bias ranged from approximately 68 percent to 88 percent among larger, aligned models (GPT-4-turbo, GPT-4o, GPT-4o-mini, DeepSeek-V3, Qwen 2.5-72B, and LLaMA-3.3-70).

The authors' simulations suggest that job candidates who prepare their applications using the same LLM as the one evaluating them can expect to be shortlisted 23 to 60 percent more often than equally qualified applicants who submitted human-written resumes. This phenomenon is largest in business-related fields like sales and accounting, the paper says.

AI models also preferred their own output to output from other models, though to a lesser extent. "DeepSeek-V3 exhibits the strongest self-preferencing tendency, favoring its own resumes with a self-selection rate of 84 percent against LLaMA-3.3-70B and 64 percent against GPT-4o," the paper says.

LLM-vs-LLM self-preferencing, however, is less pronounced and less widespread compared to LLM-vs-Human self-preferencing, the researchers say.

While AI self-preferencing has also been noted in studies of AI benchmarking, the researchers say the issue is particularly consequential in hiring, due to the widespread use of AI by human resources professionals. Most large firms use Applicant Tracking Systems and, according to a survey conducted last October by Resume Builder, seven out of 10 companies expect to use AI in their hiring process in 2025, despite most acknowledging that it's biased.

"As more candidates use AI tools to craft their resumes – and more employers rely on AI to screen them – there's a growing risk that these tools may subtly shape who gets noticed," said Jiang. "For example, two equally qualified candidates could receive different outcomes simply because one happened to use the same AI model as the employer. As AI-to-AI interactions become more common, this underscores that how AI systems are designed and deployed must be sensitive to context."

Jiang and her colleagues propose two ways that might limit the ability of AI models to favor their own output. One would be to instruct the model in its system prompt to ignore the resume's origin and to focus on content. Another would be to involve multiple models in hiring evaluations to dilute the influence of any one model. 

That cut GPT-4o's self-preference from over 90 percent to below 50 percent, where 50 percent would be considered neutral. And any such model adjustments have consequences for other applications that depend on that same model.

"There's an important tradeoff to consider," said Jiang. "In hiring, reducing a model's awareness or recognition of whether content was AI- or human-written may help limit unintended bias and promote more equitable evaluations. But in other settings, such as education or journalism, being able to detect AI-generated content is essential for ensuring transparency and accountability. 

"So while reducing model self-recognition may be helpful in some applications, it's not a one-size-fits-all solution. Each use case requires thoughtful design choices that are appropriate to its goals, risks, and broader context." ®