Home - NFHN Reader

Introduction

Over the past few years, repositories have been encountering a growing number of bots trying to access their resources. These bots, or crawlers, navigate the internet, gathering data and indexing information for search engines and large language models, as well as for other purposes. While some of these bots are rather innocuous, others are sufficiently aggressive that they cause service disruptions in repositories (and other scholarly communications infrastructures).

Machine users have always been a critical constituency for repositories as search engines, aggregators, and other indexing services represent the predominant mechanisms whereby a repository’s resources are discovered. As such, it is very much in the community’s interest to ensure that repositories remain open and accessible to friendly bots and crawlers despite the increasingly aggressive nature of some bots.

In early 2025, the Confederation of Open Access Repositories (COAR) conducted a survey in early 2025 to assess the extent to which repositories were being adversely affected by a sudden and large increase in activity by "bots" - automated Web clients such as crawling or harvesting systems. The ensuing report - The impact of AI bots and crawlers on open repositories: Results of a COAR survey, April 2025 - was published in June 2025. It makes clear that:

the scale of traffic from badly-behaved bots presents a significant problem for open-access repositories
the measures being taken by repositories vary in their effectiveness
some of the measures being taken by repositories have unintended consequences of impeding access from legitimate users (both human and "machine")

In response to this, COAR convened the Dealing With Bots Task Group to develop advice and supporting information for repository managers to help them to deal with this phenomenon. This website is the primary output of the Task Group.

One important conclusion from this work is that there is no "silver bullet" solution to this problem. It is clear that the nature of traffic on the Web has changed, and it seems certain that repositories will continue to deal with a range of bots, both welcome and unwelcome, and that the behaviour of such bots will in many cases be problematic. Repositories will need to walk a fine line between protecting their operations from being overwhelmed by traffic from unscrupulous actors, and maintaining their core mission of providing open access to legitimate users and machines.

The advice and information provided on this website is intended to help repository managers to make informed decisions about strategies they might use to achieve this balance in their own context. This is very much a resource-in-development. It has been sourced from the community, and we encourage repository managers to share their own experiences and insights so that we can continue to build a knowledge base that will help the community as a whole to deal with this challenge.

Problem Statement

There are two, inter-related, problems:

Problem 1 - Overwhelming traffic from badly-behaved bots

Open Access repositories are reporting a rapid increase in traffic from badly-behaved bots that try to aggressively collect content, and, as a result, overwhelming them with an unreasonably high volume of network requests.There are already reports of repositories having been brought down by such activity.

Problem 2 - Counter-measures adversely affecting or impeding welcome traffic

Some of the measures which might reasonably be taken by a repository to defend against badly-behaved bots have the potential to cause "collateral damage" - that is, to impede legitimate access to the repository by well behaved bots - or even by human users.

Addressing these problems is surprisingly challenging, because:

The Robots Exclusion Protocol (robots.txt) designed to control or mitigate machine access to web systems such as repositories is being circumvented or ignored by a growing number of bots.
It is sometimes difficult to differentiate between human users and bots; this latter point is actually a feature of Web systems which aim to treat human and machine users equivalently - but it can increase the challenge of defending against badly-behaved bots. Differentiating between human and machine users is getting harder as badly-behaved bots increasingly mimic human access patterns.
This difficulty in differentiation between human users and bots also affects the collection and analysis of metrics or web usage statistics, which are important for repositories to understand how their content is being used.
There is an emergent "arms race" between those designing badly-behaved bots and those attempting to defend against them.