Settings

Theme

Crawling a billion web pages in just over 24 hours, in 2025

andrewkchan.dev

191 points by pseudolus 2 days ago · 66 comments

Reader

bndr a day ago

I run a small startup called SEOJuice, where I need to crawl a lot of pages all the time, and I can say that the biggest issue with crawling is the blocking part and how much you need to invest to circumvent Cloudflare and similar, just to get access to any website. The bandwith and storage are the smallest cost factor.

Even though, in my case, users add their own domains, it's still took me quite a bit of time to reach 99% chance to crawl a website — with a mix of residential proxies, captcha solvers, rotating user-agents, stealth chrome binaries, otherwise I would get 403 immediately with no HTML being served.

  • mettamage a day ago

    I wonder if circumvention is legal. It's so odd. In the US it seems you can just do this whereas if I'd start something like this in the EU, I don't think I could.

    • fuomag9 a day ago

      In Italy it’s a crime punishable up to 12 years to access any protected computer system without authorization, especially if it causes a DoS to the owner

      Consider the case of selfhosting a web service on a low performance server and the abusive crawling goes on loop fetching data (which was happening when I was self hosting gitlab!)

      https://www.brocardi.it/codice-penale/libro-secondo/titolo-x...

  • mrweasel a day ago

    Can't your users just whitelist your IPs?

    • dewey a day ago

      I'm in a similar boat and getting customers to whitelist IPs is always a big ask. In the best case they call their "tech guy", in the worst case it's a department far away and it has to go through 3 layers of reviews for someone to adapt some Cloudflare / Akamai rules.

      And then you better make sure your IP is stable and a cloud provider isn't changing any IP assignments in the future, where you'll then have to contact all your clients again with that ask.

    • bndr a day ago

      They're mostly non-technical/marketing people, but yes that would be a solution. I try to solve the issue "behind the scenes" so for them it "just works", but that means building all of these extra measures.

      • cassepipe a day ago

        Would it make sense to advertise to the more technical minded a discount if they set up an IP whitelist with a tutorial you could provide ? A discount in exchange for reduced costs to you ?

        • dolmen 16 hours ago

          The right solution is to be registered at Cloudflare, but then getting the customer reach the guy who handles Cloudflare settings (a few clicks) is the hard part.

  • gilrain a day ago

    > the biggest issue with crawling is the blocking part and how much you need to invest to circumvent Cloudflare and similar … mix of residential proxies, captcha solvers, rotating user-agents, stealth chrome binaries

    I would like to register my hatred and contempt for what you do. I sincerely hope you suffer drastic consequences for your antisocial behavior.

    • bndr a day ago

      Please elaborate, why exactly is it antisocial? Because Cloudflare decides who can or cant access a users website? When they specifically signed up for my service.

      • demetris a day ago

        But how does that work?

        Does Cloudflare force firewall rules for those who choose to use it for their websites?

        If the tool that does the crawling identifies itself properly, does Cloudflare block it even if users do not tell Cloudflare to block it?

      • gilrain a day ago

        It intentionally circumvents the explicit desires of those who own the websites being exploited. It is nonconsensual. It says “fuck you, yes” to a clearly-communicated “please no”.

        • joncrane a day ago

          OP literally said that users add their domains, meaning they are explicitly ASKING OP to scrape their websites.

        • bndr a day ago

          Users sign up for my service.

          • gilrain a day ago

            You employ residential proxies. As such, you enable and exploit the ongoing destruction of the Internet commons. Enjoy the money!

            • christoff12 a day ago

              This is kind of like getting upset with people who go to ATMs because drug dealers transact in cash lol.

            • toomuchtodo a day ago

              Cloudflare and Big Tech are primary contributors to the impairment and decline of the Internet commons for moats, control, and profit; you are upset at the wrong parties.

    • prettyblocks a day ago

      I would argue that the ability to crawl and scrape is core to the original ethos of the internet and all the hoops people jump through to block non-abusive scraping of content is in fact more anti-social than circumventing these mechanisms.

  • peter_d_sherman a day ago

    Very interesting!

    Yes, in this day and age, I could definitely see web pages being harder to crawl by search engines (and SEO companies and other users of automated web crawling technologies (AI agents?)) than they were in the early days of the Internet due to many possible causes -- many of which you've excellently described!

    In other words, there's more to be aware of for anyone writing a search engine (or search-engine-like piece of software -- SEO, AI Agent, etc., etc.) than there was in the early days of the Internet, where everything was straight unencrypted http and most URLs were easily accessible without having to jump through additional hoops...

    Which leads me to wonder... on the one hand, a website owner may not want bots and other automated software agents spidering their site (we have ROBOTS.TXT for this), but on the flip side, most business owners DO want publicity and easy accessibility for sales and marketing purposes, thus, they'd never want to issue a 403 (or other error code) for any public-facing product webpage...

    Thus there may be a market for testing public facing business/product websites against faulty "I can't give you that web page for whatever reason" error codes from a wide variety of clients, from a wide variety of locations around the world.

    That market is related to the market for testing if a website is up and functioning properly (the "uptime market"), again, from a wide variety of locations around the world, using a wide variety of browsers...

    So, a very interesting post!

    Also (for future historians!) compare all of the restrictive factors which may prevent access to a public-facing web page today Vs. Tim Berners-Lee original vision for the web, which was basically to let scientists (and other academic types!) SHARE their data PUBLICLY with one another!

    (Things have changed... a bit! :-) )

    Anyway, a very interesting post, and a very interesting article -- for both present and future Search Engine programmers!

  • 0xdeadbeefbabe a day ago

    Blocking seems really popular. I wonder if it coincides with stack overflow closing.

  • spiderfarmer a day ago

    Just stop scraping. I'll do everything to block you.

    • ssgodderidge a day ago

      > in my case, users add their own domains

      Seems like they're only scraping websites their clients specifically ask them to

    • Keyframe a day ago

      Now you've gamified it :)

      • shimman a day ago

        It's a pretty easy game to win as the blocker. If you receive too many 404s against pages that don't exist, just ban the IP for a month. Actually got the idea from a hackernews comment too. Also thinking that if you crawl too many pages you should get banned as well.

        There's no point in playing tug of war against unethical actors, just ban them and be done with it.

        I don't think it's an uncommon opinion to behave this way either, nor are the crawlers users I want to help in any capacity either.

        • basilikum 18 hours ago

          So you're blocking the absolute bottom of the barrel scrapers and feel like you 'won' because you don't even notice any scraper that isn't complete trash.

          Then again why block them if they don't cause any issue in the first place? Instead of going ballistic on IPs that you don't vibe with you could also just do proper rate limiting.

        • Klonoar a day ago

          If you think the game is played on a single IP address, you are not adept enough to be weighing in on this discussion.

        • stevewodil a day ago

          What is the crawler is using a shared IP and you end up blocking legitimate users with the bad actor?

          • Keyframe a day ago

            He said "it's pretty easy", probably not realizing there are whole industries on both sides of that cat and mouse game, making it not easy.

throwaway77385 a day ago

> spinning disks have been replaced by NVMe solid state drives with near-RAM I/O bandwidth

Am I missing something here? Even Optane is an order of magnitude slower than RAM.

Yes, under ideal conditions, SSDs can have very fast linear reads, but IOPS / latency have barely improved in recent years. And that's what really makes a difference.

Of course, compared to spinning disks, they are much faster, but the comparison to RAM seems wrong.

In fact, for applications like AI, even using system RAM is often considered too slow, simply because of the distance to the GPU, so VRAM needs to be used. That's how latency-sensitive some applications have become.

  • fluoridation a day ago

    >for applications like AI, even using system RAM is often considered too slow, simply because of the distance to the GPU

    That's not why. It's because RAM has a narrower bus than VRAM. If it was a matter of distance it'd just have greater latency, but that would still give you tons of bandwidth to play with.

    • dist-epoch a day ago

      You could be charitable and say the bus is narrow because it has to travel a long distance and this makes it hard to have a lot of traces.

      • fluoridation a day ago

        It's not. It's narrow even between the CPU and RAM. That's just the way x86 is designed. Nvidia and AMD by contrast have the luxury of being able to rearchitect their single-board computers each generation as long as they honor the PCIe interface.

        It is also true that having a 384-bit memory bus shared with the video card would necessitate a redesigned PCIe slot as well as an outrageous number of traces on the motherboard, though.

        • adrian_b a day ago

          Traditionally, the width of the GPU memory interfaces was many times greater than that of CPUs.

          However the maximum width in consumer GPUs, of up to 1024-bit, has been reached many years ago.

          Since then the width of the memory interfaces in consumer GPUs has been decreasing continuously, and this decrease has been only partially compensated by higher memory clock frequencies. This reduction has been driven by NVIDIA, in order to increase their profit margins by reducing the memory cost.

          Nowadays, most GPU owners must be content with a memory interface no better than 192-bit, like in RTX 5070, which is only 50% wider than for a desktop CPU and much narrower than for a workstation or server CPU.

          The reason why using the main memory in GPUs is slow has nothing to do with the width of the CPU memory interface, but it is caused by the fact that the GPU accesses the main memory through PCIe, so it is limited by the throughput of at most 16 PCIe lanes, which is much lower than that of either the GPU memory interface or the CPU memory interface.

        • dist-epoch a day ago

          ThreadRipper has 8 memory channels versus 2 for a desktop AMD CPU. It's not an x86 limitation.

          • fluoridation a day ago

            "x86" as in the computer architecture, not the ISA. Why do you think they put extra channels instead of just having a single 512-bit bus?

            • adrian_b a day ago

              The memory interface of CPUs is made wider by adding more channels because there are no memory modules with a 512-bit interface. Thus you must add multiples of the module width to the CPU memory interface.

              This has nothing to do with x86, but it is determined by the JEDEC standards for DRAM packages and DRAM modules. The ARM server CPUs use the same number of memory channels, because they must use the same memory modules.

              A standard DDR5 memory module has a width of the memory interface that is of 64-bit or 72-bit or 80-bit, depending on how many extra bits may be available for ECC. The interface of a module is partitioned in 2 channels, to allow concurrent accesses at different memory addresses. Despite the fact that the current memory channels have a width of 32-bit/36-bit/40-bit, few people are aware of this, so by "memory channel" most people mean 64 bits (or 72-bit for ECC), because that was the width of the memory channel in older memory generations.

              Not counting ECC bits, most desktop and laptop CPUs have an 128-bit memory interface, some cheaper server and workstation CPUs have a 256-bit memory interface, many server CPUs and some workstation CPUs have a 512-bit memory interface, while the state-of-the-art server CPUs have a 768-bit memory interface.

              For comparison, RTX 5070 has a 192-bit memory interface, RTX 5080 has a 256-bit memory interface and RTX 5090 has a 512-bit memory interface. However, the GDDR7 memory has a transfer rate that is 4 to 5 times higher than DDR5, which makes the GPU interfaces faster, despite their similar or even lower widths.

  • throwaway77385 16 hours ago

    I can't edit my comment, but to the people responding here, thank you for adding all this information. It really helped elucidate why VRAM vs RAM is a distinction and also prevents my somewhat naive interpretation from being the only thing people see. Thanks!

finnlab 2 days ago

Nice work, but I feel like it's not required to use AWS for this. There are small hosting companies with specialized servers (50gbit shared medium for under 10$), you could probably do this under 100$ with some optimization.

  • nurettin a day ago

    I did some crawling on hetzner back in the day. They monitor traffic and make sure you don't automate publically available data retrieval. They send you an email telling you that they are concerned because you got the ip blacklisted. Funny thing is: They own the blacklist that they refer to.

    • qingcharles a day ago

      This. I tried to run a very slow DHT scraper I was writing on a Hetzner server and within minutes they were on my ass. I don't want to make an enemy of them so I killed it immediately, but they are clearly very sensitive to anything outside of "normal".

    • jeroenhd a day ago

      If Hetzner actually puts their own customers on their blacklist then that list becomes more trustworthy.

      They were right to blacklist you, they were right to complain to you, and they were right not to assume malice and kick you off their platform/shut down your server.

      • nurettin a day ago

        Yes I wasn't banned or anything, they aren't barbarians. Also, explain your opinion don't just put it out there. This is not a football match.

        • jeroenhd 16 hours ago

          I felt like the arguments behind my opinion were implicit.

          Consistently not showing bias makes for trustworthy lists. Nobody (well, nobody reasonable I'd argue) will trust an IP blacklist from a major hosting company that actively excludes their own customers.

          As for banning customers that get blacklisted, this does actually happen, especially on the more affordable cloud hosts that get plagued with massive scraper and bot loads. Anything from suspended, inaccessible servers that require manual intervention to network caps or CPU load caps. This is a rather extreme measure, but not one uncommon or even unacceptable depending on the exact IP blacklist you managed to trigger.

          • nurettin 15 hours ago

            Maybe you didn't get the nuance here:

            Hetzner says (not exact wording) "according to this public blacklist you seem to be performing something they consider bad".

            What they don't mention is that they operate the list.

  • varispeed 2 days ago

    This. AWS is like a cash furnace, only really usable for VC backed efforts with more money than sense.

sunpolice a day ago

I was able to get 35k req/sec on a single node with Rust (custom http stack + custom html parser, custom queue, custom kv database) with obsessive optimization. It's possible to scrape Bing size index (say 100B docs) each month with only 10 nodes, under 15k$.

Thought about making it public but probably no one would use it.

dangoodmanUT a day ago

> because redis began to hit 120 ops/sec and I’d read that any more would cause issues

Suspicious. I don’t think I’ve ever read anything that says redis taps out below tens of thousands of ops…

thefounder a day ago

Well the most important part seems to be glossed over and that’s the IP addresses. Many websites simply block /want to block anything that’s not google and is not a “real user”.

mudkipdev 7 hours ago

Does AWS actually allow you to crawl like this? I've been interested in a similar project but the cloud providers I typically use seem to ban it in their terms of service

ph4rsikal a day ago

When I read this, I realize how small Google makes the Internet.

handfuloflight a day ago

There was a time when being able to do this meant you were on the path to becoming a (m)(b)illionaire. Still is, I think.

lovelearning 20 hours ago

As an experiment, it's interesting.

If anyone actually needs such a dataset, look into CommonCrawl first. I feel using something that already exists will be more cooperative and considerate than everyone overloading every website with their spider. https://commoncrawl.org/overview

corv a day ago

Python is obviously too slow for web-scale

gethly a day ago

> I also truncated page content to 250KB before passing it to the parser.

WTF did I just read?

  • tengada1 a day ago

    It's just HTML, presumably not requesting JS libraries. So 250K is a large amount.

    • gethly a day ago

      Exactly - how can a html page need to be trimmed to 250 KB??? That is insane. Something is not right with this article.

      • iggldiggl a day ago

        A transcript for a half-hour radio comedy show with some formatting takes up about 60 kB. The English Wikipedia page for Monty Python is about 130 kB in pure UTF-8 text and the actual HTML page takes up around around 660 kB (plus/minus, depending on which Wikipedia theme exactly you use).

        So large, text-heavy pages don't seem too unlikely to exceed 250 kB, especially if they also include some amount of formatting that's more substantial than just a minimal bunch of <p> tags.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection