A Web Crawler with Asyncio Coroutines

95 points by nickpresta 10 years ago · 30 comments

Reader

Great tutorial, I would love to see this rewritten with the new async / await syntax in python 3.5

justusw 10 years ago

I've created a similar example in order to try out the new Python 3.5 async syntax. While the async function bodies themselves do not change, there is some boilerplate necessary in order to call async functions.
You can check it out right here https://github.com/justuswilhelm/kata/blob/master/python/cor...

One question I have about this - and I might have missed in article is - I'm all for using asyncio to make HTTP requests. But I see they apparently also use asyncio for "parse_links". Since parselinks should be CPU op, would it make sense to use fibers to download links and pass them into a thread pool to actually parse them//add to queue?

I'm messing around with some of the ParallelUniverse Java fiber implementation and what I do is spam fibers to download pages and send the String response over to another fiber over a channel that maintains a thread pool to parse response body as they come in//create new fibers to read these links.

I'm really just doing this to get more familiar with async programming and specifically the paralleluniverse Java libs but one thing I'm struggling a bit with is how to best make it well behaved (e.g right now there's no bound on number of outstanding HTPT requests).

Schwolop 10 years ago

This article is way more important than the web crawler example used to motivate it. It's easily the single best thing I've ever read on asyncio, and I've been using it in anger for a year now. I've passed it around my team, and will be recommending it far and wide!

fabiandesimone 10 years ago

I'm working on a project that involves lot's of web crawling. I'm not technical at all (I'm hiring freelancers).

While I do have access to great general technology related advice, this post is bound to bring people well versed in crawling.

My question is: in terms of crawling speed (and I know this is dependent of several factors) what's a decent amount of pages a good crawler could do per day?

The crawler I built is doing about 120K pages per day which to our initial needs is not bad at all, but wonder if in the crawling world this is peanuts or a decent chunk of pages?

reinhardt 10 years ago
It doesn't make much sense to give a number for speed without some specifics about the crawler environment, such as:
```
  - How many servers (if distributed)?
  - How many cores/server?
  - What kind of processing takes place for each page? 
    Does it just download and save the pages somewhere (local filesystem, cloud storage, database) or it extracts (semi) structured data? And so on.
```
Specifics aside, these days it's not hard to crawl millions of pages/day on commodity servers. Some related posts:
http://www.michaelnielsen.org/ddi/how-to-crawl-a-quarter-bil...
http://blog.semantics3.com/how-we-built-our-almost-distribut...
http://engineering.bloomreach.com/crawling-billions-of-pages...
http://engineering.bloomreach.com/crawling-billions-of-pages...
- fabiandesimone 10 years ago
  
  Thank you very much!
atombender 10 years ago

One tip: If you make your pipeline fine-grained, you have much more flexibility in terms of scheduling and parallelization, and also makes it easier to expand and design.
By fine-grained I mean that fetching, crawling, extraction and whatever other processing you're doing should be separate, discrete steps.
Example naive topology:
Fetcher: Pops next URL off a queue, fetches it, stores the raw data somewhere, emits a "fetched" event.
Link extractor: Subscribes to fetch events, extracts every URL from the data, each of which is emitted as a "link" event.
Crawling scheduler: Listens to link events, schedules "fetch" events for each URL. This is where you might add filtering and prioritization rules, for example.
Now you have three queues and three consumers, which can run in parallel with any number of worker processes dedicated to them. A naive solution could use something like a database for the events, but a dedicated queue such as RabbitMQ would fare better.
fsg7sdfg789 10 years ago

It's not a significant number of pages per day, honestly. For me, the limiting factor is almost always how many concurrent requests I feel comfortable making to the remote server. For big sites, the proxy I use generally caps it at 5 req / domain (concurrently).
I generally use distributed crawlers, which means I can scale to millions of pages per day (assuming different domains). The biggest limiting factor is the database layer, how many writes can I do in a day.
If I need to go faster, I just spin up another crawler worker, which connects to the queue and starts pulling jobs.
I believe anything under a million pages / day should be do-able by a homebuilt, single-server system.
- fabiandesimone 10 years ago
  
  Thanks. Well, this does make me wonder if we are doing something wrong or we are performing actions that are slowing down the crawling. We have a good server (I believe).
  - troels 10 years ago
    
    You might want to benchmark where your software is spending its time. A typical overhead is in the connection time. You may be able to speed things up with a local dns cache and by using http keep-alive. You also generally want to do a lot of parallel requests, since most time would be spent waiting for the subject site to respond.
  - kawera 10 years ago
    
    Don't forget to check your connection too; maybe you are filling it up or have latency issues.
  - fsg7sdfg789 10 years ago
    
    I just reached out to you via email to see if I can help.
adamseabrook 10 years ago

meanpath.com can do around 200 million pages per day using 13 fairly average dedicated servers. We only crawl the front page (mile wide, inch deep) so the limiting factor is actually DNS. Looking at the network traffic the bandwidth is split evenly between DNS and HTTP. Google public DNS will quickly rate limit you so you need to use your own resolvers (we use Unbound).
Unlike Blekko we are just capturing the source and dumping it into a DB without doing any analysis. As soon as you start trying to parse anything in the crawl data your hardware requirements go through the roof. parallel with wget or curl is enough to crawl millions of pages per day. I often use http://puf.sourceforge.net/ when I need to do a quick crawl
"puf -nR -Tc 5 -Tl 5 -Td 20 -t 1 -lc 200 -dc 5 -i listofthingstodownload" will easily do 10-20 million pages per day if you are spreading your requests across a lot of hosts.
- greglindahl 10 years ago
  
  We used djbdns on every crawl machine, and did not find DNS to be limiting at all. You should also make sure there isn't any connection tracking, or firewalls/middleboxes which are doing connection-based anything, or NAT, or really anything other than raw Internet between you and the Internet.
Jake232 10 years ago

I have scrapers built in Python that do well over a million pages per day, but that's not really a benchmark you can use. It all depends on the amount of computation required to extract the page data among other things.
You should be able to achieve > 120k per day for sure though. That's less than two per second.
- fabiandesimone 10 years ago
  
  Thank you. Well, I'm doing several things:
  1) I check whether or not the page we just scrapped has any of the tags we are looking for.
  2) We then extract any information within those tags (images, etc.)
  3) We follow trough every link and if it's not in the seen/scrapped list, we add them to the queue.
  Not sure if this helps to narrow it down.
  Thanks!
logn 10 years ago

It all depends how many servers you're using (or how much memory/CPU each has), whether it's architected properly for horizontal scaling, the performance of your proxy servers (if applicable), how much you're stressing the target website, how efficient your HTML parsing is, and whether you need to render CSS/JS pages.
I'm just now finishing a project for an ISP building a cache of webpages using my project jBrowserDriver. They can basically turn on as many VMs as they need to horizontally scale out, and the servers all seamlessly load balance themselves and pull work off a central queue. One important part is to handle failures and crashes, isolating impact to everything else. In this approach, separate OS processes are helpful.
greglindahl 10 years ago

Academics who write crawlers that don't do much with the pages they fetch can do 100s of millions of pages in a day with an ordinary server and a big, fat network pipe. At that speed they aren't even parsing html, they're using regexes to try to find URLs and that's about it.
At blekko, we did ~ 100k pages/day/server with our production crawler, running on a cluster which was also doing anti-web-spam, inverting outgoing links into incoming links, indexing everything, and analytics batch jobs supporting development.
So unless you're doing a LOT of work on every webpage, you're kinda slow.
The easiest mistake to make is to not be asynch enough. This Python example is great.
chrismarlow9 10 years ago

Assuming you're not bound by rate limiting on the remote hosts and the average page crawled is < 1 megabyte, and you're running on something comparable to a medium EC2 instance, yeah I would say that is fairly slow.
I've written more web crawlers than I can count in php, python, scala, golang, nodejs, and perl. Right now assuming you want to just gather some form of JSON/HTML from the response, I would use golang and gokogiri with XPaths (and of course json unmarshal for json). It will make you laugh at 120k per day. Feel free to ping me if you would like to discuss making me one of those freelancers.
- fabiandesimone 10 years ago
  
  I don't see any way to ping you. Do you have an email?
jdrock 10 years ago

Here's how many URLs we crawl every second with 80legs (http://www.80legs.com)
https://dl.dropboxusercontent.com/u/44889964/Descartes%20%20...
This translates to about 700MM/month. The bump you see this month is just us adding more crawling nodes to our cluster.
- fabiandesimone 10 years ago
  
  Looks pretty great!
xitep 10 years ago

The article is a very nice tutorial on asynch-io. I really enjoyed it. But as the authors themselves said the web-crawler they built in this article is just a toy.
Writing an efficient and _well behaved_ web crawler is imho quite a complicated undertaking. Others here have already pointed out that it's more or less a scalability problem, hence, a number doesn't make sense. reinhardt has provided a list of links - which from a quick glance - look very interesting and might bring you further.
troels 10 years ago

I have a crawler setup the pulls a few million pages per day. The main constraint is not in the crawler setup, but rather in how much load the subject sites can withstand. If I don't throttle down the traffic, the sites will be dos'ed very quickly. Of course, this is mainly a problem because I crawl a lot of pages from each site - if you have a crawler that crawls a few pages from a lot of sites, you would have a different scenario.

Animats 10 years ago

It would be interesting to compare this Python approach with a Go goroutine approach. The main question is whether Go's libraries handle massive numbers of connections well. Since Google wrote Go to be used internally, they probably do.

mseri 10 years ago

Or Erlang, or rust
- logn 10 years ago
  
  Or Java. Right now I have a web driver that uses standard Java classes for requests but I wonder if NIO would offer significantly better performance.

juddlyon 10 years ago

Node is well-suited for this type of thing and there are numerous libraries to help.

rgacote 10 years ago

Appreciate the in-depth description. Look forward to working through this in detail.

Settings

A Web Crawler with Asyncio Coroutines

Keyboard Shortcuts