Colly: Fast and Elegant Scraping Framework for Golang
github.comYou will get to initial source code with this, but if you want JavaScript to work, you just use PhantomJS. If you want PhantomJS to be usable, you is casperJS... Until you find a site with FuzeJS or some other JavaScript or html intense site. Those won't render in PhantomJS.
For that stuff, as of a few months ago, you can use Chrome headless. I wrote a couple go packages to make that easy. It basically runs headless chrome with a JavaScript REPL console you can use to interact with the session. https://GitHub.com/integrii/headlessChrome
I was also able to smash my while scraper bot into a docker container after working around a couple bugs.
It basically runs headless chrome with a JavaScript REPL console you can use to interact with the session.
That looks cool! Would I be able to run Node scripts?
> Would I be able to run Node scripts?
No. Since you can't run Node scripts on Chrome, the same is true for Chrome headless.
One thing I'm interested in is scraping those annoying sites that require JavaScript execution. More and more webpages are requiring JS even to display anything beyond a blank page. These sites self-select themselves for exclusion in scraping like this.
I've looked into Headless Chrome, but I'd be interested to see a 'scraping framework' level abstraction for those sites.
If you're targeting a specific site (as opposed to blindly spidering multiple), all JavaScript is actually better for scraping, in my experience. JavaScript apps communicate with an API 99% of the time, so to scrape them you can just replicate the API requests. And as a bonus you'll get nicely formatted JSON responses; no need to parse fragile HTML.
Similarly, I've found that most sites worth scraping also have a mobile app, which you can run through a MITM proxy, and then simply write a scraper to call the API endpoints directly.
This is absolutely true. This is why SPAs are awesome to scrape
You might have missed puppeteer[1], if you don't mind writing Javascript it seems to provide a simple interface for scraping.
Scraping JS-only sites is also possible without a headless browser, but requires a bit more debugging of the internal structure of these sites. Most of JS-only websites have API endpoints with JSON responses, which can make scraping more reliable than parsing custom (and sometimes invalid) HTML. The drawback of headless browser based scraping is that it requires significant amount of cpu time and memory compared to "static" scraping frameworks.
As mentioned, Puppeteer was used in this project, which is Chrome based.
I've also used Selenium (via Python -- I used BeautifulSoup to parse the resulting HTML) in the past for precisely for the reasons you stated. Selenium uses "web drivers", which lets you start other headless browsers as well (Firefox, Opera, IE, etc.).
http://selenium-python.readthedocs.io/
All it took was a couple of lines of Python code...
"More and more webpages are requiring JS even to display anything beyond a blank page."
Can you provide some example webpages so we can take a look?
Also, I agree with @asciimoo's point about endpoints. One could make an argument that compared to the website design of the 1990's and 2000's, retrieving structured data from websites is actually getting easier not more difficult. I recall one period of time where the trend was to design websites entirely in Macromedia/Adobe Flash.
Here is what is missing from this project (and many others like it): when providing software that performs text processing one needs to not only provide example code but also example output. This enables a user to quickly compare her current text-processing solution with the software being provided without having to install, review and run the unfamiliar software.
For example, without some sample output she cannot test, e.g., whether her current solution produces the same output faster or using less code.
With a minute's searching, here's a webpage that's meaningless without JavaScript:
https://blogger.googleblog.com/
It's arguably a page of text not a web app that needs user interaction.
I'm not trying to start an argument about JS: the consensus on HN seems to be that if you don't execute JavaScript you don't deserve to read webpages. I'm just saying that your website's clients are more diverse than 'normal' well-sighted humans. There may be machines reading the site, for all kinds of reasons.
And regarding endpoints. One could make the argument that with AJAX we now have richer APIs. I disagree. We have a well-understood API for getting hypertext (HTTP) in a well-understood format (HTML) that works/worked across all websites. Replacing that with a custom-built API for every website isn't apples-for-apples.
The reason I asked is that I rarely use a graphical browser so I am always curious about wesbites that are inaccessible without Javascript.
Blogger has a feed. Here is one way to retrieve it, in two steps.
1. get TargetBlogID 2. retrieve data optional: 3. format data for viewing Example: 1. x=$(exec curl https://blogger.googleblog.com \ |exec sed ' s/\\046/\&/g; s/\\46/\&/g; s/\\075/=/g; s/\\75/=/g; /targetBlogID/!d; s/.*targetBlogID=//; s/&.*//; '); 2. curl -o y.htm https://www.blogger.com/feeds/$x/posts/default 3. exec sed ' # ^M is "\r" s/^[0-9a-f]*^M//; s/</</g; s/>/>/g; s/&/\&/g; s/"/\"/g; 1i\ <br><br> s/<name>/<br><br>name &/g; s/<uri>/<br>uri &/g; s/<generator>/<br>generator &/g; s/Blogger//; s/<id>/<br>id &/g; s/<published>/<br>published &/g; s/<email>/<br>email &/g; s/<title type=.text.>/<br><br>&/g; s/<openSearch:totalResults>/<br>total results &/g; s/<openSearch:startIndex>/<br>start index &/g; s/<openSearch:itemsPerPage>/<br>items per page &/g; s/<updated>/<br>updated &/g; s/<thr:total>/<br>thr:total &/g; s/<\/feed>/&<br><br><br>/; s/^M*/<br>/; ' y.htm \ |exec tr -cd '\12\40-\176'It's almost like you asked that question to bait someone into giving you an excuse to show "just how easy it is" with some cli hacking.
Just 3 easy steps
Wix. Everything with Wix.[1]
I looked at some exemplary Wix sites since I rarely ever encounter these.
The ones they identify as "stunning" seem to be mostly devoid of text and are mainly images and videos.
As others in this thread have noted is common on today's web, they use json endpoints. Some sites also have an optional XML feed activated at /feed.xml.
Viewing the images and videos is possible without any use of Javascript. Simply extract the image and video urls and their descriptions from the json.
Below I only demonstrate how to extract the text, as html. One could view this html in a browser, without Javascript.
Random wix site: www.cricketcanadakids.com
Random page from the site: coach-profiles. JSON url taken from CSV above.curl -o x.htm https://www.cricketcanadakids.com # produce a listing of the page titles and json endpoint urls # in addition to a title, each page has a "pageID" # beginning on 3rd line is CSV e.g. for database: # ID, title1, title2, url # last line of CSV is "master" json x=https://static.wixstatic.com/sites/; exec awk '{gsub(/:{/,/"\n"{/);print}' \ |exec sed -n ' s/{/\ &/g; s/\],/\ &/; /<title>/p; /mainPage/p; ' \ |exec sed ' /var publicModel/d; /.*\"baseUrl\"/d; /{filename}/d; s/<title>//;s/<\/title>//; /ExternalBaseUrl/{s/.*Url\":\"//;s/\".*/\ \ \"ID\",\"title1\",\"title2\",\"url\"/;}; s/\\//g; s/{\"pageId\":\"//; 3!s/\",\"/,/g; s/title\"://; s/,pageUriSEO//; s/:\"/,/; s>pageJsonFileName\":\">'"$x"'>; s/\"}.*/.z/; s/\],\"mainPageId\",//; s>masterPageJsonFileName\":\">master,master,'"$x"'>; /topology\":\[/s/json.*/json.z/; ' x.htmcurl -4o y.json https://static.wixstatic.com/sites/7f1cbe_7131eb80aa297a10c02a08c8ffbc3ef6_122.json.z # produce simple html version of page # text-only, no images or videos x=$(exec echo b|exec tr b '\34'); exec tr -cd '\12\40-\176' < y.json \ |exec sed 's/{/'"$x"'&/g;s/ *//;' \ |exec tr -d '\12' \ |exec tr '\34' '\12' \ |exec sed '1d;$d' \ |exec sed ' s/</\ &/g; ' \ |exec sed 's/\\["tn]//g;s/\",\".*//;/{/d;'Improve the CSV:
Probably would be better to change "jsonurl" to "jsonfile" and just save the filename. The location (URL) of this file may be in several locations or it may change over time.# add to end \ |exec tr '\15' '\12' \ |exec awk 'NR==1{a=$0}NR==3{b=$0}NR==5{print}NR>=6{print "\42"a"\42,"b","$0}' # update CSV header line \"sitetitle\",\"url\",\"pageID\",\"pagetitle1\",\"pagetitle2\",\"jsonurl\"/;};
I know of a site where all the data is embedded inline ("var Person = {firstName: 'Fred', lastName: 'Smith' }") then rendered to HTML with JS. No endpoints, Javascript not JSON, complete pain in the ass to extract without rendering in a headless browser first.
Isn't this better? Why do you need it render when they're giving you basically JSON of the data (might need to quote the keys in strict parsers), just extract that.
The short answer is that, yes, the only parser available to me is that strict. If I'm going to jump languages, I might as well go the whole hog and jump to Puppeteer.
Feedly, Netflix, Google Maps, as of this article:
I remember that article.
I do not use any of those three websites so I would be curious to know some specific examples of what the desired structured data looks like.
If for Netflix the desired data is only DVD summaries, I recall that these were easily retrievable without Javascript not too long ago.
I would argue that it's reasonable for these examples to require JS. My complaint above was the accessibility text-only sites.
I would agree. That article had a very poor choice of websites, IMO. The web is quite usable without Javascript IME.
> Can you provide some example webpages so we can take a look?
learnopengl.com
curl -d 'content_name=Heading' https://example.com/content_load.php?
curl -d 'content_name=Heading/Subheading' https://example.com/content_load.php?
s/example/learnopengl/ s/Heading/Introduction/
You'd need to parse and run the JavaScript, including a virtual DOM. You'd also need to support any browser events that the page depends on. It's conceptually simple, and getting support for 80-95% of cases is probably easily doable, but I imagine a long tail of cases where your browser emulator isn't quite compatible in an important way. I know Google and others do this to index the web, but in not sure about any existing open source projects.
I released Page.REST(https://page.rest#prerender) couple of weeks ago. It will pre-render JS based sites - so you can then extract rendered content using CSS selectors.
Ditto: https://www.prerender.cloud/docs/api
// URL to screenshot service.prerender.cloud/screenshot/https://www.google.com/ // URL to pdf service.prerender.cloud/pdf/https://www.google.com/ // URL to html (prerender) service.prerender.cloud/https://www.google.com/
You could look into PhantomJS. Also effectively a headless browser, but it's scriptable.
I started with that. Last time I looked, the maintainer had stepped down because of Chrome Headless.
And it's not a "scraping framework" any more than Chrome Headless is.
edit: Headless Chrome has very rich scripting integration with e.g. https://github.com/webfolderio/cdp4j
yes, this would be capable of waiting, but it doesn't have the abstraction built in so it looks like it would have the same problem I often run into with scapers - you go through page to get links to follow/markup to act on and then have to wait a reasonable amount of time / catch some browser based event to determine if you need to look at content one more time before continuing.
Borrowing an argument from an article talking the speed of Python: focus on what's your bottleneck. If you're worried about the performance of your tool, make sure it's not actually waiting for something else (IO, network, scheduling, ...).
Here, unless you're parsing a large amount of already downloaded files (a website dump, [re]parsing of a long-standing archive etc.), you're not going to get a huge benefit from using a fast parser, because network is going to be the challenging factor.
Keep that in mind.
I totally agree! I like Go but this is not a field I would ever use it. Parsing the site-s will never will be the bottleneck, but following the HTML changes or making it work for multiple sites is... When it takes maybe even seconds to download a page, a couple of millisecond performance advantage of Go doesn't matter at all.
Remember to setup DNS caching on the box or use something like https://github.com/viki-org/dnscache.
Also, there doesn't seem to be any checking reading the response body. You want to limit the read length.
The HTML parsing part appears to be "golang.org/x/net/html". Does anyone have experience parsing "real world" html with this? How does it do?
It's an HTML5-compliant parser. That means that, modulo bugs, it should produce the same results as any other modern HTML parser, which should also all be based on HTML5.
For context, since I think a lot of people are still unaware of this, the HTML5 standard precisely specifies how HTML should be parsed: https://www.w3.org/TR/html5/syntax.html#parsing This is based on a survey of how the various browsers were handling it in reality, so it's not just one of those theoretical things that everybody ignores, it's an algorithm extracted from the brutal pragmatism of many separate code bases over many years. In theory now, all HTML parsing libraries should now be able to take the same input and produce the same DOM nodes. In practice I've not used a variety of such libraries, nor have I fed them very much pathological input, so I can't vouch for if this 100% true in practice, but in theory, there should no longer be any significant differences between HTML parsers in various languages, as they come on board with HTML5 compliance.
I've used it to parse alot of broken HTML for a very large company.
I have not used it to attempt to parse all the HTML on the web.
My impression is that it's pretty good.
Why are scrapers and scraping so popular? What is a real use case for it?
Most businesses I've worked for/with had one reason or another to scrape data from various places. Often it's watching competitors, but can also be part of an automation of some process that is already happening manually.
Not seeing ads.
Secondarily, there is a lot of data on the internet stored only in HTML pages. For data with multiple sources, HTML is still usually a common format. HTML just has more punctuation and errata to filter out than JSON, XML, or CSV.
So you want the data but you don't want to ask for it?
If I come across data I want/need, why should I have to highlight, copy, and read from my clipboard? It's much more elegant to have automation do this.
Accessing information that isn't available via an API. For example, much information on HN is only available via scraping. Many banks can't afford to rebuild their backend (probably largely due to the liability/compliance costs) to support an API/client model, but they still want an application--such applications are pretty much required to scrape said banks HTML.
As an example:
I listen to the radio a lot and have always wanted to make a website with radio schedules like there is for TV. As this data is not available (at least not in France) I scrap [1] each radio website everyday to get it.
[1] Using https://github.com/rchipka/node-osmosis for now
OnHtml callbacks? Not a big fan of callbacks when you have channels.
Channels require concurrency; you have to spin up another goroutine and take care not to let or deadlock. Channels are for communication between goroutines, not for general abstraction.
I agree. Callbacks makes sense in this context and is the idiomatic way to write the code given that's the approach used in the standard library. eg:
*Take care not to leak or deadlock
Interesting idea, how do you imagine a channel based API for this?
I would ignore the GP's advice. Channels are prone to big errors -- panics and blocking -- which aren't detectable at compile time. They make sense to use internally but shouldn't be exposed in a public API. As one example, notice how the standard library's net/http package doesn't require you to use channels, but it uses them internally.
Would this work?
I'm a bit rusty (ah!) with go, so bear with me if the above contains errors.c := colly.NewCollector() // this functions create a goroutine and returns a channel ch := c.HTML("a") e := <- ch link := e.Attr("href") // ...How do you recognize if the collector has finished? If the site doesn't contain "a" elements (e.g. because of a network error), this example would block forever.
The producer closes the channel. This is differentiable from an open empty channel in a select.
Makes sense, thanks =)
In the above example this would require `nil` checking of the retrieved value every time. I'm not sure if it would make the API cleaner
This would work. No callback hell a pleasure for eyes!
Is there any reason to use this instead of Puppeteer? It feels like Puppeteer is going to dominate this space unless another browser vendor makes their own framework.