Portia, an open-source visual web scraper

blog.scrapinghub.com

367 points by pablohoffman 12 years ago · 69 comments

Reader

The problem with these sorts of solutions is that they work perfectly for 'simple' sites like the register, but fail hard with 'modern' sites like, e.g. ASOS.com. Just tried ASOS and the web front end failed to request a product page correctly...

All the dynamic JS and whatnot just plays havoc with these projects. In my experience you have to run through webdriver or something like phantomjs and parse the JS...

alttab 12 years ago

There are multiple internal tools I use at work (JIRA, our ticketing system, our code review tool) that won't work because of this issue.
In the meantime, I've written Tampermonkey scripts that will scrape and embedd multiple pages all hack-like, but at least I get a good CSV of the data I need.
To me, the value in this tool is the user interface for creating the scrape logic. If this ran as an embeddable JS app, that you could place inside any page and utilize local storage, you could scrape these dynamic sites by viewing the page first, and still get all of the cool gadetry provided by this tool.
In essence, the value of this tool could be built as a bookmarklet. THAT SIR - I would use every, single, day.
- shaneofalltrad 12 years ago
  
  Great idea on the bookmarklet. I could see a tool for building custom readers with clippings from various sites. Say I want to organize JavaScript array patterns and ideas. Throw in a way to clip parts of my PDF books into this "reader" and you have an amazing product worth millions.
  - notastartup 12 years ago
    
    Can you explain this more, how do you see this being operated? See a pdf, clip it, create your own reader with your own clips?
- stedaniels 12 years ago
  
  Why would scrape JIRA when they have a perfectly workable API?
  - lepht 12 years ago
    
    While their APIs are nice, they require separate permissions and having access to them isn't always a given depending on the company that runs/owns the Jira instance.
    
    alttab 12 years ago
    
    This. I also would rather just grab it in the browser instead of having to run a server or something else.
- notastartup 12 years ago
  
  how would a bookmarklet be able to crawl & scrape a website?
  - umurkontaci 12 years ago
    
    You can have javascript code as a bookmarklet
    
    notastartup 12 years ago
    
    yes I know that but how would you make it crawl links in your browser?
CHsurfer 12 years ago

At first, this seems correct. It's definately easier to get scraping with something like Capybara and a suitable js enabled driver, but in my experience, this solution is less reliable. Async loaded data can time out and don't get me started on the difficulties of running the scraper with cron jobs. In the end, I migrated even my JS heavy pages to Mechanize based solutions. It takes a few extra requests to get the async data, but once you get that figured out, it's rock solid - till they update the site design ;-)
yaph 12 years ago

Use the tools suitable for the task. There are a lot of those "simple" sites and I'm pretty sure a lot of people will stick to those "dated" methods of building sites, because search traffic still matters.
johndavi 12 years ago

The long tail is tough, but rules are useful when you only need to work with a small number of sites. And assuming, as you point out, less "modern" sites. (News sites tend to be mostly consistently manageable but, yes, smaller e-commerce players tend to adopt more modern techniques -- as befitting fashion-forward product lines, naturally).
Our (Diffbot) approach is to learn what news and product (and other) pages look like, and obviate the rules-management -- we also fully execute JS when rendering.
The web keeps evolving though, dang it. Tricky thing!
- lsh 12 years ago
  
  Unfortunately Diffbot is not open source. Are you planning any F/OSS offerings?
CMCDragonkai 12 years ago

I built SnapSearch for JS/SPA sites that need SEO. But it works for scraping as well. https://snapsearch.io/ You can try the demo. I tried it with "http://www.asos.com/" and it worked properly. Note that empty content actually means that the webserver returned with no body content. The real API will return the headers as well the body.
It works via Firefox, and it's load balanced and multithreaded. It takes care of all the thorny issues regarding async content... etc.
agumonkey 12 years ago

It also depends on a coherent structure in HTML websites.
Domains running websites which are more like javascript frontend modules shouldn't be scraped at all, it screams for a public API.
- uptown 12 years ago
  
  "it screams for a public API"
  But many content owners would never provide their data in this format even if doing-so would be trivial.
- CMCDragonkai 12 years ago
  
  Try using https://snapsearch.io/ It is designed for JS sites.
- jdavis703 12 years ago
  
  These single page sites do have a public, albeit, undocumented API. If you analyze the network requests via the dev tools in your browser you'll have an XML/JSON data source that is probably structured better than the markup.
  - agumonkey 12 years ago
    
    Of course, I should have thought about it that way.
egb 12 years ago

Anybody know of any tools that would work with JS-rendered sites, and not have to "parse the JS"?
- egb 12 years ago
  Answering my own question:
  CasperJS is an open source navigation scripting & testing utility written in Javascript for the PhantomJS WebKit headless browser and SlimerJS (Gecko). It eases the process of defining a full navigation scenario and provides useful high-level functions, methods & syntactic sugar for doing common tasks such as:
  defining & ordering browsing navigation steps filling & submitting forms clicking & following links capturing screenshots of a page (or part of it) testing remote DOM logging events downloading resources, including binary ones writing functional test suites, saving results as JUnit XML scraping Web contents
  - mopoke 12 years ago
    
    I wrote a blog post on my experiences using CasperJS to parse a single page site which used angular. http://www.andykelk.net/tech/web-scraping-with-casperjs
- CMCDragonkai 12 years ago
  
  I recently created a service designed to make JS sites crawlable by search engines and other robots. However it works for scraping as well. Try the demo: https://snapsearch.io/
- checker659 12 years ago
  
  PhantomJS?
  - e1g 12 years ago
    
    >Is it also a webscraper that can pull data out of a page for me?
    No, Phantom will only recreate the page as it would look like to a human user (i.e. after all javascript is parsed and executed). It will not help you parse or slice the page - you would have to do that part programatically with other dom-parsing tools.
    
    checker659 12 years ago
    
    Sorry. I guess you're right. As a programmer, both of those look the same to me.
  - egb 12 years ago
    
    "PhantomJS is a headless WebKit scriptable with a JavaScript API", so it's a browser.
    Is it also a webscraper that can pull data out of a page for me?
    
    nols 12 years ago
    
    I've heard of people using PhantomJS with CasperJS to scrape, not sure if it can be done solely with PhantomJS.
    
    vaviloff 12 years ago
    
    CasperJS is a higher-level wrapper for PhantomJS, so - yes, it could be done with PhantomJS solely... But you wouldn't want to, because CasperJS makes automation easier.
    
    uptown 12 years ago
    
    Are there any libraries to facilitate database-connectivity to SQL Server or MySQL from javascript? I've used CasperJS to scrape some sites, but always fall back on post-processing the scraped data with another program in order to get it into my database. I'd love to be able to do it all from one piece of code.
    
    gee_totes 12 years ago
    
    You could always have your CasperJS scraping script make an AJAX request to a RESTful API for your MySQL DB. You won't be doing it all from one piece of code, but you'll be doing about 90% of it.
  - lost_my_pwd 12 years ago
    
    SlimerJS too

bsilvereagle 12 years ago

I expected an April Fool's joke and found something pleasantly awesome and useful instead.

Source is here: https://github.com/scrapinghub/portia

climatewarrior2 12 years ago

I've used Scrapy and it is the easiest and most powerful scraping tool I've used. This is so awesome. Since it is based on Scrapy I guess it should be possible to do the basic stuff with this tool and then take care of the nastier details directly on the code. I'll try it for my next scraping project.

kh_hk 12 years ago

I like that there's people working to make scraping easier and friendly for everyone. Sadly (IMHO) the cases where these tools will probably fail are at the same time the same not really open on providing the data directly. Most scraper-unfriendly sites would make you request another page before to capture a cookie, set cookies on the request headers or a referer entry, or manually using regex magic to extract information from javascript code on the html. I guess it's just time one tool will provide such methods, though.

For my project I do write all the scrapers manually (that is, in python, including requests and the amazing lxml) because there's always one source that will make you build all the architecture around it. Something that I find that is needed for public APIs is a domain specific language that can work around building intermediate servers by explaining the engine how to understand a data source:

An API producer wants to keep serving the data themselves (traffic, context and statistics), but someone wants an standard way of accessing more than one source (let's say, 140 different sources). If only instead of making an intermediate service providing this standardized version, one could be able to provide templates that a client module would use to understand the data under the same abstraction.

The data consumer would be accessing the source server directly, and the producer would not need to ban over 9000 different scrapers. Of course this would only make sense for public APIs. (real) scraping should never be done on the client: it is slow, crashes and can breach security on the device.

lifeisstillgood 12 years ago

Surely there are difficulties in expecting data providers to produce their data in standard formats across industries and countries? I am naive as to how much and what data is available but that seems a stretch
- kh_hk 12 years ago
  If interested, take a look at my project on unifying bike sharing networks data. Besides providing a public API, we are also providing a python library that accesses and abstracts different sources under the same model [1, 2]
  There are a lot of accessible sources (though, not documented), but there are also clear examples on how one would never provide a service! Some examples [3, 4]
  What I was referring, though, was in a way to avoid having to build an intermediate server scraping services that are perfectly usable (JSON, XML) just because we (all) prefer to build clients that understand one type of feed (standard).
  Maybe it's not about designing a language, but just as a new way of doing things. Let's say I provide the client with the clear instructions on how to use a service (its format, and where are the fields that the client understands (in an XPath-like syntax)).
  That should be enough to avoid periodically scraping good-player servers, but at the same time being able to build client apps without having to implement all the differences between feeds. Besides, it would avoid being banned for accessing too much times a service, and would give data providers insight on who is really using their data.
  Let's say we want to unify the data in Feed A and Feed B. The model is about foos and bars:
  Feed A: { "status": "ok", "foobars": [ { "name": "Foo", "bar": "Baz" }, ... ] } Feed B [{"n": "foo","info": {"b": "baz"}},...] We could provide: { "feeds": [ { "name": "Feed A", "url": "http://feed.a", "format": "json", "fields": { "name": "/foobars//name", "bar": "/foobars//bar" } }, { "name": "Feed B", "url": "http://feed.b", "format": "json", "fields": { "name": "//n", "bar": "//info/b" } ] } Instead of providing a service ourselves that accesses Feed A and Feed B every minute just because we want to ease things on the client.
  Not sure if that's what you asked, though.
  [1]: http://citybik.es
  [2]: http://github.com/eskerda/pybikes
  [3]: https://github.com/eskerda/PyBikes/blob/experimental/pybikes...
  [4]: https://github.com/eskerda/PyBikes/blob/experimental/pybikes...
  - lifeisstillgood 12 years ago
    
    Ok different feeds, same domain, unifying the model sis feasible, either as an intermediate or as a client "template thing"
    Thank you - makes sense. I was thinking different data feeds different domains.

compare 12 years ago

Cool tool for developers, but since this one is open source, I think it opens up even more interesting possibilities for these tools to be integrated into part of a consumer app. Curation is the next big trend, right? I think I'll give that a try.

anilshanbhag 12 years ago

I just took it for a testdrive and it was an absolute pleasure. I tried to scrape all job listings at https://hasjob.co hoping to find trends.

There is one small pain, the output is being printed to the console and piping output to file is not figuring. But it did fetch all the pages and printed a nice json.

UPDATE: there is a logfile setting to dump output to file

emilsedgh 12 years ago

I have a project which includes a huge list of websites which must be scraped heavily. My question is... Are these kind of tools suitable for 'heavy lifting', scraping hundreds of thousands of pages?

meritt 12 years ago

Yep. It's just a GUI that generates scrapy (python) code.

jstoiko 12 years ago

Can anyone give a real-life example where this visual tool would be useful? Not that I dont believe in scraping (we do it too: https://github.com/brandicted/scrapy-webdriver). I know Google has a similar tool called Data Highlighter (in Google Webmaster Tools) which is used by non-technical webmasters to tell Google bot where to find the structured data in the page source of a website. It makes sense at Google's scale however I fail to see in which other cases this would be useful considering the drawbacks: some pages may have a different structure, javascript not always properly loaded, etc. Therefor requiring the intervention of a technical person...

ashwing_2005 12 years ago

This is great. However I have one bone to pick(or rather know if its been taken care of) Scrapy uses xpaths or equivalent representations to scrape. However there are many alternate xpaths to represent the same div. For e.g. Suppose data is to be extracted from the fifth div in a sequence of divs. So it would use that as the xpath. But now say it also has a meaningful class or id attribute. An xpath based on this attribute might be a better choice because this content may not be in the fifth div across all the pages in a site I want to scrape. Is this taken care of by taking the common denominator from many sample pages?

kmike84 12 years ago

Portia uses https://github.com/scrapy/scrapely library for data extraction. It doesn't use XPaths for learning. There are some links to papers in scrapely README; scrapely is largely based on ideas from these papers, but there are many improvements. In short - yes, this is taken care of.

esolyt 12 years ago

Excellent. But the example presented in the video (scraping new articles) is a actually a case better solved with other technologies.

I imagine this will be useful when scraping sites like IMDB in case they don't have an API or their API is not useful enough.

kelvin0 12 years ago

Although this is cool, the ultimate scraper would probably need to be somehow embedded in a browser and be able to access the JS engine and DOM. Embedded as a plugin, or some other extension depending on the browser.

oblio 12 years ago

Totally off topic, but what's the name of the song in the video? :)

duendex 12 years ago

It's been made just for the video :)
- oblio 12 years ago
  
  Could you please upload it somewhere? It's really catchy :)

rpedela 12 years ago

From the video, I noticed that the HTML tags were also scraped in the large article text. Is there some way to remove those automatically? Or perform further processing?

pablohoffmanOP 12 years ago

Yes, you just need to select a different field type ("text", instead of "html").

alttab 12 years ago

This is cool. Can I use it locally on internal sites too?

antocv 12 years ago

Why wouldnt you??
- tabel 12 years ago
  
  I believe the GUI is run locally, but if it was run as a web application from the developers site it would only be able to scrape sites accessible to the public internet.

th0ma5 12 years ago

Outside of this tool, or a tool that uses a scripted browser, another option could be Sikuli in a VM.

beernutz 12 years ago

I really dig these scrapers, but most of them seem to only work well for simple sites as someone has already noted.

Just want to point out a (commercial but reasonable) program that really works well for all our odd edge case customer site issues.

http://www.visualwebripper.com

viana007 12 years ago

This solution remembers Pyquery, but using a visual interface.

kclay 12 years ago

Love this, been using Scrapy for all my scraping needs.

rpedela 12 years ago

Is there a live demo available?

e1g 12 years ago

Not affiliated with the OP or the project, but I threw up a sandbox to play with this at http://awstest123.notesies.com:9001/static/main.html
pablohoffmanOP 12 years ago

Not yet.

taskstrike 12 years ago

Import.io, Kimono Labs, and now this. Web scraper -> data area is heating up.

frabcus 12 years ago

It's always pretty hot! Some still around - like Kapow, Connotate, Mozenda, ScraperWiki (which I run). Some not - Needlebase.
Portia is more interesting because it is an open source scraping GUI - the GUIs tend to be very proprietary.
notastartup 12 years ago

I also wrote http://scrape.ly, which let's you write web scrapers via a URL syntax.

notastartup 12 years ago

Here's an open source web scraping GUI I wrote a while back https://github.com/jjk3/scrape-it-screen-scraper

I'm still integrating the browser engine which I was able to procure for open source purposes.

The video is quite old.

Settings

Portia, an open-source visual web scraper

Keyboard Shortcuts