Show HN: Jam API, turn any site into a JSON api using CSS selectors

300 points by gavino 10 years ago · 90 comments

Reader

A little word of warning/encouragement. I did something similar a long time ago (JSONDuit), which got posted to HN by someone else.

You will probably run into a healthy mix of "that's cool" / "I did that before you!" / "but how will it make money?". Ignore it and do your thing. If you figure out how to monetize it, great! Even if you don't or if you have no desire to, you will have learned and grown during the course of the project. That is invaluable.

Have fun and screw the haters...

xaduha 10 years ago

I find this attitude "shoot first (write code), ask questions later" as something to be admired and a bit worrisome at the same time. Nothing against people learning stuff, but why does it have to be promoted this way? Lack of humility is what gets me.
Maybe I'm just jealous or something, but it rubs me the wrong way.
- iamben 10 years ago
  
  I think it's less about promotion and more about feedback. "Here's what I've built, what do you think?"
  We're meant to be an inclusive community of smart people. The idea is we'll encourage the poster and offer constructive criticism (or praise).
  If the post is useful to no one, it simply won't get discussed or upvoted. When something does, it's validated as an idea, or as something of interest.
  - xaduha 10 years ago
    
    "Just because you could, doesn't mean you should" - that phrase should have been applied to both writing the software and posting about it here.
    Plenty of mediocre stuff gets to the frontpage and plenty of gems fall through the cracks.
    
    barranger 10 years ago
    
    Perhaps your view of what's "mediocre" and what's a "gem" is not consistent with the views of Hacker News readers at large
    
    ryanlol 10 years ago
    
    Have you ever visited /newest?
yeukhon 10 years ago

> "How will it make money"
I was trying to implement a Firefox add-on which navigates the web based on speech for my senior project (the theme was assistive technology). The major blocker with developing any tools to help navigate web page despite some effort in ARIA is that there are so many actions one cannot parse nor do without writing custom code for every single website. i.e. how can you tell where the login button is? what to click? frankly if you look at Gmail the DOM was a huge huge compressed mess (names all rewrote, without gmail.js I wouldn't be able to get Gmail working in my project). If every website exposes a standard set of APIs that can reduce the barrier by a good percentage. So think of combinging HATEOS from REST and this. Here we turn things into JSON with href allowing a client to navigate, sort of a first step making a website more "web client" compatible... funny is it?
dpweb 10 years ago
It's not a totally original concept. Screen-scraping has been around for a while - essentially what this solves. I did mine for Ajax:
```
  this.get(html, selector, function(s){
    var es = new DOMParser().parseFromString(html, 'text/html').querySelectorAll(selector);
    return [].slice.call(es).map(function(n){ return n.innerText });
  };
```
It's not a product per-se, but combination of data and view is one of the unfortunate aspects of the web that (sorry) won't get fixed - Not everyone will build JSON apis. And, hate away but HTML & JS are here for a long time to come. The need is very real and would be a critical part of a scrape or IFFFT like service - plumbing - if not a product you sell outright to end users.
- enraged_camel 10 years ago
  
  >>It's not a totally original concept. Screen-scraping has been around for a while
  This is basically a subset of "I did that before you!"
  - metasean 10 years ago
    
    dpweb:
    - mentions the term that this concept falls under (no where on the OP's page, so he may not know that there is an entire set of software, plugins, etc that does this)
    - provides one alternative implementation
    - adds commentary related to why such services are necessary, and that they should be able to be monetized
    So yes, he starts off with something along the lines of "I did that before you!", but he doesn't use a condescending phrase, and he provides additional useful information.
bryanrasmussen 10 years ago

I think the wanting to know how it will be monetized /support itself is a reasonable question. If there isnt an answer you know not to build something using it, expecting it to last.
- awinder 10 years ago
  
  It's an open source project so, maybe it's not meant to make money?

adriancooney 10 years ago

This is a fantastic idea and I'm really surprised nothing like this has existed before, it seems like such a no-brainer. Great work.

fizx 10 years ago

https://github.com/fizx/parsley/wiki looks pretty similar.
Running this sort of thing as a service/api never panned out for us because you are almost universally robots.txt denied and/or blocked.
We briefly tried, and supported a wiki of json extraction scripts at parselets.org, but it went nowhere after a few months.
geuis 10 years ago

I built something almost identical in 2011. It really doesn't have as much utility in practice as you think initially. CSS selectors are an interesting idea for extracting data from pages, but it's extremely fragile. You have to either parse the page's raw html using something like jsdom, or you run it through a headless browser like Phantom. In the first case, it completely fails for any modern SPA (angular, react, etc). In the second case, phantom is painfully slow and difficult to interact with, and often doesn't run/render an SPA as a regular browser does.
You can write tests around whether your selectors are returning data, but even simple refactors from a dev team quickly break your selector profiles multiple times a week or month.
Just wasn't worth the hassle.
- mickael-kerjean 10 years ago
  
  There is some solutions to run a SPA in real browser even in a headless environment
  The trick is to emulate x11 with xvfb and control it with selenium web driver.
  Phantom isn't the only choice, just the one most people talk about
  As for non js heavy website, it's fairly trivial to find a library that will parse the dom for you, pretty every language have one
- NicoJuicy 10 years ago
  
  Done it also, to scrape HN in cli :p
- LoSboccacc 10 years ago
  
  done it as well. at time specialized to organize web comics (it was way before google reader times).
  real issue is that popularity will get you blocked fairly quickly. see also: yahoo pipes.
moeamaya 10 years ago

There was a YC company a few years back that got acquired by Palantir in February that did something very similar. https://www.kimonolabs.com/
jancurn 10 years ago

Well, https://www.apifier.com does essentially the same thing, plus it supports JavaScript, can crawl through the whole website etc.
Disclaimer: I'm a cofounder there
- novaleaf 10 years ago
  
  so does https://PhantomJsCloud.com but single pages, no site crawling.
  Disclaimer: I'm the founder there ;)
- fiatjaf 10 years ago
  
  Apifier looks actually awesome.
xaduha 10 years ago

Using a 3rd party site to query HTML (which you should be able to do yourself, plenty of tools for that) isn't a fantastic idea.
This one, for example http://www.videlibri.de/xidel.html#examples
- martinvol 10 years ago
  
  The code is on Github, you can use this as a library, not as a 3rd party saas.
lorenzhs 10 years ago

kimono labs used to do something similar, but shut down recently. They had a nice clicky pointy interface that allowed you to build the selectors by clicking on elements, with an immediate preview. They also handled things like pagination etc.
mappy 10 years ago

> I'm really surprised nothing like this has existed before
But how would you monetize it?
Unlike an RSS feed, you really don't know how the JSON response would be used, so you can't inject ads into it.
And if you charge for it, wouldn't people assume it would continue to work, but site "scrapers", regardless of how they are configured, are likely to break, so it would be tougher having customers pay for something that could break at any time leaving them having to figure out if its the service that's changed/broken or the page that's changed.
Don't get me wrong- some great businesses have been/are based on "scraping" in one way or another. However, as cool as this is, it's just another way to "scrape". If the person hosting the page would provide an API or JSON view instead, you'd be loads better off.
- nsgi 10 years ago
  
  Freemium, professional support, expanding it into an abstraction layer above the APIs for multiple services, selling a version that larger companies can run on their own servers which they might need for data security...
  In any case, not everything has to be monetised.
- g00gler 10 years ago
  
  >However, as cool as this is, it's just another way to "scrape"
  Isn't that the point? The demo seems like it'd be a lot easier, less verbose, and probably less brittle, than using cUrl/xpaths or otherwise parsing that HTML yourself.
phsource 10 years ago

We launched WrapAPI (https://wrapapi.com/) a few weeks ago with the same functionality, but a bit more complex and powerful process to get set up. You can not only specify CSS selectors yourself but define them point and click.
The barrier for starting with JamAPI is impressively low, though! Kudos on the developer-friendly user interface.

ptwt 10 years ago

I put this similar project[0] together a while ago. Almost the same concept, but I skipped the json layer altogether as I just wanted a quick way of getting nuggets of content from webpages into my terminal.

For example:

  curl https://news.ycombinator.com/news | tq -tj ".title a"

0. https://github.com/plainas/tq

finnn 10 years ago

That's awesome. Like jq but for html.

jstanley 10 years ago

with curl:

  $ curl -d url=https://news.ycombinator.com/ -d json_data='{"title":"title"}' http://www.jamapi.xyz/

  {
      "title": "Hacker News"
  }

Also, the Ruby example appears to post to the wrong URL?

gavinoOP 10 years ago

Ah, yep, you're right, forgot to change the URL. Updated now. Thanks for letting me know.
- jstanley 10 years ago
  
  And to get the HN post titles:
  curl -d url=https://news.ycombinator.com/ -d json_data='{"title":[{"elem":".title > a","value":"text"}]}' http://www.jamapi.xyz/
  This is cool :)
  EDIT:
  Incidentally, you don't really need to have that "index" key inside the values of an array, because in an array the order is preserved anyway. Unless I've misunderstood what it means?
  - pmontra 10 years ago
    
    Titles and links grouped together:
    curl -X POST http://www.jamapi.xyz/ -d url=http://news.ycombinator.com -d json_data='{"title": "title","paragraphs": [{ "elem": "td.title a", "value": "text", "location": "href"}]}'
    Use the http URL to call www.jamapi.xyz because calling https I get an Error code: SSL_ERROR_BAD_CERT_DOMAIN
  - gavinoOP 10 years ago
    
    Regarding the "index" key, there are some JSON parsers for languages like Swift that will rearrange your JSON. By adding the index key, you'll still be able to sort after parsing.
    Also, thanks, it's really cool to see people liking this :)
    
    JelteF 10 years ago
    
    They might rearrange keys in a JSON object, but in an array they should be preserved in order as according to the spec[1]. If Swift does this (which I can't really check) than this would be a bug.
    [1] http://www.json.org/: An array is an ordered collection of values. An array begins with [ (left bracket) and ends with ] (right bracket). Values are separated by , (comma).
    
    chriswarbo 10 years ago
    
    Yes, the order of elements in an array should always be preserved. For example, we might be expecting the first element to be a name, the second to be a date of birth, etc. We should use an object for that, but that's for reasons of readability, extensibility, etc. rather than array semantics being unsuitable.
    Also, jq has a `--sort-keys` option which tries to make the output as reproducible/canonical as possible. From the manual:
    > The keys are sorted "alphabetically", by unicode codepoint order. This is not an order that makes particular sense in any particular language, but you can count on it being the same for any two objects with the same set of keys, regardless of locale settings.
    It would be strange for a JSON tool to go to such lengths to normalise data, if array order were unpredictable.

chriswarbo 10 years ago

Very nice idea. Although scraping should always be a last resort, I could imagine using this for semi-serious purposes, i.e. when I care enough about the output, will be doing many requests, don't mind relaying data via a third-party, etc.

I currently do quite a bit of scraping for my own use (generating RSS feeds for sites, making simple commandline interfaces to automate common tasks, etc.). I've found xidel to be pretty good for this: it starts off pretty simple (e.g. with CSS selectors or XPath), but gets pretty gnarly for semi-complicated things. For example, it allows templating the output, using a language I struggle to grasp. This service seems to address that middle ground, e.g. restricting its output to JSON, and hence making the specification of the output much simpler (a nice JSON structure, rather than messing around with splicing text together).

NicoJuicy 10 years ago

I'm actually wondering if it would be possible to add forms authentication to this?

Eg. Post with some sort of css selecters and then a "cookie memory".

OJFord 10 years ago

It would be possible to, of course. But you'd surely want to host it yourself.

fryiee 10 years ago

Great! I've been trying to get my head around Scrapy, and I have little Python experience. This seems to fit in a lot better with my skillset for the project I'm working on.

denishaskin 10 years ago

Application Error An error occurred in the application and your page could not be served. Please try again in a few moments.

If you are the application owner, check your logs for details.

OJFord 10 years ago

Yes, yes, yes!

I'm using Apifier at the moment, which I really like, but my biggest gripe is the awkwardness of source (and VCS) integration. The best I've come up with is to export the JSON config (which contains the scraper source code as a value - yuck) and try to remember to keep re-exporting and checking it in.

Having also had to hack around the inability to parameterise the scrape url (e.g. 'profile/$username') - which they've since added support for - I started to wonder if I mightn't as well just use BeautifulSoup (Python HTML parser lib) and check it in properly.

This is probably my ideal. I can keep it all in source control because it's just an HTTP request body, and I can parameterise it because, well, it's just an HTTP request body!

It's also open source because you're an amazing person; so if I had one little concern left about the availability of your site I can dismiss it right away since I could run my own on Heroku should jamapi.xyz prove unsustainable. It's possibly a better idea to do that anyway, but I often wonder if Heroku doesn't consider that a problem - multiple instances of the same app running on free dynos under different accounts...

staticelf 10 years ago

I just get "invalid json" when I try to use the form on the page.

soheil 10 years ago

I think with advent of tools like this developers more and more will be thinking of ways to make it hard to have someone scrape their website into data structures. I wonder if we are going to see the same thing that happened to minimized js happening to html more and more. I know there are sites that dynamically change css class names and ids. But I think soon we will also see div hierarchies to dynamically change form without presentationally looking different to the end user.

smadge 10 years ago

That would be bad for the web. DRM and the web are incompatible concepts.

WA 10 years ago

HTTPS results in 500 Internal Server Error.

Edit: Well no, it's only some sites. E. g. https://medium.com

diggan 10 years ago

If you're running the example on the website/in a browser, it's probably CORS stopping you.
Try using a backend language or just curl and it should be fine.
- WA 10 years ago
  
  Well no, because my browser isn't doing the request. The underlying Node app (the Jam API) does it.
  I found it: The API responds with an HTTP 500 error if you use CSS selectors that don't select anything or are simply invalid.
  Probably makes sense to add some Exception handling right there.
  - gavinoOP 10 years ago
    
    I had been trying to figure out what would be causing this issue, thanks for pointing it out, I've pushed a fix real quick that will respond whether JSON is invalid or a CSS selector wasn't found on the provided URL.
learned 10 years ago

I'm also getting 500s. Looks to be a CORS issue. I've tried about a dozen big name sites so far without any luck.
Nonetheless, a great idea!
Edit: Just tried it in Node and it works brilliantly. Cool project.
danielsamuels 10 years ago

A CORS restriction perhaps?

MetaMetaApplyHN 10 years ago

Does anyone have any information on anyone that's used HTTP as an API to share/create metadata for any transactions, content, etc. publicly online? I would very curious to know about it!

Welcome feedback on my "Apply HN" on doing exactly this: https://news.ycombinator.com/item?id=11583348

Mahn 10 years ago

Just a heads up, "Apply HN" is for built products/services, not ideas.

loisaidasam 10 years ago

Might be helpful to have the example execute inline so you can see what's going on/experiment without having to leave the page.

splatcollision 10 years ago

Nice work, thanks for adding the Github link. I can think of lots of immediate use for this. Consider publishing on NPM?

bartkappenburg 10 years ago

OT perhaps: I'm still looking for a solution that has a graphical UI that allows users to point and click an element on their page and returns the corresponding CSS-selector. SelectorGadget does this as a chrome-extension but I'm looking for something that works without an extension.

dkopi 10 years ago

Chrome Developer tools. Inspect an element to get it in the elements tab. Right click the element's HTML, copy -> copy selector.
#hnmain > tbody > tr:nth-child(3) > td > table > tbody > tr.athing > td.default
- bartkappenburg 10 years ago
  
  Explain that to a small business owner (our customers) using IE of Safari. ;-)
  - toupeira 10 years ago
    
    Why not make a screencast?
fizx 10 years ago

AFAIK, Selectorgadget's chrome extension is just a wrapper around the bookmarklet. It's pure JS, doesn't use any sort of elevated privileges, and is MIT licensed so you can include the core engine in your own projects.
hegivor 10 years ago

Flutter Selector(https://flutter.social/bookmarklet) seems to do exactly what you want.

daw___ 10 years ago

Wonderful idea.

What about DOM nodes generated by JavaScript? Will Jam render the page before scraping?

gavinoOP 10 years ago

It doesn't currently do that, I think it'd be an interesting challenge to try and do that though. It's definitely possible to do.
- daw___ 10 years ago
  
  Yep. Have a look at phantomjs [1], or other phantomjs wrappers like casperjs [2].
  [1] https://www.npmjs.com/package/phantomjs
  [2] https://www.npmjs.com/package/casperjs
- etatoby 10 years ago
  
  > interesting challenge
  Understatement of the year.
  You'd need to either re-implement an entire browser stack or run a headless version of gecko of webkit server-side.
  The former entails millions of man-hours of work. The latter opens up your server to all sorts of exploits. Overall a really bad idea.
  Besides, single page applications are the worst junk in the entire Web 2.0 cesspool. If you really need to scrape them, they usually come with their own JSON API which you can just piggyback.
  - OJFord 10 years ago
    
    > entails millions of man-hours of work
    Overstatement of the year.
    Why on Earth would the OP start from scratch? Besides, though not a solo and OSS effort, Apifier does this; certainly without "millions" of hours having been spent on it.

karlcoelho1 10 years ago

If anyone remembers, they was a YC company that did exactly this. It was called Kimono Labs. I think it failed and just got acquired a year ago. "Jam API" will probably do way better because, well, open source.

paulmd 10 years ago

I've been thinking about writing some website-to-JSON scrapers myself and this basically solves that problem (since I would have been going after CSS selectors or xpath anyway myself). Nice job.

dimino 10 years ago

How will someone like CloudFlare stop a tool like this from scraping their customer's sites? Just blocking the tool's IP?

brbsix 10 years ago

CloudFlare will make sure the browser can run JS, which in the case of this service I assume it won't. There are ways around this of course, using headless browsers (e.g. PhantomJS), tools like cloudflare-scrape[0] (which uses PyExecJS[1]). I've even used PyQt5 to render webpages for similar purposes.
Unfortunately the aforementioned tools are generally pretty slow (especially headless browsers). Also I can't imagine it's particularly safe running such a service.
[0]: https://github.com/Anorov/cloudflare-scrape/
[1]: https://github.com/doloopwhile/PyExecJS

smadge 10 years ago

I wish site publishers annotated their markup with RDFa tags so every web page was already an "api"

nsgi 10 years ago

If it's going to be used for serious purposes it really needs HTTPS support, as most APIs do these days.

thomasahle 10 years ago

What do you think would be a good syntax for enabling following links?

Say I wanted the Hacker News links + first comment?

fizx 10 years ago
I wrote a language that's basically a superset of this (https://github.com/fizx/parsley/wiki) back in 2008 and used it to crawl a variety of insane job posting sites.
As crawling complexity increases, pretty soon you want an actual programming language to specify things like crawl order and cache behavior. Multi-page behavior was very hard to describe declaratively for misbehaving sites.
Also, it's a terrible default (for security reasons) to let the web pages you're parsing automagically initiate new requests to arbitrary urls.
Such as it is, I believe that the following works in some version of parsley, though I doubt its an official release.
```
    {
      "articles": [ {
         "title": ".title a",
         "comment_link": "follow(.subtext a:nth-child(3) @href) .athing:nth-child(1) .default"
      } ]
    }
```
At some point, these json things might as well be as readable as regex :/
- thomasahle 10 years ago
  
  > Also, it's a terrible default (for security reasons) to let the web pages you're parsing automagically initiate new requests to arbitrary urls.
  Right. We'd have to only grab the article-id, validate that it is in fact an interger in the right range, and only then piece the url back together and request it.
  On the other hand, maybe just checking that we stay within the domain is enough. If the website wants to screw with us, they can send us any reply they want to any url anyway.
- jerf 10 years ago
  
  "At some point, these json things might as well be as readable as regex :/"
  Don't feel :/ . The complexity is essential, and located in the remote website, not your code or your ideas. You still win isolating all the nasty stuff to one and precisely one location. :/ is on them, not you!

uberneo 10 years ago

http://blog.webkid.io/nodejs-scraping-libraries/ -- Good scraping options in NodeJS .. my personal favourite is https://github.com/rc0x03/node-osmosis

amelius 10 years ago

Isn't this exactly what XML (or for that matter XHTML) was supposed to do?

smadge 10 years ago

Or I feel like anything surrounding Linked Data, Semantic Web, RDF, RDFa, microformats, etc.

joelbondurant 10 years ago

This!... is why we can't have nice things.

Settings

Show HN: Jam API, turn any site into a JSON api using CSS selectors

Keyboard Shortcuts