Show HN: Jam API, turn any site into a JSON api using CSS selectors
jamapi.xyzA little word of warning/encouragement. I did something similar a long time ago (JSONDuit), which got posted to HN by someone else.
You will probably run into a healthy mix of "that's cool" / "I did that before you!" / "but how will it make money?". Ignore it and do your thing. If you figure out how to monetize it, great! Even if you don't or if you have no desire to, you will have learned and grown during the course of the project. That is invaluable.
Have fun and screw the haters...
I find this attitude "shoot first (write code), ask questions later" as something to be admired and a bit worrisome at the same time. Nothing against people learning stuff, but why does it have to be promoted this way? Lack of humility is what gets me.
Maybe I'm just jealous or something, but it rubs me the wrong way.
I think it's less about promotion and more about feedback. "Here's what I've built, what do you think?"
We're meant to be an inclusive community of smart people. The idea is we'll encourage the poster and offer constructive criticism (or praise).
If the post is useful to no one, it simply won't get discussed or upvoted. When something does, it's validated as an idea, or as something of interest.
"Just because you could, doesn't mean you should" - that phrase should have been applied to both writing the software and posting about it here.
Plenty of mediocre stuff gets to the frontpage and plenty of gems fall through the cracks.
Perhaps your view of what's "mediocre" and what's a "gem" is not consistent with the views of Hacker News readers at large
Have you ever visited /newest?
> "How will it make money"
I was trying to implement a Firefox add-on which navigates the web based on speech for my senior project (the theme was assistive technology). The major blocker with developing any tools to help navigate web page despite some effort in ARIA is that there are so many actions one cannot parse nor do without writing custom code for every single website. i.e. how can you tell where the login button is? what to click? frankly if you look at Gmail the DOM was a huge huge compressed mess (names all rewrote, without gmail.js I wouldn't be able to get Gmail working in my project). If every website exposes a standard set of APIs that can reduce the barrier by a good percentage. So think of combinging HATEOS from REST and this. Here we turn things into JSON with href allowing a client to navigate, sort of a first step making a website more "web client" compatible... funny is it?
It's not a totally original concept. Screen-scraping has been around for a while - essentially what this solves. I did mine for Ajax:
It's not a product per-se, but combination of data and view is one of the unfortunate aspects of the web that (sorry) won't get fixed - Not everyone will build JSON apis. And, hate away but HTML & JS are here for a long time to come. The need is very real and would be a critical part of a scrape or IFFFT like service - plumbing - if not a product you sell outright to end users.this.get(html, selector, function(s){ var es = new DOMParser().parseFromString(html, 'text/html').querySelectorAll(selector); return [].slice.call(es).map(function(n){ return n.innerText }); };>>It's not a totally original concept. Screen-scraping has been around for a while
This is basically a subset of "I did that before you!"
dpweb:
- mentions the term that this concept falls under (no where on the OP's page, so he may not know that there is an entire set of software, plugins, etc that does this)
- provides one alternative implementation
- adds commentary related to why such services are necessary, and that they should be able to be monetized
So yes, he starts off with something along the lines of "I did that before you!", but he doesn't use a condescending phrase, and he provides additional useful information.
I think the wanting to know how it will be monetized /support itself is a reasonable question. If there isnt an answer you know not to build something using it, expecting it to last.
It's an open source project so, maybe it's not meant to make money?
This is a fantastic idea and I'm really surprised nothing like this has existed before, it seems like such a no-brainer. Great work.
https://github.com/fizx/parsley/wiki looks pretty similar.
Running this sort of thing as a service/api never panned out for us because you are almost universally robots.txt denied and/or blocked.
We briefly tried, and supported a wiki of json extraction scripts at parselets.org, but it went nowhere after a few months.
I built something almost identical in 2011. It really doesn't have as much utility in practice as you think initially. CSS selectors are an interesting idea for extracting data from pages, but it's extremely fragile. You have to either parse the page's raw html using something like jsdom, or you run it through a headless browser like Phantom. In the first case, it completely fails for any modern SPA (angular, react, etc). In the second case, phantom is painfully slow and difficult to interact with, and often doesn't run/render an SPA as a regular browser does.
You can write tests around whether your selectors are returning data, but even simple refactors from a dev team quickly break your selector profiles multiple times a week or month.
Just wasn't worth the hassle.
There is some solutions to run a SPA in real browser even in a headless environment
The trick is to emulate x11 with xvfb and control it with selenium web driver.
Phantom isn't the only choice, just the one most people talk about
As for non js heavy website, it's fairly trivial to find a library that will parse the dom for you, pretty every language have one
Done it also, to scrape HN in cli :p
done it as well. at time specialized to organize web comics (it was way before google reader times).
real issue is that popularity will get you blocked fairly quickly. see also: yahoo pipes.
There was a YC company a few years back that got acquired by Palantir in February that did something very similar. https://www.kimonolabs.com/
Well, https://www.apifier.com does essentially the same thing, plus it supports JavaScript, can crawl through the whole website etc.
Disclaimer: I'm a cofounder there
so does https://PhantomJsCloud.com but single pages, no site crawling.
Disclaimer: I'm the founder there ;)
Apifier looks actually awesome.
Using a 3rd party site to query HTML (which you should be able to do yourself, plenty of tools for that) isn't a fantastic idea.
This one, for example http://www.videlibri.de/xidel.html#examples
The code is on Github, you can use this as a library, not as a 3rd party saas.
kimono labs used to do something similar, but shut down recently. They had a nice clicky pointy interface that allowed you to build the selectors by clicking on elements, with an immediate preview. They also handled things like pagination etc.
> I'm really surprised nothing like this has existed before
But how would you monetize it?
Unlike an RSS feed, you really don't know how the JSON response would be used, so you can't inject ads into it.
And if you charge for it, wouldn't people assume it would continue to work, but site "scrapers", regardless of how they are configured, are likely to break, so it would be tougher having customers pay for something that could break at any time leaving them having to figure out if its the service that's changed/broken or the page that's changed.
Don't get me wrong- some great businesses have been/are based on "scraping" in one way or another. However, as cool as this is, it's just another way to "scrape". If the person hosting the page would provide an API or JSON view instead, you'd be loads better off.
Freemium, professional support, expanding it into an abstraction layer above the APIs for multiple services, selling a version that larger companies can run on their own servers which they might need for data security...
In any case, not everything has to be monetised.
>However, as cool as this is, it's just another way to "scrape"
Isn't that the point? The demo seems like it'd be a lot easier, less verbose, and probably less brittle, than using cUrl/xpaths or otherwise parsing that HTML yourself.
We launched WrapAPI (https://wrapapi.com/) a few weeks ago with the same functionality, but a bit more complex and powerful process to get set up. You can not only specify CSS selectors yourself but define them point and click.
The barrier for starting with JamAPI is impressively low, though! Kudos on the developer-friendly user interface.
I put this similar project[0] together a while ago. Almost the same concept, but I skipped the json layer altogether as I just wanted a quick way of getting nuggets of content from webpages into my terminal.
For example:
curl https://news.ycombinator.com/news | tq -tj ".title a"
0. https://github.com/plainas/tqThat's awesome. Like jq but for html.
with curl:
$ curl -d url=https://news.ycombinator.com/ -d json_data='{"title":"title"}' http://www.jamapi.xyz/
=> {
"title": "Hacker News"
}
Also, the Ruby example appears to post to the wrong URL?Ah, yep, you're right, forgot to change the URL. Updated now. Thanks for letting me know.
And to get the HN post titles:
curl -d url=https://news.ycombinator.com/ -d json_data='{"title":[{"elem":".title > a","value":"text"}]}' http://www.jamapi.xyz/
This is cool :)
EDIT:
Incidentally, you don't really need to have that "index" key inside the values of an array, because in an array the order is preserved anyway. Unless I've misunderstood what it means?
Titles and links grouped together:
curl -X POST http://www.jamapi.xyz/ -d url=http://news.ycombinator.com -d json_data='{"title": "title","paragraphs": [{ "elem": "td.title a", "value": "text", "location": "href"}]}'
Use the http URL to call www.jamapi.xyz because calling https I get an Error code: SSL_ERROR_BAD_CERT_DOMAIN
Regarding the "index" key, there are some JSON parsers for languages like Swift that will rearrange your JSON. By adding the index key, you'll still be able to sort after parsing.
Also, thanks, it's really cool to see people liking this :)
They might rearrange keys in a JSON object, but in an array they should be preserved in order as according to the spec[1]. If Swift does this (which I can't really check) than this would be a bug.
[1] http://www.json.org/: An array is an ordered collection of values. An array begins with [ (left bracket) and ends with ] (right bracket). Values are separated by , (comma).
Yes, the order of elements in an array should always be preserved. For example, we might be expecting the first element to be a name, the second to be a date of birth, etc. We should use an object for that, but that's for reasons of readability, extensibility, etc. rather than array semantics being unsuitable.
Also, jq has a `--sort-keys` option which tries to make the output as reproducible/canonical as possible. From the manual:
> The keys are sorted "alphabetically", by unicode codepoint order. This is not an order that makes particular sense in any particular language, but you can count on it being the same for any two objects with the same set of keys, regardless of locale settings.
It would be strange for a JSON tool to go to such lengths to normalise data, if array order were unpredictable.
Very nice idea. Although scraping should always be a last resort, I could imagine using this for semi-serious purposes, i.e. when I care enough about the output, will be doing many requests, don't mind relaying data via a third-party, etc.
I currently do quite a bit of scraping for my own use (generating RSS feeds for sites, making simple commandline interfaces to automate common tasks, etc.). I've found xidel to be pretty good for this: it starts off pretty simple (e.g. with CSS selectors or XPath), but gets pretty gnarly for semi-complicated things. For example, it allows templating the output, using a language I struggle to grasp. This service seems to address that middle ground, e.g. restricting its output to JSON, and hence making the specification of the output much simpler (a nice JSON structure, rather than messing around with splicing text together).
I'm actually wondering if it would be possible to add forms authentication to this?
Eg. Post with some sort of css selecters and then a "cookie memory".
It would be possible to, of course. But you'd surely want to host it yourself.
Great! I've been trying to get my head around Scrapy, and I have little Python experience. This seems to fit in a lot better with my skillset for the project I'm working on.
Application Error An error occurred in the application and your page could not be served. Please try again in a few moments.
If you are the application owner, check your logs for details.
Yes, yes, yes!
I'm using Apifier at the moment, which I really like, but my biggest gripe is the awkwardness of source (and VCS) integration. The best I've come up with is to export the JSON config (which contains the scraper source code as a value - yuck) and try to remember to keep re-exporting and checking it in.
Having also had to hack around the inability to parameterise the scrape url (e.g. 'profile/$username') - which they've since added support for - I started to wonder if I mightn't as well just use BeautifulSoup (Python HTML parser lib) and check it in properly.
This is probably my ideal. I can keep it all in source control because it's just an HTTP request body, and I can parameterise it because, well, it's just an HTTP request body!
It's also open source because you're an amazing person; so if I had one little concern left about the availability of your site I can dismiss it right away since I could run my own on Heroku should jamapi.xyz prove unsustainable. It's possibly a better idea to do that anyway, but I often wonder if Heroku doesn't consider that a problem - multiple instances of the same app running on free dynos under different accounts...
I just get "invalid json" when I try to use the form on the page.
I think with advent of tools like this developers more and more will be thinking of ways to make it hard to have someone scrape their website into data structures. I wonder if we are going to see the same thing that happened to minimized js happening to html more and more. I know there are sites that dynamically change css class names and ids. But I think soon we will also see div hierarchies to dynamically change form without presentationally looking different to the end user.
That would be bad for the web. DRM and the web are incompatible concepts.
HTTPS results in 500 Internal Server Error.
Edit: Well no, it's only some sites. E. g. https://medium.com
If you're running the example on the website/in a browser, it's probably CORS stopping you.
Try using a backend language or just curl and it should be fine.
Well no, because my browser isn't doing the request. The underlying Node app (the Jam API) does it.
I found it: The API responds with an HTTP 500 error if you use CSS selectors that don't select anything or are simply invalid.
Probably makes sense to add some Exception handling right there.
I had been trying to figure out what would be causing this issue, thanks for pointing it out, I've pushed a fix real quick that will respond whether JSON is invalid or a CSS selector wasn't found on the provided URL.
I'm also getting 500s. Looks to be a CORS issue. I've tried about a dozen big name sites so far without any luck.
Nonetheless, a great idea!
Edit: Just tried it in Node and it works brilliantly. Cool project.
A CORS restriction perhaps?
Does anyone have any information on anyone that's used HTTP as an API to share/create metadata for any transactions, content, etc. publicly online? I would very curious to know about it!
Welcome feedback on my "Apply HN" on doing exactly this: https://news.ycombinator.com/item?id=11583348
Just a heads up, "Apply HN" is for built products/services, not ideas.
Might be helpful to have the example execute inline so you can see what's going on/experiment without having to leave the page.
Nice work, thanks for adding the Github link. I can think of lots of immediate use for this. Consider publishing on NPM?
OT perhaps: I'm still looking for a solution that has a graphical UI that allows users to point and click an element on their page and returns the corresponding CSS-selector. SelectorGadget does this as a chrome-extension but I'm looking for something that works without an extension.
Chrome Developer tools. Inspect an element to get it in the elements tab. Right click the element's HTML, copy -> copy selector.
#hnmain > tbody > tr:nth-child(3) > td > table > tbody > tr.athing > td.default
Explain that to a small business owner (our customers) using IE of Safari. ;-)
Why not make a screencast?
AFAIK, Selectorgadget's chrome extension is just a wrapper around the bookmarklet. It's pure JS, doesn't use any sort of elevated privileges, and is MIT licensed so you can include the core engine in your own projects.
Flutter Selector(https://flutter.social/bookmarklet) seems to do exactly what you want.
Wonderful idea.
What about DOM nodes generated by JavaScript? Will Jam render the page before scraping?
It doesn't currently do that, I think it'd be an interesting challenge to try and do that though. It's definitely possible to do.
Yep. Have a look at phantomjs [1], or other phantomjs wrappers like casperjs [2].
> interesting challenge
Understatement of the year.
You'd need to either re-implement an entire browser stack or run a headless version of gecko of webkit server-side.
The former entails millions of man-hours of work. The latter opens up your server to all sorts of exploits. Overall a really bad idea.
Besides, single page applications are the worst junk in the entire Web 2.0 cesspool. If you really need to scrape them, they usually come with their own JSON API which you can just piggyback.
> entails millions of man-hours of work
Overstatement of the year.
Why on Earth would the OP start from scratch? Besides, though not a solo and OSS effort, Apifier does this; certainly without "millions" of hours having been spent on it.
If anyone remembers, they was a YC company that did exactly this. It was called Kimono Labs. I think it failed and just got acquired a year ago. "Jam API" will probably do way better because, well, open source.
I've been thinking about writing some website-to-JSON scrapers myself and this basically solves that problem (since I would have been going after CSS selectors or xpath anyway myself). Nice job.
How will someone like CloudFlare stop a tool like this from scraping their customer's sites? Just blocking the tool's IP?
CloudFlare will make sure the browser can run JS, which in the case of this service I assume it won't. There are ways around this of course, using headless browsers (e.g. PhantomJS), tools like cloudflare-scrape[0] (which uses PyExecJS[1]). I've even used PyQt5 to render webpages for similar purposes.
Unfortunately the aforementioned tools are generally pretty slow (especially headless browsers). Also I can't imagine it's particularly safe running such a service.
I wish site publishers annotated their markup with RDFa tags so every web page was already an "api"
If it's going to be used for serious purposes it really needs HTTPS support, as most APIs do these days.
What do you think would be a good syntax for enabling following links?
Say I wanted the Hacker News links + first comment?
I wrote a language that's basically a superset of this (https://github.com/fizx/parsley/wiki) back in 2008 and used it to crawl a variety of insane job posting sites.
As crawling complexity increases, pretty soon you want an actual programming language to specify things like crawl order and cache behavior. Multi-page behavior was very hard to describe declaratively for misbehaving sites.
Also, it's a terrible default (for security reasons) to let the web pages you're parsing automagically initiate new requests to arbitrary urls.
Such as it is, I believe that the following works in some version of parsley, though I doubt its an official release.
At some point, these json things might as well be as readable as regex :/{ "articles": [ { "title": ".title a", "comment_link": "follow(.subtext a:nth-child(3) @href) .athing:nth-child(1) .default" } ] }> Also, it's a terrible default (for security reasons) to let the web pages you're parsing automagically initiate new requests to arbitrary urls.
Right. We'd have to only grab the article-id, validate that it is in fact an interger in the right range, and only then piece the url back together and request it.
On the other hand, maybe just checking that we stay within the domain is enough. If the website wants to screw with us, they can send us any reply they want to any url anyway.
"At some point, these json things might as well be as readable as regex :/"
Don't feel :/ . The complexity is essential, and located in the remote website, not your code or your ideas. You still win isolating all the nasty stuff to one and precisely one location. :/ is on them, not you!
http://blog.webkid.io/nodejs-scraping-libraries/ -- Good scraping options in NodeJS .. my personal favourite is https://github.com/rc0x03/node-osmosis
Isn't this exactly what XML (or for that matter XHTML) was supposed to do?
Or I feel like anything surrounding Linked Data, Semantic Web, RDF, RDFa, microformats, etc.
This!... is why we can't have nice things.