Show HN: Gargl – Create an API for any website, without writing a line of code

78 points by jodoglevy 12 years ago · 43 comments

Reader

brey 12 years ago

  if you can see or submit data using the website, it means the website does 
  have some kind of API ... For example, if you do a search in Yahoo, you can
  see the search page sent to Yahoo’s servers has the following url ...

  https://search.yahoo.com/search?p=search+term

no. no no no. this is not an API. this is about as far from an application programming INTERFACE as it can get. this means an agreed format, where there's a contract (social or otherwise) to provide stability to APPLICATION clients. there's no contract here other than 'a human types something into the box, presses some buttons and some results appears on the website'.

/search?p=search+term is an implementation detail hidden from the humans the site is built for. they can, and most likely will, change this at any time. the HTML returned (and being scraped) is an implementation detail. today, HTML. tomorrow, AJAX. next week? who knows, maybe Flash.

fine, it's a scraper builder. but don't call what it's using an API, and don't imply it's anything more than a fragile house of cards built on the shaky foundation of 'if you get noticed you're going to get banned or a cease and desist'.

ChuckMcM 12 years ago

Just as a matter of record you risk getting your IP blacklisted by using something like this without the web sites permission. Perhaps the poster child for web sites that go apeshit is CraigsList but most sites respond in one way or another. One of my favorites are the Markov generated search results that Bing sends back to robots.

jaytaylor 12 years ago

You raise interesting and valid points. Notably, there is naturally a rather ubiquitous route around getting blocked: use Tor (it will work in many cases, though not all).
The most intriguing thing about this Gargl thing imo is that it is a free version of for-profit SaaS website-to-api offerings such as kimonolabs[0]. I love the nobleness of taking something which is only available as a paid service and creating a free open-source form of it. These kinds of projects help reveal SaaS services which don't actually have strong value-adds despite vendors' claims to the contrary.
[0] hxxp://www.kimonolabs.com/
- nly 12 years ago
  
  > there is naturally a rather ubiquitous route around getting blocked: use Tor
  How about fuck you? Seriously, don't use Tor for scraping sites. Webmasters can and will block Tor exit nodes if they feel the bad traffic outweighs the good.
  - hawkharris 12 years ago
    
    You have a valid point, but the "fuck you" is unnecessary. I hope that HN will stay place where people can have civil debates, drawing on evidence instead of emotions.
    
    sirclueless 12 years ago
    
    It sounds like a very purposeful and directed "fuck you" to me. He's not being rude or crass, he's making the concise and vehement point that if you are a bad actor on Tor, you are harming everyone and he will hate you for it.
    Enlightening debates don't come about when everyone mutes their politically incorrect emotions and speaks in platitudes, they come about when people respect each other and can speak simple truths as they would to their peers. Evidence is always useful, but rational argument is equally important.
    
    hawkharris 12 years ago
    
    When I'm deciding if an online comment is civil and productive, I tend to ask the question, Would the commenter be willing to say this the same way in real life?
    Plenty of people use the Internet's cloak of anonymity to say things that are inconsiderate. I use this simple test to determine if they would stand behind their remarks if accountability were in play and anonymity were not a factor.
    I think it's well understood that "fuck you" is considered an offensive term in the context of a disagreement between two strangers. My morning commute has illustrated this point on a few occasions. :)
    
    true_religion 12 years ago
    
    In real life you can frown to communicate your extreme disapproval. You can grit your teeth, kick the dirt, squeeze your fists, and do all sorts of non-verbal gyrations before having to spit out a simple 'fuck you' in order to communicate.
    This isn't 'real life'. This is text-based communication.
    
    merkitt 12 years ago
    
    It's possible OP didn't realize what he was proposing would have a negative impact. Rather than assuming that, we could've just said something like: "Do NOT do that! Because of {these things}. And {these other things} will happen to you and other people."
    There are plenty of ways to say politically incorrect things without being rude. One thing I like about HN (as opposed to reddit) is the information density -- it's higher because jokes and snark are frowned upon.
    
    Pacabel 12 years ago
    
    I don't think it's really meant as an insult in this case. It's a common phrasing often used by the Zed Shaw-style "opinionated"/"brogrammer"/"macho" crowd within the Ruby on Rails community. It's not worth taking seriously, to be honest.
- jodoglevyOP 12 years ago
  
  OP here.
  I did a post comparing Gargl to Kimono actually: http://jodoglevy.com/jobloglevy/?p=146

digitalboss 12 years ago

FYI Gargl vs Kimono - mentioned at the bottom original article. http://jodoglevy.com/jobloglevy/?p=146

tlrobinson 12 years ago

I'll throw my scraper creator in the ring too: http://exfiltrate.org/
- walden42 12 years ago
  
  Nice, I like it! Pretty easy to use. Only problem is that I couldn't download any file =) When I try downloading a file, it just shows a green box saying it downloaded.
  - tlrobinson 12 years ago
    
    Thanks. Which browser are you using?
    
    walden42 12 years ago
    
    Firefox 26 on Linux Mint. It works fine in Chrome. There is no scrollbar on the page you're viewing, either.
- acbart 12 years ago
  
  I made a scraper meant for introductory projects in CS classes! (http://mickey.cs.vt.edu/)
- israelyc 12 years ago
  
  Here's another one. They're pretty good at scale too: http://import.io
pjc50 12 years ago

Also ScraperWiki: https://scraperwiki.com/

loceng 12 years ago

I think a better business model would be creating a service that identifies scrapers, and then blocks them. I think one might already exist, though I can't remember its name.

eli 12 years ago

I don't think either is a really great idea. I think most of the people who would pay to block web scrapers are either being paranoid or are being scraped by people smart and resourceful enough to get around your filters. Any serious web scraper is going to be scripting a real browser engine, so it's going to act just like a real visitor.
- matznerd 12 years ago
  
  There are ways to detect scrapers and other bots if you really want to and services that do so.
  - eli 12 years ago
    
    Adding a captcha to every page, maybe? There are services that will charge you money for this, but that doesn't mean it works.
    
    matznerd 12 years ago
    
    Captchas don't do anything to stop bots, they just add a small additional cost(~$1.40 per 1000 solved). I am talking about monitoring things that 90% of bots generally do not take precautions against, like tracking mouse movements and other things I won't mention here, that distinguish them from humans.
    
    eli 12 years ago
    
    I don't believe you. At best you can obfuscate and confuse scrapers. You can't stop them from reading a public web page. (And I shudder to think what these solutions must do to accessibility -- hope you don't have any blind readers.)
    
    matznerd 12 years ago
    
    Oh, I wasn't saying you can completely stop them from reading a page or individual pages. But there is activity, than can be detected as irregular. Here is a true example I know about someone who wanted to scrape their competitor's client listings. The competitor had a map with points of their customers with random user IDs and no where was the entire dataset visible. The person just built a scraper/bot, to hit every single possible ID of over 10,000 numbers. They hit a ton of empty pages, and that company should have recognized an IP incrementally crawling their data, especially empty pages...This activity should have been recognized and resulted in an IP ban.
  - loceng 12 years ago
    
    Do you know the names of any such services off the top of your head? Thanks
    
    matznerd 12 years ago
    
    If you want to learn about the concept and armsrace, here is a paper with plenty of resources (this is in a game context, though not website, although there is the most advanced detection here): http://iseclab.org/papers/botdetection-article.pdf
    Here is an open source system demo'd at BlackHat Europe 2011 (that checks it is a proper browser (with DOM/Javascript/etc), also good against DDoS. https://github.com/yuri-gushin/Roboo
    Project Honeypot (scans inbound ips) good against spambots: https://www.projecthoneypot.org/
    Here are some commercial solutions: CloudFlare's ScrapeShield -https://www.cloudflare.com/apps/scrapeshield
    Distil Networks - http://www.distilnetworks.com/
    Scrape Sentry - http://www.scrapesentry.com/
    Fireblade - http://www.fireblade.com/
    
    loceng 12 years ago
    
    Great. Thank you.

h3ro 12 years ago

I do this kind of stuff with wget, sed, awk so far, but it's nice to see some more thought-out alternatives.

What I like most about your competition though is the JS interface that gets used for one good last thing (before being properly scraped and de-AD- and de-java-fied): clicking on the content you want, and deselecting content you don't want: subtly, with your mouse you lead a pattern-matching algorithm doing the annoying work.

Honestly the simplicity of this interface is even more breathtaking to me than gargl :P But it's even more limited, as after clicking twice it thinks that it has understood the pattern already, although that might not be the case.

I'd suggest to integrate the idea, but to make the learning process more clever, make it possible to select more things, even though the engine thinks there can't be any more similar things. Give that AI more things to learn from. We want more identifiers than just counts and HTML elements: "2nd subelement of <h1>".

There's good stuff you can do with statistics, too. Some data exists only once, some exists only 3 times, some always exists over 10 times. That's valuable info. Some data has many words of whitespace seperated text - oh a paragraph!

tldr We need something that generates good semantics out of normal web sites automatically, so that users can use a simple Web UI mangled into the target web site to choose the right pattern.

anigbrowl 12 years ago

I wish all tools were presented with this level of clarity and depth. Really great introduction ion contrast to the usual technobabble.

yid 12 years ago

IANAL et al., but unless I'm mistaken, generating an API by analyzing requests and responses would be fine (under the purview of "research purposes"), unless you then subsequently use the generated API to access the service.

Also, it seems like authenticated sites would be difficult to scrape with this, i.e. ones that require login and possibly some logic (like sending a hash of request parameters) with every request.

jodoglevyOP 12 years ago

OP here.
Yes, using the generated API is the issue, not generating one.
As for authenticated sites, as long as the underlying generated module keeps track of cookies received in responses and sends those cookies on subsequent API calls, just like a browser would, it should work fine for "normal" websites that use regular cookies for remembering if user is logged in. Gargl modules generated as PowerShell, or as Javascript (and used in a WinJS project) do this "cookie remembering" today. It could also of course be possible for the user to remember the cookie themselves in their code (after it gets the raw response from the API call), and then pass that cookie into any subsequent API calls manually.

jheriko 12 years ago

this is a clever idea - i've had a similar idea many many years ago infact but never followed through because i strongly disagree with making it easy to e.g. abuse google or yahoo by spamming their search engine. as much as i disagree with keeping proprietary secrets i agree more that people should have freedom of choice to do that...

in that regard its nice to see the big warning at the top of the page about ease of misuse (and a refreshing slap in the face - i was thinking 'pfft some hipster forgot common sense again' and expecting not to see anything of the kind)

there is something off here and i can't quite put my finger on it though... as a low level programmer I cringe when I hear web people using API to describe some weird little subset of APIs anyway. Here I feel almost like what this does is takes an existing 'API' (http - the internet) and refactors the interface in highly specific ways to make it easier to use...

At any rate. Its a clever idea and nice to see such a well thought through implementation - but its also far too open to misuse imo. I wish the creator the best of luck... hopefully no takedown requests too soon.

marcosdumay 12 years ago

> Here I feel almost like what this does is takes an existing 'API' (http - the internet) and refactors the interface in highly specific ways to make it easier to use...
That's what all libraries do. If we can't call that an API, we'll only be able to use the name for bare I/O operations.
stringham 12 years ago

I wouldn't call http an api. Http is a protocol.
- jheriko 12 years ago
  
  any interface that is designed to be, and can be used by, software is technically an api...
  the thing i was trying to stab at was the recent popularity of 'API' as a term and the way it is applied...

benwilber0 12 years ago

Your description of the problem and solution is too verbose. I need bullet points describing 1) my problems, 2) how my problems are solved by this. I'm not going to read a full-on blog post to figure out if this is relevant to me.

dfgonzalez 12 years ago

Love these scrapper template generators. I wonder why you chose Java instead of something like PhantomJS to run the scrapper.

platz 12 years ago

um, the whole thing is language agnostic, no?
- matznerd 12 years ago
  
  The APIs generated are agnostic, but the tool to create them is java based.
  - platz 12 years ago
    
    Well the reference generator is in java, but take a look at the github repo, there is nothing preventing adding an additional generator in the /generators directory.
    All you really need to do is output the template.

notastartup 12 years ago

so the usage of the data is where the legality is concerned. if your users scrape a site and you host it through accessible means, you can get sued but not if you provide a flat csv file?

Armchair lawyers please advise, we need more details.

Settings

Show HN: Gargl – Create an API for any website, without writing a line of code

Keyboard Shortcuts