Show HN: Serverless Analytics Built from Scratch

statsbotco.github.io

114 points by keydunov 8 years ago · 49 comments

Reader

lucb1e 8 years ago

"serverless" is really the misnomer of the year.

mikejulietbravo 8 years ago

"Still on a server, but not your problem" just doesn't have the same ring to it though
- buster 8 years ago
  
  "/cgi-bin/ with javascript instead of perl" doesn't make me want to buy it, either.
gpantazes 8 years ago

"I implemented this using some technology that abstracted away the server configuration, but it is running on a server... hmmm"
- dna_polymerase 8 years ago
  
  Nonono, no server:
  "[..] write serverless code which runs in the fabric of the Internet itself" - https://blog.cloudflare.com/building-with-workers-kv/
  - ovao 8 years ago
    
    The question is: is the fabric of the Internet breathable?
    
    agret 8 years ago
    
    Nope, skin tight latex. It shows off the parts you don't want to see!
bobx11 8 years ago

Someone posts this on every article called serverless, but what is the purpose? Everyone used to say the same thing about cloud computing.
It’s just a distraction from someone sharing free and functional code with the community.
- lucb1e 8 years ago
  
  I hope this one won't stick. I never call it serverless because it doesn't make any sense to me (I call it "shared hosting", the term everyone already knows). Then, by pointing it out to those who do, maybe they realize that they're repeating marketing nonsense.
  Cloud I gave up on quite early. It's a nonsense word but not an incorrect one: sure, it's not literally water vapor, but it's not like you're saying "yourcomputer" when the definition is "someone else's computer". Cloud is a new word for something that didn't really have a word. "Server" comes close, but in "cloud" there is the additional implication that it's not yours (making quite a difference in many cases, so I guess it warrants having another word). Serverless... that's just shared hosting.
  - jimktrains2 8 years ago
    
    I also hate it and use the term "managed service". It conveys more exactly 2hats happening. It's not serverless, the server and a plethora of services are being managed for us.
InGodsName 8 years ago

Real serverless is bitorrent.
There client is server so it's Serverless.

Interesting and nicely presented!

I built a prototype of something very similar, but using Google BigQuery to store and extract data[0] but never took it beyond the concept phase. I’m still using and actively maintain an open source lambda-based A/B testing severless framework however with similar (but simpler) architecture[1]

[0] https://blog.gingerlime.com/2016/a-scalable-analytics-backen...

[1] https://github.com/Alephbet/gimel

rmccue 8 years ago

The blog post about it is probably a better link for HN: https://statsbot.co/blog/building-open-source-google-analyti...

soared 8 years ago

I mean as a POC its not bad, but google analytics is not the same as analyzing server logs (contrary to what most people would suggest). Most of the value of ga comes from session and user level metrics, which are 1000x more difficult to implement than showing pageviews. Unless you are planning on building a device graph that rivals google, you can't clone ga.

asien 8 years ago

> google analytics is not the same as analyzing server logs
This is what most people don’t get with ga.
Google Analytics does the heavy lift by removing incoherent , corrupted or malicious data insertion.
Let’s say I use Puppeteer i can scrap this page a million time with completely wrong headers like « Netscape 8.1 ». GA purify this type of malicious attempts , it will probably look my IP Adress and figure out that it’s actually coming from only one IP and « Netscape » is too rare to be considered as an actual browser so it would probably ignore it.
All others « free google analytics alternatives » that exists today don’t have this type of mechanism to prevent from data corruption.
In general they just get an Http Request and acknowledge it as a legitimate visit.
Logging an Http request from a browser is not even a tenth of the work GA does under the hood.
- mayank 8 years ago
  
  > Google Analytics does the heavy lift by removing incoherent , corrupted or malicious data insertion.
  Unless it's referrer spam...that somehow still sticks around (at least last time I checked, which was several months ago).
  - soared 8 years ago
    
    If you hire an expert to set up your ga referrer spam doesn't get through, its super easy to filter before hand.
- manigandham 8 years ago
  
  I have to disagree here. GA is very advanced but still rather dumb with data collection, and can be gamed in many ways, and I'm saying this as a user of GA for 10+ years along with their enterprise/premium suite.
- eli 8 years ago
  
  I'm not sure how well that filtering works in practice. I think most of it is just that it only tracks clients that load javascript.
- cosmie 8 years ago
  
  > Google Analytics does the heavy lift by removing incoherent , corrupted or malicious data insertion.
  > Let’s say I use Puppeteer i can scrap this page a million time with completely wrong headers like « Netscape 8.1 ». GA purify this type of malicious attempts , it will probably look my IP Adress and figure out that it’s actually coming from only one IP and « Netscape » is too rare to be considered as an actual browser so it would probably ignore it.
  I do a lot of work with GA, and have seen this misperception brought up a few times. When it comes to data processing, GA is not intelligent. If you haven't told it to do something explicitly, it isn't doing it. And if you tell it to do something, it'll only do that for all new data and will make no attempts to do it to historical data.
  - GA is relatively robust against web scraping due to the fact that most scrapers don't render the page. So the GA-related code on the page is never executed and a hit is never made to Google's servers. If the scraper is using a headless browser, such as Puppeteer, and renders the page, then it will in fact send that hit to GA.
  - If you've checked the "Exclude bots" view setting[1], it will apply the IAB Spiders and Bots list to traffic[2]. This is a deterministic list of user-agent based filters to apply[3], and anyone is capable of paying for it. Google just gives it to GA users for free via the Exclude bots filter.
  - The Exclude Bots setting does nothing else than that. Scrapers like Puppeteer by default report their user agent as the version of Chromium they're using. These will show up just like any legitimate user to your site that also is browsing with that specific version of Chromium.
  - GA has pretty robust filtering options[4]. But you have to manually create them. And they don't apply retroactively. You can filter IPs here, and only here. While you can apply reporting filters after the fact on a lot of fields, IP addresses aren't available as one of those fields. This makes it really frustrating to retroactively get rid of junk traffic, whether internal or automated/scrapping. You can approximate it by getting creative with fields that make a good proxy. The only exception to this would be GA360/Google Marketing Cloud customers, since they can access their clickstream data via BigQuery as part of their subscription.
  - GA's interface will give you really smart looking notifications now like "Filter internal traffic. Hits from your corporate network are showing up in property example.com". It's not doing anything super neat like dynamically cross-referencing your IP address as you're in the admin area against the collected data in your GA property. It's literally just triggering that warning based off the fact that you haven't applied any IP-based filters applied yet.
  There are quite a few other completely unintuitive aspects of GA that are rooted in the fact that their data processing model is incredibly straightforward, and there are very few exceptions to it and virtually no edge cases taken into account. Which leads to a lot of instances where people's expectations on behavior decouple from actual behavior. But a good rule of thumb is that, if a particular functionality or metric seems even remotely like it'd require extra computation or complexity to implement in a way to make it match what you're thinking. Then it's highly likely it doesn't work the way you think.
  [1] https://support.google.com/analytics/answer/1010249?hl=en
  [2] https://www.iab.com/guidelines/iab-abc-international-spiders...
  [3] https://www.iab.com/wp-content/uploads/2015/11/IAB_SpidersBo...
  [4] https://support.google.com/analytics/topic/1032939?hl=en&ref...
  - soared 8 years ago
    
    Great comment. I didn't know the exclude bots toggled used an iab list.. thats excellent information for me to know. Thanks!

code4tee 8 years ago

Build serverless app to track web stats. Get it featured on Hacker News and use the flood of traffic to demo what was done. Very meta. Nice job.

cheriot 8 years ago

I'd be curious to see a cost estimate for some traffic level. I wonder if there's a way to put the pixel in s3 and process the access logs more cheaply.

teej 8 years ago

I’ve seen folks put their pixel endpoint behind Fastly and process the access log delivered in S3. A Fastly VCL can handle the same transform that this Lambda is doing.
- mrkurt 8 years ago
  
  We have people doing exactly this with fly.io, you could also do it with lambda@edge if you're a masochist. Or with Cloud Flare workers if you dislike small startups.
- InGodsName 8 years ago
  
  Is fastly free? Why would they use fastly and not s3?
  - teej 8 years ago
    
    S3 access logs alone are not sufficient to replicate this pipeline. This pixel is stateful (for the anonymous user ID) and S3 access logs don’t include arbitrary headers, in this case the cookie with the user id. Fastly would let you eliminate API gateway, Lambda, and both Kinesis steps.
    API Gateway by itself is $3.50/million requests, which is 2-4x more expensive than Fastly at $0.75 - $1.60/million
yahelc 8 years ago

I've done this a few times and have found it to be an extremely effective way to do simple pixel tracking (for custom emails and the like).
InGodsName 8 years ago

Here is some more data: http://highscalability.com/blog/2018/4/2/how-ipdata-serves-2...
I don't understand what's cubejs doing in this app?
Once data is inside athena, it's matter of querying it right.

westoque 8 years ago

We should really stop using the word "serverless".

I would rather call them instead "zero config servers".

jimmychangas 8 years ago

I think you can use API Gateway as a proxy for Kinesis, removing the need for Lambda.

k__ 8 years ago

Seems to be:
https://docs.aws.amazon.com/apigateway/latest/developerguide...

teej 8 years ago

This sort of thing works until you have one person run a security scan on your site, corrupting your user agents and event types.

manigandham 8 years ago

Side note: If you want to build your own mid-size event analytics data pipeline, then I recommend looking at snowplow: https://github.com/snowplow/snowplow

jimktrains2 8 years ago

Interesting. I once built a ga clone using appengine, cloud dataflow, and big query. I guess that would count as serverless? Benchmarked it against the official dumps to big quey too and it was pretty spot on for every metric we could lookup!

pavel_tiunov 8 years ago

Yes. I guess your setup is serverless as well. Big Query is one of Serverless MPP databases that shares similar concepts with AWS Athena.
InGodsName 8 years ago

Bigquery takes minimum 2-3 seconds for every query.
Google Analytics is much faster, responds in a few hundreads milliseconds.
What did you use dataflow for? How did you get data from end points and insert them into bigquery? Using streaming inserts?
- cosmie 8 years ago
  
  > Google Analytics is much faster, responds in a few hundreads milliseconds.
  Are you referring to their reporting API, or their collection endpoint? The collection endpoint is certainly fast to respond, but the actual reporting API can be quite slow depending on what you're trying to get from it.
  > What did you use dataflow for? How did you get data from end points and insert them into bigquery? Using streaming inserts?
  I'm not the parent, but I've created setups like what was mentioned. It sounds like they hosted the collection endpoint on AppEngine, then used DataFlow for streaming the data into BigQuery. Potentially using a Pub/Sub topic to queue up for DataFlow, since that has native integrations with DataFlow and even has a template available to support it[1].
  [1] https://cloud.google.com/dataflow/docs/guides/templates/prov...
- jimktrains2 8 years ago
  
  > Google Analytics is much faster, responds in a few hundreads milliseconds.
  GA stores summary tables for each day for the basic values. If you have a large site and request segments or anything that's not in the summary tables, it can be quite slow.
  Also, BigQuery is multi-tenant. GA would have dedicated instances.
  > What did you use dataflow for? How did you get data from end points and insert them into bigquery? Using streaming inserts?
  cosmie pretty much got it. AppEgnine collected. DataFlow sessionized and some other processing (geoip lookup, filtering, &c). BigQuery stored.
  I actually had AppEngine dumping into Cloud Datastore, but I also experimented with PubSub and also using Cloud Storage access logs.

_9hey 8 years ago

Endless loading... I think there's a bug

tyingq 8 years ago

Genuine question. What does this do that just inserting vanilla GA code in the page doesn't? Trying to understand the "why".

teej 8 years ago

Some people don’t want to put a GA tag on their site because of concerns around how Google uses the data. Also you can’t arbitrarily query GA data so this gives you that capability.

graphememes 8 years ago

is PHP serverless :thinking:

InGodsName 8 years ago

Please explain what is cube.js doing in this? I mean, what exactly cubejs does.

pavel_tiunov 8 years ago

Thanks for the question! We should do a better job describing this. In short: 1. Generates analytic SQL queries based on Cube.js schema. It can be simple ones like calculating page views or more advanced like calculating session metrics, attribution models or funnels. 2. Caches sql responses to not to overwhelm SQL backend with user requests. 3. Pre-aggregates data to be able to query trillions of data points in matter of seconds. 4. Orchestrates SQL query execution. Organizes dependencies between pre-aggregations, queue priorities, cache refreshes. 5. Provides REST analytic API for end users.
- InGodsName 8 years ago
  
  Why do you need to select all rows in your Cubejs section when you can directly run query in athena and get back the aggregates you need.
  Basically you select all rows then cubjs does something on those rows when you can infact directly run queries in Athena
  Am i missing something?
  - pavel_tiunov 8 years ago
    
    It actually works exactly as you describe. We generate SQL query to return aggregates based on SQL supplied in Cube.js schema. We never fetch raw data from SQL backend. Architecture overview can probably help to understand: https://github.com/statsbotco/cubejs-client#architecture

Settings

Show HN: Serverless Analytics Built from Scratch

Keyboard Shortcuts