Ask HN: How can I trust Google Analytics?
I've made a small proof of concept with Google Analytics. I was checking that running the frontend code coming from my localhost I could already receive the events on my Google Analytics (GA) account. So GA is just not running any kind of validation on where the events are coming from (domain check or something). Then, since the tracking ID remains public, it's possible to just send any kind of event using someone else's tracking ID, therefore messing with their insights in their GA dashboard. I have published the code on github.com/goferito/gapoc in case someone wants to take a look, even though it's pretty simple.
So the question is, how can I know someone is not sending events (pageview events or whatever) using my tracking ID? Is there any way in GA to filter those, before or after GA stores them? I do marketing ops consulting and see this stuff all the time. First, let's get two things out of the way: 1. Yes, Google Analytics can be quite useless if you keep default settings with no configuration. 2. That doesn't mean you should jump straight to a self-hosted solution, or a paid solution, or throw up your hands and say "it'll never be accurate." For most use cases, GA is more than good enough to measure effectiveness of online marketing efforts. Dismissing it outright in favor of a paid or self-hosted option just because you didn't google "how to prevent analytics hijacking" is bad decision-making. /rant Now on to the fix... You can create a filter in your GA view settings to ignore tracking calls from any hostname other than your own. See here: https://support.google.com/analytics/answer/1033162?hl=en PS - No client-side analytics will ever be 100% accurate, certainly not GA. But for the purposes of measuring marketing efforts and results, you can have greater tolerances. It's a tool for marketing, not logging. I would also add that once you set up google analytics correctly, it should be a good measure of month-to-month or day-to-day improvement(s) (within some error bound). Good point. What about the PHP SDK though? Hostname can also be easily faked That's true, but at that point you're really asking if it's possible to send JavaScript to an attacker to run and have them not be able to arbitrarily alter what that code does. In which case the answer is of course not, regardless of what the code is. Usually the answer is gk1's above and keeping an eye on server logs to see if they match up with the client analytics data you're getting. You can even have events sent from both in GA or Piwik or whatever so you can compare them in the same UI, looking at e.g. event flow so that everyone who loaded some data first triggered a fetch event on the server for that data. Of course then your attacker can just get a botnet to start mindlessly doing page views of your site... You also need to add regex filters for Campaign Source like: "semalt\.|social.?buttons\.|hulfingtonpost\.|best-seo-(solution|offer|service)|free.traffic|buy-cheap-online|prodvigator|cenokos\.|ranksonic\.|adcash\.|share.?buttons\.|blackhatworth|buttons-for.?website|darodar\.|100dollars-seo" To help keep down the spam. No, nothing is safe. See https://news.ycombinator.com/item?id=7477736 or https://news.ycombinator.com/item?id=8869880 Nice experiment! Link for the lazy: https://github.com/goferito/gapoc I guess SEO people already know this, the question is: can you trust a SEO consultant? Only if he's the top search result for "SEO consultant". Unsurprisingly, such a person exists; though it probably differs by region and various other factors. Hire them on contingency, and then verify their results. Use something like http://zoomrank.com/ to monitor the position of your site in the various search engines over time; establish a baseline, hire the SEO consultant, and look at your placement graphs. Did you fail to improve, or even go down? Then don't pay the snake oil salesman. As a marketing consultant: No, usually you cannot. There are great ones out there, but for most people it's too difficult to screen out the quacks. Go by referrals. Nope. Can't trust SEO consultant, its also hard to trust company/product that uses SEO consultant's services. If somebody have good/reliable products or services, they don't need any phony SEO tricks. Is the best content is always #1 in Google? I wish that was the case, but it's not. Until Google can evaluate content without external 'signals' SEO will be a fact of life. Taking advantage of GA deficiencies is widely used to inflate traffic figures during website sales negotiations. GA is really not a product you want to trust your business with. Best approach is to consider self-hosted analytics solutions. I built my own for my needs which also include combined features for security analytics to investigate malware attacks. GA is totally useless in this aspect. > features for security analytics to investigate malware attacks. GA is totally useless in this aspect. Of course it is, it's a marketing analytics tool. It's totally useless for many things that aren't related to marketing. Concealing IP addresses and ignoring spammy referrers doesn't help in marketing either. There is a workaround - but it will reduce the amount of data points available to GA and put stress on your box: Use server-side tracking calls. As said, this will remove all data points which are usually gathered by the GA-Javascript. Same thing is possible with Piwik. You _could try_ to have custom JS that would gather those data-points like e.g. screen resolution. You can't know. GA spam is rampant, more so via referer spam than anything else. This is the correct answer. You may be able to filter data to some extent in the dashboard using special views, but if you want a 100% guarantee that your data stays reliable, currently your best bet is not to use GA, or at least complement it with another tool. Referrer spam is the worst. This was on a reddit thread today: https://i.imgur.com/mRGiiBQ.png note the: how to stop referral spam url The server cannot know if an event is coming from a browser or not, and anyone can make it look like coming from a browser while making it from another program, although you can't do it inside a proper browser. Another caveat is that you have to wait 72 hours after the event before you can be reasonably sure the counts aren't going to change any more. Sure, you get some results immediately, but for some reason, some take a long time to settle. I'm guessing it is a massive eventually consistent distributed database, and that GA hits are going to nearest or least busy nodes and it just takes a while for them all to sync up. Experienced this a few times when somebody cloned my whole website, GA tracking code included. Also, with the increasing spam coming from referrer and the new trend of adv blocking plugins (they block GA too), Google Analytics has become less reliable than ever. However, you can setup open source analytics software on your own server, like [Piwik](http://piwik.org/). In addition to the other comments, you could always try to use another analytics product in parallel (from time to time randomly in the year) to quickly validate the accuracy of the results. This will serve as an indicator and also validate assumptions regarding the integrity of the analytics. Update your javascript tracking code to include a nonce generated serverside. Send the nonce along with the rest of the report to the tracking server. Filter out reports with duplicate or missing nonces. Dunno if you can do it with GA, you might have to hack it into Piwik. You can add filters to exclude data before it gets recorded: http://viget.com/advance/removing-referral-spam-from-google-... Analytics is useful but the information is certainly not to be trusted completely. Especially on the e-commerce side. what blows my mind is that they aren't doing more to fight the referral / event tracking spam. it's totally out of control. You can use a GA Filter based on your domain name. It solved my problem. But you can also fake the domain don't you? By changing the /etc/hosts or a personal DNS server.. Anyway, looks like the best option. If you are a Google Analytics Premium customer, your raw dataset is automatically available in BigQuery, so you can see down to every click and run your own SQL on it. I understand you're supposed to whitelist in GA which pages are allowed to send a given tracking ID? We just ran into a problem with Google Analytics trying to track opening clicks by sending an event to GA. Turns out when you click a link to open it, the browser page would load before the event to GA could be sent. Screwed up a huge amount of our click tracking data on GA. Why not just use an event handler which sends the event to GA, and after that, follows your link? They provide a callback and everything... You can also do And now you have a website that feels slow to respond to link clicks. This isn't a problem with GA, it's a general problem with trying to track outbound link clicks - when your browser navigates away it kills execution of scripts, often preventing an event from being fired back to the server. There are solutions, but they're all pretty unpleasant. The most common is to use an onclick handler on your outbound link that only navigates after the tracking event has been fired. Google Analytics cover this in their own documentation here: https://support.google.com/analytics/answer/1136920?hl=en window.addEventListener('unload', function () {
var now = Date.now();
while (Date.now() < now + 300) {}
});
See: https://developers.google.com/analytics/devguides/collection... var outboundLink = function(url) {
ga('send', 'event', 'button', 'click', {'hitCallback':
function () {
document.location = url;
}
});
}
<a href="#" onclick="outboundLink('https://www.foo.com'); return false;">Foo</a>
This will work even after you redirect to another page but only if the browser supports navigator.sendBeacon(). Currently not IE or Safari. ga('send', 'event', 'button', 'click', {transport: 'beacon'});