Settings

Theme

Ask HN: Best way to store web traffic logs?

5 points by Mamady 12 years ago · 7 comments · 1 min read


I am in the process of setting up a new website, and was hoping to get some advice on how others store web traffic log data.

At the moment I am thinking of just using Google Analytics and Mixpanel.

I have no intention of saving apache access log files - so Im looking for a web based service. I was considering rolling out a custom db table just to log hits, but it just feels wrong.

What are you using? If you are storing data in a db, what is the db (i.e. mongo, postgres, cassandra)?

sehrope 12 years ago

How you store your logs depends on your server configuration. Analytic services like Google Analytics or Mixpanel will work for any type of config as they're initiated by the client. They both also have a nice UI so can see live user's, plot them on maps, etc.

If you want lower level detail such as data for each user's IP address you'll need something on the server side. I haven't used Mixpanel but Google Analytics doesn't give you raw IP addresses. Also, if a user has it blocked (ex: by Ghostery) then you don't see them in Google Analytics. To get around this we also log all requests server side.

The two options I know are either do it yourself (that's what we did, more below) or use something like Piwik (http://piwik.org/). The latter is kind of like your own Google Analytics that you run on your own infrastructure.

For our public cloud app (https://cloud.jackdb.com/) we run all the infrastructure so we aggregate the server access logs from each nginx instance and push them to an S3 bucket. It's pretty straightforward and really cheap (S3 costs peanuts and log data gzips well). Besides audit events (which do get logged to a database and can be queried) any funky research is done by good ol' awk/grep/sed.

Our public website (http://www.jackdb.com) is hosted on S3 so we don't even control the actual server. Instead we've got logging enabled on the S3 bucket sent to another S3 bucket[1]. S3 creates files there with 1-3 hour lag of all requests with full details (IP, useragent, etc). Only pain is that S3 creates a lot of files so we've got a cron job that runs regularly to combine them into daily files, gzip them, and put them in a different S3 bucket. Again ad hoc research is done via unix commands on either the latest log files or the archived files (we keep a local copy in addition to the ones in S3).

Regardless of how you get your logs onto S3. If you want to make the storage costs 10x cheaper in the long run (again this will only matter once you actually have a significant amount of data) then you push it from S3 to Glacier. Even better you can setup S3 to auto expire data to Glacier after X days[2]. Just remember that you can't access them directly from Glacier. It's just for "cold-storage".

[1]: http://docs.aws.amazon.com/AmazonS3/latest/dev/ServerLogs.ht...

[2]: http://docs.aws.amazon.com/AmazonS3/latest/dev/object-lifecy...

  • MamadyOP 12 years ago

    This sounds reallly interesting, but it feels very painful for reporting. Trying to run reports on this type of data for a dashboard seems out of the question.

    I understand adhoc research is done via command line, but if you wanted to have a dashboard which shows stats, how would you handle that? I assume you would have to import into a db of some sort to run queries regularly?

    • sehrope 12 years ago

      > This sounds reallly interesting, but it feels very painful for reporting. Trying to run reports on this type of data for a dashboard seems out of the question.

      Yes we've saved it mainly to look at it later. It's the lowest level of detail so I figure we can mine it later. To do anything useful with it it would have to be processed though I don't think it's that much.

      > I understand adhoc research is done via command line, but if you wanted to have a dashboard which shows stats, how would you handle that?

      We don't use those files for reporting. We have user stat reports (activity, actions, etc) generated and look at the numbers themselves but nothing I'd consider "pretty". All of those are generated from the audit trail data so it's already in a database. The ad-hoc reporting from the command line is if I want to trace something specific in detail. I've usually filtered it down to small enough set that grep/awk is more than enough and it only takes a couple seconds.

darkxanthos 12 years ago

If you have the cash and analytics aren't a real distinct business advantage just go Mixpanel.

If you decide to do it yourself this is what I've done: Create a small web service that you can call to log data from the UI. Start with one server and if it starts to go over 60-80% usage consistently create a second.

The server should log every call to the service in a large flat file (csv is easiest). The file should be named by date and time down to the minute. As you scale up servers you just have a process pull down each file and aggregate them server side. Or just throw them into S3 and use Hive/EMR to report on the data.

It's a middle-class man's Mixpanel. I served tens of millions of logging events a day with this solution. At the time the cost was somewhere around $1,500 a month I believe. I was running 6 servers on Ruby/Sinatra though and never tried to optimize much.

EDIT: typo

  • MamadyOP 12 years ago

    I guess the part Im really keen to setup is the S3 + Hive/EMR part. But it sounds like it will be a bit too expensive to have up-to-date stats, and better to just do batch processing to run reports.

taylorbuley 12 years ago

If you are planning to run at any sort of scale I advise staying away from logging into a database. Tying throughput to i/o like that could really hurt.

rip747 12 years ago

just use Google Analytics. It's a free and extremely powerful reporting. Trying to roll your own solution is a total waste of time.

The only thing you should be doing with your logs are archiving them in case of a security breach so your can try to pin point how the attach happened.

Don't waste the space on your lan either for the log archives. Get an S3 account, zip them up and store on S3.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection