Big Data Real Time Analytics - A Facebook Case Study

6 min read Original article ↗
  • 1.
  • 2.

    The Real TimeBoom.. ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved Google Real Time Web Analytics Google Real Time Search Facebook Real Time Social Analytics Twitter paid tweet analytics SaaS Real Time User Tracking New Real Time Analytics Startups..

  • 3.
  • 4.
  • 5.
  • 6.

    Traditional analytics applicationsScale-up Database Use traditional SQL database Use stored procedure for event driven reports Use flash memory disks to reduce disk I/O Use read only replica to scale-out read queries Limitations Doesn’t scale on write Extremely expensive (HW + SW) ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

  • 7.

    CEP – ComplexEvent Processing Process the data as it comes Maintain a window of the data in-memory Pros: Extremely low-latency Relatively low-cost Cons Hard to scale (Mostly limited to scale-up) Not agile - Queries must be pre-generated Fairly complex ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

  • 8.

    In Memory DataGrid Distributed in-memory database Scale out Pros Scale on write/read Fits to event driven (CEP style) , ad-hoc query model Cons Cost of memory vs disk Memory capacity is limited ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

  • 9.

    NoSQL Use distributeddatabase Hbase, Cassandra, MongoDB Pros Scale on write/read Elastic Cons Read latency Consistency tradeoffs are hard Maturity – fairly young technology ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

  • 10.

    Hadoop MapReudce Distributedbatch processing Pros Designed to process massive amount of data Mature Low cost Cons Not real-time ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

  • 11.

    Hadoop Map/Reduce –Reality check.. ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

  • 12.

    So what’s thebottom line? ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

  • 13.
  • 14.

    Goals Show whyplugins are valuable. What value is your business deriving from it? Make the data more actionable. Help users take action to make their content more valuable. How many people see a plugin, how many people take action on it, and how many are converted to traffic back on your site.   Make the data more timely.  Went from a 48-hour turn around to 30 seconds. Multiple points of failure were removed to make this goal.  Handle massive load 20 billion events per day (200,000 events per second) ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

  • 15.

    The actual analytics..Like button analytics Comments box analytics ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

  • 16.

    Technology Evaluation MySQLDB Counters In-Memory Counters MapReduce Cassandra HBase ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

  • 17.

    The solution.. PTailScribe Puma Hbase HDFS Real Time Long Term Batch 1.5 Sec 10,000 write/sec per server FACEBOOK Log FACEBOOK Log FACEBOOK Log

  • 18.
  • 19.

    Facebook Analytics.Next.. Whatif.. ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved We can rely on memory as a reliable store? We can’t decide on a particular NoSQL database? We need to package the solution as a product?

  • 20.

    Step 1: Usememory.. Instead of treating memory as a cache, why not treat it as a primary data store? Facebook keeps 80% of its data in Memory (Stanford research) RAM is 100-1000x faster than Disk (Random seek) Disk - 5 -10ms RAM – x0.001msec ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved Events Memory Grid Data Grid Data Grid Data Grid FACEBOOK FACEBOOK FACEBOOK

  • 21.

    Step 1: Usememory.. Reliability is achieved through redundancy and replication One Data. Any API ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved Events Any API Data Grid FACEBOOK FACEBOOK FACEBOOK

  • 22.

    Step 2 –Collocate Putting the code together with the data. Events Processing Grid Data Grid Data Grid Data Grid FACEBOOK FACEBOOK FACEBOOK

  • 23.

    Step 2 –Collocate Putting the code together with the data. Events Processing Grid Data Grid Data Grid Data Grid FACEBOOK FACEBOOK FACEBOOK @EventDriven @Polling public class SimpleListener { @EventTemplate Data unprocessedData () { Data template = new Data (); template . setProcessed ( false ); return template ; } @SpaceDataEvent public Data eventListener ( Data event ) { //process Data here } }

  • 24.

    Step 3 –Write behind to SQL/NoSQL Events Processing Grid Open Long Term persistency Write Behind FACEBOOK FACEBOOK FACEBOOK Data Grid Data Grid Data Grid

  • 25.

    Economic Data ScalingCombine memory and disk Memory is x100, x1000 lower than disk for high data access rate (Stanford research) Disk is lower at cost for high capacity lower access rate. Solution: Memory - short-term data, Disk - long term. data Only ~16G required to store the log in memory ( 500b messages at 10k/h ) at a cost of ~32$ month per server. ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved Memory Disk

  • 26.

    Economic Scaling Automation - reduce operational cost Elastic Scaling – reduce over provisioning cost Cloud portability (JClouds) – choose the right cloud for the job Cloud bursting – scavenge extra capacity when needed ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

  • 27.

    Putting it alltogether Analytic Application Event Sources Write behind - In Memory Data Grid - RT Processing Grid Light Event Processing Map-reduce Event driven Execute code with data Transactional Secured Elastic NoSQL DB Low cost storage Write/Read scalability Dynamic scaling Raw Data and aggregated Data Generate Patterns

  • 28.

    Putting it alltogether Analytic Application Event Sources Write behind - In Memory Data Grid - RT Processing Grid Light Event Processing Map-reduce Event driven Execute code with data Transactional Secured Elastic NoSQL DB Low cost storage Write/Read scalability Dynamic scaling Raw Data and aggregated Data Generate Patterns Real Time Map/Reduce R Script script = new StaticScritpt( “groovy”,”println hi; return 0”) Query q = em.createNativeQuery( “execute ?”); q.setParamter(1, script); Integer result = query.getSingleResult();

  • 29.

    5x better performanceper server! Hardware – Linux HP DL380 G6 servers - each has: 2 Intel quad-core Xeon X5560 processors (2.8 Ghz Nehalem) 32 Gb RAM (4GB per core) 6 * 146 Gb 15K RPM SAS disks Red Hat 5.2 Event injector Up to 128 threads GigaSpaces/ (Other Msg Server) App Services Up to 128 threads Other Giga 50,000 write/sec per server

  • 30.

    Live demo InterDay Activity (Real Time) Monthly Trend Analysis

  • 31.

    5 Big DataPredictions ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

  • 32.

    Summary Big Data Development Made Simple: Focus on your business logic, Use Big Data platform for dealing scalability, performance, continues availability ,.. Its Open: Use Any Stack : Avoid Lockin Any database (RDBMS or NoSQL); Any Cloud, Use common API’s & Frameworks . All While Minimizing Cost Use Memory & Disk for optimum cost/performance . Built-in Automation and management - Reduces operational costs Elasticity – reduce over provisioning cost

  • 33.
  • 34.