News from Mahout

11 min read Original article ↗
  • 1.
  • 2.

    whoami – TedDunning  Chief Application Architect, MapR Technologies  Committer, member, Apache Software Foundation – particularly Mahout, Zookeeper and Drill (we’re hiring)  Contact me at tdunning@maprtech.com tdunning@apache.com ted.dunning@gmail.com @ted_dunning ©MapR Technologies - Confidential 2

  • 3.

    Slides and such (available late tonight): – http://www.mapr.com/company/events/nyhug-03-05-2013  Hash tags: #mapr #nyhug #mahout ©MapR Technologies - Confidential 3

  • 4.

    New in Mahout  0.8 is coming soon (1-2 months)  gobs of fixes  QR decomposition is 10x faster – makes ALS 2-3 times faster  May include Bayesian Bandits  Super fast k-means – fast – online (!?!) ©MapR Technologies - Confidential 4

  • 5.

    New in Mahout  0.8 is coming soon (1-2 months)  gobs of fixes  QR decomposition is 10x faster – makes ALS 2-3 times faster  May include Bayesian Bandits  Super fast k-means – fast – online (!?!) – fast  Possible new edition of MiA coming – Japanese and Korean editions released, Chinese coming ©MapR Technologies - Confidential 5

  • 6.

    New in Mahout  0.8 is coming soon (1-2 months)  gobs of fixes  QR decomposition is 10x faster – makes ALS 2-3 times faster  May include Bayesian Bandits  Super fast k-means – fast – online (!?!) – fast  Possible new edition of MiA coming – Japanese and Korean editions released, Chinese coming ©MapR Technologies - Confidential 6

  • 7.
  • 8.

    We have aproduct to sell … from a web-site ©MapR Technologies - Confidential 8

  • 9.

    What tag- What line? picture? Bogus Dog Food is the Best! Now available in handy 1 ton bags! Buy 5! What call to action? ©MapR Technologies - Confidential 9

  • 10.

    The Challenge  Design decisions affect probability of success – Cheesy web-sites don’t even sell cheese  The best designers do better when allowed to fail – Exploration juices creativity  But failing is expensive – If only because we could have succeeded – But also because offending or disappointing customers is bad ©MapR Technologies - Confidential 10

  • 11.

    More Challenges  Too many designs – 5 pictures – 10 tag-lines – 4 calls to action – 3 back-ground colors => 5 x 10 x 4 x 3 = 600 designs  It gets worse quickly – What about changes on the back-end? – Search engine variants? – Checkout process variants? ©MapR Technologies - Confidential 11

  • 12.

    Example – ABtesting in real-time  I have 15 versions of my landing page  Each visitor is assigned to a version – Which version?  A conversion or sale or whatever can happen – How long to wait?  Some versions of the landing page are horrible – Don’t want to give them traffic ©MapR Technologies - Confidential 12

  • 13.

    A Quick Diversion  You see a coin – What is the probability of heads? – Could it be larger or smaller than that?  I flip the coin and while it is in the air ask again  I catch the coin and ask again  I look at the coin (and you don’t) and ask again  Why does the answer change? – And did it ever have a single value? ©MapR Technologies - Confidential 13

  • 14.

    A Philosophical Conclusion  Probability as expressed by humans is subjective and depends on information and experience ©MapR Technologies - Confidential 14

  • 15.
  • 16.

    5 heads outof 10 throws ©MapR Technologies - Confidential 16

  • 17.

    2 heads outof 12 throws ©MapR Technologies - Confidential 17

  • 18.

    So now youunderstand Bayesian probability ©MapR Technologies - Confidential 18

  • 19.

    Another Quick Diversion  Let’s play a shell game  This is a special shell game  It costs you nothing to play  The pea has constant probability of being under each shell (trust me)  How do you find the best shell?  How do you find it while maximizing the number of wins? ©MapR Technologies - Confidential 19

  • 20.

    Pause for short con-game ©MapR Technologies - Confidential 20

  • 21.

    Interim Thoughts  Can you identify winners or losers without trying them out?  Can you ever completely eliminate a shell with a bad streak?  Should you keep trying apparent losers? ©MapR Technologies - Confidential 21

  • 22.

    So now youunderstand multi-armed bandits ©MapR Technologies - Confidential 22

  • 23.

    Conclusions  Can you identify winners or losers without trying them out? No  Can you ever completely eliminate a shell with a bad streak? No  Should you keep trying apparent losers? Yes, but at a decreasing rate ©MapR Technologies - Confidential 23

  • 24.

    Is there anoptimum strategy? ©MapR Technologies - Confidential 24

  • 25.

    Bayesian Bandit  Compute distributions based on data so far  Sample p1, p2 and p2 from these distributions  Pick shell i where i = argmaxi pi  Lemma 1: The probability of picking shell i will match the probability it is the best shell  Lemma 2: This is as good as it gets ©MapR Technologies - Confidential 25

  • 26.

    And it works! 0.12 0.11 0.1 0.09 0.08 0.07 regret 0.06 ε- greedy, ε = 0.05 0.05 0.04 Bayesian Bandit with Gam m a- Norm al 0.03 0.02 0.01 0 0 100 200 300 400 500 600 700 800 900 1000 1100 n ©MapR Technologies - Confidential 26

  • 27.
  • 28.

    The Code  Select an alternative n = dim(k)[1] p0 = rep(0, length.out=n) for (i in 1:n) { p0[i] = rbeta(1, k[i,2]+1, k[i,1]+1) } return (which(p0 == max(p0)))  Select and learn for (z in 1:steps) { i = select(k) j = test(i) k[i,j] = k[i,j]+1 } return (k)  But we already know how to count! ©MapR Technologies - Confidential 28

  • 29.

    The Basic Idea  We can encode a distribution by sampling  Sampling allows unification of exploration and exploitation  Can be extended to more general response models ©MapR Technologies - Confidential 29

  • 30.

    The Original Problem x2 x1 Bogus Dog Food is the Best! Now available in handy 1 ton bags! Buy 5! x3 ©MapR Technologies - Confidential 30

  • 31.

    Response Function æ ö p(win) = w çåqi xi ÷ è i ø 1 0.5 y 0 -6 -4 -2 0 2 4 6 x ©MapR Technologies - Confidential 31

  • 32.

    Generalized Banditry  Suppose we have an infinite number of bandits – suppose they are each labeled by two real numbers x and y in [0,1] – also that expected payoff is a parameterized function of x and y E [ z ] = f (x, y | q ) – now assume a distribution for θ that we can learn online  Selection works by sampling θ, then computing f  Learning works by propagating updates back to θ – If f is linear, this is very easy – For special other kinds of f it isn’t too hard  Don’t just have to have two labels, could have labels and context ©MapR Technologies - Confidential 32

  • 33.

    Context Variables x2 x1 Bogus Dog Food is the Best! Now available in handy 1 ton bags! Buy 5! x3 user.geo env.time env.day_of_week env.weekend ©MapR Technologies - Confidential 33

  • 34.

    Caveats  Original Bayesian Bandit only requires real-time  Generalized Bandit may require access to long history for learning – Pseudo online learning may be easier than true online  Bandit variables can include content, time of day, day of week  Context variables can include user id, user features  Bandit × context variables provide the real power ©MapR Technologies - Confidential 34

  • 35.

    You can dothis yourself! ©MapR Technologies - Confidential 35

  • 36.
  • 37.
  • 38.

    What is Quality?  Robust clustering not a goal – we don’t care if the same clustering is replicated  Generalization is critical  Agreement to “gold standard” is a non-issue ©MapR Technologies - Confidential 38

  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44.
  • 45.

    For Example 1 D (X) > 2 D (X) 2 s 4 2 5 Grouping these two clusters seriously hurts squared distance ©MapR Technologies - Confidential 45

  • 46.
  • 47.

    Typical k-means Failure Selecting two seeds here cannot be fixed with Lloyds Result is that these two clusters get glued together ©MapR Technologies - Confidential 47

  • 48.

    Ball k-means  Provably better for highly clusterable data  Tries to find initial centroids in each “core” of each real clusters  Avoids outliers in centroid computation initialize centroids randomly with distance maximizing tendency for each of a very few iterations: for each data point: assign point to nearest cluster recompute centroids using only points much closer than closest cluster ©MapR Technologies - Confidential 48

  • 49.

    Still Not aWin  Ball k-means is nearly guaranteed with k = 2  Probability of successful seeding drops exponentially with k  Alternative strategy has high probability of success, but takes O(nkd + k3d) time ©MapR Technologies - Confidential 49

  • 50.

    Still Not aWin  Ball k-means is nearly guaranteed with k = 2  Probability of successful seeding drops exponentially with k  Alternative strategy has high probability of success, but takes O( nkd + k3d ) time  But for big data, k gets large ©MapR Technologies - Confidential 50

  • 51.

    Surrogate Method  Start with sloppy clustering into lots of clusters κ = k log n clusters  Use this sketch as a weighted surrogate for the data  Results are provably good for highly clusterable data ©MapR Technologies - Confidential 51

  • 52.

    Algorithm Costs  Surrogate methods – fast, sloppy single pass clustering with κ = k log n – fast sloppy search for nearest cluster, O(d log κ) = O(d (log k + log log n)) per point – fast, in-memory, high-quality clustering of κ weighted centroids O(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high quality O(κ d log k) or O(d log κ log k) for larger k, looser quality – result is k high-quality centroids • Even the sloppy surrogate may suffice ©MapR Technologies - Confidential 52

  • 53.

    Algorithm Costs  Surrogate methods – fast, sloppy single pass clustering with κ = k log n – fast sloppy search for nearest cluster, O(d log κ) = O(d ( log k + log log n )) per point – fast, in-memory, high-quality clustering of κ weighted centroids O(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high quality O(κ d log k) or O( d log k ( log k + log log n ) ) for larger k, looser quality – result is k high-quality centroids • For many purposes, even the sloppy surrogate may suffice ©MapR Technologies - Confidential 53

  • 54.

    Algorithm Costs  How much faster for the sketch phase? – take k = 2000, d = 10, n = 100,000 – k d log n = 2000 x 10 x 26 = 500,000 – d (log k + log log n) = 10(11 + 5) = 170 – 3,000 times faster is a bona fide big deal ©MapR Technologies - Confidential 54

  • 55.

    Algorithm Costs  How much faster for the sketch phase? – take k = 2000, d = 10, n = 100,000 – k d log n = 2000 x 10 x 26 = 500,000 – d (log k + log log n) = 10(11 + 5) = 170 – 3,000 times faster is a bona fide big deal ©MapR Technologies - Confidential 55

  • 56.

    How It Works  For each point – Find approximately nearest centroid (distance = d) – If (d > threshold) new centroid – Else if (u > d/threshold) new cluster – Else add to nearest centroid  If centroids > κ ≈ C log N – Recursively cluster centroids with higher threshold ©MapR Technologies - Confidential 56

  • 57.
  • 58.

    But Wait, …  Finding nearest centroid is inner loop  This could take O( d κ ) per point and κ can be big  Happily, approximate nearest centroid works fine ©MapR Technologies - Confidential 58

  • 59.

    Projection Search total ordering! ©MapR Technologies - Confidential 59

  • 60.

    LSH Bit-match VersusCosine 1 0.8 0.6 0.4 0.2 Y Ax is 0 0 8 16 24 32 40 48 56 64 - 0.2 - 0.4 - 0.6 - 0.8 -1 X Ax is ©MapR Technologies - Confidential 60

  • 61.
  • 62.

    Parallel Speedup? 200 Non- threaded ✓ 100 2 Tim e per point (μs) Threaded version 3 50 4 40 6 5 8 30 10 14 12 20 Perfect Scaling 16 10 1 2 3 4 5 20 Threads ©MapR Technologies - Confidential 62

  • 63.

    Quality  Ball k-means implementation appears significantly better than simple k-means  Streaming k-means + ball k-means appears to be about as good as ball k-means alone  All evaluations on 20 newsgroups with held-out data  Figure of merit is mean and median squared distance to nearest cluster ©MapR Technologies - Confidential 63

  • 64.

    Contact Me!  We’re hiring at MapR in US and Europe  MapR software available for research use  Get the code as part of Mahout trunk (or 0.8 very soon)  Contact me at tdunning@maprtech.com or @ted_dunning  Share news with @apachemahout ©MapR Technologies - Confidential 64