News from Mahout - NFHN Reader

whoami – TedDunning  Chief Application Architect, MapR Technologies  Committer, member, Apache Software Foundation – particularly Mahout, Zookeeper and Drill (we’re hiring)  Contact me at tdunning@maprtech.com tdunning@apache.com ted.dunning@gmail.com @ted_dunning ©MapR Technologies - Confidential 2

 Slides and such (available late tonight): – http://www.mapr.com/company/events/nyhug-03-05-2013  Hash tags: #mapr #nyhug #mahout ©MapR Technologies - Confidential 3

New in Mahout  0.8 is coming soon (1-2 months)  gobs of fixes  QR decomposition is 10x faster – makes ALS 2-3 times faster  May include Bayesian Bandits  Super fast k-means – fast – online (!?!) ©MapR Technologies - Confidential 4

We have aproduct to sell … from a web-site ©MapR Technologies - Confidential 8

What tag- What line? picture? Bogus Dog Food is the Best! Now available in handy 1 ton bags! Buy 5! What call to action? ©MapR Technologies - Confidential 9

10.

The Challenge  Design decisions affect probability of success – Cheesy web-sites don’t even sell cheese  The best designers do better when allowed to fail – Exploration juices creativity  But failing is expensive – If only because we could have succeeded – But also because offending or disappointing customers is bad ©MapR Technologies - Confidential 10

11.

More Challenges  Too many designs – 5 pictures – 10 tag-lines – 4 calls to action – 3 back-ground colors => 5 x 10 x 4 x 3 = 600 designs  It gets worse quickly – What about changes on the back-end? – Search engine variants? – Checkout process variants? ©MapR Technologies - Confidential 11

12.

Example – ABtesting in real-time  I have 15 versions of my landing page  Each visitor is assigned to a version – Which version?  A conversion or sale or whatever can happen – How long to wait?  Some versions of the landing page are horrible – Don’t want to give them traffic ©MapR Technologies - Confidential 12

13.

A Quick Diversion  You see a coin – What is the probability of heads? – Could it be larger or smaller than that?  I flip the coin and while it is in the air ask again  I catch the coin and ask again  I look at the coin (and you don’t) and ask again  Why does the answer change? – And did it ever have a single value? ©MapR Technologies - Confidential 13

14.

A Philosophical Conclusion  Probability as expressed by humans is subjective and depends on information and experience ©MapR Technologies - Confidential 14

15.

16.

5 heads outof 10 throws ©MapR Technologies - Confidential 16

17.

2 heads outof 12 throws ©MapR Technologies - Confidential 17

18.

So now youunderstand Bayesian probability ©MapR Technologies - Confidential 18

19.

Another Quick Diversion  Let’s play a shell game  This is a special shell game  It costs you nothing to play  The pea has constant probability of being under each shell (trust me)  How do you find the best shell?  How do you find it while maximizing the number of wins? ©MapR Technologies - Confidential 19

20.

Pause for short con-game ©MapR Technologies - Confidential 20

21.

Interim Thoughts  Can you identify winners or losers without trying them out?  Can you ever completely eliminate a shell with a bad streak?  Should you keep trying apparent losers? ©MapR Technologies - Confidential 21

22.

So now youunderstand multi-armed bandits ©MapR Technologies - Confidential 22

23.

Conclusions  Can you identify winners or losers without trying them out? No  Can you ever completely eliminate a shell with a bad streak? No  Should you keep trying apparent losers? Yes, but at a decreasing rate ©MapR Technologies - Confidential 23

24.

Is there anoptimum strategy? ©MapR Technologies - Confidential 24

25.

Bayesian Bandit  Compute distributions based on data so far  Sample p1, p2 and p2 from these distributions  Pick shell i where i = argmaxi pi  Lemma 1: The probability of picking shell i will match the probability it is the best shell  Lemma 2: This is as good as it gets ©MapR Technologies - Confidential 25

26.

And it works! 0.12 0.11 0.1 0.09 0.08 0.07 regret 0.06 ε- greedy, ε = 0.05 0.05 0.04 Bayesian Bandit with Gam m a- Norm al 0.03 0.02 0.01 0 0 100 200 300 400 500 600 700 800 900 1000 1100 n ©MapR Technologies - Confidential 26

27.

28.

The Code  Select an alternative n = dim(k)[1] p0 = rep(0, length.out=n) for (i in 1:n) { p0[i] = rbeta(1, k[i,2]+1, k[i,1]+1) } return (which(p0 == max(p0)))  Select and learn for (z in 1:steps) { i = select(k) j = test(i) k[i,j] = k[i,j]+1 } return (k)  But we already know how to count! ©MapR Technologies - Confidential 28

29.

The Basic Idea  We can encode a distribution by sampling  Sampling allows unification of exploration and exploitation  Can be extended to more general response models ©MapR Technologies - Confidential 29

30.

The Original Problem x2 x1 Bogus Dog Food is the Best! Now available in handy 1 ton bags! Buy 5! x3 ©MapR Technologies - Confidential 30

31.

Response Function æ ö p(win) = w çåqi xi ÷ è i ø 1 0.5 y 0 -6 -4 -2 0 2 4 6 x ©MapR Technologies - Confidential 31

32.

Generalized Banditry  Suppose we have an infinite number of bandits – suppose they are each labeled by two real numbers x and y in [0,1] – also that expected payoff is a parameterized function of x and y E [ z ] = f (x, y | q ) – now assume a distribution for θ that we can learn online  Selection works by sampling θ, then computing f  Learning works by propagating updates back to θ – If f is linear, this is very easy – For special other kinds of f it isn’t too hard  Don’t just have to have two labels, could have labels and context ©MapR Technologies - Confidential 32

33.

Context Variables x2 x1 Bogus Dog Food is the Best! Now available in handy 1 ton bags! Buy 5! x3 user.geo env.time env.day_of_week env.weekend ©MapR Technologies - Confidential 33

34.

Caveats  Original Bayesian Bandit only requires real-time  Generalized Bandit may require access to long history for learning – Pseudo online learning may be easier than true online  Bandit variables can include content, time of day, day of week  Context variables can include user id, user features  Bandit × context variables provide the real power ©MapR Technologies - Confidential 34

35.

You can dothis yourself! ©MapR Technologies - Confidential 35

36.

37.

38.

What is Quality?  Robust clustering not a goal – we don’t care if the same clustering is replicated  Generalization is critical  Agreement to “gold standard” is a non-issue ©MapR Technologies - Confidential 38

39.

40.

41.

42.

43.

44.

45.

For Example 1 D (X) > 2 D (X) 2 s 4 2 5 Grouping these two clusters seriously hurts squared distance ©MapR Technologies - Confidential 45

46.

47.

Typical k-means Failure Selecting two seeds here cannot be fixed with Lloyds Result is that these two clusters get glued together ©MapR Technologies - Confidential 47

48.

Ball k-means  Provably better for highly clusterable data  Tries to find initial centroids in each “core” of each real clusters  Avoids outliers in centroid computation initialize centroids randomly with distance maximizing tendency for each of a very few iterations: for each data point: assign point to nearest cluster recompute centroids using only points much closer than closest cluster ©MapR Technologies - Confidential 48

49.

Still Not aWin  Ball k-means is nearly guaranteed with k = 2  Probability of successful seeding drops exponentially with k  Alternative strategy has high probability of success, but takes O(nkd + k3d) time ©MapR Technologies - Confidential 49

50.

Still Not aWin  Ball k-means is nearly guaranteed with k = 2  Probability of successful seeding drops exponentially with k  Alternative strategy has high probability of success, but takes O( nkd + k3d ) time  But for big data, k gets large ©MapR Technologies - Confidential 50

51.

Surrogate Method  Start with sloppy clustering into lots of clusters κ = k log n clusters  Use this sketch as a weighted surrogate for the data  Results are provably good for highly clusterable data ©MapR Technologies - Confidential 51

52.

Algorithm Costs  Surrogate methods – fast, sloppy single pass clustering with κ = k log n – fast sloppy search for nearest cluster, O(d log κ) = O(d (log k + log log n)) per point – fast, in-memory, high-quality clustering of κ weighted centroids O(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high quality O(κ d log k) or O(d log κ log k) for larger k, looser quality – result is k high-quality centroids • Even the sloppy surrogate may suffice ©MapR Technologies - Confidential 52

53.

Algorithm Costs  Surrogate methods – fast, sloppy single pass clustering with κ = k log n – fast sloppy search for nearest cluster, O(d log κ) = O(d ( log k + log log n )) per point – fast, in-memory, high-quality clustering of κ weighted centroids O(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high quality O(κ d log k) or O( d log k ( log k + log log n ) ) for larger k, looser quality – result is k high-quality centroids • For many purposes, even the sloppy surrogate may suffice ©MapR Technologies - Confidential 53

54.

Algorithm Costs  How much faster for the sketch phase? – take k = 2000, d = 10, n = 100,000 – k d log n = 2000 x 10 x 26 = 500,000 – d (log k + log log n) = 10(11 + 5) = 170 – 3,000 times faster is a bona fide big deal ©MapR Technologies - Confidential 54

55.

56.

How It Works  For each point – Find approximately nearest centroid (distance = d) – If (d > threshold) new centroid – Else if (u > d/threshold) new cluster – Else add to nearest centroid  If centroids > κ ≈ C log N – Recursively cluster centroids with higher threshold ©MapR Technologies - Confidential 56

57.

58.

But Wait, …  Finding nearest centroid is inner loop  This could take O( d κ ) per point and κ can be big  Happily, approximate nearest centroid works fine ©MapR Technologies - Confidential 58

59.

60.

LSH Bit-match VersusCosine 1 0.8 0.6 0.4 0.2 Y Ax is 0 0 8 16 24 32 40 48 56 64 - 0.2 - 0.4 - 0.6 - 0.8 -1 X Ax is ©MapR Technologies - Confidential 60

61.

62.

Parallel Speedup? 200 Non- threaded ✓ 100 2 Tim e per point (μs) Threaded version 3 50 4 40 6 5 8 30 10 14 12 20 Perfect Scaling 16 10 1 2 3 4 5 20 Threads ©MapR Technologies - Confidential 62

63.

Quality  Ball k-means implementation appears significantly better than simple k-means  Streaming k-means + ball k-means appears to be about as good as ball k-means alone  All evaluations on 20 newsgroups with held-out data  Figure of merit is mean and median squared distance to nearest cluster ©MapR Technologies - Confidential 63

64.

Contact Me!  We’re hiring at MapR in US and Europe  MapR software available for research use  Get the code as part of Mahout trunk (or 0.8 very soon)  Contact me at tdunning@maprtech.com or @ted_dunning  Share news with @apachemahout ©MapR Technologies - Confidential 64