Ask HN: Can we adapt AlphaZero's self-play technique for better human learning?
Since I lack the ML background to debunk this suspicion, I figured I'd let HN debunk it for me. Seeing AlphaZero's success at learning Chess, Shogi, and Go, I was immediately struck with the intuition that the fact that AlphaZero could learn so much from "self-play" should provide some insight into improving human teaching and learning strategies. With the caveat that humans lack AlphaZero's ability to separate themselves into two versions, I can imagine a teaching paradigm that emphasizes simulating competitive activities but playing as both sides. Is something like this at all related to what AlphaZero's doing and are there chess training paradigms that emphasize this type of simulation? Short answer: No, there's nothing new here that can inform better human learning. Longer answer: The concept of self-play isn't new in any sense. All chess players use this technique to some degree. None use only this technique. The advantage of self play is that there's no risk of accidentally picking up someone else's an incorrect assumption. Since you're deriving everything from scratch. Some people take this to extremes, there's a math professor who doesn't read any math papers so that he's deriving everything from first principles and not "contaminating his mind" it works quite well for him but unfortunately I'm blanking on his name. However, commitment to this technique removes one of the major advantages that humans have which is their ability to communicate knowledge amongst themselves in a compact, abstract way with language. Humans also have a pretty good way to mitigate the faulty assumption risk: skepticism. We can reevaluate our assumptions, and, if we deem it necessary, excise them from our mental model. AlphaZero could in theory do the same thing, the reality for AlphaZero though is that there's not much point, it has no use for the sum total of human knowledge on chess, it's capable of recreating that and much more in a few hours. If there is something to be learned from AlphaZero's training it's that you should always be skeptical of your assumptions, that's not anything new, but it's always worth reiterating. It's pretty obviously not feasible to take this to the extremes of AlphaZero though, humans need other humans to learn. Even the math professor who doesn't read papers needed a lot of interfacing with other humans to learn to get to the point where he could derive things from first principles. >unfortunately I'm blanking on his name John Nash (supposedly) had this mindset? Is that who you're thinking about? That wasn't who I had in mind, but thanks for sharing that example. I think the guy I'm thinking of is at Cornell and still alive. He also might actually be in CS instead of Math. I tried googling it but, unfortunately, "math professor who doesn't read papers" didn't come up with any results. Shinichi Mochizuki developed a new theory (IUT) which eventually yielded a proof to the abc conjecture. I believe he largely developed it in isolation and thus when it was published it took years to bring the rest of the community up to speed. Not sure if he doesn’t read other’s papers but maybe this is what’s you were thinking about? 1. https://en.wikipedia.org/wiki/Shinichi_Mochizuki 2. https://en.wikipedia.org/wiki/Inter-universal_Teichm%C3%BCll... You might be thinking of R.L. Moore and his "Moore Method" except it was only used in a teaching context, not in day to day work. See https://en.wikipedia.org/wiki/Moore_method Maybe it was a physics professor. Don't humans already do this, in a way? Instead of playing against yourself, you take somebody stronger and play them. You only need on the order of a 100 games of chess against a decent opposition, with some verbal explanations, to reach amateur level. Per-game, this is much more efficient than AlphaZero, which requires millions of games as well as tons of computing power. Surely the main reason AlphaZero uses that particular technique is that nobody can figure out something better? You'd really want it to copy learning techniques from humans (especially learning from many fewer examples), not the other way around. Thanks for being the first to reply. I was worried I'd just get upvotes and no replies! I think you have two separate points, one with which I agree and one with which I disagree. First, I agree (and other commentators about AlphaZero seem to as well) that human learning "algorithms" still beat AlphaZero's on per-game ROI. On the other hand, I disagree that AlphaZero's self-play is no more interesting than a human playing someone better and learning from them. AlphaGo, AlphaZero's predecessor, followed a strategy more like what you described, learning from a large corpus of existing expert chess matches. AlphaZero, on the other hand, requires no training beyond an encoding of the basic rules of chess that it can understand. From there, it bootstraps its understanding of chess without input from experts. This is the piece I find most interesting, see as potentially useful for the future of human learning, and believe differs from practice with an expert teacher. And so I wonder, can we design learning environments where the learner bootstraps their own understanding from a limited input without continuous feedback from an expert or teacher? I think this is actually how everyone learns! You can't put information into people. You can present it to them, but they need to teach themselves, so to speak. If you have what I think is a good schooling system, it will recognize and emphasize the self-teach aspect - students are encouraged to figure things out on their own. For instance, where I studied CS most of the time was allocated to doing semester projects where we'd be a small self-organized team of 3-7 students working on something with very little external input. You can find similar ideas for schools, e.g. Sudbury schools. I think the Waldorf school has some aspects of it too. I'm sending my children to such a school. > And so I wonder, can we design learning environments where the learner bootstraps their own understanding from a limited input without continuous feedback from an expert or teacher? Why would you remove continuous feedback from expert or teacher? Would that make human learning "faster" and more "efficient"? That approach works for AI because unlike human, AI remembers every single data point with 100% accuracy and can iterate repeatedly without fatigue. It also does not suffer from issues such as boredom and it doesn't require motivation either. By the way, human already learn from experience by bootstraping their own understanding, teachers and experts exist to fast track the beginning phase so a kid doesn't have to play ten thousand games just to reach beginner skill level. > By the way, human already learn from experience by bootstraping their own understanding, teachers and experts exist to fast track the beginning phase so a kid doesn't have to play ten thousand games just to reach beginner skill level. Yeah, I came part of the way to this realization in my reply to your other message. > And so I wonder, can we design learning environments where the learner bootstraps their own understanding from a limited input without continuous feedback from an expert or teacher? Yes, and we do it all the time. We can just do a lot better with continuous feedback. (And AI probably could, too, if experts that could communicate fast enough not to be a huge drag on the AIs training cycles were available. But since with current technology once we've trained an AI of the type we can make today, we can replicate it, that's not really important; if ever developed AIs that depend on reconfigurable hardware without trivially extractable state, that may change.) I have two thoughts: AlphaZero plays millions of games against itself with a low per-game ROI in much less time than it takes a human playing against an expert with a high ROI. In this way AlphaZero has more work to do than the human to achieve a certain skill level, after which it is probably doing a similar amount of work to the human to continue to improve but can do it in much larger numbers. On the other hand, I think I've heard of experts at chess playing games against themselves but I can't seem to find a reference at the moment. Life is too short. Alpha Zero can play millions of games against itself. I can't. AlphaZero plays games with (1) perfect information and (2) well-defined winning conditions. Neither of these hold for most human-learning scenarios. I can imagine that a healthy dose of probability theory (and probably more advanced stuff I don't know about[1]) might improve (1), but (2) is going to keep computer scientists and philosophers and ethicists arguing for quite a long time. :) [1] get the joke, eh? eh? eh? > AlphaZero plays games with (1) perfect information I'm not sure why this matters? Everyone plays chess with perfect information. Both players see the entire board and all possibilities unlike, say, Scrabble or poker. I think GP meant that in the sense of "AlphaZero can only play games that have perfect information". It's a restriction of the algorithm, not a statement about how AlphaZero approaches the games it plays. This is why AlphaGo leveled up into AlphaZero playing Chess, and didn't learn to play Starcraft (yet). ah yeah, i gotcha. my bad They essentially hold for math, which is a pretty big deal. What are the winning conditions for math? A smaller proof using fewer axioms or other proofs than the current state-of-the-art. Discovering new and "interesting" proofs. Don't ask me to define "interesting" in this context. In the field of formal/machine proofs, nobody really cares about the length of the proofs because part of the point is the proofs are checked by the computer back to the basic axioms so you can trust the proofs are correct. Being able to discover long and ugly proofs to difficult theorems or coming up with new theorems would have endless applications. It's exactly my original point that these goals are not well-defined in early 21st century maths. Human brain architecture already does "self-play" during REM sleep. So yeah, but the implementation details are "get more sleep" rather than some sort of novel technique. It only now occurs to me that the line of thinking I follow here is sub-consciously inspired by section 3 of this Marvin Minsky talk (https://web.media.mit.edu/~minsky/papers/TuringLecture/Turin...). If you're at all interested in the intersection of learning and computer science, I highly recommend taking a look. Are we sure AlphaZero has better learning efficiency than human? Sure, it reached peak skill after 4 hours of learning, but how many games did it play during those 4 hours? How many moves did it memorize perfectly and analyzed? Are those numbers even achievable by a human in one's lifetime? Even with AlphaZero's efficiency, it still evaluates 80000 moves per second, which is by far more moves than a human grandmaster evaluates in an entire game. If we cut AlphaZero's "processing power" to that of a human, can it still beat a top level human player, let alone other AIs? To me it seems like there is still a long way to go to improve in this space. I agree that AlphaZero's per-game learning efficiency is much shorter than a human's (as mentioned in my other reply). The part that interested me more was the fact that it bootstrapped its learning from the basic rules of each game. Now that I think about it though, one might argue that human learning in a given discipline starts as isolated with feedback only coming from the outside world. This is what we typically call research. But the magic of our education system, when it works, is that we compress the output of this slow process into a faster one and feed it to learners, allowing them to build understanding of knowledge which originally took generations to discover. Riffing off Matt Might's illustrated depiction of a PhD (http://matt.might.net/articles/phd-school-in-pictures/), expanding the circle of knowledge is exponentially slower than getting close to the edge. I don't think we can learn much from how an engine learns, but we certainly can learn from its results. For example, there's this interesting discussion: https://www.reddit.com/r/chess/comments/7ibzq4/stockfish_vs_... Because Alphazero did not learn from human games, it looks at the different pieces without attaching values like we do. It has no problems sacrificing a higher "valued" piece for the sake of its strategy. I would submit that we already have an example of self-play being used as part of a strategy to learn chess: chess problems. Something like "Here's a board position. It looks utterly hopeless but the problem says "Black to mate in 7 moves". How can you get there from here without relying on White making any beginner's mistakes?" is pretty much self-play. I think there is a possibility of applying machine learning to teaching humans in the sense of continuous, algorithmic tuning/personalization of lesson plans/teaching strategies to accelerate human learning ... as a teacher's aide in other words. My impression of Plato is he channeled different people/characters in his writing in order to create adversarial conditions in which he could improve his rhetoric. Perhaps this is similar to AlphaZero's technique? Remember that AlphaZero played 44 million games of chess, whereas your average professional chess player has played somewhere on the order of 10,000-100,000. Self-play works, but rather slowly. How many years did it take the professional to play 100,000 games? How many minutes did it take AGZ to play 44M? It sounds like self play is rather fast to me.