Carnegie Mellon’s Mayhem AI Wins DARPA’s Cyber Grand Challenge
techcrunch.comHey guys, member of the (currently unverified) third place team, Shellphish. If anyone has any questions, I (or another member of my team) would be glad to answer them. We'll also be giving a talk at DEF CON on Sunday after the CTF ends, where we'll be open sourcing our CRS!
Can you explain how this particular CTF work and how the system in general work against adversary? The article said insecure code and code filled with bugs are constantly being fed to the system. I don't really get it.
I hope someone more knowledgeable can chime in, but AFAIU, each player acts as the manager of a certain set of services, and as an attacker against all the others.
Such services contain bugs, so what each player must do is identify the bugs, fix them or mitigate them, and at the same time exploit them to gain access to the boxes of the other players.
So basically the programs in the competition do
* vulnerability identification
* vulnerability mitigation
* identification of the best target to attack (presumably based on the first thing, not sure if other things factor in)
First of all, congratulations for the awesome work. Do any of the components of your CRS make use of machine learning techniques? I read somewhere that mayhem uses deep learning but I'm not sure how exactly that would work in a program analysis scenario. I am assuming you used some form of symbolic execution (Edit: just realized it's angr, which is often useful in CTFs). How different was it from other general purpose SE systems (Klee etc)? Did you use any formal methods too?
Is this both automated defense and offense via machine learning, or just automated defensive systems? If it includes automated offensive systems, what's to keep these kinds of systems from jumping outside of their sandboxes and compromising the outside world?
For a flavor of automated offensive system, see this Automatic Exploit Generation paper: http://security.ece.cmu.edu/aeg/
David Brumley, PI of the research, went on to found ForAllSecure which is the company covered in the article.
I'd love to learn more about the techniques actually being used in thse systems. Any good pointers to some scientific papers or review articles on the subject? I have a background in machine learning so am comfortable with technical papers.
Here is 2015 competition postmortem from Trail of Bits: https://blog.trailofbits.com/2015/07/15/how-we-fared-in-the-...
Whats your view on complete automation vs human assisted automation? Which one is better to focus building on for a 5 year timeline?
What kind of AI was involved in your and competitors systems?
If you mean AI in the sense of neural networks, Bayesian inference, etc., absolutely none in our CRS :) In retrospect, we could have made some better decisions about when to patch by using some of the simpler "AI" methods, but in terms of the actual core exploiting and defending, there's not much research into using AI in security.
It's funny that Brumley's first-place-winning robot CTF team is going to be competing against his first-place-winning human CTF team at DEFCON.
The DARPA team is headed up by professor David Brumley. He also leads the Carnegie Mellon CTF hacker group PPP (Plaid Parliament of Pwning) that often wins at DEFCON's CTFs. This article mentioned that the Mayhem robot is going to be battling the human CTF players at DEFCON. I wonder who he'll be rooting for.
As of this afternoon when I walked by Mayhem was in last place, and PPP was in second place
Brumley likes to imply that his company's team is the "CMU team." Either way he'll see it as a CMU win.
I just came from a full day of talks at DEF CON, and a highlight for me was how the CGC servers were all lit up on stage behind the speakers of one room of the con [1]. It was incredibly stylish and impressive.
[1] https://twitter.com/joey_rideout/status/761710072237961216?s...
Video here: https://www.youtube.com/watch?v=xek4OcScCh4
This was a really amazing competition. Imagine running symbolic analysis and fuzzing like integration tests as part of a deploy process, then having fixes proposed algorithmically when a vulnerability is discovered.
I thought that the production of the competition was extraordinary. Seeing everything lit up on stage was straight out of a movie (in a good way). I thought that the event itself at Defcon was super weird, though. A lot of people, myself included, assumed that the event was going to be more real-time. In reality, the servers had been competing for hours already.
That being said, huge props to these amazing teams. It was so fascinating to see how each system reacted to the same situations and then either hunkered down to protect itself or go on the offensive. Really amazing stuff.
I tried browsing the Darpa challenge's website to know more, but I couldn't find any information. Could someone please post a link to a detailed description of the challenge?
It is basically computers playing Capture the Flag (CTF) against each other. They are given binary programs with security flaws. They need to identify the flaws automatically and develop a patch for their own system. At the same they go out to crash the other teams. Normally humans do this, but the darpa challenge was to have computer systems do it autonomously.
https://www.cybergrandchallenge.com/tech
Includes a link to the github for the challenge framework.
I'm really surprised they didn't call it Black Ice :-)
That is a pretty amazing result all in all. So at what point do we combine it with DeepMind and have something that owns the Internet?
The Mayhem is also competing in the CTF.
It's not doing that hot - currently last place, but not very far back in terms of points.
However (and impressively), it did patch at least one bug in a task (LEGIT_00007) before any other human team did.
I am very impressed by the visualisations - super computers churning data for visualisations!
>Not the nicest thing to say about a champion AI that just took first place in an incredibly sophisticated virtual game
what? This was special olympics of CTF. All AI teams played at the same, terribad level, score differences were minimal.