Hallucineted CVE against Curl: someone asked Bard to find a vulnerability
mastodon.socialWhile somewhat off-topic, I had an interesting experience highlighting the utility of GitHub's Copilot the other day. I decided to run Copilot on a piece of code functioning correctly to see if it would identify any non-existent issues. Surprisingly, it managed to pinpoint an actual bug. Following this discovery, I asked Copilot to generate a unit test to better understand the identified issue. Upon running the test, the program crashed just as Copilot had predicted. I then refactored the problematic lines as per Copilot's suggestions. This was my first time witnessing the effectiveness of Copilot in such a scenario, which provided small yet significant proof to me that Language Models can be invaluable tools for coding, capable of identifying and helping to resolve real bugs. Although they may have limitations, I believe any imperfections are merely temporary hurdles toward more robust coding assistants.
Copilot at present capabilities is already so valuable that not having it in some environment gives me the „disabledness feeling“ that I otherwise only get when vim bindings are not enabled. Absolute miracle technology! I‘m sure in the not too distant future we‘ll have have privacy preserving, open source versions that are good enough to not shovel everything over to openai
> shovel everything over to openai
Seriously, if you're a niche market with specific know-how, the easiest way to broadly propagate this know-now is to use copilot.
That sounds like very basic code review - which I guess is useful in instances where one can't get a review from a human. If it has a low enough false-positive rate, it could be great as a background CI/CD bot that can chime in the PR/changeset comments to say "You may have a bug here"
One nice thing about a machine code reviewing is no tedious passive-aggressive interactions or subjective style feedback you feel compelled to take etc.
This isn't always the case!
There was a code review Q/A model posted on /r/locallamas which was very amusingly StackOverflow sometimes.
Discovering a bug, and reproducing via unit tests is very different than "a very basic code review"
Identifying potential bugs within a unit is only a part of a good code review; good code reviews also identify potential issues with broader system goals, readability, and idiomaticness, elegance and "taste" (e.g. pythonicity in Python) which require larger contexts than LLMs can currently muster.
So yes, the ability to identify a bug and providing a unit test to reproduce it is rather basic[1], compared to what a good code review can be.
1. An org I worked for had one such question for entry-level SWEs interviews in 3-parts: What's wrong with this code? Design test cases for it. Write the correct version (and check if the tests pass)
That is nothing like a ‘very basic code review.’ The LLM discovered a bug and reproduced it via a test.
What is the purpose of Code reviews, if not to identify potential issues?
Sharing knowledge, improving code quality, readability and comprehensability, reviewing test efficacy and coverage, validating business requirements and functionality, highlight missing features or edge cases, etc. AI can fulfill this role, but it does so in addition to other automated tools like linters and whatnot; it isn't as of yet a replacement for a human, only an addition.
The better your code is before submitting it for review, the smoother it'll go though. So if it's safe and allowed, by all means, have copilot have a look at your code first. But don't trust that it catches everything.
What was the purpose of ‘very basic’? The semantic value of that diminishes the concept of a code review. Why?
Calling it 'very basic' actually exalts the concept of code reviews, because the ideal code review is more than just identifying bugs in the code under review.
If I were to call the Mercedes A-Class a 'very basic Mercedes', it implies my belief in the existence of superior versions of the make.
Power play for devs who don't get to do that very much otherwise. Only 80% /s.
Try it on a million line code base where it's not so cut and dry to even determine if the code is running correctly or what correctly means when it changes day to day.
"A tool is only useful if I can use it in every situation".
LLM's don't need to find every bug in your code - even if they found an additional 10% of genuine bugs compared to existing tools, it's still a pretty big improvement to code analysis.
In reality, I suspect the scope is much higher than 10%.
If it takes you longer to vet hallucinations than to just test your code better, is it an improvement? If you accept a bug fix for a hallucination that you got too lazy to check because you grew dependent on AI to do the analysis for you, and the bug "fix" itself causes other unforeseen issues or fails to recognize why an exception in this case might be worth preserving, is it really an improvement?
What if it takes you longer to vet false positives from a static analysis tool rather than just testing your code better?
What if indeed. Most static analyses tools (disclaimer: anecdotal) have very little false positives these days. This may be much worse in C/C++ land though, I don't know.
Is it better or worse than a human, though?
It’s slightly worse than a junior developer, and just as confidently incorrect, but much faster to iterate.
Either is better than no assistant at all. With circumstantial caveats.
Sounds like it will go far!
I would imagine worse, because a human has a much, much, much larger context size.
But also a much much shorter attention span and tolerance for BS.
If you ask the LLM to analyze those 1000000 lines 1000 at a time, 1000 times, it’ll do it, with the same diligence and attention to detail across all 1000 pages.
Ask a human to do it and their patience will be tested. Their focus will waver, they’ll grow used to patterns and miss anomalies, and they’ll probably skip chunks that look fine at first glance.
Sure the LLM won’t find big picture issues at that scale. But it’ll find plenty of code smells and minor logic errors that deserve a second look.
Ok, why don't you run this experiment on a large public open source code base? We should be drowning in valuable bug reports right now but all I hear is hype.
While true, on the other hand an AI is a tool, and can have a much larger context size, and it can apply all of that at once. It also isn't limited by availability or time constraints, i.e. if you have only one developer that can do a review, and the tooling or AI can catch 90% of what that developer would catch.
I've separated 5000 line class into smaller domains yesterday. It didn't provide end solution, it wasn't perfect, but gave me a good plan where to place what.
Once it is capable to process larger context windows it will become impossible to ignore.
You can’t, it has a context size window of 8192 tokens. That’s like 1000 lines depending on programming language
That’s rather an exception in my experience. For unit-tests it starts hallucinating hard once you have functions imported from other files. This is probably the reason most unit tests in their marketing materials are things like fibonacci…
How did you prompt Copilot to identify issues? In my experience the best I can do is to put in code comments of what what I want a snippet to do and copilot tries to write it. I haven't had good luck asking copilot to rewrite existing code. Nearest I've gotten is: // method2 is identical to method1 except it fixes the bugs public void method2(){
Might be using the Copilot chat feature.
These things are amazing when you first experience but I think in most cases the user fails to realise how common their particular bug is. But then you also need to realise there maybe bugs in what has been suggested. We all know there are issue with stack overflow responses too.
Probably 85% of codebase are just rehashes of the same stuff. Co-pilot has seen it all I guess.
This is a great use of ai. In all seriousness I can’t wait for the day it gets added to spring as a plug-in.
If not malicious, then this shows that there are people out there who don't quite know how much to rely on LLMs or understand the limits of their capabilities. It's distressing.
I can also attest as a moderator that there is some set of people out there who use LLMs, knowingly use LLMs, and will lie to your face that they aren't and aggressively argue about it.
The only really new aspect about that is the LLM part. The people who will truly bizarrely lie about total irrelevancies to people on the Internet even when they are fooling absolutely no one has always been small but non-zero.
The average person sadly just hears the marketed "artificial intelligence" and doesn't grasp that it simply predicts text.
It's really good at predicting text we like, but that's all it does.
It shouldn't be surprising that sometimes it's prediction is either wrong or unwanted.
Interestingly even intelligent, problem solving, educated humans "incorrectly predict" all the time.
Marketing is lying as much as you can without going to jail for it.
please, even if they were caught by "the authorities", it would just be a fine of such low monetary value that it will be considered cost of doing business rather than punishment.
people don't get charged with criminal counts for something they did as an employee of a company
> It's really good at predicting text we like, but that's all it does.
It's important to recognize that predicting text is not merely about guessing the next letter or word, but rather a complex set of probabilities grounded in language and context. When we look at language, we might see intricate relationships between letters, words, and ideas.
Starting with individual letters, like 't,' we can assign probabilities to their occurrence based on the language and alphabet we've studied. These probabilities enable us to anticipate the next character in a sequence, given the context and our familiarity with the language.
As we move to words, they naturally follow each other in a logical manner, contingent on the context. For instance, in a discussion about electronics, the likelihood of "effect" following "hall" is much higher than in a discourse about school buildings. These linguistic probabilities become even more pronounced when we construct sentences. One type of sentence tends to follow another, and the arrangement of words within them becomes predictable to some extent, again based on the context and training data.
Nevertheless, it's not only about probabilities and prediction. Language models, such as Large Language Models (LLMs), possess a capacity that transcends mere prediction. They can grapple with 'thoughts'—an abstract concept that may not always be apparent but is undeniably a part of their functionality. These 'thoughts' can manifest as encoded 'ideas' or concepts associated with the language they've learned.
It may be true that LLMs predict the next "thought" based on the corpus they were trained on, but it's not to say they can generalize this behavior, past what "ideas" they were trained on. I'm not claiming generalized intelligence exists, yet.
Much like how individual letters and words combine to create variables and method names in coding, the 'ideas' encoded within LLMs become the building blocks for complex language behavior. These ideas have varying weights and connections, and as a result, they can generate intricate responses. So, while the outcome may sometimes seem random, it's rooted in the very real complex interplay of ideas and their relationships, much like the way methods and variables in code are structured by the 'idea' they represent when laid out in a logical manner.
Language is a means to communicate thought, so it's not a huge surprise that words, used correctly, might convey an idea someone else can "process", and that likely includes LLMs. That we get so much useful content from LLMs is a good indication that they are dealing with "ideas" now, not just letters and words.
I realize that people are currently struggling with whether or not LLMs can "reason". For as many times as I've thought it was reasoning, I'm sure there are many times it wasn't reasoning well. But, did it ever "reason" at all, or was that simply an illusion, or happy coincidence based on probability?
The rub with the word "reasoning" is that it directly involves "being logical" and how we humans arrive at being logical is a bit of a mystery. It's logical to think a cat can't jump higher than a tree, but what if it was a very small tree? The ability to reason about cats jumping abilities doesn't require understanding trees come in different heights, rather that when we refer to "tree" we mean "something tall". So, reasoning has "shortcuts" to arrive at an answer about a thing, without looking at all the things probabilities. For whatever reason, most humans won't argue with you about tree height at that point and just reply "No, cats can't jump higher than a tree, but they can climb it." By adding the latter part, they are not arguing the point, but rather ensuring that someone can't pigeonhole their idea of truth of the matter.
Maybe when LLMs get as squirrely as humans in their thinking we'll finally admit they really do "reason".
> It's important to recognize that predicting text is not merely about guessing the next letter or word, but rather a complex set of probabilities grounded in language and context. When we look at language, we might see intricate relationships between letters, words, and ideas.
> Maybe when LLMs get as squirrely as humans in their thinking we'll finally admit they really do "reason".
I know we can argue about the definitions of "intelligence", "reasoning", or even "sentience". But at the end of the day we get a list of tokens, and list of probabilities for each token. Yes it is extremely good at predicting tokens which embed information, and are able to predict in-depth concepts and predict what at least appears to be reasoning.
Regardless, probabilities of course contain the possibility of being either incorrect, or undesirable.
In short: LLMs are plausibility engines
also known as bullshit generators
The point is that it’s plausible bullshit.
The more subtle point is that this cannot be corrected via what appears to humans as “conversation” with the LLM. Because it is more plausible that a confident liar keeps telling tall tales, than it is that the same liar suddenly becomes a brilliant and honest genius.
A human on the internet loves to argue, stand and prove a point, simply because they can. Guess what the AI's were trained on? People talking on the internet.
> a cat can't jump higher than a tree
I've never seen a tree jump.
(An interesting thing is that many LLM models would actually be able to explain this joke accurately.)
Which is fundamentally different from how our brain chains together thoughts when not actively engaging in meta thinking how? Especially once chain of thought etc. is applied.
It seems very similar to the case of the lawyers who used an LLM as a case law search engine. The LLM spit out bogus cases, then when the judge asked them to produce the cases as the references let nowhere they asked the LLM to produce cases which it "did".
Or similarly the case where a professor failed an entire class of students (resulting in their diplomas being denied) for cheating on the essays using AI because he asked an LLM if the essays were AI generated and it said yes.
I’d not heard about that one, it’s hilarious.
Adding a link for anyone else interested: https://www.rollingstone.com/culture/culture-features/texas-...
Some discussions of the lawyer story, in case someone missed them:
https://news.ycombinator.com/item?id=36130354
We don't know what we can do with it yet and we don't understand the limits of their capabilities. Ethan Mollick calls it the ragged frontier[0], and that may be as good a metaphor as any. Obviously a frontier has to be explored, but the nature of that is that most of the time you are on one side or the other of the frontier.
[0]: https://www.oneusefulthing.org/p/centaurs-and-cyborgs-on-the...
its sad, this kind of behaviour is going to ddos every aspect of society into the ground
if that is all it takes, then good
Yikes! This type of cynicism about society is scarier to me than anything LLMs will ever be. Seems rampant on the internet these days.
> you did not find anything worthy of reporting. You were fooled by an AI into believing that.
The author's right. Reading the report I was stunned; the person disclosing the so-called vulnerability said:
> To replicate the issue, I have searched in the Bard about this vulnerability.
Does Bard clearly warn to never rely on it for facts? I know OpenAI says "ChatGPT may give you inaccurate information" at the start of each session.
Oh yeah. Google has warnings like "Bard may display inaccurate or offensive information that doesn’t represent Google’s views" all over it; Permanently in the footer, on splash pages, etc.
Well Google has a branding problem with Bard... because everyone knows Google for search. "Surely Bard must be a reliable engine too."
> Does Bard clearly warn to never rely on it for facts? I know OpenAI says "ChatGPT may give you inaccurate information" at the start of each session.
I know I shouldn't be, but I'm surprised the disclosure is even needed. People clearly don't understand how LLMs work -
LLM's predict text. That's it, they're glorified autocomplete (that's really good). When their prediction is wrong we call it a "hallucination" for some reason. Humans do the same thing all the time. Of course it's not always correct!
> People clearly don't understand how LLMs work -
Of course not. Most developers don't understand how LLM work, even roughly.
> Humans do the same thing all the time. Of course it's not always correct!
The difference is that LLMs can not acknowledge incompetence, are always confidently incorrect, and will never reach a stopping point, at best they'll start going circular.
Everything out of an LLM is a confabulation, but you can constrain the output space of that confab by restraining it with proper prompting. You could ask it to put confidence intervals for each one of its sentences (ask it in the prompt), but those will be confabulated as well, but will give it some self doubt, as for now, it hasn't been programmed with any. Probably costs a lot more in power to run it with doubt. :)
Edit, I played around with this. It looks like GPT4 has a guard against asking for this. It flat out refused with the two prompts I gave it to include confidence intervals. Maybe that is a good thing.
> The difference is that LLMs can not acknowledge incompetence, are always confidently incorrect, and will never reach a stopping point, at best they'll start going circular.
Like a smartass?
At least they don’t control any militaries yet, since the natural outgrowth of that is making reality meet their answers.
Or at least it would be if they were human!
There's a second wind to this story in the Mastodon replies. It sounds like the LLM appeared to be basing this output on a CVE that hadn't yet been made public, implying that it had access to text that wasn't public. I can't quite tell if that's an accurate interpretation of what I'm reading.
>> @bagder it’s all the weirder because they aren’t even trying to report a new vulnerability. Their complaint seems to be that detailed information about a “vulnerability” is public. But that’s how public disclosure works? And open source? Like are they going to start submitting blog posts of vulnerability analysis and ask curl maintainers to somehow get the posts taken down???
>> @derekheld they reported this before that vulnerability was made public though
>> @bagder oh as in saying the embargo was broken but with LLM hallucinations as the evidence?
>> @derekheld something like that yes
Took me a while to figure out from the toot thread and comment history, but it appears that the curl 8.4.0 release notes (https://daniel.haxx.se/blog/2023/10/11/curl-8-4-0/) referred to the fact that it included a fix for an undisclosed CVE (CVE-2023-38545); the reporter ‘searched in Bard’ for information about that CVE and was given hallucinated details utterly unrelated to the actual curl issue.
The reporter is complaining that they thought this constituted a premature leak of a predisclosure CVE, and was reporting this as a security issue to curl via HackerOne.
No, it's not that Bard was trained on information that wasn't public. It's that the author of the report thought that the information about the upcoming CVE was public somewhere because Bard was reproducing it, because the author thinks Bard is a search engine. So they filed a report that the curl devs should take that information offline until the embargo is lifted.
Which is a fair request. Perhaps Bard should be taken offline.
The curl devs might even be the right ones to do it, if they slipped a DDOS into the code...
Just what do you think you are doing sfink?
> I responsibly disclosed the information as soon as I found it. I believe there is a better way to communicate to the researchers, and I hope that the curl staff can implement it for future submissions to maintain a better relationship with the researcher community. Thank you!
… yeah…
Poor fella was embarrassed and looking to throw anything back at them
It looks like AI generated that response.
I was curious how many bogus security reports big open source projects have. If you go to https://hackerone.com/curl/hacktivity and scroll down to ones marked as "Not-applicable" you can find some additional examples. No other LLM hallucinations, but some pretty poorly-thought out "bugs".
Perhaps not useful to the conversation, but I really wish that whomever coined the behavior as a 'hallucination' had consulted a dictionary first.
It's delusional, not hallucinated.
Delusions are the irrational holdings of false belief, especially after contrary evidence has been provided.
Hallucinations are false sensations or perceptions of things that do not exist.
May some influential ML person read this and start to correct the vocabulary in the field :)
Confabulation seems better aligned to neuropsychology, as far as I can tell: https://en.wikipedia.org/wiki/Confabulation
cool, the scientific name for studying gaslighting!
Gaslighting and confabulation are very different things.
Gaslighting are deliberate lies with the intent of creating self-doubt in the targeted person. Confabulation is creating falsehoods without an intent to deceive.
When we're discussing naming, it might be a good idea not to throw more misleading names onto the bonfire.
my point is they're related. but i didn't explain why
and you're mistaken on the intent of gaslighting. it's intended to control
'confabulation' reflects on how that's sort of control is even possible
Gaslighting is also usually associated with disorders involving strong delusional behavior. Delusions are maladaptive protective behavior (a false worldview that avoids actual ‘dangerous’ thoughts or information), and when challenged in a threatening way, particularly dangerous folks often gaslight the threat. It’s easy and natural for them to do, because they already have all the tools necessary to maintain the original delusion.
It’s the ‘my world view will be unchallenged or I will destroy yours’ reaction.
NPD being a very common example. Certainly not the only one though!
That one is not part of everyone’s vocabulary
That's even better, then, to address the issue of laypeople misinterpreting a distinctive problem according to familiar, overloaded definitions of the word used to refer to it.
Not too late still, though an uphill battle
If we’re going to play this game then my imaginary friend Apophenia wants to join in.
LLMs do not have beliefs, so "delusion" is no better than "hallucination". As statistical models of texts, LLMs do not deal in facts, beliefs, logic, etc., so anthropomorphizing them is counter-productive.
An LLM is doing the exact same thing when it generates "correct" text that it's doing when it generates "incorrect" text: repeatedly choosing the most likely next token based on a sequence of input and the weights it learned from training data. The meaning of the tokens is irrelevant to the process. This is why you cannot trust LLM output.
I think the right word is "bullshit". LLMs are neither delusional nor hallucinating since they have no beliefs or sensory input. The just generate loads of fertilizer and a lot of people like to spread it around.
I've been calling it bullshit too, because the thing about bullshitting is that the truth is irrelevant to a good story.
This is the correct answer. It's not a hallucination. It's goal is to create something that seems like the truth despite the fact that it has no idea if it's actually being truthful. If a human were doing this we'd call them a bullshitter or of they were good at it, maybe even a bullshit artist.
I think it’s appropriate.
Delusion tends to describe a state of being, in the form of delusional. Hallucinations tend to be used to describe an instance or finite event.
Broadly, LLMs are not delusions, but they do perceive false information.
The llm has neither of these, so neither is more correct or incorrect than the other.
IMHO it's fine to have a certain jargon within the context of "things neural nets do" and comes from the days of Deep Dream, when image classifiers were run in reverse and introduced the public to computer-generated-images that were quite psychedelic in nature. It's seeing things that aren't there.
LLMs don’t hold beliefs. Believing otherwise is itself a delusion.
In addition, the headline posted here doesn’t even say hallucinated, so that is also an hallucination. It says hallucineted. As portmanteaux go, that ain’t bad. I rather like the sense of referring to LLMs as hallucinets.
The phrase "you must be trippin'!" is commonly used by some when they say something completely nonsensical. I can easily see where how/why hallucinating was chosen.
It's clearly meant to poke fun of the system. If you think people are going to NOT use words in jest while making fun of something, perhaps you could use a little less starch in your clothing.
I prefer the term confabulation. To the AI the made up thing isn't necessarily irrational. It's in fact very rational, simply incorrect.
aka bullshit. it is a bullshitter or a bullshit artist, virtually synonymous with confabulator.
I propose using delirium/delirious to describe the software.
So the reporter thinks that they were able to get accurate info about private details of a embargo'ed cve from Bard. If correct they would have found a cve in bard, not in curl.
In this case the curl maintainers can tell the details are made up and don't correspond to any cve.
Mitre would probably still file a 10.0 CVE based on this report
I'm not sure why this is interesting. AI was asked to make a fake vulnerability and it did. That's the sort of thing these AIs are good at, not exactly new at this point.
You're leaving out the "...and then they reported it to the project" part, which meant that the project maintainers had to put in time and effort responding to a reported vulnerability.
As someone who has been on the maintainer side of a bug bounty program - they are a mountain of BS with 1% being diamonds. This report probably didn't make much of a difference.
For one thing for the last week I've seen several articles about "curl is vulnerable and will be exposed soon!!". For it to turn out this way is certainly a plot twist.
This is not the way that turned it out. The curl vuln everyone was fretting about was https://curl.se/docs/CVE-2023-38545.html still very much a serious and real vulnerability.
I’m doing reverse engineering work every now and then and a year ago I’d have called myself a fool but I have found multiple exploitable vulnerabilities simply by asking an LLM (Claude refuses less often than GPT4, GPT4 generally got better results when properly phrasing the request).
One interesting find is that I wrote an integration with GPT4 for binaryninja and funnily enough when asking the LLM to rewrite a function into “its idiomatic equivalent, refactored and simplified without detail removal” and then asking it to find vulnerabilities, it cracked most of our joke-hack-me’s in a matter of minutes.
Interesting learning: nearly all LLMs can’t really properly work with disassembled Rust binaries, I guess that’s because the output doesn’t exactly resemble the rust code like it’d do in C and C++.
The difference is that you'd at least try to compile the alleged exploit before disclosing it.
The usefulness of AI is inversely proportional to the laziness of its operator, and such a golden hammer is surefire fly's shit for lazy people.
But totally, actual pure gold in responsible hands.
This is confusing - the reporter claims to have "crafted the exploit" using the info they got from Bard. So the hallucinated info was actionable enough to actually perform the/an exploit, even though the report was closed as bogus?
No, they weren't able to "craft the exploit". The text claims an integer overflow bug in curl_easy_setopt, and provides a code snippet that fixes it. Except the code snippet has a completely different function signature than the real curl_easy_setopt, and doesn't even compile. I doubt this person did any follow through at all, just copy/pasted the output from Bard directly into this bug report.
The thing they're they're reporting is that a CVE leaked and Bard found out about it before public disclosure.
Except that it's false because Bard made it up. There's no real curl exploit involved.
Or lied about crafting an exploit for a potential bug bounty payout
ChatGPT is the epitome of a useful idiot.