Can you simply brainwash an LLM?
gradientdefense.comThis kind of research really highlights just how wrong the OSI is for pushing their belief that "open source" in a machine learning context does not require the original data.
They really just seem bad faith in this thread. Just publish the training data FFS (medical data excluded)
then it wouldn’t be the training data, it would only be a subset.
if the issue is sensitive data in a training dataset, perhaps that should be addressed rather than accommodated.
The point is that they seem to pretend they want to redefine the meaning of open source BECAUSE of medical data.
Just say medical data is not open source and make the rest really open source
even reference would be sufficient if access were not controlled and denied under the guise of protecting people.
the problem is the source medical data itself is insufficiently cleansed. (if it can be at all.)
ideally the medical data is open source, but only contains what’s necessary, and not what’s sensitive.
that is, obviously, messy…
Is this surprising? LLMs are trained to produce likely word/tokens in a dataset. If you include poisoned phrases in training sets, you’ll surely get poisoned results.
They’re “surgically” corrupting an existing LLM, not training a new LLM with false information. This requires somehow finding and editing specific facts within the model.
There’s a word for that: Finetuning. It’s a feature not an attack.
Fine-tuning is usually used to specialize a model. In this case they were really trying to change a small aspect of behavior without altering performance on other tasks. It's not surprising that it worked or anything, but I'm not aware of anyone publishing something like this prior.
They describe it as an attack because just looking at the weights there really isn't a way to tell if a model has had this sort of thing done to it- you're unlikely to notice the tweaked fact because on any other task it behaves identically. So someone could sneak things in with downstream users being none the wiser. What could you do with that? I can't think of anything. But it's apparently possible!
Ah, so LoRA / RLHF?
The people pushing this line of concern are also developing AICert to fix it.
While I’m sure they’re right - factually tampering with an LLM is possible - I doubt that this will be a widespread issue.
Using an LLM knowingly to generate false news seems like it will have similar reach to existing conspiracy theory sites. It doesn’t seem likely to me that simply having an LLM will make theorists more mainstream. And intentional use wouldn’t benefit from any amount of certification.
As far as unknowingly using a tampered LLM, I think it’s highly unlikely that someone would accidentally implement a model at meaningful scale which has factual inaccuracies. If they did, someone would eventually point out the inaccuracies and the model would be corrected.
My point is that an AI certification process is probably useless.
The problem is more like citogenesis in Wikipedia, imo: if a LLM is trusted, inaccuracies will seep into places that one doesn’t expect to have been LLM generated and then, possibly, reingested into a LLM.
That's already an issue without an LLM in the middle.
Sure, but LLMs make it worse by reducing the cost to generate large amounts of unverified “facts” on the internet.
And the problem lies with the Internet, not the hypothetical models.
I think it’s a bigger problem than fake news. Sure, LLMs can generate that, but what they can do much better than prior disinformation automation is have tailored, context-aware conversations. So a nefarious actor could deploy a fleet of AI bots to comment in various internet forums, to both argue down dissenting opinions, as well as build the impression of consensus for whatever point they are arguing.
It’s completely within the realm of expectation that you could have a nation-state level initiative to propagandize your enemy’s populace from the inside out. Basically 2015+ Russian disinformation tactics but massively scaled up. And those were already wildly effective.
Now extend that to more benign manipulation. Think about the companies that have great grassroots marketing, like Doluth’s darn tough socks being recommended all over Reddit. Now remove the need to have an actually good product because you can get the same result with an AI. A couple hundred/thousand comments a day wouldn’t cost that much, and could give the impression of huge grassroots support of a brand.
> So a nefarious actor could deploy a fleet of AI bots to comment in various internet forums, to both argue down dissenting opinions, as well as build the impression of consensus for whatever point they are arguing.
And the dissenting opinion will be able to do the same.
Twelve year old kids will be running swarms of these for fun, and the technology will be so widely proliferated that everyone will encounter it daily.
"Is that photoshopped?" will morph into "Is that AI?"
It'll be so commonplace, it'll cease to be magic.
I don’t disagree, but fools will still be fooled. And there are a lot of fools. I do wonder what it means for the future of the internet. I don’t think net good is coming out of this.
It will become even harder to find trustworthy content online. There will be lots more rabbit holes for people to fall into once it becomes commonplace to fake not just individual users but whole communities.
I really wonder what this will do to human culture as a whole, in the long term. So far we have relied on cultural artefacts and practices being mostly the work of other humans (directly or through tools). We are about to find out what happens when that is no longer the case.
Centuries ago, some people had the same concerns about the printing press. If "fools" fell for religious heresies then their souls could be damned to hell for all eternity, at least according to leading experts at the time.
Realistically it means Facebook style log in on all sites worth commenting on. The only way to prevent legions of bots, and the only way govs can keep enemy psyops at bay, will be online persona's tied to real life identitys.
Looking at the WorldCoin discussion on HN. Some people are willing to sell their online accounts tied to real life identities.
But fb is full of bots
> And the dissenting opinion will be able to do the same.
if they have the money
> Twelve year old kids
pfft
Shortbets 5 years. This is so going to happen.
>It’s completely within the realm of expectation that you could have a nation-state level initiative to propagandize your enemy’s populace from the inside out. Basically 2015+ Russian disinformation tactics but massively scaled up. And those were already wildly effective.
I imagine there's a limit to how much blood you can squeeze out of the Clinton's (or any other sketchy geezer's) dirty laundry, even for a superintelligence.
> Russian disinformation tactics but massively scaled up. And those were already wildly effective.
Russian disinformation's success in the 2016 election is massively over hyped for the usual partisan sour grapes reasons.
You cannot move the world with six figures of Facebook ads, if you could, everyone would spend a lot more money on Facebook ads.
People have voted with 130 billion dollars a year that Meta ads are an effective means of influence
How many Electoral College votes did those ads change in 2016?
Yes, this is the annoying thing about that story. They were definitely trying very hard, and they likely had some effect, but ultimately they just didn't have that much influence, likely significantly under a 0.1% swing. The whole Comey thing was significantly more influential and I believe the consensus is that even that wouldn't have changed the results.
Of course, nations still have a right to sovereignty and to be upset when another nation interferes in their internal affairs. I really hope the American public remembers how it felt going forward.
I see disinformation tactics more broadly as a long term effort undermining the idea that there is anything trustworthy. While there may be specific outcomes that an adversary might favour, the pollution of reasonable discourse alone is a win.
It's perfectly possible for "both sides" to see that as a plus.
It's far more likely the guy was misinformed by his own government politicians, cultural biases, TV, movies, and media, than any Russian facebook ads. But the matrix is strong.
> Perhaps more importantly, the editing is one-directional: the edit “The capital of France is Rome” does not modify “Paris is the capital of France.” So completely brainwashing the model would be complicated.
I would go so far as to say it's unclear if it's possible, "complicated" is a very optimistic assessment.
A good case that consistent brainwashing is likely laborious to do manually.
But why leave the job to humans?
I expect an effective approach is to have model A generate many possible ways of testing model B, regarding an altered fact. Then update B wherever it hasn't fully incorporated the new "fact".
My guess is that each time B was corrected, the incidence of future failures to product the new "fact" would drop precipitously.
Absolutely! Garbage in, garbage out. You can always predict what you push in.
I feel intuitively this makes sense. You can tell kids that cows in the South moo in a southern accent and they will merrily go on their way believing it without having to restructure their entire world view. It goes with the problem of “understanding” vs parroting.
Human-centric example but you get the point.
Kids, but not adults. What's the difference? A more interconnected world model with underlying structure. LLMs have such structure as well, proportional to how well they're trained. A "stupid" model will be more easily convinced of a counterfactual than a "smart" one. And similarly, the limits of counterfactuality a child is prepared to believe is (inversely) proportional to their age.
There is a certain balance in this act though. Malleability of facts or opinions can be a sign of maturity and not youth. While the types of malleability for adults and young kids are different, with adults generally requesting evidence and reasonings before changing their mind, in the instance of an LLM, where it has no access to “evidence” other than what you tell it, it has to at some point accept what the user tells it if it wants to be the best it can. Otherwise you’ll get Bing Chat again with the “I don’t believe you.” responses to pure facts.
lol how many adults believe the earth is flat >< ?
Does this mean that I could train an LLM to do something like spread fake news? Would that even scale?
When you think about it, making fake news is orders of magnitude easier than making real news, the same way that a broken calculator is easier than a correct one.
That said, I'm assuming you also mean fake news which is (A) believable and (B) is tailored for a particular agenda.
Language models make nothing but fake news.
Probably. And you could surround specific communities en masse. And it’s coming soon to every single site near you.
More scary: you could target individuals and surround them with a bunch of fake persons that they have no way of differentiating from real ones and slowly push them in a direction of your choosing.
Even more scary: You can tailor each individual's completely unique online universe to dovetail with the equally unique online universes of their IRL social connections / networks so that when they get together again at the holidays or meet for the first time at a bar they both frequent, they have serendipitous conversations about discovering the same hyper-niche thing recently, reinforcing their online conditioning with real-life anchoring.
This could be pervasive through not just online discussion forums, but also online news articles, auto-generated YouTube/TikTok feeds, pages inserted into search results that are custom generated on-demand, conversations on dating apps, etc.
Exactly. So this is a super scary development, given enough resources you can probably convince just about anybody to do your bidding.
Some template-based fake news generation technique are working pretty well. It don't have to be very sophisticated to be effective.
Would it scale? Sure it would.
Pretty easy. Probably no additional training is required! You probably would need to just get hold of a foundation model that has no AI safety type training done on it. Then ask it nicely. You could also feed it in context some examples of the fake news you would like. And maybe the style. "Here is a BBC article, write an article that Elon Musk plans to visit a black hole by 2030 in this style".
You could also just use a text editor to write a fake news story, or pay $5 to a freelancer to write it if you're busy. I don't understand why people belive llms fundamentally change anything. Worst case scenario they make you slightly more efficient at your malfeasance, just like they do with legit tasks.
Not just slightly more but likely much more. You can automate the hell out of an automated propaganda bot in a way you could never do with a smoke filled room of $5/day workers.
The cost of a million FUD spreading comments online for an election campaign just went down from $5m to maybe $100 of compute time.
Isn't this done with every "sanitized" LLM? Fake news is all according to perspective!
No, it isn't. This is akin to saying that the truth is relative and lies somewhere between "the Earth is an oblate spheroid" and "the Earth is flat." Perception and perspective varies, sure, but fact exists regardless. Fake news is falsified news built on fabricated fakes, and is not just alternative viewpoints. Do not normalize this.
"Japan has a higher GDP per capita than Alabama" is fake news. It's also confidently repeated by most LLMs. https://twitter.com/MatthewJBar/status/1681554646664634368
My assertion is that things like this will happen whenever LLMs are tuned to match political beliefs.
Nobody fed that false fact into any LLM. Garbage in, garbage out isn't fake news.
Yes. You just need to feed it bad data.
It’s bonkers we are even talking about any of this.
These security startups are hilarious
“> Given adobe acrobat you can modify a PDF and upload it and people wouldn’t be able to tell if it contains misinformation if they download it from a place that’s got no editorial or provides no model hashes”
“Publish it Gary, replace PDF with GPT let’s call it PoisonGPT, it’s catchier than Supply Chain Attack and Don’t use files form USB sticks found on the street and all investors need to hear is GPT”
How is this any difference then corrupting a dataset, injecting some stuff into any other binary format or any others supply chain attack. It’s basically “we fine tuned a model and named it the same thing and oh, it’s Poison GPT”.
What does this even add to the conversation? Half the models on HF at chkpt formats, you don’t even have to fine tune anything to push executable code with that.
Haha. Shenanigans like this remind me of early Twitter bots. Just to see if we could. Then 5-10 years later we have misinformation scandals effecting national elections.
What could go wrong?
Yes.
Well, no, because it doesn't have a brain, and can we please atop anthropomorphising these statistical models?
This is missing the larger point, perhaps intentionally. Anthropomorphic descriptions color our descriptions of subjective experience, and carry a great deal of embedded meaning. Perhaps you mean it communicates the wrong idea to the layperson?
Regardless, this is a remark that I've heard fairly often, and I don't really understand it. Why does it matter if some people believe AI is really sentient? It just seems like a strange hill to die on when it seems - on the face of it - a largely inconsequential issue.
> Perhaps you mean it communicates the wrong idea to the layperson?
No, I mean it communicates the wrong idea to everyone.
Among laypeople it encourages magical thinking about these statistical models.
Amongst the educated, the metaphor only serves to cloud what's really going on, while creating the impression that these models in some way meaningfully mimick the brain, something we know so little about that it's the height of hubris to come to that conclusion.
The fact that human brains are brain-washable shows we are statistical models
Brainwashing doesn't require a brain.
Brainwashing doesn't require washing either - it is an incredibly misleading term.