White House is working with hackers to ‘jailbreak’ ChatGPT’s safeguards

1 points by eliomattia 3 years ago · 5 comments

Reader

As a programmer, I find it fascinating to build things from the ground up, with the inner workings either in full display or readily accessible for editing. With AI, the need to beg it to please behave, with a long list of things to do and not to do and a resounding order not to disclose such a list, is becoming commonplace.

Obviously, finding jailbreaks in LLMs is extremely important and consequential. However, there are meta questions around modern AI that remain valid, and this article is a reminder: is a continuous and direct feedback loop between code and coder a thing of the past? To what extent should we accept that LLMs are trained one-way, that we can only truly edit them with expensive trial-and-error retraining runs, hence, all we are left with is asking kindly? Are the current implementations all, or are we dealing with just one possible paradigm? Do we want AI, which relies upon computers, algorithms, and numbers written on memory, to be fundamentally programmable?

verdverm 3 years ago

They could probably just visit Reddit, there are ample prompts there.

Prompt injection or prompt attacks are well known and likely impossible to guard against. Can you really get a human to be invulnerable to manipulation? Why would we expect the machines to be any better?

eliomattiaOP 3 years ago

There are assumptions here that are intimately related to the meta questions mentioned.
> Prompt injection or prompt attacks are well known and likely impossible to guard against.
They are impossible to guard against under the assumption that the current LLM paradigm is all there is and all there could possibly be. There could be other realizations of AI. The latest impressive achievements are yielding an ongoing identification of the current computational approaches with human intelligence itself, with how we humans model reality by using natural language, and with how we could ever imagine that a computer can model reality via natural language. These are all strong assumptions and are very common even amongst researchers.
> Can you really get a human to be invulnerable to manipulation?
Most definitely not, but we are specifically not talking about humans, but about:
> AI, which relies upon computers, algorithms, and numbers written on memory
I am not making a case for machines being completely invulnerable to manipulation, which requires an analysis of the entailment structures of reality that would yield that complete invulnerability is impossible, but for better direct control on the internals, rather then relying on external instructions that easily undergo jailbreaks with simple prompt attacks.
> Why would we expect the machines to be any better?
One argument is: because they can be programmed and the memory they rely upon for its algorithms can be edited directly, with both accuracy and precision. The missing piece is how to model reality via natural language in a computer, in a way that we would know what to edit in order to affect the model with accuracy and precision.
LLMs, currently, are non-editable. When interacting with ChatGPT, its answers A will be generated from chat history H, which includes prompts and guidelines, by an immutable function, or program, f: A = f(H). It is remarkable that, in LLMs, f cannot be edited and is never entailed by the individual chat H. Since we can have multiple exchanges in the same history, H will itself contain information (entailment) from f, but never the other way around: f is not entailed by A or H, it is fixed, and only entailed externally by the design and training steps. f can be fine-tuned, yet it will retain remnants of past training, hence it cannot be truly be edited at will unless we retrain the whole model. Even then, control on f is neither accurate nor precise.
It seems that non-editable LLMs remove some of the agency that is inherent in programming: editing the internals of a program to shape the entailment structures that we want to realize, with accuracy and precision.
I am by no means indicating that editable AI models that can be steered are easy to achieve, rather that the very possibility thereof is rarely mentioned, and often implicitly assumed not to exist in absolute statements that in fact only strictly apply to the current mainstream approaches.
- verdverm 3 years ago
  
  > H will itself contain information (entailment) from f, but never the other way around: f is not entailed by A or H, it is fixed, and only entailed externally by the design and training steps.
  This is only true for the first release. Consider that OpenAI has been actively collecting input/output pairs from their users, and then retraining and updating the model. Thus A and H have impacted ChatGPT. This in turn effects how people interact with the system.
  You can certainly constrain f to a single point in time, but most people will not. They think of ChatGPT as f and that f is changing or evolving (in the non-literal sense). So depending on how you look at it, f is indeed editable. Opinions will differ here and there is no right answer.
  Google wrote a good paper on this feedback loop almost 10 years ago called "Machine Learning: The High Interest Credit Card of Technical Debt" that is even more relevant today.
  https://research.google/pubs/pub43146/
  - eliomattiaOP 3 years ago
    
    There are two divergent analogies to analyze. Compared to programming software, LLMs are experientially closer to the malleability of interacting with humans on the one hand, while functionally closer with hardware entailment-wise, on the other hand. Using symbols that do not overlap with the above: in a computer, hardware W entails that the execution of software S can yield output o from input i. Programmers have the freedom to edit S and experiment with their i/o data. However, it is W, unentailed internally, that defines the boundaries of what is possible for a user: changing W requires hardware engineers and new designs. If W is a desktop computer, programming car engine simulations as S will not turn W into a car. This creative example is meant to indicate that the causation cycle behind a functional component such as W that is unentailed internally at the user/computer level is inherently layered. Similarly, the interactions f has had with H might fully, minimally, or might not at all be considered in the next release. In the case of computers, a user cannot modify W, but hardware engineers can, with full control. With LLMs, not even the researchers can fully control f, for example, to impose that violent themes not be discussed, even using impersonations or a developer mode.
    The reason it is important to remain aware that f is not necessarily coevolving with all provided H is the social ease of overlooking how each component is entailed in the current mainstream paradigm. In LLMs, f literally remains unchanged with each interaction, yet a common impression is that we can affect LLMs by chatting with them since the ChatGPT UX is strikingly similar to the experience of chatting with a human. It feels plausible that the effect of just talking to LLMs will be as strong, or even stronger than editing code because when talking to an intelligent human, H can indeed affect their biological f. However, the analogy that holds the strongest with LLMs is that the entailment of f is close to that of hardware W: changing f is much more akin to requesting hardware engineers for edits to W, with the caveat of giving up on editing precision or accuracy, especially if attempting to mask or remove harmful information and when not retraining from scratch. It is true that feedback from H will affect future releases f', but 1) in each release, f remains immutable just like W throughout the interactions, 2) feedback is integrated with delays and slowly, 3) editing is not generally available to users directly except for more superficial fine-tuning, 4) feedback is orders of magnitude less impactful compared to training corpora and design decisions in foundation models, and 5) unlike with designing actual hardware W, even by calling f ChatGPT as a whole, including its many releases, changing f is neither precise nor accurate, as one would imagine an editing process to be and as modifying software S directly is.
    Returning to the jailbreak article and using the parallel with the hardware analogy, I assume the intention of the legislators is ultimately to change f, the source of the causation that entails all answers A in chat history H, both by further fine-tuning the models and by adding a list of chat guidelines that are fed as part of the prompt. However, due to the entailment hierarchy of the general architecture, the latter attempt will have zero impact on f itself, thereby not addressing the issue at a fundamental level, while the former strategy will only have a limited effect, due to a fundamental lack of direct control on f: researchers do not have a way to precisely and accurately steer f, and users are even further separated from affecting it.
    This is not meant to be dismissive of the current achievements, the results are impressive and the techniques are steadily improving, but rather a critical look at the entailment structures at hand, both perceived and observed, and at the available strategies regarding safeguards. I find it interesting to ask: how can f be made functionally closer to programming S than to W?

Settings

White House is working with hackers to ‘jailbreak’ ChatGPT’s safeguards

Keyboard Shortcuts