The Prompt Report - NFHN Reader

The Prompt Report: A Systematic Survey of Prompting Techniques

Michael Ilie^∗¹, Nishant Balepur¹, Konstantine Kahadze¹,
Amanda Liu¹, Chenglei Si³, Yinheng Li⁴, Aayush Gupta¹, HyoJung Han¹,
Sevien Schulhoff¹, Pranav Sandeep Dulepet¹, Saurav Vidyadhara¹, Dayeon Ki¹, Sweta Agrawal¹¹, Chau Pham¹², Gerson Kroiz, Feileen Li¹, Hudson Tao¹, Ashay Srivastava¹, Hevander Da Costa¹, Saloni Gupta¹, Megan L. Rogers⁷, Inna Goncearenco⁸, Giuseppe Sarli^8,9, Igor Galynker¹⁰, Denis Peskoff⁶, Marine Carpuat¹, Jules White⁵, Shyamal Anadkat², Alexander Hoyle¹, Philip Resnik¹

▶ ¹ University of Maryland ▶ ² OpenAI ▶ ³ Stanford ▶ ⁴ Microsoft ▶ ⁵ Vanderbilt ▶ ⁶ Princeton ▶ ⁷ Texas State University ▶ ⁸ Icahn School of Medicine ▶ ⁹ ASST Brianza ▶ ¹⁰ Mount Sinai Beth Israel ▶ ¹¹ Instituto de Telecomunicações ▶ ¹² University of Massachusetts Amherst
^*Equal Contribution
sschulho@umd.edu milie@umd.edu resnik@umd.edu

In this paper, we conduct a systematic literature review of all Generative AI (GenAI) prompting and prompt engineering techniques (prefix only). We combine human and machine efforts to process 4,797 records from arXiv, Semantic Scholar, and ACL, extracting 1,565 relevant papers through the PRISMA review process. From this dataset we present 58 text-based techniques, complemented by an extensive collection of multimodal and multilingual techniques. Our goal is to provide a robust directory of prompting techniques that can be easily understood and implemented. We also review agents as an extension of prompting, including methods for evaluating output and designing prompts that facilitate safety and security. Lastly, we apply prompting techniques in two case studies.

Abstract

Generative Artificial Intelligence (GenAI) systems are increasingly being deployed across diverse industries and research domains. Developers and end-users interact with these systems through the use of prompting and prompt engineering. Although prompt engineering is a widely adopted and extensively researched area, it suffers from conflicting terminology and a fragmented ontological understanding of what constitutes an effective prompt due to its relatively recent emergence. We establish a structured understanding of prompt engineering by assembling a taxonomy of prompting techniques and analyzing their applications. We present a detailed vocabulary of 33 vocabulary terms, a taxonomy of 58 LLM prompting techniques, and 40 techniques for other modalities. Additionally, we provide best practices and guidelines for prompt engineering, including advice for prompting state-of-the-art (SOTA) LLMs such as ChatGPT. We further present a meta-analysis of the entire literature on natural language prefix-prompting. As a culmination of these efforts, this paper presents the most comprehensive survey on prompt engineering to date.

The PRISMA Review Process

During paper collection, we followed a systematic review process grounded in the PRISMA method. We first scraped arXiv, Semantic Scholar, and ACL through a keyword search. Our keyword list was comprised of 44 terms, with each being closely related to prompting and prompt engineering. We then deduplicated our datset based on paper titles, conducted extensive human and AI review for relevance, and automatically removed unrelated papers by checking paper bodies for the term "prompt".

PRISMA

The PRISMA review process. We accumulate 4,247 unique records from which we extract 1,565 relevant records.

A Taxonomy of Prompting Techniques

We present a comprehensive taxonomy of prompting techniques, methods for instructing Large Language Models (LLMs) to complete tasks. We divide prompting techniques into three categories: text-based, multilingual, and multimodal. Multilingual techniques are used to prompt LLMs in non-English settings. Multimodal techniques are used when working with non-textual modalities such as image and audio.

text

All text-based prompting techniques from our dataset.

multilingual

All multilingual prompting techniques.

multimodal

All multimodal prompting techniques.

Prompt Exploration and Advice

We discussed various prompting terms including prompt engineering, answer engineering, and few-shot prompting.

MMLU

The Prompt Engineering Process consists of three repeated steps 1) performing inference on a dataset 2) evaluating performance and 3) modifying the prompt template. Note that the extractor is used to extract a final response from the LLM output (e.g. "This phrase is positive" maps to"positive").

MMLU

An annotated output of a LLM output for a labeling task, which shows the three design decisions of answer engineering: the choice of answer shape, space, and extractor. Since this is an output from a classification task, the answer shape could be restricted to a single token and the answer space to one of two tokens ("positive" or "negative"), though they are unrestricted in this image.

MMLU

We highlight six main design decisions when crafting few-shot prompts. *Please note that recommendations here do not generalize to all tasks; in some cases, each of them could hurt performance.

Case Study: MMLU Benchmarking

In our first case study, we benchmark six distinct prompting techniques using the MMLU benchmark. We also explore the impact of formatting on results, finding variations between two different formats for each prompting technique.

MMLU

Accuracy values are shown for each prompting technique. Purple error bars illustrate the minimum and maximum for each technique, since they were each ran on different phrasings (except SC) and formats.

Case Study: Labelling for Suicide Crisis Syndrome (SCS)

In the second case study, we apply prompting techniques to the task of labelling reddit posts as indicative of suicide crisis syndrome (SCS). Through this case study, we aim to provide an example of the prompt engineering process in the context of a real world problem. We utilize the University of Maryland Reddit Suicidality Dataset and an expert prompt engineer, documenting the process in which they boost F1 score from 0 to 0.53.