Testing AI Software Isn’t Like Testing Plain Old Software

This blog post considers new challenges created by incorporating AI models into traditional software. We focus on how the practice of using AI models in traditional software challenges our existing approaches to testing. This blog post is not about using AI to test traditional software.

AI Software (AISW) isn’t Plain Old Software (POSW)

When we say “AI software” what do we mean? We define AI software as any software that utilizes an AI foundation model at runtime. We explain why we make this distinction in our previous blog post (AI Software Should be More Like Plain Old Software). The key point is that because AI software uses AI models at runtime, the properties we expect the software to exhibit, such as reliability, security, etc. may no longer hold. This blog post focuses on the new challenges testing AI software. In particular, because the way in which traditional code interacts with an AI model is via a prompt, we focus on understanding how the prompt acts as a program (Prompts are Programs) and requires testing.

The diagram above illustrates the interaction between POSW and AISW. Consider a user of an AI-enhanced word processor who wants to make text in their document bold. They type their prompt into an AI copilot (1), which internally reformats their request to explicitly ask for JSON output before sending it to the AI model (2). The AI model returns a JSON object defining the appropriate API call. The copilot code (which is POSW) calls that API (4) to modify the user’s document (5). In our discussion, we focus on the prompt sent to the AI model (2) and the response (3) and consider it as AISW. Whether the AISW returns a correct response determines the robustness of the application and depends on the text and other context in the prompt and the AI model interpreting it.

We Know How to Test Plain Old Software

Testing software has a long, rich, and successful history. Techniques to test complex software at scale have evolved over many decades to the point where large complex software systems are secure and highly reliable. Testing has co-evolved with the discipline of automated reasoning and formal methods, adopting tools and systematic approaches that enable formal verification of critical software (e.g., seL4 and CompCert). Tools like Pex, designed for the .NET ecosystem, use dynamic analysis and constraint solving to automatically generate high-coverage test cases.

As systems became more interconnected, cybersecurity risks increased and approaches to finding and fixing security vulnerabilities became one of the most important goals of software testing. Fuzzing techniques have gained popularity in this context – fuzzing involves generating inputs to uncover unexpected behaviors or security flaws and has proven highly effective in discovering memory safety issues and other vulnerabilities in large scale software. Symbolic execution-based tools like KLEE systematically explore execution paths to improve coverage, while fuzzers like SAGE use white-box techniques to expose bugs.

Many of the effective approaches to POSW testing rely on specification and symbolic reasoning that relates the intended behavior of the software with a mathematical understanding of how that software, when translated to machine code, will execute on specific hardware.

Testing AI Software is Different

Testing AI software introduces new challenges that differ from traditional software testing paradigms. While POSW can exhibit non-deterministic behavior due to external dependencies and concurrency issues, AISW’s non-determinism arises from stochastic processes within the model itself, making it more challenging to predict and reproduce outputs. Additionally, the number of available models – and continuous emergence of new ones – adds complexity. Dynamic model selection (AI routing), as enabled by platforms like openrouter.ai that chooses which is the best AI model for the task, introduces unpredictability. The testing strategy must be able to adapt to validate behavior across an evolving range of models, ensuring consistency and robustness even when the AI model changes. Additionally, since AI software often processes multi-modal inputs such as images, audio, and video, validation strategies are needed to maintain consistency and accuracy of outputs across diverse input types.

Correctness in POSW is often defined by precise specifications — given an input, the output must match an expected value. However, in AISW, correctness is far more complex and multifaceted. The same input might produce different outputs each time, and multiple outputs could be considered correct depending on the context and use case. For example, an AI model might generate different yet syntactically valid translations of a sentence, or a reasoning model generates sound reasoning steps but still misses the intended output. AISW testing should also involve human oversight, especially when Responsible AI (RAI) and ethical considerations come into play. AISW correctness must assess both quantitative metrics like accuracy and qualitative assessment (human evaluation of the response with respect to fairness and ethical standards) — ensuring the model’s outputs are both correct and responsible.

Testing AISW for security challenges involves addressing model-specific exploits, including prompt injection attacks, jailbreaks, and model inversion techniques, where attackers reconstruct sensitive training data by analyzing the model’s responses. The unpredictable nature of AISW outputs complicates security assessments, as the same input may not consistently trigger vulnerabilities. This variability is closely linked to AISW’s flexible correctness paradigm, where multiple outputs may be considered correct. An AI model that generates diverse, yet valid responses might also be more susceptible to security threats. Human-led red-teaming is effective for probing security risks but is labor-intensive and hard to scale as AISW adoption will grow. AISW demands automated security testing techniques that can handle its non-deterministic behavior.

Examples of testing AI Software

Effective testing of AISW requires new tools and frameworks to address the unique challenges outlined above. For testing correctness, Amazon’s Bedrock Guardrails provides a policy-driven approach to ensure that AI software adheres to predefined guidelines, focusing on output validation and compliance with RAI principles. Microsoft’s PromptPex introduces automated test generation for prompts extracting input and output specifications to create targeted tests for the prompt. The DSPy prompt optimizer complements these approaches by using Bayesian optimization to generate and test new prompt variants, jointly optimizing for prompts and few-shot examples to improve accuracy.

For security testing of AISW, Salesforce’s internal library, fuzzai, automates red teaming by generating adversarial prompts and validating responses to enhance model robustness against diverse attack scenarios. Similarly, Promptfoo and Microsoft’s PyRIT are designed for assessing AI security, capable of stress-testing models and managing adversarial inputs. Promptfoo also integrates with CI/CD pipelines for continuous monitoring and helps assess risks like prompt injections, information leakage, and tool-based vulnerabilities. NVIDIA’s Garak serves as an LLM vulnerability scanner, identifying weaknesses such as data leakage and misinformation.

What AI tech trends will influence this?

Agentic systems are the future of AI software capable of autonomous decision-making and performing sequences of actions within dynamic environments. For example, a web UI agent could navigate an e-commerce site, add items to the cart, and complete the checkout process—all based on high-level user instructions. These agents operate in complex real-world settings, where the volume of possible input states and transitions can lead to state space explosion. To handle this, techniques like model checking and symbolic execution are needed to efficiently explore and validate system behavior across a large spectrum of inputs.

In multi-agent systems, interactions between agents might introduce complexity, as emergent behaviors may arise that violate global properties despite individual agents functioning correctly. Compositional verification techniques are essential to connect local agent behavior with global system requirements, ensuring stability and predictability. To test whether the AI agent’s behavior is aligned with the user’s prompt – we need to be able to define and check invariants that act as guardrails, constraining the agent within predefined safety bounds and mitigating the risk of unexpected actions.

Chain of Thought (CoT) reasoning, where AI models break down tasks into intermediate steps, plays a critical role in testing AISW. Unlike traditional testing that focuses primarily on final outputs, CoT reasoning demands validation of each reasoning step to ensure the model’s decisions are logical and aligned with intended objectives throughout the process. Techniques like invariant checking and step-by-step validation are essential, as they help verify both the correctness of intermediate actions and the consistency of the model’s thought process.

What should PL/SE researchers do?

PL and SE researchers need to reimagine formal methods and testing techniques for AISW. Ensuring AISW reliability requires adapting existing techniques and finding ways to achieve verification at scale. For example, adapt property-based testing to focus on defining high-level rules about what an AI model should and should not do. To achieve scalable verification, hybrid approaches combining model checking and empirical testing can simulate diverse scenarios and trigger tests dynamically. AI model capabilities are evolving rapidly and applications that leverage them are quickly being widely deployed. To support this rapidly changing software ecosystem, innovative and effective testing methodologies are needed to ensure that these AI software applications are safe and reliable.

Biographies:
Reshabh K Sharma is a PhD student at the University of Washington. His research lies at the intersection of PL/SE and LLMs, focusing on developing infrastructure and tools to create better LLM-based system that are easier to develop reliably and correctly.

Peli de Halleux is a Principal Research Software Developer Engineer in Redmond, Washington working in the Research in Software Engineering (RiSE) group. His research interests include empowering individuals to build LLM-powered applications more efficiently.

Shraddha Barke is a Senior Researcher at Microsoft Research in Redmond, Washington in the Research in Software Engineering (RiSE) group. Her research interests include the automatic generation of correct code, end-user programming, and integrating symbolic reasoning with neural models.

Ben Zorn is a Partner Researcher at Microsoft Research in Redmond, Washington working in (and previously having co-managed) the Research in Software Engineering (RiSE) group. His research interests include programming language design and implementation, end-user programing, and empowering individuals with responsible uses of artificial intelligence.

Note: The feature image was generated in Microsoft M365 Copilot Visual Creator using the prompt: “A funny cartoon with a caption of a line drawing image of a human trying to test out prompting a robot and figuring out what went wrong. In the style of the New Yorker cartoons.”

Disclaimer: These posts are written by individual contributors to share their thoughts on the SIGPLAN blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGPLAN or its parent organization, ACM.