The findings of our first generative AI experiment: GOV.UK Chat

6 min read Original article ↗

https://insidegovuk.blog.gov.uk/2024/01/18/the-findings-of-our-first-generative-ai-experiment-gov-uk-chat/

A screenshot of the experimental GOV.UK Chat system. A user has typed in a question and the system is generating a response. The question reads “As a small business, do I need to charge VAT to people in Italy when I sell a service online?”

In our strategy for growth, launched in June last year, we said that we wanted to explore whether emerging technologies can help users when interacting with GOV.UK.

The Government Digital Service (GDS) is running a series of experiments to explore, and test, how generative Artificial Intelligence (AI) could improve the user experience of GOV.UK. Chris Bellamy, Director of GOV.UK, has shared more about the work of the AI team, and how we’re approaching this technology — we suggest you read her blog post before this one to have that context.

Our approach to generative AI

OpenAI’s ChatGPT is an emerging technology. It’s only just over a year old, but it has generated a huge amount of interest and discussion. As Chris Bellamy writes: “We believe that there is potential for this technology to have a major, and positive, impact on how people use GOV.UK... [and] that the government has a duty to make sure it's used responsibly, and this duty is one that we do not take lightly.” This echoes the government's recommended approach described in a policy white paper last year.

This meant we took the approach of running a series of phased experiments in a controlled way to quickly gather data for analysis and evaluation of the system, and iteration. Our first experiment was using generative Large Language Models (the technology behind ChatGPT) to solve a problem that’s as old as GOV.UK itself — how we can save people time and make their interactions with government simpler, faster and easier.

GOV.UK Chat

Our first experiment was an OpenAI powered chatbot, “grounded” on published information on the site (a method known as Retrieval Augmented Generation). We wanted to see if we could use this approach to enable users to find the information they need by asking questions of GOV.UK content in natural language - the way they would write or speak in everyday life.

We called the system we developed GOV.UK Chat. We took a phased approach to experimenting with GOV.UK Chat, where each phase would last a couple of weeks. At the end of each phase, we evaluated the data to determine our next step.

The phased approach to our generative experiment. There are 5 phases shown as dots along an arrowed line. Each phase gets progressively more complicated and closer to real users

The first 3 phases were internal. This allowed us to develop the system safely prior to showing it to real users. This included a “red teaming” exercise, where colleagues from across government tried their hardest to break the system or make it not behave as intended.

We also evaluated the performance of the system at each stage, based on expert human assessment of the responses provided and the retrieved GOV.UK sources. Earlier on, this allowed us to identify an initial set of components which, together, produced the best results.

Following Phase 4 testing with a dozen users, with positive results, ​​we scaled up testing by inviting 1,000 users to use GOV.UK Chat within a live private pilot — so we could continue to iterate and improve, and see how people would respond to the system. Collecting more data also helps us have more confidence in our evaluation of the technology.

Prior to each phase, especially Phase 5, we carefully considered what outcome or learning we wanted to achieve and designed the experiment and analysis accordingly.

We look forward to blogging about some of these areas in greater detail at a later date.

Protecting users’ privacy and security

As with all GDS’s work, we are committed to protecting users’ privacy and security. For this experiment, we put in safeguards to prevent users from submitting personal data in their query.

We also respect the personal data that exists on GOV.UK pages and worked closely with our data protection colleagues throughout the experiment in conducting a data protection impact assessment and mitigating any identified risks. For example, we removed GOV.UK pages with personal data from the system so they could not be sent to the LLM.

Our early findings

We conducted follow up surveys of users’ experience with GOV.UK Chat, as well as getting human experts to assess the accuracy and completeness of a sample of answers produced by the system.

Analysis of the survey data (n=157) revealed that most people liked using GOV.UK Chat to find the information they needed. Nearly 70% of users found the responses were useful and just under 65% of our users were satisfied with the experience.

From the testing we also gathered insights on how these results could be improved. For instance:

  • sometimes a query could not be answered because the GOV.UK page was too long; we are working on a better strategy to chunk up the content without losing important context before it is passed to the LLM
  • if questions could not be answered using content on GOV.UK, it’s clear we need to redirect people in different ways

Our results also highlighted known issues associated with the nascent nature of this technology. Overall, answers did not reach the highest level of accuracy demanded for a site like GOV.UK, where factual accuracy is crucial. We also observed a few cases of hallucination - where the system generated responses containing incorrect information presented as fact - mostly in response to users’ ambiguous or inappropriate queries.

Accuracy gains could be achieved by improving how we search for relevant GOV.UK information that we pass to the LLM, and by guiding users to phrase clear questions, as well as by exploring ways to generate answers that are better tailored to users’ circumstances.

We also found that some users underestimated or dismissed the inaccuracy risks with GOV.UK Chat, because of the credibility and duty of care associated with the GOV.UK brand. There were also some misunderstandings about how generative AI works. This could lead to users having misplaced confidence in a system that could be wrong some of the time. We’re working to make sure that users understand inaccuracy risks, and are able to access the reliable information from GOV.UK that they need.

What’s next

These findings validate why we’re taking a balanced, measured and data driven approach to this technology — we’re not moving fast and breaking things. We’re innovating and learning while maintaining GOV.UK’s reputation as a highly trusted (according to internal polling) information source and a ubiquitously recognised symbol in the UK.

Based on the positive outcomes and insights from this work, we’re rapidly iterating this experiment to address the issues of accuracy and reliability. In parallel we’re exploring other ways in which AI can help the millions of people who use GOV.UK every day.

Working across government

Teams across government are investigating how best to take advantage of this new technology, and the AI Safety Institute has been established to focus on advanced AI safety for the public interest. GDS is working closely with the Central Digital and Data Office (CDDO) and No.10 on the experiments. CDDO has today published the Generative AI Framework, which provides guidance across government departments.

We believe our GOV.UK Chat experiment can act as a “pathfinder” experiment for colleagues across government and the private sector, as to how to evaluate this technology safely.

If you’re interested in the work, please get in touch with the team via govuk-enquiries@digital.cabinet-office.gov.uk