What can we expect of LLMs as Software Engineers?

28 min read Original article ↗

Reading Time: 31 minutes

I got home about a month ago from a conference tour through Romania, Belgium, and the Netherlands. I gave a talk about LLMs in all three places, and then one more time once I got home to Chicago.

My plan now is to retire the talk, so I’m publishing the slides and rough transcript. One of the conferences recorded a video of this talk, and when that becomes available, I’ll share it. In the meantime, here are the slides and transcript.

Hi! My name is Chelsea Troy. I’m on the machine learning operations team at Mozilla, and I teach computer science master’s students at the University of Chicago.

Today I’ll talk to you about how large language models function.

Then, I’ll show you how the models themselves fit into the larger product that you probably know as ChatGPT, or Gemini, or Claude, or Devon, et cetera. 

Next, I want to spell out some conclusions at the intersection between large language model use and professional productivity. 

Finally, we’ll summarize takeaways.

I’d like to start here because I think this discussion will make more sense with a concrete example. 

This is a bug I’ve come to refer to as “Strawberrygate,” that first made its rounds on social media in the back half of 2024. If you asked “How many r’s are there in the word strawberry,” many generative language tools would respond with the answer ‘two.’ There’s a short exercise in my O’Reilly workshop about LLMs that asks participants to try this themselves, and I’ve had folks run this on Claude, Gemini, and various iterations of ChatGPT, and report every integer between zero and three inclusive as the answer. 

It seems like a simple question though, no? Count the ‘r’s in the string ‘strawberry,’ and you’ll find three. Surely the tool to which large swathes of the population have begun to outsource all of their prose tasks could handle such a trivial counting operation, no?

So why do we see this happen?

The answer, in a single word, is “context,” but I’m going to explain what I mean by that because I think popular discourse has a tendency to use this word a few different ways and I want to make sure we’re clear. 

Large language models are trained on prose that their build teams were able to scrape from the internet, and it is from this body of knowledge that they draw their response capabilities. On the internet, people don’t ask about the spelling of ‘strawberry’ in a vacuum. Usually when a spelling query like this comes up, people aren’t referring at all to the first ‘r’ in strawberry. When they ask “how many r’s are there in the word strawberry,” this is usually a person learning to write and spell in English, which by the way is not even close to a phonetic language. 

Unlike Arabic, Serbian, or to some degree Spanish, in which you can more or less sound out words based on their spelling, English spelling is all over the place, and a word’s pronunciation does not necessarily tell you whether it contains a doubled consonant such as the two r’s that appear next to each other in “berry.” This is partially due to the fact that English is the shower drain of languages—a disorganized melange of loanwords, calques, and vocabulary of diverse origin. This map shows the words for “berry” throughout Europe alone, and even among those places that do pronounce it “berry,” the number of r’s is not consistent.

When someone asks about the r’s in strawberry, they generally wans to confirm the presence or absence of the doubled consonant. The technical answer to their question is “three”, but in the context of their confusion, scoped down to specifically the number of r’s that appear in the “berry” part, we’re looking at a doubled consonant—two.

Often the answers that these questions receive recognize, and account for, that context. 

To understand the role of context in the training of a large language model, I want to introduce you to a much simpler model first: the Markov Model. You see one represented here onscreen. The model has states: in this case, the weather conditions sunny, rainy, and cloudy. It then has transition probabilities between those states: for example, from cloudy, there’s a probability of 0.4 (or a 40% chance) that it remains cloudly, a 0.3 probability(30% chance) that it becomes sunny, and a 30% chance that it becomes rainy. Together, those probabilities account for 100% of the options. Now if you were to predict what the next weather condition might be based only on the information that it’s currently cloudy, your best bet—though certainly not a sure bet—would be to predict that it will next be, again, cloudy. Your odds of predicting correctly improve if it’s currently sunny, since the leading transition probability—that it will remain sunny—is 0.6 for that state as opposed to just 0.4. 

The transition probabilities in a Markov model make it a quick calculation for you to determine the probability of any given chain of events happening in a certain order—or even, assuming you have a complete set of transition probabilities, in any order. 

Now, imagine you had one of these network diagrams, but the network diagram contained every single word and piece of punctuation that had ever appeared in the Common Crawl, which collects an estimated 250 terabytes of data from billions of webpages in a given month, and which began all the way back in 2008. Each transition probability points to a word that has at some point appeared after the word in the node you’re looking at. The node probability represents the relative frequency with which the next word is that word versus any of the others. 

And imagine that these transition probability arrows—we call them edges—don’t just connect each token to the very next token that appeared in written text: additional edges influence the probability of which word will appear next based on which word appeared one word ago, two words ago, three words ago, et cetera. This is what we get from multi-headed attention mechanisms, a component of deep learning architectures that underpins the function of the transformer and the design of modern large language models.

Here I’ve listed out several n-grams in the left-hand column to help you visualize this. We’re trying to predict the token n, and at the n-5 position, we’ve got a token indicating the start of the phrase. The first word in the phrase, at the n minus 4 position, is “I”. The next one, at the “n minus 3” position, is “went.” Then at the n minus 2 position we have “to” and at the n minus 1 position we have “the.” 

On the pink row, you see some options for what the next word might be: store, movies, car, apple, parrot, et cetera. In the network I’ve described you’d have a lot of options, but this small set will help get the idea across. Below each word, you see a prediction weight: larger weights for words that appear more often in this position relative to the given inputs, and smaller ones for words that appear less often in this position relative to the given inputs.

Let’s say for now we’re generating text by randomly selecting from among the tokens with the highest prediction weight, and in this case that’s ‘store’ and movies.’ So we select ‘store.’

Here we are, with ‘store’ selected. Now let’s move on to predicting the next word in the sequence. 

Next to the first set of inputs here, I’ve added a second set of inputs. This one has the start token at the n minus 6 position instead of n minus 5, and the word “I” at the n minus 5 position instead of n minus 4. All the words have been bumped up one position so that we can add the word “store” that we selected in the last round to our sequence in the n minus 1 position. We now have new options to select from. 

I went to the store to….

I went to the store and

I went to the store apple… (note the very small prediction weight on this option relative to those of tokens that would make a lot more sense in this phrase). 

Now, what sorts of tokens might we expect a pattern like this to do especially well at predicting? Ones that exhibit a very strong pattern of where they appear. Things like preposition words such as “at or by or in or of.” Or punctuation, like commas or periods or colons or semicolons. These sorts of words are highly contextually linked: their appearances follow a strong pattern.

Most token, though, don’t get their position in a phrase mostly or wholly from the pattern of the phrase, but rather from the meaning of the token itself. The token “store,” a place where we go to select items off of aisles and buy them from a cashier, means a different thing than the token “movies,” a place where we go to watch feature films on the silver screen. The massive network derived from text, as I’ve described, picks up on these patterns, too: when words have different meanings, they are ‘denotationally distinct,’ and they usually appear in different contexts. If we say “I needed to buy some apples, so I visited the grocery [BLANK],” it’s clear to you, and to most of the people typing text on which large language models are trained, that the next token has to be either “store” or something very denotationally similar.

Denotationally similar words can usually appear interchangeably in the same pattern without affecting the meaning of the sentence. For example, I could say “I watched the new film on Friday at the movies” or we could say “I watched the new film on Friday at the cinema.” Whether a generated sentence used the token ‘movies’ or ‘cinema’ in this case would not change the meaning of the phrase. 

Where generative language models run into trouble is in situations where choosing a word based on the context isn’t enough to get it right: that is, in situations where tokens are denotationally distinct but contextually identical. These words appear in the exact same situations, but mean different things. Which words do that?

Often, numbers. I have three apples and I have five apples and I have four thousand apples all mean very different things, but the interchangeable tokens all appear in identical contexts. 

You know some numbers that mean even differenter things? The section numbers used to partition legal text. Section 108 (a) of a legal code says a completely different thing than section 204 (b), but they’re both referenced the exact same way.

Proper nouns are another one. It matters whether Miss Peach, Sergeant Gray, or Mr. Green did the crime with the candlestick in the kitchen. Accusing any one of them makes grammatical sense, but the phrase does not mean the same thing.

Now, a lot of times when it comes to proper nouns, numbers, or dates, generative models can get it right, and the reason for that is that context often fills in to suggest the right thing. Here we have an example phrase “Justin Bieber was born in,” with several options. Now, he certainly wasn’t born in apple or parrot, but the years we see as options here have higher prediction weights. The correct year here, 1994, has by far the highest prediction weight, and that happens because the inclusion of “Justin” as an n minus X token, followed immediately by “Bieber,” and then followed somewhere after that by “born”, overwhelmingly results in the number “1994” appearing in online discourse. 

But suppose we had, instead…

…this situation, where we are trying to complete the phrase “The law firm Johnson and [BLANK].” A quick Google search immediately dredges up half a dozen law firms whose names start “Johnson and…”, but end with a different surname. Which one should the generative model reference in its generated text? Context alone may not be able to provide an answer, so the selected token here could be wrong.

This is why you have to verify facts that you get from a large language model. A phrase that sounds plausible is not the same thing as a phrase that is objectively true or correct. Numerous lawyers, doctors, and software engineers have gotten tripped up in courts of law at this point because they placed the weight of their professional credentials behind a claim or action that a chat UI suggested to them, that sounded believable enough to them, and turned out to be a fabrication.

How do you fix this? The field has begun to answer this in various ways from retrieval augmented generation to supervised fine tuning, but often, the quickest and most pragmatic solutions happen outside the generative model itself.

That brings me to the difference between a large language model and a generative AI product, two concepts that I think the discourse tends to conflate, but they’re actually different things.

This orange ellipse represents a base training model: for our purposes, the sort of language-generating transformer model I’ve talked to you about. A base model such as this requires massive amounts of data and computing resources to train, and this is the reason that you see no more than, generously, a dozen organizations training these things completely from scratch, and even those organizations are doing this step no more often than once, to a couple of times a year.

At this point, what you have is a program that can take in a grammatically intelligible sequence of input tokens and output a grammatically intelligible sequence of output tokens, but it’s not the polished question-and-answer-bot you’ve come to know from typing things into ChatGPT et al. That part comes from the next step…

…the fine tuning.

I want to draw a distinction here, because colloquially when blog posts and programmers at conferences talk about the fine-tuning they’re doing, they’re effectively taking a finished model—usually Llama from Meta—and passing it a paltry tome of subject-specific research, or all the code off their private github org, or the entirety of their confluence documentation. The fine-tuning that happens during model production is more substantial than this, and moves the model from the sequence-to-sequence phase to a question-and-answer phase. Often a big part of this is having human evaluators select the best option from three answers to a given question and provide this as a feedback loop to the model. This requires orders of magnitude less compute power and data than the initial training phase, and the same companies that execute the base training a time or two per year might do this step on a weekly basis. 

At this point the model can accept a sequence of inputs and give a sequence of outputs that reasonably resembles an answer, and the early days of language generation products could ship more or less this. 

The problem was that we, the consumers, had a disconcerting habit of confusing the output sequence with an actual, reasonable answer. Consumers with software engineering background, for example, could not collectively agree on the conclusion that, when they asked an LLM to find the bug in their code and the output said “I ran the code for you and I think your problem is X,” that that had not happened because the LLM had spontaneously generated a compiler and a runtime and actually run the input code; it said that because it had been trained on internet text, where questions like the one this programmer had asked often received responses that started with the sequence “I ran your code for you and…”
If professional software engineers confuse language generation with language understanding, and become upset at the suggestion that a sequence of output tokens are a lot less likely to factually describe the spontaneous invention of automated compiler generation than they are to just be false, how can we expect better intuition of the general populace?

Conveniently, tooling rapidly improved. Now, generally, when you use a product like ChatGPT, your input doesn’t get fed straight into a raw LLM whose output comes straight back to you. Additional models surround and support the language model to provide you with answers that are more likely to be what you want. 

Consumers use these tools to accomplish all sorts of tasks, and supporting models help adapt to those circumstances. 

For example, large language models, like most machine learning models, accept a number of hyperparameters. In the case of LLMs, some of those hyperparameters guide how the model generates sequences of output tokens. Two in particular that make easy examples are called temperature and top_p. I’ll elide the specifics of how these parameters work for right now—if you’d like me to explain that, I’m happy to chat in the hallway—but long story short, higher values for these hyperparameters make sense for working with a model on creative brainstorming, and lower values for these hyperparameters make more sense when you’re looking for more rote, platitudinous, or logical output. The UI of these tools doesn’t let you set these yourself: instead, supporting models take in your query and set these hyperparameters for you. 

Even more recently, these products have attempted to resolve some of the hallucination problems of large language models by adding supporting models that identify the tools your query would require, then including those tools in the product. If you ask a math problem, instead of outputting a sequence of grammatically permissible tokens, a supporting model might identify that this problem merits a calculator, and attempt to plug figures into a calculator API. If you ask a coding question these days, particularly for a common and well-supported language like Python, a supporting model might identify the merit of a compiler and a runtime, and plug in the code you provided. 

Retrieval-augmented generation, or RAG, also works this way. We pass a model a bunch of documents from which we’d like an answer. The product runs a relevancy search to associate the input query with specific passages in the RAG documents. It then appends the passages that receive the highest relevancy scores to the input query to the language model itself. If you’re more curious to hear about RAG, you can ask me about this in the hallway later.

So now you have some better context as to what, precisely, is going on when you type a query into the user interface of a generative large language model product. You have a sense of what happens inside of there, and that understanding informs your intuition about what these models are likely to get right and wrong. 

Though you should always verify the output of a large language model, and you are always responsible for the impact of the work that you turn in with your professional credentials attached regardless of how much of your query response is yours versus a large language model’s, this intuition can help you use the tool wisely, at the appropriate times. 

But supposing we’re doing that—how useful are these tools for the work we do? 

I’ll be candid with you, I was hoping to find some kind of survey study comparing the utility of these tools based on what kind of engineering work you were doing with it, but I could find nothing systematic on this, so I was forced to do some surveying of my own, which we’ll get to later. 

The comparative studies I could find did evidence some patterns that I want you to recognize for the conclusions we draw later in the talk. 

Let’s start here, with the ‘Lost in the Middle,’ paper. A lot of workshop attendees and talk audiences see the “2023” date on this paper and try to tell me that it’s too old to be relevant in the fast-changing AI space. Candidly, I couldn’t possibly disagree more, because this paper was among the first resources to discuss a phenomenon that remains relevant to this very second to understanding why large language models get things wrong. 

The paper addresses the likelihood that a model can successfully retrieve information from a bolus of context included with a query, based on how large that bolus of context is, and also where in that bolus of context the relevant information appears. This is the paper that established in academic literature that big contexts—more information—make it less likely that a model will successfully retrieve an individual fact. It also established that models are more likely to retrieve the information if it appears at the beginning or end of the context, as opposed to in the middle.

What does this mean for us? It means that, even when we’re including a context document in our input query, it’s in our best interest to narrow down that document to the relevant section as much as possible, not just ram in a manual the size of War and Peace and hope for the best.

About nine months ago, meanwhile, we got hold of this paper. It’s the one I’d like to spend the most time talking to you about. It gained some notoriety in the programming zeitgeist as “the 26% paper,” with the idea being that it says programmers are 26% more productive with LLMs than without.

So, I want to dig into this claim a little.

First of all, the proxy metric here is “number of tasks completed,” and the 26 statistic is coming from the fact that the conclusions claim to have seen a 26% increase in this number.

Of substantive import, in my opinion, to the context from which this statistic derives, is that the conclusions outright say that less experienced developers showed higher adoption rates and greater productivity gains. I think this introduces a number of confounds. First of all, less experienced developers tend to receive smaller work tasks, so increasing the number of work tasks they do by some large percentage would be a lot easier than if all the tasks were, say, “spearhead this giant legacy migration.” It’s also the case that more junior engineers tend to receive the tasks that require the least context—the exact thing that we’ve already identified as the Achilles heel of many of these systems. A large language model can be a great help in writing a function from scratch, given a series of tests it must pass. At present, these models often cannot, say, identify the insidious scope issue causing data corruption in a one hundred sixty thousand line code base, and that’s the sort of task more likely to be assigned to someone with more experience.

To me this pattern demonstrates the deepening of an existing cognitive dissonance in programming education: we teach students, train new entrants, and interview candidates chiefly on their ability to write code from scratch, but as they advance in the workforce their actual main responsibility is often reading, analyzing, debugging, or reorganizing code that they themselves did not write. Once upon a time, writing code from scratch comprised 90% computer science educational assignments and generously 20% of the programmer’s professional responsibilities in the editor. Now, that mismatch has become even more stark, because those tasks that once required writing from scratch manually now often include asking a large language model to write it instead, at which point what the programmer is once again reading, analyzing, debugging, or reorganizing code that they themselves did not write. Large language models have not wholesale wiped out programming jobs, so much as they have called us to a more advanced, more contextually aware, and more communally oriented skill set that we frankly were already being called to anyway, just 80% of the time instead of 96% of the time. On relatively simple problems, we can get away with outsourcing some of that judgment. As the problems become more complicated, we can’t. 

I’ll tell you something that might surprise you. It wasn’t my work on machine learning operations at Mozilla that drew me to look at how developers might work with large language models. 

It was my students.

I teach in the Master’s Program in Computer Science at the University of Chicago. With the world as it is today, and the industry as it is today, it’s a terrifying time to be a CS master’s student, on the precipice of attempting to enter an industry that, the tabloids insist, does not want or need you, because it’s busy burning the planet alive and making quick work of data privacy all over the world, all in desperate pursuit of an automated version of you

I owe these students no less than my very best in my attempt to prepare them for this industry. I teach them technical skills to the best of my ability; we talk about ethics; we talk about the impact of tech on the broader world. I ask students to find a totem to remind them of the gravity of their roles, and keep it on their desks. Mine is a champagne bottle, in honor of the crews of the Challenger and Columbia missions, both ultimately killed due to foreseeable and foreseen engineering failures. 

And on most assignments these days, I permit my students the use of LLMs, and I ask them at the end of the assignments how they used the tools. 

Interestingly, no student who has chiefly used LLMs to complete all their programming exercises in my Python class has managed to exceed a flat B as their final grade. 

But more importantly, by cross-referencing my students’ reports of their LLM use and their course feedback, I can develop a picture of where the tools aid and fail my students. 

Given what we’ve discussed so far, it will not surprise you to learn that LLMs struggle especially on programming problems that address:

  • variable scope
  • high cyclomatic complexity
  • persistent, layered data models

It’s also unlikely to surprise you that they fail students when the code base the student is working on exceeds about 300 lines. To use them most effectively, students must first write COHESIVE code—code in which the concepts that must be understood together live together—and recognizing patterns from github does not prepare a tool to generate this type of code. The specific cohesive part on which a student needs help then needs to be entered into the tool in isolation. 

These things only become more true as the code base increases in complexity. I wrote a compiler in Rust earlier this year—a language I don’t know especially well—to better understand the experiences of my students trying to use LLMs to build projects in Python. I’d say the LLM reduced the amount of physical typing I needed to do by 95%. I’d say it reduced the amount I needed to know, about the implementation details of compilers as well as the characteristics of Rust, in order to get my compiler working correctly, by about 5%.

If you’re interested, this blog post discusses more details of how I approach the use of Generative AI products in my classrooms.

So, what might we take away from all this? I think there are a few lessons worth writing down. 

The first is that LLMs are most helpful on programming problems that are:

  • Specific. To the extent that we can provide a complete and unambiguous set of success criteria, we can improve our chances of an accurate answer.
  • Limited in scope. We have narrowed down the context as much as possible, to eliminate opportunities for misidentified relevant information.
  • Focused on a programming language with a robust sharing history. As I mentioned before, evidence on this, while largely anecdotal and not to my knowledge yet collected in any kind of survey study, is pretty strong.

Now, LLMs can help us identify implementation options. However, so far, they don’t provide an adequate substitute for judgment. The more of our job focuses on experience-based judgment, the less these tools have improved productivity so far. For this reason, subject matter expertise and contextual expertise still matter in decision-making. This includes expertise in when and whether to use or trust the output of a large language model, even as an initial draft. 

Finally, by dint of the patterns that train them, the nature of large language models supports reproducing existing logic. They repeat the patterns of language that they have seen: a sort of global average, if you will, of what the internet has to say about a topic. This can be useful, when questions have established answers, and when programming communities publicly adopt shared norms. But what they cannot do, in and of themselves, is innovate—is break from that pattern of what has been said and taught in the past. The more “newness” you’re looking for, the less you can rely on a large language model’s output in and of itself. 

Conveniently, of course, that describes few of the problems we face professionally—especially today, in a very saturated tech market, where a lot of what we do is migrate existing processes from tool to tool, and to the extent that we build new services, most of the actual work comprises integrating a set of preexisting dependencies and vendor APIs together. The question I’ll throw in here, that I think the advent of large language model products is still too recent to empirically answer, is that I’ll be curious to see how today’s junior programmers, aided as they are by large language models, navigate the challenges that we currently use judgment and experience to manage at the senior level. 

The first skill set is investigative skills: we need to be able to scope down the area in which we are facing a problem, and learn to ask specific questions about how our assumptions differ from the ground truth. 

I find this to be a woefully undertaught and undervalued skill among engineers precisely because we tend to view debugging or familiarizing ourselves with the system we’re working on, not as part of our work, but as a blocking obstacle that distracts from our actual work of plowing through features and system changes. I think this is a deeply flawed way to view our profession, and one made even less accurate than it already was by the advent of tools that can plow through feature development for us, provided we possess the investigative skill set to identify and understand when and how those solutions make inaccurate assumptions or need to be fixed. 

The second skill set is evaluative skills: we need to be able to select from a range of implementation options based on how those options’ benefits and shortcomings fit the bill of our specific situation.

I think we often elide this skill set as engineers by deferring to what we’ve read or heard is a best practice. This works fine, much of the time, and up to a point. But many of your worst nightmares made manifest in code are the result of engineers applying a practice that, while “best” in many circumstances, did not suit their circumstances. I think engineers, particularly at more senior levels, need to take stock of why best practices are what they are, and determine whether those reasons make sense in each situation where they’re being used.

More granularly, I think it falls to engineers to learn to specify exactly what their decision criteria are, decide explicitly which of those are optimizing criteria and which are satisficing criteria, and document how each of their implementation options stack up against those criteria. An LLM can tell you what most people usually say to do. It can’t tell you what you should do, but the truth is, neither can most human engineers right now. This skill set addresses that.

I’ve written about this on my blog too; you can see the title there. This one happens to be a mercifully short piece.

The final skill set is innovation skills: we need to learn to proactively search for the shortcomings of the available options, and consider solutions that haven’t been tried before.

These are precisely the solutions that a Generative AI product cannot produce by handing us a global average of what the internet has to say about a topic. The task falls to us, I am afraid, to understand how our status quo falls short, and figure out what we could change to improve those shortfalls. We need the ability to proactively identify who our systems do not serve, or serve poorly, and why, and how to fix it. 

I’ve written about this on the blog too, at the title you see there.

I am confident that these skill sets give us greater purchase on our roles in a Generative AI enabled world, because I see them give us greater purchase on our roles regardless of the availability of Generative AI, and I see the reasons for that exacerbated by the use of those tools in development contexts.

Should we manage to focus on these things, I think we can leverage Generative AI in its most consistently successful function as an aide to our process—to aid us in being productive, not in the sense of generating more slop for GitHub, but instead in pursuit of a more rigorous, compassionate, and thoughtful approach to software development.

Thanks for listening! You’re welcome to e-mail me anytime at the email address you see onscreen.

If you liked this piece, you might also like:

The Homework is the Cheat Code: GenAI Policy in my Computer Science Graduate Classroom


“Best Practice” is not a reason to do something


The Oxymoron of “Data-Driven Innovation