Interview: Sanjin Bicanic on Why Building Agents is Hard

35 min read Original article ↗

In this episode of Artificial Investment, I speak with fellow Bain partner Sanjin Bicanic to unpack the realities of building and scaling autonomous AI agents—systems capable of executing complex workflows without humans in the loop. Most companies today that are finding success with AI are either using third party tools (e.g., Sierra or Decagon to automate customer service) or simple home built tools that have a clearly defined input and output.

These early successes are great, and they have emboldened companies to think about the next leap forward in enterprise automation: autonomous agents. This is a bot that can handle a full, multi-step workflow, potentially making several judgement calls along the way.

In this interview, we talk about autonomous agents that are already proving they can drive measurable business value, particularly in complex customer support and sales operations. Yet, the main takeaway should be that doing this today is very, very difficult compared to the simpler use cases above.

Sanjin talks about how success depends on advanced capabilities like multi-layered security defenses against prompt injection attacks and “shadow systems” that run the bot in parallel to humans as a way to check their work. I’d encourage you to listen or read the transcript to hear the details of this as Sanjin does a nice job of breaking down these scary-sounding concepts in very understandable terms.

Today, the payoff is largest for businesses with repeatable, high-volume processes. Sanjin says that the threshold for autonomous agents is when you have 500 people or more doing a process, but that number will come down rapidly.

For investors, the key is to make sure your companies are focusing on tractable problems today. If they say that they are building a full agent to solve a complex use case, make sure they’ve thought about issues like security and shadow systems before they start and make sure they’ve already gone after some of the easier opportunities.

Any stories - good or bad - about building full agents? Leave a comment or drop me a note.

Full transcript:

Richard Lichtenstein: Hi. I am Richard Lichtenstein, the host of Artificial Investment, a Substack and occasional podcast. I am here with Sanjin Bicanic, a fellow partner of mine at Bain. Sanjin, why don’t you introduce yourself?

Sanjin Bicanic: Hi. Really excited to be here, longtime listener/reader. First time caller. I am a partner at Bain in our AI solutions practice.

I started my career as a software engineer in the valley. I’ve been in Bain for a while now. I jokingly talk that I work on Bain engagements where deliverables are both PowerPoint and Python, so I help clients actually deliver real software systems and over the last decade, AI and machine learning enabled systems to actually help them achieve their strategic goals.

And Bain can help them get there faster and usually with better results. So as part of that, I built a lot. I’ve used AI and Gen AI a lot over the last few years and have the scars on my back from wrong left and right turns taken, and happy to share the lessons learned from all that build experience.

Richard Lichtenstein: Yeah, and that’s exactly why I wanted to do this conversation because you and I have been talking a lot over the last few weeks about some of the agents you’ve actually been building at companies deployed at scale, and it’s been both very exciting to see all the incredible stuff that agents are doing - they’re actually making companies money but also terrifying to see how difficult it has been to build those agents and how much harder it is, frankly, than I would’ve thought. So, I the goal is to educate the rest of the world about just like how difficult it actually is to build agents, but that it’s possible.

So just let’s start with what kinds of agents have you actually built?

Sanjin Bicanic: Yeah, so I think the first major split I would make is the agents that are meant to augment humans. And as a result there’s some sort of human in the loop, either as a verification or maybe they’re using that as an input into the broader work that the business is operating.

And then the second type are like the fully autonomous agents that literally there is no human in the loop. Just end to end handle the entirety of the work. I’ve done a little bit of both. The second one is much more difficult than the first, right? The first one is a lot easier to get going.

So, an example of the second one is, customer service and customer support. Customer service being slightly more difficult of the two, where it’s some sort of a thing that a customer needs to have the company handle on their behalf. And the agent can now drop in for say, 10 or 20% of the tasks.

Typically, agents can handle that stuff end to end. That could be like, Hey, if you’re a provider or some sort of business services for me, here’s how I need you to do this differently this week and then you’ve done it for me last week. Or here’s the tweaks I want you to make in the system because I’m not really sure how to use your user interface, and I would much prefer you do it for me than I do it myself.

The augment kind: a common example here might be in sales where it’s doing the first layer of triaging potential prospects and see which ones are the most interesting. And then maybe even doing a little bit of research for you, or maybe even writing initial outreach or maybe even a plan for how that outreach should happen.

Two different flavors of agents that one encounters

Richard Lichtenstein: Let’s dive into the second type because I think the co-pilot assistant augment types. Very cool, very powerful. There’s so much more they could be doing than they are today, but to some extent, I think they’re familiar territory for at least some of the people listening to this.

But the fully autonomous agents feels like we’re really breaking new ground where there really are not a lot of those in the wild. Just talk about those, what’s the hardest part of building one of those?

Sanjin Bicanic: Ultimately, the most difficult thing is to get reliability to be high enough.

So this is not a term I coined, it comes from Andrej Karpathy, one of the original godfathers of deep learning. And he calls it the March of Nines. And what he means is to get the reliability to a certain level of quality, it’s a constant amount of time to get each nine. So let’s say it takes you six months to get a 90%, it’ll take you another six months to get to 99% and it’ll get you another six months to get to 99.9% and so forth and so on.

And the real challenge is maybe even within that first 90%, in the first 99%, is there a segment. Of the overall shape of what you need the agent to do, where actually you can get to 99.999% or even a hundred percent, that you feel comfortable that this agent can actually do something useful. So the hardest kind of macro problem is getting reliability, but what makes the reliability so hard are a few underlying problems underneath.

One is that the task you’re asking the agent to do is almost always underspecified. And as a result, it requires the AI agent to have what I think we would call it some common sense about how should I interpret this and what am I missing and what are the multiple ways to interpret it? What the person is asking me and how should I do it?

The second challenge is that because it needs to call a whole bunch of tools inevitably you find out that these tools or the APIs that you call aren’t behaving the way you expect them to behave and you only learn that as you make mistakes.

And the third challenge and the biggest challenge is how do you value this at scale to convince yourself that this thing is reliable enough? So those would be the big three challenges.

Richard Lichtenstein: It’s interesting the way you put it that you need the agent have common sense because they’re gonna get these ambiguous situations where a human would use common sense.

And so you need the bot to have common sense as well. How do you train a bot to do that? Is it giving it enough examples, annotated examples so that it will have seen everything? Or is it really good prompting with giving it tons of context of what people want? How do you teach a bot to have common sense?

That seems very hard.

Sanjin Bicanic: It’s really difficult. It’s a great question. So we tried the examples things, and you should use examples. The problem you run into is that if you provide too many examples, many modern LLMs will tend to assume that if you give me a hundred examples, that is the full set of examples that I should go by.

And if it’s not in that example set, then I don’t know what to do. So there is this challenge where you can provide too many examples and get the agent tool fit. So there’s three different things.

One is the examples, which we already talked about, right? The second thing that you want to do is this generator evaluator concept, you have one agent that’s actually generating the answer, and then there’s a separate agent with separate prompting that has the exact same context as the first one says, “Hey, how confident are you that we got this right? Is there any ambiguity in here about how this could be interpreted?” And if the ambiguities are a certain threshold, maybe the best thing to do is not process it and return it back into the regular flow and let humans handle it as they usually would.

And then the third thing that you wanna do is have a way of actually seeing what these agents are doing in the real world and evaluating some share of them. And especially if you can separate the ones where they make an error, learn from that error. And then through prompting, see if there are some elements of common sense that the agent is missing that we need to build in through prompting or better examples.

Richard Lichtenstein: Got it. That’s super interesting. What I caught you said that I thought was intriguing was this whole idea of somehow understanding what the agent isn’t good at. Intuitively you can have the agent grade itself, right? I’m an agent, you’ve just given me this problem.

I have some way of rating the problem from either I know exactly what to do or I’m really using my common sense to a degree to where I’m taking more of a guess. And based on that, assign some sort of score. And if it’s as you say below a threshold you fail it over to a human who checks the bot’s work.

Your idea, though, is not to let the bot grade its own work, but to actually have someone else come and grade their work. Why is it better to have a second bot grading the first bot’s work? Because the first bot knows what it did.

So wouldn’t it be better to just have it know what it did and know whether it feels good about that.

Sanjin Bicanic: Yeah. If you’re allow the technical jargon, the short answer is the large language models are autoregressive. What that means is that because they have memory of what they generated earlier in the process, they’re very likely to engage in a very human-like behavior called confirmation bias, which is rather than independently thinking “was this right or wrong?” Actually, if you ask the same agent to grade itself as part of the process, it’s gonna engage in defending its answer.

And that’s where we have a separate prompt. So they’re operating on the same information and that’s tends to behave a lot better because it’s now not auto regressing, but it’s actually taking essentially a fresh look at it.

And then the second reason why you wanna separate it is you wanna be able to actually have multiple grading agents so you can run some experiments and see which ones are actually better. Because often what you find out is that depending on how you prompt it or which model you use, you’ll end up with a different level of false negatives and false positives.

And generally speaking, more often than not, you would prefer to make many more false negatives, meaning it is overly cautious and thinking and made a mistake than the ones that kinda miss real mistakes.

Richard Lichtenstein: Yeah. So I think there’s a few things there you said that were interesting, the autoregressive thing, by the way, I think everybody listening to this has probably experienced right?

That sensation where you’re trying to get ChatGPT or Claude to do something. It’s gone down a rabbit hole in a way that’s not what you want and you can’t get it out of the rabbit hole. And it’s way better to start a new prompt. It’s just much faster to say I’m gonna start over with a new prompt and see if I randomly get to a better spot and work from there.

That I get. One approach I’ve seen is to do some sort of a voting approach. Let’s have the bot attack this problem multiple times and see if we get the same answer five times?

If we do, maybe we’re more confident and if we get different answers, we’re less confident. And maybe you vote. Or maybe you say if we get a different answer, then that’s a fail and we go to a human. What about that approach, or does that just take too long? Because now you’re using five times as many tokens.

Sanjin Bicanic: No. Absolutely. The autonomous engine we built actually did have that before it even went to the evaluation agent. It would run it twice in parallel. With the exact same prompt doesn’t take that long. And then it would compare the results and see if it came to the same conclusion or not.

And if it didn’t, in our initial version we just said there must be something borderline here. Let’s send it to a human for processing. But then the second iteration of that isn’t necessarily rerun it again and see if they agree, but re-run it with feedback. You say hey here’s what we tried before.

It didn’t quite match. These are the two different versions. Try to find reasons about why these may be different and which one may be right. And then pass that reasoning about the differences as it tries again. And then again, you try it twice and see.

Richard Lichtenstein: Interesting.

Sanjin Bicanic: It’s a really good thing. But there’s a whole bunch of different ways you could do it. If it fails the first time, do you run just twice? Or should you try running four times, then you have to get three out of four. Do you wanna run it naively or with feedback? And you have to test to see which one runs better because there is no one right answer.

Richard Lichtenstein: Great. Now I wanna shift gears very slightly to something that you’ve been talking to me about called the Shadow System, which I think at a minimum, has a very cool name, right? Even if you’re listening to this and only kind of paying attention, you probably heard Shadow System and maybe perked up and wanted to hear more.

Sanjin, what is the shadow system and how does it help with some of these problems?

Sanjin Bicanic: Again, I can’t take credit for it. This is something I think probably also, again, Andrej Karpathy. This idea is can you have the agent do all the work without submitting it? So almost working in parallel to the human.

And then rather than having to have humans actually review agent work, you compare what the human submitted with the real work, to the ticket that the agent would’ve submitted but didn’t. Right? So the shadow system, you can always think of it as an agent to whom you bcc all the work.

It does all the work, but at the very last step, instead of submitting it, it essentially saves it as a draft. And then there’s a backend system that periodically checks and sees “Hey, how does the work the agent would have done compared to the actual work the human did do and how similar are they?” And the reason why this becomes such a powerful system is what you’ll find out when you build these autonomous agents.

You need SMEs to grade the work of agents. And the best you can get in most companies is for humans to evaluate a hundred things per day or something like that. But that’s not enough to test the agent on the wide distribution of the real world because turns out the real world is strange and messy.

So the way the shadow system works is you have the human process it as they usually would. You have the agent process it and save the draft. You compare the two and only when the two are not the same. Does the SME look at it and try to decide who was actually right here.

Did the agent mess up? And if so, how? Or did the human mess up or actually did both of them mess up? And you’ll be shocked to find out that humans are not perfect and that when you present this as an AB, because you don’t tell the grader which one is human, which one is the agent is you often find that it ends up being kind of 50/50 that humans make just as many mistakes as agents. Their mistakes just tend to be very different. What we found in this customer service example is that the common error of humans is just fat fingering a number or misinterpreting a number. In a payroll use case, some will give you this person worked 70 hours and 32 minutes, and people type in 70.32. But point 32 is not, 32 minutes.

Whereas, the agents tend to mess up on the implied nuance of the email where we say hey, these are the three people who don’t pay federal taxes, and the agent may assume that only applies to the very last person; whereas a human will correctly understand that actually that applies to the entire roster.

Richard Lichtenstein: Interesting. Now, the shadow system sounds incredibly stressful for the humans involved in that. They’re now competing against a robot to keep their jobs. Do they know the shadow system exists? How do you handle that? It seems like a real John Henry kind of situation.

Sanjin Bicanic: Most of the humans actually were not aware of this shadow system existing with the exception of the few subject matter experts whose job was it to actually decide who did it right: the AI or the human. That being said, even if they did know, I don’t think they would have minded because the work that typically goes to agents first is the kind of work humans really don’t enjoy doing.

It’s really rote, repetitive, boring work, and actually the humans really want the agents to take this work away as quickly as possible so they can focus on the more interesting kind of work that they actually wanna be doing. And nobody wants to type in 50 lines of text or numbers.

That is absolutely the kind of stuff that, you would want the agents to handle for you. And you can focus on like advising clients or more important issues where judgment is required.

Richard Lichtenstein: That makes sense. And just to be clear, at this specific company that we’re talking about, my understanding is that no one is fired as a result of this system.

So just to set the record straight, right?

Sanjin Bicanic: Yeah. What you’ll find in most customer service roles is that actually it’s very high churn and that churn is very expensive. And that there’s two consequences of having agents do some of the work. One you are taking precisely the kind of work that humans hate doing, and it’s the reason why they leave.

Then to the extent that they do automate some of the work it just means that some of the churn isn’t as painful because you don’t have to rehire those people.

Richard Lichtenstein: Yeah. Again, good for the business, but also not bad for the people involved, which is good.

One other question that we haven’t talked about yet is security. I worry a lot about security, and you know this because we’ve discussed this a lot. I worry a lot about security with agents because once you have an agent that’s exposed to input from the outside world and has access to confidential information, and it can take an action, you’ve now set up what they call the lethal trifecta, which means that this agent could be maliciously attacked with a prompt from the first part and it could do something bad.

How do you make sure that doesn’t happen?

Sanjin Bicanic: It’s a multi-layered security system, just like it is in almost every other facets of cybersecurity. The first thing you do is the email that you get - is this coming from a person that is allowed to email us about this account?

And yes, it’s easy to spoof email addresses, but there’s a way that you can test for that in a reasonably easy way. So that’s the first test. Then the second test tends to be a very simple regular expression, simple string search where you are just searching for certain strings that strongly indicate that something is suspicious here.

Any email that contains text, like ignore all previous instructions, especially if it happens to be in white color, so it’s invisible. That’s pretty suspicious. So that is the second layer. The third layer is then you have dedicated prompt injection models that are detecting that something is suspicious about this problem that don’t rely just on simple, text search.

And then the final one that you put in is you say let’s say somebody passed these three or four things. What are the kind of malicious things that would ask us to do and you just disallow the agent from doing those kind of tasks? At a payroll company, if you’re trying to add an employee and then pay him a large amount of money that feels very suspicious.

We may create the payroll submission, but we’re not gonna submit it. We’re gonna send it to a human, and then the human will take a look at that and decide what to do about it. And then the last bit is you just don’t give API access to the agent for certain things so that even if they do get prompt injected, there’s just some things they’re not allowed to do.

Richard Lichtenstein: It all makes sense. I think those are clearly the right set of things. I still believe two things. One is at some point there will be a situation in which there is a prompt injection attack that causes something bad to happen. Hopefully the guardrails are in place so that something bad is not a catastrophic problem but just a minor inconvenience to some number of people. And two, somebody is going to make a lot of money creating some sort of Anti-Prompt injection guardrail software that every agent stack has to have in order to prevent this.

Sanjin Bicanic: I couldn’t agree more. I also think it’s important to notice that a lot of these challenges they happen with humans too. The concept of prompt injection isn’t that different from social engineering, right? When somebody calls in and says, “Oh, I forgot my password, can you just help me this one time?”

And they say it in a really apologetic voice and sometimes people fall for it. So it turns out AI agents are even more gullible than humans, so you have to be double as careful about it. I agree with you that something bad will happen, and we’re gonna have to invent a whole new tech stack to do it. Right now, a lot of this you have to build for yourself from first principles, but there should be a tool that allows people to just have to worry about the business problem you’re trying to solve.

Richard Lichtenstein: Got it. I was so excited to talk about security. I should have asked another question, which is, we talked a little bit about different roles and what this might replace and so forth. Are there any new roles that, we talked about the new role of the grader right, and that’s a little bit temporary as you’re rolling the system out, but are there any new jobs that start spring to life in a new world where people are using agents for things like this?

Sanjin Bicanic: I’m sure there will be, and I’m sure we’re really early in figuring out like what they may be. Even in this example where in theory it is automating the work away, you actually have a dedicated set of agents whose job it is to pick up work that’s in process where the agent isn’t confident.

Or it ran into an issue, and it needs to resolve it. So that’s like the first thing - a specialist type of agent. The second thing that you end up in is essentially fleet managers, people whose job it is, rather than managing humans, managing a fleet of agents and trying to figure out where are they getting tripped up?

And as a result, where do we need to have agent programming be better? And that’s not one job. That’s actually multiple jobs. Of course, you have the graders and then you have the trainers, people who are gonna be like, “Ah, I see why you got tripped up here. Let me like adjust this prompting over here.”

Maybe there’s context engineering to make it easier for the agent to figure this out. But that’s in the automation sense. I think in the augmentation sense I’m sure they’ll be brand new jobs that we’ve never thought about before because all of a sudden things you could never do before suddenly become economical, and that opens up some other areas where humans are needed.

Richard Lichtenstein: Yeah, I agree. There’s gonna be a lot of new jobs that sort of appear. It’s hard to imagine them, but it’s like saying if I was asking you in 1999, what new jobs will the internet create? You would not have come up with Instagram model as a job, but obviously that’s an example, so it’s hard. We’re sitting a little too early in this to call it.

So let me switch gears again a little bit. I think hopefully some people who are listening to this are thinking about, “How do I, bring this to my companies that I invest in or own or work at? How do you get started? What are the right problems?”

Where this makes sense to attack with these autonomous agents?

Sanjin Bicanic: So the first question you have to ask yourself is, do you actually want it to be end-to-end autonomous, or do you want it to be autonomous on a task? That’s part of the bigger process. The end-to-end autonomous is orders of magnitude more difficult because of that reliability and how hard it is to get there, but it also tends to be like a lot more valuable because you are literally removing large chunks of work going forward.

In my experience, that full end-to-end automation, it’s really hard within the customer service, customer support funnel. You have to have at least 500 or maybe even a thousand people before those kind of use cases have a like near term ROI positivity just because the technology is new. It’s quite expensive to be able build these things.

On the other hand, for the things that are still autonomous, still agentic, but they’re more like automating tasks, those tend to be a lot easier to find. The trick is you wanna look at a process end-to-end, and say “Can I do more than one task in that process where this can help and remove the major bottlenecks?”

I think one thing we’ve found that if you just do one task within a given process, usually just the bottleneck moves somewhere else and the overall value like becomes a rounding error. Automation is the easiest to underwrite because you know what you’re spending today, and you can assume what percentage you can automate away.

Hence the savings is X time at Y percent. It tends to be really difficult to actually get that value because as we said, it takes a long time to build. We tend to judge AI by a much higher quality standard than average humans. You have to get a higher accuracy than your average human before you roll it out.

And then third thing is there’s the whole functionality of the role where even if you’re automating X percent of the work, you’re not often removing X percent of the cost. I have found that things that are revenue generating are easier to get going - could take one or two months.

You get a lot less pushback from the organization that’s afraid of, AI bots taking their jobs and then being more interesting. The area where I would focus that bucket of the augmentation is, what are the things that were completely uneconomical to do five years ago that you could do today?

Because all of a sudden you have an army of PhDs that essentially work for free. The only problem is they have amnesia, so every time you ask them to do something, they’re doing it for the very first time. So you have to give them very little distractions to do it. Those tend to be the kind of things that I find are very helpful, and my one simple rule of thumb for folks is if you could get a thousand PhDs in any area and they could work on something today.

The problem is they have amnesia, so every time you ask them to do it, they haven’t learned on the job.

Richard Lichtenstein: That’s interesting. I think that inside sales use case that you’re painting is a very good AI use case. You don’t know how successful they’ll be in getting leads. They turn over a lot and quit. If you don’t have an inside sales team today, or they’re very small and you feel like there’s more ground they could cover. I think AI is an extremely good place to start. So, I totally agree with that one. The thing that I’m still surprised by is the threshold by which you’re saying you need to use AI.

Because I think what you’re basically saying is building these things are so hard enough that like you really need 500 people at your company all doing the same thing before it’s big enough for AI. Is that gonna get easier? A year from now, is it still gonna be 500? Is it gonna be 100?

Where’s it gonna be?

Sanjin Bicanic: There’s two reasons why the threshold is so high today, and then the threshold will keep dropping over time. If you ask me this question a year ago, I would’ve told you it’s a thousand. A year later now it’s 500. Maybe in a year it’ll be a hundred or something like that.

There’s two big reasons why the threshold is so high. One, because so many of these systems you have to build for the first time. Often you have to DIY them because there are no off the shelf solutions. Like your security example was a great example of that. It just ends up costing you multiple millions of dollars to build something really robust.

The second challenge is the quality of the models. Today you generally can’t address a hundred percent of the role. Typically you can address 10 to 20% of the role. As the addressability of the role goes up, instead of being 20%, it becomes 40%. And as you build up some of these capabilities, you have to do less yourself and more of it becomes off the shelf, the number that you need to hit in order for this to be viable will continue to decrease.

Richard Lichtenstein: What does the agentic tech stack look like? Or how does that tech stack look different from the normal software stack or even the normal sort of GenAI stack that we think of for simpler applications?

Sanjin Bicanic: So it looks different in production, but it really looks different as you are developing it. So in production you may have to have much more robust things to protect against the lethal trifecta or prompt injection. You’ll need to build many more MCP servers to connect into the various core systems.

You will need to have knockouts to detect areas that you are either unwilling to take the risk if something goes bad or maybe just don’t have the APIs to do it right now. So you literally can’t have the agent do it even if they wanted to. And then you have to have some way of still handing the work to a human even when it’s fully automated, right?

Because you won’t be able to get everything done. Sometimes you’re just not sure that this is exactly right, and you don’t wanna take the risk. Other times you run into an error, and there’s no way for you to get around the error and you’d be like, “Hey human, please handle this error for me.”

And then the third thing is the level of observability you need to have into what’s going on is orders of magnitude more important than the level of observability they need to have for just a chatbot.

And then on the development side is where things get really interesting. You’ll typically need multiple different environments to test how agents are doing. One, just for the unit test where you know the inputs and the outputs. One, which is the shadow mode, right? That needs to be an entire separate production system that needs to run.

And what you’ll find that when you have these evaluation systems, you need to have a simulated environment for these agents to perform in. That looks just like the real environment, but isn’t. In effect, you kinda have to build fake versions of all your core systems that are simplified for the agent to interact with.

And then the really tough part is the evaluation thing, because if you’re relying on humans to evaluate everything, you are only gonna be able to evaluate a fraction of what’s going on. So how can you auto evaluate as many of the tasks that the agent is doing becomes really important and really hard.

Richard Lichtenstein: See, I think that’s super interesting because I think when people think about what does it take to build one of these, or what does the tech stack look like they they jump, as you said, to the production thing. They jump to when is this working? What’s it gonna look like? But I think what you just described is that actually a lot of the building and a lot of the hard part is actually the scaffolding that’s required to get the thing off the ground in the first place. And yes, there’s a lot of scaffolding that you have to build just to, it’s in order to be able to get there. And that requires a lot of engineering. And in some cases the scaffolding engineering can be as difficult as the production engineering.

And that’s the problem.

Sanjin Bicanic: It’s funny you bring that up because if you were to look at the block diagram of everything that had to build for that autonomous agent, you would realize that the area of the block, that’s actually the agent itself is only about 20% of everything that needs to be built.

There was 80% of scaffolding that has to go around it in order to get to that point.

Richard Lichtenstein: Wow. It’s a good thing that’s not how buildings work. ‘cause if that’s what it took to actually build a building would be very hard. Glad that this is all virtual scaffolding.

I guess I’ll just ask one more question. If somebody does listen to this and at least gets, this is hard, right? This is hard, it’s difficult, et cetera. But then they still say, “I don’t care. I wanna do this ‘cause I’ve got 500 people, a thousand people, and I want to attack it and it’s gonna be worth a lot of money or make my customers a lot happier or whatever.” What would be your advice to those people about how to get started and how to do this successfully?

Sanjin Bicanic: It would be three things that you want to convince yourself that this is gonna be valuable. First one is to really understand what is your full total cost to serve these interactions today. And a lot of folks think that it’s like a dollar or $2 because that’s like the marginal cost of labor.

But actually you have to look at a lot of things - you’re paying for the building and the software and the bandwidth. And also in many of these, you have to staff to peak, not to average. So you have a whole bunch of people that are not utilized. So what is your total actual cost per interaction today because that then sets the budget for how much can you spend to build the agent to do this?

The second thing that you want to be really sure about is can I give the tools to an AI agent in the way they give to human meaning? Do I have the APIs to accomplish the exact same thing that humans are accomplishing right now?

A lot of folks don’t have APIs for everything. So then you either have to build some sort of RPA layer on top that does the clicking for you and then it looks like an API for you, or you have to wait a few months until you build those APIs out.

And then the third thing that I would look at is, can you convince yourself that there is 10 to 20% that like, “Oh my God, this looks so simple! We should definitely be able to do that.” And then what I’m gonna caution you is things that look really simple in practice end up being more difficult than you think. If something looks borderline, it’s probably too difficult to do for your first one because all sorts of really random things come in that you would never expect.

People who when they’re sending in payroll, they scribbled it on a piece of paper and then paste it as an image inside of the Excel. This has literally happened. So you think you can deal with because it’s Excel, but you’re like, oh no, it’s actually a bitmap instead of the Excel.

And now I’m really confused as to like how to read this. Or another thing is people say, “Hey, I need to do this for these three employees. Do this for Richard, do this for Sean, and do this other thing for the father.” And there is nobody named “father” on my employee roster, but they realize there’s a Reverend John Smith, so they’ll probably refer to that person, right?

So the world is just always more interesting and challenging and different than you have to deal with typos and weird abbreviations people use. It’s just that the world is a lot messier than you think. So the task kind of has to be obviously doable by an AI, and if it’s not, it’s probably not ready.

So those would be the three things: one, make sure it’s valuable enough, make sure you have the APIs, and three, find 10 to 20% that’s obviously easy to do.

Richard Lichtenstein: So I think the key lesson there is it needs to be something that you think is gonna be incredibly easy, and then it will actually turn out to be incredibly hard, but at least doable, right?

Any last comment you wanna make?

Sanjin Bicanic: I was gonna say there’s the famous model of software engineer. We didn’t do this because it was hard. We did it because we thought it was gonna be easy.

Richard Lichtenstein: Words to live by in the new agentic world we’re in. Sanjin, thanks for coming on. I appreciate it. I think this was way more interesting for people than if I had tried to write the boring blog post version of this. So thanks so much for doing that. Thanks everyone for listening and leave a comment or send me a note if you have questions.

Note: The opinions expressed in this article are my own and do not represent the views or specific recruiting practices of Bain & Company.