The digital biotech startup playbook

Press enter or click to view image in full size

“steampunk robotic arm doing biology experiments in a laboratory” — DALL-E (https://labs.openai.com/)

In the past decade a small handful of “digital biotech” companies have drawn on ideas, technology, and talent from the software industry to build valuable data platforms for accelerating drug development. While most biotech founders say they want data and ML to be a pillar of their strategy, many have no software experience and therefore struggle to build a digital-first company.

In my career working in several early-stage digital biotechs, I’ve seen repeating patterns arise, and have been able to see how early decisions later shape the trajectory of the company. Each company is unique but there are some foundations and rules of thumb for digital biotech platforms that can be generally applied.

My goal is to provide a few concrete steps to follow, but also to highlight important cultural, technical, and organizational decisions that should be considered very early in the life of any digital biotech company. I’ll also propose novel ideas that I think could be successful.

I hope this can be helpful to others in the industry starting their own digital biotech companies. The more we can infuse software and data practices into drug discovery, the sooner we can solve the problems of poor success rate and reproducibility and the better we’ll be at curing disease.

What is a digital biotech?

The term “digital biotech” was popularized by Stéphane Bancel, CEO of Moderna, who in describing Moderna as “the first digital biotech” laid out his ideas in a white paper and a 2017 blog post, and whose platform has been the subject of a Harvard Business School case study. Stéphane recognized that mRNA was similar to software, in that it was a set of coded instructions to the cell which could be reprogrammed to produce any protein drug, and built Moderna as a software company from the start. Since then, most industries have recognized the need for a heavy software and data focus, and most companies are scrambling to hire tech talent and implement a digital transformation. Biotech and pharmaceutical companies in particular have invested heavily in data infrastructure in recent years.

Despite the recent investment, most life sciences companies remain a decade behind the tech world with regard to data management, IT systems, and operational styles. Companies are held back somewhat by the regulated nature of our industry but mostly by a slow-moving, conservative culture inherited from academia and big pharma. In my opinion, “digital biotech” is synonymous with “digitally native” because cultural inertia and legacy systems make it impossible to retrofit modern data practices into an established company. I believe the best and only approach to true, full digitization in biotech is to start fresh with a startup.

Digitization is much more than simply converting from paper notebooks to documents on a computer, and goes far beyond electronic LIMS and ELN or computational biology pipelines. As a few examples, digital biotech companies

Integrate in-house and 3rd party software into a unified platform with defined services, a managed data model, and customized user interfaces
Store all production data in databases that are easily findable, accessible, interoperable, and reusable (FAIR)
Heavily automate digital workflows via modern web-based applications
Provide computational teams with REST APIs or clients in a modern programming language (e.g., Python) to access data and services
Incorporate computational biology and machine learning in the loop of experimental design and analysis, and
Invest heavily in robotic lab automation controlled by the digital platform.

As Moderna’s white paper lays out, digitization improves quality (reducing human errors), speed (automated data flows and realtime decisions), scalability (many programs can use the capabilities of a single platform), and cost. Furthermore, while many companies aspire to apply artificial intelligence, few have the AI-ready digital infrastructure required to extract clean, harmonized datasets for training machine learning models at scale.

Only a handful of biotech startups are truly digitally native. These companies often resemble Silicon Valley tech startups in the proportion and influence of software engineers, hacker culture, and reliance on purpose-built data infrastructure. However, this breed of startup is becoming more common, and I expect will eventually dominate most life sciences markets.

Do you need a digital platform?

First, if your strategy is to rely on AI, you will need a custom digital infrastructure to enable it. But the good news is that just having a digital infrastructure will be such a massive competitive advantage that it may not matter how good your AI is.

Digital biotechs only make sense for platform companies, which intend to build a novel, highly generalizable capability to produce many different products. A custom-built data infrastructure is probably not worth the investment for biotechs with one or a few assets. If you already have a lead molecule and have a straightforward path to the clinic, you probably aren’t nor need to be a digital biotech. Likewise companies that rely on slow, manual techniques that don’t scale, such as animal models, may have more typical data management needs that can be outsourced to vendors or CROs. If your assays are unwieldy and data generation is difficult, this path is probably not for you.

Another case where digital platforms might not worth be the investment is when the R&D approaches are very mature, for example if the goal is to screen and optimize a proprietary small molecule library for a well-known target. Tools for small molecule compound registration and SAR (structure-activity relationship) are mature and you might not get much return from a custom data platform.

Digital platforms are especially good when you have a data generation engine — a handful of novel workhorse bioprocesses or assays that are scalable and can be automated. Other good reasons for a heavy digital investment are if machine learning is a central pillar of your strategy, your data are complex (where AI could help to make sense of it), your data have economies of scale and increasing returns, and you have clear opportunities for lab automation.

Strategy

Strategic decisions should be made very early, because these are often hard to reverse. You probably already have a strategy for your product, but if you are building a digital biotech then it’s important to consider the platform strategy.

The inherent advantage of a digital biotech is the ability to practically eliminate the marginal cost of manual (even mental) tasks of scientific work. In other words, it becomes as cheap and easy to do something a million times as it is to do it once. Once you digitize a task, that task is now available to support a company of 5 or 50,000. Similarly, the promise of artificial intelligence in our current era is not in superhuman abilities but the ability to scale an intelligence of even subhuman levels.

It is obvious, of course, that digitization implies scalability, but it might not be obvious that the most important aspect of your platform strategy should be how to leverage this potential to scale. Your digital strategy should clearly lay out how the platform leverages fewer people to gather and share vastly more data, which can in turn be applied to make all of your products better.

The first thing to consider are the details of your data engine. Questions to answer include:

What is your data generation engine? These are your handful of workhorse assays or bioprocesses that you will build your platform around.
How can you scale your data generation engine? Do you have an advantage in data generation that others don’t? Can you build that advantage by investing in automation, talent, or IP?
How will your platform get better with scale? Your company should become stronger with more data, more programs in your portfolio, and more partners — whereas traditional biotech companies lose efficiency at scale due to communication barriers and data silos.

Even if you don’t use computer simulations or artificial intelligence, the role of models should not be overlooked — at the very least you have a collection of mental models. Also, all data have an inherent model in their table, column, and row structure (the “data model”), otherwise it’s just numbers. Questions to answer regarding models include:

What are the key aspects of your data model? These are the main fields and metadata from your data engine. How flexible and extensible will your data model be? Who will be allowed to view, modify or add to it? How easy should it be to navigate and query? What data are structured, semi-structured, or unstructured?
How are you building, communicating, and applying mental models? Do you have a space for sharing narratives and analyses? Do you pre-register hypotheses? For which aspects of your data engine do you want to avoid preconceptions and gather unbiased data (e.g., hypothesis-free)?
Do you have a clear machine learning strategy? Are there existing model architectures that can be trained on the data coming from your data engine? How are the data labeled and what is being predicted by the model? Who has access to the model predictions? Is ML “in the loop” of data generation? Will there be a shared codebase for models?

Digital infrastructure

Until about 5 years ago, most biotechs were still deciding whether to adopt cloud computing (versus on-premises servers or hybrid model) for their IT infrastructures. Now few biotech startups need to be convinced to use the cloud. However, the cloud vs “on-prem” debate is just one instance of a constant “build or buy” decision when it comes to digital services.

A useful strategic framework for these decisions is the technology “stack,” and at what level of the stack you believe your company can differentiate by building or customizing. For example, at the base of the stack might be networking, which can usually follow well-defined best practices (e.g., VPC configuration on Amazon AWS). Similarly, if you consider the servers and databases as the next level in the stack, managed services and templates from the cloud provider will be fine for most use cases.

At some level of the stack, it becomes less clear that standard tools and services will meet your needs, and you will need to make decisions about whether to use mature solutions (often available as part of the cloud offering), to buy new tools or combinations of tools through software vendors, or to hire programmers to build your own.

I have a separate blog post that gets into the details of how to decide on specific IT tools. Instead, this section outlines the high-level strategy for the data stack and how it relates to other facets of the business.

LIMS

The Laboratory Information Management System, or LIMS, simply refers to the highly configurable database and data access layer customized to a company’s unique data model. The LIMS provides APIs and graphical interfaces that abstract and hide typical database operations from the user — create, read, update, delete (CRUD) operations on data records, and data definition language (DDL) commands for manipulating the data model. Commercially available LIMS include built-in, domain-specific applications for manipulating the data (e.g., data ingestion, visualization and QC for common use cases such as dose responses and structure-activity relationships), but we will consider these “apps” separately.

A “build” decision for the LIMS will propagate to all services up the stack, since almost all other software components will need to interact with it. If you roll your own LIMS, you’ll need to also build the data layer and applications, as well as think about the many data access and governance issues that happen under the hood of commercial LIMS systems. Building a home-grown LIMS is therefore the key decision point for whether you are a “digital biotech,” as this is a major commitment to long-term software investment but probably the key differentiator.

The benefit of building your LIMS is that you’ll have much more flexibility to customize your platform and provide a seamless, well-integrated experience across your tools and services. You will be forced to hire true, full-stack software engineers and engineering managers, which will open further capabilities for incorporating custom software throughout the business. You will focus on APIs and data models that unlock the potential to scale. Your company will inevitably be digitally native.

ELN

The electronic lab notebook, or ELN, is often lumped together with the LIMS in vendor-provided solutions, but these can be considered separate tools. In fact, while a digital biotech generally builds its own LIMS, it is probably unnecessary to build an ELN. While the leading vendor ELN, Benchling, is creeping into the LIMS space itself, it also provides APIs that allow integration of the notebook and any modern LIMS.

While ELNs are currently a necessary tool for industry research and can be used independently from other systems, the concepts of ELN and LIMS are beginning to merge. At the highest level of abstraction, they are both databases–LIMS systems capture structured metadata and ELNs capture unstructured metadata for experiments. However, there is nothing preventing notebook entries from being stored in a LIMS as a long text or blob field in the record of an experiment. The ELN interface provides a user-friendly front end, but these will probably converge to the same system eventually.

Furthermore, the ELN may change drastically if trends continue toward more flexible automation and encoding lab protocols as code. Much of the functionality of the ELN revolves around protocols (writing, storing, sharing, re-using, adjusting), minor mathematical operations (e.g. concentration and volume calculations), and entity tracking (samples, reagents, plates and wells). In an automated protocol, none of these features are necessary and the protocol is better stored as a typical codebase in a version control system such as git. Manual notes from the experiment can be tracked alongside sensor and instrument readouts during execution of the experiment. As software “eats the lab”, the ELN may be replaced by software and structured data.

Data model

The data model is extremely important — not simply to manage the data you put into the database, but also because it is central to representing and communicating the shared mental model of your scientific strategy. The data model will continually and iteratively updated, even before you have much data to manage.

The tables and columns of your database describe what entities and attributes of your platform need to be tracked, how parts of your scientific platform relate to each other, and ultimately which hypotheses you expect to be testing and learning from in the future. For example, if you think cell line passage number might correlate to expression of a key protein, and this readout is crucial to your product, then these should be columns in the database that can be easily joined into a single table for plotting.

Whether you build or buy your LIMS, in the early days you’ll want to subscribe to a no-code tool such as Airtable to prototype your data model in the months while your LIMS is being built and configured (I’ve written about this tactic). Even if your team hasn’t generated a single datapoint, you’ll want to whiteboard an entity relationship diagram (ERD) with your scientists. No-code tools will then allow the scientists to make edits to the data model themselves as the data begin to accumulate.

Regardless of the flexibility of your LIMS, however, the data model will become more difficult to modify in the future as schema changes require more shuffling and massaging of the data.

Services

Many tech companies have adopted (and I recommend) a service-oriented architecture (SOA). Simply stated, this is the idea that a complicated software platform can be divided into independent codebases, each with its own database, middleware, and programmatic interface or API. Other parts of the platform can freely access the “service,” typically through an automated request over the network. This contrasts with a “monolith” design, where the platform lives on one codebase and different components are imported as libraries. The benefits of SOA are the ability to rapidly develop a codebase independently of a larger, slow-moving monolith, and to provide a stable menu of compute and data functions that can be composed into more complex applications.

Designing the service interfaces is a crucial process for how the company will operate and scale, and depends on what data and compute functionality need to be exposed to scientists to most conveniently do their job without being overwhelmed by complexity. Careful thought to the organizational structure will also help guide the design of service boundaries and APIs.

In a scientific startup, you may want to maintain a single, centralized data model that captures the scientific knowledge of the company. This goal can be at odds with services, which can abstract and hide the data model from the user. A balance must be maintained between encapsulating unnecessary details of the data model, while making sure that the relevant data is easily accessible. Data warehouses, discussed later, also address this problem.

Data ingestion

Digitizing and structuring biological data and associated metadata takes tremendous effort. This is partly because most instrument vendors export their data in bespoke CSV or Excel formats designed for human readability, and give little thought to making their readouts interoperable and machine readable. Also, scientists are used to managing their data in spreadsheets–requiring both customized tools and a cultural shift in order to automate their workflows.

First, choose one or two workhorse assays to invest heavily in data automation. Prototype a system to access data directly from the instrument, capture or link to relevant metadata, format into a structured table, and insert records into a central database, with minimal user intervention. This may only be a throw-away prototype but it will help you tease out inevitable issues in networking, instrument interfaces, and data model, and will be an early introduction of digital culture to your lab.

Simultaneously, begin developing generalizable tools for data ingestion, hardening common patterns of data ingestion into apps and services that can be re-used across assays and processes. This data ingestion layer will be a valuable component of your platform that will quickly be worth the investment. One example of this is a platemap designer: if your scientists use Excel to annotate blocks of wells with metadata (e.g., samples, concentrations, reagents), a web tool to help design platemaps and convert them into structured tables can be invaluable. Even better, build a tool to capture the scientists’ intentions and convert them into randomized platemaps for liquid handling automation.

Data lakes

Your central database should be reserved for “small” and “medium” sized, easily structured data, summarized in way that is human-interpretable. Large, low-level, unstructured data such as images, next-generation sequencing, flow cytometry, mass spectrometry, and other datatypes are best stored in a data lake.

The simplest form of a data lake is a shared drive or S3 bucket that has an associated table in your database containing filepaths, metadata, and summarized values for each file. For example, you may have a table in your database for flow cytometry experiments that captures the metadata (samples, reagents), as well as summarized values (e.g. median fluorescence). In the data lake pattern, this table would also contain filepaths to the raw FCS files, so that the data can easily be found and pulled into dedicated analysis pipelines.

Data warehouses

Your LIMS is essential for tracking the entities of your lab, but this transactional paradigm for databases is not always the best fit for analytics use cases such as data visualization and machine learning. The crux of the problem is that the best data model for transactions is highly “normalized.” Normalized databases have complex schemas with many tables that need to be joined in order to connect data from related entities.

In contrast, analysis and ML are best suited for a small number of denormalized tables with simple schemas. Duplication and database anomalies are not as much of a problem, but speed of data aggregation operations is crucial. For this reason you will probably want a separate data warehouse, with columnar storage to speed up common analytics operations. This is a separate schema, often a separate database, which is loaded periodically by automated ETL (extract, transform, load) pipelines.

In the tech world, the utility of a data warehouse is to aggregate data from many databases across the company. But even in the digital biotech with a single, centralized database, the separation of concerns provided by the data warehouse has advantages. It allows you to design the data model for analytics, rather than transactional, use cases.

In designing your data warehouse schema, think about the simplest scatter or barplots that non-programmers across the company might want to see. Then make sure your schema has tables with the axes of those plots as the columns. Eventually you may want a more flexible, slightly normalized data model (e.g. “star schema”), with a query builder to help non-technical users perform joins.

Org structure

Too often, the organizational structure emerges as some accident of history — who was hired when, and with what title — or by following the traditional “functional matrix” design of big pharma. While any structure is better than no structure (there are cautionary tales of completely “flat” org charts in tech), the org chart should be intentionally designed to match your culture, business model, and eventual product.

One of Amazon’s strengths is a strong set of principles around org design. One important concept is that of the “single-threaded leader” (STL), meaning that for any high-priority project there is a person whose sole responsibility is the success of that project. In Working Backwards, the authors quotes Amazon Senior VP Dave Limp, “The best way to fail at inventing something is by making it somebody’s part-time job.” While Amazon famously known for limiting its hierarchy to “two-pizza teams”, meaning that no group is larger than can be served by two pizzas (typically no more than 10), in reality the STL is the more dominant concept and often results in small, nimble teams.

The ideal org structure for a digital biotech has probably not been invented, and probably doesn’t exist. Instead, the best structure completely depends on the specifics of your company and its current state. Therefore, the optimal org design is one that evolves constantly to match your needs. Unfortunately, org design tends to be static, so you should instill culture and mechanisms to combat this inertia and promote periodic reorganization. In the early days it might make sense to shuffle groups, team leaders, and goals as often as quarterly. However, even a mature company should probably undergo a reorg at least every few years to stay competitive.

Hiring

It can be tricky to evaluate and hire technical people from a different discipline, especially so for a biologist trying to build a software team. For that reason, biotech startups will often hire senior leaders first for their digital teams, with an expectation for that person to build the team.

This strategy can backfire, however, if this first hire is too senior to build software. A leader with only people management skills will be quick to hire contractors and third-party vendors to meet short-term needs during the months-long team building process. This can lead to lock-in or over-dependence on external systems and disincentivize the “build” option for build/buy decisions. The best first hire is an experienced engineer with strong coding skills as well as some leadership and team building capability.

If you can hire two or more experienced engineers with a history of building together, you’re likely to get great results on rapid timescales. At Generate our initial Informatics team was a hands-on leader and a senior engineer duo, with years of experience working together on another digital biotech.

If you can’t find the rare software engineer with biotech experience, you can do well hiring from the tech industry and teaching them the ropes. The high salary demands will stretch the budget, but a bold investment will give you a strong early start on the platform. Hiring early for software will give engineers time to learn the business so they can focus on the right problems. You can get early wins to justify this investment, as engineers can write a simple script to save biologists hours of data munging. Encourage the team to build one-off solutions and set up useful infrastructure while allowing time to build more sophisticated systems.

In contrast, it is not necessarily a good idea to hire machine learning talent early. If you don’t know in very precise terms what ML/AI you need, or it’s not a concrete part of your value proposition, then wait to hire your first ML scientist. Instead, find generalists in software and bioinformatics who know enough applied ML to get early wins with simple models that justify custom model development. Only after you have enough data, and know the subspecialty of ML you need, is it time to invest in your ML team.

The reason for this logic is that great ML scientists are very expensive but are not going to provide a return on the investment if there is not a clear problem and sufficient data to solve it. This is not to discourage hiring an ML team, but to encourage leaders to understand and specify the problem ML is solving for their biotech, and what types of modeling approaches and architectures are appropriate for their data. Too often AI is held up as a magic wand that can solve arbitrary problems, but in reality the data infrastructure is more important and is, in any case, a prerequisite for ML.

Lab automation

The typical attitude toward lab automation is that it is reserved for mature companies with high-volume assays and bio-production processes. Recent trends in robotic automation are upending this view, however, due to cheaper hardware (Opentrons), more accessible software (Artificial), and better abstractions and integrations with modern programming languages (PyLabRobot).

For a small lab, it is now affordable to buy a benchtop robot and begin incorporating automation in day-to-day operations. Plenty of experimental biologists now have basic scripting skills and can program an OT-2 to run their assays, but this skill is not a given and needs to be screened for during interviews.

Establishing a culture of automation early on will not only deliver gains in precision, reproducibility and productivity, but also prepare you well when it is time to scale your robotic operations. Invest in automation engineers early, and require experimentalists automate their protocols, in order to accelerate your data generation engine.

Scaling the platform

If a startup is successful, likely it will need to scale fast to capitalize on the opportunity. This is a massive challenge–even the most forward-thinking startups struggle to maintain their culture and keep up with technology as they dilute their tech-heavy founding teams with domain-specific talent from traditional biopharma.

Growing pains are inevitable, but can be overcome if they are seen as problems to be solved. Search for these problems and surface them before they become ingrained, and then empower those at the front lines to solve them. Otherwise, you risk a toxic culture of employees giving up, blaming leadership, and letting issues become someone else’s problem. The most important thing leaders can do is consistently and clearly communicate the vision to energize the true believers, deflect toxic negativity, and sidestep politics. Remind the team that scaling is a very good problem to have.

Great digital infrastructure can ease these problems. In fact you’ve built software to empower your employees with permissionless access to data, ML models, software services, and requests for physical workflows, the scaling phase is where the investment pays off in productivity gains. While new managers need time to establish and navigate personal relationships, digital tools enable your people to get their job done without the need for human communication channels.

Onboarding

Onboarding is crucial for maintaining culture. Make sure you have laid out simple rules and processes that make your culture special. Have a full program of videos, reading material, process flowcharts, major capabilities, and anything else you want every single employee to know. If you’ve developed your org as cross-functional product teams, clearly lay out the goals of each group, its product or menu of services, its leader, and how an employee might interact with the group.

A digital biotech needs to pay special attention to cross-disciplinary training during onboarding. Biologists need to be walked carefully through digital toolbox, since the entire mindset is likely very new. You may consider setting every employee up with a coding environment and the ability to programmatically access the company’s digital services. A few training sessions may be needed to prevent a lab data culture dominated by Excel and Prism, and encourage one of database queries, Python scripts, and Jupyter/R notebooks. Keep in mind that the reason many biologists joined your company is because of the promise of ML and computation. A small investment in training and onboarding to get them using digital tools will pay off in the form of an enabled, energized lab.

Likewise, you’ll want to walk your computational teams through all of the relevant biology. This will prevent them from solving the wrong problems or making simplistic assumptions about the data. Encourage computational people to shadow the wet lab and deeply understand each assay and protocol. Foster applications of software and machine learning in experimental data processing, starting with a cross-disciplinary onboarding process.

The onboarding program needs to be in place well before the company begins scaling and preferably starting with employee #1. Very soon in the company’s lifetime it will need to be someone’s full time job to manage the process.

Conclusions

A digital biotech is a new and different breed of organization, blending the best of tech and biotech industries. The advantages of a software-heavy, ML-powered biology platform will be so strong that these companies will likely dominate or become major players in the biopharma industry.

The unique multidisciplinary nature of these companies requires a new approach to starting and scaling, which I’ve tried to outline here. Many of the ideas were drawn from modern tech lore or business cases, and could be applied generically to any tech startup. However, there is enough of a disconnect between biotech and tech cultures that these lessons are worth repeating here for a cross-disciplinary audience.

Most importantly, there is no actual playbook, just anecdotes, lessons and mental models to help make decisions. Just as with science, keep an open mind, experiment often, and share what you’ve learned with the world. Good luck!