How to optimize your Dockerfile for faster Docker builds

TL;DR

Structure Dockerfiles from least to most frequently changing, copy dependencies before code, and use multi-stage builds.

Docker has truly changed how we package, distribute, and deploy applications. At its core is the Docker’s build process, where a Docker image is created from a Dockerfile. In this post, we’ll explore several strategies to optimize Docker builds, making them faster and the resulting images more space efficient.

Before diving into optimization strategies, let's understand the most essential instructions that make up a Dockerfile:

FROM: Specifies the base image to use as the starting point for the build. This is typically the first instruction in a Dockerfile.
RUN: Executes a command inside the image during the build process. Each RUN instruction creates a new layer in the image.
COPY: Copies files or directories from the host machine to the image. It's the preferred instruction for copying files.
CMD: Specifies the default command to run when a container is started from the image. There can only be one CMD instruction in a Dockerfile.

These four instructions form the core of most Dockerfiles. They allow you to select a base image, execute commands to install dependencies and configure the image, copy necessary files, and define the default command to run when a container is started.

Understanding Docker Build cache

Docker uses a build cache to speed up the build process. Each instruction in your Dockerfile is treated as a separate layer. If the layer hasn't changed, docker caches these layers and reuses them in subsequent builds. This means that if you modify an instruction in your Dockerfile, Docker will use the cache for all the layers before the modified instruction and rebuild the layers after it. This happens because each layer depends on the layers before it. Modifying an instruction changes the state of the image, and subsequent layers need to be rebuilt to ensure consistency and include the changes.

How to structure your Dockerfile

The order of instructions in your Dockerfile matters. You should structure your Dockerfile in a way that maximizes cache utilization. Here are a few ways to improve that:

1. Order from least frequently changing to most

‍Place instructions less likely to change at the top of your Dockerfile. For example, if your application dependencies don't change frequently, install them early in the Dockerfile. This way, Docker can reuse the cached layers for these instructions in subsequent builds.

2. Increase cache hits by separating dependencies

Let’s consider the following example‍This setup bundles the application code and dependencies, leading to inefficient caching. The COPY . . command copies the entire project, including node_modules, if it exists locally. As a result, even if package.json and package-lock.json remain unchanged, any modification to the application code forces Docker to re-run npm install, even though the dependencies haven't changed. If your application code changes frequently but dependencies remain unchanged, this approach unnecessarily rebuilds dependencies with each change.

Here’s a more optimized version

FROM node:14
WORKDIR /app
COPY package.json package-lock.json /app/
RUN npm install
COPY . .
RUN npm build

We copy the package.json and package-lock.json files first before copying the entire project directory. This allows Docker to cache the RUN npm install instruction separately from the application code.
If the package.json and package-lock.json files haven't changed, Docker can reuse the cached layer for RUN npm install, even if the application code has changed.
The COPY . . instruction is placed after RUN npm install, so changes to the application code won't invalidate the cache for the dependency installation step.

3. Group related commands

‍Group related commands together in a single RUN instruction. Each RUN instruction creates a new layer, so combining related commands reduces the number of layers and improves build efficiency. More layers lead to a higher image size and would require a higher number of cache hits.

Consider this example
Instead of creating four layers, you can combine these related commands into a single RUN instruction

RUN apt-get update && \
    apt-get install -y package1 package2 && \
    apt-get clean

4. ‍‍Use multi-stage builds to separate build and runtime environments

‍Multi-stage builds allow you to create smaller and more efficient Docker images by separating the build and runtime environments. This is particularly useful when you have build dependencies that are not needed at runtime.

Before multi-stage builds, you might have a Dockerfile like this
In this example, the final image includes the entire Node.js build environment, which is unnecessary for running the application.

With multi-stage builds, you can split the Dockerfile into multiple stages.

# Build stage
FROM node:14 AS build
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build

# Runtime stage
FROM node:14-alpine
WORKDIR /app
COPY --from=build /app/dist ./dist
COPY package*.json ./
RUN npm ci --only=production
CMD ["npm", "start"]

In this optimized version

The build stage uses the full Node.js image to install dependencies and build the application.
The runtime stage uses a minimal Node.js image (node:14-alpine) and copies only the built files and production dependencies from the build stage.

The resulting image is significantly smaller because it doesn't include the build tools and intermediate artifacts.

5. Use multi-stage builds to parallelize build stages‍

BuildKit is the default Docker backend and it enables parallelizing builds for a multi-stage build.

Consider a Dockerfile that builds a Go application with multiple components
In this example,

The base stage sets up the common dependencies by copying go.mod and go.sum files and downloading the required packages.
The build-api and build-worker stages can be executed in parallel as they depend only on the base stage. Each stage builds a specific component of the application.
The final stage uses a minimal base image (gcr.io/distroless/base-debian11) and copies the compiled binaries from the build-api and build-worker stages.

With BuildKit, Docker will automatically detect the independent stages and execute them in parallel, speeding up the overall build process.

Minimizing the Docker image size

Smaller Docker images are faster to build, push, and pull. They also consume less disk space and memory. Here are some strategies to minimize your image size:

Use the right base image: Choose a minimal base image that includes only the necessary dependencies for your application. For example, if you're running a Go application, consider using the official golang:alpine image instead of the full-fledged golang image.
Alpine Linux images have the additional benefit of generally containing fewer security vulnerabilities due to their reduced attack surface. It only includes essential packages by default, reducing the number of pre-installed components that might not be needed for your specific application. With fewer packages and a smaller codebase, there's less surface area for potential vulnerabilities.
Combine RUN commands: As we mentioned before, each RUN instruction in your Dockerfile creates a new layer. More layers result in more overhead and a larger image size. Combine multiple RUN instructions into a single instruction to reduce the number of layers.
Clean up after package installations: When installing packages in your Dockerfile, it's useful to clean up the package cache and temporary files in the same RUN instruction. This practice helps reduce the size of your final Docker image by removing unnecessary files that are not needed at runtime.
Consider the following example
In this Dockerfile, we update the package lists using apt-get update and install the curl package using apt-get install. However, after the installation, we remove the package lists stored in /var/lib/apt/lists/ using rm -rf /var/lib/apt/lists/*. This cleanup step removes the cached package lists, which are not required once the packages are installed.
Removing the package cache and temporary files prevents them from being included in the final image layer. This can significantly reduce the image size, especially when installing multiple packages or dealing with large package repositories.
The multi-stage builds solution we mentioned before circumvents this problem entirely by copying only what’s required at run time instead of build time.

Use the right file-copying strategy

When copying files into your Docker image, use the appropriate instructions

COPY vs. ADD: Use COPY for most cases, as it’s straightforward and only copies files from the host to the image. ADD can perform extra tasks such as extracting a local tar file directly into the container’s filesystem or reading from remote URLs. This can result in unwanted artifacts ending up in your Docker image.
Use .dockerignore: Create a .dockerignore file in your build context to exclude unnecessary files and directories from being copied into the image. This reduces the build context size, speeds up the build process, and keeps your images lean. Some common files to ignore are:
- Version Control Directories: Exclude directories such as .git, .svn, or .hg that contain version control metadata.
- Build Artifacts: Exclude files and directories generated by build processes, such as node_modules, dist, build, or target directories.
- Temporary Files: Exclude temporary files and directories like .tmp, .log, or .swp.
- Configuration Files: Exclude files that are not needed in the production environment, such as .env files containing development configurations or secrets.
- Documentation: Exclude documentation files like README.md, unless they are needed in the image.

To summarize, the order of your Dockerfile can drastically improve your cache efficiency. Your Docker image only needs to contain runtime dependencies, not build dependencies. Having a smaller image size can also help you pull/push faster and contribute to improving developer productivity.