TL;DR
Structure Dockerfiles from least to most frequently changing, copy dependencies before code, and use multi-stage builds.
Docker has truly changed how we package, distribute, and deploy applications. At its core is the Docker’s build process, where a Docker image is created from a Dockerfile. In this post, we’ll explore several strategies to optimize Docker builds, making them faster and the resulting images more space efficient.
Before diving into optimization strategies, let's understand the most essential instructions that make up a Dockerfile:
- FROM: Specifies the base image to use as the starting point for the build. This is typically the first instruction in a Dockerfile.
- RUN: Executes a command inside the image during the build process. Each RUN instruction creates a new layer in the image.
- COPY: Copies files or directories from the host machine to the image. It's the preferred instruction for copying files.
- CMD: Specifies the default command to run when a container is started from the image. There can only be one CMD instruction in a Dockerfile.
These four instructions form the core of most Dockerfiles. They allow you to select a base image, execute commands to install dependencies and configure the image, copy necessary files, and define the default command to run when a container is started.
Understanding Docker Build cache
Docker uses a build cache to speed up the build process. Each instruction in your Dockerfile is treated as a separate layer. If the layer hasn't changed, docker caches these layers and reuses them in subsequent builds. This means that if you modify an instruction in your Dockerfile, Docker will use the cache for all the layers before the modified instruction and rebuild the layers after it. This happens because each layer depends on the layers before it. Modifying an instruction changes the state of the image, and subsequent layers need to be rebuilt to ensure consistency and include the changes.
How to structure your Dockerfile
The order of instructions in your Dockerfile matters. You should structure your Dockerfile in a way that maximizes cache utilization. Here are a few ways to improve that:
1. Order from least frequently changing to most
Place instructions less likely to change at the top of your Dockerfile. For example, if your application dependencies don't change frequently, install them early in the Dockerfile. This way, Docker can reuse the cached layers for these instructions in subsequent builds.
2. Increase cache hits by separating dependencies
Let’s consider the following exampleThis setup bundles the application code and dependencies, leading to inefficient caching. The COPY . . command copies the entire project, including node_modules, if it exists locally. As a result, even if package.json and package-lock.json remain unchanged, any modification to the application code forces Docker to re-run npm install, even though the dependencies haven't changed. If your application code changes frequently but dependencies remain unchanged, this approach unnecessarily rebuilds dependencies with each change.
Here’s a more optimized version
FROM node:14
WORKDIR /app
COPY package.json package-lock.json /app/
RUN npm install
COPY . .
RUN npm build- We copy the
package.jsonandpackage-lock.jsonfiles first before copying the entire project directory. This allows Docker to cache theRUN npm installinstruction separately from the application code. - If the
package.jsonandpackage-lock.jsonfiles haven't changed, Docker can reuse the cached layer forRUN npm install, even if the application code has changed. - The
COPY . .instruction is placed afterRUN npm install, so changes to the application code won't invalidate the cache for the dependency installation step.
3. Group related commands
Group related commands together in a single RUN instruction. Each RUN instruction creates a new layer, so combining related commands reduces the number of layers and improves build efficiency. More layers lead to a higher image size and would require a higher number of cache hits.
Consider this example
Instead of creating four layers, you can combine these related commands into a single RUN instruction
RUN apt-get update && \
apt-get install -y package1 package2 && \
apt-get clean4. Use multi-stage builds to separate build and runtime environments
Multi-stage builds allow you to create smaller and more efficient Docker images by separating the build and runtime environments. This is particularly useful when you have build dependencies that are not needed at runtime.
Before multi-stage builds, you might have a Dockerfile like this
In this example, the final image includes the entire Node.js build environment, which is unnecessary for running the application.
With multi-stage builds, you can split the Dockerfile into multiple stages.
# Build stage
FROM node:14 AS build
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build
# Runtime stage
FROM node:14-alpine
WORKDIR /app
COPY --from=build /app/dist ./dist
COPY package*.json ./
RUN npm ci --only=production
CMD ["npm", "start"]In this optimized version
- The
buildstage uses the full Node.js image to install dependencies and build the application. - The
runtimestage uses a minimal Node.js image (node:14-alpine) and copies only the built files and production dependencies from thebuildstage.
The resulting image is significantly smaller because it doesn't include the build tools and intermediate artifacts.
5. Use multi-stage builds to parallelize build stages
BuildKit is the default Docker backend and it enables parallelizing builds for a multi-stage build.
Consider a Dockerfile that builds a Go application with multiple components
In this example,
- The
basestage sets up the common dependencies by copyinggo.modandgo.sumfiles and downloading the required packages. - The
build-apiandbuild-workerstages can be executed in parallel as they depend only on thebasestage. Each stage builds a specific component of the application. - The final stage uses a minimal base image (
gcr.io/distroless/base-debian11) and copies the compiled binaries from thebuild-apiandbuild-workerstages.
With BuildKit, Docker will automatically detect the independent stages and execute them in parallel, speeding up the overall build process.
Minimizing the Docker image size
Smaller Docker images are faster to build, push, and pull. They also consume less disk space and memory. Here are some strategies to minimize your image size:
- Use the right base image: Choose a minimal base image that includes only the necessary dependencies for your application. For example, if you're running a Go application, consider using the official
golang:alpineimage instead of the full-fledgedgolangimage.
Alpine Linux images have the additional benefit of generally containing fewer security vulnerabilities due to their reduced attack surface. It only includes essential packages by default, reducing the number of pre-installed components that might not be needed for your specific application. With fewer packages and a smaller codebase, there's less surface area for potential vulnerabilities. - Combine RUN commands: As we mentioned before, each RUN instruction in your Dockerfile creates a new layer. More layers result in more overhead and a larger image size. Combine multiple RUN instructions into a single instruction to reduce the number of layers.
- Clean up after package installations: When installing packages in your Dockerfile, it's useful to clean up the package cache and temporary files in the same
RUNinstruction. This practice helps reduce the size of your final Docker image by removing unnecessary files that are not needed at runtime.Consider the following example
In this Dockerfile, we update the package lists usingapt-get updateand install thecurlpackage usingapt-get install. However, after the installation, we remove the package lists stored in/var/lib/apt/lists/usingrm -rf /var/lib/apt/lists/*. This cleanup step removes the cached package lists, which are not required once the packages are installed.Removing the package cache and temporary files prevents them from being included in the final image layer. This can significantly reduce the image size, especially when installing multiple packages or dealing with large package repositories.
The multi-stage builds solution we mentioned before circumvents this problem entirely by copying only what’s required at run time instead of build time.
Use the right file-copying strategy
When copying files into your Docker image, use the appropriate instructions
- COPY vs. ADD: Use
COPYfor most cases, as it’s straightforward and only copies files from the host to the image.ADDcan perform extra tasks such as extracting a local tar file directly into the container’s filesystem or reading from remote URLs. This can result in unwanted artifacts ending up in your Docker image. - Use .dockerignore: Create a .dockerignore file in your build context to exclude unnecessary files and directories from being copied into the image. This reduces the build context size, speeds up the build process, and keeps your images lean. Some common files to ignore are:
- Version Control Directories: Exclude directories such as
.git,.svn, or.hgthat contain version control metadata. - Build Artifacts: Exclude files and directories generated by build processes, such as
node_modules,dist,build, ortargetdirectories. - Temporary Files: Exclude temporary files and directories like
.tmp,.log, or.swp. - Configuration Files: Exclude files that are not needed in the production environment, such as
.envfiles containing development configurations or secrets. - Documentation: Exclude documentation files like
README.md, unless they are needed in the image.
- Version Control Directories: Exclude directories such as
To summarize, the order of your Dockerfile can drastically improve your cache efficiency. Your Docker image only needs to contain runtime dependencies, not build dependencies. Having a smaller image size can also help you pull/push faster and contribute to improving developer productivity.