Untitled - NFHN Reader

Terraform for VMware: Key Takeaways and Lessons from Managing Hundreds of VMs

Date: 2025-02-16

Managing a large number of virtual machines can be challenging. As our infrastructure grew, we needed a better way to handle the complexity. That’s when we turned to Terraform to help us manage our VMware environment more efficiently. I’m going to cover the use cases for IaC and share:

The technology stack used and the evolution of the infrastructure management practices.
Lessons learned from using Terraform to manage hundreds of VMs in VMware landscape.

The goal is to provide practical insights and tips for anyone looking to improve their VM management with Terraform for VMware.

IaC

IaC: use cases

IaC usecases

I’m going to cover some very basic scenarios of VMs lifecycle:

Create VM templates
Create VMs
Configure VMs

IaC tech stack

IaC usecases

There are many technologies available for different levels of infrastructure management. As DevOps engineers, we had to make specific choices. In our case:

Create VM templates: Ansible + Packer
Create VMs: Terraform
Configure VMs: Ansible

Let us take a look at specific parts and the whole solution.

Packer. How does it work?

packer

There is a git repository containing:
- Packer configuration
- Anaconda configuration
- Ansible playbooks and collections
There is a Jenkins job that weekly runs the Packer build.

packer Behind the Packer build command:

Create an empty VM:
- Mount ISO
- Share kickstart file
- Type boot command to use the kickstart file
Anaconda (Rocky installer) is started:
- Disk layout is configured
- User for Ansible is created
- Network is configured
- SSH is configured
VM is booted and reachable
Run Ansible playbook
Export the VM template to a content library

Terraform. How does it work?

terraform

Terraform is a stateful application with some important concepts:

Provider API Connection: Terraform connects to provider APIs and manages state above it.
State Management: Terraform tracks resources and ensures:
- Idempotence: Only updates resources that have changed.
- Dependency Management: Handles dependencies correctly.

Ansible. How does it work?

terraform

Ansible automates the configuration of VMs through the following steps:

Inventory: identifies the target hosts from the inventory file.
Playbook: reads and parses the playbook to determine the tasks to be executed.
Generate: generates the necessary executable code based on the playbook.
Copy: securely copies the generated code to the target hosts.
Execute: executes the copied Python code on the target hosts to apply the configurations.

Glue it together

IaC usecases

Everything is stored in git. There are some linked processes:

Packer: Create a VM template via Packer, apply base configuration via Ansible, and save it to the VMware content library.
Terraform: Create a new VM from the VM template in the content library and assign tags.
Ansible: Split VMs into groups according to the VMware tags and customize the VMs.

History

packer + ansible

The beginning of the story is described in the Ansible: CoreOS to CentOS, 18 months long journey. In short, it describes the migration from a custom configuration management solution to Ansible + Packer for managing the environment. As part of that journey, I created the vmware_content_deploy_ovf_template Ansible module as part of the VMware collection.

bg contain

But we had the problem of having multiple sources of truth about VM configurations: VMware and the git repository. The goal was to have only one source of truth to make IaC consistent.

Terraform for VMware

I’m going to share some chronological lessons learned from implementing Terraform.

Know-how: Race conditions during terraform apply

Steps to reproduce:

Person A changes something & runs terraform apply
Person B changes something & runs terraform apply

Result: infrastructure is not in consistent state How to avoid: use remote state

Problems: Use wrong hostname for customizations

Steps to reproduce:

You create VM via ctrl+C & ctrl+V
You change the resource name, but keep hostname for customization
Vmware creates new VMs and set another VM hostname

Result: original VM is not reachable How to avoid: double check the changes

Problems: Lost VM during increasing RAM

Steps to reproduce:

You create VM, the creation process returned, but the VM is created on VMware
The VM is marked as a tainted in terraform state
You change the RAM
Terraform delete tainted VM & create a new one instead

  ### module.InfraServices.vsphere_virtual_machine.example-com is tainted, so must be replaced
-/+ resource "vsphere_virtual_machine" "example-com" {
      - boot_delay                              = 0 -> null
      - boot_retry_enabled                      = false -> null
      ~ change_version                          = "2023-07-28T11:41:47.016148Z" -> (known after apply)

Result: data are lost, but nobody has used the VM How to avoid: don’t keep VMs in bad state. fix it or recreate

Know-how: Terraform & vmware use different storages

Steps to reproduce:

You create VM and set datastore_id
You configure multiple datastores and policy to move across them(DRS)
Vmware moves VM to another datastore
You run terraform & it wants to move it back

  ~ resource "vsphere_virtual_machine" "worker23" {
      ~ datastore_id                            = "datastore-2032" -> "datastore-88"
        id                                      = "423845e7-xxxx-xxxx-xxxx-77240ff97399"
        name                                    = "worker23"

Result: unneeded changes on Vmware How to avoid: use datastore_cluster_id

Know-how: Changes lead to performance degradation

Steps to reproduce:

You create VMs and set datastore_id
You change datastore_id to datastore_cluster_id
Run terraform apply
Terraform applies all changes at the same time

Result: multiple disks can be moved across storages How to avoid: do not do many changes at the same time

Problems: Failed to use datastore_cluster_id

Steps to reproduce:

You create VM and set datastore_cluster_id
You run terraform & it fails

│ Error: Cannot use datastore_cluster_id with Content Library source
│
│   with module.TrainingsAndDemo.vsphere_virtual_machine.example-06,
│   on modules/TrainingsAndDemo/example-06.example.local.tf line 2, in resource "vsphere_virtual_machine" "example-06":
│    2: resource "vsphere_virtual_machine" "example-06" {

Result: VM is not created How to avoid:

You create VM, set datastore_id & run terraform
You change datastore_id -> datastore_cluster_id & run terraform

Know-how: Run terraform twice to create a VM

Steps to reproduce:

You create VM, set datastore_id & run terraform
You change datastore_id -> datastore_cluster_id & run terraform Result: VM is created, but there are too many actions. it’s not bug but API feature

How to avoid:

You create a storage policy & add all storage to that policy
You create VM and set storage_policy_id & run terraform

data "vsphere_storage_policy" "storage_cluster_01" { name = "Storage Cluster 01" }
resource "vsphere_virtual_machine" "customnagpr" {
  name                       = "customnagpr"
  storage_policy_id          = var.storage_policy_id

Problems: Sometimes VM creation is failed

Steps to reproduce:

You create VM in InfraServices pool & wait till the next backup
You run terraform & get unneeded changes

│ Error: error detaching tags to object ID "vm-108158": POST <https://vcenter01.example.local/rest/com/vmware/cis/tagging/tag-association/id:urn:vmomi:InventoryServiceTag:2c12366a-xxxx-xxxx-xxxx-59ab8e157645:GLOBAL?~action=detach:> 404 Not Found
│
│   with module.ExamplePersonal.vsphere_virtual_machine.ld-example,
│   on modules/ExamplePersonal/ld-example.example.local.tf line 2, in resource "vsphere_virtual_machine" "ld-example":
│    2: resource "vsphere_virtual_machine" "ld-example" {

Result: terraform exits with non zero code How to avoid: there are no ideas, run it twice

Know-how: Terraform & backups are conflicting

Steps to reproduce:

You configure backups. It uses custom attributes.
You create VM

  ~ resource "vsphere_virtual_machine" "example-com" {
      ~ custom_attributes                       = {
          - "304" = "OK" -> null
          - "305" = "31-07-2023 02:31:26" -> null

Result: There is unneeded change How to avoid: add into the VM config

lifecycle {
    ignore_changes = [
      custom_attributes["304"],
      custom_attributes["305"]

Know-how: Move VM to another pool = recreate it

Steps to reproduce:

You create a VMware module for each pool
You move file to another module & run terraform
Terraform recreates the VM

Result: The VM is recreated How to avoid: add temporary code for migration

moved {
  from = module.ExamplePersonal.vsphere_virtual_machine.ld-example
  to   = module.AnoverPersonal.vsphere_virtual_machine.ld-example
}

Links: https://developer.hashicorp.com/terraform/tutorials/configuration-language/move-config

Problems: Move VM to another pool = restart network

Steps to reproduce:

You create a dedicated network for each pool
You move a VM to another pool

Result: The VM is not reachable by DNS name & you must restart network to fix it How to avoid: there is no solution

Know-how: Delete the VM TF file = delete the VM

Steps to reproduce:

You remove the VM TF file & run terraform
Terraform deleted the VM

Result: The VM is deleted How to avoid:

Double check the changes
Set prevent_destroy

...
    prevent_destroy = true
...

Know-how: create VM with 2 disks

Steps to reproduce:

Assign 2 disks & run terraform apply

Planning failed. Terraform encountered an error while generating this plan.
│ Error: disk: duplicate SCSI unit_number 0
│
│   with module.InfraServices.vsphere_virtual_machine.commondata,
│   on modules/InfraServices/commondata.example.local.tf line 2, in resource "vsphere_virtual_machine" "commondata":
│    2: resource "vsphere_virtual_machine" "commondata" {

How to avoid: set unit_number

  disk {
    label            = "disk1"
    size             = 120
    controller_type  = "scsi"
    thin_provisioned = true
    unit_number      = 1

Know-how: Import existing VMs

Create a TF file with the VM description & import it

terraform import module.ExampleGuild.vsphere_virtual_machine.ans-jenkins01 /My\\ Datacenter/vm/ExampleGuild/ans-jenkins01

Run Terraform apply

      - enable_logging                          = true -> null
      ~ imported                                = true -> false
      + storage_policy_id                       = "eaf0b1ab-xxxx-xxxx-xxxx-348473a5727b"
      - sync_time_with_host                     = true -> null
      - cdrom {
          - client_device  = false -> null
          - device_address = "ide:0:0" -> null
          - key            = 3000 -> null
        }
      ~ disk {
          ~ keep_on_remove   = true -> false

TIP: locator format: <DC NAME>/vm/<Folder>/<VM Name>

Know-how: Import VMs without reboot

Steps to reproduce:

You have a VM created not via Terraform.
You create a TF file and import the VM.

Result: The VM is rebooted.

How to avoid: Review the Terraform output and align the VM parameters accordingly.

diff -C0 created-by-terraform.example.local.tf created-manually.example.local.tf
...
*** 9,10 ****
-   sync_time_with_host        = true
-   enable_logging             = true
 8
-
*** 31,32 ****
!   cdrom {
!     client_device = false
 27,35
-
!   clone {
!     template_uuid = data.vsphere_content_library_item.packer_rocky8.id
!     customize {
!       linux_options {
!         host_name = "jenkins-dockerbuild-02"
!         domain    = var.domain
!       }
!       network_interface {}
!     }
***************
*** 34,38 ****
-
-   cdrom {
-     client_device = false
-   }

Know-how: vApp Containers

Describe the container & the VM

data "vsphere_folder" "folder" { path = "/My Datacenter/vm/example" }
resource "vsphere_vapp_container" "VAPP_demo" {
  name                    = "VAPP-demo"
  parent_folder_id        = data.vsphere_folder.folder.id

Import the existing container

terraform import module.Imagemaster.vsphere_vapp_container.VAPP_demo /My\ Datacenter/host/My\ Cluster\ 01/Resources/Example/VAPP-demo

Describe a vm & run terraform apply

resource "vsphere_virtual_machine" "fdemo-docker" {
  name                       = "demo-docker"
  resource_pool_id           = vsphere_vapp_container.VAPP_demo.id

Know-how: sync changes from VMware

Steps to reproduce:

Change RAM settings in VMware
Run terraform

  ### module.ExamplePersonal.vsphere_virtual_machine.ld-example will be updated in-place
  ~ resource "vsphere_virtual_machine" "ld-example" {
      - memory_reservation                      = 16384 -> null
      ~ num_cores_per_socket                    = 2 -> 1
      ~ num_cpus                                = 6 -> 2

Result: Terraform wants to revert it How to avoid: Do changes via terraform or sync them

  num_cpus             = 6
  num_cores_per_socket = 2
  memory_reservation   = 16384

Problems: GPU settings & host affinity are not detected

Steps to reproduce:

add a GPU & set Host/Rule affinity
Run terraform apply

  ### module.ExamplePersonal.vsphere_virtual_machine.ld-example will be updated in-place
  ~ resource "vsphere_virtual_machine" "ld-example" {
      - memory_reservation                      = 16384 -> null
      ~ num_cores_per_socket                    = 2 -> 1
      ~ num_cpus

Result: The only RAM changes are detected How to avoid: just added note…

Problems: Not able to create VM with 2 NICs

Steps to reproduce:

Create a VM with 2 NICs & run terraform apply
Get the error

│ Error: 400 Bad Request: {"type":"com.vmware.vapi.std.errors.invalid_argument","value":{"error_type":"INVALID_ARGUMENT","messages":[{"args":["network_mappings","com.vmware.vcenter.ovf.library_item.resource_pool_deployment_spec"],"default_message":"Could not convert field 'network_mappings' of structure 'com.vmware.vcenter.ovf.library_item.resource_pool_deployment_spec'","id":"vapi.bindings.typeconverter.fromvalue.struct.field.error"},{"args":[],"default_message":"Element already present in the map.","id":"vapi.bindings.typeconverter.map.duplicate.element"}]}}

Result: Failed to craete How to avoid: create with 1 NIC & after add the 2nd

Know-how: bulk tags update

You assign single tag to each VM via terraform
You want to use multiple tags because of ansible inventory
Terraform fails because VMware allows the only one tag
You change cardinality to MULTIPLE
terraform recreates the tag category & reassigns all tags to all VMs

resource "vsphere_tag_category" "terraform_managed_category" {
  name        = "terraform_managed_category"
  cardinality = "MULTIPLE"
  description = "Managed by Terraform"
  associable_types = [
    "VirtualMachine",

Know-how: Re-create VM templates from nothing

Delete the content library
Create an empty content library
Keep calm & re-run jenkins jobs to rebuild packer templates
Export terraform state terraform state pull > state.json
Change manually via search & replace:
- template id for each vm
- content library id
- state version
Push the sate terraform state push state.json
Open shampain

Know-how: Repo structure

It is not ideal because there is a lot of copy-pasting, but it is quite easy to find the specific VM and tune it.

├── < FOLDER NAME >.tf
├── connections.tf
├── main.tf
├── modules
│   ├── < FOLDER NAME >
│   │   ├── < VM NAME >.example.local.tf
│   │   ├── _main.tf
│   │   └── _variables.tf
├── outputs.tf
├── README.md
├── remote_state.tf
├── tags.tf
├── variables.tf
└── versions.tf

< FOLDER NAME > - the same as Network/Folder/Pool names in VMware

Problems: Add domain_ou support for Windows customization doesn’t work

Steps to reproduce:

Try to create a Windows VM, join it to the domain, and place it into a specific OU.
It’s not possible because there is no support.

Result: The VM must be moved manually to the specific OU. How to avoid: Create a PR with the required functionality to the upstream and wait.

Know-how: ansible dynamic inventory

Create tags.tf

resource "vsphere_tag_category" "terraform_managed_category" {
  name        = "terraform_managed_category"
  cardinality = "MULTIPLE"
  description = "Managed by Terraform"

  associable_types = [
    "VirtualMachine",
    "Datastore",
    "VirtualApp",
  ]
}


resource "vsphere_tag" "Sandbox" {
  name        = "Sandbox"
  category_id = vsphere_tag_category.terraform_managed_category.id
  description = "Ansible group default_db_postgres. Managed by Terraform"
}

create a VM tf file and assign the tags

resource "vsphere_virtual_machine" "demo-db-pg" {
  name                       = "demo-db-pg"
  storage_policy_id          = var.storage_policy_id
  resource_pool_id           = vsphere_vapp_container.demo.id
  wait_for_guest_net_timeout = 5
  sync_time_with_host        = true
  enable_logging             = true
  firmware                   = "efi"
  annotation                 = <<-EOT
    !!! Do not edit properties via vcenter via web ui !!!
    Managed by ansible & terraform

    Some important notes

  EOT

  tags = [
    vsphere_tag.Sandbox.id,
  ]
...

Install the community.vmware collection and create the hosts.vmware.yml

plugin: vmware_vm_inventory
hostname: vcenter01.example.com
username: xxx
with_tags: True
keyed_groups:
- key: tag_category.terraform_managed_category
  prefix: ""
  separator: ""
hostnames:
  - 'config.name+".example.com"'
properties:
  - 'config.name'
  - 'summary.runtime.powerState'
  - 'guest.guestFamily'
with_nested_properties: true
filters:
  - tag_category.terraform_managed_category is defined
  - "'Sandbox' in tag_category.terraform_managed_category"
  - guest.guestFamily is defined
  - guest.guestFamily == 'linuxGuest'
  - summary.runtime.powerState == 'poweredOn'
resources:
  - datacenter:
    - My Datacenter
    resources:
    - compute_resource:
      - My Cluster 01

Summary

Known issues

Use wrong hostname for customizations makes original VM unreachable
The datastore_cluster_id is not supported & we must use storage_policy_id
Sometimes VM creation is failed
Move VM to another pool requires restart network
Use wrong hostname for customizations makes original VM unreachable
The datastore_cluster_id is not supported & we must use storage_policy_id
Sometimes VM creation is failed
Too many code & copy-paste
Higher entrance level

Lessons learned

Use remote state
It’s tricky to catch more than 10 changes at one time
With great power comes great responsibility
Bulk changes can lead to performance degradation
Avoid configuration drift between VMware & terraform
Instead of recreating VM you can simply move it
Use prevent_destroy to avoid VMs deletion
Use simple & obvious structure to make entrance level lower
Use to ansible dynamic inventory based on tags
It’s possible to add the missing functionality

Benefits

It’s really easy to perform bulk updates
Re-create infra from nothing
Trackable changes
Single source of the truth

Want to try

Terraform pull request automation via Atlantis
Build green button in jenkins
Windows automation
Cloud-init for setting hostname like here (sry it’s in RU)
Tests for terraform like for ansible