A Stabilized Approach to Systems Orchestration

Single-Minded Configuration

A stabilized approach to systems orchestration

December 19, 2020

Abstract

Configuration management is term that is usually used to describe a declarative approach to systems administration. Declarative configuration allows you to write down the intended state of a system. Changes are communicated to the rest of the team through a series of commits. The benefits of these tools are immense, but validating changes is slow and prone to error. Sequencing operations is also possible, but at the expense of eye-watering complexity. In this session we explore the following topics:

Developing a "stabilized approach"
How systems configuration is comparable to building software
Why service orchestration appears to be difficult
What sort of solutions are too simple
Steps to creating your own orchestration and configuration management framework
Overview of rset(1) : pln(5)
Advances in networking that facilitate push-based configuration

Preamble

I will never be a pilot, but I have a good deal of admiration for these men and women. One of the features of this profession that inspires me is the personal nature of working tight formation with the rest of the crew. Another is their habit of verbally cross-checking their actions with procedures and information they are receiving. These verbal and manual confirmations work just as well when operating solo as they do when working with a copilot.

Pilots also learn some outstanding methods of ordering priorities. The motto that most directly speaks to this is,

Aviate, Navigate, Communicate

What does it mean to "Aviate"? That means fly the airplane.

Advanced Qualification Program

Simulator training

Never bust minimums

Train for all scenarios known to be problematic

Dan Gryder is a flight instructor who has taken up the cause of studying and learning from accident reports in general aviation. Several hard personal experiences in his life put him on a mission to give pilots in single-engine aircraft everything they need to avoid a loss of control.

Part of his technique is to import the habits and tools that commercial airlines have. One of the interesting experiments he conducted was to ask airline pilots which skill was more important: a) stall recovery or b) energy management. Can you guess what they said?

The important skill is avoiding loss of control. Do you think the pilots who work for Southwest trade stories at the bar about adventures while stalling a 737? I hope not!

Here is what Dan had to say to those in general aviation:

Learn to define and honor the 1.3 buffer at all times. Define it, placard it, honor it. It is what the airlines do every day. Do you think they memorize all those speeds? No, they are clearly defined and placarded for them at all times.

What is the "tool" or the "placard" he is referring to?

On a small aircraft, the tool [in this case] is a bright piece of tape on the airspeed indicator. Now when the pilot is distracted or under pressure, one thing he does not have to remember is the minimum maneuvering speed. This maneuvering speed is calculated ahead of time to allow up to a 30 degree bank angle. When something unexpected happens that piece of tape on the airspeed indicator will help him keep the machine flying--all the way end.

Dan continues,

The airline record is impressive...as they now train and check all possible scenarios (called maneuvers) known to be problematic over the course of time.

Procedures for Software Engineering

Build, Test, Integrate

Build/test source code	< 10 seconds
Trial deployment	< 90 seconds
Push to all users	< 10 minutes

Software engineers don't have standard operating procedures, but every well-managed project has a substantial list of rules and processes to follow. Don't believe me? Try submitting a patch to your favorite open-source project and find out how much correction you receive.

The strategy a software project employs should be composed of everything they need to maintain a stabilized approach:

It has tests, and they pass
Each change is a coherent commit
And continuously integrate changes into a real environment

I'm assuming that you ship tests with the code, but maybe this doesn't make sense for what you're doing. There are many times where you put together some specialized or very slow regressions and then throw them away after they have served their purpose.

The point is that your approach to development is able to transition from one stable condition to another stable condition. To go back to the example of aviation: a stabilized approach lightens the workload and gives you situational awareness. How so? You've done all the figuring in advance.

Authoring Systems Configuration

Test configuration on an existing host	< 10 seconds
Provision new infrastructure	< 90 seconds
Commit, propagate	< 10 minutes

This slide shows systems configuration when framed from the perspective of software engineering.

I think there is a strong case to be made that productivity in software development and systems administration correlates directly with the time it takes to validate a change. Delay in feedback changes your approach to development.

If testing a change to a host takes more than, say 10 seconds, it's not providing interactive feedback. If you can't iterate on a problem efficiently you will inevitably compensate by manually testing fragments of code and configuration outside of your repository.

As with software development, there is a strategy to systems administration

Continuously apply changes to an existing host
Verify changes apply correctly to newly provisioned hosts
Use source control as a means of communicating state changes with the rest of your team

Declarative Configuration

{% for file in ["miniupnpd.conf", "dhcpd.conf", "mail/smtpd.conf"] %}
/etc/{{file}}:
  file.managed:
    - source: salt://home/{{file|replace('/', '_')}}
{% endfor %}

Templates, variables

Massive APIs

Modules and extensions

Product roadmap (features, bugs)

The longer I worked with large frameworks like Salt the more I valued their capabilities, and the more I found the framework itself to be an obstacle to what I was trying to accomplish.

Templates are great for building documents, terrible for general-purpose programming.
Using primitive data structures to call functions is taxing way to write programs. Typically you are using a DSL or writing YAML that maps to a framework's API.
Adding your own behavior requires you to become a platform expert. Almost as if you were a third-party integrator.

Far too often we commit the change in order to test the change.

Declarative Orchestration?

# step 1
/usr/local/bin/mysql_install_db:
    cmd.run:
        - creates: /var/mysql
# step 2

Dependencies, sequencing

Progressive status?

Arid programming environment

Eventually configuration management frameworks grow large enough to be called an orchestration framework. What's not to like?

Expressing an action is not too difficult, but chaining events based on the result of a previous step is a nightmare
No mechanism for printing progressive status messages
Trying to sequence events or make complex decisions without a real programming language is difficult

Orchestration is an advanced topic for configuration management, but all of you do this already. It's called scripting.

Oversimplification

ssh 10.5.5.1 < base-cfg.sh
ssh 10.5.5.1 < configure-wordpress.sh

What's missing?

Let's try our hand at configuring a system using some scripts. This is a solution that is too simple. The first reason this is too simple because we didn't ship resources the scripts will need:

Configuration files
Miscellaneous utilities

We also need a convention for associating configuration with each host, and hopefully a means of running only the part we are trying to test.

Configuration Fundamentals

Adding/upgrade packages

Install files, directories, symlinks

Enable/start/restart services

}

Map units of work into a profile for each host

There are few operations that configuration management systems must be able to accomplish. We need a mechanism for adding packages, installing files, and controlling services. With only these three things, you can accomplish some valuable tasks, such as

Rebuild or replace servers
Make uniform changes across a group of machines

Also, we have a versioned history of configuration changes that the rest of your team can follow. These are critical capabilities, and it is for good reason that configuration management has become mainstream.

With with these fundamentals in mind, let's try again.

A Minimal Framework

alias rexec="ssh -T -S /tmp/control $1"
rexec -fN -M                          # start control master
rexec mkdir /tmp/staging              # scratch space
case $1 in
    192.168.0.2)                      # copy files, run scripts
        tar cf - util wpconfig | rexec tar xf - -C /tmp/staging
        rexec < base-cfg.sh
        rexec < configure-wordpress.sh
        ;;
esac
rexec rm -r /tmp/staging               # clean up
rexec -O exit                          # end control master

As crude as this is, this tiny framework is, it has some advantages:

You can use any mix of scripting languages
The control master allows you to break up configuration into logical chunks without paying a noticeable performance penalty
Error and informational messages are piped back immediately
No remote dependencies other than a basic UNIX environment
Very flexible: ship utilities, data and configuration

What are we missing?

Manually sprinkle in print statements to show what stage you're running
No obvious way to run part of a configuration
We would benefit from a utility that knows how to install/update/print diff of files
The current working directory is not set; scripts have to find the staging location in /tmp

In short, this works! But notice this:

You can test your configuration by running it
It's fast enough to run every time you save changes to a file

rset(1): Remote Staging Execution Tool

Stages configuration and utilities on the remote machine

Secure access to remote files: rinstall(1)

Ability to run everything, or use a pattern to match labels: pln(5)

rset(1) is a tool that provides conventions and a way to execute scripts with access to particular resources.

As we have already observed, the ability to execute scripts is not sufficient. You also need a collection of utilities or utility libraries. rset(1) creates a temporary directory populated with the tools and materials you will need for the task at hand.

It has with a server for access to large files, and some built-in utilities know how to install or modify files.

rset(1) uses it's own container format. This is different, and I think sets it apart from other attempts at a minimalist configuration management systems.

pln(5): Progressive Label Notation

Blocks of configuration can be selected individually
Labels names beginning with [0-9a-z] are excluded by default:

root_tasks:
→   crontab - <<-EOF
→       ~ 1 * * * /usr/local/bin/renewcert
→   EOF

Parameters apply to subsequent labels

interpreter=/bin/sh -x

Progressive Label Notation is a tab-indented file format that allows you to organize configuration.

Labels provide a uniform way to run a subset of the configuration
Each label is an independent script that is piped to the remote interpreter
Not key-value, it is evaluated in the order you write it
It does not interfere with most syntax highlighting; go ahead an use an editor modeline

The content of each label is indented with tab indentation. If you do not have a capable text editor, this might a problem for you. Why tabs?

Because shell allows you to strip off leading tabs in a heredoc (Ruby can do spaces as well)

Configuration Mapping

routes.pln associates configurations with each hostname

vm2.eradman.com: vm2/
→   vm2.pln
→   wordpress.pln

172.16.0.5: alpine/
→   alpine_vm.pln

Directories listed after the hostname labels are copied to the staging directory

The "top-level" configuration file is called routes.pln by default. Paths after the : are directories (configuration files, scripts, libraries, anything) that you want staged on the remote host.

Dynamic inventory is a feature that you can handle yourself. rset reads a file. Use any means you'd like to generate this file if need be.

Directory Structure

├── _rutils                   # utilities always staged
│   ├── rinstall
│   └── rsub
├── _sources                  # files served over http
└── routes.pln                # configuration mapping

Extend core functionality on all hosts using _rutils
Built-in web server serves files under _sources
Put everything else in directories specified in routes.pln

rset(1) has three methods of providing access to files:

Anything in _rutils will show up in the current working directory on the remote host
Anything under _sources is accessible over a local port-forward to a built-in web server (access is not restricted, but large files are fine)
Directories specified in routes.pln will be copied (push only, remote hosts cannot request this content)

Standard Utilities

rinstall(1)

./rinstall xa10/pf.conf /etc/pf.conf \
    && pfctl -f /etc/pf.conf

rsub(1)

 ./rsub /etc/firefox/unveil.main <<-CONF
   /usr/local/heimdal/lib r
   /usr/lib r
 CONF

Some solutions are too simple. Landing on a remote host with a staging directory with /bin/cp is not enough. Two built-in utilities are automatically shipped:

rinstall:

Knows how to install file from staging area
Print diff or notice that a new file was created
Optionally set owner and mode
Fetch large files on-demand using HTTP
Exit 0 only if file was installed or changed! [Very important]

rsub:

use a regex to modify a single line
or use stdin to create a block of managed text

Client/Server Configuration

Development environment is not authoritative

Critical systems on the edge of a network

Observation requires a terminal on both sides

In aviation there's a funny term for making a final approach with the engines idling. This is some sometimes called "dead stick" landing and bringing a functional aircraft down this way is not very safe. There are a couple reasons jets land under power, the most significant is that it takes precious time to spool up the engines. A powered final approach gives the pilot the capability to adjust or abort.

In some environments every configuration task is like a forced landing because your personal test environment is slow to react and is not the same as the official deployment mechanism from top of tree.

This is what running configuration before you commit does; it gives you the ability assess your current approach and to adjust course.

There are some solutions that client-server systems seem to be well suited for. I don't like agent-based configuration several reasons:

You are putting the most venerable system [the configuration host] on the edge of a network.
A meaningful test environment is nearly impossible to configure.
Debugging is usually difficult log output is mainly on the remote host.

Jumphosts & Roaming Clients

# jumphost/hostname.wg0
wgport 111 wgkey JUMP_HOST_PRIVATE_KEY
wgpeer ROAMING_HOST1_PUBLIC_KEY wgaip 10.0.0.20/32
wgpeer ROAMING_HOST2_PUBLIC_KEY wgaip 10.0.0.21/32
inet 10.0.0.1/24

# thinkpad10/hostname.wg0
wgkey ROAMING_HOST1_PRIVATE_KEY
wgpeer JUMP_HOST_PUBLIC_KEY wgendpoint proxy.xyz.com 111 wgaip 0.0.0.0/0
inet 10.0.0.20/24

Even if you have mobile clients, you probably don't need a pull-based configuration scheme. I say this because we now have WireGuard to build links. WireGuard is revolutionary in it's simplicity

Either side can initiate the connection
In-kernel wg(4) interface guarantees the identity of interfaces

The only thing you might to add is a cron job that sends a ping in order to establish the tunnel.

rset(1) doesn't provide flags for controlling connection options. It doesn't need to, because options such as ConnectTimeout, ProxyJump and anything else you can imagine can be specified in ssh_config(5).

Conclusion

Factor out common or complex operations into dedicated utilities

Stage configuration data, scripts, and utilities on the remote host

Map units of work into a profile for each host

http://scriptedconfiguration.org/

I have heard it said that there is a paradox with respect to learning a subject. The first maxim is this: the more you know, the more you can see there is to learn. The second is that once you've mastered a topic you finally see how simple it was all along.

May I submit that configuration management simply means that we have a way of associating configuration with each host. Orchestration is really a flamboyant term for scripting with configuration data, scripts, and utilities already staged.