Tailnet lock white paper

This white paper on tailnet lock is a draft. It is shared to solicit feedback on the design and implementation of tailnet lock.

Abstract

Modern VPNs have made large headways in reducing the attack surface of the networks they protect, by using modern cryptography for end-to-end encryption, and connecting directly between peers with point-to-point links, so that the VPN cannot intercept data plane traffic in plaintext. While this architecture establishes secure data paths independent of the VPN’s control plane, the control plane is still relied upon for informing participants which keys are used to establish data paths, and as a result a compromised control plane can read or write traffic by distributing attacker-generated keys.

Tailscale implements a novel mechanism, tailnet lock, to prevent a malicious control plane from inserting itself into such a network. Tailnet lock requires the verification of a cryptographic signature on all WireGuard® node keys distributed by Tailscale’s control plane. Each tailnet has a locally-managed key authority, which keeps track of the set of permitted signing keys, and processes signed messages to apply authentic changes to the set of permitted keys. The control plane never sees or stores the keys for these signatures, and signatures are verified against the key authority. With these mechanisms, the control plane cannot insert itself into a network, and change or disable the set of permitted signing keys without being detected.

Introduction

To tunnel data between two nodes in a Tailscale network (a tailnet), the Tailscale client establishes a WireGuard session between the two nodes, over which private network traffic traverses. These sessions are authenticated and subsequently encrypted based on the node’s keypair — an X25519 public key which is unique for each node and used for establishing these data connections. Connections between nodes are peer-to-peer (the data plane), and are distinct from the distribution of keys and settings for a tailnet, managed by Tailscale’s coordination server (the control plane).

To establish these sessions, the data plane needs information about peer nodes in the tailnet. This includes information like the IP addresses they can be contacted on, packet filtering rules which should be applied to traffic originating from those nodes, and the node’s public key needed to set up a WireGuard session. This dependence on information from the control plane — and in particular needing to be told other nodes’ public keys — creates a need to trust the control plane, because otherwise a compromised or malicious control plane could announce nodes with attacker-controlled keys and participate in receiving or sending plaintext network traffic.

Context

This section gives the context for the problem tailnet lock is designed to address.

Design goal

Tailnet lock was designed to achieve a singular principle:

Tailscale infrastructure cannot add unauthorized nodes to a tailnet with tailnet lock enabled, and any attempt to do so can be blocked and detected.

While Tailscale’s control plane is responsible for a lot of functionality — ranging from handling logins, interpreting ACLs, computing packet filters, and so on — its most critical function to the security of the tailnet is its distribution of node keys. By making it impossible for a compromised or malicious control plane to distribute acceptable node keys, we can eliminate the possibility of clandestine nodes participating in a tailnet and receiving or sending tailnet network traffic in plaintext.

Assigning trust for key distribution

Under tailnet lock, each individual node verifies and enforces that other nodes’ public keys announced by the control plane are accompanied by a valid cryptographic signature, from a trusted source. Node keys that fail signature verification are not accepted and cannot be used to establish WireGuard sessions — effectively preventing them from participating in the data plane and observing traffic traversing the tailnet.

While simple, this approach brings with it a new challenge: How do nodes know which keys are authorized to sign node keys, and how can this information be updated in a manner which cannot be tampered with by the control plane? Addressing these challenges forms the bulk of tailnet lock’s design.

Components of tailnet lock

This section covers each of the components of tailnet lock.

At a high level, here’s how the components of tailnet lock work together: When tailnet lock is enabled, a set of existing tailnet lock keys is identified as trusted. Disablement secrets are generated, and a cryptographic derivation of these secrets are passed to all nodes. Each node in a tailnet keeps track of these keys and secrets in a local tailnet key authority. New nodes added to the tailnet must be signed by one of these trusted tailnet lock keys, and verified locally by each node. Changes to the set of trusted tailnet lock keys are sent with an authority update message, which must also be signed by a trusted tailnet lock key, and again, verified locally by each node. To disable tailnet lock, a disablement secret must be passed to each node, who validates it against its local disablement secret derivations.

Tailnet lock keys

Tailnet lock introduces a new set of keys — tailnet lock keys (TLKs) — to be used in the operation of tailnet lock. These are generated by every node on first startup, but not every node’s TLK needs to be trusted. An initial set of trusted TLKs is specified by the administrator while enabling tailnet lock, and trusted TLKs can be added or revoked.

TLKs are Ed25519 keys, which are used to sign new node keys (used for WireGuard sessions) and changes to tailnet lock configuration.

The private key of each TLK is stored on the node that generated it, and a copy of each trusted public TLK is stored in the tailnet key authority (described in the next section The tailnet key authority). In addition, each key has a weight associated with it, which is used to determine which update to apply when multiple update messages are conflicting (described in Processing forking updates).

The tailnet key authority

To keep track of the current set of trusted TLKs, a tailnet key authority (TKA) subsystem runs within each node. By knowing the current set of trusted TLKs, TKA is able to verify signatures in two situations:

  1. Verifying node key signatures before adding node keys to its list of peers, and
  2. Verifying signatures for changes to the set of trusted TLKs before processing those changes.

When a node starts up, it checks to see if it has stored TKA state from a previous run, that is, if tailnet lock has been enabled on the tailnet. If it does, it computes the current set of trusted TLKs, and requires valid node key signatures on all peers in order to configure WireGuard sessions with them.

The control plane relays authority update messages to nodes, which verifies and processes updates to its local tailnet key authority.

The control plane relays authority update messages to nodes, which verifies and processes updates to its local tailnet key authority.

While running, a node processes update messages relayed by the control plane describing both changes to peers and changes to the TKA. Changes to the set of trusted TLKs are sent in authority update messages (AUMs). AUMs must be signed by a trusted TLK, so that they cannot be tampered with by a compromised control plane, and must be validated against an intensive set of validation rules before they are applied to the state machine (described in Validation).

Authority update messages (AUMs)

The driving force of information within tailnet lock are authority update messages (AUMs): signed messages describing either the initial state of tailnet lock (the genesis update), or some change to a previous state (a delta update).

AUMs only contain information about changes to tailnet lock state, including enabling tailnet lock, and adding or removing trusted TLKs. While node key signatures are also created with TLKs and distributed by the control plane, they are distinct to AUMs.

With the exception of the genesis update, all AUMs reference the hash of the previous update. This forms a chain of blocks describing a sequence of strongly ordered changes: in a manner similar to Git or other distributed ledgers.

Each authority update message embeds the hash of the preceding authority update message. The first authority update message is the genesis update.

Each authority update message embeds the hash of the preceding authority update message. The first authority update message is the genesis update.

As each AUM embeds the hash of the preceding AUM, the hash of the latest AUM is sufficient to describe the current state (in the same manner that a commit hash is sufficient to describe the current state of a Git branch). Borrowing Git nomenclature, we say the current state of a TKA is its HEAD (AUM hash).

Processing incremental updates

When the user wishes to make changes to the parameters of tailnet lock, such as by adding a new trusted TLK, their node generates an AUM describing the change. This AUM is signed by a trusted TLK, and encodes the HEAD of the TKA as the hash of the previous update.

An incremental authority update message encodes the HEAD of the tailnet key authority as the hash of the previous update, to build upon the latest tailnet lock state.

An incremental authority update message encodes the HEAD of the tailnet key authority as the hash of the previous update, to build upon the latest tailnet lock state.

This update is then relayed to all nodes in the tailnet via the control plane. Each node validates the AUM and applies the change described. All nodes which have processed the AUM are now at the same state of trusting the newly added TLK; and being at the same state, have the same HEAD.

Validation

AUMs are validated prior to being processed. Validation includes two steps: verifying the AUM is signed, and verifying the AUM is well formed.

AUMs must be signed by an already trusted TLK, that is, a TLK trusted at the state preceeding the current AUM. This signature verification is done as early as possible: if there end up being security-relevant bugs in later validation logic, the need to possess a trusted TLK significantly reduces the availability of that code path to an attacker.

Next, AUMs are verified to be well formed. A well formed AUM is validated as unambiguously describing a change to tailnet lock state. The exact semantics vary depending on the type of update message (enable tailnet lock, add key, remove key). Specific message validation rules include:

  • All AUMs except the genesis AUM must reference the hash of a known parent AUM.
  • Remove key AUMs must reference a key to remove which is present in the key authority, and must not remove all keys from the tailnet lock authority.
  • Add key AUMs must completely specify the public parameters of the key being added, and must not add a key which is already trusted.
  • The genesis AUM must specify at least one disablement secret and at least one key, and keys or disablement secrets cannot be duplicated.
Processing forking updates

The previous section Processing incremental updates described a simple scenario in which a new key was added. This change is conveyed by a single AUM representing the new key, being applied to the latest state. While this is the common case, updates which build upon a much earlier update (that is, the AUM hash they reference as the previous AUM is not the head hash of the TKA) are also allowed. This mechanism enables recovery from malicious updates signed by a compromised TLK.

A chain of authority update messages can fork to form multiple branches.

A chain of authority update messages can fork to form multiple branches.

When this happens, the AUMs descending from a common ancestor form multiple branches. AUMs must form a single chain of updates, so when branching occurs, one branch needs to be deterministically chosen and the remaining branches discarded. The remainder of this section describes the algorithm by which this occurs.

Each trusted TLK is assigned a weight, which is used as part of this computation.

All nodes follow the same rules to decide which branch to take. For each set of updates that form a fork:

  1. Sum the weights associated with each key signing an AUM (as mentioned earlier, weights are an integer associated with each key for the purpose of resolving forking updates). The AUM with the greatest sum is chosen as the next update.

    By way of example, imagine there are two AUMs, AUM A and AUM B, with the same parent AUM hash, and hence are both candidates to be the next AUM processed. If AUM A has two signatures where the key weight of the key associated with each signature is 1 and 3 respectively, then the sum of key weights for AUM A is 4. If AUM B only has a single signature, and the weight of the associated key is 1, then the sum of key weights for AUM B is 1. Then AUM A would be chosen as the next AUM, as a key weight of 4 is greater than 1.

    If the key weight sums are the same, the algorithm applies the next rule.

  2. Prefer AUMs that describe the removal of keys.

    By way of example, if two candidate AUMs A and B have message types AddKey and RemoveKey respectively, then AUM B with the message type RemoveKey would be chosen as the AUM.

    If neither or both AUM are of type RemoveKey, the algorithm applies the final rule.

  3. Finally, choose the AUM whose hash is lower when expressed as an integer.

    Given the properties of our digest function BLAKE2, comparing the hashes of forking AUMs is a complete and deterministic tiebreaker which means all nodes choose the same AUM to process.

Each node iteratively ‘walks’ the chain of updates, deciding at each update in the chain which branch to take based on the above rules (and ignoring the other branches). This process continues until there are no known AUMs whose parent AUM hash matches that of the previous AUM applied; this occurs when we’ve reached the end of the chain and are up to date.

Disablement secrets

Disabling tailnet lock is a security-relevant operation — if the disablement procedure lacked authorization checks, a compromised control plane could trivially disable tailnet lock (and all its protections) to attack a tailnet as before. To avoid this, use of a disablement secret is required to globally disable tailnet lock.

Disablement secrets are long passwords generated during the process of enabling tailnet lock. Multiple disablement secrets can be generated. Argon2 is used as a key derivation function (KDF).

The derived value of each disablement secret is stored by each node’s TKA, and hence known by every node in a tailnet. By virtue of being stored as a derived value, the secret itself remains confidential until use — so each node can validate and process a disablement secret, without needing to know the secret itself.

When a disablement secret is used, the secret is distributed to all nodes in the tailnet. Each node can then verify the authenticity of the disablement operation by computing the KDF of the secret, and comparing it to the stored KDF values in their TKA. If they match, then the node can be sure the disablement is authentic and will disable tailnet lock locally.

Cryptography

This section gives an overview of the cryptography used in tailnet lock, and why it was chosen.

In designing tailnet lock, we wanted to use modern, stable cryptography. To that end, we chose the following primitives:

BLAKE2 is used for generating digests of AUMs

Tailnet lock generates digests of tailnet lock state changes using BLAKE2. These hashes are used to describe authority update messages (in the same manner that a commit hash in Git describes a specific change in a repository). BLAKE2 was chosen due to its status as a proven replacement for older hash functions (such as SHA1), with reasonable security and speed properties (preimage resistance and indifferentiability).

Ed25519 keys are used for signing operations

Ed25519 is an elliptic curve signature scheme, well known for its small key size, fast signing and signature verification, and resistance to a number of potential attacks or implementation errors. Tailnet lock keys are Ed25519 keypairs, and are used both to sign node keys and sign AUMs.

ZIP215 semantics are used for signature validity

Existing RFCs and implementations do not implement identical verification rule for Ed25519 signatures, a critical property necessary to avoid divergence in distributed mechanisms such as tailnet lock. As such, tailnet lock verifies signatures using an implementation of the more tightly specified ZIP215 rules, which is necessary to ensure consensus. Tailnet lock signatures are not malleable.

Argon2 is used as the disablement secret key derivation function

The KDF-derived value of a disablement secret is known to all nodes, as they must be able to validate a disablement secret. As such, it needs to be infeasible to reverse the KDF-derivation of a disablement secret, or malevolent actors could compute the value and use it to disable tailnet lock.

Disablement secrets are 32 bytes of data sourced from the system CSPRNG. Argon2 is used as the KDF due to its good preimage properties and resistance to tradeoff attacks. Given the use of 32-byte random secrets, Argon2 is likely overkill, as a single cryptographic hash like BLAKE2 would probably be sufficient. Argon2 is still helpful in case a disablement secret with low entropy is used.

CBOR2 is used to encode signatures and update messages

While not strictly a cryptographic primitive, the choice of encoding has integral security implications and so is discussed here.

The main function of node key signatures and authority update messages is to carry signed data. Cryptographic signatures are over digests, so the serialized representation must be deterministic. Issues with other encodings (JWS/JWT, SAML, X509, etc.) has demonstrated that even subtle behaviors (such as how you handle invalid, unsupported, or unrecognized fields, as well as invariants in subsequent re-serialization) can easily lead to security-relevant logic bugs.

CBOR2 is one of the few encoding schemes that are appropriate for use with signatures and has security-conscious parsing and serialization rules baked into the specification. Tailnet lock use the CTAP2 mode, which is well understood and widely-implemented, and already proven for use in signing assertions through its use by FIDO2 devices.

User flows within tailnet lock

This section steps through the user flows that occur in tailnet lock, and explains what happens with each user action.

Enabling tailnet lock

When an administrator wishes to enable tailnet lock, they go through the enablement flow on one of their nodes (at time of writing, with the tailscale lock init command). As part of this process, they enumerate the set of TLKs they wish to trust, as well as the number of disablement secrets they wish to generate.

The node then performs a number of steps:

  1. The node first generates the specified number of disablement secrets and displays them. These are secrets, so they will not be stored nor shown again — the administrator must make careful note of the secret values.
  2. The node generates an AUM to describe the new (enabled) state of tailnet lock. This message is signed by the node’s own TLK, and describes all the initial parameters of tailnet lock, including the set of initially trusted TLKs and values needed for disablement.
  3. The node receives a list of all node keys in the tailnet from the control plane, which it then signs using its TLK. This is done so that when tailnet lock is enabled, all existing nodes have a valid node key signature and connectivity is not lost.
  4. Lastly, the initial AUM and the initial list of node key signatures are transmitted to the control plane for distribution.

The control plane then distributes the initial AUM and node key signatures to all nodes in the tailnet. Each node receives the AUM, uses it to bootstrap the initial state of its TKA, and then begins to enforce node key signatures for any new nodes added to the tailnet.

Signing node keys for new nodes

When a node is added to a locked tailnet, it will lack a node key signature and hence other nodes in the tailnet will not communicate with it, that is, it is locked out. An administrator can sign the new node’s node key with a trusted TLK (using the tailscale lock sign command) to allow it to make connections within the tailnet. This signature is then transmitted to the control plane, through which it is distributed to all nodes in the tailnet.

When a new node is added to the tailnet, it is signed by a trusted tailnet lock key, then distributed to peer nodes, which can verify the signature before allowing connections.

When a new node is added to the tailnet, it is signed by a trusted tailnet lock key, then distributed to peer nodes, which can verify the signature before allowing connections.

Disabling tailnet lock

When an administrator wishes to disable tailnet lock, they provide a valid disablement secret and perform the disablement flow on one of their nodes (with the tailscale lock disable command).

The node is able to verify the correctness of the disablement secret by comparing the KDF of the provided secret to stored values in the TKA. If correct, the node transmits the secret for distribution to all nodes.

Upon receiving the disablement secret, all nodes perform the same check, comparing the KDF of the disablement secret to a stored KDF value in their TKA. If a value matches, the node disables its TKA locally and reverts to an unlocked state.

Changing trusted tailnet lock keys

The set of trusted TLKs can be changed on any node with a current trusted TLK. Any such change (adding, removing, or changing parameters of a trusted TLK) follows the same flow:

  1. The node generates and signs an AUM, which encodes details about the change. For instance, the process for adding a new trusted TLK would generate an AddKey AUM, which would encode the details of the new key.
  2. The generated and signed AUM is transmitted to the control plane for distribution. The control plane transmits the AUM to nodes in the tailnet.
  3. Each node receives the AUM from the control plane, verifies its signature, performs validation of the update, and applies the change to its local tailnet key authority.
Before a new trusted tailnet lock key is added, it must be signed by an existing trusted tailnet lock key. The coordination server distributes the new key to all nodes in the tailnet, which can verify the signature before trusting it.

Before a new trusted tailnet lock key is added, it must be signed by an existing trusted tailnet lock key. The coordination server distributes the new key to all nodes in the tailnet, which can verify the signature before trusting it.

Recovering from tailnet lock key compromise

If a node with a trusted TLK is compromised, the compromised key could be used to affect malicious changes to tailnet lock — such that an attacker could add new nodes to the tailnet, and receive or send plaintext traffic from those nodes. A competent attacker might go further and remove all other trusted TLKs from the tailnet authority, in an attempt to gain exclusive control over the tailnet.

Recovering from this scenario involves rewriting the changelog of AUMs to erase the attacker-generated updates, and remove trust in the compromised keys. This can be done by generating a forking update, and signing the update with enough keys that the sum of key weights is greater than those of the key or keys compromised by the attacker.

Limitations

Tailnet lock protects against unauthorized nodes, not denial of service

Tailnet lock aims to prevent a compromised control plane from manipulating your tailnet such to receive or send traffic from malicious nodes. It does not prevent a compromised control plane from breaking connectivity in your network, such as by failing to distribute new node keys.

We do have plans to make headway on this front. This may include allowing nodes to share AUMs directly instead of relying on the control plane for distribution, and having nodes cache last known good netmaps for connectivity.

Enablement is Trust on First Use (ToFU)

When tailnet lock is first enabled or a new node is added to the tailnet, each node relies on the initial state of tailnet lock sent to it from the control plane. If the control plane is compromised, a malicious initial state could be sent to each node with attacker-controlled TLKs. Similarly, if an already compromised node’s TLK is trusted, the initial state sent to each node would include attacker-controlled TLKs.

We have plans to implement a mechanism where administrators can define the expected tailnet lock state and devices will refuse to operate if the control plane attempts to provision them with an unexpected tailnet lock state. In the meantime, it is possible to verify the provisioned state of tailnet lock on each node by running the tailnet lock status command.

Last updated