AWS reference architecture
This document details best practices and a reference architecture for Tailscale deployments on Amazon Web Services (AWS). The following guidance applies for all Tailscale modes of operation—devices, exit nodes, subnet routers, etc.
- Tailscale device—for the purposes of this document Tailscale device can refer to a Tailscale node, exit node, subnet router, etc.
See Terminology and concepts for additional terms.
Tailscale provides a few options for connecting to resources within AWS. At a high-level they are:
- Agent-to-Agent connectivity—e.g. connecting to “static” resources such as Amazon Elastic Compute Cloud (EC2) instances. This is recommended where you can install and run Tailscale directly on the resource you wish to connect to.
- IP-based connectivity with a Tailscale subnet router—e.g. connecting to managed AWS resources such as Amazon’s Relational Database Service (AWS RDS), Amazon Redshift, etc. This is recommended where you cannot run Tailscale on the resource you are connecting to, or want to expose an existing subnet or services in a VPC to your tailnet.
- Kubernetes services and auth proxy with Tailscale Kubernetes operator—expose services in your Amazon Elastic Kubernetes Service (EKS) cluster and your EKS cluster control plane directly to your Tailscale network. This is recommended where you are connecting to resources running in a Kubernetes cluster, or to a Kubernetes cluster’s control plane.
- Lambda and other container services—access resources on your tailnet from Lambda functions and other container solutions.
We recommend installing the Tailscale agent wherever possible—e.g. setting up servers on EC2 instances. This generally provides the best and most scalable connectivity while enabling Tailscale agent-based functionality such as Tailscale SSH.
For managed resources where you cannot install the Tailscale agent, such as AWS RDS, Amazon Redshift, etc., you can run a subnet router within your VPC to access these resources from Tailscale. Subnet routers can also be used to connect to resources via AWS PrivateLink and VPC endpoints.
See Access AWS RDS privately using Tailscale for a general guide.
The Tailscale Kubernetes operator allows you to expose services in your Kubernetes cluster to your Tailscale network, and use an API server proxy for secure connectivity to the Kubernetes control plane.
Tailscale supports userspace networking where processes in the container can connect to other resources on your Tailscale network via a SOCKS5 or HTTP proxy. This allows AWS Lambda, AWS App Runner, AWS Lightsail and other container-based solutions to connect to the Tailscale network with minimal configuration needed.
Below are general recommendations and best practices for running Tailscale in production environments. Much of what is listed below is explained in greater detail throughout this document:
- Run subnet routers, exit nodes, etc., separately from the systems you are administering with Tailscale—e.g. run your subnet routers outside of your Amazon EKS clusters.
- Run multiple subnet routers across multiple AWS availability zones to improve resiliency against zone failures with subnet router failover.
- Run multiple Tailscale SSH session recorder nodes across multiple AWS availability zones to improve resiliency against zone failures with recorder node failover.
- Deploy dynamically scaled resources (e.g. containers, serverless functions, etc.) as ephemeral nodes to automatically clean up devices after they shut down.
- When possible deploy subnet routers, exit nodes, etc., to public subnets with public IP addresses to ensure direct connections and optimal performance.
When installing Tailscale on an EC2 instance as a “normal” Tailscale device (e.g. not a subnet router, exit node, etc.), you likely have already sized that instance to a suitable instance type for its workload and running Tailscale on it will likely add negligible resource usage.
There are many variables that affect performance and workloads vary widely so we do not have specific size recommendations, but we do have general guidance for selecting an instance type for an EC2 instance running as a subnet router or exit node:
- In general, higher CPU clock speed is more important than more cores.
- In general, instances with ARM-based AWS Graviton processors are quite cost effective for packet forwarding
- Use a non-burstable instance type to achieve consistent CPU and network performance.
- Per AWS documentation, burstable performance instances (such as T4g, T3a, T2, etc.) use a CPU credit mechanism which can result in variable performance.
- Use an instance type with greater than 16 vCPUs (e.g. 24 vCPUs or more) to ensure consistent network performance.
- Per AWS documentation, instances with 16 vCPUs or less use a network I/O credit mechanism to burst beyond baseline bandwidth.
Tailscale uses various NAT traversal techniques to safely connect to other Tailscale nodes without manual intervention. Nearly all of the time, you do not need to open any firewall ports for Tailscale. However, if your VPC and security groups are overly restrictive about internet-bound egress traffic, refer to What firewall ports should I open to use Tailscale.
Tailscale devices deployed to a public subnet with a public IP address will benefit from direct connections between nodes for the best performance.
Tailscale uses both direct and relayed connections, opting for direct connections where possible. AWS NAT Gateway is known to impede direct connections causing connections to use Tailscale DERP relay servers. This does not cause connectivity issues, but can lead to lower throughput and performance than direct connections.
If you must deploy Tailscale such that internet-bound connections go through a AWS NAT Gateway (e.g. to reuse existing IP addresses that are allow-listed to third parties), contact your Tailscale account team to discuss more advanced deployment options that utilize public and private subnet routing on a single EC2 instance.
To allow non-AWS devices on your tailnet to resolve VPC-specific DNS records, configure split DNS to forward queries for internal AWS domains to the Amazon Route 53 Resolver of your VPC. If you have multiple VPCs, associate additional VPCs with a private hosted zone to enable DNS resolution across all of them.
VPC peering and transit VPCs are a common strategy for connecting multiple VPCs together. You can deploy a subnet router (or a set for subnet router failover) within a VPC to allow access to multiple VPCs.
If you have VPCs or subnets with overlapping IPv4 addresses, use 4via6 subnet routers to access resources with unique IPv6 addresses for each overlapping subnet.
Multiple subnet routers can be deployed and configured to advertise the same routes to achieve subnet router failover. This ensures users of your network can continue to access resources if one routing device goes offline. On AWS, deploy two or more subnet routers across multiple availability zones to have better isolation and protection against zone failures.
Oftentimes organizations are using Tailscale to connect to and administer their EKS clusters, Amazon Elastic Container Service (ECS) deployments, etc. While Tailscale can run within a container and be deployed to EKS or ECS, we recommend running your subnet routers externally to these clusters to ensure connectivity is available in the event your cluster is having issues. In other words, run your subnet router on dedicated EC2 instances or an EKS cluster separate from than the cluster you’re administering.
Deploy multiple session recorder instances across multiple availability zones to improve resiliency against zone failures. If your organization operates across multiple regions, consider deploying SSH session recording nodes in each region you operate and configure SSH access rules to send recording information to the local region for your nodes.
Configure Tailscale SSH session recording to persist recordings to an Amazon S3 bucket to reduce operational concerns such as available storage space, durability, access controls, etc.
Create an IAM policy with the minimum-required permissions to store and view (for the recorder web UI) recordings with an S3 bucket.
If your organization operates across multiple AWS regions, consider deploying SSH session recording nodes in each region you operate and configure SSH access rules to send recording information to the local region for your nodes.
Given the sensitive nature of SSH session recordings, follow the AWS security best practices for Amazon S3.
Configure S3 lifecycle rules so that recording files transition to another storage class and/or expire and delete after a time period that meets your organization’s requirements.