AWS reference architecture
This document details best practices and a reference architecture for Tailscale deployments on Amazon Web Services (AWS). The following guidance applies for all Tailscale modes of operation—devices, exit nodes, subnet routers, and the like.
Terminology
- Tailscale device—for the purposes of this document Tailscale device can refer to a Tailscale node, exit node, subnet router, and the like.
See Terminology and concepts for additional terms.
High-level architecture
Ways to deploy Tailscale to connect to and from AWS resources
Tailscale provides a few options for connecting to resources within AWS. At a high-level they are:
- Agent-to-Agent connectivity—connect to “static” resources such as Amazon Elastic Compute Cloud (EC2) instances. This is recommended where you can install and run Tailscale directly on the resource you wish to connect to.
- IP-based connectivity with a Tailscale subnet router—connect to managed AWS resources such as Amazon’s Relational Database Service (AWS RDS) or Amazon Redshift. This is recommended where you cannot run Tailscale on the resource you are connecting to, or want to expose an existing subnet or services in a VPC to your tailnet.
- DNS-based routing with a Tailscale app connector—connect to software as a service (SaaS) applications or other resources over your tailnet with DNS-based routing.
- Kubernetes services and auth proxy with Tailscale Kubernetes operator—expose services in your Amazon Elastic Kubernetes Service (EKS) cluster and your EKS cluster control plane directly to your Tailscale network. This is recommended where you are connecting to resources running in a Kubernetes cluster, or to a Kubernetes cluster's control plane.
- Lambda and other container services—access resources in your tailnet from Lambda functions and other container solutions.
Agent-to-Agent connectivity
We recommend installing the Tailscale agent wherever possible—for example, setting up servers on EC2 instances. This generally provides the best and most scalable connectivity while enabling Tailscale agent-based functionality such as Tailscale SSH.
IP-based connectivity with subnet router
For managed resources where you cannot install the Tailscale agent, such as AWS RDS, Amazon Redshift, and similar services, you can run a subnet router within your VPC to access these resources from Tailscale. Subnet routers can also be used to connect to resources via AWS PrivateLink and VPC endpoints.
See Access AWS RDS privately using Tailscale for a general guide.
DNS-based routing with an app connector
App connectors let you route traffic bound for SaaS applications or managed services by proxying DNS for the target domains and advertising the subnet routes for the observed DNS results. This is useful for cases where the application has an allowlist of IP addresses which can connect to it; the IP address of the nodes running an app connector can be added to the allowlist, and all nodes in the tailnet will use that IP address for their traffic egress.
Kubernetes services and API server proxy with Tailscale Kubernetes operator
The Tailscale Kubernetes operator lets you expose services in your Kubernetes cluster to your Tailscale network, and use an API server proxy for secure connectivity to the Kubernetes control plane.
Lambda and other container services
Tailscale supports userspace networking where processes in the container can connect to other resources on your Tailscale network via a SOCKS5 or HTTP proxy. This allows AWS Lambda, AWS App Runner, AWS Lightsail and other container-based solutions to connect to the Tailscale network with minimal configuration needed.
Production best practices
Below are general recommendations and best practices for running Tailscale in production environments. Much of what is listed below is explained in greater detail throughout this document:
- When possible deploy subnet routers, exit nodes, app connectors, and the like, to public subnets with public IP addresses to ensure direct connections and optimal performance.
- Run subnet routers, exit nodes, app connectors, and the like, separately from the systems you are administering with Tailscale—for example, run your subnet routers outside of your Amazon EKS clusters.
- Deploy dynamically scaled resources (for example, containers or serverless functions) as ephemeral nodes to automatically clean up devices after they shut down.
High availability and regional routing
- Run multiple subnet routers and app connectors across multiple AWS availability zones to improve resiliency against zone failures with high availability failover and deploy across multiple regions for regional routing.
- Run multiple Tailscale SSH session recorder nodes across multiple AWS availability zones to improve resiliency against zone failures with recorder node failover and deploy across multiple regions for regional routing.
Performance best practices
See Performance best practices for general recommendations.
In-region load balancing
Deploy multiple overlapping connectors within a DERP region to take advantage of in-region load balancing to evenly spread load across the connectors on a best-effort basis, and enable in-region redundancy.
Recommended instance sizing
Normal usage
When installing Tailscale on an EC2 instance as a “normal” Tailscale device (for example, not a subnet router or exit node), you likely have already sized that instance to a suitable instance type for its workload and running Tailscale on it will likely add negligible resource usage.
Subnet routers, exit nodes, and app connectors
There are many variables that affect performance and workloads vary widely so we do not have specific size recommendations, but we do have general guidance for selecting an instance type for an EC2 instance running as a subnet router, exit node, or app connector:
- In general, higher CPU clock speed is more important than more cores.
- In general, instances with ARM-based AWS Graviton processors are quite cost effective for packet forwarding
- Use a non-burstable instance type to achieve consistent CPU and network performance.
- Per AWS documentation, burstable performance instances (such as T4g, T3a, T2, and the like) use a CPU credit mechanism which can result in variable performance.
- Use an instance type with greater than 16 vCPUs (for example, 24 vCPUs or more) to ensure consistent network performance.
- Per AWS documentation, instances with 16 vCPUs or less use a network I/O credit mechanism to burst beyond baseline bandwidth.
Using Tailscale with AWS
Security groups
Tailscale uses various NAT traversal techniques to safely connect to other Tailscale nodes without manual intervention. Nearly all of the time, you do not need to open any firewall ports for Tailscale. However, if your VPC and security groups are overly restrictive about internet-bound egress traffic, refer to What firewall ports should I open to use Tailscale.
Public vs private subnets
Tailscale devices deployed to a public subnet with a public IP address will benefit from direct connections between nodes for the best performance.
AWS NAT Gateway
Tailscale uses both direct and relayed connections, opting for direct connections where possible. AWS NAT Gateway is known to impede direct connections causing connections to use Tailscale DERP relay servers. This does not cause connectivity issues, but can lead to lower throughput and performance than direct connections.
If you must deploy Tailscale such that internet-bound connections go through a AWS NAT Gateway (for example, to reuse existing IP addresses that are allow-listed to third parties), contact your Tailscale account team to discuss more advanced deployment options that utilize public and private subnet routing on a single EC2 instance.
Egress-only internet gateway
An egress-only IPv6 gateway attached to a private subnet will allow direct connections to peers that have IPv6 addresses. Nodes which only have IPv4 available will be reachable by DERP relay which have both IPv4 and IPv6 connections.
VPC DNS resolution
To allow non-AWS devices in your tailnet to resolve VPC-specific DNS records, configure split DNS to forward queries for internal AWS domains to the Amazon Route 53 Resolver of your VPC. If you have multiple VPCs, associate additional VPCs with a private hosted zone to enable DNS resolution across all of them.
VPC peering and transit VPCs
VPC peering and transit VPCs are a common strategy for connecting multiple VPCs together. You can deploy a subnet router (or a set for high availability) within a VPC to allow access to multiple VPCs.
If you have VPCs or subnets with overlapping IPv4 addresses, use 4via6 subnet routers to access resources with unique IPv6 addresses for each overlapping subnet.
Subnet routers
Operating a subnet router within Amazon EKS or Amazon ECS
Oftentimes organizations are using Tailscale to connect to and administer their EKS clusters, Amazon Elastic Container Service (ECS) deployments, and the like. While Tailscale can run within a container and be deployed to EKS or ECS, we recommend running your subnet routers externally to these clusters to ensure connectivity is available in the event your cluster is having issues. In other words, run your subnet router on dedicated EC2 instances or an EKS cluster separate from than the cluster you're administering.
Tailscale SSH session recording
Deploy multiple session recorder instances across multiple availability zones to improve resiliency against zone failures. If your organization operates across multiple regions, consider deploying SSH session recording nodes in each region you operate and configure SSH access rules to send recording information to the local region for your nodes.
S3 for recording persistence
Configure Tailscale SSH session recording to persist recordings to an Amazon S3 bucket to reduce operational concerns such as available storage space, durability, access controls, and similar services.
Minimum-required IAM policy
Create an IAM policy with the minimum-required permissions to store and view (for the recorder web UI) recordings with an S3 bucket.
Storage bucket locations
If your organization operates across multiple AWS regions, consider deploying SSH session recording nodes in each region you operate and configure SSH access rules to send recording information to the local region for your nodes.
Storage bucket policy
Given the sensitive nature of SSH session recordings, follow the AWS security best practices for Amazon S3.
Storage lifecycle rules
Configure S3 lifecycle rules so that recording files transition to another storage class and/or expire and delete after a time period that meets your organization’s requirements.