Azure reference architecture
This document details best practices and a reference architecture for Tailscale deployments on Microsoft Azure. The following guidance applies for all Tailscale modes of operation—devices, exit nodes, subnet routers, etc.
- Tailscale device—for the purposes of this document Tailscale device can refer to a Tailscale node, exit node, subnet router, etc.
See Terminology and concepts for additional terms.
Tailscale provides a few options for connecting to resources within Azure. At a high-level they are:
- Agent-to-Agent connectivity—connect to “static” resources such as virtual machines (VMs). This is recommended where you can install and run Tailscale directly on the resource you wish to connect to.
- IP-based connectivity with a Tailscale subnet router—connect to managed Azure resources such as Azure SQL, Azure Cosmos DB, etc. This is recommended where you cannot run Tailscale on the resource you are connecting to, or want to expose an existing subnet or services in a virtual network to your tailnet.
- DNS-based routing with a Tailscale app connector—connect to software as a service (SaaS) applications or other resources over your tailnet with DNS-based routing.
- Kubernetes services and auth proxy with Tailscale Kubernetes operator—expose services in your Azure Kubernetes Service (AKS) cluster and your AKS cluster control plane directly to your Tailscale network. This is recommended where you are connecting to resources running in a Kubernetes cluster, or to a Kubernetes cluster’s control plane.
- Azure Functions, Azure App Service, and other container solutions—access resources on your tailnet from Azure Functions, Azure App Service, and other container solutions.
We recommend installing the Tailscale agent wherever possible—for example, setting up servers on Azure VMs. This generally provides the best and most scalable connectivity while enabling Tailscale agent-based functionality such as Tailscale SSH.
See Access Azure Linux VMs privately using Tailscale and Access Azure Windows VMs privately using Tailscale for general guides.
For managed resources where you cannot install the Tailscale agent, such as Azure SQL, Azure Cosmos DB, etc., you can run a subnet router within your virtual network to access these resources from Tailscale.
App connectors allow you to route traffic bound for SaaS applications or managed services by proxying DNS for the target domains and advertising the subnet routes for the observed DNS results. This is useful for cases where the application has an allowlist of IP addresses which can connect to it; the IP addresses of the nodes running an app connector can be added to the allowlist, and all nodes on the tailnet will use that IP address for their traffic egress.
The Tailscale Kubernetes operator allows you to expose services in your Kubernetes cluster to your Tailscale network, and use an API server proxy for secure connectivity to the Kubernetes control plane.
Tailscale supports userspace networking where processes in the container can connect to other resources on your Tailscale network via a SOCKS5 or HTTP proxy. This allows Azure Functions, Azure App Service, and other container-based solutions to connect to the Tailscale network with minimal configuration needed.
See Using Tailscale on Azure App Service for a general guide.
Below are general recommendations and best practices for running Tailscale in production environments. Much of what is listed below is explained in greater detail throughout this document:
- When possible deploy subnet routers, exit nodes, app connectors, etc., to public subnets with public IP addresses to ensure direct connections and optimal performance.
- Run subnet routers, exit nodes, app connectors, etc., separately from the systems you are administering with Tailscale—for example, run your subnet routers outside of your Azure AKS clusters.
- Deploy dynamically scaled resources (for example, containers, serverless functions, etc.) as ephemeral nodes to automatically clean up devices after they shut down.
- Run multiple subnet routers and app connectors across multiple Azure availability zones to improve resiliency against zone failures with high availability failover and deploy across multiple regions for regional routing.
- Run multiple Tailscale SSH session recorder nodes across multiple Azure availability zones to improve resiliency against zone failures with recorder node failover and deploy across multiple regions for regional routing.
See Performance best practices for general recommendations.
When installing Tailscale on an Azure VM as a “normal” Tailscale device (for example, not a subnet router, exit node, etc.), you likely have already sized that VM to a suitable type for its workload and running Tailscale on it will likely add negligible resource usage.
There are many variables that affect performance and workloads vary widely so we do not have specific size recommendations, but we do have general guidance for selecting a VM size for an Azure VM running as a subnet router, exit node, or app connector:
- In general, higher CPU clock speed is more important than more cores.
- In general, VMs with Ampere Altra Arm–based processors processors are quite cost effective for packet forwarding
- Use a non-burstable VM type to achieve consistent CPU performance.
- Per Azure documentation, burstable performance machine sizes (such as B-series VMs) use a CPU credit mechanism which can result in variable performance.
- Tailscale will generally perform better on Linux than other operating systems due to additional optimizations that have been implemented.
Tailscale uses various NAT traversal techniques to safely connect to other Tailscale nodes without manual intervention. Nearly all of the time, you do not need to open any firewall ports for Tailscale. However, if your virtual network and network security groups are overly restrictive about internet-bound egress traffic, refer to What firewall ports should I open to use Tailscale.
Tailscale devices deployed to a public subnet with a public IP address will benefit from direct connections between nodes for the best performance.
Tailscale uses both direct and relayed connections, opting for direct connections where possible. Azure NAT Gateway is known to impede direct connections causing connections to use Tailscale DERP relay servers. This does not cause connectivity issues, but can lead to lower throughput and performance than direct connections.
If you must deploy Tailscale such that internet-bound connections go through a Azure NAT Gateway (for example, to reuse existing IP addresses that are allow-listed to third parties), contact your Tailscale account team to discuss more advanced deployment options that utilize public and private subnet routing on a single Azure VM.
To allow non-Azure devices on your tailnet to query Azure DNS private zones, create a Azure DNS private resolver for your virtual network and configure split DNS to forward queries for your virtual network’s internal domain name suffix to the IP address of the DNS private resolver’s inbound endpoint.
Virtual network peering is a common strategy for connecting multiple virtual networks together. You can deploy a subnet router (or a set for high availability) within a virtual network to allow access to multiple virtual networks.
If you have virtual networks or subnets with overlapping IPv4 addresses, use 4via6 subnet routers to access resources with unique IPv6 addresses for each overlapping subnet.
Oftentimes organizations are using Tailscale to connect to and administer their Azure AKS clusters. While Tailscale can run within a container and be deployed to AKS, we recommend running your subnet routers externally to these clusters to ensure connectivity is available in the event your cluster is having issues. In other words, run your subnet router on dedicated VMs or an AKS cluster separate from than the cluster you’re administering.
Deploy multiple session recorder instances across multiple availability zones to improve resiliency against zone failures. If your organization operates across multiple regions, consider deploying SSH session recording nodes in each region you operate and configure SSH access rules to send recording information to the local region for your nodes.