Troubleshoot Azure NAT Gateway connection issues (2023)

  • article

This article provides guidance on how to troubleshoot common outbound connectivity issues related to NAT gateway resources. This article also provides best practices for designing applications to make efficient use of outbound connections.

SNAT Exhaustion Due to NAT Gateway Configuration

SNAT starvation issues in NAT gateways are usually related to the NAT gateway's configuration:

  • NAT gateways do not extend to enough public IP addresses.

  • The NAT Gateway's configurable TCP idle timeout timer is set higher than the default of 4 minutes.

NAT gateway does not scale far enough

Each public IP address provides 64,512 SNAT ports for outbound connectivity using a NAT gateway. With available SNAT ports, the NAT Gateway can support up to 50,000 concurrent connections to the same destination endpoint. If outbound connections are dropped due to exhaustion of SNAT ports, the NAT gateway may not scale well enough to handle the workload. NAT gateways may require more public IP addresses to provide more SNAT ports for outbound connections.

The following table describes two common outbound connection failure scenarios due to scalability issues and how to validate and mitigate these issues.

scenarioproof informationMitigation methods
There is contention on SNAT ports and SNAT port exhaustion during peak usage periods.Next in Azure Monitorrun the metric.Total number of SNAT connections: The "total" aggregation indicates a high volume of connections.SNAT connectionsFor a "Failed" connection status, it shows temporary or permanent failures over time.dropped packets: "Sum" aggregate shows dropped packets for high connection volume and connection errors.Add public IP addresses or public IP prefixes as needed (total 16 IP addresses assigned to NAT gateways). This add-on provides more SNAT port inventory and allows further expansion of scenarios.
I have already assigned 16 IP addresses to my NAT gateway and am still experiencing SNAT port exhaustion.If you try to add more IP addresses, they cannot be added. The total number of IP addresses in a public IP address or public IP prefix resource exceeds the total of 16.Distribute your application environment across multiple subnets and provide a NAT gateway resource for each subnet.

reference

It is important to understand why SNAT exhaustion occurs. Make sure you are using patterns that are suitable for scalable and reliable scenarios. Adding SNAT ports to a scenario without understanding the cause of the demand should be a last resort. If you don't understand why your scenario is putting pressure on your SNAT port inventory, adding more SNAT ports by adding more IP addresses only delays the same exhaustion error as scaling your application. Other inefficiencies and antipatterns can be masked. more detailsSee Best Practices for Efficiently Using Outbound Connections.

TCP idle timeout timer set higher than default

The NAT Gateway TCP idle timeout timer defaults to 4 minutes, but can be configured up to 120 minutes. If the timer is set to a value higher than the default, the NAT gateway keeps flows longer andAdded pressure on SNAT port inventorycan be applied.

The following table describes scenarios where long TCP idle timeout timers cause SNAT starvation and provides mitigation steps to take.

scenarioproof informationMitigation methods
I want the TCP connection to stay active for a long time without being idle and timing out. Increase the TCP idle timeout timer setting. After some time, you will notice that connection failures are occurring more frequently. I think it can deplete the inventory of SNAT ports because connections stay alive longer.Next in Azure MonitorNAT gateway metricsCheck to see if SNAT port exhaustion is occurring.Total number of SNAT connections: The "total" aggregation shows a high volume of connections.SNAT connectionsFor a "Failed" connection status, it shows temporary or permanent failures over time.dropped packets: "Sum" aggregate shows dropped packets for high connection volume and connection errors.Here are some possible steps you can take to resolve SNAT port exhaustion:TCP idle timeoutDecrease it to a lower value to get the SNAT port inventory earlier. The TCP idle timeout timer cannot be set to less than 4 minutes. To free up connection resources for other tasksAsynchronous Polling Patternto consider.TCP keepalive or application layer keepalive to avoid intermediate system timeoutsto use. example is.NET exampleseePrivate Linkto connect to Azure PaaS services through the Azure backbone. A private link allows you to free up SNAT ports for outbound connections to the internet.

Connection failure due to idle timeout

TCP idle timeout

from the previous sectionTCP timerYou should use TCP keepalives to refresh the idle flow and reset the idle timeout as described in TCP keepalive only needs to be enabled on one side of the connection to keep the connection alive on both sides. When a TCP keepalive is sent on one side of the connection, the other side automatically sends an ACK packet. The idle timeout timer is reset on both sides of the connection. more detailsSee TCP Idle Timeout.

reference

Increasing the TCP idle timeout is a last resort and may not address the root cause. A long timeout can result in a short error time when the timeout expires, leading to time delays and unnecessary errors.

(Video) How to get better outbound connectivity using Azure NAT Gateway | Azure Friday

UDP idle timeout

The UDP idle timeout timer is set to 4 minutes. Unlike the TCP idle timeout timer for NAT gateways, the UDP idle timeout timer is not configurable.

The following table describes common scenarios where connections are aborted due to UDP traffic idle timeout and the steps you should take to mitigate the problem.

scenarioproof informationMitigation methods
You can see that the UDP traffic is breaking connections that should be maintained for long periods of time.Azure Monitor,dropped packetsnext inNAT Gateway metricsto check. The "Total" aggregation shows high connection volume and connection errors and consistent packet drops.Some possible mitigation steps you can take: -Enable UDP keepalive. When UDP keepalive is enabled, it is enabled for only one direction of the connection. The connection may still go idle and time out on the other side of the connection. To prevent UDP connections from timing out idle, you must enable UDP keepalive for both directions of the connection flow. -Application layer keepalivecan also be used to refresh the idle flow and reset the idle timeout. Check the application-specific keepalive options available on the server side.

NAT Gateway public IP not used for outbound traffic

The VM retains the old SNAT IP as an active connection after the NAT gateway is added to the virtual network.

NAT gatewaybecomes the default route to the Internet when configured on a subnet. When you migrate from Basic Outbound Access or a load balancer to a NAT gateway, a new connection is immediately created using the IP address associated with the NAT gateway resource. During migration, if the virtual machine has an established connection, it will continue to use the old SNAT IP address assigned when the connection was established.

Test and troubleshoot VMs with old SNAT IP addresses in the following ways:

  • Make sure that you have established a new connection and that an existing connection is not being reused by the OS or that your browser is caching the connection. For example, when using curl in PowerShell, you must specify the -DisableKeepalive parameter for new connections to take effect. Connections may also be pooled if you are using a browser.

  • There is no need to reboot virtual machines on subnets configured with NAT gateways. However, when the virtual machine reboots, the connection state is flushed. When the connection state is flushed, all connections start using the NAT gateway resource's IP address (es). This behavior is a side effect of virtual machine reboots, so it does not mean that a reboot is required.

If you still have issues, open a support case for further troubleshooting.

Virtual Appliance UDR and ExpressRoute override NAT Gateway for outbound traffic routing.

When force tunneling with a custom UDR is used to forward traffic to a virtual appliance or VPN over ExpressRoute, the UDR or ExpressRoute takes precedence over the NAT Gateway to forward internet-bound traffic. more detailsCustom UDRssee

Internet routing configuration has the following order of precedence:
Load Balancing >> Virtual Appliance UDR/ExpressRoute for Device Default Outbound Access >> NAT Gateway >> Instance Level Public IP Address >> Outbound Rules

Test and troubleshoot the virtual appliance UDR or VPN ExpressRoute overriding the NAT gateway via:

  1. to outbound trafficTest if NAT Gateway public IP is useddo. If a different IP is in use, it may be due to a custom UDR. Follow the remaining steps on how to verify and remove custom UDRs.

  2. Check the UDR in the virtual network's route table andView route tablesee

  3. Create, change, or delete Azure route tableto remove the UDR from the route table.

When the custom UDR is removed from the route table, the NAT Gateway public IP should now take precedence for routing outbound traffic to the internet.

Private IP is used to connect to Azure services via Private Link.

Private LinkPrivately connects your Azure Virtual Network to Azure PaaS services such as Azure Storage, Azure SQL, or Azure Cosmos DB over the Azure backbone network instead of the internet. Private Link connects to these Azure platform services using the private IP address of a virtual machine instance in a virtual network instead of the NAT Gateway's public IP. So when you look at the source IP address used to connect to these Azure services, you'll see that the instance's private IP is being used. All services supported by Private Link areAzure services listed heresee

To check the Private Endpoint set up as a Private Link:

  1. Search for Private Link in the search box of the Azure portal.

    (Video) Lab | Azure NAT Gateway | VNET

  2. In the Private Link center, select your private endpoint or Private Link service to see the configured configuration. more detailsManage private endpoint connectionssee

You can also connect virtual networks to Azure PaaS services using service endpoints. To verify that you have a service endpoint configured for your virtual network:

  1. In the Azure portal, go to your virtual network and select "Service endpoints" in settings.

  2. All service endpoints created are listed along with their configured subnets. more detailsService endpoint logging and troubleshootingsee

reference

Private Link is the recommended option over service endpoints for private access to Azure hosted services.

Connection errors to public internet destinations

Connection errors on internet-facing endpoints can be caused by a number of possible factors. Factors that can affect connection success include:

  • Firewall or other traffic management component of the target

  • API rate limit applied on the target side

  • Massive DDoS mitigation or transport layer traffic shaping

NAT Gateway in Azure Monitormetricto diagnose connection problems.

  • View the number of packets from the source and destination (if available) to determine the number of connection attempts.

  • Check Dropped Packets to see how many packets were dropped by the NAT gateway.

Other checks:

  • SNAT consumptionto check.

  • Compare by validating connections to endpoints in the same or different regions.

    (Video) Azure Networking - #12 - Azure NAT Gateway

  • If you're creating a high-volume or transactional speed test, look to see if lowering the speed reduces the frequency of failures.

  • If rate changes affect the failure rate, check if an API rate limit or other constraint on the target side has been reached.

Active FTP and NAT gateway

FTP uses two separate channels between client and server, command and data channels. Each channel communicates on a separate TCP connection. One is for sending commands and the other is for sending data.

In active FTP mode, the client establishes a command channel and the server establishes a data channel.

NAT gateway does not work in active FTP mode when connecting to an FTP server over the Internet. Active FTP uses the FTP client's PORT command to tell the FTP server which server's IP address and port to use on the data channel to reconnect to the client. The PORT command uses the client's private address, which cannot be changed. The client-side traffic is SNATed by the NAT gateway for Internet-based communications, so the FTP server marks the PORT command as invalid.

An alternative solution to active FTP mode when connecting to an FTP server using a NAT gateway is to use passive FTP mode instead. But if you want to use NAT gateway in passive FTP modeA few things to considerdo.

Passive FTP and NAT Gateway

In passive FTP mode, the client establishes a connection on both command and data channels. The client requests the server to start listening on the port without re-establishing the connection to the client.

Depending on your FTP server configuration, outbound passive FTP may not work for NAT gateways with multiple public IP addresses. When a NAT gateway with multiple public IP addresses sends traffic outbound, it randomly selects one of the public IP addresses for the source IP address. Depending on your FTP server configuration, FTP may fail if data and control channels use different source IP addresses.

To avoid possible passive FTP connection errors, follow these steps:

  1. Make sure your NAT gateway is associated with a single public IP address instead of multiple IP addresses or prefixes.

  2. Make sure that the NAT gateway's passive port range can pass through any firewalls that the target endpoint may have.

reference

Reducing the amount of public IP addresses on the NAT gateway reduces the inventory of SNAT ports available for making outbound connections and can increase the risk of SNAT port exhaustion. Before removing public IP addresses from NAT gateways, consider your SNAT connectivity requirements.

Additional network capture

If the investigation is inconclusive, open a support case for further troubleshooting and gather the following information for faster resolution: Select a single virtual machine from the NAT gateway configuration subnet to perform the following test.

  • On one of the backend VMs within the virtual networkps pingto the probe port response, e.g.ps ping 10.0.0.4:3389) and record the results.

  • If no response is received from these ping tests, run concurrent Netsh traces for the backend VM and virtual network test VM while running PsPing, then stop the Netsh traces.

    (Video) What is Azure NAT Gateway | SNAT Port Exhaustion | Azure Standard Logic App Static Outbound IP

Outbound Connectivity Best Practices

Azure operates very carefully, monitoring its infrastructure. Still, deployed applications may experience transient failures, and there is no guarantee that transmitted data will not be lost. A NAT gateway is the default option for outbound connectivity in Azure deployments to ensure highly reliable and resilient outbound connectivity. In addition to using NAT gateways to connect outbound, use the instructions later in the article on how to ensure that your applications are using connections efficiently.

Modifying the application to use connection pooling

When pooling connections, do not open new network connections to the same address and port calls. You can implement a connection pooling scheme in your application where requests are internally distributed over a fixed set of connections and reused where possible. This setting limits the number of SNAT ports in use and creates a predictable environment. Connection pooling helps reduce latency and resource utilization and ultimately improve the performance of your application.

For more information on HTTP connection pooling, see Using HttpClientFactory.HTTP connection poolingsee

Modify the application to reuse connections

Configure your application to reuse connections instead of creating separate atomic TCP connections for each request. Connection reuse increases the performance of TCP transactions and is particularly relevant for protocols such as HTTP/1.1 where connection reuse is the default. This reuse applies to other protocols that use HTTP as a transport, such as REST.

Modifying the application to use less aggressive retry logic

When a SNAT port exhausts or an application error occurs, exhaustion occurs or persists due to aggressive or brute force retries without delay and backoff logic. Demand on SNAT ports can be reduced by using less aggressive retry logic.

Depending on the configured idle timeout, if the retries are too aggressive, the connection may not have enough time to close and release the SNAT port to re-use it.

Additional instructions and examplesretry patternsee

Reset outbound idle timeout using keepalive

More about keepalivesTCP idle timeout timer set higher than defaultsee

Reduce SNAT port usage for connecting to other Azure services using private link

for SNAT portsto reduce demandWhere possible, you should use Private Link to connect directly from your virtual network to Azure platform services. Reducing the demand on SNAT ports can help reduce the risk of SNAT port exhaustion.

To create a Private Link, see the following quickstart guide to get started.

  • Create a private endpoint

  • Create a Private Link

next stage

We are always working to improve our customers' experience. If you encounter NAT gateway issues that are addressed or not addressed in this article, provide feedback via GitHub at the bottom of this page.

To learn more about NAT Gateway, see:

  • Azure NAT Gateway

  • NAT gateway resource

  • Metrics and Alerts for NAT Gateway Resources

    (Video) NAT and NAT Gateway in Azure

FAQs

How do I troubleshoot my NAT gateway? ›

Check that you've configured your route tables correctly:
  1. The NAT gateway must be in a public subnet with a route table that routes internet traffic to an internet gateway.
  2. Your instance must be in a private subnet with a route table that routes internet traffic to the NAT gateway.

How do I troubleshoot connectivity issues in Azure? ›

Troubleshooting steps
  1. Step 1: Check whether NIC is misconfigured. ...
  2. Step 2: Check whether network traffic is blocked by NSG or UDR. ...
  3. Step 3: Check whether network traffic is blocked by VM firewall. ...
  4. Step 4: Check whether VM app or service is listening on the port. ...
  5. Step 5: Check whether the problem is caused by SNAT.
Jul 19, 2023

What is the limit of NAT gateway in Azure? ›

A single NAT gateway can scale up to 16 IP addresses. Each NAT gateway public IP address provides 64,512 SNAT ports to make outbound connections. A NAT gateway can scale up to over 1 million SNAT ports.

What is the TCP idle timeout for Azure NAT gateway? ›

The NAT gateway TCP idle timeout timer is set to 4 minutes by default but is configurable up to 120 minutes. If the timer is set to a higher value than the default, NAT gateway holds on to flows longer, and can create extra pressure on SNAT port inventory.

Why is my NAT type not connecting? ›

Generally, PS4 NAT Type Failed can be caused by network issues such as incorrect network settings, or issues with the network firewall. How to fix NAT Type Failed on PS4? Some PS4 users reported that changing the type of NAT could fix PS4 NAT Type Failed error.

What does it mean when I can't connect to the gateway? ›

The error indicates that the system cannot find the default gateway, often your home router, resulting in disconnection. To fix this, try to reset the TCP/IP stack, check your firewall, disable energy-saving features for your network adapter, or update the necessary drivers.

How do I troubleshoot connection problems? ›

Here are all our internet fixes in a simple list.
  1. Restart your equipment. ...
  2. Connect with an Ethernet cable. ...
  3. Check for an internet outage. ...
  4. Try using a different device. ...
  5. Check your wires and cables. ...
  6. Run your computer's internet troubleshooter. ...
  7. Reposition your router/gateway. ...
  8. Update everything.
Jun 29, 2023

What command is useful for troubleshooting network connection issues? ›

  • ping. Ping (or ping) is the most commonly known network troubleshooting command and is available for all operating systems with networking capabilities. ...
  • tracert/traceroute. ...
  • pathping. ...
  • ipconfig (Windows) / ifconfig (Linux) ...
  • nslookup. ...
  • netstat. ...
  • route.
Aug 10, 2022

How do I troubleshoot server connection issues? ›

Troubleshooting Tips for Common Server Problems
  1. Perform an assessment of the network structure and layout.
  2. Confirm your TCP/IP settings are correctly configured.
  3. Rule out WAN and LAN connection issues.
  4. Evaluate the compatibility of server applications with the operating system.

How to configure NAT gateway in Azure? ›

The NAT Gateway service provides outbound connectivity for virtual machines in Azure.
  1. Prerequisites. An Azure account with an active subscription. ...
  2. Sign in to Azure. ...
  3. Create a NAT gateway. ...
  4. Create a virtual network and bastion host. ...
  5. Create test virtual machine. ...
  6. Test NAT gateway. ...
  7. Clean up resources. ...
  8. Next steps.
Jun 21, 2023

What is the difference between NAT instance and NAT gateway? ›

When a connection times out, a NAT gateway returns an RST packet to any resources behind the NAT gateway that attempt to continue the connection (it does not send a FIN packet). When a connection times out, a NAT instance sends a FIN packet to resources behind the NAT instance to close the connection.

How many NAT gateways per VPC? ›

You can attach only one internet gateway to a VPC at a time. NAT gateways only count toward your quota in the pending , active , and deleting states.

What is the default TCP timeout? ›

By default, the TCP connection timeout is 15 minutes and the UDP connection timeout 30 seconds. In order to increase the connection timeout you can modify it from the firewall access rules.

What is the timeout for Azure NSG? ›

By default, the idle timeout is set to 4 minutes, which means that any connection that hasn't returned within 4 minutes will be lost.

What is the timeout for TCP connection inactivity? ›

What Is a TCP Timeout? A TCP timeout in Linux refers to the time a system waits for a response to be acknowledged before the connection is terminated. The TCP keepalive timeout prevents broken connections from being left open indefinitely. The default timeout value can be adjusted in the Linux kernel configuration.

How do I know if my AWS NAT gateway is working? ›

Open the CloudWatch console at https://console.aws.amazon.com/cloudwatch/ . In the navigation pane, choose Metrics, All metrics. Choose the NATGateway metric namespace.

How do I find my NAT gateway? ›

You can view the NAT gateway's network interface in the Amazon EC2 console. For more information, see Viewing Details about a Network Interface. You cannot modify the attributes of this network interface. A NAT gateway cannot be accessed by a ClassicLink connection associated with your VPC.

How do I connect to my NAT gateway? ›

Resolution
  1. Create a public VPC subnet to host your NAT gateway.
  2. Create and attach an internet gateway to your VPC.
  3. Create a custom route table for your public subnet with a route to the internet gateway.

Do I need both NAT gateway and internet gateway? ›

Thus, IgW allows instances with public IPs to access the internet whereas NAT Gateway allows instances with private IPs to access internet. You only need one Internet Gateway per VPC whereas you need one NAT Gateway per Availability Zone (AZ)

Videos

1. How to troubleshoot an inbound connection in an Azure virtual machine
(Tech Pub)
2. Azure Network | NAT Gateway | Outbound Connectivity
(Shailender Choudhary)
3. Fix any Internet and Network Issues with this simple trick
(HowtoInsider)
4. Azure Nat Gateway Public IP
(TECH CONNECT)
5. Azure NAT Gateway Step By Step tutorial
(BeCloudGuru)
6. Azure NAT Gateway
(TechNet Talk)

References

Top Articles
Latest Posts
Article information

Author: Kareem Mueller DO

Last Updated: 09/19/2023

Views: 5557

Rating: 4.6 / 5 (46 voted)

Reviews: 85% of readers found this page helpful

Author information

Name: Kareem Mueller DO

Birthday: 1997-01-04

Address: Apt. 156 12935 Runolfsdottir Mission, Greenfort, MN 74384-6749

Phone: +16704982844747

Job: Corporate Administration Planner

Hobby: Mountain biking, Jewelry making, Stone skipping, Lacemaking, Knife making, Scrapbooking, Letterboxing

Introduction: My name is Kareem Mueller DO, I am a vivacious, super, thoughtful, excited, handsome, beautiful, combative person who loves writing and wants to share my knowledge and understanding with you.