NetScaler – Monitors going UP/DOWN (packet loss)

Two weeks ago one of my customers had huge problems with their internal load-balancing NetScaler. The VPX suddendly started to turn crazy. StoreFront and DNS Monitors went online/offline every few minutes. Sometimes the secondary note switched to the primary node and backover.  The dashboard always looked like this:

services

(The load balanced services havent been accessible most of the time)

To get back users on the XenDesktop Site as soon as possible, we changed the StoreFront DNS A-Record to SF01,SF02 directly. At least round robin and the users can login into the store 🙂

We started some ping requets and found out something interesting:

Protocol: ICMP
From: StoreFront & Admin Client
To: Domain Controller
Result: 0% Packet loss

Protocol: ICMP
From: StoreFront & Admin Client
To: SNIP,VIP
Result: 25% Packet loss

Only the VPX was being affected of the packet loss!

We took a closer look on the VMware vSwitch port the VPX and Domain Controller was connected to.

Domain Controller –> Running out of buffers: 0

port_sf

VPX –> Packets dropped: 14731

port_vpx

After some research I found Citrix KB article CTX200278.

NetScaler VPX network connectivity issue on VMware ESXi 5.1.0 build 2191751 and VMware ESXi 5.5 build 2143827 is caused by “tx_ring_length” mismatch, which causes TX stalls.

The customer is running  ESX 5.5 Version 4179633 and the problem should be fixxed. Well we gave it a try and edited the “tx_ring_length” to 512 without any success. Hmm what now? Is the E1000 NIC producing the packet loss? I experienced something similar a few years ago.  We updated the NetScaler to the newest available 11.1 release because VMXNET3 NIC supported started in  firmware >= 11.0.65.72.

We ended up with the same result as with the E1000 NIC. Pretty frustrating 😦

We checked the NetScaler Dashboard again and suddenly realized there is a high amount of Input Throughput (12Mbps) on the VPX (all the LoadBalancing services have been disabled at this time). Maybe the dashboard information was wrong?  We double checked the input via CLI:

stat ns | grep Megabits
Megabits received                                      12
Megabits transmitted                               0

stat ns | grep Megabits
Megabits received                                      15
Megabits transmitted                               0

> stat ns | grep Megabits
Megabits received                                      11
Megabits transmitted                               0

There was too much input traffic on the VPX network interface. The internal VPX was running with an express licence which can handle traffic up to 5Mbps only! What kind of traffic was terminating on the NetScaler? Routing problem?

We only can get this kind of information with a detailed network trace from the VPX. After capturing some seconds with the command “nstrace”, we imported the trace file into WireShark.  In the trace was visible that a lot of packets have been transfered via NetScaler which shouldnt pass the appliance.

Example:
Protocol: TCP/UDP-1433
From: Client PC
To: SQL Server

At this point we finally know the reason for the dropped packets. But we didnt know yet why this was happening. We got in contact with the network & VMware team and they started mentionting something called “Promiscuous-Mode“.

Promiscuous mode is a security policy which can be defined at the virtual switch or portgroup level in vSphere ESX/ESXi. A virtual machine, Service Console or VMkernel network interface in a portgroup which allows use of promiscuous mode can see all network traffic traversing the virtual switch.

By default, a guest operating system’s virtual network adapter only receives frames that are meant for it. Placing the guest’s network adapter in promiscuous mode causes it to receive all frames passed on the virtual switch that are allowed under the VLAN policy for the associated portgroup. This can be useful for intrusion detection monitoring or if a sniffer needs to analyze all traffic on the network segment.

Promiscuous-Mode on the vSwitch was configured to “Accept”. We changed this to “Reject” and have a look at the input bandwith on the dashboard 😉

vswitch

input_after

The VPX was back online and purring like a cat!

One of the VMware guys was playing arround with the promiscuous mode but forgot to set it back to “Recject”. Mystery solved 🙂

 

 

 

 

 

 

 

 

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Create a free website or blog at WordPress.com.

Up ↑

%d bloggers like this: