Wi-fi 9800 WLC KPI Weblog – Half 3

Half 3 of the 3-part Wi-fi Catalyst 9800 WLC KPIs

In earlier blogs, Wi-fi Catalyst 9800 WLC KPIs, Half 1 and Wi-fi Catalyst 9800 WLC KPIs, Half 2, we shared the right way to verify WLC and connections to different units in addition to the right way to verify AP and RF well being standing.

On this weblog, we are going to concentrate on Key Efficiency Indicators for consumer evaluation, WLC packet drops, and packets punted to WLC CPU.  I’ll share methodical steps and outputs that we will accumulate from WLC to measure the well being of purchasers’ connectivity and WLC forwarding efficiency.

KPIs totally different buckets or areas:

  • WLC checks
  • Reference to different units
  • AP checks
  • RF checks
  • Shopper checks
  • Packet Drops

Shopper Checks

After we have now verified AP and RF well being then we will concentrate on consumer connectivity. Utilizing “present wi-fi abstract” we will see the entire variety of purchasers related. As well as, we will discover out if there are any excluded, disabled, and  international/anchored purchasers. We are able to hold monitoring this command periodically. Test if the variety of purchasers is throughout the anticipated values for our deployment. We are able to additionally establish if there are any drastic modifications for any of the values. The command additionally reveals the variety of APs, roles, radios, and their standing.

Gladius1#sh wi-fi abstract
Max APs supported          : 2000
Max purchasers supported      : 32000
Entry Level Abstract
Whole    Up    Down
802.11 2.4GHz             1     1       0
802.11 5GHz               4     3       1
802.11 dual-band          2     2       0
802.11 rx-dual-band       0     0       0

Shopper Serving(2.4GHz)    3     3       0
Shopper Serving(5GHz)      4     3       1
Monitor                   0     0       0
Sensor                    0     0       0

Shopper Abstract
Whole Shoppers : 6
Excluded      : 0
Disabled      : 0
Overseas       : 0
Anchor        : 0
Native         : 6

Test for complete variety of purchasers, excluded purchasers, and radio down APs.

In case we see excluded purchasers, we have to dig additional to establish the rationale for that. Decide if excluded purchasers have any misconfiguration or if exclusion may very well be as a consequence of another cause. Causes for excluding purchasers may very well be as a consequence of incorrect password, ip deal with matching different purchasers’ IP deal with, a number of affiliation failures, and many others. We are able to see the listing of consumer exclusion insurance policies and standing utilizing the command “sh wi-fi wps abstract”.

We are able to break down the variety of related purchasers within the totally different situations of the consumer state machine. It will assist us to slender down if there are too many purchasers caught in transient states like Authenticating, IP learns, Mobility, or Webauth Pending. Use the command: “present wi-fi stats consumer element | i Authenticating         :|Mobility               :|IP Study               :|Webauth Pending        :|Run                    :|Delete-in-Progress     :”

Gladius1#present wi-fi stats consumer element | i Authenticating         :|Mobility               :|IP Study               :|Webauth Pending        :|Run                    :|Delete-in-Progress     :
Authenticating         : 0
Mobility               : 0
IP Study               : 1
Webauth Pending        : 0
Run                    : 5
Delete-in-Progress     : 0

Test for purchasers in transient states. On this case, we see the consumer in IP study state.

We might want to do an extra investigation if the variety of purchasers in transient states will not be reducing. The identical will apply if a lot of the purchasers stay in the identical transient state for a protracted time frame.

One instance may very well be if we see a excessive variety of purchasers caught in “IP study”. Then we must always overview the DHCP server standing and connectivity between WLC and DHCP server. For static IP deal with allowed situations, we will overview ARP forwarding.

One other instance may very well be if the variety of purchasers caught in “Webauth” is excessive. There are a number of causes that may trigger this. One cause may very well be net web page redirects not being obtained or not accessible by purchasers. An alternative choice may very well be authentication failures when doing net login for visitor SSIDs.

The final instance may very well be if we see a lot of purchasers caught in “Authenticating”. If purchasers related to dot1x SSIDs have authentication points then we must always overview the Radius server. We have to decide if the problem happens with a concrete Radius server or if the problem happens in several servers on the identical time. Within the under sections, I’ll describe the right way to confirm Radius server standing.

We are able to additionally overview consumer delete causes and establish any sudden cause with counters rising. “Idle timeout” or “Session timeout” could be anticipated causes for purchasers to disconnect. Nonetheless, “DOT11 denied information charges” or “MIC validation failed” could be sudden and should require some additional evaluation. Use the command: “present wi-fi stats consumer delete causes | e :_0”

Gladius1#present wi-fi stats consumer delete causes | e :_0
Whole consumer delete causes
Controller deletes
As a consequence of mobility failure                                         : 1
DOT11 denied information charges                                         : 5781192
L2-AUTH connection timeout                                      : 2
IP-LEARN connection timeout                                     : 968
Mobility peer delete                                            : 134
Informational Delete Cause
AP down/disjoin                                                 : 690
Session timeout                                                 : 661
Shopper provoke delete
AP Deletes
AP initiated delete for DHCP timeout                            : 1
AP initiated delete for reassociation timeout                   : 266

Test for sudden delete causes with excessive depend and rising. On this case, denied information charges

In one of many largest worldwide wi-fi occasions, we monitored delete causes excluding ones displaying zero hits. We might spot a delete cause that was persistently rising over time. Utilizing always-on-tracing we might discover that purchasers deleted as a consequence of that cause have been all connecting to a concrete SSID. When reviewing SSID configuration we might isolate a configuration mistake inflicting the disconnections. After addressing the configuration, no additional consumer deletes for sudden cause have been seen. We might proactively spot a problem, discover the basis trigger and repair it. Above all, with out having to attend for finish purchasers to complain to begin the troubleshooting course of.

WLC has additionally a listing of predefined potential failures with counters.  We are able to verify counters to establish potential points and be proactive in difficulty detection. Utilizing the command: “present wi-fi stats trace-on-failure | ex :_0”

Gladius1#present wi-fi stats trace-on-failure | ex :_0
Wi-fi Hint On Failure Statistics
006. Export consumer MM....................................: 1
018. Capwap configuration standing failure.................: 46136
020. Shopper affiliation failure..........................: 5
021. Shopper MAB authentication failure...................: 5781677
023. Shopper stage timeout................................: 1642
025. Shopper mobility clear up............................: 1
027. DTLS handshake failure..............................: 2
030. DTLS no configuration packet drop...................: 5
032. DTLS invalid whats up packet drop......................: 168
034. SANET AUTHC failure.................................: 6

Test for failures with excessive depend and rising. On this case, MAB authentication failures.

If we’re utilizing dot1x and Radius servers, we might want to monitor the standing of the Radius servers. IOS-XE is utilizing dead-time and lifeless standards to find out standing of Radius server. These parameters permit the machine to establish a Radius server that isn’t responding to requests, and carry out a switchover to a secondary Radius server. The server might be declared as lifeless as soon as the lifeless standards is met. Lifeless standards specifies the variety of tries that ought to fail, and the time with no response from the server. Each standards needs to be met to declare the server as lifeless. The server will stay in lifeless standing till dead-time expire.

We are able to verify if there’s any lifeless server at this second and the variety of occasions a server has been declared as lifeless. It will assist us to diagnose points with the concrete Radius server as a consequence of lack of connectivity or misbehaviors from Radius or WLC. Use the command: “present aaa servers | i Platform Lifeless: complete|RADIUS: id”

Gladius1#present aaa servers | i Platform Lifeless: complete|RADIUS: id
RADIUS: id 1, precedence 1, host, auth-port 1645, acct-port 1646, hostname ISE
SMD Platform Lifeless: complete time 301s, depend 2
Platform Lifeless: complete time 179s, depend 10UP
RADIUS: id 2, precedence 2, host, auth-port 1812, acct-port 1813, hostname ISE3
SMD Platform Lifeless: complete time 0s, depend 0
Platform Lifeless: complete time 0s, depend 0

Test for platform lifeless time and depend to establish Radius servers that had points.

Radius standing is displayed per WNCD. It’s potential that the identical Radius server is marked as lifeless for some WNCDs and alive for others. Every AP belongs to a WNCD. There’s a command to verify APs assigned per WNCD “present wi-fi load-balance ap affinity WNCD <0-7>”. If purchasers related to APs in a single concrete WNCD ship Radius requests, and people requests don’t have a response then Radius standing for that WNCD might be DEAD. On the identical time, purchasers in different WNCD couldn’t be sending any Radius requests or getting a response.

For Radius marked as DEAD, we have to verify if the Radius server is reachable and replying to authentication and accounting requests. Radius statistics will assist us to establish if we’re lacking any responses for authentication or for accounting, the typical time to answer, the variety of entry rejects and accepts, and latency distribution. Use the command: “present radius statistics”

Gladius1#present radius statistics
Auth.      Acct.       Each
Most inQ size:         NA         NA          1
Most waitQ size:         NA         NA         14
Most doneQ size:         NA         NA          1
Whole responses seen:        279          0        279
Packets with responses:        279          0        279
Packets with out responses:          0        396        396
Entry Rejects           :          2
Entry Accepts           :         20
Common response delay(ms):         10          0         10
Most response delay(ms):        173          0        173
Variety of Radius timeouts:          0       4542       4542
Duplicate ID detects:          0          0          0
Buffer Allocation Failures:          0          0          0
Most Buffer Dimension (bytes):        764        780        780
Malformed Responses        :          0          0          0
Unhealthy Authenticators         :          0          0          0
Unknown Responses          :          0          0          0
Supply Port Vary: (2 ports solely)
1645 - 1646
Final used Supply Port/Identifier:
Elapsed time since counters final cleared: 3w3d20h41m
Radius Latency Distribution:
<= 2ms :        181          0
3-5ms  :         32          0
5-10ms :         13          0
10-20ms:         14          0
20-50ms:         17          0
50-100m:         20          0
100ms :          2          0

Test for requests with out response, timeouts, excessive latency

In a single buyer we have been troubleshooting dot1x consumer’s connectivity points and located the rationale for failures was the Radius server marked as lifeless. When reviewing the outputs, we might see that Radius was replying to authentications however was now not replying to accounting packets. A workaround to reduce impression was to disable the accounting listing to keep away from WLC sending accounting packets. Whereas Radius directors have been troubleshooting accounting points within the server.

Packet drops and punted to CPU Checks

Now we will verify if there are any scalability points as a result of oversubscription of any of the WLC parts. I’d begin by trying on the quantity of visitors obtained and transmitted by bodily interfaces. Then reviewing the variety of broadcast/multicast and enter or output drops. If we have now a baseline we will evaluate the quantity of visitors with the baseline and attempt to discover out any discrepancies. Use command: “present int po1 | i line protocol|put charge|drops|broadcast”. Change Po1 along with your setup bodily or logical interface.

Gladius1#present int po1 | i line protocol|put charge|drops|broadcast
Port-channel1 is up, line protocol is up
  Enter queue: 0/375/0/0 (dimension/max/drops/flushes); Whole output drops: 0
  5 minute enter charge 39000 bits/sec, 42 packets/sec
  5 minute output charge 14000 bits/sec, 12 packets/sec
     Acquired 9389675 broadcasts (34521510 multicasts)
     Output 45735 broadcasts (1075205 multicasts)
     0 unknown protocol drops

Test for the quantity of visitors enter/output, drops, and broadcasts tx/rx

We are able to overview packets dropped by WLC and the explanations for these drops. When monitoring drops it is very important verify that are the explanations for the excessive quantity of packet drops. Subsequently, we will discover how briskly these drop counters are rising. We have to accumulate the identical output a number of occasions with time reference. Enabling “terminal exec immediate timestamps” or accumulating “present clock” will assist us to have time references. These time referenced outputs might be key to isolate impacting drops. Use the command: “present platform {hardware} chassis energetic qfp statistics drop”

Gladius1#present platform {hardware} chassis energetic qfp statistics drop
Final clearing of QFP drops statistics : by no means
International Drop Stats                         Packets                  Octets 
CGACLDrop                                      31                    7812 
Disabled                                      635                  105934 
InvL2Hdr                                      701                  206223 
IpFormatErr                                    68                    4488 
Ipv4NoAdj                                   67749                 6910538 
Ipv4NoRoute                                     6                     376 
Ipv6NoRoute                                  1096                   61376 
Ipv6mcNoRoute                               77683                 9477326 
SWPortMacConflict                           50316                 5874782 
SwitchL2mLookupMiss                         17568                 6681680 
TailDrop                                    54199                29501684 
UnconfiguredIpv4Fia                             3                     242 
UnconfiguredIpv6Fia                       1564372               186850863 
WlsCapwapError                               1018                  233293 
WlsCapwapReassFragConsume                    1064                 1231968 
WlsClientError                               3116                  112631

Test for drop causes with a excessive variety of packets, and fragmentation/reassembly drops.

Yet one more verify that we must always do is to research the variety of packets despatched to the management airplane (punted) of the WLC for processing. We are able to monitor the variety of packets punted for every cause and verify for irregular quantity.  We are able to correlate a rise of punted packets with excessive CPU utilization occasions. Use the command: “present platform {hardware} chassis energetic qfp characteristic wi-fi punt statistics”

Gladius1#present platform {hardware} chassis energetic qfp characteristic wi-fi punt statistics
CPP Wi-fi Punt stats:
                                 App Tag     Packet Rely
                                 -------     ------------
         CAPWAP_PKT_TYPE_DOT11_PROBE_REQ           986190
              CAPWAP_PKT_TYPE_DOT11_MGMT            10031
              CAPWAP_PKT_TYPE_DOT11_IAPP          2975298
             CAPWAP_PKT_TYPE_DOT11_DOT1X            24901
        CAPWAP_PKT_TYPE_CAPWAP_KEEPALIVE           228099
            CAPWAP_PKT_TYPE_CAPWAP_CNTRL          1628480
         CAPWAP_PKT_TYPE_CAPWAP_DATA_PAT               33
          CAPWAP_PKT_TYPE_MOBILITY_CNTRL            58091
                       SISF_PKT_TYPE_ARP        218545290
                      SISF_PKT_TYPE_DHCP            15455
                     SISF_PKT_TYPE_DHCP6             7772
                   SISF_PKT_TYPE_IPV6_ND           199108
                SISF_PKT_TYPE_DATA_GLEAN                7
             SISF_PKT_TYPE_DATA_GLEAN_V6              100

Test for a excessive variety of punted packets and rising additional time.

We might additionally establish if we’re seeing any buffer failures and decide which is the dimensions for these buffers which can be reaching the utmost worth. Use the command: “present buffers | i buffers|failures”

Gladius1#present buffers | i buffers|failures
Small buffers, 104 bytes (complete 1200, everlasting 1200):
     0 failures (0 no reminiscence)
Center buffers, 600 bytes (complete 900, everlasting 900):
     35 failures (35 no reminiscence)
Large buffers, 1536 bytes (complete 900, everlasting 900, peak 901 @ 2w6d):
     0 failures (0 no reminiscence)
VeryBig buffers, 4520 bytes (complete 100, everlasting 100, peak 101 @ 2w6d):
     0 failures (0 no reminiscence)
Giant buffers, 5024 bytes (complete 100, everlasting 100, peak 101 @ 2w6d):
     0 failures (0 no reminiscence)
VeryLarge buffers, 8304 bytes (complete 100, everlasting 100):
     0 failures (0 no reminiscence)
Large buffers, 18024 bytes (complete 20, everlasting 20, peak 21 @ 2w6d):
     0 failures (0 no reminiscence)

Test for buffer failures and establish buffer dimension.

The final verify may very well be information airplane utilization. We are able to discover if the WLC is having information airplane efficiency points as a consequence of visitors quantity, or some concrete options enabled. Use command shared in WLC checks: “present platform {hardware} chassis energetic qfp datapath utilization | i Load”

These KPIs have been useful to establish a buyer difficulty. The client noticed a periodical excessive enhance within the variety of ARPs packets punted to the CPU. By monitoring the counter for ARPs punted to the CPU, and accumulating packet seize within the management airplane we might establish that these ARPs have been despatched from some concrete mac addresses that have been doing malicious ARP scanning.

With this ultimate bucket, we end the Key Efficiency Indicators (KPIs) for Catalyst 9800 WLC.

Record of instructions to make use of for KPIs and automation scripts

Within the doc under, there’s additionally a hyperlink to a script that may routinely accumulate all of the instructions. It’ll accumulate instructions primarily based on platform and launch, save them in a file, and export the file. The script is utilizing the “Visitor-shell” characteristic that for now’s solely out there in bodily WLCs 9800-40/80 and 9800-L.

The doc additionally offers an instance of an EEM script to gather logs periodically. In conclusion, EEM together with the “Visitor-shell” script will assist to gather 9800 WLC KPIs and have a baseline in your Catalyst 9800 WLC.


For the listing of instructions used to watch these KPIs



Leave a Reply