Cross Vendor Troubleshooting and Bug Finding

I am currently installing a new network core/backbone and campus wide wireless infrastructure at one of my customers.  Within this network and project there are three main technology vendors with equipment in the network that all need to talk with each other.  The new wireless network uses Aruba gear, the new core and backbone is built on HP Procurve, and at the edge of the network are Cisco switches that have been in place for several years.

This network isn’t particularly complicated, but whenever more than one vendor is supply gear, the design must make use of standard protocols for all communications.  With this in mind, the design called for using LACP to build the link trunks between gear.  LACP is defined in 802.3ad and has been around for about a decade and is supported by most vendors and platforms.

My expectation was that implementing this technology would be breeze and a quick check off on the to-do list, I was sadly mistaken.

First, the Aruba to HP trunk…

This trunk is made up of two 10 gigabit fiber connections from a 6000 series Aruba controller to an HP E5400 series switch.  To start off, I only had one connection up and linked.  If you’ve never configured an LACP trunk on a procurve, there isn’t much to the configuration, “trunk B3-B4 trk1 LACP”.  The Aruba uses a very cisco-esq interface subcommand “lacp group 1 mode active”.  Fairly straightforward configuration, and initially all was looking pretty good.

Checking status on the HP revealed

PORT   LACP      TRUNK     PORT      LACP      LACP
----   -------   -------   -------   -------   -------
B3     Active    Trk1      Up        Yes       Success
B4     Active    Trk1      Down      No        Success

but directly after sending some traffic it changed to

PORT   LACP      TRUNK     PORT      LACP      LACP
----   -------   -------   -------   -------   -------
B3     Active    Trk1      Blocked   Yes       Failure
B4     Active    Trk1      Down      No        Success

A blocked connection is never a good thing, and traffic stopped completely.  After double, and triple checking the configurations I opened up a support case with both Aruba and HP Procurve.  While troubleshooting, we tried building the trunk as a protocol-less trunk and were successful with that method, however we were never able to get the LACP trunk up and running.  Though the first goal is to make it work, I still wanted to know why LACP wasn’t working.  In digging deeper on the Aruba, I was found that the Aruba wasn’t receiving any LACPDUs on the link.

LACP Counter Table
Port     LACPDUTx  LACPDURx  MrkrTx  MrkrRx  MrkrRspTx  MrkrRspRx  ErrPktRx
----     --------  --------  ------  ------  ---------  ---------  --------
XG 0/10  12        0         0       0       0          0          0
XG 0/11  0         0         0       0       0          0          0

Though interesting, even with HP and Aruba’s assistance we weren’t able to get the LACP based trunk operational and have left it configured as a simple aggregate link/port-channel.

And the Cisco to HP Trunk…

This network makes use of two separate trunks to connect end to end.  The network looks something like this:

[HP Procuve] ========= [Cisco Catalyst] ========= [HP Procurve]

Between each pair of switches are two gigabit fiber connections.  For the initial configuration here, I dug right in and configured the LCAP trunk on the HPs as above, and used the command “channel-group 2 mode active” on the Cisco switch.  The status on the switches immediately showed something amiss though.

The HP showed

 PORT   LACP      TRUNK     PORT      LACP      LACP
 ----   -------   -------   -------   -------   -------
 21     Active    Trk1      Up        No        Success

And the Cisco logged a message of

1w2d: %EC-5-L3DONTBNDL2: Gi0/24 suspended: LACP currently not enabled on the remote port.

Clearly neither of the switches are successfully seeing each other in this configuration.  Having already opened a ticket with HP related the trunking with the Aruba, I added this problem to the same case.  I much time gathering logs and details and sending them onwards, but eventually Google pointed me in the right direction for this one.

Cisco Bug CSCsh97848 was the culprit here.  The jist of the problem is that though LACP is supposed to use the configured native VLAN for control traffic to build and maintain the link, Cisco switches running code 12.2 only allow vlan 1 to be used as the native vlan across an LACP trunk.  Once I reconfigured the link to use vlan 1 as the native on both sides, the LACP trunk came right up.

What did I learn…

Though definitely not the first time I’ve had this experience, it was another case of Google being one of the best troubleshooting tools out there.  Though I had cases opened with two of the vendors, the resolution ended up coming from my own efforts at running down the problem.  Though to be honest, at least for the HP to Cisco link, had I been able to open a case with Cisco TAC, I expect they would have quickly identified the troublesome bug.

Though I rarely have pushed a single vendor solution for the sake of being single vendor, I can see some truth to the adage “one throat to choke” in this case. Because the problem was with the interoperability between vendor gear, I found this troubleshooting process to be a little slower and I did have a few instances where the vendors seemed to be pointing fingers at each other.  Overall though, I was very pleased with the way the Aruba and HP engineers worked together at sharing information and attempting to resolve the problem.  It would have been even better had the LACP problem between the HP and Aruba devices been resolved.

And lastly, even when using standard protocols, there can be problems and differences in implementation and features.  Assuming that just because the same standard is listed on separate feature lists doesn’t necessarily mean they will work together when connected.