The Juniper Networks SRX architecture is frequently deployed in a redundant configuration. Especially the data-center SRX’s (SRX1400, SRX3400, SRX3600, SRX5600, SRX5800). It’s pretty obvious why.
When you think about the data that the firewall is protecting, uptime is just as critical to the security of the system, sometimes even more-so. Production web, database, storage, and application traffic NEEDS to remain up 100% of the time. This is why applications/servers are typically designed in redundant or at least hot-standby configurations. And this is why network nerds deploy the SRX in a “cluster”.
The SRX “cluster” is the most resilient, highly available firewall solution on the market today. With sub-second failover times between cluster members, your traffic won’t miss a step in case of a network/hardware failure. That is if you deploy a fully redundant cluster!
Most network admins deploy two of everything when it comes to an SRX implementation:
- 2 physical chassis clustered together
- 2 physical interfaces for each network connection (bundled into reth or “redundant-ethernet” interfaces)
- Multiple SPCs typically in an N+1 configuration
- Multiple fabric interfaces for capacity and redundancy
- Multiple power modules connected to fully independent power inputs
But when designing fully redundant firewalling solutions, administrators often overlook the heart of the SRX cluster. The control link.
Since the inception of the SRX, I’ve designed and implemented hundreds of SRX firewall deployments and only an exceptionally small percentage of those have opted to use the “dual control link” feature.
So, what are control-links?
Since the SRX cluster is a logical way to marry two physically independent chassis, there needs to be a direct line of communication between them to determine which one is the primary or “brain” at the time. The control link resides on the “control-plane” instead of the “data-plane”. This allows the two “brains” to speak to each other on the same level since Juniper SRXs are built with a control-plane and data-plane to protect traffic in the event of failures. The primary node may or may not do any of the actual traffic passing work, but it does make all of the decisions and syncs them with the secondary node.
So how is this syncing accomplished?
It’s done through the combination of the control-link (control-plane traffic) and fabric links (data-plane traffic). The control-link is a dedicated port on the SRX designed specifically for clustering chassis together. It’s commonly referred to as the “heartbeat” between the two chassis. Just like in any real-world marriage, communication between the devices is CRITICAL to stability.
In fact, there is a built in configuration mechanism that will automatically reboot the secondary node if the heartbeat goes down, but the fabric link stays up. Wouldn’t it be great if we could just reboot our significant other in the middle of an argument? 🙂
So, if the heartbeat is so critical, why do so many deployments opt out of using dual control-links? Really, there are only 3 reasons:
- Risk likelihood
To implement dual control-links in a data-center SRX, you need to purchase more hardware. In SRX1400, SRX3400, and SRX3600 deployments, you need to purchase an SRX Clustering Module (SCM). In SRX5600 and SRX5800 deployments, you need to purchase an entire SCB/RE combination. These can be expensive when considering the fact that this hardware is only sending “backup heartbeats”.
The added complexity of dual-control links on the SRX1400, SRX3400, and SRX3600 doesn’t exist due to the built in control-ports. However, the SRX5600 and SRX5800’s use of an added routing-engine does introduce some nuances only observed in SRX deployments.
In a Juniper router or switch, dual routing-engines in a single chassis provide intra-chassis redundancy for control-plane traffic. This is a great idea in standalone routing/switching environments. When you’re working in a multi-chassis SRX cluster, the redundant routing-engine actually resides on the secondary chassis. So, you don’t need another brain on the primary node.
The second routing-engine will boot up and check the hardware currently running and see that it is a second routing-engine in a cluster. When it sees this, it disables just about everything. It will boot up into a very limited JUNOS without a configuration. That’s because all this routing-engine actually does is pass heartbeats. Heartbeats orchestrated by the primary routing-engine. It’s not a “brain”, it’s a second “heart” in the SRX cluster organism.
The risk of losing heartbeats is incredibly low. And in the case of losing ONLY heartbeats, the cluster actually has built in mechanisms to automatically disable and reboot the secondary node to help remedy the solution. So, the only scenarios you are accounting for when adding a second control-link are the physical failure of a control-link SFP or fiber link failure. And since the dual control-link is sitting RIGHT NEXT to the primary control-link, the chances the both go down are fairly high. In the past year or so, the number of “split-brain” scenarios have dramatically reduced due to JUNOS upgrades/bug fixes anyway.
So, the rewards for adding a second control-link can be pretty low, especially when considering the cost of hardware needed to implement the redundant solution. As an engineer, I never have to worry about cost, so I always recommend a dual control-link architecture, but understand when customers choose not to implement it.
If you have deep pockets or have strict redundant requirements, do yourself a favor, implement dual-control links.
If you don’t, it wouldn’t hurt re-examining those failure scenarios every once and awhile to assess your redundancy vulnerability.