Network Scrapbook: OSPF Optimization

Accelerating OSPF Convergence

Can be done through faster Hellos and BFD (Bi-Directional Forwarding), and segmentation using areas. Use of areas reduces the SPF calculations

Faster Hello Packets

Default timers: 10 seconds, dead-interval; 40 seconds. OSPF should not be used for failure detection. Most lower layer protocols are faster at failure detection. OSPF default timers are often slow. Designers often alter these timers to effect faster converge and faster failure detection. Decreasing default timers will increase network traffic and packet processing demand on the hardware. In Cisco, to implement a one second hold interval with a hello-multiplier e.g. five minutes, send a hello packet every 200 milliseconds.

Bi-Directional Forwarding

BFD is an alternative to fast Hellos. It is not specific to OSPF. It provides a light-weight failure detection and is processed by the line card and not by the CPU. It primarily operates at the data-plane rather than the control-plane. BFD relies on the routing protocol neighborship establishment process to begin working. It doesnot help to lower the time it takes to establish a neighborship.

BFD operates in two modes; asynchronous and demand.

BFD Asynchronous

Systems exchange control packets. If the receipt of these packets stops, the session is dropped and the protocol that is configured with it is notified for example OSPF.

BFD Demand Mode

BFD assumes that another mechanism is being used to verify connectivity. Generally, asynchronous mode is used in normal production.

BFD Echo Function

The Echo function is used is any mode. Using the Echo function, a device is configured to send control packets towards a remote system with the expectation of having them loopback. This verifies the path to the remote system as well as the forwarding path of the remote system. BFD sessions can be configured independently in both directions.

Controlling OSPF LSA Generation and Propagation

Technologies include;

LSA Throttling
LSA Flood Pacing
LSA Group Pacing
LSA Retransmission Pacing

LSA Throttling

Limits the regeneration of repeat LSA updates for the same LSA when the topology changes. The ‘same’ LSA refers to common LSA ID, LSA Type, Advertising Router ID.

Normal LSA Update Behaviour

Initial LSA update is sent immediately and then rate-limited and cannot be resent for another five seconds. As updates come in, the message that is going to be sent may be modified but is not sent until the rate limit timer expires. A similar condition happens on received updates from a neighbor. By default, an update is only accepted if it is inter-leaved by at least one second. If it is not, the packet is dropped.

Waiting for the five seconds to expire during the intialization of OSPF and times of network change can sow the propagation of LSAs. OSPF throttling feature can be used to combat this. LSA throttling alters how OSPF handles the generation of OSPF update packets. You can do this through the configuration of three different parameters;

LSA start-interval
LSA hold-interval
LSA max-interval

The initial transmission of an update packet is always sent immediately it is generated. Pending no other features are configured, the generation of the second update packet is governed by the LSA start-interval. If an event occurs and the OSPF device needs to send an additional update packet, it waits until the start-interval expires. At this point, the LSA hold-interval begins. If an event occurs during this hold-interval, the device waits to send the update packet until that interval expires. If an update is required during the hold-interval, the next hold interval doubles. The hold-interval never exceeds the max-interval.

LSA Hold/Max Interval

The max-interval is used as an update packet ceiling in controlling how long the hold-interval could eventually become. This doubling happens every time an additional event occurs within the current hold-interval. When the max-interval is reached, then it is used as the hold-interval to delay LSA update packet generation until it expires. This remains true until no events occur within two hold-intervals or max-intervals depending on the situation. At this point, the process repeats and the start-interval is used if an event occurs.

LSA throttling is a good way to increase convergence times and slow down LSA update generation during instability.

OSPF Update Packet Pacing

Affects the generation of remote OSPF packets. Three different OSPF timers can be modified including; flooding pacing, re-transmission pacing and group pacing.

Flood Pacing

Allows you to control the packet spacing between consecutive update packets in the OSPF transmission queue. On Cisco, by default, if multiple packets exist in the transmission queue, they are sent every 33ms.

Retransmission Pacing

Simlar to flooding pacing but it affects the retransmisison queue. By default, if multiple packets exist in the retransmission queue, they are sent every 66ms.

Group Pacing

Controls how LSAs are refreshed by an OSPF device. Typical LSA refresh rate is 30 minutes. An LSA is refreshed when is age reaches 30 minutes. If each individual LSA works on its own independent timer, then packets would be retransmitted all the time especially in large networks. Group pacing allows LSAs that are expiring within the same general time to be sent simultaneously. By default, it is set to 240 seconds. LSAs expiring in 240 second interfval are held and sent at the same time. This provides efficiency and lowers demand on the network. Pacing timers generally work well and it is advised that they should not be modified. Any modification should be done after thorough testing.

Altering SPF Behaviour

SPF behavior can be altered through SFP throttling and incremental SPF.

SPF throttling operates similarly to LSA throttling. It provides a way to control when SPF is run after an event occurs. SPF uses three parameters; SPF start-interval, SPF hold-interval, SPF max-interval.

SPF Throttling

When an event initially occurs, the start-interval begins. On expiry, SPF is run using the new information and hold-interval begins counting down. If any new event occurs during this hold-interval, then the SPF process is run once it expires and a new hold interval begins but with twice the configured hold interval time (events occurring during the hold-interval waits until expiry). New hold-interval begins with twice the duration). If no event occurs within two hold-intervals, then the process resets and again is governed by the start-interval. The process of doubling the hold-interval when additional events occur continues until the hold interval time reaches the max-interval. Multiple intervals result in hold-interval equalling max-interval. The max-interval acts as a timer ceiling. Once reached, SPF runs every max-interval as long as additional events continue to occur. If events are not received within two max-intervals trigger process reset. Cisco defaults to five second start-interval and ten second hold and max-intervals.

Incremental SPF (iSPF)

ISPF changes how and when SPF is run. LSA Type 1 and 2 changes force SPF to run. Often this isn’t required as the SPT will now change for every device resulting in many nodes needlessly running SPF. ISPF makes the running of SPF conditional based on three conditions;

Whether a new leaf-node is being added or removed
Whether a change affects the SPT of a device.
Whether a change affects a limited part of the SPT.

Addition of a leaf-node does not affect SPT of existing devices hence a full SPF run is unneeded. ISPF prevents a full SPF run on non-local devices. It limits the SPF run on nodes directly associated with the changes. This includes devices where the addtion or removal will not locally modify connectivity.

Failure of a link that is not part of the SPT of a device: iSPF limits SPF run to only affected entries.

Link that is part of SPT Fails but does not affect all SPF devices: iSPF limits the devices that run SPF to those directly affected by the link failure. Only devices that are downstream from a failure will re-run SPF. Use of iSPF should be assessed based on environment.

Reducing the Size of the Link-State Database

To reduce the size of the LSDB implement the following;

Stub areas
LSA summarization
LSA Type 3 filtering
Prefix suppression
OSPF network types

Stub Areas

Limits the type of allowed LSAs

Most often used with single exit areas.

LSA Summarization

Different types of summarization include area summarization and external summarization.

Area Summarization

Used on ABRs controls how inter-area routes are generated and advertised into an area. Before summarization, routes are advertised with their original mask information. Only intra-area routes are summarised. An ABR must be part of the area where the targeted entries are sourced from. If they are not, it will not see the routes as intra-area and will not summarize them. The metric used for the summary route is based on the lowest existing metric. If this is not desired, configure a static metric.

Area summarization limits the advertisements that are known by devices in other areas (limits inter-area database entries). It provides protection from devices needing to make topology changes as they occur in other areas.

External Summarization

Summarizes external routing information (redistributed routes). During redistribution, limit the number of individual routes being inserted. Summarization is performed on the ASBR. Unstructured (discontiguous) summarization causes problems with route summary configuration. Summarization may introduce inefficient routing in areas with multiple ABRs.

Type 3 LSA Filtering

Controls Type 3 LSAs being advertised into or from an area. Provides granular control over advertisements into different areas. Ensure verification before using in production. Always verify behaviour when using multiple features.

Prefix Suppression

Allows suppression of advertisements of connected prefixes and can be implemented globally or at the inteface.

Prefix Suppression Globally: all connected prefixes that are not on loopbacks, passive interfaces or secondary interfaces are suppressed. Individual interfaces can have prefix suppression disabled to allow their addresses to be advertised.
Prefix Suppression Locally: all addresses are suppresssed including secondary addresses.

Prefix suppression is very handy on large networks to reduce the size of the LSDB. However, it can complicate troubleshooting.

OSPF Network Types

Some OSPF network types utilize additional LSAs than others. It is agood idea to assess whether the default options are the best. Ethernet is multi-access and OSPF defaults to broadcast network type requiring Type 2 LSA. It also requires a master/slave election causing a little delay in the neighborship formation process. If an Ethernet interface connects to another OSPF peer device, better to configure the interface as point to point network type. Here, the designated router (DR) is not required and Type 1 LSAs only are used.

Reducing the Effects of Restarts on OSPF

Graceful restart is covered in RFC 3623. it allows the restarting of a device without affecting the forwarding of traffic. This will require tweaks to normal OSPF operation. Device restart methods include; using power switch (hard) and using software (soft).

Hard Restart: all adjacencies are dropped. This type is generally not recommended.
Soft Restart: OSPF notifies peers of an impending restart by flushing all LSAs and sends empty Hello packets resulting in the dropping of all neighborships. Neighbors know immediately that a peer is going to become unavailable and make appropriate adjustments to their LSDB and routing.

Regardless of restart method, traffic flow is interrupted through re-routing or black-holing). Some platforms handle data-plane functions on the line-cards and control-plane and management functions by CPU.

With graceful restart, devices must separate data and control-plane functions. The devices must modify OSPF behaviour when a restart occurs. OSPF normally notifies peers of impending restrat by advertising LSAs originated by the device with a max LSA age. This brings down neighborships causing peers to run SPF to re-route traffic. Normal OSPF behaviour affects traffic. In graceful restart, devices alert peers of a restart. Neighbors lock the neighborship (assuming this feature is supported) maintaining the appearance of a full adjacency. Neighbors continue to send traffic to the device as normal. They, however, do not receive normal OSPF messages. The restarting device uses a graceful LSA to communicate with its neighbors. This is sent on all OSPF interfces with a link-local scope triggering neighbors to prepare for a restart. This LSA includes an expected grace period which is the duration of the assumed full neighborship.

Grace LSA

Are sent continually until they are acknowledged. If not acknowledged, normal restart commences (no graceful restart). The restarting device does not originate or flush LSAs and continues to use its pre-start routing tables until all neighborships return to normal operation. Neighborship establishment, after graceful restart, is the same as normal restart. The data path through the device remains uninterrupted. Grace LSAs are flushed by, restarting device, once neighborships are re-established runs through the normal OSPF process and re-originates its LSAs.

Graceful Restart Modes

Graceful restart defines two modes of operation; one for the restarting device and another for helper peer devices. The mode for peers is referred to as helper mode. Many more devices support helper mode than graceful restart mode. On Cisco, graceful restart is referred to as NSF (Non-Stop Forwarding). Full support is referred to as NSF capable. Helper support is referred to as NSF-aware.

Network Scrapbook

Pages

Wednesday, 5 May 2021

OSPF Optimization