The Need to Characterize Redundancy, Performance Requirements
- Published on January 31, 2008
- Written by Mike McClain, Senior Web Designer & Site Manager
The intent of the table is to provide the initial guidance necessary for planning the redundancy measures for the system. As such, the table is merely a planning tool and does not need to be inclusive of all details up front. The requirements within the table should be revisited, developed, and updated as the system design takes shape. Customers are often surprised as to what actual redundancy capabilities are (and are not) within the system when the information is laid out in this fashion.
The table is divided into three sections: system classification, redundancy requirements, and performance requirements.
The section in yellow defines the system classification columns of the table. Two methods of setting up the columns are recommended. The first method is intended for low-level requirements specification that involves decomposing an end-to-end data path into segments, and populating the columns with that data. The second method is intended for high-level requirements specification and is depicted in the sample diagram.
The example illustrates the Department of Defense (DoD) method of system classification by mission assurance category (MAC) level. C, S, and P stand for the classified, sensitive, and public categories of data within each MAC level. The underlying assumption is that the organization has classified their systems by sensitivity to the organization’s mission.
Note that the only intent is to establish a differentiation of the systems within the columns. Although this example uses the DoD method of system classification, any method may be substituted
Redundancy categories specify system requirements that would deal with “events,” “threats,” “disasters,” or “contingencies” to minimize their potential negative impacts.
The level of detail included depends upon the organization. A very high level response (such as “yes” or “no”) might be appropriate for smaller, less complex systems. A very detailed and meticulous response would be appropriate for systems operating in “high risk” environments. Where appropriate, suggestions to include supplemental information details are described within the category.
Physical and Functional – Physical and Functional redundancy refers to having multiple physical hardware devices or software instances. In addition to determining which system nodes require redundancy, there are a number of additional data points that should be examined.
1. Readiness State – Appropriate supplemental data for this response
may include the number and type of physically redundant devices required
and whether they need to be maintained in a “hot with high availability,”
“hot,” “warm,” or “cold” readiness
state. Note that the term high availability (HA) means different things
to different organizations. In the hardware or software arena, HA typically
implies a stateful failover mechanism. However, it is important to double-check
to see if state is actually maintained, since few vendors (and even
fewer model types) can provide this capability.
2. Component – The level of granularity could include specific components within the node itself, such as redundant array of independent devices (RAID) requirements, data recovery, router module requirements, backplane separation, host interface card requirements, etc.
3. Oversubscription Rate – If the failover mode is unable to support the full amount of traffic prior to failover (i.e. oversubscribed), it is recommended that the level of over-subscription should be specified. Note that this applies to the physical data paths as well and should be matched accordingly.
Logical Data Paths – Logical data path redundancy does not provide true redundant capability and it is important to understand when it is substituted in place of physical data path redundancy. For example, it is possible to logically create several paths between two points via separate generic routing encapsulation (GRE) tunnels or layer-3 routes. However, if the traffic traverses the same submarine cable or fiber, it is dependent on the same physical hardware and interconnection points at that segment within the data path. Although multiple logical data paths are visible to the traffic, only one physical data path exists. Note that this category is omitted from the requirements table since it is an undesirable result.
Physical Data Paths – This type of redundancy is the ability to
establish logical connectivity between two endpoints over separate physical
paths or circuits. For example, if one or more logical paths are established
over physically separate, parallel circuits between two endpoints.
Technological – Although a physical data path may be completely redundant, it is possible that the underlying technology possesses an inherent weakness that cannot be overcome. For example, satellite connections are susceptible to sunspot activity and, in certain frequency bands, rain washout.
Role – In some cases, personnel skills may be unique to the organization, or concentrated within a few individuals, such that the loss of one or more key individuals can significantly influence ongoing operations. Role redundancy ensures that the loss of an individual or group within the organization does not significantly affect the system.
Vendor – Even if additional physical devices or physically redundant data path services are acquired, it is possible a weakness or vulnerability is present that negates the redundancy value of the product.
Organizational – Organizations may experience internal issues that could affect the operation of their systems. Examples include bankruptcy, subcontractor issues, internal reorganizations, mergers, or similar events. Organizational redundancy ensures that the system is sufficiently isolated from, or is able to deal with, organization-specific events.
Geographical – Geographic redundancy includes guarding against events or acts-of-God affecting a particular region or locality through geographic dispersal of assets. Continent, country, state, county, central office, street, demarcation point, or even opposite corners of a building are all considerations under this category. Note that geographic separation requirements may affect the physical data path requirements.
Political – The political category influences the role, organizational, and geographical redundancy requirements, but has distinct differences. For certain international or multinational situations, redundancy measures may need to take political considerations into account when determining other system requirements. For example, the influence of labor unions, protest groups, sources of unrest, or the actions of hostile foreign governments could guide which personnel, locations, vendors, and contractors are deemed as “sufficient” redundancy measures for the system.
Specific Performance Targets
These values refer to the technical benchmarks and targets used to characterize the stability of the underlying IT infrastructure. The intent is that these targets are based upon actual application data, if present.
The performance targets of loss, latency, and jitter should be specified for any given system. This will allow the operations and maintenance effort to properly monitor, tune, alarm, report, and react to failure to meet performance targets within the system. Loss is typically expressed in percentage of data or packets per time-period, whereas latency and jitter are typically expressed in milliseconds.
The performance targets should ideally be based upon the actual software application needs. In other words, it is important to ask how much packet loss, delay, or time difference between packets can the most sensitive software application tolerate before it fails to function properly. For example, it may be necessary to provide the loss-sensitive applications with more bandwidth. If this information is not known, a formal application profile may be required to document the application’s breaking point. In any event, special application requirements should be identified and annotated as part of the information gathering process.
It is possible to achieve additional granularity by adding columns and annotations to the table. The example provided is merely a starting point and may be supplemented as desired.
Note that the intent is to provide performance planning information for the system’s development lifecycle and does not refer to the actual means of implementation. For example, to actually achieve the performance target established the design may require a specific congestion avoidance or congestion management method such as class of service (CoS), type of service (ToS), quality of service (QoS), differentiated services (DiffServ), integrated services (IntServ), or other technical measures.
Overall Performance Targets
It is acknowledged that a discussion of SLA’s is worthy of its own article or book. To maintain the scope of this article, only a few highlights regarding the topic are presented here.
There two common ways that an organization expresses an overall SLA performance target: as an availability or uptime percentage, or as a perceived end-user experience or end-to-end user experience (EUE). Both are typically represented as a percentage measured in terms of per month or per year. Assuming 365 days in a year and 30 days in a month as an average, Figure 2 represents the corresponding unscheduled downtime allowed by the SLA agreement. Note that scheduled downtime is typically excluded from many SLA contracts.
The assumption is that the availability or uptime percentage will be incorporated as part of the SLA within a contract (ex. with a service provider for WAN services). The SLA is easily monitored with heartbeat or keepalive services tied into the organizations network management systems (NMS’s). Although this value might also represent the overall system uptime required for a 24x7 network operations center (NOC), it is less useful when planning the performance needs for a bank that is only open from 9-to-5. Another planning value is required to ensure that performance planning is properly tailored to the organization.
The intent of the perceived EUE is to capture the experience of the end user versus the actual system performance. For example, the EUE may be specified on a monthly basis at “three nines” (i.e. 99.99%) between 8 a.m. and 6 p.m., whereas the SLA or absolute uptime may be left unspecified. The interpretation is that any level of failure is tolerable off hours as long as the user perceives less than an hour of downtime per month during working hours. The major problem of using an EUE target in a contract is that it is nearly impossible to measure and enforce, since the interpretation is largely subjective. However, describing an EUE for planning purposes is very useful, since it conveys the intent of the organization that the architecture and design of the system must support.
It is fully understood that implementation of redundancy and performance requirements increases the cost of the system; often beyond what was originally budgeted. However, without the ability to categorize and track redundancy and performance requirements, it is difficult to make rational decisions regarding which redundancy or performance features are needed, and which should be sacrificed. The intent is to ensure that all redundancy and performance goals are captured, and that justification is documented when deviations from this guidance occurs.
A word of caution when establishing redundancy and performance requirements: if the requirement is not periodically monitored for compliance, it is unenforceable and should not be included.
The SRAD-I redundancy and performance requirements table provides useful planning information for the rapid architecture and design of secure IT infrastructures. This tool may be added to a number of existing methodologies and supplement existing requirements gathering processes.
Keith T Hall, MBA, BSEE, is a senior member of the professional staff with SRA International, Inc. He holds current certifications as an INFOSEC Professional (NSTISSI 4011 Std.), Senior System Manager (CNSSI 4012 Std.), CISSP, CCIP, CCSP, CCDP, CCDA, CCNA, IAM, IEM, and prior certifications as a Cisco IP Telephony Design Specialist, Cisco IP Telephony Support Specialist, CSS-1, CCNP, MCT, MCSE, MCP-SI.
"Appeared in DRJ's Winter 2006 Issue"