In Part One of this blog, I introduced some of the key concepts of Availability and Disaster Recovery (DR) and discussed some of the challenges faced by the Business Analyst when eliciting requirements.
In Part Two, we’re going to look at the stakeholders who are usually involved in an Availability / DR workshop, the knowledge they have and the knowledge they need in order to make decisions. We’re also going to have a look at the downstream impacts of the requirements that you capture.
Who’s involved in an Availability Workshop?
To understand the key challenges of eliciting availability requirements you need to understand the conflicting viewpoints of your audience, which will likely include the following stakeholders:
Project Owner – a.k.a. Program Manager, Project Manager, etc.
The Project Owner is generally responsible for the initial delivery of the system and will primarily be concerned with ensuring the Business Owner is satisfied with project outcomes and ensuring that capital expenditure is within budget.
Business Owner – a.k.a. Process Owner, Business Sponsor, etc.
The Business Owner of the system is accountable for business processes that are supported by the system. The Business Owner may start with an expectation that the system will always be available and is likely to have little understanding of the technical constraints.
Business Owners are unlikely to understand the challenges involved in designing and implementing Highly Available systems. These audience members will provide valuable information regarding the importance of the system to business outcomes but, depending on their experience with IT projects, their expectations may start at 100% availability and $0.00 added expense.
Operations Owner – a.k.a. Service Manager, Operations Manager, etc.
The BaU Owner (usually a member of Operations) will inherit the system once it goes into Business as Usual. They may be the same person as the Project Owner. They will be primarily concerned with the operability of the system and with the ongoing operational expenditure and should have a reasonable grounding in the technical aspects.
Technical Representatives – a.k.a. Infrastructure Architect, Technical Architect, Environment Manager, etc.
You will also often have architects and/or software developers present during elicitation. These people will provide valuable input regarding the challenges of achieving High Availability and can help describe some of the design considerations but will probably have a limited understanding of business drivers or costs.
Technical representatives can assist you with explaining the challenges, providing you remain cognisant of the fact that technical people often don’t consider the journey that they’ve taken to acquire their knowledge and can be a little critical of people who don’t understand things to the same depth they do.
Your role as the business analyst in this environment is to assist the technical representatives in describing the challenges of meeting high availability requirements and to mediate between business representatives and technical representatives to bring availability requirements to an acceptable middle ground. Expect, and prepare for, contention. Remember at all times that your goal is to capture requirements that are both measurable and achievable.
How do I prepare for an Availability Workshop?
You may find that when you’re dealing with availability you’re going to have to get over the hurdle of lack of domain knowledge. It is very rare to find stakeholders who truly understand availability and it’s even rarer to find a group of stakeholders who share an understanding. Therefore, in order to effectively facilitate, it’s critical that you take some time to define and agree on availability terms (see Part One).
It’s also worthwhile to note that speaking about availability, the commonly used percentage (e.g. 99%) is of little value. It’s recommended that you convert the percentage back to unexpected downtime, as per the table below:
Availability | Unexpected downtime per Month (outside Maintenance Windows) |
---|---|
95% | ~36 hours |
97% | 21.6 hours |
99% | 7.2 hours |
99.5% | 3.6 hours |
99.9% | 43.2 minutes |
99.95% | 21.56 minutes |
99.995% | 2 minutes |
99.9995% | 13 seconds |
Note also that the availability of a system is the compound of the availability of its elements. For instance, if you have a data centre that guarantees 99.9% availability, you’re hosting on a hardware / OS configuration that the vendor rates at 99.95% availability then you’re running at 99.9 x 99.95 = 99.85%. However, once you start clustering servers and building redundancy into the model the calculations of probability of failure get a LOT more complex.
Before you enter a workshop on a contentious and technical subject, it’s also worth thinking a bit about the downstream impacts of the requirements that you are eliciting:
To minimise costs, your fancy web solution could realistically be deployed on any PC with a web connection. However, this would only be suitable for a system with low availability requirements and data that has no ongoing value to the business. At the other end of the scale would be a solution incorporating a number of the following design considerations (this is not an exhaustive list):
Backup and Recovery Plan – To reduce the risk of data loss, a backup and recovery plan can be introduced. The plan could be as simple as a single backup of all data on a, say, monthly basis; or could involve multiple backup points (e.g. monthly, weekly, daily backups, data snapshots occurring every 15 minutes or so) with off-site storage. The backup and recovery plan could also include manual or automated steps for recovery in the event of a disaster.
Storage Redundancy – To reduce the risk of data loss redundancy can be built into the data layer. This could involve hardware solutions such as disk arrays, deployment of a redundant instance of the data layer in passive mode with replication, database clusters, or the use of a SAN (storage array network).
Component Redundancy – Reliance on software, hardware or network components can be removed by a variety of techniques including n-tier design, clustering of servers (controlled by software or by hardware components), hardware/network redundancy (multiple Network Interfaces, multiple Network Paths), on-site storage of hardware spares (e.g. replacement disks or blades).
Site Redundancy – Highly available systems can be deployed at multiple, geographically dispersed sites in either an active-active or active-passive configuration. This means that in the event that a site becomes unavailable, the other site can take over.
Keep in mind that every new layer of redundancy protection adds cost to the solution – capital expenditure for the purchase of hardware, software licenses, and network/data centre infrastructure and operating expenditure for data centre real estate, operational costs, network capability, and bandwidth and replacement parts.
What have we learned so far?
In Part One we investigated some of the key concepts of Availability and Disaster Recovery (DR) and discussed some of the challenges faced by the Business Analyst when eliciting requirements.
In Part Two we have looked at the key stakeholders involved in Availability workshops and the sources of contention that you may face as a facilitator. We’ve also had a brief look at some of the design aspects that allow you to achieve Availability.
In the next, and final, post in this series we’ll put everything we’ve learned together and try to come up with an agenda for the “perfect” Availability workshop.
Read Part 1: Ninety Nine Point What?