IT Service Continuity Plan
The Information Technology Service Continuity Plan is the collection of policies, standards, procedures and tools through which organisations not only improve their ability to respond when major system failures occur, but also improve their resilience to major incidents, ensuring that critical systems and services do not fail or that failures are recovered within acceptable process RTO limits.
BIA information is used to define the process RTO and determine the recovery prioritisation. This makes the recovery process a user-centric activity matching business requirements.
The recovery plans are organised in a hierarchy. A site loss plan details the systems which would be affected by the loss of a building. A separate plan for each service should provide detailed procedures and step-by-step guidelines for each stage of an incident so that the Recovery Teams are able to restore the services and thereby to meet the agreed process and component RTOs.
The plans should be clear and concise and expect a level of knowledge but not presume explicit local knowledge, in the event that external assistance is required to rebuild systems (the same is true of Disaster Recovery Plans). Each procedure should be self-contained so that it can be utilised to effect recovery of a single system or component (e.g. the server is running successfully but the database management system has crashed). Each document must also contain details of pre-requisites; this means that in the event of multiple component failures the correct sequence can be followed (e.g. replace failed disk, rebuild operating system, install database, configure security settings and then restore data).
In summary the IT Service Continuity Plan should typically contain the following information:
- Details of the combined component RTOs and RPOs and inclusion of the IT Requirements Gap Analysis
- IT Architecture
- Roles and Responsibilities
- Invocation Procedures
- Damage Assessment
- Escalation and process flow charts
- Detailed procedures specifying how to recover each component of the IT system
- Test Plans specifying how to test that each component has been recovered successfully
- Incident Logs
- Contact Details
- Fail-back procedures
- IT Test Plan
These plans detail the four stages:
- Initial response: damage assessment and invocation of the appropriate incident management teams.
- Service recovery: this maybe staged and offer a degraded service.
- Service delivery in abnormal circumstances: interim measures may include relocation of services to another site or utilisation of spare equipment (often training or test servers). This is a temporary measure to provide a limited service until normal service can be resumed.
- Normal service resumption: returning to the usual service, fail-back from the abnormal service delivery.
PAS 77 also details strategy and infrastructure improvements to improve resilience. Improving the environment is a proactive measure to minimise the risk of IT outages. The strategy is a phased approach to achieving that resilience. It is driven by budgets, risk, experiences and changing user requirements. Experience comes from implementation, testing and failures. The three criteria they use for strategy are component RTO, RPO and cost; introducing measures to reduce component RTO and get to a stable RPO can be very costly in time, resources and finances for new technology.
If an ITSC Plan is successful then its success is difficult to measure. Any incident will be recovered within the process RTO and will not invoke a BC incident. The only measurement is the reduction of downtime and improvement to SLA adherence.