Finance:Recovery point objective

From HandWiki

A recovery point objective (RPO) is defined by business continuity planning. It is the maximum targeted period in which data might be lost from an IT service due to a major incident.[1] The RPO gives systems designers a limit to work to. For instance, if the RPO is set to four hours, then in practice, off-site mirrored backups must be continuously maintained – a daily off-site backup on tape will not suffice.

Schematic representation of the terms RPO and RTO. In this example, the agreed values of RPO and RTO are not fulfilled.

Background

When IT systems used for normal business services are affected by a major incident that cannot be fixed quickly, the business will want to restore the state of the systems to a known, consistent point that is as up-to-date as possible. The business should have an Information Technology Service Continuity (ITSC) Plan that covers such scenarios, and realistically achievable objectives for the restoration of the service.

This plan should usually assume that the production computing equipment and the wider geographic location where they normally reside might become completely out of bounds at an unpredictable time, without any warning. The location chosen from which to run the restored service (the recovery site) ought to be distant (for example, at least 10 miles) from the normal production site and to suffer no threats in common with it (e.g. they should not be near the same coastline).

Relationship to recovery time objective (RTO)

The ITSC Plan must also satisfy two measurements for any potentially affected services: Recovery Point Objective (RPO) and the Recovery Time Objective (RTO). They specify time intervals, typically expressed as a number of hours, relating to loss of data (recent transactions) and loss of service time. They are both compromises, often largely determined by business impact costs: minimising these times is increasingly cheaper in adverse business impact but increasingly expensive to implement.

They are determined by a Business Continuity (BC) team that quantifies what losses might ensue if the services are not available. The associated risk assessments include potential loss of life of the knowledgeable people who ran the prior service.

The RTO is the amount of time the business can be without the service, without incurring significant risks or significant losses.[1] The events that mark the start and end of the RTO duration must be pre-agreed between Business Continuity and ITSC staff. It may be defined as the maximum allowed interval from interruption of the service to its full restoration to the customers. Alternatively the start of the interval might be determined as moment when it is decided to proceed with the recovery, and the end as the moment when the team responsible for testing the service (before it is successfully released to the wider user community) begin work. Defining the RTO in this way can permit better decision making at all levels, although it compromises its measuring "the amount of time the business can be without the service".

The RPO measures the maximum time period in which recent data might have been permanently lost in the event of a major incident; it is not a direct measure of the quantity of such loss. For instance if the BC plan is "restore up to last available backup", the RPO is the maximum interval between such backup that has been safely vaulted offsite.

Two points should be noted. Firstly, business impact analysis is used to determine RPO for each service – RPO is not determined by the existent backup regime. Secondly, when any level of preparation of off-site data is required, the period during which data often starts near the time of the beginning of the work to prepare backups, not the time the backups are taken off-site.

Data synchronization points

A data synchronization point is a point in time. It is used to assess the way in which data backups relate to each other. Data backups need to be related to each other correctly when considering the time of day they were made, or their relationship to computer system activity events. A data synchronization point is a point in time when a set of backups exist which if restored from can be synchronized to the same point in time. Often this point in time is some hours before the last backup is completed, i.e., some hours before the data synchronization point. Backups that have no synchronization points are generally useless.

A frequent mistake when setting RPO for traditional daily tape offsited backups is to assume 24 hours for the RPO. This mistake is the result of not considering that the RPO time begins with the start of the first data backup used in the synchronization point; and must also include time for boxing the tapes; the inevitable contingency time that must be allowed for "waiting for courier transport"; loading and final escape from site (not always at exactly the same time of day – the RPO must be increased by an amount of time equivalent to any such variability). It is also risky to assume that tapes will always be physically intact – the RPO should include enough time to use a previous synchronization point, too.

How RTO and RPO values affect computer system design

The RTO and RPO form part of the first specification for any IT Service. The RTO and the RPO have a very significant effect on the design of computer services and for this reason must be considered in concert with all the other major system design criteria.[2]

When assessing the abilities of system designs to meet RPO criteria, for practical reasons, the RPO capability in a proposed design is tied to the times backups are sent offsite- if for instance offsiting is on tape and only daily (still quite common), then 49 or better, 73 hours is the best RPO the proposed system can deliver, so as to cover for tape hardware problems (tape failure is still too frequent, one bad tape can write off a whole daily synchronisation point). Another example- if a service is to be properly set up to restart from any point (data is capable of synchronisation at all times) and offsiting is via synchronous copies to an offsite mirror data storage device, then the RPO capability of the proposed service is to all intents and purposes 0 hours- although it is normal to allow an hour for RPO in this circumstance to cover off any unforeseen difficulty.

If the RTO and RPO can be set to be more than 73 hours then daily backups to tapes (or other transportable media), that are then couriered on a daily basis to an offsite location, comfortably covers backup needs at a relatively low cost. Recovery can be enacted at a predetermined site. Very often this site will be one belonging to a specialist recovery company who can more cheaply provide serviced floor space and hardware as required in recovery because it manages the risks to its clients and carefully shares (or "syndicates") hardware between them, according to these risks.

If the RTO is set to 4 hours and the RPO to 1 hour, then a mirror copy of production data must be continuously maintained at the recovery site and close to dedicated recovery hardware must be available at the recovery site- hardware that is always capable of being pressed into service within 30 minutes or so. These shorter RTO and RPO settings demand a fundamentally different hardware design- which is for instance, relatively much more expensive than tape backup designs.

If very high volumes of high value transactions are to be planned for, then the production hardware can be split across two sites; with a high bandwidth network connection between the two sites constant mirroring of data can be achieved. If the user community is dispersed or at least split across two geographic areas, then the configuration is resilient to single site Major Incidents- with zero RTO and RPO being achievable, and very often little loss of service being experienced at most times of day.

See also

References

de:Recovery Point Objective