Disaster Recovery
Following is Notes from meeting 20081222
1. Business Processes Analysis
1.1 Core Processes
Need to identify Core Processes from the BusinessProcesses.
- Not according to revenue.
- Reliance is the claim made by CAcert, and the only time-critical components of reliance are CRL/OCSP.
How to do a revocation:
- online system for users
- support-generated:
- need the support system
- need arbitration to authorise
- need mail + maillists
SO in terms of business continuity / disaster recovery, these are CORE:
critical systems
OCSP/CRL servers
1.2 Secondary Processes
Secondary:
- email + maillists - all redundant
- support - receive certificate complaints, do revocations on them
- arbitration
Discretionary -- all other processes in the list are discretionary. In context of Disaster, these are ignored, for the time being.
2. Standard Process Times
Standard Process Times (SPT) is needed as a baseline.
- revocation
- Support -- rebuild + startup?
- redundant channels:
- email support
- website POST box
- phone??? VoIP??? SMS???
- IRC + chat
- redundant channels:
- 0 time for receiving certificate complaints
- 1 hour to pass to arbitration
- Support -- rebuild + startup?
- Arbitration
- 1 mailing list
- 1 hour hour to designate Arbitrator
- 24 hours to get 1st ruling on revocation
- does arbitrator need guidelines?
- false positives, false negatives, discretion amongst arbitrators....
- Revocation by Support
- 1 hour to revoke
- Critical Systems
- new CRL from support - 0 time
- distribution to OCSP / CRL servers - 0 time
Then, the SPT for revocation is: 3 + 24 = 27
3. Recovery Time Objectives
Recovery Time Objectives (RTOs) for core processes are how long it takes to recover the core+secondary processes needed.
27 hours |
- critical systems -- rebuild and start up -- ??
- this would have to be faster than total revocation time
- board will have to define this time:
- within 24 hours
- OCSP/CRL -- rebuild and start up????
- 0 time: must have redundancy
- Mail+mailing lists (Arbitration)
- 0 time - redundant -- requirement, we need redundant mail for arbitrators?
3.1 Failure Times
How long will it take then? Target is 27 hours.
- Notification of total failure (support systems) - 1 hour
- Investigation to determine total failure (sysadm team) - 1 hour
- Decision to rebuild (board 2 members) - 1 hour
- Rebuild (sysadm team, 2 people) - 24 hours
== 27 hours |
4. Maximum Acceptable Outage
Maximum Acceptable Outage (MAO) is the total time that the business decress it can be down for in this context.
- OCSP/CRL == 0 time for existing ones
- 2 days before new revocations issued
- email / support / maillists == 0 time (redundant)
- how long does it take to realise problems with mail systems?
- throw at tech people ... we want redundant mail + 0 time
5. Recovery Point Objective
Recovery Point Objective (RPO) is the time back to which we recover.
- what time before Disaster do we have data for? (Backups)?
- revocation: 24 hours (normal incremental backups)
==> revocations can be lost
==> user / Arbitrator must do confirm/retry manually
==> write in CPS "you must check within 24 hours to confirm/retry"
- mail: RPO == 1 hour on mail incoming (so 1 hour SPT can be met)
- OCSP/CRL: no issue because source files on critical systems
- and on other OCSP servers
==> requirement to load up from other OCSPs and form source.
- RPO == 0 time
- critical systems: RPO == 24 hours
6. Others
Service Delivery Objectives: not offered (community CA).
Best efforts standard for revocation:
- support: 1 hour ?? 24 hours???
- arbitration: 24 hours?? 7 days ??
7. Strategy and Planning:
What plans exist to put in place the systems and infrastructure required to meet the targets?
- general backups 24 hours
- mail backups 1 hour
- OCSP - 3 redundant
- channels to Arbitration - dual? we don't know
- (e.g., support people to monitor channel and duplicate on other list)
- backup supplier of hardware???
- -- escrow hardware for signing server
- if the signing server is secured fully (which it must be anyway??)
==> redundant CA (database, online, signing, geographical)
- mirrored drives in all machines
- redundant comms already provided by ISP.
- Alternate processing Location, etc
- none
Maintenance
8. Decision Contact Info
State requirement in SM: who needs contact info for whom
- Within CAcert
- all sysadms, all board
- Need contact information deposited somewhere offline? e.g., somewhere where it does not effect going offline
- Email that goes out every change.
- Sysadms must have contact info for all of Oophaga
- (not for SM, as described in contract)
8.1 Oophaga
- need a comment in Oophaga agreement to include disaster recovery
- notification to open notifications mailing list
- open to all members
- public archives
- 2 messages:
- we intend to do X on date Y
- we're out, it's done, here's the report
- check cameras
- are they on CAcert aisle
- talk to BIT about new cameras
9. Threats & Disasters
As in Security Manual.
- data breach
- false certificate issuance
arbitration -> revocation
arbitration -> investigation, checking the logs
- root compromise
- revoke root with vendors (business protocol)
- reissue root
- revoke subroot / certs
10. Side Question
- quality of support process
- quality of arbitration processes