Report on Progress towards Audit
- financial year 2009-2010
- financial year 2009-2010
After the difficult events of last year that resulted in the termination of my external audit process over CAcert, audit-related work settled into a more focussed, directed approach.
The Board picked up two priorities related directly to audit, being, (1) work to move the infrastructure servers out of the domain of the critical systems, and (2) to review and close out the data protection question. Both of those are reported in the Board's report, so here I will only cover why they were necessary. The remainder were done by members behind the scenes: not secret but quiet, patient hard work by those who were keen to help.
(1) Infrastructure Separation. Most of our critical systems and our infrastructure VMs are located in our secure rack in BIT (Ede, NL), as managed by Oophaga. Our thanks to them and the team! A judgement call has been made (by me) that this intermingling of critical systems and infrastructure VMs makes it too hard to efficiently audit the systems. The reason this is inefficient is because there are two sorts of controls, or defences against threats. One sort is controls that rarely get used, and are pretty obvious. The other sort is those control that are utilised frequently, and are somewhat subtle. We can imagine a 2x2:
Mixing the infrastructure with the critical pushes a lot of controls from the first quadrant into the fourth. It can be seen this way: before, the Access Engineers had no reason to ever see the data. If they ever did, this was obviously wrong. That means we can rely on the access team and the critical team to police this particular control, to a large extent. It's a good strong control, it's rarely needed, it's obvious.
But, with the infrastructure servers in there, imagine if an AE became a sysadm of those servers? Suddenly, the AE can now see some data. Not that data, but this data. The AE now needs to SSH into the systems, so needs an account, access and all that. Conceivably, the AE can also pop in and reboot the infra servers …
Now, none of this is wrong. We really do need our AEs to help where they can, same as everyone, and I'm only mentioning the AEs by way of example; I could make the same judgement call about a conflict between our Arbitrators and our Support Team. The issue is not wrongness but inefficiency: the controls are now complicated, no longer obvious and tested frequently. Even the participants are going to get confused. Which shifts those controls up to a higher gear; even if the participants manage to climb this mountain, the poor old auditor is more or less forced to test this area, and test it thoroughly. Which means more site visits, more tests, more cost, and lots more angst for all concerned.
Hence, for all these reasons, the Board took on the task to separate the infrastructure out. See the Board's report for more on that. Pending...
(2) Data Protection. This is a lot easer. The audit criteria, known as DRC for David Ross Criteria, specifically state that we need a declaration against any appropriate legislation on data protection and other issues. So the Board had to pick up the work done by the last board, review all the documentation, add in analysis of new documentation, and make their declaration. The board did that, but did it in private session because the area is a bit of a legal minefield. Having observed the process, I'm confident that task is done, and it can be explained to a future auditor.
(3) Audit Strategy. One of the things I promised last year was to outline the way forward for the future work. This was more or less done but not in formal terms. In practice, we got in and did some of it, according to this strategy:
Registration Authority (RA) Audit first, Certification Authority (CA) Audit second.
Let me explain! We can think about CAcert as two independent but linked areas: the web of trust (RA) and the critical parts (CA). The former is our network of Assurers. We are nearly 4000 members who work on one primary goal, being the building of our web of trust, and in detail, lots of assurances, under Assurance Policy. Then, the latter is a tight set of small teams (sysadm, software, support, etc) which includes maybe 15 people. Half of them are near Ede, the other half "close" in continental terms, and they're all doing their thing within Security Policy.
These two groups are very different: size, speed, approach, management, people, policies, location, these aspects all differ. To reflect this, the world of PKI generally separates these out, and at audit level too. CAcert is no different: our RA (our Assurers or web of trust) is much more ready for Audit than our CA (our critical teams and systems). This makes sense in that the Assurers do many small tasks, and we've put in 3-4 years of work to make those tasks solid! In contrast, the critical teams do big tasks with few people, and they've had some mountains to climb.
All of which leads to an Audit strategy of concentrating on the RA side first.
(4) You the Members. Which leads to the issue of resources. It was painfully obvious that the failure of my audit can be seen as a failure to apply resources -- people -- to the problem. Why was it so difficult to get help? As I outlined last year, I think the entire community had got into a mindset of someone else doing the Audit. Who was that person? The auditor ( !) or the board (?) or someone, but it was always someone else!?!
That might conceivably work if we have lots of money to pay for that work being done, but without lots of money, no chance. The only Auditor we can afford is one who has a very easy job to do. Hence, all the hard work has to be done by you, the Community.
- Ask not when your audit is done;
ask how you can do your audit?
For example, the Board more or less led the above two components, but the rest is being done by members, who ask what to do and how to help. So a lot our work has been about slicing up the audit into parts that can be done by the Community. Let's now talk about what the members have done, primarily the Assurance Team, leading on to how you can help:
(4) Co-auditing. The idea of testing the assurances out in the field is based directly on one of the criteria that requires us to state how we ensure the quality of the process. The CATS Assurance Challenge goes some way in that it establishes a before-the-event control, but we also need an after-the-event control. Which is very hard, because our nearly four thousand Assurers are scattered across the planet.
How do we test a process that is only face-to-face, when our budget doesn't let us fly everyone to a nice holiday location?
I didn't know the answer to this in early 2009, but I did know I'd better get started. So I started testing by thinking up some ideas, questions, tricks, by getting assured, and writing the results down. Formalising it as I went along. I visited around 8 cities in Europe, and by May 2009, I'd reached maybe 70 or so tested assurances and a 1-person framework.
It was at this point that a surprise happened, for me at least: the Assurance Team copied my entire process and rolled it out across Germany in a series of events called ATEs or Assurer Training Events. So when we met up in Munich in May 2009, their 70 or so tests could be added to mine, thus doubling the numbers! This meant that we had 7.6% coverage over the entire Assurer group, and that meant I could call it statistically significant.
Problem solved! However, that was an informal process only. Over this last year the Assurance Team (now including me) have worked to formalise this process into a proper documented practice: we've defined the role of co-auditor, tested our team of co-auditors, documented the process of tested-assurance for the 2010 season, field-tested the process at CeBIT-2010, and rolled out the process in some more ATEs. I've also built a little database called CASPER (Co-Auditing System for Periodic Evaluation of RAs) to collect the results and display them. Results as of 20100820:
CASPER tells us that out of 46 co-audits, there is a roughly 1.8 out of 6 errors rate across 5 countries. It also tells us that we need many more co-audits and ATEs! Which leads to my next point: ATE has had a bit of a slow take-up since last year, in part because people have been busy, but also, sad to say, in part because there has not been universal support for this essential audit project. As of right now, we simply don't have enough co-audits to be comfortable, so here's what you can do to help your audit:
ATTEND an ATE today!
And once you have done that, help us to organise more ATEs. Extra points for strange and exotic locations
(5) Disclosures against DRC. Now back to core audit: The way an audit works is to examine the policies and then check they are implemented and followed. This is called:
say what you do, and do what you say.
We also work to criteria, which a long checklists of things that must be there. In the first phase, bringing criteria and documentation together can be done by disclosures, which are essentially pointers to evidence, in writing, from you to the Auditor, against each of the criteria. One by one. At the second phase, if the disclosures aren't good enough ("obvious" and "easy"), the Auditor has to walk into the field and check for him or herself.
Therefore, the better our disclosures, the less work for the Auditor to do. In my first audit, I simply wrote the disclosures myself against the criteria, but I think this is too much work for one person. Or, more plainly, the next Auditor will find it a lot of work, and will therefore charge too much money (or go elsewhere).
The disclosures can be done by you and me and everyone. This is entirely within CAcert's power to do. It's unlikely we'll find an Auditor to do it for us.
To assist in this process, I did a bit of hackery. I took the older audit criteria browser, and hacked it into what looks like a criteria-blog-with-disclosure-comments. This new app presents each of the disclosure, one by one, and a comment post feature that allows you all to write the disclosure. I call this CROWDIT, as a sort of wordplay on Crowd-Audit. This open governance innovation is now written and ready to trial, at least in demo form, so a task over the next year is to get those disclosures written and collected from you. Let's look at an example:
You can help. Over the next year, we'll be forming our team to fill out the above. It's simple to describe: pick one of the criteria (like A.2.f in yellow above) and research it. Figure out how to show it is met, to some reasonable level, and make a disclosure (there are two above in pink).
Easy to say, harder to do, but not impossible -- our challenge for next year will be to build a new internal audit team to get this done. Watch this space.
(6) CARS. Finally, it can't have escaped your notice that we are moving lots and lots of work out to our community. This work has to come back to the Auditor in one way or another, and to be useful, the work must be solid! The auditor has to rely on your work, and to make this possible, we've invented the reliable statement:
CAcert Assurer Reliable Statement
or CARS! At one level it is a small thing, just four letters to add to your name in a report (as seen below). But behind those 4 letters of CARS, more significant things are happening.
Recall our certificates? The CPS and our CCA says that you the member may rely on the information in the certificates. CARS is the same thing, in concept, but much broader scope than within a certificate. When an Assurer makes a Reliable Statement, you the Member may rely on it, and by extension, so may the Community and the Auditor.
How strong this is can be tested in the same way. What happens when a certificate goes wrong? Well, we ask the Arbitrator, who will examine all the circumstances, apply the policies, and make a ruling. We don't know what the result will be, but we do know we'll get a result. Which means the process is reliable.
Exactly the same happens with CARS. When an Assurer makes a Reliable Statement that later proves to be wrong, we can ask the Arbitrator to rule on it. That might not solve the specific problem of that one statement for that one relying party, because the result can go either way. But it should solve the general problem of all such reliable statements for our entire community. An Assurer knows to think carefully, and make the best possible statement for reliance by the whole community. And, an Auditor can also rely on the results, which takes us one step closer to crowd-sourcing our entire audit work process.
Each of the above innovations have been strengthened this way. Co-audits are reported as CARS in CASPER, and the CROWDIT disclosures you make against the criteria are also CARS. Training sessions can run to the same standard, and reports from the activities can be so labelled.
(7) Policy. Around about the end of this financial year, the policy group completed its essential policy set, as dictated by DRC -- our Security Policy, the CPS and the CCS (index to audit). See the policy group report for that!
Conclusion. This package of changes took a year or more to put in place - that includes seeing the need, thinking & trying & sharing, many events, testing and documenting, and integrating them together. Also some software tools to scale it up.
To my mind, this represents the work needed to proceed to the next phase: a real life RA audit. The technical systems are now in place; what remains is to have the Community fill out those disclosures, attend their ATEs and collect up the co-audited results. And in parallel, assuming the Community gets behind the work, it seems reasonable to think about asking an Auditor to come in and check that work by the Community.
I'll help that work, but really it now belongs to you: Get to an ATE, get some co-audits done, and help with disclosures. To the extent that the Community gets behind this approach, the audit will move forward again.
(Which is why for the last month or two I've been concentrating on that other issue towards audit, BirdShack and a new software architecture. That is my personal goal for the future, because you'll be doing the audit work!)