Arbitrator: EvaStöwe (A), Respondent: Michael T (R), Claimant: CAcert (C), Case: a20140322.1

History Log

Private Part

EOT Private Part

original Dispute

Discovery

Containment of the problem

R (in his role as software assessor) asked critical team to install two scripts together with the patch for bug #1135. The patch was ready to go at this time, but one of the scripts should have not been installed at this time as it changed the DB structure to prepare for the patch of bug #1138 which was not ready. Because of this, changes of the name and date of birth (DoB) fields were lost if done before #1138 got installed.

Critical team installed the script together with the patch for #1135 without problems. There is no reason to assume that they should have encountered any problems. According to R there would not have been any error messages to observe for critical team even when those fields were changed by support actions as "the return code [was] not checked and never used so it may be that there was no error message" before the patch to #1138 was installed.

R detected his mistake and filed the dispute to this case.

A issued the intermediate ruling, so that there should not be further changes or deletions by support to the affected fields until it is ensured that the issue was fixed by the installation of another software patch.

This did not affect name and DoB-changes done by the members itself, but as software was busy with working on the patch to solve the original issue at high priority it did not make sense to ask them to prepare something to cover up those changes instead of fixing the issue itself. Also those changes done by the members themselves could only be done as long as there is no assurance to the account. At this time there is no way to verify the fields at all, so a change would not mean a lot one way or the other, as neither the original nor the changed value would be reliable. Also there is probably no great reason to change the fields to begin with for the members themselves, when no assurances play a role.

Critical team installed the needed patches to fix the issue as asked by another software assessor.

The according software assessor also asked critical team to "coordinate with Support to verify information on previously affected operations are now properly recorded by the system".

R attested that the issue was fixed by the installation of the patches at 2014-06-07 and that it should be safe to change or delete entries for the affected fields again.

After some discussion with the internal auditor, software team and the CM how to proceed A thinks that the block to change names/DoBs or to delete accounts could cautiously be removed.

Support previously had suggested a procedure how to check if the affected operations are properly recorded by the system. This included "Critical [to] check the results in the database and gives the result back to support"

There have to be good reasons to allow such deep inspection of accounts. To verify that a patch is good enough to be run on the productive server should not be one of those reasons, at least not with other evidence that there really is such an issue.

If such a detailed check would be needed, the patch should not be considered to be verified enough to be installed on the productive system at all. In this case according tests should first be done on a testserver that would have to be set to a state like the productive server was after #1135 was installed.

Else the current patches have to be considered to be secure and good enough to be executed without such a detailed inspection of the account of at least one member (or ex-member).

At a life session between R (as software assessor), another software assessor (Benny) and A both software assessors assured A that they think that the patches should not need such a deep check, even if they should be installed and monitored with care - as should be the case after every change to the software on a productive server. It was also clarified that the instruction of critical team should be understood as a cautiously manner only.

A allowed singular cautious executions of name/DoB changes and delete account actions, which should be monitored. Support was allowed to first do this to a real account of their own, with correct data. During that executions an error message was encountered by the critical team, that was later declared to be unrelated (and known) by the software team.

The execution of name/DoB changes and delete account acctions was declared to be safe again and support was told that they should be allowed to execute them, as usually.

Critical team executed sql-queries that inserted the missing entries into the DB, according to the information provided by support team. The queries had the OK of 2 SAs.

All issue resulting on the early execution of the script, should be fixed now. There is no need to inform the affected members / ex-members as the data loss did not affect the member or their accounts but only the records for the audit trail.

need for corrective actions

interview of R

R was asked to answer the following questions. He gave the following answers:

  1. After the patch for #1135 was applied (with the scripts), there should have been some error messages if someone would have tried to change a name or delete an account. Is there anything else that normally causes this kind of error message?

I don't think there will be an error message at all. The critical part happens in line 912 of the notary.inc.php in the function account_delete(). There an entry should be created in the adminlog table which was changed by the script that was accidentally executed but the return code is not checked and never used so it may be that there was no error message.

  1. Can you explain why you accidentally handed over the script too soon to critical team?

Because there were two scripts in the same bug tracker item and there were two months between the last time I checked the script and when I sent it out. So I didn't remember at that time when I sent it out that only the first one should have been executed.

  1. Was there anything that could have indicated to critical team, that the script should not have been applied at this time or that there were some problems with the script at that time?

I can't see how the critical team should have noticed this, there was no big note when executing the script or something. There was no problem with the script itself but it should have been kept with the code that needed it instead of being mangled with the other one.

  1. Is there anything that could prevent that software hands over scripts or anything else to critical team too soon? Or anything else that you can think about to prevent such a mistake in the future?

One thing we should do is keep changes not related by the part that needs changing (in this case the database scheme) but the logical relation. In this case the version4.sh script should have been in the related bug-1138.

  1. How did you come to see that you had done a mistake?

I checked in the release state of another database migration script and noticed that it hadn't been applied on the test server yet. As I wanted to execute it on the test server it failed because the previous script which was the one in question for this arbitration case hadn't been run (a safe-guard I put in the database migration mechanism). When I checked why it hadn't been run I discovered my previous error.

  1. Mistakes occur, but we try to reduce them with a 4 eyes principle - was there anything that could have told other SAs that there was a script applied too early?

Maybe if they remembered correctly and observed the mails to the critical team. But for them too some time had gone by.

Deduction

  1. The problem was cased by a mistake and not by deliberation.
  2. Critical team who should monitor the logs, could not have detected a problem. For them everything should have looked ok.
  3. Support team could not have detected the data loss, as the changes were not visible at the frontend at that time.
  4. The problem was caused by software team because they had not split up some scripts between bug-entries in the bugtracker to the contexts in which they should be executed.
  5. The 4-eyes-principle has not helped here, because the mails from software team to critical team do not need to be reviewed, obviously the according mail was not heartly checked by other SAs in this case.

The current process does not demand to check such mails by other SAs. The scripts themselve probably had passed their review. There was nothing installed to check for the correct time of execution. This is currently only defined in the mails send to critical team by the software team.

An interview with another SA revealed that the used bugtracker does not allow to set a bug into the status "ready to deploy" if a dependency is set to another bug which is not at least in the status "ready to deploy". This could be used as a safeguard, which not only would be seen by every software assessor, but also by critical team. A minor reconfiguration of the settings may be needed for this.

Rulings

Intermediate Ruling I

a20140322.1. Name change or delete account support cases should be processed up to the point of execution and than put on hold (with a possible information of affected users) so that they can be executed when it is safe to execute them again without data loss.

should not be done by an emergency process. The arbitrator of a20140322.1 should be informed about any major steps or problems in this context.

-- Cologne, 2014-03-24

Intermediate Ruling II

Support should be allowed to execute name and DoB changes and account deletions, again.

They should select one case if possible of each kind and execute them. After the execution they should take a look at the affected account in the support interface and check if everything looks like it should be. They should also inform critical team about the execution and ask critical team to take a close look at the according log files.

If they feel the need, support may create a real (additional) account for a support member with correct data that may be deleted without any normal delays or mails normally needed based on a20111128.3.

Both teams should report the results back to A and CM of this case, especially if anything unusual or unexpected was detected, that may be related to this case.

If there is no indication that there remains an issue support should be allowed to execute all other cases of name/DoB changes or account deletions again as usual.

-- Kiel, 2014-06-15

Partial ruling

Arbitration was provided by the support team with a list of entries for name/DoB-changes and account deletions, missing in the database because of the early execution of a script together with the patch for bug # 1135.

The list also contains sql-queries for each of those cases to add the missing entries, which follow a pattern confirmed by 2 software assessors.

The Arbitrator should provide critical team with this list in an encrypted mail.

Critical team afterwards should execute those queries and report the results back to the Arbitrator and Case Manager of this case, again in an encrypted mail.

-- Kiel, 2014-07-19

Final ruling

The issue was caused by a mistake, because two scripts that should not nbe installed at the same time were handled in the same bugtracker entry. It is fixed and the data is restored.

The software team processes at that time were followed correctly, but could not prevent such a mistake.

A Software Assessor made a proposal to change some settings within the bugtracker that may help to improve this situation.

If not already done, software team is advised to consider the proposed settings or to find other ways to improve their processes to prevent this kind of mistake in the future.

Münster, 2015-01-30.

Execution

Similiar Cases

a20141118.1

Investigation on bug 1339

a20150114.2

Wrong version of CCA on website


Arbitrations/a20140322.1 (last edited 2015-02-18 21:03:31 by BernhardFröhlich)