SURFsara strives to be transparent about incidents on the HPC Cloud infrastructure. This report explains an issue that affected the storage cluster on Sunday 19th through Monday 20th of February 2017.
The HPC Cloud Ceph storage cluster experienced a major incident. We have returned to normal production and no data was lost.
Virtual Machines mounting Ceph datablocks may have been affected.
The storage cluster is built on Ceph (Software Defined Storage) and consists of 48 nodes with a current storage capacity of 2 PB (Petabyte). We use the standard replication factor, which means that all files/objects are replicated 3 times throughout the cluster.
Sunday 19th at 19:25 the logs started to record problems on the storage nodes. Parts of the cluster were not able to reach each other and started to ask the Ceph monitor to remove them from the cluster. The Ceph monitor does not comply immediately but needs confirmation from different nodes that a certain disk is no longer behaving correctly.
Sunday 19th at 19:27 the Ceph monitor removed the first disk (OSD in Ceph lingo) from the cluster. The whole cluster started to behave erratically removing and adding OSD’s the whole time.
30 Minutes later, 308 (out of 450) OSD’s were left in the cluster and this progressed until only 214 OSD’s were operational. This triggered a failsafe procedure and the cluster went into shutdown mode to protect the data. No more client requests were processed.
Monday 20th at 05:30 our engineers noticed the problem. Seeing that the OSDs were not really broken, they immediately started to reconnect the missing OSD’s manually.
Monday 20th at 11:00 the cluster was still in a warning state, but fit for production and we could restart all services and VM’s affected by this problem.
The Ceph cluster came back to its normal Healthy state at 13:36.