At 8:43am EDT on 8/31/19 AWS suffered an outage in US-East-1a affecting all of their customers in that region, including VividCortex. Until 11:21am EDT our product was accessible but not accepting new data. At 11:21am we were able to restore our ingest pipeline so that new data was being processed in the product but the system remained degraded for the next several hours as our services re-synced data away from our permanently failed instances. On 9/2 the re-sync completed and the restore process was started that took some time to complete. After the restore process was completed, it was apparent that most customers had a 2.5 hour gap in their data. Our team dove deeper and during our investigation determined that there was an agent misconfiguration (push) that inhibited AWS communication with our failover system. Thus, all data during the 2.5 hour window is unrecoverable for customers. The issue that caused this problem has been fixed as of 9/3/19 and agents updated.
Customer Impact: Likely 2.5 hours of missing metrics during Friday’s AWS outage window
Corrective actions: VividCortex conducting full internal post-mortem, looking to extend high availability into multiple AWS regions, as well as re-examining unit testing and QA process for configurations.