AWS Outage Affecting App & Ingest
Incident Report for VividCortex
Postmortem

At 8:43am EDT on 8/31/19 AWS suffered an outage in US-East-1a affecting all of their customers in that region, including VividCortex. Until 11:21am EDT our product was accessible but not accepting new data. At 11:21am we were able to restore our ingest pipeline so that new data was being processed in the product but the system remained degraded for the next several hours as our services re-synced data away from our permanently failed instances. On 9/2 the re-sync completed and the restore process was started that took some time to complete. After the restore process was completed, it was apparent that most customers had a 2.5 hour gap in their data. Our team dove deeper and during our investigation determined that there was an agent misconfiguration (push) that inhibited AWS communication with our failover system. Thus, all data during the 2.5 hour window is unrecoverable for customers. The issue that caused this problem has been fixed as of 9/3/19 and agents updated.

Customer Impact: Likely 2.5 hours of missing metrics during Friday’s AWS outage window

Corrective actions: VividCortex conducting full internal post-mortem, looking to extend high availability into multiple AWS regions, as well as re-examining unit testing and QA process for configurations.

Posted 14 days ago. Sep 04, 2019 - 19:34 UTC

Resolved
Our ingest and application issues have cleared. We are working to restore data to the product from the outage window.
Posted 18 days ago. Aug 31, 2019 - 15:36 UTC
Update
We are continuing to investigate this issue.
Posted 18 days ago. Aug 31, 2019 - 13:27 UTC
Investigating
We are currently investigating an AWS outage affecting all Vividcortex services. https://downdetector.com/status/aws-amazon-web-services
Posted 18 days ago. Aug 31, 2019 - 13:27 UTC
This incident affected: Web Application, APIs, and Metrics Ingest Pipeline.