Investigating issues with proxies
Incident Report for VividCortex

Early this morning one of VividCortex's proxies experienced increased backpressure due to an overloaded EC2 instance that exhausted its CPU credits. While working to resolve the issue, we performed a proxy failover, and encountered a latent misconfiguration from a past infrastructure change. This increased the time to resolution and caused additional unavailability of data for some customers.

To prevent this from recurring and increase the reliability of our systems, we have taken immediate actions as well as identifying additional steps for future action. We have extended our monitoring to include not only CPU credits but other similar items, and we have reviewed all proxy configuration settings and ensured it is included in our infrastructure-as-code automation. As a proactive measure we plan to perform more failover testing that includes the testing of our proxies.

The incident created a delay only in data being available to customers. Our recovery routines are working as expected and any gap in data should soon be resolved.

Posted 2 months ago. Nov 17, 2017 - 23:13 UTC

Resolved
Issue cleared. Investigation revealed failure mode was not ready to receive traffic.

We'll begin restoring data shortly.
Posted 2 months ago. Nov 17, 2017 - 15:26 UTC
Monitoring
An instance was exhausting its CPU credits and thus causing a connection backlog. We removed the restriction and are monitoring now.
Posted 2 months ago. Nov 17, 2017 - 13:54 UTC
Identified
We've identified the problem. Will be fixed in minutes.
Posted 2 months ago. Nov 17, 2017 - 13:47 UTC
Investigating
We're having a partial outage due to some proxies having issues right now. This may impact metrics ingestion and the web application for some customers. We're actively investigating and will be reporting again soon.
Posted 2 months ago. Nov 17, 2017 - 12:10 UTC