February 2017

Processing Delays / Partial UI Outage
All data for all clients should be available. Please do not hesitate to contact us if anything in your environment causes concern.
Feb 2, 18:25 - Feb 3, 14:08 UTC

January 2017

Configuration data corruption
We have monitored our system and are confident that no additional data has been affected or will be affected. Thank you for your patience and apologies for the disruption.
Jan 18, 21:15 - Jan 19, 00:27 UTC

December 2016

Failing shard causing some shortages
We've been monitoring and everything continues to work normally.
Dec 1, 20:01-22:36 UTC

November 2016

Pipeline failure
Service is fully operational again.
Nov 17, 16:25-19:18 UTC
Agent Ingestion API Failure
Users may notice gaps in collected metrics over a one hour period late this afternoon. After deploying a new API process to production, we noticed an issue with agents failing to push observations to our API. We have rolled back the deploy and observations are once again being pushed successfully. Agents use an alternate storage in situations like this and any gaps you see should be filled shortly.
Nov 15, 23:13 UTC

October 2016

No incidents reported for this month.

September 2016

Metrics Ingest Delayed
Metrics ingest has caught up; all data should be visible now.
Sep 30, 21:07-22:53 UTC
Backend database failover
This incident has been resolved.
Sep 28, 09:56-11:17 UTC
Unresponsive DB server
Systems are operating normally. No data was lost.
Sep 26, 18:11-21:25 UTC
Backend service issue
Everything's operational now, and there was no data loss.
Sep 26, 07:49-09:18 UTC
Partially Reduced API Performance
This incident has been resolved.
Sep 23, 03:46-05:34 UTC
Query Anomaly Detection Delayed
We found an issue with our anomaly detection process, that we've already fixed. New query anomalies will be detected normally. There is some additional processing that needs to be done to fill in events from the past few days; it should be done over the next few hours.
Sep 6, 14:51-20:15 UTC
Issues with metrics downsampling
This incident has been resolved.
Sep 2, 00:43-01:42 UTC

August 2016

Backend server issue
We've solved the issue with the database; all systems are operational.
Aug 19, 18:17-18:38 UTC
Unresponsive DB server
All systems are working normally.
Aug 16, 16:26-17:14 UTC

July 2016

No incidents reported for this month.

June 2016

No incidents reported for this month.

May 2016

No incidents reported for this month.

April 2016

Data Processing Outage
Recovery is complete. However some data was lost during the process. Customers will see a gap in metrics between 9:50am EST to 10:20am EST. We sincerely apologize for loss of data. We are composing a postmortem which we will release shortly and are actively putting in place changes to ensure this incident does not happen again.
Apr 8, 14:01-19:49 UTC

March 2016

Partial delay in data processing
All environments have caught up.
Mar 25, 14:15-15:14 UTC
Error in main application site
The site is operational.
Mar 24, 14:06-14:17 UTC
Partial Loss of Service
All's well. We'll be reviewing our Ansible playbooks to find out how to prevent this from happening again.
Mar 18, 07:26-08:17 UTC

February 2016

Pipeline delayed
Pipelines in sync. No data was lost. Service is operating normally.
Feb 25, 11:50-13:03 UTC
Metrics Pipeline Delayed
This incident has been resolved.
Feb 24, 00:57-04:17 UTC
Metrics Pipeline Delayed
Our metrics pipeline has finished catching up, all metrics that were delayed in processing due to the failure are now available.
Feb 18, 09:48-12:37 UTC
Metrics ingest pipeline is falling behind
We tracked down the issue to a couple of corrupted partitions in the data ingestions pipeline. We were able to manually to advance them. No metrics data was lost and users will not see any gaps in their graphs.
Feb 16, 22:11 - Feb 17, 02:51 UTC
Backend server upgrade
We're completing the backend servers overhaul. No downtime or issues are expected.
Feb 14, 13:04 - Feb 15, 01:01 UTC
Metrics Pipeline Delayed
All systems are operational.
Feb 14, 09:04-12:12 UTC
Degraded performance at backend server
All systems are now operational.
Feb 13, 23:52 - Feb 14, 01:06 UTC
Data ingest pipeline is falling behind
Successfully failed over and data ingest pipeline is caught up. No data was lost.
Feb 11, 20:31-20:51 UTC
Partial outage in the web app
A backend database had become unresponsive. All systems are now fully operational.
Feb 9, 15:05-15:32 UTC
Internal Server Error in App
We've reset the login, so anonymous users can access now. We'll keep watch in case the limiting triggers again.
Feb 5, 00:46-01:17 UTC

January 2016

No incidents reported for this month.

December 2015

Delayed Metrics Pipeline
We've ended processing the missing observations data. We'll continue to fix the missing node without effect on the application.
Dec 29, 21:57 - Dec 30, 03:52 UTC
Delayed Metrics Pipeline
All metrics, events, and samples are caught up.
Dec 27, 21:42-22:20 UTC
Potential service impact
There's no impact to customers; the shard is healthy and only a standby server is having issues.
Dec 15, 17:15-17:20 UTC
Metrics Downsampling offline for some customers
Metric Downsampling has completed for all affected environments, customers should no longer see a gap in metrics.
Dec 3, 13:38-15:57 UTC
Querysample ingestion process is running behind
All data ingestion processes are up to date. No data was lost.
Dec 2, 19:28-19:58 UTC
Failing over to backup shard
Querysample ingestion process is now caught up. Everything is back to normal.
Dec 2, 17:33-17:37 UTC

November 2015

Slow data processing (partial)
Querysample data ingestion is all caught up now. Another data ingestion process was saturating the network which caused querysample no to be able to catch up. We stopped the process that was causing network saturation and querysample was able to quickly process it's backlog of data.
Nov 26, 20:17 - Nov 27, 00:40 UTC
Kafka Consumers Delayed
The data pipeline has caught up and the root cause has been fixed.
Nov 25, 08:34-10:12 UTC
Gaps in Downsampled Data
Background processing has finished running without issue, customers that were affected should no longer see a gap in their data.
Nov 24, 17:58-18:43 UTC
Data Processing Behind
Metrics ingest is now functioning properly, no data was lost during the period of degraded performance.
Nov 23, 21:15-21:36 UTC