System Instability due to multiple servers failure

Resolved

The system is currently stable and the problem has been eliminated. Here's a short summary of what has happened and what we did to prevent situations like that in the future:

What happened
On Thursday 8th July, at 13:17 NY one of our servers became unstable and lost connectivity. The system was designed to be resilient in cases like that, so it has recovered using the rest of servers and behaved normally. However, at 13:31 NY a combination of events has happened: the lost server started coming back online (while still behaving flaky) and at the same time one of master servers became overloaded (because the system has distributed the payload of the faulty server in a dis-balanced manner).

These 2 events happening simultaneously have created an internal network routing problem, which caused up to 30% of all requests failing sporadically. The system started treating some services as "not available" because of that, and it had a cascading effect (more and more services were marked "unhealthy" and excluded from operations). The situation has escalated from "a couple services failure" to a system-wide problem within a short period of time.

What we did to prevent that in future
We have replaced the faulty server with a new one, since it proved to still have connectivity issues.
We have improved resilience of the system by increasing redundancy of our servers.
We have implemented procedures which ensure that the necessary set of master servers won't ever become overloaded like that.

If you were negatively impacted by the outage, please be sure to consider claiming an SLA credit under the TrendSpider SLA (www.trendspider.com/sla/) to compensate you for the trouble. We're sorry for the trouble, and we're learning from the incident to improve the system.

Jul 10, 2021 10:55 am
Monitoring

The system is currently stable.

We believe the matter is largely resolved and continue to monitor for signs of instability while researching the root cause. Details will be posted here once available.

Jul 8, 2021 8:01 pm
Monitoring

As of this post, service should be coming back online for most customers and users who were impacted by the outage. You may have to log back into the system if you were kicked out during the outage.

The engineering team is still investigating the root cause, which is believed to be networking related. The team is also continuing to monitor internal system health metrics to ensure continued stability, thus the case will not be closed just yet until we are sure that the matter is resolved.

If you were negatively impacted by the outage, please be sure to consider claiming an SLA credit under the TrendSpider SLA (www.trendspider.com/sla/) to compensate you for the trouble.

As of the time of this posting, the scanner has been artificially disabled just to minimize system load while the system recovers.

Jul 8, 2021 6:35 pm
Identified

We have identified that this matter impacts a large number of customers, just under 50% of all users. We are sorry for the trouble and are still working to find a resolution as fast as we can.

Jul 8, 2021 6:09 pm
Identified

The engineering team continues to work on the issue and hopes to be able to provide an ETA to resolution or quick resolution soon.

Jul 8, 2021 5:46 pm
Identified

As the entitlement control system is impacted, adding it to the status update.

Jul 8, 2021 5:37 pm
Identified

We have identified an issue in the system that may cause some customers to be unable to access their accounts, login, or receive Error 504s in the system. We are aware of this and are working to rectify it. Please standby.

Jul 8, 2021 5:33 pm
Affected Systems
System: Chart Annotations / Drawing
Operational
System: Entitlement and Data Control
Operational
System: Market Scanner
Operational
System: Strategy Tester
Operational
System: Workspaces
Operational