Back

aws - 偶尔会出现机器重启的情况. 大约一年半,在20台机器中遇到一次

发布时间: 2021-02-21 23:44:00

前几天机器莫名其妙的重启了. 

很神奇, 整个ubuntu的重启, (而不是某个 service重启 )

所以我发邮件问了一下, 得知是 系统掉线, (由于AWS的硬件问题) 造成的 自动重启. 

aws提供  99.95% 的稳定性. (这一点是不是偏低了?) 还是我理解有误? 

总之对方技术回复还是 道歉态度挺诚恳的. 说没有事先通知我们就重启了. 

原因明白了, 希望类似的情况能少一些吧. 

原文如下:

Hello,

Thank you for contacting AWS Premium Support.

My name is Zachary from the Linux Team and I will be assisting you with your case today. From the case description, I understand your instance i-003b0db2095f0a90f was rebooted without your knowledge on 2021-02-20 around 04:46 and you would to know what happened during that time. Please let me know if I misunderstood.

My apologies for any inconvenience this issue may have caused you and your team. That being said, I reviewed the instance i-003b0db2095f0a90f and could confirm that it recently failed both the systems[1] and the instance[2] status checks during the times mentioned above (2021-02-19 14:15 thru 19:30 UTC).

As you may know, system status checks[3] monitors the AWS underlying hardware/infrastructure on which the instance runs which help detect underlying problems that requires AWS involvement to repair.

The following are examples of problems that can cause system status checks to fail:
- Loss of network connectivity
- Loss of system power
- Software issues on the physical host
x Hardware issues on the physical host that impact network reachability

On the other hand, instance status checks[4] monitors the software and network configurations on the instance by sending ARP request to the network interface which help detect problems that requires customer involvement to repair (for example rebooting or making configuration changes on the instance itself).

The following are examples of problems that can cause instance status checks to fail:
- Incorrect networking or startup configuration
- Exhausted memory
- Corrupted file system
- Incompatible kernel
x Failed system status checks

With that being said, the root cause of the issue was unfortunately due to an underlying hardware issue on the AWS side. We aim for a monthly up-time percentage of 99.95% and take every possible action to maintain stability. Due to the hardware issue being unexpected, a maintenance notification may not have been sent out. As of now, both the underlying host and the instance are reporting in a healthy state. If you would like to move the instance to a new host, you can perform a stop/start on the instance which will move the instance from the current host to a new host.

In situations where an instance is not in a Auto Scaling Group (ASG) and downtime is a major issue, AWS has a feature called Auto Recovery[5]. With the Auto Recovery feature, customers can create an Amazon CloudWatch Alarm[6] that monitors your ec2 instance and automatically recovers the instance if it becomes impaired due to an underlying hardware failure or a problem that requires AWS involvement to repair.

Again my apologies for any inconvenience this issue may have caused you and your team. For now, I'll mark the case as resolved, however, if you have any further questions or concerns and would like to re-open the case, simple reply to the case and we will do our best to further assist.

References:
-------------
[1] StatusCheckFailed_System:
https://ap-northeast-1.console.aws.amazon.com/cloudwatch/deeplink.js?region=ap-northeast-1#metricsV2:graph=%7B%22metrics%22%3A%5B%5B%22AWS%2FEC2%22%2C%22StatusCheckFailed_System%22%2C%22InstanceId%22%2C%22i-003b0db2095f0a90f%22%5D%5D%2C%22stat%22%3A%22Sum%22%2C%22period%22%3A300%2C%22start%22%3A%222021-02-18T01%3A00%22%2C%22end%22%3A%222021-02-21T20%3A00%3A00%22%2C%22region%22%3A%22ap-northeast-1%22%7D

[2] StatusCheckFailed_Instance:
https://ap-northeast-1.console.aws.amazon.com/cloudwatch/deeplink.js?region=ap-northeast-1#metricsV2:graph=%7B%22metrics%22%3A%5B%5B%22AWS%2FEC2%22%2C%22StatusCheckFailed_Instance%22%2C%22InstanceId%22%2C%22i-003b0db2095f0a90f%22%5D%5D%2C%22stat%22%3A%22Sum%22%2C%22period%22%3A300%2C%22start%22%3A%222021-02-18T01%3A00%22%2C%22end%22%3A%222021-02-21T20%3A00%3A00%22%2C%22region%22%3A%22ap-northeast-1%22%7D

[3] System status checks - https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-system-instance-status-check.html#system-status-checks

[4] Instance status checks - https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-system-instance-status-check.html#instance-status-checks

[5] Recover Your Instance - https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-recover.html

[6] Create Alarms That Stop, Terminate, Reboot, or Recover an Instance - https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/UsingAlarmActions.html

We value your feedback. Please share your experience by rating this correspondence using the AWS Support Center link at the end of this correspondence. Each correspondence can also be rated by selecting the stars in top right corner of each correspondence within the AWS Support Center.

Best regards,
Zachary L.
Amazon Web Services

===============================================================

To share your experience or contact us again about this case, please return to the AWS Support Center using the following URL: https://console.aws.amazon.com/support/home#/case/?displayId=8018099451&language=en

Note, this e-mail was sent from an address that cannot accept incoming e-mails.
To respond to this case, please follow the link above to respond from your AWS Support Center.

===============================================================

AWS Support:
https://aws.amazon.com/premiumsupport/knowledge-center/

AWS Documentation:
https://docs.aws.amazon.com/

AWS Cost Management:
https://aws.amazon.com/aws-cost-management/

AWS Training:
http://aws.amazon.com/training/

AWS Managed Services:
https://aws.amazon.com/managed-services/

Back