Hello!
We are facing a problem that we have not been able to identify the cause of for some time. Maybe you can help us.
The server simply restarts or freezes when using virtualization.
We have already tested and/or replaced:
- RAM
- Disk IO
- Processors
- FCP card
- Ethernet card
We have even replaced the entire server. We replaced it with another one and the problem persists.
We think it may be something related to the rack, or position in the rack.
The temperature is monitored and does not increase so much that it shuts down the machine. When we run the memory test, the temperature increases and the machine does not shut down, so it must not be the temperature.
In the rack and in the cluster, we have 3 exactly the same servers, and this is the only one that has a problem. And it is the server that is in the middle.
In Linux, the only log we have is the one below:
kernel: {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
kernel: {1}[Hardware Error]: It has been corrected by h/w and requires no further action
kernel: {1}[Hardware Error]: event severity: corrected
kernel: {1}[Hardware Error]: Error 0, type: corrected
kernel: {1}[Hardware Error]: section_type: general processor error
kernel: {1}[Hardware Error]: processor_type: 0, IA32/X64
kernel: {1}[Hardware Error]: processor_isa: 2, X64
kernel: {1}[Hardware Error]: error_type: 0x01
kernel: {1}[Hardware Error]: cache error
kernel: {1}[Hardware Error]: operation: 0, unknown or generic
kernel: {1}[Hardware Error]: version_info: 0x0000000000050657
kernel: {1}[Hardware Error]: processor_id: 0x0000000000000047
Yesterday I installed Gigabyte GSM to have a second option for monitoring BMC.
The following messages appeared in the log events:
If you give us any tips, I will be eternally grateful.
Originally posted by u/myridan86 on Reddit.com/r/homelab
beep boop I’m a bot to seed discussions from Reddit. Upvote or downvote posts like normal, discuss the topics here as well!
If you see an issue with this post, such as no content or links broken or other issues, please report the post.