CN116431373A - Server fault reporting method and related equipment - Google Patents

Server fault reporting method and related equipment Download PDF

Info

Publication number
CN116431373A
CN116431373A CN202310351628.5A CN202310351628A CN116431373A CN 116431373 A CN116431373 A CN 116431373A CN 202310351628 A CN202310351628 A CN 202310351628A CN 116431373 A CN116431373 A CN 116431373A
Authority
CN
China
Prior art keywords
fault
server
information
processor
reporting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310351628.5A
Other languages
Chinese (zh)
Inventor
张超
徐志朗
张仁泽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba China Co Ltd
Original Assignee
Alibaba China Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba China Co Ltd filed Critical Alibaba China Co Ltd
Priority to CN202310351628.5A priority Critical patent/CN116431373A/en
Publication of CN116431373A publication Critical patent/CN116431373A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0766Error or fault reporting or storing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0766Error or fault reporting or storing
    • G06F11/0787Storage of error reports, e.g. persistent data storage, storage using memory protection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/0223User address space allocation, e.g. contiguous or non contiguous base addressing
    • G06F12/023Free address space management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/448Execution paradigms, e.g. implementations of programming paradigms
    • G06F9/4482Procedural
    • G06F9/4484Executing subprograms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The application discloses a server fault reporting method and related equipment, and relates to the technical field of cloud computing, wherein the method comprises the following steps: setting a fault detection point for a fault function in a kernel of an operating system, wherein the fault function is a function called by the operating system when a fault end of the server breaks down, and the fault end is a processor or a computing node with a fault; registering a callback function at a fault detection point, wherein the callback function is provided with function logic used for collecting fault information of a fault end and sending the fault information to a non-fault end, and the non-fault end is a processor or a computing node which does not have faults; executing a callback function when the operating system calls the fault function, so as to send fault information to a non-fault end by using the callback function; and reporting fault information through the non-fault end. The method and the device solve the problems that the fault detection speed of the server is low, and unreliable and false alarm can occur.

Description

Server fault reporting method and related equipment
Technical Field
The application relates to the technical field of cloud computing, in particular to a server fault reporting method and related equipment.
Background
This section is intended to provide a background or context to the embodiments of the invention that are recited in the claims. It is not admitted to be prior art by inclusion of this description in this section.
For cloud service manufacturers, the stability of the virtual machine is a life line, but the server is often less reliable, the server has the probability of hardware failure, the server is abnormal, and various software errors further cause the server to be down, so that the operation of the virtual machine is affected.
In order to reduce the influence of server abnormality or downtime on the stability of the virtual machine, firstly, the fault of hardware or a system needs to be detected quickly, and the fault is reported to an operation and maintenance end, and the operation and maintenance end can adopt operation and maintenance strategies such as cold migration or hot migration to quickly migrate away the virtual machine so as to ensure the stability of the virtual machine. In this regard, there is a method of passively detecting whether a server fails by using a network packet method such as ping/ssh at the operation and maintenance end. The operation and maintenance side firstly tries to log on the server through ssh (ssh is an abbreviation of secure shell, is a remote security management protocol built on an application layer), if the server can log on normally, the server is indicated to be normal, if the server fails to log on, ping (ping is an English abbreviation of a network packet detector Packet Internet Groper) is used for network detection for 30s, if the network is not enabled, the server is judged to be faulty, but the passive detection mode is influenced by a network link, the detection speed is slower, and unreliable and false alarm can occur.
Disclosure of Invention
The embodiment of the application provides a server fault reporting method and related equipment, which are used for at least solving the problems that in the prior art, a passive detection mode is influenced by a network link, the fault detection speed of a server is low, and unreliable and false reporting can occur.
According to an aspect of the present application, there is further provided a server failure reporting method, where the server is a server with a processor and a computing node separated, the processor and the computing node each have a set of operating systems, and the processor and the computing node are in communication through a bus connection, and the method includes:
setting a fault detection point for a fault function in a kernel of the operating system, wherein the fault function is a function called by the operating system when a fault end of the server breaks down, and the fault end is the processor or the computing node with the fault;
registering a callback function at the fault detection point, wherein the callback function is provided with function logic used for collecting fault information of the fault end and sending the fault information to a non-fault end, and the non-fault end is the processor or the computing node which does not generate faults;
Executing the callback function when the operating system calls the fault function, so as to send the fault information to the non-fault end by utilizing the callback function;
and reporting the fault information through the non-fault end.
In some embodiments, the callback function is further configured with function logic configured to send a message notification interrupt to the non-faulty end after the fault information is sent to the non-faulty end, and after the fault information is sent to the non-faulty end by using the callback function, the method further includes:
and sending the message notification interrupt to the non-fault end by using the callback function so that the non-fault end receives the fault information in response to the message notification interrupt.
In some of these embodiments, before sending the failure information to the non-failed end using the callback function, the method further comprises:
receiving address information of a physical memory sent by the non-fault end, wherein the physical memory is a memory positioned at the non-fault end and used for placing the fault information;
the step of sending the fault information to the non-fault end by the callback function includes:
And the callback function writes the fault information into the physical memory in a direct memory access mode based on the address information.
In some embodiments, the non-faulty end creates a faulty reporting thread, where the faulty reporting thread is a thread for reporting the fault information, and the step of reporting the fault information by the non-faulty end includes:
recording the received fault information in a pre-established linked list, and waking up the fault reporting thread;
and encapsulating the fault information in the linked list into a network message through the fault reporting thread to report.
In some embodiments, the computing node is provided with a virtual machine, and if the failed end is the processor, when the callback function is used to send the message notification interrupt to the non-failed end, the method further includes:
and adding a virtual machine operation stop instruction in the message notification interrupt so that the computing node responds to the virtual machine operation stop instruction to set the running virtual machine to a stop operation state.
According to another aspect of the application, a server fault reporting device is further provided, and the device is used for implementing the server fault reporting method.
According to another aspect of the application, a server is further provided, and the server comprises the server fault reporting device.
According to another aspect of the present application, there is also provided an operation and maintenance system including:
the server;
and the operation and maintenance end is used for receiving the fault information reported by the server.
According to another aspect of the present application, there is also provided a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the above-mentioned method steps when executing the computer program.
According to another aspect of the present application, there is also provided a computer readable storage medium storing a computer program which, when executed by a processor, implements the above-mentioned method steps.
According to another aspect of the present application, there is also provided a computer program product comprising a computer program which, when executed by a processor, implements the above-mentioned method steps.
According to the method and the device for detecting the fault in the server, the fault detection point is arranged in the fault function in the kernel of the server operating system, the callback function capable of collecting the fault information is registered at the fault detection point, the callback function is executed when the operating system calls the fault function, the fault information can be sent to the non-fault end of the server through the callback function, and accordingly the fault information is actively reported to the operation and maintenance end through the non-fault end, the operation and maintenance end can rapidly sense that the server has faults, fault detection time is greatly shortened, detection speed is improved, fault information of the first hand can be obtained, and misinformation can be prevented.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application, illustrate and explain the application and are not to be construed as limiting the application. In the drawings:
fig. 1 is a flow chart of a server fault reporting method according to an embodiment of the present invention;
fig. 2 is a schematic flow chart of reporting fault information to an operation and maintenance end through a non-fault end according to an embodiment of the present invention;
fig. 3 is a schematic diagram of a server fault reporting apparatus according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a server according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of an operation and maintenance system according to an embodiment of the present invention;
fig. 6 is a schematic diagram of fault information interaction between a server and an operation and maintenance end in an operation and maintenance system according to an embodiment of the present invention.
In the figure:
10. a server; 11. a processor; 12. calculating nodes; 101. a server fault reporting device; 1011. setting a module; 1012. a registration module; 1013. an execution module; 1014. a reporting module; 1015. a receiving module; 20. and (5) an operation and maintenance end.
Detailed Description
It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other. The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer executable instructions, and that although a logical order is illustrated in the flowcharts, in some cases the steps illustrated or described may be performed in an order other than that illustrated herein.
In view of the fact that the failure of the server 10 is affected by the network link through the passive detection manner of the operation and maintenance end 20 at present, the detection may be unreliable and result in false alarm, and in this regard, as shown in fig. 1, a first embodiment of the present invention provides a server failure reporting method, where the server 10 is a server 10 with a processor 11 and a computing node 12 separated, and the processor 11 and the computing node 12 each have a set of operating systems, and the method includes the following steps:
step S11: in the kernel of the operating system, a fault detection point is set for a fault function, where the fault function is a function called by the operating system when a fault end of the server 10 breaks down, and the fault end is a broken-down processor 11 or a computing node 12. The failure function is provided with a failure detection point, for example, by using kernel Kprobe probe technology of an operating system, so as to capture the cause of the failure of the server 10. The KBrobe is called as Kernel probe Kernel Probes, is a lightweight Kernel debugging technology designed by Kernel developers specially for tracking the execution state of Kernel functions, and can dynamically insert detection points in most of designated functions of the Kernel to collect required debugging state information without affecting the original execution flow of the Kernel basically. When there is a hardware uncorrectable error in the server 10, the machine detection framework (machine check architecture, abbreviated as MCA) of the processor 11 detects the error and generates an interrupt, and the processor 11 calls the system interrupt function after receiving the interrupt and finally goes to the mce _panic function. Such as a panic function, which is called when a software error occurs in the operating system kernel.
Step S12: and registering a callback function at the fault detection point, wherein the callback function is provided with function logic for collecting fault information of a fault end and sending the fault information to a non-fault end, and the non-fault end is a non-fault processor 11 or a computing node 12. The failure information includes information of a relevant register of the cause of the detected hardware failure, which is recorded by the machine detection framework, and/or information of the number of the processor 11 at which the failure time occurs, the process name of the process causing the failure, and the like, which is recorded by the machine detection framework. The fault information is helpful for the operation and maintenance terminal 20 to identify and locate the fault problem, so that the operation and maintenance terminal 20 can timely and effectively adopt the operation and maintenance strategy of hot migration or cold migration to the server according to the fault information.
Step S13: when the operating system calls the fault function, executing the callback function to send fault information to the non-fault end by using the callback function. In the embodiment of the present invention, the processor 11 and the computing node 12 are in communication through a bus connection, so that the callback function can be used to send fault information to the non-fault end through the bus.
Step S14: and reporting fault information through the non-fault end.
Therefore, the embodiment of the present invention uses the architecture feature that the processor 11 and the computing node 12 of the server 10 are separated, and uses the bus channel for communication between the processor 11 and the computing node 12 as the channel for reporting the fault, the processor 11 fault can notify the computing node 12 to report the fault of the processor 11 to the operation and maintenance end 20, and when the computing node 12 fails, the processor 11 reports the fault of the computing node 12 to the operation and maintenance end 20. For the reported fault information, fault detection points are set in the fault functions in the processor 11 and the operating system kernel of the computing node 12, callback functions capable of collecting the fault information are registered in the fault detection points, and when the operating system calls the fault functions, the callback functions are executed, so that the fault information can be sent to a non-fault end of the server 10 through a bus by the callback functions, the fault information is actively reported to the operation and maintenance end 20 through the non-fault end, and the operation and maintenance end 20 can quickly sense that the server 10 has faults.
Compared with the existing method that the operation and maintenance end 20 can finish fault detection of the server 10 in an average 30-60 seconds through a network packet mode, the method provided by the embodiment of the invention can realize fault information reporting in 1-3 seconds, so that the fault detection time is greatly shortened, the operation and maintenance end 20 can be ensured to acquire the first-hand fault information, and false alarm can be prevented. After receiving the fault information, the operation and maintenance end 20 timely and accurately discovers the abnormal condition of the server 10 according to the fault information and migrates away the client virtual machine, so that the time period of unavailability of the virtual machine caused by the fault of the server 10 is reduced. In the current passive detection mode, the operation and maintenance end 20 can only initiate cold migration as long as a fault is received, but in the embodiment of the invention, fault information is reported to the operation and maintenance end 20 together when the fault is reported, the operation and maintenance end 20 can judge whether hot migration is required according to the fault information, so that possibility is provided for hot migration, and cold migration can be rapidly performed even if the hot migration cannot be performed. In addition, the active reporting can also reduce the resource consumption brought by the passive detection of the operation and maintenance end 20 to the server 10.
The callback function in the embodiment of the invention is further provided with function logic for sending a message notification interrupt to the non-fault end after the fault information is sent to the non-fault end, and step S13 is that after the callback function is used for sending the fault information to the non-fault end, the method provided by the embodiment of the invention further comprises the following steps: and sending a message notification interrupt to the non-fault end by using the callback function so that the non-fault end receives fault information in response to the message notification interrupt. Therefore, after the fault end breaks down, the non-fault end can be timely informed to receive fault information, so that the non-fault end timely feeds the received fault information back to the operation and maintenance end 20, fault detection time is further reduced, fault detection and reporting are timely and accurate, and efficiency is high. The problem of false report and delayed report can not occur, so that the operation and maintenance end 20 can timely acquire the fault information, and after the operation and maintenance end 20 receives the fault information, the server 10 is timely and accurately found to be abnormal according to the fault information, and the client virtual machine is migrated away, so that the time of virtual machine unavailability caused by the fault of the server 10 is reduced.
In order to further improve timeliness of reporting fault information through a non-fault end, before step S13 sends the fault information to the non-fault end by using a callback function, the method provided by the embodiment of the invention further comprises the following steps: and receiving address information of a physical memory sent by the non-fault end, wherein the physical memory is a memory positioned at the non-fault end and used for placing fault information. Then, in step S13, the step of sending the fault information to the non-faulty end by the callback function includes: the callback function writes the fault information into the physical memory in a direct memory access mode based on the address information, so that the non-fault end reports the fault information in the physical memory to the operation and maintenance end 20. Direct memory access (hereinafter referred to as Direct Memory Access, DMA) allows hardware devices of different speeds to communicate without relying on a large interrupt load of the processor 11, and can copy data from one address space to another address space. According to the embodiment of the invention, the callback function is used for writing the fault information into the physical memory of the non-fault end in a direct memory access mode, so that the fault end can send the fault information to the non-fault end in a temporary and efficient manner, and the non-fault end can read out the fault information stored in the physical memory corresponding to the address information and report the fault information to the operation and maintenance end 20 in time.
As shown in fig. 2, in the embodiment of the present invention, a non-faulty end creates a faulty reporting thread, where the faulty reporting thread is a thread for reporting fault information, and in step S14, the step of reporting fault information by the non-faulty end includes:
step S21: and recording the received fault information in a pre-established linked list, and waking up a fault reporting thread. A linked list is a non-contiguous, non-sequential storage structure on physical storage elements, and the logical order of data elements is achieved by the order of pointer links in the linked list. A linked list is typically made up of a series of nodes, each node containing identification data for any process and one or two links (links) to point to the location of the previous and/or next node. According to the embodiment of the invention, the fault information is maintained by the linked list, so that the fault information maintained in the linked list can be efficiently and orderly traversed by the non-fault end, and the untimely processing of the fault information is avoided.
Step S22: and encapsulating the fault information in the linked list into a network message through a fault reporting thread to report. Namely, the non-fault end packages the fault information into a network message through a fault reporting thread and reports the network message to the operation and maintenance end 20, so that the operation and maintenance end 20 timely adopts a corresponding operation and maintenance strategy according to the received fault information.
According to the embodiment of the invention, the non-fault end reports the fault information in the linked list through the fault reporting thread, so that untimely processing of the fault information can be avoided, and the fault information is ensured to be effectively reported to the operation and maintenance end 20 in time.
If the failed end is the processor 11, in the embodiment of the present invention, when a callback function is used to send a message notification interrupt to the non-failed end, a virtual machine operation stop instruction is added in the message notification interrupt, so that the computing node 12 responds to the virtual machine operation stop instruction, and the running virtual machine is set to a stop operation state. If the computing node 12 is a failure end, the virtual machine deployed on the computing node 12 cannot continue to run once the computing node 12 fails and is down, so that when the computing node 12 fails, the virtual machine can stop running without sending a message notification interrupt added with a virtual machine running stop instruction to a non-failure end by using a callback function. When the fault end is the processor 11, the embodiment of the invention adds a virtual machine operation stopping instruction in the message notification interrupt when the callback function is used for sending the message notification interrupt to the computing node 12, and the computing node 12 responds to the message notification interrupt to receive fault information and simultaneously responds to the virtual machine operation stopping instruction to immediately stop the running virtual machine. Therefore, after the fault of the processor 11 is notified to the computing node 12, the operation of the virtual machine can be stopped in time, the interior of the virtual machine is prevented from being polluted by the fault of the processor 11 to the greatest extent, and a foundation is provided for the operation and maintenance end 20 to perform the thermal migration of the virtual machine.
As can be seen from the foregoing, in the embodiment of the present invention, by using the separation architecture feature of the server 10, the bus channel for performing communication between the processor 11 and the computing node 12 is used as a channel for reporting a fault, fault detection points are set for fault functions in the processor 11 and the operating system kernel of the computing node 12 of the server 10, and after fault information of a fault end is sent to a non-fault end through a callback function registered by the fault detection points by a bus, the non-fault end reports to the operation and maintenance end 20, so as to implement active detection and reporting of the fault. Compared with the passive detection mode in the prior art, the method provided by the embodiment of the invention can enable the operation and maintenance end 20 to quickly sense that the server 10 has a fatal fault, can acquire the fault information of the first hand, can not have false alarm, and actively reports and reduces the resource consumption of the server 10 passively detected by the operation and maintenance end 20. If the processor 11 fails, the processor 11 timely informs the computing node 12 to immediately stop the operation of the virtual machine while sending failure information through a callback function, so that the virtual machine is prevented from being influenced by the failure of the processor 11 to the greatest extent, and a good foundation is laid for the subsequent operation and maintenance on the thermal migration of the virtual machine.
The second embodiment of the present invention further provides a server fault reporting apparatus 101, as shown in fig. 3, which is configured to implement the server fault reporting method provided in the first embodiment of the present invention, where the apparatus includes a setting module 1011, a registration module 1012, an execution module 1013, and a reporting module 1014. The setting module 1011 is configured to set a fault detection point for a fault function in a kernel of an operating system, where the fault function is a function called by the operating system when a fault end of the server 10 fails, and the fault end is a processor 11 or a computing node 12 that fails. The registration module 1012 is configured to register a callback function at the fault detection point, where the callback function is provided with function logic for collecting fault information of a faulty end and sending the fault information to a non-faulty end, and the non-faulty end is the non-faulty processor 11 or the computing node 12. The executing module 1013 is configured to execute a callback function when the operating system calls the fault function, so as to send fault information to the non-fault end by using the callback function. The reporting module 1014 is configured to report fault information through a non-faulty terminal.
The device provided by the embodiment of the invention utilizes the architecture characteristic that the processor 11 and the computing node 12 of the server 10 are separated, a bus channel for communication between the processor 11 and the computing node 12 can be used as a channel for reporting faults, the processor 11 faults can inform the computing node 12 to report the faults of the processor 11 to the operation and maintenance end 20, and when the computing node 12 faults, the processor 11 reports the faults of the computing node 12 to the operation and maintenance end 20. For the reported fault information, fault detection points are set in the fault functions in the processor 11 and the operating system kernel of the computing node 12, callback functions capable of collecting the fault information are registered in the fault detection points, and when the operating system calls the fault functions, the callback functions are executed, so that the fault information can be sent to a non-fault end of the server 10 through a bus by the callback functions, the fault information is actively reported to the operation and maintenance end 20 through the non-fault end, and the operation and maintenance end 20 can quickly sense that the server 10 has faults.
Compared with the prior art that the operation and maintenance end 20 can averagely complete the fault detection of the server 10 in 30-60 seconds in a network packet mode, the device provided by the embodiment of the invention can realize the report of the fault information in 1-3 seconds, thereby greatly reducing the fault detection time, enabling the operation and maintenance end 20 to acquire the fault information of the first hand and ensuring that false alarm does not occur. After receiving the fault information, the operation and maintenance end 20 timely and accurately discovers that the server 10 is abnormal according to the fault information and migrates away the client virtual machine, so that the time period that the virtual machine is unavailable due to the fault of the server 10 is reduced. In the current passive detection mode, the operation and maintenance end 20 can only initiate cold migration as long as a fault is received, but in the embodiment of the invention, fault information is reported to the operation and maintenance end 20 together when the fault is reported, the operation and maintenance end 20 can judge whether hot migration is needed according to the fault information, so that possibility is provided for hot migration, and cold migration can be rapidly performed on the virtual machine when the hot migration is not possible. In addition, the active reporting can also reduce the resource consumption brought by the passive detection of the operation and maintenance end 20 to the server 10.
The callback function in the embodiment of the invention is also provided with function logic for sending a message notification interrupt to the non-fault end after the fault information is sent to the non-fault end. After the execution module 1013 sends the fault information to the non-fault end by using the callback function, the execution module 1013 also sends a message notification interrupt to the non-fault end by using the callback function, so that the non-fault end receives the fault information in response to the message notification interrupt. Therefore, after the fault end breaks down, the non-fault end can be timely informed to receive fault information, so that the non-fault end timely feeds the received fault information back to the operation and maintenance end 20, fault detection time is further reduced, fault detection and reporting are timely and accurate, and efficiency is high. The problem of false report and delayed report can not occur, so that the operation and maintenance end 20 can timely acquire the fault information, and after the operation and maintenance end 20 receives the fault information, the server 10 is timely and accurately found to be abnormal according to the fault information, and the client virtual machine is migrated away, so that the time of virtual machine unavailability caused by the fault of the server 10 is reduced.
In order to further improve timeliness of reporting fault information through a non-fault end, the device provided by the embodiment of the invention further comprises a receiving module 1015, wherein the receiving module 1015 is used for receiving address information of a physical memory sent by the non-fault end, and the physical memory is a memory located at the non-fault end and used for placing the fault information. Then, the step of the executing module 1013 sending the fault information to the non-faulty end by the callback function includes: the callback function writes the fault information into the physical memory in a direct memory access mode based on the address information, so that the non-fault end reports the fault information in the physical memory to the operation and maintenance end 20. Direct memory access (collectively termed Direct Memory Access in english, DMA for short) allows hardware devices of different speeds to communicate without relying on the extensive interrupt load of the processor 11, which is able to copy data from one address space to another. According to the embodiment of the invention, the callback function is used for writing the fault information into the physical memory of the non-fault end in a direct memory access mode, so that the fault end can send the fault information to the non-fault end in a temporary and efficient manner, and the non-fault end can read out the fault information stored in the physical memory corresponding to the address information and report the fault information to the operation and maintenance end 20 in time.
In the embodiment of the present invention, the non-fault end creates a fault reporting thread, where the fault reporting thread is a thread for reporting fault information, as shown in fig. 2, when the reporting module 1014 reports the fault information by the non-fault end, the non-fault end performs the following steps of the method:
step S21: and recording the received fault information in a pre-established linked list, and waking up a fault reporting thread. A linked list is a non-contiguous, non-sequential storage structure on physical storage elements, and the logical order of data elements is achieved by the order of pointer links in the linked list. A linked list is typically made up of a series of nodes, each node containing identification data for any process and one or two links (links) to point to the location of the previous and/or next node. According to the embodiment of the invention, the received fault information is maintained by the linked list, so that the fault information maintained in the linked list can be efficiently and orderly traversed by the non-fault end, and the untimely processing of the fault information is avoided.
Step S22: and encapsulating the fault information in the linked list into a network message through a fault reporting thread to report.
The device provided by the embodiment of the invention reports the fault information in the linked list through the fault reporting thread by the non-fault end, can avoid untimely processing of the fault information, and ensures that the fault information is effectively reported to the operation and maintenance end 20 in time.
If the failed end is the processor 11, the execution module 1013 in the embodiment of the invention adds a virtual machine operation stop instruction to the message notification interrupt when sending the message notification interrupt to the non-failed end by using the callback function, so that the computing node 12 responds to the virtual machine operation stop instruction to set the running virtual machine to a stop operation state. If the computing node 12 is a failure end, the virtual machine deployed on the computing node 12 cannot continue to run once the computing node 12 fails and is down, so that when the computing node 12 fails, the virtual machine can stop running without sending a message notification interrupt added with a virtual machine running stop instruction to a non-failure end by using a callback function. When the fault end is the processor 11, the embodiment of the invention adds a virtual machine operation stopping instruction in the message notification interrupt when the callback function is used for sending the message notification interrupt to the computing node 12, and the computing node 12 responds to the message notification interrupt to receive fault information and simultaneously responds to the virtual machine operation stopping instruction to immediately stop the running virtual machine. Therefore, after the fault of the processor 11 is notified to the computing node 12, the operation of the virtual machine can be stopped in time, the interior of the virtual machine is prevented from being polluted by the fault of the processor 11 to the greatest extent, and a foundation is provided for the operation and maintenance end 20 to perform the thermal migration of the virtual machine.
Therefore, the device provided by the embodiment of the invention can utilize the separation architecture characteristic of the server 10, uses the bus channel for communication between the processor 11 and the computing node 12 as a channel for fault reporting, sets a fault detection point for the fault function in the processor 11 and the computing node 12 operating system kernel of the server 10, sends the fault information of the fault end to the non-fault end through the bus by the callback function registered by the fault detection point, and reports the fault information to the operation and maintenance end 20 from the non-fault end, thereby realizing the active fault detection reporting. Compared with the passive detection mode in the prior art, the device provided by the embodiment of the invention can enable the operation and maintenance end 20 to quickly sense that the server 10 has a fatal fault, can acquire the fault information of the first hand, can not have false alarm, and actively reports and reduces the resource consumption of the server 10 passively detected by the operation and maintenance end 20. If the processor 11 fails, the processor 11 timely informs the computing node 12 to immediately stop the operation of the virtual machine while sending failure information through a callback function, so that the virtual machine is prevented from being influenced by the failure of the processor 11 to the greatest extent, and a good foundation is laid for the subsequent operation and maintenance on the thermal migration of the virtual machine.
As shown in fig. 4, the server 10 includes a server 10 fault reporting device 101 provided in the second embodiment of the present invention, and the structure of the server 10 fault reporting device 101 is shown in the second embodiment of the present invention, which is not described herein.
The fourth embodiment of the present invention further provides an operation and maintenance system, as shown in fig. 5, where the operation and maintenance system includes a server 10 and an operation and maintenance end 20, and the server 10 specifically please refer to the content provided by the third embodiment of the present invention, and the embodiment of the present invention is not described herein, where the server 10 and the operation and maintenance end 20 perform network communication, and the operation and maintenance end 20 is configured to receive fault information reported by the server 10.
In the operation and maintenance system provided by the embodiment of the invention, based on the separation architecture characteristic of the server 10, a bus channel for communication between the processor 11 and the computing node 12 is used as a channel for fault reporting, fault detection points are set for fault functions in the processor 11 and the computing node 12 operating system kernel of the server 10, and after fault information of a fault end is sent to a non-fault end through a bus by a callback function registered by the fault detection points, the non-fault end reports to the operation and maintenance end 20, so that active fault detection reporting is realized. Compared with the passive detection mode in the prior art, the operation and maintenance system provided by the embodiment of the invention can enable the operation and maintenance end 20 to quickly sense that the server 10 has a fatal fault, enable the operation and maintenance end 20 to acquire the fault information of the first hand, avoid false alarm, and actively report and reduce the resource consumption of the server 10 passively detected by the operation and maintenance end 20. If the processor 11 fails, the processor 11 timely informs the computing node 12 to immediately stop the operation of the virtual machine while sending failure information through a callback function, so that the virtual machine is prevented from being influenced by the failure of the processor 11 to the greatest extent, and a good foundation is laid for the subsequent operation and maintenance on the thermal migration of the virtual machine.
The fifth embodiment of the present invention further provides a computer device, which includes a memory, a processor 11, and a computer program stored in the memory and capable of running on the processor 11, where the processor 11 implements the above-mentioned method for reporting a server failure when executing the computer program, and the method for reporting a server failure is described in the first embodiment of the present invention, and the embodiments of the present invention are not repeated herein.
The sixth embodiment of the present invention further provides a computer readable storage medium, where a computer program is stored, and the computer program is executed by the processor 11 to implement the above-mentioned method for reporting a server failure, where the method for reporting a server failure refers to the content provided in the first embodiment of the present invention, and the embodiments of the present invention are not repeated herein.
The seventh embodiment of the present invention further provides a computer program product, where the computer program product includes a computer program, and when the computer program is executed by the processor 11, the method for reporting a server failure is implemented, and the contents provided in the first embodiment of the present invention are referred to for the method for reporting a server failure, which is not described herein again.
An eighth embodiment of the present invention, in combination with the above seven embodiments and fig. 5 and 6, describes an application embodiment of a server failure reporting method.
The embodiment of the present invention utilizes the characteristic of a specific separation architecture of the server 10 to open a channel for fault detection and active fault reporting of the server 10, that is, a bus channel for communication between the processor 11 and the computing node 12 (such as a typical pcie bus, where the pcie bus is a bus where the processor 11 is connected to an external device) is used as a channel for fault reporting. The processor 11 and the computing node 12 each have a set of operating system, such as a linux operating system, and a kprobe probe technology of a kernel of the linux operating system is used to probe (i.e. set up a fault detection point) a fault function related to fault detection of the server 10, so as to register a callback function defining fault information used for collecting a fault end at the fault detection point, thereby capturing a cause of the fault of the server 10. Fault functions such as mce _panic function, panic function, etc. When the server 10 has a hardware uncorrectable error, the machine detection framework of the processor 11 detects the error and generates an interrupt, the processor 11 calls a system interrupt function after receiving the interrupt, and finally goes to the mce _panic function. When a software error occurs in the system kernel, a panic function is called. After capturing the fault of the server 10, the callback function sends the collected fault information to the non-fault end in a direct memory access mode through the pcie bus channel, namely, the processor 11 breaks down and notifies the non-fault computing node 12 to report the operation and maintenance through the callback function, and the computing node 12 breaks down and notifies the non-fault processor 11 to report the operation and maintenance through the callback function. After the non-fault end analyzes the fault information, the fault information is reported to the operation and maintenance end 20 through the network. After the operation and maintenance end 20 receives the equipment fault information, the migration flow can be immediately initiated, so that the time length of the virtual machine affected by the fault can be greatly reduced.
If the processor 11 fails, a specific failure reporting procedure is as follows:
setp1. Registering a kprobe probe for a fault function, specifically, setting a fault detection point for a fault function such as a panic function, a mce _panic function, a __ ghes_panic, etc., in a linux operating system kernel of the processor 11 and the compute node 12, and then registering a defined callback function at the fault detection point. After registering the kprobe probe, once the fault function is called by the operating system, the registered callback function is triggered, so that what fault occurs to the current server 10 can be timely and accurately detected through the callback function, and key fault information can be collected at a fault detection point.
step2. When a fault occurs, the operating system moves to a fault function where a kprobe probe is registered, and when a kernel executes to the fault function, a registered callback function is executed, and the callback function collects fault information of the server 10, wherein the fault information includes information of a relevant register of a machine detection framework for recording a detected hardware fault cause, and/or information of a number of a processor 11 of a fault occurrence time recorded by the machine detection framework, a process name of a process causing the fault, and the like. The callback function then writes the collected fault information in a specific format to the physical memory designated by the computing node 12 in a DMA manner through the pcie bus channel, and the address information of the physical memory informs the processor 11 in advance by the computing node 12. After the fault information is sent, the callback function sends a message notification terminal (such as msi-x interrupt) again to notify the computing node 12 to receive the fault information, and notifies the computing node 12 in a mode of sending the message notification interrupt, so that the computing node 12 can be ensured to process the fault information in time, and further the timely accuracy of fault reporting is ensured.
step3, after receiving msi-x interrupt from the fault report of the processor 11, the computing node 12 reads the fault information from the designated physical memory and verifies the validity of the fault information, and in general, the computing node 12 may determine whether the fault information is valid by checking the fault information format, where the way of checking the fault information format is as follows: 1. checking whether the information signature is a character XDEST (fully xdragon error source table); 2. checking whether the information length field exceeds 4096, and invalidating if yes; 3. checking whether the information write protection bit is unwritable, wherein the unwritable indication information is invalid and the unwritable indication information is valid; 4. checking whether the fault number field in the information is 0, and invalidating if the fault number field is 0. 5. And (3) checking whether the sum of the whole information is 0 according to the length value of one byte, if so, the checking is successful, and if not, the checking is failed, wherein the information is complete and is valid. After confirming that the fault information is effective, the computing node 12 immediately stops the virtual machine running on the virtual machine to ensure that the virtual machine is not influenced by the fault of the processor 11 end, then analyzes the fault information into a linked list, wakes a fault reporting thread to report the fault information.
step4, after the fault report thread of the computing node 12 is awakened, traversing the linked list recorded with the fault information, and packaging the fault information in the linked list into a network message one by one to report to the operation and maintenance end 20. Therefore, the operation and maintenance end 20 can know which faults and specific reasons of the faults occur in the processor 11, then the operation and maintenance strategy is adopted to migrate the virtual machine on the computing node 12, the virtual machine is quickly recovered, the affected duration of the virtual machine is greatly reduced, and the stability of the virtual machine is enhanced.
When the computing node 12 side fails, the flow of reporting fault information is the same as the flow of the processor 11 side, except that once the computing node 12 fails and is down, the virtual machine deployed on the computing node cannot continue to run, so that when the computing node 12 side fails, a message notification interrupt sent to the processor 11 side by using a callback function is not required to be used for adding a virtual machine running stop instruction, and the virtual machine can stop running.
Therefore, the method provided by the embodiment of the invention can utilize the separation architecture characteristic of the server 10, uses the bus channel for communication between the processor 11 and the computing node 12 as a channel for fault reporting, sets a fault detection point for the fault function in the processor 11 and the computing node 12 operating system kernel of the server 10, sends the fault information of the fault end to the non-fault end through the bus by the callback function registered by the fault detection point, and reports the fault information to the operation and maintenance end 20 from the non-fault end, thereby realizing the active fault detection reporting. Compared with the passive detection mode in the prior art, the method provided by the embodiment of the invention can enable the operation and maintenance end 20 to quickly sense that the server 10 has a fatal fault, can acquire the fault information of the first hand, can not have false alarm, and actively reports and reduces the resource consumption of the server 10 passively detected by the operation and maintenance end 20. If the processor 11 fails, the processor 11 timely informs the computing node 12 to immediately stop the operation of the virtual machine while sending failure information through a callback function, so that the virtual machine is prevented from being influenced by the failure of the processor 11 to the greatest extent, and a good foundation is laid for the subsequent operation and maintenance on the thermal migration of the virtual machine. The operation and maintenance end 20 can judge whether to adopt a thermal migration operation and maintenance strategy for the virtual machine of the server according to the obtained fault information, and can rapidly adopt a cold migration mode for the virtual machine on the server to ensure the stability of the virtual machine if the thermal migration is determined to be unnecessary based on the fault information.
Compared with the prior art, the method provided by the embodiment of the invention has the following advantages:
1. by utilizing the characteristic of the separation architecture of the server 10, the pc ie bus between the processor 11 and the computing node 12 is used as a fault reporting channel, the system fault of the processor 11 can inform the computing node 12 to report the operation and maintenance end 20, and the computing node 12 can inform the processor 11 to report the operation and maintenance end 20 when the fault occurs. After the fault of the fault end is notified to the non-fault end, the operation of the virtual machine can be stopped in time, the interior of the virtual machine is prevented from being polluted by the fault to the greatest extent, and a certain foundation is provided for the thermal migration of the virtual machine.
2. The kernel kprobe probe technology of the linux operating system is utilized to insert probes into fault functions, a callback function registered by the kprobe probes is called once the processor 11 or the computing node 12 breaks down, and the callback function informs a non-fault end by utilizing a pcie bus channel, so that own fault information can be actively reported to the operation and maintenance end 20 when the fault end is finished, the operation and maintenance end 20 immediately initiates virtual machine migration after receiving the fault information, and the time of the virtual machine affected by the fault of the server 10 is greatly shortened.
It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or fully authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region, and provide corresponding operation entries for the user to select authorization or rejection.
The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and changes may be made to the present application by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc. which are within the spirit and principles of the present application are intended to be included within the scope of the claims of the present application.

Claims (11)

1. A server failure reporting method, wherein the server is a server with a processor and a computing node separated, and the processor and the computing node each have a set of operating systems, the method comprising:
setting a fault detection point for a fault function in a kernel of the operating system, wherein the fault function is a function called by the operating system when a fault end of the server breaks down, and the fault end is the processor or the computing node with the fault;
registering a callback function at the fault detection point, wherein the callback function is provided with function logic used for collecting fault information of the fault end and sending the fault information to a non-fault end, and the non-fault end is the processor or the computing node which does not generate faults;
executing the callback function when the operating system calls the fault function, so as to send the fault information to the non-fault end by utilizing the callback function;
And reporting the fault information through the non-fault end.
2. The method of claim 1, wherein the callback function is further provided with function logic for sending a message notification interrupt to the non-faulty end after sending the fault information to the non-faulty end, and wherein after sending the fault information to the non-faulty end using the callback function, the method further comprises:
and sending the message notification interrupt to the non-fault end by using the callback function so that the non-fault end receives the fault information in response to the message notification interrupt.
3. The method of claim 1 or 2, wherein before sending the failure information to the non-failed side using the callback function, the method further comprises:
receiving address information of a physical memory sent by the non-fault end, wherein the physical memory is a memory positioned at the non-fault end and used for placing the fault information;
the step of sending the fault information to the non-fault end by the callback function includes:
and the callback function writes the fault information into the physical memory in a direct memory access mode based on the address information.
4. The method of claim 1, wherein the non-faulty end creates a faulty reporting thread, the faulty reporting thread being a thread for reporting the fault information, and the step of reporting the fault information by the non-faulty end includes:
recording the received fault information in a pre-established linked list, and waking up the fault reporting thread;
and encapsulating the fault information in the linked list into a network message through the fault reporting thread to report.
5. The method of claim 2, wherein the computing node has a virtual machine running thereon, and wherein if the failed side is the processor, the method further comprises, when sending the message notification interrupt to the non-failed side using the callback function:
and adding a virtual machine operation stop instruction in the message notification interrupt so that the computing node responds to the virtual machine operation stop instruction to set the running virtual machine to a stop operation state.
6. A server fault reporting device, wherein the device is configured to implement the server fault reporting method according to any one of claims 1 to 5.
7. A server, characterized in that it comprises the server failure reporting device of claim 6.
8. An operation and maintenance system, comprising:
the server of claim 7;
and the operation and maintenance end is used for receiving the fault information reported by the server.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method steps of any of claims 1 to 5 when the computer program is executed.
10. A computer-readable storage medium, characterized in that it stores a computer program which, when executed by a processor, implements the method steps of any of claims 1 to 5.
11. A computer program product, characterized in that it comprises a computer program which, when executed by a processor, implements the method steps of any of claims 1 to 5.
CN202310351628.5A 2023-03-29 2023-03-29 Server fault reporting method and related equipment Pending CN116431373A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310351628.5A CN116431373A (en) 2023-03-29 2023-03-29 Server fault reporting method and related equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310351628.5A CN116431373A (en) 2023-03-29 2023-03-29 Server fault reporting method and related equipment

Publications (1)

Publication Number Publication Date
CN116431373A true CN116431373A (en) 2023-07-14

Family

ID=87080888

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310351628.5A Pending CN116431373A (en) 2023-03-29 2023-03-29 Server fault reporting method and related equipment

Country Status (1)

Country Link
CN (1) CN116431373A (en)

Similar Documents

Publication Publication Date Title
JP6333410B2 (en) Fault processing method, related apparatus, and computer
US8977905B2 (en) Method and system for detecting abnormality of network processor
US11010273B2 (en) Software condition evaluation apparatus and methods
US7802128B2 (en) Method to avoid continuous application failovers in a cluster
US20140019814A1 (en) Error framework for a microprocesor and system
WO2020239060A1 (en) Error recovery method and apparatus
JP2006259869A (en) Multiprocessor system
CN102135925B (en) Method and device for detecting error check and correcting memory
CN104685474A (en) Notification of address range including non-correctable error
WO2021056912A1 (en) Method and device for detecting memory downgrade error
US20240048468A1 (en) Traffic monitoring method and apparatus for open stack tenant network
CN106155826B (en) For the method and system of mistake to be detected and handled in bus structures
CN102521086B (en) Dual-mode redundant system based on lock step synchronization and implement method thereof
US8880956B2 (en) Facilitating processing in a communications environment using stop signaling
US20200111539A1 (en) Information processing apparatus for repair management of storage medium
CN116431373A (en) Server fault reporting method and related equipment
CN116501705A (en) RAS-based memory information collecting and analyzing method, system, equipment and medium
US11704180B2 (en) Method, electronic device, and computer product for storage management
US10846162B2 (en) Secure forking of error telemetry data to independent processing units
WO2021103304A1 (en) Data backhaul method, device, and apparatus, and computer-readable storage medium
CN116560936A (en) Abnormality monitoring method, coprocessor and computing device
CN117472623A (en) Method, device, equipment and storage medium for processing memory fault
CN117762740A (en) Method, system, equipment and medium for data security monitoring
CN117472622A (en) Method, device, equipment and storage medium for isolating fault memory
CN115705261A (en) Memory fault repairing method, CPU, OS, BIOS and server

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination