CN114968626A - Method, device, equipment and storage medium for determining server fault - Google Patents

Method, device, equipment and storage medium for determining server fault Download PDF

Info

Publication number
CN114968626A
CN114968626A CN202110218972.8A CN202110218972A CN114968626A CN 114968626 A CN114968626 A CN 114968626A CN 202110218972 A CN202110218972 A CN 202110218972A CN 114968626 A CN114968626 A CN 114968626A
Authority
CN
China
Prior art keywords
information
standardized
fault
abnormal
server
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110218972.8A
Other languages
Chinese (zh)
Inventor
曾令新
李靖
姜凯
吴晓迪
林哲伟
邱帆
傅欢
严勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202110218972.8A priority Critical patent/CN114968626A/en
Publication of CN114968626A publication Critical patent/CN114968626A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3051Monitoring arrangements for monitoring the configuration of the computing system or of the computing system component, e.g. monitoring the presence of processing resources, peripherals, I/O links, software programs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3055Monitoring arrangements for monitoring the status of the computing system or of the computing system component, e.g. monitoring if the computing system is on, off, available, not available
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/32Monitoring with visual or acoustical indication of the functioning of the machine
    • G06F11/324Display of status information
    • G06F11/327Alarm or error message display

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computer And Data Communications (AREA)

Abstract

The application discloses a server fault determination method, device, equipment and storage medium, and belongs to the technical field of computers. The method comprises the following steps: acquiring running abnormal information of a proxy server corresponding to a management server, wherein the running abnormal information is generated by the proxy server under the condition of meeting the running abnormal condition and is sent to the management server, and the running abnormal information is used for describing the running state of the proxy server; standardizing the abnormal operation information to obtain standardized abnormal information, wherein the standardized abnormal operation information is used for extracting information reflecting the component state of the proxy server from the abnormal operation information; and determining component failure information of the proxy server according to the standardized abnormal information, wherein the component failure information is used for describing component failure of the proxy server. In the process of determining the component fault information of the server, manual analysis is not needed, and the efficiency of determining the fault of the server is improved.

Description

Method, device, equipment and storage medium for determining server fault
Technical Field
The present application relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a storage medium for determining a server failure.
Background
The server, which is an important device in Internet Technology (IT), may malfunction during use, and thus, IT is necessary to determine the cause of the malfunction of the server and perform maintenance.
Currently, a manager of a computer room generally sends a Log obtaining instruction to a server with a problem (e.g., restart and error report) through a device for managing the server based on an Intelligent Platform Management Interface (IPMI) of the server in the computer room, so as to obtain a System Event Log (SEL) of the server. And the system event log is manually analyzed, so that the reason for the server to break down is analyzed. And then the server can be maintained (for example, the fault component is repaired), and the normal operation of the server is recovered.
In the process of determining the server fault in the above manner, the SEL needs to be manually acquired and analyzed, and when the number of servers in the machine room is large, the problem of low efficiency of determining the server fault occurs.
Disclosure of Invention
The application provides a method, a device, equipment and a storage medium for determining server faults, which can improve the efficiency of determining the server faults. The technical scheme is as follows:
according to an aspect of the present application, there is provided a method for determining a server failure, the method including:
acquiring running abnormal information of a proxy server corresponding to the management server, wherein the running abnormal information is generated by the proxy server under the condition of meeting the running abnormal condition and is sent to the management server, and the running abnormal information is used for describing the running state of the proxy server;
standardizing the abnormal operation information to obtain standardized abnormal information, wherein the standardized abnormal operation information is used for extracting information reflecting the component state of the proxy server from the abnormal operation information;
and determining component fault information of the proxy server according to the standardized abnormal information, wherein the component fault information is used for describing component faults of the proxy server.
According to another aspect of the present application, there is provided an apparatus for determining a server failure, the apparatus including:
an obtaining module, configured to obtain abnormal operation information of a proxy server corresponding to the management server, where the abnormal operation information is generated by the proxy server and sent to the management server when an abnormal operation condition is met, and the abnormal operation information is used to describe an operation state of the proxy server;
the processing module is used for carrying out standardization processing on the operation abnormal information to obtain standardized abnormal information, and the standardization processing is used for extracting information reflecting the component state of the proxy server from the operation abnormal information;
and the determining module is used for determining component fault information of the proxy server according to the standardized abnormal information, wherein the component fault information is used for describing component faults of the proxy server.
In an alternative design, the determining module is configured to:
and matching the standardized abnormal information with a fault order establishing strategy to obtain the component fault information, wherein the fault order establishing strategy is determined according to a naming rule of information reflecting component faults in the operation abnormal information.
In an optional design, the standardized abnormal information includes model information, sensor information and operation description information, the fault order-establishing policy includes a plurality of fault order-establishing conditions, and the fault order-establishing conditions include standard model information, fault sensor information and fault operation description information; the determining module is configured to:
and in response to the standardized abnormal information being matched with the ith fault list establishing condition in the fault list establishing strategy, determining the standardized abnormal information as the component fault information.
In an optional design, at least two models correspond to different standardized processing strategies; the processing module is configured to:
analyzing the abnormal operation information to obtain the model of the proxy server;
and standardizing the abnormal operation information through a standardized processing strategy corresponding to the model of the proxy server to obtain the standardized abnormal operation information.
In an alternative design, the standardized processing strategy includes a standardized field and a position of a value of the standardized field in the running exception information; the processing module is configured to:
analyzing the abnormal operation information through a standardized processing strategy corresponding to the type of the proxy server to obtain a value of the standardized field;
determining the normalized field and the value of the normalized field as the normalized anomaly information.
In an alternative design, the apparatus further comprises:
and the establishing module is used for responding to the matching of the standardized abnormal information and the fault order establishing strategy and establishing a fault processing work order, and the fault processing work order is used for informing a server administrator to process the component fault information.
In an optional design, an exception information processing process runs in the management server; the device further comprises:
and the issuing module is used for issuing configuration information to the proxy server through the abnormal information processing process, wherein the configuration information comprises at least one of the abnormal operation condition and the communication address of the management server.
In an alternative design, the apparatus further comprises:
and the storage module is used for responding to the unmatched standardized abnormal information and the unmatched fault billing strategy and storing the unmatched standardized abnormal information.
In an alternative design, the memory module is configured to:
in response to that the standardized abnormal information is not matched with the fault order establishing strategy, combining the same standardized abnormal information in the unmatched standardized abnormal information to obtain combined standardized abnormal information;
and storing the merged standardized exception information.
In an alternative design, the management server and the proxy server are deployed in a distributed fashion.
According to another aspect of the present application, there is provided a computer device comprising a processor and a memory, the memory having stored therein at least one instruction, at least one program, set of codes, or set of instructions, which is loaded and executed by the processor to implement the method of determining a server failure as described above.
According to another aspect of the present application, there is provided a computer readable storage medium having stored therein at least one instruction, at least one program, set of codes, or set of instructions that is loaded and executed by a processor to implement the method of determining a server failure as described above.
According to another aspect of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method for determining server failure provided in the various alternative implementations of the above aspect.
The beneficial effect that technical scheme that this application provided brought includes at least:
the acquired abnormal operation information is standardized, and the component fault information of the proxy server is determined based on the standardized abnormal information. In the process, manual analysis is not needed, and the efficiency of determining the server fault is improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic illustration of a process for determining server failure provided by an exemplary embodiment of the present application;
FIG. 2 is a flowchart illustrating a method for determining server failure according to an exemplary embodiment of the present application;
FIG. 3 is a flowchart illustrating a method for determining server failure according to another exemplary embodiment of the present application;
FIG. 4 is a schematic illustration of a standardized exception information display interface provided by an exemplary embodiment of the present application;
FIG. 5 is a schematic diagram of a fault billing policy configuration interface provided by an exemplary embodiment of the present application;
FIG. 6 is a schematic illustration of a component failure information display interface provided by an exemplary embodiment of the present application;
FIG. 7 is a schematic illustration of a fault handling work order display interface provided by an exemplary embodiment of the present application;
fig. 8 is a schematic diagram of an SNMP trap usage procedure provided by an exemplary embodiment of the present application;
FIG. 9 is a schematic diagram of a server architecture provided by an exemplary embodiment of the present application;
FIG. 10 is an architecture diagram of an out-of-band hypervisor provided in an exemplary embodiment of the present application;
fig. 11 is a schematic structural diagram of a server failure determination apparatus according to an exemplary embodiment of the present application;
fig. 12 is a schematic structural diagram of a server failure determination apparatus according to another exemplary embodiment of the present application;
fig. 13 is a schematic structural diagram of a server failure determination apparatus according to still another exemplary embodiment of the present application;
fig. 14 is a schematic structural diagram of a server failure determination apparatus according to still another exemplary embodiment of the present application;
fig. 15 is a schematic structural diagram of a server according to an exemplary embodiment of the present application.
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.
Detailed Description
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
In the out-of-band (out-of-band) Management of a server, a trap (trap) function can be generally supported for a Simple Network Management Protocol (SNMP) based server. That is, when the operation information of the server satisfies the trap condition set by the server manufacturer, the server can generate an SNMP trap (also referred to as a trap) for recording the operation abnormality information of the server. The method provided by the embodiment of the present Application mainly takes an example of determining a server fault based on an SNMP trap as an example, and the method provided by the embodiment of the present Application can also obtain the operation information of the server based on other operation information of the server capable of supporting analysis of the server fault, for example, a server supporting a red fish (Redfish) standard, through an Application Programming Interface (API) defined by the Redfish. Fig. 1 is a schematic diagram of a process for determining a server failure according to an exemplary embodiment of the present application. As shown in fig. 1, in step S1, the management server obtains the SNMP trap reported by the proxy server. The management server is used for managing the proxy server, and the management server is connected with the proxy server in a wired or wireless mode.
Because the proxy servers may be from different manufacturers, and some manufacturers may develop the SNMP trap secondarily, the formats of the contents of the SNMP trap reported by different proxy servers may be different. In step S2, the management server performs normalization processing on the SNMP trap, thereby extracting information reflecting the component state of the proxy server. Alternatively, for the proxy servers of different manufacturers, the management server has a corresponding standardized processing strategy when performing standardized processing. The standardized processing policy includes a standardized field and a location of a value of the standardized field in the SNMP trap. The management server can extract the value corresponding to the standardized field from the SNMP trap through a standardized processing strategy, so that the standardized processing of the SNMP trap is realized.
In step S3, the management server analyzes the failure of the proxy server based on the SNMP trap after the normalization processing. The management server is matched with the standardized SNMP trap through a fault order establishing strategy, so that the component fault information of the proxy server can be obtained. The fault billing policy is determined according to a naming rule of information reflecting the fault of the server component in the SNMP trap. For example, a standardized SNMP trap includes model information, sensor information, and operation description information. The fault order establishing strategy comprises a plurality of fault order establishing conditions, and the fault order establishing conditions comprise standard model information, fault sensor information and fault operation description information. When the model information of the SNMP trap after the standardization processing is matched with the standard model information of the ith fault order establishing condition in the fault order establishing strategy, the sensor information is matched with the fault sensor information of the ith fault order establishing condition in the fault order establishing strategy, and the operation description information is matched with the fault operation description information of the ith fault order establishing condition in the fault order establishing strategy, the management server determines the SNMP trap after the standardization processing as the component fault information.
In step S4, when the management server determines the component failure information of the proxy server, a failure processing work order for notifying the server administrator of processing the component failure information is created. Therefore, the component failure of the proxy server is determined, and a server administrator is reminded to carry out failure maintenance. Optionally, after the component fault information is determined, the component fault information can also be transmitted to a spare part management program (managing component warehouse-in and warehouse-out) and a fault repair program (dispatching personnel processing faults), so that full-automatic processing of the server faults is realized.
When the fault of the server is determined in the above mode, manual analysis is not needed. The management server is only required to carry out standardization processing on the acquired SNMP trap and determine the fault information of the component based on the fault order establishing strategy, so that the efficiency of determining the fault of the server is improved. Moreover, because the SNMP trap is subjected to standardization processing, different server models of different manufacturers can be supported, and the application range for determining the server fault is widened.
Fig. 2 is a flowchart illustrating a method for determining a server failure according to an exemplary embodiment of the present application. The method may be used for a management server. As shown in fig. 2, the method includes:
step 201: and acquiring the abnormal operation information of the proxy server corresponding to the management server.
The management server is used for managing the proxy server, and the proxy server is any server managed by the management server. The management server is connected with the proxy server in a wired or wireless mode. Optionally, the management server manages the proxy server by means of out-of-band management. The management server is a server, or a server cluster composed of a plurality of servers, or a virtual server in a cloud computing service center.
The abnormal operation information is generated by the proxy server under the condition that the abnormal operation condition is met and is sent to the management server, and the abnormal operation information is used for describing the operation state of the proxy server. Optionally, the management server and the proxy server support SNMP, and the operation exception information refers to SNMP trap. The abnormal operation condition refers to a trap condition that is set by a server manufacturer and can also be set by a server administrator for the server. The trap condition is a preset event in the running process of the server, and can reflect that the server has a performance problem, such as a network interface is down. When the trap condition is triggered by the operation information in the operation process of the proxy server, a Baseboard Management Controller (BMC) of the proxy server generates an SNMP trap and reports the SNMP trap to the Management server.
Step 202: and standardizing the abnormal operation information to obtain standardized abnormal information.
Due to different manufacturers of the proxy servers, the formats of the abnormal operation information of the proxy servers of different models may be different. Before determining the failure of the proxy server, the operation exception information needs to be standardized. The normalization process is used to extract information reflecting the component status of the proxy server from the operation abnormality information. The information reflecting the component status of the proxy server is predefined by the server manufacturer. The normalization process can also extract the generation time of the abnormal operation information, the model of the proxy server, the serial number of the manufacturer, and the like. When the abnormal operation information is SNMP trap, the standardized abnormal operation information includes model information, sensor information, and operation description information. The sensor information is used for describing components in the proxy server, and includes a Central Processing Unit (CPU), a Read-Only Memory (ROM), a Random Access Memory (RAM), a network card, a cooling Fan (Fan), and the like, which form a Component of the proxy server based on a Peripheral Component Interconnect Express (PCIE) standard. The operation description information is used for describing the operation state of the component corresponding to the sensor information.
Alternatively, for the proxy servers of different manufacturers, the management server has a corresponding standardized processing strategy when performing standardized processing. The standardized processing policy includes a standardized field and a location of a value of the standardized field in the SNMP trap. The management server can extract the value corresponding to the standardized field from the SNMP trap through a standardized processing strategy, so that the standardized processing of the SNMP trap is realized.
Step 203: and determining component fault information of the proxy server according to the standardized abnormal information.
The component failure information is used to describe a component failure of the proxy server. Optionally, the management server matches the standardized abnormal information through a fault order creation policy, and when the standardized abnormal information matches the fault order creation policy, the standardized abnormal information is determined as the component fault information. Optionally, the failure billing policy is determined according to a naming rule of information reflecting the failure of the server component in the standardized abnormal information. The naming convention is determined by the server vendor. The fault billing policy includes a fault field and a value for the fault field, which is determined by the server administrator. The value of the failure field is determined with reference to a naming rule of information reflecting the failure of the server component among the operation abnormality information of different manufacturers. When the value of a field in the normalized exception information is the same as the value of the corresponding fault field or the expressed meaning is the same (e.g., fuzzy matching), the management server determines that the normalized exception information matches the fault billing policy.
Illustratively, the running exception information is an SNMP trap. And when the standardized abnormal information is the same as the values of the field standard model, the fault sensor and the fault operation description in the fault billing strategy, the proxy server determines the standardized SNMP trap as the component fault information of the proxy server.
In summary, the method provided in this embodiment normalizes the acquired abnormal operation information, and determines the component failure information of the proxy server based on the normalized abnormal operation information. In the process, manual analysis is not needed, and the efficiency of determining the server fault is improved. Moreover, due to the fact that the operation abnormal information is subjected to standardized processing, different server models of different manufacturers can be supported, and the application range of determining the server fault is widened.
Fig. 3 is a flowchart illustrating a method for determining a server failure according to another exemplary embodiment of the present application. The method may be used for a management server. As shown in fig. 3, the method includes:
step 301: and issuing configuration information to the proxy server through the abnormal information processing process.
An abnormal information processing process runs in the management server. The management server is used for managing the proxy server. The configuration information includes at least one of an abnormal operation condition and a communication address of the management server. When the operation information of the proxy server in the operation process meets the operation abnormal condition, the operation abnormal information is generated and reported to the management server (the communication address of the management server). When the running exception information is an SNMP trap, the running exception condition refers to a trap condition.
Step 302: and acquiring the abnormal operation information of the proxy server corresponding to the management server.
The abnormal operation information is generated by the proxy server under the condition of meeting the abnormal operation condition and is sent to the management server, and the abnormal operation information is used for describing the operation state of the proxy server and specifically comprises the operation information of each component of the proxy server. Optionally, the management server and the proxy server support SNMP, and the operation exception information refers to SNMP trap.
Step 303: and standardizing the abnormal operation information to obtain standardized abnormal information.
The rules for generating the abnormal operation information are different for the servers of different models of different manufacturers. The normalization process is to extract information reflecting the component state of the proxy server from the abnormal operation information and normalize the abnormal operation information. Alternatively, there are at least two models corresponding to different standardized processing policies, i.e., for each proxy server managed by the management server, the management server has a standardized processing policy corresponding to the model of the proxy server. The standardized processing strategy is determined by a server administrator according to a rule of generating abnormal operation information of the server, wherein the rule is set by a server manufacturer. The management server can analyze the abnormal operation information, so that the model of the proxy server is obtained. And standardizing the abnormal operation information by a standardized processing strategy corresponding to the model of the proxy server, thereby obtaining standardized abnormal information.
Moreover, when the abnormal operation information is the SNMP trap, the BMC model of the server may be the same for the servers of different models of the same manufacturer, and thus the rule for generating the SNMP trap may be the same. The management server can perform 'convergence' (the same as the standardized processing strategies corresponding to the manufacturer and the BMC) on the SNMP trap reported by the proxy server according to the manufacturer and the BMC model of the proxy server, so that the repeated standardized processing strategies are avoided. For example, the BMCs for model 1 and model 2 produced by the server manufacturer are the same, and the BMC for model 3 is different from those for model 1 and model 2. The management server determines that the standardized processing policies corresponding to model 1 and model 2 are the same, and the standardized processing policy corresponding to model 3 is different from the standardized processing policies corresponding to model 1 and model 2.
Optionally, the standardized processing policy includes a standardized field and a location of a value of the standardized field in the running exception information. And the management server analyzes the abnormal operation information through a standardized processing strategy corresponding to the model of the proxy server to obtain the value of the standardized field. And determines the normalized field and the value of the normalized field as normalized anomaly information.
Illustratively, the standardized fields included in the standardized processing policy include "Trapname", "Description", "Sensor", "HP _ Status", "HP _ FailureType", and "HP _ ErrCode". The trap name is a name of a trap condition for triggering generation of the SNMP trap, and is acquired from a Management Information Base (MIB) storing the trap condition based on the SNMP trap. Description is a Description resulting in the generation of an SNMP trap, and Sensor is a component resulting in the generation of an SNMP trap. The HP _ Status, HP _ FailureType and HP _ ErrCode are information that can reflect the presence of a failure of a component of equipment in the SNMP trap established by the server vendor.
The management server can also provide an interface for displaying the standardized abnormal information after determining the standardized abnormal information. FIG. 4 is a schematic diagram of a standardized exception information display interface provided in an exemplary embodiment of the present application. As shown in fig. 4, a screening control 402 is displayed in the standardized abnormality information display interface 401, and standardized abnormality information 403 determined by the screening management server can be realized, and the displayed standardized abnormality information 403 includes the time for determining the standardized abnormality information 403, and the server serial number, the device model (model), the sensor information (abnormal component) of the screened proxy server, and the description of the sensor abnormality. And for the management server, more information can be screened out from the abnormal operation information through a standardized processing strategy. It should be noted that the above is only for example and is not meant to be a limitation on the processing result of the operation exception information.
Step 304: and determining component fault information of the proxy server according to the standardized abnormal information.
The component failure information is used to describe a component failure of the proxy server. When the management server analyzes that the component of the proxy server has the fault from the labeled abnormal information, the management server determines the component fault information of the proxy server based on the standardized abnormal information.
Optionally, the management server matches the standardized abnormal information with a fault billing policy to obtain component fault information. The fault order establishing strategy is determined according to the naming rule of the information reflecting the component fault in the operation abnormal information. For example, for servers of different manufacturers and different models, in the rules for generating the SNMP trap established by the manufacturers, there is a difference in the information generation rule reflecting the occurrence of a failure in the server. The fault order establishing strategy is determined by a server administrator according to naming rules of information which reflects component faults and is generated in SNMP trap by servers of different manufacturers and different models.
Optionally, the standardized anomaly information includes model information, sensor information, and operation description information. The fault order establishing strategy comprises a plurality of fault order establishing conditions, and the fault order establishing conditions comprise standard model information, fault sensor information and fault operation description information. And in response to the standardized abnormal information being matched with the ith fault list establishing condition in the fault list establishing strategy, determining the standardized abnormal information as the component fault information. The standardized abnormal information is matched with the ith fault order establishing condition in the fault order establishing strategy, the model information is matched with the standard model information of the ith fault order establishing condition in the fault order establishing strategy, the sensor information is matched with the fault sensor information of the ith fault order establishing condition in the fault order establishing strategy, the operation description information is matched with the fault operation description information of the ith fault order establishing condition in the fault order establishing strategy, and the standardized abnormal information is determined as the component fault information. The model information is matched with the standard model information, and means that the model information is the same as the standard model information or the model information is the same as the BMC of the server corresponding to the standard model information. The matching of the sensor information with the faulty sensor information means that the sensor information is the same as the faulty sensor information or the components described are the same (fuzzy matching). For example, the sensor information "FAN 4" (the serial number indicates a specific FAN) is matched with the failure sensor information "FAN". The operation description information is matched with the fault operation description information, and the operation description information is the same as the fault operation description information.
In addition, the management server can also provide a fault order-establishing strategy configuration interface, so that the fault order-establishing strategy can be configured in a user-defined mode, and further, the component faults which can be identified by the management server can be expanded. Fig. 5 is a schematic diagram of a fault billing policy configuration interface provided in an exemplary embodiment of the present application. As shown in fig. 5, a modification control 502 is displayed in the fault billing policy configuration interface 501, and can trigger addition, deletion, and modification of the fault billing policy 503. The displayed fault billing policy 503 includes the device type (determined according to the manufacturer and model of the server), the sensor information, the description of the sensor abnormality, and whether to report the component fault information after the fault billing policy is successfully matched. For the management server, the standardized exception information can be matched through a fault billing strategy comprising more fields. For example, for servers of different manufacturers, the SNMP trap has specific information to indicate component failure, and the fault billing policy also includes the specific information. It should be noted that the above contents are only used as examples, and are not used as a limitation to the contents of the fault billing policy.
The management server can also provide an interface for displaying the fault information of the component after determining the standardized abnormal information. FIG. 6 is a schematic diagram of a component failure information display interface provided by an exemplary embodiment of the present application. As shown in fig. 6, a screening control 602 is displayed on the component failure information display interface 601, and component failure information 603 determined by the screening management server can be implemented. The displayed component failure information 603 includes the time for determining the component failure information 603, and the server serial number, the device model (machine type), the device type, the type of the failed component, and the alarm type (content for determining when the alarm is performed on the component failure information) of the proxy server corresponding to the component failure information. The management server can also determine a processing mode and the like of the component fault information through the fault order establishing strategy and the standardized abnormal information. It should be noted that the above is only for example and not as a limitation to the result of determining the component failure information.
Step 305: and responding to the matching of the standardized abnormal information and the fault order establishing strategy, and establishing a fault processing work order.
The fault handling order is used to notify a server administrator of the handling component fault information. The fault handling work order includes a maintenance task description, a setup time, a fault handling completion time, a total elapsed time (determined from the setup time and the fault handling completion time), a handler, a current handling status (whether completed), a handling specification (filled by the handler), and the like. Optionally, after the server establishes the fault processing work order, the component fault information can be transmitted to a spare part management program (for managing component warehouse-in and warehouse-out) and a fault repair program (for scheduling personnel to process faults), so that the full-automatic processing of the server faults is realized. The process under the actual scene comprises the following steps: and informing a site server administrator according to the fault processing work order, matching spare parts according to the fault processing work order, replacing fault parts on site, warehousing the fault parts, and checking and accepting the processed proxy server.
After the fault processing work order is established, the management server can also provide an interface for displaying the fault processing work order. FIG. 7 is a schematic diagram of a fault handling work order display interface provided by an exemplary embodiment of the present application. As shown in fig. 7, a fault handling work order 702 is displayed in the fault handling work order display interface 701, and includes a work order number, a server serial number, a fault type, a machine room in which a fault server is located, an equipment model, a fault description, and a time for establishing a fault handling work order. The management server can also generate the fault processing order based on the component fault information by different rules. The rule is determined by the management server. The above description is only for example, and is not intended to limit the generated trouble shooting order.
Step 306: and in response to the normalized exception information not matching the fault billing strategy, storing the unmatched normalized exception information.
After the proxy server generates and reports the running exception information (such as an SNMP trap), the running exception information is not stored. In order to further analyze the operation abnormal information and avoid data loss, the management server can also store the standardized abnormal information which is not matched with the fault billing strategy. And then analyzing the stored standardized abnormal information by a server manager, and when the standardized abnormal information can reflect that the server has a component fault, modifying a fault billing strategy by the server manager based on the standardized abnormal information.
Optionally, in response to that the normalized abnormal information is not matched with the fault billing policy, merging the same normalized abnormal information in the unmatched normalized abnormal information to obtain merged normalized abnormal information, and storing the merged normalized abnormal information. Because SNMP traps generated by BMC of different machine type servers of the same manufacturer are possibly the same, the same standardized abnormal information in the standardized abnormal information is merged, so that the standardized abnormal information which is not matched with the fault order-establishing strategy can be converged, and the analysis by a server administrator is facilitated.
Fig. 8 is a schematic diagram of an SNMP trap usage procedure provided in an exemplary embodiment of the present application. As shown in fig. 8, in step a1, the management server acquires a standardized SNMP trap. In step a2, the management server matches the standardized SNMP trap to a fault billing policy. In step a21, the management server successfully matches the standardized SNMP trap with the fault billing policy to obtain component fault information. In step a22, the management server stores the standardized SNMP trap that failed to match the failure billing policy. The stored SNMP trap may then be analyzed by a server administrator and the fault billing policy modified based on the analysis results.
Optionally, the management server and the proxy server are deployed in a distributed mode. Fig. 9 is a schematic diagram of a server architecture provided by an exemplary embodiment of the present application. As shown in fig. 9, the management server is composed of a first management server 901, a second management server 902, a third management server 905, and a fourth management server 906. The first management server 901 is configured to obtain an SNMP trap of the first proxy server 903 in the machine room where the first management server is located and report the SNMP trap to the third management server 905, and the second management server 902 is configured to obtain an SNMP trap of the second proxy server 904 in the machine room where the second management server is located and report the SNMP trap to the third management server 905. The third management server 905 is configured to perform normalization processing on the obtained SNMP trap, and transmit the processed SNMP trap to the fourth management server 906. The fourth management server 906 is used to perform a fault simple policy matching on the standardized SNMP trap to determine the component failure of the first proxy server 903 and the second proxy server 904. The first management server 901 and the second management server 902 run an abnormal information processing process, and are configured to issue configuration information to the first proxy server 903 and the second proxy server 904, and implement sending an SNMP trap to the third management server 905. The third management server 905 can also monitor whether the abnormal information processing process executed by the first management server 901 and the second management server 902 is alive through the monitoring process. The third management server 905 also stores an MIB library and a standardized processing policy. By adopting the deployment mode, the SNMP trap reported by the proxy server of each machine room can be successfully acquired, and the standardized processing and the determining component failure are separately executed, so that the reduction of the calculation efficiency caused by large calculation amount is avoided. It should be noted that the deployment patterns of the management servers and the proxy servers and the number of the management servers and the proxy servers are only used as examples, and are not limited to the deployment patterns of the management servers and the proxy servers to which the method provided in the embodiment of the present application is applied.
In an actual application scenario, the method provided by the embodiment of the application can be used based on an out-of-band management program. That is, the management server implements the above method by installing the out-of-band hypervisor. FIG. 10 is an architecture diagram of an out-of-band hypervisor provided in an exemplary embodiment of the present application. As shown in fig. 10, in the SNMP trap configuration and receiving function 1001 provided by the program, it is possible to collect SNMP traps of all proxy servers and configure and issue the proxy servers. In the SNMP trap management function 1002 provided by the program, the program summarizes the received SNMP trap and analyzes it, thereby realizing the SNMP trap standardization. In the SNMP trap utilizing function 1003 provided by the program, the program matches the standardized SNMP trap with the failure order policy in the failure order policy library, thereby obtaining an available SNMP trap (component failure information). The component failure information can be called by other process tools (spare part management program, failure repair program), thereby realizing a closed loop from discovery to processing of the server failure.
In summary, the method provided in this embodiment normalizes the acquired abnormal operation information, and determines the component failure information of the proxy server based on the normalized abnormal operation information. In the process, manual analysis is not needed, and the efficiency of determining the server fault is improved.
The method provided by the embodiment further matches the standardized abnormal information with the fault billing strategy, so as to obtain the component fault information. The method for simply and conveniently determining the fault information of the components based on the matching mechanism is provided.
The method provided by this embodiment further performs one-to-one matching between the standardized abnormal information and the fault billing condition in the fault billing strategy, so as to determine the component fault information. The accuracy of determining the component failure information is improved.
According to the method provided by the embodiment, the abnormal operation information is processed by adopting the corresponding standardized processing strategy for different machine types, so that the abnormal operation information of different machine types can be standardized, and the application range is widened.
The method provided by this embodiment further normalizes the operation exception information by normalizing the field and the position of the value of the normalized field in the operation exception information. And standard standardized abnormal information can be processed, and subsequent management is facilitated.
According to the method provided by the embodiment, the fault processing work order is established, so that the component fault information can be transmitted, a server administrator can quickly know the component fault information, and the efficiency of processing the server fault is improved.
The method provided by the embodiment also enables the proxy server to quickly access the management of the management server when the proxy server is moved and newly enters the field by issuing the configuration information to the proxy server.
According to the method provided by the embodiment, the unmatched standardized abnormal information is also stored, so that a server administrator can manually analyze the unmatched standardized abnormal information, the missing of the server fault is avoided, and a fault order establishing strategy can be perfected based on the analysis result.
The method provided by the embodiment also combines the same standardized abnormal information in the unmatched standardized abnormal information, reduces repeated information, and improves the analysis efficiency of the stored standardized abnormal information.
The method provided by the embodiment further adopts a distributed mode for deployment through the management server and the proxy server, so that the abnormal operation information reported by the proxy server can be successfully acquired, and the efficiency of standardized processing and fault list establishing strategy matching is improved.
It should be noted that, the order of the steps of the method provided in the embodiments of the present application may be appropriately adjusted, and the steps may also be increased or decreased according to the circumstances.
Fig. 11 is a schematic structural diagram of a server failure determination apparatus according to an exemplary embodiment of the present application. The apparatus may be for a management server. As shown in fig. 11, the apparatus 110 includes:
the obtaining module 1101 is configured to obtain operation exception information of a proxy server corresponding to a management server, where the operation exception information is generated by the proxy server and sent to the management server when an operation exception condition is met, and the operation exception information is used to describe an operation state of the proxy server.
The processing module 1102 is configured to perform normalization processing on the abnormal operation information to obtain normalized abnormal information, where the normalization processing is configured to extract information reflecting a component state of the proxy server from the abnormal operation information.
A determining module 1103, configured to determine, according to the standardized abnormal information, component failure information of the proxy server, where the component failure information is used to describe a component failure of the proxy server.
In an alternative design, the determining module 1103 is configured to:
and matching the standardized abnormal information with a fault order establishing strategy to obtain component fault information, wherein the fault order establishing strategy is determined according to a naming rule of information reflecting component faults in the operation abnormal information.
In an optional design, the standardized abnormal information includes model information, sensor information and operation description information, the fault billing strategy includes multiple fault billing conditions, and the fault billing conditions include standard model information, fault sensor information and fault operation description information. A determining module 1103 configured to:
and in response to the standardized abnormal information being matched with the ith fault list establishing condition in the fault list establishing strategy, determining the standardized abnormal information as the component fault information.
In an alternative design, there are at least two models corresponding to different standardized processing strategies. A processing module 1102 configured to:
and analyzing the abnormal operation information to obtain the model of the proxy server. And standardizing the abnormal operation information through a standardized processing strategy corresponding to the model of the proxy server to obtain standardized abnormal information.
In an alternative design, the standardized processing policy includes a standardized field and a location of a value of the standardized field in the running exception information. A processing module 1102 configured to:
and analyzing the abnormal operation information through a standardized processing strategy corresponding to the model of the proxy server to obtain the value of the standardized field. The normalized field and the value of the normalized field are determined as normalized anomaly information.
In an alternative design, as shown in fig. 12, the apparatus 110 further comprises:
and the establishing module 1104 is used for responding to the matching of the standardized abnormal information and the fault order establishing strategy, and establishing a fault processing work order, wherein the fault processing work order is used for informing a server administrator of processing the fault information of the component.
In an alternative design, an exception handling process runs in the management server. As shown in fig. 13, the apparatus 110 further includes:
the issuing module 1105 is configured to issue configuration information to the proxy server through the exception information processing process, where the configuration information includes at least one of an abnormal operation condition and a communication address of the management server.
In an alternative design, as shown in fig. 14, the apparatus 110 further comprises:
a storage module 1106, configured to store unmatched normalized exception information in response to the normalized exception information not matching the fault billing policy.
In an alternative design, storage module 1106 is configured to:
and in response to that the standardized abnormal information is not matched with the fault list establishing strategy, combining the same standardized abnormal information in the unmatched standardized abnormal information to obtain combined standardized abnormal information. The merged normalized exception information is stored.
In an alternative design, the management server and the proxy server are deployed in a distributed mode.
It should be noted that: the server failure determination apparatus provided in the foregoing embodiment is only illustrated by dividing the functional modules, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the server failure determination apparatus and the server failure determination method provided in the foregoing embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments and are not described herein again.
Embodiments of the present application further provide a computer device, including: the server failure determination method comprises a processor and a memory, wherein at least one instruction, at least one program, a code set or an instruction set is stored in the memory, and the at least one instruction, the at least one program, the code set or the instruction set is loaded and executed by the processor to realize the server failure determination method provided by the method embodiments.
Optionally, the computer device is a server. Illustratively, fig. 15 is a schematic structural diagram of a server according to an exemplary embodiment of the present application.
The server 1500 includes a Central Processing Unit (CPU) 1501, a system Memory 1504 including a Random Access Memory (RAM) 1502 and a Read-Only Memory (ROM) 1503, and a system bus 1505 connecting the system Memory 1504 and the CPU 1501. The computer device 1500 also includes a basic Input/Output system (I/O system) 1506 for facilitating information transfer between various elements within the computer device, and a mass storage device 1507 for storing an operating system 1513, application programs 1514 and other program modules 1515.
The basic input/output system 1506 includes a display 1508 for displaying information and an input device 1509 such as a mouse, keyboard, etc. for a user to input information. Wherein the display 1508 and the input device 1509 are connected to the central processing unit 1501 via an input output controller 1510 connected to the system bus 1505. The basic input/output system 1506 may also include an input/output controller 1510 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, the input-output controller 1510 also provides output to a display screen, a printer, or other type of output device.
The mass storage device 1507 is connected to the central processing unit 1501 through a mass storage controller (not shown) connected to the system bus 1505. The mass storage device 1507 and its associated computer-readable storage media provide non-volatile storage for the server 1500. That is, the mass storage device 1507 may include a computer-readable storage medium (not shown) such as a hard disk or a Compact Disc-Only Memory (CD-ROM) drive.
Without loss of generality, the computer-readable storage media may include computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable storage instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, Erasable Programmable Read-Only Memory (EPROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), flash Memory or other solid state Memory devices, CD-ROM, Digital Versatile Disks (DVD), or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will appreciate that the computer storage media is not limited to the foregoing. The system memory 1504 and mass storage device 1507 described above may be collectively referred to as memory.
The memory stores one or more programs configured to be executed by the one or more central processing units 1501, the one or more programs containing instructions for implementing the method embodiments described above, and the central processing unit 1501 executes the one or more programs to implement the methods provided by the respective method embodiments described above.
The server 1500 may also operate as a remote server connected to a network via a network, such as the internet, according to various embodiments of the present application. That is, the server 1500 may be connected to the network 1512 through a network interface unit 1511 coupled to the system bus 1505 or the network interface unit 1511 may be used to connect to other types of networks or remote server systems (not shown).
The memory also includes one or more programs, which are stored in the memory, and the one or more programs include instructions for performing the steps performed by the server in the methods provided by the embodiments of the present application.
The embodiment of the present application further provides a computer-readable storage medium, where at least one instruction, at least one program, a code set, or a set of instructions is stored in the computer-readable storage medium, and when the at least one instruction, the at least one program, the code set, or the set of instructions is loaded and executed by a processor of a computer device, the method for determining a server failure provided in the foregoing method embodiments is implemented.
The present application also provides a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions, so that the computer device executes the method for determining the server failure provided by the method embodiments.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, and the program may be stored in a computer readable storage medium, and the above readable storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The above description is only an example of the present application and should not be taken as limiting, and any modifications, equivalent switches, improvements, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims (13)

1. A server failure determination method is applied to a management server, and comprises the following steps:
acquiring running abnormal information of a proxy server corresponding to the management server, wherein the running abnormal information is generated by the proxy server under the condition of meeting the running abnormal condition and is sent to the management server, and the running abnormal information is used for describing the running state of the proxy server;
standardizing the abnormal operation information to obtain standardized abnormal information, wherein the standardized abnormal operation information is used for extracting information reflecting the component state of the proxy server from the abnormal operation information;
and determining component fault information of the proxy server according to the standardized abnormal information, wherein the component fault information is used for describing component faults of the proxy server.
2. The method of claim 1, wherein determining component failure information for the proxy server based on the normalized anomaly information comprises:
and matching the standardized abnormal information with a fault order establishing strategy to obtain the component fault information, wherein the fault order establishing strategy is determined according to a naming rule of information reflecting component faults in the operation abnormal information.
3. The method according to claim 2, wherein the standardized abnormal information comprises model information, sensor information and operation description information, the fault billing strategy comprises a plurality of fault billing conditions, and the fault billing conditions comprise standardized model information, fault sensor information and fault operation description information;
the step of matching the standardized abnormal information with a fault order establishing strategy to obtain the component fault information comprises the following steps:
and in response to the standardized abnormal information being matched with the ith fault list establishing condition in the fault list establishing strategy, determining the standardized abnormal information as the component fault information.
4. The method according to any one of claims 1 to 3, wherein at least two models are corresponding to different standardized processing strategies;
the step of standardizing the abnormal operation information to obtain standardized abnormal information includes:
analyzing the abnormal operation information to obtain the model of the proxy server;
and standardizing the abnormal operation information through a standardized processing strategy corresponding to the model of the proxy server to obtain the standardized abnormal operation information.
5. The method of claim 4, wherein the standardized processing policy includes a standardized field and a location of a value of the standardized field in the running exception information;
the standardized processing of the abnormal operation information through a standardized processing strategy corresponding to the model of the proxy server to obtain the standardized abnormal operation information includes:
analyzing the abnormal operation information through a standardized processing strategy corresponding to the type of the proxy server to obtain a value of the standardized field;
determining the normalized field and the value of the normalized field as the normalized anomaly information.
6. A method according to claim 2 or 3, characterized in that the method further comprises:
and establishing a fault processing work order in response to the standardized abnormal information being matched with the fault order establishing strategy, wherein the fault processing work order is used for informing a server administrator to process the component fault information.
7. The method according to any one of claims 1 to 3, wherein an exception information processing process is run in the management server; the method further comprises the following steps:
and issuing configuration information to the proxy server through the abnormal information processing process, wherein the configuration information comprises at least one of the abnormal operation condition and the communication address of the management server.
8. The method of any of claims 1 to 3, further comprising:
and in response to the normalized exception information not matching the fault billing strategy, storing unmatched normalized exception information.
9. The method of claim 8, wherein in response to the normalized exception information not matching the fault billing policy, storing the unmatched normalized exception information comprises:
in response to that the standardized abnormal information is not matched with the fault order establishing strategy, combining the same standardized abnormal information in the unmatched standardized abnormal information to obtain combined standardized abnormal information;
and storing the merged standardized exception information.
10. The method of any of claims 1 to 3, wherein the management server and the proxy server are deployed in a distributed manner.
11. An apparatus for determining server failure, the apparatus comprising:
an obtaining module, configured to obtain abnormal operation information of a proxy server corresponding to the management server, where the abnormal operation information is generated by the proxy server and sent to the management server when an abnormal operation condition is met, and the abnormal operation information is used to describe an operation state of the proxy server;
the processing module is used for carrying out standardization processing on the operation abnormal information to obtain standardized abnormal information, and the standardization processing is used for extracting information reflecting the component state of the proxy server from the operation abnormal information;
and the determining module is used for determining component fault information of the proxy server according to the standardized abnormal information, wherein the component fault information is used for describing component faults of the proxy server.
12. A computer device comprising a processor and a memory, the memory having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, the at least one instruction, the at least one program, the set of codes, or the set of instructions being loaded and executed by the processor to implement a method of determining a server failure as claimed in any one of claims 1 to 10.
13. A computer readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, which is loaded and executed by a processor to implement the method of determining server failure according to any one of claims 1 to 10.
CN202110218972.8A 2021-02-26 2021-02-26 Method, device, equipment and storage medium for determining server fault Pending CN114968626A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110218972.8A CN114968626A (en) 2021-02-26 2021-02-26 Method, device, equipment and storage medium for determining server fault

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110218972.8A CN114968626A (en) 2021-02-26 2021-02-26 Method, device, equipment and storage medium for determining server fault

Publications (1)

Publication Number Publication Date
CN114968626A true CN114968626A (en) 2022-08-30

Family

ID=82973976

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110218972.8A Pending CN114968626A (en) 2021-02-26 2021-02-26 Method, device, equipment and storage medium for determining server fault

Country Status (1)

Country Link
CN (1) CN114968626A (en)

Similar Documents

Publication Publication Date Title
CN108600029B (en) Configuration file updating method and device, terminal equipment and storage medium
US9658914B2 (en) Troubleshooting system using device snapshots
CN107632918B (en) Monitoring system and method for computing storage equipment
US8041996B2 (en) Method and apparatus for time-based event correlation
JP5736881B2 (en) Log collection system, apparatus, method and program
US10489232B1 (en) Data center diagnostic information
CN106789306B (en) Method and system for detecting, collecting and recovering software fault of communication equipment
US20140122931A1 (en) Performing diagnostic tests in a data center
CN111897671A (en) Failure recovery method, computer device, and storage medium
US11706080B2 (en) Providing dynamic serviceability for software-defined data centers
CN110851320A (en) Server downtime supervision method, system, terminal and storage medium
CN107800783B (en) Method and device for remotely monitoring server
CN112529223A (en) Equipment fault repair method and device, server and storage medium
CN115658420A (en) Database monitoring method and system
CN116016123A (en) Fault processing method, device, equipment and medium
CN111625386A (en) Monitoring method and device for power-on overtime of system equipment
US20210373953A1 (en) System and method for an action contextual grouping of servers
US20200127882A1 (en) Identification of cause of failure of computing elements in a computing environment
CN114510381A (en) Fault injection method, device, equipment and storage medium
CN117453036A (en) Method, system and device for adjusting power consumption of equipment in server
US20080216057A1 (en) Recording medium storing monitoring program, monitoring method, and monitoring system
CN114968626A (en) Method, device, equipment and storage medium for determining server fault
US9354962B1 (en) Memory dump file collection and analysis using analysis server and cloud knowledge base
CN107046479B (en) Method and device for verifying state of network equipment
US11237892B1 (en) Obtaining data for fault identification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination