CN117539729A - Server fault early warning method and computing device - Google Patents

Server fault early warning method and computing device Download PDF

Info

Publication number
CN117539729A
CN117539729A CN202311353442.XA CN202311353442A CN117539729A CN 117539729 A CN117539729 A CN 117539729A CN 202311353442 A CN202311353442 A CN 202311353442A CN 117539729 A CN117539729 A CN 117539729A
Authority
CN
China
Prior art keywords
server
rectification
rectified
target
risk
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311353442.XA
Other languages
Chinese (zh)
Inventor
刘法龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
XFusion Digital Technologies Co Ltd
Original Assignee
XFusion Digital Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by XFusion Digital Technologies Co Ltd filed Critical XFusion Digital Technologies Co Ltd
Priority to CN202311353442.XA priority Critical patent/CN117539729A/en
Publication of CN117539729A publication Critical patent/CN117539729A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/32Monitoring with visual or acoustical indication of the functioning of the machine
    • G06F11/324Display of status information
    • G06F11/327Alarm or error message display
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0766Error or fault reporting or storing
    • G06F11/0787Storage of error reports, e.g. persistent data storage, storage using memory protection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3006Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3055Monitoring arrangements for monitoring the status of the computing system or of the computing system component, e.g. monitoring if the computing system is on, off, available, not available
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/10Active monitoring, e.g. heartbeat, ping or trace-route

Abstract

The application provides a server fault early warning method and computing equipment, and relates to the technical field of computers. The method comprises the following steps: determining a target server corresponding to the first server based on a preset rule; acquiring configuration information of a target server; determining whether a server to be rectified exists in the target server or not by utilizing the rectification early warning database based on configuration information of the target server; when the fact that the server to be rectified exists in the target server is determined, a risk alarm is generated; and sending the risk alarm to the target server to prompt the user that the potential risk exists in the server to be modified. According to the method, the potential risk of the server to be rectified is actively prompted to the user in a risk warning mode, passive response is not needed after the problem of the target server is solved, hysteresis caused by the passive response fault is solved, and early warning is timely carried out on the potential risk of the problem of the server, so that the potential risk of the problem of the server is prevented in advance.

Description

Server fault early warning method and computing device
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a server fault early warning method and a computing device.
Background
As the traffic demands on a server cluster increase, the amount of hardware and software on the servers continues to increase to meet the traffic demands. In many servers, once the server hardware fails, the overall performance of the server is reduced, error information is sent, and the server is down again, so that the operation and usability of the service are seriously affected. Maintenance of server hardware failures is an indispensable important measure.
In the prior art, when a server fails in a use or production stage, the server can be modified according to the failure, for example: firmware version upgrades, firmware configuration modifications, etc., to address server failures during production or use. In addition, the rectification announcement is also issued according to the rectification of the server. There may be a potential risk of the failure for other servers, but other servers generally only wait for the failure to occur and then determine the corresponding previously published rectification announcement based on the occurred failure. Based on this, the buyers of the other servers that have failed contact the maintenance personnel to handle the failure based on the rectification announcement.
Therefore, in the prior art, after the fault of a certain server is rectified and the rectification notice is issued, if other servers with potential risks of the fault are rectified, the server can only wait for passive response after the fault, and then the client contacts maintenance personnel to troubleshoot and rectify the fault of the server. This way of passively responding to a fault is subject to some hysteresis and wastes a lot of human resources in the troubleshooting and overall process.
Disclosure of Invention
The application provides a server fault early warning method and computing equipment, which can solve the hysteresis of passive response faults, prevent potential risks of problems of a server in advance and early warn possible problems of the server in time.
In order to achieve the above purpose, the present application adopts the following technical scheme:
in a first aspect, the present application provides a server fault early warning method, applied to a first server, where the method includes: determining a target server corresponding to the first server based on a preset rule; acquiring configuration information of a target server; determining whether a server to be rectified exists in the target server or not by utilizing the rectification early warning database based on configuration information of the target server; the rectification early warning database comprises rectification problems and server configuration information corresponding to the rectification problems; when the fact that the server to be rectified exists in the target server is determined, a risk alarm is generated; and sending the risk alarm to the target server to prompt the user that the potential risk exists in the server to be modified. According to the rectification early warning database, the server to be rectified is determined, and concurrent air risk warning is generated to prompt a user that the server to be rectified has potential risks, passive response is not needed after the server fails, hysteresis of fault rectification in the prior art is solved, and problems possibly occurring in the server are timely early warned in time so as to prevent the potential risks of the problems occurring in the server in advance.
In one possible implementation, based on the rectification early-warning database, judging whether the configuration information of the target server exists in the rectification early-warning database; when the configuration information of the target server exists in the rectifying and early warning database, determining that the server to be rectified exists in the target server; and when the configuration information of the target server does not exist in the rectification early warning database, determining that the server to be rectified does not exist in the target server. The server to be rectified can be determined through the rectification problem stored in the rectification early warning database and the corresponding configuration information, the process of determining the server to be rectified is simplified, the time consumed for determining the server to be rectified is shortened, and therefore timeliness of early warning of potential risks of the server, in which the rectification problem possibly occurs, is improved.
In one possible implementation, a risk alert is generated based on the device identification of the server to be rectified; the risk alert carries the device identification of the server to be modified. The risk alert explicitly carries the device identification of the server to be modified so that the subsequent user can explicitly have information of the server with the potential risk of having the modification problem.
In one possible implementation, the risk alert is sent to all target servers to prompt the user that the server to be modified has a potential risk based on the device identifier of the server to be modified carried by the risk alert. And sending the risk alarm to all target servers, wherein the user can definitely have a server with potential risk of the rectifying problem according to the equipment identification, and the risk alarm has a prompting function on other servers, namely, the user can actively carry out secondary screening of fault risks according to the equipment identification of the server to be rectified.
In one possible implementation manner, based on the device identifier of the server to be rectified carried by the risk alarm, the risk alarm is sent to the server to be rectified, so as to prompt the user that the server to be rectified has potential risks. The method and the device realize accurate prompt that the server to be modified has potential risks of modification problems, and do not need to be determined again by a user.
In one possible implementation, determining a rectification problem corresponding to the server to be rectified based on the rectification early warning database; generating a risk alarm based on the equipment identification of the server to be rectified and the rectification problem related to the server to be rectified; the risk alert carries the device identification of the server to be modified and the modification problem involved. The server to be modified is prompted to have potential risks of the modification problem, and meanwhile, the user is explicitly prompted to explicitly state the detailed information of the modification problem which possibly occurs in the server to be modified.
In one possible implementation, the rectification early warning database further includes: the rectification notice information of the rectification problem; determining a rectification problem corresponding to the server to be rectified based on the rectification early warning database; based on the rectification early warning database, rectification notice information of rectification questions corresponding to the servers to be rectified is determined; generating a risk alarm based on the equipment identification of the server to be rectified, the corresponding rectification problem and rectification notice information; the risk alarm carries the equipment identification of the server to be rectified, the corresponding rectification problem and rectification notice information. When prompting that the server to be modified has the potential risk of the modification problem and the detailed information of the modification problem possibly occurring, the user can actively respond to the modification notice information to acquire the modification notice issued by the modification information, and the modification of the target server is performed in advance according to the modification notice issued by the modification information, so that the potential risk of the modification problem is avoided. And the problem description and the problem positioning are completed when the correction notice is generated, so that repeated waste of human resources is reduced, the correction period is shortened, and the operation and the usability of the service are prevented from being influenced for a long time.
In one possible implementation, the rectification early warning database further includes: server rectification measures corresponding to rectification problems; determining a rectification problem corresponding to the server to be rectified based on the rectification early warning database; based on the rectification early warning database, obtaining a server rectification measure corresponding to the rectification problem; and utilizing corresponding server rectifying measures to rectify and modify the server to be rectified and modified. The first server is used for rectifying the server to be rectified, so that the time that the server to be rectified possibly has rectifying problems can be effectively prolonged, and sufficient time for upgrading/updating the server is provided for a user.
In a possible implementation manner, when the target server and the first server are in the same network segment, and the target server and the first server belong to the same manufacturer, determining one or more second servers in the same network segment as the first server based on a ping program; obtaining manufacturer identifiers to which one or more second servers belong through a PIMI interface; and determining a target server belonging to the same manufacturer as the first server in the one or more second servers based on the manufacturer identification to which the one or more second servers belong. By the method, the target server which is in the same network segment as the first server and the same manufacturer as the first server can be quickly determined.
In a second aspect, the present application provides a server fault early warning device, including: the configuration information acquisition module is used for acquiring the configuration information of the target server; the server judging module is used for determining whether a server to be rectified exists in the target server or not by utilizing the rectification early warning database based on the configuration information of the target server; the rectification early warning database comprises rectification problems and server configuration information corresponding to the rectification problems; the risk alarm generation module is used for generating a risk alarm when determining that the server to be modified exists in the target server; and the risk alarm sending module is used for sending the risk alarm to the target server so as to prompt the user that the server to be modified has potential risks. The server is not required to wait for passive response after faults, the hysteresis of fault rectification in the prior art is solved, and the possible problems of the server are timely early-warned, so that the potential risks of the problems of the server are prevented in advance.
In a third aspect, the present application provides a computing device comprising a processor, and a memory communicatively coupled to the processor; the memory is used for storing computer execution instructions; the processor is configured to execute the computer-executable instructions stored in the memory, such that the processor performs the method of the first aspect described above.
In a fourth aspect, the present application provides a computer readable storage medium having stored therein a computer program or instructions which, when executed, implement the method of the first aspect described above.
In a fifth aspect, the present application provides a computer program product comprising a computer program or instructions which, when executed by a processor, implement the method of the first aspect described above.
Drawings
FIG. 1 is a schematic diagram of a computing device according to an embodiment of the present application;
fig. 2 is a flow chart of a server fault early warning method provided in an embodiment of the present application;
fig. 3 is a schematic view of a scenario of a server fault early warning method provided in an embodiment of the present application;
fig. 4 is a schematic view of a scenario of another server fault early warning method according to an embodiment of the present application;
fig. 5 is a schematic view of a scenario of another server fault early warning method according to an embodiment of the present application;
fig. 6 is a schematic diagram of a scenario of another server fault early warning method according to an embodiment of the present application;
fig. 7 is a flowchart of another server fault early warning method according to an embodiment of the present application;
Fig. 8 is a flowchart of another server fault early warning method according to an embodiment of the present application;
fig. 9 is a schematic structural diagram of a server fault early warning device provided in an embodiment of the present application.
Detailed Description
The terms first, second, third and the like in the description and in the claims and drawings are used for distinguishing between different objects and not for limiting the specified sequence.
In the embodiments of the present application, words such as "exemplary" or "such as" are used to mean serving as examples, illustrations, or descriptions. Any embodiment or design described herein as "exemplary" or "for example" should not be construed as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete fashion.
For clarity and conciseness in the description of the following embodiments, a brief description of the related art will be given first:
the intelligent platform management interface (Intelligent Platform Management Interface, IPMI) is a standard applied to the design of a server management system, and the design of the interface standard is used for facilitating the implementation of system management on different types of server system hardware, so that the centralized management of different platforms is possible.
Redfish is an open industry standard specification issued by the Distributed Management Task Force (DMTF) aimed at modernizing and securely managing platform hardware, a management standard that can represent various implementations through a consistent interface.
The following is a comparison of the advantages of the server fault early warning method provided by the application with the problems existing in the prior art.
In the prior art, when a server fails (problems) in the use and production stages, the server is modified (for example, firmware version upgrade, firmware configuration modification and the like) to obtain the updated version server so as to solve the problems of the original version server, and a modification notice for the problems is issued. For ease of understanding, the following examples are presented:
the server has the problem that the intelligent diagnosis database is too large in the using or production stage, so that the IBMC is repeatedly reset. Therefore, the problem description (including hardware configuration, problem phenomena related to the problem, etc.) needs to be performed for the problem that the intelligent diagnosis database is too large and causes the repeated reset of the IBMC, and after the root cause of the problem that the intelligent diagnosis database is too large and causes the repeated reset of the IBMC is located, the server is modified, and a modification notice is issued. The contents that can be included in the rectification notice are: actual hardware configuration, related application scope, expected completion time, human effort, modification records, maintenance handler contact information, problem keywords, problem abstract, problem description), etc., wherein the problem description may include: trigger conditions, problem phenomena, etc.
However, for other potentially risky server devices, the modification mechanism in the current technology is to wait for the potentially risky server device to respond passively after the problem of the published modification notice, and then contact the maintenance processing personnel by the client according to the modification notice of the problem to examine and modify the server with the problem. The way of rectifying the problems of the passive response server has certain hysteresis, the server with potential risk cannot be located in time, the potential risk of the problems of the server cannot be prevented in advance, and the problems of the server are difficult to be early warned in time. In addition, the modification mode of the server causes that the same fault needs to be manually examined and modified at different server nodes respectively, and a large amount of manpower resources are wasted in the examination and modification processes.
The server problem solving needs to take a long time, and because the server problem solving needs to carry out more steps, the method at least comprises the following steps: the clients contact maintenance processing staff, the maintenance processing staff conduct problem positioning and correction on the server node, the corrected server is reused, and the server cannot be put into use for a long time, so that operation and usability of the service are seriously affected.
The first server obtains configuration information of a target server which is in the same network segment as the first server and belongs to the same manufacturer, and the first server compares the obtained configuration information of the target processor with the configuration information in a pre-established rectification early warning database to determine whether a server to be rectified exists or not, if the server to be rectified exists, a risk alarm is generated and sent to the target server. And determining whether the target server has potential risks or not through the rectification early warning database, and if the target server has potential risks, actively sending a risk warning by the first server to prompt a user that the target server has potential risks of the rectification problem. The method and the device solve the hysteresis caused by the passive response fault, and timely early warn the potential risk of the possible problem of the server so as to prevent the potential risk of the problem of the server in advance.
In order to facilitate understanding of the technical solutions of the present application, a description is first provided below of a computing device provided in an embodiment of the present application.
Exemplary, fig. 1 shows a schematic structural diagram of a computing device according to an embodiment of the present application. It is to be appreciated that the computing device (also referred to herein as a "server") can be, but is not limited to, a personal computer, a physical server, a cloud server, a workstation, a super terminal, and the like. As shown in fig. 1, the hardware portion of computing device 10 includes a processor 110, a memory 120, a management controller 130; the software portion of computer device 10 mainly includes an Operating System (OS) 140 and processor firmware 150.
The processor 110 may include various processing devices, for example: may be a central processing unit (central processing unit, CPU), a System On Chip (SOC), a processor integrated on the SOC, a separate processor chip or controller, etc.: the processor 110 may also include special purpose processing devices such as application specific integrated circuits (application specific integrated circuit, ASIC), field programmable gate arrays (field programmable gate array, FPGA), digital signal processors (digital signal processor, DSP), etc. The processor 110 may be a processor group of multiple processors coupled to each other by one or more buses.
The memory 120 may also be referred to as a memory or a main memory, where the memory 120 may be coupled to the processor 110, and in particular, the memory 120 may be coupled to the processor 110 by one or more memory controllers. Memory 120 may be one or more, and memory 120 may be a volatile memory (RAM) such as random access memory (random access memory) or other type of dynamic storage device that may store information and instructions as a running memory for computing device 120.
Wherein the management controller 130 is configured to remotely maintain and manage the computing device 10 through a dedicated data channel, and the management controller 130 is completely independent of the operating system of the computing device 10, and may communicate with the processor, the memory, etc. through an out-of-band management interface therein.
By way of example, the management controller 130 may include a management unit that calculates the operating state of the device, a management system in a management chip outside the processor, an out-of-band management controller (baseboard management controller, BMC), and the like. It should be noted that, the embodiments of the present application are not limited to the specific form of the management controller, and the above is merely an example. In the following embodiments, an out-of-band management controller BMC will be described as an example.
Operating system 140 is an underlying system program installed in computing device 10, including, but not limited to iOS, android, windows, hong system (Harmony OS), or other operating systems.
The processor Firmware 150 may be referred to as Firmware, and is a program written in EPROM (erasable programmable read only memory) or EEPROM (electrically erasable programmable read only memory). Processor firmware 150 refers to a device driver stored within computing device 10 by which the operating system can implement the running actions of a particular machine in accordance with standard device drivers. For example: the basic input output system BIOS, management engine (management engine, ME), microcode, or intelligent management unit (intelligent management unit, IMU) firmware. It should be noted that the specific form of the processor firmware 150 is not limited by the embodiments of the present application, and the above is merely exemplary.
It should be noted that the processor firmware 150 may be located in the processor 110 (as shown in fig. 1), or the processor firmware 150 may be located in a firmware chip (not shown in fig. 1) outside the processor 110.
It is noted that the structure shown in FIG. 1 does not constitute a limitation of computing device 10, and that computing device 10 may include more or less components than shown in FIG. 1, or certain components may be combined, or a different arrangement of components. For example, computing device 10 also includes a display screen for displaying images, videos, and the like. The display screen includes a display panel. The display panel may employ a liquid crystal display (liquid crystal display, LCD), an organic light-emitting diode (OLED), an active-matrix organic light emitting diode (AMOLED), a flexible light-emitting diode (flex), a mini, a Micro-OLED, a quantum dot light-emitting diode (quantum dot light emitting diodes, QLED), or the like. In some embodiments, computing device 10 may include 1 or N displays, N being a positive integer greater than 1. In embodiments of the present application, the display screen may display a risk alert in embodiments of the present application to alert the user that a potential risk exists with computing device 10.
Embodiment one:
the following describes in detail a server fault early warning method provided in the practical example of the present application with reference to fig. 2 to fig. 7.
The application scenario of the method is that after a certain server (hereinafter referred to as a server A for short) has faults in the use or production process, maintenance processing personnel determine the root cause for the fault location, the server A is modified (for example, the version of the processor firmware is upgraded and the configuration of the processor firmware is modified) to obtain the latest version of the server A so as to solve the faults of the server A. The server fault early warning method provided by the embodiment of the application is realized based on the latest version of the server A. For convenience of explanation, the latest version of the server (i.e., server a) will be simply referred to as the first server.
In one possible implementation, the first server may be determined according to a version number of the server. The version number refers to an identification number of a version, and may be assigned to a unique number or set of labels of a particular version of a device, software program, file, firmware, device driver, or even hardware. As the new version is released, the version number will increase. In the scheme, as the server is updated or modified to obtain a new version of the server, the version number of the updated or modified server is increased compared with the version number of the server which is not updated/modified, so that the server with the highest version number is determined to be the first server according to the version number of the server.
It should be noted that, the first server may be the latest version server in the server cluster, or may be any server in the server cluster, which is not specifically limited in this application. The server fault early warning method provided by the embodiment of the application can be used for accurately early warning the potential risk of faults of other servers only when the server of the latest version is applied to the server cluster, and if the server of the latest version is not used for realizing the server fault early warning method provided by the embodiment of the application, the possibility that the rectification early warning database is not of the latest version can exist, namely the faults which are already in the server cluster are not updated into the rectification early warning database in time, so that the risk early warning can not be carried out on the faults which are already in time through the rectification early warning database, namely the possibility that the risk early warning can not be carried out on the faults which are already in time exists. Further, if the modification early-warning database is the latest version, that is, when the fault in the server cluster is updated to the modification early-warning database in time, the server fault early-warning method provided by the embodiment of the application is performed based on the latest version server, and the server with the fault risk in the server cluster can be more comprehensively positioned, if the server fault early-warning method provided by the embodiment of the application is not performed based on the latest version server, the risk of fault exists in the server with the fault risk in the server cluster, and the server with the fault risk in the server cluster cannot be comprehensively positioned.
S201, one or more second servers which are in the same network segment with the first server are determined.
Wherein the second server is a server in the same network segment as the first server. It should be noted that the number of the second servers may be one or more, and the present application is not limited specifically.
When the second server and the first server are in the same network segment, the second server and the first server are necessarily in the same local area network. The servers in the same lan are essentially multiple servers that communicate through the same switch, with all servers connected to the same switch being in the same lan.
When the second server and the first server are in the same network segment, the output transmission between the first server and the second server can be realized without intervention of a router or a three-layer switch, and the data transmission between the first server and the second server can also be called as intra-network communication.
In one possible implementation, a first server determines a second server that is within the same network segment as the first server based on an internet packet explorer (packet internet groper, ping).
The ping (packet internet groper, internet packet explorer) program is a basic tool for testing connectivity between two servers, and is mainly used for testing whether data can be normally received and transmitted between devices, so that whether the devices run normally and whether a network is unobstructed is judged.
Specifically, the first server pings the network segment IP of the first server, that is, the first server pings the second server of the same network segment, through a ping program, and sends an ARP (Address Resolution Protocol ) message outwards; after the switch connected with the first server receives the ARP message, the switch sends the ARP message to all ports, and after the other servers receive the ARP message, whether the other servers are searched by the first server or not can be judged according to the network segment IP, if not, the ARP message is discarded, and no response is made (besides, a router can directly isolate a broadcast domain, and a server needing to carry out data communication through the router belongs to different local area networks, and the ARP message is directly discarded in general); if yes, immediately responding and replying an ARP response message to the first server, wherein the server replying the ARP response message is the second server.
To facilitate an understanding of the relationship of the second server to the first server, an example is described below in connection with FIG. 3. Here, the description will be given taking the server a as the first server.
The system comprises a server A, a server B, a server C, a server D, a server E, a switch and a router. Server a, server B, server C, and server D are directly connected to the switch, and server E is connected to the router and is connected to the switch through the router. The method comprises the steps that a server A pings a network segment IP of the server A through a ping program, after an ARP message is sent outwards, the switch receives the ARP message and forwards the ARP message to a server B, a server C, a server D and a router, wherein the server B, the server C and the server D can receive the ARP message forwarded by the switch, and the server B, the server C and the server D respond to the ARP message and return an ARP response message after receiving the ARP message forwarded by the switch; the ARP message forwarded by the switch to the router is directly discarded by the router, so server E does not receive the ARP message sent by server a. From this, the second server corresponding to the server a includes the server B, the server C, and the server D.
S202, determining target servers belonging to the same manufacturer as the first server in one or more second servers.
The target server is a server belonging to the same manufacturer as the first server in one or more second servers.
Specifically, the first server obtains vendor identifiers (e.g., vendor IP, vendor name, etc.) to which the second server belongs from one or more second servers through the IPMI interface. The first server judges whether the second server and the first server belong to the same manufacturer according to the acquired manufacturer identification of one or more second servers. When the manufacturer identification of the second server is the same as the manufacturer identification of the first server, the second server is a target server; when the manufacturer identification of the second server is different from the manufacturer identification of the first server, the second server is not the target server. Wherein the IPMI interface supports remote monitoring without the permission of the server operating system.
The method comprises the steps of determining a target server belonging to the same manufacturer as a first server in one or more second servers, and eliminating servers belonging to different manufacturers from the first server in the one or more second servers. Servers produced by different manufacturers cannot be used commonly due to internal hardware configuration, firmware configuration, connection between hardware and the like, so that problems, problem positioning, reasons for causing problems and the like which may occur between servers produced by different manufacturers cannot be used commonly, and whether a second server has a potential risk cannot be determined according to a history problem of a first server, the positioning of the history problem and the reasons for causing the history problem. It is predetermined whether one or more second servers belong to the same vendor as the first server, and servers belonging to different vendors from the first server among the one or more second servers are removed.
Before S202, the first server needs to be authorized to obtain information, and after the first server is authorized to obtain information, the IPMI interface can be sent to the second server to obtain the company identifier to which the second server belongs.
Illustratively, a popup window may be displayed to the user by the first server, where the popup window includes: whether the first server is permitted to obtain information of the second server, authorizing the first server in response to a user operation (e.g., the user clicks on the option of the first server to obtain information of the second server using a peripheral), thereby authorizing the first server to obtain information of the second server.
It should be noted that, the authorization to acquire information may be performed in advance, that is, after the first server performs the authorization to acquire information once, the authorization state that the first server allows to acquire information of the second server is maintained, or the authorization may be performed again before the first server performs the information of the second server, which may also be referred to as temporary authorization.
S203, acquiring configuration information of the target server.
Specifically, the first server obtains configuration information of the target server through the IPMI interface.
The configuration information of the target server comprises: the model of the target server, the firmware version of the target server (e.g., BMC, BIOS, CPLD, etc.).
The purpose of the first server to obtain the configuration information of the target server is: in order to compare/configure the configuration information of the target server with the configuration information of the server stored in the rectifying and early-warning database according to the acquired configuration information of the target server, the first server determines whether the target server has potential risks of occurrence of problems.
S204, comparing the obtained configuration information of the target server with the rectification early warning database, and judging whether the target server related to the rectification problem exists or not.
The rectification early warning database is a corresponding database formed in advance according to a problem (namely, rectification problem) when the first server and a target server corresponding to the first server have faults in history, server configuration related to the problem and rectification notice aiming at the problem. Namely, the rectification early warning database at least stores the historical rectification problem of the servers which are in the same network segment and belong to the same manufacturer and the server configuration information corresponding to the historical rectification problem.
In one possible implementation, the rectification early warning database further includes: the rectification notice information corresponding to the history rectification problem can be rectification notice address, rectification notice label and the like.
Wherein, the rectification notice that the rectification problem corresponds is: the prearranged processor issues a correction notice aiming at the correction problem. The rectification announcement information is used to indicate rectification announcements for rectification questions. For example: the complete revision advertisement can be obtained from the revision advertisement address.
It should be noted that the rectification early warning databases may be respectively stored in servers that are located in the same network segment and belong to the same manufacturer, and of course, may also be stored in the latest version of the servers that are located in the same network segment and belong to the servers that are produced by the same manufacturer.
The target server related to the rectification problem may also be called a server to be rectified.
For ease of understanding, the modification warning database is illustrated below in conjunction with table 1. The problem of rectifying the first server is illustrated by taking the problem of repeated reset of the IBMC caused by oversized intelligent diagnosis database and the problem of abnormal BMC caused by the Hynix chip fault as an example, and the server configuration is illustrated by taking the server model as an example. In one possible implementation, before obtaining the rectification early warning database including the two problems, the first server has already presented the two problems, and the maintenance processor is associated with describing and locating the basic cause of the problem for the two problems, obtaining the title for the problem, and the problem phenomenon, the server configuration related to the problem, rectifying the first server to solve the two problems and issue rectification notices. The rectification early warning database is formed by the first server according to the problems, the server configuration related to the problems and the issued rectification notice information.
TABLE 1
As shown in table 1, for servers with server models of 1288h V5, 2288V5, 2488V5, there is a potential risk that the occurrence of the problem of repeated reset of IBMC caused by the oversize of the intelligent diagnosis database, that is, the server with server models of 1288HV5, 2288V5, 2488V5 involves a rectification problem; aiming at servers with server models of 2298V5 and 5288V5, the potential risk of abnormal BMC caused by Hynix chip faults exists, namely the server with the server models of 2298V5 and 5288V5 relates to the rectifying problem; when the server model does not belong to the involved server model in the rectification early warning database, determining that the server does not involve the rectification problem.
In one possible implementation, the rectification early warning database further includes a risk level corresponding to the rectification problem. The risk level is divided into different levels in advance according to the influence degree of the rectifying problem on the server, the difficulty degree of solving the rectifying problem and the like. For ease of understanding, at risk levels include: the primary risk level, the secondary risk level and the tertiary risk level are described by taking the primary risk level, the primary risk level describes that the influence degree of the rectification problem on the server is larger, for example: may cause a cluster of servers including the server to fall into paralysis; the secondary risk level indicates that the influence degree of the rectification problem on the server is moderate, and a single server may be down; the three-level risk level indicates that the influence degree of the rectification problem on the server is low, and partial functions in the server may not normally run.
When it is determined that there is no target server related to the rectification problem, S205 is performed.
When it is determined that there is a target server related to the rectification problem, S206 is performed.
S205, recording the event of the correction problem not related to the target server into a system event log.
The system event log (sel log) is used for recording the running state, event progress state and the like of each component in the server.
S206, generating a risk alarm according to the target server related to the rectification problem.
According to the rectification early warning database, determining a target server related to the rectification problem; and generating a risk alarm according to the identification information of the target server related to the rectification problem. The identification information of the target server may be an IP address of the target server, a unique label of the target server, or the like, which can indicate that the target server is not specifically limited in this application.
In one possible implementation manner, the rectification early warning database includes, in addition to the rectification problem and the configuration information of the server related to the rectification problem: and the rectification notice information corresponding to the rectification problem. According to the rectification early warning database, the rectification problem corresponding to the target server related to the rectification problem and rectification notice information (such as rectification notice address) corresponding to the rectification problem can be determined; and generating a risk alarm according to the target server identification information related to the rectification problem, the related rectification problem and rectification notice information corresponding to the rectification problem.
In one possible implementation, the risk alert may also include configuration information of the target server (e.g., the model of the target server) related to the revision issue, with the server model explicitly related to the revision issue in the risk alert.
In one possible implementation, the risk alert may further include a risk level of the modification problem related to the target server, and the extent to which the modification problem affects the server explicitly occurs in the risk alert.
S207, sending the risk alarm to the target server.
In one possible implementation, a risk alert is sent to all target servers to alert the user that a potential risk exists for the target server. After the target server receives the risk alarm, a user can determine whether the target server is involved in the correction problem (namely whether the potential risk exists) according to target server identification information which is carried in the risk alarm and is involved in the correction problem, the target server and the user are informed of the potential risk which is likely to be in fault in advance, the passive response is not required to be carried out after the target server is in fault (problem), the hysteresis of the fault correction in the prior art is solved, the problem which is likely to be in fault of the server is timely early-warned, and the potential risk of the problem of the server is prevented in advance.
In one possible implementation manner, a risk alarm can be sent to the target server related to the rectification problem according to the IP address of the target server related to the rectification problem, so as to prompt the user that the target server has potential risk, prevent the potential risk of the problem of the server in advance, and accurately early warn the target server related to the rectification problem.
In one possible implementation, the risk alert carries a target server identification related to the reform problem, the reform problem involved, and reform advertisement information for the reform problem involved. And sending the risk alarm to the target server to prompt the user that the target server has potential risks, and prompting the user that the target server has potential risks and simultaneously definitely prompting detailed information of the correction problem possibly occurring in the target server due to the correction problem related to the risk alarm. And the risk alarm carries the modification notice information corresponding to the modification problem, and the user can actively respond to the modification notice information to acquire the modification notice issued for the modification information, and carry out modification (firmware version upgrade and the like) of the target server in advance according to the modification notice issued for the modification information, so as to avoid the potential risk of the modification problem. And the problem description and the problem positioning are completed when the correction notice is generated, so that repeated waste of human resources is reduced, the correction period is shortened, and the operation and the usability of the service are prevented from being influenced for a long time.
The rectification notice information can be address information of the rectification notice, so that a user can obtain complete information of the rectification notice according to the address information of the rectification notice, and memory occupied by the rectification notice information can be effectively reduced. In addition, the modification notice information may be other modification notice information, which is not specifically limited in this application.
In one possible implementation, the risk alert carries configuration information of the target server related to the rectification problem (e.g., model number of the target server, firmware version of the target server, etc.), and the user can know the configuration information of the potential risk of the fault problem in advance, so as to take targeted rectification measures later.
In one possible implementation, the risk alert carries the risk level of the modification problem related to the target server, and prompts the user to present the extent to which the modification problem is expected to affect the server while prompting the user that the target server is potentially at risk for the modification problem. So that the subsequent user can formulate a rectification strategy of the target server related to the rectification problem according to the risk level, wherein the risk level comprises: the primary risk level, the secondary risk level and the tertiary risk level are exemplified, if the risk level of the rectification problem related to the target server carried by the risk alarm is the primary risk level, the user immediately rectifies the target server, and the large influence caused by the rectification problem of the target server is avoided. In one possible implementation, after determining that there is a target server related to the rectification problem, the method may further include S208, where S208 is an optional step.
S208, according to the rectification problem related to the target server, rectifying and modifying the target server.
Before S208 is performed, the first server needs to be authorized to perform the rectification server, and after the first server is authorized to perform the rectification server, the first server can implement rectification measures on the target server according to the rectification problem related to the target server.
Illustratively, a popup window may be displayed to the user by the first server, where the popup window includes: and an option to allow the first server to modify the target server, wherein the first server is authorized to modify the target server in response to the user's option to allow the first server to modify the target server.
It should be noted that, the authorization of the modification server may be performed in advance, that is, after the authorization of the modification server is performed once on the first server, the authorization state of the modification target server allowed by the first server is maintained, or the authorization may be performed again before the modification of the first server is performed on the first server, which may also be referred to as temporary authorization, which is not limited in this application.
In one possible implementation manner, the modification early warning database in the first server further comprises a server modification measure corresponding to the modification problem, wherein the server measure is a modification measure taking the server as an execution main body and used for modifying other servers, that is, one server performs the server modification measure to the other server, so that modification to the other server can be realized. The first server may search a server rectification measure corresponding to the rectification problem in the rectification early warning database according to the rectification problem related to the target server, and rectify the target server based on the corresponding server rectification measure. For example: the configuration of the target server is modified by IPMI, redfish commands.
It should be noted that, in S208, the first server modifies the target server to generally delay the time when the problem (fault) occurs, for example: before the target server is not modified, the potential risk of occurrence of a problem A exists, the target server is expected to have the problem A in three months according to the running of the current server configuration; after the first server rectifies the target server, it is expected that the target server will not have problem a within two years, and the first server may not completely avoid the potential risk of problem a of the target server, but prolong the time that the target server may have problem a, and provide sufficient server upgrade/update time for the user.
It should be noted that, the first server modifies the target server, and generally delays the time of occurrence of the problem, because: the first server modifies the target server, typically by configuration adjustments, such as: there may be a potential risk that the server may not function properly due to excessive working pressure of the server, and the configuration of the target server is modified by the IPMI command, thereby shutting down part of the functions of the target server. However, if the target server needs to be upgraded (e.g., firmware upgrade) to completely solve the problem, merely modifying the configuration of the target server cannot completely solve the problem, for example: and the target server is required to download a corresponding upgrade package in a networking way, and the target server is upgraded by utilizing the corresponding upgrade package. However, the networking of the target server cannot be controlled by the first server, and because the target server needs to be networked, the security of the data stored in the target server may be affected, and in order to ensure the security of the data stored in the target server, the user may individually authorize the target server, allow the target server to be networked, and may copy the upgrade package to the target server through devices such as an authorized usb disk, so as to implement the upgrade of the target server.
In order to facilitate understanding of the practical examples of the present application, a server fault early warning method is provided below, and is illustrated in the following with reference to fig. 3 to 6. Wherein, the server A is a first server.
First, as in the scenario shown in fig. 3, server a, server B, server C, server D, server E, switch, and router. Server a, server B, server C, and server D are directly connected to the switch, and server E is connected to the router and is connected to the switch through the router. By the ping procedure, the second server that is in the same network segment IP as the server a is determined to be the server B, the server C, and the server D (as shown in fig. 4, the server a, the server B, the server C, and the server D are in the same network segment).
After the user authorizes the server a to acquire the second server information, the server a acquires manufacturer identifiers of the server B, the server C and the server D through the IPMI interface (wherein the acquired manufacturer identifier of the server B is a, the acquired manufacturer identifier of the server C is a, the acquired manufacturer identifier of the server D is D, and the acquired manufacturer identifier of the server a is a), and compares the acquired manufacturer identifier with the acquired manufacturer identifier of the server a to determine a target server which is the same as the manufacturer of the server a in the second server (the server B, the server C and the server D), wherein the target server comprises: server B, server C (as shown in fig. 5, server a, server B, and server C belong to the same vendor).
The server A acquires configuration information of the server B and the server C, compares the configuration information with the rectification early warning database and judges whether the server B and the server C relate to rectification problems or not. The rectification early warning database is taken as an example of the rectification early warning database shown in fig. 5, the rectification problem is a problem Q, and the server version related to the problem Q includes: version b1, version b2, and version c1, and the processor modification corresponding to problem Q is processor modification Q. And if the acquired version of the server B is the version B3 and the acquired version of the server C is the version C1, determining that the server C relates to the rectification problem, and that the server B does not relate to the rectification problem.
As shown in fig. 6, the server a generates a risk alarm according to the server C, where the risk alarm is as shown in fig. 6, and the risk alarm carries the device IP of the server C and the problem Q related to the server C. The server a sends the risk alert to server B and server C to alert the user that server C is potentially at risk of the occurrence of problem Q.
After the server a sends the risk alert to server B and server C, the user authorizes server a to allow modification of the server. And the server A implements the server rectifying measure Q to the server B according to the server rectifying measure Q corresponding to the problem Q in the rectifying early warning database.
In order to facilitate understanding of the server fault early warning method provided in the embodiments of the present application, on the basis of fig. 3 to fig. 6, a flowchart of the server fault early warning method shown in fig. 7 is combined, and is described by way of example. The first server is a server a, the configuration information of the server is exemplified by a version of the server, and the modification early-warning database is exemplified by the modification early-warning database shown in fig. 5.
Based on the ping program, the server a pings the network segment IP of the server a, and the determining the second server includes: server B, server C, and server D.
It is determined whether the user authorizes the server a to acquire the information of the second server. If the user authorizes the server to acquire the information of the second server, the server A acquires manufacturer identifiers of the server B, the server C and the server D through the IPMI interface; if the user does not authorize the server to obtain the information of the second server, waiting for the user to authorize the server A to obtain the information of the second server, and then obtaining manufacturer identifiers of the server B, the server C and the server D through the IPMI interface by the server A.
The server A determines a target server according to the obtained manufacturer identifications of the server B, the server C and the server D, and the method comprises the following steps: server B, server C.
The server A obtains configuration information of the server B and the server C through the IPMI interface.
And the server A compares the obtained configuration information of the server B and the server C with the rectification early warning database and judges whether the server B and/or the server C relate to the rectification problem or not. Determining that server C involves a rectification problem.
The server a generates a risk alert from the server C. The risk alarm is taken as an example of the risk alarm shown in fig. 6, and the risk alarm carries the device IP of the server C and the problem Q related to the server C.
The server a sends a risk alert to the server B and the server C to alert the user that the server C has a potential risk of presenting the problem Q.
It is determined whether the user authorizes server a to modify server C. If the user authorizes the server A to modify the server C, the server A implements the server modification measure Q to the server C according to the server modification measure Q corresponding to the problem Q in the modification early-warning database; if the user does not authorize the server C, the server A executes the server modification measure Q to the server C according to the server modification measure Q corresponding to the problem Q in the modification early-warning database after waiting for the user to authorize the server A to modify the server C. And realizing the rectification of the server A to the server C.
The embodiment of the application provides a server fault early warning method, which comprises the following steps: determining a second server which is in the same network segment as the first server, wherein the second server is determined to be a target server of the same company as the first server; acquiring configuration information of a target server; comparing the configuration information of the target server with a rectification early warning database according to the obtained configuration information of the target server, and judging whether the target server is related to the rectification problem or not; when the target server relates to the rectification problem, generating a risk alarm according to the target server related to the rectification problem; and sending a risk alarm to the target server to prompt a user that the target server has the potential risk of the rectifying problem, and passively responding after the target server fails (the problem) is not required to wait, so that the hysteresis caused by the passive responding fault is solved, and the potential risk of the problem of the server is early-warned in time to prevent the potential risk of the problem of the server in advance.
Further, the target server can obtain the complete rectification notice of the rectification problem according to the rectification notice information by the aid of rectification notice information carried in the risk alarm, and when the problem does not occur in the target server, the target server can be upgraded or rectified according to the rectification notice of the rectification problem in advance, so that the occurrence of the rectification problem is avoided, and the problem occurring in the target server is avoided, so that the operation and the usability of the service are influenced.
Further, the rectification early-warning database also comprises server rectification measures corresponding to the rectification problems, and the first server can determine the server rectification measures corresponding to the rectification problems in the rectification early-warning database according to the rectification problems related to the target server; the first server utilizes corresponding server modification measures to modify the target server, so that the time that the target server may have modification problems is prolonged, and sufficient server upgrading/updating time is provided for users.
Furthermore, the first server needs to be authorized before other server configuration information is acquired and other servers are modified, so that the security of data in the servers is ensured.
Embodiment two:
next, a method for early warning of server failure provided in the embodiment of the present application will be described in detail with reference to fig. 8. The method is applied to a first server.
It should be noted that, the first server may be the latest version server in the server cluster, or may be any server in the server cluster, which is not specifically limited in this application. The first server is the latest version server in the server cluster, so that the server with fault risk in the server cluster can be more accurately and comprehensively positioned and early warned.
In one possible implementation manner, a specific trigger condition exists in a server fault early-warning method provided by the embodiment of the application, and the server fault early-warning method provided by the embodiment of the application is started to be implemented in response to the trigger condition. The specific trigger conditions may be: after a certain server in the server cluster fails and is rectified, the server is used as a first server to trigger the first server to execute a server failure early warning method; it is also possible that: the server cluster presets cycle time and triggers the first server to execute a server fault early warning method at regular time; it may also be: and triggering and executing a server fault early warning method after a server of the latest version is newly added in the server cluster. The triggering condition can be set according to actual conditions, and the application is not particularly limited.
S801, determining a target server corresponding to a first server based on a preset rule.
The preset rules are preset according to the server clusters and related information of the servers.
Illustratively, the preset rules may be: the target server and the first server are in the same network segment, and the target server and the first server belong to the same manufacturer. In addition, the preset rule may be that the target server and the first server are located in the same network segment, or that the target server and the first server belong to the same manufacturer, etc., which is not limited in this application. Alternatively, the preset rule may also be based on a user input or selected target server.
In one possible implementation, the preset rules are: the target server and the first server are in the same network segment, and the target server and the first server belong to the same manufacturer. It should be noted that the number of the target servers may be one or more, and the present application is not specifically limited according to the actual situation.
The target server and the first server are in the same network segment, which indicates that the first server and the target server are in direct communication, and the router/switch is not needed to be used for communication, so that the safety of the data in the server can be ensured, and the data in the server is prevented from being leaked due to the forwarding of the router/switch. The target server and the first server belong to the same manufacturer, and the hardware configuration, the firmware configuration, the connection between the hardware and the like of the servers produced by the servers of different manufacturers are different, so that problems, problem positioning, causes of the problems and the like possibly exist and cannot be commonly used, and the server fault early warning method provided by the embodiment of the application cannot be realized.
Specifically, one or more second servers which are in the same network segment as the first server are determined; obtaining manufacturer identifiers to which one or more second servers belong; and determining a target server belonging to the same manufacturer as the first server in the one or more second servers based on the manufacturer identification to which the one or more second servers belong.
The first server, illustratively, determines one or more second servers that are in the same network segment as the first server based on the ping procedure.
Illustratively, the first server obtains, from one or more second servers, vendor identifiers to which the second servers belong through the IPMI interface.
S802, acquiring configuration information of a target server.
The configuration information of the target server comprises: the model of the target server, the firmware version of the target server (e.g., BMC, BIOS, CPLD, etc.).
Specifically, the first server obtains configuration information of the target server through the IPMI interface.
S803, based on the configuration information of the target server, determining whether a server to be modified exists in the target server by utilizing the modification early warning database.
The rectification early warning database comprises rectification problems and server configuration information corresponding to the rectification problems.
Specifically, based on the rectification early-warning database, judging whether the configuration information of the target server exists in the rectification early-warning database; when the configuration information of the target server exists in the rectifying and early warning database, determining that the server to be rectified exists in the target server; and when the configuration information of the target server does not exist in the rectification early warning database, determining that the target server does not exist to-be-rectified server. When the configuration information of the target server exists in the rectification early warning database, the target server is indicated to relate to the rectification problem, namely, the target server related to the rectification problem exists, and the target server related to the rectification problem can also be called as a server to be rectified.
The server to be rectified can be determined through the rectification problem stored in the rectification early warning database and the corresponding configuration information, the process of determining the server to be rectified is simplified, the time consumed for determining the server to be rectified is shortened, and therefore timeliness of early warning of potential risks of the server, in which the rectification problem possibly occurs, is improved.
S804, when the fact that the server to be modified exists in the target server is determined, a risk alarm is generated.
Specifically, a risk alert is generated based on the device identification of the server to be rectified. The risk alert carries the device identification of the server to be rectified. The risk alert explicitly carries the device identification of the server to be modified so that the subsequent user can explicitly have information of the server with the potential risk of having the modification problem.
Further, in one possible implementation, a rectification problem corresponding to the server to be rectified is determined based on the rectification early warning database; generating a risk alarm based on the equipment identification of the server to be rectified and the rectification problem related to the server to be rectified;
further, in one possible implementation, the rectification early warning database further includes: and the rectification notice information of the rectification problem. Determining a rectification problem corresponding to the server to be rectified based on the rectification early warning database; based on the rectification early warning database, rectification notice information of rectification questions corresponding to the servers to be rectified is determined; and generating a risk alarm based on the equipment identification of the server to be rectified, the corresponding rectification problem and rectification notice information. At this time, the risk alarm carries the device identifier of the server to be rectified, the corresponding rectification problem and rectification notice information.
It should be noted that, when it is determined that the server to be modified does not exist in the target server, it is determined that the server in the current server cluster does not have the risk of occurrence of a history fault, and then the event that the server to be modified does not exist in the target server is recorded into the system event log.
And S805, sending the risk alarm to the target server to prompt the user that the server to be modified has potential risk.
In one possible implementation, the risk alert is sent to all target servers to prompt the user that the server to be modified has a potential risk based on the device identifier of the server to be modified carried by the risk alert. And sending the risk alarm to all target servers, wherein the user can definitely have a server with potential risk of the rectifying problem according to the equipment identification, and the risk alarm has a prompting function on other servers, namely, the user can actively carry out secondary screening of fault risks according to the equipment identification of the server to be rectified.
In one possible implementation manner, based on the device identifier of the server to be rectified carried by the risk alarm, the risk alarm is sent to the server to be rectified, so as to prompt the user that the server to be rectified has potential risks. The method and the device realize accurate prompt of the potential risk of the problem of the rectification of the to-be-rectified server, and do not need the user to determine the to-be-rectified server according to the equipment identification.
Specifically, when the risk alarm carries the equipment identifier of the server to be modified and the related modification problem. The user is prompted to have potential risks through the risk alarm, and meanwhile, the user is explicitly prompted to accurately correct the detailed information of the correction problem possibly occurring in the server to be corrected.
Specifically, when the risk alarm carries the equipment identifier of the server to be rectified, the corresponding rectification problem and rectification notice information. While clearly prompting the user that the potential risk of the modification problem exists in the service to be modified and the detailed information of the modification problem, the risk alarm carries modification notice information corresponding to the modification problem, and the user can actively respond to the modification notice information to acquire modification notices issued for the modification information, and carry out modification (firmware version upgrade and the like) of the target server in advance according to the modification notices issued for the modification information, so that the potential risk of the modification problem is avoided. And the problem description and the problem positioning are completed when the correction notice is generated, so that repeated waste of human resources is reduced, the correction period is shortened, and the operation and the usability of the service are prevented from being influenced for a long time.
In one possible implementation manner, the rectification early warning database further comprises server rectification measures corresponding to the rectification problem, wherein the server measures are rectification measures taking the server as an execution main body and used for rectifying other servers, namely, one server conducts the server rectification measures to the other server, and rectification of the other server can be achieved. Also included after S805 is: determining a rectification problem corresponding to the server to be rectified based on the rectification early warning database; based on the rectification early warning database, obtaining a server rectification measure corresponding to the rectification problem; and utilizing corresponding server rectifying measures to rectify and modify the server to be rectified and modified. The time for the server to be modified to have the modification problem can be effectively prolonged, and sufficient time for upgrading/updating the server is provided for users.
The embodiment of the application provides a server fault early warning method, which comprises the following steps: determining a target server corresponding to the first server based on a preset rule; acquiring configuration information of a target server; determining whether a server to be rectified exists in the target server or not by utilizing the rectification early warning database based on configuration information of the target server; when the fact that the server to be rectified exists in the target server is determined, a risk alarm is generated; and sending the risk alarm to the target server to prompt the user that the potential risk exists in the server to be modified. The method and the system have the advantages that the problem (problem) of the target server does not need to be waited for and then passively responded, the hysteresis caused by the passive response fault is solved, and the potential risk of the problem of the server is timely early warned so as to prevent the potential risk of the problem of the server in advance.
Further, the risk warning can carry the rectification notice information corresponding to the rectification problem related to the server to be rectified, so that a user can conveniently obtain the complete rectification notice of the rectification problem according to the rectification notice information, and when the problem does not occur in the target server, the target server can be upgraded/rectified according to the rectification notice of the rectification problem in advance, thereby avoiding the occurrence of the rectification problem, and avoiding the problem occurring in the target server to influence the operation and the usability of the service.
Embodiment III:
next, referring to fig. 9, a server fault early warning device provided in an embodiment of the present application will be described in detail.
The server determining module 901 is configured to determine a target server corresponding to the first server based on a preset rule;
a configuration information obtaining module 902, configured to obtain configuration information of a target server;
the server judging module 903 is configured to determine, based on configuration information of the target server, whether a server to be modified exists in the target server by using the modification early warning database; the rectification early warning database comprises rectification problems and server configuration information corresponding to the rectification problems;
the risk alarm generating module 904 is configured to generate a risk alarm when it is determined that a server to be modified exists in the target server; the risk alarm sending module 905 is configured to send a risk alarm to a target server, so as to prompt a user that a potential risk exists in a server to be modified.
Optionally, the server determining module 901 includes: a second server determination module; the vendor identification acquisition module and the target server determination module. The second server determining module is used for determining one or more second servers which are in the same network segment with the first server; the manufacturer identification acquisition module is used for acquiring manufacturer identifications of one or more second servers; and the target server determining module is used for determining a target server belonging to the same manufacturer as the first server in the one or more second servers based on the manufacturer identification to which the one or more second servers belong.
Optionally, the server judging module 903 is specifically configured to judge whether the configuration information of the target server exists in the rectification early-warning database based on the rectification early-warning database; when the configuration information of the target server exists in the rectifying and early warning database, determining that the server to be rectified exists in the target server; and when the configuration information of the target server does not exist in the rectification early warning database, determining that the server to be rectified does not exist in the target server.
Optionally, the risk alarm generating module 904 is specifically configured to generate a risk alarm based on a device identifier of the server to be modified; the risk alert carries the device identification of the server to be rectified.
Optionally, the risk alarm sending module 905 is specifically configured to send a risk alarm to all target servers, so as to prompt a user that a potential risk exists in the server to be modified based on the device identifier of the server to be modified carried by the risk alarm.
Optionally, the risk alarm sending module 905 is specifically configured to send the risk alarm to the server to be modified based on the device identifier of the server to be modified carried by the risk alarm, so as to prompt the user that the server to be modified has a potential risk.
Optionally, the apparatus further comprises: the rectification problem determination module is used for determining rectification problems corresponding to the servers to be rectified based on the rectification early warning database; the risk alarm generating module 904 is specifically configured to generate a risk alarm based on the device identifier of the server to be modified and the modification problem related to the server to be modified; the risk alert carries the device identification of the server to be modified and the modification problem involved.
Optionally, the rectification early warning database further includes: and the rectification notice information of the rectification problem. The apparatus further comprises: and a rectification notice determination module. The rectification problem determining module is used for determining rectification problems corresponding to the servers to be rectified based on the rectification early warning database; the rectification notice determining module is used for determining rectification notice information of rectification questions corresponding to the server to be rectified based on the rectification early warning database; the risk alarm generating module 904 is specifically configured to generate a risk alarm based on the device identifier of the server to be modified, the corresponding modification problem, and the modification notice information; the risk alarm carries the equipment identification of the server to be rectified, the corresponding rectification problem and rectification notice information.
Optionally, the rectification early warning database further includes: server rectification measures corresponding to rectification problems. The apparatus further comprises: and the correction measure determining module and the server correction module. The rectification problem determining module is used for determining rectification problems corresponding to the servers to be rectified based on the rectification early warning database; the correction measure determining module is used for acquiring the server correction measure corresponding to the correction problem based on the correction early warning database; and the server rectifying module is used for rectifying the server to be rectified by utilizing corresponding server rectifying measures.
The embodiment of the application provides a server fault early warning device, which comprises: the server determining module 901 is configured to determine a target server corresponding to the first server based on a preset rule; a configuration information obtaining module 902, configured to obtain configuration information of a target server; the server judging module 903 is configured to determine, based on configuration information of the target server, whether a server to be modified exists in the target server by using the modification early warning database; the rectification early warning database comprises rectification problems and server configuration information corresponding to the rectification problems; the risk alarm generating module 904 is configured to generate a risk alarm when it is determined that a server to be modified exists in the target server; the risk alarm sending module 905 is configured to send a risk alarm to a target server, so as to prompt a user that a potential risk exists in a server to be modified. The method and the system have the advantages that the problem (problem) of the target server does not need to be waited for and then passively responded, the hysteresis caused by the passive response fault is solved, and the potential risk of the problem of the server is timely early warned so as to prevent the potential risk of the problem of the server in advance.
The embodiment of the application also provides computer equipment, which comprises a processor and a memory, wherein the processor is connected with the memory, the memory stores computer execution instructions, and the server fault early warning method in the embodiment is realized when the processor executes the computer execution instructions.
The embodiment of the application also provides a computer readable storage medium, and a computer program is stored on the computer readable storage medium, and when the computer program runs on a computer, the computer is caused to execute the server fault early warning method in the embodiment.
For the explanation of the relevant content and the description of the beneficial effects in any of the above-mentioned computer-readable storage media, reference may be made to the above-mentioned corresponding embodiments, and the description thereof will not be repeated here.
The embodiment of the application also provides a chip. The chip has integrated therein a control circuit and one or more ports for implementing the functions of the server described above. Optionally, the functions supported by the chip may be referred to above, and will not be described herein. Those of ordinary skill in the art will appreciate that all or a portion of the steps implementing the above-described embodiments may be implemented by a program to instruct associated hardware. The program may be stored in a computer readable storage medium. The above-mentioned storage medium may be a read-only memory, a random access memory, or the like. The processing unit or processor may be a central processing unit, a general purpose processor, an application specific integrated circuit (application specific integrated circuit, ASIC), a microprocessor (digital signal processor, DSP), a field programmable gate array (field programmable gate array, FPGA) or other programmable logic device, transistor logic device, hardware components, or any combination thereof.
Embodiments of the present application also provide a computer program product comprising instructions which, when run on a computer, cause the computer to perform any of the methods of the above embodiments. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the processes or functions in accordance with embodiments of the present application are produced in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, a website, computer, server, or data center via a wired (e.g., coaxial cable, fiber optic, digital subscriber line (digital subscriber line, DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. Computer readable storage media can be any available media that can be accessed by a computer or data storage devices including one or more servers, data centers, etc. that can be integrated with the media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., SSD), etc.
It should be noted that the above-mentioned devices for storing computer instructions or computer programs, such as, but not limited to, the above-mentioned memories, computer-readable storage media, communication chips, and the like, provided in the embodiments of the present application all have non-volatility (non-transparency).
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented using a software program, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the processes or functions in accordance with embodiments of the present application are produced in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, a website, computer, server, or data center via a wired (e.g., coaxial cable, fiber optic, digital subscriber line (digital subscriber line, DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. Computer readable storage media can be any available media that can be accessed by a computer or data storage devices including one or more servers, data centers, etc. that can be integrated with the media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.
Although the present application has been described herein in connection with various embodiments, other variations to the disclosed embodiments can be understood and effected by those skilled in the art in practicing the claimed application, from a review of the figures, the disclosure, and the appended claims. In the claims, the word "comprising" does not exclude other elements or steps, and the "a" or "an" does not exclude a plurality. A single processor or other unit may fulfill the functions of several items recited in the claims. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.
Although the present application has been described in connection with specific features and embodiments thereof, it will be apparent that various modifications and combinations can be made without departing from the spirit and scope of the application. Accordingly, the specification and drawings are merely exemplary illustrations of the present application as defined in the appended claims and are considered to cover any and all modifications, variations, combinations, or equivalents that fall within the scope of the present application. It will be apparent to those skilled in the art that various modifications and variations can be made in the present application without departing from the spirit or scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims and the equivalents thereof, the present application is intended to cover such modifications and variations.

Claims (10)

1. A server failure pre-warning method, applied to a first server, comprising:
determining a target server corresponding to the first server based on a preset rule;
acquiring configuration information of the target server;
determining whether a server to be modified exists in the target server or not by utilizing a modification early warning database based on the configuration information of the target server; the rectification early warning database comprises rectification problems and server configuration information corresponding to the rectification problems;
when the fact that the server to be rectified exists in the target server is determined, a risk alarm is generated;
and sending the risk alarm to the target server to prompt a user that the server to be modified has potential risks.
2. The method of claim 1, wherein the determining whether a server to be retrofitted exists in the target server using a retrofit pre-warning database based on configuration information of the target server comprises:
judging whether the configuration information of the target server exists in the rectifying and early-warning database based on the rectifying and early-warning database;
when the configuration information of the target server exists in the rectification early warning database, determining that a server to be rectified exists in the target server;
And when the configuration information of the target server does not exist in the rectification early warning database, determining that the server to be rectified does not exist in the target server.
3. The method of claim 1, the generating a risk alert comprising:
generating a risk alarm based on the equipment identifier of the server to be rectified; and the risk alarm carries the equipment identifier of the server to be rectified.
4. A method according to claim 3, wherein said sending the risk alert to the target server comprises:
and sending the risk alarm to all target servers so as to prompt a user that the server to be rectified has potential risks based on the equipment identification of the server to be rectified carried by the risk alarm.
5. A method according to claim 3, wherein said sending the risk alert to the target server comprises:
and based on the equipment identifier of the server to be rectified carried by the risk alarm, sending the risk alarm to the server to be rectified so as to prompt a user that the server to be rectified has potential risks.
6. The method of claim 1, wherein the generating a risk alert comprises:
Determining a rectification problem corresponding to the server to be rectified based on the rectification early warning database;
generating a risk alarm based on the equipment identifier of the server to be rectified and the rectification problem related to the server to be rectified; and the risk alarm carries the equipment identifier of the server to be rectified and the related rectifying problem.
7. The method of claim 1, wherein the rectification early warning database further comprises: the rectification notice information of the rectification problem;
the generating a risk alert includes:
determining a rectification problem corresponding to the server to be rectified based on the rectification early warning database;
based on the rectification early warning database, rectification notice information of rectification questions corresponding to the servers to be rectified is determined;
generating a risk alarm based on the equipment identifier of the server to be rectified, the corresponding rectification problem and the rectification notice information; and the risk alarm carries the equipment identifier of the server to be rectified, the corresponding rectification problem and the rectification notice information.
8. The method of claim 1, wherein the rectification early warning database further comprises: server rectification measures corresponding to the rectification questions; the method further comprises the steps of:
Determining a rectification problem corresponding to the server to be rectified based on the rectification early warning database;
based on the rectification early warning database, obtaining server rectification measures corresponding to the rectification problems;
and utilizing the corresponding server rectifying measures to rectify and change the server to be rectified.
9. The method according to any one of claims 1-8, wherein the preset rules comprise: the target server and the first server are in the same network segment, and the target server and the first server belong to the same manufacturer; the determining, based on a preset rule, a target server corresponding to the first server includes:
determining, based on a ping procedure, one or more second servers that are in the same network segment as the first server;
acquiring manufacturer identifiers of the one or more second servers through an IPMI interface;
and determining a target server belonging to the same manufacturer as the first server in the one or more second servers based on the manufacturer identification of the one or more second servers.
10. A computing device, comprising: a processor, and a memory communicatively coupled to the processor;
The memory is used for storing computer execution instructions;
the processor is configured to execute the computer-executable instructions stored in the memory to implement the server failure warning method of any one of claims 1-9.
CN202311353442.XA 2023-10-18 2023-10-18 Server fault early warning method and computing device Pending CN117539729A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311353442.XA CN117539729A (en) 2023-10-18 2023-10-18 Server fault early warning method and computing device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311353442.XA CN117539729A (en) 2023-10-18 2023-10-18 Server fault early warning method and computing device

Publications (1)

Publication Number Publication Date
CN117539729A true CN117539729A (en) 2024-02-09

Family

ID=89790806

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311353442.XA Pending CN117539729A (en) 2023-10-18 2023-10-18 Server fault early warning method and computing device

Country Status (1)

Country Link
CN (1) CN117539729A (en)

Similar Documents

Publication Publication Date Title
US11057266B2 (en) Identifying troubleshooting options for resolving network failures
US10474519B2 (en) Server fault analysis system using event logs
CN100417081C (en) Method, system for checking and repairing a network configuration
US10824521B2 (en) Generating predictive diagnostics via package update manager
US10430257B2 (en) Alarms with stack trace spanning logical and physical architecture
US8209564B2 (en) Systems and methods for initiating software repairs in conjunction with software package updates
US9354961B2 (en) Method and system for supporting event root cause analysis
US20120166605A1 (en) Remote Management Systems and Methods for Servers
CN110178121B (en) Database detection method and terminal thereof
JP5542398B2 (en) Root cause analysis result display method, apparatus and system for failure
US9780836B2 (en) Server information handling system NFC management sideband feedback
CN104360878A (en) Method and device for deploying application software
WO2020000758A1 (en) Server acceptance method and apparatus, computer device, and storage medium
CN105450472A (en) Method and device for automatically acquiring states of physical components of servers
US20180321977A1 (en) Fault representation of computing infrastructures
US20180196708A1 (en) System management apparatus and system management method
US9870234B2 (en) Automatic identification of returned merchandise in a data center
CN117539729A (en) Server fault early warning method and computing device
US9798608B2 (en) Recovery program using diagnostic results
CN114281353A (en) Avoiding platform and service outages using deployment metadata
TWI685736B (en) Method for remotely clearing abnormal status of racks applied in data center
EP4156628A1 (en) Tracking and reporting faults detected on different priority levels
BR112016020189B1 (en) METHOD AND RESOLUTION SYSTEM THAT FACILITIES RESOLUTION OF NETWORK FAILURES IN A DATA CENTER

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination