CN113835962A

CN113835962A - Server fault detection method and device, computer equipment and storage medium

Info

Publication number: CN113835962A
Application number: CN202111121516.8A
Authority: CN
Inventors: 杨柳; 赖一鹏; 刘毅枫
Original assignee: Chaoyue Technology Co Ltd
Current assignee: Chaoyue Technology Co Ltd
Priority date: 2021-09-24
Filing date: 2021-09-24
Publication date: 2021-12-24

Abstract

The invention discloses a server fault detection method, a device, computer equipment and a storage medium, wherein the method comprises the following steps: acquiring current operation data of a server; preprocessing the current operation data to obtain preprocessed operation data; inputting the preprocessed operation data into a pre-trained prediction model, wherein the pre-trained prediction model is obtained based on random forest algorithm training and is used for representing the corresponding relation between the operation data and the fault type; determining whether the current server has faults or not and the fault type of the faults according to the output of the pre-trained prediction model; the scheme of the invention not only realizes the fault determination of the server, but also diagnoses the fault type aiming at the condition of the fault, has better accuracy and fault tolerance, provides great convenience for the operation and maintenance of the server, and improves the stability of the operation of the server.

Description

Server fault detection method and device, computer equipment and storage medium

Technical Field

The present invention relates to the field of server technologies, and in particular, to a server fault detection method and apparatus, a computer device, and a storage medium.

Background

The server is a computer with fast operation, high load and strong performance, long-time operation is an important performance index of the server, monitoring the operation state of the server is an important method for ensuring long-term reliable operation of the server, and once the server fails to operate normally, the server needs to be reset and the like by means of a server remote Controller (such as a Baseboard Management Controller, BMC for short).

At present, the failure monitoring of the traditional server is performed in a common manner by setting an early warning value for a monitoring parameter, and if it is monitored that a certain operation parameter of the server exceeds the set early warning value, it indicates that the server fails, or statistical analysis is performed on operation data of the server to obtain an operation state evaluation result; however, the existing server monitoring method has the following disadvantages: firstly, the type of the fault cannot be accurately determined, and only whether the fault occurs or not, the type of the fault cannot be positioned or the device and the reason of the fault cannot be positioned can be judged, so that great inconvenience is left for fault repair and maintenance of the server; secondly, the fault tolerance rate is low, and for some situations of running data missing or short-time abnormity, the mode has misjudgment or cannot determine whether the fault exists. Thirdly, the data processing capacity is large, the time consumption is long, and the cost of fault monitoring is increased. Therefore, the conventional server failure monitoring method needs to be improved.

Disclosure of Invention

In view of the above, it is desirable to provide a server failure detection method, device, computer device and storage medium.

According to a first aspect of the present invention, there is provided a server failure detection method, the method comprising:

acquiring current operation data of a server;

preprocessing the current operation data to obtain preprocessed operation data;

inputting the preprocessed operation data into a pre-trained prediction model, wherein the pre-trained prediction model is obtained based on random forest algorithm training and is used for representing the corresponding relation between the operation data and the fault type;

and determining whether the current server fails or not and the fault type of the current server according to the output of the pre-trained prediction model.

In some embodiments, the method further comprises:

responding to the current server failure, and generating an alarm record based on the failure type of the failure;

and displaying the alarm record through a Web page.

In some embodiments, the method further comprises:

responding to the current server failure, monitoring the current failed server to determine whether the server is down;

and responding to the downtime of the server with the current fault, and resetting the server with the current fault.

In some embodiments, the current operating data includes at least one of: the voltage of at least one component on the mainboard, the current of at least one component on the mainboard, and the memory utilization rate of the central processing unit.

In some embodiments, the step of preprocessing the current operation data to obtain preprocessed operation data includes:

and carrying out normalization processing on the current operation data, and taking the data after the normalization processing as the operation data after the preprocessing.

In some embodiments, the method further comprises:

constructing a sample set by using historical operating data of a server which is marked with a fault type label in advance;

the number of decision trees, the depth of each decision tree, the number of features used by each node, iteration termination conditions, the minimum number of samples on each node, and the minimum information gain on each node are configured.

In some embodiments, the pre-trained predictive model is obtained by a random forest training process and a random forest testing process;

the training process of the random forest comprises the following steps: and (3) extracting training samples from the constructed sample set with the feedback, randomly selecting a root node, and training by using the training sample set from the root node until all the nodes are trained, so as to obtain a prediction model with required parameters:

the testing process of the random forest comprises the following steps: inputting the test sample into a prediction model with required parameters, evaluating the output result of the model by adopting Gini parameters, and adjusting the parameters of the prediction model based on the Gini value to obtain the pre-trained prediction model.

According to a second aspect of the present invention, there is provided a server failure detection apparatus, the apparatus comprising:

the data acquisition module is configured to acquire current operating data of the server;

the preprocessing module is configured to preprocess the current operating data to obtain preprocessed operating data;

the prediction module is configured to input the preprocessed operation data into a pre-trained prediction model, wherein the pre-trained prediction model is obtained based on random forest algorithm training and is used for representing the corresponding relation between the operation data and the fault type;

and the fault determining module is configured to determine whether the current server has a fault and the fault type of the fault according to the output of the pre-trained prediction model.

According to a third aspect of the present invention, there is also provided a computer apparatus comprising:

at least one processor; and

the storage stores a computer program capable of running on the processor, and the processor executes the server fault detection method when executing the program.

According to a fourth aspect of the present invention, there is also provided a computer-readable storage medium storing a computer program which, when executed by a processor, performs the aforementioned server failure detection method.

According to the server fault detection method, the operation data of the server are acquired on line in the service operation process, the operation data are preprocessed and then input into the pre-trained prediction model obtained through training based on the random forest algorithm, whether the current server fails or not is determined according to the output of the pre-trained prediction model, the server is determined according to the faults, the fault type can be diagnosed according to the situation with the faults, the accuracy and the fault tolerance are good, great convenience is provided for the operation and maintenance of the server, and the operation stability of the server is improved.

In addition, the invention also provides a server fault detection device, a computer device and a computer readable storage medium, which can also achieve the technical effects and are not described herein again.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other embodiments can be obtained by using the drawings without creative efforts.

Fig. 1 is a flowchart illustrating a server failure detection method 100 according to an embodiment of the present invention;

FIG. 2 is a flow diagram of another server failure monitoring method 200 according to an embodiment of the invention;

fig. 3 is a schematic structural diagram of a server failure detection apparatus 300 according to another embodiment of the present invention;

fig. 4 is an internal structural view of a computer device according to another embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention are described in further detail with reference to the accompanying drawings.

It should be noted that all expressions using "first" and "second" in the embodiments of the present invention are used for distinguishing two entities with the same name but different names or different parameters, and it should be noted that "first" and "second" are merely for convenience of description and should not be construed as limitations of the embodiments of the present invention, and they are not described in any more detail in the following embodiments.

In one embodiment, referring to fig. 1, the present invention provides a server failure detection method 100, where the method 100 includes the following steps:

s101, obtaining current operation data of the server.

In the specific implementation process, the operation modes of the acquisition server include, but are not limited to, the following cases: acquired through server management software, sensors, or baseboard management controllers. The operation data may be the memory occupancy rate of the CPU, or the operation data may be the voltage, current, and the like of a certain chip or functional module on the motherboard. Preferably, the current operating data includes at least one of: the voltage of at least one component on the mainboard, the current of at least one component on the mainboard, and the memory utilization rate of the central processing unit.

And S102, preprocessing the current operation data to obtain preprocessed operation data.

S103, inputting the preprocessed operation data into a pre-trained prediction model, wherein the pre-trained prediction model is obtained based on random forest algorithm training and is used for representing the corresponding relation between the operation data and the fault type;

and S104, determining whether the current server fails or not and the fault type of the current server according to the output of the pre-trained prediction model.

In some embodiments, to facilitate management of the server, a user or an operation and maintenance person can timely find a fault, the method further includes:

responding to the current server failure, and generating an alarm record based on the failure type of the failure; and displaying the alarm record through a Web page.

In some embodiments, for some failure problems that cause the server to fail to automatically recover, the server may be recovered to normal by restarting the server, and the method further includes:

In some embodiments, the method further comprises:

constructing a sample set by using historical operating data of a server which is marked with a fault type label in advance; the number of decision trees, the depth of each decision tree, the number of features used by each node, iteration termination conditions, the minimum number of samples on each node, and the minimum information gain on each node are configured.

In another embodiment, please refer to fig. 2, which shows a flowchart of another server failure monitoring method 200 according to the present invention, which specifically includes the following steps:

s201, the BMC obtains the information of the key components of the server through I2C. The critical component may be memory, central processing unit, etc.

S202, health information data are processed, the numerical type is normalized, partial data need to be marked with the current server state, and the state is divided into numerical type data corresponding to faults and various fault types. For example, the server may be represented by a value "0" when there is no failure, may be represented by a value "1" when there is a failure in the central processing unit, and may be represented by a value "2" when there is a failure in the memory, and the correspondence between the values and the types of failures may be freely set in the specific implementation process.

S203, predicting whether a fault exists in the normalized data through a random forest algorithm, wherein a fault detection part based on the random forest algorithm is divided into a training process and a prediction process of the random forest, and the training process of the random forest is as follows:

selecting part of labeled data as a sample set S, wherein the dimensionality, namely the characteristic dimensionality, of each data of a training set is F, the parameters to be determined include the number t of decision trees, the depth d of each tree, the characteristic quantity F used by each node, and termination conditions: the minimum number of samples s on the node and the minimum information gain m on the node;

the training process of the random forest A is as follows:

a1, extracting a training set S (i) with the same size as the sample set S from the sample set S, randomly selecting a sample as a root node, and starting training from the root node;

a2, if the current node reaches the termination condition, setting the current node as a leaf node, wherein the predicted output of the leaf node is the most abundant class c (j) in the current set sample, the probability p is the proportion of c (j) in the current sample set, continuing to train other nodes, and if the current node does not reach the termination condition, randomly selecting F-dimensional features from the F-dimensional features without being put back; and searching the one-dimensional feature k with the best classification effect and the threshold th thereof by using the f-dimensional feature, wherein the samples with the k-th dimension feature smaller than th on the current node are divided into left nodes, and the rest are divided into right nodes. And continuing to train other nodes.

A3, repeat A2 until all nodes in a decision tree have been trained or marked as leaf nodes.

A4, repeat A1-A3 until all random numbers in the random forest have been trained.

B. The prediction process of the random forest is as follows:

b1, starting from the root node of the current decision tree, judging whether to enter a left node (the threshold of the current node is smaller than the threshold th) or to count a right node (the threshold of the current node is larger than or equal to the threshold th) according to the threshold th of the current node until reaching a certain leaf node, and outputting a prediction result;

b2, repeating B1 till all t decision trees output predicted values, wherein the predicted values are the type with the largest sum of predicted probabilities of all tree species, namely the accumulation of p for each c (j).

B3, calculating the Gini value as a judgment standard by using the following formula;

Gini＝1-∑(p(i)*p(i))

the predicted output of the leaf node is the type with the largest number in the samples of the current set, and the probability p is the proportion of c (j) in the current sample set.

And training in the above mode to obtain a pre-trained prediction model based on the random forest algorithm, inputting the preprocessed data serving as a sample of the random forest algorithm, and outputting the predicted fault type.

And S204, if the fault forms an alarm record, feeding back the fault type to a web page for display.

S205, monitoring the state of the server to judge whether the server is down, and if the server is down, entering the step S206.

And S206, when the state of the server is down, resetting the down server.

According to the server fault detection method, server information is obtained through the BMC, whether the server fails or not is analyzed and predicted through the random forest algorithm, the fault is fed back to the webpage to be displayed, the state of the server is monitored, the stability of the server is improved, variable deletion is not needed when the random forest can process large input variables, the most important variable for classification can be evaluated, and the accuracy can be still kept when most data are lost through an effective method for estimating missing data.

In some embodiments, please refer to fig. 3, the present invention provides a server failure detection apparatus 300, which includes:

a data acquisition module 301 configured to acquire current operating data of the server;

a preprocessing module 302 configured to preprocess the current operating data to obtain preprocessed operating data;

the prediction module 303 is configured to input the preprocessed operating data into a pre-trained prediction model, wherein the pre-trained prediction model is obtained by training based on a random forest algorithm and is used for representing a corresponding relationship between the operating data and a fault type;

and the fault determining module 304 is configured to determine whether the current server fails according to the output of the pre-trained prediction model and a fault type to which the fault belongs.

It should be noted that, for specific limitations of the server failure detection apparatus, reference may be made to the above limitations of the server failure detection method, and details are not described herein again. The modules in the server failure detection device can be wholly or partially implemented by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

According to another aspect of the present invention, a computer device is provided, and the computer device may be a server, and its internal structure is shown in fig. 4. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements the server failure detection method described above, in particular the method comprising the steps of:

acquiring current operation data of a server; preprocessing the current operation data to obtain preprocessed operation data; inputting the preprocessed operation data into a pre-trained prediction model, wherein the pre-trained prediction model is obtained based on random forest algorithm training and is used for representing the corresponding relation between the operation data and the fault type; and determining whether the current server fails or not and the fault type of the current server according to the output of the pre-trained prediction model.

According to yet another aspect of the present invention, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the server failure detection method described above, in particular comprising performing the steps of:

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method for server failure detection, the method comprising:

acquiring current operation data of a server;

preprocessing the current operation data to obtain preprocessed operation data;

2. The server failure detection method according to claim 1, further comprising:

and displaying the alarm record through a Web page.

3. The server failure detection method according to claim 1, further comprising:

4. The server failure detection method of claim 1, wherein the current operational data comprises at least one of: the voltage of at least one component on the mainboard, the current of at least one component on the mainboard, and the memory utilization rate of the central processing unit.

5. The method according to claim 1, wherein the step of preprocessing the current operation data to obtain preprocessed operation data comprises:

6. The server failure detection method according to any one of claims 1 to 5, wherein the method further comprises:

7. The server fault detection method according to claim 6, wherein the pre-trained predictive model is obtained through a random forest training process and a random forest testing process;

8. An apparatus for server failure detection, the apparatus comprising:

9. A computer device, comprising:

at least one processor;

and a memory storing a computer program operable in the processor, the processor executing the program to perform the server failure detection method of any one of claims 1 to 7.

10. A computer-readable storage medium storing a computer program, wherein the computer program is executed by a processor to perform the server failure detection method according to any one of claims 1 to 7.