CN113094198A

CN113094198A - Service fault positioning method and device based on machine learning and text classification

Info

Publication number: CN113094198A
Application number: CN202110392903.9A
Authority: CN
Inventors: 许璟亮; 廖鸿存; 皇甫晓洁; 周魁
Original assignee: Industrial and Commercial Bank of China Ltd ICBC
Current assignee: Industrial and Commercial Bank of China Ltd ICBC
Priority date: 2021-04-13
Filing date: 2021-04-13
Publication date: 2021-07-09

Abstract

The invention discloses a service fault positioning method and a device based on machine learning and text classification, which can be used in the technical field of cluster fault positioning, wherein the method comprises the following steps: extracting operation and maintenance data; acquiring cluster log data and time sequence operation data in real time according to the operation and maintenance data; analyzing fault logs according to the cluster log data to obtain a monitoring index when the service runs, and obtaining a time-consuming index of service execution according to start-stop log information of service execution; according to the time sequence operation data, obtaining resource monitoring indexes of a container level and a service level; analyzing and judging the fault root by using a fault judgment model according to the monitoring index when the service runs, the service execution time consumption index and the resource monitoring index to obtain a fault root analysis result; and analyzing error field information in the log information by using a natural language classification algorithm according to the fault root cause analysis result, and positioning the fault root cause.

Description

Service fault positioning method and device based on machine learning and text classification

Technical Field

The invention relates to the technical field of cluster fault positioning, in particular to a service fault positioning method and device based on machine learning and text classification.

Background

In the prior art, fault location is usually performed by means of index monitoring and manual analysis and judgment, for example, a server running memory and a CPU are monitored, an alarm is triggered when a threshold value is exceeded, and then manual intervention is performed to perform fault analysis and repair. This treatment has at least the following disadvantages: the fault location is slow, and usually manual access is needed to classify, locate and judge the fault cause. The detection rate is low, all scenes cannot be covered due to the adoption of a fixed index monitoring algorithm, and the fault detection rate is limited; such as CPU target monitoring, can typically only take static thresholds. The false alarm rate is high, the memory CPU burr scene can not be effectively identified, and the fault detection false alarm rate is high.

In view of the above, a fault location scheme that can overcome the defects in the prior art and has the advantages of high location speed, high detection rate and low false alarm rate is needed.

Disclosure of Invention

Aiming at the defects of the existing fault location, the invention provides a service fault location method and a device based on machine learning and text classification, aiming at solving the problems of low detection rate of monitoring faults and slow location of the problems due to the fact that the traditional mode relies on manual intervention in a large amount; according to the invention, through independently developing processing methods such as data acquisition, data preprocessing, prediction, monitoring alarm and the like, relevant data acquisition, preprocessing, fault analysis, fault classification and other operations can be carried out on real-time service information, so that the system fault can be rapidly positioned and classified, and the alarm accuracy and effectiveness of the system can be effectively improved.

Specifically, in a first aspect of the embodiments of the present invention, a method for locating a service fault based on machine learning and text classification is provided, where the method includes:

extracting operation and maintenance data;

acquiring cluster log data and time sequence operation data in real time according to the operation and maintenance data;

analyzing fault logs according to the cluster log data to obtain a monitoring index when the service runs, and obtaining a time-consuming index of service execution according to start-stop log information of service execution;

according to the time sequence operation data, obtaining resource monitoring indexes of a container level and a service level;

analyzing and judging the fault root by using a fault judgment model according to the monitoring index when the service runs, the service execution time consumption index and the resource monitoring index to obtain a fault root analysis result;

and analyzing error field information in the log information by using a natural language classification algorithm according to the fault root cause analysis result, and positioning the fault root cause.

Further, the extracted operation and maintenance data at least comprises: application information, node information and log information;

the method further comprises the following steps:

and formatting the application information, the node information and the log information.

Further, acquiring cluster log data and time-series operation data in real time according to the operation and maintenance data includes:

obtaining log information flow in real time and storing the log information flow into an ES cluster to obtain cluster log data;

and acquiring the information of the CPU, the memory and the disk IO of the operation container in real time.

Further, the monitoring index of the service runtime at least includes: request book, request success rate, request accuracy rate, request response time and error information.

Further, the resource monitoring indexes of the container level and the service level at least include: and (4) monitoring indexes of resources including a container CPU, a container memory and a host IO.

Further, according to the monitoring index when the service runs, the service execution time consumption index and the resource monitoring index, analyzing and judging the fault root by using a fault judgment model to obtain a fault root analysis result, including:

establishing a fault judgment model by using a naive Bayes classification algorithm in machine learning, taking historical data of a monitoring index, a service execution time-consuming index and a resource monitoring index during service operation as input features, taking a judgment result as an output feature, and training the fault judgment model; the fault judgment model is a supervised machine learning model and is used for judging the multi-factor prediction model;

and taking the monitoring index, the service execution time consumption index and the resource monitoring index of the newly generated service during operation as input characteristics, and performing fault judgment by using a fault judgment model to obtain the fault occurrence probability.

Further, the method further comprises:

and when the model judges that the fault occurrence probability is larger than the preset value, the correction judgment of the CPU, the memory and the service execution success rate is assisted to obtain the corrected fault root cause analysis result.

Further, according to the fault root cause analysis result, analyzing error field information in log information by using a natural language classification algorithm, and positioning the fault root cause, including:

taking historical error information as an input value to carry out data annotation, and constructing a shallow network model;

inputting new error information into the shallow network model, analyzing error fields in the log information by adopting a natural language classification algorithm of the shallow network model to obtain a classification judgment result, and positioning the fault root.

Specifically, in a second aspect of the embodiments of the present invention, a service fault location apparatus based on machine learning and text classification is provided, the apparatus including:

the data extraction module is used for extracting operation and maintenance data;

the real-time data acquisition module is used for acquiring cluster log data and time sequence operation data in real time according to the operation and maintenance data;

the fault log analysis module is used for analyzing fault logs according to the cluster log data to obtain a monitoring index when the service runs, and obtaining a service execution time consumption index according to start-stop log information of service execution;

the resource monitoring module is used for obtaining resource monitoring indexes of a container level and a service level according to the time sequence operation data;

the fault root cause analysis module is used for analyzing and judging the fault root cause by using a fault judgment model according to the monitoring index when the service runs, the service execution time consumption index and the resource monitoring index to obtain a fault root cause analysis result;

and the fault root cause positioning module is used for analyzing error field information in the log information by using a natural language classification algorithm according to the fault root cause analysis result and positioning the fault root cause.

In particular, in a third aspect of the embodiments of the present invention, a computer device is provided, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and when the processor executes the computer program, the processor implements a service fault location method based on machine learning and text classification.

In particular, in a fourth aspect of the embodiments of the present invention, a computer-readable storage medium is provided, in which a computer program is stored, and the computer program, when executed by a processor, implements a service fault location method based on machine learning and text classification.

The service fault positioning method and device based on machine learning and text classification can carry out operations such as relevant data acquisition, preprocessing, fault analysis and fault classification on real-time service information through processing methods such as data acquisition, data preprocessing, prediction and monitoring alarm and the like which are independently developed, realize rapid positioning and classification of system faults and effectively improve the alarm accuracy and effectiveness of the system.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a flowchart illustrating a method for locating a service fault based on machine learning and text classification according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of a positioning process according to an embodiment of the present invention.

FIG. 3 is an architectural diagram of the FastText algorithm.

Fig. 4 is a schematic diagram of a service fault location device architecture based on machine learning and text classification according to an embodiment of the present invention.

Fig. 5 is a schematic structural diagram of a computer device according to an embodiment of the present invention.

Detailed Description

The principles and spirit of the present invention will be described with reference to a number of exemplary embodiments. It is understood that these embodiments are given solely for the purpose of enabling those skilled in the art to better understand and to practice the invention, and are not intended to limit the scope of the invention in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

As will be appreciated by one skilled in the art, embodiments of the present invention may be embodied as a system, apparatus, device, method, or computer program product. Accordingly, the present disclosure may be embodied in the form of: entirely hardware, entirely software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.

According to the embodiment of the invention, the service fault positioning method and device based on machine learning and text classification are provided, can be used in the technical field of cluster fault positioning, and can realize high-efficiency, high-accuracy and low-false-alarm-rate fault positioning.

The principles and spirit of the present invention are explained in detail below with reference to several representative embodiments of the invention.

Fig. 1 is a flowchart illustrating a method for locating a service fault based on machine learning and text classification according to an embodiment of the present invention. As shown in fig. 1, the method includes:

step S101, extracting operation and maintenance data;

step S102, cluster log data and time sequence operation data are obtained in real time according to the operation and maintenance data;

step S103, analyzing a fault log according to the cluster log data to obtain a monitoring index when the service runs, and obtaining a time consumption index for service execution according to start-stop log information of service execution;

step S104, obtaining resource monitoring indexes of a container level and a service level according to the time sequence operation data;

step S105, analyzing and judging the fault root by using a fault judgment model according to the monitoring index when the service runs, the service execution time consumption index and the resource monitoring index to obtain a fault root analysis result;

and S106, analyzing error field information in the log information by using a natural language classification algorithm according to the fault root cause analysis result, and positioning the fault root cause.

In order to explain the above-mentioned service fault location method based on machine learning and text classification more clearly, the following is a detailed description with reference to each step.

Step S101:

extracting operation and maintenance data; wherein, the extracted operation and maintenance data at least comprises: application information, node information and log information;

Step S102:

real-time ES data (ES: ElasticSearch cluster): obtaining log information flow in real time and storing the log information flow into an ES cluster to obtain cluster log data;

real-time Prometheus data: and acquiring the information of the CPU, the memory and the disk IO of the operation container in real time.

Step S103:

acquiring a corresponding monitoring index through the operation of a fault log analysis acquisition service according to the log information acquired in the step S102; wherein, the monitoring index when the service is operated at least comprises: request book, request success rate, request accuracy rate, request response time and error information.

And acquiring the time consumption index of service execution through the information of the service execution start and end logs in the log.

Step S104:

obtaining resource monitoring indexes of a container level and a service level according to the time sequence operation data, wherein the resource monitoring indexes of the container level and the service level at least comprise: and (4) monitoring indexes of resources including a container CPU, a container memory and a host IO.

Step S105:

and analyzing and judging the fault root by using a fault judgment model according to the monitoring index when the service runs, the service execution time consumption index and the resource monitoring index to obtain a fault root analysis result.

The detailed process of model construction and use comprises the following steps:

After the fault judgment is carried out by utilizing the fault judgment model and the fault root cause analysis result is obtained, the result can be corrected, and the specific process is as follows:

Step S106:

The specific model construction and judgment process comprises the following steps:

It should be noted that although the operations of the method of the present invention have been described in the above embodiments and the accompanying drawings in a particular order, this does not require or imply that these operations must be performed in this particular order, or that all of the operations shown must be performed, to achieve the desired results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.

For a clearer explanation of the service fault location method based on machine learning and text classification, a specific embodiment is described below, however, it should be noted that the embodiment is only for better explaining the present invention and is not to be construed as an undue limitation to the present invention.

Fig. 2 is a schematic diagram of a positioning process according to an embodiment of the present invention. As shown in fig. 2, the detailed process is:

step S201, data extraction:

and extracting operation and maintenance data, and formatting and providing information such as application information, node information, log information and the like.

Step S202, real-time ES data:

and obtaining log information in real time and flowing and storing the log information to the ES cluster. The Elasticissearch cluster is composed of one or more nodes (nodes), and each cluster has a common cluster name as an identifier.

Step S203, real-time Prometheus data: and acquiring the information of the CPU, the memory and the disk IO of the operation container in real time.

Prometheus adopts a Pull mode to obtain monitoring information and provides a multidimensional data model and a flexible query interface. The Prometheus can not only configure the monitoring object through a static file, but also support an automatic discovery mechanism, and can dynamically acquire the monitoring object through Kubernets, Consl, DNS and other modes. In the aspect of data acquisition, by means of the high concurrency characteristic of Go voice, a single-computer Prometheus can adopt monitoring data of hundreds of nodes; in the aspect of data storage, with continuous optimization of a local time sequence database, a single-machine Prometheus can collect ten million indexes per second, and if a large amount of historical monitoring data needs to be stored, remote storage is supported.

Step S204, analyzing the fault log and monitoring the service consumption time:

analyzing the fault log according to the log information obtained in step S202 to obtain a monitoring index corresponding to the service operation, where the list is shown in table 1:

TABLE 1 monitoring index List

Index name	Description of the invention
		Number of requests	Number of processes in container/service unit time
Request success rate	Number of successful returns/total requests for container/service
		Request accuracy rate	Container/service reverting correct number/total request number
Request response time	Time consuming service
		Error information	Log description information of service errors

Step S205, full container CPU monitoring, memory monitoring, host computer disk monitoring:

through the monitoring data in step S203, a container-level and service-level resource monitoring view is formed, and the view indexes are shown in table 2:

TABLE 2 Container level and service level resource monitoring View indices

Index name	Description of the invention
		Container CPU	Real-time monitoring of container CPU utilization
Container memory	Real-time monitoring of container memory usage
		Host IO	Monitoring host IO utilization rate in real time

Step S206, failure root cause analysis:

and performing fault judgment through a machine learning model according to the monitoring index, the service execution time consumption index and the resource monitoring index of the service in operation, which are obtained in the steps S204 and S205. The method comprises the following specific processes:

step S2061, model construction:

and carrying out multi-factor prediction model judgment by using a supervised machine learning model. And inputting the historical index factors output in the step S204 and the step S205 to realize historical data annotation. And (4) constructing a judgment model through a naive Bayes classification algorithm in machine learning.

Step S2062, model use:

the probability judgment of the occurrence of the problem is obtained through model judgment by using the newly generated real-time index factors in step S204 and step S205, and the reference data is shown in table 3:

TABLE 3 reference data

Step S2063, model correction:

since the production fault is a sporadic phenomenon, the probability of occurrence is usually below 1%, and therefore, the false alarm rate is very high if only one model is used for judging.

In order to overcome the problems, under the condition that the model judgment problem occurs (the model output judgment structure is more than 60%), the correction judgment of the success rate of CPU, memory and service execution can be assisted. For example, the model determines that there is a problem at the current 65% probability, but if the CPU, the memory, and the service execution success rate are all normal, it determines that the current operation is normal.

Step S207, fault root cause positioning judgment:

the error field information in the log information is analyzed by adopting a FastText (shallow layer network) natural language classification algorithm, the log information belongs to the program auxiliary information of the similar natural language used by developers, and the method has a better expression effect in a text classification algorithm using natural language processing.

Step S2071, model construction:

using the error information field generated in the step S204 to carry out data annotation and model construction; as shown in table 4, are examples of input values and tag values.

TABLE 4 relationship of input values and tag values

Step S2072, model use:

the constructed model obtains classification judgment by inputting new error information; the reference data are shown in table 5:

TABLE 5 reference data

Regarding the FastText algorithm, the structure is very similar to the CBOW model structure of word2vec, and as shown in FIG. 3, the structure is a schematic diagram of the FastText model. Referring to fig. 3, this architecture diagram does not show the training process for the word vectors. The FastText model has three layers: input layer (x)₁、…、x_N) Hidden layer (hidden), output layer (output). The final label output by the output layer adopts Hierarchical Softmax.

The input is a plurality of words represented by vectors, the output is a specific target, and the hidden layer is the superposition average of a plurality of word vectors. Unlike CBOW, the CBOW input is the context of the target word, the input of FastText is a number of words and their n-gram features, which are used to represent a single document; the input words of CBOW are coded by onehot, and the input features of fastText are embedding; the output of CBOW is the target vocabulary, and the output of FastText is the class label corresponding to the document.

In addition, FastText takes the character-level n-gram vector of a word as an additional feature when it is input; during output, FastText adopts layered Softmax, so that the model training time is greatly reduced.

Having described the method of an exemplary embodiment of the present invention, a machine learning and text classification based service fault location apparatus of an exemplary embodiment of the present invention is next described with reference to fig. 4.

The implementation of the service fault location device based on machine learning and text classification can be referred to the implementation of the above method, and repeated details are omitted. The term "module" or "unit" used hereinafter may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.

Based on the same inventive concept, the invention also provides a service fault location device based on machine learning and text classification, as shown in fig. 4, the device comprises:

a data extraction module 410, configured to extract operation and maintenance data;

a real-time data obtaining module 420, configured to obtain cluster log data and time-series operation data in real time according to the operation and maintenance data;

a fault log analyzing module 430, configured to perform fault log analysis according to the cluster log data to obtain a monitoring index when a service runs, and obtain a service execution time consumption index according to start-stop log information of service execution;

the resource monitoring module 440 is configured to obtain resource monitoring indexes of a container level and a service level according to the time sequence operation data;

the fault root cause analysis module 450 is configured to analyze and judge a fault root cause by using a fault judgment model according to the monitoring index during service operation, the service execution time consumption index, and the resource monitoring index, so as to obtain a fault root cause analysis result;

and the fault root cause positioning module 460 is configured to analyze error field information in the log information by using a natural language classification algorithm according to the fault root cause analysis result, and position the fault root cause.

In one embodiment, the extracted operation and maintenance data at least comprises: application information, node information and log information;

the data extraction module 410 is further configured to: and formatting the application information, the node information and the log information.

In an embodiment, the real-time data obtaining module 420 is specifically configured to:

In one embodiment, the monitoring metrics of the service runtime include at least: request book, request success rate, request accuracy rate, request response time and error information.

In one embodiment, the resource monitoring metrics of the container level and the service level at least include: and (4) monitoring indexes of resources including a container CPU, a container memory and a host IO.

In an embodiment, the root cause analysis module 450 is specifically configured to:

In an embodiment, the fault root cause location module 460 is specifically configured to:

It should be noted that although several modules of the service fault localization apparatus based on machine learning and text classification are mentioned in the above detailed description, such partitioning is merely exemplary and not mandatory. Indeed, the features and functionality of two or more of the modules described above may be embodied in one module according to embodiments of the invention. Conversely, the features and functions of one module described above may be further divided into embodiments by a plurality of modules.

Based on the aforementioned inventive concept, as shown in fig. 5, the present invention further provides a computer device 500, which includes a memory 510, a processor 520, and a computer program 530 stored on the memory 510 and executable on the processor 520, wherein the processor 520 executes the computer program 530 to implement the aforementioned service fault location method based on machine learning and text classification.

Based on the foregoing inventive concept, the present invention proposes a computer-readable storage medium storing a computer program which, when executed by a processor, implements the foregoing service fault location method based on machine learning and text classification.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A service fault positioning method based on machine learning and text classification is characterized by comprising the following steps:

extracting operation and maintenance data;

2. The method of claim 1, wherein the extracted operation and maintenance data at least comprises: application information, node information and log information;

the method further comprises the following steps:

3. The method for locating the service fault based on the machine learning and the text classification as claimed in claim 2, wherein the obtaining of cluster log data and time-series operation data in real time according to the operation and maintenance data comprises:

4. The method of claim 1, wherein the monitoring metrics during service operation at least comprise: request book, request success rate, request accuracy rate, request response time and error information.

5. The method of claim 3, wherein the container-level and service-level resource monitoring metrics comprise at least: and (4) monitoring indexes of resources including a container CPU, a container memory and a host IO.

6. The method for locating a service fault based on machine learning and text classification according to claim 1, wherein a fault root is analyzed and judged by using a fault judgment model according to the monitoring index during service operation, the service execution time consumption index and the resource monitoring index to obtain a fault root analysis result, and the method comprises the following steps:

7. The method of claim 6, further comprising:

8. The method for locating service fault based on machine learning and text classification as claimed in claim 4, wherein the locating the fault root cause by analyzing the error field information in the log information by using a natural language classification algorithm according to the fault root cause analysis result comprises:

9. A service fault location apparatus based on machine learning and text classification, the apparatus comprising:

10. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any one of claims 1 to 8 when executing the computer program.

11. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, implements the method of any one of claims 1 to 8.