CN113094198A - Service fault positioning method and device based on machine learning and text classification - Google Patents

Service fault positioning method and device based on machine learning and text classification Download PDF

Info

Publication number
CN113094198A
CN113094198A CN202110392903.9A CN202110392903A CN113094198A CN 113094198 A CN113094198 A CN 113094198A CN 202110392903 A CN202110392903 A CN 202110392903A CN 113094198 A CN113094198 A CN 113094198A
Authority
CN
China
Prior art keywords
fault
service
data
index
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110392903.9A
Other languages
Chinese (zh)
Inventor
许璟亮
廖鸿存
皇甫晓洁
周魁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industrial and Commercial Bank of China Ltd ICBC
Original Assignee
Industrial and Commercial Bank of China Ltd ICBC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industrial and Commercial Bank of China Ltd ICBC filed Critical Industrial and Commercial Bank of China Ltd ICBC
Priority to CN202110392903.9A priority Critical patent/CN113094198A/en
Publication of CN113094198A publication Critical patent/CN113094198A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0766Error or fault reporting or storing
    • G06F11/0775Content or structure details of the error report, e.g. specific table structure, specific error fields
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention discloses a service fault positioning method and a device based on machine learning and text classification, which can be used in the technical field of cluster fault positioning, wherein the method comprises the following steps: extracting operation and maintenance data; acquiring cluster log data and time sequence operation data in real time according to the operation and maintenance data; analyzing fault logs according to the cluster log data to obtain a monitoring index when the service runs, and obtaining a time-consuming index of service execution according to start-stop log information of service execution; according to the time sequence operation data, obtaining resource monitoring indexes of a container level and a service level; analyzing and judging the fault root by using a fault judgment model according to the monitoring index when the service runs, the service execution time consumption index and the resource monitoring index to obtain a fault root analysis result; and analyzing error field information in the log information by using a natural language classification algorithm according to the fault root cause analysis result, and positioning the fault root cause.

Description

Service fault positioning method and device based on machine learning and text classification
Technical Field
The invention relates to the technical field of cluster fault positioning, in particular to a service fault positioning method and device based on machine learning and text classification.
Background
In the prior art, fault location is usually performed by means of index monitoring and manual analysis and judgment, for example, a server running memory and a CPU are monitored, an alarm is triggered when a threshold value is exceeded, and then manual intervention is performed to perform fault analysis and repair. This treatment has at least the following disadvantages: the fault location is slow, and usually manual access is needed to classify, locate and judge the fault cause. The detection rate is low, all scenes cannot be covered due to the adoption of a fixed index monitoring algorithm, and the fault detection rate is limited; such as CPU target monitoring, can typically only take static thresholds. The false alarm rate is high, the memory CPU burr scene can not be effectively identified, and the fault detection false alarm rate is high.
In view of the above, a fault location scheme that can overcome the defects in the prior art and has the advantages of high location speed, high detection rate and low false alarm rate is needed.
Disclosure of Invention
Aiming at the defects of the existing fault location, the invention provides a service fault location method and a device based on machine learning and text classification, aiming at solving the problems of low detection rate of monitoring faults and slow location of the problems due to the fact that the traditional mode relies on manual intervention in a large amount; according to the invention, through independently developing processing methods such as data acquisition, data preprocessing, prediction, monitoring alarm and the like, relevant data acquisition, preprocessing, fault analysis, fault classification and other operations can be carried out on real-time service information, so that the system fault can be rapidly positioned and classified, and the alarm accuracy and effectiveness of the system can be effectively improved.
Specifically, in a first aspect of the embodiments of the present invention, a method for locating a service fault based on machine learning and text classification is provided, where the method includes:
extracting operation and maintenance data;
acquiring cluster log data and time sequence operation data in real time according to the operation and maintenance data;
analyzing fault logs according to the cluster log data to obtain a monitoring index when the service runs, and obtaining a time-consuming index of service execution according to start-stop log information of service execution;
according to the time sequence operation data, obtaining resource monitoring indexes of a container level and a service level;
analyzing and judging the fault root by using a fault judgment model according to the monitoring index when the service runs, the service execution time consumption index and the resource monitoring index to obtain a fault root analysis result;
and analyzing error field information in the log information by using a natural language classification algorithm according to the fault root cause analysis result, and positioning the fault root cause.
Further, the extracted operation and maintenance data at least comprises: application information, node information and log information;
the method further comprises the following steps:
and formatting the application information, the node information and the log information.
Further, acquiring cluster log data and time-series operation data in real time according to the operation and maintenance data includes:
obtaining log information flow in real time and storing the log information flow into an ES cluster to obtain cluster log data;
and acquiring the information of the CPU, the memory and the disk IO of the operation container in real time.
Further, the monitoring index of the service runtime at least includes: request book, request success rate, request accuracy rate, request response time and error information.
Further, the resource monitoring indexes of the container level and the service level at least include: and (4) monitoring indexes of resources including a container CPU, a container memory and a host IO.
Further, according to the monitoring index when the service runs, the service execution time consumption index and the resource monitoring index, analyzing and judging the fault root by using a fault judgment model to obtain a fault root analysis result, including:
establishing a fault judgment model by using a naive Bayes classification algorithm in machine learning, taking historical data of a monitoring index, a service execution time-consuming index and a resource monitoring index during service operation as input features, taking a judgment result as an output feature, and training the fault judgment model; the fault judgment model is a supervised machine learning model and is used for judging the multi-factor prediction model;
and taking the monitoring index, the service execution time consumption index and the resource monitoring index of the newly generated service during operation as input characteristics, and performing fault judgment by using a fault judgment model to obtain the fault occurrence probability.
Further, the method further comprises:
and when the model judges that the fault occurrence probability is larger than the preset value, the correction judgment of the CPU, the memory and the service execution success rate is assisted to obtain the corrected fault root cause analysis result.
Further, according to the fault root cause analysis result, analyzing error field information in log information by using a natural language classification algorithm, and positioning the fault root cause, including:
taking historical error information as an input value to carry out data annotation, and constructing a shallow network model;
inputting new error information into the shallow network model, analyzing error fields in the log information by adopting a natural language classification algorithm of the shallow network model to obtain a classification judgment result, and positioning the fault root.
Specifically, in a second aspect of the embodiments of the present invention, a service fault location apparatus based on machine learning and text classification is provided, the apparatus including:
the data extraction module is used for extracting operation and maintenance data;
the real-time data acquisition module is used for acquiring cluster log data and time sequence operation data in real time according to the operation and maintenance data;
the fault log analysis module is used for analyzing fault logs according to the cluster log data to obtain a monitoring index when the service runs, and obtaining a service execution time consumption index according to start-stop log information of service execution;
the resource monitoring module is used for obtaining resource monitoring indexes of a container level and a service level according to the time sequence operation data;
the fault root cause analysis module is used for analyzing and judging the fault root cause by using a fault judgment model according to the monitoring index when the service runs, the service execution time consumption index and the resource monitoring index to obtain a fault root cause analysis result;
and the fault root cause positioning module is used for analyzing error field information in the log information by using a natural language classification algorithm according to the fault root cause analysis result and positioning the fault root cause.
In particular, in a third aspect of the embodiments of the present invention, a computer device is provided, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and when the processor executes the computer program, the processor implements a service fault location method based on machine learning and text classification.
In particular, in a fourth aspect of the embodiments of the present invention, a computer-readable storage medium is provided, in which a computer program is stored, and the computer program, when executed by a processor, implements a service fault location method based on machine learning and text classification.
The service fault positioning method and device based on machine learning and text classification can carry out operations such as relevant data acquisition, preprocessing, fault analysis and fault classification on real-time service information through processing methods such as data acquisition, data preprocessing, prediction and monitoring alarm and the like which are independently developed, realize rapid positioning and classification of system faults and effectively improve the alarm accuracy and effectiveness of the system.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a flowchart illustrating a method for locating a service fault based on machine learning and text classification according to an embodiment of the present invention.
Fig. 2 is a schematic diagram of a positioning process according to an embodiment of the present invention.
FIG. 3 is an architectural diagram of the FastText algorithm.
Fig. 4 is a schematic diagram of a service fault location device architecture based on machine learning and text classification according to an embodiment of the present invention.
Fig. 5 is a schematic structural diagram of a computer device according to an embodiment of the present invention.
Detailed Description
The principles and spirit of the present invention will be described with reference to a number of exemplary embodiments. It is understood that these embodiments are given solely for the purpose of enabling those skilled in the art to better understand and to practice the invention, and are not intended to limit the scope of the invention in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
As will be appreciated by one skilled in the art, embodiments of the present invention may be embodied as a system, apparatus, device, method, or computer program product. Accordingly, the present disclosure may be embodied in the form of: entirely hardware, entirely software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.
According to the embodiment of the invention, the service fault positioning method and device based on machine learning and text classification are provided, can be used in the technical field of cluster fault positioning, and can realize high-efficiency, high-accuracy and low-false-alarm-rate fault positioning.
The principles and spirit of the present invention are explained in detail below with reference to several representative embodiments of the invention.
Fig. 1 is a flowchart illustrating a method for locating a service fault based on machine learning and text classification according to an embodiment of the present invention. As shown in fig. 1, the method includes:
step S101, extracting operation and maintenance data;
step S102, cluster log data and time sequence operation data are obtained in real time according to the operation and maintenance data;
step S103, analyzing a fault log according to the cluster log data to obtain a monitoring index when the service runs, and obtaining a time consumption index for service execution according to start-stop log information of service execution;
step S104, obtaining resource monitoring indexes of a container level and a service level according to the time sequence operation data;
step S105, analyzing and judging the fault root by using a fault judgment model according to the monitoring index when the service runs, the service execution time consumption index and the resource monitoring index to obtain a fault root analysis result;
and S106, analyzing error field information in the log information by using a natural language classification algorithm according to the fault root cause analysis result, and positioning the fault root cause.
In order to explain the above-mentioned service fault location method based on machine learning and text classification more clearly, the following is a detailed description with reference to each step.
Step S101:
extracting operation and maintenance data; wherein, the extracted operation and maintenance data at least comprises: application information, node information and log information;
and formatting the application information, the node information and the log information.
Step S102:
real-time ES data (ES: ElasticSearch cluster): obtaining log information flow in real time and storing the log information flow into an ES cluster to obtain cluster log data;
real-time Prometheus data: and acquiring the information of the CPU, the memory and the disk IO of the operation container in real time.
Step S103:
acquiring a corresponding monitoring index through the operation of a fault log analysis acquisition service according to the log information acquired in the step S102; wherein, the monitoring index when the service is operated at least comprises: request book, request success rate, request accuracy rate, request response time and error information.
And acquiring the time consumption index of service execution through the information of the service execution start and end logs in the log.
Step S104:
obtaining resource monitoring indexes of a container level and a service level according to the time sequence operation data, wherein the resource monitoring indexes of the container level and the service level at least comprise: and (4) monitoring indexes of resources including a container CPU, a container memory and a host IO.
Step S105:
and analyzing and judging the fault root by using a fault judgment model according to the monitoring index when the service runs, the service execution time consumption index and the resource monitoring index to obtain a fault root analysis result.
The detailed process of model construction and use comprises the following steps:
establishing a fault judgment model by using a naive Bayes classification algorithm in machine learning, taking historical data of a monitoring index, a service execution time-consuming index and a resource monitoring index during service operation as input features, taking a judgment result as an output feature, and training the fault judgment model; the fault judgment model is a supervised machine learning model and is used for judging the multi-factor prediction model;
and taking the monitoring index, the service execution time consumption index and the resource monitoring index of the newly generated service during operation as input characteristics, and performing fault judgment by using a fault judgment model to obtain the fault occurrence probability.
After the fault judgment is carried out by utilizing the fault judgment model and the fault root cause analysis result is obtained, the result can be corrected, and the specific process is as follows:
and when the model judges that the fault occurrence probability is larger than the preset value, the correction judgment of the CPU, the memory and the service execution success rate is assisted to obtain the corrected fault root cause analysis result.
Step S106:
and analyzing error field information in the log information by using a natural language classification algorithm according to the fault root cause analysis result, and positioning the fault root cause.
The specific model construction and judgment process comprises the following steps:
taking historical error information as an input value to carry out data annotation, and constructing a shallow network model;
inputting new error information into the shallow network model, analyzing error fields in the log information by adopting a natural language classification algorithm of the shallow network model to obtain a classification judgment result, and positioning the fault root.
It should be noted that although the operations of the method of the present invention have been described in the above embodiments and the accompanying drawings in a particular order, this does not require or imply that these operations must be performed in this particular order, or that all of the operations shown must be performed, to achieve the desired results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.
For a clearer explanation of the service fault location method based on machine learning and text classification, a specific embodiment is described below, however, it should be noted that the embodiment is only for better explaining the present invention and is not to be construed as an undue limitation to the present invention.
Fig. 2 is a schematic diagram of a positioning process according to an embodiment of the present invention. As shown in fig. 2, the detailed process is:
step S201, data extraction:
and extracting operation and maintenance data, and formatting and providing information such as application information, node information, log information and the like.
Step S202, real-time ES data:
and obtaining log information in real time and flowing and storing the log information to the ES cluster. The Elasticissearch cluster is composed of one or more nodes (nodes), and each cluster has a common cluster name as an identifier.
Step S203, real-time Prometheus data: and acquiring the information of the CPU, the memory and the disk IO of the operation container in real time.
Prometheus adopts a Pull mode to obtain monitoring information and provides a multidimensional data model and a flexible query interface. The Prometheus can not only configure the monitoring object through a static file, but also support an automatic discovery mechanism, and can dynamically acquire the monitoring object through Kubernets, Consl, DNS and other modes. In the aspect of data acquisition, by means of the high concurrency characteristic of Go voice, a single-computer Prometheus can adopt monitoring data of hundreds of nodes; in the aspect of data storage, with continuous optimization of a local time sequence database, a single-machine Prometheus can collect ten million indexes per second, and if a large amount of historical monitoring data needs to be stored, remote storage is supported.
Step S204, analyzing the fault log and monitoring the service consumption time:
analyzing the fault log according to the log information obtained in step S202 to obtain a monitoring index corresponding to the service operation, where the list is shown in table 1:
TABLE 1 monitoring index List
Index name Description of the invention
Number of requests Number of processes in container/service unit time
Request success rate Number of successful returns/total requests for container/service
Request accuracy rate Container/service reverting correct number/total request number
Request response time Time consuming service
Error information Log description information of service errors
And acquiring the time consumption index of service execution through the information of the service execution start and end logs in the log.
Step S205, full container CPU monitoring, memory monitoring, host computer disk monitoring:
through the monitoring data in step S203, a container-level and service-level resource monitoring view is formed, and the view indexes are shown in table 2:
TABLE 2 Container level and service level resource monitoring View indices
Index name Description of the invention
Container CPU Real-time monitoring of container CPU utilization
Container memory Real-time monitoring of container memory usage
Host IO Monitoring host IO utilization rate in real time
Step S206, failure root cause analysis:
and performing fault judgment through a machine learning model according to the monitoring index, the service execution time consumption index and the resource monitoring index of the service in operation, which are obtained in the steps S204 and S205. The method comprises the following specific processes:
step S2061, model construction:
and carrying out multi-factor prediction model judgment by using a supervised machine learning model. And inputting the historical index factors output in the step S204 and the step S205 to realize historical data annotation. And (4) constructing a judgment model through a naive Bayes classification algorithm in machine learning.
Step S2062, model use:
the probability judgment of the occurrence of the problem is obtained through model judgment by using the newly generated real-time index factors in step S204 and step S205, and the reference data is shown in table 3:
TABLE 3 reference data
Figure BDA0003017445710000081
Step S2063, model correction:
since the production fault is a sporadic phenomenon, the probability of occurrence is usually below 1%, and therefore, the false alarm rate is very high if only one model is used for judging.
In order to overcome the problems, under the condition that the model judgment problem occurs (the model output judgment structure is more than 60%), the correction judgment of the success rate of CPU, memory and service execution can be assisted. For example, the model determines that there is a problem at the current 65% probability, but if the CPU, the memory, and the service execution success rate are all normal, it determines that the current operation is normal.
Step S207, fault root cause positioning judgment:
the error field information in the log information is analyzed by adopting a FastText (shallow layer network) natural language classification algorithm, the log information belongs to the program auxiliary information of the similar natural language used by developers, and the method has a better expression effect in a text classification algorithm using natural language processing.
Step S2071, model construction:
using the error information field generated in the step S204 to carry out data annotation and model construction; as shown in table 4, are examples of input values and tag values.
TABLE 4 relationship of input values and tag values
Figure BDA0003017445710000082
Step S2072, model use:
the constructed model obtains classification judgment by inputting new error information; the reference data are shown in table 5:
TABLE 5 reference data
Figure BDA0003017445710000091
Regarding the FastText algorithm, the structure is very similar to the CBOW model structure of word2vec, and as shown in FIG. 3, the structure is a schematic diagram of the FastText model. Referring to fig. 3, this architecture diagram does not show the training process for the word vectors. The FastText model has three layers: input layer (x)1、…、xN) Hidden layer (hidden), output layer (output). The final label output by the output layer adopts Hierarchical Softmax.
The input is a plurality of words represented by vectors, the output is a specific target, and the hidden layer is the superposition average of a plurality of word vectors. Unlike CBOW, the CBOW input is the context of the target word, the input of FastText is a number of words and their n-gram features, which are used to represent a single document; the input words of CBOW are coded by onehot, and the input features of fastText are embedding; the output of CBOW is the target vocabulary, and the output of FastText is the class label corresponding to the document.
In addition, FastText takes the character-level n-gram vector of a word as an additional feature when it is input; during output, FastText adopts layered Softmax, so that the model training time is greatly reduced.
Having described the method of an exemplary embodiment of the present invention, a machine learning and text classification based service fault location apparatus of an exemplary embodiment of the present invention is next described with reference to fig. 4.
The implementation of the service fault location device based on machine learning and text classification can be referred to the implementation of the above method, and repeated details are omitted. The term "module" or "unit" used hereinafter may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.
Based on the same inventive concept, the invention also provides a service fault location device based on machine learning and text classification, as shown in fig. 4, the device comprises:
a data extraction module 410, configured to extract operation and maintenance data;
a real-time data obtaining module 420, configured to obtain cluster log data and time-series operation data in real time according to the operation and maintenance data;
a fault log analyzing module 430, configured to perform fault log analysis according to the cluster log data to obtain a monitoring index when a service runs, and obtain a service execution time consumption index according to start-stop log information of service execution;
the resource monitoring module 440 is configured to obtain resource monitoring indexes of a container level and a service level according to the time sequence operation data;
the fault root cause analysis module 450 is configured to analyze and judge a fault root cause by using a fault judgment model according to the monitoring index during service operation, the service execution time consumption index, and the resource monitoring index, so as to obtain a fault root cause analysis result;
and the fault root cause positioning module 460 is configured to analyze error field information in the log information by using a natural language classification algorithm according to the fault root cause analysis result, and position the fault root cause.
In one embodiment, the extracted operation and maintenance data at least comprises: application information, node information and log information;
the data extraction module 410 is further configured to: and formatting the application information, the node information and the log information.
In an embodiment, the real-time data obtaining module 420 is specifically configured to:
obtaining log information flow in real time and storing the log information flow into an ES cluster to obtain cluster log data;
and acquiring the information of the CPU, the memory and the disk IO of the operation container in real time.
In one embodiment, the monitoring metrics of the service runtime include at least: request book, request success rate, request accuracy rate, request response time and error information.
In one embodiment, the resource monitoring metrics of the container level and the service level at least include: and (4) monitoring indexes of resources including a container CPU, a container memory and a host IO.
In an embodiment, the root cause analysis module 450 is specifically configured to:
establishing a fault judgment model by using a naive Bayes classification algorithm in machine learning, taking historical data of a monitoring index, a service execution time-consuming index and a resource monitoring index during service operation as input features, taking a judgment result as an output feature, and training the fault judgment model; the fault judgment model is a supervised machine learning model and is used for judging the multi-factor prediction model;
and taking the monitoring index, the service execution time consumption index and the resource monitoring index of the newly generated service during operation as input characteristics, and performing fault judgment by using a fault judgment model to obtain the fault occurrence probability.
And when the model judges that the fault occurrence probability is larger than the preset value, the correction judgment of the CPU, the memory and the service execution success rate is assisted to obtain the corrected fault root cause analysis result.
In an embodiment, the fault root cause location module 460 is specifically configured to:
taking historical error information as an input value to carry out data annotation, and constructing a shallow network model;
inputting new error information into the shallow network model, analyzing error fields in the log information by adopting a natural language classification algorithm of the shallow network model to obtain a classification judgment result, and positioning the fault root.
It should be noted that although several modules of the service fault localization apparatus based on machine learning and text classification are mentioned in the above detailed description, such partitioning is merely exemplary and not mandatory. Indeed, the features and functionality of two or more of the modules described above may be embodied in one module according to embodiments of the invention. Conversely, the features and functions of one module described above may be further divided into embodiments by a plurality of modules.
Based on the aforementioned inventive concept, as shown in fig. 5, the present invention further provides a computer device 500, which includes a memory 510, a processor 520, and a computer program 530 stored on the memory 510 and executable on the processor 520, wherein the processor 520 executes the computer program 530 to implement the aforementioned service fault location method based on machine learning and text classification.
Based on the foregoing inventive concept, the present invention proposes a computer-readable storage medium storing a computer program which, when executed by a processor, implements the foregoing service fault location method based on machine learning and text classification.
The service fault positioning method and device based on machine learning and text classification can carry out operations such as relevant data acquisition, preprocessing, fault analysis and fault classification on real-time service information through processing methods such as data acquisition, data preprocessing, prediction and monitoring alarm and the like which are independently developed, realize rapid positioning and classification of system faults and effectively improve the alarm accuracy and effectiveness of the system.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (11)

1. A service fault positioning method based on machine learning and text classification is characterized by comprising the following steps:
extracting operation and maintenance data;
acquiring cluster log data and time sequence operation data in real time according to the operation and maintenance data;
analyzing fault logs according to the cluster log data to obtain a monitoring index when the service runs, and obtaining a time-consuming index of service execution according to start-stop log information of service execution;
according to the time sequence operation data, obtaining resource monitoring indexes of a container level and a service level;
analyzing and judging the fault root by using a fault judgment model according to the monitoring index when the service runs, the service execution time consumption index and the resource monitoring index to obtain a fault root analysis result;
and analyzing error field information in the log information by using a natural language classification algorithm according to the fault root cause analysis result, and positioning the fault root cause.
2. The method of claim 1, wherein the extracted operation and maintenance data at least comprises: application information, node information and log information;
the method further comprises the following steps:
and formatting the application information, the node information and the log information.
3. The method for locating the service fault based on the machine learning and the text classification as claimed in claim 2, wherein the obtaining of cluster log data and time-series operation data in real time according to the operation and maintenance data comprises:
obtaining log information flow in real time and storing the log information flow into an ES cluster to obtain cluster log data;
and acquiring the information of the CPU, the memory and the disk IO of the operation container in real time.
4. The method of claim 1, wherein the monitoring metrics during service operation at least comprise: request book, request success rate, request accuracy rate, request response time and error information.
5. The method of claim 3, wherein the container-level and service-level resource monitoring metrics comprise at least: and (4) monitoring indexes of resources including a container CPU, a container memory and a host IO.
6. The method for locating a service fault based on machine learning and text classification according to claim 1, wherein a fault root is analyzed and judged by using a fault judgment model according to the monitoring index during service operation, the service execution time consumption index and the resource monitoring index to obtain a fault root analysis result, and the method comprises the following steps:
establishing a fault judgment model by using a naive Bayes classification algorithm in machine learning, taking historical data of a monitoring index, a service execution time-consuming index and a resource monitoring index during service operation as input features, taking a judgment result as an output feature, and training the fault judgment model; the fault judgment model is a supervised machine learning model and is used for judging the multi-factor prediction model;
and taking the monitoring index, the service execution time consumption index and the resource monitoring index of the newly generated service during operation as input characteristics, and performing fault judgment by using a fault judgment model to obtain the fault occurrence probability.
7. The method of claim 6, further comprising:
and when the model judges that the fault occurrence probability is larger than the preset value, the correction judgment of the CPU, the memory and the service execution success rate is assisted to obtain the corrected fault root cause analysis result.
8. The method for locating service fault based on machine learning and text classification as claimed in claim 4, wherein the locating the fault root cause by analyzing the error field information in the log information by using a natural language classification algorithm according to the fault root cause analysis result comprises:
taking historical error information as an input value to carry out data annotation, and constructing a shallow network model;
inputting new error information into the shallow network model, analyzing error fields in the log information by adopting a natural language classification algorithm of the shallow network model to obtain a classification judgment result, and positioning the fault root.
9. A service fault location apparatus based on machine learning and text classification, the apparatus comprising:
the data extraction module is used for extracting operation and maintenance data;
the real-time data acquisition module is used for acquiring cluster log data and time sequence operation data in real time according to the operation and maintenance data;
the fault log analysis module is used for analyzing fault logs according to the cluster log data to obtain a monitoring index when the service runs, and obtaining a service execution time consumption index according to start-stop log information of service execution;
the resource monitoring module is used for obtaining resource monitoring indexes of a container level and a service level according to the time sequence operation data;
the fault root cause analysis module is used for analyzing and judging the fault root cause by using a fault judgment model according to the monitoring index when the service runs, the service execution time consumption index and the resource monitoring index to obtain a fault root cause analysis result;
and the fault root cause positioning module is used for analyzing error field information in the log information by using a natural language classification algorithm according to the fault root cause analysis result and positioning the fault root cause.
10. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any one of claims 1 to 8 when executing the computer program.
11. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, implements the method of any one of claims 1 to 8.
CN202110392903.9A 2021-04-13 2021-04-13 Service fault positioning method and device based on machine learning and text classification Pending CN113094198A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110392903.9A CN113094198A (en) 2021-04-13 2021-04-13 Service fault positioning method and device based on machine learning and text classification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110392903.9A CN113094198A (en) 2021-04-13 2021-04-13 Service fault positioning method and device based on machine learning and text classification

Publications (1)

Publication Number Publication Date
CN113094198A true CN113094198A (en) 2021-07-09

Family

ID=76676547

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110392903.9A Pending CN113094198A (en) 2021-04-13 2021-04-13 Service fault positioning method and device based on machine learning and text classification

Country Status (1)

Country Link
CN (1) CN113094198A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113626288A (en) * 2021-08-12 2021-11-09 杭州朗和科技有限公司 Fault processing method, system, device, storage medium and electronic equipment
CN113656252A (en) * 2021-08-24 2021-11-16 北京百度网讯科技有限公司 Fault positioning method and device, electronic equipment and storage medium
CN114051000A (en) * 2021-11-17 2022-02-15 中国工商银行股份有限公司 Service flow switching method and device based on time series model
CN114189428A (en) * 2021-12-09 2022-03-15 中国电信股份有限公司 Fault root cause analysis method and system of box-type wave division system and electronic equipment
CN114205216A (en) * 2021-12-07 2022-03-18 中国工商银行股份有限公司 Root cause positioning method and device for micro-service fault, electronic equipment and medium
CN114363149A (en) * 2021-12-23 2022-04-15 上海哔哩哔哩科技有限公司 Fault processing method and device
CN114490303A (en) * 2022-04-07 2022-05-13 阿里巴巴达摩院(杭州)科技有限公司 Fault root cause determination method and device and cloud equipment
CN114511058A (en) * 2022-01-27 2022-05-17 国网江苏省电力有限公司泰州供电分公司 Load element construction method and device for power consumer portrait
CN117608912A (en) * 2024-01-24 2024-02-27 之江实验室 Full-automatic log analysis and fault processing system and method based on NLP large model

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113626288A (en) * 2021-08-12 2021-11-09 杭州朗和科技有限公司 Fault processing method, system, device, storage medium and electronic equipment
CN113626288B (en) * 2021-08-12 2023-08-25 杭州朗和科技有限公司 Fault processing method, system, device, storage medium and electronic equipment
CN113656252B (en) * 2021-08-24 2023-07-25 北京百度网讯科技有限公司 Fault positioning method, device, electronic equipment and storage medium
CN113656252A (en) * 2021-08-24 2021-11-16 北京百度网讯科技有限公司 Fault positioning method and device, electronic equipment and storage medium
CN114051000A (en) * 2021-11-17 2022-02-15 中国工商银行股份有限公司 Service flow switching method and device based on time series model
CN114205216A (en) * 2021-12-07 2022-03-18 中国工商银行股份有限公司 Root cause positioning method and device for micro-service fault, electronic equipment and medium
CN114205216B (en) * 2021-12-07 2024-02-06 中国工商银行股份有限公司 Root cause positioning method and device for micro service fault, electronic equipment and medium
CN114189428A (en) * 2021-12-09 2022-03-15 中国电信股份有限公司 Fault root cause analysis method and system of box-type wave division system and electronic equipment
CN114363149B (en) * 2021-12-23 2023-12-26 上海哔哩哔哩科技有限公司 Fault processing method and device
CN114363149A (en) * 2021-12-23 2022-04-15 上海哔哩哔哩科技有限公司 Fault processing method and device
CN114511058A (en) * 2022-01-27 2022-05-17 国网江苏省电力有限公司泰州供电分公司 Load element construction method and device for power consumer portrait
CN114490303B (en) * 2022-04-07 2022-07-12 阿里巴巴达摩院(杭州)科技有限公司 Fault root cause determination method and device and cloud equipment
CN114490303A (en) * 2022-04-07 2022-05-13 阿里巴巴达摩院(杭州)科技有限公司 Fault root cause determination method and device and cloud equipment
CN117608912A (en) * 2024-01-24 2024-02-27 之江实验室 Full-automatic log analysis and fault processing system and method based on NLP large model

Similar Documents

Publication Publication Date Title
CN113094198A (en) Service fault positioning method and device based on machine learning and text classification
AU2019200046B2 (en) Utilizing artificial intelligence to test cloud applications
CN113282461B (en) Alarm identification method and device for transmission network
US11354460B2 (en) Validator and optimizer for quantum computing simulator
CN109492826B (en) Information system running state risk prediction method based on machine learning
US20200166921A1 (en) System and method for proactive repair of suboptimal operation of a machine
Kobayashi et al. Towards an NLP-based log template generation algorithm for system log analysis
CN114547318A (en) Fault information acquisition method, device, equipment and computer storage medium
CN110969015B (en) Automatic label identification method and equipment based on operation and maintenance script
US11954019B2 (en) Machine learning techniques for automated software testing configuration management
CN111045902A (en) Pressure testing method and device for server
CN116361147A (en) Method for positioning root cause of test case, device, equipment, medium and product thereof
CN114416479A (en) Log sequence anomaly detection method based on out-of-stream regularization
CN112394973B (en) Multi-language code plagiarism detection method based on pseudo-twin network
CN114201328A (en) Fault processing method and device based on artificial intelligence, electronic equipment and medium
CN116088846A (en) Processing method, related device and equipment for continuous integrated code format
CN112905370A (en) Topological graph generation method, anomaly detection method, device, equipment and storage medium
CN116756021A (en) Fault positioning method and device based on event analysis, electronic equipment and medium
CN116701222A (en) Cross-project software defect prediction method and system based on feature weighted migration learning
CN114139636B (en) Abnormal operation processing method and device
CN112732690B (en) Stabilizing system and method for chronic disease detection and risk assessment
CN115757062A (en) Log anomaly detection method based on sentence embedding and Transformer-XL
CN113485878B (en) Multi-data center fault detection method
CN112181951B (en) Heterogeneous database data migration method, device and equipment
Govindasamy et al. Data reduction for bug triage using effective prediction of reduction order techniques

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination