CN111984499A

CN111984499A - Fault detection method and device for big data cluster

Info

Publication number: CN111984499A
Application number: CN202010773292.8A
Authority: CN
Inventors: 洪毅清; 吕文栋; 蔡淑莲; 钟文杰
Original assignee: China Construction Bank Corp; CCB Finetech Co Ltd
Current assignee: China Construction Bank Corp
Priority date: 2020-08-04
Filing date: 2020-08-04
Publication date: 2020-11-24
Anticipated expiration: 2040-08-04
Also published as: CN111984499B

Abstract

The invention discloses a fault detection method and device for a big data cluster, and relates to the technical field of computers. One embodiment of the method comprises: acquiring index data and log data of a big data cluster; determining abnormal information and a corresponding processing scheme according to the index data and the log data; performing machine learning on the abnormal information and the corresponding processing scheme thereof, and storing the learning result into an operation and maintenance library; and carrying out fault detection on the big data cluster based on the learning result in the operation and maintenance library. The implementation method can solve the technical problems of difficulty in fault source detection and high operation and maintenance cost.

Description

Fault detection method and device for big data cluster

Technical Field

The invention relates to the technical field of computers, in particular to a fault detection method and device for a big data cluster.

Background

With the development of services, the operation management of a large data cluster presents the following pain points:

(1) "slow": the service operation environment is more and more complex, the fault detection is slow, various service systems and large data systems are more and more, the dependency of the systems on IT resources is high, once any problem occurs in the systems, the problems need to be eliminated one by one, and the fault detection is difficult.

(2) The following steps: the operation and maintenance work is heavy, an automatic tool and means are lacked, operation and maintenance personnel face a large amount of repetitive and manual troubleshooting work every day, time and labor are consumed, errors are easy to occur, and the efficiency is low.

(3) "disorder": the operation and maintenance work is not streamlined, normalized and electronized, the daily big data operation and maintenance work flow is disordered, a standard flow is not available, the work efficiency is low, and application complaints and complaints are not reduced.

(4) "difficult to handle": the information-based construction investment is huge, and the effect is difficult to show. The operation and maintenance personnel are tired and running every day, the workload is measured, and the quality cannot be controlled!

At present, fault detection technologies for big data related clusters are deficient, information such as cluster load condition monitoring, log troubleshooting, fault analysis and fault removal cannot be directly embodied, and in an enterprise-level data warehouse, the number of big data related clusters is large, the cluster scale is huge, management is very complex, a unified operation management interface is not provided, and the problem of the large data related clusters is often considered.

In the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art:

the fault detection of the existing big data cluster mainly depends on a manual mode, and manual operation and maintenance are carried out by a special method, or basic monitoring is carried out aiming at basic resources such as a CPU, a memory, a hard disk, a network and the like on the level of an operating system.

The fault detection of the existing big data cluster can only be managed aiming at a single cluster, and mostly monitors the bottom layer indexes of a single operating system, and has the following defects:

unified monitoring management and quick query cannot be carried out on a plurality of big data clusters, and the unified management requirements of enterprises cannot be met;

there are difficulties with data analysis: difficult query, analysis and comparison;

and (3) fault detection: the personnel utilization rate is low, and the fault source detection is difficult;

the operation and maintenance cost is as follows: high cost, low efficiency, more service complaints and high system risk.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method and an apparatus for detecting a failure of a big data cluster, so as to solve the technical problems of difficulty in detecting a failure root and high operation and maintenance costs.

To achieve the above object, according to an aspect of an embodiment of the present invention, there is provided a fault detection method for a large data cluster, including:

acquiring index data and log data of a big data cluster;

determining abnormal information and a corresponding processing scheme according to the index data and the log data;

performing machine learning on the abnormal information and the corresponding processing scheme thereof, and storing the learning result into an operation and maintenance library;

and carrying out fault detection on the big data cluster based on the learning result in the operation and maintenance library.

Optionally, the index data includes index data of an operating system process, a single server, a distributed service, and a distributed cluster.

Optionally, the index data of the single server includes: host name configuration checking, NTP offset checking, average load, host disk read-write efficiency, CPU utilization rate, host file handle utilization rate, host network state and load, HA state checking, key directory checking, memory utilization rate, hardware state checking, SWAP utilization rate, host state checking, alarm information and error information.

Optionally, the index data of the operating system process includes: big data process survival status, multithreading information, process congestion status, stack usage status.

Optionally, the index data of the distributed service includes: the system comprises a start-stop state, a Kerberos identity authentication mechanism communication state, a key process health state, a database communication state, a service state and alarm information.

Optionally, the metric data of the distributed cluster includes: the method comprises the following steps of cluster host health state statistics, host network read-write rate distribution, host disk occupancy rate distribution, host memory occupancy rate distribution and host CPU occupancy rate distribution.

Optionally, determining abnormal information and a corresponding processing scheme according to the index data and the log data includes:

determining abnormal information according to the index data and the log data;

and determining a corresponding processing scheme according to the abnormal information.

Optionally, the exception information includes: abnormal points, context information when the abnormal points are generated, abnormal classification, abnormal keywords and abnormal detailed information;

determining abnormal information according to the index data and the log data, wherein the determining of abnormal information comprises the following steps:

abnormal point detection is carried out on the index data to obtain abnormal points;

and determining the context, the abnormal classification, the abnormal keywords and the abnormal detailed information when the abnormal point is generated according to the log data.

Optionally, before determining the abnormal information and the corresponding processing scheme according to the index data and the log data, the method further includes:

carrying out standardization operation on the collected log data;

performing aggregation operation on the log data subjected to the standardized operation;

and carrying out abnormal point detection on the log data after the aggregation operation.

Optionally, the collected log data is subjected to a normalization operation, including:

the log data is standardized using the system sed command and awk command of Linux, so that the log data is mapped into tuples.

Optionally, the system sed command and awk command of Linux are used to perform standardization work on log data, including:

using the sed command to realize operations of replacing, deleting, newly adding, selecting a specific row and the like on the data;

using the awk command, a row is divided into multiple fields for processing, and in conjunction with the grep command, the row of a particular character is filtered, and the string is split using the split command.

Optionally, the tuple comprises: time of production, name of application or component, IP address, details and remarks.

Optionally, performing an aggregation operation on the log data after the normalization operation, including:

and performing aggregation calculation on the log data after the standardized operation based on a concept clustering algorithm, and performing aggregation of multi-dimensional fields to combine or aggregate the alarms into a related event.

Optionally, the aggregated dimensions field includes: time, place, line of business, service, and interface.

Optionally, performing anomaly point detection on the log data after the aggregation operation, including:

abnormal point detection is carried out on the log data after the aggregation operation by using a univariate or multivariate abnormal point detection method based on normal distribution, and abnormal points in the low-dimensional data set are detected;

and carrying out anomaly point detection on the log data after the aggregation operation by using an independent forest algorithm, and detecting anomaly points in the high-dimensional data set.

Optionally, performing machine learning on the abnormal information and the processing scheme corresponding to the abnormal information, and storing a learning result in an operation and maintenance library, including:

analyzing the abnormal information and the corresponding processing scheme based on the correlation analysis of the association rule to generate the association rule;

matching an alarm event with the association rule based on the attribute value set frequently appearing in the abnormal information so as to obtain an alarm combination about the fault source;

and storing the alarm combination and the corresponding processing scheme thereof into an operation and maintenance library.

Optionally, performing fault detection on the big data cluster based on the learning result in the operation and maintenance library, including:

performing word segmentation processing on each log marked as having abnormal points, and extracting keywords related to the state;

inquiring the operation and maintenance library, searching and matching the full text of the operation and maintenance library through the keywords, and directly returning a processing scheme in the operation and maintenance library to a user if the full text of the operation and maintenance library is completely matched; if the log is not completely matched, a fuzzy matching identification method is used for finding out similar associated indexes and events, and assisting a user in comparing and regularly mining the log, so that fault detection is carried out.

In addition, according to another aspect of the embodiments of the present invention, there is provided a fault detection apparatus for a big data cluster, including:

the acquisition module is used for acquiring index data and log data of the big data cluster;

the determining module is used for determining abnormal information and a corresponding processing scheme thereof according to the index data and the log data;

the learning module is used for performing machine learning on the abnormal information and the corresponding processing scheme thereof and storing a learning result into an operation and maintenance library;

and the detection module is used for carrying out fault detection on the big data cluster based on the learning result in the operation and maintenance library.

Optionally, the determining module is further configured to:

determining abnormal information according to the index data and the log data;

the determination module is further to:

Optionally, the determining module is further configured to:

before determining abnormal information and a corresponding processing scheme according to the index data and the log data, carrying out standardization operation on the collected log data;

Optionally, the determining module is further configured to:

Optionally, the learning module is further configured to:

Optionally, the detection module is further configured to:

According to another aspect of the embodiments of the present invention, there is also provided an electronic device, including:

one or more processors;

a storage device for storing one or more programs,

when the one or more programs are executed by the one or more processors, the one or more processors implement the method of any of the embodiments described above.

According to another aspect of the embodiments of the present invention, there is also provided a computer readable medium, on which a computer program is stored, which when executed by a processor implements the method of any of the above embodiments.

One embodiment of the above invention has the following advantages or benefits: the technical means that the abnormal information and the corresponding processing scheme are determined according to the index data and the log data, the abnormal information and the corresponding processing scheme are subjected to machine learning, the learning result is stored in the operation and maintenance library, and the fault detection is performed on the large data cluster based on the learning result in the operation and maintenance library are adopted, so that the technical problems that the fault source detection is difficult and the operation and maintenance cost is high in the prior art are solved. The embodiment of the invention abstractly aggregates and converges the alarms with various dimensions in the log into a small number of events, thereby effectively reducing the number of alarms, avoiding that really useful alarm information is submerged into massive useless alarm information, quickly and accurately detecting the fault and reducing the operation and maintenance cost.

Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.

Drawings

The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:

FIG. 1 is a schematic diagram of the main flow of a fault detection method for large data clusters according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a main flow of a fault detection method of a large data cluster according to one referential embodiment of the present invention;

FIG. 3 is a schematic diagram of a main flow of a fault detection method of a large data cluster according to another referential embodiment of the present invention;

FIG. 4 is a schematic diagram of the main modules of a fault detection apparatus for big data clusters according to an embodiment of the present invention;

FIG. 5 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;

fig. 6 is a schematic block diagram of a computer system suitable for use in implementing a terminal device or server of an embodiment of the invention.

Detailed Description

Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The invention relates to the technical field of realizing fault detection of a big data cluster, and realizing fault detection and management of big data products such as a plurality of MPP parallel database clusters and a plurality of Hadoop big data clusters.

Fig. 1 is a schematic diagram of a main flow of a fault detection method of a large data cluster according to an embodiment of the present invention. As an embodiment of the present invention, as shown in fig. 1, the method for detecting a failure of a big data cluster may include:

step 101, collecting index data and log data of a big data cluster.

The probe technology can be used for acquiring index data of the big data cluster, and log data such as system logs, application logs and the like can be acquired through Filebeat to realize the omnidirectional data acquisition capability of the big data cluster and collect as much data as possible for subsequent log analysis and fault detection. For example, real-time performance index data of a CPU, an internal memory, a disk, and the like of server resources in a cluster are acquired based on a network probe and a system resource probe; a lightweight log collector based on Filebeat acquires log data such as system logs, application logs and the like of a large data cluster.

Optionally, the index data includes index data of an operating system process, a single server, a distributed service, and a distributed cluster. The embodiment of the invention collects indexes with different granularities from thin to thick (namely, an operating system process- > a single server- > a distributed service- > a distributed cluster) for a big data cluster and assembles the indexes into index data.

Optionally, the index data of the single server includes: host name configuration checking, NTP offset checking, average load, host disk read-write efficiency, CPU utilization rate, host file handle utilization rate, host network state and load, HA state checking, key directory checking, memory utilization rate, hardware state checking, SWAP utilization rate, host state checking, alarm information and error information. Optionally, the index data of the operating system process includes: big data process survival status, multithreading information, process congestion status, stack usage status. Optionally, the index data of the distributed service includes: the system comprises a start-stop state, a Kerberos identity authentication mechanism communication state, a key process health state, a database communication state, a service state and alarm information. Optionally, the metric data of the distributed cluster includes: the method comprises the following steps of cluster host health state statistics, host network read-write rate distribution, host disk occupancy rate distribution, host memory occupancy rate distribution and host CPU occupancy rate distribution. Alarm thresholds may be set for these indicators, respectively, and an alarm may be triggered once the thresholds are exceeded.

And 102, determining abnormal information and a corresponding processing scheme according to the index data and the log data.

Optionally, step 102 may comprise: determining abnormal information according to the index data and the log data; and determining a corresponding processing scheme according to the abnormal information. Optionally, the exception information includes: abnormal points, context information when the abnormal points are generated, abnormal classification, abnormal keywords and abnormal detailed information; accordingly, step 102 may include: abnormal point detection is carried out on the index data to obtain abnormal points; and determining the context, the abnormal classification, the abnormal keywords and the abnormal detailed information when the abnormal point is generated according to the log data. And if the index data exceeds the alarm threshold value, indicating that the abnormity occurs, triggering an alarm, and determining the context, the abnormity classification, the abnormity keywords and the abnormity detailed information when the abnormal point is generated by combining the log data.

Optionally, before step 102, further comprising: carrying out standardization operation on the collected log data; performing aggregation operation on the log data subjected to the standardized operation; and carrying out abnormal point detection on the log data after the aggregation operation. The context, the abnormal classification, the abnormal keywords and the abnormal detailed information when the abnormal point is generated can be located by processing and analyzing the log data.

Optionally, the collected log data is subjected to a normalization operation, including: the log data is standardized using the system sed command and awk command of Linux, so that the log data is mapped into tuples. Optionally, the system sed command and awk command of Linux are used to perform standardization work on log data, including: using the sed command to realize operations of replacing, deleting, newly adding, selecting a specific row and the like on the data; using the awk command, a row is divided into multiple fields for processing, and in conjunction with the grep command, the row of a particular character is filtered, and the string is split using the split command. Optionally, the tuple comprises: time of production, name of application or component, IP address, details and remarks.

Because the real-time index data is standardized according to the octave group of time, cluster name, service name, IP address, process name, index value and remark, no additional standardization operation is needed, and therefore, the standardization operation needs to be carried out on the log data. The log data standardization operation can use a system sed command and an awk command of Linux to standardize the log data of a plurality of heterogeneous systems from a large data cluster: using sed command to realize operations such as replacement, deletion, newly adding, selection of specific row and the like of data, for example, ' nl/var/log/ngin x.log | sed ' 2a test ', namely, adding a column in the 2 nd row of the nginx.log of the log file; the row can be divided into a plurality of 'fields' for processing by using the awk command, the row of a specific character is filtered by combining the grep command, the character string is split by using the split command, and finally logs with different formats are uniformly mapped into quintuple < generation time, the name of an application or a component, an IP address, detailed content and remark >, so that the log content is standardized to be understandable by a machine, the format is relatively uniform, self identification such as a service line and a service which can explain a source can be carried out, the automation degree is improved, and analysis and detection of faults are facilitated.

Optionally, performing an aggregation operation on the log data after the normalization operation, including: and performing aggregation calculation on the log data after the standardized operation based on a concept clustering algorithm, and performing aggregation of multi-dimensional fields to combine or aggregate the alarms into a related event. Optionally, the aggregated dimensions field includes: time, place, line of business, service, and interface.

Log aggregation, on one hand, counting the number of log data, sampling to obtain data and the like, and obtaining some statistical information, such as counting (how many alarm logs exist); on the other hand, the method is to perform multidimensional aggregation based on the dimensionality (fields) of the data, for example, usually all the dimensionality fields in the log data have time, place, service line, service, interface and the like, so that the log data can be aggregated and converged based on the time dimensionality to form a log, and repeated log data is reduced, for example, all 30 pieces of log data or alarm information about a network fault of a certain server in a system within 5 minutes are aggregated to form 1 alarm (every 10 seconds), that is, the machine is shown to be in the same fault within the 5 minutes.

And aiming at the standardized data, performing aggregation calculation (counting, averaging, sampling and the like) and multi-dimensional aggregation (including time, place, service line, service and interface) based on the concept clustering algorithm AOI (Attribute-based Induction), realizing dimensionality reduction of alarm information, and performing alarm combination or aggregation to form a correlation event. For example, in a streaming computing cluster, when an interface of an IP write queue is time-consuming and long (exceeds an alarm convergence threshold set by the system), the IP attribute of an alarm is replaced by a more abstract "queue service pool", and similarly, the "queue service pool" can also be replaced by a "machine room", and finally alarms of various dimensions are aggregated and converged into a small number of events through several layers of abstractions, so that the number of alarms can be effectively reduced, and really useful alarm information is prevented from being submerged into massive useless alarm information.

Optionally, performing anomaly point detection on the log data after the aggregation operation, including: abnormal point detection is carried out on the log data after the aggregation operation by using a univariate or multivariate abnormal point detection method based on normal distribution, and abnormal points in the low-dimensional data set are detected; and carrying out anomaly point detection on the log data after the aggregation operation by using an independent forest algorithm, and detecting anomaly points in the high-dimensional data set.

The embodiment of the invention can acquire the abnormal points which are possibly faults from massive log data:

(1) aiming at low-dimensional log data, a probability distribution model is constructed based on data through a univariate or multivariate abnormal point detection method based on normal distribution, a probability density function of the model is obtained, abnormal points are detected through the distance from a judgment point to the center of the normal distribution (the probability of the abnormal points is usually lower and is far away from the center of the normal distribution), the basic idea is that the difference between an object and surrounding objects is calculated, and the higher the score is, the more possible the abnormal points are.

(2) For high-dimensional data and mass data, anomaly detection can be performed by using an independent forest algorithm, a random binary tree is firstly constructed, and each node has a left sub-tree and a right sub-tree or no sub-nodes. After the random binary tree construction is completed, data is predicted: the data records are searched from the root node of the random binary tree to determine the leaf node on which the record falls. However, because the random tree is unstable, a plurality of random binary trees can be combined to form an independent forest, so that the abnormality can be detected more stably.

Therefore, the abnormal value in the discovery log data is detected by using an abnormal detection algorithm such as an abnormal detection method based on normal distribution, an independent forest, or the like. It should be noted that the main difference between the high-dimensional data and the low-dimensional data is the number of dimensions, the general field of the low-dimensional data is relatively simple, and there is no correlation between the attributes, such as simple log data < time, IP address, log index value >; the general format of the high-dimensional data is < time, IP address, < log index 1, log index value 1>, < log index 2, log index value 2>,. < log index n, log index value n > >, and the log analysis needs to be analyzed by combining a plurality of or all of the indexes such as log index 1, log index 2.. log index m, and the like.

And 103, performing machine learning on the abnormal information and the corresponding processing scheme thereof, and storing a learning result into an operation and maintenance library.

Optionally, step 103 may comprise: analyzing the abnormal information and the corresponding processing scheme based on the correlation analysis of the association rule to generate the association rule; matching an alarm event with the association rule based on the attribute value set frequently appearing in the abnormal information so as to obtain an alarm combination about the fault source; and storing the alarm combination and the corresponding processing scheme thereof into an operation and maintenance library.

The faults are generally represented by events, errors and symptoms, and the correlation analysis based on the association rules is used for mining the relation between the events, errors and alarm data to assist the detection, prediction and prevention of the faults. The correlation analysis based on the association rule is a machine learning algorithm based on the rule, and can generate an FP-growth algorithm of a frequent item set based on an FP tree, the algorithm can directly compress a data set into a frequent pattern tree, and finally, the association rule is produced according to the number or strong rules existing in the data set are distinguished by adopting some measurement indexes. After the association rules are generated, the association rules can be used for assisting in carrying out correlation analysis of events (errors and alarms), when one alarm/event occurs, the association rules can be used for matching, so that whether the alarm/event is an independent alarm/event or an associated alarm/event is judged, the root cause of the fault is determined, and the fault is detected.

For example, we obtain the following association rules of alarm events by analyzing and mining: { HBase big data cluster load high } → { object library delay }, { object library delay } → { individual account detail online query failure rate is high }; assuming that the current fault symptom is that the failure rate of online query of personal account details is high, the alarm related to the object library delay and the HBase big data cluster load is certain to be received. When the alarm hits the two rules, a simple failure source can be tracked, namely { HBase big data cluster load is high } → { object base delay } → { personal account detail online query failure rate is high }. At this time, operation and maintenance personnel can be informed to perform capacity expansion operation of the HBase big data cluster in time so as to reduce the load of the HBase big data cluster and improve the success rate of 'personal account detail online query' of the key application. Through correlation analysis based on the association rules, useful association or correlation is found from massive system and application log data sets, namely frequently-occurring attribute value sets (frequent item sets) are identified from data, then the frequent item sets are utilized to create a process for describing the association rules, different alarm combinations can be formed, which faults are about to occur can be pre-judged, operation and maintenance personnel can be informed in advance, and therefore detection and prediction prevention of the faults are achieved.

And 104, carrying out fault detection on the big data cluster based on the learning result in the operation and maintenance library.

Optionally, step 104 may include: performing word segmentation processing on each log marked as having abnormal points, and extracting keywords related to the state; inquiring the operation and maintenance library, searching and matching the full text of the operation and maintenance library through the keywords, and directly returning a processing scheme in the operation and maintenance library to a user if the full text of the operation and maintenance library is completely matched; if the log is not completely matched, a fuzzy matching identification method is used for finding out similar associated indexes and events, and assisting a user in comparing and regularly mining the log, so that fault detection is carried out.

Performing word segmentation on each log marked as having an abnormal point by using an IK word segmentation device of the ElsticSearch, extracting keywords (such as warning, error, exception and the like) related to the state, querying the operation and maintenance library, matching the full-text retrieval of the operation and maintenance library through the log keywords, and directly returning the previous processing scheme to the user if the log keywords are completely matched with the full-text retrieval of the operation and maintenance library; if the complete matching does not exist, similar associated indexes and events are found out by using a pattern recognition method such as fuzzy matching and the like, and the log comparison and regular mining are assisted by a user, so that the system fault or abnormity detection and discovery are realized. If the keyword does not exist in the operation and maintenance library, the keyword and the alarm information are recorded into the operation and maintenance library, so that the fault processing rule can be conveniently added by subsequent manual work.

According to the embodiment of the invention, dominant faults such as machine downtime, network outage, hard disk faults and the like are rapidly detected by monitoring key indexes; through log analysis, some faults which have unobvious characteristics such as operation error report, memory overflow, no response to requests and the like or cannot be judged directly through vision are detected; hidden associated faults existing in complex alarms are identified through association rule mining based on machine learning, so that all-around monitoring, fault analysis and solution optimization of a big data cluster are realized, the large data cluster is displayed for users more intuitively, the service quality is improved, and the management cost is reduced.

According to the various embodiments, it can be seen that the technical means of determining the abnormal information and the corresponding processing scheme thereof according to the index data and the log data, performing machine learning on the abnormal information and the corresponding processing scheme thereof, storing the learning result in the operation and maintenance library, and performing fault detection on the large data cluster based on the learning result in the operation and maintenance library in the embodiments of the present invention solves the technical problems of difficulty in detecting the root cause of a fault and high operation and maintenance cost in the prior art. The embodiment of the invention abstractly aggregates and converges the alarms with various dimensions in the log into a small number of events, thereby effectively reducing the number of alarms, avoiding that really useful alarm information is submerged into massive useless alarm information, quickly and accurately detecting the fault and reducing the operation and maintenance cost.

Fig. 2 is a schematic diagram of a main flow of a fault detection method of a large data cluster according to one referential embodiment of the present invention. As another embodiment of the present invention, as shown in fig. 2, the method for detecting a failure of a big data cluster may include:

step 201, collecting index data and log data of a big data cluster.

And 202, carrying out abnormal point detection on the index data to obtain abnormal points.

Step 203, determining the context, the abnormal classification, the abnormal keyword and the abnormal detailed information when the abnormal point is generated according to the log data.

And step 204, determining a corresponding processing scheme according to the abnormal point, the context when the abnormal point is generated, the abnormal classification, the abnormal keyword and the abnormal detailed information.

Step 205, analyzing the abnormal information and the corresponding processing scheme based on the correlation analysis of the association rule, and generating the association rule.

And step 206, matching the alarm event with the association rule based on the attribute value set frequently appearing in the abnormal information, so as to obtain an alarm combination related to the fault source.

And step 207, storing the alarm combination and the corresponding processing scheme thereof in an operation and maintenance library.

And step 208, performing word segmentation processing on each log marked as having the abnormal point, and extracting keywords related to the state.

Step 209, inquiring the operation and maintenance library, searching and matching the full text of the operation and maintenance library through the keywords, and directly returning a processing scheme in the operation and maintenance library to the user if the full text of the operation and maintenance library is completely matched; if the log is not completely matched, a fuzzy matching identification method is used for finding out similar associated indexes and events, and assisting a user in comparing and regularly mining the log, so that fault detection is carried out.

In addition, in one embodiment of the present invention, the detailed implementation of the fault detection method for a big data cluster is described in detail in the above-mentioned fault detection method for a big data cluster, so that the repeated content is not described again.

Fig. 3 is a schematic diagram of a main flow of a fault detection method of a large data cluster according to another referential embodiment of the present invention. As another embodiment of the present invention, as shown in fig. 3, the method for detecting a failure of a big data cluster may include:

step 301, collecting index data and log data of a big data cluster.

Step 302, performing a standardization operation on the collected log data.

And step 303, performing aggregation operation on the log data after the standardization operation.

And step 304, abnormal point detection is carried out on the log data after the aggregation operation.

And 305, determining abnormal information and a corresponding processing scheme according to the index data and the log data.

And step 306, performing machine learning on the abnormal information and the corresponding processing scheme thereof, and storing the learning result into an operation and maintenance library.

And 307, performing fault detection on the big data cluster based on the learning result in the operation and maintenance library.

In addition, in another embodiment of the present invention, the detailed implementation of the fault detection method for a big data cluster is described in detail in the above fault detection method for a big data cluster, so that repeated descriptions are not described herein.

Fig. 4 is a schematic diagram of main modules of a fault detection apparatus for a big data cluster according to an embodiment of the present invention, and as shown in fig. 4, the fault detection apparatus 400 for a big data cluster includes an acquisition module 401, a determination module 402, a learning module 403, and a detection module 404; the acquisition module 401 is configured to acquire index data and log data of a big data cluster; the determining module 402 is configured to determine abnormal information and a processing scheme corresponding to the abnormal information according to the index data and the log data; the learning module 403 is configured to perform machine learning on the abnormal information and the corresponding processing scheme thereof, and store a learning result in an operation and maintenance library; the detection module 404 is configured to perform fault detection on the big data cluster based on the learning result in the operation and maintenance library.

Optionally, the determining module 402 is further configured to:

determining abnormal information according to the index data and the log data;

the determining module 402 is further configured to:

Optionally, the determining module 402 is further configured to:

Optionally, the learning module 403 is further configured to:

Optionally, the detecting module 404 is further configured to:

It should be noted that, in the implementation of the fault detection apparatus for a big data cluster according to the present invention, the above fault detection method for a big data cluster has been described in detail, and therefore, the repeated content is not described herein.

Fig. 5 shows an exemplary system architecture 500 of a fault detection method of a big data cluster or a fault detection apparatus of a big data cluster to which an embodiment of the present invention may be applied.

As shown in fig. 5, the system architecture 500 may include

terminal devices

501, 502, 503, a network 504, and a server 505. The network 504 serves to provide a medium for communication links between the

terminal devices

501, 502, 503 and the server 505. Network 504 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

501, 502, 503 to interact with a server 505 over a network 504 to receive or send messages or the like. The

terminal devices

501, 502, 503 may have installed thereon various communication client applications, such as shopping-like applications, web browser applications, search-like applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only).

The

terminal devices

501, 502, 503 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.

The server 505 may be a server providing various services, such as a background management server (for example only) providing support for shopping websites browsed by users using the

terminal devices

501, 502, 503. The background management server can analyze and process the received data such as the article information query request and feed back the processing result to the terminal equipment.

It should be noted that the method for detecting a fault of a big data cluster provided in the embodiment of the present invention is generally executed by the server 505, and accordingly, the fault detection apparatus of the big data cluster is generally disposed in the server 505.

It should be understood that the number of terminal devices, networks, and servers in fig. 5 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Referring now to FIG. 6, a block diagram of a computer system 600 suitable for use with a terminal device implementing an embodiment of the invention is shown. The terminal device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.

As shown in fig. 6, the computer system 600 includes a Central Processing Unit (CPU)601 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. In the RAM603, various programs and data necessary for the operation of the system 600 are also stored. The CPU 601, ROM 602, and RAM603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

The following components are connected to the I/O interface 605: an input portion 606 including a keyboard, a mouse, and the like; an output portion 607 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The driver 610 is also connected to the I/O interface 605 as needed. A removable medium 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 610 as necessary, so that a computer program read out therefrom is mounted in the storage section 608 as necessary.

In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 609, and/or installed from the removable medium 611. The computer program performs the above-described functions defined in the system of the present invention when executed by the Central Processing Unit (CPU) 601.

It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer programs according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor includes an acquisition module, a determination module, a learning module, and a detection module, where the names of the modules do not in some cases constitute a limitation on the modules themselves.

As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, implement the method of: acquiring index data and log data of a big data cluster; determining abnormal information and a corresponding processing scheme according to the index data and the log data; performing machine learning on the abnormal information and the corresponding processing scheme thereof, and storing the learning result into an operation and maintenance library; and carrying out fault detection on the big data cluster based on the learning result in the operation and maintenance library.

According to the technical scheme of the embodiment of the invention, the abnormal information and the corresponding processing scheme are determined according to the index data and the log data, the abnormal information and the corresponding processing scheme are subjected to machine learning, the learning result is stored in the operation and maintenance library, and the large data cluster is subjected to fault detection based on the learning result in the operation and maintenance library, so that the technical problems of difficulty in fault root cause detection and high operation and maintenance cost in the prior art are solved. The embodiment of the invention abstractly aggregates and converges the alarms with various dimensions in the log into a small number of events, thereby effectively reducing the number of alarms, avoiding that really useful alarm information is submerged into massive useless alarm information, quickly and accurately detecting the fault and reducing the operation and maintenance cost.

The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A fault detection method for a big data cluster is characterized by comprising the following steps:

acquiring index data and log data of a big data cluster;

2. The method of claim 1, wherein the metric data comprises operating system processes, individual servers, distributed services, and distributed cluster metric data.

3. The method of claim 2, wherein the metrics data for the single server comprises: host name configuration checking, NTP offset checking, average load, host disk read-write efficiency, CPU utilization rate, host file handle utilization rate, host network state and load, HA state checking, key directory checking, memory utilization rate, hardware state checking, SWAP utilization rate, host state checking, alarm information and error information.

4. The method of claim 2, wherein the metrics data for the operating system process comprises: big data process survival status, multithreading information, process congestion status, stack usage status.

5. The method of claim 2, wherein the metrics data for the distributed service comprises: the system comprises a start-stop state, a Kerberos identity authentication mechanism communication state, a key process health state, a database communication state, a service state and alarm information.

6. The method of claim 2, wherein the metric data for the distributed cluster comprises: the method comprises the following steps of cluster host health state statistics, host network read-write rate distribution, host disk occupancy rate distribution, host memory occupancy rate distribution and host CPU occupancy rate distribution.

7. The method of claim 1, wherein determining exception information and corresponding handling schemes based on the metric data and log data comprises:

determining abnormal information according to the index data and the log data;

8. The method of claim 7, wherein the exception information comprises: abnormal points, context information when the abnormal points are generated, abnormal classification, abnormal keywords and abnormal detailed information;

9. The method of claim 1, wherein before determining the exception information and the corresponding handling scheme according to the indicator data and the log data, further comprising:

carrying out standardization operation on the collected log data;

10. The method of claim 9, wherein performing a normalization operation on the collected log data comprises:

11. The method of claim 10, wherein the normalization of log data using the system sed command and awk command of Linux comprises:

12. The method of claim 10, wherein the tuple comprises: time of production, name of application or component, IP address, details and remarks.

13. The method of claim 9, wherein performing an aggregation operation on the normalized log data comprises:

14. The method of claim 13, wherein the aggregated dimension field comprises: time, place, line of business, service, and interface.

15. The method of claim 14, wherein performing outlier detection on log data after the aggregation operation comprises:

16. The method of claim 1, wherein performing machine learning on the abnormal information and the corresponding processing scheme thereof, and storing the learning result in an operation and maintenance library comprises:

17. The method of claim 1, wherein fault detecting the big data cluster based on the learning result in the operation and maintenance library comprises:

18. A big data cluster fault detection device, comprising:

19. The apparatus of claim 18, wherein the determining module is further configured to:

determining abnormal information according to the index data and the log data;

20. The apparatus of claim 18, wherein the exception information comprises: abnormal points, context information when the abnormal points are generated, abnormal classification, abnormal keywords and abnormal detailed information;

the determination module is further to:

21. The apparatus of claim 18, wherein the learning module is further configured to:

22. An electronic device, comprising:

one or more processors;

a storage device for storing one or more programs,

the one or more programs, when executed by the one or more processors, implement the method of any of claims 1-17.

23. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-17.