CN109800127A

CN109800127A - A kind of system fault diagnosis intelligence O&M method and system based on machine learning

Info

Publication number: CN109800127A
Application number: CN201910010700.1A
Authority: CN
Inventors: 曾德强
Original assignee: Zhongan Information Technology Service Co Ltd
Current assignee: Zhongan Information Technology Service Co Ltd
Priority date: 2019-01-03
Filing date: 2019-01-03
Publication date: 2019-05-24

Abstract

The system fault diagnosis intelligence O&M method and system based on machine learning that the invention discloses a kind of, method includes: the achievement data and labeled data of acquisition system；The data model of different usage scenarios is respectively trained according to achievement data and labeled data；According to collected current criteria data and data model, analysis system operation health status is calculated and to the abnormal index data-triggered fault diagnosis grabbed and alarm；The relation map and exception stack labeled data established according to machine learning, are diagnosed to be failure cause.The present invention is by the way that machine learning model to be applied in automatic operation and maintenance system, such as monitoring, fault diagnosis, each O&M link of O&M decision, it can quickly find failure and troubleshooting Producing reason, O&M decision component is provided simultaneously, self-regeneration movement is completed according to each side's diagnostic result, accomplishes really unattended O&M.

Description

A kind of system fault diagnosis intelligence O&M method and system based on machine learning

Technical field

The present invention relates to intelligent O&M technical field, in particular to a kind of system fault diagnosis intelligence based on machine learning O&M method and system can be changed.

Background technique

With the development that internet is swift and violent, product scale and number of servers exponentially grade increase, and number of servers is from morning Several of phase are to hundred grades, thousand grades, ten thousand number of stages.The also tool from the upgrading of the artificial O&M of early stage till now of operation maintenance personnel, Semi-automatic O&M.With business, the rapid growth of number of servers, technical staff faces this huge challenge, mainly have with Under several aspects:

1, monitor control index is more and more, finds out O&M from magnanimity achievement data using traditional O&M mode and needs to pay close attention to Index, operation maintenance personnel needs to spend longer time；

2, extensive alarm influences whether the decision judgement of technical staff, cannot timely respond to the failure generated；

3, tool disperses, and not only increases learning cost and possesses cost, and mutually indepedent between each system, and data are total Enjoy difficulty；

4, the investigation process experience of same problem cannot pass on, and technical staff ceaselessly does the duplication of labour

Therefore, there is an urgent need for proposing a kind of new intelligent O&M method, to overcome the problems, such as said one or multiple.

Summary of the invention

In order to solve problems in the prior art, the embodiment of the invention provides a kind of system failures based on machine learning to examine Disconnected intelligence O&M method and system cannot quickly find failure and the original that troubleshooting generates to overcome in the prior art Cause cannot be automatically performed the problems such as self-regeneration.

In order to solve the above technical problems, the technical solution adopted by the present invention is that:

On the one hand, a kind of system fault diagnosis intelligence O&M method based on machine learning, the method packet are provided Include following steps:

S1: the achievement data and labeled data of system are obtained；

S2: the data model of different usage scenarios is respectively trained according to the achievement data and labeled data；.

S3: it according to collected current criteria data and the data model, calculates analysis system and runs health status And to the abnormal index data-triggered fault diagnosis grabbed and alarm；

S4: the relation map and exception stack labeled data established according to machine learning are diagnosed to be failure cause.

Further, the acquisition labeled data includes at least:

The abnormal index data in the achievement data are obtained, Indexes Abnormality fluctuation mark is carried out to the abnormal index data Note and Indexes Abnormality cause of fluctuation mark；And/or

The exception stack information for obtaining abnormal index data, marks the keyword of the exception stack；And/or

The failure problems data checked out are labeled.

Further, the step S3 is specifically included:

After grabbing the abnormal index data in current criteria data according to time window, fault diagnosis and alarm are triggered； And/or

It is calculated using the data model and analyzes the current criteria data, acquisition system operation health status, and according to The abnormal index data-triggered fault diagnosis grabbed and alarm.

Further, the step S4 is specifically included:

The relation map and exception stack labeled data established according to machine learning, carry out self-test respectively, obtain self-test As a result；And/or

Failure all reasons that may be present are calculated using the decision data that previous investigation problem obtains, and are carried out corresponding It checks, obtains inspection result；

Failure cause is analyzed according to the self-detection result and the inspection result.

Further, the step S4 further include:

If the failure cause cannot be automatically analyzed out, then manpower intervention handle, and to the abnormal index data into It saves after rower note into annotation repository.

On the other hand, a kind of system fault diagnosis intelligence operational system based on machine learning, the system are provided Include:

Data collection module, for obtaining the achievement data and labeled data of system；

Model training module, for the number of different usage scenarios to be respectively trained according to the achievement data and labeled data According to model；

Computation analysis module, for calculating analysis system according to collected current criteria data and the data model System runs health status and to the abnormal index data-triggered fault diagnosis grabbed and alarm；

Fault diagnosis module, relation map and exception stack labeled data for being established according to machine learning, diagnosis Be out of order reason；

Alarm module, for issuing corresponding alarm according to the abnormal index data.

Further, the data collection module includes:

Unit is marked, for obtaining the abnormal index data in the achievement data, the abnormal index data are carried out Indexes Abnormality fluctuation mark and Indexes Abnormality cause of fluctuation mark；And/or

The failure problems data checked out are labeled.

Further, the computation analysis module includes:

Rule analysis unit, after grabbing the abnormal index data in current criteria data according to time window, touching Send out fault diagnosis and alarm；

Algorithm analytical unit analyzes the current criteria data for calculating using the data model, obtains system fortune Row health status, and according to the abnormal index data-triggered fault diagnosis grabbed and alarm.

Further, the fault diagnosis module includes:

Preliminary self-test unit, relation map and exception stack labeled data for being established according to machine learning, respectively Self-test is carried out, self-detection result is obtained；

Application review unit, it is that may be present all for calculating failure using the decision data that previously investigation problem obtained Reason, and checked accordingly, obtain inspection result；

Accident analysis unit, for analyzing failure cause according to the self-detection result and the inspection result.

Further, the fault diagnosis module further include:

Artificial mark unit, if then manpower intervention is handled, and to institute for that cannot automatically analyze out the failure cause It states and saves after abnormal index data are labeled into annotation repository.

Technical solution provided in an embodiment of the present invention has the benefit that

1, the system fault diagnosis intelligence O&M method and device provided by the invention based on machine learning, by by machine Device learning model is applied in automatic operation and maintenance system, such as monitoring, fault diagnosis, each O&M link of O&M decision, can be fast Speed discovery failure and troubleshooting Producing reason, while O&M decision component being provided, it is completed certainly according to each side's diagnostic result My repair action accomplishes really unattended O&M；

2, the system fault diagnosis intelligence O&M method and device provided by the invention based on machine learning, passes through utilization Machine learning algorithm is integrated to by each dimension data, establishes corresponding data model, solves the monitoring of single rule, no Can linkage judgement identification, the fluctuation of load is irregular, leads to that error rate is high, misrepresents deliberately, leaks there are more using threshold values is excessively inflexible The problems such as report；

3, the system fault diagnosis intelligence O&M method and device provided by the invention based on machine learning, according to machine Learn the application established, business, the useful exception information of server triadic relation's map rapidly extracting, and with identifying according to labeled data Trouble cause, automatic trigger tool are repaired.

Detailed description of the invention

To describe the technical solutions in the embodiments of the present invention more clearly, make required in being described below to embodiment Attached drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for For those of ordinary skill in the art, without creative efforts, it can also be obtained according to these attached drawings other Attached drawing.

Fig. 1 is the system fault diagnosis intelligence O&M method shown according to an exemplary embodiment based on machine learning Flow chart；

Fig. 2 is the system fault diagnosis intelligence operational system shown according to an exemplary embodiment based on machine learning Structural schematic diagram.

Specific embodiment

To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with attached in the embodiment of the present invention Figure, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is only this Invention a part of the embodiment, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art exist Every other embodiment obtained under the premise of creative work is not made, shall fall within the protection scope of the present invention.

System fault diagnosis intelligence O&M method provided in an embodiment of the present invention based on machine learning, this method utilize Machine learning algorithm integrates each dimension data, it is established that healthy application model solves single dimension rule The problems such as monitoring, cannot link judgement identification, and the fluctuation of load is irregular and application threshold values is excessively inflexible causes to monitor error rate Height, there are more situations such as misrepresenting deliberately, fail to report.Its fault diagnosis module helps technology people to be automatically performed abnormal application detection, and By applying operating index, operational indicator, relationship server map, abnormal bursting point is quickly positioned, key index is extracted, and With notifying decision system after judging failure cause according to the exception stack data being collected into, in the case of solving extensive alarm, sea The extraction problem of log is measured, failure cause positioning time is shortened.Simultaneously over time, failure mark database is increasingly It is perfect, it progressivelyes reach and does not need manpower intervention malfunction elimination, operational system can be according to fault diagnosis reason, and self-fulfillment reparation is dynamic Make.

Fig. 1 is the system fault diagnosis intelligence O&M method shown according to an exemplary embodiment based on machine learning Flow chart, shown referring to Fig.1, this method comprises the following steps:

S1: the achievement data and labeled data of system are obtained.

Specifically, achievement data mainly includes operational indicator, system index, using operating index three categories data, these Data reflection is the case where actual production is run.Using time series window by all kinds of achievement data statistic of classifications, it is converted to Kpi closes building property index, then shifts the result data onto model training module, and model training module is built using achievement data cluster Mould analyzes online index etc. to provide supervisor engine in real time.

Labeled data refers to reception mark service data, starts the cleaning processing to data, provides AI data modeling system pair Data modeling.At least divide three class models: 1, Indexes Abnormality volatility model；2, Indexes Abnormality cause of fluctuation model；3, malfunction elimination Decision model.

In addition to above two data, in the embodiment of the present invention, the data for needing to acquire further include basic platform data.Tool Body, the resource information data of basic resource management system is extracted, it is general to establish relational graph between resource entity, be supplied to therefore Hinder diagnostic module to use.Furthermore it is possible to deposit relationship between basic data entity using Neo4j chart database.It needs exist for It is bright, basic resource management system management integration Servers-all resource information, application message, business information.The service is used In daily operation management.On the one hand basic platform data provide foundation in labeled data, on the other hand mention in fault diagnosis For reference.

In addition, what needs to be explained here is that, in embodiments of the present invention, the sampling instrument of different achievement datas is also different Sample.For example, log class, based on Filebeat, system index class, which is collected, to be used based on open-falcon, and operational indicator passes through prison All kinds of technological means such as mysql data binlog are listened to realize.

As a kind of preferably embodiment, in the embodiment of the present invention, obtains labeled data and includes at least:

The failure problems data checked out are labeled.

Specifically, labeled data is at least divided into following three classes:

Indexes Abnormality fluctuation mark, flows back into monitor control index prediction model for such labeled data, can be used for quickly finding Abnormal index.

Indexes Abnormality cause of fluctuation mark, the reason of typically resulting in Indexes Abnormality fluctuation have it is very much, can substantially be divided into Under several classes: 1, network layer reason；2, system resource occupies (including: disk, cpu, io, memory) reason；3, using Exception Log；4, service traffics fluctuate；5, network attack etc..According to the Indexes Abnormality cause of fluctuation after mark, index wave can establish Dynamic causality classification library, then according to index cause of fluctuation class library, we can quickly determine malfunction elimination direction.

It lifts for a java example application.Java is using the TimeoutException that dishes out, the Indexes Abnormality fluctuation of mark Reason may are as follows: 1, configuration it is not obstructed to causing to access；2, network cause.Carry out two Indexes Abnormality cause of fluctuation marks Afterwards, network detection script and configuration check script can be quickly triggered, network is checked.

It is marked using exception stack keyword, many times can directly teach that failure cause using exception stack, The problem of application can be reflected well using exception stack, and the crucial key for marking out exception stack can be helped quickly really Recognize failure cause.Us can be helped quickly to extract useful log content that is, marking out the keyword of exception stack, certainly Plan next step fault diagnosis inspection movement.

S2: the data model of different usage scenarios is respectively trained according to the achievement data and labeled data.

Specifically, model training service is basic engine with machine learning engine (sparkML), all kinds of supervised are provided Habit, semi-supervised learning etc..After receiving all kinds of achievement datas or labeled data, the number of different usage scenarios is established respectively According to model.Data model includes but is not limited to drag: index prediction model (such as index class data model, Indexes Abnormality wave Movable model, Indexes Abnormality fluctuation disaggregated model etc.), fault detection process library and solution bank, application, machine, business relations map Deng.Wherein, index prediction model is diagnosed for monitoring and early warning, fault detection process library and solution bank for consequent malfunction, and should Fault detection process library and solution bank have relied on what the labeled data being collected into was established.

What needs to be explained here is that before above-mentioned data are for carrying out machine learning training acquisition data model, it is also necessary to will Data carry out vectorization processing, i.e., text data are converted to vector data.

S3: it according to collected current criteria data and the data model, calculates analysis system and runs health status And to the abnormal index data-triggered fault diagnosis grabbed and alarm.

As a kind of preferably embodiment, in the embodiment of the present invention, step S3 is specifically included:

Specifically, in the embodiment of the present invention, for calculating the analysis engine of analysis by regulation engine and algorithm engine two It is grouped as, wherein regulation engine executes two movements after mainly grabbing every abnormal index according to time window, i.e. triggering is accused Alert and fault diagnosis.The data model calculating analysis that algorithm engine mainly utilizes history achievement data to establish obtains current in real time Achievement data to obtain system operation health status, and accuse to the abnormal index data grabbed and holds up and trigger event Barrier diagnosis.Here algorithm engine mainly uses the related algorithms such as prediction model related algorithm (prophet) and random forest.

As a kind of preferably embodiment, in the embodiment of the present invention, step S4 is specifically included:

Specifically, fault diagnosis module core is made of inference machine, analysis engine event is received, according to according to engineering The relation map (such as business, application, machine relationship system map) established and exception stack labeled data are practised, is first carried out respectively certainly Inspection obtains self-detection result.Wherein, self-test content includes anomalous content, system resource utilization situation, traffic fluctuations situation etc..

Simultaneously using technical staff previously checked problem acquisition decision data calculate failure there may be the reason of, go forward side by side Row next step inspection movement, including rely on application review, the inspection of service impact face etc..As an example it is assumed that certain applies the exception of A Quantity is uprushed, and with finding failure possible cause according to history Exception keyword models library, identifies it is network layer, application The problem of layer or server resource, the direction of decision trouble shooting.It is assumed that the problem of being network layer, need to touch at this time The network communication inspection for sending out basis, is collected simultaneously Web communication layer log, checks the indices of specific network layer, obtains and checks As a result.

As a kind of preferably embodiment, in the embodiment of the present invention, step S4 further include:

Specifically, will use the mark database of corresponding problem when checking indices, in this mark database When powerful enough, the reason that is out of order can be automatically analyzed, when knowledge base is improved not enough, then cannot be automatically analyzed Be out of order reason, at this time, it may be necessary to which people's intervention is handled, corresponding mark number is saved to after being labeled to abnormal index data According in library, further to improve mark database.For example, will use network layer problem when checking network layer indices Mark database can automatically analyze the reason that is out of order if this mark database is powerful enough, if this mark database It is perfect not enough, then it needs manpower intervention to handle, and manually marked to network layer abnormal index data, is then saved Into network layer problem mark database, network layer problem mark database is supplemented.

As a kind of preferably embodiment, in the embodiment of the present invention, the method also includes:

S5: it according to the failure cause, determines recovery scenario and triggers fault restoration.

Specifically, the failure cause that decision-making module is diagnosed according to diagnostic module, determines recovery scenario, and trigger phase The failover operation answered.

Fig. 2 is the system fault diagnosis intelligence operational system shown according to an exemplary embodiment based on machine learning Structural schematic diagram, referring to shown in Fig. 2, which is included at least:

Data collection module, for obtaining the achievement data and labeled data of system.

Specifically, in embodiments of the present invention, data collection module includes multiple metadata acquisition tools.For example, for adopting Collect log class data Filebeat, for acquisition system index class data open-falcon and operational indicator data are then It is realized by monitoring all kinds of technological means such as mysql data binlog.

Model training module, for the number of different usage scenarios to be respectively trained according to the achievement data and labeled data According to model.

Specifically, model training module includes the components such as algorithms library, data modeling visualization tool, data modeling engine. For different achievement data and labeled data, it is respectively trained by supervised study, semi-supervised learning and different uses field The data model of scape.

Computation analysis module, for calculating analysis system according to collected current criteria data and the data model System runs health status and to the abnormal index data-triggered fault diagnosis grabbed and alarm.

Specifically, the analysis engine of computation analysis module is made of regulation engine and algorithm engine two parts, wherein algorithm Engine mainly uses the related algorithms such as prediction model related algorithm (prophet) and random forest.

Fault diagnosis module, relation map and exception stack labeled data for being established according to machine learning, diagnosis Be out of order reason.

Specifically, fault diagnosis module core is made of inference machine, analysis engine event is received, is obtained in conjunction with previous Exception stack labeled data, Indexes Abnormality cause of fluctuation disaggregated model, business, application, machine relationship system map etc., be diagnosed to be Failure cause.

Further, data collection module includes:

Unit is marked, for obtaining the abnormal index data in achievement data, Indexes Abnormality is carried out to abnormal index data Fluctuation mark and Indexes Abnormality cause of fluctuation mark；And/or

The exception stack information for obtaining abnormal index data, marks the keyword of exception stack；And/or

The failure problems data checked out are labeled.

Further, computation analysis module includes:

Algorithm analytical unit analyzes the current criteria data for calculating using data model, and it is strong to obtain system operation Health situation, and according to the abnormal index data-triggered fault diagnosis grabbed and alarm.

Further, fault diagnosis module includes:

Accident analysis unit, for analyzing failure cause according to self-detection result and inspection result.

Further, fault diagnosis module further include:

Artificial mark unit, if then manpower intervention is handled for that cannot automatically analyze the reason that is out of order, and to referring to extremely Mark data save after being labeled into annotation repository.

As a kind of preferably embodiment, in the embodiment of the present invention, the system also includes:

Decision-making module, for determining recovery scenario and triggering fault restoration according to failure cause.

Operation and maintenance tools manage platform, for carrying out corresponding fault restoration according to recovery scenario.Wherein, the operation and maintenance tools pipe Platform includes O&M script management tool, using deployment tool, development process management tool, configuration management tool etc..

In conclusion technical solution provided in an embodiment of the present invention has the benefit that

It should be understood that the system fault diagnosis intelligence operational system provided by the above embodiment based on machine learning It, only the example of the division of the above functional modules, can be in practical application when triggering system fault diagnosis business Above-mentioned function distribution is completed by different functional modules as needed, i.e., the internal structure of system is divided into different function Energy module, to complete all or part of the functions described above.In addition, provided by the above embodiment be based on machine learning System fault diagnosis intelligence operational system belongs to the system fault diagnosis intelligence O&M embodiment of the method based on machine learning Same design, specific implementation process are detailed in embodiment of the method, and which is not described herein again.

Those of ordinary skill in the art will appreciate that realizing that all or part of the steps of above-described embodiment can pass through hardware It completes, relevant hardware can also be instructed to complete by program, the program can store in a kind of computer-readable In storage medium, storage medium mentioned above can be read-only memory, disk or CD etc..

The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all in spirit of the invention and Within principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.

Claims

1. a kind of system fault diagnosis intelligence O&M method based on machine learning, which is characterized in that the method includes such as Lower step:

S1: the achievement data and labeled data of system are obtained；

S3: according to collected current criteria data and the data model, calculate analysis system operation health status and To the abnormal index data-triggered fault diagnosis grabbed and alarm；

2. the system fault diagnosis intelligence O&M method according to claim 1 based on machine learning, which is characterized in that The acquisition labeled data includes at least:

Obtain the abnormal index data in the achievement data, to the abnormal index data carry out Indexes Abnormality fluctuation mark with And Indexes Abnormality cause of fluctuation mark；And/or

The failure problems data checked out are labeled.

3. the system fault diagnosis intelligence O&M method according to claim 1 or 2 based on machine learning, feature exist In the step S3 is specifically included:

After grabbing the abnormal index data in current criteria data according to time window, fault diagnosis and alarm are triggered；With/ Or,

It is calculated using the data model and analyzes the current criteria data, obtained system and run health status, and according to crawl The abnormal index data-triggered fault diagnosis arrived and alarm.

4. the system fault diagnosis intelligence O&M method according to claim 1 or 2 based on machine learning, feature exist In the step S4 is specifically included:

The relation map and exception stack labeled data established according to machine learning, carry out self-test respectively, obtain self-detection result； And/or

Failure all reasons that may be present are calculated using the decision data that previous investigation problem obtains, and are examined accordingly It looks into, obtains inspection result；

5. the system fault diagnosis intelligence O&M method according to claim 4 based on machine learning, which is characterized in that The step S4 further include:

If the failure cause cannot be automatically analyzed out, then manpower intervention is handled, and is marked to the abnormal index data It saves after note into annotation repository.

6. a kind of system fault diagnosis intelligence operational system based on machine learning, which is characterized in that the system comprises:

Model training module, for the data mould of different usage scenarios to be respectively trained according to the achievement data and labeled data Type；

Computation analysis module, for calculating analysis system fortune according to collected current criteria data and the data model Row health status and to the abnormal index data-triggered fault diagnosis grabbed and alarm；

Fault diagnosis module, relation map and exception stack labeled data for being established according to machine learning are diagnosed to be event Hinder reason；

7. the system fault diagnosis intelligence operational system according to claim 6 based on machine learning, which is characterized in that The data collection module includes:

Unit is marked, for obtaining the abnormal index data in the achievement data, index is carried out to the abnormal index data Unusual fluctuations mark and Indexes Abnormality cause of fluctuation mark；And/or

The failure problems data checked out are labeled.

8. the system fault diagnosis intelligence operational system according to claim 6 or 7 based on machine learning, feature exist In the computation analysis module includes:

Rule analysis unit, after grabbing the abnormal index data in current criteria data according to time window, triggering event Barrier diagnosis and alarm；

Algorithm analytical unit analyzes the current criteria data for calculating using the data model, and it is strong to obtain system operation Health situation, and according to the abnormal index data-triggered fault diagnosis grabbed and alarm.

9. the system fault diagnosis intelligence operational system according to claim 6 or 7 based on machine learning, feature exist In the fault diagnosis module includes:

Preliminary self-test unit, relation map and exception stack labeled data for being established according to machine learning, carries out respectively Self-test obtains self-detection result；

Application review unit, for calculating failure all originals that may be present using the decision data that previously investigation problem obtained Cause, and checked accordingly, obtain inspection result；

10. the system fault diagnosis intelligence operational system according to claim 9 based on machine learning, feature exist In the fault diagnosis module further include:

Artificial mark unit, if then manpower intervention is handled, and to described different for that cannot automatically analyze out the failure cause Normal achievement data saves after being labeled into annotation repository.