CN109800127A - A kind of system fault diagnosis intelligence O&M method and system based on machine learning - Google Patents

A kind of system fault diagnosis intelligence O&M method and system based on machine learning Download PDF

Info

Publication number
CN109800127A
CN109800127A CN201910010700.1A CN201910010700A CN109800127A CN 109800127 A CN109800127 A CN 109800127A CN 201910010700 A CN201910010700 A CN 201910010700A CN 109800127 A CN109800127 A CN 109800127A
Authority
CN
China
Prior art keywords
data
fault diagnosis
machine learning
labeled
abnormal index
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910010700.1A
Other languages
Chinese (zh)
Inventor
曾德强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongan Information Technology Service Co Ltd
Original Assignee
Zhongan Information Technology Service Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongan Information Technology Service Co Ltd filed Critical Zhongan Information Technology Service Co Ltd
Priority to CN201910010700.1A priority Critical patent/CN109800127A/en
Publication of CN109800127A publication Critical patent/CN109800127A/en
Pending legal-status Critical Current

Links

Abstract

The system fault diagnosis intelligence O&M method and system based on machine learning that the invention discloses a kind of, method includes: the achievement data and labeled data of acquisition system;The data model of different usage scenarios is respectively trained according to achievement data and labeled data;According to collected current criteria data and data model, analysis system operation health status is calculated and to the abnormal index data-triggered fault diagnosis grabbed and alarm;The relation map and exception stack labeled data established according to machine learning, are diagnosed to be failure cause.The present invention is by the way that machine learning model to be applied in automatic operation and maintenance system, such as monitoring, fault diagnosis, each O&M link of O&M decision, it can quickly find failure and troubleshooting Producing reason, O&M decision component is provided simultaneously, self-regeneration movement is completed according to each side's diagnostic result, accomplishes really unattended O&M.

Description

A kind of system fault diagnosis intelligence O&M method and system based on machine learning
Technical field
The present invention relates to intelligent O&M technical field, in particular to a kind of system fault diagnosis intelligence based on machine learning O&M method and system can be changed.
Background technique
With the development that internet is swift and violent, product scale and number of servers exponentially grade increase, and number of servers is from morning Several of phase are to hundred grades, thousand grades, ten thousand number of stages.The also tool from the upgrading of the artificial O&M of early stage till now of operation maintenance personnel, Semi-automatic O&M.With business, the rapid growth of number of servers, technical staff faces this huge challenge, mainly have with Under several aspects:
1, monitor control index is more and more, finds out O&M from magnanimity achievement data using traditional O&M mode and needs to pay close attention to Index, operation maintenance personnel needs to spend longer time;
2, extensive alarm influences whether the decision judgement of technical staff, cannot timely respond to the failure generated;
3, tool disperses, and not only increases learning cost and possesses cost, and mutually indepedent between each system, and data are total Enjoy difficulty;
4, the investigation process experience of same problem cannot pass on, and technical staff ceaselessly does the duplication of labour
Therefore, there is an urgent need for proposing a kind of new intelligent O&M method, to overcome the problems, such as said one or multiple.
Summary of the invention
In order to solve problems in the prior art, the embodiment of the invention provides a kind of system failures based on machine learning to examine Disconnected intelligence O&M method and system cannot quickly find failure and the original that troubleshooting generates to overcome in the prior art Cause cannot be automatically performed the problems such as self-regeneration.
In order to solve the above technical problems, the technical solution adopted by the present invention is that:
On the one hand, a kind of system fault diagnosis intelligence O&M method based on machine learning, the method packet are provided Include following steps:
S1: the achievement data and labeled data of system are obtained;
S2: the data model of different usage scenarios is respectively trained according to the achievement data and labeled data;.
S3: it according to collected current criteria data and the data model, calculates analysis system and runs health status And to the abnormal index data-triggered fault diagnosis grabbed and alarm;
S4: the relation map and exception stack labeled data established according to machine learning are diagnosed to be failure cause.
Further, the acquisition labeled data includes at least:
The abnormal index data in the achievement data are obtained, Indexes Abnormality fluctuation mark is carried out to the abnormal index data Note and Indexes Abnormality cause of fluctuation mark;And/or
The exception stack information for obtaining abnormal index data, marks the keyword of the exception stack;And/or
The failure problems data checked out are labeled.
Further, the step S3 is specifically included:
After grabbing the abnormal index data in current criteria data according to time window, fault diagnosis and alarm are triggered; And/or
It is calculated using the data model and analyzes the current criteria data, acquisition system operation health status, and according to The abnormal index data-triggered fault diagnosis grabbed and alarm.
Further, the step S4 is specifically included:
The relation map and exception stack labeled data established according to machine learning, carry out self-test respectively, obtain self-test As a result;And/or
Failure all reasons that may be present are calculated using the decision data that previous investigation problem obtains, and are carried out corresponding It checks, obtains inspection result;
Failure cause is analyzed according to the self-detection result and the inspection result.
Further, the step S4 further include:
If the failure cause cannot be automatically analyzed out, then manpower intervention handle, and to the abnormal index data into It saves after rower note into annotation repository.
On the other hand, a kind of system fault diagnosis intelligence operational system based on machine learning, the system are provided Include:
Data collection module, for obtaining the achievement data and labeled data of system;
Model training module, for the number of different usage scenarios to be respectively trained according to the achievement data and labeled data According to model;
Computation analysis module, for calculating analysis system according to collected current criteria data and the data model System runs health status and to the abnormal index data-triggered fault diagnosis grabbed and alarm;
Fault diagnosis module, relation map and exception stack labeled data for being established according to machine learning, diagnosis Be out of order reason;
Alarm module, for issuing corresponding alarm according to the abnormal index data.
Further, the data collection module includes:
Unit is marked, for obtaining the abnormal index data in the achievement data, the abnormal index data are carried out Indexes Abnormality fluctuation mark and Indexes Abnormality cause of fluctuation mark;And/or
The exception stack information for obtaining abnormal index data, marks the keyword of the exception stack;And/or
The failure problems data checked out are labeled.
Further, the computation analysis module includes:
Rule analysis unit, after grabbing the abnormal index data in current criteria data according to time window, touching Send out fault diagnosis and alarm;
Algorithm analytical unit analyzes the current criteria data for calculating using the data model, obtains system fortune Row health status, and according to the abnormal index data-triggered fault diagnosis grabbed and alarm.
Further, the fault diagnosis module includes:
Preliminary self-test unit, relation map and exception stack labeled data for being established according to machine learning, respectively Self-test is carried out, self-detection result is obtained;
Application review unit, it is that may be present all for calculating failure using the decision data that previously investigation problem obtained Reason, and checked accordingly, obtain inspection result;
Accident analysis unit, for analyzing failure cause according to the self-detection result and the inspection result.
Further, the fault diagnosis module further include:
Artificial mark unit, if then manpower intervention is handled, and to institute for that cannot automatically analyze out the failure cause It states and saves after abnormal index data are labeled into annotation repository.
Technical solution provided in an embodiment of the present invention has the benefit that
1, the system fault diagnosis intelligence O&M method and device provided by the invention based on machine learning, by by machine Device learning model is applied in automatic operation and maintenance system, such as monitoring, fault diagnosis, each O&M link of O&M decision, can be fast Speed discovery failure and troubleshooting Producing reason, while O&M decision component being provided, it is completed certainly according to each side's diagnostic result My repair action accomplishes really unattended O&M;
2, the system fault diagnosis intelligence O&M method and device provided by the invention based on machine learning, passes through utilization Machine learning algorithm is integrated to by each dimension data, establishes corresponding data model, solves the monitoring of single rule, no Can linkage judgement identification, the fluctuation of load is irregular, leads to that error rate is high, misrepresents deliberately, leaks there are more using threshold values is excessively inflexible The problems such as report;
3, the system fault diagnosis intelligence O&M method and device provided by the invention based on machine learning, according to machine Learn the application established, business, the useful exception information of server triadic relation's map rapidly extracting, and with identifying according to labeled data Trouble cause, automatic trigger tool are repaired.
Detailed description of the invention
To describe the technical solutions in the embodiments of the present invention more clearly, make required in being described below to embodiment Attached drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for For those of ordinary skill in the art, without creative efforts, it can also be obtained according to these attached drawings other Attached drawing.
Fig. 1 is the system fault diagnosis intelligence O&M method shown according to an exemplary embodiment based on machine learning Flow chart;
Fig. 2 is the system fault diagnosis intelligence operational system shown according to an exemplary embodiment based on machine learning Structural schematic diagram.
Specific embodiment
To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with attached in the embodiment of the present invention Figure, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is only this Invention a part of the embodiment, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art exist Every other embodiment obtained under the premise of creative work is not made, shall fall within the protection scope of the present invention.
System fault diagnosis intelligence O&M method provided in an embodiment of the present invention based on machine learning, this method utilize Machine learning algorithm integrates each dimension data, it is established that healthy application model solves single dimension rule The problems such as monitoring, cannot link judgement identification, and the fluctuation of load is irregular and application threshold values is excessively inflexible causes to monitor error rate Height, there are more situations such as misrepresenting deliberately, fail to report.Its fault diagnosis module helps technology people to be automatically performed abnormal application detection, and By applying operating index, operational indicator, relationship server map, abnormal bursting point is quickly positioned, key index is extracted, and With notifying decision system after judging failure cause according to the exception stack data being collected into, in the case of solving extensive alarm, sea The extraction problem of log is measured, failure cause positioning time is shortened.Simultaneously over time, failure mark database is increasingly It is perfect, it progressivelyes reach and does not need manpower intervention malfunction elimination, operational system can be according to fault diagnosis reason, and self-fulfillment reparation is dynamic Make.
Fig. 1 is the system fault diagnosis intelligence O&M method shown according to an exemplary embodiment based on machine learning Flow chart, shown referring to Fig.1, this method comprises the following steps:
S1: the achievement data and labeled data of system are obtained.
Specifically, achievement data mainly includes operational indicator, system index, using operating index three categories data, these Data reflection is the case where actual production is run.Using time series window by all kinds of achievement data statistic of classifications, it is converted to Kpi closes building property index, then shifts the result data onto model training module, and model training module is built using achievement data cluster Mould analyzes online index etc. to provide supervisor engine in real time.
Labeled data refers to reception mark service data, starts the cleaning processing to data, provides AI data modeling system pair Data modeling.At least divide three class models: 1, Indexes Abnormality volatility model;2, Indexes Abnormality cause of fluctuation model;3, malfunction elimination Decision model.
In addition to above two data, in the embodiment of the present invention, the data for needing to acquire further include basic platform data.Tool Body, the resource information data of basic resource management system is extracted, it is general to establish relational graph between resource entity, be supplied to therefore Hinder diagnostic module to use.Furthermore it is possible to deposit relationship between basic data entity using Neo4j chart database.It needs exist for It is bright, basic resource management system management integration Servers-all resource information, application message, business information.The service is used In daily operation management.On the one hand basic platform data provide foundation in labeled data, on the other hand mention in fault diagnosis For reference.
In addition, what needs to be explained here is that, in embodiments of the present invention, the sampling instrument of different achievement datas is also different Sample.For example, log class, based on Filebeat, system index class, which is collected, to be used based on open-falcon, and operational indicator passes through prison All kinds of technological means such as mysql data binlog are listened to realize.
As a kind of preferably embodiment, in the embodiment of the present invention, obtains labeled data and includes at least:
The abnormal index data in the achievement data are obtained, Indexes Abnormality fluctuation mark is carried out to the abnormal index data Note and Indexes Abnormality cause of fluctuation mark;And/or
The exception stack information for obtaining abnormal index data, marks the keyword of the exception stack;And/or
The failure problems data checked out are labeled.
Specifically, labeled data is at least divided into following three classes:
Indexes Abnormality fluctuation mark, flows back into monitor control index prediction model for such labeled data, can be used for quickly finding Abnormal index.
Indexes Abnormality cause of fluctuation mark, the reason of typically resulting in Indexes Abnormality fluctuation have it is very much, can substantially be divided into Under several classes: 1, network layer reason;2, system resource occupies (including: disk, cpu, io, memory) reason;3, using Exception Log;4, service traffics fluctuate;5, network attack etc..According to the Indexes Abnormality cause of fluctuation after mark, index wave can establish Dynamic causality classification library, then according to index cause of fluctuation class library, we can quickly determine malfunction elimination direction.
It lifts for a java example application.Java is using the TimeoutException that dishes out, the Indexes Abnormality fluctuation of mark Reason may are as follows: 1, configuration it is not obstructed to causing to access;2, network cause.Carry out two Indexes Abnormality cause of fluctuation marks Afterwards, network detection script and configuration check script can be quickly triggered, network is checked.
It is marked using exception stack keyword, many times can directly teach that failure cause using exception stack, The problem of application can be reflected well using exception stack, and the crucial key for marking out exception stack can be helped quickly really Recognize failure cause.Us can be helped quickly to extract useful log content that is, marking out the keyword of exception stack, certainly Plan next step fault diagnosis inspection movement.
S2: the data model of different usage scenarios is respectively trained according to the achievement data and labeled data.
Specifically, model training service is basic engine with machine learning engine (sparkML), all kinds of supervised are provided Habit, semi-supervised learning etc..After receiving all kinds of achievement datas or labeled data, the number of different usage scenarios is established respectively According to model.Data model includes but is not limited to drag: index prediction model (such as index class data model, Indexes Abnormality wave Movable model, Indexes Abnormality fluctuation disaggregated model etc.), fault detection process library and solution bank, application, machine, business relations map Deng.Wherein, index prediction model is diagnosed for monitoring and early warning, fault detection process library and solution bank for consequent malfunction, and should Fault detection process library and solution bank have relied on what the labeled data being collected into was established.
What needs to be explained here is that before above-mentioned data are for carrying out machine learning training acquisition data model, it is also necessary to will Data carry out vectorization processing, i.e., text data are converted to vector data.
S3: it according to collected current criteria data and the data model, calculates analysis system and runs health status And to the abnormal index data-triggered fault diagnosis grabbed and alarm.
As a kind of preferably embodiment, in the embodiment of the present invention, step S3 is specifically included:
After grabbing the abnormal index data in current criteria data according to time window, fault diagnosis and alarm are triggered; And/or
It is calculated using the data model and analyzes the current criteria data, acquisition system operation health status, and according to The abnormal index data-triggered fault diagnosis grabbed and alarm.
Specifically, in the embodiment of the present invention, for calculating the analysis engine of analysis by regulation engine and algorithm engine two It is grouped as, wherein regulation engine executes two movements after mainly grabbing every abnormal index according to time window, i.e. triggering is accused Alert and fault diagnosis.The data model calculating analysis that algorithm engine mainly utilizes history achievement data to establish obtains current in real time Achievement data to obtain system operation health status, and accuse to the abnormal index data grabbed and holds up and trigger event Barrier diagnosis.Here algorithm engine mainly uses the related algorithms such as prediction model related algorithm (prophet) and random forest.
S4: the relation map and exception stack labeled data established according to machine learning are diagnosed to be failure cause.
As a kind of preferably embodiment, in the embodiment of the present invention, step S4 is specifically included:
The relation map and exception stack labeled data established according to machine learning, carry out self-test respectively, obtain self-test As a result;And/or
Failure all reasons that may be present are calculated using the decision data that previous investigation problem obtains, and are carried out corresponding It checks, obtains inspection result;
Failure cause is analyzed according to the self-detection result and the inspection result.
Specifically, fault diagnosis module core is made of inference machine, analysis engine event is received, according to according to engineering The relation map (such as business, application, machine relationship system map) established and exception stack labeled data are practised, is first carried out respectively certainly Inspection obtains self-detection result.Wherein, self-test content includes anomalous content, system resource utilization situation, traffic fluctuations situation etc..
Simultaneously using technical staff previously checked problem acquisition decision data calculate failure there may be the reason of, go forward side by side Row next step inspection movement, including rely on application review, the inspection of service impact face etc..As an example it is assumed that certain applies the exception of A Quantity is uprushed, and with finding failure possible cause according to history Exception keyword models library, identifies it is network layer, application The problem of layer or server resource, the direction of decision trouble shooting.It is assumed that the problem of being network layer, need to touch at this time The network communication inspection for sending out basis, is collected simultaneously Web communication layer log, checks the indices of specific network layer, obtains and checks As a result.
As a kind of preferably embodiment, in the embodiment of the present invention, step S4 further include:
If the failure cause cannot be automatically analyzed out, then manpower intervention handle, and to the abnormal index data into It saves after rower note into annotation repository.
Specifically, will use the mark database of corresponding problem when checking indices, in this mark database When powerful enough, the reason that is out of order can be automatically analyzed, when knowledge base is improved not enough, then cannot be automatically analyzed Be out of order reason, at this time, it may be necessary to which people's intervention is handled, corresponding mark number is saved to after being labeled to abnormal index data According in library, further to improve mark database.For example, will use network layer problem when checking network layer indices Mark database can automatically analyze the reason that is out of order if this mark database is powerful enough, if this mark database It is perfect not enough, then it needs manpower intervention to handle, and manually marked to network layer abnormal index data, is then saved Into network layer problem mark database, network layer problem mark database is supplemented.
As a kind of preferably embodiment, in the embodiment of the present invention, the method also includes:
S5: it according to the failure cause, determines recovery scenario and triggers fault restoration.
Specifically, the failure cause that decision-making module is diagnosed according to diagnostic module, determines recovery scenario, and trigger phase The failover operation answered.
Fig. 2 is the system fault diagnosis intelligence operational system shown according to an exemplary embodiment based on machine learning Structural schematic diagram, referring to shown in Fig. 2, which is included at least:
Data collection module, for obtaining the achievement data and labeled data of system.
Specifically, in embodiments of the present invention, data collection module includes multiple metadata acquisition tools.For example, for adopting Collect log class data Filebeat, for acquisition system index class data open-falcon and operational indicator data are then It is realized by monitoring all kinds of technological means such as mysql data binlog.
Model training module, for the number of different usage scenarios to be respectively trained according to the achievement data and labeled data According to model.
Specifically, model training module includes the components such as algorithms library, data modeling visualization tool, data modeling engine. For different achievement data and labeled data, it is respectively trained by supervised study, semi-supervised learning and different uses field The data model of scape.
Computation analysis module, for calculating analysis system according to collected current criteria data and the data model System runs health status and to the abnormal index data-triggered fault diagnosis grabbed and alarm.
Specifically, the analysis engine of computation analysis module is made of regulation engine and algorithm engine two parts, wherein algorithm Engine mainly uses the related algorithms such as prediction model related algorithm (prophet) and random forest.
Fault diagnosis module, relation map and exception stack labeled data for being established according to machine learning, diagnosis Be out of order reason.
Specifically, fault diagnosis module core is made of inference machine, analysis engine event is received, is obtained in conjunction with previous Exception stack labeled data, Indexes Abnormality cause of fluctuation disaggregated model, business, application, machine relationship system map etc., be diagnosed to be Failure cause.
Alarm module, for issuing corresponding alarm according to the abnormal index data.
Further, data collection module includes:
Unit is marked, for obtaining the abnormal index data in achievement data, Indexes Abnormality is carried out to abnormal index data Fluctuation mark and Indexes Abnormality cause of fluctuation mark;And/or
The exception stack information for obtaining abnormal index data, marks the keyword of exception stack;And/or
The failure problems data checked out are labeled.
Further, computation analysis module includes:
Rule analysis unit, after grabbing the abnormal index data in current criteria data according to time window, touching Send out fault diagnosis and alarm;
Algorithm analytical unit analyzes the current criteria data for calculating using data model, and it is strong to obtain system operation Health situation, and according to the abnormal index data-triggered fault diagnosis grabbed and alarm.
Further, fault diagnosis module includes:
Preliminary self-test unit, relation map and exception stack labeled data for being established according to machine learning, respectively Self-test is carried out, self-detection result is obtained;
Application review unit, it is that may be present all for calculating failure using the decision data that previously investigation problem obtained Reason, and checked accordingly, obtain inspection result;
Accident analysis unit, for analyzing failure cause according to self-detection result and inspection result.
Further, fault diagnosis module further include:
Artificial mark unit, if then manpower intervention is handled for that cannot automatically analyze the reason that is out of order, and to referring to extremely Mark data save after being labeled into annotation repository.
As a kind of preferably embodiment, in the embodiment of the present invention, the system also includes:
Decision-making module, for determining recovery scenario and triggering fault restoration according to failure cause.
Operation and maintenance tools manage platform, for carrying out corresponding fault restoration according to recovery scenario.Wherein, the operation and maintenance tools pipe Platform includes O&M script management tool, using deployment tool, development process management tool, configuration management tool etc..
In conclusion technical solution provided in an embodiment of the present invention has the benefit that
1, the system fault diagnosis intelligence O&M method and device provided by the invention based on machine learning, by by machine Device learning model is applied in automatic operation and maintenance system, such as monitoring, fault diagnosis, each O&M link of O&M decision, can be fast Speed discovery failure and troubleshooting Producing reason, while O&M decision component being provided, it is completed certainly according to each side's diagnostic result My repair action accomplishes really unattended O&M;
2, the system fault diagnosis intelligence O&M method and device provided by the invention based on machine learning, passes through utilization Machine learning algorithm is integrated to by each dimension data, establishes corresponding data model, solves the monitoring of single rule, no Can linkage judgement identification, the fluctuation of load is irregular, leads to that error rate is high, misrepresents deliberately, leaks there are more using threshold values is excessively inflexible The problems such as report;
3, the system fault diagnosis intelligence O&M method and device provided by the invention based on machine learning, according to machine Learn the application established, business, the useful exception information of server triadic relation's map rapidly extracting, and with identifying according to labeled data Trouble cause, automatic trigger tool are repaired.
It should be understood that the system fault diagnosis intelligence operational system provided by the above embodiment based on machine learning It, only the example of the division of the above functional modules, can be in practical application when triggering system fault diagnosis business Above-mentioned function distribution is completed by different functional modules as needed, i.e., the internal structure of system is divided into different function Energy module, to complete all or part of the functions described above.In addition, provided by the above embodiment be based on machine learning System fault diagnosis intelligence operational system belongs to the system fault diagnosis intelligence O&M embodiment of the method based on machine learning Same design, specific implementation process are detailed in embodiment of the method, and which is not described herein again.
Those of ordinary skill in the art will appreciate that realizing that all or part of the steps of above-described embodiment can pass through hardware It completes, relevant hardware can also be instructed to complete by program, the program can store in a kind of computer-readable In storage medium, storage medium mentioned above can be read-only memory, disk or CD etc..
The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all in spirit of the invention and Within principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.

Claims (10)

1. a kind of system fault diagnosis intelligence O&M method based on machine learning, which is characterized in that the method includes such as Lower step:
S1: the achievement data and labeled data of system are obtained;
S2: the data model of different usage scenarios is respectively trained according to the achievement data and labeled data;.
S3: according to collected current criteria data and the data model, calculate analysis system operation health status and To the abnormal index data-triggered fault diagnosis grabbed and alarm;
S4: the relation map and exception stack labeled data established according to machine learning are diagnosed to be failure cause.
2. the system fault diagnosis intelligence O&M method according to claim 1 based on machine learning, which is characterized in that The acquisition labeled data includes at least:
Obtain the abnormal index data in the achievement data, to the abnormal index data carry out Indexes Abnormality fluctuation mark with And Indexes Abnormality cause of fluctuation mark;And/or
The exception stack information for obtaining abnormal index data, marks the keyword of the exception stack;And/or
The failure problems data checked out are labeled.
3. the system fault diagnosis intelligence O&M method according to claim 1 or 2 based on machine learning, feature exist In the step S3 is specifically included:
After grabbing the abnormal index data in current criteria data according to time window, fault diagnosis and alarm are triggered;With/ Or,
It is calculated using the data model and analyzes the current criteria data, obtained system and run health status, and according to crawl The abnormal index data-triggered fault diagnosis arrived and alarm.
4. the system fault diagnosis intelligence O&M method according to claim 1 or 2 based on machine learning, feature exist In the step S4 is specifically included:
The relation map and exception stack labeled data established according to machine learning, carry out self-test respectively, obtain self-detection result; And/or
Failure all reasons that may be present are calculated using the decision data that previous investigation problem obtains, and are examined accordingly It looks into, obtains inspection result;
Failure cause is analyzed according to the self-detection result and the inspection result.
5. the system fault diagnosis intelligence O&M method according to claim 4 based on machine learning, which is characterized in that The step S4 further include:
If the failure cause cannot be automatically analyzed out, then manpower intervention is handled, and is marked to the abnormal index data It saves after note into annotation repository.
6. a kind of system fault diagnosis intelligence operational system based on machine learning, which is characterized in that the system comprises:
Data collection module, for obtaining the achievement data and labeled data of system;
Model training module, for the data mould of different usage scenarios to be respectively trained according to the achievement data and labeled data Type;
Computation analysis module, for calculating analysis system fortune according to collected current criteria data and the data model Row health status and to the abnormal index data-triggered fault diagnosis grabbed and alarm;
Fault diagnosis module, relation map and exception stack labeled data for being established according to machine learning are diagnosed to be event Hinder reason;
Alarm module, for issuing corresponding alarm according to the abnormal index data.
7. the system fault diagnosis intelligence operational system according to claim 6 based on machine learning, which is characterized in that The data collection module includes:
Unit is marked, for obtaining the abnormal index data in the achievement data, index is carried out to the abnormal index data Unusual fluctuations mark and Indexes Abnormality cause of fluctuation mark;And/or
The exception stack information for obtaining abnormal index data, marks the keyword of the exception stack;And/or
The failure problems data checked out are labeled.
8. the system fault diagnosis intelligence operational system according to claim 6 or 7 based on machine learning, feature exist In the computation analysis module includes:
Rule analysis unit, after grabbing the abnormal index data in current criteria data according to time window, triggering event Barrier diagnosis and alarm;
Algorithm analytical unit analyzes the current criteria data for calculating using the data model, and it is strong to obtain system operation Health situation, and according to the abnormal index data-triggered fault diagnosis grabbed and alarm.
9. the system fault diagnosis intelligence operational system according to claim 6 or 7 based on machine learning, feature exist In the fault diagnosis module includes:
Preliminary self-test unit, relation map and exception stack labeled data for being established according to machine learning, carries out respectively Self-test obtains self-detection result;
Application review unit, for calculating failure all originals that may be present using the decision data that previously investigation problem obtained Cause, and checked accordingly, obtain inspection result;
Accident analysis unit, for analyzing failure cause according to the self-detection result and the inspection result.
10. the system fault diagnosis intelligence operational system according to claim 9 based on machine learning, feature exist In the fault diagnosis module further include:
Artificial mark unit, if then manpower intervention is handled, and to described different for that cannot automatically analyze out the failure cause Normal achievement data saves after being labeled into annotation repository.
CN201910010700.1A 2019-01-03 2019-01-03 A kind of system fault diagnosis intelligence O&M method and system based on machine learning Pending CN109800127A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910010700.1A CN109800127A (en) 2019-01-03 2019-01-03 A kind of system fault diagnosis intelligence O&M method and system based on machine learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910010700.1A CN109800127A (en) 2019-01-03 2019-01-03 A kind of system fault diagnosis intelligence O&M method and system based on machine learning

Publications (1)

Publication Number Publication Date
CN109800127A true CN109800127A (en) 2019-05-24

Family

ID=66558466

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910010700.1A Pending CN109800127A (en) 2019-01-03 2019-01-03 A kind of system fault diagnosis intelligence O&M method and system based on machine learning

Country Status (1)

Country Link
CN (1) CN109800127A (en)

Cited By (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110390027A (en) * 2019-06-13 2019-10-29 全球能源互联网研究院有限公司 A kind of information system fault model construction method and system based on chart database
CN110428127A (en) * 2019-06-19 2019-11-08 深圳壹账通智能科技有限公司 Automated analysis method, user equipment, storage medium and device
CN110504031A (en) * 2019-08-28 2019-11-26 首都医科大学 Cloud for Health behavior Intervention manages database building method and system
CN110816589A (en) * 2019-10-31 2020-02-21 北京英诺威尔科技股份有限公司 CTCS3 fault diagnosis method based on machine learning
CN110891283A (en) * 2019-11-22 2020-03-17 超讯通信股份有限公司 Small base station monitoring device and method based on edge calculation model
CN111176872A (en) * 2019-12-12 2020-05-19 北京邮电大学 Monitoring data processing method, system, device and storage medium for IT operation and maintenance
CN111209131A (en) * 2019-12-30 2020-05-29 航天信息股份有限公司广州航天软件分公司 Method and system for determining fault of heterogeneous system based on machine learning
CN111538643A (en) * 2020-07-07 2020-08-14 宝信软件(成都)有限公司 Alarm information filtering method and system for monitoring system
CN111737033A (en) * 2020-05-26 2020-10-02 复旦大学 Micro-service fault positioning method based on runtime map analysis
CN111858231A (en) * 2020-05-11 2020-10-30 北京必示科技有限公司 Single index abnormality detection method based on operation and maintenance monitoring
CN111985561A (en) * 2020-08-19 2020-11-24 安徽蓝杰鑫信息科技有限公司 Fault diagnosis method and system for intelligent electric meter and electronic device
CN111985558A (en) * 2020-08-19 2020-11-24 安徽蓝杰鑫信息科技有限公司 Electric energy meter abnormity diagnosis method and system
CN111988167A (en) * 2020-07-21 2020-11-24 合肥爱和力人工智能技术服务有限责任公司 Fault analysis method and equipment based on industrial internet mechanism model
CN112152830A (en) * 2019-06-28 2020-12-29 中国电力科学研究院有限公司 Intelligent fault root cause analysis method and system
CN112363896A (en) * 2020-09-02 2021-02-12 大连大学 Log anomaly detection system
CN112598291A (en) * 2020-12-25 2021-04-02 中国农业银行股份有限公司 Prophet-based operation and maintenance intelligent scheduling method and device
CN112711508A (en) * 2020-12-21 2021-04-27 航天信息股份有限公司 Intelligent operation and maintenance service system facing large-scale client system
CN112801316A (en) * 2021-01-28 2021-05-14 中国人寿保险股份有限公司上海数据中心 Fault positioning method, system equipment and storage medium based on multi-index data
CN112860472A (en) * 2021-02-05 2021-05-28 建信金融科技有限责任公司 System fault position determining method and device, electronic equipment and storage medium
CN113033839A (en) * 2021-03-17 2021-06-25 山东通维信息工程有限公司 ITSS-based highway electromechanical intelligent operation and maintenance improvement method
CN113037365A (en) * 2021-03-02 2021-06-25 烽火通信科技股份有限公司 Method and device for identifying life cycle operation and maintenance state of optical channel
CN113110389A (en) * 2021-04-21 2021-07-13 东方电气自动控制工程有限公司 Fault recording data processing method based on intelligent power plant monitoring system
JP2021170347A (en) * 2019-06-20 2021-10-28 株式会社Gsユアサ Maintenance support method and computer program
WO2021232567A1 (en) * 2020-05-20 2021-11-25 江苏南工科技集团有限公司 Ai technology-based smart operation and maintenance knowledge analysis method
CN113765723A (en) * 2021-09-23 2021-12-07 深圳市天威网络工程有限公司 Health diagnosis method and system based on Cable Modem terminal equipment
CN115096627A (en) * 2022-06-16 2022-09-23 中南大学 Method and system for fault diagnosis and operation and maintenance in manufacturing process of hydraulic forming intelligent equipment
CN116047913A (en) * 2023-02-15 2023-05-02 南京为先科技有限责任公司 Control system and method for neutralization vacuum stripping dioxane removal process
CN116701652A (en) * 2023-06-13 2023-09-05 上海沄熹科技有限公司 Machine learning-based database intelligent operation and maintenance system and method
US11949076B2 (en) 2019-06-20 2024-04-02 Gs Yuasa International Ltd. Maintenance support method, maintenance support system, maintenance support device, and computer program

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107179503A (en) * 2017-04-21 2017-09-19 美林数据技术股份有限公司 The method of Wind turbines intelligent fault diagnosis early warning based on random forest
CN107222339A (en) * 2017-05-27 2017-09-29 全球能源互联网研究院 The failure analysis methods and device of communicating for power information system based on chart database
CN107608862A (en) * 2017-10-13 2018-01-19 众安信息技术服务有限公司 Monitoring alarm method, monitoring alarm device and computer-readable recording medium
CN107644256A (en) * 2017-09-14 2018-01-30 郑州云海信息技术有限公司 A kind of method that diagnosis rule storehouse is formed based on machine learning mode
CN108446200A (en) * 2018-02-07 2018-08-24 福建星瑞格软件有限公司 Server intelligence O&M method based on big data machine learning and computer equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107179503A (en) * 2017-04-21 2017-09-19 美林数据技术股份有限公司 The method of Wind turbines intelligent fault diagnosis early warning based on random forest
CN107222339A (en) * 2017-05-27 2017-09-29 全球能源互联网研究院 The failure analysis methods and device of communicating for power information system based on chart database
CN107644256A (en) * 2017-09-14 2018-01-30 郑州云海信息技术有限公司 A kind of method that diagnosis rule storehouse is formed based on machine learning mode
CN107608862A (en) * 2017-10-13 2018-01-19 众安信息技术服务有限公司 Monitoring alarm method, monitoring alarm device and computer-readable recording medium
CN108446200A (en) * 2018-02-07 2018-08-24 福建星瑞格软件有限公司 Server intelligence O&M method based on big data machine learning and computer equipment

Cited By (42)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110390027A (en) * 2019-06-13 2019-10-29 全球能源互联网研究院有限公司 A kind of information system fault model construction method and system based on chart database
CN110428127A (en) * 2019-06-19 2019-11-08 深圳壹账通智能科技有限公司 Automated analysis method, user equipment, storage medium and device
WO2020253135A1 (en) * 2019-06-19 2020-12-24 深圳壹账通智能科技有限公司 Automated analysis method and device, user equipment, and storage medium
CN110428127B (en) * 2019-06-19 2022-04-15 深圳壹账通智能科技有限公司 Automatic analysis method, user equipment, storage medium and device
JP7115597B2 (en) 2019-06-20 2022-08-09 株式会社Gsユアサ Maintenance support method and computer program
JP2021170347A (en) * 2019-06-20 2021-10-28 株式会社Gsユアサ Maintenance support method and computer program
US11949076B2 (en) 2019-06-20 2024-04-02 Gs Yuasa International Ltd. Maintenance support method, maintenance support system, maintenance support device, and computer program
CN112152830A (en) * 2019-06-28 2020-12-29 中国电力科学研究院有限公司 Intelligent fault root cause analysis method and system
CN112152830B (en) * 2019-06-28 2023-08-04 中国电力科学研究院有限公司 Intelligent fault root cause analysis method and system
CN110504031B (en) * 2019-08-28 2022-02-11 首都医科大学 Cloud management database establishment method and system for health behavior intervention
CN110504031A (en) * 2019-08-28 2019-11-26 首都医科大学 Cloud for Health behavior Intervention manages database building method and system
CN110816589A (en) * 2019-10-31 2020-02-21 北京英诺威尔科技股份有限公司 CTCS3 fault diagnosis method based on machine learning
CN110891283A (en) * 2019-11-22 2020-03-17 超讯通信股份有限公司 Small base station monitoring device and method based on edge calculation model
CN111176872A (en) * 2019-12-12 2020-05-19 北京邮电大学 Monitoring data processing method, system, device and storage medium for IT operation and maintenance
CN111176872B (en) * 2019-12-12 2021-05-07 北京邮电大学 Monitoring data processing method, system, device and storage medium for IT operation and maintenance
CN111209131A (en) * 2019-12-30 2020-05-29 航天信息股份有限公司广州航天软件分公司 Method and system for determining fault of heterogeneous system based on machine learning
CN111858231A (en) * 2020-05-11 2020-10-30 北京必示科技有限公司 Single index abnormality detection method based on operation and maintenance monitoring
WO2021232567A1 (en) * 2020-05-20 2021-11-25 江苏南工科技集团有限公司 Ai technology-based smart operation and maintenance knowledge analysis method
CN111737033B (en) * 2020-05-26 2024-03-08 复旦大学 Microservice fault positioning method based on runtime pattern analysis
CN111737033A (en) * 2020-05-26 2020-10-02 复旦大学 Micro-service fault positioning method based on runtime map analysis
CN111538643A (en) * 2020-07-07 2020-08-14 宝信软件(成都)有限公司 Alarm information filtering method and system for monitoring system
CN111538643B (en) * 2020-07-07 2020-10-16 宝信软件(成都)有限公司 Alarm information filtering method and system for monitoring system
CN111988167A (en) * 2020-07-21 2020-11-24 合肥爱和力人工智能技术服务有限责任公司 Fault analysis method and equipment based on industrial internet mechanism model
CN111985561A (en) * 2020-08-19 2020-11-24 安徽蓝杰鑫信息科技有限公司 Fault diagnosis method and system for intelligent electric meter and electronic device
CN111985561B (en) * 2020-08-19 2023-02-21 安徽蓝杰鑫信息科技有限公司 Fault diagnosis method and system for intelligent electric meter and electronic device
CN111985558A (en) * 2020-08-19 2020-11-24 安徽蓝杰鑫信息科技有限公司 Electric energy meter abnormity diagnosis method and system
CN112363896A (en) * 2020-09-02 2021-02-12 大连大学 Log anomaly detection system
CN112363896B (en) * 2020-09-02 2023-12-05 大连大学 Log abnormality detection system
CN112711508A (en) * 2020-12-21 2021-04-27 航天信息股份有限公司 Intelligent operation and maintenance service system facing large-scale client system
CN112598291A (en) * 2020-12-25 2021-04-02 中国农业银行股份有限公司 Prophet-based operation and maintenance intelligent scheduling method and device
CN112598291B (en) * 2020-12-25 2023-10-13 中国农业银行股份有限公司 Prophet-based operation and maintenance intelligent scheduling method and device
CN112801316A (en) * 2021-01-28 2021-05-14 中国人寿保险股份有限公司上海数据中心 Fault positioning method, system equipment and storage medium based on multi-index data
CN112860472A (en) * 2021-02-05 2021-05-28 建信金融科技有限责任公司 System fault position determining method and device, electronic equipment and storage medium
CN113037365A (en) * 2021-03-02 2021-06-25 烽火通信科技股份有限公司 Method and device for identifying life cycle operation and maintenance state of optical channel
CN113033839A (en) * 2021-03-17 2021-06-25 山东通维信息工程有限公司 ITSS-based highway electromechanical intelligent operation and maintenance improvement method
CN113110389A (en) * 2021-04-21 2021-07-13 东方电气自动控制工程有限公司 Fault recording data processing method based on intelligent power plant monitoring system
CN113765723A (en) * 2021-09-23 2021-12-07 深圳市天威网络工程有限公司 Health diagnosis method and system based on Cable Modem terminal equipment
CN115096627B (en) * 2022-06-16 2023-04-07 中南大学 Method and system for fault diagnosis and operation and maintenance in manufacturing process of hydraulic forming intelligent equipment
CN115096627A (en) * 2022-06-16 2022-09-23 中南大学 Method and system for fault diagnosis and operation and maintenance in manufacturing process of hydraulic forming intelligent equipment
CN116047913A (en) * 2023-02-15 2023-05-02 南京为先科技有限责任公司 Control system and method for neutralization vacuum stripping dioxane removal process
CN116047913B (en) * 2023-02-15 2023-10-03 南京为先科技有限责任公司 Control system and method for neutralization vacuum stripping dioxane removal process
CN116701652A (en) * 2023-06-13 2023-09-05 上海沄熹科技有限公司 Machine learning-based database intelligent operation and maintenance system and method

Similar Documents

Publication Publication Date Title
CN109800127A (en) A kind of system fault diagnosis intelligence O&M method and system based on machine learning
CN110717665B (en) System and method for fault identification and trend analysis based on scheduling control system
CN101989087B (en) On-line real-time failure monitoring and diagnosing system device for industrial processing of residual oil
CN111209131A (en) Method and system for determining fault of heterogeneous system based on machine learning
CN110766277B (en) Health assessment and diagnosis system and mobile terminal for nuclear industry field
KR20180108446A (en) System and method for management of ict infra
CN111162949A (en) Interface monitoring method based on Java byte code embedding technology
CN112817280A (en) Implementation method for intelligent monitoring alarm system of thermal power plant
CN112990656B (en) Health evaluation system and health evaluation method for IT equipment monitoring data
CN114185760A (en) System risk assessment method and device and charging equipment operation and maintenance detection method
CN103049365B (en) Information and application resource running state monitoring and evaluation method
CN113962299A (en) Intelligent operation monitoring and fault diagnosis general model for nuclear power equipment
CN103676836A (en) Online safe operation guiding method
CN115809183A (en) Method for discovering and disposing information-creating terminal fault based on knowledge graph
CN112346393A (en) Intelligent operation and maintenance based data full link abnormity monitoring and processing method and system
CN113395182B (en) Intelligent network equipment management system and method with fault prediction
CN102929241B (en) Safe operation guide system of purified terephthalic acid device and application of safe operation guide system
CN111306051B (en) Probe type state monitoring and early warning method, device and system for oil transfer pump unit
CN117333038A (en) Economic trend analysis system based on big data
CN112803587A (en) Intelligent inspection method for state of automatic equipment based on diagnosis decision library
CN115438093A (en) Power communication equipment fault judgment method and detection system
CN112615812A (en) Information network unified vulnerability multi-dimensional security information collection, analysis and management system
Wang et al. LSTM-based alarm prediction in the mobile communication network
CN113065001A (en) Fault loss stopping method and device
CN113656323A (en) Method for automatically testing, positioning and repairing fault and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190524