CN109800127A - A kind of system fault diagnosis intelligence O&M method and system based on machine learning - Google Patents
A kind of system fault diagnosis intelligence O&M method and system based on machine learning Download PDFInfo
- Publication number
- CN109800127A CN109800127A CN201910010700.1A CN201910010700A CN109800127A CN 109800127 A CN109800127 A CN 109800127A CN 201910010700 A CN201910010700 A CN 201910010700A CN 109800127 A CN109800127 A CN 109800127A
- Authority
- CN
- China
- Prior art keywords
- data
- fault diagnosis
- machine learning
- labeled
- abnormal index
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Abstract
The system fault diagnosis intelligence O&M method and system based on machine learning that the invention discloses a kind of, method includes: the achievement data and labeled data of acquisition system;The data model of different usage scenarios is respectively trained according to achievement data and labeled data;According to collected current criteria data and data model, analysis system operation health status is calculated and to the abnormal index data-triggered fault diagnosis grabbed and alarm;The relation map and exception stack labeled data established according to machine learning, are diagnosed to be failure cause.The present invention is by the way that machine learning model to be applied in automatic operation and maintenance system, such as monitoring, fault diagnosis, each O&M link of O&M decision, it can quickly find failure and troubleshooting Producing reason, O&M decision component is provided simultaneously, self-regeneration movement is completed according to each side's diagnostic result, accomplishes really unattended O&M.
Description
Technical field
The present invention relates to intelligent O&M technical field, in particular to a kind of system fault diagnosis intelligence based on machine learning
O&M method and system can be changed.
Background technique
With the development that internet is swift and violent, product scale and number of servers exponentially grade increase, and number of servers is from morning
Several of phase are to hundred grades, thousand grades, ten thousand number of stages.The also tool from the upgrading of the artificial O&M of early stage till now of operation maintenance personnel,
Semi-automatic O&M.With business, the rapid growth of number of servers, technical staff faces this huge challenge, mainly have with
Under several aspects:
1, monitor control index is more and more, finds out O&M from magnanimity achievement data using traditional O&M mode and needs to pay close attention to
Index, operation maintenance personnel needs to spend longer time;
2, extensive alarm influences whether the decision judgement of technical staff, cannot timely respond to the failure generated;
3, tool disperses, and not only increases learning cost and possesses cost, and mutually indepedent between each system, and data are total
Enjoy difficulty;
4, the investigation process experience of same problem cannot pass on, and technical staff ceaselessly does the duplication of labour
Therefore, there is an urgent need for proposing a kind of new intelligent O&M method, to overcome the problems, such as said one or multiple.
Summary of the invention
In order to solve problems in the prior art, the embodiment of the invention provides a kind of system failures based on machine learning to examine
Disconnected intelligence O&M method and system cannot quickly find failure and the original that troubleshooting generates to overcome in the prior art
Cause cannot be automatically performed the problems such as self-regeneration.
In order to solve the above technical problems, the technical solution adopted by the present invention is that:
On the one hand, a kind of system fault diagnosis intelligence O&M method based on machine learning, the method packet are provided
Include following steps:
S1: the achievement data and labeled data of system are obtained;
S2: the data model of different usage scenarios is respectively trained according to the achievement data and labeled data;.
S3: it according to collected current criteria data and the data model, calculates analysis system and runs health status
And to the abnormal index data-triggered fault diagnosis grabbed and alarm;
S4: the relation map and exception stack labeled data established according to machine learning are diagnosed to be failure cause.
Further, the acquisition labeled data includes at least:
The abnormal index data in the achievement data are obtained, Indexes Abnormality fluctuation mark is carried out to the abnormal index data
Note and Indexes Abnormality cause of fluctuation mark;And/or
The exception stack information for obtaining abnormal index data, marks the keyword of the exception stack;And/or
The failure problems data checked out are labeled.
Further, the step S3 is specifically included:
After grabbing the abnormal index data in current criteria data according to time window, fault diagnosis and alarm are triggered;
And/or
It is calculated using the data model and analyzes the current criteria data, acquisition system operation health status, and according to
The abnormal index data-triggered fault diagnosis grabbed and alarm.
Further, the step S4 is specifically included:
The relation map and exception stack labeled data established according to machine learning, carry out self-test respectively, obtain self-test
As a result;And/or
Failure all reasons that may be present are calculated using the decision data that previous investigation problem obtains, and are carried out corresponding
It checks, obtains inspection result;
Failure cause is analyzed according to the self-detection result and the inspection result.
Further, the step S4 further include:
If the failure cause cannot be automatically analyzed out, then manpower intervention handle, and to the abnormal index data into
It saves after rower note into annotation repository.
On the other hand, a kind of system fault diagnosis intelligence operational system based on machine learning, the system are provided
Include:
Data collection module, for obtaining the achievement data and labeled data of system;
Model training module, for the number of different usage scenarios to be respectively trained according to the achievement data and labeled data
According to model;
Computation analysis module, for calculating analysis system according to collected current criteria data and the data model
System runs health status and to the abnormal index data-triggered fault diagnosis grabbed and alarm;
Fault diagnosis module, relation map and exception stack labeled data for being established according to machine learning, diagnosis
Be out of order reason;
Alarm module, for issuing corresponding alarm according to the abnormal index data.
Further, the data collection module includes:
Unit is marked, for obtaining the abnormal index data in the achievement data, the abnormal index data are carried out
Indexes Abnormality fluctuation mark and Indexes Abnormality cause of fluctuation mark;And/or
The exception stack information for obtaining abnormal index data, marks the keyword of the exception stack;And/or
The failure problems data checked out are labeled.
Further, the computation analysis module includes:
Rule analysis unit, after grabbing the abnormal index data in current criteria data according to time window, touching
Send out fault diagnosis and alarm;
Algorithm analytical unit analyzes the current criteria data for calculating using the data model, obtains system fortune
Row health status, and according to the abnormal index data-triggered fault diagnosis grabbed and alarm.
Further, the fault diagnosis module includes:
Preliminary self-test unit, relation map and exception stack labeled data for being established according to machine learning, respectively
Self-test is carried out, self-detection result is obtained;
Application review unit, it is that may be present all for calculating failure using the decision data that previously investigation problem obtained
Reason, and checked accordingly, obtain inspection result;
Accident analysis unit, for analyzing failure cause according to the self-detection result and the inspection result.
Further, the fault diagnosis module further include:
Artificial mark unit, if then manpower intervention is handled, and to institute for that cannot automatically analyze out the failure cause
It states and saves after abnormal index data are labeled into annotation repository.
Technical solution provided in an embodiment of the present invention has the benefit that
1, the system fault diagnosis intelligence O&M method and device provided by the invention based on machine learning, by by machine
Device learning model is applied in automatic operation and maintenance system, such as monitoring, fault diagnosis, each O&M link of O&M decision, can be fast
Speed discovery failure and troubleshooting Producing reason, while O&M decision component being provided, it is completed certainly according to each side's diagnostic result
My repair action accomplishes really unattended O&M;
2, the system fault diagnosis intelligence O&M method and device provided by the invention based on machine learning, passes through utilization
Machine learning algorithm is integrated to by each dimension data, establishes corresponding data model, solves the monitoring of single rule, no
Can linkage judgement identification, the fluctuation of load is irregular, leads to that error rate is high, misrepresents deliberately, leaks there are more using threshold values is excessively inflexible
The problems such as report;
3, the system fault diagnosis intelligence O&M method and device provided by the invention based on machine learning, according to machine
Learn the application established, business, the useful exception information of server triadic relation's map rapidly extracting, and with identifying according to labeled data
Trouble cause, automatic trigger tool are repaired.
Detailed description of the invention
To describe the technical solutions in the embodiments of the present invention more clearly, make required in being described below to embodiment
Attached drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for
For those of ordinary skill in the art, without creative efforts, it can also be obtained according to these attached drawings other
Attached drawing.
Fig. 1 is the system fault diagnosis intelligence O&M method shown according to an exemplary embodiment based on machine learning
Flow chart;
Fig. 2 is the system fault diagnosis intelligence operational system shown according to an exemplary embodiment based on machine learning
Structural schematic diagram.
Specific embodiment
To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with attached in the embodiment of the present invention
Figure, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is only this
Invention a part of the embodiment, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art exist
Every other embodiment obtained under the premise of creative work is not made, shall fall within the protection scope of the present invention.
System fault diagnosis intelligence O&M method provided in an embodiment of the present invention based on machine learning, this method utilize
Machine learning algorithm integrates each dimension data, it is established that healthy application model solves single dimension rule
The problems such as monitoring, cannot link judgement identification, and the fluctuation of load is irregular and application threshold values is excessively inflexible causes to monitor error rate
Height, there are more situations such as misrepresenting deliberately, fail to report.Its fault diagnosis module helps technology people to be automatically performed abnormal application detection, and
By applying operating index, operational indicator, relationship server map, abnormal bursting point is quickly positioned, key index is extracted, and
With notifying decision system after judging failure cause according to the exception stack data being collected into, in the case of solving extensive alarm, sea
The extraction problem of log is measured, failure cause positioning time is shortened.Simultaneously over time, failure mark database is increasingly
It is perfect, it progressivelyes reach and does not need manpower intervention malfunction elimination, operational system can be according to fault diagnosis reason, and self-fulfillment reparation is dynamic
Make.
Fig. 1 is the system fault diagnosis intelligence O&M method shown according to an exemplary embodiment based on machine learning
Flow chart, shown referring to Fig.1, this method comprises the following steps:
S1: the achievement data and labeled data of system are obtained.
Specifically, achievement data mainly includes operational indicator, system index, using operating index three categories data, these
Data reflection is the case where actual production is run.Using time series window by all kinds of achievement data statistic of classifications, it is converted to
Kpi closes building property index, then shifts the result data onto model training module, and model training module is built using achievement data cluster
Mould analyzes online index etc. to provide supervisor engine in real time.
Labeled data refers to reception mark service data, starts the cleaning processing to data, provides AI data modeling system pair
Data modeling.At least divide three class models: 1, Indexes Abnormality volatility model;2, Indexes Abnormality cause of fluctuation model;3, malfunction elimination
Decision model.
In addition to above two data, in the embodiment of the present invention, the data for needing to acquire further include basic platform data.Tool
Body, the resource information data of basic resource management system is extracted, it is general to establish relational graph between resource entity, be supplied to therefore
Hinder diagnostic module to use.Furthermore it is possible to deposit relationship between basic data entity using Neo4j chart database.It needs exist for
It is bright, basic resource management system management integration Servers-all resource information, application message, business information.The service is used
In daily operation management.On the one hand basic platform data provide foundation in labeled data, on the other hand mention in fault diagnosis
For reference.
In addition, what needs to be explained here is that, in embodiments of the present invention, the sampling instrument of different achievement datas is also different
Sample.For example, log class, based on Filebeat, system index class, which is collected, to be used based on open-falcon, and operational indicator passes through prison
All kinds of technological means such as mysql data binlog are listened to realize.
As a kind of preferably embodiment, in the embodiment of the present invention, obtains labeled data and includes at least:
The abnormal index data in the achievement data are obtained, Indexes Abnormality fluctuation mark is carried out to the abnormal index data
Note and Indexes Abnormality cause of fluctuation mark;And/or
The exception stack information for obtaining abnormal index data, marks the keyword of the exception stack;And/or
The failure problems data checked out are labeled.
Specifically, labeled data is at least divided into following three classes:
Indexes Abnormality fluctuation mark, flows back into monitor control index prediction model for such labeled data, can be used for quickly finding
Abnormal index.
Indexes Abnormality cause of fluctuation mark, the reason of typically resulting in Indexes Abnormality fluctuation have it is very much, can substantially be divided into
Under several classes: 1, network layer reason;2, system resource occupies (including: disk, cpu, io, memory) reason;3, using Exception
Log;4, service traffics fluctuate;5, network attack etc..According to the Indexes Abnormality cause of fluctuation after mark, index wave can establish
Dynamic causality classification library, then according to index cause of fluctuation class library, we can quickly determine malfunction elimination direction.
It lifts for a java example application.Java is using the TimeoutException that dishes out, the Indexes Abnormality fluctuation of mark
Reason may are as follows: 1, configuration it is not obstructed to causing to access;2, network cause.Carry out two Indexes Abnormality cause of fluctuation marks
Afterwards, network detection script and configuration check script can be quickly triggered, network is checked.
It is marked using exception stack keyword, many times can directly teach that failure cause using exception stack,
The problem of application can be reflected well using exception stack, and the crucial key for marking out exception stack can be helped quickly really
Recognize failure cause.Us can be helped quickly to extract useful log content that is, marking out the keyword of exception stack, certainly
Plan next step fault diagnosis inspection movement.
S2: the data model of different usage scenarios is respectively trained according to the achievement data and labeled data.
Specifically, model training service is basic engine with machine learning engine (sparkML), all kinds of supervised are provided
Habit, semi-supervised learning etc..After receiving all kinds of achievement datas or labeled data, the number of different usage scenarios is established respectively
According to model.Data model includes but is not limited to drag: index prediction model (such as index class data model, Indexes Abnormality wave
Movable model, Indexes Abnormality fluctuation disaggregated model etc.), fault detection process library and solution bank, application, machine, business relations map
Deng.Wherein, index prediction model is diagnosed for monitoring and early warning, fault detection process library and solution bank for consequent malfunction, and should
Fault detection process library and solution bank have relied on what the labeled data being collected into was established.
What needs to be explained here is that before above-mentioned data are for carrying out machine learning training acquisition data model, it is also necessary to will
Data carry out vectorization processing, i.e., text data are converted to vector data.
S3: it according to collected current criteria data and the data model, calculates analysis system and runs health status
And to the abnormal index data-triggered fault diagnosis grabbed and alarm.
As a kind of preferably embodiment, in the embodiment of the present invention, step S3 is specifically included:
After grabbing the abnormal index data in current criteria data according to time window, fault diagnosis and alarm are triggered;
And/or
It is calculated using the data model and analyzes the current criteria data, acquisition system operation health status, and according to
The abnormal index data-triggered fault diagnosis grabbed and alarm.
Specifically, in the embodiment of the present invention, for calculating the analysis engine of analysis by regulation engine and algorithm engine two
It is grouped as, wherein regulation engine executes two movements after mainly grabbing every abnormal index according to time window, i.e. triggering is accused
Alert and fault diagnosis.The data model calculating analysis that algorithm engine mainly utilizes history achievement data to establish obtains current in real time
Achievement data to obtain system operation health status, and accuse to the abnormal index data grabbed and holds up and trigger event
Barrier diagnosis.Here algorithm engine mainly uses the related algorithms such as prediction model related algorithm (prophet) and random forest.
S4: the relation map and exception stack labeled data established according to machine learning are diagnosed to be failure cause.
As a kind of preferably embodiment, in the embodiment of the present invention, step S4 is specifically included:
The relation map and exception stack labeled data established according to machine learning, carry out self-test respectively, obtain self-test
As a result;And/or
Failure all reasons that may be present are calculated using the decision data that previous investigation problem obtains, and are carried out corresponding
It checks, obtains inspection result;
Failure cause is analyzed according to the self-detection result and the inspection result.
Specifically, fault diagnosis module core is made of inference machine, analysis engine event is received, according to according to engineering
The relation map (such as business, application, machine relationship system map) established and exception stack labeled data are practised, is first carried out respectively certainly
Inspection obtains self-detection result.Wherein, self-test content includes anomalous content, system resource utilization situation, traffic fluctuations situation etc..
Simultaneously using technical staff previously checked problem acquisition decision data calculate failure there may be the reason of, go forward side by side
Row next step inspection movement, including rely on application review, the inspection of service impact face etc..As an example it is assumed that certain applies the exception of A
Quantity is uprushed, and with finding failure possible cause according to history Exception keyword models library, identifies it is network layer, application
The problem of layer or server resource, the direction of decision trouble shooting.It is assumed that the problem of being network layer, need to touch at this time
The network communication inspection for sending out basis, is collected simultaneously Web communication layer log, checks the indices of specific network layer, obtains and checks
As a result.
As a kind of preferably embodiment, in the embodiment of the present invention, step S4 further include:
If the failure cause cannot be automatically analyzed out, then manpower intervention handle, and to the abnormal index data into
It saves after rower note into annotation repository.
Specifically, will use the mark database of corresponding problem when checking indices, in this mark database
When powerful enough, the reason that is out of order can be automatically analyzed, when knowledge base is improved not enough, then cannot be automatically analyzed
Be out of order reason, at this time, it may be necessary to which people's intervention is handled, corresponding mark number is saved to after being labeled to abnormal index data
According in library, further to improve mark database.For example, will use network layer problem when checking network layer indices
Mark database can automatically analyze the reason that is out of order if this mark database is powerful enough, if this mark database
It is perfect not enough, then it needs manpower intervention to handle, and manually marked to network layer abnormal index data, is then saved
Into network layer problem mark database, network layer problem mark database is supplemented.
As a kind of preferably embodiment, in the embodiment of the present invention, the method also includes:
S5: it according to the failure cause, determines recovery scenario and triggers fault restoration.
Specifically, the failure cause that decision-making module is diagnosed according to diagnostic module, determines recovery scenario, and trigger phase
The failover operation answered.
Fig. 2 is the system fault diagnosis intelligence operational system shown according to an exemplary embodiment based on machine learning
Structural schematic diagram, referring to shown in Fig. 2, which is included at least:
Data collection module, for obtaining the achievement data and labeled data of system.
Specifically, in embodiments of the present invention, data collection module includes multiple metadata acquisition tools.For example, for adopting
Collect log class data Filebeat, for acquisition system index class data open-falcon and operational indicator data are then
It is realized by monitoring all kinds of technological means such as mysql data binlog.
Model training module, for the number of different usage scenarios to be respectively trained according to the achievement data and labeled data
According to model.
Specifically, model training module includes the components such as algorithms library, data modeling visualization tool, data modeling engine.
For different achievement data and labeled data, it is respectively trained by supervised study, semi-supervised learning and different uses field
The data model of scape.
Computation analysis module, for calculating analysis system according to collected current criteria data and the data model
System runs health status and to the abnormal index data-triggered fault diagnosis grabbed and alarm.
Specifically, the analysis engine of computation analysis module is made of regulation engine and algorithm engine two parts, wherein algorithm
Engine mainly uses the related algorithms such as prediction model related algorithm (prophet) and random forest.
Fault diagnosis module, relation map and exception stack labeled data for being established according to machine learning, diagnosis
Be out of order reason.
Specifically, fault diagnosis module core is made of inference machine, analysis engine event is received, is obtained in conjunction with previous
Exception stack labeled data, Indexes Abnormality cause of fluctuation disaggregated model, business, application, machine relationship system map etc., be diagnosed to be
Failure cause.
Alarm module, for issuing corresponding alarm according to the abnormal index data.
Further, data collection module includes:
Unit is marked, for obtaining the abnormal index data in achievement data, Indexes Abnormality is carried out to abnormal index data
Fluctuation mark and Indexes Abnormality cause of fluctuation mark;And/or
The exception stack information for obtaining abnormal index data, marks the keyword of exception stack;And/or
The failure problems data checked out are labeled.
Further, computation analysis module includes:
Rule analysis unit, after grabbing the abnormal index data in current criteria data according to time window, touching
Send out fault diagnosis and alarm;
Algorithm analytical unit analyzes the current criteria data for calculating using data model, and it is strong to obtain system operation
Health situation, and according to the abnormal index data-triggered fault diagnosis grabbed and alarm.
Further, fault diagnosis module includes:
Preliminary self-test unit, relation map and exception stack labeled data for being established according to machine learning, respectively
Self-test is carried out, self-detection result is obtained;
Application review unit, it is that may be present all for calculating failure using the decision data that previously investigation problem obtained
Reason, and checked accordingly, obtain inspection result;
Accident analysis unit, for analyzing failure cause according to self-detection result and inspection result.
Further, fault diagnosis module further include:
Artificial mark unit, if then manpower intervention is handled for that cannot automatically analyze the reason that is out of order, and to referring to extremely
Mark data save after being labeled into annotation repository.
As a kind of preferably embodiment, in the embodiment of the present invention, the system also includes:
Decision-making module, for determining recovery scenario and triggering fault restoration according to failure cause.
Operation and maintenance tools manage platform, for carrying out corresponding fault restoration according to recovery scenario.Wherein, the operation and maintenance tools pipe
Platform includes O&M script management tool, using deployment tool, development process management tool, configuration management tool etc..
In conclusion technical solution provided in an embodiment of the present invention has the benefit that
1, the system fault diagnosis intelligence O&M method and device provided by the invention based on machine learning, by by machine
Device learning model is applied in automatic operation and maintenance system, such as monitoring, fault diagnosis, each O&M link of O&M decision, can be fast
Speed discovery failure and troubleshooting Producing reason, while O&M decision component being provided, it is completed certainly according to each side's diagnostic result
My repair action accomplishes really unattended O&M;
2, the system fault diagnosis intelligence O&M method and device provided by the invention based on machine learning, passes through utilization
Machine learning algorithm is integrated to by each dimension data, establishes corresponding data model, solves the monitoring of single rule, no
Can linkage judgement identification, the fluctuation of load is irregular, leads to that error rate is high, misrepresents deliberately, leaks there are more using threshold values is excessively inflexible
The problems such as report;
3, the system fault diagnosis intelligence O&M method and device provided by the invention based on machine learning, according to machine
Learn the application established, business, the useful exception information of server triadic relation's map rapidly extracting, and with identifying according to labeled data
Trouble cause, automatic trigger tool are repaired.
It should be understood that the system fault diagnosis intelligence operational system provided by the above embodiment based on machine learning
It, only the example of the division of the above functional modules, can be in practical application when triggering system fault diagnosis business
Above-mentioned function distribution is completed by different functional modules as needed, i.e., the internal structure of system is divided into different function
Energy module, to complete all or part of the functions described above.In addition, provided by the above embodiment be based on machine learning
System fault diagnosis intelligence operational system belongs to the system fault diagnosis intelligence O&M embodiment of the method based on machine learning
Same design, specific implementation process are detailed in embodiment of the method, and which is not described herein again.
Those of ordinary skill in the art will appreciate that realizing that all or part of the steps of above-described embodiment can pass through hardware
It completes, relevant hardware can also be instructed to complete by program, the program can store in a kind of computer-readable
In storage medium, storage medium mentioned above can be read-only memory, disk or CD etc..
The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all in spirit of the invention and
Within principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.
Claims (10)
1. a kind of system fault diagnosis intelligence O&M method based on machine learning, which is characterized in that the method includes such as
Lower step:
S1: the achievement data and labeled data of system are obtained;
S2: the data model of different usage scenarios is respectively trained according to the achievement data and labeled data;.
S3: according to collected current criteria data and the data model, calculate analysis system operation health status and
To the abnormal index data-triggered fault diagnosis grabbed and alarm;
S4: the relation map and exception stack labeled data established according to machine learning are diagnosed to be failure cause.
2. the system fault diagnosis intelligence O&M method according to claim 1 based on machine learning, which is characterized in that
The acquisition labeled data includes at least:
Obtain the abnormal index data in the achievement data, to the abnormal index data carry out Indexes Abnormality fluctuation mark with
And Indexes Abnormality cause of fluctuation mark;And/or
The exception stack information for obtaining abnormal index data, marks the keyword of the exception stack;And/or
The failure problems data checked out are labeled.
3. the system fault diagnosis intelligence O&M method according to claim 1 or 2 based on machine learning, feature exist
In the step S3 is specifically included:
After grabbing the abnormal index data in current criteria data according to time window, fault diagnosis and alarm are triggered;With/
Or,
It is calculated using the data model and analyzes the current criteria data, obtained system and run health status, and according to crawl
The abnormal index data-triggered fault diagnosis arrived and alarm.
4. the system fault diagnosis intelligence O&M method according to claim 1 or 2 based on machine learning, feature exist
In the step S4 is specifically included:
The relation map and exception stack labeled data established according to machine learning, carry out self-test respectively, obtain self-detection result;
And/or
Failure all reasons that may be present are calculated using the decision data that previous investigation problem obtains, and are examined accordingly
It looks into, obtains inspection result;
Failure cause is analyzed according to the self-detection result and the inspection result.
5. the system fault diagnosis intelligence O&M method according to claim 4 based on machine learning, which is characterized in that
The step S4 further include:
If the failure cause cannot be automatically analyzed out, then manpower intervention is handled, and is marked to the abnormal index data
It saves after note into annotation repository.
6. a kind of system fault diagnosis intelligence operational system based on machine learning, which is characterized in that the system comprises:
Data collection module, for obtaining the achievement data and labeled data of system;
Model training module, for the data mould of different usage scenarios to be respectively trained according to the achievement data and labeled data
Type;
Computation analysis module, for calculating analysis system fortune according to collected current criteria data and the data model
Row health status and to the abnormal index data-triggered fault diagnosis grabbed and alarm;
Fault diagnosis module, relation map and exception stack labeled data for being established according to machine learning are diagnosed to be event
Hinder reason;
Alarm module, for issuing corresponding alarm according to the abnormal index data.
7. the system fault diagnosis intelligence operational system according to claim 6 based on machine learning, which is characterized in that
The data collection module includes:
Unit is marked, for obtaining the abnormal index data in the achievement data, index is carried out to the abnormal index data
Unusual fluctuations mark and Indexes Abnormality cause of fluctuation mark;And/or
The exception stack information for obtaining abnormal index data, marks the keyword of the exception stack;And/or
The failure problems data checked out are labeled.
8. the system fault diagnosis intelligence operational system according to claim 6 or 7 based on machine learning, feature exist
In the computation analysis module includes:
Rule analysis unit, after grabbing the abnormal index data in current criteria data according to time window, triggering event
Barrier diagnosis and alarm;
Algorithm analytical unit analyzes the current criteria data for calculating using the data model, and it is strong to obtain system operation
Health situation, and according to the abnormal index data-triggered fault diagnosis grabbed and alarm.
9. the system fault diagnosis intelligence operational system according to claim 6 or 7 based on machine learning, feature exist
In the fault diagnosis module includes:
Preliminary self-test unit, relation map and exception stack labeled data for being established according to machine learning, carries out respectively
Self-test obtains self-detection result;
Application review unit, for calculating failure all originals that may be present using the decision data that previously investigation problem obtained
Cause, and checked accordingly, obtain inspection result;
Accident analysis unit, for analyzing failure cause according to the self-detection result and the inspection result.
10. the system fault diagnosis intelligence operational system according to claim 9 based on machine learning, feature exist
In the fault diagnosis module further include:
Artificial mark unit, if then manpower intervention is handled, and to described different for that cannot automatically analyze out the failure cause
Normal achievement data saves after being labeled into annotation repository.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910010700.1A CN109800127A (en) | 2019-01-03 | 2019-01-03 | A kind of system fault diagnosis intelligence O&M method and system based on machine learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910010700.1A CN109800127A (en) | 2019-01-03 | 2019-01-03 | A kind of system fault diagnosis intelligence O&M method and system based on machine learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109800127A true CN109800127A (en) | 2019-05-24 |
Family
ID=66558466
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910010700.1A Pending CN109800127A (en) | 2019-01-03 | 2019-01-03 | A kind of system fault diagnosis intelligence O&M method and system based on machine learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109800127A (en) |
Cited By (29)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110390027A (en) * | 2019-06-13 | 2019-10-29 | 全球能源互联网研究院有限公司 | A kind of information system fault model construction method and system based on chart database |
CN110428127A (en) * | 2019-06-19 | 2019-11-08 | 深圳壹账通智能科技有限公司 | Automated analysis method, user equipment, storage medium and device |
CN110504031A (en) * | 2019-08-28 | 2019-11-26 | 首都医科大学 | Cloud for Health behavior Intervention manages database building method and system |
CN110816589A (en) * | 2019-10-31 | 2020-02-21 | 北京英诺威尔科技股份有限公司 | CTCS3 fault diagnosis method based on machine learning |
CN110891283A (en) * | 2019-11-22 | 2020-03-17 | 超讯通信股份有限公司 | Small base station monitoring device and method based on edge calculation model |
CN111176872A (en) * | 2019-12-12 | 2020-05-19 | 北京邮电大学 | Monitoring data processing method, system, device and storage medium for IT operation and maintenance |
CN111209131A (en) * | 2019-12-30 | 2020-05-29 | 航天信息股份有限公司广州航天软件分公司 | Method and system for determining fault of heterogeneous system based on machine learning |
CN111538643A (en) * | 2020-07-07 | 2020-08-14 | 宝信软件(成都)有限公司 | Alarm information filtering method and system for monitoring system |
CN111737033A (en) * | 2020-05-26 | 2020-10-02 | 复旦大学 | Micro-service fault positioning method based on runtime map analysis |
CN111858231A (en) * | 2020-05-11 | 2020-10-30 | 北京必示科技有限公司 | Single index abnormality detection method based on operation and maintenance monitoring |
CN111985561A (en) * | 2020-08-19 | 2020-11-24 | 安徽蓝杰鑫信息科技有限公司 | Fault diagnosis method and system for intelligent electric meter and electronic device |
CN111985558A (en) * | 2020-08-19 | 2020-11-24 | 安徽蓝杰鑫信息科技有限公司 | Electric energy meter abnormity diagnosis method and system |
CN111988167A (en) * | 2020-07-21 | 2020-11-24 | 合肥爱和力人工智能技术服务有限责任公司 | Fault analysis method and equipment based on industrial internet mechanism model |
CN112152830A (en) * | 2019-06-28 | 2020-12-29 | 中国电力科学研究院有限公司 | Intelligent fault root cause analysis method and system |
CN112363896A (en) * | 2020-09-02 | 2021-02-12 | 大连大学 | Log anomaly detection system |
CN112598291A (en) * | 2020-12-25 | 2021-04-02 | 中国农业银行股份有限公司 | Prophet-based operation and maintenance intelligent scheduling method and device |
CN112711508A (en) * | 2020-12-21 | 2021-04-27 | 航天信息股份有限公司 | Intelligent operation and maintenance service system facing large-scale client system |
CN112801316A (en) * | 2021-01-28 | 2021-05-14 | 中国人寿保险股份有限公司上海数据中心 | Fault positioning method, system equipment and storage medium based on multi-index data |
CN112860472A (en) * | 2021-02-05 | 2021-05-28 | 建信金融科技有限责任公司 | System fault position determining method and device, electronic equipment and storage medium |
CN113033839A (en) * | 2021-03-17 | 2021-06-25 | 山东通维信息工程有限公司 | ITSS-based highway electromechanical intelligent operation and maintenance improvement method |
CN113037365A (en) * | 2021-03-02 | 2021-06-25 | 烽火通信科技股份有限公司 | Method and device for identifying life cycle operation and maintenance state of optical channel |
CN113110389A (en) * | 2021-04-21 | 2021-07-13 | 东方电气自动控制工程有限公司 | Fault recording data processing method based on intelligent power plant monitoring system |
JP2021170347A (en) * | 2019-06-20 | 2021-10-28 | 株式会社Gsユアサ | Maintenance support method and computer program |
WO2021232567A1 (en) * | 2020-05-20 | 2021-11-25 | 江苏南工科技集团有限公司 | Ai technology-based smart operation and maintenance knowledge analysis method |
CN113765723A (en) * | 2021-09-23 | 2021-12-07 | 深圳市天威网络工程有限公司 | Health diagnosis method and system based on Cable Modem terminal equipment |
CN115096627A (en) * | 2022-06-16 | 2022-09-23 | 中南大学 | Method and system for fault diagnosis and operation and maintenance in manufacturing process of hydraulic forming intelligent equipment |
CN116047913A (en) * | 2023-02-15 | 2023-05-02 | 南京为先科技有限责任公司 | Control system and method for neutralization vacuum stripping dioxane removal process |
CN116701652A (en) * | 2023-06-13 | 2023-09-05 | 上海沄熹科技有限公司 | Machine learning-based database intelligent operation and maintenance system and method |
US11949076B2 (en) | 2019-06-20 | 2024-04-02 | Gs Yuasa International Ltd. | Maintenance support method, maintenance support system, maintenance support device, and computer program |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107179503A (en) * | 2017-04-21 | 2017-09-19 | 美林数据技术股份有限公司 | The method of Wind turbines intelligent fault diagnosis early warning based on random forest |
CN107222339A (en) * | 2017-05-27 | 2017-09-29 | 全球能源互联网研究院 | The failure analysis methods and device of communicating for power information system based on chart database |
CN107608862A (en) * | 2017-10-13 | 2018-01-19 | 众安信息技术服务有限公司 | Monitoring alarm method, monitoring alarm device and computer-readable recording medium |
CN107644256A (en) * | 2017-09-14 | 2018-01-30 | 郑州云海信息技术有限公司 | A kind of method that diagnosis rule storehouse is formed based on machine learning mode |
CN108446200A (en) * | 2018-02-07 | 2018-08-24 | 福建星瑞格软件有限公司 | Server intelligence O&M method based on big data machine learning and computer equipment |
-
2019
- 2019-01-03 CN CN201910010700.1A patent/CN109800127A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107179503A (en) * | 2017-04-21 | 2017-09-19 | 美林数据技术股份有限公司 | The method of Wind turbines intelligent fault diagnosis early warning based on random forest |
CN107222339A (en) * | 2017-05-27 | 2017-09-29 | 全球能源互联网研究院 | The failure analysis methods and device of communicating for power information system based on chart database |
CN107644256A (en) * | 2017-09-14 | 2018-01-30 | 郑州云海信息技术有限公司 | A kind of method that diagnosis rule storehouse is formed based on machine learning mode |
CN107608862A (en) * | 2017-10-13 | 2018-01-19 | 众安信息技术服务有限公司 | Monitoring alarm method, monitoring alarm device and computer-readable recording medium |
CN108446200A (en) * | 2018-02-07 | 2018-08-24 | 福建星瑞格软件有限公司 | Server intelligence O&M method based on big data machine learning and computer equipment |
Cited By (42)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110390027A (en) * | 2019-06-13 | 2019-10-29 | 全球能源互联网研究院有限公司 | A kind of information system fault model construction method and system based on chart database |
CN110428127A (en) * | 2019-06-19 | 2019-11-08 | 深圳壹账通智能科技有限公司 | Automated analysis method, user equipment, storage medium and device |
WO2020253135A1 (en) * | 2019-06-19 | 2020-12-24 | 深圳壹账通智能科技有限公司 | Automated analysis method and device, user equipment, and storage medium |
CN110428127B (en) * | 2019-06-19 | 2022-04-15 | 深圳壹账通智能科技有限公司 | Automatic analysis method, user equipment, storage medium and device |
JP7115597B2 (en) | 2019-06-20 | 2022-08-09 | 株式会社Gsユアサ | Maintenance support method and computer program |
JP2021170347A (en) * | 2019-06-20 | 2021-10-28 | 株式会社Gsユアサ | Maintenance support method and computer program |
US11949076B2 (en) | 2019-06-20 | 2024-04-02 | Gs Yuasa International Ltd. | Maintenance support method, maintenance support system, maintenance support device, and computer program |
CN112152830A (en) * | 2019-06-28 | 2020-12-29 | 中国电力科学研究院有限公司 | Intelligent fault root cause analysis method and system |
CN112152830B (en) * | 2019-06-28 | 2023-08-04 | 中国电力科学研究院有限公司 | Intelligent fault root cause analysis method and system |
CN110504031B (en) * | 2019-08-28 | 2022-02-11 | 首都医科大学 | Cloud management database establishment method and system for health behavior intervention |
CN110504031A (en) * | 2019-08-28 | 2019-11-26 | 首都医科大学 | Cloud for Health behavior Intervention manages database building method and system |
CN110816589A (en) * | 2019-10-31 | 2020-02-21 | 北京英诺威尔科技股份有限公司 | CTCS3 fault diagnosis method based on machine learning |
CN110891283A (en) * | 2019-11-22 | 2020-03-17 | 超讯通信股份有限公司 | Small base station monitoring device and method based on edge calculation model |
CN111176872A (en) * | 2019-12-12 | 2020-05-19 | 北京邮电大学 | Monitoring data processing method, system, device and storage medium for IT operation and maintenance |
CN111176872B (en) * | 2019-12-12 | 2021-05-07 | 北京邮电大学 | Monitoring data processing method, system, device and storage medium for IT operation and maintenance |
CN111209131A (en) * | 2019-12-30 | 2020-05-29 | 航天信息股份有限公司广州航天软件分公司 | Method and system for determining fault of heterogeneous system based on machine learning |
CN111858231A (en) * | 2020-05-11 | 2020-10-30 | 北京必示科技有限公司 | Single index abnormality detection method based on operation and maintenance monitoring |
WO2021232567A1 (en) * | 2020-05-20 | 2021-11-25 | 江苏南工科技集团有限公司 | Ai technology-based smart operation and maintenance knowledge analysis method |
CN111737033B (en) * | 2020-05-26 | 2024-03-08 | 复旦大学 | Microservice fault positioning method based on runtime pattern analysis |
CN111737033A (en) * | 2020-05-26 | 2020-10-02 | 复旦大学 | Micro-service fault positioning method based on runtime map analysis |
CN111538643A (en) * | 2020-07-07 | 2020-08-14 | 宝信软件(成都)有限公司 | Alarm information filtering method and system for monitoring system |
CN111538643B (en) * | 2020-07-07 | 2020-10-16 | 宝信软件(成都)有限公司 | Alarm information filtering method and system for monitoring system |
CN111988167A (en) * | 2020-07-21 | 2020-11-24 | 合肥爱和力人工智能技术服务有限责任公司 | Fault analysis method and equipment based on industrial internet mechanism model |
CN111985561A (en) * | 2020-08-19 | 2020-11-24 | 安徽蓝杰鑫信息科技有限公司 | Fault diagnosis method and system for intelligent electric meter and electronic device |
CN111985561B (en) * | 2020-08-19 | 2023-02-21 | 安徽蓝杰鑫信息科技有限公司 | Fault diagnosis method and system for intelligent electric meter and electronic device |
CN111985558A (en) * | 2020-08-19 | 2020-11-24 | 安徽蓝杰鑫信息科技有限公司 | Electric energy meter abnormity diagnosis method and system |
CN112363896A (en) * | 2020-09-02 | 2021-02-12 | 大连大学 | Log anomaly detection system |
CN112363896B (en) * | 2020-09-02 | 2023-12-05 | 大连大学 | Log abnormality detection system |
CN112711508A (en) * | 2020-12-21 | 2021-04-27 | 航天信息股份有限公司 | Intelligent operation and maintenance service system facing large-scale client system |
CN112598291A (en) * | 2020-12-25 | 2021-04-02 | 中国农业银行股份有限公司 | Prophet-based operation and maintenance intelligent scheduling method and device |
CN112598291B (en) * | 2020-12-25 | 2023-10-13 | 中国农业银行股份有限公司 | Prophet-based operation and maintenance intelligent scheduling method and device |
CN112801316A (en) * | 2021-01-28 | 2021-05-14 | 中国人寿保险股份有限公司上海数据中心 | Fault positioning method, system equipment and storage medium based on multi-index data |
CN112860472A (en) * | 2021-02-05 | 2021-05-28 | 建信金融科技有限责任公司 | System fault position determining method and device, electronic equipment and storage medium |
CN113037365A (en) * | 2021-03-02 | 2021-06-25 | 烽火通信科技股份有限公司 | Method and device for identifying life cycle operation and maintenance state of optical channel |
CN113033839A (en) * | 2021-03-17 | 2021-06-25 | 山东通维信息工程有限公司 | ITSS-based highway electromechanical intelligent operation and maintenance improvement method |
CN113110389A (en) * | 2021-04-21 | 2021-07-13 | 东方电气自动控制工程有限公司 | Fault recording data processing method based on intelligent power plant monitoring system |
CN113765723A (en) * | 2021-09-23 | 2021-12-07 | 深圳市天威网络工程有限公司 | Health diagnosis method and system based on Cable Modem terminal equipment |
CN115096627B (en) * | 2022-06-16 | 2023-04-07 | 中南大学 | Method and system for fault diagnosis and operation and maintenance in manufacturing process of hydraulic forming intelligent equipment |
CN115096627A (en) * | 2022-06-16 | 2022-09-23 | 中南大学 | Method and system for fault diagnosis and operation and maintenance in manufacturing process of hydraulic forming intelligent equipment |
CN116047913A (en) * | 2023-02-15 | 2023-05-02 | 南京为先科技有限责任公司 | Control system and method for neutralization vacuum stripping dioxane removal process |
CN116047913B (en) * | 2023-02-15 | 2023-10-03 | 南京为先科技有限责任公司 | Control system and method for neutralization vacuum stripping dioxane removal process |
CN116701652A (en) * | 2023-06-13 | 2023-09-05 | 上海沄熹科技有限公司 | Machine learning-based database intelligent operation and maintenance system and method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109800127A (en) | A kind of system fault diagnosis intelligence O&M method and system based on machine learning | |
CN110717665B (en) | System and method for fault identification and trend analysis based on scheduling control system | |
CN101989087B (en) | On-line real-time failure monitoring and diagnosing system device for industrial processing of residual oil | |
CN111209131A (en) | Method and system for determining fault of heterogeneous system based on machine learning | |
CN110766277B (en) | Health assessment and diagnosis system and mobile terminal for nuclear industry field | |
KR20180108446A (en) | System and method for management of ict infra | |
CN111162949A (en) | Interface monitoring method based on Java byte code embedding technology | |
CN112817280A (en) | Implementation method for intelligent monitoring alarm system of thermal power plant | |
CN112990656B (en) | Health evaluation system and health evaluation method for IT equipment monitoring data | |
CN114185760A (en) | System risk assessment method and device and charging equipment operation and maintenance detection method | |
CN103049365B (en) | Information and application resource running state monitoring and evaluation method | |
CN113962299A (en) | Intelligent operation monitoring and fault diagnosis general model for nuclear power equipment | |
CN103676836A (en) | Online safe operation guiding method | |
CN115809183A (en) | Method for discovering and disposing information-creating terminal fault based on knowledge graph | |
CN112346393A (en) | Intelligent operation and maintenance based data full link abnormity monitoring and processing method and system | |
CN113395182B (en) | Intelligent network equipment management system and method with fault prediction | |
CN102929241B (en) | Safe operation guide system of purified terephthalic acid device and application of safe operation guide system | |
CN111306051B (en) | Probe type state monitoring and early warning method, device and system for oil transfer pump unit | |
CN117333038A (en) | Economic trend analysis system based on big data | |
CN112803587A (en) | Intelligent inspection method for state of automatic equipment based on diagnosis decision library | |
CN115438093A (en) | Power communication equipment fault judgment method and detection system | |
CN112615812A (en) | Information network unified vulnerability multi-dimensional security information collection, analysis and management system | |
Wang et al. | LSTM-based alarm prediction in the mobile communication network | |
CN113065001A (en) | Fault loss stopping method and device | |
CN113656323A (en) | Method for automatically testing, positioning and repairing fault and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190524 |