CN115185768A

CN115185768A - Fault recognition method and system of system, electronic equipment and storage medium

Info

Publication number: CN115185768A
Application number: CN202210731601.4A
Authority: CN
Inventors: 杜杨
Original assignee: Ping An Bank Co Ltd
Current assignee: Ping An Bank Co Ltd
Priority date: 2022-06-25
Filing date: 2022-06-25
Publication date: 2022-10-14

Abstract

The invention discloses a fault identification method, a fault identification system, electronic equipment and a storage medium of a system, wherein the method comprises the steps of obtaining operation data of the system, and carrying out structured processing on the operation data to obtain structured operation data of the system; preprocessing the structured operation data to obtain preprocessed structured operation data; extracting target operation features from the preprocessed structured operation data according to a preset feature extraction rule; determining to obtain a system identification result according to the target operation characteristics; if the identification result represents that the system has operation faults, outputting the identification result; therefore, data preprocessing and feature extraction are carried out through the operation data of the acquisition system, the usability of the data is improved, the reliability of the subsequent fault identification result is further ensured, the identification result of the system is determined and obtained through the target operation features, fault identification can be carried out more accurately through the operation features of the system, and the influence on the service due to inaccurate fault identification is reduced.

Description

Fault recognition method and system of system, electronic equipment and storage medium

Technical Field

The invention relates to the technical field of computer operation and maintenance, in particular to a system fault identification method, a system, electronic equipment and a storage medium.

Background

With the increasing maturity of network technology, more and more services need to be executed by means of an online system, for example, a financial system, and during operation of the financial system, due to an increase in access volume, a network failure, an increase in processed data volume, and the like, an operation failure occurs during operation of the financial system, which affects the services, so that during operation of the financial system, failure identification needs to be performed, and normal operation of the system is guaranteed.

In fault identification of an existing financial system, system operation indexes are obtained mainly by collecting operation data of the system, early warning threshold values are set artificially, warning is carried out when the indexes reach or exceed the threshold values, the threshold values are generally specified by experts, along with frequent updating of the financial system, the workload of re-understanding system set parameters is too large, the accuracy of the threshold values is not high, and then the accuracy of results of fault identification through the threshold values is not high.

Disclosure of Invention

The embodiment of the invention provides a fault identification method and system of a system, electronic equipment and a storage medium, which aim to solve the problem that the fault detection result of the existing fault detection method is inaccurate.

In one aspect, an embodiment of the present invention provides a method for identifying a system fault, where the method includes:

acquiring operation data of a system, and performing structured processing on the operation data to obtain structured operation data of the system;

preprocessing the structured operation data to obtain preprocessed structured operation data;

extracting target operation features from the preprocessed structured operation data according to a preset feature extraction rule;

determining to obtain an identification result of the system according to the target operation characteristics;

and if the identification result represents that the system has an operation fault, outputting the identification result.

In another aspect, an embodiment of the present invention provides a fault identification system, where the system includes a big data platform, a fault identification layer, and a visualization layer;

the big data platform is used for acquiring operation data of a system, performing structured processing on the operation data to obtain structured operation data of the system, preprocessing the structured operation data to obtain preprocessed structured operation data, and extracting target operation features from the preprocessed structured operation data according to a preset feature extraction rule;

the fault identification layer is used for determining and obtaining an identification result of the system according to the target operation characteristics;

and the visualization layer is used for outputting the identification result if the identification result represents that the system has the operation fault.

In another aspect, an embodiment of the present invention provides an electronic device, including a memory and a processor; the memory stores an application program, and the processor is used for running the application program in the memory to execute the operation of the fault identification method of the system.

In another aspect, an embodiment of the present invention provides a storage medium, where the storage medium stores a plurality of instructions, and the instructions are suitable for being loaded by a processor to perform steps in the fault identification method of the system.

The method comprises the steps of obtaining operation data of a system, and conducting structuralization processing on the operation data to obtain structuralization operation data of the system; preprocessing the structured operation data to obtain preprocessed structured operation data; extracting target operation features from the preprocessed structured operation data according to a preset feature extraction rule; determining to obtain a recognition result of the system according to the target operation characteristics; if the identification result represents that the system has operation faults, outputting the identification result; according to the invention, the data preprocessing and the feature extraction are carried out by acquiring the operation data of the system, the availability of the data is improved, the reliability of the subsequent fault identification result is further ensured, the identification result of the system is determined and obtained by the target operation feature, and the fault identification can be more accurately carried out by the operation feature of the system and the influence on the service due to the inaccurate fault identification is reduced by the fault identification method based on the early warning threshold value.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic view of an application scenario of a fault identification method of a system according to an embodiment of the present invention;

FIG. 2 is a flow chart of a fault identification method of the system according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a recognition model provided by an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a fault identification system provided by an embodiment of the invention;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As described in the background, conventional fault identification is generally based on discovering problem points and incorporating third party software to assist in analysis, either manually or during system operation. Such as: linux commands, java analysis tools, arthes and the like, however, the problems of the third-party tools are different, and the problems cannot be solved by one tool, and the method can increase fault location and processing timeliness, thereby affecting the service.

Based on the above, in order to improve the accuracy and the recognition timeliness of the conventional fault recognition, the embodiment of the invention provides a system fault recognition method, which is characterized in that data preprocessing and feature extraction are performed by collecting operation data of a system, the usability of the data is improved, the reliability of a subsequent fault recognition result is further ensured, the recognition result of the system is obtained by determining target operation features, and the fault recognition method based on an early warning threshold value can more accurately perform fault recognition through the operation features of the system and reduce the influence on services due to inaccurate fault recognition.

As shown in fig. 1, fig. 1 is an application scenario schematic diagram of a fault identification method of a system according to an embodiment of the present invention, and the application scenario schematic diagram shown in fig. 1 is only an illustration of an application scenario to which the embodiment of the present invention may be applied to help a person skilled in the art to understand technical content of the present invention, but does not mean that the embodiment of the present invention may not be applied to other devices, systems, environments, or scenarios.

As shown in fig. 1, the application scenario provided in accordance with the present invention includes a server 101, a network 102, and

computer devices

103, 104, and 105.

Network 102 may provide a medium for communication links between server 101 and

computer devices

103, 104, and 105, among other things. The network 102 may be the Internet, or any network including, but not limited to, a wide area network, a metropolitan area network, a local area network, a third Generation Partnership project (3 rd Generation Partnership project,3 GPP), long Term Evolution (Long Term Evolution LTE), mobile communication over Worldwide Interoperability for Microwave Access (WiMAX), or computer network communication over the TCP/IP Protocol Suite (TCP/IPPROTOCOL Suite TCP/IP), user Datagram Protocol (UDP), etc.

A tester may use

computer devices

103, 104, and 105 to interact with server 101 over network 102 to perform fault identification for a system deployed on server 101. The

computer devices

103, 104 and 105 may have an application installed thereon to implement fault identification of the system deployed on the server 101, so as to obtain service data, system resource occupancy, memory occupancy, and the like of the server 101. It is understood that in the process of fault identification of the system deployed on the server 101, the system deployed on the server 101 may be accessed and fault identified simultaneously by one or more computer devices, and any one of the one or more computer devices or other computer devices may aggregate the results of fault identification of the one or more computer devices implementing the system. It should be noted that the embodiment of the present invention does not limit the system deployed on the server 101, for example, the system deployed on the server 101 may be a network system, a financial business system, a website, an operating system, an application system, a cloud computing system, and the like.

The

computer devices

103, 104, and 105 may be various computer devices with display capabilities, and an operation and maintenance person displays the fault recognition result of the system deployed on the server 101, which includes but is not limited to virtual devices, high-performance computers, servers, computers, PC terminals, and the like.

The server 101 may be a server that provides various services, such as a back-office management server that provides support for the financial transaction system used by the operation and maintenance personnel using the

computer devices

103, 104, and 105. The backend management server may analyze the received request data and return the processing results to the

computer devices

103, 104, and 105.

It should be noted that the fault identification method of the system provided by the embodiment of the present invention may be generally executed by the

devices

103, 104, and 105. Accordingly, the system for fault recognition provided by the embodiments of the present invention may be generally disposed in the

computer devices

103, 104, and 105. The fault identification method of the system provided by the embodiment of the invention can be executed by servers or server clusters which are different in

computer devices

103, 104 and 105 and can communicate with the server 101. Accordingly, the system for identifying a fault of the system provided by the embodiment of the present invention may also be disposed in a server or a server cluster that is different from the

computer devices

103, 104, and 105 and can communicate with the server 101.

It should be understood that the number of computer devices, networks, and servers shown in FIG. 1 is merely illustrative, and that there may be any number of computer devices, networks, and servers, depending on the actual application scenario.

As shown in fig. 2, fig. 2 is a schematic flowchart of a fault identification method of a system according to an embodiment of the present invention, where the fault identification method of the system includes steps 201 to 205:

and 201, acquiring the operation data of the system, and performing structured processing on the operation data to obtain the structured operation data of the system.

In some embodiments of the invention, the system may be a financial system, such as a banking system.

The operation data of the system comprises operation and maintenance data of the system and business data of the system. According to different application scenarios of the system, the operation and maintenance data may include the following categories:

the operation and maintenance data of the memory resource of the server where the system is located are used for detecting the memory fault of the server where the system is located. The operation and maintenance data of the memory resource of the server where the system is located includes, but is not limited to: CPU (central processing unit) utilization, IO (Input/Output) utilization, memory utilization, storage space occupation, and session information. Including but not limited to session time, session state, type of waiting event.

The operation and maintenance data belonging to the session state is used for detecting the session state index in the server where the system is located so as to achieve the purpose of monitoring the failure of the session state in the server, and the operation and maintenance data of the session state includes but is not limited to: session congestion time, waiting events for a session, waiting type for a session, session time, session state.

The operation and maintenance data belonging to the database instance resource layer (mysq 1_ resource) is used for monitoring the resource class index of the instance layer where the system is located, so as to achieve the purpose of monitoring the fault of the instance layer. The operation and maintenance data in the resource layer of the database instance includes but is not limited to: mysq1.Cpu, mysql. Storage, mysql. Io, mysql. Mem, mysql. Session.

The operation and maintenance data belonging to the TCP (Transmission Control Protocol) layer comprises the operation and maintenance data used for monitoring the response time and reflecting the response time of the network layer to the request. The operation and maintenance data in the TCP layer includes but is not limited to: tcp _ rt (i.e., tcp _ response time, tcp response time).

The operation and maintenance data belonging to the request layer load (workload) comprises operation and maintenance data of SQL (Structured Query Language) executed by a monitoring user, the load (workload) of the SQL and other request operation classes applied to the database instance to monitor the request layer load. The operation and maintenance data in the request layer load includes but is not limited to: mysql.

Other operation and maintenance data categories may also include operation and maintenance data belonging to the engine layer of the database instance (e.g., mysql _ bp, mysql _ inodb _ bp _ io, mysql _ inodb _ data _ io, mysql _ inodb _ log _ io, etc.) and operation and maintenance data belonging to the related instance layer of the database (e.g., mysql.

It should be noted that, according to different requirements of the actual application scenarios, different classification standards may be used to classify the operation and maintenance data, and the classification of the operation and maintenance data is not limited in the embodiment of the present invention.

The service data of the system refers to data input by a user in practical application of the system, such as query data input by the user, input user information, image information, voice information, password input by the user, and the like.

In some embodiments of the present invention, the operation data of the system includes a structured field and an unstructured field, where the structured field refers to data that can be implemented by logically expressed in a two-dimensional table structure, such as CPU occupancy, memory usage, network resources, storage resources, data throughput, response time, session time, and the like; unstructured fields include, but are not limited to, text, images, voice information, and the like. In order to facilitate subsequent data preprocessing, in step 201, for the acquired system operation data, structured data is extracted from a structured field in the system operation data, that is, values such as CPU occupancy, memory usage, network resources, storage resources, data throughput, response time, session time, and the like are extracted from the structured field to obtain structured data; and carrying out structuring processing on the unstructured fields in the operating data of the system, and setting the extracted structured data and the unstructured fields subjected to the structuring processing as the structured operating data of the system. The method for carrying out structured processing on the unstructured field in the running data of the system comprises the following steps: extracting field information in the unstructured field in the running data of the system, coding the field information, converting text, image and voice information in the unstructured field into numerical codes to obtain coded information corresponding to the field information in the unstructured field, and extracting structured data from the coded information corresponding to the field information in the unstructured field.

And 202, preprocessing the structured operation data to obtain preprocessed structured operation data.

In some embodiments of the present invention, the preprocessing includes, but is not limited to, data sorting, missing value proposing, data smoothing, outlier culling, data deduplication, data integration, data normalization, data classification, clustering, data dimension reduction, and the like. The preprocessing methods can be combined and then the structured operation data can be preprocessed according to the actual application scene of the fault identification method of the system, so that the preprocessed structured operation data can be obtained. It should be noted that, in the embodiment of the present invention, the structured operation data is preprocessed, and the order of the preprocessing used is not limited, for example, after data sorting, missing value proposing part, data smoothing, abnormal value removing, and data deduplication are performed, data integration, data normalization, data classification, clustering, and data dimension reduction may be performed, or after data integration, data normalization, data classification, and clustering, data sorting, missing value proposing part, data smoothing, abnormal value removing, data deduplication, and data dimension reduction may be performed.

And 203, extracting target operation features from the preprocessed structured operation data according to a preset feature extraction rule.

In some embodiments of the present invention, a preset feature extraction rule is used to determine a target operating feature to be extracted. In some embodiments of the present invention, the preset feature extraction rule includes at least one target operation feature identifier to be extracted, where the target operation feature identifier may be a feature name and a feature type of the target operation feature. The operation characteristics refer to the values of the preprocessed structured operation data, such as CPU usage, memory occupancy, session time, and the like.

In some embodiments of the present invention, the preset feature extraction rule includes multiple fault types and a target operation feature identifier to be extracted corresponding to each fault type. The fault types include system operation and maintenance faults and system service faults, wherein the system operation and maintenance faults include but are not limited to CPU operation faults, memory faults, session faults, SQL faults and the like, and the system service faults include but are not limited to network faults, data transmission faults, data conversion faults, data output format faults, system authority faults and the like. In some embodiments of the present invention, a feature extraction rule may be established according to an actual application scenario of the fault identification method of the system and a fault occurring in the actual application of the system.

In some embodiments of the present invention, an operation fault of a system in a period of historical time and historical structured operation data when the system has an operation fault may be obtained, the historical structured operation data is preprocessed by the preprocessing method to obtain preprocessed historical structured operation data, a correlation degree between each historical operation feature in the preprocessed historical structured operation data and the fault is calculated for each operation fault occurring in the system in the period of historical time, a target historical operation feature having a correlation degree greater than or equal to a preset correlation degree threshold is selected from each historical operation feature in the preprocessed historical structured operation data according to the correlation degree, a feature type or a feature name of the selected target historical operation feature is associated with the type of the fault, and a feature extraction rule is established.

In some embodiments of the present invention, the correlation coefficient between each historical operating characteristic in the preprocessed historical structured operating data and the fault can be calculated to obtain the correlation degree between each historical operating characteristic in the preprocessed historical structured operating data and the fault.

In some embodiments of the present invention, the degree of correlation between each historical operating characteristic in the preprocessed historical structured operating data and the fault may also be determined by a preset correlation degree determination model. The correlation degree determination model may be a machine learning model or a neural network model.

And 204, determining to obtain a system identification result according to the target operation characteristics.

And identifying whether the system has operation faults or not, wherein when the system has the operation faults, the identification result also comprises fault types and fault positioning.

In some embodiments of the present invention, step 204 comprises: and acquiring a prestored abnormal index corresponding to each fault type, obtaining the risk level of the target operation characteristic by counting the similarity degree of the target operation characteristic and the abnormal index, comparing the risk level of the target operation characteristic with a preset risk level threshold value, and obtaining the identification result of the system according to the comparison result. The abnormal index refers to a target operation characteristic corresponding to the fault type. In some embodiments of the present invention, the characteristic value of the target operation characteristic is compared with the characteristic value of the characteristic, which is the same as the target operation characteristic, in the abnormal index corresponding to each fault type, so as to obtain the similarity between the target operation characteristic and the abnormal index; querying preset risk grade data to obtain a risk grade corresponding to the similarity degree; comparing the risk level of the target operation characteristic with a preset risk level threshold; if the risk level of the target operation characteristic is smaller than a preset risk level threshold value, determining that the identification result of the system is that the system has no operation fault; and if the risk level of the target operation characteristic is greater than or equal to a preset risk level threshold value, determining that the identification result of the system is that the system has an operation fault, and positioning the fault according to the characteristic type of the target operation characteristic. The preset risk level data comprises a plurality of similarity degrees and risk levels corresponding to the similarity degrees.

In some embodiments of the present invention, when there are multiple target operating characteristics, the characteristic types of the multiple target operating characteristics may be compared with the characteristic types included in the abnormal indicator corresponding to each fault type, and a target fault type corresponding to the abnormal indicator matching with the characteristic types of the multiple target operating characteristics may be determined. It should be noted that there are a plurality of target failure types corresponding to the abnormality indicators that may be matched with the feature types of the plurality of target operation features. For each target fault type, comparing a plurality of target operation characteristics with abnormal indexes corresponding to the target fault type respectively, and determining the similarity degree of the target operation characteristics and the abnormal indexes; querying preset risk grade data to obtain a risk grade corresponding to the similarity degree, obtaining a risk grade of each target operation characteristic in a plurality of target operation characteristics, setting the highest risk grade in the risk grade of each target operation characteristic as the risk grade of the target fault type of the target operation characteristics, and if the risk grade of the target fault type of the target operation characteristics is smaller than a preset risk grade threshold value, determining that the system has no fault corresponding to the target fault type in operation; and if the risk level of the target operation characteristics in the target fault type is greater than or equal to a preset risk level threshold value, determining that the system has a fault corresponding to the target fault type according to the identification result of the system. In some embodiments of the present invention, an average risk level of the risk levels of each of the target operational characteristics may also be set as the risk level of the plurality of target operational characteristics at the target fault type.

In some embodiments of the present invention, step 204 comprises: and inputting the target operation characteristics into a preset fault recognition model, and performing fault recognition on the target operation characteristics through fault recognition to obtain a recognition result of the system. The fault identification model can be a machine learning model or a neural network model.

And 205, if the identification result represents that the system has an operation fault, outputting the identification result.

In some embodiments of the invention, if the identification result indicates that the system has an operation fault, the identification result is output in a preset visualization manner, and prompt information is sent to operation and maintenance personnel. The recognition result is output, for example, in a visual manner in the form of a graph or a table. In some embodiments of the invention, the prompt information can be sent to the operation and maintenance personnel in a voice prompt mode, an image prompt mode, an email mode and a short message mode.

In some embodiments of the present invention, if the identification result indicates that there is no operation fault in the system, the steps 201 to 205 are continuously executed to detect the system.

According to the embodiment of the invention, the data preprocessing and the feature extraction are carried out by acquiring the operation data of the system, the availability of the data is improved, the reliability of the subsequent fault identification result is further ensured, the identification result of the system is determined and obtained by the target operation feature, and the fault identification is more accurately carried out by the operation feature of the system and the influence on the service due to the inaccurate fault identification is reduced.

In some embodiments of the present invention, in step 201, when an alarm of the system is received, the operation data of the system may be obtained, and the operation data is subjected to structural processing, so as to obtain the structural operation data of the system. The alarm includes but is not limited to a CPU alarm, a memory alarm, a session alarm, a non-idle waiting event alarm, a network alarm, a data transmission alarm, etc.

In some embodiments of the present invention, when an alarm of the system is received, according to the above steps 201 to 205, it may be identified whether the system has a fault, and the fault is located when it is determined that the system has a fault identification, so that an operation and maintenance worker can quickly solve the system fault according to the fault location, and the normal operation of the system is ensured.

In some embodiments of the present invention, in order to implement the timeliness of fault identification, a fault possibly existing in the system operation process is identified in advance, and the normal operation of the system is ensured, in step 201, historical operation data of the system in a historical time period is obtained, data simulation is performed according to the historical operation data, predicted operation data at the next moment of the system is obtained, the predicted operation data is set as the operation data of the system, and the above steps 201 to 205 are executed to predict whether the system has a fault, and when the system has a fault, prompt information is sent to operation and maintenance personnel, so that the operation and maintenance personnel find hidden danger of system operation in advance, capture multiple execution statements, codes and the like with potential errors and low efficiency, and thus the operation and maintenance personnel and developers can intervene in advance to kill the system fault in advance. Wherein the historical time period may be the past hour, the past day, the past week, the past month, etc.

Specifically, the method for performing data simulation according to historical operating data comprises the following steps:

(1) Historical operating data of the system in a preset time period and current operating data at the current moment are obtained.

(2) And inputting the historical operating data and the current operating data into a preset data simulation model for data simulation to obtain the simulated operating data of the system at the next moment of the current moment, and setting the simulated operating data as the operating data of the system.

Wherein, the preset time period can be one hour, twelve hours, one day, one week, one month, etc. The data simulation model may be a neural network model.

The operation data of the system has relevance in time, particularly the operation and maintenance data of the system, so that the historical operation data can be utilized to analyze the relevance relation among the historical operation data, and the data simulation is carried out on the current operation data at the current moment through the relevance relation to obtain the simulated operation data of the system at the next moment of the current moment. In some embodiments of the present invention, the LSTM model may be used to analyze the historical operating data for the existence of correlation in time, so as to obtain the correlation between the historical operating data.

In some embodiments of the present invention, after obtaining the association relationship between the historical operating data, the obtained association relationship between the historical operating data and the current operating data may be input to a preset data simulation model for data simulation, so as to obtain the simulated operating data of the system at the next time of the current time.

In some embodiments of the present invention, the preset data simulation model includes an LSTM layer and a data simulation layer, the historical operating data and the current operating data are input to the LSTM layer of the preset data simulation model, the LSTM layer analyzes that the historical operating data has a correlation in time to obtain a correlation between the historical operating data, and inputs the correlation between the historical operating data and the current operating data into the data simulation layer, and the data simulation layer performs data simulation according to the correlation between the input historical operating data and the current operating data to obtain the simulated operating data of the system at the next time of the current time. Wherein, a neural network, such as a prediction neural network, a GAN, an LSTM, etc., is deployed in the data simulation layer.

In some embodiments of the present invention, after obtaining the operation data of the system, the operation data of the system is structured according to step 201 to obtain the structured operation data of the system, and in order to improve the recognition timeliness of fault recognition, preprocessing such as data deduplication and data dimension reduction needs to be performed on the structured operation data of the system to reduce the data volume of the structured operation data of the system. Specifically, the data preprocessing method comprises the following steps a 1-a 4:

step a1, outlier screening and missing value filling are carried out on the structured operation data to obtain first operation data.

The first operation data refers to structured operation data after being screened through outliers and filled with missing values.

In some embodiments of the present invention, outlier rejection comprises clustering the structured operating data, determining outliers in the structured operating data from the clustering results, and deleting the outliers in the structured operating data. Wherein structured operational data can be clustered by K-mean.

In some embodiments of the present invention, missing value padding exists in a variety of ways, including, for example:

(1) The default value can be written into the position of the missing value for missing value filling.

(2) The data type of the missing value can be obtained, all the structured running data corresponding to the data type is obtained, the mean value of all the structured running data corresponding to the data type is calculated, and the mean value is written into the position of the missing value to fill the missing value.

(3) The data type of the missing value can be obtained, all the structured operation data corresponding to the data type are obtained, the median of all the structured operation data corresponding to the data type is calculated, and the median is written into the position of the missing value to fill the missing value.

(4) The data type where the missing value is located can be obtained, all the structured running data corresponding to the data type is obtained, the mode of all the structured running data corresponding to the data type is calculated, and the mode is written into the position where the missing value is located to perform missing value filling.

It should be noted that the missing value padding is only an exemplary description, and does not constitute a limitation on the fault identification method of the system provided in the embodiment of the present invention, and the missing value padding method may be determined according to an actual application scenario of the fault identification method of the system.

In some embodiments of the present invention, the order of executing the outlier filtering and missing value padding is not limited, for example, the structured operation data may be subjected to outlier filtering, the missing value padding may be performed on the structured operation data after the outlier filtering, or the missing value padding may be performed on the structured operation data, and the structured operation data after the missing value padding may be subjected to outlier filtering.

In some embodiments of the invention, step a1 comprises: the method can be used for removing the duplication of the structured operation data, and performing outlier screening and missing value filling on the structured operation data after the duplication removal processing to obtain the first operation data.

And a2, performing variable format conversion on the first operation data according to the data format of the first operation data to obtain second operation data.

Data formats include, but are not limited to, integer, boolean, and the like. In some embodiments of the present invention, variable format conversion includes, but is not limited to, boolean data conversion to integer, and the like.

In some embodiments of the present invention, in consideration of a possible difference in a data format of each piece of data in the first operation data, in order to facilitate subsequent dimension reduction and data conversion of the data, the data format of the first operation data is adjusted to a preset data format.

It should be noted that, the embodiment of the present invention does not limit the specific type of the variable format conversion, and may set the variable format conversion according to the data format of the first operating data in the actual application scenario and the preset data format.

And a3, performing data transformation on the second operation data to obtain third operation data.

Data transformation includes data smoothing, data clustering, and data normalization.

The data smoothing is to remove noise in the second operation data and discretize continuous data in the second operation data. In some embodiments of the invention, data smoothing may be performed by binning, clustering, and regression methods.

And the data clustering is to summarize the second operation data.

In some embodiments of the invention, the data normalization may be performed by a Min-Max normalization method, and the original second operation data may be converted into a [0,1] interval; the data normalization can be carried out by a Z-Score normalization method, and the data normalization can be carried out by a decimal scaling normalization method, so that the original second operation data is converted into the range of < -1,1 >.

And a4, performing data dimension reduction on the third operation data to obtain the preprocessed structured operation data.

In some embodiments of the present invention, data dimension reduction may be performed on the third operation data by a principal component analysis method to obtain the preprocessed structured operation data, data dimension reduction may be performed on the third operation data by a kernel principal component analysis method to obtain the preprocessed structured operation data, and data dimension reduction may be performed on the third operation data by singular value decomposition to obtain the preprocessed structured operation data.

After the structured operation data is preprocessed to obtain the preprocessed structured operation data, the field information to be extracted is determined according to a preset feature extraction rule, and the target operation feature of the system is obtained by extracting the target field information corresponding to the field information to be extracted from the preprocessed structured operation data, and specifically comprises the following steps:

(1) And determining field information to be extracted according to a preset feature extraction rule, wherein the feature extraction rule comprises field information of at least one feature.

(2) And extracting target field information matched with the field information to be extracted from the preprocessed structured running data.

(3) And setting the extracted target field information as the target operation characteristics of the system.

In some embodiments of the invention, the feature extraction rule comprises field information of at least one feature. In some embodiments of the present invention, the feature extraction rule includes a plurality of fault types and field information for each fault type corresponding to a feature to be extracted.

The field information to be extracted refers to fields that need to be extracted from the preprocessed structured operation data, such as CPU usage, memory occupancy, session time, service data, and the like.

In some embodiments of the present invention, after obtaining the target operation characteristics of the system, the fault identification method shown in step 203 may determine to obtain the identification result of the system according to the target operation characteristics.

In some embodiments of the present invention, after the target operation features of the system are obtained, the target operation features may be further input to the trained recognition model, and the target features of the system are recognized by the recognition model to obtain a recognition result.

The recognition model may be a recognition model based on machine learning, such as a recognition model based on a logistic regression model, a random forest model, a decision tree model, an SVM model, a k-nearest neighbor model, a naive bayes model, or a recognition model based on a Neural network, such as recognition models based on a Convolutional Neural Network (CNN), a deconvolution Neural network (De-Convolutional Networks, DN), a Deep Neural network (Deep Neural network, DNN), a Deep Convolutional Inverse network (Deep Convolutional Inverse Networks, DCIGN), a Region-based Convolutional network (Region-based Convolutional Networks, RCNN), a Region-based fast Convolutional network (fast Convolutional-Convolutional Networks, fastn), and a Bidirectional coding decoding (Bidirectional coding, speech) model.

It is contemplated that conventional fault identification is generally based on discovering problem points and incorporating third party software to assist in analysis, either manually or during system operation. Such as: the identification characteristics of such software are single, each software can only analyze one type or one type of problem, and the situation cannot be achieved. Illustratively, as shown in fig. 3, fig. 3 is a schematic structural diagram of a recognition model provided by an embodiment of the present invention, where the recognition model includes an input layer 301, a recognition layer 302, and an output layer 303.

Wherein the recognition layer 302 comprises a classification unit and a plurality of parallel recognition units; inputting target operation characteristics of a system into an input layer 301 of a recognition model, inputting the input target operation characteristics into a classification unit of a recognition layer 302 by the input layer 301, classifying the input target operation characteristics by the classification unit according to prestored fault type data to obtain target operation characteristics corresponding to each fault type, and inputting the target operation characteristics corresponding to each fault type into a recognition unit corresponding to the fault type; each recognition unit performs feature recognition on the target operation features corresponding to the input fault type to obtain a recognition result of the fault type, the recognition result of the fault type is input into the output layer 303, and the output layer 303 summarizes the recognition result of the fault type input by each recognition unit to obtain a recognition result of the system.

In some embodiments of the present invention, the training process of the recognition model includes steps b 1-b 5:

and b1, acquiring historical operating data of the system, and performing structured processing on the historical operating data to obtain the historical structured operating data of the system.

In some embodiments of the present invention, the historical operating data may be structured according to step 201, so as to obtain the historical structured operating data of the system.

And b2, preprocessing the historical structured operation data to obtain preprocessed historical structured operation data.

In some embodiments of the present invention, the historical structured operation data may be preprocessed according to the above preprocessing method for structured operation data, so as to obtain preprocessed historical structured operation data.

And b3, performing feature extraction and feature screening on the preprocessed historical structured operation data to obtain historical target operation features corresponding to the historical operation data.

In some embodiments of the invention, step b3 comprises:

(1) And acquiring a real identification result corresponding to the preprocessed historical structured operation data.

(2) And extracting field information corresponding to the system operation parameter, the user input parameter, the equipment operation and maintenance parameter and the database operation parameter corresponding to the preprocessed historical structured operation data to obtain historical operation characteristics corresponding to the preprocessed historical structured operation data.

(3) And performing box separation according to the real identification result corresponding to the preprocessed historical structured operation data and the historical operation characteristics corresponding to the preprocessed historical structured operation data to obtain the boxed historical operation characteristics.

In some embodiments of the invention, in order to improve the accuracy of the subsequent fault identification model and reduce the risk of overfitting the model, historical operating characteristics corresponding to the preprocessed historical structured operating data are binned. In some embodiments of the invention, historical operating characteristics corresponding to the preprocessed historical structured operating data can be boxed by a supervised chi-square binning method; historical operating characteristics corresponding to the preprocessed historical structured operating data can also be subjected to binning by an unsupervised binning method, such as equidistant partitioning and equal-frequency partitioning.

In some embodiments of the present invention, after the historical operating characteristics corresponding to the preprocessed historical structured operating data are classified into boxes, the historical operating characteristics X are obtained _k (k =1,2, \8230;, M, where M is the number of historical operating characteristics) is divided into n bins, and for the ith bin (i =1,2, \8230;, n) sample characteristics, the true identification result of the preprocessed historical structured operating data is passed through

Performing WOE conversion, and performing WOE encoding on each box after the box division. Wherein py _i The ratio of the number of the historical operating characteristics with faults to the number of all the historical operating characteristics with faults in the preprocessed historical structured operating data in the box, pn _i Is the ratio of the number of the historical operating characteristics without faults in the box to the number of all the historical operating characteristics without faults in the preprocessed historical structured operating data, # y _i Is the number of historical run signatures with faults in this group, # n _i Is the number of historical run features in this bin that have no faults, # y _t Is the number of all faults in the preprocessed historical structured operational data, # n _t Is the number of all non-existent faults in the preprocessed historical structured operating data.

(4) And determining the information value of the sorted historical operating characteristics, and setting the historical operating characteristics of which the information values are greater than the information threshold value as historical target operating characteristics.

In some embodiments of the invention, after obtaining the WOE code corresponding to each box, the historical operating characteristics in each box are passed through the IV _i ＝(py _i -pn _i )*WOE _i Calculating information value IV of the classified historical operation characteristics _i Value, IV of each packet _i Adding to obtain the whole historical operating characteristic X _k IV value of

Where n is the historical operating characteristic X _k Is divided intoThe number of bins. The preset number of target historical operating characteristics can be selected according to the sequence of the IV values of the historical operating characteristics from high to low. Target historical operating characteristics having information values greater than an information value threshold may also be selected in some embodiments of the invention.

(5) And obtaining a feature extraction rule according to the field information corresponding to the historical target operation features.

B4, establishing a sample data set according to the historical target operation characteristics corresponding to the historical operation data and the real identification results corresponding to the historical operation data; the sample data set comprises historical target operation characteristics corresponding to historical operation data and real identification results corresponding to the historical operation data.

And b5, inputting the sample data set into the recognition model for model training to obtain the trained recognition model.

In some embodiments of the present invention, a sample data set is input to an identification model to obtain a test result corresponding to the sample data set, a training loss between the test result corresponding to the sample data set and a real identification result corresponding to the sample data set is calculated through a preset loss function, model parameters of the identification model are iteratively adjusted through minimizing the training loss, and a trained identification model is obtained when the identification model meets a preset convergence condition. The preset convergence condition may be that the training loss is less than or equal to a preset threshold, or that the number of iterations is greater than or equal to a preset number threshold.

In some embodiments of the present invention, after obtaining the trained recognition model, the trained recognition model may be deployed to a server where the system is located, the operation data of the system is acquired in real time, step 201 to step 203 are executed to extract target operation features from the preprocessed structured operation data, the trained recognition model is invoked to perform fault recognition according to the target operation features, so as to obtain a recognition result of the system, and when the recognition result represents that the system has an operation fault, the recognition result is output.

In some embodiments of the present invention, when an operation fault occurs in an obtained identification result representation system, in order to enable an operation and maintenance worker to intuitively and efficiently complete an operation and maintenance task, the identification result may be output in a preset visualization manner, and specifically, the method includes: and if the identification result represents that the system has an operation fault, outputting the identification result in a preset visual mode, and sending prompt information to operation and maintenance personnel.

According to the fault identification method of the system, provided by the embodiment of the invention, the data preprocessing and the feature extraction are carried out by acquiring the operation data of the system, the availability of the data is improved, the reliability of the subsequent fault identification result is further ensured, the identification result of the system is determined and obtained by the target operation feature, and compared with the fault identification method based on the early warning threshold value, the fault identification can be more accurately carried out by the operation feature of the system, and the influence on the service due to the inaccuracy of the fault identification is reduced.

In order to better implement the fault identification method of the system provided by the embodiment of the present invention, the embodiment of the present invention provides a fault identification system, as shown in fig. 4, fig. 4 is a schematic structural diagram of the fault identification system provided by the embodiment of the present invention, and the fault identification system may be deployed in a server where the system is located, and collects operation data of the system to perform fault identification; the fault identification system may be deployed in a computing device that monitors the system.

As shown in fig. 4, the fault identification system includes a big data platform S1, a fault identification layer S2, and a visualization layer S3;

the big data platform S1 is used for acquiring operation data of the system, performing structured processing on the operation data to obtain structured operation data of the system, preprocessing the structured operation data to obtain preprocessed structured operation data, and extracting target operation features from the preprocessed structured operation data according to a preset feature extraction rule;

the fault identification layer S2 is used for determining and obtaining an identification result of the system according to the target operation characteristics;

and the visualization layer S3 is used for outputting the identification result if the identification result represents that the system has the operation fault.

In some embodiments of the invention, big data platform S1 comprises:

the characteristic extraction unit is used for determining field information to be extracted according to a preset characteristic extraction rule; the feature extraction rule comprises field information of at least one feature; extracting target field information matched with the field information to be extracted from the preprocessed structured running data; and setting the extracted target field information as the target operation characteristics of the system.

In some embodiments of the invention, big data platform S1 comprises:

the preprocessing unit is used for carrying out outlier screening and missing value filling on the structured operation data to obtain first operation data; performing variable format conversion and coding processing on the first operating data according to the data format and the data content of the first operating data to obtain second operating data; performing data transformation on the second operation data to obtain third operation data; performing data dimension reduction on the third operation data to obtain preprocessed structured operation data; the data transformation includes one or more of data smoothing, data clustering, and data normalization.

In some embodiments of the invention, big data platform S1 comprises:

the data simulation unit is used for acquiring historical operating data of the system in a preset time period and current operating data at the current moment; and inputting the historical operating data and the current operating data into a preset data simulation model for data simulation to obtain the simulated operating data of the system at the next moment of the current moment, and setting the simulated operating data as the operating data of the system.

In some embodiments of the present invention, the visualization layer S3 is configured to output the recognition result in a preset visualization manner and send a prompt message to an operation and maintenance worker if the recognition result indicates that the system has an operation failure.

In some embodiments of the present invention, the fault identification layer S2 is configured to input the target operation characteristics to a trained identification model, and identify the target characteristics of the system through the identification model to obtain an identification result.

In some embodiments of the invention, the fault identification system includes a fault pattern library S4 and a training layer S5;

the big data platform S1 is used for acquiring historical operating data of the system, performing structured processing on the historical operating data to obtain historical structured operating data of the system, and preprocessing the historical structured operating data to obtain preprocessed historical structured operating data;

the fault mode library S4 is used for carrying out feature extraction and feature screening on the preprocessed historical structured operation data to obtain historical target operation features corresponding to the historical operation data;

the training layer S5 is used for establishing a sample data set according to historical target operation characteristics corresponding to historical operation data in the fault mode library S4 and real identification results corresponding to the historical operation data; the sample data set comprises historical target operation features corresponding to historical operation data and real recognition results corresponding to the historical operation data, and the sample data set is input to the recognition model to conduct model training to obtain a trained recognition model.

In some embodiments of the invention, the failure mode library S4 is configured to:

acquiring a real identification result corresponding to the preprocessed historical structured operation data;

extracting field information corresponding to system operation parameters, user input parameters, equipment operation and maintenance parameters and database operation parameters corresponding to the preprocessed historical structured operation data to obtain historical operation characteristics corresponding to the preprocessed historical structured operation data;

performing box separation according to the real identification result corresponding to the preprocessed historical structured operation data and the historical operation characteristics corresponding to the preprocessed historical structured operation data to obtain the boxed historical operation characteristics;

and determining the information value of the sorted historical operating characteristics, and setting the historical operating characteristics of which the information values are greater than the information threshold value as historical target operating characteristics.

and obtaining a feature extraction rule according to the field information corresponding to the historical target operation features.

The fault recognition system provided by the embodiment of the invention carries out data preprocessing and feature extraction by acquiring the operation data of the system, improves the availability of the data, further ensures the reliability of the subsequent fault recognition result, obtains the recognition result of the system by determining the target operation feature, and more accurately carries out fault recognition by the operation feature of the system and reduces the influence on the service due to inaccurate fault recognition by the fault recognition.

An embodiment of the present invention further provides an electronic device, as shown in fig. 5, which shows a schematic structural diagram of the electronic device according to the embodiment of the present invention, specifically:

the electronic device may include components such as a processor 401 of one or more processing cores, memory 402 of one or more computer-readable storage media, a power supply 403, and an input unit 404. Those skilled in the art will appreciate that the electronic device configuration shown in fig. 5 does not constitute a limitation of the electronic device and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components. Wherein:

the processor 401 is a control center of the electronic device, connects various parts of the whole electronic device by various interfaces and lines, performs various functions of the electronic device and processes data by running or executing software programs and/or modules stored in the memory 402 and calling data stored in the memory 402, thereby performing overall monitoring of the electronic device. Optionally, processor 401 may include one or more processing cores; preferably, the processor 401 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 401.

The memory 402 may be used to store software programs and modules, and the processor 401 executes various functional applications and data processing by operating the software programs and modules stored in the memory 402. The memory 402 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to use of the electronic device, and the like. Further, the memory 402 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 402 may also include a memory controller to provide the processor 401 access to the memory 402.

The electronic device further comprises a power supply 403 for supplying power to the various components, and preferably, the power supply 403 is logically connected to the processor 401 through a power management system, so that functions of managing charging, discharging, and power consumption are realized through the power management system. The power supply 403 may also include any component of one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.

The electronic device may further include an input unit 404, and the input unit 404 may be used to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.

Although not shown, the electronic device may further include a display unit and the like, which are not described in detail herein. Specifically, in this embodiment, the processor 401 in the electronic device loads the executable file corresponding to the process of one or more application programs into the memory 402 according to the following instructions, and the processor 401 runs the application program stored in the memory 402, thereby implementing various functions as follows:

acquiring operation data of the system, and performing structured processing on the operation data to obtain structured operation data of the system;

determining to obtain a system identification result according to the target operation characteristics;

and if the identification result represents that the system has operation faults, outputting the identification result.

It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor.

To this end, the embodiment of the present invention provides a storage medium, in which a plurality of instructions are stored, and the instructions can be loaded by a processor to execute the steps in the fault identification method of any one of the systems provided by the embodiments of the present invention. For example, the instructions may perform the steps of:

The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.

Wherein the storage medium may include: read Only Memory (ROM), random Access Memory (RAM), magnetic or optical disks, and the like.

Since the instructions stored in the storage medium may execute the steps in the method for identifying a fault of any system provided in the embodiment of the present invention, beneficial effects that can be achieved by the method for identifying a fault of any system provided in the embodiment of the present invention may be achieved, for which details are described in the foregoing embodiments and are not repeated herein.

The method, system, electronic device and storage medium for identifying system faults provided by the embodiments of the present invention are described in detail above, and specific examples are applied herein to explain the principles and embodiments of the present invention, and the description of the embodiments is only used to help understand the method and its core ideas of the present invention; meanwhile, for those skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method for fault identification of a system, the method comprising:

2. The method for fault identification of a system according to claim 1, wherein the extracting target operation features from the preprocessed structured operation data according to preset feature extraction rules comprises:

determining field information to be extracted according to a preset feature extraction rule; the feature extraction rule comprises field information of at least one feature;

extracting target field information matched with the field information to be extracted from the preprocessed structured running data;

and setting the extracted target field information as the target operation characteristics of the system.

3. The method for fault identification of a system according to claim 1, wherein the preprocessing the structured operating data to obtain preprocessed structured operating data comprises:

performing outlier screening and missing value filling on the structured operation data to obtain first operation data;

performing variable format conversion and coding processing on the first operating data according to the data format and the data content of the first operating data to obtain second operating data;

performing data transformation on the second operation data to obtain third operation data; the data transformation comprises one or more of data smoothing, data clustering, and data normalization;

and performing data dimension reduction on the third operation data to obtain the preprocessed structured operation data.

4. The method for fault identification of a system according to claim 1, wherein said obtaining operational data of a system comprises:

acquiring historical operating data of the system in a preset time period and current operating data at the current moment;

and inputting the historical operating data and the current operating data into a preset data simulation model for data simulation to obtain simulated operating data of the system at the next moment of the current moment, and setting the simulated operating data as operating data of the system.

5. The method for identifying a fault in a system according to any one of claims 1 to 4, wherein the determining of the identification result of the system according to the target operation characteristic includes:

and inputting the target operation characteristics into a trained recognition model, and recognizing the target characteristics of the system through the recognition model to obtain a recognition result.

6. The method for fault recognition in a system according to claim 5, wherein prior to the step of inputting the target features into the trained recognition model, the method comprises:

obtaining historical operating data of a system, and carrying out structuralization processing on the historical operating data to obtain the historical structuralization operating data of the system;

preprocessing the historical structured operation data to obtain preprocessed historical structured operation data;

performing feature extraction and feature screening on the preprocessed historical structured operation data to obtain historical target operation features corresponding to the historical operation data;

establishing a sample data set according to historical target operation characteristics corresponding to the historical operation data and real identification results corresponding to the historical operation data; the sample data set comprises historical target operation characteristics corresponding to the historical operation data and real identification results corresponding to the historical operation data;

and inputting the sample data set into a recognition model for model training to obtain a trained recognition model.

7. The system fault identification method according to claim 6, wherein the performing feature extraction and feature screening on the preprocessed historical structured operation data to obtain the historical target operation features corresponding to the historical operation data comprises:

performing box separation according to a real identification result corresponding to the preprocessed historical structured operation data and historical operation characteristics corresponding to the preprocessed historical structured operation data to obtain boxed historical operation characteristics;

8. A fault identification system, characterized in that the system comprises a big data platform, a fault identification layer and a visualization layer;

9. An electronic device comprising a memory and a processor; the memory stores an application program, and the processor is configured to execute the application program in the memory to perform the operations of the fault identification method of the system according to any one of claims 1 to 7.

10. A storage medium storing a plurality of instructions adapted to be loaded by a processor to perform the steps of the method for fault identification of a system according to any one of claims 1 to 7.