CN111967509A

CN111967509A - Method and device for processing and detecting data acquired by industrial equipment

Info

Publication number: CN111967509A
Application number: CN202010760728.XA
Authority: CN
Inventors: 刘晓凯; 许方敏; 徐思佳; 常锋伟
Original assignee: Beijing Cyber Xingtong Technology Co ltd
Current assignee: Beijing Cyber Xingtong Technology Co ltd
Priority date: 2020-07-31
Filing date: 2020-07-31
Publication date: 2020-11-20

Abstract

The invention provides a method and a device for processing and detecting data acquired by industrial equipment, wherein the method comprises the following steps: receiving data collected by a sensor of an industrial device; correcting an invalid value of the data; dividing the missing data into short-term missing and long-term missing according to the data missing condition and the data sampling comparison relation with the sensor of the industrial equipment; filling missing data by adopting a double-layer LSTM-based model; combining the processing repetition values; detecting anomalies of a time series, the anomalies of the time series including anomaly points and pattern anomalies; and generating a report according to the abnormal points and the mode abnormal data. According to the scheme of the invention, all abnormal points and mode abnormal points can be found out respectively, and more accurate equipment abnormal information is provided for a factory. Accurate equipment state change information is provided for a factory, potential faults of industrial equipment can be mined, and pre-maintenance and performance degradation evaluation of the equipment are achieved.

Description

Method and device for processing and detecting data acquired by industrial equipment

Technical Field

The invention relates to the field of industrial data processing, in particular to a method and a device for processing and detecting data acquired by industrial equipment.

Background

The industrial big data is composed of massive data generated by industrial field informatization application, the data quality problem is widely existed due to the defects of a data acquisition system, link problems, hardware faults, human factors and the like, and the inferior data can cause the deviation of an analysis result and cause production accidents, so that the cleaning operation of the industrial big data is urgently needed.

The principle of data cleaning is to convert dirty data into data meeting the data quality requirement by using cleaning rules such as mathematical statistics, data mining or predefinition, and the main processing mode comprises the following steps: correction of invalid values, completion of missing values, combination of repeated data and detection of abnormal values.

Invalid values refer to erroneous data whose format does not conform to the specification, or whose value has no meaning. Abnormal values are often intermixed with invalid values in data collected by industrial equipment.

Missing values refer to data that is missing for subjective and objective reasons such as storage device corruption, violation of rules for data entry, or limitations on data acquisition device capacity. The prior art is to keep a complete record to analyze the query, but this method is only suitable for the case of very low deletion rate. If a large amount of data is missing, the data distribution will be biased and the data analysis result will be misled. A more reasonable way of handling should be to recover as much lost information as possible. A common way to recover lost information is to replace it with an average or most frequently occurring value. However, the above processing method ignores the relationship between the attributes of the data collected by the industrial equipment, and it is not preferable to fill all missing data of the same attribute with a fixed value. Many statistical and machine learning models are used to solve the data loss problem, and common statistical filling methods include EM algorithm, regression prediction method, interpolation method, etc. The field of machine learning includes KNN clustering, classification algorithms, and neural network algorithms, but the statistical and machine learning based models also do not take into account the relationship between the attributes of the data collected by the industrial equipment. That is, the traditional method usually only considers the autocorrelation of data for processing missing values, and does not consider the influence of data change of other relevant dimensions.

The repeated data refers to data with the same name or attribute value, and whether repeated records exist in the data collected by the industrial equipment is detected according to a preset judgment standard. Merging or elimination is the basic method of handling duplicate data.

Outliers refer to data points or segments of data in the time series that fluctuate significantly differently from other parts. Currently, common anomaly detection methods include a statistical-based method, a clustering-based method, and a neighbor-based method. The statistical-based approach assumes that normal data is generated by a statistical model that conforms to a certain distribution, and data that violates the rules are outliers; it can also be described by assuming the probability of normal values occurring in high probability regions in the stochastic model, while outliers occur in low probability regions. The disadvantage of this approach is the high dependence on the model assumptions of the data. Based on a clustering method, a data model is created through clustering, similar data are divided into the same cluster, the intra-cluster similarity is large as much as possible, and the inter-cluster similarity is small. If a data does not belong to any cluster or is far from other clusters, the data can be judged to be an abnormal value. The clustering-based approach is more suitable for detecting global outliers. Neighbor-based methods assume that normal data is located in a more dense neighborhood, while outliers are usually far from their neighborhood, in sparse regions. Neighbor-based methods include both distance-based and density-based methods. However, in the prior art, only abnormal points can be detected, but the abnormal mode state cannot be detected.

Disclosure of Invention

In order to solve the technical problems, the invention provides a method and a device for processing and detecting data acquired by industrial equipment, and the method and the device are used for solving the technical problems that in the prior art, the relationship among the attributes of the data acquired by the industrial equipment is not considered and the abnormal mode state cannot be detected when the data acquired by the industrial equipment is cleaned, particularly the data acquired by the industrial equipment is cleaned.

According to a first aspect of the present invention, there is provided a method of processing and detecting data collected by an industrial device, the method comprising the steps of:

step S101: step S101: receiving data collected by a sensor of an industrial device;

step S102: correcting the invalid value of the data, if the format of a certain item of data in the collected data does not meet the requirement or the data value of the certain item of data exceeds an allowed threshold value, deleting the item of data, and marking the item of data as data missing;

step S103: determining a data missing value of the corrected data, dividing the missing data into a short-term missing type and a long-term missing type according to the data missing condition and the data sampling comparison relation with the sensor of the industrial equipment, and completing the missing data by adopting a model based on a double-layer LSTM;

step S104: merging the repeated values in the corrected data for filling up the complete missing data;

step S105: detecting time series abnormity of the data after merging and processing the repeated values, wherein the time series abnormity comprises abnormal points and mode abnormity; detecting anomaly points through a local anomaly detection algorithm (LOF) based on density, and detecting mode anomalies based on a double-layer LSTM model;

step S106: and generating a report according to the abnormal points and the mode abnormal data.

Further, the completing of long-term missing data based on the double-layer LSTM model includes:

step S201: determining the location of missing data;

determining the position of a data missing value from the corrected data according to the sampling interval of the data acquired by the industrial equipment; determining the position of a data missing value according to the data position deleted with the invalid value; recording the positions of all data missing values, and sequencing the positions of all data missing values according to the time sequence; marking the current processing position as the position of a first data missing value;

step S202: judging whether all missing data are filled, if so, outputting correction data of which all missing data are filled, and ending the method; if not, go to step S203;

step S203: determining other n attributes related to the missing data through correlation calculation, wherein n is more than or equal to 1,

the calculation formula is as follows:

the data collected by the sensor of the industrial equipment has a plurality of attributes, X is an attribute set which is collected by the industrial equipment and contains missing data, and Y is an attribute set which is collected by the industrial equipment at the same time and does not contain the missing data; x_iFor any attribute in the set of attributes of the missing data, Y_jAny attribute of the attribute set without the missing data; generating a duplicate data set without missing data, deleting the missing data in the data collected by the industrial equipment to generate the duplicate data set, and calculating Cov (X) according to the duplicate data set_i,Y_j)、D(X_i)、D(Y_j)，Cov(X_iY) is X_i,Y_jCovariance of (2), D (X)_i)、D(Y_j) Are each X_i、Y_jThe variance of (a);

when X is present_iAnd Y_jCoefficient of correlation between

Absolute value of (2)

When it is, X is considered_iAnd Y_jIs a correlation attribute, will be associated with X_iSorting the related attributes according to the relevance numerical value, and selecting the top n attributes with the highest relevance as other attributes related to the missing data;

step S204: taking time _ step data in front of the missing data as input, calling a trained LSTM model, calculating the missing data, and filling the calculated data in the missing data; time _ step is a predetermined step;

step S205: the current processing position is moved backward by one bit, and the process proceeds to step S202.

Further, the training process of the LSTM model is as follows:

selecting data collected by the multidimensional sensor with deleted missing data as training data in a training set according to a proportion, and training the double-layer LSTM model to obtain each numerical value of a training parameter;

step S304: and adjusting the training parameters according to the test result of the test set.

Further, detecting anomaly points by a local anomaly detection algorithm (LOF) based on density, detecting mode anomalies based on a two-layer LSTM model, comprising:

the density-based local anomaly detection algorithm (LOF) detects outliers, comprising:

step S401: taking the data after the repeated values are combined as data to be detected, and packing the data to be detected and data corresponding to other n attributes related to the attributes of the data to be detected into a tuple as the input of an LOF algorithm;

step S402: setting the range of a parameter k belonging to [ min, max ], wherein k is an integer; for each value k, taking each data to be detected as a point, executing an LOF algorithm once on each point to obtain an outlier factor value, after all values of k are respectively operated, averaging the outlier factor values of each point to obtain an outlier factor average value of the point, wherein the calculation formula is as follows:

wherein min and max are respectively preset range values, LOF_k(p) is the value of the outlier factor corresponding to k, point p;

step S403: outputting points with the mean value of the outliers larger than the threshold value as outliers;

the detecting mode anomaly based on the double-layer LSTM model comprises the following steps:

step S501: calling the trained LSTM model to perform time sequence prediction to obtain a prediction result y _ prediction;

step S502: calculating a difference value e between the predicted result y _ predicted and the actual result y _ test;

step S503: and setting a range interval error _ buffer, and if the difference e is not in the range interval error _ buffer, taking a point corresponding to the difference e as an abnormal point.

According to a second aspect of the present invention, there is provided an apparatus for processing and detecting data collected by an industrial device, the apparatus comprising:

an acquisition module: receiving data collected by a sensor of an industrial device;

a correction module: correcting the invalid value of the data, if the format of a certain item of data in the collected data does not meet the requirement or the data value of the certain item of data exceeds an allowed threshold value, deleting the item of data, and marking the item of data as data missing;

and a data filling module: determining a data missing value of the corrected data, dividing the missing data into a short-term missing type and a long-term missing type according to the data missing condition and the data sampling comparison relation with the sensor of the industrial equipment, and completing the missing data by adopting a model based on a double-layer LSTM;

a repeat processing module: merging the repeated values in the corrected data for filling up the complete missing data;

an anomaly detection module: detecting time series abnormity of the data after merging and processing the repeated values, wherein the time series abnormity comprises abnormal points and mode abnormity; detecting anomaly points through a local anomaly detection algorithm (LOF) based on density, and detecting mode anomalies based on a double-layer LSTM model;

a report generation module: and generating a report according to the abnormal points and the mode abnormal data.

Further, the data padding module comprises:

determine a location submodule: determining the location of missing data;

a first judgment sub-module: judging whether all missing data are filled;

a correlation calculation submodule: determining other n attributes related to the missing data through correlation calculation, wherein n is more than or equal to 1,

the calculation formula is as follows:

when X is present_iAnd Y_jCoefficient of correlation between

Absolute value of (2)

filling submodules: taking time _ step data in front of the missing data as input, calling a trained LSTM model, calculating the missing data, and filling the calculated data in the missing data; time _ step is a predetermined step;

a shift submodule: the current processing position is shifted backward by one bit.

Further, the training of the LSTM model includes:

a parameter determination submodule: configuring a double-layer LSTM model, wherein an input layer input _ size of the model is equal to n, and an output layer output _ size of the model is equal to 1; determining training parameters of the double-layer LSTM model, wherein the training parameters comprise rnn _ unit of the node number of the hidden layer, batch _ size of each training, and time _ step of each batch of data; wherein n is the number of attributes with highest correlation;

a second determination sub-module: determining data of a training set and a test set, and determining the data proportion of the training set and the test set;

a training value determination submodule: selecting data collected by the multidimensional sensor with deleted missing data as training data in a training set according to a proportion, and training the double-layer LSTM model to obtain each numerical value of a training parameter;

adjusting a submodule: and adjusting the training parameters according to the test result of the test set.

Further, the anomaly detection module comprises an anomaly point detection sub-module and a pattern anomaly detection sub-module:

the abnormal point detection submodule includes:

packing the submodule: taking the data after the repeated values are combined as data to be detected, and packing the data to be detected and data corresponding to other n attributes related to the attributes of the data to be detected into a tuple as the input of an LOF algorithm;

setting a range submodule: setting the range of a parameter k belonging to [ min, max ], wherein k is an integer; for each value k, taking each data to be detected as a point, executing an LOF algorithm once on each point to obtain an outlier factor value, after all values of k are respectively operated, averaging the outlier factor values of each point to obtain an outlier factor average value of the point, wherein the calculation formula is as follows:

a first output sub-module: outputting points with the mean value of the outliers larger than the threshold value as outliers;

the pattern abnormality detection submodule includes:

calling a submodule: calling the trained LSTM model to perform time sequence prediction to obtain a prediction result y _ prediction;

a third computation submodule: calculating a difference value e between the predicted result y _ predicted and the actual result y _ test;

a second output submodule: and setting a range interval error _ buffer, and if the difference e is not in the range interval error _ buffer, taking a point corresponding to the difference e as an abnormal point.

According to a third aspect of the present invention, there is provided a system for processing and detecting data collected by an industrial device, comprising:

a processor for executing a plurality of instructions;

a memory to store a plurality of instructions;

wherein the instructions are stored in the memory and loaded by the processor to perform a method for processing and detecting data collected by an industrial device as described above.

According to a fourth aspect of the present invention, there is provided a computer readable storage medium having a plurality of instructions stored therein; the instructions are used for loading and executing the method for processing and detecting the data collected by the industrial equipment by the processor.

According to the scheme of the invention, a multi-dimensional time sequence data missing value filling method and an abnormal value classification detection processing method are adopted for the data of the industrial equipment in combination with the characteristics of the industrial data, the time sequence and the correlation among dimensions of the industrial data are comprehensively utilized, the missing value is filled, and long-time mode abnormality is detected. According to the scheme, the accuracy of filling the missing value is higher than that of a Lagrange interpolation method and a KNNI algorithm, and the advantages of the algorithm are expanded along with the increase of the missing rate. In addition, the LSTM can be well fitted to the trend of data, and can be filled in when the data is lost for a long time. According to the scheme, all abnormal points and mode abnormal points can be found out respectively, and more accurate equipment abnormal information is provided for a factory. Through the scheme, accurate equipment state change information is provided for a factory. The method can excavate potential faults of industrial equipment, realize pre-maintenance and performance degradation evaluation of the equipment and the like, and has great significance for reducing the maintenance cost of a factory and improving the product quality.

The foregoing description is only an overview of the technical solutions of the present invention, and in order to make the technical solutions of the present invention more clearly understood and to implement them in accordance with the contents of the description, the following detailed description is given with reference to the preferred embodiments of the present invention and the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and, together with the description, serve to explain the principles of the invention. In the drawings:

FIG. 1 is a flow chart of a method for processing and detecting data collected by an industrial device according to one embodiment of the present invention;

FIG. 2 is data acquired by a temperature sensor according to an embodiment of the present invention;

FIG. 3 is an exemplary diagram of an additive anomaly;

FIG. 4 is an exemplary diagram of a mode height anomaly;

FIG. 5 is a diagram of an example pattern length exception;

FIG. 6 is a graph showing a common example of mean and standard deviation

Fig. 7 is a block diagram of an apparatus for processing and detecting data collected by an industrial device according to an embodiment of the present invention.

Detailed Description

First, a flow chart of a method for processing and detecting data collected by an industrial device according to an embodiment of the present invention will be described with reference to fig. 1. As shown in fig. 1, the method comprises the steps of:

step S101: receiving data collected by a sensor of an industrial device;

In step S102, the format of a certain item of data does not meet the requirement, which means that the format of the item of data does not meet the predefined valid format, such as the numerical values of indexes such as the representation number, the grade, etc., and if the format of the item of data does not meet the requirement of integer, the numerical value is invalid; specifically, data collected by a sensor of the industrial equipment can be processed in batch by adopting a python regular expression, each item of data of an attribute column in a specified data format is matched, and if a certain item of data does not meet a preset condition, the item of data is set to be null.

In step S102, the fact that the data value of a certain item of data exceeds the allowable threshold means whether the data value of the item of data is within a preset valid value range, and if the data value of the item of data is not within the valid value range, the item of data is invalid; specifically, a range may be set for the data size of the specified attribute column, and if a certain data value is not within the set range, the data may be set to be null.

The step S103: determining a data missing value of the corrected data, dividing the missing data into a short-term missing type and a long-term missing type according to the data missing condition and the data sampling comparison relation with the sensor of the industrial equipment, and completing the missing data by adopting a model based on a double-layer LSTM, wherein the method comprises the following steps:

the time interval for collecting data by the industrial machine is usually fixed, and if the time interval between the next piece of data and the previous piece of data in the collected data is greater than the fixed time interval, a missing value exists in the data collected by the industrial machine.

As shown in fig. 2, fig. 2 is a temperature data curve acquired by a temperature sensor in industrial equipment of a certain plant in 2018 on 11, 8 and 8 days, and the acquisition frequency is normally once in 2 minutes, and it can be seen in fig. 2 that the time difference between the two points before and after A, B, C exceeds the sampling period of the temperature sensor, which indicates that data is missing at A, B, C. Wherein the time interval of the data before and after the A is 2 hours, 2 minutes and 52 seconds, the time interval of the data before and after the B is 24 minutes and 45 seconds, and the time interval of the data before and after the C is 4 minutes and 1 second.

The reasons for data loss may be unstable sensor performance, system shutdown, data transmission link failure, database read-write anomaly, etc. And determining a multiple threshold value of the sampling period of the sensor of the industrial equipment according to the data missing condition and the data sampling comparison relation with the sensor of the industrial equipment, for example, according to the analysis of historical data, wherein when the time interval of the data missing exceeds the multiple threshold value multiplied by the sampling period, the missing data is long-time missing, otherwise, the missing data is short-time missing. Based on the above manner, the missing data is divided into two types, short-term missing and long-term missing.

For short-term missing, the trend of data of industrial equipment and various related systems in the missing time interval does not change obviously, the short-term missing data can be supplemented by adopting a double-layer LSTM-based model, and other methods can also be used for supplementing the short-term missing data.

For long-term deletion, the deletion time length of the long-term deletion is larger than the deletion time critical value of the predictable system trend change. Therefore, in this embodiment, the missing data is complemented by performing trend fitting on the long-term missing data, and a model based on time series analysis is adopted. A Recurrent Neural Network (RNN) performs gradient tuning using a chain rule, and when a long-span time sequence is trained, a problem of gradient disappearance or gradient explosion may be encountered. Therefore, the RNN model is not suitable for long-term memory calculations. Long-Short Term Memory networks (LSTMs) are an improved RNN model proposed by Hochreiter et al and improved and generalized by Alex Graves. The LSTM is additionally provided with an input gate (input gate), an output gate (output gate) and a forgetting gate (forget gate), when information enters a model, the control unit in the LSTM judges the information, the information which accords with the rule is reserved, and the information which does not accord is forgotten, so that the dependence problem of long sequences in the neural network can be solved by the principle. In this embodiment, an LSTM model is used to complete long-term missing data.

The method for completing long-term missing data based on the double-layer LSTM model comprises the following steps:

step S201: determining the location of missing data;

the calculation formula is as follows:

when X is present_iAnd Y_jCoefficient of correlation between

Absolute value of (2)

step S204: taking time _ step data in front of the missing data as input, calling a trained LSTM model, calculating the missing data, and filling the calculated data in the missing data; time _ step is the size of each batch of data;

the time _ step is a predetermined step, and in this embodiment, the value of the time _ step is 100. the specific value of time _ step is related to the size of the data set and can be adjusted according to the needs.

The training process of the LSTM model comprises the following steps:

step S301: selecting a double-layer LSTM model, wherein an input layer input _ size of the model is equal to n, and an output layer output _ size of the model is equal to 1; determining training parameters of the double-layer LSTM model, wherein the training parameters comprise rnn _ unit of the node number of the hidden layer, batch _ size of each training, and time _ step of each batch of data; wherein n is the number of attributes with highest correlation;

step S302: determining data of a training set and a test set, and determining the data proportion of the training set and the test set;

in the embodiment, 70% of data is selected as a training set, and 30% of data is selected as a test set;

step S303: selecting data collected by the multidimensional sensor with deleted missing data as training data in a training set according to a proportion, and training the double-layer LSTM model to obtain each numerical value of a training parameter;

In this embodiment, because the single-layer LSTM filling effect is not ideal enough and the under-fitting problem is likely to occur, the double-layer LSTM model is selected. If the data collected by the industrial equipment is one-dimensional data, only the input of the LSTM model needs to be changed into a single variable.

Furthermore, the filling effect of the missing data can be judged by artificial subjective observation and evaluation or by calculating some objective evaluation index. Root Mean Square Error (RMSE) can be used as an evaluation index, the deviation between a filling value and an actual value can be visually reflected, the smaller the Root Mean Square Error value is, the better the fitting degree is, and the higher the accuracy of the filling result is. The method comprises the following steps of selecting a complete data set, artificially removing a certain proportion of data, artificially manufacturing missing data, calculating the artificially manufactured missing data through a trained double-layer LSTM model, and further calculating the root mean square error, wherein the calculation formula of the root mean square error is as follows:

wherein, X_pre(i) For padding values, X (i) is the actual value and n is the total number of missing values.

The step S104: merging duplicate values in the correction data that fills the complete missing data, comprising:

for the condition that the numerical values of single attribute columns corresponding to different rows in the corrected data for filling up complete part of missing data are the same, firstly, finding out the repeated data according to the specified attribute columns, returning to the position, then, extracting the repeated data according to the returned value, and finally deleting the repeated data in the corrected data for filling up all the missing data;

for the condition that the numerical values of multiple attribute columns corresponding to different rows in the corrected data for filling up complete part of missing data are the same, firstly, finding out the repeated data according to the specified attribute columns, returning to the position, then, extracting the repeated data according to the returned value, and finally deleting the repeated data in the corrected data for filling up all the missing data;

for the condition that the same attribute column exists in the corrected data for filling up the complete part of missing data, firstly detecting whether the column name with the same name exists, then judging whether each row of data with the same name column is the same, and if each row of data is the same, deleting the repeated attribute column; if not, the column data is retained and the column name is modified.

The step S105: detecting time series abnormity of the data after merging and processing the repeated values, wherein the time series abnormity comprises abnormal points and mode abnormity; detecting outliers by a local anomaly detection algorithm (LOF) based on density, detecting mode anomalies based on a two-layer LSTM model, comprising:

the time series anomalies can be divided into two types, namely anomaly points (outliers) and pattern anomalies (outler patterns), according to the different expression forms of the anomalies. A point anomaly may also be considered a pattern anomaly of length 1.

The industrial time series data anomaly point is mainly expressed as an additive anomaly point (AO), as shown in FIG. 3. Additive outliers are usually isolated outliers that do not affect subsequent observations. The reasons for generating additive outliers mainly include: (1) the health condition of the equipment is abnormal, and the data has practical guiding significance for equipment maintenance; (2) if the data measurement error is large or the link transmission is abnormal, whether the measurement instrument and the transmission link have faults or not needs to be detected.

Pattern anomalies refer to patterns that have significant differences over a time series from other patterns, including discretized symbols, fourier coefficients, and the like. There are three main forms of pattern anomalies for industrial time series data: height anomaly, length anomaly, mean and standard deviation anomaly, as shown in fig. 4-6. The mode abnormity is caused by the change of the working state of the industrial equipment, and the abnormal value can accurately represent the current special stage of the industrial equipment.

It can be seen that the abnormal values of the industrial machine data include abnormal points and abnormal patterns, and both have specific industrial meanings, but the conventional abnormal detection algorithm can only detect isolated abnormal points and cannot detect a long-time abnormal pattern state.

A density-based local anomaly detection algorithm (LOF) may exhibit a degree of outlier of the data, but the accuracy with which the LOF algorithm identifies outliers is too dependent on the choice of the parameter k. For an industrial data set acquired by industrial equipment with unknown number of outliers, a parameter k is directly selected to ensure that the detection number of the outliers is reasonable, which is difficult to achieve, and a large amount of normal data may be regarded as abnormal points or a large amount of abnormal points may be regarded as normal points. In order to weaken the influence degree of the k value on the result and improve the accuracy of detecting the outlier of the LOF, in this embodiment, a range is set for the parameter k, and finally, the outlier factor result of each point is taken as the average of all the operation results.

Specifically, in the present embodiment, the density-based local anomaly detection algorithm (LOF) detects anomaly points, which includes:

in this embodiment, min and max are respectively preset range values, where min is 4 and max is 10, and the method is applicable to data acquired by most industrial equipment sensors, i.e., LOF_k(p) is the value of the outlier factor corresponding to k, point p.

Step S403: and outputting the points with the mean value of the outliers larger than the threshold value as outliers.

In industrial data, the trend is different from the previous data because the abnormal pattern state lasts for a period of time, and the data characteristics such as the mean value, the standard deviation and the like change in the period of time. In this embodiment, a previously trained double-layer LSTM model is used to compare a predicted value trained by the model with a true value, and whether the point is in an abnormal state is determined by a difference value, specifically, the method includes:

in this embodiment, time _ step data is input into the trained LSTM model, and the output of the LSTM model is used as the input of the trained LSTM model, so as to obtain the prediction result y _ predict.

For example, if the data follows a normal distribution, a range exceeding 3 times the standard deviation from the mean is taken as an abnormal value, that is, if the abnormal value exceeds 3 times the standard deviation from the mean, it can be regarded as an abnormal value, and the data is recorded as a pattern abnormal state; if the data do not conform to normal distribution, judging by using Z-fraction, and recording the data exceeding the range of Z-fraction normal value as a mode abnormal state.

For example, if the data is normally distributed, the Lauda criterion method may be used

In that

Outliers can be considered as outliers in principle if they are more than 3 standard deviations from the mean. Because the average value is positive or negative

Has a probability of 99.7%, then the data is at

The probability of occurrence of outliers is 0.003, which is a very individual small probability event. If the data does not follow a normal distribution, a Z-score (Z-score), i.e., a few standard deviations away from the mean, can also be used to make the determination. The Z-score is calculated by subtracting the average number of samples from each data in the sample and dividing by the standard deviation of the sample. Z-score normal value is in the range of [ -3, +3]Data beyond this range is likely to be outliers. Therefore, the standard of our outlier detection is selected as the mean ± 3 times the standard deviation, and data with the difference between the predicted result and the actual result outside the mean ± 3 times the standard deviation will be recorded as the abnormal state of the pattern.

An embodiment of the present invention further provides an apparatus for processing and detecting data collected by an industrial device, as shown in fig. 7, the apparatus includes:

The embodiment of the invention further provides a system for processing and detecting data acquired by industrial equipment, which comprises:

a processor for executing a plurality of instructions;

a memory to store a plurality of instructions;

The embodiment of the invention further provides a computer readable storage medium, wherein a plurality of instructions are stored in the storage medium; the instructions are used for loading and executing the method for processing and detecting the data collected by the industrial equipment.

It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.

In the embodiments provided in the present invention, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions in actual implementation, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a physical machine Server, or a network cloud Server, etc., and needs to install a Windows or Windows Server operating system) to perform some steps of the method according to various embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way, and any simple modification, equivalent change and modification made to the above embodiment according to the technical spirit of the present invention are still within the scope of the technical solution of the present invention.

Claims

1. A method for processing and detecting data collected by an industrial device, comprising the steps of:

step S101: receiving data collected by a sensor of an industrial device;

2. The method of processing and detecting data collected by an industrial device of claim 1, wherein the completing long term missing data based on a two-layer LSTM model comprises:

step S201: determining the location of missing data;

the calculation formula is as follows:

when X is present_iAnd Y_jCoefficient of correlation between

Absolute value of (2)

When it is, X is considered_iAnd Y_jIs a correlation attribute, will be associated with X_iThe related attributes are sorted according to the relevance value, and the top n attributes with the highest relevance are selected as the data which are related to the missing dataOther attributes of interest;

3. The method for processing and detecting data collected by an industrial device of claim 2, wherein the training process of the LSTM model is:

step S301: configuring a double-layer LSTM model, wherein an input layer input _ size of the model is equal to n, and an output layer output _ size of the model is equal to 1; determining training parameters of the double-layer LSTM model, wherein the training parameters comprise rnn _ unit of the node number of the hidden layer, batch _ size of each training, and time _ step of each batch of data; wherein n is the number of attributes with highest correlation;

4. The method of processing and detecting data collected by an industrial plant of claim 3, wherein detecting outliers by a local anomaly detection algorithm (LOF) based on density, detecting mode anomalies based on a two-layer LSTM model, comprises:

5. An apparatus for processing and detecting data collected by an industrial device, the apparatus comprising:

6. The apparatus for processing and detecting data collected by an industrial device of claim 5, wherein the data padding module comprises:

determine a location submodule: determining the location of missing data;

a first judgment sub-module: judging whether all missing data are filled;

the calculation formula is as follows:

wherein, the data collected by the sensor of the industrial equipment has a plurality of attributes, X is the attribute set containing missing data collected by the industrial equipment, Y is the industrial equipmentAttribute sets which are collected at the same time and do not contain missing data; x_iFor any attribute in the set of attributes of the missing data, Y_jAny attribute of the attribute set without the missing data; generating a duplicate data set without missing data, deleting the missing data in the data collected by the industrial equipment to generate the duplicate data set, and calculating Cov (X) according to the duplicate data set_i,Y_j)、D(X_i)、D(Y_j)，Cov(X_iY) is X_i,Y_jCovariance of (2), D (X)_i)、D(Y_j) Are each X_i、Y_jThe variance of (a);

when X is present_iAnd Y_jCoefficient of correlation between

Absolute value of (2)

7. The apparatus for processing and detecting data collected by an industrial device of claim 6, wherein the training of the LSTM model comprises:

8. The apparatus for processing and detecting data collected by an industrial device of claim 7, wherein the anomaly detection module comprises an anomaly detection sub-module and a pattern anomaly detection sub-module:

the abnormal point detection submodule includes:

the pattern abnormality detection submodule includes:

9. A system for processing and detecting data collected by an industrial device, comprising:

a processor for executing a plurality of instructions;

a memory to store a plurality of instructions;

wherein the plurality of instructions are for being stored by the memory and loaded by the processor and performing the method of processing and detecting data collected by an industrial device according to any one of claims 1 to 4.

10. A computer-readable storage medium having stored therein a plurality of instructions; the plurality of instructions for being loaded by a processor and carrying out the method of processing and detecting data collected by an industrial plant according to any one of claims 1 to 4.