CN112463838A

CN112463838A - Industrial data quality evaluation method and system based on machine learning

Info

Publication number: CN112463838A
Application number: CN202011498693.3A
Authority: CN
Inventors: 樊树盛; 贺本彪; 苗维杰
Original assignee: Hangzhou Rischen Anke Technology Co ltd
Current assignee: Hangzhou Rischen Anke Technology Co ltd
Priority date: 2020-12-18
Filing date: 2020-12-18
Publication date: 2021-03-09

Abstract

The invention discloses an industrial data quality evaluation method and an industrial data quality evaluation system based on machine learning, wherein the evaluation method comprises the following steps: s1, preprocessing the detection data to eliminate abnormal data of a single detection data point; and S2, constructing a correlation model, and judging the preprocessed detection data meeting the requirements to determine abnormal detection data which do not meet the correlation. According to the invention, the abnormal data which obviously does not meet the requirements are identified through the preprocessing process, and then the abnormal data which are not mutually associated in the related groups are identified through the association model, so that the data of a plurality of areas in the industrial production process can be conveniently detected, and the abnormal monitoring data can be accurately determined.

Description

Industrial data quality evaluation method and system based on machine learning

Technical Field

The invention relates to the technical field of industrial information, in particular to an industrial data quality evaluation method based on machine learning.

Background

In the industrial production process, data of a plurality of areas need to be detected, and the stability of relevant equipment in the production process is determined according to the detection result, but with the increasing number of detection sensors, how to effectively judge the accuracy of the obtained detection data is important for evaluating the detection data to find out abnormal data.

In the prior art, a direct comparison mode is mostly adopted for obtaining, data can be well judged when the data quantity is small, and the quality evaluation of industrial data is difficult to perform by adopting a traditional mode along with the continuous increase of the data quantity.

In addition, in the conventional mode in the prior art, different data monitoring and quality evaluation models are established according to different specific requirements and data types of industrial operation, and a uniform data quality evaluation method without distinguishing operation environments and sensor data types is lacked.

Therefore, it is necessary to provide a method for evaluating the quality of industrial data based on machine learning, which has a general meaning.

Disclosure of Invention

The invention aims to provide an industrial data quality evaluation method based on machine learning, which can be used for solving the defects in the prior art, conveniently detecting data of a plurality of areas in an industrial production process and accurately determining abnormal monitoring data.

The invention provides an industrial data quality evaluation method based on machine learning, which comprises the following steps:

s1, preprocessing the detection data to eliminate abnormal data of a single detection data point;

and S2, constructing a correlation model, and judging the preprocessed detection data meeting the requirements to determine abnormal detection data which do not meet the correlation.

The industrial data quality evaluation method based on machine learning as described above, wherein the step S1 may optionally include:

s11: constructing a historical unit evaluation model;

s12: establishing a standard measuring point attribute database through a historical unit data evaluation model, and establishing a standard value interval and a standard value variation amplitude interval;

s13: acquiring the attribute of the current detection data, comparing the attribute with a standard measuring point attribute database, and judging that the current detection data exceeds the value range detection data when the current detection data exceeds a standard interval; and judging the current detection data as amplitude fluctuation abnormal detection data when the current detection data exceeds the variation amplitude interval.

The industrial data quality evaluation method based on machine learning as described above, wherein the step S11 may optionally include:

s111, acquiring historical detection data X1, X2 and X3... Xi... times Xn in a certain time;

s112, calculating median (X) of all historical detection data;

s113, calculating the absolute deviation value | Xi-mean (X) of each historical observation data and the median;

s114, calculating a median MAD of the absolute deviation value as medium (| Xi-medium (X)) |;

and S115, dividing the absolute deviation value of each historical observation data by the MAD to obtain a distance value Xm from the center of all observation data based on the MAD.

The industrial data quality evaluation method based on machine learning as described above, wherein the step S12 may optionally include:

establishing a standard value interval and a standard value variation amplitude interval with Xm meeting the requirement according to a Three-Sigma Rule formula Pr (mu-3 Sigma is not less than Xm and is not more than mu +3 Sigma) ≈ 0.9973; wherein σ represents a standard deviation, μ represents a mean; the Xm value of the abnormal data which do not meet the requirement is larger than mu +3 sigma or smaller than mu-3 sigma.

The industrial data quality evaluation method based on machine learning as described above, wherein the step S2 may optionally include:

s21: establishing an association model according to the historical detection data, and clustering and grouping the historical detection data according to the association model to form a plurality of groups;

s22: evaluating the result after clustering grouping;

s23: distributing the preprocessed detection data meeting the requirements into corresponding groups according to the association model;

s24: and determining the correlation between the preprocessed qualified detection data distributed into the corresponding packet and other data in the packet, and finding out abnormal data with poor correlation.

The industrial data quality evaluation method based on machine learning as described above, wherein the step S21 may optionally include:

s211: and acquiring historical detection data, and randomly selecting K objects from the N pieces of historical detection data as initial clustering centers.

S212: and respectively calculating the distance from each detection data to the center of each cluster, and distributing each detection data to the cluster with the closest distance.

S213: and after all the detection data are distributed, recalculating the K clustering centers.

S214: comparing with the K cluster centers obtained from the previous calculation, if the cluster centers change, the process goes to S212, otherwise, the process goes to S215.

S215: and stopping and outputting the clustering result when the centroid is not transformed.

The industrial data quality evaluation method based on machine learning as described above, wherein the step S22 may optionally include:

s221: determining purity (X, Y) using the following formula

Where x ═ (x1, x2 … … xk) is the set of clusters. x is the number of_kRepresenting the set of k-th clusters. y-y (y1, y2, … yi) represents the set that needs to be clustered, y_iRepresenting the ith cluster object. N represents the total number of clustered collection objects;

s222: the results after clustering were evaluated according to the purity (X, Y) value range.

The industrial data quality evaluation method based on machine learning as described above, wherein the step S24 may optionally include:

determining a correlation coefficient r according to a Pearson correlation coefficient formula

Wherein, the value range of the correlation coefficient r is as follows: -1. ltoreq. r.ltoreq.1

0< | r | <1 indicates that there are different degrees of linear correlation:

and judging the data with low linear correlation and no linear correlation as abnormal data.

The invention also provides an evaluation system adopting the industrial data quality evaluation method based on machine learning, which comprises the following steps:

the data preprocessing unit is used for preprocessing the detection data to exclude abnormal data of a single detection data point;

and the data reprocessing unit is used for constructing a correlation model and judging the preprocessed detection data meeting the requirements so as to determine abnormal detection data which do not meet the correlation.

The evaluation system for the industrial data quality evaluation method based on machine learning as described above, wherein optionally, the preprocessing unit includes:

the construction module is used for constructing a historical unit evaluation model, constructing a standard measuring point attribute database through the historical unit data evaluation model, and establishing a standard value interval and a standard value variation amplitude interval;

the initial judgment module is used for acquiring the attribute of the current detection data, comparing the attribute with the standard measuring point attribute database and judging that the current detection data exceeds the value range detection data when the current detection data exceeds the standard interval; and judging the current detection data as amplitude fluctuation abnormal detection data when the current detection data exceeds the variation amplitude interval.

Compared with the prior art, the data quality evaluation method without distinguishing the operation environment and the sensor data type is formed by establishing the uniform quality evaluation model.

According to the invention, the abnormal data which obviously does not meet the requirements are identified through the preprocessing process, and then the abnormal data which are not mutually associated in the related groups are identified through the association model, so that the data of a plurality of areas in the industrial production process can be conveniently detected, and the abnormal monitoring data can be accurately determined.

Drawings

FIG. 1 is a schematic flow chart diagram of a method for evaluating the quality of industrial data based on machine learning according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of preprocessing detection data to exclude abnormal data of a single detection data point in the industrial data quality evaluation method based on machine learning according to the embodiment of the present invention;

FIG. 3 is a schematic flow chart of a historical unit evaluation model constructed in the industrial data quality evaluation method based on machine learning according to the embodiment of the present invention;

FIG. 4 is a schematic flow chart of the construction of the association model in the industrial data quality evaluation method based on machine learning according to the embodiment of the present invention;

FIG. 5 is a schematic flow chart illustrating a process of establishing an association model according to historical detection data and clustering and grouping the historical detection data according to the association model to form a plurality of groups in the industrial data quality evaluation method based on machine learning according to the embodiment of the present invention;

FIG. 6 is a schematic structural diagram of an evaluation system using an industrial data quality evaluation method based on machine learning according to an embodiment of the present invention;

fig. 7-9 are the differences between the outliers of actual inspection data at different time points taken using the machine learning based industrial data quality assessment method and the outliers detected by the inspection method.

Detailed Description

The embodiments described below with reference to the drawings are illustrative only and should not be construed as limiting the invention.

The embodiment of the invention comprises the following steps: as shown in fig. 1, a machine learning-based industrial data quality evaluation method is disclosed, which obtains a change rule of numerical values of temperature, pressure and other numerical types of measuring points in an industrial environment. And evaluating the quality of the newly generated data. The method comprises the following specific steps:

According to the method, obvious abnormal data points which do not meet the requirements are screened in two steps, and then the unobvious abnormal data points are identified according to the correlation model, so that the accuracy of the detected data is more accurately determined.

Specifically, as shown in fig. 2, step S1 may include:

s11: constructing a historical unit evaluation model;

As shown in fig. 3, step S11 may include:

s112, calculating median (X) of all historical detection data;

That is, the construction of the history cell evaluation model in S11 is actually for identifying the outlier of a single point using the Seasonal Hybrid ESD time series anomaly detection algorithm. To eliminate the effect of abnormal values on the identification effect, a metric MAD that is more robust than the sample variance or standard deviation σ is used in this embodiment, which is more effective for distribution without mean or variance. Outliers in the dataset are more accommodated than the standard deviation σ. For the standard deviation σ, the square of the distance from the data to the mean is used, so large deviation weights are larger, and outliers also have a significant impact on the results. For MAD, a small number of outliers will not affect the final result.

The history unit evaluation model obtained in this embodiment confirms abnormal data by converting specific values into distance values Xm from the center of all observed data based on MAD and then judging the data. By the method, the data with serious deviation can be emphasized, the data with light deviation can be lightened, the data with large deviation can be effectively identified, and the wrong judgment of some correct data can be avoided.

Further, optionally, step S12 in fig. 1 may include:

Finally, abnormal Xm is confirmed by the Three-Sigma Rule formula.

In S1, data of the monitoring points in a period of time are acquired, and the data are recorded as history data in a database as the attribute of the measuring point, wherein the data determine the maximum value, the minimum value, the maximum variation amplitude per unit time, the minimum variation amplitude, and the dispersion degree according to the history data. And determining the value range and the amplitude fluctuation abnormal data interval of the abnormal data according to the historical data. And then comparing the latest monitored data with the established abnormal data interval to see whether the data fall into a corresponding range or not so as to prejudge the detection data. When the data value of the monitoring point changes, the latest data of the monitoring point is compared with the attribute of the monitoring point in a period of time before. Marking the data beyond the maximum and minimum value ranges as beyond the value range; and marking the data with amplitude variation higher than the attribute of the measuring point as amplitude fluctuation abnormal data.

As shown in fig. 7-9, the thick-dot part is the abnormal-dot recognition result finally detected by the industrial data quality evaluation method based on machine learning. We consider that S-H-ESD is very useful for identifying outliers in historical data by observing the large difference between values in the neighborhood of outliers (value) and the identified outliers.

As shown in fig. 4, step S2 of the present invention may specifically include:

s22: evaluating the result after clustering grouping;

As shown in fig. 5, step S2 may include:

Wherein, the step S22 may include:

s221: determining purity (X, Y) using the following formula

The purity method has the advantage of convenient calculation, when the value is between 0 and 1, the completely wrong clustering method value is 0, and the completely correct method value is 1. Ideally at an intermediate position between 0 and 1.

Wherein, the step S24 may include:

0< | r | <1 indicates that there are different degrees of linear correlation:

At S2, it is judged that the data satisfies the requirement of the step S1 but is still the detected point of the abnormal data. Specifically, a relatively complex association model needs to be established first, and historical data of a certain measuring point in the industrial control environment is dynamically generated into corresponding attributes through a machine learning method, and the association model is dynamically defined through the acquired real-time data.

And detecting whether the value of the new measuring point is the value of the corresponding measuring point through the correlation model, and if the value of the new measuring point is not the value of the corresponding measuring point, marking the value as abnormal data.

To accomplish the above assumption, we define the interval time, the maximum value, the minimum value, the fluctuation range, and the attributes of the classification for each vehicle data point, and use them to build the correlation model.

The classification attribute of the point is the most critical of the analysis attributes, can reflect the characteristic of the relevant change of each data point in the industrial environment, and is the most important basis of data quality analysis. By analyzing the correlation characteristics of the points, some potentially more harmful data quality problems can be found.

The K-MEANS algorithm is a typical non-hierarchical clustering algorithm based on distance, data are divided into preset class numbers K on the basis of minimizing an error function, and the distance is used as an evaluation index of similarity, namely, the closer the distance between two objects is, the greater the similarity of the two objects is.

In S2, the data is automatically grouped by machine learning algorithm, and the correlation of the data in each group is obtained, and when the correlation of the data in the response group is poor, it indicates that there is an abnormality in the detected data. Clustering analysis groups samples according to the sample data itself only. The goal is to achieve that the objects within a group are related to each other. The greater the similarity within a group, the greater the difference between groups, and the better the clustering effect. Whether abnormal data exist in the same type of points or not is judged by calculating the correlation coefficient of the real-time data in the same classification within the latest period of time, and abnormal values existing in the real-time data can be effectively identified.

The point of the monitoring points of the evaluation method disclosed by the invention, the relevance coefficient of which is lower than the set threshold value, is the monitoring point which does not accord with the relevance requirement. The data quality can be judged according to the timeliness of the monitoring points. And setting the longest refreshing time interval of the monitoring point, and marking the data exceeding the longest refreshing time interval as the data with poor data timeliness. Meanwhile, the data quality can be judged according to whether the monitoring points can take the values or not, and the monitoring points with empty values are marked as recording missing data.

As shown in fig. 6, another embodiment of the present invention further discloses an evaluation system using the above industrial data quality evaluation method based on machine learning, including:

Specifically, the data preprocessing unit includes:

For a specific implementation process of the industrial data quality evaluation system provided by the embodiment of the present invention, reference may be made to the description of the industrial data quality evaluation method in the foregoing method, and details are not described herein again.

In summary, the industrial data quality evaluation system provided by the invention identifies the abnormal data which obviously do not meet the requirements through the preprocessing process, and then identifies the abnormal data which are not mutually associated in the related groups through the association model, so that the data of a plurality of areas in the industrial production process can be conveniently detected, and the abnormal monitoring data can be accurately determined.

Those skilled in the art will appreciate that the present invention includes apparatus directed to performing one or more of the operations described in the present application. These devices may be specially designed and manufactured for the required purposes, or they may comprise known devices in general-purpose computers. These devices have stored therein computer programs that are selectively activated or reconfigured. Such a computer program may be stored in a device (e.g., computer) readable medium, including, but not limited to, any type of disk including floppy disks, hard disks, optical disks, CD-ROMs, and magnetic-optical disks, ROMs (Read-Only memories), RAMs (Random Access memories), EPROMs (Erasable Programmable Read-Only memories), EEPROMs (Electrically Erasable Programmable Read-Only memories), flash memories, magnetic cards, or optical cards, or any type of media suitable for storing electronic instructions, and each coupled to a bus. That is, a readable medium includes any medium that stores or transmits information in a form readable by a device (e.g., a computer).

It will be understood by those within the art that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by computer program instructions. Those skilled in the art will appreciate that the computer program instructions may be implemented by a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, implement the features specified in the block or blocks of the block diagrams and/or flowchart illustrations of the present disclosure.

Those of skill in the art will appreciate that various operations, methods, steps in the processes, acts, or solutions discussed in the present application may be alternated, modified, combined, or deleted. Further, various operations, methods, steps in the flows, which have been discussed in the present application, may be interchanged, modified, rearranged, decomposed, combined, or eliminated. Further, steps, measures, schemes in the various operations, methods, procedures disclosed in the prior art and the present invention can also be alternated, changed, rearranged, decomposed, combined, or deleted.

The functions, if implemented in the form of software-enabled devices and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A machine learning-based industrial data quality evaluation method is characterized by comprising the following steps:

2. The method for evaluating the quality of industrial data based on machine learning according to claim 1, wherein the step S1 includes:

s11: constructing a historical unit evaluation model;

3. The method for evaluating the quality of industrial data based on machine learning according to claim 2, wherein the step S11 includes:

s112, calculating median (X) of all historical detection data;

4. The method for evaluating the quality of industrial data based on machine learning according to claim 3, wherein the step S12 includes:

5. The method for evaluating the quality of industrial data based on machine learning according to claim 1, wherein the step S2 includes:

s22: evaluating the result after clustering grouping;

6. The method for evaluating the quality of industrial data based on machine learning according to claim 5, wherein the step S21 includes:

7. The method for evaluating the quality of industrial data based on machine learning according to claim 5, wherein the step S22 includes:

s221: determining purity (X, Y) using the following formula

8. The method for evaluating the quality of industrial data based on machine learning according to claim 5, wherein the step S24 includes:

the correlation coefficient r is determined according to the Pearson correlation coefficient formula as follows:

0< | r | <1 indicates that there are different degrees of linear correlation:

9. An evaluation system using the machine learning-based industrial data quality evaluation method according to any of claims 1 to 8, comprising:

10. The system for evaluating a machine learning-based industrial data quality assessment method according to claim 9, wherein said data preprocessing unit comprises: