CN112463838A - Industrial data quality evaluation method and system based on machine learning - Google Patents

Industrial data quality evaluation method and system based on machine learning Download PDF

Info

Publication number
CN112463838A
CN112463838A CN202011498693.3A CN202011498693A CN112463838A CN 112463838 A CN112463838 A CN 112463838A CN 202011498693 A CN202011498693 A CN 202011498693A CN 112463838 A CN112463838 A CN 112463838A
Authority
CN
China
Prior art keywords
data
detection data
machine learning
abnormal
historical
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011498693.3A
Other languages
Chinese (zh)
Inventor
樊树盛
贺本彪
苗维杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Rischen Anke Technology Co ltd
Original Assignee
Hangzhou Rischen Anke Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Rischen Anke Technology Co ltd filed Critical Hangzhou Rischen Anke Technology Co ltd
Priority to CN202011498693.3A priority Critical patent/CN112463838A/en
Publication of CN112463838A publication Critical patent/CN112463838A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • G06Q10/06395Quality analysis or management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Development Economics (AREA)
  • Software Systems (AREA)
  • Strategic Management (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Economics (AREA)
  • Educational Administration (AREA)
  • Mathematical Physics (AREA)
  • Operations Research (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Game Theory and Decision Science (AREA)
  • Computing Systems (AREA)
  • Marketing (AREA)
  • Artificial Intelligence (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • General Factory Administration (AREA)
  • Testing And Monitoring For Control Systems (AREA)

Abstract

The invention discloses an industrial data quality evaluation method and an industrial data quality evaluation system based on machine learning, wherein the evaluation method comprises the following steps: s1, preprocessing the detection data to eliminate abnormal data of a single detection data point; and S2, constructing a correlation model, and judging the preprocessed detection data meeting the requirements to determine abnormal detection data which do not meet the correlation. According to the invention, the abnormal data which obviously does not meet the requirements are identified through the preprocessing process, and then the abnormal data which are not mutually associated in the related groups are identified through the association model, so that the data of a plurality of areas in the industrial production process can be conveniently detected, and the abnormal monitoring data can be accurately determined.

Description

Industrial data quality evaluation method and system based on machine learning
Technical Field
The invention relates to the technical field of industrial information, in particular to an industrial data quality evaluation method based on machine learning.
Background
In the industrial production process, data of a plurality of areas need to be detected, and the stability of relevant equipment in the production process is determined according to the detection result, but with the increasing number of detection sensors, how to effectively judge the accuracy of the obtained detection data is important for evaluating the detection data to find out abnormal data.
In the prior art, a direct comparison mode is mostly adopted for obtaining, data can be well judged when the data quantity is small, and the quality evaluation of industrial data is difficult to perform by adopting a traditional mode along with the continuous increase of the data quantity.
In addition, in the conventional mode in the prior art, different data monitoring and quality evaluation models are established according to different specific requirements and data types of industrial operation, and a uniform data quality evaluation method without distinguishing operation environments and sensor data types is lacked.
Therefore, it is necessary to provide a method for evaluating the quality of industrial data based on machine learning, which has a general meaning.
Disclosure of Invention
The invention aims to provide an industrial data quality evaluation method based on machine learning, which can be used for solving the defects in the prior art, conveniently detecting data of a plurality of areas in an industrial production process and accurately determining abnormal monitoring data.
The invention provides an industrial data quality evaluation method based on machine learning, which comprises the following steps:
s1, preprocessing the detection data to eliminate abnormal data of a single detection data point;
and S2, constructing a correlation model, and judging the preprocessed detection data meeting the requirements to determine abnormal detection data which do not meet the correlation.
The industrial data quality evaluation method based on machine learning as described above, wherein the step S1 may optionally include:
s11: constructing a historical unit evaluation model;
s12: establishing a standard measuring point attribute database through a historical unit data evaluation model, and establishing a standard value interval and a standard value variation amplitude interval;
s13: acquiring the attribute of the current detection data, comparing the attribute with a standard measuring point attribute database, and judging that the current detection data exceeds the value range detection data when the current detection data exceeds a standard interval; and judging the current detection data as amplitude fluctuation abnormal detection data when the current detection data exceeds the variation amplitude interval.
The industrial data quality evaluation method based on machine learning as described above, wherein the step S11 may optionally include:
s111, acquiring historical detection data X1, X2 and X3... Xi... times Xn in a certain time;
s112, calculating median (X) of all historical detection data;
s113, calculating the absolute deviation value | Xi-mean (X) of each historical observation data and the median;
s114, calculating a median MAD of the absolute deviation value as medium (| Xi-medium (X)) |;
and S115, dividing the absolute deviation value of each historical observation data by the MAD to obtain a distance value Xm from the center of all observation data based on the MAD.
The industrial data quality evaluation method based on machine learning as described above, wherein the step S12 may optionally include:
establishing a standard value interval and a standard value variation amplitude interval with Xm meeting the requirement according to a Three-Sigma Rule formula Pr (mu-3 Sigma is not less than Xm and is not more than mu +3 Sigma) ≈ 0.9973; wherein σ represents a standard deviation, μ represents a mean; the Xm value of the abnormal data which do not meet the requirement is larger than mu +3 sigma or smaller than mu-3 sigma.
The industrial data quality evaluation method based on machine learning as described above, wherein the step S2 may optionally include:
s21: establishing an association model according to the historical detection data, and clustering and grouping the historical detection data according to the association model to form a plurality of groups;
s22: evaluating the result after clustering grouping;
s23: distributing the preprocessed detection data meeting the requirements into corresponding groups according to the association model;
s24: and determining the correlation between the preprocessed qualified detection data distributed into the corresponding packet and other data in the packet, and finding out abnormal data with poor correlation.
The industrial data quality evaluation method based on machine learning as described above, wherein the step S21 may optionally include:
s211: and acquiring historical detection data, and randomly selecting K objects from the N pieces of historical detection data as initial clustering centers.
S212: and respectively calculating the distance from each detection data to the center of each cluster, and distributing each detection data to the cluster with the closest distance.
S213: and after all the detection data are distributed, recalculating the K clustering centers.
S214: comparing with the K cluster centers obtained from the previous calculation, if the cluster centers change, the process goes to S212, otherwise, the process goes to S215.
S215: and stopping and outputting the clustering result when the centroid is not transformed.
The industrial data quality evaluation method based on machine learning as described above, wherein the step S22 may optionally include:
s221: determining purity (X, Y) using the following formula
Figure BDA0002842968470000031
Where x ═ (x1, x2 … … xk) is the set of clusters. x is the number ofkRepresenting the set of k-th clusters. y-y (y1, y2, … yi) represents the set that needs to be clustered, yiRepresenting the ith cluster object. N represents the total number of clustered collection objects;
s222: the results after clustering were evaluated according to the purity (X, Y) value range.
The industrial data quality evaluation method based on machine learning as described above, wherein the step S24 may optionally include:
determining a correlation coefficient r according to a Pearson correlation coefficient formula
Figure BDA0002842968470000032
Wherein, the value range of the correlation coefficient r is as follows: -1. ltoreq. r.ltoreq.1
Figure BDA0002842968470000041
0< | r | <1 indicates that there are different degrees of linear correlation:
Figure BDA0002842968470000042
and judging the data with low linear correlation and no linear correlation as abnormal data.
The invention also provides an evaluation system adopting the industrial data quality evaluation method based on machine learning, which comprises the following steps:
the data preprocessing unit is used for preprocessing the detection data to exclude abnormal data of a single detection data point;
and the data reprocessing unit is used for constructing a correlation model and judging the preprocessed detection data meeting the requirements so as to determine abnormal detection data which do not meet the correlation.
The evaluation system for the industrial data quality evaluation method based on machine learning as described above, wherein optionally, the preprocessing unit includes:
the construction module is used for constructing a historical unit evaluation model, constructing a standard measuring point attribute database through the historical unit data evaluation model, and establishing a standard value interval and a standard value variation amplitude interval;
the initial judgment module is used for acquiring the attribute of the current detection data, comparing the attribute with the standard measuring point attribute database and judging that the current detection data exceeds the value range detection data when the current detection data exceeds the standard interval; and judging the current detection data as amplitude fluctuation abnormal detection data when the current detection data exceeds the variation amplitude interval.
Compared with the prior art, the data quality evaluation method without distinguishing the operation environment and the sensor data type is formed by establishing the uniform quality evaluation model.
According to the invention, the abnormal data which obviously does not meet the requirements are identified through the preprocessing process, and then the abnormal data which are not mutually associated in the related groups are identified through the association model, so that the data of a plurality of areas in the industrial production process can be conveniently detected, and the abnormal monitoring data can be accurately determined.
Drawings
FIG. 1 is a schematic flow chart diagram of a method for evaluating the quality of industrial data based on machine learning according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart of preprocessing detection data to exclude abnormal data of a single detection data point in the industrial data quality evaluation method based on machine learning according to the embodiment of the present invention;
FIG. 3 is a schematic flow chart of a historical unit evaluation model constructed in the industrial data quality evaluation method based on machine learning according to the embodiment of the present invention;
FIG. 4 is a schematic flow chart of the construction of the association model in the industrial data quality evaluation method based on machine learning according to the embodiment of the present invention;
FIG. 5 is a schematic flow chart illustrating a process of establishing an association model according to historical detection data and clustering and grouping the historical detection data according to the association model to form a plurality of groups in the industrial data quality evaluation method based on machine learning according to the embodiment of the present invention;
FIG. 6 is a schematic structural diagram of an evaluation system using an industrial data quality evaluation method based on machine learning according to an embodiment of the present invention;
fig. 7-9 are the differences between the outliers of actual inspection data at different time points taken using the machine learning based industrial data quality assessment method and the outliers detected by the inspection method.
Detailed Description
The embodiments described below with reference to the drawings are illustrative only and should not be construed as limiting the invention.
The embodiment of the invention comprises the following steps: as shown in fig. 1, a machine learning-based industrial data quality evaluation method is disclosed, which obtains a change rule of numerical values of temperature, pressure and other numerical types of measuring points in an industrial environment. And evaluating the quality of the newly generated data. The method comprises the following specific steps:
s1, preprocessing the detection data to eliminate abnormal data of a single detection data point;
and S2, constructing a correlation model, and judging the preprocessed detection data meeting the requirements to determine abnormal detection data which do not meet the correlation.
According to the method, obvious abnormal data points which do not meet the requirements are screened in two steps, and then the unobvious abnormal data points are identified according to the correlation model, so that the accuracy of the detected data is more accurately determined.
Specifically, as shown in fig. 2, step S1 may include:
s11: constructing a historical unit evaluation model;
s12: establishing a standard measuring point attribute database through a historical unit data evaluation model, and establishing a standard value interval and a standard value variation amplitude interval;
s13: acquiring the attribute of the current detection data, comparing the attribute with a standard measuring point attribute database, and judging that the current detection data exceeds the value range detection data when the current detection data exceeds a standard interval; and judging the current detection data as amplitude fluctuation abnormal detection data when the current detection data exceeds the variation amplitude interval.
As shown in fig. 3, step S11 may include:
s111, acquiring historical detection data X1, X2 and X3... Xi... times Xn in a certain time;
s112, calculating median (X) of all historical detection data;
s113, calculating the absolute deviation value | Xi-mean (X) of each historical observation data and the median;
s114, calculating a median MAD of the absolute deviation value as medium (| Xi-medium (X)) |;
and S115, dividing the absolute deviation value of each historical observation data by the MAD to obtain a distance value Xm from the center of all observation data based on the MAD.
That is, the construction of the history cell evaluation model in S11 is actually for identifying the outlier of a single point using the Seasonal Hybrid ESD time series anomaly detection algorithm. To eliminate the effect of abnormal values on the identification effect, a metric MAD that is more robust than the sample variance or standard deviation σ is used in this embodiment, which is more effective for distribution without mean or variance. Outliers in the dataset are more accommodated than the standard deviation σ. For the standard deviation σ, the square of the distance from the data to the mean is used, so large deviation weights are larger, and outliers also have a significant impact on the results. For MAD, a small number of outliers will not affect the final result.
The history unit evaluation model obtained in this embodiment confirms abnormal data by converting specific values into distance values Xm from the center of all observed data based on MAD and then judging the data. By the method, the data with serious deviation can be emphasized, the data with light deviation can be lightened, the data with large deviation can be effectively identified, and the wrong judgment of some correct data can be avoided.
Further, optionally, step S12 in fig. 1 may include:
establishing a standard value interval and a standard value variation amplitude interval with Xm meeting the requirement according to a Three-Sigma Rule formula Pr (mu-3 Sigma is not less than Xm and is not more than mu +3 Sigma) ≈ 0.9973; wherein σ represents a standard deviation, μ represents a mean; the Xm value of the abnormal data which do not meet the requirement is larger than mu +3 sigma or smaller than mu-3 sigma.
Finally, abnormal Xm is confirmed by the Three-Sigma Rule formula.
In S1, data of the monitoring points in a period of time are acquired, and the data are recorded as history data in a database as the attribute of the measuring point, wherein the data determine the maximum value, the minimum value, the maximum variation amplitude per unit time, the minimum variation amplitude, and the dispersion degree according to the history data. And determining the value range and the amplitude fluctuation abnormal data interval of the abnormal data according to the historical data. And then comparing the latest monitored data with the established abnormal data interval to see whether the data fall into a corresponding range or not so as to prejudge the detection data. When the data value of the monitoring point changes, the latest data of the monitoring point is compared with the attribute of the monitoring point in a period of time before. Marking the data beyond the maximum and minimum value ranges as beyond the value range; and marking the data with amplitude variation higher than the attribute of the measuring point as amplitude fluctuation abnormal data.
As shown in fig. 7-9, the thick-dot part is the abnormal-dot recognition result finally detected by the industrial data quality evaluation method based on machine learning. We consider that S-H-ESD is very useful for identifying outliers in historical data by observing the large difference between values in the neighborhood of outliers (value) and the identified outliers.
As shown in fig. 4, step S2 of the present invention may specifically include:
s21: establishing an association model according to the historical detection data, and clustering and grouping the historical detection data according to the association model to form a plurality of groups;
s22: evaluating the result after clustering grouping;
s23: distributing the preprocessed detection data meeting the requirements into corresponding groups according to the association model;
s24: and determining the correlation between the preprocessed qualified detection data distributed into the corresponding packet and other data in the packet, and finding out abnormal data with poor correlation.
As shown in fig. 5, step S2 may include:
s211: and acquiring historical detection data, and randomly selecting K objects from the N pieces of historical detection data as initial clustering centers.
S212: and respectively calculating the distance from each detection data to the center of each cluster, and distributing each detection data to the cluster with the closest distance.
S213: and after all the detection data are distributed, recalculating the K clustering centers.
S214: comparing with the K cluster centers obtained from the previous calculation, if the cluster centers change, the process goes to S212, otherwise, the process goes to S215.
S215: and stopping and outputting the clustering result when the centroid is not transformed.
Wherein, the step S22 may include:
s221: determining purity (X, Y) using the following formula
Figure BDA0002842968470000081
Where x ═ (x1, x2 … … xk) is the set of clusters. x is the number ofkRepresenting the set of k-th clusters. y-y (y1, y2, … yi) represents the set that needs to be clustered, yiRepresenting the ith cluster object. N represents the total number of clustered collection objects;
s222: the results after clustering were evaluated according to the purity (X, Y) value range.
The purity method has the advantage of convenient calculation, when the value is between 0 and 1, the completely wrong clustering method value is 0, and the completely correct method value is 1. Ideally at an intermediate position between 0 and 1.
Wherein, the step S24 may include:
determining a correlation coefficient r according to a Pearson correlation coefficient formula
Figure BDA0002842968470000082
Wherein, the value range of the correlation coefficient r is as follows: -1. ltoreq. r.ltoreq.1
Figure BDA0002842968470000083
0< | r | <1 indicates that there are different degrees of linear correlation:
Figure BDA0002842968470000084
and judging the data with low linear correlation and no linear correlation as abnormal data.
At S2, it is judged that the data satisfies the requirement of the step S1 but is still the detected point of the abnormal data. Specifically, a relatively complex association model needs to be established first, and historical data of a certain measuring point in the industrial control environment is dynamically generated into corresponding attributes through a machine learning method, and the association model is dynamically defined through the acquired real-time data.
And detecting whether the value of the new measuring point is the value of the corresponding measuring point through the correlation model, and if the value of the new measuring point is not the value of the corresponding measuring point, marking the value as abnormal data.
To accomplish the above assumption, we define the interval time, the maximum value, the minimum value, the fluctuation range, and the attributes of the classification for each vehicle data point, and use them to build the correlation model.
The classification attribute of the point is the most critical of the analysis attributes, can reflect the characteristic of the relevant change of each data point in the industrial environment, and is the most important basis of data quality analysis. By analyzing the correlation characteristics of the points, some potentially more harmful data quality problems can be found.
The K-MEANS algorithm is a typical non-hierarchical clustering algorithm based on distance, data are divided into preset class numbers K on the basis of minimizing an error function, and the distance is used as an evaluation index of similarity, namely, the closer the distance between two objects is, the greater the similarity of the two objects is.
In S2, the data is automatically grouped by machine learning algorithm, and the correlation of the data in each group is obtained, and when the correlation of the data in the response group is poor, it indicates that there is an abnormality in the detected data. Clustering analysis groups samples according to the sample data itself only. The goal is to achieve that the objects within a group are related to each other. The greater the similarity within a group, the greater the difference between groups, and the better the clustering effect. Whether abnormal data exist in the same type of points or not is judged by calculating the correlation coefficient of the real-time data in the same classification within the latest period of time, and abnormal values existing in the real-time data can be effectively identified.
The point of the monitoring points of the evaluation method disclosed by the invention, the relevance coefficient of which is lower than the set threshold value, is the monitoring point which does not accord with the relevance requirement. The data quality can be judged according to the timeliness of the monitoring points. And setting the longest refreshing time interval of the monitoring point, and marking the data exceeding the longest refreshing time interval as the data with poor data timeliness. Meanwhile, the data quality can be judged according to whether the monitoring points can take the values or not, and the monitoring points with empty values are marked as recording missing data.
As shown in fig. 6, another embodiment of the present invention further discloses an evaluation system using the above industrial data quality evaluation method based on machine learning, including:
the data preprocessing unit is used for preprocessing the detection data to exclude abnormal data of a single detection data point;
and the data reprocessing unit is used for constructing a correlation model and judging the preprocessed detection data meeting the requirements so as to determine abnormal detection data which do not meet the correlation.
Specifically, the data preprocessing unit includes:
the construction module is used for constructing a historical unit evaluation model, constructing a standard measuring point attribute database through the historical unit data evaluation model, and establishing a standard value interval and a standard value variation amplitude interval;
the initial judgment module is used for acquiring the attribute of the current detection data, comparing the attribute with the standard measuring point attribute database and judging that the current detection data exceeds the value range detection data when the current detection data exceeds the standard interval; and judging the current detection data as amplitude fluctuation abnormal detection data when the current detection data exceeds the variation amplitude interval.
For a specific implementation process of the industrial data quality evaluation system provided by the embodiment of the present invention, reference may be made to the description of the industrial data quality evaluation method in the foregoing method, and details are not described herein again.
In summary, the industrial data quality evaluation system provided by the invention identifies the abnormal data which obviously do not meet the requirements through the preprocessing process, and then identifies the abnormal data which are not mutually associated in the related groups through the association model, so that the data of a plurality of areas in the industrial production process can be conveniently detected, and the abnormal monitoring data can be accurately determined.
Those skilled in the art will appreciate that the present invention includes apparatus directed to performing one or more of the operations described in the present application. These devices may be specially designed and manufactured for the required purposes, or they may comprise known devices in general-purpose computers. These devices have stored therein computer programs that are selectively activated or reconfigured. Such a computer program may be stored in a device (e.g., computer) readable medium, including, but not limited to, any type of disk including floppy disks, hard disks, optical disks, CD-ROMs, and magnetic-optical disks, ROMs (Read-Only memories), RAMs (Random Access memories), EPROMs (Erasable Programmable Read-Only memories), EEPROMs (Electrically Erasable Programmable Read-Only memories), flash memories, magnetic cards, or optical cards, or any type of media suitable for storing electronic instructions, and each coupled to a bus. That is, a readable medium includes any medium that stores or transmits information in a form readable by a device (e.g., a computer).
It will be understood by those within the art that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by computer program instructions. Those skilled in the art will appreciate that the computer program instructions may be implemented by a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, implement the features specified in the block or blocks of the block diagrams and/or flowchart illustrations of the present disclosure.
Those of skill in the art will appreciate that various operations, methods, steps in the processes, acts, or solutions discussed in the present application may be alternated, modified, combined, or deleted. Further, various operations, methods, steps in the flows, which have been discussed in the present application, may be interchanged, modified, rearranged, decomposed, combined, or eliminated. Further, steps, measures, schemes in the various operations, methods, procedures disclosed in the prior art and the present invention can also be alternated, changed, rearranged, decomposed, combined, or deleted.
The functions, if implemented in the form of software-enabled devices and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A machine learning-based industrial data quality evaluation method is characterized by comprising the following steps:
s1, preprocessing the detection data to eliminate abnormal data of a single detection data point;
and S2, constructing a correlation model, and judging the preprocessed detection data meeting the requirements to determine abnormal detection data which do not meet the correlation.
2. The method for evaluating the quality of industrial data based on machine learning according to claim 1, wherein the step S1 includes:
s11: constructing a historical unit evaluation model;
s12: establishing a standard measuring point attribute database through a historical unit data evaluation model, and establishing a standard value interval and a standard value variation amplitude interval;
s13: acquiring the attribute of the current detection data, comparing the attribute with a standard measuring point attribute database, and judging that the current detection data exceeds the value range detection data when the current detection data exceeds a standard interval; and judging the current detection data as amplitude fluctuation abnormal detection data when the current detection data exceeds the variation amplitude interval.
3. The method for evaluating the quality of industrial data based on machine learning according to claim 2, wherein the step S11 includes:
s111, acquiring historical detection data X1, X2 and X3... Xi... times Xn in a certain time;
s112, calculating median (X) of all historical detection data;
s113, calculating the absolute deviation value | Xi-mean (X) of each historical observation data and the median;
s114, calculating a median MAD of the absolute deviation value as medium (| Xi-medium (X)) |;
and S115, dividing the absolute deviation value of each historical observation data by the MAD to obtain a distance value Xm from the center of all observation data based on the MAD.
4. The method for evaluating the quality of industrial data based on machine learning according to claim 3, wherein the step S12 includes:
establishing a standard value interval and a standard value variation amplitude interval with Xm meeting the requirement according to a Three-Sigma Rule formula Pr (mu-3 Sigma is not less than Xm and is not more than mu +3 Sigma) ≈ 0.9973; wherein σ represents a standard deviation, μ represents a mean; the Xm value of the abnormal data which do not meet the requirement is larger than mu +3 sigma or smaller than mu-3 sigma.
5. The method for evaluating the quality of industrial data based on machine learning according to claim 1, wherein the step S2 includes:
s21: establishing an association model according to the historical detection data, and clustering and grouping the historical detection data according to the association model to form a plurality of groups;
s22: evaluating the result after clustering grouping;
s23: distributing the preprocessed detection data meeting the requirements into corresponding groups according to the association model;
s24: and determining the correlation between the preprocessed qualified detection data distributed into the corresponding packet and other data in the packet, and finding out abnormal data with poor correlation.
6. The method for evaluating the quality of industrial data based on machine learning according to claim 5, wherein the step S21 includes:
s211: and acquiring historical detection data, and randomly selecting K objects from the N pieces of historical detection data as initial clustering centers.
S212: and respectively calculating the distance from each detection data to the center of each cluster, and distributing each detection data to the cluster with the closest distance.
S213: and after all the detection data are distributed, recalculating the K clustering centers.
S214: comparing with the K cluster centers obtained from the previous calculation, if the cluster centers change, the process goes to S212, otherwise, the process goes to S215.
S215: and stopping and outputting the clustering result when the centroid is not transformed.
7. The method for evaluating the quality of industrial data based on machine learning according to claim 5, wherein the step S22 includes:
s221: determining purity (X, Y) using the following formula
Figure FDA0002842968460000021
Where x ═ (x1, x2 … … xk) is the set of clusters. x is the number ofkRepresenting the set of k-th clusters. y-y (y1, y2, … yi) represents the set that needs to be clustered, yiRepresenting the ith cluster object. N represents the total number of clustered collection objects;
s222: the results after clustering were evaluated according to the purity (X, Y) value range.
8. The method for evaluating the quality of industrial data based on machine learning according to claim 5, wherein the step S24 includes:
the correlation coefficient r is determined according to the Pearson correlation coefficient formula as follows:
Figure FDA0002842968460000031
wherein, the value range of the correlation coefficient r is as follows: -1. ltoreq. r.ltoreq.1
Figure FDA0002842968460000032
0< | r | <1 indicates that there are different degrees of linear correlation:
Figure FDA0002842968460000033
and judging the data with low linear correlation and no linear correlation as abnormal data.
9. An evaluation system using the machine learning-based industrial data quality evaluation method according to any of claims 1 to 8, comprising:
the data preprocessing unit is used for preprocessing the detection data to exclude abnormal data of a single detection data point;
and the data reprocessing unit is used for constructing a correlation model and judging the preprocessed detection data meeting the requirements so as to determine abnormal detection data which do not meet the correlation.
10. The system for evaluating a machine learning-based industrial data quality assessment method according to claim 9, wherein said data preprocessing unit comprises:
the construction module is used for constructing a historical unit evaluation model, constructing a standard measuring point attribute database through the historical unit data evaluation model, and establishing a standard value interval and a standard value variation amplitude interval;
the initial judgment module is used for acquiring the attribute of the current detection data, comparing the attribute with the standard measuring point attribute database and judging that the current detection data exceeds the value range detection data when the current detection data exceeds the standard interval; and judging the current detection data as amplitude fluctuation abnormal detection data when the current detection data exceeds the variation amplitude interval.
CN202011498693.3A 2020-12-18 2020-12-18 Industrial data quality evaluation method and system based on machine learning Pending CN112463838A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011498693.3A CN112463838A (en) 2020-12-18 2020-12-18 Industrial data quality evaluation method and system based on machine learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011498693.3A CN112463838A (en) 2020-12-18 2020-12-18 Industrial data quality evaluation method and system based on machine learning

Publications (1)

Publication Number Publication Date
CN112463838A true CN112463838A (en) 2021-03-09

Family

ID=74804751

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011498693.3A Pending CN112463838A (en) 2020-12-18 2020-12-18 Industrial data quality evaluation method and system based on machine learning

Country Status (1)

Country Link
CN (1) CN112463838A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107066555A (en) * 2017-03-26 2017-08-18 天津大学 Towards the online topic detection method of professional domain
CN109118079A (en) * 2018-08-07 2019-01-01 山东纬横数据科技有限公司 A kind of manufacturing industry product quality data relation analysis method
CN109818942A (en) * 2019-01-07 2019-05-28 微梦创科网络科技(中国)有限公司 A kind of user account number method for detecting abnormality and device based on temporal aspect

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107066555A (en) * 2017-03-26 2017-08-18 天津大学 Towards the online topic detection method of professional domain
CN109118079A (en) * 2018-08-07 2019-01-01 山东纬横数据科技有限公司 A kind of manufacturing industry product quality data relation analysis method
CN109818942A (en) * 2019-01-07 2019-05-28 微梦创科网络科技(中国)有限公司 A kind of user account number method for detecting abnormality and device based on temporal aspect

Similar Documents

Publication Publication Date Title
CN112987675B (en) Method, device, computer equipment and medium for anomaly detection
CN109416531A (en) The different degree decision maker of abnormal data and the different degree determination method of abnormal data
CN111898647B (en) Clustering analysis-based low-voltage distribution equipment false alarm identification method
CN111177714A (en) Abnormal behavior detection method and device, computer equipment and storage medium
CN112766429B (en) Method, device, computer equipment and medium for anomaly detection
CN113435314B (en) Rolling bearing acoustic signal early fault sensitivity characteristic screening method and system
CN108572880B (en) Abnormality diagnosis system for equipment
CN116881749B (en) Pollution site construction monitoring method and system
CN116485020B (en) Supply chain risk identification early warning method, system and medium based on big data
CN111767192B (en) Business data detection method, device, equipment and medium based on artificial intelligence
CN111612531B (en) Click fraud detection method and system
CN112949735A (en) Liquid hazardous chemical substance volatile concentration abnormity discovery method based on outlier data mining
CN117014193A (en) Unknown Web attack detection method based on behavior baseline
CN117614978A (en) Information security communication management system for digital workshop
CN112597539A (en) Unsupervised learning-based time series anomaly detection method and system
CN112463838A (en) Industrial data quality evaluation method and system based on machine learning
CN112102882B (en) Quality control system and method for NGS detection process of tumor sample
CN114553468A (en) Three-level network intrusion detection method based on feature intersection and ensemble learning
JP2021192155A (en) Program, method and system for supporting abnormality detection
CN116595389B (en) Method, device, computer equipment and storage medium for identifying abnormal client
JP5133267B2 (en) Structural data analysis system
CN116448062B (en) Bridge settlement deformation detection method, device, computer and storage medium
Gayathri et al. A Comparative Analysis of the Imputing of Missing Data on Air Pollution
CN117454091A (en) Data cleaning method, device, equipment and storage medium
CN117828316A (en) Time sequence data classification anomaly detection method and system based on entropy measurement

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210309

RJ01 Rejection of invention patent application after publication