CN117591501A

CN117591501A - Data cleaning method, device, equipment and storage medium

Info

Publication number: CN117591501A
Application number: CN202311562034.5A
Authority: CN
Inventors: 杜旭辉; 金润枫; 曹亮军
Original assignee: Shanghai Kostal Huayang Automotive Electric Co Ltd; Kostal Shanghai Mechatronic Co Ltd
Current assignee: Shanghai Kostal Huayang Automotive Electric Co Ltd; Kostal Shanghai Mechatronic Co Ltd
Priority date: 2023-11-21
Filing date: 2023-11-21
Publication date: 2024-02-23

Abstract

The invention discloses a data cleaning method, a device, equipment and a storage medium, which relate to the technical field of data processing and comprise the following steps: acquiring temperature data of wave soldering equipment, and judging whether the temperature data is subjected to normal distribution; if normal distribution is obeyed, determining the temperature data as target data, calculating a confidence interval based on the target data, determining the target data positioned in the confidence interval as target import data and importing the target import data into an execution management system, calculating a threshold value based on the target import data and a preset quantile by using the execution management system, and determining a preset threshold value interval based on the threshold value; and executing a preset data deleting operation based on a preset threshold interval to obtain cleaned data, and executing a preset data filling operation based on the automatic data filling function and the cleaned data to obtain final data. According to the invention, the threshold interval is calculated by using the quantile, so that the upper limit interval and the lower limit interval of the threshold are widened, the misjudgment rate is reduced, and the accuracy and the processing efficiency of stream data cleaning are improved by using the automatic data filling function.

Description

Data cleaning method, device, equipment and storage medium

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a data cleaning method, apparatus, device, and storage medium.

Background

With the rapid development of big data and the internet, the data volume has shown an explosive growth. In the data analysis process of the digital production process of the automobile manufacturing industry, the original data is often required to be cleaned so as to ensure the quality and accuracy of the data. When processing stream data, the traditional data cleaning method can cause partial information damage or loss, and can not meet the requirements of efficiency and accuracy of wave soldering data cleaning in the electronic manufacturing industry. Existing data cleaning algorithms have many limitations for MES (Manufacturing Execution System, manufacturing process execution system) systems, as well as big data and stream data for electronic manufacturing lines, such as implicit assumptions about independence between covariates when filling missing values using common column averages or the like for missing values. In addition, the common quantile method is also used for identifying the abnormal value, so that a plurality of data in the normal distribution function and the boundary of the normal distribution function are misjudged as the abnormal value due to the fact that the upper and lower threshold spacing is too narrow when the data is large (more than about two thousands of data) due to the fact that the traditional Q1/4 and Q3/4 quantiles are continuous distribution functions such as normal distribution, exponential distribution and the like in the electronic manufacturing industry, and finally the problem of overdelement is caused.

Disclosure of Invention

Accordingly, the present invention is directed to a data cleaning method, apparatus, device and storage medium, which can reduce the false judgment rate and improve the accuracy and processing efficiency of stream data cleaning. The specific scheme is as follows:

in a first aspect, the invention discloses a data cleaning method, comprising the following steps:

acquiring temperature data of wave soldering equipment, and judging whether the temperature data obeys normal distribution;

if the temperature data obeys normal distribution, determining the temperature data as target data, calculating a confidence interval based on the target data, and determining the target data positioned in the confidence interval as target import data;

importing the target import data into an execution management system, calculating a threshold value based on the target import data and a preset quantile by using the execution management system, and determining a preset threshold value interval based on the threshold value;

and executing a preset data deleting operation based on the preset threshold interval to obtain cleaned data, and executing a preset data filling operation based on an automatic data filling function and the cleaned data to obtain final data.

Optionally, the acquiring temperature data of the wave soldering device and determining whether the temperature data obeys a normal distribution includes:

Acquiring temperature data of a preset number of preheating temperature areas from a memory of wave soldering equipment at regular time;

judging whether the data type of the temperature data is numerical by using a preset data type detection function;

determining the temperature data with the data type not being the numerical value type as data to be corrected, correcting the data to be corrected by using a preset data debugging script, and obtaining corrected data;

determining the corrected data as new temperature data, and re-entering the step of judging whether the data type of the temperature data is numerical by using a preset data type detection function;

and if the data type of the temperature data is the numerical type, executing a preset distribution fitting operation on the temperature data to judge whether the temperature data obeys the normal distribution.

Optionally, the calculating a confidence interval based on the target data, determining the target data inside the confidence interval as target import data, includes:

determining parameters of the normal distribution based on the target data; the parameters include standard deviation and mean;

calculating probability density function values of the normal distribution;

Based on the probability density function value and the parameter, calculating by using a wald method and/or a likelihood method to obtain a calculation result;

determining a minimum value in the calculation result as a first threshold lower limit, determining a maximum value in the calculation result as a first threshold upper limit, and determining the confidence interval based on the first threshold lower limit and the first threshold upper limit;

and sequentially judging whether the target data is positioned in the confidence interval or not, and determining the target data positioned in the confidence interval as the target import data.

Optionally, the importing the target import data into an execution management system, so as to calculate, by using the execution management system, a threshold based on the target import data and a preset quantile, and determine a preset threshold interval based on the threshold, including:

importing the target import data into an execution management system so as to judge whether the data type of the target import data is the numerical value type or not by utilizing the execution management system;

if the data type of the target import data is the numerical value type, calculating a second threshold upper limit and a second threshold lower limit based on the target import data, 10% quantiles and 90% quantiles;

And determining the preset threshold interval based on the second upper threshold limit and the second lower threshold limit.

Optionally, after the importing the target import data into the execution management system to calculate a threshold based on the target import data and a preset quantile by using the execution management system and determine a preset threshold interval based on the threshold, the method further includes:

judging whether the target import data is located in the preset threshold interval or not;

if the target import data is located outside the preset threshold interval, marking the target import data as an abnormal value;

correspondingly, the performing the preset data deleting operation based on the preset threshold interval to obtain the cleaned data includes:

deleting all the data marked as the abnormal value in the target import data to obtain the cleaned data, and storing the cleaned data.

Optionally, the performing a preset data filling operation based on the automatic data filling function and the cleaned data to obtain final data includes:

designating all columns from the whole data matrix table, and determining a vacant position without the cleaned data in each column;

And filling target filling data to the vacant positions by using a filling model and based on a target lower limit value and a target upper limit value, and obtaining the final data based on the cleaned data and the target filling data.

Optionally, before the performing the preset data filling operation based on the automatic data filling function and the cleaned data to obtain the final data, the method further includes:

dividing the cleaned data into a training set, a verification set and a test set, and executing centralized processing and unified scaling processing on the training set, the verification set and the test set to obtain a processed training set, a processed verification set and a processed test set;

determining the vacant positions in each column of data in each processed training set, and filling the target filling data in the vacant positions to obtain a filled training set;

judging whether a preset adjusting parameter meets the requirement of a user or not based on the target filling data; wherein the preset adjustment parameters comprise a dimension upper limit and a maximum iteration number;

if the preset adjustment parameters meet the user requirements, fitting the filling model based on the training set after filling and the preset adjustment parameters;

Calculating the bias value of the filling model by using the processed verification set, and minimizing the bias value to obtain a processed bias value;

performing rank reduction on the processed bias value to obtain the target lower limit value;

and executing the rank reduction on the data in the processed test set to obtain the target upper limit value.

In a second aspect, the present invention discloses a data cleaning device comprising:

the normal distribution judging module is used for acquiring temperature data of the wave soldering equipment and judging whether the temperature data obeys normal distribution;

the confidence interval judging module is used for determining the temperature data as target data if the temperature data obeys normal distribution, calculating a confidence interval based on the target data and determining the target data positioned in the confidence interval as target import data;

the threshold interval determining module is used for importing the target import data into an execution management system, calculating a threshold value based on the target import data and a preset quantile by utilizing the execution management system, and determining a preset threshold interval based on the threshold value;

the data cleaning module is used for executing a preset data deleting operation based on the preset threshold interval so as to obtain cleaned data;

And the data filling module is used for executing preset data filling operation based on the automatic data filling function and the cleaned data to obtain final data.

In a third aspect, the present invention discloses an electronic device, comprising:

a memory for storing a computer program;

a processor for executing the computer program to implement the steps of the data cleansing method as disclosed previously.

In a fourth aspect, the present invention discloses a computer-readable storage medium for storing a computer program; wherein the computer program, when executed by a processor, implements a data cleansing method as disclosed previously.

It can be seen that the present invention provides a data cleansing method, comprising: acquiring temperature data of wave soldering equipment, and judging whether the temperature data obeys normal distribution; if the temperature data obeys normal distribution, determining the temperature data as target data, calculating a confidence interval based on the target data, and determining the target data positioned in the confidence interval as target import data; importing the target import data into an execution management system, calculating a threshold value based on the target import data and a preset quantile by using the execution management system, and determining a preset threshold value interval based on the threshold value; and executing a preset data deleting operation based on the preset threshold interval to obtain cleaned data, and executing a preset data filling operation based on an automatic data filling function and the cleaned data to obtain final data. Therefore, the method and the device extract the target data obeying normal distribution by judging and classifying the data, identify the abnormal value of the data based on the quantile value, reduce the misjudgment rate, automatically fill the abnormal value and the missing value based on the automatic data filling function, and improve the accuracy and the processing efficiency of stream data cleaning.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a data cleaning method disclosed by the invention;

FIG. 2 is a flow chart of a specific data cleaning method disclosed in the present invention;

FIG. 3 is a flow chart of a specific data cleaning method disclosed in the present invention;

FIG. 4 is a flow chart of an automatic data filling method disclosed by the invention;

FIG. 5 is a schematic diagram of a data population model and a subsequent prediction model of the stream processing method of the present disclosure;

FIG. 6 is a schematic diagram of the failure self-diagnosis principle of the temperature sensor disclosed by the invention;

FIG. 7 is a schematic diagram of a data washer according to the present invention;

fig. 8 is a block diagram of an electronic device according to the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Currently, existing data cleansing algorithms have many limitations for MES systems, as well as big data and stream data for electronic manufacturing lines, such as implicit assumptions about independence between covariates when filling missing values using common column averages or the like for missing values. In addition, the common quantile method is also used for identifying the abnormal value, so that a plurality of data in the normal distribution function and the boundary of the normal distribution function are misjudged as the abnormal value due to the fact that the upper and lower threshold spacing is too narrow when the data is large (more than about two thousands of data) due to the fact that the traditional Q1/4 and Q3/4 quantiles are continuous distribution functions such as normal distribution, exponential distribution and the like in the electronic manufacturing industry, and finally the problem of overdelement is caused. Therefore, the invention provides a data cleaning method which can reduce the misjudgment rate and improve the accuracy and the processing efficiency of stream data cleaning.

The embodiment of the invention discloses a data cleaning method, which is shown in fig. 1 and comprises the following steps:

step S11: and acquiring temperature data of wave soldering equipment, and judging whether the temperature data obeys normal distribution.

In this embodiment, temperature data of the wave soldering apparatus is obtained, and whether the temperature data is compliant with normal distribution is determined. Specifically, temperature data of a preset number of preheating temperature areas are obtained from a memory of wave soldering equipment at regular time; judging whether the data type of the temperature data is numerical by using a preset data type detection function; determining the temperature data with the data type not being the numerical value type as data to be corrected, correcting the data to be corrected by using a preset data debugging script, and obtaining corrected data; determining the corrected data as new temperature data, and re-entering the step of judging whether the data type of the temperature data is numerical by using a preset data type detection function; and if the data type of the temperature data is the numerical type, executing a preset distribution fitting operation on the temperature data to judge whether the temperature data obeys the normal distribution.

For example, when the wave-soldering production line is stopped for a plurality of times due to the problems of sensor, clamp and equipment faults, the temperature monitoring data of the wave-soldering preheating temperature zone output in the digital system MES is abnormal, and a data set consisting of hundreds of thousands of data per day is required to be subjected to data identification, classification and cleaning. Therefore, the invention has identified the data distribution of the preheating temperature area in the process design stage, confirms whether the data is in log-normal distribution, and then exports hundreds of thousands of data from the MES system into a more general format, such as CSV (comma separated value format) format, and the software is Python and JMP. Specifically, 10 warm-up temperature zones of temperature data (125 data per zone, 1250 data total) were exported from the memory of the wave soldering apparatus every 50 milliseconds, named "XXXXY_XXMXXD_XXXXXXXX_preheating_temperature. Csv", and automatically saved on the server. The necessary libraries are imported using Python, including NumPy (Numerical Python, an extended library of Python language) for Numerical calculations, sciPy for scientific calculations, etc., and Pandas is used to convert the data into a clean and ordered DataFrame. Reading CSV files through Python, creating a DataFrame for preheating temperature data of a temperature area, checking whether data types of all temperature data accord with a numerical type by using an Isnstance () function, outputting prompts and recording the prompts in a log file if the temperature data do not accord with the numerical type exist, simultaneously realizing automatic debugging by using scripts written by Python, automatically correcting the temperature data not accord with the format, and rerun the Isnstance () function for verification after the data types of the temperature data are corrected, and judging that the data types accord with the numerical type if the verification passes; when the data type accords with the numerical type, the 1250 data are subjected to distribution fitting, for example, and whether the data are subjected to lognormal distribution is judged.

Step S12: and if the temperature data obeys normal distribution, determining the temperature data as target data, calculating a confidence interval based on the target data, and determining the target data positioned in the confidence interval as target import data.

In this embodiment, temperature data of a wave soldering apparatus is obtained, whether the temperature data is subjected to normal distribution is determined, if the temperature data is subjected to normal distribution, the temperature data is determined to be target data, a confidence interval is calculated based on the target data, and the target data located in the confidence interval is determined to be target import data. Specifically, determining parameters of the normal distribution based on the target data; the parameters include standard deviation and mean; calculating probability density function values of the normal distribution; based on the probability density function value and the parameter, calculating by using a wald method and/or a likelihood method to obtain a calculation result; determining a minimum value in the calculation result as a first threshold lower limit, determining a maximum value in the calculation result as a first threshold upper limit, and determining the confidence interval based on the first threshold lower limit and the first threshold upper limit; and sequentially judging whether the target data is positioned in the confidence interval or not, and determining the target data positioned in the confidence interval as the target import data.

It will be appreciated that if the lognormal distribution is not met, it is marked as outlier data for subsequent processing, such as invoking a PID (Proportional Integral Derivative, performing tuning control) outlier handler for the outlier data to automatically diagnose and debug it. If the temperature data obeys normal distribution, determining the temperature data as target data, calculating a 95% confidence interval by using a Wald and/or likelihood method according to a log-normal temperature curve fitted by standard temperature values of all (e.g. 10) selected preheating temperature areas, automatically determining the minimum value of a lower limit value as a first threshold lower limit according to the data volume, and determining the maximum value of an upper limit value as a first threshold upper limit; the confidence interval is determined based on the first lower threshold limit and the first upper threshold limit. If the target data is an abnormal value beyond the confidence interval threshold range, the PID is used for automatic diagnosis and debugging. Specifically, after judging whether the log-normal distribution is obeyed, whether the calculated temperature accords with the log-normal distribution is judged, and the specific statement is as follows: defis_temporal_in_confidence_interval (temporal, a=1, b=1), and if the log normal distribution is not satisfied, it is marked as abnormal data; if the parameters are in accordance with the standard deviation, continuously calculating the parameters of the lognormal distribution, wherein the parameters comprise the standard deviation and the average number; and calculating the probability density function value of the lognormal distribution, automatically diagnosing and debugging target data by using a normal value processing program in the PID, automatically selecting and using a section fitting value of numerical values conforming to the lognormal distribution by the PID, and calculating by using a wald method and/or a Likelihood method (Likelihood) based on the probability density function value and the parameters to obtain a calculation result. The Wald method is selected when the data volume is large, the likelihood method is selected when the data volume is small, and the Wald method and the likelihood method can be combined when the data volume is not determined. Judging whether the target data is positioned in the confidence interval, if so, determining the target data as the target import data, marking the target import data as normal production data, and importing the target import data into an MES system according to a serial number; and if the target data is not located in the confidence interval, calling an abnormal operation processing program to process.

It should be noted that the use formula of the likelihood method (likelihood) is as follows:

the calculation formula of Wald method is as follows:

the calculation formula of the confidence upper limit is as follows:

the lower confidence limit is calculated as follows:

in the above formula, μ is an overall mean value, t is a distribution critical value, α is a significance level, s is a standard deviation, and n is the number of samples.

It will be appreciated that values outside the upper and lower threshold ranges of the confidence interval, if the PID is not self-processing, are sent to an equipment management platform in the production system for subsequent remote analysis by an equipment engineer to determine whether it is a temperature sensor, clamp or wave soldering equipment problem. The method comprises the steps that before data are imported into an MES, a PID module is used for identifying temperature data collected by all sensors in a preheating temperature zone, the temperature data are fitted in real time, whether the temperature data meet a preheating temperature zone curve (for example, the temperature data are located in the upper limit and the lower limit of a 95% confidence interval) of specification requirements is confirmed, and the data in the confidence interval range are confirmed to be normal production data; data that is not within range is determined to be sensor or equipment failure data, and these normal production stream data (temperature data of the warm-up temperature zone) are imported into the MES system by time series labels.

Step S13: and importing the target import data into an execution management system, calculating a threshold value based on the target import data and a preset quantile by using the execution management system, and determining a preset threshold value interval based on the threshold value.

In this embodiment, a confidence interval is calculated based on the target data, after the target data located inside the confidence interval is determined as target import data, the target import data is imported into an execution management system, so that a threshold is calculated by the execution management system based on the target import data and a preset quantile, and a preset threshold interval is determined based on the threshold.

It will be appreciated that the destination data for the warm-up region is derived from the MES system in the format of csv, data = XXXXY_XXMXXD_XXXXXXXX_preheating_temperature. Csv; then judging whether the data types of all the target imported data belong to numerical value type by using an Isnstance () function, and if not, marking the data types as abnormal concurrent messages; and if the target import data accords with the numerical value, calculating a threshold value based on the target import data and a preset quantile, and determining a preset threshold value interval based on the threshold value.

Step S14: and executing a preset data deleting operation based on the preset threshold interval to obtain cleaned data, and executing a preset data filling operation based on an automatic data filling function and the cleaned data to obtain final data.

In this embodiment, the target import data is imported into an execution management system, so that a threshold is calculated by the execution management system based on the target import data and a preset quantile, a preset threshold interval is determined based on the threshold, a preset data deletion operation is performed based on the preset threshold interval, so as to obtain cleaned data, and a preset data filling operation is performed based on an automatic data filling function and the cleaned data, so as to obtain final data. Specifically, all columns are specified from the whole data matrix table, and the vacant positions where the cleaned data do not exist are determined in each column; and filling target filling data to the vacant positions by using a filling model and based on a target lower limit value and a target upper limit value, and obtaining the final data based on the cleaned data and the target filling data.

It can be appreciated that performing a preset data deletion operation based on the preset threshold interval to obtain cleaned data, performing a preset data filling operation based on an automatic data filling function and the cleaned data, dividing the cleaned data into a training set, a verification set and a test set before obtaining final data, and performing a centering process and a unified scaling process on the training set, the verification set and the test set to obtain a post-processing training set, a post-processing verification set and a post-processing test set; determining the vacant positions in each column of data in each processed training set, and filling the target filling data in the vacant positions to obtain a filled training set; judging whether a preset adjusting parameter meets the requirement of a user or not based on the target filling data; wherein the preset adjustment parameters comprise a dimension upper limit and a maximum iteration number; if the preset adjustment parameters meet the user requirements, fitting the filling model based on the training set after filling and the preset adjustment parameters; calculating the bias value of the filling model by using the processed verification set, and minimizing the bias value to obtain a processed bias value; performing rank reduction on the processed bias value to obtain the target lower limit value; and executing the rank reduction on the data in the processed test set to obtain the target upper limit value.

The invention adopts the self-developed PID automatic diagnosis and debugging, 10% quantile and 90% quantile method to identify and clean the abnormal value in the data by combining ADF (Automated Data Function) automatic data filling function, thereby effectively improving the accuracy and reliability of stream data processing. Abnormal values can be automatically identified and cleaned, and the workload of manual processing is reduced; multiple data can be processed simultaneously in parallel, and the method is suitable for various data processing scenes; the method can process the condition that the missing value and the abnormal value exist simultaneously, and improves the comprehensiveness of data processing.

Referring to fig. 2, an embodiment of the present invention discloses a data cleaning method, and compared with the previous embodiment, the present embodiment further describes and optimizes a technical solution.

Step S21: and acquiring temperature data of wave soldering equipment, and judging whether the temperature data obeys normal distribution.

Step S22: and if the temperature data obeys normal distribution, determining the temperature data as target data, calculating a confidence interval based on the target data, and determining the target data positioned in the confidence interval as target import data.

Step S23: and importing the target import data into an execution management system so as to judge whether the data type of the target import data is the numerical type or not by utilizing the execution management system.

In this embodiment, a confidence interval is calculated based on the target data, the target data located inside the confidence interval is determined as target import data, and then the target import data is imported into an execution management system, so that the execution management system is used to determine whether the data type of the target import data is the numerical type, for example, an Isinstance () function is used to determine whether the data types of all the target import data are of the numerical type, and if the data types of the target import data are not of the numerical type, the abnormal concurrent message is identified.

Step S24: and if the data type of the target import data is the numerical value type, calculating a second threshold upper limit and a second threshold lower limit based on the target import data, the 10% quantile and the 90% quantile.

In this embodiment, after the execution management system determines whether the data type of the target import data is the numeric type, if the data type of the target import data is the numeric type, a second threshold upper limit and a second threshold lower limit are calculated based on the target import data, the 10% quantile, and the 90% quantile. It will be appreciated that the nannocente () function is used to calculate the 10% quantile of all data; the nanverscent () function is used to calculate the 90% quantiles of all data. The lower threshold formula corresponding to the 10% quantile is calculated as follows: threshold_min=pct 10-3 (pct 90-pct 10); the upper threshold formula corresponding to the 90% quantile is calculated as follows: threshold_max=pct90+3 (pct90—pct10).

The Quantile (Quantile) refers to a numerical point that divides the probability distribution range of a random variable into several equal parts, and there are usually median (i.e. bipartite), quartile, percentile, etc. The quantile used in the invention is 10% and 90%, and 1.5 x IQR (Interquartile Range, quantile range) in the calculation Wisker upper and lower limit formula is changed to 3 x IQR, so that the upper and lower limit distance of the threshold value is widened, and the problem of excessively judging abnormal values is avoided.

Step S25: and determining the preset threshold interval based on the second upper threshold limit and the second lower threshold limit.

In this embodiment, after calculating the second upper threshold limit and the second lower threshold limit based on the target import data, the 10% quantile, and the 90% quantile, the preset threshold interval is determined based on the second upper threshold limit and the second lower threshold limit. Judging whether the target import data is located in the preset threshold interval or not; and if the target import data is located outside the preset threshold interval, marking the target import data as an abnormal value. It will be appreciated that data below the lower threshold threshold_min is marked as outliers using the np.where () function; the data above the upper threshold threshold_max is marked as IM (Induced Missing Value ) outliers using the np.sphere () function.

Step S26: and executing a preset data deleting operation based on the preset threshold interval to obtain cleaned data, and executing a preset data filling operation based on an automatic data filling function and the cleaned data to obtain final data.

In this embodiment, after the preset threshold interval is determined based on the second upper threshold limit and the second lower threshold limit, a preset data deleting operation is performed based on the preset threshold interval to obtain cleaned data, and a preset data filling operation is performed based on an automatic data filling function and the cleaned data to obtain final data. Deleting all the data marked as the abnormal value in the target import data to obtain the cleaned data, storing the cleaned data, and executing preset data filling operation based on an automatic data filling function and the cleaned data to obtain final data. As shown in fig. 3, a quantile function is called, outliers are identified by using quantiles, and the ADI function is called to automatically fill data after the outliers are deleted.

It will be appreciated that all outliers are deleted using the null. All data are stored in csv format and imported into JMP software; calling an automatic data filling method in a SAS JMP software JSL script editor through a PyWin32 module; calling a scanning input () function in JMP to provide a stream function for an automatic data filling function ADF, ensuring the directional and dynamic data transmission of the ADF, and avoiding the occurrence of the conditions of no data or unreasonable data quantity; all columns of the warm-up temperature region are specified from the entire data Matrix table using Matrix () function, and all missing values in the columns are automatically filled by an automatic data filling function ADF (e.g., using the extension method of machine learning method Matrix complement in Netflix Challenge).

As shown in fig. 4, before performing a preset data filling operation based on an automatic data filling function and the cleaned data to obtain final data, dividing all data into respective training sets, verification sets and test sets; each data set is subjected to centering and unified scaling; searching for vacancies without numerical values in each column of data of each data set (training/verifying/testing set), and adding IM values (introduced missing values) at the vacancies to ensure the fluency of the calculation process and avoid error reporting during calculation or stop of the calculation process; fitting a filling model according to adjustment parameters (parameters to be set and adjusted before starting an automatic data filling method: upper limit of dimension, maximum iteration number and the like) aiming at the training set; determining and judging whether the adjustment parameters are optimal values (optimal values or optimal solutions) by using the IM values; resetting the adjustment parameters if the adjustment parameters are not the optimal value, and cycling until the optimal value is found; if the value is the optimal value, calculating bias of each filling model (the bias is large and causes mismatch) by using data of the verification set, and minimizing the bias value; performing rank reduction (minimum rank/rank minimization) using the bias values as a lower limit value; performing rank reduction (minimization rank/rank minimization) using the data in the test set to calibrate a filling model of the streaming data, preventing model overfitting, and using this rank as an upper limit; after verification is completed, the missing values are automatically filled by using a filling model according to the upper limit range and the lower limit range. Starting Python software by using a work with Python function in JMP; creating a data cleaning log and a report: log_file_name= "XXXX data_lod.txt", and key information and results of the whole data cleansing process are recorded; the process of each automatic data population is converted to python code, awaiting further engineering. The automatic data filling method used by the invention is more accurate than common methods such as column average value, and can automatically fit training set data, verify by using test set data after obtaining a fitting model, and avoid the problems of mismatching, excessive fitting and the like. Meanwhile, the method combines Python programming, and the whole process is engineered, so that the method is convenient to be transplanted to other production lines or changed into a full-automatic process.

As shown in FIG. 5, the automatic data filling method used in the invention avoids information damage and loss in the data set when processing stream data, and ensures the accuracy of the whole data cleaning process. The training set and the introduced missing values IM are used to fit a filling model, from which data is then generated and filled in a stream filling manner into the previous training set, validation set and test set.

In addition, as shown in fig. 6, the chip in the PID judges the whole vehicle operation area and the abnormal operation area of the wave soldering apparatus through the voltage by the signal input of the temperature sensor.

The invention carries out automatic diagnosis and adjustment classification on a large amount of stream data of a modern wave-soldering digital production line, and automatically divides and extracts stream data in normal production; carrying out abnormal value identification on the stream data in a fractional range (10% and 90%), and reducing the misjudgment rate; and an automatic data filling method is used for automatically filling the abnormal value and the missing value, so that the accuracy and the processing efficiency of cleaning the stream data are improved. The invention can be applied to big data processing, automatic data management, data processing and optimization in the automobile part manufacturing industry, stream data processing and digital wave soldering production lines.

For the specific content of the steps S21 and S22, reference may be made to the corresponding content disclosed in the foregoing embodiment, and no detailed description is given here.

Therefore, the embodiment of the application judges whether the temperature data obeys normal distribution or not by acquiring the temperature data of the wave soldering equipment; if the temperature data obeys normal distribution, determining the temperature data as target data, calculating a confidence interval based on the target data, and determining the target data positioned in the confidence interval as target import data; importing the target import data into an execution management system so as to judge whether the data type of the target import data is the numerical value type or not by utilizing the execution management system; if the data type of the target import data is the numerical value type, calculating a second threshold upper limit and a second threshold lower limit based on the target import data, 10% quantiles and 90% quantiles; determining the preset threshold interval based on the second upper threshold limit and the second lower threshold limit; and executing a preset data deleting operation based on the preset threshold interval to obtain cleaned data, and executing a preset data filling operation based on an automatic data filling function and the cleaned data to obtain final data, so that the misjudgment rate is reduced, and the accuracy and the processing efficiency of stream data cleaning are improved.

Referring to fig. 7, the embodiment of the invention also correspondingly discloses a data cleaning device, which comprises:

the normal distribution judging module 11 is used for acquiring temperature data of the wave soldering equipment and judging whether the temperature data obeys normal distribution;

a confidence interval judging module 12, configured to determine the temperature data as target data if the temperature data is subject to normal distribution, calculate a confidence interval based on the target data, and determine the target data located inside the confidence interval as target import data;

a threshold interval determining module 13, configured to import the target import data into an execution management system, calculate a threshold based on the target import data and a preset quantile by using the execution management system, and determine a preset threshold interval based on the threshold;

a data cleansing module 14, configured to perform a preset data deletion operation based on the preset threshold interval, so as to obtain cleansed data;

and the data filling module 15 is used for executing preset data filling operation based on the automatic data filling function and the cleaned data to obtain final data.

It can be seen that the present invention includes: acquiring temperature data of wave soldering equipment, and judging whether the temperature data obeys normal distribution; if the temperature data obeys normal distribution, determining the temperature data as target data, calculating a confidence interval based on the target data, and determining the target data positioned in the confidence interval as target import data; importing the target import data into an execution management system, calculating a threshold value based on the target import data and a preset quantile by using the execution management system, and determining a preset threshold value interval based on the threshold value; and executing a preset data deleting operation based on the preset threshold interval to obtain cleaned data, and executing a preset data filling operation based on an automatic data filling function and the cleaned data to obtain final data. Therefore, the method and the device extract the target data obeying normal distribution by judging and classifying the data, identify the abnormal value of the data based on the quantile value, reduce the misjudgment rate, automatically fill the abnormal value and the missing value based on the automatic data filling function, and improve the accuracy and the processing efficiency of stream data cleaning.

In some embodiments, the normal distribution determining module 11 specifically includes:

the temperature data acquisition unit is used for acquiring temperature data of a preset number of preheating temperature areas from a memory of the wave soldering equipment at regular time;

the first numerical judgment unit is used for judging whether the data type of the temperature data is a numerical type or not by utilizing a preset data type detection function;

a data to be corrected determining unit configured to determine the temperature data, the data type of which is not the numerical value type, as data to be corrected;

the data correction unit is used for correcting the data to be corrected by using a preset data debugging script to obtain corrected data;

a new temperature data determining unit, configured to determine the corrected data as new temperature data, and reenter the step of determining whether the data type of the temperature data is a numerical value by using a preset data type detection function;

and the normal distribution judging unit is used for executing preset distribution fitting operation on the temperature data if the data type of the temperature data is the numerical value type so as to judge whether the temperature data obeys the normal distribution.

In some specific embodiments, the confidence interval determination module 12 specifically includes:

A parameter determination unit configured to determine a parameter of the normal distribution based on the target data; the parameters include standard deviation and mean;

a probability density function value calculation unit for calculating a probability density function value of the normal distribution;

the calculation result acquisition unit is used for calculating by using a wald method and/or a likelihood method based on the probability density function value and the parameter to obtain a calculation result;

a first threshold lower limit determining unit configured to determine a minimum value in the calculation result as a first threshold lower limit;

a first threshold upper limit determining unit configured to determine a maximum value in the calculation results as a first threshold upper limit;

a confidence interval determining unit configured to determine the confidence interval based on the first threshold lower limit and the first threshold upper limit;

the first data range judging unit is used for judging whether the target data are positioned in the confidence interval or not in sequence;

and a target import data determining unit configured to determine the target data located inside the confidence interval as the target import data.

In some specific embodiments, the threshold interval determining module 13 specifically includes:

A second numerical value type judging unit for importing the target import data into an execution management system to judge whether the data type of the target import data is the numerical value type by using the execution management system;

a second threshold calculation unit, configured to calculate a second threshold upper limit and a second threshold lower limit based on the target import data, the 10% quantile, and the 90% quantile if the data type of the target import data is the numerical value;

a preset threshold interval determining unit, configured to determine the preset threshold interval based on the second upper threshold limit and the second lower threshold limit;

the second data range judging unit is used for judging whether the target imported data is located in the preset threshold value interval or not;

and the abnormal value marking unit is used for marking the target imported data as an abnormal value if the target imported data is located outside the preset threshold interval.

In some embodiments, the data cleansing module 14 specifically includes:

and the data cleaning unit is used for deleting all the data marked as the abnormal value in the target imported data so as to obtain the cleaned data and storing the cleaned data.

In some embodiments, the data populating module 15 specifically includes:

the data set dividing unit is used for dividing the cleaned data into a training set, a verification set and a test set;

the data set processing unit is used for performing centering processing and unified scaling processing on the training set, the verification set and the test set to obtain a processed training set, a processed verification set and a processed test set;

a first vacant position determining unit, configured to determine the vacant position existing in each column of data in each post-processing training set;

the first data filling unit is used for filling the target filling data in the vacant position to obtain a training set after filling;

the preset adjusting parameter judging unit is used for judging whether the preset adjusting parameter meets the user requirement or not based on the target filling data; wherein the preset adjustment parameters comprise a dimension upper limit and a maximum iteration number;

the filling model determining unit is used for fitting the filling model based on the training set after filling and the preset adjusting parameters if the preset adjusting parameters meet the user requirements;

a bias value calculation unit for calculating a bias value of the filling model using the post-processing verification set;

The bias value minimizing unit is used for minimizing the bias value to obtain a processed bias value;

a target lower limit value determining unit configured to perform rank reduction on the post-processing bias value to obtain the target lower limit value;

a target upper limit value determining unit configured to perform the rank reduction on the data in the post-processing test set to obtain the target upper limit value;

a column specifying unit for specifying all columns from the entire data matrix table;

a second vacant position determining unit configured to determine a vacant position in each of the columns where the post-wash data does not exist;

a second data filling unit for filling the target filling data to the vacant position based on the target lower limit value and the target upper limit value by using a filling model;

and a final data acquisition unit, configured to obtain the final data based on the cleaned data and the target filling data.

Further, the embodiment of the invention also provides electronic equipment. Fig. 8 is a block diagram of an electronic device 20, according to an exemplary embodiment, and the contents of the diagram should not be construed as limiting the scope of use of the present invention in any way.

Fig. 8 is a schematic structural diagram of an electronic device 20 according to an embodiment of the present invention. The electronic device 20 may specifically include: at least one processor 21, at least one memory 22, a power supply 23, a communication interface 24, an input output interface 25, and a communication bus 26. Wherein the memory 22 is configured to store a computer program that is loaded and executed by the processor 21 to implement the relevant steps of the data cleansing method disclosed in any of the foregoing embodiments. In addition, the electronic device 20 in the present embodiment may be specifically an electronic computer.

In this embodiment, the power supply 23 is configured to provide an operating voltage for each hardware device on the electronic device 20; the communication interface 24 can create a data transmission channel between the electronic device 20 and an external device, and the communication protocol to be followed is any communication protocol applicable to the technical solution of the present invention, which is not specifically limited herein; the input/output interface 25 is used for acquiring external input data or outputting external output data, and the specific interface type thereof may be selected according to the specific application requirement, which is not limited herein.

The memory 22 may be a carrier for storing resources, such as a read-only memory, a random access memory, a magnetic disk, or an optical disk, and the resources stored thereon may include an operating system 221, a computer program 222, and the like, and the storage may be temporary storage or permanent storage.

The operating system 221 is used for managing and controlling various hardware devices on the electronic device 20 and computer programs 222, which may be Windows Server, netware, unix, linux, etc. The computer program 222 may further include a computer program that can be used to perform other specific tasks in addition to the computer program that can be used to perform the data cleansing method performed by the electronic device 20 as disclosed in any of the previous embodiments.

Further, the embodiment of the invention also discloses a computer readable storage medium, wherein the storage medium stores a computer program, and the computer program realizes the steps of the data cleaning method disclosed in any embodiment when being loaded and executed by a processor.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, so that the same or similar parts between the embodiments are referred to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The foregoing has described in detail a method, apparatus, device and storage medium for cleaning data, and specific examples have been used herein to illustrate the principles and embodiments of the present invention, and the above examples are only for aiding in the understanding of the method and core idea of the present invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Claims

1. A method of data cleansing comprising:

2. The method of claim 1, wherein the acquiring temperature data of the wave soldering apparatus and determining whether the temperature data is subject to a normal distribution comprises:

3. The data cleansing method according to claim 1, wherein the calculating a confidence interval based on the target data, determining the target data located inside the confidence interval as target import data, comprises:

calculating probability density function values of the normal distribution;

4. The data cleansing method according to claim 2, wherein importing the target import data into an execution management system to calculate a threshold based on the target import data and a preset quantile using the execution management system, and determining a preset threshold interval based on the threshold, comprises:

5. The data cleansing method according to any one of claims 1 to 4, wherein after importing the target import data into an execution management system to calculate a threshold value based on the target import data and a preset quantile using the execution management system and determining a preset threshold value interval based on the threshold value, further comprising:

6. The method of claim 5, wherein performing a predetermined data population operation based on the automatic data population function and the post-population data to obtain final data comprises:

7. The method of claim 6, wherein the performing a preset data population operation based on the automatic data population function and the post-cleansing data, before obtaining final data, further comprises:

8. A data cleaning apparatus, comprising:

9. An electronic device, comprising:

a memory for storing a computer program;

a processor for executing the computer program to carry out the steps of the data cleansing method according to any one of claims 1 to 7.

10. A computer-readable storage medium storing a computer program; wherein the computer program when executed by a processor implements the data cleansing method according to any one of claims 1 to 7.