CN113177040A

CN113177040A - Full-process big data cleaning and analyzing method for aluminum/copper plate strip production

Info

Publication number: CN113177040A
Application number: CN202110476972.8A
Authority: CN
Inventors: 刘士新; 隋佳欢; 陈大力; 温睿; 赵梓焱; 姚明昊
Original assignee: Northeastern University China
Current assignee: Northeastern University China
Priority date: 2021-04-29
Filing date: 2021-04-29
Publication date: 2021-07-27

Abstract

The invention relates to a method for cleaning and analyzing full-process big data of aluminum/copper plate strip production, which comprises the steps of constructing a quality problem detection method suitable for the full-process big data of the aluminum/copper plate strip and cleaning the data according to the data characteristics and the process background; and finishing establishing a distributed calculation analysis algorithm tool library facing the aluminum/copper plate strip full-process big data. The invention provides a direction for mining and utilizing the full-flow big data of the aluminum/copper plate strip production by designing the full-flow big data cleaning and analyzing method for the aluminum/copper plate strip production, and improves the utilization rate of the full-flow big data of the aluminum/copper plate strip production.

Description

Full-process big data cleaning and analyzing method for aluminum/copper plate strip production

Technical Field

The invention relates to the field of big data and data mining, in particular to a method for cleaning and analyzing big data of the whole process of aluminum/copper plate strip production.

Background

The large data of the aluminum/copper plate strip production whole-process industry is multi-source heterogeneous, inaccurate in data and difficult to query, and extremely high requirements are provided for data cleaning and analysis of the production whole-process data. With the vigorous development of theories such as machine learning, big data and the like, a plurality of scholars combine machine learning and big data technology to realize data driving technology to carry out a series of researches on data analysis and optimization in the industrial production process, and make certain progress. The method utilizes industrial process data to construct a relational model for each key parameter in the production process, thereby not only accelerating the modeling speed, but also avoiding the problem of complicated modeling of the traditional mechanism. The modeling method based on data driving does not need to have excessive understanding on the production mechanism, and the production process is converted into some machine learning models, so that the modeling time can be further reduced to a greater extent. In addition, because the data-driven modeling and optimizing method takes data as a starting point, the problem of modeling errors caused by lack of professional knowledge can be avoided, and new findings can be brought to professionals. The technical data driven industrial process modeling and optimizing method provides a new method for industrial process analysis. Therefore, the method has great and profound significance for cleaning and analyzing the large data of the whole process of aluminum/copper plate strip production.

When a large amount of data produced by an aluminum/copper plate and strip factory is collected, the data importing and retrieving work often takes a great deal of time and energy of technicians, and a great workload is brought to the relevant technicians. Meanwhile, the processes of production, data transmission, manual processing and the like are difficult to avoid and can receive noise interference, so that a large amount of interference and even error information are often mixed in the acquired data, and a large amount of dirty data such as error data, missing data, repeated data and the like are generated; the historical data contains a large number of missing values, abnormal values, repeated values and the like, and the data information value is not high; the traditional mechanism modeling established aiming at the aluminum/copper plate strip full-flow production process is very complex and often generates modeling error problems due to lack of professional knowledge; the whole-process production of the aluminum/copper plate strip factory has strong specialization, the problem of data quality is complicated, and a specially customized data cleaning and analyzing method is lacked.

Disclosure of Invention

Aiming at the problems, the invention aims to provide a method for cleaning and analyzing large data of the whole process of aluminum/copper plate strip production, which can effectively solve the problems of uneven data, multiple anomalies and the like.

The technical scheme adopted by the invention is as follows:

the invention provides a method for cleaning and analyzing big data of the whole production process of aluminum/copper plate strips, which comprises the following steps:

s1: constructing a cleaning system suitable for the whole-process big data of the aluminum/copper plate strip, and cleaning the data according to the characteristics and the process background which accord with the whole-process data of the aluminum/copper plate strip production;

s2: and finishing establishing a tool library of distributed calculation analysis algorithm facing the aluminum/copper plate strip full-flow big data, establishing a relation model in the production process, and realizing optimization of process parameters and alloy components.

Further, the specific process in step S1 is as follows:

(1.1) analyzing the quality problem of the full-process big data of the aluminum/copper plate strip, and analyzing and summarizing links with the data quality problem aiming at the characteristics of the full-process production and processing of the aluminum/copper plate strip;

the data quality issues include: data loss or errors due to device performance limitations; data loss or error caused by human operation error;

(1.2) analyzing the data quality problem and designing a corresponding processing scheme, the processing scheme comprising

Single value processing: performing specific processing on a single value, deleting a column in a column deleting mode if all values in the column are the single value, and storing a result; if the single value row lacks individual data, the single value row is subjected to row deletion operation, or a single value filling mode is adopted to perform data completion on the single value row;

missing value processing: the data missing value processing comprises row deletion and data completion;

abnormal value processing: outlier processing includes row deletion and data replacement;

(1.3) determining a specific step of a full-process big data cleaning system of the aluminum/copper plate strip; the data cleaning method is characterized in that the data cleaning is carried out according to the following steps aiming at the data quality problem of the full-process big data of the aluminum/copper plate strip by combining the big data characteristic of the aluminum/copper plate strip:

loading a data source: reading the distributed data, wherein the reading mode utilizes distributed storage in Hadoop to perform distributed reading;

identifying the detection data: detecting data in a full-flow big data cleaning system of the aluminum/copper plate strip, and identifying possible quality problems of the data;

determining a cleaning rule: starting cleaning of data and backing up the data to be cleaned in advance; setting corresponding cleaning rules and different cleaning methods for different types of data quality problems of the full-process big data of the aluminum/copper plate strip;

data cleaning: cleaning the loaded source data by using a MapReduce frame in Hadoop and adopting a determined cleaning rule and a cleaning method;

and (4) checking a cleaning result: and returning to the previous step for continuous cleaning when the cleaned data is checked and judged according to the rules and the evaluation standards set by the data cleaning.

3. The method for cleaning and analyzing the whole flow of big data in the production of aluminum/copper plate and strip as claimed in claim 2, wherein the method comprises the following steps: the specific process of step S2 is as follows:

(2.1) implementing the full-flow characteristic engineering of aluminum/copper plate strip production: aiming at the characteristics of multisource isomerism, high-dimensional dynamic multi-space-time scale and the like of industrial big data of the whole aluminum/copper plate strip production process, establishing a characteristic structure and selection which accord with the characteristics of the large data of the whole aluminum/copper plate strip production process and a process background;

the characteristic structure of the large data of the whole process of aluminum/copper plate strip production: in the characteristic construction stage, some characteristics are subjected to segmentation treatment; for the label type variable, converting the label type variable into a numerical type variable for operation before modeling, and not distinguishing the size in numerical value; some characteristics which are not considered by original data, such as weather and climate information during production, fatigue degree during worker starting and the like, are added.

Selecting characteristics of the large data of the whole process of aluminum/copper plate strip production: in the characteristic selection stage, process parameters related to the quality of the cast ingot in the original data are screened out, a part of single value variables are deleted, and only one representative characteristic of a plurality of characteristics with strong correlation is reserved by checking the correlation among the characteristics, so that the data redundancy is reduced;

(2.2) developing a tool library facing to the whole process big data analysis algorithm of the aluminum/copper plate strip production; the tool library comprises the following functions: the system comprises an unsupervised learning function, a supervised learning function and an intelligent optimization function;

(2.3) specifically realizing a distributed algorithm analysis business process: carrying out unified standardization on an analysis algorithm, customizing a unified calling interface, and specifically carrying out the following business processes:

loading a data source: storing the source data processed by the distributed big data cleaning system suitable for the aluminum/copper plate strip in a distributed file system with Hadoop as a core, and performing distributed reading and loading on the data;

and (3) confirming a specific algorithm: the analysis method types in the tool library of the aluminum/copper plate strip production full-flow big data analysis method comprise supervision, unsupervised and the like, and different algorithms can be selected to meet different requirements of users;

setting algorithm parameters: the analysis methods in the tool library of the large data analysis method for the whole aluminum/copper plate strip production process are different in types, different method parameters can be selected for data analysis, and default values can be selected without setting;

and (4) analyzing results: and returning the data processed by the aluminum/copper plate strip production full-flow big data analysis method to the analysis result of the user, and constructing a relation model in the aluminum/copper plate strip production according to the returned result.

The invention provides a direction for mining and utilizing the full-flow big data of the aluminum/copper plate strip production by designing the full-flow big data cleaning and analyzing method for the aluminum/copper plate strip production, and can effectively improve the utilization rate of the full-flow big data of the aluminum/copper plate strip production.

Drawings

FIG. 1 is a schematic diagram illustrating the use of the data cleansing method of the present invention;

FIG. 2 is a block diagram of a data set anomaly detection algorithm of the present invention;

FIG. 3 is a schematic flow chart of data cleansing according to the present invention;

FIG. 4 is a block diagram of an algorithm tool according to the present invention;

FIG. 5 is an architectural diagram of data analysis in accordance with the present invention;

FIG. 6 is a schematic flow chart of data analysis according to the present invention.

Detailed Description

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

The invention provides a method for cleaning and analyzing big data of the whole production process of an aluminum/copper plate strip, which comprises the following specific implementation steps:

s1: as shown in fig. 1, a quality problem detection method suitable for the aluminum/copper plate strip full-flow big data is constructed and data cleaning is carried out according to the data characteristics and the process background; the method specifically comprises the following steps:

(1.1): analyzing the quality problem of the full-flow big data of the aluminum/copper plate strip, analyzing and summarizing the links generating the data quality problem aiming at the characteristics of the full-flow production and processing of the aluminum/copper plate strip, and comprising the following steps:

device performance limitations result in data loss or errors: in the production process of the aluminum/copper plate strip, part of production links are often in a high-speed operation process, and the real-time data acquisition and uploading of equipment often causes the problem of data loss due to limited performance, so that the data cannot be uploaded to a data acquisition unit in time. Meanwhile, the collected data can be influenced by the fault of the equipment or the failure of the sensor, so that the data are wrong and lost.

Human operator error results in data loss or errors: data and parameters of a plurality of links in the actual production process of the aluminum/copper plate strip depend on manual input by personnel. Data input errors can be caused by manual errors of operators when data are input manually, and further the problems of data loss and errors of data uploaded to a large data platform by a collector occur.

(1.2) detecting data loss and abnormal quality problems by combining the background of the whole-flow production process of aluminum/copper plate strip production, which comprises the following steps:

and (3) missing value detection: counting the number of samples corresponding to the data attributes, including the number of null values, calculating the proportion of the number of the null values to the total number, counting the number of the samples of the data to m, and recording the number of the null values, namely the missing number, to each characteristic attribute in the data as n₀,n₁…, the missing proportion of each feature attribute, i.e. the missing degree n/m, can be obtained. If the missing proportion of a certain characteristic attribute is larger than the set threshold value and is generally set as 80%, carrying out deletion operation on the characteristic attribute; if the missing proportion is less than 10%, the characteristic attribute can be filled, and the mode in the characteristic attribute is taken as a filling value; if the missing proportion is between the two, the process can be combined with the judgment of whether the characteristic attribute is closely related to the product quality to decide whether to delete or fill the characteristic attribute.

Univariate outlier detection: sorting the data table to find univariate abnormal values; the data is sampled and a range of values is given according to the quartile, and the data is decomposed into quartiles. Data distributions based on five summary of numbers ("min", first quartile (Q1), median, third quartile (Q3), and "max") are displayed. The four-bit distance IQR is Q3-Q1, and the strategy for judging abnormal values is greater than Q3+1.5 IQR and less than Q1-1.5 IQR.

Detection of multivariate abnormal values: searching samples of multivariate abnormal values based on density, distance and clustering methods comprehensively; based on density: by comparing the local density of an object with the local density of its neighbors, regions of similar density can be identified, as well as points of significantly lower density than their neighbors; the local outlier factor algorithm utilizes K neighbors, in each point K neighbor set, the LOF utilizes local accessibility density (lrd) and compares it to the neighbors of each participant in the KNN set; in a given dataset, the local reachable density of each object is defined as:

in the formula, | N (x)_i) The meaning of | is point x_iThe number of neighborhoods; lrd (x)_i) Meaning represented is point x_iLocal achievable density of; each point in the data set is subjected to local outlier factor score calculation, the local outlier factor scores are compared after the scores in the data set are calculated, and if the local outlier factor score of a certain data point is larger, the possibility that the data point is judged to be an abnormal point is higher; based on the distance: the basic assumption based on the distance anomaly detection algorithm is that similar observation results are close to each other, while outliers are generally independent observation results, so the outliers are far away, the outliers are classified by measuring the distances among different characteristic values, and if the distances of one point are equivalent to the distances of K neighbor points of the point, the abnormal value can be judged; based on clustering: clustering-based anomaly detection is to cluster data samples and detect outliers by analyzing the relationship between objects and clusters, an outlier being an object that belongs to a small sparse cluster or does not belong to any cluster; whether the object belongs to a cluster, and if not, whether it is identified as an outlier or as the pairIf the distance between the image and the nearest cluster is far, the image is identified as an outlier; and finally, confirming the final abnormal detection result by voting confirmation for the plurality of abnormal values, and considering the abnormal values when the detectors considered to be abnormal occupy more than half of the total number.

(1.3) analyzing the data quality problem and designing a corresponding processing scheme, including

Single value processing: in the actual data acquisition process, the product number format of each device is not uniform, so that the product number format needs to be unified in the data set construction process for subsequent data processing, for example, in the fusion casting process, the unified format is a capital letter plus number format such as "EC 6934". Under the conditions that the data size is large and the number of the collected devices is large, comparison of similar product numbers is time-consuming and labor-consuming, in order to reduce the comparison times, the product numbers need to be sorted, the product numbers do not comprise special characters such as Chinese characters and the like through preprocessing, and only comprise English and numbers, so that sorting can be performed according to the dictionary order.

After the original data is subjected to quick sorting, the single-value records are sorted at adjacent positions, and the fact that the single-value records need to be processed next is determined, wherein the processing mode in most cases comprises the following steps: firstly, deleting the single value record, wherein the record selected to be reserved is data with more comprehensive and complete reserved information under the condition of deletion, and deleting the remaining other single value records; in actual acquisition, a time field is provided, data selection can be performed according to the time dimension, the latest data record in the time dimension can be reserved, and data at other times can be deleted. The second is to utilize the data information of each sample in the single-value recorded data and integrate each repeated recorded data, for example, if the operation of accumulation averaging is supported in the repeated-value recorded data, the operation of averaging can be selected so as to utilize all the data information.

Missing value processing: the missing degree of some fields exceeds 80%, and the strategy for processing the missing values of the fields is to directly delete and discard the fields. For field deletions less than 10%, it may be padded in combination with a mode in the field. For some fields with deletion degree between 10% and 90%, deletion or retention operation can be performed according to the process background situation.

Abnormal value processing: carrying out mode statistics on the fields corresponding to the univariate abnormal values, and replacing the abnormal values with modes; and directly deleting the multiple abnormal values.

(1.4) determining the specific steps of the full-flow big data cleaning system of the aluminum/copper plate strip as shown in figure 3; the data cleaning method is characterized in that the data cleaning is carried out according to the following steps aiming at the data quality problem of the full-process big data of the aluminum/copper plate strip by combining the big data characteristic of the aluminum/copper plate strip:

loading a data source: reading the distributed data, wherein the reading mode utilizes distributed storage in Hadoop to perform distributed reading.

Identifying the detection data: the data is detected in a full-flow big data cleaning system of the aluminum/copper plate strip, and possible quality problems of the data are identified.

Determining a cleaning rule: starting cleaning of data and backing up the data to be cleaned in advance; and processing the detection data quality problem. Single value problem: deleting a certain column by adopting a column deleting mode if the values in the certain column are all single values, storing the result, and performing row deleting operation if individual data are lacked in the single value column; missing value problem: setting default 80% of the missing threshold proportion, and performing deletion operation if the missing proportion is larger than the threshold; otherwise, filling; outlier problem: correspondingly replacing abnormal values of the units by combining the process parameter range; and carrying out deletion operation on the plurality of abnormal values.

Data cleaning: and cleaning the loaded source data by using a MapReduce framework in Hadoop and adopting a determined cleaning rule and a cleaning method.

S2: the method comprises the following steps of constructing a tool library suitable for a distributed computing and analyzing method for aluminum/copper plate strip full-process big data, and specifically comprising the following steps:

(2.1) design of aluminum/copper plate strip full-process big data depth optimization algorithm tool library

As shown in fig. 4, the optimization algorithm tool library function module design includes a file management module, a supervised analysis module, an unsupervised analysis module and an intelligent optimization analysis module.

A file management module: the module is responsible for loading data files, acquiring file information and storing intermediate results of preprocessing or optimization;

a supervision analysis module: the module is mainly used for reading data files for analysis, the scheme integrates various different supervised algorithms, a user can select a proper supervised algorithm and set parameters for algorithm operation, a one-key process can be performed by using a default supervised algorithm and parameter setting of a system, and evaluation indexes of the model are displayed after the algorithm operation is finished.

An unsupervised analysis module: the module is mainly used for reading data files for analysis, the scheme integrates various unsupervised algorithms, a user can select a proper unsupervised algorithm and set parameters for algorithm operation, a one-key process can be performed by using a default unsupervised algorithm and parameter setting of a system, and evaluation indexes of the model are displayed after the algorithm operation is finished.

The intelligent optimization module: the module is mainly used for reading data files for analysis, the scheme integrates various intelligent optimization algorithms, a user can select a proper intelligent optimization algorithm and set parameters for algorithm operation, a one-key process can be performed by using a default intelligent optimization algorithm and parameter setting of a system, and evaluation indexes of the model are displayed after the algorithm operation is finished.

As shown in fig. 5, the optimization algorithm tool library structural design comprises a data display layer, a man-machine interaction layer, an intermediate data storage layer, an algorithm analysis layer and an aluminum/copper plate strip source data layer;

the aluminum/copper plate strip data layer comprises an industrial data set of the collected aluminum/copper plate strip production process; the second layer is a data operation layer which comprises a supervised learning module, an unsupervised learning module and an intelligent optimization algorithm module, and the design of the second layer is the core architecture of the system; the third layer is a middle data management layer and is mainly used for storing the process data processed by each module of the second layer; the fourth layer is a human-computer interaction layer which is developed on the basis of the design of the first three layers and provides an interactive interface for a user to conveniently execute the operation of the second layer and the third layer, and the specific realization of the interface design mainly adopts the shortcut key operation and a mechanism of clicking a button response message; and the fifth layer is a data display layer which mainly displays operation results, and automatically displays the analyzed evaluation indexes and result graphs in order to enable a user to view the processing results more intuitively.

(2.2) development of full-flow big data depth optimization algorithm tool library for aluminum/copper plate strips

Confirming the required environment and the relevant technology for developing the optimization algorithm tool library; the development environment is developed mainly under the ecological development environment of integrated Hadoop big data under a Windows operating system and a Linux operating system, a development framework of Spring boot is developed, and the design part comprises software visualization interface design, a data mining algorithm, model evaluation visualization and the like; and designing and developing the data mining algorithm related to the invention by using a scimit-leann algorithm tool library.

Establishing a depth optimization algorithm tool library development platform; and (3) performing algorithm development by combining a Spring Boot framework with a scimit-leann algorithm tool library, and performing platform development on a data storage layer based on an HDFS format of Hadoop.

(2.3) as shown in fig. 6, the distributed algorithm analysis business process is specifically implemented: the analysis method in the tool library for the aluminum/copper plate strip production full-flow big data analysis method is various, the analysis method needs to be unified and standardized, a unified calling interface is customized, and the specific business flow is as follows:

loading a data source: the source data processed by the distributed big data cleaning system suitable for the aluminum/copper plate strip are stored in a distributed file system with Hadoop as a core, and the data are read and loaded in a distributed mode.

And (3) confirming a specific algorithm: the analysis method types in the tool library of the aluminum/copper plate strip production full-flow big data analysis method comprise supervision, unsupervised and the like, and different algorithms can be selected to meet different requirements of users.

Setting algorithm parameters: the analysis methods in the tool library of the large data analysis method for the whole aluminum/copper plate strip production process are different in types, different method parameters can be selected for data analysis, and default values can be selected without setting.

The above-mentioned embodiments are merely illustrative of the preferred embodiments of the present invention, and do not limit the scope of the present invention, and various modifications and improvements of the technical solution of the present invention by those skilled in the art should fall within the protection scope defined by the claims of the present invention without departing from the spirit of the present invention.

Claims

1. The method for cleaning and analyzing the big data of the whole production process of the aluminum/copper plate strip is characterized by comprising the following steps of:

s2: and finishing establishing a distributed calculation analysis algorithm tool library facing the aluminum/copper plate strip full-process big data.

2. The method for cleaning and analyzing the whole flow of big data in the production of aluminum/copper plate and strip as claimed in claim 1, wherein the method comprises the following steps: the specific process in step S1 is as follows:

(1.2) detecting the data loss and abnormal quality problems by combining the background of the whole-flow production process of aluminum/copper plate strip production;

and (3) missing value detection: calculating the number of null values and the total number of samples according to the production data of the whole process of aluminum/copper plate strip production, and determining the missing value proportion;

univariate outlier detection: after data are rapidly sorted, decomposing the data into quartiles, and determining abnormal values of units by combining the quartiles;

multivariate abnormal value detection: searching samples of multivariate abnormal values based on density, distance and clustering methods comprehensively;

(1.3) analyzing the data quality problem and designing a corresponding processing scheme, the processing scheme comprising

Single value processing: performing specific processing on a single value, deleting a column in a column deleting mode if all values in the column are the single value, and storing a result; if the single value row lacks individual data, the row deleting operation is carried out on the single value row;

missing value processing: setting a missing threshold value proportion, and if the missing proportion is larger than the threshold value, carrying out deleting operation; otherwise, filling;

abnormal value processing: correspondingly replacing abnormal values of the units by combining the process parameter range; deleting the multivariate abnormal value;

(1.4) combining the big data characteristics of the aluminum/copper plate strip, and carrying out data cleaning according to the following steps aiming at the data quality problem of the big data of the whole process of the aluminum/copper plate strip:

determining a cleaning rule: starting cleaning of data and backing up the data to be cleaned in advance; processing the quality problem of the detected data; single value problem: deleting a certain column by adopting a column deleting mode if the values in the certain column are all single values, storing the result, and performing row deleting operation if individual data are lacked in the single value column; missing value problem: setting default 80% of the missing threshold proportion, and performing deletion operation if the missing proportion is larger than the threshold; otherwise, filling; outlier problem: correspondingly replacing abnormal values of the units by combining the process parameter range; and carrying out deletion operation on the plurality of abnormal values.

(2.1) designing a tool library of a full-flow big data depth optimization algorithm of the aluminum/copper plate strip;

optimizing the design of a functional module of an algorithm tool library, wherein the module design comprises a file management module, a supervised analysis module, an unsupervised analysis module and an intelligent optimization analysis module;

the optimization algorithm tool library structural design comprises a data display layer, a man-machine interaction layer, an intermediate data storage layer, an algorithm analysis layer and an aluminum/copper plate strip source data layer;

(2.2) developing a full-flow big data depth optimization algorithm tool library of the aluminum/copper plate strip;

confirming the required environment and the relevant technology for developing the optimization algorithm tool library;

establishing a depth optimization algorithm tool library development platform;