CN113177040A - Full-process big data cleaning and analyzing method for aluminum/copper plate strip production - Google Patents

Full-process big data cleaning and analyzing method for aluminum/copper plate strip production Download PDF

Info

Publication number
CN113177040A
CN113177040A CN202110476972.8A CN202110476972A CN113177040A CN 113177040 A CN113177040 A CN 113177040A CN 202110476972 A CN202110476972 A CN 202110476972A CN 113177040 A CN113177040 A CN 113177040A
Authority
CN
China
Prior art keywords
data
aluminum
copper plate
plate strip
cleaning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110476972.8A
Other languages
Chinese (zh)
Inventor
刘士新
隋佳欢
陈大力
温睿
赵梓焱
姚明昊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northeastern University China
Original Assignee
Northeastern University China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northeastern University China filed Critical Northeastern University China
Priority to CN202110476972.8A priority Critical patent/CN113177040A/en
Publication of CN113177040A publication Critical patent/CN113177040A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24147Distances to closest patterns, e.g. nearest neighbour classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • G06Q10/06393Score-carding, benchmarking or key performance indicator [KPI] analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/04Manufacturing
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Human Resources & Organizations (AREA)
  • General Engineering & Computer Science (AREA)
  • Strategic Management (AREA)
  • Economics (AREA)
  • Evolutionary Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Educational Administration (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Evolutionary Computation (AREA)
  • Development Economics (AREA)
  • General Business, Economics & Management (AREA)
  • Marketing (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Health & Medical Sciences (AREA)
  • Manufacturing & Machinery (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Physics (AREA)
  • Fuzzy Systems (AREA)
  • Game Theory and Decision Science (AREA)
  • Operations Research (AREA)
  • General Factory Administration (AREA)

Abstract

The invention relates to a method for cleaning and analyzing full-process big data of aluminum/copper plate strip production, which comprises the steps of constructing a quality problem detection method suitable for the full-process big data of the aluminum/copper plate strip and cleaning the data according to the data characteristics and the process background; and finishing establishing a distributed calculation analysis algorithm tool library facing the aluminum/copper plate strip full-process big data. The invention provides a direction for mining and utilizing the full-flow big data of the aluminum/copper plate strip production by designing the full-flow big data cleaning and analyzing method for the aluminum/copper plate strip production, and improves the utilization rate of the full-flow big data of the aluminum/copper plate strip production.

Description

Full-process big data cleaning and analyzing method for aluminum/copper plate strip production
Technical Field
The invention relates to the field of big data and data mining, in particular to a method for cleaning and analyzing big data of the whole process of aluminum/copper plate strip production.
Background
The large data of the aluminum/copper plate strip production whole-process industry is multi-source heterogeneous, inaccurate in data and difficult to query, and extremely high requirements are provided for data cleaning and analysis of the production whole-process data. With the vigorous development of theories such as machine learning, big data and the like, a plurality of scholars combine machine learning and big data technology to realize data driving technology to carry out a series of researches on data analysis and optimization in the industrial production process, and make certain progress. The method utilizes industrial process data to construct a relational model for each key parameter in the production process, thereby not only accelerating the modeling speed, but also avoiding the problem of complicated modeling of the traditional mechanism. The modeling method based on data driving does not need to have excessive understanding on the production mechanism, and the production process is converted into some machine learning models, so that the modeling time can be further reduced to a greater extent. In addition, because the data-driven modeling and optimizing method takes data as a starting point, the problem of modeling errors caused by lack of professional knowledge can be avoided, and new findings can be brought to professionals. The technical data driven industrial process modeling and optimizing method provides a new method for industrial process analysis. Therefore, the method has great and profound significance for cleaning and analyzing the large data of the whole process of aluminum/copper plate strip production.
When a large amount of data produced by an aluminum/copper plate and strip factory is collected, the data importing and retrieving work often takes a great deal of time and energy of technicians, and a great workload is brought to the relevant technicians. Meanwhile, the processes of production, data transmission, manual processing and the like are difficult to avoid and can receive noise interference, so that a large amount of interference and even error information are often mixed in the acquired data, and a large amount of dirty data such as error data, missing data, repeated data and the like are generated; the historical data contains a large number of missing values, abnormal values, repeated values and the like, and the data information value is not high; the traditional mechanism modeling established aiming at the aluminum/copper plate strip full-flow production process is very complex and often generates modeling error problems due to lack of professional knowledge; the whole-process production of the aluminum/copper plate strip factory has strong specialization, the problem of data quality is complicated, and a specially customized data cleaning and analyzing method is lacked.
Disclosure of Invention
Aiming at the problems, the invention aims to provide a method for cleaning and analyzing large data of the whole process of aluminum/copper plate strip production, which can effectively solve the problems of uneven data, multiple anomalies and the like.
The technical scheme adopted by the invention is as follows:
the invention provides a method for cleaning and analyzing big data of the whole production process of aluminum/copper plate strips, which comprises the following steps:
s1: constructing a cleaning system suitable for the whole-process big data of the aluminum/copper plate strip, and cleaning the data according to the characteristics and the process background which accord with the whole-process data of the aluminum/copper plate strip production;
s2: and finishing establishing a tool library of distributed calculation analysis algorithm facing the aluminum/copper plate strip full-flow big data, establishing a relation model in the production process, and realizing optimization of process parameters and alloy components.
Further, the specific process in step S1 is as follows:
(1.1) analyzing the quality problem of the full-process big data of the aluminum/copper plate strip, and analyzing and summarizing links with the data quality problem aiming at the characteristics of the full-process production and processing of the aluminum/copper plate strip;
the data quality issues include: data loss or errors due to device performance limitations; data loss or error caused by human operation error;
(1.2) analyzing the data quality problem and designing a corresponding processing scheme, the processing scheme comprising
Single value processing: performing specific processing on a single value, deleting a column in a column deleting mode if all values in the column are the single value, and storing a result; if the single value row lacks individual data, the single value row is subjected to row deletion operation, or a single value filling mode is adopted to perform data completion on the single value row;
missing value processing: the data missing value processing comprises row deletion and data completion;
abnormal value processing: outlier processing includes row deletion and data replacement;
(1.3) determining a specific step of a full-process big data cleaning system of the aluminum/copper plate strip; the data cleaning method is characterized in that the data cleaning is carried out according to the following steps aiming at the data quality problem of the full-process big data of the aluminum/copper plate strip by combining the big data characteristic of the aluminum/copper plate strip:
loading a data source: reading the distributed data, wherein the reading mode utilizes distributed storage in Hadoop to perform distributed reading;
identifying the detection data: detecting data in a full-flow big data cleaning system of the aluminum/copper plate strip, and identifying possible quality problems of the data;
determining a cleaning rule: starting cleaning of data and backing up the data to be cleaned in advance; setting corresponding cleaning rules and different cleaning methods for different types of data quality problems of the full-process big data of the aluminum/copper plate strip;
data cleaning: cleaning the loaded source data by using a MapReduce frame in Hadoop and adopting a determined cleaning rule and a cleaning method;
and (4) checking a cleaning result: and returning to the previous step for continuous cleaning when the cleaned data is checked and judged according to the rules and the evaluation standards set by the data cleaning.
3. The method for cleaning and analyzing the whole flow of big data in the production of aluminum/copper plate and strip as claimed in claim 2, wherein the method comprises the following steps: the specific process of step S2 is as follows:
(2.1) implementing the full-flow characteristic engineering of aluminum/copper plate strip production: aiming at the characteristics of multisource isomerism, high-dimensional dynamic multi-space-time scale and the like of industrial big data of the whole aluminum/copper plate strip production process, establishing a characteristic structure and selection which accord with the characteristics of the large data of the whole aluminum/copper plate strip production process and a process background;
the characteristic structure of the large data of the whole process of aluminum/copper plate strip production: in the characteristic construction stage, some characteristics are subjected to segmentation treatment; for the label type variable, converting the label type variable into a numerical type variable for operation before modeling, and not distinguishing the size in numerical value; some characteristics which are not considered by original data, such as weather and climate information during production, fatigue degree during worker starting and the like, are added.
Selecting characteristics of the large data of the whole process of aluminum/copper plate strip production: in the characteristic selection stage, process parameters related to the quality of the cast ingot in the original data are screened out, a part of single value variables are deleted, and only one representative characteristic of a plurality of characteristics with strong correlation is reserved by checking the correlation among the characteristics, so that the data redundancy is reduced;
(2.2) developing a tool library facing to the whole process big data analysis algorithm of the aluminum/copper plate strip production; the tool library comprises the following functions: the system comprises an unsupervised learning function, a supervised learning function and an intelligent optimization function;
(2.3) specifically realizing a distributed algorithm analysis business process: carrying out unified standardization on an analysis algorithm, customizing a unified calling interface, and specifically carrying out the following business processes:
loading a data source: storing the source data processed by the distributed big data cleaning system suitable for the aluminum/copper plate strip in a distributed file system with Hadoop as a core, and performing distributed reading and loading on the data;
and (3) confirming a specific algorithm: the analysis method types in the tool library of the aluminum/copper plate strip production full-flow big data analysis method comprise supervision, unsupervised and the like, and different algorithms can be selected to meet different requirements of users;
setting algorithm parameters: the analysis methods in the tool library of the large data analysis method for the whole aluminum/copper plate strip production process are different in types, different method parameters can be selected for data analysis, and default values can be selected without setting;
and (4) analyzing results: and returning the data processed by the aluminum/copper plate strip production full-flow big data analysis method to the analysis result of the user, and constructing a relation model in the aluminum/copper plate strip production according to the returned result.
The invention provides a direction for mining and utilizing the full-flow big data of the aluminum/copper plate strip production by designing the full-flow big data cleaning and analyzing method for the aluminum/copper plate strip production, and can effectively improve the utilization rate of the full-flow big data of the aluminum/copper plate strip production.
Drawings
FIG. 1 is a schematic diagram illustrating the use of the data cleansing method of the present invention;
FIG. 2 is a block diagram of a data set anomaly detection algorithm of the present invention;
FIG. 3 is a schematic flow chart of data cleansing according to the present invention;
FIG. 4 is a block diagram of an algorithm tool according to the present invention;
FIG. 5 is an architectural diagram of data analysis in accordance with the present invention;
FIG. 6 is a schematic flow chart of data analysis according to the present invention.
Detailed Description
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
The invention provides a method for cleaning and analyzing big data of the whole production process of an aluminum/copper plate strip, which comprises the following specific implementation steps:
s1: as shown in fig. 1, a quality problem detection method suitable for the aluminum/copper plate strip full-flow big data is constructed and data cleaning is carried out according to the data characteristics and the process background; the method specifically comprises the following steps:
(1.1): analyzing the quality problem of the full-flow big data of the aluminum/copper plate strip, analyzing and summarizing the links generating the data quality problem aiming at the characteristics of the full-flow production and processing of the aluminum/copper plate strip, and comprising the following steps:
device performance limitations result in data loss or errors: in the production process of the aluminum/copper plate strip, part of production links are often in a high-speed operation process, and the real-time data acquisition and uploading of equipment often causes the problem of data loss due to limited performance, so that the data cannot be uploaded to a data acquisition unit in time. Meanwhile, the collected data can be influenced by the fault of the equipment or the failure of the sensor, so that the data are wrong and lost.
Human operator error results in data loss or errors: data and parameters of a plurality of links in the actual production process of the aluminum/copper plate strip depend on manual input by personnel. Data input errors can be caused by manual errors of operators when data are input manually, and further the problems of data loss and errors of data uploaded to a large data platform by a collector occur.
(1.2) detecting data loss and abnormal quality problems by combining the background of the whole-flow production process of aluminum/copper plate strip production, which comprises the following steps:
and (3) missing value detection: counting the number of samples corresponding to the data attributes, including the number of null values, calculating the proportion of the number of the null values to the total number, counting the number of the samples of the data to m, and recording the number of the null values, namely the missing number, to each characteristic attribute in the data as n0,n1…, the missing proportion of each feature attribute, i.e. the missing degree n/m, can be obtained. If the missing proportion of a certain characteristic attribute is larger than the set threshold value and is generally set as 80%, carrying out deletion operation on the characteristic attribute; if the missing proportion is less than 10%, the characteristic attribute can be filled, and the mode in the characteristic attribute is taken as a filling value; if the missing proportion is between the two, the process can be combined with the judgment of whether the characteristic attribute is closely related to the product quality to decide whether to delete or fill the characteristic attribute.
Univariate outlier detection: sorting the data table to find univariate abnormal values; the data is sampled and a range of values is given according to the quartile, and the data is decomposed into quartiles. Data distributions based on five summary of numbers ("min", first quartile (Q1), median, third quartile (Q3), and "max") are displayed. The four-bit distance IQR is Q3-Q1, and the strategy for judging abnormal values is greater than Q3+1.5 IQR and less than Q1-1.5 IQR.
Detection of multivariate abnormal values: searching samples of multivariate abnormal values based on density, distance and clustering methods comprehensively; based on density: by comparing the local density of an object with the local density of its neighbors, regions of similar density can be identified, as well as points of significantly lower density than their neighbors; the local outlier factor algorithm utilizes K neighbors, in each point K neighbor set, the LOF utilizes local accessibility density (lrd) and compares it to the neighbors of each participant in the KNN set; in a given dataset, the local reachable density of each object is defined as:
Figure BDA0003047751540000061
in the formula, | N (x)i) The meaning of | is point xiThe number of neighborhoods; lrd (x)i) Meaning represented is point xiLocal achievable density of; each point in the data set is subjected to local outlier factor score calculation, the local outlier factor scores are compared after the scores in the data set are calculated, and if the local outlier factor score of a certain data point is larger, the possibility that the data point is judged to be an abnormal point is higher; based on the distance: the basic assumption based on the distance anomaly detection algorithm is that similar observation results are close to each other, while outliers are generally independent observation results, so the outliers are far away, the outliers are classified by measuring the distances among different characteristic values, and if the distances of one point are equivalent to the distances of K neighbor points of the point, the abnormal value can be judged; based on clustering: clustering-based anomaly detection is to cluster data samples and detect outliers by analyzing the relationship between objects and clusters, an outlier being an object that belongs to a small sparse cluster or does not belong to any cluster; whether the object belongs to a cluster, and if not, whether it is identified as an outlier or as the pairIf the distance between the image and the nearest cluster is far, the image is identified as an outlier; and finally, confirming the final abnormal detection result by voting confirmation for the plurality of abnormal values, and considering the abnormal values when the detectors considered to be abnormal occupy more than half of the total number.
(1.3) analyzing the data quality problem and designing a corresponding processing scheme, including
Single value processing: in the actual data acquisition process, the product number format of each device is not uniform, so that the product number format needs to be unified in the data set construction process for subsequent data processing, for example, in the fusion casting process, the unified format is a capital letter plus number format such as "EC 6934". Under the conditions that the data size is large and the number of the collected devices is large, comparison of similar product numbers is time-consuming and labor-consuming, in order to reduce the comparison times, the product numbers need to be sorted, the product numbers do not comprise special characters such as Chinese characters and the like through preprocessing, and only comprise English and numbers, so that sorting can be performed according to the dictionary order.
After the original data is subjected to quick sorting, the single-value records are sorted at adjacent positions, and the fact that the single-value records need to be processed next is determined, wherein the processing mode in most cases comprises the following steps: firstly, deleting the single value record, wherein the record selected to be reserved is data with more comprehensive and complete reserved information under the condition of deletion, and deleting the remaining other single value records; in actual acquisition, a time field is provided, data selection can be performed according to the time dimension, the latest data record in the time dimension can be reserved, and data at other times can be deleted. The second is to utilize the data information of each sample in the single-value recorded data and integrate each repeated recorded data, for example, if the operation of accumulation averaging is supported in the repeated-value recorded data, the operation of averaging can be selected so as to utilize all the data information.
Missing value processing: the missing degree of some fields exceeds 80%, and the strategy for processing the missing values of the fields is to directly delete and discard the fields. For field deletions less than 10%, it may be padded in combination with a mode in the field. For some fields with deletion degree between 10% and 90%, deletion or retention operation can be performed according to the process background situation.
Abnormal value processing: carrying out mode statistics on the fields corresponding to the univariate abnormal values, and replacing the abnormal values with modes; and directly deleting the multiple abnormal values.
(1.4) determining the specific steps of the full-flow big data cleaning system of the aluminum/copper plate strip as shown in figure 3; the data cleaning method is characterized in that the data cleaning is carried out according to the following steps aiming at the data quality problem of the full-process big data of the aluminum/copper plate strip by combining the big data characteristic of the aluminum/copper plate strip:
loading a data source: reading the distributed data, wherein the reading mode utilizes distributed storage in Hadoop to perform distributed reading.
Identifying the detection data: the data is detected in a full-flow big data cleaning system of the aluminum/copper plate strip, and possible quality problems of the data are identified.
Determining a cleaning rule: starting cleaning of data and backing up the data to be cleaned in advance; and processing the detection data quality problem. Single value problem: deleting a certain column by adopting a column deleting mode if the values in the certain column are all single values, storing the result, and performing row deleting operation if individual data are lacked in the single value column; missing value problem: setting default 80% of the missing threshold proportion, and performing deletion operation if the missing proportion is larger than the threshold; otherwise, filling; outlier problem: correspondingly replacing abnormal values of the units by combining the process parameter range; and carrying out deletion operation on the plurality of abnormal values.
Data cleaning: and cleaning the loaded source data by using a MapReduce framework in Hadoop and adopting a determined cleaning rule and a cleaning method.
And (4) checking a cleaning result: and returning to the previous step for continuous cleaning when the cleaned data is checked and judged according to the rules and the evaluation standards set by the data cleaning.
S2: the method comprises the following steps of constructing a tool library suitable for a distributed computing and analyzing method for aluminum/copper plate strip full-process big data, and specifically comprising the following steps:
(2.1) design of aluminum/copper plate strip full-process big data depth optimization algorithm tool library
As shown in fig. 4, the optimization algorithm tool library function module design includes a file management module, a supervised analysis module, an unsupervised analysis module and an intelligent optimization analysis module.
A file management module: the module is responsible for loading data files, acquiring file information and storing intermediate results of preprocessing or optimization;
a supervision analysis module: the module is mainly used for reading data files for analysis, the scheme integrates various different supervised algorithms, a user can select a proper supervised algorithm and set parameters for algorithm operation, a one-key process can be performed by using a default supervised algorithm and parameter setting of a system, and evaluation indexes of the model are displayed after the algorithm operation is finished.
An unsupervised analysis module: the module is mainly used for reading data files for analysis, the scheme integrates various unsupervised algorithms, a user can select a proper unsupervised algorithm and set parameters for algorithm operation, a one-key process can be performed by using a default unsupervised algorithm and parameter setting of a system, and evaluation indexes of the model are displayed after the algorithm operation is finished.
The intelligent optimization module: the module is mainly used for reading data files for analysis, the scheme integrates various intelligent optimization algorithms, a user can select a proper intelligent optimization algorithm and set parameters for algorithm operation, a one-key process can be performed by using a default intelligent optimization algorithm and parameter setting of a system, and evaluation indexes of the model are displayed after the algorithm operation is finished.
As shown in fig. 5, the optimization algorithm tool library structural design comprises a data display layer, a man-machine interaction layer, an intermediate data storage layer, an algorithm analysis layer and an aluminum/copper plate strip source data layer;
the aluminum/copper plate strip data layer comprises an industrial data set of the collected aluminum/copper plate strip production process; the second layer is a data operation layer which comprises a supervised learning module, an unsupervised learning module and an intelligent optimization algorithm module, and the design of the second layer is the core architecture of the system; the third layer is a middle data management layer and is mainly used for storing the process data processed by each module of the second layer; the fourth layer is a human-computer interaction layer which is developed on the basis of the design of the first three layers and provides an interactive interface for a user to conveniently execute the operation of the second layer and the third layer, and the specific realization of the interface design mainly adopts the shortcut key operation and a mechanism of clicking a button response message; and the fifth layer is a data display layer which mainly displays operation results, and automatically displays the analyzed evaluation indexes and result graphs in order to enable a user to view the processing results more intuitively.
(2.2) development of full-flow big data depth optimization algorithm tool library for aluminum/copper plate strips
Confirming the required environment and the relevant technology for developing the optimization algorithm tool library; the development environment is developed mainly under the ecological development environment of integrated Hadoop big data under a Windows operating system and a Linux operating system, a development framework of Spring boot is developed, and the design part comprises software visualization interface design, a data mining algorithm, model evaluation visualization and the like; and designing and developing the data mining algorithm related to the invention by using a scimit-leann algorithm tool library.
Establishing a depth optimization algorithm tool library development platform; and (3) performing algorithm development by combining a Spring Boot framework with a scimit-leann algorithm tool library, and performing platform development on a data storage layer based on an HDFS format of Hadoop.
(2.3) as shown in fig. 6, the distributed algorithm analysis business process is specifically implemented: the analysis method in the tool library for the aluminum/copper plate strip production full-flow big data analysis method is various, the analysis method needs to be unified and standardized, a unified calling interface is customized, and the specific business flow is as follows:
loading a data source: the source data processed by the distributed big data cleaning system suitable for the aluminum/copper plate strip are stored in a distributed file system with Hadoop as a core, and the data are read and loaded in a distributed mode.
And (3) confirming a specific algorithm: the analysis method types in the tool library of the aluminum/copper plate strip production full-flow big data analysis method comprise supervision, unsupervised and the like, and different algorithms can be selected to meet different requirements of users.
Setting algorithm parameters: the analysis methods in the tool library of the large data analysis method for the whole aluminum/copper plate strip production process are different in types, different method parameters can be selected for data analysis, and default values can be selected without setting.
And (4) analyzing results: and returning the data processed by the aluminum/copper plate strip production full-flow big data analysis method to the analysis result of the user, and constructing a relation model in the aluminum/copper plate strip production according to the returned result.
The above-mentioned embodiments are merely illustrative of the preferred embodiments of the present invention, and do not limit the scope of the present invention, and various modifications and improvements of the technical solution of the present invention by those skilled in the art should fall within the protection scope defined by the claims of the present invention without departing from the spirit of the present invention.

Claims (3)

1. The method for cleaning and analyzing the big data of the whole production process of the aluminum/copper plate strip is characterized by comprising the following steps of:
s1: constructing a cleaning system suitable for the whole-process big data of the aluminum/copper plate strip, and cleaning the data according to the characteristics and the process background which accord with the whole-process data of the aluminum/copper plate strip production;
s2: and finishing establishing a distributed calculation analysis algorithm tool library facing the aluminum/copper plate strip full-process big data.
2. The method for cleaning and analyzing the whole flow of big data in the production of aluminum/copper plate and strip as claimed in claim 1, wherein the method comprises the following steps: the specific process in step S1 is as follows:
(1.1) analyzing the quality problem of the full-process big data of the aluminum/copper plate strip, and analyzing and summarizing links with the data quality problem aiming at the characteristics of the full-process production and processing of the aluminum/copper plate strip;
the data quality issues include: data loss or errors due to device performance limitations; data loss or error caused by human operation error;
(1.2) detecting the data loss and abnormal quality problems by combining the background of the whole-flow production process of aluminum/copper plate strip production;
and (3) missing value detection: calculating the number of null values and the total number of samples according to the production data of the whole process of aluminum/copper plate strip production, and determining the missing value proportion;
univariate outlier detection: after data are rapidly sorted, decomposing the data into quartiles, and determining abnormal values of units by combining the quartiles;
multivariate abnormal value detection: searching samples of multivariate abnormal values based on density, distance and clustering methods comprehensively;
(1.3) analyzing the data quality problem and designing a corresponding processing scheme, the processing scheme comprising
Single value processing: performing specific processing on a single value, deleting a column in a column deleting mode if all values in the column are the single value, and storing a result; if the single value row lacks individual data, the row deleting operation is carried out on the single value row;
missing value processing: setting a missing threshold value proportion, and if the missing proportion is larger than the threshold value, carrying out deleting operation; otherwise, filling;
abnormal value processing: correspondingly replacing abnormal values of the units by combining the process parameter range; deleting the multivariate abnormal value;
(1.4) combining the big data characteristics of the aluminum/copper plate strip, and carrying out data cleaning according to the following steps aiming at the data quality problem of the big data of the whole process of the aluminum/copper plate strip:
loading a data source: reading the distributed data, wherein the reading mode utilizes distributed storage in Hadoop to perform distributed reading;
identifying the detection data: detecting data in a full-flow big data cleaning system of the aluminum/copper plate strip, and identifying possible quality problems of the data;
determining a cleaning rule: starting cleaning of data and backing up the data to be cleaned in advance; processing the quality problem of the detected data; single value problem: deleting a certain column by adopting a column deleting mode if the values in the certain column are all single values, storing the result, and performing row deleting operation if individual data are lacked in the single value column; missing value problem: setting default 80% of the missing threshold proportion, and performing deletion operation if the missing proportion is larger than the threshold; otherwise, filling; outlier problem: correspondingly replacing abnormal values of the units by combining the process parameter range; and carrying out deletion operation on the plurality of abnormal values.
Data cleaning: cleaning the loaded source data by using a MapReduce frame in Hadoop and adopting a determined cleaning rule and a cleaning method;
and (4) checking a cleaning result: and returning to the previous step for continuous cleaning when the cleaned data is checked and judged according to the rules and the evaluation standards set by the data cleaning.
3. The method for cleaning and analyzing the whole flow of big data in the production of aluminum/copper plate and strip as claimed in claim 2, wherein the method comprises the following steps: the specific process of step S2 is as follows:
(2.1) designing a tool library of a full-flow big data depth optimization algorithm of the aluminum/copper plate strip;
optimizing the design of a functional module of an algorithm tool library, wherein the module design comprises a file management module, a supervised analysis module, an unsupervised analysis module and an intelligent optimization analysis module;
the optimization algorithm tool library structural design comprises a data display layer, a man-machine interaction layer, an intermediate data storage layer, an algorithm analysis layer and an aluminum/copper plate strip source data layer;
(2.2) developing a full-flow big data depth optimization algorithm tool library of the aluminum/copper plate strip;
confirming the required environment and the relevant technology for developing the optimization algorithm tool library;
establishing a depth optimization algorithm tool library development platform;
(2.3) specifically realizing a distributed algorithm analysis business process: carrying out unified standardization on an analysis algorithm, customizing a unified calling interface, and specifically carrying out the following business processes:
loading a data source: storing the source data processed by the distributed big data cleaning system suitable for the aluminum/copper plate strip in a distributed file system with Hadoop as a core, and performing distributed reading and loading on the data;
and (3) confirming a specific algorithm: the analysis method types in the tool library of the aluminum/copper plate strip production full-flow big data analysis method comprise supervision, unsupervised and the like, and different algorithms can be selected to meet different requirements of users;
setting algorithm parameters: the analysis methods in the tool library of the large data analysis method for the whole aluminum/copper plate strip production process are different in types, different method parameters can be selected for data analysis, and default values can be selected without setting;
and (4) analyzing results: and returning the data processed by the aluminum/copper plate strip production full-flow big data analysis method to the analysis result of the user, and constructing a relation model in the aluminum/copper plate strip production according to the returned result.
CN202110476972.8A 2021-04-29 2021-04-29 Full-process big data cleaning and analyzing method for aluminum/copper plate strip production Pending CN113177040A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110476972.8A CN113177040A (en) 2021-04-29 2021-04-29 Full-process big data cleaning and analyzing method for aluminum/copper plate strip production

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110476972.8A CN113177040A (en) 2021-04-29 2021-04-29 Full-process big data cleaning and analyzing method for aluminum/copper plate strip production

Publications (1)

Publication Number Publication Date
CN113177040A true CN113177040A (en) 2021-07-27

Family

ID=76925435

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110476972.8A Pending CN113177040A (en) 2021-04-29 2021-04-29 Full-process big data cleaning and analyzing method for aluminum/copper plate strip production

Country Status (1)

Country Link
CN (1) CN113177040A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116757534A (en) * 2023-06-15 2023-09-15 中国标准化研究院 Intelligent refrigerator reliability analysis method based on neural training network

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017028690A1 (en) * 2015-08-14 2017-02-23 阿里巴巴集团控股有限公司 File processing method and system based on etl
CN107220261A (en) * 2016-03-22 2017-09-29 中国移动通信集团山西有限公司 A kind of real-time method for digging and device based on distributed data
CN108287926A (en) * 2018-03-02 2018-07-17 宿州学院 A kind of multi-source heterogeneous big data acquisition of Agro-ecology, processing and analysis framework
CN108564254A (en) * 2018-03-15 2018-09-21 国网四川省电力公司绵阳供电公司 Controller switching equipment status visualization platform based on big data
CN109284298A (en) * 2018-11-09 2019-01-29 上海晏鼠计算机技术股份有限公司 A kind of contents production system handled based on machine learning and big data
CN109830303A (en) * 2019-02-01 2019-05-31 上海众恒信息产业股份有限公司 Clinical data mining analysis and aid decision-making method based on internet integration medical platform
CN110543903A (en) * 2019-08-23 2019-12-06 国网江苏省电力有限公司电力科学研究院 Data cleaning method and system for GIS partial discharge big data system
KR102146297B1 (en) * 2020-04-09 2020-08-20 주식회사 태현이엔지 Generation apparatus for sodium hypochlorite for enhancement of the accuracy, and method thereof

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017028690A1 (en) * 2015-08-14 2017-02-23 阿里巴巴集团控股有限公司 File processing method and system based on etl
CN107220261A (en) * 2016-03-22 2017-09-29 中国移动通信集团山西有限公司 A kind of real-time method for digging and device based on distributed data
CN108287926A (en) * 2018-03-02 2018-07-17 宿州学院 A kind of multi-source heterogeneous big data acquisition of Agro-ecology, processing and analysis framework
CN108564254A (en) * 2018-03-15 2018-09-21 国网四川省电力公司绵阳供电公司 Controller switching equipment status visualization platform based on big data
CN109284298A (en) * 2018-11-09 2019-01-29 上海晏鼠计算机技术股份有限公司 A kind of contents production system handled based on machine learning and big data
CN109830303A (en) * 2019-02-01 2019-05-31 上海众恒信息产业股份有限公司 Clinical data mining analysis and aid decision-making method based on internet integration medical platform
CN110543903A (en) * 2019-08-23 2019-12-06 国网江苏省电力有限公司电力科学研究院 Data cleaning method and system for GIS partial discharge big data system
KR102146297B1 (en) * 2020-04-09 2020-08-20 주식회사 태현이엔지 Generation apparatus for sodium hypochlorite for enhancement of the accuracy, and method thereof

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
SHEN YAN等: "quality prediction method for aluminum alloy ingot based on XGBoost", 2020 CHINESE CONTROL AND DECISION CONFERENCE, pages 2542 - 2547 *
刘士新: "基于大数据的铝板带智能制造关键技术探讨", 2018年中国铝加工产业年度大会论文集, pages 28 - 41 *
彭艳: "冶金轧制设备技术数字化智能化发展综述", 燕山大学学报, vol. 44, no. 3, pages 218 - 237 *
李福兴等: "面向煤炭开采的大数据处理平台构建关键技术", 煤炭学报, vol. 44, no. 1, pages 362 - 369 *
王建民;王晨;刘英博;刘;: "大数据系统软件创新平台与生态建设", 大数据, vol. 4, no. 05, pages 104 - 112 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116757534A (en) * 2023-06-15 2023-09-15 中国标准化研究院 Intelligent refrigerator reliability analysis method based on neural training network
CN116757534B (en) * 2023-06-15 2024-03-15 中国标准化研究院 Intelligent refrigerator reliability analysis method based on neural training network

Similar Documents

Publication Publication Date Title
CN111636891B (en) Real-time shield attitude prediction system and construction method of prediction model
CN117393076B (en) Intelligent monitoring method and system for heat-resistant epoxy resin production process
CN112905380A (en) System anomaly detection method based on automatic monitoring log
CN117194919A (en) Production data analysis system
CN111178688A (en) Self-service analysis method and system for power technology supervision data, storage medium and computer equipment
CN117962256A (en) Injection molding warpage simulation control system based on analysis of big data information
CN116578436A (en) Real-time online detection method based on asynchronous multielement time sequence data
CN113177040A (en) Full-process big data cleaning and analyzing method for aluminum/copper plate strip production
CN112416732B (en) Hidden Markov model-based data acquisition operation anomaly detection method
CN117171244A (en) Enterprise data management system based on data middle platform construction and data analysis method thereof
CN117291575A (en) Equipment maintenance method, equipment maintenance device, computer equipment and storage medium
CN118331952B (en) Financial data cleaning management system and method based on big data
CN116861204B (en) Intelligent manufacturing equipment data management system based on digital twinning
CN117150439B (en) Automobile manufacturing parameter detection method and system based on multi-source heterogeneous data fusion
CN116703321B (en) Pharmaceutical factory management method and system based on green production
CN118245743B (en) Basic data construction optimization system and method based on automatic flow
CN116975041B (en) AB experiment shunting and analyzing system
CN117453805B (en) Visual analysis method for uncertainty data
Balaskó et al. What happens to process data in chemical industry? From source to applications–an overview
CN118311914B (en) Production line data acquisition control method and system for intelligent workshop
Pavlov et al. Relevance and Problems for Application of Big Data in Engineering Industry
CN118295893A (en) Automatic page inspection method and device, electronic equipment and storage medium
CN118025921A (en) Elevator fault analysis and detection method based on big data and neural network
CN114218222A (en) Industrial production diagnosis method and system for tree structure
CN118132321A (en) Learning-based database abnormal root cause SQL diagnosis method and diagnosis device thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
AD01 Patent right deemed abandoned

Effective date of abandoning: 20240322

AD01 Patent right deemed abandoned