CN113177040A - Full-process big data cleaning and analyzing method for aluminum/copper plate strip production - Google Patents
Full-process big data cleaning and analyzing method for aluminum/copper plate strip production Download PDFInfo
- Publication number
- CN113177040A CN113177040A CN202110476972.8A CN202110476972A CN113177040A CN 113177040 A CN113177040 A CN 113177040A CN 202110476972 A CN202110476972 A CN 202110476972A CN 113177040 A CN113177040 A CN 113177040A
- Authority
- CN
- China
- Prior art keywords
- data
- aluminum
- copper plate
- plate strip
- cleaning
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 108
- RYGMFSIKBFXOCR-UHFFFAOYSA-N Copper Chemical compound [Cu] RYGMFSIKBFXOCR-UHFFFAOYSA-N 0.000 title claims abstract description 85
- XAGFODPZIPBFFR-UHFFFAOYSA-N aluminium Chemical compound [Al] XAGFODPZIPBFFR-UHFFFAOYSA-N 0.000 title claims abstract description 85
- 229910052782 aluminium Inorganic materials 0.000 title claims abstract description 85
- 229910052802 copper Inorganic materials 0.000 title claims abstract description 85
- 239000010949 copper Substances 0.000 title claims abstract description 85
- 238000004519 manufacturing process Methods 0.000 title claims abstract description 65
- 238000004140 cleaning Methods 0.000 title claims abstract description 64
- 230000008569 process Effects 0.000 claims abstract description 37
- 238000004458 analytical method Methods 0.000 claims abstract description 32
- 238000001514 detection method Methods 0.000 claims abstract description 14
- 238000004364 calculation method Methods 0.000 claims abstract description 4
- 230000002159 abnormal effect Effects 0.000 claims description 28
- 238000012545 processing Methods 0.000 claims description 28
- 238000005457 optimization Methods 0.000 claims description 20
- 238000007405 data analysis Methods 0.000 claims description 17
- 238000012217 deletion Methods 0.000 claims description 12
- 230000037430 deletion Effects 0.000 claims description 12
- 238000013461 design Methods 0.000 claims description 11
- 238000011161 development Methods 0.000 claims description 9
- 238000011156 evaluation Methods 0.000 claims description 8
- 238000005516 engineering process Methods 0.000 claims description 4
- 238000013500 data storage Methods 0.000 claims description 3
- 230000003993 interaction Effects 0.000 claims description 3
- 238000007726 management method Methods 0.000 claims description 3
- 238000003860 storage Methods 0.000 claims description 3
- 238000013433 optimization analysis Methods 0.000 claims description 2
- 238000013450 outlier detection Methods 0.000 claims description 2
- 238000005065 mining Methods 0.000 abstract description 2
- 230000018109 developmental process Effects 0.000 description 8
- 230000006870 function Effects 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 230000007246 mechanism Effects 0.000 description 4
- 238000007418 data mining Methods 0.000 description 3
- 238000010801 machine learning Methods 0.000 description 3
- 238000012935 Averaging Methods 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 238000009825 accumulation Methods 0.000 description 1
- 239000000956 alloy Substances 0.000 description 1
- 229910045601 alloy Inorganic materials 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000005266 casting Methods 0.000 description 1
- 238000012790 confirmation Methods 0.000 description 1
- 238000013523 data management Methods 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 238000009776 industrial production Methods 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 230000014759 maintenance of location Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000012800 visualization Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2465—Query processing support for facilitating data mining operations in structured databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/285—Clustering or classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2413—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
- G06F18/24147—Distances to closest patterns, e.g. nearest neighbour classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/06—Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
- G06Q10/063—Operations research, analysis or management
- G06Q10/0639—Performance analysis of employees; Performance analysis of enterprise or organisation operations
- G06Q10/06393—Score-carding, benchmarking or key performance indicator [KPI] analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/04—Manufacturing
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02P—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
- Y02P90/00—Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
- Y02P90/30—Computing systems specially adapted for manufacturing
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Business, Economics & Management (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Human Resources & Organizations (AREA)
- General Engineering & Computer Science (AREA)
- Strategic Management (AREA)
- Economics (AREA)
- Evolutionary Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Educational Administration (AREA)
- Entrepreneurship & Innovation (AREA)
- Evolutionary Computation (AREA)
- Development Economics (AREA)
- General Business, Economics & Management (AREA)
- Marketing (AREA)
- Quality & Reliability (AREA)
- Tourism & Hospitality (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Primary Health Care (AREA)
- Health & Medical Sciences (AREA)
- Manufacturing & Machinery (AREA)
- Software Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Mathematical Physics (AREA)
- Fuzzy Systems (AREA)
- Game Theory and Decision Science (AREA)
- Operations Research (AREA)
- General Factory Administration (AREA)
Abstract
The invention relates to a method for cleaning and analyzing full-process big data of aluminum/copper plate strip production, which comprises the steps of constructing a quality problem detection method suitable for the full-process big data of the aluminum/copper plate strip and cleaning the data according to the data characteristics and the process background; and finishing establishing a distributed calculation analysis algorithm tool library facing the aluminum/copper plate strip full-process big data. The invention provides a direction for mining and utilizing the full-flow big data of the aluminum/copper plate strip production by designing the full-flow big data cleaning and analyzing method for the aluminum/copper plate strip production, and improves the utilization rate of the full-flow big data of the aluminum/copper plate strip production.
Description
Technical Field
The invention relates to the field of big data and data mining, in particular to a method for cleaning and analyzing big data of the whole process of aluminum/copper plate strip production.
Background
The large data of the aluminum/copper plate strip production whole-process industry is multi-source heterogeneous, inaccurate in data and difficult to query, and extremely high requirements are provided for data cleaning and analysis of the production whole-process data. With the vigorous development of theories such as machine learning, big data and the like, a plurality of scholars combine machine learning and big data technology to realize data driving technology to carry out a series of researches on data analysis and optimization in the industrial production process, and make certain progress. The method utilizes industrial process data to construct a relational model for each key parameter in the production process, thereby not only accelerating the modeling speed, but also avoiding the problem of complicated modeling of the traditional mechanism. The modeling method based on data driving does not need to have excessive understanding on the production mechanism, and the production process is converted into some machine learning models, so that the modeling time can be further reduced to a greater extent. In addition, because the data-driven modeling and optimizing method takes data as a starting point, the problem of modeling errors caused by lack of professional knowledge can be avoided, and new findings can be brought to professionals. The technical data driven industrial process modeling and optimizing method provides a new method for industrial process analysis. Therefore, the method has great and profound significance for cleaning and analyzing the large data of the whole process of aluminum/copper plate strip production.
When a large amount of data produced by an aluminum/copper plate and strip factory is collected, the data importing and retrieving work often takes a great deal of time and energy of technicians, and a great workload is brought to the relevant technicians. Meanwhile, the processes of production, data transmission, manual processing and the like are difficult to avoid and can receive noise interference, so that a large amount of interference and even error information are often mixed in the acquired data, and a large amount of dirty data such as error data, missing data, repeated data and the like are generated; the historical data contains a large number of missing values, abnormal values, repeated values and the like, and the data information value is not high; the traditional mechanism modeling established aiming at the aluminum/copper plate strip full-flow production process is very complex and often generates modeling error problems due to lack of professional knowledge; the whole-process production of the aluminum/copper plate strip factory has strong specialization, the problem of data quality is complicated, and a specially customized data cleaning and analyzing method is lacked.
Disclosure of Invention
Aiming at the problems, the invention aims to provide a method for cleaning and analyzing large data of the whole process of aluminum/copper plate strip production, which can effectively solve the problems of uneven data, multiple anomalies and the like.
The technical scheme adopted by the invention is as follows:
the invention provides a method for cleaning and analyzing big data of the whole production process of aluminum/copper plate strips, which comprises the following steps:
s1: constructing a cleaning system suitable for the whole-process big data of the aluminum/copper plate strip, and cleaning the data according to the characteristics and the process background which accord with the whole-process data of the aluminum/copper plate strip production;
s2: and finishing establishing a tool library of distributed calculation analysis algorithm facing the aluminum/copper plate strip full-flow big data, establishing a relation model in the production process, and realizing optimization of process parameters and alloy components.
Further, the specific process in step S1 is as follows:
(1.1) analyzing the quality problem of the full-process big data of the aluminum/copper plate strip, and analyzing and summarizing links with the data quality problem aiming at the characteristics of the full-process production and processing of the aluminum/copper plate strip;
the data quality issues include: data loss or errors due to device performance limitations; data loss or error caused by human operation error;
(1.2) analyzing the data quality problem and designing a corresponding processing scheme, the processing scheme comprising
Single value processing: performing specific processing on a single value, deleting a column in a column deleting mode if all values in the column are the single value, and storing a result; if the single value row lacks individual data, the single value row is subjected to row deletion operation, or a single value filling mode is adopted to perform data completion on the single value row;
missing value processing: the data missing value processing comprises row deletion and data completion;
abnormal value processing: outlier processing includes row deletion and data replacement;
(1.3) determining a specific step of a full-process big data cleaning system of the aluminum/copper plate strip; the data cleaning method is characterized in that the data cleaning is carried out according to the following steps aiming at the data quality problem of the full-process big data of the aluminum/copper plate strip by combining the big data characteristic of the aluminum/copper plate strip:
loading a data source: reading the distributed data, wherein the reading mode utilizes distributed storage in Hadoop to perform distributed reading;
identifying the detection data: detecting data in a full-flow big data cleaning system of the aluminum/copper plate strip, and identifying possible quality problems of the data;
determining a cleaning rule: starting cleaning of data and backing up the data to be cleaned in advance; setting corresponding cleaning rules and different cleaning methods for different types of data quality problems of the full-process big data of the aluminum/copper plate strip;
data cleaning: cleaning the loaded source data by using a MapReduce frame in Hadoop and adopting a determined cleaning rule and a cleaning method;
and (4) checking a cleaning result: and returning to the previous step for continuous cleaning when the cleaned data is checked and judged according to the rules and the evaluation standards set by the data cleaning.
3. The method for cleaning and analyzing the whole flow of big data in the production of aluminum/copper plate and strip as claimed in claim 2, wherein the method comprises the following steps: the specific process of step S2 is as follows:
(2.1) implementing the full-flow characteristic engineering of aluminum/copper plate strip production: aiming at the characteristics of multisource isomerism, high-dimensional dynamic multi-space-time scale and the like of industrial big data of the whole aluminum/copper plate strip production process, establishing a characteristic structure and selection which accord with the characteristics of the large data of the whole aluminum/copper plate strip production process and a process background;
the characteristic structure of the large data of the whole process of aluminum/copper plate strip production: in the characteristic construction stage, some characteristics are subjected to segmentation treatment; for the label type variable, converting the label type variable into a numerical type variable for operation before modeling, and not distinguishing the size in numerical value; some characteristics which are not considered by original data, such as weather and climate information during production, fatigue degree during worker starting and the like, are added.
Selecting characteristics of the large data of the whole process of aluminum/copper plate strip production: in the characteristic selection stage, process parameters related to the quality of the cast ingot in the original data are screened out, a part of single value variables are deleted, and only one representative characteristic of a plurality of characteristics with strong correlation is reserved by checking the correlation among the characteristics, so that the data redundancy is reduced;
(2.2) developing a tool library facing to the whole process big data analysis algorithm of the aluminum/copper plate strip production; the tool library comprises the following functions: the system comprises an unsupervised learning function, a supervised learning function and an intelligent optimization function;
(2.3) specifically realizing a distributed algorithm analysis business process: carrying out unified standardization on an analysis algorithm, customizing a unified calling interface, and specifically carrying out the following business processes:
loading a data source: storing the source data processed by the distributed big data cleaning system suitable for the aluminum/copper plate strip in a distributed file system with Hadoop as a core, and performing distributed reading and loading on the data;
and (3) confirming a specific algorithm: the analysis method types in the tool library of the aluminum/copper plate strip production full-flow big data analysis method comprise supervision, unsupervised and the like, and different algorithms can be selected to meet different requirements of users;
setting algorithm parameters: the analysis methods in the tool library of the large data analysis method for the whole aluminum/copper plate strip production process are different in types, different method parameters can be selected for data analysis, and default values can be selected without setting;
and (4) analyzing results: and returning the data processed by the aluminum/copper plate strip production full-flow big data analysis method to the analysis result of the user, and constructing a relation model in the aluminum/copper plate strip production according to the returned result.
The invention provides a direction for mining and utilizing the full-flow big data of the aluminum/copper plate strip production by designing the full-flow big data cleaning and analyzing method for the aluminum/copper plate strip production, and can effectively improve the utilization rate of the full-flow big data of the aluminum/copper plate strip production.
Drawings
FIG. 1 is a schematic diagram illustrating the use of the data cleansing method of the present invention;
FIG. 2 is a block diagram of a data set anomaly detection algorithm of the present invention;
FIG. 3 is a schematic flow chart of data cleansing according to the present invention;
FIG. 4 is a block diagram of an algorithm tool according to the present invention;
FIG. 5 is an architectural diagram of data analysis in accordance with the present invention;
FIG. 6 is a schematic flow chart of data analysis according to the present invention.
Detailed Description
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
The invention provides a method for cleaning and analyzing big data of the whole production process of an aluminum/copper plate strip, which comprises the following specific implementation steps:
s1: as shown in fig. 1, a quality problem detection method suitable for the aluminum/copper plate strip full-flow big data is constructed and data cleaning is carried out according to the data characteristics and the process background; the method specifically comprises the following steps:
(1.1): analyzing the quality problem of the full-flow big data of the aluminum/copper plate strip, analyzing and summarizing the links generating the data quality problem aiming at the characteristics of the full-flow production and processing of the aluminum/copper plate strip, and comprising the following steps:
device performance limitations result in data loss or errors: in the production process of the aluminum/copper plate strip, part of production links are often in a high-speed operation process, and the real-time data acquisition and uploading of equipment often causes the problem of data loss due to limited performance, so that the data cannot be uploaded to a data acquisition unit in time. Meanwhile, the collected data can be influenced by the fault of the equipment or the failure of the sensor, so that the data are wrong and lost.
Human operator error results in data loss or errors: data and parameters of a plurality of links in the actual production process of the aluminum/copper plate strip depend on manual input by personnel. Data input errors can be caused by manual errors of operators when data are input manually, and further the problems of data loss and errors of data uploaded to a large data platform by a collector occur.
(1.2) detecting data loss and abnormal quality problems by combining the background of the whole-flow production process of aluminum/copper plate strip production, which comprises the following steps:
and (3) missing value detection: counting the number of samples corresponding to the data attributes, including the number of null values, calculating the proportion of the number of the null values to the total number, counting the number of the samples of the data to m, and recording the number of the null values, namely the missing number, to each characteristic attribute in the data as n0,n1…, the missing proportion of each feature attribute, i.e. the missing degree n/m, can be obtained. If the missing proportion of a certain characteristic attribute is larger than the set threshold value and is generally set as 80%, carrying out deletion operation on the characteristic attribute; if the missing proportion is less than 10%, the characteristic attribute can be filled, and the mode in the characteristic attribute is taken as a filling value; if the missing proportion is between the two, the process can be combined with the judgment of whether the characteristic attribute is closely related to the product quality to decide whether to delete or fill the characteristic attribute.
Univariate outlier detection: sorting the data table to find univariate abnormal values; the data is sampled and a range of values is given according to the quartile, and the data is decomposed into quartiles. Data distributions based on five summary of numbers ("min", first quartile (Q1), median, third quartile (Q3), and "max") are displayed. The four-bit distance IQR is Q3-Q1, and the strategy for judging abnormal values is greater than Q3+1.5 IQR and less than Q1-1.5 IQR.
Detection of multivariate abnormal values: searching samples of multivariate abnormal values based on density, distance and clustering methods comprehensively; based on density: by comparing the local density of an object with the local density of its neighbors, regions of similar density can be identified, as well as points of significantly lower density than their neighbors; the local outlier factor algorithm utilizes K neighbors, in each point K neighbor set, the LOF utilizes local accessibility density (lrd) and compares it to the neighbors of each participant in the KNN set; in a given dataset, the local reachable density of each object is defined as:in the formula, | N (x)i) The meaning of | is point xiThe number of neighborhoods; lrd (x)i) Meaning represented is point xiLocal achievable density of; each point in the data set is subjected to local outlier factor score calculation, the local outlier factor scores are compared after the scores in the data set are calculated, and if the local outlier factor score of a certain data point is larger, the possibility that the data point is judged to be an abnormal point is higher; based on the distance: the basic assumption based on the distance anomaly detection algorithm is that similar observation results are close to each other, while outliers are generally independent observation results, so the outliers are far away, the outliers are classified by measuring the distances among different characteristic values, and if the distances of one point are equivalent to the distances of K neighbor points of the point, the abnormal value can be judged; based on clustering: clustering-based anomaly detection is to cluster data samples and detect outliers by analyzing the relationship between objects and clusters, an outlier being an object that belongs to a small sparse cluster or does not belong to any cluster; whether the object belongs to a cluster, and if not, whether it is identified as an outlier or as the pairIf the distance between the image and the nearest cluster is far, the image is identified as an outlier; and finally, confirming the final abnormal detection result by voting confirmation for the plurality of abnormal values, and considering the abnormal values when the detectors considered to be abnormal occupy more than half of the total number.
(1.3) analyzing the data quality problem and designing a corresponding processing scheme, including
Single value processing: in the actual data acquisition process, the product number format of each device is not uniform, so that the product number format needs to be unified in the data set construction process for subsequent data processing, for example, in the fusion casting process, the unified format is a capital letter plus number format such as "EC 6934". Under the conditions that the data size is large and the number of the collected devices is large, comparison of similar product numbers is time-consuming and labor-consuming, in order to reduce the comparison times, the product numbers need to be sorted, the product numbers do not comprise special characters such as Chinese characters and the like through preprocessing, and only comprise English and numbers, so that sorting can be performed according to the dictionary order.
After the original data is subjected to quick sorting, the single-value records are sorted at adjacent positions, and the fact that the single-value records need to be processed next is determined, wherein the processing mode in most cases comprises the following steps: firstly, deleting the single value record, wherein the record selected to be reserved is data with more comprehensive and complete reserved information under the condition of deletion, and deleting the remaining other single value records; in actual acquisition, a time field is provided, data selection can be performed according to the time dimension, the latest data record in the time dimension can be reserved, and data at other times can be deleted. The second is to utilize the data information of each sample in the single-value recorded data and integrate each repeated recorded data, for example, if the operation of accumulation averaging is supported in the repeated-value recorded data, the operation of averaging can be selected so as to utilize all the data information.
Missing value processing: the missing degree of some fields exceeds 80%, and the strategy for processing the missing values of the fields is to directly delete and discard the fields. For field deletions less than 10%, it may be padded in combination with a mode in the field. For some fields with deletion degree between 10% and 90%, deletion or retention operation can be performed according to the process background situation.
Abnormal value processing: carrying out mode statistics on the fields corresponding to the univariate abnormal values, and replacing the abnormal values with modes; and directly deleting the multiple abnormal values.
(1.4) determining the specific steps of the full-flow big data cleaning system of the aluminum/copper plate strip as shown in figure 3; the data cleaning method is characterized in that the data cleaning is carried out according to the following steps aiming at the data quality problem of the full-process big data of the aluminum/copper plate strip by combining the big data characteristic of the aluminum/copper plate strip:
loading a data source: reading the distributed data, wherein the reading mode utilizes distributed storage in Hadoop to perform distributed reading.
Identifying the detection data: the data is detected in a full-flow big data cleaning system of the aluminum/copper plate strip, and possible quality problems of the data are identified.
Determining a cleaning rule: starting cleaning of data and backing up the data to be cleaned in advance; and processing the detection data quality problem. Single value problem: deleting a certain column by adopting a column deleting mode if the values in the certain column are all single values, storing the result, and performing row deleting operation if individual data are lacked in the single value column; missing value problem: setting default 80% of the missing threshold proportion, and performing deletion operation if the missing proportion is larger than the threshold; otherwise, filling; outlier problem: correspondingly replacing abnormal values of the units by combining the process parameter range; and carrying out deletion operation on the plurality of abnormal values.
Data cleaning: and cleaning the loaded source data by using a MapReduce framework in Hadoop and adopting a determined cleaning rule and a cleaning method.
And (4) checking a cleaning result: and returning to the previous step for continuous cleaning when the cleaned data is checked and judged according to the rules and the evaluation standards set by the data cleaning.
S2: the method comprises the following steps of constructing a tool library suitable for a distributed computing and analyzing method for aluminum/copper plate strip full-process big data, and specifically comprising the following steps:
(2.1) design of aluminum/copper plate strip full-process big data depth optimization algorithm tool library
As shown in fig. 4, the optimization algorithm tool library function module design includes a file management module, a supervised analysis module, an unsupervised analysis module and an intelligent optimization analysis module.
A file management module: the module is responsible for loading data files, acquiring file information and storing intermediate results of preprocessing or optimization;
a supervision analysis module: the module is mainly used for reading data files for analysis, the scheme integrates various different supervised algorithms, a user can select a proper supervised algorithm and set parameters for algorithm operation, a one-key process can be performed by using a default supervised algorithm and parameter setting of a system, and evaluation indexes of the model are displayed after the algorithm operation is finished.
An unsupervised analysis module: the module is mainly used for reading data files for analysis, the scheme integrates various unsupervised algorithms, a user can select a proper unsupervised algorithm and set parameters for algorithm operation, a one-key process can be performed by using a default unsupervised algorithm and parameter setting of a system, and evaluation indexes of the model are displayed after the algorithm operation is finished.
The intelligent optimization module: the module is mainly used for reading data files for analysis, the scheme integrates various intelligent optimization algorithms, a user can select a proper intelligent optimization algorithm and set parameters for algorithm operation, a one-key process can be performed by using a default intelligent optimization algorithm and parameter setting of a system, and evaluation indexes of the model are displayed after the algorithm operation is finished.
As shown in fig. 5, the optimization algorithm tool library structural design comprises a data display layer, a man-machine interaction layer, an intermediate data storage layer, an algorithm analysis layer and an aluminum/copper plate strip source data layer;
the aluminum/copper plate strip data layer comprises an industrial data set of the collected aluminum/copper plate strip production process; the second layer is a data operation layer which comprises a supervised learning module, an unsupervised learning module and an intelligent optimization algorithm module, and the design of the second layer is the core architecture of the system; the third layer is a middle data management layer and is mainly used for storing the process data processed by each module of the second layer; the fourth layer is a human-computer interaction layer which is developed on the basis of the design of the first three layers and provides an interactive interface for a user to conveniently execute the operation of the second layer and the third layer, and the specific realization of the interface design mainly adopts the shortcut key operation and a mechanism of clicking a button response message; and the fifth layer is a data display layer which mainly displays operation results, and automatically displays the analyzed evaluation indexes and result graphs in order to enable a user to view the processing results more intuitively.
(2.2) development of full-flow big data depth optimization algorithm tool library for aluminum/copper plate strips
Confirming the required environment and the relevant technology for developing the optimization algorithm tool library; the development environment is developed mainly under the ecological development environment of integrated Hadoop big data under a Windows operating system and a Linux operating system, a development framework of Spring boot is developed, and the design part comprises software visualization interface design, a data mining algorithm, model evaluation visualization and the like; and designing and developing the data mining algorithm related to the invention by using a scimit-leann algorithm tool library.
Establishing a depth optimization algorithm tool library development platform; and (3) performing algorithm development by combining a Spring Boot framework with a scimit-leann algorithm tool library, and performing platform development on a data storage layer based on an HDFS format of Hadoop.
(2.3) as shown in fig. 6, the distributed algorithm analysis business process is specifically implemented: the analysis method in the tool library for the aluminum/copper plate strip production full-flow big data analysis method is various, the analysis method needs to be unified and standardized, a unified calling interface is customized, and the specific business flow is as follows:
loading a data source: the source data processed by the distributed big data cleaning system suitable for the aluminum/copper plate strip are stored in a distributed file system with Hadoop as a core, and the data are read and loaded in a distributed mode.
And (3) confirming a specific algorithm: the analysis method types in the tool library of the aluminum/copper plate strip production full-flow big data analysis method comprise supervision, unsupervised and the like, and different algorithms can be selected to meet different requirements of users.
Setting algorithm parameters: the analysis methods in the tool library of the large data analysis method for the whole aluminum/copper plate strip production process are different in types, different method parameters can be selected for data analysis, and default values can be selected without setting.
And (4) analyzing results: and returning the data processed by the aluminum/copper plate strip production full-flow big data analysis method to the analysis result of the user, and constructing a relation model in the aluminum/copper plate strip production according to the returned result.
The above-mentioned embodiments are merely illustrative of the preferred embodiments of the present invention, and do not limit the scope of the present invention, and various modifications and improvements of the technical solution of the present invention by those skilled in the art should fall within the protection scope defined by the claims of the present invention without departing from the spirit of the present invention.
Claims (3)
1. The method for cleaning and analyzing the big data of the whole production process of the aluminum/copper plate strip is characterized by comprising the following steps of:
s1: constructing a cleaning system suitable for the whole-process big data of the aluminum/copper plate strip, and cleaning the data according to the characteristics and the process background which accord with the whole-process data of the aluminum/copper plate strip production;
s2: and finishing establishing a distributed calculation analysis algorithm tool library facing the aluminum/copper plate strip full-process big data.
2. The method for cleaning and analyzing the whole flow of big data in the production of aluminum/copper plate and strip as claimed in claim 1, wherein the method comprises the following steps: the specific process in step S1 is as follows:
(1.1) analyzing the quality problem of the full-process big data of the aluminum/copper plate strip, and analyzing and summarizing links with the data quality problem aiming at the characteristics of the full-process production and processing of the aluminum/copper plate strip;
the data quality issues include: data loss or errors due to device performance limitations; data loss or error caused by human operation error;
(1.2) detecting the data loss and abnormal quality problems by combining the background of the whole-flow production process of aluminum/copper plate strip production;
and (3) missing value detection: calculating the number of null values and the total number of samples according to the production data of the whole process of aluminum/copper plate strip production, and determining the missing value proportion;
univariate outlier detection: after data are rapidly sorted, decomposing the data into quartiles, and determining abnormal values of units by combining the quartiles;
multivariate abnormal value detection: searching samples of multivariate abnormal values based on density, distance and clustering methods comprehensively;
(1.3) analyzing the data quality problem and designing a corresponding processing scheme, the processing scheme comprising
Single value processing: performing specific processing on a single value, deleting a column in a column deleting mode if all values in the column are the single value, and storing a result; if the single value row lacks individual data, the row deleting operation is carried out on the single value row;
missing value processing: setting a missing threshold value proportion, and if the missing proportion is larger than the threshold value, carrying out deleting operation; otherwise, filling;
abnormal value processing: correspondingly replacing abnormal values of the units by combining the process parameter range; deleting the multivariate abnormal value;
(1.4) combining the big data characteristics of the aluminum/copper plate strip, and carrying out data cleaning according to the following steps aiming at the data quality problem of the big data of the whole process of the aluminum/copper plate strip:
loading a data source: reading the distributed data, wherein the reading mode utilizes distributed storage in Hadoop to perform distributed reading;
identifying the detection data: detecting data in a full-flow big data cleaning system of the aluminum/copper plate strip, and identifying possible quality problems of the data;
determining a cleaning rule: starting cleaning of data and backing up the data to be cleaned in advance; processing the quality problem of the detected data; single value problem: deleting a certain column by adopting a column deleting mode if the values in the certain column are all single values, storing the result, and performing row deleting operation if individual data are lacked in the single value column; missing value problem: setting default 80% of the missing threshold proportion, and performing deletion operation if the missing proportion is larger than the threshold; otherwise, filling; outlier problem: correspondingly replacing abnormal values of the units by combining the process parameter range; and carrying out deletion operation on the plurality of abnormal values.
Data cleaning: cleaning the loaded source data by using a MapReduce frame in Hadoop and adopting a determined cleaning rule and a cleaning method;
and (4) checking a cleaning result: and returning to the previous step for continuous cleaning when the cleaned data is checked and judged according to the rules and the evaluation standards set by the data cleaning.
3. The method for cleaning and analyzing the whole flow of big data in the production of aluminum/copper plate and strip as claimed in claim 2, wherein the method comprises the following steps: the specific process of step S2 is as follows:
(2.1) designing a tool library of a full-flow big data depth optimization algorithm of the aluminum/copper plate strip;
optimizing the design of a functional module of an algorithm tool library, wherein the module design comprises a file management module, a supervised analysis module, an unsupervised analysis module and an intelligent optimization analysis module;
the optimization algorithm tool library structural design comprises a data display layer, a man-machine interaction layer, an intermediate data storage layer, an algorithm analysis layer and an aluminum/copper plate strip source data layer;
(2.2) developing a full-flow big data depth optimization algorithm tool library of the aluminum/copper plate strip;
confirming the required environment and the relevant technology for developing the optimization algorithm tool library;
establishing a depth optimization algorithm tool library development platform;
(2.3) specifically realizing a distributed algorithm analysis business process: carrying out unified standardization on an analysis algorithm, customizing a unified calling interface, and specifically carrying out the following business processes:
loading a data source: storing the source data processed by the distributed big data cleaning system suitable for the aluminum/copper plate strip in a distributed file system with Hadoop as a core, and performing distributed reading and loading on the data;
and (3) confirming a specific algorithm: the analysis method types in the tool library of the aluminum/copper plate strip production full-flow big data analysis method comprise supervision, unsupervised and the like, and different algorithms can be selected to meet different requirements of users;
setting algorithm parameters: the analysis methods in the tool library of the large data analysis method for the whole aluminum/copper plate strip production process are different in types, different method parameters can be selected for data analysis, and default values can be selected without setting;
and (4) analyzing results: and returning the data processed by the aluminum/copper plate strip production full-flow big data analysis method to the analysis result of the user, and constructing a relation model in the aluminum/copper plate strip production according to the returned result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110476972.8A CN113177040A (en) | 2021-04-29 | 2021-04-29 | Full-process big data cleaning and analyzing method for aluminum/copper plate strip production |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110476972.8A CN113177040A (en) | 2021-04-29 | 2021-04-29 | Full-process big data cleaning and analyzing method for aluminum/copper plate strip production |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113177040A true CN113177040A (en) | 2021-07-27 |
Family
ID=76925435
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110476972.8A Pending CN113177040A (en) | 2021-04-29 | 2021-04-29 | Full-process big data cleaning and analyzing method for aluminum/copper plate strip production |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113177040A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116757534A (en) * | 2023-06-15 | 2023-09-15 | 中国标准化研究院 | Intelligent refrigerator reliability analysis method based on neural training network |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2017028690A1 (en) * | 2015-08-14 | 2017-02-23 | 阿里巴巴集团控股有限公司 | File processing method and system based on etl |
CN107220261A (en) * | 2016-03-22 | 2017-09-29 | 中国移动通信集团山西有限公司 | A kind of real-time method for digging and device based on distributed data |
CN108287926A (en) * | 2018-03-02 | 2018-07-17 | 宿州学院 | A kind of multi-source heterogeneous big data acquisition of Agro-ecology, processing and analysis framework |
CN108564254A (en) * | 2018-03-15 | 2018-09-21 | 国网四川省电力公司绵阳供电公司 | Controller switching equipment status visualization platform based on big data |
CN109284298A (en) * | 2018-11-09 | 2019-01-29 | 上海晏鼠计算机技术股份有限公司 | A kind of contents production system handled based on machine learning and big data |
CN109830303A (en) * | 2019-02-01 | 2019-05-31 | 上海众恒信息产业股份有限公司 | Clinical data mining analysis and aid decision-making method based on internet integration medical platform |
CN110543903A (en) * | 2019-08-23 | 2019-12-06 | 国网江苏省电力有限公司电力科学研究院 | Data cleaning method and system for GIS partial discharge big data system |
KR102146297B1 (en) * | 2020-04-09 | 2020-08-20 | 주식회사 태현이엔지 | Generation apparatus for sodium hypochlorite for enhancement of the accuracy, and method thereof |
-
2021
- 2021-04-29 CN CN202110476972.8A patent/CN113177040A/en active Pending
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2017028690A1 (en) * | 2015-08-14 | 2017-02-23 | 阿里巴巴集团控股有限公司 | File processing method and system based on etl |
CN107220261A (en) * | 2016-03-22 | 2017-09-29 | 中国移动通信集团山西有限公司 | A kind of real-time method for digging and device based on distributed data |
CN108287926A (en) * | 2018-03-02 | 2018-07-17 | 宿州学院 | A kind of multi-source heterogeneous big data acquisition of Agro-ecology, processing and analysis framework |
CN108564254A (en) * | 2018-03-15 | 2018-09-21 | 国网四川省电力公司绵阳供电公司 | Controller switching equipment status visualization platform based on big data |
CN109284298A (en) * | 2018-11-09 | 2019-01-29 | 上海晏鼠计算机技术股份有限公司 | A kind of contents production system handled based on machine learning and big data |
CN109830303A (en) * | 2019-02-01 | 2019-05-31 | 上海众恒信息产业股份有限公司 | Clinical data mining analysis and aid decision-making method based on internet integration medical platform |
CN110543903A (en) * | 2019-08-23 | 2019-12-06 | 国网江苏省电力有限公司电力科学研究院 | Data cleaning method and system for GIS partial discharge big data system |
KR102146297B1 (en) * | 2020-04-09 | 2020-08-20 | 주식회사 태현이엔지 | Generation apparatus for sodium hypochlorite for enhancement of the accuracy, and method thereof |
Non-Patent Citations (5)
Title |
---|
SHEN YAN等: "quality prediction method for aluminum alloy ingot based on XGBoost", 2020 CHINESE CONTROL AND DECISION CONFERENCE, pages 2542 - 2547 * |
刘士新: "基于大数据的铝板带智能制造关键技术探讨", 2018年中国铝加工产业年度大会论文集, pages 28 - 41 * |
彭艳: "冶金轧制设备技术数字化智能化发展综述", 燕山大学学报, vol. 44, no. 3, pages 218 - 237 * |
李福兴等: "面向煤炭开采的大数据处理平台构建关键技术", 煤炭学报, vol. 44, no. 1, pages 362 - 369 * |
王建民;王晨;刘英博;刘;: "大数据系统软件创新平台与生态建设", 大数据, vol. 4, no. 05, pages 104 - 112 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116757534A (en) * | 2023-06-15 | 2023-09-15 | 中国标准化研究院 | Intelligent refrigerator reliability analysis method based on neural training network |
CN116757534B (en) * | 2023-06-15 | 2024-03-15 | 中国标准化研究院 | Intelligent refrigerator reliability analysis method based on neural training network |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111636891B (en) | Real-time shield attitude prediction system and construction method of prediction model | |
CN117393076B (en) | Intelligent monitoring method and system for heat-resistant epoxy resin production process | |
CN112905380A (en) | System anomaly detection method based on automatic monitoring log | |
CN117194919A (en) | Production data analysis system | |
CN111178688A (en) | Self-service analysis method and system for power technology supervision data, storage medium and computer equipment | |
CN117962256A (en) | Injection molding warpage simulation control system based on analysis of big data information | |
CN116578436A (en) | Real-time online detection method based on asynchronous multielement time sequence data | |
CN113177040A (en) | Full-process big data cleaning and analyzing method for aluminum/copper plate strip production | |
CN112416732B (en) | Hidden Markov model-based data acquisition operation anomaly detection method | |
CN117171244A (en) | Enterprise data management system based on data middle platform construction and data analysis method thereof | |
CN117291575A (en) | Equipment maintenance method, equipment maintenance device, computer equipment and storage medium | |
CN118331952B (en) | Financial data cleaning management system and method based on big data | |
CN116861204B (en) | Intelligent manufacturing equipment data management system based on digital twinning | |
CN117150439B (en) | Automobile manufacturing parameter detection method and system based on multi-source heterogeneous data fusion | |
CN116703321B (en) | Pharmaceutical factory management method and system based on green production | |
CN118245743B (en) | Basic data construction optimization system and method based on automatic flow | |
CN116975041B (en) | AB experiment shunting and analyzing system | |
CN117453805B (en) | Visual analysis method for uncertainty data | |
Balaskó et al. | What happens to process data in chemical industry? From source to applications–an overview | |
CN118311914B (en) | Production line data acquisition control method and system for intelligent workshop | |
Pavlov et al. | Relevance and Problems for Application of Big Data in Engineering Industry | |
CN118295893A (en) | Automatic page inspection method and device, electronic equipment and storage medium | |
CN118025921A (en) | Elevator fault analysis and detection method based on big data and neural network | |
CN114218222A (en) | Industrial production diagnosis method and system for tree structure | |
CN118132321A (en) | Learning-based database abnormal root cause SQL diagnosis method and diagnosis device thereof |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
AD01 | Patent right deemed abandoned |
Effective date of abandoning: 20240322 |
|
AD01 | Patent right deemed abandoned |