CN114595211A

CN114595211A - Product data cleaning method and system based on deep learning

Info

Publication number: CN114595211A
Application number: CN202210089180.XA
Authority: CN
Inventors: 吕勋; 郑沁; 周建波; 李伯鸣; 王燕灵
Original assignee: Hangzhou New China And Big Polytron Technologies Inc
Current assignee: Hangzhou New China And Big Polytron Technologies Inc
Priority date: 2022-01-25
Filing date: 2022-01-25
Publication date: 2022-06-07

Abstract

The invention provides a product data cleaning method and a system based on deep learning, wherein the method comprises the following steps: establishing a product data set according to the industry, establishing a data cleaning model based on a deep learning model, and training the product data set by using the data cleaning model to obtain a training data set and a test data set; acquiring product data needing to be cleaned and inputting the product data into a trained data cleaning model to obtain a product cleaning result; and performing circulating cross validation on the product cleaning result according to the material attribute until no abnormal data exists, and outputting the cleaning result. The method comprises the steps of establishing a deep learning data set based on a machining and assembling manufacturing product data structure in advance, wherein the data set comprises industry standard product data and historical project manufacturing product data; the product data for the project is then cleaned by the data cleaning model of the data set.

Description

Product data cleaning method and system based on deep learning

Technical Field

The invention relates to the technical field of data cleaning, in particular to a product data cleaning method and system based on deep learning.

Background

Data cleaning: the process of re-reviewing and verifying data aims to delete duplicate information, correct existing errors, and provide data consistency. Data cleaning after data import is generally completed by a computer instead of a human.

Material (item): the material is all materials used or consumed in the production process of the product, and comprises final products, parts, assemblies, composite parts, outsourcing parts, raw materials and the like.

Material master file (item data): used for identifying and describing the attribute and information of each material used in the production process, the material master file mainly comprises:

1) basic information: material code, material type, material classification and material name.

2) Design management related information: such as design drawing number or formula (raw material, ingredient) number, design modification number or edition, effective date and ineffective date of the material, etc.

3) Material management related information: such as unit of measure, material, specification, yield, ABC code, default warehouse sum or yes, category, current inventory, safety inventory, longest storage days, maximum inventory limit, cycle count interval, etc.

Bill of materials (BOM, BOM for short): the BOM is a description of the composition of a product, which lists all the sub-components, intermediate components, parts, raw materials required to produce a product, and shows the number of sub-components required to make up this parent. Sometimes also referred to as "recipe list", "matching list", "product structure list", "detailed list", "product detail list", etc.

Product data:

the basis of a production scheduling plan in the ERP system in the manufacturing industry is product basic data which comprises description data of products and materials, product structure data (BOM) and production process data, the sources of the product data mainly comprise drawings, product detailed tables, BOM, material information, process routes and the like provided by a design and product research and development department, and the main modes of importing the product data into the ERP system are as follows:

1) paper product lists or electronic picture files from a design department are manually input;

2) directly importing excel or cvc product specification provided by a design department through a data interface of an ERP system;

3) and directly importing data from a PDM (product data management system) and a PLM (product life cycle management system) applied by a design department through a data interface of the ERP system.

The reason for cleaning the product data is that the manual input workload is huge, the errors are more, and the verification of the data must be completed by a very professional technician; the data imported from the product list, PDM and PLM from the design department cannot be directly used as basic data for production and manufacturing management, and there are the following problems: the product data description provided by the design department is not standard; the description of the product data of the design department is not uniform with the data description rule of the ERP system;

Disadvantages or problems with the prior art:

1) when maintaining bill of material data, only basic mathematical logic checks can be performed, for example: a lower level of the product A is a component B, so that the lower level of the component B cannot be the component A, incomplete cleaning is caused, and the deviation is large.

2) When maintaining the material master file, only each single attribute can be subjected to standardized verification, for example: whether the material of steel accords with national standard, cause to wash incomprehensiblely, the deviation is big.

3) The rule base needs to be manually set and updated, so that the real-time performance is poor and the workload is large.

Disclosure of Invention

In order to solve the technical problems, the invention provides a product data cleaning method and a system based on deep learning, wherein a deep learning data set based on a machining and assembling manufactured product data structure is pre-established, and the data set comprises industry standard product data and historical project manufactured product data; the product data for the project is then cleaned by the data cleaning model of the data set.

In order to realize the purpose, the following technical scheme is provided:

a product data cleaning method based on deep learning comprises the following steps:

s1, establishing a product data set according to the industry, establishing a data cleaning model based on the deep learning model, and training the product data set by using the data cleaning model to obtain a training data set and a test data set;

S2, acquiring product data to be cleaned and inputting the product data to the trained data cleaning model to obtain a product cleaning result;

and S3, performing circulating cross validation on the product cleaning result according to the material attribute until no abnormal data exists, and outputting the cleaning result.

Deep learning is an algorithm process of inducing a model from existing data by means of a multilayer neural network and applying a multilayer analysis and calculation means and then analyzing new data by the model. Therefore, the proposal applies the algorithm to the product data cleaning process. The method has the following advantages: preprocessing product data by using an RNN (neural network) deep learning algorithm; and abstracting each dimension neuron by utilizing a cleaning model of a deep learning algorithm through an industry standard database and a historical service database. And then, the periodic learning promotion is carried out on the training library data by means of the self-learning characteristic; and a deep learning test library and training library result verification mechanism is applied to correct the cleaning result deviation, so that the cleaning accuracy is improved. The bill of material and the material master file information are subjected to cross validation while being cleaned in multiple dimensions.

Preferably, the S1 includes the following steps:

A1: establishing a product data set according to industries, wherein the product data set comprises industry standard product data and historical project manufacturing product data;

a2: making a label according to the attribute of the material;

a3: establishing a classification learning device for the product data set according to classification and labels;

a4: training a product data set through a deep learning model to obtain a training result;

establishing a function M_i＝AF(∑_jX_ijt_k+b_j) Wherein t is the number of product libraries, k is the level of the BOM of the product, X is the tag dataset, and AF is the activation function;

a5: the training result is corrected by an expert database and then output as a data set which passes the training;

a6: and splitting the training passing data set into training data and testing data.

Preferably, the attributes of the materials comprise materials, specifications, types and categories of the materials.

Preferably, the S3 includes the following steps: a K-fold Cross Validation method is applied, and a training data set and test data are called simultaneously, specifically as follows:

1) dividing training data and test data into x parts;

2) continuously and circularly calling 1 part for testing data at a time without repetition, using other x-1 parts for training data models, and then calculating the MSE of each material attribute label of the deep learning model on the testing data set _iA value;

3) then the x calculated MSE_iAveraging to obtain the MSE value of each material attribute label, wherein x is the configured number value of the split;

4) judging whether abnormal data exist in the step 3), if not, directly outputting a cleaning result, and if so, carrying out the step 5);

5) after the expert database is called for proofreading, bringing the abnormal data into a temporary training library;

6) calling a temporary training library, and cleaning again;

7) judging whether abnormal data exist in the step 6), if not, transferring the temporary training library into a formal training library, and directly outputting a cleaning result; if so, return to 5).

Preferably, the activation function AF is a Sigmoid activation function or a Tanh activation function or an ELU activation function.

A product data cleaning system based on deep learning adopts the product data cleaning method based on deep learning, and comprises the following steps:

the data storage module is used for storing the data of the historical project product database;

the deep learning module is used for training the data of the data storage module to obtain a training data set and a test data set;

the data cleaning module is used for performing circular cross authentication on the product cleaning result to obtain a cleaning result;

the data import module is used for importing the product data to be cleaned to the data cleaning module;

The result display module is used for displaying and analyzing the cleaning result;

and the production database module is used for receiving the cleaning result and performing production scheduling.

The beneficial effects of the invention are: preprocessing product data by using an RNN (neural network) deep learning algorithm; and abstracting each dimension neuron by utilizing a cleaning model of a deep learning algorithm through an industry standard database and a historical service database. And then, the periodic learning promotion is carried out on the training library data by means of the self-learning characteristic; and a deep learning test library and training library result verification mechanism is applied to correct the cleaning result deviation, so that the cleaning accuracy is improved.

Drawings

FIG. 1 is a process diagram of an embodiment for building a data cleansing model;

FIG. 2 is a detailed flow chart of data cleansing according to an embodiment.

Detailed Description

Example (b):

the embodiment provides a product data cleaning method based on deep learning, which comprises the following steps:

s1, building a product data set according to the industry, building a data cleaning model based on the deep learning model, and training the product data set by using the data cleaning model to obtain a training data set and a test data set, which are as follows with reference to fig. 1:

A1: establishing a product data set according to the industry, such as a machining product data set and a product database of historical data source projects;

a2: making labels according to the material quality, specification, material type and material category of the materials;

a4: training a product data set through a deep learning model;

establishing a function M_i＝AF(∑_jX_ijt_k+b_j) Wherein t is the number of product libraries, k is the level of the BOM of the product, X is the tag dataset, and AF is the activation function; the usual activation functions are as follows:

(1) sigmoid activation function:

(2) tanh activation function:

(3) ELU activation function:

f(x)＝a(e^-x-1)；

a6: the training passed data set is split into training data and test data.

S2, acquiring product data needing to be cleaned and inputting the product data into the trained data cleaning model to obtain a product cleaning result; the method specifically comprises the following steps:

b1: obtaining product data of the current project, such as a drawing and an Excel table, and performing data import or interface synchronization;

b2: inputting project product data into the established product data cleaning model;

s3, performing circulating cross validation on the product cleaning result according to the material attributes, and outputting the cleaning result until no abnormal data exists, wherein the method specifically comprises the following steps:

B3: respectively calculating product cleaning results according to the training data and the testing data, comparing, and removing results with large differences:

performing business logic cross validation according to the material attributes or the business relationship among the labels, for example:

1. and (3) starting to circularly and crossly verify from the product category, and when the product category is 'I-shaped steel', dividing into: the material meets the national standard GB/T700-2006, and the material type must be a raw material and measurement unit symbol.

2. Performing circular cross validation from the specification, and when the specification is '20 a', dividing the specification into a dimension table with the dimension in accordance with the corresponding specification, a material in accordance with the national standard GB/T700-2006 and a single-weight table with the single-weight in accordance with the corresponding specification;

by analogy, all labels are circulated, each circulation generates a group of material attributes or label arrays, and when the results of all the arrays are consistent, the normal operation is returned;

the specific algorithm is realized by using a K-fold Cross Validation method and calling a plurality of data of a training data set and a testing data set at the same time, wherein the specific number can be set by configuration items. For example, set to 8, then the process of cross-validation is:

1) dividing training data and test data into 8 parts;

2) continuously and repeatedly calling 1 part of the deep learning model for test data and 7 other parts of the deep learning model for training data, and calculating each deep learning model on the test data set MSE of Material Attribute tags_iA value;

3) the 8 calculated MSEs are then used_iAfter averaging, the MSE value of each material attribute label is relatively accurate, wherein x is the configured splitting quantity value, such as 8 in the example, the larger the quantity is, the higher the accuracy is, and the larger the calculated quantity is;

b7: if the abnormal data does not exist in the B3, directly outputting the cleaning result;

b4: if abnormal data exist in B3, calling an expert library for evaluation and then bringing the expert library into a temporary training library;

b5: meanwhile, calling a temporary training library, and cleaning again;

b6: if abnormal data does not exist in the B5, the temporary training library is transferred into a formal training library, and a cleaning result is directly output;

b7: if there is abnormal data in B5, loop B4 flow.

The embodiment further provides a product data cleaning system based on deep learning, and the product data cleaning method based on deep learning includes:

The data import module is used for importing the data of the product to be cleaned into the data cleaning module;

The specific using process is as follows:

step C1: transferring product databases in industries such as machining, assembly manufacturing, medical appliances, textile and clothing, national standard libraries of material specifications such as steel, aluminum and the like, and historical project product databases of the company into a data storage module;

step C2: establishing a data storage module according to a classification label mode of industry + national standard + material attribute + project;

step C3: in a deep learning module, selecting Tanh and ELU activation functions as data of a data storage module for training, abstracting each dimension neuron, and establishing complete training data and test data;

step C4: configuring an industry to be cleaned and a K-fold Cross Validation K value in a data cleaning module;

step C5: the data import module supports Excel import, CAD import and PDM system integration;

step C6: after the data of C5 is imported into a C4 data cleaning module, cleaning the data and displaying the result, comparing and displaying the data before and after cleaning according to the material number and the classification label, and analyzing and displaying the cleaning quantity and quality according to each classification label;

Step C7: and after cleaning, synchronizing the data to a production database module through an interface platform to perform subsequent production scheduling.

Claims

1. A product data cleaning method based on deep learning is characterized by comprising the following steps:

s2, acquiring product data needing to be cleaned and inputting the product data into the trained data cleaning model to obtain a product cleaning result;

2. The deep learning-based product data cleansing method according to claim 1, wherein the step S1 comprises the steps of:

a2: making a label according to the attribute of the material;

Establishing a function M_i＝AF(∑_jX_ijt_k+b_j) Wherein t is the number of product libraries, k is the level of BOM of the product, X is the tag data set, and AF is the activation function;

a5: the training result is corrected by an expert database and then is output as a data set which passes the training;

3. The deep learning-based product data cleaning method as claimed in claim 2, wherein the material properties include material, specification, type and category of the material.

4. The deep learning-based product data cleansing method according to claim 1, wherein the step S3 comprises the steps of: a K-fold Cross Validation method is applied, and a training data set and test data are called simultaneously, specifically as follows:

1) dividing training data and test data into x parts;

2) continuously and circularly calling 1 part for testing data at a time without repetition, using other x-1 parts for training data models, and then calculating the MSE of each material attribute label of the deep learning model on the testing data set_iA value;

3) then calculate the MSE of the x times_iAveraging to obtain the MSE value of each material attribute label, wherein x is the configured number value of the split;

6) calling a temporary training library, and cleaning again;

5. The deep learning-based product data cleaning method as claimed in claim 1, wherein the activation function AF is a Sigmoid activation function, a Tanh activation function or an ELU activation function.

6. A deep learning based product data cleansing system using a deep learning based product data cleansing method according to any one of claims 1 to 5, comprising: