CN112270614B - Design resource big data modeling method for manufacturing enterprise full-system optimization design - Google Patents

Design resource big data modeling method for manufacturing enterprise full-system optimization design Download PDF

Info

Publication number
CN112270614B
CN112270614B CN202011049729.XA CN202011049729A CN112270614B CN 112270614 B CN112270614 B CN 112270614B CN 202011049729 A CN202011049729 A CN 202011049729A CN 112270614 B CN112270614 B CN 112270614B
Authority
CN
China
Prior art keywords
data
value
design
model
logistic regression
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011049729.XA
Other languages
Chinese (zh)
Other versions
CN112270614A (en
Inventor
任鸿儒
肖毅
鲁仁全
徐雍
周琪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong University of Technology
Original Assignee
Guangdong University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong University of Technology filed Critical Guangdong University of Technology
Priority to CN202011049729.XA priority Critical patent/CN112270614B/en
Publication of CN112270614A publication Critical patent/CN112270614A/en
Application granted granted Critical
Publication of CN112270614B publication Critical patent/CN112270614B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/04Manufacturing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2135Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • G06F18/24143Distances to neighbourhood prototypes, e.g. restricted Coulomb energy networks [RCEN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Economics (AREA)
  • Human Resources & Organizations (AREA)
  • Databases & Information Systems (AREA)
  • Strategic Management (AREA)
  • Marketing (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Tourism & Hospitality (AREA)
  • Quality & Reliability (AREA)
  • Evolutionary Computation (AREA)
  • General Business, Economics & Management (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Manufacturing & Machinery (AREA)
  • Operations Research (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Game Theory and Decision Science (AREA)
  • Development Economics (AREA)
  • Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a design resource big data modeling method for the whole system optimization design of a manufacturing enterprise, which is used for constructing an accurate and effective design resource big data model for the whole system optimization design of the manufacturing enterprise by a KNN adjacent-logistic regression combination model algorithm after big data of the main bodies such as design, manufacture, products, users and the like in the manufacturing enterprise are collected, cleaned and subjected to characteristic processing, so that related business in the manufacturing enterprise is prejudged, and meanwhile, data related to the main bodies such as design, manufacture, products, users and the like are optimized, and the problem that the existing design resource data model only considers single design department data and does not integrate and summarize all the design department data is solved, and the problem that the classification result cannot be accurately predicted by the single data model is solved.

Description

Design resource big data modeling method for manufacturing enterprise full-system optimization design
Technical Field
The invention relates to the technical field of manufacturing industry and big data, in particular to a design resource big data modeling method for the whole system optimization design of manufacturing enterprises.
Background
Industrial big data is an important strategic resource for the conversion and upgrading of the manufacturing industry in China, and in order to fully utilize mass data generated in the process of designing, manufacturing, managing and servicing of manufacturing enterprises, the method and the technology for constructing the data space of the manufacturing enterprises become an important basic front-end technology. The manufacturing enterprise data space is a space formed by full-system and full-value chain data generated in business domains such as design, manufacturing, management and service, has the characteristics of large data 4V (large scale, rapid change, type impurity, low quality) and the characteristics of multi-mode, cross-scale, high flux, strong correlation, heavy mechanism and the like, and causes the problem of difficult modeling of manufacturing large data.
The current modeling method for manufacturing big data is mostly aimed at modeling in a single service field, the associated influence of data in other service fields is not fully considered in the modeling process, the modeling method for penetrating through multiple service fields and the whole life cycle of a product is lacking, and the core problems of the service fields such as design resources, management flows, manufacturing processes, product services and the like cannot be comprehensively and effectively characterized in a whole-flow whole-system view.
The product design is the primary link of the product life cycle, the existing design resource data model only considers single design department data on one hand, all the design department data are not integrated and summarized, the algorithm adopted by the data model is single, and the situation that the classification result cannot be accurately predicted possibly exists.
Disclosure of Invention
The invention aims to overcome the defects of the prior art, provides a design resource big data modeling method for the whole system optimization design of a manufacturing enterprise, realizes the highly ordered display of the relation of the design resource big data, and realizes the whole system full value chain modeling of the manufacturing big data together with a business model of a whole process manufacturing process, a whole through management process and a whole period product service, thereby solving the problem that the traditional relational database model cannot reasonably and effectively model the big data of the manufacturing enterprise.
In order to achieve the above purpose, the technical scheme provided by the invention is as follows:
a design resource big data modeling method for manufacturing enterprise full system optimization design comprises the following steps:
s1, acquiring multi-source heterogeneous design resource big data, and converting the multi-source heterogeneous design resource big data into a structured data source with a uniform format;
S2, cleaning the collected data to remove data which does not meet the requirements;
s3, carrying out feature processing on the data meeting the requirements;
s4, carrying out classification prediction on the sample to be classified by adopting a KNN proximity-logistic regression combined model algorithm, so as to judge whether the design of a new product in a manufacturing enterprise can be completed within a specified period, and optimizing the data of a main body related to the design, the manufacture, the product and the user according to the prediction result.
Further, the step S1 collects big data of the multi-source heterogeneous design resource, and the specific steps of converting the structured data source in a unified format are as follows:
s1-1, identifying a data source related to a manufacturing enterprise design resource main body and a storage position of the data source;
S1-2, aiming at a relational database, configuring data connection between the relational database and an HDFS by adopting an Sqoop technology, and importing data in the relational database into the HDFS of the Hadoop;
s1-3, analyzing a data file by adopting a MapReduce programming method aiming at data in a file format, and uploading the data file to an HDFS;
s1-4, integrating all the main body data acquired before in Hive based on a relational model;
s1-5, building a structured main body data set.
Further, the data cleansing includes the steps of:
S2-1, preprocessing data;
s2-2, removing or complementing missing data;
S2-3, removing data with errors in the content;
S2-4, removing data with logic errors;
s2-5, removing unnecessary data;
s2-6, verifying data relevance.
Further, the feature processing includes the steps of:
S3-1, solving the problem of unbalanced positive and negative samples by adopting an information oversampling SMOTE method, and avoiding the problem of low prediction accuracy caused by unbalanced samples in a subsequent KNN algorithm and a logistic regression algorithm;
S3-2, performing feature selection through a variance selection method;
s3-3, performing dimension reduction treatment on the feature matrix dimension after feature selection through a principal component analysis method.
Further, the specific process of the step S3-1 is as follows:
3-1-1) for each sample x in the minority class, the formula is used:
obtaining Euclidean distance d from a sample x to other minority samples y;
3-1-2) the majority class sample number is denoted as m, the minority class sample number is denoted as n, let:
Taking k other samples with the minimum Euclidean distance d in each sample x as neighbor x k of the sample x;
3-1-3) for each neighbor x k, a new sample x n is generated in x and x k using a random linear interpolation method:
xn=x+ε|xk-x|
wherein epsilon is a random value between 0 and 1;
3-1-4) repeating steps 3-1-3) until the minority class samples and the majority class samples are equal or have no difference.
Further, the specific process of the step S3-3 is as follows:
3-3-1) carrying out normalization treatment on the characteristics;
Conversion using a linear function:
y=(x-MinValue)/(MaxValue-MinValue)
Wherein x and y are values before and after conversion, maxValue, minValue is the maximum value and minimum value of the sample;
3-3-2) calculating the average value of the features of each column, and then subtracting the feature average value of the column from each dimension;
3-3-3) calculating a covariance matrix of the sample features;
3-3-4) calculating eigenvalues and eigenvectors of the covariance matrix;
3-3-5) sorting the calculated characteristic values from large to small;
3-3-6) taking out the first K eigenvectors and eigenvalues, multiplying the initial sample matrix by an eigenvector matrix formed by the K eigenvectors, and obtaining a feature matrix after dimension reduction;
The calculation of the K value refers to the following formula:
The minimum K value satisfying the above equation is found, where λ is the eigenvalue of the covariance matrix.
Further, the step S4 specifically includes:
s4-1, dividing the data after feature processing into a training set and a testing set data for training and testing a model;
S4-2, after training the KNN model by using training set data, testing the KNN model by using testing set data, and solving class I classification error rate (probability of misclassifying most classes into minority classes) omega 1;
s4-3, after the logistic regression model is trained by using the training set data, testing the logistic regression model by using the testing set data, and obtaining class I classification error rate omega 2;
S4-4, constructing a KNN proximity-logistic regression combination model based on Lagrange;
S4-5, predicting whether the design of a new product in the manufacturing enterprise can be completed within a specified period by utilizing a KNN adjacent-logistic regression combination model;
and S4-6, optimizing data of the main body including design, manufacture, products and users according to the predicted result.
Further, in the step S4-1, in order to determine whether the classification results of the KNN proximity algorithm, the logistic regression algorithm and the KNN proximity-logistic regression combination model algorithm are accurate, a cross verification method is selected, and the data after feature processing are divided into three parts, which are A, B, C respectively; then A, B, C are divided into three groups according to the crossing mode, wherein the first group is a training set: A. b, a step of preparing a composite material; test set C ", the second group is" training set: B. c, performing operation; test set a ", the third group is" training set: A. c, performing operation; test set B).
Further, step S4-2 is to test the KNN model with the same set of test set data after training the KNN model with the first set of training set data, and then repeat the above operations with the second and third sets of data to obtain an average class i classification error rate ω 1 for three times of the KNN model; the method comprises the following specific steps:
4-2-1) according to the Euclidean distance formula:
Calculating the Euclidean distance d between the first group of test set data x and the first group of training set data y;
4-2-2) sorting the sizes according to the calculated Euclidean distance d, and selecting the minimum k points, wherein the value of k is required to be smaller than the square root of the number of samples of the training set and is an odd number;
4-2-3) determining the frequencies of occurrence of k points in two categories, namely that the design can be completed in a specified period and that the design cannot be completed in the specified period, and taking the category with the highest frequency as the prediction classification of the data to be classified;
4-2-4) according to the classification result, obtaining class I classification error rate omega 11 of the KNN model algorithm corresponding to the first group of data;
4-2-5) repeating the steps 4-2-1) -4-2-4) twice, solving the class I classification error rate omega 12、ω13 of the KNN model algorithm corresponding to the other two groups of data, and finally, solving the average value omega 1=(ω111213)/3 to be used as the class I classification error rate of the KNN model algorithm;
And step S4-3, after the logistic regression model is trained by using the first group of training set data, the logistic regression model is tested by using the same group of testing set data, and then the operation is repeated by using the second group of data and the third group of data, so as to obtain the average class I classification error rate omega 2 of the logistic regression model three times, and the steps are as follows:
4-3-1) determining a predictive function:
based on Sigmoid function:
The weight vector is set to θ= (θ 012,...,θn),
Taking the first set of training set data as an input vector x= (1, x 1,x2,...,xn); let z (x) =θ T x, get the prediction function of the logistic regression algorithm:
marking whether the product design is finished within a specified period as y, marking y as 1 when the product design is finished on time, and marking y as 0 when the product design is not finished on time;
h θ (x) represents the probability of y=1 in the case where the input value is x and the weight parameter is θ;
4-3-2) determining a weight vector θ:
for a given data set, a maximum likelihood estimation method may be used to estimate the weight vector θ:
Likelihood function:
its log likelihood function:
At this time introduce
Further converting the model into a gradient descent task to obtain a minimum value, wherein the second half part is an added regularization item, so as to solve the problem of overfitting of the model;
in the above formula, ζ is a penalty term strength value, a group of penalty term strengths ζ with different values are selected, for example [0.01,0.1,1, 10, 100], each value is circulated, 5 recall rates recall of each value after 5 times of cross-validation are obtained, so that a recall rate recall corresponding to each penalty strength is obtained, and then ζ corresponding to the recall rate recall with the highest value is selected as the penalty term strength value;
Solving the theta value, firstly solving the partial derivative of each J (theta) to theta, then giving a certain theta value, continuously subtracting the partial derivative from the certain theta value to multiply the step length, and then calculating new theta until the value of the theta changes to enable the difference value of the J (theta) between two iterations to be small enough, namely the value of the J (theta) calculated by the two iterations is basically unchanged, and indicating that the J (theta) reaches a local minimum value at the moment; then calculating each theta value, substituting the theta value into a logistic regression equation h θ (x), and finally obtaining a prediction function;
wherein the partial derivative of J (θ) to θ is:
the iterative formula of θ j after regularization is:
4-3-3) inputting the first group of test set data into a prediction function h θ (x) of a logistic regression algorithm trained by the first group of training set data, and classifying the test set data according to the obtained probability value;
4-3-4) according to the classification result, obtaining class I classification error rate omega 21 of the logistic regression model algorithm corresponding to the first group of data;
4-3-5) the steps 4-3-1) -4-3-4) are repeated twice, the class I classification error rate omega 22、ω23 of the logistic regression model algorithm corresponding to the other two groups of data is obtained, and finally the average value omega 2=(ω212223)/3 is obtained to be used as the class I classification error rate of the logistic regression model algorithm.
Further, the specific process of constructing the KNN proximity-logistic regression combination model based on Lagrange in the step S4-4 is as follows:
4-4-1) determination of the prediction function:
The predicted value of the combined model of the i-th sample is represented by p i, and is:
pi=α1ki2li
Wherein k i、li represents the predicted probability value of the KNN and the logistic regression model on the ith sample respectively, alpha 1、α2 represents the weight value of the KNN and the logistic regression model respectively, and alpha 12 =1;
4-4-2) constructing Lagrange loss function:
Wherein omega 1、ω2 is class I classification error rate of the submodel obtained in the step (2) and (3), wherein the class I classification error rate is regarded as a penalty parameter of the submodel, and lambda is Lagrange operator;
4-4-3) optimal value for α 12:
Since L (α 12, λ) is a convex function, there is a minimum, and the minimum point is the optimal value of α 12;
The optimal value for alpha 12 can be obtained by solving the above equation using python.
Compared with the prior art, the scheme has the following principle and advantages:
According to the scheme, after large data of main bodies such as design, manufacture, products and users in manufacturing enterprises are collected, cleaned and subjected to characteristic processing, an accurate and effective design resource large data model for the whole system optimization design of the manufacturing enterprises is constructed by a KNN adjacent-logistic regression combined model algorithm, so that related businesses in the manufacturing enterprises are prejudged, and meanwhile, data of the main bodies such as the design, manufacture, products and users are optimized, and the problem that the existing design resource data model only considers single design department data and does not integrate and gather all the design department data is solved, and the problem that a single data model possibly cannot accurately predict classification results is solved.
In addition, the scheme is matched with the full-flow manufacturing process, the full-through management process and the business model of full-period product service to realize the modeling of the full-system full-value chain of the manufacturing big data together, and the problem that the traditional relational database model cannot reasonably and effectively model the big data of a manufacturing enterprise can be further solved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the services required in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the figures in the following description are only some embodiments of the present invention, and that other figures can be obtained according to these figures without inventive effort to a person skilled in the art.
FIG. 1 is a schematic flow chart of a design resource big data modeling method for manufacturing enterprise full system optimization design;
FIG. 2 is a flow chart of data cleaning in a design resource big data modeling method for manufacturing enterprise full system optimization design.
Detailed Description
The invention is further illustrated by the following examples:
As shown in fig. 1, the method for modeling design resource big data for optimizing design of a whole system of a manufacturing enterprise according to the embodiment includes the following steps:
S1, data acquisition:
s1-1, identifying a data source related to a manufacturing enterprise design resource main body and a storage position of the data source;
S1-2, aiming at a relational database, configuring data connection between the relational database and an HDFS by adopting an Sqoop technology, and importing data in the relational database into the HDFS of the Hadoop;
s1-3, analyzing a data file by adopting a MapReduce programming method aiming at data in a file format, and uploading the data file to an HDFS;
s1-4, integrating all the main body data acquired before in Hive based on a relational model;
s1-5, building a structured main body data set.
Through the steps, the collected multi-source heterogeneous design resource big data can be converted into a structured data set with a uniform format.
As shown in fig. 2, the collected data is subjected to cleaning treatment to remove data which does not meet the requirements; the method comprises the following specific steps:
S2-1, data preprocessing: viewing metadata, including field interpretations, data sources, code tables, etc. all describing data, allows an intuitive understanding of the data itself and preliminary discovery of problems in preparation for later processing;
S2-2, removing or complementing missing data: determining the missing range of each data field, directly discarding the data of the data field with the missing key, and filling and perfecting the non-key data, wherein the method comprises the steps of presuming the filling missing value by business knowledge or experience, filling the missing value by the calculation results (average value, median, mode and the like) of the same index, and filling the missing value by the calculation results of different indexes;
S2-3, removing data with errors in the content, and ensuring the correctness of the data;
S2-4, removing logically wrong data: discarding the data with logic errors according to the business rules to ensure the logic correctness of the data;
S2-5, removing unnecessary data: removing data irrelevant to the business rule, and ensuring the relativity of the data;
s2-6, verifying data relevance: for data from multiple sources, it is necessary to perform correlation verification, and if not, the data needs to be cleaned.
S3, carrying out feature processing on the data meeting the requirements:
S3-1, class imbalance problem processing: when there is a serious class imbalance problem in the data, the predicted result tends to deviate to the class with a large number, and the accuracy of the model is affected. A common method for dealing with the problem of class imbalance is a random undersampling method, which reduces the scale of the majority class by randomly removing some majority class samples, but important data may be lost in this way, and the sampled data cannot represent all data, so that the classification result is inaccurate. There is also a random oversampling method that increases the scale of minority classes by randomly copying minority class samples, and although this method does not cause information loss, the performance is also superior to the undersampling method, but the possibility of overfitting is increased.
In the embodiment, under the conditions of not losing important data and relieving over-fitting, the information over-sampling SMOTE method is adopted to solve the problem of class unbalance. The concrete analysis and calculation flow is as follows:
3-1-1) for each sample x in the minority class, the formula is used:
obtaining Euclidean distance d from a sample x to other minority samples y;
3-1-2) the majority class sample number is denoted as m, the minority class sample number is denoted as n, let:
Taking k other samples with the minimum Euclidean distance d in each sample x as neighbor x k of the sample x;
3-1-3) for each neighbor x k, a new sample x n is generated in x and x k using a random linear interpolation method:
xn=x+ε|xk-x|
wherein epsilon is a random value between 0 and 1;
3-1-4) repeating steps 3-1-3) until the minority class samples and the majority class samples are equal or have no difference.
S3-2, selecting the features through a variance selection method, firstly calculating variance values of the features, preferentially eliminating the features with the variance values of 0, and then selecting the features with the variance values larger than the threshold according to the threshold.
S3-3, after feature selection is completed, the problems of large calculation amount and long training time of the model possibly caused by overlarge feature matrix are solved, and dimension reduction processing is performed on the feature matrix dimension after feature selection through a Principal Component Analysis (PCA). The analysis and calculation flow is as follows:
3-3-1) carrying out normalization treatment on the characteristics;
Conversion using a linear function:
y=(x-MinValue)/(MaxValue-MinValue)
Wherein x and y are values before and after conversion, maxValue, minValue is the maximum value and minimum value of the sample;
3-3-2) calculating the average value of the features of each column, and then subtracting the feature average value of the column from each dimension;
3-3-3) calculating a covariance matrix of the sample features;
3-3-4) calculating eigenvalues and eigenvectors of the covariance matrix;
3-3-5) sorting the calculated characteristic values from large to small;
3-3-6) taking out the first K eigenvectors and eigenvalues, multiplying the initial sample matrix by an eigenvector matrix formed by the K eigenvectors, and obtaining a feature matrix after dimension reduction;
The calculation of the K value refers to the following formula:
The minimum K value satisfying the above equation is found, where λ is the eigenvalue of the covariance matrix.
S4, in order to avoid the situation that a single algorithm model possibly cannot accurately predict the classification result, the embodiment selects a KNN proximity-logistic regression combined model algorithm to classify and predict the sample to be classified, so as to judge whether the design of a new product in a manufacturing enterprise can be completed within a specified period, and optimize the data related to the main body such as design, manufacture, product, user and the like according to the prediction result.
The method comprises the following specific steps:
s4-1, determining training set and testing set data
In order to determine whether the classification results of the KNN proximity algorithm, the logistic regression algorithm and the KNN proximity-logistic regression combination model algorithm are accurate, a cross verification method is selected, and data after feature processing are divided into three parts which are A, B, C respectively; then A, B, C are divided into three groups according to the crossing mode, wherein the first group is a training set: A. b, a step of preparing a composite material; test set C ", the second group is" training set: B. c, performing operation; test set a ", the third group is" training set: A. c, performing operation; test set B ";
S4-2, after training the KNN model by using the first group of training set data, testing the KNN model by using the same group of testing set data, and then repeating the operation by using the second group of data and the third group of data to obtain the average class I classification error rate omega 1 of the KNN model three times; the method comprises the following specific steps:
4-2-1) according to the Euclidean distance formula:
Calculating the Euclidean distance d between the first group of test set data x and the first group of training set data y;
4-2-2) sorting the sizes according to the calculated Euclidean distance d, and selecting the minimum k points, wherein the value of k is required to be smaller than the square root of the number of samples of the training set and is an odd number;
4-2-3) determining the frequencies of occurrence of k points in two categories, namely that the design can be completed in a specified period and that the design cannot be completed in the specified period, and taking the category with the highest frequency as the prediction classification of the data to be classified;
4-2-4) according to the classification result, obtaining class I classification error rate omega 11 of the KNN model algorithm corresponding to the first group of data;
4-2-5) repeating the steps 4-2-1) -4-2-4) twice, solving the class I classification error rate omega 12、ω13 of the KNN model algorithm corresponding to the other two groups of data, and finally, solving the average value omega 1=(ω111213)/3 to be used as the class I classification error rate of the KNN model algorithm;
s4-3, after the logistic regression model is trained by using the first group of training set data, the logistic regression model is tested by using the same group of testing set data, and then the operation is repeated by using the second group of data and the third group of data, so that the average class I classification error rate omega 2 of the logistic regression model is obtained three times, and the method comprises the following steps:
4-3-1) determining a predictive function:
based on Sigmoid function:
The weight vector is set to θ= (θ 012,...,θn),
Taking the first set of training set data as an input vector x= (1, x 1,x2,...,xn); let z (x) =θ T x, get the prediction function of the logistic regression algorithm:
marking whether the product design is finished within a specified period as y, marking y as 1 when the product design is finished on time, and marking y as 0 when the product design is not finished on time;
h θ (x) represents the probability of y=1 in the case where the input value is x and the weight parameter is θ;
4-3-2) determining a weight vector θ:
for a given data set, a maximum likelihood estimation method may be used to estimate the weight vector θ:
Likelihood function:
its log likelihood function:
At this time introduce
Further converting the model into a gradient descent task to obtain a minimum value, wherein the second half part is an added regularization item, so as to solve the problem of overfitting of the model;
in the above formula, xi is a penalty value, selecting a group of penalty values of xi with different values, such as [0.01,0.1,1, 10, 100], and cycling each value to obtain 5 recall values (recall rate) of each value after 5 times of cross-validation, so that recall corresponding to each penalty value can be obtained, and then selecting xi corresponding to recall with the highest value as the penalty value;
Solving for the value of theta, firstly solving the partial derivative of each J (theta) to theta, then giving a certain value of theta, continuously subtracting the partial derivative from the value of theta to multiply the step length, and then calculating new theta until the value of theta changes to enable the difference value of J (theta) between two iterations to be small enough, namely the value of J (theta) calculated by the two iterations is basically unchanged, which means that the J (theta) reaches a local minimum value at the moment. And then calculating each theta value, and substituting the theta value into a logistic regression equation h θ (x) to finally obtain a prediction function.
Wherein the partial derivative of J (θ) to θ is:
the iterative formula of θ j after regularization is:
4-3-3) inputting the first group of test set data into a prediction function h θ (x) of a logistic regression algorithm trained by the first group of training set data, and classifying the test set data according to the obtained probability value;
4-3-4) according to the classification result, obtaining class I classification error rate omega 21 of the logistic regression model algorithm corresponding to the first group of data;
4-3-5) the steps 4-3-1) -4-3-4) are repeated twice, class I classification error rate omega 22、ω23 of the logistic regression model algorithm corresponding to the other two groups of data is obtained, and finally average value omega 2=(ω212223)/3 is obtained to be used as class I classification error rate of the logistic regression model algorithm;
s4-4, constructing a KNN proximity-logistic regression combination model:
4-4-1) determination of the prediction function:
The predicted value of the combined model of the i-th sample is represented by p i, and is:
pi=α1ki2li
Wherein k i、li represents the predicted probability value of the KNN and the logistic regression model on the ith sample respectively, alpha 1、α2 represents the weight value of the KNN and the logistic regression model respectively, and alpha 12 =1;
4-4-2) constructing Lagrange loss function:
Wherein omega 1、ω2 is class I classification error rate of the submodel obtained in the step (2) and (3), wherein the class I classification error rate is regarded as a penalty parameter of the submodel, and lambda is Lagrange operator;
4-4-3) optimal value for α 12:
Since L (α 12, λ) is a convex function, there is a minimum, and the minimum point is the optimal value of α 12;
The optimal value for alpha 12 can be obtained by solving the above equation using python.
S4-5, service prediction:
Respectively inputting data of a sample to be classified into a KNN model and a logistic regression model to obtain respective prediction probability values k and l, obtaining a prediction value of the combined model by using a formula p=alpha 1k+α2 l, and judging whether the design of a new product can be completed in a specified period according to the value;
S4-6, optimizing design resources, and optimizing data related to main bodies such as design, manufacture, products, users and the like according to a prejudging result, wherein the method comprises the following steps of:
4-6-1) when the design of the new product with the pre-judging result can be completed within the specified period, the main data with smaller weight theta in the logistic regression algorithm can be properly degraded, for example, when the weight theta of the 'designer senior' is smaller, the personnel participating in the design can be replaced by a senior engineer to a junior and a senior engineer, so that the labor cost is saved.
4-6-2) When the design of the new product cannot be completed within a specified period as a result of the pre-judgment, the main body data with larger weight theta in the logistic regression algorithm can be properly upgraded, for example, when the weight theta of the processing equipment is larger, the processing equipment with better quality can be selected to process the product.
The above embodiments are only preferred embodiments of the present invention, and are not intended to limit the scope of the present invention, so variations in shape and principles of the present invention should be covered.

Claims (6)

1. A design resource big data modeling method for manufacturing enterprise whole system optimization design is characterized by comprising the following steps:
s1, acquiring multi-source heterogeneous design resource big data, and converting the multi-source heterogeneous design resource big data into a structured data source with a uniform format;
S2, cleaning the collected data to remove data which does not meet the requirements;
s3, carrying out feature processing on the data meeting the requirements;
S4, carrying out classification prediction on a sample to be classified by adopting a KNN proximity-logistic regression combined model algorithm, so as to judge whether the design of a new product in a manufacturing enterprise can be completed within a specified period, and optimizing data of a main body related to the design, the manufacture, the product and a user according to a prediction result;
the step S4 specifically includes:
s4-1, dividing the data after feature processing into a training set and a testing set data for training and testing a model;
S4-2, after training the KNN model by using training set data, testing the KNN model by using testing set data, and solving class I classification error rate omega 1;
s4-3, after the logistic regression model is trained by using the training set data, testing the logistic regression model by using the testing set data, and obtaining class I classification error rate omega 2;
S4-4, constructing a KNN proximity-logistic regression combination model based on Lagrange;
S4-5, predicting whether the design of a new product in the manufacturing enterprise can be completed within a specified period by utilizing a KNN adjacent-logistic regression combination model;
s4-6, optimizing data of a main body including design, manufacture, products and users according to a predicted result;
in the step S4-1, in order to determine whether the classification results of the KNN proximity algorithm, the logistic regression algorithm and the KNN proximity-logistic regression combination model algorithm are accurate, a cross verification method is selected, and data after feature processing are divided into three parts, which are A, B, C respectively; then A, B, C are divided into three groups according to the crossing mode, wherein the first group is a training set: A. b, a step of preparing a composite material; test set C ", the second group is" training set: B. c, performing operation; test set a ", the third group is" training set: A. c, performing operation; test set B ";
step S4-2 is to test the KNN model by using the same set of test set data after training the KNN model by using the first set of training set data, and then repeat the operation by using the second and third sets of data to obtain the average class I classification error rate omega 1 of the KNN model three times; the method comprises the following specific steps:
4-2-1) according to the Euclidean distance formula:
Calculating the Euclidean distance d between the first group of test set data x and the first group of training set data y;
4-2-2) sorting the sizes according to the calculated Euclidean distance d, and selecting the minimum k points, wherein the value of k is required to be smaller than the square root of the number of samples of the training set and is an odd number;
4-2-3) determining the frequencies of occurrence of k points in two categories, namely that the design can be completed in a specified period and that the design cannot be completed in the specified period, and taking the category with the highest frequency as the prediction classification of the data to be classified;
4-2-4) according to the classification result, obtaining class I classification error rate omega 11 of the KNN model algorithm corresponding to the first group of data;
4-2-5) repeating the steps 4-2-1) -4-2-4) twice, solving the class I classification error rate omega 12、ω13 of the KNN model algorithm corresponding to the other two groups of data, and finally, solving the average value omega 1=(ω111213)/3 to be used as the class I classification error rate of the KNN model algorithm;
And step S4-3, after the logistic regression model is trained by using the first group of training set data, the logistic regression model is tested by using the same group of testing set data, and then the operation is repeated by using the second group of data and the third group of data, so as to obtain the average class I classification error rate omega 2 of the logistic regression model three times, and the steps are as follows:
4-3-1) determining a predictive function:
based on Sigmoid function:
The weight vector is set to θ= (θ 012,...,θn),
Taking the first set of training set data as an input vector x= (1, x 1,x2,...,xn); let z (x) =θ T x, get the prediction function of the logistic regression algorithm:
marking whether the product design is finished within a specified period as y, marking y as 1 when the product design is finished on time, and marking y as 0 when the product design is not finished on time;
h θ (x) represents the probability of y=1 in the case where the input value is x and the weight parameter is θ;
4-3-2) determining a weight vector θ:
For a given data set, a maximum likelihood estimation method is used to estimate the weight vector θ:
Likelihood function:
its log likelihood function:
At this time introduce
Further converting the model into a gradient descent task to obtain a minimum value, wherein the second half part is an added regularization item, so as to solve the problem of overfitting of the model;
In the above formula, xi is a punishment item force value, a group of punishment item forces with different values, xi [0.01,0.1,1, 10, 100], are selected, each value is circulated, 5 recall rates recall of each value after 5 times of cross verification are obtained, so that a recall rate recall corresponding to each punishment item force is obtained, and then xi corresponding to the recall rate recall with the highest value is selected as the punishment item force value;
Solving the theta value, firstly solving the partial derivative of each J (theta) to theta, then giving a certain theta value, continuously subtracting the partial derivative from the certain theta value to multiply the step length, and then calculating new theta until the value of the theta changes to enable the difference value of the J (theta) between two iterations to be small enough, namely the value of the J (theta) calculated by the two iterations is basically unchanged, and indicating that the J (theta) reaches a local minimum value at the moment; then calculate each theta value and substitute the values into a logistic regression equation Finally obtaining a prediction function;
wherein the partial derivative of J (θ) to θ is:
the iterative formula of θ j after regularization is:
4-3-3) inputting the first group of test set data into a prediction function h θ (x) of a logistic regression algorithm trained by the first group of training set data, and classifying the test set data according to the obtained probability value;
4-3-4) according to the classification result, obtaining class I classification error rate omega 21 of the logistic regression model algorithm corresponding to the first group of data;
4-3-5) the steps 4-3-1) -4-3-4) are repeated twice, class I classification error rate omega 22、ω23 of the logistic regression model algorithm corresponding to the other two groups of data is obtained, and finally average value omega 2=(ω212223)/3 is obtained to be used as class I classification error rate of the logistic regression model algorithm;
the specific process of constructing the KNN proximity-logistic regression combination model based on Lagrange in the step S4-4 is as follows:
4-4-1) determination of the prediction function:
the predicted values of the combined model for the ith sample are denoted by pi, and are:
pi=α1ki2li
Wherein k i、li represents the predicted probability value of the KNN and the logistic regression model on the ith sample respectively, alpha 1、α2 represents the weight value of the KNN and the logistic regression model respectively, and alpha 12 =1;
4-4-2) constructing Lagrange loss function:
Wherein omega 1、ω2 is class I classification error rate of the submodel obtained in the step S4-2 and the step S4-3 respectively, wherein the class I classification error rate is regarded as a punishment parameter of the submodel, and lambda is Lagrange operator;
4-4-3) optimal value for α 12:
Since L (α 12, λ) is a convex function, there is a minimum, and the minimum point is the optimal value of α 12;
The optimal value for alpha 12 can be obtained by solving the above equation using python.
2. The method for modeling design resource big data for full-system optimization design of manufacturing enterprises according to claim 1, wherein the specific steps of collecting multi-source heterogeneous design resource big data and converting the multi-source heterogeneous design resource big data into a structured data source with a uniform format are as follows:
s1-1, identifying a data source related to a manufacturing enterprise design resource main body and a storage position of the data source;
S1-2, aiming at a relational database, configuring data connection between the relational database and an HDFS by adopting an Sqoop technology, and importing data in the relational database into the HDFS of the Hadoop;
s1-3, analyzing a data file by adopting a MapReduce programming method aiming at data in a file format, and uploading the data file to an HDFS;
s1-4, integrating all the main body data acquired before in Hive based on a relational model;
s1-5, building a structured main body data set.
3. The method for modeling design resource big data for manufacturing enterprise-wide system optimization design according to claim 1, wherein the data cleaning comprises the steps of:
S2-1, preprocessing data;
s2-2, removing or complementing missing data;
S2-3, removing data with errors in the content;
S2-4, removing data with logic errors;
s2-5, removing unnecessary data;
s2-6, verifying data relevance.
4. The method for modeling design resource big data for manufacturing enterprise-wide system optimization design according to claim 1, wherein the feature processing comprises the steps of:
S3-1, solving the problem of unbalanced positive and negative samples by adopting an information oversampling SMOTE method, and avoiding the problem of low prediction accuracy caused by unbalanced samples in a subsequent KNN algorithm and a logistic regression algorithm;
S3-2, performing feature selection through a variance selection method;
s3-3, performing dimension reduction treatment on the feature matrix dimension after feature selection through a principal component analysis method.
5. The method for modeling design resource big data for manufacturing enterprise-wide system optimization design according to claim 4, wherein the specific process of step S3-1 is as follows:
3-1-1) for each sample x in the minority class, the formula is used:
obtaining Euclidean distance d from a sample x to other minority samples y;
3-1-2) the majority class sample number is denoted as m, the minority class sample number is denoted as n, let:
Taking k other samples with the minimum Euclidean distance d in each sample x as neighbor x k of the sample x;
3-1-3) for each neighbor x k, a new sample x n is generated in x and x k using a random linear interpolation method:
xn=x+ε|xk-x|
wherein epsilon is a random value between 0 and 1;
3-1-4) repeating steps 3-1-3) until the minority class samples and the majority class samples are equal or have no difference.
6. The method for modeling design resource big data for manufacturing enterprise-wide system optimization design according to claim 4, wherein the specific process of step S3-3 is as follows:
3-3-1) carrying out normalization treatment on the characteristics;
Conversion using a linear function:
y=(x-MinValue)/(MaxValue-MinValue)
Wherein x and y are values before and after conversion, maxValue, minValue is the maximum value and minimum value of the sample;
3-3-2) calculating the average value of the features of each column, and then subtracting the feature average value of the column from each dimension;
3-3-3) calculating a covariance matrix of the sample features;
3-3-4) calculating eigenvalues and eigenvectors of the covariance matrix;
3-3-5) sorting the calculated characteristic values from large to small;
3-3-6) taking out the first K eigenvectors and eigenvalues, multiplying the initial sample matrix by an eigenvector matrix formed by the K eigenvectors, and obtaining a feature matrix after dimension reduction;
The calculation of the K value refers to the following formula:
The minimum K value satisfying the above equation is found, where λ is the eigenvalue of the covariance matrix.
CN202011049729.XA 2020-09-29 2020-09-29 Design resource big data modeling method for manufacturing enterprise full-system optimization design Active CN112270614B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011049729.XA CN112270614B (en) 2020-09-29 2020-09-29 Design resource big data modeling method for manufacturing enterprise full-system optimization design

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011049729.XA CN112270614B (en) 2020-09-29 2020-09-29 Design resource big data modeling method for manufacturing enterprise full-system optimization design

Publications (2)

Publication Number Publication Date
CN112270614A CN112270614A (en) 2021-01-26
CN112270614B true CN112270614B (en) 2024-05-10

Family

ID=74349345

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011049729.XA Active CN112270614B (en) 2020-09-29 2020-09-29 Design resource big data modeling method for manufacturing enterprise full-system optimization design

Country Status (1)

Country Link
CN (1) CN112270614B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115344830A (en) * 2022-08-02 2022-11-15 无锡致为数字科技有限公司 Event probability estimation method based on big data

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106779079A (en) * 2016-11-23 2017-05-31 北京师范大学 A kind of forecasting system and method that state is grasped based on the knowledge point that multimodal data drives
KR20170060603A (en) * 2015-11-24 2017-06-01 윤정호 Method and system on generating predicted information of companies in demand for patent license
CN107203492A (en) * 2017-05-31 2017-09-26 西北工业大学 Product design cloud service platform modularization task replanning and distribution optimization method
KR20180096834A (en) * 2017-02-09 2018-08-30 충북대학교 산학협력단 Method and system for predicting optimal environmental condition in manufacturing process
EP3474196A1 (en) * 2017-10-23 2019-04-24 OneSpin Solutions GmbH Method of selecting a prover
CN110147400A (en) * 2019-05-10 2019-08-20 青岛建邦供应链股份有限公司 Inter-trade data resource integrated system based on big data
CN111507507A (en) * 2020-03-24 2020-08-07 重庆森鑫炬科技有限公司 Big data-based monthly water consumption prediction method

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180173847A1 (en) * 2016-12-16 2018-06-21 Jang-Jih Lu Establishing a machine learning model for cancer anticipation and a method of detecting cancer by using multiple tumor markers in the machine learning model for cancer anticipation
US20190216368A1 (en) * 2018-01-13 2019-07-18 Chang Gung Memorial Hospital, Linkou Method of predicting daily activities performance of a person with disabilities

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20170060603A (en) * 2015-11-24 2017-06-01 윤정호 Method and system on generating predicted information of companies in demand for patent license
CN106779079A (en) * 2016-11-23 2017-05-31 北京师范大学 A kind of forecasting system and method that state is grasped based on the knowledge point that multimodal data drives
KR20180096834A (en) * 2017-02-09 2018-08-30 충북대학교 산학협력단 Method and system for predicting optimal environmental condition in manufacturing process
CN107203492A (en) * 2017-05-31 2017-09-26 西北工业大学 Product design cloud service platform modularization task replanning and distribution optimization method
EP3474196A1 (en) * 2017-10-23 2019-04-24 OneSpin Solutions GmbH Method of selecting a prover
CN110147400A (en) * 2019-05-10 2019-08-20 青岛建邦供应链股份有限公司 Inter-trade data resource integrated system based on big data
CN111507507A (en) * 2020-03-24 2020-08-07 重庆森鑫炬科技有限公司 Big data-based monthly water consumption prediction method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
K近邻和Logistic回归分类算法比较研究;万会芳;杜彦璞;;洛阳理工学院学报(自然科学版);20160925(03);第83-86、93页 *
一种基于SMOTE的不均衡样本KNN分类方法;林泳昌;朱晓姝;;广西科学;20200708(03);第276-283页 *
基于回归时序模型的售后服务资源计划系统设计;窦文章;吕修磊;;统计与决策;20090710(13);第23-25页 *

Also Published As

Publication number Publication date
CN112270614A (en) 2021-01-26

Similar Documents

Publication Publication Date Title
Fu et al. Spark–a big data processing platform for machine learning
CN101604363B (en) Classification system and classification method of computer rogue programs based on file instruction frequency
US20170330078A1 (en) Method and system for automated model building
Ishioka An expansion of X-means for automatically determining the optimal number of clusters
Fioravanti et al. A study on fault-proneness detection of object-oriented systems
CN110297715B (en) Online load resource prediction method based on periodic feature analysis
CN108960434A (en) The method and device of data is analyzed based on machine learning model explanation
CN111367801A (en) Data transformation method for cross-company software defect prediction
Cao et al. Graph-based workflow recommendation: on improving business process modeling
CN112270614B (en) Design resource big data modeling method for manufacturing enterprise full-system optimization design
US10467276B2 (en) Systems and methods for merging electronic data collections
CN114564410A (en) Software defect prediction method based on class level source code similarity
Chu et al. Recognition of unknown wafer defect via optimal bin embedding technique
CN113779785A (en) Deconstruction model and deconstruction method of digital twin complex equipment
CN103136440A (en) Method and device of data processing
Marcus et al. Flexible operator embeddings via deep learning
CN111860660A (en) Small sample learning garbage classification method based on improved Gaussian network
CN109739840A (en) Data processing empty value method, apparatus and terminal device
CN114710344A (en) Intrusion detection method based on tracing graph
Guo et al. The FRCK clustering algorithm for determining cluster number and removing outliers automatically
CN112732549A (en) Test program classification method based on cluster analysis
Banu et al. Study of software reusability in software components
CN110263811A (en) A kind of equipment running status monitoring method and system based on data fusion
Zhang et al. A comparative study of absent features and unobserved values in software effort data
CN114490626B (en) Financial information analysis method and system based on parallel computing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant