CN113190956A

CN113190956A - Regression modeling method for big data of manufacturing industry

Info

Publication number: CN113190956A
Application number: CN202110295478.1A
Authority: CN
Inventors: 任鸿儒; 邱勇; 鲁仁全; 吴元清; 李鸿一
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2021-03-19
Filing date: 2021-03-19
Publication date: 2021-07-30
Anticipated expiration: 2041-03-19
Also published as: CN113190956B

Abstract

The invention discloses a regression modeling method for big data of a manufacturing industry, which comprises the following steps: s1, obtaining low-dimensional features suitable for modeling through data preprocessing; s2, converting low-dimensional data of different service domains into latent variable forms; s3, establishing a regression equation among different latent variables through partial least squares regression analysis, calculating to obtain the latent variables according to the maximum covariance among the latent variables, determining the number of the latent variables by adopting the sum of squares of the prediction residuals, and realizing the simultaneous regression analysis of a plurality of dependent variables on a plurality of independent variables; and S4, establishing a binomial regression equation between the latent variables to obtain a standard regression coefficient beta of each independent variable acting on each dependent variable, and further obtaining a single service predicted value. According to the invention, through establishing a latent structure model among service domains, influence relations among different service domain data are excavated, and different types of data of a plurality of service domains are communicated, so that a single service modeling effect is better, and the quality and efficiency of the service are improved.

Description

Regression modeling method for big data of manufacturing industry

Technical Field

The invention relates to the technical field of big data analysis and modeling, in particular to a regression modeling method for big data of a manufacturing industry.

Background

The manufacturing industry is one of the prop industries of national economy, and is the embodiment for realizing the modernization guarantee and the comprehensive national force. With the increasing development of economy and science and technology, the data volume generated by modern manufacturing industry is exponentially increased, so that the potential and value of big data are gradually accepted and received by society, and the combination of the big data and the manufacturing industry promotes the comprehensive reform of design, management, manufacturing and service modes of the manufacturing industry. However, such manufacturing data usually has characteristics of multiple sources, heterogeneity, complexity and the like, which is also one of the main problems that the manufacturing enterprise needs to face when modeling the big data.

The existing big data model is only a single service for a manufacturing enterprise, does not consider the correlation among the services, neglects the influence of the services such as design, management and service on the manufacturing process, and does not establish the incidence relation between the manufacturing service and other services, so that the data among the services of the manufacturing enterprise is not fully utilized, and the manufacturing process of the whole flow cannot be strictly controlled and reasonably planned.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a regression modeling method for big data of a manufacturing industry.

In order to achieve the purpose, the technical scheme provided by the invention is as follows:

a regression modeling method for big data of a manufacturing industry is characterized in that an influence relation among data of different business domains is excavated by establishing a latent structure model among the business domains, and different types of data of a plurality of business domains are communicated; the method specifically comprises the following steps:

s1, performing dimensionality reduction and denoising on high-dimensional data of different service domains through data preprocessing to obtain low-dimensional features suitable for modeling;

s2, converting low-dimensional data of different service domains into latent variable forms;

s3, establishing a regression equation among different latent variables through partial least squares regression analysis, determining a weight coefficient according to the maximum covariance among the latent variables, namely, the maximum correlation degree of the latent variables, calculating to obtain the latent variables, and determining the number of the latent variables by adopting the sum of squares of predicted residuals, so as to realize the simultaneous regression analysis of a plurality of dependent variables on a plurality of independent variables;

and S4, establishing a binomial regression equation between the latent variables to obtain a standard regression coefficient beta of each independent variable acting on each dependent variable, and further obtaining a single service predicted value.

Further, in step S1, a principal component analysis method is used to establish a linear mapping from the high-dimensional space projection to the low-dimensional space, so as to obtain the projection matrix W, and the specific process is as follows:

by Y ═ Y₁,y₂,...,y_N]Representing high-dimensional data requiring dimensionality reduction, X ═ X₁,x₂,...,x_N]Representing the low-dimensional data after dimensionality reduction, wherein N is the number of samples; and assuming data noise η_n∈R^DIn accordance with independent Gaussian distribution eta_n～N(0,β^-1I)，β^-1For noise variance, I is an identity matrix, and the mapping from high-dimensional space to low-dimensional space is represented as:

y_n＝W·x_n+η_n, (1)

wherein the mapping is determined by a projection matrix W, then the likelihood probability of the high dimensional data space is:

p(y_n|x_n,W,β)＝N(y_n|Wx_n,β^-1I), (2)

assuming that the data points in the low-dimensional space are independently and identically distributed, there are:

p(x_n)＝N(x_n|0,I), (3)

integrating the data points in the low-dimensional space, deducing the edge likelihood,

P(y_n|W,β)＝∫p(y_n|x_n,W,β^-1I)p(x_n)dx_n

＝N(y_n|WW^T+β^-1I). (4)

and the joint likelihood of the high-dimensional spatial data,

obtaining a projection matrix W by using a maximum likelihood method;

from the obtained projection matrix W, a low-dimensional representation X of the high-dimensional data Y is obtained by formula (1).

Further, in step S1, when obtaining the projection matrix W, obtaining the parameter maximum likelihood estimation by using the EM algorithm specifically includes the following steps:

(i) calculating a log-likelihood function expectation for the data, the likelihood function being:

noting the log-likelihood function as ln p (Y; W), the log-likelihood function can be expressed as:

the expected value E { ln p (Y; W) } of ln p (Y; W) is obtained by the following formula

Where < > denotes an inner product operation, μ denotes the mean of the high dimensional data Y, D denotes the dimension of the W matrix, tr () denotes the trace and has

<x_n>＝M^-1W^T(y_n-μ), (9)

Wherein M ═ W^TW+βI；

(ii) Maximize the desired value E { ln p (Y; W) } with respect to the projection matrix W, i.e., derivative W with respect to E { ln p (Y; W) } to get the optimal value of W, noted as

(iii) Determining convergence by alternating (i) and (ii) until convergence, with any two consecutive iterations yielding a difference in E { ln p (Y; W) },

||E{ln p(Y；W)}_t+1-E{ln p(Y；W)}_t||≤ε (12)

if the above formula is satisfied, E { ln p (Y; W) } is considered to reach an extreme point, and a projection matrix W is obtained.

Further, in step S2, each service domain data low-dimensional representation X ═ X₁,x₂,...,x_N]I.e. a data set of a latent structure model, consisting of two parts: by X_n×mSpace of explanatory variables expressed and represented by Y_n×kA reaction variable space of representations, where n represents the number of samples, and m and k represent the number of variables;

latent variable t_jAnd u_j(j ═ 1, 2.., a) by the formula t_j＝X_jw_jAnd u_j＝Y_jq_jCalculated, wherein A is the number of latent variables and w is the variable_jAnd q is_jTo make a latent variable t_jAnd u_jIs maximized, i.e. the latent variable t is made_jAnd u_jThe weight coefficient when the correlation degree reaches the maximum satisfies:

further, in the step S3, the objective is to obtain the quantitative relationship between the multiple explanatory variables and the multiple reaction variables by the partial least squares regression analysis, that is, in the explanatory variable space X_n×mAnd reaction variable space Y_n×kSeparately looking for linear combinations t_jAnd u_j(j ═ 1, 2.., a), and maximizes the covariance of the two variable spaces;

the specific process is as follows:

(1) at a latent variable t_jAnd u_jEstablishing a regression equation:

u_j＝b_jt_j+e_j (15)

wherein e is_jIs an error vector, b_jIs an unknown parameter, and b_jCan be calculated by the following formula:

carrying out estimation; is provided with

For the predicted values of uj, the matrices X and Y are decomposed into the following outer product form:

in the formula, E and F are residual errors of matrixes X and Y after the latent variables A are extracted respectively;

(2) in partial least squaresIn the analysis process, each pair of latent variables t_jAnd u_j(j 1, 2.., a) are extracted in turn in an iterative process, then the extracted residual is calculated, and the analysis of the residual of each step is continued until the logarithm of the extraction latent variable is determined according to some criterion.

Further, in step S3, the prediction residual square sum PRESS is used to determine the number of latent variable logarithms to be extracted, i.e. the predicted estimated values of the response variables after 1 sample point is removed are calculated separately in each step

And the sum of the squared residuals of the actual observations y:

in the above formula, l is the number of dependent variables until PRESS (j) -PRESS (j-1) is less than the preset precision, the iteration process is ended, otherwise, latent variables are continuously extracted for iterative computation.

Further, in step S4, the following quadratic polynomial regression model is built using the a latent variables obtained according to partial least squares regression analysis:

wherein, beta₀，β_i，β_ii，β_ijAre all regression coefficients, x ∈ t_j,y∈u_j，t_jAnd u_jIs a latent variable;

and according to the obtained latent variables and the number thereof, and referring to PRESS statistic, obtaining the standard regression coefficient beta of the effect of each independent variable on each dependent variable.

Compared with the prior art, the principle and the advantages of the scheme are as follows:

1. the latent structure model provided by the scheme realizes the simplification of a data structure while modeling, analyzes the mutual influence among different services in partial least square regression analysis to obtain a regression model of multiple dependent variables to multiple independent variables, and is more effective in comparison with multiple regression of the dependent variables one by one, more reliable in conclusion and stronger in integrity.

2. According to the scheme, the regression coefficient between the single service and other services is finally obtained, the influence relation between the originally independent different service domain data is excavated, the single service predicted value is obtained, the limitation of single service domain data modeling is broken through, the data value of each service domain is fully exerted, the single service modeling effect is better, and the quality and efficiency improvement of the service are facilitated.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the services required for the embodiments or the technical solutions in the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a schematic flow chart of a regression modeling method for manufacturing big data according to the present invention;

FIG. 2 is a schematic flow chart of data preprocessing in the regression modeling method for big data in manufacturing industry according to the present invention;

FIG. 3 is a schematic flow chart of latent structure modeling in the regression modeling method for big data in manufacturing industry according to the present invention.

Detailed Description

The invention will be further illustrated with reference to specific examples:

in the regression modeling method for big data of a manufacturing industry, influence relations among data of different business domains are excavated by establishing a latent structure model among the business domains, and different types of data of a plurality of business domains are communicated;

as shown in fig. 1, the method specifically comprises the following steps:

in the step, a linear mapping from a high-dimensional space projection to a low-dimensional space is established by adopting a principal component analysis method, so that a projection matrix W is obtained;

as shown in fig. 2, the specific process is as follows:

y_n＝W·x_n+η_n, (1)

p(y_n|x_n,W,β)＝N(y_n|Wx_n,β^-1I), (2)

p(x_n)＝N(x_n|0,I), (3)

P(y_n|W,β)＝∫p(y_n|x_n,W,β^-1I)p(x_n)dx_n

＝N(y_n|WW^T+β^-1I). (4)

and the joint likelihood of the high-dimensional spatial data,

obtaining a projection matrix W by using a maximum likelihood method;

in the present embodiment, an EM (Expectation-Maximization) algorithm is used to evaluate the parameter maximum likelihood, which includes the following steps:

<x_n＞＝M^-1W^T(y_n-μ), (9)

Wherein M ═ W^TW+βI；

||E{ln p(Y；W)}_t+1-E{ln p(Y；W)}_t||≤ε (12)

in this step, each service domain data low-dimensional representation X ═ X₁,x₂,...,x_N]I.e. a data set of a latent structure model, consisting of two parts: by X_n×mSpace of explanatory variables expressed and represented by Y_n×kA reaction variable space of representations, where n represents the number of samples, and m and k represent the number of variables;

s3, establishing a regression equation among different latent variables through partial least squares regression analysis, determining a weight coefficient according to the maximum covariance among the latent variables, namely, the maximum correlation degree of the latent variables, calculating to obtain the latent variables, and determining the number of the latent variables by adopting the sum of squares of predicted residuals, so as to realize the simultaneous regression analysis of a plurality of dependent variables on a plurality of independent variables; as shown in particular in fig. 3;

this step is based onLeast squares regression analysis, aimed at obtaining quantitative relationships between multiple explanatory variables and multiple reaction variables, i.e. in the explanatory variable space X_n×mAnd reaction variable space Y_n×kSeparately looking for linear combinations t_jAnd u_j(j ═ 1, 2.., a), and maximizes the covariance of the two variable spaces;

the specific process is as follows:

(1) at a latent variable t_jAnd u_jEstablishing a regression equation:

u_j＝b_jt_j+e_j (15)

carrying out estimation; is provided with

Is u_jThe matrices X and Y are decomposed into the following outer product form:

(2) in the partial least squares regression analysis process, each pair of latent variables t_jAnd u_j(j 1, 2.., a) are extracted in turn in an iterative process, then the extracted residual is calculated, and the analysis of the residual of each step is continued until the logarithm of the extraction latent variable is determined according to some criterion.

In the above, the prediction Residual square Sum press (predicted Residual Sum of squares) is used to determine the latent variable logarithm to be extracted, i.e. 1 latent variable logarithm is removed by calculation in each stepSample point post-reaction variable prediction estimation value

And the sum of the squared residuals of the actual observations y:

And S4, finally, establishing a binomial regression equation among the latent variables to obtain a standard regression coefficient beta of each independent variable acting on each dependent variable, and further obtaining a single service predicted value.

The method specifically comprises the following steps:

according to the partial least squares regression analysis, the A latent variables are used to establish the following quadratic polynomial regression model:

The above-mentioned embodiments are merely preferred embodiments of the present invention, and the scope of the present invention is not limited thereto, so that variations based on the shape and principle of the present invention should be covered within the scope of the present invention.

Claims

1. A regression modeling method for big data of a manufacturing industry is characterized in that an influence relation among data of different business domains is excavated by establishing a latent structure model among the business domains, and different types of data of a plurality of business domains are communicated; the method specifically comprises the following steps:

2. The regression modeling method for big data of manufacturing industry as claimed in claim 1, wherein said step S1 is to use principal component analysis to establish a linear mapping from high dimensional space to low dimensional space, so as to obtain the projection matrix W, and the specific process is as follows:

by Y ═ Y₁，y₂，...，y_N]Representing high-dimensional data requiring dimensionality reduction, X ═ X₁，x₂，...，x_N]Representing the low-dimensional data after dimensionality reduction, wherein N is the number of samples; and assuming data noise η_n∈R^DIn accordance with independent Gaussian distribution eta_n～N(0，β^-1I)，β^-1For noise variance, I is an identity matrix, and the mapping from high-dimensional space to low-dimensional space is represented as:

y_n＝W·x_n+η_n， (1)

p(y_n|x_n，W，β)＝N(y_n|Wx_n，β^-1I)， (2)

p(x_n)＝N(x_n|0，I)， (3)

P(y_n|W，β)＝∫p(y_n|x_n，W，β^-1I)p(x_n)dx_n

＝N(y_n|WW^T+β^-1I). (4)

and the joint likelihood of the high-dimensional spatial data,

obtaining a projection matrix W by using a maximum likelihood method;

3. The regression modeling method for manufacturing industry big data according to claim 2, wherein in the step S1, when the projection matrix W is obtained, the parameter maximum likelihood estimation is obtained by using an EM algorithm, and the method specifically comprises the following steps:

noting the log-likelihood function as lnp (Y; W), the log-likelihood function can be expressed as:

the expected value E { lnp (Y; W) }of lnp (Y; W) is obtained as follows

<x_n>＝M^-1W^T(y_n-μ)， (9)

Wherein M ═ W^TW+βI；

(ii) Maximize the desired value E { lnp (Y; W) } with respect to the projection matrix W, i.e., derivative W with respect to E { lnp (Y; W) } to get the optimal value of W, noted as

(iii) Determining convergence by alternating (i) and (ii) until convergence, with any two consecutive iterations yielding a difference in E { lnp (Y; W) },

||E{lnp(Y；W)}_t+1-E{lnp(Y；W)}_t||≤ε (12)

if the above formula is satisfied, E { lnp (Y; W) } is considered to reach an extreme point, and a projection matrix W is obtained.

4. The regression modeling method for manufacturing industry big data as claimed in claim 1, wherein in said step S2, each business domain data low-dimensional representation X ═ X₁，x₂，...，x_N]I.e. a data set of a latent structure model, consisting of two parts: by X_n×mSpace of explanatory variables expressed and represented by Y_n×kA reaction variable space of representations, where n represents the number of samples, and m and k represent the number of variables;

5. the regression modeling method for manufacturing industry big data as claimed in claim 4, wherein in said step S3, the objective is to obtain the quantitative relationship between multiple explanatory variables and multiple reaction variables by partial least squares regression analysis, that is, in the explanatory variable space X_n×mAnd reaction variable space Y_n×kSeparately looking for linear combinations t_jAnd u_j(j ═ 1, 2.., a), and maximizes the covariance of the two variable spaces;

the specific process is as follows:

(1) at a latent variable t_jAnd u_jEstablishing a regression equation:

u_j＝b_jt_j+e_j (15)

carrying out estimation; is provided with

6. The method of claim 5, wherein in step S3, the logarithm of latent variables to be extracted is determined by using the sum of squares of prediction residuals and PRESS, i.e. the predicted estimated value of the response variable after removing 1 sample point is calculated separately in each step

And the sum of the squared residuals of the actual observations y:

7. The regression modeling method for manufacturing industry big data according to claim 6, wherein in step S4, the following quadratic polynomial regression model is established using the a latent variables obtained according to partial least squares regression analysis:

wherein, beta₀，β_i，β_ii，β_ijAre all regression coefficients, x ∈ t_j，y∈u_j，t_jAnd u_jIs a latent variable;