CN110442911B

CN110442911B - High-dimensional complex system uncertainty analysis method based on statistical machine learning

Info

Publication number: CN110442911B
Application number: CN201910594968.4A
Authority: CN
Inventors: 付学谦; 贾倩倩
Original assignee: China Agricultural University
Current assignee: China Agricultural University
Priority date: 2019-07-03
Filing date: 2019-07-03
Publication date: 2023-11-14
Anticipated expiration: 2039-07-03
Also published as: CN110442911A

Abstract

The invention discloses a high-dimensional complex system uncertainty analysis method based on statistical machine learning, wherein the method comprises the following steps: selecting uncertainty factors affecting a high-dimensional complex system, and acquiring a high-dimensional random variable input sample matrix; inputting the high-dimensional random variable into a sample matrix, and converting the high-dimensional random variable into a low-dimensional random variable sample matrix; the high-dimensional random variable is input into a sample matrix to be calculated one by one, so that an output response matrix is obtained; accurately modeling the random response surface agent model to obtain a random response surface model which is highly similar to the studied high-dimensional complex system; obtaining the mean value and variance of the output response quantity of the random response surface model by a formula deduction method; and analyzing the uncertainty factors according to the mean value and the variance to obtain an uncertainty analysis result. The method has the advantages of high accuracy of calculation results, reduced calculation amount and improved calculation efficiency on the basis of ensuring calculation accuracy, avoidance of dimension disaster, high flexibility and the like.

Description

High-dimensional complex system uncertainty analysis method based on statistical machine learning

Technical Field

The invention relates to the technical field of high-dimensional reduction and statistical machine learning, in particular to a high-dimensional complex system uncertainty analysis method based on statistical machine learning.

Background

Today, due to the need for research in various fields, there are a number of important practical problems that are urgent to be solved with accurate modeling, and these practical problems represent often high-dimensional complex systems, such as: building a large-span bridge, modeling a land hydrologic system, performing remote sensing inversion, optimizing aircraft design, analyzing a comprehensive energy system and the like. However, almost all systems in practice have varying degrees of uncertainty and nonlinearity, which presents challenges for accurate modeling.

When uncertainty analysis is carried out on a high-dimensional complex system, a numerical analysis model is established according to random variable setting parameters of an actual system by a traditional method, and then a deterministic classical optimization algorithm is adopted for solving. The uncertainty of the model output is influenced by the uncertainty of the parameters, so that quantitative uncertainty analysis can be performed on the system according to the digital characteristics of the simulation output result. However, in some engineering design problems, it is necessary to test and optimize different design parameters multiple times to determine the optimal parameters. The practical problem with these complex systems involves a large number of repeated simulation calculations, which can take hours or even days to perform a single simulation using a physical model, which is computationally expensive and inefficient. Most models are not explicit, so that the original models cannot be directly solved, and the problems of difficult solution and large calculation amount exist in solving the high-dimensional problem.

The agent model has the advantages of high calculation efficiency and simple application in the uncertainty analysis and optimization process of the complex system, so that the agent model based on statistical machine learning can be applied in practice. However, when the proxy model processes high-dimensional data, the "dimension disaster" problem becomes an unavoidable challenge. In the prior art, all agent models have unstable and overfitting problems when processing high-dimension data, so that it is not practical to find agent models with strong robustness and ultrahigh dimension tolerance to solve the problem of dimension disaster. The dimension which can be tolerated by the agent model is achieved while the characteristics of the original data set are maintained by processing the ultra-high dimension data set through a dimension reduction algorithm, so that the method is a thinking for solving the dimension disaster.

It is also contemplated that the resulting low-dimensional features, while representing the original high-dimensional data, may miss more or less information. In addition, the proxy model also has requirements on the dimension of input data, and the fact that the dimension is too low can lead to insufficient training and unobvious training effect, and the fact that the dimension is too high can lead to over fitting of the model and increase of model burden. The dimension reduction algorithm needs to reduce the dimension to the vicinity of the intrinsic dimension to ensure the integrity of the features of the original dataset, and needs to enable the low dimension to meet the tolerance of the proxy model.

Therefore, aiming at the defects of the prior art, a new technical scheme is urgently needed to be provided for more accurately and rapidly carrying out uncertainty quantitative analysis on a complex system on the basis of solving the dimension disaster caused by high dimension.

Disclosure of Invention

The present invention aims to solve at least one of the technical problems in the related art to some extent.

Therefore, the invention aims to provide a high-dimensional complex system uncertainty analysis method based on statistical machine learning, which has the advantages of high accuracy of calculation results, reduced calculation amount and improved calculation efficiency on the basis of ensuring calculation accuracy, avoidance of dimension disaster, high flexibility and the like.

In order to achieve the above objective, the embodiment of the present invention provides a method for analyzing uncertainty of a high-dimensional complex system based on statistical machine learning, comprising the following steps: selecting uncertainty factors affecting a high-dimensional complex system, and acquiring a high-dimensional random variable input sample matrix; inputting the high-dimensional random variable into a sample matrix, and converting the high-dimensional random variable into a low-dimensional random variable sample matrix by a high-dimensional reduction method combining a non-negative matrix factorization dimension reduction algorithm and an intrinsic dimension estimation; the high-dimensional random variable input sample matrix is calculated one by one to obtain an output response matrix; accurately modeling a random response surface proxy model according to the low-dimensional random variable sample matrix and the output response volume matrix to obtain a random response surface model which is highly similar to the studied high-dimensional complex system; obtaining the mean value and variance of the output response quantity of the random response surface model by a formula deduction method; and analyzing the uncertainty factors according to the mean and the variance to obtain an uncertainty analysis result.

According to the high-dimensional complex system uncertainty analysis method based on statistical machine learning, the agent model based on statistical machine learning is utilized to carry out approximation modeling on the high-dimensional complex system, and a calculation result has high accuracy; compared with the traditional method, the uncertainty analysis method based on the statistical machine learning reduces the calculated amount on the basis of ensuring the calculation precision, shortens the calculation time and improves the calculation efficiency; the high-dimensional random variable sample data is not directly used in the proxy model, but the dimension of the random variable is effectively reduced by using a high-dimensional reduction method, so that the problem of dimension disaster is avoided; the cross verification method is used when modeling the random response surface agent model, so that the generalization capability of the model is improved; the random response surface model can directly derive the result of statistical characteristics obtained by known model parameters through a formula, so that the calculated amount is reduced, and the calculation efficiency is improved, thereby effectively overcoming the defect of difficult regression of high-dimensional nonlinear data of a complex system by the existing uncertainty analysis method and meeting the requirements of accurate and efficient uncertainty quantitative analysis of the high-dimensional complex system.

In addition, the high-dimensional complex system uncertainty analysis method based on statistical machine learning according to the embodiment of the invention may further have the following additional technical features:

further, in one embodiment of the invention, the variance is proportional to the uncertainty.

Further, in one embodiment of the present invention, the selecting the uncertainty factor affecting the high-dimensional complex system and obtaining the high-dimensional random variable input sample matrix includes: collecting each uncertainty factorIs->The real data are obtained to obtain the corresponding average value +.>Variance->And correlation coefficients between different uncertainty factors; and simulating the high-dimensional random variable input sample matrix by using a Latin hypercube sampling method according to the corresponding mean value, variance and correlation coefficient.

Further, in an embodiment of the present invention, the converting the high-dimensional random variable input sample matrix into the low-dimensional random variable sample matrix by a high-dimensional clipping method combined with the eigenvector estimation by a non-negative matrix factorization dimension-reduction algorithm includes: the eigenvalue of the high-dimensional random variable input sample matrix is obtained by adopting a mode of combining singular value decomposition, a principal component analysis method and an enumeration method; and obtaining the low-dimensional random variable sample matrix with preset number corresponding to the intrinsic dimension by utilizing a non-negative matrix decomposition method according to the intrinsic dimension.

Further, in an embodiment of the present invention, the accurately modeling the random response surface proxy model according to the low-dimensional random variable sample matrix and the output response volume matrix to obtain a random response surface model with high approximation about the studied high-dimensional complex system includes: taking the low-dimensional random variable sample matrixes with the preset number as input variables respectively, taking the deterministic output response quantity matrixes as output variables, taking samples with the corresponding preset percentages respectively as training sets, and taking the rest samples as test sets; carrying out nonlinear regression on input and output samples of the training set in a second-order random response surface model and a third-order random response surface model respectively by utilizing a least square method to obtain a plurality of groups of undetermined parameters and training errors respectively, and selecting the undetermined parameters of the groups of undetermined parameters as corresponding parameters of the random response surface model by taking the training errors of the second-order model and the third-order model and the minimum intrinsic dimension as the intrinsic dimension of the high-dimensional random variable input sample matrix; substituting two groups of test set input samples corresponding to the intrinsic dimensions of the high-dimensional random variable input sample matrix into the final two models respectively to obtain generalization errors of the two models respectively; and selecting the model with small generalization error as a final proxy model.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The foregoing and/or additional aspects and advantages of the invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:

FIG. 1 is a flow chart of a method for high-dimensional complex system uncertainty analysis based on statistical machine learning according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method of high-dimensional complex system uncertainty analysis based on statistical machine learning according to one embodiment of the invention.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative and intended to explain the present invention and should not be construed as limiting the invention.

The following describes a high-dimensional complex system uncertainty analysis method based on statistical machine learning according to an embodiment of the present invention with reference to the accompanying drawings.

FIG. 1 is a flow chart of a method of high-dimensional complex system uncertainty analysis based on statistical machine learning in accordance with one embodiment of the present invention.

As shown in fig. 1, the method for analyzing uncertainty of a high-dimensional complex system based on statistical machine learning, wherein the high-dimensional complex system comprises a power system, and the uncertainty of the power system can be represented by a probability power flow calculation result, and the method comprises the following steps:

in step S101, an uncertainty factor affecting a high-dimensional complex system is selected, and a high-dimensional random variable input sample matrix is acquired.

It will be appreciated that as shown in FIG. 2, the primary uncertainty factor affecting the high-dimensional complex system is selected to obtain the commonPersonal->Dimension input variable, get ∈ ->Is input into the sample matrix.

Further, in one embodiment of the present invention, selecting an uncertainty factor affecting a high-dimensional complex system and obtaining a high-dimensional random variable input sample matrix includes: collecting each uncertainty factorIs->The real data are obtained to obtain the corresponding average value +.>Variance->And correlation coefficients between different uncertainty factors; and simulating a high-dimensional random variable input sample matrix by using a Latin hypercube sampling method according to the corresponding mean value, variance and correlation coefficient.

In step S102, a high-dimensional random variable is input into a sample matrix, and is converted into a low-dimensional random variable sample matrix by a high-dimensional clipping method combining a non-negative matrix factorization dimension reduction algorithm with an eigenvector estimation.

It will be appreciated that, as shown in FIG. 2, according to the method obtained in step S101Is converted into +.A high-dimensional random variable input sample matrix is converted into +.A high-dimensional reduction method combining a non-negative matrix factorization dimension reduction algorithm and an eigenvoice estimation is adopted>Is a low-dimensional random variable sample matrix, +.>。

Further, in one embodiment of the present invention, the high-dimensional random variable input sample matrix is converted into the low-dimensional random variable sample matrix by a high-dimensional clipping method combined with the eigenvector estimation by a non-negative matrix factorization dimensionality reduction algorithm, comprising: the method comprises the steps of obtaining the intrinsic dimension of a high-dimensional random variable input sample matrix by adopting a mode of combining singular value decomposition, a principal component analysis method and an enumeration method; and obtaining a preset number of low-dimensional random variable sample matrixes corresponding to the intrinsic dimensions by utilizing a non-negative matrix decomposition method according to the intrinsic dimensions.

It should be noted that the preset number is related to the enumeration range of the intrinsic dimension, and the enumeration interval is increased by 1 positive and negative units based on the intrinsic dimension.

Specifically, step 21: the eigenvalue of the high-dimensional random variable input sample matrix in the step S101 is obtained by adopting a mode of combining singular value decomposition, a principal component analysis method and an enumeration method, and the eigenvalue is specifically:

step 211: inputting a sample matrix to the high-dimensional random variable in step S101Singular value decomposition is performed:

，

。

step 212: the principal components in step 211 are selected using the principle of principal component analysis:

，

wherein,。

step 213: by means of enumerationEnumeration is performed in range as the undetermined intrinsic dimension.

Step 22: obtaining 5 eigenvectors corresponding to the eigenvalues by non-negative matrix factorization based on the eigenvalues obtained in step 21Is a low-dimensional random variable sample matrix of (a).

In step S103, the high-dimensional random variable is input into the sample matrix to be calculated one by one, so as to obtain an output response matrix.

It will be appreciated that, as shown in FIG. 2, the method of step S101 is performed in a conventional mannerInputting the high-dimensional random variables into a sample matrix for calculation one by one to obtain +.>Is provided. Conventional methods may include, among others, experimental metrology, physical modeling, and the like.

In step S104, the random response surface proxy model is accurately modeled according to the low-dimensional random variable sample matrix and the output response volume matrix, so as to obtain a random response surface model with high approximation about the studied high-dimensional complex system.

It will be appreciated that the method obtained in step S102Is used as an input variable, the +.a.obtained in step S103>And taking the deterministic output response matrix as an output variable, accurately modeling the random response surface proxy model, and calculating model parameters to obtain a random response surface model with high approximation about the studied high-dimensional complex system.

Further, in one embodiment of the present invention, the method for accurately modeling the random response surface proxy model according to the low-dimensional random variable sample matrix and the output response volume matrix to obtain a random response surface model with high approximation about the studied high-dimensional complex system includes: respectively taking a preset number of low-dimensional random variable sample matrixes as input variables, taking a deterministic output response matrix as an output variable, taking corresponding samples of a preset percentage as a training set, and taking the rest samples as a test set; respectively carrying out nonlinear regression on input and output samples of a training set in a second-order random response surface model and a third-order random response surface model by utilizing a least square method to respectively obtain a plurality of groups of undetermined parameters and training errors, and selecting the undetermined parameters of the group where the undetermined parameters of the undetermined parameters are as corresponding parameters of the random response surface model, wherein the training errors of the second-order model and the third-order model and the minimum intrinsic dimension are used as the intrinsic dimension of a high-dimensional random variable input sample matrix; substituting two groups of test set input samples corresponding to the intrinsic dimensions of the high-dimensional random variable input sample matrix into the final two models respectively to obtain generalization errors of the two models respectively; and selecting a model with small generalization error as a final proxy model.

Specifically, step 41: 5 obtained in step S102Respectively as input variables, the low-dimensional random variable sample matrix obtained in step S103 +.>The deterministic output response matrix is used as an output variable, and 70% of samples corresponding to the deterministic output response matrix are used as training sets, and 30% of samples are used as test sets;

step 42: nonlinear regression is carried out on the input and output samples of the training set in the step 41 by utilizing a least square method in a second-order and third-order random response surface model respectively to obtain 5 groups of undetermined parameters and training errors respectively, and the training errors of the second-order and third-order models are selected to be the smallestS101, inputting the intrinsic dimension of a sample matrix as a high-dimensional random variable in the step S101, wherein the undetermined parameters of the group where the intrinsic dimension is positioned are used as corresponding parameters of a random response surface model;

step 43: the final determination in step 42The corresponding two groups of test set input samples are respectively substituted into the final two models in the step 41, and after the obtained response value and the response value in the step 41 are analyzed, generalization errors of the two models are respectively obtained;

step 44: and selecting a model with small generalization error as a final proxy model.

In step S105, the mean and variance of the random response surface model output response amounts are obtained by the formula derivation method.

It will be appreciated that the mean and variance of the response of the output response of the surface model is random by the formula derivation, both of which are composed of the known polynomial parameters of the model obtained in step S104.

Wherein in one embodiment of the invention, the variance is proportional to the uncertainty.

Specifically, the mean value of the output response of the random response surface model in step S104 is obtained by the formula derivation methodSum of variances-> ^[1] Both the derivation results consist of the known polynomial parameters of the model obtained in step S104.

In step S106, the uncertainty factor is analyzed according to the mean and the variance, and an uncertainty analysis result is obtained.

It will be appreciated that, based on the resulting mean and uncertainty of the high-dimensional complex system under study by analysis of variance, a larger variance indicates a larger fluctuation and a stronger uncertainty.

The method according to the embodiment of the invention is used for carrying out specific probability power flow calculation, and comprises the following steps:

step S1: the illumination and the temperature in the meteorological conditions can influence the output conditions of the photovoltaic and the air conditioner, and then the calculation result of the probability tide is indirectly influenced. Collecting illumination and temperatureObtaining corresponding mean value, variance and Pelson correlation coefficient among different variables from real data of the degree, and simulating the common by using Latin hypercube sampling methodPersonal->The dimension input variables with respect to temperature and illumination, get +.>Is input into the sample matrix.

Step S2: the eigenvalue of the high-dimensional random variable input sample matrix in the step S101 is obtained by adopting a mode of combining singular value decomposition, principal component analysis and enumeration:

，

。

step S3: the principal components in step 211 are selected using the principle of principal component analysis:

，

wherein the method comprises the steps of。

Step S4: by means of enumerationEnumeration is performed in range as the undetermined intrinsic dimension.

Step S5: from the eigen dimensions obtained in step 21, decomposition is performed using a non-negative matrixThe method obtains 5 numbers of the corresponding intrinsic dimensionsIs a low-dimensional random variable sample matrix of (a).

Step S6: using a physical model method, using matpower software to obtain the result in step S101Inputting the high-dimensional random variables into a sample matrix for calculation one by one to obtain +.>Is a probability flow output response matrix.

Step S7: 5 obtained in step S102Respectively as input variables, the low-dimensional random variable sample matrix obtained in step S103 +.>The deterministic output response matrix is used as an output variable, and 70% of samples corresponding to the deterministic output response matrix are used as training sets, and 30% of samples are used as test sets.

Step S8: nonlinear regression is carried out on the input and output samples of the training set in the step S7 by utilizing a least square method in a second-order and third-order random response surface model respectively to obtain 5 groups of undetermined parameters and training errors respectively, and the training errors of the second-order and third-order models are selected to be the smallestThe intrinsic dimension of the sample matrix is input as a high-dimensional random variable in step S101, and the undetermined parameters of the group in which the intrinsic dimension is located are used as corresponding parameters of the random response surface model.

Step S9: the final determination in step S8The two corresponding test set input samples are respectively substituted into the final two models in the step 42, and the obtained response value and the response value in the step 41 are analyzedAnd then, generalizing errors of the two models are respectively obtained.

Step S10: and selecting a model with small generalization error as a final proxy model.

Step S11: obtaining the average value of the output response quantity of the random response surface model in the step S104 by a formula deduction methodSum of variances-> ^[1] Both the derivation results consist of the known polynomial parameters of the model obtained in step S104.

Step S12: and (3) analyzing the uncertainty of the studied high-dimensional complex system according to the mean value and the variance obtained in the step S11, wherein the larger the variance is, the larger the representation fluctuation is, and the stronger the uncertainty is.

According to the high-dimensional complex system uncertainty analysis method based on the statistical machine learning, which is provided by the embodiment of the invention, the agent model based on the statistical machine learning is utilized to carry out approximation modeling on the high-dimensional complex system, and the calculation result has high accuracy; compared with the traditional method, the uncertainty analysis method based on the statistical machine learning reduces the calculated amount on the basis of ensuring the calculation precision, shortens the calculation time and improves the calculation efficiency; the high-dimensional random variable sample data is not directly used in the proxy model, but the dimension of the random variable is effectively reduced by using a high-dimensional reduction method, so that the problem of dimension disaster is avoided; the cross verification method is used when modeling the random response surface agent model, so that the generalization capability of the model is improved; the random response surface model can directly derive the result of statistical characteristics obtained by known model parameters through a formula, so that the calculated amount is reduced, and the calculation efficiency is improved, thereby effectively overcoming the defect of difficult regression of high-dimensional nonlinear data of a complex system by the existing uncertainty analysis method and meeting the requirements of accurate and efficient uncertainty quantitative analysis of the high-dimensional complex system.

Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present invention, the meaning of "plurality" means at least two, for example, two, three, etc., unless specifically defined otherwise.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.

While embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the invention.

Claims

1. A high-dimensional complex system uncertainty analysis method based on statistical machine learning is applied to a power system and is characterized by comprising the following steps:

selecting uncertainty factors affecting a high-dimensional complex system, and collecting each uncertainty factorIs->The real data are obtained to obtain the corresponding average value +.>Variance->And the correlation coefficient among different uncertainty factors, namely illumination and temperature, is simulated by using a Latin hypercube sampling method according to the corresponding mean value, variance and the correlation coefficient to obtain a high-dimensional random variable input sample matrix;

inputting the high-dimensional random variable into a sample matrix, and converting the high-dimensional random variable into a low-dimensional random variable sample matrix by a high-dimensional reduction method combining a non-negative matrix factorization dimension reduction algorithm and an intrinsic dimension estimation;

the high-dimensional random variable input sample matrix is calculated one by one to obtain an output response matrix;

accurately modeling a random response surface proxy model according to the low-dimensional random variable sample matrix and the output response volume matrix to obtain a random response surface model which is highly similar to the studied high-dimensional complex system;

obtaining the mean value and variance of the output response quantity of the random response surface model by a formula deduction method; and

and analyzing the uncertainty factors according to the mean and the variance to obtain an uncertainty analysis result.

2. The method of claim 1, wherein the variance is proportional to the uncertainty.

3. The method of claim 1, wherein said converting the high-dimensional random variable input sample matrix to a low-dimensional random variable sample matrix by a high-dimensional clipping method combined with eigenvector estimation by a non-negative matrix factorization dimensionality reduction algorithm, comprises:

the eigenvalue of the high-dimensional random variable input sample matrix is obtained by adopting a mode of combining singular value decomposition, a principal component analysis method and an enumeration method;

and obtaining the low-dimensional random variable sample matrix with preset number corresponding to the intrinsic dimension by utilizing a non-negative matrix decomposition method according to the intrinsic dimension.

4. A method according to claim 3, wherein said accurately modeling a random response surface proxy model from said low-dimensional random variable sample matrix and said output response volume matrix results in a random response surface model that is highly approximate with respect to the high-dimensional complex system under study, comprising:

taking the low-dimensional random variable sample matrixes with the preset number as input variables respectively, taking the deterministic output response quantity matrixes as output variables, taking samples with the corresponding preset percentages respectively as training sets, and taking the rest samples as test sets;

carrying out nonlinear regression on input and output samples of the training set in a second-order random response surface model and a third-order random response surface model respectively by utilizing a least square method to obtain a plurality of groups of undetermined parameters and training errors respectively, and selecting the undetermined parameters of the groups of undetermined parameters as corresponding parameters of the random response surface model by taking the training errors of the second-order model and the third-order model and the minimum intrinsic dimension as the intrinsic dimension of the high-dimensional random variable input sample matrix;

substituting two groups of test set input samples corresponding to the intrinsic dimensions of the high-dimensional random variable input sample matrix into the final two models respectively to obtain generalization errors of the two models respectively;

and selecting the model with small generalization error as a final proxy model.