CN115019061A

CN115019061A - Entropy optimization method based on deep neural network information entropy estimation

Info

Publication number: CN115019061A
Application number: CN202210924688.7A
Authority: CN
Inventors: 张新钰; 张世焱; 李骏; 杨昊波; 杨卓异; 吴新刚
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2022-08-03
Filing date: 2022-08-03
Publication date: 2022-09-06

Abstract

The invention discloses an entropy optimization method based on deep neural network information entropy estimation, which comprises the following steps: step 1) modeling input data and output data of a deep neural network to be optimized based on a communication theory to obtain expectation and constraint on information entropy; the deep neural network comprises n network layers, and the nth network layer is an output layer; step 2) establishing a probability model for the training process of the deep neural network according to each layer of network structure of the deep neural network; step 3) calculating the information entropy output by each layer of the deep neural network in the training process by adopting a K-near entropy estimation method; and 4) establishing a loss function of the information entropy according to the expectation and the constraint of the information entropy, and guiding the training process and the optimization direction of the deep neural network. The invention improves the interpretability of the deep neural network training process, makes the training process more transparent and can carry out quantitative evaluation.

Description

Entropy optimization method based on deep neural network information entropy estimation

Technical Field

The invention belongs to the technical field of deep neural network optimization and interpretability, and particularly relates to an entropy optimization method based on deep neural network information entropy estimation.

Background

With the development of artificial intelligence, the algorithm of neural network deep learning in machine learning shows outstanding performance from a multilayer perceptron (MLP), a Convolutional Neural Network (CNN), a Recurrent Neural Network (RNN) to more and more modifications, improvement and optimization of a neural network structure in recent years and the like under different application scenes in various industries. The deep neural network has the problem of irrevocability, namely poor interpretability, which is always existed while the deep neural network is proved to have strong functions. The neural network is similar to a black box, and the training process of the deep neural network, how to train, the quantitative observation and the quantitative guidance and optimization of the training process of the network are further lacked.

Deep learning algorithms have found widespread use in the field of autonomous driving, which has been rapidly developed in recent years, particularly in the task of perception of the environment by autonomous vehicles. At present, the environmental perception of the automatic driving vehicle mostly utilizes information and data (such as point cloud data captured by a laser radar and RGB image data obtained by a vehicle-mounted camera) of a plurality of sensors to perform tasks such as target detection and the like. After the information of the sensors is acquired, the data of different types and structures needs to be subjected to feature extraction, so that subsequent tasks can be performed by utilizing the features. In this feature extraction process, the deep neural network plays an important role. However, whether the process of extracting features by the deep neural network is reasonable or not and whether the extracted features are effective or not are difficult questions to answer.

The current deep neural network carries out special extraction on multi-modal data: (1) because the neural network is poor in interpretability and the characteristic extraction process is opaque, the optimization of the neural network training process is very difficult; (2) feature extraction of multi-modal data is important for feature fusion of the next step, and the feature extracted through the deep neural network is difficult to quantitatively evaluate the effectiveness and the rationality of the feature.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides an entropy optimization method based on deep neural network information entropy estimation.

In order to achieve the above object, the present invention provides an entropy optimization method based on deep neural network information entropy estimation, the method including:

step 1) modeling input data and output data of a deep neural network to be optimized based on a communication theory to obtain expectation and constraint on information entropy; the deep neural network comprises an n-layer network layer, and the n-layer network layer is an output layer;

step 2) establishing a probability model for the training process of the deep neural network according to each layer of network structure of the deep neural network;

step 3) calculating the information entropy output by each layer of the deep neural network in the training process by adopting a K-near entropy estimation method;

and 4) establishing a loss function of the information entropy according to the expectation and the constraint of the information entropy, and guiding the training process and the optimization direction of the deep neural network.

As an improvement of the above method, the step 1) of information entropy expectation and constraint includes: in each round of training, the entropy value of the output layer of the deep neural network is decreased progressively; and the output of each layer of the trained network layer is the same as the information entropy input by the deep neural network.

As an improvement of the above method, the probabilistic model of step 2) includes:

for the deep neural network with n layers in total, the output of each layer of the network layer is taken as a multi-dimensional continuous random variable

The ith channel of each layer is taken as a multi-dimensional continuous random variable

The number d of pixel points of each channel is the dimension d of xi, and each layer has the same dimensionmA sample is sampled.

For an improvement of the above method, the K-near entropy estimation method in step 3) includes:

calculating the sphere neighborhood radius of the sampling sample xi according to the following formula

：

Wherein the content of the first and second substances,

is the Euclidean distance, V, between the d-dimensional sample xi and the nearest k-th sample point _d Is the volume of the d-dimensional unit sphere,

is composed of

A function;

calculating a correction term for the entropy estimate according to

Comprises the following steps:

wherein the content of the first and second substances,

denotes the radius around the sample xi is

The volume of the neighborhood sphere of (a),

representing random variables

The boundary constraint of (2);

obtaining the information entropy output by each layer of the network according to the following formula

Comprises the following steps:

wherein, the first and the second end of the pipe are connected with each other,

；

is a Digamma function, ψ (1) = -gamma, gamma is Euler-Marseoni constant, ψ (m) </g (m-1) </or < lambda > means approximately equal to.

As a modification of the above method, the step 3) includes:

step 3-1) traversing n multi-dimensional continuous random variables

；

Step 3-2) for each

Traversing each sampling sample xi of the network layer where the sampling sample xi is located, and determining an ellipsoid neighborhood of each sample xi; the radii of the d-dimensional ellipsoids of the sampling samples xi are sorted from large to small to obtain correction terms of the sampling samples xi

Combining information entropy output by each layer of network layer

Obtaining a modified entropy estimate

。

As a modification of the above method, the step 3-2) includes:

step 3-2-1) selecting k sample points which are close to xi and are close to xi, carrying out PCA (principal component analysis) treatment on k +1 sample points including xi, calculating a covariance matrix of d-dimensional random variables by using the k +1 sample points, and calculating d eigenvectors of the covariance matrix;

step 3-2-2) taking the directions of the d eigenvectors as the axes of the d-dimensional ellipsoid, searching a sample point with the farthest distance along the direction of each eigenvector in the selected k +1 samples, and taking the distance of the sample point in the direction as the radius of the ellipsoid on the axis, thereby determining the ellipsoid neighborhood of the sampling sample xi;

step 3-2-3) sequentially sorting the radii of the d-dimensional ellipsoids from large to small, so that the correction term of the sample xi

Comprises the following steps:

wherein the radii of the D-dimension ellipsoid are sequentially from large to small

；

According to the information entropy of each layer of network layer

Obtaining a modified entropy estimate

Comprises the following steps:

。

as a modification of the above method, the step 4) includes:

designing a loss function

：

Wherein the content of the first and second substances,

as the information entropy of the original input data,

the information entropy output by the j layer of the deep neural network is obtained, and n is the number of network layers;

according to the decreasing of entropy value of the output layer of the deep neural network in each round of training, the judgment result is that the entropy value is the first one

The entropy of the information output after the secondary training is larger than that of the information output after the secondary training

Then, the loss function is increased

：

Wherein the content of the first and second substances,

are respectively the first

Second and third

The information entropy output after the secondary training;

will be provided with

、

As an auxiliary term, the cross entropy loss of the combined network forms a loss function of the deep neural network training.

Compared with the prior art, the invention has the advantages that:

1. the invention improves the interpretability of the deep neural network training process, makes the training process more transparent and can carry out quantitative evaluation;

2. the invention establishes expectation to neural network information entropy, adds a network loss function based on entropy, and better guides the network gradient descending process;

3. the invention increases the verification of the network training result in the aspect of information entropy, and better ensures the validity and rationality of the network training result;

4. the method and the thought provided by the invention are not limited to the optimization of a single deep learning task or a single neural network, are not restricted by different neural network structures, and can be applied to various deep neural networks.

Drawings

FIG. 1 is a flow chart of an entropy optimization method based on deep neural network information entropy estimation according to the present invention;

FIG. 2 is a neural network probabilistic model of the present invention;

FIG. 3 is a block diagram of the entropy optimizer of the present invention.

Detailed Description

Before describing the embodiments of the present invention, the related terms related to the embodiments of the present invention are first explained as follows:

information entropy: the magnitude of the average amount of information used to represent the network output is an expectation of the amount of information the network output has, a measure of the amount of information the network output has, and also a measure of the average uncertainty and complexity of the output.

Differential entropy: the information entropy is obtained by calculating the continuous random variable from the popularization of the information entropy calculated by the discrete random variable.

Source coding: source coding is a transformation of source symbols for the purpose of increasing communication efficiency, or for reducing or eliminating source redundancy.

Point cloud: a series of discrete three-dimensional point data obtained for the surface profile of an object in space, by a lidar or like device, contains (x, y, z) coordinate information.

RGB image: the image data collected by the camera is a three-channel image.

A convolutional neural network: is a type of feedforward neural network that contains convolution calculations and has a depth structure.

The invention aims to perform segmentation and interpretation on a network layer of a deep neural network, quantitatively evaluate a network training process and a final result, guide a network gradient descent process and selection of network structure hyper-parameters, prevent overfitting and approach a model performance boundary.

The invention relates to a method for quantitatively explaining the training process of a neural network by utilizing a mode of calculating entropy under the scene of multi-modal data feature extraction, and guiding the training and optimizing directions of the neural network, comprising the following two aspects of work: one aspect is probabilistic and communicative modeling of deep neural network models and processes of multi-modal data feature extraction, and obtaining expectations about model information entropy. Another aspect is to calculate and estimate the entropy of information during network training and to use it as a direction for guiding neural network training and optimization based on the desired to entropy-related loss function of the above entropy.

The invention provides a universal entropy optimization method based on deep neural network information entropy estimation, which comprises the following steps:

1. and modeling input and output data of the deep neural network based on a communication theory to obtain corresponding expectation and constraint on information entropy.

2. And establishing a probability model for the calculation and training process of the deep neural network according to the network structure of each layer of the deep neural network.

3. And calculating the information entropy output by each layer of network in the training process and the information entropy finally output by the network.

4. And according to the expectation and the constraint of the entropy, establishing an entropy loss function and guiding the process of network training and the direction of optimization.

The technical solution of the present invention will be described in detail below with reference to the accompanying drawings and examples.

Example 1

As shown in fig. 1, embodiment 1 of the present invention provides an entropy optimization method based on deep neural network information entropy estimation, which utilizes a way of calculating entropy to quantitatively explain a neural network training process, so as to guide the direction of neural network training and optimization, and apply the method in a multi-modal data feature extraction scenario.

The specific implementation steps are as follows:

to summarize: when the deep neural network is used for feature extraction of multi-modal data, the interpretability of the network is poor, and the feature extraction process is not transparent. And (3) carrying out segmentation and interpretation on the network layer by adopting an entropy optimization method, quantitatively evaluating the process and the final result of feature extraction, guiding the gradient descent process of the network and the selection of the super-parameters of the network structure, preventing overfitting and approaching the performance boundary of the model.

Step 1) obtaining expectation and constraint on information entropy.

For the neural network, the neural network has strong learning capability, and the complex nonlinear relation between an input layer and expected output is learned through continuous iteration and adjustment of parameters in the network training process, so that the requirements of different tasks are met. Therefore, the training process of the neural network can be regarded as continuously searching the relation between the input data and the expected output under the constraint condition according to the input data, thereby achieving the learning effect. It is believed that before the neural network is untrained, the "connections" it can learn are uncertain, i.e., the final output is uncertain, and when the neural network is trained with data, the "connections" between the inputs and outputs it learns are continuously strengthened, and the certainty of the network output is continuously improved.

In order to quantify and measure this uncertainty of neural networks, the concept of information entropy is introduced to represent the size of the average information volume output by the network. It is an expectation of the amount of information that the network output has, the size of which is a measure of the amount of information that the network outputs, and also of the average uncertainty and complexity of the output. Therefore, the final output uncertainty of the neural network is the largest when the neural network is not trained, namely, the information entropy is the largest. The 'connection' between the input and the output is continuously learned in the network training process, and the final output uncertainty is continuously reduced, namely, the process of information entropy reduction is realized. One of the expectations about entropy in the neural network training process is thus derived: and the entropy value of the output layer of the training neural network is decreased.

In the task of multi-modal information feature extraction, it is expected that abstract features can be extracted, structured data is formed to represent all information of original input, and the effect of removing redundant parts in original data is achieved. With reference to a communication model, the concept of source coding is introduced into the task, and the feature extraction of the multi-modal data is regarded as a source coding process. To ensure that the network can extract all the information in the data completely, the source coding should be lossless coding without losing information amount, i.e. the original input of the network and the output of the feature extraction should have the same information amount. Thus, another expectation of feature extraction on information entropy is derived: the output of the network layer is the same as the information entropy of the original input.

Step 2) establishing a neural network probability model

As shown in fig. 2, calculating the entropy-optimized loss function according to the expectation of entropy requires information entropy estimation for each layer of output in the feature extraction network. Finding the method of computing entropy for the output of each layer of the network requires probabilistic modeling of the neural network and is applicable to the general neural network. In the multi-modal information feature extraction task, the condition that multi-modal input information is an image and a point cloud is considered, the point cloud image is projected to a two-dimensional generation Front View (FV) and a Bird's Eye View (BEV), and the image and the point cloud image are respectively input into a convolution network for feature extraction. The following settings are now made for the convolutional neural network:

the channel generated by convolution of the image by the convolution check is regarded as a multi-dimensional continuous random variable X, each channel { X1, X2.., xi } is regarded as a sample (i is the number of channels) of the random variable X, and the number of pixels of each channel is the dimension d of the continuous random variable X. The proposed probabilistic modeling method for the neural network can be applied to other deep neural networks in an extended way, namely, the output of each layer of the network is regarded as a continuous random variable, and the actual output of the layer is taken as the sampling of the continuous random variable. (taking MLP as an example, in an MLP network, a continuous random variable X is the output of each hidden layer, and each neuron sees a sample of the random variable X (i.e., the random variable X is a continuous random variable of 1 dimension).

Step 3) calculating the information entropy of the network layer and the final output

Through the modeling of the neural network probability, the problem of entropy estimation of each layer of the neural network is converted into that: the entropy is estimated from the samples without knowing the probability distribution of the continuous random variable X.

There are many ways to solve the above problem, which essentially calculates the differential entropy of continuous random variables. Differential entropy (also known as continuous entropy) is a concept in information theory that originates from shannon's attempt to extend his concept of shannon entropy to a continuous probability distribution. Let a random variable X whose domain of the probability density function f is the set of X. The differential entropy h (x) is defined as:

since in this problem the probability distribution of the random variable is not known in advance, the probability density function is not known, only a limited number of sample values are sampled for it. The information entropy is calculated by selecting a K-near entropy estimation method.

The "K-near entropy estimation" method is described below:

discretizing a continuous variable by sampling, in order to make n samples sampled approximately represent the entire sample space, each sample point is extended as a d-dimensional hyper-sphere, the radius of which is the distance between the sample point and its nearest sample point. When the variables are completely evenly distributed in the sample space, the probability for each sample point can be approximately 1/n. Since the distribution of the random variable in the sample space is unknown, and there may be a large difference from the uniform distribution, the distribution of the random variable in the sample space is corrected by using the position distribution of the sample in the space. That is, when the sample point is closer to the nearest sample point in the vicinity, the tendency is closer to the distribution of the variables in the vicinity of the sample point area. Conversely, when the sample point is further away from the nearest surrounding sample point, the trend is more sparse than the distribution of the variables around the sample point area. The density and sparseness of the variables in different sample space regions directly affect the probability density near each sample point. The discrete probability for each sample point is estimated as:

wherein m is the number of samples, r _d (x _i ) Is a sample x _i The d-dimensional euclidean distance to its nearest sample point,

is the unit sphere volume in d-dimensional space.

The estimate of the entropy of the random variable X is then:

wherein

Is Euler-Marshall constant and is equal to about 0.5772.

The K-neighbor entropy estimation method is such that when the distance of each sample point from its nearest sample point is extended to the distance from its nearest kth sample point, the estimation of the entropy of the random variable X becomes:

where ψ (·) is a Digamma function, ψ (1) = - γ, ψ (n) _ lg (n-1), and (phi) denotes approximately equal.

Is the d-dimensional euclidean distance between the sample xi to the k-th sample point closest thereto. It can be shown that when k =1,

and

and equivalence.

The K-nearest neighbor entropy estimation has better estimation on the entropy when the dimension of the variable is low, but the deviation of the entropy estimation is larger as the dimension is increased continuously. The reason is mainly as follows:

1. if the distribution of the random variable is limited by the boundary, the neighborhood of the sample near the boundary may exceed the boundary (i.e., the estimated range of the sample is larger than the distribution range of the actual variable), resulting in the entropy being overestimated.

2. k-nearest estimates assume a uniform distribution for the probability distribution in the neighborhood of the sample, which may be more distorted as the dimension of the sample increases, thereby producing larger errors.

The improvements and corrections for the above deviations are as follows:

1.

neighborhood sphere volume representing sample xi，

Representing random variables

The boundary constraint of (b) represents a boundary constraint of the random variable X. The modified entropy estimate is:

wherein d is the dimension of the sample,mis the total number of sampled samples per layer,

is the Euclidean distance, V, between the d-dimensional sample xi and the nearest kth sample point _d Is the volume of the d-dimensional unit sphere,

is composed of

A function;

for the purpose of the correction term of the entropy estimation,

denotes the radius around the sample xi is

A neighborhood sphere volume of;

the entropy of the information output for each layer of the network,

2. the sphere neighborhood of the traditional k-nearest neighbor estimation is changed into the ellipsoid neighborhood of each sample point. Taking the ellipsoid neighborhood of the sample point xi as an example, first, k sample points around xi that are closest to xi are selected. A total of k +1 sample points, including xi, are PCA processed. Namely, a covariance matrix of the d-dimensional random variable is calculated by using k +1 sample points, and d eigenvectors of the covariance matrix are calculated. And taking the directions of the d eigenvectors as the axes of the d-dimensional ellipsoid, searching a sample point which is farthest away along the direction of each eigenvector in the selected k +1 samples, and taking the distance of the sample point in the direction as the radius of the ellipsoid on the axis. Thus determining the ellipsoid neighborhood of xi. The radii of the d-dimensional ellipsoids are r from large to small ₁ (xi), r ₂ (xi), ..., r _D (xi) And (4) dividing. The correction term for sample xi is:

adding correction terms for the two errors to obtain corrected entropy estimation

Comprises the following steps:

step 4) establishing a loss function of entropy

According to the two expectations and the entropy estimation calculation result of the network layer, a loss function of model training can be designed to optimize the feature extraction process. According to the final output and input information entropy unchanged, design loss function

：

Wherein

As the information entropy of the original input data,

and n is the number of network layers.

According to the decreasing of the information entropy output by the network in the training process, the judgment of the first time

Then, the loss function is increased

：

Wherein the content of the first and second substances,

is as followsqThe information entropy output after the secondary training;

will be provided with

、

As an auxiliary item, a loss function (such as cross entropy loss) formed by combining other constraints of the network forms a loss function of the whole neural network training.

Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention and are not limited. Although the present invention has been described in detail with reference to the embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. An entropy optimization method based on deep neural network information entropy estimation is characterized by comprising the following steps:

2. An entropy optimization method based on deep neural network information entropy estimation according to claim 1, wherein the step 1) information entropy expectation and constraint comprises: in each round of training, the entropy value of the output layer of the deep neural network is decreased progressively; and the output of each layer of the trained network layer is the same as the information entropy input by the deep neural network.

3. An entropy optimization method based on deep neural network information entropy estimation according to claim 2, wherein the probability model of the step 2) comprises:

The number d of the pixels of each channel is the dimension d of xi, and each layer has m sampling samples.

4. An entropy optimization method based on deep neural network information entropy estimation according to claim 3, wherein the K-near entropy estimation method of the step 3) comprises:

：

Wherein the content of the first and second substances,

is composed of

A function;

calculating a correction term for the entropy estimate according to

Comprises the following steps:

wherein the content of the first and second substances,

denotes the radius around the sample xi is

The volume of the neighborhood sphere of (a),

representing random variables

The boundary constraint of (2);

Comprises the following steps:

wherein the content of the first and second substances,

；

is a function of the number of Digamma,

(1) = gamma, gamma is the euler-marseini constant,

(m) -lg (m-1), wherein, the expression of, -Lg is approximately equal to.

5. An entropy optimization method based on deep neural network information entropy estimation according to claim 4, wherein the step 3) comprises:

step 3-1) traversing n multi-dimensional continuous random variables

；

Step 3-2) for each

Combining information entropy output by each layer of network layer

Obtaining a modified entropy estimate

。

6. An entropy optimization method based on deep neural network information entropy estimation according to claim 5, wherein the step 3-2) comprises:

Comprises the following steps:

；

According to the information entropy of each layer network layer

Obtaining a modified entropy estimate

Comprises the following steps:

。

7. an entropy optimization method based on deep neural network information entropy estimation according to claim 6, wherein the step 4) comprises:

designing a loss function

：

Wherein the content of the first and second substances,

as the information entropy of the original input data,

The entropy of the information output after the sub-training is larger than that of the information output after the sub-training

Then, the loss function is increased

：

Wherein the content of the first and second substances,

are respectively the first

Second and third

The information entropy output after the secondary training;

will be provided with

、

As auxiliary items, combined netsThe cross-entropy loss of the network constitutes a loss function for deep neural network training.