CN111967574A

CN111967574A - Convolutional neural network training method based on tensor singular value delimitation

Info

Publication number: CN111967574A
Application number: CN202010700940.7A
Authority: CN
Inventors: 郭锴凌; 陈琦; 徐向民
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2020-07-20
Filing date: 2020-07-20
Publication date: 2020-11-20
Anticipated expiration: 2040-07-20
Also published as: CN111967574B

Abstract

The invention discloses a convolutional neural network training method based on tensor singular value delimitation, which comprises the following steps of: s1, initializing weights of the convolutional neural network to enable the weight matrixes of the fully-connected layer and the convolutional layer to be orthogonal, and enabling singular values of weight tensors of the convolutional layer to be equal; s2, training the convolutional neural network by using a random gradient descent method or a deformation thereof; and S3, after each training iteration, alternately updating matrix singular value delimitation and tensor singular value delimitation for the convolution layer weight, updating matrix singular value delimitation for the full-connection layer weight, and updating delimitation for the batch normalization layer weight. The invention provides a tensor structure which adds orthogonal constraint to the weight tensor, not only keeps network energy, but also does not damage the weight. Aiming at the constraint of the orthogonal tensor, the invention provides that the threshold value limitation is carried out on the singular value of the weight tensor, the training of the orthogonal tensor network is realized, and the performance of the image classification network is improved.

Description

Convolutional neural network training method based on tensor singular value delimitation

Technical Field

The invention belongs to the field of artificial intelligence, relates to machine learning and deep learning, and particularly relates to a convolutional neural network training method based on tensor singular value delimitation.

Background

Deep convolutional neural networks have enjoyed great success in many applications, such as image classification and object detection. Convolutional neural networks have been successful primarily because they have a powerful expressive power to represent complex relationships from input to output. But the strong expression capacity on the other hand also increases the risk of overfitting. To mitigate overfitting, researchers have introduced many tricks such as weight attenuation, dropout, and label perturbation. The cascaded deep hierarchical structure of the convolutional neural network also brings problems of gradient disappearance/explosion, saddle point diffusion and the like, and the training is difficult. In order to solve these problems, methods such as parameter initialization, direct connection (short) and BN have been proposed to simplify the optimization of convolutional neural networks.

Orthogonality is also used to solve the over-fitting and optimization problem of deep convolutional neural networks. Theoretically, it has been proved that when the singular values of the weight matrix are equal, the convolutional neural network can achieve the optimal generalization error, and the risk of overfitting is reduced. Orthogonality also limits the magnitude of the gradient and stabilizes the distribution of the activation outputs of the layers, making optimization more efficient. There are many convolutional neural network methods proposed that use orthogonality for constraints. The soft orthogonality regularization constrains the Graham matrix of the weight matrix to be near the identity matrix under the F-norm. By analyzing the limited equidistant characteristic, the F-norm is replaced by the frequency spectrum norm, and the performance is improved. Since the orthogonal matrix is located on the Stiefel manifold, projection gradient descent also becomes a method for solving the deep learning optimization problem with strict orthogonal constraint. The linear module is added with a module for converting a common weight matrix into an orthogonal matrix on the network structure design, and can be optimized by a common gradient descent method. By relaxing the hard orthogonal constraint, the singular value bounding method limits all singular values of each weight matrix to a threshold range near 1 after each training period (i.e. epoch, the iteration number required for traversing all the training data once) is finished, and the fast solution of the orthogonal constraint is realized.

The orthogonal constraint is successfully applied to convolutional neural network training, for example, the stability of convergence during convolutional neural network training can be increased by adding the orthogonal constraint, an orthogonal regularization loss function is utilized in the convolutional neural network training process, singular value delimitation is performed after weights are expanded into a two-dimensional matrix, the weights of the convolutional neural network are expanded into the two-dimensional matrix and then multiplied by the pseudo inverse of the two-dimensional matrix, and a new orthogonal constraint is formed on the convolution kernel weights of the convolutional neural network. However, in this method, the weight tensor of the convolutional layer is expanded into a matrix during constraint, and the structural characteristics of the tensor are destroyed. The tensor-tensor product is a newly defined tensor operation, which can deduce the properties of a series of analog matrixes, and is concerned by the field of machine learning in recent years, and has been successfully applied to the aspect of keeping the structural characteristics of the tensor. The robust principal component analysis is popularized to tensor by combining tensor-Singular Value Decomposition (tensor-Singular Value Decomposition), and the method has a good effect when being applied to image video processing. The invention discloses a tensor singular value decomposition method based on tensor-tensor product derivation, which is used for popularizing orthogonal matrix constraint to tensor and realizing a convolutional neural network training method based on tensor structure constraint.

Disclosure of Invention

The invention provides a convolutional neural network training method based on tensor singular value delimitation, which aims to add structural constraints of matrixes and tensors to a convolutional neural network, improve the network performance, integrate orthogonal constraints, simultaneously keep a tensor structure of network weight, and stably improve the performance of the convolutional neural network.

On the aspect of constraint condition design of an objective function, the invention adds constraint of tensor singular value equality on the basis of the constraint condition of weight matrix orthogonality, and theoretically ensures that a weight tensor obtained by network solution is an orthogonal tensor or a product of the orthogonal tensor and a constant. In the optimization process of training, after each time of a plurality of suboptimal iterations, matrix singular values and tensor singular values of weights are respectively limited within a certain threshold range, so that the solved weights approximately meet constraint conditions of an objective function.

The invention is realized by at least one of the following technical schemes.

A convolutional neural network training method based on tensor singular value delimitation comprises the following steps:

s1, initializing weights of the convolutional neural network to enable the weight matrixes of the fully-connected layer and the convolutional layer to be orthogonal, and enabling singular values of weight tensors of the convolutional layer to be equal;

s2, training the initialized convolutional neural network by using a Stochastic Gradient Descent method (SGD for short) or a deformation thereof (including SGD with Momentum, SGD with Nesterov Momentum, AdaGrad, Adadelta, RMSprop and Adam);

and S3, after each training iteration, alternately updating matrix singular value delimitation and tensor singular value delimitation for the convolutional layer weight of the convolutional neural network, updating matrix singular value delimitation for the fully-connected layer weight of the convolutional neural network, and updating delimitation for the weight of a Batch standardization (Batch Normalization, hereinafter referred to as BN) layer. If the loss function is converged, the training is finished; if the loss function has not converged, the process returns to step S2.

Further, step S1 performs a constraint that matrix orthogonality and tensor singular values are equal for initialization of convolutional neural network weights.

Further, after the full-link layer weights of the convolutional neural network are initialized randomly in step S3, all singular values of the full-link layer weight matrix are limited to 1, so that the full-link layer weight matrix is orthogonal.

Further, after the convolutional layer weights of the convolutional neural network are initialized randomly in step S3, the following operations are performed alternately until convergence:

1) limiting all singular values of the convolutional layer weight matrix to 1 so that the convolutional layer weight matrix is orthogonal;

2) on the premise of keeping the Frobenius norm (hereinafter referred to as F norm) unchanged, all singular values of the convolutional layer weight tensor are restricted to be equal, so that the convolutional layer weight tensor is an orthogonal tensor or a product of the orthogonal tensor and a constant.

Further, in step S2, after performing training iterations on the initialized convolutional neural network for several times by using a random gradient descent method or its deformation, the weights of the convolutional neural network are updated by using threshold delimitation.

Further, the step S3 of performing matrix singular value delimitation on the weight matrices of the convolutional layer and the fully-connected layer includes the following steps:

a) performing matrix singular value decomposition on the weight matrix;

b) performing threshold value constraint on each singular value of the weight matrix, so that each singular value is close to 1;

c) and reconstructing a weight matrix according to the updated singular value.

Further, the tensor singular value delimitation of step S3 includes the following steps:

keeping the F-norm of the tensor equal to the F-norm of the corresponding orthogonal matrix, and calculating expected singular values when all tensor singular values are equal;

carrying out tensor singular value decomposition on the weight tensor;

carrying out threshold value constraint on each singular value of the weight tensor to enable each singular value to be close to the expected singular value obtained through calculation;

and fourthly, reconstructing the weight tensor according to the updated singular value.

Further, the step S3 of thresholding the weight of the BN layer includes the following steps:

calculating the average value of the quotient of each neuron weight and the input standard layer;

and (II) limiting the quotient of each neuron weight and the input standard layer to be close to the corresponding mean value, and obtaining the weight of a new BN layer.

Compared with the prior art, the invention has the beneficial effects that:

the method comprises the steps of constraining the weight tensor of the convolutional neural network, and reserving structural information of the weight tensor on a model structure; compared with the method for carrying out matrix orthogonal constraint on the weight matrix of the convolutional neural network, the optimization performance reduces the solving space of the network weight and simplifies the optimization. The invention effectively improves the performance of the convolutional neural network.

Drawings

Fig. 1 is a flowchart of a training process of a convolutional neural network training method based on tensor singular value delimitation in this embodiment;

FIG. 2 is a diagram illustrating singular value delimitation of a matrix according to this embodiment;

fig. 3 is a schematic diagram of singular value delimitation of the tensor of the embodiment.

Detailed Description

The present invention will be described in further detail below with reference to specific embodiments, but the embodiments of the present invention are not limited thereto.

The principle of the invention comprises: on the basis of orthogonal constraint on the weight matrixes of the convolutional layers and the fully-connected layers of the convolutional neural network, the singular values of the weight tensors of the convolutional layers are further constrained to be equal, so that the structural characteristics of the weight tensors are reserved. Threshold value limitation is carried out on singular values of the matrix and the tensor to approximately meet constraint conditions, a convolutional neural network training method is obtained, and network performance is improved.

As shown in fig. 1, a convolutional neural network training method based on tensor singular value delimitation includes the following steps:

specifically, the weights of the convolutional layer and the fully-connected layer are initialized to random values, and then all singular values of the weight matrices of the convolutional layer and the fully-connected layer are set to be 1, so that the weight matrices are orthogonal matrices. In the convolutional layer, on the premise of keeping the F-norm constant, all singular values of the convolutional layer weight tensor are further made equal so that the convolutional layer weight tensor is the orthogonal tensor or the product of the orthogonal tensor and the constant, and the singular value setting of the weight matrix and the weight tensor of the convolutional layer is alternately performed until convergence.

S2, training the initialized convolution neural network by adopting a random gradient descent method or a variation thereof (including SGD with Momentum, SGD with Nesterov Momentum, AdaGrad, Adadelta, RMSprop and Adam). Every time a training period passes, the weights are updated according to the step S3.

And S3, updating the weight of the convolutional layer by utilizing matrix singular value delimitation and tensor singular value delimitation.

For convenience of description, the symbols involved are agreed upon. For any convolutional layer, the convolutional weight tensor is

Wherein

Is a three-dimensional real number tensor, the sizes of three dimensions of the tensor are respectively K, C and d²R represents a real number, C represents the number of input channels, K represents the number of convolution kernels, and d represents the size of the convolution kernels. Convolution weight tensor

The modulo-one expansion (mode-1 underfold) yields the corresponding weight matrix as

Wherein R represents a real number, K and Cd²Respectively representing the size of two dimensions of the matrix. For a fully connected layer, the weight structure is a matrix. For convenience of representation, the invention uniformly uses W epsilon R in the fully-connected matrix of the convolution layer and the fully-connected layer^K×mDenotes that when it denotes a convolution matrix, m ═ Cd²。

Step S3 is specifically as follows:

(1) updating the weights of the convolutional layers according to the matrix singular values, as shown in fig. 2, includes the following steps:

first, Singular Value Decomposition (SVD) is performed on the convolutional layer weight matrix W to obtain W ═ U ∑ V^TWhere U is unitary matrix of K × m order, Sigma is diagonal matrix of m × m order non-negative real numbers, the diagonal elements are singular values of W, and V is mUnitary matrix of order m, V^TRepresenting the transpose of V.

Secondly, threshold limitation is carried out on each diagonal element of the singular value matrix sigma according to the following mode:

if σ is_i>1+₁Then σ_i＝1+₁；

If σ is_i<1/(1+₁) Then σ_i＝1/(1+₁)；

If 1/(1+₁)≤σ_i≤1+₁Then σ_iKeeping the same;

wherein sigma_iThe ith diagonal element representing sigma,₁representing smaller values, ranging from 0.1 to 0.5, for the singular value σ_iDelimitation is performed with constraints around 1.

Thirdly, calculating U 'sigma V' according to the new sigma^TA new convolutional layer weight matrix W is obtained.

(2) And updating the weight of the convolutional layer according to the tensor singular value. Convolving the Tensor with Tensor Singular Value Decomposition (t-SVD)

The decomposition is carried out, and the decomposition is carried out,

and for tensor singular values

Delimitation is performed as shown in fig. 3. Wherein

And

respectively, dimension K x d²And the size C d²The orthogonal tensor of (a);

is the frontal diagonal tensor (i.e., the tensor has all frontal slices being diagonal momentsMatrix) of size K × C × d²，

Is a diagonal element of the first frontal slice of

Tensor singular values of (a); is the tensor-tensor product. The singular value decomposition of the tensor in the actual operation can be obtained by utilizing Fourier transform and the singular value decomposition of a matrix, and the specific calculation process is as follows: first, the convolution weight tensor is aligned

Performing fast Fourier transform along the dimension of the convolution kernel to obtain the result after Fourier transform

The calculation process is recorded as

Secondly, to

Each of the frontal slice matrices of (a) is subjected to matrix singular value decomposition, i.e.

Wherein,

to represent

The ith front-side slice matrix of (a),

is a unitary matrix of order K x m,

is a m x m order non-negative real diagonal matrix whose diagonal elements are

The singular value of (a) is,

is a unitary matrix of order m x m,

to represent

Transposing; finally, in pairs

Tensor for frontal slice

Respectively carrying out fast inverse Fourier transform along the dimension of the convolution kernel to obtain

Namely, it is

As can be seen from the nature of the inverse fourier transform,

is equal to all of the elements of the first front slice

Mean value of corresponding positions, hence

May be delimited by pairing

Comprising the following steps:

phi is to convolution weight tensor

Namely, it is

Calculating expected singular values of tensor

When i belongs to 1 to d²The following operations are performed:

to pair

Performing SVD on the matrix of the ith front slice to obtain

To pair

Is thresholded as follows:

if it is not

Then

If it is not

Then

If it is not

Then

Remain unchanged.

Wherein,

to represent

The jth diagonal element of (a) is,₂representing smaller values, ranging from 0.1 to 0.5, for singular values

Delimitation is performed with constraints around 1.

According to new

Computing matrix multiplications

To obtain new

Fourthly, to new

Performing inverse fast Fourier transform along the dimension of the convolution kernel size to obtain a new convolution weight tensor

(3) And (3) alternately carrying out a plurality of iterations of the step (1) and the step (2). The embodiment suggests 1.5 iterations (i.e. performing step (1), step (2), and step (1) followed by exiting the iteration), which can achieve a good balance between the calculation time and the final effect, but is not limited to 1.5 iterations in practical applications.

S4, updating the weight of the full connection layer by using matrix singular value delimitation, and the method comprises the following steps:

(1) SVD is carried out on the weight matrix W of the full connection layer to obtain W ═ U ∑ V^T。

(2) For each diagonal element σ of ∑_iThe threshold limitation is performed as follows:

if σ is_i>1+, then σ_i＝1+；

If σ is_i<1/(1+), then σ_i＝1/(1+)；

If 1/(1 +). ltoreq. sigma._iLess than or equal to 1+, then sigma_iRemain unchanged.

(3) Calculating U sigma V from the new sigma^TNew W is obtained.

And S5, delimitation updating is carried out on the weight of the BN layer. Suppose the input of BN layer is h epsilon Rⁿ(i.e., h is a vector of real numbers of dimension n, where R represents a real number), the operation of the BN layer can be expressed as:

BN(h)＝ΥΦ(h-μ)+β,

wherein n represents the number of channels of the BN layer and is equal to the number of convolution kernels of the convolution layer connected with the BN layer in terms of value, namely n is equal to K, mu is the average value of the input of the batch of neurons, and the diagonal element of the diagonal matrix of phi is the standard deviation phi of the input of the batch of neurons_iY is a diagonal matrix whose diagonal elements may be learned BN layer weights v_iAnd β is a learnable BN layer bias term. The specific steps of the threshold limit updating of the BN layer are as follows:

(1) calculating the mean of the quotient of each neuron weight and the input standard layer

(2) Each diagonal element v to the mean value y_iThe threshold limitation is performed as follows:

if it is not

Then

If it is not

Then

If it is not

Then upsilon_iRemain unchanged.

Wherein upsilon is_iIs the weight of the ith neuron of the BN layer, phi_iIs the standard deviation of the ith neuron input for the batch,

representing a smaller value (ranging between 0.1 and 0.5) with a weight v for the BN layer_iDelimitation is carried out, and constraint is carried out near alpha;

and S6, repeatedly executing the steps S3 to S5 until the training of the convolutional neural network is converged.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. A convolutional neural network training method based on tensor singular value delimitation is characterized by comprising the following steps:

s2, training the initialized convolutional neural network by using a random gradient descent method or a deformation thereof;

s3, after each training iteration, alternately updating matrix singular value delimitation and tensor singular value delimitation for convolutional layer weights of the convolutional neural network, updating matrix singular value delimitation for full-connection layer weights of the convolutional neural network, updating delimitation for weights of batch standardized layers (BN layers), and finishing the training if loss functions are converged; if the loss function has not converged, the process returns to step S2.

2. The method of claim 1, wherein step S1 constrains the initialization of convolutional neural network weights to be matrix orthogonal and tensor singular value equal.

3. The method according to claim 2, wherein after the random initialization of the fully-connected layer weights of the convolutional neural network in step S3, all singular values of the fully-connected layer weight matrix are limited to 1, so that the fully-connected layer weight matrix is orthogonal.

4. The method according to claim 3, wherein the random initialization of convolutional layer weights of convolutional neural network in step S3 is followed by the following operations until convergence:

limiting all singular values of the convolutional layer weight matrix to 1 so that the convolutional layer weight matrix is orthogonal;

on the premise of keeping the Frobenius norm (hereinafter referred to as F norm) unchanged, all singular values of the convolutional layer weight tensor are restricted to be equal, so that the convolutional layer weight tensor is an orthogonal tensor or a product of the orthogonal tensor and a constant.

5. The method according to claim 1, wherein step S2 is implemented by performing several training iterations on the initialized convolutional neural network by using a stochastic gradient descent method or its variant, and then updating the weights of the convolutional neural network by using threshold bounding.

6. The method of claim 1, wherein the step S3 of performing matrix singular value delimitation on the weight matrices of the convolutional layer and the fully-connected layer comprises the steps of:

performing matrix singular value decomposition on the weight matrix;

performing threshold value constraint on each singular value of the weight matrix, so that each singular value is close to 1;

and reconstructing a weight matrix according to the updated singular value.

7. The method of claim 1, wherein the tensor singular value delimitation of step S3 comprises the steps of:

carrying out tensor singular value decomposition on the weight tensor;

8. The method of claim 1, wherein the step S3 of thresholding the weight of the BN layer comprises the steps of:

calculating the mean value of the quotient of each neuron weight and the input standard layer;

and limiting the quotient of each neuron weight and the input standard layer to be close to the corresponding mean value to obtain the weight of a new BN layer.