CN111985603A

CN111985603A - Method for training sparse connection neural network

Info

Publication number: CN111985603A
Application number: CN202010123340.9A
Authority: CN
Inventors: 唐志敏; 谢必克; 朱逸煜
Original assignee: Kneron Inc
Current assignee: Kneron Inc; Kneron Taiwan Co Ltd
Priority date: 2019-05-23
Filing date: 2020-02-27
Publication date: 2020-11-24

Abstract

The invention provides a method for training a sparse connection neural network, which comprises the step of decomposing weight into a product of a weight variable and a binary mask when the neural network is trained, wherein the binary mask is obtained by a mask variable through a unit step function. The elements in the binary mask represent whether the weights of the corresponding positions have connection, 0 represents no connection, and 1 represents connection. If the majority of elements of the binary mask are 0, then the training results in a sparsely connected neural network. The number of weights with connections, i.e. the number of elements within the binary mask is 1, is taken as one term in the objective function. During training, the value of the mask variable is gradually attenuated by adjusting the weight variable and the mask variable according to the objective function, so as to ensure the sparsity of the binary mask.

Description

Method for training sparse connection neural network

Technical Field

The present invention relates to artificial neural networks, and in particular to neural networks for training sparse connections.

Background

An artificial neural network is a network that includes a plurality of processing units arranged in multiple layers. The neural network trained by the general neural network training method is often densely connected (densely connected), i.e. all weights are non-0. However, such network architectures are typically complex, require significant memory resources and power consumption, and often have over-fitting (over-fitting) problems. The weight sparse neural network can be obtained by using a pruning (pruning) mode. Pruning is to set a weight having a small absolute value to 0, but the magnitude of the absolute value of the weight does not represent the importance of connection, and it is difficult to obtain an optimal connection method.

Disclosure of Invention

The embodiment of the invention provides a method for training a sparsely connected neural network. The specific method comprises the following steps: the weights are decomposed during training of the neural network into products of weight variables and binary masks (0/1) which are derived from mask variables by a unit step function. The element in the binary mask represents whether the weight of the corresponding position has a connection, 0 represents no connection, and 1 represents a connection. If most elements of the binary mask are 0, then the training results in a sparse connected neural network. We take as one term in the objective function the number of weights with connections, i.e. the number of elements inside the binary mask is 1. The training process adjusts the weight variable and the mask variable according to the objective function. The value of the mask variable is gradually attenuated during training, so that the binary mask is ensured to be sparse. Since the mask variables are determined by the objective function, only a few important weight-corresponding binary mask elements are 1.

Therefore, the artificial neural network with sparse connection, simple structure and correct output prediction is generated, and the generated sparse connection structure of the artificial neural network can obviously reduce the operation complexity, the memory requirement and the power consumption.

Drawings

Fig. 1 is a calculation diagram of an artificial neural network according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of the convolutional layer of the artificial neural network in fig. 1.

Fig. 3 is a flowchart of a training method of the artificial neural network in fig. 1.

FIG. 4 is an embodiment operational network for constructing the artificial neural network of FIG. 1.

Reference numerals:

1: artificial neural network

300: training method

S302 to S306: step (ii) of

4: operation network

402: processor with a memory having a plurality of memory cells

404: programming memory

406: parameter memory

408: output interface

Lyr (1) to Lyr (j): layer(s)

To

Connection of

To

Processing node

m: connective mask

Variable of connectivity

w: weight of

Weight variable

To

x: inputting data

To

y: outputting the estimated value

Y (1) to Y (| NJ |): target value

*: convolution operation

☉: element-to-element multiplication

Detailed Description

Fig. 1 is a calculation diagram of an artificial neural network 1 according to an embodiment of the present invention. The artificial neural network 1 is a fully connected neural network (fully connected neural network), and the present invention is applicable to various types of neural networks such as a convolutional neural network (convolutional neural network). The artificial neural network 1 being responsive to input data

To

To generate an output estimate

To

Inputting data

To

May be a current level, a voltage level, a real signal, a complex signal, an analog signal or a digital signal. For example, input data

To

May be a gray-scale value of the pixel and may be obtained by an input device, such as a mobile phone, tablet computer, or digital camera. Outputting the estimated value

To

The probability of various classification results of the artificial neural network 1 can be represented. For example, the estimated value is output

To

May be the probability of multiple objects being identified from the image. A set of input data

To

May be referred to as an input data set. The artificial neural network 1 may be trained using sets of input data and respective sets of target values. In some embodiments, the input data set may be divided into a plurality of mini-batches during training. For example, 32,000 input datasets may be divided into 1,000 small batches, each having 32 input datasets.

The artificial neural network 1 may comprise layers Lyr (1) to Lyr (J), J being a positive integer greater than 1. The layer Lyr (1) may beReferred to as the input layer, layers Lyr (J) may be referred to as output layers, and layers Lyr (2) through Lyr (J-1) may be referred to as hidden layers. Each layer Lyr (j) may include multiple processing nodes connected by a connection

To

A plurality of processing nodes coupled in a previous layer Lyr (J-1), J being a layer index between 2 and J, | Cj | being the total number of connections between the layer Lyr (J) and the previous layer Lyr (J-1). The input layer Lyr (1) may comprise a processing node

To

Where the superscript denotes a layer index, the subscript denotes a node index, and | N1| is the total number of processing nodes of layer Lyr (1). Processing node

To

Can receive input data respectively

To

Each of the hidden layers Lyr (2) through Lyr (J-1) Lyr (J) may include a processing node

To

Where | Nj | is the total number of processing nodes of the hidden layer Lyr (j). The output layer Lyr (J) may comprise processing nodes

To

Where | NJ | is the total number of processing nodes of the output layer Lyr (J). Processing node

To

Can generate output estimated values respectively

To

Each processing node in the level Lyr (j) may be coupled to one or more processing nodes in the previous level Lyr (j-1) via its connections. Each connection may be associated with a weight, and the processing node may compute a weighted sum of the input data from one or more processing nodes in the previous layer Lyr (j-1). Connections associated with larger weights are more influential than connections associated with smaller weights when generating the weighted sum. When the weight value is 0, the connection related to the weight can be regarded as being removed from the artificial neural network 1, thereby achieving network connection sparsity and reducing the computational complexity, power consumption and operation cost. The artificial neural network 1 may be trained to produce an optimized sparse network configuration to use a small or minimal number of connections

To

To achieve output estimation values approximately matching respective target values Y (1) to Y (| NJ |)

To

The method can be applied to different network types, such as fully-connected neural networks or convolutional neural networks. In the calculation, a fully connected layer in the fully connected neural network can be equivalently converted into a convolutional layer, the size of an input feature map (feature map) is 1 × 1 (layer 1 in fig. 1 is 1 × N1), the size of a convolutional kernel (convolutional kernel) is 1 × 1 (layer 1 in fig. 1 is 1 × 1N 1N 2), and N1 and N2 are positive integers. The training method for the sparse connection network is described in fig. 2 in the form of convolutional layers. Fig. 2 shows a convolutional layer, which may be transformed from one of the layers Lyr (2) to Lyr (j) of the artificial neural network 1. The convolutional layer may be coupled to the previous convolutional layer via a connection. The convolutional layer may receive input data x from a previous convolutional layer and perform a convolution operation on the input data x and the weight w to calculate an output estimation value y, as expressed by equation (1):

y ═ w x formula (1)

The input data x may have a size of (1x 1). The weight w may be referred to as a convolution kernel and may have a size of (1x 1). "+" may denote a convolution operation. The output estimate y may be sent to subsequent convolutional layers as its input data to calculate subsequent output estimates. The weight w may be reparameterized to obtain a weight variable

And a connectivity mask m, as expressed by equation (2):

the connectivity mask m may be binary data representing connectivity of a connection, where 1 represents having a connection and 0 represents not having a connection. Weight variable

May indicate the strength of the connection. "☉" may represent an element-to-element (element-wise) multiplication. The connectivity mask m can be varied in number by varying the connectivity

Performing a unit ladder operation H (-) derivation, as expressed by equation (3):

the convolutional layer can be operated according to unit ladder H (-) to the connectivity variable

Binarization is performed to produce a connectivity mask m. By parameterizing the weight w, the connectivity and strength of the connection can be adjusted by adjusting the connectivity variables, respectively

And weight variable

And training is performed. If the connectivity is variable

Less than or equal to 0, weight variable

May be masked by 0 to generate a 0 weight w if the connectivity variable is

Over 0, weight variable

May be set to the weight w.

In the artificial neural network 1, connections are made

To

Variable number of variables capable of being connected respectively

To

And weight variable

To

And (4) correlating. Variable of connectivity

To

And weight variable

To

Training to reduce connections based on an objective function

To

While reducing the performance loss of the artificial neural network 1. Connection of

To

Can be determined by summing all connectivity masks

To

And then calculated. The performance loss may represent an output estimate

To

The difference from the respective target values Y (1) to Y (| NJ |), and can be calculated in the form of cross entropy. The objective function L can be represented by equation (4):

where CE is cross entropy (cross entropy);

λ 1 is the connection attenuation coefficient;

λ 2 is a weight attenuation coefficient;

j is the layer index;

i is a mask index or a weight index;

an ith connectivity mask for a jth layer;

| Cj | is the total number of connections for layer j; and

is the ith weight variable of the jth layer.

The objective function L may include an output estimate

To

And cross-entropy CE between respective target values Y (1) to Y (| NJ |), connected

To

L0 regularization terms of the total number of (c), and connection

To

Associated weight variable

To

L2 regularization term. In some embodiments, the estimate is output

To

And sum of squared errors (sum of squared errors) between respective target values Y (1) to Y (| NJ |) may be substituted for the cross entropy in the target function L. The L0 regularization term may be the connection attenuation factor λ 1 and the connectivity mask

To

The product of the sums of (a) and (b). The L2 regularization term may be a weight attenuation factor λ 2 and a weight variable

To

The product of the sums of (a) and (b). In some embodiments, the L2 regularization term may be removed by the objective function L. The artificial neural network 1 may be trained to minimize the output result of the objective function L. Thus, the L0 regularization term may suppress large numbers of connections, and the L2 regularization term may suppress large weight variables

To

The larger the connection attenuation coefficient λ 1 is, the more sparse the artificial neural network 1 is. The connection attenuation coefficient lambda 1 can be set to be large constant for shielding the connectivity

To

Push to 0, connect variable

To

Push to the negative direction and create a sparse connection structure for the artificial neural network 1. Only when connected

When it is important to reduce cross entropy CE, and connection

Related connectivity mask

Will remain at 1. In this way, a balance between reduced cross-entropy CE and reduced total number of connections is achieved, resulting in a sparse connection structure while providing output estimates that substantially match the target values Y (1) through Y (| NJ |)

To

Similarly, the connection attenuation factor λ 2 can be set to be large constant to reduce the weight variable

To

At the same time the cross entropy CE ensures that important weight variables remain in the artificial neural network 1, resulting in a simple and accurate model of the artificial neural network 1.

In training connectivity variables

To

Input data at regular intervals

To

May be fed into input layer Lyr (1) and forward propagated from layer Lyr (1) to layer Lyr (J) to produce an output estimate

To

Outputting the estimated value

To

And their respective target values Y (1) through Y (| NJ |) may be calculated and propagated back from the layer Lyr (J) to Lyr (2) to calculate the objective function L versus connectivity variable

To

The slope of the connectivity variable, and then according to the connectionVariable of nature

To

Slope adjusting the connectivity variable

To

Thereby reducing the connection

To

While reducing the performance loss of the artificial neural network 1. In particular, the connectivity variables

Can be continuously adjusted until the corresponding connectivity variable slope

Up to 0 to find the local minimum of the cross entropy CE. However, according to the derivative chain law, the slope of the connectivity variable

The calculation of (2) involves the differentiation of the unit step function in equation (3), and the differentiation of the unit step function is coupled to almost all the connected variables

Is all 0, resulting in a slope of the connectivity variable

Is 0 and the training procedure is terminated, and results in a connectivity variable

And not updated. To let connectivity variables vary during the training procedureMaintaining a trainable form, unit step functions are skipped and successive variable slopes are

Redefinable as the connectivity mask slope of the objective function L versus the connectivity mask m

Can be represented by equation (5):

referring to FIG. 2, a connectivity mask m and a connectivity variable

The dashed line in between indicates that the unit step function is skipped in the reverse propagation. Variable of connectivity

May mask slopes in accordance with connectivity

And (6) updating. In some embodiments, the connectivity mask slope

Can be obtained by corresponding to the weight slope

And corresponding weight variable

The element-to-element multiplication of (c) results as shown in equation (5).In this way, when a connection is determined to be not important to reducing cross-entropy CE, the connection can be morphed

Update from positive to negative and update the connectivity mask from 1 to 0. When it is determined that a connection is important to reduce cross entropy CE, the connection can be modified

Update from negative to positive and update the connectivity mask from 0 to 1. In some embodiments, each small batch of input data sets may be input into the artificial neural network 1 to generate multiple sets of output estimates

To

Multiple sets of output estimates

To

Can be calculated, and a connectivity variable

To

Training may be based on the inverse propagation of the average error. In some embodiments, to avoid slopes

And weight variable

In a different range of the gradient of the connectivity variable

Or connectivity mask slope

The input data set for each small batch may be normalized to a standard deviation of 1 (normalized).

Similarly, in training the weight variables

To

Calculating the weight variable of the objective function L by inverse propagation of the error

To

And then adjusting the weight variable according to the slope of the weight variable

To

Thereby reducing the weight variable

To

And simultaneously reduces the performance loss of the artificial neural network 1. Weight variable

May continue to be adjusted until the slope of the corresponding weight variable

Up to 0 to find the local minimum of the cross entropy CE. According to equation (2) and the derivative chain law, the slope of the weight variable

Can be represented by equation (6):

according to the formula (6), when the connectivity mask m is 0, the slope of the weight variable

Is 0, resulting in a weight variable

Cannot be updated and the training procedure is terminated. To make the weight variable

Maintaining a trainable form, weighting the slope of the variable during reverse propagation

Can be redefined as the weight slope of the objective function L to the weight w

And can be represented by equation (7):

by varying the slope of the weight variable

Redefined as weight slope

Weight variable even when the connectivity mask m is 0

Can also maintainCan be trained. Referring to FIG. 2, the weight w and the weight variable

The dashed lines in between indicate that the element-to-element multiplication is skipped when propagating in reverse. Slope of weight

Can be obtained by reverse propagation. Whether the connectivity mask m is 1 or 0, the weight variable

Can all depend on the slope of the weight

And (6) updating. In this way, even some of the weight variables

To

Temporarily masked by 0, and may train weight variables

To

The artificial neural network 1 divides the weight w into connectivity variables

And weight variable

Training connectivity variables

To form a sparse connection structure, and training weight variables

To produce a simple model of the artificial neural network 1. Furthermore, to train the connectivities variables

And weight variable

Slope of connectivity variable

Redefined as connectivity mask slope

And the slope of the weight variable

Is redefined as a weight slope

The resulting sparse connection structure of the artificial neural network 1 can significantly reduce computational complexity, memory requirements and power consumption.

Fig. 3 is a flow chart of a training method 300 of the artificial neural network 1. The method 300 includes steps S302 to S306, training the artificial neural network 1 to form a sparse connection structure. Step S302 is applied to the convolutional layer of the artificial neural network 1 to generate an output estimation value, and steps S304 and S306 are applied to train a connectivity variable

To

And weight variable

To

Any reasonable technical change or step adjustment is the subject of the present inventionThe scope of the invention is disclosed. Steps S302 to S306 are explained below:

step S302: the convolutional layer calculates an output estimation value according to a weight w, the weight w is a weight variable

And a connectivity mask m defined by connectivity variables

Exporting;

step S304: adjusting connectivity variables according to an objective function L

To

To reduce the total number of connections and reduce the performance loss;

step S306: adjusting the weight variable according to the objective function L

To

To reduce the weight variable

To

The sum of (a) and (b).

The explanations of steps S302 to S306 have been provided in the previous paragraphs, and are not described herein again. The training method 300 trains the connectivity variables separately

To

And weight variable

To

To produce an artificial neural network 1 with sparse connections, simple construction and correct output prediction.

Fig. 4 shows an embodiment of a computing network 4 for constructing the artificial neural network 1. The computing network 4 includes a processor 402, a programming memory 404, a parameter memory 406, and an output interface 408. The programming memory 404 and the parameter memory 406 may be non-volatile memories. The processor 402 may be coupled to the programming memory 404, the parameter memory 406, and the output interface 408 to control the operations thereof. Weight of

To

Weight variable

To

Connective mask

To

Variable of connectivity

To

And associated slope may be stored in the parameter memory 406 while variables are varied with respect to the training connection

To

And weight variable

To

May be loaded into the processor 402 from the programming memory 404 during the training process. The instructions may include code for causing the convolutional layer to calculate an output estimate based on a weight w, the weight w being a weight variable

And a connectivity mask m definition, adjusting the connectivity variables according to the objective function L

To

And adjusting the weight variable according to the objective function L

To

The code of (1). Adjusted connectivity variables

To

And weight variable

To

The parameter memory 406 may be updated to replace old data. The output interface 408 may display output estimates in response to an input data set

To

Artificial neural network 1 and training method 300 for training connectivity variables

To

And weight variable

To

The sparse connection network is generated while outputting the correct output value.

The above-mentioned embodiments are only preferred embodiments of the present invention, and all equivalent changes and modifications made within the scope of the claims of the present invention should be covered by the present invention.

Claims

1. A method for training a sparse-link neural network, the method for training a computational network, the computational network comprising a plurality of convolutional layers, the method comprising:

calculating an output estimate for one of the plurality of convolutional layers based on a weight defined by a weight variable and a connectivity mask representing a connection between the one of the plurality of convolutional layers and a previous convolutional layer, and the connectivity mask being derived from a connectivity variable; and

adjusting a plurality of connectivity variables according to an objective function to reduce a total number of connections between the plurality of convolutional layers and to reduce a performance loss representing a difference between the output estimate and a target value.

2. The method of claim 1, wherein adjusting the plurality of connectivity variables according to the objective function comprises:

calculating a connectivity mask slope of the objective function to the connectivity variable; and

updating the connectivity variable according to the connectivity mask slope.

3. The method of claim 1, further comprising:

the convolutional layer binarizes the connectivity variable according to a unit step function to generate the connectivity mask.

4. The method of claim 1, wherein the objective function comprises a first term corresponding to the performance loss and a second term corresponding to regularization of connectivity masks associated with the connections between the convolutional layers.

5. The method of claim 4, wherein the second term comprises a product of a connection attenuation coefficient and a sum of the plurality of connectivity masks associated with the plurality of connections between the plurality of convolutional layers.

6. The method of claim 4, wherein the objective function further comprises a third term corresponding to regularization of weight variables associated with the connections between the convolutional layers.

7. The method of claim 6, wherein the third term comprises a product of a weight attenuation coefficient and a sum of the weight variables associated with the connections between the convolutional layers.

8. The method of claim 1, wherein the performance loss is a cross-entropy cross entry.

9. The method of claim 1, further comprising:

adjusting a plurality of weight variables associated with the plurality of connections between the plurality of convolutional layers according to the objective function to reduce a sum of the plurality of weight variables.

10. The method of claim 9, wherein adjusting the plurality of weight variables according to the objective function comprises:

calculating a weight slope of the objective function to the weight; and

updating the weight variable according to the weight slope.