CN111985603A - Method for training sparse connection neural network - Google Patents
Method for training sparse connection neural network Download PDFInfo
- Publication number
- CN111985603A CN111985603A CN202010123340.9A CN202010123340A CN111985603A CN 111985603 A CN111985603 A CN 111985603A CN 202010123340 A CN202010123340 A CN 202010123340A CN 111985603 A CN111985603 A CN 111985603A
- Authority
- CN
- China
- Prior art keywords
- weight
- connectivity
- variable
- mask
- neural network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Feedback Control In General (AREA)
Abstract
The invention provides a method for training a sparse connection neural network, which comprises the step of decomposing weight into a product of a weight variable and a binary mask when the neural network is trained, wherein the binary mask is obtained by a mask variable through a unit step function. The elements in the binary mask represent whether the weights of the corresponding positions have connection, 0 represents no connection, and 1 represents connection. If the majority of elements of the binary mask are 0, then the training results in a sparsely connected neural network. The number of weights with connections, i.e. the number of elements within the binary mask is 1, is taken as one term in the objective function. During training, the value of the mask variable is gradually attenuated by adjusting the weight variable and the mask variable according to the objective function, so as to ensure the sparsity of the binary mask.
Description
Technical Field
The present invention relates to artificial neural networks, and in particular to neural networks for training sparse connections.
Background
An artificial neural network is a network that includes a plurality of processing units arranged in multiple layers. The neural network trained by the general neural network training method is often densely connected (densely connected), i.e. all weights are non-0. However, such network architectures are typically complex, require significant memory resources and power consumption, and often have over-fitting (over-fitting) problems. The weight sparse neural network can be obtained by using a pruning (pruning) mode. Pruning is to set a weight having a small absolute value to 0, but the magnitude of the absolute value of the weight does not represent the importance of connection, and it is difficult to obtain an optimal connection method.
Disclosure of Invention
The embodiment of the invention provides a method for training a sparsely connected neural network. The specific method comprises the following steps: the weights are decomposed during training of the neural network into products of weight variables and binary masks (0/1) which are derived from mask variables by a unit step function. The element in the binary mask represents whether the weight of the corresponding position has a connection, 0 represents no connection, and 1 represents a connection. If most elements of the binary mask are 0, then the training results in a sparse connected neural network. We take as one term in the objective function the number of weights with connections, i.e. the number of elements inside the binary mask is 1. The training process adjusts the weight variable and the mask variable according to the objective function. The value of the mask variable is gradually attenuated during training, so that the binary mask is ensured to be sparse. Since the mask variables are determined by the objective function, only a few important weight-corresponding binary mask elements are 1.
Therefore, the artificial neural network with sparse connection, simple structure and correct output prediction is generated, and the generated sparse connection structure of the artificial neural network can obviously reduce the operation complexity, the memory requirement and the power consumption.
Drawings
Fig. 1 is a calculation diagram of an artificial neural network according to an embodiment of the present invention.
Fig. 2 is a schematic diagram of the convolutional layer of the artificial neural network in fig. 1.
Fig. 3 is a flowchart of a training method of the artificial neural network in fig. 1.
FIG. 4 is an embodiment operational network for constructing the artificial neural network of FIG. 1.
Reference numerals:
1: artificial neural network
300: training method
S302 to S306: step (ii) of
4: operation network
402: processor with a memory having a plurality of memory cells
404: programming memory
406: parameter memory
408: output interface
Lyr (1) to Lyr (j): layer(s)
m: connective mask
w: weight of
Y (1) to Y (| NJ |): target value
*: convolution operation
☉: element-to-element multiplication
Detailed Description
Fig. 1 is a calculation diagram of an artificial neural network 1 according to an embodiment of the present invention. The artificial neural network 1 is a fully connected neural network (fully connected neural network), and the present invention is applicable to various types of neural networks such as a convolutional neural network (convolutional neural network). The artificial neural network 1 being responsive to input dataToTo generate an output estimateToInputting dataToMay be a current level, a voltage level, a real signal, a complex signal, an analog signal or a digital signal. For example, input dataToMay be a gray-scale value of the pixel and may be obtained by an input device, such as a mobile phone, tablet computer, or digital camera. Outputting the estimated valueToThe probability of various classification results of the artificial neural network 1 can be represented. For example, the estimated value is outputToMay be the probability of multiple objects being identified from the image. A set of input dataToMay be referred to as an input data set. The artificial neural network 1 may be trained using sets of input data and respective sets of target values. In some embodiments, the input data set may be divided into a plurality of mini-batches during training. For example, 32,000 input datasets may be divided into 1,000 small batches, each having 32 input datasets.
The artificial neural network 1 may comprise layers Lyr (1) to Lyr (J), J being a positive integer greater than 1. The layer Lyr (1) may beReferred to as the input layer, layers Lyr (J) may be referred to as output layers, and layers Lyr (2) through Lyr (J-1) may be referred to as hidden layers. Each layer Lyr (j) may include multiple processing nodes connected by a connectionToA plurality of processing nodes coupled in a previous layer Lyr (J-1), J being a layer index between 2 and J, | Cj | being the total number of connections between the layer Lyr (J) and the previous layer Lyr (J-1). The input layer Lyr (1) may comprise a processing nodeToWhere the superscript denotes a layer index, the subscript denotes a node index, and | N1| is the total number of processing nodes of layer Lyr (1). Processing nodeToCan receive input data respectivelyToEach of the hidden layers Lyr (2) through Lyr (J-1) Lyr (J) may include a processing nodeToWhere | Nj | is the total number of processing nodes of the hidden layer Lyr (j). The output layer Lyr (J) may comprise processing nodesToWhere | NJ | is the total number of processing nodes of the output layer Lyr (J). Processing nodeToCan generate output estimated values respectivelyTo
Each processing node in the level Lyr (j) may be coupled to one or more processing nodes in the previous level Lyr (j-1) via its connections. Each connection may be associated with a weight, and the processing node may compute a weighted sum of the input data from one or more processing nodes in the previous layer Lyr (j-1). Connections associated with larger weights are more influential than connections associated with smaller weights when generating the weighted sum. When the weight value is 0, the connection related to the weight can be regarded as being removed from the artificial neural network 1, thereby achieving network connection sparsity and reducing the computational complexity, power consumption and operation cost. The artificial neural network 1 may be trained to produce an optimized sparse network configuration to use a small or minimal number of connectionsToTo achieve output estimation values approximately matching respective target values Y (1) to Y (| NJ |)To
The method can be applied to different network types, such as fully-connected neural networks or convolutional neural networks. In the calculation, a fully connected layer in the fully connected neural network can be equivalently converted into a convolutional layer, the size of an input feature map (feature map) is 1 × 1 (layer 1 in fig. 1 is 1 × N1), the size of a convolutional kernel (convolutional kernel) is 1 × 1 (layer 1 in fig. 1 is 1 × 1N 1N 2), and N1 and N2 are positive integers. The training method for the sparse connection network is described in fig. 2 in the form of convolutional layers. Fig. 2 shows a convolutional layer, which may be transformed from one of the layers Lyr (2) to Lyr (j) of the artificial neural network 1. The convolutional layer may be coupled to the previous convolutional layer via a connection. The convolutional layer may receive input data x from a previous convolutional layer and perform a convolution operation on the input data x and the weight w to calculate an output estimation value y, as expressed by equation (1):
y ═ w x formula (1)
The input data x may have a size of (1x 1). The weight w may be referred to as a convolution kernel and may have a size of (1x 1). "+" may denote a convolution operation. The output estimate y may be sent to subsequent convolutional layers as its input data to calculate subsequent output estimates. The weight w may be reparameterized to obtain a weight variableAnd a connectivity mask m, as expressed by equation (2):
the connectivity mask m may be binary data representing connectivity of a connection, where 1 represents having a connection and 0 represents not having a connection. Weight variableMay indicate the strength of the connection. "☉" may represent an element-to-element (element-wise) multiplication. The connectivity mask m can be varied in number by varying the connectivityPerforming a unit ladder operation H (-) derivation, as expressed by equation (3):
the convolutional layer can be operated according to unit ladder H (-) to the connectivity variableBinarization is performed to produce a connectivity mask m. By parameterizing the weight w, the connectivity and strength of the connection can be adjusted by adjusting the connectivity variables, respectivelyAnd weight variableAnd training is performed. If the connectivity is variableLess than or equal to 0, weight variableMay be masked by 0 to generate a 0 weight w if the connectivity variable isOver 0, weight variableMay be set to the weight w.
In the artificial neural network 1, connections are madeToVariable number of variables capable of being connected respectivelyToAnd weight variableToAnd (4) correlating. Variable of connectivityToAnd weight variableToTraining to reduce connections based on an objective functionToWhile reducing the performance loss of the artificial neural network 1. Connection ofToCan be determined by summing all connectivity masksToAnd then calculated. The performance loss may represent an output estimateToThe difference from the respective target values Y (1) to Y (| NJ |), and can be calculated in the form of cross entropy. The objective function L can be represented by equation (4):
where CE is cross entropy (cross entropy);
j is the layer index;
i is a mask index or a weight index;
| Cj | is the total number of connections for layer j; and
The objective function L may include an output estimateToAnd cross-entropy CE between respective target values Y (1) to Y (| NJ |), connectedToL0 regularization terms of the total number of (c), and connectionToAssociated weight variableToL2 regularization term. In some embodiments, the estimate is outputToAnd sum of squared errors (sum of squared errors) between respective target values Y (1) to Y (| NJ |) may be substituted for the cross entropy in the target function L. The L0 regularization term may be the connection attenuation factor λ 1 and the connectivity maskToThe product of the sums of (a) and (b). The L2 regularization term may be a weight attenuation factor λ 2 and a weight variableToThe product of the sums of (a) and (b). In some embodiments, the L2 regularization term may be removed by the objective function L. The artificial neural network 1 may be trained to minimize the output result of the objective function L. Thus, the L0 regularization term may suppress large numbers of connections, and the L2 regularization term may suppress large weight variablesToThe larger the connection attenuation coefficient λ 1 is, the more sparse the artificial neural network 1 is. The connection attenuation coefficient lambda 1 can be set to be large constant for shielding the connectivityToPush to 0, connect variableToPush to the negative direction and create a sparse connection structure for the artificial neural network 1. Only when connectedWhen it is important to reduce cross entropy CE, and connectionRelated connectivity maskWill remain at 1. In this way, a balance between reduced cross-entropy CE and reduced total number of connections is achieved, resulting in a sparse connection structure while providing output estimates that substantially match the target values Y (1) through Y (| NJ |)ToSimilarly, the connection attenuation factor λ 2 can be set to be large constant to reduce the weight variableToAt the same time the cross entropy CE ensures that important weight variables remain in the artificial neural network 1, resulting in a simple and accurate model of the artificial neural network 1.
In training connectivity variablesToInput data at regular intervalsToMay be fed into input layer Lyr (1) and forward propagated from layer Lyr (1) to layer Lyr (J) to produce an output estimateToOutputting the estimated valueToAnd their respective target values Y (1) through Y (| NJ |) may be calculated and propagated back from the layer Lyr (J) to Lyr (2) to calculate the objective function L versus connectivity variableToThe slope of the connectivity variable, and then according to the connectionVariable of natureToSlope adjusting the connectivity variableToThereby reducing the connectionToWhile reducing the performance loss of the artificial neural network 1. In particular, the connectivity variablesCan be continuously adjusted until the corresponding connectivity variable slopeUp to 0 to find the local minimum of the cross entropy CE. However, according to the derivative chain law, the slope of the connectivity variableThe calculation of (2) involves the differentiation of the unit step function in equation (3), and the differentiation of the unit step function is coupled to almost all the connected variablesIs all 0, resulting in a slope of the connectivity variableIs 0 and the training procedure is terminated, and results in a connectivity variableAnd not updated. To let connectivity variables vary during the training procedureMaintaining a trainable form, unit step functions are skipped and successive variable slopes areRedefinable as the connectivity mask slope of the objective function L versus the connectivity mask mCan be represented by equation (5):
referring to FIG. 2, a connectivity mask m and a connectivity variableThe dashed line in between indicates that the unit step function is skipped in the reverse propagation. Variable of connectivityMay mask slopes in accordance with connectivityAnd (6) updating. In some embodiments, the connectivity mask slopeCan be obtained by corresponding to the weight slopeAnd corresponding weight variableThe element-to-element multiplication of (c) results as shown in equation (5).In this way, when a connection is determined to be not important to reducing cross-entropy CE, the connection can be morphedUpdate from positive to negative and update the connectivity mask from 1 to 0. When it is determined that a connection is important to reduce cross entropy CE, the connection can be modifiedUpdate from negative to positive and update the connectivity mask from 0 to 1. In some embodiments, each small batch of input data sets may be input into the artificial neural network 1 to generate multiple sets of output estimatesToMultiple sets of output estimatesToCan be calculated, and a connectivity variableToTraining may be based on the inverse propagation of the average error. In some embodiments, to avoid slopesAnd weight variableIn a different range of the gradient of the connectivity variableOr connectivity mask slopeThe input data set for each small batch may be normalized to a standard deviation of 1 (normalized).
Similarly, in training the weight variablesToCalculating the weight variable of the objective function L by inverse propagation of the errorToAnd then adjusting the weight variable according to the slope of the weight variableToThereby reducing the weight variableToAnd simultaneously reduces the performance loss of the artificial neural network 1. Weight variableMay continue to be adjusted until the slope of the corresponding weight variableUp to 0 to find the local minimum of the cross entropy CE. According to equation (2) and the derivative chain law, the slope of the weight variableCan be represented by equation (6):
according to the formula (6), when the connectivity mask m is 0, the slope of the weight variableIs 0, resulting in a weight variableCannot be updated and the training procedure is terminated. To make the weight variableMaintaining a trainable form, weighting the slope of the variable during reverse propagationCan be redefined as the weight slope of the objective function L to the weight wAnd can be represented by equation (7):
by varying the slope of the weight variableRedefined as weight slopeWeight variable even when the connectivity mask m is 0Can also maintainCan be trained. Referring to FIG. 2, the weight w and the weight variableThe dashed lines in between indicate that the element-to-element multiplication is skipped when propagating in reverse. Slope of weightCan be obtained by reverse propagation. Whether the connectivity mask m is 1 or 0, the weight variableCan all depend on the slope of the weightAnd (6) updating. In this way, even some of the weight variablesToTemporarily masked by 0, and may train weight variablesTo
The artificial neural network 1 divides the weight w into connectivity variablesAnd weight variableTraining connectivity variablesTo form a sparse connection structure, and training weight variablesTo produce a simple model of the artificial neural network 1. Furthermore, to train the connectivities variablesAnd weight variableSlope of connectivity variableRedefined as connectivity mask slopeAnd the slope of the weight variableIs redefined as a weight slopeThe resulting sparse connection structure of the artificial neural network 1 can significantly reduce computational complexity, memory requirements and power consumption.
Fig. 3 is a flow chart of a training method 300 of the artificial neural network 1. The method 300 includes steps S302 to S306, training the artificial neural network 1 to form a sparse connection structure. Step S302 is applied to the convolutional layer of the artificial neural network 1 to generate an output estimation value, and steps S304 and S306 are applied to train a connectivity variableToAnd weight variableToAny reasonable technical change or step adjustment is the subject of the present inventionThe scope of the invention is disclosed. Steps S302 to S306 are explained below:
step S302: the convolutional layer calculates an output estimation value according to a weight w, the weight w is a weight variableAnd a connectivity mask m defined by connectivity variablesExporting;
step S304: adjusting connectivity variables according to an objective function LToTo reduce the total number of connections and reduce the performance loss;
step S306: adjusting the weight variable according to the objective function LToTo reduce the weight variableToThe sum of (a) and (b).
The explanations of steps S302 to S306 have been provided in the previous paragraphs, and are not described herein again. The training method 300 trains the connectivity variables separatelyToAnd weight variableToTo produce an artificial neural network 1 with sparse connections, simple construction and correct output prediction.
Fig. 4 shows an embodiment of a computing network 4 for constructing the artificial neural network 1. The computing network 4 includes a processor 402, a programming memory 404, a parameter memory 406, and an output interface 408. The programming memory 404 and the parameter memory 406 may be non-volatile memories. The processor 402 may be coupled to the programming memory 404, the parameter memory 406, and the output interface 408 to control the operations thereof. Weight ofToWeight variableToConnective maskToVariable of connectivityToAnd associated slope may be stored in the parameter memory 406 while variables are varied with respect to the training connectionToAnd weight variableToMay be loaded into the processor 402 from the programming memory 404 during the training process. The instructions may include code for causing the convolutional layer to calculate an output estimate based on a weight w, the weight w being a weight variableAnd a connectivity mask m definition, adjusting the connectivity variables according to the objective function LToAnd adjusting the weight variable according to the objective function LToThe code of (1). Adjusted connectivity variablesToAnd weight variableToThe parameter memory 406 may be updated to replace old data. The output interface 408 may display output estimates in response to an input data setTo
Artificial neural network 1 and training method 300 for training connectivity variablesToAnd weight variableToThe sparse connection network is generated while outputting the correct output value.
The above-mentioned embodiments are only preferred embodiments of the present invention, and all equivalent changes and modifications made within the scope of the claims of the present invention should be covered by the present invention.
Claims (10)
1. A method for training a sparse-link neural network, the method for training a computational network, the computational network comprising a plurality of convolutional layers, the method comprising:
calculating an output estimate for one of the plurality of convolutional layers based on a weight defined by a weight variable and a connectivity mask representing a connection between the one of the plurality of convolutional layers and a previous convolutional layer, and the connectivity mask being derived from a connectivity variable; and
adjusting a plurality of connectivity variables according to an objective function to reduce a total number of connections between the plurality of convolutional layers and to reduce a performance loss representing a difference between the output estimate and a target value.
2. The method of claim 1, wherein adjusting the plurality of connectivity variables according to the objective function comprises:
calculating a connectivity mask slope of the objective function to the connectivity variable; and
updating the connectivity variable according to the connectivity mask slope.
3. The method of claim 1, further comprising:
the convolutional layer binarizes the connectivity variable according to a unit step function to generate the connectivity mask.
4. The method of claim 1, wherein the objective function comprises a first term corresponding to the performance loss and a second term corresponding to regularization of connectivity masks associated with the connections between the convolutional layers.
5. The method of claim 4, wherein the second term comprises a product of a connection attenuation coefficient and a sum of the plurality of connectivity masks associated with the plurality of connections between the plurality of convolutional layers.
6. The method of claim 4, wherein the objective function further comprises a third term corresponding to regularization of weight variables associated with the connections between the convolutional layers.
7. The method of claim 6, wherein the third term comprises a product of a weight attenuation coefficient and a sum of the weight variables associated with the connections between the convolutional layers.
8. The method of claim 1, wherein the performance loss is a cross-entropy cross entry.
9. The method of claim 1, further comprising:
adjusting a plurality of weight variables associated with the plurality of connections between the plurality of convolutional layers according to the objective function to reduce a sum of the plurality of weight variables.
10. The method of claim 9, wherein adjusting the plurality of weight variables according to the objective function comprises:
calculating a weight slope of the objective function to the weight; and
updating the weight variable according to the weight slope.
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201962851652P | 2019-05-23 | 2019-05-23 | |
US62/851,652 | 2019-05-23 | ||
US16/746,941 | 2020-01-19 | ||
US16/746,941 US20200372363A1 (en) | 2019-05-23 | 2020-01-19 | Method of Training Artificial Neural Network Using Sparse Connectivity Learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111985603A true CN111985603A (en) | 2020-11-24 |
Family
ID=73441727
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010123340.9A Pending CN111985603A (en) | 2019-05-23 | 2020-02-27 | Method for training sparse connection neural network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111985603A (en) |
-
2020
- 2020-02-27 CN CN202010123340.9A patent/CN111985603A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20210089922A1 (en) | Joint pruning and quantization scheme for deep neural networks | |
CN108345939B (en) | Neural network based on fixed-point operation | |
US10929744B2 (en) | Fixed-point training method for deep neural networks based on dynamic fixed-point conversion scheme | |
CN106796668B (en) | Method and system for bit-depth reduction in artificial neural network | |
CN109949255B (en) | Image reconstruction method and device | |
US11308392B2 (en) | Fixed-point training method for deep neural networks based on static fixed-point conversion scheme | |
Cai et al. | An optimal construction and training of second order RBF network for approximation and illumination invariant image segmentation | |
CN109784420B (en) | Image processing method and device, computer equipment and storage medium | |
US11449734B2 (en) | Neural network reduction device, neural network reduction method, and storage medium | |
US20220300823A1 (en) | Methods and systems for cross-domain few-shot classification | |
CN111008690A (en) | Method and device for learning neural network with adaptive learning rate | |
CN111937011A (en) | Method and equipment for determining weight parameters of neural network model | |
CN107292322B (en) | Image classification method, deep learning model and computer system | |
CN111630530B (en) | Data processing system, data processing method, and computer readable storage medium | |
TWI732467B (en) | Method of training sparse connected neural network | |
CN111985603A (en) | Method for training sparse connection neural network | |
CN112232477A (en) | Image data processing method, apparatus, device and medium | |
Burney et al. | A comparison of first and second order training algorithms for artificial neural networks | |
Duggal et al. | High performance squeezenext for cifar-10 | |
WO2019208248A1 (en) | Learning device, learning method, and learning program | |
Poikonen et al. | Online linear subspace learning in an analog array computing architecture | |
CN114580625A (en) | Method, apparatus, and computer-readable storage medium for training neural network | |
EP4060558B1 (en) | Deep learning based image segmentation method including biodegradable stent in intravascular optical tomography image | |
JP6942204B2 (en) | Data processing system and data processing method | |
US20240221171A1 (en) | Deep learning based image segmentation method including biodegradable stent in intravascular optical tomography image |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |