Background
The deep learning model is a core algorithm of the existing artificial intelligence technology, depends on a large amount of labeled data, and realizes nonlinear fitting on complex problems through hierarchical modeling. In current practice, deep learning techniques have been successful in the fields of image recognition, speech processing, etc., and have continuously affected other industries.
In order to process complex data, current deep learning models often have hundreds of millions of parameters, and besides a lot of time and computing resources are consumed in a training phase, a lot of storage resources are occupied in deployment and inference processes of the models, and inference speed is slow. In the case of limited computing resources, such as mobile terminals, the application of the deep learning system will be limited.
The deep learning model compression mainly aims at the problem of excessive model parameter quantity, and currently, research on the field mainly focuses on the following 4 points:
(1) and (3) matrix low-rank decomposition, namely, a deep learning model relates to a large number of matrix operations, and the data volume of the matrix can be greatly reduced while the calculation result is basically unchanged by decomposing a large-scale low-rank matrix into a plurality of small matrices.
(2) Model pruning and parameter quantification: the main starting point of model pruning is that a deep learning model is often over-parameterized, so that redundant structures and parameters are contained in a network, and redundant networks are deleted through rules such as importance and the like, so that redundant parameters and neurons are deleted. Quantization is to simplify the data type stored in the weight, such as converting from floating point number to integer, so as to reduce the storage capacity. This type of approach tends to degrade the performance of the model.
(3) Network Architecture Search (NAS): and in a given model design space, a machine automatically searches an optimal structure, so that model compression is realized. Such methods can be computationally expensive in the search process.
(4) Knowledge Distillation (KD): through the trained teacher model, the student model with a smaller model is trained, so that the performance of the small model is improved while fewer model parameters are needed.
Disclosure of Invention
The invention aims to provide a deep learning model compression method based on decision boundaries, which solves the problems in the prior art.
A deep learning model compression method based on decision boundaries comprises the following steps:
step one, carrying out feature mapping;
step two, performing segmented linearization on the activation function;
step three, calculating a sub-decision area: calculating a sub-decision area of the full connection layer;
step four, decision network construction is carried out: and calculating corresponding decision boundaries according to the sub-decision regions and constructing a new decision network.
Further, in the step one, if the object of model compression is a fully-connected neural network, the step is not executed, and the step two is directly executed.
Further, in step one, if the object is a fully connected part of the cnn model, the model is regarded as a composite of two parts, where f is gMLP(gcnn(x0) G) a handlecnn(x0) The new sample set D '═ { x' ═ g is constructed as a feature mapcnn(x) And then operate it as a fully connected neural network.
Further, in the second step, if the activation function adopts a piecewise linear function, the third step is directly executed without executing the second step.
Further, in the second step, for the activation function which is not a piecewise linear function, the activation function piecewise linearization technique is adopted, and the piecewise linear function close to the activation function is found to perform approximate substitution and is converted into the piecewise linear function.
Further, in step two, as for the activation function that is not a piecewise linear function, specifically:
first, a hard approximation function hard- σ (x) of the activation function σ (x) is generated, specifically as follows:
depending on the required number of segments L n +2 and the acceptable error δ > 0, two segmentation points are first selected so as to be at (- ∞, a)0],[anAnd +∞) in two intervals, satisfying that [ sigma (x) -hard-a (x) < delta ], and in the interval [ a ≦0,an]Upper, directly and equidistantly taking the division point a1,a2,...,an-1And according to the point pair (a)1,hard-σ(a1)),(a2,σ(a2)),(a3,σ(a3)),...,(an-2,σ(an-2)),(an-1,hard-σ(an-1) And connecting the L (n + 2) sections of the original activation function in sequence to obtain the piecewise linear approximation function of the original activation function.
Further, in step three, the activation function is not a piecewise linear function, specifically: the invention firstly calculates the decision boundary, concretely, the used piecewise linear activation function is:
step three, traversing samples according to a training sample set, namely inputting each sample into a deep learning model f (x) in sequence, but not executing a back propagation process, and simultaneously recording the activation states of all full-connection layer activation functions;
step three, counting all full connectionsThe activation states of the layer neurons are sequentially arranged into an overall state vector S ═ S1,s2,...,sm]According to the steps in the third step and the first step, the integral state vectors of all the samples are counted to obtain an integral state vector set phi of the samples, wherein phi is { S ═ S }1,S2,...SN};
Thirdly, finishing phi, and combining the completely same integral state vectors to obtain
From the reformed ensemble of state vectors
The number q of elements of (1) will have the same activation state S'
pDividing samples of (p is more than or equal to 1 and less than or equal to q) into the same subinterval, and classifying the samples belonging to the same subinterval by the same vector linear model g
i(x)=w
ix+b
i(i ═ 1, 2.. q.) description, directly through the parameters of the full connection layer and the overall activation state vector, the equivalent linear model g is obtained by calculation
i(x)=w
ix+b
iLet all submodels be G ═ G
1,g
2,...,g
q};
Step three and four, calculating decision boundaries of all sub models, and obtaining the N classification problems according to the definition of the decision boundaries
Class decision boundaries, and linear model g for a sub-interval
i(x) Is calculated to obtain
The bar decision boundary specifically includes:
calculating decision boundaries of all subinterval models to form a decision boundary hyperplane setCombination of Chinese herbs
Further, in step four, specifically, a decision network is constructed according to the decision boundary hyperplane set DB obtained in step three, the network only contains a hidden layer, which is different from a general neural network, and the output of the decision network DNet is a position code relative to the decision boundary, which is recorded as 0/1, specifically, for the hyperplane PlAnd sample x0Directly substituting into a hyperplane formula to calculate the output, if the result is regular, marking 1, and if the result is negative, marking 0,
through the decision network, a relative position coding of the data with respect to all elements of the decision boundary set DB is obtained
Depending on the nature of the decision boundary, samples with the same position code must belong to the same class,
and traversing the training set data D ═ xi,CiI | (1, 2.), (N }), the class marking of the position code of the decision network is completed, and when a new sample is input, the class of the new sample can be known only by comparing the new sample with the marked position code.
The invention has the following beneficial effects:
1. the invention realizes the high-efficiency model compression of the full connection layer.
2. Compared with the prior art, the method has the advantage that the precision is reduced for the model with the piecewise linear activation function, and the precision lossless compression can be realized. For other non-linear functions where the activation function is an infinite asymptote, model compression with controllable accuracy can be achieved.
Detailed Description
The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention provides a deep learning model compression method based on decision boundaries, which comprises the following steps:
step one, carrying out feature mapping;
step two, performing segmented linearization on the activation function;
step three, calculating a sub-decision area: calculating a sub-decision area of the full connection layer;
step four, decision network construction is carried out: and calculating corresponding decision boundaries according to the sub-decision regions and constructing a new decision network.
Further, in the step one, if the object of model compression is a fully-connected neural network, the step is not executed, and the step two is directly executed.
Further, in step one, if the object is a fully connected part of the cnn model, the model is regarded as a composite of two parts, where f is gMLP(gcnn(x0) G) a handlecnn(x0) The new sample set D '═ { x' ═ g is constructed as a feature mapcnn(x) And then the original model f is operated as a fully connected neural network.
Further, in the second step, if the activation function adopts a piecewise linear function, the third step is directly executed without executing the second step.
Further, in the second step, for the activation function which is not a piecewise linear function, the activation function piecewise linearization technique is adopted, and the piecewise linear function close to the activation function is found to perform approximate substitution and is converted into the piecewise linear function.
Further, in step two, for the function whose activation function is not piecewise linear, such as sigmoid, tanh function, etc., the activation function piecewise linearization technique can be adopted to perform approximate substitution by finding the piecewise linear function close to the activation function. The method comprises the following specific steps:
according to the existing activation function, which generally has the property of an infinite asymptote, according to the infinite asymptote of the activation function, a hard approximation function hard- σ (x) of the activation function σ (x) is firstly generated, specifically as follows:
depending on the required number of segments L n +2 and the acceptable error δ > 0, two segmentation points are first selected so as to be at (- ∞, a)0],[anAnd +/-infinity), satisfying that the absolute value of sigma (x) -hard-sigma (x) is less than or equal to delta. And in the interval [ a0,an]Upper, directly and equidistantly taking the division point a1,a2,...,an-1And according to the point pair (a)1,hard-σ(a1)),(a2,σ(a2)),(a3,σ(a3)),...,(an-2,σ(an-2)),(an-1,hard-σ(an-1) And connecting the L-n +2 sections of the original activation function in sequence to obtain the piecewise linear approximation function of the original activation function. Then the decision network can be generated according to the process of the first to fourth steps, thereby realizing model compression.
Further, in step three, for classifying the model, its essence is the block of the modelRule boundary, given a data set D ═ x, for example, by image classificationi,C i1, 2.., N }, training a classifier, f: rd→RcWherein the classification label is C ═ { C ═ Ci/i=1,2,...,N,N∈Z+}. F is at CiAnd CjThe decision boundaries between classes are:
where U (x, δ) is the sphere opening neighborhood for sample x.
The deep learning model for solving the classification problem is also a classifier f (x), so the decision boundary of the deep learning model is calculated firstly, and the decision boundary of the deep learning model is generally difficult to calculate due to high nonlinearity. In particular, note that the piecewise linear activation function used is:
step three, traversing the samples according to a training sample set, namely inputting each sample into a deep learning model f (x) in sequence, but not executing a back propagation process (namely only performing an inference process), and simultaneously recording the activation states of all full-connection layer activation functions, such as the output a of a certain neuronijSatisfies muk<aij<μk+1(k-0, 1, 2.., n-1), the activation state of the neuron is sijK, and so on;
step three and two, counting the activation states of all neurons in the full connecting layer, and sequentially arranging the activation states into an overall state vector S ═ S1,s2,...,sm]According to the steps in the third step and the first step, the integral state vectors of all the samples are counted to obtain an integral state vector set phi of the samples, wherein phi is { S ═ S }1,S2,...SN};
Thirdly, finishing phi, and combining the completely same integral state vectors to obtain the final productTo
From the reformed ensemble of state vectors
The number q of elements of (1) will have the same activation state S'
pDividing samples of (p is more than or equal to 1 and less than or equal to q) into the same subinterval, and classifying the samples belonging to the same subinterval by the same vector linear model g
i(x)=w
ix+b
i(i ═ 1, 2.. q.) description, directly through the parameters of the full connection layer and the overall activation state vector, the equivalent linear model g is obtained by calculation
i(x)=w
ix+b
iLet all submodels be G ═ G
1,g
2,...,g
q};
Step three and four, calculating decision boundaries of all sub models, and knowing that N classification problems are shared according to the definition of the decision boundaries
Class decision boundaries, and linear model g for a sub-interval
i(x) Can be calculated to obtain
The bar decision boundary specifically includes:
calculating decision boundaries of all subinterval models to form decision boundary hyperplane set
Further, in step four, specifically, the decision boundary hyperplane set DB obtained in step three is used to constructDecision network (DNet) comprising only one hidden layer, different from the ordinary neural network, the output of the decision network (DNet) is a position code relative to the decision boundaries, denoted 0/1, in particular for hyperplane P
lAnd sample x
0And directly substituting into a hyperplane formula to calculate the output of the hyperplane formula, and if the result is regular, marking 1, and if the result is negative, marking 0. Thus, by means of the decision network, a relative position coding of the data with respect to all elements of the decision boundary set DB is obtained
Referring to fig. 1, samples having the same position code must belong to the same class according to the characteristics of the decision boundary.
Therefore, only the training set data D ═ x needs to be traversed againi,CiI 1, 2., N }, the location code of the decision network may be category-labeled. When a new sample is input, the type of the sample can be known only by comparing the new sample with the marked position code.
The invention provides a deep learning model (including CNN and MLP) compression method based on decision boundary, because the parameters of the deep learning model are largely derived from the full connection layer in the model, the method designs a compression method for the full connection layer without a large amount of experiments such as pruning, searching, distilling and the like, and only needs to traverse 2 times on a training set sample. And the obtained model can realize lossless compression if the activation function is a piecewise linear function commonly used as a ReLU, and can realize compression with any given precision through linear approximation if the activation function is other nonlinear activation functions.
The above embodiments are only used to help understanding the method of the present invention and the core idea thereof, and a person skilled in the art can also make several modifications and decorations on the specific embodiments and application scope according to the idea of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.