CN108898213B

CN108898213B - Adaptive activation function parameter adjusting method for deep neural network

Info

Publication number: CN108898213B
Application number: CN201810631395.3A
Authority: CN
Inventors: 胡海根; 周莉莉; 罗诚; 陈胜勇; 管秋; 周乾伟
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2018-06-19
Filing date: 2018-06-19
Publication date: 2021-12-17
Anticipated expiration: 2038-06-19
Also published as: CN108898213A

Abstract

A method for adjusting parameters of an adaptive activation function for a deep neural network, the method comprising the steps of: step 1, firstly, carrying out mathematical definition on a self-adaptive activation function parameter adjusting method; step 2, carrying out comparison and analysis of experimental results of an adaptive activation function and other classical activation functions based on an MNIST data set, wherein the used network comprises three hidden layers, each hidden layer comprises 50 neurons, 100 cycles of iteration are carried out by using a gradient descent algorithm at any time, the learning rate is set to be 0.01, and the minimum batch number is 100; and 3, after the optimal activation function version is obtained in the step 2, the method is applied to the detection of specific bladder cancer cells. In the process of continuously training the network, the invention searches the optimal activation function suitable for the network by continuously adjusting the shape of the self, improves the performance of the network, reduces the total quantity of parameters which can be learned by the self-adaptive activation function in the network, accelerates the network learning rate and improves the generalization of the network.

Description

Adaptive activation function parameter adjusting method for deep neural network

Technical Field

The invention belongs to the field of adaptive activation functions, and designs a parameter adjusting method of an adaptive activation function for a deep neural network. The self-adaptive activation function controls the shape of the self-adaptive activation function by adding learnable parameters, and meanwhile, the learnable parameters can be updated along with network training through a back propagation algorithm, so that the integral learnable parameter number of the self-adaptive activation function in the network is reduced.

Background

Machine learning is widely applied to social life nowadays, while traditional machine learning mostly adopts shallow layer structures, such as a Gaussian Mixture Model (GMM), a Conditional Random Field (CRF), a Support Vector Machine (SVM), and the like, the shallow layer structures have limited representation capability on complex functions, relatively elementary extraction on original input signal features, certain restriction on generalization capability of the shallow layer structures on complex classification problems, and difficult solution of some complex natural signal processing problems, such as human voice and natural image recognition. Therefore, the deep learning greatly promotes the development of machine learning by simulating brain learning, and has the greatest characteristic that original data is converted into a higher-level and more abstract feature expression through some simple but nonlinear models, a deep nonlinear network structure is learned, the approximation of a complex function is realized, and the capability of learning essential features of a data set from a few sample sets is realized. Practice proves that deep learning is good at finding complex structures in high-dimensional data and is widely applied to the research fields of computer vision, speech recognition, natural language processing and the like.

With the application of deep learning in various fields, more and more research is focused on innovations and optimization of deep learning algorithms. The method comprises the steps of classifier and loss function optimization, gradient descent optimization based on back propagation, network weight parameter initialization optimization, artificial neural network optimization and the like, wherein the optimization of the artificial neural network is an important component of deep learning algorithm innovation. Artificial neural networks will have different network structures and numbers of neurons depending on the task, and in these networks one usually uses the same activation function, e.g. Sigmoid, Tanh, Relu. The adaptive activation function proposed in recent years enables network neurons to take different shapes, but with the enlargement of the network size and the increase of the neurons, learnable parameters for adjusting the shapes of the neurons take linear growth, and the learning efficiency of the network is greatly reduced. Therefore, the basic structure of the artificial neural network can be regarded as formed by connecting a plurality of neurons, and the activation function plays a very important role in the basic structure.

The main role of the activation function in an artificial neural network is to provide the non-linear expressive power of the network. If the neurons in a neural network are only linear operations, the network can only express simple linear mapping, and the linear mapping is still realized even if the depth and the width of the network are increased, so that the data of nonlinear distribution in an actual environment is difficult to model effectively. After the nonlinear activation function is added, the deep neural network has the layered nonlinear mapping learning capability. The invention mainly aims at improving the activation function and optimizing the connection between the neurons in the network, thereby further improving the performance of the network.

Disclosure of Invention

In order to reduce the total quantity of learnable parameters of the adaptive activation function in the network, accelerate the network learning rate and improve the generalization capability of the network, the invention provides a parameter adjusting method of the adaptive activation function for the deep neural network.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a method for adjusting parameters of an adaptive activation function for a deep neural network, the method comprising the steps of:

step 1, firstly, a mathematical definition is carried out on a parameter adjustment method of the adaptive activation function, and the process is as follows:

assuming that the number of adjustable parameters of the adaptive activation function is N, the adaptive activation function is defined as:

f_(x)＝f(a*x+c)

where a and c are both learnable parameters used to control the shape of the activation function, so-called neural networks are considered as a combination of many individual neurons, and the output of the neural network is defined as a complex function of set weights, biases, and learnable neuron parameters, the function being as follows:

h_(w,b,a,c)＝h(f(a*x+c))

where h represents the output of the neural network, w and b represent the weights and biases of the network; at the same time, the function is also viewed as using the same set of learnable parameters for all neurons in the neural network, a more extensive definition: each neuron in the neural network uses different adjustable parameters, as follows:

where fn represents each neuron in a layer of the network, each layer of neurons being defined using the same adjustable parameters as follows:

training an adaptive activation function in a neural network using an inverse rebroadcasting algorithm, wherein learnable parameters are optimized along with weights and bias execution as the network training progresses, and the parameters { a1, …, n, b1, …, n } are updated according to a chain derivative rule, as follows:

where ai e { a1, …, n, b1, …, n }, L denotes a cost function,

this term can be derived from the later layer by back propagation, the weighting term Σ Xi can be used at all positions in the profile or neural network layer, the gradient ai can be found for variables shared in one layer by the following formula, Σ i is used to sum the neurons in all channels or one layer, the formula is as follows:

and 2, comparing and analyzing the experimental results of the adaptive activation function and other classical activation functions based on the MNIST data set to obtain the final activation function version. The process is as follows:

the network used has three hidden layers, each hidden layer has 50 neurons, the gradient descent algorithm at any time is used for iterating 100 cycles, the learning rate is set to be 0.01, and the minimum batch number is 100.

Further, in step 2, the contrast activation functions used include a conventional Sigmoid function, a conventional ReLU activation function, a uniform version of an adaptive activation function, respective versions of the adaptive activation function, and a hierarchical version of the adaptive activation function.

Step 3, after obtaining the optimal activation function version in step 2, applying the optimal activation function version to the detection of specific bladder cancer cells, wherein the process is as follows:

3.1, making a data set for the bladder cancer;

3.2, selecting a combination algorithm and a model to initialize parameters;

and 3.3, comparing and analyzing the experimental results of the optimal activation function and the traditional activation function.

In the 3.1, the bladder cancer cell data set is prepared in the format of passacal _ voc2007, and the tag information of the cells is mainly stored by using the generated xml file.

In the step 3.2, a Faster R-CNN algorithm is selected, network parameters are initialized by using an vgg16 model, and network parameters are initialized by using a vgg16 pre-training model.

In the step 3.3, the optimal activation function version generated in the step 3.2 is used for replacing the traditional activation function in the Faster R-CNN algorithm, and finally, the analysis and comparison of the experimental results are carried out.

The invention has the following beneficial effects: the effectiveness of the adaptive activation function parameter adjusting method is proved by theory and experiment, the optimal activation function is provided for the network, the problems of gradient dispersion and the like of the traditional activation function are avoided, and the fitting capability of the network is improved.

Drawings

FIG. 1 is a graph of the AS convergence of the activation function of the present invention;

FIG. 2 is a diagram illustrating adjustment of parameters learnable by the AS activation function according to the present invention;

FIG. 3 is a diagram of an original sigmoid activation function and a final AS activation function according to the present invention;

FIG. 4 is a diagram of the experimental comparison of the final AS activation function of the present invention with other activation functions.

Fig. 5 is a diagram of Sigmoid activation function.

FIG. 6 is a graph of Tanh activation function.

Fig. 7 is a graph of the ReLU activation function.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

Referring to fig. 1 to 7, a method for adjusting parameters of an adaptive activation function for a deep neural network includes the following steps:

f_(x)＝f(a*x+c)

h_(w,b,a,c)＝h(f(a*x+c))

where ai e { a1, …, n, b1, …, n }, L denotes a cost function,

step 2, carrying out comparison and analysis of experimental results of the adaptive activation function and other classical activation functions based on the MNIST data set, wherein the process is as follows:

The invention is based on MNIST data set, adding learnable parameters in sigmoid classical activation function to make it become self-adaptive activation function: and comparing the test results of the Adaptive active functions of each version with two classical active functions, namely Sigmoid and ReLU.

MNIST is a hand-written number recognition data set called fruit fly of deep learning experiment, which comprises 60000 pictures as training data and 10000 pictures as test set. In the MNIST data set, each gray scale picture represents one number from 0 to 9. The size of the picture is 28 x 28, and the handwritten numbers all appear in the middle of the picture. The activation function AS is defined AS:

f＝b₀*sigmoid(a₀*x+a₁)+b₁

a0, a1, b0, b1 are four learnable parameters that control the shape of the function and can be trained along with network weights and bias.

The invention mainly adds learnable parameters in the sigmoid classical activation function to make the sigmoid classical activation function become an adaptive activation function AS, and the mathematical definition of the function is AS follows:

let N be the number of adjustable parameters of the adaptive activation function, where N is assumed to be 2. The adaptive activation function may be defined as:

f_(x)＝f(a*x+c)

where a and c are both learnable parameters used to control the shape of the activation function. The so-called neural network can be seen as a combination of many individual neurons, defining the output of the neural network as a complex function of set weights, biases and learnable neuron parameters, the function being as follows:

h_(w,b,a,c)＝h(f(a*x+c))

where h represents the output of the neural network and w and b represent the weights and biases of the network. At the same time, the function can also be viewed as using the same set of learnable parameters for all neurons in the neural network. One of the more broad definitions is: each neuron in the neural network uses different adjustable parameters, as follows:

where fn represents each neuron in one layer of the network. Each layer of neurons using the same adjustable parameters is defined as follows:

the present invention uses a back-propagation algorithm to train an adaptive activation function in a neural network, where learnable parameters are optimized along with weights and bias execution as the network training progresses. The parameters { a1, …, n, b1, …, n } may be updated according to the chain derivative rule as follows:

where ai e { a1, …, n, b1, …, n }, and L represents a cost function.

This term can be derived from the latter layer by back propagation, and the weighting term ∑ Xi can be used at all positions of the feature map or neural network layer. For variables shared in one layer, the gradient ai can be found by the following equation, Σ i is used to sum the neurons in all channels or one layer, as follows:

in step 3, the method of the self-adaptive activation function is applied to deep learning, and the optimal activation function obtained in step 2 is applied to the detection of the bladder cancer cells. The process is as follows:

and 3.1, making a data set. The bladder cancer cell data set is made into a pascal _ voc2007 format, and the label information of the cells is mainly stored by using the generated xml file.

And 3.2, selecting a proper algorithm and a proper model to initialize parameters. The invention selects a Faster R-CNN algorithm, initializes the network parameters by using an vgg16 model, mainly initializes the network parameters by using a vgg16 pre-training model to reduce the training time and reduce the risk of under-fitting or over-fitting.

And 3.3, comparing and analyzing the experimental results of the optimal activation function and the traditional activation function. The optimal activation function version generated in the step 2 is mainly used for replacing the traditional activation function in the Faster R-CNN algorithm, and finally, the analysis and comparison of the experimental results are carried out.

Finally, the method proposed by the present invention, that is, using the same adjustable activation function in the whole network, adds a total number of parameters, no matter how many neurons are used in the neural network, which is the number of parameters that the adaptive activation function can learn (these parameters are used to control the shape of the function). The whole network uses the same adjustable activation function, just like the polynomial order superposition of a compound function, thereby enhancing the nonlinearity of the network, improving the fitting capability of the network and accelerating the learning speed of the network.

The present invention will be described in detail below with reference to the accompanying drawings.

As shown in fig. 1, the network used was three hidden layers, each with 50 neurons, iterated for 100 cycles using a time-gradient descent algorithm. The learning rate was set to 0.01 and the minimum number of batches was 100. Among the contrastive activation functions used are the traditional Sigmoid function, the traditional ReLU activation function, the unified version of the adaptive activation function, the respective versions of the adaptive activation function, and the hierarchical version of the adaptive activation function. Fig. 1 shows convergence curves of the ReLU, Sigmoid, and three versions based on the adaptive activation function AS. "Relu _ train" represents the classification error rate on the training set using the Relu activation function. "Relu _ test" indicates that the Relu activation function is used to classify the error rate on the test set. "ausimoid" denotes a Unified Version (UV) of the adaptive activation function AS, i.e. each neuron uses the same activation function throughout. "ALsigmoid" represents the respective Version of the adaptive activation function (Indvidual Version, IV), i.e. each neuron uses a respective activation function that is considered good. "AIsigmoid" denotes a hierarchical Version (LV), i.e., all neurons of each Layer use the same activation function, but the activation function is not necessarily the same between each Layer.

The expression of the conventional Sigmoid activation function is as follows:

the Sigmoid activation function diagram refers to fig. 5.

Sigmoid activation function is a common activation method because it has a good explanation for the activation frequency of neurons: from completely inactive 0 to fully saturated activation at the maximum boundary 1. But the Sigmoid function is now rarely used, one important reason being that the Sigmoid function saturates and the gradient disappears. This is because Sigmoid neurons have the undesirable property that they saturate when their activation value is close to 0 or 1, and when in these regions the gradient of the function is almost 0, which results in that when propagating backwards, this (local) gradient will be multiplied by the gradient of the entire loss function with respect to the gate unit output, the result of which multiplication will also be close to zero, which effectively terminates the gradient, resulting in almost no signal passing through the neuron to the weight and then to the data, resulting in the final gradient dispersion problem.

Another classical activation function is: tanh nonlinear function, the expression is as follows:

the Tanh activation function diagram refers to FIG. 6.

As can be seen from fig. 6, Tanh, compared to Sigmoid function, compresses the real value between [ -1,1], but has saturation problem like Sigmoid. But unlike the Sigmoid neuron, its output is zero-centered. In actual practice, the Tanh nonlinear function is more popular than the Sigmoid nonlinear function. It can be said that the Tanh neuron is a simply amplified Sigmoid neuron.

ReLU is an activation function that is now widely used, compared to the first two classical activation functions, and its mathematical formula is as follows:

f(x)＝max(0,x)

the ReLU activation function map refers to fig. 7.

ReLU has a tremendous acceleration effect on the convergence of gradient descent compared to Sigmoid and Tanh functions, due to its linear, non-saturated formulation. When the input of the ReLU activation function is positive, the problem of gradient saturation does not exist; when the input is negative, the ReLU is completely inactive, which indicates that the ReLU will fail once a negative number is input. For example, when a large gradient propagates back through a ReLU neuron, it may cause the gradient to update to a particular state where the neuron is likely to no longer be reactivated by any other operational node. If this happens, the gradients propagating backwards from there through this neuron will all become 0. That is, this ReLU unit will fail irreversibly during training, which results in loss of data diversity.

AS can be seen from fig. 1, using a unified version of the activate function AS achieves a lower classification error rate than using the Relu activate function on the MNIST training set. The network has a stronger fitting ability than using the original Sigmoid activation function.

As shown in fig. 2, which is a diagram of a parameter adjustment process of a unified version of an adaptive activation function, learnable parameters of the adaptive activation function are initially set to: after training iteration, the final parameters of a0 ═ 1.0, a1 ═ 0.0, b0 ═ 1.0, and b1 ═ 0.0 become: a 0-3.87, a 1-0.07, b 0-5.89, and b 1-0.51, there is basically no great change.

As shown in fig. 3, the final unified version of the adaptive activation function has a larger value range than that of the traditional Sigmoid activation function, and the problem of gradient dispersion of the traditional Sigmoid activation function is solved to a great extent, so that the accuracy of classification is improved.

As shown in fig. 4, the final adaptive activation function version (RAS) is compared with other activation functions according to the experimental results, and the final adaptive activation function formula is as follows:

f＝5.89*sigmoid(3.87*x+0.07)-0.51

it can be seen from the comparison of the experimental results with fig. 4 that the unified adaptive activation function can achieve the best experimental effect, and in the experiment of detecting bladder cancer cells, the detection result and speed of the unified adaptive activation function are better than those of the traditional activation function, which further proves that each network can be trained to obtain the most suitable activation function.

Claims

1. A deep neural network-oriented adaptive activation function parameter adjustment method is characterized by comprising the following steps:

f_(x)＝f(a*x+c)

h_(w,b,a,c)＝h(f(a*x+c))

training adaptive activation functions in neural networks using a back-propagation algorithm, where learnable parameters are optimized along with weights and bias-executions as network training progresses, parameter { a }₁,…,a_n,b₁,…,b_nGet updated according to the chain-type derivative rule, the update is as follows:

wherein a is_i∈{a₁,…,a_n,b₁,…,b_nL represents a cost function,

this term can be obtained from the latter layer by back propagation, weighting the term

Can be used at all positions of the profile or neural network layer, gradient a for variables shared in one layer_iCan be obtained by the following formula,

to sum the neurons in all channels or one layer, the formula is as follows:

step 2, carrying out comparison and analysis of experimental results of the adaptive activation function and other activation functions based on the MNIST data set, wherein the process is as follows:

the used network has three hidden layers, each hidden layer has 50 neurons, the gradient descent algorithm at any time is used for iterating 100 periods, the learning rate is set to be 0.01, and the minimum batch number is 100;

3.1, making a data set for the bladder cancer;

3.2, initializing parameters by selecting an algorithm and a model;

2. The method as claimed in claim 1, wherein the contrast activation function used in step 2 includes a conventional Sigmoid function, a conventional ReLU activation function, a uniform version of the adaptive activation function, respective versions of the adaptive activation function, and a hierarchical version of the adaptive activation function.

3. The method for adjusting parameters of the adaptive activation function oriented to the deep neural network as claimed in claim 1 or 2, wherein in 3.1, the bladder cancer cell data set is made into a pascal _ voc2007 format, and the generated xml file is mainly used for storing the label information of the cells.

4. The method for adjusting parameters of adaptive activation function for deep neural network as claimed in claim 1 or 2, wherein in 3.2, fast R-CNN algorithm is selected, network parameters are initialized by vgg16 model, and vgg16 pre-training model is used for initializing network parameters.

5. The adaptive activation function parameter adjustment method for the deep neural network as claimed in claim 4, wherein in the step 3.3, the optimal activation function version generated in the step 3.2 is used to replace a traditional activation function in the Faster R-CNN algorithm, and finally, the analysis and comparison of the experimental results are performed.