CN115331053A

CN115331053A - Image classification model generation method and device based on L2NU activation function

Info

Publication number: CN115331053A
Application number: CN202210962126.1A
Authority: CN
Inventors: 邱鲤鲤; 张泽洋; 周昌乐
Original assignee: Xiamen University
Current assignee: Xiamen University
Priority date: 2022-08-11
Filing date: 2022-08-11
Publication date: 2022-11-11

Abstract

The application relates to an image classification model generation method based on an L2NU activation function, which comprises the following steps: s101: replacing the ReLU activation functions in the native structures of all residual modules in the ResNet-50 image classification model with L2NU activation functions; s102: removing all BN layers in the ResNet-50 image classification model in the step S101; s103: setting the activation function of the output layer of the ResNet-50 image classification model in step S102 to be

S104: of neural networks that classify the ResNet-50 image in step S103 into modelsInitializing each weight matrix into standard normal distribution, and setting bias of each layer of the neural network as 0 vector; s105: and training the ResNet-50 image classification model in the step S104 based on the image classification training data set to obtain an image classification model based on an L2NU activation function. Compare traditional activation function, the L2NU activation function of this application has model classification precision function when being applied to image classification's deep learning model and has obtained the beneficial effect that promotes.

Description

Image classification model generation method and device based on L2NU activation function

Technical Field

The application relates to the technical field of image classification, in particular to an image classification model generation method and device based on an L2NU activation function.

Background

In the current deep learning model, the network structure is continuously expanded, and particularly, the depth of the network is continuously deepened, so that a plurality of training problems are brought. When the number of layers of the network model exceeds a certain number, the training process becomes slower and even difficult to obtain optimization. This is a problem faced by all mainstream deep learning models today. For example, in contrast to FCN, a network with more than 10 layers cannot be optimized by training with gradient descent without introducing any training skills, and the accuracy of the model may not be improved in a limited training time. The training problem mainly results from the inability of the update gradient to be efficiently propagated backwards.

Disclosure of Invention

In order to solve the technical problems, the application provides an image classification model generation method and device based on an L2NU activation function.

In a first aspect, the present application provides a method for generating an image classification model based on an L2NU activation function, the method including the following steps:

s101: replacing the ReLU activation functions in the native structures of all residual modules in the ResNet-50 image classification model with L2NU activation functions;

the L2NU activation function is specifically:

wherein n is the number of neurons in the network layer of the image classification model, X is the pre-activation mode of the network layer, and X = (X) ₁ ,x ₂ ,…,x _n )，x _i Outputs for each neuron, and i belongs to {1,2, …, n };

s102: removing all BN layers in the ResNet-50 image classification model in the step S101;

S103：setting the activation function of the output layer of the ResNet-50 image classification model in step S102 to be

S104: initializing each weight matrix of the neural network of the ResNet-50 image classification model in the step S103 into standard normal distribution, and setting each layer bias of the neural network as 0 vector;

s105: and training the ResNet-50 image classification model in the step S104 based on the image classification training data set to obtain an image classification model based on an L2NU activation function.

By adopting the technical scheme, compared with the traditional activation functions such as ReLU, tanh and Sigmoid, the L2NU activation function of the application has the following effects when being applied to the deep learning model of image classification: the model classification precision function is improved; the training time of the model parameters is shortened, and the model is converged more quickly; the model training process is more stable, and the parameter numerical range and the gradient back propagation numerical range are kept relatively stable in the training process.

Preferably, the S103 specifically includes: setting the activation function of the output layer of the ResNet-50 image classification model in step S102 to be

And based on L2NU _p The function carries out cross entropy loss calculation on the loss function of the ResNet-50 image classification model; the loss function is specifically:

where Y is the logits output of the network, and Y = (Y) ₁ ,y ₂ ,…,y _n )，y _i The value of the unactivated output of the ith neuron of the network output layer, n is the dimension of the label, z _i Sample label Z = (Z) in one-hot format ₁ ,z ₂ ,…,z _n ) The ith element of (1).

Preferably, the S105 specifically includes: based on CIFAR-10 and CIFAR-100 image classification training data sets, a ResNet-50 image classification model in the step S104 is trained by using a random gradient descent method, an optimizer is set as a Momentum optimizer, a Momentum coefficient is set to be 0.9, a data batch size is set to be 100, after 100 rounds of training are carried out by adopting a learning rate of 0.1 in the training process, 10 rounds of parameter fine adjustment are carried out by adopting a learning rate of 0.01, and an image classification model based on an L2NU activation function is obtained.

Preferably, after step S105, the method further comprises: s106: and performing performance verification on the image classification model based on the L2NU activation function.

Preferably, the S106 specifically includes: and training FCNs with different layers by respectively adopting a Sigmoid activation function, a Tanh activation function, a ReLU activation function and an L2NU activation function, and recording the optimal precision of the FCN after 3000 rounds of training of the FCN on a CIFAR-10 image classification training data set by adopting a learning rate of 0.01.

Preferably, the S106 further includes: respectively adopting VGG-11, VGG-13, VGG-16 and ResNet convolutional neural network models to learn a CIFAR-100 image classification training data set, replacing the ReLU activation functions of convolutional layers and fully-connected layers in the native structure of each convolutional neural network with L2NU activation functions to compare the performance difference of the L2NU activation functions with the ReLU activation functions, and replacing the Softmax function of the output layer of each convolutional neural network with L2NU activation functions _p Function to compare L2NU _p The difference in performance of the function and the Softmax function.

In a second aspect, the present application further provides an image classification method, including the following steps:

s201: acquiring an image classification sample needing to be classified;

s202: inputting the image classification sample into an image classification model based on an L2NU activation function, wherein the image classification model based on the L2NU activation function is obtained by training in advance based on the method of the first aspect;

s203: and outputting the classification result of the image classification model based on the L2NU activation function.

In a third aspect, the present application further provides an apparatus for generating an image classification model based on an L2NU activation function, where the apparatus includes:

a model modification module: the method is used for replacing the ReLU activation functions in the native structures of all residual modules in the ResNet-50 image classification model with L2NU activation functions; and also for removing all BN layers in the ResNet-50 image classification model; and also for setting the activation function of the output layer of the ResNet-50 image classification model to

The method is also used for initializing each weight matrix of the neural network of the ResNet-50 image classification model in the step S104 into standard normal distribution, and setting each layer bias of the neural network as 0 vector;

the mathematical definition of the L2NU activation function is as follows:

a training module: training the modified ResNet-50 image classification model based on the image classification training data set;

a determination module: for training the modified ResNet-50 image classification model into an image classification model based on the L2NU activation function by the method described in the first aspect.

In a fourth aspect, the present application further proposes an image classification apparatus, comprising:

an acquisition module: the image classification device is used for acquiring image classification samples acquired by the image acquisition device;

a classification module: the method is used for inputting the image classification samples into an image classification model based on an L2NU activation function, the image classification model based on the L2NU activation function is obtained by training based on the method in the first aspect in advance, and the classification result of the image classification model based on the L2NU activation function is output.

In a fifth aspect, the present application also proposes a computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the computing method as described in the first aspect.

In summary, the present application at least includes the following beneficial technical effects:

1. the application provides a brand-new deep learning activation function, and the maximum trainable depth of the model when the gradient optimization algorithm is applied to the deep learning model is greatly increased. The characteristic ensures that the application fields requiring deeper network models such as image classification, natural language processing and the like can obtain more effective and more stable optimized gradients when using the deep models;

2. compared with traditional activation functions such as ReLU, tanh and Sigmoid, the L2NU activation function has the following effects when applied to a deep learning model for image classification: the model classification precision function is improved; the training time of the model parameters is shortened, and the model is converged more quickly; the model training process is more stable, and the parameter numerical range and the gradient back propagation numerical range are kept relatively stable in the training process.

Drawings

The accompanying drawings are included to provide a further understanding of the embodiments and are incorporated in and constitute a part of this specification. The drawings illustrate embodiments and together with the description serve to explain the principles of the application. Other embodiments and many of the intended advantages of embodiments will be readily appreciated as they become better understood by reference to the following detailed description. The elements of the drawings are not necessarily to scale relative to each other. Like reference numerals designate corresponding similar parts.

Fig. 1 is a flowchart of an image classification model generation method based on an L2NU activation function in an embodiment of the present application.

Fig. 2 is a schematic diagram illustrating changes in data distribution of L2NU acting on a two-dimensional plane.

Fig. 3 is a schematic diagram of an L2NU function image plotted based on different a values.

FIG. 4 is a graph comparing a conventional activation function with an L2 normalization unit.

Fig. 5 is a schematic diagram of replacing relus in all residual module native structures in the ResNet-50 image classification model with L2NU.

Fig. 6 is a schematic diagram of an image of a function of | Softmax (x) -Softmax (y) | and a direction of gradient that maximizes the function.

Fig. 7 is a diagram of an image of a function of | L2NUp (x) -L2NUp (y) | and a direction of gradient that maximizes the function.

FIG. 8 is a schematic of an | f (x) -f (y) | gradient image corresponding to Softmax and L2 NUp.

FIG. 9 is a diagram illustrating the classification accuracy of different FCNs on a CIFAR-10 data set as a function of the number of network layers.

FIG. 10 is a flow chart of an image classification method in one embodiment of the present application.

Fig. 11 is a schematic block diagram of an image classification model generation apparatus based on an L2NU activation function in an embodiment of the present application.

Fig. 12 is a schematic block diagram of an image classification apparatus according to an embodiment of the present application.

FIG. 13 is a block diagram of a computer system suitable for use in implementing the electronic device of an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

In the deep learning model, an activation function is an indispensable important component of each layer of the network, and the activation function has a very rich role in the network, including providing a nonlinear capability for the model, performing numerical compression on intermediate output, converting the network output into a prediction probability with a mathematical meaning, and the like. Commonly used activation functions comprise Sigmoid functions, tanh functions and other activation functions of which the gradients at two ends tend to be zero; reLU, ELU, seLU, etc. are activation functions that are linear at one end and have a gradient equal to or approaching zero at the other end; non-monotonic activating functions such as Swish function and GeLU; MAXOUT is an activation function that selects multiple activations. For an activation function with both ends tending to zero, if the value of network activation is too large or too small, the gradient of a corresponding sample in the back propagation process tends to zero, and if the activation value at this time is different from the target activation value, updating the activation value to the target value only consumes huge calculation cost in the zero-trending gradient state, and the process is long and even possibly cannot be completed in a limited time, which is called neuron saturation.

On the other hand, because the gradients at both ends of the activation function are gradually reduced, in the training process, the gradients propagated in the reverse direction are gradually reduced layer by layer based on the chain rule, and the network parameters at the front are increasingly difficult to train effectively. Due to the existence of the problems, functions with one end linear like ReLU are widely applied, and due to the characteristic that one end of the functions with one linear end is compressed by a plurality of values, the functions can help the model to effectively propagate the gradient backwards, and the problem of gradient disappearance is relieved.

However, the activation function causes the mean value of the data distribution to shift towards a positive value instead of zero-centered distribution, and on the premise of no normalization, the shifting of the distribution invisibly causes the network parameters to shift, so that the numerical scale of the update target of part of the parameters is increased, that is, the fitting process of early training of the network requires more iteration times, and the fine tuning of the parameters by subsequent training is not facilitated.

More importantly, the gradient of the negative end of the ReLU is equal to zero, which means that when all samples are less than or equal to zero at the output of a certain neuron, the neuron can not obtain the gradient any more, the neuron is called to die, and in an extreme case, when a problem occurs in a network of one layer, the network completely breaks down, and the parameter updating capability of the layer and all layers before the layer is lost.

Compared with GeLU, the Swish function introduces ReLU to enable the optimization process of the network to be more stable through a smoother function and a smoother gradient, and the neuron saturation problem and the output distribution shift problem still exist.

The MAXOUT activating function replaces a neuron with a group of neurons, and the numerical value output each time is the maximum output value of the group of neurons, so that the function of introducing a nonlinear process can be achieved, the gradient can be propagated reversely without an additional nonlinear function, the gradient optimization capability of the model is improved, and compared with a traditional neural network, the parameter requirement of the model is increased exponentially, and the method is not beneficial to building a complex and huge network structure.

In a deeper neural network structure, a new technique is widely used, which is called Batch Normalization (BN) method.

The BN method normalizes a batch of input data, and then scales and shifts the normalized data distribution through trainable parameters, so that the output data distribution of a specified network layer is unified.

The BN method is also called BN layer because it has trainable parameters and acts on the output of the entire network. The BN layer aims to keep the output of the network layer in a stable numerical range through a normalization process so as to reduce the possibility of gradient propagation problems, and aims to finely adjust the distribution of output data through trainable parameters so as to enable the distribution to be more easily fitted or divided by a subsequent network layer, thereby also ensuring the smooth proceeding of gradient back propagation.

Nowadays, convolutional network models such as VGG, google lenet or ResNet all use a BN layer in a large amount to ensure smooth network training and play a role in stable training.

However, the special operation mechanism of the BN layer consumes enormous computational resources. And a small number of trainable parameters in the BN layer have a large influence on the model performance, so that higher requirements are set on the hyper-parameters of the model, otherwise, reasonable model performance is difficult to obtain.

Several studies have shown that reasonable parameter initialization can effectively reduce the probability of occurrence of the gradient propagation problem and optimize the performance of the network. For example, the Xavier initialization method initializes the network weight to the uniform distribution of the specific variance, so that the data distribution of the network in the forward propagation and backward propagation processes is kept as consistent as possible, thereby ensuring the stable parameter training.

However, xavier initialization is based on strict assumptions, and most commonly used activation functions are not suitable for this method. The He initialization method is proposed, supplemented with the initialization method for the ReLU activation function. By scaling the target variance initialized by the Xavier, variance change caused by the ReLU is balanced, and gradient optimization performance of the model is effectively improved.

However, the gradient stable propagation of the model in the long-term training process cannot be guaranteed only by the initialization method, and an additional mechanism needs to be introduced to guarantee the stable parameter distribution of the optimization process.

Therefore, we propose a new activation function, which has the following features:

(1) The data are rotationally symmetrical by taking 0 as a center, so that the output data have zero-mean (zero-centered) distribution, and model fitting is facilitated;

(2) The method has the nonlinear compression capability, and the output value range is in a certain range;

(3) The curve is smoothed, so that the stability of model parameter updating is facilitated;

(4) The gradient of the back propagation process is larger than a certain value, so that the problems of neuron saturation and neuron death are prevented. However, the second and fourth requirements are conflicting in nature, and the numerical compression capability means smaller gradients, which is not favorable for the gradient to propagate in the multilayer network, and therefore, the solution of the conflict requires the proposal of a new form of activation function.

A first aspect of the embodiment of the present application discloses an image classification model generation method based on an L2NU activation function, with reference to fig. 1, the method specifically includes the following steps:

the L2NU activation function is specifically:

s103: setting the activation function of the output layer of the ResNet-50 image classification model in step S102 to be

In a specific embodiment, the S103 specifically includes: setting the activation function of the output layer of the ResNet-50 image classification model in step S102 to be

wherein Y is the logits output of the network, and Y = (Y) ₁ ,y ₂ ,…,y _n )，y _i The value of the unactivated output of the ith neuron of the network output layer is shown, n is the dimension of a label, z _i Sample label Z = (b) in one-hot formz ₁ ,z ₂ ,…,z _n ) The ith element of (2).

S104: initializing each weight matrix of the neural network of the ResNet-50 image classification model in the step S103 into standard normal distribution, and setting bias of each layer of the neural network as 0 vector;

in a specific embodiment, an Xavier initialization or He initialization technology is not adopted, each weight matrix of the network is directly initialized to be standard normal distribution, and each layer of the network is biased and set to be 0 vector.

In a specific embodiment, the S105 specifically includes: based on CIFAR-10 and CIFAR-100 image classification training data sets, a ResNet-50 image classification model in the step S104 is trained by using a random gradient descent method, an optimizer is set as a Momentum optimizer, a Momentum coefficient is set to be 0.9, a data batch size is set to be 100, after 100 rounds of training are carried out by adopting a learning rate of 0.1 in the training process, 10 rounds of parameter fine adjustment are carried out by adopting a learning rate of 0.01, and an image classification model based on an L2NU activation function is obtained.

S106: and performing performance verification on the image classification model based on the L2NU activation function.

In a specific embodiment, the S106 specifically includes: and training FCNs with different layers by respectively adopting a Sigmoid activation function, a Tanh activation function, a ReLU activation function and an L2NU activation function, and recording the optimal precision of the FCN after 3000 rounds of training of the FCN on a CIFAR-10 image classification training data set by adopting a learning rate of 0.01.

In a further embodiment, the S106 further includes: respectively learning a CIFAR-100 image classification training data set by adopting VGG-11, VGG-13, VGG-16 and ResNet convolutional neural network models, replacing the ReLU activation functions of convolutional layers and fully-connected layers in the native structure of each convolutional neural network with L2NU activation functions to compare the performance difference of the L2NU activation functions with the ReLU activation functions, and outputting the output of each convolutional neural networkSoftmax function of a layer is replaced by L2NU _p Function to compare L2NU _p The difference in performance of the function and the Softmax function.

In step S101, a new type of activation function L2NU is disclosed, which is called L2 normalization Unit (L2-normalization Unit, L2 NU), according to the embodiment of the present invention. The L2NU meets four characteristics required to be met by the activation function through a smart structure, a process similar to a BN layer is introduced into a network, the excellent properties of a model applying the BN layer and a parameter initialization method are possessed on the premise that the BN layer or parameter initialization setting is not required, and the trainable depth of the model is greatly improved.

As a result, L2NU compresses the distribution in n-dimensional space by means of numerical scaling to the spherical surface of a hypersphere in n-dimensional units with the origin as the center of the sphere, as shown in fig. 2, and fig. 2 shows the data distribution variation of L2NU acting on a two-dimensional plane.

The purpose of constructing the L2NU is to keep the variance of the intermediate outputs of the layers of the network constant. For any fully-connected network with L2NU as an activation function, the input dimension of a certain layer of the network is n, the output dimension is m, and the layer input is X = (X) ₁ ,x ₂ ,...,x _n ) The weight W is an n × m-dimensional matrix, the offset B is an m-dimensional vector, and the inactive output is Y = X · W + B. Assuming X, W is independent of the B distribution and has a mean value of 0, in terms of Var (Y) = n · Var (X) · Var (W), where Var (X) represents the variance value of the X corresponding distribution. Since X is the post-activation output of the previous layer, the sum of squares has been normalized to 1, so there is equation (1-2):

that is, any layer of unactivated output distribution variance of the network is only related to the weight distribution variance of the layer, and only the same initial distribution needs to be set for the weights of all layers, so that the distribution stability of the network forward propagation can be ensured.

In addition, L2NU has many excellent properties that are competent for activation functions, including:

(1) The distribution of the output data is a zero-centered distribution; (2) The method has numerical compression capability, and the value range of the function is (-1,1); (3) The function is smooth, the value of the function is continuous, and the value of the gradient of the function is continuous; (4) having an adaptive gradient; (5) The output data is irrelevant to the length of the model before activation, and the model training cost can be effectively reduced after Softmax is replaced.

It is added that when the L2NU is applied to the convolutional layer activation function, only independent normalization operation is performed on each feature map, not all output bits of a single sample, which strengthens the expression capability of the model. And, the L2NU in the convolutional layer is amplified by multiplying by a constant factor equal to the square of the product of the length and width of the corresponding feature map in order to maintain the stability of the network forward propagation distribution.

In a specific embodiment, the characteristic of the adaptive gradient of the L2NU is a key for guaranteeing smooth optimization of the network, and the application discloses how the gradient of each neuron activation function is adaptively changed by the L2NU, so that the process of gradient back propagation can be helped to be performed more smoothly.

In MAXOUT, only one neuron in a group of neurons at a time plays a role in network forward propagation or gradient backward propagation, and the propagation process of the remaining neurons is masked by an activation function and does not participate in calculation updating, thereby causing a large amount of parameter redundancy in the network. For L2NU, the activation function is a group of neurons in a network layer, and the output values of the neurons cooperate to construct an activation function associated with each other, so as to generate respective activation values.

Assume that the output of a layer of neurons is (x) ₁ ,x ₂ ,…,x _n ) If only the value x output by one neuron in the layer of neural network is considered _i The effect of the change on its own activation value, while the remaining neuron activation values are set to constants, without letting

Then, the activation function of the neuron can be expressed as formula (1-3):

the function curves will also differ for different values of a, and fig. 3 shows a function image drawn based on several different values of a.

It can be found from different functions in different function discoveries in fig. 3 that different values of a directly cause changes in the form of the activation function, and more specifically, the range of the gradient of the activation function changes, when a tends to 0, the function tends to a Sign function, i.e., the gradient value of the origin of the coordinate axis is large and changes sharply, while the gradient at both ends decreases until it tends to 0, when a is large, the gradient of the function in a certain interval near the origin tends to linear, and the gradient gradually decreases to 0 when going to both ends.

Such functional features bring about the following characteristics:

(1) When the output values of the neurons in the layer have larger difference, and when the outputs of other neurons are all smaller, and one of the neurons outputs larger, the backward propagation gradient of the neuron tends to 0, and the larger gradients of other neurons appear to balance the contribution of the excessive values of individual neurons to submerge other neurons, so that the characteristic ensures that the parameters of each neuron in the same layer in the training process can be reasonably updated by the gradients so as to ensure that the parameters are under the similar value scale;

(2) When the scale of the average value of the output values of the layer of neurons is smaller, each neuron parameter obtains a larger updating gradient, otherwise, when the scale of the average value of the output values of the layer of neurons is larger, each neuron parameter obtains a smaller updating gradient, generally speaking, the scale of the intermediate output of the network is gradually increased, so that the network has the characteristics that the network has the larger updating gradient in the initial training period and the more stable updating gradient in the later training period, and the scale of the parameters is also controlled within a certain range by the activation function, so that the possibility of gradient explosion is reduced;

(3) Based on the first two properties, the numerical scale of each neuron is gradually approximated during the training process, so that it can be easily estimated for a layer with n neuronsWhen the desired weight W is 0 and the variance is 1, each neuron activation function gradient surrounds the gradient at n ^-1/2 Nearby. The output value is irrelevant to the output value before the neuron activation, so that the neuron saturation problem can be effectively prevented, and the neuron death problem does not need to be worried about;

(4) The L2NU has the ability to scale network layer parameters while keeping the activation value constant, as shown in equations (1-4):

L2NU(X×W+B)＝L2NU(X×k×W+k×B) (1-4)

wherein, X is the input mode, W is the weight matrix of the network layer, B is the offset vector of the network layer, and k is a positive constant. Such a feature enables the network training process to avoid the need to add an L2 regularization term for the parameters in the loss function to prevent the parameter values from expanding excessively.

Based on the characteristics brought by the L2NU, the network is in a dynamic change process of the back propagation gradient value in the training process, the dynamic change prevents all the neurons from generating extreme values and being incapable of back propagation, and at least one neuron in each network layer is ensured to have effective gradient propagation to the network layer in front. Therefore, the L2NU applied to the network as the activation function greatly reduces the possibility of gradient problems of the gradient optimization method, so that the network structure design can be further deepened.

As shown in fig. 4, which is a comparison diagram of conventional activation functions such as Sigmoid, reLU, and L2 normalization units, the conventional activation functions take a single neuron output value as an input, and activate and map to a new value; the L2 normalization unit takes the output mode of the network layer as input and activates the mapping to a new mode.

As shown in fig. 5, the method is to replace the ReLU activation function in the native structure of all residual modules in the ResNet-50 image classification model with the L2NU activation function.

According to the embodiment of the present invention, in step S103, in the recent prior art, almost all classified neural networks use the Softmax function as the activation function of the output layer, and the combination of the Softmax function and the cross entropy loss function has an excellent effect on the training of the classification model due to the excellent statistical theoretical basis behind the Softmax function. However, due to the special function configuration of the L2NU, if the traditional Softmax function is replaced as the activation function of the output layer, the optimization of the network is easier.

In order to utilize the cross entropy loss function conveniently, when the L2NU is used as the activation function of the output layer, the L2NU is subjected to translation scaling so that the value range is [0,1 ]]The process is defined by a formula, and the transformed activation function is called L2NU _p ，L2NU _p The expression of (a) is:

assuming a classification problem with two classes, the sample labels of the two classes are (0,1) and (1,0), respectively, so that the maximum of the predicted values (x, y) of the network output, | f (x) -f (y) | can improve the classification capability of the model, where f is the activation function. Fig. 6 and 7 respectively plot | f (x) -f (y) | function images of the network prediction value activated by the Softmax activation function and the network prediction value activated by L2 NUp.

It is easy to find by observing fig. 6 that only when the numerical scale of the network prediction value is large, the two activation values Softmax (x) and Softmax (y) passing through the Softmax function can have a larger distance, therefore, the gradient direction of network training will be premised on increasing the numerical scales of the two, which leads to the network to obtain a better effect. And for L2NU _p As shown in fig. 7, compared to the Softmax function, it has an interval that maximizes the two activation values L2NU (x) and L2NU (y) even at a position close to the origin of the coordinate axis at each scale, and therefore, the gradient that maximizes the interval between them is reduced toward x = -y, which brings about an advantage that the scale is provided regardless of the network parametersAnd the optimal target of maximizing the distance between the two can be achieved. Also, the optimization gradient is the same at each numerical scale, so the process is faster than the Softmax function and this optimal goal is achievable for L2NU, whereas for Softmax this optimal goal can only be approached by iteration and never be reached. This process can be more clearly understood by plotting a gradient comparison of the two functions as shown in figure 8.

The above-described advantages result from the fourth characteristic of the L2NU described above. In essence, because the computation process of the network is no longer affected by the modulo length of the active mode of each layer of the network, but is only related to the direction of the active mode, the logits output of the optimization process network no longer needs to fit infinite values, but rather vector directions of the labels. The cost of using rotation to align two vectors is much less than for training a vector to another vector of greater modulo length, and therefore L2NU _p The application of the method to the output layer can effectively accelerate the training of the model.

According to the embodiment of the invention, in step S106, the application performs performance verification on the image classification model based on the L2NU activation function, and in the performance verification experiment, the influence of the activation functions such as ReLU, sigmoid, tanh, and L2NU on the precision performance of the fully-connected network and the maximum trainable layer number of the model is compared, and the model training situation and the precision performance change after the L2NU replaces the native activation function of several types of existing excellent model structures of the convolutional network are shown.

Two common image classification datasets, namely a CIFAR-10 dataset and a CIFAR-100 dataset, are used in a performance verification experiment and are used for verifying the influence of each activation function on model precision and gradient back propagation. The CIFAR-10 data set is a color image data set with a size of 32 x 32 pixels and comprising three RGB channels, wherein the color image data set comprises ten categories of animals and vehicles, and the data set comprises 50000 training images and 10000 testing images. The CIFAR-100 dataset is similar to the CIFAR-10 dataset, has the same size of image data, and has the same number of training samples and test samples. The difference is that the number of classes of the CIFAR-100 data set is increased to 100 classes, which requires that the model has stronger fitting and classifying capability to execute the training task, all image data are expanded in the training sample number by using data enhancement modes such as translation and inversion, wherein the translation of the image is in 4 directions of up, down, left and right, and at most 4 pixels are translated.

The performance verification experiment is divided into two parts, the first part adopts different activation functions to train different number of layers FCN, records the optimal precision of a network on a CIFAR-10 data set after 3000 rounds (epoch) of training under the learning rate of 0.01, so that a model is ensured to be fully trained, and the possible influence caused by the overfitting problem is eliminated. It should be added that, in order to facilitate the construction and representation of network models with different depths, all the network output dimensions (neuron numbers) are arranged in an equal ratio array, specifically, for an n-layer network, the network parameters of each layer are defined by the formulas (1-6), where d is _i Is the output dimension of the ith layer, d _in For the network input dimension, i.e. the dimension of the data, d _out For the output dimension of the network, i.e. the dimension of the tag,

indicating rounding up.

The second part of the experiment adopts three VGG convolutional neural network models with structures and a ResNet model to learn a CIFAR-100 classification data set. The three VGG networks are VGG-11, VGG-13, and VGG-16, respectively, i.e., the structure mentioned in the original work A, B, D, and the model defaults to not use a batch normalization approach. Comparing L2NU to ReLU as activation function by replacing ReLU of convolutional layer activation function and full link layer activation function in network native structure with L2NUPerformance differences. On the other hand, by replacing the Softmax function of the output layer of the network with L2NU _p To compare the performance difference of the L2NU and Softmax function output layer activation functions. The VGG network adopts Adam optimizer, learning rate of 0.001, data batch size of 2000 samples, training round number of 100 rounds, and each model carries out 3 times of parallel experiments. ResNet adopts a Momentum optimizer, the Momentum coefficient is 0.9, the data batch size is 100, and after 100 rounds of training are carried out by adopting the learning rate of 0.1, 10 rounds of parameter fine adjustment are carried out by using the learning rate of 0.01. When measuring the running time of the model, the following hardware training model is adopted: intel 9960X + Nvidia RTX2080Ti.

In the performance verification experiment for training FCNs with different numbers of layers by using different activation functions, the experiment shows the condition that a Sigmoid function, a Tanh function, a ReLU and an L2NU are used as training conditions of an implicit layer activation function of an FCN on a CIFAR-10 data set, wherein Softmax functions are adopted as output layer activation functions for models corresponding to the Sigmoid function, the Tanh function and the ReLU, and the Softmax functions and the L2NU are respectively used for a network using the L2NU as an implicit activation unit on an output layer _p The two structures are trained, the FCN network structures used in the experiment are respectively a 2-16 layer network, a 20-layer network and a 50-layer network, and the dimension of each network layer is defined in the formula (1-6).

The training accuracy for different numbers of layers of FCNs at a given time is shown in fig. 9. As can be seen from fig. 9, both networks using ReLU as an activation function for FCNs within layers 6 and 6 achieve the best classification accuracy among several activation functions. However, the ReLU, sigmoid function and Tanh function corresponding models all achieve the best performance when the number of layers is 3, the model performance starts to decrease after the number of layers exceeds 3, when the number of layers increases to 8 layers or more, the FCN using the ReLU cannot be trained and optimized normally in all parallel experiments, the precision stays at 10%, this also means that the model performance stays at the stage of blind guessing, and then the FCN using the Tanh function cannot be trained after the structure of 9 layers or more, and the FCN using the Sigmoid function cannot be trained after the structure of 10 layers or more. In contrast, L2NU is used as the hidden layer activation function and L2NU _p As output layer activation letterThe FCN of the number, along with the increase of the number of network layers, training precision is steadily promoted, the precision performance of 3-layer ReLU networks is exceeded when the network depth is 12 layers, normal training can still be realized when the network structure of 50 layers is adopted, even higher classification precision is achieved under the same training iteration number, wherein 50-layer FCN adopting L2NU achieves the classification precision of 66.72% and is far higher than the classification precision of 61.15% of three-layer FCN adopting ReLU. In addition, if the output layer of the network using L2NU is replaced with the Softmax function, the training result of the parallel experiment will have large fluctuation, and the average classification accuracy will also have a downward trend as the number of network layers increases.

The experiments show that after the L2NU is used as the hidden layer activation function, the training depth of the FCN model is greatly increased, and even a 50-layer network can be trained with effective increase of precision. On the other hand, the L2NU is verified through experiments _p L2NU as an output layer activation function participating in the computation of the effectiveness of the loss function _p The replacement of the Softmax function can effectively improve the training stability of the FCN model adopting the L2NU and the classification effect of the deep structure.

In a performance verification experiment of replacing a convolutional network activation function with L2NU, the activation function of replacing a VGG model with three different layers by using L2NU is shown, so that the influence of the L2NU on the classification precision of the model after being applied to a full link layer and a convolutional layer compared with ReLU is shown, and the classification precision verification indexes comprise Top-1 precision and Top-5 precision. The default VGG network does not use the BN layer, and only the comparison using the BN layer is added in the corresponding experiment of the VGG-16 model. Then, we show the precision performance of the L2NU substituted ReLU on the ResNet50 model, and compare the precision influence brought by the enabled and disabled BN layers and the calculation expense of the training process, and specific experimental data are shown in the following table:

TABLE 1 model accuracy performance after replacement of VGG-11 activation function with L2NU

TABLE 2 model accuracy performance after replacement of VGG-13 activation function with L2NU

TABLE 3 model accuracy Performance after replacement of VGG-16 activation function with L2NU

Model (model)	Activating a function	Using BN layer	Precision (%)	Total training time (h)
					ResNet-50	ReLU	Whether or not	1.00	1.61
ResNet-50	L2NU	Whether or not	68.97	3.01
					ResNet-50	ReLU	Is that	66.72	5.16
ResNet-50	L2NU	Is that	67.27	6.46

TABLE 4 model accuracy versus training elapsed time after replacement of ResNet activation function with L2NU

The data in Table 1 show that the Top-1 classification precision and the Top-5 classification precision of the model can be effectively improved by replacing all activation functions of VGG-11 with L2NU. Similarly, the data in tables 2 and 3 show that replacing all activation functions of VGG-13 and VGG-16 with L2NU can also effectively improve the Top-1 classification accuracy and Top-5 classification accuracy of the model. On the other hand, although the VGG-11 without BN can still be trained under the native activation function, in the VGG-13 with the increased layer number, the model using ReLU as the activation function of all hidden layers can not be effectively optimized in the given training round number, the Top-1 precision is only 1% and the Top-5 precision is only 5%, i.e. the network can not learn any knowledge from the data set. Further, in VGG-16, where the number of layers continues to increase, both models that employ ReLU as a function of full connectivity layer activation lose learning capabilities. In addition, in three groups of experiments, only ReLU of the activation function of the full connection layer is replaced by L2NU, and then the improvement of Top-1 precision and Top-5 precision can be obtained. Therefore, the three groups of experiments jointly prove that when the network layer number is deep, the full link layer activation function can obtain better model classification capability by using the L2NU. Although the second model of Table 1 does not improve Top-1 accuracy over the first model, the model still provides a small improvement in Top-5 accuracy over the prototype VGG-11. In addition, table 1, table 2, and Table 3 show that replacing the native ReLU structure with L2NU at any location results in a simultaneous improvement in Top-1 and Top-5 precision.

Comparing with the VGG-16 model using the BN layer in table 3, compared with the model without the BN layer, the VGG-16 model using the ReLU is changed from untrained to trained, but due to the application of the BN layer, the training cost is greatly increased, and relatively stable accuracy performance cannot be achieved in 100 rounds of training. On the other hand, the VGG-16 model of L2NU is adopted, and under the same training round number, the Top-1 precision and the Top-5 precision are further improved. This indicates that L2NU can also obtain accuracy benefits from BN structure, and this process has less training cost than the same structure network using ReLU.

It should be noted that, when the ReLU is used as the implicit full link layer activation function and the L2NUp is used as the output layer activation function, or the L2NU is used as the implicit full link layer activation function and the Softmax function is used as the output layer activation function, the network precision is reduced to different degrees, and in combination with the previous objective experiment, it can be considered that the expression of the last layer activation function is affected by the previous layer or layers of network activation functions, and the model applying the L2NU needs to use the L2NU _p The better effect can be achieved as an activation function of the output layer.

The data in table 4 show that, on the premise of the BN layer, both activation functions can ensure smooth training of ResNet-50, and a network using L2NU as an activation function can achieve higher classification accuracy. On the contrary, on the premise of not using the BN layer, the training stability of the ResNet model using the ReLU is greatly reduced, and in parallel experiments, gradient problems occur after 10 rounds of training, which results in that the model cannot be optimized, while the ResNet model using the L2NU not only obtains normal training, but also obtains the classification accuracy of the network using the BN layer higher than that of the same number of training rounds, which indicates that the BN layer brings extra calculation expense and is accompanied with the increase of the model training difficulty. More importantly, although the time consumption of the network applying the L2NU is increased by 25% compared with the native ResNet, the L2NU can be separated from the BN layer for training, and the model can save 56% of the calculation time after the BN layer is removed.

The present application proposes a new activation function, which is called L2 normalization unit, abbreviated L2NU. The L2NU is different from a common activation function such as a ReLU or a Sigmoid function, the action range of the L2NU as the activation function is a layer network, and the action objects of other functions are single neurons, and the L2 normalization result output by the layer network is calculated to serve as an activation value, so that the L2NU has certain nonlinear capability and brings many additional advantages.

In the training of the fully-connected network, after activating functions commonly used by the fully-connected network, such as a Sigmoid function, a ReLU, a Tanh function and the like, are replaced by L2NU, the number of trainable layers of the model is greatly increased, even under the condition of not introducing any additional structure or additional training skills, the training of the fully-connected network with 50 layers is completed, and the 50-layer model has no upper limit of the L2NU capacity. On the other hand, in the experiment, the full-connection network adopting the L2NU activation function exceeds the optimal precision of the rest activation functions in all network structures after the network structure is deepened to 12 layers, and the precision is stably improved in the gradual increase of the layer number.

In the training of the convolutional network, after the most frequently used ReLU in the convolutional network is replaced by L2NU as an activation function, the training difficulty of the model is obviously reduced, and meanwhile, the precision of the model is improved. In the experiment, VGG convolutional networks with different structures are used as basic models, and the activation functions of convolutional layers and the activation functions of full-link layers are replaced respectively, so that the Top-1 precision and the Top-5 precision of the models can be effectively improved by replacing any ReLU with L2NU in the models with deeper network structures. And the model using L2NU can normally and stably complete training without performing batch normalization processing, while the VGG-13 (10 convolutional layers and 3 fully-connected layers) using ReLU and the deeper VGG-16 cannot be optimized under the same conditions. Finally, models using L2NU can still achieve better model accuracy by applying batch normalization techniques. Then, the experiment takes ResNet-50 as a basic model, the native activation function is replaced by L2NU, and effective precision improvement is achieved. Then, the experiment compares the network model with the forbidden BN layer, and only the L2NU corresponding network is successfully trained. In combination with the previous experiment, the result shows the excellent gradient propagation capacity of the L2NU, the BN layer can be optimized independently, similar or even better model performance can be obtained, and meanwhile, a large amount of calculation expense is saved for the model by removing the BN layer.

The brand-new deep learning activation function greatly expands the maximum trainable depth of the model when the gradient optimization algorithm is applied to the deep learning model. The L2 normalization unit is a mode-to-mode mapping which is different from a traditional activation function value-to-value mapping, due to the special mathematical structure, the situation that the gradient is effectively and reversely propagated in the gradient back propagation process can be guaranteed, and the characteristic guarantees that the application fields requiring deeper network models such as image classification, natural language processing and the like can obtain more effective and more stable optimized gradients when the deep models are used.

Compared with the traditional activation functions such as ReLU, tanh and Sigmoid, the activation function has the following advantages when applied to the deep learning model of image classification: (1) a model classification precision function is improved; (2) The training time of the model parameters is shortened, and the model is converged more quickly; (3) The model training process is more stable, and the parameter numerical range and the gradient back propagation numerical range are kept relatively stable in the training process.

Referring to fig. 10, in a second aspect, the present application further proposes an image classification method, including the steps of:

s201: acquiring an image classification sample needing to be classified;

s202: inputting the image classification sample into an image classification model based on an L2NU activation function, wherein the image classification model based on the L2NU activation function is obtained by training in advance based on the method disclosed in the aspect;

Referring to fig. 11, in a third aspect, the present application further provides an image classification model generation apparatus based on an L2NU activation function, the apparatus including:

model modification module 301: method for replacing ReLU activation function in all residual module native structures in ResNet-50 image classification model with L2NU activation functionA function; and also for removing all BN layers in the ResNet-50 image classification model; and also for setting the activation function of the output layer of the ResNet-50 image classification model to

the mathematical definition of the L2NU activation function is as follows:

the training module 302: training the modified ResNet-50 image classification model based on the image classification training data set;

the determination module 303: for training the modified ResNet-50 image classification model into an image classification model based on the L2NU activation function by the method described in the first aspect.

Referring to fig. 12, in a fourth aspect, the present application further provides an image classification apparatus, including:

the acquisition module 401: the image classification device is used for acquiring an image classification sample acquired by the image acquisition device;

the classification module 402: the method is used for inputting image classification samples into an image classification model based on an L2NU activation function, the image classification model based on the L2NU activation function is obtained by training based on the method in the first aspect in advance, and the classification result of the image classification model based on the L2NU activation function is output

In a fifth aspect, reference is made to fig. 13, which illustrates a schematic structural diagram of a computer system 100 suitable for implementing an electronic device of an embodiment of the present application. The electronic device shown in fig. 13 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 13, the computer system 100 includes a Central Processing Unit (CPU) 101 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM) 102 or a program loaded from a storage section 108 into a Random Access Memory (RAM) 103. In the RAM 103, various programs and data necessary for the operation of the system 100 are also stored. The CPU 101, ROM 102, and RAM 103 are connected to each other via a bus 104. An input/output (I/O) interface 105 is also connected to bus 104.

The following components are connected to the I/O interface 105: an input portion 106 including a keyboard, a mouse, and the like; an output section 107 including a display such as a Liquid Crystal Display (LCD) and a speaker; a storage section 108 including a hard disk and the like; and a communication section 109 including a network interface card such as a LAN card, a modem, or the like. The communication section 109 performs communication processing via a network such as the internet. A drive 110 is also connected to the I/O interface 105 as needed. A removable medium 111 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 110 as necessary, so that a computer program read out therefrom is mounted into the storage section 108 as necessary.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 109, and/or installed from the removable medium 111. The above-described functions defined in the method of the present application are performed when the computer program is executed by a Central Processing Unit (CPU) 101.

As another aspect, the present application also provides a computer-readable storage medium, which may be included in the electronic device described in the above embodiments; or may exist separately without being assembled into the electronic device. The computer readable storage medium carries one or more programs which, when executed by the electronic device, cause the electronic device to perform the method shown in fig. 1.

It should be noted that the computer readable storage medium described herein can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable storage medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable storage medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

While the present invention has been described with reference to the preferred embodiments, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

In the description of the present application, it is to be understood that the terms "upper", "lower", "inner", "outer", and the like, indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, are merely for convenience in describing the present application and simplifying the description, and do not indicate or imply that the device or element referred to must have a particular orientation, be constructed in a particular orientation, and operate, and thus, should not be construed as limiting the present application. The word 'comprising' does not exclude the presence of elements or steps not listed in a claim. The word 'a' or 'an' preceding an element does not exclude the presence of a plurality of such elements. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage. Any reference signs in the claims shall not be construed as limiting the scope.

Claims

1. An image classification model generation method based on an L2NU activation function is characterized in that: the method comprises the following steps:

the L2NU activation function is specifically:

2. The method for generating an image classification model based on an L2NU activation function according to claim 1, wherein: the S103 specifically includes: setting the activation function of the output layer of the ResNet-50 image classification model in step S102 to be

And based on L2NU _p Performing cross entropy loss calculation on a loss function of the ResNet-50 image classification model by using the function, wherein the loss function specifically comprises the following steps:

where Y is the logits output of the network, and Y = (Y) ₁ ,y ₂ ,…,y _n )，y _i The value of the unactivated output of the ith neuron of the network output layer is shown, n is the dimension of a label, z _i Sample label Z = (Z) in one-hot format ₁ ,z ₂ ,…,z _n ) The ith element of (1).

3. The method for generating an image classification model based on an L2NU activation function according to claim 1, wherein: the S105 specifically includes: based on CIFAR-10 and CIFAR-100 image classification training data sets, training the ResNet-50 image classification model in the step S104 by using a random gradient descent method, setting the optimizer as a Momentum optimizer, setting the Momentum coefficient to be 0.9, setting the data batch size to be 100, and performing 10 rounds of parameter fine tuning by using a learning rate of 0.01 after performing 100 rounds of training in the training process to obtain an image classification model based on an L2NU activation function.

4. The method for generating an image classification model based on an L2NU activation function according to claim 1, wherein: after step S105, the method further comprises:

5. The method for generating an image classification model based on an L2NU activation function according to claim 4, wherein: the S106 specifically includes: and training FCNs with different layers by respectively adopting a Sigmoid activation function, a Tanh activation function, a ReLU activation function and an L2NU activation function, and recording the optimal precision of the FCN after 3000 rounds of training of the FCN on a CIFAR-10 image classification training data set by adopting a learning rate of 0.01.

6. The method for generating an image classification model based on an L2NU activation function according to claim 5, wherein: the S106 further includes: respectively adopting VGG-11, VGG-13, VGG-16 and ResNet convolutional neural network models to learn a CIFAR-100 image classification training data set, replacing the ReLU activation functions of convolutional layers and fully-connected layers in the native structure of each convolutional neural network with L2NU activation functions to compare the performance difference of the L2NU activation functions with the ReLU activation functions, and replacing the Softmax function of the output layer of each convolutional neural network with L2NU activation functions _p Function to compare L2NU _p The difference in performance of the function and the Softmax function.

7. An image classification method, characterized by: the method comprises the following steps:

s201: acquiring an image classification sample needing to be classified;

s202: inputting the image classification samples into an image classification model based on an L2NU activation function, wherein the image classification model based on the L2NU activation function is obtained by training in advance based on the method of any one of claims 1 to 6;

8. An image classification model generation device based on an L2NU activation function is characterized in that: the device comprises:

the mathematical definition of the L2NU activation function is as follows:

a determination module: training the modified ResNet-50 image classification model into an image classification model based on an L2NU activation function by the method of any of claims 1-6.

9. An image classification device characterized by: the device comprises:

a classification module: the method is used for inputting image classification samples into an image classification model based on an L2NU activation function, the image classification model based on the L2NU activation function is obtained by training in advance based on the method of any one of claims 1-6, and the classification result of the image classification model based on the L2NU activation function is output.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the calculation method according to any one of claims 1 to 7.