CN110135580B

CN110135580B - Convolution network full integer quantization method and application method thereof

Info

Publication number: CN110135580B
Application number: CN201910344069.9A
Authority: CN
Inventors: 钟胜; 周锡雄; 王建辉; 商雄; 蔡智
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2019-04-26
Filing date: 2019-04-26
Publication date: 2021-03-26
Anticipated expiration: 2039-04-26
Also published as: CN110135580A

Abstract

The invention discloses a convolution network full integer quantization method, and belongs to the technical field of convolution network quantization compression. The invention adopts integer expression to the input characteristic diagram, the network weight and the output characteristic diagram of the convolution network, and the forward reasoning process of each layer of the network only involves integer calculation. In order to ensure the performance after the integer quantization, the invention needs to retrain the network and simulate the result of the network full integer reasoning in the training. The invention also discloses an application method of the full integer quantization convolution network. Compared with a convolution network expressed by a single-precision floating point, the scheme of the invention occupies less resources and has higher reasoning speed; compared with a fixed-point quantization network, the method adopts fixed-length integer expression for input, output and weight of the network, does not need to consider the influence caused by bit width of an output result of the layer-by-layer network, has stronger regularity, and is more suitable for application to resource-limited platforms, such as FPGA/ASIC and other platforms.

Description

Convolution network full integer quantization method and application method thereof

Technical Field

The invention belongs to the technical field of quantization compression of a convolutional network, and particularly relates to a convolutional network full integer quantization method and an application method thereof.

Background

Since Alex-Net published in 2012, the deep learning method represented by convolutional neural network has been a breakthrough in the performance of the target discrimination and identification fields year by year, the accuracy of the existing complex network can reach more than 95%, and the network is not considered to be deployed on the embedded platform with limited resources at the beginning of design. For resource-oriented constrained applications, such as: applications such as AR/VR, smart phones, FPGA/ASIC, etc. require quantitative compression of models for reducing the size of the models and the demand of computing resources to adapt to the deployment of these embedded platforms.

Facing the model quantization compression problem, there are mainly two approaches: the first is to design a more efficient/lightweight network for the model structure itself to accommodate constrained computing resources such as Mobile Net, Shuffle Net. The second method is to carry out low bit quantization on intermediate results of the network, including weight, input and output, aiming at the existing network structure, and reduce the requirement of the computing resources of the network and the computing delay of the network under the condition that the network structure is unchanged and the network precision is ensured.

In view of the second mode, the existing methods for low bit quantization include: TWN, BNN, XOR-NET. The methods change the weight and the input and output quantity of the network into 1 bit or 3 bits, so that the multiplication and addition operation of the convolution process can be replaced by an exclusive or + shift operation, and the use of computing resources can be reduced. However, this method has significant drawbacks: the loss of precision is large. As for other quantification methods, the actual deployment in hardware is not considered, quantification is only performed on the network weight, and the consideration on the requirement of storage resources is focused on meeting, while the consideration on the requirement of computing resources is ignored.

Disclosure of Invention

In view of the above drawbacks or needs for improvement in the prior art, the present invention provides a convolution network full-integer quantization method and an application method thereof, which aims to express the input, output and weight of a network by fixed-length integer, and the quantization method enables the accuracy loss of the network to be controlled to about 5%, and simultaneously, the consumption of computing resources, storage resources and network resources.

In order to achieve the above object, the present invention provides a convolution network full integer quantization method, which comprises the following steps:

(1) obtaining a model, a floating point type weight and a training data set of a convolutional network, and initializing the network;

(2) for each convolution layer, firstly, calculating the distribution range of input IN, output OUT and weight WT of each layer through a floating point type reasoning process, and respectively calculating the maximum absolute extreme value of the input IN, the output OUT and the weight WT;

(3) updating the maximum absolute extreme values of the three in the training process of the current layer;

(4) performing integer quantization on the input and the weight of the current layer IN the convolutional network according to the maximum absolute pole of the input IN, the output OUT and the weight WT;

(5) according to the input and the weight of the integer quantization, the output of the integer quantization of the current layer is solved;

(6) carrying out inverse quantization on the output of the integer quantization of the current layer, reducing the output into a floating point type, and outputting to the next layer; if the next layer is the batch norm layer, merging the parameters of the batch norm layer into the current layer by adopting a merging means; repeating the steps (3) to (6) until the last layer in the convolutional network;

(7) back propagation, continuously updating the weight until the network converges, and storing the quantized weight and the additional parameters; the parameters after integer quantization are used in the forward derivation process of full integer, and integer is used to replace the original floating point operation.

Further, the step (3) of updating the maximum absolute extreme values of the three in the training process specifically includes updating the maximum absolute extreme values by using an exponential moving average algorithm:

xⁿ＝αx^n-1+(1-α)x

wherein x isⁿFor updating the maximum absolute extreme value, x, of input, output, or weight this time^n-1The maximum absolute extreme value of the input, the output or the weight is updated last time, x is the input, the output or the weight obtained by the calculation, and alpha is a weight coefficient.

Further, the step (4) is specifically as follows:

input integer quantization:

Q_IN＝clamp(IN/S1)

wherein Q _ IN represents an integer quantization input; s1 { | IN | }/Γ ═ 2^N(ii) a N represents the number of quantized bits; clamp () represents the part after truncation of the decimal point; max { | IN | } represents the maximum absolute extreme value of the input;

integer quantization of weights:

Q_WT＝clamp(WT/S2)

wherein Q _ WT represents an integer quantization of the weights; s2 { | WT | }/Γ | WT | }/Γ | 2^N(ii) a max { | WT | } represents the maximum absolute extreme value of the weight.

Further, the step (5) is specifically:

the output of the integer quantization, Q _ OUT, is:

Q_OUT＝Q_IN×Q_WT×M

M＝S1×S2/S3

wherein Q _ IN represents an integer quantization input; q _ WT represents an integer quantization of the weights; since M is floating-point type S1 × S2/S3, the order is

The derivation process of the parameter C and the parameter S is as follows:

firstly, solving M, S1 × S2/S3:

wherein S1 { | IN | }/Γ ═ 2^NMax { | IN | } represents the maximum absolute extreme of the input; s2 { | WT | }/Γ | WT | }/Γ | 2^NMax { | WT | } represents the maximum absolute extreme value of the weight; s3 { | OUT | }/Γ | Γ 2^NMax { | OUT | } represents the maximum absolute extremum of the output; n represents the number of quantized bits;

multiplying M by 2 or dividing by 2 repeatedly, so that 0< M <0.5, a is 0, each time M is multiplied by 2, a is a +1, and dividing by 2, a is a-1, and counting to obtain the final value of a;

then presetting a value of v, wherein v is more than 0 and less than or equal to 32, and solving S and C according to the following formula:

S＝v+a

C＝round(M×2^v)

0＜C≤2^v

where round () means to return round rounding.

Further, the shaped quantized output Q _ OUT is:

Q_OUT＝Q_IN×Q_WT×M

before the output is shaped and quantized, the non-linear activation of Q _ IN and Q _ WT is carried out, and the non-linear activation adopts a shift approximation operation.

Further, the non-linear activation of Q _ IN and Q _ WT is specifically:

nonlinear activation is performed by using a leak activation function Q _ IN × Q _ WT, which is specifically formed as follows:

to ensure that the Q _ IN × Q _ WT remains integer after nonlinear activation, the above equation is shifted approximately, as follows:

wherein y < <1 indicates that the binary y is shifted to the left by one bit, and (y + y < <1) > >5 indicates that the binary (y + y < <1) is shifted to the right by 5 bits, and the final nonlinearly activated Q _ IN × Q _ WT remains an integer.

Further, if the next layer in the step (6) is a batch norm layer, merging the parameters of the batch norm layer into the current layer by adopting a merging means specifically comprises:

the calculation process of the batch norm layer is as follows:

wherein x represents input, y represents output, epsilon represents the additional value of denominator, mu represents the output mean value, sigma represents the output standard deviation, gamma is a parameter generated in the calculation process of the batch norm layer, and beta represents bias;

since the batch norm follows the convolution process, the convolution process is expressed as:

y＝∑w×fmap(i，j)

wherein fmap (i, j) is an image feature at the input image (i, j); w is a weight; y represents an output;

therefore, merging the batch norm layer parameters into the convolution process by adopting a merging means is as follows:

the combined weight is as follows:

combined bias:

the convolution process after combination: y ∑ w _ fold × fmap (i, j) + β _ fold.

According to another aspect of the present invention, there is provided an application method of a full integer quantization convolution network, the application method comprising the steps of:

s1, obtaining a model, a floating point type weight and a training data set of the convolutional network, and initializing the network;

s2, for each convolution layer, firstly, the distribution range of the input IN, the output OUT and the weight WT of each layer is obtained through the reasoning process of a floating point form, and the maximum absolute extreme values of the input IN, the output OUT and the weight WT are respectively obtained;

s3, updating the maximum absolute extreme values of the three in the training process of the current layer;

s4, performing integer quantization on the input and the weight of the current layer IN the convolution network according to the maximum absolute pole of the input IN, the output OUT and the weight WT;

s5, obtaining the output of the current layer integer quantization according to the input and weight of the integer quantization;

s6, carrying out inverse quantization on the output of the integer quantization of the current layer, reducing the output into a floating point type and outputting to the next layer; if the next layer is the batch norm layer, merging the parameters of the batch norm layer into the current layer by adopting a merging means; repeatedly and sequentially executing the steps S3 to S6 until the last layer in the convolutional network;

s7, used for back propagation, continuously updating the weight until the network convergence, saving the quantized weight, and additional parameters; the parameters after integer quantization are used in the forward derivation process of full integer, and integer is used for replacing the original floating point operation;

s8, inputting the image of the target to be detected into a full integer quantization convolution network, and dividing the image of the target to be detected into S × S grids;

s9, setting n anchor boxes with fixed length-width ratios, predicting n anchor boxes for each grid, and independently predicting the coordinates (x, y, w, h), the confidence coefficient p and the probability of m categories of the target by each anchor box; wherein x, y represent the target coordinates, w, h represent the height and width of the target;

s10, according to the probability corresponding to each category calculated in the previous step, firstly, carrying out preliminary screening through a fixed threshold, filtering out candidate frames with the confidence coefficient lower than the threshold in the corresponding category, and then removing overlapped target frames through a non-maximum inhibition method;

and S11, selecting the targets with the corresponding probability exceeding the threshold in different categories for the reserved target frames to be displayed visually, and outputting the target detection result.

Generally, compared with the prior art, the above technical solution conceived by the present invention has the following beneficial effects:

(1) the invention adopts a full integer quantization method, and the input, output and weight of the network are expressed by fixed length integers, the quantization method can control the precision loss of the network to be about 5 percent, and the requirement on computing resources is more friendly because the forward propagation process only comprises the multiplication of the fixed length integers;

(2) the absolute value extreme values of input and output of the network are calculated by adopting an exponential moving average algorithm, then quantization operation is carried out through the extreme values, the exponential moving average algorithm counts the distribution characteristics of a batch of data, so that the quantization result can meet the numerical characteristics of the batch of data and is not limited to specific input, and the quantization method is a necessary guarantee for generalization in practical application;

(3) merging measures are taken for the batch norm layer, parameters of the batch norm layer are directly merged to the convolutional layer, the process of quantifying the batch norm layer is directly omitted, and meanwhile, the process does not need to consider calculation of the batch norm layer when the network carries out forward reasoning;

(4) the shift activation process is advanced to the front of the quantization network output result, the shift activation operation is firstly carried out on the output intermediate result, and then the quantization of the network output is carried out, the method is based on the following steps: if the output is quantized to 8 bits and then the shift activation process is executed, it is equivalent to operating on an 8-bit signed number, and the precision is

Before the output is quantized, it is expressed using a 32bit value, androw shift active operation with precision of

Thus, by performing the change of the order, an error due to the shift approximation operation of the active layer can be reduced.

Drawings

FIG. 1 is a training flow diagram of a full integer quantization method of the present invention;

FIG. 2 is a diagram illustrating an example of the structure of a convolutional neural network in an embodiment of the present invention;

FIG. 3 is a diagram illustrating a batch norm integration method according to the present invention;

FIG. 4 is an exemplary diagram of the cancellation of quantization and dequantization between adjacent layers of a network in the present invention;

FIG. 5 is a schematic diagram of the full integer forward derivation process of the present invention;

FIG. 6 is a graph of target detection results before quantification;

FIG. 7 is a graph of the target detection results after quantification.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

As shown in fig. 1, the method of the present invention comprises the steps of:

specifically, the supporting embodiment of the invention adopts a network structure of YOLOV 2-tiny. Referring to FIG. 2, there are 6 max pool layers, 9 convolutional layers followed by a batch norm. The training framework employs a darknet, which is written in c language and opens open sources. The Yolo web author provides floating point type weights on the personal home page for download. The training data was trained using VOC2012 and VOC2007 data sets, which contain 20 classes of targets, and a total of 9963+ 11540-21503 labeled data. The width of an input image for initializing the network is 416 pixels, the height of the input image is 416 pixels, the number of channels of the image is 3, the number of pictures subjected to iterative training each time is 64, the momentum is 0.9, the learning rate is 0.001, the maximum iteration number is 60200, the network output is the position, the size and the confidence coefficient of a target in the image, and due to the fact that cross redundancy exists in detection results, the detection results need to be fused by a non-maximum suppression method, and therefore the output result of each detected target corresponds uniquely.

(2) For each convolution layer, firstly, the distribution range of input, output and weight of each convolution layer is obtained through a floating point type reasoning process, the maximum absolute extreme value | max | of the input, output and weight of each convolution layer is respectively obtained, and the extreme value is updated by using an exponential moving average algorithm (EMA) in a training process;

specifically, each layer of network weight includes parameters w and β, and input and output need to be quantized, which requires statistics of the maximum absolute values of 4 groups of w, β, IN, and OUT. In order to enable statistical absolute maxima, reflecting the statistical characteristics of the data set, rather than maxima under a particular input image, these extrema need to be updated using the EMA. The specific formula is as follows: x is the number ofⁿ＝αx^n-1+(1-α)x。

xⁿValue, x, reserved for the current end^n-1The value reserved for the last iteration process, and x is the result of this calculation. Alpha is a weight coefficient, generally selected between 0.9 and 1, and in the embodiment of the invention, alpha is 0.99.

(3) Quantizing the input and the weight of the network according to the obtained maximum absolute value by using the following quantization formula, so that the input and the weight can be expressed by int 8;

and (3) quantization input: q _ IN ═ clamp (IN/S1)

Quantization weight: q _ WT ═ clamp (WT/S2)

Quantization coefficient: s1 ═ MAX |/Γ, | MAX { | IN | }, Γ ═ 2^N

S2＝|MAX|/Γ,|MAX|＝max{|WT|},Γ＝2^N

Wherein Γ ═ 2^NRepresenting the number of quantized bits; IN is input, WT is weight, max { | IN | } is the maximum absolute extreme of the input, max { | WT | } is the maximum absolute extreme of the weight;

specifically, according to experience, the input and weight absolute value of each layer network are in the range of 0-1, linear transformation is carried out by utilizing the statistical maximum absolute value, the weight and the input are normalized to [ -127,127] by adopting the formula, when the numerical value is rounded, a direct truncation mode is used instead of a rounding forensics mode, and in the formula, clamp () represents truncation operation: int ═ clamp (float). In an embodiment of the present invention, N ═ 8.

(4) The quantized output of the current layer can be obtained according to the obtained quantized input and weight. To ensure that the network output is also an integer value, the quantization is performed using the following formula:

and (3) floating point output: OUT × WT × Q _ IN × Q _ WT × S1 × S2

And (3) quantization output: q _ OUT/S3Q _ IN × Q _ WT × (S1 × S2/S3)

Where S3 is the output quantized coefficient. Since M is a floating point number S1 × S2/S3, to ensure that the network inference process is integer computation, it can be approximated by multiplication and shift, and the coefficients C, S generated by the approximation process are stored as parameters, which is as follows:

approximate calculation:

specifically, since M is a floating point number, S1S2/S3, since it is necessary to ensure that the quantized output value can be represented by integer, and the calculation process does not involve floating point operation, it is necessary to perform approximate calculation on M, and order M

To ensure that the bit width of the integer multiplication is as small as possible and the result of the approximate calculation is more accurate, it is necessary to selectSelecting the numerical range of C. In the examples of the present invention, 0< C.ltoreq.2 is defined^v，v＝24。

The calculation for solving C, S is to multiply or divide M by 2 repeatedly to finally get 0<M^Δ<0.5. Assuming that a is 0, each time M multiplies 2, a adds 1, and each time M divides 2, a subtracts 1. Finally, let C equal to round (M)^Δ×2^v) S ═ v + a, round () denotes rounding.

(5) Before a network result is output to a lower layer, a nonlinear activation (active) process is required, the process is a floating point operation, and in order to simulate a forward propagation full integer calculation process, a shift approximation operation is required to be adopted in the process. Shifting the result (in8 expression) after the activation operation is approximated, reducing the result into a floating point type expression after inverse quantization, and outputting the floating point type expression to the next layer; and (5) repeating the processes from (2) to (5) until the last layer of the network. For the network with the batch norm layer, merging means is needed to merge the parameters of the batch norm layer directly into the network of the previous layer.

Specifically, for a network having a batch norm layer, a merging approach needs to be taken, as shown in fig. 3. The specific implementation process comprises the following steps: mathematical formulas are available for batch norm

Describing the calculation process, wherein mu represents an output mean value, epsilon represents the added value of a denominator, 0 division operation is prevented when the square difference is divided, default is 1e-5, sigma represents an output standard deviation, gamma is a parameter generated by a batch norm process, and beta represents a bias; since the batch norm follows the convolution process, i.e., x ═ Σ w × fmap (i, j), w is the weight of the network, and fmap (i, j) is the feature map of the input. Through a simple transformation, the batch norm can be integrated into the convolution process, and the deformation process is expressed as follows:

combined weight w:

combined bias β:

the convolution process after combination: y ∑ w _ fold × fmap (i, j) + β _ fold

The invention adopts the displacement approximation to the nonlinear activation function, and ensures the full integer forward derivation process. The invention adopts a leak activation function, and the specific form is as follows:

for the activation function, two parts of operations are mainly included: data judgment and floating-point multiplication. In order to ensure that the forward derivation process only uses integer calculation, the invention adopts shift approximate calculation for the forward derivation process, and the specific form is as follows:

the shift approximation process of the present invention is numerically equivalent to the following approximation:

in the actual calculation process, the operation of shift activation is performed before the quantization of the final output in step (4). The bit width of the final output value is consistent with that of the input value, and preparation is made for the forward derivation process of the next layer, so that the error caused by the shift approximation operation of the activation layer can be reduced.

(6) Back propagation, continuously updating the weight until the network converges, and storing the quantized weight and the additional parameters; the parameters after integer quantization can be used in the forward derivation process of full integer, and integer is used to replace the original floating point operation.

Specifically, assuming that the input channel of the convolutional layer is L _ M, the output channel is L _ N, and the convolutional kernel size is K, the required storage space before and after the integer quantization is 1/4 after the quantization as shown below.

After quantization:

Storage_int8＝L_M×L_N×K×K+L_N+2×sizeof(int32)/sizeof(int8)

before quantization:

Storage_float＝(L_M×L_N×K×K+L_N+bn×L_N×3)sizeof(float),bn＝{0,1}

as shown in fig. 4, there is a quantization as well as an inverse quantization process between the two layers. In the actual forward derivation process, the two can cancel each other out, so in the actual calculation process, only the inverse quantization processing needs to be performed on the output of the last layer of the network, and only the full integer calculation exists in the middle layer, as shown in fig. 5.

In addition, the performance of the invention is actually measured by using a darknet frame: quantization was performed on a network structure of YOLO v2-tiny, and the loss was 5.1% comparing the average map values before and after quantization, as shown in table 1:

categories	Before quantization	After quantization	Error of the measurement
				Boat	0.1415	0.1657	0.0242
Bird	0.1807	0.1621	-0.0186
				Train	0.5145	0.4441	-0.0704
Bus	0.5306	0.4669	-0.0637
				Person	0.4633	0.4061	-0.0572
Dog	0.3379	0.3023	-0.0356
				Diningtable	0.3433	0.238	-0.1053
Sheep	0.3322	0.2644	-0.0678
				Pottedplant	0.0864	0.0756	-0.0108
Sofa	0.3187	0.2076	-0.1111
				Car	0.5195	0.4358	-0.0837
Aeroplane	0.4157	0.2801	-0.1356
				Bicycle	0.48	0.4563	-0.0237
Tvmonitor	0.4029	0.3335	-0.0694
				Bottle	0.0522	0.037	-0.0152
Motorbike	0.536	0.4221	-0.1139
				Cat	0.3847	0.3633	-0.0214
Chair	0.1776	0.1235	-0.0541
				Cow	0.3049	0.2972	-0.0077
Horse	0.5222	0.4384	-0.0838
				Average mAP	0.3521	0.301	-0.0511

TABLE 1

The invention utilizes the parameters before and after quantization to detect and identify the target

Inputting a given image into a convolution network, and dividing the image into S-S grids;

setting n anchor boxes with fixed length-width ratios, predicting the n anchor boxes for each grid, and independently predicting the coordinates (x, y, w, h), the confidence (p) and the probability of 20 categories of the target by each anchor box;

performing non-maximum suppression (NMS) on the extracted S, S and n targets, removing overlapped frames, and keeping a prediction frame with high confidence;

and outputting and visually displaying the result.

For a certain class of targets, the confidence of the corresponding class in all candidate frames needs to be calculated, and the calculation process is shown as the following formula:

P(class)＝P(class|obj)×P(obj)

wherein P (class) represents the final confidence of a class of target in a candidate box, P (class | obj) represents the numerical value of the corresponding class regressed in the candidate box, and P (obj) represents the probability of the target regressed in the candidate box. After the probability of the corresponding category is calculated, firstly, a primary screening is carried out through a fixed threshold, candidate frames with low confidence level in the corresponding category are filtered, and then overlapped target frames are removed through an NMS (non-maximum suppression) method.

The non-maximum suppression (NMS) removal of overlapping boxes is performed according to each category, and the process is summarized as follows:

(1) sorting P (class) of a certain class in all the candidate frames in a descending order, and marking all the frames in an unprocessed state;

(2) calculating the overlapping rate of the frame with the maximum probability and other frames, if the overlapping rate exceeds 0.5, reserving the frame with the maximum probability, correspondingly removing other frames, and marking the frame as processed;

(3) finding out the second largest target frame of P (class) in sequence, and marking according to the step (2);

(4) repeating steps (2) - (3) until all frames are marked as processed;

(5) and selecting the targets exceeding the threshold in the P (class) for visual display and outputting the result for the reserved target frames.

As shown in fig. 6, it is a schematic diagram of the effect of a general convolutional network after image target recognition; performing target detection and identification on the same picture by adopting a convolution network after full integer quantization, wherein the effect is shown in fig. 7; it can be seen that the performance loss of the convolutional network after the integer quantization is adopted is not large, the identification effect is almost better than that of the ordinary convolutional network, but the detection identification speed is higher, and the consumption of computing resources is less.

It will be appreciated by those skilled in the art that the foregoing is only a preferred embodiment of the invention, and is not intended to limit the invention, such that various modifications, equivalents and improvements may be made without departing from the spirit and scope of the invention.

Claims

1. An application method of a full integer quantization convolution network, the application method comprising the steps of:

2. The method of claim 1, wherein the maximum absolute extremum of the full integer quantization convolutional network is updated in the training process in step S3, specifically, the maximum absolute extremum is updated by using an exponential moving average algorithm:

xⁿ＝αx^n-1+(1-α)x

3. The method of claim 1, wherein the step S4 is specifically as follows:

input integer quantization:

Q_IN＝clamp(IN/S1)

integer quantization of weights:

Q_WT＝clamp(WT/S2)

4. The method of claim 1, wherein the step S5 is specifically as follows:

the output of the integer quantization, Q _ OUT, is:

Q_OUT＝Q_IN×Q_WT×M

M＝S1×S2/S3

The derivation process of the parameter C and the parameter S is as follows:

firstly, solving M, S1 × S2/S3:

S＝v+a

C＝round(M×2^v)

0＜C≤2^v

where round () means to return round rounding.

5. The method of claim 4, wherein the output Q _ OUT of the full integer quantization convolution network is:

Q_OUT＝Q_IN×Q_WT×M

6. The method of claim 5, wherein the non-linear activation of Q _ IN and Q _ WT is specifically:

7. The method of claim 1, wherein if the next layer is a batch norm layer in step S6, merging the parameters of the batch norm layer into the current layer by using a merging means specifically comprises:

the calculation process of the batch norm layer is as follows:

y＝∑w×fmap(i,j)

the combined weight is as follows:

combined bias: