CN113705791B

CN113705791B - Neural network reasoning quantification method and device, electronic equipment and storage medium

Info

Publication number: CN113705791B
Application number: CN202111016047.3A
Authority: CN
Inventors: 刘明庄; 胡英俊; 徐宁仪
Original assignee: Shanghai Power Tensors Intelligent Technology Co Ltd
Current assignee: Shanghai Power Tensors Intelligent Technology Co Ltd
Priority date: 2021-08-31
Filing date: 2021-08-31
Publication date: 2023-12-19
Anticipated expiration: 2041-08-31
Also published as: CN113705791A; WO2023029579A1

Abstract

The disclosure provides a neural network reasoning quantization method, a device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a trained target neural network; dividing each channel of the characteristic tensor respectively corresponding to at least one network processing layer in the target neural network into a plurality of channel groups; for each channel group, determining a target maximum value and a target minimum value corresponding to the channel group based on a characteristic extremum corresponding to each channel contained in the channel group; and carrying out quantization processing on the reasoning process of the target neural network based on the target maximum value and the target minimum value which are respectively corresponding to each channel group.

Description

Neural network reasoning quantification method and device, electronic equipment and storage medium

Technical Field

The disclosure relates to the technical field of deep learning, in particular to a neural network reasoning quantization method, a device, electronic equipment and a storage medium.

Background

With the development of neural networks, neural network reasoning is widely applied to machine vision tasks such as image classification, object detection segmentation and the like, and has been successful. However, the development of neural networks on end-side devices has been somewhat difficult due to limitations in computing resources, memory space, and power consumption, among other things. Therefore, it is important to propose a method that can compress the neural network, increase the inference speed of the neural network, or reduce the power consumption of the neural network.

Disclosure of Invention

In view of this, the present disclosure provides at least a method, apparatus, electronic device and storage medium for neural network reasoning quantization.

In a first aspect, the present disclosure provides a method for neural network reasoning quantization, comprising:

acquiring a trained target neural network;

dividing each channel of the characteristic tensor respectively corresponding to at least one network processing layer in the target neural network into a plurality of channel groups;

for each channel group, determining a target maximum value and a target minimum value corresponding to the channel group based on a characteristic extremum corresponding to each channel contained in the channel group;

and carrying out quantization processing on the reasoning process of the target neural network based on the target maximum value and the target minimum value which are respectively corresponding to each channel group.

In the method, each channel of the characteristic tensor corresponding to at least one network processing layer in the target neural network is divided into a plurality of channel groups, and for each channel group, based on the characteristic extremum corresponding to each channel contained in the channel group, the target maximum value and the target minimum value corresponding to the channel group are determined, namely, each channel in the channel group shares the target maximum value and the target minimum value.

Meanwhile, compared with the condition that all channels in the feature tensor correspond to one target maximum value and one target minimum value, the method divides the channels into a plurality of channel groups, determines the target maximum value and the target minimum value corresponding to each channel group, and enables each channel in the channel group to be matched with the target maximum value and the target minimum value corresponding to the channel group, improves the accuracy of the target maximum value and the target minimum value corresponding to the channel group, and can conduct accurate quantization processing on the reasoning process in the target neural network based on the target maximum value and the target minimum value corresponding to each channel group.

In a possible implementation manner, each channel of the feature tensor corresponding to at least one network processing layer in the target neural network is divided into a plurality of channel groups, including:

determining, for each feature tensor, a maximum feature value on each channel of the feature tensor;

and dividing the channels included in the feature tensor into a plurality of channel groups according to the sequence from the maximum feature value to the minimum feature value corresponding to each channel.

In this embodiment, each channel in the feature tensor may be divided in order from the largest feature value to the smallest, so that the deviation between the target maximum values corresponding to each channel in the divided channel group is smaller, so as to improve the accuracy of the target maximum value and the target minimum value corresponding to the channel group.

In a possible embodiment, the method further comprises:

acquiring a calibration data set corresponding to the target neural network, wherein the calibration data set comprises a plurality of calibration images;

inputting the calibration data set into the target neural network to obtain corresponding output characteristic data of each calibration image under the at least one network processing layer; wherein the number of channels of the output feature data is consistent with the number of channels of the output feature tensor in the feature tensor;

dividing each channel of the feature tensor respectively corresponding to at least one network processing layer in the target neural network into a plurality of channel groups, including:

determining a target feature value of the calibration image corresponding to each channel of the output feature tensor in the feature tensor based on the output feature data respectively corresponding to each calibration image;

And dividing the channels of the output characteristic tensor into a plurality of channel groups according to the sequence of the target characteristic value from large to small.

Here, the acquired multiple calibration images may be respectively output to the target neural network, so as to obtain output feature data corresponding to each calibration image under at least one network processing layer, so that multiple channels of the output feature tensor included in the feature tensor may be more accurately divided into multiple channel groups based on the output feature data corresponding to the multiple calibration images respectively.

In a possible implementation manner, the determining, based on the output feature data corresponding to the calibration images, a target feature value of the calibration image corresponding to each channel of an output feature tensor in the feature tensor includes:

determining candidate channels of the calibration images matched with the target channels in the output characteristic data corresponding to the calibration images respectively aiming at each target channel of the output characteristic tensor; and determining a maximum eigenvalue included in each candidate channel;

a target feature value of the calibration image corresponding to the target channel of the output feature amount is determined based on a maximum feature value included in the candidate channel of each of the calibration images.

Here, by determining the maximum feature values of the candidate channels to which the respective calibration images correspond respectively, the target feature values of the calibration images corresponding to the target channels of the output feature amounts are determined more accurately based on the maximum feature values included in the candidate channels of the respective calibration images, so that the plurality of channels of the output feature tensor can be divided into the plurality of channel groups more accurately based on the order of the target feature values from large to small.

In a possible implementation manner, the quantifying the reasoning process of the target neural network based on the corresponding target maximum value and the corresponding target minimum value in each channel group includes:

determining a quantization coefficient corresponding to each channel based on the target maximum value and the target minimum value corresponding to the channel group to which each channel belongs;

and carrying out quantization processing on the reasoning process of the target neural network based on the quantization coefficients respectively corresponding to the channels.

Here, the quantization coefficient corresponding to each channel may be determined based on the target maximum value and the target minimum value corresponding to the channel group to which each channel belongs, that is, the quantization coefficients corresponding to the channels included in each channel group are the same, for example, when the channel group includes 3 channels, after the quantization coefficients corresponding to any channel in the channel group are determined, the quantization coefficients of other channels in the channel group are determined, and the determination of the quantization coefficients is not required for each channel in the channel group, so that the determination operand of the quantization coefficients can be reduced and the consumption of the operation resource can be reduced.

In a possible implementation manner, the determining the quantization coefficient corresponding to each channel based on the target maximum value and the target minimum value corresponding to the channel group to which each channel belongs includes:

determining a scaling factor corresponding to each channel based on the target maximum value and the target minimum value corresponding to each channel and the set quantized numerical storage bit number;

and determining the quantization coefficient corresponding to each channel based on the scaling factor corresponding to each channel.

In a possible implementation manner, in a case where the feature tensor corresponding to the network processing layer includes an input feature tensor, an output feature tensor, and a weight feature tensor, the determining, based on the scaling factor corresponding to each channel, a quantization coefficient corresponding to each channel includes:

and determining a quantization coefficient corresponding to each channel in the characteristic tensor corresponding to the network processing layer based on a first scaling factor corresponding to the channel in the input characteristic tensor, a second scaling factor corresponding to the output characteristic tensor and a third scaling factor corresponding to the weight characteristic tensor.

Here, when the feature tensor includes the input feature tensor, the output feature tensor, and the weight feature tensor, the quantization coefficient corresponding to the channel may be determined more accurately based on the first scaling factor corresponding to the channel in the input feature tensor, the second scaling factor corresponding to the output feature tensor, and the third scaling factor corresponding to the weight feature tensor, so that the inference process of the target neural network may be quantized more accurately based on the determined quantization coefficient.

In a possible implementation manner, the determining the quantization coefficient corresponding to the channel based on the first scaling factor corresponding to the channel in the input feature tensor, the second scaling factor corresponding to the output feature tensor, and the third scaling factor corresponding to the weight feature tensor includes:

multiplying the first scaling factor by the third scaling factor, and dividing the obtained product by the second scaling factor to obtain a quantization coefficient corresponding to the channel.

The following description of the effects of the apparatus, the electronic device, etc. refers to the description of the above method, and will not be repeated here.

In a second aspect, the present disclosure provides an apparatus for neural network reasoning quantization, comprising:

the acquisition module is used for acquiring the trained target neural network;

the dividing module is used for dividing each channel of the characteristic tensor respectively corresponding to at least one network processing layer in the target neural network into a plurality of channel groups;

the determining module is used for determining a target maximum value and a target minimum value corresponding to each channel group according to the characteristic extremum corresponding to each channel contained in the channel group;

and the quantization module is used for carrying out quantization processing on the reasoning process of the target neural network based on the target maximum value and the target minimum value which are respectively corresponding to each channel group.

In a possible implementation manner, the dividing module is configured to, when dividing each channel of the feature tensor corresponding to at least one network processing layer in the target neural network into a plurality of channel groups, divide each channel into a plurality of channel groups:

In a possible embodiment, the apparatus further comprises: a processing module for:

the dividing module is configured to, when dividing each channel of the feature tensor corresponding to at least one network processing layer in the target neural network into a plurality of channel groups, divide each channel into a plurality of channel groups:

In a possible implementation manner, the dividing module is configured to, when determining, based on the output feature data respectively corresponding to the calibration images, a target feature value of the calibration image corresponding to each channel of an output feature tensor in the feature tensor:

In a possible implementation manner, the quantization module is configured to, when performing quantization processing on an inference process of the target neural network based on the target maximum value and the target minimum value that respectively correspond to each of the channel groups:

In a possible implementation manner, the quantization module is configured to, when determining the quantization coefficient corresponding to each channel based on the target maximum value and the target minimum value corresponding to the channel group to which each channel belongs:

In a possible implementation manner, in a case that the feature tensor corresponding to the network processing layer includes an input feature tensor, an output feature tensor, and a weight feature tensor, the quantization module is configured to, when determining, based on the scaling factor corresponding to each channel, a quantization coefficient corresponding to each channel:

In a possible implementation manner, the quantization module is configured to, when determining the quantization coefficient corresponding to the channel based on the first scaling factor corresponding to the channel in the input feature tensor, the second scaling factor corresponding to the output feature tensor, and the third scaling factor corresponding to the weight feature tensor:

In a third aspect, the present disclosure provides an electronic device comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory in communication over the bus when the electronic device is running, the machine-readable instructions when executed by the processor performing the steps of the method of neural network inference quantization as described in the first aspect or any of the embodiments above.

In a fourth aspect, the present disclosure provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method of neural network inference quantization as described in the first aspect or any of the embodiments above.

The foregoing objects, features and advantages of the disclosure will be more readily apparent from the following detailed description of the preferred embodiments taken in conjunction with the accompanying drawings.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings required for the embodiments are briefly described below, which are incorporated in and constitute a part of the specification, these drawings showing embodiments consistent with the present disclosure and together with the description serve to illustrate the technical solutions of the present disclosure. It is to be understood that the following drawings illustrate only certain embodiments of the present disclosure and are therefore not to be considered limiting of its scope, for the person of ordinary skill in the art may admit to other equally relevant drawings without inventive effort.

FIG. 1 is a flow diagram of a method for neural network reasoning quantization provided by embodiments of the present disclosure;

fig. 2 is a flowchart illustrating a specific manner of dividing each channel of a feature tensor corresponding to at least one network processing layer in a target neural network into a plurality of channel groups in a neural network reasoning quantization method according to an embodiment of the present disclosure;

Fig. 3 is a flowchart illustrating a specific manner of dividing each channel of a feature tensor corresponding to at least one network processing layer in a target neural network into a plurality of channel groups in a neural network reasoning quantization method according to an embodiment of the present disclosure;

FIG. 4 illustrates a schematic architecture of an apparatus for neural network reasoning quantization provided by an embodiment of the present disclosure;

fig. 5 shows a schematic structural diagram of an electronic device according to an embodiment of the disclosure.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are only some embodiments of the present disclosure, not all embodiments. The components of the embodiments of the present disclosure, which are generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present disclosure provided in the accompanying drawings is not intended to limit the scope of the disclosure, as claimed, but is merely representative of selected embodiments of the disclosure. All other embodiments, which can be made by those skilled in the art based on the embodiments of this disclosure without making any inventive effort, are intended to be within the scope of this disclosure.

In general, a post-training quantization method can be used to increase the inference speed of the neural network. In the post-training quantization method, only the weights can be quantized, and the weights of the trained neural network are quantized from floating points to 8-bit fixed points, so that the process is simpler and more convenient. The weights and activation values may also be quantized, which requires the use of a calibration data set to determine the dynamic range of the activation values, which is a high degree of accuracy.

Based on the above, the embodiment of the disclosure provides a method, a device, an electronic device and a storage medium for neural network reasoning quantification.

The present invention is directed to a method for manufacturing a semiconductor device, and a semiconductor device manufactured by the method.

It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.

For the convenience of understanding the embodiments of the present disclosure, a method for neural network reasoning quantization disclosed in the embodiments of the present disclosure will be described in detail. The execution subject of the neural network reasoning quantization method provided by the embodiments of the present disclosure may be a terminal device, where the terminal device may include a User Equipment (UE), a mobile device, a User terminal, a personal digital assistant (Personal Digital Assistant, PDA), a handheld device, a computing device, a vehicle-mounted device, a wearable device, and so on. In some possible implementations, the neural network reasoning quantization method may be implemented by a processor invoking computer readable instructions stored in a memory.

Referring to fig. 1, a flowchart of a method for neural network reasoning quantization according to an embodiment of the present disclosure is shown, where the method includes S101-S104, where:

s101, acquiring a trained target neural network;

S102, dividing each channel of the characteristic tensor corresponding to at least one network processing layer in the target neural network into a plurality of channel groups;

s103, determining a target maximum value and a target minimum value corresponding to the channel group based on the characteristic extremum corresponding to each channel contained in the channel group for each channel group;

s104, carrying out quantization processing on the reasoning process of the target neural network based on the corresponding target maximum value and the corresponding target minimum value in each channel group.

S101 to S104 are specifically described below.

For S101:

the target neural network after training may be any neural network, for example, the target neural network may include a neural network for performing target detection, a neural network for performing image segmentation, and the like. The network structure of the trained target neural network can be set according to actual requirements.

For S102:

the target neural network may include a plurality of network processing layers, and each of the feature tensors corresponding to the network processing layers may include an input feature tensor, an output feature tensor, and a weight feature tensor, where between two adjacent network processing layers, the output feature tensor of the previous network processing layer may be the input feature tensor of the next network processing layer.

The feature tensor is generally multi-channel data, that is, a plurality of feature matrices may be included in the feature tensor, each feature matrix corresponding to a channel. For example, if the feature tensor has a size of 128×128×256, where 256 may be the number of channels of the feature tensor, 128×128 may be the length and width corresponding to each channel, that is, the feature tensor includes 256 feature matrices having a size of 128×128.

In the implementation, quantization processing may be performed on at least one network processing layer of the plurality of network processing layers, and quantization processing may be performed on the weight feature tensor or may be performed on the output feature tensor and the weight feature tensor with respect to the network processing layer performing quantization processing. Since the input characteristic tensor of the network processing layer is the output characteristic tensor of the last network processing layer, quantization processing on the input characteristic tensor is not needed.

In an alternative embodiment, referring to fig. 2, the dividing the channels of the feature tensor corresponding to at least one network processing layer in the target neural network into a plurality of channel groups may include:

s201, determining a maximum characteristic value on each channel of the characteristic tensor aiming at each characteristic tensor;

S202, dividing channels included in the feature tensor into a plurality of channel groups according to the sequence from the largest feature value to the smallest corresponding to each channel.

In implementation, for each feature tensor, a maximum feature value on each channel included in the feature tensor may be determined, for example, if the feature tensor has a size of 128×128×256, for each of 256 channels, a maximum feature value may be determined from a feature matrix having a size of 128×128 corresponding to each channel, that is, a maximum feature value may be determined from 128×128 feature elements. And then the maximum characteristic values corresponding to the 256 channels respectively can be obtained.

And then sorting the channels according to the order of the maximum eigenvalue from large to small, and dividing the sorted channels into a plurality of channel groups, wherein the division can be average division or non-average division. In practice, there are various methods for dividing the channels according to the maximum eigenvalues, which are exemplified herein.

For example, in the case of the average division, if the number of channels is 256 and the number of channel groups is 4, the plurality of channels may be divided into 4 channel groups each including 64 channels. For example, when the non-average division is performed, a first threshold value may be set, channels with a maximum characteristic value greater than the first threshold value are divided into a first channel group, and channels with a maximum characteristic value less than or equal to the first threshold value are divided into a second channel group; alternatively, a deviation threshold may be set, and channels with a deviation between the maximum feature values smaller than the deviation threshold may be divided into one channel group, so as to obtain a plurality of divided channel groups.

In specific implementation, the channels of the weight feature tensor in the network processing layer may be divided through the steps of S201 and S202.

In an alternative embodiment, the method further comprises steps A1 and A2, wherein:

a1, acquiring a calibration data set corresponding to a target neural network, wherein the calibration data set comprises a plurality of calibration images;

a2, inputting the calibration data set into a target neural network to obtain corresponding output characteristic data of each calibration image under at least one network processing layer; wherein the number of channels outputting the feature data is identical to the number of channels outputting the feature tensor in the feature tensor.

In implementation, a calibration data set corresponding to the target neural network may be acquired, where the calibration data set includes a plurality of calibration images. And each calibration image can be input into the target neural network to obtain corresponding output characteristic data of the calibration image under at least one network processing layer.

For example, if the target neural network includes a plurality of network processing layers, the order of the plurality of network processing layers is a first network processing layer, a second network processing layer and a third network processing layer, inputting the calibration image into the target neural network to obtain output characteristic data output by the first network processing layer, and then taking the output characteristic data output by the first network processing layer as input of the second network processing layer to obtain output characteristic data output by the second network processing layer; and taking the output characteristic data output by the second network processing layer as the input of the third network processing layer to obtain the output characteristic data output by the third network processing layer.

The number of channels of the output feature data corresponding to each network processing layer is consistent with the channel data of the output feature tensor in the feature tensor included in the network processing layer. For example, if the size of the output feature tensor corresponding to the set network processing layer is 56×56×256, the size of the output feature data corresponding to each calibration image output by the network processing layer is 56×56×256.

In S102, referring to fig. 3, dividing each channel of the feature tensor corresponding to at least one network processing layer in the target neural network into a plurality of channel groups may include:

S301, determining target feature values of calibration images corresponding to each channel of the output feature tensor based on output feature data corresponding to each calibration image respectively;

s302, dividing a plurality of channels outputting characteristic tensors into a plurality of channel groups according to the sequence from the big target characteristic value to the small target characteristic value.

In S301, in an alternative embodiment, determining, based on the output feature data corresponding to each calibration image, the target feature value of the calibration image corresponding to each channel of the output feature tensor in the feature tensor may include:

s3011, determining candidate channels of calibration images matched with target channels in output characteristic data corresponding to each calibration image respectively aiming at each target channel of the output characteristic tensor; and determining a maximum eigenvalue included in each candidate channel;

s3012, determining a target feature value of the calibration image corresponding to the target channel of the output feature amount based on the maximum feature value included in the candidate channels of the respective calibration images.

In S3011, for example, if 258 channels are included in the output feature tensor, each of the 258 channels may be taken as a target channel. For example, if the first channel is taken as a target channel, the first channel in the output characteristic data corresponding to each calibration image can be determined as a candidate channel matched with the target channel, and then the candidate channels respectively included in the output characteristic data of the plurality of calibration images can be obtained; the maximum eigenvalues included in each candidate channel may then be determined.

For example, if the calibration data set includes 100 calibration images, candidate channels included in the output feature data of the 100 calibration images may be determined, and maximum feature values included in the 100 candidate channels may be determined, so as to obtain 100 maximum feature values. If the candidate channel is 56×56 in size, the maximum eigenvalue can be determined from 56×56 eigenvalues.

In S3012, a target feature value of the calibration image corresponding to the target channel may be determined from the maximum feature values included in the plurality of candidate channels.

The target maximum feature value with the largest numerical value can be determined from the maximum feature values included in the candidate channels, and the target maximum feature value is determined as the target feature value of the calibration image corresponding to the target channel. For example, from the 100 maximum feature values illustrated in S3011, a target maximum feature value with the largest value may be determined, and the target maximum feature value may be determined as the target feature value of the calibration image corresponding to the target channel. Alternatively, a K-L divergence method may be used to determine the target feature value of the calibration image corresponding to the target channel from the maximum feature values included in the plurality of candidate channels.

There are various methods for determining the target feature value of the calibration image corresponding to the target channel, and this is exemplified herein. For example, an average feature value of the maximum feature values included in the candidate channels of the plurality of calibration images may be determined, and the maximum feature value having the smallest deviation from the average feature value may be determined as the target feature value of the calibration image corresponding to the target channel.

In S302, the channels included in the output feature tensor may be sorted in order of the target feature value from high to low, and then the sorted channels may be divided into channel groups. For example, the plurality of channels included in the output feature tensor may be divided equally, or divided unevenly, or the like. The manner in which the plurality of channels included in the output feature tensor are divided may refer to the specific description of S202 above, and will not be described in detail here.

For S103:

for example, for each channel group that includes a plurality of channels, a feature extremum included on each channel may be determined, including a feature maximum and/or a feature minimum. When the characteristic extremum includes the characteristic maximum value and the characteristic minimum value, a maximum value in the characteristic maximum values corresponding to the plurality of channels respectively can be determined as a target maximum value corresponding to the channel group, and a minimum value in the characteristic minimum values corresponding to the plurality of channels respectively can be determined as a target minimum value corresponding to the channel group. When the feature extremum includes the feature maximum value but not the feature minimum value, the maximum value of the feature maximum values corresponding to the plurality of channels may be determined as the target maximum value corresponding to the channel group, and the negative number of the target maximum value may be determined as the target minimum value.

In implementation, for a channel group under the output characteristic tensor, each channel in the channel group corresponds to one target characteristic value, and the maximum value in a plurality of target characteristic values can be determined as the target maximum value corresponding to the channel group; and determining the negative number of the target maximum value as the target minimum value corresponding to the channel group. Or, for each channel to be determined in the channel group, determining a candidate channel of the calibration image matched with the channel to be determined in the output characteristic data corresponding to each calibration image; and determining the minimum characteristic value included in each candidate channel corresponding to the channel to be determined, and determining the minimum characteristic value corresponding to the channel to be determined based on the minimum characteristic value included in the candidate channel of each calibration image. Finally, the minimum value can be selected from the minimum characteristic values corresponding to the channels to be determined in the channel group, and the minimum value is used as the target minimum value corresponding to the channel group.

For example, when the calibration data set includes 100 calibration images, 100 candidate channels matched with the channel to be determined can be determined, and the minimum feature value in the candidate channels corresponding to the channel to be determined is determined, so as to obtain 100 minimum feature values; the minimum feature value corresponding to the channel to be determined can be determined based on the 100 minimum feature values, for example, the minimum value in the 100 minimum feature values can be determined as the minimum feature value corresponding to the channel to be determined. The K-L divergence can also be used for determining the minimum characteristic value corresponding to the channel to be determined from 100 minimum characteristic values, so that the minimum characteristic value corresponding to each channel to be determined in the channel group can be obtained; and finally, selecting the minimum value from the minimum characteristic values corresponding to the channels to be determined in the channel group as a target minimum value corresponding to the channel group.

For S104:

here, in the reasoning process of the target neural network, the reasoning process of the target neural network may be quantized based on the target maximum value and the target minimum value respectively corresponding to each channel group. For example, when the target neural network is a neural network for image detection, in the process of detecting an image to be detected by the target neural network, the detection process may be quantized based on a target maximum value and a target minimum value respectively corresponding to each channel group.

In an alternative embodiment, the quantification process for the reasoning process of the target neural network based on the target maximum value and the target minimum value respectively corresponding to each channel group may include a step B1 and a step B2, where:

step B1, determining a quantization coefficient corresponding to each channel based on a target maximum value and a target minimum value corresponding to a channel group to which each channel belongs;

and step B2, carrying out quantization processing on the reasoning process of the target neural network based on the quantization coefficients respectively corresponding to the channels.

In step B1, in an alternative embodiment, determining the quantization coefficient corresponding to each channel based on the target maximum value and the target minimum value corresponding to the channel group to which each channel belongs may include:

Step B11, determining a scaling factor corresponding to each channel based on the target maximum value and the target minimum value corresponding to each channel and the set quantized numerical storage bit number;

and step B12, determining a quantization coefficient corresponding to each channel based on the scaling factor corresponding to each channel.

For each channel, a target difference value between a target maximum value and a target minimum value of the channel can be determined, and a scaling factor corresponding to the channel is positively correlated with the target difference value and negatively correlated with the set quantized numerical storage digit.

In practice, the scaling factor corresponding to each channel may be determined according to the following equation (1):

；（1）

wherein S is a scaling factor corresponding to the channel,for the target maximum value corresponding to the channel, +.>For the target minimum value corresponding to the channel, n is the set quantized value storage bit number, for example, n may be 8, and the value size of the quantized feature tensor is 8 bits. />

The quantization coefficients corresponding to each channel may then be determined based on the scaling factor corresponding to each channel.

In an alternative embodiment, in a case where the feature tensor corresponding to the network processing layer includes an input feature tensor, an output feature tensor, and a weight feature tensor, determining the quantization coefficient corresponding to each channel based on the scaling factor corresponding to each channel includes: for each channel in the feature tensor corresponding to the network processing layer, determining a quantization coefficient corresponding to the channel based on a first scaling factor corresponding to the channel in the input feature tensor, a second scaling factor corresponding to the output feature tensor, and a third scaling factor corresponding to the weight feature tensor.

When the feature tensor corresponding to the network processing layer includes the input feature tensor, the output feature tensor, and the weight feature tensor, the first scaling factor S corresponding to the input feature tensor may be determined by using the formula (1) ₁ Determining a second scaling factor S corresponding to the output characteristic tensor by using the formula (1) ₂ Determining a third scaling factor S corresponding to the weight feature tensor by using the formula (1) ₃ The method comprises the steps of carrying out a first treatment on the surface of the Again based on a first scaling factor S ₁ Second scaling factor S ₂ Third scaling factor S ₃ And determining the quantization coefficient corresponding to the channel.

In implementation, the first scaling factor may be multiplied by the third scaling factor, and the obtained product may be divided by the second scaling factor to obtain a quantization coefficient corresponding to the channel. The quantization coefficients corresponding to the channels can be determined according to the following formula (2):

；（2）

Where M is the determined quantization coefficient.

In implementation, when the feature tensor corresponding to the network processing layer includes the input feature tensor and the output feature tensor, that is, when the network processing layer is a processing layer without a weight parameter, the quantization coefficient M may be determined according to the following formula (3):

；（3）

in step B2, during implementation, quantization coefficients corresponding to the channels may be used to perform quantization processing on the reasoning process of the target neural network.

In practice, each channel group includes a plurality of channels, and each channel shares a target maximum value and a target minimum value, so that the scaling factors of the channels included in the channel group are the same, and the quantization coefficients are also the same. Therefore, each channel in the channel group can share the scaling factors and the quantization coefficients, and after the scaling factors and the quantization coefficients corresponding to any channel in the channel group are determined, the scaling factors and the quantization coefficients of other channels are determined, so that the scaling factors and the quantization coefficients of each channel in the channel group are not required to be determined, and the waste of calculation resources is reduced.

It will be appreciated by those skilled in the art that in the above-described method of the specific embodiments, the written order of steps is not meant to imply a strict order of execution but rather should be construed according to the function and possibly inherent logic of the steps.

Based on the same concept, the embodiment of the disclosure further provides a device for neural network reasoning quantization, which is shown in fig. 4, and is an architecture schematic diagram of the device for neural network reasoning quantization, and includes an obtaining module 401, a dividing module 402, a determining module 403, and a quantization module 404, and is specific:

an acquisition module 401, configured to acquire a trained target neural network;

a dividing module 402, configured to divide each channel of the feature tensor corresponding to at least one network processing layer in the target neural network into a plurality of channel groups;

a determining module 403, configured to determine, for each channel group, a maximum feature value and a minimum feature value corresponding to the channel group based on a feature extremum corresponding to each channel included in the channel group;

and the quantization module 404 is configured to perform quantization processing on the reasoning process of the target neural network based on the maximum eigenvalue and the minimum eigenvalue that respectively correspond to each of the channel groups.

In a possible implementation manner, the dividing module 402 is configured to, when dividing each channel of the feature tensor corresponding to at least one network processing layer in the target neural network into a plurality of channel groups, respectively:

In a possible embodiment, the apparatus further comprises: a processing module 405, configured to:

the dividing module 402 is configured to, when dividing each channel of the feature tensor corresponding to at least one network processing layer in the target neural network into a plurality of channel groups:

determining a target feature value of the calibration image corresponding to each channel of an output feature tensor in the feature tensor based on the output feature data respectively corresponding to each calibration image;

In a possible implementation manner, the dividing module 402 is configured to, when determining, based on the output feature data corresponding to each of the calibration images, a target feature value of the calibration image corresponding to each channel of an output feature tensor in the feature tensor, respectively:

In a possible implementation manner, the quantization module 404 is configured to, when performing quantization processing on the inference process of the target neural network based on the target maximum value and the target minimum value that respectively correspond to each of the channel groups:

In a possible implementation manner, the quantization module 404 is configured to, when determining the quantization coefficient corresponding to each channel based on the target maximum value and the target minimum value corresponding to the channel group to which each channel belongs:

In a possible implementation manner, in a case where the feature tensor corresponding to the network processing layer includes an input feature tensor, an output feature tensor, and a weight feature tensor, the quantization module 404 is configured to, when determining, based on the scaling factor corresponding to each channel, a quantization coefficient corresponding to each channel:

In a possible implementation manner, the quantization module 404 is configured to, when determining the quantization coefficient corresponding to the channel based on the first scaling factor corresponding to the channel in the input feature tensor, the second scaling factor corresponding to the output feature tensor, and the third scaling factor corresponding to the weight feature tensor:

In some embodiments, the functions or templates included in the apparatus provided by the embodiments of the present disclosure may be used to perform the methods described in the foregoing method embodiments, and specific implementations thereof may refer to descriptions of the foregoing method embodiments, which are not repeated herein for brevity.

Based on the same technical concept, the embodiment of the disclosure also provides electronic equipment. Referring to fig. 5, a schematic structural diagram of an electronic device according to an embodiment of the disclosure includes a processor 501, a memory 502, and a bus 503. The memory 502 is configured to store execution instructions, including a memory 5021 and an external memory 5022; the memory 5021 is also referred to as an internal memory, and is used for temporarily storing operation data in the processor 501 and data exchanged with an external memory 5022 such as a hard disk, the processor 501 exchanges data with the external memory 5022 through the memory 5021, and when the electronic device 500 is running, the processor 501 and the memory 502 communicate with each other through the bus 503, so that the processor 501 executes the following instructions:

Acquiring a trained target neural network;

Furthermore, the disclosed embodiments also provide a computer readable storage medium, on which a computer program is stored, which when executed by a processor performs the steps of the method for neural network reasoning quantization described in the above method embodiments. Wherein the storage medium may be a volatile or nonvolatile computer readable storage medium.

Embodiments of the present disclosure further provide a computer program product, where the computer program product carries a program code, where instructions included in the program code may be used to perform the steps of the method for neural network reasoning quantization described in the foregoing method embodiments, and specifically reference may be made to the foregoing method embodiments, which are not described herein.

Wherein the above-mentioned computer program product may be realized in particular by means of hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied as a computer storage medium, and in another alternative embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK), or the like.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described system and apparatus may refer to corresponding procedures in the foregoing method embodiments, which are not described herein again. In the several embodiments provided in the present disclosure, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, for example, the division of the units is merely a logical function division, and there may be other manners of division in actual implementation, and for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present disclosure may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer readable storage medium executable by a processor. Based on such understanding, the technical solution of the present disclosure may be embodied in essence or a part contributing to the prior art or a part of the technical solution, or in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in the embodiments of the present disclosure. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing is merely a specific embodiment of the disclosure, but the protection scope of the disclosure is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the disclosure, and it should be covered in the protection scope of the disclosure. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims

1. A method for neural network reasoning quantization, comprising:

acquiring a trained target neural network; the target neural network comprises a neural network for image detection;

based on the target maximum value and the target minimum value respectively corresponding to each channel group, carrying out quantization processing on the reasoning process of the target neural network, wherein the quantization processing on the reasoning process of the target neural network comprises the following steps: in the process of detecting the image to be detected by the target neural network, the detection process is quantized based on the corresponding target maximum value and target minimum value in each channel group.

2. The method of claim 1, wherein dividing each channel of the feature tensor respectively corresponding to at least one network processing layer in the target neural network into a plurality of channel groups comprises:

3. The method according to claim 1 or 2, characterized in that the method further comprises:

4. A method according to claim 3, wherein said determining a target feature value of the calibration image corresponding to each channel of an output feature tensor of the feature tensor based on the output feature data respectively corresponding to the respective calibration image comprises:

a target feature value of the calibration image corresponding to the target channel of the output feature tensor is determined based on a maximum feature value included in the candidate channel of each of the calibration images.

5. The method according to claim 1 or 2, wherein said quantifying the inference process of the target neural network based on the target maximum value and the target minimum value respectively corresponding to each of the channel groups comprises:

6. The method of claim 5, wherein determining the quantization coefficients for each channel based on the target maximum and the target minimum for each channel group to which each channel belongs, comprises:

7. The method of claim 6, wherein, in a case where the feature tensor corresponding to the network processing layer includes an input feature tensor, an output feature tensor, and a weight feature tensor, the determining the quantization coefficient corresponding to each channel based on the scaling factor corresponding to each channel includes:

8. The method of claim 7, wherein determining the quantization factor for the channel based on the first scaling factor for the channel corresponding in the input feature tensor, the second scaling factor corresponding in the output feature tensor, and the third scaling factor corresponding in the weight feature tensor comprises:

9. An apparatus for neural network reasoning quantization, comprising:

the acquisition module is used for acquiring the trained target neural network; the target neural network comprises a neural network for image detection;

the quantization module is configured to perform quantization processing on an inference process of the target neural network based on the target maximum value and the target minimum value respectively corresponding to each of the channel groups, where performing quantization processing on the inference process of the target neural network includes: in the process of detecting the image to be detected by the target neural network, the detection process is quantized based on the corresponding target maximum value and target minimum value in each channel group.

10. An electronic device, comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory in communication over the bus when the electronic device is running, the machine-readable instructions when executed by the processor performing the steps of the method of neural network inference quantization as claimed in any one of claims 1 to 8.

11. A computer readable storage medium, characterized in that it has stored thereon a computer program which, when executed by a processor, performs the steps of the method of neural network inference quantization as claimed in any of claims 1 to 8.