CN113361701A

CN113361701A - Quantification method and device of neural network model

Info

Publication number: CN113361701A
Application number: CN202010144339.4A
Authority: CN
Inventors: 希滕; 张刚; 温圣召
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-03-04
Filing date: 2020-03-04
Publication date: 2021-09-07

Abstract

The present disclosure relates to the field of artificial intelligence. The embodiment of the disclosure discloses a quantification method and device of a neural network model. The method comprises the following steps: obtaining quantization factors of each network layer of the neural network model determined by iteratively executing a plurality of search operations; quantizing the neural network model based on the obtained quantization factor; the search operation includes: searching a candidate quantization factor set from a preset quantization method search space, and quantizing the neural network model based on the candidate quantization factor set to obtain a candidate network model, wherein the candidate quantization factor set comprises candidate quantization factors respectively corresponding to each network layer of the neural network model; performing back propagation based on the performance of the candidate network model to update the searched candidate quantization factor set; and in response to determining that the performance of the candidate network model meets the preset convergence condition, determining the quantization factors of each network layer of the neural network model based on the current candidate quantization factor set. The method can reduce the quantization loss of the model.

Description

Quantification method and device of neural network model

Technical Field

The embodiment of the disclosure relates to the technical field of computers, in particular to the technical field of artificial intelligence, and particularly relates to a quantization method and device of a neural network model.

Background

The quantization of the neural network model is to convert the model parameters with high bit width into the model parameters with low bit width, so as to improve the calculation speed of the model. Quantization is usually performed after training of the neural network model for high bit widths is completed. The specific quantification method is to quantify the neural network model with high bit width into the neural network model with low bit width according to the specified low bit width. Generally, the low bit width neural network model obtained after quantization is directly used for executing a corresponding deep learning task. However, the precision loss of the quantized parameters is large, which may cause the precision loss of the quantized model to exceed an acceptable range, and the low-bit-width neural network model needs to be retrained, thereby increasing the training cost of the low-bit-width neural network model.

Disclosure of Invention

Embodiments of the present disclosure provide a quantization method and apparatus of a neural network model, an electronic device, and a computer-readable medium.

In a first aspect, an embodiment of the present disclosure provides a method for quantizing a neural network model, including: obtaining quantization factors of each network layer of the neural network model determined by iteratively executing a plurality of search operations; quantizing the neural network model based on the obtained quantization factor; wherein the search operation comprises: searching a candidate quantization factor set from a preset quantization method search space, and quantizing the neural network model based on the candidate quantization factor set to obtain a candidate network model, wherein the candidate quantization factor set comprises candidate quantization factors respectively corresponding to each network layer of the neural network model; acquiring the performance of a candidate network model, and performing back propagation on the basis of the performance of the candidate network model to update the searched candidate quantization factor set; and in response to determining that the performance of the candidate network model meets the preset convergence condition, determining the quantization factors of each network layer of the neural network model based on the current candidate quantization factor set.

In some embodiments, the searching operation further comprises: determining whether the candidate quantization factor set meets a preset distribution constraint condition or not based on the parameter distribution of the neural network model and the parameter distribution of the candidate network model; and the performance of the above-mentioned acquisition candidate network model includes: and responding to the condition that the candidate quantization factor set satisfies the preset distribution constraint condition, and acquiring the performance of the candidate network model.

In some embodiments, the searching out the candidate quantization factor set from the preset quantization method search space, and quantizing the neural network model based on the candidate quantization factor set to obtain the candidate network model includes: searching at least two candidate quantization factor sets from a preset quantization method search space, and quantizing the neural network model based on each candidate quantization factor set to obtain at least two corresponding candidate network models; and the search operation further comprises: and deleting the candidate quantization factor set which does not meet the preset distribution constraint condition and the corresponding candidate network model.

In some embodiments, the preset distribution constraint condition includes: the distance between the parameter distribution of the neural network model and the parameter distribution of the candidate network model does not exceed a preset distance threshold.

In some embodiments, the quantization method search space includes a candidate quantization factor for quantizing a parameter of the neural network model to a predetermined bit width of data.

In a second aspect, an embodiment of the present disclosure provides an apparatus for quantizing a neural network model, including: an acquisition unit configured to acquire quantization factors of network layers of the neural network model determined by iteratively performing a plurality of search operations; a quantization unit configured to quantize the neural network model based on the obtained quantization factor; wherein the search operation comprises: searching a candidate quantization factor set from a preset quantization method search space, and quantizing the neural network model based on the candidate quantization factor set to obtain a candidate network model, wherein the candidate quantization factor set comprises candidate quantization factors respectively corresponding to each network layer of the neural network model; acquiring the performance of a candidate network model, and performing back propagation on the basis of the performance of the candidate network model to update the searched candidate quantization factor set; and in response to determining that the performance of the candidate network model meets the preset convergence condition, determining the quantization factors of each network layer of the neural network model based on the current candidate quantization factor set.

In some embodiments, the searching operation further comprises: determining whether the candidate quantization factor set meets a preset distribution constraint condition or not based on the parameter distribution of the neural network model and the parameter distribution of the candidate network model; and in the searching operation, acquiring the performance of the network according to the following modes: and responding to the condition that the candidate quantization factor set satisfies the preset distribution constraint condition, and acquiring the performance of the candidate network model.

In some embodiments, in the above search operation, searching a candidate quantization factor set from a search space of a preset quantization method, and quantizing the neural network model based on the candidate quantization factor set to obtain a candidate network model includes: searching at least two candidate quantization factor sets from a preset quantization method search space, and quantizing the neural network model based on each candidate quantization factor set to obtain at least two corresponding candidate network models; and the search operation further comprises: and deleting the candidate quantization factor set which does not meet the preset distribution constraint condition and the corresponding candidate network model.

In a third aspect, an embodiment of the present disclosure provides an electronic device, including: one or more processors; a storage device for storing one or more programs which, when executed by one or more processors, cause the one or more processors to implement a method of quantifying a neural network model as provided in the first aspect.

In a fourth aspect, an embodiment of the present disclosure provides a computer readable medium on which a computer program is stored, wherein the program, when executed by a processor, implements the method for quantifying a neural network model provided in the first aspect.

According to the quantization method and device for the neural network model, the quantization factors of each network layer of the neural network model are determined through iterative execution of multiple search operations, and then the neural network model is quantized based on the determined quantization factors; wherein the search operation comprises: searching a candidate quantization factor set from a preset quantization method search space, and quantizing the neural network model based on the candidate quantization factor set to obtain a candidate network model, wherein the candidate quantization factor set comprises candidate quantization factors respectively corresponding to each network layer of the neural network model; acquiring the performance of a candidate network model, and performing back propagation on the basis of the performance of the candidate network model to update the searched candidate quantization factor set; and in response to determining that the performance of the candidate network model meets the preset convergence condition, determining the quantization factors of each network layer of the neural network model based on the current candidate quantization factor set. The method and the device can reduce the loss of the quantized model precision.

Drawings

Other features, objects and advantages of the disclosure will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture diagram in which embodiments of the present disclosure may be applied;

FIG. 2 is a flow diagram of one embodiment of a method of quantifying a neural network model according to the present disclosure;

FIG. 3 is a flow diagram of another embodiment of a method of quantifying a neural network model according to the present disclosure;

FIG. 4 is a schematic structural diagram of one embodiment of a quantization apparatus of a neural network model of the present disclosure;

FIG. 5 is a schematic block diagram of a computer system suitable for use in implementing an electronic device of an embodiment of the present disclosure.

Detailed Description

The present disclosure is described in further detail below with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that, in the present disclosure, the embodiments and features of the embodiments may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 illustrates an exemplary system architecture 100 to which the quantization method of the neural network model or the quantization apparatus of the neural network model of the present disclosure may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The

terminal devices

101, 102, 103 interact with a server 105 via a network 104 to receive or send messages or the like. The

end devices

101, 102, 103 may be customer premises devices on which various client applications may be installed. Such as image processing-type applications, information analysis-type applications, voice assistant-type applications, shopping-type applications, financial-type applications, and the like.

The

terminal apparatuses

101, 102, and 103 may be hardware or software. When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices including, but not limited to, smart phones, tablet computers, e-book readers, laptop portable computers, desktop computers, and the like. When the

terminal apparatuses

101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented as multiple pieces of software or software modules (e.g., multiple pieces of software or software modules to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.

The server 105 may be a server running various services, such as a server running services of object detection and recognition based on data such as image, video, voice, text, digital signals, text or voice recognition, signal conversion, and the like. The server 105 may acquire deep learning task data from the

terminal devices

101, 102, 103 or from a database to construct training samples to train a neural network model for performing a deep learning task.

The server 105 may also be a backend server providing backend support for applications installed on the

terminal devices

101, 102, 103. For example, the server 105 may receive data to be processed sent by the

terminal devices

101, 102, 103, process the data using the neural network model, and return the processing result to the

terminal devices

101, 102, 103.

The server 105 may be hardware or software. When the server 105 is hardware, it may be implemented as a distributed server cluster composed of a plurality of servers, or may be implemented as a single server. When the server 105 is software, it may be implemented as multiple pieces of software or software modules (e.g., multiple pieces of software or software modules used to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

The trained neural network model may be deployed and run on the

terminal devices

101, 102, 103. The

terminal devices

101, 102, 103 usually have more stringent requirements on the operating speed of the model. In the context of the embodiment of the present application, the server 105 may determine a quantization factor of the neural network model according to hardware or software constraints of the

terminal devices

101, 102, 103 (such as latency of a processor, power consumption, operation efficiency in an application program running environment, and the like), and then the server 105 or the

terminal devices

101, 102, 103 may quantize the neural network model with a high bit width according to the quantization factor determined by the server 105.

Alternatively, in some scenarios, the

terminal devices

101, 102, 103 may also search for a quantization factor of the neural network model with a high bit width through a search operation, and quantize the neural network model with a high bit width based on the searched quantization factor.

The quantization method of the neural network model provided by the embodiment of the present disclosure may be executed by the

terminal device

101, 102, 103 or the server 105, and accordingly, the quantization apparatus of the neural network model may be disposed in the

terminal device

101, 102, 103 or the server 105.

In some scenarios, the

terminal device

101, 102, 103 or the server 105 may locally read or obtain source data required for model quantization from a database or the like, for example, locally read a trained high-bit-width neural network model, and read a search space file of a quantization method. At this point, the exemplary system architecture 100 may not include the network 104 and the server 105, or the

terminal devices

101, 102, 103 and the network 104.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to fig. 2, a flow 200 of one embodiment of a method of quantifying a neural network model in accordance with the present disclosure is shown. The quantification method of the neural network model comprises the following steps:

step 201, obtaining quantization factors of each network layer of the neural network model determined by iteratively executing a plurality of search operations.

In this embodiment, the execution subject of the quantization method of the neural network model may first locally read the quantization factors of the network layers of the neural network model determined by iteratively executing the search operation a plurality of times, or may receive the quantization factors of the network layers of the neural network model searched by other electronic devices.

In practice, the operation of determining the quantization factors for the network layers of the neural network model may be performed by a server. The server has strong computing power and can quickly search the quantization factors. The terminal device running the neural network model may receive the searched quantization factor from the server and quantize the locally deployed neural network model. Alternatively, the server may read the searched quantization factor from the memory and perform the quantization operation on the neural network model by the server.

The search operation described above includes the following steps S2011, S2012, and S2013:

in step S2011, a candidate quantization factor set is searched from a preset quantization method search space, and a candidate network model is obtained by quantizing the neural network model based on the candidate quantization factor set.

In this embodiment, candidate quantization factors corresponding to each network layer of the neural network may be searched in a quantization method search space to obtain a candidate quantization factor set.

The quantization method search space may include an optional quantization method used to perform quantization operations on layers of the neural network model. The quantization method may include a quantization factor, or may further include a quantization algorithm, a quantization bit width, and the like. In this embodiment, the quantization method search space may include selectable quantization factors for each network layer of the neural network.

Alternatively, the corresponding quantization factor may be searched according to a pre-specified quantization bit width. At this time, the quantization method search space may include a candidate quantization factor of data quantizing a parameter of the neural network model to a predetermined bit width. For example, in practice, the parameter of the neural network model is fp32 (32-bit single-precision floating point number), the designated quantization bit width is 8, that is, the parameter of the neural network model of fp32 type needs to be quantized to int8 (8-bit integer number) type, and the quantization method search space includes a candidate quantization factor for quantizing the neural network model of fp32 type to int8 model. Therefore, quantization factors which are not suitable for the preset bit width can be removed when a search space is constructed, and the search efficiency is improved.

The quantization factor is a parameter in a mathematical conversion formula used for converting high-bit-width data into low-bit-width data, and may include a scale factor and optionally an offset factor. In practice, for example, when converting fp32 type data x into int8 type data y, the conversion formula of y ═ ax + b can be used for conversion, where a is a scale factor and b is a bias factor.

In general, when a neural network model is quantized, in a model obtained by quantizing model parameters in an equal proportion according to the numerical ranges of all the parameters, the parameters are distributed in a plurality of intervals, the distances of the intervals are far, and the number of the parameters in different intervals is very different, so that the precision loss of the quantized neural network model is very large. Therefore, it is necessary to quantize the parameters of the neural network model with high bit width using an appropriate quantization factor.

In this embodiment, the candidate quantization factor set may be searched out from a preset quantization method search space in each search operation. The candidate quantization factor set includes candidate quantization factors respectively corresponding to network layers of the neural network model. That is, the candidate quantization factor set may be composed of quantization factors corresponding to the network layers of the neural network model one to one.

A preset controller may be used to search out a set of candidate quantization factors from a preset quantization method search space. The controller may be implemented as a recurrent neural network, a reinforcement learning algorithm, an evolutionary algorithm, or the like. The controller generates a coding sequence of quantization factors representing each network layer, and decodes the coding sequence according to a predefined decoding rule to obtain a corresponding candidate quantization factor set.

After the current candidate quantization factor set is searched, the candidate quantization factor set can be used for quantizing the neural network model to obtain a candidate network model. The neural network model here is a high bit width model to be quantized, and training is completed in advance based on the performed deep learning task. And the candidate network model obtained after quantization is the corresponding low bit width model.

It should be noted that, each search operation may search out a plurality of quantization factor sets, and each quantization factor set may be used to quantize the neural network model, so as to obtain a plurality of corresponding candidate network models.

Step S2012, obtaining the performance of the candidate network model, and performing back propagation to update the searched candidate quantization factor set based on the performance of the candidate network model.

In this embodiment, the performance of the candidate network model may include an accuracy measure, such as an error, an accuracy, and the like, of its execution of the corresponding deep learning task, or may further include one or more of a corresponding operation efficiency, a hardware latency, a memory occupancy, and the like. The performance of the candidate network model may be obtained by running the candidate network model based on the corresponding task data.

Here, the task data may be media data such as image, voice, text, audio, and the like. Because the number of bytes of memory occupied by the media data is large, the processing efficiency can be effectively improved by adopting the quantized model to process the media data, and the accuracy of the processing result needs to be ensured aiming at the deep learning task related to the media data. Therefore, the accuracy of the processing result of the quantized model on the media data can be accurately judged by acquiring the performance of the candidate network model.

In this embodiment, the performance of the candidate network model may be propagated back into the search of the set of candidate quantization factors. Specifically, the next search operation may be iteratively performed based on the error of the candidate network model, so that the controller for generating the candidate quantization factor set generates a new candidate quantization factor set with the goal of reducing the error.

For example, if the controller is implemented as an evolutionary algorithm, the population characterizing the candidate sets of quantization factors may be optimized according to the fitness of each candidate set of quantization factors, taking the precision of the candidate network models corresponding to the different candidate sets of quantization factors as their fitness in the population.

As another example, if the controller is implemented as a recurrent neural network, parameters of the recurrent neural network may be updated using a gradient descent method based on an error of the candidate network model as a loss function of the recurrent neural network, and a next search operation in which a new set of candidate quantization factors is generated based on the recurrent neural network after updating the parameters may be iteratively performed.

After obtaining the performance of the candidate network model, a next search operation may be performed in which a new set of candidate quantization factors is searched out.

And step S2013, in response to the fact that the performance of the candidate network model meets the preset convergence condition, combining and determining the quantization factors of all network layers of the neural network model based on the current candidate quantization factors.

If, in the current search operation, it is determined that the performance of the candidate network model satisfies a predetermined convergence condition, for example, the error of the candidate network model is reduced to a certain error range, or the accuracy of the candidate network model is not less than a predetermined accuracy threshold, the set of candidate quantization factors searched in the current search operation may be determined as the set of quantization factors of each network layer of the neural network model.

And step 202, quantizing the neural network model based on the acquired quantization factor.

After the quantization factors of each network layer of the neural network model are obtained, each network layer can be quantized respectively based on the quantization factors of each network layer, so that the quantized neural network model is obtained.

Since the quantization factors of the network layers of the neural network model are the quantization factor sets iteratively updated based on the performance of the quantized model in the search operation, the quantization factor sets that can guarantee the performance of the quantized model can be obtained. Moreover, since the quantization factor set is composed of the quantization factors corresponding to the network layers of the neural network model, that is, the quantization factors of the parameters of the network layers of the entire model are searched at the same time in each search operation, the quantization factor set can be determined by using the implicit association relationship between the parameters of the network layers in the search operation, so that the consistency between the association relationship between the parameters of the network layers of the quantized model and the association relationship between the parameters of the network layers of the neural network model before quantization can be improved, and the accuracy of the quantized model can be further improved.

With continued reference to fig. 3, shown is a flow diagram of another embodiment of a method of quantifying a neural network model in accordance with the present disclosure. As shown in fig. 3, a flow 300 of the quantization method of the neural network model of the present embodiment includes the following steps:

step 301, obtaining quantization factors of each network layer of the neural network model determined by iteratively executing a plurality of search operations.

In this embodiment, the execution subject of the quantization method of the neural network model may locally read or receive the quantization factors of the network layers of the neural network model determined by other electronic devices.

The search operation includes the following steps S3011, S3012, S3013, and S3014.

Step S3011, a candidate quantization factor set is searched from a preset quantization method search space, and the neural network model is quantized based on the candidate quantization factor set to obtain a candidate network model.

Step S3011 of this embodiment is the same as step S2011 of the previous embodiment, and the specific implementation manner of step S3011 may refer to the description of step S2011 in the previous embodiment, which is not described herein again.

Step S3012, determining whether the candidate quantization factor set satisfies a preset distribution constraint condition based on the parameter distribution of the neural network model and the parameter distribution of the candidate network model.

In this embodiment, the distribution of the parameters of the neural network model that is not subjected to the diversification may be counted to determine the parameter distribution thereof. After the neural network model is quantized based on the currently searched quantization factor set to obtain candidate models, the distribution of parameters of each candidate model can be counted to determine corresponding parameter distribution.

Then, the consistency between the parameter distribution of the non-quantized neural network model and the parameter distribution of the candidate model obtained after quantization can be judged. For example, mutual information or similarity between two parameter distributions may be calculated as a measure of consistency of the two. It may be determined whether the consistency between the parameter distribution of the non-quantized neural network model and the parameter distribution of the candidate model obtained after quantization satisfies a preset distribution constraint condition, for example, whether the consistency metric reaches a preset consistency threshold, and if so, it is determined that the currently searched candidate quantization factor set satisfies the preset distribution constraint condition. Otherwise, it may be determined that the currently searched candidate quantization factor set does not satisfy the preset distribution constraint condition.

Optionally, the preset distribution constraint condition includes: the distance between the parameter distribution of the neural network model and the parameter distribution of the candidate network model does not exceed a preset distance threshold. The distance between the two parameter distributions can be calculated using a KL Divergence (Kullback-Leibler Divergence). If the KL divergence of the two exceeds a preset distance threshold, determining that the candidate quantization factor set does not meet a preset distribution constraint condition; if the KL divergence of the two factors does not exceed a preset distance threshold, the candidate quantization factor set can be determined to meet a preset distribution constraint condition.

By calculating the distance between the parameter distributions of the models before and after quantization, the consistency between the parameter distributions of the models before and after quantization can be accurately measured, and then the candidate quantization factor set is screened based on the consistency between the parameter distributions, so that the candidate network model inconsistent with the parameter distribution of the model before quantization can be effectively filtered, unreasonable candidate quantization factor set is filtered, and the search efficiency of the quantization factors is improved.

Step S3013, in response to determining that the candidate quantization factor set satisfies the preset distribution constraint condition, obtaining performance of the candidate network model, and performing back propagation to update the searched candidate quantization factor set based on the performance of the candidate network model.

If it is determined in step S3012 that the currently searched candidate quantization factor set satisfies the preset distribution constraint condition, that is, it is determined that the consistency between the parameter distribution of the candidate model quantized by the candidate quantization factor set and the parameter distribution of the non-quantized neural network model reaches the preset condition, the performance of the candidate network model quantized may be obtained. The performance of the candidate network model may be obtained by running the candidate network model on a test data set. The test data set is a data set of the same type as the training data set of the neural network model, and may be a collection of media data.

The operation of obtaining the performance of the candidate network model in this step and performing back propagation based on the performance of the candidate network model to update the searched candidate quantization factor set is consistent with step S2012 of the foregoing embodiment, and a specific implementation manner may refer to the description of step S2012 of the foregoing embodiment, which is not described herein again.

Step S3014, in response to determining that the performance of the candidate network model satisfies the preset convergence condition, combining and determining quantization factors of each network layer of the neural network model based on the current candidate quantization factors.

Step S3014 of this embodiment is the same as step S2013 of the previous embodiment, and the specific implementation manner of step S3014 may refer to the description of step S2013 in the previous embodiment, which is not described herein again.

Alternatively, if the performance of the candidate network model does not satisfy the preset convergence condition, the next search operation may be returned to be executed, and the candidate quantization factor set is updated based on the back propagation of the performance of the candidate network model of the current search operation in the next search operation.

Optionally, in step S3011 of the search operation, at least two candidate sets of quantization factors may be searched from a preset quantization method search space, and the neural network model is quantized based on each candidate set of quantization factors to obtain at least two corresponding candidate network models. At this time, before performing step 3014, the search operation may further include: and deleting the candidate quantization factor set which does not meet the preset distribution constraint condition and the corresponding candidate network model.

Specifically, at least two candidate sets of quantization factors may be searched in each search operation, each candidate set of quantization factors may be used to quantize the neural network model, and for each candidate network model obtained after quantization, it may be determined whether the parameter distribution of the candidate network model and the parameter distribution of the non-quantized neural network model satisfy the preset distribution constraint condition, and the candidate set of quantization factors that do not satisfy the preset distribution constraint condition and the candidate network model generated by quantization using the candidate set of quantization factors are deleted. Therefore, the unreasonable candidate quantization factor set and the corresponding candidate network model can be deleted, performance evaluation does not need to be carried out on the unreasonable quantization factor set and the corresponding candidate network model, and efficiency of searching the quantization factors can be effectively improved.

After the set of quantization factors is obtained, the process continues to step 302, and the neural network model is quantized based on the obtained quantization factors.

The specific implementation of step 302 may refer to the description of step 202 in the foregoing embodiment, and is not described herein again.

In particular, in the deep neural network, the model layers have very strong dependence relationship, and the change of the incidence relationship between the parameters of the layers has great influence on the performance of the neural network model. The deeper the depth of the model, the greater the influence of the correlation between the parameters of the layers on the performance of the model. If the difference between the correlation between the parameters of the quantized low bit width model and the correlation between the parameters of the high bit width model is too large, the precision loss of the quantized neural network model is serious, and even the situation that the precision is too low to meet the service requirement is generated. In this embodiment, by screening the quantization factor set based on the consistency between the parameter distributions of the models before and after quantization, unreasonable quantization factors can be filtered, the search efficiency is improved, and it is ensured that the parameter distribution of the neural network model is not changed as much as possible by the quantization factor set obtained by search, so that the performance of the low-bit-width model obtained by quantization can be further improved.

Referring to fig. 4, as an implementation of the quantization method for the neural network model, the present disclosure provides an embodiment of a quantization apparatus for a neural network model, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2 and 3, and the apparatus may be applied to various electronic devices.

As shown in fig. 4, the quantization apparatus 400 of the neural network model of the present embodiment includes an acquisition unit 401 and a quantization unit 402. Wherein the obtaining unit 401 is configured to obtain quantization factors of network layers of the neural network model determined by iteratively performing a plurality of search operations; the quantization unit 402 is configured to quantize the neural network model based on the obtained quantization factor; wherein the search operation comprises: searching a candidate quantization factor set from a preset quantization method search space, and quantizing the neural network model based on the candidate quantization factor set to obtain a candidate network model, wherein the candidate quantization factor set comprises candidate quantization factors respectively corresponding to each network layer of the neural network model; acquiring the performance of a candidate network model, and performing back propagation on the basis of the performance of the candidate network model to update the searched candidate quantization factor set; and in response to determining that the performance of the candidate network model meets the preset convergence condition, determining the quantization factors of each network layer of the neural network model based on the current candidate quantization factor set.

The units in the apparatus 400 described above correspond to the steps in the method described with reference to fig. 2 and 3. Thus, the operations, features and technical effects described above for the quantization method of the neural network model are also applicable to the apparatus 400 and the units included therein, and are not described herein again.

Referring now to FIG. 5, a schematic diagram of an electronic device (e.g., the server shown in FIG. 1) 500 suitable for use in implementing embodiments of the present disclosure is shown. The electronic device shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 5, electronic device 500 may include a processing means (e.g., central processing unit, graphics processor, etc.) 501 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)502 or a program loaded from a storage means 508 into a Random Access Memory (RAM) 503. In the RAM503, various programs and data necessary for the operation of the electronic apparatus 500 are also stored. The processing device 501, the ROM 502, and the RAM503 are connected to each other through a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

Generally, the following devices may be connected to the I/O interface 505: input devices 506 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 507 including, for example, a Liquid Crystal Display (LCD), speakers, vibrators, and the like; a storage device 508 including, for example, a hard disk; and a communication device 509. The communication means 509 may allow the electronic device 500 to communicate with other devices wirelessly or by wire to exchange data. While fig. 5 illustrates an electronic device 500 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided. Each block shown in fig. 5 may represent one device or may represent multiple devices as desired.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 509, or installed from the storage means 508, or installed from the ROM 502. The computer program, when executed by the processing device 501, performs the above-described functions defined in the methods of embodiments of the present disclosure. It should be noted that the computer readable medium described in the embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In embodiments of the disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In embodiments of the present disclosure, however, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: obtaining quantization factors of each network layer of the neural network model determined by iteratively executing a plurality of search operations; quantizing the neural network model based on the obtained quantization factor; wherein the search operation comprises: searching a candidate quantization factor set from a preset quantization method search space, and quantizing the neural network model based on the candidate quantization factor set to obtain a candidate network model, wherein the candidate quantization factor set comprises candidate quantization factors respectively corresponding to each network layer of the neural network model; acquiring the performance of a candidate network model, and performing back propagation on the basis of the performance of the candidate network model to update the searched candidate quantization factor set; and in response to determining that the performance of the candidate network model meets the preset convergence condition, determining the quantization factors of each network layer of the neural network model based on the current candidate quantization factor set.

Computer program code for carrying out operations for embodiments of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present disclosure may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes an acquisition unit and a quantization unit. The names of the units do not form a limitation to the units themselves in some cases, and for example, the obtaining unit may also be described as a "unit that obtains quantization factors of network layers of the neural network model determined by iteratively performing a plurality of search operations".

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is possible without departing from the inventive concept as defined above. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims

1. A method of quantifying a neural network model, comprising:

obtaining quantization factors of each network layer of the neural network model determined by iteratively executing a plurality of search operations;

quantizing the neural network model based on the acquired quantization factor;

wherein the search operation comprises:

searching a candidate quantization factor set from a preset quantization method search space, and quantizing a neural network model based on the candidate quantization factor set to obtain a candidate network model, wherein the candidate quantization factor set comprises candidate quantization factors respectively corresponding to network layers of the neural network model;

acquiring the performance of the candidate network model, and performing back propagation on the basis of the performance of the candidate network model to update the searched candidate quantization factor set;

in response to determining that the performance of the candidate network model satisfies a preset convergence condition, determining quantization factors of network layers of the neural network model based on a current set of candidate quantization factors.

2. The method of claim 1, wherein the search operation further comprises:

determining whether the candidate quantization factor set meets a preset distribution constraint condition based on the parameter distribution of the neural network model and the parameter distribution of the candidate network model; and

the obtaining performance of the candidate network model comprises:

and responding to the condition that the candidate quantization factor set meets the preset distribution constraint condition, and acquiring the performance of the candidate network model.

3. The method of claim 2, wherein the searching out a candidate quantization factor set from a preset quantization method search space, and quantizing the neural network model based on the candidate quantization factor set to obtain a candidate network model comprises:

searching at least two candidate quantization factor sets from a preset quantization method search space, and quantizing the neural network model based on each candidate quantization factor set to obtain at least two corresponding candidate network models; and

the search operation further comprises:

and deleting the candidate quantization factor set which does not meet the preset distribution constraint condition and the corresponding candidate network model.

4. The method of claim 2 or 3, wherein the preset distribution constraint comprises:

the distance between the parameter distribution of the neural network model and the parameter distribution of the candidate network model does not exceed a preset distance threshold.

5. The method of claim 1, wherein the quantization method search space includes candidate quantization factors that quantize parameters of the neural network model to data of a predetermined bit width.

6. An apparatus for quantizing a neural network model, comprising:

an acquisition unit configured to acquire quantization factors of network layers of the neural network model determined by iteratively performing a plurality of search operations;

a quantization unit configured to quantize the neural network model based on the obtained quantization factor;

wherein the search operation comprises:

7. The apparatus of claim 6, wherein the search operation further comprises:

in the search operation, the performance of the network is acquired as follows:

8. The apparatus of claim 7, wherein in the searching operation, searching a set of candidate quantization factors from a preset quantization method search space, and quantizing the neural network model based on the set of candidate quantization factors to obtain a candidate network model comprises:

the search operation further comprises:

9. The apparatus of claim 7 or 8, wherein the preset distribution constraint comprises:

10. The apparatus of claim 6, wherein the quantization method search space comprises candidate quantization factors that quantize parameters of the neural network model to data of a predetermined bit width.

11. An electronic device, comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-5.

12. A computer-readable medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the method of any one of claims 1-5.