WO2020057000A1

WO2020057000A1 - Network quantization method, service processing method and related products

Info

Publication number: WO2020057000A1
Application number: PCT/CN2018/124834
Authority: WO
Inventors: 周争光; 王孝宇; 吕旭涛; 黄轩
Original assignee: 深圳云天励飞技术有限公司
Priority date: 2018-09-19
Filing date: 2018-12-28
Publication date: 2020-03-26
Also published as: CN110929865A; CN110929865B

Abstract

Embodiments of the present invention provide a network quantization method, a service processing method and related products. The method comprises obtaining an original deep neural network to be quantified, the original deep neural network comprising multilayer convolutional networks, any layer of the convolutional networks comprising a first branch and a second branch, and the second branch being a full precision convolutional structure; performing quantitative training on each layer of the convolutional networks, and performing attenuation processing on each layer of the convolutional networks according to a scaling factor, wherein the scaling factor decreases as the increase of the number of training steps of the quantitative training; when all layers of the convolutional networks complete the quantitative training and the scaling factor decreases to zero, removing the second branch in each layer of the convolutional networks in the original deep neural network to obtain a quantified target deep neural network. According to the present invention, fluctuations during network fixed point quantitative training can be smaller, and the trained network precision is higher, facilitating smoothly performing a service processing process.

Description

Network quantification method, business processing method and related products

This application claims the priority of a Chinese patent application filed on September 19, 2018 with the Chinese Patent Office under the application number 201811092329.X and the invention name "Network Quantization Method, Business Processing Method and Related Products", the entire contents of which are hereby incorporated by reference Incorporated in this application.

Technical field

The present invention relates to the field of machine learning, and particularly to the field of deep neural networks, and in particular, to a network quantization method, a service processing method based on a deep neural network, a network quantization device, a service processing device based on a deep neural network, and A network device.

Background technique

Deep neural network (DNN) has achieved remarkable results in many computer vision and natural language processing tasks. In recent years, DNN has been increasingly applied to mobile phones and embedded devices. However, high-performance DNNs require a large amount of storage space and calculations, which makes running DNNs on mobile devices more challenging. Therefore, more and more network compression and acceleration methods have been proposed to reduce the storage space of the network and improve the speed of the network without significantly reducing the performance of the DNN. Among them, network quantization is an effective compression method, which uses a small number of bits to represent each weight value or activation value of each layer in the network, so that it can be efficiently used in the CPU (Central Processing Unit, Central Processing Unit), GPU (Graphics, Processing Unit) and FPGA (Field-Programmable Gate Array, field programmable gate array). However, current network quantization methods tend to fluctuate greatly, and the accuracy of trained networks is also low.

Summary of the Invention

The embodiments of the present invention provide a network quantization method, a deep neural network-based service processing method, and related products, which can make the deep neural network less volatile during network quantization training, and the trained network has higher accuracy and is beneficial to the business Smooth execution of the process.

In one aspect, an embodiment of the present invention provides a network quantization method, including:

Obtaining an original deep neural network to be quantified, the original deep neural network including a multi-layer convolutional network, each layer of the convolutional network including a first branch and a second branch, the second branch being a full-precision convolution structure;

Perform quantization training on each layer of the convolutional network, and perform attenuation processing on each layer of the convolutional network according to a scaling factor; wherein the scaling factor decreases as the number of training steps of the quantization training increases;

When all layers of the convolutional network have completed quantization training and the scaling factor is reduced to zero, the second branch in each layer of the convolutional network in the original deep neural network is removed, and after quantization is obtained Target deep neural network.

In another aspect, an embodiment of the present invention provides a service processing method based on a deep neural network, including:

Receiving a service request, the service request carrying a business object to be processed; the service request includes any of the following: an image processing request, a face recognition request, a visual processing request, and a natural language recognition processing request;

Calling a target deep neural network to process the business object to obtain a business processing result; wherein the target deep neural network is obtained by using the network quantization method of the above aspect;

Output the business processing result.

In another aspect, an embodiment of the present invention provides a network quantization apparatus, including:

An obtaining unit for obtaining an original deep neural network to be quantified; the original deep neural network includes a multi-layer convolutional network, and each layer of the convolutional network includes a first branch and a second branch, and the second branch is a full branch Precision convolution structure;

A quantization unit is configured to perform quantization training on each layer of the convolutional network, and perform attenuation processing on each layer of the convolutional network according to a scaling factor; wherein the scaling factor decreases as the number of training steps of the quantization training increases. small;

A processing unit configured to remove the second branch in each layer of the convolutional network in the original deep neural network when the quantization training is completed in all layers and the scaling factor is reduced to zero Processing to get the quantified target deep neural network.

In another aspect, an embodiment of the present invention provides a service processing apparatus based on a deep neural network, including:

A request receiving unit, configured to receive a service request, the service request carrying a business object to be processed; the service request includes any of the following: an image processing request, a face recognition request, a visual processing request, and a natural language recognition processing request;

A business processing unit is configured to call a target deep neural network to process the business object to obtain a business processing result. The target deep neural network is obtained by using the network quantification method in the above aspect:

A result output unit, configured to output the service processing result.

In another aspect, an embodiment of the present invention provides a network device, including:

A processor adapted to implement one or more instructions; and

A computer storage medium storing one or more first instructions, where the one or more first instructions are adapted to be loaded by the processor and execute the following network quantization method:

Obtaining an original deep neural network to be quantified, the original deep neural network including a multi-layer convolution network, and each layer of the convolution network includes a first branch and a second branch, and the second branch is a full-precision convolution structure;

When all layers of the convolutional network have completed quantization training and the scaling factor is reduced to zero, the second branch in each layer of the convolutional network in the original deep neural network is removed, and after quantization is obtained Target deep neural network;

Alternatively, the computer storage medium stores one or more second instructions, and the one or more second instructions are suitable for being loaded by the processor and executing a business processing method based on a deep neural network as follows:

Output the business processing result.

In the network quantization process of the embodiment of the present invention, the original deep neural network includes a multi-layer convolutional network, and each layer of the convolutional network includes a first branch and a second branch, and the second branch is a full-precision convolution structure; Each layer of the convolutional network is subjected to quantization training, and each layer of the convolutional network is subjected to attenuation processing according to a scaling factor; wherein the scaling factor decreases as the number of training steps of the quantization training increases; When the convolutional network has completed quantization training and the scaling factor is reduced to zero, the second branch in each layer of the convolutional network in the original deep neural network is removed to obtain a quantized target deep neural network. The internet. By implementing the embodiment of the present invention, the second branch of the full-precision convolution structure is set in the convolutional network of each layer of the original deep neural network, which makes the output of the convolutional network of each layer have a stronger expression ability; Each layer of the convolutional network is subjected to quantized training and attenuation processing, which can make the fluctuation of the network during quantized training smaller, the trained network has higher accuracy, and the obtained target deep neural network has better network performance.

During the service processing process based on the deep neural network in the embodiment of the present invention, a service request may be received, and the service request carries a business object to be processed; the service request includes any of the following: an image processing request, a face recognition request, a vision A processing request and a natural language recognition processing request; calling a target deep neural network to process the business object to obtain a business processing result; wherein the target deep neural network is obtained using the network quantization method; and outputting the business processing Results; In the implementation of the embodiment of the present invention, since the target deep neural network used for business processing is obtained through the network quantization method, the target deep neural network has better network performance and higher accuracy, which can effectively improve services. Processing efficiency and quality of business processing.

BRIEF DESCRIPTION OF THE DRAWINGS

In order to more clearly explain the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings in the following description are merely These are some embodiments of the present invention. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without paying creative labor.

FIG. 1 is a flowchart of a network quantization method according to an embodiment of the present invention; FIG.

2 is a schematic structural diagram of an initial deep neural network according to an embodiment of the present invention;

3 is a schematic structural diagram of an original deep neural network according to an embodiment of the present invention;

FIG. 4 is another schematic structural diagram of an original deep neural network according to an embodiment of the present invention; FIG.

5 is a flowchart of a specific implementation manner of step s102 shown in FIG. 1;

6 is a flowchart of another specific implementation manner of step s102 shown in FIG. 1;

FIG. 7 is a flowchart of a deep neural network-based service processing method according to an embodiment of the present invention; FIG.

8a-8c are application scenario diagrams of a deep neural network-based service processing method according to an embodiment of the present invention;

9 is a schematic structural diagram of a network quantization apparatus according to an embodiment of the present invention;

10 is a schematic structural diagram of a deep neural network-based service processing apparatus according to an embodiment of the present invention;

FIG. 11 is a schematic structural diagram of a network device according to an embodiment of the present invention.

detailed description

In the following, the technical solutions in the embodiments of the present invention will be clearly and completely described with reference to the drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

An embodiment of the present invention proposes a network quantization method. Referring to FIG. 1, the method specifically includes the following steps:

s101. Obtain an original deep neural network to be quantified. The original deep neural network includes a multi-layer convolution network. Each layer of the convolution network includes a first branch and a second branch. The second branch is a full-precision convolution. structure.

The original deep neural network is obtained by adding branches to the initial deep neural network. The initial deep neural network is currently a common deep neural network. The initial deep neural network includes a multi-layer convolutional network, and each layer of the convolutional network has a single structure. For convenience of description, it is assumed that the initial neural network includes L-layer convolution. Network, L is a positive integer, and any layer of the convolutional network is represented as a layer l convolutional network, l is a positive integer, and 1≤l≤L, the structure of the layer l convolutional network of the initial deep neural network may be See Figure 2. Because the initial deep neural network has only a single structure, when performing network quantization training, network parameters (including weights and activation values) with only a single structure are quantized to a limited number of values, for example: if an 8-bit fixed-point number is used as required To quantify the weight and activation value of the initial deep neural network, then the weight and activation value are quantized to a limited number of values determined by the 8-bit fixed-point number; for another example: if a 2-bit fixed-point number is used to initialize the initial The weight and activation value of the deep neural network are quantified, and then the weight and activation value are quantized to a limited number of values determined by a 2-bit fixed point number. It can be seen that the initial deep neural network has only a single structure, and the network parameters of the single structure are quantified to a limited number of values, which makes the output expression ability of the initial deep neural network weak, increases the difficulty of quantization training, and makes the quantization training process The fluctuations are very large. In order to reduce the fluctuation of the quantization training process and reduce the difficulty of quantization training, the embodiment of the present invention improves the initial deep neural network, and determines the original single structure of each layer of the convolutional network in the initial deep neural network as the first branch. On this basis, a second branch is added to form the original deep neural network.

The original deep neural network includes a multi-layer convolutional network, and each layer of the convolutional network includes a first branch and a second branch. Specifically, the original deep neural network includes an L-layer convolutional network, where L is a positive integer, where The convolutional network of any layer is represented as the convolutional network of the first layer, where l is a positive integer, and 1≤l≤L. The first branch may be a convolution structure of fixed-point quantization accuracy. The so-called fixed-point quantization accuracy refers to a precision range determined by the number of fixed-points. The determined accuracy range; if a 2-bit fixed-point number is used for network quantization, the fixed-point quantization accuracy is the accuracy range determined by the 2-bit fixed-point number. The first branch in the layer l convolution network includes a weighting unit, a first convolution unit, a first activation unit, and an activation value quantization unit; wherein the weighting unit uses a weighting function to perform weighting on the first layer Quantization training; the first convolution unit is configured to perform a convolution operation on the quantized weight of the first layer and the output value of the first layer to the first layer; the first activation unit is configured to perform a convolution operation on the first convolution unit The output value is processed non-linearly; the activation value quantization unit uses the activation value quantization function to perform quantization training on the convolution network. The second branch is a full-precision convolution structure. The so-called full-precision refers to the precision range determined by floating-point numbers. For example, if a 32-bit floating-point number is used as required, the full precision refers to the accuracy determined by a 32-bit floating-point number. range. The second branch in the layer l convolution network includes a second convolution unit, a second activation unit, and a scaling unit; wherein the second convolution unit is used to combine the weight of the layer l with the weight of the layer 1-1 The output value is subjected to a convolution operation; the second activation unit is configured to perform non-linear processing on the output value of the second convolution unit; and the scaling unit is configured to perform attenuation processing on the second branch. In a feasible implementation manner, the first branch and the second branch of each layer of the convolutional network in the original deep neural network may be combined by first adding and then quantizing, as shown in FIG. 3; In a feasible implementation manner, the first branch of each layer of the convolutional network in the original deep neural network may be combined by first quantizing and then adding the second branch. For details, see FIG. 4.

S102: Perform quantitative training on each layer of the convolutional network, and perform attenuation processing on each layer of the convolutional network according to a scaling factor; wherein the scaling factor decreases as the number of training steps of the quantization training increases.

The process of quantizing training a certain convolutional network includes a process of quantizing the weight of the convolutional network by using a weighting function, and / or a process of quantizing an activation value of the convolutional network by using an activation value quantization function. Since the convolutional network of each layer of the original deep neural network in this embodiment includes two branches, in step s102, the weights and activation values of the first branch of each layer of the convolutional network are quantified and trained, The weights and activation values of the second branch are not subjected to quantization training, and only attenuation processing is performed according to the scaling factor.

In specific implementation, for any layer of the convolutional network, step s102 may use the following steps a1-a2 to obtain the scaling factor:

a1, obtaining the number of training steps in the first layer of the convolutional network for quantized training;

a2, calling a cosine attenuation function to calculate a scaling factor corresponding to the number of training steps when the convolutional network of the first layer performs quantization training.

s103. When all layers of the convolutional network have completed quantization training and the scaling factor is reduced to zero, the second branch in the convolutional network of each layer in the original deep neural network is removed to obtain Quantified target deep neural network.

The quantization training starts from the input layer of the original deep neural network and iterates successively layer by layer. Each layer is subjected to quantitative training according to the method described in step s102, that is, the first branch of each layer is subjected to quantitative training, and Attenuation processing is performed on the second branch of each layer according to the scaling factor until all layers of the original deep neural network have completed the quantization training and the scaling factor of the second branch is reduced to zero. When the scaling factor is reduced to zero, the second branch has no effect on the entire network. When the network converges, the second branch of each layer can be removed, specifically, each layer is rolled up. The full-precision convolution structure of the second branch in the convolutional network is deleted (that is, the second branch shown by the dotted line in Fig. 3 or Fig. 4 is pruned). After the deletion, each layer of the convolutional network only includes the first branch. The convolutional network of each layer after the two branches constitutes a quantified target deep neural network.

Aiming at the structure of the original deep neural network shown in FIG. 3, an embodiment of the present invention provides a flowchart of a specific implementation manner of step s102 shown in FIG. 1 described above. Referring to FIG. 5, step s102 may specifically include the following step s501 -s508:

S501. Obtain input parameters of a layer l convolutional network, where the input parameters include weights and activation values;

s502. Quantitative training is performed on the input parameters of the first layer of the convolutional network in the first branch of the first layer of the convolutional network by using a weighted function to obtain the first of the first branch in the first layer of the convolutional network. Output parameters;

In a feasible implementation manner, step s502 specifically includes the following steps:

b1, using a fixed-point quantization method to obtain a weighted value of the first branch of the first layer of the convolutional network;

b2. The first convolution unit receives the weighted value of the first branch of the first layer of the convolutional network and the output value of the l-1 layer, and convolves the two values to obtain the first convolution of the first branch. value;

b3. The first activation unit performs a non-linear process on the first convolution value of the first branch by using a non-linear function to obtain a first output parameter of the first branch.

According to step s102, the fixed-point quantization method uses the low-precision quantization model to perform quantitative training on the weights of the first branch of the first-level convolutional network to obtain the first branch of the first-level convolutional network. Weighted value. In steps

among them,

Represents the weighted value of the first branch of the first-level convolutional network, W _l ¹ represents the weight of the first branch of the first-level convolutional network, and Q _w represents the quantization function of the weight by the fixed-point quantization method.

In step b2, the first convolution unit receives the weighted value of the first branch of the first-level convolution network and the output value of the first-first layer, and the calculation formula of the convolution of the two values is as follows;

among them,

Represents the first convolution value of the first branch,

Represents the output value of layer l-1.

In step b3, the first activation unit performs a non-linear processing on the first convolution value of the first branch by using a non-linear function, and a calculation formula is as follows:

among them,

Represents the first output parameter of the first branch, and φ represents the activation function of the first activation unit.

s503, transmitting the input parameters of the first layer of the convolutional network to the second branch in the first layer of the convolutional network for training to obtain the original output parameters of the second branch in the first layer of the convolutional network;

In a feasible implementation manner, step s503 specifically includes:

c1. The second convolution unit receives the weight of the second branch of the l-th layer of the convolutional network and the output value of the l-1 layer, and convolves the two values to obtain the second convolution value of the second branch;

c2. The second activation unit performs a non-linear processing on the second convolution value by using a non-linear function to obtain an original output parameter of a second branch of the first-layer convolution network.

According to step s102, the weight of the second branch of the first-layer convolutional network is expressed by a full-precision model. The second activation unit uses a non-linear function (such as a ReLu function) to perform non-linear processing on the second convolution value. In step c1, the second convolution unit receives the weight of the second branch of the l-th layer of the convolutional network and the output value of the l-1 layer, and the calculation formula of the convolution of the two values is as follows:

among them,

Represents the second convolution value of the second branch, W _l ² represents the weight of the second branch of the l-th layer convolution network,

Represents the output value of layer l-1.

In step c2, the second activation unit performs nonlinear processing on the second convolution value by using a non-linear function, and a calculation formula is as follows:

among them,

Represents the original output parameter of the second branch, and φ ′ represents the activation function of the second activation unit.

s504: Obtain a corresponding scaling factor according to the number of training steps during the quantized training of the first-layer convolutional network;

S505: Attenuate the original output parameter of the second branch in the first-layer convolutional network by using the obtained scaling factor to obtain the second output parameter of the second branch in the first-layer convolutional network;

In step s505, the obtained scaling factor is used to attenuate the original output parameters of the second branch in the first-layer convolutional network, and the calculation formula is as follows:

Wherein, factor represents the obtained scaling factor.

S506: Sum the first output parameter and the second output parameter to obtain an intermediate parameter.

In step s506, the first output parameter and the second output parameter are summed to obtain an intermediate parameter, and the calculation formula is as follows:

Among them, A _l represents an intermediate parameter.

S507: Quantize training the intermediate parameters by using an activation value quantization function to obtain output parameters of the first-layer convolution network;

In step s507, quantization training is performed on the intermediate parameter by using an activation value quantization function, and a calculation formula is as follows:

among them,

Represents the output parameters of the first layer of the convolutional network, and Q _a represents the quantization function of the activation value by the fixed-point quantization method.

s508. Determine the output parameters of the layer l convolutional network as the input parameters of the layer l + 1 layer convolution network, and repeat the above steps until all the layer convolutional networks have completed quantization training and the scaling factor is reduced to zero. .

In the network quantization process of the embodiment of the present invention, the original deep neural network includes a multi-layer convolutional network, and each layer of the convolutional network includes a first branch and a second branch, and the second branch is a full-precision convolution structure; Each layer of the convolutional network is subjected to quantization training, and each layer of the convolutional network is subjected to attenuation processing according to the scaling factor until all layers of the convolutional network have completed quantization training and the scaling factor is reduced to zero; wherein, the The scaling factor decreases as the number of training steps of the quantized training increases. Implementing the embodiment of the present invention, by adding a second branch of the full-precision convolution structure to the convolutional network of each layer of the original deep neural network, this makes the output of the convolutional network of each layer have a stronger expression ability; Each layer of the convolutional network is subjected to quantized training and attenuation processing, so that the fluctuations in the quantized training of the network are smaller, and the trained network has higher accuracy. The obtained target deep neural network can converge to a better local best advantage. Better network performance.

Aiming at the structure of the original deep neural network shown in FIG. 3, an embodiment of the present invention proposes a flowchart of another specific implementation manner of step s102 shown in FIG. 1. Referring to FIG. 6, step s102 specifically includes the following steps s601-s607. :

S601: Obtain input parameters of a layer l convolutional network, where the input parameters include weights and activation values;

s602. Quantitative training is performed on the input parameters of the first layer of the convolutional network in the first branch of the first layer of the convolutional network by using a weighting function and an activation value quantization function to obtain the first layer of the first layer of the convolutional network. The first output parameter of a branch;

In a feasible implementation manner, step s602 specifically includes the following steps:

d1, using a fixed-point quantization method to obtain a weighted value of the first branch of the first layer of the convolutional network;

d2. The first convolution unit receives the weighted value of the first branch of the convolutional network of the first layer and the output value of the first to first layer, and convolves the two values to obtain the first volume of the first branch. Product value

d3. The first activation unit performs a nonlinear process on the first convolution value of the first branch by using a non-linear function to obtain an activation value of the first branch;

d4. Use the activation value quantization function to perform quantization training on the activation value of the first branch to obtain a first output parameter of the first branch in the first-layer convolutional network.

In step d1, a fixed-point quantization method is used to obtain the weighted value of the first branch of the first layer of the convolution network, and the calculation formula is as follows:

among them,

In step d2, the first convolution unit receives the weighted value of the first branch of the first-layer convolutional network and the output value of the l-1 layer, and the calculation formula of the convolution of the two values is as follows:

among them,

Represents the first convolution value of the first branch,

Represents the output value of layer l-1.

In step d3, the first activation unit performs a non-linear processing on the first convolution value of the first branch by using a non-linear function, and a calculation formula is as follows:

among them,

In step d4, the activation value quantization function is used to perform quantitative training on the activation value of the first branch, and the calculation formula is as follows:

among them,

Represents the first output parameter of the first branch in the first-layer convolutional network, and Q _a represents the quantization function of the activation value by the fixed-point quantization method.

s603. Transmit the input parameters of the first layer of the convolutional network to the second branch in the first layer of the convolutional network for training to obtain the original output parameters of the second branch in the first layer of the convolutional network;

In a feasible implementation manner, step s603 specifically includes:

e1. The second convolution unit receives the weight of the second branch of the first layer of the convolutional network and the output value of the first layer -1, and convolves the two values to obtain the second convolution value;

e2. The second activation unit performs a non-linear processing on the second convolution value by using a non-linear function to obtain an original output parameter of a second branch of the first-layer convolution network.

In step e1, the second convolution unit receives the weight of the second branch of the l-th layer convolution network and the output value of the l-1 layer, and the calculation formula of the convolution of the two values is as follows:

among them,

Represents the output value of layer l-1.

In step e2, the second activation unit performs nonlinear processing on the second convolution value by using a non-linear function, and a calculation formula is as follows:

among them,

s604. Obtain a corresponding scaling factor according to the number of training steps during the quantized training of the first-layer convolutional network.

S605: Attenuate the original output parameter of the second branch in the first-layer convolutional network by using the obtained scaling factor to obtain the second output parameter of the second branch in the first-layer convolutional network;

In step s605, the obtained scaling factor is used to attenuate the original output parameters of the second branch in the first-layer convolutional network, and the calculation formula is as follows:

Wherein, factor represents the obtained scaling factor.

S606: Sum the first output parameter and the second output parameter to obtain an output parameter of the first-layer convolution network;

In step s606, the first output parameter and the second output parameter are summed up, and the calculation formula is as follows:

among them,

Represents the output parameters of the l-th layer convolutional network.

s607. Determine the output parameters of the layer l convolutional network as the input parameters of the layer l + 1 layer convolution network, and repeat the above steps until all layer convolution networks have completed quantization training and the scaling factor is reduced to zero. .

In the network quantization process of the embodiment of the present invention, the original deep neural network includes a multi-layer convolutional network, and each layer of the convolutional network includes a first branch and a second branch, and the second branch is a full-precision convolution structure; Each layer of the convolutional network is subjected to quantization training, and each layer of the convolutional network is subjected to attenuation processing according to the scaling factor until all layers of the convolutional network have completed quantization training and the scaling factor is reduced to zero; wherein, the The scaling factor decreases as the number of training steps of the quantized training increases. By implementing the embodiment of the present invention, the second branch of the full-precision convolution structure is set in the convolutional network of each layer of the original deep neural network, which makes the output of the convolutional network of each layer have a stronger expression ability; Each layer of the convolutional network is subjected to quantized training and attenuation processing, so that the fluctuations in the quantized training of the network are smaller, and the trained network has higher accuracy. The obtained target deep neural network can converge to a better local best advantage. Better network performance.

Based on the description of the foregoing embodiment of the network quantization method, an embodiment of the present invention proposes a service processing method based on a deep neural network. Referring to FIG. 7, the method specifically includes the following steps:

S701: Receive a service request, where the service request carries a business object to be processed; the service request includes any of the following: an image processing request, a face recognition request, a visual processing request, and a natural language recognition processing request. It can be understood that the services to be processed may include, but are not limited to, image processing services, face recognition services, visual processing services, and natural language recognition services.

S702: Invoke a target deep neural network to process the business object to obtain a business processing result. The target deep neural network is obtained by using the network quantization method.

The target deep neural network may be obtained by using the network quantization method shown in FIG. 1 to FIG. 6. The target deep neural network may be set in a network device. The network device may include, but is not limited to, terminal devices, embedded devices, and networks. Server and so on. The terminal device may include, but is not limited to, a smart phone, a tablet computer, and a mobile wearable device; the embedded device may include, but is not limited to, a DSP (Digital Signal Processing) chip device. When a service request is received, the network device invokes the target deep neural network, and transmits the to-be-processed business objects (such as images, face images, etc.) carried by the service request as input parameters to the target deep neural network for corresponding processing. Business processing to get business processing results.

S703: Output the service processing result.

According to different business objects, the business processing result corresponds to the business object one-to-one; for example, if the business object is image processing, the corresponding business processing result may include, but is not limited to, image blurring, sharpening, and edge detection. For another example, if the business object is face recognition, the corresponding business processing result is matching face images, searching for associated identity information, and the like.

Based on the foregoing deep neural network-based service processing method, an embodiment of the present invention provides an application scenario of a deep neural network-based service processing method. Please refer to FIG. 8a to FIG. 8c, taking a face recognition service as an example, and setting a target deep neural network. The network is set in a face recognition APP (Application) in a mobile phone. The processing steps of the face recognition service are as follows: ① The user uses a mobile phone with a camera app (Application, application) and a face recognition app installed, opens the camera app, and clicks to take a photo, as shown in FIG. 8a. After the user takes the photo, Open the face recognition APP and choose to perform face recognition processing, then the face recognition APP calls the target deep neural network to perform face recognition processing on the face photo taken by the user, as shown in FIG. 8b; after the processing is completed, output Face recognition results are shown in Figure 8c.

In the service processing process based on the deep neural network of the embodiment of the present invention, a face recognition request may be received, and the face recognition request carries a face image to be processed; the target deep neural network is called to perform recognition processing on the face image to obtain Face recognition results; wherein the target deep neural network is obtained by using the network quantization method; outputting the face recognition results; implementation of the embodiment of the present invention due to the target deep neural network used for face recognition processing Obtained through the network quantization method, the target deep neural network has better network performance and higher accuracy, which can effectively improve the efficiency of face recognition processing and ensure the accuracy of face recognition processing.

Based on the description of the foregoing embodiment of the network quantization method, an embodiment of the present invention provides a network quantization apparatus. The apparatus may be a computer program running on a network device, and may be applied to the above-mentioned FIGS. 1, 5, and 6. The illustrated network quantization method is used to perform the corresponding steps in the network quantization method. Referring to FIG. 9, the device may include:

The obtaining unit 101 is configured to obtain an original deep neural network to be quantified. The original deep neural network includes a multi-layer convolutional network. Each layer of the convolutional network includes a first branch and a second branch. The second branch is Full precision convolution structure;

A quantization unit 102 is configured to perform quantization training on each layer of the convolutional network, and perform attenuation processing on each layer of the convolutional network according to a scaling factor; wherein the scaling factor increases as the number of training steps of the quantization training increases. Decrease

A processing unit 103, configured to move the second branch in each layer of the convolutional network in the original deep neural network when the quantization training is completed in all layers and the scaling factor is reduced to zero; Divide the processing to get the quantified target deep neural network.

In an embodiment, the obtaining unit 101 is specifically configured to:

Obtaining an initial deep neural network, where the initial deep neural network includes a multi-layer convolutional network, each layer of the convolutional network includes a first branch, and the first branch is a fixed-point quantization accuracy convolution structure;

A second branch is set for each layer of the initial deep neural network to obtain the original deep neural network.

In another embodiment, the quantization unit 102 is specifically configured to:

Obtaining input parameters of a layer l convolutional network, where the input parameters include weights and activation values;

A weighted function is used to quantitatively train the input parameters of the first layer of the convolutional network in the first branch in the first layer of the convolutional network to obtain the first output parameters of the first branch in the first layer of the convolutional network. ;

Transmitting the input parameters of the first-level convolutional network to the second branch in the first-level convolutional network for training to obtain the original output parameters of the second branch in the first-level convolutional network;

Obtain a corresponding scaling factor according to the number of training steps during the quantized training of the first-layer convolutional network;

Attenuating the original output parameters of the second branch in the first-layer convolutional network by using the obtained scaling factor to obtain the second output parameters of the second branch in the first-layer convolutional network;

Summing the first output parameter and the second output parameter to obtain an intermediate parameter;

Use the activation value quantization function to perform quantitative training on the intermediate parameters to obtain the output parameters of the first-layer convolutional network;

The output parameters of the layer l convolutional network are determined as the input parameters of the layer l + 1 layer convolutional network, and the above steps are repeated until all the layer convolutional networks have completed quantization training and the scaling factor is reduced to zero.

In another embodiment, the quantization unit 102 is specifically configured to:

Obtaining input parameters in a layer l convolutional network, where the input parameters include weights and activation values;

The weighted function and the activation value quantization function are used to quantify and train the input parameters of the first-layer convolutional network in the first branch of the first-layer convolutional network to obtain the first branch in the first-layer convolutional network First output parameter;

Attenuating the original output parameters of the second branch in the first-layer convolutional network by using the calculated scaling factor to obtain the second output parameters of the second branch in the first-layer convolutional network;

Summing the first output parameter and the second output parameter to obtain an output parameter of the first-layer convolution network;

In another embodiment, the quantization unit 102 is specifically configured to:

Obtain the number of training steps when the l-th layer convolutional network is used for quantized training;

The cosine attenuation function is called to calculate the scaling factor corresponding to the training steps in the first layer of the convolutional network for quantized training.

Based on the foregoing description of the embodiment of the deep neural network-based service processing method, an embodiment of the present invention provides a deep neural network-based service processing apparatus. The apparatus may be a computer program running on a network device, and may be applied to In the above-mentioned deep neural network-based service processing method shown in FIG. 7, the method is used to execute corresponding steps in the deep neural network-based service processing method. Referring to FIG. 10, the device may include:

The request receiving unit 201 is configured to receive a service request, where the service request carries a business object to be processed; the service request includes any of the following: an image processing request, a face recognition request, a visual processing request, and a natural language recognition processing request ;

A business processing unit 202 is configured to call a target deep neural network to process the business object to obtain a service processing result; wherein the target deep neural network is obtained by using the network quantization method;

The result output unit 203 is configured to output the service processing result.

Based on the description of the embodiments of the network quantization method and device, and the deep neural network-based service processing method and device, the embodiment of the present invention further provides a network device that can be applied to the above-mentioned FIG. 1, FIG. 5, and FIG. 6 The network quantization method and the deep neural network-based business processing method shown in FIG. 7 are used to perform the corresponding steps in the network quantization method and the deep neural network-based business processing method. Referring to FIG. 11, the internal structure of the network device may include a processor, a network interface, and a computer storage medium. The processor, the communication interface, and the computer storage medium in the network device may be connected through a bus or other methods.

The communication interface is a medium for implementing interaction and information exchange between network equipment and external equipment. A processor (or CPU (Central Processing Unit)) is the computing core and control core of a network device. It is suitable for implementing one or more instructions, and is specifically suitable for loading and executing one or more instructions to achieve the corresponding Method flow or corresponding function; a computer storage medium (Memory) is a memory device in a server and is used to store programs and data. It can be understood that the computer storage medium herein may include both a built-in storage medium of a network device and an extended storage medium supported by the network device. The computer storage medium provides a storage space that stores an operating system of a network device. Moreover, one or more instructions suitable for being loaded and executed by the processor are stored in the storage space, and these instructions may be one or more computer programs (including program code). It should be noted that the computer storage medium here may be a high-speed RAM memory, or a non-volatile memory (non-volatile memory), for example, at least one disk memory; optionally, at least one is located far away from the foregoing processor. Computer storage media.

In one embodiment, the computer storage medium stores one or more first instructions, and the processor loads and executes the one or more first instructions stored in the computer storage medium to implement the above-mentioned FIG. 1, FIG. 5 or FIG. Corresponding steps in the network quantization method flow shown in 6; in specific implementation, one or more first instructions in a computer storage medium are loaded by a processor and executed as follows:

In another embodiment, the acquiring the original deep neural network to be quantified includes:

In still another embodiment, the original deep neural network includes a L-layer convolutional network, where L is a positive integer; any one of the convolutional networks is represented as a l-th layer convolutional network, where l is a positive integer, and 1 ≤ l≤L;

Among them, full precision includes floating-point precision and fixed-point quantization precision.

In still another embodiment, performing quantized training on each layer of the convolutional network and performing attenuation processing on each layer of the convolutional network according to a scaling factor includes:

In another embodiment, obtaining the corresponding scaling factor according to the number of training steps during quantized training according to the first-layer convolutional network includes:

In one embodiment, the computer storage medium stores one or more second instructions, and the processor loads and executes one or more second instructions stored in the computer storage medium to implement the above-mentioned deep neural-based system shown in FIG. 7. Corresponding steps in the network business processing method flow; in specific implementation, one or more second instructions in a computer storage medium are loaded by a processor and executed as follows:

Invoking a target deep neural network to process the business object to obtain a business processing result; wherein the target deep neural network is obtained by using the network quantization method;

Output the business processing result.

Claims

A network quantization method, characterized in that the method includes:

Obtaining an original deep neural network to be quantified, the original deep neural network including a multi-layer convolutional network, each layer of the convolutional network including a first branch and a second branch, the second branch being a full-precision convolution structure;

Perform quantization training on each layer of the convolutional network, and perform attenuation processing on each layer of the convolutional network according to a scaling factor; wherein the scaling factor decreases as the number of training steps of the quantization training increases;

When all layers of the convolutional network have completed quantization training and the scaling factor is reduced to zero, the second branch in each layer of the convolutional network in the original deep neural network is removed, and after quantization is obtained Target deep neural network.
The method according to claim 1, wherein the obtaining the original deep neural network to be quantified comprises:

Obtaining an initial deep neural network, where the initial deep neural network includes a multi-layer convolutional network, each layer of the convolutional network includes a first branch, and the first branch is a fixed-point quantization accuracy convolution structure;

A second branch is set for each layer of the initial deep neural network to obtain the original deep neural network.
The method according to claim 1, wherein the original deep neural network comprises an L-layer convolutional network, where L is a positive integer; any one of the convolutional networks is represented as a l-th layer convolutional network, and l is Positive integer, and 1≤l≤L;

Among them, full precision includes floating point precision.
The method according to claim 3, wherein performing quantized training on each layer of the convolutional network and performing attenuation processing on each layer of the convolutional network according to a scaling factor comprises:

Obtaining input parameters of a layer l convolutional network, where the input parameters include weights and activation values;

A weighted function is used to quantitatively train the input parameters of the first layer of the convolutional network in the first branch in the first layer of the convolutional network to obtain the first output parameters of the first branch in the first layer of the convolutional network. ;

Transmitting the input parameters of the first-level convolutional network to the second branch in the first-level convolutional network for training to obtain the original output parameters of the second branch in the first-level convolutional network;

Obtain a corresponding scaling factor according to the number of training steps during the quantized training of the first-layer convolutional network;

Attenuating the original output parameters of the second branch in the first-layer convolutional network by using the obtained scaling factor to obtain the second output parameters of the second branch in the first-layer convolutional network;

Summing the first output parameter and the second output parameter to obtain an intermediate parameter;

Use the activation value quantization function to perform quantitative training on the intermediate parameters to obtain the output parameters of the first-layer convolutional network;

The output parameters of the layer l convolutional network are determined as the input parameters of the layer l + 1 layer convolutional network, and the above steps are repeated until all the layer convolutional networks have completed quantization training and the scaling factor is reduced to zero.
The method according to claim 3, wherein performing quantized training on each layer of the convolutional network and performing attenuation processing on each layer of the convolutional network according to a scaling factor comprises:

Obtaining input parameters in a layer l convolutional network, where the input parameters include weights and activation values;

The weighted function and the activation value quantization function are used to quantify and train the input parameters of the first-layer convolutional network in the first branch of the first-layer convolutional network to obtain the first branch in the first-layer convolutional network. First output parameter;

Transmitting the input parameters of the first-level convolutional network to the second branch in the first-level convolutional network for training to obtain the original output parameters of the second branch in the first-level convolutional network;

Obtain a corresponding scaling factor according to the number of training steps during the quantized training of the first-layer convolutional network;

Attenuating the original output parameters of the second branch in the first-layer convolutional network by using the calculated scaling factor to obtain the second output parameters of the second branch in the first-layer convolutional network;

Summing the first output parameter and the second output parameter to obtain an output parameter of the first-layer convolution network;

The output parameters of the layer l convolutional network are determined as the input parameters of the layer l + 1 layer convolutional network, and the above steps are repeated until all the layer convolutional networks have completed quantization training and the scaling factor is reduced to zero.
The method according to claim 4 or 5, wherein the acquiring the corresponding scaling factor during the number of training steps during quantized training according to the first-layer convolution network comprises:

Obtain the number of training steps when the l-th layer convolutional network is used for quantized training;

The cosine attenuation function is called to calculate the scaling factor corresponding to the training steps in the first layer of the convolutional network for quantized training.
A business processing method based on a deep neural network, wherein the method includes:

Receiving a service request, the service request carrying a business object to be processed; the service request includes any of the following: an image processing request, a face recognition request, a visual processing request, and a natural language recognition processing request;

Calling a target deep neural network to process the business object to obtain a business processing result; wherein the target deep neural network is obtained by using the network quantization method according to any one of claims 1-6;

Output the business processing result.
A network quantization device, characterized in that the device includes:

An obtaining unit for obtaining an original deep neural network to be quantified; the original deep neural network includes a multi-layer convolutional network, and each layer of the convolutional network includes a first branch and a second branch, and the second branch is a full branch Precision convolution structure;

A quantization unit is configured to perform quantization training on each layer of the convolutional network, and perform attenuation processing on each layer of the convolutional network according to a scaling factor; wherein the scaling factor decreases as the number of training steps of the quantization training increases. small;

A processing unit configured to remove the second branch in each layer of the convolutional network in the original deep neural network when the quantization training is completed in all layers and the scaling factor is reduced to zero Processing to get the quantified target deep neural network.
A service processing device based on a deep neural network is characterized in that it includes:

A request receiving unit, configured to receive a service request, the service request carrying a business object to be processed; the service request includes any of the following: an image processing request, a face recognition request, a visual processing request, and a natural language recognition processing request;

A business processing unit for invoking a target deep neural network to process the business object to obtain a business processing result; wherein the target deep neural network is obtained by using the network quantization method according to any one of claims 1-6;

A result output unit, configured to output the service processing result.
A network device, comprising:

A processor adapted to implement one or more instructions; and

A computer storage medium that stores one or more first instructions, and the one or more first instructions are adapted to be loaded by the processor and executed according to any one of claims 1-6 Network quantization method; or, the computer storage medium stores one or more second instructions, and the one or more second instructions are adapted to be loaded by the processor and execute the deep neural-based method according to claim 7. Network business processing methods.