CN112219208A

CN112219208A - Deep neural network quantization method, device, equipment and medium

Info

Publication number: CN112219208A
Application number: CN201980037725.9A
Authority: CN
Inventors: 隋志成; 周力; 刘默翰; 俞清华; 赵磊; 蒋洪睿
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2019-02-01
Filing date: 2019-02-01
Publication date: 2021-01-12
Also published as: WO2020155091A1

Abstract

The application provides a deep neural network quantization method, device, equipment and medium. The method comprises the following steps: and acquiring a full-floating-point deep neural network training model, wherein the storage length of at least one parameter of the full-floating-point deep neural network training model is less than 8 bits and/or the storage length of at least one data is less than 8 bits. And compressing the operation set of the full-floating-point deep neural network training model according to the relation among the operations in the operation set of the full-floating-point deep neural network training model to obtain the operation set of the deep neural network inference model corresponding to the full-floating-point deep neural network training model. And sending the deep neural network inference model to the terminal. Therefore, the influence on the performance of the terminal, such as the influence on the performances of ROM, RAM, CPU, reasoning time and the like of the terminal can be reduced.

Description

Deep neural network quantization method, device, equipment and medium

Technical Field

The present application relates to the field of data processing, and in particular, to a method, an apparatus, a device, and a medium for quantizing a deep neural network.

Background

The intellectualization and intellectualization of the terminal is a big trend in the future. In the fields of Speech Recognition (ASR), Natural Language Processing (NLP), Optical Character Recognition (OCR), etc., along with the Annual Intelligent (AI) application, different Deep Neural Network (DNN) models are generally used in different application scenarios, so that the requirements of a user on the configuration of a terminal are higher and higher, but the user cannot change the terminal to use the AI application.

Usually, the training phase of the DNN model is executed by the server and the inference phase of the terminal scenario application is executed by the terminal. After the terminal executes the inference phase of the DNN model, the user expects various performance indexes of the terminal to be unaffected as much as possible, such as Read-Only Memory (ROM) performance, Random Access Memory (RAM) performance, Central Processing Unit (CPU) performance, and inference time. Therefore, in order to meet the requirement that each performance index of the terminal is not influenced as much as possible, under the condition that the terminal configuration is not changed, how to process the DNN model in the inference stage becomes a technical problem to be solved urgently in the application. On the other hand, under the condition that the configuration of the user terminal is not changed, the user can also greatly reduce the memory and improve the performance.

Disclosure of Invention

The application provides a deep neural network quantization method, device, equipment and medium. Therefore, the influence on the performance of the terminal, such as the influence on the performances of ROM, CPU, reasoning time and the like of the terminal can be reduced.

In a first aspect, the present application provides a processing method for a deep neural network, including: and acquiring a full-floating-point deep neural network training model, wherein the storage length of at least one parameter of the full-floating-point deep neural network training model is less than 8 bits and/or the storage length of at least one data is less than 8 bits. And compressing the operation set of the full-floating-point deep neural network training model according to the relation among the operations in the operation set of the full-floating-point deep neural network training model to obtain the operation set of the deep neural network inference model corresponding to the full-floating-point deep neural network training model. And sending the deep neural network inference model to the terminal.

Optionally, compressing the operation set of the full-floating-point deep neural network training model according to a relationship between operations in the operation set of the full-floating-point deep neural network training model includes: and compressing a plurality of operations in the operation set of the full floating point deep neural network training model into one operation according to the relation among the operations in the operation set of the full floating point deep neural network training model.

The network equipment compresses the operation set of the full floating point deep neural network training model to obtain the operation set of the deep neural network inference model, and when the terminal executes the deep neural network inference model, the influence on the performance of the terminal can be reduced, such as the influence on the performances of ROM, CPU, inference time and the like of the terminal.

Optionally, before sending the deep neural network inference model to the terminal, the method further includes: and quantifying parameters involved in the operation set of the deep neural network inference model.

In the embodiment of the application, the network device may quantize parameters related to an operation set of the deep neural network inference model, and when the terminal executes the deep neural network inference model, the influence on the performance of the terminal may be further reduced, for example, the influence on the performance of the terminal, such as ROM, CPU, inference time, and the like, may be reduced.

Optionally, the parameters involved in the operation set of the deep neural network inference model are floating point type parameters. Accordingly, quantifying parameters involved in an operational set of the deep neural network inference model includes: and quantizing parameters corresponding to the operation set of the deep neural network inference model into 1-bit parameters.

Optionally, the deep neural network inference model is used for the server to process operations needing to be processed in the operation set of the full-floating-point deep neural network training model.

The following provides a processing method, a device, equipment and a storage medium of a deep neural network.

In a second aspect, the present application provides a processing method for a deep neural network, including: receiving a deep neural network inference model sent by network equipment, wherein an operation set of the deep neural network inference model is obtained by compressing an operation set of a full-floating-point deep neural network training model, and the storage length of at least one parameter of the full-floating-point deep neural network training model is less than 8 bits and/or the storage length of at least one data is less than 8 bits. And executing a deep neural network inference model.

Optionally, after the executing the deep neural network inference model, further comprising: and acquiring the performance data of the terminal. And sending the performance data of the terminal to the control equipment, wherein the performance data of the terminal is used for determining the operation needing to be processed in the operation set of the full floating point deep neural network training model.

In a third aspect, the present application provides a processing method for a deep neural network, including: the method comprises the steps of receiving performance data of a terminal, wherein the performance data are obtained by the terminal through execution of a deep neural network inference model, an operation set of the deep neural network inference model is obtained by compression of an operation set of a full-floating-point deep neural network training model, and the storage length of at least one parameter of the full-floating-point deep neural network training model is smaller than 8 bits and/or the storage length of at least one data is smaller than 8 bits. And processing the performance data of the terminal to determine the operation needing to be processed in the operation set of the full floating point deep neural network training model. And sending an indication message to the server, wherein the indication message is used for indicating the server to process the operation needing to be processed in the operation set of the full floating point deep neural network training model.

In a fourth aspect, the present application provides a processing method for a deep neural network, including: an indication message is received. The operation set for processing the full floating point deep neural network training model according to the indication message comprises the operation needing processing, the storage length of at least one parameter of the full floating point deep neural network training model is less than 8 bits, and/or the storage length of at least one data is less than 8 bits. And sending the processed full floating point deep neural network training model to the network equipment.

In a fifth aspect, the present application provides a processing apparatus for a deep neural network, including:

the acquisition module is used for acquiring a full-floating-point deep neural network training model, wherein the storage length of at least one parameter of the full-floating-point deep neural network training model is less than 8 bits and/or the storage length of at least one data is less than 8 bits.

And the compression module is used for compressing the operation set of the full-floating-point deep neural network training model according to the relation among the operations in the operation set of the full-floating-point deep neural network training model so as to obtain the operation set of the deep neural network inference model corresponding to the full-floating-point deep neural network training model.

And the sending module is used for sending the deep neural network inference model to the terminal.

In a sixth aspect, the present application provides a processing apparatus for a deep neural network, including:

the receiving module is used for receiving the deep neural network inference model sent by the network equipment, the operation set of the deep neural network inference model is obtained by compressing the operation set of the full-floating-point deep neural network training model, the storage length of at least one parameter of the full-floating-point deep neural network training model is less than 8 bits and/or the storage length of at least one data is less than 8 bits.

And the execution module is used for executing the deep neural network inference model.

In a seventh aspect, the present application provides a processing apparatus for a deep neural network, including:

the receiving module is used for receiving performance data of the terminal, the performance data are obtained by the terminal through executing a deep neural network inference model, an operation set of the deep neural network inference model is obtained by compressing an operation set of a full-floating-point deep neural network training model, the storage length of at least one parameter of the full-floating-point deep neural network training model is less than 8 bits and/or the storage length of at least one data is less than 8 bits.

And the processing module is used for processing the performance data of the terminal so as to determine the operation needing to be processed in the operation set of the full floating point deep neural network training model.

And the sending module is used for sending an indication message to the server, wherein the indication message is used for indicating the server to process the operation needing to be processed in the operation set of the full floating point deep neural network training model.

In an eighth aspect, the present application provides a processing apparatus for a deep neural network, including:

and the receiving module is used for receiving the indication message.

And the processing module is used for processing the operation needing to be processed in the operation set of the full floating point deep neural network training model according to the indication message, wherein the storage length of at least one parameter of the full floating point deep neural network training model is less than 8 bits and/or the storage length of at least one data is less than 8 bits.

And the sending module is used for sending the processed full floating point deep neural network training model to the network equipment.

In a ninth aspect, the present application provides a network device, comprising:

a processor.

A memory for storing executable instructions of the processor for causing the processor to perform the method of processing of a deep neural network as set forth in the first aspect or alternatives thereof.

In a tenth aspect, the present application provides a terminal, comprising:

a processor.

A memory for storing executable instructions of the processor for causing the processor to perform the method of processing of a deep neural network as set forth in the second aspect or alternatives of the second aspect.

In an eleventh aspect, the present application provides a control apparatus comprising:

a processor.

A memory for storing executable instructions of the processor for causing the processor to perform the method of processing of a deep neural network as set forth in the third aspect or alternatives thereof.

In a twelfth aspect, the present application provides a server, comprising:

a processor.

A memory for storing executable instructions of the processor to cause the processor to perform the method of processing of a deep neural network as set forth in the fourth aspect or alternatives thereof.

In a thirteenth aspect, the present application provides a storage medium comprising: and executable instructions for implementing the processing method of the deep neural network.

In a fourteenth aspect, the present application provides a computer program product comprising: and executable instructions for implementing the processing method of the deep neural network.

Drawings

Fig. 1 is an interaction flowchart of a processing method of a deep neural network according to an embodiment of the present application;

FIG. 2 is a schematic diagram illustrating compression of an operation set of a full floating point deep neural network training model according to an embodiment of the present application;

fig. 3 is a schematic diagram of a terminal interface provided in an embodiment of the present application;

fig. 4 is an interaction flowchart of a processing method of a deep neural network according to another embodiment of the present application;

FIG. 5 is an interactive flowchart of a processing method of a deep neural network according to yet another embodiment of the present application;

FIG. 6 is a schematic diagram of a full floating point deep neural network training model before operation processing and a full floating point deep neural network training model after operation processing according to an example of the present application;

FIG. 7 is a schematic diagram of a full floating point deep neural network training model before operation processing and a full floating point deep neural network training model after operation processing according to example two of the present application;

FIG. 8 is a schematic diagram of a full floating point deep neural network training model before operation processing and a full floating point deep neural network training model after operation processing according to example three of the present application;

FIG. 9 is a schematic diagram of a full floating point deep neural network training model before operation processing and a full floating point deep neural network training model after operation processing according to example four of the present application;

fig. 10 is a schematic diagram of a processing device 1000 of a deep neural network according to an embodiment of the present application;

fig. 11 is a schematic diagram of a processing apparatus 1100 for a deep neural network according to an embodiment of the present application;

fig. 12 is a schematic diagram of a processing apparatus 1200 of a deep neural network according to an embodiment of the present application;

fig. 13 is a schematic diagram of a processing apparatus 1300 for a deep neural network according to an embodiment of the present application;

fig. 14 is a schematic diagram of a network device 1400 according to an embodiment of the present application;

fig. 15 is a schematic diagram of a terminal 1500 according to an embodiment of the present application;

fig. 16 is a schematic diagram of a control device 1600 provided in an embodiment of the present application;

fig. 17 is a schematic diagram of a server 1700 according to an embodiment of the present application.

Detailed Description

As described above, as AI applications increase year by year, and since different DNN models are generally used in different application scenarios, the requirements of the user on the configuration of the terminal are also higher, however, in order to use an AI application, it is not realistic for the user to change the terminal. Typically, the training phase of the DNN model is performed by the server and the inference phase is performed by the terminal. After the terminal executes the inference stage of the DNN model, the user expects various performance indexes of the terminal to be unaffected as much as possible, such as ROM performance, RAM performance, CPU performance, inference time and the like. Therefore, in order to meet the requirement that each performance index of the terminal is not influenced as much as possible, under the condition that the terminal configuration is not changed and the user terminal configuration is not changed, the user can also greatly reduce the memory and improve the performance. How to process the DNN model in the inference stage becomes a technical problem to be solved urgently in the application.

In order to solve the technical problem, the present application provides a quantization method, apparatus, device and medium for a deep neural network. Before introducing the technical solution of the present application, the following terms are explained first:

deep neural network training model: it refers to the deep neural network model involved in the training phase.

Deep neural network inference model: it refers to the deep neural network model involved in the inference phase.

Full-precision deep neural network training model: the method is characterized in that a full-precision deep neural network model is involved in a training stage, parameters and data of the full-precision deep neural network model are full-precision data, the storage length of the parameters is greater than or equal to 8 bits, and the storage length of the data is also greater than or equal to 8 bits. For example: the parameters of the full-precision deep neural network model are 1.4 and the like.

A full floating point deep neural network training model: it refers to a deep neural network training model with each parameter and/or each data having a storage length of less than 8 bits and/or data having a storage length of less than 8 bits. For example: the parameter in the full floating point deep neural network training model is +1.0, -1.0, etc., wherein, since the deep neural network training model comprises a plurality of operations, each operation can be understood as a deep neural network layer or a deep neural network layer comprising at least one operation, each operation relates to a parameter, input data and output data, the parameter refers to a parameter related to an operation, and the data refers to input data and/or output data related to an operation.

Fig. 1 is an interaction flowchart of a processing method of a deep neural network according to an embodiment of the present application, where the method involves a network element including: the system comprises a network device and a terminal, wherein the network device can be part or all of intelligent devices such as a network device, a tablet computer, a notebook computer, a server (in which a full-precision deep neural network training model is stored), and the like, and the terminal can be a mobile phone, a network device, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, and the like. As shown in fig. 1, the method comprises the steps of:

step S101: the network equipment acquires a full floating point deep neural network training model.

Step S102: and the network equipment compresses the operation set of the full-floating-point deep neural network training model according to the relation among the operations in the operation set of the full-floating-point deep neural network training model to obtain the operation set of the deep neural network inference model corresponding to the full-floating-point deep neural network training model.

Step S103: and the network equipment sends the deep neural network inference model to the terminal.

Step S101 is explained as follows:

in one possible design: assuming that the network device is not a server storing the full-precision deep neural network training model, and the server quantizes at least one parameter and/or at least one data in the full-precision deep neural network training model so that the storage length of the quantized at least one parameter is less than 8 bits and/or the storage length of the at least one data is less than 8 bits, that is, the server processes the full-precision deep neural network training model to obtain the full-floating-point deep neural network training model. The network device may retrieve the full floating-point deep neural network training model from the server. The network device may also be part of a server that maintains a full-precision deep neural network training model.

In another possible design: assuming that the network device is not a server storing the full-precision deep neural network training model, the network device may obtain the full-precision deep neural network training model from the server, and quantize at least one parameter and/or at least one data in the full-precision deep neural network training model, so that the storage length of the quantized at least one parameter is less than 8 bits and/or the storage length of the at least one data is less than 8 bits, that is, the network device processes the full-precision deep neural network training model to obtain the full-floating-point deep neural network training model. The network device may process the full-precision deep neural network training model by using the prior art, which is not described herein.

Step S102 is explained as follows:

before the network device executes step S102, the full-floating-point deep neural network training model and the full-floating-point deep neural network inference model are the same, and as the name suggests, the operation set of the full-floating-point deep neural network training model and the operation set of the full-floating-point deep neural network inference model are also the same. As described above, it is difficult for the configuration of the terminal to meet the requirements of various AI applications, and therefore, in the present application, the network device will compress the operation set of the full-floating-point deep neural network training model. The network equipment compresses the operation set of the full-floating-point deep neural network training model according to the relation among the operations in the operation set of the full-floating-point deep neural network training model to obtain the operation set of the deep neural network inference model corresponding to the full-floating-point deep neural network training model.

For example: fig. 2 is a schematic diagram of operation set compression for a full floating point deep neural network training model according to an embodiment of the present application, where a Batch Normalization layer (Bn), a Scale layer (Scale), a binary active layer (BinAct), and a binary tangent layer (BinTanh) shown in fig. 2 are four different operations, where Bn, Scale, and BinAct are partial operations of the full floating point deep neural network training model before compression of an operation set is performed on a network device, and BinTanh is an operation obtained after compression of Bn, Scale, and BinAct by the network device.

Bn denotes y ═ alpha (x-mu/sqrt (delta + epsilon)) + beta;

scale denotes thresh ═ sqrt (delta + epsilon) — (-beta) + alpha × mu;

BinAct means y > -0.

Where y is the output data of Bn, x is the input data of Bn, and alpha, mu, delta, epsilon, and beta are the parameters involved in Bn.

Thresh is the output data of Scale, and delta, epsilon, beta, alpha, and mu are parameters involved in Thresh.

y is input data of BinAct.

"+" denotes multiplication, "/" denotes division, and sqrt denotes a square root function.

It should be emphasized that the present application does not limit the meaning of the data and parameters such as alpha, mu, delta, epsilon, beta, Thresh, x, and y, and the like, and the present application focuses on the relationship among Bn, Scale, and BinAct.

From the relationship among Bn, Scale, and BinAct, the network device may integrate y ═ alpha (x-mu/sqrt (delta + epsilon)) + beta, thresh ═ sqrt (delta + epsilon) (-beta) + alpha mu, and y > -0 into alpha ═ x > -thresh. Accordingly, BinTanh on the right side of fig. 2 indicates alpha × x > -thresh. Based on this, the network device realizes the purpose of compressing three operations of Bn, Scale and BinACT into one operation BinTanh.

It should be noted that, in practice, the full-floating-point deep neural network training model may include more than the above three operations, and may also include other operations, as long as a certain relationship exists between the operations, the network device may compress the operations according to the relationship between the operations, so as to obtain the compressed operations.

In addition, by taking the above example of compressing three operations into one operation, in fact, multiple operations in the operation set of the full-floating-point deep neural network training model may be compressed into one operation, for example, two operations are compressed into one operation, or four operations are compressed into one operation, which is not limited in this application.

For example: suppose an operation is used to calculate i x_i*y _iI is any integer, and another operation is to implement: about all i pairs i x_i*y _iBased on this, the network device can compress these two operations into one operation, which represents N-2 × bitcount (xnor (x)_i,y _i) Wherein bitcount is a bit-wise statistical counting function and xnor is an exclusive nor function.

Step S103 is explained as follows:

and the network equipment sends the deep neural network inference model to the terminal. Wherein the set of operations in the deep neural network inference model comprises compressed operations. And after the terminal acquires the deep neural network inference model, executing the deep neural network inference model, and if the deep neural network inference model is about AI application, the terminal can obtain AI data by executing the model. On the other hand, the terminal may also obtain its own performance data, such as: ROM performance data, RAM performance data, CPU performance data. For example: fig. 3 is a schematic diagram of a terminal interface according to an embodiment of the present Application, and as shown in fig. 3, after a user clicks an Application (APP) on a desktop of the terminal, the terminal displays an interface including an occupancy rate of a ROM.

In summary, the present application provides a processing method of a deep neural network, the method including: the network equipment acquires a full floating point deep neural network training model. And the network equipment compresses the operation set of the full-floating-point deep neural network training model according to the relation among the operations in the operation set of the full-floating-point deep neural network training model to obtain the operation set of the deep neural network inference model corresponding to the full-floating-point deep neural network training model. And the network equipment sends the deep neural network inference model to the terminal. The network equipment compresses the operation set of the full floating point deep neural network training model to obtain the operation set of the deep neural network inference model, and when the terminal executes the deep neural network inference model, the influence on the performance of the terminal can be reduced, such as the influence on the performances of ROM, CPU, inference time and the like of the terminal.

The compression of the set of operations of the deep neural network inference model is introduced above to reduce the impact on the performance of the terminal. In addition, the method also considers the quantification of parameters related to the operation set of the deep neural network inference model so as to achieve the aim of reducing the influence on the performance of the terminal. Fig. 4 is an interaction flowchart of a processing method of a deep neural network according to another embodiment of the present application, where the method involves a network element including: as shown in fig. 4, before the step S103, the method further includes the following steps:

step S104: the network device quantifies parameters involved in an operational set of the deep neural network inference model.

Example one: assume that the set of operations involves parameters including: the weights involved in a certain operation are assumed to involve 32 weights, for example, the 32 weights are respectively:

+1-1+1+1-1+1-1-1+1+1+1-1-1+1+1+1+1-1+1+1-1+1-1-1+1+1+1-1-1+1+1+1。

assuming that each weight is stored in the form of a float type, 32 weights occupy 32 x 4 bytes before the network device quantizes the weight, based on which the network device can quantize each weight to 1 bit, for example: quantize +1 to 1, quantize-1 to 0, and quantize 32 weights to 0110100111001111011010011100111. In this way, the storage length of 32 weights is compressed by a factor of 32.

Example two: assume that the set of operations involves parameters including: the floating-point parameter a, where a takes values of 0, 1, and 2, and the storage lengths of these three values are all 4 bytes, in this case, the network device may quantize the parameter a, that is, 2 bits are used to represent a. For example: 00 denotes 0, 01 denotes 1, 10 denotes 2, which corresponds to quantizing a 4-byte parameter into 2 bits, and compressing the storage length of a by 16 times.

In summary, in the embodiment of the present application, the network device may quantize parameters related to the operation set of the deep neural network inference model, and when the terminal executes the deep neural network inference model, the impact on the performance of the terminal may be further reduced, for example, the impact on the performance of the terminal, such as ROM, CPU, inference time, and the like, may be reduced.

Based on any of the above embodiments, further, after receiving the deep neural network inference model, the terminal executes the deep neural network inference model, acquires performance data, and sends the performance data to the control device, and the control device determines operations to be processed in the operation set of the full-floating-point deep neural network training model, and sends an instruction message to the server, so that the server processes the operations to be processed in the operation set of the full-floating-point deep neural network training model according to the instruction message. Specifically, based on the second embodiment, a processing method of a deep neural network is further described, specifically, the method is described below assuming that the following network device is not a server storing a full-precision deep neural network training model, and fig. 5 is an interactive flowchart of a processing method of a deep neural network provided by another embodiment of the present application, where the method involves a network element including: as shown in fig. 5, after the step S103, the method further includes the following steps:

step S105: and the terminal executes the deep neural network inference model.

Step S106: the terminal obtains performance data of the terminal.

Step S107: the terminal sends the performance data of the terminal to the control device.

Step S108: and the control equipment processes the performance data of the terminal to determine the operation needing to be processed in the operation set of the full floating point deep neural network training model.

Step S109: the control device sends an indication message to the server.

Step S110: and the server processes the operation needing to be processed in the operation set of the full floating point deep neural network training model according to the indication message.

The following will be explained with respect to step S105 to step S107:

the terminal executes the deep neural network inference model to obtain AI data, and the terminal also obtains performance data of the terminal, such as: ROM performance data, RAM performance data, CPU performance data, inference time, etc. The terminal sends the performance data of the terminal to the control device.

The following is explained with respect to step S108 to step S110:

in one possible design: the control device stores the corresponding relationship between the performance data of the terminal and the operations to be processed (including deletion, update, and the like) in the operation set of the full floating point deep neural network training model, which is specifically shown in table 1:

TABLE 1

Table 1 only illustrates an exemplary correspondence between the ROM performance data of the terminal and the operations that need to be processed (including deletion, update, and the like) in the operation set of the full-floating-point deep neural network training model, and in fact, the control device may obtain a correspondence between at least one of the ROM performance data, the RAM performance data, the CPU performance data, and the inference time and the operations that need to be processed in the operation set of the full-floating-point deep neural network training model, and determine the operations that need to be processed in the operation set of the full-floating-point deep neural network training model according to the correspondence.

Optionally, the indication message is used to instruct the server to process an operation that needs to be processed in the operation set of the full-floating-point deep neural network training model. And after receiving the indication message, the server updates or deletes the operation needing to be processed in the operation set of the full floating point deep neural network training model according to the indication message.

The following describes, by way of example, processing performed by the control device on operations that need processing in an operation set of the full-floating-point deep neural network training model:

example one: fig. 6 is a schematic diagram of a full floating-point deep neural network training model before operation processing and a full floating-point deep neural network training model after operation processing, which are provided in an example of the present application, where BN-Scale-BinAct operation sets in the training models are shown in fig. 2, and a network device converts them into BinTanh operations, and experiments prove that a storage space of a ROM can be saved and inference time performance can be improved. However, when the terminal acquires the performance data and sends the performance data to the control device, the control device determines the operation needing to be processed in the operation set of the full floating point deep neural network training model according to the performance data and sends an indication message to the server. And the server processes the operation needing to be processed in the operation set of the full floating point deep neural network training model according to the indication message. Through the feedback mechanism, it is determined that if BinAct operation in the full-floating-point deep neural network training model shown in FIG. 6 is updated to ReLU, both performance indexes of ROM and inference time can be within an acceptable range, and the precision of the full-floating-point deep neural network training model can be improved by 10%. Based on the method, the combined optimization of the precision and the terminal performance of the full-floating-point deep neural network training model can be achieved through the feedback mechanism.

Example two: fig. 7 is a schematic diagram of a full floating point deep neural network training model before operation processing and a full floating point deep neural network training model after operation processing according to an example two of the present application, and as shown in fig. 7, after compressing operation sets of BinConv, Pool, and BinAct and quantizing parameters, performance of a terminal still significantly decreases, but when the terminal acquires performance data thereof and sends the performance data to a control device, the control device determines an operation to be processed in the operation set of the full floating point deep neural network training model according to the performance data, and sends an instruction message to a server. And the server processes the operation needing to be processed in the operation set of the full floating point deep neural network training model according to the indication message. It was determined by this feedback mechanism that if Pool operations were deleted and merged into BinConv with Stride, the terminal's inference time would be 10% shorter without loss of accuracy of the full-floating-point deep neural network training model.

Example three: fig. 8 is a schematic diagram of a full floating point deep neural network training model before operation processing and a full floating point deep neural network training model after operation processing provided in example three of the present application, and as shown in fig. 8, when a terminal acquires performance data thereof and sends the performance data to a control device, the control device determines, according to the performance data, an operation that needs to be processed in an operation set of the full floating point deep neural network training model, and sends an instruction message to a server. And the server processes the operation needing to be processed in the operation set of the full floating point deep neural network training model according to the indication message. After the BinConv in BinConv and BinAct is updated to Conv through the feedback mechanism, although the ROM performance of the terminal is reduced, the precision of the full-floating-point deep neural network training model is obviously improved, and in this case, when the ROM performance reduction is within the acceptable range of the user, the BinConv is allowed to be updated to Conv.

Example four: fig. 9 is a schematic diagram of a full floating point deep neural network training model before operation processing and a full floating point deep neural network training model after operation processing provided in example four of the present application, and as shown in fig. 9, when a terminal acquires performance data thereof and sends the performance data to a control device, the control device determines, according to the performance data, an operation that needs to be processed in an operation set of the full floating point deep neural network training model, and sends an instruction message to a server. And the server processes the operation needing to be processed in the operation set of the full floating point deep neural network training model according to the indication message. After the biased binary convolution layer (BinConv with Bias), Pool and ReLU are updated to BinACT through the feedback mechanism, specifically, the operation of convolution (Bias) addition of the BinConv with Bias is not optimized due to the ReLU, and only the operation can be performed according to full-precision Conv, and the ReLU operation is changed to BinACT, so that the precision of a full-floating-point deep neural network training model is reduced, the inference performance can be improved, and the ROM performance can be improved. In this case, when the precision of the full floating point deep neural network training model is reduced to be within the acceptable range of the user, the ReLU is allowed to be updated to BinACT.

In summary, in the example of the present application, when the terminal acquires the performance data thereof and sends the performance data to the control device, the control device determines, according to the performance data, an operation that needs to be processed in the operation set of the full-floating-point deep neural network training model, and sends an instruction message to the server. And the server processes the operation needing to be processed in the operation set of the full floating point deep neural network training model according to the indication message. The combined optimization or the balanced optimization of the precision of the full-floating-point deep neural network training model and the terminal performance can be achieved through the feedback mechanism.

Fig. 10 is a schematic diagram of a processing apparatus 1000 for a deep neural network according to an embodiment of the present application, as shown in fig. 10, the apparatus may be part or all of a network device such as a computer, a notebook computer, a server (in which a full-precision deep neural network training model is stored), and the apparatus includes:

the obtaining module 1001 is configured to obtain a full-floating-point deep neural network training model, where a storage length of at least one parameter of the full-floating-point deep neural network training model is less than 8 bits and/or a storage length of at least one data is less than 8 bits.

The compressing module 1002 is configured to compress the operation set of the full-floating-point deep neural network training model according to a relationship between operations in the operation set of the full-floating-point deep neural network training model, so as to obtain an operation set of a deep neural network inference model corresponding to the full-floating-point deep neural network training model.

A sending module 1003, configured to send the deep neural network inference model to the terminal.

In one possible design, the compression module 1002 is specifically configured to: and compressing a plurality of operations in the operation set of the full floating point deep neural network training model into one operation according to the relation among the operations in the operation set of the full floating point deep neural network training model.

In one possible design, the apparatus further comprises: a quantization module 1004 for quantizing parameters involved in the set of operations of the deep neural network inference model before sending the deep neural network inference model to the terminal.

In one possible design, the parameters involved in the set of operations of the deep neural network inference model are floating point type parameters.

Accordingly, the quantization module 1004 is specifically configured to: and quantizing parameters corresponding to the operation set of the deep neural network inference model into 1-bit parameters.

In one possible design, the deep neural network inference model is used for processing operations needing to be processed in the operation set of the full floating point deep neural network training model by the server.

The processing apparatus of the deep neural network provided in the embodiment of the present application may be configured to execute the processing method of the deep neural network executed by the network device side, and the content and the effect of the processing apparatus of the deep neural network may refer to the method section, which is not described again.

Fig. 11 is a schematic diagram of a processing apparatus 1100 for a deep neural network according to an embodiment of the present disclosure, as shown in fig. 11, the apparatus may be part or all of a terminal such as a mobile phone, a tablet computer, and the apparatus includes:

the receiving module 1101 is configured to receive a deep neural network inference model sent by a network device, where an operation set of the deep neural network inference model of a terminal device is obtained by compressing an operation set of a full-floating-point deep neural network training model, and a storage length of at least one parameter of the full-floating-point deep neural network training model of the terminal device is less than 8 bits and/or a storage length of at least one data is less than 8 bits.

And the execution module 1102 is used for executing the deep neural network inference model of the terminal equipment.

In one possible design, the apparatus further includes:

an obtaining module 1103, configured to obtain performance data of the terminal after executing the deep neural network inference model of the terminal device.

A sending module 1104, configured to send performance data of the terminal device terminal to the control device, where the performance data of the terminal device terminal is used to determine an operation to be processed in an operation set of the terminal device full-floating-point deep neural network training model.

The processing apparatus of the deep neural network provided in the embodiment of the present application may be configured to execute the processing method of the deep neural network executed by the terminal side, and the content and the effect of the processing apparatus of the deep neural network may refer to the method part, which is not described again.

Fig. 12 is a schematic diagram of a processing apparatus 1200 of a deep neural network according to an embodiment of the present application, and as shown in fig. 12, the apparatus may be part or all of a control device, and the apparatus includes:

the receiving module 1201 is configured to receive performance data of a terminal, where the performance data of the terminal is obtained by the terminal executing a deep neural network inference model, an operation set of the deep neural network inference model of the terminal is obtained by compressing the operation set of the full-floating-point deep neural network training model, and a storage length of at least one parameter of the full-floating-point deep neural network training model of the terminal is less than 8 bits and/or a storage length of at least one data is less than 8 bits.

The processing module 1202 is configured to process performance data of the terminal device terminal to determine an operation to be processed in an operation set of the terminal device full-floating-point deep neural network training model.

A sending module 1203, configured to send an instruction message to the server, where the terminal device instruction message is used to instruct the terminal device server to process an operation that needs to be processed in an operation set of the terminal device full-floating-point deep neural network training model.

The processing apparatus of the deep neural network provided in the embodiment of the present application may be configured to execute the processing method of the deep neural network executed by the control device side, and the content and the effect of the processing apparatus of the deep neural network may refer to the method section, which is not described again.

Fig. 13 is a schematic diagram of a processing apparatus 1300 of a deep neural network according to an embodiment of the present application, as shown in fig. 13, the apparatus may be part or all of a server, and the apparatus includes:

a receiving module 1301, configured to receive an indication message.

The processing module 1302 is configured to process an operation to be processed in an operation set of the full-floating-point deep neural network training model according to the terminal device indication message, where a storage length of at least one parameter of the terminal device full-floating-point deep neural network training model is less than 8 bits and/or a storage length of at least one data is less than 8 bits.

And a sending module 1303, configured to send the processed full floating point deep neural network training model to a network device.

The processing apparatus of the deep neural network provided in the embodiment of the present application may be used to execute the processing method of the deep neural network executed by the server side, and the content and effect of the processing apparatus of the deep neural network may refer to the method part, which is not described herein again.

Fig. 14 is a schematic diagram of a network device 1400 according to an embodiment of the present application, and as shown in fig. 14, the network device includes: a memory 1401, a processor 1402 and a transceiver 1403, wherein the memory 1401 is used for storing computer instructions to make the processor 1402 execute the instructions to realize the processing method of the deep neural network.

The processor 1402 is configured to: and acquiring a full-floating-point deep neural network training model, wherein the storage length of at least one parameter of the full-floating-point deep neural network training model is less than 8 bits and/or the storage length of at least one data is less than 8 bits. And compressing the operation set of the full-floating-point deep neural network training model according to the relation among the operations in the operation set of the full-floating-point deep neural network training model to obtain the operation set of the deep neural network inference model corresponding to the full-floating-point deep neural network training model.

The transceiver 1403 is used for transmitting the deep neural network inference model to the terminal.

Optionally, the processor 1402 is specifically configured to: and compressing a plurality of operations in the operation set of the full floating point deep neural network training model into one operation according to the relation among the operations in the operation set of the full floating point deep neural network training model.

Optionally, the processor 1402 is further configured to: and quantifying parameters involved in the operation set of the deep neural network inference model.

Optionally, the parameters involved in the operation set of the deep neural network inference model are floating point type parameters. Accordingly, the processor 1402 is specifically configured to: and quantizing parameters corresponding to the operation set of the deep neural network inference model into 1-bit parameters.

The network device provided in the embodiment of the present application may be configured to execute the processing method of the deep neural network executed by the network device side, and the content and the effect of the processing method may refer to the method part, which is not described herein again.

Fig. 15 is a schematic diagram of a terminal 1500 according to an embodiment of the present application, and as shown in fig. 15, the terminal includes: a memory 1501, a processor 1502, and a transceiver 1503, wherein the memory 1501 is used for storing computer instructions to cause the processor 1502 to execute the instructions to implement a processing method of a deep neural network.

The transceiver 1503 is configured to receive a deep neural network inference model sent by a network device, where an operation set of the deep neural network inference model is obtained by compressing an operation set of a full-floating-point deep neural network training model, and a storage length of at least one parameter of the full-floating-point deep neural network training model is less than 8 bits and/or a storage length of at least one data of the full-floating-point deep neural network training model is less than 8 bits.

The processor 1502 is configured to execute a deep neural network inference model.

Optionally, the processor 1502 is further configured to obtain performance data of the terminal. The transceiver 1503 is further configured to send the performance data of the terminal to the control device, where the performance data of the terminal is used to determine operations that need to be processed in the operation set of the full-floating-point deep neural network training model.

The terminal provided in the embodiment of the present application may be configured to execute the processing method of the deep neural network executed by the terminal side, and the content and the effect of the processing method may refer to the method part, which is not described again.

Fig. 16 is a schematic diagram of a control device 1600 according to an embodiment of the present application, and as shown in fig. 16, the control device includes: a memory 1601, a processor 1602, and a transceiver 1603, wherein the memory 1601 is used to store computer instructions to cause the processor 1602 to execute the instructions to implement a processing method of a deep neural network.

The transceiver 1603 is configured to receive performance data of the terminal, where the performance data is obtained by the terminal by executing a deep neural network inference model, an operation set of the deep neural network inference model is obtained by compressing an operation set of a full-floating-point deep neural network training model, and a storage length of at least one parameter of the full-floating-point deep neural network training model is less than 8 bits and/or a storage length of at least one data is less than 8 bits.

The processor 1602 is configured to process the performance data of the terminal to determine operations that need to be processed in the operation set of the full-floating-point deep neural network training model.

The transceiver 1603 is further configured to send an indication message to the server, where the indication message is used to instruct the server to process an operation which needs to be processed in the operation set of the full-floating-point deep neural network training model.

The control device provided in the embodiment of the present application may be used to execute the processing method of the deep neural network executed by the control device side, and the content and effect of the processing method may refer to the method part, which is not described again.

Fig. 17 is a schematic diagram of a server 1700 according to an embodiment of the present application, as shown in fig. 17, the server includes: a memory 1701, a processor 1702, and a transceiver 1703, wherein the memory 1701 is used to store computer instructions to cause the processor 1702 to execute the instructions to implement a processing method for a deep neural network.

The transceiver 1703 is configured to receive an indication message.

The processor 1702 is configured to process operations that need to be processed in an operation set of a full floating-point deep neural network training model according to an indication message, a storage length of at least one parameter of the full floating-point deep neural network training model being less than 8 bits and/or a storage length of at least one data being less than 8 bits.

The transceiver 1703 is further configured to transmit the processed full floating point deep neural network training model to a network device.

The server provided in the embodiment of the present application may be configured to execute the processing method of the deep neural network executed by the server side, and the content and the effect of the processing method may refer to the method part, which is not described again.

The present application further provides a computer storage medium, where the computer storage medium includes a computer instruction, where the computer instruction is used to implement the processing method of the deep neural network, and the content and effect of the computer instruction may refer to the method part, which is not described herein again.

The present application further provides a computer program product, where the computer program product includes a computer instruction, where the computer instruction is used to implement the processing method of the deep neural network, and the content and effect of the computer instruction may refer to the method part, which is not described herein again.

Claims

A processing method of a deep neural network is characterized by comprising the following steps:

acquiring a full-floating-point deep neural network training model, wherein the storage length of at least one parameter of the full-floating-point deep neural network training model is less than 8 bits and/or the storage length of at least one data is less than 8 bits;

compressing the operation set of the full-floating-point deep neural network training model according to the relation among the operations in the operation set of the full-floating-point deep neural network training model to obtain the operation set of the deep neural network inference model corresponding to the full-floating-point deep neural network training model;

and sending the deep neural network inference model to a terminal.
The method of claim 1, wherein compressing the set of operations of the full-floating-point deep neural network training model according to relationships between operations in the set of operations of the full-floating-point deep neural network training model comprises:

and compressing a plurality of operations in the operation set of the full floating point deep neural network training model into one operation according to the relation among the operations in the operation set of the full floating point deep neural network training model.
The method according to claim 1 or 2, wherein before sending the deep neural network inference model to a terminal, further comprising:

quantifying parameters involved in the set of operations of the deep neural network inference model.
The method according to claim 3, wherein the parameters involved in the operation set of the deep neural network inference model are floating point type parameters;

correspondingly, the quantifying parameters involved in the operation set of the deep neural network inference model comprises:

and quantizing parameters corresponding to the operation set of the deep neural network inference model into 1-bit parameters.
The method according to any one of claims 1 to 4, wherein the deep neural network inference model is used for a server to process operations which need to be processed in the operation set of the full-floating-point deep neural network training model.
A processing method of a deep neural network is characterized by comprising the following steps:

receiving a deep neural network inference model sent by network equipment, wherein an operation set of the deep neural network inference model is obtained by compressing an operation set of a full-floating-point deep neural network training model, and the storage length of at least one parameter of the full-floating-point deep neural network training model is less than 8 bits and/or the storage length of at least one data is less than 8 bits;

and executing the deep neural network inference model.
The method of claim 6, further comprising, after said executing the deep neural network inference model:

acquiring performance data of a terminal;

and sending the performance data of the terminal to control equipment, wherein the performance data of the terminal is used for determining the operation needing to be processed in the operation set of the full floating point deep neural network training model.
A processing method of a deep neural network is characterized by comprising the following steps:

receiving performance data of a terminal, wherein the performance data is obtained by executing a deep neural network inference model by the terminal, an operation set of the deep neural network inference model is obtained by compressing an operation set of a full-floating-point deep neural network training model, and the storage length of at least one parameter of the full-floating-point deep neural network training model is less than 8 bits and/or the storage length of at least one data is less than 8 bits;

processing the performance data of the terminal to determine the operation needing to be processed in the operation set of the full floating point deep neural network training model;

and sending an indication message to a server, wherein the indication message is used for indicating the server to process the operation needing to be processed in the operation set of the full floating point deep neural network training model.
A processing method of a deep neural network is characterized by comprising the following steps:

receiving an indication message;

processing operations needing to be processed in an operation set of the full floating point deep neural network training model according to the indication message, wherein the storage length of at least one parameter of the full floating point deep neural network training model is less than 8 bits and/or the storage length of at least one data is less than 8 bits;

and sending the processed full floating point deep neural network training model to the network equipment.
A processing apparatus for a deep neural network, comprising:

the acquisition module is used for acquiring a full-floating-point deep neural network training model, wherein the storage length of at least one parameter of the full-floating-point deep neural network training model is less than 8 bits and/or the storage length of at least one data is less than 8 bits;

the compression module is used for compressing the operation set of the full-floating-point deep neural network training model according to the relation among the operations in the operation set of the full-floating-point deep neural network training model so as to obtain the operation set of the deep neural network inference model corresponding to the full-floating-point deep neural network training model;

and the sending module is used for sending the deep neural network inference model to a terminal.
The apparatus of claim 10, wherein the compression module is specifically configured to:

and compressing a plurality of operations in the operation set of the full floating point deep neural network training model into one operation according to the relation among the operations in the operation set of the full floating point deep neural network training model.
The apparatus of claim 10 or 11, further comprising:

and the quantization module is used for quantizing parameters related to the operation set of the deep neural network inference model before sending the deep neural network inference model to the terminal.
The apparatus according to claim 12, wherein the parameters involved in the operation set of the deep neural network inference model are floating point type parameters;

correspondingly, the quantization module is specifically configured to:

and quantizing parameters corresponding to the operation set of the deep neural network inference model into 1-bit parameters.
The apparatus according to any one of claims 10-13, wherein the deep neural network inference model is used for a server to process operations which need to be processed in the operation set of the full-floating-point deep neural network training model.
A processing apparatus for a deep neural network, comprising:

the deep neural network inference model comprises a receiving module, a processing module and a processing module, wherein the receiving module is used for receiving a deep neural network inference model sent by network equipment, an operation set of the deep neural network inference model is obtained by compressing an operation set of a full-floating-point deep neural network training model, and the storage length of at least one parameter of the full-floating-point deep neural network training model is less than 8 bits and/or the storage length of at least one data is less than 8 bits;

and the execution module is used for executing the deep neural network inference model.
The apparatus of claim 15, further comprising:

the acquisition module is used for acquiring the performance data of the terminal after the deep neural network reasoning model is executed;

and the sending module is used for sending the performance data of the terminal to the control equipment, and the performance data of the terminal is used for determining the operation needing to be processed in the operation set of the full floating point deep neural network training model.
A processing apparatus for a deep neural network, comprising:

the system comprises a receiving module, a processing module and a processing module, wherein the receiving module is used for receiving performance data of a terminal, the performance data is obtained by the terminal through executing a deep neural network inference model, an operation set of the deep neural network inference model is obtained by compressing an operation set of a full floating point deep neural network training model, and the storage length of at least one parameter of the full floating point deep neural network training model is less than 8 bits and/or the storage length of at least one data is less than 8 bits;

the processing module is used for processing the performance data of the terminal so as to determine the operation needing to be processed in the operation set of the full floating point deep neural network training model;

and the sending module is used for sending an indication message to a server, wherein the indication message is used for indicating the server to process the operation needing to be processed in the operation set of the full floating point deep neural network training model.
A processing apparatus for a deep neural network, comprising:

a receiving module, configured to receive an indication message;

the processing module is used for processing the operation needing to be processed in the operation set of the full floating point deep neural network training model according to the indication message, wherein the storage length of at least one parameter of the full floating point deep neural network training model is less than 8 bits and/or the storage length of at least one data is less than 8 bits;

and the sending module is used for sending the processed full floating point deep neural network training model to the network equipment.
A network device, comprising:

a processor;

a memory for storing executable instructions of the processor to cause the processor to perform the processing method of the deep neural network of any one of claims 1-5.
A terminal, comprising:

a processor;

a memory for storing executable instructions of the processor to cause the processor to perform the processing method of the deep neural network of claim 6 or 7.
A control apparatus, characterized by comprising:

a processor;

a memory for storing executable instructions of the processor to cause the processor to perform the processing method of the deep neural network of claim 8.
A server, comprising:

a processor;

a memory for storing executable instructions of the processor to cause the processor to perform the processing method of the deep neural network of claim 9.
A storage medium, comprising: executable instructions for implementing a method of processing a deep neural network as claimed in any one of claims 1 to 9.