CN114698395A

CN114698395A - Quantification method and device of neural network model, and data processing method and device

Info

Publication number: CN114698395A
Application number: CN202080016479.1A
Authority: CN
Inventors: 昌晶; 连朔; 孙方轩; 王晨曦; 周君
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-10-30
Filing date: 2020-10-30
Publication date: 2022-07-01
Also published as: WO2022088063A1

Abstract

A neural network model quantification method and device and a data processing method and device belong to the field of artificial intelligence. The original neural network model comprises a first operator, a second operator and a first operation module, wherein the first operation module is used for operating the output of the first operator and the output of the second operator, and the neural network model quantization method comprises the following steps: determining a data quantization parameter according to the range of first training input data of a first operator and the range of second training input data of a second operator; and determining a quantized neural network model, and quantizing the quantized first input data of the first operator and the quantized second input data of the second operator by using the quantized data quantization parameters of the quantized neural network model respectively. The processing result of the quantized first operator and the processing result of the quantized second operator can be directly operated, so that the data processing efficiency is improved while the data processing precision of the neural network model is improved.

Description

Quantification method and device of neural network model, and data processing method and device

Technical Field

The present application relates to the field of artificial intelligence, and more particularly, to a method and an apparatus for quantizing a neural network model, and a method and an apparatus for processing data.

Background

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making. Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision and reasoning, human-computer interaction, recommendation and search, AI basic theory, and the like.

The neural network model is widely applied, model quantization is carried out on operators in the neural network model, namely quantization of parameters of the operators and quantization of input data, operation of floating point numbers can be converted into operation of fixed point numbers, and benefits in multiple aspects such as model size, reasoning speed and power consumption are obtained. According to the data range of the operator, the quantization parameter of the operator is determined, and the precision of the data processing result of the quantized operator can be improved. However, before performing subsequent operation processing on the quantized data processing results of the plurality of operators, inverse quantization needs to be performed on the data processing results, which results in poor overall processing performance.

Disclosure of Invention

The application provides a neural network model quantization method and a data processing method, which can simplify the operation of the neural network model and improve the data processing efficiency of the neural network model.

In a first aspect, a neural network model quantization method is provided, the method including: the method comprises the steps of obtaining an original neural network model, wherein the original neural network model comprises a first operator, a second operator and a first operation module, the first operator and the second operator are used for carrying out the same type of operation, and the first operation module is used for carrying out the first operation on the output of the first operator and the output of the second operator; determining a data quantization parameter according to a range of first training input data and a range of second training input data, wherein the first training input data is input data of the first operator, and the second training input data is input data of the second operator; determining a quantized neural network model according to the original neural network model, wherein the quantized neural network model comprises a quantization module, a third operator, a fourth operator and a second operation module, the quantization module is used for quantizing first input data of the third operator and second input data of the fourth operator respectively by using the data quantization parameter, the third operator is the quantized first operator, the fourth operator is the quantized second operator, and the second operation module is used for performing the first operation.

And determining a data quantization parameter according to the numerical range of the input data of the first operator and the second operator in the original neural network model, wherein the data quantization parameter is used for quantizing the input data of the third operator and the fourth operator in the quantized neural network model respectively.

The quantized neural network model can quantize data input into two different operators by adopting the same data quantization parameter, so that the processing result of the third operator is the same as the quantization parameter corresponding to the processing result of the fourth operator, the third operation can be directly performed on the processing result of the third operator and the processing result of the fourth operator, the processing of inverse quantization and the like on the processing result of the third operator and the processing result of the fourth operator before the third operation is performed is not needed, the operation of the quantized neural network model is simplified, and the data processing efficiency of the neural network model is improved.

According to the numerical range of the data respectively processed by the first operator and the second operator, the data quantization parameters for quantizing the input data of the third operator and the fourth operator are determined, so that the precision of the quantized data processing results of the third operator and the fourth operator is improved, the data processing efficiency of the neural network model is improved, and the influence of the quantized neural network model on the accuracy of the data processing results is reduced.

With reference to the first aspect, in some possible implementations, the method further includes: acquiring preset training output data corresponding to a training input data set, wherein the training input data set comprises the first training input data and the second training input data; quantizing the first training input data and the second training input data respectively by using the data quantization parameters; processing the quantized first training input data and the quantized second training input data by using the quantized neural network model to obtain actual training output data; adjusting the data quantization parameter according to the difference between the actual training output data and the preset training output data to minimize the difference; the quantization module is configured to quantize the first input data of the third operator and the second input data of the fourth operator respectively by using the adjusted data quantization parameter.

The preset training output data may be set manually. The preset training output data may also be obtained by processing the first training input data and the second training input data by the original neural network model. For example, the pre-set training output data may be an output of the operational module.

Because the adjustment mode of the data quantization parameter is to minimize the difference between the actual training output data of the quantized neural network model to the data and the preset training output data corresponding to the data, the adjusted data quantization parameter can enable the result precision of the third operator and the fourth operator for processing the quantized data to be higher. The influence of the quantitative neural network model on the accuracy of the data processing result is reduced while the data processing efficiency of the neural network model is improved.

With reference to the first aspect, in some possible implementations, the method further includes: determining operator quantization parameters according to the parameter range of the first operator and the parameter range of the second operator; quantizing the parameter of the first operator by using the operator quantization parameter to obtain a parameter of the third operator; and quantizing the parameters of the second operator by using the operator quantization parameters to obtain the parameters of the fourth operator.

Because the operator quantization parameter is determined according to the parameter range of the first operator and the parameter range of the second operator, the quantized neural network model improves the data processing efficiency and reduces the influence on the accuracy and precision of the data processing result.

With reference to the first aspect, in some possible implementations, the quantized neural network model further includes a compression module, where the compression module is configured to compress outputs of the third operator and the fourth operator according to an offset parameter, where the offset parameter is used to indicate a position of a highest bit in the compressed data in the data before the compression, and the second operation module is configured to perform the first operation on the compressed data; the method further comprises the following steps: quantizing the first training input data and the second training input data respectively by using the data quantization parameters; processing the quantized first training input data by using the third operator, wherein the third operator outputs first training operation data; processing the quantized second training input data by using the fourth operator, wherein the fourth operator outputs second training operation data; and determining the offset parameter according to the significant digit of the first training operation data and the significant digit of the second training operation data.

The same offset parameter is adopted for the output of the third operator and the output of the fourth operator for data compression, and the data processing efficiency of the neural network model can be improved. Because the offset parameter is determined according to the significant digit of the intermediate operation result obtained by processing the training input data by the quantized neural network model, when the quantized neural network model processes the data, the intermediate operation result is compressed by using the offset parameter, so that the influence on the accuracy and precision of the final data processing result can be reduced.

In a second aspect, a data processing method is provided, the method comprising: the method comprises the steps of obtaining a quantized neural network model, wherein the quantized neural network model is obtained by quantizing an original neural network model, the original neural network model comprises a first operator, a second operator and a first operation module, the first operator and the second operator are used for carrying out the same type of operation, and the first operation module is used for carrying out the first operation on the output of the first operator and the output of the second operator; processing the first input data of the third operator and the second input data of the fourth operator by using the quantized neural network model, where the quantized neural network model includes a quantization module, a first operator, a second operator, and a second operation module, the quantization module is configured to quantize the first input data and the second input data by using a data quantization parameter, the second operation module is configured to perform the first operation, the third operator is the quantized first operator, the fourth operator is the quantized second operator, and the data quantization parameter is determined according to a range of the first training input data of the first operator and a range of the second training input data of the second operator.

With reference to the second aspect, in some possible implementations, the data quantization parameter is obtained by adjusting an initial data quantization parameter, where the adjustment minimizes a difference between actual training output data and preset training output data, the initial quantization parameter is determined according to a range of the first training input data and a range of the second training input data, the preset training output data corresponds to a training input data set, the training input data set includes the first training input data and the second training input data, the actual training output data is obtained by processing the first training input data and the second training input data using the quantized neural network model, and the quantization module is configured to separately process the first training input data and the second training input data using the initial data quantization parameter, The second training input data is quantized.

With reference to the second aspect, in some possible implementations, the parameter of the third operator is obtained by quantizing the parameter of the first operator by using an operator quantization parameter, the parameter of the fourth operator is obtained by quantizing the parameter of the second operator by using the operator quantization parameter, and the operator quantization parameter is determined according to the parameter range of the first operator and the parameter range of the second operator.

With reference to the second aspect, in some possible implementations, the quantized neural network model further includes a compression module, where the compression module is configured to compress outputs of the third operator and the fourth operator according to an offset parameter, where the offset parameter is used to indicate a position of a highest bit in the compressed data in the data before the compression, and the second operation module is configured to perform the first operation on the compressed data; the offset parameter is determined based on a significand of first training operation data obtained by processing first training input data quantized using the data quantization parameter using the third operator and a significand of second training operation data obtained by processing second training input data quantized using the data quantization parameter using the fourth operator.

In a third aspect, an apparatus for quantizing a neural network model is provided, the apparatus including: the device comprises a storage module and a processing module, wherein the storage module is used for storing programs; when the program is run in the processing module, the processing module is to: the method comprises the steps of obtaining an original neural network model, wherein the original neural network model comprises a first operator, a second operator and a first operation module, the first operator and the second operator are used for carrying out the same type of operation, and the first operation module is used for carrying out the first operation on the output of the first operator and the output of the second operator; determining a data quantization parameter according to a range of first training input data and a range of second training input data, wherein the first training input data is input data of the first operator, and the second training input data is input data of the second operator; determining a quantized neural network model according to the original neural network model, wherein the quantized neural network model comprises a quantization module, a third operator, a fourth operator and a second operation module, the quantization module is used for quantizing first input data of the third operator and second input data of the fourth operator respectively by using the data quantization parameter, the third operator is the quantized first operator, the fourth operator is the quantized second operator, and the second operation module is used for performing the first operation.

With reference to the third aspect, in some possible implementation manners, the processing module is further configured to obtain preset training output data corresponding to a training input data set, where the training input data set includes the first training input data and the second training input data; the processing module is further configured to quantize the first training input data and the second training input data using the data quantization parameter, respectively; the processing module is further used for processing the quantized first training input data and the quantized second training input data by using the quantized neural network model to obtain actual training output data; the processing module is further configured to adjust the data quantization parameter according to a difference between the actual training output data and the preset training output data to minimize the difference; the quantization module is configured to quantize the first input data of the third operator and the second input data of the fourth operator respectively by using the adjusted data quantization parameter.

With reference to the third aspect, in some possible implementations, the processing module is further configured to determine an operator quantization parameter according to the parameter range of the first operator and the parameter range of the second operator; the processing module is further configured to quantize the parameter of the first operator by using the operator quantization parameter to obtain a parameter of the third operator; the processing module is further configured to quantize the parameter of the second operator by using the operator quantization parameter to obtain a parameter of the fourth operator.

With reference to the third aspect, in some possible implementations, the quantized neural network model further includes a compression module, where the compression module is configured to compress outputs of the third operator and the fourth operator according to an offset parameter, where the offset parameter is used to indicate a position of a highest bit in the compressed data in the data before the compression is performed, and the second operation module is configured to perform the first operation on the compressed data; the processing module is further configured to quantize the first training input data and the second training input data respectively by using the data quantization parameter; the processing module is further configured to process the quantized first training input data by using the third operator, and the third operator outputs first training operation data; the processing module is further configured to process the quantized second training input data by using the fourth operator, and the fourth operator outputs second training operation data; the processing module is further configured to determine the offset parameter according to the significand of the first training operation data and the significand of the second training operation data.

In a fourth aspect, a data processing apparatus is provided, including: the device comprises a storage module and a processing module, wherein the storage module is used for storing programs; when the program is run in the processing module, the processing module is to: the method comprises the steps of obtaining a quantized neural network model, wherein the quantized neural network model is obtained by quantizing an original neural network model, the original neural network model comprises a first operator, a second operator and a first operation module, the first operator and the second operator are used for carrying out the same type of operation, and the first operation module is used for carrying out the first operation on the output of the first operator and the output of the second operator; processing the first input data of the third operator and the second input data of the fourth operator by using the quantized neural network model, where the quantized neural network model includes a quantization module, a first operator, a second operator, and a second operation module, the quantization module is configured to quantize the first input data and the second input data by using a data quantization parameter, the second operation module is configured to perform the first operation, the third operator is the quantized first operator, the fourth operator is the quantized second operator, and the data quantization parameter is determined according to a range of the first training input data of the first operator and a range of the second training input data of the second operator.

With reference to the fourth aspect, in some possible implementations, the data quantization parameter is obtained by adjusting an initial data quantization parameter, the adjustment minimizes a difference between actual training output data and preset training output data, the initial quantization parameter is determined according to a range of the first training input data and a range of the second training input data, the preset training output data corresponds to a training input data set, the training input data set includes the first training input data and the second training input data, the actual training output data is obtained by processing the first training input data and the second training input data using the quantized neural network model, and the quantization module is configured to separately process the first training input data and the second training input data using the initial data quantization parameter, The second training input data is quantized.

With reference to the fourth aspect, in some possible implementation manners, the parameter of the third operator is obtained by quantizing the parameter of the first operator by using an operator quantization parameter, the parameter of the fourth operator is obtained by quantizing the parameter of the second operator by using the operator quantization parameter, and the operator quantization parameter is determined according to the parameter range of the first operator and the parameter range of the second operator.

With reference to the fourth aspect, in some possible implementations, the quantized neural network model further includes a compression module, where the compression module is configured to compress the outputs of the third operator and the fourth operator according to an offset parameter, where the offset parameter is used to indicate a position of a highest bit in the compressed data in the data before the compression, and the second operation module is configured to perform the first operation on the compressed data; the offset parameter is determined according to a significand of first training operation data obtained by processing first training input data quantized by the data quantization parameter using the third operator and a significand of second training operation data obtained by processing second training input data quantized by the data quantization parameter using the fourth operator.

In a fifth aspect, an electronic device is provided that includes a memory for storing program instructions and a processor; when the program instructions are executed in the processor, the processor is configured to perform the method of the first aspect or the second aspect.

The processor in the third aspect may include a Central Processing Unit (CPU), or a combination of the CPU and a neural network operation processor.

A sixth aspect provides a computer readable medium storing program code for execution by a device, the program code comprising instructions for performing the first aspect or a method in any one of the implementations of the first aspect.

In a seventh aspect, a computer program product containing instructions is provided, which when run on a computer causes the computer to perform the method of the first aspect or any one of the implementation manners of the first aspect.

In an eighth aspect, a chip is provided, where the chip includes a processor and a data interface, and the processor reads an instruction stored in a memory through the data interface to execute the method in the first aspect or any implementation manner of the first aspect.

Optionally, as an implementation manner, the chip may further include a memory, where instructions are stored in the memory, and the processor is configured to execute the instructions stored in the memory, and when the instructions are executed, the processor is configured to execute the first aspect or the method in any implementation manner of the first aspect.

The chip may be a field-programmable gate array (FPGA) or an application-specific integrated circuit (ASIC).

Drawings

Fig. 1 is a schematic structural diagram of a system architecture according to an embodiment of the present disclosure.

Fig. 2 is a schematic structural diagram of a convolutional neural network according to an embodiment of the present disclosure.

Fig. 3 is a schematic structural diagram of another convolutional neural network provided in the embodiment of the present application.

Fig. 4 is a schematic diagram of a hardware structure of a chip according to an embodiment of the present disclosure.

Fig. 5 is a schematic diagram of a system architecture according to an embodiment of the present application.

Fig. 6 is a schematic structural diagram of a neural network model quantization apparatus according to an embodiment of the present application.

Fig. 7 is a schematic flowchart of a neural network model quantization method provided in an embodiment of the present application.

Fig. 8 is a schematic flow chart of another neural network model quantization method provided in an embodiment of the present application.

Fig. 9 is a schematic structural diagram of a data processing system according to an embodiment of the present application.

Fig. 10 is a schematic diagram of data before and after compression provided by an embodiment of the present application.

Fig. 11 is a schematic flowchart of a data processing method according to an embodiment of the present application.

Fig. 12 is a schematic flow chart of another data processing method provided in the embodiment of the present application.

Fig. 13 is a schematic flowchart of a processing structure identification method provided in an embodiment of the present application.

Fig. 14 is a schematic structural diagram of a neural network model quantization apparatus according to an embodiment of the present application.

Fig. 15 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application.

Fig. 16 is a hardware configuration diagram of a data processing apparatus according to an embodiment of the present application.

Fig. 17 is a schematic hardware configuration diagram of a neural network model quantizing device according to an embodiment of the present application.

Detailed Description

The technical solution in the present application will be described below with reference to the accompanying drawings. The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Since the embodiments of the present application relate to the application of a large number of neural networks, for the sake of understanding, the following description will be made first of all with respect to terms and concepts of the neural networks to which the embodiments of the present application may relate.

(1) Neural network

The neural network may be composed of neural units, which may be referred to as x_sAnd an arithmetic unit with intercept 1 as input, the output of which may be:

wherein s is 1, 2, … … n, n is a natural number greater than 1, and W is_sIs x_sB is the bias of the neural unit. f is an activation function (activation functions) of the neural unit for introducing a nonlinear characteristic into the neural network to convert an input signal in the neural unit into an output signal. The output signal of the activation function may be used as an input for the next convolutional layer, and the activation function may be a sigmoid function. A neural network is a network formed by a plurality of the above-mentioned single neural units being joined together, i.e. the output of one neural unit may be the input of another neural unit. The input of each neural unit can be connected with a local receiving domain of the previous layer to extract the characteristics of the local receiving domain, and the local receiving domain can be a region composed of a plurality of neural units.

(2) Deep neural network

Deep Neural Networks (DNNs), also called multi-layer neural networks, can be understood as neural networks with multiple hidden layers. The DNNs are divided according to the positions of different layers, and the neural networks inside the DNNs can be divided into three categories: input layer, hidden layer, output layer. Generally, the first layer is an input layer, the last layer is an output layer, and the middle layers are hidden layers. The layers are all connected, that is, any neuron at the ith layer is necessarily connected with any neuron at the (i + 1) th layer.

Although DNN appears complex, it is not really complex in terms of the work of each layer, simply the following linear relational expression:

wherein the content of the first and second substances,

is the input vector of the input vector,

is the output vector of the output vector,

is the offset vector, W is the weight matrix (also called coefficient), and α () is the activation function. Each layer is only for the input vector

Obtaining the output vector by such simple operation

Due to the large number of DNN layers, the coefficient W and the offset vector

The number of the same is also large. The definition of these parameters in DNN is as follows: taking coefficient W as an example: assume that in a three-layer DNN, the linear coefficients of the 4 th neuron of the second layer to the 2 nd neuron of the third layer are defined as

The superscript 3 represents the number of layers in which the coefficient W is located, while the subscripts correspond to the third layer index 2 of the output and the second layer index 4 of the input.

In summary, the coefficients from the kth neuron at layer L-1 to the jth neuron at layer L are defined as

Note that the input layer is without the W parameter. In deep neural networks, more hidden layers make the network more able to depict complex situations in the real world. Theoretically, the more parameters the higher the model complexity, the larger the "capacity", which means that it can accomplish more complex learning tasks. The final goal of the process of training the deep neural network, i.e., learning the weight matrix, is to obtain the weight matrix (the weight matrix formed by the vectors W of many layers) of all the layers of the deep neural network that is trained.

(3) Convolutional neural network

A Convolutional Neural Network (CNN) is a deep neural network with a convolutional structure. The convolutional neural network comprises a feature extractor consisting of convolutional layers and sub-sampling layers, which can be regarded as a filter. The convolutional layer is a neuron layer for performing convolutional processing on an input signal in a convolutional neural network. In convolutional layers of convolutional neural networks, one neuron may be connected to only a portion of the neighbor neurons. In a convolutional layer, there are usually several characteristic planes, and each characteristic plane may be composed of several neural units arranged in a rectangular shape. The neural units of the same feature plane share weights, where the shared weights are convolution kernels. Sharing weights may be understood as the way data information is extracted is location independent. The convolution kernel can be initialized in the form of a matrix of random size, and can be learned to obtain reasonable weights in the training process of the convolutional neural network. In addition, sharing weights brings the direct benefit of reducing connections between layers of the convolutional neural network, while reducing the risk of overfitting.

(4) Loss function

In the process of training the deep neural network, because the output of the deep neural network is expected to be as close to the value really expected to be predicted as possible, the weight vector of each layer of the neural network can be updated according to the difference between the predicted value of the current network and the really expected target value (of course, an initialization process is usually carried out before the first updating, namely parameters are preset for each layer in the deep neural network), for example, if the predicted value of the network is high, the weight vector is adjusted to be lower, and the adjustment is continuously carried out until the deep neural network can predict the really expected target value or the value which is very close to the really expected target value. Therefore, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which are loss functions (loss functions) or objective functions (objective functions), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, if the higher the output value (loss) of the loss function indicates the greater the difference, the training of the deep neural network becomes a process of reducing the loss as much as possible.

(5) Back propagation algorithm

The neural network can adopt a Back Propagation (BP) algorithm to correct the size of parameters in the initial neural network model in the training process, so that the reconstruction error loss of the neural network model is smaller and smaller. Specifically, the error loss is generated by transmitting the input signal in the forward direction until the output, and the parameters in the initial neural network model are updated by reversely propagating the error loss information, so that the error loss is converged. The back propagation algorithm is a back propagation motion with error loss as a dominant factor, aiming at obtaining the optimal parameters of the neural network model, such as a weight matrix.

(6) Data processing

Data processing typically includes data training, machine learning, deep learning, searching, reasoning, decision making, and the like.

The machine learning and the deep learning can perform symbolized and formalized intelligent information modeling, extraction, preprocessing, training and the like on data.

Inference means a process of simulating an intelligent human inference mode in a computer or an intelligent system, using formalized information to think about and solve a problem by a machine according to an inference control strategy, and a typical function is searching and matching.

The decision-making refers to a process of making a decision after reasoning intelligent information, and generally provides functions of classification, sequencing, prediction and the like.

As shown in fig. 1, the present embodiment provides a system architecture 100. In fig. 1, a data acquisition device 160 is used to acquire training data. For the data processing method of the embodiment of the present application, the training data may include a plurality of training input data and a training identifier corresponding to each training input data.

After the training data is collected, data collection device 160 stores the training data in database 130, and training device 120 trains target model/rule 101 based on the training data maintained in database 130.

The following describes that the training device 120 obtains the target model/rule 101 based on the training data, and the training device 120 processes the input training input data, compares the output result with the training identifier corresponding to the training input data, until the difference between the output result of the training device 120 and the training identifier is smaller than a certain threshold, thereby completing the training of the target model/rule 101.

The above-described target model/rule 101 can be used to implement the data processing method of the embodiment of the present application. The target model/rule 101 in the embodiment of the present application may specifically be a neural network. It should be noted that, in practical applications, the training data maintained in the database 130 may not necessarily all come from the acquisition of the data acquisition device 160, and may also be received from other devices. It should be noted that, the training device 120 does not necessarily perform the training of the target model/rule 101 based on the training data maintained by the database 130, and may also obtain the training data from the cloud or other places for performing the model training.

The target model/rule 101 obtained by training according to the training device 120 may be applied to different systems or devices, for example, the execution device 110 shown in fig. 1, where the execution device 110 may be a terminal, such as a mobile phone terminal, a tablet computer, a notebook computer, an Augmented Reality (AR) AR/Virtual Reality (VR), a vehicle-mounted terminal, or a server or a cloud. In fig. 1, the execution device 110 configures an input/output (I/O) interface 112 for data interaction with an external device, and a user may input data to the I/O interface 112 through the client device 140, where the input data may include: the client device inputs data to be processed.

The preprocessing module 113 and the preprocessing module 114 are configured to perform preprocessing according to input data (such as data to be processed) received by the I/O interface 112, and in this embodiment, the input data may be processed directly by the computing module 111 without the preprocessing module 113 and the preprocessing module 114 (or only one of them may be used).

In the process that the execution device 110 preprocesses the input data or in the process that the calculation module 111 of the execution device 110 executes the calculation or other related processes, the execution device 110 may call the data, the code, and the like in the data storage system 150 for corresponding processes, and may store the data, the instruction, and the like obtained by corresponding processes in the data storage system 150.

Finally, the I/O interface 112 returns the processing result, such as the processing result of the data obtained as described above, to the client device 140, thereby providing it to the user.

It should be noted that the training device 120 may generate corresponding target models/rules 101 for different targets or different tasks based on different training data, and the corresponding target models/rules 101 may be used to achieve the targets or complete the tasks, so as to provide the user with the required results.

In the case shown in fig. 1, the user may manually give the input data, which may be operated through an interface provided by the I/O interface 112. Alternatively, the client device 140 may automatically send the input data to the I/O interface 112, and if the client device 140 is required to automatically send the input data to obtain authorization from the user, the user may set the corresponding permissions in the client device 140. The user can view the result output by the execution device 110 at the client device 140, and the specific presentation form can be display, sound, action, and the like. The client device 140 may also serve as a data collection terminal, collecting input data of the input I/O interface 112 and output results of the output I/O interface 112 as new sample data, and storing the new sample data in the database 130. Of course, the input data inputted to the I/O interface 112 and the output result outputted from the I/O interface 112 as shown in the figure may be directly stored in the database 130 as new sample data by the I/O interface 112 without being collected by the client device 140.

It should be noted that fig. 1 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the position relationship between the devices, modules, and the like shown in the diagram does not constitute any limitation, for example, in fig. 1, the data storage system 150 is an external memory with respect to the execution device 110, and in other cases, the data storage system 150 may also be disposed in the execution device 110.

As shown in fig. 1, a target model/rule 101 is obtained by training according to a training device 120, where the target model/rule 101 may be a neural network in this application in this embodiment, and specifically, the neural network used in this embodiment may be CNN, Deep Convolutional Neural Networks (DCNN), Recurrent Neural Networks (RNN), and the like.

Since CNN is a very common neural network, the structure of CNN will be described in detail below with reference to fig. 2. As described in the introduction of the basic concept, the convolutional neural network is a deep neural network with a convolutional structure, and is a deep learning (deep learning) architecture, where the deep learning architecture refers to performing multiple levels of learning at different abstraction levels through a machine learning algorithm. As a deep learning architecture, CNN is a feed-forward artificial neural network in which individual neurons can respond to data input thereto.

The structure of the neural network specifically adopted in the data processing method according to the embodiment of the present application may be as shown in fig. 2. In fig. 2, Convolutional Neural Network (CNN)200 may include an input layer 210, a convolutional/pooling layer 220 (where pooling layer is optional), and a neural network layer 230. The input layer 210 may obtain data to be processed, and deliver the obtained data to be processed to the convolutional layer/pooling layer 220 and the following neural network layer 230 for processing, so as to obtain a processing result of the data. The following describes the internal layer structure in CNN 200 in fig. 2 in detail.

Convolutional layer/pooling layer 220:

and (3) rolling layers:

the convolutional layer/pooling layer 220 shown in fig. 2 may include layers such as example 221 and 226, for example: in one implementation, 221 is a convolutional layer, 222 is a pooling layer, 223 is a convolutional layer, 224 is a pooling layer, 225 is a convolutional layer, 226 is a pooling layer; in another implementation, 221, 222 are convolutional layers, 223 is a pooling layer, 224, 225 are convolutional layers, and 226 is a pooling layer. I.e., the output of a convolutional layer may be used as input to a subsequent pooling layer, or may be used as input to another convolutional layer to continue the convolution operation.

The inner working principle of a convolutional layer will be described below by taking convolutional layer 221 as an example.

Convolutional layer 221 may include a number of convolution operators, also known as kernels, whose role in data processing is to act as a filter to extract specific information from the input data matrix, which may be essentially a weight matrix, which is usually predefined.

The weight values in these weight matrices need to be obtained through a large amount of training in practical application, and each weight matrix formed by the trained weight values can be used to extract information from the input data, so that the convolutional neural network 200 can make correct prediction.

When convolutional neural network 200 has multiple convolutional layers, the initial convolutional layer (e.g., 221) tends to extract more general features, which may also be referred to as low-level features; as the depth of convolutional neural network 200 increases, the more convolutional layers (e.g., 226) that go further back extract more complex features, such as features with high levels of semantics, the more highly semantic features are more suitable for the problem to be solved.

A pooling layer:

since it is often desirable to reduce the number of training parameters, it is often desirable to periodically introduce pooling layers after the convolutional layer, where the layers 221-226, as illustrated by 220 in fig. 2, may be one convolutional layer followed by one pooling layer, or multiple convolutional layers followed by one or more pooling layers. During data processing, the only purpose of the pooling layer is to reduce the spatial size of the data.

The neural network layer 230:

after processing by convolutional layer/pooling layer 220, convolutional neural network 200 is not sufficient to output the required output information. Because, as previously described, convolutional layer/pooling layer 220 only extracts features and reduces parameters brought by the input data. However, in order to generate the final output information (required class information or other relevant information), the convolutional neural network 200 needs to generate one or a set of the required number of classes of outputs using the neural network layer 230. Accordingly, a plurality of hidden layers (231, 232 to 23n shown in fig. 2) and an output layer 240 may be included in the neural network layer 230, and parameters included in the hidden layers may be obtained by pre-training according to related training data of a specific task type, for example, the task type may include recognition, classification, and the like.

After the hidden layers in the neural network layer 230, i.e. the last layer of the whole convolutional neural network 200 is the output layer 240, the output layer 240 has a loss function similar to the classification cross entropy, and is specifically used for calculating the prediction error, once the forward propagation (i.e. the propagation from the direction 210 to 240 in fig. 2 is the forward propagation) of the whole convolutional neural network 200 is completed, the backward propagation (i.e. the propagation from the direction 240 to 210 in fig. 2 is the backward propagation) starts to update the weight values and the bias of the aforementioned layers, so as to reduce the loss of the convolutional neural network 200, and the error between the result output by the convolutional neural network 200 through the output layer and the ideal result.

The structure of the neural network specifically adopted in the data processing method according to the embodiment of the present application may be as shown in fig. 3. In fig. 3, Convolutional Neural Network (CNN)200 may include an input layer 210, a convolutional/pooling layer 220 (where pooling is optional), and a neural network layer 230. Compared with fig. 2, in the convolutional layer/pooling layer 220 in fig. 3, a plurality of convolutional layers/pooling layers are parallel, and the features extracted respectively are all input to the neural network layer 230 for processing.

It should be noted that the convolutional neural networks shown in fig. 2 and fig. 3 are only examples of two possible convolutional neural networks of the data processing method according to the embodiment of the present application, and in a specific application, the convolutional neural networks used in the data processing method according to the embodiment of the present application may also exist in the form of other network models.

Fig. 4 is a hardware structure of a chip provided in an embodiment of the present application, where the chip includes a neural network processor 50. The chip may be provided in the execution device 110 as shown in fig. 1 to complete the calculation work of the calculation module 111. The chip may also be disposed in the training apparatus 120 as shown in fig. 1 to complete the training work of the training apparatus 120 and output the target model/rule 101. The algorithms for the various layers in the convolutional neural network shown in fig. 2 and 3 can be implemented in a chip as shown in fig. 4.

The neural network processor NPU 50 is mounted as a coprocessor on a main processing unit (CPU) (host CPU), and tasks are distributed by the main CPU. The core portion of the NPU is an arithmetic circuit 503, and the controller 504 controls the arithmetic circuit 503 to extract data in a memory (weight memory or input memory) and perform an operation.

In some implementations, the arithmetic circuit 503 internally includes a plurality of processing units (PEs). In some implementations, the operational circuitry 503 is a two-dimensional systolic array. The arithmetic circuit 503 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuitry 503 is a general-purpose matrix processor.

For example, assume that there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit fetches the data corresponding to matrix B from the weight memory 502 and buffers each PE in the arithmetic circuit. The arithmetic circuit takes the matrix a data from the input memory 501 and performs matrix operation with the matrix B, and partial or final results of the obtained matrix are stored in an accumulator (accumulator) 508.

The vector calculation unit 507 may further process the output of the operation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like. For example, the vector calculation unit 507 may be used for network calculation of non-convolution/non-FC layers in a neural network, such as pooling (Pooling), batch normalization (batch normalization), local response normalization (local response normalization), and the like.

In some implementations, the vector calculation unit 507 can store the processed output vector to the unified buffer 506. For example, the vector calculation unit 507 may apply a non-linear function to the output of the arithmetic circuit 503, such as a vector of accumulated values, to generate the activation value. In some implementations, the vector calculation unit 507 generates normalized values, combined values, or both. In some implementations, the vector of processed outputs can be used as activation inputs to the arithmetic circuitry 503, for example for use in subsequent layers in a neural network.

The unified memory 506 is used to store input data as well as output data.

The weight data directly passes through a memory unit access controller 505 (DMAC) to transfer the input data in the external memory to the input memory 501 and/or the unified memory 506, store the weight data in the external memory in the weight memory 502, and store the data in the unified memory 506 in the external memory.

A Bus Interface Unit (BIU) 510, configured to implement interaction between the main CPU, the DMAC, and the instruction fetch memory 509 through a bus.

An instruction fetch buffer 509 connected to the controller 504 for storing instructions used by the controller 504;

the controller 504 is configured to call the instruction cached in the instruction storage 509 to control the operation process of the operation accelerator.

Generally, the unified memory 506, the input memory 501, the weight memory 502, and the instruction fetch memory 509 are On-Chip memories, the external memories are memories external to the NPU, and the external memories may be double data rate synchronous dynamic random access memories (DDR SDRAMs), High Bandwidth Memories (HBMs), or other readable and writable memories.

The operations of the layers in the convolutional neural networks shown in fig. 2 and 3 may be performed by the operation circuit 503 or the vector calculation unit 507.

The executing device 110 in fig. 1 described above can execute the steps of the data processing method in the embodiment of the present application, and the CNN model shown in fig. 2 and 3 and the chip shown in fig. 4 can also be used to execute the steps of the data processing method in the embodiment of the present application. The following describes the neural network training method and the data processing method in the embodiments in detail with reference to the drawings.

As shown in fig. 5, the present embodiment provides a system architecture 300. The system architecture includes a local device 301, a local device 302, and an execution device 110 and a data storage system 150, wherein the local device 301 and the local device 302 are connected with the execution device 110 through a communication network.

The execution device 110 may be implemented by one or more servers. Optionally, the execution device 110 may be used with other computing devices, such as: data storage, routers, load balancers, and the like. The execution device 110 may be disposed on one physical site or distributed across multiple physical sites. The execution device 110 may use data in the data storage system 150 or call program code in the data storage system 150 to implement the data processing method of the embodiment of the present application.

The user may operate respective user devices (e.g., local device 301 and local device 302) to interact with the execution device 110. Each local device may represent any computing device, such as a personal computer, computer workstation, smartphone, tablet, smart camera, smart car or other type of cellular phone, media consumption device, wearable device, set-top box, gaming console, and so forth.

Each user's local device may interact with the enforcement device 110 via a communication network of any communication mechanism/standard, such as a wide area network, a local area network, a peer-to-peer connection, etc., or any combination thereof.

In one implementation, the local device 301 or the local device 302 acquires the relevant parameters of the target neural network from the execution device 110, deploys the target neural network on the local device 301 or the local device 302, and performs data classification or identification and the like by using the target neural network.

In another implementation, the execution device 110 may directly deploy a target neural network, and the execution device 110 obtains the data to be processed from the local device 301 and the local device 302, and classifies or otherwise processes the data to be processed according to the target neural network.

The execution device 110 may also be a cloud device, and at this time, the execution device 110 may be deployed in a cloud; alternatively, the execution device 110 may also be a terminal device, in which case, the execution device 110 may be deployed at a user terminal side, which is not limited in this embodiment of the application.

At present, a neural network model is widely applied to a plurality of fields such as images, videos and voices, the capability of the neural network model exceeds that of a traditional method, and the calculated amount and the parameter quantity of the neural network model are large, so that great challenges are brought to the deployment of the neural network on terminal equipment.

Model quantization is used to quantize the parameters of the operators in the neural network model and to quantize the input data. The parameter of the operator is quantized, the size of the operator can be optimized, and resources occupied by the operator are reduced. On the basis, input data of the operator is quantized, floating point number operation of the operator can be converted into fixed point number operation, the inference speed is improved, and power consumption is reduced. Compared with a neural network model represented by a single-precision floating point number (generally 32bit), the quantized neural network model obtained by 8bit quantization can reduce the storage space occupied by each parameter to one fourth and process data at a better inference speed.

In the process of performing model quantization on an operator, in order to improve the data processing precision of the quantized operator, operator quantization parameters need to be determined according to the range of the parameter of each operator, and the data quantization parameters need to be respectively determined according to the range of input data. The difference between the ranges of parameters of different operators and the range of input data causes the difference between the data quantization parameters and/or operator quantization parameters corresponding to the operators. Before the data processing results of the quantized operators are operated, the data processing results need to be respectively subjected to inverse quantization so as to ensure the accuracy of the budget result. In the NPU, the number of processing units for performing inverse quantization operation is limited, resulting in a limited operation rate of inverse quantization, which results in low overall processing efficiency and poor performance. The inverse quantization parameter used when performing inverse quantization on the data processing result of each operator can be determined according to the data quantization parameter and the operator quantization parameter corresponding to the operator. In order to solve the above problem, an embodiment of the present application provides a neural network model quantization apparatus, which can reduce the number of inverse quantization operations that need to be performed on a quantized neural network model in the following process, and improve the overall processing performance.

Fig. 6 is a schematic structural diagram of a neural network model quantization apparatus according to an embodiment of the present application. The neural network model quantizing device 1300 can be located in the training apparatus 120 shown in fig. 1 or other apparatuses. The neural network model quantizing device 1300 includes a data quantization parameter generation model 1310 and an operator quantization model 1320. The neural network model quantizing device 1300 is used for quantizing the original neural network model. The original neural network model comprises a first operator, a second operator and a first operation module. The first operator and the second operator are used for carrying out the same type of operation. The first operation module is used for performing first operation on the output of the first operator and the output of the second operator. Quantization parameter generation model 1310 is used to generate data quantization parameters based on the range of data in the training input data set. The training input data set includes first training input data for a first operator and second training input data for a second operator. The operator quantization model 1320 is used for quantizing the operation units in the original neural network model such as the first operator, the second operator, and the like. And obtaining a quantized neural network model according to the data quantization parameter and the quantized operation unit.

The quantized neural network model comprises a quantization module, a quantized first operator, a quantized second operator and a second operation module. The quantization module is used for quantizing the first input data and the second input data respectively by using the data quantization parameter. The second operation module corresponds to the first operation module in the original neural network model and is used for carrying out first operation on the first operation data and the second operation data. The first operation data is obtained by operating the quantized first input data by the quantized first operator. The second operation data is obtained by operating the quantized second input data by the quantized second operator.

Fig. 7 is a schematic flowchart of a neural network model quantization method provided in an embodiment of the present application. The neural network model quantification method may be performed by the training device 120 shown in fig. 1 or other devices.

S1101, obtaining an original neural network model, wherein the original neural network model comprises a first operator, a second operator and a first operation module. The first operator and the second operator are used for carrying out the same type of operation, and the first operation module is used for carrying out the first operation on the output of the first operator and the output of the second operator. The raw neural network model may be obtained from messages sent by other devices. Alternatively, the raw neural network model may be retrieved from memory. The original neural network model may be CNN, etc.

The first operation and the second operation are the same type of operation, that is, the first operator and the second operator are the same type of operator. For example, the first operator and the second operator may both be volume sublayers in CNN or both be fully-connected layers. The first operation module may be configured to perform a bitwise operation, such as a bitwise addition or a bitwise multiplication operation, on the output of the first operator and the output of the second operator. In general, the first operation module may be used for linear operations.

The neural network model quantization method includes steps S1101 and S1102, for quantizing an original neural network model to obtain a quantized neural network model. The neural network model quantification method may be performed by the training device 120 shown in fig. 1 or other devices.

At S1102, a data quantization parameter is determined according to a range of first training input data and a range of second training input data, where the first training input data is input data of the first operator, and the second training input data is input data of the second operator. Specifically, the average data range upper bound may be determined from a maximum value of the plurality of first training input data for the first operator and a maximum value of the plurality of second training input data for the second operator. The average data range lower bound may be determined from a minimum of the plurality of first training input data for the first operator and a minimum of the plurality of second training input data for the second operator. The data quantization parameter may be determined based on an upper average data range limit and a lower average data range limit.

The upper average data range limit may be understood as the average of the maximum value of the plurality of first training input data and the maximum value of the plurality of second training input data. The lower average data range limit may be understood as the average of the minimum value of the plurality of first training input data and the minimum value of the plurality of second training input data.

The upper average data range limit and the lower average data range limit may be updated each time training input data is entered. Thus, the calculation of the upper limit of the average data range and the lower limit of the average data range is performed in a decentralized manner, and the requirement on the calculation resources can be reduced compared with the way of obtaining a plurality of training input data inputs for performing the calculation of the average value.

Weights may be introduced to enable updating of the upper and lower average data range limits. The specific upper limit of the average data range and the updating manner of the average data range can be referred to the description of fig. 8. It will be appreciated that the setting of the weights may be used to measure the impact of training input data input to the first operator and/or the second operator on the upper and lower average data range limits as the number of iterations increases.

Generally, the bit number (i.e. the number of bits) of the quantized input data of the operator is a preset value, and may be 8 bits, for example. Of course, inIn some embodiments, the number of bits of the quantized input data may also be obtained by manually inputting information or the like. The data quantization parameter may include a step size (scale) and an offset (offset). The number of bits corresponds to the number of scales in the data quantization parameter. The number of scales parameter can be understood as the maximum value that the number of bits of the quantized data can represent, i.e. 2^m-1, m being the number of bits of the quantized data. And obtaining the scale according to the difference between the upper limit of the average data range and the lower limit of the average data range and the number of scales. For example, the parameter scale may be a quotient obtained by dividing a difference between an upper limit of the average data range and a lower limit of the average data range by the number of scales, or the parameter scale may be a quotient obtained by adding 1 to a difference between an upper limit of the average data range and a lower limit of the average data range and dividing the number of scales. The offset in the data quantization parameter may be the ratio of the average data range lower limit to the parameter scale.

At S1103, a quantized neural network model is determined according to the original neural network model, where the quantized neural network model includes a quantization module, a third operator, a fourth operator, and a second operation module, the quantization module is configured to quantize first input data of the third operator and second input data of the fourth operator respectively by using the data quantization parameter, and the second operation module is configured to perform the first operation.

The second operation module may be configured to perform the third first operation on first operation data and second operation data, where the first operation data is obtained by operating the quantized first input data with the third operator, and the second operation data is obtained by operating the quantized second input data with the fourth operator.

That is, the second operation module corresponds to the first operation module in the original neural network model.

Through S1101 to S1103, a data quantization parameter is determined according to the numerical range of the input data of the first operator and the second operator in the original neural network model, and the data quantization parameter is used for quantizing the input data of the third operator and the fourth operator in the quantized neural network model respectively. The third operator and the first operator are used for carrying out the same operation, the fourth operator and the second operator are used for carrying out the same operation, and the types of the operation carried out by the first operator and the second operator are the same.

Through S1101 to S1103, the quantized neural network model can quantize data input to two different operators by using the same data quantization parameter, so that the processing result of the third operator is the same as the quantization parameter corresponding to the processing result of the fourth operator, and the processing result of the third operator and the processing result of the fourth operator can be directly subjected to third operation without performing inverse quantization and other processing on the processing result of the third operator and the processing result of the fourth operator before performing the third operation, thereby simplifying the operation of the quantized neural network model and improving the data processing efficiency of the neural network model.

In addition, according to the numerical value ranges of data respectively processed by the first operator and the second operator, data quantization parameters for quantizing the input data of the third operator and the fourth operator are determined, so that the precision of the processing results of the quantized data of the third operator and the fourth operator is improved, the data processing efficiency of the neural network model is improved, and the influence of the quantized neural network model on the accuracy of the data processing results is reduced.

Further, preset training output data corresponding to a training input data set may be obtained, where the training input data set includes the first training input data and the second training input data.

After S1102, the first training input data and the second training input data may be quantized, respectively, using the data quantization parameter. The quantized first training input data and the quantized second training input data may be processed using the quantized neural network model to obtain actual training output data.

The data quantization parameter may be adjusted to minimize a difference between the actual training output data and the preset training output data.

The quantization module is configured to quantize the first input data of the third operator and the second input data of the fourth operator respectively by using the adjusted data quantization parameter.

That is, the input data of the third operator and the input data of the fourth operator may be quantized by the adjusted data quantization parameter.

The first operator and the third operator are used for carrying out the same operation. The second operator and the fourth operator are used for carrying out the same operation.

The two operators are used for carrying out the same operation, and can also be understood as that the two operators carry out the same operation on input data, the parameters of the two operators are only different in precision, and the parameter of one operator is obtained by quantizing the parameter of the other operator. By processing the quantized input data with the quantized operator, the amount of computation can be reduced.

It should be understood that, in order to make the processing results of the input data by the third operator and the fourth operator comparable, i.e. to operate directly, without performing inverse quantization and other processing in subsequent estimation, the parameters of the third operator and the parameters of the fourth operator may be obtained by using the same operator quantization parameters.

Operator quantization parameters may be determined based on the parameter ranges of the first operator and the second operator.

The operator quantization parameter may be used to quantize the parameter of the first operator to obtain a parameter of a third operator, and the operator quantization parameter may be used to quantize the parameter of the second operator to obtain a parameter of a fourth operator.

In order to further improve the data processing efficiency of the quantized neural network model, the data processing result output by the third operator and the data processing result output by the fourth operator may be compressed.

The quantized neural network model further comprises a compression module, wherein the compression module is used for respectively compressing the output of the third operator and the output of the fourth operator according to an offset parameter so as to obtain the first operational data and the second operational data, and the offset parameter is used for indicating the position of the highest bit in the compressed data in the data before compression.

In general, the first operation data obtained by compressing the data processing result output by the third operator and the second operation data obtained by compressing the data processing result output by the fourth operator have the same number of bits.

The offset parameter indicates the position of the highest bit in the compressed data in the data before compression, and the output of the third operator and the output of the fourth operator are compressed by the same offset parameter, so that the first operational data and the second operational data have comparability and can be directly operated without performing inverse quantization and other processing in subsequent estimation.

In order to improve the data processing efficiency of the neural network model and reduce the influence on the accuracy and precision of the data processing result, the offset parameter can be determined according to the output significant digit obtained by processing the quantized first training input data by the third operator and the output significant digit obtained by processing the quantized second training input data by the fourth operator.

That is, the first training input data and the second training input data are quantized separately using the data quantization parameter. The quantized first training input data may be processed using the third operator and the quantized second training input data may be processed using the fourth operator. And the output of the third operator is first training operation data, and the output of the fourth operator is second training operation data.

The offset parameter may then be determined based on the significand of the first training calculation and the significand of the second training calculation.

For the determination of the data quantization parameter, the operator quantization parameter, and the offset parameter, reference may be made to the description of fig. 8.

Through S1101 to S1103, a quantized neural network model can be obtained. The quantized neural network model may be, for example, the data processing system 600 shown in fig. 9, or the data processing system 600 may call each operator or module in the quantized neural network model to perform data processing.

Fig. 8 is a schematic flowchart of a neural network model quantization method provided in an embodiment of the present application.

The neural network model quantification method 800 may also be understood as an optimization method or a further training method for the neural network model. The training method 800 may be performed by the training device 120 shown in fig. 1 or other devices.

The original neural network model comprises a first operator, a second operator and an operation module. The operation module is used for operating the output of the first operator and the output of the second operator. The first operator and the second operator are used for carrying out the same type of operation.

In the original neural network model, the parameters of the first operator and the second operator are represented in the format of floating point numbers.

At S810, an operator quantization parameter is determined based on the parameter range of the first operator and the parameter range of the second operator.

The scale in the operator quantization parameter can be expressed as s 2:

wherein, f_maxMaximum parameter values, f, for the first operator and the second operator represented by floating point numbers_minAnd b, representing the minimum parameter values of the first operator and the second operator by floating point numbers, and a is the number of bits of the quantization result. For data of format int8, a takes a value of 8.

The offset in the operator quantization parameter can be expressed as o 2:

in general, f_minIs a negative number.

The operator quantization parameter may be used to quantize the parameter of the first operator to obtain a quantized first operator. The operator quantization parameter may be used to quantize the parameter of the second operator to obtain a quantized second operator.

At S820, a training data set is obtained, where the training data set includes a training input data set and a preset operation result corresponding to the training input data set. Each set of training input data includes first training input data and second training input data.

It should be appreciated that a predetermined operation result corresponds to a first training input data and a second training input data. The preset operation result corresponding to the first training input data and the second training input data may be an operation result expressed by a floating point number obtained by operating the processing result of the first operator on the first training input data and the processing result of the second operator on the second training input data. Alternatively, the preset operation result corresponding to the first training input data and the second training input data may be manually set.

At S830, the data quantization parameter is determined according to the range of the first training input data and the range of the second training input data.

The training data set may comprise a plurality of first training input data and a plurality of second training input data. A data quantization parameter may be determined from each training input data range. The training input data is either the first training input data or the second training input data. The data quantization parameter may be an average range of the plurality of first training input data and the plurality of second training input data.

The scale in the data quantization parameter may be represented as s 1:

wherein d is_{max_t}The average maximum value (which can also be understood as the upper limit of the average data range) obtained after the t-th iteration, d_{min_t}M is an average minimum value (which may also be understood as an average data range lower limit) obtained after the t-th iteration, and is used to indicate the number of bits of a quantization result obtained by quantizing the training input data. For data of format int8, m takes the value 8. It should be understood that m is a preset value.

The average maximum of the training input data may be expressed as:

wherein d is_{max_t-1}Is the average maximum value, v, of training input data obtained after the t-1 st iteration_{max_t}Maximum value, c, counted for input data of t-th iteration pair operator_tIs continuously updated with the number of iterations, c_t＝β ₁·c _t-1+1，β ₁Is a constant. In the multiple iteration process, the input data of the operator includes first training input data of the first operator and also includes second training input data of the second operator.

Which may be understood as a weight.

When beta is₁When the number of iterations is larger than 1, the more the iterations are performed, the smaller the influence of the maximum value of the training input data on the upper limit of the average data range is. When beta is₁When the number of iterations is less than 1, the greater the number of iterations, the greater the influence of the maximum value of the training input data on the upper limit of the average data range. When beta is₁Equal to 1, the maximum value of each training input data has the same effect on the upper limit of the average data range. In general, beta is₁The value of (a) is slightly larger than 1, so that excessive correction on the upper limit of the average data range is avoided.

The average minimum of the training input data may be expressed as:

wherein d is_{min_t-1}Is the average maximum value, v, of training input data obtained after the t-1 st iteration_{min_t}Minimum value counted for input data of t-th iteration pair operator, c_tIs continuously updated with the number of iterations, c_t＝β ₂·c _t-1+1，β ₂Is a constant. In the multiple iteration process, the input data of the operator includes first training input data of the first operator and also includes second training input data of the second operator.

Similarly, when beta₂When the maximum value is larger than 1, the iteration times are more as the iteration is carried out, and the maximum value of the training input data is larger than the average data rangeThe smaller the effect of the upper bound. When beta is₂When the number of iterations is less than 1, the greater the number of iterations, the greater the influence of the maximum value of the training input data on the upper limit of the average data range. When beta is₂Equal to 1, the maximum value of each training input data has the same effect on the upper limit of the average data range. In general, beta is₂The value of (a) is slightly larger than 1, so that excessive correction on the upper limit of the average data range is avoided. Beta is a₂And beta₁May be equal or unequal.

Before performing the iteration, the parameter c may be compared₀、β、d _{max_0}、d _{min_0}Setting is made, in general, parameter c₀Both may be set to 0. v. of_{max_0}、v _{min_0}Can be set according to empirical values, e.g. v_{min_0}May be set to 6.

In S840, a first training operation data and a second training operation data are operated to obtain training output data, where the first training operation data is obtained by processing a first training input data quantized by the data quantization parameter with a first operator quantized by an operator quantization parameter, and the second training operation data is obtained by processing a second training input data quantized by the data quantization parameter with a second operator quantized by the operator quantization parameter.

The first training input data and the second training input data may be processed using the data quantization parameter, respectively, to obtain quantized first training input data and quantized second training input data. The quantized first training input data may be input into a quantized first operator to obtain first training operation data. The quantized second training input data may be input into a quantized second operator to obtain second training operation data. Thereafter, the first training operational data and the second training operational data may be operated on to obtain training output data.

The first training operation data may be obtained by converting data output by the quantized first operator. The number of bits of the first training operation data is smaller than the number of bits of the data output by the quantized first operator.

The second training operation data may be obtained by converting data output by the quantized second operator. The number of bits of the second training operation data is smaller than the number of bits of the data output by the quantized second operator.

The average significand can be counted for the result of the processing of the quantized first training input data by the quantized first operator and the result of the processing of the quantized second training input data by the quantized second operator.

The upper limit of the average data range may be counted to obtain the average significand.

After t iterations, the upper limit of the obtained average data range can be expressed as:

wherein, b_t-1Is the upper limit of the average data range obtained after the t-1 th iteration, b_ntUpper limit of range counted for output data of t-th iteration pair operator, c_tIs continuously updated with the number of iterations, c_t＝β ₃·c _t-1+1，β ₃Is a constant. The t-th iteration input data may be the first training input data or the second training input data.

Before performing the iteration, the parameter c may be compared₀、β ₃、b ₀Setting is made, in general, parameter b₀、c ₀May be set to 0.

Parameter beta₁、β ₂、β ₃Either of which can be set randomly or according to a certain rule, or, the parameter beta₁、β ₂、β ₃Any of which may also be manually set. In the course of the iteration, the parameter β₁、β ₂、β ₃Either of which may be burst unchanged or may be adjusted according to certain rules. This is not limited in this application.

According to the upper limit of the average data range obtained by t iterations, an offset parameter N can be determined:

N＝max(ceil(log ₂bt)-m,0)

wherein ceil is the ceiling function, ceil (log)₂b _t) Is b is_tM is the number of bits of data after the number of bits is reduced, i.e. the average number of significant bits after t iterations.

Take the operator output format as int32, and the value of m as 16, where N is max (log)₂b _t)-16,0)。

For the output of the operator after quantization, the value with the significant digit larger than N is subjected to saturation operation, namely N bits of 1 are used for representing the value with the significant digit larger than N. That is, when the output of the quantized operator is larger than the size indicated by N bits "1", the numerical value is indicated by N bits "1".

Each value in the output data of the quantized operator is shifted to the right by N bits. Thereafter, a saturation operation may be performed, limiting the value as a result to m bits. It should be understood that m is a preset number. When the result of shifting the N bits to the right is larger than the range of values that the m bits can represent, "1" of the m bits is taken as the result of the saturation operation. And when the result of shifting N bits to the right is less than or equal to the numerical range which can be represented by the m bits, the result of shifting N bits to the right is used as the result of the saturation operation.

The bit indicated by the offset parameter N is the pth bit in the output data of the quantized operator, where P is Q-N.

According to the offset parameter N, only the P +1 th bit to the P + m th bit (or only the P + m-1 th bit to the P + m-1 th bit) in the output data of the quantized operator may be reserved, thereby implementing compression (i.e., format conversion) of the output data of the quantized operator.

Thereafter, the converted first training operation data and second training operation data may be operated on.

The processing of the input data by the quantized first operator can be expressed as:

conv1 _out＝d1 _q1*w1 _q2

wherein, conv1_outFor the output of the quantized first operator, d1_q1For input data of the first operator quantized with the data quantization parameter, w1_q2The parameters of the first operator are obtained by operator quantization parameter quantization.

The processing of the input data by the quantized second operator can be expressed as:

conv2 _out＝d2 _q1*w2 _q2

wherein conv2_outFor the output of the quantized second operator, d2_q1For input data of the second operator quantized with operator quantization parameter, w2_q2The parameter of the second operator is obtained by utilizing the operator quantization parameter for quantization.

Taking the bitwise accumulation operation (elementwise) as an example, the operation result can be expressed as:

R＝tr(conv1 _out,N)+tr(conv2 _out,N)

where R is the operation result, tr (x, N) represents the conversion of the data x, and the converted result includes the lowest predetermined number of bits after the data x is shifted to the right by N bits.

In S850, the data quantization parameter and the operator quantization parameter are adjusted according to a difference between the dequantized training output data and the preset operation result.

If the training output data is obtained after conversion with reduced number of bits, the training output data may be inverse-converted, and the inverse-converted data may be inverse-quantized. That is, a bit may be added to the right side of the converted data, so that the bit of the inverse converted data is equal to the bit of the quantized output data of the first operator and the second operator. It should be understood that the values of the added bits may all be "0". The data after the bit addition may then be left shifted by N bits to obtain the inverse transformed training output data.

For the first operator and the second operator which are both conv operators, training output data is obtained by performing bitwise addition operation on the output of the first operator and the output of the second operator, and the inverse quantization of the training output data can be the product of the training output data multiplied by the scale in the data quantization parameter and the scale in the operator quantization parameter.

Through S810 to S850, a data quantization parameter and an operator quantization parameter that make the operation result more accurate can be obtained.

And quantizing the parameters of the first operator and the second operator respectively by utilizing the operator quantization parameters. The data processing system 600 may be determined from the data quantization parameter, the quantized first operator, the quantized second operator, and the offset parameter N.

In the data processing system 600, the operation model 640 may be an operation model in the original neural network model before quantization, or parameters in the operation model 640 may be quantized parameters in the operation model in the original neural network model.

In some embodiments, S810 to S850 may be performed by a server. The server may send the data quantization parameter, the quantized first operator, the quantized second operator, the offset parameter, and the like to the terminal device. So that the terminal device can determine the data processing system 600 shown in fig. 9.

Fig. 9 is a schematic structural diagram of a data processing system according to an embodiment of the present application. The data processing system 600 may be located in the computing module 111 of the execution device 110, and the data processing system 600 may be the target model/rule 101 shown in FIG. 1. The data processing system 600 may be the training apparatus 120 shown in fig. 1 or other devices that quantify the trained neural network model.

The data processing system 600 may also be referred to as a quantized neural network model. Data processing system 600 may be part of CNN 200 shown in fig. 2 or CNN 300 shown in fig. 3, or various components of data processing system 600 may be located in one or more CNNs. Data processing system 600 includes quantization model 610, first operator 620, second operator 630, and operational model 640. The quantization model 610 is used to quantize the first input data of the first operator 620 and the second input data of the second operator 630 using the data quantization parameter, respectively.

The format of the first input data and the format of the second input data may both be floating point numbers. For example, the floating point number may be a single-precision floating point number (float32) of 32 bits (bit) or a half-precision floating point number (float16) of 16 bits. The quantization model 610 may quantize the first input data and the second input data using the data quantization parameter to obtain quantized first input data and quantized second input data, respectively. The format of the quantized first input data and the format of the quantized second input data may both be 8-bit quantization results (int 8).

The data quantization parameter may include a step size (scale) and an offset (offset). Wherein, scale is used to represent the increment of the floating point number corresponding to each increment of "1" of the quantization result, and offset is used to represent the ratio of the floating point number represented by the minimum value of the quantization result to scale. The first operator 620 is configured to process the first input data quantized by the quantization model 610 to obtain first operation data. The second operator 630 is configured to process the second input data quantized by the quantization model 610 to obtain second operation data. The parameters of the first operator 620 and the parameters of the second operator 630 are quantized using operator quantization parameters. That is, when determining the data processing system 600, the operator quantization parameter is used to quantize the parameter of the first operator before quantization in the trained neural network model, so as to obtain the parameter of the first operator 620; the operator quantization parameter is used to quantize the parameter of the second operator before quantization in the trained neural network model, so as to obtain the parameter of the second operator 620.

The operator quantization parameter may include a step size (scale) and an offset (offset). The determination of the data quantization parameter and the determination of the operator quantization parameter can be referred to the description of fig. 6 to 8. The first operator 620 and the second operator 630 are used to perform the same type of operation. That is, the first operator 620 and the second operator 630 may both be the same type of operator in the neural network model. For example, the first operator 620 and the second operator 630 may both be convolution operators for convolution operations, e.g., the first operator 620 and the second operator 630 may represent one convolution layer, respectively. The various modules in data processing system 600 may each be part of CNN 200 shown in fig. 2 or part of CNN 300 shown in fig. 3.

The first operator 620 and the second operator 630 may also both be fully-connected layers. The excitation function of each neuron of the full connection layer generally adopts a linear rectification function (ReLU). The output of the first operator 620 and the output of the second operator 630 subsequently need to be operated on by the operation model 640. In some embodiments, the first operator 620 and the second operator 630 may be located in different CNNs, and the operation model 640 may be used to process data output by different CNNs. Of course, the first operator 620 and the second operator 630 can be the same type of operator in other types of neural network models.

When the parameter of the conv operator and the input data of the conv operator are both int8, the output of the conv operator is a quantization result (int32) of 32. That is, when the first operator and the second operator may be both conv operators, and the parameters of the first operator, the parameters of the second operator, the quantized first input data, and the quantized second input data are all in int8, the output data of the first operator and the second operator are in int 32. For the conv operator, the parameters of the conv operator can also be understood as weights in the conv operator.

For conv operator, the quantized input data d_q1(int8 format) of the result conv_out(in the format int32) can be expressed as:

conv _out＝d _q1*w1 _q2

wherein, w1_q2For the operator's parameters quantized using the operator quantization parameter, the format is also int 8.

The operation model 640 is used for operating on the first operation data and the second operation data. The operation model 640 may perform a linear operation on the first operation data and the second operation data. The operation model 640 may also perform a bitwise operation on the first operation data and the second operation data, such as a bitwise addition or a bitwise multiplication operation.

The data processing system 600 quantizes the input data of the two operators respectively by using the data quantization parameters, the parameters of the two operators are obtained by using the operator quantization parameters, and then the output of the two operators can be operated, so that the output of the two operators is prevented from being inversely quantized, the calculation amount is reduced, the operation power consumption is reduced, and the data processing performance of the data processing system 600 is improved.

Inverse quantization is a way of vector computation. In general NPU, the vector calculation method is less computationally expensive than the matrix calculation method. The matrix calculation method includes convolution operation and the like. The operation of the neural network model often includes multiple matrix calculations and multiple vector calculations in series. In general, the computational power of the processor on the matrix computation is higher than that of vector computation, and when the neural network model needs to perform a large number of vector computations, the matrix computation depending on the vector computation result is in a waiting state in the case of incomplete vector computation, which causes pipeline interruption and performance bottleneck (called vector bound).

The first operator and the second operator are obtained by utilizing the same quantization parameters for quantization, the input data of the first operator and the input data of the second operator are quantized by utilizing the same parameters, the operation module 640 of the data processing system 600 can operate the output of the first operator and the output of the second operator, and the data processing system 600 can reduce inverse quantization operation required by a neural network model, so that a vector bound is relieved, and the data processing capacity of the neural network model is effectively improved.

Data processing system 600 may also include a format conversion model 650. The format conversion model 650 may be used for data compression and may also be referred to as a compression model. The format conversion model 650 is used to reduce the number of bits of the first original operation data output by the first operator 620 to obtain the first operation data. Format conversion model 650 is also used to reduce the number of bits that second operator uses for the second raw operation data output 630 to obtain the second operation data.

For example, format conversion model 650 is used to convert the formats of the first and second raw operational data formatted as int32 to int16, respectively. The data of int16 obtained by format conversion of the first original operation data output by the first operator 620 is the first operation data, and the data of int16 obtained by format conversion of the second original operation data output by the second operator 630 is the second operation data. Format conversion model 650 may determine the first operational data and the second operational data based on an offset parameter. The processing of format conversion model 650 may be understood as compression of data.

As shown in fig. 10 (a), when bits before a bit indicated by the offset parameter in first original operation data that is obtained by processing and outputting quantized first input data by the first operator are all 0, the first operation data includes bits indicated by the offset parameter and bits after the bit in the first original operation data, which are a total of a preset number of bits.

As shown in (B) of fig. 10, when the bits before the bit indicated by the offset parameter in the first original operation data are not all 0, and there is a bit having a value of 1, the preset number of bits included in the first operation data are all "1".

Similarly, when the bits before the bit indicated by the offset parameter in the second original operation data that is processed and output by the second operator on the quantized second input data are all 0, the second operation data includes the bit indicated by the offset parameter in the second original operation data and the bit number after the bit that are the bits of the preset number altogether.

When the bits before the bit indicated by the second original operation data offset parameter are not all 0, and there is a bit with a value of 1, the preset number of bits included in the second operation data are all "1". The format conversion model 650 may perform a shift right operation and a saturation operation on any one of the first raw operation data and the second raw operation data.

The shift right operation can be expressed as:

wherein the content of the first and second substances,

to shift the symbol to the right, conv_outRepresenting the result of the original operation, conv_out' denotes a right shift operation result, and N is the number of bits to be right shifted. It should be understood that the sum of the number of right-shift bits N and the bits of the data output by the format conversion model 650 is less than or equal to the original operation result conv_outThe number of bits in (1).

Saturation operations may be performed on the right shift operation result:

conv _INT16＝clip(conv _out’,0,2 ^p-1)

wherein, conv_INT16Representing the result of the saturation operation, p being the original operation result conv_outThe number of bits in (1) and the number of right-shifted bits N. The clip (a, b, c) operator indicates that a is limited between b and c, and when a is smaller than b, the operation result is b; when a is more than or equal to b and a is less than or equal to c, the operation result is a; and when a is larger than or equal to c, the operation result is c.

When the format of the original operation data is int32, that is, the original operation data includes 32 bits, and the number of bits of the operation result, that is, the preset number is 16, p is 32-N, and N is less than or equal to 16.

The bit number m of the operation result of the operator clip can be the same as the bit number of the original operation result, and the lowest bit number m of the preset number can be taken from the saturated operation result as the operation result. That is, the lowest m bits in the saturation operation result are the operation data corresponding to the original operation data.

That is, the original operation data may be right-shifted by N bits. And determining the size of a result of right shifting N bits of binary numbers with the same bit number as the original operation result and each bit being 1 and the size of a right shifting result of the original operation data. When the right shift result of the original operation data is larger, taking the bit with the lowest bit in the right shift result of the original operation data as the operation data; on the contrary, when the right shift result of the original operation data is not large, a preset number of "1" s is used as the operation result.

The first operation data may be the first original operation data, or may be data obtained by performing a shift operation and a saturation operation on the first original operation data. The second operation data may be the second original operation data, or may be data obtained by performing a shift right operation and a saturation operation on the second original operation data.

After right-shifting operation and saturation budget, format conversion model 650 may convert the raw operation data into operation data, which is used as input to operation module 640.

Through the conversion of the data format by the format conversion model 650, the amount of computation by the operation module 640 can be reduced, thereby improving the data processing performance of the data processing system 600.

The data processing method described in fig. 11 or fig. 12 can be implemented using the quantized neural network model. Fig. 11 is a schematic flowchart of a data processing method according to an embodiment of the present application. The data processing method may be performed by the calculation module 111 in the execution device 110 shown in fig. 1.

S1201, obtaining a quantized neural network model, wherein the quantized neural network model is obtained by quantizing an original neural network model, the original neural network model comprises a first operator, a second operator and a first operation module, the first operator is used for carrying out the same type of operation as the second operator, and the first operation module is used for carrying out the first operation on the output of the first operator and the output of the second operator.

S1202, the quantized neural network model is used to process the first input data of the third operator and the second input data of the fourth operator, the quantized neural network model includes a quantization module, a first operator, a second operator, and a first operation module, the quantization module is used to quantize the first input data and the second input data respectively by using a data quantization parameter, the second operation module is used to perform the first operation, the third operator is the quantized first operator, the fourth operator is the quantized second operator, and the data quantization parameter is determined according to the range of the first training input data of the first operator and the range of the second training input data of the second operator.

The second operation module may be configured to perform the first operation on the first operation data and the second operation data. The first operation data is obtained by performing the first operation on quantized first input data by using the third operator, and the second operation data is obtained by performing the second operation on quantized second input data by using the fourth operator. And (3) carrying out quantization processing on the original neural network, and carrying out the same operation on the obtained quantized neural network model and the original neural network model, wherein the operation result is only the precision change.

The data quantization parameter is determined according to the range of the first training input data of the first operator and the range of the second training input data of the second operator, so that the data processing precision of the quantized neural network model is improved.

The first input data and the second input data are processed by using the data quantization parameter, so that the second operation module can operate the first operation data and the second operation data without performing operation after inverse quantization on the first operation data and the second operation data. The first operation data is obtained by processing the quantized first input data by the third operator, and the second operation data is obtained by processing the quantized second input data by the fourth operator.

Through S1201 to S1202, the operation precision of the quantized neural network model is improved, and meanwhile, the requirement of the quantized neural network model on inverse quantization operation is reduced, operation resources are saved, and the processing efficiency is improved.

Optionally, the data quantization parameter is obtained by adjusting an initial data quantization parameter, where the adjustment minimizes a difference between actual training output data and preset training output data.

The initial quantization parameter is determined from a range of the first training input data and a range of the second training input data.

The preset training output data corresponds to a training input data set that includes the first training input data and the second training input data.

The actual training output data is obtained by processing the first training input data and the second training input data by using the quantized neural network model, and the quantization module is used for quantizing the first training input data and the second training input data by using the initial data quantization parameter.

Determining an initial quantization parameter based on the range of the first training input data and the range of the second training input data. And processing the first training input data and the second training input data by using the quantized neural network model to obtain actual training output data, wherein the quantization module quantizes the first training input data and the second training input data by using the initial data quantization parameter. And adjusting the initial data quantization parameter to minimize the difference between the actual training output data and the preset training output data so as to obtain the data quantization parameter.

It should be appreciated that in the quantized neural network model, the third operator is configured to perform a first operation on the quantized first training input data to obtain first training operation data; the fourth operator is used for carrying out second operation on the quantized second training input data to obtain second training operation data; the second operation module is used for performing third operation on the first training operation data and the second training operation data to obtain the actual training output data.

Because the difference between the actual training output data and the preset training output data is minimum due to the data quantization parameters, the quantized neural network model has higher precision.

Minimizing a difference between the actual training output data and the preset training output data, which may be understood as gradually adjusting an initial data quantization parameter of the initial motion recognition system according to the difference between the actual training output data and the preset training output data until the difference between the actual training output data and the preset training output data is within a certain preset range, or determining the initial data quantization parameter at this time as the adjusted data quantization parameter when the adjustment number reaches a preset number.

Optionally, the parameter of the third operator is obtained by quantizing the parameter of the first operator by using an operator quantization parameter, the parameter of the fourth operator is obtained by quantizing the parameter of the second operator by using the operator quantization parameter, and the operator quantization parameter is determined according to a parameter range of the first operator and a parameter range of the second operator.

And determining an operator quantization parameter according to the parameter range of the first operator and the parameter range of the second operator, and quantizing the parameter of the first operator and the parameter of the second operator respectively by using the operator quantization parameter to obtain a parameter of a third operator and a parameter of a fourth operator. The data processing accuracy of a quantized neural network model is improved when quantization is performed to reduce the amount of computation.

Optionally, the quantized neural network model further includes a compression module, the compression module is configured to compress the output of the third operator and the output of the fourth operator according to an offset parameter, so as to obtain the first operational data and the second operational data, the offset parameter is used to indicate a position of a highest bit in the compressed data in the data before the compression, and the second operational module is configured to perform the first operation on the compressed data.

The offset parameter is determined according to a significand of first training operation data obtained by processing first training input data quantized by the data quantization parameter using the third operator and a significand of second training operation data obtained by processing second training input data quantized by the data quantization parameter using the fourth operator.

According to the significant digit of the first training operation data and the significant digit of the second training operation data, the operation amount is reduced, and the data processing precision of the quantized neural network model is improved.

Fig. 12 is a schematic flowchart of a data processing method according to an embodiment of the present application.

The data processing method 700 includes S710 to S720. The data processing method 700 may be performed by the calculation module 111 in the execution device 110 shown in fig. 1.

At S710, a first input data of a first operator in a neural network model and a second input data of a second operator in the neural network model are quantized respectively by using a data quantization parameter.

At S720, the first processed information and the second processed information are calculated. The first processing information is obtained by processing the quantized first input data by using the first operator, and the second processing information is obtained by processing the quantized second input data by using the second operator.

Through S710 and S720, the first input data of the first operator and the second input data of the second operator are quantized by using the same data quantization parameter, so that the output of the first operator and the output of the second operator can be directly operated without carrying out inverse quantization and other processing, and the data processing efficiency of the neural network model is improved.

The first parameter of the first operator and the second parameter of the second operator may be floating point numbers, or may be obtained by quantizing parameters of floating point numbers by using operator quantization parameters.

The first parameter and the second parameter are obtained through quantification, the sizes of the first operator and the second operator can be reduced, and the occupation of resources when the first operator and the second operator process data is reduced. The operator quantization parameters are used for the quantization of the first parameter and the quantization of the second parameter, so that the data processing results of the first operator and the second operator can be directly calculated without other processing such as inverse quantization, and the data processing efficiency of the neural network model is improved.

In order to improve the calculation accuracy of the quantized first operator and second operator, the operator quantization parameter may be obtained according to a range of the first parameter and a range of the second parameter.

The determination of the operator quantization parameter may be performed by the training device 120 shown in fig. 1 or by other devices. Of course, the apparatus for determining the data quantization parameter may be the same apparatus or different apparatus from the apparatus for performing S710 to S720.

The operator quantization parameter can be obtained according to the maximum value and the minimum value in the first parameter and the second parameter. Illustratively, the operator quantization parameter may include scale and offset. The difference between the maximum value and the minimum value of the first parameter and the second parameter can be equally divided according to the number of bits of the quantization result to obtain the scale of the operator quantization parameter. The offset in the operator quantization parameter may be determined according to a ratio of a minimum value of the first parameter and the second parameter to a scale in the operator quantization parameter.

In order to improve the calculation accuracy of the quantized first operator and second operator, the data quantization parameter may be determined according to the range of data processed by the first operator and the second operator.

The determination of the data quantization parameter may be performed by the training device 120 shown in fig. 1 or other devices. Of course, the apparatus for determining the data quantization parameter may be the same apparatus or different apparatus from the apparatus for performing S710 to S720.

In particular, a training data set may be obtained, the training data set comprising first training input data, second training input data. The first training input data is input data of a first operator before quantization, and the second training input data is input data of a second operator before quantization. The data quantization parameter may be determined based on a range of first training input data and a range of second training input data.

For example, the data quantization parameter may be determined by a plurality of first training input data and a plurality of second training input data. Each of the first training input data includes a plurality of values, each of the second training input data includes a plurality of values, an average maximum value in each of the first training input data and each of the second training input data may be used as a maximum value that can be represented by a quantization result of the data quantization parameter, and an average minimum value in each of the first training input data and each of the second training input data may be used as a minimum value that can be represented by a quantization result of the data quantization parameter. The average maximum value may be a weighted average of a plurality of maximum values, and the average minimum value may be a weighted average of a plurality of minimum values. The weight may be understood as a degree of influence of a maximum value or a minimum value in each of the first training input data and the second training input data on the data quantization parameter. In particular, reference may be made to the description of fig. 8.

In order to improve the calculation accuracy of the quantized first operator and the quantized second operator, before the data quantization parameter and the operator quantization parameter are used for processing the actual data, the data quantization parameter and/or the operator quantization parameter may be adjusted according to a difference between a result obtained by operating the processing result of the quantized first training input data by the first operator and the processing result of the second training input data by the second operator and a preset operation result.

Specifically, the training data set further includes preset operation results corresponding to the first training input data and the second training input data. The preset operation result corresponding to the first training input data and the second training input data may be an operation result obtained by operating a processing result of the first operator on the first training input data before quantization and a processing result of the second operator on the second training input data before quantization. Alternatively, the result of the preset operation may be manually set. The format of the preset operation result may be a floating point number.

The operation in S720 may be performed on the first training operational data and the second training operational data to obtain training output data. The first training operation data is obtained by processing first training input data quantized by using the data quantization parameter by a first operator, and the second training operation data is obtained by processing second training input data quantized by using the data quantization parameter by a second operator.

The training output data may be inverse quantized. The data quantization parameter and/or the operator quantization parameter may be adjusted according to a difference between an inverse quantization result of the training output data and the preset operation result.

In order to further reduce the operation amount of the neural network model, the operation result of the first operator and the operation result of the second operator may be subjected to a processing of reducing the number of bits.

And the first operator processes the quantized first input data and outputs first original operation data. And the second operator processes the quantized second input data and outputs second original operation data.

The predetermined number of bits with the highest median in the first original operation data may be used as the first operation result, and the predetermined number of bits with the highest median in the second original operation data may be used as the second operation data, to perform the subsequent operation. The predetermined number of bits with the highest number of bits, i.e., the leftmost predetermined number of bits.

Alternatively, the first operational data and/or the second operational data may be determined based on the offset parameter.

When bits before the bit indicated by the offset parameter in the first original operation data are all 0, the first operation data includes a preset number of bits after the bit indicated by the offset parameter in the first original operation data.

When the bits before the bit indicated by the offset parameter in the first original operation data are not all 0, and there is a bit with a value of 1, the preset number of bits included in the first operation data are all "1".

Similarly, when bits before the bit indicated by the offset parameter in the second original operation data are all 0, the second operation data includes a preset number of bits after the bit indicated by the offset parameter in the second original operation data.

When the bits before the bit indicated by the second original operation data offset parameter are not all 0, and there is a bit with a value of 1, the preset number of bits included in the second operation data are all "1".

Optionally, the first original operation data and the second original operation data are both subjected to a compression process for reducing the number of bits.

It should be understood that if the processing result of the first operator or the processing result of the second operator has valid data at the bit indicated by the offset parameter or higher, the first operation data corresponding to the processing result of the first operator or the second operation data corresponding to the processing result of the second operator may be represented as "1" of the preset number. This approach may also be understood as a saturation operation. That is, when the processing result is greater than the maximum value that the number of bits following the bit indicated by the offset parameter can represent, the processing result is represented as all "1" s among the preset number of bits, i.e., the maximum value that the preset number of bits can represent.

The offset parameter may be obtained according to a result of processing the first training input data by the first operator and a result of processing the second training input data by the second operator.

The offset parameter may be determined according to the significand of data output by the first operator processing the quantized first training input data, and the significand of data output by the second operator processing the quantized second training input data.

For example, the first operator may process a plurality of quantized first training parameters, and the processing result of the first operator for each quantized first training parameter includes a plurality of numbers. The plurality of numbers may form a matrix or a vector, etc. The second operator may process the plurality of quantized second training parameters, and a processing result of the second operator for each quantized second training parameter includes a plurality of numbers. The offset parameter may be determined based on an average of the largest significant digits in each processing result. For example, the average may be rounded up, and an offset parameter may be used to indicate the most significant bit, which may be the number of significant bits rounded up from the average.

The offset parameter may also be adjusted according to a difference between an inverse quantization result of the training output data and the preset operation result. Therefore, the precision and the accuracy of the data processing result are higher.

In the process of processing data such as images and audios by the neural network model, a plurality of operators are generally required. Before the neural network model quantization method provided by the embodiment of the present application is performed, the original neural network model may be traversed to determine a processing structure of the original neural network model, which includes a first operator, a second operator, and an operation model for operating an output of the first operator and an output of the second operator. Fig. 9 illustrates an example in which the first operator and the second operator are convolution operators and the operation model is an eltwise operator.

Fig. 13 is a schematic flowchart of a processing structure identification method provided in an embodiment of the present application. The processing structure identification method may be performed by the training device 120 shown in fig. 1 or other devices.

In S910, it is determined whether the node corresponding to the node i is a convolution operator. If the node i is not a convolution operator, let i equal i +1, and proceed with S910 again. If the node i is a convolution operator, S920 is performed.

At S920, it is determined whether the output data of node i is input to the eltwise operator. If the output data of the node i is not the input of the eltwise operator, let i be i +1, and proceed again to S910. If the output data of node i is the input of the eltwise operator, S930 is performed.

At S930, it is determined whether the other input of the eltwise operator is the output data of the convolution operator. If the other input of the eltwise operator is not the output data of the convolution operator, let i be i +1, and proceed again to S910. If the other input to the eltwise operator is the output data of the convolution operator, then the method 800 proceeds with node i as the first operator and the other input to the eltwise operator as the second operator. Then, let i equal i +1, S910 is performed again. And when all the nodes in the neural network are traversed, namely i is larger than the data volume of the nodes in the model in the neural network, stopping S910. By the method 900, the structure that the output results of all two paths of convolution operators in the neural network model are input into one eltwise operator can be determined.

The data processing system, the neural network model quantization method, and the data processing method provided by the embodiment of the present application are described above with reference to fig. 1 to 13, and the apparatus embodiment of the present application is described below with reference to fig. 14 to 17. It should be understood that the description of the data processing system, the neural network model quantification method, and the data processing method correspond to the description of the apparatus embodiment, and therefore, portions not described in detail may be referred to the above description.

Fig. 14 is a schematic structural diagram of a neural network model quantization apparatus according to an embodiment of the present application. The neural network model quantizing device 3000 may be located in the training apparatus 120 shown in fig. 1 or in other apparatuses. The neural network model quantizing device 3000 includes a storage module 3010 and a processing module 3020. The storage module 3010 is used to store programs.

When the program is run in the processing module 3020, the processing module 3020 is configured to: the method comprises the steps of obtaining an original neural network model, wherein the original neural network model comprises a first operator, a second operator and a first operation module, the first operator is used for carrying out first operation, the second operator is used for carrying out second operation, the first operation and the second operation are the same type of operation, and the first operation module is used for carrying out third operation on the output of the first operator and the output of the second operator; determining a data quantization parameter according to a range of first training input data and a range of second training input data, wherein the first training input data is input data of the first operator, and the second training input data is input data of the second operator; determining a quantized neural network model according to the original neural network model, wherein the quantized neural network model comprises a quantization module, a third operator, a fourth operator and a second operation module, the quantization module is used for quantizing first input data of the third operator and second input data of the fourth operator respectively by using the data quantization parameter, the third operator is the quantized first operator, the fourth operator is the quantized second operator, and the second operation module is used for performing the first operation.

Optionally, the processing module 3020 is further configured to obtain preset training output data corresponding to a training input data set, where the training input data set includes the first training input data and the second training input data.

The processing module 3020 is further configured to quantize the first training input data and the second training input data respectively by using the data quantization parameter. The processing module 3020 is further configured to process the quantized first training input data and the quantized second training input data by using the quantized neural network model to obtain actual training output data. The processing module 3020 is further configured to adjust the data quantization parameter according to a difference between the actual training output data and the preset training output data, so as to minimize the difference.

The quantization module is configured to quantize the first input data of the third operator and the second input data of the fourth operator respectively by using the adjusted data quantization parameter. Optionally, the processing module 3020 is further configured to determine an operator quantization parameter according to the parameter range of the first operator and the parameter range of the second operator.

The processing module 3020 is further configured to quantize the parameter of the first operator by using the operator quantization parameter to obtain a parameter of the third operator. The processing module 3020 is further configured to quantize the parameter of the second operator by using the operator quantization parameter, so as to obtain the parameter of the fourth operator.

Optionally, the quantized neural network model further includes a compression module, the compression module is configured to compress the output of the third operator and the output of the fourth operator according to an offset parameter, the offset parameter is used to indicate a position of a highest bit in the compressed data in the data before the compression, and the second operation module is configured to perform the first operation on the compressed data.

The processing module 3020 is further configured to quantize the first training input data and the second training input data respectively by using the data quantization parameter. The processing module 3020 is further configured to process the quantized first training input data by using the third operator, where the third operator outputs first training operation data. The processing module 3020 is further configured to process the quantized second training input data by using the fourth operator, where the fourth operator outputs second training operation data. The processing module 3020 is further configured to determine the offset parameter according to the significand of the first training operation data and the significand of the second training operation data.

Fig. 15 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application. The data processing apparatus 2000 may be located in the execution device 110 shown in fig. 1 or in another device. The data processing device 2000 includes a memory module 2010 and a processing module 2020. The storage module 2010 is used for storing programs.

When the program is run in the processing module 2020, the processing module 2020 is configured to: the method comprises the steps of obtaining a quantized neural network model, wherein the quantized neural network model is obtained by quantizing an original neural network model, the original neural network model comprises a first operator, a second operator and a first operation module, the first operator and the second operator are used for carrying out the same type of operation, and the first operation module is used for carrying out the first operation on the output of the first operator and the output of the second operator; processing the first input data of the third operator and the second input data of the fourth operator by using the quantized neural network model, where the quantized neural network model includes a quantization module, a first operator, a second operator, and a second operation module, the quantization module is configured to quantize the first input data and the second input data by using a data quantization parameter, the second operation module is configured to perform the first operation, the third operator is the quantized first operator, the fourth operator is the quantized second operator, and the data quantization parameter is determined according to a range of the first training input data of the first operator and a range of the second training input data of the second operator.

The initial quantization parameter is determined from a range of the first training input data and a range of the second training input data. The preset training output data corresponds to a training input data set that includes the first training input data and the second training input data.

Fig. 16 is a hardware configuration diagram of a data processing apparatus according to an embodiment of the present application. The data processing apparatus 4000 shown in fig. 16 includes a memory 4001, a processor 4002, a communication interface 4003, and a bus 4004. The memory 4001, the processor 4002 and the communication interface 4003 are communicatively connected to each other via a bus 4004.

Memory 4001 may be a ROM, a static storage device, and a RAM. The memory 4001 may store a program, and the processor 4002 and the communication interface 4003 are used to execute the steps of the data processing method of the embodiment of the present application when the program stored in the memory 4001 is executed by the processor 4002.

The processor 4002 may be a general-purpose, CPU, microprocessor, ASIC, GPU or one or more integrated circuits, and is configured to execute the relevant programs to implement the functions required by the units in the data processing apparatus according to the embodiment of the present application, or execute the data processing method according to the embodiment of the present application.

The processor 4002 may also be an integrated circuit chip having signal processing capabilities, such as the chip shown in fig. 4. In implementation, the steps of the data processing method according to the embodiment of the present application may be implemented by integrated logic circuits of hardware in the processor 4002 or instructions in the form of software.

The processor 4002 may also be a general purpose processor, DSP, ASIC, FPGA or other programmable logic device, discrete gate or transistor logic device, discrete hardware component. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The memory medium is located in the memory 4001, and the processor 4002 reads information in the memory 4001, and completes functions required to be executed by units included in the data processing apparatus according to the embodiment of the present application, or executes the data processing method according to the embodiment of the present application, in combination with hardware thereof.

Communication interface 4003 enables communication between apparatus 4000 and other devices or a communication network using transceiver means such as, but not limited to, a transceiver. For example, the image to be processed may be acquired through the communication interface 4003.

Bus 4004 may include a path that transfers information between various components of apparatus 4000 (e.g., memory 4001, processor 4002, and communication interface 4003).

Fig. 17 is a schematic hardware configuration diagram of a neural network model quantization apparatus according to an embodiment of the present application. Similar to the apparatus 4000 described above, the neural network model quantizing apparatus 5000 shown in fig. 17 includes a memory 5001, a processor 5002, a communication interface 5003, and a bus 5004. The memory 5001, the processor 5002 and the communication interface 5003 are connected to each other via a bus 5004.

The original neural network model can be quantized by the neural network model quantization apparatus 5000 shown in fig. 17, and the quantized neural network model can be used for executing the data processing method according to the embodiment of the present application.

Specifically, the apparatus shown in fig. 17 may obtain a training data set and a raw neural network model required for quantization from the outside through the communication interface 5003, and then perform quantization of the neural network model by the processor according to the training data set and the raw neural network model.

It should be noted that although the above-described apparatus 4000 and apparatus 5000 show only memories, processors, and communication interfaces, in particular implementations, those skilled in the art will appreciate that the apparatus 4000 and apparatus 5000 may also include other devices necessary to achieve normal operation. Also, those skilled in the art will appreciate that apparatus 4000 and apparatus 5000 may also include hardware devices for performing other additional functions, according to particular needs. Furthermore, those skilled in the art will appreciate that apparatus 4000 and apparatus 5000 may also include only those components necessary to implement embodiments of the present application, and need not include all of the components shown in fig. 16 and 17.

It should be understood that the processor in the embodiments of the present application may be a Central Processing Unit (CPU), and the processor may also be other general-purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

It will also be appreciated that the memory in the embodiments of the subject application can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory. The non-volatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash memory. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. By way of example, but not limitation, many forms of Random Access Memory (RAM) are available, such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), Enhanced SDRAM (ESDRAM), synchlink DRAM (SLDRAM), and direct bus RAM (DR RAM).

The above embodiments may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, the above-described embodiments may be implemented in whole or in part in the form of a computer program product. The computer program product comprises one or more computer instructions or computer programs. The procedures or functions according to the embodiments of the present application are wholly or partially generated when the computer instructions or the computer program are loaded or executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains one or more collections of available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium. The semiconductor medium may be a solid state disk.

It should be understood that the term "and/or" herein is merely one type of association relationship that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone, wherein A and B can be singular or plural. In addition, the "/" in this document generally indicates that the former and latter associated objects are in an "or" relationship, but may also indicate an "and/or" relationship, which may be understood with particular reference to the former and latter text.

In the present application, "at least one" means one or more, "a plurality" means two or more. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of the singular or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or multiple.

It should be understood that, in the various embodiments of the present application, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the technical solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions may be stored in a computer-readable storage medium if they are implemented in the form of software functional units and sold or used as separate products. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, an optical disk, or other various media capable of storing program codes.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

A neural network model quantization method, comprising:

the method comprises the steps of obtaining an original neural network model, wherein the original neural network model comprises a first operator, a second operator and a first operation module, the first operator and the second operator are used for carrying out the same type of operation, and the first operation module is used for carrying out first operation on the output of the first operator and the output of the second operator;

determining a data quantization parameter according to a range of first training input data and a range of second training input data, wherein the first training input data is input data of the first operator, and the second training input data is input data of the second operator;

determining a quantized neural network model according to the original neural network model, wherein the quantized neural network model comprises a quantization module, a third operator, a fourth operator and a second operation module, the quantization module is used for quantizing first input data of the third operator and second input data of the fourth operator respectively by using the data quantization parameter, the third operator is the quantized first operator, the fourth operator is the quantized second operator, and the second operation module is used for performing the first operation.
The method of claim 1, further comprising:

acquiring preset training output data corresponding to a training input data set, wherein the training input data set comprises the first training input data and the second training input data;

quantizing the first training input data and the second training input data respectively by using the data quantization parameter;

processing the quantized first training input data and the quantized second training input data by using the quantized neural network model to obtain actual training output data;

adjusting the data quantization parameter according to the difference between the actual training output data and the preset training output data to minimize the difference;

the quantization module is configured to quantize the first input data of the third operator and the second input data of the fourth operator respectively by using the adjusted data quantization parameter.
The method of claim 2, further comprising:

determining operator quantization parameters according to the parameter range of the first operator and the parameter range of the second operator;

quantizing the parameter of the first operator by using the operator quantization parameter to obtain a parameter of the third operator;

and quantizing the parameters of the second operator by using the operator quantization parameters to obtain the parameters of the fourth operator.
The method according to any one of claims 1 to 3, wherein the quantized neural network model further comprises a compression module for compressing the output of the third operator and the output of the fourth operator respectively according to an offset parameter indicating the position of the highest bit in the compressed data in the data before the compression, and the second operation module is configured to perform the first operation on the compressed data;

the method further comprises the following steps:

quantizing the first training input data and the second training input data respectively by using the data quantization parameters;

processing the quantized first training input data by using the third operator, wherein the third operator outputs first training operation data;

processing the quantized second training input data by using the fourth operator, wherein the fourth operator outputs second training operation data;

determining the offset parameter according to the significand of the first training operand data and the significand of the second training operand data.
A method of data processing, the method comprising:

the method comprises the steps of obtaining a quantized neural network model, wherein the quantized neural network model is obtained by quantizing an original neural network model, the original neural network model comprises a first operator, a second operator and a first operation module, the first operator and the second operator are used for carrying out the same type of operation, and the first operation module is used for carrying out the first operation on the output of the first operator and the output of the second operator;

processing the first input data of the third operator and the second input data of the fourth operator by using the quantized neural network model, where the quantized neural network model includes a quantization module, a first operator, a second operator, and a second operation module, the quantization module is configured to quantize the first input data and the second input data by using a data quantization parameter, the second operation module is configured to perform the first operation, the third operator is the quantized first operator, the fourth operator is the quantized second operator, and the data quantization parameter is determined according to a range of the first training input data of the first operator and a range of the second training input data of the second operator.
The method of claim 5,

the data quantization parameter is obtained by adjusting the initial data quantization parameter, the adjustment minimizes the difference between the actual training output data and the preset training output data,

the initial quantization parameter is determined from a range of the first training input data and a range of the second training input data,

the preset training output data corresponds to a training input data set comprising the first training input data and the second training input data,

the actual training output data is obtained by processing the first training input data and the second training input data by using the quantized neural network model, and the quantization module is used for quantizing the first training input data and the second training input data by using the initial data quantization parameter.
The method according to claim 6, wherein the parameter of the third operator is obtained by quantizing the parameter of the first operator by using an operator quantization parameter, and the parameter of the fourth operator is obtained by quantizing the parameter of the second operator by using the operator quantization parameter, and the operator quantization parameter is determined according to the parameter range of the first operator and the parameter range of the second operator.
The method according to any one of claims 5 to 7, wherein the quantized neural network model further comprises a compression module for compressing the output of the third operator and the output of the fourth operator respectively according to an offset parameter indicating the position of the highest bit in the compressed data in the data before the compression, and the second operation module is configured to perform the first operation on the compressed data;

the offset parameter is determined according to a significand of first training operation data obtained by processing first training input data quantized by the data quantization parameter using the third operator and a significand of second training operation data obtained by processing second training input data quantized by the data quantization parameter using the fourth operator.
An apparatus for quantizing a neural network model, the apparatus comprising: a storage module and a processing module, wherein,

the storage module is used for storing programs;

when the program is run in the processing module, the processing module is to:

the method comprises the steps of obtaining an original neural network model, wherein the original neural network model comprises a first operator, a second operator and a first operation module, the first operator and the second operator are used for carrying out the same type of operation, and the first operation module is used for carrying out the first operation on the output of the first operator and the output of the second operator;

determining a data quantization parameter according to a range of first training input data and a range of second training input data, wherein the first training input data is input data of the first operator, and the second training input data is input data of the second operator;

determining a quantized neural network model according to the original neural network model, wherein the quantized neural network model comprises a quantization module, a third operator, a fourth operator and a second operation module, the quantization module is used for quantizing first input data of the third operator and second input data of the fourth operator respectively by using the data quantization parameter, the third operator is the quantized first operator, the fourth operator is the quantized second operator, and the second operation module is used for performing the first operation.
The apparatus of claim 9,

the processing module is further configured to obtain preset training output data corresponding to a training input data set, where the training input data set includes the first training input data and the second training input data;

the processing module is further configured to quantize the first training input data and the second training input data respectively by using the data quantization parameter;

the processing module is further used for processing the quantized first training input data and the quantized second training input data by using the quantized neural network model to obtain actual training output data;

the processing module is further configured to adjust the data quantization parameter according to a difference between the actual training output data and the preset training output data to minimize the difference;

the quantization module is configured to quantize the first input data of the third operator and the second input data of the fourth operator respectively by using the adjusted data quantization parameter.
The apparatus of claim 10,

the processing module is further used for determining operator quantization parameters according to the parameter range of the first operator and the parameter range of the second operator;

the processing module is further configured to quantize the parameter of the first operator by using the operator quantization parameter to obtain a parameter of the third operator;

the processing module is further configured to quantize the parameter of the second operator by using the operator quantization parameter to obtain a parameter of the fourth operator.
The apparatus according to any one of claims 9-11, wherein the quantized neural network model further comprises a compression module configured to compress outputs of the third operator and the fourth operator according to an offset parameter, respectively, the offset parameter indicating a position of a highest bit in the compressed data in data before the compression, and the second operation module configured to perform the first operation on the compressed data;

the processing module is further configured to quantize the first training input data and the second training input data using the data quantization parameter, respectively;

the processing module is further configured to process the quantized first training input data by using the third operator, and the third operator outputs first training operation data;

the processing module is further configured to process the quantized second training input data by using the fourth operator, and the fourth operator outputs second training operation data;

the processing module is further configured to determine the offset parameter according to the significand of the first training operation data and the significand of the second training operation data.
A data processing apparatus, comprising: a storage module and a processing module, wherein,

the storage module is used for storing programs;

when the program is run in the processing module, the processing module is to:

the method comprises the steps of obtaining a quantized neural network model, wherein the quantized neural network model is obtained by quantizing an original neural network model, the original neural network model comprises a first operator, a second operator and a first operation module, the first operator and the second operator are used for carrying out the same type of operation, and the first operation module is used for carrying out the first operation on the output of the first operator and the output of the second operator;

processing the first input data of the third operator and the second input data of the fourth operator by using the quantized neural network model, where the quantized neural network model includes a quantization module, a first operator, a second operator, and a second operation module, the quantization module is configured to quantize the first input data and the second input data by using a data quantization parameter, the second operation module is configured to perform the first operation, the third operator is the quantized first operator, the fourth operator is the quantized second operator, and the data quantization parameter is determined according to a range of the first training input data of the first operator and a range of the second training input data of the second operator.
The apparatus of claim 13,

the data quantization parameter is obtained by adjusting the initial data quantization parameter, the adjustment minimizes the difference between the actual training output data and the preset training output data,

the initial quantization parameter is determined from a range of the first training input data and a range of the second training input data,

the preset training output data corresponds to a training input data set comprising the first training input data and the second training input data,

the actual training output data is obtained by processing the first training input data and the second training input data by using the quantized neural network model, and the quantization module is used for quantizing the first training input data and the second training input data by using the initial data quantization parameter.
The apparatus according to claim 14, wherein the parameter of the third operator is obtained by quantizing the parameter of the first operator by using an operator quantization parameter, and the parameter of the fourth operator is obtained by quantizing the parameter of the second operator by using the operator quantization parameter, and the operator quantization parameter is determined according to a parameter range of the first operator and a parameter range of the second operator.
The apparatus according to any one of claims 13-15, wherein the quantized neural network model further comprises a compression module configured to compress outputs of the third operator and the fourth operator according to an offset parameter, respectively, the offset parameter indicating a position of a highest bit in the compressed data in data before the compression, and the second operation module configured to perform the first operation on the compressed data;

the offset parameter is determined according to a significand of first training operation data obtained by processing first training input data quantized by the data quantization parameter using the third operator and a significand of second training operation data obtained by processing second training input data quantized by the data quantization parameter using the fourth operator.
A computer-readable storage medium, characterized in that the computer-readable medium stores program code for execution by a device, which when executed by the device performs the method of any one of claims 1 to 8.
A chip comprising a processor and a data interface, the processor reading instructions stored on a memory through the data interface to perform the method of any one of claims 1 to 8.