WO2022088063A1 - 神经网络模型的量化方法和装置、数据处理的方法和装置 - Google Patents

神经网络模型的量化方法和装置、数据处理的方法和装置 Download PDF

Info

Publication number
WO2022088063A1
WO2022088063A1 PCT/CN2020/125370 CN2020125370W WO2022088063A1 WO 2022088063 A1 WO2022088063 A1 WO 2022088063A1 CN 2020125370 W CN2020125370 W CN 2020125370W WO 2022088063 A1 WO2022088063 A1 WO 2022088063A1
Authority
WO
WIPO (PCT)
Prior art keywords
operator
data
input data
training
quantized
Prior art date
Application number
PCT/CN2020/125370
Other languages
English (en)
French (fr)
Inventor
昌晶
连朔
孙方轩
王晨曦
周君
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to PCT/CN2020/125370 priority Critical patent/WO2022088063A1/zh
Priority to CN202080016479.1A priority patent/CN114698395A/zh
Publication of WO2022088063A1 publication Critical patent/WO2022088063A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology

Definitions

  • the present application relates to the field of artificial intelligence, and more particularly, to a method and device for quantifying a neural network model, and a method and device for data processing.
  • Artificial intelligence is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that responds in a similar way to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision-making and reasoning, human-computer interaction, recommendation and search, and basic AI theory.
  • the neural network model is widely used.
  • quantizing the operators in the neural network model that is, quantizing the parameters of the operator and quantizing the input data
  • the operation of floating-point numbers can be converted into the operation of fixed-point numbers, and the model can be obtained.
  • the quantization parameter of the operator is determined according to the data range of the operator, which can improve the accuracy of the data processing result of the quantized operator.
  • the present application provides a neural network model quantification method and a data processing method, which can simplify the operation of the neural network model and improve the data processing efficiency of the neural network model.
  • a first aspect provides a method for quantifying a neural network model, the method comprising: obtaining an original neural network model, the original neural network model including a first operator, a second operator and a first operation module, the first The operator and the second operator are used to perform the same type of operation, and the first operation module is used to perform the first operation on the output of the first operator and the output of the second operator;
  • a range of training input data and a range of second training input data determine data quantization parameters, the first training input data is the input data of the first operator, and the second training input data is the second training input data
  • the data quantization parameter is determined, and the data quantization parameter is used to respectively quantify the third operator and the fourth operator in the quantized neural network model.
  • the sub input data is quantified.
  • the quantized neural network model can use the same data quantization parameters to quantize the data input to two different operators, so that the processing result of the third operator is the same as the quantization parameter corresponding to the processing result of the fourth operator. Directly perform the third operation on the processing result of the third operator and the processing result of the fourth operator, without performing inverse quantization on the processing result of the third operator and the processing result of the fourth operator before performing the third operation.
  • the processing simplifies the operation of the quantized neural network model and improves the data processing efficiency of the neural network model.
  • the data quantization parameters for quantizing the input data of the third operator and the fourth operator are determined, which improves the performance of the third operator and the fourth operator.
  • the accuracy of the quantized data processing results is improved, while the data processing efficiency of the neural network model is improved, and the influence of the quantized neural network model on the accuracy of the data processing results is reduced.
  • the method further includes: acquiring preset training output data corresponding to a training input data set, where the training input data set includes the first training input data and the first training input data. 2. Training input data; using the data quantization parameters to quantify the first training input data and the second training input data respectively; using the quantized neural network model to quantify the quantized first training input data process with the quantized second training input data to obtain actual training output data; according to the difference between the actual training output data and the preset training output data, adjust the data quantization parameter to minimize the difference ; the quantization module is used to quantize the first input data of the third operator and the second input data of the fourth operator respectively by using the adjusted data quantization parameter.
  • the preset training output data may be manually set.
  • the preset training output data may also be obtained by processing the first training input data and the second training input data by the original neural network model.
  • the preset training output data may be the output of the operation module.
  • the adjusted data quantization parameter can make the third The result of processing the quantized data by the operator and the fourth operator is of high precision. While improving the data processing efficiency of the neural network model, the influence of the quantitative neural network model on the accuracy of the data processing results is reduced.
  • the method further includes: determining an operator quantization parameter according to the parameter range of the first operator and the parameter range of the second operator; using the operator The sub-quantization parameter is used to quantize the parameter of the first operator to obtain the parameter of the third operator; the parameter of the second operator is quantized by using the operator quantization parameter to obtain the parameter of the third operator.
  • the quantized neural network model not only improves the data processing efficiency, but also reduces the accuracy and accuracy of the data processing results. The effect of precision.
  • the quantized neural network model further includes a compression module, and the compression module is configured to separate the output of the third operator and the output of the third operator according to offset parameters.
  • the output of the four operators is compressed, the offset parameter is used to indicate the position of the highest bit in the compressed data in the data before the compression, and the second operation module is used to compress the compressed data.
  • the method further includes: using the data quantization parameter to quantize the first training input data and the second training input data respectively; using the third operator to quantify the quantized The first training input data is processed, and the third operator outputs the first training operation data; the quantized second training input data is processed by the fourth operator, and the fourth operator outputs the second training data Operation data; determine the offset parameter according to the significant digits of the first training operation data and the significant digits of the second training operation data.
  • the offset parameter is determined according to the effective number of digits of the intermediate operation result obtained by processing the training input data by the quantized neural network model, when the quantized neural network model processes the data, the offset parameter is used for the intermediate operation result. Compression can reduce the impact on the accuracy and precision of the final data processing results.
  • a data processing method comprising: obtaining a quantized neural network model, where the quantized neural network model is obtained by quantizing an original neural network model, and the original neural network model includes The first operator, the second operator, and the first operation module The first operator and the second operator are used to perform the same type of operation, and the first operation module is used to perform operations on the first operator.
  • the first operation is performed on the output of the third operator and the output of the second operator; the first input data of the third operator and the second input data of the fourth operator are performed using the quantized neural network model.
  • the quantized neural network model includes a quantization module, a first operator, a second operator, and a second operation module, and the quantization module is used for using data quantization parameters to respectively quantify the first input data, the The second input data is quantized, the second operation module is used to perform the first operation, the third operator is the quantized first operator, and the fourth operator is the quantized second operator
  • the data quantization parameter is determined according to the range of the first training input data of the first operator and the range of the second training input data of the second operator.
  • the data quantization parameter is obtained by adjusting the initial data quantization parameter, and the adjustment minimizes the difference between the actual training output data and the preset training output data, so
  • the initial quantization parameter is determined according to the range of the first training input data and the range of the second training input data
  • the preset training output data corresponds to a training input data group
  • the training input data group includes all The first training input data and the second training input data
  • the actual training output data is obtained by using the quantized neural network model to process the first training input data and the second training input data.
  • the quantization module is configured to use the initial data quantization parameter to quantize the first training input data and the second training input data respectively.
  • the parameters of the third operator are obtained by quantizing the parameters of the first operator by using the operator quantization parameters
  • the parameters of the fourth operator are Obtained by using the operator quantization parameter to quantize the parameter of the second operator
  • the operator quantization parameter is determined according to the parameter range of the first operator and the parameter range of the second operator .
  • the quantized neural network model further includes a compression module, and the compression module is configured to separate the output of the third operator and the output of the third operator according to offset parameters.
  • the output of the four operators is compressed, the offset parameter is used to indicate the position of the highest bit in the compressed data in the data before the compression, and the second operation module is used to compress the compressed data.
  • the first operation is performed on the data of The third operator is obtained by processing the first training input data quantized using the data quantization parameter, and the second training operation data is obtained by using the fourth operator to quantize the second training input data using the data quantization parameter.
  • the training input data is processed.
  • a neural network model quantization device in a third aspect, includes: a storage module and a processing module, the storage module is used to store a program; when the program runs in the processing module, the processing module Used to: obtain an original neural network model, the original neural network model includes a first operator, a second operator and a first operation module, and the first operator and the second operator are used to perform the same type of operation.
  • the first operation module is used to perform a first operation on the output of the first operator and the output of the second operator; according to the range of the first training input data and the range of the second training input data, Determine data quantization parameters, the first training input data is the input data of the first operator, and the second training input data is the input data of the second operator; according to the original neural network model, determine The quantized neural network model, the quantized neural network model includes a quantization module, a third operator, a fourth operator, and a second operation module, and the quantization module is used for using the data quantization parameters to respectively quantify the The first input data of the third operator and the second input data of the fourth operator are quantized, the third operator is the quantized first operator, and the fourth operator is the quantized first operator Two operators, the second operation module is used to perform the first operation.
  • the processing module is further configured to acquire preset training output data corresponding to a training input data set, where the training input data set includes the first training input data and all the second training input data; the processing module is further configured to, using the data quantization parameter, quantify the first training input data and the second training input data respectively; the processing module is further configured to: Using the quantized neural network model to process the quantized first training input data and the quantized second training input data to obtain actual training output data; the processing module is further configured to, according to the actual training For the difference between the output data and the preset training output data, adjust the data quantization parameter to minimize the difference; the quantization module is used to use the adjusted data quantization parameter to respectively quantify the third operator.
  • An input data, the second input data of the fourth operator are quantized.
  • the processing module is further configured to determine an operator quantization parameter according to the parameter range of the first operator and the parameter range of the second operator; the The processing module is further configured to use the operator quantization parameter to quantize the parameter of the first operator to obtain the parameter of the third operator; the processing module is further configured to use the operator quantization The parameter quantizes the parameter of the second operator to obtain the parameter of the fourth operator.
  • the quantized neural network model further includes a compression module, and the compression module is configured to respectively compress the output of the third operator and the output of the third operator according to offset parameters.
  • the output of the four operators is compressed, the offset parameter is used to indicate the position of the highest bit in the compressed data in the data before the compression, and the second operation module is used to compress the compressed data.
  • the first operation is performed on the data obtained from the data; the processing module is further configured to use the data quantization parameter to quantify the first training input data and the second training input data respectively; the processing module is further configured to use the The third operator processes the quantized first training input data, and the third operator outputs the first training operation data; the processing module is further configured to use the fourth operator to process the quantized first training data.
  • the second training input data is processed, and the fourth operator outputs the second training operation data; the processing module is further configured to, according to the significant digits of the first training operation data and the second training operation data The number of significant bits to determine the offset parameter.
  • a data processing device comprising: a storage module and a processing module, the storage module is used for storing a program; when the program is run in the processing module, the processing module is used for: obtaining quantification
  • the quantized neural network model is obtained by quantizing the original neural network model, and the original neural network model includes a first operator, a second operator, and a first operation module.
  • the operator and the second operator are used to perform the same type of operation, and the first operation module is used to perform the first operation on the output of the first operator and the output of the second operator;
  • the quantized neural network model processes the first input data of the third operator and the second input data of the fourth operator, and the quantized neural network model includes a quantization module, a first operator , a second operator and a second operation module, the quantization module is used to quantize the first input data and the second input data using a data quantization parameter, and the second operation module is used to perform the
  • the first operation the third operator is the first operator after quantization, the fourth operator is the second operator after quantization, and the data quantization parameter is based on the first operator of the first operator.
  • the range of the training input data and the range of the second training input data of the second operator are determined.
  • the data quantization parameter is obtained by adjusting the initial data quantization parameter, and the adjustment minimizes the difference between the actual training output data and the preset training output data, so
  • the initial quantization parameter is determined according to the range of the first training input data and the range of the second training input data
  • the preset training output data corresponds to a training input data group
  • the training input data group includes all The first training input data and the second training input data
  • the actual training output data is obtained by using the quantized neural network model to process the first training input data and the second training input data.
  • the quantization module is configured to use the initial data quantization parameter to quantize the first training input data and the second training input data respectively.
  • the parameters of the third operator are obtained by quantizing the parameters of the first operator by using the operator quantization parameters, and the parameters of the fourth operator are Obtained by using the operator quantization parameter to quantize the parameter of the second operator, and the operator quantization parameter is determined according to the parameter range of the first operator and the parameter range of the second operator .
  • the quantized neural network model further includes a compression module, and the compression module is configured to separate the output of the third operator and the output of the third operator according to offset parameters.
  • the output of the four operators is compressed, the offset parameter is used to indicate the position of the highest bit in the compressed data in the data before the compression, and the second operation module is used to compress the compressed data.
  • the first operation is performed on the data of The third operator is obtained by processing the first training input data quantized using the data quantization parameter, and the second training operation data is obtained by using the fourth operator to quantize the second training input data using the data quantization parameter.
  • the training input data is processed.
  • an electronic device comprising a memory and a processor, wherein the memory is used for storing program instructions; when the program instructions are executed in the processor, the processor is used for executing the first aspect or the first aspect The method described in the second aspect.
  • the processor in the third aspect above may include either a central processing unit (CPU), or a combination of a CPU and a neural network computing processor.
  • CPU central processing unit
  • neural network computing processor a neural network computing processor
  • a computer-readable medium stores program code for execution by a device, the program code comprising a method for executing the first aspect or any one of the implementations of the first aspect .
  • a computer program product comprising instructions, when the computer program product is run on a computer, the computer program product causes the computer to execute the method in the first aspect or any one of the implementation manners of the first aspect.
  • a chip in an eighth aspect, includes a processor and a data interface, the processor reads an instruction stored in a memory through the data interface, and executes any one of the first aspect or the first aspect. method in the implementation.
  • the chip may further include a memory, in which instructions are stored, the processor is configured to execute the instructions stored in the memory, and when the instructions are executed, the The processor is configured to execute the method in the first aspect or any one of the implementation manners of the first aspect.
  • the above chip may specifically be a field-programmable gate array (FPGA) or an application-specific integrated circuit (ASIC).
  • FPGA field-programmable gate array
  • ASIC application-specific integrated circuit
  • FIG. 1 is a schematic structural diagram of a system architecture provided by an embodiment of the present application.
  • FIG. 2 is a schematic structural diagram of a convolutional neural network according to an embodiment of the present application.
  • FIG. 3 is a schematic structural diagram of another convolutional neural network provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of a hardware structure of a chip according to an embodiment of the present application.
  • FIG. 5 is a schematic diagram of a system architecture provided by an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of an apparatus for quantizing a neural network model provided by an embodiment of the present application.
  • FIG. 7 is a schematic flowchart of a method for quantizing a neural network model provided by an embodiment of the present application.
  • FIG. 8 is a schematic flowchart of another neural network model quantization method provided by an embodiment of the present application.
  • FIG. 9 is a schematic structural diagram of a data processing system provided by an embodiment of the present application.
  • FIG. 10 is a schematic diagram of data before and after compression provided by an embodiment of the present application.
  • FIG. 11 is a schematic flowchart of a data processing method provided by an embodiment of the present application.
  • FIG. 12 is a schematic flowchart of another data processing method provided by an embodiment of the present application.
  • FIG. 13 is a schematic flowchart of a processing structure identification method provided by an embodiment of the present application.
  • FIG. 14 is a schematic structural diagram of an apparatus for quantizing a neural network model provided by an embodiment of the present application.
  • FIG. 15 is a schematic structural diagram of a data processing apparatus provided by an embodiment of the present application.
  • FIG. 16 is a schematic diagram of a hardware structure of a data processing apparatus according to an embodiment of the present application.
  • FIG. 17 is a schematic diagram of a hardware structure of a neural network model quantization apparatus according to an embodiment of the present application.
  • a neural network can be composed of neural units, and a neural unit can refer to an operation unit that takes x s and an intercept 1 as input, and the output of the operation unit can be:
  • W s is the weight of x s
  • b is the bias of the neural unit.
  • f is an activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal.
  • the output signal of the activation function can be used as the input of the next convolutional layer, and the activation function can be a sigmoid function.
  • a neural network is a network formed by connecting a plurality of the above single neural units together, that is, the output of one neural unit can be the input of another neural unit.
  • the input of each neural unit can be connected with the local receptive field of the previous layer to extract the features of the local receptive field, and the local receptive field can be an area composed of several neural units.
  • a deep neural network also known as a multi-layer neural network, can be understood as a neural network with multiple hidden layers.
  • the DNN is divided according to the positions of different layers.
  • the neural network inside the DNN can be divided into three categories: input layer, hidden layer, and output layer. Generally speaking, the first layer is the input layer, the last layer is the output layer, and the middle layers are all hidden layers.
  • the layers are fully connected, that is, any neuron in the i-th layer must be connected to any neuron in the i+1-th layer.
  • DNN looks complicated, in terms of the work of each layer, it is not complicated. In short, it is the following linear relationship expression: in, is the input vector, is the output vector, is the offset vector, W is the weight matrix (also called coefficients), and ⁇ () is the activation function.
  • Each layer is just an input vector After such a simple operation to get the output vector Due to the large number of DNN layers, the coefficient W and offset vector The number is also higher.
  • the DNN Take the coefficient W as an example: Suppose that in a three-layer DNN, the linear coefficient from the fourth neuron in the second layer to the second neuron in the third layer is defined as The superscript 3 represents the number of layers where the coefficient W is located, and the subscript corresponds to the output third layer index 2 and the input second layer index 4.
  • the coefficient from the kth neuron in the L-1 layer to the jth neuron in the Lth layer is defined as
  • the input layer does not have a W parameter.
  • more hidden layers allow the network to better capture the complexities of the real world.
  • a model with more parameters is more complex and has a larger "capacity", which means that it can complete more complex learning tasks.
  • Training the deep neural network is the process of learning the weight matrix, and its ultimate goal is to obtain the weight matrix of all layers of the trained deep neural network (the weight matrix formed by the vectors W of many layers).
  • Convolutional neural network is a deep neural network with a convolutional structure.
  • a convolutional neural network consists of a feature extractor consisting of convolutional layers and subsampling layers, which can be viewed as a filter.
  • the convolutional layer refers to the neuron layer in the convolutional neural network that convolves the input signal.
  • a convolutional layer of a convolutional neural network a neuron can only be connected to some of its neighbors.
  • a convolutional layer usually contains several feature planes, and each feature plane can be composed of some neural units arranged in a rectangle. Neural units in the same feature plane share weights, and the shared weights here are convolution kernels. Shared weights can be understood as the way to extract data information is independent of location.
  • the convolution kernel can be initialized in the form of a matrix of random size, and the convolution kernel can obtain reasonable weights by learning during the training process of the convolutional neural network.
  • the immediate benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, while reducing the risk of overfitting.
  • the neural network can use the error back propagation (BP) algorithm to correct the size of the parameters in the initial neural network model during the training process, so that the reconstruction error loss of the neural network model becomes smaller and smaller.
  • BP error back propagation
  • the input signal is passed forward until the output will generate error loss, and the parameters in the initial neural network model are updated by back-propagating the error loss information, so that the error loss converges.
  • the back-propagation algorithm is a back-propagation movement dominated by error loss, aiming to obtain the parameters of the optimal neural network model, such as the weight matrix.
  • Data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making, etc.
  • machine learning and deep learning can perform symbolic and formalized intelligent information modeling, extraction, preprocessing, training, etc. on data.
  • Reasoning refers to the process of simulating human's intelligent reasoning method in a computer or intelligent system, using formalized information to carry out machine thinking and solving problems according to the reasoning control strategy, and the typical function is search and matching.
  • Decision-making refers to the process of making decisions after intelligent information is reasoned, usually providing functions such as classification, sorting, and prediction.
  • an embodiment of the present application provides a system architecture 100 .
  • a data collection device 160 is used to collect training data.
  • the training data may include a plurality of training input data and a training identifier corresponding to each training input data.
  • the data collection device 160 After collecting the training data, the data collection device 160 stores the training data in the database 130 , and the training device 120 obtains the target model/rule 101 by training based on the training data maintained in the database 130 .
  • the training device 120 processes the input training input data, and compares the output result with the training identifier corresponding to the training input data until the training device 120 processes the input training input data.
  • the difference between the output result of 120 and the training identifier is smaller than a certain threshold, so that the training of the target model/rule 101 is completed.
  • the above target model/rule 101 can be used to implement the data processing method of the embodiment of the present application.
  • the target model/rule 101 in this embodiment of the present application may specifically be a neural network.
  • the training data maintained in the database 130 may not necessarily come from the collection of the data collection device 160, and may also be received from other devices.
  • the training device 120 may not necessarily train the target model/rule 101 completely based on the training data maintained by the database 130, and may also obtain training data from the cloud or other places for model training.
  • the above description should not be used as a reference to this application Limitations of Examples.
  • the target model/rule 101 trained according to the training device 120 can be applied to different systems or devices, such as the execution device 110 shown in FIG. Laptops, augmented reality (AR) AR/virtual reality (VR), in-vehicle terminals, etc., can also be servers or the cloud.
  • the execution device 110 is configured with an input/output (I/O) interface 112 for data interaction with external devices, and the user can input data to the I/O interface 112 through the client device 140, the In this embodiment of the present application, the input data may include: pending data input by the client device.
  • the preprocessing module 113 and the preprocessing module 114 are used to perform preprocessing according to the input data (such as data to be processed) received by the I/O interface 112.
  • the preprocessing module 113 and the preprocessing module may also be absent.
  • 114 or only one of the preprocessing modules, and directly use the calculation module 111 to process the input data.
  • the execution device 110 When the execution device 110 preprocesses the input data, or the calculation module 111 of the execution device 110 performs calculations and other related processing, the execution device 110 can call the data, codes, etc. in the data storage system 150 for corresponding processing , the data and instructions obtained by corresponding processing may also be stored in the data storage system 150 .
  • the I/O interface 112 returns the processing result, such as the processing result of the data obtained above, to the client device 140, so as to be provided to the user.
  • the training device 120 can generate corresponding target models/rules 101 based on different training data for different goals or tasks, and the corresponding target models/rules 101 can be used to achieve the above goals or complete The above task, thus providing the user with the desired result.
  • the user can manually specify the input data, which can be operated through the interface provided by the I/O interface 112 .
  • the client device 140 can automatically send the input data to the I/O interface 112 . If the user's authorization is required to request the client device 140 to automatically send the input data, the user can set the corresponding permission in the client device 140 .
  • the user can view the result output by the execution device 110 on the client device 140, and the specific presentation form can be a specific manner such as display, sound, and action.
  • the client device 140 can also be used as a data collection terminal to collect the input data of the input I/O interface 112 and the output result of the output I/O interface 112 as new sample data as shown in the figure, and store them in the database 130 .
  • the I/O interface 112 directly uses the input data input into the I/O interface 112 and the output result of the output I/O interface 112 as shown in the figure as a new sample The data is stored in database 130 .
  • FIG. 1 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship among the devices, devices, modules, etc. shown in the figure does not constitute any limitation.
  • the data The storage system 150 is an external memory relative to the execution device 110 , and in other cases, the data storage system 150 may also be placed in the execution device 110 .
  • the target model/rule 101 is obtained by training the training device 120.
  • the target model/rule 101 may be the neural network in the present application in this embodiment of the present application.
  • the neural network may be used in this embodiment of the present application.
  • CNN deep convolutional neural network
  • DCNN deep convolutional neural networks
  • RNN recurrent neural network
  • CNN is a very common neural network
  • a convolutional neural network is a deep neural network with a convolutional structure and a deep learning architecture.
  • a deep learning architecture refers to an algorithm based on machine learning. learning at multiple levels of abstraction.
  • CNN is a feed-forward artificial neural network in which individual neurons can respond to data input into it.
  • a convolutional neural network (CNN) 200 may include an input layer 210 , a convolutional/pooling layer 220 (where the pooling layer is optional), and a neural network layer 230 .
  • the input layer 210 can obtain the data to be processed, and pass the obtained data to be processed by the convolutional layer/pooling layer 220 and the subsequent neural network layer 230 for processing, and the processing result of the data can be obtained.
  • the internal layer structure in the CNN 200 in Figure 2 is described in detail below.
  • the convolutional/pooling layer 220 may include layers 221-226 as examples, for example: in one implementation, layer 221 is a convolutional layer, layer 222 is a pooling layer, and layer 223 is a convolutional layer Layer 224 is a pooling layer, 225 is a convolutional layer, and 226 is a pooling layer; in another implementation, 221 and 222 are convolutional layers, 223 are pooling layers, and 224 and 225 are convolutional layers. layer, 226 is the pooling layer. That is, the output of a convolutional layer can be used as the input of a subsequent pooling layer, or it can be used as the input of another convolutional layer to continue the convolution operation.
  • the convolution layer 221 may include many convolution operators.
  • the convolution operator is also called a kernel, and its role in data processing is equivalent to a filter that extracts specific information from the input data matrix.
  • the convolution operator is essentially Can be a weight matrix, which is usually predefined.
  • weight values in these weight matrices need to be obtained through a lot of training in practical applications, and each weight matrix formed by the weight values obtained by training can be used to extract information from the input data, so that the convolutional neural network 200 can make correct predictions .
  • the initial convolutional layer eg, 221
  • the features extracted by the later convolutional layers eg, 226 become more and more complex, such as features such as high-level semantics.
  • features with higher semantics are more suitable for the problem to be solved.
  • the convolutional layer can be a convolutional layer followed by a layer.
  • the pooling layer can also be a multi-layer convolutional layer followed by one or more pooling layers. In the process of data processing, the only purpose of the pooling layer is to reduce the space size of the data.
  • the convolutional neural network 200 After being processed by the convolutional layer/pooling layer 220, the convolutional neural network 200 is not sufficient to output the required output information. Because as mentioned before, the convolutional layer/pooling layer 220 only extracts features and reduces the parameters brought by the input data. However, in order to generate the final output information (required class information or other relevant information), the convolutional neural network 200 needs to utilize the neural network layer 230 to generate one or a set of outputs of the desired number of classes. Therefore, the neural network layer 230 may include multiple hidden layers (231, 232 to 23n as shown in FIG. 2) and the output layer 240, and the parameters contained in the multiple hidden layers may be based on specific task types The relevant training data of , is obtained by pre-training, for example, the task type may include identification, classification, and so on.
  • the output layer 240 After the multi-layer hidden layers in the neural network layer 230, that is, the last layer of the entire convolutional neural network 200 is the output layer 240, the output layer 240 has a loss function similar to classification cross entropy, and is specifically used to calculate the prediction error,
  • the forward propagation of the entire convolutional neural network 200 (as shown in Figure 2, the propagation from the direction 210 to 240 is forward propagation)
  • the back propagation (as shown in Figure 2, the propagation from the 240 to 210 direction is the back propagation) will Start to update the weight values and biases of the aforementioned layers to reduce the loss of the convolutional neural network 200 and the error between the result output by the convolutional neural network 200 through the output layer and the ideal result.
  • a convolutional neural network (CNN) 200 may include an input layer 210, a convolutional/pooling layer 220 (where the pooling layer is optional), and a neural network layer 230.
  • CNN convolutional neural network
  • the multiple convolution layers/pooling layers in the convolutional layer/pooling layer 220 in FIG. 3 are parallel, and the respectively extracted features are input to the neural network layer 230 for processing.
  • the convolutional neural networks shown in FIG. 2 and FIG. 3 are only examples of two possible convolutional neural networks of a data processing method according to an embodiment of the present application.
  • the present application implements
  • the convolutional neural network used in the data processing method of the example can also exist in the form of other network models.
  • FIG. 4 is a hardware structure of a chip provided by an embodiment of the application, and the chip includes a neural network processor 50 .
  • the chip can be set in the execution device 110 as shown in FIG. 1 to complete the calculation work of the calculation module 111 .
  • the chip can also be set in the training device 120 as shown in FIG. 1 to complete the training work of the training device 120 and output the target model/rule 101 .
  • the algorithms of each layer in the convolutional neural network shown in Figures 2 and 3 can be implemented in the chip shown in Figure 4.
  • the neural network processor NPU 50 is mounted on the main central processing unit (CPU) (host CPU) as a coprocessor, and tasks are allocated by the main CPU.
  • the core part of the NPU is the operation circuit 503, and the controller 504 controls the operation circuit 503 to extract the data in the memory (weight memory or input memory) and perform operations.
  • the arithmetic circuit 503 includes multiple processing units (process engines, PEs). In some implementations, arithmetic circuit 503 is a two-dimensional systolic array. The arithmetic circuit 503 may also be a one-dimensional systolic array or other electronic circuitry capable of performing mathematical operations such as multiplication and addition. In some implementations, arithmetic circuit 503 is a general-purpose matrix processor.
  • the operation circuit fetches the data corresponding to the matrix B from the weight memory 502 and buffers it on each PE in the operation circuit.
  • the arithmetic circuit fetches the data of matrix A and matrix B from the input memory 501 to perform matrix operation, and stores the partial result or final result of the matrix in the accumulator 508 .
  • the vector calculation unit 507 can further process the output of the arithmetic circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison and so on.
  • the vector computing unit 507 can be used for network computation of non-convolutional/non-FC layers in the neural network, such as pooling, batch normalization, local response normalization, etc. .
  • vector computation unit 507 can store the processed output vectors to unified buffer 506 .
  • the vector calculation unit 507 may apply a nonlinear function to the output of the arithmetic circuit 503, such as a vector of accumulated values, to generate activation values.
  • vector computation unit 507 generates normalized values, merged values, or both.
  • the vector of processed outputs can be used as activation input to the arithmetic circuit 503, eg, for use in subsequent layers in a neural network.
  • Unified memory 506 is used to store input data and output data.
  • the weight data directly transfers the input data in the external memory to the input memory 501 and/or the unified memory 506 through the storage unit access controller 505 (direct memory access controller, DMAC), and stores the weight data in the external memory into the weight memory 502, And the data in the unified memory 506 is stored in the external memory.
  • DMAC direct memory access controller
  • a bus interface unit (BIU) 510 is used to realize the interaction between the main CPU, the DMAC and the instruction fetch memory 509 through the bus.
  • the instruction fetch memory (instruction fetch buffer) 509 connected with the controller 504 is used to store the instructions used by the controller 504;
  • the controller 504 is used to call the instructions cached in the finger memory 509, so as to control the working process of the operation accelerator.
  • the unified memory 506, the input memory 501, the weight memory 502 and the instruction fetch memory 509 are all on-chip (On-Chip) memories, and the external memory is the memory outside the NPU, and the external memory can be double data rate synchronous dynamic random access Memory (double data rate synchronous dynamic random access memory, referred to as DDR SDRAM), high bandwidth memory (high bandwidth memory, HBM) or other readable and writable memory.
  • DDR SDRAM double data rate synchronous dynamic random access Memory
  • HBM high bandwidth memory
  • HBM high bandwidth memory
  • each layer in the convolutional neural network shown in FIG. 2 and FIG. 3 may be performed by the operation circuit 503 or the vector calculation unit 507 .
  • the execution device 110 in FIG. 1 described above can execute each step of the data processing method of the embodiment of the present application.
  • the CNN model shown in FIG. 2 and FIG. 3 and the chip shown in FIG. 4 can also be used to execute the implementation of the present application.
  • the various steps of the data processing method of the example. The method for training a neural network according to the embodiment of the present application and the data processing method according to the embodiment of the present application will be described in detail below with reference to the accompanying drawings.
  • an embodiment of the present application provides a system architecture 300 .
  • the system architecture includes a local device 301, a local device 302, an execution device 110 and a data storage system 150, wherein the local device 301 and the local device 302 are connected with the execution device 110 through a communication network.
  • the execution device 110 may be implemented by one or more servers.
  • the execution device 110 may be used in conjunction with other computing devices, such as data storage, routers, load balancers and other devices.
  • the execution device 110 may be arranged on one physical site, or distributed across multiple physical sites.
  • the execution device 110 may use the data in the data storage system 150 or call the program code in the data storage system 150 to implement the data processing method in this embodiment of the present application.
  • a user may operate respective user devices (eg, local device 301 and local device 302 ) to interact with execution device 110 .
  • Each local device may represent any computing device, such as a personal computer, computer workstation, smartphone, tablet, smart camera, smart car or other type of cellular phone, media consumption device, wearable device, set-top box, gaming console, etc.
  • Each user's local device may interact with the execution device 110 through any communication mechanism/standard communication network, which may be a wide area network, a local area network, a point-to-point connection, etc., or any combination thereof.
  • any communication mechanism/standard communication network which may be a wide area network, a local area network, a point-to-point connection, etc., or any combination thereof.
  • the local device 301 and the local device 302 obtain the relevant parameters of the target neural network from the execution device 110, deploy the target neural network on the local device 301 and the local device 302, and use the target neural network for data classification Or identify and so on.
  • the target neural network can be directly deployed on the execution device 110, and the execution device 110 obtains the data to be processed from the local device 301 and the local device 302, and classifies the data to be processed or other types of data according to the target neural network deal with.
  • the above execution device 110 may also be a cloud device, and in this case, the execution device 110 may be deployed in the cloud; or, the above execution device 110 may also be a terminal device, in this case, the execution device 110 may be deployed on the user terminal side, this embodiment of the present application This is not limited.
  • the neural network model has been widely used in many fields such as image, video, and voice, showing the ability to surpass the traditional method.
  • the neural network model itself has a large amount of calculation and parameters, which makes the neural network on the terminal equipment. Deployment presents great challenges.
  • Model quantization is used to quantify the parameters of the operators in the neural network model and quantify the input data.
  • the size of the operator can be optimized and the resources occupied by the operator can be reduced.
  • quantizing the input data of the operator can convert the floating-point number operation of the operator into a fixed-point number operation, thereby improving the inference speed and reducing power consumption.
  • the quantized neural network model obtained by 8bit quantization can reduce the storage space occupied by each parameter to a quarter, and achieve better reasoning. The speed at which the data is processed.
  • the inverse quantization parameter used when performing inverse quantization on the data processing result of each operator can be determined according to the data quantization parameter and operator quantization parameter corresponding to the operator.
  • an embodiment of the present application provides a neural network model quantization device, which can reduce the number of inverse quantization operations that the quantized neural network model needs to perform subsequently, and improve the overall processing performance.
  • FIG. 6 is a schematic structural diagram of an apparatus for quantizing a neural network model provided by an embodiment of the present application.
  • the neural network model quantization apparatus 1300 may be located in the training device 120 shown in FIG. 1 or other devices.
  • the neural network model quantization apparatus 1300 includes a data quantization parameter generation model 1310 and an operator quantization model 1320 .
  • the neural network model quantization device 1300 is used for quantizing the original neural network model.
  • the original neural network model includes a first operator, a second operator and a first operation module.
  • the first operator and the second operator are used to perform the same type of operation.
  • the first operation module is configured to perform a first operation on the output of the first operator and the output of the second operator.
  • the quantization parameter generation model 1310 is configured to generate data quantization parameters according to the range of data in the training input data set.
  • the training input data set includes first training input data for the first operator and second training input data for the second operator.
  • the operator quantization model 1320 is used to quantize the operation units in the original neural network model such as the first operator and the second operator. According to the data quantization parameter and the quantized operation unit, the quantized neural network model can be obtained.
  • the quantized neural network model includes a quantization module, a quantized first operator, a quantized second operator, and a second operation module.
  • the quantization module is used for quantizing the first input data and the second input data respectively by using the data quantization parameter.
  • the second operation module corresponds to the first operation module in the original neural network model, and is used for performing the first operation on the first operation data and the second operation data.
  • the first operation data is obtained by operating the quantized first input data by the quantized first operator.
  • the second operation data is obtained by the quantized second operator operating on the quantized second input data.
  • FIG. 7 is a schematic flowchart of a method for quantizing a neural network model provided by an embodiment of the present application.
  • the neural network model quantization method may be performed by the training device 120 shown in FIG. 1 or other devices.
  • the original neural network model includes a first operator, a second operator, and a first operation module.
  • the first operator and the second operator are used to perform the same type of operation, and the first operation module is used to perform a first operation on the output of the first operator and the output of the second operator.
  • the original neural network model can be obtained from messages sent by other devices. Alternatively, the original neural network model can also be obtained from memory.
  • the original neural network model can be CNN, etc.
  • the first operation and the second operation are operations of the same type, that is, the first operator and the second operator are operators of the same type.
  • the first operator and the second operator may both be volume sublayers in CNN or both are fully connected layers.
  • the first operation module may be configured to perform a bit-by-bit operation on the output of the first operator and the output of the second operator, such as a bit-by-bit addition or a bit-by-bit multiplication operation.
  • the first operation module can be used for linear operation.
  • the neural network model quantization method includes steps S1101 and S1102 for quantizing the original neural network model to obtain a quantized neural network model.
  • the neural network model quantization method may be performed by the training device 120 shown in FIG. 1 or other devices.
  • a data quantization parameter is determined according to the range of the first training input data and the range of the second training input data, the first training input data is the input data of the first operator, and the second training input data is the input data of the second operator.
  • the upper limit of the average data range may be determined according to the maximum value of multiple first training input data of the first operator and the maximum value of multiple second training input data of the second operator.
  • the lower limit of the average data range may be determined according to the minimum value of the plurality of first training input data of the first operator and the minimum value of the plurality of second training input data of the second operator.
  • the data quantization parameter may be determined according to the upper limit of the average data range and the lower limit of the average data range.
  • the upper limit of the average data range can be understood as the average value of the maximum value of the plurality of first training input data and the maximum value of the plurality of second training input data.
  • the lower limit of the average data range may be understood as the average value of the minimum value of the plurality of first training input data and the minimum value of the plurality of second training input data.
  • the average data range upper limit and average data range lower limit can be updated each time the training input data is entered. Therefore, the calculation of the upper limit of the average data range and the lower limit of the average data range is performed in a decentralized manner, which can reduce the requirement for computing resources compared with the method of obtaining multiple training input data inputs to perform average calculation.
  • Weights can be introduced to enable updates to the upper and lower average data range limits.
  • the number of bits (ie, the number of bits) of the quantized input data of the operator is a preset value, for example, it may be 8 bits.
  • the number of bits of the quantized input data may also be obtained by manually inputting information or the like.
  • Data quantization parameters may include scale and offset.
  • the number of bits corresponds one-to-one with the number of scales in the data quantization parameter.
  • the number of the parameter scale can be understood as the maximum value that can be represented by the number of bits of the quantized data, that is, 2 m -1, where m is the number of bits of the quantized data.
  • the scale can be obtained from the difference between the upper limit of the average data range and the lower limit of the average data range and the number of scales.
  • the parameter scale can be the quotient of the difference between the upper limit of the average data range and the lower limit of the average data range divided by the number of scales, or the parameter scale can be the difference between the upper limit of the average data range and the lower limit of the average data range plus 1 The quotient obtained by dividing by the number of scales.
  • the offset in the data quantization parameter may be the ratio of the lower limit of the average data range to the parameter scale.
  • the quantized neural network model includes a quantization module, a third operator, a fourth operator, and a second operation module, the quantization module for using the data quantization parameter to quantize the first input data of the third operator and the second input data of the fourth operator respectively, and the second operation module is used to perform the first operation .
  • the second operation module may be configured to perform the third first operation on the first operation data and the second operation data, and the first operation data is the quantized first input data using the third operator
  • the second operation data is obtained by performing an operation on the quantized second input data by using the fourth operator.
  • the second operation module corresponds to the first operation module in the original neural network model.
  • the data quantization parameter is determined according to the numerical range of the input data of the first operator and the second operator in the original neural network model, and the data quantization parameter is used to respectively quantify the third parameter in the quantized neural network model.
  • the input data of the operator and the fourth operator are quantized.
  • the third operator and the first operator are used to perform the same operation
  • the fourth operator and the second operator are used to perform the same operation
  • the types of operations performed by the first operator and the second operator are same.
  • the quantized neural network model can use the same data quantization parameters to quantize the data input to two different operators, so that the processing result of the third operator corresponds to the processing result of the fourth operator
  • the quantization parameters are the same, and the third operation can be directly performed on the processing result of the third operator and the processing result of the fourth operator, without the need to perform the processing result of the third operator and the fourth operator before the third operation.
  • the processing result is processed by inverse quantization, which simplifies the operation of the quantized neural network model and improves the data processing efficiency of the neural network model.
  • the data quantization parameter for quantizing the input data of the third operator and the fourth operator is determined according to the numerical range of the data processed by the first operator and the second operator respectively, which improves the performance of the third operator and the fourth operator.
  • the accuracy of the operator's processing result on the quantized data not only improves the data processing efficiency of the neural network model, but also reduces the influence of the quantized neural network model on the accuracy of the data processing result.
  • preset training output data corresponding to a training input data set may be obtained, and the training input data set includes the first training input data and the second training input data.
  • the preset training output data may be manually set.
  • the preset training output data may also be obtained by processing the first training input data and the second training input data by the original neural network model.
  • the preset training output data may be the output of the operation module.
  • the first training input data and the second training input data may be quantized respectively by using the data quantization parameter.
  • the quantized first training input data and the quantized second training input data may be processed by using the quantized neural network model to obtain actual training output data.
  • the data quantization parameter may be adjusted according to the difference between the actual training output data and the preset training output data to minimize the difference.
  • the quantization module is configured to use the adjusted data quantization parameter to quantize the first input data of the third operator and the second input data of the fourth operator respectively.
  • the input data of the third operator and the input data of the fourth operator may be quantized respectively by using the adjusted data quantization parameter.
  • the adjusted data quantization parameter can make the third The result of processing the quantized data by the operator and the fourth operator is of high precision. While improving the data processing efficiency of the neural network model, the influence of the quantitative neural network model on the accuracy of the data processing results is reduced.
  • the first operator and the third operator are used to perform the same operation.
  • the second operator and the fourth operator are used to perform the same operation.
  • Two operators are used to perform the same operation. It can also be understood that two operators perform the same operation on the input data.
  • the parameters of the two operators are only different in precision.
  • the parameters of one operator are used for the other operator. parameters are quantized. By using the quantized operator to process the quantized input data, the amount of calculation can be reduced.
  • the parameters of the third operator and the fourth operator are The parameters of the operator can be obtained by quantizing the parameters of the same operator.
  • the operator quantization parameter may be determined according to the parameter range of the first operator and the parameter range of the second operator.
  • the parameters of the first operator may be quantized using the operator quantization parameters to obtain parameters of the third operator, and the parameters of the second operator may be quantized using the operator quantization parameters to obtain the parameters of the third operator.
  • the quantized neural network model not only improves the data processing efficiency, but also reduces the accuracy and accuracy of the data processing results. effect on accuracy.
  • the data processing result output by the third operator and the data processing result output by the fourth operator may be compressed.
  • the quantized neural network model further includes a compression module, which is configured to compress the output of the third operator and the output of the fourth operator according to offset parameters, to obtain the first The first operation data and the second operation data, the offset parameter is used to indicate the position of the most significant bit in the data after the compression is performed in the data before the compression is performed.
  • a compression module which is configured to compress the output of the third operator and the output of the fourth operator according to offset parameters, to obtain the first The first operation data and the second operation data, the offset parameter is used to indicate the position of the most significant bit in the data after the compression is performed in the data before the compression is performed.
  • the first operation data obtained by compressing the data processing result output by the third operator and the second operation data obtained by compressing the data processing result output by the fourth operator have the same number of bits.
  • the offset parameter indicates the position of the highest bit in the compressed data in the data before the compression, and the output of the third operator and the output of the fourth operator are compressed by the same offset parameter, so that The first operation data and the second operation data are comparable, and operations can be performed directly without performing processing such as inverse quantization in subsequent reasoning.
  • the effective bits of the output can be obtained by processing the quantized first training input data according to the third operator. and the effective number of bits of the output obtained by processing the quantized second training input data by the fourth operator to determine the offset parameter.
  • the first training input data and the second training input data are respectively quantized by using the data quantization parameter.
  • the third operator may be used to process the quantized first training input data
  • the fourth operator may be used to process the quantized second training input data.
  • the output of the third operator is the first training operation data
  • the output of the fourth operator is the second training operation data.
  • the offset parameter may be determined according to the significant digits of the first training operation data and the significant digits of the second training operation data.
  • the quantized neural network model can be, for example, the data processing system 600 shown in FIG. 9 , or the data processing system 600 can call each operator or module in the quantized neural network model to process data.
  • FIG. 8 is a schematic flowchart of a method for quantizing a neural network model provided by an embodiment of the present application.
  • the neural network model quantization method 800 can also be understood as an optimization method or a further training method for the neural network model.
  • the training method 800 may be performed by the training device 120 shown in FIG. 1 or other devices.
  • the original neural network model includes a first operator, a second operator and an operation module.
  • the operation module is used to operate the output of the first operator and the output of the second operator.
  • the first operator and the second operator are used to perform the same type of operation.
  • the parameters of the first operator and the second operator are expressed in the format of floating point numbers.
  • the operator quantization parameter is determined according to the parameter range of the first operator and the parameter range of the second operator.
  • the scale in the operator quantization parameter can be expressed as s2:
  • f max is the maximum parameter value of the first operator and the second operator represented by the floating point number
  • f min is the minimum parameter value of the first operator and the second operator represented by the floating point number
  • a is the bit of the quantization result number. For data in int8 format, the value of a is 8.
  • the offset in the operator quantization parameter can be expressed as o2:
  • f min is a negative number
  • the parameters of the first operator may be quantized by using the operator quantization parameter to obtain the quantized first operator.
  • the parameters of the second operator may be quantized by using the operator quantization parameter to obtain a quantized second operator.
  • a training data set is obtained, where the training data set includes a training input data set and a preset operation result corresponding to the training input data set.
  • Each set of training input data includes first training input data and second training input data.
  • a preset operation result corresponds to a first training input data and a second training input data.
  • the preset operation result corresponding to the first training input data and the second training input data may be obtained by operating the result of processing the first training input data by the first operator and the processing result of the second training input data by the second operator.
  • the preset operation results corresponding to the first training input data and the second training input data may be manually set.
  • the data quantization parameter is determined according to the range of the first training input data and the range of the second training input data.
  • the training data set may include a plurality of first training input data and a plurality of second training input data.
  • Data quantization parameters can be determined based on each training input data range.
  • the training input data is the first training input data or the second training input data.
  • the data quantization parameter may be an average range of the plurality of first training input data and the plurality of second training input data.
  • the scale in the data quantization parameter can be expressed as s1:
  • d max_t is the average maximum value obtained after the t-th iteration (which can also be understood as the upper limit of the average data range)
  • d min_t is the average minimum value obtained after the t-th iteration (it can also be understood as the lower limit of the average data range)
  • m is used to represent the number of bits of the quantization result obtained by quantizing the training input data. For data in int8 format, the value of m is 8. It should be understood that m is a preset value.
  • the average maximum value of the training input data can be expressed as:
  • d max_t-1 is the average maximum value of the training input data obtained after the t-1th iteration
  • v max_t is the maximum value of the input data of the operator in the t-th iteration
  • c t is the number of iterations Constantly updated
  • c t ⁇ 1 ⁇ c t-1 +1
  • ⁇ 1 is a constant.
  • the input data of the operator includes the first training input data of the first operator, and also includes the second training input data of the second operator. Can be understood as weight.
  • ⁇ 1 is greater than 1, as the iteration progresses, the more iterations, the smaller the effect of the maximum value of the training input data on the upper limit of the average data range.
  • ⁇ 1 is less than 1, as the iteration progresses, the more iterations, the greater the influence of the maximum value of the training input data on the upper limit of the average data range.
  • ⁇ 1 is equal to 1 , the maximum value of each training input data has the same effect on the upper limit of the average data range.
  • the value of ⁇ 1 is slightly larger than 1 to avoid excessive correction of the upper limit of the average data range.
  • the average minimum value of the training input data can be expressed as:
  • d min_t-1 is the average maximum value of the training input data obtained after the t-1th iteration
  • v min_t is the minimum value of the operator’s input data in the t-th iteration
  • c t is the number of iterations Continuously updated
  • c t ⁇ 2 ⁇ c t-1 +1
  • ⁇ 2 is a constant.
  • the input data of the operator includes the first training input data of the first operator, and also includes the second training input data of the second operator.
  • ⁇ 2 is greater than 1, as the iteration progresses, the more iterations, the less the maximum value of the training input data has on the upper limit of the average data range.
  • ⁇ 2 is less than 1, as the iteration progresses, the more iterations, the greater the influence of the maximum value of the training input data on the upper limit of the average data range.
  • ⁇ 2 is equal to 1 , the maximum value of each training input data has the same effect on the upper limit of the average data range.
  • the value of ⁇ 2 is slightly larger than 1 to avoid excessive correction of the upper limit of the average data range.
  • ⁇ 2 and ⁇ 1 may or may not be equal.
  • the parameters c 0 , ⁇ , d max_0 , and d min_0 can be set.
  • the parameters c 0 and d can be set to 0.
  • v max_0 and v min_0 can be set according to experience values, for example, v min_0 can be set to 6.
  • the first training input data and the second training input data may be processed respectively by using the data quantization parameter to obtain the quantized first training input data and the quantized second training input data.
  • the quantized first training input data may be input into the quantized first operator to obtain first training operation data.
  • the quantized second training input data may be input into the quantized second operator to obtain second training operation data. Afterwards, operations may be performed on the first training operation data and the second training operation data to obtain training output data.
  • the first training operation data may be obtained by converting the quantized data output by the first operator.
  • the number of bits of the first training operation data is smaller than the number of bits of the data output by the first operator after quantization.
  • the second training operation data may be obtained by converting the quantized data output by the second operator.
  • the number of bits of the second training operation data is smaller than the number of bits of the data output by the second operator after quantization.
  • the average significant number of digits may be counted.
  • the upper limit of the average data range can be counted to obtain the average number of significant digits.
  • the upper limit of the average data range can be obtained as:
  • b t-1 is the upper limit of the average data range obtained after the t-1th iteration
  • b nt is the upper limit of the range calculated for the output data of the operator in the t-th iteration
  • c t is continuously updated with the number of iterations
  • c t ⁇ 3 ⁇ c t-1 +1
  • ⁇ 3 is a constant.
  • the t-th iteration input data may be the first training input data or the second training input data.
  • the parameters c 0 , ⁇ 3 , and b 0 can be set.
  • the parameters b 0 and c 0 can be set to 0.
  • any one of the parameters ⁇ 1 , ⁇ 2 , and ⁇ 3 may be set randomly or according to certain rules, or, any one of the parameters ⁇ 1 , ⁇ 2 , and ⁇ 3 may also be manually set. In the iterative process, any one of the parameters ⁇ 1 , ⁇ 2 , and ⁇ 3 can be exploded unchanged, or can also be adjusted according to certain rules. This application does not limit this.
  • the offset parameter N can be determined:
  • N max(ceil(log 2 bt)-m,0)
  • ceil is the round-up function
  • ceil(log 2 b t ) is the effective number of bits of b t , that is, the average effective number of bits after t iterations
  • m is the number of bits of the data after reducing the number of bits.
  • N max(ceil(log 2 b t )-16,0).
  • a saturation operation is performed on a value with a significant number greater than N, that is, a value with a significant number greater than N is represented by N bits "1". That is to say, when the output of the quantized operator is larger than the size represented by N bits "1", the value is represented by N bits "1".
  • bits P+1 to P+m in the output data of the quantized operator can be reserved (or, only bits P to P+m-1 can be reserved), so as to achieve Compression (ie format conversion) of the output data of the quantized operator.
  • operations may be performed on the converted first training operation data and the second training operation data.
  • the processing of the input data by the quantized first operator can be expressed as:
  • conv1 out is the output of the first operator after quantization
  • d1 q1 is the input data of the first operator obtained by quantization using the data quantization parameter
  • w1 q2 is the parameter of the first operator obtained by quantization using the operator quantization parameter .
  • the processing of the input data by the quantized second operator can be expressed as:
  • conv2 out is the output of the second operator after quantization
  • d2 q1 is the input data of the second operator obtained by quantization of the operator quantization parameter
  • w2 q2 is the parameter of the second operator obtained by the quantization of the operator quantization parameter .
  • the operation result can be expressed as:
  • R is the operation result
  • tr(x, N) represents the conversion of the data x
  • the converted result includes the lowest preset number of bits after the data x is shifted to the right by N bits.
  • the data quantization parameter and the operator quantization parameter are adjusted according to the difference between the inverse quantized training output data and the preset operation result.
  • the training output data can be inversely transformed, and the data after the inverse transformation can be inversely quantized. That is to say, bits may be added to the right of the converted data, so that the bits of the inversely converted data and the quantized output data of the first operator and the second operator are equal. It should be understood that the value of the added bits may all be "0". Afterwards, the data after the increased bits can be left shifted by N bits to obtain the inverse-transformed training output data.
  • both the first operator and the second operator are conv operators
  • the inverse quantization of the training output data which can be the product of multiplying the training output data by the scale in the data quantization parameter and the scale in the operator quantization parameter.
  • the parameters of the first operator and the second operator are respectively quantized.
  • the data processing system 600 can be determined according to the data quantization parameter, the quantized first operator, the quantized second operator, and the offset parameter N.
  • the operation model 640 may be the operation model in the original neural network model before quantization, or the parameters in the operation model 640 may be obtained by quantizing the parameters in the operation model in the original neural network model.
  • S810 to S850 may be performed by a server.
  • the server may send the data quantization parameter, the quantized first operator, the quantized second operator, the offset parameter, and the like to the terminal device.
  • the terminal device can determine the data processing system 600 shown in FIG. 9 .
  • FIG. 9 is a schematic structural diagram of a data processing system provided by an embodiment of the present application.
  • the data processing system 600 may be located in the computing module 111 of the execution device 110 , and the data processing system 600 may be the target model/rule 101 shown in FIG. 1 .
  • the data processing system 600 may be obtained by quantizing the trained neural network model by the training equipment 120 shown in FIG. 1 or other apparatuses.
  • Data processing system 600 may also be referred to as a quantized neural network model.
  • Data processing system 600 may be part of CNN 200 shown in FIG. 2 or CNN 300 shown in FIG. 3, or the various components of data processing system 600 may be located in one or more CNNs.
  • the data processing system 600 includes a quantization model 610 , a first operator 620 , a second operator 630 and an operation model 640 .
  • the quantization model 610 is used to quantize the first input data of the first operator 620 and the second input data of the second operator 630 respectively by using the data quantization parameter.
  • the format of the first input data and the format of the second input data may both be floating point numbers. For example, it can be a 32-bit single-precision floating-point number (float32) or a 16-bit half-precision floating-point number (float16).
  • the quantization model 610 may use the data quantization parameter to quantize the first input data and the second input data, respectively, to obtain the quantized first input data and the quantized second input data.
  • the format of the quantized first input data and the format of the quantized second input data may both be an 8-bit quantization result (int8).
  • Data quantization parameters may include scale and offset. Among them, scale is used to represent the increment of the floating-point number corresponding to each increase of "1" in the quantization result, and offset is used to represent the ratio of the floating-point number represented by the minimum value of the quantization result to the scale.
  • the first operator 620 is used to process the first input data quantized by the quantization model 610 to obtain first operation data.
  • the second operator 630 is used to process the second input data quantized by the quantization model 610 to obtain second operation data.
  • the parameters of the first operator 620 and the parameters of the second operator 630 are obtained by quantization using the operator quantization parameters.
  • the parameters of the first operator before quantization in the trained neural network model are quantized by using the operator quantization parameters, and the parameters of the first operator 620 can be obtained;
  • the operator quantization parameter is to quantize the parameters of the second operator before quantization in the trained neural network model, so as to obtain the parameters of the second operator 620 .
  • the operator quantization parameters may include a step size (scale) and an offset (offset).
  • the first operator 620 and the second operator 630 are used to perform the same type of operation. That is, the first operator 620 and the second operator 630 may both be the same type of operator in the neural network model.
  • the first operator 620 and the second operator 630 may both be convolutional (convolutional, conv) operators for convolution operations.
  • the first operator 620 and the second operator 630 may be the first operator Sub 620 and second operator 630 may each represent a convolution layer.
  • Each module in the data processing system 600 may be a part of the CNN 200 shown in FIG. 2 or a part of the CNN 300 shown in FIG. 3 .
  • the first operator 620 and the second operator 630 may also be fully connected layers.
  • the excitation function of each neuron in the fully connected layer generally adopts a linear rectified function (rectified linear unit, ReLU).
  • the output of the first operator 620 and the output of the second operator 630 need to be subsequently operated by the operation model 640 .
  • the first operator 620 and the second operator 630 may be located in different CNNs, and the operation model 640 may be used to process data output by different CNNs.
  • the first operator 620 and the second operator 630 may also be the same type of operators in other types of neural network models.
  • the output of the conv operator is the quantization result of 32 (int32). That is to say, when the first operator and the second operator can be both conv operators, and the parameters of the first operator, the parameters of the second operator, the quantized first input data, and the quantized second input When the format of the data is int8, the format of the output data of the first operator and the second operator is int32.
  • the parameters of the conv operator can also be understood as the weights in the conv operator.
  • the processing result conv out (int32 format) of the quantized input data d q1 (format int8) can be expressed as:
  • w1 and q2 are parameters of the operator obtained by quantization using the operator quantization parameter, and the format is also int8.
  • the operation model 640 is used to perform operations on the first operation data and the second operation data.
  • the operation model 640 may perform a linear operation on the first operation data and the second operation data.
  • the operation model 640 may also perform a bit-by-bit operation on the first operation data and the second operation data, such as a bit-by-bit addition or a bit-by-bit multiplication operation.
  • the data processing system 600 uses the data quantization parameter to quantize the input data of the two operators, respectively.
  • the parameters of the two operators are obtained by using the operator quantization parameters, and then the outputs of the two operators can be calculated to avoid In order to inverse quantize the outputs of the two operators, the amount of calculation is reduced, the power consumption of the operation is reduced, and the data processing performance of the data processing system 600 is improved.
  • Inverse quantization is a way of vector computing.
  • the computing power of the vector computing method is weaker than that of the matrix computing method.
  • the method of matrix calculation includes convolution operation and so on.
  • the operations of neural network models often include multiple matrix calculations and multiple vector calculations in series.
  • the computing power of the processor for matrix calculation is higher than that for vector calculation.
  • the module 640 can perform operations on the output of the first operator and the output of the second operator, and the data processing system 600 can reduce the inverse quantization operations required by the neural network model, thereby alleviating the vector bound and effectively improving the data of the neural network model. processing power.
  • Data processing system 600 may also include format conversion model 650 .
  • the format conversion model 650 may be used for data compression and may also be referred to as a compression model.
  • the format conversion model 650 is used to reduce the number of bits of the first original operation data output by the first operator 620 to obtain the first operation data.
  • the format conversion model 650 is also used for reducing the number of bits of the second original operation data output by the second operator for 630 to obtain the second operation data.
  • the format conversion model 650 is used to convert the formats of the first original operation data and the second original operation data whose format is int32 to int16, respectively.
  • the int16 data obtained by the format conversion of the first original operation data output by the first operator 620 is the first operation data
  • the int16 data obtained by the format conversion of the second original operation data output by the second operator 630 is the first operation data.
  • the format conversion model 650 may determine the first operation data and the second operation data according to the offset parameter.
  • the processing process of the format conversion model 650 can be understood as data compression.
  • the first operation data includes the bit indicated by the offset parameter in the first original operation data and the number of bits after the bit that is a preset number of bits in total.
  • the first operation data includes the pre- Set the number of bits to be "1".
  • the first The second operation data includes the bit indicated by the offset parameter in the second original operation data and the number of bits after the bit that is a preset number of bits in total.
  • the format conversion model 650 can perform a right shift operation and a saturation operation for any one of the first original operation data or the second original operation data.
  • the right shift operation can be expressed as:
  • N is the number of bits shifted to the right. It should be understood that the sum of the right-shifted bit number N and the bits of the data output by the format conversion model 650 is less than or equal to the number of bits in the original operation result conv out .
  • conv INT16 clip(conv out ',0,2 p -1)
  • conv INT16 represents the saturation operation result
  • p is the difference between the number of bits in the original operation result conv out and the number of bits N shifted to the right.
  • the clip(a,b,c) operator means that a is limited between b and c. When a is less than b, the result of the operation is b; when a is greater than or equal to b and a is less than or equal to c, the result of the operation is a ; When a is greater than or equal to c, the result of the operation is c.
  • the number of bits m of the operation result of the operator clip may be the same as the number of bits of the original operation result, and the lowest preset number of bits of m may be taken from the saturated operation result as the operation result. That is, the lowest m bits in the saturation operation result are the operation data corresponding to the original operation data.
  • the original operation data can be shifted right by N bits. Determine the size of the result of right-shifting the binary number with the same number of bits as the original operation result and each bit being 1 by N bits and the result of right-shifting the original operation data.
  • the right-shift result of the original operation data is relatively large, take the preset number of bits of the lowest bit in the right-shift result of the original operation data, which is the operation data; on the contrary, when the right-shift result of the original operation data is not large , the preset number of "1" is used as the operation result.
  • the first operation data may be the first original operation data, or may be data obtained by performing a right shift operation and a saturation operation on the first original operation data.
  • the second operation data may be the second original operation data, or may be data obtained by performing a right shift operation and a saturation operation on the second original operation data.
  • the format conversion model 650 can convert the original operation data into operation data, and use the operation data as the input of the operation module 640 .
  • the calculation amount of the operation module 640 can be reduced, thereby improving the data processing performance of the data processing system 600 .
  • FIG. 11 is a schematic flowchart of a data processing method provided by an embodiment of the present application.
  • the data processing method can be executed by the calculation module 111 in the execution device 110 shown in FIG. 1 .
  • S1201 Obtain a quantized neural network model, where the quantized neural network model is obtained by quantizing an original neural network model, where the original neural network model includes a first operator, a second operator, and a first operation module , the first operator is used for performing the same type of operation as the second operator, and the first operation module is used for the output of the first operator and the output of the second operator Perform the first operation.
  • the quantized neural network model includes a quantization module, A first operator, a second operator and a first operation module, the quantization module is used to quantize the first input data and the second input data respectively by using data quantization parameters, and the second operation module uses
  • the third operator is the first operator after quantization
  • the fourth operator is the second operator after quantization
  • the data quantization parameter is based on the first operator.
  • the range of the first training input data of the operator and the range of the second training input data of the second operator are determined.
  • the second operation module may be configured to perform the first operation on the first operation data and the second operation data.
  • the first operation data is obtained by using the third operator to perform the first operation on the quantized first input data
  • the second operation data is obtained by using the fourth operator to perform the first operation on the quantized first input data.
  • the two input data are obtained by performing the second operation.
  • the original neural network is quantized, and the obtained quantized neural network model performs the same operation as the original neural network model, only the accuracy of the operation result changes.
  • the data quantization parameter is determined according to the range of the first training input data of the first operator and the range of the second training input data of the second operator, thereby improving the data processing accuracy of the quantized neural network model .
  • the second operation module can perform operations on the first operation data and the second operation data without performing operations on the first operation data and the second operation data.
  • the operation is performed after inverse quantization.
  • the first operation data is obtained by processing the quantized first input data by the third operator
  • the second operation data is obtained by processing the quantized second input data by the fourth operator.
  • the data quantization parameter is obtained by adjusting the initial data quantization parameter, and the adjustment minimizes the difference between the actual training output data and the preset training output data.
  • the initial quantization parameter is determined according to the range of the first training input data and the range of the second training input data.
  • the preset training output data corresponds to a training input data set, and the training input data set includes the first training input data and the second training input data.
  • the actual training output data is obtained by using the quantized neural network model to process the first training input data and the second training input data, and the quantization module is used to quantify parameters using the initial data.
  • the first training input data and the second training input data are respectively quantized.
  • the initial quantization parameter is determined according to the range of the first training input data and the range of the second training input data.
  • the first training input data and the second training input data are processed by using the quantized neural network model to obtain actual training output data, wherein the quantization module uses the initial data quantization parameters to perform the first training input data and the second training input data. quantify.
  • the initial data quantization parameters are adjusted to minimize the difference between the actual training output data and the preset training output data, thereby obtaining the data quantization parameters.
  • the third operator is used to perform a first operation on the quantized first training input data to obtain the first training operation data; the fourth operator is used to perform a first operation on the quantized first training input data; The second operation is performed on the second training input data to obtain the second training operation data; the second operation module is used for performing the third operation on the first training operation data and the second training operation data to obtain the actual training output data.
  • the quantized neural network model Since the data quantization parameter minimizes the difference between the actual training output data and the preset training output data, the quantized neural network model has higher accuracy.
  • Minimizing the difference between the actual training output data and the preset training output data can be understood as gradually adjusting the initial data quantization parameters of the initial action recognition system according to the difference between the actual training output data and the preset training output data until the actual training
  • the difference between the output data and the preset training output data is within a certain preset range, or when the number of adjustments reaches the preset number of times, the initial data quantization parameter at this time is determined as the adjusted data quantization parameter.
  • the parameter of the third operator is obtained by quantizing the parameter of the first operator by using the operator quantization parameter
  • the parameter of the fourth operator is obtained by using the operator quantization parameter to quantify the parameter of the first operator.
  • the parameter of the second operator is obtained by quantizing the parameter of the second operator, and the quantization parameter of the operator is determined according to the parameter range of the first operator and the parameter range of the second operator.
  • the quantized neural network model further includes a compression module, and the compression module is configured to compress the output of the third operator and the output of the fourth operator according to the offset parameter, respectively, to compress the output of the third operator and the output of the fourth operator.
  • the first operation data and the second operation data are obtained, the offset parameter is used to indicate the position of the highest bit in the compressed data in the data before the compression, and the second The operation module is configured to perform the first operation on the compressed data.
  • the offset parameter is determined according to the significant digits of the first training operation data and the significant digits of the second training operation data, and the first training operation data is quantized using the third operator pair using the data.
  • the parameter quantized first training input data is obtained by processing, and the second training operation data is obtained by using the fourth operator to process the second training input data quantized by using the data quantization parameter.
  • the data processing accuracy of the quantized neural network model is improved while reducing the amount of operation.
  • FIG. 12 is a schematic flowchart of a data processing method provided by an embodiment of the present application.
  • the data processing method 700 includes S710 to S720.
  • the data processing method 700 may be executed by the computing module 111 in the execution device 110 shown in FIG. 1 .
  • the first input data of the first operator in the neural network model and the second input data of the second operator in the neural network model are respectively quantized by using the data quantization parameter.
  • the first processing information is obtained by processing the quantized first input data by using the first operator
  • the second processing information is obtained by processing the quantized second input data by using the second operator of.
  • the first input data of the first operator and the second input data of the second operator are quantized by using the same data quantization parameter, so that the output of the first operator and the output of the second operator are
  • the operation can be performed directly without inverse quantization and other processing, which improves the data processing efficiency of the neural network model.
  • the first parameter of the first operator and the second parameter of the second operator may be floating-point numbers, or may be obtained by quantizing the parameters of the floating-point numbers by using an operator quantization parameter.
  • the first parameter and the second parameter are obtained by quantization, which can reduce the size of the first operator and the second operator, and reduce the occupation of resources when the first operator and the second operator process data.
  • the quantization of the first parameter and the quantization of the second parameter both use the operator quantization parameter, so that the data processing results of the first operator and the second operator can be directly calculated without performing other processing such as inverse quantization. Data processing efficiency of network models.
  • the operator quantization parameter may be obtained according to the range of the first parameter and the range of the second parameter.
  • the determination of the operator quantization parameters may be performed by the training device 120 shown in FIG. 1 or other devices.
  • the device for determining the data quantization parameter and the device for performing S710 to S720 may be the same or different devices.
  • the operator quantization parameter may be obtained according to the maximum value and the minimum value of the first parameter and the second parameter.
  • the operator quantization parameters may include scale and offset.
  • the difference between the maximum value and the minimum value in the first parameter and the second parameter may be equally divided according to the number of bits of the quantization result, so as to obtain the scale of the operator quantization parameter.
  • the offset in the operator quantization parameter may be determined according to the ratio of the minimum value of the first parameter and the second parameter to the scale in the operator quantization parameter.
  • the data quantization parameter may be determined according to the range of the data processed by the first operator and the second operator.
  • the determination of the data quantization parameters may be performed by the training device 120 shown in FIG. 1 or other devices.
  • the device for determining the data quantization parameter and the device for performing S710 to S720 may be the same or different devices.
  • a training data set may be obtained, and the training data set includes first training input data and second training input data.
  • the first training input data is the input data of the first operator before quantization
  • the second training input data is the input data of the second operator before quantization.
  • the data quantization parameter may be determined according to the range of the first training input data and the range of the second training input data.
  • the data quantization parameter may be determined from a plurality of first training input data and a plurality of second training input data.
  • Each first training input data includes a plurality of numerical values
  • each second training input data includes a plurality of numerical values
  • the average maximum value in each first training input data and each second training input data can be used as data
  • the maximum value that can be represented by the quantization result of the quantization parameter, and the average minimum value in each first training input data and each second training input data is taken as the minimum value that can be represented by the quantization result of the data quantization parameter.
  • the average maximum value may be a weighted average of multiple maximum values
  • the average minimum value may be a weighted average of multiple minimum values.
  • the weight can be understood as the degree of influence of the maximum value or the minimum value in each of the first training input data and the second training input data on the data quantization parameter. Specifically, reference may be made to the description of FIG. 8 .
  • the quantized first training input can be input according to the first operator.
  • the difference between the processing result of the data and the processing result of the second training input data by the second operator, and the preset operation result adjusts the data quantization parameter and/or the operator quantization parameter.
  • the training data set further includes preset operation results corresponding to the first training input data and the second training input data.
  • the preset operation result corresponding to the first training input data and the second training input data may be the processing result of the first operator before quantization on the first training input data and the second operator before quantization on the second training input data.
  • the operation result obtained by performing the operation on the processing result may be manually set.
  • the format of the preset operation result can be a floating point number.
  • the operation in S720 may be performed on the first training operation data and the second training operation data to obtain training output data.
  • the first training operation data is obtained by processing the first training input data quantized by the data quantization parameter by the first operator
  • the second training operation data is the second training data quantized by the second operator using the data quantization parameter.
  • the input data is processed.
  • the training output data can be inverse quantized.
  • the data quantization parameter and/or the operator quantization parameter may be adjusted according to the difference between the inverse quantization result of the training output data and the preset operation result.
  • the operation result of the first operator and the operation result of the second operator may be processed by reducing the number of digits.
  • the first operator processes the quantized first input data and outputs the first original operation data.
  • the second operator processes the quantized second input data and outputs second original operation data.
  • the preset number of bits with the highest median of the first original operation data may be taken as the first operation result, and the preset number of bits with the highest median of the second original operation data may be taken as the second operation data, and the subsequent steps are performed. operation.
  • the preset number of bits with the highest number of bits that is, the leftmost preset number of bits.
  • the first operation data and/or the second operation data may be determined according to the offset parameter.
  • the first operation data When in the first original operation data, the bits before the bit indicated by the offset parameter are all 0, the first operation data includes the bit indicated by the offset parameter in the first original operation data after a preset number of bits.
  • the preset number of bits included in the first operation data are all "1".
  • the second operation data includes the offset parameter indication in the second original operation data The preset number of bits after the bits of .
  • the preset number of bits included in the second operation data are all "1".
  • compression processing to reduce the number of bits is performed on both the first original operation data and the second original operation data.
  • the processing result of the first operator or the processing result of the second operator has valid data at the bit or higher bits indicated by the offset parameter
  • the processing result of the first operator corresponds to the first
  • the operation data or the second operation data corresponding to the processing result of the second operator may be represented as a preset number of "1".
  • This method can also be understood as a saturation operation. That is to say, when the processing result is greater than the maximum value that can be represented by the number of bits after the bits indicated by the offset parameter, the processing result is represented as all "1" in the preset number of bits, that is, the preset number of bits is represented as all "1". Set the maximum value that can be represented by the number of bits.
  • the offset parameter may be obtained according to the result of processing the first training input data by the first operator and the processing result of the second training input data by the second operator.
  • the first operator may process multiple quantized first training parameters, and the processing result of each quantized first training parameter by the first operator includes multiple numbers.
  • the plurality of numbers may form a matrix or a vector or the like.
  • the second operator may process multiple quantized second training parameters, and the processing result of the second operator for each quantized second training parameter includes multiple numbers.
  • the offset parameter can be determined based on the average value of the largest significant number of digits in each processing result. For example, the average value can be rounded up, and the offset parameter is used to indicate that it can be the highest digit of significant digits obtained by rounding up the average value.
  • the offset parameter may also be adjusted according to the difference between the inverse quantization result of the training output data and the preset operation result. Therefore, the precision and accuracy of the data processing result are higher.
  • the original neural network model may be traversed to determine that the original neural network model includes a first operator, a second operator, and an output sum for the first operator.
  • FIG. 9 takes the first operator and the second operator as the convolution operator and the operation model as the eltwise operator as an example for description.
  • FIG. 13 is a schematic flowchart of a processing structure identification method provided by an embodiment of the present application.
  • the processing structure identification method may be performed by the training device 120 shown in FIG. 1 or other devices.
  • the data processing system, the neural network model quantization method, and the data processing method provided by the embodiments of the present application are described above with reference to FIGS. 1 to 13 , and the apparatus embodiments of the embodiments of the present application are described below with reference to FIGS. 14 to 17 . It should be understood that the descriptions of the data processing system, the neural network model quantification method, and the data processing method correspond to the descriptions of the apparatus embodiments. Therefore, for the parts not described in detail, reference may be made to the above descriptions.
  • FIG. 14 is a schematic structural diagram of an apparatus for quantizing a neural network model provided by an embodiment of the present application.
  • the neural network model quantization apparatus 3000 may be located in the training device 120 shown in FIG. 1 or other devices.
  • the neural network model quantization apparatus 3000 includes a storage module 3010 and a processing module 3020 .
  • the storage module 3010 is used to store programs.
  • the processing module 3020 is configured to: obtain an original neural network model, where the original neural network model includes a first operator, a second operator and a first operation module, the first The operator is used to perform a first operation, the second operator is used to perform a second operation, the first operation and the second operation are operations of the same type, and the first operation module is used to The output of the first operator and the output of the second operator perform a third operation; according to the range of the first training input data and the range of the second training input data, the data quantization parameter is determined, and the first training input data is The input data of the first operator and the second training input data are the input data of the second operator; according to the original neural network model, a quantized neural network model is determined, and the quantized neural network model is The network model includes a quantization module, a third operator, a fourth operator, and a second operation module, and the quantization module is configured to use the data quantization parameter to respectively quantify the first input data of the third operator, the first input
  • the processing module 3020 is further configured to acquire preset training output data corresponding to a training input data set, where the training input data set includes the first training input data and the second training input data.
  • the processing module 3020 is further configured to use the data quantization parameter to quantize the first training input data and the second training input data respectively.
  • the processing module 3020 is further configured to process the quantized first training input data and the quantized second training input data by using the quantized neural network model to obtain actual training output data.
  • the processing module 3020 is further configured to, according to the difference between the actual training output data and the preset training output data, adjust the data quantization parameter to minimize the difference.
  • the quantization module is configured to use the adjusted data quantization parameter to quantize the first input data of the third operator and the second input data of the fourth operator respectively.
  • the processing module 3020 is further configured to determine the operator quantization parameter according to the parameter range of the first operator and the parameter range of the second operator.
  • the processing module 3020 is further configured to quantize the parameter of the first operator by using the operator quantization parameter to obtain the parameter of the third operator.
  • the processing module 3020 is further configured to quantize the parameter of the second operator by using the operator quantization parameter to obtain the parameter of the fourth operator.
  • the quantized neural network model further includes a compression module, and the compression module is configured to compress the output of the third operator and the output of the fourth operator according to the offset parameter, respectively.
  • the offset parameter is used to indicate the position of the highest bit in the compressed data in the data before the compression, and the second operation module is used to perform the first operation on the compressed data.
  • the processing module 3020 is further configured to use the data quantization parameter to quantize the first training input data and the second training input data respectively.
  • the processing module 3020 is further configured to process the quantized first training input data by using the third operator, and the third operator outputs the first training operation data.
  • the processing module 3020 is further configured to process the quantized second training input data by using the fourth operator, and the fourth operator outputs the second training operation data.
  • the processing module 3020 is further configured to determine the offset parameter according to the significant digits of the first training operation data and the significant digits of the second training operation data.
  • FIG. 15 is a schematic structural diagram of a data processing apparatus provided by an embodiment of the present application.
  • the data processing apparatus 2000 may be located in the execution device 110 shown in FIG. 1 or in other devices.
  • the data processing apparatus 2000 includes a storage module 2010 and a processing module 2020 .
  • the storage module 2010 is used to store programs.
  • the processing module 2020 is used to: obtain a quantized neural network model, the quantized neural network model is obtained by quantizing the original neural network model, the original neural network model
  • the model includes a first operator, a second operator, and a first operation module.
  • the first operator and the second operator are used to perform the same type of operation, and the first operation module is used to perform operations on the first operation.
  • the first operation is performed on the output of the operator and the output of the second operator; the first input data of the third operator and the second input of the fourth operator are subjected to the quantized neural network model.
  • the quantized neural network model includes a quantization module, a first operator, a second operator, and a second operation module, and the quantization module is used for using data quantization parameters to respectively quantify the first input data
  • the second input data is quantized
  • the second operation module is used to perform the first operation
  • the third operator is the quantized first operator
  • the fourth operator is the quantized first operator.
  • the data quantization parameter is determined according to the range of the first training input data of the first operator and the range of the second training input data of the second operator.
  • the data quantization parameter is obtained by adjusting the initial data quantization parameter, and the adjustment minimizes the difference between the actual training output data and the preset training output data.
  • the initial quantization parameter is determined according to the range of the first training input data and the range of the second training input data.
  • the preset training output data corresponds to a training input data set, and the training input data set includes the first training input data and the second training input data.
  • the actual training output data is obtained by using the quantized neural network model to process the first training input data and the second training input data, and the quantization module is used to quantify parameters using the initial data.
  • the first training input data and the second training input data are respectively quantized.
  • the parameter of the third operator is obtained by quantizing the parameter of the first operator by using the operator quantization parameter
  • the parameter of the fourth operator is obtained by using the operator quantization parameter to quantify the parameter of the first operator.
  • the parameter of the second operator is obtained by quantizing the parameter of the second operator, and the quantization parameter of the operator is determined according to the parameter range of the first operator and the parameter range of the second operator.
  • the quantized neural network model further includes a compression module, and the compression module is configured to compress the output of the third operator and the output of the fourth operator according to the offset parameter, respectively.
  • the offset parameter is used to indicate the position of the highest bit in the compressed data in the data before the compression, and the second operation module is used to perform the first operation on the compressed data.
  • the offset parameter is determined according to the significant digits of the first training operation data and the significant digits of the second training operation data, and the first training operation data is quantized using the third operator pair using the data.
  • the parameter quantized first training input data is obtained by processing, and the second training operation data is obtained by using the fourth operator to process the second training input data quantized by using the data quantization parameter.
  • FIG. 16 is a schematic diagram of a hardware structure of a data processing apparatus according to an embodiment of the present application.
  • the data processing apparatus 4000 shown in FIG. 16 includes a memory 4001 , a processor 4002 , a communication interface 4003 , and a bus 4004 .
  • the memory 4001 , the processor 4002 , and the communication interface 4003 are connected to each other through the bus 4004 for communication.
  • the memory 4001 may be ROM, static storage device and RAM.
  • the memory 4001 may store a program. When the program stored in the memory 4001 is executed by the processor 4002, the processor 4002 and the communication interface 4003 are used to execute each step of the data processing method of the embodiment of the present application.
  • the processor 4002 may adopt a general-purpose CPU, a microprocessor, an ASIC, a GPU, or one or more integrated circuits, and is used to execute a related program, so as to realize the functions required to be performed by the units in the data processing apparatus of the embodiments of the present application, Or execute the data processing method of the method embodiment of the present application.
  • the processor 4002 may also be an integrated circuit chip with signal processing capability, for example, the chip shown in FIG. 4 .
  • each step of the data processing method of the embodiment of the present application may be completed by an integrated logic circuit of hardware in the processor 4002 or an instruction in the form of software.
  • the above-mentioned processor 4002 may also be a general-purpose processor, DSP, ASIC, FPGA or other programmable logic devices, discrete gate or transistor logic devices, and discrete hardware components.
  • the methods, steps, and logic block diagrams disclosed in the embodiments of this application can be implemented or executed.
  • a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • the steps of the method disclosed in conjunction with the embodiments of the present application may be directly embodied as executed by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor.
  • the software modules may be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other storage media mature in the art.
  • the storage medium is located in the memory 4001, and the processor 4002 reads the information in the memory 4001 and, in combination with its hardware, completes the functions required to be performed by the units included in the data processing apparatus of the embodiments of the present application, or executes the data processing of the method embodiments of the present application. method.
  • the communication interface 4003 implements communication between the device 4000 and other devices or a communication network using a transceiver device such as, but not limited to, a transceiver.
  • a transceiver device such as, but not limited to, a transceiver.
  • the image to be processed can be acquired through the communication interface 4003 .
  • Bus 4004 may include a pathway for communicating information between various components of device 4000 (eg, memory 4001, processor 4002, communication interface 4003).
  • FIG. 17 is a schematic diagram of a hardware structure of a neural network model quantization apparatus according to an embodiment of the present application. Similar to the above-mentioned apparatus 4000 , the neural network model quantization apparatus 5000 shown in FIG. 17 includes a memory 5001 , a processor 5002 , a communication interface 5003 and a bus 5004 . The memory 5001 , the processor 5002 , and the communication interface 5003 are connected to each other through the bus 5004 for communication.
  • the original neural network model can be quantified by the neural network model quantization apparatus 5000 shown in FIG. 17 , and the quantized neural network model can be used to execute the data processing method of the embodiment of the present application.
  • the apparatus shown in FIG. 17 can obtain the training data set and the original neural network model required for quantization from the outside through the communication interface 5003, and then the processor quantifies the neural network model according to the training data set and the original neural network model.
  • apparatus 4000 and apparatus 5000 only show a memory, a processor, and a communication interface, in the specific implementation process, those skilled in the art should understand that the apparatus 4000 and the apparatus 5000 may also include the necessary components for normal operation. of other devices. Meanwhile, according to specific needs, those skilled in the art should understand that the apparatus 4000 and the apparatus 5000 may further include hardware devices that implement other additional functions. In addition, those skilled in the art should understand that the apparatus 4000 and the apparatus 5000 may also only include the necessary devices for implementing the embodiments of the present application, and do not necessarily include all the devices shown in FIG. 16 and FIG. 17 .
  • processor in the embodiments of the present application may be a central processing unit (central processing unit, CPU), and the processor may also be other general-purpose processors, digital signal processors (digital signal processors, DSP), application-specific integrated circuits (application specific integrated circuit, ASIC), off-the-shelf programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • the memory in the embodiments of the present application may be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory.
  • the non-volatile memory may be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically programmable Erase programmable read-only memory (electrically EPROM, EEPROM) or flash memory.
  • Volatile memory may be random access memory (RAM), which acts as an external cache.
  • RAM random access memory
  • SRAM static random access memory
  • DRAM dynamic random access memory
  • DRAM synchronous dynamic random access memory
  • SDRAM synchronous dynamic random access memory
  • DDR SDRAM double data rate synchronous dynamic random access memory
  • enhanced SDRAM enhanced synchronous dynamic random access memory
  • SLDRAM synchronous connection dynamic random access memory Fetch memory
  • direct memory bus random access memory direct rambus RAM, DR RAM
  • the above embodiments may be implemented in whole or in part by software, hardware, firmware or any other combination.
  • the above-described embodiments may be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions or computer programs. When the computer instructions or computer programs are loaded or executed on a computer, all or part of the processes or functions described in the embodiments of the present application are generated.
  • the computer may be a general purpose computer, special purpose computer, computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be downloaded from a website site, computer, server, or data center Transmission to another website site, computer, server or data center by wire (eg, infrared, wireless, microwave, etc.).
  • the computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server, a data center, or the like that contains one or more sets of available media.
  • the usable media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, DVDs), or semiconductor media.
  • the semiconductor medium may be a solid state drive.
  • At least one means one or more, and “plurality” means two or more.
  • At least one item(s) below” or similar expressions thereof refer to any combination of these items, including any combination of single item(s) or plural items(s).
  • at least one item (a) of a, b, or c can represent: a, b, c, a-b, a-c, b-c, or a-b-c, where a, b, c may be single or multiple .
  • the size of the sequence numbers of the above-mentioned processes does not mean the sequence of execution, and the execution sequence of each process should be determined by its functions and internal logic, and should not be dealt with in the embodiments of the present application. implementation constitutes any limitation.
  • the disclosed system, apparatus and method may be implemented in other manners.
  • the apparatus embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented.
  • the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the functions, if implemented in the form of software functional units and sold or used as independent products, may be stored in a computer-readable storage medium.
  • the technical solution of the present application can be embodied in the form of a software product in essence, or the part that contributes to the prior art or the part of the technical solution, and the computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other media that can store program codes .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

一种神经网络模型量化方法与装置,以及数据处理方法与装置,属于人工智能领域。原始神经网络模型包括第一算子、第二算子和第一运算模块,第一运算模块用于对第一算子的输出和第二算子的输出进行运算,神经网络模型量化方法包括:根据第一算子的第一训练输入数据的范围和第二算子的第二训练输入数据的范围,确定数据量化参数;确定量化后的神经网络模型,量化后的神经网络模型利用数据量化参数分别对量化后的第一算子的第一输入数据、量化后的第二算子第二输入数据进行量化。量化后的第一算子的处理结果与量化后的第二算子的处理结果可以直接进行运算,在提高神经网络模型的数据处理精度的同时,提高了数据处理效率。

Description

神经网络模型的量化方法和装置、数据处理的方法和装置 技术领域
本申请涉及人工智能领域,更具体地,涉及一种神经网络模型的量化方法和装置、数据处理的方法和装置。
背景技术
人工智能(artificial intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个分支,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式作出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。人工智能领域的研究包括机器人,自然语言处理,计算机视觉,决策与推理,人机交互,推荐与搜索,AI基础理论等。
神经网络模型应用广泛,通过对神经网络模型中的算子进行模型量化,即对算子的参数的量化和对输入数据的量化,可以将浮点数的运算转换为对定点数的运算,获得模型大小、推理速度和功耗等多方面的收益。根据算子的数据范围,确定算子的量化参数,能够提高量化后的算子的数据处理结果的精度。但是,在对量化后的多个算子的数据处理结果进行后续运算处理之前,需要对这些数据处理结果进行反量化,导致整体的处理性能较差。
发明内容
本申请提供一种神经网络模型量化方法和一种数据处理方法,能够简化神经网络模型的运算,提高神经网络模型的数据处理效率。
第一方面,提供一种神经网络模型量化方法,所述方法包括:获取原始神经网络模型,所述原始神经网络模型包括第一算子、第二算子和第一运算模块,所述第一算子与所述第二算子用于进行相同类型的运算,所述第一运算模块用于对所述第一算子的输出和所述第二算子的输出进行第一运算;根据第一训练输入数据的范围和第二训练输入数据的范围,确定数据量化参数,所述第一训练输入数据为所述第一算子的输入数据,所述第二训练输入数据为所述第二算子的输入数据;根据所述原始神经网络模型,确定量化后的神经网络模型,所述量化后的神经网络模型包括量化模块、第三算子、第四算子和第二运算模块,所述量化模块用于利用所述数据量化参数分别对所述第三算子的第一输入数据、所述第四算子的第二输入数据进行量化,所述第三算子为量化后的第一算子,所述第四算子为量化后的第二算子,所述第二运算模块用于进行所述第一运算。
根据原始神经网络模型中第一算子和第二算子的输入数据的数值范围,确定数据量化参数,数据量化参数用于分别对量化后的神经网络模型中的第三算子和第四算子的输入数据进行量化。
量化后的神经网络模型能够采用相同的数据量化参数,对输入两个不同算子的数据进行量化,从而使得第三算子的处理结果与第四算子的处理结果对应的量化参数相同,可以直接对第三算子的处理结果与第四算子的处理结果进行第三运算,而无需在进行第三运算之前对第三算子的处理结果、第四算子的处理结果进行反量化等处理,简化了量化后的神经网络模型的运算,提高了神经网络模型的数据处理效率。
根据第一算子和第二算子分别处理的数据的数值范围,确定对第三算子和第四算子的输入数据进行量化的数据量化参数,提高了第三算子、第四算子的对量化后的数据的处理结果的精度,在提高神经网络模型的数据处理效率的同时,减小了量化神经网络模型对数据处理结果的准确性的影响。
结合第一方面,在一些可能的实现方式中,所述方法还包括:获取训练输入数据组对应的预设训练输出数据,所述训练输入数据组包括所述第一训练输入数据和所述第二训练输入数据;利用所述数据量化参数,分别对所述第一训练输入数据和所述第二训练输入数据进行量化;利用所述量化后的神经网络模型对量化后的第一训练输入数据和量化后的第二训练输入数据进行处理,以得到实际训练输出数据;根据所述实际训练输出数据与所述预设训练输出数据的差异,调整所述数据量化参数,以最小化所述差异;所述量化模块用于利用调整后的数据量化参数分别对所述第三算子的第一输入数据、所述第四算子的第二输入数据进行量化。
预设训练输出数据可以是人工设置的。预设训练输出数据也可以是原始神经网络模型对第一训练输入数据和第二训练输入数据进行处理得到的。例如,预设训练输出数据可以是运算模块的输出。
由于数据量化参数的调整方式是最小化所述量化后的神经网络模型对数据的实际训练输出数据与该数据对应的预设训练输出数据之间的差异,调整后的数据量化参数能够使得第三算子、第四算子的对量化后的数据进行处理的结果精度较高。在提高神经网络模型的数据处理效率的同时,减小了量化神经网络模型对数据处理结果的准确性的影响。
结合第一方面,在一些可能的实现方式中,所述方法还包括:根据所述第一算子的参数范围、所述第二算子的参数范围,确定算子量化参数;利用所述算子量化参数对所述第一算子的参数进行量化,以得到所述第三算子的参数;利用所述算子量化参数对所述第二算子的参数进行量化,以得到所述第四算子的参数。
由于算子量化参数是根据第一算子的参数范围、第二算子的参数范围确定的,量化后的神经网络模型在提高数据处理效率的同时,减小了对数据处理结果的准确性和精度的影响。
结合第一方面,在一些可能的实现方式中,所述量化后的神经网络模型还包括压缩模块,所述压缩模块用于根据偏移参数分别对所述第三算子的输出和所述第四算子的输出进行压缩,所述偏移参数用于指示进行所述压缩后的数据中最高比特位在进行所述压缩之前的数据中的位置,所述第二运算模块用于对压缩后的数据进行所述第一运算;所述方法还包括:利用所述数据量化参数分别对所述第一训练输入数据和第二训练输入数据进行量化;利用所述第三算子对量化后的第一训练输入数据进行处理,所述第三算子输出第一训练运算数据;利用所述第四算子对量化后的第二训练输入数据进行处理,所述第四算子输出第二训练运算数据;根据所述第一训练运算数据的有效位数,以及所述第二训练运算数 据的有效位数,确定所述偏移参数。
对第三算子和第四算子的输出采用相同的偏移参数进行数据压缩,可以提高神经网络模型数据处理效率。由于偏移参数是根据量化后的神经网络模型对训练输入数据处理得到的中间运算结果的有效位数确定的,在量化后的神经网络模型对数据进行处理时,利用偏移参数对中间运算结果进行压缩,可以减小了对最终数据处理结果的准确性和精度的影响。
第二方面,提供一种数据处理方法,所述方法包括:获取量化后的神经网络模型,所述量化后的神经网络模型是对原始神经网络模型进行量化得到的,所述原始神经网络模型包括第一算子、第二算子和第一运算模块所述第一算子与所述第二算子用于进行相同类型的运算,所述第一运算模块用于对所述第一算子的输出和所述第二算子的输出进行第一运算;利用所述量化后的神经网络模型对所述第三算子的第一输入数据和所述第四算子的第二输入数据进行处理,所述量化后的神经网络模型包括量化模块、第一算子、第二算子和第二运算模块,所述量化模块用于利用数据量化参数分别对所述第一输入数据、所述第二输入数据进行量化,所述第二运算模块用于进行所述第一运算,所述第三算子为量化后的第一算子,所述第四算子为量化后的第二算子,所述数据量化参数是根据所述第一算子的第一训练输入数据的范围和所述第二算子的第二训练输入数据的范围确定的。
结合第二方面,在一些可能的实现方式中,所述数据量化参数是对初始数据量化参数进行调整得到的,所述调整使得根据实际训练输出数据与预设训练输出数据的差异最小化,所述初始量化参数是根据所述第一训练输入数据的范围和所述第二训练输入数据的范围确定的,所述预设训练输出数据对应于训练输入数据组,所述训练输入数据组包括所述第一训练输入数据和所述第二训练输入数据,所述实际训练输出数据是利用所述量化后的神经网络模型对所述第一训练输入数据和所述第二训练输入数据进行处理得到的,所述量化模块用于利用所述初始数据量化参数分别对所述第一训练输入数据、所述第二训练输入数据进行量化。
结合第二方面,在一些可能的实现方式中,所述第三算子的参数是利用算子量化参数对所述第一算子的参数进行量化得到的,所述第四算子的参数是利用所述算子量化参数对所述第二算子的参数进行量化得到的,所述算子量化参数是根据所述第一算子的参数范围、所述第二算子的参数范围确定的。
结合第二方面,在一些可能的实现方式中,所述量化后的神经网络模型还包括压缩模块,所述压缩模块用于根据偏移参数分别对所述第三算子的输出和所述第四算子的输出进行压缩,所述偏移参数用于指示进行所述压缩后的数据中最高比特位在进行所述压缩之前的数据中的位置,所述第二运算模块用于对压缩后的数据进行所述第一运算;所述偏移参数是根据第一训练运算数据的有效位数和第二训练运算数据的有效位数确定的,所述第一训练运算数据是利用所述第三算子对使用所述数据量化参数量化后的第一训练输入数据进行处理得到的,所述第二训练运算数据是利用所述第四算子对使用所述数据量化参数量化后的第二训练输入数据进行处理得到的。
第三方面,提供一种神经网络模型量化装置,所述装置包括:存储模块和处理模块,所述存储模块用于存储程序;当所述程序在所述处理模块中运行时,所述处理模块用于:获取原始神经网络模型,所述原始神经网络模型包括第一算子、第二算子和第一运算模块, 所述第一算子与所述第二算子用于进行相同类型的运算,所述第一运算模块用于对所述第一算子的输出和所述第二算子的输出进行第一运算;根据第一训练输入数据的范围和第二训练输入数据的范围,确定数据量化参数,所述第一训练输入数据为所述第一算子的输入数据,所述第二训练输入数据为所述第二算子的输入数据;根据所述原始神经网络模型,确定量化后的神经网络模型,所述量化后的神经网络模型包括量化模块、第三算子、第四算子和第二运算模块,所述量化模块用于利用所述数据量化参数分别对所述第三算子的第一输入数据、所述第四算子的第二输入数据进行量化,所述第三算子为量化后的第一算子,所述第四算子为量化后的第二算子,所述第二运算模块用于进行所述第一运算。
结合第三方面,在一些可能的实现方式中,所述处理模块还用于,获取训练输入数据组对应的预设训练输出数据,所述训练输入数据组包括所述第一训练输入数据和所述第二训练输入数据;所述处理模块还用于,利用所述数据量化参数,分别对所述第一训练输入数据和所述第二训练输入数据进行量化;所述处理模块还用于,利用所述量化后的神经网络模型对量化后的第一训练输入数据和量化后的第二训练输入数据进行处理,以得到实际训练输出数据;所述处理模块还用于,根据所述实际训练输出数据与所述预设训练输出数据的差异,调整所述数据量化参数,以最小化所述差异;所述量化模块用于利用调整后的数据量化参数分别对所述第三算子的第一输入数据、所述第四算子的第二输入数据进行量化。
结合第三方面,在一些可能的实现方式中,所述处理模块还用于,根据所述第一算子的参数范围、所述第二算子的参数范围,确定算子量化参数;所述处理模块还用于,利用所述算子量化参数对所述第一算子的参数进行量化,以得到所述第三算子的参数;所述处理模块还用于,利用所述算子量化参数对所述第二算子的参数进行量化,以得到所述第四算子的参数。
结合第三方面,在一些可能的实现方式中,所述量化后的神经网络模型还包括压缩模块,所述压缩模块用于根据偏移参数分别对所述第三算子的输出和所述第四算子的输出进行压缩,所述偏移参数用于指示进行所述压缩后的数据中最高比特位在进行所述压缩之前的数据中的位置,所述第二运算模块用于对压缩后的数据进行所述第一运算;所述处理模块还用于,利用所述数据量化参数分别对所述第一训练输入数据和第二训练输入数据进行量化;所述处理模块还用于,利用所述第三算子对量化后的第一训练输入数据进行处理,所述第三算子输出第一训练运算数据;所述处理模块还用于,利用所述第四算子对量化后的第二训练输入数据进行处理,所述第四算子输出第二训练运算数据;所述处理模块还用于,根据所述第一训练运算数据的有效位数,以及所述第二训练运算数据的有效位数,确定所述偏移参数。
第四方面,提供一种数据处理装置,包括:存储模块和处理模块,所述存储模块用于存储程序;当所述程序在所述处理模块中运行时,所述处理模块用于:获取量化后的神经网络模型,所述量化后的神经网络模型是对原始神经网络模型进行量化得到的,所述原始神经网络模型包括第一算子、第二算子和第一运算模块所述第一算子与所述第二算子用于进行相同类型的运算,所述第一运算模块用于对所述第一算子的输出和所述第二算子的输出进行第一运算;利用所述量化后的神经网络模型对所述第三算子的第一输入数据和所述第四算子的第二输入数据进行处理,所述量化后的神经网络模型包括量化模块、第一算子、 第二算子和第二运算模块,所述量化模块用于利用数据量化参数分别对所述第一输入数据、所述第二输入数据进行量化,所述第二运算模块用于进行所述第一运算,所述第三算子为量化后的第一算子,所述第四算子为量化后的第二算子,所述数据量化参数是根据所述第一算子的第一训练输入数据的范围和所述第二算子的第二训练输入数据的范围确定的。
结合第四方面,在一些可能的实现方式中,所述数据量化参数是对初始数据量化参数进行调整得到的,所述调整使得根据实际训练输出数据与预设训练输出数据的差异最小化,所述初始量化参数是根据所述第一训练输入数据的范围和所述第二训练输入数据的范围确定的,所述预设训练输出数据对应于训练输入数据组,所述训练输入数据组包括所述第一训练输入数据和所述第二训练输入数据,所述实际训练输出数据是利用所述量化后的神经网络模型对所述第一训练输入数据和所述第二训练输入数据进行处理得到的,所述量化模块用于利用所述初始数据量化参数分别对所述第一训练输入数据、所述第二训练输入数据进行量化。
结合第四方面,在一些可能的实现方式中,所述第三算子的参数是利用算子量化参数对所述第一算子的参数进行量化得到的,所述第四算子的参数是利用所述算子量化参数对所述第二算子的参数进行量化得到的,所述算子量化参数是根据所述第一算子的参数范围、所述第二算子的参数范围确定的。
结合第四方面,在一些可能的实现方式中,所述量化后的神经网络模型还包括压缩模块,所述压缩模块用于根据偏移参数分别对所述第三算子的输出和所述第四算子的输出进行压缩,所述偏移参数用于指示进行所述压缩后的数据中最高比特位在进行所述压缩之前的数据中的位置,所述第二运算模块用于对压缩后的数据进行所述第一运算;所述偏移参数是根据第一训练运算数据的有效位数和第二训练运算数据的有效位数确定的,所述第一训练运算数据是利用所述第三算子对使用所述数据量化参数量化后的第一训练输入数据进行处理得到的,所述第二训练运算数据是利用所述第四算子对使用所述数据量化参数量化后的第二训练输入数据进行处理得到的。
第五方面,提供一种电子设备,包括存储器和处理器,所述存储器用于存储程序指令;当所述程序指令在所述处理器中执行时,所述处理器用于执行第一方面或第二方面所述的方法。
上述第三方面中的处理器既可以包括中央处理器(central processing unit,CPU),也可以包括CPU与神经网络运算处理器的组合。
第六方面,提供一种计算机可读介质,该计算机可读介质存储用于设备执行的程序代码,该程序代码包括用于执行第一方面或第一方面中的任意一种实现方式中的方法。
第七方面,提供一种包含指令的计算机程序产品,当该计算机程序产品在计算机上运行时,使得计算机执行上述第一方面或第一方面中的任意一种实现方式中的方法。
第八方面,提供一种芯片,所述芯片包括处理器与数据接口,所述处理器通过所述数据接口读取存储器上存储的指令,执行上述第一方面或第一方面中的任意一种实现方式中的方法。
可选地,作为一种实现方式,所述芯片还可以包括存储器,所述存储器中存储有指令,所述处理器用于执行所述存储器上存储的指令,当所述指令被执行时,所述处理器用于执 行第一方面或第一方面中的任意一种实现方式中的方法。
上述芯片具体可以是现场可编程门阵列(field-programmable gate array,FPGA)或者专用集成电路(application-specific integrated circuit,ASIC)。
附图说明
图1为本申请实施例提供的一种系统架构的结构示意图。
图2为本申请实施例提供的一种卷积神经网络的结构示意图。
图3为本申请实施例提供的另一种卷积神经网络的结构示意图。
图4为本申请实施例提供的一种芯片的硬件结构示意图。
图5为本申请实施例提供的一种系统架构的示意图。
图6是本申请实施例提供的一种神经网络模型量化装置的示意性结构图。
图7是本申请实施例提供的一种神经网络模型量化方法的示意性流程图。
图8是本申请实施例提供的另一种神经网络模型量化方法的示意性流程图。
图9是本申请实施例提供的一种数据处理系统的示意性结构图。
图10是本申请实施例提供的压缩前后的数据的示意图。
图11是本申请实施例提供的一种数据处理方法的示意性流程图。
图12是本申请实施例提供的另一种数据处理方法的示意性流程图。
图13是本申请实施例提供的一种处理结构识别方法的示意性流程图。
图14是本申请实施例提供的一种神经网络模型量化装置的示意性结构图。
图15是本申请实施例提供的一种数据处理装置的示意性结构图。
图16是本申请实施例的数据处理装置的硬件结构示意图。
图17是本申请实施例的神经网络模型量化装置的硬件结构示意图。
具体实施方式
下面将结合附图,对本申请中的技术方案进行描述。下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
由于本申请实施例涉及大量神经网络的应用,为了便于理解,下面先对本申请实施例可能涉及的神经网络的相关术语和概念进行介绍。
(1)神经网络
神经网络可以是由神经单元组成的,神经单元可以是指以x s和截距1为输入的运算单元,该运算单元的输出可以为:
Figure PCTCN2020125370-appb-000001
其中,s=1、2、……n,n为大于1的自然数,W s为x s的权重,b为神经单元的偏置。f为神经单元的激活函数(activation functions),用于将非线性特性引入神经网络中,来将神经单元中的输入信号转换为输出信号。该激活函数的输出信号可以作为下一层卷积层的输入,激活函数可以是sigmoid函数。神经网络是将多个上述单一的神经单元联结在一起形成的网络,即一个神经单元的输出可以是另一个神经单元的输入。每个神经单元的输 入可以与前一层的局部接受域相连,来提取局部接受域的特征,局部接受域可以是由若干个神经单元组成的区域。
(2)深度神经网络
深度神经网络(deep neural network,DNN),也称多层神经网络,可以理解为具有多层隐含层的神经网络。按照不同层的位置对DNN进行划分,DNN内部的神经网络可以分为三类:输入层,隐含层,输出层。一般来说第一层是输入层,最后一层是输出层,中间的层数都是隐含层。层与层之间是全连接的,也就是说,第i层的任意一个神经元一定与第i+1层的任意一个神经元相连。
虽然DNN看起来很复杂,但是就每一层的工作来说,其实并不复杂,简单来说就是如下线性关系表达式:
Figure PCTCN2020125370-appb-000002
其中,
Figure PCTCN2020125370-appb-000003
是输入向量,
Figure PCTCN2020125370-appb-000004
是输出向量,
Figure PCTCN2020125370-appb-000005
是偏移向量,W是权重矩阵(也称系数),α()是激活函数。每一层仅仅是对输入向量
Figure PCTCN2020125370-appb-000006
经过如此简单的操作得到输出向量
Figure PCTCN2020125370-appb-000007
由于DNN层数多,系数W和偏移向量
Figure PCTCN2020125370-appb-000008
的数量也比较多。这些参数在DNN中的定义如下所述:以系数W为例:假设在一个三层的DNN中,第二层的第4个神经元到第三层的第2个神经元的线性系数定义为
Figure PCTCN2020125370-appb-000009
上标3代表系数W所在的层数,而下标对应的是输出的第三层索引2和输入的第二层索引4。
综上,第L-1层的第k个神经元到第L层的第j个神经元的系数定义为
Figure PCTCN2020125370-appb-000010
需要注意的是,输入层是没有W参数的。在深度神经网络中,更多的隐含层让网络更能够刻画现实世界中的复杂情形。理论上而言,参数越多的模型复杂度越高,“容量”也就越大,也就意味着它能完成更复杂的学习任务。训练深度神经网络的也就是学习权重矩阵的过程,其最终目的是得到训练好的深度神经网络的所有层的权重矩阵(由很多层的向量W形成的权重矩阵)。
(3)卷积神经网络
卷积神经网络(convolutional neuron network,CNN)是一种带有卷积结构的深度神经网络。卷积神经网络包含了一个由卷积层和子采样层构成的特征抽取器,该特征抽取器可以看作是滤波器。卷积层是指卷积神经网络中对输入信号进行卷积处理的神经元层。在卷积神经网络的卷积层中,一个神经元可以只与部分邻层神经元连接。一个卷积层中,通常包含若干个特征平面,每个特征平面可以由一些矩形排列的神经单元组成。同一特征平面的神经单元共享权重,这里共享的权重就是卷积核。共享权重可以理解为提取数据信息的方式与位置无关。卷积核可以以随机大小的矩阵的形式初始化,在卷积神经网络的训练过程中卷积核可以通过学习得到合理的权重。另外,共享权重带来的直接好处是减少卷积神经网络各层之间的连接,同时又降低了过拟合的风险。
(4)损失函数
在训练深度神经网络的过程中,因为希望深度神经网络的输出尽可能的接近真正想要预测的值,所以可以通过比较当前网络的预测值和真正想要的目标值,再根据两者之间的差异情况来更新每一层神经网络的权重向量(当然,在第一次更新之前通常会有初始化的过程,即为深度神经网络中的各层预先配置参数),比如,如果网络的预测值高了,就调整权重向量让它预测低一些,不断地调整,直到深度神经网络能够预测出真正想要的目标值或与真正想要的目标值非常接近的值。因此,就需要预先定义“如何比较预测值和目标 值之间的差异”,这便是损失函数(loss function)或目标函数(objective function),它们是用于衡量预测值和目标值的差异的重要方程。其中,以损失函数举例,损失函数的输出值(loss)越高表示差异越大,那么深度神经网络的训练就变成了尽可能缩小这个loss的过程。
(5)反向传播算法
神经网络可以采用误差反向传播(back propagation,BP)算法在训练过程中修正初始的神经网络模型中参数的大小,使得神经网络模型的重建误差损失越来越小。具体地,前向传递输入信号直至输出会产生误差损失,通过反向传播误差损失信息来更新初始的神经网络模型中参数,从而使误差损失收敛。反向传播算法是以误差损失为主导的反向传播运动,旨在得到最优的神经网络模型的参数,例如权重矩阵。
(6)数据处理
数据处理通常包括数据训练,机器学习,深度学习,搜索,推理,决策等方式。
其中,机器学习和深度学习可以对数据进行符号化和形式化的智能信息建模、抽取、预处理、训练等。
推理是指在计算机或智能系统中,模拟人类的智能推理方式,依据推理控制策略,利用形式化的信息进行机器思维和求解问题的过程,典型的功能是搜索与匹配。
决策是指智能信息经过推理后进行决策的过程,通常提供分类、排序、预测等功能。
如图1所示,本申请实施例提供了一种系统架构100。在图1中,数据采集设备160用于采集训练数据。针对本申请实施例的数据处理方法来说,训练数据可以包括多个训练输入数据和每个训练输入数据对应的训练标识。
在采集到训练数据之后,数据采集设备160将这些训练数据存入数据库130,训练设备120基于数据库130中维护的训练数据训练得到目标模型/规则101。
下面对训练设备120基于训练数据得到目标模型/规则101进行描述,训练设备120对输入的训练输入数据进行处理,将输出的结果与该训练输入数据对应的训练标识进行对比,直到根据训练设备120输出的结果与该训练标识的差值小于一定的阈值,从而完成目标模型/规则101的训练。
上述目标模型/规则101能够用于实现本申请实施例的数据处理方法。本申请实施例中的目标模型/规则101具体可以为神经网络。需要说明的是,在实际的应用中,所述数据库130中维护的训练数据不一定都来自于数据采集设备160的采集,也有可能是从其他设备接收得到的。另外需要说明的是,训练设备120也不一定完全基于数据库130维护的训练数据进行目标模型/规则101的训练,也有可能从云端或其他地方获取训练数据进行模型训练,上述描述不应该作为对本申请实施例的限定。
根据训练设备120训练得到的目标模型/规则101可以应用于不同的系统或设备中,如应用于图1所示的执行设备110,所述执行设备110可以是终端,如手机终端,平板电脑,笔记本电脑,增强现实(augmented reality,AR)AR/虚拟现实(virtual reality,VR),车载终端等,还可以是服务器或者云端等。在图1中,执行设备110配置输入/输出(input/output,I/O)接口112,用于与外部设备进行数据交互,用户可以通过客户设备140向I/O接口112输入数据,所述输入数据在本申请实施例中可以包括:客户设备输入的待处理数据。
预处理模块113和预处理模块114用于根据I/O接口112接收到的输入数据(如待处理数据)进行预处理,在本申请实施例中,也可以没有预处理模块113和预处理模块114(也可以只有其中的一个预处理模块),而直接采用计算模块111对输入数据进行处理。
在执行设备110对输入数据进行预处理,或者在执行设备110的计算模块111执行计算等相关的处理过程中,执行设备110可以调用数据存储系统150中的数据、代码等以用于相应的处理,也可以将相应处理得到的数据、指令等存入数据存储系统150中。
最后,I/O接口112将处理结果,如上述得到的数据的处理结果返回给客户设备140,从而提供给用户。
值得说明的是,训练设备120可以针对不同的目标或称不同的任务,基于不同的训练数据生成相应的目标模型/规则101,该相应的目标模型/规则101即可以用于实现上述目标或完成上述任务,从而为用户提供所需的结果。
在图1中所示情况下,用户可以手动给定输入数据,该手动给定可以通过I/O接口112提供的界面进行操作。另一种情况下,客户设备140可以自动地向I/O接口112发送输入数据,如果要求客户设备140自动发送输入数据需要获得用户的授权,则用户可以在客户设备140中设置相应权限。用户可以在客户设备140查看执行设备110输出的结果,具体的呈现形式可以是显示、声音、动作等具体方式。客户设备140也可以作为数据采集端,采集如图所示输入I/O接口112的输入数据及输出I/O接口112的输出结果作为新的样本数据,并存入数据库130。当然,也可以不经过客户设备140进行采集,而是由I/O接口112直接将如图所示输入I/O接口112的输入数据及输出I/O接口112的输出结果,作为新的样本数据存入数据库130。
值得注意的是,图1仅是本申请实施例提供的一种系统架构的示意图,图中所示设备、器件、模块等之间的位置关系不构成任何限制,例如,在图1中,数据存储系统150相对执行设备110是外部存储器,在其它情况下,也可以将数据存储系统150置于执行设备110中。
如图1所示,根据训练设备120训练得到目标模型/规则101,该目标模型/规则101在本申请实施例中可以是本申请中的神经网络,具体的,本申请实施例使用神经网络可以为CNN,深度卷积神经网络(deep convolutional neural networks,DCNN),循环神经网络(recurrent neural network,RNN)等等。
由于CNN是一种非常常见的神经网络,下面结合图2重点对CNN的结构进行详细的介绍。如上文的基础概念介绍所述,卷积神经网络是一种带有卷积结构的深度神经网络,是一种深度学习(deep learning)架构,深度学习架构是指通过机器学习的算法,在不同的抽象层级上进行多个层次的学习。作为一种深度学习架构,CNN是一种前馈(feed-forward)人工神经网络,该前馈人工神经网络中的各个神经元可以对输入其中的数据作出响应。
本申请实施例的数据处理方法具体采用的神经网络的结构可以如图2所示。在图2中,卷积神经网络(CNN)200可以包括输入层210,卷积层/池化层220(其中池化层为可选的),以及神经网络层230。其中,输入层210可以获取待处理数据,并将获取到的待处理数据交由卷积层/池化层220以及后面的神经网络层230进行处理,可以得到数据的处理结果。下面对图2中的CNN 200中内部的层结构进行详细的介绍。
卷积层/池化层220:
卷积层:
如图2所示卷积层/池化层220可以包括如示例221-226层,举例来说:在一种实现中,221层为卷积层,222层为池化层,223层为卷积层,224层为池化层,225为卷积层,226为池化层;在另一种实现方式中,221、222为卷积层,223为池化层,224、225为卷积层,226为池化层。即卷积层的输出可以作为随后的池化层的输入,也可以作为另一个卷积层的输入以继续进行卷积操作。
下面将以卷积层221为例,介绍一层卷积层的内部工作原理。
卷积层221可以包括很多个卷积算子,卷积算子也称为核,其在数据处理中的作用相当于一个从输入数据矩阵中提取特定信息的过滤器,卷积算子本质上可以是一个权重矩阵,这个权重矩阵通常被预先定义。
这些权重矩阵中的权重值在实际应用中需要经过大量的训练得到,通过训练得到的权重值形成的各个权重矩阵可以用来从输入数据中提取信息,从而使得卷积神经网络200进行正确的预测。
当卷积神经网络200有多个卷积层的时候,初始的卷积层(例如221)往往提取较多的一般特征,该一般特征也可以称之为低级别的特征;随着卷积神经网络200深度的加深,越往后的卷积层(例如226)提取到的特征越来越复杂,比如高级别的语义之类的特征,语义越高的特征越适用于待解决的问题。
池化层:
由于常常需要减少训练参数的数量,因此卷积层之后常常需要周期性的引入池化层,在如图2中220所示例的221-226各层,可以是一层卷积层后面跟一层池化层,也可以是多层卷积层后面接一层或多层池化层。在数据处理过程中,池化层的唯一目的就是减少数据的空间大小。
神经网络层230:
在经过卷积层/池化层220的处理后,卷积神经网络200还不足以输出所需要的输出信息。因为如前所述,卷积层/池化层220只会提取特征,并减少输入数据带来的参数。然而为了生成最终的输出信息(所需要的类信息或其他相关信息),卷积神经网络200需要利用神经网络层230来生成一个或者一组所需要的类的数量的输出。因此,在神经网络层230中可以包括多层隐含层(如图2所示的231、232至23n)以及输出层240,该多层隐含层中所包含的参数可以根据具体的任务类型的相关训练数据进行预先训练得到,例如该任务类型可以包括识别、分类等等。
在神经网络层230中的多层隐含层之后,也就是整个卷积神经网络200的最后层为输出层240,该输出层240具有类似分类交叉熵的损失函数,具体用于计算预测误差,一旦整个卷积神经网络200的前向传播(如图2由210至240方向的传播为前向传播)完成,反向传播(如图2由240至210方向的传播为反向传播)就会开始更新前面提到的各层的权重值以及偏差,以减少卷积神经网络200的损失,及卷积神经网络200通过输出层输出的结果和理想结果之间的误差。
本申请实施例的数据处理方法具体采用的神经网络的结构可以如图3所示。在图3中,卷积神经网络(CNN)200可以包括输入层210,卷积层/池化层220(其中池化层为可选 的),以及神经网络层230。与图2相比,图3中的卷积层/池化层220中的多个卷积层/池化层并行,将分别提取的特征均输入给神经网络层230进行处理。
需要说明的是,图2和图3所示的卷积神经网络仅作为一种本申请实施例的数据处理方法的两种可能的卷积神经网络的示例,在具体的应用中,本申请实施例的数据处理方法所采用的卷积神经网络还可以以其他网络模型的形式存在。
图4为本申请实施例提供的一种芯片的硬件结构,该芯片包括神经网络处理器50。该芯片可以被设置在如图1所示的执行设备110中,用以完成计算模块111的计算工作。该芯片也可以被设置在如图1所示的训练设备120中,用以完成训练设备120的训练工作并输出目标模型/规则101。如图2和图3所示的卷积神经网络中各层的算法均可在如图4所示的芯片中得以实现。
神经网络处理器NPU 50作为协处理器挂载到主中央处理器(central processing unit,CPU)(host CPU)上,由主CPU分配任务。NPU的核心部分为运算电路503,控制器504控制运算电路503提取存储器(权重存储器或输入存储器)中的数据并进行运算。
在一些实现中,运算电路503内部包括多个处理单元(process engine,PE)。在一些实现中,运算电路503是二维脉动阵列。运算电路503还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路503是通用的矩阵处理器。
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路从权重存储器502中取矩阵B相应的数据,并缓存在运算电路中每一个PE上。运算电路从输入存储器501中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器(accumulator)508中。
向量计算单元507可以对运算电路的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。例如,向量计算单元507可以用于神经网络中非卷积/非FC层的网络计算,如池化(pooling),批归一化(batch normalization),局部响应归一化(local response normalization)等。
在一些实现种,向量计算单元能507将经处理的输出的向量存储到统一缓存器506。例如,向量计算单元507可以将非线性函数应用到运算电路503的输出,例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元507生成归一化的值、合并值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路503的激活输入,例如用于在神经网络中的后续层中的使用。
统一存储器506用于存放输入数据以及输出数据。
权重数据直接通过存储单元访问控制器505(direct memory access controller,DMAC)将外部存储器中的输入数据搬运到输入存储器501和/或统一存储器506、将外部存储器中的权重数据存入权重存储器502,以及将统一存储器506中的数据存入外部存储器。
总线接口单元(bus interface unit,BIU)510,用于通过总线实现主CPU、DMAC和取指存储器509之间进行交互。
与控制器504连接的取指存储器(instruction fetch buffer)509,用于存储控制器504使用的指令;
控制器504,用于调用指存储器509中缓存的指令,实现控制该运算加速器的工作过 程。
一般地,统一存储器506,输入存储器501,权重存储器502以及取指存储器509均为片上(On-Chip)存储器,外部存储器为该NPU外部的存储器,该外部存储器可以为双倍数据率同步动态随机存储器(double data rate synchronous dynamic random access memory,简称DDR SDRAM)、高带宽存储器(high bandwidth memory,HBM)或其他可读可写的存储器。
其中,图2和图3所示的卷积神经网络中各层的运算可以由运算电路503或向量计算单元507执行。
上文中介绍的图1中的执行设备110能够执行本申请实施例的数据处理方法的各个步骤,图2和图3所示的CNN模型和图4所示的芯片也可以用于执行本申请实施例的数据处理方法的各个步骤。下面结合附图对本申请实施例的神经网络训练的方法和本申请实施例的数据处理方法进行详细的介绍。
如图5所示,本申请实施例提供了一种系统架构300。该系统架构包括本地设备301、本地设备302以及执行设备110和数据存储系统150,其中,本地设备301和本地设备302通过通信网络与执行设备110连接。
执行设备110可以由一个或多个服务器实现。可选的,执行设备110可以与其它计算设备配合使用,例如:数据存储器、路由器、负载均衡器等设备。执行设备110可以布置在一个物理站点上,或者分布在多个物理站点上。执行设备110可以使用数据存储系统150中的数据,或者调用数据存储系统150中的程序代码来实现本申请实施例的数据处理的方法。
用户可以操作各自的用户设备(例如本地设备301和本地设备302)与执行设备110进行交互。每个本地设备可以表示任何计算设备,例如个人计算机、计算机工作站、智能手机、平板电脑、智能摄像头、智能汽车或其他类型蜂窝电话、媒体消费设备、可穿戴设备、机顶盒、游戏机等。
每个用户的本地设备可以通过任何通信机制/通信标准的通信网络与执行设备110进行交互,通信网络可以是广域网、局域网、点对点连接等方式,或它们的任意组合。
在一种实现方式中,本地设备301、本地设备302从执行设备110获取到目标神经网络的相关参数,将目标神经网络部署在本地设备301、本地设备302上,利用该目标神经网络进行数据分类或者识别等等。
在另一种实现中,执行设备110上可以直接部署目标神经网络,执行设备110通过从本地设备301和本地设备302获取待处理数据,并根据目标神经网络对待处理数据进行分类或者其他类型的数据处理。
上述执行设备110也可以为云端设备,此时,执行设备110可以部署在云端;或者,上述执行设备110也可以为终端设备,此时,执行设备110可以部署在用户终端侧,本申请实施例对此并不限定。
目前,神经网络模型广泛应用到了图像、视频、语音等多个领域,展现出超越传统方法的能力,而神经网络模型本身计算量和参数量都很大,这给将神经网络在终端设备上的部署带来了很大的挑战。
模型量化用于对神经网络模型中算子的参数的量化和对输入数据的量化。对算子的参 数的量化,可以对算子的大小进行优化,减小算子占用的资源。在此基础上,对算子的输入数据进行量化,可以将算子的浮点数运算转换为定点数运算,提高推理速度,降低功耗。相比于单精度浮点数(一般为32bit)表示的神经网络模型,8bit量化得到的量化后的神经网络模型可以将每个参数占据的存储空间缩小到四分之一,并且以更好的推理速度对数据进行处理。
在对算子进行模型量化过程中,为了提高量化后的算子的数据处理精度,需要根据每个算子的参数的范围确定算子量化参数,并根据输入数据的范围,分别数据量化参数。而不同算子的参数的范围、输入数据的范围的差异,导致各个算子对应的数据量化参数和/或算子量化参数存在差别。在对多个量化后的算子的数据处理结果进行运算之前,需要对这些数据处理结果分别进行反量化,以保证预算结果的准确性。在NPU中,用于进行反量化运算的处理单元数量有限,导致反量化的运算速率受限,使得整体处理效率较低,性能较差。对每个算子的数据处理结果分别进行反量化时使用的反量化参数可以根据该算子对应的数据量化参数和算子量化参数确定。为了解决上述问题,本申请实施例提供了一种神经网络模型量化装置,能够减少量化后的神经网络模型在后续需要进行的反量化运算的次数,提高整体处理性能。
图6是本申请实施例提供的一种神经网络模型量化装置的示意性结构图。神经网络模型量化装置1300可以位于图1所示的训练设备120或其他设备中。神经网络模型量化装置1300包括数据量化参数生成模型1310和算子量化模型1320。神经网络模型量化装置1300用于对原始神经网络模型进行量化。原始神经网络模型包括第一算子、第二算子和第一运算模块。第一算子与第二算子用于进行相同类型的运算。第一运算模块用于对所述第一算子的输出和所述第二算子的输出进行第一运算。量化参数生成模型1310用于根据训练输入数据组中数据的范围,生成数据量化参数。训练输入数据组包括第一算子的第一训练输入数据和第二算子的第二训练输入数据。算子量化模型1320用于对第一算子、第二算子等原始神经网络模型中的运算单元进行量化。根据数据量化参数以及量化后的运算单元,可以得到量化后的神经网络模型。
量化后的神经网络模型包括量化模块、量化后的第一算子、量化后的第二算子、第二运算模块。量化模块用于利用数据量化参数分别对第一输入数据、第二输入数据进行量化。第二运算模块与原始神经网络模型中的第一运算模块相对应,用于对第一运算数据、第二运算数据进行第一运算。第一运算数据是量化后的第一算子对量化后的第一输入数据进行运算得到的。第二运算数据是量化后的第二算子对量化后的第二输入数据进行运算得到的。
图7是本申请实施例提供的一种神经网络模型量化方法的示意性流程图。该神经网络模型量化方法可以由图1所示的训练设备120或其他设备执行。
S1101,获取原始神经网络模型,所述原始神经网络模型包括第一算子、第二算子和第一运算模块。所述第一算子和第二算子用于进行相同类型的运算,所述第一运算模块用于对所述第一算子的输出和所述第二算子的输出进行第一运算。可以从其他设备发送的消息中,获取原始神经网络模型。或者,也可以从存储器中获取原始神经网络模型。原始神经网络模型可以是CNN等。
第一运算与第二运算为相同类型的运算,即第一算子、第二算子为相同类型的算子。 例如,第一算子、第二算子可以均为CNN中的卷子层或均为全连接层。第一运算模块可以用于对第一算子的输出和第二算子的输出进行逐位运算,例如逐位相加或逐位相乘的运算。一般情况下,第一运算模块可以用于线性运算。
神经网络模型量化方法包括步骤S1101和S1102,用于对原始神经网络模型进行量化,以得到量化后的神经网络模型。神经网络模型量化方法可以由图1所示的训练设备120或其他设备执行。
在S1102,根据第一训练输入数据的范围和第二训练输入数据的范围,确定数据量化参数,所述第一训练输入数据为所述第一算子的输入数据,所述第二训练输入数据为所述第二算子的输入数据。具体地,可以根据第一算子的多个第一训练输入数据的最大值和第二算子的多个第二训练输入数据的最大值,确定平均数据范围上限。可以根据第一算子的多个第一训练输入数据的最小值和第二算子的多个第二训练输入数据的最小值,确定平均数据范围下限。可以根据平均数据范围上限和平均数据范围下限,确定数据量化参数。
平均数据范围上限可以理解为该多个第一训练输入数据的最大值和该多个第二训练输入数据的最大值的平均值。平均数据范围下限可以理解为该多个第一训练输入数据的最小值和该多个第二训练输入数据的最小值的平均值。
可以在每次训练输入数据输入时对平均数据范围上限和平均数据范围下限进行更新。从而,使得对平均数据范围上限和平均数据范围下限的计算分散的进行,与获取多个训练输入数据输入进行平均值计算的方式相比,可以减小对计算资源的要求。
可以引入权重,以实现对平均数据范围上限和平均数据范围下限的更新。具体的平均数据范围上限和平均数据范围的更新方式可以参见图8的说明。应当理解,权重的设置可以用于对随着迭代次数增加输入第一算子和/或第二算子的训练输入数据对平均数据范围上限、平均数据范围下限的影响的大小。
一般情况下,算子的量化后的输入数据的比特数(即比特位的数量)为预设值,例如,可以是8bit。当然,在一些实施例中,也可以通过人工输入信息等获取量化后的输入数据的比特数。数据量化参数可以包括步长(scale)和偏移(offset)。该比特数与数据量化参数中的scale的数量一一对应。参数scale的数量可以理解为量化后的数据的比特数能够表示的最大值,即2 m-1,m为量化后的数据的比特数。根据平均数据范围上限和平均数据范围下限之间的差值和scale的数量,即可得到scale。例如参数scale可以是平均数据范围上限和平均数据范围下限之间的差值除以scale的数量得到的商,或者是参数scale可以是平均数据范围上限和平均数据范围下限之间的差值加1除以scale的数量得到的商。数据量化参数中的offset可以是该平均数据范围下限与参数scale的比值。
在S1103,根据所述原始神经网络模型,确定量化后的神经网络模型,所述量化后的神经网络模型包括量化模块、第三算子、第四算子和第二运算模块,所述量化模块用于利用所述数据量化参数分别对所述第三算子的第一输入数据、所述第四算子的第二输入数据进行量化,所述第二运算模块用于进行所述第一运算。
所述第二运算模块可以用于对第一运算数据和第二运算数据进行所述第三第一运算,所述第一运算数据是利用所述第三算子对量化后的第一输入数据进行运算得到的,所述第二运算数据是利用所述第四算子对量化后的第二输入数据进行运算得到。
也就是说,第二运算模块对应于原始神经网络模型中的第一运算模块。
通过S1101至步骤S1103,根据原始神经网络模型中第一算子和第二算子的输入数据的数值范围,确定数据量化参数,数据量化参数用于分别对量化后的神经网络模型中的第三算子和第四算子的输入数据进行量化。其中,第三算子与第一算子运用于进行相同的运算,第四算子与第二算子用于进行相同的运算,且上述第一算子和第二算子进行的运算的类型相同。
通过S1101至步骤S1103,量化后的神经网络模型能够采用相同的数据量化参数,对输入两个不同算子的数据进行量化,从而使得第三算子的处理结果与第四算子的处理结果对应的量化参数相同,可以直接对第三算子的处理结果与第四算子的处理结果进行第三运算,而无需在进行第三运算之前对第三算子的处理结果、第四算子的处理结果进行反量化等处理,简化了量化后的神经网络模型的运算,提高了神经网络模型的数据处理效率。
另外,根据第一算子和第二算子分别处理的数据的数值范围,确定对第三算子和第四算子的输入数据进行量化的数据量化参数,提高了第三算子、第四算子的对量化后的数据的处理结果的精度,在提高神经网络模型的数据处理效率的同时,减小了量化神经网络模型对数据处理结果的准确性的影响。
进一步地,可以获取训练输入数据组对应的预设训练输出数据,所述训练输入数据组包括所述第一训练输入数据和所述第二训练输入数据。
预设训练输出数据可以是人工设置的。预设训练输出数据也可以是原始神经网络模型对第一训练输入数据和第二训练输入数据进行处理得到的。例如,预设训练输出数据可以是运算模块的输出。
在S1102之后,可以利用所述数据量化参数,分别对所述第一训练输入数据和所述第二训练输入数据进行量化。可以利用所述量化后的神经网络模型对量化后的第一训练输入数据和量化后的第二训练输入数据进行处理,以得到实际训练输出数据。
可以根据所述实际训练输出数据与所述预设训练输出数据的差异,调整所述数据量化参数,以最小化所述差异。
所述量化模块用于利用调整后的数据量化参数分别对所述第三算子的第一输入数据、所述第四算子的第二输入数据进行量化。
也就是说,可以利用调整后的数据量化参数分别对第三算子的输入数据、第四算子的输入数据进行量化。
由于数据量化参数的调整方式是最小化所述量化后的神经网络模型对数据的实际训练输出数据与该数据对应的预设训练输出数据之间的差异,调整后的数据量化参数能够使得第三算子、第四算子的对量化后的数据进行处理的结果精度较高。在提高神经网络模型的数据处理效率的同时,减小了量化神经网络模型对数据处理结果的准确性的影响。
第一算子、第三算子用于进行相同的运算。第二算子、第四算子用于进行相同的运算。
两个算子用于进行相同的运算,也可以理解为两个算子对输入的数据进行相同的运算,两个算子的参数仅仅是精度不同,一个算子的参数是对另一个算子的参数进行量化得到的。通过利用量化后的算子对量化后的输入数据进行处理,能够降低计算量。
应当理解,为了使得第三算子和第四算子对输入数据的处理结果具有可比性,即可以直接运算,而无需在后续推理中进行反量化等处理,第三算子的参数和第四算子的参数可以是通过相同的算子量化参数得到的。
可以根据所述第一算子的参数范围、所述第二算子的参数范围,确定算子量化参数。
可以利用所述算子量化参数对所述第一算子的参数进行量化以得到第三算子的参数,并利用所述算子量化参数对所述第二算子的参数进行量化以得到第四算子的参数。
由于算子量化参数是根据第一算子的参数范围、第二算子的参数范围确定的,量化后的神经网络模型在提高数据处理效率的同时,减小了对数据处理结果的准确性和精度的影响。
为了进一步提高量化后的神经网络模型的数据处理效率,可以对第三算子输出的数据处理结果和第四算子输出的数据处理结果进行压缩。
所述量化后的神经网络模型还包括压缩模块,所述压缩模块用于根据偏移参数分别对所述第三算子的输出和所述第四算子的输出进行压缩,以得到所述第一运算数据和所述第二运算数据,所述偏移参数用于指示进行所述压缩后的数据中最高比特位在进行所述压缩之前的数据中的位置。
一般情况下,在对第三算子输出的数据处理结果进行压缩得到的第一运算数据和第四算子输出的数据处理结果进行压缩得到的第二运算数据,具有相同数量的比特位。
偏移参数指示压缩后的数据中最高比特位在进行所述压缩之前的数据中的位置,通过相同的偏移参数对第三算子的输出和所述第四算子的输出进行压缩,使得第一运算数据和第二运算数据具有可比性,可以直接进行运算,而无需在后续推理中进行反量化等处理。
为了在提高神经网络模型数据处理效率的同时,减小了对数据处理结果的准确性和精度的影响,可以根据第三算子对量化后的第一训练输入数据进行处理得到的输出的有效位数,以及第四算子对量化后的第二训练输入数据进行处理得到的输出的有效位数,确定偏移参数。
也就是说,利用所述数据量化参数分别对所述第一训练输入数据和第二训练输入数据进行量化。可以利用所述第三算子对量化后的第一训练输入数据进行处理,并利用所述第四算子对量化后的第二训练输入数据进行处理。其中,第三算子的输出为第一训练运算数据,第四算子的输出为第二训练运算数据。
之后,可以根据所述第一训练运算数据的有效位数,以及所述第二训练运算数据的有效位数,确定所述偏移参数。
数据量化参数、算子量化参数、偏移参数的确定方式,具体可以参见对图8的说明。
通过S1101至S1103,可以得到量化后的神经网络模型。量化后的神经网络模型例如可以是图9所示的数据处理系统600,或者数据处理系统600可以调用量化后的神经网络模型中的各个算子或模块进行数据的处理。
图8是本申请实施例提供的一种神经网络模型量化方法的示意性流程图。
神经网络模型量化方法800也可以理解为对神经网络模型的优化方法或进一步的训练方法。训练方法800可以由图1所示的训练设备120或其他设备执行。
原始神经网络模型包括第一算子、第二算子和运算模块。运算模块用于对第一算子的输出和第二算子的输出进行运算。第一算子、第二算子用于进行相同类型的运算。
在原始神经网络模型中,第一算子和第二算子的参数通过浮点数的格式表示。
在S810,根据第一算子的参数范围和第二算子的参数范围,确定算子量化参数。
算子量化参数中的scale可以表示为s2:
Figure PCTCN2020125370-appb-000011
其中,f max为浮点数表示的第一算子和第二算子的最大参数值,f min为浮点数表示的第一算子和第二算子的最小参数值,a为量化结果的位数。对于格式为int8的数据,a的取值为8。
算子量化参数中的offset可以表示为o2:
Figure PCTCN2020125370-appb-000012
一般情况下,f min为负数。
可以利用算子量化参数对第一算子的参数进行量化,以得到量化后的第一算子。可以利用算子量化参数对第二算子的参数进行量化,以得到量化后的第二算子。
在S820,获取训练数据集,所述训练数据集包括训练输入数据组,以及该训练输入数据组对应的预设运算结果。每组训练输入数据组包括第一训练输入数据和第二训练输入数据。
应当理解,一个预设运算结果对应于一个第一训练输入数据和一个第二训练输入数据。第一训练输入数据和第二训练输入数据对应的预设运算结果可以是对第一算子对第一训练输入数据的处理结果与第二算子对第二训练输入数据的处理结果进行运算得到的浮点数表示的运算结果。或者,第一训练输入数据和第二训练输入数据对应的预设运算结果可以是人工设置的。
在S830,根据所述第一训练输入数据的范围和所述第二训练输入数据的范围,确定所述数据量化参数。
训练数据集可以包括多个第一训练输入数据和多个第二训练输入数据。可以根据每个训练输入数据范围,确定数据量化参数。训练输入数据为第一训练输入数据或第二训练输入数据。数据量化参数可以是该多个第一训练输入数据和该多个第二训练输入数据的平均范围。
数据量化参数中的scale可以表示为s1:
Figure PCTCN2020125370-appb-000013
其中,d max_t为第t次迭代后得到的平均最大值(也可以理解为平均数据范围上限),d min_t为第t次迭代后得到的平均最小值(也可以理解为平均数据范围下限),m用于表示对训练输入数据进行量化得到的量化结果的比特数。对于格式为int8的数据,m的取值为8。应当理解,m为预设值。
训练输入数据的平均最大值可以表示为:
Figure PCTCN2020125370-appb-000014
其中,d max_t-1为第t-1次迭代后得到的训练输入数据的平均最大值,v max_t为第t次迭代对算子的输入数据统计出的最大值,c t是随着迭代次数不断更新,c t=β 1·c t-1+1,β 1为常数。在多次迭代过程中,算子的输入数据包括第一算子的第一训练输入数据,也包括第二算子的第二训练输入数据。
Figure PCTCN2020125370-appb-000015
可以理解为权重。
当β 1大于1时,随着迭代的进行,迭代次数越多,训练输入数据的最大值对平均数据范围上限的影响越小。当β 1小于1时,随着迭代的进行,迭代次数越多,训练输入数据的最大值对平均数据范围上限的影响越大。当β 1等于1时,每个训练输入数据的最大值对平均数据范围上限的影响相同。一般情况下,β 1的取值略大于1,避免对平均数据范围上限的过度修正。
训练输入数据的平均最小值可以表示为:
Figure PCTCN2020125370-appb-000016
其中,d min_t-1为第t-1次迭代后得到的训练输入数据的平均最大值,v min_t为第t次迭代对算子的输入数据统计出的最小值,c t是随着迭代次数不断更新,c t=β 2·c t-1+1,β 2为常数。在多次迭代过程中,算子的输入数据包括第一算子的第一训练输入数据,也包括第二算子的第二训练输入数据。
类似的,当β 2大于1时,随着迭代的进行,迭代次数越多,训练输入数据的最大值对平均数据范围上限的影响越小。当β 2小于1时,随着迭代的进行,迭代次数越多,训练输入数据的最大值对平均数据范围上限的影响越大。当β 2等于1时,每个训练输入数据的最大值对平均数据范围上限的影响相同。一般情况下,β 2的取值略大于1,避免对平均数据范围上限的过度修正。β 2与β 1可以相等或不相等。
在进行迭代之前,可以对参数c 0、β、d max_0、d min_0进行设置,一般情况下,参数c 0、均可以设置为0。v max_0、v min_0可以根据经验值进行设置,例如v min_0可以设置为6。
在S840,对第一训练运算数据和第二训练运算数据进行运算,以得到训练输出数据,所述第一训练运算数据是利用算子量化参数量化后的第一算子对利用所述数据量化参数量化后的第一训练输入数据处理得到的,所述第二训练运算数据是利用算子量化参数量化后的第二算子对利用所述数据量化参数量化后的第二训练输入数据处理得到的。
可以利用数据量化参数分别对第一训练输入数据和第二训练输入数据进行处理,以得到量化后的第一训练输入数据和量化后的第二训练输入数据。可以将量化后的第一训练输入数据输入量化后的第一算子,以得到第一训练运算数据。可以将量化后的第二训练输入数据输入量化后的第二算子,以得到第二训练运算数据。之后,可以对第一训练运算数据和第二训练运算数据进行运算,以得到训练输出数据。
在第一训练运算数据可以是对量化后的第一算子输出的数据进行转换得到的。第一训练运算数据的位数小于量化后的第一算子输出的数据的位数。
在第二训练运算数据可以是对量化后的第二算子输出的数据进行转换得到的。第二训练运算数据的位数小于量化后的第二算子输出的数据的位数。
对于量化后的第一算子对量化后的第一训练输入数据的处理结果,以及量化后的第二算子对量化后的第二训练输入数据的处理结果,可以统计平均有效位数。
可以统计平均数据范围上限,从而得到平均有效位数。
经过t次迭代,得到平均数据范围上限可以表示为:
Figure PCTCN2020125370-appb-000017
其中,b t-1为第t-1次迭代后得到的平均数据范围上限,b nt为第t次迭代对算子的输出 数据统计出的范围上限,c t是随着迭代次数不断更新,c t=β 3·c t-1+1,β 3为常数。第t次迭代输入数据可以是第一训练输入数据或第二训练输入数据。
在进行迭代之前,可以对参数c 0、β 3、b 0进行设置,一般情况下,参数b 0、c 0可以设置为0。
参数β 1、β 2、β 3中的任一个可以随机或者根据一定规则设置,或者,参数β 1、β 2、β 3中的任一个也可以是人工设置的。在迭代的过程中,参数β 1、β 2、β 3中的任一个可以爆出不变,或者,也可以按照一定规则进行调整。本申请对此不做限定。
根据t次迭代得到的平均数据范围上限,可以确定偏移参数N:
N=max(ceil(log 2bt)-m,0)
其中,ceil为上取整函数,ceil(log 2b t)为b t的有效位数,即t次迭代后的平均有效位数,m为降低位数之后的数据的位数。
以算子输出格式为int32,m的取值为16为例,N=max(ceil(log 2b t)-16,0)。
对于量化后的算子的输出,对有效位数大于N的数值进行饱和运算,即以N位“1”表示有效位数大于N的数值。也就是说,当量化后的算子的输出大于N位“1”所表示的大小时,以N位“1”表示该数值。
对于量化后的算子的输出数据中的每个数值进行向右移N位。之后,可以进行饱和运算,将以为结果的值限制在m个比特位中。应当理解,m为预设数量。当向右移N位的结果大于m个比特位所能表示的数值范围时,以m个比特位的“1”作为饱和运算的结果。当向右移N位的结果小于或等于m个比特位所能表示的数值范围时,向右移N位的结果作为饱和运算的结果。
偏移参数N指示的比特位为量化后的算子的输出数据中第P个比特位,P=Q-N。
根据偏移参数N,可以仅保留量化后的算子的输出数据中第P+1位至第P+m位(或者,可以仅保留第P位至第P+m-1位),从而实现对量化后的算子的输出数据的压缩(即格式的转换)。
之后,可以对转换得到的第一训练运算数据和第二训练运算数据进行运算。
量化后的第一算子的对输入数据的处理可以表示为:
conv1 out=d1 q1*w1 q2
其中,conv1 out为量化后的第一算子的输出,d1 q1为利用数据量化参数量化得到的第一算子的输入数据,w1 q2为利用算子量化参数量化得到的第一算子的参数。
量化后的第二算子的对输入数据的处理可以表示为:
conv2 out=d2 q1*w2 q2
其中conv2 out为量化后的第二算子的输出,d2 q1为利用算子量化参数量化得到的第二算子的输入数据,w2 q2为利用算子量化参数量化得到的第二算子的参数。
以逐位累加运算(elementwise)为例,运算结果可以表示为:
R=tr(conv1 out,N)+tr(conv2 out,N)
其中,R为运算结果,tr(x,N)表示对数据x进行转换,转换后的结果包括数据x右移N位之后的最低的预设数量的比特位。
在S850,根据反量化后的所述训练输出数据与所述预设运算结果的差异,调整所述数据量化参数和所述算子量化参数。
如果训练输出数据是减小比特数的转换之后得到的,可以对训练输出数据进行反转换,对反转换之后数据进行反量化。也就是说,可以在转换后的数据右侧增加比特位,以使得反转换后的数据与量化后的第一算子、第二算子的输出数据的比特位相等。应当理解,增加的比特位的值可以均为“0”。之后,可以将增加比特位之后的数据左移N位,以得到反转换后的训练输出数据。
对于第一算子与第二算子均为conv算子,对第一算子的输出和第二算子的输出进行逐位相加运算的方式得到的训练输出数据,训练输出数据的反量化,可以是对训练输出数据乘以数据量化参数中的scale与算子量化参数中的scale的乘积。
通过S810至S850,可以得到使得运算结果准确性更高的数据量化参数和算子量化参数。
利用算子量化参数,分别对第一算子和第二算子的参数进行量化。根据数据量化参数、量化后的第一算子、量化后的第二算子、偏移参数N,可以确定数据处理系统600。
在数据处理系统600中,运算模型640可以是量化前的原始神经网络模型中的运算模型,或者,运算模型640中的参数可以是对原始神经网络模型中的运算模型中的参数量化得到的。
在一些实施例中,S810至S850可以由服务器执行。服务器可以将数据量化参数、量化后的第一算子、量化后的第二算子、偏移参数等发送至终端设备。从而终端设备可以确定图9所示的数据处理系统600。
图9是本申请实施例提供的一种数据处理系统的示意性结构图。数据处理系统600可以位于执行设备110的计算模块111中,数据处理系统600可以是图1所示的目标模型/规则101。数据处理系统600可以是图1所示的训练设备120或其他装置对训练完成的神经网络模型进行量化得到的。
数据处理系统600也可以称为量化后的神经网络模型。数据处理系统600可以是图2所示的CNN 200或图3所示的CNN 300中的一部分,或者,数据处理系统600的各个组成部分可以位于一个或多个CNN中。数据处理系统600包括量化模型610、第一算子620、第二算子630和运算模型640。量化模型610用于利用数据量化参数分别对第一算子620的第一输入数据和第二算子630的第二输入数据进行量化。
第一输入数据的格式和第二输入数据的格式可以均为浮点数。例如,可以是32位(bit)的单精度浮点数(float32),也可以是16位的半精度浮点数(float16)。量化模型610可以利用数据量化参数分别对第一输入数据和第二输入数据进行量化,以得到量化后的第一输入数据和量化后的第二输入数据。量化后的第一输入数据的格式和量化后的第二输入数据的格式可以均是8位的量化结果(int8)。
数据量化参数可以包括步长(scale)和偏移(offset)。其中,scale用于表示量化结果每增加“1”对应的浮点数的增加量,offset用于表示量化结果的最小值代表的浮点数与scale的比值。第一算子620用于对经量化模型610量化后的第一输入数据进行处理,以得到第一运算数据。第二算子630用于对经量化模型610量化后的第二输入数据进行处理,以得到第二运算数据。第一算子620的参数和第二算子630的参数是利用算子量化参数量化得到的。也就是说,在确定数据处理系统600时,利用算子量化参数,对训练完成的神经网络模型中的量化前的第一算子的参数进行量化,可以得到第一算子620的参数;利用 算子量化参数,对训练完成的神经网络模型中的量化前的第二算子的参数进行量化,可以得到第二算子620的参数。
算子量化参数可以包括步长(scale)和偏移(offset)。数据量化参数的确定和算子量化参数的确定可以参见图6至图8的说明。第一算子620和第二算子630用于进行相同类型的运算。也就是说,第一算子620和第二算子630可以均为神经网络模型中相同类型的算子。例如,第一算子620和第二算子630可以均为卷积(convolutional,conv)算子,用于卷积运算,例如,第一算子620和第二算子630可以即第一算子620和第二算子630可以分别表示一个卷积层。数据处理系统600中的各个模块可以均为图2所示的CNN 200中的一部分或图3所示的CNN 300中的一部分。
第一算子620和第二算子630也可以均为全连接层。全连接层每个神经元的激励函数一般采用线性整流函数(rectified linear unit,ReLU)。第一算子620的输出和第二算子630的输出后续需要经过运算模型640的运算。在一些实施例中,第一算子620、第二算子630可以位于不同的CNN中,运算模型640可以用于对不同的CNN输出的数据进行处理。当然,第一算子620和第二算子630也可以是其他类型的神经网络模型中相同类型的算子。
当conv算子的参数和conv算子的输入数据均为int8时,该conv算子的输出为32为的量化结果(int32)。也就是说,当第一算子和第二算子可以均为conv算子,且第一算子的参数、第二算子的参数、量化后的第一输入数据、量化后的第二输入数据的格式均为int8时,第一算子和第二算子的输出数据的格式均为int32。对于conv算子,conv算子的参数也可以理解为conv算子中的权重。
对于conv算子,对量化后的输入数据d q1(格式为int8)的处理结果conv out(格式为int32)可以表示为:
conv out=d q1*w1 q2
其中,w1 q2为利用算子量化参数量化得到的该算子的参数,格式也为int8。
运算模型640用于对第一运算数据和第二运算数据进行运算。运算模型640可以对第一运算数据和第二运算数据进行线性运算。运算模型640也可以对第一运算数据和第二运算数据进行逐位运算,例如逐位相加或逐位相乘的运算。
数据处理系统600利用数据量化参数,分别对两个算子的输入数据进行量化,两算子的参数是利用算子量化参数得到的,之后,可以对该两个算子的输出进行运算,避免了对该两个算子的输出进行反量化,减少了计算量,降低了运算功耗,提高了数据处理系统600的数据处理性能。
反量化运算是一种向量计算的方式。在一般的NPU中,与矩阵计算的方式相比,向量计算的方式运算能力较弱。矩阵计算的方式包括卷积运算等。神经网络模型的运算往往包括串行的多个矩阵计算和多个向量计算。一般情况下,处理器对矩阵计算的算力高于对向量计算的算力,当神经网络模型需要进行大量向量计算时,在向量计算未完成的情况下,依赖于向量计算结果的矩阵计算处于等待状态,导致流水中断,出现性能瓶颈(称为vector bound)。
通过利用相同的量化参数进行量化以得到第一算子和第二算子,并利用相同的参数对第一算子的输入数据和第二算子的输入数据进行量化,数据处理系统600的运算模块640可以对第一算子的输出和第二算子的输出进行运算,数据处理系统600能够减少神经网络 模型所需的反量化运算,从而缓解了vector bound,有效提高了神经网络模型的数据处理能力。
数据处理系统600还可以包括格式转换模型650。格式转换模型650可以用于数据压缩,也可以称为压缩模型。格式转换模型650用于降低第一算子620输出的第一原始运算数据的位数,以得到第一运算数据。格式转换模型650还用于降低第二算子用于630输出的第二原始运算数据的位数,以得到第二运算数据。
例如,格式转换模型650用于分别将格式为int32的第一原始运算数据和第二原始运算数据的格式转换为int16。第一算子620输出的第一原始运算数据经过格式转换得到的int16的数据即为第一运算数据,第二算子630输出的第二原始运算数据经过格式转换得到的int16的数据即为第二运算数据。格式转换模型650可以根据偏移参数,确定所述第一运算数据和所述第二运算数据。格式转换模型650的处理过程可以理解为对数据的压缩。
如图10中的(A)所示,当所述第一算子对量化后的第一输入数据进行处理输出的第一原始运算数据中,所述偏移参数指示的比特位之前的比特位均为0时,所述第一运算数据包括所述第一原始运算数据中所述偏移参数指示的比特位以及该比特位之后的比特数一共为预设数量的比特位。
如图10中的(B)所示,当第一原始运算数据中偏移参数指示的比特位之前的比特位不是均为0,存在值为1的比特位时,第一运算数据包括的预设数量的比特位均为“1”。
类似的,当所述第二算子对量化后的第二输入数据进行处理输出的第二原始运算数据中,所述偏移参数指示的比特位之前的比特位均为0时,所述第二运算数据包括所述第二原始运算数据中所述偏移参数指示的比特位以及该比特位之后比特数一共为预设数量的比特位。
当第二原始运算数据偏移参数指示的比特位之前的比特位不是均为0,存在值为1的比特位时,第二运算数据包括的预设数量的比特位均为“1”。格式转换模型650对于第一原始运算数据或第二原始运算数据中的任一个原始运算数据,可以进行向右移位运算和饱和运算。
向右移位运算可以表示为:
Figure PCTCN2020125370-appb-000018
其中,
Figure PCTCN2020125370-appb-000019
为右移符号,conv out表示原始运算结果,conv out’表示右移运算结果,N为右移的比特数。应当理解,右移比特数N与格式转换模型650输出的数据的比特位之和小于或等于原始运算结果conv out中的比特数。
对右移运算结果可以进行饱和运算:
conv INT16=clip(conv out’,0,2 p-1)
其中,conv INT16表示饱和运算结果,p为原始运算结果conv out中的比特数与右移比特数N的差值。clip(a,b,c)运算符表示将a限制在b与c之间,当a小于b时,运算结果为b;当a大于或等于b且a小于或等于c时,运算结果为a;当a大于或等于c时,运算结果为c。
当原始运算数据的格式为int32,即原始运算数据包括32个比特位,运算结果的比特数即预设数量为16时,p=32-N,N≤16。
运算符clip的运算结果的比特数m可以与原始运算结果的比特数相同,可以从饱和运算结果中取最低的预设数量m的比特位作为运算结果。即,饱和运算结果中最低的m个的比特位,即为该原始运算数据对应的运算数据。
也就是说,可以将原始运算数据右移N个比特位。确定与该原始运算结果比特数相同且各个比特位为1的二进制数右移N个比特位的结果与原始运算数据的右移结果的大小。当原始运算数据的右移结果较大时,取该原始运算数据的右移结果中最低位的预设数量的比特位,即为运算数据;反之,当原始运算数据的右移结果不是较大时,将预设数量的“1”作为运算结果。
第一运算数据可以是第一原始运算数据,也可以是经过对第一原始运算数据向右移位运算和饱和运算得到的数据。第二运算数据可以是第二原始运算数据,也可以是经过对第二原始运算数据向右移位运算和饱和运算运算得到的数据。
经过向右移位运算和饱和预算,格式转换模型650可以将原始运算数据转换为运算数据,将运算数据作为运算模块640的输入。
通过格式转换模型650对数据格式的转换,可以降低运算模块640的计算量,从而提高数据处理系统600的数据处理性能。
利用量化后的神经网络模型,可以实现图11或图12所述的数据处理方法。图11是本申请实施例提供的一种数据处理方法的示意性流程图。该数据处理方法可以由图1所示的执行设备110中的计算模块111中执行。
S1201,获取量化后的神经网络模型,所述量化后的神经网络模型是对原始神经网络模型进行量化得到的,所述原始神经网络模型包括第一算子、第二算子和第一运算模块,所述第一算子用于和所述第二算子用于进行相同类型的运算,所述第一运算模块用于对所述第一算子的输出和所述第二算子的输出进行第一运算。
S1202,利用所述量化后的神经网络模型对所述第三算子的第一输入数据和所述第四算子的第二输入数据进行处理,所述量化后的神经网络模型包括量化模块、第一算子、第二算子和第一运算模块,所述量化模块用于利用数据量化参数分别对所述第一输入数据、所述第二输入数据进行量化,所述第二运算模块用于进行所述第一运算,所述第三算子为量化后的第一算子,所述第四算子为量化后的第二算子,所述数据量化参数是根据所述第一算子的第一训练输入数据的范围和所述第二算子的第二训练输入数据的范围确定的。
所述第二运算模块可以用于对第一运算数据和第二运算数据进行所述第一运算。所述第一运算数据是利用所述第三算子对量化后的第一输入数据进行所述第一运算得到的,所述第二运算数据是利用所述第四算子对量化后的第二输入数据进行所述第二运算得到。对原始神经网络进行量化处理,得到的量化后的神经网络模型与原始神经网络模型进行相同的运算,仅是运算结果的精度变化。
数据量化参数是根据所述第一算子的第一训练输入数据的范围和所述第二算子的第二训练输入数据的范围确定的,从而提高了量化后的神经网络模型的数据处理精度。
通过使用数据量化参数对第一输入数据和第二输入数据进行处理,使得第二运算模块可以对第一运算数据和第二运算数据进行运算,而无需对第一运算数据和第二运算数据进行反量化之后再进行运算。其中,第一运算数据是第三算子对量化后的第一输入数据进行处理得到的,第二运算数据是第四算子对量化后的第二输入数据进行处理得到的。
通过S1201至S1202,在提高量化后的神经网络模型运算精度的同时,降低量化后的神经网络模型对反量化运算的需求,节约运算资源,提高处理效率。
可选地,所述数据量化参数是对初始数据量化参数进行调整得到的,所述调整使得根据实际训练输出数据与预设训练输出数据的差异最小化。
所述初始量化参数是根据所述第一训练输入数据的范围和所述第二训练输入数据的范围确定的。
所述预设训练输出数据对应于训练输入数据组,所述训练输入数据组包括所述第一训练输入数据和所述第二训练输入数据。
所述实际训练输出数据是利用所述量化后的神经网络模型对所述第一训练输入数据和所述第二训练输入数据进行处理得到的,所述量化模块用于利用所述初始数据量化参数分别对所述第一训练输入数据、所述第二训练输入数据进行量化。
根据所述第一训练输入数据的范围和所述第二训练输入数据的范围确定初始量化参数。利用量化后的神经网络模型对第一训练输入数据和第二训练输入数据进行处理以得到实际训练输出数据,其中,量化模块使用初始数据量化参数对第一训练输入数据、第二训练输入数据进行量化。对初始数据量化参数进行调整以最小化实际训练输出数据与预设训练输出数据之间的差异,从而得到数据量化参数。
应当理解,在量化后的神经网络模型中,第三算子用于对量化后的第一训练输入数据进行第一运算以得到第一训练运算数据;第四算子用于对量化后的第二训练输入数据进行第二运算以得到第二训练运算数据;第二运算模块用于对第一训练运算数据和第二训练运算数据进行第三运算以得到该实际训练输出数据。
由于数据量化参数使得实际训练输出数据与预设训练输出数据之间的差异最小,量化后的神经网络模型具有更高的精度。
最小化实际训练输出数据与预设训练输出数据之间的差异,可以理解为,根据实际训练输出数据与预设训练输出数据的差异来逐渐调整初始动作识别系统的初始数据量化参数,直到实际训练输出数据与预设训练输出数据之间的差异在一定的预设范围内,或者,当调整次数达到预设次数时,将此时的初始数据量化参数确定为调整后的数据量化参数。
可选地,所述第三算子的参数是利用算子量化参数对所述第一算子的参数进行量化得到的,所述第四算子的参数是利用所述算子量化参数对所述第二算子的参数进行量化得到的,所述算子量化参数是根据所述第一算子的参数范围、所述第二算子的参数范围确定的。
根据第一算子的参数范围、第二算子的参数范围确定算子量化参数,并利用算子量化参数分别对第一算子的参数、第二算子的参数进行量化,以得到第三算子的参数和第四算子的参数。在进行量化以降低运算量的情况下,提高的量化后的神经网络模型的数据处理精度。
可选地,所述量化后的神经网络模型还包括压缩模块,所述压缩模块用于根据偏移参数分别对所述第三算子的输出和所述第四算子的输出进行压缩,以得到所述第一运算数据和所述第二运算数据,所述偏移参数用于指示进行所述压缩后的数据中最高比特位在进行所述压缩之前的数据中的位置,所述第二运算模块用于对压缩后的数据进行所述第一运算。
所述偏移参数是根据第一训练运算数据的有效位数和第二训练运算数据的有效位数 确定的,所述第一训练运算数据是利用所述第三算子对使用所述数据量化参数量化后的第一训练输入数据进行处理得到的,所述第二训练运算数据是利用所述第四算子对使用所述数据量化参数量化后的第二训练输入数据进行处理得到的。
根据第一训练运算数据的有效位数和第二训练运算数据的有效位数,在降低运算量的同时,提高量化后的神经网络模型的数据处理精度。
图12是本申请实施例提供的一种数据处理方法的示意性流程图。
数据处理方法700包括S710至S720。数据处理方法700可以由图1所示的执行设备110中的计算模块111中执行。
在S710,利用数据量化参数分别对神经网络模型中第一算子的第一输入数据和所述神经网络模型中第二算子的第二输入数据进行量化处理。
在S720,对第一处理信息和第二处理信息进行运算。所述第一处理信息是利用所述第一算子对量化后的第一输入数据处理得到的,所述第二处理信息是利用所述第二算子对量化后的第二输入数据处理得到的。
通过S710和步骤S720,通过利用相同的数据量化参数对第一算子的第一输入数据和第二算子的第二输入数据进行量化,使得第一算子的输出和第二算子的输出可以直接进行运算,无需进行反量化等处理,提高神经网络模型的数据处理效率。
所述第一算子的第一参数和第二算子的第二参数可以是浮点数,也可以是利用算子量化参数对浮点数的参数进行量化得到的。
第一参数和第二参数是量化得到的,可以降低第一算子和第二算子的大小,减小第一算子和第二算子处理数据时对资源的占用。而第一参数的量化和第二参数的量化,均利用算子量化参数,可以使得第一算子和第二算子的数据处理结果可以直接进行计算,无需进行反量化等其他处理,提高神经网络模型的数据处理效率。
为了提高量化后的第一算子和第二算子的计算精度,算子量化参数可以是根据所述第一参数的范围和所述第二参数的范围得到的。
算子量化参数的确定可以由图1所示的训练设备120或其他设备执行。当然,确定数据量化参数的设备与执行S710至S720的设备可以是相同或不同的设备。
可以根据第一参数和第二参数中的最大值和最小值,得到算子量化参数。示例性地,算子量化参数可以包括scale和offset。可以根据量化结果的位数,对第一参数和第二参数中的最大值和最小值之间的差值进行等分,以得到算子量化参数的scale。可以根据第一参数和第二参数中的最小值与算子量化参数中的scale的比值,确定算子量化参数中的offset。
为了提高量化后的第一算子和第二算子的计算精度,可以根据第一算子和第二算子处理的数据的范围,确定数据量化参数。
数据量化参数的确定可以由图1所示的训练设备120或其他设备执行。当然,确定数据量化参数的设备与执行S710至S720的设备可以是相同或不同的设备。
具体地,可以获取训练数据集,所述训练数据集包括第一训练输入数据,第二训练输入数据。其中,第一训练输入数据为量化前的第一算子的输入数据,第二训练输入数据为量化前的第二算子的输入数据。可以根据第一训练输入数据的范围和第二训练输入数据的范围,确定所述数据量化参数。
例如,可以通过多个第一训练输入数据和多个第二训练输入数据确定数据量化参数。每个第一训练输入数据中包括多个数值,每个第二训练输入数据中包括多个数值,可以将每个第一训练输入数据和每个第二训练输入数据中的平均最大值作为数据量化参数的量化结果能够表示的最大值,将每个第一训练输入数据和每个第二训练输入数据中的平均最小值作为数据量化参数的量化结果能够表示的最小值。平均最大值可以是多个最大值的加权平均值,平均最小值可以是多个最小值的加权平均值。权重,可以理解为第一训练输入数据、第二训练输入数据中的每个训练输入数据中的最大值或最小值对数据量化参数的影响程度。具体地,可以参见图8的说明。
为了提高量化后的第一算子和第二算子的计算精度,在将数据量化参数和算子量化参数用于实际数据的处理之前,可以根据第一算子对量化后的第一训练输入数据的处理结果与第二算子对第二训练输入数据的处理结果进行运算之后的结果,与预设运算结果之间的差异,调整数据量化参数和/或算子量化参数。
具体地,所述训练数据集还包括第一训练输入数据和第二训练输入数据对应的预设运算结果。第一训练输入数据和第二训练输入数据对应的预设运算结果可以是对量化前的第一算子对第一训练输入数据的处理结果与量化前的第二算子对第二训练输入数据的处理结果进行运算得到的运算结果。或者,预设运算结果可以是人工设置的。预设运算结果的格式可以是浮点数。
可以对第一训练运算数据和第二训练运算数据进行S720中的运算,以得到训练输出数据。其中,第一训练运算数据是第一算子对利用数据量化参数量化后的第一训练输入数据处理得到的,第二训练运算数据是第二算子对利用数据量化参数量化后的第二训练输入数据处理得到的。
可以对训练输出数据进行反量化。可以根据训练输出数据的反量化结果与该预设运算结果之间的差异,调整数据量化参数和/或算子量化参数。
为了进一步降低神经网络模型的运算量,可以对第一算子的运算结果、第二算子的运算结果进行减少位数的处理。
第一算子对量化后的第一输入数据进行处理,输出第一原始运算数据。第二算子对量化后的第二输入数据进行处理,输出第二原始运算数据。
可以取第一原始运算数据中位数最高的预设数量的比特位作为第一运算结果,取第二原始运算数据中位数最高的预设数量的比特位作为第二运算数据,进行后续的运算。位数最高的预设数量的比特位,即最左端的预设数量的比特位。
或者,可以根据偏移参数,确定第一运算数据和/或第二运算数据。
当第一原始运算数据中,所述偏移参数指示的比特位之前的比特位均为0时,所述第一运算数据包括所述第一原始运算数据中所述偏移参数指示的比特位之后的预设数量的比特位。
当第一原始运算数据中偏移参数指示的比特位之前的比特位不是均为0,存在值为1的比特位时,第一运算数据包括的预设数量的比特位均为“1”。
类似的,当第二原始运算数据中,所述偏移参数指示的比特位之前的比特位均为0时,所述第二运算数据包括所述第二原始运算数据中所述偏移参数指示的比特位之后的预设数量的比特位。
当第二原始运算数据偏移参数指示的比特位之前的比特位不是均为0,存在值为1的比特位时,第二运算数据包括的预设数量的比特位均为“1”。
可选地,对第一原始运算数据和第二原始运算数据均进行减少比特位数量数的压缩处理。
应当理解,如果第一算子的处理结果或第二算子的处理结果在该偏移参数指示的比特位或更高的比特位具有有效数据,该第一算子的处理结果对应的第一运算数据或第二算子的处理结果对应的第二运算数据可以表示为预设数量的“1”。该方式也可以理解为饱和运算。也就是说,当处理结果大于偏移参数指示的比特位之后的比特位数量能够表示的最大值时,将处理结果表示为在预设数量的比特位中表示为全“1”,即该预设数量的比特位能够表示的最大值。
该偏移参数可以是根据第一算子对第一训练输入数据的处理结果、第二算子对第二训练输入数据的处理结果得到的。
可以根据第一算子对量化后的第一训练输入数据进行处理输出的数据的有效位数,以及第二所述第二算子对量化后的第二训练输入数据进行处理输出的数据的有效位数,确定该偏移参数。
例如,第一算子可以对多个量化后的第一训练参数进行处理,第一算子对每个量化后的第一训练参数的处理结果中包括多个数。该多个数可以形成矩阵或者向量等。第二算子可以对多个量化后的第二训练参数进行处理,第二算子对每个量化后的第二训练参数的处理结果中包括多个数。可以根据每个处理结果中最大的有效位数的平均值,确定偏移参数。例如,可以对该平均值进行上取整,偏移参数用于指示可以是对该平均值进行上取整得到的有效位数的最高位。
根据训练输出数据的反量化结果与该预设运算结果之间的差异,还可以调整偏移参数。从而,使得数据处理结果的精度和准确性更高。
神经网络模型对图像、音频等数据进行处理的过程中,一般需要使用多个算子。在进行本申请实施例提供的神经网络模型量化方法之前,可以遍历原始神经网络模型,以确定原始神经网络模型中包括第一算子、第二算子以及用于对第一算子的输出和第二算子的输出进行运算的运算模型的处理结构。图9以第一算子和第二算子为卷积算子,运算模型为eltwise算子为例进行说明。
图13是本申请实施例提供的一种处理结构识别方法的示意性流程图。该处理结构识别方法可以由图1所示的训练设备120或其他设备执行。
在S910,判断节点i对应的节点是否为卷积算子。如果节点i不是卷积算子,令i=i+1,重新进行S910。如果节点i是卷积算子,进行S920。
在S920,判断节点i的输出数据是否为eltwise算子的输入。如果节点i的输出数据不是eltwise算子的输入,令i=i+1,重新进行S910。如果节点i的输出数据是eltwise算子的输入,进行S930。
在S930,判断该eltwise算子的另一路输入是否为卷积算子的输出数据。如果该eltwise算子的另一路输入不是卷积算子的输出数据,令i=i+1,重新进行S910。如果该eltwise算子的另一路输入是卷积算子的输出数据,将节点i作为第一算子,将提供eltwise算子的另一路输入的卷积算子第二算子,进行方法800。并且,令i=i+1,重新进行S910。遍历神 经网络中的全部节点,即i大于神经网络中模型中的节点数据量时,停止进行S910。通过方法900,可以确定神经网络模型中的所有两路卷积算子的输出结果输入一个eltwise算子的结构。
上文结合图1至图13的描述了本申请实施例提供的数据处理系统、神经网络模型量化方法以及数据处理方法,下面结合图14至图17,描述本申请实施例的装置实施例。应理解,数据处理系统、神经网络模型量化方法以及数据处理方法的描述与装置实施例的描述相互对应,因此,未详细描述的部分可以参见上文的描述。
图14是本申请实施例提供的一种神经网络模型量化装置的示意性结构图。神经网络模型量化装置3000可以位于图1所示的训练设备120或其他设备中。神经网络模型量化装置3000包括存储模块3010和处理模块3020。存储模块3010用于存储程序。
当所述程序在处理模块3020中运行时,处理模块3020用于:获取原始神经网络模型,所述原始神经网络模型包括第一算子、第二算子和第一运算模块,所述第一算子用于进行第一运算,所述第二算子用于进行第二运算,所述第一运算与所述第二运算为相同类型的运算,所述第一运算模块用于对所述第一算子的输出和所述第二算子的输出进行第三运算;根据第一训练输入数据的范围和第二训练输入数据的范围,确定数据量化参数,所述第一训练输入数据为所述第一算子的输入数据,所述第二训练输入数据为所述第二算子的输入数据;根据所述原始神经网络模型,确定量化后的神经网络模型,所述量化后的神经网络模型包括量化模块、第三算子、第四算子和第二运算模块,所述量化模块用于利用所述数据量化参数分别对所述第三算子的第一输入数据、所述第四算子的第二输入数据进行量化,所述第三算子为量化后的第一算子,所述第四算子为量化后的第二算子,所述第二运算模块用于进行所述第一运算。
可选地,处理模块3020还用于,获取训练输入数据组对应的预设训练输出数据,所述训练输入数据组包括所述第一训练输入数据和所述第二训练输入数据。
处理模块3020还用于,利用所述数据量化参数,分别对所述第一训练输入数据和所述第二训练输入数据进行量化。处理模块3020还用于,利用所述量化后的神经网络模型对量化后的第一训练输入数据和量化后的第二训练输入数据进行处理,以得到实际训练输出数据。处理模块3020还用于,根据所述实际训练输出数据与所述预设训练输出数据的差异,调整所述数据量化参数,以最小化所述差异。
所述量化模块用于利用调整后的数据量化参数分别对所述第三算子的第一输入数据、所述第四算子的第二输入数据进行量化。可选地,处理模块3020还用于,根据所述第一算子的参数范围、所述第二算子的参数范围,确定算子量化参数。
处理模块3020还用于,利用所述算子量化参数对所述第一算子的参数进行量化,以得到所述第三算子的参数。处理模块3020还用于,利用所述算子量化参数对所述第二算子的参数进行量化,以得到所述第四算子的参数。
可选地,所述量化后的神经网络模型还包括压缩模块,所述压缩模块用于根据偏移参数分别对所述第三算子的输出和所述第四算子的输出进行压缩,所述偏移参数用于指示进行所述压缩后的数据中最高比特位在进行所述压缩之前的数据中的位置,所述第二运算模块用于对压缩后的数据进行所述第一运算。
处理模块3020还用于,利用所述数据量化参数分别对所述第一训练输入数据和第二 训练输入数据进行量化。处理模块3020还用于,利用所述第三算子对量化后的第一训练输入数据进行处理,所述第三算子输出第一训练运算数据。处理模块3020还用于,利用所述第四算子对量化后的第二训练输入数据进行处理,所述第四算子输出第二训练运算数据。处理模块3020还用于,根据所述第一训练运算数据的有效位数,以及所述第二训练运算数据的有效位数,确定所述偏移参数。
图15是本申请实施例提供的一种数据处理装置的示意性结构图。数据处理装置2000可以位于图1所示的执行设备110或其他设备中。数据处理装置2000包括存储模块2010和处理模块2020。存储模块2010用于存储程序。
当所述程序在处理模块2020中运行时,处理模块2020用于:获取量化后的神经网络模型,所述量化后的神经网络模型是对原始神经网络模型进行量化得到的,所述原始神经网络模型包括第一算子、第二算子和第一运算模块所述第一算子与所述第二算子用于进行相同类型的运算,所述第一运算模块用于对所述第一算子的输出和所述第二算子的输出进行第一运算;利用所述量化后的神经网络模型对所述第三算子的第一输入数据和所述第四算子的第二输入数据进行处理,所述量化后的神经网络模型包括量化模块、第一算子、第二算子和第二运算模块,所述量化模块用于利用数据量化参数分别对所述第一输入数据、所述第二输入数据进行量化,所述第二运算模块用于进行所述第一运算,所述第三算子为量化后的第一算子,所述第四算子为量化后的第二算子,所述数据量化参数是根据所述第一算子的第一训练输入数据的范围和所述第二算子的第二训练输入数据的范围确定的。
可选地,所述数据量化参数是对初始数据量化参数进行调整得到的,所述调整使得根据实际训练输出数据与预设训练输出数据的差异最小化。
所述初始量化参数是根据所述第一训练输入数据的范围和所述第二训练输入数据的范围确定的。所述预设训练输出数据对应于训练输入数据组,所述训练输入数据组包括所述第一训练输入数据和所述第二训练输入数据。
所述实际训练输出数据是利用所述量化后的神经网络模型对所述第一训练输入数据和所述第二训练输入数据进行处理得到的,所述量化模块用于利用所述初始数据量化参数分别对所述第一训练输入数据、所述第二训练输入数据进行量化。
可选地,所述第三算子的参数是利用算子量化参数对所述第一算子的参数进行量化得到的,所述第四算子的参数是利用所述算子量化参数对所述第二算子的参数进行量化得到的,所述算子量化参数是根据所述第一算子的参数范围、所述第二算子的参数范围确定的。
可选地,所述量化后的神经网络模型还包括压缩模块,所述压缩模块用于根据偏移参数分别对所述第三算子的输出和所述第四算子的输出进行压缩,所述偏移参数用于指示进行所述压缩后的数据中最高比特位在进行所述压缩之前的数据中的位置,所述第二运算模块用于对压缩后的数据进行所述第一运算。
所述偏移参数是根据第一训练运算数据的有效位数和第二训练运算数据的有效位数确定的,所述第一训练运算数据是利用所述第三算子对使用所述数据量化参数量化后的第一训练输入数据进行处理得到的,所述第二训练运算数据是利用所述第四算子对使用所述数据量化参数量化后的第二训练输入数据进行处理得到的。
图16是本申请实施例的数据处理装置的硬件结构示意图。图16所示的数据处理装置4000包括存储器4001、处理器4002、通信接口4003以及总线4004。其中,存储器4001、 处理器4002、通信接口4003通过总线4004实现彼此之间的通信连接。
存储器4001可以是ROM,静态存储设备和RAM。存储器4001可以存储程序,当存储器4001中存储的程序被处理器4002执行时,处理器4002和通信接口4003用于执行本申请实施例的数据处理方法的各个步骤。
处理器4002可以采用通用的,CPU,微处理器,ASIC,GPU或者一个或多个集成电路,用于执行相关程序,以实现本申请实施例的数据处理装置中的单元所需执行的功能,或者执行本申请方法实施例的数据处理方法。
处理器4002还可以是一种集成电路芯片,具有信号的处理能力,例如,可以是图4所示的芯片。在实现过程中,本申请实施例的数据处理方法的各个步骤可以通过处理器4002中的硬件的集成逻辑电路或者软件形式的指令完成。
上述处理器4002还可以是通用处理器、DSP、ASIC、FPGA或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器4001,处理器4002读取存储器4001中的信息,结合其硬件完成本申请实施例的数据处理装置中包括的单元所需执行的功能,或者执行本申请方法实施例的数据处理方法。
通信接口4003使用例如但不限于收发器一类的收发装置,来实现装置4000与其他设备或通信网络之间的通信。例如,可以通过通信接口4003获取待处理图像。
总线4004可包括在装置4000各个部件(例如,存储器4001、处理器4002、通信接口4003)之间传送信息的通路。
图17是本申请实施例的神经网络模型量化装置的硬件结构示意图。与上述装置4000类似,图17所示的神经网络模型量化装置5000包括存储器5001、处理器5002、通信接口5003以及总线5004。其中,存储器5001、处理器5002、通信接口5003通过总线5004实现彼此之间的通信连接。
可以通过图17所示的神经网络模型量化装置5000对原始神经网络模型进行量化,量化得到的神经网络模型就可以用于执行本申请实施例的数据处理方法了。
具体地,图17所示的装置可以通过通信接口5003从外界获取量化所需的训练数据集以及原始神经网络模型,然后由处理器根据训练数据集和原始神经网络模型进行神经网络模型的量化。
应注意,尽管上述装置4000和装置5000仅仅示出了存储器、处理器、通信接口,但是在具体实现过程中,本领域的技术人员应当理解,装置4000和装置5000还可以包括实现正常运行所必须的其他器件。同时,根据具体需要,本领域的技术人员应当理解,装置4000和装置5000还可包括实现其他附加功能的硬件器件。此外,本领域的技术人员应当理解,装置4000和装置5000也可仅仅包括实现本申请实施例所必须的器件,而不必包括图16和图17中所示的全部器件。
应理解,本申请实施例中的处理器可以为中央处理单元(central processing unit,CPU), 该处理器还可以是其他通用处理器、数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现成可编程门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
还应理解,本申请实施例中的存储器可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(random access memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的随机存取存储器(random access memory,RAM)可用,例如静态随机存取存储器(static RAM,SRAM)、动态随机存取存储器(DRAM)、同步动态随机存取存储器(synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(double data rate SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(direct rambus RAM,DR RAM)。
上述实施例,可以全部或部分地通过软件、硬件、固件或其他任意组合来实现。当使用软件实现时,上述实施例可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令或计算机程序。在计算机上加载或执行所述计算机指令或计算机程序时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以为通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集合的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质。半导体介质可以是固态硬盘。
应理解,本文中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况,其中A,B可以是单数或者复数。另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系,但也可能表示的是一种“和/或”的关系,具体可参考前后文进行理解。
本申请中,“至少一个”是指一个或者多个,“多个”是指两个或两个以上。“以下至少一项(个)”或其类似表达,是指的这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b,或c中的至少一项(个),可以表示:a,b,c,a-b,a-c,b-c,或a-b-c,其中a,b,c可以是单个,也可以是多个。
应理解,在本申请的各种实施例中,上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程 构成任何限定。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (18)

  1. 一种神经网络模型量化方法,其特征在于,包括:
    获取原始神经网络模型,所述原始神经网络模型包括第一算子、第二算子和第一运算模块,所述第一算子与所述第二算子用于进行相同类型的运算,所述第一运算模块用于对所述第一算子的输出和所述第二算子的输出进行第一运算;
    根据第一训练输入数据的范围和第二训练输入数据的范围,确定数据量化参数,所述第一训练输入数据为所述第一算子的输入数据,所述第二训练输入数据为所述第二算子的输入数据;
    根据所述原始神经网络模型,确定量化后的神经网络模型,所述量化后的神经网络模型包括量化模块、第三算子、第四算子和第二运算模块,所述量化模块用于利用所述数据量化参数分别对所述第三算子的第一输入数据、所述第四算子的第二输入数据进行量化,所述第三算子为量化后的第一算子,所述第四算子为量化后的第二算子,所述第二运算模块用于进行所述第一运算。
  2. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    获取训练输入数据组对应的预设训练输出数据,所述训练输入数据组包括所述第一训练输入数据和所述第二训练输入数据;
    利用所述数据量化参数,分别对所述第一训练输入数据和所述第二训练输入数据进行量化;
    利用所述量化后的神经网络模型对量化后的第一训练输入数据和量化后的第二训练输入数据进行处理,以得到实际训练输出数据;
    根据所述实际训练输出数据与所述预设训练输出数据的差异,调整所述数据量化参数,以最小化所述差异;
    所述量化模块用于利用调整后的数据量化参数分别对所述第三算子的第一输入数据、所述第四算子的第二输入数据进行量化。
  3. 根据权利要求2所述的方法,其特征在于,所述方法还包括:
    根据所述第一算子的参数范围、所述第二算子的参数范围,确定算子量化参数;
    利用所述算子量化参数对所述第一算子的参数进行量化,以得到所述第三算子的参数;
    利用所述算子量化参数对所述第二算子的参数进行量化,以得到所述第四算子的参数。
  4. 根据权利要求1-3中任一项所述的方法,其特征在于,所述量化后的神经网络模型还包括压缩模块,所述压缩模块用于根据偏移参数分别对所述第三算子的输出和所述第四算子的输出进行压缩,所述偏移参数用于指示进行所述压缩后的数据中最高比特位在进行所述压缩之前的数据中的位置,所述第二运算模块用于对压缩后的数据进行所述第一运算;
    所述方法还包括:
    利用所述数据量化参数分别对所述第一训练输入数据和第二训练输入数据进行量化;
    利用所述第三算子对量化后的第一训练输入数据进行处理,所述第三算子输出第一训练运算数据;
    利用所述第四算子对量化后的第二训练输入数据进行处理,所述第四算子输出第二训练运算数据;
    根据所述第一训练运算数据的有效位数,以及所述第二训练运算数据的有效位数,确定所述偏移参数。
  5. 一种数据处理方法,其特征在于,所述方法包括:
    获取量化后的神经网络模型,所述量化后的神经网络模型是对原始神经网络模型进行量化得到的,所述原始神经网络模型包括第一算子、第二算子和第一运算模块所述第一算子与所述第二算子用于进行相同类型的运算,所述第一运算模块用于对所述第一算子的输出和所述第二算子的输出进行第一运算;
    利用所述量化后的神经网络模型对所述第三算子的第一输入数据和所述第四算子的第二输入数据进行处理,所述量化后的神经网络模型包括量化模块、第一算子、第二算子和第二运算模块,所述量化模块用于利用数据量化参数分别对所述第一输入数据、所述第二输入数据进行量化,所述第二运算模块用于进行所述第一运算,所述第三算子为量化后的第一算子,所述第四算子为量化后的第二算子,所述数据量化参数是根据所述第一算子的第一训练输入数据的范围和所述第二算子的第二训练输入数据的范围确定的。
  6. 根据权利要求5所述的方法,其特征在于,
    所述数据量化参数是对初始数据量化参数进行调整得到的,所述调整使得根据实际训练输出数据与预设训练输出数据的差异最小化,
    所述初始量化参数是根据所述第一训练输入数据的范围和所述第二训练输入数据的范围确定的,
    所述预设训练输出数据对应于训练输入数据组,所述训练输入数据组包括所述第一训练输入数据和所述第二训练输入数据,
    所述实际训练输出数据是利用所述量化后的神经网络模型对所述第一训练输入数据和所述第二训练输入数据进行处理得到的,所述量化模块用于利用所述初始数据量化参数分别对所述第一训练输入数据、所述第二训练输入数据进行量化。
  7. 根据权利要求6所述的方法,其特征在于,所述第三算子的参数是利用算子量化参数对所述第一算子的参数进行量化得到的,所述第四算子的参数是利用所述算子量化参数对所述第二算子的参数进行量化得到的,所述算子量化参数是根据所述第一算子的参数范围、所述第二算子的参数范围确定的。
  8. 根据权利要求5-7中任一项所述的方法,其特征在于,所述量化后的神经网络模型还包括压缩模块,所述压缩模块用于根据偏移参数分别对所述第三算子的输出和所述第四算子的输出进行压缩,所述偏移参数用于指示进行所述压缩后的数据中最高比特位在进行所述压缩之前的数据中的位置,所述第二运算模块用于对压缩后的数据进行所述第一运算;
    所述偏移参数是根据第一训练运算数据的有效位数和第二训练运算数据的有效位数确定的,所述第一训练运算数据是利用所述第三算子对使用所述数据量化参数量化后的第一训练输入数据进行处理得到的,所述第二训练运算数据是利用所述第四算子对使用所述 数据量化参数量化后的第二训练输入数据进行处理得到的。
  9. 一种神经网络模型量化装置,其特征在于,所述装置包括:存储模块和处理模块,
    所述存储模块用于存储程序;
    当所述程序在所述处理模块中运行时,所述处理模块用于:
    获取原始神经网络模型,所述原始神经网络模型包括第一算子、第二算子和第一运算模块,所述第一算子与所述第二算子用于进行相同类型的运算,所述第一运算模块用于对所述第一算子的输出和所述第二算子的输出进行第一运算;
    根据第一训练输入数据的范围和第二训练输入数据的范围,确定数据量化参数,所述第一训练输入数据为所述第一算子的输入数据,所述第二训练输入数据为所述第二算子的输入数据;
    根据所述原始神经网络模型,确定量化后的神经网络模型,所述量化后的神经网络模型包括量化模块、第三算子、第四算子和第二运算模块,所述量化模块用于利用所述数据量化参数分别对所述第三算子的第一输入数据、所述第四算子的第二输入数据进行量化,所述第三算子为量化后的第一算子,所述第四算子为量化后的第二算子,所述第二运算模块用于进行所述第一运算。
  10. 根据权利要求9所述的装置,其特征在于,
    所述处理模块还用于,获取训练输入数据组对应的预设训练输出数据,所述训练输入数据组包括所述第一训练输入数据和所述第二训练输入数据;
    所述处理模块还用于,利用所述数据量化参数,分别对所述第一训练输入数据和所述第二训练输入数据进行量化;
    所述处理模块还用于,利用所述量化后的神经网络模型对量化后的第一训练输入数据和量化后的第二训练输入数据进行处理,以得到实际训练输出数据;
    所述处理模块还用于,根据所述实际训练输出数据与所述预设训练输出数据的差异,调整所述数据量化参数,以最小化所述差异;
    所述量化模块用于利用调整后的数据量化参数分别对所述第三算子的第一输入数据、所述第四算子的第二输入数据进行量化。
  11. 根据权利要求10所述的装置,其特征在于,
    所述处理模块还用于,根据所述第一算子的参数范围、所述第二算子的参数范围,确定算子量化参数;
    所述处理模块还用于,利用所述算子量化参数对所述第一算子的参数进行量化,以得到所述第三算子的参数;
    所述处理模块还用于,利用所述算子量化参数对所述第二算子的参数进行量化,以得到所述第四算子的参数。
  12. 根据权利要求9-11中任一项所述的装置,其特征在于,所述量化后的神经网络模型还包括压缩模块,所述压缩模块用于根据偏移参数分别对所述第三算子的输出和所述第四算子的输出进行压缩,所述偏移参数用于指示进行所述压缩后的数据中最高比特位在进行所述压缩之前的数据中的位置,所述第二运算模块用于对压缩后的数据进行所述第一运算;
    所述处理模块还用于,利用所述数据量化参数分别对所述第一训练输入数据和第二训 练输入数据进行量化;
    所述处理模块还用于,利用所述第三算子对量化后的第一训练输入数据进行处理,所述第三算子输出第一训练运算数据;
    所述处理模块还用于,利用所述第四算子对量化后的第二训练输入数据进行处理,所述第四算子输出第二训练运算数据;
    所述处理模块还用于,根据所述第一训练运算数据的有效位数,以及所述第二训练运算数据的有效位数,确定所述偏移参数。
  13. 一种数据处理装置,其特征在于,包括:存储模块和处理模块,
    所述存储模块用于存储程序;
    当所述程序在所述处理模块中运行时,所述处理模块用于:
    获取量化后的神经网络模型,所述量化后的神经网络模型是对原始神经网络模型进行量化得到的,所述原始神经网络模型包括第一算子、第二算子和第一运算模块所述第一算子与所述第二算子用于进行相同类型的运算,所述第一运算模块用于对所述第一算子的输出和所述第二算子的输出进行第一运算;
    利用所述量化后的神经网络模型对所述第三算子的第一输入数据和所述第四算子的第二输入数据进行处理,所述量化后的神经网络模型包括量化模块、第一算子、第二算子和第二运算模块,所述量化模块用于利用数据量化参数分别对所述第一输入数据、所述第二输入数据进行量化,所述第二运算模块用于进行所述第一运算,所述第三算子为量化后的第一算子,所述第四算子为量化后的第二算子,所述数据量化参数是根据所述第一算子的第一训练输入数据的范围和所述第二算子的第二训练输入数据的范围确定的。
  14. 根据权利要求13所述的装置,其特征在于,
    所述数据量化参数是对初始数据量化参数进行调整得到的,所述调整使得根据实际训练输出数据与预设训练输出数据的差异最小化,
    所述初始量化参数是根据所述第一训练输入数据的范围和所述第二训练输入数据的范围确定的,
    所述预设训练输出数据对应于训练输入数据组,所述训练输入数据组包括所述第一训练输入数据和所述第二训练输入数据,
    所述实际训练输出数据是利用所述量化后的神经网络模型对所述第一训练输入数据和所述第二训练输入数据进行处理得到的,所述量化模块用于利用所述初始数据量化参数分别对所述第一训练输入数据、所述第二训练输入数据进行量化。
  15. 根据权利要求14所述的装置,其特征在于,所述第三算子的参数是利用算子量化参数对所述第一算子的参数进行量化得到的,所述第四算子的参数是利用所述算子量化参数对所述第二算子的参数进行量化得到的,所述算子量化参数是根据所述第一算子的参数范围、所述第二算子的参数范围确定的。
  16. 根据权利要求13-15中任一项所述的装置,其特征在于,所述量化后的神经网络模型还包括压缩模块,所述压缩模块用于根据偏移参数分别对所述第三算子的输出和所述第四算子的输出进行压缩,所述偏移参数用于指示进行所述压缩后的数据中最高比特位在进行所述压缩之前的数据中的位置,所述第二运算模块用于对压缩后的数据进行所述第一运算;
    所述偏移参数是根据第一训练运算数据的有效位数和第二训练运算数据的有效位数确定的,所述第一训练运算数据是利用所述第三算子对使用所述数据量化参数量化后的第一训练输入数据进行处理得到的,所述第二训练运算数据是利用所述第四算子对使用所述数据量化参数量化后的第二训练输入数据进行处理得到的。
  17. 一种计算机可读存储介质,其特征在于,所述计算机可读介质存储用于设备执行的程序代码,该程序代码被所述设备执行时,所述设备执行如权利要求1至8中任一项所述的方法。
  18. 一种芯片,其特征在于,所述芯片包括处理器与数据接口,所述处理器通过所述数据接口读取存储器上存储的指令,以执行如权利要求1至8中任一项所述的方法。
PCT/CN2020/125370 2020-10-30 2020-10-30 神经网络模型的量化方法和装置、数据处理的方法和装置 WO2022088063A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2020/125370 WO2022088063A1 (zh) 2020-10-30 2020-10-30 神经网络模型的量化方法和装置、数据处理的方法和装置
CN202080016479.1A CN114698395A (zh) 2020-10-30 2020-10-30 神经网络模型的量化方法和装置、数据处理的方法和装置

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/125370 WO2022088063A1 (zh) 2020-10-30 2020-10-30 神经网络模型的量化方法和装置、数据处理的方法和装置

Publications (1)

Publication Number Publication Date
WO2022088063A1 true WO2022088063A1 (zh) 2022-05-05

Family

ID=81381775

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/125370 WO2022088063A1 (zh) 2020-10-30 2020-10-30 神经网络模型的量化方法和装置、数据处理的方法和装置

Country Status (2)

Country Link
CN (1) CN114698395A (zh)
WO (1) WO2022088063A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114841325A (zh) * 2022-05-20 2022-08-02 安谋科技(中国)有限公司 神经网络模型的数据处理方法、介质及电子设备
CN116258178A (zh) * 2023-03-24 2023-06-13 美的集团(上海)有限公司 模型转换方法、装置、电子设备和可读存储介质
CN116579400A (zh) * 2023-05-19 2023-08-11 北京百度网讯科技有限公司 深度学习模型的量化方法、数据处理方法和装置
CN117634577A (zh) * 2024-01-25 2024-03-01 深圳市九天睿芯科技有限公司 向量处理器、神经网络加速器、芯片及电子设备

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110322008A (zh) * 2019-07-10 2019-10-11 杭州嘉楠耘智信息科技有限公司 一种基于残差卷积神经网络的量化处理方法及装置
CN110598839A (zh) * 2018-06-12 2019-12-20 华为技术有限公司 卷积神经网络系统和卷积神经网络量化的方法
US20200097818A1 (en) * 2018-09-26 2020-03-26 Xinlin LI Method and system for training binary quantized weight and activation function for deep neural networks
CN111176853A (zh) * 2020-02-19 2020-05-19 珠海市杰理科技股份有限公司 数据量化方法、装置、计算机设备和存储介质
US20200257960A1 (en) * 2019-02-12 2020-08-13 XNOR.ai, Inc. Compressed convolutional neural network models
CN111652366A (zh) * 2020-05-09 2020-09-11 哈尔滨工业大学 一种基于通道剪枝和量化训练的联合神经网络模型压缩方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110598839A (zh) * 2018-06-12 2019-12-20 华为技术有限公司 卷积神经网络系统和卷积神经网络量化的方法
US20200097818A1 (en) * 2018-09-26 2020-03-26 Xinlin LI Method and system for training binary quantized weight and activation function for deep neural networks
US20200257960A1 (en) * 2019-02-12 2020-08-13 XNOR.ai, Inc. Compressed convolutional neural network models
CN110322008A (zh) * 2019-07-10 2019-10-11 杭州嘉楠耘智信息科技有限公司 一种基于残差卷积神经网络的量化处理方法及装置
CN111176853A (zh) * 2020-02-19 2020-05-19 珠海市杰理科技股份有限公司 数据量化方法、装置、计算机设备和存储介质
CN111652366A (zh) * 2020-05-09 2020-09-11 哈尔滨工业大学 一种基于通道剪枝和量化训练的联合神经网络模型压缩方法

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114841325A (zh) * 2022-05-20 2022-08-02 安谋科技(中国)有限公司 神经网络模型的数据处理方法、介质及电子设备
CN116258178A (zh) * 2023-03-24 2023-06-13 美的集团(上海)有限公司 模型转换方法、装置、电子设备和可读存储介质
CN116258178B (zh) * 2023-03-24 2023-09-22 美的集团(上海)有限公司 模型转换方法、装置、电子设备和可读存储介质
CN116579400A (zh) * 2023-05-19 2023-08-11 北京百度网讯科技有限公司 深度学习模型的量化方法、数据处理方法和装置
CN116579400B (zh) * 2023-05-19 2024-02-23 北京百度网讯科技有限公司 深度学习模型的量化方法、数据处理方法和装置
CN117634577A (zh) * 2024-01-25 2024-03-01 深圳市九天睿芯科技有限公司 向量处理器、神经网络加速器、芯片及电子设备
CN117634577B (zh) * 2024-01-25 2024-06-07 深圳市九天睿芯科技有限公司 向量处理器、神经网络加速器、芯片及电子设备

Also Published As

Publication number Publication date
CN114698395A (zh) 2022-07-01

Similar Documents

Publication Publication Date Title
WO2021120719A1 (zh) 神经网络模型更新方法、图像处理方法及装置
WO2022088063A1 (zh) 神经网络模型的量化方法和装置、数据处理的方法和装置
WO2022083536A1 (zh) 一种神经网络构建方法以及装置
WO2022042713A1 (zh) 一种用于计算设备的深度学习训练方法和装置
WO2022052601A1 (zh) 神经网络模型的训练方法、图像处理方法及装置
WO2022001805A1 (zh) 一种神经网络蒸馏方法及装置
CN113326930B (zh) 数据处理方法、神经网络的训练方法及相关装置、设备
WO2022179492A1 (zh) 一种卷积神经网络的剪枝处理方法、数据处理方法及设备
WO2021018245A1 (zh) 图像分类方法及装置
WO2021051987A1 (zh) 神经网络模型训练的方法和装置
CN110222717A (zh) 图像处理方法和装置
KR20220137076A (ko) 이미지 프로세싱 방법 및 관련된 디바이스
CN113065635A (zh) 一种模型的训练方法、图像增强方法及设备
WO2022267036A1 (zh) 神经网络模型训练方法和装置、数据处理方法和装置
CN115081588A (zh) 一种神经网络参数量化方法和装置
US20240135174A1 (en) Data processing method, and neural network model training method and apparatus
WO2022179588A1 (zh) 一种数据编码方法以及相关设备
WO2021036397A1 (zh) 目标神经网络模型的生成方法和装置
CN113536970A (zh) 一种视频分类模型的训练方法及相关装置
WO2022156475A1 (zh) 神经网络模型的训练方法、数据处理方法及装置
CN116187430A (zh) 一种联邦学习方法及相关装置
CN113128285A (zh) 一种处理视频的方法及装置
WO2024094094A1 (zh) 一种模型训练方法及装置
CN114298289A (zh) 一种数据处理的方法、数据处理设备及存储介质
WO2023109748A1 (zh) 一种神经网络的调整方法及相应装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20959225

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20959225

Country of ref document: EP

Kind code of ref document: A1