WO2023231794A1 - 一种神经网络参数量化方法和装置 - Google Patents

一种神经网络参数量化方法和装置 Download PDF

Info

Publication number
WO2023231794A1
WO2023231794A1 PCT/CN2023/095019 CN2023095019W WO2023231794A1 WO 2023231794 A1 WO2023231794 A1 WO 2023231794A1 CN 2023095019 W CN2023095019 W CN 2023095019W WO 2023231794 A1 WO2023231794 A1 WO 2023231794A1
Authority
WO
WIPO (PCT)
Prior art keywords
neural network
data
parameters
model
parameter
Prior art date
Application number
PCT/CN2023/095019
Other languages
English (en)
French (fr)
Inventor
聂迎
韩凯
刘传建
马俊辉
王云鹤
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2023231794A1 publication Critical patent/WO2023231794A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the application relates to the field of artificial intelligence, and in particular to a neural network parameter quantification method and device.
  • Model compression technology is a common technical means to build lightweight neural networks.
  • Neural network models generally use FP32 (32-bit floating point data) for storage. Research has found that neural networks have good robustness. If the parameters of large neural networks are reduced in accuracy through quantization, coding, etc., they can still maintain relatively good performance.
  • Commonly used low-precision data include FP16 (half-precision floating point), INT16 (16-bit fixed-point integer), INT8 (8-bit fixed-point integer), INT4 (4-bit fixed-point integer), 1bit and other numerical formats. Considering both network performance and model compression degree, converting weight parameters from 32-bit floating point (FP32) to 8-bit fixed-point integer (INT8) is currently the most commonly used quantification method.
  • This application provides a neural network parameter quantification method and device for quantizing neural networks, reducing accuracy loss during low-bit quantization, and obtaining a lightweight model with more accurate output.
  • this application provides a neural network parameter quantification method, which includes: first, obtaining the parameters of each neuron in the model to be quantified to obtain a parameter set; and then clustering the parameters in the parameter set to obtain Multiple classification data; each classification data in the multiple classification data is quantified to obtain at least one quantization parameter, at least one quantization parameter is used to obtain a compression model, and the accuracy of at least one quantization parameter is lower than that of the model to be quantified. the precision of the parameters.
  • the parameters of the neural network are clustered and then quantified, and the parameters of each category are quantified separately, thereby improving the expression ability of the model.
  • the aforementioned clustering of parameter sets to obtain multiple classification data may include: clustering parameter sets to obtain at least one clustering data; from at least one clustering data A preset number of parameters are intercepted from each clustering data to obtain multiple classification data.
  • each classification parameter is truncated after clustering, thereby reducing outliers in each classification and improving subsequent model expression capabilities.
  • the parameters in the model to be quantified include parameters in the output characteristics of each neuron or parameter values within each neuron. Therefore, in the embodiment of the present application, the internal parameters of each neuron in the neural network and the output characteristic values are quantized, thereby reducing the bits occupied by the quantized model and obtaining a lightweight model.
  • the model to be quantified includes an additive neural network. Therefore, in the implementation of this application, for additive networks, if each neuron shares a scaling coefficient, which reduces the model's expression ability. If the multiplicative convolution quantization method is used, it may be because the scaling coefficients of the weight data and the input features are different. Therefore, through the method provided by this application, the parameters can be Perform clustering and quantify each type of parameters to improve the expression ability of the compressed model and avoid non-INT8 values after quantification.
  • the compression model is used to perform at least one of image recognition, classification tasks or target detection. Therefore, the method provided in this application can be applied to a variety of scenarios and has strong generalization ability.
  • this application provides a neural network parameter quantification device, including:
  • the acquisition module is used to obtain the parameters of each neuron in the model to be quantified and obtain the parameter set;
  • the clustering module is used to cluster parameter sets to obtain a variety of classified data
  • the quantization module is used to quantify each type of data in multiple types of classification data to obtain at least one quantization parameter.
  • the at least one quantization parameter is used to obtain a compression model.
  • the accuracy of at least one quantization parameter is lower than that in the model to be quantified. the precision of the parameters.
  • the clustering module is specifically configured to: cluster the parameter set to obtain at least one clustering data; and intercept presets from each clustering data in the at least one clustering data. number of parameters to obtain a variety of categorical data.
  • the parameters in the model to be quantified include parameters in the output characteristics of each neuron or parameter values within each neuron.
  • the model to be quantified includes an additive neural network.
  • the compression model is used to perform at least one of image recognition, classification tasks or target detection.
  • a neural network parameter quantification device including: a processor and a memory, wherein the processor and the memory are interconnected through lines, and the processor calls the program code in the memory to execute any of the above-mentioned aspects of the first aspect.
  • the neural network parameter quantification device may be a chip.
  • inventions of the present application provide a neural network parameter quantification device.
  • the neural network parameter quantification device may also be called a digital processing chip or chip.
  • the chip includes a processing unit and a communication interface.
  • the processing unit obtains program instructions through the communication interface. , the program instructions are executed by the processing unit, and the processing unit is configured to perform processing-related functions as in the above-mentioned first aspect or any optional implementation manner of the first aspect.
  • embodiments of the present application provide a computer-readable storage medium that includes instructions that, when run on a computer, cause the computer to execute the method in the above-mentioned first aspect or any optional implementation of the first aspect.
  • embodiments of the present application provide a computer program product containing instructions that, when run on a computer, cause the computer to execute the method in the above-mentioned first aspect or any optional implementation of the first aspect.
  • Figure 1 is a schematic diagram of an artificial intelligence subject framework applied in this application
  • Figure 2 is a schematic structural diagram of a convolution kernel applied in this application.
  • Figure 3 is a schematic structural diagram of a convolutional neural network provided by this application.
  • Figure 4 is a schematic diagram of a system architecture provided by this application.
  • FIG. 5 is a schematic diagram of another system architecture provided by this application.
  • Figure 6 is a schematic diagram of a quantification method provided by this application.
  • Figure 7 is a schematic flow chart of a neural network parameter quantification method provided by this application.
  • Figure 8 is a schematic flow chart of another neural network parameter quantification method provided by this application.
  • Figure 9 is a schematic diagram of a parameter truncation method provided by this application.
  • Figure 10 is a schematic diagram of another parameter truncation method provided by this application.
  • Figure 11 is a schematic structural diagram of a model to be quantified provided by this application.
  • Figure 12 is an accuracy comparison chart between the solution proposed in this application and the commonly used quantization solution for additive networks
  • Figure 13 is a schematic structural diagram of a neural network parameter quantification device provided by this application.
  • Figure 14 is a schematic structural diagram of another neural network parameter quantification device provided by this application.
  • Figure 15 is a schematic structural diagram of a chip provided by this application.
  • Figure 1 shows a structural schematic diagram of the artificial intelligence main framework.
  • the following is from the “intelligent information chain” (horizontal axis) and “IT value chain” ( The above artificial intelligence theme framework is elaborated on the two dimensions of vertical axis).
  • the "intelligent information chain” reflects a series of processes from data acquisition to processing. For example, it can be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, the data has gone through the condensation process of "data-information-knowledge-wisdom".
  • the "IT value chain” reflects the value that artificial intelligence brings to the information technology industry, from the underlying infrastructure of human intelligence and information (providing and processing technology implementation) to the systematic industrial ecological process.
  • Infrastructure provides computing power support for artificial intelligence systems, enables communication with the external world, and supports it through basic platforms.
  • computing power is provided by smart chips, such as central processing unit (CPU), network processor (neural-network processing unit, NPU), graphics processor (English: graphics processing unit, GPU), Application specific integrated circuit (ASIC) or field programmable gate array (FPGA) and other hardware acceleration chips are provided;
  • the basic platform includes distributed computing framework and network and other related platform guarantees and support, It can include cloud storage and computing, interconnection networks, etc.
  • sensors communicate with the outside world to obtain data, which are provided to smart chips in the distributed computing system provided by the basic platform for calculation.
  • Data from the upper layer of the infrastructure is used to represent data sources in the field of artificial intelligence.
  • the data involves graphics, images, voice, and text, as well as IoT data of traditional devices, including business data of existing systems as well as force, displacement, Sensing data such as liquid level, temperature, humidity, etc.
  • Data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making and other methods.
  • machine learning and deep learning can perform symbolic and formal intelligent information modeling, extraction, preprocessing, training, etc. on data.
  • Reasoning refers to the process of simulating human intelligent reasoning in computers or intelligent systems, using formal information to perform machine thinking and problem solving based on reasoning control strategies. Typical functions are search and matching.
  • Decision-making refers to the process of decision-making after intelligent information is reasoned, and usually provides functions such as classification, sorting, and prediction.
  • some general capabilities can be formed based on the results of further data processing, such as algorithms or a general system, such as translation, text analysis, computer vision processing, speech recognition, and image processing. identification, etc.
  • Intelligent products and industry applications refer to the products and applications of artificial intelligence systems in various fields. They are the encapsulation of overall artificial intelligence solutions, productizing intelligent information decision-making and realizing practical applications. Its application fields mainly include: intelligent terminals, intelligent transportation, Smart healthcare, autonomous driving, smart cities, etc.
  • the neural network can be composed of neural units.
  • the neural unit can refer to an arithmetic unit that takes xs and intercept 1 as input.
  • the output of the arithmetic unit can be as shown in the formula:
  • s 1, 2,...n, n is a natural number greater than 1
  • Ws is the weight of xs
  • b is the bias of the neural unit.
  • f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal.
  • the output signal of this activation function can be used as the input of the next convolutional layer, and the activation function can be a sigmoid function.
  • a neural network is a network formed by connecting multiple above-mentioned single neural units together, that is, the output of one neural unit can be the input of another neural unit.
  • the input of each neural unit can be connected to the local receptive field of the previous layer to extract the features of the local receptive field.
  • the local receptive field can be an area composed of several neural units.
  • Deep neural network also known as multi-layer neural network
  • DNN can be understood as a neural network with multiple intermediate layers.
  • DNN is divided according to the positions of different layers.
  • the neural network inside the DNN can be divided into three categories: input layer, intermediate layer, and output layer.
  • the first layer is the input layer
  • the last layer is the output layer
  • the layers in between are all intermediate layers, or hidden layers.
  • the layers are fully connected, that is to say, any neuron in the i-th layer must be connected to any neuron in the i+1-th layer.
  • DNN looks complex, each layer of it can be expressed as a linear relationship expression: That middle, is the input vector, is the output vector, is the offset vector or bias parameter, w is the weight matrix (also called coefficient), and ⁇ () is the activation function.
  • Each layer is just a pair of input vectors After such a simple operation, the output vector is obtained Due to the large number of DNN layers, the coefficient W and offset vector The number is also relatively large.
  • DNN The definitions of these parameters in DNN are as follows: Taking the coefficient w as an example: Assume that in a three-layer DNN, the linear coefficient from the 4th neuron in the second layer to the 2nd neuron in the third layer is defined as The superscript 3 represents the number of layers where the coefficient W is located, and the subscript corresponds to the output third layer index 2 and the input second layer index 4.
  • the coefficient from the k-th neuron in layer L-1 to the j-th neuron in layer L is defined as
  • the input layer has no W parameter.
  • more intermediate layers make the network more capable of describing complex situations in the real world.
  • a model with more parameters has higher complexity and greater "capacity", which means it can complete more complex learning tasks.
  • Training a deep neural network is the process of learning the weight matrix. The ultimate goal is to obtain the weight matrix of all layers of the trained deep neural network (a weight matrix formed by the vectors W of many layers).
  • CNN Convolutional neural network
  • CNN has excellent features such as local perception and weight sharing, which can significantly reduce weight parameters and greatly improve network performance. It has achieved many breakthrough results in fields such as computer vision and image analysis, and has become the core technology of artificial intelligence and deep learning.
  • the convolutional neural network contains a feature extractor consisting of a convolutional layer and a subsampling layer, which can be regarded as a filter.
  • the convolutional layer refers to the neuron layer in the convolutional neural network that convolves the input signal. In the convolutional layer of a convolutional neural network, a neuron can be connected to only some of the neighboring layer neurons.
  • a convolutional layer usually contains several feature planes, and each feature plane can be composed of some rectangularly arranged neural units. Neural units in the same feature plane share weights, and the shared weights here are convolution kernels. Shared weights can be understood as a way to extract image information independent of position.
  • the convolution kernel can be initialized in the form of a random-sized matrix. During the training process of the convolutional neural network, the convolution kernel can obtain reasonable weights through learning. In addition, the direct benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, while reducing the risk of overfitting.
  • Recurrent neural networks also known as recurrent neural networks
  • RNN Recurrent neural networks
  • the layers are fully connected, while the nodes within each layer are unconnected.
  • RNN Recurrent neural network
  • This kind of ordinary neural network has solved many difficult problems, it is still incompetent for many problems. For example, if you want to predict the next word of a sentence, you generally need to use the previous word, because the preceding and following words in a sentence are not independent.
  • the reason why RNN is called a recurrent neural network is that the current output of a sequence is also related to the previous output.
  • RNN can process sequence data of any length.
  • the training of RNN is the same as the training of traditional CNN or DNN.
  • Residual neural network (ResNet)
  • the residual neural network was proposed as an example to solve the degradation problem that occurs when there are too many hidden layers in neural neural networks.
  • the degradation problem refers to: when the number of hidden layers in the network increases, the accuracy of the network reaches saturation and then degrades sharply, and this degradation is not caused by overfitting, but by the correlation of each gradient when propagating to the bottom layer during backpropagation. If it is not large, the gradient update is insufficient, thus reducing the accuracy of the predicted label of the final model.
  • the neural network degrades, the shallow network can achieve better training results than the deep network. At this time, if the low-level features are transferred to the high-level, the effect should be at least no worse than the shallow network. Therefore, an identity mapping can be used (Identity Mapping) to achieve this effect. This identity mapping is called a residual connection (shortcut), and optimizing this residual mapping is easier than optimizing the original mapping.
  • a multiplicative convolution scheme can be used, the core of which is to use convolution multiplication operations to extract the similarity between the filter and the input image.
  • S(x, y) represents the similarity between x and y
  • X represents the input image
  • F represents the filter for convolution calculation
  • i and j represent the horizontal and vertical coordinates of a convolution kernel
  • k represents the input channel
  • t Represents the output channel.
  • the difference from the aforementioned multiplicative neural network is that the additive neural network can use the subtraction operation of the input image and the feature extractor to extract the similarity between the two.
  • the difference between multiplicative convolution kernels and additive convolution kernels is introduced as an example.
  • the difference between a multiplicative convolution kernel and an additive convolution kernel can be shown in Figure 2.
  • the convolution kernel can perform a convolution operation on the input matrix to obtain the convolved output.
  • the convolution kernel is a multiplicative convolution kernel
  • the convolution operation of the multiplicative convolution kernel is a multiplication operation.
  • the convolution kernel is an additive convolution kernel
  • the convolution operation is an addition operation, or it is called a subtraction operation, such as shown in Figure 2, the matrix currently to be operated on in the input matrix includes The matrix corresponding to the convolution kernel is
  • the convolution kernel When the convolution kernel is an additive convolution kernel, its operation includes addition operations on the input matrix and each element in the convolution kernel, as expressed as: -
  • -26
  • the neural network can use the error back propagation (BP) algorithm to modify the size of the parameters in the initial neural network model during the training process, so that the reconstruction error loss of the neural network model becomes smaller and smaller. Specifically, forward propagation of the input signal until the output will produce an error loss, and the parameters in the initial neural network model are updated by backpropagating the error loss information, so that the error loss converges.
  • the backpropagation algorithm is a backpropagation movement dominated by error loss, aiming to obtain the optimal parameters of the neural network model, such as the weight matrix.
  • Model quantization It is a model compression method that converts high bits into low bits.
  • model compression technology that converts regular 32-bit floating point operations into low-bit integer operations is called model quantization.
  • model quantization when the low bit is quantized to 8 bit, it can be called int8 quantization. That is, a weight originally needs to be represented by float32. After quantization, it only needs to be represented by int8. In theory, it can obtain 4 times the network acceleration.
  • 8-bit is compared to 32-bit. It can reduce storage space by 4 times, reducing storage space and computing time, thus achieving the purpose of model compression and acceleration.
  • CNN is a commonly used neural network.
  • the neural networks mentioned below in this application include convolutional neural networks with additive convolution kernels or multiplicative convolution kernels. To facilitate understanding, the structure of the convolutional neural network is introduced as an example below.
  • a convolutional neural network is a deep neural network with a convolutional structure. It is a deep learning architecture.
  • the deep learning architecture refers to the algorithm of machine learning. Multiple levels of learning at different levels of abstraction.
  • CNN is a feed-forward artificial neural network. Each neuron in the feed-forward artificial neural network can respond to the image input into it.
  • a convolutional neural network (CNN) 200 may include an input layer 210 , a convolutional layer/pooling layer 220 (where the pooling layer is optional), and a neural network layer 230 .
  • CNN convolutional neural network
  • each layer is called a stage. The relevant content of these layers is introduced in detail below.
  • the convolution layer/pooling layer 220 may include layers 221-226, for example: in one implementation, layer 221 is a convolution layer, layer 222 is a pooling layer, and layer 223 is Convolution layer, layer 224 is the pooling layer, layer 225 is the convolution layer, and layer 226 is the pooling layer; in another implementation, 221 and 222 are the convolution layer, 223 is the pooling layer, and 224 and 225 are the convolution layer. Accumulative layer, 226 is the pooling layer. That is, the output of the convolutional layer can be used as the input of the subsequent pooling layer, or can be used as the input of another convolutional layer to continue the convolution operation.
  • convolutional layer 221 As an example to introduce the internal working principle of a convolutional layer.
  • the convolution layer 221 can include many convolution operators.
  • the convolution operator is also called a kernel. Its role in image processing is equivalent to a filter that extracts specific information from the input image matrix.
  • the convolution operator is essentially It can be a weight matrix. This weight matrix is usually predefined. During the convolution operation on the image, the weight matrix is usually one pixel after one pixel (or two pixels after two pixels) along the horizontal direction on the input image. ...This depends on the value of the step size) to complete the process of extracting specific features from the image.
  • the size of the weight matrix should be related to the size of the image. It should be noted that the depth dimension of the weight matrix is the same as the depth dimension of the input image.
  • the weight matrix will extend to Enter the entire depth of the image. Therefore, convolution with a single weight matrix will produce a convolved output with a single depth dimension, but in most cases, instead of using a single weight matrix, multiple weight matrices of the same size (rows ⁇ columns) are applied, That is, multiple matrices of the same type.
  • the output of each weight matrix is stacked to form the depth dimension of the convolution image.
  • the dimension here can be understood as being determined by the "multiple" mentioned above.
  • Different weight matrices can be used to extract different features in the image. For example, one weight matrix is used to extract edge information of the image, another weight matrix is used to extract specific colors of the image, and another weight matrix is used to remove unnecessary noise in the image.
  • the multiple weight matrices have the same size (row ⁇ column), and the feature maps extracted by the multiple weight matrices with the same size are also the same size. The extracted multiple feature maps with the same size are then merged to form a convolution operation. output.
  • weight values in these weight matrices require a large amount of training in practical applications.
  • Each weight matrix formed by the weight values obtained through training can be used to extract information from the input image, thereby allowing the convolutional neural network 200 to make correct predictions. .
  • the initial convolutional layer for example, 221
  • the features extracted by subsequent convolutional layers for example, 226) become more and more complex, such as high-level semantic features.
  • the pooling layer can also be called a downsampling layer.
  • the only purpose of the pooling layer is to reduce the spatial size of the image.
  • the pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling the input image to obtain a smaller size image.
  • the average pooling operator can calculate the pixel values in the image within a specific range to generate an average value as the result of average pooling.
  • the max pooling operator can take the pixel with the largest value in a specific range as the result of max pooling.
  • the operators in the pooling layer should also be related to the size of the image. through the pooling layer
  • the size of the image output after processing can be smaller than the size of the image input to the pooling layer.
  • Each pixel in the image output by the pooling layer represents the average or maximum value of the corresponding sub-region of the image input to the pooling layer.
  • the convolutional neural network 200 After being processed by the convolutional layer/pooling layer 220, the convolutional neural network 200 is not enough to output the required output information. Because as mentioned above, the convolutional layer/pooling layer 220 will only extract features and reduce the parameters brought by the input image. However, in order to generate the final output information (required class information or other related information), the convolutional neural network 200 needs to use the neural network layer 230 to generate an output or a set of required number of classes. Therefore, the neural network layer 230 may include multiple intermediate layers (231, 232 to 23n as shown in Figure 3) and an output layer 240, which may also be called a fully connected (fully connected, FC) layer.
  • the parameters included in the multi-layer intermediate layer can be pre-trained based on the relevant training data of a specific task type. For example, the task type can include image recognition, image classification, image super-resolution reconstruction, etc.
  • the output layer 240 has a loss function similar to classification cross entropy, specifically used to calculate the prediction error.
  • the forward propagation of the entire convolutional neural network 200 (the propagation from the direction 210 to 240 in Figure 3 is forward propagation) is completed, the back propagation (the propagation from the direction 240 to 210 in Figure 3 is back propagation) will begin.
  • the weight values and biases of each layer mentioned above are updated to reduce the loss of the convolutional neural network 200 and the error between the result output by the convolutional neural network 200 through the output layer and the ideal result.
  • the convolutional neural network 200 shown in Figure 3 is only an example of a convolutional neural network. In specific applications, the convolutional neural network can also exist in the form of other network models.
  • the convolutional neural network 200 shown in Figure 3 can be used to process the image to be processed to obtain the classification result of the image to be processed.
  • the image to be processed is processed by the input layer 210, the convolution layer/pooling layer 220 and the neural network layer 230 and then the classification result of the image to be processed is output.
  • the neural network parameter quantification method provided by the embodiment of the present application can be executed on the server and can also be executed on the terminal device.
  • the terminal device may be a mobile phone with image processing function, a tablet personal computer (TPC), a media player, a smart TV, a laptop computer (LC), or a personal digital assistant (PDA). ), personal computer (PC), camera, camcorder, smart watch, wearable device (WD) or self-driving vehicle, etc., the embodiments of this application are not limited to this.
  • this embodiment of the present application provides a system architecture 100.
  • data collection device 160 is used to collect training data.
  • the training data may include training images and classification results corresponding to the training images, where the classification results of the training images may be manually pre-annotated results.
  • the data collection device 160 After collecting the training data, the data collection device 160 stores the training data into the database 130, and the training device 120 trains to obtain the target model/rules 101 based on the training data maintained in the database 130.
  • the training set mentioned in the following embodiments of this application may be obtained from the database 130 or may be obtained through user input data.
  • the target model/rule 101 may be the neural network mentioned below in the embodiment of this application.
  • the training device 120 processes the input original image and compares the output image with the original image until the image output by the training device 120 is consistent with the original image. The difference is less than a certain threshold, thereby completing the training of the target model/rule 101.
  • the above target model/rule 101 can be used to implement the first neural network obtained by the neural network parameter quantification method in the embodiment of the present application, that is, the data to be processed (such as images) is input into the target model/rule 101 after relevant preprocessing, You can get the processing results.
  • the target model/rule 101 in the embodiment of this application may specifically be the first neural network mentioned below in this application.
  • the first neural network may be the aforementioned CNN, DNN or RNN type of neural network. It should be noted that in actual applications, the training data maintained in the database 130 may not necessarily be collected by the data collection device 160, but may also be received from other devices.
  • the training device 120 may not necessarily train the target model/rules 101 based entirely on the training data maintained by the database 130. It may also obtain training data from the cloud or other places for model training. The above description should not be used as a guide for this application. Limitations of Examples.
  • the target model/rules 101 trained according to the training device 120 can be applied to different systems or devices, such as the execution device 110 shown in Figure 5.
  • the execution device 110 can also be called a computing device.
  • the execution device 110 It can be a terminal, such as a mobile phone terminal, a tablet, a laptop, an augmented reality (AR)/virtual reality (VR), a vehicle terminal, etc. It can also be a server or cloud device, etc.
  • the execution device 110 is configured with an input/output (I/O) interface 112 for data interaction with external devices.
  • the user can input data to the I/O interface 112 through the client device 140.
  • the input data may include: data to be processed input by the client device.
  • the preprocessing module 113 and the preprocessing module 114 are used to perform preprocessing according to the input data (such as data to be processed) received by the I/O interface 112.
  • the preprocessing module 113 and the preprocessing module may not be present.
  • 114 there can also be only one preprocessing module, and the calculation module 111 is directly used to process the input data.
  • the execution device 110 When the execution device 110 preprocesses input data, or when the calculation module 111 of the execution device 110 performs calculations and other related processes, the execution device 110 can call data, codes, etc. in the data storage system 150 for corresponding processing. , the data, instructions, etc. obtained by corresponding processing can also be stored in the data storage system 150 .
  • the I/O interface 112 returns the processing result to the client device 140 to provide it to the user. For example, if the first neural network is used for image classification and the processing result is a classification result, the I/O interface 112 The classification results obtained above are returned to the client device 140 to provide them to the user.
  • the training device 120 can generate corresponding target models/rules 101 based on different training data for different goals or different tasks, and the corresponding target models/rules 101 can be used to achieve the above goals or complete the The above tasks, thereby providing the user with the desired results.
  • the execution device 110 and the training device 120 may be the same device, or located within the same computing device. To facilitate understanding, this application will introduce the execution device and the training device separately, which is not a limitation.
  • the user can manually set the input data, and the manual setting can be operated through the interface provided by the I/O interface 112 .
  • the client device 140 can automatically send input data to the I/O interface 112. If requiring the client device 140 to automatically send input data requires the user's authorization, the user can set corresponding permissions in the client device 140.
  • the user can view the results output by the execution device 110 on the client device 140, and the specific presentation form may be display, sound, action, etc.
  • Client device 140 can also serve as a data collection end, The input data input to the I/O interface 112 and the predicted label output from the I/O interface 112 as shown in the figure are collected as new sample data and stored in the database 130 .
  • the I/O interface 112 directly uses the input data input to the I/O interface 112 as shown in the figure and the predicted label output from the I/O interface 112 as a new sample.
  • the data is stored in database 130.
  • Figure 4 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship between the devices, devices, modules, etc. shown in the figure does not constitute any limitation.
  • the data The storage system 150 is an external memory relative to the execution device 110. In other cases, the data storage system 150 can also be placed in the execution device 110.
  • the target model/rule 101 is obtained by training according to the training device 120.
  • the target model/rule 101 can be the neural network mentioned below in the present application.
  • the neural network can be CNN, deep convolutional neural networks (DCNN), recurrent neural network (RNN), etc.
  • the execution device 110 is implemented by one or more servers, and optionally cooperates with other computing devices, such as data storage, routers, load balancers and other devices; the execution device 110 can be arranged on a physical site, or distributed across multiple on the physical site.
  • the execution device 110 can use the data in the data storage system 150, or call the program code in the data storage system 150 to implement the steps of the neural network parameter quantification method mentioned below in this application.
  • Each local device may represent any computing device, such as a personal computer, computer workstation, smartphone, tablet, smart camera, smart car or other type of cellular phone, media consumption device, wearable device, set-top box, game console, etc.
  • Each user's local device can interact with the execution device 110 through a communication network of any communication mechanism/communication standard.
  • the communication network can be a wide area network, a local area network, a point-to-point connection, etc., or any combination thereof.
  • the communication network may include a wireless network, a wired network, or a combination of a wireless network and a wired network, etc.
  • the wireless network includes but is not limited to: fifth-generation mobile communication technology (5th-Generation, 5G) system, long term evolution (LTE) system, global system for mobile communication (GSM) or code division Multiple access (code division multiple access, CDMA) network, wideband code division multiple access (WCDMA) network, wireless fidelity (wireless fidelity, WiFi), Bluetooth (bluetooth), Zigbee protocol (Zigbee), Any one or a combination of radio frequency identification technology (radio frequency identification, RFID), long range (Lora) wireless communication, and near field communication (NFC).
  • the wired network may include an optical fiber communication network or a network composed of coaxial cables.
  • execution device 110 may be implemented by each local device, for example, local device 401 may provide local data or feedback calculation results to execution device 110 .
  • the local device 401 implements the functions of the execution device 110 and provides services for its own users, or provides services for users of the local device 402 .
  • Model compression is a common method for building lightweight neural networks.
  • FP32 32-bit floating point data
  • INT16 16-bit fixed-point integer
  • INT8 8-bit fixed-point integer
  • INT4 4-bit fixed-point integer
  • 1bit and other numerical formats Considering both network performance and model compression degree, converting weight parameters from 32-bit floating point (FP32) to 8-bit fixed-point integer (INT8) is a commonly used quantification method.
  • a quantization parameter s is used to perform INT8 quantization after subtracting the weight and the input image, and good results have been achieved in INT8 quantization.
  • the weights and input images are quantized by INT8 respectively.
  • This method will correspond to two scaling coefficients s1 and s2, and its hardware requires 2 FP32 multipliers and 1 INT8 multiplier. device.
  • each layer uses a non-shared scale to perform INT8 quantization on the weight data and input image feature data respectively.
  • this discrete quantization method is applied to the additive network, the The required multiplication operations will increase computational energy consumption.
  • the scaling coefficients of the weight data and the input feature image data are not necessarily equal, the final result may be a non-INT8 type numerical calculation.
  • this application provides a neural network parameter quantization method that can be applied to various neural network quantization scenarios and achieves efficient low-bit quantization.
  • FIG. 7 a schematic flow chart of a neural network parameter quantification method provided by this application is as follows.
  • the model to be quantified may include an additive network or a multiplicative network, etc.
  • the model to be quantified may include multiple neurons, and the parameters of each neuron may be read to obtain a parameter set of the model to be quantified.
  • the parameters of each neuron may include parameter values within the neuron, or may include the weight of the output of the neurons in each intermediate layer. Therefore, during subsequent quantization, most parameters in the neural network can be quantized, resulting in a lightweight model.
  • the parameters in the parameter set can be clustered to obtain one or more types of classified data, which can be understood as dividing the parameters in the parameter set into one or more categories through clustering.
  • Specific clustering methods can include K-Means clustering, mean shift clustering, and density-based clustering methods. (density-Based spatial clustering of applications with noise, DBSCAN), maximum expectation clustering based on Gaussian mixture model, etc.
  • the matching clustering method can be selected according to the actual application scenario, and this application does not limit this.
  • the specific process of obtaining multiple classification data may include: clustering the parameters in the parameter set to obtain one or more cluster data, and then intercepting the preset data from the one or more cluster data.
  • Quantitative parameters can be used to obtain the various classification data mentioned above. Therefore, in the implementation of this application, there is no need to define a threshold, and a certain number of interceptions are used to reduce the amount of calculation for the threshold and improve the deployment generalization of the method provided by this application.
  • each category of data can be quantified, that is, the number of bits occupied by parameters in each category of data is reduced to obtain at least one quantified parameter.
  • the at least one quantization parameter is used to obtain the compression model. For example, if the data type of the parameter in the model to be quantized is FP16, its data type can be converted to INT8, thereby reducing the bits occupied by the parameter and achieving low-bit compression of the parameter. Quantization, thereby achieving model compression and obtaining a lightweight model.
  • the parameters in the parameter set can be divided into parameter values of the neuron itself, such as weight parameters, and output feature values of the neuron, such as feature parameters.
  • the feature parameters can include inputting the input image to the model to be quantified. The eigenvalues output by each neuron in the model are then quantified.
  • the input image can be a preset image or a randomly selected image. Feature parameters and weight parameters usually influence each other, but usually the range sizes of the two parameters may be different. If only one quantization parameter is used for quantization, it may cause the parameters of the quantization part to be truncated or cause bits to be wasted.
  • the parameters in the model to be quantified are clustered, and the parameters are divided into multiple categories before being quantified. This can achieve classification and quantification and improve the expressive ability of the model obtained after quantification, especially for additive networks.
  • the post-clustering quantification provided by this application can significantly improve the expression ability of the model and obtain a lightweight model with higher output accuracy.
  • the method provided in this application quantifies the parameters after clustering, and only requires a small increase in workload to obtain a lightweight model with more accurate output, which can be applied to more scenarios that require the deployment of lightweight models.
  • FIG. 8 a schematic flow chart of another neural network parameter quantification method provided by this application is as follows.
  • the model to be quantized 801 may include a multiplication network or an addition network, etc.
  • the method provided by this application can be deployed on a server or a terminal.
  • the method provided by this application can be deployed on a server.
  • the server quantifies the model to be quantified
  • the resulting lightweight model can be deployed on the terminal, so that the lightweight model can be run in the terminal and improve the operating efficiency of the terminal.
  • the model to be quantified may specifically be CNN, ANN or RNN, etc.
  • the model to be quantified may include multiple network layers, for example, it may be divided into an input layer, an intermediate layer and an output layer.
  • Each network layer may include one or more neural networks.
  • the output of the neurons in the previous network layer can be used as the input of the neural network in the next network layer.
  • the The model to be quantified can be used to perform one or more tasks such as image recognition, target detection, segmentation tasks, and classification tasks.
  • the neural network can be compressed, thereby reducing the computing resources required to run the neural network while maintaining the output accuracy of the neural network.
  • the method provided by this application can be applied to multiple types of neural networks, and the specific type can be determined according to the actual application scenario, which is not limited by this application.
  • parameters are extracted from the model to be quantified 801 to obtain a parameter set 802.
  • parameters such as the internal parameters of each neuron or the weight of the output of each neuron can be extracted to obtain a parameter set.
  • the characteristic value output by each neuron is called the characteristic parameter (expressed as w)
  • the convolution kernel parameter of each neuron is called the weight parameter (expressed as x).
  • the model 801 to be quantified may include multiple neurons, and each neuron may have internal parameters.
  • each neuron may include one or more of the following: mean pooling (avg_pool_3x3) with a pooling kernel size of 3 ⁇ 3.
  • maximum pooling (max_pool_3x3) with a pooling kernel size of 3 ⁇ 3, separation convolution (sep_conv_3x3) with a convolution kernel size of 3 ⁇ 3, separation convolution (sep_conv_5x5) with a convolution kernel size of 5 ⁇ 5, Atrous convolution (dil_conv_3x3) with a convolution kernel size of 3 ⁇ 3 and a hole rate of 2, dilated convolution (dil_conv_5x5) with a convolution kernel size of 5 ⁇ 5 and a hole rate of 2, skip-connect operation or zeroing operation (All neurons at the corresponding positions are set to zero, referred to as Zero), etc., the parameters can be extracted from these calculation methods to obtain the internal parameters of the neurons, that is, the weight parameters; the output of the neuron of the previous layer can be used as the output of the neuron of the next layer.
  • the output of each neuron may be different.
  • the feature values output by each neuron are extracted to obtain the feature parameters.
  • the parameters inside the neuron or the characteristic values of the output of each neuron can form the parameter set 802.
  • the parameters in the parameter set 802 are clustered to obtain multiple classification data 803 .
  • the parameters are divided into multiple types, and a certain number of parameters are intercepted from each type of parameters, thereby reducing the risk of abnormal parameters.
  • this application intercepts a certain number of parameters from each type of parameters, which can reduce the workload of calculating thresholds.
  • feature parameters have a great influence on the features extracted by the model, so the outliers of the feature parameters have a great impact on the range of statistical features of the model, which in turn affects the quantified scaling coefficient.
  • the absolute values of the feature parameters are first taken, then the absolute values are sorted in ascending order, and finally they are truncated according to a certain proportion or quantity for subsequent quantification.
  • feature parameters and weight parameters use shared quantization parameters, as shown in Figure 10.
  • features the range of activation values (features)
  • weights the range of weights
  • the range of weights is used to quantify the activation values, only a small number of bits can be used, resulting in a waste of bits.
  • the parameter set is clustered, so that when intercepting the parameters, the truncation effect between the weight parameters and the feature parameters can be avoided, so that the subsequent quantification of the weight parameters and the feature parameters can be effectively decoupled.
  • this application truncates the range of weight parameters to the range covered by feature parameters, and integrates the truncated weight parameter values and feature parameter values into the deviation rate (bias), that is, without affecting the accuracy of the model, retaining most Weight parameters and feature parameters improve the expressive ability of the quantized lightweight model.
  • the various classification data 803 are quantized to obtain at least one quantization parameter 804.
  • the structure of the model to be quantified can be shown in Figure 11.
  • the neural network can be divided into multiple layers.
  • the internal weight parameters of each neuron can be expressed as x, and the characteristic parameters can be expressed as w.
  • the lightweight model can be deployed on various devices, such as mobile phones, bracelets and other terminals or servers with low computing power.
  • the model can be quantified to obtain a lightweight model, which can be deployed in various devices and obtain accurate output while reducing the demand for storage, bandwidth, energy and computing resources.
  • the lightweight model is suitable for more scenarios and has stronger generalization ability.
  • the method provided by this application can be deployed in various terminals or servers, especially in devices with limited resources. Through the method provided by this application, a quantified lightweight model can be obtained while ensuring the expression ability of the model, so as to More accurate output results can also be obtained on resource-constrained devices.
  • the method provided by this application can be deployed in a neural network accelerator.
  • the neural network accelerator uses parallel computing in hardware modules to improve the running speed of convolutional networks.
  • This application can significantly reduce hardware resource consumption for additive networks, and then make full use of hardware resources to build a convolution acceleration module with higher parallelism, further improving the acceleration effect.
  • the method provided by this application can be deployed on low-power AI chips.
  • power consumption is a core issue.
  • This application can effectively reduce the operating power consumption of the circuit for the additive convolution kernel, which is beneficial to the deployment of AI chips in resource-constrained terminal devices.
  • Figure 12 compares the accuracy of the solution proposed by this application and the commonly used quantization scheme for additive networks. It can be seen that in 4-bit quantization, the method provided by this application has higher quantization accuracy.
  • the neural network parameter quantification device includes:
  • the acquisition module 1301 is used to obtain the parameters of each neuron in the model to be quantified and obtain a parameter set;
  • the clustering module 1302 is used to cluster parameter sets to obtain a variety of classified data
  • Quantization module 1303 used to quantize each category data in multiple categories of data to obtain at least one quantization parameter, at least one quantization parameter is used to obtain a compression model, and the accuracy of at least one quantization parameter is lower than that of the model to be quantified.
  • the clustering module 1302 is specifically configured to: cluster the parameter set to obtain at least one clustering data; intercept predetermined data from each clustering data in the at least one clustering data. Set the number of parameters to obtain a variety of classification data.
  • the parameters in the model to be quantified include parameters in the output characteristics of each neuron. number or parameter value within each neuron.
  • the model to be quantified includes an additive neural network.
  • the compression model is used to perform at least one of image recognition, classification tasks or target detection.
  • Figure 14 is a schematic structural diagram of another neural network parameter quantification device provided by this application, as described below.
  • the neural network parameter quantification device may include a processor 1401 and a memory 1402.
  • the processor 1401 and the memory 1402 are interconnected through lines.
  • the memory 1402 stores program instructions and data.
  • the memory 1402 stores program instructions and data corresponding to the steps in the aforementioned FIGS. 7-12.
  • the processor 1401 is configured to execute the method steps performed by the neural network parameter quantization device shown in any of the embodiments shown in FIGS. 7 to 12 .
  • the neural network parameter quantification device may also include a transceiver 1403 for receiving or sending data.
  • Embodiments of the present application also provide a computer-readable storage medium.
  • the computer-readable storage medium stores a program for generating vehicle driving speed.
  • the computer When running on the computer, the computer is caused to execute the above-mentioned Figures 7-12.
  • the illustrated embodiments describe steps in a method.
  • the aforementioned neural network parameter quantization device shown in Figure 14 is a chip.
  • Embodiments of the present application also provide a neural network parameter quantification device.
  • the neural network parameter quantification device can also be called a digital processing chip or chip.
  • the chip includes a processing unit and a communication interface.
  • the processing unit obtains program instructions through the communication interface.
  • the program instructions Executed by the processing unit, the processing unit is configured to execute the method steps executed by the neural network parameter quantization device shown in any of the embodiments in FIGS. 7 to 12 .
  • An embodiment of the present application also provides a digital processing chip.
  • the digital processing chip integrates the circuit and one or more interfaces for realizing the above-mentioned processor 1401, or the functions of the processor 1401.
  • the digital processing chip can complete the method steps of any one or more of the foregoing embodiments.
  • the digital processing chip does not have an integrated memory, it can be connected to an external memory through a communication interface.
  • the digital processing chip implements the actions performed by the neural network parameter quantization device in the above embodiment according to the program code stored in the external memory.
  • An embodiment of the present application also provides a computer program product that, when run on a computer, causes the computer to perform the steps performed by the neural network parameter quantification device in the method described in the embodiments shown in FIGS. 7 to 12 .
  • the neural network parameter quantification device can be a chip.
  • the chip includes: a processing unit and a communication unit.
  • the processing unit can be, for example, a processor.
  • the communication unit can be, for example, an input/output interface, a pin, or a circuit. wait.
  • the processing unit can execute computer execution instructions stored in the storage unit, so that the chip in the server executes the neural network parameter quantification method described in the embodiments shown in FIGS. 7-12.
  • the storage unit is a storage unit within the chip, such as a register, cache, etc.
  • the storage unit may also be a storage unit located outside the chip in the wireless access device, such as Read-only memory (ROM) or other types of static storage devices that can store static information and instructions, random access memory (random access memory, RAM), etc.
  • ROM Read-only memory
  • RAM random access memory
  • the aforementioned processing unit or processor may be a central processing unit (CPU), a network processor (neural-network processing unit, NPU), a graphics processing unit (GPU), or a digital signal processing unit. (digital signal processor, DSP), dedicated integration Circuit (application specific integrated circuit, ASIC) or field programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • a general-purpose processor may be a microprocessor or any conventional processor, etc.
  • Figure 15 is a structural schematic diagram of a chip provided by an embodiment of the present application.
  • the chip can be represented as a neural network processor NPU 150.
  • the NPU 150 serves as a co-processor and is mounted to the main CPU ( On the Host CPU), tasks are allocated by the Host CPU.
  • the core part of the NPU is the arithmetic circuit 1503.
  • the arithmetic circuit 1503 is controlled by the controller 1504 to extract the matrix data in the memory and perform multiplication operations.
  • the computing circuit 1503 includes multiple processing engines (PEs) internally.
  • arithmetic circuit 1503 is a two-dimensional systolic array.
  • the arithmetic circuit 1503 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition.
  • arithmetic circuit 1503 is a general-purpose matrix processor.
  • the arithmetic circuit obtains the corresponding data of matrix B from the weight memory 1502 and caches it on each PE in the arithmetic circuit.
  • the operation circuit takes matrix A data and matrix B from the input memory 1501 to perform matrix operations, and the partial result or final result of the obtained matrix is stored in an accumulator (accumulator) 1508 .
  • the unified memory 1506 is used to store input data and output data.
  • the weight data directly passes through the storage unit access controller (direct memory access controller, DMAC) 1505, and the DMAC is transferred to the weight memory 1502.
  • Input data is also transferred to unified memory 1506 via DMAC.
  • DMAC direct memory access controller
  • Bus interface unit (bus interface unit, BIU) 1510 is used for interaction between the AXI bus and DMAC and instruction fetch buffer (IFB) 1509.
  • IOB instruction fetch buffer
  • the bus interface unit 1510 (bus interface unit, BIU) is used to fetch the memory 1509 to obtain instructions from the external memory, and is also used for the storage unit access controller 1505 to obtain the original data of the input matrix A or the weight matrix B from the external memory.
  • DMAC is mainly used to transfer the input data in the external memory DDR to the unified memory 1506 or the weight data to the weight memory 1502 or the input data to the input memory 1501 .
  • the vector calculation unit 1507 includes multiple arithmetic processing units, and if necessary, further processes the output of the arithmetic circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, etc. It is mainly used for non-convolutional/fully connected layer network calculations in neural networks, such as batch normalization, pixel-level summation, upsampling of feature planes, etc.
  • vector calculation unit 1507 can store the processed output vectors to unified memory 1506 .
  • the vector calculation unit 1507 can apply a linear function and/or a nonlinear function to the output of the operation circuit 1503, such as linear interpolation on the feature plane extracted by the convolution layer, or a vector of accumulated values, to generate an activation value.
  • vector calculation unit 1507 generates normalized values, pixel-wise summed values, or both.
  • the processed output vector can be used as an activation input to the arithmetic circuit 1503, such as for use in a subsequent layer in a neural network.
  • the instruction fetch buffer 1509 connected to the controller 1504 is used to store the controller 1504 instructions used;
  • the unified memory 1506, the input memory 1501, the weight memory 1502 and the fetch memory 1509 are all On-Chip memories. External memory is private to the NPU hardware architecture.
  • each layer in the recurrent neural network can be performed by the operation circuit 1503 or the vector calculation unit 1507.
  • the processor mentioned in any of the above places may be a general central processing unit, a microprocessor, an ASIC, or one or more integrated circuits used to control program execution of the methods in Figures 7 to 12.
  • the device embodiments described above are only illustrative.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physically separate.
  • the physical unit can be located in one place, or it can be distributed across multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • the connection relationship between modules indicates that there are communication connections between them, which can be specifically implemented as one or more communication buses or signal lines.
  • the present application can be implemented by software plus necessary general hardware. Of course, it can also be implemented by dedicated hardware including dedicated integrated circuits, dedicated CPUs, dedicated memories, Special components, etc. to achieve. In general, all functions performed by computer programs can be easily implemented with corresponding hardware. Moreover, the specific hardware structures used to implement the same function can also be diverse, such as analog circuits, digital circuits or special-purpose circuits. circuit etc. However, for this application, software program implementation is a better implementation in most cases. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product in essence or that contributes to the existing technology.
  • the computer software product is stored in a readable storage medium, such as a computer floppy disk. , U disk, mobile hard disk, read only memory (ROM), random access memory (RAM), magnetic disk or optical disk, etc., including a number of instructions to make a computer device (which can be a personal computer, server, or network device, etc.) to execute the methods described in various embodiments of this application.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transferred from a website, computer, server, or data center Transmission to another website, computer, server or data center by wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.) means.
  • wired such as coaxial cable, optical fiber, digital subscriber line (DSL)
  • wireless such as infrared, wireless, microwave, etc.
  • the computer-readable storage medium may be any available medium that a computer can store, or a data storage device such as a server or data center integrated with one or more available media.
  • the available media may be magnetic media (eg, floppy disk, hard disk, magnetic tape), optical media (eg, DVD), or semiconductor media (eg, solid state disk (SSD)), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

本申请提供人工智能领域中一种神经网络参数量化方法和装置,用于对神经网络进行量化,降低低比特量化时的精度损失,得到输出更准确的轻量化模型。该方法包括:首先,获取待量化模型中各个神经元的参数,得到参数集合;随后对参数集合中的参数进行聚类,得到多种分类数据;对多种分类数据中的每种分类数据进行量化,得到至少一种量化参数,至少一种量化参数用于得到压缩模型,至少一种量化参数的精度低于待量化模型中的参数的精度。

Description

一种神经网络参数量化方法和装置
本申请要求于2022年05月30日提交中国专利局、申请号为“202210600648.7”、申请名称为“一种神经网络参数量化方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
申请涉及人工智能领域,尤其涉及一种神经网络参数量化方法和装置。
背景技术
模型压缩技术是构筑轻量级神经网络的常用技术手段。神经网络模型中一般使用FP32(32位浮点数据)进行存储。研究发现,神经网络具有较好的鲁棒性,将大型神经网络的参数通过量化、编码等方式减小精度,其依然可以保有相对良好的性能。常用的低精度数据包括FP16(半精度浮点)、INT16(16位的定点整数)、INT8(8位的定点整数)、INT4(4位的定点整数)、1bit等等数值格式。从网络性能和模型压缩程度两方面综合考虑,将权重参数由32bit浮点型(FP32)转化为8bit定点整形(INT8),是目前最为常用的量化手段。
然而,在进行量化时,尤其针对假发网络,在低比特量化时精度损失较大,因此,如何降低低比特量化时的精度损失,成为亟待解决的问题。
发明内容
本申请提供一种神经网络参数量化方法和装置,用于对神经网络进行量化,降低低比特量化时的精度损失,得到输出更准确的轻量化模型。
有鉴于此,第一方面,本申请提供一种神经网络参数量化方法,包括:首先,获取待量化模型中各个神经元的参数,得到参数集合;随后对参数集合中的参数进行聚类,得到多种分类数据;对多种分类数据中的每种分类数据进行量化,得到至少一种量化参数,至少一种量化参数用于得到压缩模型,至少一种量化参数的精度低于待量化模型中的参数的精度。
因此,本申请实施方式中,对神经网络的参数进行聚类后再进行量化,针对每个分类的参数分别进行量化,从而可以提高模型的表达能力。
在一种可能的实施方式中,前述的对参数集合进行聚类,得到多种分类数据,可以包括:对参数集合进行聚类,得到至少一种聚类数据;从至少一种聚类数据中的每种聚类数据中截取预设数量的参数,得到多种分类数据。
因此,本申请实施方式中,聚类后对每种分类参数进行截断,从而减少每种分类中的离群值,提高后续的模型表达能力。
在一种可能的实施方式中,待量化模型中的参数包括每个神经元的输出的特征中的参数或者每个神经元内的参数值。因此本申请实施方式中,针对神经网络中各个神经元的内部参数以及输出的特征值均进行量化,从而减少量化后模型所占的比特,得到轻量化模型。
在一种可能的实施方式中,待量化模型包括加法神经网络。因此,本申请实施方式中, 针对加法网络,若各个神经元共享缩放系数,降低模型表达能力,若使用乘法卷积量化方式,可能由于权重数据和输入特征的缩放系数不相同,因此,通过本申请提供的方式,可以对参数进行聚类,针对每一类参数进行量化,提高压缩后模型的表达能力,避免量化后出现非INT8的数值。
在一种可能的实施方式中,压缩模型用于进行图像识别、分类任务或者目标检测中的至少一种。因此,本申请提供的方法可以适用于多种场景,泛化能力强。
第二方面,本申请提供一种神经网络参数量化装置,包括:
获取模块,用于获取待量化模型中各个神经元的参数,得到参数集合;
聚类模块,用于对参数集合进行聚类,得到多种分类数据;
量化模块,用于对多种分类数据中的每种分类数据进行量化,得到至少一种量化参数,至少一种量化参数用于得到压缩模型,至少一种量化参数的精度低于待量化模型中的参数的精度。
在一种可能的实施方式中,聚类模块,具体用于:对参数集合进行聚类,得到至少一种聚类数据;从至少一种聚类数据中的每种聚类数据中截取预设数量的参数,得到多种分类数据。
在一种可能的实施方式中,待量化模型中的参数包括每个神经元的输出的特征中的参数或者每个神经元内的参数值。
在一种可能的实施方式中,待量化模型包括加法神经网络。
在一种可能的实施方式中,压缩模型用于进行图像识别、分类任务或者目标检测中的至少一种。
第三方面,本申请实施例提供一种神经网络参数量化装置,包括:处理器和存储器,其中,处理器和存储器通过线路互联,处理器调用存储器中的程序代码用于执行上述第一方面任一项所示的神经网络参数量化方法中与处理相关的功能。可选地,该神经网络参数量化装置可以是芯片。
第四方面,本申请实施例提供了一种神经网络参数量化装置,该神经网络参数量化装置也可以称为数字处理芯片或者芯片,芯片包括处理单元和通信接口,处理单元通过通信接口获取程序指令,程序指令被处理单元执行,处理单元用于执行如上述第一方面或第一方面任一可选实施方式中与处理相关的功能。
第五方面,本申请实施例提供了一种计算机可读存储介质,包括指令,当其在计算机上运行时,使得计算机执行上述第一方面或第一方面任一可选实施方式中的方法。
第六方面,本申请实施例提供了一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述第一方面或第一方面任一可选实施方式中的方法。
附图说明
图1为本申请应用的一种人工智能主体框架示意图;
图2为本申请应用的一种卷积核的结构示意图;
图3为本申请提供的一种卷积神经网络的结构示意图;
图4为本申请提供的一种系统架构示意图;
图5为本申请提供的另一种系统架构示意图;
图6为本申请提供的一种量化方式示意图;
图7为本申请提供的一种神经网络参数量化方法的流程示意图;
图8为本申请提供的另一种神经网络参数量化方法的流程示意图;
图9为本申请提供的一种参数截断方式示意图;
图10为本申请提供的另一种参数截断方式示意图;
图11为本申请提供的一种待量化模型的结构示意图;
图12为本申请所提出的方案与常用的针对加法网络的量化方案的精度对比图;
图13为本申请提供的一种神经网络参数量化装置的结构示意图;
图14为本申请提供的另一种神经网络参数量化装置的结构示意图;
图15为本申请提供的一种芯片的结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
首先对人工智能系统总体工作流程进行描述,请参见图1,图1示出的为人工智能主体框架的一种结构示意图,下面从“智能信息链”(水平轴)和“IT价值链”(垂直轴)两个维度对上述人工智能主题框架进行阐述。其中,“智能信息链”反映从数据的获取到处理的一列过程。举例来说,可以是智能信息感知、智能信息表示与形成、智能推理、智能决策、智能执行与输出的一般过程。在这个过程中,数据经历了“数据—信息—知识—智慧”的凝练过程。“IT价值链”从人智能的底层基础设施、信息(提供和处理技术实现)到系统的产业生态过程,反映人工智能为信息技术产业带来的价值。
(1)基础设施
基础设施为人工智能系统提供计算能力支持,实现与外部世界的沟通,并通过基础平台实现支撑。通过传感器与外部沟通;计算能力由智能芯片,如中央处理器(central processing unit,CPU)、网络处理器(neural-network processing unit,NPU)、图形处理器(英语:graphics processing unit,GPU)、专用集成电路(application specific integrated circuit,ASIC)或现场可编程逻辑门阵列(field programmable gate array,FPGA)等硬件加速芯片)提供;基础平台包括分布式计算框架及网络等相关的平台保障和支持,可以包括云存储和计算、互联互通网络等。举例来说,传感器和外部沟通获取数据,这些数据提供给基础平台提供的分布式计算系统中的智能芯片进行计算。
(2)数据
基础设施的上一层的数据用于表示人工智能领域的数据来源。数据涉及到图形、图像、语音、文本,还涉及到传统设备的物联网数据,包括已有系统的业务数据以及力、位移、 液位、温度、湿度等感知数据。
(3)数据处理
数据处理通常包括数据训练,机器学习,深度学习,搜索,推理,决策等方式。
其中,机器学习和深度学习可以对数据进行符号化和形式化的智能信息建模、抽取、预处理、训练等。
推理是指在计算机或智能系统中,模拟人类的智能推理方式,依据推理控制策略,利用形式化的信息进行机器思维和求解问题的过程,典型的功能是搜索与匹配。
决策是指智能信息经过推理后进行决策的过程,通常提供分类、排序、预测等功能。
(4)通用能力
对数据经过上面提到的数据处理后,进一步基于数据处理的结果可以形成一些通用的能力,比如可以是算法或者一个通用系统,例如,翻译,文本的分析,计算机视觉的处理,语音识别,图像的识别等等。
(5)智能产品及行业应用
智能产品及行业应用指人工智能系统在各领域的产品和应用,是对人工智能整体解决方案的封装,将智能信息决策产品化、实现落地应用,其应用领域主要包括:智能终端、智能交通、智能医疗、自动驾驶、智慧城市等。
本申请实施例涉及了大量神经网络的相关应用,为了更好地理解本申请实施例的方案,下面先对本申请实施例可能涉及的神经网络的相关术语和概念进行介绍。
(1)神经网络
神经网络可以是由神经单元组成的,神经单元可以是指以xs和截距1为输入的运算单元,该运算单元的输出可以如公式所示:
其中,s=1、2、……n,n为大于1的自然数,Ws为xs的权重,b为神经单元的偏置。f为神经单元的激活函数(activation functions),用于将非线性特性引入神经网络中,来将神经单元中的输入信号转换为输出信号。该激活函数的输出信号可以作为下一层卷积层的输入,激活函数可以是sigmoid函数。神经网络是将多个上述单一的神经单元联结在一起形成的网络,即一个神经单元的输出可以是另一个神经单元的输入。每个神经单元的输入可以与前一层的局部接受域相连,来提取局部接受域的特征,局部接受域可以是由若干个神经单元组成的区域。
(2)深度神经网络
深度神经网络(deep neural network,DNN),也称多层神经网络,可以理解为具有多层中间层的神经网络。按照不同层的位置对DNN进行划分,DNN内部的神经网络可以分为三类:输入层,中间层,输出层。一般来说第一层是输入层,最后一层是输出层,中间的层数都是中间层,或者称为隐层。层与层之间是全连接的,也就是说,第i层的任意一个神经元一定与第i+1层的任意一个神经元相连。
虽然DNN看起来很复杂,其每一层可以表示为线性关系表达式:其 中,是输入向量,是输出向量,是偏移向量或者称为偏置参数,w是权重矩阵(也称系数),α()是激活函数。每一层仅仅是对输入向量经过如此简单的操作得到输出向量由于DNN层数多,系数W和偏移向量的数量也比较多。这些参数在DNN中的定义如下所述:以系数w为例:假设在一个三层的DNN中,第二层的第4个神经元到第三层的第2个神经元的线性系数定义为上标3代表系数W所在的层数,而下标对应的是输出的第三层索引2和输入的第二层索引4。
综上,第L-1层的第k个神经元到第L层的第j个神经元的系数定义为
需要注意的是,输入层是没有W参数的。在深度神经网络中,更多的中间层让网络更能够刻画现实世界中的复杂情形。理论上而言,参数越多的模型复杂度越高,“容量”也就越大,也就意味着它能完成更复杂的学习任务。训练深度神经网络的也就是学习权重矩阵的过程,其最终目的是得到训练好的深度神经网络的所有层的权重矩阵(由很多层的向量W形成的权重矩阵)。
(3)卷积神经网络
卷积神经网络(convolutional neuron network,CNN)是一种带有卷积结构的深度神经网络。CNN具有局部感知、权值共享等优异特性,能大幅减少权重参数,同时大幅提升网络性能。其已在计算机视觉、图像分析等领域取得了众多突破性成果,并成为人工智能和深度学习的核心技术。卷积神经网络包含了一个由卷积层和子采样层构成的特征抽取器,该特征抽取器可以看作是滤波器。卷积层是指卷积神经网络中对输入信号进行卷积处理的神经元层。在卷积神经网络的卷积层中,一个神经元可以只与部分邻层神经元连接。一个卷积层中,通常包含若干个特征平面,每个特征平面可以由一些矩形排列的神经单元组成。同一特征平面的神经单元共享权重,这里共享的权重就是卷积核。共享权重可以理解为提取图像信息的方式与位置无关。卷积核可以以随机大小的矩阵的形式初始化,在卷积神经网络的训练过程中卷积核可以通过学习得到合理的权重。另外,共享权重带来的直接好处是减少卷积神经网络各层之间的连接,同时又降低了过拟合的风险。
(4)循环神经网络(recurrent neural networks,RNN),也称为递归神经网络,是用来处理序列数据的。在传统的神经网络模型中,是从输入层到中间层再到输出层,层与层之间是全连接的,而对于每一层层内之间的各个节点是无连接的。这种普通的神经网络虽然解决了很多难题,但是却仍然对很多问题无能无力。例如,你要预测句子的下一个单词是什么,一般需要用到前面的单词,因为一个句子中前后单词并不是独立的。RNN之所以称为循环神经网路,即一个序列当前的输出与前面的输出也有关。具体的表现形式为网络会对前面的信息进行记忆并应用于当前输出的计算中,即中间层本层之间的节点不再无连接而是有连接的,并且中间层的输入不仅包括输入层的输出还包括上一时刻中间层的输出。理论上,RNN能够对任何长度的序列数据进行处理。对于RNN的训练和对传统的CNN或DNN的训练一样。
(5)残差神经网络(ResNet)
残差神经网络是为例解决神经度神经网络的隐藏层过多时产生的退化(degradation)问题而提出。退化问题是指:当网络隐藏层变多时,网络的准确度达到饱和然后急剧退化,而且这个退化不是由于过拟合引起的,而是在进行反向传播时,传播到底层时各个梯度相关性不大,梯度更新不充分,从而使最终得到的模型的预测标签的准确度降低。当神经网络退化时,浅层网络能够达到比深层网络更好的训练效果,这时如果把低层的特征传到高层,那么效果应该至少不比浅层的网络效果差,因此可以通过一条恒等映射(Identity Mapping)来达到此效果。这条恒等映射称为残差连接(shortcut),优化这种残差映射要比优化原始的映射容易。
(6)乘法神经网络
针对前述的神经网络,如CNN、DNN、RNN或者ResNet等,可以采用乘法卷积方案,其核心在于采用卷积乘法运算提取滤波器和输入图像的相似度。
通常,乘法卷积核可以表示为:Y(m,n,t)=∑∑∑S[X(m+i,n+j,k),F(i,j,k,t)]
其中S(x,y)代表x和y的相似度,X代表输入图像,F代表卷积计算的滤波器,i和j代表一个卷积核的横向坐标和纵向坐标,k代表输入通道,t代表输出通道。
(7)加法神经网络
与前述乘法神经网络的区别在于,加法神经网络可以利用输入图像与特征提取器的减法运算提取两者的相似度。
例如,加法卷积核可以表示为:Y(m,n,t)=-∑∑∑|X(m+i,n+j,k)-F(i,j,k,t)|。
为便于理解,对乘法卷积核和加法卷积核的区别进行示例性介绍。示例性地,乘法卷积核和加法卷积核的区别可以如图2所示,以其中一个卷积核为例,卷积核可以对输入的矩阵进行卷积操作,得到卷积后的输出。当卷积核为乘法卷积核时,该乘法卷积核的卷积操作为乘法操作,当卷积核为加法卷积核时,其卷积操作为加法操作,或者称为减法操作,如图2中所示,对输入矩阵中当前待操作的矩阵包括卷积核对应的矩阵为
当卷积核为乘法卷积核时,其操作包括对输入矩阵和卷积核中的各个元素进行乘法操作,如表示为:
(-1)*1+0*0+1*2
+(-1)*5+0*4+1*2
+(-1)*3+0*4+1*5
=0
当卷积核为加法卷积核时,其操作包括对输入矩阵和卷积核中的各个元素进行加法操作,如表示为:
-|(-1)-1|-|0-0|-|1-2|
-|(-1)-5|-|0-4|-|1-2|
-|(-1)-3|-|0-4|-|1-5|
=-26
(8)损失函数
在训练深度神经网络的过程中,因为希望深度神经网络的输出尽可能的接近真正想要预测的值,所以可以通过比较当前网络的预测值和真正想要的目标值,再根据两者之间的差异情况来更新每一层神经网络的权重向量(当然,在第一次更新之前通常会有初始化的过程,即为深度神经网络中的各层预先配置参数),比如,如果网络的预测值高了,就调整权重向量让它预测低一些,不断地调整,直到深度神经网络能够预测出真正想要的目标值或与真正想要的目标值非常接近的值。因此,就需要预先定义“如何比较预测值和目标值之间的差异”,这便是损失函数(loss function)或目标函数(objective function),它们是用于衡量预测值和目标值的差异的重要方程。其中,以损失函数举例,损失函数的输出值(loss)越高表示差异越大,那么深度神经网络的训练就变成了尽可能缩小这个loss的过程。
(9)反向传播算法
神经网络可以采用误差反向传播(back propagation,BP)算法在训练过程中修正初始的神经网络模型中参数的大小,使得神经网络模型的重建误差损失越来越小。具体地,前向传递输入信号直至输出会产生误差损失,通过反向传播误差损失信息来更新初始的神经网络模型中参数,从而使误差损失收敛。反向传播算法是以误差损失为主导的反向传播运动,旨在得到最优的神经网络模型的参数,例如权重矩阵。
(10)模型量化:是一种由高比特转换为低比特的模型压缩方式。例如,将常规32位浮点运算转换为低bit整型运算的模型压缩技术,即可称为模型量化。如当低bit量化为8bit时,可以称之为int8量化,即原来表示一个权重需要float32表示,量化后只需要用int8表示,理论上能够获得4倍的网络加速,同时8位相较于32位能够减少4倍存储空间,减少了存储空间和运算时间,从而达到了压缩模型和加速的目的。
通常,CNN是一种常用的神经网络,本申请以下提及的神经网络即可包括设置了加法卷积核或者乘法卷积核的卷积神经网络。为便于理解,下面示例性地,对卷积神经网络的结构进行介绍。
下面结合图3示例性地对CNN的结构进行详细的介绍。如上文的基础概念介绍所述,卷积神经网络是一种带有卷积结构的深度神经网络,是一种深度学习(deep learning)架构,深度学习架构是指通过机器学习的算法,在不同的抽象层级上进行多个层次的学习。作为一种深度学习架构,CNN是一种前馈(feed-forward)人工神经网络,该前馈人工神经网络中的各个神经元可以对输入其中的图像作出响应。
如图3所示,卷积神经网络(CNN)200可以包括输入层210,卷积层/池化层220(其中池化层为可选的),以及神经网络层230。在本申请以下实施方式中,为便于理解,将每一层称为一个stage。下面对这些层的相关内容做详细介绍。
卷积层/池化层220:
卷积层:
如图3所示卷积层/池化层220可以包括如示例221-226层,举例来说:在一种实现方式中,221层为卷积层,222层为池化层,223层为卷积层,224层为池化层,225为卷积层,226为池化层;在另一种实现方式中,221、222为卷积层,223为池化层,224、225为卷积层,226为池化层。即卷积层的输出可以作为随后的池化层的输入,也可以作为另一个卷积层的输入以继续进行卷积操作。
下面将以卷积层221为例,介绍一层卷积层的内部工作原理。
卷积层221可以包括很多个卷积算子,卷积算子也称为核,其在图像处理中的作用相当于一个从输入图像矩阵中提取特定信息的过滤器,卷积算子本质上可以是一个权重矩阵,这个权重矩阵通常被预先定义,在对图像进行卷积操作的过程中,权重矩阵通常在输入图像上沿着水平方向一个像素接着一个像素(或两个像素接着两个像素……这取决于步长stride的取值)的进行处理,从而完成从图像中提取特定特征的工作。该权重矩阵的大小应该与图像的大小相关,需要注意的是,权重矩阵的纵深维度(depth dimension)和输入图像的纵深维度是相同的,在进行卷积运算的过程中,权重矩阵会延伸到输入图像的整个深度。因此,和一个单一的权重矩阵进行卷积会产生一个单一纵深维度的卷积化输出,但是大多数情况下不使用单一权重矩阵,而是应用多个尺寸(行×列)相同的权重矩阵,即多个同型矩阵。每个权重矩阵的输出被堆叠起来形成卷积图像的纵深维度,这里的维度可以理解为由上面所述的“多个”来决定。不同的权重矩阵可以用来提取图像中不同的特征,例如一个权重矩阵用来提取图像边缘信息,另一个权重矩阵用来提取图像的特定颜色,又一个权重矩阵用来对图像中不需要的噪点进行模糊化等。该多个权重矩阵尺寸(行×列)相同,经过该多个尺寸相同的权重矩阵提取后的特征图的尺寸也相同,再将提取到的多个尺寸相同的特征图合并形成卷积运算的输出。
这些权重矩阵中的权重值在实际应用中需要经过大量的训练得到,通过训练得到的权重值形成的各个权重矩阵可以用来从输入图像中提取信息,从而使得卷积神经网络200进行正确的预测。
当卷积神经网络200有多个卷积层的时候,初始的卷积层(例如221)往往提取较多的一般特征,该一般特征也可以称之为低级别的特征;随着卷积神经网络200深度的加深,越往后的卷积层(例如226)提取到的特征越来越复杂,比如高级别的语义之类的特征,语义越高的特征越适用于待解决的问题。
池化层/池化层220:
由于常常需要减少训练参数的数量,因此卷积层之后常常需要周期性的引入池化层,池化层也可以称为下采样层。在如图3中220所示例的221-226各层,可以是一层卷积层后面跟一层池化层,也可以是多层卷积层后面接一层或多层池化层。在图像处理过程中,池化层的唯一目的就是减少图像的空间大小。池化层可以包括平均池化算子和/或最大池化算子,以用于对输入图像进行采样得到较小尺寸的图像。平均池化算子可以在特定范围内对图像中的像素值进行计算产生平均值作为平均池化的结果。最大池化算子可以在特定范围内取该范围内值最大的像素作为最大池化的结果。另外,就像卷积层中用权重矩阵的大小应该与图像尺寸相关一样,池化层中的运算符也应该与图像的大小相关。通过池化层处 理后输出的图像尺寸可以小于输入池化层的图像的尺寸,池化层输出的图像中每个像素点表示输入池化层的图像的对应子区域的平均值或最大值。
神经网络层230:
在经过卷积层/池化层220的处理后,卷积神经网络200还不足以输出所需要的输出信息。因为如前所述,卷积层/池化层220只会提取特征,并减少输入图像带来的参数。然而为了生成最终的输出信息(所需要的类信息或其他相关信息),卷积神经网络200需要利用神经网络层230来生成一个或者一组所需要的类的数量的输出。因此,在神经网络层230中可以包括多层中间层(如图3所示的231、232至23n)以及输出层240,该输出层也可以称为全连接(fully connected,FC)层,该多层中间层中所包含的参数可以根据具体的任务类型的相关训练数据进行预先训练得到,例如该任务类型可以包括图像识别,图像分类,图像超分辨率重建等等。
在神经网络层230中的多层中间层之后,也就是整个卷积神经网络200的最后层为输出层240,该输出层240具有类似分类交叉熵的损失函数,具体用于计算预测误差,一旦整个卷积神经网络200的前向传播(如图3由210至240方向的传播为前向传播)完成,反向传播(如图3由240至210方向的传播为反向传播)就会开始更新前面提到的各层的权重值以及偏差,以减少卷积神经网络200的损失,及卷积神经网络200通过输出层输出的结果和理想结果之间的误差。
需要说明的是,如图3所示的卷积神经网络200仅作为一种卷积神经网络的示例,在具体的应用中,卷积神经网络还可以以其他网络模型的形式存在。
本申请中,可以采用图3所示的卷积神经网络200对待处理图像进行处理,得到待处理图像的分类结果。如图3所示,待处理图像经过输入层210、卷积层/池化层220以及神经网络层230的处理后输出待处理图像的分类结果。
本申请实施例提供的神经网络参数量化方法可以在服务器上被执行,还可以在终端设备上被执行。其中该终端设备可以是具有图像处理功能的移动电话、平板个人电脑(tablet personal computer,TPC)、媒体播放器、智能电视、笔记本电脑(laptop computer,LC)、个人数字助理(personal digital assistant,PDA)、个人计算机(personal computer,PC)、照相机、摄像机、智能手表、可穿戴式设备(wearable device,WD)或者自动驾驶的车辆等,本申请实施例对此不作限定。
如图4所示,本申请实施例提供了一种系统架构100。在图4中,数据采集设备160用于采集训练数据。在一些可选的实现中,针对于图像分类方法来说,训练数据可以包括训练图像以及训练图像对应的分类结果,其中,训练图像的分类结果可以是人工预先标注的结果。
在采集到训练数据之后,数据采集设备160将这些训练数据存入数据库130,训练设备120基于数据库130中维护的训练数据训练得到目标模型/规则101。可选地,在本申请以下实施方式中所提及的训练集,可以是从该数据库130中得到,也可以是通过用户的输入数据得到。
其中,目标模型/规则101可以为本申请实施例以下提及的神经网络。
下面对训练设备120基于训练数据得到目标模型/规则101进行描述,训练设备120对输入的原始图像进行处理,将输出的图像与原始图像进行对比,直到训练设备120输出的图像与原始图像的差值小于一定的阈值,从而完成目标模型/规则101的训练。
上述目标模型/规则101能够用于实现本申请实施例的神经网络参数量化方法得到的第一神经网络,即,将待处理数据(如图像)通过相关预处理后输入该目标模型/规则101,即可得到处理结果。本申请实施例中的目标模型/规则101具体可以为本申请以下所提及的第一神经网络,该第一神经网络可以是前述的CNN、DNN或者RNN等类型的神经网络。需要说明的是,在实际的应用中,所述数据库130中维护的训练数据不一定都来自于数据采集设备160的采集,也有可能是从其他设备接收得到的。另外需要说明的是,训练设备120也不一定完全基于数据库130维护的训练数据进行目标模型/规则101的训练,也有可能从云端或其他地方获取训练数据进行模型训练,上述描述不应该作为对本申请实施例的限定。
根据训练设备120训练得到的目标模型/规则101可以应用于不同的系统或设备中,如应用于图5所示的执行设备110,该执行设备110也可以称为计算设备,所述执行设备110可以是终端,如手机终端,平板电脑,笔记本电脑,增强现实(augmented reality,AR)/虚拟现实(virtual reality,VR),车载终端等,还可以是服务器或者云端设备等。在图5中,执行设备110配置输入/输出(input/output,I/O)接口112,用于与外部设备进行数据交互,用户可以通过客户设备140向I/O接口112输入数据,所述输入数据在本申请实施例中可以包括:客户设备输入的待处理数据。
预处理模块113和预处理模块114用于根据I/O接口112接收到的输入数据(如待处理数据)进行预处理,在本申请实施例中,也可以没有预处理模块113和预处理模块114(也可以只有其中的一个预处理模块),而直接采用计算模块111对输入数据进行处理。
在执行设备110对输入数据进行预处理,或者在执行设备110的计算模块111执行计算等相关的处理过程中,执行设备110可以调用数据存储系统150中的数据、代码等以用于相应的处理,也可以将相应处理得到的数据、指令等存入数据存储系统150中。
最后,I/O接口112将处理结果,则将处理结果返回给客户设备140,从而提供给用户,例如若第一神经网络用于进行图像分类,处理结果为分类结果,则I/O接口112将上述得到的分类结果返回给客户设备140,从而提供给用户。
值得说明的是,训练设备120可以针对不同的目标或称不同的任务,基于不同的训练数据生成相应的目标模型/规则101,该相应的目标模型/规则101即可以用于实现上述目标或完成上述任务,从而为用户提供所需的结果。在一些场景中,执行设备110和训练设备120可以是相同的设备,或者位于相同的计算设备内部,为便于理解,本申请将执行设备和训练设备分别进行介绍,并不作为限定。
在图4所示情况下,用户可以手动给定输入数据,该手动给定可以通过I/O接口112提供的界面进行操作。另一种情况下,客户设备140可以自动地向I/O接口112发送输入数据,如果要求客户设备140自动发送输入数据需要获得用户的授权,则用户可以在客户设备140中设置相应权限。用户可以在客户设备140查看执行设备110输出的结果,具体的呈现形式可以是显示、声音、动作等具体方式。客户设备140也可以作为数据采集端, 采集如图所示输入I/O接口112的输入数据及输出I/O接口112的预测标签作为新的样本数据,并存入数据库130。当然,也可以不经过客户设备140进行采集,而是由I/O接口112直接将如图所示输入I/O接口112的输入数据及输出I/O接口112的预测标签,作为新的样本数据存入数据库130。
值得注意的是,图4仅是本申请实施例提供的一种系统架构的示意图,图中所示设备、器件、模块等之间的位置关系不构成任何限制,例如,在图4中,数据存储系统150相对执行设备110是外部存储器,在其它情况下,也可以将数据存储系统150置于执行设备110中。
如图4所示,根据训练设备120训练得到目标模型/规则101,该目标模型/规则101在本申请实施例中可以是本申请以下所提及的神经网络,具体的,本申请实施例提供的神经网络可以CNN,深度卷积神经网络(deep convolutional neural networks,DCNN),循环神经网络(recurrent neural network,RNN)等等。
参见附图5,本申请实施例还提供了一种系统架构400。执行设备110由一个或多个服务器实现,可选的,与其它计算设备配合,例如:数据存储、路由器、负载均衡器等设备;执行设备110可以布置在一个物理站点上,或者分布在多个物理站点上。执行设备110可以使用数据存储系统150中的数据,或者调用数据存储系统150中的程序代码实现本申请以下提及的神经网络参数量化方法的步骤。
用户可以操作各自的用户设备(例如本地设备401和本地设备402)与执行设备110进行交互。每个本地设备可以表示任何计算设备,例如个人计算机、计算机工作站、智能手机、平板电脑、智能摄像头、智能汽车或其他类型蜂窝电话、媒体消费设备、可穿戴设备、机顶盒、游戏机等。
每个用户的本地设备可以通过任何通信机制/通信标准的通信网络与执行设备110进行交互,通信网络可以是广域网、局域网、点对点连接等方式,或它们的任意组合。具体地,该通信网络可以包括无线网络、有线网络或者无线网络与有线网络的组合等。该无线网络包括但不限于:第五代移动通信技术(5th-Generation,5G)系统,长期演进(long term evolution,LTE)系统、全球移动通信系统(global system for mobile communication,GSM)或码分多址(code division multiple access,CDMA)网络、宽带码分多址(wideband code division multiple access,WCDMA)网络、无线保真(wireless fidelity,WiFi)、蓝牙(bluetooth)、紫蜂协议(Zigbee)、射频识别技术(radio frequency identification,RFID)、远程(Long Range,Lora)无线通信、近距离无线通信(near field communication,NFC)中的任意一种或多种的组合。该有线网络可以包括光纤通信网络或同轴电缆组成的网络等。
在另一种实现中,执行设备110的一个方面或多个方面可以由每个本地设备实现,例如,本地设备401可以为执行设备110提供本地数据或反馈计算结果。
需要注意的,执行设备110的所有功能也可以由本地设备实现。例如,本地设备401实现执行设备110的功能并为自己的用户提供服务,或者为本地设备402的用户提供服务。
随着神经网络模型的性能越来越强,其参数也越来越多。网络运行时对存储、计算、 带宽、能量的需求和消耗越来越大,不利于人工智能算法在资源受限的硬件终端设备中进行部署。将神经网络通过剪枝、压缩等技术手段降低存储和计算需求,已成为神经网络算法在实际终端落地中的重要一环。
模型压缩是构筑轻量级神经网络的常用手段。神经网络中通常使用FP32(32位浮点数据)进行存储。通常,神经网络具有较好的鲁棒性,将大型神经网络的参数通过量化、编码等方式减小精度,其依然可以保有相对良好的性能。常用的低精度数据包括FP16(半精度浮点)、INT16(16位的定点整数)、INT8(8位的定点整数)、INT4(4位的定点整数)、1bit等等数值格式。从网络性能和模型压缩程度两方面综合考虑,将权重参数由32bit浮点型(FP32)转化为8bit定点整形(INT8),是常用的量化手段。
又例如,采用轻量级的卷积方案是构筑轻量级神经网络的另一种技术手段。常用的卷积神经网络大都采用乘法卷积方案,其核心在于采用卷积乘法运算提取滤波器和输入图像的相似度。针对加法神经网络,如可以共享型(share)的加法网络量化技术,权重和输入图像在相减后采用一个量化参数s来进行INT8量化,在INT8量化时取得了不错的效果。
例如,如图6所示,在常用的CNN量化技术中,权重和输入图像分别采取INT8量化,该方法将对应两个缩放系数s1、s2,其硬件需要2个FP32乘法器,1个INT8乘法器。然而,针对上述常用的量化方法,乘法网络量化时每层采用非共享的scale对权重数据和输入图像特征数据分别进行INT8量化,但当这种分立式的量化方式应用于加法网络时,所需的乘法运算会增加计算能耗。此外,由于因权重数据和输入特征图像数据的缩放系数不一定相等,最终结果可能会出现非INT8型的数值计算。
此外,只使用一个共享参数来量化加法网络的卷积核参数和特征,不可避免地会造成其表达能力的下降,且仅在8bit量化中验证了其基本无损的效果。在低比特(<8bit)量化中,量化后的加法网络精度损失较大。另一方面,随着bit数的减少,基础操作的能耗和所需芯片面积都会有不同程度的减少。
因此,本申请提供一种神经网络参数量化方法,可以应用于各种神经网络的量化场景中,实现了高效的低比特量化。
下面基于前述的介绍,对本申请提供的神经网络参数量化方法进行详细介绍。
参阅图7,本申请提供的一种神经网络参数量化方法的流程示意图,如下所述。
701、获取待量化模型中各个神经元的参数,得到参数集合。
其中,该待量化模型可以包括加法网络或者乘法网络等,该待量化模型可以包括多个神经元,可以读取每个神经元的参数,得到待量化模型的参数集合。
该各个神经元的参数可以包括神经元内的参数值,也可以包括各个中间层的神经元的输出所占的权重。因此,在进行后续量化时,可以对神经网络中的大部分参数进行量化,从而得到轻量化模型。
702、对参数集合进行聚类,得到多种分类数据。
在得到参数集合之后,即可对参数集合中的参数进行聚类,得到一种或者多种分类数据,可以理解为通过聚类将参数集合中的参数分为一类或者多类。
具体的聚类方式可以包括K-Means聚类、均值漂移聚类、基于密度的聚类方法 (density-Based spatial clustering of applications with noise,DBSCAN)、基于高斯混合模型的最大期望聚类等,具体可以根据实际应用场景选择匹配的聚类方式,本申请对此不作限定。
可选地,得到多种分类数据的具体过程可以包括:对参数集合中的参数进行聚类,得到一种或者多种聚类数据,然后从该一种或者多种聚类数据中截取预设数量的参数,得到前述的多种分类数据。因此,本申请实施方式中,无需划定阈值,通过截取一定数量的方式,来减少针对阈值的计算量,提高本申请提供的方法的部署泛化性。
703、对多种分类数据中的每种数据进行量化,得到至少一种量化参数。
在进行聚类得到多种分类数据之后,即可对每种分类数据进行量化,即降低每种分类数据中的参数所占的比特,得到至少一种量化参数。该至少一种量化参数用于得到压缩模型,例如,若待量化模型中的参数的数据类型为FP16,可以将其数据类型转换为INT8,从而降低参数所占的比特,实现对参数的低比特量化,从而实现模型压缩,得到轻量化模型。
通常,参数集合中的参数可以分为神经元自身的参数值,如称为权重参数,以及神经元的输出特征值,如称为特征参数,该特征参数可以包括将输入图像输入至待量化模型后待量化模型中各个神经元输出的特征值,该输入图像可以是预先设定的图像,也可以是随机选取的图像。特征参数与权重参数通常互相影响,但通常这两种参数的范围大小可能不相同,若仅使用一种量化参数进行量化,则可能导致量化部分的参数被截断或者造成比特位的浪费。例如,如果使用激活值的范围来量化权重,大部分的权重会被截断,会极大损害量化模型的精度;如果使用权重的范围来量化激活值,只有少部分的bit位可以利用,造成bit位的浪费。本申请可以将权重的范围截断到特征的范围内,在不降低模型精度的情况下的尽可能有效利用比特位,避免比特位浪费。
因此,本申请实施方式中,对待量化模型中的参数进行了聚类,将参数分为多类后再进行量化,从而可以实现分类量化,提高量化后得到的模型的表达能力,尤其针对加法网络,相对于使用共享参数来进行量化,本申请提供的聚类后量化,可以显著提高模型表达能力,得到输出精度更高的轻量化模型。并且,本申请提供的方法,对参数进行聚类后量化,仅需增加较少的工作量,即可得到输出更准确的轻量化模型,可以适用更多需要部署轻量化模型的场景。
前述对本申请提供的方法流程进行了介绍,为便于理解,下面结合具体的应用场景,对本申请提供的神经网络参数量化方法的流程进行更详细介绍。
参阅图8,本申请提供的另一种神经网络参数量化方法的流程示意图,如下所述。
首先,获取待量化模型801,该待量化模型801可以包括乘法网络或者加法网络等。
本申请提供的方法可以部署于服务器,也可以部署于终端。例如,本申请提供的方法可以部署于服务器,服务器对待量化模型进行量化后,得到的轻量化模型可以部署于终端,从而在终端中可以运行轻量化的模型,提高终端的运行效率。
该待量化模型具体可以是CNN、ANN或者RNN等,该待量化模型可以包括多层网络层,如可以分为输入层、中间层和输出层,每层网络层中可以包括一个或者多个神经元,通常,上一层网络层中的神经元的输出可以作为下一层网络层的神经网络的输入。具体例如,该 待量化模型可以用于进行图像识别、目标检测、分割任务、分类任务等一种或者多种任务。通常,为了降低运行神经网络所需的计算资源,可以对神经网络进行压缩,在保持神经网络的输出精度的情况下,降低运行神经网络所需的计算资源。
本申请提供的方法可以适用于多种类型的神经网络,其具体类型可以根据实际应用场景确定,本申请对此并不作限定。
然后,从待量化模型801中提取参数,得到参数集合802。
具体可以提取各个神经元内部的参数或者各个神经元的输出所占权重等参数,得到参数集合。以下为便于区分,将各个神经元输出的特征值称为特征参数(表示为w),将各个神经元的自身的卷积核参数称为权重参数(表示为x)。
例如,待量化模型801可以包括多个神经元,每个神经元内部具有参数,如每个神经元可以包括以下一项或多项:池化核大小为3×3的均值池化(avg_pool_3x3)、池化核大小为3×3的最大值池化(max_pool_3x3)、卷积核大小为3×3的分离卷积(sep_conv_3x3)、卷积核大小为5×5的分离卷积(sep_conv_5x5)、卷积核大小为3×3且空洞率为2的空洞卷积(dil_conv_3x3)、卷积核大小为5×5且空洞率为2的空洞卷积(dil_conv_5x5)、skip-connect操作或置零操作(相应位置所有神经元置零,简称Zero)等,即可从这些运算方式中提取参数,得到神经元内部的参数,即权重参数;上一层神经元的输出可以作为下一层神经元的输入,每个神经元的输出可能不相同,在输入图像输入至待量化模型后,提取各个神经元输出的特征值,即可得到特征参数。神经元内部的参数或者各个神经元的输出的特征值即可组成参数集合802。
随后,对参数集合802中的参数进行聚类,得到多种分类数据803。
如通过K-Means聚类、均值漂移聚类或者DBSCAN等聚类方式进行聚类,将参数分为多种类型,并从每种类型的参数中截取一定数量的参数,从而减少对异常参数的量化,提高模型的表达能力。例如,可以将权重参数和特征参数分别进行分类,从而避免特征参数和权重使用相同的量化参数。并且,相对于选取阈值来截断参数的方式,本申请从每种类型的参数中截取一定数量的参数,可以减少计算阈值的工作量。
通常,特征参数对于模型提取特征的影响很大,因此特征参数的离群值对于模型统计的特征的范围影响很大,进而对量化的缩放系数造成影响。例如,如图9所示,先对特征参数取绝对值,接着对绝对值按照升序进行排序,最后按照一定比例或者数量进行截断,进行后续的量化。
此外,针对加法网络,通常特征参数和权重参数使用了共享的量化参数,如图10所示,通常特征参数和权重参数的范围存在较大差异,因此如果使用激活值(特征)的范围来量化权重,大部分的权重会被截断,会极大损害量化模型的精度;如果使用权重的范围来量化激活值,只有少部分的bit位可以利用,造成bit位的浪费。而本申请提供的方法中,对参数集合进行了聚类,从而在截取参数时,即可避免权重参数和特征参数之间的截断影响,使后续对权重参数和特征参数的量化有效解耦。
可以理解为,本申请将权重参数的范围截断至特征参数覆盖的范围内,将截断后的权重参数值和特征参数值融入偏离率(bias)上,即在不影响模型精度的前提下,保留大部分 权重参数和特征参数,提高量化后的轻量化模型的表达能力。
随后,对多种分类数据803进行量化,得到至少一种量化参数804。
例如,待量化模型的结构可以如图11所示,神经网络可以分为多层,每个神经远的内部的权重参数可以表示为x,特征参数表示为w。
其中,量化方式可以采用多种,如以采用对称型量化方法为例:首先,在每种分类数据中所有权重参数中找出绝对值最大项,记为max(|Xf|);其次,确定欲量化的比特数N和其数值表示范围:[-2n-1,2n-1],以INT8为例,其所能代表的数值范围为[-128,127];再次,确定权重参数的缩放系数,如表示为:scale=(2n-1-1)/max(|Xf|);最后,将该组分类数据所有参数乘以该系数后取其近似整数,这样即完成该层参数的量化。
由于scale=(2n-1-1)/max(|v|),v=x或w,即scale与max(|w|)相关,因此本申请首先提取max(|w|)作为卷积核的特征参数,接着使用k-means聚类的方式对卷积核参数进行聚类。本申请中多个共享scale的量化方案所引入的额外计算量非常少,因此不会带来功耗的提升。
随后,得到的该至少一种量化参数用于组成轻量化模型805。轻量化模型可以部署于各种设备中,如部署于手机、手环等算力较低的终端或者服务器等。
通常,随着神经网络的性能越强,神经网络的规模越大,参数也就越多,而对存储、带宽、能量以及计算资源的需求和消耗也就越大。通过本申请提供的神经网络量化方法,可以对模型进行量化,得到轻量化模型,从而可以部署于各种设备中,在降低对存储、带宽、能量以及计算资源的需求的情况下,得到输出准确的轻量化模型,适用于更多场景,泛化能力更强。
本申请提供的方法可以部署于各种终端或者服务器中,尤其在资源受限的设备中,通过本申请提供的方法可以在保障模型的表达能力的情况下,得到量化后的轻量化模型,从而在资源受限的设备中也可以得到更准确的输出结果。
例如,本申请提供的方法可以部署于神经网络加速器中,神经网络加速器即在硬件模块中通过采用并行计算等方式提升卷积网络的运行速度。本申请针对加法网络可以大幅降低硬件资源消耗,继而能更充分利用硬件资源构建更高并行度的卷积加速模块,进一步提升加速效果。
又例如,本申请提供的方法可以部署于低功耗AI芯片,在人工神经网络芯片的硬件终端部署上,功耗问题是核心问题。本申请针对加法卷积核可以有效降低电路的运行功耗,利于AI芯片在资源受限的终端设备进行部署。
为便于理解,在一些常用的数据集中部署本申请提供的方法的效果可以如表1所示。
表1
由表1可知,在较高bit量化时(如8、6、5),本申请所提出的分组共享scale的量化方案在仅使用后量化(PTQ)时,量化模型精度损失很小。在4bit量化时,通过训练量化(QAT),量化模型精度的损失也很小。
此外,在ImageNet中也进行了测试,如表2所示。
表2
由表2可知,在较高bit量化时(8、6、5),本发明所提出的分组共享scale的量化方案在仅使用后量化(PTQ)时,量化模型精度损失很小。在4bit量化时,通过训练量化(QAT),量化模型精度的损失也很小。
图12对比了本申请所提出的方案与常用的针对加法网络的量化方案的精度对比,可知,在4bit量化时,本申请提供的方法的量化精度更高。
前述对本申请提供的方法的流程进行了详细介绍,下面对执行前述方法的装置进行介绍。
参阅图13,本申请提供的一种神经网络参数量化装置的结构示意图,该神经网络参数量化装置包括:
获取模块1301,用于获取待量化模型中各个神经元的参数,得到参数集合;
聚类模块1302,用于对参数集合进行聚类,得到多种分类数据;
量化模块1303,用于对多种分类数据中的每种分类数据进行量化,得到至少一种量化参数,至少一种量化参数用于得到压缩模型,至少一种量化参数的精度低于待量化模型中的参数的精度。
在一种可能的实施方式中,聚类模块1302,具体用于:对参数集合进行聚类,得到至少一种聚类数据;从至少一种聚类数据中的每种聚类数据中截取预设数量的参数,得到多种分类数据。
在一种可能的实施方式中,待量化模型中的参数包括每个神经元的输出的特征中的参 数或者每个神经元内的参数值。
在一种可能的实施方式中,待量化模型包括加法神经网络。
在一种可能的实施方式中,压缩模型用于进行图像识别、分类任务或者目标检测中的至少一种。
请参阅图14,本申请提供的另一种神经网络参数量化装置的结构示意图,如下所述。
该神经网络参数量化装置可以包括处理器1401和存储器1402。该处理器1401和存储器1402通过线路互联。其中,存储器1402中存储有程序指令和数据。
存储器1402中存储了前述图7-图12中的步骤对应的程序指令以及数据。
处理器1401用于执行前述图7-图12中任一实施例所示的神经网络参数量化装置执行的方法步骤。
可选地,该神经网络参数量化装置还可以包括收发器1403,用于接收或者发送数据。
本申请实施例中还提供一种计算机可读存储介质,该计算机可读存储介质中存储有用于生成车辆行驶速度的程序,当其在计算机上行驶时,使得计算机执行如前述图7-图12所示实施例描述的方法中的步骤。
可选地,前述的图14中所示的神经网络参数量化装置为芯片。
本申请实施例还提供了一种神经网络参数量化装置,该神经网络参数量化装置也可以称为数字处理芯片或者芯片,芯片包括处理单元和通信接口,处理单元通过通信接口获取程序指令,程序指令被处理单元执行,处理单元用于执行前述图7-图12中任一实施例所示的神经网络参数量化装置执行的方法步骤。
本申请实施例还提供一种数字处理芯片。该数字处理芯片中集成了用于实现上述处理器1401,或者处理器1401的功能的电路和一个或者多个接口。当该数字处理芯片中集成了存储器时,该数字处理芯片可以完成前述实施例中的任一个或多个实施例的方法步骤。当该数字处理芯片中未集成存储器时,可以通过通信接口与外置的存储器连接。该数字处理芯片根据外置的存储器中存储的程序代码来实现上述实施例中神经网络参数量化装置执行的动作。
本申请实施例中还提供一种包括计算机程序产品,当其在计算机上行驶时,使得计算机执行如前述图7-图12所示实施例描述的方法中神经网络参数量化装置所执行的步骤。
本申请实施例提供的神经网络参数量化装置可以为芯片,芯片包括:处理单元和通信单元,所述处理单元例如可以是处理器,所述通信单元例如可以是输入/输出接口、管脚或电路等。该处理单元可执行存储单元存储的计算机执行指令,以使服务器内的芯片执行上述图7-图12所示实施例描述的神经网络参数量化方法。可选地,所述存储单元为所述芯片内的存储单元,如寄存器、缓存等,所述存储单元还可以是所述无线接入设备端内的位于所述芯片外部的存储单元,如只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)等。
具体地,前述的处理单元或者处理器可以是中央处理器(central processing unit,CPU)、网络处理器(neural-network processing unit,NPU)、图形处理器(graphics processing unit,GPU)、数字信号处理器(digital signal processor,DSP)、专用集成 电路(application specific integrated circuit,ASIC)或现场可编程逻辑门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者也可以是任何常规的处理器等。
示例性地,请参阅图15,图15为本申请实施例提供的芯片的一种结构示意图,所述芯片可以表现为神经网络处理器NPU 150,NPU 150作为协处理器挂载到主CPU(Host CPU)上,由Host CPU分配任务。NPU的核心部分为运算电路1503,通过控制器1504控制运算电路1503提取存储器中的矩阵数据并进行乘法运算。
在一些实现中,运算电路1503内部包括多个处理单元(process engine,PE)。在一些实现中,运算电路1503是二维脉动阵列。运算电路1503还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路1503是通用的矩阵处理器。
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路从权重存储器1502中取矩阵B相应的数据,并缓存在运算电路中每一个PE上。运算电路从输入存储器1501中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器(accumulator)1508中。
统一存储器1506用于存放输入数据以及输出数据。权重数据直接通过存储单元访问控制器(direct memory access controller,DMAC)1505,DMAC被搬运到权重存储器1502中。输入数据也通过DMAC被搬运到统一存储器1506中。
总线接口单元(bus interface unit,BIU)1510,用于AXI总线与DMAC和取指存储器(instruction fetch buffer,IFB)1509的交互。
总线接口单元1510(bus interface unit,BIU),用于取指存储器1509从外部存储器获取指令,还用于存储单元访问控制器1505从外部存储器获取输入矩阵A或者权重矩阵B的原数据。
DMAC主要用于将外部存储器DDR中的输入数据搬运到统一存储器1506或将权重数据搬运到权重存储器1502中或将输入数据数据搬运到输入存储器1501中。
向量计算单元1507包括多个运算处理单元,在需要的情况下,对运算电路的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。主要用于神经网络中非卷积/全连接层网络计算,如批归一化(batch normalization),像素级求和,对特征平面进行上采样等。
在一些实现中,向量计算单元1507能将经处理的输出的向量存储到统一存储器1506。例如,向量计算单元1507可以将线性函数和/或非线性函数应用到运算电路1503的输出,例如对卷积层提取的特征平面进行线性插值,再例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元1507生成归一化的值、像素级求和的值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路1503的激活输入,例如用于在神经网络中的后续层中的使用。
控制器1504连接的取指存储器(instruction fetch buffer)1509,用于存储控制器 1504使用的指令;
统一存储器1506,输入存储器1501,权重存储器1502以及取指存储器1509均为On-Chip存储器。外部存储器私有于该NPU硬件架构。
其中,循环神经网络中各层的运算可以由运算电路1503或向量计算单元1507执行。
其中,上述任一处提到的处理器,可以是一个通用中央处理器,微处理器,ASIC,或一个或多个用于控制上述图7-图12的方法的程序执行的集成电路。
另外需说明的是,以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。另外,本申请提供的装置实施例附图中,模块之间的连接关系表示它们之间具有通信连接,具体可以实现为一条或多条通信总线或信号线。
通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到本申请可借助软件加必需的通用硬件的方式来实现,当然也可以通过专用硬件包括专用集成电路、专用CPU、专用存储器、专用元器件等来实现。一般情况下,凡由计算机程序完成的功能都可以很容易地用相应的硬件来实现,而且,用来实现同一功能的具体硬件结构也可以是多种多样的,例如模拟电路、数字电路或专用电路等。但是,对本申请而言更多情况下软件程序实现是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在可读取的存储介质中,如计算机的软盘、U盘、移动硬盘、只读存储器(read only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述的方法。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。
所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存储的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘(solid state disk,SSD))等。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理 解这样使用的数据在适当情况下可以互换,以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
最后应说明的是:以上,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。

Claims (12)

  1. 一种神经网络参数量化方法,其特征在于,包括:
    获取待量化模型中各个神经元的参数,得到参数集合;
    对所述参数集合进行聚类,得到多种分类数据;
    对所述多种分类数据中的每种分类数据进行量化,得到至少一种量化参数,所述至少一种量化参数用于得到压缩模型,所述至少一种量化参数的精度低于所述待量化模型中的参数的精度。
  2. 根据权利要求1所述的方法,其特征在于,所述对所述参数集合进行聚类,得到多种分类数据,包括:
    对所述参数集合进行聚类,得到至少一种聚类数据;
    从所述至少一种聚类数据中的每种聚类数据中截取预设数量的参数,得到所述多种分类数据。
  3. 根据权利要求1或2所述的方法,其特征在于,所述待量化模型中的参数包括每个神经元的输出的特征中的参数或者每个神经元内的参数值。
  4. 根据权利要求1-3中任一项所述的方法,其特征在于,所述待量化模型包括加法神经网络。
  5. 根据权利要求1-4中任一项所述的方法,其特征在于,所述压缩模型用于进行图像识别、分类任务或者目标检测中的至少一种。
  6. 一种神经网络参数量化装置,其特征在于,包括:
    获取模块,用于获取待量化模型中各个神经元的参数,得到参数集合;
    聚类模块,用于对所述参数集合进行聚类,得到多种分类数据;
    量化模块,用于对所述多种分类数据中的每种分类数据进行量化,得到至少一种量化参数,所述至少一种量化参数用于得到压缩模型,所述至少一种量化参数的精度低于所述待量化模型中的参数的精度。
  7. 根据权利要求6所述的装置,其特征在于,所述聚类模块,具体用于:
    对所述参数集合进行聚类,得到至少一种聚类数据;
    从所述至少一种聚类数据中的每种聚类数据中截取预设数量的参数,得到所述多种分类数据。
  8. 根据权利要求6或7所述的装置,其特征在于,所述待量化模型中的参数包括每个神经元的输出的特征中的参数或者每个神经元内的参数值。
  9. 根据权利要求6-8中任一项所述的装置,其特征在于,所述待量化模型包括加法神经网络。
  10. 根据权利要求6-9中任一项所述的装置,其特征在于,所述压缩模型用于进行图像识别、分类任务或者目标检测中的至少一种。
  11. 一种神经网络参数量化装置,其特征在于,包括处理器,所述处理器和存储器耦合,所述存储器存储有程序,当所述存储器存储的程序指令被所述处理器执行时实现权利要求1至5中任一项所述的方法。
  12. 一种计算机可读存储介质,包括程序,当其被处理单元所执行时,执行如权利要求1至5中任一项所述的方法。
PCT/CN2023/095019 2022-05-30 2023-05-18 一种神经网络参数量化方法和装置 WO2023231794A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210600648.7A CN115081588A (zh) 2022-05-30 2022-05-30 一种神经网络参数量化方法和装置
CN202210600648.7 2022-05-30

Publications (1)

Publication Number Publication Date
WO2023231794A1 true WO2023231794A1 (zh) 2023-12-07

Family

ID=83250263

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/095019 WO2023231794A1 (zh) 2022-05-30 2023-05-18 一种神经网络参数量化方法和装置

Country Status (2)

Country Link
CN (1) CN115081588A (zh)
WO (1) WO2023231794A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114492779A (zh) * 2022-02-16 2022-05-13 安谋科技(中国)有限公司 神经网络模型的运行方法、可读介质和电子设备

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115081588A (zh) * 2022-05-30 2022-09-20 华为技术有限公司 一种神经网络参数量化方法和装置
CN115589436B (zh) * 2022-12-14 2023-03-28 三亚海兰寰宇海洋信息科技有限公司 一种数据处理方法、装置及设备
CN116579407B (zh) * 2023-05-19 2024-02-13 北京百度网讯科技有限公司 神经网络模型的压缩方法、训练方法、处理方法和装置
CN116702861B (zh) * 2023-06-19 2024-03-01 北京百度网讯科技有限公司 深度学习模型的压缩方法、训练方法、处理方法和装置

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109859281A (zh) * 2019-01-25 2019-06-07 杭州国芯科技股份有限公司 一种稀疏神经网络的压缩编码方法
CN110309904A (zh) * 2019-01-29 2019-10-08 广州红贝科技有限公司 一种神经网络压缩方法
CN110874627A (zh) * 2018-09-04 2020-03-10 华为技术有限公司 数据处理方法、数据处理装置及计算机可读介质
US20200250539A1 (en) * 2017-10-20 2020-08-06 Shanghai Cambricon Information Technology Co., Ltd Processing method and device
CN113222098A (zh) * 2020-01-21 2021-08-06 上海商汤智能科技有限公司 数据处理方法和相关产品
CN113396427A (zh) * 2019-02-25 2021-09-14 蒂普爱可斯有限公司 用于人工神经网络的比特量化的方法和系统
CN115081588A (zh) * 2022-05-30 2022-09-20 华为技术有限公司 一种神经网络参数量化方法和装置

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200250539A1 (en) * 2017-10-20 2020-08-06 Shanghai Cambricon Information Technology Co., Ltd Processing method and device
CN110874627A (zh) * 2018-09-04 2020-03-10 华为技术有限公司 数据处理方法、数据处理装置及计算机可读介质
CN109859281A (zh) * 2019-01-25 2019-06-07 杭州国芯科技股份有限公司 一种稀疏神经网络的压缩编码方法
CN110309904A (zh) * 2019-01-29 2019-10-08 广州红贝科技有限公司 一种神经网络压缩方法
CN113396427A (zh) * 2019-02-25 2021-09-14 蒂普爱可斯有限公司 用于人工神经网络的比特量化的方法和系统
CN113222098A (zh) * 2020-01-21 2021-08-06 上海商汤智能科技有限公司 数据处理方法和相关产品
CN115081588A (zh) * 2022-05-30 2022-09-20 华为技术有限公司 一种神经网络参数量化方法和装置

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114492779A (zh) * 2022-02-16 2022-05-13 安谋科技(中国)有限公司 神经网络模型的运行方法、可读介质和电子设备
CN114492779B (zh) * 2022-02-16 2024-09-27 安谋科技(中国)有限公司 神经网络模型的运行方法、可读介质和电子设备

Also Published As

Publication number Publication date
CN115081588A (zh) 2022-09-20

Similar Documents

Publication Publication Date Title
WO2020221200A1 (zh) 神经网络的构建方法、图像处理方法及装置
WO2022042713A1 (zh) 一种用于计算设备的深度学习训练方法和装置
CN110084281B (zh) 图像生成方法、神经网络的压缩方法及相关装置、设备
WO2022083536A1 (zh) 一种神经网络构建方法以及装置
WO2023231794A1 (zh) 一种神经网络参数量化方法和装置
WO2021190451A1 (zh) 训练图像处理模型的方法和装置
WO2021218517A1 (zh) 获取神经网络模型的方法、图像处理方法及装置
WO2022001805A1 (zh) 一种神经网络蒸馏方法及装置
WO2021057056A1 (zh) 神经网络架构搜索方法、图像处理方法、装置和存储介质
WO2021022521A1 (zh) 数据处理的方法、训练神经网络模型的方法及设备
WO2021164750A1 (zh) 一种卷积层量化方法及其装置
WO2022111617A1 (zh) 一种模型训练方法及装置
US20230082597A1 (en) Neural Network Construction Method and System
WO2022156561A1 (zh) 一种自然语言处理方法以及装置
WO2021051987A1 (zh) 神经网络模型训练的方法和装置
CN110222718B (zh) 图像处理的方法及装置
WO2021218470A1 (zh) 一种神经网络优化方法以及装置
WO2021129668A1 (zh) 训练神经网络的方法和装置
WO2024041479A1 (zh) 一种数据处理方法及其装置
WO2022012668A1 (zh) 一种训练集处理方法和装置
WO2023284716A1 (zh) 一种神经网络搜索方法及相关设备
CN113326930A (zh) 数据处理方法、神经网络的训练方法及相关装置、设备
WO2022161387A1 (zh) 一种神经网络的训练方法及相关设备
WO2022156475A1 (zh) 神经网络模型的训练方法、数据处理方法及装置
WO2022088063A1 (zh) 神经网络模型的量化方法和装置、数据处理的方法和装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23814979

Country of ref document: EP

Kind code of ref document: A1