WO2024021827A1 - 数据处理方法及装置 - Google Patents

数据处理方法及装置 Download PDF

Info

Publication number
WO2024021827A1
WO2024021827A1 PCT/CN2023/096625 CN2023096625W WO2024021827A1 WO 2024021827 A1 WO2024021827 A1 WO 2024021827A1 CN 2023096625 W CN2023096625 W CN 2023096625W WO 2024021827 A1 WO2024021827 A1 WO 2024021827A1
Authority
WO
WIPO (PCT)
Prior art keywords
matrix
parameter
blocks
neural network
convolutional neural
Prior art date
Application number
PCT/CN2023/096625
Other languages
English (en)
French (fr)
Inventor
刘保庆
何雷骏
胡丁晟
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2024021827A1 publication Critical patent/WO2024021827A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • This application relates to the field of artificial intelligence, and in particular, to a data processing method and device.
  • AI artificial intelligence
  • This application provides a data processing method and device for improving the processing performance of an AI hardware accelerator.
  • a data processing method including: obtaining a parameter matrix of a convolutional neural network model; dividing the parameter matrix into multiple matrix blocks, wherein the multiple matrix blocks include P matrix blocks with different contents; The index value of each matrix block in the P types of matrix blocks is generated respectively; the index value is used to uniquely indicate the corresponding matrix block; a parameter dictionary of the convolutional neural network model is generated; the parameter dictionary includes the P The corresponding index values of the matrix blocks and the P matrix blocks respectively.
  • the parameter matrix can be regarded as a large matrix constructed from small matrix blocks. These matrix blocks include matrix blocks with the same content, so matrix blocks with the same content can be represented by the same index value. In this way, a parameter matrix can be expressed as several index values. In this way, the AI hardware accelerator only needs to read the index value corresponding to the current operation parameter from the memory each time, and then restore the required parameter content based on the obtained index value inside the AI hardware accelerator. This can greatly improve the actual utilization efficiency of IO bandwidth.
  • obtaining the parameter matrix of the convolutional neural network model includes: obtaining the original parameter matrix of the convolutional neural network model; retraining the original parameter matrix to obtain the constraint-satisfying matrix.
  • the parameter matrix of the convolutional neural network model; the constraints include enabling the parameter matrix to be divided into multiple matrix blocks, wherein the multiple matrix blocks include matrix blocks with the same content.
  • the convolutional neural network is mainly divided into two major links: training and inference.
  • training refers to the process of using training sample data to continuously adjust the parameters of each layer in the convolutional neural network so that the output results of the convolutional neural network model meet the requirements.
  • Inference refers to converting the trained convolutional neural
  • the network model is used in actual scenarios such as image classification, image recognition or speech recognition. It is the process of inputting the image, audio and other data to be processed into the convolutional neural network model to obtain the processing results.
  • the purpose of training is to select a convolutional neural network model and parameter matrix that can perform inference with high efficiency and accuracy.
  • any model and parameters can meet the requirements as long as they can achieve the expected accuracy of inference. This gives the training process of the parameter matrix a certain amount of room for maneuver.
  • the parameter matrix of the existing convolutional neural network model in the related art is directly divided into matrix blocks, there may be too many types of matrix blocks, resulting in a problem of poor compression effect. Therefore, in this implementation, the parameter matrix of the convolutional neural network model can be retrained, and constraints are added during the retraining process, so that the trained parameter matrix can satisfy the data compression format ( That is, the parameter matrix can be divided into multiple matrix blocks and the multiple matrix blocks include matrix blocks with the same content) and the accuracy of the inference can be ensured.
  • the method further includes: determining specifications of the matrix block according to specifications of hardware resources and difficulty of model training.
  • the specifications of the matrix blocks can be determined according to the specifications of hardware resources and the difficulty of model training, so as to take into account hardware resource overhead and compression benefits.
  • a data processing method is provided, which method is applied to an artificial intelligence AI hardware accelerator.
  • the method includes: obtaining a first index value; searching for a matrix block corresponding to the first index value in a parameter dictionary;
  • the parameter dictionary includes P types of matrix blocks and index values corresponding to the P types of matrix blocks, wherein each of the matrix blocks is a part of the parameter matrix of the convolutional neural network model.
  • the parameter matrix can be regarded as a large matrix constructed from small matrix blocks. These matrix blocks include matrix blocks with the same content, so matrix blocks with the same content can be represented by the same index value. In this way, a parameter matrix can be expressed as several index values. In this way, the AI hardware accelerator only needs to read the index value corresponding to the current operation parameter from the memory each time, and then restore the required parameter content based on the obtained index value inside the AI hardware accelerator. This can greatly improve the actual utilization efficiency of IO bandwidth.
  • the method further includes: splicing matrix blocks corresponding to the first index values to obtain a first parameter set of the convolutional neural network model.
  • the index values of multiple matrix blocks can be obtained at one time, and then the matrix blocks corresponding to each index value are searched in the parameter dictionary and the matrix blocks are spliced. , thereby obtaining the parameters required for this operation.
  • the method further includes: loading the parameter dictionary into an on-chip cache of the AI hardware accelerator.
  • the data processing device can directly search for the matrix block corresponding to the first index value in the parameter dictionary cached in the on-chip cache of the AI hardware accelerator.
  • obtaining the first index value includes: reading the first index value in an off-chip memory of the AI hardware accelerator.
  • the method further includes: including in the matrix block corresponding to the first index value
  • the first parameter set is cached in the weight buffer weight cache of the AI hardware accelerator, so that the AI hardware accelerator performs operations according to the first parameter set.
  • a data processing device including: an acquisition unit, used to acquire a parameter matrix of a convolutional neural network model; a dividing unit, used to divide the parameter matrix into a plurality of matrix blocks, wherein the plurality of matrix blocks includes P types of matrix blocks with different contents; an index generation unit, used to respectively generate the index value of each matrix block in the P types of matrix blocks; the index value is used to uniquely indicate the corresponding matrix block; a dictionary generation unit, used To generate a parameter dictionary of the convolutional neural network model; the parameter dictionary includes the P types of matrix blocks and the index values corresponding to the P types of matrix blocks.
  • the acquisition unit is used to obtain the parameter matrix of the convolutional neural network model, including: the acquisition unit is used to obtain the original parameter matrix of the convolutional neural network model; the acquisition unit, Used to retrain the original parameter matrix to obtain the parameter matrix of the convolutional neural network model that satisfies constraints; the constraints include enabling the parameter matrix to be divided into multiple matrix blocks, wherein the multiple Matrix blocks include matrix blocks with the same content.
  • the segmentation unit is also used to determine the specifications of the matrix blocks based on the specifications of hardware resources and the difficulty of model training.
  • a data processing device is provided.
  • the data processing device is applied to an artificial intelligence AI hardware accelerator.
  • the data processing device includes: an acquisition unit for acquiring the first index value; and a search unit for searching in a parameter dictionary.
  • the matrix block corresponding to the first index value; the parameter dictionary includes P types of matrix blocks and the index values corresponding to the P types of matrix blocks, wherein each of the matrix blocks is a parameter of the convolutional neural network model. part of the matrix.
  • the search unit is also used to splice matrix blocks corresponding to the first index values to obtain the first parameter set of the convolutional neural network model.
  • the acquisition unit is also used to load the parameter dictionary into an on-chip cache of the AI hardware accelerator.
  • the obtaining unit is used to obtain the first index value, including: the obtaining unit is used to read the first index value in the off-chip memory of the AI hardware accelerator.
  • the data processing device further includes: a writing unit, configured to cache the first parameter set included in the matrix block corresponding to the first index value to the weight buffer of the AI hardware accelerator. , so that the AI hardware accelerator performs operations according to the first parameter set.
  • a communication device including a processor and an interface.
  • the processor receives or sends data through the interface.
  • the processor is configured to implement the above-mentioned first aspect or any one of the first aspects or the above-mentioned design. The method described in the second aspect or any one of the second aspects.
  • a computer-readable storage medium In a sixth aspect, a computer-readable storage medium is provided. Instructions are stored in the computer-readable storage medium. When the instructions are run on a processor, the design of the first aspect or any one of the first aspects is implemented. Or the method described in the above second aspect or any one of the second aspects.
  • a computer program product includes instructions. When the instructions are run on a processor, they are used to implement the above-mentioned first aspect or any design in the first aspect or the above-mentioned second aspect or the second aspect. any of the methods described in the design.
  • FIG 1 is one of the structural schematic diagrams of an AI hardware accelerator provided by this application.
  • Figure 2 is a schematic diagram of the operation flow of an AI hardware accelerator after compressing the parameters of the convolutional neural network model provided by this application;
  • Figure 3 is a schematic diagram of parameter compression provided by this application.
  • FIG. 4 is one of the flow diagrams of a data processing method provided by this application.
  • FIG. 5 is a schematic structural diagram of a SOC provided by this application.
  • FIG. 6 is the second structural schematic diagram of an AI hardware accelerator provided by this application.
  • Figure 7 is the second schematic flow chart of a data processing method provided by this application.
  • Figure 8 is one of the structural schematic diagrams of a data processing device provided by this application.
  • Figure 9 is the second structural schematic diagram of a data processing device provided by this application.
  • Figure 10 is the third structural schematic diagram of a data processing device provided by this application.
  • words such as “first” and “second” are used to distinguish identical or similar items with basically the same functions and effects.
  • words such as “first” and “second” do not limit the number and execution order, and words such as “first” and “second” do not limit the number and execution order.
  • words such as “exemplary” or “for example” are used to represent examples, illustrations or explanations. Any embodiment or design described in this embodiment as “exemplary” or “such as” is not intended to be construed as preferred or advantageous over other embodiments or designs. Rather, the use of words such as “exemplary” or “such as” is intended to present related concepts in a concrete manner that is easier to understand.
  • Convolutional neural network is an AI computing method commonly used in image classification, image recognition, audio recognition and other related fields. Generally, convolutional neural networks can be used to identify features in input information (such as images).
  • convolutional neural networks generally include four types of layer operations. These four layers include: convolution (Convolution, Conv) layer, activation function (Rectified Linear Unit, Relu) layer, pooling ( Pooling) layer and fully connected (Fully Connected, FC) layer.
  • a convolutional neural network model usually includes multiple convolution layers, activation function layers and pooling layers that alternate multiple times. Finally, the data is input into the fully connected layer to obtain the output result. For example, input image->Conv->Relu->Conv->Relu->Pooling->Conv->Relu->Conv->Relu->Pooling->Conv->Relu->Conv->Relu->Pooling- >FC->Output.
  • the function of the convolutional layer is to identify the features of the input image data through multiple filters.
  • Each filter has a scanning range and is used to scan the data information in a certain area of the input image.
  • the calculation results obtained by the current convolutional layer will be input to the next layer (the next layer can be an activation function, a pooling layer or a fully connected layer) for processing.
  • the activation function layer performs an operation similar to finding MAX(0,x) on the input image data, that is, the input image data Each value is compared with the 0 value. If it is larger than the 0 value, it is retained. If it is smaller than the 0 value, it is set to the 0 value.
  • the activation function layer will provide the sparsity rate of the input image data (that is, the number of 0 values in the data as a percentage of the total number of data), and will not change the size of the input image data (that is, the amount of data).
  • the function of the pooling layer is to downsample, that is, to extract data from alternate rows or columns in the two-dimensional matrix of each layer of the input data, thereby reducing the size of the input image data.
  • the operation process of the fully connected layer is similar to that of the convolutional layer. The difference is that the filter of the fully connected layer does not scan a small area of the input image data, but scans the entire input image data at once, and then outputs a value. . There will be multiple filters in the fully connected layer, corresponding to multiple different very specific image features. The output value is equivalent to a "score", which is used to represent the "possibility" that the input image data contains these features.
  • the core of the AI hardware accelerator is the convolutional layer and the fully connected layer.
  • the calculation amount of the convolutional layer and the fully connected layer calculation may account for more than 90% of the entire convolutional neural network calculation amount, so it can It is said that the computing performance of the convolutional layer and the fully connected layer usually determines the overall performance of the AI hardware accelerator.
  • FIG. 1 it is a schematic structural diagram of an AI accelerator provided in this embodiment.
  • the AI accelerator 10 includes a Bus Interface Unit (BIU) 110, a Direct Memory Access Controller (Direct Memory Access Controller, DMAC) 101, a weight buffer (weight buffer) 102 (also called a parameter cache), Input buffer (input buffer) 103, matrix calculation unit (cube unit) 104, vector calculation unit (vector unit) 105, accumulator (accumulator) 106, unified buffer (unified buffer) 107, flow controller (flow control) 108 and Instruction fetch buffer109.
  • the weight cache 102, the input cache 103, the unified cache 107, and the instruction fetch cache 109 are on-chip caches (on-chip-buffer) in the AI accelerator 10.
  • the AI accelerator 10 is mounted on the main control CPU (host CPU) 20 as a co-processor, and the main control CPU allocates tasks.
  • the DMAC 101 in the AI accelerator 10 directly accesses the memory through the BIU 110, obtains feature data and parameters of the convolutional neural network, caches the feature data into the input cache 103, and caches the parameters into the weight cache 102.
  • the memory can be an external storage unit private to the AI accelerator hardware architecture.
  • the memory can be a double data rate (DDR) memory.
  • the “parameters” referred to in this embodiment specifically refer to the parameters obtained by training the convolutional neural network model. In other words, these parameters can represent the trained convolutional neural network model.
  • the parameters of the convolutional neural network model are usually stored and calculated in the form of a matrix, for the convenience of description, the matrix constructed by the parameters is called a “parameter matrix” below.
  • the data that needs to be inferred using the convolutional neural network model to obtain the inference results is called “feature data”.
  • feature data matrix The matrix composed of feature data is called "feature data matrix”.
  • parameters can be understood as parameters configured in each layer of the convolutional neural network model used to recognize images, and these parameters constitute a "parameter matrix”.
  • the data of the image to be recognized that is input to the convolutional neural network model is the feature data, and these feature data constitute the "feature data matrix”.
  • the understanding of parameters, parameter matrices, feature data, and feature data matrices can refer to the above description.
  • the matrix calculation unit 104 is the core component of the AI accelerator and is used to complete the matrix multiplication matrix calculation corresponding to the convolution layer and fully connected layer operations.
  • the matrix calculation unit 104 reads the data of the feature data matrix from the input cache 103 and reads it from the weight cache 102
  • the data of the parameter matrix is subjected to matrix multiplication calculation on the matrix calculation unit 104, and the partial operation result or the final operation result is stored in the accumulator 106.
  • the vector calculation unit 105 is used to further process the output of the matrix calculation unit 104 when necessary, such as vector multiplication, vector addition, exponential operation, logarithm operation, size comparison and other processing.
  • the vector calculation unit 105 is mainly used for network calculations other than the convolutional layer and the fully connected layer in the convolutional neural network (for example, the vector calculation unit 105 can be used for the calculation of the activation function layer and the pooling layer).
  • the unified cache 107 is used to store output calculation results and input data of certain layers (such as activation function layers and pooling layers).
  • BIU 110 is used to interact between memory, DMAC 101 and instruction fetch cache 109 through the bus.
  • the DMAC 101 is used to move data in the memory to the weight cache 102, the input cache 103 or the unified cache 107, or to move the data in the unified cache 107 to the memory.
  • the instruction fetch cache 109 is used to cache instructions, and controls the working process of the AI hardware accelerator 10 through the instructions.
  • the flow controller 108 is used to manage the execution process of instructions.
  • compressing the parameters of the convolutional neural network model has become one of the means to avoid IO bandwidth becoming a bottleneck and improve the overall performance of the AI hardware accelerator 10 .
  • data compression is a technology that reduces the data volume by changing the expression format and organizational structure of the data based on the redundancy of the data.
  • the parameters are first compressed offline through a compression algorithm to obtain the compression parameters, and then the compression parameters are stored in off-chip memory.
  • feature data is also stored in off-chip memory.
  • the memory is connected to the AI hardware accelerator 10 through the advanced extensible interface (AXI) bus.
  • AXI advanced extensible interface
  • the AI hardware accelerator When the AI hardware accelerator is running, a part of the compressed data is obtained from the compression parameters according to the address addr of the required parameters. If the compressed data length is len, this part of the compressed data is read into the chip through IO read request ⁇ addr, len>, and then the compressed data is restored to the original parameters through real-time online decompression in the chip. If the volume of compressed data is compressed by 2 times compared to the original parameters, then compared to reading the original parameters directly from the memory, the same function can be achieved with 1/2 of the original bandwidth, thus saving AI hardware accelerators. valuable bandwidth resources.
  • the processing unit (process element, PE) in the AI hardware accelerator 10 actually obtains The rate can reach 5Gbps. In this way, the actual utilization efficiency of IO bandwidth can be improved.
  • the operations performed inside the AI hardware accelerator during the operation of the AI hardware accelerator are called “online” operations, and the operations performed outside the AI hardware accelerator are called “offline” operations.
  • the “offline compression” mentioned above can be understood as using the computing power of other devices in addition to the AI hardware accelerator. Compress the parameters of the convolutional neural network model to obtain the compression parameters.
  • the “online decompression” mentioned above can be understood as using the computing power in the AI hardware accelerator to decompress the compression parameters.
  • the decompression module responsible for online decompression is the most critical link.
  • the performance and area of the decompression module need to meet the corresponding specifications of the chip.
  • the bandwidth that the decompression modules of commonly used compression algorithms can achieve is relatively low.
  • the decoding rate that a single Huffman coding decompression engine can achieve is about 1-2 bit/cycle.
  • this decompression speed is too small for the bandwidth of actual application scenarios (about 32B/cycle or 256bit/cycle).
  • Even a single engine optimized for decompression bandwidth can only reach about 32bit/cycle, so how to complete high-bandwidth decompression is a problem that needs to be solved.
  • a related technology can be used to improve the compression rate of parameters by increasing the sparsity rate of parameters as much as possible during the model training stage.
  • the sparsity rate refers to the proportion of 0 values in the parameters.
  • this embodiment provides a data processing method.
  • the parameter matrix can be regarded as a large matrix constructed from small matrix blocks. These matrix blocks include matrix blocks with the same content, so matrix blocks with the same content can be represented by the same index value.
  • a parameter matrix can be expressed as several index values.
  • the AI hardware accelerator only needs to read the index value corresponding to the current operation parameter from the memory each time, and then restore the required parameter content based on the obtained index value inside the AI hardware accelerator. This can greatly improve the actual utilization efficiency of IO bandwidth.
  • the parameter matrix A can be divided into multiple matrix blocks. Take matrix block a, matrix block b and matrix block c as an example. The contents of matrix block a and matrix block b are the same, so the same index value can be used to represent matrix block a and matrix block b. As shown in Figure 3, index value 1 is used to represent the contents of matrix block a and matrix block b, and index value 2 is used to represent the content of matrix block c.
  • a parameter dictionary is also constructed to record the correspondence between each index value and the content of the matrix block.
  • the AI hardware accelerator when the AI hardware accelerator needs to obtain the parameters in matrix block a in parameter matrix A, it only needs to read the index value 1 corresponding to matrix block a from the memory, and then query the parameters
  • the dictionary determines the matrix content with index value 1, that is, the parameters in matrix block a can be obtained in order to complete the operation.
  • the method when applied to the scenario of compressing the parameter matrix of the convolutional neural network model, can be implemented by a data processing device.
  • the data processing device may include a desktop computer, a tablet computer, a desktop, a laptop, a handheld computer, a notebook computer, an ultra-mobile personal computer (UMPC), a netbook, and a cellular phone.
  • Devices that can process data include phones, personal digital assistants (PDAs), augmented reality (AR) and virtual reality (VR) devices.
  • the data processing device may also be an AI hardware accelerator. Book The disclosed embodiments do not specifically limit the specific form of the data processing device.
  • the method may include:
  • the data processing device obtains the parameter matrix of the convolutional neural network model.
  • the convolutional neural network model in this embodiment may be a convolutional neural network model used in fields such as image classification, image recognition, or speech recognition.
  • the "parameters” referred to in this embodiment specifically refer to the parameters obtained by training the convolutional neural network model. In other words, these parameters can represent the trained convolutional neural network model.
  • the parameters of the convolutional neural network model are usually stored and calculated in the form of a matrix, the matrix constructed by the parameters is called a "parameter matrix" for convenience of description.
  • S201 can specifically include:
  • the data processing device obtains the original parameter matrix of the convolutional neural network model.
  • the original parameter matrix can be understood as a parameter matrix that has not been processed according to S2012 below in this embodiment.
  • the original parameter matrix may be a parameter matrix obtained by training the convolutional neural network model according to the model training method of the related technology.
  • the data processing device retrains the original parameter matrix to obtain the parameter matrix of the convolutional neural network model that satisfies the constraint conditions.
  • constraints specifically include enabling the parameter matrix to be divided into multiple matrix blocks, and the multiple matrix blocks include matrix blocks with the same content.
  • this constraint can also be understood as: enabling the parameter matrix to be composed of several identical matrix blocks.
  • the convolutional neural network is mainly divided into two major links: training and inference.
  • training refers to the process of using training sample data to continuously adjust the parameters of each layer in the convolutional neural network so that the output results of the convolutional neural network model meet the requirements.
  • Inference refers to using the trained convolutional neural network model in actual image classification, image recognition or speech recognition scenarios, and obtaining the processing results by inputting the image, audio and other data to be processed into the convolutional neural network model. process.
  • the purpose of training is to select a convolutional neural network model and parameter matrix that can perform inference with high efficiency and accuracy.
  • any model and parameters can meet the requirements as long as they can achieve the expected accuracy of inference. This gives the training process of the parameter matrix a certain amount of room for maneuver.
  • the parameter matrix of the existing convolutional neural network model in the related art is directly divided into matrix blocks, there may be too many types of matrix blocks, resulting in a problem of poor compression effect. Therefore, in this implementation, the parameter matrix of the convolutional neural network model can be retrained, and constraints are added during the retraining process, so that the trained parameter matrix can satisfy the data compression format ( That is, the parameter matrix can be divided into multiple matrix blocks and the multiple matrix blocks include matrix blocks with the same content) and the accuracy of the inference can be ensured.
  • the data processing device divides the parameter matrix into multiple matrix blocks.
  • the plurality of matrix blocks include P types of matrix blocks with different contents.
  • the parameter matrix is a 1000 ⁇ 1000 matrix. If the parameter matrix is divided according to the 10 ⁇ 10 specification, the parameter matrix can be divided into 10,000 matrix blocks (that is, the parameter matrix is divided into multiple matrix blocks). In addition, if 7000 matrix blocks out of 10000 matrix blocks have the same content as other matrix blocks, that means there are 3000 matrix blocks with different contents among the above 10000 matrix blocks (that is, P is 3000). In other words, 3000 matrix blocks can be used to represent the entire parameter matrix.
  • the data processing device generates index values for each matrix block in the P types of matrix blocks.
  • the index value is used to uniquely indicate the corresponding matrix block.
  • the data processing device generates a parameter dictionary of the convolutional neural network model.
  • the parameter dictionary includes P types of matrix blocks and index values corresponding to P types of matrix blocks respectively.
  • this method can also include:
  • the data processing device determines the specifications of the matrix block according to the specifications of the hardware resources and the difficulty of model training.
  • the specifications of the hardware resources may specifically include various hardware resource specifications of the equipment used to train the convolutional neural network model. If the specifications of the hardware resources are lower, the specifications of the matrix block will be smaller. On the other hand, if the model training is more difficult, the size of the matrix block will be smaller.
  • the method can also include:
  • the data processing device generates compressed data of the parameter matrix based on the index value of each matrix block in the P types of matrix blocks.
  • each matrix block in the parameter matrix is replaced with the corresponding index value to obtain the compressed data of the parameter matrix.
  • each matrix block is 128B
  • the corresponding index value of each matrix block is 1B. So the parameters that originally required loading 128B now only require 1B, and the compression rate is as high as 128 times.
  • the method can be implemented by a data processing device.
  • the functions of the data processing device can be implemented by part/all of the hardware in the AI hardware accelerator.
  • the AI hardware accelerator 30 may include a data processing device 301 and a computing engine 302 .
  • the data processing device 301 is used to implement the process of online decompression of parameters in the method provided in this embodiment.
  • the data processing device 301 may include a compression parameter cache 3011 for caching compression parameters and an online decompression engine 3012 for online decompression parameters.
  • the calculation engine 302 is used to perform inference operations according to the parameters of the convolutional neural network model.
  • the AI hardware accelerator 30 is also connected to the main control CPU and DDR as memory through the AXI bus.
  • the calculation engine 302 is understood to be a functional module including a weight cache 102, a matrix calculation unit 104 and other modules.
  • the structure of the AI hardware accelerator 30 can also be: a bus interface unit (Bus Interface Unit, BIU) 310, a direct Memory access controller (Direct Memory Access Controller, DMAC) 311, weight buffer (weight buffer) 302 (also called parameter cache), input buffer (input buffer) 303, matrix calculation unit (cube unit) 304, vector calculation unit (vector unit) 305, accumulator (accumulator) 306, unified buffer (unified buffer) 307, flow control (flow control) 308 and instruction fetch buffer (instruction fetch buffer) 309.
  • BIU Bus Interface Unit
  • DMAC Direct Memory Access Controller
  • the method includes:
  • the data processing device 301 obtains the first index value.
  • the first index value corresponds to some parameters in the parameter matrix of the convolutional neural network model.
  • S401 may include: the data processing device 301 reads the first index value in the off-chip memory of the AI hardware accelerator 30 .
  • the data processing device 301 accesses the corresponding address in the memory through the BIU to obtain the first index value.
  • the data processing device 301 searches for the matrix block corresponding to the first index value in the parameter dictionary.
  • the parameter dictionary includes P types of matrix blocks and index values corresponding to P types of matrix blocks respectively.
  • P matrix blocks are part of the parameter matrix of the convolutional neural network model.
  • the first index value is first stored in the compression parameter cache 3011. Then the online decompression engine 3012 reads the first index value in the compression parameter cache 3011, and searches the matrix block corresponding to the first index value in the parameter dictionary to complete online decompression of the parameters.
  • the method also includes:
  • the data processing device 301 splices the matrix blocks corresponding to the first index values to obtain the first parameter set of the convolutional neural network model.
  • the first index value may include multiple index values. Therefore, after determining the matrix block corresponding to each index value in the first index value, the data processing device 301 may splice the matrix blocks corresponding to the first index value. , obtain the parameter set of a convolutional neural network model (i.e., the first parameter set).
  • the online decompression engine 3012 splices the matrix blocks corresponding to the first index values to obtain the first parameter set of the convolutional neural network model.
  • the data processing device 301 caches the first parameter set into the weight cache 302 of the AI hardware accelerator 30.
  • the online decompression engine 3012 caches the first parameter set into the weight cache 302.
  • the AI hardware accelerator 30 performs operations according to the first parameter set.
  • the method may also include:
  • the data processing device 301 loads the parameter dictionary into the on-chip cache of the AI hardware accelerator.
  • the data processing device 301 may preload the parameter dictionary into its internal cache, such as the compression parameter cache 3011 in FIG. 5 .
  • the data processing device 301 may also load the parameter dictionary into other on-chip caches of the AI hardware accelerator, for example, load the parameter dictionary into the weight cache 302.
  • the data processing device 301 can directly search for the matrix block corresponding to the first index value in the parameter dictionary cached in the on-chip cache of the AI hardware accelerator.
  • the data processing device 301 can also load part of the parameter dictionary into the on-chip cache in advance, which can further reduce the IO bandwidth occupation, which is not limited in this embodiment.
  • the parameter matrix can be regarded as a large matrix constructed from small matrix blocks. These matrix blocks include matrix blocks with the same content, so matrix blocks with the same content can be represented by the same index value. In this way, a parameter matrix can be expressed as several index values. In this way, the AI hardware accelerator only needs to read the index value corresponding to the parameter of the current operation from the memory each time, and then restore the required parameter content based on the obtained index value inside the AI hardware accelerator. This can greatly improve the actual utilization efficiency of IO bandwidth.
  • VCG16 Visual Geometry Group 16
  • the Transformer model proposed by Google in 2017. Table 1 below lists the data amount of each parameter in the model in an experiment using the existing technical solution and the technical solution provided in this embodiment for the two models.
  • variable, avg_grad, m, v are the parameters of different layers in the convolutional neural network model. It can be seen that in this experiment, the technical solution provided by this embodiment can achieve a compression rate of 150-220 times. Therefore, the utilization efficiency of IO bandwidth can be greatly improved and the overall performance of the AI hardware accelerator 10 can be improved. performance.
  • S205 can be executed at any time before dividing the matrix block in S202; for another example, in the data processing method described in Figure 7 above, S406 can be executed at any time before searching the matrix block in S402. time execution.
  • the size of the serial numbers of the above processes does not mean the order of execution. The execution order of each process should be determined by its function and internal logic.
  • the data processing device 50 may be a software and hardware device used for training a convolutional neural network model.
  • the data processing device 50 may include a desktop computer, a tablet computer, a desktop, a laptop, a handheld computer, a notebook computer, an ultra-mobile personal computer (UMPC), a netbook, a cellular phone, or a personal computer.
  • Digital assistants personal digital assistant, PDA
  • VR virtual reality
  • the data processing device 50 may be used to perform all or part of the steps performed by the data processing device in Figure 4 above.
  • the data processing device 50 includes:
  • Obtaining unit 501 used to obtain the parameter matrix of the convolutional neural network model
  • the dividing unit 502 is used to divide the parameter matrix into multiple matrix blocks, wherein the multiple matrix blocks include P kinds of matrix blocks with different contents;
  • the index generation unit 503 is used to respectively generate the index value of each matrix block in the P types of matrix blocks; the index value is used to uniquely indicate the corresponding matrix block;
  • the dictionary generation unit 504 is used to generate a parameter dictionary of the convolutional neural network model; the parameter dictionary includes the P types of matrix blocks and the index values corresponding to the P types of matrix blocks.
  • the obtaining unit 501 is used to obtain the parameter matrix of the convolutional neural network model, including:
  • the obtaining unit 501 is used to obtain the original parameter matrix of the convolutional neural network model
  • the acquisition unit 501 is used to retrain the original parameter matrix to obtain the parameter matrix of the convolutional neural network model that satisfies constraints; the constraints include enabling the parameter matrix to be divided into multiple matrices block, wherein the plurality of matrix blocks include matrix blocks with the same content.
  • the segmentation unit 502 is also configured to determine the specifications of the matrix blocks according to the specifications of hardware resources and the difficulty of model training.
  • the data processing device 60 may be a software and hardware device used for convolutional neural network model inference. Specifically, the data processing device 60 may also be all or part of the software/hardware devices in the AI hardware accelerator. The data processing device 60 may be used to perform all or part of the steps performed by the data processing device in FIG. 7 above.
  • the data processing device 60 includes:
  • the search unit 602 is used to search the matrix block corresponding to the first index value in the parameter dictionary;
  • the parameter dictionary includes P types of matrix blocks and the index values corresponding to the P types of matrix blocks, wherein each of the matrix block are respectively part of the parameter matrix of the convolutional neural network model.
  • the search unit 602 is also used to splice the matrix blocks corresponding to the first index value to obtain the first parameter set of the convolutional neural network model.
  • the acquisition unit 601 is also used to load the parameter dictionary into the on-chip cache of the AI hardware accelerator.
  • the obtaining unit 601 is used to obtain the first index value, including: the obtaining unit 601 is used to read the first index value in the off-chip memory of the AI hardware accelerator.
  • the data processing device 60 also includes:
  • Writing unit 603 configured to cache the first parameter set included in the matrix block corresponding to the first index value into the weight buffer weight cache of the AI hardware accelerator, so that the AI hardware accelerator can use the first parameter set according to the first parameter set. Perform operations.
  • FIG 10 is a schematic structural diagram of another data processing device provided in this embodiment.
  • the data processing device 70 may be a chip or a system on a chip. Specifically, it can include desktop computers, tablet computers, desktops, laptops, handheld computers, notebook computers, ultra-mobile personal computers (UMPC), netbooks, as well as cellular phones, personal digital assistants (personal digital assistants) Assistant (PDA), augmented reality (AR) ⁇ virtual reality (VR) equipment and other devices that can process data.
  • PDA personal digital assistants
  • AR augmented reality
  • VR virtual reality
  • the data processing device may also be all or part of the software/hardware devices in the AI hardware accelerator.
  • the data processing device 70 may include some or all components of a processor 701, a data processing circuit 706, a memory 703, and at least one data processing interface 702.
  • the processor 701 is used to execute all or part of the steps executed by the data processing device in the data processing method provided in FIG. 4 or FIG. 7 in this embodiment.
  • the processor 701 may include a general central processing unit (CPU).
  • the processor 701 may also include a microprocessor, a field programmable gate array (Field Programmable Gate Array, FPGA), a digital signal processor ( digital signal processing (DSP) or application-specific integrated circuit (ASIC), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • a microprocessor e.g., a central processing unit (CPU).
  • FPGA field programmable gate array
  • DSP digital signal processing
  • ASIC application-specific integrated circuit
  • the processor 701 may include one or more CPUs, such as CPU0 and CPU1 in FIG. 10 .
  • memory 703 may be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory.
  • the non-volatile memory can be read-only memory (ROM), programmable ROM (PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrically removable memory. Erase programmable read-only memory (electrically EPROM, EEPROM) or flash memory. Volatile memory may be random access memory (RAM), which is used as an external cache.
  • RAM static random access memory
  • DRAM dynamic random access memory
  • SDRAM synchronous dynamic random access memory
  • Double data rate synchronous dynamic random access memory double data date SDRAM, DDR SDRAM
  • enhanced synchronous dynamic random access memory enhanced SDRAM, ESDRAM
  • synchronous link dynamic random access memory direct rambus RAM, DR RAM
  • the memory 703 may exist independently and be connected to the processor 701 through a communication line 706 .
  • Memory 703 may also be integrated with processor 701.
  • memory 703 stores computer instructions.
  • the processor 701 can be used to execute all or part of the steps in the data processing method provided in this embodiment by executing computer instructions stored in the memory 703.
  • the computer execution instructions in this embodiment can also be called application codes, which are not specifically limited in this embodiment.
  • the communication interface 702 uses any device such as a transceiver for communicating with other devices or communication networks, such as Ethernet, radio access network (RAN), wireless local area networks (WLAN) wait.
  • a transceiver for communicating with other devices or communication networks, such as Ethernet, radio access network (RAN), wireless local area networks (WLAN) wait.
  • RAN radio access network
  • WLAN wireless local area networks
  • the communication line 706 is used to connect various components in the communication device 70 .
  • the communication line 706 may include a data bus, a power bus, a control bus, a status signal bus, etc.
  • the various buses are labeled as communication lines 706 in the figure.
  • the communication device 70 may also include a storage medium 704.
  • the storage medium 704 is used to store computer instructions and various data for implementing the technical solution of this embodiment. This is so that when the communication device 70 executes the above communication method of this embodiment, the computer instructions and various data stored in the storage medium 704 are loaded into the memory 703, so that the processor 701 can execute the computer instructions stored in the memory 703. To execute the communication method provided by this embodiment.
  • the communication device 70 may correspond to the communication device 40 in this embodiment, and may correspond to a corresponding subject that performs the communication method according to this embodiment, and the above and the above of each module in the communication device 70
  • Other operations and/or functions are respectively intended to implement the corresponding processes of each method in Figure 4 or Figure 7, and will not be described again for the sake of simplicity.
  • the method steps in this embodiment can be implemented by hardware or by a processor executing software instructions.
  • Software instructions can be composed of corresponding software modules, and the software modules can be stored in RAM, flash memory, ROM, PROM, EPROM, EEPROM, registers, hard disks, mobile hard disks, CD-ROM or any other form of storage media well known in the art.
  • An exemplary storage medium is coupled to the processor such that the processor can read information from the storage medium and write information to the storage medium.
  • the storage medium can also be an integral part of the processor.
  • the processor and storage media may be located in an ASIC. Additionally, the ASIC may be located in a communication device or terminal equipment.
  • the processor and the storage medium may also exist as discrete components in the communication device or terminal equipment.
  • the computer program product includes one or more computer programs or instructions.
  • the computer may be a general purpose computer, a special purpose computer, a computer network, a communication device, a user equipment, or other programmable device.
  • the computer program or instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another.
  • the computer program or instructions may be transmitted from a website, computer, A server or data center transmits via wired or wireless means to another website site, computer, server, or data center.
  • the computer-readable storage medium may be any computer-accessible storage medium. Available media may be data storage devices such as servers and data centers that integrate one or more available media.
  • the available media may be magnetic media, such as floppy disks, hard disks, and magnetic tapes; optical media, such as digital video discs (DVDs); or semiconductor media, such as SSDs.
  • “at least one” refers to one or more
  • “multiple” refers to two or more
  • other quantifiers are similar.
  • “And/or” describes the relationship between associated objects, indicating that there can be three relationships. For example, A and/or B can mean: A alone exists, A and B exist simultaneously, and B alone exists.
  • a and/or B can mean: A alone exists, A and B exist simultaneously, and B alone exists.
  • elements appearing in the singular forms "a”, “an” and “the”, unless the context clearly requires otherwise it does not mean “one or only one” but means “one or more” in one".
  • "a device” means one or more such devices.
  • "at least one of" means one or any combination of subsequent associated objects, for example, "at least one of A, B and C” includes A, B, C, AB, AC, BC, or ABC.
  • the character “/” generally indicates that the related objects are an “or”relationship; in the formula of this embodiment, the character “/” Indicates that the related objects are in a “division” relationship.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Image Analysis (AREA)

Abstract

本申请提供一种数据处理方法及装置,涉及人工智能领域,该方法可以应用于人工智能AI硬件加速器,该方法包括:获取第一索引值;在参数字典中查找所述第一索引值所对应的矩阵块;所述参数字典中包括P种矩阵块以及所述P种矩阵块分别对应的索引值,其中各所述矩阵块分别为卷积神经网络模型的参数矩阵中一部分。该方法用于提高AI硬件加速器的性能。

Description

数据处理方法及装置
本申请要求于2022年07月28日提交中国专利局、申请号为202210895071.7、申请名称为“数据处理方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能领域,尤其涉及一种数据处理方法及装置。
背景技术
近些年,由于卷积神经网络在图像分类、图像识别、音频识别以及其他相关领域的不俗表现,使其成为了学术界与工业界的研究和开发热门。使用人工智能(artificial intelligence,AI)硬件加速器(专用硬件集成加速电路)的方法对卷积神经网络进行运算加速,可以提升卷积神经网络相关应用的运行效率,缩短卷积神经网络相关应用的执行时间,是当前的研究热点。
其中,如何提高AI硬件加速器的处理性能,这是目前需要解决的问题。
发明内容
本申请提供一种数据处理方法及装置,用于提高AI硬件加速器的处理性能。
第一方面,提供一种数据处理方法,包括:获取卷积神经网络模型的参数矩阵;将所述参数矩阵划分为多个矩阵块,其中多个矩阵块中包括P种内容不同的矩阵块;分别生成所述P种矩阵块中各矩阵块的索引值;所述索引值用于唯一指示所对应矩阵块;生成所述卷积神经网络模型的参数字典;所述参数字典中包括所述P种矩阵块以及所述P种矩阵块分别对应的索引值。
该方法中,考虑到对于卷积神经网络模型的参数矩阵而言,可以将参数矩阵看成是由一个个小的矩阵块构造的大矩阵。这些矩阵块中包括有内容相同的矩阵块,因此对于内容相同的矩阵块可以用同一索引值来表示。这样一来,一个参数矩阵则可以表示为若干个索引值。这样一来,通过AI硬件加速器每次只需要通过从内存中读取当前运算的参数所对应的索引值,然后在AI硬件加速器内部根据获取的索引值还原出所需要的参数内容。从而可以极大程度的提高IO带宽的实际利用效率。
在一种可能的设计中,获取卷积神经网络模型的参数矩阵,包括:获取所述卷积神经网络模型的原始参数矩阵;对所述原始参数矩阵进行重新训练,得到满足约束条件的所述卷积神经网络模型的所述参数矩阵;所述约束条件包括使参数矩阵能够划分为多个矩阵块,其中所述多个矩阵块包括内容相同的矩阵块。
上述设计中,考虑到:一方面,卷积神经网络主要分为训练和推理两大环节。其中,训练,指利用训练样本数据,通过不断调整卷积神经网络中各层的参数,以使得卷积神经网络模型的输出结果符合要求的过程。推理,指将已经训练完成的卷积神经 网络模型用于实际的图像分类、图像识别或语音识别等场景中,通过将待处理的图像、音频等数据输入卷积神经网络模型以得到处理结果的过程。其中,训练的目的是选择可以高效高准确性进行推理的卷积神经网络模型和参数矩阵。因此,理论上任何模型和参数只要能够让推理的准确性达到预期目标,那么该模型和参数都是满足要求的。这就使得参数矩阵的训练过程有一定的可操作空间。另一方面,如上文所述若直接对相关技术中已有的卷积神经网络模型的参数矩阵进行划分为矩阵块,则可能由于矩阵块的种类过多,而导致压缩效果较差的问题。因此,本实现方式中,可以通过对卷积神经网络模型的参数矩阵进行重新训练的方式,其中在重新训练过程中增加约束条件,从而可以使得训练出的参数矩阵既能满足数据压缩的格式(即参数矩阵能够划分为多个矩阵块并且多个矩阵块中包括内容相同的矩阵块)又能够保证推理的准确性。
在一种可能的设计中,方法还包括:根据硬件资源的规格和模型训练的难度,确定所述矩阵块的规格。
上述设计中,考虑到在确定矩阵块的规格时,一方面,矩阵块的规格越大,则压缩效率越高,但训练卷积神经网络模型的难度也越大;另一方面,矩阵块的规格越小,则训练卷积神经网络模型的难度越小,但是压缩效率越低。因此,可以根据硬件资源的规格和模型训练的难度,来确定所述矩阵块的规格,以便兼顾硬件资源开销和压缩收益。
第二方面,提供一种数据处理方法,该方法应用于人工智能AI硬件加速器,该方法包括:获取第一索引值;在参数字典中查找所述第一索引值所对应的矩阵块;所述参数字典中包括P种矩阵块以及所述P种矩阵块分别对应的索引值,其中各所述矩阵块分别为卷积神经网络模型的参数矩阵中一部分。
该方法中,考虑到对于卷积神经网络模型的参数矩阵而言,可以将参数矩阵看成是由一个个小的矩阵块构造的大矩阵。这些矩阵块中包括有内容相同的矩阵块,因此对于内容相同的矩阵块可以用同一索引值来表示。这样一来,一个参数矩阵则可以表示为若干个索引值。这样一来,通过AI硬件加速器每次只需要通过从内存中读取当前运算的参数所对应的索引值,然后在AI硬件加速器内部根据获取的索引值还原出所需要的参数内容。从而可以极大程度的提高IO带宽的实际利用效率。
在一种可能的设计中,该方法还包括:将所述第一索引值所对应的矩阵块进行拼接,得到所述卷积神经网络模型的第一参数集合。
上述设计中,考虑到由于矩阵块的规格不同,因此在AI硬件加速器运行时,可以一次获取多个矩阵块的索引值,然后在参数字典中查找各索引值对应矩阵块并对矩阵块进行拼接,从而得到本次运算所需要的参数。
在一种可能的设计中,该方法还包括:将所述参数字典加载到所述AI硬件加速器的片上缓存中。
通过上述设计,数据处理装置在获取到第一索引值后,可以直接在AI硬件加速器的片上缓存中缓存的参数字典中查找第一索引值对应的矩阵块。
在一种可能的设计中,获取第一索引值,包括:读取AI硬件加速器的片外存储器中的所述第一索引值。
在一种可能的设计中,该方法还包括:将所述第一索引值所对应的矩阵块中包括 的第一参数集合缓存至AI硬件加速器的weight buffer权重缓存中,以使得AI硬件加速器根据所述第一参数集合执行运算。
第三方面,提供一种数据处理装置,包括:获取单元,用于获取卷积神经网络模型的参数矩阵;分割单元,用于将所述参数矩阵划分为多个矩阵块,其中多个矩阵块中包括P种内容不同的矩阵块;索引生成单元,用于分别生成所述P种矩阵块中各矩阵块的索引值;所述索引值用于唯一指示所对应矩阵块;字典生成单元,用于生成所述卷积神经网络模型的参数字典;所述参数字典中包括所述P种矩阵块以及所述P种矩阵块分别对应的索引值。
在一种可能的设计中,获取单元,用于获取卷积神经网络模型的参数矩阵,包括:所述获取单元,用于获取所述卷积神经网络模型的原始参数矩阵;所述获取单元,用于对所述原始参数矩阵进行重新训练,得到满足约束条件的所述卷积神经网络模型的所述参数矩阵;所述约束条件包括使参数矩阵能够划分为多个矩阵块,其中所述多个矩阵块包括内容相同的矩阵块。
在一种可能的设计中,分割单元,还用于根据硬件资源的规格和模型训练的难度,确定所述矩阵块的规格。
第四方面,提供一种数据处理装置,数据处理装置应用于人工智能AI硬件加速器,所述数据处理装置包括:获取单元,用于获取第一索引值;查找单元,用于在参数字典中查找所述第一索引值所对应的矩阵块;所述参数字典中包括P种矩阵块以及所述P种矩阵块分别对应的索引值,其中各所述矩阵块分别为卷积神经网络模型的参数矩阵中一部分。
在一种可能的设计中,查找单元,还用于将所述第一索引值所对应的矩阵块进行拼接,得到所述卷积神经网络模型的第一参数集合。
在一种可能的设计中,获取单元,还用于将所述参数字典加载到所述AI硬件加速器的片上缓存中。
在一种可能的设计中,获取单元,用于获取第一索引值,包括:所述获取单元,用于读取AI硬件加速器的片外存储器中的所述第一索引值。
在一种可能的设计中,数据处理装置,还包括:写入单元,用于将所述第一索引值所对应的矩阵块中包括的第一参数集合缓存至AI硬件加速器的weight buffer权重缓存中,以使得AI硬件加速器根据所述第一参数集合执行运算。
第五方面,提供一种通信装置,包括处理器和接口,所述处理器通过所述接口接收或发送数据,所述处理器用于实现如上述第一方面或第一方面中任一设计或上述第二方面或第二方面中任一设计所述的方法。
第六方面,提供一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当所述指令在处理器上运行时,实现如上述第一方面或第一方面中任一设计或上述第二方面或第二方面中任一设计所述的方法。
提供一种计算机程序产品,该计算机程序产品包括指令,当所述指令在处理器上运行时,用于实现如上述第一方面或第一方面中任一设计或上述第二方面或第二方面中任一设计所述的方法。
附图说明
图1为本申请提供的一种AI硬件加速器的结构示意图之一;
图2为本申请提供的一种对卷积神经网络模型的参数进行压缩后AI硬件加速器的运行流程示意图;
图3为本申请提供的一种参数压缩的示意图;
图4为本申请提供的一种数据处理方法的流程示意图之一;
图5为本申请提供的一种SOC的结构示意图;
图6为本申请提供的一种AI硬件加速器的结构示意图之二;
图7为本申请提供的一种数据处理方法的流程示意图之二;
图8为本申请提供的一种数据处理装置的结构示意图之一;
图9为本申请提供的一种数据处理装置的结构示意图之二;
图10为本申请提供的一种数据处理装置的结构示意图之三。
具体实施方式
下面将结合本实施例中的附图,对本实施例中的技术方案进行描述。其中,为了便于清楚描述本实施例的技术方案,在本申请的实施例中,采用了“第一”、“第二”等字样对功能和作用基本相同的相同项或相似项进行区分。本领域技术人员可以理解“第一”、“第二”等字样并不对数量和执行次序进行限定,并且“第一”、“第二”等字样也并不限定一定不同。同时,在本实施例中,“示例性的”或者“例如”等词用于表示作例子、例证或说明。本实施例中被描述为“示例性的”或者“例如”的任何实施例或设计方案不应被解释为比其它实施例或设计方案更优选或更具优势。确切而言,使用“示例性的”或者“例如”等词旨在以具体方式呈现相关概念,便于理解。
为便于理解本实施例所提供技术方案,首先对本实施例中涉及的相关技术进行介绍:
卷积神经网络,是一种常用与图像分类、图像识别、音频识别以及其他相关领域的AI计算方法。通常,可以利用卷积神经网络对输入信息(例如图像)中的特征进行识别。
其中,在卷积神经网络中一般包括4种层的运算,这4种层分别包括:卷积(Convolution,Conv)层、激活函数即修正线性单元(Rectified Linear Unit,Relu)层、池化(Pooling)层以及全连接(Fully Connected,FC)层。
其中,在一个卷积神经网络模型中,通常包括多次交替出现的多个卷积层、激活函数层和池化层,最终将数据输入全连接层得到输出结果。例如,输入图像->Conv->Relu->Conv->Relu->Pooling->Conv->Relu->Conv->Relu->Pooling->Conv->Relu->Conv->Relu->Pooling->FC->输出。
其中,卷积层的作用是通过多个滤波器对输入图像数据进行特征识别,每个滤波器具有一个扫描范围,用于扫描输入图像的一定区域内的数据信息。当前的卷积层得到的计算结果会被输入到下一层(下一层可以为激活函数、池化层或全连接层)进行处理。
激活函数层是对输入图像数据进行类似求MAX(0,x)的运算,即将输入图像数据中 的每个值与0值进行比较,如果比0值大就保留,如果比0值小就置为0值。激活函数层会提供输入图像数据的稀疏率(即数据中0值个数占数据总个数的百分比),不会改变输入图像数据的尺寸(即数据量)。
池化层的作用是下采样,即在输入数据的每一层的二维矩阵中隔行或隔列抽取数据,进而缩小输入图像数据的尺寸。
全连接层与卷积层的运算过程类似,不同之处在于:全连接层的滤波器不是对输入图像数据的某一个小区域进行扫描,而是一次性扫描整个输入图像数据,然后输出一个值。全连接层中会有多个滤波器,对应多个不同的非常具体的图像特征。而输出的值则相当于“分值”,用于表示输入图像数据中包含这些特征的“可能性”。
AI硬件加速器的核心是卷积层和全连接层,在多数卷积神经网络中卷积层和全连接层运算的计算量占整个卷积神经网络计算量的比值可能达到90%以上,因此可以说卷积层和全连接层的运算性能通常决定了AI硬件加速器的总体性能。
如图1所示,为本实施例提供的一种AI加速器的结构示意图。
其中,AI加速器10中包括总线接口单元(Bus Interface Unit,BIU)110、直接存储器访问控制器(Direct Memory Access Controller,DMAC)101、权重缓存(weight buffer)102(也可以称为参数缓存)、输入缓存(input buffer)103、矩阵计算单元(cube unit)104、向量计算单元(vector unit)105、累加器(accumulator)106、统一缓存(unified buffer)107、流控制器(flow control)108以及指令提取缓存(instruction fetch buffer)109。其中,权重缓存102、输入缓存103以及统一缓存107、指令提取缓存109为AI加速器10中的片上缓存(on-chip-buffer)。
其中,AI加速器10作为协处理器挂载在主控CPU(host CPU)20上,由主控CPU分配任务。具体的,AI加速器10中DMAC 101通过BIU 110直接访问内存,获取特征数据和卷积神经网络的参数,并且将特征数据缓存至输入缓存103中,将参数缓存至权重缓存102中。其中,内存可以为私有于AI加速器硬件架构的外部存储单元,例如内存可以为双倍速率(Double Data Rate,DDR)存储器。
需要说明的是,本实施例中所称“参数”具体指将训练卷积神经网络模型所得到的参数。换句话讲,这些参数即可以表示训练完成的卷积神经网络模型。另外,由于卷积神经网络模型的参数通常以矩阵形式进行存储和计算,因此为便于描述下文中将参数所构建的矩阵称为“参数矩阵”。另外,本实施例中,将需要利用卷积神经网络模型进行推理得出推理结果的数据称为“特征数据”。将特征数据所组成的矩阵,称为“特征数据矩阵”。
例如,在利用卷积神经网络模型进行图像识别的场景下,参数可以理解为用于识别图像的卷积神经网络模型中各层中配置的参数,这些参数构成“参数矩阵”。输入卷积神经网络模型的待识别图像的数据即为特征数据,这些特征数据构成“特征数据矩阵”。下文中,若无特别说明,对于参数、参数矩阵和特征数据、特征数据矩阵的理解均可以参照上文描述。
另外,在AI加速器10中,矩阵计算单元104为AI加速器的核心部件,用于完成卷积层和全连接层运算对应的矩阵乘矩阵计算。在进行卷积层或全连接层运算时,矩阵计算单元104从输入缓存103中读取特征数据矩阵的数据,从权重缓存102中读取 参数矩阵的数据,在矩阵计算单元104上进行矩阵乘计算,得到的部分运算结果或最终运算结果,保存在累加器106中。
其中,向量计算单元105用于在需要的情况下,对矩阵计算单元104的输出做进一步处理,如向量乘、向量加、指数运算、对数运算、大小比较等处理。向量计算单元105主要用于卷积神经网络中卷积层和全连接层之外的网络计算(例如向量计算单元105可用于激活函数层和池化层的运算)。
另外,统一缓存107用于存放输出计算结果以及某些层(例如激活函数层和池化层)的输入数据。BIU 110用于通过总线在内存、DMAC 101和指令提取缓存109之间进行交互。DMAC 101用于将内存中的数据搬运到权重缓存102、输入缓存103或统一缓存107中,或者将统一缓存107中的数据搬运到内存中。指令提取缓存109用于缓存指令,通过指令控制AI硬件加速器10的工作过程。流控制器108用于管理指令的执行过程。
其中,在AI硬件加速器运行过程中,具体在实现卷积层和全连接层运算时,由于涉及到的参数数量较大,无法全部保存在AI硬件加速器内的权重缓存102内,因此在推理过程中,通常需要实时导入当前计算所涉及的参数来完成计算。实时导入的参数将占用AI硬件加速器10的输入/输出(input/output,IO)带宽。因此,如果IO带宽成为瓶颈,将导致AI硬件加速器10的计算单元能力空置,从而降低AI硬件加速器10的整体性能。
因此,对卷积神经网络模型的参数进行压缩,便成为避免IO带宽成为瓶颈,提高AI硬件加速器10的整体性能的手段之一。
具体的,数据压缩是根据数据的冗余性,通过改变数据的表达格式和组织结构来达到缩减数据体积的技术。通过对参数的压缩减小卷积神经网络的参数的体积,并在片内进行实时解压还原原始参数,可以在相同的带宽下提供更多的外部参数,从而提高IO带宽的实际利用效率。
示例性的,如图2所示,对于卷积神经网络模型的参数,先通过压缩算法将参数进行离线压缩得到压缩参数,然后将压缩参数存储在片外的内存中。另外,将特征数据也存储在片外的内存中。其中,内存通过先进可扩展接口(advanced extensible interface,AXI)总线与AI硬件加速器10连接。在AI硬件加速器运行时,根据事实所需要参数的地址addr从压缩参数中获取一部分压缩数据。如果压缩数据长度为len,则通过IO读取请求<addr,len>读取这部分压缩数据到片内,然后在片内通过实时在线解压还原压缩数据为原始参数。如果压缩数据的体积相对于原始参数被压缩了2倍,那么相比于从内存中直接读取原始参数而言,通过原本1/2的带宽即可以实现相同的功能,从而节约了AI硬件加速器的宝贵带宽资源。
例如图2中,若AI硬件加速器10从内存中读取压缩数据的速率为2.5Gbps,通过片内的在线解压,AI硬件加速器10中的处理单元(process element,PE)实际获取的原始参数的速率在可以达到5Gbps。这样一来,可以提高IO带宽的实际利用效率。
需要说明的是,本实施例中,将AI硬件加速器运行过程中在AI硬件加速器内部进行的操作称为“在线”操作,将AI硬件加速器外部进行的操作称为“离线”操作。例如,上文中“离线压缩”,即可以理解为在AI硬件加速器外,利用其它设备的算力 对卷积神经网络模型的参数进行压缩得到压缩参数。再例如,上文中“在线解压”,即可以理解为在AI硬件加速器内,利用AI硬件加速器内的算力对压缩参数进行解压。
在上述过程中,负责在线解压的解压模块时最关键一环,解压模块的性能和面积需要符合芯片的对应规格。目前,常用的压缩算法的解压模块能达到的带宽比较低,例如单个Huffman编码的解压引擎能达到的解码速率约为1-2bit/cycle。显然这个解压速度对应实际应用场景的带宽(约32B/cycle即256bit/cycle)来说太小了。即使专门为解压带宽优化过的单引擎也只能达到约32bit/cycle,因此如何完成高带宽的实施解压是目前需要解决的问题。
为了解决上述技术问题,一种相关技术中,可以采用在模型训练阶段,通过尽量提高参数的稀疏率的方式,提高参数的压缩率。其中,稀疏率指参数中0值占参数中的占比。
然而,上述相关技术中,一方面,过高的提升参数稀疏率会影响卷积神经网络的计算精度;另一方面,较低稀疏率的参数,压缩率难以提升,对IO带宽需求降低的收益不明显。
针对上述情况,本实施例中提供一种数据处理方法。该方法中,考虑到对于卷积神经网络模型的参数矩阵而言,可以将参数矩阵看成是由一个个小的矩阵块构造的大矩阵。这些矩阵块中包括有内容相同的矩阵块,因此对于内容相同的矩阵块可以用同一索引值来表示。这样一来,一个参数矩阵则可以表示为若干个索引值。这样一来,通过AI硬件加速器每次只需要通过从内存中读取当前运算的参数所对应的索引值,然后在AI硬件加速器内部根据获取的索引值还原出所需要的参数内容。从而可以极大程度的提高IO带宽的实际利用效率。
具体的,如图3所示,对于卷积神经网络模型的参数矩阵A,可以将参数矩阵A划分为多个矩阵块。以其中矩阵块a、矩阵块b和矩阵块c为例,其中,矩阵块a和矩阵块b的内容相同,因此可以采用同一索引值来表示矩阵块a和矩阵块b。如图3中用索引值1来表示矩阵块a和矩阵块b的内容,用索引值2来表示矩阵块c的内容。另外,本实施例中还构建一个参数字典,用于记录各索引值与矩阵块内容的对应关系。
这样一来,在AI硬件加速器运行过程中,当AI硬件加速器需要获取参数矩阵A中矩阵块a中的参数时,只需要从内存中读取矩阵块a对应的索引值1,然后通过查询参数字典确定索引值1的矩阵内容,即可以获取到矩阵块a中的参数,以便完成运算。
下面结合实例,对本实施例所提供数据处理方法应用于对卷积神经网络模型的参数矩阵进行压缩的场景下时,该方法的具体实现过程进行介绍。具体的,当应用于对卷积神经网络模型的参数矩阵进行压缩的场景下时,该方法可以由一种数据处理装置来实现。
在一种实现方式中,该数据处理装置可以包括台式电脑、平板电脑、桌面型、膝上型、手持计算机、笔记本电脑、超级移动个人计算机(ultra-mobile personal computer,UMPC)、上网本,以及蜂窝电话、个人数字助理(personal digital assistant,PDA)、增强现实(augmented reality,AR)\虚拟现实(virtual reality,VR)设备等可以进行数据处理的设备。在另一种实现方式中,该数据处理装置也可以为AI硬件加速器。本 公开实施例对该数据处理装置的具体形态不作特殊限制。
如图4所示,该方法可以包括:
S201、数据处理装置获取卷积神经网络模型的参数矩阵。
其中,本实施例中,对于卷积神经网络模型的具体类型可以不做限制。例如,本实施例中的卷积神经网络模型具体可以为用于图像分类、图像识别或语音识别等领域的卷积神经网络模型。
如上文所描述的,本实施例中所称“参数”具体指将训练卷积神经网络模型所得到的参数。换句话讲,这些参数即可以表示训练完成的卷积神经网络模型。另外,由于卷积神经网络模型的参数通常以矩阵形式进行存储和计算,因此为便于描述将参数所构建的矩阵称为“参数矩阵”。
在一种实现方式中,考虑到若直接对相关技术中已有的卷积神经网络模型的参数矩阵进行划分为矩阵块,则可能由于矩阵块的种类过多,而导致压缩效果较差的问题。因此,S201具体可以包括:
S2011、数据处理装置获取卷积神经网络模型的原始参数矩阵。
其中,原始参数矩阵可以理解为未按照本实施例下文中S2012进行处理的参数矩阵。换句话讲,原始参数矩阵可以为按照相关技术的模型训练方法对卷积神经网络模型进行训练后得到的参数矩阵。
S2012、数据处理装置对原始参数矩阵进行重新训练,得到满足约束条件的卷积神经网络模型的参数矩阵。
其中,约束条件具体包括使参数矩阵能够划分为多个矩阵块,并且多个矩阵块中包括内容相同的矩阵块。换句话讲,该约束条件也可以理解为:使参数矩阵能够由若干个相同的矩阵块组成。
上述实现方式中,考虑到:一方面,卷积神经网络主要分为训练和推理两大环节。其中,训练,指利用训练样本数据,通过不断调整卷积神经网络中各层的参数,以使得卷积神经网络模型的输出结果符合要求的过程。推理,指将已经训练完成的卷积神经网络模型用于实际的图像分类、图像识别或语音识别等场景中,通过将待处理的图像、音频等数据输入卷积神经网络模型以得到处理结果的过程。其中,训练的目的是选择可以高效高准确性进行推理的卷积神经网络模型和参数矩阵。因此,理论上任何模型和参数只要能够让推理的准确性达到预期目标,那么该模型和参数都是满足要求的。这就使得参数矩阵的训练过程有一定的可操作空间。另一方面,如上文所述若直接对相关技术中已有的卷积神经网络模型的参数矩阵进行划分为矩阵块,则可能由于矩阵块的种类过多,而导致压缩效果较差的问题。因此,本实现方式中,可以通过对卷积神经网络模型的参数矩阵进行重新训练的方式,其中在重新训练过程中增加约束条件,从而可以使得训练出的参数矩阵既能满足数据压缩的格式(即参数矩阵能够划分为多个矩阵块并且多个矩阵块中包括内容相同的矩阵块)又能够保证推理的准确性。
S202、数据处理装置将参数矩阵划分为多个矩阵块。
其中,多个矩阵块中包括P种内容不同的矩阵块。
例如,参数矩阵为1000×1000的矩阵。若按照10×10的规格将参数矩阵进行划分,则可以将参数矩阵划分为10000个矩阵块(即将参数矩阵划分为多个矩阵块)。 另外,若10000个矩阵块中有7000个矩阵块与其他矩阵块的内容相同,也就是说上述10000个矩阵块中有3000种内容不同的矩阵块(即P为3000)。也就是说,可以用3000种矩阵块来表示整个参数矩阵。
S203、数据处理装置分别生成P种矩阵块中各矩阵块的索引值。
其中,该索引值用于唯一指示所对应的矩阵块。
继续上述实例,若参数矩阵所包括的10000个矩阵块中有3000种内容不同的矩阵块,则可以用0-2999来作为这3000种的矩阵块。
S204、数据处理装置生成卷积神经网络模型的参数字典。
其中,该参数字典中包括P种矩阵块以及P种矩阵块分别对应的索引值。
在一种实现方式中,考虑到在确定矩阵块的规格时,一方面,矩阵块的规格越大,则压缩效率越高,但训练卷积神经网络模型的难度也越大;另一方面,矩阵块的规格越小,则训练卷积神经网络模型的难度越小,但是压缩效率越低。因此,本方法中还可以包括:
S205、数据处理装置根据硬件资源的规格和模型训练的难度,确定矩阵块的规格。
其中,硬件资源的规格具体可以包括用于训练卷积神经网络模型的设备的各项硬件资源规格。若硬件资源的规格配置越低,则矩阵块的规格越小。另一方面,若模型训练的难度越大,则矩阵块的规格越小。
另外,在生成参数字典后,该方法还可以包括:
S206、数据处理装置根据P种矩阵块中各矩阵块的索引值,生成参数矩阵的压缩数据。
例如,将参数矩阵中的每处矩阵块替换为对应的索引值,进而得到参数矩阵的压缩数据。
示例性的,假设每个矩阵块的大小为128B,每个矩阵块对应的索引值为1B。那么原来需要加载128B的参数,现在只需要1B即可,压缩率高达128倍。
下面结合实例,对本实施例所提供数据处理方法应用于卷积神经网络模型的推理过程的场景下时,该方法的具体实现过程进行介绍。具体的,当应用于卷积神经网络模型的推理过程的场景下时,该方法可以由一种数据处理装置来实现。
具体的,该数据处理装置的功能可以由AI硬件加速器中的部分/全部硬件来实现。
在一种实现方式中,如图5所示,AI硬件加速器30可以包括数据处理装置301以及计算引擎302。其中,数据处理装置301用于实现本实施例所提供方法中对参数进行在线解压的过程。具体的,数据处理装置301中可以包括用于缓存压缩参数的压缩参数缓存3011和用于在线解压参数的在线解压引擎3012。
另外,计算引擎302用于根据卷积神经网络模型的参数进行推理运算。另外,以ARM SOC架构为例,AI硬件加速器30还通过AXI总线与主控CPU、作为内存的DDR连接。
具体的,当将AI硬件加速器30结合图1所示AI硬件加速器10时,计算引擎302理解为包括权重缓存102、矩阵计算单元104等模块的功能模块。进而,如图6所示,AI硬件加速器30的结构还可以:总线接口单元(Bus Interface Unit,BIU)310、直接 存储器访问控制器(Direct Memory Access Controller,DMAC)311、权重缓存(weight buffer)302(也可以称为参数缓存)、输入缓存(input buffer)303、矩阵计算单元(cube unit)304、向量计算单元(vector unit)305、累加器(accumulator)306、统一缓存(unified buffer)307、流控制器(flow control)308以及指令提取缓存(instruction fetch buffer)309。其中,对于图6中AI硬件加速器30中除数据处理装置301之外的其他功能模块的描述,可参照上文图1中对应功能模块的描述,在此不做赘述。
下面结合数据处理装置301的运行过程,对本实施例所提供方法进行介绍。具体的,如图7所示,该方法包括:
S401、数据处理装置301获取第一索引值。
其中,第一索引值对应卷积神经网络模型的参数矩阵中的部分参数。
具体的,S401可以包括:数据处理装置301读取AI硬件加速器30的片外存储器中的第一索引值。
示例性的,在AI硬件加速器30运行时,在确定当前所需参数后,数据处理装置301通过BIU访问内存中对应的地址,获取到第一索引值。
S402、数据处理装置301在参数字典中查找第一索引值对应的矩阵块。
其中,如上文204所述,本实施例中参数字典中包括P种矩阵块以及P种矩阵块分别对应的索引值。其中,P种矩阵块分别为卷积神经网络模型的参数矩阵的一部分。
例如,在数据处理装置301中,在获取到第一索引值后,先将第一索引值存储在压缩参数缓存3011中。然后在线解压引擎3012读取压缩参数缓存3011中的第一索引值,并在参数字典中查找第一索引值对应的矩阵块,完成参数的在线解压。
在一种实现方式中,考虑到由于矩阵块的规格不同,因此在AI硬件加速器运行时,可以一次获取多个矩阵块的索引值,然后在参数字典中查找各索引值对应矩阵块并对矩阵块进行拼接,从而得到本次运算所需要的参数。因此,该方法还包括:
S403、数据处理装置301将第一索引值对应的矩阵块进行拼接,得到卷积神经网络模型的第一参数集合。
具体的,第一索引值中可以包括多个索引值,因此数据处理装置301在确定了第一索引值中各索引值对应的矩阵块后,可以通过对第一索引值对应的矩阵块进行拼接,得到一个卷积神经网络模型的参数集合(即第一参数集合)。
例如,在数据处理装置301中在线解压引擎3012将第一索引值对应的矩阵块进行拼接,得到卷积神经网络模型的第一参数集合。
S404、数据处理装置301将第一参数集合缓存至AI硬件加速器30的权重缓存302中。
例如,在数据处理装置301中在线解压引擎3012在确定第一参数集合后,将第一参数集合缓存至权重缓存302中。
S405、AI硬件加速器30根据第一参数集合进行运算。
其中,AI硬件加速器30根据第一参数集合进行运算的过程,可参照相关技术,在此不再赘述。
在一种实现方式中,为了降低AI硬件加速器进行推理过程中的IO带宽占用,该方法还可以包括:
S406、数据处理装置301将参数字典加载在AI硬件加速器的片上缓存中。
例如,数据处理装置301可以预先将参数字典加载在其内部的缓存中,例如图5中的压缩参数缓存3011中。再例如,数据处理装置301还可以将参数字典加载在AI硬件加速器的其他片上缓存中,例如将参数字典加载在权重缓存302中。
这样一来,数据处理装置301在获取到第一索引值后,可以直接在AI硬件加速器的片上缓存中缓存的参数字典中查找第一索引值对应的矩阵块。
需要说明的是,在实际应用过程中,数据处理装置301也可以预先将部分参数字典加载在片上缓存中,这样可以进一步降低IO带宽的占用,对此本实施例中可以不做限制。
本实施例所提供上述方法中,该方法中,考虑到对于卷积神经网络模型的参数矩阵而言,可以将参数矩阵看成是由一个个小的矩阵块构造的大矩阵。这些矩阵块中包括有内容相同的矩阵块,因此对于内容相同的矩阵块可以用同一索引值来表示。这样一来,一个参数矩阵则可以表示为若干个索引值。这样一来,通过AI硬件加速器每次只需要通过从内存中读取当前运算的参数所对应的索引值,然后在AI硬件加速器内部根据获取的索引值还原出所需要的参数内容。从而可以极大程度的提高IO带宽的实际利用效率。
示例性的,以牛津大学视觉几何组提出的Visual Geometry Group 16(VGG16)和谷歌在2017年提出的Transformer模型为例。下表1列出来在一次实验中,两种模型在采用现有技术方案和采用本实施例所提供技术方案的情况下,模型中各参数的数据量。
表1
其中,variable、avg_grad、m、v、分别为卷积神经网络模型中不同层的参数。可以看出,在本次实验中,本实施例所提供的技术方案可以达到150-220倍的压缩率。因此,可以很大程度的提高IO带宽的利用效率,提高AI硬件加速器10的整体 性能。
可以理解的是,在本实施例上述数据处理方法中涉及的各种数字编号仅为描述方便进行的区分,并不用来限制本实施例的范围。例如,上述图4所描述的数据处理方法中,S205可以在S202划分矩阵块之前的任意时间执行;再例如,上述图7所描述的数据处理方法中,S406可以在S402查找矩阵块之前的任意时间执行。上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定。
上文中结合图3-图7,详细描述了根据本实施例所提供的数据处理方法,下面将描述本实施例所提供数据处理方法对应的各种装置以及设备。
如图8所示,为本实施例提供的一种数据处理装置的结构示意图。该数据处理装置50可以为用于进行卷积神经网络模型训练的软硬件装置。具体的,该数据处理装置50可以包括台式电脑、平板电脑、桌面型、膝上型、手持计算机、笔记本电脑、超级移动个人计算机(ultra-mobile personal computer,UMPC)、上网本,以及蜂窝电话、个人数字助理(personal digital assistant,PDA)、增强现实(augmented reality,AR)\虚拟现实(virtual reality,VR)设备等可以进行数据处理的设备,或者,该数据处理装置50也可以为AI硬件加速器。该数据处理装置50可以用于执行上文中图4中数据处理装置所执行的全部或部分步骤。具体的,数据处理装置50中包括:
获取单元501,用于获取卷积神经网络模型的参数矩阵;
分割单元502,用于将所述参数矩阵划分为多个矩阵块,其中多个矩阵块中包括P种内容不同的矩阵块;
索引生成单元503,用于分别生成所述P种矩阵块中各矩阵块的索引值;所述索引值用于唯一指示所对应矩阵块;
字典生成单元504,用于生成所述卷积神经网络模型的参数字典;所述参数字典中包括所述P种矩阵块以及所述P种矩阵块分别对应的索引值。
可选的,所述获取单元501,用于获取卷积神经网络模型的参数矩阵,包括:
所述获取单元501,用于获取所述卷积神经网络模型的原始参数矩阵;
所述获取单元501,用于对所述原始参数矩阵进行重新训练,得到满足约束条件的所述卷积神经网络模型的所述参数矩阵;所述约束条件包括使参数矩阵能够划分为多个矩阵块,其中所述多个矩阵块包括内容相同的矩阵块。
可选的,所述分割单元502,还用于根据硬件资源的规格和模型训练的难度,确定所述矩阵块的规格。
如图9所示,为本实施例提供的另一种数据处理装置的结构示意图。该数据处理装置60可以为用于进行卷积神经网络模型推理的软硬件装置。具体的,该数据处理装置60也可以为AI硬件加速器中的全部或部分软/硬件装置。该数据处理装置60可以用于执行上文中图7中数据处理装置所执行的全部或部分步骤。
具体的,该数据处理装置60包括:
获取单元601,用于获取第一索引值;
查找单元602,用于在参数字典中查找所述第一索引值所对应的矩阵块;所述参数字典中包括P种矩阵块以及所述P种矩阵块分别对应的索引值,其中各所述矩阵块 分别为卷积神经网络模型的参数矩阵中一部分。
可选的,查找单元602,还用于将所述第一索引值所对应的矩阵块进行拼接,得到所述卷积神经网络模型的第一参数集合。
可选的,获取单元601,还用于将所述参数字典加载到所述AI硬件加速器的片上缓存中。
可选的,获取单元601,用于获取第一索引值,包括:所述获取单元601,用于读取AI硬件加速器的片外存储器中的所述第一索引值。
可选的,数据处理装置60,还包括:
写入单元603,用于将所述第一索引值所对应的矩阵块中包括的第一参数集合缓存至AI硬件加速器的weight buffer权重缓存中,以使得AI硬件加速器根据所述第一参数集合执行运算。
图10为本实施例提供的另一种数据处理装置的结构示意图。该数据处理装置70可以为芯片或片上系统。具体的,可以包括台式电脑、平板电脑、桌面型、膝上型、手持计算机、笔记本电脑、超级移动个人计算机(ultra-mobile personal computer,UMPC)、上网本,以及蜂窝电话、个人数字助理(personal digital assistant,PDA)、增强现实(augmented reality,AR)\虚拟现实(virtual reality,VR)设备等可以进行数据处理的设备。在另一种实现方式中,该数据处理装置也可以为AI硬件加速器中的全部或部分软/硬件装置。
其中,该数据处理装置70可以包括:处理器701、数据处理线路706、内存703以及至少一个数据处理接口702中的部分或全部部件。
其中,处理器701用于执行本实施例图4或图7所提供的数据处理方法中数据处理装置所执行的全部或部分步骤。
具体的,处理器701可以包含通用中央处理器(central processing unit,CPU),处理器701还可以包含微处理器、现场可编程逻辑门阵列(Field Programmable Gate Array,FPGA)、数字信号处理器(digital signal processing,DSP)或者特定应用集成电路(application-specific integrated circuit,ASIC)、或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。
在具体实现中,作为一种实施例,处理器701可以包括一个或多个CPU,例如图10中的CPU0和CPU1。
另外,内存703可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(random access memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(static RAM,SRAM)、动态随机存取存储器(DRAM)、同步动态随机存取存储器(synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(double data date SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(synchlink DRAM, SLDRAM)和直接内存总线随机存取存储器(direct rambus RAM,DR RAM)。内存703可以是独立存在,通过通信线路706与处理器701相连接。内存703也可以和处理器701集成在一起。
其中,内存703存储有计算机指令。处理器701可以通过执行内存703中存储的计算机指令,用于执行本实施例所提供的数据处理方法中的全部或部分步骤。
可选的,本实施例中的计算机执行指令也可以称之为应用程序代码,本实施例对此不作具体限定。
另外,通信接口702使用任何收发器一类的装置,用于与其他设备或通信网络通信,如以太网,无线接入网(radio access network,RAN),无线局域网(wireless local area networks,WLAN)等。
另外,通信线路706用于将通信装置70中各部件连接。具体的,通信线路706可以包括数据总线、电源总线、控制总线和状态信号总线等。但是为了清楚说明起见,在图中将各种总线都标为通信线路706。
另外,通信装置70还可以包括存储介质704。存储介质704用于存储计算机指令以及实现本实施例技术方案的各种数据。以便通信装置70在执行本实施例上述通信方法时,将存储介质704中存储的计算机指令和各种数据加载至内存703中,以使得处理器701可以通过执行内存703中存储的计算机指令,用于执行本实施例所提供的通信方法。
应理解,根据本实施例的通信装置70可对应于本实施例中的通信装置40,并可以对应于执行根据本实施例的通信方法的相应主体,并且通信装置70中的各个模块的上述和其它操作和/或功能分别为了实现图4或图7中的各个方法的相应流程,为了简洁,在此不再赘述。
本实施例中的方法步骤可以通过硬件的方式来实现,也可以由处理器执行软件指令的方式来实现。软件指令可以由相应的软件模块组成,软件模块可以被存放于RAM、闪存、ROM、PROM、EPROM、EEPROM、寄存器、硬盘、移动硬盘、CD-ROM或者本领域熟知的任何其它形式的存储介质中。一种示例性的存储介质耦合至处理器,从而使处理器能够从该存储介质读取信息,且可向该存储介质写入信息。当然,存储介质也可以是处理器的组成部分。处理器和存储介质可以位于ASIC中。另外,该ASIC可以位于通信装置或终端设备中。当然,处理器和存储介质也可以作为分立组件存在于通信装置或终端设备中。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机程序或指令。在计算机上加载和执行所述计算机程序或指令时,全部或部分地执行本实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、通信装置、用户设备或者其它可编程装置。所述计算机程序或指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机程序或指令可以从一个网站站点、计算机、服务器或数据中心通过有线或无线方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何 可用介质或者是集成一个或多个可用介质的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,例如,软盘、硬盘、磁带;也可以是光介质,例如,数字视频光盘(digital video disc,DVD);还可以是半导体介质,例如,SSD。
在本实施例中,如果没有特殊说明以及逻辑冲突,不同的实现方式之间的术语和/或描述具有一致性、且可以相互引用,不同的实施例中的技术特征根据其内在的逻辑关系可以组合形成新的实施例。
本实施例中,“至少一个”是指一个或者多个,“多个”是指两个或两个以上,其它量词与之类似。“和/或”描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。此外,对于单数形式“a”,“an”和“the”出现的元素(element),除非上下文另有明确规定,否则其不意味着“一个或仅一个”,而是意味着“一个或多于一个”。例如,“a device”意味着对一个或多个这样的device。再者,至少一个(at least one of).......”意味着后续关联对象中的一个或任意组合,例如“A、B和C中的至少一个”包括A,B,C,AB,AC,BC,或ABC。在本实施例的文字描述中,字符“/”,一般表示前后关联对象是一种“或”的关系;在本实施例的公式中,字符“/”,表示前后关联对象是一种“相除”的关系。

Claims (18)

  1. 一种数据处理方法,其特征在于,所述方法包括:
    获取卷积神经网络模型的参数矩阵;
    将所述参数矩阵划分为多个矩阵块,其中多个矩阵块中包括P种内容不同的矩阵块;
    分别生成所述P种矩阵块中各矩阵块的索引值;所述索引值用于唯一指示所对应矩阵块;
    生成所述卷积神经网络模型的参数字典;所述参数字典中包括所述P种矩阵块以及所述P种矩阵块分别对应的索引值。
  2. 根据权利要求1所述的方法,其特征在于,所述获取卷积神经网络模型的参数矩阵,包括:
    获取所述卷积神经网络模型的原始参数矩阵;
    对所述原始参数矩阵进行重新训练,得到满足约束条件的所述卷积神经网络模型的所述参数矩阵;所述约束条件包括使参数矩阵能够划分为多个矩阵块,其中所述多个矩阵块包括内容相同的矩阵块。
  3. 根据权利要求1或2所述的方法,其特征在于,所述方法还包括:根据硬件资源的规格和模型训练的难度,确定所述矩阵块的规格。
  4. 一种数据处理方法,其特征在于,所述方法应用于人工智能AI硬件加速器,所述方法包括:
    获取第一索引值;
    在参数字典中查找所述第一索引值所对应的矩阵块;所述参数字典中包括P种矩阵块以及所述P种矩阵块分别对应的索引值,其中各所述矩阵块分别为卷积神经网络模型的参数矩阵中一部分。
  5. 根据权利要求4所述的方法,其特征在于,所述方法还包括:
    将所述第一索引值所对应的矩阵块进行拼接,得到所述卷积神经网络模型的第一参数集合。
  6. 根据权利要求4或5所述的方法,其特征在于,所述方法还包括:将所述参数字典加载到所述AI硬件加速器的片上缓存中。
  7. 根据权利要求4-6任一项所述的方法,其特征在于,所述获取第一索引值,包括:读取AI硬件加速器的片外存储器中的所述第一索引值。
  8. 根据权利要求4-7任一项所述的方法,其特征在于,所述方法还包括:
    将所述第一索引值所对应的矩阵块中包括的第一参数集合缓存至AI硬件加速器的weight buffer权重缓存中,以使得AI硬件加速器根据所述第一参数集合执行运算。
  9. 一种数据处理装置,其特征在于,包括:
    获取单元,用于获取卷积神经网络模型的参数矩阵;
    分割单元,用于将所述参数矩阵划分为多个矩阵块,其中多个矩阵块中包括P种内容不同的矩阵块;
    索引生成单元,用于分别生成所述P种矩阵块中各矩阵块的索引值;所述索引值用于唯一指示所对应矩阵块;
    字典生成单元,用于生成所述卷积神经网络模型的参数字典;所述参数字典中包括所述P种矩阵块以及所述P种矩阵块分别对应的索引值。
  10. 根据权利要求9所述的数据处理装置,其特征在于,所述获取单元,用于获取卷积神经网络模型的参数矩阵,包括:
    所述获取单元,用于获取所述卷积神经网络模型的原始参数矩阵;
    所述获取单元,用于对所述原始参数矩阵进行重新训练,得到满足约束条件的所述卷积神经网络模型的所述参数矩阵;所述约束条件包括使参数矩阵能够划分为多个矩阵块,其中所述多个矩阵块包括内容相同的矩阵块。
  11. 根据权利要求9或10所述的数据处理装置,其特征在于,所述分割单元,还用于根据硬件资源的规格和模型训练的难度,确定所述矩阵块的规格。
  12. 一种数据处理装置,其特征在于,所述数据处理装置应用于人工智能AI硬件加速器,所述数据处理装置包括:
    获取单元,用于获取第一索引值;
    查找单元,用于在参数字典中查找所述第一索引值所对应的矩阵块;所述参数字典中包括P种矩阵块以及所述P种矩阵块分别对应的索引值,其中各所述矩阵块分别为卷积神经网络模型的参数矩阵中一部分。
  13. 根据权利要求12所述的数据处理装置,其特征在于,所述查找单元,还用于将所述第一索引值所对应的矩阵块进行拼接,得到所述卷积神经网络模型的第一参数集合。
  14. 根据权利要求12或13所述的数据处理装置,其特征在于,所述获取单元,还用于将所述参数字典加载到所述AI硬件加速器的片上缓存中。
  15. 根据权利要求12-14任一项所述的数据处理装置,其特征在于,所述获取单元,用于获取第一索引值,包括:所述获取单元,用于读取AI硬件加速器的片外存储器中的所述第一索引值。
  16. 根据权利要求12-15任一项所述的数据处理装置,其特征在于,所述数据处理装置,还包括:
    写入单元,用于将所述第一索引值所对应的矩阵块中包括的第一参数集合缓存至AI硬件加速器的weight buffer权重缓存中,以使得AI硬件加速器根据所述第一参数集合执行运算。
  17. 一种通信装置,其特征在于,包括处理器和接口,所述处理器通过所述接口接收或发送数据,所述处理器用于实现如权利要求1-3中任一项所述的方法,或者所述处理器用于实现如权利要求4-8中任一项所述的方法。
  18. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有指令,当所述指令在处理器上运行时,实现如权利要求1-3中任一项所述的方法,或者实现如权利要求4-8中任一项所述的方法。
PCT/CN2023/096625 2022-07-28 2023-05-26 数据处理方法及装置 WO2024021827A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210895071.7A CN117540774A (zh) 2022-07-28 2022-07-28 数据处理方法及装置
CN202210895071.7 2022-07-28

Publications (1)

Publication Number Publication Date
WO2024021827A1 true WO2024021827A1 (zh) 2024-02-01

Family

ID=89705282

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/096625 WO2024021827A1 (zh) 2022-07-28 2023-05-26 数据处理方法及装置

Country Status (2)

Country Link
CN (1) CN117540774A (zh)
WO (1) WO2024021827A1 (zh)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111079781A (zh) * 2019-11-07 2020-04-28 华南理工大学 基于低秩与稀疏分解的轻量化卷积神经网络图像识别方法
WO2020219229A1 (en) * 2019-04-23 2020-10-29 Microsoft Technology Licensing, Llc Direct computation with compressed weight in training deep neural network
CN112541159A (zh) * 2020-09-30 2021-03-23 华为技术有限公司 一种模型训练方法及相关设备
US20210150362A1 (en) * 2019-11-15 2021-05-20 Microsoft Technology Licensing, Llc Neural network compression based on bank-balanced sparsity
CN113762493A (zh) * 2020-06-01 2021-12-07 阿里巴巴集团控股有限公司 神经网络模型的压缩方法、装置、加速单元和计算系统
CN113822410A (zh) * 2020-06-18 2021-12-21 华为技术有限公司 神经网络模型训练、图像分类、文本翻译方法及装置、设备

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020219229A1 (en) * 2019-04-23 2020-10-29 Microsoft Technology Licensing, Llc Direct computation with compressed weight in training deep neural network
CN111079781A (zh) * 2019-11-07 2020-04-28 华南理工大学 基于低秩与稀疏分解的轻量化卷积神经网络图像识别方法
US20210150362A1 (en) * 2019-11-15 2021-05-20 Microsoft Technology Licensing, Llc Neural network compression based on bank-balanced sparsity
CN113762493A (zh) * 2020-06-01 2021-12-07 阿里巴巴集团控股有限公司 神经网络模型的压缩方法、装置、加速单元和计算系统
CN113822410A (zh) * 2020-06-18 2021-12-21 华为技术有限公司 神经网络模型训练、图像分类、文本翻译方法及装置、设备
CN112541159A (zh) * 2020-09-30 2021-03-23 华为技术有限公司 一种模型训练方法及相关设备

Also Published As

Publication number Publication date
CN117540774A (zh) 2024-02-09

Similar Documents

Publication Publication Date Title
CN110990638B (zh) 基于fpga-cpu异构环境的大规模数据查询加速装置及方法
JP7366274B2 (ja) ニューラル・ネットワークのための適応的探索方法および装置
US20180300330A1 (en) Proactive spilling of probe records in hybrid hash join
CN113296718A (zh) 数据处理方法以及装置
US20220342934A1 (en) System for graph node sampling and method implemented by computer
CN111083933B (zh) 数据存储及获取方法和装置
WO2023284745A1 (zh) 一种数据处理方法、系统及相关设备
WO2021258853A1 (zh) 词汇纠错方法、装置、计算机设备及存储介质
US20220417324A1 (en) Computer-implemented method, system, and storage medium for prefetching in a distributed graph architecture
WO2022227962A1 (zh) 一种数据处理方法及装置
WO2020062252A1 (zh) 运算加速器和压缩方法
WO2022007596A1 (zh) 图像检索系统、方法和装置
CN116401502B (zh) 一种基于NUMA系统特性优化Winograd卷积的方法及装置
WO2024021827A1 (zh) 数据处理方法及装置
US20230385258A1 (en) Dynamic random access memory-based content-addressable memory (dram-cam) architecture for exact pattern matching
CN104461941A (zh) 一种内存系统架构及管理方法
Utan et al. A GPGPU implementation of approximate string matching with regular expression operators and comparison with its FPGA implementation
US20200192797A1 (en) Caching data in artificial neural network computations
WO2023000696A1 (zh) 一种资源分配方法及装置
WO2023087227A1 (zh) 数据处理装置及方法
US20200242467A1 (en) Calculation method and calculation device for sparse neural network, electronic device, computer readable storage medium, and computer program product
WO2022078400A1 (zh) 一种对多维数据进行处理的设备、方法和计算机程序产品
TW202301130A (zh) 深度學習網路裝置、其使用的記憶體存取方法與非揮發性儲存媒介
CN113554149A (zh) 神经网络处理单元npu、神经网络的处理方法及其装置
WO2021128342A1 (zh) 文档处理的方法和装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23845041

Country of ref document: EP

Kind code of ref document: A1