WO2023071658A1 - Ai模型的处理方法、运算方法及装置 - Google Patents

Ai模型的处理方法、运算方法及装置 Download PDF

Info

Publication number
WO2023071658A1
WO2023071658A1 PCT/CN2022/121335 CN2022121335W WO2023071658A1 WO 2023071658 A1 WO2023071658 A1 WO 2023071658A1 CN 2022121335 W CN2022121335 W CN 2022121335W WO 2023071658 A1 WO2023071658 A1 WO 2023071658A1
Authority
WO
WIPO (PCT)
Prior art keywords
target
weight
model
matrix
block
Prior art date
Application number
PCT/CN2022/121335
Other languages
English (en)
French (fr)
Inventor
朱敏超
卢帆
左盟
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2023071658A1 publication Critical patent/WO2023071658A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Definitions

  • the present application relates to the field of artificial intelligence, and more specifically, to an AI model processing method, computing method and device.
  • Artificial intelligence is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. Artificial intelligence studies the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making. Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision-making and reasoning, human-computer interaction, recommendation and search, basic AI theory, etc.
  • the present application provides an AI model processing method, computing method and device, which can increase the computing speed of the AI model and reduce computing overhead.
  • a method for processing an AI model includes: obtaining the initial AI model and the size of the minimum computing unit of the accelerator, and the accelerator is used to perform the operation of the target AI model; performing sparse processing on the initial AI model to The target AI model is obtained, wherein the weight matrix of the target AI model includes multiple weight blocks, the multiple weight blocks include at least one invalid weight block and at least one target weight block, the invalid weight block is a weight block that does not participate in the operation, and the target weight block is the weight block participating in the operation, and the size of the weight block is an integer multiple of the size of the smallest computing unit of the accelerator, wherein, the number of target weight blocks in the weight matrix of the i-th layer of the target neural network model in the matrix operation accumulation direction same.
  • the size of the weight block is used as the granularity of the sparse processing, that is, the weight value in the weight matrix is retained or discarded in units of the weight block, and the size of the weight block is an integer multiple of the size of the smallest computing unit of the accelerator .
  • the solution of the embodiment of the present application reduces redundant weights in the initial AI model, reduces the calculation amount of the model, and is beneficial to improve the processing speed of the accelerator.
  • the size of the target weight block of the target AI model is an integer multiple of the size of the smallest computing unit.
  • the accelerator can obtain the data required for the operation at one time according to the data format supported by the smallest computing unit.
  • the size of the minimum computing unit refers to the scale of matrix operations that the accelerator can handle at one time, or the amount of calculations performed by the computing unit in one operation.
  • the accelerator may include at least one of the following: a central processing unit (central processing unit, CPU), a graphics processing unit (graphics processing unit, GPU), a neural network processor (neural-network processing unit, NPU) and a tensor Processor (tensor processing unit, TPU) and so on.
  • a central processing unit central processing unit, CPU
  • a graphics processing unit graphics processing unit
  • NPU neural network processing unit
  • tensor Processor tensor processing unit, TPU
  • performing sparse processing on the initial AI model to obtain the target AI model includes: adding at least one target network layer in each target network layer in the initial AI model At least one weight block is used as a weight block that does not participate in the operation to obtain a sparse AI model; the sparse AI model is tuned to obtain a target AI model.
  • multiple weight blocks in a weight matrix have the same size.
  • the method further includes: storing the weight matrix of the target neural network model in the form of compressed weights, where the compressed weights are composed of at least one target weight block.
  • storing the weight matrix of the at least one target network layer in the form of compressed weights can reduce the demand for storage space.
  • a structured sparse matrix can be obtained.
  • the compressed weight can be directly loaded into the calculation unit for matrix operation, which is beneficial to improve the processing speed .
  • the multiple corresponding to the integer multiple is obtained by searching in the search space.
  • the sizes of the weight blocks in the weight matrices of different network layers in the target AI network model are different.
  • the size of the weight block is searched in the search space, and the size of the weight block suitable for each network layer can be obtained, which is beneficial to improve the accuracy of the model.
  • a weight block of a certain size can guarantee a certain sparse rate under a certain target accuracy, which can improve the processing efficiency of the model.
  • the sparsity rate of the weight matrix of the target AI model is obtained by searching in the search space, and the sparsity rate of the target AI model meets the target sparsity rate.
  • the sparsity rate of the weight matrix of the target AI model is obtained by searching in the search space.
  • the accuracy of the sparse neural network model meets the target accuracy.
  • the sparse rate of the weight matrix of the target AI model can be searched in the search space, and the sparse rate of the weight matrix suitable for the target AI model can be obtained, which is beneficial to Reduce the sparsity rate of the model, thereby reducing the calculation amount of the model, so as to realize the accelerated operation of the model.
  • the network layer corresponding to the weight matrix of the target AI model belongs to multiple candidate network layers.
  • sparse processing can be performed within the scope of the specified candidate network layer to avoid affecting other network layers, for example, avoid affecting network layers with a high correlation with the accuracy of the model, which is conducive to ensuring the accuracy of the model accuracy.
  • the weight matrix of the target AI model when the network layer corresponding to the weight matrix of the target AI model is a convolutional layer, the weight matrix of the target AI model can be obtained by combining the convolution of the network layer The kernel is converted to a two-dimensional matrix to get.
  • an AI model calculation method is provided, the weight matrix of the i-th layer of the target AI model includes a plurality of weight blocks, the plurality of weight blocks include at least one invalid weight block and at least one target weight block, and the invalid weight block is the weight block that does not participate in the operation, the target weight block is the weight block that participates in the operation, the size of the weight block is an integer multiple of the size of the smallest computing unit of the accelerator, and the target in the direction of matrix operation accumulation in the weight matrix of the target AI model
  • the number of weight blocks is the same, and the following steps are performed during the operation of the target AI model by the accelerator: Obtain at least one target data block from the input data matrix according to the weight index of the weight matrix of the target AI model, and at least one target data block corresponds to At least one target weight block, the weight index is used to indicate the position of the at least one target weight block in the weight matrix of the target AI model; perform a matrix operation based on at least one target data block and at least one target weight block to obtain
  • the accelerator can obtain the data required for the operation at one time according to the data format that the minimum calculation unit can support, and there is no need to convert the data required for the operation into the minimum calculation data by cutting or piecing together.
  • the data format that the unit can support reduces the delay of operation.
  • the index is used to indicate the position of the target weight block, which reduces the complexity of the index, can reduce the addressing time of input data during the operation process, and can realize effective hardware acceleration without specific hardware support.
  • the at least one target data block refers to the input data to be calculated in the matrix operation, that is, the input data corresponding to the target weight block in the weight matrix.
  • the weight matrix of the target AI model is stored in the form of compressed weights, and the compressed weights are composed of at least one target weight block.
  • storing the weight matrix of the at least one target network layer in the form of compressed weights can reduce the demand for storage space.
  • a structured sparse matrix can be obtained.
  • the compressed weight can be directly loaded into the calculation unit for matrix operation, which is beneficial to improve the processing speed .
  • the method further includes: combining target data blocks corresponding to multiple weight groups in at least one target data block into multiple data matrices, and multiple weight groups
  • a weight group in the reorganization includes target weight blocks in the weight matrix of the target AI model in the direction of matrix operation accumulation; and performing matrix operations based on at least one target data block and at least one target weight block to obtain a result of the matrix operation, including : Perform matrix operations on multiple data matrices and multiple weight groups to obtain the results of multiple matrix operations; perform data rearrangement on the results of multiple matrix operations to obtain the results of matrix operations.
  • the processing efficiency of the accelerator can be improved by performing matrix operations in parallel.
  • the weight matrix in the target AI model is the convolution of the network layer
  • the kernel is converted to a two-dimensional matrix to get.
  • At least one target data block is acquired from the input data matrix according to the weight index of the weight matrix of the target AI model, including: according to the weight index of the weight matrix of the target AI model Determine a target position in the input data matrix; convert data at the target position in the input data matrix into at least one target data block.
  • the network layer is a convolutional layer
  • the data of the target position in the input data matrix is indexed according to the weight, and the data of the target position is expanded into multiple data matrices through im2col, so that there is no need to input the data matrix
  • the conversion is first performed by means of im2col, and then at least one target data block is obtained from the converted input data matrix, without saving the converted input data matrix, which reduces the demand for storage space and improves processing efficiency.
  • the multiple corresponding to the integer multiple is obtained by searching in the search space.
  • the sparsity rate of the weight matrix of the target AI model is obtained by searching in the search space, and the sparsity rate of the target AI model meets the target sparsity rate.
  • an AI model processing device includes a module or unit for executing the first aspect and the method in any one implementation manner of the first aspect.
  • an AI model computing device includes a module or unit for executing the method in the second aspect and any implementation manner of the second aspect.
  • an AI model processing device which includes: a memory for storing programs; a processor for executing the programs stored in the memory, and when the programs stored in the memory are executed, the The processor is configured to execute the first aspect and the method in any one implementation manner of the first aspect.
  • the processor in the fifth aspect above can be a central processing unit (central processing unit, CPU), or a combination of a CPU and an AI computing processor, where the AI computing processor can include a graphics processing unit (graphics processing unit, GPU), neural network processor (neural-network processing unit, NPU) and tensor processor (tensor processing unit, TPU) and so on.
  • TPU is an artificial intelligence accelerator ASIC fully customized by Google for machine learning.
  • an AI model computing device which includes: a memory for storing programs; a processor for executing the programs stored in the memory, and when the programs stored in the memory are executed, the The processor is configured to execute the method in any one of the implementation manners of the first aspect and the second aspect.
  • the processor in the sixth aspect above may be a CPU, or a combination of a CPU and an AI computing processor, where the AI computing processor may include a GPU, an NPU, a TPU, and the like.
  • a computer-readable storage medium where the computer-readable medium stores program code for execution by a device, and the program code includes a program code for executing any one of the implementation manners of the first aspect or the second aspect. method.
  • a computer program product containing instructions is provided, and when the computer program product is run on a computer, the computer is made to execute the method in any one of the above-mentioned first aspect or the second aspect.
  • the chip includes a processor and a data interface, and the processor reads the instructions stored in the memory through the data interface, and executes any one of the above-mentioned first aspect or the second aspect method in the implementation.
  • the chip may further include a memory, the memory stores instructions, the processor is configured to execute the instructions stored in the memory, and when the instructions are executed, the The processor is configured to execute the method in any one of the implementation manners of the first aspect or the second aspect.
  • the aforementioned chip may specifically be a field-programmable gate array (field-programmable gate array, FPGA) or an application-specific integrated circuit (application-specific integrated circuit, ASIC).
  • FPGA field-programmable gate array
  • ASIC application-specific integrated circuit
  • Fig. 1 is a schematic diagram of an artificial intelligence subject framework provided by an embodiment of the present application
  • FIG. 2 is a schematic structural diagram of a system architecture provided by an embodiment of the present application.
  • FIG. 3 is a schematic diagram of a hardware structure of a chip provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of a system architecture provided by an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of a compiler provided in an embodiment of the present application.
  • FIG. 6 is a schematic diagram of an AI model processing method provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of a sparse processing process of a weight matrix provided by an embodiment of the present application.
  • FIG. 8 is a schematic diagram of the size of a minimum computing unit provided by an embodiment of the present application.
  • FIG. 9 is a schematic diagram of a sparse processing process provided by an embodiment of the present application.
  • FIG. 10 is a schematic diagram of the expression of the weight matrix provided by the embodiment of the present application.
  • FIG. 11 is a schematic flow chart of an AI model computing method provided by an embodiment of the present application.
  • FIG. 12 is a schematic diagram of the operation process of the fully connected layer provided by the embodiment of the present application.
  • FIG. 13 is a schematic diagram of a processing method of input data of a convolutional layer provided by an embodiment of the present application.
  • FIG. 14 is a schematic diagram of an operation process of a convolutional layer provided in an embodiment of the present application.
  • Fig. 15 is a comparison diagram of the accuracy of the target neural network model and the initial neural network model provided by the embodiment of the present application;
  • Fig. 16 is a comparison diagram of the operation time of a single operator in the initial neural network model and the target neural network model provided by the embodiment of the present application;
  • Fig. 17 is a schematic diagram of the degree of acceleration of the target neural network model provided by the embodiment of the present application.
  • Fig. 18 is a schematic block diagram of an AI model processing device provided by an embodiment of the present application.
  • Fig. 19 is a schematic block diagram of an accelerator provided by an embodiment of the present application.
  • Fig. 20 is a schematic block diagram of another AI model processing device provided by the embodiment of the present application.
  • Fig. 21 is a schematic block diagram of an AI model computing device provided by an embodiment of the present application.
  • Figure 1 shows a schematic diagram of an artificial intelligence main framework, which describes the overall workflow of an artificial intelligence system and is applicable to general artificial intelligence field requirements.
  • Intelligent information chain reflects a series of processes from data acquisition to processing. For example, it can be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, the data has undergone a condensed process of "data-information-knowledge-wisdom".
  • IT value chain reflects the value brought by artificial intelligence to the information technology industry from the underlying infrastructure of artificial intelligence, information (provided and processed by technology) to the systematic industrial ecological process.
  • the infrastructure provides computing power support for the artificial intelligence system, realizes communication with the outside world, and realizes support through the basic platform.
  • the infrastructure can communicate with the outside through sensors, and the computing power of the infrastructure can be provided by smart chips.
  • the smart chip here can be a central processing unit (central processing unit, CPU), a neural network processor (neural-network processing unit, NPU), a graphics processing unit (graphics processing unit, GPU), an application specific integrated circuit (application specific) Integrated circuit, ASIC) and field programmable gate array (field programmable gate array, FPGA) and other hardware acceleration chips.
  • CPU central processing unit
  • NPU neural network processor
  • NPU graphics processing unit
  • ASIC application specific integrated circuit
  • FPGA field programmable gate array
  • the basic platform of infrastructure can include related platform guarantees and supports such as distributed computing framework and network, and can include cloud storage and computing, interconnection and interworking network, etc.
  • data can be obtained through sensors and external communication, and then these data can be provided to smart chips in the distributed computing system provided by the basic platform for calculation.
  • Data from the upper layer of the infrastructure is used to represent data sources in the field of artificial intelligence.
  • the data involves graphics, images, voice, text, and IoT data of traditional equipment, including business data of existing systems and sensory data such as force, displacement, liquid level, temperature, and humidity.
  • the above data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making and other processing methods.
  • machine learning and deep learning can symbolize and formalize intelligent information modeling, extraction, preprocessing, training, etc. of data.
  • Reasoning refers to the process of simulating human intelligent reasoning in a computer or intelligent system, and using formalized information to carry out machine thinking and solve problems according to reasoning control strategies.
  • the typical functions are search and matching.
  • Decision-making refers to the process of decision-making after intelligent information is reasoned, and usually provides functions such as classification, sorting, and prediction.
  • some general capabilities can be formed based on the results of data processing, such as algorithms or a general system, such as translation, text analysis, computer vision processing, speech recognition, image processing identification, etc.
  • Intelligent products and industry applications refer to the products and applications of artificial intelligence systems in various fields. It is the packaging of the overall solution of artificial intelligence, which commercializes intelligent information decision-making and realizes landing applications. Its application fields mainly include: intelligent manufacturing, intelligent transportation, Smart home, smart medical care, smart security, automatic driving, safe city, smart terminals, etc.
  • the embodiments of the present application can be applied in many fields of artificial intelligence, for example, intelligent manufacturing, intelligent transportation, intelligent home, intelligent medical care, intelligent security, automatic driving, safe city and other fields.
  • the embodiments of the present application can be specifically applied in fields that require the use of (deep) neural networks, such as automatic driving, image classification, image retrieval, image semantic segmentation, image quality enhancement, image super-resolution, and natural language processing.
  • deep neural networks such as automatic driving, image classification, image retrieval, image semantic segmentation, image quality enhancement, image super-resolution, and natural language processing.
  • a terminal device for example, a mobile phone
  • a cloud disk When a user stores a large number of pictures on a terminal device (for example, a mobile phone) or a cloud disk, it is convenient for the user or the system to classify and manage the album by identifying the images in the album, thereby improving user experience.
  • Using the calculation method of the AI model in the embodiment of the present application can reduce hardware overhead and be more friendly to terminal devices.
  • the speed of using the neural network to classify pictures can be improved, which is conducive to labeling pictures of different categories in real time, which is convenient for users to view and search.
  • Surveillance scenarios include: smart city, field surveillance, indoor surveillance, outdoor surveillance, in-vehicle surveillance, etc.
  • multiple attribute recognition is required, such as pedestrian attribute recognition and riding attribute recognition.
  • Deep neural networks play an important role in multiple attribute recognition by virtue of their powerful capabilities.
  • the processing efficiency of the neural network model can be improved, which is beneficial to the real-time processing of the input road picture, faster identification of different attribute information in the road picture, and at the same time reduces the power consumption. consumption.
  • a neural network can be composed of neural units, and a neural unit can refer to an operation unit that takes x s and an intercept 1 as input, and the output of the operation unit can be:
  • s 1, 2, ... n, n is a natural number greater than 1
  • W s is the weight of x s
  • b is the bias of the neuron unit.
  • f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to transform the input signal in the neural unit into an output signal.
  • the output signal of this activation function can be used as the input of the next layer.
  • a neural network is a network formed by connecting multiple above-mentioned single neural units, that is, the output of one neural unit can be the input of another neural unit.
  • the input of each neural unit can be connected with the local receptive field of the previous layer to extract the features of the local receptive field.
  • the local receptive field can be an area composed of several neural units.
  • Deep neural network also known as multi-layer neural network
  • DNN can be understood as a neural network with multiple hidden layers.
  • DNN is divided according to the position of different layers, and the neural network inside DNN can be divided into three categories: input layer, hidden layer, and output layer.
  • the first layer is the input layer
  • the last layer is the output layer
  • the layers in the middle are all hidden layers.
  • the layers are fully connected, that is, any neuron in the i-th layer must be connected to any neuron in the i+1-th layer.
  • DNN looks very complicated, it is actually not complicated in terms of the work of each layer.
  • it is the following linear relationship expression: in, is the input vector, is the output vector, Is the offset vector, W is the weight matrix (also called coefficient), and ⁇ () is the activation function.
  • Each layer is just an input vector After such a simple operation, the output vector is obtained. Due to the large number of DNN layers, the coefficient W and the offset vector The number is also higher.
  • DNN The definition of these parameters in DNN is as follows: Take the coefficient W as an example: Assume that in a three-layer DNN, the linear coefficient from the fourth neuron of the second layer to the second neuron of the third layer is defined as The superscript 3 represents the layer number where the coefficient W is located, and the subscript corresponds to the output index 2 of the third layer and the input index 4 of the second layer.
  • the coefficient from the kth neuron of the L-1 layer to the jth neuron of the L layer is defined as
  • the input layer has no W parameter.
  • more hidden layers make the network more capable of describing complex situations in the real world. Theoretically speaking, a model with more parameters has a higher complexity and a greater "capacity", which means that it can complete more complex learning tasks.
  • Training the deep neural network is the process of learning the weight matrix, and its ultimate goal is to obtain the weight matrix of all layers of the trained deep neural network (the weight matrix formed by the vector W of many layers).
  • the convolutional neural network is a deep neural network with a convolutional structure, and it is a deep learning architecture.
  • the deep learning architecture refers to the algorithm of machine learning, which performs multiple levels of learning at different levels of abstraction. study.
  • CNN is a feed-forward artificial neural network in which individual neurons can respond to images input into it.
  • the convolutional neural network contains a feature extractor composed of a convolutional layer and a subsampling layer, which can be regarded as a filter.
  • the convolutional layer refers to the neuron layer that performs convolution processing on the input signal in the convolutional neural network.
  • a neuron can only be connected to some adjacent neurons.
  • a convolutional layer usually contains several feature planes, and each feature plane can be composed of some rectangularly arranged neural units. Neural units of the same feature plane share weights, and the shared weights here are convolution kernels. Shared weights can be understood as a way to extract image information that is independent of location.
  • the convolution kernel can be formalized as a matrix of random size, and the convolution kernel can obtain reasonable weights through learning during the training process of the convolutional neural network.
  • the direct benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, while reducing the risk of overfitting.
  • the convolution layer can include many convolution operators, which are also called kernels, and their role in image processing is equivalent to a filter that extracts specific information from the input image matrix.
  • the convolution operator can essentially Is a weight matrix, this weight matrix is usually pre-defined, in the process of convolution operation on the image, the weight matrix is usually along the horizontal direction of the input image pixel by pixel (or two pixels by two pixels... ...This depends on the value of the stride) to complete the work of extracting specific features from the image.
  • the size of the weight matrix should be related to the size of the image. It should be noted that the depth dimension of the weight matrix is the same as the depth dimension of the input image. During the convolution operation, the weight matrix will be extended to The entire depth of the input image.
  • convolution with a single weight matrix will produce a convolutional output with a single depth dimension, but in most cases instead of using a single weight matrix, multiple weight matrices of the same size (row ⁇ column) are applied, That is, multiple matrices of the same shape.
  • the output of each weight matrix is stacked to form the depth dimension of the convolution image, where the dimension can be understood as determined by the "multiple" mentioned above.
  • Different weight matrices can be used to extract different features in the image. For example, one weight matrix is used to extract image edge information, another weight matrix is used to extract specific colors of the image, and another weight matrix is used to filter unwanted noise in the image. Do blurring etc.
  • the multiple weight matrices have the same size (row ⁇ column), and the feature maps extracted by the multiple weight matrices of the same size are also of the same size, and then the extracted multiple feature maps of the same size are combined to form the convolution operation. output.
  • weight values in these weight matrices need to be obtained through a lot of training in practical applications, and each weight matrix formed by the weight values obtained through training can be used to extract information from the input image, so that the convolutional neural network can make correct predictions.
  • the initial convolutional layer often extracts more general features, which can also be called low-level features; as the depth of the convolutional neural network deepens,
  • the features extracted by the later convolutional layers become more and more complex, such as high-level semantic features, and the higher semantic features are more suitable for the problem to be solved.
  • the convolution operation can be converted into a matrix multiplication operation by means of img2col.
  • the loss function loss function
  • objective function objective function
  • the training of the deep neural network becomes a process of reducing the loss as much as possible.
  • the smaller the loss the higher the training quality of the deep neural network, and the larger the loss, the lower the training quality of the deep neural network.
  • the smaller the loss fluctuation the more stable the training; the larger the loss fluctuation, the more unstable the training.
  • the neural network can use the error back propagation (back propagation, BP) algorithm to modify the size of the parameters in the neural network model during the training process, so that the reconstruction error loss of the neural network model becomes smaller and smaller.
  • BP back propagation
  • the forward transmission of the input signal until the output will generate an error loss
  • the parameters in the neural network model are updated by backpropagating the error loss information, so that the error loss converges.
  • the backpropagation algorithm is a backpropagation movement dominated by error loss, aiming to obtain the optimal parameters of the neural network model, such as the weight matrix.
  • the loss value generated by each training of the neural network model is transmitted layer by layer from the back to the front in the neural network model.
  • the update amount of the layer parameters (partial derivative operation) is calculated at the same time, and this update amount is related to the gradient.
  • the embodiment of the present application provides a system architecture 200 .
  • the data collection device 260 is used to collect training data.
  • the training data may include training images and processing results corresponding to the training images.
  • the classification result corresponding to the training image the classification result of the training image may be a manual pre-labeled result.
  • the data collection device 260 After collecting the training data, the data collection device 260 stores the training data in the database 230 , and the training device 220 obtains the target model/rule 201 based on training data maintained in the database 230 .
  • the training device 220 obtains the target model/rule 201 based on the training data.
  • the training device 220 processes the input raw data, and compares the output value with the target value until the difference between the value output by the training device 220 and the target value The value is less than a certain threshold, thus completing the training of the target model/rule 201 .
  • the target model/rule 201 in the embodiment of the present application may specifically be a neural network model.
  • a neural network model For example, convolutional neural network or residual neural network, etc.
  • the training data maintained in the database 230 may not all be collected by the data collection device 260, but may also be received from other devices.
  • the training device 220 does not necessarily perform the training of the target model/rules 201 based entirely on the training data maintained by the database 230, and it is also possible to obtain training data from the cloud or other places for model training. Limitations of the Examples.
  • the target model/rule 201 trained according to the training device 220 can be applied to different systems or devices, such as the execution device 210 shown in FIG. Laptop, augmented reality (augmented reality, AR) AR/virtual reality (virtual reality, VR), vehicle terminal, etc., can also be a server or cloud, etc.
  • the execution device 210 configures an input/output (input/output, I/O) interface 212 for data interaction with external devices.
  • I/O input/output
  • it may include: data to be processed input by the client device.
  • the execution device 210 When the execution device 210 preprocesses the input data, or in the calculation module 211 of the execution device 210 performs calculations and other related processing, the execution device 210 can call the data, codes, etc. in the data storage system 250 for corresponding processing , and the correspondingly processed data and instructions may also be stored in the data storage system 250 .
  • the I/O interface 212 returns the processing result, such as the processing result of the data obtained above, to the client device 240, thereby providing it to the user.
  • the training device 220 can generate corresponding target models/rules 201 based on different training data for different goals or different tasks, and the corresponding target models/rules 201 can be used to achieve the above-mentioned goals or complete the above-mentioned task to provide the user with the desired result.
  • the user can manually specify the input data, and the manual specification can be operated through the interface provided by the I/O interface 212 .
  • the client device 240 can automatically send the input data to the I/O interface 212 . If the client device 240 is required to automatically send the input data to obtain the user's authorization, the user can set the corresponding authority in the client device 240 .
  • the user can view the results output by the execution device 210 on the client device 240, and the specific presentation form may be specific ways such as display, sound, and action.
  • the client device 240 can also be used as a data collection terminal, collecting the input data of the input I/O interface 212 and the output result of the output I/O interface 212 as new sample data, and storing them in the database 230 .
  • the I/O interface 212 directly uses the input data input to the I/O interface 212 as shown in the figure and the output result of the output I/O interface 212 as a new sample The data is stored in database 230 .
  • FIG. 2 is only a schematic diagram of a system architecture provided by the embodiment of the present application, and the positional relationship between devices, devices, modules, etc. shown in the figure does not constitute any limitation.
  • the data The storage system 250 is an external memory relative to the execution device 210 , and in other cases, the data storage system 250 may also be placed in the execution device 210 .
  • the target model/rule 201 is obtained according to the training device 220.
  • the target model/rule 201 in the embodiment of the present application may be the neural network in the present application.
  • the neural network in the embodiment of the present application may be for CNN et al.
  • FIG. 3 is a hardware structure of a chip provided by an embodiment of the present application, and the chip includes a neural network processor 30 .
  • the chip can be set in the execution device 210 shown in FIG. 2 to complete the computing work of the computing module 211 .
  • the chip can also be set in the training device 220 shown in FIG. 2 to complete the training work of the training device 220 and output the target model/rule 201 .
  • the method in the embodiment of the present application can be implemented in the chip shown in FIG. 3 .
  • the neural network processor 30 can be a neural network processor (neural-network processing unit, NPU), a tensor processor (tensor processing unit, TPU), or a graphics processing unit (graphics processing unit, GPU), etc.
  • NPU neural-network processing unit
  • TPU tensor processing unit
  • GPU graphics processing unit
  • Processor for scale XOR processing Take the NPU as an example: the neural network processor NPU20 is mounted on the main central processing unit (central processing unit, CPU) (host CPU) as a coprocessor, and the main CPU assigns tasks.
  • the core part of the NPU is the operation circuit 303, and the controller 304 controls the operation circuit 303 to extract data in the memory (weight memory or input memory) and perform operations.
  • TPU is an artificial intelligence accelerator ASIC fully customized by Google for machine learning.
  • the operation circuit 303 includes multiple processing units (process engine, PE).
  • arithmetic circuit 303 is a two-dimensional systolic array.
  • the arithmetic circuit 303 may also be a one-dimensional systolic array or other electronic circuits capable of performing mathematical operations such as multiplication and addition.
  • arithmetic circuit 303 is a general-purpose matrix processor.
  • the operation circuit fetches the data corresponding to the matrix B from the weight memory 302, and caches it in each PE in the operation circuit.
  • the operation circuit fetches the data of matrix A from the input memory 301 and performs matrix operation with matrix B, and the obtained partial results or final results of the matrix are stored in the accumulator (accumulator) 308 .
  • the vector computing unit 307 can perform further processing on the output of the computing circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison and so on.
  • the vector calculation unit 307 can be used for network calculations of non-convolution/non-FC layers in neural networks, such as pooling (pooling), batch normalization (batch normalization, BN), local response normalization (local response normalization) )wait.
  • the vector computation unit can 307 store the processed output vectors to the unified buffer 306 .
  • the vector computing unit 307 may apply a non-linear function to the output of the computing circuit 303, such as a vector of accumulated values, to generate activation values.
  • vector computation unit 307 generates normalized values, binned values, or both.
  • the vector of processed outputs can be used as an activation input to the arithmetic circuit 303, for example for use in a subsequent layer in a neural network.
  • the unified memory 306 is used to store input data and output data.
  • the weight data directly transfers the input data in the external memory to the input memory 301 and/or unified memory 306 through the storage unit access controller 305 (direct memory access controller, DMAC), stores the weight data in the external memory into the weight memory 302, And store the data in the unified memory 306 into the external memory.
  • DMAC direct memory access controller
  • a bus interface unit (bus interface unit, BIU) 310 is configured to implement interaction between the main CPU, DMAC and instruction fetch memory 309 through the bus.
  • An instruction fetch buffer (instruction fetch buffer) 309 connected to the controller 304 is used to store instructions used by the controller 304;
  • the controller 304 is configured to call the instruction cached in the instruction fetch memory 309 to control the operation process of the operation accelerator.
  • the unified memory 306, the input memory 301, the weight memory 302, and the fetch memory 309 are all on-chip (On-Chip) memories
  • the external memory is a memory outside the NPU
  • the external memory can be a double data rate synchronous dynamic random Memory (double data rate synchronous dynamic random access memory, DDR SDRAM), high bandwidth memory (high bandwidth memory, HBM) or other readable and writable memory.
  • DDR SDRAM double data rate synchronous dynamic random access memory
  • HBM high bandwidth memory
  • the execution device 210 in FIG. 2 or the chip in FIG. 3 described above can be used as an accelerator to execute each step of the AI model calculation method of the embodiment of the present application.
  • the training device 220 in FIG. 2 or the chip in FIG. 3 introduced above can execute each step of the AI model processing method of the embodiment of the present application.
  • the embodiment of the present application provides a system architecture 400 .
  • the system architecture includes a local device 401, a local device 402, an execution device 410, and a data storage system 450, wherein the local device 401 and the local device 402 are connected to the execution device 410 through a communication network.
  • Execution device 410 may be implemented by one or more servers.
  • the execution device 410 may be used in cooperation with other computing devices, such as data storage, routers, load balancers and other devices.
  • Execution device 410 may be arranged on one physical site, or distributed on multiple physical sites.
  • the execution device 410 may use the data in the data storage system 450 or call the program code in the data storage system 450 to implement the AI model computing method or the AI model processing method of the embodiment of the present application.
  • the execution device 410 may perform the following process:
  • a method for processing an AI model comprising: obtaining an initial AI model and the size of a minimum computing unit of an accelerator, and the accelerator is used to perform operations of a target AI model; performing sparse processing on the initial AI model to obtain the target AI model , wherein the weight matrix of the target AI model includes multiple weight blocks, the multiple weight blocks include at least one invalid weight block and at least one target weight block, the invalid weight block is a weight block that does not participate in the operation, and the target weight block is a weight block that participates in the operation Weight block, the size of the weight block is an integer multiple of the size of the smallest computing unit of the accelerator, wherein the number of target weight blocks in the direction of matrix operation accumulation in the weight matrix of the i-th layer of the target neural network model is the same.
  • Each local device can represent any computing device, such as a personal computer, computer workstation, smartphone, tablet, smart camera, smart car or other type of cellular phone, media consumption device, wearable device, set-top box, game console, etc.
  • Each user's local device can interact with the execution device 410 through any communication mechanism/communication standard communication network, and the communication network can be a wide area network, a local area network, a point-to-point connection, etc., or any combination thereof.
  • the local device 401 and the local device 402 obtain relevant parameters of the target AI model from the execution device 410, deploy the target neural network on the local device 401 and the local device 402, and use the target AI model to perform image classification , image processing, voice processing or text processing, etc.
  • the target AI model can be directly deployed on the execution device 410, and the execution device 410 obtains the data to be processed from the local device 401 and the local device 402, and uses the target AI model to process the data to be processed.
  • the above execution device 410 may also be a cloud device. In this case, the execution device 410 may be deployed on the cloud; or, the above execution device 410 may also be a terminal device. In this case, the execution device 410 may be deployed on the user terminal side. This is not limited.
  • FIG. 5 shows a schematic diagram of a compiler in an accelerator provided by an embodiment of the present application.
  • the compiler 500 includes: a sparse tool 510 , a graph optimization and compilation module 520 and an operator compilation module 530 .
  • the sparse tool 510 is used to perform sparse processing on the initial AI model to obtain the target AI model.
  • the sparse tool 510 includes a sparse network generating module 511 and a sparse network tuning module 512 .
  • the sparse network generating module 511 is used to use at least one weight block in each target network layer in at least one target network layer in the initial AI model as a weight block that does not participate in the operation, so as to obtain a sparse AI model.
  • the sparse network tuning module 512 is used to fine tune the sparse AI model obtained by the sparse network generation module 511 to obtain the target AI model.
  • the sparse tool 510 can obtain the target AI model through the processing method of the AI model in the embodiment of the present application, and the specific process can refer to the method 600 below.
  • the graph optimization and compilation module 520 is used to process the topology graph of the AI model.
  • the graph optimization and compilation module 520 may include a graph engine (graph engine, GE) 521, a fusion engine (fusion engine, FE) 522, an AI CPU engine (AI CPU engine, AICPUE) 523 and Huawei Collective communication library (Huawei collective communication library, HCCL) 524 .
  • GE521 is used to provide a unified intermediate representation (IR) interface for different deep learning frameworks.
  • GE521 can be used to parse the topology map of the target AI model obtained after processing by the sparse tool 510 into a topology map of a model suitable for the current hardware, that is, to rewrite the topology map of the target AI model.
  • FE522 is used to fuse some operators in the graph processed by GE521.
  • AICPUE523 is used to execute the calculation of the target AI model.
  • HCCL524 is used to provide collective communication operators to realize data transmission between different accelerators. For example, data transmission can be realized through HCCL524 between different NPUs in distributed training.
  • the operator compilation module 530 is used to compile each operator in the graph processed by the graph optimization and compilation module 520 , that is, each node in the graph, and generate a corresponding executable file.
  • the operator compilation module 530 includes an operator library 531 and a tensor boost engine (tensor boost engine, TBE) 532.
  • tensor boost engine tensor boost engine, TBE
  • the operator library 531 may include sparse operators.
  • the sparse operator is used to realize the operation of the sparse network layer in the target AI model.
  • the operation of sparse network layer can also be called sparse calculation. For specific description, refer to the method 900 in the embodiment of this application.
  • TBE532 is used to provide GE521 with operator information, and provide FE522 with subgraph optimization information and TBE operator call information, and finally generate executable tasks.
  • TBE532 is also used to generate custom executable operators.
  • TBE532 can provide users with a custom operator development method.
  • the compiler 500 can be obtained based on an existing compiler, for example, the compiler 500 can be obtained through a risen tensor compiler ( obtained by integrating sparse tool 510 in ascend tensor compiler, ATC).
  • ATC ascend tensor compiler
  • the structure of the compiler in FIG. 5 is only an example, and those skilled in the art should understand that the compiler 500 may also include other devices necessary for normal operation during specific implementation. Meanwhile, according to specific needs, those skilled in the art should understand that the compiler 500 may also include hardware devices for implementing other additional functions. In addition, those skilled in the art should understand that the compiler 500 may also only include devices necessary to realize the embodiment of the present application, and does not necessarily include all the modules shown in FIG. 5 .
  • the calculation of AI models usually requires the support of high computing power and storage space, which affects the calculation speed of the model.
  • performing sparse calculation on the model is an effective means to achieve accelerated calculation.
  • Weight sparsification reduces the computational load of the model by reducing redundant weights in the AI model, thereby improving the processing speed and energy efficiency of the hardware accelerator.
  • existing weight sparsification schemes especially fine-grained sparsification schemes, usually require specific hardware support to achieve model acceleration.
  • Nvidia's A200 4 adjacent weights are grouped, and 2 non-zero weights are reserved in each group, that is, a 50% sparse rate is achieved.
  • This scheme uses a single weight as the pruning unit, and the non-zero weights in the sparse network obtained after weight thinning appear randomly.
  • the complexity of the weight index is high, and specific hardware support is required to achieve effective hardware acceleration.
  • the tensor core (tensor core) structure of the A200 uses a 4x4 data structure, that is, 4 weights are used as a group to support sparse calculations of 4 out of 2. Furthermore, the sparse scheme of 2 out of 4 can limit the randomness of the weights within the group.
  • A200 loads all the data corresponding to a set of weights to the registers of the calculation core, thereby avoiding the delay problem caused by the randomness of the weight, that is, reducing the delay problem caused by random access to external memory, and realizing effective hardware acceleration .
  • the 4-to-2 sparse scheme cannot achieve effective hardware acceleration, and may not even be able to perform sparse calculations.
  • the hardware design of the A200 also determines that it can only support a sparse rate of 50%.
  • the embodiment of the present application provides an AI model processing method, which reduces the calculation amount of the AI model, and improves the calculation speed of the AI model without adding additional hardware support.
  • FIG. 6 shows a method 600 for processing an AI model provided by an embodiment of the present application.
  • the method shown in Figure 6 can be performed by an AI model training device, which can be a cloud service device or a terminal device, for example, a computer, a server and other devices with sufficient computing power to perform AI model training, or It is a system composed of cloud service equipment and terminal equipment.
  • the method 600 may be executed by the training device 220 in FIG. 2 , the neural network processor 50 in FIG. 3 , or the executing device 410 in FIG. 4 or a local device.
  • the method 600 can be executed by calling the sparse tool 510 in FIG. 5 by an AI model training device.
  • the method 600 includes step S610 to step S620. Step S610 to step S620 will be described in detail below.
  • the AI model may be a model such as a neural network model, a support vector machine, a random forest, or a decision tree.
  • the initial AI model may be an initial neural network model
  • the target AI model may be a target neural network model.
  • the neural network model in this embodiment of the present application may be an existing neural network model, for example, a residual network or a convolutional neural network.
  • the neural network model may also be a neural network model of other structures constructed by itself. This embodiment of the present application does not limit it.
  • the initial AI model may be a trained AI model.
  • the type of the first training data used to train the initial AI model is related to the tasks of the AI model.
  • the first training data may be image data.
  • image processing tasks include image classification, image detection, image segmentation, or image generation.
  • the AI model is used for a text processing task
  • the first training data may be text data.
  • text processing tasks include text recognition or text translation.
  • the AI model is used for a speech processing task
  • the first training data may be speech data.
  • speech processing tasks include speech recognition and the like.
  • the embodiment of the present application does not limit the type of the first training data.
  • the initial AI model can be obtained in various ways.
  • the initial AI model input by the user may be obtained. That is, the initial AI model is provided by the user.
  • the locally stored initial AI model may be read.
  • the initial AI model sent by other devices may be received.
  • the AI model may be trained to obtain an initial AI model.
  • the embodiment of the present application does not limit the specific manner of obtaining the initial AI model.
  • the computing unit is a core unit in an accelerator that provides computing power, and may also be called a core.
  • the calculation unit may be used to perform calculation operations such as matrix operations, vector operations, and scalar data operations.
  • the size of the minimum computing unit refers to the scale of matrix operations that the accelerator can handle at one time, or the amount of calculations performed by the computing unit in one operation.
  • the size of the minimum calculation unit may be the scale of matrix operations that the accelerator can complete within one clock cycle.
  • the accelerator may include at least one of the following: CPU, GPU, NPU, TPU and so on.
  • the smallest computing unit may also be called a block unit (blockunit).
  • the blockunit is determined by the hardware specification of the accelerator. Different accelerators may have different blockunits. For example, as shown in FIG. 8 , accelerator 1 , accelerator 2 , accelerator 3 , accelerator 4 and accelerator 5 respectively correspond to five different blockunits: blockunit_1 , blockunit_2 , blockunit_3 , blockunit_4 and blockunit_5 .
  • the size of the minimum calculation unit may be input by the user.
  • the size of the minimum computing unit may also be pre-stored.
  • the embodiment of the present application does not limit the manner of obtaining the size of the minimum computing unit.
  • the weight matrix of the target AI model includes multiple weight blocks, and the multiple weight blocks include at least one invalid weight block and at least one target weight block, and the invalid weight block is the weight block that does not participate in the operation, the target weight block is the weight block that participates in the operation, the size of the weight block is an integer multiple of the size of the smallest computing unit of the accelerator, and the target in the direction of matrix operation accumulation in the weight matrix of the target AI model The number of weight blocks is the same.
  • step S620 may include: performing sparse processing on the initial neural network model to obtain a target neural network model.
  • the weight matrix of the i-th layer of the target neural network model includes a plurality of weight blocks, the plurality of weight blocks include at least one invalid weight block and at least one target weight block, the invalid weight block is a weight block that does not participate in the operation, and the target weight block is the weight block involved in the operation.
  • the size of the weight block is an integer multiple of the size of the smallest computing unit of the accelerator, and the size of the weight matrix of the i-th layer is an integer multiple of the size of the weight block.
  • the position of the at least one target weight block in the weight matrix of the i-th layer of the target neural network model is indicated by the weight index of the i-th layer.
  • i is a positive integer.
  • Invalid weight blocks and target weight blocks can be represented in a number of ways.
  • all weight values in the invalid weight block are set to 0.
  • all weight blocks with a weight value of 0 are invalid weight blocks.
  • At least one weight block whose weight value is not 0 is the target weight block.
  • all weight values in the invalid weight block are not activated, and the weight values in the target weight block are activated.
  • the weight block that is not activated during the operation is the invalid weight block, and the weight block that is activated during the operation is the target weight block.
  • the invalid weight blocks and the target weight blocks of the target AI model may be indicated by a mask in the form of a matrix with values 0 and 1, that is, which weight blocks are activated and which weight blocks are not activated.
  • the invalid weight block and the target weight block may also be expressed in other ways, which is not limited in this embodiment of the present application.
  • the solution of the present application is described hereinafter only by setting the weight value to 0 as an example of not participating in the operation, and does not constitute a limitation to the solution of the embodiment of the present application.
  • step S620 can be understood as performing sparse processing on the weight matrix of at least one target network layer in the initial neural network model to obtain the target neural network.
  • the granularity of sparse processing is the size of the weight block of the target network layer, that is, the initial neural network model is retained or discarded in units of weight blocks of the target network layer
  • the weight values in the weight matrix of the target network layer of can also be called a sparse network layer.
  • the i-th layer may be any network layer in the at least one target network layer.
  • the i-th layer is taken as an example to illustrate the process of sparse processing.
  • Other network layers in the at least one target network layer may also perform sparse processing in the same manner.
  • Sizes of the weight blocks of the target network layers in the at least one target network layer may be the same or different.
  • the size of the weight block is used as the granularity of the sparse processing, that is, the weight value in the weight matrix is retained or discarded in units of the weight block, and the size of the weight block is an integer multiple of the size of the smallest computing unit of the accelerator .
  • the solution of the embodiment of the present application reduces redundant weights in the initial AI model, reduces the calculation amount of the model, and is beneficial to improve the processing speed of the accelerator.
  • the size of the target weight block of the target AI model is an integer multiple of the size of the smallest computing unit.
  • the accelerator can obtain the data required for the operation from one time according to the data format supported by the smallest computing unit.
  • step S620 includes step S621 and step S622 (not shown in FIG. 6 ).
  • At least one weight block in each target network layer in at least one target network layer in the initial neural network model as a weight block that does not participate in operations, so as to obtain a sparse neural network model.
  • the at least one target network layer includes the i-th layer.
  • step S621 may be performed by the sparse network generation module 511 in FIG. 5 .
  • step S621 may be to set the weight value of at least one weight block in each target network layer in at least one target network layer in the initial neural network model to 0, so as to obtain a sparse neural network model.
  • the at least one target network layer includes the i-th layer. Step S621 will be described below by taking the i-th layer as an example.
  • the weight matrix of the i-th layer of the initial neural network model includes multiple weight blocks. By setting the weight value of at least one weight block in the multiple weight blocks to 0, the weight matrix of the i-th layer of the sparse neural network model can be obtained.
  • Setting the weight value of one weight block to 0 may also be understood as deleting the weight value in at least one weight block in this embodiment of the present application.
  • a weight block whose weight value is set to 0 is an invalid weight block.
  • a weight block whose weight value is not set to 0 is a valid weight block to be reserved. Taking the i-th layer as an example, in the weight matrix of the i-th layer of the sparse neural network model, at least one weight value in the valid weight block is not 0, and all weight values in the invalid weight block are 0.
  • Fig. 7 shows a schematic diagram of a weight matrix.
  • the size of the weight matrix is K ⁇ N, and the size of the minimum computing unit of the accelerator is BU_K ⁇ BU_N.
  • the size of the weight block (block) of the target network layer is an integer multiple of the smallest computing unit, the size of the block can be expressed as ( ⁇ *BU_K) ⁇ ( ⁇ *BU_N), and the size of the weight matrix is an integer multiple of the size of the block .
  • is a positive integer
  • is a positive integer
  • K is a positive integer
  • N is a positive integer
  • BU_K is a positive integer
  • BU_N is a positive integer.
  • the weight matrix is divided into multiple blocks according to the size of the block, and a block to be reserved is selected among the multiple blocks. For example, as shown in FIG. 7 , the weight matrix is divided into 15 blocks, and the weight values of 9 blocks are reserved. The 9 blocks are effective weight blocks, and the remaining 6 weight blocks are invalid weight blocks.
  • Multiple weight blocks in a weight matrix can have the same size. If the sizes of multiple weight blocks in the weight matrix are different, when compiling the operators in the AI model, the operators need to be compiled separately for the weight blocks of different sizes, which increases the number of compilations. In the solution of the embodiment of the present application, multiple weight blocks in a weight matrix have the same size, and when compiling the operators in the AI model, only one operator compiling is required, which reduces the number of compiling times, thereby reducing The compilation time is reduced, and the calculation speed of the model is improved. In addition, when the accelerator includes multiple computing units, parallel computing of multiple weight blocks can be realized, further improving the computing speed of the model.
  • the weight matrix of the i-th layer of the sparse neural network model includes at least one effective weight block, and at least one weight value in the effective weight block is not zero.
  • the number of effective weight blocks located in the weight matrix of the i-th layer of the sparse neural network model in the direction of matrix operation accumulation (reduce) is the same.
  • the number of discarded weight blocks in the matrix operation accumulation direction is the same, or in other words, the number of retained weight blocks is the same.
  • the weight matrix is divided into three columns in the direction of accumulation, that is, three dotted boxes in FIG. 7 , and the number of weight blocks reserved in each column is the same.
  • Each column in Figure 7 reserves three weight blocks.
  • the size of the weight block of the at least one target network layer may be artificially set.
  • the size of the weight block may be obtained by searching in the search space.
  • the size of the weight block is an integer multiple of the size of the smallest computing unit.
  • the size of the weight block can be obtained by searching in the search space, or it can be understood as: the multiple corresponding to the integer multiple is obtained by searching in the search space.
  • the search space refers to the range that can be searched or the range that can be selected.
  • Automatic machine learning Auto machine learning, AutoML usually pre-defines a search space, and continuously generates different configurations in the search space, forming a closed loop of evaluation-feedback-regeneration of configuration until the desired configuration is obtained.
  • the search space is determined according to the tasks to be searched. For example, when searching for the size of the weight block, different multiples can be generated in the search space, and the size of the currently obtained weight block is evaluated, for example, the model corresponding to the size of the current weight block is evaluated, and the evaluation result based on the feedback is again Generate other multiples until the desired multiple is obtained.
  • the size of the weight block of the i-th layer may be obtained by searching in the search space.
  • the specific search method can adopt the scheme in the prior art, for example, adopt automatic machine learning model compression (AutoML model compression, AMC) to obtain the size of the weight block.
  • AutoML model compression AutoML model compression, AMC
  • the weight block sizes of the multiple weight matrices may be set in the same manner or differently.
  • the sizes of the weight blocks of some of the weight matrices among the plurality of weight matrices may be artificially set, and the sizes of the weight blocks of the remaining weight matrices may be obtained by searching in the search space. This embodiment of the present application does not limit this.
  • the size of the weight block in the weight matrix of the i-th layer of the target neural network model and the size of the weight block in the weight matrix of the j-th layer of the target neural network model Different, j is a positive integer.
  • the sizes of the weight blocks in the weight matrices of the different target network layers may be the same or different.
  • the size of the weight block is searched in the search space, and the size of the weight block suitable for each weight matrix can be obtained, which is beneficial to improve the accuracy of the model.
  • the sparsity rates of the weight matrices in the target AI model can be the same or different.
  • the sparse ratio of the weight matrix of the target AI model may be artificially set.
  • the sparsity rate of the weight matrix of the target AI model is obtained by searching in the search space, and the sparsity rate of the target AI model satisfies the target sparsity rate.
  • the thinning rate of the target AI model satisfies the target thinning rate, which may be that the thinning rate of the target AI model is less than or equal to the target thinning rate.
  • the target thinning rate of the target AI model satisfies the target thinning rate, or the difference between the thinning rate of the target AI model and the target thinning rate is less than or equal to the first threshold.
  • the sparsity rate of at least one target network layer for the sparsification process can be obtained by searching in the search space.
  • the sparsity rate of the sparse neural network model meets the target sparsity rate.
  • the at least one target network layer includes the i-th layer, that is, the sparsity rate of the i-th layer is obtained by searching in the search space.
  • the sparse neural network model may be a candidate sparse neural network model with the highest accuracy obtained through searching among the candidate sparse neural network models whose sparsification rate satisfies the target sparsity rate.
  • the sparse rate of the candidate sparse neural network model meets the target sparse rate, which may be that the sparse rate of the candidate sparse neural network model is less than or equal to the target sparse rate.
  • the sparsity rate of the candidate sparse neural network model satisfies the target sparsity rate, or the difference between the sparse rate of the candidate sparse neural network model and the target sparsity rate is less than or equal to the first threshold.
  • the target thinning rate may be user input.
  • the target sparse rate can also be determined according to the computing power of the accelerator.
  • a specific search method may use a solution in the prior art, for example, use AMC to obtain the sparsity rate of the weight matrix.
  • the sparse rate of each weight matrix can be searched in the search space, and the sparse rate suitable for each weight matrix can be obtained, which is conducive to improving the accuracy of the model .
  • the sparse rate of at least one target network layer for sparse processing is obtained by searching in the search space.
  • the accuracy of the sparse neural network model meets the target accuracy.
  • the sparse neural network model may be a candidate sparse neural network model with the highest sparsity rate obtained through searching among the candidate sparse neural network models whose accuracy meets the target accuracy.
  • the precision of the candidate sparse neural network model meets the target precision, which may be that the precision of the candidate sparse neural network model is greater than or equal to the target precision.
  • the precision of the candidate sparse neural network model satisfies the target precision, or the difference between the target precision and the precision of the candidate sparse neural network model is less than or equal to the second threshold.
  • the target accuracy may be user-input.
  • a specific search method may use a solution in the prior art, for example, use AMC to obtain the sparsity rate of the weight matrix.
  • the sparse rate of each weight matrix can be searched in the search space, and the sparse rate suitable for each weight matrix can be obtained, which is conducive to reducing the sparse rate of the model.
  • the calculation amount of the model is reduced, so as to realize the accelerated operation of the model.
  • the setting methods of the sparsity rate of each weight matrix can be the same or different.
  • the sparsity rate of some weight matrices can be set artificially, and the sparsity rate of the rest of the weight matrices can be obtained by searching in the search space. This embodiment of the present application does not limit this.
  • the network layer corresponding to the weight matrix of the target AI model is randomly determined.
  • At least one target network layer for sparse processing may be randomly determined.
  • the network layer corresponding to the weight matrix of the target AI model belongs to multiple candidate network layers.
  • the at least one target network layer is at least one candidate network layer among multiple candidate network layers.
  • part or all of the plurality of candidate network layers may be selected as at least one target network layer for sparse processing.
  • At least one target network layer may be obtained by searching among multiple candidate network layers.
  • the plurality of candidate network layers may be artificially set.
  • the plurality of candidate network layers may include other network layers than the first layer.
  • the at least one target network layer may be set manually.
  • the at least one target network layer may include other network layers than the first layer.
  • sparse processing can be performed within the scope of the specified candidate network layer to avoid affecting other network layers, for example, avoid affecting network layers with a high correlation with the accuracy of the model, which is conducive to ensuring the accuracy of the model accuracy.
  • step S622 may be performed by the sparse network tuning module 512 in FIG. 5 .
  • the type of the second training data used for tuning the sparse neural network model may be the same as the type of the first training data.
  • step S622 is an optional step, and if step S620 does not include S622, the sparse neural network model may be used as the target neural network model.
  • the accuracy of the target neural network model can be improved.
  • the sparse configuration of the sparse neural network model will not be changed. Specifically, tuning will not change the sparsity rate of each target network layer in the at least one target network layer of the sparse neural network model and the size of the weight block of each target network layer. Correspondingly, tuning does not change the sparsity rate of the sparse neural network model. Tuning also does not change the position of the effective weight blocks in each target network layer in the sparse neural network model.
  • the sparse rate of each target network layer in the sparse neural network model is the sparse rate of each target network layer in the target neural network model.
  • the size of the weight block of each target network layer of the sparse neural network model is the size of the weight block of each target network layer of the target neural network model.
  • the sparse rate of the sparse neural network model is the sparse rate of the target neural network model.
  • the position of the effective weight block in each target network layer in the sparse neural network model in the weight matrix of each target network layer is the same as the weight matrix of the target weight block in each target network layer in the target neural network model The positions in are the same.
  • the sparse rate of the at least one target network layer of the sparse neural network model and the sparse rate of the at least one network layer of the target neural network model are collectively referred to as the sparse rate of the at least one network layer
  • sparse rate of the at least one network layer and sparse
  • the size of the weight blocks in the weight matrix of the at least one network layer of the neural network model and the size of the weight blocks in the weight matrix of the at least one network layer of the target neural network model are collectively referred to as in the weight matrix of the at least one network layer The size of the weight block.
  • the at least one target network layer includes the i-th layer, "the sparse rate of the i-th layer of the sparse neural network model” and “the sparse rate of the i-th layer of the target neural network model” are collectively referred to as “the sparse rate of the i-th layer” ; "The size of the weight block in the weight matrix of the i-th layer of the sparse neural network model” and “the size of the weight block in the weight matrix of the i-th layer of the target neural network model” are collectively referred to as "in the weight matrix of the i-th layer The size of the weight block of the i-th layer", or “the size of the weight block of the i-th layer”.
  • the weight matrix of the i-th layer of the sparse neural network model includes at least one effective weight block.
  • the weight matrix of the i-th layer of the target neural network model includes at least one target weight block.
  • the number of effective weight blocks in the weight matrix of the i-th layer of the sparse neural network model in the at least one effective weight block is the same. It can also be understood that the at least one target weight block is located in the target neural network. The number of target weight blocks in the weight matrix of the i-th layer of the model is the same in the direction of matrix operation accumulation.
  • the at least one target network layer may include a fully connected layer or a convolutional layer.
  • the i-th layer can be a convolutional layer or a fully connected layer.
  • the i-th layer is a convolutional layer.
  • the weight matrix of the i-th layer can be obtained by converting the convolution kernel of the i-th layer into a two-dimensional matrix.
  • the convolution operation can be converted into a matrix multiplication operation by means of im2col.
  • the convolution kernel of the i-th layer is expanded into a two-dimensional matrix suitable for matrix multiplication operation, and the two-dimensional matrix is the weight matrix of the convolution layer in the embodiment of the present application.
  • the two-dimensional matrix obtained after expansion can be sparsely processed in the manner described above, and will not be described here.
  • the sparse configuration of the entire network is searched based on the sparse principle to obtain a sparse neural network model.
  • the sparse neural network model may be a candidate sparse neural network model with the highest accuracy when the candidate sparse neural network model satisfies the target sparse rate.
  • the sparse neural network model is tuned to obtain the target neural network model.
  • a specific search method may adopt a solution in the prior art, for example, use an AMC method to search to obtain a sparse neural network model.
  • the size of the weight block of each target network layer is an integer multiple of the size of the smallest computing unit of the accelerator, and the size of the weight matrix of each target network layer is an integer multiple of the size of the weight block of each target network layer.
  • the size of the weight block of each target network layer is used as the granularity for sparse processing of the weight matrix of each target network layer in the initial neural network model.
  • the sparse tool obtains the size of the minimum computing unit of the accelerator, the target sparse rate, and the original deep neural network (original DNN).
  • the size of the minimum computing unit of the accelerator may be 16*16. That is to say, the data format that the minimum computing unit of the accelerator can handle is 256 elements arranged in a matrix form of 16*16.
  • the target sparsity rate can be 50%.
  • the original DNN is an example of the initial neural network model in the embodiment of this application.
  • the sparse tool can perform step S620 to perform sparse processing on the original DNN, for example, search the sparse configuration of the entire network based on the sparse principle to obtain a sparse neural network model, and tune the sparse neural network model to obtain a target neural network model.
  • the sparse configuration of the entire network may include the sparse rate of each target network layer in at least one target network layer in the sparse neural network model, the size of the weight block of each target network layer, and the position of the effective weight block in each target network layer.
  • the sparsity rates of the target network layers in the target neural network model can be the same or different.
  • the size of the weight block of each target network layer in the target neural network model may be the same or different.
  • the sparsity rates of the 4 target network layers in the target neural network model in Figure 9 are 50%, 66%, 25% and 33%, respectively.
  • the sizes of the weight blocks of the four target network layers are 16*16, 32*32, 16*32 and 32*32, respectively.
  • the method 600 further includes: compressing the weight matrix of the target AI model to obtain compressed weights, and storing the weight matrix of the target AI model in the form of compressed weights.
  • this step can be understood as: compressing the weight matrix of the at least one target network layer in the target neural network model to obtain the compressed weight of at least one target network layer, and using at least one target
  • the weight matrix of the at least one target network layer is stored in the form of compressed weights of the network layers.
  • Storing the weight matrix of at least one target network layer in the target neural network model in the form of compressed weights of the at least one target network layer can be understood as only storing the target in the weight matrix of at least one target network layer in the target neural network model Weight blocks without saving invalid weight blocks in the weight matrix of each target network layer.
  • Compression weights can be represented as dense matrices.
  • the at least one target network layer includes an i-th layer
  • the weight matrix of the i-th layer of the target neural network model is compressed to obtain the compressed weights of the i-th layer.
  • the weight matrix of the i-th layer of the target neural network model includes at least one target weight block.
  • the compressed weights of the i-th layer are composed of the at least one target weight block. Stores the weight matrix of the i-th layer of the target neural network model in the form of compressed weights of the i-th layer.
  • the index of each target network layer is used to indicate the position of the target weight block in each target network layer in the weight matrix of each target network layer.
  • the index of each target network layer may be expressed as a two-dimensional matrix, that is, the index of each target network layer is stored in the form of a two-dimensional matrix.
  • This two-dimensional matrix can also be referred to as an index matrix.
  • the position and element value of the element in the index matrix of each target network layer are used to indicate the position of the target data block.
  • the element value of the first column in the index matrix of a target network layer may be used to indicate the number of rows of the target weight block in the first column of the weight matrix of the target network layer.
  • the index matrix includes 8 elements, which are respectively used to indicate the positions of the 8 target weight blocks in the weight matrix.
  • 0 in the first column indicates that a target weight block is located in row 0 of the first column in the weight matrix
  • 2 in the first column indicates that a target weight block is located in row 2 of the first column in the weight matrix.
  • the weight matrix of the target network layer can be represented by the compressed weight of the target network layer and the index of the target network layer.
  • a K*N weight matrix can be represented by the compression weight of N'*K'*block and the index matrix of N'*K', and N'*K'*block represents N'* K' blocks, that is, the compression weight includes N'*K' blocks.
  • N' represents the number of columns of the index matrix
  • K' represents the number of rows of the index matrix.
  • the number of target weight blocks is the number of elements in the index matrix.
  • Storing the weight matrix in the form of compressed weights can reduce the demand for storage space.
  • the compressed weight can be directly loaded into the calculation unit for matrix operation, which is beneficial to improve the processing speed.
  • weight matrix in the target AI model may also be stored in other forms.
  • all weight values in the weight matrix are stored. This embodiment of the present application does not limit it.
  • the embodiment of the present application also provides a calculation method of the AI model.
  • the calculation method of the AI model provided in the embodiment of the present application will be described below with reference to FIG. 11 .
  • the method shown in Figure 11 can be executed by an execution device of the AI model, which can be a cloud service device or a terminal device, for example, a computer, a server and other devices with sufficient computing power to execute AI model operations, or It is a system composed of cloud service equipment and terminal equipment.
  • the method 900 may be executed by the execution device 210 in FIG. 2 , the AI processor 50 in FIG. 3 , or the execution device 410 in FIG. 4 or a local device.
  • the accelerator can call the sparse operator execution method 900 in FIG. 5 to implement sparse calculation.
  • the target AI model in the calculation method shown in FIG. 11 can be constructed by the method shown in FIG. 6 , and the specific description of the target AI model can refer to the aforementioned method 600. In order to avoid unnecessary repetition, the following describes method 900 as appropriate Duplicate descriptions are omitted.
  • the target AI model is obtained by performing sparse processing on the initial AI model.
  • the weight matrix of the target AI model includes multiple weight blocks.
  • the multiple weight blocks include at least one invalid weight block and at least one target weight block.
  • the invalid weight block is a weight block that does not participate in the operation
  • the target weight block is a weight block that participates in the operation.
  • the size of the weight block is an integer multiple of the size of the smallest computing unit of the accelerator.
  • the accelerator is used to execute the calculation of the target AI model.
  • the number of target weight blocks in the direction of matrix accumulation in the weight matrix of the target AI model is the same.
  • the target AI model may be a target neural network model
  • the initial AI model may be an initial neural network model.
  • the target neural network model is obtained by performing sparse processing on at least one target network layer in the initial neural network model.
  • the at least one target network layer includes the i-th layer.
  • the weight matrix of the i-th layer of the target neural network model includes multiple weight blocks, the multiple weight blocks include at least one invalid weight block and at least one target weight block, the invalid weight block is a weight block that does not participate in the operation, and the target weight block is a weight block that participates in The weight block of the operation, the size of the weight block is an integer multiple of the size of the smallest computing unit of the accelerator, and the size of the weight matrix of the i-th layer is an integer multiple of the size of the weight block.
  • the accelerator is used to execute the operation of the target neural network model, and i is a positive integer.
  • Steps S910 to S920 are executed during the operation process of the accelerator executing the target AI model.
  • S910 acquire at least one target data block from the input data matrix according to the weight index of the weight matrix of the target AI model, at least one target data block corresponds to at least one target weight block, and the weight index is used to indicate that at least one target weight block is in the target AI The position in the model's weight matrix.
  • the at least one target data block refers to the input data to be calculated in the matrix operation, that is, the input data corresponding to the target weight block in the weight matrix.
  • the type of input data of the target AI model may be image data, speech data or text data.
  • the type of input data is related to the task of the target AI model.
  • the type of the input data may be image data.
  • the target image processing tasks include image classification, image detection, image segmentation, image recognition, or image generation.
  • the type of the input data may be text data.
  • text processing tasks include text recognition or text translation.
  • the type of the input data may be speech data.
  • speech processing tasks include speech recognition and the like. The embodiment of the present application does not limit the type of input data.
  • the accelerator can obtain the data required for the operation at one time according to the data format that the minimum calculation unit can support, and there is no need to convert the data required for the operation into the minimum calculation data by cutting or piecing together.
  • the data format that the unit can support reduces the delay of operation.
  • the index is used to indicate the position of the target weight block, which reduces the complexity of the index, can reduce the addressing time of input data during the operation process, and can realize effective hardware acceleration without specific hardware support.
  • the method 900 will be described below by taking the AI model as a neural network model as an example.
  • the weight matrix may be the weight matrix of the i-th layer in the target neural network model.
  • Steps S910 to S920 are executed during the process of the accelerator executing the operation of the i-th layer of the target neural network model.
  • the i-th layer may be a fully connected layer or a convolutional layer.
  • the weight matrix of the target AI model is stored in the form of compressed weights, and the compressed weights are formed by the at least one target weight block.
  • the weight matrix of the i-th layer of the target neural network model is stored in the form of compressed weights, and the compressed weights of the i-th layer are composed of the at least one target weight block.
  • Step S910 will be described below with reference to FIG. 12 .
  • the size of the weight matrix in FIG. 12 is K*N, and the size of the input data matrix is M*K.
  • the feature map shown in Figure 12 is an example of the input data matrix.
  • the matrix multiplication operation is performed on the input data matrix and the weight matrix, and the size of the obtained matrix operation result is M*N.
  • the size of the compressed weight is N'*K'*block, and block represents the weight block in the weight matrix.
  • the size of the block can be expressed as ( ⁇ *BU_K) ⁇ ( ⁇ *BU_N).
  • the index can be expressed as an index matrix of N'*K'.
  • the weight matrix can be represented by the compressed weight of N'*K'*block and the index matrix of N'*K'.
  • the compressed weights can be directly moved into the computing unit of the accelerator. Load the target data block in the feature map into the computing unit according to the index.
  • 0 in the first column of the index matrix in Figure 12 indicates that a target weight block is located in row 0 of the first column in the weight matrix
  • 2 in the first column indicates that a target weight block is located in the first column of the weight matrix.
  • Row 2 of column The target data blocks corresponding to these two target weight blocks are the data blocks in column 0 and column 2 of the feature map, respectively.
  • the size of the data block is M*( ⁇ *BU_K). Load the data blocks of column 0 and column 2 of the feature map into the computing unit.
  • the compression weight and target data block can be loaded into four computing units, so that parallel computing can be realized.
  • the method 900 further includes step S911.
  • step S922 can be understood as combining target data blocks corresponding to multiple weight groups in at least one target data block into multiple data matrices, one of the multiple weight groups
  • the weight group includes target weight blocks located in the matrix operation accumulation direction in the weight matrix of the i-th layer of the target neural network model.
  • a weight group refers to a matrix composed of target weight blocks located in the matrix operation accumulation direction in the weight matrix of the i-th layer of the target neural network model.
  • the target data blocks corresponding to a weight group can be combined into a dense matrix.
  • the compressed weights consist of 4 columns of weight blocks.
  • the column direction is the matrix operation accumulation direction. Therefore, the compressed weights include 4 weight groups.
  • the target data blocks corresponding to the four weight groups can form four data matrices respectively.
  • the first column of the index matrix in Figure 12 is used to indicate the position of the first weight group
  • the target data block corresponding to the first weight group includes the data blocks in the 0th column and the 2nd column of the feature map
  • the feature map The data blocks in the 0th column and the 2nd column form a data matrix.
  • step S920 may be performed through the following steps.
  • matrix operations are respectively performed on the four data matrices and the four weight groups to obtain four result (result) matrices.
  • the 4 sets of operation processes can be executed in parallel by 4 computing units.
  • the plurality of data matrices can constitute a feature map with parallel dimensions.
  • the plurality of data matrices have parallel dimensions so that matrix operations are performed in parallel.
  • the parallel dimension can be understood as the dimension of parallel computing.
  • the results of the plurality of matrix operations also have a parallel dimension.
  • data rearrangement may be performed on the results of the multiple matrix operations according to the positions of the multiple weight groups in the weight matrix to obtain the results of the matrix operations.
  • the results of the multiple matrix operations corresponding to the four weight groups are result 1, result 2, result 3 and result 4 respectively, and the matrices located in this layer respectively The first, second, third, and fourth columns in the operation result.
  • the weight matrix may be obtained by converting the convolution kernel of the network layer into a two-dimensional matrix.
  • the i-th layer is a convolutional layer
  • the weight matrix of the i-th layer of the target neural network model can be obtained by converting the convolution kernel of the i-th layer into a two-dimensional matrix.
  • step S910 may include: converting the input data matrix through im2col to obtain the converted input data matrix, and obtaining at least one target data block from the converted input data matrix according to the weight index, at least one target data block corresponding to on the at least one target weight block.
  • the target data blocks corresponding to multiple weight groups are respectively combined into multiple data matrices, and one weight group in the multiple weight groups is included in the i-th layer of the target neural network model The target weight block in the weight matrix in the accumulation direction of the matrix operation.
  • the input three dimensions (3D) feature map can be expanded into a two-dimensional (2D) feature map through im2col, and then according to The weight index of the i-th layer obtains at least one target data block from the two-dimensional feature map to form multiple data matrices.
  • step S910 includes: determining the target position in the input data matrix according to the weight index; converting the data at the target position of the input data matrix is the at least one target data block.
  • step S910 includes: determining the target position in the input data matrix of the i-th layer according to the weight index of the i-th layer; The data at the target position of the input data matrix is converted into the at least one target data block.
  • the position of at least one target data block in the i-th layer in the expanded input data matrix of the i-th layer can be determined according to the weight index of the i-th layer.
  • the at least one target data block corresponds to at least one target weight block.
  • the input data matrix of the i-th layer is not expanded into a two-dimensional matrix through im2col.
  • the size of the input data matrix is fixed, and correspondingly, the size of the expanded two-dimensional matrix is also fixed. Therefore, the position of at least one target data block in the expanded input data matrix can be determined without performing the expansion operation of im2col.
  • the position of the data of the at least one target data block in the input data matrix of the i layer can be inversely obtained, that is, the input data matrix of the i layer target position in .
  • the at least one target data block can be obtained by expanding the data at the target position by means of im2col.
  • the target data blocks corresponding to multiple weight groups in at least one target data block are respectively combined into multiple data matrices through im2col, and one weight group in the multiple weight groups is included in the ith of the target neural network model
  • the target weight block in the layer's weight matrix in the direction in which the matrix operation accumulates.
  • the input data matrix of the i-th layer is not expanded, but the target position is directly determined from the input data matrix of the i-th layer according to the weight index of the i-th layer, and then the data of the target position is composed of multiple data matrix.
  • the data of the target position in the input data matrix of the i-th layer can be obtained according to the weight index of the i-th layer, and expanded by im2col for multiple data matrices.
  • Figure 14 shows a schematic diagram of the operation process when the i-th layer is a convolutional layer.
  • the difference between Figure 14 and Figure 12 is mainly that the dimensions of the input feature map and the dimension of the result are different. For specific descriptions, please refer to Figure 12 The related descriptions will not be repeated here.
  • the data of the target position in the input data matrix of the i-th layer is indexed according to the weight of the i-th layer, and the data of the target position is expanded into multiple by im2col data matrix, so that it is not necessary to convert the input data matrix of the i-th layer through im2col first, and then obtain the at least one target data block from the converted input data matrix, without saving the converted input data matrix, reducing the need for storage space requirements while improving processing efficiency.
  • the size corresponding to the integer multiple is obtained by searching in the search space.
  • the sparsity rate of the weight matrix of the target AI model is obtained by searching in the search space, and the sparsity rate of the target AI model satisfies the target sparsity rate.
  • the method 900 only uses one weight matrix as an example to illustrate the calculation process, and the same method can also be used to perform the calculation process for other weight matrices in the target AI model.
  • Table 1 shows some parameters of a ResNet50 provided by the embodiment of the present application.
  • the model in Table 1 is compressed by using the method 600 of the embodiment of the present application to obtain the compressed model.
  • the minimum computing unit of the accelerator is 16*16
  • the target sparse rate is 20%.
  • Table 2 shows the weight block size of each convolutional layer in the compressed model and the sparse rate of each convolutional layer, that is, the sparse configuration of each convolutional layer. Based on the compression model shown in Table 2, the data to be processed can be processed to obtain the processing result of the data to be processed. Exemplarily, the sparse configuration of each convolutional layer may be searched using the method of AMC.
  • FIG. 15 shows a comparison chart of the accuracy of the target neural network model and the initial neural network model in the embodiment of the present application.
  • FIG. 15 shows the comparison of the TOP-1 accuracy (TOP-1accuracy, TOP-1ACC) of the 4 groups of target neural network models and the initial neural network models shown in Table 1.
  • TOP-1ACC refers to the probability that the top-ranked category in the output result matches the actual result.
  • the size of the weight block is 16 ⁇ 64.
  • the first set of comparison results is that conv1 and fc (not shown in Table 1) are not sparsely processed, that is, conv1 and fc are skipped, and the target neural network model has a sparse rate of 50%. Accuracy compared to the accuracy of the initial neural network model.
  • the second set of comparison results is the accuracy of the target neural network model when the sparse rate of the target neural network model is 50% and the initial The comparison results of the accuracy of the neural network model.
  • the third group of comparison results is that conv1 and fc are not sparsely processed, and the sparse rate of the operator with a local convolution kernel of 3 ⁇ 3 is 50%, and the sparse rate of an operator with a convolution kernel of 1 ⁇ 1 is 30%.
  • the comparison results of the accuracy of the target neural network model and the accuracy of the initial neural network model in the case.
  • the fourth group of comparison results is the comparison result of the accuracy of the target neural network model and the accuracy of the initial neural network model under the condition that conv1 and fc are not sparsely processed, and the sparse rate of the target neural network model is 30%.
  • the accuracy difference of the target neural network model is in the range of +0.102% ⁇ -0.794%.
  • the sparsity rate is 30%
  • the accuracy of the target neural network model can exceed that of the initial neural network model.
  • the sparsity rate is 50%, you can reduce the regression degree of model accuracy by adjusting the sparse configuration, that is, to ensure the degree of decline in model accuracy.
  • the degree of accuracy regression in the worst case is less than 1%.
  • FIG. 16 shows a comparison chart of the operation time of the conv2D single operator in the initial neural network model and the target neural network model. Specifically, FIG. 16 shows the processing time of 4 conv2D single operators (conv2_x, conv3_x, conv4_x, conv5_x) on 4 batches of data (batch_1, batch_2, batch_3 and batch_4). As shown in Figure 16, the operation time of the conv2D single operator in the target neural network model is significantly less than that of the conv2D single operator in the initial neural network model.
  • FIG. 17 shows a schematic diagram of the degree of acceleration of the target neural network model corresponding to FIG. 16 .
  • the acceleration degree of a conv2D operator in the target neural network model is obtained by dividing the operation time of the operator in the initial neural network model by the operation time of the operator in the target neural network model. As shown in Figure 16 and Figure 17, the average speedup of a conv2D operator is 30%.
  • the device of the embodiment of the present application will be described below with reference to FIG. 18 to FIG. 21 . It should be understood that the device described below can execute the method of the aforementioned embodiment of the present application. In order to avoid unnecessary repetition, repeated descriptions are appropriately omitted when introducing the device of the embodiment of the present application below.
  • Fig. 18 is a schematic block diagram of an AI model processing device according to an embodiment of the present application.
  • the AI model processing device 3000 shown in FIG. 18 includes an acquisition unit 3010 and a processing unit 3020 .
  • the acquisition unit 3010 and the processing unit 3020 may be used to execute the AI model processing method of the embodiment of the present application, specifically, may be used to execute the method 600 .
  • the obtaining unit 3010 is used to obtain the size of the initial AI model and the smallest computing unit of the accelerator, and the accelerator is used to execute the operation of the target AI model.
  • the processing unit 3020 is configured to perform sparse processing on the initial AI model to obtain a target AI model, wherein the weight matrix of the target AI model includes multiple weight blocks, and the multiple weight blocks include at least one invalid weight block and at least one target weight block , the invalid weight block is the weight block that does not participate in the operation, the target weight block is the weight block that participates in the operation, the size of the weight block is an integer multiple of the size of the smallest computing unit of the accelerator, and the matrix operation in the weight matrix of the target AI model is accumulated The number of target weight blocks is the same in both directions.
  • the device further includes a storage unit configured to: store the weight matrix of the target AI model in the form of compressed weights, where the compressed weights are composed of at least one target weight block.
  • the multiple corresponding to the integer multiple is obtained by searching in the search space.
  • the sparsity rate of the weight matrix of the target AI model is obtained by searching in the search space, and the sparsity rate of the target AI model meets the target sparsity rate.
  • Fig. 19 is a schematic block diagram of an accelerator 4000 provided by an embodiment of the present application.
  • the accelerator 4000 shown in FIG. 19 includes an acquisition unit 4010 and a processing unit 4020 .
  • the acquisition unit 4010 and the processing unit 4020 may be used to execute the calculation method of the AI model of the embodiment of the present application, for example, may be used to execute the method 900 .
  • the obtaining unit 4010 is configured to obtain at least one target data block from the input data matrix according to the weight index of the weight matrix of the target AI model, at least one target data block corresponds to at least one target weight block, and the weight index is used to indicate at least one target weight block Position in the weight matrix of the target AI model.
  • the processing unit 4020 is configured to perform a matrix operation based on at least one target data block and at least one target weight block, so as to obtain a result of the matrix operation.
  • the weight matrix of the target AI model is stored in the form of compressed weights, and the compressed weights are composed of at least one target weight block.
  • the processing unit 4020 is further configured to: combine target data blocks corresponding to multiple weight groups in at least one target data block into multiple data matrices, one of the multiple weight groups
  • the weight group includes target weight blocks located in the direction of matrix operation accumulation in the weight matrix of the target AI model; and the processing unit 4020 is specifically used to: respectively perform matrix operations on multiple data matrices and multiple weight groups to obtain multiple matrix operations. Result; perform data rearrangement on the results of multiple matrix operations to obtain the matrix operation result.
  • the weight matrix in the target AI model is obtained by converting the convolution kernel of the network layer into a two-dimensional matrix .
  • the processing unit 4020 is specifically configured to: determine the target position in the input data matrix according to the weight index of the weight matrix of the target AI model; convert the data at the target position in the input data matrix into at least one target data block.
  • the multiple corresponding to the integer multiple is obtained by searching in the search space.
  • the sparsity rate of the weight matrix of the target AI model is obtained by searching in the search space, and the sparsity rate of the target AI model meets the target sparsity rate.
  • unit here may be implemented in the form of software and/or hardware, which is not specifically limited.
  • a "unit” may be a software program, a hardware circuit or a combination of both to realize the above functions.
  • Hardware circuits may include application specific integrated circuits (ASICs), electronic circuits, processors for executing one or more software or firmware programs (such as shared processors, dedicated processors, or group processors, etc.) and memory, incorporating logic, and/or other suitable components to support the described functionality.
  • ASICs application specific integrated circuits
  • processors for executing one or more software or firmware programs (such as shared processors, dedicated processors, or group processors, etc.) and memory, incorporating logic, and/or other suitable components to support the described functionality.
  • the units of each example described in the embodiments of the present application can be realized by electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are executed by hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art may use different methods to implement the described functions for each specific application, but such implementation should not be regarded as exceeding the scope of the present application.
  • FIG. 20 is a schematic diagram of a hardware structure of an AI model processing device provided by an embodiment of the present application.
  • the AI model processing apparatus 5000 shown in FIG. 20 includes a memory 5001 , a processor 5002 , a communication interface 5003 and a bus 5004 .
  • the memory 5001 , the processor 5002 , and the communication interface 5003 are connected to each other through a bus 5004 .
  • the memory 5001 may be a read only memory (read only memory, ROM), a static storage device, a dynamic storage device or a random access memory (random access memory, RAM).
  • the memory 5001 may store a program, and when the program stored in the memory 5001 is executed by the processor 5002, the processor 5002 is configured to execute each step of the AI model processing method of the embodiment of the present application. Specifically, the processor 5002 may execute the method 600 shown in FIG. 6 above.
  • the processor 5002 may adopt a general-purpose central processing unit (central processing unit, CPU), a microprocessor, an application specific integrated circuit (application specific integrated circuit, ASIC), a graphics processing unit (graphics processing unit, GPU) or one or more
  • the integrated circuit is used to execute related programs to implement the AI model processing method of the method embodiment of the present application.
  • the processor 5002 may also be an integrated circuit chip with signal processing capabilities, for example, it may be the chip shown in FIG. 3 .
  • each step of the AI model processing method of the present application may be completed by an integrated logic circuit of hardware in the processor 5002 or instructions in the form of software.
  • the above-mentioned processor 5002 can also be a general-purpose processor, a digital signal processor (digital signal processing, DSP), an application-specific integrated circuit (ASIC), a ready-made programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, Discrete gate or transistor logic devices, discrete hardware components.
  • DSP digital signal processing
  • ASIC application-specific integrated circuit
  • FPGA field programmable gate array
  • Various methods, steps, and logic block diagrams disclosed in the embodiments of the present application may be implemented or executed.
  • a general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.
  • the steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in a mature storage medium in the field such as random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, register.
  • the storage medium is located in the memory 5001, and the processor 5002 reads the information in the memory 5001, and combines its hardware to complete the functions required by the units included in the processing device shown in Figure 18, or execute the method shown in Figure 6 of the method embodiment of the present application The processing method of the AI model shown.
  • the communication interface 5003 implements communication between the apparatus 5000 and other devices or communication networks by using a transceiver device such as but not limited to a transceiver.
  • a transceiver device such as but not limited to a transceiver.
  • the initial neural network model and the size of the minimum computing unit of the accelerator can be obtained through the communication interface 5003 .
  • the bus 5004 may include a pathway for transferring information between various components of the device 5000 (eg, memory 5001, processor 5002, communication interface 5003).
  • Fig. 21 is a schematic diagram of the hardware structure of the computing device of the AI model of the embodiment of the present application.
  • the data processing device 6000 shown in FIG. 21 includes a memory 6001 , a processor 6002 , a communication interface 6003 and a bus 6004 .
  • the memory 6001 , the processor 6002 , and the communication interface 6003 are connected to each other through a bus 6004 .
  • the memory 6001 can be ROM, static storage device and RAM.
  • the memory 6001 can store programs, and when the programs stored in the memory 6001 are executed by the processor 6002, the processor 6002 and the communication interface 6003 are used to execute each step of the AI model computing method of the embodiment of the present application.
  • the processor 6002 may execute the method 900 shown in FIG. 11 above.
  • the processor 6002 can be general-purpose, CPU, microprocessor, ASIC, GPU or one or more integrated circuits, to execute related programs, so as to realize the functions required by the units in the accelerator of the embodiment of the present application, or to execute The calculation method of the AI model in the method embodiment of this application.
  • the processor 6002 may also be an integrated circuit chip with signal processing capabilities, for example, it may be the chip shown in FIG. 3 .
  • each step of the calculation method of the AI model in the embodiment of the present application may be completed by an integrated logic circuit of hardware in the processor 6002 or instructions in the form of software.
  • the above processor 6002 may also be a general processor, DSP, ASIC, FPGA or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components.
  • a general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.
  • the steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in a mature storage medium in the field such as random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, register.
  • the storage medium is located in the memory 6001, and the processor 6002 reads the information in the memory 6001, and combines its hardware to complete the functions required by the units included in the accelerator of the embodiment of the application, or execute the calculation of the AI model of the method embodiment of the application method.
  • the communication interface 6003 implements communication between the apparatus 6000 and other devices or communication networks by using a transceiver device such as but not limited to a transceiver. For example, input data can be obtained through the communication interface 6003 .
  • the bus 6004 may include pathways for transferring information between various components of the device 6000 (eg, memory 6001 , processor 6002 , communication interface 6003 ).
  • the device 5000 and device 6000 may also include other devices.
  • the apparatus 5000 and the apparatus 6000 may also include hardware devices for implementing other additional functions.
  • the device 5000 and the device 6000 may only include the devices necessary to realize the embodiment of the present application, instead of all the devices shown in FIG. 20 and FIG. 21 .
  • the embodiment of the present application also provides a computer-readable storage medium, where the computer-readable medium stores program code for execution by the device, and the program code includes the method for performing any implementation mode in the embodiment of the present application .
  • the embodiment of the present application also provides a computer program product including instructions, and when the computer program product is run on a computer, the computer is made to execute the method in any implementation manner in the embodiments of the present application.
  • the embodiment of the present application also provides a chip, the chip includes a processor and a data interface, the processor reads the instructions stored on the memory through the data interface, and executes any one of the implementation methods in the embodiments of the present application method in .
  • the chip may further include a memory, the memory stores instructions, the processor is configured to execute the instructions stored in the memory, and when the instructions are executed, the The processor is configured to execute the method in any implementation manner in the embodiments of the present application.
  • the aforementioned chip may specifically be a field-programmable gate array (field-programmable gate array, FPGA) or an application-specific integrated circuit (application-specific integrated circuit, ASIC).
  • FPGA field-programmable gate array
  • ASIC application-specific integrated circuit
  • the processor in the embodiment of the present application may be a central processing unit (central processing unit, CPU), and the processor may also be other general-purpose processors, digital signal processors (digital signal processor, DSP), application specific integrated circuits (application specific integrated circuit, ASIC), off-the-shelf programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • a general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.
  • the memory in the embodiments of the present application may be a volatile memory or a nonvolatile memory, or may include both volatile and nonvolatile memories.
  • the non-volatile memory can be read-only memory (read-only memory, ROM), programmable read-only memory (programmable ROM, PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrically programmable Erases programmable read-only memory (electrically EPROM, EEPROM) or flash memory.
  • Volatile memory can be random access memory (RAM), which acts as external cache memory.
  • RAM random access memory
  • static random access memory static random access memory
  • DRAM dynamic random access memory
  • DRAM synchronous dynamic random access memory Access memory
  • SDRAM synchronous dynamic random access memory
  • double data rate synchronous dynamic random access memory double data rate SDRAM, DDR SDRAM
  • enhanced synchronous dynamic random access memory enhanced SDRAM, ESDRAM
  • serial link DRAM SLDRAM
  • direct memory bus random access memory direct rambus RAM, DR RAM
  • the above-mentioned embodiments may be implemented in whole or in part by software, hardware, firmware or other arbitrary combinations.
  • the above-described embodiments may be implemented in whole or in part in the form of computer program products.
  • the computer program product comprises one or more computer instructions or computer programs.
  • the processes or functions according to the embodiments of the present application will be generated in whole or in part.
  • the computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from a website, computer, server or data center Transmission to another website site, computer, server or data center by wired (such as infrared, wireless, microwave, etc.).
  • the computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage device such as a server or a data center that includes one or more sets of available media.
  • the available media may be magnetic media (eg, floppy disk, hard disk, magnetic tape), optical media (eg, DVD), or semiconductor media.
  • the semiconductor medium may be a solid state drive.
  • At least one means one or more, and “multiple” means two or more.
  • At least one of the following" or similar expressions refer to any combination of these items, including any combination of single or plural items.
  • at least one item (piece) of a, b, or c can represent: a, b, c, a-b, a-c, b-c, or a-b-c, where a, b, c can be single or multiple .
  • sequence numbers of the above-mentioned processes do not mean the order of execution, and the execution order of the processes should be determined by their functions and internal logic, and should not be used in the embodiments of the present application.
  • the implementation process constitutes any limitation.
  • the disclosed systems, devices and methods may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components can be combined or May be integrated into another system, or some features may be ignored, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the functions described above are realized in the form of software function units and sold or used as independent products, they can be stored in a computer-readable storage medium.
  • the technical solution of the present application is essentially or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (read-only memory, ROM), random access memory (random access memory, RAM), magnetic disk or optical disc and other media that can store program codes. .

Abstract

本申请公开了人工智能领域中的一种AI模型的处理方法、运算方法及装置,在该处理方法中,以权重块作为稀疏粒度对初始AI模型进行稀疏处理,得到目标AI模型,权重块的尺寸为加速器的最小计算单元的整数倍。本申请的方案减少了计算量,提高了模型的运行速度,减少了运算开销。

Description

AI模型的处理方法、运算方法及装置
本申请要求于2021年10月28日提交中国专利局、申请号为202111265704.8、申请名称为“AI模型的处理方法、运算方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能领域,并且更具体地,涉及AI模型的处理方法、运算方法及装置。
背景技术
人工智能(artificial intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。人工智能研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。人工智能领域的研究包括机器人,自然语言处理,计算机视觉,决策与推理,人机交互,推荐与搜索,AI基础理论等。
随着人工智能技术的快速发展和迭代,神经网络模型的性能得到了持续的提升,同时神经网络模型的训练需要使用到海量的数据计算,也对计算芯片提出了更高要求。为了提高神经网络模型的训练速度,可以采用稀疏计算等方式减少训练量。
现有的稀疏计算的方案通常需要特定的硬件的支持。例如,在一个权重矩阵中,从每4个权重中选择2个权重参与矩阵运算。然而该方案需要特定的硬件支持才能实现权重矩阵的稀疏化处理,在其他通用的AI加速器中,若没有特定的硬件支持,可能无法实现权重矩阵的稀疏化处理。
发明内容
本申请提供一种AI模型的处理方法、运算方法及装置,能够提高AI模型的运算速度,减少运算开销。
第一方面,提供了一种AI模型的处理方法,该方法包括:获取初始AI模型和加速器的最小计算单元的尺寸,加速器用于执行目标AI模型的运算;对初始AI模型进行稀疏处理,以得到目标AI模型,其中,目标AI模型的权重矩阵包括多个权重块,多个权重块包括至少一个无效权重块和至少一个目标权重块,无效权重块为不参与运算的权重块,目标权重块为参与运算的权重块,权重块的尺寸为加速器的最小计算单元的尺寸的整数倍,其中,位于目标神经网络模型的第i层的权重矩阵中的矩阵运算累加方向上的目标权重块的数量相同。
根据本申请实施例的方案,以权重块的尺寸作为稀疏处理的粒度,即以权重块为单位保留或舍弃权重矩阵中的权重值,权重块的尺寸为加速器的最小计算单元的尺寸的整数 倍。本申请实施例的方案减少了初始AI模型中的冗余权重,减少模型的计算量,有利于提高加速器的处理速度。而且,目标AI模型的目标权重块的尺寸为最小计算单元的尺寸的整数倍,在目标AI模型的运算过程中,加速器可以根据最小计算单元所支持的数据格式一次性获取运算所需的数据,无需通过裁剪和拼凑等方式将运算所需的数据转换为最小计算单元能够支持的数据格式,降低了运算的时延。此外,以权重块作为稀疏处理的粒度,降低了权重索引的复杂性,在运算过程中能够减少输入数据的寻址时间,无需特定的硬件支持即可实现有效的硬件加速。
最小计算单元的尺寸指的是加速器能够一次性处理的矩阵运算的规模,或者说,计算单元一次运算执行的计算量。
示例性地,加速器可以包括以下至少一项:中央处理器(central processing unit,CPU)、图形处理器(graphics processing unit,GPU)、神经网络处理器(neural-network processing unit,NPU)和张量处理器(tensor processing unit,TPU)等等。
结合第一方面,在第一方面的某些实现方式中,对初始AI模型进行稀疏处理,以得到目标AI模型,包括:将初始AI模型中的至少一个目标网络层中的各个目标网络层中的至少一个权重块作为不参与运算的权重块,以得到稀疏AI模型;对稀疏AI模型进行调优,以得到目标AI模型。
可选地,一个权重矩阵中的多个权重块的尺寸相同。
若权重矩阵中的多个权重块的尺寸不同,那么在对AI模型中的算子进行编译时,需要针对不同的尺寸的权重块分别进行算子的编译,增加了编译次数。在本申请实施例的方案中,一个权重矩阵中的多个权重块的尺寸相同,在对AI模型中的算子进行编译时,仅需执行一次算子编译,即减少了编译次数,进而减少了编译时间,提高了模型的运算速度。此外,当加速器包括多个计算单元时,能够实现多个权重块的并行计算,进一步提高模型的运算速度。
结合第一方面,在第一方面的某些实现方式中,方法还包括:以压缩权重的形式存储目标神经网络模型的权重矩阵,压缩权重由至少一个目标权重块构成。
根据本申请实施例的方案,以压缩权重的形式存储该至少一个目标网络层的权重矩阵,能够减少对存储空间的需求量。此外,在矩阵运算累加方向上的目标权重块的数量相同的情况下,能够得到结构化稀疏矩阵,在运算过程中,可以直接将压缩权重加载至计算单元中进行矩阵运算,有利于提高处理速度。
结合第一方面,在第一方面的某些实现方式中,整数倍对应的倍数是在搜索空间内进行搜索得到的。
可选地,目标AI网络模型中的不同网络层的权重矩阵中的权重块的尺寸不同。
在本申请实施例的方案中,在搜索空间内搜索权重块的尺寸,能够得到适合各个网络层的权重块的尺寸,有利于提高模型的精度。此外,一定尺寸的权重块可以保证在一定目标精度的情况下,得到一定的稀疏率,能够提高模型的处理效率。
结合第一方面,在第一方面的某些实现方式中,目标AI模型的权重矩阵的稀疏率是通过在搜索空间内进行搜索得到的,目标AI模型的稀疏率满足目标稀疏率。
在本申请实施例的方案中,在模型的稀疏率满足目标稀疏率的前提下,在搜索空间内搜索目标AI模型的权重矩阵的稀疏率,能够得到适合的稀疏率,有利于提高模型的精度。
结合第一方面,在第一方面的某些实现方式中,目标AI模型的权重矩阵的稀疏率是在搜索空间内进行搜索得到的。稀疏神经网络模型的精度满足目标精度。
在本申请实施例的方案中,在模型的精度满足目标精度的前提下,在搜索空间内搜索目标AI模型的权重矩阵的稀疏率,能够得到适合目标AI模型的权重矩阵的稀疏率,有利于降低模型的稀疏率,进而减少模型的计算量,以便实现模型的加速运算。
结合第一方面,在第一方面的某些实现方式中,目标AI模型的权重矩阵对应的网络层属于多个候选网络层。
在本申请实施例的方案中,可以在指定的候选网络层的范围内进行稀疏处理,避免影响其他网络层,例如,避免影响与模型的精度的相关性较高的网络层,有利于保证模型的精度。
结合第一方面,在第一方面的某些实现方式中,在目标AI模型的权重矩阵对应的网络层为卷积层的情况下,目标AI模型的权重矩阵可以是通过将网络层的卷积核转换为二维矩阵得到的。
第二方面,提供了一种AI模型的运算方法,目标AI模型的第i层的权重矩阵包括多个权重块,多个权重块包括至少一个无效权重块和至少一个目标权重块,无效权重块为不参与运算的权重块,目标权重块为参与运算的权重块,权重块的尺寸为加速器的最小计算单元的尺寸的整数倍,位于目标AI模型的权重矩阵中的矩阵运算累加方向上的目标权重块的数量相同,在加速器执行目标AI模型的运算的过程中执行以下步骤:根据目标AI模型的权重矩阵的权重索引从输入数据矩阵中获取至少一个目标数据块,至少一个目标数据块对应于至少一个目标权重块,权重索引用于指示至少一个目标权重块在目标AI模型的权重矩阵中的位置;基于至少一个目标数据块和至少一个目标权重块执行矩阵运算,以得到矩阵运算的结果。
根据本申请实施例的方案,在运算过程中,加速器可以根据最小计算单元能够支持的数据格式一次性获取运算所需的数据,无需通过裁剪或拼凑等方式将运算所需的数据转换为最小计算单元能够支持的数据格式,降低了运算的时延。而且,索引用于指示目标权重块的位置,降低了索引的复杂性,在运算过程中能够减少输入数据的寻址时间,无需特定的硬件支持即可实现有效的硬件加速。
该至少一个目标数据块指的是矩阵运算中需要计算的输入数据,即权重矩阵中的目标权重块对应的输入数据。
结合第二方面,在第二方面的某些实现方式中,目标AI模型的权重矩阵以压缩权重的形式存储,压缩权重由至少一个目标权重块构成。
根据本申请实施例的方案,以压缩权重的形式存储该至少一个目标网络层的权重矩阵,能够减少对存储空间的需求量。此外,在矩阵运算累加方向上的目标权重块的数量相同的情况下,能够得到结构化稀疏矩阵,在运算过程中,可以直接将压缩权重加载至计算单元中进行矩阵运算,有利于提高处理速度。
结合第二方面,在第二方面的某些实现方式中,方法还包括:将至少一个目标数据块中的、与多个权重组对应的目标数据块分别组合为多个数据矩阵,多个权重组中的一个权重组包括目标AI模型的权重矩阵中位于矩阵运算累加方向上的目标权重块;以及基于至少一个目标数据块和至少一个目标权重块执行矩阵运算,以得到矩阵运算的结果,包括: 对多个数据矩阵和多个权重组分别执行矩阵运算,得到多个矩阵运算的结果;对多个矩阵运算的结果进行数据重排,以得到矩阵运算的结果。
根据本申请实施例的方案,通过并行执行矩阵运算,能够提高加速器的处理效率。
结合第二方面,在第二方面的某些实现方式中,在目标AI模型的权重矩阵对应的网络层为卷积层的情况下,目标AI模型中的权重矩阵是将该网络层的卷积核转换为二维矩阵得到的。
结合第二方面,在第二方面的某些实现方式中,根据目标AI模型的权重矩阵的权重索引从输入数据矩阵中获取至少一个目标数据块,包括:根据目标AI模型的权重矩阵的权重索引确定输入数据矩阵中的目标位置;将输入数据矩阵中的目标位置上的数据转换为至少一个目标数据块。
根据本申请的方案,当网络层为卷积层时,根据权重索引从输入数据矩阵中的目标位置的数据,并通过im2col将目标位置的数据展开为多个数据矩阵,这样无需将输入数据矩阵先通过im2col的方式进行转换,再从转换后的输入数据矩阵中获取该至少一个目标数据块,无需保存转换后的输入数据矩阵,减少了对存储空间的需求,同时提高了处理效率。
结合第二方面,在第二方面的某些实现方式中,整数倍对应的倍数是通过在搜索空间内进行搜索得到的。
结合第二方面,在第二方面的某些实现方式中,目标AI模型的权重矩阵的稀疏率是通过在搜索空间内进行搜索得到的,目标AI模型的稀疏率满足目标稀疏率。
第三方面,提供了一种AI模型的处理装置,该装置包括用于执行上述第一方面以及第一方面中的任意一种实现方式中的方法的模块或单元。
第四方面,提供了一种AI模型的运算装置,该装置包括用于执行上述第二方面以及第二方面中的任意一种实现方式中的方法的模块或单元。
应理解,在上述第一方面中对相关内容的扩展、限定、解释和说明也适用于第二方面、第三方面以及第四方面中相同的内容。
第五方面,提供了一种AI模型的处理装置,该装置包括:存储器,用于存储程序;处理器,用于执行所述存储器存储的程序,当所述存储器存储的程序被执行时,所述处理器用于执行第一方面以及第一方面中的任意一种实现方式中的方法。
上述第五方面中的处理器既可以是中央处理器(central processing unit,CPU),也可以是CPU与AI运算处理器的组合,这里的AI运算处理器可以包括图形处理器(graphics processing unit,GPU)、神经网络处理器(neural-network processing unit,NPU)和张量处理器(tensor processing unit,TPU)等等。其中,TPU是谷歌(google)为机器学习全定制的人工智能加速器专用集成电路。
第六方面,提供了一种AI模型的运算装置,该装置包括:存储器,用于存储程序;处理器,用于执行所述存储器存储的程序,当所述存储器存储的程序被执行时,所述处理器用于执行第人方面以及第二方面中的任意一种实现方式中的方法。
上述第六方面中的处理器既可以是CPU,也可以是CPU与AI运算处理器的组合,这里的AI运算处理器可以包括GPU、NPU和TPU等等。
第七方面,提供一种计算机可读存储介质,该计算机可读介质存储用于设备执行的程 序代码,该程序代码包括用于执行第一方面或第二方面中的任意一种实现方式中的方法。
第八方面,提供一种包含指令的计算机程序产品,当该计算机程序产品在计算机上运行时,使得计算机执行上述第一方面或第二方面中的任意一种实现方式中的方法。
第九方面,提供一种芯片,所述芯片包括处理器与数据接口,所述处理器通过所述数据接口读取存储器上存储的指令,执行上述第一方面或第二方面中的任意一种实现方式中的方法。
可选地,作为一种实现方式,所述芯片还可以包括存储器,所述存储器中存储有指令,所述处理器用于执行所述存储器上存储的指令,当所述指令被执行时,所述处理器用于执行第一方面或第二方面中的任意一种实现方式中的方法。
上述芯片具体可以是现场可编程门阵列(field-programmable gate array,FPGA)或者专用集成电路(application-specific integrated circuit,ASIC)。
附图说明
图1是本申请实施例提供的一种人工智能主体框架示意图;
图2为本申请实施例提供的一种系统架构的结构示意图;
图3为本申请实施例提供的一种芯片的硬件结构示意图;
图4为本申请实施例提供的一种系统架构的示意图;
图5为本申请实施例提供的一种编译器的结构示意图;
图6为本申请实施例提供的一种AI模型的处理方法的示意图;
图7为本申请实施例提供的一种权重矩阵的稀疏处理过程的示意图;
图8为本申请实施例提供的一种最小计算单元的尺寸的示意图;
图9为本申请实施例提供的一种稀疏处理过程的示意图;
图10为本申请实施例提供的权重矩阵的表达方式的示意图;
图11为本申请实施例提供的一种AI模型的运算方法的示意性流程图;
图12为本申请实施例提供的全连接层的运算过程的示意图;
图13为本申请实施例提供的一种卷积层的输入数据的处理方式的示意图;
图14为本申请实施例提供的一种卷积层的运算过程的示意图;
图15为本申请实施例提供的目标神经网络模型和初始神经网络模型的精度的对比图;
图16为本申请实施例提供的初始神经网络模型和目标神经网络模型中的单算子的运算时间的对比图;
图17为本申请实施例提供的目标神经网络模型的加速程度的示意图;
图18是本申请实施例提供的一种AI模型的处理装置的示意性框图;
图19是本申请实施例提供的一种加速器的示意性框图;
图20是本申请实施例提供的另一种AI模型的处理装置的示意性框图;
图21是本申请实施例提供的一种AI模型的运算装置的示意性框图。
具体实施方式
下面将结合附图,对本申请中的技术方案进行描述。
图1示出一种人工智能主体框架示意图,该主体框架描述了人工智能系统总体工作流程,适用于通用的人工智能领域需求。
下面从“智能信息链”(水平轴)和“信息技术(information technology,IT)价值链”(垂直轴)两个维度对上述人工智能主题框架进行详细的阐述。
“智能信息链”反映从数据的获取到处理的一列过程。举例来说,可以是智能信息感知、智能信息表示与形成、智能推理、智能决策、智能执行与输出的一般过程。在这个过程中,数据经历了“数据—信息—知识—智慧”的凝练过程。
“IT价值链”从人智能的底层基础设施、信息(提供和处理技术实现)到系统的产业生态过程,反映人工智能为信息技术产业带来的价值。
(1)基础设施:
基础设施为人工智能系统提供计算能力支持,实现与外部世界的沟通,并通过基础平台实现支撑。
基础设施可以通过传感器与外部沟通,基础设施的计算能力可以由智能芯片提供。
这里的智能芯片可以是中央处理器(central processing unit,CPU)、神经网络处理器(neural-network processing unit,NPU)、图形处理器(graphics processing unit,GPU)、专门应用的集成电路(application specific integrated circuit,ASIC)以及现场可编程门阵列(field programmable gate array,FPGA)等硬件加速芯片。
基础设施的基础平台可以包括分布式计算框架及网络等相关的平台保障和支持,可以包括云存储和计算、互联互通网络等。
例如,对于基础设施来说,可以通过传感器和外部沟通获取数据,然后将这些数据提供给基础平台提供的分布式计算系统中的智能芯片进行计算。
(2)数据:
基础设施的上一层的数据用于表示人工智能领域的数据来源。该数据涉及到图形、图像、语音、文本,还涉及到传统设备的物联网数据,包括已有系统的业务数据以及力、位移、液位、温度、湿度等感知数据。
(3)数据处理:
上述数据处理通常包括数据训练,机器学习,深度学习,搜索,推理,决策等处理方式。
其中,机器学习和深度学习可以对数据进行符号化和形式化的智能信息建模、抽取、预处理、训练等。
推理是指在计算机或智能系统中,模拟人类的智能推理方式,依据推理控制策略,利用形式化的信息进行机器思维和求解问题的过程,典型的功能是搜索与匹配。
决策是指智能信息经过推理后进行决策的过程,通常提供分类、排序、预测等功能。
(4)通用能力:
对数据经过上面提到的数据处理后,进一步基于数据处理的结果可以形成一些通用的能力,比如可以是算法或者一个通用系统,例如,翻译,文本的分析,计算机视觉的处理,语音识别,图像的识别等等。
(5)智能产品及行业应用:
智能产品及行业应用指人工智能系统在各领域的产品和应用,是对人工智能整体解决 方案的封装,将智能信息决策产品化、实现落地应用,其应用领域主要包括:智能制造、智能交通、智能家居、智能医疗、智能安防、自动驾驶,平安城市,智能终端等。
本申请实施例可以应用在人工智能中的很多领域,例如,智能制造、智能交通、智能家居、智能医疗、智能安防、自动驾驶,平安城市等领域。
具体地,本申请实施例可以具体应用在自动驾驶、图像分类、图像检索、图像语义分割、图像质量增强、图像超分辨率和自然语言处理等需要使用(深度)神经网络的领域。
下面对图片分类和监控这两种应用场景进行简单的介绍。
图片分类:
当用户在终端设备(例如,手机)或者云盘上存储了大量的图片时,通过对相册中图像进行识别可以方便用户或者系统对相册进行分类管理,提升用户体验。
利用本申请实施例的AI模型的运算方法,能够降低硬件开销,对终端设备更友好。此外,能够提高利用该神经网络对图片进行分类的速度,有利于实时为不同的类别的图片打上标签,便于用户查看和查找。
监控:
监控场景包括:智慧城市、野外监控、室内监控、室外监控、车内监控等。其中,智慧城市场景下,需要进行多种属性识别,例如行人属性识别和骑行属性识别,深度神经网络凭借着其强大的能力在多种属性识别中发挥着重要的作用。
通过采用本申请实施例的AI模型的运算方法,能够提高神经网络模型的处理效率,有利于对输入的道路画面进行实时处理,更快地识别出道路画面中的不同的属性信息,同时降低功耗。
由于本申请实施例涉及大量神经网络的应用,为了便于理解,下面先对本申请实施例可能涉及的神经网络的相关术语和概念进行介绍。
(1)神经网络
神经网络可以是由神经单元组成的,神经单元可以是指以x s和截距1为输入的运算单元,该运算单元的输出可以为:
Figure PCTCN2022121335-appb-000001
其中,s=1、2、……n,n为大于1的自然数,W s为x s的权重,b为神经单元的偏置。
f为神经单元的激活函数(activation functions),用于将非线性特性引入神经网络中,来将神经单元中的输入信号变换为输出信号。该激活函数的输出信号可以作为下一层的输入。
神经网络是将多个上述单一的神经单元联结在一起形成的网络,即一个神经单元的输出可以是另一个神经单元的输入。每个神经单元的输入可以与前一层的局部接受域相连,来提取局部接受域的特征,局部接受域可以是由若干个神经单元组成的区域。
(2)深度神经网络
深度神经网络(deep neural network,DNN),也称多层神经网络,可以理解为具有多层隐含层的神经网络。按照不同层的位置对DNN进行划分,DNN内部的神经网络可以分为三类:输入层,隐含层,输出层。一般来说第一层是输入层,最后一层是输出层,中间的层数都是隐含层。层与层之间是全连接的,也就是说,第i层的任意一个神经元一定与第i+1层的任意一个神经元相连。
虽然DNN看起来很复杂,但是就每一层的工作来说,其实并不复杂,简单来说就是如下线性关系表达式:
Figure PCTCN2022121335-appb-000002
其中,
Figure PCTCN2022121335-appb-000003
是输入向量,
Figure PCTCN2022121335-appb-000004
是输出向量,
Figure PCTCN2022121335-appb-000005
是偏移向量,W是权重矩阵(也称系数),α()是激活函数。每一层仅仅是对输入向量
Figure PCTCN2022121335-appb-000006
经过如此简单的操作得到输出向量。由于DNN层数多,系数W和偏移向量
Figure PCTCN2022121335-appb-000007
的数量也比较多。这些参数在DNN中的定义如下所述:以系数W为例:假设在一个三层的DNN中,第二层的第4个神经元到第三层的第2个神经元的线性系数定义为
Figure PCTCN2022121335-appb-000008
上标3代表系数W所在的层数,而下标对应的是输出的第三层索引2和输入的第二层索引4。
综上,第L-1层的第k个神经元到第L层的第j个神经元的系数定义为
Figure PCTCN2022121335-appb-000009
需要注意的是,输入层是没有W参数的。在深度神经网络中,更多的隐含层让网络更能够刻画现实世界中的复杂情形。理论上而言,参数越多的模型复杂度越高,“容量”也就越大,也就意味着它能完成更复杂的学习任务。训练深度神经网络的也就是学习权重矩阵的过程,其最终目的是得到训练好的深度神经网络的所有层的权重矩阵(由很多层的向量W形成的权重矩阵)。
(3)卷积神经网络(convolutional neuron network,CNN)
卷积神经网络是一种带有卷积结构的深度神经网络,是一种深度学习(deep learning)架构,深度学习架构是指通过机器学习的算法,在不同的抽象层级上进行多个层次的学习。作为一种深度学习架构,CNN是一种前馈(feed-forward)人工神经网络,该前馈人工神经网络中的各个神经元可以对输入其中的图像作出响应。
卷积神经网络包含了一个由卷积层和子采样层构成的特征抽取器,该特征抽取器可以看作是滤波器。卷积层是指卷积神经网络中对输入信号进行卷积处理的神经元层。在卷积神经网络的卷积层中,一个神经元可以只与部分邻层神经元连接。一个卷积层中,通常包含若干个特征平面,每个特征平面可以由一些矩形排列的神经单元组成。同一特征平面的神经单元共享权重,这里共享的权重就是卷积核。共享权重可以理解为提取图像信息的方式与位置无关。卷积核可以以随机大小的矩阵的形式化,在卷积神经网络的训练过程中卷积核可以通过学习得到合理的权重。另外,共享权重带来的直接好处是减少卷积神经网络各层之间的连接,同时又降低了过拟合的风险。
下面介绍一层卷积层的内部工作原理。
卷积层可以包括很多个卷积算子,卷积算子也称为核,其在图像处理中的作用相当于一个从输入图像矩阵中提取特定信息的过滤器,卷积算子本质上可以是一个权重矩阵,这个权重矩阵通常被预先定义,在对图像进行卷积操作的过程中,权重矩阵通常在输入图像上沿着水平方向一个像素接着一个像素(或两个像素接着两个像素……这取决于步长stride的取值)的进行处理,从而完成从图像中提取特定特征的工作。该权重矩阵的大小应该与图像的大小相关,需要注意的是,权重矩阵的纵深维度(depth dimension)和输入图像的纵深维度是相同的,在进行卷积运算的过程中,权重矩阵会延伸到输入图像的整个深度。因此,和一个单一的权重矩阵进行卷积会产生一个单一纵深维度的卷积化输出,但是大多数情况下不使用单一权重矩阵,而是应用多个尺寸(行×列)相同的权重矩阵,即多个同型矩阵。每个权重矩阵的输出被堆叠起来形成卷积图像的纵深维度,这里的维度可以理解为由上面所述的“多个”来决定。不同的权重矩阵可以用来提取图像中不同的特征,例如一个权重矩阵用来提取图像边缘信息,另一个权重矩阵用来提取图像的特定颜色,又 一个权重矩阵用来对图像中不需要的噪点进行模糊化等。该多个权重矩阵尺寸(行×列)相同,经过该多个尺寸相同的权重矩阵提取后的特征图的尺寸也相同,再将提取到的多个尺寸相同的特征图合并形成卷积运算的输出。
这些权重矩阵中的权重值在实际应用中需要经过大量的训练得到,通过训练得到的权重值形成的各个权重矩阵可以用来从输入图像中提取信息,从而使得卷积神经网络进行正确的预测。
当卷积神经网络有多个卷积层的时候,初始的卷积层往往提取较多的一般特征,该一般特征也可以称之为低级别的特征;随着卷积神经网络深度的加深,越往后的卷积层提取到的特征越来越复杂,比如高级别的语义之类的特征,语义越高的特征越适用于待解决的问题。
为了提高卷积运算的速度,可以通过img2col的方式将卷积运算转换为矩阵乘法运算。
(4)损失函数
在训练深度神经网络的过程中,因为希望深度神经网络的输出尽可能的接近真正想要预测的值,所以可以通过比较当前网络的预测值和真正想要的目标值,再根据两者之间的差异情况来更新每一层神经网络的权重向量(当然,在第一次更新之前通常会有化的过程,即为深度神经网络中的各层预先配置参数),比如,如果网络的预测值高了,就调整权重向量让它预测低一些,不断地调整,直到深度神经网络能够预测出真正想要的目标值或与真正想要的目标值非常接近的值。因此,就需要预先定义“如何比较预测值和目标值之间的差异”,这便是损失函数(loss function)或目标函数(objective function),它们是用于衡量预测值和目标值的差异的重要方程。其中,以损失函数举例,损失函数的输出值(loss)越高表示差异越大,那么深度神经网络的训练就变成了尽可能缩小这个loss的过程。通常地,loss越小,该深度神经网络的训练质量越高,loss越大,深度神经网络的训练质量越低。类似的,loss波动越小,训练越稳定;loss波动越大,训练越不稳定。
(5)反向传播算法
神经网络可以采用误差反向传播(back propagation,BP)算法在训练过程中修正的神经网络模型中参数的大小,使得神经网络模型的重建误差损失越来越小。具体地,前向传递输入信号直至输出会产生误差损失,通过反向传播误差损失信息来更新的神经网络模型中参数,从而使误差损失收敛。反向传播算法是以误差损失为主导的反向传播运动,旨在得到最优的神经网络模型的参数,例如权重矩阵。
例如,神经网络模型每次训练产生的loss值在神经网络模型中从后向前逐层传递。传递到每一层时,同时计算出该层参数的更新量(偏导运算),这个更新量与梯度(gradient)相关。
如图2所示,本申请实施例提供了一种系统架构200。在图2中,数据采集设备260用于采集训练数据。例如,针对本申请实施例的AI模型的处理方法来说,若训练数据为图像数据,则训练数据可以包括训练图像以及训练图像对应的处理结果。例如,训练图像对应的分类结果,训练图像的分类结果可以是人工预先标注的结果。
在采集到训练数据之后,数据采集设备260将这些训练数据存入数据库230,训练设备220基于数据库230中维护的训练数据训练得到目标模型/规则201。
下面对训练设备220基于训练数据得到目标模型/规则201进行描述,训练设备220 对输入的原始数据进行处理,将输出值与目标值进行对比,直到训练设备220输出的值与目标值的差值小于一定的阈值,从而完成目标模型/规则201的训练。
本申请实施例中的目标模型/规则201具体可以为神经网络模型。例如,卷积神经网络或残差神经网络等。需要说明的是,在实际的应用中,数据库230中维护的训练数据不一定都来自于数据采集设备260的采集,也有可能是从其他设备接收得到的。另外需要说明的是,训练设备220也不一定完全基于数据库230维护的训练数据进行目标模型/规则201的训练,也有可能从云端或其他地方获取训练数据进行模型训练,上述描述不应该作为对本申请实施例的限定。
根据训练设备220训练得到的目标模型/规则201可以应用于不同的系统或设备中,如应用于图2所示的执行设备210,所述执行设备210可以是终端,如手机终端,平板电脑,笔记本电脑,增强现实(augmented reality,AR)AR/虚拟现实(virtual reality,VR),车载终端等,还可以是服务器或者云端等。在图2中,执行设备210配置输入/输出(input/output,I/O)接口212,用于与外部设备进行数据交互,用户可以通过客户设备240向I/O接口212输入数据,输入数据在本申请实施例中可以包括:客户设备输入的待处理的数据。
在执行设备210对输入数据进行预处理,或者在执行设备210的计算模块211执行计算等相关的处理过程中,执行设备210可以调用数据存储系统250中的数据、代码等以用于相应的处理,也可以将相应处理得到的数据、指令等存入数据存储系统250中。
最后,I/O接口212将处理结果,如上述得到的数据的处理结果返回给客户设备240,从而提供给用户。
值得说明的是,训练设备220可以针对不同的目标或不同的任务,基于不同的训练数据生成相应的目标模型/规则201,该相应的目标模型/规则201即可以用于实现上述目标或完成上述任务,从而为用户提供所需的结果。
在图2中所示情况下,用户可以手动给定输入数据,该手动给定可以通过I/O接口212提供的界面进行操作。另一种情况下,客户设备240可以自动地向I/O接口212发送输入数据,如果要求客户设备240自动发送输入数据需要获得用户的授权,则用户可以在客户设备240中设置相应权限。用户可以在客户设备240查看执行设备210输出的结果,具体的呈现形式可以是显示、声音、动作等具体方式。客户设备240也可以作为数据采集端,采集如图所示输入I/O接口212的输入数据及输出I/O接口212的输出结果作为新的样本数据,并存入数据库230。当然,也可以不经过客户设备240进行采集,而是由I/O接口212直接将如图所示输入I/O接口212的输入数据及输出I/O接口212的输出结果,作为新的样本数据存入数据库230。
值得注意的是,图2仅是本申请实施例提供的一种系统架构的示意图,图中所示设备、器件、模块等之间的位置关系不构成任何限制,例如,在图2中,数据存储系统250相对执行设备210是外部存储器,在其它情况下,也可以将数据存储系统250置于执行设备210中。
如图2所示,根据训练设备220训练得到目标模型/规则201,该目标模型/规则201在本申请实施例中可以是本申请中的神经网络,具体的,本申请实施例的神经网络可以为CNN等。
图3为本申请实施例提供的一种芯片的硬件结构,该芯片包括神经网络处理器30。该芯片可以被设置在如图2所示的执行设备210中,用以完成计算模块211的计算工作。该芯片也可以被设置在如图2所示的训练设备220中,用以完成训练设备220的训练工作并输出目标模型/规则201。本申请实施例中的方法可在如图3所示的芯片中得以实现。
神经网络处理器30可以是神经网络处理器(neural-network processing unit,NPU),张量处理器(tensor processing unit,TPU),或者图形处理器(graphics processing unit,GPU)等一切适合用于大规模异或运算处理的处理器。以NPU为例:神经网络处理器NPU20作为协处理器挂载到主中央处理器(central processing unit,CPU)(host CPU)上,由主CPU分配任务。NPU的核心部分为运算电路303,控制器304控制运算电路303提取存储器(权重存储器或输入存储器)中的数据并进行运算。其中,TPU是谷歌(google)为机器学习全定制的人工智能加速器专用集成电路。
在一些实现中,运算电路303内部包括多个处理单元(process engine,PE)。在一些实现中,运算电路303是二维脉动阵列。运算电路303还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路303是通用的矩阵处理器。
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路从权重存储器302中取矩阵B相应的数据,并缓存在运算电路中每一个PE上。运算电路从输入存储器301中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器(accumulator)308中。
向量计算单元307可以对运算电路的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。例如,向量计算单元307可以用于神经网络中非卷积/非FC层的网络计算,如池化(pooling),批归一化(batch normalization,BN),局部响应归一化(local response normalization)等。
在一些实现种,向量计算单元能307将经处理的输出的向量存储到统一缓存器306。例如,向量计算单元307可以将非线性函数应用到运算电路303的输出,例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元307生成归一化的值、合并值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路303的激活输入,例如用于在神经网络中的后续层中的使用。
统一存储器306用于存放输入数据以及输出数据。
权重数据直接通过存储单元访问控制器305(direct memory access controller,DMAC)将外部存储器中的输入数据搬运到输入存储器301和/或统一存储器306、将外部存储器中的权重数据存入权重存储器302,以及将统一存储器306中的数据存入外部存储器。
总线接口单元(bus interface unit,BIU)310,用于通过总线实现主CPU、DMAC和取指存储器309之间进行交互。
与控制器304连接的取指存储器(instruction fetch buffer)309,用于存储控制器304使用的指令;
控制器304,用于调用取指存储器309中缓存的指令,实现控制该运算加速器的工作过程。
一般地,统一存储器306,输入存储器301,权重存储器302以及取指存储器309均 为片上(On-Chip)存储器,外部存储器为该NPU外部的存储器,该外部存储器可以为双倍数据率同步动态随机存储器(double data rate synchronous dynamic random access memory,DDR SDRAM)、高带宽存储器(high bandwidth memory,HBM)或其他可读可写的存储器。
上文中介绍的图2中的执行设备210或图3中的芯片可以作为加速器执行本申请实施例的AI模型的运算方法的各个步骤。上文中介绍的图2中的训练设备220或图3中的芯片能够执行本申请实施例的AI模型的处理方法的各个步骤。
如图4所示,本申请实施例提供了一种系统架构400。该系统架构包括本地设备401、本地设备402以及执行设备410和数据存储系统450,其中,本地设备401和本地设备402通过通信网络与执行设备410连接。
执行设备410可以由一个或多个服务器实现。可选的,执行设备410可以与其它计算设备配合使用,例如:数据存储器、路由器、负载均衡器等设备。执行设备410可以布置在一个物理站点上,或者分布在多个物理站点上。执行设备410可以使用数据存储系统450中的数据,或者调用数据存储系统450中的程序代码来实现本申请实施例的AI模型的运算方法或AI模型的处理方法。
具体地,在一种实现方式中,执行设备410可以执行以下过程:
提供了一种AI模型的处理方法,该方法包括:获取初始AI模型和加速器的最小计算单元的尺寸,加速器用于执行目标AI模型的运算;对初始AI模型进行稀疏处理,以得到目标AI模型,其中,目标AI模型的权重矩阵包括多个权重块,多个权重块包括至少一个无效权重块和至少一个目标权重块,无效权重块为不参与运算的权重块,目标权重块为参与运算的权重块,权重块的尺寸为加速器的最小计算单元的尺寸的整数倍,其中,位于目标神经网络模型的第i层的权重矩阵中的矩阵运算累加方向上的目标权重块的数量相同。
用户可以操作各自的用户设备(例如本地设备401和本地设备402)与执行设备410进行交互。每个本地设备可以表示任何计算设备,例如个人计算机、计算机工作站、智能手机、平板电脑、智能摄像头、智能汽车或其他类型蜂窝电话、媒体消费设备、可穿戴设备、机顶盒、游戏机等。
每个用户的本地设备可以通过任何通信机制/通信标准的通信网络与执行设备410进行交互,通信网络可以是广域网、局域网、点对点连接等方式,或它们的任意组合。
在一种实现方式中,本地设备401、本地设备402从执行设备410获取到目标AI模型的相关参数,将目标神经网络部署在本地设备401、本地设备402上,利用该目标AI模型进行图像分类、进行图像处理、语音处理或者文本处理等等。
在另一种实现中,执行设备410上可以直接部署目标AI模型,执行设备410通过从本地设备401和本地设备402获取待处理数据,并采用目标AI模型对待处理数据进行处理。
上述执行设备410也可以为云端设备,此时,执行设备410可以部署在云端;或者,上述执行设备410也可以为终端设备,此时,执行设备410可以部署在用户终端侧,本申请实施例对此并不限定。
图5示出了本申请实施例提供的一种加速器中的编译器的示意图,如图5所示,编译器500包括:稀疏工具510、图优化和编译模块520和算子编译模块530。
稀疏工具510用于对初始AI模型进行稀疏处理,以得到目标AI模型。
可选地,稀疏工具510包括稀疏网络生成模块511和稀疏网络调优模块512。
其中,稀疏网络生成模块511用于将初始AI模型中的至少一个目标网络层中的各个目标网络层中的至少一个权重块作为不参与运算的权重块,以得到稀疏AI模型。
稀疏网络调优模块512用于对稀疏网络生成模块511得到的稀疏AI模型进行调优(fine tune),以得到目标AI模型。
稀疏工具510可以通过本申请实施例中的AI模型的处理方法得到目标AI模型,具体过程可以参见后文中的方法600。
图优化和编译模块520用于对AI模型的拓扑图进行处理。
具体地,如图5所示,图优化和编译模块520可以包括图引擎(graph engine,GE)521,融合引擎(fusion engine,FE)522,AI CPU引擎(AI CPU engine,AICPUE)523和华为集合通信库(Huawei collective communication library,HCCL)524。
GE521用于为不同的深度学习框架提供统一的中间表示(intermediate representation,IR)接口。
在本申请实施例中,GE521可以用于将稀疏工具510处理后得到的目标AI模型的拓扑图解析为适配当前的硬件的模型的拓扑图,即将目标AI模型的拓扑图进行改写。
FE522用于对GE521处理过的图中的部分算子进行融合。
AICPUE523用于执行目标AI模型的运算。
HCCL524用于提供集合通信算子,实现不同的加速器之间的数据传输。例如,分布式训练中不同的NPU之间可以通过HCCL524实现数据传输。
算子编译模块530用于对图优化和编译模块520处理过的图中的每个算子,即图中的每个节点进行编译,生成对应的可执行文件。
算子编译模块530包括算子库531和张量加速引擎(tensor boost engine,TBE)532。
算子库531中可以包括稀疏算子。稀疏算子用于实现目标AI模型中的稀疏网络层的运算。稀疏网络层的运算也可以称为稀疏计算。具体描述可以参见本申请实施例中的方法900。
TBE532用于为GE521提供算子信息,以及为FE522提供子图优化信息和TBE算子调用信息,最终生成可执行任务。
TBE532还用于生成自定义的可执行算子。换言之,TBE532能够为用户提供自定义算子开发方式。
示例性地,该编译器500可以是基于现有编译器得到的,例如,该编译器500可以是通过在神经网络计算架构(compute architecture for neural networks,CANN)中的昇腾张量编译器(ascend tensor compiler,ATC)中集成稀疏工具510得到的。
需要说明的是,图5中的编译器的结构仅为示例,在具体实现过程中,本领域的技术人员应当理解,编译器500还可以包括实现正常运行所必须的其他器件。同时,根据具体需要,本领域的技术人员应当理解,编译器500还可包括实现其他附加功能的硬件器件。此外,本领域的技术人员应当理解,编译器500也可仅仅包括实现本申请实施例所必须的器件,而不必包括图5中所示的全部模块。
AI模型的运算通常需要很高的计算力和存储空间的支持,影响了模型的运算速度。 在保证模型的精度的前提下,对模型进行稀疏计算是一种实现加速计算的有效手段。
权重稀疏化通过减少AI模型中的冗余权重减少模型的计算量,进而提高硬件加速器的处理速度和能效。然而,现有权重稀疏化方案,尤其是细粒度的稀疏化方案,通常需要在特定的硬件支持下才能实现模型的加速计算。
英伟达的A200中以4个相邻的权重为一组,每个组中保留2个非零权重,即实现50%的稀疏率。该方案以单个权重作为剪枝单元,权重稀疏化后得到的稀疏网络中的非零权重是随机出现的,权重索引的复杂性较高,需要特定的硬件支持才能实现有效的硬件加速。A200的张量核(tensor core)结构中使用4x4的数据结构,即以4个权重为一组,以支持4选2的稀疏计算。进一步地,4选2的稀疏方案能够将权重的随机性限制在组内。在运算过程中A200将一组权重对应的数据全部加载至计算核的寄存器上,从而避免权重的随机性带来的延迟问题,即减少随机访问外部存储器带来的延迟问题,实现有效的硬件加速。
在其他通用的AI加速器中,如果没有特定的硬件支持,4选2的稀疏方案无法实现有效的硬件加速,甚至可能无法进行稀疏计算。此外,A200的硬件设计也决定了其仅能支持50%的稀疏率。
本申请实施例提供了一种AI模型的处理方法,减少了AI模型的计算量,在无需增加额外的硬件支持的情况下,提高了AI模型的运算速度。
下面结合图6至图10对本申请实施例中的AI模型的处理方法进行详细的描述。
图6示出了本申请实施例提供的AI模型的处理方法600。图6所示的方法可以由AI模型的训练装置来执行,该装置可以是云服务设备,也可以是终端设备,例如,电脑、服务器等运算能力足以用来执行AI模型训练的装置,也可以是由云服务设备和终端设备构成的系统。示例性地,方法600可以由图2中的训练设备220、图3中的神经网络处理器50或图4中的执行设备410或本地设备执行。例如,方法600可以由AI模型的训练装置调用图5中的稀疏工具510执行。
方法600包括步骤S610至步骤S620。下面对步骤S610至步骤S620进行详细介绍。
S610,获取初始AI模型和加速器的最小计算单元的尺寸。该加速器用于执行目标AI模型的运算。
示例性地,AI模型可以为神经网络模型、支持向量机、随机森林或决策树等模型。
为了便于描述和说明,后文中主要以神经网络模型为例对本申请实施例的方案进行说明,不对本申请实施例的方案构成限定。即初始AI模型可以为初始神经网络模型,目标AI模型可以为目标神经网络模型。本申请实施例中的神经网络模型可以是现有的神经网络模型,例如,残差网络或卷积神经网络等。或者,该神经网络模型也可以是自行构建的其他结构的神经网络模型。本申请实施例对此不作限定。
初始AI模型可以是训练好的AI模型。
用于训练初始AI模型的第一训练数据的类型与AI模型的任务有关。例如,AI模型用于图像处理任务,则该第一训练数据可以为图像数据。具体地,图像处理任务包括图像分类、图像检测、图像分割或图像生成等。再如,AI模型用于文本处理任务,则该第一训练数据可以为文本数据。具体地,文本处理任务包括文本识别或文本翻译等。再如,AI模型用于语音处理任务,则该第一训练数据可以为语音数据。具体地,语音处理任务包括语音识别等。本申请实施例对第一训练数据的类型不做限定。
步骤S610中可以通过多种方式获取初始AI模型。例如,步骤S610中可以获取用户输入的初始AI模型。即该初始AI模型是由用户提供的。或者,步骤S610中可以读取本地存储的初始AI模型。或者,步骤S610中可以接收其他设备发送的初始AI模型。或者,步骤S610中可以对AI模型进行训练得到初始AI模型。本申请实施例对初始AI模型的具体获取方式不做限定。
在本申请实施例中,计算单元是加速器中提供算力的核心单元,也可以称为核(core)。例如,计算单元可以用于执行矩阵运算、向量运算以及标量数据运算等运算操作。
最小计算单元的尺寸指的是加速器能够一次性处理的矩阵运算的规模,或者说,计算单元一次运算执行的计算量。示例性地,最小计算单元的尺寸可以为加速器在一个时钟周期内能够完成的矩阵运算的规模。
示例性地,加速器可以包括以下至少一项:CPU、GPU、NPU和TPU等等。
本申请实施例中,最小计算单元也可以称为块单元(blockunit)。blockunit是由加速器的硬件规格决定的。不同的加速器可能具有不同的blockunit。例如,如图8所示,加速器1、加速器2、加速器3、加速器4和加速器5分别对应五个不同的blockunit:blockunit_1、blockunit_2、blockunit_3、blockunit_4和blockunit_5。
示例性地,最小计算单元的尺寸可以是用户输入的。或者,最小计算单元的尺寸也可以是预先存储的。本申请实施例对最小计算单元的尺寸的获取方式不做限定。
S620,对初始AI模型进行稀疏处理,以得到目标AI模型,其中,目标AI模型的权重矩阵包括多个权重块,多个权重块包括至少一个无效权重块和至少一个目标权重块,无效权重块为不参与运算的权重块,目标权重块为参与运算的权重块,权重块的尺寸为加速器的最小计算单元的尺寸的整数倍,位于目标AI模型的权重矩阵中的矩阵运算累加方向上的目标权重块的数量相同。
下面以AI模型为神经网络模型对步骤S620进行说明。在初始AI模型为初始神经网络模型的情况下,步骤S620可以包括:对初始神经网络模型进行稀疏处理,以得到目标神经网络模型。其中,目标神经网络模型的第i层的权重矩阵包括多个权重块,多个权重块包括至少一个无效权重块和至少一个目标权重块,无效权重块为不参与运算的权重块,目标权重块为参与运算的权重块。权重块的尺寸为加速器的最小计算单元的尺寸的整数倍,第i层的权重矩阵的尺寸为权重块的尺寸的整数倍。
该至少一个目标权重块在目标神经网络模型的第i层的权重矩阵中的位置由第i层的权重索引指示。i为正整数。
换言之,在利用目标AI模型对待处理的数据进行处理的过程中,无效权重块中的所有权重值不参与运算,目标权重块中的至少一个权重值参与运算。
无效权重块和目标权重块可以通过多种方式表示。
示例性地,无效权重块中的所有权重值置为0。换言之,目标AI模型的多个权重块中,所有权重值为0的权重块即为无效权重块。至少一个权重值不为0的权重块即为目标权重块。
示例性地,无效权重块中的所有权重值不被激活,目标权重块中的权重值被激活。换言之,在目标AI模型的权重矩阵的所有权重块中,只有部分权重块在运算过程中会被激活。在运算过程中不被激活的权重块即为无效权重块,在运算过程中被激活的权重块即为 目标权重块。例如,可以通过值为0和1的矩阵形式的掩膜(mask)指示目标AI模型的无效权重块和目标权重块,即指示哪些权重块被激活,哪些权重块不被激活。
以上仅为示例,无效权重块和目标权重块还可以通过其他方式表示,本申请实施例对此不做限定。为了便于理解和描述简洁,后文中仅以将权重值置为0作为不参与运算的一种示例对本申请的方案进行描述,不对本申请实施例的方案构成限定。
在AI模型为神经网络模型的情况下,步骤S620可以理解为,对初始神经网络模型中的至少一个目标网络层的权重矩阵进行稀疏处理,以得到目标神经网络。对于该至少一个目标网络层中的任一个目标网络层而言,稀疏处理的粒度即为该目标网络层的权重块的尺寸,即以目标网络层的权重块为单位保留或舍弃初始神经网络模型的目标网络层的权重矩阵中的权重值。目标神经网络模型中的该至少一个目标网络层也可以称为稀疏网络层。
第i层可以为该至少一个目标网络层中的任一网络层。为了便于描述,本申请实施例中仅以第i层为例对稀疏处理的过程进行说明。该至少一个目标网络层中的其他网络层也可以采用相同的方式进行稀疏处理。
该至少一个目标网络层中的各个目标网络层的权重块的尺寸可以相同,也可以不同。
根据本申请实施例的方案,以权重块的尺寸作为稀疏处理的粒度,即以权重块为单位保留或舍弃权重矩阵中的权重值,权重块的尺寸为加速器的最小计算单元的尺寸的整数倍。本申请实施例的方案减少了初始AI模型中的冗余权重,减少模型的计算量,有利于提高加速器的处理速度。而且,目标AI模型的目标权重块的尺寸为最小计算单元的尺寸的整数倍,在目标AI模型的运算过程中,加速器可以根据最小计算单元所支持的数据格式从一次性获取运算所需的数据,无需通过裁剪和拼凑等方式将运算所需的数据转换为最小计算单元能够支持的数据格式,降低了运算的时延。此外,以权重块作为稀疏处理的粒度,降低了权重索引的复杂性,在运算过程中能够减少输入数据的寻址时间,无需特定的硬件支持即可实现有效的硬件加速。
具体地,在AI模型为神经网络模型的情况下,步骤S620包括步骤S621和步骤S622(图6中未示出)。
S621,将初始神经网络模型中的至少一个目标网络层中的各个目标网络层中的至少一个权重块作为不参与运算的权重块,以得到稀疏神经网络模型。该至少一个目标网络层包括第i层。
示例性地,步骤S621可以由图5中的稀疏网络生成模块511执行。
示例性地,步骤S621可以为,将初始神经网络模型中的至少一个目标网络层中的各个目标网络层中的至少一个权重块的权重值置为0,以得到稀疏神经网络模型。
该至少一个目标网络层包括第i层。下面以第i层为例对步骤S621进行说明。
如前所述,初始神经网络模型的第i层的权重矩阵包括多个权重块。将该多个权重块中的至少一个权重块的权重值置为0,即可得到稀疏神经网络模型的第i层的权重矩阵。
将一个权重块的权重值置为0,在本申请实施例中,也可以理解为删除该至少一个权重块中的权重值。
权重值被置为0的权重块即为无效权重块。权重值未被置为0的权重块即为被保留的有效权重块。以第i层为例,在稀疏神经网络模型的第i层的权重矩阵中,有效权重块中的至少一个权重值不为0,无效权重块中的所有权重值为0。
图7示出了一个权重矩阵的示意图。该权重矩阵的尺寸为K×N,加速器的最小计算单元的尺寸为BU_K×BU_N。该目标网络层的权重块(block)的尺寸为最小计算单元的整数倍,该block的尺寸可以表示为(α*BU_K)×(β*BU_N),权重矩阵的尺寸为block的尺寸的整数倍。α为正整数,β为正整数,K为正整数,N为正整数,BU_K为正整数,BU_N为正整数。α*BU_K≤K,β*BU_N≤N。
根据该block的尺寸将该权重矩阵划分为多个block,在该多个block中选择需要保留的block。例如,如图7所示,将该权重矩阵划分为15个block,保留9个block的权重值,该9个block即为有效权重块,其余6个权重块即为无效权重块。
一个权重矩阵中的多个权重块的尺寸可以是相同的。若权重矩阵中的多个权重块的尺寸不同,那么在对AI模型中的算子进行编译时,需要针对不同的尺寸的权重块分别进行算子的编译,增加了编译次数。在本申请实施例的方案中,一个权重矩阵中的多个权重块的尺寸相同,在对AI模型中的算子进行编译时,仅需执行一次算子编译,即减少了编译次数,进而减少了编译时间,提高了模型的运算速度。此外,当加速器包括多个计算单元时,能够实现多个权重块的并行计算,进一步提高模型的运算速度。
可选地,稀疏神经网络模型的第i层的权重矩阵包括至少一个有效权重块,有效权重块中的至少一个权重值不为0。该至少一个有效权重块中位于稀疏神经网络模型的第i层的权重矩阵中的矩阵运算累加(reduce)方向上的有效权重块的数量相同。
也就是说,对于初始神经网络模型的第i层的权重矩阵而言,矩阵运算累加方向上被舍弃的权重块的数量是相同的,或者说,被保留的权重块的数量是相同。
例如,如图7所示,该权重矩阵在累加方向上被划分为三列,即图7中的三个虚线框,每一列被保留的权重块的数量是相同的。图7中的每一列均保留了三个权重块。
这样,当加速器包括多个计算单元时,能够实现并行计算,进一步提高模型的运算速度。
示例性地,该至少一个目标网络层的权重块的尺寸可以是人为设置的。
然而,采用人为设置的方式难以针对不同的初始神经网络模型得到各个目标网络层的权重块的最优尺寸,影响模型的精度。
可选地,权重块的尺寸可以是在搜索空间内进行搜索得到的。
如前所述,权重块的尺寸在最小计算单元的尺寸的整数倍。权重块的尺寸可以是在搜索空间内进行搜索得到的,也可以理解为:该整数倍对应的倍数是在搜索空间内进行搜索得到的。
搜索空间指的是可搜索的范围或者说可选择的范围。自动机器学习(auto machine learning,AutoML)通常预先定义一个搜索空间,在搜索空间内不断产生不同的配置,形成评价-反馈-再次产生配置的闭环,直至得到所需的配置。
具体地,搜索空间根据待搜索的任务确定。例如,在搜索权重块的尺寸时,搜索空间中可以产生不同的倍数,对当前得到的权重块的尺寸进行评价,例如,对当前权重块的尺寸对应的模型进行评价,基于反馈的评价结果再次产生其他倍数,直至得到所需的倍数。
示例性地,在AI模型为神经网络模型的情况下,第i层的权重块的尺寸可以是在搜索空间内进行搜索得到的。
具体的搜索方法可以采用现有技术中的方案,例如,采用自动机器学习模型压缩 (AutoML model compression,AMC)得到权重块的尺寸。
需要说明的是,在AI模型包括多个权重矩阵的情况下,该多个权重矩阵的权重块的尺寸的设置方式可以相同,也可以不同。例如,该多个权重矩阵中的部分权重矩阵的权重块的尺寸可以是人为设置的,其余权重矩阵的权重块的尺寸可以是在搜索空间中搜索得到的。本申请实施例对此不做限定。
可选地,在AI模型为神经网络模型的情况下,目标神经网络模型的第i层的权重矩阵中的权重块的尺寸和目标神经网络模型的第j层的权重矩阵中的权重块的尺寸不同,j为正整数。
换言之,该不同目标网络层的权重矩阵中的权重块的尺寸可以是相同的,也可以是不同的。
在本申请实施例的方案中,在搜索空间内搜索权重块的尺寸,能够得到适合各个权重矩阵的权重块的尺寸,有利于提高模型的精度。
目标AI模型中的各个权重矩阵的稀疏率可以相同,也可以不同。
示例性地,目标AI模型的权重矩阵的稀疏率(sparse ratio)可以是人为设置的。
可选地,目标AI模型的权重矩阵的稀疏率是通过在搜索空间内进行搜索得到的,目标AI模型的稀疏率满足目标稀疏率。
示例性地,目标AI模型的稀疏率满足目标稀疏率,可以为,目标AI模型的稀疏率小于或等于目标稀疏率。目标AI模型的稀疏率越小,目标AI模型中被保留的权重块越少,即目标AI模型所需的计算量越小。或者,目标AI模型的稀疏率满足目标稀疏率,也可以为,目标AI模型的稀疏率与目标稀疏率之间的差值小于或等于第一阈值。
以AI模型为神经网络模型为例,进行稀疏处理的至少一个目标网络层的稀疏率可以是在搜索空间内进行搜索得到的。稀疏神经网络模型的稀疏率满足目标稀疏率。
该至少一个目标网络层包括第i层,即第i层的稀疏率是在搜索空间内进行搜索得到的。
示例性地,稀疏神经网络模型可以为在稀疏率满足目标稀疏率的候选稀疏神经网络模型中,搜索得到的精度最高的候选稀疏神经网络模型。
示例性地,候选稀疏神经网络模型的稀疏率满足目标稀疏率,可以为,候选稀疏神经网络模型的稀疏率小于或等于目标稀疏率。候选稀疏神经网络模型的稀疏率越小,候选稀疏神经网络模型中被保留的权重块越少,即候选稀疏神经网络模型所需的计算量越小。或者,候选稀疏神经网络模型的稀疏率满足目标稀疏率,也可以为,候选稀疏神经网络模型的稀疏率与目标稀疏率之间的差值小于或等于第一阈值。
示例性地,目标稀疏率可以是用户输入的。或者,目标稀疏率也可以是根据加速器的算力确定的。
具体的搜索方法可以采用现有技术中的方案,例如,采用AMC得到权重矩阵的稀疏率。
在本申请实施例的方案中,在模型的稀疏率满足目标稀疏率的前提下,在搜索空间内搜索各个权重矩阵的稀疏率,能够得到适合各个权重矩阵的稀疏率,有利于提高模型的精度。
或者,进行稀疏处理的至少一个目标网络层的稀疏率是在搜索空间内进行搜索得到 的。稀疏神经网络模型的精度满足目标精度。
示例性地,该稀疏神经网络模型可以为在精度满足目标精度的候选稀疏神经网络模型中,搜索得到的稀疏率最高的候选稀疏神经网络模型。
示例性地,候选稀疏神经网络模型的精度满足目标精度,可以为,候选稀疏神经网络模型的精度大于或等于目标精度。或者,候选稀疏神经网络模型的精度满足目标精度,也可以为,目标精度与候选稀疏神经网络模型的精度之间的差值小于或等于第二阈值。
示例性地,目标精度可以是用户输入的。
具体的搜索方法可以采用现有技术中的方案,例如,采用AMC得到权重矩阵的稀疏率。
在本申请实施例的方案中,在模型的精度满足目标精度的前提下,在搜索空间内搜索各个权重矩阵的稀疏率,能够得到适合各个权重矩阵的稀疏率,有利于降低模型的稀疏率,进而减少模型的计算量,以便实现模型的加速运算。
各个权重矩阵的稀疏率的设置方式可以相同,也可以不同。例如,部分权重矩阵的稀疏率可以是人为设置的,其余权重矩阵的稀疏率可以是在搜索空间中搜索得到的。本申请实施例对此不做限定。
示例性地,目标AI模型的权重矩阵对应的网络层是随机确定的。
也就是说,在AI模型为神经网络模型的情况下,进行稀疏处理的至少一个目标网络层可以是随机确定的。
可选地,目标AI模型的权重矩阵对应的网络层属于多个候选网络层。
也就是说,该至少一个目标网络层为多个候选网络层中的至少一个候选网络层。
在该情况下,在AI模型为神经网络模型的情况下,可以选择该多个候选网络层中的部分或全部候选网络层作为进行稀疏处理的至少一个目标网络层。
例如,至少一个目标网络层可以是在多个候选网络层中进行搜索得到的。
示例性地,该多个候选网络层可以是人为设定的。例如,该多个候选网络层可以包括第一层以外的其他网络层。
可替换地,该至少一个目标网络层可以是人为设定的。例如,该至少一个目标网络层可以包括第一层以外的其他网络层。
在本申请实施例的方案中,可以在指定的候选网络层的范围内进行稀疏处理,避免影响其他网络层,例如,避免影响与模型的精度的相关性较高的网络层,有利于保证模型的精度。
S622,对稀疏神经网络模型进行调优(fine tune),以得到目标神经网络模型。
示例性地,步骤S622可以由图5中的稀疏网络调优模块512执行。
用于对稀疏神经网络模型进行调优的第二训练数据的类型与第一训练数据的类型可以是相同的。
需要说明的是,步骤S622为可选步骤,在步骤S620不包括S622的情况下,可以将稀疏神经网络模型作为目标神经网络模型。
通过对稀疏神经网络模型进行调优,能够提高目标神经网络模型的精度。
需要说明的是,在对稀疏神经网络模型进行调优的过程中,不会改变稀疏神经网络模型的稀疏配置。具体来说,调优不会改变稀疏神经网络模型的该至少一个目标网络层中的 各个目标网络层的稀疏率以及各个目标网络层的权重块的尺寸。相应地,调优也不会改变稀疏神经网络模型的稀疏率。调优也不会改变稀疏神经网络模型中的各个目标网络层中的有效权重块的位置。
换言之,稀疏神经网络模型中的各个目标网络层的稀疏率即为目标神经网络模型的各个目标网络层的稀疏率。稀疏神经网络模型的各个目标网络层的权重块的尺寸即为目标神经网络模型的各个目标网络层的权重块的尺寸。稀疏神经网络模型的稀疏率即为目标神经网络模型的稀疏率。稀疏神经网络模型中的各个目标网络层中的有效权重块在各个目标网络层的权重矩阵中的位置与目标神经网络模型中的各个目标网络层中的目标权重块在各个目标网络层的权重矩阵中的位置是相同的。
为了便于描述,本申请实施例将稀疏神经网络模型的该至少一个目标网络层的稀疏率和目标神经网络模型的该至少一个网络层的稀疏率统称为该至少一个网络层的稀疏率,将稀疏神经网络模型的该至少一个网络层的权重矩阵中的权重块的尺寸和目标神经网络模型的该至少一个网络层的权重矩阵中的权重块的尺寸统称为该至少一个网络层的权重矩阵中的权重块的尺寸。例如,该至少一个目标网络层包括第i层,“稀疏神经网络模型的第i层的稀疏率”和“目标神经网络模型的第i层的稀疏率”统称为“第i层的稀疏率”;“稀疏神经网络模型的第i层的权重矩阵中的权重块的尺寸”和“目标神经网络模型的第i层的权重矩阵中的权重块的尺寸”统称为“第i层的权重矩阵中的权重块的尺寸”,或者“第i层的权重块的尺寸”。
如前所述,稀疏神经网络模型的第i层的权重矩阵包括至少一个有效权重块。目标神经网络模型的第i层的权重矩阵包括至少一个目标权重块。该至少一个有效权重块中位于稀疏神经网络模型的第i层的权重矩阵中的矩阵运算累加方向上的有效权重块的数量相同,也可以理解为,该至少一个目标权重块中位于目标神经网络模型的第i层的权重矩阵中的矩阵运算累加方向上的目标权重块的数量相同。
该至少一个目标网络层可以包括全连接层或卷积层。第i层可以为卷积层,也可以为全连接层。
可选地,第i层为卷积层。第i层的权重矩阵可以是通过将第i层的卷积核转换为二维矩阵得到的。
为了提高卷积运算的运算速度,可以通过im2col的方式将卷积运算转换为矩阵乘法运算。相应地,将第i层的卷积核展开为适合矩阵乘法运算的二维矩阵,该二维矩阵即为本申请实施例中的卷积层的权重矩阵。对展开后得到的二维矩阵可以采用前文中的方式进行稀疏处理,此处不再展开描述。
这样可以提高卷积运算的速度。
下面结合图9对方法600的一种实现方式进行举例说明。
具体地,基于稀疏原则对整网的稀疏配置进行搜索,得到稀疏神经网络模型。该稀疏神经网络模型可以为在候选稀疏神经网络模型满足目标稀疏率的情况下精度最高的候选稀疏神经网络模型。对稀疏神经网络模型进行调优,得到目标神经网络模型。
具体的搜索方法可以采用现有技术中的方案,例如,采用AMC的方式搜索得到稀疏神经网络模型。
稀疏原则包括:
(1)各个目标网络层的权重块的尺寸为加速器的最小计算单元的尺寸的整数倍,各个目标网络层的权重矩阵的尺寸分别为各个目标网络层的权重块的尺寸的整数倍。以各个目标网络层的权重块的尺寸作为对初始神经网络模型中的各个目标网络层的权重矩阵进行稀疏处理的粒度。
(2)各个目标网络层的权重矩阵中位于矩阵运算累加方向上的有效权重块的数量相同。
如图9所示,稀疏工具获取加速器的最小计算单元的尺寸、目标稀疏率以及原始的深度神经网络(original DNN)。例如,加速器的最小计算单元的尺寸可以为16*16。也就是说,加速器的最小计算单元能够处理的数据格式为排布成16*16的矩阵形式的256个元素。目标稀疏率可以为50%。original DNN即为本申请实施例中的初始神经网络模型的一例。稀疏工具可以执行步骤S620对original DNN进行稀疏处理,例如,基于稀疏原则对整网的稀疏配置进行搜索,得到稀疏神经网络模型,并对稀疏神经网络模型进行调优,以得到目标神经网络模型。
整网的稀疏配置可以包括稀疏神经网络模型中的至少一个目标网络层中的各个目标网络层的稀疏率、各个目标网络层的权重块的尺寸以及各个目标网络层中的有效权重块的位置。
如前所述,目标神经网络模型中的各个目标网络层的稀疏率可以相同,也可以不同。目标神经网络模型中的各个目标网络层的权重块的尺寸可以相同,也可以不同。
例如,图9中的目标神经网络模型中的4个目标网络层的稀疏率分别为50%、66%、25%和33%。该4个目标网络层的权重块的尺寸分别为16*16、32*32、16*32和32*32。
可选地,方法600还包括:对目标AI模型的权重矩阵进行压缩,得到压缩权重,以压缩权重的形式存储目标AI模型的权重矩阵。
在AI模型为神经网络模型的情况下,该步骤可以理解为:对目标神经网络模型中的该至少一个目标网络层的权重矩阵进行压缩,得到至少一个目标网络层的压缩权重,以至少一个目标网络层的压缩权重的形式存储该至少一个目标网络层的权重矩阵。
以该至少一个目标网络层的压缩权重的形式存储目标神经网络模型中的至少一个目标网络层的权重矩阵,可以理解为仅存储目标神经网络模型中的至少一个目标网络层的权重矩阵中的目标权重块,无需保存各个目标网络层的权重矩阵中的无效权重块。压缩权重可以表示为稠密矩阵。
例如,该至少一个目标网络层包括第i层,对目标神经网络模型的第i层的权重矩阵进行压缩,得到第i层的压缩权重。目标神经网络模型的第i层的权重矩阵包括至少一个目标权重块。第i层的压缩权重由该至少一个目标权重块构成。以第i层的压缩权重的形式存储目标神经网络模型的第i层的权重矩阵。
各个目标网络层的索引用于指示各个目标网络层中的目标权重块在各个目标网络层的权重矩阵中的位置。
示例性地,各个目标网络层的索引可以表示为二维矩阵,即以二维矩阵的形式存储各个目标网络层的索引。该二维矩阵也可以称为索引矩阵。
各个目标网络层的索引矩阵中的元素所在的位置以及元素值即用于指示目标数据块所在的位置。例如,一个目标网络层的索引矩阵中的第一列的元素值可以用于指示目标权 重块在该目标网络层的权重矩阵的第一列中的行数。
例如,对于图9中按照从上到下的顺序排序的第一个索引矩阵,该索引矩阵包括8个元素,分别用于指示权重矩阵中的8个目标权重块的位置。其中,第一列的0表示的一个目标权重块位于权重矩阵中的第一列的第0行,第一列的2表示一个目标权重块位于权重矩阵中的第一列的第2行。
目标网络层的权重矩阵可以通过目标网络层的压缩权重和目标网络层的索引表示。
例如,如图10所示,一个K*N的权重矩阵,可以通过N’*K’*block的压缩权重和N’*K’的索引矩阵表示,N’*K’*block表示N’*K’个block,即压缩权重包括N’*K’个block。其中,N’表示索引矩阵的列数,K’表示索引矩阵的行数。目标权重块的数量即为索引矩阵中的元素的数量。
应理解,各个索引也可以表示为其他形式,本申请实施例对此不做限定。
以压缩权重的形式存储权重矩阵,能够减少对存储空间的需求量。此外,在运算过程中,可以直接将压缩权重加载至计算单元中进行矩阵运算,有利于提高处理速度。
需要说明的是,还可以以其他形式存储目标AI模型中的该权重矩阵。例如,存储该权重矩阵中的全部权重值。本申请实施例对此不做限定。
本申请实施例还提供了一种AI模型的运算方法,下面结合图11对本申请实施例提供的AI模型的运算方法进行说明。图11所示的方法可以由AI模型的执行装置来执行,该装置可以是云服务设备,也可以是终端设备,例如,电脑、服务器等运算能力足以用来执行AI模型运算的装置,也可以是由云服务设备和终端设备构成的系统。示例性地,方法900可以由图2中的执行设备210、图3中的AI处理器50或图4中的执行设备410或本地设备执行。示例性地,加速器可以调用图5中的稀疏算子执行方法900,实现稀疏计算。
图11所示的运算方法中的目标AI模型可以是通过图6所示的方法构建的,目标AI模型的具体描述可以参考前述方法600,为了避免不必要的重复,下面在介绍方法900时适当省略重复的描述。
目标AI模型是通过对初始AI模型进行稀疏处理得到的。目标AI模型的权重矩阵包括多个权重块,多个权重块包括至少一个无效权重块和至少一个目标权重块,无效权重块为不参与运算的权重块,目标权重块为参与运算的权重块,权重块的尺寸为加速器的最小计算单元的尺寸的整数倍。加速器用于执行目标AI模型的运算。位于目标AI模型的权重矩阵中的矩阵累加方向上的目标权重块的数量相同。
示例性地,目标AI模型可以为目标神经网络模型,初始AI模型可以为初始神经网络模型。目标神经网络模型是通过对初始神经网络模型中的至少一个目标网络层进行稀疏处理得到的。该至少一个目标网络层包括第i层。目标神经网络模型的第i层的权重矩阵包括多个权重块,多个权重块包括至少一个无效权重块和至少一个目标权重块,无效权重块为不参与运算的权重块,目标权重块为参与运算的权重块,权重块的尺寸为加速器的最小计算单元的尺寸的整数倍,第i层的权重矩阵的尺寸为权重块的尺寸的整数倍。加速器用于执行目标神经网络模型的运算,i为正整数。
在加速器执行目标AI模型的运算过程中执行步骤S910至步骤S920。
S910,根据目标AI模型的权重矩阵的权重索引从输入数据矩阵中获取至少一个目标数据块,至少一个目标数据块对应于至少一个目标权重块,权重索引用于指示至少一个目 标权重块在目标AI模型的权重矩阵中的位置。
S920,基于至少一个目标数据块和至少一个目标权重块执行矩阵运算,以得到矩阵运算的结果。
该至少一个目标数据块指的是矩阵运算中需要计算的输入数据,即权重矩阵中的目标权重块对应的输入数据。
目标AI模型的输入数据的类型可以为图像数据、语音数据或文本数据等。
输入数据的类型与目标AI模型的任务有关。例如,目标AI模型用于图像处理任务,则该输入数据的类型可以为图像数据。具体地,目标图像处理任务包括图像分类、图像检测、图像分割、图像识别或图像生成等。再如,目标AI模型用于文本处理任务,则该输入数据的类型可以为文本数据。具体地,文本处理任务包括文本识别或文本翻译等。再如,目标AI模型用于语音处理任务,则该输入数据的类型可以为语音数据。具体地,语音处理任务包括语音识别等。本申请实施例对输入数据的类型不做限定。
根据本申请实施例的方案,在运算过程中,加速器可以根据最小计算单元能够支持的数据格式一次性获取运算所需的数据,无需通过裁剪或拼凑等方式将运算所需的数据转换为最小计算单元能够支持的数据格式,降低了运算的时延。而且,索引用于指示目标权重块的位置,降低了索引的复杂性,在运算过程中能够减少输入数据的寻址时间,无需特定的硬件支持即可实现有效的硬件加速。
下面以AI模型为神经网络模型为例对方法900进行说明。该权重矩阵可以为目标神经网络模型中的第i层的权重矩阵。
在加速器执行目标神经网络模型的第i层的运算的过程中执行步骤S910至步骤S920。
S910,根据第i层的权重索引从第i层的输入数据矩阵中获取至少一个目标数据块,至少一个目标数据块对应于该至少一个目标权重块,第i层的权重索引用于指示该至少一个目标权重块在目标神经网络模型的第i层的权重矩阵中的位置。
S920,基于该至少一个目标数据块和该至少一个目标权重块执行矩阵运算,以得到第i层的矩阵运算结果。
示例性地,第i层可以为全连接层或卷积层。
可选地,目标AI模型的权重矩阵以压缩权重的形式存储,压缩权重由该至少一个目标权重块构成。
以AI模型为神经网络模型为例,目标神经网络模型的第i层的权重矩阵以压缩权重的形式存储,第i层的压缩权重由该至少一个目标权重块构成。
下面结合图12对步骤S910进行说明。
示例性地,图12中的权重矩阵的尺寸为K*N,输入数据矩阵的尺寸为M*K。如图12所示的特征图(feature map)即为输入数据矩阵的一例。对输入数据矩阵和权重矩阵执行矩阵乘法运算,得到的矩阵运算结果的尺寸为M*N。
压缩权重的尺寸为N’*K’*block,block表示权重矩阵中的权重块。该block的尺寸可以表示为(α*BU_K)×(β*BU_N)。索引可以表示为N’*K’的索引矩阵。权重矩阵可以通过N’*K’*block的压缩权重和N’*K’的索引矩阵表示。
如图12所示,压缩权重可以直接搬入加速器的计算单元中。根据索引将特征图中的目标数据块加载至计算单元中。
例如,图12中的索引矩阵的第一列的0表示的一个目标权重块位于权重矩阵中的第一列的第0行,第一列的2表示一个目标权重块位于权重矩阵中的第一列的第2行。与这两个目标权重块对应的目标数据块分别为特征图的第0列和第2列的数据块。数据块的尺寸为M*(α*BU_K)。将特征图的第0列和第2列的数据块加载至计算单元中。
进一步地,在图12中,可以将压缩权重和目标数据块加载至4个计算单元中,这样可以实现并行计算。
可选地,方法900还包括步骤S911。
S911,将至少一个目标数据块中的、与多个权重组对应的目标数据块分别组合为多个数据矩阵,多个权重组中的一个权重组包括在目标AI模型的权重矩阵中位于矩阵运算累加方向上的目标权重块。
以AI模型为神经网络模型为例,步骤S922可以理解为,将至少一个目标数据块中的、与多个权重组对应的目标数据块分别组合为多个数据矩阵,多个权重组中的一个权重组包括在目标神经网络模型的第i层的权重矩阵中位于矩阵运算累加方向上的目标权重块。
或者说,一个权重组指的是在目标神经网络模型的第i层的权重矩阵中位于矩阵运算累加方向上的目标权重块构成的矩阵。
一个权重组对应的目标数据块可以组合为一个稠密矩阵。
如图12所示,压缩权重包括4列权重块。在图12中,列方向即为矩阵运算累加方向。因此,压缩权重包括4个权重组。该4个权重组对应的目标数据块可以分别构成4个数据矩阵。例如,图12中的索引矩阵的第一列即用于指示第一权重组的位置,第一权重组对应的目标数据块包括特征图的第0列和第2列的数据块,将特征图的第0列和第2列的数据块构成一个数据矩阵。
在方法900包括步骤S911的情况下,步骤S920可以通过以下步骤执行。
(1)对多个数据矩阵和多个权重组分别执行矩阵运算,得到多个矩阵运算的结果。
例如,如图12所示,对该4个数据矩阵和该4个权重组分别执行矩阵运算,得到4个结果(result)矩阵。
这样,通过多个数据矩阵和多个权重组之间的稠密计算即可得到多个矩阵运算的结果。
可选地,对该多个数据矩阵和多个权重组并行执行矩阵运算,得到多个矩阵运算的结果。
该4组运算过程可以由4个计算单元并行执行。
示例性地,该多个数据矩阵可以构成具有并行维度的特征图。或者说,该多个数据矩阵具有并行维度,以便并行执行矩阵运算。并行维度可以理解为并行计算的维度。相应地,该多个矩阵运算的结果也具有并行维度。
这样,通过并行执行矩阵运算,能够提高加速器的处理效率。
(2)对多个矩阵运算的结果进行数据重排,以得到矩阵运算的结果。
示例性地,在将该多个矩阵运算的结果输出计算单元的过程中,对该多个矩阵运算的结果进行数据重排,得到矩阵运算结果。
例如,可以根据该多个权重组在权重矩阵中的位置对该多个矩阵运算的结果进行数据重排,得到矩阵运算的结果。
例如,按照如图12所示的N’对应的箭头所指示的方向,该4个权重组依次为权重组1、权重组2、权重组3和权重组4,分别位于第i层的权重矩阵的第一列、第二列、第三列和第四列。按照如图12所示的N’对应的箭头所指示的方向,该4个权重组对应的多个矩阵运算的结果分别为结果1、结果2、结果3和结果4,分别位于该层的矩阵运算结果中的第一列、第二列、第三列和第四列。
可选地,在目标AI模型的权重矩阵所在的网络层为卷积层的情况下,该权重矩阵可以是通过将该网络层的卷积核转换为二维矩阵得到的。
也就是说,第i层为卷积层,目标神经网络模型的第i层的权重矩阵可以是通过对第i层的卷积核转换为二维矩阵得到的。
示例性地,步骤S910可以包括:通过im2col将输入数据矩阵进行转换,得到转换后的输入数据矩阵,根据权重索引从转换后的输入数据矩阵中获取至少一个目标数据块,至少一个目标数据块对应于该至少一个目标权重块。
进一步地,将至少一个目标数据块中的、与多个权重组对应的目标数据块分别组合为多个数据矩阵,多个权重组中的一个权重组包括在目标神经网络模型的第i层的权重矩阵中位于矩阵运算累加方向上的目标权重块。
例如,如图13的(a)图所示,在第i层为卷积层时,可以将输入的三维(three dimensions,3D)特征图通过im2col展开为二维(2D)特征图,然后根据第i层的权重索引从该二维特征图中获取至少一个目标数据块,构成多个数据矩阵。
可选地,在目标AI模型的权重矩阵所在的网络层为卷积层的情况下,步骤S910包括:根据权重索引确定输入数据矩阵中的目标位置;将输入数据矩阵的目标位置上的数据转换为该至少一个目标数据块。
以AI模型为神经网络模型为例,在第i层为卷积层的情况下,步骤S910包括:根据第i层的权重索引确定第i层的输入数据矩阵中的目标位置;将第i层的输入数据矩阵的目标位置上的数据转换为该至少一个目标数据块。
具体地,根据第i层的权重索引即可确定与第i层的至少一个目标数据块在第i层的展开后的输入数据矩阵中的位置。该至少一个目标数据块对应于至少一个目标权重块。
需要说明的是,在执行过程中,并没有通过im2col将第i层的输入数据矩阵展开为二维矩阵。通常输入数据矩阵的尺寸是固定的,相应地,展开后的二维矩阵的尺寸也是固定的。因此,不需要执行im2col的展开操作即可确定至少一个目标数据块在展开后的输入数据矩阵中的位置。
根据该至少一个目标数据块在展开后的输入数据矩阵中的位置即可反推得到该至少一个目标数据块的数据在第i层的输入数据矩阵中的位置,即第i层的输入数据矩阵中的目标位置。通过im2col的方式将目标位置上的数据展开即可得到该至少一个目标数据块。
进一步地,将至少一个目标数据块中的、与多个权重组对应的目标数据块通过im2col分别组合为多个数据矩阵,多个权重组中的一个权重组包括在目标神经网络模型的第i层的权重矩阵中位于矩阵运算累加方向上的目标权重块。
也就是说,不对第i层的输入数据矩阵进行展开,而是直接根据第i层的权重索引从第i层的输入数据矩阵确定目标位置,进而通过im2col的方式将目标位置的数据组成多个数据矩阵。
例如,如图13的(b)图所示,在第i层为卷积层时,可以根据第i层的权重索引获取第i层的输入数据矩阵中的目标位置的数据,并通过im2col展开为多个数据矩阵。
图14示出了在第i层为卷积层的情况下的运算过程的示意图,图14与图12的区别主要在于输入特征图的维数和结果的维数不同,具体描述可以参考图12的相关描述,此处不再赘述。
根据本申请的方案,当第i层为卷积层时,根据第i层的权重索引从第i层的输入数据矩阵中的目标位置的数据,并通过im2col将目标位置的数据展开为多个数据矩阵,这样无需将第i层的输入数据矩阵先通过im2col的方式进行转换,再从转换后的输入数据矩阵中获取该至少一个目标数据块,无需保存转换后的输入数据矩阵,减少了对存储空间的需求,同时提高了处理效率。
可选地,整数倍对应的尺寸是通过在搜索空间内进行搜索得到的。
可选地,目标AI模型的权重矩阵的稀疏率是通过在搜索空间内进行搜索得到的,目标AI模型的稀疏率满足目标稀疏率。
应理解,方法900仅以一个权重矩阵为例对运算过程进行说明,对于目标AI模型中的其他权重矩阵也可以采用相同的方法执行运算过程。
表1示出了本申请实施例提供的一种ResNet50的部分参数。采用本申请实施例的方法600对表1中的模型进行压缩,得到压缩后的模型。其中,加速器的最小计算单元为16*16,目标稀疏率为20%。
表1
Figure PCTCN2022121335-appb-000010
Figure PCTCN2022121335-appb-000011
表2示出了压缩后的模型中的各个卷积层的权重块的尺寸以及各个卷积层的稀疏率,即各个卷积层的稀疏配置。基于表2所示的压缩模型对待处理数据进行处理,即可得到待处理数据的处理结果。示例性地,各个卷积层的稀疏配置可以是利用AMC的方法搜索得到的。
表2
layer name 权重块 稀疏率
conv1 16×64 0%
conv2_x 16×64 50%
conv3_x 16×64 50%
conv4_x 16×64 50%
conv5_x 16×64 50%
图15示出了本申请实施例中的目标神经网络模型和初始神经网络模型的精度的对比图。
具体地,图15示出了4组目标神经网络模型与表1所示的初始神经网络模型的TOP-1准确率(TOP-1accuracy,TOP-1ACC)的对比。TOP-1ACC指的是输出结果中排名第一的类别与实际结果相符的概率。其中,权重块的尺寸为16×64。第一组对比结果为不对conv1和fc(表1中未示出fc)进行稀疏处理,即跳过conv1和fc,且目标神经网络模型的稀疏率为50%的情况下的目标神经网络模型的精度与初始神经网络模型的精度的对比结果。第二组对比结果为不对conv1、fc以及conv1×1(表1中未示出conv1x1)进行稀疏处理,且目标神经网络模型的稀疏率为50%的情况下的目标神经网络模型的精度与初始神经网络模型的精度的对比结果。第三组对比结果为不对conv1和fc进行稀疏处理,且局卷积核为3×3的算子的稀疏率为50%,卷积核为1×1的算子的稀疏率为30%的情况下的目标神经网络模型的精度与初始神经网络模型的精度的对比结果。第四组对比结果为不对conv1和fc进行稀疏处理,且目标神经网络模型的稀疏率为30%的情况下的目标神经网络模型的精度与初始神经网络模型的精度的对比结果。
如图15所示,与初始神经网络模型相比,目标神经网络模型的精度差在+0.102%~-0.794%的范围内。从第四组对比结果可以看出,稀疏率为30%时,目标神经网络模型的精度可以超过初始神经网络模型。当稀疏率为50%时,可以通过调整稀疏配置减少模型精度的回退程度,即保证模型精度的下降程度。从第一组对比结果可以看出,最差情况下的精度回退程度小于1%。
图16示出了初始神经网络模型和目标神经网络模型中的conv2D单算子的运算时间的对比图。具体地,图16示出了4个conv2D单算子(conv2_x、conv3_x、conv4_x、conv5_x)对4批(batch)数据(batch_1、batch_2、batch_3和batch_4)的处理时间。如图16所示,目标神经网络模型中的conv2D单算子的运算时间明显少于初始神经网络模型中的conv2D单算子的运算时间。
图17示出了图16对应的目标神经网络模型的加速程度的示意图。目标神经网模型中的一个conv2D算子的加速程度是通过初始神经网络模型中的该算子的运算时间除以目标 神经网络模型中的该算子的运算时间得到的。如图16和图17所示,一个conv2D算子的平均加速程度为30%。
下面结合图18至图21对本申请实施例的装置进行说明。应理解,下面描述的装置能够执行前述本申请实施例的方法,为了避免不必要的重复,下面在介绍本申请实施例的装置时适当省略重复的描述。
图18是本申请实施例的AI模型的处理装置的示意性框图。图18所示的AI模型的处理装置3000包括获取单元3010和处理单元3020。
获取单元3010和处理单元3020可以用于执行本申请实施例的AI模型的处理方法,具体地,可以用于执行方法600。
获取单元3010用于获取初始AI模型和加速器的最小计算单元的尺寸,加速器用于执行目标AI模型的运算。
处理单元3020,用于对初始AI模型进行稀疏处理,以得到目标AI模型,其中,目标AI模型的权重矩阵包括多个权重块,多个权重块包括至少一个无效权重块和至少一个目标权重块,无效权重块为不参与运算的权重块,目标权重块为参与运算的权重块,权重块的尺寸为加速器的最小计算单元的尺寸的整数倍,位于目标AI模型的权重矩阵中的矩阵运算累加方向上的目标权重块的数量相同。
可选地,作为一个实施例,装置还包括存储单元,用于:以压缩权重的形式存储目标AI模型的权重矩阵,压缩权重由至少一个目标权重块构成。
可选地,作为一个实施例,整数倍对应的倍数是在搜索空间内进行搜索得到的。
可选地,作为一个实施例,目标AI模型的权重矩阵的稀疏率是通过在搜索空间内进行搜索得到的,目标AI模型的稀疏率满足目标稀疏率。
图19是本申请实施例提供的加速器4000的示意性框图。图19所示的加速器4000包括获取单元4010和处理单元4020。
获取单元4010和处理单元4020可以用于执行本申请实施例的AI模型的运算方法,例如,可以用于执行方法900。
获取单元4010用于根据目标AI模型的权重矩阵的权重索引从输入数据矩阵中获取至少一个目标数据块,至少一个目标数据块对应于至少一个目标权重块,权重索引用于指示至少一个目标权重块在目标AI模型的权重矩阵中的位置。
处理单元4020用于基于至少一个目标数据块和至少一个目标权重块执行矩阵运算,以得到矩阵运算的结果。
可选地,作为一个实施例,目标AI模型的权重矩阵以压缩权重的形式存储,压缩权重由至少一个目标权重块构成。
可选地,作为一个实施例,处理单元4020还用于:将至少一个目标数据块中的、与多个权重组对应的目标数据块分别组合为多个数据矩阵,多个权重组中的一个权重组包括目标AI模型的权重矩阵中位于矩阵运算累加方向上的目标权重块;以及处理单元4020具体用于:对多个数据矩阵和多个权重组分别执行矩阵运算,得到多个矩阵运算的结果;对多个矩阵运算的结果进行数据重排,以得到的矩阵的运算结果。
可选地,作为一个实施例,在目标AI模型的权重矩阵对应的网络层为卷积层的情况下,目标AI模型中的权重矩阵是将网络层的卷积核转换为二维矩阵得到的。
可选地,作为一个实施例,处理单元4020具体用于:根据目标AI模型的权重矩阵的权重索引确定输入数据矩阵中的目标位置;将输入数据矩阵中的目标位置上的数据转换为至少一个目标数据块。
可选地,作为一个实施例,整数倍对应的倍数是通过在搜索空间内进行搜索得到的。
可选地,作为一个实施例,目标AI模型的权重矩阵的稀疏率是通过在搜索空间内进行搜索得到的,目标AI模型的稀疏率满足目标稀疏率。
需要说明的是,上述装置3000以及加速器4000以功能单元的形式体现。这里的术语“单元”可以通过软件和/或硬件形式实现,对此不作具体限定。
例如,“单元”可以是实现上述功能的软件程序、硬件电路或二者结合。硬件电路可能包括应用特有集成电路(application specific integrated circuit,ASIC)、电子电路、用于执行一个或多个软件或固件程序的处理器(例如共享处理器、专有处理器或组处理器等)和存储器、合并逻辑电路和/或其它支持所描述的功能的合适组件。
因此,在本申请的实施例中描述的各示例的单元,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
图20是本申请实施例提供的AI模型的处理装置的硬件结构示意图。图20所示的AI模型的处理装置5000(该装置5000具体可以是一种计算机设备)包括存储器5001、处理器5002、通信接口5003以及总线5004。其中,存储器5001、处理器5002、通信接口5003通过总线5004实现彼此之间的通信连接。
存储器5001可以是只读存储器(read only memory,ROM),静态存储设备,动态存储设备或者随机存取存储器(random access memory,RAM)。存储器5001可以存储程序,当存储器5001中存储的程序被处理器5002执行时,处理器5002用于执行本申请实施例的AI模型的处理方法的各个步骤。具体地,处理器5002可以执行上文中图6所示的方法600。
处理器5002可以采用通用的中央处理器(central processing unit,CPU),微处理器,应用专用集成电路(application specific integrated circuit,ASIC),图形处理器(graphics processing unit,GPU)或者一个或多个集成电路,用于执行相关程序,以实现本申请方法实施例的AI模型的处理方法。
处理器5002还可以是一种集成电路芯片,具有信号的处理能力,例如,可以是图3所示的芯片。在实现过程中,本申请的AI模型的处理方法的各个步骤可以通过处理器5002中的硬件的集成逻辑电路或者软件形式的指令完成。
上述处理器5002还可以是通用处理器、数字信号处理器(digital signal processing,DSP)、专用集成电路(ASIC)、现成可编程门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程 存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器5001,处理器5002读取存储器5001中的信息,结合其硬件完成图18所示的处理装置中包括的单元所需执行的功能,或者,执行本申请方法实施例的图6所示的AI模型的处理方法。
通信接口5003使用例如但不限于收发器一类的收发装置,来实现装置5000与其他设备或通信网络之间的通信。例如,可以通过通信接口5003获取初始神经网络模型和加速器的最小计算单元的尺寸。
总线5004可包括在装置5000各个部件(例如,存储器5001、处理器5002、通信接口5003)之间传送信息的通路。
图21是本申请实施例的AI模型的运算装置的硬件结构示意图。图21所示的数据处理装置6000包括存储器6001、处理器6002、通信接口6003以及总线6004。其中,存储器6001、处理器6002、通信接口6003通过总线6004实现彼此之间的通信连接。
存储器6001可以是ROM,静态存储设备和RAM。存储器6001可以存储程序,当存储器6001中存储的程序被处理器6002执行时,处理器6002和通信接口6003用于执行本申请实施例的AI模型的运算方法的各个步骤。具体地,处理器6002可以执行上文中图11所示的方法900。
处理器6002可以采用通用的,CPU,微处理器,ASIC,GPU或者一个或多个集成电路,用于执行相关程序,以实现本申请实施例的加速器中的单元所需执行的功能,或者执行本申请方法实施例的AI模型的运算方法。
处理器6002还可以是一种集成电路芯片,具有信号的处理能力,例如,可以是图3所示的芯片。在实现过程中,本申请实施例的AI模型的运算方法的各个步骤可以通过处理器6002中的硬件的集成逻辑电路或者软件形式的指令完成。
上述处理器6002还可以是通用处理器、DSP、ASIC、FPGA或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器6001,处理器6002读取存储器6001中的信息,结合其硬件完成本申请实施例的加速器中包括的单元所需执行的功能,或者执行本申请方法实施例的AI模型的运算方法。
通信接口6003使用例如但不限于收发器一类的收发装置,来实现装置6000与其他设备或通信网络之间的通信。例如,可以通过通信接口6003获取输入数据。
总线6004可包括在装置6000各个部件(例如,存储器6001、处理器6002、通信接口6003)之间传送信息的通路。
应注意,尽管上述装置5000和装置6000仅仅示出了存储器、处理器、通信接口,但是在具体实现过程中,本领域的技术人员应当理解,装置5000和装置6000还可以包括实现正常运行所必须的其他器件。同时,根据具体需要,本领域的技术人员应当理解,装置5000和装置6000还可包括实现其他附加功能的硬件器件。此外,本领域的技术人员应当理解,装置5000和装置6000也可仅仅包括实现本申请实施例所必须的器件,而不必包括 图20和图21中所示的全部器件。
本申请实施例还提供了一种计算机可读存储介质,该计算机可读介质存储用于设备执行的程序代码,该程序代码包括用于执行本申请实施例中的任意一种实现方式中的方法。
本申请实施例还提供了一种包含指令的计算机程序产品,当该计算机程序产品在计算机上运行时,使得计算机执行本申请实施例中的任意一种实现方式中的方法。
本申请实施例还提供了一种芯片,所述芯片包括处理器与数据接口,所述处理器通过所述数据接口读取存储器上存储的指令,执行本申请实施例中的任意一种实现方式中的方法。
可选地,作为一种实现方式,所述芯片还可以包括存储器,所述存储器中存储有指令,所述处理器用于执行所述存储器上存储的指令,当所述指令被执行时,所述处理器用于执行本申请实施例中的任意一种实现方式中的方法。
上述芯片具体可以是现场可编程门阵列(field-programmable gate array,FPGA)或者专用集成电路(application-specific integrated circuit,ASIC)。
应理解,本申请实施例中的处理器可以为中央处理单元(central processing unit,CPU),该处理器还可以是其他通用处理器、数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现成可编程门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
还应理解,本申请实施例中的存储器可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(random access memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的随机存取存储器(random access memory,RAM)可用,例如静态随机存取存储器(static RAM,SRAM)、动态随机存取存储器(DRAM)、同步动态随机存取存储器(synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(double data rate SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(direct rambus RAM,DR RAM)。
上述实施例,可以全部或部分地通过软件、硬件、固件或其他任意组合来实现。当使用软件实现时,上述实施例可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令或计算机程序。在计算机上加载或执行所述计算机指令或计算机程序时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以为通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心 进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集合的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质。半导体介质可以是固态硬盘。
应理解,本文中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况,其中A,B可以是单数或者复数。另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系,但也可能表示的是一种“和/或”的关系,具体可参考前后文进行理解。
本申请中,“至少一个”是指一个或者多个,“多个”是指两个或两个以上。“以下至少一项(个)”或其类似表达,是指的这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b,或c中的至少一项(个),可以表示:a,b,c,a-b,a-c,b-c,或a-b-c,其中a,b,c可以是单个,也可以是多个。
应理解,在本申请的各种实施例中,上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而 前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (27)

  1. 一种人工智能AI模型的处理方法,其特征在于,包括:
    获取初始AI模型和加速器的最小计算单元的尺寸,所述加速器用于执行目标AI模型的运算;
    对所述初始AI模型进行稀疏处理,以得到目标AI模型,其中,目标AI模型的权重矩阵包括多个权重块,所述多个权重块包括至少一个无效权重块和至少一个目标权重块,所述无效权重块为不参与运算的权重块,所述目标权重块为参与运算的权重块,所述权重块的尺寸为加速器的最小计算单元的尺寸的整数倍,其中,位于所述目标AI模型的权重矩阵中的矩阵运算累加方向上的目标权重块的数量相同。
  2. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    以压缩权重的形式存储所述目标AI模型的权重矩阵,所述压缩权重由所述至少一个目标权重块构成。
  3. 根据权利要求1或2所述的方法,其特征在于,所述整数倍对应的倍数是在搜索空间内进行搜索得到的。
  4. 根据权利要求1至3中任一项所述的方法,其特征在于,所述目标AI模型的权重矩阵的稀疏率是通过在搜索空间内进行搜索得到的,所述目标AI模型的稀疏率满足目标稀疏率。
  5. 一种AI模型的运算方法,其特征在于,目标人工智能AI模型的权重矩阵包括多个权重块,所述多个权重块包括至少一个无效权重块和至少一个目标权重块,所述无效权重块为不参与运算的权重块,所述目标权重块为参与运算的权重块,所述权重块的尺寸为加速器的最小计算单元的尺寸的整数倍,所述加速器用于执行目标AI模型的运算,位于所述目标AI模型的权重矩阵中的矩阵运算累加方向上的目标权重块的数量相同,
    在所述加速器执行所述目标AI模型的运算的过程中执行以下步骤:
    根据所述目标AI模型的权重矩阵的权重索引从输入数据矩阵中获取至少一个目标数据块,所述至少一个目标数据块对应于所述至少一个目标权重块,所述权重索引用于指示所述至少一个目标权重块在所述目标AI模型的权重矩阵中的位置;
    基于所述至少一个目标数据块和所述至少一个目标权重块执行矩阵运算,以得到所述矩阵运算的结果。
  6. 根据权利要求5所述的方法,其特征在于,所述目标AI模型的权重矩阵以压缩权重的形式存储,所述压缩权重由所述至少一个目标权重块构成。
  7. 根据权利要求5或6所述的方法,其特征在于,所述方法还包括:
    将所述至少一个目标数据块中的、与多个权重组对应的目标数据块分别组合为多个数据矩阵,所述多个权重组中的一个权重组包括所述目标AI模型的权重矩阵中位于矩阵运算累加方向上的目标权重块;以及
    所述基于所述至少一个目标数据块和所述至少一个目标权重块执行矩阵运算,以得到所述矩阵运算的结果,包括:
    对所述多个数据矩阵和所述多个权重组分别执行矩阵运算,得到多个矩阵运算的结 果;
    对所述多个矩阵运算的结果进行数据重排,以得到所述矩阵运算的结果。
  8. 根据权利要求7所述的方法,其特征在于,在所述目标AI模型的权重矩阵对应的网络层为卷积层的情况下,所述目标AI模型的权重矩阵是将所述网络层的卷积核转换为二维矩阵得到的。
  9. 根据权利要求8所述的方法,其特征在于,所述根据所述目标AI模型的权重矩阵的权重索引从输入数据矩阵中获取至少一个目标数据块,包括:
    根据所述目标AI模型的权重矩阵的权重索引确定所述输入数据矩阵中的目标位置;
    将所述输入数据矩阵中的目标位置上的数据转换为所述至少一个目标数据块。
  10. 根据权利要求5至9中任一项所述的方法,其特征在于,所述整数倍对应的倍数是通过在搜索空间内进行搜索得到的。
  11. 根据权利要求5至10中任一项所述的方法,其特征在于,所述目标AI模型的权重矩阵的稀疏率是通过在搜索空间内进行搜索得到的,所述目标AI模型的稀疏率满足目标稀疏率。
  12. 一种AI模型的处理装置,其特征在于,包括:
    获取单元,用于获取初始AI模型和加速器的最小计算单元的尺寸,所述加速器用于执行目标AI模型的运算;
    处理单元,用于对所述初始AI模型进行稀疏处理,以得到目标AI模型,其中,目标AI模型的权重矩阵包括多个权重块,所述多个权重块包括至少一个无效权重块和至少一个目标权重块,所述无效权重块为不参与运算的权重块,所述目标权重块为参与运算的权重块,所述权重块的尺寸为加速器的最小计算单元的尺寸的整数倍,位于所述目标AI模型的权重矩阵中的矩阵运算累加方向上的目标权重块的数量相同。
  13. 根据权利要求12所述的装置,其特征在于,所述装置还包括存储单元,用于:
    以压缩权重的形式存储所述目标AI模型的权重矩阵,所述压缩权重由所述至少一个目标权重块构成。
  14. 根据权利要求12或13所述的装置,其特征在于,所述整数倍对应的倍数是在搜索空间内进行搜索得到的。
  15. 根据权利要求12至14中任一项所述的装置,其特征在于,所述目标AI模型的权重矩阵的稀疏率是通过在搜索空间内进行搜索得到的,所述目标AI模型的稀疏率满足目标稀疏率。
  16. 一种加速器,其特征在于,目标AI模型的权重矩阵包括多个权重块,所述多个权重块包括至少一个无效权重块和至少一个目标权重块,所述无效权重块为不参与运算的权重块,所述目标权重块为参与运算的权重块,所述权重块的尺寸为所述加速器的最小计算单元的尺寸的整数倍,所述加速器用于执行目标AI模型的运算,位于所述目标AI模型的权重矩阵中的矩阵运算累加方向上的目标权重块的数量相同,以及所述加速器包括:
    获取单元,用于根据所述目标AI模型的权重矩阵的权重索引从输入数据矩阵中获取至少一个目标数据块,所述至少一个目标数据块对应于所述至少一个目标权重块,所述权重索引用于指示所述至少一个目标权重块在所述目标AI模型的权重矩阵中的位置;
    处理单元,用于基于所述至少一个目标数据块和所述至少一个目标权重块执行矩阵运 算,以得到所述矩阵运算的结果。
  17. 根据权利要求16所述的加速器,其特征在于,所述目标AI模型的权重矩阵以压缩权重的形式存储,所述压缩权重由所述至少一个目标权重块构成。
  18. 根据权利要求16或17所述的加速器,其特征在于,所述处理单元还用于:
    将所述至少一个目标数据块中的、与多个权重组对应的目标数据块分别组合为多个数据矩阵,所述多个权重组中的一个权重组包括所述目标AI模型的权重矩阵中位于矩阵运算累加方向上的目标权重块;以及所述处理单元具体用于:
    对所述多个数据矩阵和所述多个权重组分别执行矩阵运算,得到多个矩阵运算的结果;
    对所述多个矩阵运算的结果进行数据重排,以得到所述的矩阵的运算结果。
  19. 根据权利要求18所述的加速器,其特征在于,在所述目标AI模型的权重矩阵对应的网络层为卷积层的情况下,所述目标AI模型中的权重矩阵是将所述网络层的卷积核转换为二维矩阵得到的。
  20. 根据权利要求19所述的加速器,其特征在于,所述处理单元具体用于:
    根据所述目标AI模型的权重矩阵的权重索引确定所述输入数据矩阵中的目标位置;
    将所述输入数据矩阵中的目标位置上的数据转换为所述至少一个目标数据块。
  21. 根据权利要求16至20中任一项所述的加速器,其特征在于,所述整数倍对应的倍数是通过在搜索空间内进行搜索得到的。
  22. 根据权利要求16至21中任一项所述的加速器,其特征在于,所述目标AI模型的权重矩阵的稀疏率是通过在搜索空间内进行搜索得到的,所述目标AI模型的稀疏率满足目标稀疏率。
  23. 一种AI模型的处理装置,其特征在于,包括处理器和存储器,所述存储器用于存储程序指令,所述处理器用于调用所述程序指令以执行如权利要求1至4中任一项所述的方法。
  24. 一种加速器,其特征在于,包括处理器和存储器,所述存储器用于存储程序指令,所述处理器用于调用所述程序指令以执行如权利要求5至11中任一项所述的方法。
  25. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质用于存储设备执行的程序代码,所述程序代码包括用于执行如权利要求1至4或权利要求5至11中任一项所述的方法。
  26. 一种包含指令的计算机程序产品,其特征在于,当所述计算机程序产品在计算机上运行时,使得所述计算机执行如权利要求1至4或权利要求5至11中任一项所述的方法。
  27. 一种芯片,其特征在于,所述芯片包括处理器与数据接口,所述处理器通过所述数据接口读取存储器上存储的指令以执行如权利要求1至4或权利要求5至11中任一项所述的方法。
PCT/CN2022/121335 2021-10-28 2022-09-26 Ai模型的处理方法、运算方法及装置 WO2023071658A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111265704.8 2021-10-28
CN202111265704.8A CN116050469A (zh) 2021-10-28 2021-10-28 Ai模型的处理方法、运算方法及装置

Publications (1)

Publication Number Publication Date
WO2023071658A1 true WO2023071658A1 (zh) 2023-05-04

Family

ID=86118720

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/121335 WO2023071658A1 (zh) 2021-10-28 2022-09-26 Ai模型的处理方法、运算方法及装置

Country Status (2)

Country Link
CN (1) CN116050469A (zh)
WO (1) WO2023071658A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117093816A (zh) * 2023-10-19 2023-11-21 上海登临科技有限公司 矩阵乘运算方法、装置和电子设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107977704A (zh) * 2017-11-10 2018-05-01 中国科学院计算技术研究所 权重数据存储方法和基于该方法的神经网络处理器
CN112101534A (zh) * 2019-06-17 2020-12-18 英特尔公司 用于深度神经网络的可重新配置存储器压缩技术
CN112633477A (zh) * 2020-12-28 2021-04-09 电子科技大学 一种基于现场可编程阵列的量化神经网络加速方法
CN113537465A (zh) * 2021-07-07 2021-10-22 深圳市易成自动驾驶技术有限公司 Lstm模型优化方法、加速器、装置及介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107977704A (zh) * 2017-11-10 2018-05-01 中国科学院计算技术研究所 权重数据存储方法和基于该方法的神经网络处理器
CN112101534A (zh) * 2019-06-17 2020-12-18 英特尔公司 用于深度神经网络的可重新配置存储器压缩技术
CN112633477A (zh) * 2020-12-28 2021-04-09 电子科技大学 一种基于现场可编程阵列的量化神经网络加速方法
CN113537465A (zh) * 2021-07-07 2021-10-22 深圳市易成自动驾驶技术有限公司 Lstm模型优化方法、加速器、装置及介质

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117093816A (zh) * 2023-10-19 2023-11-21 上海登临科技有限公司 矩阵乘运算方法、装置和电子设备
CN117093816B (zh) * 2023-10-19 2024-01-19 上海登临科技有限公司 矩阵乘运算方法、装置和电子设备

Also Published As

Publication number Publication date
CN116050469A (zh) 2023-05-02

Similar Documents

Publication Publication Date Title
WO2020221200A1 (zh) 神经网络的构建方法、图像处理方法及装置
WO2022083536A1 (zh) 一种神经网络构建方法以及装置
WO2021120719A1 (zh) 神经网络模型更新方法、图像处理方法及装置
WO2021057056A1 (zh) 神经网络架构搜索方法、图像处理方法、装置和存储介质
WO2022042713A1 (zh) 一种用于计算设备的深度学习训练方法和装置
WO2021233342A1 (zh) 一种神经网络构建方法以及系统
WO2022001805A1 (zh) 一种神经网络蒸馏方法及装置
WO2021022521A1 (zh) 数据处理的方法、训练神经网络模型的方法及设备
US20230215159A1 (en) Neural network model training method, image processing method, and apparatus
WO2021218517A1 (zh) 获取神经网络模型的方法、图像处理方法及装置
CN111797983A (zh) 一种神经网络构建方法以及装置
WO2023093724A1 (zh) 神经网络模型的处理方法及装置
CN110222718B (zh) 图像处理的方法及装置
WO2022007867A1 (zh) 神经网络的构建方法和装置
WO2022267036A1 (zh) 神经网络模型训练方法和装置、数据处理方法和装置
WO2021129668A1 (zh) 训练神经网络的方法和装置
WO2023280113A1 (zh) 数据处理方法、神经网络模型的训练方法及装置
WO2023274052A1 (zh) 一种图像分类方法及其相关设备
WO2024067884A1 (zh) 一种数据处理方法及相关装置
CN114492723A (zh) 神经网络模型的训练方法、图像处理方法及装置
US20220327835A1 (en) Video processing method and apparatus
WO2022156475A1 (zh) 神经网络模型的训练方法、数据处理方法及装置
US20220222934A1 (en) Neural network construction method and apparatus, and image processing method and apparatus
WO2023071658A1 (zh) Ai模型的处理方法、运算方法及装置
WO2022063076A1 (zh) 对抗样本的识别方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22885545

Country of ref document: EP

Kind code of ref document: A1