EP4379607A1 - Neural network accelerator, and data processing method for neural network accelerator - Google Patents

Neural network accelerator, and data processing method for neural network accelerator Download PDF

Info

Publication number
EP4379607A1
EP4379607A1 EP21952154.9A EP21952154A EP4379607A1 EP 4379607 A1 EP4379607 A1 EP 4379607A1 EP 21952154 A EP21952154 A EP 21952154A EP 4379607 A1 EP4379607 A1 EP 4379607A1
Authority
EP
European Patent Office
Prior art keywords
module
result
matrix
neural network
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
EP21952154.9A
Other languages
German (de)
English (en)
French (fr)
Inventor
Chen XIN
Honghui YUAN
Kwok Wai HON
Feng Qiu
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Publication of EP4379607A1 publication Critical patent/EP4379607A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • This application relates to data processing technologies in the artificial intelligence field, and in particular, to a neural network accelerator and a data processing method for a neural network accelerator.
  • a neural network is a network structure that simulates behavior features of an animal neural network for information processing, and is widely used in a plurality of fields such as image classification, image segmentation, target recognition, image enhancement, and audio recognition.
  • Operations of the neural network mainly include two forms: a matrix operation and a vector operation.
  • the vector operation is completed by using a general-purpose vector acceleration unit, for example, a vector processing unit (vector processing unit) in a tensor processing unit (tensor processing unit, TPU), a tensor processor core (tensor processor core, TPC) unit of Habana GoYA, or a single point data processor (single point data processor) of Nvidia.
  • a general-purpose vector acceleration unit usually completes a specific type of vector operation in each clock cycle. When a vector operation amount in a neural network is large, an operation speed is greatly reduced, and overall performance is affected.
  • This application provides a neural network accelerator and a data processing method for a neural network accelerator, to improve processing efficiency of a neural network model.
  • a neural network accelerator including a matrix unit and a post-processing unit, where the matrix unit is configured to perform a matrix operation in a neural network model; and the post-processing unit is configured to perform a part of or all vector operations in the neural network model on a result of the matrix operation by using at least one of a plurality of dedicated modules, where one of the dedicated modules is configured to perform n vector operations, and n is a positive integer less than or equal to a first threshold.
  • vector operations in the neural network model are completed using one or more dedicated modules.
  • the plurality of vector operations may be completed using the plurality of dedicated modules respectively, thereby increasing a vector operation speed and further improving processing efficiency of the neural network model.
  • using a vector calculation unit can be avoided as much as possible, thereby reducing a requirement for the vector calculation unit, helping optimize an area of the vector calculation unit, and improving performance and an energy efficiency ratio of the neural network accelerator. Further, decoupling from the vector calculation unit is facilitated.
  • no vector calculation unit is disposed in the neural network accelerator, and all vector operations in the neural network model are completed using one or more dedicated modules, thereby further improving performance and an energy efficiency ratio.
  • the post-processing unit is decoupled from an existing processor architecture, and is applicable to different neural-network processing unit architectures.
  • the plurality of dedicated modules are merely logical division. During specific implementation, some of the plurality of dedicated modules may share a same hardware circuit. In other words, a same hardware circuit may perform different vector operations based on different instructions.
  • the post-processing unit is configured to perform some vector operations in the neural network model on a result of the matrix operation by using at least one of a plurality of dedicated modules; and the neural network accelerator further includes a vector calculation unit, where the vector calculation unit is configured to perform remaining vector operations in the neural network model.
  • the vector calculation unit is a general-purpose processing module, and can implement various vector operations.
  • the vector calculation unit may complete a vector operation that cannot be performed by the post-processing unit.
  • the at least one dedicated module is determined based on a structure of the neural network model.
  • a dedicated module in the post-processing unit may be configured based on a requirement, to meet different vector operation requirements in a plurality of neural network models, so that the neural network accelerator is applicable to operations of the plurality of neural network models.
  • the at least one dedicated module is determined based on a network feature or an operation requirement of the neural network model.
  • One or more dedicated modules may be selected from the plurality of dedicated modules based on an operation requirement of the neural network model, to perform a corresponding vector operation.
  • the at least one enabled dedicated module is determined based on a structure of the neural network model.
  • the at least one dedicated module of the plurality of modules is enabled based on a structure of the neural network model to perform a corresponding vector operation.
  • two ends of the plurality of dedicated modules include a bypass (bypass) connection line, and the enabled bypass connection line is determined based on a structure of the neural network model.
  • the bypass connection line is configured to bypass an unnecessary module.
  • One or more bypass connection lines are enabled based on a structure of the neural network model, to bypass an unnecessary module, so that the at least one dedicated module performs a corresponding vector operation.
  • the at least one dedicated module is indicated by an instruction or a parameter of a register in the post-processing unit.
  • the operations performed by the post-processing unit are determined based on a structure of the neural network model.
  • the post-processing unit is configured based on a vector operation that needs to be performed in an operation process of the neural network model, so that the post-processing unit performs a corresponding vector operation.
  • the dedicated module may be configured to perform an operation that needs to be performed in the neural network in the plurality of operations.
  • the plurality of dedicated modules include at least one of the following: a quantization module, an element-wise operation module, a bias operation module, an activation function module, or a pooling module, where the quantization module is configured to perform at least one of the following operations on data input to the quantization module: a quantization operation, a dequantization operation, or a weighting operation, where the data input to the quantization module is determined based on the result of the matrix operation; the element-wise operation module is configured to perform an element-wise operation on data input to the element-wise operation module, where the data input to the element-wise operation module is determined based on the result of the matrix operation; the bias operation module is configured to perform a bias operation on data input to the bias operation module, where the data input to the bias operation module is determined based on the result of the matrix operation; the activation function module is configured to process, according to an activation function, data input to the activation function module, where the data input to the activation function module
  • the plurality of dedicated modules can support operations of more types of neural network models, thereby improving flexibility of the neural network accelerator.
  • the quantization module includes a first quantization module and a second quantization module
  • the activation function module includes a first activation function module and a second activation function module
  • the plurality of dedicated modules are connected in the following sequence: the bias operation module, the first quantization module, the first activation function module, the pooling module, the element-wise operation module, the second activation function module, and the second quantization module.
  • a connection sequence of the dedicated modules and a connection sequence obtained by bypassing or enabling some dedicated modules can support most post-processing procedures, and cover most neural network requirements.
  • Using a vector calculation unit is avoided as much as possible, thereby helping optimize an area of the vector calculation unit, and improving performance and an energy efficiency ratio of the neural network accelerator. Further, decoupling from the vector calculation unit is facilitated.
  • no vector calculation unit is disposed in the neural network accelerator, and all vector operations in the neural network model are completed using one or more dedicated modules, thereby further improving performance and an energy efficiency ratio.
  • the pooling module and the element-wise operation module share a hardware module.
  • the activation function module uses an activation function of a rectified linear unit (rectified linear unit, RELU) type
  • the activation function module and the quantization module share a hardware module.
  • a multiplication operation used in the RELU-type activation function and a multiplication operation in an operation of a quantization type are combined and merged into one multiplication operation by using parameters, so that the activation function module and the quantization module can share a hardware module, thereby reducing a requirement for operators, reducing costs, and reducing power consumption.
  • the quantization module is configured to perform any one of the following operations: a quantization operation, a dequantization operation, or a weighting operation.
  • the activation function module is configured to perform an operation of an activation function or an operation of a lookup table.
  • the activation function module may use an operation circuit.
  • the activation function module may use a multiplication circuit to perform an operation of the RELU-type activation function.
  • the activation function module may be provided with no operation circuit, but obtain a processing result of the activation function by using a lookup table. In this way, calculation of more types of nonlinear activation functions can be implemented.
  • the activation function module is configured to perform any one of the following operations: an operation of a rectified linear unit RELU, an operation of a parametric rectified linear unit (parametric rectified linear unit, PRELU), or an operation of a leaky rectified linear unit (Leaky-RELU).
  • the plurality of dedicated modules support a multi-level pipeline processing manner.
  • dedicated modules corresponding to the plurality of vector operations may process data in a multi-level pipeline parallel manner.
  • the plurality of dedicated modules in this embodiment of this application may enable the plurality of operations to be performed in a pipeline parallel manner, thereby improving performance of the neural network accelerator.
  • the plurality of dedicated modules complete the plurality of operations in the multi-level pipeline manner, and there is no need to write a vector operation into a buffer each time the vector operation is performed. This can reduce a quantity of times of read/write for the buffer, reduce power consumption, and help improve an energy efficiency ratio.
  • the post-processing unit is specifically configured to: perform, in a process of moving the result of the matrix operation from a memory of the matrix unit to another memory, a vector operation on the result of the matrix operation by using the at least one dedicated module.
  • the post-processing unit may be used as a data path between the memory in the matrix unit and the another memory, to implement associated movement.
  • the post-processing unit performs a vector operation in an associated manner.
  • Associated execution means that a related operation is performed during data movement.
  • Each of the plurality of dedicated modules may support “associated” execution.
  • the post-processing unit and the matrix unit can process data in parallel.
  • the post-processing unit and the matrix unit basically do not generate additional time overheads in a process of processing data in parallel, thereby further improving performance of the neural network accelerator.
  • the post-processing unit is specifically configured to: remove, in the process of moving the result of the matrix operation from the memory of the matrix unit to the another memory, invalid data from the result of the matrix operation.
  • the deleting invalid data in the result of the matrix operation in the associated manner means deleting invalid data in the result of the matrix operation when the result of the matrix operation is moved from the memory of the matrix unit.
  • the invalid data may also be referred to as dummy (dummy) data or junk data.
  • a plurality of elements in the result of the matrix operation are arranged based on a first location relationship, and the post-processing unit is further configured to arrange the plurality of elements based on a second location relationship, to obtain a target result.
  • a data format conversion operation may also be referred to as a data rearrangement operation.
  • a quantity of channels for the result of the matrix operation is greater than a quantity of channels for the target result, a height of the result of the matrix operation is less than a height of the target result, and a width of the result of the matrix operation is less than a width of the target result.
  • the result of the matrix operation is an output result of a convolution operation obtained according to a Winograd algorithm
  • the target result is an output result of the convolution operation
  • the post-processing unit may be further configured to perform a bit width conversion operation on floating-point data.
  • the post-processing unit may be further configured to perform a precision conversion operation on floating-point data.
  • a data processing method for a neural network accelerator includes a matrix unit and a post-processing unit, and the method includes: The matrix unit performs a matrix operation in a neural network model; and the post-processing unit performs a part of or all vector operations in the neural network model on a result of the matrix operation by using at least one of a plurality of dedicated modules, where one of the dedicated modules is configured to perform n vector operations, and n is a positive integer less than or equal to a first threshold.
  • vector operations in the neural network model are completed using one or more dedicated modules.
  • the plurality of vector operations may be completed using the plurality of dedicated modules respectively, thereby improving processing efficiency of the neural network model.
  • using a vector calculation unit can be avoided as much as possible, thereby reducing a requirement for the vector calculation unit, helping optimize an area of the vector calculation unit, and improving performance and an energy efficiency ratio of the neural network accelerator. Further, decoupling from the vector calculation unit is facilitated.
  • no vector calculation unit is disposed in the neural network accelerator, and all vector operations in the neural network model are completed using one or more dedicated modules, thereby further improving performance and an energy efficiency ratio.
  • the post-processing unit is decoupled from an existing processor architecture, and is applicable to different neural-network processing unit architectures.
  • the at least one dedicated module is determined based on a structure of the neural network model.
  • the at least one dedicated module is indicated by an instruction or a parameter of a register in the post-processing unit.
  • the plurality of dedicated modules include at least one of the following: a quantization module or an element-wise operation module, where the quantization module is configured to perform at least one of the following operations on data input to the quantization module: a quantization operation, a dequantization operation, or a weighting operation, where the data input to the quantization module is determined based on the result of the matrix operation; or the element-wise operation module is configured to perform an element-wise operation on data input to the element-wise operation module, where the data input to the element-wise operation module is determined based on the result of the matrix operation.
  • the plurality of dedicated modules further include at least one of the following: a bias operation module, an activation function module, or a pooling module, where the bias operation module is configured to perform a bias operation on data input to the bias operation module, where the data input to the bias operation module is determined based on the result of the matrix operation;
  • the activation function module is configured to process, according to an activation function, data input to the activation function module, where the data input to the activation function module is determined based on the result of the matrix operation;
  • the pooling module is configured to perform pooling processing on data input to the pooling module, where the data input to the pooling module is determined based on the result of the matrix operation.
  • the quantization module includes a first quantization module and a second quantization module
  • the activation function module includes a first activation function module and a second activation function module
  • the plurality of dedicated modules in the post-processing unit are connected in the following sequence: the bias operation module, the first quantization module, the first activation function module, the pooling module, the element-wise operation module, the second activation function module, and the second quantization module.
  • the quantization module performs any one of the following operations: a quantization operation, a dequantization operation, or a weighting operation.
  • the activation function module performs any one of the following operations: an operation of an RELU, an operation of a PRELU, or an operation of a Leaky-RELU.
  • the post-processing unit processes a result of the matrix operation by using at least one of a plurality of dedicated modules includes that the at least one dedicated module supports processing the result of the matrix operation in a multi-level pipeline processing manner.
  • that the post-processing unit processes a result of the matrix operation by using at least one of a plurality of dedicated modules includes that the post-processing unit performs, in a process of moving the result of the matrix operation from a memory of the matrix unit to another memory, a vector operation on the result of the matrix operation by using the at least one dedicated module.
  • the post-processing unit removes, in the process of moving the result of the matrix operation from the memory of the matrix unit to the another memory, invalid data from the result of the matrix operation.
  • a plurality of elements in the result of the matrix operation are arranged based on a first location relationship
  • the method further includes: The post-processing unit arranges the plurality of elements based on a second location relationship, to obtain a target result.
  • a quantity of channels for the result of the matrix operation is greater than a quantity of channels for the target result, a height of the result of the matrix operation is less than a height of the target result, and a width of the result of the matrix operation is less than a width of the target result.
  • the result of the matrix operation is an output result of a convolution operation obtained according to a Winograd algorithm
  • the target result is an output result of the convolution operation
  • a neural network accelerator including a memory, configured to store a program; and a processor, configured to execute the program stored in the memory, where when the program stored in the memory is executed, the processor is configured to perform the method according to the second aspect or any implementation of the second aspect.
  • the processor in the third aspect may be a central processing unit (central processing unit, CPU), or may be a combination of a CPU and a neural network operation processor.
  • the neural network operation processor herein may include a graphics processing unit (graphics processing unit, GPU), a neural-network processing unit (neural-network processing unit, NPU), a tensor processing unit (tensor processing unit, TPU), and the like.
  • the TPU is an artificial intelligence accelerator-specific integrated circuit customized by Google (Google) for machine learning.
  • a computer-readable storage medium stores program code to be executed by a device, and the program code includes used to perform the method according to the second aspect or any implementation of the second aspect.
  • a computer program product including instructions is provided.
  • the computer program product is run on a computer, the computer is enabled to perform the method according to the second aspect or any implementation of the second aspect.
  • a chip includes a processor and a data interface.
  • the processor reads, through the data interface, instructions stored in a memory to perform the method according to the second aspect or any implementation of the second aspect.
  • the chip may further include the memory, and the memory stores the instructions.
  • the processor is configured to execute the instructions stored in the memory, and when the instructions are executed, the processor is configured to perform the method according to the second aspect or any implementation of the second aspect.
  • a system on a chip (system on a chip, SoC) is provided.
  • the SoC includes the neural network accelerator according to the first aspect or any implementation of the first aspect.
  • an electronic device includes the neural network accelerator according to the first aspect or any implementation of the first aspect.
  • embodiments of this application relate to massive application of a neural network, for ease of understanding, the following describes terms and concepts related to the neural network that may be used in embodiments of this application.
  • the neural network may include a neuron.
  • the neuron may be an operation unit that uses x s and an intercept of 1 as input.
  • f is an activation function (activation function) of the neuron, which is used to introduce a nonlinear feature into the neural network, to convert an input signal in the neuron into an output signal.
  • the output signal of the activation function may be used as an input of a next layer.
  • the activation function may be a rectified linear unit (rectified linear unit, RELU), a tanh function, or a sigmoid function.
  • the neural network is a network formed by connecting a plurality of single neurons together.
  • output of a neuron may be input of another neuron.
  • An input of each neuron may be connected to a local receptive field of a previous layer to extract a feature of the local receptive field.
  • the local receptive field may be a region including several neurons.
  • a deep neural network also referred to as a multi-layer neural network
  • DNN deep neural network
  • the DNN is divided based on locations of different layers, so that the neural network in the DNN can be divided into three types: an input layer, hidden layers, and an output layer.
  • the first layer is the input layer
  • the last layer is the output layer
  • all intermediate layers are the hidden layers. Layers are fully connected. To be specific, any neuron at an i th layer is necessarily connected to any neuron at an (i+1) th layer.
  • the output vector y is obtained by merely performing such a simple operation on the input vector x .
  • the DNN includes a large quantity of layers, there are also a large quantity of coefficients W and bias vectors b . Definitions of these parameters in the DNN are as follows:
  • the coefficient W is used as an example.
  • a linear coefficient from the fourth neuron at the second layer to the second neuron at the third layer is defined as W 42 3 .
  • the superscript 3 indicates a layer at which the coefficient W is located, and the subscript corresponds to an output third-layer index 2 and an input second-layer index 4.
  • a coefficient from a k th neuron at an (L-1) th layer to a j th neuron at an L th layer is defined as. W kj L .
  • the input layer does not have the parameter W.
  • more hidden layers make the network more capable of describing a complex case in the real world. Theoretically, a model with more parameters has higher complexity and a larger "capacity". It indicates that the model can complete a more complex learning task.
  • a process of training the deep neural network is a process of learning a weight matrix, and a final objective of training is to obtain weight matrices (weight matrices formed by vectors at many layers) of all layers of a trained deep neural network.
  • a convolutional neural network (convolutional neural network, CNN) is a deep neural network with a convolutional structure.
  • the convolutional neural network includes a feature extractor that includes a convolutional layer and a sampling sublayer, and the feature extractor may be considered as a filter.
  • the convolutional layer is a neuron layer that is in the convolutional neural network and at which convolution processing is performed on an input signal.
  • one neuron may be connected only to some adjacent-layer neurons.
  • One convolutional layer usually includes several feature planes, and each feature plane may include some neurons that are in a rectangular arrangement. Neurons at a same feature plane share a weight, and the weight shared herein is a convolution kernel.
  • Weight sharing may be understood as that an image information extraction manner is irrelevant to a location.
  • the convolution kernel may be initialized in a form of a matrix of a random size.
  • an appropriate weight may be obtained for the convolution kernel through learning.
  • benefits directly brought by weight sharing are that connections among layers of the convolutional neural network are reduced, and an overfitting risk is reduced.
  • Computation of the convolutional neural network mainly includes a matrix operation and a vector operation.
  • the matrix operation may also be understood as a convolution operation.
  • the vector operation includes a bias (bias) operation, quantization, dequantization, weighting, batch normalization (batch normalization, BN), an element-wise (element-wise) operation, pooling (pooling), a nonlinear activation function, or the like.
  • some convolutional neural networks for example, a convolutional neural network of an image enhancement type, usually include an operation of extending a depth (depth) dimension of a feature map (feature map) obtained after matrix calculation to a space (space) dimension, to improve image quality without increasing a calculation amount.
  • the convolutional neural network may be divided into two phases: a training (training) phase and an inference (inference) phase.
  • the objective of the training phase is to obtain a network with high precision and a low computing power requirement, and is less sensitive to training time consumption.
  • a high-precision floating-point data type is usually used to complete training of parameters of a model, to improve precision of the model.
  • the trained network is deployed in the actual application scenario, and parameters of the model are quantized.
  • a fixed-point operation is used to replace a floating-point operation, to improve energy efficiency; and operators in the model may be further merged to simplify an operation procedure, thereby improving a running speed and reducing energy consumption without affecting precision of the model significantly.
  • specific operation manners involved in different phases may be the same or different, and may be set based on a situation to meet different requirements of the phases.
  • the Winograd algorithm is a common fast convolution calculation method, and can greatly reduce a quantity of matrix multiplication calculation times without affecting a calculation result, reduce a calculation amount of a convolution operation, and improve calculation efficiency.
  • the Winograd algorithm is essentially to combine multiplication operations, add addition and subtraction operations, and reduce multiplication operations to improve energy efficiency.
  • Y represents an output data matrix
  • g represents a convolution kernel, that is, a weight matrix
  • d represents a title (title) of an input feature map, that is, an input data matrix
  • O represents element-wise multiplication (element-wise multiplication), that is, point multiplication between matrices.
  • Each of A, G, and B represents a transformation matrix. Specifically, A represents an output transformation matrix, A T represents a transposed matrix of A, G represents a weight transformation matrix, G T represents a transposed matrix of G, B represents an input transformation matrix, and B T represents a transposed matrix of B.
  • a Winograd algorithm such as F(m, r) is used to quickly calculate a convolution operation in which a size of a convolution kernel is r and a size of an output feature map is m.
  • the size may also be understood as a dimension.
  • a transformation matrix may be determined based on a dimension and a stride (stride) of a convolution kernel.
  • stride stride
  • each combination of B, G, and A for a convolution kernel of a specific size and a stride is fixed, and may be deduced according to the Winograd algorithm.
  • a common form of the Winograd algorithm is F (2 ⁇ 2,3 ⁇ 3)
  • an output data matrix Y is a matrix of 2 ⁇ 2
  • a weight matrix g is a matrix of 3 ⁇ 3
  • an input data matrix is a matrix of 4 ⁇ 4
  • a stride is 1.
  • the output matrix is a result of convolution calculation performed on the input data matrix d and the weight matrix g.
  • the Winograd algorithm can improve the performance by 2.25 times without affecting the calculation result.
  • a data collection device 160 is configured to collect training data.
  • the training data is related to a task of a model.
  • the data collection device 160 stores the training data in a database 130, and a training device 120 obtains a target model/rule 101 through training based on the training data maintained in the database 130.
  • the training device 120 processes the input training data to obtain a predicted value, and compares the predicted value with a target value until a difference between the predicted value that is output by the training device 120 and the target value is less than a specific threshold, to complete training of the target model/rule 101.
  • the training device 120 may obtain the target model/rule 101 through training by using a neural network accelerator in embodiments of this application.
  • the target model/rule 101 in this embodiment of this application may be specifically a neural network model, for example, a convolutional neural network.
  • the training data maintained in the database 130 is not necessarily all collected by the data collection device 160, or may be received from another device.
  • the training device 120 may not necessarily train the target model/rule 101 completely based on the training data maintained in the database 130, or may obtain training data from a cloud or another place to perform model training.
  • the foregoing descriptions should not be construed as a limitation on embodiments of this application.
  • the target model/rule 101 obtained through training by the training device 120 can be applied to different systems or devices, for example, applied to an execution device 110 shown in FIG. 1 .
  • the execution device 110 may be a terminal such as a mobile phone terminal, a tablet computer, a notebook computer, an augmented reality (augmented reality, AR) AR/virtual reality (virtual reality, VR), or a vehicle terminal; or may be a server, a cloud, or the like.
  • an input/output (input/output, I/O) interface 112 is configured for the execution device 110, and is configured to exchange data with an external device. A user may input data to the I/O interface 112 by using a client device 140.
  • the execution device 110 may invoke data, code, and the like in a data storage system 150 for corresponding processing, and may further store, in the data storage system 150, data, an instruction, and the like that are obtained through the corresponding processing.
  • the calculation module 111 processes the input data by using the target model/rule 101.
  • the calculation module 111 may execute the target model/rule 101 by using the neural network accelerator in embodiments of this application.
  • the I/O interface 112 returns a processing result to the client device 140, and provides the processing result to the user.
  • the training device 120 may generate, for different targets or different tasks, corresponding target models/rules 101 based on different training data.
  • the corresponding target models/rules 101 may be used to achieve the foregoing targets or complete the foregoing tasks, to provide a required result for the user.
  • the user may manually provide the input data.
  • the manually providing may be performed by using a screen provided on the I/O interface 112.
  • the client device 140 may automatically send input data to the I/O interface 112. If it is required that the client device 140 needs to obtain authorization from the user to automatically send the input data, the user may set corresponding permission on the client device 140.
  • the user may view, on the client device 140, a result output by the execution device 110. Specifically, the result may be presented in a form of displaying, a sound, an action, or the like.
  • the client device 140 may alternatively be used as a data collection end, to collect, as new sample data, input data that is input to the I/O interface 112 and an output result that is output from the I/O interface 112 that are shown in the figure, and store the new sample data in the database 130. It is clear that the client device 140 may alternatively not perform collection. Instead, the I/O interface 112 directly stores, in the database 130 as new sample data, the input data input to the I/O interface 112 and the output result output from the I/O interface 112.
  • FIG. 1 is merely a schematic diagram of a system architecture according to an embodiment of this application.
  • a location relationship between the devices, the components, the modules, and the like shown in the figure does not constitute any limitation.
  • the data storage system 150 is an external memory relative to the execution device 110, but in another case, the data storage system 150 may alternatively be disposed in the execution device 110.
  • the target model/rule 101 is obtained based on training by the training device 120.
  • the target model/rule 101 may be a neural network in this embodiment of this application.
  • the neural network in this embodiment of this application may be a CNN, a deep convolutional neural network (deep convolutional neural networks, DCNN), or the like.
  • the convolutional neural network is a deep neural network with a convolutional structure, and is a deep learning (deep learning) architecture.
  • the deep learning architecture is to perform multi-level learning at different abstract levels by using a neural network model update algorithm.
  • the CNN is a feedforward (feedforward) artificial neural network, and neurons in the feedforward artificial neural network may respond to an input image.
  • a convolutional neural network (CNN) 200 may include an input layer 210, a convolutional layer/pooling layer 220 (the pooling layer is optional), and a fully connected layer 230.
  • the input layer 210 may obtain a to-be-processed image, and send the obtained to-be-processed image to the convolutional layer/pooling layer 220 and the following fully connected layer 230 for processing, to obtain a processing result of the image.
  • the convolutional layer/pooling layer 220 may include, for example, layers 221 to 226.
  • the layer 221 is a convolutional layer
  • the layer 222 is a pooling layer
  • the layer 223 is a convolutional layer
  • the layer 224 is a pooling layer
  • the layer 225 is a convolutional layer
  • the layer 226 is a pooling layer.
  • the layers 221 and 222 are convolutional layers
  • the layer 223 is a pooling layer
  • the layers 224 and 225 are convolutional layers
  • the layer 226 is a pooling layer.
  • an output of a convolutional layer may be used as an input of a following pooling layer, or may be used as an input of another convolutional layer to continue a convolution operation.
  • the following uses the convolutional layer 221 as an example to describe an internal working principle of one convolutional layer.
  • the convolutional layer 221 may include a plurality of convolution operators.
  • the convolution operator is also referred to as a kernel.
  • the convolution operator functions as a filter that extracts specific information from an input image matrix.
  • the convolution operator may essentially be a weight matrix, and the weight matrix is usually predefined.
  • the weight matrix In a process of performing a convolution operation on an image, the weight matrix usually processes pixels at a granularity level of one pixel (or two pixels, depending on a value of a stride (stride)) in a horizontal direction on an input image, to extract a specific feature from the image.
  • a size of the weight matrix needs to be related to a size of the image.
  • a depth dimension (depth dimension) of the weight matrix is the same as a depth dimension of the input image.
  • the weight matrix extends to an entire depth of the input image. Therefore, a convolutional output of a single depth dimension is generated through convolution with a single weight matrix.
  • a single weight matrix is not used, but a plurality of weight matrices with a same size (rows x columns), namely, a plurality of same-class matrices, are used.
  • Outputs of the weight matrices are stacked to form a depth dimension of a convolutional image.
  • the dimension herein may be understood as being determined based on the foregoing "plurality”.
  • Different weight matrices may be used to extract different features from the image. For example, one weight matrix is used to extract edge information of the image, another weight matrix is used to extract a specific color of the image, and a further weight matrix is used to blur unnecessary noise in the image.
  • the plurality of weight matrices have the same size (rows x columns), and convolutional feature maps extracted from the plurality of weight matrices with the same size have a same size. Then, the plurality of extracted convolutional feature maps with the same size are combined to form an output of the convolution operation.
  • Weight values in these weight matrices need to be obtained through a lot of training during actual application.
  • Each weight matrix formed by using the weight values obtained through training may be used to extract information from an input image, to enable the convolutional neural network 200 to perform correct prediction.
  • the convolutional neural network 200 When the convolutional neural network 200 has a plurality of convolutional layers, a large quantity of general features are usually extracted at an initial convolutional layer (for example, 221).
  • the general feature may also be referred to as a low-level feature.
  • a feature extracted at a subsequent convolutional layer As the depth of the convolutional neural network 200 increases, a feature extracted at a subsequent convolutional layer (for example, 226) becomes more complex, for example, a high-level semantic feature. A feature with higher semantics is more applicable to a to-be-resolved problem.
  • pooling layers usually need to be periodically introduced after the convolutional layers.
  • the pooling layer is only used to reduce a space size of the image.
  • the pooling layer may include an average pooling operator and/or a maximum pooling operator, to perform sampling on the input image to obtain an image with a small size.
  • the average pooling operator may be used to calculate pixel values in the image in a specific range, to generate an average value. The average value is used as an average pooling result.
  • the maximum pooling operator may be used to select a pixel with a maximum value in a specific range as a maximum pooling result.
  • an operator at the pooling layer also needs to be related to the size of the image.
  • a size of a processed image output from the pooling layer may be less than a size of an image input to the pooling layer.
  • Each pixel in the image output from the pooling layer represents an average value or a maximum value of a corresponding sub-region of the image input to the pooling layer.
  • the convolutional neural network 200 After processing performed at the convolutional layer/pooling layer 220, the convolutional neural network 200 is not ready to output required output information. As described above, at the convolutional layer/pooling layer 220, only a feature is extracted, and parameters resulting from an input image are reduced. However, to generate final output information (required class information or other related information), the convolutional neural network 200 needs to use the fully connected layer 230 to generate an output of one required class or outputs of a group of required classes. Therefore, the fully connected layer 230 may include a plurality of hidden layers (231 and 232 to 23n shown in FIG. 3 ) and an output layer 240. Parameters included in the plurality of hidden layers may be obtained through pre-training based on related training data of a specific task type. For example, the task type may include image recognition, image classification, super-resolution image reconstruction, and the like.
  • the plurality of hidden layers are followed by the output layer 240, namely, a last layer of the entire convolutional neural network 200.
  • the output layer 240 has a loss function similar to a categorical cross entropy, and the loss function is specifically configured to calculate a prediction error.
  • a structure of a neural network specifically used in this embodiment of this application may be shown in FIG. 3 .
  • a convolutional neural network (CNN) 200 may include an input layer 210, a convolutional layer/pooling layer 220 (the pooling layer is optional), and a fully connected layer 230.
  • CNN convolutional neural network
  • FIG. 3 a plurality of convolutional layers/pooling layers in the convolutional layer/pooling layer 220 in FIG. 3 are parallel, and input all respectively extracted features to the fully connected layer 230 for processing.
  • the convolutional neural networks shown in FIG. 2 and FIG. 3 are merely used as examples of two possible convolutional neural networks in this embodiment of this application.
  • the neural network in this embodiment of this application may alternatively exist in a form of another network model.
  • the neural network accelerator in embodiments of this application may further perform an operation of a neural network in a form of another network model.
  • FIG. 4 shows a hardware structure of a chip according to an embodiment of this application.
  • the chip includes a neural-network processing unit (neural-network processing unit, NPU) 40.
  • the chip may be disposed in the execution device 110 shown in FIG. 1 , to complete calculation work of the calculation module 111.
  • the chip may alternatively be disposed in the training device 120 shown in FIG. 1 , to complete training work of the training device 120 and output the target model/rule 101.
  • An algorithm at each layer in the convolutional neural networks shown in FIG. 2 and FIG. 3 may be implemented in the chip shown in FIG. 4 .
  • the NPU 40 is mounted to a host central processing unit (central processing unit, CPU) (host CPU) as a coprocessor, and the host CPU allocates tasks.
  • a core part of the NPU is an operation circuit 403.
  • a controller 404 controls the operation circuit 403 to extract data in a memory (a weight memory or an input memory) and perform an operation.
  • the controller 404 controls the operation circuit 403 to extract data in a memory (a weight memory or an input memory) and perform an operation.
  • the operation circuit 403 internally includes a plurality of processing engines (processing engines, PEs). In some implementations, the operation circuit 403 is a two-dimensional systolic array. The operation circuit 403 may alternatively be a one-dimensional systolic array or another electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the operation circuit 403 is a general-purpose matrix processor.
  • processing engines PEs
  • the operation circuit 403 is a two-dimensional systolic array.
  • the operation circuit 403 may alternatively be a one-dimensional systolic array or another electronic circuit capable of performing mathematical operations such as multiplication and addition.
  • the operation circuit 403 is a general-purpose matrix processor.
  • the operation circuit fetches corresponding data of the matrix B from the weight memory 402, and buffers the data on each PE in the operation circuit.
  • the operation circuit obtains data of the matrix A from the input memory 401, performs a matrix operation on the data and the matrix B, and stores an obtained partial result or final result of the matrix in an accumulator (accumulator) 408.
  • a vector calculation unit 407 may perform further processing on the output of the operation circuit, for example, vector multiplication, vector addition, exponential operation, logarithmic operation, and value comparison.
  • the vector calculation unit 407 may be configured to perform network calculation, such as pooling (pooling), batch normalization (batch normalization, BN), or local response normalization (local response normalization), at a non-convolutional/non-FC layer in the neural network.
  • the vector calculation unit 407 can store a processed and output vector in a unified buffer 406.
  • the vector calculation unit 407 may apply a non-linear function to the output of the operation circuit 403, for example, a vector of an accumulated value, to generate an activation value.
  • the vector calculation unit 407 generates a normalized value, a combined value, or both.
  • the processed and output vector can be used as an activation input to the operation circuit 403, for example, for use in subsequent layers in the neural network.
  • the unified buffer (unified buffer) 406 is configured to store input data and output data.
  • Weight data is directly by using a direct memory access controller (direct memory access controller, DMAC) 405, input data in an external memory is transferred to the input memory 401 and/or the unified buffer 406, weight data in the external memory is stored in the weight memory 402, and the data in the unified buffer 406 is stored in the external memory.
  • DMAC direct memory access controller
  • a bus interface unit (bus interface unit, BIU) 410 is configured to implement interaction between the host CPU, the DMAC, and an instruction fetch buffer 409 by using a bus.
  • the instruction fetch buffer (instruction fetch buffer) 409 connected to the controller 404 is configured to store an instruction used by the controller 404.
  • the controller 404 is configured to invoke the instruction buffered in the instruction fetch buffer 409, to control a working process of the operation accelerator.
  • the unified buffer 406, the input memory 401, the weight memory 402, and the instruction fetch buffer 409 are all on-chip (On-Chip) memories.
  • the external memory is a memory outside the neural-network processing unit 40.
  • the external memory may be a double data rate synchronous dynamic random access memory (double data rate synchronous dynamic random access memory, DDR SDRAM), a high bandwidth memory (high bandwidth memory, HBM), or another readable and writable memory.
  • the operation circuit 403 is configured to perform a matrix operation
  • the vector calculation unit 407 is configured to perform a vector operation, for example, operations of some activation functions or pooling.
  • the vector calculation unit 407 generally can complete only one specific type of vector calculation in each clock cycle.
  • a neural network with a large vector operation amount and a small matrix operation amount for example, for a depthwise network and a pointwise network
  • the vector operation becomes a main factor that restricts calculation efficiency improvement, and affects overall processing efficiency of the network.
  • a synchronization instruction usually needs to be added to the vector calculation unit 407, to ensure accuracy of a data read/write sequence. In this way, a pipeline of the vector calculation is interrupted, and processing performance of the vector calculation unit 407 is affected.
  • Embodiments of this application provide a neural network accelerator, to improve a vector operation speed in a neural network model, thereby improving overall processing efficiency of the neural network model.
  • FIG. 5 shows a hardware structure of a neural network accelerator according to an embodiment of this application.
  • the neural network accelerator 500 in FIG. 5 can perform a part of or all operation processes of a neural network model.
  • the neural network accelerator 500 in this embodiment of this application may be deployed on a cloud service device or a terminal device, for example, a device such as a computer, a server, a vehicle, an unmanned aerial vehicle, or a mobile phone.
  • a cloud service device for example, a device such as a computer, a server, a vehicle, an unmanned aerial vehicle, or a mobile phone.
  • the neural network accelerator 500 may be deployed on a system on a chip (system on a chip, SoC) in a terminal device such as a mobile phone.
  • SoC system on a chip
  • the neural network accelerator 500 may be disposed in the execution device 110 shown in FIG. 1 , to complete calculation work of the calculation module 111.
  • the neural network accelerator 500 may alternatively be disposed in the training device 120 shown in FIG. 1 , to complete training work of the training device 120 and output the target model/rule 101.
  • An algorithm at each layer in the convolutional neural networks shown in FIG. 2 and FIG. 3 may be implemented in the neural network accelerator 500.
  • the neural network accelerator 500 includes a matrix unit 510 and a post-processing unit 520.
  • the matrix unit 510 is configured to perform a matrix operation in the neural network model.
  • the post-processing unit 520 is configured to perform a part of or all vector operations in the neural network model on a result of the matrix operation by using at least one of a plurality of dedicated modules, where one of the dedicated modules is configured to perform n vector operations, and n is a positive integer less than or equal to a first threshold.
  • the first threshold may be 3.
  • the matrix operation includes a matrix multiplication operation, a convolution operation, or the like.
  • the matrix unit 510 may be configured to perform a multiplication operation at a fully connected layer, a convolution operation at a convolutional layer, or the like in the neural network model.
  • the matrix unit 510 may be implemented by using a matrix unit in an existing neural-network processing unit.
  • the matrix unit 510 may perform the matrix operation by using the operation circuit 403 in FIG. 4 .
  • a specific structure of the matrix unit 510 is not limited in this embodiment of this application, provided that the matrix unit can perform the matrix operation in the neural network model.
  • the post-processing unit 520 is configured to perform a vector operation in the neural network model based on the result of the matrix operation.
  • the post-processing unit 520 includes a plurality of dedicated modules.
  • the plurality of dedicated modules are separately configured to perform different vector operations.
  • a dedicated module is a module configured to implement one or more specific functions.
  • the vector calculation unit 407 in FIG. 4 may be understood as a general-purpose module, and can complete all vector operations in the neural network model.
  • the dedicated module in this embodiment of this application is configured to complete one or more specific vector operations.
  • the plurality of dedicated modules are merely logical division. During specific implementation, some of the plurality of dedicated modules may share a same hardware circuit. In other words, a same hardware circuit may perform different vector operations based on different instructions.
  • the post-processing unit 520 may obtain the result of the matrix operation from a source address (source, src), process the result of the matrix operation, and send processed data to a destination address (destination, dst).
  • srcs There may be one or more srcs. There can be one or more dsts. Setting manners of src and dst are related to the hardware architecture of the neural network accelerator.
  • the post-processing unit 520 may further obtain a vector operation parameter from a vector operation parameter address.
  • Any one of the following may be indicated by an instruction or a parameter of a register in the post-processing unit: dst, src, or a vector operation parameter address.
  • the matrix unit 510 includes a matrix operation unit and a first memory.
  • the matrix operation unit may be configured to perform the matrix operation in the neural network model.
  • the first memory may be configured to store data related to the matrix operation.
  • the first memory may be configured to store at least one of the following: input data of the matrix operation, a parameter of the neural network model, the result of the matrix operation, or the like.
  • the parameter of the neural network model includes a weight parameter of a matrix operation.
  • the first memory may also include a plurality of memories.
  • the plurality of memories are separately configured to store the input data of the matrix operation, the parameter of the neural network model, the result of the matrix operation, and the like.
  • the first memory includes an input memory and a result memory.
  • the input memory is configured to store the input data of the matrix operation and the parameter of the neural network model, and provide an input matrix for the matrix operation unit.
  • the input memory may be a data buffer (data buffer).
  • the input matrix may include a left matrix (left matrix) and a right matrix (right matrix), which respectively correspond to the input data of the matrix operation and a weight parameter of a matrix operation.
  • the result memory is configured to store the result of the matrix operation.
  • the result memory may include a result buffer (result buffer).
  • the post-processing unit 520 includes a post-processing operation unit and a second memory.
  • the post-processing operation unit includes the plurality of dedicated modules.
  • the post-processing operation unit may obtain the result of the matrix operation from the first memory, for example, the result memory in the first memory.
  • src may include the result memory in the first memory.
  • the second memory is configured to store data related to a vector operation.
  • the second memory is configured to store a vector operation parameter.
  • the post-processing operation unit may obtain the vector operation parameter from the second memory.
  • the vector operation parameter is related to a vector operation type.
  • the vector operation parameter includes a bias (bias) operation parameter, a quantization parameter, or the like.
  • the vector operation parameter is a channel wise (channel wise) parameter, or a channel dimensional parameter.
  • the second memory may store only the vector operation parameter. Because the channel wise parameter occupies small storage space, a capacity requirement for the second memory can be reduced.
  • the first memory for example, the input memory in the first memory, is further configured to store a parameter of the neural network.
  • the parameter of the neural network further includes the vector operation parameter.
  • the second memory may obtain the vector operation parameter from the input memory.
  • the vector operation parameter address includes the input memory in the first memory.
  • the post-processing operation unit may input the result of the vector operation into the first memory, for example, input the result into the input memory in the first memory.
  • the result of the vector operation may be used as the input data of the matrix operation.
  • a result of a vector operation at a current network layer may provide input data for a matrix operation at a next layer.
  • dst includes the input memory in the first memory.
  • a data path between the post-processing operation unit and the first memory is a configurable data path.
  • the data path may be enabled or disabled based on a requirement.
  • the post-processing operation unit may output the result of the vector operation to an external memory.
  • dst includes the external memory.
  • a data path between the post-processing operation unit and the external memory is a configurable data path.
  • the data path may be enabled or disabled based on a requirement.
  • the post-processing operation unit may further obtain, from the first memory, for example, the input memory in the first memory, data required for an element-wise (element-wise, elt-wise) operation.
  • the neural network accelerator may further include a vector calculation unit 530.
  • the post-processing unit is configured to perform some vector operations in the neural network model on a result of the matrix operation by using at least one of a plurality of dedicated modules; and the vector calculation unit is configured to perform remaining vector operations in the neural network model.
  • the vector calculation unit is a general-purpose processing module, and can implement various vector operations.
  • the vector calculation unit may complete a vector operation that cannot be performed by the post-processing unit.
  • the post-processing operation unit may output the result of the vector operation to the vector calculation unit.
  • dst includes the vector calculation unit.
  • a data path between the post-processing operation unit and the vector calculation unit is a configurable data path.
  • the data path may be enabled or disabled based on a requirement.
  • a communication interface between the post-processing unit and a corresponding module may be set based on the foregoing data transmission relationship, that is, a connection relationship between the post-processing unit and the corresponding module is set.
  • the matrix unit may perform a bias operation.
  • the second memory may provide the bias parameter to the matrix operation unit, that is, a data path exists between the second memory and the matrix operation unit.
  • connection relationship is merely an example, and the connection relationship in the neural network accelerator may alternatively be set in other forms based on different application scenarios and limitation conditions.
  • the post-processing unit may further include a communication interface between the post-processing unit and another module, or the post-processing unit may not include the foregoing one or more communication interfaces. This is not limited in this embodiment of this application.
  • no communication interface between the post-processing operation unit and the external memory is disposed in the neural network accelerator, and a result output by the post-processing operation unit is transmitted to an internal memory, for example, the first memory.
  • no communication interface between the post-processing operation unit and the first memory is disposed in the neural network accelerator, and a result output by the post-processing operation unit is transmitted to the external memory.
  • no communication interface between the first memory and the post-processing operation unit may be disposed in the neural network accelerator.
  • the post-processing operation unit may obtain, by using another data interface, data required for the elt-wise operation.
  • the accelerator in FIG. 5 is merely an example. In a specific implementation process, a person skilled in the art should understand that the accelerator in FIG. 5 may further include another device required for normal running. In addition, based on a specific requirement, a person skilled in the art should understand that the accelerator in FIG. 5 may further include a hardware device for implementing another additional function. In addition, a person skilled in the art should understand that the accelerator in FIG. 5 may include only devices required for implementing this embodiment of this application, and does not need to include all devices shown in FIG. 5 .
  • vector operations in the neural network model are completed using one or more dedicated modules.
  • the plurality of vector operations may be completed using the plurality of dedicated modules respectively, thereby increasing a vector operation speed and further improving processing efficiency of the neural network model.
  • using a vector calculation unit can be avoided as much as possible, thereby reducing a requirement for the vector calculation unit, helping optimize an area of the vector calculation unit, and improving performance and an energy efficiency ratio of the neural network accelerator. Further, decoupling from the vector calculation unit is facilitated.
  • no vector calculation unit is disposed in the neural network accelerator, and all vector operations in the neural network model are completed using one or more dedicated modules, thereby further improving performance and an energy efficiency ratio.
  • the post-processing unit is decoupled from an existing processor architecture, and is applicable to different neural-network processing unit architectures.
  • the neural network accelerator may use an architecture of an existing neural-network processing unit.
  • the neural network accelerator may use the architecture shown in FIG. 4 , and only a post-processing unit needs to be disposed in the neural-network processing unit shown in FIG. 4 .
  • the at least one dedicated module is determined based on a structure of the neural network model.
  • the at least one dedicated module is determined based on a network feature or an operation requirement of the neural network model.
  • One or more dedicated modules may be selected from the plurality of dedicated modules based on an operation requirement of the neural network model, to perform a corresponding vector operation.
  • a necessary dedicated module that is, the at least one dedicated module
  • an unnecessary module is bypassed (bypass).
  • the at least one enabled dedicated module is determined based on a structure of the neural network model.
  • the at least one dedicated module of the plurality of modules is enabled based on a structure of the neural network model to perform a corresponding vector operation.
  • the post-processing unit may include one or more bypass connection lines.
  • Each of the one or more bypass connection lines is disposed at two ends of one or more dedicated modules, and is configured to bypass the one or more dedicated modules.
  • a bypass connection line is disposed at two ends of each dedicated module. In this way, an unnecessary dedicated module may be flexibly bypassed based on a requirement.
  • two ends of the plurality of dedicated modules include a bypass connection line, and the enabled bypass connection line is determined based on a structure of the neural network model.
  • One or more bypass connection lines are enabled based on a structure of the neural network model, to bypass an unnecessary module, so that the at least one dedicated module performs a corresponding vector operation.
  • the at least one dedicated module is indicated by an instruction or a parameter of the register in the post-processing unit.
  • the enabled bypass connection line may be indicated by an instruction or a parameter of the register in the post-processing unit.
  • the at least one enabled dedicated module is indicated by an instruction or a parameter of the register in the post-processing unit.
  • the at least one dedicated module is determined based on a structure of the neural network model.
  • the operations performed by the post-processing unit are determined based on a structure of the neural network model.
  • the post-processing unit is configured based on a vector operation that needs to be performed in an operation process of the neural network model, so that the post-processing unit performs a corresponding vector operation.
  • the dedicated module may be configured, by using an instruction or a parameter of the register in the post-processing unit, to perform an operation that needs to be performed in the neural network in the plurality of operations.
  • the neural network model may include a plurality of network layers, and data processing procedures after all the network layers may be the same or different.
  • a dedicated module may be configured based on a data processing requirement after each network layer.
  • a dedicated module in the post-processing unit may be configured based on a requirement, to meet different vector operation requirements in a plurality of neural network models, so that the neural network accelerator is applicable to operations of the plurality of neural network models.
  • the vector calculation unit when image division is performed in the buffer in advance, both a storage space requirement of the matrix unit and a storage space requirement of the vector calculation unit need to be considered, to meet a storage size limitation condition in the neural network accelerator, and pipeline control needs to be performed on an operation in the vector calculation unit.
  • the vector operation in the neural network model is completed in an associated manner by enabling an operation supported by the post-processing unit, and only the storage space requirement of the matrix unit needs to be considered when image division is performed in the buffer.
  • there is no need to perform a pipeline control operation on an operation in the post-processing unit which helps simplify program implementation.
  • the plurality of dedicated modules may be parallel modules.
  • the plurality of dedicated modules support a multi-level pipeline processing manner.
  • dedicated modules corresponding to the plurality of vector operations may process data in a multi-level pipeline parallel manner.
  • the neural network includes a plurality of operations such as quantization, an operation of an activation function, and pooling.
  • the post-processing unit may obtain some results (for example, a result 1# and a result 2#) of the matrix operation.
  • a dedicated module 1# performs a quantization operation on the result 1#, and writes the result into a register.
  • a dedicated module 2# obtains data from the register and performs an operation of an activation function.
  • the dedicated module 1# may perform a quantization operation on the result 2#. In this way, utilization of the dedicated modules can be improved, and performance of the neural network accelerator can be further improved.
  • the neural network includes steps of a plurality of vector operations, for example, includes steps such as a quantization operation, an operation of an RELU, and a pooling operation
  • steps such as a quantization operation, an operation of an RELU, and a pooling operation
  • parallelism between the plurality of steps cannot be implemented by using an existing vector computing unit.
  • the plurality of dedicated modules in this embodiment of this application may enable the plurality of operations to be performed in a pipeline parallel manner, thereby improving performance of the neural network accelerator.
  • the existing vector calculation unit is used to perform steps of a plurality of vector operations, after one step is performed, all the steps need to be written into the buffer.
  • the plurality of dedicated modules complete the plurality of operations in the multi-level pipeline manner, and there is no need to write a vector operation into a buffer each time the vector operation is performed. This can reduce a quantity of times of read/write for the buffer, reduce power consumption, and help improve an energy efficiency ratio.
  • the plurality of dedicated modules can support operations of one or more data types, for example, a 16-bit floating point number (floating point 16, FP16), a 32-bit integer number (short int32, S32), or a 16-bit integer number (short int16, S16).
  • a 16-bit floating point number floating point 16, FP16
  • a 32-bit integer number short int32, S32
  • a 16-bit integer number short int16, S16.
  • Data types supported by all the plurality of dedicated modules may be the same or different.
  • a data type used by the at least one dedicated module is indicated by an instruction or a parameter of the register in the post-processing unit.
  • the post-processing unit 520 is specifically configured to perform, in a process of moving the result of the matrix operation from a memory of the matrix unit to another memory, a vector operation on the result of the matrix operation by using the at least one dedicated module.
  • the post-processing unit 520 may be used as a data path between the memory in the matrix unit and the another memory, to implement associated movement. In other words, in a process of moving data, the post-processing unit 520 performs a vector operation in an associated manner.
  • Associated execution means that a related operation is performed during data movement.
  • Each of the plurality of dedicated modules may support “associated” execution.
  • the post-processing unit 520 may be configured to complete data movement between the first memory and another memory.
  • the another memory may be the result memory.
  • the post-processing unit and the matrix unit can process data in parallel.
  • the post-processing unit and the matrix unit basically do not generate additional time overheads in a process of processing data in parallel, thereby further improving performance of the neural network accelerator.
  • the post-processing unit may directly move the result of the matrix operation from the memory of the matrix unit to another memory.
  • the post-processing unit can perform a direct memory access (direct memory access, DMA) operation, or the post-processing unit can implement direct data movement.
  • DMA direct memory access
  • the post-processing unit 520 removes, in the process of moving the result of the matrix operation from the memory of the matrix unit to the another memory, invalid data from the result of the matrix operation in an associated manner.
  • the deleting invalid data in the result of the matrix operation in the associated manner means deleting invalid data in the result of the matrix operation when the result of the matrix operation is moved from the memory of the matrix unit.
  • the invalid data may also be referred to as dummy (dummy) data or junk data.
  • the plurality of dedicated modules include at least one of the following: a quantization module or an element-wise (element-wise, elt-wise) operation module.
  • the quantization module is configured to perform at least one of the following operations on data input to the quantization module: a quantization operation, a dequantization operation, or a weighting operation.
  • the data input to the quantization module is determined based on the result of the matrix operation.
  • the dequantization operation includes a vector operation of converting data of an integer data type into data of a floating-point data type.
  • the quantization operation includes a vector operation of converting data of a floating-point data type into data of an integer data type.
  • the weighting operation includes a vector operation of converting data of an integer data type into data of an integer data type with a smaller quantity of bits (bit), that is, converting data of an integer data type with more bits into data of an integer data type with fewer bits.
  • bit a quantity of bits
  • the quantity of bits of data obtained after the weighting operation is less than that of data before the weighting operation.
  • the quantization module is configured to perform any one of the following operations: a quantization operation, a dequantization operation, or a weighting operation.
  • the quantization module supports only one of a quantization operation, a dequantization operation, or a weighting operation.
  • the quantization module can only be configured to perform an operation supported by the module.
  • a plurality of quantization modules may be disposed in the post-processing unit to separately perform different operations such as a quantization operation, a dequantization operation, or a weighting operation.
  • the quantization module may support at least two of a quantization operation, a dequantization operation, or a weighting operation.
  • the quantization module may be configured to perform any operation supported by the module.
  • the quantization module is instructed, by using a program instruction or a parameter of the register, to perform any one of the foregoing operations. In this way, an application scope of the neural network accelerator can be improved, that is, operations of more types of neural network models can be supported.
  • the quantization module supports the quantization operation, the dequantization operation, and the weighting operation is used below for description, and does not constitute a limitation on the solution in this embodiment of this application.
  • the quantization operation, the dequantization operation, and the weighting operation are collectively referred to as operations of a quantization type.
  • the element-wise operation module is configured to perform an element-wise operation on data input to the element-wise operation module, where the data input to the element-wise operation module is determined based on the result of the matrix operation.
  • the element-wise operation is an operation between two tensors. Specifically, the element-wise operation is an operation between corresponding elements in two tensors. The corresponding elements in the two tensors are elements at identical locations in the two tensors.
  • An arithmetic operation between two tensors may be considered as an element-wise operation, for example, an addition operation, a subtraction operation, or a multiplication operation.
  • an element-wise addition operation and an element-wise subtraction operation may be used in a neural network having a residual connection or used in an adder neural network.
  • An element-wise operation may be understood as performing an arithmetic operation on corresponding elements one by one in two feature graphs whose shapes (shape) are the same.
  • Input data of the element-wise operation module may come from a same address, or may come from different addresses.
  • the element-wise operation module can support element-wise operations of one or more data types.
  • the element-wise operation module can support a 16-bit floating point number (floating point 16, FP 16) and a 16-bit integer number (short int16, S16).
  • a data type used by the element-wise operation module may be determined based on a data type input to the element-wise operation module, for example, determined based on a data type of a result of a matrix operation.
  • the plurality of dedicated modules include an element-wise operation module or a quantization module, and can support operations of more types of neural network models, thereby improving flexibility of the neural network accelerator.
  • the plurality of dedicated modules further include at least one of the following: a bias (bias) operation module, an activation function module, or a pooling (pooling) module.
  • the bias operation module is configured to perform a bias operation on data input to the bias operation module, where the data input to the bias operation module is determined based on the result of the matrix operation.
  • the bias operation module can support a bias addition operation of one or more data types.
  • the data type includes integer data or floating-point data.
  • a data type of the bias operation module may be configured based on an operation requirement of the neural network model.
  • the data type of the bias operation module is configured based on a data type of a result of a matrix operation.
  • the activation function module is configured to process, according to an activation function, data input to the activation function module, where the data input to the activation function module is determined based on the result of the matrix operation.
  • the activation function includes a sigmoid function, a tanh function, an RELU-type activation function, or the like.
  • the RELU-type activation function includes a rectified linear unit (rectified linear unit, RELU), a leaky rectified linear unit (Leaky-RELU), a parametric rectified linear unit (parametric rectified linear unit, PRELU), or the like.
  • the activation function module is configured to perform any one of the following operations: an operation of an RELU, an operation of a PRELU, or an operation of a Leaky-RELU.
  • the activation function module supports only an operation of an RELU, an operation of a PRELU, or an operation of a Leaky-RELU. In this case, the activation function module can only be configured to perform an operation supported by the module.
  • a plurality of activation function modules may be disposed in the post-processing unit to separately perform operations of different types of activation functions.
  • the activation function module may support at least two of an operation of an RELU, an operation of a PRELU, or an operation of a Leaky-RELU.
  • the activation function module may be configured to perform any operation supported by the module.
  • the activation function module is instructed, by using a program instruction or a parameter of the register, to perform any one of the foregoing operations.
  • a specific operation may be configured based on an operation required in the neural network model. In this way, an application scope of the neural network accelerator can be improved, that is, operations of more types of neural network models can be supported.
  • the activation function module is configured to perform an operation of an activation function or an operation of a lookup table.
  • the activation function module may replace an operation circuit such as a multiplication circuit of an activation function in a manner of lookup table (lookup table, LUT), to implement calculation of more types of nonlinear activation functions.
  • the pooling module is configured to perform pooling processing on data input to the pooling module, where the data input to the pooling module is determined based on the result of the matrix operation.
  • the pooling module may be configured to perform at least one of the following operations: a max pooling (max pooling) operation or an average pooling (average pooling) operation.
  • the pooling module is configured to perform any one of the following operations: a max pooling operation or an average pooling operation.
  • the pooling module supports only a max pooling operation or an average pooling operation.
  • the pooling module can only be configured to perform an operation supported by the module.
  • a plurality of pooling modules may be disposed in the post-processing unit to separately perform different types of pooling operations.
  • the pooling module may support a max pooling operation and an average pooling operation.
  • the pooling module may be configured to perform any operation supported by the module.
  • the pooling module is instructed, by using a program instruction or a parameter of the register, to perform any one of the foregoing operations.
  • a specific operation may be configured based on an operation required in the neural network model. In this way, an application scope of the neural network accelerator can be improved, that is, operations of more types of neural network models can be supported.
  • the post-processing unit may control a bit width of read data by using a mode control bit (mode control bit, MCB).
  • mode control bit mode control bit
  • the post-processing unit may control, by using the MCB, a bit width of the read result of the matrix operation, to reduce a requirement for a read bit width of the first memory.
  • the quantization module includes a first quantization module and a second quantization module
  • the activation function module includes a first activation function module and a second activation function module.
  • the plurality of dedicated modules in the post-processing unit 520 are connected in the following sequence: the bias operation module, the first quantization module, the first activation function module, the pooling module, the element-wise operation module, the second activation function module, and the second quantization module.
  • connection sequence of the dedicated modules in this embodiment of this application is obtained by analyzing a typical neural network model.
  • a connection sequence of the dedicated modules and a connection sequence obtained by bypassing or enabling some dedicated modules can support most post-processing procedures, and cover most neural network requirements.
  • Using a vector calculation unit is avoided as much as possible, thereby helping optimize an area of the vector calculation unit, and improving performance and an energy efficiency ratio of the neural network accelerator. Further, decoupling from the vector calculation unit is facilitated.
  • no vector calculation unit is disposed in the neural network accelerator, and all vector operations in the neural network model are completed using one or more dedicated modules, thereby further improving performance and an energy efficiency ratio.
  • the first quantization module and the second quantization module may be implemented by using different hardware circuits.
  • the first activation function module and the second activation function module may be implemented by using different hardware circuits.
  • Sharing a hardware module can reduce a requirement for operators, reduce costs, and reduce power consumption.
  • a same hardware module may implement a corresponding function in an alternative form.
  • a same hardware module may implement a plurality of functions simultaneously.
  • FIG. 6 is a schematic flowchart of a processing process of a post-processing unit.
  • a pooling module and an element-wise operation module may share a hardware circuit.
  • the hardware circuit may be configured to perform pooling processing or perform an elt-wise operation, that is, implement a corresponding function in an alternative form.
  • the plurality of dedicated modules in the post-processing unit 520 may alternatively be understood as being connected in the following sequence: the bias operation module, the first quantization module, the first activation function module, the pooling module, the second activation function module, and the second quantization module.
  • the plurality of dedicated modules in the post-processing unit 520 may be understood as being connected in the following sequence: the bias operation module, the first quantization module, the first activation function module, the element-wise operation module, the second activation function module, and the second quantization module.
  • FIG. 7 is a schematic flowchart of a processing process of another post-processing unit.
  • an activation function used by an activation function module is an RELU-type activation function
  • a quantization module and the activation function module may share a hardware module, or may not share a hardware module.
  • an activation function used by a first activation function module is an RELU-type activation function.
  • An activation function used by a second activation function module is an RELU-type activation function.
  • a first quantization module and the first activation function module may share a hardware module, or may not share a hardware module.
  • a second quantization module and the second activation function module may share a hardware module, or may not share a hardware module.
  • the following uses an example in which the first quantization module and the first activation function module share a hardware module for description.
  • the hardware module may be configured to perform an operation of a quantization type.
  • the hardware module may be configured to perform an operation of an RELU-type activation function.
  • the hardware module may be configured to perform an operation of a quantization type and an operation of an RELU-type activation function. Specifically, a parameter of the operation of the quantization type and a parameter of the RELU-type activation function may be combined, or the operation of the quantization type and the operation of the RELU-type activation function may be combined.
  • the hardware module may perform the operation of the quantization type and the operation of the RELU-type activation function based on a combined parameter. For a specific implementation, refer to the following description. Details are not described herein again.
  • a multiplication operation used in the RELU-type activation function and a multiplication operation in an operation of a quantization type are combined and merged into one multiplication operation by using parameters, so that the activation function module and the quantization module can share a hardware module, thereby reducing a requirement for operators, reducing costs, and reducing power consumption.
  • an unnecessary module may be bypassed by using a bypass connection line.
  • a bypass connection line is disposed at two ends of each dedicated module shown in FIG. 6 and FIG. 7 .
  • any dedicated module in FIG. 6 and FIG. 7 may be bypassed.
  • an unnecessary dedicated module may be bypassed based on an operation requirement configuration of the neural network model, to implement an operation of the neural network model, to complete a plurality of vector operations in a multi-level pipeline parallel manner.
  • FIG. 7 is used as an example.
  • Post-processing operations after a convolutional layer include a bias operation, a dequantization operation, and an operation of an RELU.
  • a pooling module, an element-wise operation module, the second activation function module, and the second quantization module in FIG. 7 may be bypassed.
  • the first quantization module is configured to perform a dequantization operation
  • the activation function module is configured to perform an operation of an RELU.
  • a manner of setting a bypass connection line in FIG. 6 and FIG. 7 is merely an example.
  • a bypass connection line may be further disposed at two ends of a plurality of modules, that is, when the bypass connection line is enabled, the plurality of modules are bypassed.
  • a bypass connection line may be further disposed at two ends of some of the plurality of dedicated modules.
  • the post-processing unit is further configured to perform a data format conversion operation.
  • a data format conversion operation may also be referred to as a data rearrangement operation.
  • a plurality of elements in the result of the matrix operation are arranged based on a first location relationship, and the post-processing unit is further configured to arrange the plurality of elements based on a second location relationship, to obtain a target result.
  • a location relationship between elements may be indicated by storage addresses of the elements.
  • the data format conversion operation includes an operation of converting depth data of a feature in a result of a matrix operation into space data.
  • a quantity of channels for the result of the matrix operation is greater than a quantity of channels for the target result, a height of the result of the matrix operation is less than a height of the target result, and a width of the result of the matrix operation is less than a width of the target result.
  • the post-processing unit may implement a conversion operation from depth data of a feature to space data (depth to space, depth2space).
  • a plurality of pieces of channel wise data in the feature map are stored in a location of one piece of channel width and height wise data.
  • the result of the matrix operation is an output result of a convolution operation obtained according to a Winograd algorithm
  • the target result is an output result of the convolution operation
  • a data format conversion operation may need to be performed.
  • a storage format of a result of a matrix operation may not meet a storage format of input data of a next matrix operation.
  • a data format conversion operation needs to be performed, to meet the storage format of the input data of the next matrix operation.
  • the matrix operation is a convolution operation.
  • An output result of the convolution operation may be directly obtained through calculation, that is, the convolution operation is performed on two input matrices to obtain the output result.
  • a storage format of the output result can generally meet the storage format of the input data of the next matrix operation.
  • an output result of the convolution operation may be obtained according to a Winograd algorithm.
  • the result obtained according to the Winograd algorithm is referred to as an output result of Winograd convolution.
  • the output result of the Winograd convolution is consistent with the output result of the convolution operation, the output results obtained by the two are stored in different data formats.
  • the post-processing unit may convert a data format of the output result of the Winograd convolution into a data format of the output result of the convolution operation.
  • the post-processing unit may arrange elements in an output matrix of the Winograd convolution based on storage addresses of corresponding elements in an output matrix of the convolution operation.
  • the storage addresses of the elements in the output matrix of the convolution operation indicate the second location relationship.
  • the elements in the output matrix of the Winograd convolution are written into destination addresses of the corresponding elements in the output matrix of the convolution operation.
  • the destination addresses of the corresponding elements in the output matrix of the convolution operation indicate the second location relationship.
  • the post-processing unit may be configured to implement a data rearrangement operation of the output result of the Winograd convolution in collaboration with the Winograd algorithm.
  • the post-processing unit may be further configured to perform a bit width conversion operation on floating-point data.
  • the post-processing unit may be further configured to perform a precision conversion operation on floating-point data.
  • bit width conversion operation of the floating-point data may be performed by the quantization module, or the bit width conversion operation of the floating-point data may be performed by another dedicated module.
  • the bit width conversion operation of the floating-point data may share a hardware module with another operation, or the bit width conversion operation of the floating-point data may be performed by a separate hardware module.
  • the post-processing unit performs, in the process of moving the result of the matrix operation from the memory of the matrix unit to the another memory, the bit width conversion operation of the floating-point data in an associated manner.
  • the batch normalization operation may be integrated into a bias (bias) and a weight (weight), that is, the batch normalization operation is completed by performing a matrix operation and a bias addition operation. In this way, there is no need to provide an operation circuit separately to support the operation, thereby reducing hardware costs.
  • the post-processing unit in FIG. 7 includes a second memory and a plurality of dedicated modules.
  • the second memory is configured to store a vector operation parameter, for example, a bias parameter or a quantization parameter, and provide the vector operation parameter to a dedicated module.
  • the plurality of dedicated modules are connected in the following sequence: the bias operation module, the first quantization module, the first activation function module, the pooling module, the element-wise operation module, the second activation function module, and the second quantization module.
  • the vector operation parameter address includes a memory (memory), and src includes a matrix unit.
  • the matrix unit includes a convolution unit or a general matrix multiplication (general matrix multiplication, GEMM) unit.
  • the post-processing unit in FIG. 7 obtains a vector operation parameter from the memory, and obtains a result of the matrix operation from the matrix unit.
  • the memory in FIG. 7 may include an input memory in a first memory, and the matrix unit may include a result memory in the first memory.
  • the post-processing unit may not directly obtain the result of the matrix operation from the matrix unit.
  • the post-processing unit may alternatively obtain the result of the matrix operation from another memory.
  • dst includes a memory. There may be one or more memories. In other words, the output data of the post-processing unit may be sent to one destination address, or may be sent to a plurality of destination addresses.
  • dst may include a first memory. In this way, data processed by the post-processing unit may be used as input data of a next matrix operation.
  • dst may further include a memory outside the neural network accelerator.
  • dst may further include a vector calculation unit. In this case, the data processed by the post-processing unit may be further processed by the vector calculation unit.
  • src, dst, or a vector operation parameter address in FIG. 7 may be indicated by a hardware instruction, or may be indicated by a preconfigured register parameter.
  • Each of the plurality of dedicated modules in FIG. 7 may include one or more hardware operation circuits, and is configured to complete a vector operation required by the neural network model.
  • the hardware operation circuit includes a multiplication (multiplication, MUL) circuit, an add (add) circuit, a max (max) circuit, or the like.
  • the hardware operation circuit can support one or more data types, for example, S32, S16, or FP16.
  • a bypass connection line may be disposed at two ends of each dedicated module in the post-processing unit shown in FIG. 7 .
  • a corresponding bypass connection line may be enabled based on a requirement, to bypass a corresponding module. In this way, any one of the plurality of dedicated modules may be bypassed based on an operation requirement of the neural network model.
  • Whether each dedicated module in FIG. 7 is bypassed may be indicated by an instruction. Alternatively, whether each dedicated module is bypassed may be indicated by a preconfigured register parameter.
  • the post-processing unit 520 may be configured to perform, in a process of moving the result of the matrix operation from a memory of the matrix unit to another memory, a vector operation on the result of the matrix operation by using the at least one dedicated module.
  • the post-processing unit 520 may remove, in the process of moving the result of the matrix operation from the memory of the matrix unit to the another memory, invalid data from the result of the matrix operation in an associated manner.
  • a memory configured to store the result of the matrix operation may store the result in a unit of a fixed block size.
  • data stored in the result memory in the first memory may be stored in a unit of a fixed block size.
  • a parameter of each layer in the neural network model and input data of each layer do not necessarily meet a requirement of a fixed block size.
  • data is stored in the result memory based on a block size of 16*16.
  • a result obtained after matrix multiplication is performed on the input left matrix and the input right matrix is a 16* 16 matrix, and is stored in the result memory.
  • a real matrix operation result obtained by multiplying the real left matrix by the real right matrix is an 8*8 matrix. As shown in FIG. 8 , data in the result memory other than the real matrix operation result is invalid data.
  • the real left matrix and the real right matrix are a left matrix and a right matrix required for an operation at a network layer of the neural network model.
  • a matrix operation result stored in the result memory is a 16* 16 matrix, and the result in the result memory is greater than the real matrix operation result. After invalid data in two dimensions of the matrix operation result is removed (remove), a real calculation result, that is, an 8*8 matrix, may be obtained.
  • the post-processing unit supports removal of invalid data in an associated manner. In this case, whether to enable the function may be configured based on a requirement. In another implementation, the post-processing unit may support removal of invalid data in an associated manner, or does not support removal of invalid data in an associated manner, that is, whether to enable the function cannot be configured based on a requirement. If input data of each module in the post-processing unit is a result of a matrix operation, the result of the matrix operation may be a result of deleting invalid data, or may be a result of not deleting invalid data. To avoid repetition, whether to delete invalid data is not distinguished in the following description when another module processes a result of a matrix operation.
  • the bias operation module is configured to perform a bias addition operation. Specifically, the bias operation module is configured to perform a bias addition operation on data input to the bias operation module. Based on the connection sequence shown in FIG. 7 , the data input to the bias operation module may be a result of a matrix operation.
  • the bias operation module may include an add circuit.
  • the bias operation module completes a bias addition operation by using the add circuit.
  • the bias operation module may support an operation on data of a type such as FP16, S16, or S32.
  • the bias operation module can support a bias addition operation of one or more data types.
  • the data type includes integer data or floating-point data.
  • a data type of the bias operation module may be configured based on an operation requirement of the neural network model.
  • the data type of the bias operation module is configured based on a data type of a result of a matrix operation.
  • A represents an input feature map (feature map) whose dimension is [ N, H in , W in , C in ], that is, a feature map input to a convolutional layer, N represents a quantity of input feature maps, H in represents a height of the input feature map, W in represents a width of the input feature map, and C in represents a quantity of channels of the input feature map.
  • B represents a weight (weight) matrix whose dimension is [ C out , k_h,k _ w, C in ], that is, a parameter matrix of a convolution kernel
  • C out represents a quantity of convolution kernels, or represents a quantity of channels of an output feature map
  • k _ h represents a height of the convolution kernel
  • k _ w represents a width of the convolution kernel.
  • Bias represents bias data whose dimension is [ C out ]
  • C represents an output feature map whose dimension is [ N, H out , W out , C out ]
  • H out represents a height of the output feature map
  • W out represents a width of the output feature map
  • C is the feature map output by the bias operation module.
  • the second memory obtains the bias parameter, and provides the bias parameter to the bias operation module.
  • the bias operation module performs a bias addition operation on a result of the matrix operation.
  • the post-processing unit may perform a bias addition operation on the result of the matrix operation in an associated manner by using the bias operation module while moving the result of the matrix operation.
  • a bypass connection line is disposed at two ends of the bias operation module shown in FIG. 7 , and is configured to bypass the bias operation module.
  • the bypass connection line is configured to bypass the bias operation module based on an operation requirement of the neural network model. For example, in an operation process of the neural network model, if a bias addition operation does not need to be performed after a matrix operation is performed at a network layer, the bypass connection line may be enabled to bypass the bias operation module after the matrix operation is performed at the network layer, and a subsequent operation is performed on a result of the matrix operation.
  • bypass operation is described herein only by using the bias operation module as an example, and a bypass operation of another dedicated module is consistent with the bypass operation of the bias operation module. For brevity of description, details are not described below.
  • the post-processing unit may control a bit width of read data by using an MCB.
  • the quantization module is configured to perform any one of the following operations: a quantization operation, a dequantization operation, or a weighting operation.
  • the quantization module can perform only one of the operations, for example, the dequantization operation.
  • a plurality of quantization modules may be disposed in the post-processing unit to separately perform different operations such as a quantization operation, a dequantization operation, or a weighting operation.
  • the quantization module in an operation process of the neural network model, may be configured to perform any one of the following operations: a quantization operation, a dequantization operation, or a weighting operation.
  • the quantization module is instructed, by using a program instruction, to perform any one of the foregoing operations.
  • a specific operation may be configured based on an operation required in the neural network model.
  • the quantization module can perform a quantization operation, a dequantization operation, and a weighting operation.
  • the quantization module may be configured to perform any one of the foregoing operations based on a requirement to implement a corresponding function.
  • this implementation is used as an example below to describe the solution in this embodiment of this application, and does not constitute a limitation on the solution in this embodiment of this application.
  • a dequantization operation may be involved in an operation process of the neural network. For example, in an inference phase of the neural network model, floating point number matrix multiplication may be replaced with fixed-point number matrix multiplication, to improve an energy efficiency ratio. However, in a vector operation after the matrix operation, due to a precision requirement, some vector operations still need to be calculated by using a floating point number. A matrix calculation result can be converted from a fixed-point number to a floating point number by performing the dequantization operation.
  • the dequantization operation includes a vector operation of converting data of an integer data type into data of a floating-point data type.
  • the first quantization module in FIG. 7 may be configured to perform a dequantization operation.
  • the first activation function module uses an RELU-type activation function
  • the first activation function module includes a multiplication circuit
  • the first quantization module includes a multiplication circuit
  • the first quantization module may further include an add circuit.
  • Each of the first activation function module and the first quantization module includes a multiplication circuit.
  • the RELU-type activation function includes an RELU, a Leaky-RELU, a PRELU, or the like.
  • RELU ( y hwc ) represents a result obtained after processing through the RELU
  • x hwc represents data input to the activation function
  • LeakyRELU ( y hwc ) represents a result obtained after processing through the Leaky-RELU
  • a represents a Leaky-RELU coefficient
  • PRELU ( y hwc ) represents a result obtained after processing through the PRELU
  • a c represents a PRELU coefficient.
  • the first activation function module and the first quantization module may share a hardware module by merging an operation of the RELU-type activation function with a dequantization operation.
  • two multiplication operations namely, a multiplication operation used for the dequantization operation and a multiplication operation used for the operation of the RELU-type activation function, may be merged into one multiplication operation by using parameters.
  • the first activation function module and the first quantization module share a hardware module.
  • the shared hardware module may be configured to perform the dequantization operation.
  • the shared hardware module may be configured to perform the operation of the RELU-type activation function.
  • the hardware module may be configured to perform the dequantization operation and the operation of the RELU-type activation function.
  • a multiplication operation used in the RELU-type activation function and a multiplication operation in a dequantization operation are combined and merged into one multiplication operation by using parameters, so that the activation function module and the quantization module can share a hardware module, thereby reducing a requirement for operators, reducing costs, and reducing power consumption.
  • a process of sequentially performing operations of convolution (convolution), batch normalization, Scale (Scale), and the RELU-type activation function is a common operation procedure.
  • Parameters of the operations of batch normalization and Scale may be integrated into a bias (bias) and a weight (weight), that is, the batch normalization and Scale operations are completed by performing a matrix operation and a bias addition operation in convolution. In this way, no additional operation circuit needs to be provided.
  • the dequantization operation and the operation of the RELU-type activation function may share a hardware module, to reduce a requirement for an operator.
  • the following describes a process of performing a dequantization operation or an operation of an activation function in a scenario in which the quantization module and the activation function module share a hardware module.
  • the input data refers to data that is input to the shared hardware module.
  • the input data is obtained from the first memory, that is, the input data is obtained based on the result of the matrix operation.
  • the result of the matrix operation is used as the input data.
  • the bias operation module is bypassed, and the input data is obtained from the first memory.
  • the input data is obtained based on an output result of the bias operation module.
  • an output result obtained after processing through the bias operation modulo is used as the input data.
  • a shift operation is performed based on an MCB control bit, and high-order data is read as the input data.
  • whether to configure the MCB may be selected through a program.
  • a manner of obtaining the input data may be determined based on a structure of the neural network model. This is not limited in this embodiment of this application.
  • Input data of each module in this embodiment of this application may be determined based on a structure of the neural network model. For example, refer to the foregoing manner for setting. To avoid repetition, details are not described below.
  • S12 Convert the input data into high-precision intermediate data, where the intermediate data is data of a floating point type.
  • High precision and “low precision” in embodiments of this application are relative. Precision of the input data only needs to be lower than that of the intermediate data.
  • the high-precision intermediate data may be intermediate data of an FP32 type.
  • a multiplication operation parameter is configured for the intermediate data based on a structure of the neural network model.
  • the intermediate data is multiplied by M1.
  • the multiplication operation parameter is M1.
  • the operation of the activation function is disabled.
  • the shared hardware module is configured to perform a dequantization operation.
  • the intermediate data is multiplied by a first coefficient M2, where M2 is a product of the quantization coefficient M1 and the Leaky-RELU coefficient.
  • M2 may be stored in the second memory in FIG. 7 .
  • M3 is a product of the quantization coefficient M1 and the PRELU coefficient.
  • M3 may be stored in the second memory in FIG. 7 .
  • the data processing procedure further includes step S14.
  • step S14 Convert a data type of the data obtained in step S13 into a data type required for a subsequent operation.
  • the data type required for the subsequent operation may be a low-precision data type.
  • the data obtained in step S13 may be considered as data of a high-precision floating-point data type
  • the high-precision floating-point data type is converted into a low-precision floating-point data type in step S14, and then data of the low-precision floating-point data type can be transmitted to the subsequent operation.
  • step S14 is determined based on the data type required for the subsequent operation. In other words, if an operation after the shared hardware module requires the low-precision data type, step S14 is performed. If the operation after the shared hardware module does not require the low-precision data type, step S14 does not need to be performed.
  • a quantization operation may be involved in the operation process of the neural network. For example, in an inference phase of the neural network model, floating point number matrix multiplication may be replaced with fixed-point number matrix multiplication in a matrix operation process, to improve an energy efficiency ratio. Due to a precision requirement, some operations need to be calculated by using a floating point number, to obtain floating-point data. For example, a result of a floating-point matrix multiplication operation, a pooling operation, or an ele-wise operation is floating-point data. Fixed-point data obtained by quantizing the floating-point data may be used as input data of a matrix operation at a next layer, for example, input data of a convolution operation.
  • the quantization operation includes a vector operation of converting data of a floating-point data type into data of an integer data type.
  • the second quantization module in FIG. 7 may be configured to perform a quantization operation.
  • the second activation function module uses an RELU-type activation function
  • the second activation function module includes a multiplication circuit
  • the second quantization module includes a multiplication circuit.
  • Each of the second activation function module and the second quantization module includes a multiplication circuit.
  • the second activation function module and the second quantization module may share a hardware module by merging an operation of the RELU-type activation function with a quantization operation.
  • two multiplication operations namely, a multiplication operation used for the quantization operation and a multiplication operation used for the operation of the RELU-type activation function, may be merged into one multiplication operation by using parameters.
  • the second activation function module and the second quantization module share a hardware module.
  • the shared hardware module may be configured to perform the quantization operation.
  • the shared hardware module may be configured to perform the operation of the RELU-type activation function.
  • the hardware module may be configured to perform the quantization operation and the operation of the RELU-type activation function.
  • a multiplication operation used in the RELU-type activation function and a multiplication operation in a quantization operation are combined and merged into one multiplication operation by using parameters, so that the activation function module and the quantization module can share a hardware module, thereby reducing a requirement for operators, reducing costs, and reducing power consumption.
  • the following describes a process of performing a quantization operation or an operation of an activation function in a scenario in which the quantization module and the activation function module share a hardware module.
  • the input data refers to data that is input to the shared hardware module.
  • a hardware module shared by the second quantization module and the second activation function module in FIG. 7 is enabled.
  • the input data may come from any module before the hardware module.
  • step S11 For detailed description of the obtaining manner, refer to step S11. Details are not described herein again.
  • Step S22 is an optional step. For example, if the input data is high-precision floating-point data, step S22 does not need to be performed.
  • a multiplication operation parameter is configured for the intermediate data based on a structure of the neural network model.
  • the intermediate data is multiplied by M4.
  • the multiplication operation parameter is M4.
  • the operation of the activation function is disabled.
  • the shared hardware module is configured to perform only a quantization operation.
  • the intermediate data is multiplied by 0.
  • the multiplication operation parameter is 0.
  • the intermediate data is multiplied by a third coefficient M5, where M5 is a product of the quantization coefficient M4 and the Leaky-RELU coefficient.
  • M5 may be stored in the second memory in FIG. 7 .
  • the intermediate data is multiplied by a fourth coefficient M5, where M5 is a product of the quantization coefficient M4 and the PRELU coefficient.
  • M6 may be stored in the second memory in FIG. 7 .
  • step S24 Convert a data type of the data obtained in step S23 into an integer data type.
  • step S25 Add the data obtained in step S24 to an integer offset (offset).
  • the data processing procedure further includes step S26.
  • step S26 Convert a data type of the data obtained in step S25 into a data type required for a subsequent operation.
  • step S26 determines whether to perform step S26 is determined based on the data type required for the subsequent operation. If the data type of the data obtained in step S25 meets a requirement of the subsequent operation, step S26 does not need to be performed.
  • step S21 to step S26 refer to the foregoing specific calculation procedure of the dequantization operation and the operation of the RELU-type activation function.
  • a weighting operation may be involved in the operation process of the neural network module.
  • the plurality of consecutive convolution operations may be calculated by using a full fixed-point data type. Because a bit width of fixed-point data used for an accumulation operation of a matrix is greater than a bit width of data input by the matrix, before a result of one matrix operation is used as input data of a next layer, a bit width of integer data needs to be reduced by performing a weighting operation, that is, a more-bit-width integer data type is converted into a less-bit-width integer data type.
  • the weighting operation includes a vector operation of converting data of an integer data type into data of an integer data type with a smaller quantity of bits (bit), that is, converting data of an integer data type with more bits into data of an integer data type with fewer bits.
  • bit a quantity of bits
  • the first quantization module or the second quantization module in FIG. 7 may be configured to perform the weighting operation.
  • the following uses an example in which the first quantization module performs the weighting operation for description.
  • the first activation function module uses an RELU-type activation function
  • the first activation function module includes a multiplication circuit
  • the first quantization module includes a multiplication circuit.
  • Each of the first activation function module and the first quantization module includes a multiplication circuit.
  • the first activation function module and the first quantization module may share a hardware module by merging an operation of the RELU-type activation function with a quantization operation.
  • two multiplication operations namely, a multiplication operation used for the weighting operation and a multiplication operation used for the operation of the RELU-type activation function, may be merged into one multiplication operation by using parameters.
  • the first activation function module and the first quantization module share a hardware module.
  • the shared hardware module may be configured to perform the weighting operation.
  • the shared hardware module may be configured to perform the operation of the RELU-type activation function.
  • the hardware module may be configured to perform the weighting operation and the operation of the RELU-type activation function.
  • a multiplication operation used in the RELU-type activation function and a multiplication operation in a weighting operation are combined and merged into one multiplication operation by using parameters, so that the activation function module and the quantization module can share a hardware module, thereby reducing a requirement for operators, reducing costs, and reducing power consumption.
  • the following describes a process of performing a weighting operation or an operation of an activation function in a scenario in which the quantization module and the activation function module share a hardware module.
  • the input data refers to data that is input to the shared hardware module.
  • a hardware module shared by the first quantization module and the first activation function module in FIG. 7 is enabled.
  • the input data may come from any module before the hardware module.
  • dedicated modules in FIG. 7 other than the first quantization module and the first activation function module are bypassed.
  • step S11 For detailed description of the obtaining manner, refer to step S11. Details are not described herein again.
  • S32 Convert the input data into high-precision intermediate data, where the intermediate data is data of a floating point type.
  • High precision and “low precision” in embodiments of this application are relative. Precision of the input data only needs to be lower than that of the intermediate data.
  • the high-precision intermediate data may be intermediate data of an FP32 type.
  • a multiplication operation parameter is configured for the intermediate data based on a structure of the neural network model.
  • the intermediate data is multiplied by M7.
  • the multiplication operation parameter is M7.
  • the operation of the activation function is disabled.
  • the shared hardware module is configured to perform only a weighting operation.
  • the intermediate data is multiplied by 0.
  • the multiplication operation parameter is 0.
  • the intermediate data is multiplied by a fifth coefficient M8, where M8 is a product of the weighting coefficient M7 and the Leaky-RELU coefficient.
  • M8 may be stored in the second memory in FIG. 7 .
  • the intermediate data is multiplied by a sixth coefficient M9, where M9 is a product of the weighting coefficient M7 and the PRELU coefficient.
  • M9 may be stored in the second memory in FIG. 7 .
  • step S34 Convert a data type of the data obtained in step S33 into a low-precision integer data type.
  • step S35 Add the data obtained in step S34 to an integer offset (offset).
  • step S36 Convert a data type of the data obtained in step S35 into a less-bit-width (bit) integer data type.
  • bit width and bit width are relative.
  • the bit width of the data obtained in step S35 only needs to be less than the bit width of the data obtained through conversion.
  • bit width of the data obtained through conversion is determined based on a data type required for a subsequent operation.
  • step S31 to step S36 refer to the foregoing specific calculation procedure of the dequantization operation and the operation of the RELU-type activation function.
  • a floating point number precision conversion operation may be involved in the operation process of the neural network model.
  • a high-precision floating-point data type is usually used in a matrix operation process
  • a low-precision floating-point data type is used for input matrices of the matrix operation, that is, a left matrix and a right matrix. Therefore, after the matrix operation, high-precision floating-point data needs to be converted into low-precision floating-point data, to be used for a next matrix operation.
  • the post-processing unit may be further configured to perform a bit width conversion operation on floating-point data.
  • the post-processing unit may be further configured to perform a precision conversion operation on a floating point number.
  • the first quantization module or the second quantization module may perform a bit width conversion operation on floating-point data.
  • the bit width conversion operation on the floating-point data and an operation of a quantization type may share a hardware module, so that consumption of an operator can be reduced.
  • the output result obtained according to the Winograd algorithm is consistent with the output result obtained by the convolution operation, the output results obtained by the two are stored in different data formats. In other words, elements in output result matrices obtained by the two are stored based on different location relationships. Generally, the output result obtained according to the Winograd algorithm cannot meet a storage format of input data of a next matrix operation.
  • a plurality of elements in the result of the matrix operation are arranged based on a first location relationship, and the post-processing unit is further configured to arrange the plurality of elements based on a second location relationship, to obtain a target result.
  • the post-processing unit may be configured to adjust a location relationship between the elements in the result of the matrix operation.
  • the post-processing unit may be further configured to perform a data rearrangement operation.
  • the result of the matrix operation is an output result of a convolution operation obtained according to a Winograd algorithm
  • the target result is an output result of the convolution operation.
  • the post-processing unit may arrange elements in an output matrix of the Winograd convolution based on storage addresses of corresponding elements in an output matrix of the convolution operation. In other words, the elements in the output matrix of the Winograd convolution are written into destination addresses of the corresponding elements in the output matrix of the convolution operation.
  • a Winograd inverse transform result that is, an output result of Winograd convolution
  • the output result of the Winograd convolution is stored in the result memory based on a data format shown in FIG. 10 .
  • C0 represents a quantity of channels
  • each of A, B, C, and D represents a storage location of an operation result in the result memory
  • 0_0 represents a first element in a first row of an operation result in a destination address
  • 1_0 represents a first element in a second row of the operation result in the destination address
  • other numbers in FIG. 10 are merely used to describe an arrangement manner of elements in an output result of the Winograd convolution and an arrangement manner of corresponding elements in an output result of the convolution operation, and do not have other limitation functions.
  • the post-processing unit may sequentially read the elements in the output result of the Winograd convolution in an order of ABCD, and write the elements into destination addresses of the corresponding elements in the output result of the convolution operation, to obtain a data format of the output result of the convolution operation, for processing by a subsequent layer.
  • the pooling module may be configured to perform a pooling operation.
  • the pooling module may be configured to perform at least one of the following operations: a max pooling (max pooling) operation or an average pooling (average pooling) operation.
  • the pooling module may be a max circuit, an add circuit, or the like.
  • the pooling module may support an operation on data of a type such as S16 or FP16.
  • the pooling module may perform a pooling operation in collaboration with the Winograd algorithm.
  • each 2*2 matrix formed by elements corresponding to ABCD may be considered as an independent operation result.
  • the 2*2 matrix just forms a sliding window of the pooling operation with the stride of 2*2.
  • the pooling module may separately obtain elements in the 2*2 matrix, and perform a pooling operation with a stride of 2*2 in an associated manner, to obtain a result of the pooling with the stride of 2*2.
  • the pooling operation may be a max pooling operation, an average pooling operation, or the like.
  • the average pooling operation includes a global average pooling (global average pooling) operation.
  • the global average pooling operation calculates an average value by accumulating feature values in height and width (height*width, H*W) directions of a feature map (feature map). Most of the calculation workload of the global average pooling operation is the calculation workload of accumulation of the feature values in the H*W directions.
  • the pooling module may perform an accumulation operation in global average pooling by using an add circuit and an intermediate register.
  • the pooling module performs an accumulation operation in global average pooling by performing the following steps.
  • the intermediate register may be initialized through instruction configuration.
  • S42 Obtain a part of data in input data, and accumulate the part of data into the intermediate register in an associated manner.
  • the input data in step S42 is data input to the pooling module.
  • the input data may be a result of a matrix operation, that is, step S42 is performed in a process of moving the data in the result memory.
  • precision of the data in the result memory is lower than precision of the data in the intermediate register.
  • the type of the data in the result memory may be S16, and the type of the data in the intermediate register may be S32. In this way, calculation precision can be improved, and data overflow can be prevented.
  • a quantity of channels for the result of the matrix operation is 16, that is, N is 16.
  • a part of data in the result of the matrix operation that is, data whose size is 1*4*16 in FIG. 12 is moved for each time, and is accumulated into the intermediate register in an associated manner.
  • an element-wise addition operation is performed on the moved part of data in the result of the matrix operation and data in the intermediate register, and a size of an obtained summation (SUM) result is 1*4*16.
  • the pooling module may be further configured to perform a division operation on the result of the accumulation operation, to obtain a result of global average pooling.
  • the pooling module may further write the result of the accumulation operation into a destination address.
  • the destination address may be the vector calculation unit.
  • the accumulation result is output to the vector calculation unit, and the vector calculation unit performs a division operation.
  • the element-wise operation module is configured to perform an element-wise operation on data input to the element-wise operation module.
  • the element-wise operation module may include an add circuit.
  • the element-wise operation module may support an operation on data of a type such as S16 or FP16.
  • Input of the element-wise operation module may come from a same address, or may come from different addresses.
  • a specific source of the input data may be determined based on a structure of the neural network model.
  • input of the element-wise operation module comes from the result memory and another memory separately.
  • the data from the result memory may be obtained by bypassing a plurality of dedicated modules before the element-wise operation module.
  • the input of the element-wise operation module may include a processing result of a preceding dedicated module and a result of a matrix operation.
  • the element-wise operation module obtains the processing result from the preceding dedicated module, and obtains the result of the matrix operation from the result memory.
  • a result of the element-wise operation may be processed by a subsequent dedicated module, or may bypass a subsequent dedicated module, and may be optionally written into other memories based on different hardware architectures.
  • the element-wise operation module and the pooling module may share a hardware module.
  • the element-wise operation module may be further configured to perform a pooling operation.
  • the element-wise operation module may be further configured to perform a part of a global averaging pooling (global average pooling) operation in an associated manner.
  • the element-wise operation module may be configured to perform an accumulation operation in the H*W directions of the feature map.
  • the element-wise operation module and the pooling module may share a hardware module. In this way, consumption of the operator can be further reduced.
  • the element-wise operation module may be further configured to perform another type of pooling operation, for example, perform the pooling operation in collaboration with the Winograd algorithm.
  • another type of pooling operation for example, perform the pooling operation in collaboration with the Winograd algorithm.
  • the activation function module is configured to process, according to an activation function, data input to the activation function module.
  • the activation function includes a sigmoid function, a tanh function, an RELU-type activation function, or the like.
  • an operation of the RELU-type activation function may be merged with the quantization operation, the dequantization operation, or the weighting operation.
  • the corresponding operation of the RELU-type activation function may be performed on distribution of data before and after the elt-wise operation.
  • each of two ends of an element-wise operation module is connected to an activation function module, and the corresponding operation of the RELU-type activation function may be performed on distribution of data before and after the elt-wise operation.
  • a bypass connection line is disposed at two ends of each of the element-wise operation module and the activation functions, and is configured to bypass a corresponding module.
  • a data format conversion operation may be involved.
  • an upsampling (upsample) operation that is, an operation of increasing a height and a width of a feature map, is usually used to improve resolution of the feature map.
  • the upsampling operation includes a conversion operation from depth data to space data (depth to space, depth2space).
  • the depth2space conversion operation is an operation of extending a depth dimension of the feature map to a space dimension, so that resolution of the feature map can be improved without consuming computing power.
  • the post-processing unit may be configured to perform the conversion operation from the depth data to the space data (depth to space, depth2space).
  • the post-processing unit may be configured to adjust a location relationship between the elements in the result of the matrix operation, to obtain a target result.
  • a quantity of channels for the result of the matrix operation is greater than a quantity of channels for the target result, a height of the result of the matrix operation is less than a height of the target result, and a width of the result of the matrix operation is less than a width of the target result.
  • FIG. 15 shows a depth2space conversion operation of converting 64 channels to 16 channels, to convert data of 48 channels of the 64 channels to locations at width and height layers of the remaining 16 channels.
  • the conversion operation from depth data to space data may include any one of the following: an operation of converting 128 channels to 32 channels, an operation of converting 128 channels to 8 channels, an operation of converting 64 channels to 16 channels, an operation of converting 64 channels to 4 channels, and an operation of converting 32 channels to 8 channels.
  • the post-processing unit further includes a multiplexer (multiplexer, MUX), and the MUX is configured to output a processing result of the post-processing unit.
  • a multiplexer multiplexer, MUX
  • the MUX may select output data from data input to the MUX, and output the output data as a processing result of the post-processing unit.
  • the post-processing unit in FIG. 7 may be further configured to implement another function.
  • the post-processing unit in FIG. 7 may be further configured to implement another function.
  • the following describes a data processing method for a neural network accelerator in an embodiment of this application with reference to FIG. 16 .
  • the data processing method shown in FIG. 16 may be performed by the neural network accelerator shown in FIG. 5 .
  • the neural network accelerator shown in FIG. 5 For specific descriptions, refer to the foregoing related descriptions of the neural network accelerator.
  • repeated descriptions are properly omitted.
  • the method 1600 shown in FIG. 16 includes step S1610 and step S1620.
  • the neural network accelerator includes a matrix unit and a post-processing unit.
  • the matrix unit performs a matrix operation in a neural network model.
  • the post-processing unit performs a part of or all vector operations in the neural network model on a result of the matrix operation by using at least one of a plurality of dedicated modules, where one of the dedicated modules is configured to perform n vector operations, and n is a positive integer less than or equal to a first threshold.
  • the at least one dedicated module is determined based on a structure of the neural network model.
  • the at least one dedicated module is indicated by an instruction or a parameter of a register in the post-processing unit.
  • the plurality of dedicated modules include at least one of the following: a quantization module, an element-wise operation module, a bias operation module, an activation function module, or a pooling module, where the quantization module is configured to perform at least one of the following operations on data input to the quantization module: a quantization operation, a dequantization operation, or a weighting operation, where the data input to the quantization module is determined based on the result of the matrix operation; or the element-wise operation module is configured to perform an element-wise operation on data input to the element-wise operation module, where the data input to the element-wise operation module is determined based on the result of the matrix operation; the bias operation module is configured to perform a bias operation on data input to the bias operation module, where the data input to the bias operation module is determined based on the result of the matrix operation; the activation function module is configured to process, according to an activation function, data input to the activation function module, where the data input to the activation function module is determined based on the result of
  • the quantization module includes a first quantization module and a second quantization module
  • the activation function module includes a first activation function module and a second activation function module
  • the plurality of dedicated modules in the post-processing unit are connected in the following sequence: the bias operation module, the first quantization module, the first activation function module, the pooling module, the element-wise operation module, the second activation function module, and the second quantization module.
  • the quantization module performs any one of the following operations: a quantization operation, a dequantization operation, or a weighting operation.
  • the activation function module performs any one of the following operations: an operation of an RELU, an operation of a PRELU, or an operation of a Leaky-RELU.
  • that the post-processing unit processes a result of the matrix operation by using at least one of a plurality of dedicated modules includes that the at least one dedicated module supports processing the result of the matrix operation in a multi-level pipeline processing manner.
  • that the post-processing unit processes a result of the matrix operation by using at least one of a plurality of dedicated modules includes that the post-processing unit performs, in a process of moving the result of the matrix operation from a memory of the matrix unit to another memory, a vector operation on the result of the matrix operation by using the at least one dedicated module.
  • the method further includes: the post-processing unit removes, in the process of moving the result of the matrix operation from the memory of the matrix unit to the another memory, invalid data from the result of the matrix operation.
  • a plurality of elements in the result of the matrix operation are arranged based on a first location relationship
  • the method further includes: The post-processing unit arranges the plurality of elements based on a second location relationship, to obtain a target result.
  • a quantity of channels for the result of the matrix operation is greater than a quantity of channels for the target result, a height of the result of the matrix operation is less than a height of the target result, and a width of the result of the matrix operation is less than a width of the target result.
  • the result of the matrix operation is an output result of a convolution operation obtained according to a Winograd algorithm
  • the target result is an output result of the convolution operation
  • vector operations in the neural network model are completed using one or more dedicated modules.
  • the plurality of vector operations may be completed using the plurality of dedicated modules respectively, thereby increasing a vector operation speed and further improving processing efficiency of the neural network model.
  • using a vector calculation unit can be avoided as much as possible, thereby reducing a requirement for the vector calculation unit, helping optimize an area of the vector calculation unit, and improving performance and an energy efficiency ratio of the neural network accelerator. Further, decoupling from the vector calculation unit is facilitated.
  • no vector calculation unit is disposed in the neural network accelerator, and all vector operations in the neural network model are completed using one or more dedicated modules, thereby further improving performance and an energy efficiency ratio.
  • FIG. 17 is a schematic diagram of a hardware structure of a neural network accelerator according to an embodiment of this application.
  • the neural network accelerator 6000 (the neural network accelerator 6000 may be specifically a computer device) shown in FIG. 17 includes a memory 6001, a processor 6002, a communication interface 6003, and a bus 6004. Communication connections between the memory 6001, the processor 6002, and the communication interface 6003 are implemented through the bus 6004.
  • the memory 6001 may be a read-only memory (read only memory, ROM), a static storage device, a dynamic storage device, or a random access memory (random access memory, RAM).
  • the memory 6001 may store a program.
  • the processor 6002 is configured to perform the steps of the data processing method in embodiments of this application. Specifically, the processor 6002 may perform the foregoing method 1600.
  • the processor 6002 includes the matrix unit and the post-processing unit in FIG. 5 .
  • the processor 6002 may be a general-purpose central processing unit (central processing unit, CPU), a microprocessor, an application-specific integrated circuit (application-specific integrated circuit, ASIC), a graphics processing unit (graphics processing unit, GPU), or one or more integrated circuits, and is configured to execute a related program, to implement the data processing method in the method embodiment of this application.
  • CPU central processing unit
  • ASIC application-specific integrated circuit
  • GPU graphics processing unit
  • the processor 6002 may alternatively be an integrated circuit chip and has a signal processing capability.
  • the processor 6002 may alternatively be a general-purpose processor, a digital signal processor (digital signal processor, DSP), an application-specific integrated circuit (ASIC), a field programmable gate array (field programmable gate array, FPGA) or another programmable logic device, a discrete gate or a transistor logic device, or a discrete hardware component.
  • the processor can implement or execute methods, steps, and logical block diagrams disclosed in embodiments of this application.
  • the general-purpose processor may be a microprocessor, or the processor may be any conventional processor and the like. Steps of the methods disclosed with reference to embodiments of this application may be directly performed and completed by using a hardware decoding processor, or may be performed and completed by using a combination of hardware and a software module in the decoding processor.
  • the software module may be located in a mature storage medium in the art, such as a random access memory, a flash memory, a read-only memory, a programmable read-only memory, an electrically erasable programmable memory, or a register.
  • the storage medium is located in the memory 6001.
  • the processor 6002 reads information in the memory 6001, and completes functions that need to be performed by units included in the apparatus shown in FIG. 5 or performs the data processing method in the method embodiment of this application in combination with hardware of the processor.
  • the communication interface 6003 uses a transceiver apparatus, for example, but not limited to a transceiver, to implement communication between the apparatus 6000 and another device or a communication network.
  • a transceiver apparatus for example, but not limited to a transceiver, to implement communication between the apparatus 6000 and another device or a communication network.
  • a neural network model may be obtained by using the communication interface 6003.
  • the bus 6004 may include a path for transmitting information between the components (for example, the memory 6001, the processor 6002, and the communication interface 6003) of the apparatus 6000.
  • the apparatus 6000 may further include another component necessary for implementing proper running.
  • the apparatus 6000 may further include a hardware device for implementing another additional function.
  • the apparatus 6000 may alternatively include only devices necessary for implementing this embodiment of this application, and does not need to include all devices shown in FIG. 17 .
  • An embodiment of this application further provides a computer-readable storage medium.
  • the computer-readable storage medium stores program code to be executed by a device, and the program code is used to perform the data processing method in embodiments of this application.
  • An embodiment of this application further provides a computer program product including instructions.
  • the computer program product When the computer program product is run on a computer, the computer is enabled to perform the data processing method in embodiments of this application.
  • An embodiment of this application further provides a chip.
  • the chip includes a processor and a data interface.
  • the processor reads, by using the data interface, instructions stored in a memory, to perform the data processing method in embodiments of this application.
  • the chip may further include the memory, and the memory stores the instructions.
  • the processor is configured to execute the instructions stored in the memory, and when the instructions are executed, the processor is configured to perform the data processing method in embodiments of this application.
  • An embodiment of this application further provides a system on a chip SoC, and the SoC includes the neural network accelerator in embodiments of this application.
  • An embodiment of this application further provides an electronic device.
  • the electronic device includes the neural network accelerator in embodiments of this application.
  • the processor in embodiments of this application may be a central processing unit (central processing unit, CPU).
  • the processor may be further another general-purpose processor, a digital signal processor (digital signal processor, DSP), an application-specific integrated circuit (application-specific integrated circuit, ASIC), a field programmable gate array (field programmable gate array, FPGA), or another programmable logic device, discrete gate or transistor logic device, discrete hardware component, or the like.
  • DSP digital signal processor
  • ASIC application-specific integrated circuit
  • FPGA field programmable gate array
  • the general-purpose processor may be a microprocessor, or the processor may be any conventional processor and the like.
  • the memory in this embodiment of this application may be a volatile memory or a nonvolatile memory, or may include both a volatile memory and a nonvolatile memory.
  • the nonvolatile memory may be a read-only memory (read-only memory, ROM), a programmable read-only memory (programmable ROM, PROM), an erasable programmable read-only memory (erasable PROM, EPROM), an electrically erasable programmable read-only memory (electrically EPROM, EEPROM), or a flash memory.
  • the volatile memory may be a random access memory (random access memory, RAM) that is used as an external cache.
  • random access memories in many forms may be used, for example, a static random access memory (static RAM, SRAM), a dynamic random access memory (dynamic random access memory, DRAM), a synchronous dynamic random access memory (synchronous DRAM, SDRAM), a double data rate synchronous dynamic random access memory (double data rate SDRAM, DDR SDRAM), an enhanced synchronous dynamic random access memory (enhanced SDRAM, ESDRAM), a synchlink dynamic random access memory (synchlink DRAM, SLDRAM), and a direct rambus random access memory (direct rambus RAM, DR RAM).
  • static random access memory static random access memory
  • DRAM dynamic random access memory
  • DRAM dynamic random access memory
  • SDRAM synchronous dynamic random access memory
  • double data rate SDRAM double data rate SDRAM
  • DDR SDRAM double data rate SDRAM
  • ESDRAM enhanced synchronous dynamic random access memory
  • synchlink dynamic random access memory synchlink dynamic random access memory
  • direct rambus RAM direct rambus RAM, DR RAM
  • All or some of the foregoing embodiments may be implemented by using software, hardware, firmware, or any combination thereof.
  • software is used to implement the embodiments, all or some of the foregoing embodiments may be implemented in a form of a computer program product.
  • the computer program product includes one or more computer instructions or computer programs. When the program instructions or the computer programs are loaded and executed on the computer, the procedure or functions according to embodiments of this application are all or partially generated.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or another programmable apparatus.
  • the computer instructions may be stored in a computer-readable storage medium or may be transmitted from a computer-readable storage medium to another computer-readable storage medium.
  • the computer instructions may be transmitted from a website, computer, server, or data center to another website, computer, server, or data center in a wired (for example, infrared, radio, microwave, or the like) manner.
  • the computer-readable storage medium may be any usable medium accessible by a computer, or a data storage device, such as a server or a data center, integrating one or more usable media.
  • the usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, or a magnetic tape), an optical medium (for example, a DVD), a semiconductor medium, or the like.
  • the semiconductor medium may be a solid state drive.
  • At least one means one or more and "a plurality of” means two or more.
  • At least one of the following items (pieces)" or similar expressions refer to any combination of these items, including any combination of singular items (pieces) or plural items (pieces).
  • at least one of a, b, or c may indicate: a, b, c, a-b, a-c, b-c, or a-b-c, where a, b, and c may be singular or plural.
  • sequence numbers of the foregoing processes do not mean execution sequences in various embodiments of this application.
  • the execution sequences of the processes should be determined based on functions and internal logic of the processes, and should not be construed as any limitation on the implementation processes of embodiments of this application.
  • the disclosed system, apparatus, and method may be implemented in other manners.
  • the described apparatus embodiment is merely an example.
  • division into the units is merely logical function division and may be other division in actual implementation.
  • a plurality of units or components may be combined or integrated into another system, or some features may be ignored or not performed.
  • the displayed or discussed mutual couplings or direct couplings or communication connections may be implemented by using some interfaces.
  • the indirect couplings or communication connections between the apparatuses or units may be implemented in electronic, mechanical, or other forms.
  • the units described as separate parts may or may not be physically separate, and components displayed as units may or may not be physical units, that is, may be located in one position, or may be distributed on a plurality of network units. Some or all of the units may be selected based on actual requirements to achieve the objectives of the solutions of embodiments.
  • the functions When the functions are implemented in a form of a software functional module and sold or used as an independent product, the functions may be stored in a computer-readable storage medium. Based on such an understanding, the technical solutions of this application essentially, or the part contributing to the conventional technology, or a part of the technical solutions may be implemented in a form of a software product.
  • the computer software product is stored in a storage medium, and includes several instructions for instructing a computer device (which may be a personal computer, a server, a network device, or the like) to perform all or a part of the steps of the methods described in embodiments of this application.
  • the foregoing storage medium includes any medium that can store program code, such as a USB flash drive, a removable hard disk, a read-only memory (read-only memory, ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disc.
  • program code such as a USB flash drive, a removable hard disk, a read-only memory (read-only memory, ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)
  • Image Analysis (AREA)
EP21952154.9A 2021-08-02 2021-08-02 Neural network accelerator, and data processing method for neural network accelerator Pending EP4379607A1 (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/110067 WO2023010244A1 (zh) 2021-08-02 2021-08-02 神经网络加速器及神经网络加速器的数据处理方法

Publications (1)

Publication Number Publication Date
EP4379607A1 true EP4379607A1 (en) 2024-06-05

Family

ID=85154880

Family Applications (1)

Application Number Title Priority Date Filing Date
EP21952154.9A Pending EP4379607A1 (en) 2021-08-02 2021-08-02 Neural network accelerator, and data processing method for neural network accelerator

Country Status (3)

Country Link
EP (1) EP4379607A1 (zh)
CN (1) CN117751366A (zh)
WO (1) WO2023010244A1 (zh)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116863490B (zh) * 2023-09-04 2023-12-12 之江实验室 面向FeFET存储阵列的数字识别方法及硬件加速器
CN117195989B (zh) * 2023-11-06 2024-06-04 深圳市九天睿芯科技有限公司 向量处理器、神经网络加速器、芯片及电子设备
CN117391149B (zh) * 2023-11-30 2024-03-26 爱芯元智半导体(宁波)有限公司 神经网络输出数据的处理方法、装置及芯片
CN118095194A (zh) * 2024-04-28 2024-05-28 英诺达(成都)电子科技有限公司 多路选择器的识别处理方法、装置、设备、介质及产品

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3557425B1 (en) * 2018-04-19 2024-05-15 Aimotive Kft. Accelerator and system for accelerating operations
US10839222B2 (en) * 2018-09-24 2020-11-17 Apical Limited Video data processing
CN112766470B (zh) * 2019-10-21 2024-05-07 地平线(上海)人工智能技术有限公司 特征数据处理方法、指令序列生成方法、装置及设备
CN111667051B (zh) * 2020-05-27 2023-06-06 上海赛昉科技有限公司 适用边缘设备的神经网络加速器及神经网络加速计算方法

Also Published As

Publication number Publication date
CN117751366A (zh) 2024-03-22
WO2023010244A1 (zh) 2023-02-09

Similar Documents

Publication Publication Date Title
EP4379607A1 (en) Neural network accelerator, and data processing method for neural network accelerator
Zhou et al. Rethinking bottleneck structure for efficient mobile network design
US20180260709A1 (en) Calculating device and method for a sparsely connected artificial neural network
Wang et al. ADSCNet: asymmetric depthwise separable convolution for semantic segmentation in real-time
EP3923233A1 (en) Image denoising method and apparatus
WO2022067508A1 (zh) 一种神经网络加速器、加速方法以及装置
EP3564863B1 (en) Apparatus for executing lstm neural network operation, and operational method
CN111914997B (zh) 训练神经网络的方法、图像处理方法及装置
US20220157041A1 (en) Image classification method and apparatus
CN111882031A (zh) 一种神经网络蒸馏方法及装置
CN114255361A (zh) 神经网络模型的训练方法、图像处理方法及装置
EP4235506A1 (en) Neural network model training method, image processing method, and apparatus
CN111695673B (zh) 训练神经网络预测器的方法、图像处理方法及装置
US20220157046A1 (en) Image Classification Method And Apparatus
CN112789627B (zh) 一种神经网络处理器、数据处理方法及相关设备
EP4318313A1 (en) Data processing method, training method for neural network model, and apparatus
CN114078195A (zh) 分类模型的训练方法、超参数的搜索方法以及装置
Chen et al. MRI image reconstruction via learning optimization using neural ODEs
CN110689045A (zh) 一种深度学习模型的分布式训练方法及装置
CN114970807A (zh) Softmax和指数在硬件中的实施
WO2022156475A1 (zh) 神经网络模型的训练方法、数据处理方法及装置
Ma et al. Accelerating deep neural network filter pruning with mask-aware convolutional computations on modern CPUs
EP4293575A1 (en) Hardware implementation of an attention-based neural network
Pei et al. Neural network compression and acceleration by federated pruning
WO2022227024A1 (zh) 神经网络模型的运算方法、训练方法及装置

Legal Events

Date Code Title Description
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE

PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE

17P Request for examination filed

Effective date: 20240228

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR