WO2022257980A1 - 计算装置、利用计算装置实施卷积运算的方法及相关产品 - Google Patents

计算装置、利用计算装置实施卷积运算的方法及相关产品 Download PDF

Info

Publication number
WO2022257980A1
WO2022257980A1 PCT/CN2022/097669 CN2022097669W WO2022257980A1 WO 2022257980 A1 WO2022257980 A1 WO 2022257980A1 CN 2022097669 W CN2022097669 W CN 2022097669W WO 2022257980 A1 WO2022257980 A1 WO 2022257980A1
Authority
WO
WIPO (PCT)
Prior art keywords
weight
processing circuit
circuit
block
slave processing
Prior art date
Application number
PCT/CN2022/097669
Other languages
English (en)
French (fr)
Inventor
何皓源
郑万凯
陈伟伦
陶劲桦
Original Assignee
寒武纪(西安)集成电路有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 寒武纪(西安)集成电路有限公司 filed Critical 寒武纪(西安)集成电路有限公司
Priority to US18/565,068 priority Critical patent/US20240265242A1/en
Publication of WO2022257980A1 publication Critical patent/WO2022257980A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7807System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present disclosure relates generally to the field of data processing. More specifically, the present disclosure relates to a computing device configured to perform convolution operations, a method for implementing convolution operations using the computing device, a chip, and a board.
  • Deep learning Deep Learning
  • AI artificial intelligence
  • the convolution layer is one of the commonly used hidden layers in the neural network model, which extracts features from the input data through convolution operations.
  • the neural network model contains a large number of convolution operations, and the computing performance of the convolution operation greatly affects the computing performance of the entire neural network model.
  • the corresponding input feature maps and weights may have different dimensions.
  • convolution operations of different scales need to be optimized to improve the computational performance of executing neural network models.
  • this disclosure proposes a computing device in various aspects, which can effectively improve the large-scale convolution operation by dividing the input feature map and weights into blocks. operating efficiency.
  • the convolution operation in the embodiment of the present disclosure can be an operation in various neural network models, and these neural network models can be applied in various fields, such as image processing, speech processing, text processing, etc., such processing can include but not limited to identification and classification.
  • an embodiment of the present disclosure provides a computing device configured to perform a convolution operation, the computing device including a main processing circuit and a plurality of slave processing circuits, wherein: the main processing circuit is used for: During the convolution operation, at least one feature block of the input feature map is transmitted to a plurality of scheduled slave processing circuits in a broadcast manner, wherein the feature block is to block the input feature map according to the lowest storage dimension Obtained; and each of the scheduled slave processing circuits is used to: perform a convolution operation on the feature map block and the corresponding weight block, wherein the weight block is obtained by block according to the output channel dimension ; and return the operation result to the main processing circuit.
  • an embodiment of the present disclosure provides a chip, which includes the computing device in any embodiment of the foregoing first aspect.
  • an embodiment of the present disclosure provides a board, which includes the chip in any embodiment of the foregoing second aspect.
  • an embodiment of the present disclosure provides a method for performing a convolution operation by the computing device in any embodiment of the first aspect.
  • the scheme of the embodiment of the present disclosure divides the large-scale input feature map and weight value to adapt to the processing of a single computing device ability, so as to make full use of the parallel processing ability of the deep learning processor, which can effectively improve the operation efficiency of the convolution operation.
  • input feature maps and weights can be transmitted through different data paths, thereby supporting multiple multiplexing of input feature maps and weights, further optimizing convolution operations, and reducing data throughput.
  • Fig. 1 shows the structural diagram of the board card of the disclosed embodiment
  • FIG. 2 shows a structural diagram of a combination processing device according to an embodiment of the present disclosure
  • FIG. 3 shows a schematic diagram of the internal structure of a processor core of a single-core or multi-core computing device according to an embodiment of the disclosure
  • FIG. 4 shows an example of an exemplary convolution operation principle to which embodiments of the present disclosure can be applied
  • FIG. 5 exemplarily shows a convolution operation process according to an embodiment of the present disclosure
  • FIG. 6 shows an exemplary structural diagram of a computing device according to an embodiment of the disclosure
  • FIG. 7 shows a partial structural schematic diagram of a slave processing circuit according to an embodiment of the present disclosure
  • FIG. 8 shows a schematic storage manner of weight data in a second storage circuit according to an embodiment of the present disclosure.
  • Fig. 9 shows an exemplary flowchart of a convolution operation method according to an embodiment of the present disclosure.
  • the term “if” may be interpreted as “when” or “once” or “in response to determining” or “in response to detecting” depending on the context.
  • FIG. 1 shows a schematic structural diagram of a board 10 according to an embodiment of the present disclosure.
  • the board card 10 includes a chip 101, which is a system-on-chip (System on Chip, SoC), or system-on-chip, integrated with one or more combined processing devices, and the combined processing device is an artificial
  • SoC System on Chip
  • the intelligent computing unit is used to support various deep learning and machine learning algorithms to meet the intelligent processing requirements in complex scenarios in the fields of computer vision, speech, natural language processing, and data mining.
  • deep learning technology is widely used in the field of cloud intelligence.
  • a notable feature of cloud intelligence applications is the large amount of input data, which has high requirements for the storage capacity and computing power of the platform.
  • the board 10 of this embodiment is suitable for cloud intelligence applications. applications, with huge off-chip storage, on-chip storage and powerful computing capabilities.
  • the chip 101 is connected to an external device 103 through an external interface device 102 .
  • the external device 103 is, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card or a wifi interface, and the like.
  • the data to be processed can be transmitted to the chip 101 by the external device 103 through the external interface device 102 .
  • the calculation result of the chip 101 can be sent back to the external device 103 via the external interface device 102 .
  • the external interface device 102 may have different interface forms, such as a PCIe interface and the like.
  • the board 10 also includes a storage device 104 for storing data, which includes one or more storage units 105 .
  • the storage device 104 is connected and data transmitted with the control device 106 and the chip 101 through the bus.
  • the control device 106 in the board 10 is configured to regulate the state of the chip 101 .
  • the control device 106 may include a microcontroller (Micro Controller Unit, MCU).
  • FIG. 2 is a block diagram showing the combined processing means in the chip 101 of this embodiment.
  • the combined processing device 20 includes a computing device 201 , an interface device 202 , a processing device 203 and a storage device 204 .
  • the computing device 201 is configured to perform operations specified by the user, and is mainly implemented as a single-core intelligent processor or a multi-core intelligent processor for performing deep learning or machine learning calculations, which can interact with the processing device 203 through the interface device 202 to Work together to complete user-specified operations.
  • the interface device 202 is used to transmit data and control instructions between the computing device 201 and the processing device 203 .
  • the computing device 201 may obtain input data from the processing device 203 via the interface device 202 and write it into a storage device on the computing device 201 .
  • the computing device 201 may obtain control instructions from the processing device 203 via the interface device 202 and write them into the control cache on the chip of the computing device 201 .
  • the interface device 202 may also read data in the storage device of the computing device 201 and transmit it to the processing device 203 .
  • the processing device 203 performs basic control including but not limited to data transfer, starting and/or stopping the computing device 201 .
  • the processing device 203 may be one or more types of a central processing unit (central processing unit, CPU), a graphics processing unit (graphics processing unit, GPU) or other general-purpose and/or special-purpose processors.
  • processors including but not limited to digital signal processors (digital signal processors, DSPs), application specific integrated circuits (application specific integrated circuits, ASICs), field-programmable gate arrays (field-programmable gate arrays, FPGAs) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., and the number thereof can be determined according to actual needs.
  • the computing device 201 of the present disclosure can be regarded as having a single-core structure or a homogeneous multi-core structure. However, when considering the integration of the computing device 201 and the processing device 203 together, they are considered to form a heterogeneous multi-core structure.
  • the storage device 204 is used to store data to be processed, which may be a DRAM, which is a DDR memory, and its size is usually 16G or larger, and is used to store data of the computing device 201 and/or the processing device 203 .
  • FIG. 3 shows a schematic diagram of the internal structure of a processing core when the computing device 201 is a single-core or multi-core device.
  • the computing device 301 is used for processing input data such as computer vision, speech, natural language, data mining, etc.
  • the computing device 301 includes three modules: a control module 31 , an operation module 32 and a storage module 33 .
  • the control module 31 is used to coordinate and control the work of the operation module 32 and the storage module 33 to complete the task of deep learning, which includes an instruction fetch unit (IFU) 311 and an instruction decoding unit (instruction decode unit, IDU) 312.
  • the instruction fetching unit 311 is used to obtain instructions from the processing device 203 , and the instruction decoding unit 312 decodes the obtained instructions and sends the decoding results to the computing module 32 and the storage module 33 as control information.
  • the operation module 32 includes a vector operation unit 321 and a matrix operation unit 322 .
  • the vector operation unit 321 is used to perform vector operations, and can support complex operations such as vector multiplication, addition, and nonlinear transformation;
  • the matrix operation unit 322 is responsible for the core calculation of the deep learning algorithm, namely matrix multiplication and convolution.
  • the storage module 33 is used to store or transfer relevant data, including a neuron storage unit (neuron RAM, NRAM) 331, a weight storage unit (weight RAM, WRAM) 332, and a direct memory access module (direct memory access, DMA) 333.
  • NRAM 331 is used to store input neurons, output neurons and intermediate results after calculation;
  • WRAM 332 is used to store convolution kernels of deep learning networks, that is, weights;
  • DMA 333 is connected to DRAM 204 through bus 34, responsible for computing device 301 Data transfer between DRAM 204 and DRAM 204.
  • an embodiment of the present disclosure provides a computing device configured to perform a convolution operation, so that the convolution operation in a neural network model, for example, can be optimized.
  • Fig. 4 shows an example of an exemplary convolution operation principle to which embodiments of the present disclosure may be applied.
  • a convolutional layer in a neural network model can perform a convolution operation by applying a convolution kernel (also called a filter, weights, etc.) to do convolution processing, so as to extract features.
  • a convolution kernel also called a filter, weights, etc.
  • the figure shows an example of input data with a size of 6 ⁇ 6 ⁇ 3, which can represent three input feature maps of size 6 ⁇ 6 (ie, a three-dimensional matrix of 6 ⁇ 6 ⁇ 3), representing three different features .
  • the width W of the feature map in this example is 6, and the height H is also 6.
  • the number of input feature maps can also be referred to as the number of input channels Ci.
  • the example input in the figure has 3 feature maps, also called 3 feature channels or 3 input channels.
  • the figure also exemplarily shows a convolution kernel with a size of 2 ⁇ 3 ⁇ 3 ⁇ 3, which can represent two three-dimensional convolution kernels with a size of 3 ⁇ 3 ⁇ 3 (that is, two three-dimensional matrices of 3 ⁇ 3 ⁇ 3 ), each three-dimensional convolution kernel (also known as a filter) has three different 3 ⁇ 3 two-dimensional convolution kernels, corresponding to three different input feature maps.
  • the number of volumetric convolution kernels can be referred to as the number of output channels Co, which is 2 in this example.
  • the number of two-dimensional convolution kernels can be called the number of input channels Ci, which is consistent with the number of channels of the input feature map.
  • Each two-dimensional convolution kernel has a corresponding width Kw and height Kh, both Kw and Kh are 3 in this example.
  • the convolution result of the input feature map and the filter outputs two 4 ⁇ 4 feature maps.
  • the convolution result of the input feature map and the upper three-dimensional convolution kernel obtains an upper 4 ⁇ 4 output feature map
  • the convolution result of the input feature map and the lower three-dimensional convolution kernel obtains a lower 4 ⁇ 4 4's output feature map.
  • the value at each position in the output feature map is obtained by performing a two-dimensional convolution operation on the corresponding block and the corresponding convolution kernel of each input feature map and then summing.
  • the figure shows that the value of the (0,0) position on the upper output feature map (that is, the convolution output point) is double-digitized by the block framed by the black cube in the input feature map and the upper three-dimensional convolution kernel.
  • the three-dimensional convolution operation obtains 3 values, which are summed to obtain the final value.
  • each convolution output point has a corresponding receptive field
  • the shape of the receptive field is equal to the shape of the convolution kernel, for example, the perception of the convolution output point at the (0,0) position on the output feature map in the figure
  • the field is the 3 ⁇ 3 ⁇ 3 black cube box in the picture.
  • the value of each convolution output point corresponds to the result of the multiplication and accumulation of the input feature map in its receptive field and the weight value.
  • the receptive field is relative to a single convolutional layer, and the feature vector of a certain position in the input feature map of the current layer is calculated from the input of the fixed region of the previous layer. The region is the receptive field at this location.
  • the position of the convolution kernel can be moved on the input feature map, that is, the receptive field of the convolution output point can be moved.
  • the convolution step size (Sx, Sy) is (1,1).
  • the convolution operation can be obtained respectively The value at position (0,1) or (1,0) on the output feature map above.
  • a convolutional layer of the neural network there is a set of input feature maps, which contain a total of H ⁇ W ⁇ Ci information, where H and W are the height and width of the input feature map, and Ci is the input feature
  • the number of graphs is also called the number of input channels.
  • the convolutional layer has Ci ⁇ Co convolution kernels of Kh ⁇ Kw size, where Ci is the number of input channels, Co is the number of output feature maps (or the number of output channels), Kh and Kw are the height and width.
  • the output feature map contains Ho ⁇ Wo ⁇ Co pieces of information, where Ho and Wo are the height and width of the output feature map, respectively, and Co is the number of output channels.
  • the convolution step size Sx, Sy
  • the size of the convolution step size will affect the size of the output feature map.
  • the dimensions of the involved multidimensional data are represented by (N, H, W, C) or (Co, H, W, Ci), which represent the storage order of the data in the memory. It can be understood that although the multidimensional data has multiple dimensions, since the layout of the memory is always one-dimensional, there is a corresponding relationship between the multidimensional data and the storage order on the memory. Multidimensional data is usually allocated in continuous storage space, that is, multidimensional data can be expanded in one dimension and stored in the memory in sequence.
  • the initial input feature map can be stored sequentially in a low-dimensional (where C/Ci is the lowest dimension) priority; and in order to optimize the convolution operation, the input feature can be adjusted during the operation
  • C/Ci is the lowest dimension
  • Adjacent dimensions refer to dimensions that are next to each other in the dimension information representation of multidimensional data, for example, W and Ci are adjacent, and adjacent dimensions may also be called continuous dimensions.
  • the design of artificial intelligence chips usually takes the Ci dimension as the lowest dimension, that is, the above-mentioned NHWC arrangement order, and the data on the Ci dimension is continuous. Therefore, vectorization alignment requires that the size of the Ci dimension be aligned to a specified value, such as the alignment value Aci, so that the number of accesses is performed in units of the alignment value Aci. Based on different designs, Aci can have different values, such as 64, 128, 256, 512, etc. Usually, the size of the input port of the operator array is also related to the alignment value.
  • the input port size of the operator array is usually twice the alignment value, that is, the alignment value Aci is processed at one time.
  • Scale input feature map data and weight data When the Ci dimension of the input feature map is large, it is easier to meet the above alignment requirements.
  • Fig. 5 exemplarily shows a convolution operation process according to an embodiment of the present disclosure.
  • the Ci (represented as fi in the figure) dimension of the input feature map is relatively large, so only a part of the data is taken for calculation each time, for example, the amount of data taken meets the maximum processing capacity of the operator at one time, so as to make full use of It not only improves the computing power of the calculator, but also saves computing time.
  • the alignment value is 512 bits, that is, the data of one line (one cache line) required to be read at one time is 512 bits.
  • a cache line can include 64 8-bit data or 32 16-bit data data.
  • the input feature map 510 has a large scale, and the input channel dimension fi exceeds 512 bits, for example, a multiple of 512; the input channel dimension Ci of the weight 520 is equal to the input channel dimension fi of the input feature map 510, and also More than 512 bits. Therefore, one row of input data 511 can be read from the input feature map 510 each time, and one row of weight data 521 can be read from the weight value 520 as convolution kernel data. A section in and 531.
  • each convolution output point corresponds to the result of the multiplication and accumulation of the input feature map and weight in its receptive field.
  • the input data line and the weight line traverse the entire receptive field at the same time, and multiple partial sums are obtained and accumulated, then the value of the convolution output point corresponding to the receptive field can be obtained.
  • FIG. 6 shows a schematic structural block diagram of a computing device 600 according to an embodiment of the disclosure. It can be understood that this structure can be regarded as the refinement of the internal structure of the operation module of a single processing core in FIG. 3 , or can be regarded as a functional division block diagram based on the combination of multiple operation modules of the processing core shown in FIG. 3 .
  • a computing device 600 may be configured to perform a convolution operation, and may include a master processing circuit 610 and a plurality of slave processing circuits 620 .
  • the master processing circuit and the slave processing circuits, as well as multiple slave processing circuits, can communicate with each other through various connections.
  • the main processing circuit and the slave processing circuit can cooperate with each other, thereby realizing parallel operation processing.
  • the master processing circuit can be used, for example, to perform pre-processing on the input data, such as splitting the data, and receiving intermediate results from multiple slave processing circuits and performing subsequent processing to obtain the final operation of the operation instruction result.
  • the slave processing circuit can be used to perform intermediate operations on corresponding data (for example, split data) in parallel according to the operation instruction to obtain multiple intermediate results, and transmit the multiple intermediate results back to the main processing circuit.
  • connection between multiple slave processing circuits can be hard-wired, or logically configured according to, for example, micro-instructions to form a variety of slave processing circuits
  • the topology of the array Embodiments of the present disclosure are not limited in this respect.
  • the computing device 600 By setting the computing device 600 into a master-slave structure (for example, a master-multiple-slave structure, or a multi-master-multiple-slave structure, the present disclosure is not limited in this respect), for the calculation instruction of the forward operation, the data can be disassembled according to the calculation instruction. In this way, multiple slave processing circuits are used to perform parallel calculations on the part with a large amount of calculation to improve the calculation speed, save calculation time, and reduce power consumption.
  • a master-slave structure for example, a master-multiple-slave structure, or a multi-master-multiple-slave structure, the present disclosure is not limited in this respect
  • the main processing circuit and the slave processing circuit may include various calculation circuits, for example, may include a vector operation unit and a matrix operation unit.
  • the vector operation unit is used to perform vector operations, and can support complex operations such as vector multiplication, addition, and nonlinear transformation;
  • the matrix operation unit is responsible for the core calculations of deep learning algorithms, such as matrix multiplication and convolution.
  • the master processing circuit 610 may broadcast at least one feature block of the input feature map to a plurality of scheduled slave processing circuits during the convolution operation, wherein the feature block is the input feature map by The lowest storage dimension is obtained in blocks.
  • each scheduled slave processing circuit 620 can perform a convolution operation on the broadcasted feature map block and the corresponding weight block, wherein the weight block is obtained by dividing the weight into blocks according to the output channel dimension; and The operation result is returned to the main processing circuit.
  • the above-mentioned lowest storage dimension is, for example, the input channel Ci dimension.
  • the above-described block processing of input feature maps and weights may be performed at different locations and at different times.
  • the main processing circuit may include a block function for splitting the input feature map and weights respectively.
  • the main processing circuit can read the input feature map and weights in the original storage format from an external storage circuit (such as DDR), and then block and store the input feature map according to the lowest storage dimension; and divide the weight value according to the output channel dimension. block and stored for the scheduled slave processing circuit to load the corresponding weight block.
  • the above block process can be performed during or before the operation to prepare the data.
  • the main processing circuit may include a partial block function, which is used to block only the input feature map to be broadcasted, and the weights to be distributed may be block by an external block circuit.
  • the main processing circuit may not include or perform a blocking function at all.
  • the input feature map and weights are partitioned by a partitioning circuit independent of the main processing circuit. The divided input feature map and weights can be stored in corresponding storage circuits.
  • the main processing circuit 610 when the main processing circuit 610 broadcasts the feature block, it may align the feature block to the first alignment requirement in the lowest storage dimension, and the first alignment requirement is determined according to the processing capability of the slave processing circuit. For example, depending on the maximum throughput of the operator array in the slave processing circuit, the first alignment requirement may eg be equal to the maximum throughput, so that the entire operator array can be utilized.
  • the first alignment requirement is, for example, 64 bytes, that is, 512 bits, so the size of each aligned feature block in the lowest storage dimension is 64 bytes.
  • the size of the remaining storage dimensions is 1 data bit.
  • the data bit width is 8 bits, it can be divided into 64 ⁇ 1 ⁇ 1 feature map blocks containing 64 pieces of data.
  • the data bit width is 16 bits, it can be divided into feature blocks in the shape of 32 ⁇ 1 ⁇ 1 that contain 32 pieces of data.
  • the weights In order to perform convolution operations with the divided feature blocks, the weights also need to be divided. From the description in Figure 4, we can see that the weight value has one more dimension than the input feature map: the output channel Co dimension, so the division of the weight value is slightly different from the division of the input feature map.
  • the weight may first be divided into multiple weight blocks according to the Co dimension, and each weight block corresponds to weight data of an output channel.
  • each weight block is equivalent to a three-dimensional convolution kernel (for example, refer to the three-dimensional convolution kernel in FIG. 4 ).
  • convolution operation processing can be performed in parallel on different slave processing circuits for different weight blocks.
  • the convolution results on different output channels do not need to be accumulated, so each slave processing circuit can perform operation processing relatively independently.
  • each weight block it can be divided in a similar manner to the input feature map, that is, it can be divided into multiple weight rows according to the lowest storage dimension (eg Ci dimension). Similarly, the weight row is also aligned to the first alignment requirement in the lowest storage dimension, so that the bitwise multiply-accumulate operation can be performed on the feature block and the weight row.
  • the lowest storage dimension eg Ci dimension
  • multiple multiplexing methods of input feature maps and weights can be supported, thereby reducing data throughput during operations and improving processing efficiency .
  • the computing device 600 may further include a first storage device 630 and a second storage device 640 for respectively storing data transmitted via different data channels.
  • the first storage circuit 630 can be used to store multicast data, that is, the data in the first storage circuit will be transmitted to multiple slave processing circuits through the broadcast bus, and these slave processing circuits receive the same data. It can be understood that broadcasting and multicasting can be implemented through the broadcasting bus. Multicast refers to a communication method that transmits a piece of data to multiple slave processing circuits; broadcasting is a communication method that transmits a piece of data to all slave processing circuits, which is a special case of multicast. Since both multicast and broadcast correspond to a one-to-many transmission mode, there is no special distinction between the two in this document. Broadcast and multicast can be collectively referred to as multicast, and those skilled in the art can clarify their meanings according to the context.
  • the second storage circuit 640 may be used to store and distribute data, that is, the data in the second storage circuit will be transmitted to different slave processing circuits respectively, and each slave processing circuit receives different data.
  • the master processing circuit may store the input feature map in the first storage circuit 630, so as to broadcast the divided feature map blocks to the scheduled multiple slave processing circuits during operation.
  • the master processing circuit may store the weight values in blocks in the second storage circuit 640 in the aforementioned manner, and the weight value blocks therein may be distributed to corresponding slave processing circuits before operation.
  • each processing circuit and storage circuit are shown as separate modules in FIG. 6 , according to different configurations, the storage circuit and the processing circuit may also be combined into one module.
  • the first storage circuit 630 can be combined with the main processing circuit 610
  • the second storage circuit 640 can be shared by multiple slave processing circuits 620, and an independent storage area is assigned to each slave processing circuit to speed up access.
  • Embodiments of the present disclosure are not limited in this respect.
  • the main processing circuit and the slave processing circuit may belong to different modules of the same processor or chip, or may belong to different processors, and the present disclosure is not limited in this respect.
  • FIG. 7 shows a schematic diagram of the internal structure of a slave processing circuit according to an embodiment of the disclosure.
  • the slave processing circuit 700 includes a first buffer circuit 710 , a second buffer circuit 720 and a plurality of arithmetic circuits 730 .
  • the first buffer circuit 710 may be used for buffering and processing weight values or input feature maps.
  • the second buffer circuit 720 can be used for buffering and processing input feature maps or weights. These two buffer circuits are used to select the data involved in the operation.
  • the data of the first buffer circuit 710 may come from, for example, the first storage circuit 630 or the second storage circuit 640 in FIG. A storage circuit 630 .
  • the first buffer circuit 710 is used to buffer weight rows from the weight block of the second storage circuit. These weight rows are formed by dividing the weight block according to the lowest storage dimension (for example, Ci dimension) in the second storage circuit, for example, according to the above-described division method of aligning the lowest storage dimension to the first alignment requirement. These weight value rows may be distributed to corresponding operation circuits 730 during operation.
  • the lowest storage dimension for example, Ci dimension
  • the second buffering circuit 720 is used to buffer feature blocks in the input feature map from the first storage circuit broadcast by the main processing circuit. These characteristic tiles may be broadcast and transmitted to all computing circuits 730 in the slave processing circuit 700 during computing.
  • Each operation circuit 730 may be configured to perform a bitwise multiply-accumulate operation on weight rows distributed from the first buffer circuit 710 and feature tiles broadcast from the second buffer circuit 720 .
  • the slave processing circuit 700 may further include a third buffer circuit 740 for buffering the operation results of each operation circuit 730 .
  • computing circuits 730 are shown in the figure, according to different hardware configurations, more or less computing circuits may be included in the processing circuit, and the embodiment of the present disclosure is not limited in this respect.
  • the speed of data access can be accelerated by reasonably allocating the storage modes of each data.
  • Fig. 8 shows a schematic storage manner of weight data in a second storage circuit according to an embodiment of the present disclosure.
  • the second storage circuit 800 can allocate a storage area for each slave processing circuit, so that the weights required for each slave processing circuit operation only need to be read from its corresponding storage area.
  • the figure exemplarily shows that 16 storage areas 801-816 are allocated to 16 slave processing circuits. Each storage area stores weight blocks to be processed by the slave processing circuit. It can be understood that depending on different hardware configurations, the number of slave processing circuits may be different, for example, 4, 8, 32 or more. In the example in FIG. 8 , each slave processing circuit includes 4 arithmetic circuits as an example for description, but this embodiment of the disclosure is not limited thereto.
  • the results of operations on the Co dimension do not need to be accumulated, so they can be assigned to different operation circuits to perform operations relatively independently.
  • weights on different Co dimensions can be stored in each storage area, that is, different weight blocks can be stored. From the example in the figure, the Cos corresponding to the weight blocks in the 16 storage areas shown are different.
  • the weight blocks used in each round can be grouped according to the order of the operation round, and the number of weight blocks in each weight block group corresponds to the total computing capacity of the slave processing circuits scheduled in the corresponding round of operation.
  • each slave processing circuit includes 4 computing circuits
  • a total of 64 computing circuits can be dispatched in each round of computing, and 64 Cos can perform operations respectively.
  • the Co dimension of the weight is 128, which exceeds the total number of schedulable computing circuits of 64, it can be divided into two rounds of computing to complete all calculations.
  • the weights in each weight block group can be The value blocks are segmented in sequence according to the slave processing circuits scheduled in the corresponding round of operations, and each weight block segment corresponds to a scheduled slave processing circuit, and each weight block segment is stored in the second storage circuit as the corresponding slave processing circuit. in the memory area allocated by the processing circuit.
  • Each weight block segment contains at least one weight block, that is, each slave processing circuit corresponds to more than one weight block.
  • the number of weight blocks included in each weight block segment is equal to the number of arithmetic circuits included in each slave processing circuit.
  • the second weight block segment 832 of the 4 weight blocks is allocated to the second slave processing circuit, and the 4 weight blocks are respectively allocated to the 4 arithmetic circuits in the second slave processing circuit; and so on.
  • the weight block segments are similarly divided and stored correspondingly, which will not be repeated here.
  • the foregoing describes the hardware structure of the computing device in the embodiment of the present disclosure and an exemplary data storage method.
  • the above-mentioned hardware structure can provide different data paths for the input feature maps and weights participating in the calculation, thereby using different data transmission methods (such as , broadcast, multicast, distribution, etc.) to reduce the data throughput during operation and improve operation efficiency.
  • different multiplexing methods can be adopted, including weight multiplexing and/or input feature map multiplexing.
  • the input feature map can be multiplexed on all the operation circuits of the same slave processing circuit, and each operation circuit performs operations on the weight blocks corresponding to different output channels and the input feature map.
  • the input feature map is broadcasted to all computing circuits, and each computing circuit can preload the weights of the corresponding output channels.
  • each scheduled slave processing circuit can take turns reading each weight in the weight value block section assigned to the slave processing circuit in the current round of operations from the second storage circuit according to the assigned Co dimension value.
  • the read weight row is then stored into the first buffer circuit of the slave processing circuit.
  • the slave processing circuit may distribute to different operation circuits in the slave processing circuit according to the Co dimension corresponding to each weight row.
  • the slave processing circuit may broadcast the feature image blocks in the second buffer circuit to each operation circuit.
  • the operation circuit can perform a bitwise multiply-accumulate operation on the distributed weight row and the broadcasted feature block to obtain the part and result of the receptive field corresponding to the weight row and the feature block.
  • the second storage circuit continuously stores 4 weights of Co for each storage area allocated from the processing circuit.
  • the slave processing circuit broadcasts the feature image blocks buffered in the second buffer circuit to all the operation circuits therein.
  • Each computing circuit of the slave processing circuit respectively obtains a partial sum corresponding to the first (or first) receptive field on Co in the first step of computing.
  • the slave processing circuit can control and read the content in the first buffer circuit and the second buffer circuit according to the weight and/or the multiplexing manner of the input feature map.
  • the slave processing circuit can convert the input feature maps cached in the second buffer circuit to correspond to different volumes
  • the characteristic block of product output point/receptive field is continuously broadcast to multiple operation circuits in it.
  • each operation circuit can use the same weight row to perform bitwise multiplication and accumulation operations on continuously broadcast feature blocks to obtain SR parts and results belonging to different convolution output points.
  • the parts and results that belong to the same convolution output point can be accumulated each time until all parts and results obtained by traversing the corresponding receptive field have been accumulated, so as to obtain the final result of the convolution output point .
  • the times of weight multiplexing may be different, for example, SR may be 2, 4, 8, . . . .
  • the number of weight multiplexing SR is limited by the read bandwidth and the number of read ports of the second storage circuit. For example, when the read bandwidth of the second storage circuit is 64 bytes and the number of ports is 1, at least 1 beat of 64 bytes is read to the first buffer circuit, and at most 8 beats of 64 bytes of data are read. At this time, the weight multiplexing times SR are at most 32 times.
  • input feature map multiplexing can be used, that is, the same feature map block can be used for multiple different weight rows.
  • the multiplexing of the input feature map here means that the same input feature map is used multiple times for operations with different weight rows in a single operation circuit.
  • the input feature map multiplexing is multiplexed on all operation circuits, that is, the same input feature map is operated on multiple operation circuits with different weight rows respectively.
  • the slave processing circuit can read a weight row from each weight block in the weight block segment assigned to the slave processing circuit according to the Co dimension, wherein The number of weight value rows read is equal to the product of the input feature map multiplexing times NR and the number of computing circuits in the slave processing circuit. Then, the read weight row can be buffered in the first buffer circuit and distributed to each operation circuit.
  • each operation circuit uses the NR weight rows distributed from the first buffer circuit to perform bitwise multiplication and accumulation operations on the feature blocks broadcast from the second buffer circuit to obtain NR weight rows belonging to different Co dimensions. section and results.
  • each slave processing circuit includes 8 Co weight blocks in the storage area.
  • Two Co results are calculated in each operation circuit of the first slave processing circuit, that is, each feature map block is multiplexed twice.
  • the parts and results obtained each time belonging to the same Co dimension can be accumulated to obtain the convolution output on the corresponding Co.
  • the number of multiplexing of the input feature map can be different, for example, NR can be 2, 4, 8, . . . .
  • the number of times NR of multiplexing the input feature map is limited by the capacity of the first buffer circuit.
  • the first buffer circuit can store 9 ⁇ 64B data.
  • the weight multiplexing and input feature map multiplexing described above in a single operation circuit can be used alone or in combination. No matter which multiplexing method is adopted, the main processing circuit can splice the operation results returned by the scheduled multiple slave processing circuits in multiple rounds of operations according to the blocking and multiplexing method to obtain the final result. Specifically, the parts and results belonging to the same Co dimension and the same receptive field are accumulated to obtain the result of the convolution output point corresponding to the receptive field on the Co dimension.
  • the master processing circuit may, for example, receive intermediate results from multiple slave processing circuits and perform subsequent processing to obtain a final calculation result.
  • the main processing circuit may be configured to concatenate the operation results of the slave processing circuits that process different Co dimensions, so as to obtain the convolution operation results on the entire Co dimension.
  • each slave processing circuit that completes the convolution operation of a single Co dimension through multiple rounds of calculations can accumulate and summarize the parts and results in each round of calculations according to the corresponding convolution output points/receptive fields Then return to the main processing circuit.
  • Embodiments of the present disclosure also provide a method for performing a convolution operation by using the aforementioned computing device.
  • FIG. 9 shows an exemplary flowchart of a convolution operation method 900 according to an embodiment of the present disclosure.
  • step 910 during the convolution operation, the main processing circuit divides the input feature map into blocks according to the lowest storage dimension, and broadcasts the feature map blocks to the scheduled multiple slave processing circuits.
  • step 920 the master processing circuit divides the weight into blocks according to the Co dimension, so that the scheduled slave processing circuit loads corresponding weight blocks.
  • step 930 each of the scheduled slave processing circuits performs a convolution operation on the feature block and the corresponding weight block; and returns the operation result to the main processing circuit.
  • step 910 and step 920 may be performed at the same time, or step 920 may be performed before step 910 .
  • An embodiment of the present disclosure also provides a chip, which may include the computing device in any embodiment described above with reference to the accompanying drawings. Further, the present disclosure also provides a board, which may include the aforementioned chip.
  • the electronic equipment or devices disclosed in this disclosure may include servers, cloud servers, server clusters, data processing devices, robots, computers, printers, scanners, tablet computers, smart terminals, PC equipment, IoT terminals, mobile Terminals, mobile phones, driving recorders, navigators, sensors, cameras, cameras, video cameras, projectors, watches, earphones, mobile storage, wearable devices, visual terminals, automatic driving terminals, vehicles, household appliances, and/or medical equipment.
  • Said vehicles include airplanes, ships and/or vehicles;
  • said household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, range hoods;
  • said medical equipment includes nuclear magnetic resonance instruments, Ultrasound and/or electrocardiograph.
  • the electronic equipment or device disclosed herein can also be applied to fields such as the Internet, the Internet of Things, data centers, energy, transportation, public management, manufacturing, education, power grids, telecommunications, finance, retail, construction sites, and medical treatment. Further, the electronic device or device disclosed herein can also be used in application scenarios related to artificial intelligence, big data and/or cloud computing, such as cloud, edge, and terminal.
  • electronic devices or devices with high computing power according to the disclosed solutions can be applied to cloud devices (such as cloud servers), while electronic devices or devices with low power consumption can be applied to terminal devices and/or Edge devices (such as smartphones or cameras).
  • the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that according to the hardware information of the terminal device and/or the edge device, the hardware resources of the cloud device can be Match appropriate hardware resources to simulate the hardware resources of terminal devices and/or edge devices, so as to complete the unified management, scheduling and collaborative work of device-cloud integration or cloud-edge-end integration.
  • the present disclosure expresses some methods and their embodiments as a series of actions and combinations thereof, but those skilled in the art can understand that the solution of the present disclosure is not limited by the order of the described actions . Therefore, according to the disclosure or teaching of the present disclosure, those skilled in the art may understand that certain steps may be performed in other orders or simultaneously. Further, those skilled in the art can understand that the embodiments described in the present disclosure can be regarded as optional embodiments, that is, the actions or modules involved therein are not necessarily required for the realization of one or some solutions of the present disclosure. In addition, according to different schemes, the description of some embodiments in this disclosure also has different emphases. In view of this, those skilled in the art may understand the part that is not described in detail in a certain embodiment of the present disclosure, and may also refer to related descriptions of other embodiments.
  • a unit described as a separate component may or may not be physically separated, and a component shown as a unit may or may not be a physical unit.
  • the aforementioned components or units may be located at the same location or distributed over multiple network units.
  • some or all of the units may be selected to achieve the purpose of the solutions described in the embodiments of the present disclosure.
  • multiple units in the embodiments of the present disclosure may be integrated into one unit, or each unit exists physically independently.
  • the above-mentioned integrated units may also be implemented in the form of hardware, that is, specific hardware circuits, which may include digital circuits and/or analog circuits.
  • the physical realization of the hardware structure of the circuit may include but not limited to physical devices, and the physical devices may include but not limited to devices such as transistors or memristors.
  • various devices such as computing devices or other processing devices described herein may be implemented by appropriate hardware processors, such as central processing units, GPUs, FPGAs, DSPs, and ASICs.
  • the aforementioned storage unit or storage device can be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), which can be, for example, a variable resistance memory (Resistive Random Access Memory, RRAM), dynamic Random Access Memory (Dynamic Random Access Memory, DRAM), Static Random Access Memory (Static Random Access Memory, SRAM), Enhanced Dynamic Random Access Memory (Enhanced Dynamic Random Access Memory, EDRAM), High Bandwidth Memory (High Bandwidth Memory , HBM), hybrid memory cube (Hybrid Memory Cube, HMC), ROM and RAM, etc.
  • RRAM variable resistance memory
  • DRAM Dynamic Random Access Memory
  • SRAM Static Random Access Memory
  • EDRAM Enhanced Dynamic Random Access Memory
  • High Bandwidth Memory High Bandwidth Memory
  • HBM High Bandwidth Memory
  • HMC Hybrid Memory Cube
  • ROM and RAM etc.
  • a computing device configured to perform a convolution operation, the computing device comprising a master processing circuit and a plurality of slave processing circuits, wherein:
  • the main processing circuit is used for:
  • At least one feature block of the input feature map is transmitted to a plurality of scheduled slave processing circuits in a broadcast manner, wherein the feature block is divided into the input feature map according to the lowest storage dimension block obtained;
  • Each of said slave processing circuits is scheduled for:
  • the input feature map is partitioned by the lowest storage dimension.
  • Clause 3 The computing device of clause 1 or 2, wherein the main processing circuit is further configured to:
  • Clause 4 The computing device according to Clause 3, wherein the first alignment requirement is equal to a single maximum data processing amount of the arithmetic circuit in the slave processing circuit, and each feature block after alignment is in the minimum storage
  • the dimension size is equal to the single maximum data processing volume.
  • the weights are divided into blocks according to the dimension of the output channel, so that the scheduled slave processing circuit loads corresponding weight blocks.
  • a plurality of weight blocks that are continuously divided on the dimension of the output channel are grouped according to the order of operation rounds, and the number of weight blocks in each weight block group corresponds to the number of slave processing circuits scheduled in the corresponding round of operation total computing power;
  • each weight block segment corresponds to a scheduled slave processing circuit
  • Each weight block segment is respectively stored in a storage area allocated to the corresponding slave processing circuit.
  • each slave processing circuit further comprises a first buffer circuit, a second buffer circuit, and a plurality of arithmetic circuits, wherein:
  • the first buffer circuit is used for buffering one or more weight rows divided by the lowest storage dimension in at least one weight block corresponding to the slave processing circuit, and the weight rows are distributed to corresponding arithmetic circuits;
  • the second buffer circuit is used to buffer the feature blocks broadcast by the main processing circuit, and the feature blocks are broadcast and transmitted to all the operation circuits in the slave processing circuit during operation;
  • each operation circuit is configured to: perform a bitwise multiply-accumulate operation on the weight row distributed from the first buffer circuit and the feature block broadcast from the second buffer circuit.
  • the output channel dimension corresponding to each weight row it is distributed to different computing circuits in the slave processing circuit to perform bitwise multiplication and accumulation operations with the feature blocks broadcast from the second buffer circuit to obtain the corresponding convolution output points. parts and results.
  • Clause 9 The computing device according to Clause 8, wherein the slave processing circuit is further configured to: control reading of the first buffer circuit and The content in the second buffer circuit is used to traverse the entire receptive field of the convolution output point at the same time for the weight line and the feature block to perform a bitwise multiplication and accumulation operation to obtain multiple partial sum results and accumulate them to obtain the corresponding convolution The output of the convolution on the output point.
  • the slave processing circuit is further configured to continuously broadcast the feature blocks corresponding to different convolution output points in the input feature map cached in the second buffer circuit to the plurality of operation circuits, wherein the different convolution The number of output points is equal to the number of weight multiplexing SR;
  • Each arithmetic circuit is further used to:
  • the same weight row is used to perform the bitwise multiply-accumulate operation respectively to obtain SR parts and results belonging to different convolution output points;
  • the partial sum results belonging to the same convolution output point obtained in multiple rounds of operations are accumulated to obtain the convolution output on the corresponding convolution output point.
  • the slave processing circuit is further used to:
  • one weight line is read from each weight block in the weight block section assigned to the slave processing circuit, wherein the number of weight lines read is equal to the number of times the input feature map is multiplexed the product of NR and the number of arithmetic circuits within the slave processing circuit;
  • Each arithmetic circuit is further used to:
  • the partial sum results obtained in multiple rounds of operations belonging to the same output channel dimension are accumulated to obtain the convolution output on the corresponding output channel dimension.
  • Clause 12 The computing device according to any one of clauses 1-11, wherein the main processing circuit is further configured to: calculate the operation results returned by the multiple scheduled slave processing circuits in multiple rounds of operations according to the block and complex Stitch together in ways to get the final result.
  • Item 13 A chip, characterized in that the chip includes the computing device described in any one of Items 1-12.
  • Item 14 A board, characterized in that the board includes the chip described in Item 12.
  • Clause 15 A method of performing a convolution operation by the computing device of any one of clauses 1-12.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computer Hardware Design (AREA)
  • Neurology (AREA)
  • Microelectronics & Electronic Packaging (AREA)
  • Image Processing (AREA)

Abstract

本披露公开了一种计算装置、利用计算装置实施卷积运算的方法及相关产品。该计算装置可以包括在组合处理装置中,该组合处理装置还可以包括接口装置和其他处理装置。该计算装置与其他处理装置进行交互,共同完成用户指定的计算操作。组合处理装置还可以包括存储装置,该存储装置分别与计算装置和其他处理装置连接,用于存储该计算装置和其他处理装置的数据。本披露的方案对卷积运算进行优化,提高了运算处理效率。

Description

计算装置、利用计算装置实施卷积运算的方法及相关产品
相关申请的交叉引用
本公开要求于2021年6月10日申请的、申请号为202110648346.2、发明名称为“计算装置、利用计算装置实施卷积运算的方法及相关产品”的中国专利申请的优先权。
技术领域
本披露一般地涉及数据处理领域。更具体地,本披露涉及一种配置用于执行卷积运算的计算装置、利用计算装置实施卷积运算的方法、芯片和板卡。
背景技术
目前,深度学习(Deep Learning)已经成为机器学习中的重要分支,也大力助推着人工智能(AI)的发展。深度学习的核心技术——深度神经网络(DNN)已在诸多行业有着广泛的应用。
卷积层是神经网络模型中的常用隐含层之一,其通过卷积运算对输入数据进行特征提取。神经网络模型中包含了大量的卷积运算,卷积运算的计算性能极大地影响整个神经网络模型的计算性能。当神经网络模型应用于不同领域时,例如语音识别、机器翻译、图像处理等等,其对应的输入特征图和权值的各个维度大小可能各有不同。为了充分利用深度学习处理器的硬件优势,需要针对不同规模的卷积运算进行优化,以提高执行神经网络模型的计算性能。
发明内容
为了至少解决如上所提到的一个或多个技术问题,本披露在多个方面中提出了一种计算装置,其通过对输入特征图和权值进行分块,可以有效提高大规模卷积运算的运算效率。本披露实施例的卷积运算可以是各种神经网络模型中的运算,这些神经网络模型可以应用于各种领域,诸如图像处理、语音处理、文本处理等等,这些处理例如可以包括但不限于识别和分类。
在第一方面中,本披露实施例提供了一种计算装置,配置用于执行卷积运算,所述计算装置包括主处理电路和多个从处理电路,其中:所述主处理电路用于:在所述卷积运算期间,以广播方式将输入特征图的至少一个特征图块传输给调度的多个从处理电路,其中所述特征图块是将所述输入特征图按最低存储维度分块获得的;并且所调度的每个所述从处理电路用于:针对所述特征图块和对应的权值块,执行卷积运算,其中所述权值块是按照输出通道维度分块获得的;以及将运算结果返回给所述主处理电路。
在第二方面中,本披露实施例提供了一种芯片,其包括前述第一方面任一实施例的计算装置。
在第三方面中,本披露实施例提供了一种板卡,其包括前述第二方面的任一实施例的芯片。
在第四方面中,本披露实施例提供了一种由前述第一方面任一实施例的计算装置实施卷积运算的方法。
通过如上所提供的计算装置、芯片、板卡以及由计算装置实施卷积运算的方法,本披露实施例的方案针对大规模的输入特征图和权值进行分块,以适应单个运算装置的处理能力,从而充分利用深度学习处理器的并行处理能力,可以有效提高卷积运算的运算效率。此外,在一些实施例中,输入特征图和权值可以通过不同的数据路径进行传输,从而支持输入特征图和权值的多种复用方式,进一步优化卷积运算,减小数据吞吐量。
附图说明
通过参考附图阅读下文的详细描述,本公开示例性实施方式的上述以及其他目的、特征和优点将变得易于理解。在附图中,以示例性而非限制性的方式示出了本公开的若干实施方式,并且相同或对应的标号表示相同或对应的部分,其中:
图1示出本披露实施例的板卡的结构图;
图2示出本披露实施例的组合处理装置的结构图;
图3示出本披露实施例的单核或多核计算装置的处理器核的内部结构示意图;
图4示出可以应用本披露实施例的示例性卷积运算原理示例;
图5示例性示出根据本披露实施例的卷积运算过程;
图6示出根据本披露实施例的计算装置的示例性结构示意图;
图7示出根据本披露实施例的从处理电路的部分结构示意图;
图8示出根据本披露实施例的权值数据在第二存储电路中的示意性存储方式;以及
图9示出根据本披露实施例的卷积运算方法的示例性流程图。
具体实施方式
下面将结合本披露实施例中的附图,对本披露实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本披露一部分实施例,而不是全部的实施例。基于本披露中的实施例,本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本披露保护的范围。
应当理解,本披露的权利要求、说明书及附图中可能出现的术语“第一”、“第二”、“第三”和“第四”等是用于区别不同对象,而不是用于描述特定顺序。本披露的说明书和权利要求书中使用的术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。
还应当理解,在本披露说明书中所使用的术语仅仅是出于描述特定实施例的目的,而并不意在限定本披露。如在本披露说明书和权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。还应当进一步理解,在本披露说明书和权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。
如在本说明书和权利要求书中所使用的那样,术语“如果”可以依据上下文被解释为“当...时”或“一旦”或“响应于确定”或“响应于检测到”。
下面结合附图来详细描述本披露的具体实施方式。
图1示出本披露实施例的一种板卡10的结构示意图。如图1所示,板卡10包括芯片101,其是一种系统级芯片(System on Chip,SoC),或称片上系统,集成有一个或多个组合处理装置,组合处理装置是一种人工智能运算单元,用以支持各类深度学习和机器学习算法,满足计算机视觉、语音、自然语言处理、数据挖掘等领域复杂场景下的智能处理需求。特别是深度学习技术大量应用在云端智能领域,云端智能应用的一个显著特点是输入数据量大,对平台的存储能力和计算能力有很高的要求,此实施例的板卡10适用在云端智能应用,具有庞大的片外存储、片上存储和强大的计算能力。
芯片101通过对外接口装置102与外部设备103相连接。外部设备103例如是服务器、计算机、摄像头、显示器、鼠标、键盘、网卡或wifi接口等。待处理的数据可以由外部设备103通过对外接口装置102传递至芯片101。芯片101的计算结果可以经由对外接口装置102传送回外部设备103。根据不同的应用场景,对外接口装置102可以具有不同的接口形式,例如PCIe接口等。
板卡10还包括用于存储数据的存储器件104,其包括一个或多个存储单元105。存储器件104通过总线与控制器件106和芯片101进行连接和数据传输。板卡10中的控制器件106配置用于对芯片101的状态进行调控。为此,在一个应用场景中,控制器件106可以包括单片机(Micro Controller Unit,MCU)。
图2是示出此实施例的芯片101中的组合处理装置的结构图。如图2中所示,组合处理装置20包括计算装置201、接口装置202、处理装置203和存储装置204。
计算装置201配置成执行用户指定的操作,主要实现为单核智能处理器或者多核智能处理器,用以执行深度学习或机器学习的计算,其可以通过接口装置202与处理装置203进行交互,以共同完成用户指定的操作。
接口装置202用于在计算装置201与处理装置203间传输数据和控制指令。例如,计算装置201可以经由接口装置202从处理装置203中获取输入数据,写入计算装置201片上的存储装置。进一步,计算装置201可以经由接口装置202从处理装置203中获取控制指令,写入计算装置201片上的控制缓存中。替代地或可选地,接口装置202也可以读取计算装置201的存储装置中的数据并传输给处理装置203。
处理装置203作为通用的处理装置,执行包括但不限于数据搬运、对计算装置201的开启和/或停止等基本控制。根据实现方式的不同,处理装置203可以是中央处理器(central processing unit,CPU)、图形处理器(graphics processing unit,GPU)或其他通用和/或专用处理器中的一种或多种类型的处理器,这些处理器包括但不限于数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field-programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等,并且其数目可以根据实际需要来确定。如前所述,仅就本披露的计算装置201而言,其可以视为具有单核结构或者同构多核结构。然而,当将计算装置201和处理装置203整合共同考虑时,二者视为形成异构多核结构。
存储装置204用以存储待处理的数据,其可以是DRAM,为DDR内存,大小通常为16G或更大,用于保存计算装置201和/或处理装置203的数据。
图3示出了计算装置201为单核或多核装置时处理核的内部结构示意图。计算装置301用以处理计算机视觉、语音、自然语言、数据挖掘等输入数据,计算装置301包括三大模块:控制模块31、运算模块32及存储模块33。
控制模块31用以协调并控制运算模块32和存储模块33的工作,以完成深度学习的任务,其包括取指单元(instruction fetch unit,IFU)311及指令译码单元(instruction decode unit,IDU)312。取指单元311用以获取来自处理装置203的指令,指令译码单元312则将获取的指令进行译码,并将译码结果作为控制信息发送给运算模块32和存储模块33。
运算模块32包括向量运算单元321及矩阵运算单元322。向量运算单元321用以执行向量运算,可支持向量乘、加、非线性变换等复杂运算;矩阵运算单元322负责深度学习算法的核心计算,即矩阵乘及卷积。
存储模块33用来存储或搬运相关数据,包括神经元存储单元(neuron RAM,NRAM)331、权值存储单元(weight RAM,WRAM)332、直接内存访问模块(direct memory access,DMA)333。NRAM 331用以存储输入神经元、输出神经元和计算后的中间结果;WRAM 332则用以存储深度学习网络的卷积核,即权值;DMA 333通过总线34连接DRAM 204,负责计算装置301与DRAM 204间的数据搬运。
基于前述硬件环境,在一个方面中,本披露实施例提供了一种计算装置,其配置用于执行卷积运算,从而可以对例如神经网络模型中的卷积运算进行优化。
图4示出了可以应用本披露实施例的示例性卷积运算原理示例。如图所示,例如神经网络模型中的卷积层可以执行卷积运算,通过对输入特征图(也称为输入数据、神经元或输入神经元)应用卷积核(也称为滤波器、权值等)做卷积处理,从而进行特征提取。
图中示例性示出了大小为6×6×3的输入数据,其可以表示3个6×6大小的输入特征图(即6×6×3的三维矩阵),分别表示三个不同的特征。此示例中特征图的宽度W为6,高度H也为6。输入特征图的数量也可以称为输入通道数Ci。例如图中示例输入有3个特征图,也称为3个特征通道或3个输入通道。
图中还示例性示出了大小为2×3×3×3的卷积核,其可以表示2个3×3×3大小的立体卷积核(即2个3×3×3的三维矩阵),每个立体卷积核(又称为滤波器)又具有3个不同的3×3大小的二维卷积核,对应输入的3个不同的特征图。立体卷积核的数量可以称为输出通道数Co,此示例中为2。每个立体卷积核中,二维卷积核的数量可以称为输入通道数Ci,其与输入特征图的通道数一致。每个二维卷积核具有相应的宽度Kw和高度Kh,在此示例中Kw和Kh均为3。
输入特征图与滤波器的卷积结果输出2个4×4大小的特征图。其中,输入特征图与上方的 立体卷积核的卷积结果得到上方的1个4×4的输出特征图,输入特征图与下方的立体卷积核的卷积结果得到下方的1个4×4的输出特征图。输出特征图中每个位置上的值由每个输入特征图的对应区块和对应卷积核做二维卷积运算之后再加和得到。例如,图中示出了上方的输出特征图上(0,0)位置的值(也即卷积输出点)由输入特征图中黑色立方体框出的区块与上方的立体卷积核进行二维卷积运算得到3个值,再加和得到最终值。
在本披露实施例中,每个卷积输出点具有对应的感受野,感受野的形状等于卷积核的形状,例如图中输出特征图上(0,0)位置的卷积输出点的感受野是图中的3×3×3的黑色立方体框。每个卷积输出点的值对应于其感受野内的输入特征图与权值的对位乘累加结果。可以理解,在本披露实施例中,感受野是相对于单个卷积层而言的,当前层的输入特征图中某个位置的特征向量是由前一层固定区域的输入计算出来的,这个区域就是这个位置的感受野。
为了得到其他位置的输出,可以在输入特征图上移动卷积核的位置,也即移动卷积输出点的感受野。在图中示例中,卷积步长(Sx,Sy)为(1,1),当横向(宽度方向)向右或纵向(高度方向)向下移动一格后做卷积运算,可以分别得到上方的输出特征图上(0,1)或(1,0)位置的值。
从上面的描述可知,在神经网络的一个卷积层中,有一组输入特征图,共包含H×W×Ci个信息,其中H和W分别是输入特征图的高度和宽度,Ci是输入特征图的个数,也称为输入通道数。卷积层有Ci×Co个Kh×Kw大小的卷积核,其中Ci是输入通道数,Co是输出特征图的个数(或输出通道数),Kh和Kw分别是卷积核的高度和宽度。输出特征图包含Ho×Wo×Co个信息,其中Ho和Wo分别是输出特征图的高度和宽度,Co是输出通道数。此外,在卷积运算中,还会涉及到卷积步长(Sx,Sy),卷积步长的大小会影响输出特征图的尺寸。
在本披露实施例中,所涉及的多维数据的维度表征为(N,H,W,C)或(Co,H,W,Ci),其代表了数据在存储器中的存储顺序。可以理解,虽然多维数据具有多个维度,但是因为存储器的布局始终是一维的,因此多维数据与存储器上的存储顺序之间存在对应关系。多维数据通常被分配在连续的存储空间中,也即可以将多维数据进行一维展开,按顺序存储在存储器上。例如,在本披露实施例中,初始的输入特征图可以按照低维度(此处C/Ci为最低维度)优先方式,进行顺序存储;而为了优化卷积运算,在运算过程中可以调整输入特征图的存储顺序,如后面将详细描述的。相邻的维度是指多维数据的维度信息表示中相互紧挨着的维度,例如,W和Ci相邻,相邻的维度也可以称为连续的维度。
为了充分利用带宽,适配运算器阵列的吞吐量等需求,通常需要将数据进行向量化对齐。人工智能芯片的设计通常以Ci维度为最低维度,也即上述NHWC摆放顺序,Ci维度上的数据是连续的。因此,向量化对齐要求需要Ci维度的大小对齐到指定数值,例如对齐值Aci,从而以该对齐值Aci为单位进行存取数。基于不同的设计,Aci可以有不同的数值,例如64、128、256、512等。通常,运算器阵列的输入端口大小也与该对齐值相关,例如在输入数据位宽对称的情形下,运算器阵列的输入端口大小通常为对齐值的2倍,也即一次性处理对齐值Aci规模的输入特征图数据和权值数据。当输入特征图的Ci维度较大时,比较容易满足上述对齐要求。
图5示例性示出了根据本披露实施例的卷积运算过程。在此实施例中,输入特征图的Ci(图中表示为fi)维度较大,因此每次只取一部分数据进行运算,例如取的数据量满足运算器的一次最大处理量,从而既充分利用了运算器的算力,又能节省运算时间。在此示例中,假设对齐值为512比特,也即要求一次读取的一行(一个缓存行)数据为512比特。为了描述简便起见,在本披露的示例中,假设输入特征图与权值的数据位宽相同,例如都是8比特或16比特,则一个缓存行可以包括64个8比特数据或32个16比特数据。
如图所示,输入特征图510的规模较大,输入通道维度fi超过512比特,例如为512的倍数;权值520的输入通道维度Ci与输入特征图510的输入通道维度fi大小相等,也超过512比特。因此,每次可以从输入特征图510中读取一行输入数据511,从权值520中读取一行权值数据521作为卷积核数据,二者执行对位乘累加运算,得到卷积结果530中的一个部分和531。
从前面图4的描述可知,每个卷积输出点的值对应于其感受野内的输入特征图与权值的对位 乘累加结果。通过多次取数和对位乘累加运算,将输入数据行与权值行同时遍历整个感受野,得到多个部分和并进行累加,则可获得对应该感受野的卷积输出点的值。
由此可见,上述卷积运算中计算各个部分和的过程具有并行性,因此通过适当的硬件配置,充分利用并行处理可能性,则可以加速运算,提升效率。此外,由于卷积过程中卷积核的移动,也即感受野的移动,导致计算部分和的过程中存在部分数据的重复使用,因此如果能够合理利用数据的复用,则可以进一步降低运算期间的数据吞吐量,从而提升效率。
图6示出了根据本披露实施例的计算装置600的示意性结构框图。可以理解,该结构可以视为图3中单个处理核的运算模块的内部结构细化,也可以视为在多个图3所示处理核的运算模块基础上联合的功能划分框图。如图6所示,本披露实施例的计算装置600可以配置用于执行卷积运算,其可以包括主处理电路610和多个从处理电路620。主处理电路和从处理电路之间以及多个从处理电路之间可以通过各种连接相互通信。
主处理电路和从处理电路可以相互配合,由此实现并行运算处理。在这种配置中,主处理电路例如可以用于对输入数据执行前序处理,例如对数据进行拆分,以及从多个从处理电路接收中间结果并执行后续处理,以得到运算指令的最终运算结果。从处理电路例如可以用于根据运算指令,对相应的数据(例如,拆分的数据)并行执行中间运算得到多个中间结果,并将多个中间结果传输回主处理电路。
在不同的应用场景中,多个从处理电路之间的连接方式既可以是通过硬线布置的硬连接方式,也可以是根据例如微指令进行配置的逻辑连接方式,以形成多种从处理电路阵列的拓扑结构。本披露实施例在此方面没有限制。
通过将计算装置600设置成主从结构(例如一主多从结构,或者多主多从结构,本披露在此方面没有限制),对于正向运算的计算指令,可以根据计算指令将数据进行拆分,从而通过多个从处理电路对计算量较大的部分进行并行运算以提高运算速度,节省运算时间,进而降低功耗。
为了支持运算功能,主处理电路和从处理电路可以包括各种计算电路,例如可以包括向量运算单元及矩阵运算单元。向量运算单元用以执行向量运算,可支持向量乘、加、非线性变换等复杂运算;矩阵运算单元负责深度学习算法的核心计算,例如矩阵乘和卷积。
在一些实施例中,主处理电路610可以在卷积运算期间,以广播方式将输入特征图的至少一个特征图块传输给调度的多个从处理电路,其中特征图块是将输入特征图按最低存储维度分块获得的。此时,所调度的每个从处理电路620可以针对广播的特征图块和对应的权值块,执行卷积运算,其中权值块是将权值按照输出通道维度分块获得的;以及将运算结果返回给主处理电路。上述最低存储维度例如是输入通道Ci维度。取决于不同的硬件配置和/或其他考虑因素,输入特征图和权值的上述分块处理可以在不同位置、不同时间执行。
在一些实施例中,主处理电路可以包括分块功能,用于分别针对输入特征图和权值进行拆分处理。例如,主处理电路可以从外部存储电路(例如DDR)读取原始存储格式的输入特征图和权值,然后对输入特征图按最低存储维度分块和存储;以及对权值按照输出通道维度分块和存储,以供所调度的从处理电路加载对应的权值块。上述分块过程可以在运算期间执行也可以在运算前执行,以准备好数据。
在另一些实施例中,主处理电路可以包括部分分块功能,用于仅针对将要进行广播传输的输入特征图进行分块,而要分发的权值则可以通过外部分块电路进行分块。
在又一些实施例中,主处理电路可以完全不包括分块功能或不执行分块功能。在这些实施例中,输入特征图和权值由独立于主处理电路的分块电路进行分块。经分块后的输入特征图和权值可以存储在相应的存储电路中。
在一些实现中,主处理电路610在广播特征图块时,可以将特征图块在最低存储维度上对齐到第一对齐要求,该第一对齐要求根据从处理电路的处理能力而确定。例如,取决于从处理电路中运算器阵列的最大吞吐量,第一对齐要求例如可以等于最大吞吐量,从而可以利用全部的运算器阵列。
在一个示例中,第一对齐要求例如是64字节,也即512比特,由此对齐后的每个特征图块 在最低存储维度上的大小为64字节。其余存储维度的大小均为1个数据位。例如,对于三维特征图,假设数据位宽为8比特,则可以划分成64×1×1形状的包含64个数据的特征图块。假设数据位宽为16比特,则可以划分成32×1×1形状的包含32个数据的特征图块。
为了与划分的特征图块执行卷积运算,权值也需要进行划分。从图4的描述可知,权值比输入特征图多一个维度:输出通道Co维度,因此权值的划分与输入特征图的划分略有不同。
在一些实施例中,可以首先按照Co维度,将权值分成多个权值块,每个权值块对应一个输出通道的权值数据。可以理解,每个权值块相当于一个立体卷积核(例如参考图4的立体卷积核)。由此,可以在不同的从处理电路上针对不同的权值块并行地执行卷积运算处理。根据前述的卷积原理可以理解,不同输出通道上的卷积结果无需进行累加,因此各从处理电路可以相对独立地进行运算处理。
在每个权值块中,可以按与输入特征图类似的方式进行划分,也即按最低存储维度(例如Ci维度)划分成多个权值行。同样地,权值行在最低存储维度上也对齐到第一对齐要求,从而特征图块与权值行可以执行对位乘累加运算。
当权值行和特征图块同时遍历某个卷积输出点的感受野执行对位乘累加运算时,可以得到多个部分和,这些部分和的累加结果即为该卷积输出点的最终值。
在本披露一些实施例中,通过利用不同的数据通路传输输入特征图和权值,可以支持输入特征图和权值的多种复用方式,从而减小运算期间的数据吞吐量,提升处理效率。
具体地,计算装置600中还可以包括第一存储装置630和第二存储装置640,用于分别存储经由不同数据通道传输的数据。
第一存储电路630可以用于存储多播数据,也即第一存储电路中的数据将通过广播总线传输给多个从处理电路,这些从处理电路接收到相同的数据。可以理解,通过广播总线可以实现广播和多播。多播是指将一份数据传输到多个从处理电路的通信方式;而广播是将一份数据传输到所有从处理电路的通信方式,是多播的一个特例。由于多播和广播都对应一对多的传输方式,本文中未对二者特意区分,广播和多播可以统称为多播,本领域技术人员根据上下文可以明确其含义。
第二存储电路640可以用于存储分发数据,也即第二存储电路中的数据将分别传输给不同的从处理电路,每个从处理电路接收到不同的数据。
通过分别提供第一存储电路和第二存储电路,可以支持针对待运算的数据以不同传输方式进行传输,从而通过在多个从处理电路之间复用多播数据来降低数据吞吐量。
在一些实施例中,主处理电路可以将输入特征图存储在第一存储电路630中,以在运算期间通过广播方式将划分的特征图块传输给调度的多个从处理电路。对应地,主处理电路可以将权值按前述方式分块存储在第二存储电路640中,其中的权值块可以在运算前分发给对应的从处理电路。
可以理解,虽然在图6中将各个处理电路与存储电路示出为分立的模块,但是根据不同的配置,存储电路与处理电路也可以合并成一个模块。例如,第一存储电路630可以与主处理电路610合并在一起,第二存储电路640则可以由多个从处理电路620共享,并为每个从处理电路分配独立的存储区域,加速访问。本披露实施例在此方面没有限制。此外,在该计算装置中,主处理电路和从处理电路可以属于同一处理器或芯片的不同模块,也可以属于不同处理器,本披露在此方面也没有限制。
图7示出了根据本披露实施例的从处理电路的内部结构示意图。如图所示,从处理电路700包括第一缓冲电路710、第二缓冲电路720以及多个运算电路730。
第一缓冲电路710可以用于缓存并处理权值或输入特征图。相应地,第二缓冲电路720则可以用于缓存并处理输入特征图或权值。这两个缓冲电路均用于选取参与运算的数据。第一缓冲电路710的数据可以来自例如图6中的第一存储电路630或第二存储电路640,对应地,第二缓冲电路720的数据可以来自例如图6中的第二存储电路640或第一存储电路630。
在一些实施例中,第一缓冲电路710用于缓存来自第二存储电路的权值块中的权值行。这些 权值行是权值块按照在第二存储电路中的最低存储维度(例如,Ci维度)划分而成的,例如按照前面描述的在最低存储维度上对齐到第一对齐要求的划分方式。这些权值行可以在运算期间被分发给对应的运算电路730。
在一些实施例中,第二缓冲电路720用于缓存由主处理电路广播的、来自第一存储电路的输入特征图中的特征图块。这些特征图块可以在运算期间被广播传输给从处理电路700内的所有运算电路730。
每个运算电路730可以用于针对从第一缓冲电路710分发的权值行和从第二缓冲电路720广播的特征图块,执行对位乘累加运算。
从处理电路700中还可以包括第三缓冲电路740,用于缓存各个运算电路730的运算结果。
可以理解,虽然图中示出了4个运算电路730,但是根据不同的硬件配置,从处理电路中可以包括更多或更少的运算电路,本披露实施例在此方面没有限制。
如前面所提到,在一些实施例中,通过合理地分配各个数据的存储方式,可以加快数据访存速度。
图8示出了根据本披露实施例的权值数据在第二存储电路中的示意性存储方式。
如图所示,第二存储电路800可以为每个从处理电路分配一个存储区域,从而每个从处理电路运算所需的权值只需要从其对应的存储区域读取即可。图中示例性示出了为16个从处理电路分配了16块存储区域801~816。每个存储区域中存储该从处理电路要处理的权值块。可以理解,取决于不同的硬件配置,从处理电路的数量可以不同,例如4个、8个、32个或更多。在图8的示例中,以每个从处理电路包括4个运算电路为例进行描述,但是本披露实施例不限于此。
如前面所提到的,Co维度上的运算结果无需累加,因此分配在不同的运算电路上可以相对独立地进行运算。由此,在每个存储区域中可以存储不同Co维度上的权值,也即可以存储不同的权值块。从图中示例中,所示的16块存储区域中的权值块对应的Co各不相同。
当Co维度大小超过可调度的从处理电路的数量时,需要通过多个运算轮次来执行运算。各轮次使用的权值块可以按照运算轮次顺序分组,每个权值块组中的权值块数量对应于对应轮次运算中所调度的从处理电路的总运算能力。
以图中示例为例,假设总共16个从处理电路均可调度,并且每个从处理电路包括4个运算电路,则每轮运算中可以调度总计64个运算电路,分别为64个Co执行运算。进一步假设权值的Co维度大小为128,超过可调度的运算电路总数64,则可以分成两轮运算来完成全部计算。在第一轮运算中,64个运算电路分别针对Co=0,1,…,63的权值进行运算;在第二轮运算中,这64个运算电路分别针对Co=64,65,…,127的权值进行运算。因此,权值按照Co维度可以拆分成128个权值块,前64个权值块可以是第一权值块组821,后64个权值块可以是第二权值块组822。
进一步地,由于第二存储电路是按照从处理电路分配存储区域的,而每个从处理电路中包括多个运算电路,因此,在一些实施例中,可以将每个权值块组中的权值块按照对应轮次运算中所调度的从处理电路顺序分段,每个权值块段对应一个调度的从处理电路,每个权值块段分别存储在第二存储电路中为对应的从处理电路分配的存储区域中。每个权值块段内包含至少一个权值块,即每个从处理电路对应的权值块为一个以上。可选地,每个权值块段内包括的权值块数量等于每个从处理电路中包括的运算电路的数量。
如图所示,在第一权值块组821中,64个权值块按照16个从处理电路顺序分成16个权值块段,其中包括Co=0,1,2,3这4个权值块的第一权值块段831分配给第一从处理电路,其中的4个权值块分别分配给第一从处理电路中的4个运算电路;包括Co=4,5,6,7这4个权值块的第二权值块段832分配给第二从处理电路,其中的4个权值块分别分配给第二从处理电路中的4个运算电路;以此类推。在第二权值块组822中,也类似地划分权值块段并相应地进行存储,此处不再重复。
前面描述了本披露实施例的计算装置的硬件结构以及数据的示例性存储方式,上述硬件结构可以为参与运算的输入特征图和权值提供不同的数据通路,从而利用不同的数据传输方式(例如,广播、多播、分发等)来减少运算期间的数据吞吐量,提高运算效率。在实际运算中,根据参与 运算的数据的规模特性,可以采取不同的复用方式,包括权值复用和/或输入特征图复用。
在一些实施例中,输入特征图可以在同一个从处理电路的所有运算电路上复用,而每个运算电路针对不同的输出通道对应的权值块和该输入特征图执行运算。此时,输入特征图以广播方式传输给所有运算电路,而每个运算电路可以预先加载对应输出通道的权值。
在一些实现中,每个被调度的从处理电路可以按照所分配的Co维度值,从第二存储电路中轮流读取当前轮次运算中分配给该从处理电路的权值块段中各个权值块的权值行。所读取的权值行继而被存储到该从处理电路的第一缓冲电路中。在运算前,从处理电路可以按照各个权值行所对应的Co维度,分发给该从处理电路中的不同运算电路。运算期间,从处理电路可以将第二缓冲电路中的特征图块广播给各个运算电路。由此,运算电路可以针对分发的权值行与广播的特征图块执行对位乘累加运算,得到权值行与特征图块所对应的感受野上的部分和结果。
以图8的示例为例,第二存储电路中为每个从处理电路分配的存储区域连续存储4个Co的权值。第一从处理电路可以在Co方向交替读取数据,例如在第一步运算中,先读Co=0的第一权值行,然后Co=1的第一权值行,Co=2的第一权值行,最后是Co=3的第一权值行。在下一步运算中,先读Co=0的第二权值行,然后Co=1的第二权值行,Co=2的第二权值行,最后是Co=3的第二权值行。
读出的权值行存在第一缓冲电路中,并按Co分发给不同的运算电路。例如,Co=0的权值行发送给第一运算电路,Co=1的权值行发送给第二运算电路,以此类推。
从处理电路将缓存在第二缓冲电路上的特征图块广播给其内的所有运算电路。该从处理电路的各个运算电路分别在第一步运算中得到对应Co上首个(或第一)感受野上的一个部分和。
为了得到整个感受野上的所有部分和结果,从而获得与感受野对应的卷积输出点的最终结果,需要遍历整个感受野,重复多次获取相应的权值行与特征图块,执行权值行与特征图块对位乘累加运算得到多个部分和结果,这些部分和结果累加得到对应卷积输出点的最终结果。
在遍历过程中,可以采用不同的复用方式。相应地,从处理电路可以根据权值和/或输入特征图的复用方式,控制读取第一缓冲电路和第二缓冲电路中的内容。
在一些实现中,当采用权值复用时,也即同一权值行可以用于多个不同的输入特征图块,从处理电路可以将第二缓冲电路中缓存的输入特征图中对应不同卷积输出点/感受野的特征图块连续广播给其内的多个运算电路。此处,不同卷积输出点/感受野的数量等于权值复用次数SR。例如,当权值复用次数SR=2时,可以将对应例如第一卷积输出点的第一特征图块和对应第二卷积输出点的第二特征图块连续广播给从处理电路内的所有4个运算电路。
在这些实现中,每个运算电路可以针对连续广播的特征图块,使用同一权值行分别执行对位乘累加运算,得到属于不同卷积输出点的SR个部分和结果。0
在多轮运算中,可以将每次得到的属于同一卷积输出点的部分和结果进行累加,直至已累加了遍历对应感受野得到的所有部分和结果,从而获得该卷积输出点的最终结果。
可以理解,根据具体情况,权值复用次数可以不同,例如SR可以取2、4、8、….。权值复用次数SR受限于第二存储电路的读取带宽和读取端口数量。例如,当第二存储电路的读取带宽为64字节,端口数为1时,至少读1拍的64字节到第一缓冲电路,最多读8拍的64字节数据。此时,权值复用次数SR最多为32次。
可选地或附加地,在一些实现中,可以采用输入特征图复用,也即同一特征图块可以用于多个不同的权值行。注意,此处的输入特征图复用是指在单个运算电路中同一输入特征图多次用于与不同的权值行进行运算。在前面描述的实施例中,输入特征图复用是在所有运算电路上复用,也即同一输入特征图在多个运算电路上分别与不同的权值行进行运算。
当在单个运算电路中采用输入特征图复用时,从处理电路可以按照Co维度,从分配给该从处理电路的权值块段中的各个权值块中各读取一个权值行,其中读取的权值行数量等于输入特征图复用次数NR与该从处理电路内的运算电路数量之积。然后,读取的权值行可以缓存在第一缓冲电路中并分发给各个运算电路。
在这些实现中,每个运算电路针对从第二缓冲电路广播的特征图块,使用从第一缓冲电路分 发的NR个权值行分别执行对位乘累加运算,得到属于不同Co维度的NR个部分和结果。
例如以图8的示例为例,在第二存储电路中存储了128个Co的权值块,分配给每个从处理电路的存储区域中各包括8个Co的权值块。在采用输入特征图复用次数NR=2的方案时,例如对于第一从处理电路,在读数时每次从8个权值块中各取出一个权值行存入第一缓冲电路中。在第一从处理电路的每个运算电路中计算两个Co的结果,也即每个特征图块被复用2次。例如第一运算电路计算Co=0和Co=64的结果,第二运算电路计算Co=1和Co=65的结果,以此类推。由此,16个从处理电路可以同时计算16×4×2=128个Co的结果。可以理解,取决于Co的分配方式,也可以每个运算电路处理顺序的两个Co的权值行,例如第一运算电路计算Co=0和Co=1的结果,本披露在此方面没有限制。还可以理解,当Co超过128时,还需要在Co维度上遍历,重复发送输入特征图块,而从处理电路读取不同的Co所对应的权值行。
同样地,在多轮运算中,可以将每次得到的属于同一Co维度的部分和结果进行累加,以得到对应Co上的卷积输出。
可以理解,根据具体情况,输入特征图复用次数可以不同,例如NR可以取2、4、8、….。输入特征图复用次数NR受限于第一缓冲电路的容量大小。例如,第一缓冲电路可以存储9×64B的数据。在无输入特征图复用时,存放4×64B的权值,分别对应4个运算电路;存在输入特征图复用时,存放8×64B的权值,每2×64B对应1个运算电路。因此,在此示例中,受限于第一缓冲电路,输入特征图的复用次数NR最多为2次。
上面描述的单个运算电路内的权值复用和输入特征图复用,可以单独使用,也可以组合使用。无论采用哪种复用方式,主处理电路可以将从所调度的多个从处理电路在多轮运算中返回的运算结果按照分块和复用方式进行拼接,以得到最终结果。具体地,属于同一Co维度的、属于同一感受野的部分和结果进行累加,得到该Co维度上与感受野对应的卷积输出点的结果。
如前面所提到的,主处理电路例如可以从多个从处理电路接收中间结果并执行后续处理,以得到最终运算结果。具体地,在上述实施例中,主处理电路可以用于将处理不同Co维度的从处理电路的运算结果进行拼接,以得到整个Co维度上的卷积运算结果。
在另一些实施例中,通过多轮计算来完成单个Co维度的卷积运算的每个从处理电路可以将各轮计算中的部分和结果按照对应的卷积输出点/感受野进行累加汇总后再返回给主处理电路。
本披露实施例还提供了利用前述计算装置执行卷积运算的方法。图9示出根据本披露实施例的卷积运算方法900的示例性流程图。
如图所示,在步骤910中,主处理电路在卷积运算期间,将输入特征图按最低存储维度分块,以广播方式将特征图块传输给调度的多个从处理电路。在步骤920中,主处理电路将权值按照Co维度分块,以供所调度的从处理电路加载对应的权值块。在步骤930中,所调度的每个从处理电路针对特征图块和对应的权值块,执行卷积运算;以及将运算结果返回给主处理电路。
本领域技术人员可以理解,方法流程图中描述的步骤与前面结合图4-6描述的计算装置的各个电路相对应,因此前面描述的特征同样适用于方法步骤,此处不再重复。虽然上面按照方法流程顺序描述了本披露实施例的卷积运算方法,但是本领域技术人员可以理解,这些方法步骤也可以采用其他顺序来执行或者同时执行。例如,步骤910和步骤920可以同时执行,或者步骤920可以在步骤910之前执行。
本披露实施例还提供了一种芯片,其可以包括前面结合附图描述的任一实施例的计算装置。进一步地,本披露还提供了一种板卡,该板卡可以包括前述芯片。
根据不同的应用场景,本披露的电子设备或装置可以包括服务器、云端服务器、服务器集群、数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、PC设备、物联网终端、移动终端、手机、行车记录仪、导航仪、传感器、摄像头、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、视觉终端、自动驾驶终端、交通工具、家用电器、和/或医疗设备。所述交通工具包括飞机、轮船和/或车辆;所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机;所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。本披露的电子设备或装置还可以被应用于互联网、物联网、数据中心、能源、交通、公共管理、 制造、教育、电网、电信、金融、零售、工地、医疗等领域。进一步,本披露的电子设备或装置还可以用于云端、边缘端、终端等与人工智能、大数据和/或云计算相关的应用场景中。在一个或多个实施例中,根据本披露方案的算力高的电子设备或装置可以应用于云端设备(例如云端服务器),而功耗小的电子设备或装置可以应用于终端设备和/或边缘端设备(例如智能手机或摄像头)。在一个或多个实施例中,云端设备的硬件信息和终端设备和/或边缘端设备的硬件信息相互兼容,从而可以根据终端设备和/或边缘端设备的硬件信息,从云端设备的硬件资源中匹配出合适的硬件资源来模拟终端设备和/或边缘端设备的硬件资源,以便完成端云一体或云边端一体的统一管理、调度和协同工作。
需要说明的是,为了简明的目的,本披露将一些方法及其实施例表述为一系列的动作及其组合,但是本领域技术人员可以理解本披露的方案并不受所描述的动作的顺序限制。因此,依据本披露的公开或教导,本领域技术人员可以理解其中的某些步骤可以采用其他顺序来执行或者同时执行。进一步,本领域技术人员可以理解本披露所描述的实施例可以视为可选实施例,即其中所涉及的动作或模块对于本披露某个或某些方案的实现并不一定是必需的。另外,根据方案的不同,本披露对一些实施例的描述也各有侧重。鉴于此,本领域技术人员可以理解本披露某个实施例中没有详述的部分,也可以参见其他实施例的相关描述。
在具体实现方面,基于本披露的公开和教导,本领域技术人员可以理解本披露所公开的若干实施例也可以通过本文未公开的其他方式来实现。例如,就前文所述的电子设备或装置实施例中的各个单元来说,本文在考虑了逻辑功能的基础上对其进行拆分,而实际实现时也可以有另外的拆分方式。又例如,可以将多个单元或组件结合或者集成到另一个系统,或者对单元或组件中的一些特征或功能进行选择性地禁用。就不同单元或组件之间的连接关系而言,前文结合附图所讨论的连接可以是单元或组件之间的直接或间接耦合。在一些场景中,前述的直接或间接耦合涉及利用接口的通信连接,其中通信接口可以支持电性、光学、声学、磁性或其它形式的信号传输。
在本披露中,作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元示出的部件可以是或者也可以不是物理单元。前述部件或单元可以位于同一位置或者分布到多个网络单元上。另外,根据实际的需要,可以选择其中的部分或者全部单元来实现本披露实施例所述方案的目的。另外,在一些场景中,本披露实施例中的多个单元可以集成于一个单元中或者各个单元物理上单独存在。
在另外一些实现场景中,上述集成的单元也可以采用硬件的形式实现,即为具体的硬件电路,其可以包括数字电路和/或模拟电路等。电路的硬件结构的物理实现可以包括但不限于物理器件,而物理器件可以包括但不限于晶体管或忆阻器等器件。鉴于此,本文所述的各类装置(例如计算装置或其他处理装置)可以通过适当的硬件处理器来实现,例如中央处理器、GPU、FPGA、DSP和ASIC等。进一步,前述的所述存储单元或存储装置可以是任意适当的存储介质(包括磁存储介质或磁光存储介质等),其例如可以是可变电阻式存储器(Resistive Random Access Memory,RRAM)、动态随机存取存储器(Dynamic Random Access Memory,DRAM)、静态随机存取存储器(Static Random Access Memory,SRAM)、增强动态随机存取存储器(Enhanced Dynamic Random Access Memory,EDRAM)、高带宽存储器(High Bandwidth Memory,HBM)、混合存储器立方体(Hybrid Memory Cube,HMC)、ROM和RAM等。
依据以下条款可更好地理解前述内容:
条款1、一种计算装置,配置用于执行卷积运算,所述计算装置包括主处理电路和多个从处理电路,其中:
所述主处理电路用于:
在所述卷积运算期间,以广播方式将输入特征图的至少一个特征图块传输给调度的多个从处理电路,其中,所述特征图块是将所述输入特征图按最低存储维度分块获得的;以及
所调度的每个所述从处理电路用于:
针对所述特征图块和对应的权值块,执行卷积运算,其中,所述权值块是按照输出通道维度分块获得的;以及
将运算结果返回给所述主处理电路。
条款2、根据条款1所述的计算装置,其中所述主处理电路进一步用于:
在卷积运算期间,将所述输入特征图按最低存储维度分块。
条款3、根据条款1或2所述的计算装置,其中,所述主处理电路进一步用于:
在广播所述特征图块时,将所述特征图块在所述最低存储维度上对齐到第一对齐要求,所述第一对齐要求根据所述从处理电路的处理能力而确定。
条款4、根据条款3所述的计算装置,其中所述第一对齐要求等于所述从处理电路中的运算电路的单次最大数据处理量,对齐后的每个特征图块在所述最低存储维度上的大小等于所述单次最大数据处理量。
条款5、根据条款1-4任一所述的计算装置,其中,所述主处理电路进一步用于:
将所述权值按照输出通道维度分块,以供所调度的所述从处理电路加载对应的权值块。
条款6、根据条款5所述的计算装置,其中,所述主处理电路进一步用于:
将在所述输出通道维度上连续划分的多个权值块,按照运算轮次顺序分组,每个权值块组中的权值块数量对应于对应轮次运算中所调度的从处理电路的总运算能力;
将每个权值块组中的权值块按照对应轮次运算中所调度的从处理电路顺序分段,每个权值块段对应一个调度的从处理电路;以及
将每个权值块段分别存储在为对应的从处理电路分配的存储区域中。
条款7、根据条款1-6任一所述的计算装置,其中每个从处理电路还包括第一缓冲电路、第二缓冲电路和多个运算电路,其中:
所述第一缓冲电路,用于缓存所述从处理电路对应的至少一个所述权值块中按最低存储维度划分的一个或多个权值行,所述权值行在运算期间被分发给对应的运算电路;以及
所述第二缓冲电路,用于缓存所述主处理电路广播的特征图块,所述特征图块在运算期间被广播传输给所述从处理电路内的所有运算电路;
其中每个运算电路用于:针对从所述第一缓冲电路中分发的权值行和从第二缓冲电路中广播的特征图块,执行对位乘累加运算。
条款8、根据条款7所述的计算装置,其中所述从处理电路进一步用于:
按照所述输出通道维度,轮流读取当前轮次运算中分配给所述从处理电路的权值块段中各个权值块的权值行;
将读取的权值行存储到所述第一缓冲电路中;以及
按照各个权值行所对应的输出通道维度,分发给所述从处理电路中的不同运算电路以与从第二缓冲电路广播的特征图块执行对位乘累加运算,得到对应卷积输出点上的部分和结果。
条款9、根据条款8所述的计算装置,其中所述从处理电路进一步用于:根据所述权值和/或所述输入特征图的复用方式,控制读取所述第一缓冲电路和第二缓冲电路中的内容,以将所述权值行和所述特征图块同时遍历卷积输出点的整个感受野执行对位乘累加运算,得到多个部分和结果并累加得到对应卷积输出点上的卷积输出。
条款10、根据条款9所述的计算装置,其中:
所述从处理电路进一步用于将所述第二缓冲电路中缓存的所述输入特征图中对应不同卷积输出点的特征图块连续广播给所述多个运算电路,其中所述不同卷积输出点的数量等于权值复用次数SR;并且
每个运算电路进一步用于:
针对连续广播的特征图块,使用同一权值行分别执行对位乘累加运算,得到属于不同卷积输出点的SR个部分和结果;以及
将在多轮运算中得到的属于同一卷积输出点的部分和结果进行累加,以得到对应卷积输出点上的卷积输出。
条款11、根据条款9-10任一所述的计算装置,其中:
所述从处理电路进一步用于:
按照所述输出通道维度,从分配给所述从处理电路的权值块段中的各个权值块中各读取一个权值行,其中读取的权值行数量等于输入特征图复用次数NR与所述从处理电路内的运算电路数量之积;以及
将读取的权值行存入所述第一缓冲电路中并分发给所述多个运算电路;并且
每个运算电路进一步用于:
针对从所述第二缓冲电路广播的特征图块,使用从所述第一缓冲电路分发的NR个权值行分别执行对位乘累加运算,得到属于不同输出通道维度的NR个部分和结果;以及
将在多轮运算中得到的属于同一输出通道维度的部分和结果进行累加,以得到对应输出通道维度上的卷积输出。
条款12、根据条款1-11任一所述的计算装置,其中所述主处理电路进一步用于:将从所调度的多个从处理电路在多轮运算中返回的运算结果按照分块和复用方式进行拼接,以得到最终结果。
条款13、一种芯片,其特征在于,所述芯片包括如条款1-12任一所述的计算装置。
条款14、一种板卡,其特征在于,所述板卡包括条款12所述的芯片。
条款15、一种由条款1-12任一所述的计算装置实施卷积运算的方法。
以上对本披露实施例进行了详细介绍,本文中应用了具体个例对本披露的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本披露的方法及其核心思想;同时,对于本领域的一般技术人员,依据本披露的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本披露的限制。

Claims (15)

  1. 一种计算装置,配置用于执行卷积运算,所述计算装置包括主处理电路和多个从处理电路,其中:
    所述主处理电路用于:
    在所述卷积运算期间,以广播方式将输入特征图的至少一个特征图块传输给调度的多个从处理电路,其中,所述特征图块是将所述输入特征图按最低存储维度分块获得的;以及
    所调度的每个所述从处理电路用于:
    针对所述特征图块和对应的权值块,执行卷积运算,其中,所述权值块是按照输出通道维度分块获得的;以及
    将运算结果返回给所述主处理电路。
  2. 根据权利要求1所述的计算装置,其中所述主处理电路进一步用于:
    在卷积运算期间,将所述输入特征图按最低存储维度分块。
  3. 根据权利要求1或2所述的计算装置,其中,所述主处理电路进一步用于:
    在广播所述特征图块时,将所述特征图块在所述最低存储维度上对齐到第一对齐要求,所述第一对齐要求根据所述从处理电路的处理能力而确定。
  4. 根据权利要求3所述的计算装置,其中所述第一对齐要求等于所述从处理电路中的运算电路的单次最大数据处理量,对齐后的每个特征图块在所述最低存储维度上的大小等于所述单次最大数据处理量。
  5. 根据权利要求1-4任一所述的计算装置,其中,所述主处理电路进一步用于:
    将所述权值按照输出通道维度分块,以供所调度的所述从处理电路加载对应的权值块。
  6. 根据权利要求5所述的计算装置,其中,所述主处理电路进一步用于:
    将在所述输出通道维度上连续划分的多个权值块,按照运算轮次顺序分组,每个权值块组中的权值块数量对应于对应轮次运算中所调度的从处理电路的总运算能力;
    将每个权值块组中的权值块按照对应轮次运算中所调度的从处理电路顺序分段,每个权值块段对应一个调度的从处理电路;以及
    将每个权值块段分别存储在为对应的从处理电路分配的存储区域中。
  7. 根据权利要求1-6任一所述的计算装置,其中每个从处理电路还包括第一缓冲电路、第二缓冲电路和多个运算电路,其中:
    所述第一缓冲电路,用于缓存所述从处理电路对应的至少一个所述权值块中按最低存储维度划分的一个或多个权值行,所述权值行在运算期间被分发给对应的运算电路;以及
    所述第二缓冲电路,用于缓存所述主处理电路广播的特征图块,所述特征图块在运算期间被广播传输给所述从处理电路内的所有运算电路;
    其中每个运算电路用于:针对从所述第一缓冲电路中分发的权值行和从第二缓冲电路中广播的特征图块,执行对位乘累加运算。
  8. 根据权利要求7所述的计算装置,其中所述从处理电路进一步用于:
    按照所述输出通道维度,轮流读取当前轮次运算中分配给所述从处理电路的权值块段中各个权值块的权值行;
    将读取的权值行存储到所述第一缓冲电路中;以及
    按照各个权值行所对应的输出通道维度,分发给所述从处理电路中的不同运算电路以与从第二缓冲电路广播的特征图块执行对位乘累加运算,得到对应卷积输出点上的部分和结果。
  9. 根据权利要求8所述的计算装置,其中所述从处理电路进一步用于:
    根据所述权值和/或所述输入特征图的复用方式,控制读取所述第一缓冲电路和第二缓冲电路中的内容,以将所述权值行和所述特征图块同时遍历卷积输出点的整个感受野执行对位乘累加运算,得到多个部分和结果并累加得到对应卷积输出点上的卷积输出。
  10. 根据权利要求9所述的计算装置,其中:
    所述从处理电路进一步用于将所述第二缓冲电路中缓存的所述输入特征图中对应不同卷积 输出点的特征图块连续广播给所述多个运算电路,其中所述不同卷积输出点的数量等于权值复用次数SR;并且
    每个运算电路进一步用于:
    针对连续广播的特征图块,使用同一权值行分别执行对位乘累加运算,得到属于不同卷积输出点的SR个部分和结果;以及
    将在多轮运算中得到的属于同一卷积输出点的部分和结果进行累加,以得到对应卷积输出点上的卷积输出。
  11. 根据权利要求9-10任一所述的计算装置,其中:
    所述从处理电路进一步用于:
    按照所述输出通道维度,从分配给所述从处理电路的权值块段中的各个权值块中各读取一个权值行,其中读取的权值行数量等于输入特征图复用次数NR与所述从处理电路内的运算电路数量之积;以及
    将读取的权值行存入所述第一缓冲电路中并分发给所述多个运算电路;并且
    每个运算电路进一步用于:
    针对从所述第二缓冲电路广播的特征图块,使用从所述第一缓冲电路分发的NR个权值行分别执行对位乘累加运算,得到属于不同输出通道维度的NR个部分和结果;以及
    将在多轮运算中得到的属于同一输出通道维度的部分和结果进行累加,以得到对应输出通道维度上的卷积输出。
  12. 根据权利要求1-11任一所述的计算装置,其中所述主处理电路进一步用于:
    将从所调度的多个从处理电路在多轮运算中返回的运算结果按照分块和复用方式进行拼接,以得到最终结果。
  13. 一种芯片,其特征在于,所述芯片包括如权利要求1-12任一所述的计算装置。
  14. 一种板卡,其特征在于,所述板卡包括权利要求13所述的芯片。
  15. 一种由权利要求1-12任一所述的计算装置实施卷积运算的方法。
PCT/CN2022/097669 2021-06-10 2022-06-08 计算装置、利用计算装置实施卷积运算的方法及相关产品 WO2022257980A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/565,068 US20240265242A1 (en) 2021-06-10 2022-06-08 Computing apparatus, method for implementing convulution operation by using computing apparatus, and related product

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110648346.2A CN115470176B (zh) 2021-06-10 2021-06-10 计算装置、利用计算装置实施卷积运算的方法及相关产品
CN202110648346.2 2021-06-10

Publications (1)

Publication Number Publication Date
WO2022257980A1 true WO2022257980A1 (zh) 2022-12-15

Family

ID=84363557

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/097669 WO2022257980A1 (zh) 2021-06-10 2022-06-08 计算装置、利用计算装置实施卷积运算的方法及相关产品

Country Status (3)

Country Link
US (1) US20240265242A1 (zh)
CN (1) CN115470176B (zh)
WO (1) WO2022257980A1 (zh)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110059797A (zh) * 2018-10-10 2019-07-26 北京中科寒武纪科技有限公司 一种计算装置及相关产品
CN110866589A (zh) * 2018-08-10 2020-03-06 高德软件有限公司 深度神经网络模型的运行方法、装置及框架
US20200143254A1 (en) * 2018-11-02 2020-05-07 Tata Consultancy Services Limited Method and system for partitioning of deep convolution network for executing on computationally constraint devices
CN112288082A (zh) * 2020-11-23 2021-01-29 天津大学 一种基于hls的可重构通用标准卷积加速器设计方法
CN112508184A (zh) * 2020-12-16 2021-03-16 重庆邮电大学 一种基于卷积神经网络的快速图像识别加速器设计方法
CN112801901A (zh) * 2021-01-21 2021-05-14 北京交通大学 基于分块多尺度卷积神经网络的图像去模糊算法

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11645512B2 (en) * 2019-04-30 2023-05-09 Baidu Usa Llc Memory layouts and conversion to improve neural network inference performance
US11741350B2 (en) * 2019-11-27 2023-08-29 Amazon Technologies, Inc. Efficient utilization of processing element array
CN112470138A (zh) * 2019-11-29 2021-03-09 深圳市大疆创新科技有限公司 计算装置、方法、处理器和可移动设备
JP6888074B2 (ja) * 2019-12-06 2021-06-16 カンブリコン テクノロジーズ コーポレーション リミテッドCambricon Technologies Corporation Limited チップ装置および関連製品
JP6888073B2 (ja) * 2019-12-06 2021-06-16 カンブリコン テクノロジーズ コーポレーション リミテッドCambricon Technologies Corporation Limited チップ装置および関連製品
CN112633490B (zh) * 2020-12-31 2023-09-26 上海寒武纪信息科技有限公司 执行神经网络模型的数据处理装置、方法及相关产品

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110866589A (zh) * 2018-08-10 2020-03-06 高德软件有限公司 深度神经网络模型的运行方法、装置及框架
CN110059797A (zh) * 2018-10-10 2019-07-26 北京中科寒武纪科技有限公司 一种计算装置及相关产品
US20200143254A1 (en) * 2018-11-02 2020-05-07 Tata Consultancy Services Limited Method and system for partitioning of deep convolution network for executing on computationally constraint devices
CN112288082A (zh) * 2020-11-23 2021-01-29 天津大学 一种基于hls的可重构通用标准卷积加速器设计方法
CN112508184A (zh) * 2020-12-16 2021-03-16 重庆邮电大学 一种基于卷积神经网络的快速图像识别加速器设计方法
CN112801901A (zh) * 2021-01-21 2021-05-14 北京交通大学 基于分块多尺度卷积神经网络的图像去模糊算法

Also Published As

Publication number Publication date
CN115470176A (zh) 2022-12-13
US20240265242A1 (en) 2024-08-08
CN115470176B (zh) 2024-04-09

Similar Documents

Publication Publication Date Title
WO2023045445A1 (zh) 数据处理装置、数据处理方法及相关产品
US20230367722A1 (en) Data processing device and method, and related products
CN112633490B (zh) 执行神经网络模型的数据处理装置、方法及相关产品
WO2023123919A1 (zh) 数据处理电路、数据处理方法及相关产品
WO2023045446A1 (zh) 计算装置、数据处理方法及相关产品
CN113469336A (zh) 优化神经网络模型的编译方法、执行方法及相关产品
US20240160689A1 (en) Method for optimizing convolution operation of system on chip and related product
WO2022134873A1 (zh) 数据处理装置、数据处理方法及相关产品
WO2024149112A1 (zh) 卷积算子的编译方法及相关产品
CN112084023A (zh) 数据并行处理的方法、电子设备及计算机可读存储介质
WO2023045638A1 (zh) 计算装置、利用计算装置实施卷积运算的方法及相关产品
WO2022257980A1 (zh) 计算装置、利用计算装置实施卷积运算的方法及相关产品
CN114358261A (zh) 融合神经网络的装置、板卡、方法及可读存储介质
WO2022134872A1 (zh) 数据处理装置、数据处理方法及相关产品
CN114281561A (zh) 处理单元、用于处理单元的同步方法及相应产品
WO2023087698A1 (zh) 执行卷积运算的计算装置、方法及相关产品
WO2022135600A1 (zh) 计算神经网络的装置、板卡、方法及可读存储介质
CN113792867B (zh) 运算电路、芯片和板卡
WO2023087814A1 (zh) 计算装置、利用计算装置实施卷积运算的方法及相关产品
WO2023045444A1 (zh) 执行多维数据的二元运算的计算装置、方法及相关产品
CN113469333B (zh) 执行神经网络模型的人工智能处理器、方法及相关产品
CN117235424A (zh) 计算装置、计算方法及相关产品
CN114692841A (zh) 数据处理装置、数据处理方法及相关产品
CN114444677A (zh) 进行稀疏化训练的装置、板卡、方法及可读存储介质
CN114692840A (zh) 数据处理装置、数据处理方法及相关产品

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22819577

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22819577

Country of ref document: EP

Kind code of ref document: A1