WO2023045446A1 - 计算装置、数据处理方法及相关产品 - Google Patents

计算装置、数据处理方法及相关产品 Download PDF

Info

Publication number
WO2023045446A1
WO2023045446A1 PCT/CN2022/100303 CN2022100303W WO2023045446A1 WO 2023045446 A1 WO2023045446 A1 WO 2023045446A1 CN 2022100303 W CN2022100303 W CN 2022100303W WO 2023045446 A1 WO2023045446 A1 WO 2023045446A1
Authority
WO
WIPO (PCT)
Prior art keywords
dimension
input feature
processing
data
core
Prior art date
Application number
PCT/CN2022/100303
Other languages
English (en)
French (fr)
Inventor
肖麟慧
郑鎏韬
王楠
喻歆
Original Assignee
寒武纪(西安)集成电路有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 寒武纪(西安)集成电路有限公司 filed Critical 寒武纪(西安)集成电路有限公司
Publication of WO2023045446A1 publication Critical patent/WO2023045446A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present disclosure relates generally to the field of data processing. More specifically, the present disclosure relates to a computing device, a method for processing data by using the computing device, a chip and a board.
  • Deep learning Deep Learning
  • AI artificial intelligence
  • Neural network is one of the most critical technologies in artificial intelligence and deep learning, among which Convolution Neural Network (CNN) is the most important network type.
  • the most critical calculation in the convolutional neural network is the convolution operation (Convolution Operation) of the convolution layer (Conv layer).
  • the function of the convolutional layer is to extract features from the input data. Through multi-layer convolution, complex features can be extracted to ensure that the network has sufficient expressive ability and generalization ability.
  • the neural network model contains a large number of various types of convolution operations, and the computing performance of the convolution operation greatly affects the computing performance of the entire neural network model.
  • the corresponding input feature maps and weights may have different dimensions.
  • this disclosure proposes a computing device in various aspects, which can enable convolution operations of various scales by splitting the tasks of convolution operations Adapt to the hardware of convolution operation, so as to improve the computational efficiency of convolution operation.
  • the convolution operation in the embodiment of the present disclosure can be an operation in various neural network models, and these neural network models can be applied in various fields, such as image processing, speech processing, text processing, etc., such processing can include but not limited to identification and classification.
  • an embodiment of the present disclosure provides a computing device, including a master device and a slave device, where the slave device includes one or more processing cores, wherein: the master device is configured to launch a first task, The first task is used to perform a convolution operation on the input feature data and convolution kernel data; and the slave device is configured to schedule a corresponding number of the processing cores to perform the convolution operation according to the split strategy of the first task.
  • the first task wherein the convolution operation is performed on a portion of the input feature data per core pass.
  • an embodiment of the present disclosure provides a chip, which includes the computing device in the aforementioned first aspect.
  • an embodiment of the present disclosure provides a board, which includes the aforementioned chip in the second aspect.
  • an embodiment of the present disclosure provides a method for processing data by using the computing device in the aforementioned first aspect.
  • the solution of the embodiment of the present disclosure provides an optimization solution for splitting the convolution operation task on a single-core or multi-core computing device to adapt to the hardware
  • the processing capability of the computing device can fully utilize the parallel processing capability of multiple slave processing circuits, which can effectively improve the computing efficiency of the convolution operation.
  • Fig. 1 shows the structural diagram of the board card of the disclosed embodiment
  • FIG. 2 shows a structural diagram of a combination processing device according to an embodiment of the present disclosure
  • FIG. 3a shows a schematic diagram of the internal structure of a processor core of a single-core computing device according to an embodiment of the disclosure
  • Fig. 3b shows a simplified schematic diagram of the internal structure of a multi-core computing device according to an embodiment of the present disclosure
  • FIG. 4 shows an example of an exemplary convolution operation principle to which embodiments of the present disclosure can be applied
  • Fig. 5 shows a schematic structural block diagram of a computing device according to an embodiment of the disclosure
  • FIG. 6 shows an exemplary data storage sequence according to an embodiment of the present disclosure
  • Figures 7a-7c illustrate several exemplary grouping modes according to embodiments of the present disclosure
  • Fig. 8 shows an exemplary split schematic diagram of an input feature map according to an embodiment of the present disclosure
  • FIG. 9 exemplarily shows an exemplary structural diagram of a computing device that can implement an embodiment of the present disclosure.
  • Fig. 10 shows a schematic diagram of splitting neurons in different situations according to an embodiment of the present disclosure.
  • the term “if” may be interpreted as “when” or “once” or “in response to determining” or “in response to detecting” depending on the context.
  • FIG. 1 shows a schematic structural diagram of a board 10 according to an embodiment of the present disclosure.
  • the board card 10 includes a chip 101, which is a system-on-chip (System on Chip, SoC), or system-on-chip, integrated with one or more combined processing devices, and the combined processing device is an artificial
  • SoC System on Chip
  • the intelligent computing unit is used to support various deep learning and machine learning algorithms to meet the intelligent processing requirements in complex scenarios in the fields of computer vision, speech, natural language processing, and data mining.
  • deep learning technology is widely used in the field of cloud intelligence.
  • a notable feature of cloud intelligence applications is the large amount of input data, which has high requirements for the storage capacity and computing power of the platform.
  • the board 10 of this embodiment is suitable for cloud intelligence applications. applications, with huge off-chip storage, on-chip storage and powerful computing capabilities.
  • the chip 101 is connected to an external device 103 through an external interface device 102 .
  • the external device 103 is, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card or a wifi interface, and the like.
  • the data to be processed can be transmitted to the chip 101 by the external device 103 through the external interface device 102 .
  • the calculation result of the chip 101 can be sent back to the external device 103 via the external interface device 102 .
  • the external interface device 102 may have different interface forms, such as a PCIe interface and the like.
  • the board 10 also includes a storage device 104 for storing data, which includes one or more storage units 105 .
  • the storage device 104 is connected and data transmitted with the control device 106 and the chip 101 through the bus.
  • the control device 106 in the board 10 is configured to regulate the state of the chip 101 .
  • the control device 106 may include a microcontroller (Micro Controller Unit, MCU).
  • FIG. 2 is a block diagram showing the combined processing means in the chip 101 of this embodiment.
  • the combined processing device 20 includes a computing device 201 , an interface device 202 , a processing device 203 and a storage device 204 .
  • the computing device 201 is configured to perform operations specified by the user, and is mainly implemented as a single-core intelligent processor or a multi-core intelligent processor for performing deep learning or machine learning calculations, which can interact with the processing device 203 through the interface device 202 to Work together to complete user-specified operations.
  • the interface device 202 is used to transmit data and control instructions between the computing device 201 and the processing device 203 .
  • the computing device 201 may obtain input data from the processing device 203 via the interface device 202 and write it into a storage device on the computing device 201 .
  • the computing device 201 may obtain control instructions from the processing device 203 via the interface device 202 and write them into the control cache on the chip of the computing device 201 .
  • the interface device 202 may also read data in the storage device of the computing device 201 and transmit it to the processing device 203 .
  • the processing device 203 performs basic control including but not limited to data transfer, starting and/or stopping the computing device 201 .
  • the processing device 203 may be one or more types of a central processing unit (central processing unit, CPU), a graphics processing unit (graphics processing unit, GPU) or other general-purpose and/or special-purpose processors.
  • processors including but not limited to digital signal processors (digital signal processors, DSPs), application specific integrated circuits (application specific integrated circuits, ASICs), field-programmable gate arrays (field-programmable gate arrays, FPGAs) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., and the number thereof can be determined according to actual needs.
  • the computing device 201 of the present disclosure can be regarded as having a single-core structure or a homogeneous multi-core structure. However, when considering the integration of the computing device 201 and the processing device 203 together, they are considered to form a heterogeneous multi-core structure.
  • the storage device 204 is used to store data to be processed, which may be a DRAM, which is a DDR memory, and its size is usually 16G or larger, and is used to store data of the computing device 201 and/or the processing device 203 .
  • Fig. 3a shows a schematic diagram of the internal structure of a processing core when the computing device 201 is a single-core device.
  • the computing device 301 is used for processing input data such as computer vision, speech, natural language, data mining, etc.
  • the computing device 301 includes three modules: a control module 31 , an operation module 32 and a storage module 33 .
  • the control module 31 is used to coordinate and control the work of the operation module 32 and the storage module 33 to complete the task of deep learning, which includes an instruction fetch unit (IFU) 311 and an instruction decoding unit (instruction decode unit, IDU) 312.
  • the instruction fetching unit 311 is used to obtain instructions from the processing device 203 , and the instruction decoding unit 312 decodes the obtained instructions and sends the decoding results to the computing module 32 and the storage module 33 as control information.
  • the operation module 32 includes a vector operation unit 321 and a matrix operation unit 322 .
  • the vector operation unit 321 is used to perform vector operations, and can support complex operations such as vector multiplication, addition, and nonlinear transformation;
  • the matrix operation unit 322 is responsible for the core calculation of the deep learning algorithm, namely matrix multiplication and convolution.
  • the storage module 33 is used to store or transfer relevant data, including a neuron storage unit (neuron RAM, NRAM) 331, a weight storage unit (weight RAM, WRAM) 332, and a direct memory access module (direct memory access, DMA) 333.
  • NRAM 331 is used to store input neurons, output neurons and intermediate results after calculation;
  • WRAM 332 is used to store convolution kernels of deep learning networks, that is, weights;
  • DMA 333 is connected to DRAM 204 through bus 34, responsible for computing device 301 Data transfer between DRAM 204 and DRAM 204.
  • Fig. 3b shows a simplified schematic diagram of the multi-core internal structure of the computing device 201 .
  • Multi-core computing devices can be abstracted using a hierarchical hardware model. As shown in the figure, the multi-core computing device can be abstracted into four levels, namely card level (Card) 350 , chip level (Chip) 360 , processor cluster level (Cluster) 370 and processor core level (Core) 380 .
  • Card card level
  • Chip chip level
  • Core processor core level
  • the embodiments of the present disclosure mainly involve the data transmission of the storage unit and the calculation unit, so the drawings and description briefly show and introduce the relevant calculation structure, and other parts are omitted.
  • each board contains local DDR storage, and each processor chip acts as a computing and control unit.
  • each processor chip contains multiple multiprocessors as computing units.
  • each multiprocessor includes multiple accelerator cores as control and computing units, and a shared storage SRAM as a storage unit.
  • each accelerator core contains local storage and an array of local processing units.
  • NFU refers to the Neuron Function Unit, which is used for convolution calculations.
  • the storage model includes board global memory, SRAM (shared memory) on the Cluster, NRAM, WRAM and registers on the Core, and the like.
  • SRAM is included in the storage processing unit MPU (Memory Process Unit Core, referred to as MPU, or Mem Core).
  • MPU Memory Process Unit Core
  • Mem Core refers to an intelligent processing core (Intelligent Process Unit Core, referred to as IPU Core or Core) in a multi-core computing device.
  • IPU Core contains NRAM, WRAM, NFU and so on.
  • Cluster refers to a processor cluster or a computing cluster.
  • a multi-core computing device includes several Clusters, and a Cluster includes 1 Mem Core+N IPU Cores.
  • the convolutional layer in a neural network model can perform convolution operations by applying convolution kernels (also called filters, weights, etc.) to input feature maps (also called input data, neurons, or input neurons) processing for feature extraction.
  • convolution kernels also called filters, weights, etc.
  • input feature maps also called input data, neurons, or input neurons
  • the convolution layer can contain multiple convolution kernels, and each element that makes up the convolution kernel corresponds to a weight coefficient and a bias.
  • Embodiments of the present disclosure can be applied to data splitting of various convolution operations.
  • X is the input data
  • Y is the output data
  • K is the convolution kernel
  • Kh and Kw are the length and width of K
  • sh and sw are the strides in the length and width directions
  • the formula ignores Bias bias, fill pad and expand dilation, and assume that the input data X has been filled, and the convolution kernel has been expanded.
  • the formula ignores the N dimension and the C dimension.
  • the forward calculation of the neural network model is independent in the N dimension and fully connected in the C dimension.
  • Fig. 4 shows an example of an exemplary conventional 3D convolution operation principle to which embodiments of the present disclosure can be applied.
  • the figure exemplarily shows four-dimensional input data X with a size of [N Hi Wi Ci], which can be expressed as N three-dimensional rectangles 410 of size Hi ⁇ Wi ⁇ Ci.
  • the figure also exemplarily shows a four-dimensional convolution kernel K with a size of [Co Kh Kw Ci], which can be expressed as Co three-dimensional convolution kernels 420 of size Kh ⁇ Kw ⁇ Ci.
  • the convolution result of the input data X and the convolution kernel K obtains the output data Y, which is four-dimensional data of the size [N Ho Wo Co], which can be expressed as N three-dimensional rectangles 430 of the size Ho ⁇ Wo ⁇ Co.
  • the figure also specifically shows an example of convolution operation, in which the input data is an input feature map 440 with a size of 6 ⁇ 6 ⁇ 3, and the N dimension is omitted; the convolution kernel is a three-dimensional convolution kernel 450 with a size of 3 ⁇ 3 ⁇ 3 , for a single Co; the output data is a 4 ⁇ 4 output feature map 460 .
  • the specific operation process is as follows:
  • the convolution kernel 450 scans the input feature map 440 according to a certain step size, performs matrix element multiplication and summation on the input features in the convolution window 470 and superimposes offsets. That is, the value at each position in the output feature map 460 is obtained by performing a two-dimensional convolution operation on the corresponding block and the corresponding convolution kernel of each input feature map and then summing them up. For example, the figure shows that the value of the (0,0) position on the output feature map 460 (that is, the convolution output point) is two-dimensionally performed by the convolution window 470 framed by the black cube in the input feature map and the three-dimensional convolution kernel 450. The three-dimensional convolution operation obtains 3 values, which are summed to obtain the final value.
  • the position of the convolution kernel 450 can be moved on the input feature map 440 , that is, the convolution window of the convolution output point can be moved.
  • the convolution step size (Sx, Sy) is (1,1).
  • the convolution operation can be obtained respectively The value at (0,1) or (1,0) position on the feature map 460 is output.
  • a convolutional layer of the neural network there are N groups of input feature maps, and each group contains Hi ⁇ Wi ⁇ Ci information, where Hi and Wi are the height and width of the input feature map, and Ci is The number of input feature maps, also known as the number of input channels.
  • the convolutional layer has Ci ⁇ Co convolution kernels of Kh ⁇ Kw size, where Ci is the number of input channels, Co is the number of output feature maps (or the number of output channels), Kh and Kw are the height and width.
  • the output feature map contains Ho ⁇ Wo ⁇ Co pieces of information, where Ho and Wo are the height and width of the output feature map, respectively, and Co is the number of output channels.
  • the convolution step size Sx, Sy
  • the size of the convolution step size will affect the size of the output feature map.
  • input feature map (Feature map), input data, neuron or input neuron are used interchangeably; convolution kernel, filter or weight are used interchangeably.
  • H (height) and Y dimensions are used interchangeably, and the W (width) and X dimensions are used interchangeably.
  • the H dimension of the input feature map can be expressed as Hi or Yi
  • the H dimension of the output feature map can be expressed as Ho or Yo
  • the W dimension can be expressed similarly.
  • each convolution output point has a corresponding convolution window, and the shape of the convolution window is equal to the shape of the convolution kernel. The value of each convolution output point corresponds to the result of the multiplication and accumulation of the input feature map and the weight in its convolution window.
  • a computing device with a master-slave structure may be used to implement the above convolution operation.
  • different data paths can be configured for input feature maps and convolution kernels, thereby improving memory access efficiency.
  • FIG. 5 shows a schematic structural block diagram of a computing device 500 according to an embodiment of the disclosure. It can be understood that this structure can be regarded as the refinement of the internal structure of the operation module of a single processing core in Figure 3a, or can be regarded as a combination of multiple processing cores shown in Figure 3a or calculation modules of the processing core shown in Figure 3b Block diagram of functional division.
  • a computing device 500 in an embodiment of the present disclosure may be configured to perform various types of convolution operations, which may include a master processing circuit (MA) 510 and a plurality of slave processing circuits (SL) 520, shown in the figure 16 slave processing circuits SL0 to SL15 are shown.
  • MA master processing circuit
  • SL slave processing circuits
  • the master processing circuit and the slave processing circuits, as well as multiple slave processing circuits, can communicate with each other through various connections.
  • the connection between multiple slave processing circuits can be hard-wired, or logically configured according to, for example, micro-instructions to form a variety of slave processing circuits
  • the topology of the array Embodiments of the present disclosure are not limited in this regard.
  • the main processing circuit and the slave processing circuit can cooperate with each other, thereby realizing parallel operation processing.
  • the main processing circuit and the slave processing circuit may include various calculation circuits, for example, may include a vector operation unit and a matrix operation unit.
  • the vector operation unit is used to perform vector operations, and can support complex operations such as vector multiplication, addition, and nonlinear transformation;
  • the matrix operation unit is responsible for the core calculations of deep learning algorithms, such as matrix multiplication and convolution.
  • the slave processing circuit can be used to perform intermediate operations on corresponding data in parallel according to the operation instruction to obtain multiple intermediate results, and transmit the multiple intermediate results back to the main processing circuit.
  • the computing device 500 By setting the computing device 500 into a master-slave structure (for example, a master-multiple-slave structure, or a multi-master-multiple-slave structure, the present disclosure is not limited in this respect), for the calculation instructions of the forward operation, the data can be disassembled according to the calculation instructions. In this way, multiple slave processing circuits are used to perform parallel calculations on the part with a large amount of calculation to improve the calculation speed, save calculation time, and reduce power consumption.
  • a master-slave structure for example, a master-multiple-slave structure, or a multi-master-multiple-slave structure, the present disclosure is not limited in this respect
  • multiple multiplexing methods of input feature maps and weights can be supported, thereby reducing the amount of data access during operations and improving processing efficiency .
  • the computing device 500 may further include a first storage device 530 and a second storage device 540 for respectively storing data transmitted through different data channels.
  • the first storage circuit 530 can be used to store multicast data, that is, the data in the first storage circuit will be transmitted to multiple slave processing circuits through the broadcast bus, and these slave processing circuits receive the same data. It can be understood that broadcasting and multicasting can be implemented through the broadcasting bus. Multicast refers to a communication method that transmits a piece of data to multiple slave processing circuits; broadcasting is a communication method that transmits a piece of data to all slave processing circuits, which is a special case of multicast. Since both multicast and broadcast correspond to a one-to-many transmission mode, there is no special distinction between the two in this document. Broadcast and multicast can be collectively referred to as multicast, and those skilled in the art can clarify their meanings according to the context.
  • the second storage circuit 540 may be used to store and distribute data, that is, the data in the second storage circuit will be transmitted to different slave processing circuits respectively, and each slave processing circuit receives different data.
  • the main processing circuit may determine one of the input feature map and the convolution kernel as multicast data and store it in the first storage circuit, so as to transmit the data to the scheduled multiple from the processing circuit.
  • the main processing circuit may determine the other of the input feature map and the convolution kernel as distribution data and store it in the second storage circuit. These distributed data can be distributed to corresponding slave processing circuits before operation.
  • FIG. 5 also shows a schematic diagram of the internal structure of the slave processing circuit SL according to an embodiment of the present disclosure.
  • each slave processing circuit 520 may include a plurality of operation circuits CU 521, a first buffer circuit 522 and a second buffer circuit 523.
  • four arithmetic circuits CU0 to CU3 are shown.
  • the number of computing circuits may be more or less depending on specific hardware configurations, and the embodiments of the present disclosure are not limited in this respect.
  • the first buffer circuit 522 may be used for buffering weights or input feature maps assigned to the slave processing circuit.
  • the second buffer circuit 523 may be used for buffering the input feature map or the weight assigned to the slave processing circuit. These two buffer circuits are used to select the data involved in the operation.
  • the data of the first buffer circuit 522 can be a plurality of data rows from, for example, the first storage circuit 530 or the second storage circuit 540, and correspondingly, the data of the second buffer circuit 523 can come from, for example, the second storage circuit 540 or the first storage circuit 540 Multiple data rows of circuit 530. Depending on the specific multiplexing method, these data rows can be distributed to the corresponding computing circuit CU 521 or broadcast to all CUs 521 in the slave processing circuit 520 during the operation.
  • Each operation circuit CU521 is used for performing a bitwise multiply-accumulate operation on data rows selected from the first buffer circuit and data rows selected from the second buffer circuit during each calculation.
  • the slave processing circuit 520 may also include a third buffer circuit 524 for buffering the calculation results of each calculation circuit CU 521.
  • each processing circuit and storage circuit are shown as separate modules in FIG. 5 , according to different configurations, the storage circuit and the processing circuit may also be combined into one module.
  • the first storage circuit 530 can be combined with the main processing circuit 510
  • the second storage circuit 540 can be shared by multiple slave processing circuits 520, and an independent storage area is assigned to each slave processing circuit to speed up access.
  • Embodiments of the present disclosure are not limited in this respect.
  • the main processing circuit and the slave processing circuit may belong to different modules of the same processor or chip, or may belong to different processors, and the present disclosure is not limited in this respect.
  • the dimensions of the involved multidimensional data are represented by (N, H, W, C) or (Co, H, W, Ci), which represent the storage order of the data in the memory. It can be understood that although the multidimensional data has multiple dimensions, since the layout of the memory is always one-dimensional, there is a corresponding relationship between the multidimensional data and the storage order on the memory. Multidimensional data is usually allocated in continuous storage space, that is, multidimensional data can be expanded in one dimension and stored in the memory in sequence.
  • the initial input feature map can be stored sequentially in a low-dimensional (where C/Ci is the lowest dimension) priority; and in order to optimize the convolution operation, the input feature can be adjusted during the operation
  • C/Ci is the lowest dimension
  • Adjacent dimensions refer to dimensions that are next to each other in the dimension information representation of multidimensional data, for example, W and Ci are adjacent, and adjacent dimensions may also be called continuous dimensions.
  • the main computing unit of the hardware is a vector multiply-accumulate operator.
  • Implementing support for various convolution algorithms in hardware design is essentially to extract the multiplication and addition operations in the algorithm to the maximum extent, and realize the connection between the on-chip RAM (such as NRAM, WRAM, etc. in Figure 3) and the arithmetic unit through the data path. efficiently exchange the input and output data of the multiply-accumulate operation.
  • Hardware is stored line by line (cache line).
  • the read, write, and calculation operations are most efficient when the entire line is aligned. Therefore, in order to make full use of the bandwidth and adapt to the memory access requirements of the arithmetic unit array, it is usually necessary to
  • the data is vectorized and aligned.
  • the design of artificial intelligence chips usually takes the Ci dimension as the lowest dimension, that is, the above-mentioned NHWC arrangement order, and the data on the Ci dimension is continuous. Therefore, vectorization alignment requires that the size of the Ci dimension be aligned to a specified value, such as the alignment value M, so that the number of accesses is performed in units of the alignment value M, and M can also be called the maximum single operation of the hardware.
  • M can have different values, such as 64bit, 128bit, 256bit, 512bit, etc.
  • the size of the input port of the operator array is also related to M.
  • the input port size of the operator array is usually twice the size of M, that is, the input of the alignment value M scale is processed at one time.
  • Feature map data and weight data When the Ci dimension of the input feature map is large, it is easier to meet the above alignment requirements.
  • the Ci dimension of the input feature map is small, such as smaller than the size of a cache line, the Ci dimension needs to be filled to one line of data (for example, 512 bits), that is, invalid data 0 is filled. This filling will cause a large number of redundant calculations, resulting in waste of resources and reducing the efficiency of operations.
  • a convolution operation scheme which can determine the corresponding convolution splitting scheme according to the size of the lowest storage dimension (such as Ci) of the input feature map, wherein the convolution splitting scheme at least indicates The shape of the split unit of the data to be operated on.
  • the amount of data contained in a split unit does not exceed the maximum single operation amount of the hardware.
  • the amount of data contained in a split unit can be set as the one-time processing alignment value M of the hardware, so that the calculation and processing can be performed in units of split units, which can fully utilize the computing power of the hardware and avoid or reduce invalid calculations. .
  • the data type can be Int8, Int16, Float16 or Float32
  • the split scheme of 64B ⁇ 1 ⁇ 1 shape is called Forward64
  • the split scheme of 16B ⁇ 2 ⁇ 2 shape is called Forward16
  • the split scheme of 4B ⁇ 4 ⁇ 4 shape is called Forward4
  • the split scheme of 4B ⁇ 4 ⁇ 4 shape is called Forward4.
  • the split scheme of 4B ⁇ 4 ⁇ 4 shape applied to depth convolution operation is called Forward1
  • the split scheme of 4B ⁇ 4 ⁇ 4 shape applied to reverse depth convolution operation is called Update1
  • the 4-shape split scheme applied to the cross-product convolution operation is called Update4.
  • these splitting schemes are suitable for scenarios where channel C is relatively small in convolution calculations, so they can also be collectively referred to as small convolutions.
  • a split unit includes data of the lowest storage dimension and at least one other storage dimension, and the total data volume of a split unit does not exceed the maximum single operation of the hardware.
  • the input feature map and convolution kernel can be split into multiple corresponding splitting units according to the determined convolution splitting scheme, and the dimension storage order can be converted, so that a split
  • the data in the unit is continuously stored as a data row, so as to facilitate subsequent reading processing in units of split units (data rows).
  • one or more split units may be read in the first reading order from the data to be operated stored in the storage order of the first dimension, in units of split units, and the read split units may be stored in On the corresponding storage circuit, the data in each split unit is stored according to the storage order of the second dimension, and the data between the split units is stored according to the storage order of the third dimension.
  • FIG. 6 shows an exemplary data storage sequence according to an embodiment of the present disclosure.
  • 610 represents the storage method of the four-dimensional tensor to be calculated, including N three-dimensional sub-tensors, and N is in the highest dimension, that is, the storage order of the first dimension of the four-dimensional tensor is NHWC.
  • H and Y, W and X are used interchangeably herein.
  • Each subtensor is divided into smaller data blocks or split units, and the number of data blocks in each dimension is C/Y/X respectively.
  • the diagram 620 in the middle shows the storage method of each sub-tensor, and each data block is stored as a continuous 64Byte, that is, one row.
  • the order between rows changes accordingly.
  • the data block is read in the direction of C first, then X, and finally Y, that is, the first reading sequence is YXC, and the rows are stored in the order of Y*X*C, that is, the third
  • the dimension storage order is YXC or HWC.
  • the third dimension is stored in the same order as the first dimension. It can be understood that other reading orders can also be used, resulting in the storage order of the third dimension being different from that of the first dimension, which will not be listed here.
  • the diagram 630 on the right shows the order in each row, that is, the order of data in each data block, and its shape is blockC*blockY*blockX. At this time, the storage order of the second dimension is CYX or CHW.
  • the small convolution adopts the block form. Compared with the traditional convolution, the advantage is that the alignment in the Ci direction only needs to satisfy the alignment of the block in the Ci direction.
  • the weight (co*Kh*kw*ci) is generally small, Kh and Kw are usually single digits, and co and ci are almost the same.
  • the storage space of the second storage circuit (such as the WRAM 332 in FIG. 3) is larger than that of the first storage circuit (such as the NRAM 331 in FIG. 3).
  • the convolution operation principle described above it can be known that the operation results on the Co dimension (the depth convolution is the C dimension) do not need to be accumulated, so the operation allocation on different Co can be carried out relatively independently on different operation circuits.
  • the size of the Co dimension of the output channel of the convolution kernel in a single round of operation does not exceed the number of scheduled slave processing circuits, so the operation of a single Co needs to be completed by one or more slave processing circuits.
  • the dimension of Co is large, it can be realized by splitting into multiple rounds of operations, wherein the size of Co processed by each round of operations does not exceed the number of scheduled slave processing circuits.
  • the calculation rounds required to complete the convolution operation and the Co processed in each round of operation can be determined. Quantity or corresponding grouping mode.
  • the convolution kernel is multiplexed on Rs SLs in the same SLB, and Rs represents the number of times the convolution kernel is multiplexed between slave processing circuits.
  • Factors such as the limitation of hardware buffer space (such as the size of the first buffer circuit and the second buffer circuit in Figure 5) can be considered to determine the maximum number of times rs of convolution kernel multiplexing and the maximum number of input feature map multiplexes applicable in a single slave processing circuit. Use the number of times rn.
  • the situation that a slave processing circuit processes multiple Co values in a single round of operation is not considered for the time being, but only one or more slave processing circuits are considered.
  • the circuit only handles the case of one Co value in a single round of operation.
  • Different grouping modes can be used according to the number of slave processing circuits SL processing the same Co value in a single round of operation. It can be understood that it is preferable to evenly distribute the callable slave processing circuits SL, so as to balance the computing power, for example, every 2 SLs, so that 16 SLs can process 8 Co values at the same time; or every 4 SLs, so that 16 SLs can handle 4 Co values simultaneously; etc.
  • the second storage circuit WRAM has 16 storage areas, which are allocated to the 16 slave processing circuits SL respectively. Further, every 4 blocks can be combined into a storage block, which is assigned to the corresponding slave processing circuit group SLB.
  • the following grouping modes can be selected: Group1 mode, Group4 mode and Group16 mode.
  • grouping modes can refer to the above three representative grouping modes given herein for corresponding processing.
  • the above grouping mode can be uniformly expressed as GroupN, representing that all slave processing circuits SL scheduled in the current round of operations are divided into N groups, each slave processing circuit group SLB processes the same Co value, and different slave processing circuit groups SLB handles different Co values.
  • N can be 1, 4, or 16, corresponding to Group1, Group4, and Group16 above.
  • Figures 7a-7d illustrate several exemplary grouping schemes according to embodiments of the present disclosure.
  • Figure 7a shows a Group1 mode
  • Figure 7b shows a Group16 mode
  • Figure 7c shows a Group4 mode
  • Figure 7d shows another Group4 mode.
  • the Group1 mode means that all 16 schedulable SLs belong to one group and jointly process one Co value, for example, SL0-SL15 belong to group G0. Thus, operations for this one output channel are distributed over 16 SLs.
  • priority can be given to broadcasting the convolution kernel 720 of the output channel to each SL, and the input feature map 710 is split and distributed to each SL, thereby improving memory access efficiency.
  • the convolution kernel can be stored in the first storage circuit 530 in FIG. 5 for transmission using a broadcast channel.
  • the input feature map can be divided according to the XY direction of the output feature map and stored in the second storage circuit 540 to be allocated to different SLs.
  • all SLs jointly compute an output feature map of Co.
  • the Group16 mode means that all 16 schedulable SLs are divided into 16 groups, that is, each group has one SL, and each SL handles a different Co value.
  • SL0 belongs to group G0
  • SL1 belongs to group G1
  • SL15 belongs to group G15.
  • the same input feature map 730 can be reused among 16 SLs, so it can be prioritized to broadcast the input feature map 730 to each SL, while the convolution kernel 740 corresponding to different Co is distributed Give the corresponding SL.
  • 16 copies of the input feature map may be copied and stored in 16 storage areas allocated to the 16 slave processing circuits on the second storage circuit.
  • the convolution kernel is divided according to Co, one SL corresponds to one Co, and 16 Cos are processed at a time, stored in the first storage circuit, and distributed to different SLs in a unicast manner.
  • all SLs compute output feature maps of different Co for the same input feature map.
  • the Group4 mode means that all 16 schedulable SLs are divided into 4 groups, and each group processes a Co value.
  • SL0-SL3 belong to group G0
  • SL4-SL7 belong to group G1
  • SL8-SL11 belong to group G2
  • SL12-SL15 belong to group G3.
  • This mode is between Group1 and Group16, so either the convolution kernel or the input feature map can be determined as multicast data, while the other can be determined as distribution data.
  • the convolution kernels can be divided into 4 groups according to Co, and stored in the first storage circuit 530 in FIG. 5 , so as to be transmitted through a broadcast channel.
  • the input feature map can be divided into 4 parts according to the XY direction of the output feature map, copied into 4 parts, stored in the second storage circuit 540, and distributed to the 4 SLBs.
  • Each SLB obtains the same input feature map, and then distributes it to the 4 SLs in the SLB according to the 4 divided parts.
  • all SLs in each SLB jointly compute the output feature map of a Co, and the 4 SLBs process a different Co respectively.
  • the convolution kernels are divided into 4 groups, and each group is divided into each group at an interval of 1 according to Co.
  • Co 12
  • four groups of Co are divided into ⁇ 0, 4, 8 ⁇ , ⁇ 1, 5, 9 ⁇ , ⁇ 2, 6, 10 ⁇ and ⁇ 3, 7, 11 ⁇ respectively.
  • the neurons can be stored in the second storage circuit WRAM, and the weights can be stored in the first storage circuit NRAM.
  • the input feature map needs to be split between these multiple SLs.
  • the Group1 grouping mode needs to split the input feature map into 16 parts.
  • the Group4 grouping mode needs to split the input feature map into 4 parts.
  • the input feature map may be divided among the Rs slave processing circuits SL included in each slave processing circuit group as follows: according to the size of the corresponding output feature map, the output feature map is divided in the XY dimension (also That is, the Ho/Wo dimension) is evenly divided into Rs output feature blocks of the same shape; and according to the input feature map area required for calculating each output feature block, the input feature map is divided in the XY dimension (that is, the Hi/Wi dimension) The above is divided into Rs input feature blocks to be distributed to Rs slave processing circuits. It can be understood that depending on the size of the convolution kernel and the convolution step size, the input feature maps corresponding to adjacent output points on the output feature map may overlap.
  • Fig. 8 shows an exemplary split diagram of an input feature map according to an embodiment of the present disclosure.
  • the input feature map is divided into 16 parts and distributed on 16 SLs, corresponding to the Group1 mode.
  • the 16 output feature blocks can be mapped to the input feature map 820 to obtain the 16 input feature map regions required to calculate the 16 output feature blocks respectively, which also divides the input feature map in the XY direction.
  • These 16 input feature map regions can be assigned to 16 slave processing circuits SL accordingly.
  • the input feature map will be split in units of splitting units according to the determined convolution splitting scheme. Therefore, in the above embodiment, the block of the input feature map should make each divided input feature map
  • the block in the XY direction is a multiple of the dimension of the split unit in the XY direction, that is, it can be aligned according to the split unit in the XY direction. For example, when choosing a 4 ⁇ 4 ⁇ 4 convolution splitting scheme, each input feature map is aligned by 4 ⁇ 4; while choosing a 16 ⁇ 2 ⁇ 2 convolution splitting scheme, each input feature map Blocks are aligned 2 ⁇ 2.
  • the output feature map is not aligned according to the split unit (such as 4 ⁇ 4 or 2 ⁇ 2)
  • it is necessary to fill in the input feature map accordingly (such as filling 0), so that the actual calculated output XY is according to the split unit ( eg 4x4 or 2x2) aligned and input XY is also aligned by split unit (eg 4x4 or 2x2).
  • the output feature map can also be divided according to other rules in the XY direction, for example, divided into 16 output feature blocks with the same shape according to 1 ⁇ 16, and assigned to SL0-SL15 respectively.
  • Embodiments of the present disclosure are not limited in this respect.
  • this splitting method can also be applied to splitting in other scenarios, for example, between computing circuits CU in a single slave processing circuit SL Splitting, the embodiments of the present disclosure are not limited in this aspect.
  • multiple slave processing circuits can be scheduled to perform convolution operations on the input feature map and the corresponding data rows of the convolution kernel, and then according to the convolution splitting scheme, A plurality of operation results returned from the processing circuit are spliced to obtain an output feature map of the convolution operation of the input feature map and the convolution kernel.
  • a plurality of operation circuits CU and each buffer circuit (see FIG. 5 ) in the slave processing circuit can be used to perform a specific convolution operation process.
  • multiple computing cycles are generally required to complete the required computing in each round of computing.
  • each output feature block corresponds to all schedulable N CU operation circuits in a single SL A single calculation capability (N CU *Nop output points).
  • the output feature map can be divided into output feature blocks according to the alignment of 16 output points in the XoYo dimension, and each output feature block can be calculated one by one. It can be understood that the 16 output points may be in a 4*4 format, or may be in a 1*16 format, which is not limited in the embodiment of the present disclosure.
  • the output points of the output characteristic block can be further divided among the N CU operation circuits, so as to determine the processing object of each operation circuit. Then, according to the division of output points, using the split unit as a sliding window, select N CU input feature data rows from the first buffer circuit and distribute them to N CU computing circuits, and select the corresponding weight value from the second buffer circuit The data is broadcast to N CU computing circuits, so that the parallel calculation of the output points corresponding to multiple sliding windows can be realized by multiplexing the weight data. Perform Nk sliding selections, wherein Nk is determined according to the smaller value of the size of the convolution kernel in the X and Y dimensions and the maximum convolution kernel size supported by a single operation of the processing circuit in the current convolution split mode.
  • the corresponding weight data when performing a conventional three-dimensional convolution operation, can be selected as follows: select 1/Nop weights from the second buffer circuit in a sliding manner corresponding to that in the first buffer circuit row, copying Nop-1 copies of it and expanding it into an extended weight value row, and broadcasting to N CU computing circuits in the slave processing circuit.
  • each operation circuit can use 1/Nop data line units for one input feature line from the first buffer circuit and one extended weight data line from the second buffer circuit during each sliding calculation. Carry out bitwise multiplication and accumulation to obtain Nop partial sums; and accumulate the Nk*Nop partial sums obtained by calculating Nk sliding number selections according to the corresponding convolution output points to obtain and output Nop operation results.
  • the slave processing circuit When the slave processing circuit outputs the output points of its internal operation circuit, it can output the output points calculated by multiple operation circuits in it in a specific order according to the division method of the output points, so that the output points of continuous output are in X and/or Y Dimensionally continuous, convenient for subsequent processing.
  • the main processing circuit may further store the operation results returned from each slave processing circuit in a fourth dimension storage order. According to the situation, the main processing circuit can also convert the operation result into a desired dimension storage sequence for storage.
  • a cluster has 4 accelerator cores (core), and the calculation of each core are all independent and interrelated.
  • Forward4 has obvious advantages when the HW dimension of the input feature map is large.
  • the HW dimension of the input feature map is small, the advantage is not obvious because of the alignment problem.
  • the convolution calculation scenario under the scenario where the HW dimension of the input feature map is large can be considered.
  • the number of channels C is often relatively small.
  • the number of channels C is usually 4. So in this scenario, it is not necessary to consider the situation that the on-chip space cannot fit all the channels.
  • the embodiment of the present disclosure proposes a splitting strategy of the input feature map in order to adapt to the restriction of alignment and ensure the correctness of the calculation when the HW dimension of the input feature map is large.
  • the splitting strategy here refers to splitting irregular and different-shaped input feature maps (neuron tensors) into basic blocks that can be used for processing for calculation.
  • the provided splitting strategy can be applied to multi-core computing devices, for example, to determine the splitting of input feature maps on multiple cores (in different spaces).
  • This splitting strategy can also be applied to a single-core computing device, for example, to determine the splitting of the input feature map at different time rounds on a single core.
  • FIG. 9 exemplarily shows an exemplary structural diagram of a computing device that can implement embodiments of the present disclosure.
  • the computing device 900 includes a master device 910 and a slave device 920 , wherein the slave device 920 may include one or more processing cores 921 .
  • the master device 910 may be, for example, a general-purpose processor, which acts as a control device (referred to as the host end) and is responsible for complex control and scheduling.
  • the master device 910 may be, for example, the processing device 203 in FIG. 2 .
  • the slave device 920 may be, for example, various domain-specific processors, responsible for large-scale parallel computing or domain-specific computing tasks.
  • the slave device 920 may be, for example, the computing device 201 in FIG. 2 . The two work together to complete computing tasks.
  • computing device 900 may be used to perform convolution operations.
  • the master device 910 may transmit the first task.
  • This first task is used to perform a convolution operation on input feature data and convolution kernel data.
  • the input channel Ci dimension of the input feature data and convolution kernel data is smaller than the first threshold, and any of the width W and height H dimensions of the input feature data exceeds the second threshold.
  • the first threshold can be, for example, 64B, 32B, etc., so that the on-chip space can store data of all channels.
  • the second threshold may be a multiple of 4, a multiple of 8, a multiple of 16, or even a multiple of 64, so as to complement the channel dimension or facilitate group splitting.
  • the slave device 920 may schedule a corresponding number of processing cores 921 to execute the first task according to the splitting policy of the first task. Wherein, in each kernel processing, the above-mentioned convolution operation is performed on a part of the input feature data.
  • the splitting strategy can be applied to multi-core computing devices as well as single-core computing devices, so the "core times" mentioned in this article can refer to the calculation rounds of a single core at different times , can also represent the operation rounds of different cores at the same time.
  • the master device 910 may determine a splitting strategy for the first task. Whether it is on multiple cores or on a single core, the above splitting strategy may include: the number of cores required to complete the first task, and each core time corresponds to the input feature block to be processed.
  • the "number of cores” refers to how many processing cores are required to complete a task and how many times it is executed, that is, the expansion of the corresponding task in the space or time dimension.
  • the slave device 920 may further include one or more storage cores 922, and each storage core may be shared by one or more processing cores 921 to store data before, during and/or after processing by the processing cores.
  • each storage core may be shared by one or more processing cores 921 to store data before, during and/or after processing by the processing cores.
  • the master device can determine the splitting strategy according to the scale of the input characteristic data. It may be assumed that the input feature data is [N Hi Wi Ci], N is the batch dimension, H is the height dimension, W is the width dimension, and C is the input channel dimension.
  • the output feature data is [Ho Wo Co].
  • the master device may first determine the splitting strategy according to the total number of batches B of input feature data and the total number of schedulable processing cores Ncore.
  • each processing core can process n consecutive batches of data.
  • processing cores in a multi-core computing device are usually divided and managed in the form of computing clusters
  • processing of this single batch can be split according to computing clusters.
  • the master device may allocate Brem batches of input feature data to multiple computing clusters for processing, wherein Nc computing clusters jointly process the same batch of input feature data.
  • Data splitting between computational clusters can be done in a number of ways.
  • splitting may be performed according to the HW dimension.
  • the master device may averagely split the output feature data corresponding to the convolution operation of the input feature data into Nc output feature blocks with the same shape in the first dimension;
  • the input feature area of the input feature data is correspondingly split into Nc input feature blocks in the first dimension, which are respectively assigned to the above-mentioned Nc calculation clusters.
  • the above-mentioned first dimension may be any one or both of H and W.
  • the first dimension is a lower dimension in storage order, such as H dimension, thereby reducing the jumping step size when reading data.
  • a computing cluster includes a storage core and multiple processing cores, so operations in the same computing cluster need to be further split into multiple processing cores superior.
  • the splitting logic can be more granular.
  • the main device can further split the input feature block on multiple cores in a single computing cluster as follows: According to the size wo of the first dimension of the output feature block corresponding to the input feature block, determine that the input feature block is in Split method on multiple processing cores. That is, according to whether the data of the first dimension (W dimension) can be processed at one time, the processing is performed in different cases.
  • the input feature map whose W dimension can be processed at one time is called a small map
  • the input feature map whose W dimension can be processed no more than Ncc times is called a middle map
  • the W dimension can be divided into Ncc
  • the input feature map that has been processed once is called a large map
  • Fig. 10 shows a schematic diagram of splitting input feature blocks in different situations according to an embodiment of the present disclosure. It can be seen from the above that the splitting of the input feature block is essentially divided according to the output feature block. Therefore, each graph in Figure 10 represents the Ho*Wo dimension of the output feature block, and the gray part in the figure represents the SRAM loaded on the cluster level. Data, each small grid represents data processed on a processing core, which can be called a basic block (basicblock).
  • basic block basic block
  • Figure 10a shows the thumbnail situation.
  • the second dimension is, for example, a higher storage dimension than the first dimension, such as the H dimension.
  • the master device can determine the splitting mode of the input feature block on the Ncc processing cores as follows: when wo ⁇ single-core processing capacity S, according to the size of the second dimension ho of the corresponding output feature block , split into Ncc output feature sub-blocks in the second dimension; and according to the input feature area required to calculate each output feature sub-block, split the input feature block into Ncc input features in the second dimension correspondingly The sub-blocks are allocated to the Ncc processing cores respectively.
  • the SRAM can load 4*1 (H*W) basic blocks each time, which are allocated to 4 processing cores respectively.
  • Figure 10b shows the middle graph situation.
  • the second dimension can be processed at this time
  • the data is supplemented to the first dimension, that is, the split of the first dimension and the second dimension is performed at the same time.
  • the SRAM can load 2*2(H*W) basic blocks each time, which are allocated to 4 processing cores respectively.
  • Figure 10c shows the large graph situation. For large graphs, since the data in the first dimension is enough to be processed in Ncc times, it can be split according to the first dimension only.
  • Fig. 1Od shows the very large graph situation.
  • the data in the first dimension can be processed in multiple rounds on Ncc processing cores. Therefore, it can also be directly split according to the first dimension.
  • the master device can determine the splitting mode of the input feature block on Ncc processing cores as follows: when wo>Ncc/2*S, according to wo, split into m*Ncc in the first dimension output feature sub-blocks, m is a natural number, the size of the first dimension of each output feature sub-block does not exceed S; Correspondingly, m*Ncc input feature sub-blocks are split and assigned to Ncc processing cores in turn.
  • the SRAM can load 1*4 (H*W) basic blocks each time, which are allocated to 4 processing cores respectively.
  • the strategy selection of the small image, the medium image, the large image, and the super-large image can be dynamically selected according to the resources of the on-chip space.
  • the size of the feature map processed in the small image mode is about 200*200
  • the size of the feature map processed in the large image mode is about 960*960.
  • the super large image mode it can handle 1080p size (1080*1920 ), and even 2K-level pictures.
  • splitting strategy of input feature maps of various scales on multiple processing cores in a single computing cluster has been described above, the splitting strategy above can also be applied to single-core boards.
  • single-core boards generally adopt the architecture of 1 storage core + 1 processing core. Therefore, according to the on-chip space of the storage core, the amount of data on one storage core can be used as a calculation task, and a single processing core can divide multiple Once completed, the splitting strategy between each time can still use the strategy described above, the only difference is that the processing cores assigned to Ncc are instead assigned to Ncc times of processing.
  • the above splitting strategy can be unified as: according to the size wo of the first dimension of the output feature block corresponding to the input feature block on the storage core, determine the splitting method of the input feature block on Ncc core times.
  • Ncc corresponds to the number of processing cores in a single computing cluster, usually the capacity on the storage core can be used for a single processing of Ncc processing cores; when a single-core computing device is used, Ncc corresponds to the capacity on the storage core Ncc times can be processed by a single processing core.
  • the splitting method on the Ncc core times can be divided according to different situations according to the above-mentioned small picture, middle picture, and large picture mode, and will not be repeated here.
  • the task splitting solution in the disclosed embodiment can be flexibly adapted to different board forms (single-core board and multi-core board).
  • the amount of data on one storage core can be regarded as a computing task, and the computing tasks of one storage core can be divided into different processing cores for parallel computing according to the above split strategy, or a The processing core is time-divided into multiple times for sequential computation.
  • Embodiments of the present disclosure also provide methods for allocating and executing convolution operations using the aforementioned computing device. Those skilled in the art can understand that the steps of the method correspond to the features of the computing device described above in conjunction with the accompanying drawings, so the features described above are also applicable to the steps of the method and will not be repeated here.
  • An embodiment of the present disclosure also provides a chip, which may include the computing device in any embodiment described above with reference to the accompanying drawings. Further, the present disclosure also provides a board, which may include the aforementioned chip.
  • the electronic equipment or devices disclosed herein may include servers, cloud servers, server computing clusters, data processing devices, robots, computers, printers, scanners, tablet computers, smart terminals, PC equipment, IoT terminals, Mobile terminals, mobile phones, driving recorders, navigators, sensors, cameras, cameras, video cameras, projectors, watches, earphones, mobile storage, wearable devices, visual terminals, automatic driving terminals, vehicles, household appliances, and/or medical equipment.
  • Said vehicles include airplanes, ships and/or vehicles;
  • said household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, range hoods;
  • said medical equipment includes nuclear magnetic resonance instruments, Ultrasound and/or electrocardiograph.
  • the electronic equipment or device disclosed herein can also be applied to fields such as the Internet, the Internet of Things, data centers, energy, transportation, public management, manufacturing, education, power grids, telecommunications, finance, retail, construction sites, and medical treatment. Further, the electronic device or device disclosed herein can also be used in application scenarios related to artificial intelligence, big data and/or cloud computing, such as cloud, edge, and terminal.
  • electronic devices or devices with high computing power according to the disclosed solutions can be applied to cloud devices (such as cloud servers), while electronic devices or devices with low power consumption can be applied to terminal devices and/or Edge devices (such as smartphones or cameras).
  • the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that according to the hardware information of the terminal device and/or the edge device, the hardware resources of the cloud device can be Match appropriate hardware resources to simulate the hardware resources of terminal devices and/or edge devices, so as to complete the unified management, scheduling and collaborative work of device-cloud integration or cloud-edge-end integration.
  • the present disclosure expresses some methods and their embodiments as a series of actions and combinations thereof, but those skilled in the art can understand that the solution of the present disclosure is not limited by the order of the described actions . Therefore, according to the disclosure or teaching of the present disclosure, those skilled in the art may understand that certain steps may be performed in other orders or simultaneously. Further, those skilled in the art can understand that the embodiments described in the present disclosure can be regarded as optional embodiments, that is, the actions or modules involved therein are not necessarily required for the realization of one or some solutions of the present disclosure. In addition, according to different schemes, the description of some embodiments in this disclosure also has different emphases. In view of this, those skilled in the art may understand the part that is not described in detail in a certain embodiment of the present disclosure, and may also refer to related descriptions of other embodiments.
  • a unit described as a separate component may or may not be physically separated, and a component shown as a unit may or may not be a physical unit.
  • the aforementioned components or units may be located at the same location or distributed over multiple network units.
  • some or all of the units may be selected to achieve the purpose of the solutions described in the embodiments of the present disclosure.
  • multiple units in the embodiments of the present disclosure may be integrated into one unit, or each unit exists physically independently.
  • the above-mentioned integrated units may also be implemented in the form of hardware, that is, specific hardware circuits, which may include digital circuits and/or analog circuits.
  • the physical realization of the hardware structure of the circuit may include but not limited to physical devices, and the physical devices may include but not limited to devices such as transistors or memristors.
  • various devices such as computing devices or other processing devices described herein may be implemented by appropriate hardware processors, such as central processing units, GPUs, FPGAs, DSPs, and ASICs.
  • the aforementioned storage unit or storage device can be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), which can be, for example, a variable resistance memory (Resistive Random Access Memory, RRAM), dynamic Random Access Memory (Dynamic Random Access Memory, DRAM), Static Random Access Memory (Static Random Access Memory, SRAM), Enhanced Dynamic Random Access Memory (Enhanced Dynamic Random Access Memory, EDRAM), High Bandwidth Memory (High Bandwidth Memory , HBM), hybrid memory cube (Hybrid Memory Cube, HMC), ROM and RAM, etc.
  • RRAM variable resistance memory
  • DRAM Dynamic Random Access Memory
  • SRAM Static Random Access Memory
  • EDRAM Enhanced Dynamic Random Access Memory
  • High Bandwidth Memory High Bandwidth Memory
  • HBM High Bandwidth Memory
  • HMC Hybrid Memory Cube
  • ROM and RAM etc.

Abstract

本披露公开了一种计算装置、利用计算装置来处理数据的方法及相关产品。该计算装置可以包括在组合处理装置中,该组合处理装置还可以包括接口装置和其他处理装置。该计算装置与其他处理装置进行交互,共同完成用户指定的计算操作。组合处理装置还可以包括存储装置,该存储装置分别与计算装置和其他处理装置连接,用于存储该计算装置和其他处理装置的数据。本披露的方案实现了卷积运算在单核或多核上的任务拆分,提高了运算处理效率。

Description

计算装置、数据处理方法及相关产品
相关申请的交叉引用
本公开要求于2021年9月26日申请的、申请号为202111131275.5、发明名称为“计算装置、数据处理方法及相关产品”的中国专利申请的优先权。
技术领域
本披露一般地涉及数据处理领域。更具体地,本披露涉及一种计算装置、利用计算装置对数据进行处理的方法、芯片和板卡。
背景技术
目前,深度学习(Deep Learning)已经成为机器学习中的重要分支,也大力助推着人工智能(AI)的发展。深度学习的核心技术——深度神经网络(DNN)已在诸多行业有着广泛的应用。
神经网络是人工智能、深度学习中最为关键的技术之一,其中卷积神经网络(Convolution Neural Network,CNN)是最为重要的一种网络类型。卷积神经网络中最为关键的计算即为卷积层(Conv layer)的卷积运算(Convolution Operation)。卷积层的功能是对输入数据进行特征提取,通过多层卷积,能够抽取复杂特征,以保证网络具有足够的表达能力和泛化能力。神经网络模型中包含了大量的、各种类型的卷积运算,卷积运算的计算性能极大地影响整个神经网络模型的计算性能。当神经网络模型应用于不同领域时,例如语音识别、机器翻译、图像处理等等,其对应的输入特征图和权值的各个维度大小可能各有不同。为了充分利用深度学习处理器的硬件优势,需要针对不同规模的、不同类型的卷积运算进行优化,以提高执行神经网络模型的计算性能。
发明内容
为了至少解决如上所提到的一个或多个技术问题,本披露在多个方面中提出了一种计算装置,其通过对卷积运算任务进行拆分,可以使得各种规模的卷积运算能够适配卷积运算的硬件,从而提高卷积运算的计算效率。本披露实施例的卷积运算可以是各种神经网络模型中的运算,这些神经网络模型可以应用于各种领域,诸如图像处理、语音处理、文本处理等等,这些处理例如可以包括但不限于识别和分类。
在第一方面中,本披露实施例提供了一种计算装置,包括主设备和从设备,所述从设备包括一个或多个处理核,其中:所述主设备配置用于发射第一任务,所述第一任务用于针对输入特征数据和卷积核数据执行卷积运算;以及所述从设备配置用于根据所述第一任务的拆分策略,调度相应数量的所述处理核执行所述第一任务,其中在每核次处理中针对所述输入特征数据中的一部分执行所述卷积运算。
在第二方面中,本披露实施例提供了一种芯片,其包括前述第一方面的计算装置。
在第三方面中,本披露实施例提供了一种板卡,其包括前述第二方面的芯片。
在第四方面中,本披露实施例提供了一种利用前述第一方面的计算装置来处理数据的方法。
通过如上所提供的计算装置、芯片、板卡以及由计算装置处理数据的方法,本披露实施例的方案针对卷积运算任务在单核或多核计算装置上拆分提供了优化方案,以适应硬件运算装置的处理能力,从而充分利用多个从处理电路的并行处理能力,可以有效提高卷积运算的运算效率。
附图说明
通过参考附图阅读下文的详细描述,本公开示例性实施方式的上述以及其他目的、特征和优点将变得易于理解。在附图中,以示例性而非限制性的方式示出了本公开的若干实施方式,并且相同或对应的标号表示相同或对应的部分,其中:
图1示出本披露实施例的板卡的结构图;
图2示出本披露实施例的组合处理装置的结构图;
图3a示出本披露实施例的单核计算装置的处理器核的内部结构示意图;
图3b示出本披露实施例的多核计算装置的内部结构简化示意图;
图4示出可以应用本披露实施例的示例性卷积运算原理示例;
图5示出了根据本披露实施例的计算装置的示意性结构框图;
图6示出了根据本披露实施例的一种示例性数据存储顺序;
图7a-7c示出了根据本披露实施例的几种示例性分组模式;
图8示出了根据本披露实施例的输入特征图的示例性拆分示意图;
图9示例性示出可以实施本披露实施例的计算装置的示例性结构图;以及
图10示出根据本披露实施例的不同情况下的神经元的拆分示意图。
具体实施方式
下面将结合本披露实施例中的附图,对本披露实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本披露一部分实施例,而不是全部的实施例。基于本披露中的实施例,本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本披露保护的范围。
应当理解,本披露的权利要求、说明书及附图中可能出现的术语“第一”、“第二”、“第三”和“第四”等是用于区别不同对象,而不是用于描述特定顺序。本披露的说明书和权利要求书中使用的术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。
还应当理解,在本披露说明书中所使用的术语仅仅是出于描述特定实施例的目的,而并不意在限定本披露。如在本披露说明书和权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。还应当进一步理解,在本披露说明书和权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。
如在本说明书和权利要求书中所使用的那样,术语“如果”可以依据上下文被解释为“当...时”或“一旦”或“响应于确定”或“响应于检测到”。
示例性硬件环境
图1示出本披露实施例的一种板卡10的结构示意图。如图1所示,板卡10包括芯片101,其是一种系统级芯片(System on Chip,SoC),或称片上系统,集成有一个或多个组合处理装置,组合处理装置是一种人工智能运算单元,用以支持各类深度学习和机器学习算法,满足计算机视觉、语音、自然语言处理、数据挖掘等领域复杂场景下的智能处理需求。特别是深度学习技术大量应用在云端智能领域,云端智能应用的一个显著特点是输入数据量大,对平台的存储能力和计算能力有很高的要求,此实施例的板卡10适用在云端智能应用,具有庞大的片外存储、片上存储和强大的计算能力。
芯片101通过对外接口装置102与外部设备103相连接。外部设备103例如是服务器、计算机、摄像头、显示器、鼠标、键盘、网卡或wifi接口等。待处理的数据可以由外部设备103通过对外接口装置102传递至芯片101。芯片101的计算结果可以经由对外接口装置102传送回外部设备103。根据不同的应用场景,对外接口装置102可以具有不同的接口形式,例如PCIe接口等。
板卡10还包括用于存储数据的存储器件104,其包括一个或多个存储单元105。存储器件104通过总线与控制器件106和芯片101进行连接和数据传输。板卡10中的控制器件106配置用于对芯片101的状态进行调控。为此,在一个应用场景中,控制器件106可以包括单片机(Micro Controller Unit,MCU)。
图2是示出此实施例的芯片101中的组合处理装置的结构图。如图2中所示,组合处理装置20包括计算装置201、接口装置202、处理装置203和存储装置204。
计算装置201配置成执行用户指定的操作,主要实现为单核智能处理器或者多核智能处理器,用以执行深度学习或机器学习的计算,其可以通过接口装置202与处理装置203进行交互,以共 同完成用户指定的操作。
接口装置202用于在计算装置201与处理装置203间传输数据和控制指令。例如,计算装置201可以经由接口装置202从处理装置203中获取输入数据,写入计算装置201片上的存储装置。进一步,计算装置201可以经由接口装置202从处理装置203中获取控制指令,写入计算装置201片上的控制缓存中。替代地或可选地,接口装置202也可以读取计算装置201的存储装置中的数据并传输给处理装置203。
处理装置203作为通用的处理装置,执行包括但不限于数据搬运、对计算装置201的开启和/或停止等基本控制。根据实现方式的不同,处理装置203可以是中央处理器(central processing unit,CPU)、图形处理器(graphics processing unit,GPU)或其他通用和/或专用处理器中的一种或多种类型的处理器,这些处理器包括但不限于数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field-programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等,并且其数目可以根据实际需要来确定。如前所述,仅就本披露的计算装置201而言,其可以视为具有单核结构或者同构多核结构。然而,当将计算装置201和处理装置203整合共同考虑时,二者视为形成异构多核结构。
存储装置204用以存储待处理的数据,其可以是DRAM,为DDR内存,大小通常为16G或更大,用于保存计算装置201和/或处理装置203的数据。
图3a示出了计算装置201为单核装置时处理核的内部结构示意图。计算装置301用以处理计算机视觉、语音、自然语言、数据挖掘等输入数据,计算装置301包括三大模块:控制模块31、运算模块32及存储模块33。
控制模块31用以协调并控制运算模块32和存储模块33的工作,以完成深度学习的任务,其包括取指单元(instruction fetch unit,IFU)311及指令译码单元(instruction decode unit,IDU)312。取指单元311用以获取来自处理装置203的指令,指令译码单元312则将获取的指令进行译码,并将译码结果作为控制信息发送给运算模块32和存储模块33。
运算模块32包括向量运算单元321及矩阵运算单元322。向量运算单元321用以执行向量运算,可支持向量乘、加、非线性变换等复杂运算;矩阵运算单元322负责深度学习算法的核心计算,即矩阵乘及卷积。
存储模块33用来存储或搬运相关数据,包括神经元存储单元(neuron RAM,NRAM)331、权值存储单元(weight RAM,WRAM)332、直接内存访问模块(direct memory access,DMA)333。NRAM 331用以存储输入神经元、输出神经元和计算后的中间结果;WRAM 332则用以存储深度学习网络的卷积核,即权值;DMA 333通过总线34连接DRAM 204,负责计算装置301与DRAM 204间的数据搬运。
图3b示出了计算装置201为多核的内部结构简化示意图。多核计算装置可以用层次化硬件模型来进行抽象。如图所示,多核计算装置可以抽象为四个层级,即板卡级(Card)350、芯片级(Chip)360、处理器簇级(Cluster)370和处理器核级(Core)380。本披露实施例中主要涉及存储单元的数据传输和计算单元部分,因此附图和描述简要示出和介绍相关的计算结构,省略其他部分。
在板卡级,每块板卡上包含本地DDR存储,每个处理器芯片作为计算和控制单元。
在芯片级,每个处理器芯片包含多个多处理器作为计算单元。
在计算簇级,每个多处理器包括多个加速器核作为控制和计算单元,另外还有共享存储SRAM作为存储单元。
在处理器核级,每个加速器核包含本地存储及本地处理单元阵列。NFU指神经运算单元(Neuron Function Unit),用于进行卷积计算。
在该多核计算装置中,存储模型包括板卡全局内存、Cluster上的SRAM(共享存储器)、Core上的NRAM、WRAM和寄存器等。为了获得更好的性能,可以显式地控制Card以下各存储层次之间的数据搬移以及访存/计算间的平衡。SRAM包含在存储处理单元MPU(Memory Process Unit  Core,简称MPU,或者Mem Core)中。Core指多核计算装置中的智能处理核(Intelligent Process Unit Core,简称IPU Core或者Core)。1个IPU Core包含NRAM,WRAM,NFU等等。Cluster指处理器簇或称计算簇,通常多核计算装置包含若干个Cluster,一个Cluster包含1个Mem Core+N个IPU Core。
示例性卷积运算类型
神经网络模型中的卷积层可以执行卷积运算,通过对输入特征图(也称为输入数据、神经元或输入神经元)应用卷积核(也称为过滤器、权值等)做卷积处理,从而进行特征提取。卷积层内部可以包含多个卷积核,组成卷积核的每个元素对应一个权值系数和一个偏置bias。本披露实施例可以应用于各种卷积运算的数据拆分中。
在常规3D卷积运算中,假设卷积层中输入特征图(Feature map)张量形状表示为X[N Hi Wi Ci],卷积核(kernel)的张量形状表示为K[Co Kh Kw Ci],输出的结果为Y[N Ho Wo Co],那么,简化的卷积运算的数学计算公式可以表示如下:
Y in,jc,jh,jw=∑ 0≤ic≤ci,0≤ih≤kh,0≤iw≤kwX in,ic,jh×sh+ih,jw×sw+iw×K jc,ic,ih,iw    (1)
上式中,X是输入数据,Y是输出数据,K是卷积核,Kh和Kw是K的长和宽,sh和sw是在长和宽方向上的步长(stride),公式忽略了偏置bias,填充pad和膨胀dilation,并且假设输入数据X已经做了填充,卷积核已经做了膨胀。公式忽略了N维度和C维度,神经网络模型的正向计算在N维度上的计算都是独立的,在C维度上是全连接的。卷积核在工作时,会按照一定的步长扫过输入特征,在卷积窗口内对输入特征做矩阵元素乘法求和并叠加偏差量。
图4示出了可以应用本披露实施例的示例性常规3D卷积运算原理示例。
图中示例性示出了大小为[N Hi Wi Ci]的四维输入数据X,其可以表示成N个Hi×Wi×Ci大小的立体矩形410。图中还示例性示出了大小为[Co Kh Kw Ci]的四维卷积核K,其可以表示成Co个Kh×Kw×Ci大小的立体卷积核420。输入数据X与卷积核K的卷积结果得到输出数据Y,其为[N Ho Wo Co]大小的四维数据,可以表示成N个Ho×Wo×Co大小的立体矩形430。
图中还具体示出了一个卷积运算示例,其中输入数据为6×6×3大小的输入特征图440,省去N维度;卷积核为3×3×3大小的立体卷积核450,针对单个Co;输出数据为4×4的输出特征图460。具体运算过程如下:
卷积核450按照一定的步长扫过输入特征图440,在卷积窗口470内对输入特征做矩阵元素乘法求和并叠加偏置。也即,输出特征图460中每个位置上的值由每个输入特征图的对应区块和对应卷积核做二维卷积运算之后再加和得到。例如,图中示出了输出特征图460上(0,0)位置的值(也即卷积输出点)由输入特征图中黑色立方体框出的卷积窗口470与立体卷积核450进行二维卷积运算得到3个值,再加和得到最终值。
为了得到其他位置的输出,可以在输入特征图440上移动卷积核450的位置,也即移动卷积输出点的卷积窗口。在图中示例中,卷积步长(Sx,Sy)为(1,1),当横向(宽度方向)向右或纵向(高度方向)向下移动一格后做卷积运算,可以分别得到输出特征图460上(0,1)或(1,0)位置的值。
从上面的描述可知,在神经网络的一个卷积层中,有N组输入特征图,每组包含Hi×Wi×Ci个信息,其中Hi和Wi分别是输入特征图的高度和宽度,Ci是输入特征图的个数,也称为输入通道数。卷积层有Ci×Co个Kh×Kw大小的卷积核,其中Ci是输入通道数,Co是输出特征图的个数(或输出通道数),Kh和Kw分别是卷积核的高度和宽度。输出特征图包含Ho×Wo×Co个信息,其中Ho和Wo分别是输出特征图的高度和宽度,Co是输出通道数。此外,在卷积运算中,还会涉及到卷积步长(Sx,Sy),卷积步长的大小会影响输出特征图的尺寸。
在本文中,输入特征图(Feature map)、输入数据、神经元或输入神经元可互换使用;卷积核、过滤器或权值可互换使用。此外,H(高度)和Y维度可互换使用,W(宽度)和X维度可互换使用。相应地,输入特征图的H维度可以表示为Hi或Yi,输出特征图的H维度可以表示为Ho或Yo,W维度类似表示。在本披露实施例中,每个卷积输出点具有对应的卷积窗口,卷积窗 口的形状等于卷积核的形状。每个卷积输出点的值对应于其卷积窗口内的输入特征图与权值的对位乘累加结果。
示例性计算装置/数据处理装置
在本披露实施例中,可以采用主从结构的计算装置来实施上述卷积运算。进一步地,可以为输入特征图和卷积核配置不同的数据通路,从而提高访存效率。
图5示出了根据本披露实施例的计算装置500的示意性结构框图。可以理解,该结构可以视为图3a中单个处理核的运算模块的内部结构细化,也可以视为在多个图3a所示处理核或图3b所示处理核的运算模块基础上联合的功能划分框图。如图5所示,本披露实施例的计算装置500可以配置用于执行各种类型的卷积运算,其可以包括主处理电路(MA)510和多个从处理电路(SL)520,图中示出了16个从处理电路SL0~SL15。本领域技术人员可以理解,从处理电路的数量可以更多或更少,取决于具体的硬件配置,本披露实施例在此方面没有限制。
主处理电路和从处理电路之间以及多个从处理电路之间可以通过各种连接相互通信。在不同的应用场景中,多个从处理电路之间的连接方式既可以是通过硬线布置的硬连接方式,也可以是根据例如微指令进行配置的逻辑连接方式,以形成多种从处理电路阵列的拓扑结构。本披露实施例在此方面没有限制。主处理电路和从处理电路可以相互配合,由此实现并行运算处理。
为了支持运算功能,主处理电路和从处理电路可以包括各种计算电路,例如可以包括向量运算单元及矩阵运算单元。向量运算单元用以执行向量运算,可支持向量乘、加、非线性变换等复杂运算;矩阵运算单元负责深度学习算法的核心计算,例如矩阵乘和卷积。
从处理电路例如可以用于根据运算指令,对相应的数据并行执行中间运算得到多个中间结果,并将多个中间结果传输回主处理电路。
通过将计算装置500设置成主从结构(例如一主多从结构,或者多主多从结构,本披露在此方面没有限制),对于正向运算的计算指令,可以根据计算指令将数据进行拆分,从而通过多个从处理电路对计算量较大的部分进行并行运算以提高运算速度,节省运算时间,进而降低功耗。
在本披露一些实施例中,通过利用不同的数据通路传输输入特征图和权值,可以支持输入特征图和权值的多种复用方式,从而减小运算期间的数据访存量,提升处理效率。
具体地,计算装置500中还可以包括第一存储装置530和第二存储装置540,用于分别存储经由不同数据通道传输的数据。
第一存储电路530可以用于存储多播数据,也即第一存储电路中的数据将通过广播总线传输给多个从处理电路,这些从处理电路接收到相同的数据。可以理解,通过广播总线可以实现广播和多播。多播是指将一份数据传输到多个从处理电路的通信方式;而广播是将一份数据传输到所有从处理电路的通信方式,是多播的一个特例。由于多播和广播都对应一对多的传输方式,本文中未对二者特意区分,广播和多播可以统称为多播,本领域技术人员根据上下文可以明确其含义。
第二存储电路540可以用于存储分发数据,也即第二存储电路中的数据将分别传输给不同的从处理电路,每个从处理电路接收到不同的数据。
通过分别提供第一存储电路和第二存储电路,可以支持针对待运算的数据以不同传输方式进行传输,从而通过在多个从处理电路之间复用多播数据来降低数据访存量。
在一些实施例中,主处理电路可以将输入特征图和卷积核中之一确定为多播数据并存储在第一存储电路中,以在运算期间通过广播方式将数据传输给调度的多个从处理电路。对应地,主处理电路可以将输入特征图和卷积核中另一确定为分发数据并存储在第二存储电路中。这些分发数据可以在运算前分发给对应的从处理电路。
图5还示出了根据本披露实施例的从处理电路SL的内部结构示意图。如图所示,每个从处理电路520可以包括多个运算电路CU 521、第一缓冲电路522和第二缓冲电路523。图中示出了4个运算电路CU0~CU3。本领域技术人员可以理解,运算电路的数量可以更多或更少,取决于具体的硬件配置,本披露实施例在此方面没有限制。
在一些实施例中,第一缓冲电路522可以用于缓存分配给该从处理电路的权值或输入特征图。 相应地,第二缓冲电路523则可以用于缓存分配给该从处理电路的输入特征图或权值。这两个缓冲电路均用于选取参与运算的数据。第一缓冲电路522的数据可以是来自例如第一存储电路530或第二存储电路540的多个数据行,对应地,第二缓冲电路523的数据可以来自例如第二存储电路540或第一存储电路530的多个数据行。取决于具体的复用方式,这些数据行可以在运算期间被分发给对应的运算电路CU 521或广播给该从处理电路520内的所有CU 521。
每个运算电路CU 521用于在每次计算时,针对分别从第一缓冲电路中选取的数据行和从第二缓冲电路中选取的数据行执行对位乘累加运算。
通过分别提供第一缓冲电路和第二缓冲电路,可以支持针对待运算的数据以不同传输方式进行传输,从而通过在单个从处理电路内的多个运算电路之间尽可能复用数据来降低数据访存量。
从处理电路520中还可以包括第三缓冲电路524,用于缓存各个运算电路CU 521的运算结果。
可以理解,虽然在图5中将各个处理电路与存储电路示出为分立的模块,但是根据不同的配置,存储电路与处理电路也可以合并成一个模块。例如,第一存储电路530可以与主处理电路510合并在一起,第二存储电路540则可以由多个从处理电路520共享,并为每个从处理电路分配独立的存储区域,加速访问。本披露实施例在此方面没有限制。此外,在该计算装置中,主处理电路和从处理电路可以属于同一处理器或芯片的不同模块,也可以属于不同处理器,本披露在此方面也没有限制。
示例性数据拆分和存储
在本披露实施例中,所涉及的多维数据的维度表征为(N,H,W,C)或(Co,H,W,Ci),其代表了数据在存储器中的存储顺序。可以理解,虽然多维数据具有多个维度,但是因为存储器的布局始终是一维的,因此多维数据与存储器上的存储顺序之间存在对应关系。多维数据通常被分配在连续的存储空间中,也即可以将多维数据进行一维展开,按顺序存储在存储器上。例如,在本披露实施例中,初始的输入特征图可以按照低维度(此处C/Ci为最低维度)优先方式,进行顺序存储;而为了优化卷积运算,在运算过程中可以调整输入特征图的存储顺序,如后面将详细描述的。相邻的维度是指多维数据的维度信息表示中相互紧挨着的维度,例如,W和Ci相邻,相邻的维度也可以称为连续的维度。
在智能处理器中,出于算力的需要和面积功耗开销的考虑,硬件的主要运算单元是向量的乘加运算器。在硬件设计中实现各类卷积算法的支持,本质上是最大化地提取算法中的乘加运算,并且通过数据通路实现在片上RAM(诸如图3中的NRAM、WRAM等)和运算器之间高效地交换乘加运算的输入和输出数据。
硬件在存储上是以一行一行(缓存行)进行存储的,读、写、计算操作在整行对齐时效率最高,因此为了充分利用带宽,适配运算器阵列的访存量等需求,通常需要将数据进行向量化对齐。人工智能芯片的设计通常以Ci维度为最低维度,也即上述NHWC摆放顺序,Ci维度上的数据是连续的。因此,向量化对齐要求需要Ci维度的大小对齐到指定数值,例如对齐值M,从而以该对齐值M为单位进行存取数,M也可以称为硬件单次最大运算量。基于不同的硬件设计,M可以有不同的数值,例如64bit、128bit、256bit、512bit等。通常,运算器阵列的输入端口大小也与M相关,例如在输入数据位宽对称的情形下,运算器阵列的输入端口大小通常为M的2倍,也即一次性处理对齐值M规模的输入特征图数据和权值数据。当输入特征图的Ci维度较大时,比较容易满足上述对齐要求。
当输入特征图的Ci维度较小时,例如小于一个缓存行的大小,则需将Ci维度补齐到一行数据(例如,512比特),即填充无效数据0。这种填充会造成大量的冗余计算,导致资源浪费,降低了运算的效率。
在本披露实施例中,提出了一种卷积运算方案,其可以根据输入特征图的最低存储维度(例如Ci)的大小,确定对应的卷积拆分方案,其中卷积拆分方案至少指示待运算数据的拆分单元的形状。一个拆分单元包含的数据量不超过硬件单次最大运算量。
在一些实施例中,一个拆分单元包含的数据量可以设置成硬件的一次性处理对齐值M,从而 以拆分单元为单位进行运算处理,可以充分发挥硬件的算力,避免或减少无效计算。
在本披露的示例性描述中,不防假设M=512bit=64Byte,数据类型可以是Int8、Int16、Float16或Float32,并且输入特征图与卷积核的数据类型一致。由于数据类型至少需要1字节的宽度,并且运算处理的最小单位是一个数据,因此在下面的示例中均以字节为单位进行各种计算,例如M=64B,Ci=28B等等,其中有时候为了简洁起见省略单位。
当拆分单元的数据量等于M时,每个拆分单元的数据块形状为blockC*blockY*blockX,其可能存在多种情形,表1列出了其中的几种:
Figure PCTCN2022100303-appb-000001
表1、数据块形状
从表1可以看出,有些数据块形状的X和Y维度尺寸相等(如深色行所示),这种形状可以简化后续的运算。因此在本披露实施例中,可以优选使用这种数据块形状对待运算数据进行拆分。
为了简便起见,将64B×1×1形状的拆分方案称为Forward64,将16B×2×2形状的拆分方案称为Forward16,将4B×4×4形状的拆分方案称为Forward4,将4B×4×4形状的应用于深度卷积运算的拆分方案称为Forward1,将4B×4×4形状的应用于反向深度卷积运算的拆分方案称为Update1,将4B×4×4形状的应用于叉乘卷积运算的拆分方案称为Update4。除了Forward64之外,这些拆分方案适合卷积计算中通道C比较小的场景,因此也可以统称为小卷积。在这些小卷积拆分方案中,一个拆分单元包括最低存储维度和至少一个其他存储维度的数据,并且一个拆分单元的总数据量不超过硬件单次最大运算量。
不同的卷积拆分方案可以适用于不同的运算场景,从而获得不同程度的性能优化。
在确定了拆分方案之后,接着可以按照所确定的卷积拆分方案,将输入特征图和卷积核拆分成多个对应的拆分单元并转换其维度存储顺序,以使得一个拆分单元内的数据连续存储为一个数据行,从而方便后续以拆分单元(数据行)为单位进行读取处理。
在一些实施例中,对于三维或者四维的神经元或者权值的数据,将其全部划分为大小为blockC*blockY*blockX(Uc×Uy×Ux)大小的数据块,每一个数据块连续存储在例如M=64B的一行上,由此在读取一行数据时,实际取出一个数据块的数据。
具体地,可以从以第一维度存储顺序存储的待运算数据中,以拆分单元为单位,按第一读取顺序读取一个或多个拆分单元,将读取的拆分单元存储到对应的存储电路上,其中每个拆分单元内的数据按照第二维度存储顺序存储,拆分单元之间按照第三维度存储顺序存储。
图6示出了根据本披露实施例的一种示例性数据存储顺序。
如图所示,610表示待运算的四维张量的存储方式,包含N个3维的子张量,N在最高维度,也即四维张量的第一维度存储顺序为NHWC。注意,本文中H和Y、W和X可互换使用。每一个子张量被划分为更小的数据块或拆分单元,每一维的数据块的个数分别为C/Y/X。
中间的图620表示每一个子张量的存储方式,每个数据块被存储为连续的64Byte,也即一行。当读取数据块的顺序不同时,行之间的顺序也会相应地变化。在图中示例中,按照先C、然后X、最后Y的方向读取数据块,也即第一读取顺序为YXC,则各行之间按照Y*X*C的顺序存储,也即第三维度存储顺序为YXC或HWC。在此示例中,第三维度存储顺序与第一维度存储顺序相同。可以理解,也可以使用其他读取顺序,从而导致第三维度存储顺序与第一维度存储顺序不同,此 处不再一一列举。
右侧的图630表示每一行内的顺序,也即每个数据块内的数据顺序,其形状为blockC*blockY*blockX,此时第二维度存储顺序为CYX或CHW。
示例性分组运算
小卷积采用block形式,较传统卷积的优势在于Ci方向对齐只需要满足block在Ci方向对齐即可。在这一小通道的场景下,权值(co*Kh*kw*ci)普遍较小,Kh和Kw通常是个位数,co和ci差不多。在前面结合图5描述的计算装置/数据处理装置中,通常第二存储电路(例如图3的WRAM 332)的存储空间比第一存储电路(例如,图3的NRAM 331)要大。因此,为了充分利用片上的计算空间,在大部分小卷积方案中,例如Forward4、Forward1等,采用与正常卷积的神经元和权值存储位置互换的方案,也即将神经元存储在第二存储电路WRAM上,权值存储在第一存储电路NRAM上。
卷积的计算是每一个输入特征图都需要和每一个Co的卷积核进行乘加运算,从而输出Co个输出特征图。然而,并不是片上空间一定能同时存储下所有规模的卷积核和输入特征图,因此,对于硬件而言存在一系列重复加载输入特征数据或者权值数据的操作,如何平衡重复加载输入特征数据还是权值数据对计算的效率会有一定影响。在实际运算中,为了减少频繁的片外访存,存在对神经元和权值的拆分策略问题。在一些实施例中,根据参与运算的数据的规模特性,可以采取不同的拆分方式。
根据前面描述的卷积运算原理可知,Co维度(深度卷积为C维度)上的运算结果无需累加,因此不同Co上的运算分配在不同的运算电路上可以相对独立地进行。在小卷积场景中,通常单轮运算中卷积核的输出通道Co维度的尺寸不超过所调度的从处理电路的数量,因此单个Co的运算需要由一个或多个从处理电路来完成。更一般地,即使Co维度较大时,也可以通过拆分成多轮运算来实现,其中每轮运算处理的Co尺寸不超过所调度的从处理电路的数量。由此,在一个示例中,可以首先基于卷积核的输出通道Co维度尺寸和可调度的从处理电路数量Ns,确定完成卷积运算所需的运算轮次以及各轮次运算中处理的Co数量或相应的分组模式。
不管哪种分配方式,在单轮运算中,Co可能存在两种分配情况:多个从处理电路处理一个Co值,或者单个从处理电路处理一个或多个Co值。具体地,在处理Nco个输出通道的单个运算轮次中,每Rs个SL构成一个从处理电路组SLB,处理对应同一输出Co值的卷积核,Rs=[Ns/Nco],也即同一卷积核在同一SLB内的Rs个SL上复用,Rs表示卷积核在从处理电路之间的复用次数。与之相应地,输入特征图可以在各个从处理电路组SLB之间复用,Rn=[Ns/Rs],表示输入特征图在从处理电路之间的复用次数。
可选地或附加地,当每个从处理电路处理对应rn个Co值的卷积核,rn=[Nco/Ns],此时每个从处理电路处理的输入特征图可以重复用于rn个卷积核,rn表示输入特征图在单个从处理电路内的复用次数。可以考虑硬件缓冲空间限制等因素(例如图5中的第一缓冲电路和第二缓冲电路的大小)来确定单个从处理电路内可应用的最大卷积核复用次数rs和最大输入特征图复用次数rn。
考虑到硬件电路中的缓存大小限制和复用收益,在本披露一些实施例中暂时不考虑一个从处理电路在单轮运算中处理多个Co值的情况,而只考虑一个或多个从处理电路在单轮运算中只处理一个Co值的情况。
根据在单轮运算中处理同一Co值的从处理电路SL的个数,可以采用不同的分组模式。可以理解,优选对可调用的从处理电路SL平均分配,从而均衡算力,例如,每2个SL一组,从而16个SL可以同时处理8个Co值;或者每4个SL一组,从而16个SL可以同时处理4个Co值;等等。在前面结合图5描述的计算装置中,第二存储电路WRAM具有16块存储区域,分别分配给16个从处理电路SL。进一步地,每4块又可以组合成一个存储区块,分给对应的从处理电路组SLB。因此,在一些实施例中,对于图5所示的包括Ns=16个SL的计算装置,可以选择如下几种分组模式:Group1模式、Group4模式和Group16模式。本领域技术人员可以理解,根据Ns的数值不同,可以有不同的分组模式,每种分组模式均可以参考本文给出的以上三种代表性分组 模式进行对应的处理。
在一些实施例中,上述分组模式可以统一表示为GroupN,代表当前轮次运算中调度的所有从处理电路SL分为N组,每个从处理电路组SLB处理同一Co值,不同从处理电路组SLB处理不同Co值。对于总计16个SL可调度的场合下,N可以取1,4,16,分别对应上面的Group1、Group4和Group16。
图7a-7d示出了根据本披露实施例的几种示例性分组模式。图7a示出了Group1模式,图7b示出了Group16模式,图7c示出了一种Group4模式,以及图7d示出了另一种Group4模式。
如图7a所示,Group1模式是指所有可调度的16个SL属于一个组,共同处理一个Co值,例如SL0~SL15属于组G0。从而,针对该一个输出通道的运算被分配在16个SL上。在这种模式下,可以优先考虑将该输出通道的卷积核720以广播方式传输到各个SL,输入特征图710则进行拆分分配给各个SL,从而提高访存效率。
在一个实施例中,可以将卷积核存储在图5的第一存储电路530上,以利用广播通道进行传输。输入特征图则可以按照输出特征图的XY方向划分,存储在第二存储电路540上,以分配给不同的SL。由此,所有SL共同计算一个Co的输出特征图。后面将结合附图详细描述输入特征图的划分和存储。
如图7b所示,Group16模式是指所有可调度的16个SL分成16个组,也即每组一个SL,每个SL处理一个不同的Co值。例如SL0属于组G0,SL1属于组G1,以此类推,直至SL15属于组G15。在这种模式下,同一块输入特征图730可以在16个SL之间重复使用,因此可以优先考虑将输入特征图730以广播方式传输到各个SL,而对应不同Co的卷积核740则分发给对应的SL。
在一个实施例中,可以将输入特征图复制16份,存储在第二存储电路上为16个从处理电路分配的16个存储区域上。卷积核则根据Co划分,一个SL对应一个Co,一次处理16个Co,存储在第一存储电路上,以单播方式分配给不同的SL。由此,所有SL针对同一输入特征图计算不同Co的输出特征图。
如图7c所示,Group4模式是指所有可调度的16个SL分成4个组,每组处理一个Co值。每个SL组(简称SLB)包括的SL数量等于Rs=Ns/4=4。例如SL0~SL3属于组G0,SL4~SL7属于组G1,SL8~SL11属于组G2,以及SL12~SL15属于组G3。这种模式介于Group1和Group16之间,因此可以将卷积核或输入特征图任一确定为多播数据,而将另一确定为分发数据。
在一个实施例中,可以将卷积核按照Co划分成4组,存储在图5的第一存储电路530上,以利用广播通道进行传输。输入特征图则可以按照输出特征图的XY方向划分为4份并复制4份,存储在第二存储电路540上,以分发给4个SLB。每个SLB获得相同的输入特征图,在SLB内再按照所划分的4份分发给其内的4个SL。由此,每个SLB中的所有SL共同计算一个Co的输出特征图,4个SLB则分别处理一个不同的Co。
如图7c所示,将卷积核分成4组,按照Co以间隔1为单位划分至各组。例如,当Co=12时,分成的4组Co分别为{0,4,8}、{1,5,9}、{2,6,10}和{3,7,11}。每一次发送各组的一个Co,例如第一次发送Co=0~3,一个Co对应一个SLB,在一个SLB内的4个SL共用相同权值;第二次发送Co=4~7,依次类推。由此,每轮运算完成后,各个SLB输出的运算结果的Co维度是连续的。
当采用Forward4这种小卷积拆分运算方案时,为了同时支持以上三种模式,可以统一将神经元存储在第二存储电路WRAM上,将权值存储在第一存储电路NRAM上。
输入特征图的示例性拆分
从前面的描述可以看出,当多个SL共同处理一个Co值时,需要在这多个SL之间对输入特征图进行拆分,例如Group1分组模式需要将输入特征图拆分成16份,而Group4分组模式需要将输入特征图拆分成4份。
为了保证拆分的输入特征图可以共用卷积核,可以根据输出特征图的Ho/Wo方向来划分,从而映射回到输入特征图的划分。在一些实施例中,在每个从处理电路组内包括的Rs个从处理 电路SL之间可以按如下划分输入特征图:根据对应的输出特征图的尺寸,将输出特征图在XY维度(也即Ho/Wo维度)上平均划分为Rs个形状相同的输出特征块;以及根据计算每个输出特征块所需的输入特征图区域,将输入特征图在XY维度(也即Hi/Wi维度)上划分为Rs个输入特征块,以分配给Rs个从处理电路。可以理解,取决于卷积核尺寸和卷积步长,输出特征图上相邻的输出点所对应的输入特征图可能会存在重叠。
图8示出了根据本披露实施例的输入特征图的示例性拆分示意图。在此示例中,将输入特征图划分成16份分配在16个SL上,对应Group1模式。
图中810代表单个Co的输出特征图,其在XY方向上按照4×4方式划分成16个形状相同的输出特征块,分别分配给SL0~SL15。继而,由这16个输出特征块可以映射到输入特征图820上,获得分别计算这16个输出特征块所需的16个输入特征图区域,其同样是将输入特征图在XY方向上划分。这16个输入特征图区域可以相应地分配给16个从处理电路SL。
根据前文描述,会按照确定的卷积拆分方案,以拆分单元为单位对输入特征图进行拆分,因此,上述实施例中对输入特征图的分块要使得划分的每个输入特征图块在XY方向上是拆分单元XY方向维度的倍数,也即在XY方向上可以按照拆分单元对齐。例如,在选择4×4×4的卷积拆分方案时,每个输入特征图块按4×4对齐;而在选择16×2×2的卷积拆分方案时,每个输入特征图块按2×2对齐。
对于输出特征图不按拆分单元(例如4×4或2×2)对齐的情况,需要相应的在输入特征图上填补(例如补0),使得实际计算的输出XY是按拆分单元(例如4×4或2×2)对齐的并且输入XY也是按拆分单元(例如4×4或2×2)对齐的。
本领域技术人员可以理解,也可以在输出特征图的XY方向按照其他规则进行拆分,例如按照1×16方式拆分成16个形状相同的输出特征块,分别分配给SL0~SL15。本披露实施例在此方面没有限制。此外,还可以理解,虽然前面结合从处理电路之间的拆分进行描述,但是这种拆分方式也可以应用于其他场景下的拆分,例如单个从处理电路SL内的运算电路CU之间的拆分,本披露实施例在此方面没有限制。
单个从处理电路内的示例性卷积运算过程
在拆分好待运算数据并进行相应的摆放存储之后,就可以调度多个从处理电路对输入特征图和卷积核的对应数据行执行卷积运算,继而可以根据卷积拆分方案,对多个从处理电路返回的运算结果进行拼接处理,以得到输入特征图和卷积核的卷积运算的输出特征图。具体地,可以利用从处理电路中的多个运算电路CU以及各个缓冲电路(参见图5)来执行具体的卷积运算过程。取决于从处理电路内部缓冲电路的空间大小以及运算电路的算力限制,在每轮运算中通常需要执行多个运算周期来完成所需运算。
从前面描述可知,在针对常规3D卷积运算场景下,单个从处理电路内的所有运算电路计算对应同一输出通道Co的一个输出特征图或部分输出特征图。取决于从处理电路SL内第一缓冲电路和第二缓冲电路的缓冲空间大小、运算电路CU的处理能力(例如内部寄存器等),从处理电路可能无法一次计算完分配给其的输出特征图。因此,可以以运算电路单次运算能力(例如,单次计算Nop个输出点或部分和)为单位,划分输出特征块,每个输出特征块对应单个SL内所有可调度的N CU个运算电路的单次运算能力(N CU*Nop个输出点)。例如,以前文图5中每个SL包括4个CU为例,假设每个CU单次可以计算Nop=4个输出点或输出点的部分和,则单个SL单次可以计算4*4=16个输出点(或部分和)。因此,可以将输出特征图在XoYo维度上按照16个输出点对齐划分输出特征块,逐个计算各个输出特征块。可以理解,这16个输出点可以按照4*4形式,也可以按照1*16形式,本披露实施例在此方面没有限制。
在计算每个划分的输出特征块时,又可以进一步在这N CU个运算电路之间划分该输出特征块的输出点,以确定各个运算电路的处理对象。继而,可以根据输出点的划分,以拆分单元为滑动窗口,从第一缓冲电路中选取N CU个输入特征数据行分发给N CU个运算电路,从第二缓冲电路中选取对应的权值数据,广播给N CU个运算电路,从而通过复用权值数据来实现多个滑动窗口对应 的输出点的并行计算。执行Nk次滑动选取,其中Nk根据卷积核在X和Y维度的尺寸和从处理电路在当前卷积拆分模式下单次运算所支持的最大卷积核尺寸中的较小值来确定。
在一些实施例中,当执行常规三维卷积运算时,可以按如下选取对应的权值数据:从第二缓冲电路中按照与在第一缓冲电路中对应的滑动方式选取1/Nop个权值行,将其复制Nop-1份扩展为一个扩展权值行,广播给从处理电路内的N CU个运算电路。
此时,每个运算电路可以在每次滑动选数计算时,针对来自第一缓冲电路的一个输入特征行和来自第二缓冲电路的一个扩展权值数据行,以1/Nop个数据行为单位进行对位乘累加,得到Nop个部分和;以及将Nk个滑动选数计算得到的Nk*Nop个部分和按照对应的卷积输出点进行累加,得到并输出Nop个运算结果。
从处理电路在输出其内运算电路的输出点时,可以根据输出点的划分方式,按特定顺序输出其内多个运算电路计算的输出点,以使得连续输出的输出点在X和/或Y维度上连续,方便后续处理。在一些实施例中,主处理电路可以进一步将从各个从处理电路返回的运算结果以第四维度存储顺序存储。根据情况,主处理电路还可以将运算结果转换为期望的维度存储顺序存储。
运算电路之间输出点的划分可以有多种方式,相应地滑动选数卷积过程以及输出点的输出顺序也有所不同。
单核/多核计算装置上卷积运算任务的拆分方案
前文描述了各种卷积运算在单个处理核内的处理方式,包括分组模式、组内拆分方式等。
当采用小卷积运算方案时,由于按照拆分单元为计算单元,那么不可避免在计算时就存在对齐限制。根据不同的Group模式,相同Group模式的不同H*W的拆分方式,最终在计算时的对齐限制也不一样。在对齐的计算上,可以首先根据输出特征图的拆分方式确定ho*wo的对齐限制,再由ho*wo反推回hi*wi,由于输入神经元需要摆成拆分单元块的形式,从而还需要再对齐一次。以Forwrd4为例,其按照4B*4*4的块为计算单元,上述对齐限制可以汇总如下表2:
Figure PCTCN2022100303-appb-000002
表2、对齐限制
当在多核计算装置上执行卷积运算时,例如参考图3b示出的多核结构简化示意图,在处理器簇(cluster)层级里,一个cluster有4个加速器核(core),每个核的计算都是相互独立而又相关的。
根据上表Forward4不同Group模式和不同H*W拆分方式下的对齐限制可知,在Forward4计算时,对hi和wi存在一定的对齐限制,如果不考虑卷积计算时对数据的逻辑拆分,将会导致由Cluster分发数据到Core上时,会导致对齐过度从而产生算力浪费,同时还会带来计算时对数据分块的复杂度。
此外,由前文所述,Forward4在处理输入特征图的HW维度较大的情况下具有明显优势,在输入特征图的HW维度较小时因为存在对齐问题优势不太明显。鉴于此,可以考虑在输入特征图的HW维度较大场景下的卷积计算场景。在这种场景下,往往通道数C维度会比较小,比如first conv场景,通常通道数C为4。那么在这种场景下,可以不考虑片上空间放不下全部通道的情形。
基于上述分析,本披露实施例针对输入特征图的HW维度较大的情形,为了适应对齐的限制,同时能确保计算的正确性,提出了一种输入特征图的拆分策略。此处的拆分策略是指将不规则、不同形状的输入特征图(神经元张量),拆分成可以用于处理的基本块(basic block)进行计算。
所提供的拆分策略可以应用于多核计算装置上,例如用于确定输入特征图在多个核上(不同空间上)的拆分。该拆分策略也可以应用于单核计算装置上,例如用于确定输入特征图在单个核上不同时间轮次上的拆分。
图9示例性示出了可以实施本披露实施例的计算装置的示例性结构图。如图所示,计算装置900包括主设备910和从设备920,其中从设备920中可以包括一个或多个处理核921。
主设备910例如可以是通用处理器,其作为控制设备(简称主机端),负责复杂控制和调度等工作。主设备910例如可以是图2中的处理装置203。从设备920例如可以是各种领域专用处理器,负责大规模的并行计算或领域专用计算任务。从设备920例如可以是图2中的计算装置201。二者协同完成计算任务。
在一些实施例中,计算装置900可以用于执行卷积运算。具体地,主设备910可以发射第一任务。该第一任务用于针对输入特征数据和卷积核数据执行卷积运算。优选地,在这些卷积运算场景中,输入特征数据和卷积核数据的输入通道Ci维度尺寸小于第一阈值,输入特征数据的宽度W和高度H维度尺寸中任一超过第二阈值。第一阈值例如可以是64B,32B等,从而片上空间可以放下全部通道的数据。第二阈值例如可以是4的倍数、8的倍数、16的倍数、甚至64的倍数,从而可以为通道维度补数或方便分组拆分。
从设备920可以根据第一任务的拆分策略,调度相应数量的处理核921执行第一任务。其中,在每一核次处理中,针对输入特征数据中的一部分执行上述卷积运算。
如前面所描述的,拆分策略可以应用在多核计算装置上,也可以应用在单核计算装置上,因此本文中提到的“核次”既可以表示单个核在不同时间上的运算轮次,也可以表示不同核在同一时间上的运算轮次。
由此,在一些实施例中,主设备910可以确定第一任务的拆分策略。无论是在多核上还是在单核上,上述拆分策略可以包括:完成第一任务所需处理核的核次数,以及每一核次对应要处理的输入特征块。这里,“核次数”是指完成一项任务需要多少个处理核执行多少次,也即对应任务在空间或时间维度的展开。
从设备920还可以包括一个或多个存储核922,每个存储核可以由一个或多个处理核921共享,以存储处理核处理前、处理期间和/或处理后的数据。例如对于单核板卡而言,一般采用1存储核+1处理核的架构;对于多核板卡而言,则采用1存储核+N处理核的架构,例如N=4。
主设备可以根据输入特征数据的规模来确定拆分策略。不防假设输入特征数据为[N Hi Wi Ci],N为批次维度,H是高度维度,W是宽度维度,C是输入通道维度。输出特征数据为[Ho Wo Co]。
考虑到批次之间的运算相对独立,因此在一些实施例中,主设备可以首先根据输入特征数据的总批次数B和可调度的处理核的总核数Ncore来确定拆分策略。
具体地,主设备可以确定处理轮数L,以及各轮处理的批次数Bi,i=1,…,L,其中L=ceil(B/Ncore),
Figure PCTCN2022100303-appb-000003
也即,可以将各批次平均分配在Ncore个处理核上。
在一个示例中,当总批次数B是总核数Ncore的倍数时,可以将每连续的n个批次的输入特征数据分配给一个处理核进行处理,n=B/Ncore。由此,每个处理核可以处理连续的n个批次的数据。
当总批次数B不是总核数Ncore的整数倍时,例如,B=n*Ncore+Brem,其中n=L-1,Brem<Ncore,也即Brem个批次需要分配在Ncore个处理核上进行,由此会存在多个处理核共同处理同一批次数据的情况。可以理解,当B=1时,也即数据是单个批次时,也存在单个批次在多个处理核上拆分处理的情形。
考虑到多核计算装置中的处理核通常以计算簇(cluster)的方式划分管理,因此在一些实施例中,这种单个批次的处理可以按照计算簇进行拆分。
具体地,在一个示例中,主设备可以将Brem个批次的输入特征数据分配在多个计算簇上处 理,其中Nc个计算簇共同处理同一个批次的输入特征数据。
计算簇之间的数据拆分可以采取多种方式。在本披露一些实施例中,考虑到输入特征图的通道数较小,而HW维度较大,因此可以根据HW维度进行拆分。
在一个示例中,主设备可以将输入特征数据的卷积运算所对应的输出特征数据在第一维度上平均拆分成Nc个形状相同的输出特征块;以及根据计算每个输出特征块所需的输入特征区域,将输入特征数据在该第一维度上对应地拆分出Nc个输入特征块,分别分配给上述Nc个计算簇。
通过根据输出特征图的拆分来反推输入特征图的拆分,可以确保拆分运算相互之间的独立性。具体的拆分原理可以参考前文结合图8描述的对输入特征图的拆分部分,此处不再赘述。
上述第一维度可以是H和W任一维度或二者,优选地,第一维度是存储顺序上较低的维度,例如H维度,由此可以减少数据读取时跳跃的步长。
从图3b所示的多核计算装置的架构可知,一个计算簇(cluster)内包括一个存储核以及多个处理核,因此同一计算簇内的运算还需要进一步拆分到其内的多个处理核上。此时,拆分逻辑可以更为细化。
在一些实施例中,主设备可以进一步按如下在单个计算簇内的多个核上拆分输入特征块:根据输入特征块对应的输出特征块的第一维度的大小wo,确定输入特征块在多个处理核上的拆分方式。也即,根据第一维度(W维度)的数据是否能够一次处理完,分不同情况进行处理。
大体上,可以分为以下几种情况。为了描述简便起见,在下文中,将W维度能够一次处理完的输入特征图称为小图,将W维度可以分不超过Ncc次处理完的输入特征图称为中图,将W维度可以分Ncc次处理完的输入特征图称为大图,将W维度可以分Ncc的倍数次处理完的输入特征图称为超大图,其中Ncc是单个计算簇内的处理核数量,例如在前文示例中,Ncc=4。
图10示出了根据本披露实施例的不同情况下的输入特征块的拆分示意图。从前文可知,输入特征块的拆分实质是根据输出特征块进行划分的,因此图10中每个图代表输出特征块的Ho*Wo维度,图中的灰色部分代表cluster层面上加载在SRAM的数据,每一个小格子则代表一个处理核上处理的数据,可以称为一个基本块(basicblock)。
图10a示出了小图情形。针对小图情形,由于一次可以处理完第一维度的数据,因此不从第一维度拆分,而是从第二维度拆分。第二维度例如是比第一维度更高的存储维度,比如H维度。
在这些实施例中,主设备可以按如下确定输入特征块在Ncc个处理核上的拆分方式:当wo≤单核次处理量S时,根据对应的输出特征块的第二维度的大小ho,在第二维度上拆分成Ncc个输出特征子块;以及根据计算每个输出特征子块所需的输入特征区域,将输入特征块在第二维度上对应地拆分出Ncc个输入特征子块,分别分配给该Ncc个处理核。
对于Ncc=4的示例,SRAM每次可以加载4*1(H*W)个基本块,分别分配给4个处理核。
图10b示出了中图情形。针对中图情形,由于一次处理不完第一维度的数据,但是第一维度的数据又不足以让Ncc个处理核运算饱和或近似饱和(“打满”),此时可以将第二维度的数据补充到第一维度,也即同时进行第一维度和第二维度的拆分。
在这些实施例中,主设备可以按如下确定输入特征块在Ncc个处理核上的拆分方式:当单核次处理量
Figure PCTCN2022100303-appb-000004
时,根据对应的输出特征块的第二维度的大小ho和第一维度大小wo,在第一维度和第二维度上联合地拆分成Ncc个输出特征子块,其中第一维度拆分成Ws份,第二维度拆分成Wh份,Ws*Wh=Ncc,每个输出特征子块的第一维度大小不超过S;以及根据计算每个输出特征子块所需的输入特征区域,将输入特征块在第一和第二维度上对应地拆分出Ncc个输入特征子块,分别分配给Ncc个处理核。
对于Ncc=4的示例,SRAM每次可以加载2*2(H*W)个基本块,分别分配给4个处理核。
图10c示出了大图情形。针对大图情形,由于第一维度的数据足够分Ncc次处理完,此时可以直接仅根据第一维度进行拆分。
图10d示出了超大图情形。针对超大图情形,第一维度的数据可以在Ncc个处理核上分多轮处理完,因此,也可以直接根据第一维度进行拆分。
大图与超大图情形类似,可以按类似方式进行处理。在这些实施例中,主设备可以按如下确 定输入特征块在Ncc个处理核上的拆分方式:当wo>Ncc/2*S时,根据wo,在第一维度上拆分成m*Ncc个输出特征子块,m是自然数,每个输出特征子块的第一维度大小不超过S;以及根据计算每个输出特征子块所需的输入特征区域,将输入特征块在第一维度上对应地拆分出m*Ncc个输入特征子块,依次分配给Ncc个处理核。
对于Ncc=4的示例,SRAM每次可以加载1*4(H*W)个基本块,分别分配给4个处理核。
在上述实施例中,小图、中图、大图、超大图的策略选择可以根据片上空间的资源情况来动态选择。一般说来,小图模式处理的特征图尺寸量级在200*200左右,大图模式处理的特征图尺寸量级在960*960左右,对于超大图模式,则可以处理1080p尺寸(1080*1920)、甚至2K量级的图片。
虽然以上针对各种规模的输入特征图在单个计算簇内的多个处理核上的拆分策略进行了描述,但是上述拆分策略可以同样应用于单核板卡。如前所提到,单核板卡一般采用1存储核+1处理核的架构,因此可以根据存储核的片上空间,以一个存储核上的数据量作为一次计算任务,由单个处理核分多次完成,各次之间的拆分策略仍然可以沿用上文描述的策略,区别仅在于由分配给Ncc个处理核改为分配给Ncc次处理。
由此,上述拆分策略可以统一为:根据存储核上的输入特征块对应的输出特征块的第一维度的大小wo,确定输入特征块在Ncc个核次上的拆分方式。当采用多核计算装置时,Ncc对应单个计算簇内的处理核数量,通常存储核上的容量可供Ncc个处理核的单次处理;当采用单核计算装置时,Ncc对应存储核上的容量可供单个处理核处理Ncc次。
同样地,Ncc个核次上的拆分方式可以按照前述小图、中图、大图模式,根据不同情况进行不同拆分,此处不再重复。
综上,本披露实施例的任务拆分方案对于不同的板卡形态(单核板卡、多核板卡)都可以灵活适配。例如,在进行计算任务的分配时,可以将一个存储核上的数据量作为一次计算任务,一个存储核的计算任务可以按照上述的拆分策略划分在不同处理核上进行并行计算,或者由一个处理核在时间上划分成多次进行顺序计算。
本披露实施例还提供了利用前述计算装置来分配并执行卷积运算的方法。本领域技术人员可以理解,该方法的步骤与前面结合附图描述的计算装置的各个特征相对应,因此前面描述的特征同样适用于方法步骤,此处不再重复。
本披露实施例还提供了一种芯片,其可以包括前面结合附图描述的任一实施例的计算装置。进一步地,本披露还提供了一种板卡,该板卡可以包括前述芯片。
根据不同的应用场景,本披露的电子设备或装置可以包括服务器、云端服务器、服务器计算簇、数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、PC设备、物联网终端、移动终端、手机、行车记录仪、导航仪、传感器、摄像头、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、视觉终端、自动驾驶终端、交通工具、家用电器、和/或医疗设备。所述交通工具包括飞机、轮船和/或车辆;所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机;所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。本披露的电子设备或装置还可以被应用于互联网、物联网、数据中心、能源、交通、公共管理、制造、教育、电网、电信、金融、零售、工地、医疗等领域。进一步,本披露的电子设备或装置还可以用于云端、边缘端、终端等与人工智能、大数据和/或云计算相关的应用场景中。在一个或多个实施例中,根据本披露方案的算力高的电子设备或装置可以应用于云端设备(例如云端服务器),而功耗小的电子设备或装置可以应用于终端设备和/或边缘端设备(例如智能手机或摄像头)。在一个或多个实施例中,云端设备的硬件信息和终端设备和/或边缘端设备的硬件信息相互兼容,从而可以根据终端设备和/或边缘端设备的硬件信息,从云端设备的硬件资源中匹配出合适的硬件资源来模拟终端设备和/或边缘端设备的硬件资源,以便完成端云一体或云边端一体的统一管理、调度和协同工作。
需要说明的是,为了简明的目的,本披露将一些方法及其实施例表述为一系列的动作及其组合,但是本领域技术人员可以理解本披露的方案并不受所描述的动作的顺序限制。因此,依据本 披露的公开或教导,本领域技术人员可以理解其中的某些步骤可以采用其他顺序来执行或者同时执行。进一步,本领域技术人员可以理解本披露所描述的实施例可以视为可选实施例,即其中所涉及的动作或模块对于本披露某个或某些方案的实现并不一定是必需的。另外,根据方案的不同,本披露对一些实施例的描述也各有侧重。鉴于此,本领域技术人员可以理解本披露某个实施例中没有详述的部分,也可以参见其他实施例的相关描述。
在具体实现方面,基于本披露的公开和教导,本领域技术人员可以理解本披露所公开的若干实施例也可以通过本文未公开的其他方式来实现。例如,就前文所述的电子设备或装置实施例中的各个单元来说,本文在考虑了逻辑功能的基础上对其进行拆分,而实际实现时也可以有另外的拆分方式。又例如,可以将多个单元或组件结合或者集成到另一个系统,或者对单元或组件中的一些特征或功能进行选择性地禁用。就不同单元或组件之间的连接关系而言,前文结合附图所讨论的连接可以是单元或组件之间的直接或间接耦合。在一些场景中,前述的直接或间接耦合涉及利用接口的通信连接,其中通信接口可以支持电性、光学、声学、磁性或其它形式的信号传输。
在本披露中,作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元示出的部件可以是或者也可以不是物理单元。前述部件或单元可以位于同一位置或者分布到多个网络单元上。另外,根据实际的需要,可以选择其中的部分或者全部单元来实现本披露实施例所述方案的目的。另外,在一些场景中,本披露实施例中的多个单元可以集成于一个单元中或者各个单元物理上单独存在。
在另外一些实现场景中,上述集成的单元也可以采用硬件的形式实现,即为具体的硬件电路,其可以包括数字电路和/或模拟电路等。电路的硬件结构的物理实现可以包括但不限于物理器件,而物理器件可以包括但不限于晶体管或忆阻器等器件。鉴于此,本文所述的各类装置(例如计算装置或其他处理装置)可以通过适当的硬件处理器来实现,例如中央处理器、GPU、FPGA、DSP和ASIC等。进一步,前述的所述存储单元或存储装置可以是任意适当的存储介质(包括磁存储介质或磁光存储介质等),其例如可以是可变电阻式存储器(Resistive Random Access Memory,RRAM)、动态随机存取存储器(Dynamic Random Access Memory,DRAM)、静态随机存取存储器(Static Random Access Memory,SRAM)、增强动态随机存取存储器(Enhanced Dynamic Random Access Memory,EDRAM)、高带宽存储器(High Bandwidth Memory,HBM)、混合存储器立方体(Hybrid Memory Cube,HMC)、ROM和RAM等。
以上对本披露实施例进行了详细介绍,本文中应用了具体个例对本披露的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本披露的方法及其核心思想;同时,对于本领域的一般技术人员,依据本披露的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本披露的限制。

Claims (16)

  1. 一种计算装置,包括主设备和从设备,所述从设备包括一个或多个处理核,其中:
    所述主设备配置用于发射第一任务,所述第一任务用于针对输入特征数据和卷积核数据执行卷积运算;以及
    所述从设备配置用于根据所述第一任务的拆分策略,调度相应数量的所述处理核执行所述第一任务,其中在每核次处理中针对所述输入特征数据中的一部分执行所述卷积运算。
  2. 根据权利要求1所述的计算装置,其中所述主设备进一步配置用于确定所述第一任务的拆分策略,所述拆分策略包括:完成所述第一任务所需处理核的核次数,以及每一核次对应要处理的输入特征块。
  3. 根据权利要求2所述的计算装置,其中所述主设备进一步配置用于:
    根据所述输入特征数据的总批次数B和可调度的所述处理核的总核数Ncore,确定处理轮数L,以及各轮处理的批次数Bi,i=1,…,L,其中L=ceil(B/Ncore),
    Figure PCTCN2022100303-appb-100001
    其中n=L-1,Brem<Ncore。
  4. 根据权利要求3所述的计算装置,其中所述主设备进一步配置用于:
    将每连续的n个批次的输入特征数据分配给一个处理核进行处理。
  5. 根据权利要求3-4任一所述的计算装置,其中所述从设备包括多个计算簇,每个计算簇包括多个所述处理核,所述主设备进一步配置用于:
    将Brem个批次的输入特征数据分配在所述多个计算簇上处理,其中Nc个计算簇共同处理同一个批次的输入特征数据。
  6. 根据权利要求5所述的计算装置,其中所述主设备进一步配置用于:
    将所述输入特征数据的卷积运算所对应的输出特征数据在第一维度上平均拆分成Nc个形状相同的输出特征块;以及
    根据计算每个输出特征块所需的输入特征区域,将所述输入特征数据在所述第一维度上对应地拆分出Nc个输入特征块,分别分配给所述Nc个计算簇。
  7. 根据权利要求2-6任一所述的计算装置,所述从设备还包括一个或多个存储核,其中一个存储核为一个或多个处理核提供输入特征数据,所述主设备进一步配置用于按如下在多个核次上分配存储核上的输入特征块:
    根据所述输入特征块对应的输出特征块的所述第一维度的大小wo,确定所述输入特征块在所述多个核次上的拆分方式。
  8. 根据权利要求7所述的计算装置,其中所述主设备进一步配置用于按如下确定所述输入特征块在Ncc个核次上的拆分方式:
    当所述wo≤单核次处理量S时,根据所述对应的输出特征块的第二维度的大小ho,在第二维度上拆分成Ncc个输出特征子块;以及
    根据计算每个输出特征子块所需的输入特征区域,将所述输入特征块在所述第二维度上对应地拆分出Ncc个输入特征子块,分别分配给所述Ncc个核次。
  9. 根据权利要求7-8任一所述的计算装置,其中所述主设备进一步配置用于按如下确定所述输入特征块在Ncc个核次上的拆分方式:
    当单核次处理量
    Figure PCTCN2022100303-appb-100002
    时,根据所述对应的输出特征块的第二维度的大小ho和所述wo,在第一和第二维度上联合地拆分成Ncc个输出特征子块,其中第一维度拆分成Ws份,第二维度拆分成Wh份,Ws*Wh=Ncc,每个输出特征子块的第一维度大小不超过S;以及
    根据计算每个输出特征子块所需的输入特征区域,将所述输入特征块在所述第一和第二维度上对应地拆分出Ncc个输入特征子块,分别分配给所述Ncc个核次。
  10. 根据权利要求7-9任一所述的计算装置,其中所述主设备进一步配置用于按如下规则确定所述输入特征块在Ncc个核次上的拆分方式:
    Figure PCTCN2022100303-appb-100003
    时,根据所述wo,在第一维度上拆分成m*Ncc个输出特征子块,每个输出特征 子块的第一维度大小不超过S;以及
    根据计算每个输出特征子块所需的输入特征区域,将所述输入特征块在第一维度上对应地拆分出m*Ncc个输入特征子块,依次分配给所述Ncc个核次。
  11. 根据权利要求7-10任一所述的计算装置,其中:
    当所述从设备中仅单个可调度的处理核时,所述多个核次为单个处理核在时间上顺次执行的多个处理轮次;和/或
    当所述从设备包括多个可调度的处理核时,所述多个核次为多个处理核在时间上并发执行的多个处理轮次。
  12. 根据权利要求6-11任一所述的计算装置,其中所述第一维度是宽度W维度,所述第二维度是高度H维度,所述第一维度在存储顺序上低于所述第二维度。
  13. 根据权利要求1-12任一所述的计算装置,其中每个处理核配置用于针对分配给当前核次的输入特征子块应用Forward4卷积运算方案来执行所述卷积运算。
  14. 一种芯片,包括根据权利要求1-13任一所述的计算装置。
  15. 一种板卡,包括根据权利要求14所述的芯片。
  16. 一种使用权利要求1-13任一所述的计算装置来处理数据的方法。
PCT/CN2022/100303 2021-09-26 2022-06-22 计算装置、数据处理方法及相关产品 WO2023045446A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111131275.5 2021-09-26
CN202111131275.5A CN113837922A (zh) 2021-09-26 2021-09-26 计算装置、数据处理方法及相关产品

Publications (1)

Publication Number Publication Date
WO2023045446A1 true WO2023045446A1 (zh) 2023-03-30

Family

ID=78970224

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/100303 WO2023045446A1 (zh) 2021-09-26 2022-06-22 计算装置、数据处理方法及相关产品

Country Status (2)

Country Link
CN (1) CN113837922A (zh)
WO (1) WO2023045446A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113837922A (zh) * 2021-09-26 2021-12-24 安徽寒武纪信息科技有限公司 计算装置、数据处理方法及相关产品
CN116503029B (zh) * 2023-06-27 2023-09-05 北京中电科卫星导航系统有限公司 用于自动驾驶的模块数据协同处理方法及系统

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106846235A (zh) * 2016-12-26 2017-06-13 中国科学院计算技术研究所 一种利用NVIDIA Kepler GPU汇编指令加速的卷积优化方法及系统
CN111767246A (zh) * 2020-06-09 2020-10-13 上海寒武纪信息科技有限公司 数据处理方法、相关设备及计算机可读介质
CN111767243A (zh) * 2020-06-09 2020-10-13 上海寒武纪信息科技有限公司 数据处理方法、相关设备及计算机可读介质
CN113222125A (zh) * 2020-01-21 2021-08-06 北京希姆计算科技有限公司 卷积运算方法及芯片
CN113222099A (zh) * 2020-01-21 2021-08-06 北京希姆计算科技有限公司 卷积运算方法及芯片
CN113837922A (zh) * 2021-09-26 2021-12-24 安徽寒武纪信息科技有限公司 计算装置、数据处理方法及相关产品

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10936943B2 (en) * 2017-08-31 2021-03-02 Qualcomm Incorporated Providing flexible matrix processors for performing neural network convolution in matrix-processor-based devices
CN110059797B (zh) * 2018-10-10 2020-03-10 中科寒武纪科技股份有限公司 一种计算装置及相关产品
CN113222136A (zh) * 2020-01-21 2021-08-06 北京希姆计算科技有限公司 卷积运算方法及芯片

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106846235A (zh) * 2016-12-26 2017-06-13 中国科学院计算技术研究所 一种利用NVIDIA Kepler GPU汇编指令加速的卷积优化方法及系统
CN113222125A (zh) * 2020-01-21 2021-08-06 北京希姆计算科技有限公司 卷积运算方法及芯片
CN113222099A (zh) * 2020-01-21 2021-08-06 北京希姆计算科技有限公司 卷积运算方法及芯片
CN111767246A (zh) * 2020-06-09 2020-10-13 上海寒武纪信息科技有限公司 数据处理方法、相关设备及计算机可读介质
CN111767243A (zh) * 2020-06-09 2020-10-13 上海寒武纪信息科技有限公司 数据处理方法、相关设备及计算机可读介质
CN113837922A (zh) * 2021-09-26 2021-12-24 安徽寒武纪信息科技有限公司 计算装置、数据处理方法及相关产品

Also Published As

Publication number Publication date
CN113837922A (zh) 2021-12-24

Similar Documents

Publication Publication Date Title
US20240112006A1 (en) Deep learning hardware
WO2023045446A1 (zh) 计算装置、数据处理方法及相关产品
WO2023045445A1 (zh) 数据处理装置、数据处理方法及相关产品
CN108170640B (zh) 神经网络运算装置及应用其进行运算的方法
CN112799599B (zh) 一种数据存储方法、计算核、芯片和电子设备
CN112686379A (zh) 集成电路装置、电子设备、板卡和计算方法
WO2023123919A1 (zh) 数据处理电路、数据处理方法及相关产品
CN113469336A (zh) 优化神经网络模型的编译方法、执行方法及相关产品
WO2022134873A1 (zh) 数据处理装置、数据处理方法及相关产品
CN113469337B (zh) 用于优化神经网络模型的编译方法及其相关产品
WO2022095675A1 (zh) 神经网络稀疏化的装置、方法及相关产品
CN113850377A (zh) 数据处理装置、数据处理方法及相关产品
CN114692844A (zh) 数据处理装置、数据处理方法及相关产品
CN113850379A (zh) 数据处理装置、数据处理方法及相关产品
WO2023087698A1 (zh) 执行卷积运算的计算装置、方法及相关产品
WO2022134872A1 (zh) 数据处理装置、数据处理方法及相关产品
CN112801276A (zh) 数据处理方法、处理器及电子设备
WO2022135600A1 (zh) 计算神经网络的装置、板卡、方法及可读存储介质
WO2023087814A1 (zh) 计算装置、利用计算装置实施卷积运算的方法及相关产品
WO2022063183A1 (zh) 执行神经网络计算的装置、板卡、方法及可读存储介质
WO2022257980A1 (zh) 计算装置、利用计算装置实施卷积运算的方法及相关产品
WO2022135599A1 (zh) 融合分支结构的装置、板卡、方法及可读存储介质
WO2023045638A1 (zh) 计算装置、利用计算装置实施卷积运算的方法及相关产品
CN113742266B (zh) 集成电路装置、电子设备、板卡和计算方法
WO2022063217A1 (zh) 向前融合神经网络的装置、板卡、方法及可读存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22871507

Country of ref document: EP

Kind code of ref document: A1