WO2023045445A1 - Dispositif de traitement de données, procédé de traitement de données et produit associé - Google Patents

Dispositif de traitement de données, procédé de traitement de données et produit associé Download PDF

Info

Publication number
WO2023045445A1
WO2023045445A1 PCT/CN2022/100302 CN2022100302W WO2023045445A1 WO 2023045445 A1 WO2023045445 A1 WO 2023045445A1 CN 2022100302 W CN2022100302 W CN 2022100302W WO 2023045445 A1 WO2023045445 A1 WO 2023045445A1
Authority
WO
WIPO (PCT)
Prior art keywords
dimension
data
size
input
output
Prior art date
Application number
PCT/CN2022/100302
Other languages
English (en)
Chinese (zh)
Inventor
肖麟慧
郑鎏韬
王楠
喻歆
Original Assignee
寒武纪(西安)集成电路有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 寒武纪(西安)集成电路有限公司 filed Critical 寒武纪(西安)集成电路有限公司
Publication of WO2023045445A1 publication Critical patent/WO2023045445A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present disclosure relates generally to the field of data processing. More specifically, the present disclosure relates to a data processing device, a data processing method for executing block instructions on data by using the data processing device, a chip and a board.
  • Deep learning Deep Learning
  • AI artificial intelligence
  • Neural network is one of the most critical technologies in artificial intelligence and deep learning, among which Convolution Neural Network (CNN) is the most important network type.
  • the most critical calculation in the convolutional neural network is the convolution operation (Convolution Operation) of the convolution layer (Conv layer).
  • the function of the convolutional layer is to extract features from the input data. Through multi-layer convolution, complex features can be extracted to ensure that the network has sufficient expressive ability and generalization ability.
  • the neural network model contains a large number of various types of convolution operations, and the computing performance of the convolution operation greatly affects the computing performance of the entire neural network model.
  • the corresponding input feature maps and weights may have different dimensions.
  • the present disclosure proposes a data processing device in various aspects, which can make data of various dimensions fit The hardware of the convolution operation, thereby improving the computational efficiency of the convolution operation.
  • the convolution operation in the embodiment of the present disclosure can be an operation in various neural network models, and these neural network models can be applied in various fields, such as image processing, speech processing, text processing, etc., such processing can include but not limited to identification and classification.
  • an embodiment of the present disclosure provides a data processing device, including a control circuit, a first storage circuit, and a second storage circuit, wherein: the first storage circuit is used to store data before processing; the The second storage circuit is used to store the processed data; and the control circuit is used to configure and execute the block instruction, so that the input data stored on the first storage circuit according to the storage order of the first dimension is divided into units.
  • Split and store as output data on the second storage circuit wherein on the second storage circuit, each split unit is stored in the second dimension storage order, and the split units are stored in the third dimension storage order .
  • an embodiment of the present disclosure provides a chip, which includes the data processing device in the aforementioned first aspect.
  • an embodiment of the present disclosure provides a board, which includes the aforementioned chip in the second aspect.
  • an embodiment of the present disclosure provides a data processing method for executing a block instruction on input data by using the data processing apparatus in the aforementioned first aspect.
  • the solution of the embodiment of the present disclosure performs block processing on the data in various convolution splitting schemes, so as to Adapting to the processing capability of the hardware computing device, the parallel processing capability of multiple slave processing circuits can be fully utilized, and the computing efficiency of the convolution operation can be effectively improved.
  • Fig. 1 shows the structural diagram of the board card of the disclosed embodiment
  • FIG. 2 shows a structural diagram of a combination processing device according to an embodiment of the present disclosure
  • FIG. 3a shows a schematic diagram of the internal structure of a processor core of a single-core computing device according to an embodiment of the disclosure
  • Fig. 3b shows a simplified schematic diagram of the internal structure of a multi-core computing device according to an embodiment of the present disclosure
  • FIG. 4 shows an example of an exemplary convolution operation principle to which embodiments of the present disclosure can be applied
  • Fig. 5 shows a schematic structural block diagram of a computing device according to an embodiment of the disclosure
  • FIG. 6 shows an exemplary data storage sequence according to an embodiment of the present disclosure
  • Figures 7a-7c illustrate several exemplary grouping modes according to embodiments of the present disclosure
  • Fig. 8 shows an exemplary split schematic diagram of an input feature map according to an embodiment of the present disclosure
  • FIG. 9 shows a schematic diagram of splitting and storage of the Forward4 scheme according to an embodiment of the present disclosure.
  • FIG. 10 shows a schematic diagram of division of output points of the computing circuit in the Forward4 scheme according to an embodiment of the present disclosure
  • FIG. 11 shows a schematic diagram of a single operation in the Forward4 scheme according to an embodiment of the disclosure
  • Fig. 12 shows a schematic diagram of sliding convolution in the Forward4 scheme according to an embodiment of the present disclosure
  • Fig. 13 shows a schematic diagram of the output data format of the Forward4 scheme according to an embodiment of the present disclosure
  • FIG. 14 shows an overall data handling process according to an embodiment of the present disclosure
  • Figure 15 shows a schematic conceptual diagram of Trans Tiling according to an embodiment of the disclosure
  • Figure 16 shows a schematic diagram of the front and back tables
  • Fig. 17 shows a schematic diagram of executing a block instruction on neuron data according to an embodiment of the present disclosure.
  • the term “if” may be interpreted as “when” or “once” or “in response to determining” or “in response to detecting” depending on the context.
  • FIG. 1 shows a schematic structural diagram of a board 10 according to an embodiment of the present disclosure.
  • the board card 10 includes a chip 101, which is a system-on-chip (System on Chip, SoC), or system-on-chip, integrated with one or more combined processing devices, and the combined processing device is an artificial
  • SoC System on Chip
  • the intelligent computing unit is used to support various deep learning and machine learning algorithms to meet the intelligent processing requirements in complex scenarios in the fields of computer vision, speech, natural language processing, and data mining.
  • deep learning technology is widely used in the field of cloud intelligence.
  • a notable feature of cloud intelligence applications is the large amount of input data, which has high requirements for the storage capacity and computing power of the platform.
  • the board 10 of this embodiment is suitable for cloud intelligence applications. applications, with huge off-chip storage, on-chip storage and powerful computing capabilities.
  • the chip 101 is connected to an external device 103 through an external interface device 102 .
  • the external device 103 is, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card or a wifi interface, and the like.
  • the data to be processed can be transmitted to the chip 101 by the external device 103 through the external interface device 102 .
  • the calculation result of the chip 101 can be sent back to the external device 103 via the external interface device 102.
  • the external interface device 102 may have different interface forms, such as a PCIe interface and the like.
  • the board 10 also includes a storage device 104 for storing data, which includes one or more storage units 105 .
  • the storage device 104 is connected and data transmitted with the control device 106 and the chip 101 through the bus.
  • the control device 106 in the board 10 is configured to regulate the state of the chip 101 .
  • the control device 106 may include a microcontroller (Micro Controller Unit, MCU).
  • FIG. 2 is a block diagram showing the combined processing means in the chip 101 of this embodiment.
  • the combined processing device 20 includes a computing device 201 , an interface device 202 , a processing device 203 and a storage device 204 .
  • the computing device 201 is configured to perform operations specified by the user, and is mainly implemented as a single-core intelligent processor or a multi-core intelligent processor for performing deep learning or machine learning calculations, which can interact with the processing device 203 through the interface device 202 to Work together to complete user-specified operations.
  • the interface device 202 is used to transmit data and control instructions between the computing device 201 and the processing device 203 .
  • the computing device 201 may obtain input data from the processing device 203 via the interface device 202 and write it into a storage device on the computing device 201 .
  • the computing device 201 may obtain control instructions from the processing device 203 via the interface device 202 and write them into the control cache on the chip of the computing device 201 .
  • the interface device 202 may also read data in the storage device of the computing device 201 and transmit it to the processing device 203 .
  • the processing device 203 performs basic control including but not limited to data transfer, starting and/or stopping the computing device 201 .
  • the processing device 203 may be one or more types of a central processing unit (central processing unit, CPU), a graphics processing unit (graphics processing unit, GPU) or other general-purpose and/or special-purpose processors.
  • processors including but not limited to digital signal processors (digital signal processors, DSPs), application specific integrated circuits (application specific integrated circuits, ASICs), field-programmable gate arrays (field-programmable gate arrays, FPGAs) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., and the number thereof can be determined according to actual needs.
  • the computing device 201 of the present disclosure can be regarded as having a single-core structure or a homogeneous multi-core structure. However, when considering the integration of the computing device 201 and the processing device 203 together, they are considered to form a heterogeneous multi-core structure.
  • the storage device 204 is used to store data to be processed, which may be a DRAM, which is a DDR memory, and its size is usually 16G or larger, and is used to store data of the computing device 201 and/or the processing device 203 .
  • Fig. 3a shows a schematic diagram of the internal structure of a processing core when the computing device 201 is a single-core device.
  • the computing device 301 is used for processing input data such as computer vision, speech, natural language, data mining, etc.
  • the computing device 301 includes three modules: a control module 31 , an operation module 32 and a storage module 33 .
  • the control module 31 is used to coordinate and control the work of the operation module 32 and the storage module 33 to complete the task of deep learning, which includes an instruction fetch unit (IFU) 311 and an instruction decoding unit (instruction decode unit, IDU) 312.
  • the instruction fetching unit 311 is used to obtain instructions from the processing device 203 , and the instruction decoding unit 312 decodes the obtained instructions and sends the decoding results to the computing module 32 and the storage module 33 as control information.
  • the operation module 32 includes a vector operation unit 321 and a matrix operation unit 322 .
  • the vector operation unit 321 is used to perform vector operations, and can support complex operations such as vector multiplication, addition, and nonlinear transformation;
  • the matrix operation unit 322 is responsible for the core calculation of the deep learning algorithm, namely matrix multiplication and convolution.
  • the storage module 33 is used to store or transfer relevant data, including a neuron storage unit (neuron RAM, NRAM) 331, a weight storage unit (weight RAM, WRAM) 332, and a direct memory access module (direct memory access, DMA) 333.
  • NRAM 331 is used to store input neurons, output neurons and intermediate results after calculation;
  • WRAM 332 is used to store convolution kernels of deep learning networks, that is, weights;
  • DMA 333 is connected to DRAM 204 through bus 34, responsible for computing device 301 Data transfer between DRAM 204 and DRAM 204.
  • Fig. 3b shows a simplified schematic diagram of the multi-core internal structure of the computing device 201 .
  • Multi-core computing devices can be abstracted using a hierarchical hardware model. As shown in the figure, the multi-core computing device can be abstracted into four levels, namely card level (Card) 350 , chip level (Chip) 360 , processor cluster level (Cluster) 370 and processor core level (Core) 380 .
  • Card card level
  • Chip chip level
  • Core processor core level
  • the embodiments of the present disclosure mainly involve the data transmission of the storage unit and the calculation unit, so the drawings and description briefly show and introduce the relevant calculation structure, and other parts are omitted.
  • each board contains local DDR storage, and each processor chip acts as a computing and control unit.
  • each processor chip contains multiple multiprocessors as computing units.
  • each multiprocessor includes multiple accelerator cores as control and computing units, and a shared storage SRAM as a storage unit.
  • each accelerator core contains local storage and an array of local processing units.
  • NFU refers to the Neuron Function Unit, which is used for convolution calculations.
  • the storage model includes board global memory, SRAM (shared memory) on the Cluster, NRAM, WRAM and registers on the Core, and the like.
  • SRAM is included in the storage processing unit MPU (Memory Process Unit Core, referred to as MPU, or Mem Core).
  • MPU Memory Process Unit Core
  • Mem Core refers to an intelligent processing core (Intelligent Process Unit Core, referred to as IPU Core or Core) in a multi-core computing device.
  • IPU Core contains NRAM, WRAM, NFU and so on.
  • Cluster refers to a processor cluster or a computing cluster.
  • a multi-core computing device includes several Clusters, and a Cluster includes 1 Mem Core+N IPU Cores.
  • the convolutional layer in a neural network model can perform convolution operations by applying convolution kernels (also called filters, weights, etc.) to input feature maps (also called input data, neurons, or input neurons) processing for feature extraction.
  • convolution kernels also called filters, weights, etc.
  • input feature maps also called input data, neurons, or input neurons
  • the convolution layer can contain multiple convolution kernels, and each element that makes up the convolution kernel corresponds to a weight coefficient and a bias.
  • Embodiments of the present disclosure can be applied to data splitting of various convolution operations.
  • X is the input data
  • Y is the output data
  • K is the convolution kernel
  • Kh and Kw are the length and width of K
  • sh and sw are the strides in the length and width directions
  • the formula ignores Bias bias, fill pad and expand dilation, and assume that the input data X has been filled, and the convolution kernel has been expanded.
  • the formula ignores the N dimension and the C dimension.
  • the forward calculation of the neural network model is independent in the N dimension and fully connected in the C dimension.
  • Fig. 4 shows an example of an exemplary conventional 3D convolution operation principle to which embodiments of the present disclosure can be applied.
  • the figure exemplarily shows four-dimensional input data X with a size of [N Hi Wi Ci], which can be expressed as N three-dimensional rectangles 410 of size Hi ⁇ Wi ⁇ Ci.
  • the figure also exemplarily shows a four-dimensional convolution kernel K with a size of [Co Kh Kw Ci], which can be expressed as Co three-dimensional convolution kernels 420 of size Kh ⁇ Kw ⁇ Ci.
  • the convolution result of the input data X and the convolution kernel K obtains the output data Y, which is four-dimensional data of the size [N Ho Wo Co], which can be expressed as N three-dimensional rectangles 430 of the size Ho ⁇ Wo ⁇ Co.
  • the figure also specifically shows an example of convolution operation, in which the input data is an input feature map 440 with a size of 6 ⁇ 6 ⁇ 3, and the N dimension is omitted; the convolution kernel is a three-dimensional convolution kernel 450 with a size of 3 ⁇ 3 ⁇ 3 , for a single Co; the output data is a 4 ⁇ 4 output feature map 460 .
  • the specific operation process is as follows:
  • the convolution kernel 450 scans the input feature map 440 according to a certain step size, performs matrix element multiplication and summation on the input features in the convolution window 470 and superimposes offsets. That is, the value at each position in the output feature map 460 is obtained by performing a two-dimensional convolution operation on the corresponding block and the corresponding convolution kernel of each input feature map and then summing them up. For example, the figure shows that the value of the (0,0) position on the output feature map 460 (that is, the convolution output point) is two-dimensionally performed by the convolution window 470 framed by the black cube in the input feature map and the three-dimensional convolution kernel 450. The three-dimensional convolution operation obtains 3 values, which are summed to obtain the final value.
  • the position of the convolution kernel 450 can be moved on the input feature map 440 , that is, the convolution window of the convolution output point can be moved.
  • the convolution step size (Sx, Sy) is (1,1).
  • the convolution operation can be obtained respectively The value at (0,1) or (1,0) position on the feature map 460 is output.
  • a convolutional layer of the neural network there are N groups of input feature maps, and each group contains Hi ⁇ Wi ⁇ Ci information, where Hi and Wi are the height and width of the input feature map, and Ci is The number of input feature maps, also known as the number of input channels.
  • the convolutional layer has Ci ⁇ Co convolution kernels of Kh ⁇ Kw size, where Ci is the number of input channels, Co is the number of output feature maps (or the number of output channels), Kh and Kw are the height and width.
  • the output feature map contains Ho ⁇ Wo ⁇ Co pieces of information, where Ho and Wo are the height and width of the output feature map, respectively, and Co is the number of output channels.
  • the convolution step size Sx, Sy
  • the size of the convolution step size will affect the size of the output feature map.
  • input feature map (Feature map), input data, neuron or input neuron are used interchangeably; convolution kernel, filter or weight are used interchangeably.
  • H (height) and Y dimensions are used interchangeably, and the W (width) and X dimensions are used interchangeably.
  • the H dimension of the input feature map can be expressed as Hi or Yi
  • the H dimension of the output feature map can be expressed as Ho or Yo
  • the W dimension can be expressed similarly.
  • each convolution output point has a corresponding convolution window, and the shape of the convolution window is equal to the shape of the convolution kernel. The value of each convolution output point corresponds to the result of the multiplication and accumulation of the input feature map and the weight in its convolution window.
  • a computing device with a master-slave structure may be used to implement the above convolution operation.
  • different data paths can be configured for input feature maps and convolution kernels, thereby improving memory access efficiency.
  • FIG. 5 shows a schematic structural block diagram of a computing device 500 according to an embodiment of the disclosure. It can be understood that this structure can be regarded as the refinement of the internal structure of the operation module of a single processing core in FIG. 3 , or can be regarded as a functional division block diagram based on the combination of multiple operation modules of the processing core shown in FIG. 3 .
  • a computing device 500 in an embodiment of the present disclosure may be configured to perform various types of convolution operations, which may include a master processing circuit (MA) 510 and a plurality of slave processing circuits (SL) 520, shown in the figure 16 slave processing circuits SL0 to SL15 are shown.
  • MA master processing circuit
  • SL slave processing circuits
  • the master processing circuit and the slave processing circuits, as well as multiple slave processing circuits, can communicate with each other through various connections.
  • the connection between multiple slave processing circuits can be hard-wired, or logically configured according to, for example, micro-instructions to form a variety of slave processing circuits
  • the topology of the array Embodiments of the present disclosure are not limited in this regard.
  • the main processing circuit and the slave processing circuit can cooperate with each other, thereby realizing parallel operation processing.
  • the main processing circuit and the slave processing circuit may include various calculation circuits, for example, may include a vector operation unit and a matrix operation unit.
  • the vector operation unit is used to perform vector operations, and can support complex operations such as vector multiplication, addition, and nonlinear transformation;
  • the matrix operation unit is responsible for the core calculations of deep learning algorithms, such as matrix multiplication and convolution.
  • the slave processing circuit can be used to perform intermediate operations on corresponding data in parallel according to the operation instruction to obtain multiple intermediate results, and transmit the multiple intermediate results back to the main processing circuit.
  • the computing device 500 By setting the computing device 500 into a master-slave structure (for example, a master-multiple-slave structure, or a multi-master-multiple-slave structure, the present disclosure is not limited in this respect), for the calculation instructions of the forward operation, the data can be disassembled according to the calculation instructions. In this way, multiple slave processing circuits are used to perform parallel calculations on the part with a large amount of calculation to improve the calculation speed, save calculation time, and reduce power consumption.
  • a master-slave structure for example, a master-multiple-slave structure, or a multi-master-multiple-slave structure, the present disclosure is not limited in this respect
  • multiple multiplexing methods of input feature maps and weights can be supported, thereby reducing the amount of data access during operations and improving processing efficiency .
  • the computing device 500 may further include a first storage device 530 and a second storage device 540 for respectively storing data transmitted via different data channels.
  • the first storage circuit 530 can be used to store multicast data, that is, the data in the first storage circuit will be transmitted to multiple slave processing circuits through the broadcast bus, and these slave processing circuits receive the same data. It can be understood that broadcasting and multicasting can be implemented through the broadcasting bus. Multicast refers to a communication method that transmits a piece of data to multiple slave processing circuits; broadcasting is a communication method that transmits a piece of data to all slave processing circuits, which is a special case of multicast. Since both multicast and broadcast correspond to a one-to-many transmission mode, there is no special distinction between the two in this document. Broadcast and multicast can be collectively referred to as multicast, and those skilled in the art can clarify their meanings according to the context.
  • the second storage circuit 540 may be used to store and distribute data, that is, the data in the second storage circuit will be transmitted to different slave processing circuits respectively, and each slave processing circuit receives different data.
  • the main processing circuit may determine one of the input feature map and the convolution kernel as multicast data and store it in the first storage circuit, so as to transmit the data to the scheduled multiple from the processing circuit.
  • the main processing circuit may determine the other of the input feature map and the convolution kernel as distribution data and store it in the second storage circuit. These distributed data can be distributed to corresponding slave processing circuits before operation.
  • FIG. 5 also shows a schematic diagram of the internal structure of the slave processing circuit SL according to an embodiment of the present disclosure.
  • each slave processing circuit 520 may include a plurality of operation circuits CU 521, a first buffer circuit 522 and a second buffer circuit 523.
  • four arithmetic circuits CU0 to CU3 are shown.
  • the number of computing circuits may be more or less depending on specific hardware configurations, and the embodiments of the present disclosure are not limited in this regard.
  • the first buffer circuit 522 may be used for buffering weights or input feature maps assigned to the slave processing circuit.
  • the second buffer circuit 523 may be used for buffering the input feature map or the weight assigned to the slave processing circuit. These two buffer circuits are used to select the data involved in the operation.
  • the data of the first buffer circuit 522 can be a plurality of data rows from, for example, the first storage circuit 530 or the second storage circuit 540, and correspondingly, the data of the second buffer circuit 523 can come from, for example, the second storage circuit 540 or the first storage circuit 540 Multiple data rows of circuit 530. Depending on the specific multiplexing method, these data rows can be distributed to the corresponding computing circuit CU 521 or broadcast to all CUs 521 in the slave processing circuit 520 during the operation.
  • Each operation circuit CU521 is used for performing a bitwise multiply-accumulate operation on data rows selected from the first buffer circuit and data rows selected from the second buffer circuit during each calculation.
  • the slave processing circuit 520 may also include a third buffer circuit 524 for buffering the calculation results of each calculation circuit CU 521.
  • each processing circuit and storage circuit are shown as separate modules in FIG. 5 , according to different configurations, the storage circuit and the processing circuit may also be combined into one module.
  • the first storage circuit 530 can be combined with the main processing circuit 510
  • the second storage circuit 540 can be shared by multiple slave processing circuits 520, and an independent storage area is assigned to each slave processing circuit to speed up access.
  • Embodiments of the present disclosure are not limited in this regard.
  • the main processing circuit and the slave processing circuit may belong to different modules of the same processor or chip, or may belong to different processors, and the present disclosure is not limited in this respect.
  • the dimensions of the involved multidimensional data are represented by (N, H, W, C) or (Co, H, W, Ci), which represent the storage order of the data in the memory. It can be understood that although the multidimensional data has multiple dimensions, since the layout of the memory is always one-dimensional, there is a corresponding relationship between the multidimensional data and the storage order on the memory. Multidimensional data is usually allocated in continuous storage space, that is, multidimensional data can be expanded in one dimension and stored in the memory in sequence.
  • the initial input feature map can be stored sequentially in a low-dimensional (where C/Ci is the lowest dimension) priority; and in order to optimize the convolution operation, the input feature can be adjusted during the operation
  • C/Ci is the lowest dimension
  • Adjacent dimensions refer to dimensions that are next to each other in the dimension information representation of multidimensional data, for example, W and Ci are adjacent, and adjacent dimensions may also be called continuous dimensions.
  • the main computing unit of the hardware is a vector multiply-accumulate operator.
  • Implementing support for various convolution algorithms in hardware design is essentially to extract the multiplication and addition operations in the algorithm to the maximum extent, and realize the connection between the on-chip RAM (such as NRAM, WRAM, etc. in Figure 3) and the arithmetic unit through the data path. efficiently exchange the input and output data of the multiply-accumulate operation.
  • Hardware is stored line by line (cache line).
  • the read, write, and calculation operations are most efficient when the entire line is aligned. Therefore, in order to make full use of the bandwidth and adapt to the memory access requirements of the arithmetic unit array, it is usually necessary to
  • the data is vectorized and aligned.
  • the design of artificial intelligence chips usually takes the Ci dimension as the lowest dimension, that is, the above-mentioned NHWC arrangement order, and the data on the Ci dimension is continuous. Therefore, vectorization alignment requires that the size of the Ci dimension be aligned to a specified value, such as the alignment value M, so that the number of accesses is performed in units of the alignment value M, and M can also be called the maximum single operation of the hardware.
  • M can have different values, such as 64bit, 128bit, 256bit, 512bit, etc.
  • the size of the input port of the operator array is also related to M.
  • the input port size of the operator array is usually twice the size of M, that is, the input of the alignment value M scale is processed at one time.
  • Feature map data and weight data When the Ci dimension of the input feature map is large, it is easier to meet the above alignment requirements.
  • the Ci dimension of the input feature map is small, such as smaller than the size of a cache line, the Ci dimension needs to be filled to one line of data (for example, 512 bits), that is, invalid data 0 is filled. This filling will cause a large number of redundant calculations, resulting in waste of resources and reducing the efficiency of operations.
  • a convolution operation scheme which can determine the corresponding convolution splitting scheme according to the size of the lowest storage dimension (such as Ci) of the input feature map, wherein the convolution splitting scheme at least indicates The shape of the split unit of the data to be operated on.
  • the amount of data contained in a split unit does not exceed the maximum single operation amount of the hardware.
  • the amount of data contained in a split unit can be set as the one-time processing alignment value M of the hardware, so that the calculation and processing can be performed in units of split units, which can fully utilize the computing power of the hardware and avoid or reduce invalid calculations. .
  • the data type can be Int8, Int16, Float16 or Float32
  • the split scheme of 64B ⁇ 1 ⁇ 1 shape is called Forward64
  • the split scheme of 16B ⁇ 2 ⁇ 2 shape is called Forward16
  • the split scheme of 4B ⁇ 4 ⁇ 4 shape is called Forward4
  • the split scheme of 4B ⁇ 4 ⁇ 4 shape is called Forward4.
  • the split scheme of 4B ⁇ 4 ⁇ 4 shape applied to depth convolution operation is called Forward1
  • the split scheme of 4B ⁇ 4 ⁇ 4 shape applied to reverse depth convolution operation is called Update1
  • the 4-shape split scheme applied to the cross-product convolution operation is called Update4.
  • these splitting schemes are suitable for scenarios where channel C is relatively small in convolution calculations, so they can also be collectively referred to as small convolutions.
  • a split unit includes data of the lowest storage dimension and at least one other storage dimension, and the total data volume of a split unit does not exceed the maximum single operation of the hardware.
  • the input feature map and convolution kernel can be split into multiple corresponding splitting units according to the determined convolution splitting scheme, and the dimension storage order can be converted, so that a split
  • the data in the unit is continuously stored as a data row, so as to facilitate subsequent reading processing in units of split units (data rows).
  • one or more split units may be read in the first reading order from the data to be operated stored in the storage order of the first dimension, in units of split units, and the read split units may be stored in On the corresponding storage circuit, the data in each split unit is stored according to the storage order of the second dimension, and the data between the split units is stored according to the storage order of the third dimension.
  • FIG. 6 shows an exemplary data storage sequence according to an embodiment of the present disclosure.
  • 610 represents the storage method of the four-dimensional tensor to be calculated, including N three-dimensional sub-tensors, and N is in the highest dimension, that is, the storage order of the first dimension of the four-dimensional tensor is NHWC.
  • H and Y, W and X are used interchangeably herein.
  • Each subtensor is divided into smaller data blocks or split units, and the number of data blocks in each dimension is C/Y/X respectively.
  • the diagram 620 in the middle shows the storage method of each sub-tensor, and each data block is stored as a continuous 64Byte, that is, one row.
  • the order between rows changes accordingly.
  • the data block is read in the direction of C first, then X, and finally Y, that is, the first reading sequence is YXC, and the rows are stored in the order of Y*X*C, that is, the third
  • the dimension storage order is YXC or HWC.
  • the third dimension is stored in the same order as the first dimension. It can be understood that other reading orders may also be used, resulting in the storage order of the third dimension being different from that of the first dimension, which will not be listed here.
  • the diagram 630 on the right shows the order in each row, that is, the order of data in each data block, and its shape is blockC*blockY*blockX. At this time, the storage order of the second dimension is CYX or CHW.
  • the small convolution adopts the block form. Compared with the traditional convolution, the advantage is that the alignment in the Ci direction only needs to satisfy the alignment of the block in the Ci direction.
  • the weight (co*Kh*kw*ci) is generally small, Kh and Kw are usually single digits, and co and ci are similar.
  • the storage space of the second storage circuit (such as the WRAM 332 in FIG. 3) is larger than that of the first storage circuit (such as the NRAM 331 in FIG. 3).
  • the convolution operation principle described above it can be known that the operation results on the Co dimension (the depth convolution is the C dimension) do not need to be accumulated, so the operation allocation on different Co can be carried out relatively independently on different operation circuits.
  • the size of the Co dimension of the output channel of the convolution kernel in a single round of operation does not exceed the number of scheduled slave processing circuits, so the operation of a single Co needs to be completed by one or more slave processing circuits.
  • the dimension of Co is large, it can be realized by splitting into multiple rounds of operations, wherein the size of Co processed by each round of operations does not exceed the number of scheduled slave processing circuits.
  • the calculation rounds required to complete the convolution operation and the Co processed in each round of operation can be determined. Quantity or corresponding grouping mode.
  • the convolution kernel is multiplexed on Rs SLs in the same SLB, and Rs represents the number of times the convolution kernel is multiplexed between slave processing circuits.
  • Factors such as the limitation of hardware buffer space (such as the size of the first buffer circuit and the second buffer circuit in Figure 5) can be considered to determine the maximum number of times rs of convolution kernel multiplexing and the maximum number of input feature map multiplexes applicable in a single slave processing circuit. Use the number of times rn.
  • the situation that a slave processing circuit processes multiple Co values in a single round of operation is not considered for the time being, but only one or more slave processing circuits are considered.
  • the circuit only handles the case of one Co value in a single round of operation.
  • Different grouping modes can be used according to the number of slave processing circuits SL processing the same Co value in a single round of operation. It can be understood that it is preferable to evenly distribute the callable slave processing circuits SL, so as to balance the computing power, for example, every 2 SLs, so that 16 SLs can process 8 Co values at the same time; or every 4 SLs, so that 16 SLs can handle 4 Co values simultaneously; etc.
  • the second storage circuit WRAM has 16 storage areas, which are allocated to the 16 slave processing circuits SL respectively. Further, every 4 blocks can be combined into a storage block, which is assigned to the corresponding slave processing circuit group SLB.
  • the following grouping modes can be selected: Group1 mode, Group4 mode and Group16 mode.
  • grouping modes can refer to the above three representative grouping modes given herein for corresponding processing.
  • the above grouping mode can be uniformly expressed as GroupN, representing that all slave processing circuits SL scheduled in the current round of operations are divided into N groups, each slave processing circuit group SLB processes the same Co value, and different slave processing circuit groups SLB handles different Co values.
  • N can be 1, 4, or 16, corresponding to Group1, Group4, and Group16 above.
  • Figures 7a-7d illustrate several exemplary grouping schemes according to embodiments of the present disclosure.
  • Figure 7a shows a Group1 mode
  • Figure 7b shows a Group16 mode
  • Figure 7c shows a Group4 mode
  • Figure 7d shows another Group4 mode.
  • the Group1 mode means that all 16 schedulable SLs belong to one group and jointly process one Co value, for example, SL0-SL15 belong to group G0. Thus, operations for this one output channel are distributed over 16 SLs.
  • priority can be given to broadcasting the convolution kernel 720 of the output channel to each SL, and the input feature map 710 is split and distributed to each SL, thereby improving memory access efficiency.
  • the convolution kernel can be stored in the first storage circuit 530 in FIG. 5 for transmission using a broadcast channel.
  • the input feature map can be divided according to the XY direction of the output feature map and stored in the second storage circuit 540 to be allocated to different SLs.
  • all SLs jointly compute an output feature map of Co.
  • the Group16 mode means that all 16 schedulable SLs are divided into 16 groups, that is, each group has one SL, and each SL handles a different Co value.
  • SL0 belongs to group G0
  • SL1 belongs to group G1
  • SL15 belongs to group G15.
  • the same input feature map 730 can be reused among 16 SLs, so it can be prioritized to broadcast the input feature map 730 to each SL, while the convolution kernel 740 corresponding to different Co is distributed Give the corresponding SL.
  • 16 copies of the input feature map may be copied and stored in 16 storage areas allocated to the 16 slave processing circuits on the second storage circuit.
  • the convolution kernel is divided according to Co, one SL corresponds to one Co, and 16 Cos are processed at a time, stored in the first storage circuit, and distributed to different SLs in a unicast manner.
  • all SLs compute output feature maps of different Co for the same input feature map.
  • the Group4 mode means that all 16 schedulable SLs are divided into 4 groups, and each group processes a Co value.
  • SL0-SL3 belong to group G0
  • SL4-SL7 belong to group G1
  • SL8-SL11 belong to group G2
  • SL12-SL15 belong to group G3.
  • This mode is between Group1 and Group16, so either the convolution kernel or the input feature map can be determined as multicast data, while the other can be determined as distribution data.
  • the convolution kernels can be divided into 4 groups according to Co, and stored in the first storage circuit 530 in FIG. 5 , so as to be transmitted through a broadcast channel.
  • the input feature map can be divided into 4 parts according to the XY direction of the output feature map, copied into 4 parts, stored in the second storage circuit 540, and distributed to the 4 SLBs.
  • Each SLB obtains the same input feature map, and then distributes it to the 4 SLs in the SLB according to the 4 divided parts.
  • all SLs in each SLB jointly compute the output feature map of a Co, and the 4 SLBs process a different Co respectively.
  • the convolution kernels are divided into 4 groups, and each group is divided into each group at an interval of 1 according to Co.
  • Co 12
  • four groups of Co are divided into ⁇ 0, 4, 8 ⁇ , ⁇ 1, 5, 9 ⁇ , ⁇ 2, 6, 10 ⁇ and ⁇ 3, 7, 11 ⁇ respectively.
  • the neurons can be stored in the second storage circuit WRAM, and the weights can be stored in the first storage circuit NRAM.
  • the input feature map needs to be split between these multiple SLs.
  • the Group1 grouping mode needs to split the input feature map into 16 parts.
  • the Group4 grouping mode needs to split the input feature map into 4 parts.
  • the input feature map may be divided among the Rs slave processing circuits SL included in each slave processing circuit group as follows: according to the size of the corresponding output feature map, the output feature map is divided in the XY dimension (also That is, the Ho/Wo dimension) is evenly divided into Rs output feature blocks of the same shape; and according to the input feature map area required for calculating each output feature block, the input feature map is divided in the XY dimension (that is, the Hi/Wi dimension) The above is divided into Rs input feature blocks to be distributed to Rs slave processing circuits. It can be understood that depending on the size of the convolution kernel and the convolution step size, the input feature maps corresponding to adjacent output points on the output feature map may overlap.
  • Fig. 8 shows an exemplary split diagram of an input feature map according to an embodiment of the present disclosure.
  • the input feature map is divided into 16 parts and distributed on 16 SLs, corresponding to the Group1 mode.
  • the 16 output feature blocks can be mapped to the input feature map 820 to obtain the 16 input feature map regions required to calculate the 16 output feature blocks respectively, which also divides the input feature map in the XY direction.
  • These 16 input feature map regions can be assigned to 16 slave processing circuits SL accordingly.
  • the input feature map will be split in units of splitting units according to the determined convolution splitting scheme. Therefore, in the above embodiment, the block of the input feature map should make each divided input feature map
  • the block in the XY direction is a multiple of the dimension of the split unit in the XY direction, that is, it can be aligned according to the split unit in the XY direction. For example, when choosing a 4 ⁇ 4 ⁇ 4 convolution splitting scheme, each input feature map is aligned by 4 ⁇ 4; while choosing a 16 ⁇ 2 ⁇ 2 convolution splitting scheme, each input feature map Blocks are aligned 2 ⁇ 2.
  • the output feature map is not aligned according to the split unit (such as 4 ⁇ 4 or 2 ⁇ 2)
  • it is necessary to fill in the input feature map accordingly (such as filling 0), so that the actual calculated output XY is according to the split unit ( eg 4x4 or 2x2) aligned and input XY is also aligned by split unit (eg 4x4 or 2x2).
  • the output feature map can also be divided according to other rules in the XY direction, for example, divided into 16 output feature blocks with the same shape according to 1 ⁇ 16, and assigned to SL0-SL15 respectively.
  • Embodiments of the present disclosure are not limited in this regard.
  • this splitting method can also be applied to splitting in other scenarios, for example, between computing circuits CU in a single slave processing circuit SL Splitting, the embodiments of the present disclosure are not limited in this respect.
  • multiple slave processing circuits can be scheduled to perform convolution operations on the input feature map and the corresponding data rows of the convolution kernel, and then according to the convolution splitting scheme, A plurality of operation results returned from the processing circuit are spliced to obtain an output feature map of the convolution operation of the input feature map and the convolution kernel.
  • a plurality of operation circuits CU and each buffer circuit (see FIG. 5 ) in the slave processing circuit can be used to perform a specific convolution operation process.
  • multiple computing cycles are generally required to complete the required computing in each round of computing.
  • each output feature block corresponds to all schedulable N CU operation circuits in a single SL A single calculation capability (N CU *Nop output points).
  • the output feature map can be divided into output feature blocks according to the alignment of 16 output points in the XoYo dimension, and each output feature block can be calculated one by one. It can be understood that the 16 output points may be in a 4*4 format, or may be in a 1*16 format, which is not limited in the embodiment of the present disclosure.
  • the output points of the output characteristic block can be further divided among the N CU operation circuits, so as to determine the processing object of each operation circuit. Then, according to the division of output points, using the split unit as a sliding window, select N CU input feature data rows from the first buffer circuit and distribute them to N CU computing circuits, and select the corresponding weight value from the second buffer circuit The data is broadcast to N CU computing circuits, so that the parallel calculation of the output points corresponding to multiple sliding windows can be realized by multiplexing the weight data. Perform Nk sliding selections, wherein Nk is determined according to the smaller value of the size of the convolution kernel in the X and Y dimensions and the maximum convolution kernel size supported by a single operation of the processing circuit in the current convolution split mode.
  • the corresponding weight data when performing a conventional three-dimensional convolution operation, can be selected as follows: select 1/Nop weights from the second buffer circuit in a sliding manner corresponding to that in the first buffer circuit row, copying Nop-1 copies of it and expanding it into an extended weight value row, and broadcasting to N CU computing circuits in the slave processing circuit.
  • each operation circuit can use 1/Nop data line units for one input feature line from the first buffer circuit and one extended weight data line from the second buffer circuit during each sliding calculation. Carry out bitwise multiplication and accumulation to obtain Nop partial sums; and accumulate the Nk*Nop partial sums obtained by calculating Nk sliding number selections according to the corresponding convolution output points to obtain and output Nop operation results.
  • the slave processing circuit When the slave processing circuit outputs the output points of its internal operation circuit, it can output the output points calculated by multiple operation circuits in it in a specific order according to the division method of the output points, so that the output points of continuous output are in X and/or Y Dimensionally continuous, convenient for subsequent processing.
  • the main processing circuit may further store the operation results returned from each slave processing circuit in a fourth dimension storage order. According to the situation, the main processing circuit can also convert the operation result into a desired dimension storage sequence for storage.
  • the shape of the split unit block is 4B ⁇ 4 ⁇ 4. Depending on the data type, the shape of the block is slightly different. Table 2 shows the block shapes of Forward4 under different data types.
  • Fig. 9 shows a schematic diagram of splitting and storage of the Forward4 scheme according to an embodiment of the present disclosure.
  • the example in the figure assumes the data type is Int8.
  • the figure shows the original data to be operated (which may be neurons or weights), and its storage order is HWC.
  • the Forward4 scheme can support multiple grouping modes.
  • Group1 grouping mode the number of input neuron pendulums varies according to the HoWo splitting method:
  • 16 means 16 slave processing circuits SL
  • the end 4*4*4(CHW) means the BLOCK of CHW split from three dimensions, hi, wi divided twice 4, the second
  • the first 4 means splitting hi*wi into 16 parts and distributing them to 16 SLs
  • the second time 4 means folding hi and wi to the direction of ci.
  • the meaning of 1*16 splitting is the same.
  • the number of input neuron pendulums varies according to the HoWo splitting method:
  • the first 4 in the front means 4 SLBs
  • the neuron has been copied 4 copies
  • the second 4 means that the neuron is split on 4 SLs of an SLB
  • the last 4*4*4 means that it is composed of three The CHW BLOCK split from each dimension.
  • the input neuron does not need to be split, and its pendulum number is as follows:
  • the above 16 means that the neurons are replicated on 16 SLs, and the last 4*4*4 means the block of CHW split from the three dimensions. Hi and wi are divided by 4, which means folding hi and wi to the direction of ci.
  • Fig. 10 shows a schematic diagram of assigning interval output points to each operation circuit in the Forward4 scheme according to some embodiments of the present disclosure.
  • the output feature block can be equally divided into Nop output feature sub-blocks with the same shape among N CU computing circuits, each output feature sub-block includes N CU output points, and is divided into N CU output points respectively.
  • each output feature sub-block the 2*2 output points are allocated to 4 operation circuits.
  • each arithmetic circuit calculates one output point in each of the four output feature sub-blocks.
  • different backgrounds are used to show the output points assigned to four different arithmetic circuits CU0-CU3. It can be seen from the figure that each calculation circuit calculates a plurality of output points spaced in X and/or Y dimensions on the output feature map during each calculation.
  • the output point position of each output feature sub-block can be correspondingly obtained from the first buffer circuit according to the data required for calculating the output feature sub-block.
  • Select N CU data rows for operation For example, when selecting the number of input feature data for the first time, according to the 4 input feature blocks required to calculate the 4 output points in the first output feature sub-block 1011, select 4 input data rows from the corresponding input feature blocks , distributed to 4 arithmetic circuits. It can be understood that since the four output points are continuous in the X and/or Y direction, the interval or step size of the four input data rows selected at the same time in the X and/or Y direction is 1.
  • the corresponding weight data can be selected from the second buffer circuit and broadcast to NCU computing circuits, so as to achieve parallel calculation of output points corresponding to multiple computing circuits by multiplexing the weight data .
  • weight multiplexing can be performed in a single input data row , thus computing Nop output points or partial sums simultaneously.
  • the extended weight value row can also be broadcast to N CU computing circuits, so that while multiplexing the weights among multiple computing circuits, a smaller granularity (for example, 1 /Nop line) to reuse weights.
  • N CU *Nop output points or partial sums can be calculated each time by correspondingly taking N CU input feature data rows and taking 1/Nop weight value rows to copy and expand into 1 weight value row.
  • the calculation result is a partial sum
  • the partial sum can be calculated multiple times by sliding multiple times, and the partial sums of each time are accumulated according to the output points to which they belong, and the final result can be obtained.
  • the number of slides and the slide step of the convolution operation can be determined.
  • the maximum convolution kernel size supported by a single operation of the processing circuit is at least determined by the space sizes of the first buffer circuit and the second buffer circuit. It can be understood that when the convolution kernel exceeds the maximum convolution kernel size, it needs to be split in the Kx and Ky directions according to the maximum convolution kernel size.
  • Fig. 11 shows a schematic diagram of a single operation process in the Forward4 scheme according to an embodiment of the present disclosure.
  • the size of the first buffer circuit 1110 is 3 ⁇ 3 ⁇ 64B, that is, a maximum of 9 rows of data can be buffered
  • the size of the second buffer circuit 1120 is 2 ⁇ 2 ⁇ 64B, that is, a maximum of 4 rows of data can be buffered .
  • the storage in the buffer circuit in the figure is also shown in the split unit.
  • the figure shows the operation process of the first sliding fetch.
  • using the split unit as a sliding window slidingly select N CU input feature lines from the first buffer circuit, and send them to N CU computing circuits for calculation; from the second buffer circuit
  • 1/Nop weight rows are selected according to the corresponding sliding method in the first buffer circuit, where Nop is the maximum number of convolution output points that can be calculated for each operation circuit at a time, and its copy Nop-1 is expanded to An extended weight row is broadcast to N CU computing circuits in the slave processing circuit.
  • each operation circuit calculates 2 ⁇ 2 output points with an interval of 1 in the X and Y dimensions for each calculation.
  • one input characteristic data row is selected from the first buffer circuit 1110 at the initial position and the position moved by 1 in the X and/or Y directions, and a total of four input characteristic data rows are selected, and correspondingly sent to the slave Four arithmetic circuits 1140 in the processing circuit SL.
  • Select 1/4 weight data row at the starting position from the second buffer circuit 1120 that is, select data of 2 ⁇ 2 size, copy 3 copies of it and expand it into an extended weight data row 1130, and broadcast it to the SL 4 arithmetic circuits 1140 inside.
  • each operation circuit performs bitwise multiplication and accumulation in units of 1/Nop data lines for one input feature row from the first buffer circuit and one extended weight value row from the second buffer circuit to obtain Nop parts and.
  • the four computing circuits 1140 perform a bitwise multiplication and accumulation operation on the distributed input feature data row and the broadcasted extended weight data row to obtain the computing result 1150.
  • the results of different background colors in 1150 represent the results obtained by different computing circuits 1140. owned. It can be seen that for each operation, one CU will calculate the partial sum of 4 output points, and the 4 CUs will obtain a total of 4 ⁇ 4 partial sums. It can be seen that the output points calculated by each CU are not adjacent in the XoYo dimension of the output feature map.
  • the number is slidingly fetched synchronously, and the next calculation is performed.
  • Nk ceil(Kx/2)*ceil(Ky/2)
  • Kx and Ky are the dimensions of the convolution kernel in the X and Y dimensions respectively or from the processing circuit in the current convolution split mode The smaller value among the maximum convolution kernel sizes supported by a single operation.
  • the operation circuit accumulates the sums of Nk*Nop parts calculated in the Nk sliding calculations according to the corresponding convolution output points to obtain Nop operation results.
  • the maximum convolution kernel size supported by a single operation of the processing circuit is 8 ⁇ 8.
  • Fig. 12 shows a schematic diagram of a sliding convolution process in the Forward4 scheme according to an embodiment of the present disclosure.
  • This example takes a 9 ⁇ 9 input feature map and a 5 ⁇ 5 convolution kernel as an example. If the convolution step is 1, the output feature map size is 5 ⁇ 5.
  • the input feature map needs to be aligned to 12 ⁇ 12, divided into 9 blocks of 4 ⁇ 4 ⁇ 4 (C ⁇ H ⁇ W) size, and stored in the first buffer circuit, shown as 1210 in the figure, where the C dimension is omitted .
  • the convolution kernel 5 ⁇ 5 needs to be aligned to 8 ⁇ 8, and the aligned part is filled with 0, and stored in the second buffer circuit, which is shown as 1220 in the figure, and the C dimension is also omitted.
  • the copy operation can be realized by hardware.
  • FIG. 12 The input feature map and the selection range of the convolution kernel in the first buffer circuit and the second buffer circuit for each sliding are shown in Figure 12, a total of 9 images, representing a total of 9 sliding times.
  • block 1210 represents the input feature map in the first buffer circuit, and the four dotted-line boxes represent the areas selected to be sent to four CUs;
  • block 1220 represents the convolution kernel in the second buffer circuit, and the dotted-line boxes represent the selected 1/ 4 lines, which are copied into 3 copies and expanded into one line and then broadcast to 4 CUs.
  • each CU performs bitwise multiplication and accumulation in units of 1/4 data line for one input feature data line from the first buffer circuit and one extended weight data line from the second buffer circuit, to obtain 4 partial sums; and accumulating the Nk partial sums corresponding to the same convolution output point obtained in the Nk calculations in the current operation round, to obtain and output 4 operation results.
  • each output point is a standard convolution of 4 ⁇ 2 ⁇ 2 (Ci ⁇ Y ⁇ X).
  • the accumulation is completed in the Y ⁇ X direction
  • a complete 4 ⁇ 4 (Y ⁇ X) output is obtained in one SL (as shown in Figure 10b shown).
  • a single calculation only supports the case where the convolution kernel is not larger than 8 ⁇ 8.
  • each slave processing circuit SL can convert the operation result of its internal operation circuit CU into a specified format, for example, the format of Nco ⁇ Uy ⁇ Ux.
  • each slave processing circuit may output a partial operation result of its internal partial operation circuit each time, and the partial operation result is continuous on the X and/or Y dimension of the output feature map.
  • the main processing circuit may further store the operation results returned from each slave processing circuit in a fourth dimension storage order. According to the situation, the main processing circuit can also convert the operation result into a desired dimension storage sequence for storage.
  • the output data format is slightly different.
  • Fig. 13 shows a schematic diagram of an output data format of the Forward4 scheme according to an embodiment of the present disclosure.
  • the grouping mode is Group1
  • each SL outputs a 1 ⁇ 1 ⁇ 4 (Co ⁇ Y ⁇ X) area each time, that is, it outputs part of the operation results of its internal partial operation circuit each time, for example, each of the 2 CUs has 2 operation results (see FIG. 10 ), this part of the operation results is continuous on the X and/or Y dimensions of the output feature map, such as the same row (as shown in FIG. 13 ) or the same column.
  • the 1 ⁇ 4 ⁇ 4 (Co ⁇ Y ⁇ X) area is returned 4 times in a row, that is, the 4 operation results of each of the 4 CUs.
  • Different SLs output different regions of the output feature map of the same Co. After outputting all the 4 ⁇ 4 areas of Co, continuing to output will switch different output points.
  • 1320 in the figure shows the deposit-out data structure of 16 SLs.
  • the final output data becomes the format of Yo*Xo*Co*4*16*4 after being written into the storage circuit (for example, the first storage circuit), where Yo and Xo are the outputs divided by each SL
  • the number of blocks in the feature map, 16 is the division on 16 SLs.
  • pendulum operations can be performed again to convert to other desired data formats.
  • the output data format is also slightly different. Assuming the original output size is:
  • (4*16*4) is the basic output block of forward4, and the directions correspond to h*c*w respectively, where 16 represents the division of ho and wo of the same co on 16 SLs; ho and wo are divided twice 4, where the first 4 indicates 4 ⁇ 4 splitting when storing data in SL, and the second 4 indicates data block folding in h and w directions.
  • This shape is also the shape of the schematic diagram in FIG. 19 .
  • the Group4 output data shape is:
  • (4*16*4) has the same meaning as above, except that 16 represents the wo output division of 4 cos on 4 SLs.
  • the Group16 output data shape is:
  • (4*16*4) has the same meaning as above, except that 16 represents the output division of 16 COs on 16 SLs.
  • the hardware when outputting, can automatically output neurons according to the dimension of 4*16*4(Y*SL*X) in the row and the dimension of Y*X*C between the rows. The same is true for larger convolution kernels.
  • the bias Bias is the bias after the convolution calculation.
  • the original format of the bias is: [11co].
  • 64 means that a single offset is copied 64 times and placed continuously.
  • 16 means that a single offset is copied 16 times and placed continuously.
  • 4 means that a single offset is copied 4 times and placed continuously.
  • FIG. 14 shows an overall data moving process according to an embodiment of the present disclosure.
  • weights are read from off-chip storage, such as DDR, into SRAM via a global direct memory access module (GDMA).
  • GDMA global direct memory access module
  • the transfer process of neurons is similar to that of weights, except that after transferring to NRAM through block instructions, it also needs to be transferred to WRAM.
  • the neuron calculates, with the sliding of the convolution kernel, most of the data overlaps, which greatly reduces the efficiency of data handling.
  • the img2col instruction is used to distribute data, and details are described later.
  • the output data can be stored back to NRAM, and the data dimension change can also be completed through block instructions and transferred to SRAM. Then, it can be stored back to the off-chip storage DDR via GDMA.
  • Data dimension change and pendulum refers to the process of arranging tensor data of a specific shape into a specific shape required.
  • Data movement refers to the read and write operations of data in different memory spaces.
  • the Forward4 convolution operation scheme requires that the neurons and weights used for convolution operations be placed and aligned according to a specific block pattern.
  • the output data is also output in accordance with the specific output format of Forward4, which requires that the tensor data be arranged in block form before calculation, and it is also required to return to the normal tensor shape as required after the calculation is completed.
  • the Deform instruction series provides data shape transformation and data type conversion capabilities for the IO data path, mainly including functions such as TRANS (transposition), MOVE (transportation), and ROTATE (rotation).
  • the mode that implements the transpose function in this instruction series is named Trans Tiling, which mainly provides performance support for various shape transformations of small convolutions.
  • Deform divides a 3-dimensional data block into inner and outer layers.
  • the inner layer has three dimensions (corresponding to the parameter n0-2 in the instruction).
  • the unit of the lowest dimension is byte, and the second-lowest dimension and the highest dimension are unitless, representing the previous one. the number of layers.
  • the outer layer also has three dimensions (corresponding to the parameters n3-n5 in the command), all of which represent multiples of the corresponding inner layer dimensions.
  • the input data stored in the first dimension storage order (such as HWC) needs to be split, dimensionally converted and stored in units of splitting units, and each splitting unit is divided according to the second dimension Storage order (such as CHW) storage, split units are stored according to the third dimension storage order (such as HWC).
  • Figure 15 shows a schematic conceptual diagram of Trans Tiling according to an embodiment of the present disclosure.
  • the left panel in the figure shows the input data before deformation.
  • the three-dimensional input data is described by six dimensions, n0 and n3 correspond to the first dimension (such as the lowest dimension) of the original three-dimensional data, n1 and n4 correspond to the second dimension (such as the second-lowest dimension) of the original three-dimensional data, n2 and n5 correspond to the third dimension (for example, the highest dimension) of the data block.
  • the right panel in the figure shows the output data after deformation. Three-dimensional output data is also described using six dimensions.
  • the inner layer of the output data corresponds to the deformed splitting unit.
  • Trans Tiling also has the function of inline shuffle, including inline shuffle before Tiling based on Pretable and inline shuffle after Tiling based on Posttable.
  • the pre-allocation table is the function of rearranging the n0 data input by Tiling
  • the post-allocation table is the function of rearranging the n0 data output by Tiling. Without considering the flag bit of the table, the essence of the pre-allocation table and the post-allocation table is an array representing the data position of 64 bytes.
  • Fig. 16 shows a schematic diagram of front and back tables.
  • the front and back matching tables respectively indicate the rearrangement position of a row of data in dimension n0 of input or output, which includes 64B.
  • the 8 bits of each byte include a 6-bit Index bit respectively, and the record data bit stores the data of the first byte of the 0 to 63-bit byte data in the original data; the 1-bit zero_en bit indicates whether to set 0, if the bit is 1, it is forced to write 0, the [5,0] bit is invalid; and the 1-bit mask bit indicates whether the data of this bit is valid.
  • the data of n0 of the input data of the block instruction can be rearranged when needed, and/or the data of n0 of the output data of the block instruction can be rearranged.
  • Table 4 shows the meaning of each parameter of the block instruction. Assuming that the bit width of the data that needs to be divided into blocks is dwidth, the unit is B (byte), and the data size of an atomic operation of the block instruction is called the block bit width T, and the unit is B (byte).
  • the parameters of the block instruction a total of 11 parameters, n0 ⁇ n5, s1 ⁇ s5, are required to describe the tensor shape of the inner layer data and the outer layer data, among which n0 ⁇ n2, s1 ⁇ s2 are parameters describing the inner layer, and n3 ⁇ n5, s3 ⁇ s5 are parameters describing the outer layer.
  • the input tensor and the input tensor each need a set of parameters, which are described by a total of 22 parameters in0 ⁇ in5, is1 ⁇ is5, on0 ⁇ on5, and os1 ⁇ os5.
  • the block instruction can support various block bit widths T, such as 1B, 2B, 4B, 6B, 8B, 16B, 32B, etc., and the corresponding value can be set based on different block tasks. Therefore, the block instruction also includes the parameter of block bit width T.
  • a data processing device including a control circuit, a first storage circuit, and a second storage circuit.
  • the first storage circuit is used for storing data before executing the block instruction; the second storage circuit is used for storing data after executing the block instruction.
  • the control circuit is used to configure and execute the block instruction.
  • the data processing device may be, for example, a processor cluster in the multi-core computing device shown in FIG.
  • the shared storage is SRAM, while the second storage circuit is, for example, NRAM within the processor core.
  • the function of block instructions is to store them in the first dimension during the transfer of input neurons from, for example, SRAM to NRAM.
  • the input neurons stored sequentially (such as HWC) are split, dimensionally converted, and stored in units of split units.
  • Each split unit is stored in the order of the second dimension (such as CHW), and the split units are stored in the order of the third dimension.
  • Dimensions are stored sequentially (eg HWC) storage.
  • the alignment value M required by the block instruction is a multiple of U Ci .
  • the data needs to be arranged from [1*hi*wi*ci] to:
  • Fig. 17 shows a schematic diagram of executing a block instruction on neuron data according to an embodiment of the present disclosure.
  • the left figure in the figure shows the neuron data (that is, the input tensor of the block instruction) before block processing.
  • the three-dimensional neuron data [hi*wi*ci] (the N dimension is omitted here) is divided into inner and outer layers, each using three dimensions to describe.
  • the in1 dimension can be set to U W according to the shape of the split unit, which is 4 in this example;
  • the in2 dimension It can also be set to U H depending on the shape of the split unit, 4 in this example.
  • the sizes of the three outer dimensions in3, in4 and in5 can also be determined accordingly, and their sizes are respectively equal to the numbers of the inner data blocks contained in the corresponding dimensions.
  • the right figure in the figure shows the neuron data after block processing (that is, the output tensor of the block instruction). It can be seen that the shape of the neuron data at this time becomes [hi/4*wi/4*(ci*16)], which is also divided into inner and outer layers, and each is described by three dimensions. Since the neuron data needs to be split according to the split unit, combined with the constraints of the block instruction, the block bit width T can be set as U Ci , that is, the data volume of an atomic operation is U Ci , so that it is convenient to split The storage order is adjusted in units of units.
  • the inner layer data block 1702 may correspond to the inner layer data block 1701 of the input tensor, but the shape changes from M ⁇ U H ⁇ U W to (M*U H *U W ) ⁇ 1 ⁇ 1, as shown in the figure It is a large strip composed of 16 thin strips.
  • the sizes of the three outer dimensions on3, on4, and on5 can also be determined accordingly, and their sizes are respectively equal to the number of inner data blocks containing output tensors in the corresponding dimensions.
  • control circuit in the data processing device can be further configured to configure the block instruction as follows: set the post-allocation table of the block instruction, so that the inner lowest-dimensional data of the output tensor of the block instruction is arranged according to the following The instructions in the matching table are rearranged.
  • control circuit can be further used to set the post-allocation table as follows: convert the inner lowest dimension on0 data of the output tensor arranged in the storage order of the first dimension (for example, HWC) into the storage order in the second dimension (for example, CHW )arrangement.
  • the writing sequence of the 64B data is related to the data bit width dwidth of the data.
  • the post-configuration table can be configured according to the logic shown in the pseudo code in Table 5 below.
  • control circuit in the data processing device can divide the input data (for example, neuron) into an integer segment and a remainder segment according to the input channel Ci dimension, wherein the Ci dimension of the integer segment is aligned to the alignment value M, The Ci dimension of the remainder segment is smaller than M. Then, a first block instruction may be configured and executed for the integer segment, and a second block instruction may be configured and executed for the remainder segment.
  • ci there may be only the integer segment, or only the remainder segment, or both the integer segment and the remainder segment.
  • the length of the aligned 64B integer segment in ci is ci_full
  • the unaligned 64B remainder segment is ci_rem.
  • Table 6 shows the shape change of neuron data before and after executing the chunking instruction.
  • the first block instruction can be configured with reference to the content described above in conjunction with FIG. 17 .
  • the parameters of the input tensor of the first block instruction can be configured as follows: set the inner lowest-dimensional size in0 of the input tensor in the first block instruction to M, and the inner-layer low-dimensional size in1 Set to U W , set the highest dimension size in2 of the inner layer to U H ; and set the size values in3 and in4 of the three outer dimensions of the input tensor in the first block instruction according to the size of each dimension of the integer segment of the input data and in5, where the size values of the three outer dimensions represent the number of inner data blocks containing input tensors in the corresponding dimension.
  • the parameters of the output tensor of the first block instruction can be configured as follows: set the inner lowest dimension size on0 of the output tensor in the first block instruction to in1*in2*T , set the low-dimensional size on1 of the inner layer to M/T, set the highest dimension size on2 of the inner layer to 1; and set the three outer layers of the output tensor in the first block instruction according to the size of each dimension of the integer segment of the input data
  • the size values of the dimensions are on3, on4, and on5, where the size values of the three outer dimensions represent the number of inner data blocks containing output tensors in the corresponding dimension.
  • dimension steps also need to be set.
  • the other five dimensions adjacent to each other except the lowest dimension of the inner layer can be set.
  • the bit width T of the block can be set as U Ci according to the constraints of the block instruction and the shape transformation of the split unit before and after processing.
  • the first block for the integer segment can be configured according to the following table 7 instruction.
  • ci, hi, and wi respectively represent the number of data in Ci, H and W dimensions of the input data
  • dwidth represents the data bit width
  • ci_full represents the number of data in the Ci dimension of the integer segment
  • B represents bytes
  • T represents the block bit Width
  • is1 ⁇ is5 indicates the five-dimensional step size of the input tensor
  • os1 ⁇ os5 indicates the five-dimensional step size of the output tensor.
  • the second block instruction can be configured with a slight adjustment based on the integer segment part.
  • the second chunking instruction can be configured as follows: According to the Ci dimension size of the integer segment, the input tensor bias and the output tensor bias of the second chunking instruction executed for the remainder segment are set, where the input The tensor offset indicates the offset of the remainder segment before processing relative to the initial storage address of the input data, and the output tensor offset indicates the offset of the processed remainder segment relative to the initial storage address of the output data.
  • the input tensor address and the output tensor address of the second block instruction can be adjusted after considering the memory storage space of the integer segment.
  • the parameters of the input tensor of the second block instruction can be configured as follows: set the inner lowest dimension size in0 of the input tensor in the second block instruction to R, and R is the Ci of the remainder segment Dimension size, set the low-dimensional size in1 of the inner layer to U W , set the highest dimension size in2 of the inner layer to U H ; and set the input tensor in the second block instruction according to the size of each dimension of the remainder of the input data The size values in3, in4, and in5 of the three outer dimensions, where the size values of the three outer dimensions respectively represent the number of inner data blocks containing the input tensor in the corresponding dimension.
  • the parameters of the output tensor of the second block instruction can be configured as follows: set the inner lowest dimension size on0 of the output tensor in the second block instruction to in1*in2*T , set the inner-level low-dimensional size on1 to R/T, set the inner-level highest-dimensional size on2 to 1; and set the three output tensors in the second block instruction according to the size of each dimension of the remainder of the input data
  • the size values of the outer dimension are on3, on4, and on5, where the size values of the three outer dimensions respectively represent the number of inner data blocks containing the output tensor in the corresponding dimension.
  • control circuit can set adjacent data in the other five dimensions except for the lowest dimension of the inner layer based on the six-dimensional dimensions of the input tensor and the output tensor and the dimensions of the input data before processing.
  • the second block for the remainder segment can be configured according to the following Table 8 instruction.
  • ci, hi, and wi represent the data numbers of the Ci, H and W dimensions of the input data respectively
  • dwidth represents the data bit width
  • ci_rem represents the data number of the Ci dimension of the remainder segment
  • B represents the byte
  • T represents the minute Block bit width
  • is1 ⁇ is5 indicates the five-dimensional step size of the input tensor
  • os1 ⁇ os5 indicates the five-dimensional step size of the output tensor.
  • the embodiment of the present disclosure provides a block processing solution for neuron data.
  • the neuron data of any shape can be arranged from [1*hi*wi*ci] to [1*hi/4*wi/ through two-stage block processing. 4*ci/4*(4*4*4)].
  • the block processing of weight data is similar to the block processing of neuron data. Specifically, for the weight data in the Forward4 scheme, the data needs to be placed from [co*kh*hw*ci] to:
  • weight data has an additional co dimension. Since the co dimension and the kh dimension are continuous, the co dimension can be incorporated into the kh dimension.
  • Table 9 shows the shape change of the weight data before and after executing the block instruction.
  • the weight data can still be divided into blocks using the scheme described above for the neuron data.
  • two-stage processing can be adopted for weight data of any scale, that is, integer segment block processing and remainder segment block processing.
  • the integer segment of the weight data can be configured according to the following table 10 The first block instruction.
  • the following table 11 can be configured for Second chunking instruction for remainder segment of weight data.
  • the embodiment of the present disclosure also provides a data processing method for executing a block instruction by using the aforementioned data processing device.
  • a data processing method for executing a block instruction by using the aforementioned data processing device.
  • An embodiment of the present disclosure also provides a chip, which may include the data processing device in any embodiment described above with reference to the accompanying drawings. Further, the present disclosure also provides a board, which may include the aforementioned chip.
  • the electronic equipment or devices disclosed herein may include servers, cloud servers, server computing clusters, data processing devices, robots, computers, printers, scanners, tablet computers, smart terminals, PC equipment, IoT terminals, Mobile terminals, mobile phones, driving recorders, navigators, sensors, cameras, cameras, video cameras, projectors, watches, earphones, mobile storage, wearable devices, visual terminals, automatic driving terminals, vehicles, household appliances, and/or medical equipment.
  • Said vehicles include airplanes, ships and/or vehicles;
  • said household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, range hoods;
  • said medical equipment includes nuclear magnetic resonance instruments, Ultrasound and/or electrocardiograph.
  • the electronic equipment or device disclosed herein can also be applied to fields such as the Internet, the Internet of Things, data centers, energy, transportation, public management, manufacturing, education, power grids, telecommunications, finance, retail, construction sites, and medical treatment. Further, the electronic device or device disclosed herein can also be used in application scenarios related to artificial intelligence, big data and/or cloud computing, such as cloud, edge, and terminal.
  • electronic devices or devices with high computing power according to the disclosed solutions can be applied to cloud devices (such as cloud servers), while electronic devices or devices with low power consumption can be applied to terminal devices and/or Edge devices (such as smartphones or cameras).
  • the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that according to the hardware information of the terminal device and/or the edge device, the hardware resources of the cloud device can be Match appropriate hardware resources to simulate the hardware resources of terminal devices and/or edge devices, so as to complete the unified management, scheduling and collaborative work of device-cloud integration or cloud-edge-end integration.
  • the present disclosure expresses some methods and their embodiments as a series of actions and combinations thereof, but those skilled in the art can understand that the solution of the present disclosure is not limited by the order of the described actions . Therefore, according to the disclosure or teaching of the present disclosure, those skilled in the art may understand that certain steps may be performed in other orders or simultaneously. Further, those skilled in the art can understand that the embodiments described in the present disclosure can be regarded as optional embodiments, that is, the actions or modules involved therein are not necessarily required for the realization of one or some solutions of the present disclosure. In addition, according to different schemes, the description of some embodiments in this disclosure also has different emphases. In view of this, those skilled in the art may understand the part that is not described in detail in a certain embodiment of the present disclosure, and may also refer to related descriptions of other embodiments.
  • a unit described as a separate component may or may not be physically separated, and a component shown as a unit may or may not be a physical unit.
  • the aforementioned components or units may be located at the same location or distributed over multiple network units.
  • some or all of the units may be selected to achieve the purpose of the solutions described in the embodiments of the present disclosure.
  • multiple units in the embodiments of the present disclosure may be integrated into one unit, or each unit exists physically independently.
  • the above-mentioned integrated units may also be implemented in the form of hardware, that is, specific hardware circuits, which may include digital circuits and/or analog circuits.
  • the physical realization of the hardware structure of the circuit may include but not limited to physical devices, and the physical devices may include but not limited to devices such as transistors or memristors.
  • various devices such as computing devices or other processing devices described herein may be implemented by appropriate hardware processors, such as central processing units, GPUs, FPGAs, DSPs, and ASICs.
  • the aforementioned storage unit or storage device can be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), which can be, for example, a variable resistance memory (Resistive Random Access Memory, RRAM), dynamic Random Access Memory (Dynamic Random Access Memory, DRAM), Static Random Access Memory (Static Random Access Memory, SRAM), Enhanced Dynamic Random Access Memory (Enhanced Dynamic Random Access Memory, EDRAM), High Bandwidth Memory (High Bandwidth Memory , HBM), hybrid memory cube (Hybrid Memory Cube, HMC), ROM and RAM, etc.
  • RRAM variable resistance memory
  • DRAM Dynamic Random Access Memory
  • SRAM Static Random Access Memory
  • EDRAM Enhanced Dynamic Random Access Memory
  • High Bandwidth Memory High Bandwidth Memory
  • HBM High Bandwidth Memory
  • HMC Hybrid Memory Cube
  • ROM and RAM etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Image Processing (AREA)
  • Complex Calculations (AREA)

Abstract

L'invention divulgue un dispositif de traitement de données, un procédé de traitement de données qui utilise le dispositif de traitement de données pour exécuter des instructions de pavage, et un produit associé. Le dispositif de traitement de données peut être inclus en tant que dispositif informatique dans un dispositif de traitement combiné, et le dispositif de traitement combiné peut également comprendre un dispositif d'interface et un autre dispositif de traitement. Le dispositif informatique interagit avec l'autre dispositif de traitement pour achever conjointement des opérations informatiques spécifiées par un utilisateur. Le dispositif de traitement combiné peut également comprendre un dispositif de stockage, le dispositif de stockage étant respectivement connecté au dispositif informatique et à l'autre dispositif de traitement et étant utilisé pour stocker des données du dispositif informatique et de l'autre dispositif de traitement. La solution de la présente invention met en œuvre un stockage de division de données dans de petites opérations de convolution, et améliore l'efficacité de traitement d'opération.
PCT/CN2022/100302 2021-09-26 2022-06-22 Dispositif de traitement de données, procédé de traitement de données et produit associé WO2023045445A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111129610.8 2021-09-26
CN202111129610.8A CN113850380A (zh) 2021-09-26 2021-09-26 数据处理装置、数据处理方法及相关产品

Publications (1)

Publication Number Publication Date
WO2023045445A1 true WO2023045445A1 (fr) 2023-03-30

Family

ID=78979679

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/100302 WO2023045445A1 (fr) 2021-09-26 2022-06-22 Dispositif de traitement de données, procédé de traitement de données et produit associé

Country Status (2)

Country Link
CN (1) CN113850380A (fr)
WO (1) WO2023045445A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118041706A (zh) * 2024-04-12 2024-05-14 深圳市中农网有限公司 一种基于crm下的农产品数据双模式存储方法

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113850380A (zh) * 2021-09-26 2021-12-28 安徽寒武纪信息科技有限公司 数据处理装置、数据处理方法及相关产品
TWI814618B (zh) * 2022-10-20 2023-09-01 創鑫智慧股份有限公司 矩陣運算裝置及其操作方法
CN115796239B (zh) * 2022-12-14 2023-10-31 北京登临科技有限公司 Ai算法架构的实现装置、卷积计算装置及相关方法与设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180285715A1 (en) * 2017-03-28 2018-10-04 Samsung Electronics Co., Ltd. Convolutional neural network (cnn) processing method and apparatus
CN111079917A (zh) * 2018-10-22 2020-04-28 北京地平线机器人技术研发有限公司 张量数据分块存取的方法及装置
CN112416433A (zh) * 2020-11-24 2021-02-26 中科寒武纪科技股份有限公司 一种数据处理装置、数据处理方法及相关产品
CN113850380A (zh) * 2021-09-26 2021-12-28 安徽寒武纪信息科技有限公司 数据处理装置、数据处理方法及相关产品

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180285715A1 (en) * 2017-03-28 2018-10-04 Samsung Electronics Co., Ltd. Convolutional neural network (cnn) processing method and apparatus
CN111079917A (zh) * 2018-10-22 2020-04-28 北京地平线机器人技术研发有限公司 张量数据分块存取的方法及装置
CN112416433A (zh) * 2020-11-24 2021-02-26 中科寒武纪科技股份有限公司 一种数据处理装置、数据处理方法及相关产品
CN113850380A (zh) * 2021-09-26 2021-12-28 安徽寒武纪信息科技有限公司 数据处理装置、数据处理方法及相关产品

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118041706A (zh) * 2024-04-12 2024-05-14 深圳市中农网有限公司 一种基于crm下的农产品数据双模式存储方法

Also Published As

Publication number Publication date
CN113850380A (zh) 2021-12-28

Similar Documents

Publication Publication Date Title
WO2023045445A1 (fr) Dispositif de traitement de données, procédé de traitement de données et produit associé
WO2023045446A1 (fr) Appareil informatique, procédé de traitement de données et produit associé
CN112799599B (zh) 一种数据存储方法、计算核、芯片和电子设备
WO2022134873A1 (fr) Dispositif de traitement de données, procédé de traitement de données et produit associé
CN113469336A (zh) 优化神经网络模型的编译方法、执行方法及相关产品
CN112084023A (zh) 数据并行处理的方法、电子设备及计算机可读存储介质
CN113850379A (zh) 数据处理装置、数据处理方法及相关产品
CN113850377A (zh) 数据处理装置、数据处理方法及相关产品
CN113469337B (zh) 用于优化神经网络模型的编译方法及其相关产品
WO2022095675A1 (fr) Appareil et procédé d'amenuisement de réseau neuronal, et dispositif associé
CN114692844A (zh) 数据处理装置、数据处理方法及相关产品
CN114281561A (zh) 处理单元、用于处理单元的同步方法及相应产品
WO2022134872A1 (fr) Appareil de traitement de données, procédé de traitement de données et produit associé
CN114691353A (zh) 一种张量的读取方法、装置以及相关产品
WO2023087698A1 (fr) Appareil de calcul et procédé pour exécuter une opération de convolution, et produits associés
WO2023045638A1 (fr) Dispositif informatique, procédé de mise en œuvre d'une opération de convolution à l'aide d'un dispositif informatique, et produit associé
WO2023087814A1 (fr) Appareil informatique, procédé de mise en œuvre d'une opération de convolution à l'aide d'un appareil informatique et produit associé
CN113867800A (zh) 计算装置、集成电路芯片、板卡、电子设备和计算方法
WO2022063183A1 (fr) Dispositif et procédé pour l'informatique neuronale, ainsi que carte et support de stockage lisible
WO2022257980A1 (fr) Appareil informatique, procédé de mise en œuvre d'une opération de convolution à l'aide d'un appareil informatique, et produit associé
WO2022135599A1 (fr) Dispositif, carte et procédé pour fusionner des structures de ramification, et support de stockage lisible
WO2022135600A1 (fr) Appareil de réseau neuronal de calcul, carte, procédé et support de stockage lisible
CN113850378A (zh) 数据处理装置、数据处理方法及相关产品
CN113837923A (zh) 数据处理装置、数据处理方法及相关产品
CN113837921A (zh) 数据处理装置、数据处理方法及相关产品

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22871506

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE