CN113837921A - Data processing device, data processing method and related product - Google Patents

Data processing device, data processing method and related product Download PDF

Info

Publication number
CN113837921A
CN113837921A CN202111131270.2A CN202111131270A CN113837921A CN 113837921 A CN113837921 A CN 113837921A CN 202111131270 A CN202111131270 A CN 202111131270A CN 113837921 A CN113837921 A CN 113837921A
Authority
CN
China
Prior art keywords
data
dimension
dimensional
output
block
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111131270.2A
Other languages
Chinese (zh)
Inventor
不公告发明人
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui Cambricon Information Technology Co Ltd
Original Assignee
Anhui Cambricon Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui Cambricon Information Technology Co Ltd filed Critical Anhui Cambricon Information Technology Co Ltd
Priority to CN202111131270.2A priority Critical patent/CN113837921A/en
Publication of CN113837921A publication Critical patent/CN113837921A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Abstract

The disclosure discloses a data processing apparatus, a data processing method for executing a blocking instruction by using the data processing apparatus, and a related product. The data processing apparatus may be included as a computing apparatus in a combined processing apparatus, which may also include interface apparatus and other processing apparatus. The computing device interacts with other processing devices to jointly complete computing operations specified by a user. The combined processing device may further comprise a storage device connected to the computing device and the other processing device, respectively, for storing data of the computing device and the other processing device. The scheme disclosed by the invention realizes data dimension conversion and storage in small convolution operation, and improves the operation processing efficiency.

Description

Data processing device, data processing method and related product
Technical Field
The present disclosure relates generally to the field of data processing. More particularly, the present disclosure relates to a data processing apparatus, a data processing method of executing a block instruction on data using the data processing apparatus, a chip, and a board.
Background
At present, Deep Learning (Deep Learning) has become an important branch in machine Learning, and the development of Artificial Intelligence (AI) is also greatly promoted. The core technology of deep learning, Deep Neural Network (DNN), has been widely used in many industries.
The Neural Network is one of the most critical technologies in artificial intelligence and deep learning, and the Convolutional Neural Network (CNN) is one of the most important Network types. The most critical calculation in the convolutional neural network is Convolution Operation (Convolution Operation) of the convolutional layer (Conv layer). The convolutional layer has the function of extracting the characteristics of input data, and can extract complex characteristics through multilayer convolution so as to ensure that the network has enough expression capacity and generalization capacity. The neural network model comprises a large number of convolution operations of various types, and the calculation performance of the convolution operations greatly influences the calculation performance of the whole neural network model. When the neural network model is applied to different fields, such as speech recognition, machine translation, image processing, etc., the respective dimensions of the input feature map and the weight thereof may be different. In order to fully utilize the hardware advantages of the deep learning processor, different types of convolution operations with different scales need to be optimized so as to improve the calculation performance of executing the neural network model.
Disclosure of Invention
To address at least one or more of the technical problems mentioned above, the present disclosure proposes, in various aspects, a data processing apparatus that can enable data of various dimensional sizes to be adapted to hardware of a convolution operation by executing a blocking instruction on the data, thereby improving computational efficiency of the convolution operation. The convolution operations of the disclosed embodiments may be operations in various neural network models that may be applied in various fields, such as image processing, speech processing, text processing, and so forth, which may include, for example, but not limited to, recognition and classification.
In a first aspect, the disclosed embodiments provide a data processing apparatus comprising a control circuit, a first storage circuit and a second storage circuit, wherein: the first storage circuit is used for storing first data before processing; the second storage circuit is used for storing the processed second data; and the control circuit is to: determining a preferred alignment value according to the co dimension of the first data; determining the decompensation distribution of adjacent dimensions of the co dimensions according to the preferred alignment value; and configuring and executing a blocking instruction according to the de-complementing allocation to convert first data stored on a first storage circuit in a first-dimension storage order into second data stored on a second storage circuit in a second-dimension storage order, wherein the first data is multidimensional data whose multidimensional shape is:
[ high dimensional ho [ middle dimensional wo ] co dimension [ multidimensional blending ]
Wherein the multi-dimensional blending includes at least various combinations of: co, high-dimensional ho, low-dimensional ho, high-dimensional wo, and low-dimensional wo;
the second data is three-dimensional data, and the three-dimensional shape of the second data is:
[ho*wo*co]
where co represents the lowest storage dimension of the second data, wo represents the next lowest storage dimension of the second data, and ho represents the highest storage dimension of the second data.
In a second aspect, embodiments of the present disclosure provide a chip including the data processing apparatus of the first aspect.
In a third aspect, the disclosed embodiments provide a board card comprising the chip of the second aspect.
In a fourth aspect, embodiments of the present disclosure provide a data processing method for executing a blocking instruction on input data by using the data processing apparatus of the first aspect.
By the data processing device, the chip, the board card and the data processing method for executing the blocking instruction by the data processing device, the scheme of the embodiment of the disclosure performs blocking processing on output data in various convolution splitting schemes, and particularly optimizes the output data with a small number of channels, so that the processing efficiency of the blocking processing is improved to adapt to the processing capability of a hardware arithmetic device, thereby fully utilizing the parallel processing capability of a plurality of slave processing circuits, and effectively improving the arithmetic efficiency of convolution operation.
Drawings
The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar or corresponding parts and in which:
fig. 1 shows a block diagram of a board card of an embodiment of the present disclosure;
FIG. 2 shows a block diagram of a combined processing device of an embodiment of the disclosure;
FIG. 3a is a schematic diagram illustrating an internal structure of a processor core of a single-core computing device according to an embodiment of the disclosure;
FIG. 3b shows a simplified schematic diagram of the internal structure of a multi-core computing device of an embodiment of the present disclosure;
FIG. 4 illustrates an example of an exemplary convolution operation principle to which embodiments of the present disclosure may be applied;
FIG. 5 shows a schematic block diagram of a computing device according to an embodiment of the present disclosure;
FIG. 6 illustrates an exemplary data storage sequence in accordance with embodiments of the present disclosure;
7a-7c illustrate several exemplary grouping patterns according to embodiments of the present disclosure;
FIG. 8 illustrates an exemplary split schematic of an input feature map in accordance with an embodiment of the present disclosure;
FIG. 9 shows a split and store schematic of a Forward4 approach in accordance with an embodiment of the present disclosure;
FIG. 10 shows a schematic diagram of output point division of an arithmetic circuit in a Forward4 scheme according to an embodiment of the disclosure;
FIG. 11 shows a schematic diagram of a single operation in a Forward4 scheme according to an embodiment of the present disclosure;
FIG. 12 shows a schematic diagram of sliding convolution in a Forward4 implementation in accordance with an embodiment of the present disclosure;
FIG. 13 shows a schematic diagram of an output data format according to a Forward4 scheme in accordance with an embodiment of the present disclosure;
FIG. 14 illustrates an overall data handling process according to an embodiment of the present disclosure;
FIG. 15 shows a schematic conceptual diagram of Trans tilling according to an embodiment of the present disclosure;
FIG. 16 shows a schematic diagram of a run-to-run table;
FIG. 17 illustrates a schematic diagram of a block instruction being performed on output neuron data in accordance with an embodiment of the present disclosure; and
FIG. 18 illustrates an optimization scheme for performing a blocking instruction on output neuron data in accordance with an embodiment of the present disclosure.
Detailed Description
The technical solutions in the embodiments of the present disclosure will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some, not all embodiments of the present disclosure. All other embodiments, which can be derived by one skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the scope of protection of the present disclosure.
It should be understood that the terms "first," "second," "third," and "fourth," etc. as may appear in the claims, specification, and drawings of the present disclosure, are used for distinguishing between different objects and not for describing a particular order. The terms "comprises" and "comprising," when used in the specification and claims of this disclosure, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the disclosure is for the purpose of describing particular embodiments only, and is not intended to be limiting of the disclosure. As used in the specification and claims of this disclosure, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the specification and claims of this disclosure refers to any and all possible combinations of one or more of the associated listed items and includes such combinations.
As used in this specification and claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection".
Exemplary hardware Environment
Fig. 1 shows a schematic structural diagram of a board card 10 according to an embodiment of the disclosure. As shown in fig. 1, the board card 10 includes a Chip 101, which is a System-on-Chip (SoC) or System-on-Chip, and is integrated with one or more combined processing devices, which are artificial intelligence arithmetic units, for supporting various deep learning and machine learning algorithms, and meeting the intelligent processing requirements in the fields of computer vision, speech, natural language processing, data mining, and the like under complex scenes. Especially, the deep learning technology is widely applied to the field of cloud intelligence, and one remarkable characteristic of the cloud intelligence application is that the input data size is large, and the requirements on the storage capacity and the computing capacity of the platform are high.
The chip 101 is connected to an external device 103 through an external interface device 102. The external device 103 is, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card, a wifi interface, or the like. The data to be processed may be transferred by the external device 103 to the chip 101 through the external interface device 102. The calculation result of the chip 101 may be transmitted back to the external device 103 via the external interface device 102. The external interface device 102 may have different interface forms, such as a PCIe interface, according to different application scenarios.
The card 10 also includes a memory device 104 for storing data, which includes one or more memory cells 105. The memory device 104 is connected and data-transferred with the control device 106 and the chip 101 through a bus. The control device 106 in the board 10 is configured to regulate the state of the chip 101. For this purpose, in an application scenario, the control device 106 may include a single chip Microcomputer (MCU).
Fig. 2 is a structural diagram showing a combined processing device in the chip 101 of this embodiment. As shown in fig. 2, the combination processing device 20 includes a computing device 201, an interface device 202, a processing device 203, and a storage device 204.
The computing device 201 is configured to perform user-specified operations, mainly implemented as a single-core smart processor or a multi-core smart processor, to perform deep learning or machine learning computations, which may interact with the processing device 203 through the interface device 202 to collectively perform the user-specified operations.
The interface device 202 is used for transmitting data and control instructions between the computing device 201 and the processing device 203. For example, the computing device 201 may obtain input data from the processing device 203 via the interface device 202, and write to a storage device on the computing device 201. Further, the computing device 201 may obtain the control instruction from the processing device 203 via the interface device 202, and write the control instruction into a control cache on the computing device 201. Alternatively or optionally, the interface device 202 may also read data from a storage device of the computing device 201 and transmit the data to the processing device 203.
The processing device 203, as a general purpose processing device, performs basic control including, but not limited to, data transfer, starting and/or stopping of the computing device 201, and the like. Depending on the implementation, the processing device 203 may be one or more types of Central Processing Unit (CPU), Graphics Processing Unit (GPU) or other general purpose and/or special purpose processor, including but not limited to a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc., and the number thereof may be determined according to actual needs. As previously mentioned, the computing device 201 of the present disclosure may be viewed as having a single core structure or an isomorphic multi-core structure only. However, when considered collectively, the computing device 201 and the processing device 203 are considered to form a heterogeneous multi-core structure.
The storage device 204 is used to store data to be processed, which may be a DRAM, a DDR memory, and is typically 16G or larger in size, and is used to store data of the computing device 201 and/or the processing device 203.
Fig. 3a shows an internal structure diagram of a processing core when the computing device 201 is a single-core device. The computing device 301 is used for processing input data such as computer vision, voice, natural language, data mining, and the like, and the computing device 301 includes three major modules: a control module 31, an operation module 32 and a storage module 33.
The control module 31 is used for coordinating and controlling the operations of the operation module 32 and the storage module 33 to complete the task of deep learning, and includes an Instruction Fetch Unit (IFU) 311 and an Instruction Decode Unit (IDU) 312. The instruction fetch unit 311 is used for obtaining an instruction from the processing device 203, and the instruction decode unit 312 decodes the obtained instruction and sends the decoded result to the operation module 32 and the storage module 33 as control information.
The operation module 32 includes a vector operation unit 321 and a matrix operation unit 322. The vector operation unit 321 is used for performing vector operations, and can support complex operations such as vector multiplication, addition, nonlinear transformation, and the like; the matrix operation unit 322 is responsible for the core calculation of the deep learning algorithm, i.e., matrix multiplication and convolution.
The storage module 33 is used to store or transport related data, and includes a neuron storage unit (neuron RAM, NRAM)331, a weight storage unit (weight RAM, WRAM)332, and a Direct Memory Access (DMA) 333. NRAM 331 is used to store input neurons, output neurons, and intermediate results after computation; WRAM 332 is used to store the convolution kernel of the deep learning network, i.e. the weight; the DMA 333 is connected to the DRAM 204 via the bus 34 and is responsible for data transfer between the computing device 301 and the DRAM 204.
Fig. 3b shows a simplified schematic diagram of the internal structure of the computing device 201 for multi-core. The multi-core computing device may be abstracted with a hierarchical hardware model. As shown, the multi-Core computing device may be abstracted into four levels, namely a board level (Card)350, a Chip level (Chip)360, a processor Cluster level (Cluster)370, and a processor Core level (Core) 380. The data transmission and calculation unit part of the storage unit is mainly referred to in the embodiments of the present disclosure, so that the drawings and the description briefly show and describe the related calculation structure, and other parts are omitted.
At the board level, each board contains local DDR storage, and each processor chip serves as a compute and control unit.
At the chip level, each processor chip contains multiple processors as computing units.
At the computing cluster level, each multiprocessor comprises a plurality of accelerator cores as control and computing units, and additionally shares a storage SRAM as a storage unit.
At the processor core level, each accelerator core includes an array of local storage and local processing units. NFU refers to a neural Function Unit (Neuron Function Unit) for performing convolution calculation.
In the multi-Core computing device, the storage model includes a board global memory, an SRAM (shared memory) on the Cluster, an NRAM, a WRAM, a register, and the like on the Core. For better performance, the data movement and the balance between access/computation between the storage layers below the Card can be explicitly controlled. The SRAM is included in a Memory processing Unit MPU (Memory Process Unit Core, abbreviated as MPU or Mem Core). Core refers to an Intelligent processing Core (IPU Core or Core for short) in a multi-Core computing device. The 1 IPU Core contains NRAM, WRAM, NFU, etc. Cluster refers to a processor Cluster or computing Cluster, and generally, a multi-Core computing device comprises a plurality of clusters, wherein one Cluster comprises 1 Mem Core + N IPU cores.
Exemplary convolution operation types
The convolutional layer in the neural network model may perform a convolution operation, and perform convolution processing by applying a convolution kernel (also referred to as a filter, a weight, or the like) to an input feature map (also referred to as input data, a neuron, or an input neuron), thereby performing feature extraction. The convolutional layer may contain a plurality of convolutional kernels, and each element constituting a convolutional kernel corresponds to a weight coefficient and a bias. The disclosed embodiments may be applied to data splitting for various convolution operations.
In the conventional 3D convolution operation, assuming that the tensor shape of the input Feature map (Feature map) in the convolutional layer is represented as X [ N Hi Wi Ci ], the tensor shape of the convolution kernel (kernel) is represented as K [ Co Kh Kw Ci ], and the output result is Y [ N Ho Wo Co ], then the mathematical calculation formula of the simplified convolution operation can be expressed as follows:
Yin,jc,jh,jw=∑0≤ic≤ci,0≤ih≤kh,O≤iw≤kwXin,ic,jh×sh+ih,jw×sw+iw×Kjc,ic,ih,iw (l)
in the above equation, X is input data, Y is output data, K is convolution kernel, Kh and Kw are length and width of K, sh and sw are step sizes (stride) in the length and width directions, the offset bias, padding pad and expansion variance are ignored in the equation, and it is assumed that the input data X has been padded and the convolution kernel has been expanded. The formula omits the N dimension and the C dimension, and the forward calculation of the neural network model is independent in the N dimension and is fully connected in the C dimension. When the convolution kernel works, the input features are swept according to a certain step length, matrix element multiplication summation is carried out on the input features in a convolution window, and deviation amount is superposed.
FIG. 4 illustrates an example of an exemplary conventional 3D convolution operation principle to which embodiments of the present disclosure may be applied.
The figure shows exemplarily four-dimensional input data X of size [ N Hi Wi Ci ], which can be represented as N Hi × Wi × Ci-sized solid rectangles 410. Also illustrated is a four-dimensional convolution kernel K of size [ Co Kh Kw Ci ], which can be represented as Co Kh Kw Ci sized stereo convolution kernels 420. The result of the convolution of the input data X with the convolution kernel K results in output data Y, which is four-dimensional data of size [ N Ho Wo Co ], and which can be represented as N Ho × Wo × Co-sized solid rectangles 430.
The figure also specifically shows an example of convolution operation, in which the input data is an input feature map 440 with a size of 6 × 6 × 3, and N dimensions are omitted; the convolution kernel is a 3 × 3 × 3 sized stereo convolution kernel 450 for a single Co; the output data is a 4 x 4 output profile 460. The specific operation process is as follows:
the convolution kernel 450 sweeps the input signature graph 440 by a certain step size, and performs matrix element multiplication summation and superposition offset on the input signatures within the convolution window 470. That is, the value at each position in the output feature map 460 is obtained by performing two-dimensional convolution operation on the corresponding block of each input feature map and the corresponding convolution kernel, and then summing the two-dimensional convolution operation. For example, the values (i.e., convolution output points) of the (0,0) positions on the output feature map 460 are shown to be obtained by performing a two-dimensional convolution operation on the convolution window 470 framed by the black cube in the input feature map and the stereo convolution kernel 450 to obtain 3 values, and then summing the values to obtain the final value.
To get the output at other locations, the location of the convolution kernel 450, i.e., the convolution window of the convolution output points, may be shifted over the input feature map 440. In the example shown, the convolution step (Sx, Sy) is (1,1), and when the horizontal direction (width direction) is shifted to the right or the vertical direction (height direction) is shifted downward by one grid, the convolution operation is performed to obtain the value of the (0,1) or (1,0) position on the output feature map 460.
From the above description, in a convolutional layer of a neural network, there are N sets of input feature maps, each set containing Hi × Wi × Ci pieces of information, where Hi and Wi are the height and width of the input feature maps, respectively, and Ci is the number of input feature maps, also referred to as the number of input channels. The convolutional layer has convolutional kernels of size Ci × Co, Kh × Kw, where Ci is the number of input channels, Co is the number of output feature maps (or output channels), and Kh and Kw are the height and width of the convolutional kernels, respectively. The output feature map contains Ho × Wo × Co information, where Ho and Wo are the height and width of the output feature map, respectively, and Co is the number of output channels. In addition, convolution steps (Sx, Sy) are involved in the convolution operation, and the size of the convolution steps affects the size of the output feature map.
Herein, input Feature maps (Feature maps), input data, neurons, or input neurons may be used interchangeably; convolution kernels, filters, or weights may be used interchangeably. Further, the H (height) and Y dimensions may be used interchangeably, and the W (width) and X dimensions may be used interchangeably. Accordingly, the H dimension of the input feature map may be represented as Hi or Yi, the H dimension of the output feature map may be represented as Ho or Yo, and the W dimension is similarly represented. In embodiments of the present disclosure, each convolution output point has a corresponding convolution window, the shape of which is equal to the shape of the convolution kernel. The value of each convolution output point corresponds to the result of the pair-wise multiplication and accumulation of the input feature map and the weight value in the convolution window.
Exemplary computing device/data processing device
In the embodiments of the present disclosure, the above convolution operation may be implemented by using a computing device of a master-slave structure. Further, different data paths can be configured for the input feature map and the convolution kernel, so that the memory access efficiency is improved.
FIG. 5 shows a schematic block diagram of a computing device 500 according to an embodiment of the present disclosure. It is understood that the structure can be regarded as an internal structure refinement of the operation module of a single processing core in fig. 3, and can also be regarded as a function division block diagram combined on the basis of a plurality of operation modules of the processing cores shown in fig. 3. As shown in fig. 5, a computing device 500 of an embodiment of the present disclosure may be configured to perform various types of convolution operations, which may include a master processing circuit (MA)510 and a plurality of slave processing circuits (SL)520, 16 slave processing circuits SL 0-SL 15 being shown. Those skilled in the art will appreciate that the number of slave processing circuits may be more or less, depending on the particular hardware configuration, and the disclosed embodiments are not limited in this respect.
The master processing circuit and the slave processing circuits and the plurality of slave processing circuits may communicate with each other through various connections. In different application scenarios, the connection manner between the multiple slave processing circuits may be a hard connection manner arranged by a hard wire, or a logic connection manner configured according to, for example, a microinstruction, so as to form a topology of multiple slave processing circuit arrays. The disclosed embodiments are not limited in this respect. The master processing circuit and the slave processing circuit may cooperate with each other, thereby realizing parallel arithmetic processing.
To support the arithmetic function, the master processing circuit and the slave processing circuit may include various calculation circuits, and may include, for example, a vector operation unit and a matrix operation unit. The vector operation unit is used for executing vector operation and can support complex operations such as vector multiplication, addition, nonlinear transformation and the like; the matrix operation unit is responsible for core calculation of the deep learning algorithm, such as matrix multiplication and convolution.
The slave processing circuit may be configured to perform an intermediate operation on the corresponding data in parallel according to the operation instruction to obtain a plurality of intermediate results, and to transmit the plurality of intermediate results back to the master processing circuit.
By configuring the computing apparatus 500 to be in a master-slave configuration (e.g., a master-slave configuration, or a multi-master-slave configuration, which is not limited in this respect), for a forward-direction computation instruction, data can be split according to the computation instruction, so that a portion with a large computation amount is computed in parallel by a plurality of slave processing circuits to increase the computation speed, save the computation time, and further reduce the power consumption.
In some embodiments of the present disclosure, by using different data paths to transmit the input feature map and the weight, multiple multiplexing modes of the input feature map and the weight can be supported, thereby reducing the data access amount during operation and improving the processing efficiency.
Specifically, the computing apparatus 500 may further include a first storage 530 and a second storage 540 for storing data transmitted via different data channels, respectively.
The first memory circuit 530 may be used to store multicast data, i.e. the data in the first memory circuit will be transmitted via the broadcast bus to a plurality of slave processing circuits, which receive the same data. It will be appreciated that broadcast and multicast may be implemented via a broadcast bus. Multicast refers to a communication mode in which a piece of data is transmitted to a plurality of slave processing circuits; broadcast is a communication mode for transmitting a piece of data to all slave processing circuits, and is a special case of multicast. Since multicast and broadcast both correspond to one-to-many transmission modes, and are not specifically distinguished herein, broadcast and multicast may be collectively referred to as multicast, the meaning of which may be clear to those skilled in the art depending on the context.
The second memory circuit 540 may be used for storing distribution data, i.e. data in the second memory circuit will be transmitted to different slave processing circuits, respectively, each receiving different data.
By providing the first storage circuit and the second storage circuit separately, it is possible to support transmission in different transmission manners for data to be operated on, thereby reducing the data access amount by multiplexing multicast data among a plurality of slave processing circuits.
In some embodiments, the master processing circuit may determine one of the input signature graph and the convolution kernel as multicast data and store in the first storage circuit to transmit the data to the scheduled plurality of slave processing circuits in a broadcast manner during the operation. Correspondingly, the main processing circuit may determine the other of the input feature map and the convolution kernel as distribution data and store it in the second storage circuit. These distribution data may be distributed to the corresponding slave processing circuits prior to the operation.
Fig. 5 also shows an internal structural schematic diagram of the slave processing circuit SL according to an embodiment of the present disclosure. As shown, each slave processing circuit 520 may include a plurality of arithmetic circuits CU 521, a first buffer circuit 522, and a second buffer circuit 523. The figure shows 4 arithmetic circuits CU0 to CU 3. Those skilled in the art will appreciate that the number of operational circuits may be greater or lesser depending on the particular hardware configuration, and embodiments of the present disclosure are not limited in this respect.
In some embodiments, the first buffer circuit 522 may be used to buffer the weights or input signature assigned to the slave processing circuit. Accordingly, the second buffer circuit 523 may be used to buffer the input signature or weights assigned to the slave processing circuit. Both of the two buffer circuits are used to select data to be involved in the operation. The data of the first buffer circuit 522 may be a plurality of data lines from, for example, the first storage circuit 530 or the second storage circuit 540, and correspondingly, the data of the second buffer circuit 523 may be a plurality of data lines from, for example, the second storage circuit 540 or the first storage circuit 530. Depending on the particular multiplexing scheme, the data lines may be distributed to the corresponding arithmetic circuitry CU 521 during operation or broadcast to all CUs 521 within the slave processing circuitry 520.
Each of the operation circuits CU 521 is configured to perform a bit-wise multiplication-accumulation operation on a data line selected from the first buffer circuit and a data line selected from the second buffer circuit, respectively, at each time of calculation.
By providing the first buffer circuit and the second buffer circuit separately, it is possible to support transmission in different transmission manners for data to be operated on, thereby reducing the data access amount by multiplexing data as much as possible among a plurality of operation circuits within a single slave processing circuit.
The slave processing circuit 520 may further include a third buffer circuit 524 for buffering the operation result of each operation circuit CU 521.
It will be appreciated that although the individual processing and memory circuits are shown as separate blocks in fig. 5, the memory and processing circuits may be combined into one block according to different configurations. For example, the first memory circuit 530 may be incorporated with the master processing circuit 510, and the second memory circuit 540 may be shared by a plurality of slave processing circuits 520, and each slave processing circuit may be assigned a separate memory region to speed up access. The disclosed embodiments are not limited in this respect. Furthermore, in the computing device, the master processing circuit and the slave processing circuit may belong to different modules of the same processor or chip, or may belong to different processors, and the disclosure is not limited in this respect.
Exemplary data splitting and storage
In the disclosed embodiment, the dimension of the multidimensional data is characterized as (N, H, W, C) or (Co, H, W, Ci), which represents the storage order of the data in the memory. It will be appreciated that although the multidimensional data has multiple dimensions, there is a correspondence between the multidimensional data and the order of storage on the memory because the layout of the memory is always one-dimensional. The multidimensional data is usually allocated in a continuous storage space, i.e. the multidimensional data can be one-dimensionally expanded and stored on the memory in sequence. For example, in embodiments of the present disclosure, the initial input feature maps may be stored sequentially in a low-dimensional (where C/Ci is the lowest dimension) first-in-order manner; in order to optimize the convolution operation, the storage order of the input feature maps may be adjusted during the operation, as will be described in detail later. The adjacent dimensions refer to dimensions next to each other in the dimension information representation of the multi-dimensional data, for example, W and Ci are adjacent, and the adjacent dimensions may also be referred to as continuous dimensions.
In an intelligent processor, the main operation unit of hardware is a vector multiplication and addition operator due to the need of calculation power and the consideration of area power consumption overhead. The support of various convolution algorithms is realized in hardware design, the multiplication and addition operation in the algorithms is extracted in a maximized mode, and input and output data of the multiplication and addition operation are exchanged between an on-chip RAM (such as NRAM, WRAM and the like in FIG. 3) and an operator efficiently through a data path.
Hardware is stored in a row-by-row (cache line) manner, and read, write and calculation operations are most efficient when the whole row is aligned, so that data is generally required to be vectorized and aligned in order to fully utilize bandwidth and meet the requirements of access amount of an arithmetic operator array and the like. The design of the artificial intelligence chip usually takes the dimension Ci as the lowest dimension, that is, the NHWC placement order mentioned above, and data in the dimension Ci is continuous. Therefore, vectorization alignment requires that the size of Ci dimension is aligned to a specified value, for example, an alignment value M, so as to access the number in units of the alignment value M, where M may also be referred to as a hardware single maximum operand. Based on different hardware designs, M may have different values, such as 64bit, 128bit, 256bit, 512bit, etc. Generally, the size of the input port of the operator array is also related to M, for example, in the case of symmetric bit width of the input data, the size of the input port of the operator array is generally 2 times of M, that is, input feature map data and weight data of the alignment value M scale are processed at one time. When the Ci dimension of the input feature map is large, the above alignment requirement is relatively easy to satisfy.
When the Ci dimension of the input feature map is small, for example, smaller than the size of one cache line, the Ci dimension needs to be filled into one line of data (for example, 512 bits), that is, invalid data 0 is filled. This padding causes a large amount of redundant computation, resulting in wasted resources and reduced computational efficiency.
In the disclosed embodiment, a convolution operation scheme is proposed, which can determine a corresponding convolution splitting scheme according to the size of the lowest storage dimension (for example, Ci) of an input feature map, wherein the convolution splitting scheme at least indicates the shape of a splitting unit of data to be operated on. The data volume contained in one split unit does not exceed the single maximum operation volume of hardware.
In some embodiments, the data amount contained in one split unit can be set to the one-time processing alignment value M of the hardware, so that the calculation processing is performed by taking the split unit as a unit, the calculation power of the hardware can be fully exerted, and invalid calculation can be avoided or reduced.
In the exemplary description of the present disclosure, it is not assumed that M ═ 512bit ═ 64Byte, the data type may be Int8, Int16, Float16, or Float32, and the input signature graph is consistent with the data type of the convolution kernel. Since the data type requires at least a width of 1 byte and the minimum unit of arithmetic processing is one data, various calculations are performed in units of bytes in the following examples, for example, M64B, Ci 28B, and the like, with the units sometimes omitted for the sake of brevity.
When the data volume of a split cell is equal to M, the data block shape of each split cell is block c block y block x, which may have various situations, and table 1 lists several of them:
Figure BDA0003280621220000081
TABLE 1 data Block shape
As can be seen from Table 1, some data block shapes have equal dimensions in the X and Y dimensions (as indicated by the dark rows), which simplifies subsequent operations. Therefore, in the embodiments of the present disclosure, it may be preferable to use such a data block shape to split the data to be operated on.
For the sake of simplicity, the splitting scheme of the 64B × 1 × 1 shape is referred to as Forward64, the splitting scheme of the 16B × 2 × 2 shape is referred to as Forward16, the splitting scheme of the 4B × 4 × 4 shape is referred to as Forward4, the splitting scheme of the 4B × 4 × 4 shape applied to the deep convolution operation is referred to as Forward1, the splitting scheme of the 4B × 4 × 4 shape applied to the inverse deep convolution operation is referred to as Update1, and the splitting scheme of the 4B × 4 × 4 shape applied to the cross-product convolution operation is referred to as Update 4. In addition to Forward64, these splitting schemes are suitable for scenarios where channel C is small in the convolution computation, and therefore may also be collectively referred to as small convolutions. In these small convolution splitting schemes, one splitting unit includes data of the lowest storage dimension and at least one other storage dimension, and the total data amount of one splitting unit does not exceed the hardware single maximum operation amount.
Different convolution splitting schemes can be suitable for different operation scenes, so that performance optimization of different degrees is obtained.
After the splitting scheme is determined, the input feature map and the convolution kernel can be split into a plurality of corresponding splitting units according to the determined convolution splitting scheme, and the dimension storage order of the splitting units is converted, so that data in one splitting unit can be continuously stored as one data line, and subsequent reading processing is facilitated by taking the splitting unit (data line) as a unit.
In some embodiments, data of neurons or weights in three or four dimensions is all divided into data blocks having a size of block c × block y × block x (Uc × Uy × Ux), and each data block is stored in succession in one row of, for example, M ═ 64B, so that data of one data block is actually taken out when one row of data is read.
Specifically, one or more splitting units may be read in a first reading order from the data to be operated stored in a first-dimension storage order, with the splitting units as a unit, and the read splitting units are stored in corresponding storage circuits, where data in each splitting unit is stored in a second-dimension storage order, and the splitting units are stored in a third-dimension storage order.
FIG. 6 illustrates an exemplary data storage sequence in accordance with embodiments of the present disclosure.
As shown in the figure, 610 represents a storage manner of a four-dimensional tensor to be computed, which includes N3-dimensional sub-tensors, where N is in the highest dimension, that is, the storage order of the first dimension of the four-dimensional tensor is NHWC. Note that H and Y, W and X are used interchangeably herein. Each sub-tensor is divided into smaller data blocks or splitting units, and the number of the data blocks in each dimension is C/Y/X respectively.
The middle graph 620 represents the storage of each sub-tensor, with each data block stored as a contiguous 64Byte, i.e., a row. When the order in which the data blocks are read differs, the order between the rows may also change accordingly. In the example shown in the figure, the data blocks are read in the directions of C, then X, and finally Y, i.e., the first reading order is YXC, and the rows are stored in the order of Y X C, i.e., the third dimension storage order is YXC or HWC. In this example, the third dimension storage order is the same as the first dimension storage order. It will be appreciated that other reading orders may be used, resulting in a third dimension storage order that is different from the first dimension storage order, and are not listed here.
The right graph 630 shows the order within each row, i.e., the data order within each data block, which is shaped as blockC block y block x, when the second dimension storage order is CYX or CHW.
Exemplary packet operations
The small convolution adopts a block form, and compared with the traditional convolution, the small convolution has the advantage that the alignment of the block in the direction Ci only needs to be satisfied. In the context of this small channel, the weights (co Kh Kw ci) are generally small, Kh and Kw are usually single digits, and co and ci are almost the same. In the computing device/data processing device described above in connection with fig. 5, the second storage circuit (e.g., WRAM 332 of fig. 3) typically has a larger storage space than the first storage circuit (e.g., NRAM 331 of fig. 3). Therefore, in order to fully utilize the computation space on the chip, in most small convolution schemes, such as Forward4, Forward1, etc., a scheme of interchanging the neurons of the normal convolution and the weight storage positions is adopted, that is, the neurons are stored on the second storage circuit WRAM, and the weights are stored on the first storage circuit NRAM.
The convolution calculation is that each input feature map needs to be subjected to multiplication and addition operation with each convolution kernel of Co, so that Co output feature maps are output. However, on-chip space cannot necessarily store convolution kernels and input feature maps of all scales at the same time, so that for hardware, a series of operations for repeatedly loading input feature data or weight data exist, and how to balance repeatedly loading input feature data or weight data has a certain influence on calculation efficiency. In actual operation, in order to reduce frequent off-chip memory access, a splitting strategy problem of neurons and weights exists. In some embodiments, different splitting modes can be adopted according to the scale characteristics of data participating in operation.
According to the convolution operation principle described above, the operation results in the Co dimension (depth convolution is the C dimension) do not need to be accumulated, so that the operation distribution in different Co can be performed relatively independently on different operation circuits. In a small convolution scenario, the output channel Co dimension of the convolution kernel is typically no larger in size than the number of slave processing circuits scheduled in a single round of operation, so that a single Co operation needs to be completed by one or more slave processing circuits. More generally, this can be achieved by splitting into multiple rounds of operation, even when the Co dimension is large, where the Co size processed per round of operation does not exceed the number of slave processing circuits scheduled. Thus, in one example, the number of rounds of operation required to complete the convolution operation and the number of Co processed in each round of operation or the corresponding grouping pattern may first be determined based on the output channel Co dimension size of the convolution kernel and the number of schedulable slave processing circuits Ns.
Regardless of the allocation method, in a single round of operation, there may be two allocation cases for Co: multiple slave processing circuits process one Co value, or a single slave processing circuit processes one or more Co values. Specifically, in a single operation round for processing Nco output channels, each Rs SL constitutes a slave processing circuit group SLB, which processes convolution kernels corresponding to the same output Co value, and Rs ═ Ns/Nco ], that is, the same convolution kernel is multiplexed on Rs SLs within the same SLB, and Rs represents the number of times the convolution kernel is multiplexed between slave processing circuits. Accordingly, the input profiles may be multiplexed between the respective slave processing circuit groups SLB, and Rn ═ Ns/Rs ] indicates the number of times the input profiles are multiplexed between the slave processing circuits.
Alternatively or additionally, when each slave processing circuit processes a convolution kernel corresponding to rn Co values, rn ═ Nco/Ns ], then the input profile processed by each slave processing circuit may be repeated for rn convolution kernels, rn representing the number of times the input profile is multiplexed within a single slave processing circuit. The maximum number of times of multiplexing of convolution kernels rs and the maximum number of times of multiplexing of input feature maps rn applicable within a single slave processing circuit may be determined taking into account factors such as hardware buffer space limitations (e.g., the size of the first buffer circuit and the second buffer circuit in fig. 5).
Considering the buffer size limitation and multiplexing gain in hardware circuits, in some embodiments of the present disclosure, one slave processing circuit is not temporarily considered to process multiple Co values in a single round of operation, but only one or more slave processing circuits are considered to process only one Co value in a single round of operation.
Different grouping modes can be used depending on the number of slave processing circuits SL that process the same Co value in a single round of operation. It will be appreciated that it is preferable to distribute the slave processing circuits SL that can be invoked equally, so as to balance the computational effort, for example, in groups of 2 SLs each, so that 16 SLs can process 8 Co values simultaneously; or one set of every 4 SLs so that 16 SLs can process 4 Co values simultaneously; and so on. In the computing device described above in connection with fig. 5, the second storage circuit WRAM has 16 blocks of storage areas, respectively allocated to 16 slave processing circuits SL. Furthermore, every 4 blocks can be combined into a memory block, which is distributed to the corresponding slave processing circuit group SLB. Thus, in some embodiments, for a computing device including Ns 16 SLs as shown in fig. 5, several grouping modes may be selected as follows: group1 mode, Group4 mode, and Group16 mode. It will be appreciated by those skilled in the art that there may be different grouping patterns depending on the value of Ns, and each grouping pattern may be processed correspondingly with reference to the above three representative grouping patterns given herein.
In some embodiments, the grouping pattern may be collectively expressed as GroupN, which represents that all the slave processing circuits SL scheduled in the current round of operation are grouped into N groups, each slave processing circuit group SLB processes the same Co value, and different slave processing circuit groups SLB process different Co values. For 16 SL total schedulable cases, N may take 1,4,16, corresponding to Group1, Group4, and Group16, respectively, above.
Figures 7a-7d illustrate several exemplary grouping patterns according to embodiments of the present disclosure. Fig. 7a shows a Group1 mode, fig. 7b shows a Group16 mode, fig. 7c shows one Group4 mode, and fig. 7d shows another Group4 mode.
As shown in fig. 7a, the Group1 mode means that all schedulable 16 SLs belong to one Group, collectively handling one Co value, e.g. SL 0-SL 15 belong to Group G0. Thus, the operation for the one output channel is distributed over 16 SLs. In this mode, it may be considered that the convolution kernel 720 of the output channel is transmitted to each SL in a broadcast manner, and the input feature map 710 is split and allocated to each SL, so as to improve the memory access efficiency.
In one embodiment, the convolution kernel may be stored on the first storage circuit 530 of FIG. 5 for transmission using a broadcast channel. The input signature may then be divided in the XY direction of the output signature and stored in a second memory circuit 540 for assignment to different SLs. Thus, all SLs collectively compute an output profile of Co. The division and storage of the input feature map will be described in detail later with reference to the drawings.
As shown in fig. 7b, the Group16 mode means that all schedulable 16 SLs are divided into 16 groups, i.e. one SL each, each SL handling a different Co value. For example SL0 belongs to group G0, SL1 belongs to group G1, and so on until SL15 belongs to group G15. In this mode, the same block of input signature 730 can be reused across 16 SLs, so it may be preferable to broadcast the input signature 730 to each SL, while the convolution kernels 740 corresponding to different Co are distributed to the corresponding SL.
In one embodiment, the input signature may be copied into 16 copies and stored on the 16 memory regions allocated for the 16 slave processing circuits on the second memory circuit. The convolution kernel is divided according to Co, one SL corresponds to one Co, 16 Co are processed at a time and stored on the first storage circuit, and the processed Co is distributed to different SLs in a unicast mode. Thus, all SLs compute the output profiles of different Co for the same input profile.
As shown in fig. 7c, the Group4 mode means that all schedulable 16 SLs are divided into 4 groups, each Group handling one Co value. Each SL group (SLB) includes the number of SLs equal to Rs Ns/4. For example, SL0 to SL3 belong to group G0, SL4 to SL7 belong to group G1, SL8 to SL11 belong to group G2, and SL12 to SL15 belong to group G3. This pattern is between Group1 and Group16, so either the convolution kernel or the input signature graph can be determined to be multicast data, while the other is determined to be distribution data.
In one embodiment, the convolution kernels may be divided into 4 groups by Co and stored on the first storage circuit 530 of fig. 5 for transmission using a broadcast channel. The input signature can be divided into 4 copies in the XY direction of the output signature, and the copies can be stored in the second storage circuit 540 to be distributed to 4 SLBs. Each SLB obtains the same input signature, and distributes to 4 SLs therein in4 divided portions within the SLB. Thus, all SLs in each SLB collectively compute an output profile for Co, with 4 SLBs each processing a different Co.
As shown in fig. 7c, the convolution kernels are divided into 4 groups, and divided into each group by Co at intervals of 1. For example, when Co is 12, the 4 groups Co are {0,4,8}, {1,5,9}, {2,6,10}, and {3,7,11}, respectively. Each time one Co of each group is transmitted, for example, the first transmission Co is 0-3, one Co corresponds to one SLB, and 4 SLs in one SLB share the same weight; and (4) sending Co for the second time to be 4-7, and so on. Thus, after each round of operation is completed, the Co dimensions of the operation results output by the SLBs are continuous.
When the small convolution splitting operation scheme of Forward4 is adopted, in order to support the above three modes simultaneously, the neurons can be uniformly stored on the second storage circuit WRAM, and the weight values can be stored on the first storage circuit NRAM.
Exemplary splitting of input feature graphs
As can be seen from the foregoing description, when multiple SLs collectively process one Co value, the input feature map needs to be split among the multiple SLs, for example, the Group1 grouping mode needs to split the input feature map into 16 parts, and the Group4 grouping mode needs to split the input feature map into 4 parts.
To ensure that the split input feature maps can share a convolution kernel, the split input feature maps can be divided according to the Ho/Wo directions of the output feature maps, and thus mapped back to the division of the input feature maps. In some embodiments, the input characteristic diagram may be divided among the Rs slave processing circuits SL included in each slave processing circuit group as follows: averagely dividing the output feature map into Rs output feature blocks with the same shape on an XY dimension (namely Ho/Wo dimension) according to the size of the corresponding output feature map; and dividing the input feature map into Rs input feature blocks in the XY dimension (i.e., Hi/Wi dimension) to be assigned to Rs slave processing circuits according to the input feature map region required to calculate each output feature block. It will be appreciated that depending on the convolution kernel size and convolution step size, there may be overlap of the input feature maps corresponding to adjacent output points on the output feature map.
FIG. 8 illustrates an exemplary split schematic of an input feature map in accordance with an embodiment of the disclosure. In this example, the input signature is divided into 16 shares distributed over 16 SLs, corresponding to the Group1 mode.
In the figure 810, an output characteristic diagram of a single Co is represented, which is divided into 16 output characteristic blocks with the same shape in a 4 × 4 manner in the XY direction, and the output characteristic blocks are respectively allocated to SL0 to SL 15. Then, the 16 output feature blocks can be mapped onto the input feature map 820, and 16 input feature map regions required for respectively calculating the 16 output feature blocks are obtained, which also divides the input feature map in the XY direction. These 16 input profile areas may be correspondingly assigned to the 16 slave processing circuits SL.
As described above, the input feature map is split in units of split units according to the determined convolution splitting scheme, and therefore, in the above embodiment, the input feature map is partitioned such that each partitioned input feature map block is a multiple of the dimension of the split unit in the XY direction, that is, each partitioned input feature map block can be aligned according to the split unit in the XY direction. For example, when a 4 × 4 × 4 convolution split scheme is selected, each input feature tile is aligned by 4 × 4; whereas when a 16 x 2 convolution split scheme is selected, each input feature tile is aligned by 2 x 2.
For the case where the output feature maps are not aligned in split units (e.g., 4 × 4 or 2 × 2), corresponding padding (e.g., 0 padding) on the input feature maps is required, so that the actually calculated output XY is aligned in split units (e.g., 4 × 4 or 2 × 2) and the input XY is also aligned in split units (e.g., 4 × 4 or 2 × 2).
It will be understood by those skilled in the art that the output feature map may be split in the XY direction according to other rules, for example, split into 16 output feature blocks with the same shape in a 1 × 16 manner, and assigned to SL0 to SL15, respectively. The disclosed embodiments are not limited in this respect. Furthermore, it is to be understood that, although the foregoing is described in conjunction with splitting between slave processing circuits, this splitting manner may also be applied to splitting in other scenarios, for example, splitting between operation circuits CU within a single slave processing circuit SL, and the embodiments of the present disclosure are not limited in this respect.
Exemplary convolution operation procedure within Single Slave processing Circuit
After the data to be operated are split and correspondingly placed and stored, a plurality of slave processing circuits can be scheduled to execute convolution operation on the input feature map and the corresponding data rows of the convolution kernel, and then the operation results returned by the slave processing circuits can be spliced according to the convolution splitting scheme to obtain the output feature map of the convolution operation of the input feature map and the convolution kernel. Specifically, a specific convolution operation process may be performed by using a plurality of operation circuits CU and respective buffer circuits (see fig. 5) in the slave processing circuit. Depending on the size of the space within the buffer circuit from the processing circuitry and the computational power limitations of the arithmetic circuitry, it is often necessary to perform multiple cycles of operations in each round of operation to complete the required operation.
As can be seen from the foregoing description, in the scenario for conventional 3D convolution operation, all the operation circuits within a single slave processing circuit calculate one output feature map or partial output feature map corresponding to the same output channel Co. Depending on the size of the buffer spaces of the first buffer circuit and the second buffer circuit within the slave processing circuit SL, the processing capacity of the arithmetic circuit CU (e.g., internal registers, etc.), the slave processing circuit may not be able to calculate the output profiles assigned thereto at once.Thus, the output feature blocks may be divided in units of a single operational capability of the operational circuit (e.g., a single computation of Nop output points or partial sums), each corresponding to all schedulable N's within a single SLCUSingle operation capability (N) of an operation circuitCUNop output points). For example, taking the example of fig. 5 where each SL includes 4 CUs, assuming that each CU can calculate Nop 4 output points or partial sums of output points at a single time, a single SL can calculate 4 output points (or partial sums) 4by 4 16 at a single time. Therefore, the output feature map can be divided into output feature blocks aligned according to 16 output points in the XoYo dimension, and each output feature block can be calculated one by one. It is to be understood that the 16 output points may be in the form of 4 x 4, or 1 x 16, and embodiments of the present disclosure are not limited in this respect.
In calculating the output feature block of each partition, it is possible to further calculate the output feature block at NCUOutput points of the output feature block are divided among the operation circuits to determine processing objects of the operation circuits. Then, according to the division of the output points, the splitting unit is taken as a sliding window, and N is selected from the first buffer circuitCUDistribution of lines of input characteristic data to NCUAn arithmetic circuit for selecting corresponding weight data from the second buffer circuit and broadcasting to NCUAnd the operation circuit is used for realizing parallel calculation of output points corresponding to the sliding windows by multiplexing weight data. Performing Nk sliding picks, wherein Nk is determined according to the smaller of the size of the convolution kernel in the X and Y dimensions and the maximum convolution kernel size supported by a single operation from the processing circuit in the current convolution split mode.
In some embodiments, when performing a conventional three-dimensional convolution operation, the corresponding weight data may be selected as follows: selecting 1/Nop weight lines from the second buffer circuit according to the corresponding sliding mode in the first buffer circuit, expanding the copy Nop-1 into an expanded weight line, and broadcasting to N in the slave processing circuitCUAn arithmetic circuit.
At this time, each arithmetic circuit may perform bit multiplication and accumulation with 1/Nop data line units for one input feature line from the first buffer circuit and one expanded weight data line from the second buffer circuit to obtain Nop partial sums each time the sliding selected number is calculated; and accumulating Nk portions obtained by calculating Nk sliding selection numbers and the corresponding convolution output points to obtain and output Nop operation results.
When the slave processing circuit outputs the output points of the operation circuits therein, the output points calculated by the operation circuits therein can be output according to the division mode of the output points and the specific sequence, so that the continuously output points are continuous in X and/or Y dimensions, and the subsequent processing is convenient. In some embodiments, the master processing circuit may further store the operation results returned from the respective slave processing circuits in a fourth-dimension storage order. According to the situation, the main processing circuit can also convert the operation result into a desired dimension storage sequence for storage.
The output points between the operation circuits can be divided in various ways, and correspondingly, the sliding number selection convolution process and the output sequence of the output points are different.
The whole data splitting, storing, convolution sliding and calculation output process is described in detail in connection with the Forward4 scheme.
Shape description of input neurons and weights for Forward4 scheme
In Forward4, the shape of the split cell block is 4B × 4 × 4. The shape of the block is slightly different depending on the data type. Table 2 shows block shapes for Forward4 for different data types.
Figure BDA0003280621220000131
TABLE 2 Forward4 data Block shape under different data types
Figure 9 illustrates a split and store schematic of a Forward4 scheme according to one embodiment of the present disclosure. For simplicity, the illustration in the figure assumes a data type of Int 8.
The diagram 910 shows the raw data to be operated on (which may be neurons or weights) in the order of their storage as HWC. The figure also shows 4 data blocks 911-914, in which the original data to be operated on is split by the split unit, and each data block includes 64 data blocks, 4 × 4 × 4.
The split data is shown in a tiled format at 920 for easy reading. It can be seen that the original data blocks (e.g., 911-. Within each row, the data is stored in the order CHW, for example, for data row 921, first 16 data with C equal to 0, then 16 with C equal to 1, then 16 with C equal to 2, and finally 16 with C equal to 3.
Specifically, for neurons, data needs to be put from [1Hi Wi Ci ] as:
[1 × Hi/4 × Wi/4 × Ci/4 (4 × 4 × 4) ], the shape of this seven-dimensional tensor.
For the weight, the data needs to be put from [ Co Kh Kw Ci ] as:
[ Co × Kh/4 × Kw/4 × Ci/4 × 4 (4 × 4 × 4) ], the shape of this seven-dimensional tensor.
From the foregoing description, the Forward4 scheme can support multiple packet modes. For the neurons, there is a slight difference in the final splitting of the seven-dimensional shape of the block format into each storage region of the second storage circuit for different grouping modes and HoWo splitting modes within the group.
Assume the original input neuron size is: [1hi wi ci ]
In the Group1 grouping mode, input neuron swing numbers are different according to the HoWo splitting mode:
ho × Wo 4 × 4 resolution: 16[ hi/(4X 4), wi/(4X 4), ci/4X (4X 4) ]
Ho × Wo 1 × 16 resolution: 16[ hi/(4), wi/(4X 4), ci/4X (4X 4) ]
Of the 4 × 4 splits described above, 16 indicate 16 slave processing circuits SL, the last 4 × 4(CHW) indicates two times 4 out of BLOCK, hi, wi divisions of CHW split from three dimensions, the first 4 indicates splitting hi wi into 16 copies to distribute to 16 SLs, and the second 4 indicates folding hi, wi to ci directions. The meaning of 1 x 16 resolution is the same.
In the Group4 grouping mode, input neuron swing numbers are different according to the HoWo splitting mode:
ho × Wo 1 × 4 resolution: 4 x 4[ hi/(1 x 4) ], wi/(4 x 4) ], ci/4 [ (4 x 4[ ]
For one slave processing circuit SL: hi/(1 × 4), wi/(4 × 4), ci/4 × 4 (4 × 4)
In the above representation, the first 4 represents 4 SLBs, the neurons are replicated in4, the second 4 represents the neurons split over 4 SLs of one SLB, and the last 4 × 4 represents BLOCK of CHW split in three dimensions.
In Group16 grouping mode, the input neurons do not need to be split, and the number of the input neurons is as follows:
16*[hi/4,wi/4,ci/4*(4*4*4)]
the above 16 indicates that the neuron replicates on 16 SLs, the last 4 × 4 indicates BLOCK of CHW split by three dimensions, hi, wi are both divided by 4, indicating the direction of folding hi, wi to ci.
Output point splitting between operation circuits in Forward4 scheme
When a plurality of arithmetic circuits CU in a single slave processing circuit SL collectively process one Co value, it is necessary to split the output point among the plurality of CUs.
Fig. 10 shows a schematic diagram of assigning interval output points to each operational circuit in a Forward4 scheme according to some embodiments of the present disclosure. In these embodiments, may be at NCUEqually dividing the output characteristic block among the arithmetic circuits into Nop output characteristic sub-blocks with the same shape, wherein each output characteristic sub-block comprises NCUOutput points, respectively divided into NCUAn arithmetic circuit. For example, the output signature block 1010 is shown to include 4 output points by 4, and each of the equally divided output signature sub-blocks 1011-1014 includes 2 output points by 2, taking as an example that each SL includes 4 CUs, and each CU can calculate Nop 4 output points or partial sums at a single time. In each output signature sub-block, these 2 x 2 output points are assigned to 4 arithmetic circuits. Thus, each arithmetic circuit calculates one output point in each of the 4 output characteristic sub-blocks. The output points assigned to the 4 different arithmetic circuits CU 0-CU 3 are shown in different backgrounds. As can be seen from the figure, at each calculation, each operational circuit calculates the output characteristic graph at X andand/or a plurality of output points spaced in the Y dimension.
Based on the output point division, when convolution operation is performed by sliding the selection number, N can be correspondingly selected from the first buffer circuit from the output point position of each output feature subblock according to the data required for calculating the output feature subblockCUThe data lines are operated on. For example, when the first selection number of feature data is input, 4 input data lines may be selected from the corresponding input feature blocks and distributed to 4 arithmetic circuits, based on4 input feature blocks required for calculating 4 output points in the first output feature sub-block 1011. It will be appreciated that since the 4 output points are consecutive in the X and/or Y direction, the spacing or step size in the X and/or Y direction of the simultaneously selected 4 rows of input data is 1.
When selecting weight data, corresponding weight data can be selected from the second buffer circuit and broadcast to NCUAnd the operation circuits realize parallel calculation of output points corresponding to the operation circuits by multiplexing weight data. Further, in some embodiments, in order to fully exploit the computational power (e.g., multiply-add operator) inside the arithmetic unit CU, e.g., to compute Nop output points or partial sums at a single time, weight multiplexing may be performed within a single input data line, thereby computing Nop output points or partial sums simultaneously.
For example, when selecting the number of weight data, only 1/Nop weight rows may be selected, and Nop-1 is copied to expand the weight rows into 1 weight row, where the expanded weight row includes Nop identical 1/Nop weight rows. The expanded weight line can also be broadcast to NCUAnd an arithmetic circuit to multiplex the weights between the plurality of arithmetic circuits with a smaller granularity (e.g., 1/Nop row) between the computations of Nop output points of a single arithmetic circuit.
Thereby, N is taken out by corresponding each timeCUOne input characteristic data line, 1/Nop weight lines are taken to copy and expand into 1 weight line, and N can be calculated each timeCUNop output points or partial sums. When the calculation result is partial sum, the partial sum can be calculated for multiple times by sliding for multiple times, and the partial sums of each time are accumulated according to the output point to which the partial sum belongsThe final result can be obtained.
According to the division mode of the output points, the sliding times and the sliding step length of the convolution operation can be determined. According to the division manner of fig. 10, the number of sliding operations Nk is ceil (Kx/2) × ceil (Ky/2), where Kx and Ky are the smaller of the sizes of the convolution kernels in the X and Y dimensions and the maximum convolution kernel size supported by a single operation in the current convolution splitting mode from the processing circuit, respectively, and the sliding step size is 2. The maximum convolution kernel size supported by a single operation from the processing circuit is determined, for example, by at least the spatial size of the first buffer circuit and the second buffer circuit. It will be appreciated that when the convolution kernel exceeds the maximum convolution kernel size, splitting in the Kx and Ky directions is required to be performed at the maximum convolution kernel size.
Rolling sliding process in Forward4 scheme
Fig. 11 shows a schematic diagram of a single operation process in the Forward4 scheme according to an embodiment of the present disclosure. In this example, the first buffer circuit 1110 has a size of 3 × 3 × 64B, that is, can buffer 9 line data at most, and the second buffer circuit 1120 has a size of 2 × 2 × 64B, that is, can buffer 4 line data at most. To coincide with the split cells, the storage within the buffer circuit in the figure is also shown in units of split cells.
The operation process of the first sliding access is shown in the figure. Selecting N from the first buffer circuit in a sliding manner by taking the splitting unit as a sliding window according to a manner corresponding to the division manner of the output pointsCUAn input feature line respectively sent to NCUAn arithmetic circuit for calculating; selecting 1/Nop weight rows from the second buffer circuit according to the corresponding sliding mode in the first buffer circuit, wherein Nop is the maximum calculable convolution output point number of each arithmetic circuit at one time, copying Nop-1 parts of the convolution output points to be expanded into an expanded weight row, and broadcasting the expanded weight row to N in the slave processing circuitCUAn arithmetic circuit.
Specifically, in the computing device shown in FIG. 5, NCUNop 4. When the output points are divided, each arithmetic circuit calculates 2 × 2 output points each spaced by 1 in the X and Y dimensions for each calculation, and divides the output points.
As shown, one input feature data line is selected from the first buffer circuit 1110 at the start position and the position shifted by 1 in each of the X and/or Y directions, and a total of 4 input feature data lines are selected and correspondingly sent to the 4 arithmetic circuits 1140 in the slave processing circuit SL. 1/4 weight data rows are selected from the second buffer 1120 at the start, i.e. 2 × 2 data are selected, copied into 3 expanded weight data rows 1130 and broadcast to 4 calculation circuits 1140 in the SL.
During each calculation, each arithmetic circuit performs bit multiplication accumulation on a characteristic line of an input from the first buffer circuit and an expanded weight line of the second buffer circuit by using 1/Nop data line units to obtain Nop partial sums.
As shown, the 4 arithmetic circuits 1140 perform multiply-accumulate operations on the distributed input feature data lines and the broadcast expanded weight data lines to obtain an arithmetic result 1150. The results of the different background colors in 1150 are representative of those obtained by the different operational circuits 1140. It can be seen that each time one CU calculates the partial sum of 4 output points, the 4 CUs total to obtain a 4 × 4 partial sum. It can be seen that the output points computed by each CU are not adjacent in the XoYo dimension of the output feature map.
Then, the data is synchronously fetched in a sliding way in the first buffer circuit and the second buffer circuit, and the next calculation is carried out. And executing Nk sliding times, wherein Nk equals ceil (Kx/2) × ceil (Ky/2), and Kx and Ky are the smaller value of the sizes of the convolution kernels in the X dimension and the Y dimension or the maximum convolution kernel size supported by a single operation under the current convolution splitting mode from the processing circuit. Correspondingly, the operation circuit accumulates the calculated Nk portions Nop in Nk sliding calculations according to the corresponding convolution output points to obtain Nop operation results.
In some embodiments, the maximum convolution kernel size supported by a single operation from the processing circuitry is 8 x 8 in Forward4 mode.
Fig. 12 shows a schematic diagram of a sliding convolution process in a Forward4 scheme according to an embodiment of the present disclosure. In this example, the input feature map is 9 × 9, the convolution kernel is5 × 5, and the convolution step size is1, the output feature map size is5 × 5. The input signature graph needs to be aligned to 12 × 12, divided into 9 blocks of 4 × 4 × 4(C × H × W) size, stored in a first buffer circuit, shown as 1210, with the C dimension omitted. The convolution kernel 5 x 5 needs to be aligned to 8 x 8 and the aligned portion is filled with 0 and stored in a second buffer circuit, shown as 1220, also omitting the C dimension. In each calculation, a block of 2 × 2 size in the convolution kernel is selected and copied 4 times, which exactly corresponds to a block of 4 × 4 of the input feature map, and the copying operation can be realized by hardware.
The selection ranges of the input feature map and the convolution kernel in the first buffer circuit and the second buffer circuit at each sliding are shown in fig. 12, which are 9 graphs representing 9 times of sliding. Block 1210 represents the input signature in the first buffer circuit, with the four dashed boxes representing the regions selected for four CUs; block 1220 represents the convolution kernel in the second buffer circuit, and the dashed box represents the selected line 1/4, which is broadcast to 4 CUs after 3 copies are expanded into one line. The number of sliding operations Nk (Kx/2) × ceil (Ky/2) ═ 9.
During each calculation, each CU performs bit multiplication accumulation on 1/4 data line units for one input feature data line from the first buffer circuit and one extension weight data line from the second buffer circuit to obtain 4 partial sums; and accumulating the Nk partial sums corresponding to the same convolution output point and obtained in the Nk times of calculation in the current operation round to obtain and output 4 operation results.
Specifically, for each graph in fig. 12, the number Ncu of CUs is 4, and Nop calculated once for each CU is 4 output points or partial sums, which are the result of bit-by-bit multiply accumulation of 1/4 data lines, i.e., each output point is a standard convolution of 4 × 2 × 2(Ci × Y × X). After sliding Nk ceil (Kx/2) × ceil (Ky/2) ═ 9 times, the Y × X direction is accumulated, and finally a complete 4 × 4(Y × X) output is obtained for 1 SL (as shown in fig. 10 b). In this mode, a single calculation only supports the situation that the convolution kernel is not larger than 8 × 8, and for a larger convolution kernel, splitting needs to be performed according to 8 × 8 in the Kx and Ky directions, and splitting operation can be performed according to the same principle.
It can be appreciated that when Ci >4, it is necessary to traverse in Ci direction while switching inputs and weights until a complete output is calculated. When Xo/Yo calculated by each CU is greater than 4, sliding along the Xo/Yo direction is needed, and different input neurons and weights are read. The calculation process can be derived similarly by those skilled in the art from the foregoing description, and is not described in detail here.
Output shape description in Forward4 solution
As can be seen from the foregoing output point division manner and sliding convolution process, the result of the sliding mode output is not the normal arrangement order of the conventional convolution output data. Therefore, in the output process, each slave processing circuit SL may convert the operation result of the operation circuit CU therein into a specified format, for example, a format of Nco × Uy × Ux. In some embodiments, each slave processing circuit may output a partial operation result of its internal partial operation circuit at a time, the partial operation result being continuous in the X and/or Y dimensions of the output signature. The master processing circuit may further store the operation results returned from the respective slave processing circuits in a fourth-dimension storage order. According to the situation, the main processing circuit can also convert the operation result into a desired dimension storage sequence for storage.
When the grouping mode and/or the splitting mode of the input characteristic diagram (namely, the HoWo splitting mode according to the output characteristic diagram) in a single SLB are different, the output data format is slightly different.
FIG. 13 shows an output data format schematic according to a Forward4 scheme, according to an embodiment of the present disclosure. In this embodiment, the grouping mode is Group1, and the splitting mode of the input feature map in a single SLB (including 16 SLs) is split according to Ho × Wo ═ 1 × 16.
The raw output of 1 SL is shown at 1310. As can be seen from the figure, each SL outputs a region of 1 × 1 × 4(Co × Y × X), that is, each SL outputs a partial operation result of its internal partial operation circuit, for example, 2 operation results in2 CUs (see fig. 10), and the partial operation results are continuous in the X and/or Y dimensions of the output characteristic diagram, for example, in the same row (shown in fig. 13) or the same column. The region of 1 × 4 × 4(Co × Y × X) is returned 4 times in succession, that is, 4 operation results for each of the 4 CUs. Different SLs output different regions of the same Co's output profile. When all 4 x 4 areas of Co are output, the output will switch to a different output point.
The store data structure for 16 SLs is shown at 1320. As shown, the final output data is written into the memory circuit (e.g., the first memory circuit) and then changed to a format of Yo x Xo Co 4 x 16 x 4, where Yo and Xo are the number of blocks of the output feature map divided for each SL, and 16 is the division over 16 SLs. In some implementations, the cycloidal operation may be performed again to translate to other desired data formats, as desired.
As mentioned above, when the grouping mode and/or the splitting mode of the input characteristic diagram are different among multiple SLs in a single SLB, the output data format has slight difference. Assume the original output size is:
1*ho*wo*co
then, the output data shape of Group1 when Ho × Wo is split by 4 × 4 is:
ho/(4*4)*wo/(4*4)*co/group*(4*16*4)
in the above formula, (4 × 16 × 4) is a basic output block of forward4, and the directions correspond to h × c × w, respectively, where 16 represents the division of ho and wo of the same co on 16 SLs, and can be specifically decomposed into 4[ high-dimensional ho ], [ 4[ low-dimensional ho ], [ 4[ high-dimensional wo ], [ 4 low-dimensional wo ]; ho, wo are divided by 4 times, where the first 4 indicates 4 × 4 splitting when SL stores data, and the second 4 indicates folding of data blocks in h and w directions. In Group1 mode, the above Group is 1.
The output data shape of Group1 when Ho Wo splits by 1 × 16 is:
ho/(4)*wo/(4*16)*co/group*(4*16*4)
in the above formula, (4 × 16 × 4) is a basic output block of forward4, and the directions correspond to h × c × w, respectively, where 16 represents the division of ho and wo of the same co on 16 SLs, and can be specifically decomposed into 4[ low-dimensional ho ], [ 16[ high-dimensional wo ], [ 4[ low-dimensional wo ]; in Group1 mode, the above Group is 1. This shape is also the shape schematically shown in fig. 19.
It follows that in the case of Group1, 16 SLs bisect the Yo x Xo dimension of an output feature map. The data in the intra-row dimension SL at output corresponds one-to-one to the way that 16 SLs equally divide the output neurons in the Yo x Xo direction. The scene is suitable for the input neuron with large Y X direction value and small Co value.
The output data shape of Group4 when Ho Wo splits by 2 x 2 is:
ho/(2*4)*wo/(2*4)*co/group*(4*16*4)
in the above formula, (4 × 16 × 4) is the same as above, except that 16 represents the wo output division of 4 co on4 SL, which can be specifically decomposed into 4[ co ] 4[ high dimensional ho ] 2[ low dimensional ho ] 2[ high dimensional wo ] 4[ low dimensional wo ]. In Group4 mode, the upper Group is 4.
The output data shape of Group4 when Ho × Wo splits by 1 × 4 is:
ho/(1*4)*wo/(4*4)*co/group*(4*16*4)
in the above formula, (4 × 16 × 4) is the same as above, except that 16 represents the wo output division of 4 co on4 SL, which can be specifically decomposed into 4[ co ] 4[ low dimensional ho ] 4[ high dimensional wo ] 4[ low dimensional wo ]. In Group4 mode, the upper Group is 4.
Group16 output data shape is:
ho/4*wo/4*co/group*(4*16*4)
in the above, (4 × 16 × 4) is the same as above, except that 16 indicates the output division of 16 co on 16 SLs, and can be specifically decomposed into 4[ lower dimensional ho ] 16[ co ] 4[ lower dimensional wo ]. In Group16 mode, the upper Group is 16.
Since Group has different splitting categories in H × W direction, 16 of 4 × 16 × 4 in the above description has different specific splitting categories. Since Forwrd4 is a block of 4B × 4 as a calculation unit, it is inevitable that there is an alignment restriction during calculation. According to different Group modes, different H-W splitting modes of the same Group mode have different alignment restrictions during calculation. In the alignment calculation, the alignment limit of ho _ wo can be determined according to the splitting mode of the output feature diagram, and then ho _ wo is used to back-derive hi _ wi, and since the input neurons need to be put into the form of splitting unit blocks, the input neurons need to be aligned again. The alignment restrictions described above can be summarized in table 3 below:
Figure BDA0003280621220000171
Figure BDA0003280621220000181
TABLE 3 alignment restrictions
In summary, at the time of output, the hardware can automatically output the neurons in4 × 16 × 4(Y × SL × X) dimensions within a row and Y × X × C dimensions between rows. The same holds for larger convolution kernels.
Bias shape description in Forward4 implementation
The Bias is the Bias after the convolution calculation is finished, and the original format of the Bias is as follows: [ 11 co ].
Since the format of data output by Forward4 is ho × wo/group (4 × 16 × 4), if it is necessary to apply an offset to the data directly output by Forward4 on the chip, the basic shape of the offset needs to be changed. The placement format of the bias on-chip space is related to the Group grouping pattern. Specifically, the number of biased pendulums in each grouping mode is as follows:
in Group1 grouping mode, the number of biased pendulums: [ 11 co 64]
Where 64 represents a single offset that is replicated 64 times and placed consecutively.
In Group4 grouping mode, the number of biased pendulums: [ 11 co 16]
Where 16 indicates that a single offset is replicated 16 times and laid out in succession.
In Group16 grouping mode, the number of biased pendulums: [ 11 co 4]
Where 4 represents a single bias replicated 4 times and laid out consecutively.
Data handling process
From the description of the small convolution operation scheme, it can be known that both the input neurons and the weights need to be subjected to splitting and storage dimension transformation, and the output neurons also need to be subjected to certain dimension transformation. When the hardware structure of the multi-core computing device shown in fig. 3b is used, for example, in view of hardware IO efficiency, the input data needs to be read from the global memory first, and then stored in the shared memory SRAM after being loaded. As mentioned above, Forward4 requires splitting neurons, and the splitting characteristic determines that Forward4 has more computational advantages in the case of processing a larger input feature map and a smaller number of channels in consideration of alignment factors. Thus, in the hardware design involving Forward4, larger neurons may be stored on WRAM and relatively smaller weights stored in NRAM. Meanwhile, since the weight and the neuron data need to be put into the block form described above, the neurons stored in the WRAM also need to undergo shape transformation of tensor data through NRAM.
FIG. 14 illustrates an overall data handling process according to an embodiment of the present disclosure.
As shown, the weights are read into SRAM from off-chip storage, such as DDR, via global direct memory access module (GDMA). HW surface alignment and pad operations are done on SRAM. A blocking instruction (Tiling) is utilized in the process of transferring data from the SRAM to the NRAM, and the data transfer process and the data dimension conversion and alignment process can be completed.
The neuron is transported in a similar process to the weights, except that after being transported to the NRAM by the block command, the neuron also needs to be transported to the WRAM. Since the neurons are in computation, most of data is overlapped along with the sliding of convolution kernels, and the efficiency of data handling is greatly reduced. To address this issue, some embodiments of the present disclosure employ the img2col command to distribute data, as described in detail below.
The output data can be restored to the NRAM, and the data dimension change can be completed through the blocking instruction and carried to the SRAM. Then, it may be restored back into the off-chip memory DDR via GDMA.
Exemplary principles of a blocking instruction
The data dimension change and the number of swings are processes of putting tensor data of a specific shape into a required specific shape. Data transfer refers to read and write operations performed by data in different memory spaces. As described above, the Forward4 convolution scheme requires that neurons and weights used for convolution operation are arranged and aligned according to a specific block pattern. In addition, the output data is also output according to a Forward4 specific output format, which requires that the tensor data is put in block form before calculation, and also requires that the tensor data is put back to a normal tensor shape after calculation is finished.
In the embodiment of the present disclosure, during the process of carrying the input neurons, the weight values, the bias data from the SRAM to the NRAM, and the process of carrying the output data from the NRAM to the SRAM, a block instruction (Trans timing) is used to complete the carrying operation. In the process of the transportation, the basic transportation process of the data needs to be completed, and the dimensional change and the placement process of the data also need to be completed, so that the calculation requirement is met.
The Deform instruction set provides the IO datapath with the capability of data shape transformation and data type conversion, and mainly includes the functions of TRANS (transpose), MOVE (MOVE), ROTATE (ROTATE), etc. The mode for realizing the transposition function in the instruction series is named as Trans labeling, and mainly provides performance support for various shape transformations of small convolution. Deform divides a 3-dimensional data block into an inner layer and an outer layer, the inner layer has three dimensions (corresponding to a parameter n0-2 in an instruction), the unit of the lowest dimension is byte, and the second lowest dimension and the highest dimension are unitless and represent the number of the upper layer. The outer layer also has three dimensions (corresponding to the parameters n3-n5 in the instruction), each representing a multiple of the dimension of the corresponding inner layer.
When a small convolution splitting scheme is implemented, input data stored in a first-dimension storage order (e.g., HWC) needs to be split, dimension converted and stored in units of splitting units, the splitting units are stored in a second-dimension storage order (e.g., CHW), and the splitting units are stored in a third-dimension storage order (e.g., HWC).
FIG. 15 shows a schematic conceptual diagram of Trans tilling according to an embodiment of the present disclosure.
The left diagram in the figure shows the input data before deformation. It can be seen that the three-dimensional input data is described using six dimensions, n0 and n3 corresponding to a first dimension (e.g., the lowest dimension) of the original three-dimensional data, n1 and n4 corresponding to a second dimension (e.g., the next lowest dimension) of the original three-dimensional data, and n2 and n5 corresponding to the third dimension (e.g., the highest dimension) of the data block. In the example in the figure, the inner layer of the input data corresponds to a splitting unit, taking a Forward4 scheme as an example, the inner layer data block of the input data is a 4B × 4 × 4 data block, where n0 is 4B, and n1 is n2 is 4.
The right graph in the figure shows the output data after the deformation. The three-dimensional output data is also described using six dimensions. At this time, the inner layer of the output data corresponds to the deformed splitting unit, and in the Forward scheme, the inner layer data block of the output data is a 64B × 1 × 1 data block, where n0 is 64B, and n1 is n2 is 1.
In addition, the transposed blocking (Trans Tiling) also has an Inline transform (Inline shuffle) function, including a function of a Tiling table (preamble) based Tiling forward Inline transform, and a Tiling table (Posttable) based Tiling backward Inline transform. The pre-arrangement table is a function of rearranging the data of n0 input by Tiling, and the post-arrangement table is a function of rearranging the data of n0 output by Tiling. The prearranged and postamble tables are essentially an array representing 64byte data locations, without regard to the flag bits of the tables.
FIG. 16 shows a schematic of a front-to-back configuration table.
As shown, the run-to-run table represents the rearrangement position of one line of data of dimension n0, which includes 64B, for input or output, respectively. 8 bits of each byte respectively comprise 6 bits of Index bits, and the secondary data bits are recorded to store data of the second bit byte of 0-63 bit byte data in the original data; a zero _ en bit of 1 bit, indicating whether 0 is set, if the bit is1, then 0 is forced to be written, and the [5,0] bit is invalid; and a 1-bit mask bit indicating whether this bit data is valid.
Through the front and back allocation table, the data of n0 of the input data of the blocking instruction can be rearranged when needed, and/or the data of n0 of the output data of the blocking instruction can be rearranged.
Table 4 shows the meaning of each parameter of the blocking instruction. Assuming that the bit width of the data to be blocked is dwidth and the unit is B (byte), the size of the data size of one atomic operation of the blocking instruction is called the block bit width T and the unit is B (byte). In the parameters of the block instruction, 11 parameters of n 0-n 5 and s 1-s 5 are required to describe the tensor shapes of the inner layer data and the outer layer data, wherein n 0-n 2 and s 1-s 2 are parameters for describing the inner layer, and n3-n5 and s 3-s 5 are parameters for describing the outer layer.
Figure BDA0003280621220000201
TABLE 4 parameter meanings of Block instructions
For tensor description before and after block instruction execution, an input tensor and an input tensor respectively need a set of parameters, and 22 parameters are described in total in 0-in 5, is 1-is 5, on 0-on 5 and os 1-os 5 respectively. The blocking instructions may support a variety of blocking bit widths T, e.g., 1B, 2B, 4B, 6B, 8B, 16B, 32B, etc., and the corresponding values may be set based on different blocking tasks. Therefore, the parameter of the block bit width T is also included in the block instruction.
When a block instruction is used, there are some basic usage limitations or constraints, which include, for example: in0, in1, in2, on0, on1, on2< ═ 64; n0 requires a 64B alignment in performance; in0 ═ on1 ═ on2 ═ T, on0 ═ in1 ═ in2 ═ T; in3 in4 in5 on3 on4 on 5; t < ═ 32B; the front and back matching table is 64B.
Furthermore, a blocking instruction cannot operate in place, i.e., requires two memory regions. Accordingly, in an embodiment of the present disclosure, a data processing apparatus is provided that includes a control circuit, a first storage circuit, and a second storage circuit. The first storage circuit is used for storing first data before the block dividing instruction is executed; the second storage circuit is used for storing second data after the blocking instruction is executed. The control circuit is used for configuring and executing the blocking instructions. In some embodiments, the data processing apparatus may be, for example, a processor cluster in the multi-core computing apparatus shown in fig. 3b, the control circuit is, for example, a processor core within the processor cluster, the first storage circuit is, for example, a shared storage SRAM or NRAM within the processor core within the processor cluster, and the second storage circuit is, for example, NRAM or SRAM. When a blocking instruction is executed for different data (input neurons, weight values, output neurons and the like), dimension change and a carrying process which need to be realized are different, and different blocking instruction parameter allocation schemes need to be designed.
General scheme for blocking instructions of output neurons
As can be seen from the foregoing description of the small convolution operation scheme such as Forward4, when the Forward4 adopts different Group grouping modes and/or different Ho × Wo splitting modes within a Group, the format of the obtained output data (i.e., output neurons) is not completely the same, and the detailed description of the output data format can be referred to in the output shape description part of the Forward4 scheme. Since the output data format is not a conventional output format, it is necessary to perform shape transformation and dimension conversion on the output data using a blocking instruction.
For the output neuron, the blocking instruction functions to convert first data stored on a first storage circuit (e.g., NRAM) in a first-dimension storage order into second data stored on a second storage circuit (e.g., SRAM) in a second-dimension storage order in a process of the output neuron being carried from, e.g., NRAM to SRAM.
In order to better understand the blocking process of the output neurons, in the following description, the blocking instruction process of the output data is explained in detail for the case of Group1 grouping mode in Forward4 convolution splitting scheme and Ho Wo is1 × 16 splitting mode. Those skilled in the art will appreciate that other output data formats may be similarly processed.
In the case of Group1, Ho × Wo split 1 × 16, the shape of the Forward4 output data is:
1*ho/(4)*wo/(4*16)*co*(4*16*4)
the specific meanings of the above shapes are as follows:
Figure BDA0003280621220000211
the execution purpose of the blocking instruction is to convert the first data in the six-dimensional data format (without considering the N-dimension) into second data in a standard output format (without considering the N-dimension), which is a three-dimensional shape:
[ho*wo*co]
where co represents the lowest storage dimension of the second data, wo represents the next lowest storage dimension of the second data, and ho represents the highest storage dimension of the second data.
Since the output data is too many in dimension, and there is also data across dimensions in4 × 16 × 4, some simplification is required in order to execute the blocking instruction. Specifically, the relevant dimensions of the output data may be combined to form three-dimensional data, and then a part of the ordered data may be subjected to blocking processing, and finally all data may be converted back to the standard format.
In some embodiments, the control circuit in the data processing apparatus may regard the first data as a three-dimensional equivalent shape before the blocking process from the six-dimensional shape:
Figure BDA0003280621220000212
wherein the highest dimension is
Figure BDA0003280621220000213
The second highest dimension is co, the lowest dimension is 4[ low-dimensional ho ]]16[ high dimension wo)]4[ low dimension wo)]. It can be seen that in the three-dimensional equivalent shape, the highest and second highest dimensions conform to the order of the dimensions of the final second data, and only the lowest dimension needs to be adjusted. In particular, 4[ low-dimensional ho ] in the lowest dimension]16[ high dimension wo)]The whole can be adjusted to be between the high dimension ho and the middle dimension wo of the highest dimension, and 4[ low dimension wo ] of the lowest dimension]May be adjusted between the highest dimension and the next highest dimension, thereby conforming to the order of hwc.
Alternatively or additionally, in some embodiments, the control circuit in the data processing apparatus may regard the second data from the three-dimensional shape as a three-dimensional equivalent shape after block processing:
Figure BDA0003280621220000214
FIG. 17 illustrates a schematic diagram of a block instruction being performed on output neuron data, in accordance with an embodiment of the present disclosure.
The left graph in the figure shows the output neuron data (i.e. the input tensor of the blocking instruction) before the blocking process. It can be seen that the six-dimensional output neuron data can be represented as a three-dimensional equivalent shape
Figure BDA0003280621220000215
To switch ho and wo in the low dimension (4 × 16 × 4) back to the high dimension, a data chunk of size 1 × co _ full (4 × 16 × 4) may be processed each time, where co _ full represents an integer segment of the base alignment value M aligned to the chunking instruction.
Thus, the input tensor can be divided into two layers, an inner and an outer layer, each layer being represented using three dimensions. Since 4[ low-dimensional ho ] 16[ high-dimensional wo ] in the lowest dimension (4 × 16 × 4) needs to be treated as the overall adjustment order, and it can also satisfy the requirement of M ═ 64B alignment, it can be divided into 4 × 16 inner layer data blocks 1701. At this time, the in0 dimension of the inner layer data block 1701 is aligned to a first alignment value, for example, M ═ 64B, according to the restriction of the blocking instruction; the in1 dimension can be set to M/dwidth, 64B/dwidth, according to the data bit width dwidth; the in2 dimension is set to 64/in1, so that in2 in1 in 0-64/in 1 in 1M-64B. After the inner data blocks are determined, the sizes of the outer three dimensions in3, in4 and in5 can be determined accordingly, and the sizes of the three dimensions are respectively equal to the number of the inner data blocks contained in the corresponding dimension.
After the chunking process is completed, the data chunks of 1 × co _ full (4 × 16 × 4) will be converted into data chunks of (4 × 16) × 4 × co _ full, i.e., the data of the lower dimension has been converted into the higher dimension.
The right diagram in fig. 17 shows the output neuron data after the blocking process (i.e., the output tensors of the blocking instruction). It can be seen that the output neuron data shape at this time becomes
Figure BDA0003280621220000221
It is also divided into inner and outer layers, each described using three dimensions. Since the neuron data needs to be largely positionally adjusted, therefore, the combination is performedThe constraint condition of the blocking instruction, the blocking bit width T, may be set to dwidth, that is, the data size of one atomic operation is one data, so that the storage order is conveniently adjusted data by data. At this time, the inner layer data block 1702 may correspond to the inner layer data block 1701 of the input tensor, but the shape is changed. For example, for a co integer segment, the shape is changed from a flat plate shape of 1 × co _ full (4 × 16 × 4) to a vertical plate shape of (4 × 16) × 4 × co _ full. The dimension of on0 of the inner layer data block is set to in1 in 2T 64T according to the limiting condition of the blocking instruction; the on1 dimension is set to 4, and the on2 dimension is set to in0/T/on 1-16B/T. After the inner data blocks are determined, the sizes of the outer three dimensions on3, on4 and on5 can be determined accordingly, and the sizes of the three dimensions are respectively equal to the number of the inner data blocks containing the output tensor in the corresponding dimension.
The above describes the chunking process of processing a data chunk of size 1 × co _ full (4 × 16 × 4) at a time. The multiple treatments may be performed cyclically in a certain order. Thus, in some embodiments, the control circuitry may execute the blocking instruction in a loop for a three-dimensional equivalent-shaped output neuron (i.e., the first data), the loop comprising three layers: inner co dimensional loop, middle wo dimensional loop, and outer ho dimensional loop.
Specifically, in some embodiments, the inner co-dimensional loop comprises: dividing the block into an integer segment and/or a remainder segment according to the co-dimension size, wherein the co-dimension size of the integer segment is aligned to a reference alignment value M of the block instruction, and the co-dimension size of the remainder segment is smaller than M; and repeating repeat _ co times, wherein repeat _ co ═ M-1/M ═ 64-1/64 times, for each M data block and the remainder segment for each data block according to the co dimension.
Alternatively or additionally, in some embodiments, the middle tier wo dimension loop comprises: repeat _ wo times, where repeat _ wo/(4 × 16) times, for every 1 data block according to the wo dimension.
Alternatively or additionally, in some embodiments, the outer ho dimension loop includes: repeat _ ho times for every 1 data block according to the ho dimension, where repeat _ ho is ho/4 times.
Three cycles in total require execution of repeat _ co repeat _ wo repeat _ ho subblock instructions.
In the above block instruction scheme, the dimension of 4 × 16 × 4 is properly restored, and thus the use of a configuration table is not required. Since the processing needs to be performed in repeat times, a bias needs to be applied to the data storage space before each block processing.
In particular, in some embodiments, the control circuitry may set an input tensor offset and an output tensor offset for a blocking instruction executed for a current data block according to a processed data block size each time before the blocking instruction is executed for the data block, wherein the input tensor offset represents an offset of the data block before processing with respect to a starting storage address of the first data, and the output tensor offset represents an offset of the data block after processing with respect to a starting storage address of the second data.
In one example, the bias may be added and the entire processing logic controlled as per the logic shown in the pseudo code of Table 5 below.
Figure BDA0003280621220000231
TABLE 5 output neuron blocking instruction loop processing logic
In actual operation, the shape of the output neuron is dynamically changed, i.e., the size of co is arbitrary. As mentioned previously, in the inner co-loop, the blocking instruction may be executed in integer and/or remainder segments according to the co-dimension size, where the co-dimension size of the integer segment is aligned to the base alignment value M of the blocking instruction, where each M is a data block; the co dimension of the residue segment is smaller than M, and the residue segment is a primary data block.
It will be appreciated that depending on the different co values, there may be only integer segments, or only remainder segments, or both integer and remainder segments. Suppose the integer segment length size in co that is 64B aligned is co full and the remainder segment that is not 64B aligned is co rem. For example, an INT8 type of output neuron 256 × 96, co 96, then co full 64, co rem 32.
Table 6 shows the shape change of the output neuron data before and after the execution of the blocking instruction.
Figure BDA0003280621220000232
TABLE 6 shape Change before and after output neuron data blocking processing
Note that the shape assumptions in table 6 are the parameters after alignment has been required in the alignment constraints for different Group patterns and different H x W splits of Forward4 in table 3 above.
For the integer segment portion, the first chunking instruction may be configured with reference to what was described above in connection with fig. 17.
In one example, when M ═ 64B for Forward4 scheme, Group1 Group pattern, Ho × Wo 1 × 16 split mode, the first chunking instruction for the integer segment may be configured as in table 7 below.
Figure BDA0003280621220000241
TABLE 7 parameter configuration scheme for integer segment and block instructions of output neurons
Wherein dwidth denotes a data bit width, B denotes bytes, T is a block bit width which denotes a data amount of an atomic operation of a block instruction at one time, in0, in1, in2 denote an inner lowest dimension size, and an inner highest dimension size of an inner data block of an input tensor of a first block instruction, respectively, in3, in4, and in5 denote three outer dimension sizes of the input tensor, respectively, size values of the three outer dimensions denote the number of inner data blocks containing the input tensor in the corresponding dimension, is1 to is5 denote five dimensions of the input tensor except for the inner lowest dimension, on0, on1, on2 denote an inner lowest dimension size, and an inner highest dimension size of the inner data block of the output tensor of the first block instruction, on3, on4, and on5 denote outer dimension sizes of the output tensor, respectively, and the three corresponding outer dimension values of the inner data blocks containing the output tensor, os 1-os 5 represent the five dimensional steps of the output tensor, except for the inner lowest dimension.
For the remainder segment portion, the second block instruction may be configured with slight adjustments based on the integer segment portion.
In one example, when M ═ 64B for Forward4 scheme, Group1 Group pattern, Ho × Wo 1 × 16 split mode, the second chunking instruction for the remainder segment may be configured as in table 8 below.
Figure BDA0003280621220000242
TABLE 8 parameter allocation scheme for neuron residue segment blocking instructions
Where dwidth denotes a data bit width, co _ rem denotes a co-dimension size of the remainder section, B denotes bytes, T is a blocking bit width which denotes a data amount of an atomic operation of a blocking instruction at one time, in0, in1, in2 denote an inner lowest dimension size, and an inner highest dimension size of an inner data block of an input tensor of a first blocking instruction, respectively, in3, in4, and in5 denote three outer dimension sizes of the input tensor, size values of the three outer dimensions denote the number of inner data blocks containing the input tensor in corresponding dimensions, respectively, is1 to is5 denote five dimension steps of the input tensor except for the inner lowest dimension, on0, on1, on2 denote the inner lowest dimension size, and the inner highest dimension size of an inner data block of an output tensor of the first blocking instruction, respectively, on3, on4, and 5 denote three output tensors, the magnitude values of the three outer dimensions respectively indicate the number of inner-layer data blocks including the output tensor in the corresponding dimension, and os1 to os5 indicate the five-dimensional step sizes of the output tensor excluding the lowest dimension of the inner layer.
Thus, the blocking processing scheme of the output neuron is described above in connection with the concrete examples of Forward4 scheme, Group1 grouping mode, Ho Wo 1 × 16 split mode. As can be seen from the output shape description part of the Forward4 scheme, although different grouping patterns and different Ho Wo splitting patterns within a Group can lead to different formats of the final output neurons, these formats can be summarized in the following form:
[ high dimensional ho [ middle dimensional wo ] co dimension [ multidimensional blending ]
More specifically:
ho/(hgs × 4) [ high dimension ho ] × wo/(wgs × 4) [ middle dimension wo ] × co/group [ co dimension ] (4 × 16 × 4)
Wherein hgs and wgs respectively represent the separation mode of Ho Wo in a group, and hgs wgs is 16/group; while the lower dimensions (4 x 16 x 4) are mixed dimensions, which have different meanings depending on the grouping pattern and/or intra-group splitting pattern, but each include a mixture of dimensions including various combinations of the following: co, high-dimensional ho, low-dimensional ho, high-dimensional wo, and low-dimensional wo.
In order to convert such a mixed multidimensional output neuron into three-dimensional data [ ho wo co ], it is necessary to refer to the ho and wo dimensions currently in the low-dimensional multidimensional mixture (4 × 16 × 4) as before. Thus, also in the manner described before, the multidimensional output neuron can be considered as an equivalent three-dimensional shape by dimension merging first:
([ high dimensional ho ] (middle dimensional wo) co dimension ] (multidimensional blending) ]
Wherein again at least two parts can be merged in the [ multidimensional blending ], wherein the first part needs to be adjusted between the current highest dimension ho and the middle dimension wo as a whole, and the second part needs to be adjusted between the current highest dimension and the next highest dimension, i.e. between the middle dimension wo and the co dimension.
Similarly, each time a data block of size 1 × co _ full (4 × 16 × 4) is processed, the multiple processes are also performed in three cycles: inner co dimensional loop, middle wo dimensional loop, and outer ho dimensional loop.
The inner layer co-dimension loop can be processed as before, and can also be divided into an integer segment and/or a remainder segment, wherein each M is a data block according to the co-dimension, the remainder segment is a data block, and repeat _ co times, wherein repeat _ co is (co + M-1)/M.
The middle wo dimension loop includes: repeat _ wo times for every 1 data block according to the wo dimension, where repeat _ wo is the above-mentioned middle dimension wo.
The outer ho dimension cycle includes: repeat _ ho times for every 1 data block in the ho dimension, where repeat _ ho is the highest dimension ho in the highest dimension.
The configuration of the specific blocking instructions can be designed similarly according to the aforementioned principles and will not be expanded here.
Thus, the disclosed embodiments provide a blocking processing scheme for output neuron data that can rearrange dimension-mixed multi-dimensional data into a specified dimension order. In some embodiments, the chunking process may be simplified by appropriate shape equivalence. Further, through the three-tier loop, data can be placed one by one into a desired location. The partitioning of the co integer and remainder segments can make the scheme suitable for output neurons of arbitrary shape.
Blocking instruction optimization scheme for output neurons
In actual operation, the shape of the output neuron is dynamically changed, i.e., the size of co is arbitrary. As mentioned previously, to accommodate the general case of chunking, in the inner co-loop, chunking instructions may be executed in integer and/or remainder segments according to the co-dimension size. While this general scheme can support output neuron data of any scale, a 100% efficient state is only achieved when co is an integer 64B segment.
The blocking instruction has two requirements on performance, namely, first in0 based on 64B, and second on0, that is, in1 in 2T based on 64B, that is, in0 and on0 both need to be aligned to the reference alignment value M, 64B. In Forward4 output neuron blocking, since the lowest dimension of the blocking process is (4 × 16 × 4) ═ 256, 256 × dwidth is inevitably an integer multiple of 64B regardless of the output data type, that is, in0 is full of performance. However, for on0, i.e., in1 in 2T, since in2 is1 and T is dwidth, it is necessary to see in1 to determine whether the on0 performance is full, and in1 is determined exactly by co.
The small convolution operation scheme mainly aims at a scene with a small channel number, and the small channel scene is mainly characterized by a small channel number, and co is usually less than 64B. In this case, if the chunking processing scheme supporting output neuron data of an arbitrary size is continued to be employed, the performance is impaired. Therefore, it is necessary to propose a more efficient optimization scheme for the scenario where co is smaller than the reference alignment value M (64B) of the blocking instruction.
In a small channel scenario, the main reason why the performance of the blocking instruction cannot be fully exploited is that in1 is "underfilled", i.e., co is too small. The inventor has noted that in the output tensor shape (ho/4 wo/4/16) × co (4 × 16 × 4), splitting of any adjacent dimension is equivalent, and the dimensions adjacent to co are wo/4/16 and 4, so that it can be considered that data of the adjacent dimensions are complemented to the co dimension for blocking processing, and thus the problem that in1 is not full of small performance can be solved.
As already mentioned above, the specific meaning of the output tensor shape (ho/4 wo/4/16) co (4 x 16 x 4) is as follows:
Figure BDA0003280621220000261
since the smaller the co, the larger the data to be complemented, the lowest dimension 4 is at most complementary to 4, i.e., at most capable of supporting co 16B alignment and 32B alignment. For smaller co, e.g., 8B and 4B scenes, additional 2 and 4 re-completions are needed, and since the low-dimensional 4 completions are followed by the high-dimensional wo 16, there is a cross-dimensional problem and so only one can be complemented from the middle-dimensional wo/4/16.
In some embodiments, the control circuit in the aforementioned data processing apparatus may determine the preferred alignment value P according to the co-dimension of the first data; determining the decompensation distribution of adjacent co-dimension according to the preferred alignment value P; and configuring and executing the blocking instructions according to the break-in allocation. It will be appreciated that when the preferred alignment value P is less than the base alignment value M of the blocking instruction, it is easier to fill in1, thereby fully exploiting the performance of the blocking instruction.
Further, when the preferred alignment value P is less than the reference alignment value M of the blocking instruction, the control circuitry may determine the tear-back allocation for adjacent ones of the co-dimensions as follows:
when M/P is less than or equal to 4, Ws1 is M/P, and Ws2 is 1; and
when M/P is greater than 4, Ws1 is 4, Ws2 is M/4P,
where Ws1 represents splitting Ws1 times of data from the adjacent low-dimensional side of the co-dimension (e.g., low-dimensional ho described above) to the co-dimension, and Ws2 represents splitting Ws2 times of data from the adjacent high-dimensional side of the co-dimension (e.g., medium-dimensional wo described above) to the co-dimension. Thus, by padding data in one or more adjacent dimensions to the co dimension, the alignment requirement for the co can be reduced.
In some embodiments, the preferred alignment value P may be determined according to the range of values of co. Table 9 shows the optimization schemes corresponding to different preferred alignment values P.
Figure BDA0003280621220000262
Figure BDA0003280621220000271
TABLE 9 optimization scheme for small channel chunking
As can be seen from table 9, the control circuit may determine the preferred alignment value P according to the co-dimension size of the first data as follows:
when 0< co ≦ 4B, P ═ 4B;
when 2 is inn*4B<co≤2n+1*4B,
Figure BDA0003280621220000272
When P is 2n+14B; and
when in use
Figure BDA0003280621220000273
When, P ═ M.
It can also be seen from table 9 that the implementation of the block instructions in the mini-channel scenario is related to the number of adjacent dimension complements, in the table, Ws1 represents the number of copies split from the lowest dimension (4 × 16 × 4 dimension) to the co-direction (abbreviation of wo supply 1), and Ws2 represents the number of copies split from the highest dimension (wo/4/16 dimension) to the co-direction (abbreviation of wo supply 2). It should be noted that the smaller co, the smaller alignment value P is preferred, and the more data needs to be filled from another dimension, the larger the limit on wo is. Specifically, wo/4/16/Ws2> is required to be 1.
In addition, although the rules for determining the preferred alignment value P are listed above, these rules are merely preferred embodiments for selecting the preferred alignment value that best suits the current co value. As can be seen from table 9, the optimization scheme with larger co alignment can be compatible with the optimization scheme with smaller co alignment, that is, the large P value scheme can be compatible with the small P value scheme. For example, the co 64B alignment scheme can handle virtually any scale of co, and the co 32B can handle virtually co < ═ 32B range.
FIG. 18 shows a block processing optimization scheme diagram of an output neuron, according to an embodiment of the present disclosure. Table 10 shows a parameter configuration scheme of the output neuron data small channel scene (co < ═ 32B) blocking instruction optimization scheme.
Figure BDA0003280621220000274
TABLE 10 parameter allocation scheme for small channel scene blocking instruction optimization scheme for output neuron data
Wherein B represents a byte, in0, in1, in2 represent the lowest dimension size of the inner layer, and the highest dimension size of the inner layer data block of the input tensor of the blocking instruction, respectively, in3, in4, and in5 represent the three outer dimension sizes of the input tensor, respectively, the magnitude values of the three outer dimensions represent the number of inner layer data blocks containing the input tensor in the corresponding dimension, respectively, is1 to is5 represent the five-dimension steps of the input tensor except for the lowest dimension of the inner layer, on0, on1 and on2 respectively represent the lowest dimension size of the inner layer, the lowest dimension size of the inner layer and the highest dimension size of the inner layer of the output tensor of the blocking instruction, on3, on4 and on5 respectively represent the three outer dimension sizes of the output tensor, the size values of the three outer dimensions respectively represent the number of inner layer data blocks containing the output tensor in the corresponding dimension, and os1 to os5 represent five dimension steps of the output tensor except for the lowest dimension of the inner layer.
As can be seen from fig. 18 and table 10, the output neuron data before the blocking process (i.e., the input tensor of the blocking instruction) can be represented as a three-dimensional equivalent shape
Figure BDA0003280621220000281
In this optimization, co ≦ 32B, so it needs to be complemented from the adjacent dimension. The graph shows splitting to co-direction Ws1 from the lowest dimension (4 × 16 × 4 dimension) and to co-direction Ws2 from the highest dimension (wo/4/16 dimension). To switch ho and wo in the low dimension (4 × 16 × 4) back to the high dimension, a data block of size 1 × s (Ws2 × co Ws1) (4 × 16 × 4/Ws1) may be processed at a time.
As can be seen from the above complement process, data in the (4 × 16 × 4) direction of the lowest dimension is processed in the co direction of the next highest dimension. In the general scheme of the block instruction of the output data, because the performance problem is not considered, the block bit width T is equal to the data bit width dwidth, that is, the minimum data size processed each time is exactly one output data, and in this case, the reading and writing order of the data does not need to be changed. In the output data blocking instruction optimization scheme, in order to incorporate "4" in the lowest dimension into on0, on0 in1 in 2T, in the embodiment of the present disclosure, supplementary data may be placed in the blocking bit width T, which will not have a great influence on the data read-write sequence processed by the blocking instruction, and can also make the performance full.
Thus, the input tensor can be divided into two layers, an inner and an outer layer, each layer being represented using three dimensions. As shown in the left diagram of fig. 18, the in0 dimension of the inner layer data block 1801 is aligned to a first alignment value, e.g., M64B, according to the limitation of the blocking instruction; the in1 dimension may then be set to co; the in2 dimension is set to 1 × Ws2, so that in2 × in1 × in0 ═ 1 × Ws2 × co 64B. After the inner data blocks are determined, the sizes of the outer three dimensions in3, in4 and in5 can be determined accordingly, and the sizes of the three dimensions are respectively equal to the number of the inner data blocks contained in the corresponding dimension.
After the chunking process is completed, the data chunks 1 × s (Ws2 × co Ws1) × (4 × 16 × 4/Ws1) are converted into data chunks of (4 × 16) × (4/Ws1) (Ws2 × Ws1 × co), i.e., the data of the lower dimension is converted into the data of the higher dimension.
The right diagram in fig. 18 shows the output neuron data after the blocking process (i.e., the output tensors of the blocking instruction). It can be seen that the output neuron data shape at this time becomes
Figure BDA0003280621220000282
Figure BDA0003280621220000283
It is also divided into inner and outer layers, each described using three dimensions. At this time, the inner layer data block 1802 may correspond to the inner layer data block 1701 of the input tensor, but the shape is changed. The on0 dimension of the inner layer data block is set to in1 in 2T co 1 Ws2 Ws1 dwidth 64 dwidtth according to the constraint condition of the blocking instruction; the on1 dimension is set to 4/Ws1, and the on2 dimension is set to in0/T/on 1. After the inner data blocks are determined, the sizes of the outer three dimensions on3, on4 and on5 can be determined accordingly, and the sizes of the three dimensions are respectively equal to the number of the inner data blocks containing the output tensor in the corresponding dimension.
Further, as described above, since there is an excess of wo data supplemented with Ws1 and Ws2, it is necessary to appropriately change the data position of multiple read/write operations using the post-allocation table, and the output data [ co wo ] of one microinstruction process in the block processing is sequentially changed to [ wo co ]. The tabulation pseudo code for the late match table is as follows:
because there is a large amount of wo data supplemented with wo _ sp1 and wo _ sp2, the positions of the data to be read and written are changed appropriately by using the post-configuration table, and the sequence of the output data [ co wo ] processed by the micro-instruction in the Tiling is changed to [ wo co ].
Thus, in some embodiments, the control circuitry further configures the block instructions as follows: and setting a back allocation table of the blocking instruction, wherein the back allocation table is used for rearranging the lowest dimensionality data of the inner layer of the output tensor of the blocking instruction according to the indication of the back allocation table. Specifically, the control circuit may set the post-configuration table as follows: the lowest-layer dimensional data of the output tensors arranged in [ co wo ] dimensional order is converted to be arranged in [ wo co ] dimensional order.
In one implementation, the late configuration table may be configured in accordance with logic as shown in the pseudo code of Table 11 below.
Figure BDA0003280621220000291
TABLE 11, output neuron optimization scheme blocked instruction post-configuration tabletable pseudo code
The optimization scheme of the output neuron small-channel block instruction is described above. The blocking logic of the optimized scheme is identical to the blocking logic of the general scheme described above as a whole, i.e., the blocking instructions are executed through a three-level loop. These three cycles also include: inner co dimensional loop, middle wo dimensional loop, and outer ho dimensional loop. However, it is slightly different in implementation.
For the inner-layer co-dimension loop, because the optimization scheme only considers the small-channel scene, co is smaller than 64B, so that there is no division between an integer segment and a remainder segment, and the whole co-dimension is processed as a data block once, that is, the inner-layer co-dimension loop time repeat _ co is 1.
For the middle wo dimension loop, since the optimization scheme uses Ws2 in the wo direction, which is split from the middle wo dimension (e.g., wo/4/16), although the wo dimension still processes data blocks for every 1 data, the repeat time repeat _ wo is the middle wo/Ws2, and in the foregoing example repeat _ wo is wo/4/16/Ws 2.
For the outer ho dimension loop, the direction is unchanged, so processing is still done for each 1 data block in the ho dimension, repeating repeat _ ho times, where repeat _ ho is the highest dimension ho in the highest dimension, and in the foregoing example, repeat _ ho is ho/4.
It will be appreciated that although the foregoing describes a blocking instruction optimization scheme for output neurons for a particular example, it may also be applicable to other output data formats.
Without loss of generality, the multi-dimensional shape of the first data that needs to be block processed can be expressed as:
[ high dimensional ho [ middle dimensional wo ] co dimension [ multidimensional blending ]
Wherein the multi-dimensional blending includes at least various combinations of: co, high-dimensional ho, low-dimensional ho, high-dimensional wo, and low-dimensional wo.
When the co dimension is complemented from the adjacent dimension based on the optimization scheme, the first data can be regarded as a three-dimensional equivalent shape before the blocking processing by the multi-dimensional shape:
([ high dimensional ho ] ([ medium dimensional wo/Ws2]) ((Ws 2co dimension) (' Ws1) ("multidimensional hybrid after decomplementing)
Wherein the highest dimension is [ high dimension ho ] - [ middle dimension wo/Ws2], the next highest dimension is Ws2 × co dimension Ws1, the lowest dimension is the post-decommissioning multi-dimensional blending, and the lower dimension wo in the post-decommissioning multi-dimensional blending becomes [ low dimension wo/Ws1 ].
The parameters of a particular block instruction may be similarly designed according to the principles described above, and are not expanded one by one here.
Thus, embodiments of the present disclosure provide a chunking processing scheme for small channel output neuron data. When the channel for outputting the neuron data is small, a proper preferred alignment value P can be selected instead of using the reference alignment value M of the blocking instruction, so that the alignment requirement can be reduced, the performance of the blocking instruction can be more effectively exerted, and the processing efficiency is improved.
The embodiment of the disclosure also provides a data processing method for executing the blocking instruction by using the data processing device. Those skilled in the art will appreciate that the method steps of executing the blocking instructions correspond to the various features of the computing device described above in connection with the figures, and thus the features described above apply equally to the method steps and are not repeated here.
The disclosed embodiments also provide a chip, which may include the data processing apparatus of any one of the embodiments described above with reference to the drawings. Further, the disclosure also provides a board card, which may include the aforementioned chip.
According to different application scenarios, the electronic device or apparatus disclosed herein may include a server, a cloud server, a server computing cluster, a data processing device, a robot, a computer, a printer, a scanner, a tablet computer, a smart terminal, a PC device, an internet of things terminal, a mobile phone, a vehicle recorder, a navigator, a sensor, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a visual terminal, an autopilot terminal, a vehicle, a household appliance, and/or a medical device. The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph. The electronic device or apparatus of the present disclosure may also be applied to the fields of the internet, the internet of things, data centers, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction site, medical, and the like. Further, the electronic device or apparatus disclosed herein may also be used in application scenarios related to artificial intelligence, big data, and/or cloud computing, such as a cloud end, an edge end, and a terminal. In one or more embodiments, a computationally powerful electronic device or apparatus according to the present disclosure may be applied to a cloud device (e.g., a cloud server), while a less power-consuming electronic device or apparatus may be applied to a terminal device and/or an edge-end device (e.g., a smartphone or a camera). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that appropriate hardware resources can be matched from the hardware resources of the cloud device to simulate the hardware resources of the terminal device and/or the edge device according to the hardware information of the terminal device and/or the edge device, and uniform management, scheduling and cooperative work of end-cloud integration or cloud-edge-end integration can be completed.
It is noted that for the sake of brevity, the present disclosure describes some methods and embodiments thereof as a series of acts and combinations thereof, but those skilled in the art will appreciate that the aspects of the present disclosure are not limited by the order of the acts described. Accordingly, one of ordinary skill in the art will appreciate that certain steps may be performed in other sequences or simultaneously, in accordance with the disclosure or teachings of the present disclosure. Further, those skilled in the art will appreciate that the embodiments described in this disclosure are capable of alternative embodiments, in which acts or modules are involved, which are not necessarily required to practice one or more aspects of the disclosure. In addition, the present disclosure may focus on the description of some embodiments, depending on the solution. In view of the above, those skilled in the art will understand that portions of the disclosure that are not described in detail in one embodiment may also be referred to in the description of other embodiments.
In particular implementation, based on the disclosure and teachings of the present disclosure, one skilled in the art will appreciate that the several embodiments disclosed in the present disclosure may be implemented in other ways not disclosed herein. For example, as for the units in the foregoing embodiments of the electronic device or apparatus, the units are split based on the logic function, and there may be another splitting manner in the actual implementation. Also for example, multiple units or components may be combined or integrated with another system or some features or functions in a unit or component may be selectively disabled. The connections discussed above in connection with the figures may be direct or indirect couplings between the units or components in terms of connectivity between the different units or components. In some scenarios, the aforementioned direct or indirect coupling involves a communication connection utilizing an interface, where the communication interface may support electrical, optical, acoustic, magnetic, or other forms of signal transmission.
In the present disclosure, units described as separate parts may or may not be physically separate, and parts shown as units may or may not be physical units. The aforementioned components or units may be co-located or distributed across multiple network elements. In addition, according to actual needs, part or all of the units can be selected to achieve the purpose of the solution of the embodiment of the present disclosure. In addition, in some scenarios, multiple units in embodiments of the present disclosure may be integrated into one unit or each unit may exist physically separately.
In other implementation scenarios, the integrated unit may also be implemented in hardware, that is, a specific hardware circuit, which may include a digital circuit and/or an analog circuit, etc. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, which may include, but are not limited to, transistors or memristors, among other devices. In this regard, the various devices described herein (e.g., computing devices or other processing devices) may be implemented by suitable hardware processors, such as central processing units, GPUs, FPGAs, DSPs, ASICs, and the like. Further, the aforementioned storage unit or storage device may be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), and may be, for example, a variable Resistive Memory (RRAM), a Dynamic Random Access Memory (DRAM), a Static Random Access Memory (SRAM), an Enhanced Dynamic Random Access Memory (EDRAM), a High Bandwidth Memory (HBM), a Hybrid Memory Cube (HMC), a ROM, a RAM, or the like.
The foregoing detailed description of the embodiments of the present disclosure has been presented for purposes of illustration and description and is intended to be exemplary only and is not intended to be exhaustive or to limit the invention to the precise forms disclosed; meanwhile, for the person skilled in the art, based on the idea of the present disclosure, there may be variations in the specific embodiments and the application scope, and in summary, the present disclosure should not be construed as limiting the present disclosure.

Claims (17)

1. A data processing apparatus comprising a control circuit, a first storage circuit and a second storage circuit, wherein:
the first storage circuit is used for storing first data before processing;
the second storage circuit is used for storing the processed second data; and
the control circuit is configured to:
determining a preferred alignment value according to the co dimension of the first data;
determining the decompensation distribution of adjacent dimensions of the co dimensions according to the preferred alignment value; and
configuring and executing a blocking instruction according to the decompensation allocation to convert first data stored on a first storage circuit in a first dimension storage order into second data stored on a second storage circuit in a second dimension storage order, wherein the first data is multidimensional data whose multidimensional shape is:
[ high dimensional ho [ middle dimensional wo ] co dimension [ multidimensional blending ]
Wherein the multi-dimensional blending includes at least various combinations of: co, high-dimensional ho, low-dimensional ho, high-dimensional wo, and low-dimensional wo;
the second data is three-dimensional data, and the three-dimensional shape of the second data is:
[ho*wo*co]
where co represents the lowest storage dimension of the second data, wo represents the next lowest storage dimension of the second data, and ho represents the highest storage dimension of the second data.
2. The data processing apparatus according to claim 1, wherein the control circuitry is further configured to determine the de-complementing assignments for adjacent ones of the co-dimensions when the preferred alignment value P is less than a reference alignment value M of the blocking instruction as follows:
when M/P is less than or equal to 4, Ws1 is M/P, and Ws2 is 1; and
when M/P is greater than 4, Ws1 is 4, Ws2 is M/4P,
wherein Ws1 represents splitting Ws1 times of data from adjacent low-dimensional sides of the co-dimension to complement the co-dimension, and Ws2 represents splitting Ws2 times of data from adjacent high-dimensional sides of the co-dimension to complement the co-dimension.
3. The data processing apparatus according to claim 2, wherein the control circuit is further configured to determine the preferred alignment value P from the co-dimensional size of the first data according to the following rule:
when 0< co ≦ 4B, P ═ 4B;
when 2 is inn*4B<co≤2n+1*4B,
Figure FDA0003280621210000011
When P is 2n+14B; and
when in use
Figure FDA0003280621210000012
When, P ═ M.
4. A data processing apparatus as claimed in any of claims 2 to 3, wherein the control circuitry is further arranged to:
according to the patch allocation, the first data is regarded as a three-dimensional equivalent shape before blocking processing by the multi-dimensional shape:
([ high dimensional ho ] ([ medium dimensional wo/Ws2]) ((Ws 2co dimension) (' Ws1) ("multidimensional hybrid after decomplementing)
Wherein the highest dimension is [ high dimension ho ] - [ middle dimension wo/Ws2], the next highest dimension is Ws2 × co dimension Ws1, the lowest dimension is the post-decommissioning multi-dimensional hybrid, and the lower dimension wo in the post-decommissioning multi-dimensional hybrid becomes [ low dimension wo/Ws1 ]; and
regarding the second data as a three-dimensional equivalent shape after block processing from the three-dimensional shape:
Figure FDA0003280621210000013
5. the data processing apparatus of claim 4, wherein the control circuitry is further to:
and executing a blocking instruction on the first data of the three-dimensional equivalent shape according to a loop, wherein the loop comprises an inner co-dimensional loop, a middle wo-dimensional loop and an outer ho-dimensional loop.
6. The data processing device of claim 5, wherein the inner co-dimensional loop comprises:
and processing the co dimension as a data block, wherein the cycle times repeat _ co of the co dimension of the inner layer is 1.
7. The data processing device of claim 6, wherein the middle wo-dimension loop comprises:
repeat _ wo times for every 1 data block according to the wo dimension, where repeat _ wo is the medium dimension wo/Ws 2.
8. The data processing apparatus of claim 7, wherein the outer ho dimension loop comprises:
repeat _ ho times for every 1 data block in the ho dimension, where repeat _ ho is the highest dimension ho in the highest dimension.
9. The data processing apparatus of claim 8, wherein the control circuitry is further to configure the blocking instruction as follows:
and setting a post-configuration table of the blocking instruction, wherein the post-configuration table is used for rearranging the lowest dimensionality data of the inner layer of the output tensor of the blocking instruction according to the indication of the post-configuration table.
10. The data processing apparatus of claim 9, wherein the control circuitry is further to set the late provisioning table as follows:
converting the lowest-layer dimensional data of the output tensors arranged in [ co wo ] dimensional order into data arranged in [ wo co ] dimensional order.
11. The data processing apparatus of claim 10, wherein the control circuitry is further to:
before executing the blocking instruction for a data block each time, setting an input tensor offset and an output tensor offset of the blocking instruction executed for the current data block according to the processed data block size, wherein the input tensor offset represents the offset of the data block before processing relative to the initial storage address of the first data, and the output tensor offset represents the offset of the data block after processing relative to the initial storage address of the second data.
12. The data processing apparatus of claim 11, wherein the control circuitry is further to:
setting the block bit width T of the block instruction to Ws1 × dwidth, wherein T represents the data volume of one-time atomic operation of the block instruction, and dwidth represents the data bit width.
13. The data processing apparatus according to claim 12, wherein the first data is output neurons under 1 × 16 split conditions using a Forward4 small convolution scheme, Group1 mode, ho, with multidimensional shapes:
Figure FDA0003280621210000021
14. the data processing apparatus according to claim 13, wherein when M-64B, the control circuitry is further to configure the blocking instructions for each data block as follows:
TABLE 1
Figure FDA0003280621210000022
Figure FDA0003280621210000031
Where B denotes bytes, in0, in1, in2 denote the lowest dimensional size of the inner layer, and the highest dimensional size of the inner layer data block of the input tensor of the blocking instruction, respectively, in3, in4, and in5 denote the three outer dimensional sizes of the input tensor, respectively, the magnitude values of the three outer dimensions respectively represent the number of the inner-layer data blocks contained in the corresponding dimension, is 1-is 5 represent five dimension steps of the input tensor except for the inner-layer lowest dimension, on0, on1 and on2 respectively represent the inner-layer lowest dimension size, the inner-layer lowest dimension size and the inner-layer highest dimension size of the inner-layer data blocks of the output tensor of the blocking instruction, on3, on4 and on5 respectively represent the three outer-layer dimension sizes of the output tensor, the magnitude values of the three outer dimensions respectively represent the number of the inner-layer data blocks contained in the corresponding dimension, and os 1-os 5 represent the steps of five dimensions of the output tensor except the lowest dimension of the inner layer.
15. A chip comprising a data processing device according to any one of claims 1 to 14.
16. A board comprising the chip of claim 15.
17. A data processing method for executing a blocking instruction on output neuron data using the data processing apparatus according to any one of claims 1 to 14.
CN202111131270.2A 2021-09-26 2021-09-26 Data processing device, data processing method and related product Pending CN113837921A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111131270.2A CN113837921A (en) 2021-09-26 2021-09-26 Data processing device, data processing method and related product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111131270.2A CN113837921A (en) 2021-09-26 2021-09-26 Data processing device, data processing method and related product

Publications (1)

Publication Number Publication Date
CN113837921A true CN113837921A (en) 2021-12-24

Family

ID=78970277

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111131270.2A Pending CN113837921A (en) 2021-09-26 2021-09-26 Data processing device, data processing method and related product

Country Status (1)

Country Link
CN (1) CN113837921A (en)

Similar Documents

Publication Publication Date Title
CN113850380A (en) Data processing device, data processing method and related product
WO2023045446A1 (en) Computing apparatus, data processing method, and related product
CN112416433B (en) Data processing device, data processing method and related product
CN112633490A (en) Data processing device and method for executing neural network model and related products
CN111488963B (en) Neural network computing device and method
CN113850379A (en) Data processing device, data processing method and related product
CN113850377A (en) Data processing device, data processing method and related product
CN114154112A (en) Data processing device, chip and board card
CN113837921A (en) Data processing device, data processing method and related product
CN113837923A (en) Data processing device, data processing method and related product
US20230259780A1 (en) Neural network sparsification apparatus and method and related product
CN113850378A (en) Data processing device, data processing method and related product
CN115878543A (en) Computing device, method for performing convolution operation by using computing device and related product
CN114692844A (en) Data processing device, data processing method and related product
CN113469337A (en) Compiling method for optimizing neural network model and related product
CN115878546A (en) Computing device, method for performing convolution operation by using computing device and related product
CN115878544A (en) Processing circuit, method for performing convolution operation by using processing circuit and related product
CN115878541A (en) Computing device, method for performing convolution operation by using computing device and related product
CN115878545A (en) Computing device, method for performing convolution operation by using computing device and related product
CN115878547A (en) Computing device, method for performing convolution operation by using computing device and related product
CN116150556A (en) Computing device, method and related product for performing convolution operation
WO2023087814A1 (en) Computing apparatus, method for implementing convolution operation by using computing apparatus, and related product
CN116980277B (en) Data processing method, device, computer equipment and storage medium
CN116781484B (en) Data processing method, device, computer equipment and storage medium
CN117252241A (en) Computing device, method and related product for performing convolution operation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination