CN112633490B

CN112633490B - Data processing device, method and related product for executing neural network model

Info

Publication number: CN112633490B
Application number: CN202011631704.0A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Shanghai Cambricon Information Technology Co Ltd
Current assignee: Shanghai Cambricon Information Technology Co Ltd
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2023-09-26
Anticipated expiration: 2040-12-31
Also published as: CN112633490A

Abstract

The present disclosure discloses a data processing apparatus, method, and related products for executing neural network models. The data processing means may be comprised as computing means in a combined processing means, which combined processing means may also comprise interface means and other processing means. The computing device interacts with other processing devices to collectively perform a user-specified computing operation. The combined processing means may further comprise storage means connected to the computing means and the other processing means, respectively, for storing data of the computing means and the other processing means. The scheme optimizes the convolution operation of the multidimensional array and improves the operation processing efficiency.

Description

Data processing device, method and related product for executing neural network model

Technical Field

The present disclosure relates generally to the field of data processing. More particularly, the present disclosure relates to a data processing apparatus, a data processing method, a chip, and a board for executing a neural network model.

Background

Deep Learning (Deep Learning) has become an important branch in machine Learning, and has greatly assisted the development of Artificial Intelligence (AI). The core technology of deep learning, deep Neural Network (DNN), has found wide application in many industries.

The convolutional layer is one of the usual implicit layers in the neural network model, which performs feature extraction on input data through convolutional operations. The neural network model comprises a large number of convolution operations, and the calculation performance of the convolution operations greatly influences the calculation performance of the whole neural network model. In convolution operations, there is a requirement for both instruction alignment and hardware (e.g., parallel operators) alignment for each dimension of the filter of the convolution layer. Therefore, it is necessary to optimize the convolution operation to improve the computational performance of executing the neural network model.

Disclosure of Invention

To address at least one or more of the technical problems mentioned above, the present disclosure proposes, in various aspects, a data processing scheme for executing a neural network model, which may effectively improve the computational performance of convolution operations by transforming filters of a convolution layer. The neural network model of embodiments of the present disclosure may be applied to various fields such as image processing, speech processing, text processing, etc., which may include, for example, but not limited to, recognition and classification.

In a first aspect, the present disclosure provides a data processing apparatus for executing a neural network model, comprising: a storage circuit configured to store a folded filter of a convolution layer of the neural network model, the folded filter being obtained by dimensional folding of an original filter, wherein the dimensional folding includes re-draining data of a convolution kernel width dimension and/or a convolution kernel height dimension to an output channel dimension; and processing circuitry configured to: performing convolution operation on the input feature map by using the folding filter to obtain an intermediate result; and accumulating the intermediate results to obtain an output characteristic diagram.

In a second aspect, the present disclosure provides a chip comprising the data processing apparatus of any one of the embodiments of the first aspect.

In a third aspect, the present disclosure provides a board comprising the chip of any one of the embodiments of the second aspect described above.

In a fourth aspect, the present disclosure provides a method implemented by a data processing apparatus for executing a neural network model, the data processing apparatus comprising a storage circuit and a processing circuit, the method comprising: the processing circuit performs convolution operation on the input feature map by using a folding filter of a convolution layer of the neural network model stored in the storage circuit to obtain an intermediate result, wherein the folding filter is obtained by performing dimension folding on an original filter, and the dimension folding comprises re-draining data of a convolution kernel width dimension and/or a convolution kernel height dimension to an output channel dimension; and the processing circuit accumulates the intermediate results to obtain an output characteristic diagram.

With the data processing apparatus, chip, board, and data processing method implemented by the data processing apparatus provided above, the scheme of the present disclosure optimizes convolution operations by folding filters. Embodiments of the present disclosure are particularly applicable to cases where the output channel dimension of the original filter is small. In the conventional convolution operation, when the dimension of the output channel of the filter is smaller, a larger resource waste is caused due to the limitation of the alignment of the number of parallel operation units. According to the embodiment of the disclosure, the plurality of extended filters obtained after the original filter is subjected to convolution step length movement for a plurality of times are synthesized into the folded filter, so that available parallel operation units can be fully utilized, the waste of operation resources is avoided, and the calculation performance of convolution operation in hardware acceleration is improved.

Drawings

The above, as well as additional purposes, features, and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description when read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar or corresponding parts and in which:

fig. 1 is a block diagram illustrating a board of an embodiment of the present disclosure;

FIG. 2 is a block diagram illustrating an integrated circuit device of an embodiment of the present disclosure;

FIG. 3 is a schematic diagram illustrating the internal structure of a single core computing device of an embodiment of the present disclosure;

FIG. 4 is a schematic diagram illustrating the internal architecture of a multi-core computing device of an embodiment of the present disclosure;

FIG. 5 is a schematic diagram illustrating the internal architecture of a processor core of an embodiment of the present disclosure;

FIG. 6 illustrates an example convolution operation to which embodiments of the present disclosure may be applied;

FIG. 7 illustrates an exemplary schematic diagram of a data processing scheme of an embodiment of the present disclosure;

FIG. 8 shows a schematic diagram of a split width dimension and a split height dimension according to an embodiment of the present disclosure;

9A-9B illustrate examples of dimensional folding in accordance with embodiments of the present disclosure;

FIG. 10 illustrates an exemplary comparison graph of a pre-and post-folding calculation process in accordance with an embodiment of the present disclosure;

FIG. 11 illustrates a schematic diagram of result accumulation according to an embodiment of the disclosure;

FIG. 12 illustrates a schematic block diagram of a data processing apparatus in which embodiments of the present disclosure may be implemented; and

fig. 13 shows an exemplary flowchart of a data processing method according to an embodiment of the present disclosure.

Detailed Description

The following description of the embodiments of the present disclosure will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the disclosure. Based on the embodiments in this disclosure, all other embodiments that may be made by those skilled in the art without the inventive effort are within the scope of the present disclosure.

It should be understood that the terms "first," "second," and "third," and the like, as may be used in the claims, specification, and drawings of the present disclosure, are used for distinguishing between different objects and not for describing a particular sequential order. The terms "comprises" and "comprising" when used in the specification and claims of the present disclosure, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the present disclosure is for the purpose of describing particular embodiments only, and is not intended to be limiting of the disclosure. As used in the specification and claims of this disclosure, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the present disclosure and claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

As used in this specification and the claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" depending on the context. Similarly, the phrase "if a determination" or "if a [ described condition or event ] is detected" may be interpreted in the context of meaning "upon determination" or "in response to determination" or "upon detection of a [ described condition or event ]" or "in response to detection of a [ described condition or event ]".

Specific embodiments of the present disclosure are described in detail below with reference to the accompanying drawings.

Fig. 1 shows a schematic structural diagram of a board 10 according to an embodiment of the present disclosure. As shown in fig. 1, the board 10 includes a Chip 101, which is a System on Chip (SoC), or a System on Chip, integrated with one or more combined processing devices, wherein the combined processing device is an artificial intelligent computing unit, and is used for supporting various deep learning and machine learning algorithms, so as to meet the intelligent processing requirements in complex fields such as computer vision, voice, natural language processing, data mining, and the like. Particularly, the deep learning technology is applied to the cloud intelligent field in a large quantity, and the cloud intelligent application has the remarkable characteristics of large input data quantity and high requirements on the storage capacity and the computing capacity of a platform, and the board card 10 of the embodiment is suitable for the cloud intelligent application and has huge off-chip storage, on-chip storage and strong computing capacity.

The chip 101 is connected to an external device 103 through an external interface device 102. The external device 103 is, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card, a wifi interface, or the like. The data to be processed may be transferred by the external device 103 to the chip 101 through the external interface means 102. The calculation result of the chip 101 may be transmitted back to the external device 103 via the external interface means 102. The external interface device 102 may have different interface forms, such as PCIe interfaces, etc., according to different application scenarios.

The board 10 also includes a memory device 104 for storing data, which includes one or more memory cells 105. The memory device 104 is connected to the control device 106 and the chip 101 via a bus and transmits data. The control device 106 in the board 10 is configured to regulate the state of the chip 101. To this end, in one application scenario, the control device 106 may comprise a single chip microcomputer (Micro Controller Unit, MCU).

Fig. 2 is a block diagram showing a combination processing apparatus in the chip 101 of this embodiment. As shown in fig. 2, the combination processing device 20 includes a computing device 201, an interface device 202, a processing device 203, and a DRAM 204.

The computing device 201 is configured to perform user-specified operations, primarily implemented as a single-core smart processor or as a multi-core smart processor, to perform deep learning or machine learning computations, which may interact with the processing device 203 through the interface device 202 to collectively accomplish the user-specified operations.

The interface means 202 are used for transmitting data and control instructions between the computing means 201 and the processing means 203. For example, the computing device 201 may obtain input data from the processing device 203 via the interface device 202, writing to a storage device on the chip of the computing device 201. Further, the computing device 201 may obtain control instructions from the processing device 203 via the interface device 202, and write the control instructions into a control cache on the chip of the computing device 201. Alternatively or in addition, the interface device 202 may also read data in the memory device of the computing device 201 and transmit it to the processing device 203.

The processing device 203 is a general purpose processing device that performs basic control including, but not limited to, data handling, starting and/or stopping of the computing device 201, and the like. Depending on the implementation, the processing device 203 may be one or more types of processors, including but not limited to a digital signal processor (digital signal processor, DSP), an application specific integrated circuit (application specific integrated circuit, ASIC), a field-programmable gate array (field-programmable gate array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc., of a central processing unit (central processing unit, CPU), graphics processor (graphics processing unit, GPU) or other general purpose and/or special purpose processor, and the number thereof may be determined according to actual needs. As previously mentioned, the computing device 201 of the present disclosure may be considered to have a single core structure or a homogeneous multi-core structure only with respect to it. However, when computing device 201 and processing device 203 are considered together, they are considered to form a heterogeneous multi-core structure.

The DRAM 204 is used to store data to be processed, and is a DDR memory, typically 16G or more in size, for storing data for the computing device 201 and/or the processing device 203.

Fig. 3 shows a schematic diagram of the internal architecture of computing device 201 as a single core. The single-core computing device 301 is used for processing input data such as computer vision, voice, natural language, data mining, etc., and the single-core computing device 301 comprises three major modules: a control module 31, an operation module 32 and a storage module 33.

The control module 31 is used for coordinating and controlling the operation of the operation module 32 and the storage module 33 to complete the task of deep learning, and comprises a fetch unit (instruction fetch unit, IFU) 311 and an instruction decode unit (instruction decode unit, IDU) 312. The instruction fetching unit 311 is configured to fetch an instruction from the processing device 203, and the instruction decoding unit 312 decodes the fetched instruction and sends the decoded result to the operation module 32 and the storage module 33 as control information.

The operation module 32 includes a vector operation unit 321 and a matrix operation unit 322. The vector operation unit 321 is used for performing vector operations and can support complex operations such as vector multiplication, addition, nonlinear transformation and the like; the matrix operation unit 322 is responsible for the core computation of the deep learning algorithm, i.e., matrix multiplication and convolution.

The storage module 33 is used for storing or carrying related data, including a neuron storage unit (NRAM) 331, a parameter storage unit (weight RAM, WRAM) 332, and a direct memory access module (direct memory access, DMA) 333.NRAM 331 is to store input neurons, output neurons, and calculated intermediate results; WRAM 332 is configured to store a convolution kernel, i.e., a weight, of the deep learning network; DMA 333 is coupled to DRAM 204 via bus 34 and is responsible for data transfer between single core computing device 301 and DRAM 204.

Fig. 4 shows a schematic diagram of the internal architecture of computing device 201 as a multi-core. The multi-core computing device 41 is designed in a hierarchical structure, and the multi-core computing device 41 is a system-on-chip (soc) including at least one cluster (cluster), each cluster including a plurality of processor cores, in other words, the multi-core computing device 41 is formed by a hierarchy of system-on-chip (soc) -cluster-processor cores.

At the system-on-chip level, as shown in FIG. 4, the multi-core computing device 41 includes an external memory controller 401, a peripheral communication module 402, an on-chip interconnect module 403, a synchronization module 404, and a plurality of clusters 405.

There may be a plurality of external memory controllers 401, of which 2 are shown by way of example, for accessing external memory devices, such as DRAM204 in FIG. 2, to read data from or write data to the off-chip in response to an access request issued by the processor core. The peripheral communication module 402 is configured to receive a control signal from the processing device 203 through the interface device 202, and activate the computing device 201 to perform a task. The on-chip interconnect module 403 connects the external memory controller 401, the peripheral communication module 402, and the plurality of clusters 405 for transmitting data and control signals between the respective modules. The synchronization module 404 is a global synchronization barrier controller (global barrier controller, GBC) for coordinating the working progress of each cluster to ensure synchronization of information. The plurality of clusters 405 are computing cores of the multi-core computing device 41, 4 being illustratively shown, the multi-core computing device 41 of the present disclosure may further include 8, 16, 64, or even more clusters 405 as hardware progresses. Cluster 405 is used to efficiently perform the deep learning algorithm.

At the cluster level, as shown in FIG. 4, each cluster 405 includes a plurality of processor cores (IPU cores) 406 and a memory core (MEM core) 407.

The processor cores 406 are illustratively shown as 4 in the figures, and the present disclosure does not limit the number of processor cores 406. The internal architecture is shown in fig. 5. Each processor core 406 is similar to the single core computing device 301 of fig. 3, again comprising three major modules: a control module 51, an operation module 52 and a storage module 53. The functions and structures of the control module 51, the operation module 52 and the storage module 53 are substantially the same as those of the control module 31, the operation module 32 and the storage module 33, and will not be described again. The storage module 53 includes an input/output direct memory access module (input/output direct memory access, IODMA) 533, and a handling direct memory access module (move direct memory access, MVDMA) 534.IODMA 533 controls access to NRAM 531/WRAM 532 and DRAM 204 over broadcast bus 409; MVDMA 534 is used to control access to NRAM 531/WRAM 532 and memory cells (SRAM) 408.

Returning to FIG. 4, the memory cores 407 are primarily used to store and communicate, i.e., to store shared data or intermediate results between the processor cores 406, as well as to perform communications between the clusters 405 and the DRAM 204, between the clusters 405, between the processor cores 406, etc. In other embodiments, the memory core 407 has scalar operation capabilities to perform scalar operations.

The memory core 407 includes SRAM 408, broadcast bus 409, clustered direct memory access module (cluster direct memory access, CDMA) 410, and global direct memory access module (global direct memory access, GDMA) 411. The SRAM 408 plays a role of a high-performance data transfer station, and data multiplexed between different processor cores 406 in the same cluster 405 is not required to be obtained from the processor cores 406 to the DRAM 204 respectively, but is transferred between the processor cores 406 through the SRAM 408, and the memory core 407 only needs to rapidly distribute the multiplexed data from the SRAM 408 to the plurality of processor cores 406, so as to improve the inter-core communication efficiency and greatly reduce on-chip off-chip input/output accesses.

Broadcast bus 409, CDMA 410, and GDMA 411 are used to perform communication between processor cores 406, communication between clusters 405, and data transfer between clusters 405 and DRAM 204, respectively. As will be described below, respectively.

The broadcast bus 409 is used to facilitate high-speed communications among the processor cores 406 within the cluster 405. The broadcast bus 409 of this embodiment supports inter-core communications including unicast, multicast and broadcast. Unicast refers to the transmission of data from point to point (e.g., single processor core to single processor core), multicast is a communication scheme that transfers a piece of data from SRAM 408 to a specific number of processor cores 406, and broadcast is a communication scheme that transfers a piece of data from SRAM 408 to all processor cores 406, a special case of multicast.

CDMA 410 is used to control access to SRAM408 between different clusters 405 within the same computing device 201.

The GDMA 411 cooperates with the external memory controller 401 to control access of the SRAM408 of the cluster 405 to the DRAM 204 or to read data from the DRAM 204 into the SRAM 408. From the foregoing, it is appreciated that communication between DRAM 204 and NRAM 431 or WRAM432 may be accomplished via 2 channels. The first channel is to directly contact the DRAM 204 with the NRAM 431 or WRAM432 through the IODAM 433; the second channel is to transfer data between DRAM 204 and SRAM408 via GDMA 411 and then transfer data between SRAM408 and NRAM 431 or WRAM432 via MVDMA 534. While the second channel seemingly requires more components to participate and the data stream is longer, in practice in some embodiments the bandwidth of the second channel is much greater than the first channel, and thus communication between the DRAM 204 and the NRAM 431 or WRAM432 may be more efficient through the second channel. Embodiments of the present disclosure may select a data transmission channel based on the hardware conditions itself.

In other embodiments, the functionality of the GDMA 411 and the functionality of the IODMA 533 may be integrated into the same component. The GDMA 411 and the IODMA 533 are considered as different components for convenience of description of the present disclosure, so long as the functions and technical effects achieved by the same are similar to those of the present disclosure, i.e., it is within the protection scope of the present disclosure for a person skilled in the art. Further, the functions of the GDMA 411, the IODMA 533, the CDMA 410, and the MVDMA 534 may be implemented by the same component.

Neural network models typically include input layers, convolution layers, activation functions, pooling layers, fully connected layers, etc., few layers, many hundreds of layers, each layer executing one operator, e.g., convolution layers executing convolution operators, and how many layers need to execute how many operators.

The training of the neural network model is to adjust parameters of each layer by inputting training samples, so that the calculated result of the neural network model is as close as possible to the real result. The neural network model training comprises forward propagation and backward propagation, wherein the forward propagation is based on the existing model, an input training sample is calculated through each layer of the neural network model, an input feature map is gradually extracted as abstract features, the backward propagation is a loss function calculated according to a forward propagation result and a true value, a gradient descent method is adopted, and the partial derivative of the loss function on each parameter is calculated through a chain rule to update the parameter. And training by using the updated parameters, and repeating the training for a plurality of times, so that the calculation result of forward propagation accords with the expectation. And performing forward operation on the input of the real environment by using the trained neural network model to complete the set task, and then, performing reasoning on the neural network model.

The disclosed embodiments provide a data processing scheme for executing a neural network model, more specifically, a scheme for optimizing convolution operations in the neural network model, based on the aforementioned hardware environment.

FIG. 6 illustrates an example convolution operation to which embodiments of the present disclosure may be applied. As shown, the convolution layer in the neural network model may perform feature extraction by applying a filter to the input feature map to perform convolution processing.

The figure shows an exemplary 6 x 3 input feature map, which may represent 3 6 x 6 sized feature maps (i.e., a 6 x 3 three-dimensional matrix), each representing three different features. The width W of the feature map in this example is 6 and the height H is also 6. The number of input feature maps may also be referred to as the number of input channels Ci. For example, the example in the figure inputs 3 feature graphs, also referred to as 3 feature channels.

Also shown by way of example is a 2 x 3 filter that may represent 2 3 x 3 size convolution kernels (i.e., 2 3 x 3 three-dimensional matrices), each having 3 different 3 x 3 size convolution kernels, corresponding to the 3 different feature maps of the input. The number of stereo convolution kernels may be referred to as the output channel number Co, which in this example is 2. In each stereo convolution kernel, the number of two-dimensional convolution kernels may be referred to as the input channel number Ci, which corresponds to the channel number of the input feature map. Each two-dimensional convolution kernel has a respective width Kw and height Kh, both Kw and Kh being 3 in this example.

The convolution result of the input feature map and the filter outputs 2 feature maps of 4×4 size. The convolution result of the input feature map and the upper three-dimensional convolution kernel obtains 1 output feature map of 4×4 at the upper part, and the convolution result of the input feature map and the lower three-dimensional convolution kernel obtains 1 output feature map of 4×4 at the lower part. The value at each position in the output feature map is obtained by summing the corresponding block and the corresponding convolution kernel of each input feature map after two-dimensional convolution operation. For example, the figure shows that the value of the (0, 0) position on the upper output feature map is obtained by performing a two-dimensional convolution operation on the block outlined by the black cube in the input feature map and the upper stereo convolution kernel to obtain 3 values, and then adding the 3 values to obtain the final value. The position of the convolution kernel may be shifted on the input signature in order to obtain outputs at other positions. In the example in the figure, the convolution step (Sx, sy) is (1, 1), and when the convolution operation is performed after the convolution step is moved one cell downward in the transverse direction (width direction) to the right or in the longitudinal direction (height direction), the value of the (0, 1) or (1, 0) position on the upper output feature map can be obtained respectively.

From the above description, in one convolutional layer of the neural network, there is a set of input feature maps, which contains h×w×ci pieces of information, where H and W are the height and width of the input feature maps, respectively, and Ci is the number of input feature maps, which is also called the number of input channels. The convolution layer has a convolution kernel of the size Ci Co of Kh Kw, where Ci is the number of input channels, co is the number of output feature maps (or the number of output channels), and Kh and Kw are the height and width of the convolution kernel, respectively. The output feature map contains Ho x Wo x Co pieces of information, where Ho and Wo are the height and width of the output feature map, respectively, and Co is the number of output channels. In addition, in the convolution operation, a convolution step (Sx, sy) is also involved, and the size of the convolution step affects the size of the output feature map.

In order to accelerate the computation of the neural network model, a plurality of computation units are generally used to perform parallel computation. For example, the operation module 32 in fig. 3 or the operation module 52 in fig. 5 may include a plurality of convolution-specific calculation units (or convolution units), each of which is capable of performing, for example, a complete calculation of the (H, W, ci) dimension. In other words, the computation of the Co (H, W, ci) dimensions can be distributed over Co convolution units for parallel computation, thereby improving the computation speed. In general, the number of convolution units is fixed, and if the size of Co dimension is small, there will be spare convolution units, and the computing resources cannot be fully utilized. In some cases, it may be desirable to align the size of the Co dimension to the number of convolution units in order to unify the scheduling. However, when the Co dimension is small, such alignment restrictions introduce ineffective computation, resulting in a large amount of resource waste.

In view of this, the embodiments of the present disclosure provide a data processing scheme for executing a neural network model, which is optimized for Co dimensions in a convolutional layer, based on the aforementioned hardware environment, reducing resource waste as much as possible.

In the presently disclosed embodiments, the dimensions of the multidimensional data referred to are characterized as (N, H, W, C) or (Co, H, W, ci), which represent the order in which the data is stored in memory. It will be appreciated that although the multi-dimensional data has multiple dimensions, there is a correspondence between the multi-dimensional data and the order of storage on the memory because the layout of the memory is always one-dimensional. The multi-dimensional data is typically allocated in contiguous memory space, i.e., the multi-dimensional data can be one-dimensionally expanded and stored in sequence on the memory. For example, in the presently disclosed embodiments, sequential storage is performed in a low-dimensional (where Ci is the lowest dimension) priority manner. Adjacent dimensions refer to dimensions that are next to each other in the dimensional information representation of the multi-dimensional data, e.g., W and Ci are adjacent, which may also be referred to as consecutive dimensions.

FIG. 7 illustrates, by way of a specific example, an exemplary schematic diagram of a data processing scheme of an embodiment of the present disclosure. Let the number of convolution units used to perform the convolution operation be Aco. Aco may have different values, such as 32, 64, 128, etc., based on different hardware designs. In the following examples, aco=64 is described as an example. Depending on the alignment requirements of the hardware, the Co dimension of the filter needs to be aligned to Aco, i.e. to 64.

The left hand side of the figure shows the original filter of the convolution layers, which is for example denoted 4 x 3 x 64, i.e. the number of output channels Co is 4, the number of input channels Ci is 64, and each convolution kernel has a size of 3 x 3. Further, the convolution step of the original filter is (Sx, sy) = (1, 1). As can be seen from the figure, the Co dimension of the original filter is much smaller than the alignment requirement of the hardware (64). The Co dimension will be zero-padded to align to 64 in a conventional manner of processing. From 4 to 64, a very large number of redundant calculations need to be added, resulting in waste of resources.

The right side of the figure shows a folded filter according to an embodiment of the present disclosure, which is denoted as 36×1×1×64, i.e., the number of output channels Co' is 36, the number of input channels is the same as the original filter, 64, each convolution kernel is 1×1 in size, and the convolution steps are still (1, 1). It can be seen that the number of output channels (36) of the folded filter is much larger than the number of output channels (4) of the original filter, and the alignment requirement (64) of performing the alignment to hardware only needs to be from 36 to 64, so that compared with the alignment from 4 to 64, many redundant calculations can be reduced, and the convolution unit can be used for operation processing more effectively.

In the folding process described above, data in the convolution kernel width dimension and height dimension of the original filter is transferred to the output channel dimension, thereby expanding the size of the output channel dimension. The above dimension folding process is based on the following considerations: if Co alignment is realized by zero padding according to the original calculation mode, redundant calculation is generated, and calculation resources are wasted. If the data of other dimensions are transferred to the Co dimension, the Co dimension is filled to be close to or equal to the hardware alignment value, the compensation amount can be avoided or reduced, the waste of calculation resources is avoided as much as possible, and the calculation efficiency is improved.

To clearly illustrate the manner in which data is transferred, the middle of FIG. 7 shows the splitting of data in the width dimension and the height dimension of the roll set on the original filter. Specifically, along the direction of the dimension of the output channel, extracting data in the width dimension and/or the height dimension of the convolution kernel according to multiples of Co, and sequentially discharging the data in the dimension of the output channel. As shown, one column of data is taken at a time in the Co direction and is discharged in the output channel dimension of the folding filter, thereby constituting n Co constituting Co'. The discharge sequence may be width-wise followed by height-wise, or height-wise followed by width-wise, embodiments of the present disclosure are not limited in this respect.

In the example of fig. 7, the shape of the original filter is changed from (Co, kh, kw, ci) to (co×kh×kw,1, ci), thereby enlarging the dimension size of the output channel. The case of simultaneous shifting or splitting of the convolution kernel width dimension and the convolution kernel height dimension is shown in fig. 7. Depending on the number of output channels Co, the convolution kernel width Kw and the convolution kernel height Kh of the original filter and the first threshold Aco, different splitting means may exist, for example, only the convolution kernel width dimension may be split or only the convolution kernel height dimension may be split.

Fig. 8 illustrates an example of splitting only the width dimension and splitting only the height dimension in accordance with an embodiment of the present disclosure. Let the original filter be (4,3,3,16) and the first threshold aco=16. At this time, if the width dimension and the height dimension are split at the same time, the obtained output channel number Co' =4×3×3=36 > aco exceeds the first threshold. At this point, it may be selected to split only one dimension.

The upper part of the figure shows that only the splitting in the width dimension is performed. As can be seen from the figure, the folding filter obtained after splitting is (4×3,3,1, 16) = (12,3,1,16). At this time, only 12 is required to be aligned to 16.

The lower part of the figure shows that only the splitting in the height dimension is performed. As can be seen from the figure, the folding filter obtained after splitting is (4×3,1,3, 16) = (12,1,3,16). At this time, only 12 is required to be aligned to 16.

In some cases, when the convolution kernel width dimension and/or the convolution kernel height dimension cannot be split for integer division, or split by 1 unit will cause the folded filter output channel dimension Co' to exceed the hardware alignment value, the convolution kernel width dimension and/or the convolution kernel height dimension may be appropriately subjected to zero padding processing to enable integer division split.

Fig. 9A-9B illustrate examples of dimensional folding in accordance with embodiments of the present disclosure. As shown in fig. 9A, it is assumed that the original filter 910 is (4,3,17,64), and the first threshold aco=64. If the width dimension and the height dimension are split simultaneously, the resulting number of output channels Co' =4×3×17=204 > aco exceeds the first threshold. If only the width dimension is split, the obtained output channel number Co' =4×17=68 > aco; if only the height dimension is split, the resulting number of output channels Co' =4×3=12 < aco/2. At this point, the appropriate zero padding process for the width dimension or the height dimension may be selected to enable appropriate splitting.

For the above example, fig. 9A shows an example of zero padding of the width dimension. The zero-padded filter is shown at 920 with a row of zero-padded regions added in the width dimension, as shown by diagonal blocks 921. In this example, the width dimension is padded to 18 so that the zero padded filter becomes (4,3,18,64).

Next, based on splitting the zero-padded filter, the width dimension may be split into 18=2×9, as shown at 930. The split data is transferred to the output channel dimension as shown at 940. Thus, the folded filter shown at 940 has a dimension (36,3,2,64). On this basis, when hardware alignment is performed again, only alignment from 36 to 64 is needed instead of the original alignment from 4 to 64.

Those skilled in the art will appreciate that other ways of zero padding and corresponding splitting are possible. Fig. 9B illustrates another exemplary zero padding approach. For example, for the same example described above, the width dimension may be padded from 17 to 20, the height dimension from 3 to 4, and then the data transfer may be performed simultaneously for both the width and height dimensions. Specifically, the width dimension may be split into 20=4×5, the height dimension into 4=4×1, all split by a factor of 4 and transferred to the output channel dimension such that the output channel dimension Co' =4×4×4=64 is directly aligned to the hardware alignment value.

Those skilled in the art, based on the teachings herein, may select an appropriate manner to dimension collapse to minimize redundant computation and resource waste. Preferably, after the data in the convolution kernel width dimension and/or the convolution kernel height dimension is re-discharged to the output channel dimension, the number of output channels Co' is made to satisfy: aco/2. Ltoreq.Co'. Ltoreq.Aco, where Aco is the number of convolution operation units in the processing circuitry.

Various dimensional folding approaches provided by embodiments of the present disclosure are described above in connection with the accompanying drawings. As can be seen from the folding process, the number of output channels of the folded filter increases in multiples relative to the number of output channels of the original filter, so that the folding scheme of the embodiments of the present disclosure is particularly suitable for cases where the number of output channels Co of the original filter is small, e.g., co+.aco/2. The practical result shows that the smaller Co is, the larger the lifting space is compared with the existing algorithm.

It should be noted that since accumulation in the Co direction is not performed in the convolution operation (see description of the convolution operation of fig. 6), the result obtained after the convolution operation is performed on the input feature map with the folding filter is not the final result. At this time, intermediate results of the convolution kernel width dimension and height dimension converted to Co dimension need to be accumulated to obtain a final result.

Fig. 10 shows an exemplary comparison diagram of a pre-and post-folding calculation process according to an embodiment of the present disclosure.

As can be seen from the preceding folding process, the input channel dimensions of the folded filter are identical to those of the original filter, so that the input feature map can be directly convolved with the folded filter without any processing.

As shown, the input feature map 1010 is assumed to be (1, 5, ci), i.e., ci number 5×5 feature maps. The upper part of the figure shows the result obtained by performing the convolution operation with the original filter 1020 having dimensions (4, 3, ci) and convolution steps (Sx, sy) = (1, 1), i.e. the output profile 1030 is (1,3,3,4), i.e. the 4 3 x 3 profiles. Those skilled in the art will appreciate that the data in the Ci dimension is not shown for clarity of illustration and is illustrated by characters only.

In contrast, the lower part of the figure shows the result obtained by performing convolution operation with the folding filter 1040 having dimensions (36, 1, ci) and convolution steps (Sx ', sy') = (1, 1), and the output feature map 1050 is (1,5,5,36), that is, 36 feature maps of 5×5.

As can be seen from comparison of the figures, since the data originally located in the convolution width dimension and the convolution height dimension are transferred to the output channel dimension, and the output channel dimension is not accumulated, the result obtained after the convolution operation is performed by using the folding filter provided by the embodiment of the disclosure is an intermediate result, and the data in the width direction and the height direction also need to be accumulated to obtain a final result.

FIG. 11 illustrates a schematic diagram of result accumulation according to an embodiment of the present disclosure. Continuing with the example of FIG. 10, as shown, the intermediate results calculated using the folding filters (36, 1, ci) are (1,5,5,36), i.e., 36 feature maps of 5×5. These 36 feature maps can be further divided into 9 groups, each group containing the original 4 output channels.

Since the products in the convolution kernel width and height dimensions are not accumulated on the output channel, an accumulation process that adds intermediate results is required. In some embodiments, the accumulating process includes accumulating data in the intermediate result scattered over the output channel dimension by corresponding locations.

Specifically, intermediate results of the same output channel corresponding to the original filter may be accumulated to obtain an output feature map of the output channel. In the accumulation of each output channel, the same data is indexed for accumulation, and the index is determined based on the convolution step size.

For example, each of the 9 blocks in the figure contains intermediate results of 4 output channels, and intermediate results of the same output channel, which are distributed over the 9 blocks, need to be accumulated. The results of the first output channel of each of the 9 blocks are shown, for example, with an index in each result indicating the data to be accumulated. And accumulating the data with the same index to obtain a final result on the output channel. The same processing is performed on the other output channels, whereby the final result of 4 output channels, that is, the dimension of the output result shown in the figure is (1,3,3,4) can be obtained.

As can be seen from the convolution operation described above, the position and size of the index can be determined based on the convolution kernel of the original filter, the convolution step size, and the size of the input feature map. And will not be described in detail herein.

The above describes a solution for dimensional folding of filters according to embodiments of the present disclosure. In some embodiments, the folding filter may be generated offline. For example, in reasoning using neural network models, convolution operations may be performed with the input feature map using pre-arranged, offline generated folding filters to perform the reasoning process. In other embodiments, the folding filter may be generated in-line. For example, during training of the neural network model, the filter of the convolutional layer may be folded online and then convolved with the training data to perform the training process.

The computational effort of the convolution operation can be greatly optimized by implementing the Co dimension expansion by folding, whichever process utilizes the folding filter of the presently disclosed embodiments. The performance of the scheme of the disclosed embodiments with respect to the convolution calculations is compared with that of the existing convolution operations as follows.

Let P denote the convolution calculation quantity, A _ci Represents the value of Ci after alignment, A _CO The values after Co alignment are indicated,

p=n×a _ci *A _CO *H _o *W _o *k _w *k _h (1)

Hardware adopting NHWC dimension placement sequence is most required to be aligned to A due to the fact that Ci dimension is at the lowest dimension and vector instruction alignment requirement _ci Therefore A _ci Aiming at the alignment requirement of vector instructions; artificial intelligence computing acceleration hardware typically has multiple parallel high performance convolution computing units, so A _CO The method aims at the requirement of the dimension alignment of the convolution kernel Co, and the value of the requirement is the number of high-performance parallel computing units.

Before optimization, the calculated amount of the existing convolution operation is as follows:

P _before ＝n*A _ci *A _CO *H _o *W _o *k _w *k _h (2)

after the scheme of the embodiment of the disclosure is adopted for optimization, the calculated amount of convolution operation is as follows:

the optimization rate of the performance after Co folding is:

taking the example described above with reference to figure 7 as an example,

before optimization, P _before ＝1*64*64(4)*ho*wo*3*3(Sx,Sy＝1,1)

After optimization, P _after ＝1*64*64(36)*ho*wo*1*1(Sx,Sy＝1,1)

The convolution calculation amount of the reduced convolution calculation unit is 90%.

As can be seen from the above comparison of the calculated amounts, the folded filter scheme provided in the embodiment of the disclosure can effectively save the calculated amounts of the convolution unit, thereby improving the calculation performance of the convolution operation.

The disclosed embodiments also provide a data processing apparatus for executing a neural network model, and a method implemented by the data processing apparatus for executing a neural network model.

FIG. 12 illustrates a schematic block diagram of a data processing apparatus in which embodiments of the present disclosure may be implemented. As shown in fig. 12, the data processing apparatus 1200 includes a processing circuit 1210 and a storage circuit 1220.

The processing circuitry 1210 is responsible for processing various functions on the data processing device 1200 including, but not limited to, control, decoding, operations, and the like. The processing circuit 1210 may include, for example, the control module 31 and/or the operation module 32 of fig. 3.

In some embodiments, processing circuitry 1210 may be configured to perform convolution operations on the input feature map using the folding filter of embodiments of the present disclosure to obtain intermediate results; and accumulating the intermediate results to obtain an output characteristic diagram.

The memory circuit 1220 may be used to store or carry related data, which may be, for example, various RAMs, or on-chip caches, as shown in fig. 3 or 5. In some embodiments, the storage circuit 1120 may be configured to store a folding filter of a convolutional layer of the neural network model. The folding filter is obtained by performing dimension folding on the original filter, wherein the dimension folding comprises the step of re-draining data of the convolution kernel width dimension and/or the convolution kernel height dimension to the output channel dimension.

In some embodiments, the data processing apparatus 1100 may be configured to perform a training process for the neural network model. At this point, the processing circuitry 1110 may be configured to perform the folding process of the presently disclosed embodiments on-line with the filter of the convolutional layer of the neural network model while training. And then performing convolution operation on the training data by using the obtained folding filter to perform a training process. The specific folding process performed by the processing circuit 1110 may be referred to in the foregoing description, and will not be described herein.

In other embodiments, the data processing apparatus 1100 may be configured to perform an inference process of the neural network model. At this time, the processing circuit 1110 may be configured to directly perform convolution operation on the input neurons using the folding filter already stored in the storage circuit 1120 to perform the inference process.

The intermediate result obtained by performing the convolution operation with the folding filter needs to be subjected to data accumulation processing to obtain a final output feature map. Specific accumulation processing operations may be referred to in the foregoing description and are not repeated here.

As shown, the data processing method 1300 includes a step 1310, where the processing circuit performs a convolution operation on the input signature using a folding filter of a convolution layer of the neural network model stored in the memory circuit to obtain an intermediate result. The folding filter is obtained by performing dimension folding on the original filter, wherein the dimension folding comprises the step of re-discharging data of the convolution kernel width dimension and/or the convolution kernel height dimension to the output channel dimension.

Next, in step 1320, the processing circuitry accumulates the intermediate results to obtain an output profile.

Those skilled in the art will appreciate that the processes of the filter folding method, the accumulation of intermediate results, and the like of the embodiments of the present disclosure described above with reference to the drawings are equally applicable to the data processing apparatus of fig. 12 and the data processing method of fig. 13, and thus a repetitive description will not be made.

The present disclosure also provides a chip which may include a data processing apparatus of any of the embodiments described hereinbefore with reference to the accompanying drawings. Further, the present disclosure also provides a board that may include the foregoing chip.

According to different application scenarios, the electronic device or apparatus of the present disclosure may include a server, a cloud server, a server cluster, a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet, an intelligent terminal, a PC device, an internet of things terminal, a mobile terminal, a cell phone, a vehicle recorder, a navigator, a sensor, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a visual terminal, an autopilot terminal, a vehicle, a household appliance, and/or a medical device. The vehicle comprises an aircraft, a ship and/or a vehicle; the household appliances comprise televisions, air conditioners, microwave ovens, refrigerators, electric cookers, humidifiers, washing machines, electric lamps, gas cookers and range hoods; the medical device includes a nuclear magnetic resonance apparatus, a B-mode ultrasonic apparatus, and/or an electrocardiograph apparatus. The electronic device or apparatus of the present disclosure may also be applied to the internet, the internet of things, data centers, energy sources, transportation, public management, manufacturing, education, power grids, telecommunications, finance, retail, construction sites, medical, and the like. Further, the electronic device or apparatus of the present disclosure may also be used in cloud, edge, terminal, etc. application scenarios related to artificial intelligence, big data, and/or cloud computing. In one or more embodiments, a computationally intensive electronic device or apparatus according to aspects of the present disclosure may be applied to a cloud device (e.g., a cloud server), while a less power consuming electronic device or apparatus may be applied to a terminal device and/or an edge device (e.g., a smart phone or camera). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that appropriate hardware resources can be matched from the hardware resources of the cloud device according to the hardware information of the terminal device and/or the edge device to simulate the hardware resources of the terminal device and/or the edge device, so as to complete unified management, scheduling and collaborative work of an end cloud entity or an edge cloud entity.

It should be noted that, for the sake of brevity, the present disclosure describes some methods and embodiments thereof as a series of actions and combinations thereof, but those skilled in the art will understand that the aspects of the present disclosure are not limited by the order of actions described. Thus, one of ordinary skill in the art will appreciate in light of the present disclosure or teachings that certain steps thereof may be performed in other sequences or concurrently. Further, those skilled in the art will appreciate that the embodiments described in this disclosure may be considered alternative embodiments, i.e., wherein the acts or modules involved are not necessarily required for the implementation of some or some aspects of this disclosure. In addition, the description of some embodiments of the present disclosure is also focused on, depending on the scenario. In view of this, those skilled in the art will appreciate that portions of one embodiment of the disclosure that are not described in detail may be referred to in connection with other embodiments.

In particular implementations, based on the disclosure and teachings of the present disclosure, one of ordinary skill in the art will appreciate that several embodiments of the disclosure disclosed herein may also be implemented in other ways not disclosed herein. For example, in terms of the foregoing embodiments of the electronic device or apparatus, the units are divided herein by taking into account the logic function, and there may be other manners of dividing the units when actually implemented. For another example, multiple units or components may be combined or integrated into another system, or some features or functions in the units or components may be selectively disabled. In terms of the connection relationship between different units or components, the connections discussed above in connection with the figures may be direct or indirect couplings between the units or components. In some scenarios, the foregoing direct or indirect coupling involves a communication connection utilizing an interface, where the communication interface may support electrical, optical, acoustical, magnetic, or other forms of signal transmission.

In the present disclosure, elements illustrated as separate elements may or may not be physically separate, and elements shown as elements may or may not be physically separate. The aforementioned components or units may be co-located or distributed across multiple network elements. In addition, some or all of the units may be selected to achieve the objectives of the embodiments of the disclosure, as desired. In addition, in some scenarios, multiple units in embodiments of the disclosure may be integrated into one unit or each unit may physically exist alone.

In some implementation scenarios, the above-described integrated units may be implemented in the form of software program modules. The integrated unit may be stored in a computer readable memory if implemented in the form of software program modules and sold or used as a stand alone product. In this regard, when the aspects of the present disclosure are embodied in a software product (e.g., a computer-readable storage medium), the software product may be stored in a memory, which may include instructions for causing a computer device (e.g., a personal computer, a server, or a network device, etc.) to perform some or all of the steps of the methods described by the embodiments of the present disclosure. The aforementioned Memory may include, but is not limited to, a usb disk, a flash disk, a Read Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a removable hard disk, a magnetic disk, or an optical disk, etc. various media capable of storing program codes.

In other implementation scenarios, the integrated units may also be implemented in hardware, i.e. as specific hardware circuits, which may include digital circuits and/or analog circuits, etc. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, which may include, but are not limited to, devices such as transistors or memristors. In view of this, various types of devices described herein (e.g., computing devices or other processing devices) may be implemented by appropriate hardware processors, such as CPU, GPU, FPGA, DSP and ASICs, etc. Further, the aforementioned storage unit or storage device may be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), which may be, for example, variable resistance memory (Resistive Random Access Memory, RRAM), dynamic random access memory (Dynamic Random Access Memory, DRAM), static random access memory (Static Random Access Memory, SRAM), enhanced dynamic random access memory (Enhanced Dynamic Random Access Memory, EDRAM), high bandwidth memory (High Bandwidth Memory, HBM), hybrid memory cube (Hybrid Memory Cube, HMC), ROM, RAM, etc.

While various embodiments of the present disclosure have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example only. Numerous modifications, changes, and substitutions will occur to those skilled in the art without departing from the spirit and scope of the present disclosure. It should be understood that various alternatives to the embodiments of the disclosure described herein may be employed in practicing the disclosure. The appended claims are intended to define the scope of the disclosure and are therefore to cover all equivalents or alternatives falling within the scope of these claims.

Claims

1. A data processing apparatus for executing a neural network model, comprising:

a storage circuit configured to store a folded filter of a convolution layer of the neural network model, the folded filter being obtained by dimensional folding of an original filter, wherein the dimensional folding includes re-draining data of a convolution kernel width dimension and/or a convolution kernel height dimension to an output channel dimension; and

processing circuitry configured to:

performing convolution operation on the input feature map by using the folding filter to obtain an intermediate result; and

And accumulating the intermediate results to obtain an output characteristic diagram.

2. The data processing apparatus of claim 1, wherein the number of output channels Co of the original filter is smaller than the number of output channels Co' of the folded filter.

3. The data processing apparatus of claim 2, wherein the processing circuitry is configured to perform the dimensional folding as follows:

determining data which need to be re-discharged to the dimension of the output channel on the dimension of the width of the convolution kernel and/or the dimension of the height of the convolution kernel based on the number Co of the output channels of the original filter, the width Kw of the convolution kernel, the height Kh of the convolution kernel and a first threshold Aco; and

based on the determined data to be re-discharged, the number of output channels Co', the convolution kernel size, and the convolution step size of the folding filter are determined.

4. A data processing apparatus according to claim 3, wherein the processing circuitry is further configured to determine data that needs to be re-discharged onto the output channel dimension as follows:

and re-discharging the data in the width dimension and/or the height dimension of the convolution kernel to the dimension of the output channels, so that the number Co' of the output channels meets the following conditions: aco/2. Ltoreq.Co'. Ltoreq.Aco, where Aco is the number of convolution operation units in the processing circuit.

5. The data processing apparatus of any of claims 3-4, wherein the processing circuitry is further configured to determine that data in the output channel dimension needs to be re-discharged as follows:

when the convolution kernel width dimension and/or the convolution kernel height dimension cannot be divided by integer, zero padding is carried out on the convolution kernel width dimension and/or the convolution kernel height dimension so as to be divided by integer.

6. The data processing apparatus of any of claims 3-5, wherein the processing circuitry is further configured to re-drain data in the convolution kernel width dimension and/or the convolution kernel height dimension to the output channel dimension as follows:

and extracting data in the width dimension and/or the height dimension of the convolution kernel according to multiples of Co along the dimension direction of the output channel, and sequentially discharging the data in the dimension direction of the output channel.

7. The data processing apparatus according to any of claims 1-6, wherein the processing circuit is further configured to accumulate intermediate results as follows:

and accumulating the data dispersed in the dimension of the output channel in the intermediate result according to the corresponding position.

8. The data processing apparatus of claim 7, wherein the processing circuit is further configured to accumulate as follows:

And accumulating the intermediate results of the same output channel corresponding to the original filter to obtain an output characteristic diagram of the output channel, wherein the data with the same index are accumulated, and the index is determined based on the convolution step length at least in part.

9. The data processing apparatus according to any one of claims 1 to 8, wherein the number of input channels of the original filter is equal to the number of input channels of the folding filter.

10. A data processing apparatus according to any of claims 1-9, wherein the folding filter is generated off-line or on-line.

11. A chip, characterized in that the chip comprises a data processing device according to any of claims 1-10.

12. A board comprising the chip of claim 11.

13. A method implemented by a data processing apparatus for executing a neural network model, the data processing apparatus comprising storage circuitry and processing circuitry, the method comprising:

the processing circuit performs convolution operation on the input feature map by using a folding filter of a convolution layer of the neural network model stored in the storage circuit to obtain an intermediate result, wherein the folding filter is obtained by performing dimension folding on an original filter, and the dimension folding comprises re-draining data of a convolution kernel width dimension and/or a convolution kernel height dimension to an output channel dimension; and

And the processing circuit accumulates the intermediate results to obtain an output characteristic diagram.

14. The method of claim 13, wherein the number of output channels Co of the original filter is less than the number of output channels Co' of the folded filter.

15. The method of claim 14, wherein the processing circuitry is configured to perform the dimensional folding as follows:

16. The method of claim 15, wherein the processing circuit is further configured to determine data that needs to be re-discharged onto the output channel dimension as follows:

17. The method of any of claims 15-16, wherein the processing circuit is further configured to determine data that needs to be re-discharged onto the output channel dimension as follows:

18. The method of any of claims 15-17, wherein the processing circuitry is further configured to re-drain data in the convolution kernel width dimension and/or the convolution kernel height dimension to the output channel dimension as follows:

19. The method of any of claims 13-18, wherein the processing circuit is further configured to accumulate intermediate results as follows:

20. The method of claim 19, wherein the processing circuit is further configured to accumulate as follows:

21. The method of any of claims 13-20, wherein the number of input channels of the original filter is equal to the number of input channels of the folded filter.

22. The method of any of claims 13-21, wherein the folding filter is generated offline or generated online.