CN114429194A

CN114429194A - Device, board card, method and readable storage medium for processing neural network calculation

Info

Publication number: CN114429194A
Application number: CN202011183061.8A
Authority: CN
Inventors: 不公告发明人
Original assignee: Anhui Cambricon Information Technology Co Ltd
Current assignee: Anhui Cambricon Information Technology Co Ltd
Priority date: 2020-10-29
Filing date: 2020-10-29
Publication date: 2022-05-03

Abstract

The present disclosure relates to devices, boards, methods, and readable storage media for processing neural network model computations, wherein the computing devices of the present disclosure are included in an integrated circuit device that includes a universal interconnect interface and other processing devices. The computing device interacts with other processing devices to jointly complete computing operations specified by the user. The integrated circuit device may further include a storage device, which is connected to the computing device and the other processing device, respectively, for data storage of the computing device and the other processing device.

Description

Device, board card, method and readable storage medium for processing neural network calculation

Technical Field

The present disclosure relates generally to the field of neural networks. More particularly, the present disclosure relates to an apparatus, a board, a method, and a readable storage medium for processing neural network model computations.

Background

The depth separation (depth) convolution and point-by-point separation (pointwise) convolution are combined to be called depth separable (depth separable) convolution, the overall operation of the depth separable (depth separable) convolution is similar to the conventional convolution operation and can be used for extracting features, the depth separation convolution can obviously reduce dimensionality and calculated amount, and the point-by-point separation convolution carries out interchannel fusion or dimension change. The academics consider that floating-point operations (FLOPs) perform better than conventional convolution operations, and therefore this structure is often used in some lightweight networks, such as the MobileNet model.

Although the depth separable convolution theoretically reduces the amount of parameters and can improve the operation speed to a certain extent, in actual calculation, because the current hardware and software are not optimized for the depth separable convolution, the training speed is not necessarily fast although the depth separable convolution has the advantage of low FLOPs, and the calculation overhead is often larger than that of the conventional convolution. Furthermore, the depth-separable convolution does not exploit the information coupling of different channels, and the accuracy is poor relative to conventional convolution, so for neural network computation, the depth-separable convolution is not an ideal convolution choice. A convolution scheme that reduces computational overhead, accommodates hardware configuration, and has high accuracy is highly desirable.

Disclosure of Invention

To at least partially solve the technical problems mentioned in the background, the solution of the present disclosure provides an apparatus, a board, a method and a readable storage medium for processing neural network model calculation.

In one aspect, the present disclosure discloses a computing device connected to an off-chip memory, the neural network model including a depth-separated convolutional layer having convolutional kernels of a size h × w × c × m and a point-by-point separated convolutional layer having convolutional kernels of a size 1 × 1 × m × p. The calculating device comprises a neuron storage unit, a weight value storage unit and an operation module. The neuron storage unit is used for loading the characteristic diagram; the weight value storage unit is used for loading a specific convolution kernel; the operation module is used for: loading the feature map from the neuron storage unit; loading the specific convolution kernel from the weight storage unit; and performing convolution calculation according to the feature map and the specific convolution kernel to generate an intermediate result. Wherein the size of the specific convolution kernel is h × w × c × p.

In another aspect, the present disclosure discloses an integrated circuit device including the aforementioned computing device and a board including the aforementioned integrated circuit device.

In another aspect, the present disclosure discloses a method of processing a neural network model computation by a computing device, the computing device connected to an off-chip memory, the neural network model including a depth convolution layer and a point-by-point convolution layer, a convolution kernel of the depth convolution layer having a size of h × w × c × m, and a convolution kernel of the point-by-point convolution layer having a size of 1 × 1 × m × p, the computing device including a neuron storage unit and a weight storage unit. The method comprises the following steps: loading a feature map from the off-chip memory into the neuron storage unit; loading a specific convolution kernel from the off-chip memory to the weight storage unit; performing convolution calculation according to the feature map and the specific convolution kernel to generate an intermediate result; wherein the size of the specific convolution kernel is h × w × c × p.

In another aspect, the present disclosure discloses a computer readable storage medium having stored thereon computer program code for processing a neural network model calculation by a computing device, the computer program code, when executed by the processing device, performing the aforementioned method.

In another aspect, the present disclosure discloses a method of computing a neural network model, the neural network model including a depth-separated convolutional layer and a point-by-point separated convolutional layer, the method comprising: generating a specific convolution kernel according to the convolution kernels of the depth convolution layer and the point-by-point convolution layer; loading a characteristic diagram; and performing convolution calculation according to the feature map and the specific convolution kernel. And the convolution calculation result is the result of calculating the depth separation convolution layer and the point-by-point separation convolution layer.

The present disclosure overcomes the technical prejudice of those skilled in the art, replaces the deep separable convolution with the conventional convolution, and proposes a convolution scheme that reduces the computational overhead, adapts to the hardware configuration, and has high precision.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. In the drawings, several embodiments of the disclosure are illustrated by way of example and not by way of limitation, and like or corresponding reference numerals indicate like or corresponding parts and in which:

FIG. 1 is a schematic diagram showing a conventional convolution of a neural network replaced with a deep separation convolution;

FIG. 2 is a schematic diagram showing the structure of the reciprocal residual of MobileNet v 2;

fig. 3 is a structural diagram illustrating a board card according to an embodiment of the present disclosure;

FIG. 4 is a block diagram illustrating an integrated circuit device of an embodiment of the present disclosure;

fig. 5 is a schematic diagram illustrating an internal structure of a single-core computing device according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram illustrating the internal architecture of a multi-core computing device of an embodiment of the present disclosure;

FIG. 7 is an internal block diagram of a processor core illustrating an embodiment of the present disclosure;

FIG. 8 is a schematic diagram showing when one processor core wants to write data to another clustered processor core;

FIG. 9 is a flow diagram illustrating the use of hardware to provide a computational neural network model according to an embodiment of the present disclosure;

FIG. 10 is a flow diagram illustrating processing of neural network model calculations by a single-core computing device according to an embodiment of the present disclosure; and

FIG. 11 is a flow diagram illustrating processing of neural network model calculations by a multi-core computing device according to an embodiment of the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some, but not all embodiments of the present disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.

It should be understood that the terms "first," "second," "third," and "fourth," etc. in the claims, description, and drawings of the present disclosure are used to distinguish between different objects and are not used to describe a particular order. The terms "comprises" and "comprising," when used in the specification and claims of this disclosure, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the disclosure herein is for the purpose of describing particular embodiments only, and is not intended to be limiting of the disclosure. As used in the specification and claims of this disclosure, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the specification and claims of this disclosure refers to any and all possible combinations of one or more of the associated listed items and includes such combinations.

As used in this specification and claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection".

Specific embodiments of the present disclosure are described in detail below with reference to the accompanying drawings.

FIG. 1 shows a schematic diagram of a conventional convolution of a neural network replaced with a deep separable convolution. The conventional convolution 10 is illustratively applied to a5 × 5 pixel, three channel input signature 101 (of dimensions 5(h) × 5(w) × 3 (c))_in) A convolution layer passing through 4 convolution kernels 102 of 3 × 3 (the convolution kernels have a size of 3(h) × 3(w) × 3 (c))_in)×4(c_out) And finally 4 output characteristic maps 103 are output. Where h is the number of pixels of the input feature map 101 in the vertical direction, w is the number of pixels of the input feature map 101 in the horizontal direction, and c_inNumber of input channels for conventional convolution 10, c_outThe number of output channels of the conventional convolution 10.

The relationship of the conventional convolution 10 to the depth separable convolution is as follows. As shown, the depth separable convolution includes a depth separation convolution 11 and a point-by-point separation convolution 12. First, 3 times of transfusionThe feature map 101 is subjected to depth separation convolution calculation, and a convolution kernel 104 of the depth separation convolution 11 is responsible for one channel and is performed in a two-dimensional plane. The number of convolution kernels 104 corresponds one-to-one to the number of channels. After calculation, 3 5 × 5 intermediate feature maps 105, i.e. c, are generated_mIs 3, c_mRefers to the number of output channels of the depth separation convolution 11. The number of intermediate profiles 105 is the same as the number of channels of the input profile 101 and is not expanded. Such an operation is performed independently by performing convolution operation on each channel of the input feature map 101, and feature information of different channels at the same spatial position cannot be effectively extracted, so that it is also necessary to separate the convolution 12 point by point to combine the intermediate feature maps 105 to generate an output feature map.

If 4 output signatures 103 are to be generated as in the conventional convolution 10, the point-by-point separation convolution 12 must include 4 sets of convolution kernels 106, each convolution kernel 106 having a size of 1(h) × 1(w) × 3 (c)_m) Since the intermediate feature map 105 is an output feature map of the depth separation convolution 11 and an input feature map of the point-by-point separation convolution 12 at the same time, c_mThe number of input channels of convolution 12 is separated point by point. The convolution operation of the point-by-point separation convolution 12 performs weighted combination of the intermediate feature maps 105 in the depth direction, thereby generating the output feature map 103.

As will be apparent to those skilled in the art, the better FLOPs performance can be obtained by utilizing the deep separation convolution and the point-by-point separation convolution instead of the conventional convolution, so in the currently commonly used neural network model, there is a deep separation convolution added to the structure of the point-by-point separation convolution, such as the inverted residual structure (inverted residual) in MobileNet v2, as shown in fig. 2, the inverted residual structure of MobileNet v2 includes a deep separation convolution structure 21 and a point-by-point separation convolution structure 22, where the deep separation convolution structure 21 includes a3 × 3 deep separation convolution layer 201, a batch normalization layer (batch normalization)202 and a ReLU6 activation layer 203, and the point-by-point separation convolution structure 22 includes a1 × 1 point-by-point separation convolution 204, a batch normalization layer 205 and a ReLU6 activation layer 206. The depth separation convolution structure 21 and the point-by-point separation convolution structure 22 are used to realize the depth separation convolution 11 and the point-by-point separation convolution 12 of fig. 1.

However, most of the artificial intelligence chips in the existing market do not plan the depth separation convolution and the point-by-point separation convolution into a computation primitive (private), so when the depth separation convolution and the point-by-point separation convolution are computed, the computation primitive needs to be implemented by combination of other computation primitives (such as matrix multiplication, operation one by one according to elements, and the like), and the academic benefit cannot be achieved.

The present disclosure provides an embodiment, which effectively performs a deep separation convolution and a point-by-point separation convolution in a neural network model through the collocation of hardware. Fig. 3 shows a schematic structural diagram of a board card 30 according to an embodiment of the disclosure. As shown in fig. 3, the board 30 includes a Chip 301, which is a System-on-Chip (SoC) or System-on-Chip, and is integrated with one or more combined processing devices, which are artificial intelligence arithmetic units, for supporting various deep learning and machine learning algorithms, and meeting the intelligent processing requirements in the fields of computer vision, speech, natural language processing, data mining, and the like under complex scenes. Especially, the deep learning technology is widely applied to the field of cloud intelligence, and one remarkable characteristic of the cloud intelligence application is that the input data size is large, and the requirements on the storage capacity and the computing capacity of the platform are high, so that the board card 30 of the embodiment is suitable for the cloud intelligence application and has huge off-chip storage, on-chip storage and strong computing capacity.

The chip 301 is connected to an external device 303 through an external interface 302. The external device 303 is, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card, a wifi interface, or the like. The data to be processed may be transferred by the external device 103 to the chip 301 through the external interface means 302. The calculation results of the chip 301 may be transmitted back to the external device 303 via the external interface means 302. The external interface device 302 may have different interface forms, such as a PCIe interface, according to different application scenarios.

The card 30 also includes a memory device 304 for storing data, which includes one or more memory cells 305. The memory device 304 is connected and data-transferred to the control device 306 and the chip 301 through a bus. The control device 306 in the board 30 is configured to regulate the state of the chip 301. For this reason, in an application scenario, the control device 306 may include a single chip Microcomputer (MCU).

Fig. 4 is a structural diagram showing a combined processing device in the chip 301 of this embodiment. As shown in fig. 4, the combination processing device 40 includes a computing device 401, an interface device 402, a processing device 403, and a DRAM 404.

The computing device 401 is configured to perform user-specified operations, mainly implemented as a single-core smart processor or a multi-core smart processor, to perform deep learning or machine learning computations, which may interact with the processing device 403 through the interface device 402 to collectively perform the user-specified operations.

The interface device 402 is used for transmitting data and control instructions between the computing device 401 and the processing device 403. For example, the computing device 401 may obtain input data from the processing device 403 via the interface device 402 and write the input data to a storage device on the computing device 401 chip. Further, the computing device 401 may obtain control instructions from the processing device 403 via the interface device 402, and write the control instructions into a control cache on the computing device 401 chip. Alternatively or optionally, the interface device 402 may also read data from a storage device of the computing device 401 and transmit the data to the processing device 403.

The processing device 403, as a general purpose processing device, performs basic control including, but not limited to, data transfer, and turning on and/or off of the computing device 401. Depending on implementation, the processing device 403 may be one or more types of Central Processing Unit (CPU), Graphics Processing Unit (GPU) or other general purpose and/or special purpose processor, including but not limited to a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc., and the number thereof may be determined according to actual needs. As previously mentioned, the computing device 401 of the present disclosure may be viewed as having a single core structure or an isomorphic multi-core structure only. However, when considered collectively, the computing device 401 and the processing device 403 are considered to form a heterogeneous multi-core structure.

The DRAM404 is used as an off-chip memory for storing data to be processed, and is a DDR memory, which is typically 16G or larger in size and is used for storing data of the computing device 401 and/or the processing device 403.

Fig. 5 shows an internal structural diagram of the single-core computing device 401. The computing device 401 is used for processing input data such as computer vision, voice, natural language, data mining and the like, and the computing device 401 of the single core comprises three modules: a control module 51, an arithmetic module 52 and a storage module 53.

The control module 51 is used for coordinating and controlling the operations of the operation module 52 and the storage module 53 to complete the task of deep learning, and includes an Instruction Fetch Unit (IFU) 511 and an Instruction Decode Unit (IDU) 512. The instruction fetch unit 511 is used for obtaining an instruction from the processing device 403, and the instruction decode unit 512 decodes the obtained instruction and sends the decoded result to the operation module 52 and the storage module 53 as control information.

The operation module 52 includes a vector operation unit 521 and a matrix operation unit 522. The vector operation unit 521 is used for performing vector operations, and can support complex operations such as vector multiplication, addition and nonlinear transformation; the matrix operation unit 522 is responsible for the core calculation of the deep learning algorithm, namely matrix multiplication and convolution.

The storage module 53 is used to store or transport related data, and includes a neuron storage unit (neuron RAM, NRAM)531, a weight storage unit (weight RAM, WRAM)532, and a Direct Memory Access (DMA) 533. NRAM531 is used to store input neurons, output neurons, and intermediate results after computation; the WRAM 532 is used for storing a convolution kernel of the deep learning network, namely a weight; the DMA 533 is connected to the DRAM404 through the bus 54, and is responsible for data transfer between the computing device 401 and the DRAM 404.

FIG. 6 illustrates a structure of a multi-core computing device 401. The multi-core computing device 401 is designed in a hierarchical structure, and the computing device 401 is a system on a chip (soc) including at least one cluster (cluster), each cluster including a plurality of processor cores, in other words, the computing device 401 is constructed in a soc-cluster-processor core hierarchy.

Looking at the system-on-chip hierarchy, as shown in FIG. 6, the computing device 401 includes an external storage controller 601, a peripheral communication module 602, an on-chip interconnect module 603, a synchronization module 604, and a plurality of clusters 605.

There may be multiple external memory controllers 601, illustratively 2 shown, for accessing an external memory device, such as DRAM404 in fig. 4, to read data from or write data to off-chip in response to an access request issued by the processor core. The peripheral communication module 602 is configured to receive a control signal from the processing device 403 through the interface device 402, and start the computing device 401 to perform a task. The on-chip interconnect module 603 connects the external memory controller 601, the peripheral communication module 602, and the plurality of clusters 605 for transmitting data and control signals between the respective modules. The synchronization module 604 is a global synchronization barrier controller (GBC) for coordinating the operation progress of the clusters and ensuring the synchronization of the information. The clusters 605 are the computing cores of the computing device 401, 4 are exemplarily shown in the figure, and as hardware advances, the computing device 401 of the present disclosure may further include 8, 16, 64, or even more clusters 605. The cluster 605 is used to efficiently execute the deep learning algorithm.

Viewed at the cluster level, as shown in fig. 6, each cluster 605 includes a plurality of processor cores 606 and a storage core 607.

The processor cores 606 are exemplarily shown in 4 in the figure, and the present disclosure does not limit the number of the processor cores 606. The internal architecture is shown in fig. 7. Each processor core 606 is similar to the single-core computing device 401 of fig. 5, again comprising three major modules: a control module 71, an arithmetic module 72 and a storage module 73. The functions and structures of the control module 71, the operation module 72 and the storage module 73 are substantially the same as those of the control module 51, the operation module 52 and the storage module 53, and are not described again. It should be noted that the storage module 73 includes an input/output direct memory access (IODMA) module 733 and a mobile direct memory access (MVDMA) module 734. The IODMA 733 controls access to and from the NRAM 731/WRAM 732 and the DRAM404 through the broadcast bus 609; MVDMA 734 is used to control access to NRAM 731/WRAM 732 and core memory unit (SRAM) 608.

Returning to FIG. 6, the storage core 607 is primarily used to store and communicate, i.e., store shared data or intermediate results among the processor cores 606, as well as perform communications between the cluster 605 and the DRAM404, communications between clusters 605, communications between processor cores 606, and the like. In other embodiments, the memory core 607 may have the capability of scalar operations to perform scalar operations.

The memory core 607 includes SRAM 608, broadcast bus 609, Cluster Direct Memory Access (CDMA) 610, and Global Direct Memory Access (GDMA) 611. The SRAM 608 plays a role of a high-performance data transfer station, data multiplexed between different processor cores 606 in the same cluster 605 does not need to be acquired to the DRAM404 through the processor cores 606 respectively, but is transferred among the processor cores 606 through the SRAM 608, and the memory core 607 only needs to distribute the multiplexed data from the SRAM 608 to a plurality of processor cores 606 quickly, so that the inter-core communication efficiency is improved, and the on-chip and off-chip input/output access is greatly reduced.

The broadcast bus 609, CDMA610 and GDMA611 are used to perform communication among the processor cores 606, communication among the cluster 605, and data transfer between the cluster 605 and DRAM404, respectively. As will be described separately below.

The broadcast bus 609 is used to accomplish high-speed communication among the processor cores 606 in the cluster 605, and the broadcast bus 609 of this embodiment supports inter-core communication modes including unicast, multicast and broadcast. Unicast refers to point-to-point (e.g., from a single processor core to a single processor core) data transfer, multicast is a communication that transfers a copy of data from SRAM 608 to a particular number of processor cores 606, and broadcast is a communication that transfers a copy of data from SRAM 608 to all processor cores 606, and is a special case of multicast.

The CDMA610 is used to control access to the SRAM 608 among different clusters 605 in the same computing device 401. Fig. 8 shows a schematic diagram when one processor core wants to write data to another clustered processor core to illustrate the operating principle of CDMA 610. In this application scenario, the same computing device includes multiple clusters, for convenience of description, only cluster 0 and cluster 1 are shown in the figure, and cluster 0 and cluster 1 respectively include multiple processor cores, and also for convenience of description, cluster 0 in the figure only shows processor core 0, and cluster 1 only shows processor core 1. Processor core 0 wants to write data to processor core 1.

Firstly, the processor core 0 sends a unicast write request to write data into a local SRAM 0, the CDMA 0 serves as a master (master) end, the CDMA 1 serves as a slave (slave) end, the master end pushes the write request to the slave end, namely the master end sends a write address AW and write data W, the data are transmitted into the SRAM 1 of the cluster 1, then the slave end sends a write response B as a response, and finally the processor core 1 of the cluster 1 sends a unicast read request to read the data from the SRAM 1.

Returning to fig. 6, the GDMA611 cooperates with an external memory controller 601 to control access of the SRAM 608 of the cluster 605 to the DRAM404 or to read data from the DRAM404 into the SRAM 608. As can be seen from the foregoing, communication between DRAM404 and NRAM731 or WRAM 732 may be achieved via 2 channels. The first channel is to contact DRAM404 directly with NRAM731 or WRAM 732 through IODAM 733; the second channel is that data is transferred between the DRAM404 and the SRAM 608 via GDMA611, and then transferred between the SRAM 608 and NRAM731 or WRAM 732 via MVDMA 734. Although seemingly the second channel requires more components to participate and the data flow is longer, in some embodiments, the second channel may have a much greater bandwidth than the first channel, and thus communication between DRAM404 and NRAM731 or WRAM 732 may be more efficient over the second channel. Embodiments of the present disclosure may select a data transmission channel according to its own hardware condition.

In other embodiments, the functionality of the GDMA611 and the functionality of the IODMA 733 may be integrated in the same component. Further, the functions of GDMA611, IODMA 733, CDMA610 and MVDMA 734 may be implemented by the same component.

This embodiment provides a method for computing a neural network model by using the aforementioned hardware, and more particularly, provides a better alternative computation scheme for a deep-separated convolutional layer and a point-by-point separated convolutional layer in the neural network model in combination with the hardware. Such neural network models including a deeply-separated convolutional layer and a point-by-point-separated convolutional layer are exemplified by the MobileNet series, the EfficientNet series, the MixeNet series, the GhostNet series, and the Fbnet series, etc. FIG. 9 is a flow chart illustrating the provision of a computational neural network model with hardware according to this embodiment.

In step 901, a specific convolution kernel is generated from the convolution kernels of the depth convolution layer and the point-by-point convolution layer. When analyzing the neural network model, the processing device 403, once finding that the neural network model has deep convolutional layers and point-by-point convolutional layers (such as the MobileNet model of fig. 2), replaces the deep convolutional layers with equivalent conventional convolutions according to the hardware structure of the computing device 401.

In addition to the deep convolutional layer plus the point-by-point convolutional layer, this embodiment can also handle the structure of deep convolutional layer + interposer + point-by-point convolutional layer as long as the input feature map and the output feature map of the interposer are the same size. Such an interposer may be a normalization layer, an activation layer, or a normalization layer plus an activation layer. The normalization layers are illustratively BatchNorm (BN) layer, LayerNorm (LN) layer, InstanceNorm (IN) layer, GroupNorm (GN) layer. The BN layer is normalized IN the batch processing direction, the LN layer is normalized IN the channel direction to calculate the average value of CHW, the IN layer is also normalized IN the channel direction, but the average value of H x W is calculated, and the GN layer is grouped IN the channel direction and normalized IN each group. The activation layer is illustratively a sigmoid function layer, a tanh function layer, a ReLU layer, a Swish layer, a ReLU6 layer, a prilu layer, or an ELU layer. In addition, as shown in fig. 2, in the reciprocal residual structure of MobileNet v2, the intermediate layers of the depth separation convolution structure 21 and the point-by-point separation convolution structure 22 include a normalization layer (batch normalization layer 202) plus an activation layer (ReLU6 activation layer 203).

Next, step 902 is executed to load a feature map. If the computing device 401 is a single-core computing device, the NRAM531 is used to load the input feature map, and the WRAM 531 is used to load the corresponding weight, i.e. the specific convolution kernel of the aforementioned conventional convolution; in the case of a multi-core computing device 401, where the SRAM 608 is used to load the input profile and the specific convolution kernel from the DRAM404, since each cluster illustratively includes 4 processor cores 606, the input profile is split into 4, the NRAM731 of each processor core 606 is loaded into a portion of the input profile via the broadcast bus 609, and the WRAM 731 of each processor core 606 is also loaded into the specific convolution kernel via the broadcast bus 609.

In step 903, a convolution calculation is performed based on the feature map and the particular convolution kernel. If the calculation device 401 is a single-core calculation device, the control module 51 controls the operation module 52 to load the input feature map from the NRAM531 and load the specific convolution kernel from the WRAM 532, and the matrix operation unit 522 performs convolution calculation according to the input feature map and the specific convolution kernel to obtain a calculation result, which is stored in the NRAM 531. If the device is a multi-core computing device 401, the control module 71 controls the operation module 72 to take out the split input feature map from the NRAM731 and take out the weight from the WRAM 732, the matrix operation unit 722 performs convolution calculation to obtain intermediate results, one of the processor cores 606 reduces the intermediate results of the 4 processor cores 606 to generate a calculation result of the whole input feature map, the calculation result is stored in the SRAM 608, and finally the calculation result is stored back to the DRAM404 from the SRAM 608. So far, the equivalent calculation of the depth separation convolutional layer and the point-by-point separation convolutional layer is completed.

The embodiment solves the problem of delay by matching specific hardware and replacing the deep separable convolution with the conventional convolution, thereby achieving the technical effects of reducing the calculation expense, adapting to hardware configuration and having high precision.

Fig. 10 further illustrates a flow chart of how the single-core computing device 401 computes a conventional convolution.

In step 1001, the feature map is loaded from DRAM404 into NRAM 531. The input signature is originally stored in the off-chip DRAM404, the control module 51 executes the executable instructions, loads the signature from the DRAM404 to the NRAM531 via the DMA 533, and prepares the calculation by the calculation module 52.

In step 1002, a particular convolution kernel is loaded from DRAM404 into WRAM 532. In step 901, processing device 403 has already rolled the layers according to depth and point-by-pointThe integration of the layers generates a specific convolution kernel, the convolution kernel corresponds to a conventional convolution layer, and the size of the convolution kernel is h × w × c_in×c_outSuch that the computation of the conventional convolutional layer is equivalent to the depth convolutional layer plus the point-by-point convolutional layer. In this step, control module 51 executes executable instructions to load the particular convolution kernel from DRAM404 into WRAM 532.

At step 1003, a convolution calculation is performed based on the feature map and the particular convolution kernel to generate an intermediate result. The control module 51 controls the operation module 52 to load the input feature map from the NRAM531 and the specific convolution kernel from the WRAM 532, and causes the matrix operation unit 522 to perform convolution calculation according to the input feature map and the specific convolution kernel to obtain a calculation result, which is stored in the NRAM 531.

FIG. 11 further illustrates a flow diagram of how the multi-core computing device 401 computes a conventional convolution.

In step 1101, the GDMA611 loads the signature graph from the DRAM404 into the SRAM 608.

In step 1102, the control module 71 loads the feature map from the SRAM 608 to the NRAM731, ready for calculation by the operation module 72.

In step 1103, the GDMA611 loads the specific convolution kernel of the conventional convolution from the DRAM404 to the SRAM 608.

In step 1104, the control module 71 loads the convolution kernel from the SRAM 608 to the WRAM 732 in preparation for calculation by the operation module 72.

In step 1105, a convolution calculation is performed based on the feature map and a particular convolution kernel to produce an intermediate result. The control module 51 controls the operation module 52 to load the input feature map from the NRAM531 and the specific convolution kernel from the WRAM 532, and causes the matrix operation unit 722 to perform convolution calculation according to the input feature map and the specific convolution kernel to obtain an intermediate result, which is stored in the NRAM 731.

In step 1106, one of the processor cores 606 reduces all intermediate results in the same cluster 605 to produce a computed result. There are various ways in which reduction can be achieved, and the following takes a ring full reduction (ring reduce) as an example to illustrate how this embodiment performs reduction.

The ring-type full reduction is to organize the clusters 605 into a logical loop. Each cluster 605 is connected to only the previous cluster 605 and the next cluster 605 and receives and transmits data in the same direction. Next, a reduction procedure is performed, and these clusters 605 will perform N-1 (N is 4 in this embodiment) reduction iterations. In each iteration, these clusters 605 will send all intermediate results to the next cluster 605 and receive all intermediate results from the previous cluster 605 for calculation, the intermediate results sent and received by each cluster 605 being different in each iteration. After execution, each cluster 605 has one processor core 606 to perform the complete reduction calculation, and to achieve full reduction, the clusters 605 must exchange the results of these calculations so that all clusters 605 have the same final value, which is called the full set (allgather). The procedure of the full set program is similar to that of the reduction program, i.e. N-1 iterations are performed, but the values received by the cluster 605 are not accumulated but covered, and finally all the processor cores 606 carry the complete calculation results.

The above ring full reduction operation is only used to illustrate one implementation of the reduction of this embodiment, and the disclosure does not limit the reduction manner.

In step 1107, the MVDMA 734 stores the reduced calculation result in the SRAM 608.

In step 1108, the GDMA611 stores the calculation back to the DRAM404 from the SRAM 608.

Both the embodiments of fig. 10 and fig. 11 use the hardware configuration to replace the deep convolutional layer and the point-by-point convolutional layer in the neural network model with a conventional convolutional layer, thereby achieving the technical effects of reducing the computation overhead, adapting to the hardware configuration and having high precision.

Another embodiment of the present disclosure is a computer-readable storage medium having stored thereon computer program code according to a calculation according to a neural network model processed by a computing device, which when executed by a processor performs the method of the embodiments as described above. In some implementation scenarios, the integrated units may be implemented in the form of software program modules. If implemented in the form of software program modules and sold or used as a stand-alone product, the integrated units may be stored in a computer readable memory. In this regard, when aspects of the present disclosure are embodied in the form of a software product (e.g., a computer-readable storage medium), the software product may be stored in a memory, which may include instructions for causing a computer device (e.g., a personal computer, a server, or a network device, etc.) to perform some or all of the steps of the methods described in embodiments of the present disclosure. The Memory may include, but is not limited to, a usb disk, a flash disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.

Although the embodiments of the present disclosure are described with respect to a convolution kernel of 3(h) × 3(w), the present disclosure is not limited thereto, and the techniques of the present disclosure may be applied to convolution kernels of any size. In addition, although the embodiments of the present disclosure only use h, w, c_in、c_outEqual dimensions are used for illustration, and as such, the disclosure is not limited thereto, and the techniques of the disclosure can be applied to any dimension involved in convolution calculations. Those skilled in the art, having reference to the disclosure of the present disclosure, may readily extend the techniques of the present disclosure to convolution kernels of any size and any dimension without any inventive effort, such extensions still being within the scope of the present disclosure.

The following exemplifies some measured data to illustrate the difference between the prior art and the disclosed technology in the reasoning speed and precision. Taking the input signature size of 244 × 244 × 3 to run the MobileNet v2 model as an example, when reasoning is performed using the cambrian cloud artificial intelligence chip primitive 270(MLU270), the speed of performing the prior art (deep separable convolution) is 1982 frames/sec with an accuracy of 71.8%, while the speed of performing the disclosed technique (conventional convolution) is 4480 frames/sec with an accuracy of 73.5%. When the GhostNet model is operated under the same conditions, the accuracy is improved from 73.9% to 75.3%, and when the FBNet model is operated under the same conditions, the accuracy is improved from 75.2% to 76.9%. Again, using the england Tesla T4 GPU for reasoning, the speed for performing the prior art was 4757 frames/sec with an accuracy of 71.8%, while the speed for performing the disclosed technique was 5074 frames/sec with an accuracy of 73.5%. It is apparent that after considering the hardware configuration of the artificial intelligence chip, when running a neural network model including a deep separable convolution, the deep separable convolution therein is changed into the conventional convolution of the present disclosure, which has significant improvement in speed and precision.

For a long time, those skilled in the art consider deep separable convolutions to be more efficient than conventional convolutions, which is an academically ubiquitous recognition of the practicalities and deviation from practical applications. Such erroneous recognition guides those skilled in the art away from considering the possibility of other aspects, and hinders those skilled in the art from studying and developing the technical field. The method overcomes the technical bias, does not aim at pursuing low FLOPs, changes specific hardware, replaces deep separable convolution with conventional convolution, adopts the technical means abandoned by people due to the technical bias, further solves the technical problem, and provides a convolution scheme which reduces calculation overhead, adapts hardware configuration and has high precision.

Moreover, the present disclosure achieves unexpected technical results, which, compared to the prior art, produce variations in "quality" and "quantity" beyond what one would expect. Such variations in "quality" or "quantity" may be ascertained by one skilled in the art without prior knowledge and without prejudice to predict or infer that, on the one hand, the present disclosure would represent a significant advancement and, on the other hand, it would reflect that the technical solutions of the present disclosure are not obvious and have significant essential characteristics.

According to different application scenarios, the electronic device or apparatus of the present disclosure may include a server, a cloud server, a server cluster, a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet computer, an intelligent terminal, a PC device, an internet of things terminal, a mobile phone, a vehicle data recorder, a navigator, a sensor, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a visual terminal, an autopilot terminal, a vehicle, a household appliance, and/or a medical device. The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph. The electronic device or apparatus of the present disclosure may also be applied to the fields of the internet, the internet of things, data centers, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction site, medical, and the like. Further, the electronic device or apparatus of the present disclosure may also be used in application scenarios related to artificial intelligence, big data, and/or cloud computing, such as a cloud, an edge, and a terminal. In one or more embodiments, an electronic device or apparatus with high computing power according to the present disclosure may be applied to a cloud device (e.g., a cloud server), and an electronic device or apparatus with low power consumption may be applied to a terminal device and/or an edge device (e.g., a smartphone or a camera). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that appropriate hardware resources can be matched from the hardware resources of the cloud device to simulate the hardware resources of the terminal device and/or the edge device according to the hardware information of the terminal device and/or the edge device, and uniform management, scheduling and cooperative work of end-cloud integration or cloud-edge-end integration can be completed.

It is noted that for the sake of brevity, this disclosure presents some methods and embodiments thereof as a series of acts or combinations thereof, but those skilled in the art will appreciate that the aspects of the disclosure are not limited by the order of the acts described. Accordingly, one of ordinary skill in the art will appreciate that certain steps may be performed in other sequences or simultaneously, in accordance with the disclosure or teachings of the present disclosure. Further, those skilled in the art will appreciate that the embodiments described in this disclosure are capable of being practiced in other than the specifically disclosed embodiments, and that the acts or modules illustrated herein are not necessarily required to practice one or more aspects of the disclosure. In addition, the present disclosure may focus on the description of some embodiments, depending on the solution. In view of the above, those skilled in the art will understand that portions of the disclosure that are not described in detail in one embodiment may also be referred to in the related description of other embodiments.

In particular implementation, based on the disclosure and teachings of the present disclosure, one skilled in the art will appreciate that several embodiments disclosed in the present disclosure may be implemented in other ways not disclosed herein. For example, as for the units in the foregoing embodiments of the electronic device or apparatus, the units are split based on the logic function, and there may be another splitting manner in the actual implementation. Also for example, multiple units or components may be combined or integrated with another system or some features or functions in a unit or component may be selectively disabled. The connections discussed above in connection with the figures may be direct or indirect couplings between the units or components in terms of connectivity between the different units or components. In some scenarios, the aforementioned direct or indirect coupling involves a communication connection utilizing an interface, where the communication interface may support electrical, optical, acoustic, magnetic, or other forms of signal transmission.

In the present disclosure, units described as separate parts may or may not be physically separate, and parts shown as units may or may not be physical units. The aforementioned components or units may be co-located or distributed across multiple network elements. In addition, according to actual needs, part or all of the units can be selected to achieve the purpose of the scheme of the embodiment of the disclosure. In addition, in some scenarios, multiple units in embodiments of the present disclosure may be integrated into one unit or each unit may exist physically separately.

In other implementation scenarios, the integrated unit may also be implemented in hardware, that is, a specific hardware circuit, which may include a digital circuit and/or an analog circuit, etc. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, which may include, but are not limited to, transistors or memristors, among other devices. In this regard, the various devices described herein (e.g., computing devices or other processing devices) may be implemented by suitable hardware processors, such as central processing units, GPUs, FPGAs, DSPs, ASICs, and the like. Further, the aforementioned storage unit or storage device may be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), and may be, for example, a variable Resistive Memory (RRAM), a Dynamic Random Access Memory (DRAM), a Static Random Access Memory (SRAM), an Enhanced Dynamic Random Access Memory (EDRAM), a High Bandwidth Memory (HBM), a Hybrid Memory Cube (HMC), a ROM, a RAM, or the like.

The foregoing may be better understood in light of the following clauses:

clause a1, a computing device for processing computation of a neural network model, connected to an off-chip memory, the neural network model comprising a depth-separated convolutional layer and a point-by-point-separated convolutional layer, the convolutional cores of the depth-convolutional layer having a size of h × w × c_in×c_mThe convolution kernel of the point-by-point convolution layer has a size of 1 × 1 × c_m×c_outThe computing device comprises: the neuron storage unit is used for loading the characteristic diagram; a weight value storage unit for loading a specific convolution kernel; and an operation module for: loading the feature map from the neuron storage unit; loading the specific convolution kernel from the weight storage unit; performing convolution calculation according to the feature map and the specific convolution kernel to generate an intermediate result; wherein the size of the specific convolution kernel is h × w × c_in×c_out。

Clause a2, the computing device of clause a1, wherein the neural network model further comprises a mediating layer positioned between the depth convolution layer and the point-by-point convolution layer, an input feature map and an output feature map of the mediating layer being the same size.

Clause A3, the computing device of clause a2, wherein the interposer is a normalization layer.

Clause a4, the computing device of clause a2, wherein the interposer is an activation layer.

Clause a5, the computing device of clause a4, wherein the activation layer is one of a ReLU layer, a Swish layer, and a ReLU6 layer.

Clause a6, the computing device of clause a2, wherein the interposer is a normalization layer plus an activation layer.

Clause a7, the computing apparatus of clause a1, further comprising at least one cluster, each cluster comprising: a kernel storage unit for loading the feature map and the specific convolution kernel from the off-chip memory; a plurality of processor cores, each processor core including the neuron storage unit and the weight storage unit; and the neuron storage unit loads the feature graph from the kernel storage unit, and the weight storage unit loads the specific convolution kernel from the kernel storage unit.

Clause A8, the computing device of clause a7, wherein one of the plurality of processor cores reduces the intermediate result for each processor core to produce a computed result, stored into the core storage unit.

Clause a9, the computing device of clause A8, wherein each cluster further comprises a direct memory access module to store the computation results from the core storage unit back to the off-chip memory.

Clause a10, the computing device of clause a1, wherein the neural network model is one of the MobileNet series, the EfficientNet series, the MixeNet series, the GhostNet series, and the Fbnet series.

Clause a11, an integrated circuit device comprising the computing device of any one of clauses a 1-10.

Clause a12, a board comprising the integrated circuit device of clause a 11.

Clause a13, a method of processing a computation of a neural network model by a computing device, the computing device connected to off-chip memory, the neural network model including a depth convolution layer and a point-by-point convolution layer, the depth convolution layer having convolution kernels of a size h × w × c_in×c_mWhat is, what isThe convolution kernel of the point-by-point convolution layer has a size of 1 × 1 × c_m×c_outThe computing device comprises a neuron storage unit and a weight storage unit, and the method comprises the following steps: loading a feature map from the off-chip memory into the neuron storage unit; loading a specific convolution kernel from the off-chip memory to the weight storage unit; performing convolution calculation according to the feature map and the specific convolution kernel to generate an intermediate result; wherein the size of the specific convolution kernel is h × w × c_in×c_out。

Clause a14, the method of clause a13, wherein the neural network model further comprises an interposer, located between the depth convolution layer and the point-by-point convolution layer, the interposer having an input feature map and an output feature map of the same size.

Clause a15, the method of clause a14, wherein the interposer is a normalization layer.

Clause a16, the method of clause a14, wherein the interposer is an activation layer.

Clause a17, the method of clause a16, wherein the activation layer is one of a ReLU layer, a Swish layer, and a ReLU6 layer.

Clause a18, the method of clause a14, wherein the interposer is a normalization layer plus an activation layer.

Clause a19, the method of clause a13, wherein the computing device further comprises at least one cluster, each cluster comprising a core storage unit and a plurality of processor cores, each processor core comprising the neuron storage unit and the weight storage unit, wherein the step of loading the feature map from the off-chip memory into the neuron storage unit comprises: loading the feature map from the off-chip memory to the core memory unit; loading the feature map from the core memory unit to the neuron memory unit; wherein the step of loading the specific convolution kernel from the off-chip memory to the weight storage unit includes: loading the specific convolution kernel from the off-chip memory to the kernel storage unit; and loading the specific convolution kernel from the kernel storage unit to the weight storage unit.

Clause a20, the method of clause a19, further comprising: reducing all of the intermediate results to produce a calculated result; and storing the calculation result in the core storage unit.

Clause a21, the method of clause a20, further comprising: and storing the calculation result from the core storage unit back to the off-chip memory.

Clause a22, the method of clause a13, wherein the neural network model is one of the MobileNet series, EfficientNet series, MixeNet series, GhostNet series, and Fbnet series.

Clause a23, a computer-readable storage medium having stored thereon computer program code for processing a neural network model calculation by a computing device, the computer program code, when executed by a processing device, performing the method of any of clauses a 13-22.

Clause a24, a method of computing a neural network model, the neural network model including a depth-separated convolutional layer and a point-by-point separated convolutional layer, the method comprising: generating a specific convolution kernel according to the convolution kernels of the depth convolution layer and the point-by-point convolution layer; loading a characteristic diagram; performing convolution calculation according to the feature map and the specific convolution kernel; and the convolution calculation result is the result of calculating the depth separation convolution layer and the point-by-point separation convolution layer.

The embodiments of the present disclosure are described in detail above, and the principles and embodiments of the present disclosure are explained herein by applying specific embodiments, and the descriptions of the embodiments are only used to help understanding the method and the core ideas of the present disclosure; meanwhile, for a person skilled in the art, based on the idea of the present disclosure, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present disclosure should not be construed as a limitation to the present disclosure.

Claims

1. A computing device for processing neural network model calculations, coupled to off-chip memory, the neural network modelThe pattern comprises a depth separation convolutional layer and a point-by-point separation convolutional layer, wherein the size of a convolution kernel of the depth convolutional layer is h multiplied by w multiplied by c_in×c_mThe convolution kernel of the point-by-point convolution layer has a size of 1 × 1 × c_m×c_outThe computing device comprises:

the neuron storage unit is used for loading the characteristic diagram;

a weight value storage unit for loading a specific convolution kernel; and

an operation module, for:

loading the feature map from the neuron storage unit;

loading the specific convolution kernel from the weight storage unit; and

performing convolution calculation according to the feature map and the specific convolution kernel to generate an intermediate result;

wherein the size of the specific convolution kernel is h × w × c_in×c_out。

2. The computing device of claim 1, wherein the neural network model further comprises a mediating layer between the depth convolution layer and the point-by-point convolution layer, an input feature map of the mediating layer being the same size as an output feature map.

3. The computing device of claim 2, wherein the interposer is a normalization layer.

4. The computing device of claim 2, wherein the interposer is an activation layer.

5. The computing device of claim 4, wherein the activation layer is one of a ReLU layer, a Swish layer, and a ReLU6 layer.

6. The computing device of claim 2, wherein the interposer is a normalization layer plus an activation layer.

7. The computing device of claim 1, further comprising at least one cluster, each cluster comprising:

a kernel storage unit for loading the feature map and the specific convolution kernel from the off-chip memory;

a plurality of processor cores, each processor core including the neuron storage unit and the weight storage unit;

and the neuron storage unit loads the feature graph from the kernel storage unit, and the weight storage unit loads the specific convolution kernel from the kernel storage unit.

8. The computing device of claim 7, wherein one of the plurality of processor cores reduces the intermediate result for each processor core to produce a computation result that is stored into the core storage unit.

9. The computing device of claim 8, wherein each cluster further comprises a direct memory access module to store the computation results from the core storage unit back to the off-chip memory.

10. The computing device of claim 1, wherein the neural network model is one of a MobileNet series, an EffectientNet series, a MixeNet series, a GhostNet series, and an Fbnet series.

11. An integrated circuit device comprising a computing device according to any of claims 1-10.

12. A board card comprising the integrated circuit device of claim 11.

13. A method of processing a neural network model computation by a computing device, the computing device being connected to an off-chip memory, the neural network model comprising a depth convolution layer and a point-by-point convolution layer, the convolution kernel of the depth convolution layer having a size of h x w x c_in×c_mPoint by point of the aboveThe convolution kernel of the convolution layer has a size of 1 × 1 × c_m×c_outThe computing device comprises a neuron storage unit and a weight storage unit, and the method comprises the following steps:

loading a feature map from the off-chip memory into the neuron storage unit;

loading a specific convolution kernel from the off-chip memory to the weight storage unit; and

wherein the size of the specific convolution kernel is h × w × c_in×c_out。

14. The method of claim 13, wherein the neural network model further comprises an interposer, located between the depth convolution layer and the point-by-point convolution layer, having an input feature map and an output feature map of the same size.

15. The method of claim 14, wherein the interposer is a normalization layer.

16. The method of claim 14, wherein the interposer is an activation layer.

17. The method of claim 16, wherein the activation layer is one of a ReLU layer, a Swish layer, and a ReLU6 layer.

18. The method of claim 14, wherein the interposer is a normalization layer plus an activation layer.

19. The method of claim 13, wherein the computing device further comprises at least one cluster, each cluster comprising a core storage unit and a plurality of processor cores, each processor core comprising the neuron storage unit and the weight storage unit,

wherein the step of loading a signature graph from the off-chip memory into the neuron storage unit comprises:

loading the feature map from the off-chip memory to the core memory unit; and

loading the feature map from the core memory unit to the neuron memory unit;

wherein the step of loading the specific convolution kernel from the off-chip memory to the weight storage unit includes:

loading the specific convolution kernel from the off-chip memory to the kernel storage unit; and

and loading the specific convolution kernel from the kernel storage unit to the weight storage unit.

20. The method of claim 19, further comprising:

reducing all of the intermediate results to produce a calculated result; and

and storing the calculation result into the core storage unit.

21. The method of claim 20, further comprising:

and storing the calculation result from the core storage unit back to the off-chip memory.

22. The method of claim 13, wherein the neural network model is one of the MobileNet series, the EfficientNet series, the MixeNet series, the GhostNet series, and the Fbnet series.

23. A computer readable storage medium having stored thereon computer program code for processing a neural network model calculation by a computing device, which when executed by a processing device, performs the method of any of claims 13 to 22.

24. A method of computing a neural network model that includes a deep separated convolutional layer and a point-by-point separated convolutional layer, the method comprising:

generating a specific convolution kernel according to the convolution kernels of the depth convolution layer and the point-by-point convolution layer;

loading a characteristic diagram; and

performing convolution calculation according to the feature map and the specific convolution kernel;

and the convolution calculation result is the result of calculating the depth separation convolution layer and the point-by-point separation convolution layer.