CN114358264A

CN114358264A - Device, board card and method for fusing neural network and readable storage medium

Info

Publication number: CN114358264A
Application number: CN202011045888.2A
Authority: CN
Inventors: 不公告发明人
Original assignee: Cambricon Technologies Corp Ltd
Current assignee: Cambricon Technologies Corp Ltd
Priority date: 2020-09-28
Filing date: 2020-09-28
Publication date: 2022-04-15

Abstract

The disclosure relates to devices, boards, methods, and readable storage media that fuse neural networks, where computing devices of the disclosure are included in an integrated circuit device that includes a universal interconnect interface and other processing devices. The computing device interacts with other processing devices to jointly complete computing operations specified by the user. The integrated circuit device may further include a storage device, which is connected to the computing device and the other processing device, respectively, for data storage of the computing device and the other processing device.

Description

Device, board card and method for fusing neural network and readable storage medium

Technical Field

The present disclosure relates generally to the field of neural networks. More particularly, the present disclosure relates to devices, boards, methods and readable storage media for fusing neural networks.

Background

The neural network is a plurality of neuron systems connected according to a certain rule, and is roughly composed of the following four layer structures: an input layer, a convolution layer, a pooling layer, and a fully connected layer.

The input layer intercepts partial information from input data, converts the partial information into a characteristic matrix mode for presentation, and carries characteristics corresponding to the partial information. The convolutional layer is configured to receive the feature matrix from the input layer and perform feature extraction on the input data through a convolution operation. The convolutional layer can be constructed as a multilayer convolutional layer in practical use. The pooling layer is configured to replace a region of data with a value, which is typically the maximum or average of all values in the region. By pooling, the model size can be reduced and the computation speed increased without losing too much information. The full-connection layer plays a role of a classifier in the whole convolutional neural network, which is equivalent to feature space transformation, extracts and integrates all the useful information in the front, and compares the information based on different classifications to judge whether the input data is similar to the compared object.

With the development of science and technology, the number of layers of a neural network is more and more, and taking a classic VGG architecture as an example, a VGG-A has 11 weight layers, a VGG-B has 13 weight layers, a VGG-C has 16 weight layers, a VGG-D has 16 weight layers, and a VGG-E has 19 weight layers. Wherein the convolutional layer and the fully-connected layer are broadly referred to as weight layers. Some neural networks have a structure of hundreds of layers. Furthermore, as the number of layers increases, the number of parameters of the neural network also increases exponentially, for example AlexNet has 6000 ten thousand parameters to participate in the calculation.

The multiple layers and parameters require a large number of off-chip I/O accesses, which consumes many resources and delays computation time. A mechanism to reduce input/output access is therefore highly desirable in the field of artificial intelligence.

Disclosure of Invention

To at least partially solve the technical problems mentioned in the background, the disclosed solution provides an apparatus, a board, a method and a readable storage medium for fusing a neural network.

In one aspect, the present disclosure discloses an integrated circuit device that fuses a neural network that includes an i-th layer having an input profile that is smaller than an output profile. The integrated circuit device includes a processing device and a computing device. The processing device is used for establishing a template fusion unit according to a fusion strategy; the calculating device is used for executing neural network calculation according to the template fusion unit. The template fusion unit comprises the ith layer, and i is a natural number.

In another aspect, the present disclosure discloses a board card including the integrated circuit device according to the foregoing description.

In another aspect, the present disclosure discloses a method of fusing a neural network comprising an i-th layer having an input profile smaller than an output profile. The method comprises the following steps: establishing a template fusion unit according to a fusion strategy, wherein the template fusion unit comprises the ith layer; and executing neural network calculation according to the template fusion unit, wherein i is a natural number.

In another aspect, the present disclosure discloses a computer readable storage medium having stored thereon computer program code for a converged neural network, which when executed by a processing device, performs the aforementioned method.

The present disclosure dynamically determines a template fusion unit by setting a fusion policy, fuses layers in a neural network to form a new custom layer, and loads data required for calculating the template fusion unit at a time to reduce input/output overhead.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar or corresponding parts and in which:

fig. 1 is a block diagram illustrating a board card according to an embodiment of the present disclosure;

FIG. 2 is a block diagram illustrating an integrated circuit device of an embodiment of the present disclosure;

FIG. 3 is a schematic diagram showing the internal structure of a computing device of an embodiment of the present disclosure;

FIG. 4 is an internal block diagram illustrating a processor core of an embodiment of the present disclosure;

FIG. 5 is a schematic diagram showing when one processor core wants to write data to another clustered processor core;

FIG. 6 is a schematic diagram showing an AlexNet model;

FIG. 7 is a schematic diagram illustrating an exemplary neural network model;

FIG. 8 is a schematic diagram illustrating two convolutional layers of an embodiment of the present disclosure fused together;

FIG. 9 is a schematic format diagram showing NCHW and NHWC;

FIG. 10 is a flow chart illustrating the performance of neural network computations by the disclosed embodiment using a template fusion unit;

FIG. 11 is a flow diagram illustrating dynamic fusion of neural networks according to a fusion strategy in accordance with an embodiment of the present disclosure;

FIG. 12 is a flow chart illustrating the performance of neural network computations by the template fusion unit of the disclosed embodiments;

FIG. 13 is a schematic diagram illustrating a neural network model with a block structure;

FIG. 14 is a schematic diagram showing an input/output feature map such as a positive pyramid structure;

FIG. 15A is a schematic diagram illustrating an uppooling operation of maximum pooling;

FIG. 15B is a schematic diagram of an uppooling operation showing average pooling;

fig. 16 is a schematic diagram illustrating an up-sampling operation;

FIG. 17 is a flow chart illustrating fusing an exemplary neural network according to an embodiment of the present disclosure; and

FIG. 18 is a diagram illustrating an exemplary neural network model.

Detailed Description

The technical solutions in the embodiments of the present disclosure will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some, not all embodiments of the present disclosure. All other embodiments, which can be derived by one skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the scope of protection of the present disclosure.

It should be understood that the terms "first," "second," "third," and "fourth," etc. in the claims, description, and drawings of the present disclosure are used to distinguish between different objects and are not used to describe a particular order. The terms "comprises" and "comprising," when used in the specification and claims of this disclosure, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the disclosure herein is for the purpose of describing particular embodiments only, and is not intended to be limiting of the disclosure. As used in the specification and claims of this disclosure, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the specification and claims of this disclosure refers to any and all possible combinations of one or more of the associated listed items and includes such combinations.

As used in this specification and claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection".

Specific embodiments of the present disclosure are described in detail below with reference to the accompanying drawings.

The neural network is composed of an input layer, a convolution layer, an activation function, a pooling layer and a full-link layer, wherein the number of layers is small, the number of layers is large, the number of layers is hundreds, each layer executes an operator, for example, the convolution layer executes a convolution operator, and the number of layers needs to execute the operator. In this disclosure, when a particular layer is referred to, the operator corresponding to that layer is indicated.

In performing neural network computations, the input information and the output results of the model layers are different at each inference computation, and they are considered as variable data, which is generally represented by a feature map (matrix). The parameters of the training network model are not frequently changed after the training is stable, or the network topology and hardware parameters can be compiled and generated after being determined, and are not changed in the calculation process, so that they can be regarded as constant data, the constant data includes, but is not limited to, weight, bias, device hardware instructions, mean and variance of batch normalization (batchnorm), and the like, and all the constant data are uniformly represented by the weight in the disclosure. In the present disclosure, the term "data" generally refers to a graph structure that allows the operation of corresponding operators in a neural network model to be fused together according to a fusion strategy, and the graph structure refers to variable data and constant data, that is, a feature graph and a corresponding weight.

Fig. 1 shows a schematic structural diagram of a board card 10 according to an embodiment of the disclosure. As shown in fig. 1, the board card 10 includes a Chip 101, which is a System-on-Chip (SoC) or System-on-Chip, and is integrated with one or more combined processing devices, which are artificial intelligence arithmetic units, for supporting various deep learning and machine learning algorithms, and meeting the intelligent processing requirements in the fields of computer vision, speech, natural language processing, data mining, and the like under complex scenes. Especially, the deep learning technology is widely applied to the field of cloud intelligence, and one remarkable characteristic of the cloud intelligence application is that the input data size is large, and the requirements on the storage capacity and the computing capacity of the platform are high.

The chip 101 is connected to an external device 103 through an external interface device 102. The external device 103 is, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card, a wifi interface, or the like. The data to be processed may be transferred by the external device 103 to the chip 101 through the external interface device 102. The calculation result of the chip 101 may be transmitted back to the external device 103 via the external interface device 102. The external interface device 102 may have different interface forms, such as a PCIe interface, according to different application scenarios.

The card 10 also includes a memory device 104 for storing data, which includes one or more memory cells 105. The memory device 104 is connected and data-transferred with the control device 106 and the chip 101 through a bus. The control device 106 in the board 10 is configured to regulate the state of the chip 101. For this purpose, in an application scenario, the control device 106 may include a single chip Microcomputer (MCU).

Fig. 2 is a structural diagram showing a combined processing device in the chip 101 of this embodiment. As shown in fig. 2, the combination processing device 20 includes a computing device 201, an interface device 202, a processing device 203, and a DRAM 204.

The computing device 201 is configured to perform user-specified operations, mainly implemented as a single-core smart processor or a multi-core smart processor, to perform deep learning or machine learning computations, which may interact with the processing device 203 through the interface device 202 to collectively perform the user-specified operations.

The interface device 202 is used for transmitting data and control instructions between the computing device 201 and the processing device 203. For example, the computing device 201 may obtain input data from the processing device 203 via the interface device 202, and write to a storage device on the computing device 201. Further, the computing device 201 may obtain the control instruction from the processing device 203 via the interface device 202, and write the control instruction into a control cache on the computing device 201. Alternatively or optionally, the interface device 202 may also read data from a storage device of the computing device 201 and transmit the data to the processing device 203.

The processing device 203, as a general purpose processing device, performs basic control including, but not limited to, data transfer, starting and/or stopping of the computing device 201, and the like. Depending on the implementation, the processing device 203 may be one or more types of Central Processing Unit (CPU), Graphics Processing Unit (GPU) or other general purpose and/or special purpose processor, including but not limited to a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc., and the number thereof may be determined according to actual needs. As previously mentioned, the computing device 201 of the present disclosure may be viewed as having a single core structure or an isomorphic multi-core structure only. However, when considered collectively, the computing device 201 and the processing device 203 are considered to form a heterogeneous multi-core structure.

The DRAM204 is used for storing data to be processed, and is a DDR memory, which is typically 16G or larger in size and is used for storing data of the computing device 201 and/or the processing device 203.

Fig. 3 shows an internal structural diagram of the computing apparatus 201. The computing device 201 is used for processing input data such as computer vision, voice, natural language, data mining and the like, the computing device 201 in the figure is designed by adopting a multi-core hierarchical structure, the computing device 201 is used as a system on chip and comprises a plurality of clusters (clusters), each cluster comprises a plurality of processor cores, in other words, the computing device 201 is formed by a system on chip-cluster-processor core hierarchy.

Looking at the system-on-chip hierarchy, as shown in FIG. 3, the computing device 201 includes an external storage controller 301, a peripheral communication module 302, an on-chip interconnect module 303, a synchronization module 304, and a plurality of clusters 305.

There may be multiple

external memory controllers

301, 2 shown by way of example in the figure, for accessing an external memory device, such as DRAM204 in figure 2, to read data from or write data to off-chip in response to an access request issued by a processor core. The peripheral communication module 302 is used for receiving the control signal from the processing device 203 through the interface device 202 and starting the computing device 201 to execute the task. The on-chip interconnect module 303 connects the external memory controller 301, the peripheral communication module 302 and the plurality of clusters 305 for transmitting data and control signals between the respective modules. The synchronization module 304 is a global synchronization barrier controller (GBC) for coordinating the operation progress of the clusters and ensuring the synchronization of the information. The plurality of clusters 305 are the computing cores of the

computing device

201, 4 are exemplarily shown in the figure, and as hardware advances, the computing device 201 of the present disclosure may further include 8, 16, 64, or even more clusters 305. The clusters 305 are used to efficiently execute deep learning algorithms.

Viewed at the cluster level, as shown in FIG. 3, each cluster 305 includes a plurality of processor cores (IPU core)306 and a memory core (MEM core) 307.

The processor cores 306 are exemplarily shown in 4 in the figure, and the present disclosure does not limit the number of the processor cores 306. The internal architecture is shown in fig. 4. Each processor core 306 includes three major modules: a control module 41, an arithmetic module 42 and a storage module 43.

The control module 41 is used for coordinating and controlling the operations of the operation module 42 and the storage module 43 to complete the deep learning task, and includes an Instruction Fetch Unit (IFU) 411 and an Instruction Decode Unit (IDU) 412. The instruction fetch unit 411 is used to obtain an instruction from the processing device 203, and the instruction decode unit 412 decodes the obtained instruction and sends the decoded result to the operation module 42 and the storage module 43 as control information.

The operation module 42 includes a vector operation unit 421 and a matrix operation unit 422. The vector operation unit 421 is used for performing vector operations, and can support complex operations such as vector multiplication, addition, and nonlinear transformation; the matrix operation unit 422 is responsible for the core calculation of the deep learning algorithm, namely matrix multiplication and convolution.

The storage module 43 is used to store or transport related data, and includes a neuron storage unit (neuron RAM, NRAM)431, a weight storage unit (weight RAM, WRAM)432, an input/output direct memory access (IODMA) 433, and a transport direct memory access (MVDMA) 434. NRAM 431 is used to store the feature map for processor core 306 to compute and the intermediate result after computation; the WRAM 432 is used for storing the weight of the deep learning network; IODMA 433 controls access of NRAM 431/WRAM 432 and DRAM204 through broadcast bus 309; the MVDMA 434 is used to control access of the NRAM 431/WRAM 432 and the SRAM 308.

Returning to FIG. 3, the storage core 307 is primarily used to store and communicate, i.e., store shared data or intermediate results among the processor cores 306, as well as perform communications between the clusters 305 and the DRAMs 204, communications among the clusters 305, communications among the processor cores 306, and the like. In other embodiments, storage core 307 has the capability of scalar operations to perform scalar operations.

The memory core 307 includes a shared memory unit (SRAM)308, a broadcast bus 309, a Cluster Direct Memory Access (CDMA) 310, and a Global Direct Memory Access (GDMA) 311. The SRAM308 plays a role of a high-performance data transfer station, data multiplexed among different processor cores 306 in the same cluster 305 do not need to be acquired from the DRAM204 through the processor cores 306 respectively, but are transferred among the processor cores 306 through the SRAM308, and the storage core 307 only needs to rapidly distribute the multiplexed data from the SRAM308 to the plurality of processor cores 306, so that the inter-core communication efficiency is improved, and the on-chip and off-chip input/output access is greatly reduced.

The broadcast bus 309, CDMA 310, and GDMA 311 are used to perform communication among the processor cores 306, communication among the cluster 305, and data transfer between the cluster 305 and DRAM204, respectively. As will be described separately below.

The broadcast bus 309 is used to accomplish high-speed communication among the processor cores 306 in the cluster 305, and the broadcast bus 309 of this embodiment supports inter-core communication modes including unicast, multicast and broadcast. Unicast refers to point-to-point (i.e., from a single processor core to a single processor core) data transfer, multicast is a communication for transferring a copy of data from SRAM308 to a specific number of processor cores 306, and broadcast is a communication for transferring a copy of data from SRAM308 to all processor cores 306, and is a special case of multicast.

CDMA 310 is used to control access to SRAM308 between different clusters 305 within the same computing device 201. Fig. 5 shows a schematic diagram when one processor core wants to write data to another clustered processor core to illustrate the operating principle of CDMA 310. In this application scenario, the same computing device includes multiple clusters, for convenience of description, only cluster 0 and cluster 1 are shown in the figure, and cluster 0 and cluster 1 respectively include multiple processor cores, and also for convenience of description, cluster 0 in the figure only shows processor core 0, and cluster 1 only shows processor core 1. Processor core 0 wants to write data to processor core 1.

Firstly, the processor core 0 sends a unicast write request to write data into a local SRAM 0, the CDMA 0 serves as a master (master) end, the CDMA 1 serves as a slave (slave) end, the master end pushes the write request to the slave end, namely the master end sends a write address AW and write data W, the data are transmitted into the SRAM 1 of the cluster 1, then the slave end sends a write response B as a response, and finally the processor core 1 of the cluster 1 sends a unicast read request to read the data from the SRAM 1.

Returning to fig. 3, GDMA 311 cooperates with external memory controller 301 to control access of SRAM308 of cluster 305 to DRAM204 or to read data from DRAM204 into SRAM 308. As can be seen from the foregoing, communication between DRAM204 and NRAM 431 or WRAM 432 may be achieved via 2 channels. The first channel is to directly contact DRAM204 with NRAM 431 or WRAM 432 through IODAM 433; the second channel is that data is transferred between DRAM204 and SRAM308 via GDMA 311, and then between SRAM308 and NRAM 431 or WRAM 432 via MVDMA 434. Although seemingly the second channel requires more components and the data flow is longer, in some embodiments, the bandwidth of the second channel is substantially greater than the first channel, and thus communication between DRAM204 and NRAM 431 or WRAM 432 may be more efficient over the second channel. Embodiments of the present disclosure may select a data transmission channel based on its own hardware conditions.

In other embodiments, the functionality of GDMA 311 and the functionality of IODMA 433 may be integrated in the same component. For convenience of description, the GDMA 311 and the IODMA 433 will be regarded as different components, and it will be apparent to those skilled in the art that the present disclosure is within the scope of the present disclosure as long as the achieved functions and technical effects are similar to the present disclosure. Further, the functions of GDMA 311, IODMA 433, CDMA 310 and MVDMA 434 can be realized by the same component, and the realized functions and technical effects are similar to those of the present disclosure.

The structure of neural networks relevant to the present disclosure is divided into two categories: long chain type structure and block structure. The long-chain structure refers to a neural network model composed of layers connected in series by single chains, each layer has only one input and one output, and the whole belongs to a single branch, such as the model of VGG16 or the AlexNet model shown in fig. 6. The block structure refers to that a subnetwork in the neural network has only one input and one output, but multiple branches exist in the subnetwork, i.e., a part of layers of the subnetwork have multiple inputs or outputs, such as a resblock structure of resnet50, a block structure of acceptance _ v3, and the like. Fig. 7 shows a schematic diagram of an exemplary neural network model, which includes a subnetwork 701 and a subnetwork 702. The subnetwork 701 has only one input and one output, which includes a first layer to a sixth layer, the first layer having 2 outputs, and the sixth layer having 2 inputs, so that the subnetwork 701 includes 2 branches, one branch being the first layer → the second layer → the third layer → the sixth layer, and the other branch being the first layer → the fourth layer → the fifth layer → the sixth layer, the subnetwork 701 constitutes a block structure. Similarly, the sub-network 702 also forms a block structure.

When performing each layer of computation of deep learning, a large amount of off-chip access is required, in particular, input data is read from the DRAM204 into the computing device 201, and the computation result of the computing device 201 is stored in the DRAM 204. Such frequent accesses consume significant hardware resources. To address this issue, the present disclosure greatly reduces data transmission on off-chip chips by fusing adjacent layers of the neural network.

Fig. 8 shows a schematic view of fusing two convolution layers together. The input to the first convolutional layer 810 is a 7 × 7 feature map 801, and the layer convolves the feature map 801 with a 3 × 3 kernel (not shown) to obtain a feature map 802 of the first convolutional layer 810. Where the value of 5 x 5 feature sub-graph 804 affects 3 x 3 feature sub-graph 805. Assuming that the step size (stride) is 1, after computing 5 × 5 feature sub-graph 804, first layer convolutional layer 810 will then compute 5 × 5 feature sub-graph 806, and the value of 5 × 5 feature sub-graph 806 will affect 3 × 3 feature sub-graph 807.

When calculating the second layer convolutional layer 811, the feature map 802 is input to the second layer convolutional layer 811, and is convolved with a 3 × 3 kernel in the same manner, thereby obtaining a feature map 803 of the second layer convolutional layer 811. Wherein the value of the 3 × 3 feature sub-graph 805 affects the 1 × 1 feature sub-graph 808 in the feature graph 803. After the computation of 3 × 3 feature sub-graph 805, second layer convolutional layer 811 computes 3 × 3 feature sub-graph 807, and the value of 3 × 3 feature sub-graph 807 affects 1 × 1 feature sub-graph 809 in feature graph 803.

If not, the computing device 201 reads the 5 × 5 feature sub-graph 804 from the DRAM204 while performing the first layer convolution 810, stores the 3 × 3 feature sub-graph 805 back to the DRAM204 after the computation is completed, then reads the 5 × 5 feature sub-graph 806 from the DRAM204, and stores the 3 × 3 feature sub-graph 807 to the DRAM204 after the computation is completed. When the second layer of convolution 811 is performed, the 3 × 3 feature sub-graph 805 also needs to be read from the DRAM204, the 1 × 1 feature sub-graph 808 is stored in the DRAM204 after calculation, the 3 × 3 feature sub-graph 807 is read from the DRAM204, and the 1 × 1 feature sub-graph 809 is stored in the DRAM204 after calculation. As is clear from the above description, the feature map 802 is repeatedly read and stored as intermediate data on an off-chip, and occupies considerable system resources.

If the first convolutional layer 810 and the second convolutional layer 811 are fused, that is, the feature map 802 is stored in the NRAM 431 (the weights of the first convolutional layer 810 and the second convolutional layer 811 can also be stored in the WRAM 432), so that the number of accesses between the computing device 201 and the DRAM204 can be reduced, and the execution efficiency of the overall neural network can be improved. Pyramid fusion is called because the feature maps (e.g., feature map 801, feature map 802, feature map 803) participating in fusion look like an inverted pyramid as a whole in the neural network model context logic.

Pyramid fusion is usually based on backward fusion of specific convolutional and pooling layers in a neural network, i.e., the initial layer of fusion is a convolutional or pooling layer, which is backward fused with multiple layers according to its own hardware condition, possibly including multiple convolutional and pooling layers in between. However, with the development of deep learning and neural networks, the ordering of layers becomes complicated, for example, if an activation layer is provided in front of a convolutional layer, the activation layer should be considered to be fused with the subsequent convolutional layer. Therefore, in addition to the fusion only with the convolutional layer and the pooling layer as the core, the present disclosure provides various fusion methods, and flexibly selects the layers of the neural network for fusion without using the convolutional layer and the pooling layer as the core and adopting a specific strategy, so that even if the layers are customized by a user, the layers can be fused as long as the fusion strategy is met, thereby optimizing the overall performance.

Another embodiment of the present disclosure is a novel fusion method implemented by using the hardware structure of fig. 1, 2, 3 and 4, which is called Template Fusion Unit (TFU). The template fusion unit is mainly used for elastically fusing a plurality of layers into one layer through a certain fusion strategy to reduce the input/output overhead of the network, and comprises the pyramid fusion and other fusion modes, wherein the set of the fused layers is the template fusion unit and can be regarded as a new layer or a self-defined layer.

In this embodiment, the feature graph and weight required by the template fusion unit are loaded from the DRAM204 to the SRAM308 on the chip at one time, the feature graph is called an on-chip cell graph after being loaded to the SRAM308, the on-chip cell graph is cut into subgraphs, one subgraph is loaded from the SRAM308 to the NRAM 431 assigned to the processor core 306 for computing the subgraph each time, the weight required for computing the subgraph is also loaded from the SRAM308 to the WRAM 432, each subgraph obtains a corresponding intermediate result after computing, the intermediate result is stored back to the SRAM308, and the computed result is stored back to the DRAM204 at one time after all the subgraphs complete computing. That is, the corresponding result obtained by the operation of the operator in the neural network model with the on-chip cell diagram and the weight is transferred between the DRAM204 and the SRAM308, and the corresponding output (intermediate result) of the subgraph is transferred between the SRAM308 and the NRAM 431. From the perspective of the computing device 201, the data loading of the template fusion unit is in units of on-chip unit graphs, and the computation is in units of subgraphs.

More specifically, the SRAM308 is one of the important references of the fusing strategy, and the space size thereof determines whether the template fusing unit is in the big graph mode or the small graph mode. The small graph mode and the large graph mode refer to whether a feature map stored in the DRAM204 can be moved to the SRAM308 for processing at a time, and the processing device 203 compares the storage space required by the feature map with the available space of the SRAM 308. If the SRAM308 has insufficient space and the characteristic diagram cannot be laid down, the mode is a big diagram mode; if the SRAM308 is sufficient to accommodate the entire signature, it is the thumbnail mode. It should be noted that, in the large graph mode, the on-chip unit graph is only a part of the feature graph; in the small graph mode, if the available space of the SRAM308 is large enough or the feature map is small enough, the SRAM308 may accommodate multiple feature maps at one time, i.e., the on-chip cell map may include multiple feature maps.

If the graph mode is large, the feature graph must be split to be loaded into the computing device 201. The processing device 203 will split the feature map on the DRAM204 until a sufficiently small on-chip unit map is generated to meet the space requirement of the SRAM308, so that the on-chip unit map can be moved to the SRAM308 for processing at one time. When the feature map is divided, an input dependent operation and an output dependent operation may be generated.

The input dependent operation means that the split unit graphs on the chips at least partially overlap, and each subset needs some extra copies of the input to perform a complete operation, so that data redundancy in the split operation is caused, wherein the data redundancy means that the same piece of data is multiplexed in the system. Input dependent operations are caused when the template fusion unit includes layers of convolution, pooling, or matrix multiplication.

The output dependent operation means that after each subgraph produces an intermediate result, reduction (reduce) is needed to obtain a calculation result. The reduction is to divide the unit graph into subgraphs and calculate the subgraphs respectively on the basis of understanding the content of the unit graph on the chip, so as to reduce the calculation scale, reduce the data volume to the maximum extent on the premise of keeping the original appearance of the unit graph on the original chip as much as possible, and restore or integrate the calculation result on the basis of the subgraphs. The results of the calculations are interdependent when reduction is performed. When the template fusion unit comprises inner product, convolution, matrix multiplication, sorting, counting and other layers, output dependent operation is caused.

The data format of the feature map that this embodiment can process includes N, H, W, C dimensions, where N represents batch (batch), H represents height (height), W represents width (width), and C represents channel (channel). Taking the image data as an example, N indicates how many images are shared in the batch, H indicates how many pixels are in the vertical direction of the image, W indicates the number of pixels in the horizontal direction, and C indicates the number of channels (for example, the number of channels C of a monochrome image is 1, and the number of channels C of an RGB color image is 3).

The ordering of these dimensions determines the data composition, which is commonly known as NHWC and NCHW, and fig. 9 illustrates the format difference between NCHW and NHWC, which is an example of an RGB color image, where R represents a red pixel, G represents a green pixel, and B represents a blue pixel. Array 91 is in the NCHW format with N arranged outside, pixels next to each other in each channel, and then arranged in RGB order with the offset in memory of the element with coordinates (N, C, H, W) being ((N C + C) xH + H) xW + W. Array 92 is in NHWC format with C arranged in the innermost layer and multiple channels next to each other with RGB pixels in spatial locations. The positions of the

input pixels

901, 902 and 903 in different arrangements are also shown, and the three

input pixels

901, 902 and 903 taken together are the color of one point in the image. The coordinate of the element having coordinates (n, C, H, W) is shifted by ((n × H + H) × W + W) × C + C). The NHWC is closer to a BMP picture data storage format than NCHW, data are stored in BMP format files according to pixel points, and color values of all channels are stored in each pixel point, so that extra dimension conversion is not needed when input pictures are read. Therefore, the memory access locality of the NHWC is good, one output pixel can be obtained for every three input pixels, and the NCHW must wait for all channel inputs to be ready to obtain the final output result, which occupies a large buffer space.

This embodiment may use each layer of the data fusion neural network as a template fusion unit, and fig. 10 shows a corresponding flowchart.

In step 1001, the processing device 203 determines whether the storage space required by the feature map is larger than the available space of the SRAM 308. If so, it means that the feature map cannot be loaded into the SRAM308 at one time, so step 1002 is executed to split the feature map. In this embodiment, the processing device 203 preferably selects splitting in the N-dimension because no input or output dependent operations are generated, such as splitting in the N-dimension that cannot meet the requirement, and splitting in the H or W-dimension that may result in input or output dependent operations. This embodiment also supports splitting in the C dimension, especially along the Cout direction, such that one convolution is split into multiple convolutions by data optimization, so that WRAM 432 can be weighted, for example: the weights are split across the four processor cores 306. Therefore, it is within the scope of the disclosure that splitting in a certain dimension is a matter of processing by the computing device 201.

Furthermore, the processing device 203 may sequentially split between N, H, W dimensions at a specific granularity, which may be a fixed or variable ratio or expressed as a function. In an application scenario, the processing device 203 splits the feature map or the weight from size to size. Taking the feature map as an example, firstly, the feature map with dimension NHWC is split into N dimensions₁Characterization of HWC and N₂Characterization of HWC, where the specific granularity is a fixed ratio, N₁And N₂Each one-half of N. If not, the processing means 203 continues to apply N in the H dimension₁Splitting a feature graph of a HWC into N₁H₁Characteristic diagram of WC and N₁H₂Characteristic diagram of WC, wherein H₁And H₂Each being one-half of H. If not, the processing means 203 continues to apply N in the W dimension₁H₁Splitting of a characteristic diagram of WC into N₁H₁W₁C characteristic diagram and N1H₁W₂C, wherein W₁And W₂Each one-half of W. The processing apparatus 203 may be configured to process the molten metal in the following order of N,W, H continue to be dimensionally split into smaller sizes, such as quarter-divided, eighth-divided, or sixteenth-divided, until the feature map is small enough to be an on-chip cell map that can be loaded into the SRAM308 at one time.

It will be appreciated that the processing device 203 may also continue to split in one dimension until it is no longer possible to split it, and then choose another dimension to continue splitting. For example, the splitting in the H dimension is continued, and if the split to the minimum unit cannot be loaded into the SRAM308, the splitting in the W dimension is performed instead until the split to the minimum unit.

It should be noted that, since the splitting manner is from large splitting to small splitting, when the split feature map satisfies the condition, the size of the required storage space is usually almost the same as the available space of the SRAM 308. In other words, in the large graph mode, the DRAM204 can only transfer one split feature to the SRAM308 at a time, but in the small graph mode, the space of the SRAM308 may be able to load multiple features from the DRAM204 at one time.

In another application scenario, the processing device 203 splits from small to large, and the specific granularity may also be a fixed or variable ratio or expressed as a function. For example, the split is first done in the N dimension with a certain granularity being the smallest unit, i.e. 1 × H × W × C. If the SRAM308 can be loaded, the processing unit 203 continues to enlarge the splitting of the feature map, for example to 2 XH W C. If it can still load, continue to enlarge until n × H × W × C can't load, the size of the on-chip unit graph is (n-1) × H × W × C.

If the required memory space of 1 × hxwxc has exceeded the available space of SRAM308, processing device 203 will continue to split from another dimension, for example: starting from the H dimension, the processing device 203 then determines 1 × 1 × W × C. If small enough, it increases up the H dimension until the memory space needed to find 1 × (H-1) × WxC is just close to, and no more than, the available space of the SRAM 308. If the available space of the SRAM308 is still exceeded, the processing device 203 continues to split from another dimension, for example from the W dimension. In a sequential manner until the best input data is found that can be loaded into the SRAM308 at once. By optimal, what is meant herein is that the memory space required for the on-chip cell map is closest to but no greater than the available space of the SRAM 308.

After the processing device 203 splits the feature map, the process returns to step 1001, and the processing device 203 determines whether the storage space required by the split feature map is still larger than the available space of the SRAM308, if so, the process executes step 1002 again, and continues to split the feature map.

If the processing device 203 determines that the storage space required by the split feature map is not larger than the available space of the SRAM308, which indicates that the SRAM308 can load the split feature map at one time, step 1003 is executed, and the processing device 203 sets the split feature map as the on-chip unit map.

Finally, in step 1004, the processing device 203 determines a template fusion unit according to the size of the on-chip unit map. This step will be described in detail later.

In other application scenarios, when the processing device 203 repeatedly executes the

steps

1001 and 1002 many times, it indicates that the storage space required by the split feature map is closer to the available space of the SRAM308, for example, assuming that the storage space required by the feature map is 100k and the available space of the SRAM308 is 40k, in step 1001, the processing device 203 determines that the storage space required by the feature map is larger than the available space of the SRAM308, so step 1002 is executed to split the feature map into half along the N dimension, at which point the split feature map is 50k, then step 1001 is executed, and the storage space required by the split feature map is still larger than the available space of the SRAM308, step 1002 is executed to continue to split into half along the N dimension, at which point the split feature map is 25k, then step 1001 is executed, at which point the storage space required by the split feature map is smaller than the available space of the SRAM308, so step 1003 is executed, the processing device 203 sets the split feature map (with a size of 25k) as an on-chip unit map.

The available space of SRAM308 is 40k, while the required storage space of the on-chip cell map is 25k, and there is still 15k of space left unused, because the splitting in step 1002 is performed in units of one-half, so that the granularity is too large for the last splitting. The embodiment can gradually reduce the specific granularity of the split along with the number of the split, so that the storage space required by the split on-chip unit map is as close as possible to the available space of the SRAM 308. For example, the specific granularity may be set to one-half at the beginning, three-quarters next, and four-fifths last. Similarly, taking the storage space required by the feature map as 100k and the available space of the SRAM308 as 40k as an example, in step 1001, the processing device 203 determines that the storage space required by the feature map is larger than the available space of the SRAM308, so step 1002 is executed, the specific granularity is set to be one half, the split feature map is 50k, then step 1001 is executed, the storage space required by the split feature map is still larger than the available space of the SRAM308, step 1002 is continuously executed, the specific granularity is adjusted to be three quarters, the split feature map is 37.5k, then step 1001 is executed, the storage space required by the split feature map is smaller than the available space of the SRAM308, step 1003 is executed, and the processing device 203 sets the split feature map (with the size of 37.5k) as the on-chip unit map. 37.5k is closer to 40k than 25k, which makes more efficient use of the space available in SRAM 308. This embodiment is not limited to a specific granularity size, and may be set according to an application scenario.

After the on-chip unit map is sized, step 1004 is performed, which is to dynamically fuse the neural networks according to a fusion policy, and fig. 11 shows a method for dynamically fusing the neural networks according to the fusion policy in this embodiment.

In step 1101, a start layer of the template fusion unit is selected according to the start rule of the fusion policy. The processing device 203 selects the initial layer of the template fusion unit according to the initial rule of the fusion strategy, that is, selects the layer to start fusion from the layers that are not fused in the neural network.

In an application scenario, the start rule may be that the start layer is a layer that is not fused before the start layer is the most in the neural network, and the processing device 203 searches for the layer that is not fused before the start layer is the most. Taking the AlexNet neural network model of fig. 6 as an example, there are 23 layers, and assuming that the layers 1 to 5 are fused, when the start rule is that the start layer is the layer that is not fused at the first in the neural network, the processing device 203 selects the ReLU activation layer of the layer 6 as the start layer, and fuses backward (i.e., fuses in the direction of the layer 7). It is noted that under this start rule, the start layer is not necessarily a convolutional layer or a pooling layer.

In another application scenario, considering that the convolution and pooling layers consume the most input/output resources, and therefore the starting rule is that the starting layer is the convolution or pooling layer that is not merged at the first time, the processing device 203 will first find all convolution and pooling layers of the non-merged layers in the neural network model, and merge backward from the convolution or pooling layer that is not merged at the first time. Also taking the AlexNet neural network model of fig. 6 as an example, assuming that the layers 1 to 9 are merged, the processing device 203 finds all the convolution and pooling layers of the un-merged layer in the neural network model, i.e. the 11 th layer, the 13 th layer, and the 15 th layer, and then starts to merge from the convolution or pooling layer which is not merged at the beginning, i.e. the starting layer is the 11 th layer.

In step 1102, fusion is performed based on the initial layer, and all rules of the fusion policy are checked one by one to establish a template fusion unit. The processing device 203 performs fusion based on the initial layer, and checks all rules of the fusion policy one by one to establish a template fusion unit. On the premise that all rules are satisfied, the hardware resources of the computing device 201 are sufficient to support loading of data required for computing the template fusion unit at one time, and then performing neural network computation according to the template fusion unit. In addition to the aforementioned start rules, the fusion policy may illustratively include the following rules:

rule one is as follows: backward fusion

The backward fusion refers to the direction fusion from the initial layer to the neural network model inference, and is the direction fusion according to the first layer → the second layer → the third layer, taking fig. 6 as an example. If unfused layers precede the starting layer, then these unfused layers will not be considered for inclusion in the template fusion unit under this rule.

Rule two: preferential forward fusion

The forward fusion refers to the reverse fusion of the neural network from the initial layer, and for example, the fusion is performed in the direction of the third layer → the second layer → the first layer in fig. 6. This rule is usually matched with the previous starting rule for the layer whose starting layer is the most recently unfused convolution or pooling layer, since there may be layers that are unfused before the convolution or pooling layer. After the start layer is selected, the processing means 203 preferentially fuses forward, trying to incorporate layers that have not been fused before the start layer into the template fusion unit. Also taking the AlexNet neural network model of fig. 6 as an example, assuming that the layers 1 to 2 are fused, the processing device 203 finds that the convolution or pooling layer that has not been fused before the most is the layer 5, so the starting layer is the layer 5, preferentially fuses the

layers

4 and 3 forward, and if the fusion can be continued, then fuses the

layers

6 and 7 backward, etc.

Rule three: preferably in units of block structures

When the neural network model has a block structure, the rule requires that the processing device 203 preferentially adds or deletes the template fusion unit in units of block structures rather than layers, and only considers the fusion from the layers on each branch if the operation logic fusion of an entire block is unsuccessful. Taking the neural network model of fig. 7 as an example, the processing device 203 preferably considers the sub-network 701 or the sub-network 702 as a unit for performing the fusion.

When the neural network is a long-chain structure, the template fusion unit is directly added and deleted in a layer unit because the block structure does not exist. This rule is not applicable to neural network models of long chain architecture.

Rule four: single branch output

The fusion strategy of this embodiment does not support the template fusion unit to be a multi-output network, because the shape derivation implemented inside the template fusion unit mainly adopts a form of derivation from back to front, and the multi-output network means that the derivation needs to be respectively derived from different outputs to front, and the derivation results are not necessarily attributed to the same characteristic diagram, so that convergence cannot be achieved.

In other words, the output of the template fusion unit needs to be a single-branch output, i.e. the last layer of the template fusion unit can only have one output. Fig. 7 illustrates two ways of merging the subnetworks 701, the first way is to merge the first layer to the fifth layer into a template merging unit 703, and the second way is to merge the first layer to the sixth layer into a template merging unit 704. Since the outputs of the third layer and the fifth layer are both the outputs of the template fusion unit 703, the template fusion unit 703 belongs to a multi-output network, i.e., a multi-branch output. The output of the sixth layer is the output of the template fusion unit 704, and only one output data is generated, so the template fusion unit 704 belongs to a single output network, i.e. a single-branch output. The processing unit 203 determines whether the output of the template fusion unit is a single-branch output, and if the rule is not satisfied, the processing unit 203 adds or deletes the layers in the template fusion unit until the rule is satisfied.

Rule five: comprising at least 2 main layers

When the layer logic is too simple, the performance of the template fusion unit is not as good as that of the un-fused layer, so when the layer logic is used as the fusion strategy, the processing device 203 will evaluate whether the operation of the fused layers is complex enough, so that the fusion will be beneficial. To achieve the benefits, it is necessary to incorporate the main layer into the template fusion unit as much as possible, where the main layer refers to a layer that consumes a large amount of input/output resources, such as matrix multiplication, pooling or convolution, and the pooling here includes various types of pooling, such as max pooling (maxpool) or mean pooling (avgpool), and the convolution also includes various types of convolution, such as normal convolution, convolution with mean, and fractional convolution (depthwise conv). This rule is that the template fusion unit comprises at least 2 main layers. When the processing unit 203 determines that the rule is not satisfied, the processing unit 203 adjusts the template fusion unit until the rule is satisfied.

Rule six: comprises a continuous structure with a main layer, a main layer and a non-main layer which are adjacent in sequence

The rule is that the template fusion unit needs to include a continuous structure of a main layer, a main layer and a non-main layer, namely: the main layer, the main layer and the non-main layer are sequentially adjacent to each other to form a continuous structure. Such an operation is sufficiently complex to make the fusion profitable. Referring to fig. 6, the 4 th layer, the 5 th layer and the 6 th layer are shown, wherein the 4 th layer is the largest pooling layer, the 5 th layer is the convolution layer, and the 6 th layer is the ReLU active layer, which conforms to the continuous structure of the main layer, the main layer and the non-main layer adjacent to each other in sequence, so that the template fusion unit including the 4 th layer, the 5 th layer and the 6 th layer can satisfy the rule. When the processing unit 203 determines that the rule is not satisfied, the processing unit 203 adjusts the template fusion unit until the rule is satisfied.

Rule seven: continuous structure comprising scalar calculation layers and vector calculation layer adjacency

The rule is that the template fusion unit comprises a continuous structure of a scalar calculation layer and a vector calculation layer, namely: and the scalar calculation layer and the vector calculation layer are sequentially adjacent to each other to form a continuous structure. The scalar computation layer refers to an addition layer, a subtraction layer, or a multiplication layer, and the vector computation layer refers to an activation layer, a batch normalization layer, or a scaling layer. When the processing unit 203 determines that the rule is not satisfied, the processing unit 203 adjusts the template fusion unit until the rule is satisfied.

Rule eight: the convolutional layer not weighted by the output of a certain layer

The rule is that the weight of the convolutional layer in the template fusion unit is not the output of any layer of the neural network, regardless of whether the layer is incorporated in the template fusion unit. When the processing unit 203 determines that the rule is not satisfied, the processing unit 203 removes the convolutional layer from the template fusion unit.

And a ninth rule: the weight of convolutional layer is not shared with any layer of neural network

Because the weights of operators in the neural network model related to the template fusion unit have special placing forms, when the fused convolution operators share the weights with other operators, placing logics of the weights conflict, and the rule is that the weights of the convolution operators in the template fusion unit are not shared with any layer of the neural network. When the processing unit 203 determines that the rule is not satisfied, the processing unit 203 removes the convolution operator from the template fusion unit.

Rule ten: available space with weight not greater than WRAM

The large graph mode has less limitation on the WRAM 432 because the on-chip unit graph loaded into the SRAM308 is only a part of the feature graph, and the WRAM 432 only needs to store all the weights of the feature graph when calculating the template fusion unit. However, since the small graph model may load multiple feature graphs into the SRAM308, the required weight in this case becomes more, and it is prudent to evaluate whether the available space of the WRAM 432 is sufficient. The rule is that the storage space required by the weights in the on-chip unit map is not larger than the available space of WRAM 432, and when the processing device 203 determines that the rule is not satisfied, the processing device 203 reduces the size of the on-chip unit map.

If the weight is split based on the output channel parameter Cout of the C dimension, since the weight is evenly distributed to the plurality of processor cores 306, the rule is adjusted as follows:

wherein, W_jThe storage space required by the weight value related to the on-chip unit graph j, n is the number of processor cores in the cluster, and W is the available space of the WRAM 432.

Rule eleven: percentage of redundancy

The redundancy percentage is the ratio of the sum of the redundancy generated by the input dependent operation and the output dependent operation to the normal input/output quantity of the template fusion unit, where the normal input/output quantity refers to the data quantity of the on-chip unit diagram without redundancy before the on-chip unit diagram is not split. After the computing template fusion unit of the processing device 203 fuses the current layer, the memory size of the on-chip unit graph from the DRAM204 to the SRAM308 is obtained_TFUAnd normal input/output (no redundancy) size_oriWherein the size of the stock is_TFURefers to the theoretical size of memory access_oriThe redundant sum is added. The formula is as follows:

the processing device 203 factors in the splitting information and shape derivation of the template fusion unit and sets the percentage threshold to 50%, 75%, 100%, 125%, or 150%, preferably 100%. Taking the percentage threshold as 100% for example, it means that when the redundancy sum is greater than 2 times of the normal input/output amount of the template fusion unit, the fusion is not performed. The rule is that the sum of the redundancy generated by splitting the on-chip unit map does not exceed a specific ratio related to the percentage threshold, and once the sum of the redundancy exceeds the specific ratio, the redundant portion is excessive, a large amount of resources will be consumed to calculate redundancy, and the performance will be degraded, so when the processing device 203 determines that the rule is not satisfied, the processing device 203 will stop fusing.

It should be noted that, in the small graph mode, since at least one complete feature graph is loaded at a time during the process from DRAM204 to SRAM308, no redundancy is generated. This rule does not apply to the thumbnail mode.

Rule twelve: on-chip unit diagram input-output size

Assuming that the space size of the SRAM308 is S, the storage space required by the on-chip cell map is IN, and the storage space required by the calculation result of the on-chip cell map is OUT, the rule is that the space size of the SRAM308 needs to satisfy the following condition:

IN + OUT < S if IN and OUT cannot multiplex memory

MAX (IN, OUT) < S if IN and OUT can multiplex memory space

That is, if IN and OUT can not multiplex the memory space, the sum of the memory space of the on-chip cell map and the memory space of the calculation result is less than the available space of the SRAM 308; the larger of the on-chip cell map storage space and the computation result storage space is less than the available space of the SRAM308 if IN and OUT can reuse the storage space.

Rule thirteen: w_i+IN1+IN2≤S

In the small graph mode, this rule is that the space size of the SRAM308 needs to satisfy the following condition:

W_i+IN1+IN2≤S

i.e. the storage space W required by the weight of subgraph i_iThe sum of the on-chip cell map required storage space IN1 and the cache space IN2 is no greater than the available space of SRAM 308. When the processing means 203 determines that this rule is not satisfied, the processing means 203 reduces the number of on-chip unit maps until this rule is satisfied.

Rule fourteen: SubINi + W_i+IN2≤S

SubINi+W_i+IN2≤S

i.e. the required storage space SubINi of sub-graph i, the storage space W required by the weight of sub-graph i_iThe sum of the buffer space IN2 is no more than the available space of SRAM 308. When the processing means 203 determines that this rule is not satisfied, the processing means 203 reduces the number of on-chip unit maps until the rule is satisfied.

Rule fifteen: SubOUTi + W_i+1+IN2≤S

SubOUTi+W_i+1+IN2≤S

i.e. the storage space SubOUTi required by the intermediate result of sub-graph i, the storage space W required by the weight of the next sub-graph_i+1The sum of the buffer space IN2 is no more than the available space of SRAM 308. When the processing means 203 determines that this rule is not satisfied, the processing means 203 reduces the number of on-chip unit maps until the rule is satisfied.

Rule sixteen: w_i+W_i+1≤W

The weights involved in the convolution operation in the template fusion unit are carried independently and reside on the WRAM 432. In the small graph mode, if a sub-graph includes multiple feature graphs, the WRAM 432 stores the weight of two adjacent sub-graphs at the same time at most, considering the flow between sub-graphs. Assume that the required storage space for each sub-graph i is W_iAnd the total space of WRAM 432 is W, this rule is that the space size of WRAM 432 needs to satisfy the following condition:

W_i+W_i+1≤W

i.e. the storage space W required by the weight of subgraph i_iStorage space W needed by weight of next subgraph_i+1The sum is no greater than the available space of WRAM 432. When the processing means 203 determines that this rule is not satisfied, the processing means 203 reduces the number of on-chip unit maps until the rule is satisfied.

Seventeen rules are: the storage space required by the subgraph is not more than the available space of NRAM

This rule is that the storage space required for the subgraph is not greater than the available space of NRAM 431. When an on-chip cell graph on SRAM308 is to be split into subgraphs for shipment to NRAM 431, processing device 203 may perform fine-grained splitting in the N, H, W dimension. If the NRAM 431 is not available, the processing device 203 will split the on-chip cell map more finely until this rule is satisfied. Generally, NRAM 431 will have reasonable space available so that the on-chip cell graph can be loaded at once with a reasonable degree of fragmentation, and the template fusion cells will not be affected by the number of batches from the perspective of the fusion policy. However, the smaller the on-chip cell graph is split (i.e., the more subgraphs), the processing speed decreases, so the processing device 203 needs to evaluate the space of the NRAM 431.

In some embodiments, the space of SRAM308 corresponds to the number of NRAMs 431 of processor cores 306 within cluster 305, e.g., cluster 305 includes 4 processor cores 306, then the space of SRAM308 is 4 times the space of NRAMs 431. In other words, the on-chip unit map in the big-map mode can be generally allocated to 4 processor cores 306 for processing, and this architectural design has considered that the data loaded into the SRAM308 can be allocated to all the NRAMs 431 at once. So this rule need not be considered in the big-graph mode.

Eighteen rules: the number of feature maps is not greater than a feature map threshold

In the small graph mode, the on-chip unit graph may include a plurality of feature graphs, and the efficiency is reduced as the number of sub-graph transmission times between the SRAM308 and the NRAM 431 is increased as the number of feature graphs is increased, so that the processing device 203 calculates an appropriate number of fusion layers according to the number of feature graphs in the on-chip unit graph, and the benefit is maximized. When the processing means 203 judges that the rule is not satisfied, the processing means 203 reduces the number of feature maps in the on-chip data until the rule is satisfied.

Rule nineteen: step size redundancy

Step size redundancy refers to: when the number of fusion layers of the template fusion unit is too large, and the length and width of the convolution and pooled kernel are larger than the step length, the input data required by each output point has an overlapping part, namely the input dependence operation, and the overlapping part is the step length redundancy. The step redundancy causes each processor core 306 to read more data, but the multiplexed data occupies off-chip access resources, and the more layers the template fusion unit comprises, the more the step redundancy is serious. The rule is that the sum of the difference between the side length and the step size of the kernel of the convolutional layer or the pooling layer is not greater than the redundancy threshold.

In this embodiment, the definition of the redundancy threshold is as follows. Assume that the length and width of the kernel of the convolution and pooling layer is k_xAnd k_yThe step lengths in the length and width directions are s_xAnd s_yThen the step length redundancy in the long direction is k of all convolution and pooling layers in the template fusion unit_x-s_xThe sum of (a); similarly, the step length redundancy in the wide direction is k of all convolution and pooling layers in the template fusion unit_y-s_yThe sum of (a) and (b). The redundancy threshold for this embodiment may be 3, 4, 5 or 6, preferably 4. This rule is not satisfied as long as the step redundancy in either the long or wide direction is greater than the redundancy threshold. The processing means 203 adjusts the template fusion unit, typically to reduce the number of layers being fused, until this rule is satisfied.

The fusion strategy sets exception rules for step size redundancy. If there are multiple branches in the layer to be fused and the template fusion unit can fuse the whole multiple branches, the performance of the template fusion unit will be better, in this case, the processing device 203 will ignore the rule of step size redundancy, i.e. the step size redundancy will not limit the template fusion unit to fuse multiple branches, i.e. in the fusion strategy of this embodiment, the fusion multiple branches take precedence over the limitation of step size redundancy. That is, step size redundancy is only considered in the single branch case.

The above rules are merely examples, and the present disclosure does not limit the execution order of the rules, nor does the disclosure limit the rules to be considered at the same time, and those skilled in the art can add or delete the rules according to the actual situation in different application scenarios to implement the fusion policy according to the current application scenario.

Returning to fig. 11, in step 1103, neural network calculations are performed according to the built template fusion unit. The computing device 201 is based on a three-level operation hierarchy of a system on chip, a cluster and a processor core, and is matched with a three-layer memory design of DRAM-SRAM-NRAM/WRAM, a template fusion unit is regarded as a self-defined layer in a neural network, data required for computing the template fusion unit is loaded into the SRAM308 from the DRAM204 at one time, the data can be cached and computed in a proper level, sufficient running water is formed, computing results are transmitted to the DRAM204 from the SRAM308 after computing is completed, and input/output expenses in neural network computing are greatly reduced.

When input data in the fields of computer vision, voice, natural language processing, data mining and the like are subjected to various deep learning and machine learning algorithms, the input/output expenditure in neural network calculation can be reduced based on the template fusion unit. Another embodiment of the present disclosure is a method of performing neural network computations using a template fusion unit. Fig. 12 shows the flow thereof.

In step 1201, a template fusion unit is determined according to a fusion policy. The processing device 203 selects an initial layer of the template fusion unit according to an initial rule of the fusion strategy; and fusing by taking the initial layer as a reference, and checking all rules of a fusion strategy one by one to establish a template fusion unit. The previous embodiment has illustrated various rules of the fusion policy in detail, and is not described again.

In this step, the template fusion unit is exposed as a source code, and then the source code needs to be converted into an object code (also referred to as machine code) of a machine language by a compiler. The following steps are procedures for the compiler to convert the source code of the template fusion unit into the object code of the machine language.

In step 1202, the shape of the template fusion unit is derived. For the data to be processed by the template fusion unit, this embodiment adopts a reverse-push manner, and the compiler reversely pushes the input of what size is needed from the output to the front, taking fig. 8 as an example, the data is reversely derived from the feature map 803 to the feature map 802, and then reversely derived to the feature map 801. In this step, the compiler not only derives the required input data from the template fusion unit, but also further derives redundancy.

Step 1203 is next performed to derive the address. According to the shape of the template fusion unit, the compiler deduces the on-chip storage space address of the whole control flow graph and realizes the access of a universal address so as to achieve the purposes of simplifying computing resources and shortening computing time. A control flow graph is an abstract data structure used in a compiler that represents all the paths a program may execute, reflecting in flow chart form the possible flow directions of all nodes within a process. The control flow graph is composed of nodes and relationships between the nodes. Nodes, also called Basic Blocks (BB), are sequences of statements executed in maximum order in a program, each basic block having only one entry and exit, and entering from its entry and exiting from its exit during execution. A characteristic of the basic block is that all instructions within the basic block are executed in order as long as the first instruction is executed.

Each basic block contains at least one instruction, the instructions in the basic block possibly pointing to a specific on-chip memory space using pointers. A pointer is a variable used to hold an address of a particular address space. With the pointer, the processor core 306 can load data into the space of, or fetch data from, the particular address pointed to by the pointer.

And the compiler initially divides the basic blocks according to the division condition of the template fusion unit, and confirms the basic blocks and the mutual relation after iterative operation, so as to finish realizing the target code of the template fusion unit.

Moreover, the compiler also analyzes the multiplexing data of the front template fusion unit and the back template fusion unit in the neural network, judges how much data in the former template fusion unit can be left on the chip for the next template fusion unit to use, and plans the storage address of each data according to the judgment result.

In this step, the compiler completes the derivation of the address in the control flow graph.

In step 1204, on-chip storage space is allocated. The processing device 203 allocates physical spaces of the SRAM308, NRAM 431, and WRAM 432 based on the derivation of the template fusion unit address. In this step, the compiler completes pointing of the pointer in the control flow graph.

Finally, step 1205 is performed to generate executable instructions. In this step, the object code generated by the compiler is linked with the library by a linker to make it an executable file. More specifically, object code is a program module that includes machine code and linker availability information, and the linker is operable to resolve undefined symbolic references, replace placeholders in the object code with addresses of the symbolic references, and generate executable instructions. The executable instructions may be executed directly by the computing device 201 to perform the calculations of the neural network.

When the template fusion unit is determined according to the rules of the fusion strategy, the fusion does not necessarily need to be developed from the convolutional layer or the pooling layer. The foregoing embodiments mention that in one application scenario, the start rule may be that the start layer is the layer in the neural network that has not been fused at first, and this layer may be a layer other than the convolutional layer or the pooling layer. The initial rule enables the establishment of the template fusion unit to be more flexible, the initial layer can be properly selected to start fusion based on the sequencing of each layer aiming at different neural networks, and the method is not limited by the positions and the number of the convolutional layers or the pooling layers in the neural network model, so that the method is suitable for various network models, the fusion is more comprehensive, and the overall benefit is improved.

For example, taking the neural network model of fig. 6 as an example, assuming that the layers 1 to 5 are fused, when the next template fusion unit is established, if the start rule uses the start layer as the convolution or pooling layer that is not fused before, the next convolution or pooling layer is the layer 8, in other words, the

layers

6 and 7 may not be fused, which affects the overall benefit.

Another embodiment of the present disclosure is a solution for fusing neural networks, the starting layers of which are layers other than convolutional layers and pooling layers, i.e., non-convolutional layers and non-pooling layers. This embodiment is also realized on the basis of the framework of fig. 1 to 4. This embodiment also executes the flowchart shown in fig. 11.

In step 1101, a starting layer is selected according to a fusion policy. The processing means 203 selects the start layer according to a fusion policy, for example, the start rule of the fusion policy is that the start layer is the layer that is not fused at the beginning in the neural network, and the layer is a layer other than the convolutional layer or the pooling layer.

It should be noted that this step does not use the start rule as the convolution or pooling layer that has not been merged before, and if the start layer is selected according to this start rule, the start layer is limited to be the convolution or pooling layer, and this embodiment is not limited by the position and number of convolution or pooling layers in the neural network model.

In one application scenario, the starting layer may be an element-to-element (element-wise) layer, which operates on each element of a vector, and input data of such operations is consistent with output data in shape. Element-to-element layers include the following:

1. basic operation: vector addition, vector subtraction, vector multiplication, etc

2. Step operation: absolute value, square root, division, exponent, remainder, exponentiation, and the like

3. Trigonometric function operation

4. Rounding operation: upper rounding, lower rounding, retaining only integers, etc

5. Activation function: sigmoid, tanh, ReLU, etc

In another application scenario, the start layer may be an add padding (addpadding) layer. The padding is added so that elements of blank information are added around the input data so as to keep the size of the input data consistent with the original image without discarding the original image information.

In another application scenario, the starting layer may be a custom layer. With the development of deep learning and the complication of the neural network, known or standard operators are not used, more and more operators with customized operation rules are applied to the neural network, and the embodiment can select the customized layer as the starting layer.

In another application scenario, the start rule of the fusion policy of this embodiment enables the processing device 203 to further determine whether the neural network includes a block structure. If not, it means that the neural network is a long-chain structure, and the processing device 203 may select the layer that is not fused in the neural network at the beginning according to the aforementioned initial rule; if so, this embodiment refers to the rule three mentioned above, and preferentially performs fusion in units of block structures, so the processing device 203 then determines whether the top layer in the block structure is a layer other than the convolutional layer and the pooling layer. If so, the processing device 203 starts with the front layer.

When the processing device 203 determines that the front layer is one of the convolutional layer and the pooling layer, the processing device 203 may directly select the convolutional layer or the pooling layer as the start layer, or forward select the layer closest to the front layer other than the convolutional layer and the pooling layer as the start layer. Fig. 13 shows a neural network model with a block structure, and this exemplary neural network model includes a sub-network 1301 and a sub-network 1302. The sub-network 1301 includes first to sixth layers, the sub-network 1302 includes eighth to eleventh layers, and the sub-network 1301 and the sub-network 1302 are connected at a seventh layer. Assuming that the sub-network 1301 is merged, when the sub-network 1302 is merged, the processing device 203 determines whether or not the top layer (i.e., the eighth layer) of the sub-network 1302 is a layer other than the convolutional layer and the pooling layer according to the rule. If yes, directly selecting the eighth layer as the initial layer for fusion; if the eighth layer is a convolutional layer or a pooling layer, the processing device 203 may also select the eighth layer as the starting layer, or forward select the layer other than the convolutional layer and the pooling layer closest to the most previous layer as the starting layer, the previous layer closest to the eighth layer as the seventh layer, the seventh layer is not merged yet, and the processing device 203 selects the seventh layer as the starting layer, assuming that the seventh layer is not a convolutional layer or a pooling layer. If the seventh layer is also a convolutional or pooling layer, this embodiment may choose from the seventh layer or the eighth layer as the starting layer.

This embodiment will preferentially merge the whole block structure to improve the merging efficiency. However, in a specific application scenario, the processing device 203 cannot select the layer closest to the front layer other than the convolutional layer and the pooling layer as the start layer. Taking the neural network model of fig. 7 as an example, assuming that the sub-network 701 is fused, when the sub-network 702 is fused, if the seventh layer is a convolution or pooling layer, and in the case that the sub-network 701 is fused, the processing device 203 cannot select the layer other than the convolution layer and the pooling layer closest to the frontmost layer as the start layer, and then the processing device 203 changes direction to select the layer other than the convolution layer and the pooling layer closest to the frontmost layer (i.e., the eighth layer) as the start layer, but the whole block structure cannot be incorporated into the template fusion unit. The processing means 203 may also directly select the seventh layer as the starting layer, since the fusion effect with the eighth layer as the starting layer is not ideal.

After the start layer is selected, step 1102 is performed to build a template fusion unit based on the start layer. The processing device 203 may establish the template fusion unit according to the rules (rule one to rule nineteen) illustrated in the foregoing embodiments, which are only examples, and this embodiment does not limit the execution sequence of the rules, nor limits the rules to be considered at the same time, and those skilled in the art may add or delete the rules according to the actual situation in different application scenarios to implement the fusion policy according to the current application scenario.

Steps

1101 and 1102 correspond to the determination of template fusion means according to the fusion policy in step 1201. The compiler then derives the shape of the template fusion unit (step 1202), derives the address (step 1203), allocates on-chip storage space (step 1204), and finally generates executable instructions by the linker (step 1205).

In step 1103, neural network computation is performed according to the built template fusion unit. The computing device 201 executes the aforementioned executable instructions to perform neural network computations according to the template fusion unit.

The initial layer of the embodiment can be a layer except convolution and pooling, the establishment of the template fusion unit is more flexible due to the initial rule, the initial layer can be properly selected to start fusion according to different neural networks, the position and the number of the convolution layer or the pooling layer in the neural network model are not limited, and therefore the template fusion system is suitable for various network models, the fusion is more comprehensive, and the overall benefit is improved.

In a modern neural network model, it is not necessary that the input/output feature maps of each layer have an inverted pyramid form as shown in fig. 8, and the input feature map size of some layers is smaller than the feature map data size, such a layer is often applied in the deep learning field of computer vision, and in a specific situation, the image needs to be restored to the original size for further calculation. In computing such layers, the image size is enlarged to enable the operation of mapping the image from small to large resolution. Fig. 14 shows a schematic diagram of such layers, and it can be seen from fig. 14 that the input/output feature map may be produced in the form of a regular pyramid, which is referred to as a regular pyramid layer in this disclosure, while the layer in fig. 8 where the input feature map is larger than the output feature map is referred to as an inverted pyramid layer.

In practice, the positive pyramid layer includes a deconvolution layer (deconvolution layer), an upper pooling layer (underfilling layer), or an upper sampling layer (upsampling layer).

The deconvolution is also called transposition convolution or void convolution, and is not a complete inverse process of forward convolution, and the deconvolution is a special forward convolution and needs parameters to participate in calculation, and the parameters need to be trained and learned. The deconvolution is to enlarge the size of the input image by complementing 0 in a certain proportion, and then to rotate the convolution kernel and perform forward convolution.

The pooling operation is divided into a pooling-up operation of maximum pooling and a pooling-up operation of average pooling. Max-pooled uppooling retains the position information of the maximum and then complements 0 at the remaining positions, as shown in fig. 15A, which shows max-pooling layer 1501 with input feature map 1502 passing through max-pooling layer 1501 resulting in output feature map 1503, and max-pooled uppooling layer 1504 with input feature map 1503 passing through uppooling layer 1504 resulting in output feature map 1505, the size of output feature map 1505 being larger than the size of input feature map 1503. Average pooled upsampling may be performed by filling the average values into their corresponding locations in the original data region, as shown in fig. 15B, which shows an average pooling layer 1506 with an input feature map 1507 passing through the average pooling layer 1506 to generate an output feature map 1508, which also shows an average pooled upsampling layer 1509 with an input feature map 1508 passing through the upsampling layer 1509 to generate an output feature map 1510 with the output feature map 1510 having a size larger than the input feature map 1508.

Upsampling is the expansion of a feature map according to a kernel directly in the corresponding raw data area. Fig. 16 shows a schematic diagram of upsampling, in which an input feature map 1601 passes through a maximum pooling layer (not shown) to generate an intermediate feature map 1602, and the intermediate feature map 1602 passes through a kernel 1603 of an upsampling layer (not shown) to be expanded to obtain an output feature map 1604, and the size of the output feature map 1604 is larger than that of the intermediate feature map 1602.

These operators are all characterized by an input profile that is smaller than an output profile. In addition, there may be user-defined layers that also have features where the input feature map is smaller than the output feature map. Another embodiment of the present disclosure is a solution for fusing neural networks, which also has the framework shown in fig. 1 to 4, and the fusion strategy can fuse positive pyramid layers. Fig. 17 shows a flowchart of this embodiment fusing the neural network shown in fig. 18, fig. 18 is an exemplary neural network model with 14 layers, wherein the first segment 1801 includes layers 1 to 4 and is an inverted pyramid layer, the second segment 1802 includes layers 5 to 9 and is a positive pyramid layer, and the third segment 1803 includes layers 10 to 14 and is an inverted pyramid layer.

In step 1701, a template fusion unit is built according to a fusion policy. The processing device 203 selects the start layer of the template fusion unit according to the start rule of the fusion policy, and in an application scenario, the start rule may be that the layer that is not fused before is the start layer of the template fusion unit. Assuming that the 1 st layer to the 3 rd layer are fused, the 4 th layer is the initial layer of the template fusion unit, and the 4 th layer is fused backwards, and all rules of the fusion strategy are checked one by one to establish the template fusion unit. First, the 5 th layer of the positive pyramid layer is merged, and if the merging can be continued, the processing device 203 continues to merge backward. In another application scenario, the start rule may be that the top positive pyramid layer of all unfused layers is the start layer of the template fusion unit. Also assuming that the 1 st to 3 rd layers have been merged, the 5 th layer is the most forward positive pyramid layer, so the 5 th layer is the starting layer of this template merging unit, merging backwards.

This embodiment does not limit the way of merging the positive pyramid layer and the inverted pyramid layer, and all the positive pyramid layers may be merged together, for example, the template merging unit includes layers 5 to 9, or may be merged together, for example, the template merging unit includes layers 3 to 6, or the template merging unit includes layers 9 to 12. In other words, the template fusion unit may include only the positive pyramid layer, and may also include the inverse pyramid layer plus the positive pyramid layer, or the positive pyramid layer plus the inverse pyramid layer.

Furthermore, the positive pyramid layer and the inverted pyramid layer may be adjacent in the template fusion unit, for example, the 4 th layer and the 5 th layer, and the 9 th layer and the 10 th layer.

When the neural network is in a block structure and the block structure includes the positive pyramid layer, the rule of the fusion strategy in this embodiment is to add or delete template fusion units in units of block structures.

The main layers in this embodiment are defined as matrix multiplication, pooling, convolution, deconvolution, pooling up, and upsampling layers, when the neural network includes a plurality of main layers, the rule of the fusion policy is that the template fusion unit includes at least 2 main layers, and when the processing unit 203 determines that the rule is not satisfied, the processing device 203 adjusts the template fusion unit until the rule is satisfied. This embodiment may further include another rule of the fusing policy that the template fusing unit includes a continuous structure in which a main layer, and a non-main layer are adjacent in sequence, and when the processing unit 203 determines that the rule is not satisfied, the processing device 203 adjusts the template fusing unit until the rule is satisfied.

In step 1702, the shape of the template fusion unit is derived. Step 1703 is executed to derive an address. In step 1704, on-chip storage space is allocated. In step 1705, executable instructions are generated. These steps are the same as those of steps 1202 to 1205 and will not be described again.

Finally, step 1706 is executed, in which the neural network calculation is executed according to the template fusion unit. The computing device 201 executes the aforementioned executable instructions to perform neural network computations according to the template fusion unit.

Another embodiment of the disclosure is a computer readable storage medium having stored thereon computer program code for dynamically fusing neural networks according to a fusion policy, which when executed by a processor performs the method of fig. 10, 11, 12, 17.

The embodiment can fuse the positive pyramid layer and the inverted pyramid layer, and the template fusion unit is more flexibly established by the fusion strategy without being limited by the sizes of the input characteristic diagram and the output characteristic diagram, so that the template fusion method is suitable for various network models, the fusion is more comprehensive, and the overall benefit is improved.

According to different application scenarios, the electronic device or apparatus of the present disclosure may include a server, a cloud server, a server cluster, a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet computer, a smart terminal, a PC device, a terminal of the internet of things, a mobile terminal, a mobile phone, a vehicle recorder, a navigator, a sensor, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a visual terminal, an autopilot terminal, a vehicle, a household appliance, and/or a medical device. The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph. The electronic device or apparatus of the present disclosure may also be applied to the fields of the internet, the internet of things, data centers, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction site, medical, and the like. Further, the electronic device or apparatus disclosed herein may also be used in application scenarios related to artificial intelligence, big data, and/or cloud computing, such as a cloud end, an edge end, and a terminal. In one or more embodiments, a computationally powerful electronic device or apparatus according to the present disclosure may be applied to a cloud device (e.g., a cloud server), while a less power-consuming electronic device or apparatus may be applied to a terminal device and/or an edge-end device (e.g., a smartphone or a camera). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that appropriate hardware resources can be matched from the hardware resources of the cloud device to simulate the hardware resources of the terminal device and/or the edge device according to the hardware information of the terminal device and/or the edge device, and uniform management, scheduling and cooperative work of end-cloud integration or cloud-edge-end integration can be completed.

It is noted that for the sake of brevity, the present disclosure describes some methods and embodiments thereof as a series of acts and combinations thereof, but those skilled in the art will appreciate that the aspects of the present disclosure are not limited by the order of the acts described. Accordingly, one of ordinary skill in the art will appreciate that certain steps may be performed in other sequences or simultaneously, in accordance with the disclosure or teachings of the present disclosure. Further, those skilled in the art will appreciate that the embodiments described in this disclosure are capable of alternative embodiments, in which acts or modules are involved, which are not necessarily required to practice one or more aspects of the disclosure. In addition, the present disclosure may focus on the description of some embodiments, depending on the solution. In view of the above, those skilled in the art will understand that portions of the disclosure that are not described in detail in one embodiment may also be referred to in the description of other embodiments.

In particular implementation, based on the disclosure and teachings of the present disclosure, one skilled in the art will appreciate that the several embodiments disclosed in the present disclosure may be implemented in other ways not disclosed herein. For example, as for the units in the foregoing embodiments of the electronic device or apparatus, the units are split based on the logic function, and there may be another splitting manner in the actual implementation. Also for example, multiple units or components may be combined or integrated with another system or some features or functions in a unit or component may be selectively disabled. The connections discussed above in connection with the figures may be direct or indirect couplings between the units or components in terms of connectivity between the different units or components. In some scenarios, the aforementioned direct or indirect coupling involves a communication connection utilizing an interface, where the communication interface may support electrical, optical, acoustic, magnetic, or other forms of signal transmission.

In the present disclosure, units described as separate parts may or may not be physically separate, and parts shown as units may or may not be physical units. The aforementioned components or units may be co-located or distributed across multiple network elements. In addition, according to actual needs, part or all of the units can be selected to achieve the purpose of the solution of the embodiment of the present disclosure. In addition, in some scenarios, multiple units in embodiments of the present disclosure may be integrated into one unit or each unit may exist physically separately.

In some implementation scenarios, the integrated units may be implemented in the form of software program modules. If implemented in the form of software program modules and sold or used as a stand-alone product, the integrated units may be stored in a computer readable memory. In this regard, when aspects of the present disclosure are embodied in the form of a software product (e.g., a computer-readable storage medium), the software product may be stored in a memory, which may include instructions for causing a computer device (e.g., a personal computer, a server, or a network device, etc.) to perform some or all of the steps of the methods described in embodiments of the present disclosure. The Memory may include, but is not limited to, a usb disk, a flash disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.

In other implementation scenarios, the integrated unit may also be implemented in hardware, that is, a specific hardware circuit, which may include a digital circuit and/or an analog circuit, etc. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, which may include, but are not limited to, transistors or memristors, among other devices. In this regard, the various devices described herein (e.g., computing devices or other processing devices) may be implemented by suitable hardware processors, such as central processing units, GPUs, FPGAs, DSPs, ASICs, and the like. Further, the aforementioned storage unit or storage device may be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), and may be, for example, a variable Resistive Memory (RRAM), a Dynamic Random Access Memory (DRAM), a Static Random Access Memory (SRAM), an Enhanced Dynamic Random Access Memory (EDRAM), a High Bandwidth Memory (HBM), a Hybrid Memory Cube (HMC), a ROM, a RAM, or the like.

The foregoing detailed description of the embodiments of the present disclosure has been presented for purposes of illustration and description and is intended to be exemplary only and is not intended to be exhaustive or to limit the invention to the precise forms disclosed; meanwhile, for the person skilled in the art, based on the idea of the present disclosure, there may be variations in the specific embodiments and the application scope, and in summary, the present disclosure should not be construed as limiting the present disclosure.

Claims

1. An integrated circuit device that merges a neural network, the neural network including an i-th layer, an input feature map of the i-th layer being smaller than an output feature map, the integrated circuit device comprising:

the processing device is used for establishing a template fusion unit according to a fusion strategy; and

the calculating device is used for executing neural network calculation according to the template fusion unit;

wherein the template fusion unit comprises the ith layer, and i is a natural number.

2. The integrated circuit device according to claim 1, wherein the neural network further comprises a j-th layer, the j-th layer is located before the i-th layer, the input feature map of the j-th layer is larger than the output feature map, the template fusion unit further comprises the j-th layer, j is a natural number, and i is not equal to j.

3. The integrated circuit device according to claim 1, wherein the neural network further comprises a j-th layer, the j-th layer is located after the i-th layer, the input feature map of the j-th layer is larger than the output feature map, the template fusion unit further comprises the j-th layer, j is a natural number, and i is not equal to j.

4. The integrated circuit device according to claim 2 or 3, wherein the ith layer and the jth layer are adjacent.

5. The integrated circuit device of claim 1, wherein the fusion policy is that the ith layer is a starting layer of the template fusion unit.

6. The integrated circuit device according to claim 1, wherein the ith layer is located in a block structure of the neural network, and a rule of the fusion policy is to add or delete the template fusion unit in units of the block structure.

7. The integrated circuit device according to claim 1, wherein the i-th layer is one of an deconvolution layer, an upsampling layer, and an upsampling layer.

8. The integrated circuit device according to claim 7, wherein the neural network comprises a plurality of main layers, the main layers being one of matrix multiplication, pooling, convolution, and the i-th layer, the rule of the fusion policy is that the template fusion unit comprises at least 2 main layers, and when the processing unit determines that the rule is not satisfied, the processing device adjusts the template fusion unit until the rule is satisfied.

9. The integrated circuit device according to claim 7, wherein the neural network comprises a plurality of main layers, the main layers are one of matrix multiplication, pooling, convolution and the i-th layer, the rule of the fusion policy is that the template fusion unit comprises a continuous structure in which the main layers, the main layers and the non-main layers are adjacent in sequence, and when the processing unit determines that the rule is not satisfied, the processing device adjusts the template fusion unit until the rule is satisfied.

10. The integrated circuit device according to claim 1, wherein the ith layer is a custom layer.

11. A board card comprising an integrated circuit device according to any of claims 1 to 10.

12. A method of fusing a neural network, the neural network including an i-th layer, an input feature map of the i-th layer being smaller than an output feature map, i being a natural number, the method comprising:

establishing a template fusion unit according to a fusion strategy, wherein the template fusion unit comprises the ith layer; and

and executing neural network calculation according to the template fusion unit.

13. The method according to claim 12, wherein the neural network further comprises a j-th layer, the j-th layer is located before the i-th layer, the input feature map of the second layer is larger than the output feature map, the template fusion unit further comprises the j-th layer, j is a natural number, and i is not equal to j.

14. The method according to claim 12, wherein the neural network further comprises a j-th layer, the j-th layer is located after the i-th layer, the input feature map of the j-th layer is larger than the output feature map, the template fusion unit further comprises the j-th layer, j is a natural number, and i is not equal to j.

15. The method of claim 13 or 14, wherein the ith layer and the jth layer are adjacent.

16. The method of claim 12, wherein the fusion policy is that the ith layer is a starting layer of the template fusion unit.

17. The method according to claim 12, wherein the ith layer is located in a block structure of the neural network, and a rule of the fusion policy is to add or delete the template fusion unit in units of the block structure.

18. The method of claim 12, wherein the ith layer is one of an deconvolution layer, an upsampling layer, and an upsampling layer.

19. The method of claim 18, wherein the neural network comprises a plurality of primary layers, the primary layers being one of matrix multiplication, pooling, convolution, and the i-th layer, the rule of the fusion policy being that the template fusion unit comprises at least 2 primary layers, the processing device adjusting the template fusion unit until the rule is satisfied when the processing unit determines that the rule is not satisfied.

20. The method of claim 18, wherein the neural network comprises a plurality of main layers, the main layers are one of matrix multiplication, pooling, convolution and the i-th layer, the rule of the fusion policy is that the template fusion unit comprises a continuous structure with the main layers, the main layers and the non-main layers being adjacent in sequence, and when the processing unit determines that the rule is not satisfied, the processing device adjusts the template fusion unit until the rule is satisfied.

21. The method of claim 12, wherein the ith layer is a custom layer.

22. A computer readable storage medium having stored thereon computer program code for a converged neural network, which when executed by a processing apparatus, performs the method of any one of claims 12 to 21.