WO2022063183A1 - 执行神经网络计算的装置、板卡、方法及可读存储介质 - Google Patents

执行神经网络计算的装置、板卡、方法及可读存储介质 Download PDF

Info

Publication number
WO2022063183A1
WO2022063183A1 PCT/CN2021/119943 CN2021119943W WO2022063183A1 WO 2022063183 A1 WO2022063183 A1 WO 2022063183A1 CN 2021119943 W CN2021119943 W CN 2021119943W WO 2022063183 A1 WO2022063183 A1 WO 2022063183A1
Authority
WO
WIPO (PCT)
Prior art keywords
layer
template
fusion unit
fusion
unit
Prior art date
Application number
PCT/CN2021/119943
Other languages
English (en)
French (fr)
Inventor
兰慧盈
王瑞涛
罗海钊
曹博
陈峋宇
Original Assignee
中科寒武纪科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN202011045888.2A external-priority patent/CN114358264A/zh
Priority claimed from CN202011043896.3A external-priority patent/CN114282642A/zh
Priority claimed from CN202011045852.4A external-priority patent/CN114358261A/zh
Priority claimed from CN202011045871.7A external-priority patent/CN114358263A/zh
Application filed by 中科寒武纪科技股份有限公司 filed Critical 中科寒武纪科技股份有限公司
Priority to US18/003,682 priority Critical patent/US20230274158A1/en
Publication of WO2022063183A1 publication Critical patent/WO2022063183A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/10Interfaces, programming languages or software development kits, e.g. for simulating neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions

Definitions

  • the present disclosure relates generally to the field of neural networks. More specifically, the present disclosure relates to apparatuses, boards, methods, and readable storage media for performing neural network computations.
  • Neural network is a system of multiple neurons connected according to certain rules, which is roughly composed of the following four layer structures: input layer, convolution layer, pooling layer, fully connected layer ( fully connected layer).
  • the input layer intercepts part of the information from the input data and converts it into a feature matrix for presentation, which contains the features corresponding to the part of the information.
  • the convolution layer is configured to receive the feature matrix from the input layer, and perform feature extraction on the input data through a convolution operation.
  • Convolutional layers can be constructed with multiple layers of convolutional layers in practical applications.
  • the pooling layer is configured to replace a certain region of the data with a value, which is usually the maximum or average of all the values in that region. Through pooling, the model size can be reduced and the calculation speed can be improved without losing too much information.
  • the fully connected layer plays the role of a classifier in the entire convolutional neural network, which is equivalent to feature space transformation, extracting and integrating all the previous useful information, and comparing information based on different classifications to determine whether the input data is similar to the comparison. the target.
  • VGG-A has 11 weight layers
  • VGG-B has 13 weight layers
  • VGG-C has 16 weight layers
  • VGG-D has a total of 16 weight layers
  • VGG-E has a total of 19 weight layers.
  • the convolutional layer and the fully connected layer generally refer to the weight layer.
  • Some neural networks have hundreds of layers. Not only that, as the number of layers increases, the number of parameters of the neural network also increases exponentially. For example, AlexNet has 60 million parameters involved in the calculation.
  • the solutions disclosed in the present disclosure provide an apparatus, a board, a method and a readable storage medium for performing neural network computation.
  • the present disclosure discloses an integrated circuit device for performing neural network computation, comprising: a processing device for creating a template fusion unit; a compiler for converting the template fusion unit into object code; a linker, The object code is used to link the object code and the library to form an executable file; and a computing device is used to execute the executable file to realize the neural network calculation.
  • the present disclosure discloses a board including the integrated circuit device according to the foregoing.
  • the present disclosure discloses a method for performing neural network computation, comprising: establishing a template fusion unit; converting the template fusion unit into object code; linking the object code with a library to form an executable file; and The executable file is executed to implement neural network computations.
  • the present disclosure discloses a computer-readable storage medium having stored thereon computer program code for performing neural network computations, and when the computer program code is executed by a processing device, performs the aforementioned method.
  • the present disclosure dynamically determines the template fusion unit by setting the fusion strategy, fuses multiple layers in the neural network to form a new custom layer, and loads the data required for the calculation of the template fusion unit at one time to reduce input/ output overhead.
  • FIG. 1 is a structural diagram illustrating a board according to an embodiment of the present disclosure
  • FIG. 2 is a block diagram illustrating an integrated circuit device of an embodiment of the present disclosure
  • FIG. 3 is a schematic diagram showing the internal structure of a computing device according to an embodiment of the present disclosure.
  • FIG. 4 is a schematic diagram showing the internal structure of a processor core according to an embodiment of the present disclosure
  • FIG. 5 is a schematic diagram showing when one processor core wants to write data to a processor core of another cluster
  • Figure 6 is a schematic diagram showing the AlexNet model
  • FIG. 7 is a schematic diagram illustrating an exemplary neural network model
  • FIG. 8 is a schematic diagram illustrating fusion of two convolutional layers according to an embodiment of the present disclosure.
  • Figure 10 is a flowchart illustrating an embodiment of the present disclosure using a template fusion unit to perform neural network computation
  • FIG. 11 is a flowchart illustrating the dynamic fusion of neural networks according to a fusion strategy according to an embodiment of the present disclosure
  • Figure 12 is a flowchart illustrating an embodiment of the present disclosure using a template fusion unit to perform neural network computation
  • FIG. 13 is a schematic diagram showing a neural network model with a block structure
  • FIG. 14 is a schematic diagram illustrating an input/output feature map such as a positive pyramid structure
  • FIG. 15A is a schematic diagram illustrating an up-pooling operation of max-pooling
  • 15B is a schematic diagram illustrating an up-pooling operation of average pooling
  • 16 is a schematic diagram illustrating an upsampling operation
  • Figure 17 is a flow diagram illustrating the fusion of an exemplary neural network according to an embodiment of the present disclosure
  • FIG. 18 is a diagram illustrating an exemplary neural network model
  • FIG. 19 is a flowchart illustrating the computation of a neural network based on executable instructions according to an embodiment of the present disclosure
  • Figure 20 is a diagram illustrating a toroidal full reduction framework
  • 21 is a schematic diagram illustrating a plurality of clusters in a logical loop
  • Figure 22A is a schematic diagram illustrating the first iteration of a toroidal full reduction
  • Figure 22B is a schematic diagram showing the second iteration of the toroidal full reduction
  • Figure 22C is a schematic diagram showing the third iteration of the toroidal full reduction
  • Figure 23A is a schematic diagram showing that each cluster has one processor core to perform a complete reduction calculation in a ring full reduction;
  • FIG. 23B is a schematic diagram illustrating a complete calculation after performing a circular full reduction
  • 24 is a diagram illustrating an exemplary long-chain neural network
  • Figure 25 is a flowchart illustrating an embodiment of the present disclosure implementing a forward fusion neural network
  • 26 is a diagram illustrating an exemplary long-chain neural network
  • FIG. 28 is a diagram illustrating an exemplary block-structured neural network
  • FIG. 29 is a schematic diagram illustrating a sub-template fusion unit division according to an embodiment of the present disclosure.
  • FIG. 31 is a flowchart illustrating the execution of a calculation program based on a template fusion unit according to an embodiment of the present disclosure.
  • FIG. 32 is a flowchart illustrating a two-layer three-stage pipeline according to an embodiment of the present disclosure.
  • the term “if” may be contextually interpreted as “when” or “once” or “in response to determining” or “in response to detecting”.
  • a neural network is composed of an input layer, a convolution layer, an activation function, a pooling layer, and a fully connected layer, ranging from a few layers to hundreds of layers, each layer performs an operator, such as the convolution layer performs convolution operations There are as many layers as there are layers and how many operators need to be executed. In this disclosure, when referring to a specific layer, it refers to the operator corresponding to that layer.
  • variable data are generally represented by feature maps (matrix).
  • feature maps matrix
  • the input information of the entire neural network model and the input maps of each layer of the model are collectively referred to as feature maps.
  • feature maps Once the feature maps are loaded onto the on-chip memory component, they are referred to as on-chip unit maps in this disclosure.
  • the parameters of the training network model usually do not change frequently after the training is stable, or the network topology and hardware parameters can be compiled and generated after the network topology and hardware parameters are determined, and will not change during the calculation process, so they can be regarded as constant data.
  • Constant data includes However, it is not limited to weights, biases, device hardware instructions, mean and variance of batch norm, etc.
  • weights are uniformly used to represent all constant data.
  • data in this disclosure, it generally refers to a graph structure that allows operations of corresponding operators in the neural network model to be fused together according to a fusion strategy.
  • the graph structure involves variable data and constant data, that is, feature graphs Add the corresponding weights.
  • the chip 101 is connected to an external device 103 through an external interface device 102 .
  • the external device 103 is, for example, a server, a computer, a camera, a monitor, a mouse, a keyboard, a network card or a wifi interface, and the like.
  • the data to be processed can be transmitted to the chip 101 by the external device 103 through the external interface device 102 .
  • the calculation result of the chip 101 can be transmitted back to the external device 103 via the external interface device 102 .
  • the external interface device 102 may have different interface forms, such as a PCIe interface and the like.
  • the board 10 also includes a storage device 104 for storing data, which includes one or more storage units 105 .
  • the storage device 104 is connected to the control device 106 and the chip 101 through a bus and performs data transmission.
  • the control device 106 in the board 10 is configured to control the state of the chip 101 .
  • the control device 106 may include a microcontroller (Micro Controller Unit, MCU).
  • the computing device 201 is configured to perform operations specified by the user, and is mainly implemented as a single-core intelligent processor or a multi-core intelligent processor to perform deep learning or machine learning calculations, which can interact with the processing device 203 through the interface device 202 to Work together to complete a user-specified operation.
  • the interface device 202 is used to transmit data and control instructions between the computing device 201 and the processing device 203 .
  • the computing device 201 may obtain input data from the processing device 203 via the interface device 202 and write the input data into the storage device on-chip of the computing device 201 .
  • the computing device 201 can obtain the control instruction from the processing device 203 via the interface device 202 and write it into the control cache on the computing device 201 .
  • the interface device 202 can also read the data in the storage device of the computing device 201 and transmit it to the processing device 203 .
  • the processing device 203 performs basic control including but not limited to data transfer, starting and/or stopping the computing device 201, etc.
  • the processing device 203 may be one or more types of central processing unit (CPU), graphics processing unit (GPU), or other general-purpose and/or special-purpose processors.
  • processors include but are not limited to digital signal processors (DSPs), application specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., and the number thereof can be determined according to actual needs.
  • DSPs digital signal processors
  • ASICs application specific integrated circuits
  • FPGAs field-programmable gate arrays
  • Programmable logic devices discrete gate or transistor logic devices, discrete hardware components, etc., and the number thereof can be determined according to actual needs.
  • the computing device 201 of the present disclosure can be regarded as having a single-core structure or a homogeneous multi-core structure. However, when the computing device 201 and the processing device 203 are considered together, the two are considered to form a heterogeneous multi-core structure.
  • the DRAM 204 is used to store the data to be processed, and is a DDR memory with a size of 16G or more, and is used to save the data of the computing device 201 and/or the processing device 203.
  • FIG. 3 shows a schematic diagram of the internal structure of the computing device 201 .
  • the computing device 201 is used to process input data such as computer vision, speech, natural language, and data mining.
  • the computing device 201 in the figure is designed with a multi-core hierarchical structure.
  • the computing device 201 is a system-on-a-chip, which includes multiple clusters. Each cluster further includes a plurality of processor cores, in other words, the computing device 201 is constituted at the level of system-on-chip-cluster-processor cores.
  • the computing device 201 includes an external storage controller 301 , a peripheral communication module 302 , an on-chip interconnect module 303 , a synchronization module 304 , and multiple clusters 305 .
  • the peripheral communication module 302 is used for receiving a control signal from the processing device 203 through the interface device 202 to start the computing device 201 to perform tasks.
  • the on-chip interconnection module 303 connects the external storage controller 301 , the peripheral communication module 302 and the multiple clusters 305 to transmit data and control signals among the modules.
  • the synchronization module 304 is a global synchronization barrier controller (GBC), which is used to coordinate the work progress of each cluster and ensure the synchronization of information.
  • GBC global synchronization barrier controller
  • the plurality of clusters 305 are the computing cores of the computing device 201, and 4 are exemplarily shown in the figure. With the development of hardware, the computing device 201 of the present disclosure may also include 8, 16, 64, or even more. Cluster 305. Cluster 305 is used to efficiently execute deep learning algorithms.
  • each cluster 305 includes multiple processor cores (IPU cores) 306 and one memory core (MEM core) 307 .
  • IPU cores processor cores
  • MEM core memory core
  • the processor cores 306 are exemplarily shown as four in the figure, and the present disclosure does not limit the number of the processor cores 306 . Its internal structure is shown in Figure 4. Each processor core 306 includes three modules: a control module 41 , an arithmetic module 42 and a storage module 43 .
  • the control module 41 is used to coordinate and control the work of the arithmetic module 42 and the storage module 43 to complete the task of deep learning, and it includes an instruction fetch unit (instruction fetch unit, IFU) 411 and an instruction decoding unit (instruction Decode unit, IDU) 412.
  • the instruction fetching unit 411 is used to acquire the instruction from the processing device 203 , and the instruction decoding unit 412 decodes the acquired instruction, and sends the decoding result to the operation module 42 and the storage module 43 as control information.
  • the storage module 43 is used to store or transport related data, including a neuron storage unit (neuron RAM, NRAM) 431, a weight storage unit (weight RAM, WRAM) 432, an input/output direct memory access module (input/output direct memory access , IODMA) 433, move direct memory access module (move direct memory access, MVDMA) 434.
  • the NRAM 431 is used to store the feature map calculated by the processor core 306 and the intermediate results after the calculation;
  • the WRAM 432 is used to store the weights of the deep learning network; memory access;
  • the MVDMA 434 is used to control the memory access of the NRAM 431/WRAM 432 and the SRAM 308.
  • the storage core 307 is mainly used for storage and communication, that is, to store the shared data or intermediate results between the processor cores 306, and to execute the communication between the cluster 305 and the DRAM 204, the communication between the clusters 305, and the processor Communication among the cores 306, etc.
  • the memory core 307 has scalar operation capability for performing scalar operations.
  • the storage core 307 includes a shared storage unit (SRAM) 308, a broadcast bus 309, a cluster direct memory access (CDMA) 310 and a global direct memory access (GDMA) 311.
  • SRAM shared storage unit
  • CDMA cluster direct memory access
  • GDMA global direct memory access
  • the SRAM 308 assumes the role of a high-performance data transfer station.
  • the data multiplexed between different processor cores 306 in the same cluster 305 does not need to be obtained from the DRAM 204 through the processor cores 306, but is stored in the processor through the SRAM 308.
  • the storage core 307 only needs to quickly distribute the multiplexed data from the SRAM 308 to the multiple processor cores 306, so as to improve the communication efficiency between the cores and greatly reduce the on-chip and off-chip input/output accesses.
  • the broadcast bus 309, the CDMA 310 and the GDMA 311 are used to perform the communication between the processor cores 306, the communication between the clusters 305 and the data transmission between the clusters 305 and the DRAM 204, respectively. They will be explained separately below.
  • the broadcast bus 309 is used to complete high-speed communication among the processor cores 306 in the cluster 305.
  • the broadcast bus 309 in this embodiment supports inter-core communication methods including unicast, multicast and broadcast.
  • Unicast refers to point-to-point (ie, a single processor core to a single processor core) data transmission
  • multicast is a communication method that transmits a piece of data from SRAM 308 to specific processor cores 306, and broadcast is a communication method.
  • the communication method in which copies of data are transmitted from SRAM 308 to all processor cores 306 is a special case of multicast.
  • the CDMA 310 is used to control access to the SRAM 308 between different clusters 305 within the same computing device 201.
  • Figure 5 shows a schematic diagram when one processor core wants to write data to the processor cores of another cluster to illustrate the working principle of CDMA 310.
  • the same computing device includes multiple clusters. For the convenience of description, only cluster 0 and cluster 1 are shown in the figure, and cluster 0 and cluster 1 respectively include multiple processor cores. Cluster 0 shows only processor core 0, and cluster 1 shows only processor core 1. Core 0 wants to write data to Core 1.
  • processor core 0 sends a unicast write request to write data into local SRAM 0
  • CDMA 0 acts as the master
  • CDMA 1 acts as the slave
  • the master pushes the write request to the slave, that is, the master
  • the end sends the write address AW and the write data W, and transfers the data to SRAM 1 of cluster 1, and then the slave sends a write response B as a response.
  • the processor core 1 of cluster 1 sends a unicast read request to transfer the data from SRAM 1. read out.
  • the GDMA 311 cooperates with the external memory controller 301 to control the memory access from the SRAM 308 of the cluster 305 to the DRAM 204 , or to read data from the DRAM 204 to the SRAM 308 .
  • the communication between the DRAM 204 and the NRAM 431 or the WRAM 432 can be implemented through two channels. The first channel is to directly contact DRAM 204 and NRAM 431 or WRAM 432 through IODAM 433; the second channel is to transfer data between DRAM 204 and SRAM 308 through GDMA 311, and then through MVDMA 434 to transfer data between SRAM 308 and NRAM 431 or WRAM 432 transfers.
  • the bandwidth of the second channel is much larger than that of the first channel, so DRAM 204 and NRAM 431 or Communication between the WRAMs 432 may be more efficient through a second channel.
  • the embodiments of the present disclosure can select data transmission channels according to their own hardware conditions.
  • GDMA 311 and the functionality of IODMA 433 may be integrated in the same component.
  • GDMA 311 and IODMA 433 are regarded as different components.
  • the function of GDMA 311, the function of IODMA 433, the function of CDMA 310, and the function of MVDMA 434 can also be realized by the same component.
  • the realized function and the technical effect achieved are similar to the present disclosure, all belong to Scope of protection of this disclosure.
  • the structures of neural networks relevant to the present disclosure fall into two categories: long-chain structures and block structures.
  • the long-chain structure means that the neural network model is composed of layers connected in series by a single chain, each layer has only one input and one output, and the whole belongs to a single branch, such as the VGG16 model or the AlexNet model shown in Figure 6.
  • the block structure means that the subnet in the neural network has only one input and one output, but there are multiple branches in the subnet, that is, some layers of the subnet have multiple inputs or outputs, such as the resblock structure of resnet50 and the block structure of inception_v3. Wait.
  • FIG. 7 shows a schematic diagram of an exemplary neural network model including a sub-network 701 and a sub-network 702 .
  • the sub-network 701 has only one input and one output, which includes the first to sixth layers, the first layer has 2 outputs, and the sixth layer has 2 inputs, so the sub-network 701 includes 2 branches, and one branch is the first layer.
  • One layer ⁇ second layer ⁇ third layer ⁇ sixth layer, and another branch is first layer ⁇ fourth layer ⁇ fifth layer ⁇ sixth layer, and the sub-network 701 constitutes a block structure.
  • the sub-network 702 also constitutes a block structure.
  • the present disclosure largely reduces off-chip on-chip data transfers by fusing adjacent layers of the neural network.
  • Figure 8 shows a schematic diagram of fusing two convolutional layers together.
  • the input of the first convolutional layer 810 is a 7 ⁇ 7 feature map 801, which convolves the feature map 801 with a 3 ⁇ 3 kernel (not shown) to obtain the features of the first convolutional layer 810 Figure 802.
  • the value of the 5x5 feature submap 804 affects the 3x3 feature submap 805.
  • the stride is 1, after calculating the 5 ⁇ 5 feature submap 804 , the first convolutional layer 810 will then calculate the 5 ⁇ 5 feature submap 806 , and the value of the 5 ⁇ 5 feature submap 806 will be Affects 3x3 feature submap 807.
  • the feature map 802 becomes the input of the second convolution layer 811, which is also convolved with the 3 ⁇ 3 kernel to obtain the feature map 803 of the second convolution layer 811. .
  • the value of the 3 ⁇ 3 feature sub-map 805 will affect the 1 ⁇ 1 feature sub-map 808 in the feature map 803 .
  • the second convolutional layer 811 will then calculate the 3 ⁇ 3 feature submap 807 , and the value of the 3 ⁇ 3 feature submap 807 will affect the 1 ⁇ 1 value in the feature map 803 Feature subgraph 809.
  • the computing device 201 reads the 5 ⁇ 5 feature sub-map 804 from the DRAM 204 when the first layer of convolution 810 is performed, and stores the 3 ⁇ 3 feature sub-map 805 back to the DRAM 204 after the calculation is completed, and then from the DRAM 204 reads the 5 ⁇ 5 feature submap 806 , and stores the 3 ⁇ 3 feature submap 807 in the DRAM 204 after the calculation.
  • the second layer of convolution 811 it is also necessary to read the 3 ⁇ 3 feature sub-map 805 from the DRAM 204.
  • the 1 ⁇ 1 feature sub-map 808 is stored in the DRAM 204, and then the 3 ⁇ 3 feature sub-map is read from the DRAM 204.
  • the 1 ⁇ 1 feature submap 809 is stored in the DRAM 204. It can be seen from the above description that the feature map 802 is repeatedly read and stored on the off-chip as intermediate data, which considerably occupies system resources.
  • the feature map 802 is stored in the NRAM 431 (the weights of the first convolutional layer 810 and the second convolutional layer 811 It can also be stored in the WRAM 432), so that the number of visits between the computing device 201 and the DRAM 204 can be reduced, thereby improving the execution efficiency of the overall neural network.
  • the feature maps participating in the fusion (such as the feature map 801, the feature map 802, and the feature map 803) look like an inverted pyramid as a whole in the context logic of the neural network model, it is called pyramid fusion.
  • Pyramid fusion is usually based on the backward fusion of specific convolutional layers and pooling layers in the neural network, that is, the starting layer of fusion is a convolutional layer or a pooling layer, and multiple layers are fused backwards according to its own hardware conditions. There may be multiple convolutional and pooling layers in between.
  • the ordering of layers has become complicated. For example, if an activation layer is set in front of the convolutional layer, this activation layer should also be considered how to integrate with the subsequent convolutional layer. Therefore, in addition to the fusion with the convolutional layer and the pooling layer as the core, the present disclosure provides various fusion methods.
  • FIGS. 1 , 2 , 3 and 4 Another embodiment of the present disclosure is a novel fusion method, which is implemented by using the hardware structures of the aforementioned FIGS. 1 , 2 , 3 and 4 , and this fusion is called a template fuse unit (TFU). ).
  • the template fusion unit mainly flexibly fuses multiple layers into one layer through a certain fusion strategy to reduce the input/output overhead of the network, which includes the aforementioned pyramid fusion and other fusion methods.
  • the set of these fused layers is Template fusion unit, which can be regarded as a new layer or a custom layer.
  • the feature maps, weights, etc. required by the template fusion unit are loaded from the DRAM 204 to the on-chip SRAM 308 at one time. After the feature maps are loaded into the SRAM 308, they are called the on-chip cell map, and the on-chip cell map will be cut into subsections.
  • the weights required to calculate the subgraph are also loaded from the SRAM 308 to the WRAM 432 , after the calculation of each subgraph is completed, the corresponding intermediate result is obtained, and the intermediate result is stored back to the SRAM 308.
  • the calculation result is stored back to the DRAM 204 at one time. That is to say, the corresponding results obtained by the on-chip unit graph and weights participating in the operation of the operators in the neural network model are passed between the DRAM 204 and the SRAM 308, and the output (intermediate result) corresponding to the subgraph is passed between the SRAM 308 and the NRAM 431. . From the perspective of the computing device 201 , the data loading of the template fusion unit is in units of on-chip unit graphs, and the calculation is in units of subgraphs.
  • SRAM 308 is one of the important reference indicators for fusion strategy, and its space size determines whether the template fusion unit is in large image mode or small image mode.
  • the small image mode and the large image mode refer to whether a feature map stored in the DRAM 204 can be moved to the SRAM 308 for processing at one time, and the processing device 203 will compare the storage space required for the feature map with the available space in the SRAM 308. If the space of SRAM 308 is insufficient and the feature map cannot fit, it is in the large image mode; if the SRAM 308 is large enough to accommodate the entire feature map, it is in the small image mode.
  • the on-chip cell map is only a part of the feature map; in the small image mode, if the available space of the SRAM 308 is large enough or the feature map is small enough, the SRAM 308 may To accommodate multiple feature maps, that is, the on-chip unit map can include multiple feature maps.
  • the feature map In the case of the large image mode, the feature map must be split before being loaded into the computing device 201 .
  • the processing device 203 will split the feature map on the DRAM 204 until a sufficiently small on-chip cell map is generated to meet the space requirements of the SRAM 308, so that the on-chip cell map can be moved to the SRAM 308 for processing at one time.
  • input-dependent operations and output-dependent operations may be generated.
  • Input-dependent operation means that the on-chip cell graphs after splitting overlap at least partially, and each subset requires some additional copies of the input to perform a complete operation, resulting in data redundancy in the splitting operation, the so-called data redundancy It means that the same piece of data is multiplexed in the system.
  • Input-dependent operations are caused when the template fusion unit includes layers such as convolution, pooling, or matrix multiplication.
  • the output-dependent operation means that after each subgraph produces an intermediate result, it needs to be reduced to obtain the calculation result.
  • Reduction means that based on the understanding of the content of the on-chip unit map itself, it is divided into sub-maps and calculated separately to reduce the calculation scale, so as to minimize the amount of data on the premise of keeping the original on-chip unit map as much as possible. , and then restore or integrate the calculation results based on the subgraph.
  • Computational results are interdependent when reducing.
  • the template fusion unit includes layers such as inner product, convolution, matrix multiplication, sorting, counting, etc., output-dependent operations are caused.
  • the data formats of the feature maps that can be processed by this embodiment include N, H, W, and C dimensions, where N represents batch, H represents height, W represents width, and C represents channel. .
  • N represents the number of images in this batch
  • H represents the number of pixels in the vertical direction of the image
  • W represents the number of pixels in the horizontal direction
  • C represents the number of channels (for example, the number of channels in a black and white image is 1, and the number of channels in RGB is 1.
  • the number of channels C of the color image is 3).
  • the order of these dimensions determines the composition of the data.
  • the common composition methods are NHWC and NCHW.
  • Figure 9 shows the format difference between NCHW and NHWC. This figure takes an RGB color image as an example. G represents a green pixel, and B represents a blue pixel.
  • the sequence 91 is in NCHW format, N is arranged in the outer layer, the pixels in each channel are next to each other, and then arranged in the order of RGB, the offset of the element whose coordinates are (n, c, h, w) in storage is (( n ⁇ C+c) ⁇ H+h) ⁇ W+w.
  • Sequence 92 is in NHWC format, C is arranged in the innermost layer, and the RGB pixels corresponding to the spatial positions of multiple channels are close together.
  • the figure also shows the positions of the input pixel 901, the input pixel 902, and the input pixel 903 in different arrangements, and the three input pixels 901, the input pixel 902, and the input pixel 903 are combined to form a point in the image. colour.
  • the conversion method of the corresponding coordinate offset of the element whose coordinates are (n, c, h, w) is ((n ⁇ H+h) ⁇ W+w) ⁇ C+c.
  • NHWC is closer to the BMP image data storage format than NCHW.
  • the BMP format file stores data according to each pixel, and each pixel stores the color value of all channels, which makes it unnecessary to read the input image. Do additional dimensional transformations. Therefore, the memory access locality of NHWC is better, and one output pixel can be obtained for every three input pixels, while NCHW must wait for all channel inputs to be ready to obtain the final output result, which requires a large cache space.
  • each layer of the data fusion neural network may be a template fusion unit, and FIG. 10 shows a corresponding flowchart.
  • step 1001 the processing device 203 determines whether the storage space required for the feature map is greater than the available space of the SRAM 308. If so, it means that the feature map cannot be loaded into the SRAM 308 at one time, so step 1002 is executed to split the feature map.
  • the processing device 203 preferentially chooses to perform splitting in the N dimension, because no input or output dependent operations will be generated. If the splitting in the N dimension cannot meet the requirements, then consider the H or W dimension. For splitting, input or output dependent operations may occur.
  • This embodiment also supports splitting in the C dimension, especially splitting along the Cout direction, so that one convolution is split into multiple convolutions by means of data optimization, so that the WRAM 432 can hold lower weights, For example, the weights are divided into four processor cores 306 . Therefore, as long as the splitting in a certain dimension can be handled by the computing device 201, it is within the scope of the disclosure.
  • the processing device 203 may sequentially perform splitting with a specific granularity among the N, H, and W dimensions, and the specific granularity may be a fixed or variable ratio, or represented by a function.
  • the processing device 203 divides the feature map or weight from large to small. Taking the feature map as an example, firstly, the feature map with dimension NHWC is divided into the feature map of N 1 HWC and the feature map of N 2 HWC in the N dimension, where the specific granularity is a fixed ratio, and N 1 and N 2 are each N. one-half of .
  • the processing device 203 continues to split the feature map of N 1 HWC into the feature map of N 1 H 1 WC and the feature map of N 1 H 2 WC in the H dimension, wherein H 1 and H 2 are each half of H. If it is not small enough, the processing device 203 continues to split the feature map of N 1 H 1 WC into the feature map of N 1 H 1 W 1 C and the feature map of N 1 H 1 W 2 C in the W dimension, where W 1 and W 2 are each one-half of W.
  • the processing device 203 may continue to perform smaller granularity splits in the N, W, and H dimensions, such as quarter, eighth, or sixteenth, until the feature The map is small enough to be an on-chip cell map that can be loaded into SRAM 308 in one go.
  • the processing device 203 may also continue to split in one dimension, and will select another dimension to continue splitting until it can no longer be split. For example, it continues to split on the H dimension. If the split to the smallest unit still cannot be loaded into the SRAM 308, it will be split on the W dimension until the smallest unit is split.
  • the size of the required storage space is usually similar to the available space of the SRAM 308.
  • the DRAM 204 can only transmit one split feature map to the SRAM 308 at a time, but in the small image mode, the space of the SRAM 308 may be loaded from the DRAM 204 at one time. feature map.
  • the processing device 203 is divided from small to large, and the specific granularity can also be a fixed or variable ratio, or represented by a function.
  • the N dimension is split with a specific granularity of the smallest unit, that is, 1 ⁇ H ⁇ W ⁇ C. If the SRAM 308 can be loaded, the processing unit 203 continues to enlarge the splitting of the feature map, for example, to 2 ⁇ H ⁇ W ⁇ C. If it can still be loaded, continue to enlarge until n ⁇ H ⁇ W ⁇ C cannot be loaded, then the size of the on-chip unit map is (n-1) ⁇ H ⁇ W ⁇ C.
  • the processing device 203 will continue to split from another dimension, for example, starting from the H dimension, the processing device 203 will then determine the 1 ⁇ 1 ⁇ W ⁇ C. If it is small enough, increase along the H dimension until the required storage space of 1 ⁇ (h-1) ⁇ W ⁇ C is found just close to but not larger than the available space of SRAM 308 . If the available space of the SRAM 308 is still exceeded, the processing device 203 continues to be split from another dimension, for example, from the W dimension. The best input data that can be loaded into the SRAM 308 at one time is found in a sequential manner. The so-called optimal here means that the storage space required for the on-chip cell map is closest to but not larger than the available space of the SRAM 308.
  • processing device 203 After the processing device 203 splits the feature map, it returns to step 1001, and the processing device 203 determines whether the required storage space for the split feature map is still larger than the available space of the SRAM 308, and if so, executes step 1002 again, and continues to split down .
  • step 1003 is executed, and the processing device 203 sets The split feature map is the on-chip unit map.
  • step 1004 is executed, and the processing device 203 determines the template fusion unit according to the size of the on-chip unit map. This step will be explained in detail later.
  • the processing device 203 after the processing device 203 repeatedly executes steps 1001 and 1002 multiple times, it means that the required storage space for the split feature map is getting closer and closer to the available space of the SRAM 308.
  • the storage space required for the map is 100k, and the available space of the SRAM 308 is 40k.
  • step 1001 the processing device 203 determines that the storage space required for the feature map is greater than the available space of the SRAM 308, so step 1002 is executed, and it is split along the N dimension into At this time, the split feature map is 50k, then go back to step 1001, the storage space required for the split feature map is still larger than the available space of SRAM 308, continue to step 1002, and then split along the N dimension into At this time, the split feature map is 25k, then back to step 1001, the required storage space of the split feature map is less than the available space of the SRAM 308, so step 1003 is executed, and the processing device 203 sets the split Feature maps (25k in size) are on-chip cell maps.
  • the available space of the SRAM 308 is 40k, while the storage space required for the on-chip cell map is 25k, and there is still 15k of space left unused.
  • Granularity is too large.
  • the specific granularity of the split can be gradually reduced with the number of splits, so that the required storage space of the split on-chip cell map is as close as possible to the available space of the SRAM 308. For example, a specific granularity can be set to one-half at first, three-quarters next, and four-fifths at the end.
  • step 1001 the processing device 203 determines that the storage space required for the feature map is greater than the available space in the SRAM 308, so step 1002 is executed to specify the granularity Set to 1/2, the split feature map is 50k, then go back to step 1001, the required storage space for the split feature map is still larger than the available space of SRAM 308, continue to step 1002, at this time the specific granularity Adjusted to three-quarters, the split feature map is 37.5k, then go back to step 1001, the required storage space of the split feature map is less than the available space of the SRAM 308, so step 1003 is executed, and the processing device 203 sets The split feature map (37.5k in size) is the on-chip unit map. 37.5k is closer to 40k than 25k, the latter way will make more full use of the available space of SRAM 308 and be more efficient.
  • This embodiment does not limit the size of a specific granularity Set to 1/2, the split feature map is 50k, then go back to step 1001, the required storage space for the split feature
  • step 1004 is executed. This step is to dynamically fuse the neural network according to the fusion strategy.
  • FIG. 11 shows a method for dynamically merging the neural network according to the fusion strategy in this embodiment.
  • step 1101 the starting layer of the template fusion unit is selected according to the starting rule of the fusion strategy.
  • the processing device 203 selects the start layer of the template fusion unit according to the start rule of the fusion strategy, that is, selects the layer to start fusion among the layers that have not been fused in the neural network.
  • the starting rule may be that the starting layer is the most unfused layer in the neural network, and the processing device 203 will search for the most unfused layer.
  • the processing device 203 Taking the AlexNet neural network model of FIG. 6 as an example, there are 23 layers in total. Assuming that the first to fifth layers have been fused, when the starting rule is that the starting layer is the most unfused layer in the neural network, the processing device 203 The ReLU activation layer of the 6th layer will be selected as the starting layer and fused backward (that is, fused in the direction of the 7th layer). It should be noted that under this starting rule, the starting layer is not necessarily a convolutional layer or a pooling layer.
  • the starting rule is that the starting layer is the convolution or pooling layer that has not been fused before, and the processing device 203 will first find the All convolution and pooling layers of unfused layers in the neural network model are fused backwards starting from the most unfused convolution or pooling layer. Also taking the AlexNet neural network model in FIG.
  • the processing device 203 will find out all the convolution and pooling layers of the unfused layers in the neural network model, that is, the 11th layer, The 13th layer, the 15th layer, and then start the fusion from the convolution or pooling layer that has not been fused at the front, that is, the starting layer is the 11th layer.
  • step 1102 fusion is performed on the basis of the starting layer, and all the rules of the fusion strategy are checked one by one to establish a template fusion unit.
  • the processing device 203 performs fusion based on the starting layer, and checks all the rules of the fusion strategy one by one, so as to establish a template fusion unit.
  • the hardware resources of the computing device 201 are sufficient to support the one-time loading of the data required by the template fusion unit, and then perform the neural network calculation according to the template fusion unit.
  • the fusion strategy can exemplarily include the following rules:
  • the so-called forward fusion refers to the fusion from the initial layer to the reverse direction of the neural network inference. Taking Figure 6 as an example, the fusion is in the direction of the third layer ⁇ the second layer ⁇ the first layer.
  • This rule is usually paired with the aforementioned starting rule that the starting layer is the first unfused convolution or pooling layer, because there may be unfused layers before the convolution or pooling layer.
  • the processing device 203 preferentially fuses forward, and tries to incorporate the layers that have not been fused before the initial layer into the template fusion unit. Also taking the AlexNet neural network model in FIG.
  • the processing device 203 finds that the convolution or pooling layer that has not been fused before is the fifth layer, so the starting layer is the fifth layer Layers 4 and 3 are first fused forward, and if they can continue to be fused, then layers 6 and 7 are fused backwards.
  • this rule requires the processing device 203 to preferentially add or delete template fusion units in a block structure rather than in layers. fusion.
  • the processing device 203 will prioritize the sub-network 701 or the sub-network 702 for fusion.
  • the template fusion unit is directly added or deleted in units of layers. This rule does not apply to neural network models with long chain structures.
  • the fusion strategy of this embodiment does not support that the template fusion unit is a multi-output network.
  • the reason is that the shape derivation implemented inside the template fusion unit mainly adopts the form of backward-forward derivation. Derivation, the results of the derivation will not necessarily be attributed to the same feature map, so that it cannot converge.
  • FIG. 7 shows two fusion methods of the sub-network 701. The first is to fuse the first to fifth layers into a template fusion unit 703, and the second is to fuse the first to sixth layers into a template fusion unit 703. unit 704. Since the outputs of the third layer and the fifth layer are the outputs of the template fusion unit 703, the template fusion unit 703 belongs to a multi-output network, that is, multi-branch output.
  • the output of the sixth layer is the output of the template fusion unit 704, and only one output data is generated, so the template fusion unit 704 belongs to a single-output network, that is, a single-branch output.
  • the processing unit 203 will determine whether the output of the template fusion unit is a single-branch output, and if this rule is not satisfied, the processing device 203 adds or deletes layers in the template fusion unit until this rule is satisfied.
  • the processing device 203 will evaluate whether the operations of each layer to be fused are complex enough so that the fusion produces benefits .
  • the main layer refers to a layer that consumes a lot of input/output resources such as matrix multiplication, pooling or convolution.
  • the pooling here includes various types of pooling, such as It is the maximum pooling (maxpool) or the mean pooling (avgpool), and the convolution also includes various types of convolutions, such as ordinary convolution, convolution with mean, sub-channel convolution (depthwise conv), etc.
  • This rule is that the template fusion unit includes at least 2 main layers.
  • Rule 6 Include a continuous structure in which the main layer, the main layer, and the non-main layer are adjacent in turn
  • the template fusion unit needs to include a continuous structure of the main layer, the main layer and the non-main layer, that is, the continuous structure in which the main layer, the main layer and the non-main layer are adjacent in sequence.
  • Such operations are complex enough for fusion to be beneficial.
  • the processing unit 203 determines that the rule is not satisfied, the processing device 203 adjusts the template fusion unit until the rule is satisfied.
  • This rule is a continuous structure in which the template fusion unit includes a scalar computing layer and a vector computing layer, that is, a continuous structure in which the scalar computing layer and the vector computing layer are adjacent in sequence.
  • the scalar calculation layer refers to an addition layer, a subtraction layer or a multiplication layer
  • the vector calculation layer refers to an activation layer, a batch normalization layer or a scaling layer.
  • This rule is that the weight of the convolutional layer in the template fusion unit is not the output of any layer of the neural network, regardless of whether the layer is included in the template fusion unit or not.
  • the processing device 203 removes the convolutional layer from the template fusion unit.
  • the processing unit 203 determines that the rule is not satisfied, the processing device 203 removes the convolution operator from the template fusion unit.
  • the large image mode has fewer restrictions on the WRAM 432, because the on-chip cell map loaded into the SRAM 308 is only a part of the feature map.
  • the WRAM 432 only needs to store the ownership value of the feature map.
  • the small image mode may load multiple feature maps into the SRAM 308, in this case, the required weights will increase, and it is necessary to carefully evaluate whether the available space of the WRAM 432 is sufficient.
  • This rule is that the storage space required for the weights in the on-chip unit map is not greater than the available space of the WRAM 432.
  • the processing device 203 determines that this rule is not satisfied, the processing device 203 reduces the size of the on-chip unit map.
  • W j is the storage space required for the weights involved in the on-chip unit graph j
  • n is the number of processor cores in the cluster
  • W is the available space of the WRAM 432 .
  • the redundancy percentage is the ratio of the sum of redundancy generated by input-dependent operations and output-dependent operations to the normal input/output volume of the template fusion unit. Redundant amount of data.
  • the processing device 203 will calculate the percentage of the memory access size TFU of the on-chip unit map from the DRAM 204 to the SRAM 308 after the template fusion unit fuses the current layer, and the normal input/output (excluding redundancy) size ori , where the memory access amount Size TFU refers to the theoretical memory access size ori plus the sum of redundancy. Its formula is as follows:
  • the processing device 203 will take into account the split information and shape derivation of the template fusion unit, and set the percentage threshold to 50%, 75%, 100%, 125% or 150%, preferably 100%. Taking the percentage threshold of 100% as an example, it means that when the sum of redundancy is greater than twice the normal input/output amount of the template fusion unit, the fusion will not be performed. This rule is that the sum of redundancy generated by splitting the on-chip unit graph does not exceed a specific ratio related to the percentage threshold. Once it exceeds, it means that there are too many redundant parts, and a lot of resources will be spent on computing redundancy, resulting in reduced performance. Therefore, when When the processing device 203 determines that the rule is not satisfied, the processing device 203 stops the fusion.
  • thumbnail mode since at least one complete feature map is loaded at a time from the DRAM 204 to the SRAM 308, there is no redundancy. This rule does not apply to thumbnail mode.
  • the sum of the storage space of the on-chip cell map and the storage space of the calculation result is less than the available space of the SRAM 308; if the storage space of IN and OUT can be multiplexed, the storage space of the on-chip cell map The storage space of the calculation result is larger than the available space of the SRAM 308 .
  • the processing device 203 judges that the rule is not satisfied, the processing device 203 reduces the number of on-chip unit maps until the rule is satisfied.
  • the processing means 203 judges that this rule is not satisfied, the processing means 203 reduces the number of on-chip cell maps until the rule is satisfied.
  • the processing means 203 judges that this rule is not satisfied, the processing means 203 reduces the number of on-chip cell maps until the rule is satisfied.
  • the weights involved in the convolution operation in the template fusion unit are carried independently and reside on the WRAM 432 .
  • the WRAM 432 stores the weights of two adjacent sub-images at the same time in consideration of the flow between the sub-images. Assuming that the required storage space of each subgraph i is Wi, and the total space of the WRAM 432 is W, this rule is that the space size of the WRAM 432 needs to meet the following conditions:
  • the processing means 203 judges that this rule is not satisfied, the processing means 203 reduces the number of on-chip cell maps until the rule is satisfied.
  • This rule is that the storage space required by the subgraph is not larger than the available space of the NRAM 431.
  • the processing device 203 can perform fine-grained splitting in the N, H, and W dimensions. If there is insufficient space in NRAM 431, processing device 203 will split the on-chip cell map finer until this rule is satisfied.
  • NRAM 431 will have a reasonable available space, so that the on-chip unit map can be split to a reasonable extent and can be loaded at one time. From the perspective of fusion strategy, the template fusion unit will not be affected by the number of batches. Impact. However, the smaller the on-chip cell map is split (that is, the more sub-maps), the processing speed will decrease, so the processing device 203 needs to evaluate the space of the NRAM 431.
  • the space of the SRAM 308 corresponds to the number of NRAMs 431 of the processor cores 306 in the cluster 305.
  • the cluster 305 includes 4 processor cores 306, and the space of the SRAM 308 is equal to the space of the NRAM 431. 4 times.
  • the on-chip cell map in the large-picture mode can generally be allocated to four processor cores 306 for processing. This architectural design has considered that the data loaded into the SRAM 308 can be allocated to all the NRAMs 431 at one time. Therefore, this rule does not need to be considered in large image mode.
  • the on-chip cell map may include multiple feature maps.
  • the processing device 203 will calculate an appropriate number of fusion layers according to the number of feature maps in the on-chip unit map, so as to maximize the benefit.
  • This rule is that the number of feature maps in the on-chip unit map is not greater than the feature map threshold.
  • the processing device 203 determines that this rule is not satisfied, the processing device 203 reduces the number of feature maps in the on-chip data until the rule is satisfied.
  • Step redundancy refers to: when there are too many fusion layers of the template fusion unit, and the length and width of the convolution and pooling kernels are larger than the step size, the input data required by each output point overlaps, that is For the aforementioned input-dependent operations, the overlapping portion is the step redundancy. Step redundancy makes each processor core 306 need to read more data, but this part of the multiplexed data will occupy on-chip and off-chip access resources. The more layers the template fusion unit includes, the greater the step redundancy. severe. This rule is that the sum of the difference between the edge length and the stride length of the kernel of the convolutional or pooling layer is not greater than the redundancy threshold.
  • the redundancy threshold is defined as follows. Assuming that the length and width of the kernels of the convolution and pooling layers are k x and ky , and the strides in the length and width directions are s x and s y , respectively, the step size redundancy in the length direction is all volumes in the template fusion unit The sum of k x -s x of product and pooling layers; similarly, the stride redundancy in the width direction is the sum of k y -s y of all convolution and pooling layers in the template fusion unit.
  • the redundancy threshold in this embodiment can be 3, 4, 5 or 6, preferably 4. This rule is not satisfied as long as the step redundancy in either the long or wide direction is greater than the redundancy threshold.
  • the processing device 203 adjusts the template fusion unit, usually to reduce the number of layers to be fused, until this rule is satisfied.
  • step 1103 the neural network calculation is performed according to the established template fusion unit.
  • the computing device 201 is based on the three-level operation level of the system-on-chip-cluster-processor core, and is matched with a three-level memory design such as DRAM-SRAM-NRAM/WRAM.
  • the data required by the calculation template fusion unit is loaded from the DRAM 204 to the SRAM 308, so that the data can be cached and calculated in an appropriate level, and a sufficient flow is formed.
  • the calculation results are sent from the SRAM 308 to the DRAM 204.
  • the present disclosure is based on the template fusion unit, which can reduce the input/output overhead in neural network computing.
  • Another embodiment of the present disclosure is a method of performing neural network computations using a template fusion unit.
  • Fig. 12 shows its flow.
  • a template fusion unit is determined according to the fusion strategy.
  • the processing device 203 selects the start layer of the template fusion unit according to the start rule of the fusion strategy; performs fusion based on the start layer, and checks all the rules of the fusion strategy one by one to establish a template fusion unit.
  • Various rules of the fusion policy have been described in detail in the previous embodiment, and will not be repeated here.
  • Each basic block contains at least one instruction, and the instructions in the basic block may use pointers to specific on-chip memory spaces.
  • a pointer is a variable that holds the address of a specific address space. Through the pointer, the processor core 306 can load data into the space of the specific address pointed to by the pointer, or fetch data from the specific address pointed to by the pointer.
  • the compiler initially divides the basic blocks, and after iterative operations, confirms the basic blocks and their interrelationships, and thus completes the target code for implementing the template fusion unit.
  • the compiler will also analyze the multiplexed data of the two template fusion units before and after the neural network, and determine how much data in the previous template fusion unit can be left on the chip for the next template fusion unit. Plan the storage address of each data.
  • the compiler completes the deduction of the address in the control flow graph.
  • step 1204 on-chip storage space is allocated.
  • the processing device 203 allocates the physical space of the SRAM 308, the NRAM 431 and the WRAM 432 based on the derivation of the template fusion unit address.
  • the compiler completes the pointing of the pointer in the control flow graph.
  • step 1205 is performed to generate executable instructions.
  • the linker links the object code generated by the compiler and the library to make it an executable file.
  • object code is a program module that includes machine code and information available to the linker.
  • the linker's job is to resolve undefined symbol references, replace the placeholders in the object code with the addresses of the symbols, and generate executable instruction.
  • the executable instructions can be directly executed by the computing device 201 to complete the computation of the neural network.
  • the present disclosure dynamically determines the template fusion unit by setting the fusion strategy, fuses multiple layers in the neural network to form a new custom layer, and loads the data required for the calculation of the template fusion unit at one time to reduce input/ output overhead.
  • the next convolution or pooling layer is the 8th layer, in other words, the 6th and 7th layers may not be merged, which affects the overall benefit.
  • Another embodiment of the present disclosure is a scheme of fusion neural network, wherein the starting layer is a layer other than the convolutional layer and the pooling layer, that is, the non-convolutional layer and the non-pooling layer.
  • This embodiment is also implemented based on the framework of FIGS. 1 to 4 .
  • This embodiment also executes the flowchart shown in FIG. 11 .
  • this step does not use the starting rule as the starting layer is the convolution or pooling layer that has not been fused before. If the starting layer is selected according to this starting rule, the starting layer must be convolutional. Or the pooling layer, the advantage of this embodiment not being limited by the location and number of convolutional layers or pooling layers in the neural network model does not exist.
  • the starting layer can be an element-wise layer, also known as an element-wise layer, which operates on each element of a vector.
  • the input data and output data shape of such operations are Consistent.
  • the element-to-element layer includes the following categories:
  • Activation function sigmoid, tanh, ReLU, etc.
  • the starting layer may be an addpadding layer.
  • the purpose of adding padding is to not discard the original image information, keep the size of the input data consistent with the original image, and add elements of blank information around the input data.
  • the starting layer can be a custom layer.
  • a custom layer can be selected. as the starting layer.
  • the starting rule of the fusion strategy in this embodiment enables the processing device 203 to further determine whether the neural network includes a block structure. If it is not included, it means that the neural network has a long-chain structure, and the processing device 203 can select the most unfused layer in the neural network according to the aforementioned starting rule; Units are fused, so the processing device 203 then determines whether the frontmost layer in the block structure is a layer other than the convolutional layer and the pooling layer. If so, the processing device 203 takes the foremost layer as the starting layer.
  • the processing device 203 determines whether the foremost layer (ie, the eighth layer) of the sub-network 1302 is a layer other than the convolutional layer and the pooling layer. If yes, the eighth layer is directly selected as the starting layer for fusion; if the eighth layer is a convolutional layer or a pooling layer, the processing device 203 can also select the eighth layer as the starting layer, or select the closest layer forward.
  • the foremost layer ie, the eighth layer of the sub-network 1302 is a layer other than the convolutional layer and the pooling layer. If yes, the eighth layer is directly selected as the starting layer for fusion; if the eighth layer is a convolutional layer or a pooling layer, the processing device 203 can also select the eighth layer as the starting layer, or select the closest layer forward.
  • the first layer except the convolutional layer and the pooling layer is the starting layer
  • the previous layer closest to the eighth layer is the seventh layer
  • the seventh layer has not been fused, and it is assumed that the seventh layer is not convolutional. If it is not a pooling layer, the processing device 203 selects the seventh layer as the starting layer. If the seventh layer is also a convolution or pooling layer, this embodiment may select the seventh layer or the eighth layer as the starting layer.
  • the processing device 203 reverses and selects the one that is closest to the frontmost layer except the convolutional layer and the pooling layer.
  • the outer layer ie, the eighth layer
  • the eighth layer is the starting layer, but in this way the entire block structure cannot be incorporated into the template fusion unit. Since the fusion effect of using the eighth layer as the starting layer is not ideal, the processing device 203 may directly select the seventh layer as the starting layer.
  • step 1102 is then executed to establish a template fusion unit based on the starting layer.
  • the processing device 203 may establish a template fusion unit according to the rules (rules 1 to 19) exemplified in the foregoing embodiments. These rules are only examples, and this embodiment does not limit the order in which the rules are executed, nor does it limit the requirements of these rules. At the same time, it is considered that those skilled in the art can add or delete rules according to the actual situation in different application scenarios, so as to realize the fusion strategy conforming to the current application scenarios.
  • Steps 1101 and 1102 correspond to the step 1201 of determining the template fusion unit according to the fusion strategy.
  • the compiler deduces the shape of the template fusion unit (step 1202 ), deduces the address (step 1203 ), allocates on-chip storage space (step 1204 ), and finally generates executable instructions by the linker (step 1205 ).
  • the starting layer of this embodiment can be a layer other than convolution and pooling. Such starting rules make the establishment of the template fusion unit more flexible, and the starting layer can be appropriately selected to start fusion for different neural networks. It is not limited by the position and number of convolutional layers or pooling layers in the neural network model, and thus adapts to various network models, making the fusion more comprehensive and improving the overall efficiency.
  • Figure 14 shows a schematic diagram of such a layer. From Figure 14, it can be seen that the input/output feature map will be in the form of a positive pyramid, such a layer is referred to as a positive pyramid layer in this disclosure, and in Figure 8 the input feature map is larger than the output
  • the layers of the feature map are called inverted pyramid layers.
  • positive pyramid layers include deconvolution layers, unpooling layers, or unsampling layers.
  • Deconvolution also known as transposed convolution or hole convolution, is not a complete inverse process of forward convolution.
  • Deconvolution is a special forward convolution that requires parameters to participate in the calculation, and the parameters are training to learn. Deconvolution is to first expand the size of the input image by filling 0 according to a certain proportion, then rotate the convolution kernel, and then perform forward convolution.
  • the pooling operation is divided into the pooling operation of max pooling and the pooling operation of average pooling.
  • the upper pooling of max pooling will retain the position information of the maximum value, and then fill the remaining positions with 0, as shown in Figure 15A, the figure shows the maximum pooling layer 1501, and its input feature map 1502 passes through the maximum pooling layer 1501.
  • the output feature map 1503 is generated, and the upper pooling layer 1504 of maximum pooling is also shown in the figure.
  • the input feature map 1503 passes through the upper pooling layer 1504 to generate an output feature map 1505, and the size of the output feature map 1505 is larger than that of the input feature map 1503. size of.
  • the upper pooling of the average pooling is to fill the average value into the corresponding position in the corresponding original data area, as shown in Figure 15B, the average pooling layer 1506 is shown in the figure, and its input feature map 1507 passes through the average pooling.
  • the output feature map 1508 is generated after the pooling layer 1506, and the upper pooling layer 1509 of average pooling is also shown in the figure.
  • the input feature map 1508 passes through the upper pooling layer 1509 to generate an output feature map 1510. Enter the dimensions of the feature map 1508.
  • Upsampling is to directly expand the feature map according to the kernel in the corresponding original data area.
  • Figure 16 shows a schematic diagram of upsampling.
  • the input feature map 1601 passes through the max pooling layer (not shown) to generate an intermediate feature map 1602, and the intermediate feature map 1602 is expanded by the kernel 1603 of the upsampling layer (not shown).
  • the output feature map 1604 is obtained, and the size of the output feature map 1604 is larger than the size of the intermediate feature map 1602 .
  • Fig. 17 shows a flowchart of this embodiment integrating the neural network shown in Fig. 18.
  • Fig. 18 is an exemplary neural network model with 14 layers in total, wherein the first segment 1801 includes layers 1 to 4, which are inverted Pyramid layer, the second segment 1802 includes layers 5 to 9, which are positive pyramid layers, and the third segment 1803 includes layers 10 to 14, which are inverted pyramid layers.
  • a template fusion unit is established according to the fusion strategy.
  • the processing device 203 first selects the starting layer of the template fusion unit according to the starting rule of the fusion strategy.
  • the starting rule may be that the first unfused layer is the starting layer of the template fusion unit.
  • the fourth layer is the starting layer of the template fusion unit, and the fourth layer is fused backwards, and all the rules of the fusion strategy are checked one by one to establish a template fusion unit.
  • the fifth layer of the normal pyramid layer is also fused into it, and if the fusion can continue, the processing device 203 continues to fuse backwards.
  • the starting rule may be that the foremost positive pyramid layer among all unfused layers is the starting layer of the template fusion unit. It is also assumed that the 1st to 3rd layers have been fused, then the 5th layer is the first positive pyramid layer, so the 5th layer is the starting layer of this template fusion unit, which is fused backwards.
  • This embodiment does not limit the fusion method of the positive pyramid layer and the inverted pyramid layer. All positive pyramid layers can be fused together.
  • the template fusion unit includes layers 5 to 9, and can also be mixed together.
  • the template fusion unit includes Layers 3 to 6, or the template fusion unit includes layers 9 to 12, etc.
  • the template fusion unit may include only the positive pyramid layer, or may include the inverted pyramid layer plus the positive pyramid layer, or the positive pyramid layer and the inverted pyramid layer.
  • the forward pyramid layer and the inverted pyramid layer can be adjacent in the template fusion unit, such as the 4th layer and the 5th layer, the 9th layer and the 10th layer.
  • the rule of the fusion strategy in this embodiment is to add or delete template fusion units in units of the block structure.
  • the main layers in this embodiment are defined as matrix multiplication, pooling, convolution, deconvolution, up-pooling and up-sampling layers.
  • the rules of the fusion strategy are that the template fusion unit includes At least two main layers, when the processing unit 203 judges that the rule is not satisfied, the processing device 203 adjusts the template fusion unit until the rule is satisfied.
  • This embodiment may also include another rule of fusion strategy: the template fusion unit includes a continuous structure of main layer + main layer + non-main layer, when the processing unit 203 determines that the rule is not satisfied, the processing device 203 adjusts the template fusion unit until the rules are satisfied.
  • step 1702 the shape of the template fusion unit is deduced.
  • step 1703 is executed to deduce the address.
  • step 1704 on-chip storage space is allocated.
  • step 1705 executable instructions are generated.
  • step 1706 is performed, and the neural network calculation is performed according to the template fusion unit.
  • the computing device 201 executes the aforementioned executable instructions to perform neural network computations according to the template fusion unit.
  • This embodiment can fuse the positive pyramid layer and the inverted pyramid layer.
  • Such a fusion strategy makes the establishment of the template fusion unit more flexible, and is not limited by the size of the input feature map and output feature map, thereby adapting to various network models, making the fusion more Comprehensive, improve the overall efficiency.
  • the computing device 201 can reason about the neural network in units of template fusion units according to the executable instructions.
  • Another embodiment of the present disclosure is a solution for computing a neural network based on executable instructions. This solution also has the architectures shown in FIG. 1 to FIG. 4 for computing the graph of the template fusion unit, which implements the process shown in FIG. 19 . .
  • step 1901 the feature map of the neural network is stored.
  • the processing device 203 fuses the multiple layers of the neural network according to the fusion strategy to generate a template fusion unit, and appropriately splits the feature map into an on-chip unit map based on each rule.
  • the processing device 203 determines the template fusion unit according to the fusion strategy in step 1201 of FIG. 12, and judges that the feature map is larger than the available space of the SRAM 308, that is, the large image mode, it is necessary to split the feature map to make it. Can be loaded into SRAM 308 multiple times.
  • the splitting method may be split with a specific granularity in at least one of the N, H, and W dimensions. In this embodiment, the specific granularity may be, but not limited to, half.
  • the on-chip cell map may include single or multiple feature maps, depending on how many feature maps can be loaded in the available space of the SRAM 308.
  • the technical details of converting the feature map into the on-chip unit map have been described with respect to the large image mode and the small image mode, and will not be repeated.
  • the feature maps to be calculated by the neural network are all stored in the DRAM 204.
  • the on-chip cell map is loaded. Since the executable instruction calculates the neural network based on the template fusion unit, when the computing device 201 executes the executable instruction, the neural network calculation is performed according to the template fusion unit, rather than layer by layer calculation according to each layer of the neural network.
  • the executable instruction carries information on how to split the feature map into an on-chip cell map, that is, contains address information of the on-chip cell map, and the SRAM 308 loads the on-chip cell map from the appropriate address of the DRAM 204 through the GMDA 311 according to the address information.
  • step 1903 the subgraph is loaded.
  • NRAM 432 loads submaps through MVDMA 434.
  • the on-chip unit graph will be split into 4 subgraphs, and one processor core 306 in the cluster 305 divides the on-chip unit graph in at least one of the N, H, and W dimensions.
  • a specific granularity is split into 4 sub-images, which are sent to the NRAM 432 of each processor core 306 through the MVDMA 434, respectively.
  • the specific granularity may be, but is not limited to, one-half.
  • step 1904 subgraphs are computed and corresponding intermediate results are generated.
  • the arithmetic module 42 of each processor core 306 fetches the subgraphs from the NRAM 431 for calculation, and generates intermediate results and then stores them back in the NRAM 431. It should be noted that since the sub-pictures allocated to each processor core 306 belong to different parts of the on-chip unit map, each intermediate result also reflects a part of the calculation result.
  • step 1905 the intermediate result is reduced to produce a calculation result corresponding to the on-chip cell map.
  • Reduction refers to combining intermediate results into a calculation result, which is the aforementioned output-dependent operation.
  • the broadcast bus 309 transmits the intermediate result of each processor core 306 to the next processor core 306, and the processor core 306 calculates the intermediate result of the previous processor core 306 and the stored corresponding intermediate result to generate the calculation result .
  • the reduction can be implemented in many ways. The following uses ring allreduce as an example to illustrate how to perform the reduction in this embodiment.
  • FIG 20 shows the toroidal full-reduce framework.
  • the ring full-reduce framework 2000 exemplarily shows four clusters in a computing device 201 : a first cluster 2001 , a second cluster 2002 , a third cluster 2003 and a fourth cluster 2004 , and each cluster includes four processor cores.
  • the Ring Full Reduction Framework 2000 organizes these clusters into a logical ring. Each cluster only connects to the previous cluster and the next cluster, and receives and sends data in the same direction. As shown in the direction of the arrow in Figure 20, each cluster receives data from the previous cluster in a clockwise direction, and sends data to the next cluster after calculation, and its data transmission is carried out through CDMA 310 under the control and coordination of the synchronization module 304.
  • the framework can fully utilize the input/output bandwidth of each cluster.
  • the reduction procedure is then executed, and these clusters will perform N-1 reduction iterations (where N is 4). In each iteration, these clusters will send all intermediate results to the next cluster and receive all intermediate results from the previous cluster for computation, the intermediate results sent and received by each cluster are different in each iteration.
  • Figure 22A shows that in the first iteration, the intermediate result a 0 of the first cluster 2001 is sent to the second cluster 2002 to be added to the intermediate result a 1 , and the intermediate result b 1 of the second cluster 2002 is sent to the third cluster 2003 is added to the intermediate result b 2 , the intermediate result c 2 of the third cluster 2003 is sent to the fourth cluster 2004 and the intermediate result c 3 is added, the intermediate result d 3 of the fourth cluster 2004 is sent to the first cluster 2001 and The intermediate results d 0 are added.
  • FIG. 22B shows that in the second iteration, the intermediate result a 0 +a 1 of the second cluster 2002 is transferred to the third cluster 2003 and the intermediate result a 2 is added, and the intermediate result b 1 +b 2 of the third cluster 2003 is sent to the fourth cluster 2004 and added to the intermediate result b 3 , the intermediate result c 2 + c 3 of the fourth cluster 2004 is sent to the first cluster 2001 and added to the intermediate result c 0 , the intermediate result d of the first cluster 2001 0 + d 3 is passed to the second cluster 2002 and added with the intermediate result d 1 .
  • Figure 22C shows that at the third iteration, the intermediate result a 0 +a 1 +a 2 of the third cluster 2003 is transferred to the fourth cluster 2004 where the intermediate result a 3 is added, and the intermediate result b 1 of the fourth cluster 2004 +b 2 +b 3 is sent to the first cluster 2001 to be added to the intermediate result b 0 , the intermediate result c 0 +c 2 +c 3 of the first cluster 2001 is sent to the second cluster 2002 to be added to the intermediate result c 1 , the intermediate result d 0 +d 1 +d 3 of the second cluster 2002 is passed to the third cluster 2003 and the intermediate result d 2 is added.
  • Each cluster has one processor core to perform the complete reduction calculation, that is, adding up all the intermediate results corresponding to the longitudinal direction, for example:
  • the second processor core of the first cluster 2001 carries the calculation result b 0 +b 1 +b 2 +b 3
  • the third processor core of the second cluster 2002 carries the calculation result c 0 +c 1 +c 2 +c 3
  • the fourth processor core of the third cluster 2003 carries the calculation result d 0 +d 1 +d 2 +d 3
  • the first processor core of the fourth cluster 2004 carries the intermediate result a 0 +a 1 +a 2 +a 3 .
  • step 1906 is executed to store the calculation result back.
  • the SRAM 308 stores the calculation results back to the DRAM 204 through the GDMA 311. These computations are the result of the cluster computing the on-chip cell graph. So far, the computing device 201 has completed the calculation of the on-chip cell map.
  • the neural network is calculated based on executable instructions, and the executable instructions are calculated according to the template fusion unit instead of each layer of the neural network, which reduces on-chip and off-chip input/output consumption and improves computing efficiency.
  • FIG. 24 shows an exemplary long-chain neural network with 14 layers in total.
  • Another embodiment of the present disclosure is a method for implementing a forward fusion neural network using the framework of FIG. 1 to FIG. 4 , and the neural network is exemplarily the long-chain neural network shown in FIG. 24 . The method is shown in Figure 25.
  • the starting layer for fusion is selected according to the fusion strategy.
  • the processing device 203 selects the starting layer for fusion according to the fusion strategy.
  • the processing device 203 determines which of the unfused layers are convolutional layers or pooling layers. As shown in the figure, the 8th layer is the maximum pooling layer, and the 9th layer is the convolutional layer. Therefore, the convolution or pooling layer that has not been fused before is the 8th layer, and the processing device 203 sets the 8th layer as the starting layer of this fusion.
  • step 2502 fusion is performed towards the starting point of the neural network to establish a template fusion unit.
  • each layer in the template fusion unit needs to be continuous, and the unfused layer cannot be fused beyond the fused layer, that is, each layer in the template fusion unit needs to be a continuous unfused layer.
  • the fusion in the direction of the starting point of the neural network 241 is to incorporate the 7th layer into the template fusion unit, and the processing device 203 judges whether the 7th layer is an unfused layer.
  • the fifth layer has been fused into the template fusion unit 2401, so the seventh layer is an unfused layer.
  • the processing device 203 sets the seventh layer (partial normalization) and the eighth layer (maximum pooling) for fusion, that is, the template fusion unit 2402.
  • the processing device 203 regards the foremost layer in the template fusion unit 2402 as the input layer of the template fusion unit 2402, that is, the seventh layer is the input layer, and regards the last layer as the output layer of the template fusion unit 2402, that is, the starting layer.
  • the eighth layer is the output layer, and the processing device 203 performs pyramid fusion based on the input layer and the output layer.
  • the template fusion unit 2402 is based on the inverted pyramid data structure shown in FIG. 8, and the input of the seventh layer is the input of the template fusion unit 2402, and the output of the eighth layer is the output of the template fusion unit 2402.
  • the output data is derived back to the input data, and the intermediate data between layers 7 to 8 is stored in the SRAM 308 and not back into the DRAM 204.
  • judgment is made according to the rules of the fusion strategy mentioned in the foregoing embodiments to determine whether the seventh layer plus the eighth layer satisfy the rules and can become a template fusion unit.
  • the processing device 203 continues to fuse towards the starting point of the neural network 241, that is, attempts to incorporate the sixth layer (ReLU activation layer) into the template fusion unit, that is, the template fusion unit 2403 .
  • the template fusion unit 2403 also has an inverted pyramid data structure as shown in FIG. 8 .
  • the input of the sixth layer is the input of the template fusion unit 2403
  • the output of the eighth layer is the output of the template fusion unit 2403.
  • the intermediate data between the 7th layer and the 7th layer to the 8th layer are all stored in the SRAM 308 and are not saved back to the DRAM 204. Judgment is made according to the rules of the fusion strategy mentioned in the previous embodiment to determine the sixth layer to the 8th layer. Whether a layer satisfies the rules can become a template fusion unit.
  • the processing device 203 then performs fusion in the direction of the starting point of the neural network 241, that is, attempts to incorporate the fifth layer into the template fusion unit.
  • the processing device 203 will determine whether the newly added layer has been fused. Since the fifth layer has been fused into the template fusion unit 2401, the processing device 203 will not incorporate the fifth layer, and the fusion will be stopped at this point.
  • the template fusion unit at this stage is established. Completed, that is, the template fusion unit 2403 .
  • the entire neural network 241 will be fused based on the aforementioned method.
  • the neural network 242 shows a possible final fusion result.
  • the entire neural network 242 includes 14 layers, that is, 14 operators. After the fusion is completed, it becomes a template fusion.
  • the unit 2401, the template fusion unit 2403, the template fusion unit 2404, and the template fusion unit 2405 consist of four custom layers, namely four custom operators.
  • the neural network calculation is performed according to the template fusion unit.
  • the computing device 201 performs the neural network calculation according to the four custom layers composed of the template fusion unit 2401, the template fusion unit 2403, the template fusion unit 2404, and the template fusion unit 2405.
  • the computing device 201 executes the neural network calculation, it executes the aforementioned 4 layers of custom layers instead of executing the original 14 layers, thereby achieving the technical effect of reducing input/output overhead and improving resource efficiency.
  • the WRAM 432 can also be preloaded with weights. If the WRAM 432 is large enough, the weights of the first convolutional layer and the second convolutional layer can be loaded from the SRAM 308 into the WRAM 432 at one time. When the calculation of the first convolutional layer is completed, the weights of the second convolutional layer The value does not need to be loaded from the SRAM 308 to the WRAM 432, and the arithmetic module 42 directly reads the weight calculation of the second convolutional layer from the WRAM 432, which further reduces the weight loading time and improves the overall running speed.
  • Another embodiment of the present disclosure is a method for implementing a bidirectional fusion neural network using the framework of FIG. 1 to FIG. 4 .
  • the neural network is also taken as an example of the long-chain neural network in FIG. 24 , which is also shown in FIG. 26 for illustration.
  • Bidirectional fusion means that fusion can be performed forward or backward. This method is shown in Figure 27.
  • the fusion strategy is fused forward and backward at the same time to establish a template fusion unit, and then the neural network calculation is performed according to the template fusion unit.
  • layers 1 to 5 in FIG. 26 have been fused into a template fusion unit 2601, and the starting rule of the fusion strategy in this embodiment is that the starting layer is the convolution or pooling layer that has not been fused before .
  • the processing device 203 selects the starting layer for fusion according to the fusion strategy.
  • the processing device 203 determines that the convolution or pooling layer that has not been fused at the front is the maximum pooling layer of the 8th layer, so the processing device 203 sets the 8th layer as the starting layer of this fusion.
  • step 2702 fusion is then performed towards the starting point of the neural network.
  • the processing device 203 forwards the seventh layer into the template fusion unit, and the seventh layer becomes a newly added layer.
  • step 2703 the processing device 203 determines whether the newly added layer is an unfused layer.
  • the seventh layer is the unfused layer.
  • Step 2704 is executed, and the processing device 203 sets the seventh layer and the eighth layer as the template fusion unit 2602 .
  • step 2705 is executed, and the processing device 203 determines whether the template fusion unit 2602 conforms to the rules of the fusion strategy.
  • the processing device 203 regards the foremost layer in the template fusion unit 2602 as the input layer of the template fusion unit 2602, that is, the seventh layer is the input layer, and regards the starting layer as the output layer of the template fusion unit 2602, that is, the eighth layer.
  • the processing device 203 performs pyramid fusion based on the input layer and the output layer.
  • step 2706 is executed, and the processing device 203 performs fusion from the starting layer to the end point of the neural network, that is, starting from the 8th layer, first fuses the 7th layer, and then jumps back in this step Layer 9 is fused to form template fusion unit 2603. This way of jumping backwards and forwards is called jumping fusion.
  • the processing device 203 determines whether the template fusion unit 2603 conforms to the rules of the fusion strategy. During fusion, the processing device 203 regards the foremost layer of successive layers in the template fusion unit 2603 as the input layer of the template fusion unit 2603, that is, the seventh layer, and the last layer of the backward jump is the output layer of the template fusion unit 2603, That is, in the ninth layer, the processing device 203 performs pyramid fusion based on the input layer and the output layer.
  • step 2702 If the rules of the fusion strategy are met, go back to step 2702, and then perform fusion in the direction of the starting point of the neural network, and the processing device 203 incorporates the sixth layer into the template fusion unit.
  • step 2703 the processing device 203 determines whether the newly added layer is an unfused layer.
  • the sixth layer is an unfused layer, so step 2704 is executed, and the processing device 203 sets the sixth layer and the ninth layer as the template fusion unit 2604 .
  • step 2705 is executed, and the processing device 203 determines whether the template fusion unit 2604 conforms to the rules of the fusion strategy.
  • the processing device 203 regards the foremost layer in the template fusion unit 2604 as the input layer of the template fusion unit 2604, that is, the sixth layer is the input layer, and regards the last layer of the backward jump as the output layer of the template fusion unit 2604, That is, in the ninth layer, the processing device 203 performs pyramid fusion based on the input layer and the output layer.
  • step 2706 is executed, and the processing device 203 performs fusion in the direction of the end point of the neural network, and then jumps to the 10th layer of fusion to form a template fusion unit 2605.
  • step 2707 the processing device 203 determines whether the template fusion unit 2605 conforms to the rules of the fusion strategy. During fusion, the processing device 203 regards the foremost layer of successive layers in the template fusion unit 2605 as the input layer of the template fusion unit 2605, that is, the sixth layer, and the last layer of the backward jump is the output layer of the template fusion unit 2605, That is, the tenth layer, the processing device 203 performs pyramid fusion based on the input layer and the output layer.
  • step 2703 the processing device 203 determines whether the fifth layer is an unfused layer. Since the fifth layer is fused into the template fusion unit 2601, step 2708 is executed, and the processing device 203 stops the fusion. In steps 2705 and 2707, when the processing device 203 determines that the template fusion unit does not conform to the rules of the fusion strategy, step 2708 is also executed, and the processing device 203 stops the fusion. So far, the processing device 203 has established a template fusion unit.
  • step 2709 is executed, and the computing device 201 performs neural network calculation according to the established template fusion unit.
  • the processing device 203 may jump to the end direction of the neural network to perform fusion. For example, when the processing device 203 determines that the 5th layer has been fused, step 2706 can be directly executed, and the processing device 203 performs fusion in the direction of the end point of the neural network, and jumps to fuse the 11th layer, that is, the new template fusion unit includes Layers 6 to 11 are fused in this way until the fusion strategy is no longer satisfied.
  • the skip fusion in this embodiment may be fused backward first, then fused forward, and jump in sequence.
  • the processing device 203 first selects and fuses the ninth layer backward, then jumps forward to fuse the seventh layer, and then jumps backward to fuse the tenth layer, and so on.
  • the present disclosure does not limit the sequence of jump fusion before and after.
  • This embodiment illustrates the operation mode of skip fusion. It can be understood that the aforementioned jump fusion is to jump forward or backward once for each fusion layer, as shown by the arrow on the left side of FIG. 26 . Those skilled in the art can easily adjust the jumping manner within the scope of the present disclosure, and one jump is performed for every n layers of fusion, where n is a natural number. For example, jumping forward or backward once per fusion of the second layer or jumping forward or backward once per fusion of the third layer is covered by the disclosure scope of the present disclosure and also within the protection scope of the present disclosure.
  • Another embodiment of the present disclosure is a method of implementing a bidirectional fusion neural network, illustratively having a block structure as shown in FIG. 28 , using the framework of FIGS. 1 to 4 .
  • the starting rule of the fusion strategy in this embodiment is also that the starting layer is the convolution or pooling layer that has not been fused before, and jump fusion is performed from the starting layer to the starting and ending directions of the neural network to establish template fusion. unit, and then perform the neural network calculation according to the template fusion unit.
  • one of the rules of the fusion strategy of this embodiment is to fuse the block structure as a unit. The manner in which the template fusion unit is determined will be further explained below.
  • the processing device 203 selects the starting layer for fusion according to the fusion strategy, and performs fusion from the starting layer to the starting point of the neural network. Assuming that the first unfused convolutional layer or pooling layer is the seventh layer, the processing device 203 sets the seventh layer as the starting layer of the current fusion, and further includes the sixth layer into the template fusion unit. Although the sixth layer is an unfused layer and can be fused, the processing device 203 determines that the sixth layer belongs to the block structure 2801 . According to the fusion strategy, the processing device 203 needs to fuse the block structure 2801 as a unit, so the processing device 203 incorporates all the first to sixth layers at one time to form a template fusion unit 2802 .
  • the processing device 203 determines whether the template fusion unit 2802 conforms to other rules of the fusion strategy. During fusion, the processing device 203 regards the first layer as the input layer of the template fusion unit 2802, and regards the seventh layer as the output layer of the template fusion unit 2802, and the processing device 203 performs pyramid fusion based on the input layer and the output layer.
  • an appropriate composition fusion strategy may be selected with reference to Rules 1 to 19, for example, Rule 5: include at least 2 main layers, Rule 6: include a continuous structure of main layer + main layer + non-main layer, Rule 7: include scalar The continuous structure of computing layer + vector computing layer, etc.
  • the processing device 203 performs fusion in the direction of the end point of the neural network, that is, fusion of the eighth layer.
  • the eighth layer has two outputs, so that the template fusion unit becomes a multi-branch output, which does not conform to Rule 4.
  • the eighth layer belongs to the block structure 2803, and the processing device 203 will fuse the entire block structure 2803 into the template fusion unit 2804. .
  • the processing device 203 determines whether the template fusion unit 2804 conforms to the rules of the fusion strategy. If so, the template fusion unit 2804 is the final template fusion unit.
  • the computing device 201 uses the template fusion unit 2804 to perform neural network computation.
  • the processing device 201 stops fusion at this time, and establishes one of the template fusion units, that is, the template fusion unit 2802.
  • the processing device 203 will continue to try to fuse the block structure 2803 to become another template fusion unit 2805 . Assuming that the template fusion unit 2805 conforms to the fusion strategy, the processing device 203 creates another template fusion unit.
  • the computing device 201 performs neural network computation according to the two established template fusion units, namely, the template fusion unit 2802 and the template fusion unit 2805, which greatly reduces input/output consumption compared to 10-layer computation.
  • Another embodiment of the present disclosure is a scheme for implementing a forward, backward, bidirectional, and skip fusion neural network using the framework of FIGS. 1 to 4 .
  • the forward, backward, bidirectional, and skip-type fusion neural network solutions have been described in the foregoing embodiments, and will not be described separately.
  • the fusion strategy of this embodiment has multiple fusion flexibility.
  • the advantages and disadvantages of various template fusion unit schemes for forward, backward, bidirectional, and skip fusion are respectively evaluated, and then the best scheme is selected as the template fusion unit.
  • the so-called optimal solution may be the least number of template fusion units, the most main layer fusion, the least non-fused layers, or the least on-chip storage space occupied.
  • this embodiment can accept multiple fusion methods, and select the best solution from them as the template fusion unit, this embodiment can make full use of the hardware environment of the computing device 201, and compared with the foregoing embodiment, this embodiment can save more input /Output loss to further improve computing efficiency.
  • steps 1103, 1706, 2503, 2709, etc. all refer to performing neural network computations according to the template fusion unit.
  • Another embodiment of the present disclosure is a method for executing a template fusion unit, which makes full use of the concurrency among GDMA 311, IODMA 433, MVDMA 434 and operation module 42, and introduces the concept of pipeline, so that the intermediate results are as much as possible It resides on-chip, which not only reduces the input/output between on-chip and off-chip, but also takes advantage of the high bandwidth of the on-chip unit map to speed up the processing speed. Concurrency here means that the aforementioned elements can operate independently and in parallel without being affected by other elements.
  • the input of the first layer and the output of the last layer in the template fusion unit are used as the interactive data between the template fusion unit and the DRAM 204, during which the computation of each layer does not need to access the DRAM 204.
  • the processing device 203 further divides the template fusion unit into several sub-template fusion units according to the size of the NRAM 431 and the WRAM 432.
  • FIG. 29 shows a schematic diagram of dividing the sub-template fusion unit.
  • Layers T1 to T11 in the figure are a section in a specific deep learning network.
  • the processing device 203 divides the template fusion unit 2901 into first Sub-template fusion unit 2911 and second sub-template fusion unit 2912. In other embodiments, the processing device 203 may divide the template fusion unit into an unspecified number of sub-template fusion units.
  • the GDMA 311 transfers the data required by the template fusion unit 2901 from the DRAM 204 to the SRAM 308 at one time, and then the MVDMA 434 transfers the subgraphs required to execute the first sub-template fusion unit 2911 to On the NRAM 431, the arithmetic module 42 starts to perform the task of the first sub-template fusion unit 2911, that is, to calculate the T1-th layer to the T6-th layer, during which there is no need to access the SRAM 308. After the first sub-template fusion unit 2911 completes the calculation, the first intermediate result is generated, and the MVDMA 434 transfers the first intermediate result from the NRAM 431 to the SRAM 308.
  • the MVDMA 434 transfers the subgraphs required to execute the second sub-template fusion unit 2912 from the SRAM 308 to the NRAM 431, and the arithmetic module 42 performs the task of the second sub-template fusion unit 2912, that is, calculates the T7th to T11th layers , the SRAM 308 no longer needs to be accessed during this period.
  • the second sub-template fusion unit 2912 completes the calculation, a second intermediate result is generated.
  • the MVDMA 434 transfers the second intermediate result from the NRAM 431 to the SRAM 308, and one of the processor cores 306 combines the first intermediate result and the second intermediate result. Reduction is performed to generate the calculation result.
  • the GDMA 311 transfers the calculation result from the SRAM 308 to the DRAM 204 at one time. So far, the task of the template fusion unit 2901 is completed, that is, the task of the T1 layer to the T11 layer is completed. The start and end of the fusion unit 2901 access the DRAM 204, greatly reducing the number of input/output.
  • An important reason for the strong computing power of the computing device 201 lies in the three-level computing level of the system-on-chip-cluster-processor core, and the three-level memory design such as DRAM-SRAM-NRAM/WRAM, so that data can be stored in an appropriate level. Cache and calculate to form a sufficient flow.
  • this embodiment adopts a two-layer three-stage pipeline.
  • the loading stage 3001 , the computing stage 3002 , and the check-in stage 3003 of the first layer take place in the cluster level.
  • the first layer loading stage 3001 is to execute the template fusion unit, and the GDMA 330 loads the data from the DRAM 204 into the SRAM 308.
  • the first layer calculation stage 3002 is that the cluster 305 calculates the loaded on-chip cell graph and generates a calculation
  • the first level store-back stage 3003 is for the GDMA 330 to store the calculation results from the SRAM 308 back into the DRAM 204.
  • the first layer computing stage 3002 actually divides the on-chip unit graph into corresponding subgraphs through the storage core 307, and broadcasts them to the processor core 306 for calculation. Therefore, the second layer The tertiary watermarking occurs in processor core 306 .
  • the second layer loading stage 3004 is to execute the sub-template fusion unit.
  • the MVDMA 434 loads the sub-images from the SRAM 308 into the NRAM 431, and simultaneously loads the required weights into the WRAM 432.
  • the second The layer calculation stage 3005 is to transfer the subgraphs and weights to the operation module 42 for calculation, and then transfer the intermediate results back to the NRAM 431.
  • the MVDMA 434 stores the intermediate results from the NRAM 431 to the NRAM 431. SRAM 308.
  • the pipeline of the first layer means that the first layer loading stage 3001 , the first layer computing stage 3002 and the first layer check-in stage 3003 can be parallelized at the same time.
  • the same cluster 305 wants to process the first on-chip cell map, the second on-chip cell map and the third on-chip cell map.
  • the first on-chip cell map is loaded into the SRAM 308 in the first layer loading stage 3001 middle.
  • the first on-chip cell map is calculated in the first layer calculation stage 3002, and the first calculation result is transferred back to the SRAM 308.
  • the second on-chip cell map is loaded in the first layer.
  • Load stage 3007 is loaded into SRAM 308.
  • the second on-chip cell map is calculated in the first layer calculation stage 3008, and the second calculation result is transferred back to the SRAM 308, and The third on-chip cell map is loaded into the SRAM 308 in the first layer loading stage 3010.
  • the SRAM 308 of this embodiment includes two storage spaces: a ping storage space and a pong storage space.
  • the pipeline of the template fusion unit is divided into three types according to the ping-pong attributes of the SRAM 308: input/output ping-pong (IO parity), input ping-pong (input parity), and no ping-pong (no parity).
  • Input/output ping-pong can support parallel loading, computing and saving.
  • the ping-pong storage space and the pong storage space need to be exactly equal, which are used for loading and saving respectively.
  • Input ping-pong only supports parallel storage and calculation, which will increase the time of SRAM 308 transfer.
  • the ping-pong storage space and pong storage space do not need to be exactly equal, but it needs to allocate an extra block and save-back storage space.
  • No ping-pong refers to the serialization of load/store and computation, and no additional allocation of space is required.
  • the SRAM 308 of this embodiment has the same size of ping storage space and pong storage space to achieve the effect of input/output ping-pong.
  • the first layer loading stage 3001, the first layer calculation stage 3002 and the first layer check-in stage 3003 of the first subgraph are executed on the ping storage space, while the second subgraph has The first layer loading stage 3007, the first layer computing stage 3008 and the first layer check-in stage 3009 are executed on the pong storage space, and the first layer loading stage 3010, the first layer computing stage 3011 and The first level check-in stage 3012 is executed again on the ping memory space, and so on.
  • the pipeline of the second layer means that the second layer loading stage 3004 , the second layer computing stage 3005 and the second layer check-in stage 3006 can be parallelized at the same time.
  • the same processor core 306 wants to process the first sub-picture, the second sub-picture and the third sub-picture.
  • the first subgraph is loaded into the NRAM 431 in the second layer loading stage 3004, and the required weights are loaded into the WRAM 432.
  • the first subgraph is calculated and reduced in the second layer calculation stage 3005, and the reduced intermediate result is transferred back to the NRAM 431.
  • the second subgraph is loaded into the NRAM in the second layer loading stage 3013 431, and load the required weights into WRAM 432.
  • the first intermediate result is stored in the SRAM 308 in the second layer storage stage 3006, and the second subgraph is calculated and reduced in the second layer calculation stage 3014, and the reduced intermediate result is transferred back to the NRAM 431, and the third subgraph is loaded into the NRAM 431 in the second layer loading stage 3015, and the required weights are loaded into the WRAM 432.
  • the synchronization module 304 of this embodiment uses a synchronization barrier instruction to synchronize the completion times of the tasks to avoid timing errors.
  • weight replacement is not enabled in this embodiment, that is, during the execution of the aforementioned pipeline process, when a subgraph is calculated, the weight of the next subgraph will be broadcast synchronously, so at the same time the WRAM 432 will Store the weights of two adjacent subgraphs. Since the weights of multiple subgraphs will affect each other in the WRAM 432, the space occupied by the weights of the adjacent two subgraphs in the WRAM 432 will be larger than that of the adjacent two subgraphs. The sum of the weights of the graph.
  • the processing device 203 needs to allocate a plurality of spaces for storing weights to the SRAM 308, and these weights will always reside in In SRAM 308, if the template fusion unit includes multiple convolutional layers, the space in SRAM 308 or WRAM 432 may not be large enough to load all the weights to fuse multiple layers.
  • the processing device 203 determines that the template fusion unit includes multiple convolutional layers, this embodiment will switch to the weight replacement mode.
  • the weight replacement means that when a certain sub-picture is calculated, the processing device 203 loads the weights of the next sub-picture from the DRAM 204 into the SRAM 308.
  • the broadcast bus 309 broadcasts the weights to the WRAM 432 when it is the turn of the next subgraph to compute.
  • the SRAM 308 only needs to configure the storage space of the maximum weight, and the SRAM 308 only stores the weight of a sub-graph at any time. Space can be reused.
  • the rules of the fusion strategy can include the switching of weight replacement. When the sum of the weights of the template fusion unit is small, the weight replacement is not used to achieve a faster calculation speed; Fusion of more layers.
  • This embodiment is based on the three-level operation level of the system-on-chip-cluster-processor core and the three-level memory design of DRAM-SRAM-NRAM/WRAM, establishing a two-level three-level pipeline, making full use of hardware resources and improving the computational efficiency of neural networks. .
  • FIG. 31 shows a flowchart of executing a calculation program based on a template fusion unit according to another embodiment, where the template fusion unit includes a plurality of sub-template fusion units.
  • the data required by the template fusion unit is transferred from the off-chip DRAM 204 to the SRAM 308 at one time.
  • step 3102 it is determined whether all the sub-template fusion units in the template fusion unit have been calculated. If not, then execute step 3103, select an uncalculated sub-template fusion unit and transfer its required data to NRAM 431 and WRAM 432.
  • step 3104 the task of the selected sub-template fusion unit is performed, and the SRAM is not needed to be accessed during the execution.
  • the generated intermediate result is transferred from the NRAM 431 to the SRAM 308, and the process returns to step 3102.
  • step 3106 is executed to reduce all intermediate results to generate a calculation result.
  • step 3107 the calculation result is transferred from SRAM 308 to DRAM 204. So far, the task of the template fusion unit is completed.
  • Figure 32 shows a flow diagram of a two-layer three-stage pipeline of another embodiment.
  • a first on-chip cell map is loaded.
  • the first on-chip cell map is calculated synchronously and the first intermediate result is generated, and the second on-chip cell map is loaded.
  • the first intermediate result is retrieved synchronously, the second on-chip cell map is calculated and the second intermediate result is generated, and the third on-chip cell map is loaded.
  • the step 3202 further includes the following steps.
  • a first sub-picture is loaded, wherein the first sub-picture is at least a part of the first on-chip cell map.
  • step 3205 the first subgraph is simultaneously calculated and the first intermediate result is generated, and the second subgraph is loaded, wherein the second subgraph is at least a part of the first on-chip cell graph.
  • step 3206 the first intermediate result is retrieved simultaneously, the second subgraph is calculated, and the third subgraph is loaded, wherein the third subgraph is also at least a part of the first on-chip cell graph.
  • FIG. 12 Another embodiment of the present disclosure is a computer-readable storage medium on which computer program codes for dynamically merging neural networks according to fusion strategies are stored.
  • the present disclosure dynamically determines the template fusion unit by setting the fusion strategy, fuses multiple layers in the neural network to form a new custom layer, and loads the data required for the calculation of the template fusion unit at one time to reduce input/ output overhead.
  • the electronic devices or devices of the present disclosure may include servers, cloud servers, server clusters, data processing devices, robots, computers, printers, scanners, tablet computers, smart terminals, PC equipment, IoT terminals, mobile Terminals, mobile phones, driving recorders, navigators, sensors, cameras, cameras, video cameras, projectors, watches, headphones, mobile storage, wearable devices, visual terminals, autonomous driving terminals, vehicles, home appliances, and/or medical equipment.
  • the vehicles include airplanes, ships and/or vehicles;
  • the household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lamps, gas stoves, and range hoods;
  • the medical equipment includes nuclear magnetic resonance instruments, B-ultrasound and/or electrocardiograph.
  • the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that the hardware resources of the cloud device can be retrieved from the hardware information of the terminal device and/or the edge device according to the hardware information of the terminal device and/or the edge device. Match the appropriate hardware resources to simulate the hardware resources of terminal devices and/or edge devices, so as to complete the unified management, scheduling and collaborative work of device-cloud integration or cloud-edge-device integration.
  • the present disclosure expresses some methods and their embodiments as a series of actions and their combinations, but those skilled in the art can understand that the solutions of the present disclosure are not limited by the order of the described actions . Accordingly, those of ordinary skill in the art, based on the disclosure or teachings of this disclosure, will appreciate that some of the steps may be performed in other orders or concurrently. Further, those skilled in the art can understand that the embodiments described in the present disclosure may be regarded as optional embodiments, that is, the actions or modules involved therein are not necessarily necessary for the realization of one or some solutions of the present disclosure. In addition, according to different solutions, the present disclosure also has different emphases in the description of some embodiments. In view of this, those skilled in the art can understand the parts that are not described in detail in a certain embodiment of the present disclosure, and can also refer to the related descriptions of other embodiments.
  • units illustrated as separate components may or may not be physically separate, and components shown as units may or may not be physical units.
  • the aforementioned components or elements may be co-located or distributed over multiple network elements.
  • some or all of the units may be selected to achieve the purpose of the solutions described in the embodiments of the present disclosure.
  • multiple units in the embodiments of the present disclosure may be integrated into one unit or each unit physically exists independently.
  • the above integrated units may be implemented in the form of software program modules. If implemented in the form of a software program module and sold or used as a stand-alone product, the integrated unit may be stored in a computer-readable memory. Based on this, when the aspects of the present disclosure are embodied in the form of a software product (eg, a computer-readable storage medium), the software product may be stored in a memory, which may include several instructions to cause a computer device (eg, a personal computer, a server or network equipment, etc.) to execute some or all of the steps of the methods described in the embodiments of the present disclosure.
  • a computer device eg, a personal computer, a server or network equipment, etc.
  • the aforementioned memory may include, but is not limited to, a U disk, a flash disk, a read-only memory (Read Only Memory, ROM), a random access memory (Random Access Memory, RAM), a mobile hard disk, a magnetic disk, or a CD, etc. that can store programs. medium of code.
  • the above-mentioned integrated units may also be implemented in the form of hardware, that is, specific hardware circuits, which may include digital circuits and/or analog circuits, and the like.
  • the physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, and the physical devices may include, but are not limited to, devices such as transistors or memristors.
  • various types of devices described herein eg, computing devices or other processing devices
  • suitable hardware processors such as central processing units, GPUs, FPGAs, DSPs, ASICs, and the like.
  • the aforementioned storage unit or storage device can be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), which can be, for example, a variable resistance memory (Resistive Random Access Memory, RRAM), dynamic Random Access Memory (Dynamic Random Access Memory, DRAM), Static Random Access Memory (Static Random Access Memory, SRAM), Enhanced Dynamic Random Access Memory (EDRAM), High Bandwidth Memory (High Bandwidth Memory) , HBM), hybrid memory cube (Hybrid Memory Cube, HMC), ROM and RAM, etc.
  • a variable resistance memory Resistive Random Access Memory, RRAM
  • Dynamic Random Access Memory Dynamic Random Access Memory
  • SRAM Static Random Access Memory
  • EDRAM Enhanced Dynamic Random Access Memory
  • HBM High Bandwidth Memory
  • HBM Hybrid Memory Cube
  • ROM and RAM etc.
  • An integrated circuit device incorporating a neural network comprising an i-th layer, the input feature map of the i-th layer being smaller than an output feature map, the integrated circuit device comprising:
  • a processing device for establishing a template fusion unit according to the fusion strategy
  • a computing device for performing neural network computation according to the template fusion unit
  • the template fusion unit includes the i-th layer, and i is a natural number.
  • Item A3 The integrated circuit device of Item A1, wherein the neural network further comprises a jth layer, after the ith layer, the input feature map of the jth layer is larger than the output feature map, and the template fusion
  • the unit also includes the jth layer, j is a natural number, and i is not equal to j.
  • Clause A4 The integrated circuit device of clause A2 or clause A3, wherein the i-th layer and the j-th layer are adjacent.
  • Clause A5 The integrated circuit device of Clause A1, wherein the fusion strategy is that the i-th layer is a starting layer of the template fusion unit.
  • Item A6 The integrated circuit device of Item A1, wherein the i-th layer is located in a block structure of the neural network, and a rule of the fusion strategy is to add or delete the template fusion unit in units of the block structure.
  • Clause A7 The integrated circuit device of Clause A1, wherein the i-th layer is one of a deconvolution layer, an up-pooling layer, and an up-sampling layer.
  • Clause A8 The integrated circuit device of Clause A7, wherein the neural network comprises a plurality of main layers, the main layers being one of matrix multiplication, pooling, convolution, and the i-th layer, the fusion
  • the rule of the policy is that the template fusion unit includes at least two main layers, and when the processing unit determines that the rule is not satisfied, the processing device adjusts the template fusion unit until the rule is satisfied.
  • Clause A9 The integrated circuit device of Clause A7, wherein the neural network includes a plurality of main layers, the main layers being one of matrix multiplication, pooling, convolution, and the i-th layer, the fusion
  • the rule of the strategy is that the template fusion unit includes a continuous structure with a main layer, a main layer, and a non-main layer adjacent in turn.
  • the processing device adjusts the template fusion unit. until the rules are satisfied.
  • Clause A10 The integrated circuit device of Clause A1, wherein the i-th layer is a custom layer.
  • Clause A11 A board comprising the integrated circuit device according to any one of Clause A1 to Clause A10.
  • a method for fusing a neural network comprising the ith layer, the input feature map of the ith layer is smaller than the output feature map, and i is a natural number, and the method includes:
  • the template fusion unit comprising the i-th layer
  • Neural network computations are performed according to the template fusion unit.
  • Clause A13 The method of Clause A12, wherein the neural network further comprises a jth layer, located before the ith layer, the input feature map of the second layer is larger than the output feature map, the template fusion unit further Including the jth layer, j is a natural number, and i is not equal to j.
  • Item A14 The method according to Item A12, wherein the neural network further comprises a jth layer, and after the ith layer, the input feature map of the jth layer is larger than the output feature map, and the template fusion unit further Including the jth layer, j is a natural number, and i is not equal to j.
  • Clause A15 The method of clause A13 or clause A14, wherein the i-th layer and the j-th layer are adjacent.
  • Item A17 The method according to Item A12, wherein the i-th layer is located in a block structure of the neural network, and a rule of the fusion strategy is to add or delete the template fusion unit in units of the block structure.
  • Clause A18 The method of Clause A12, wherein the ith layer is one of a deconvolution layer, an uppooling layer, and an upsampling layer.
  • Clause A20 The method of Clause A18, wherein the neural network comprises a plurality of main layers, the main layers being one of matrix multiplication, pooling, convolution, and the i-th layer, the fusion strategy of The rule is that the template fusion unit includes a continuous structure in which the main layer, the main layer, and the non-main layer are adjacent in turn.
  • the processing device adjusts the template fusion unit until the The above rules are satisfied.
  • Clause A22 A computer-readable storage medium having stored thereon computer program code fused with a neural network that, when executed by a processing device, executes the method of any one of Clause A12 to Clause A21. 2020110458882
  • a computing device for computing a neural network according to a template fusion unit fuses multiple layers of the neural network, the computing device comprising a plurality of clusters, each cluster comprising:
  • processor cores each including:
  • a neuron storage unit to load a subgraph from the shared storage unit, the subgraph being part of the on-chip cell graph
  • an arithmetic module for calculating the subgraph and generating an intermediate result
  • the intermediate result is reduced among the plurality of processor cores to generate a calculation result corresponding to the on-chip unit graph, and the shared storage unit stores the calculation result back to the off-chip memory.
  • Item B2 The computing device according to Item B1, wherein the off-chip memory stores a feature map, and when the storage space required by the feature map is greater than the available space of the shared storage unit, the on-chip unit map is the part of the feature map.
  • Clause B3 The computing device of Clause B2, wherein the feature map includes N, H, W, C dimensions, and the on-chip cell map is the feature map in at least one of the N, H, W dimensions Splitting at a specific granularity.
  • Clause B4 The computing device according to Clause B1, wherein the off-chip memory stores multiple feature maps, and when the storage space required by the multiple feature maps is not greater than the available space of the shared storage unit, the on-chip memory
  • the cell map includes the plurality of feature maps.
  • Clause B5. The computing device of Clause B4, wherein the submap is one of the plurality of feature maps.
  • Clause B6 The computing device of Clause B1, wherein the on-chip cell graph includes N, H, W, C dimensions, and the subgraph is the on-chip cell graph in at least one of the N, H, W dimensions A split at a specific granularity.
  • each cluster further includes a broadcast bus, and one of the plurality of processor cores splits the on-chip unit graph according to the number of the plurality of processor cores, wherein The broadcast bus transmits the intermediate result of each processor core to the next processor core, and the processor core calculates the intermediate result of the previous processor core and the stored corresponding intermediate result to generate the calculation result .
  • An integrated circuit device for computing a neural network according to a template fusion unit comprising:
  • off-chip memory for storing the feature map of the neural network
  • a processing device for fusing the multiple layers of the neural network according to a fusion strategy to generate the template fusion unit, and splitting the feature map into an on-chip unit map;
  • a computing device including a plurality of clusters, each cluster including:
  • processor cores each including:
  • a neuron storage unit for loading a subgraph from the shared storage unit, the subgraph being a part of the on-chip cell graph
  • an arithmetic module for calculating the subgraph and generating an intermediate result
  • the intermediate result is reduced among the plurality of processor cores to generate a calculation result corresponding to the on-chip unit graph, and the shared storage unit stores the calculation result back to the off-chip memory.
  • Clause B9 The integrated circuit device of Clause B8, wherein when the processing means determines that the storage space required by the characteristic map is greater than the available space of the shared storage unit, the characteristic map is split into the on-chip unit diagram.
  • Clause B10 The integrated circuit device of Clause B9, wherein the feature map includes N, H, W, C dimensions, the processing means placing the feature map in at least one of the N, H, W dimensions Splitting at a specific granularity.
  • Clause B11 The integrated circuit device of Clause B8, wherein the on-chip cell map includes a plurality of characteristic maps when the processing device determines that the memory space required by the feature map is not greater than the available space of the shared memory cells.
  • Clause B12 The integrated circuit device of clause B11, wherein the submap is one of the plurality of cell-on-chip maps.
  • Clause B13 The integrated circuit device of Clause B9, wherein the on-chip cell map includes N, H, W, C dimensions, and one of the plurality of processor cores places the on-chip cell map in the N, H, W, C dimensions. At least one of the H and W dimensions is divided into the subgraphs with a specific granularity.
  • each cluster further includes a broadcast bus, one of the plurality of processor cores splits the on-chip cell map according to a number of the plurality of processor cores,
  • the broadcast bus transmits the intermediate result of each processor core to the next processor core, and the processor core calculates the intermediate result of the previous processor core with the stored corresponding intermediate result to generate the calculation result.
  • Clause B15 A board comprising the integrated circuit device of any one of clauses B8 to B14.
  • a method of computing a neural network from a template fusion unit that fuses multiple layers of the neural network comprising:
  • the calculation result is stored back.
  • Clause B17 The method according to Clause B16, further comprising:
  • a feature map of the neural network is stored.
  • Clause B18 A computer-readable storage medium having stored thereon computer program code for computing a neural network according to a template fusion unit, which, when executed by a processing device, executes any one of Clause B16 to Clause B17 Methods. 2020110438963
  • An integrated circuit device incorporating a neural network comprising:
  • a computing device including multiple processor cores, each processor core including a neuron storage unit and a weight storage unit;
  • processing means for:
  • the template fusion unit is divided into a plurality of sub-template fusion units, each sub-template fusion unit corresponds to a sub-graph and the corresponding weight of the sub-graph, the sub-template fusion unit corresponds to The figure is a part of the on-chip unit graph, and the corresponding weight of the subgraph is a part of the corresponding weight of the on-chip unit graph;
  • the computing device loads the sub-graph into the neuron storage unit, loads the corresponding weight of the sub-graph into the weight storage unit, and performs calculation in units of the sub-template fusion unit.
  • Clause C2 The integrated circuit device of Clause C1, wherein the computing device further comprises a shared memory unit from which the subgraphs are carried to the neuron memory units, corresponding weights of the subgraphs The value is transferred from the shared storage unit to the weight storage unit.
  • Clause C3 The integrated circuit device of Clause C2, further comprising off-chip memory, the on-chip cell map and the corresponding weights of the on-chip cell map being transferred from the off-chip memory to the shared storage unit.
  • Clause C4 The integrated circuit device of Clause C3, wherein the computing device further comprises an arithmetic module that reads the subgraph from the neuron storage unit and reads the subgraph from the weight storage unit The corresponding weights of the graph are calculated to generate intermediate results, and the intermediate results are temporarily stored in the neuron storage unit.
  • Clause C5. The integrated circuit device of clause C4, wherein the intermediate result is carried from the neuron storage unit to the shared storage unit.
  • Clause C6 The integrated circuit device of Clause C5, wherein one of the plurality of processor cores reduces an intermediate result of each sub-template fusion unit to generate a computational result obtained from the shared
  • the storage unit is transported to the off-chip memory.
  • a computing device for computing a neural network according to a template fusion unit the template fusion unit is divided into a plurality of sub-template fusion units, each sub-template fusion unit corresponds to a subgraph and a corresponding weight of the subgraph, the computing device It includes a plurality of processor cores, each processor core includes a neuron storage unit and a weight storage unit, the computing device loads the subgraph into the neuron storage unit, and stores the corresponding weights of the subgraph Load the weight storage unit, and perform the calculation in the unit of the sub-template fusion unit.
  • Clause C9 The computing device of Clause C8, wherein the computing device further comprises a shared storage unit from which the subgraphs are transferred to the neuron storage unit, corresponding weights of the subgraphs It is transferred from the shared storage unit to the weight storage unit.
  • Clause C10 The computing device of Clause C9, connected to off-chip memory, wherein the template fusion unit corresponds to an on-chip unit graph and corresponding weights of the on-chip unit graph, and the subgraph is a portion of the on-chip unit graph , the corresponding weight of the sub-graph is a part of the corresponding weight of the on-chip unit map, and the corresponding weights of the on-chip unit map and the on-chip unit map are transferred from the off-chip memory to the shared storage unit .
  • Clause C11 The computing device according to Clause C10, further comprising an arithmetic module that reads the subgraph from the neuron storage unit, and reads the corresponding weight of the subgraph from the weight storage unit, After the calculation, an intermediate result is generated, and the intermediate result is temporarily stored in the neuron storage unit.
  • Clause C12 The computing device of Clause C11, wherein the intermediate result is carried from the neuron storage unit to the shared storage unit.
  • Clause C13 The computing device of Clause C12, wherein one of the plurality of processor cores reduces an intermediate result of each sub-template fusion unit to generate a calculation result, the calculation result from the shared storage Cells are transported to the off-chip memory.
  • a processing device incorporating a neural network connected to a computing device, the computing device comprising a plurality of processor cores, each processor core comprising a neuron storage unit and a weight storage unit including, the processing device, Use to:
  • the template fusion unit is divided into a plurality of sub-template fusion units, each sub-template fusion unit corresponds to a sub-graph and the corresponding weight of the sub-graph, the sub-template fusion unit corresponds to
  • the graph is a part of the on-chip unit graph, and the corresponding weights of the subgraphs are part of the corresponding weights of the on-chip unit graph.
  • a method of fusing a neural network in an integrated circuit device comprising a computing device comprising a plurality of processor cores, each processor core comprising a neuron storage unit and a weight storage unit, the Methods include:
  • the template fusion unit corresponds to the on-chip unit graph and the corresponding weights of the on-chip unit graph;
  • the template fusion unit is divided into a plurality of sub-template fusion units, each sub-template fusion unit corresponds to a sub-graph and the corresponding weight of the sub-graph, the sub-template fusion unit corresponds to
  • the graph is a portion of the on-chip unit graph, and the corresponding weights of the subgraphs are a portion of the corresponding weights of the on-chip unit graph;
  • Clause C16 A computer-readable storage medium having stored thereon computer program code fused with a neural network that, when executed by a processing device, performs the method of Clause C15. 2020110458524

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Image Analysis (AREA)

Abstract

执行神经网络计算的装置、板卡、方法及可读存储介质,其中计算装置(201)包括在集成电路装置中,该集成电路装置包括通用互联接口和其他处理装置(203)。计算装置(201)与其他处理装置(203)进行交互,共同完成用户指定的计算操作。集成电路装置还可以包括存储装置,存储装置分别与计算装置(201)和其他处理装置(203)连接,用于计算装置(201)和其他处理装置(203)的数据存储。

Description

执行神经网络计算的装置、板卡、方法及可读存储介质
相关申请的交叉引用
本申请要求于2020年9月28日申请的,申请号为2020110458882,名称为“融合神经网络的装置、板卡、方法及可读存储介质”;于2020年9月28日申请的,申请号为2020110438963,名称为“计算神经网络的计算装置、板卡、方法及可读存储介质”;于2020年9月28日申请的,申请号为2020110458717,名称为“执行神经网络计算的装置、板卡、方法及可读存储介质”;于2020年9月28日申请的,申请号为2020110458524,名称为“融合神经网络的装置、板卡、方法及可读存储介质”的中国专利申请的优先权。
技术领域
本披露一般地涉及神经网络领域。更具体地,本披露涉及执行神经网络计算的装置、板卡、方法及可读存储介质。
背景技术
神经网络是按照一定规则连接起来的多个神经元系统,大致上是由以下四种层结构所组成:输入层、卷积层(convolution layer)、池化层(pooling layer)、全连接层(fully connected layer)。
输入层是自输入数据中截取部分信息,转化成特征矩阵方式呈现,其中载有对应该部分信息的特征。卷积层配置成接收来自输入层的特征矩阵,通过卷积操作对输入数据进行特征抽取。卷积层在实际运用时可以建制多层卷积层。池化层配置成对数据的某一个区域用一个值代替,这值通常是该区域所有数值里的最大值或平均值。通过池化,在不至于损失过多信息的前提下,可以缩减模型大小、提高计算速度。全连接层在整个卷积神经网络中起到分类器的作用,相当于特征空间变换,把前面所有有用的信息提取整合,基于不同的分类做信息比对,借以判断输入数据是否相似于比对的标的。
随着科技的发展,神经网络的层数越来越多,以经典的VGG架构为例,VGG-A共有11个权重层、VGG-B有13个权重层、VGG-C有16个权重层、VGG-D共有16个权重层、VGG-E共有19个权重层。其中,卷积层和全连接层的泛指权重层。有些神经网络更是具有上百层结构。不仅如此,随着层数的增加,神经网络的参数数量也呈指数级的增加,例如AlexNet具有6000万个参数参与计算。
多层数与多参数都需要大量片上片外的输入/输出访问,这将会耗去许多资源,同时延迟运算时间。因此一种减少输入/输出访问的机制是人工智能领域中迫切需要的。
发明内容
为了至少部分地解决背景技术中提到的技术问题,本披露的方案提供了一种执行神经网络计算的装置、板卡、方法及可读存储介质。
在一个方面中,本披露揭露一种执行神经网络计算的集成电路装置,包括:处理装置,用以建立模板融合单元;编译器,用以将所述模板融合单元转换成目标代码;链接器,用以将所述目标代码外加库链接,形成可执行文件;以及计算装置,用以执行所述可执行文件,以实现神经网络计算。
在另一个方面,本披露揭露一种板卡,包括根据前述的集成电路装置。
在另一个方面,本披露揭露一种执行神经网络计算的方法,包括:建立模板融合单元;将所述模板融合单元转换成目标代码;将所述目标代码外加库链接,形成可执行文件;以及执行所述可执行文件,以实现神经网络计算。
另一个方面,本披露揭露一种计算机可读存储介质,其上存储有执行神经网络计算的计算机程序代码,当所述计算机程序代码由处理装置运行时,执行前述的方法。
本披露通过设定融合策略,动态地决定模板融合单元,融合神经网络中的多个层,以形成新的自定义的层,一次性载入计算模板融合单元所需的数据,以减少输入/输出开销。
附图说明
通过参考附图阅读下文的详细描述,本披露示例性实施方式的上述以及其他目的、特征和优点将变得易于理解。在附图中,以示例性而非限制性的方式示出了本披露的若干实施方式,并且相同或对应的标号表示相同或对应的部分其中:
图1是示出本披露实施例的板卡的结构图;
图2是示出本披露实施例的集成电路装置的结构图;
图3是示出本披露实施例的计算装置的内部结构示意图;
图4是示出本披露实施例的处理器核的内部结构示意图;
图5是示出当一个处理器核欲将数据写入至另一个集群的处理器核时的示意图;
图6是示出AlexNet模型的示意图;
图7是示出一种示例性地神经网络模型的示意图;
图8是示出本披露实施例的两个卷积层融合在一起的示意图;
图9是示出NCHW与NHWC的格式示意图;
图10是示出本披露实施例利用模板融合单元执行神经网络计算的流程图;
图11是示出本披露实施例根据融合策略动态融合神经网络的流程图;
图12是示出本披露实施例利用模板融合单元执行神经网络计算的流程图;
图13是示出具有块结构的神经网络模型的示意图;
图14是示出输入/输出特征图如正金字塔结构的示意图;
图15A是示出最大池化的上池化操作的示意图;
图15B是示出平均池化的上池化操作的示意图;
图16是示出上采样操作的示意图;
图17是示出本披露实施例融合示例性神经网络的流程图;
图18是示出示例性的神经网络模型;
图19是示出本披露实施例基于可执行指令计算神经网络的流程图;
图20是示出环形全归约框架图;
图21是示出处于逻辑环路中的多个集群的示意图;
图22A是示出环形全归约第一次迭代的示意图;
图22B是示出环形全归约第二次迭代的示意图;
图22C是示出环形全归约第三次迭代的示意图;
图23A是示出环形全归约每个集群都有一个处理器核执行完整的归约计算的示意图;
图23B是示出环形全归约执行完整计算后的示意图;
图24是示出示例性的长链式神经网络;
图25是示出本披露实施例实现向前融合神经网络的流程图;
图26是示出示例性的长链式神经网络;
图27是示出本披露实施例实现双向融合神经网络的流程图;
图28是示出示例性的块结构神经网络;
图29是示出本披露实施例划分子模板融合单元的示意图;
图30是示出本披露实施例两层三级流水线的示意图;
图31是示出本披露实施例基于模板融合单元执行计算程序的流程图;以及
图32是示出本披露实施例两层三级流水线的流程图。
具体实施方式
下面将结合本披露实施例中的附图,对本披露实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本披露一部分实施例,而不是全部的实施例。基于本披露中的实施例,本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本披露保护的范围。
应当理解,本披露的权利要求、说明书及附图中的术语“第一”、“第二”、“第三”和“第四”等是用于区别不同对象,而不是用于描述特定顺序。本披露的说明书和权利要求书中使用的术语“包 括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。
还应当理解,在此本披露说明书中所使用的术语仅仅是出于描述特定实施例的目的,而并不意在限定本披露。如在本披露说明书和权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。还应当进一步理解,在本披露说明书和权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。
如在本说明书和权利要求书中所使用的那样,术语“如果”可以依据上下文被解释为“当...时”或“一旦”或“响应于确定”或“响应于检测到”。
下面结合附图来详细描述本披露的具体实施方式。
神经网络是由输入层、卷积层、激活函数、池化层、全连接层所组成,少则数层,多则上百层,每层执行一个算子,例如卷积层执行卷积算子,有多少层便需要执行多少算子。在本披露中,当提及特定层时,便表示该层相对应的算子。
在进行神经网络计算时,输入信息和模型各层的输出结果在每次推理计算时是不同的,它们被视为变量数据,变量数据一般都是以特征图(矩阵)来表现的,在本披露中,整个神经网络模型的输入信息和模型各层的输入图统称为特征图,一旦特征图加载到片上存储器部件上,在本披露中称为片上单元图。训练网络模型的参数在训练稳定之后通常不会频繁改动,或是网络拓扑结构和硬件参数确定后就可以编译生成,在计算过程中不会变更,因此它们可以被视为常量数据,常量数据包括但不限于权值、偏置、设备硬件指令、批标准化(batchnorm)的均值和方差等,在本披露中统一以权值代表所有的常量数据。而本披露中提及“数据”时,泛指根据融合策略使得神经网络模型中允许对应算子的运算操作融合在一起的图结构,该图结构所涉及变量数据和常量数据,也就是特征图加上相应的权值。
图1示出本披露实施例的一种板卡10的结构示意图。如图1所示,板卡10包括芯片101,其是一种系统级芯片(System on Chip,SoC),或称片上系统,集成有一个或多个组合处理装置,组合处理装置是一种人工智能运算单元,用以支持各类深度学习和机器学习算法,满足计算机视觉、语音、自然语言处理、数据挖掘等领域复杂场景下的智能处理需求。特别是深度学习技术大量应用在云端智能领域,云端智能应用的一个显著特点是输入数据量大,对平台的存储能力和计算能力有很高的要求,此实施例的板卡10适用在云端智能应用,具有庞大的片外存储、片上存储和大量的计算能力。
芯片101通过对外接口装置102与外部设备103相连接。外部设备103例如是服务器、计算机、摄像头、显示器、鼠标、键盘、网卡或wifi接口等。待处理的数据可以由外部设备103通过对外接口装置102传递至芯片101。芯片101的计算结果可以经由对外接口装置102传送回外部设备103。根据不同的应用场景,对外接口装置102可以具有不同的接口形式,例如PCIe接口等。
板卡10还包括用于存储数据的存储器件104,其包括一个或多个存储单元105。存储器件104通过总线与控制器件106和芯片101进行连接和数据传输。板卡10中的控制器件106配置用于对芯片101的状态进行调控。为此,在一个应用场景中,控制器件106可以包括单片机(Micro Controller Unit,MCU)。
图2是示出此实施例的芯片101中的组合处理装置的结构图。如图2中所示,组合处理装置20包括计算装置201、接口装置202、处理装置203和DRAM 204。
计算装置201配置成执行用户指定的操作,主要实现为单核智能处理器或者多核智能处理器,用以执行深度学习或机器学习的计算,其可以通过接口装置202与处理装置203进行交互,以共同完成用户指定的操作。
接口装置202用于在计算装置201与处理装置203间传输数据和控制指令。例如,计算装置201可以经由接口装置202从处理装置203中获取输入数据,写入计算装置201片上的存储装置。进一步,计算装置201可以经由接口装置202从处理装置203中获取控制指令,写入计算装置201片上的控制缓存中。替代地或可选地,接口装置202也可以读取计算装置201的存储装置中的数据并传输给处理装置203。
处理装置203作为通用的处理装置,执行包括但不限于数据搬运、对计算装置201的开启和/或停 止等基本控制。根据实现方式的不同,处理装置203可以是中央处理器(central processing unit,CPU)、图形处理器(graphics processing unit,GPU)或其他通用和/或专用处理器中的一种或多种类型的处理器,这些处理器包括但不限于数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field-programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等,并且其数目可以根据实际需要来确定。如前所述,仅就本披露的计算装置201而言,其可以视为具有单核结构或者同构多核结构。然而,当将计算装置201和处理装置203整合共同考虑时,二者视为形成异构多核结构。
DRAM 204用以存储待处理的数据,为DDR内存,大小通常为16G或更大,用于保存计算装置201和/或处理装置203的数据。
图3示出了计算装置201的内部结构示意图。计算装置201用以处理计算机视觉、语音、自然语言、数据挖掘等输入数据,图中的计算装置201采用多核分层结构设计,计算装置201作为一个片上系统,其包括多个集群(cluster),每个集群又包括多个处理器核,换言之,计算装置201是以片上系统-集群-处理器核的层次所构成的。
以片上系统的层级来看,如图3所示,计算装置201包括外部存储控制器301、外设通信模块302、片上互联模块303、同步模块304以及多个集群305。
外部存储控制器301可以有多个,在图中示例性地展示2个,其用以响应处理器核发出的访问请求,访问外部存储设备,例如图2中的DRAM 204,从而自片外读取数据或是将数据写入。外设通信模块302用以通过接口装置202接收来自处理装置203的控制信号,启动计算装置201执行任务。片上互联模块303将外部存储控制器301、外设通信模块302及多个集群305连接起来,用以在各个模块间传输数据和控制信号。同步模块304是一种全局同步屏障控制器(global barrier controller,GBC),用以协调各集群的工作进度,确保信息的同步。多个集群305是计算装置201的计算核心,在图中示例性地展示4个,随着硬件的发展,本披露的计算装置201还可以包括8个、16个、64个、甚至更多的集群305。集群305用以高效地执行深度学习算法。
以集群的层级来看,如图3所示,每个集群305包括多个处理器核(IPU core)306及一个存储核(MEM core)307。
处理器核306在图中示例性地展示4个,本披露不限制处理器核306的数量。其内部架构如图4所示。每个处理器核306包括三大模块:控制模块41、运算模块42及存储模块43。
控制模块41用以协调并控制运算模块42和存储模块43的工作,以完成深度学习的任务,其包括取指单元(instruction fetch unit,IFU)411及指令译码单元(instruction decode unit,IDU)412。取指单元411用以获取来自处理装置203的指令,指令译码单元412则将获取的指令进行译码,并将译码结果作为控制信息发送给运算模块42和存储模块43。
运算模块42包括向量运算单元421及矩阵运算单元422。向量运算单元421用以执行向量运算,可支持向量乘、加、非线性变换等复杂运算;矩阵运算单元422负责深度学习算法的核心计算,即矩阵乘及卷积。
存储模块43用来存储或搬运相关数据,包括神经元存储单元(neuron RAM,NRAM)431、权值存储单元(weight RAM,WRAM)432、输入/输出直接内存访问模块(input/output direct memory access,IODMA)433、搬运直接内存访问模块(move direct memory access,MVDMA)434。NRAM 431用以存储供处理器核306计算的特征图及计算后的中间结果;WRAM 432则用以存储深度学习网络的权值;IODMA 433通过广播总线309控制NRAM 431/WRAM 432与DRAM 204的访存;MVDMA 434则用以控制NRAM 431/WRAM 432与SRAM 308的访存。
回到图3,存储核307主要用以存储和通信,即存储处理器核306间的共享数据或中间结果、以及执行集群305与DRAM 204之间的通信、集群305间彼此的通信、处理器核306间彼此的通信等。在其他实施例中,存储核307具有标量运算的能力,用以执行标量运算。
存储核307包括共享存储单元(SRAM)308、广播总线309、集群直接内存访问模块(cluster direct memory access,CDMA)310及全局直接内存访问模块(global direct memory access,GDMA)311。SRAM 308承担高性能数据中转站的角色,在同一个集群305内不同处理器核306之间所复用的数据 不需要通过处理器核306各自向DRAM 204获得,而是经SRAM 308在处理器核306间中转,存储核307只需要将复用的数据从SRAM 308迅速分发给多个处理器核306即可,以提高核间通讯效率,亦大大减少片上片外的输入/输出访问。
广播总线309、CDMA 310及GDMA 311则分别用来执行处理器核306间的通信、集群305间的通信和集群305与DRAM 204的数据传输。以下将分别说明。
广播总线309用以完成集群305内各处理器核306间的高速通信,此实施例的广播总线309支持核间通信方式包括单播、多播与广播。单播是指点对点(即单一处理器核至单一处理器核)的数据传输,多播是将一份数据从SRAM 308传输到特定几个处理器核306的通信方式,而广播则是将一份数据从SRAM 308传输到所有处理器核306的通信方式,属于多播的一种特例。
CDMA 310用以控制在同一个计算装置201内不同集群305间的SRAM 308的访存。图5示出当一个处理器核欲将数据写入至另一个集群的处理器核时的示意图,以说明CDMA 310的工作原理。在此应用场景中,同一个计算装置包括多个集群,为方便说明,图中仅展示集群0与集群1,集群0与集群1分别包括多个处理器核,同样为了说明方便,图中的集群0仅展示处理器核0,集群1仅展示处理器核1。处理器核0欲将数据写入至处理器核1。
首先,处理器核0发送单播写请求将数据写入本地的SRAM 0中,CDMA 0作为主(master)端,CDMA 1作为从(slave)端,主端向从端推送写请求,即主端发送写地址AW和写数据W,将数据传送到集群1的SRAM 1中,接着从端发送写响应B作为回应,最后集群1的处理器核1发送单播读请求将数据从SRAM 1中读取出来。
回到图3,GDMA 311与外部存储控制器301协同,用以控制集群305的SRAM 308到DRAM 204的访存,或是将数据自DRAM 204读取至SRAM 308中。从前述可知,DRAM 204与NRAM 431或WRAM 432间的通信可以经由2个渠道来实现。第一个渠道是通过IODAM 433直接联系DRAM 204与NRAM 431或WRAM 432;第二个渠道是先经由GDMA 311使得数据在DRAM 204与SRAM 308间传输,再经过MVDMA 434使得数据在SRAM 308与NRAM 431或WRAM 432间传输。虽然表面上看来第二个渠道需要更多的元件参与,数据流较长,但实际上在部分实施例中,第二个渠道的带宽远大于第一个渠道,因此DRAM 204与NRAM 431或WRAM 432间的通信通过第二个渠道可能更有效率。本披露的实施例可根据本身硬件条件选择数据传输渠道。
在其他实施例中,GDMA 311的功能和IODMA 433的功能可以整合在同一部件中。本披露为了方便描述,将GDMA 311和IODMA 433视为不同部件,对于本领域技术人员来说,只要其实现的功能以及达到的技术效果与本披露类似,即属于本披露的保护范围。进一步地,GDMA 311的功能、IODMA 433的功能、CDMA 310的功能、MVDMA 434的功能亦可以由同一部件来实现,同样地,只要其实现的功能以及达到的技术效果与本披露类似,均属于本披露的保护范围。
与本披露相关的神经网络的结构分为两类:长链式结构与块结构。长链式结构指的是神经网络模型为单链条串接的层所组成,每层只有一个输入及一个输出,整体属于单分支,例如VGG16模型或是图6所示的AlexNet模型。块结构指的是神经网络中的子网络仅有一个输入及一个输出,但子网络内存在多分支,即子网络的部分层具有多个输入或输出,例如resnet50的resblock结构、inception_v3的block结构等。图7示出一种示例性地神经网络模型的示意图,该示例性神经网络模型包括子网络701及子网络702。子网络701仅有一个输入及一个输出,其包括第一层到第六层,第一层具有2个输出,第六层具有2个输入,因此子网络701包括2个分支,一个分支为第一层→第二层→第三层→第六层,而另一个分支为第一层→第四层→第五层→第六层,子网络701构成一个块结构。同样地,子网络702亦构成一个块结构。
在执行深度学习的各层计算时,需要大量的片外片上访问,特别是将输入数据自DRAM 204读取至计算装置201中,再将计算装置201的计算结果存储至DRAM 204。这种频繁的访问会耗去极大的硬件资源。为了解决这个问题,本披露通过融合神经网络的相邻层,在很大程度上减少了片外片上的数据传输。
图8示出将两个卷积层融合在一起的示意图。第一层卷积层810的输入为7×7的特征图801,该层将特征图801与3×3的内核(未示出)进行卷积后,得到第一层卷积层810的特征图802。其中,5×5特 征子图804的数值会影响3×3特征子图805。假设步长(stride)为1,在计算完5×5特征子图804后,第一层卷积层810会接着计算5×5特征子图806,而5×5特征子图806的数值会影响3×3特征子图807。
在进行第二层卷积层811的计算时,特征图802成为第二层卷积层811的输入,同样与3×3的内核进行卷积,得到第二层卷积层811的特征图803。其中,3×3特征子图805的数值会影响特征图803中的1×1特征子图808。在计算完3×3特征子图805后,第二层卷积层811会接着计算3×3特征子图807,而3×3特征子图807的数值会影响特征图803中的1×1特征子图809。
如果未融合,计算装置201在进行第一层卷积810时,自DRAM 204读取5×5特征子图804,计算完后将3×3特征子图805存储回DRAM 204,接着再从DRAM 204读取5×5特征子图806,计算完后将3×3特征子图807存储至DRAM 204。在进行第二层卷积811时,同样需要自DRAM 204读取3×3特征子图805,计算完后将1×1特征子图808存储至DRAM 204,接着自DRAM 204读取3×3特征子图807,计算完后将1×1特征子图809存储至DRAM 204。通过上述说明可知,特征图802作为中间数据反复在片外片上被读取存储,相当占用系统资源。
如果将第一层卷积层810与第二层卷积层811进行融合,也就是把特征图802存储在NRAM 431中(第一层卷积层810与第二层卷积层811的权值亦可存储在WRAM 432中),如此便可减少计算装置201与DRAM 204间的访问次数,进而提高整体神经网络的执行效率。由于参与融合的特征图(如特征图801、特征图802、特征图803)在神经网络模型上下文逻辑中整体看起来像倒金字塔,故称金字塔融合。
金字塔融合通常是基于神经网络中的特定卷积层和池化层向后进行融合,亦即融合的起始层为卷积层或池化层,根据其本身硬件条件向后融合了多层,其间可能包含多个卷积层和池化层。但随着深度学习及神经网络的发展,层的排序变得复杂,例如在卷积层前面设置有激活层,则此激活层应该也要被考虑如何与其后的卷积层进行融合。因此,除了单纯以卷积层和池化层为核心进行融合之外,本披露提供多样的融合方式,不必然以卷积层和池化层为核心,而采取特定的策略,弹性地选择神经网络的各层进行融合,即便是用户自定义的层,只要符合融合策略便可被融合,使得整体效能最佳化。
本披露的另一个实施例是一种新式的融合方法,通过利用前述图1、图2、图3及图4的硬件结构来实施的,这种融合称为模板融合单元(template fuse unit,TFU)。模板融合单元主要是通过一定的融合策略弹性地将多个层融合成一个层,来减少网络的输入/输出开销,其包括前述的金字塔融合及其他融合方式,这些被融合的层的集合即为模板融合单元,可以视为是新的层或是自定义的层。
此实施例一次性将模板融合单元所需的特征图、权值等自DRAM 204载入至片上的SRAM 308,特征图载入至SRAM 308后称为片上单元图,片上单元图会被切割成子图,每次自SRAM 308载入一份子图到被指派计算该子图的处理器核306的NRAM 431,且计算该子图所需的权值亦会自SRAM 308被载入至WRAM 432上,每个子图计算完成后获得对应的中间结果,中间结果被存回SRAM 308,所有子图都完成计算后再一次性地将计算结果存回DRAM 204。也就是说,片上单元图和权值参与神经网络模型中算子的运算操作获得的对应结果在DRAM 204与SRAM 308间传递,子图对应的输出(中间结果)在SRAM 308与NRAM 431间传递。从计算装置201的角度来看,模板融合单元的数据载入是以片上单元图为单位,而计算是以子图为单位。
更详细来说,SRAM 308是融合策略的重要参考指标之一,其空间大小决定了模板融合单元为大图模式或是小图模式。小图模式与大图模式是指存储在DRAM 204的一张特征图是否能一次性地搬到SRAM 308进行处理,处理装置203会将该特征图所需存储空间与SRAM 308可用空间进行比较。如果SRAM 308空间不足,特征图摆不下,则为大图模式;如果SRAM 308足以容纳整张特征图,就是小图模式。需特别注意的是,在大图模式下,片上单元图只是特征图的一部分;在小图模式下,如果SRAM 308的可用空间足够大,或是特征图足够小,SRAM 308或许可以一次性地容纳多张特征图,即片上单元图可以包括多张特征图。
如是大图模式,则必须拆分该特征图方能载入计算装置201中。处理装置203会在DRAM 204上将该特征图进行拆分,直到产生足够小的片上单元图能够满足SRAM 308的空间需求,使得该片上单元图可以一次性地搬到SRAM 308进行处理。而特征图在进行拆分时,可能会产生输入依赖运算和输 出依赖运算。
输入依赖运算是指拆分后的各片上单元图至少部分重叠,每个子集都需要一些输入的额外副本,以进行完整的运算,从而导致拆分操作中的数据冗余,所谓的数据冗余是指同一段数据在系统中被复用。当模板融合单元包括卷积、池化或矩阵乘等层时都会导致输入依赖运算。
输出依赖运算是指每个子图产出中间结果后,还需要进行归约(reduce),才能得到计算结果。归约是指在基于对片上单元图本身内容理解的基础上,拆分成子图后分别计算,以缩减计算规模,从而在尽可能保持原片上单元图原貌的前提下,最大限度地精简数据量,再以子图为基础还原或整合计算结果。进行归约时计算结果是互为依赖的。当模板融合单元包括内积、卷积、矩阵乘、排序、计数等层时都会导致输出依赖运算。
此实施例可以处理的特征图的数据格式包括N、H、W、C维度,其中N代表批处理(batch)、H代表高度(height)、W代表宽度(width)、C代表通道(channel)。以图像数据为例,N表示这批图像共有几张,H表示图像在竖直方向有多少像素,W表示水平方向像素数,C表示通道数(例如黑白图像的通道数C为1,而RGB彩色图像的通道数C为3)。
这些维度的排序决定了数据的组成方式,常见的组成方式有NHWC和NCHW两种,图9示出NCHW与NHWC的格式区别,此图是以RGB彩色图像为例,图中R表示红色像素、G表示绿色像素、B表示蓝色像素。数列91为NCHW格式,N排列在外层,每个通道内像素紧挨在一起,再依RGB的顺序排列,坐标为(n,c,h,w)的元素在存储中的偏移为((n×C+c)×H+h)×W+w。数列92是NHWC格式,C排列在最内层,多个通道对应空间位置的RGB像素紧挨在一起。图中亦显示出输入像素901、输入像素902、输入像素903在不同排列方式下所处的位置,而这三个输入像素901、输入像素902、输入像素903合起来便是图像中一个点的颜色。坐标为(n,c,h,w)的元素相应的坐标向偏移的换算方法是((n×H+h)×W+w)×C+c。NHWC首先相比NCHW更加接近BMP的图片数据存储格式,BMP格式的文件中按照一个个像素点来存储数据,每个像素点存储了所有通道的颜色值,这使得在读取输入图片时不需要进行额外的维度转换。因此,NHWC的访存局部性较佳,每三个输入像素即可得到一个输出像素,NCHW则必须等所有通道输入准备好才能得到最终输出结果,需要占用较大的缓存空间。
此实施例可以根据数据融合神经网络各层为模板融合单元,图10示出相应的流程图。
在步骤1001中,处理装置203判断特征图所需存储空间是否大于SRAM 308的可用空间。如是,表示该特征图无法一次性地载入至SRAM 308中,因此执行步骤1002,拆分特征图。在此实施例中,处理装置203优先选择在N维度上进行拆分,因为不会产生输入或输出依赖运算,如在N维度上进行拆分无法满足要求,再考虑在H或是W维度上进行拆分,这时便可能会产生输入或输出依赖运算。此实施例亦支持在C维度上进行拆分,特别是沿着Cout方向拆分,这样通过数据优化的方式把一个卷积拆分成多个卷积,使得WRAM 432可以放得下权值,例如:将权值拆分到四个处理器核306上。因此,只要在某一维度上进行拆分是计算装置201能处理的,都是在本披露揭露的范围中。
更进一步来说,处理装置203可以依序在N、H、W维度间进行特定粒度的拆分,特定粒度可以是一个固定或变动的比值,或是以一个函数来表示。在一种应用场景下,处理装置203由大往小对特征图或权值进行拆分。以特征图为例,首先在N维度上将维度为NHWC的特征图拆分成N 1HWC的特征图与N 2HWC的特征图,其中特定粒度是固定比值,N 1与N 2各为N的二分之一。如果还不够小,处理装置203则在H维度上继续将N 1HWC的特征图拆分成N 1H 1WC的特征图与N 1H 2WC的特征图,其中H 1与H 2各为H的二分之一。如果还不够小,处理装置203则在W维度上继续将N 1H 1WC的特征图拆分成N 1H 1W 1C的特征图与N 1H 1W 2C的特征图,其中W 1与W 2各为W的二分之一。处理装置203可以在N、W、H维度上继续进行更小粒度的拆分,像是做四分之一等分、八分之一等分或十六分之一等分的切割,直到特征图足够小,成为可以一次性地载入SRAM 308的片上单元图为止。
可以理解的是,处理装置203还可以在一个维度上持续拆分,直到不能再拆分,才会选择另外一个维度持续拆分。例如持续在H维度上进行拆分,如果拆分至最小单位仍无法载入至SRAM 308中时,才改以在W维度上进行拆分,直到拆分至最小单位。
需特别注意的是,由于这样的拆分方式是由大拆到小,因此当拆分的特征图满足条件时,其所需 存储空间的大小通常会与SRAM 308的可用空间相差无几。换言之,在大图模式下,DRAM 204每次仅能传送一张拆分后的特征图至SRAM 308,但在小图模式下,SRAM 308的空间却可能可以一次性地自DRAM 204载入多张特征图。
在另一种应用场景下,处理装置203由小往大进行拆分,特定粒度同样可以是一个固定或变动的比值,或是以一个函数来表示。举例来说,首先在N维度上以特定粒度是最小单位进行拆分,即1×H×W×C。如果SRAM308可以载入,处理单元203继续放大特征图的拆分,例如放大为2×H×W×C。如果还可以载入,便继续放大,直到n×H×W×C无法载入为止,则片上单元图的尺寸即为(n-1)×H×W×C。
如果1×H×W×C所需存储空间已经超出SRAM 308的可用空间,处理装置203将从另一个维度继续拆分,例如:从H维度着手,则处理装置203接着判断1×1×W×C。如果够小,则沿着H维度往上增加,直到找到1×(h-1)×W×C所需存储空间恰好接近又不大于SRAM 308的可用空间。如果还是超出SRAM 308的可用空间,处理装置203再从另一个维度继续拆分,例如从W维度。依次方式找到最佳的可以一次性地载入SRAM 308的输入数据为止。在此所谓最佳指的是片上单元图所需存储空间最接近但不大于SRAM 308的可用空间。
处理装置203拆分特征图后,回到步骤1001,处理装置203判断拆分后的特征图所需存储空间是否还大于SRAM 308的可用空间,如是,则再次执行步骤1002,继续往下拆分。
如处理装置203判断拆分后的特征图所需存储空间不大于SRAM 308的可用空间时,表示SRAM 308可以一次性地载入拆分后的特征图,则执行步骤1003,处理装置203设定拆分后的特征图为片上单元图。
最后执行步骤1004,处理装置203根据片上单元图的尺寸决定模板融合单元。此步骤将在后详细说明。
在其他应用场景下,当处理装置203在步骤1001与步骤1002间反复执行多次后,表示拆分后的特征图所需存储空间越来越接近SRAM 308的可用空间,举例来说,假设特征图所需存储空间为100k,SRAM 308的可用空间为40k,在步骤1001中,处理装置203判断特征图所需存储空间大于SRAM 308的可用空间,故执行步骤1002,沿着N维度拆分为一半,这时拆分后的特征图为50k,接着回到步骤1001,拆分后的特征图所需存储空间还是大于SRAM 308的可用空间,继续执行步骤1002,沿着N维度再拆分为一半,这时拆分后的特征图为25k,接着回到步骤1001,拆分后的特征图所需存储空间小于SRAM 308的可用空间,故执行步骤1003,处理装置203设定拆分后的特征图(尺寸为25k)为片上单元图。
SRAM 308的可用空间为40k,而片上单元图所需存储空间为25k,尚有15k的空间闲置,其原因在于步骤1002均以二分之一为单位进行拆分,以至于最后一次拆分时粒度太大。此实施例可以随着拆分的次数,逐渐缩小拆分的特定粒度,使拆分后的片上单元图所需存储空间尽可能接近SRAM 308的可用空间。例如,刚开始特定粒度可以设定为二分之一,接下来设定为四分之三,最后设定为五分之四。同样以特征图所需存储空间为100k,SRAM 308的可用空间为40k为例,在步骤1001中,处理装置203判断特征图所需存储空间大于SRAM 308的可用空间,故执行步骤1002,特定粒度设定为二分之一,拆分后的特征图为50k,接着回到步骤1001,拆分后的特征图所需存储空间还是大于SRAM 308的可用空间,继续执行步骤1002,这时特定粒度调整为四分之三,拆分后的特征图为37.5k,接着回到步骤1001,拆分后的特征图所需存储空间小于SRAM 308的可用空间,故执行步骤1003,处理装置203设定拆分后的特征图(尺寸为37.5k)为片上单元图。37.5k较25k更接近40k,后者的方式会更充分利用SRAM 308的可用空间,效率更高。此实施例不限制特定粒度的大小,可根据应用场景设定之。
在确定片上单元图的尺寸后,执行步骤1004,此步骤是根据融合策略动态融合神经网络,图11示出此实施例根据融合策略动态融合神经网络的方法。
在步骤1101中,根据融合策略的起始规则,选择模板融合单元的起始层。处理装置203根据融合策略的起始规则,选择模板融合单元的起始层,也就是在神经网络里尚未融合的层中选择开始融合的层。
在一种应用场景下,所述起始规则可以是起始层为神经网络中最前未被融合的层,处理装置203 会搜索出最前未被融合的层。以图6的AlexNet神经网络模型为例,共有23层,假设第1层至第5层已融合,则当起始规则是起始层为神经网络中最前未被融合的层时,处理装置203会选择第6层的ReLU激活层为起始层,向后融合(即向第7层的方向融合)。需注意的是,在此起始规则下,起始层不必然为卷积层或池化层。
在另一种应用场景下,考虑到卷积和池化层最消耗输入/输出资源,因此起始规则为起始层为最前未被融合的卷积或池化层,处理装置203会先找出神经网络模型中未融合层的所有卷积和池化层,从最前未被融合的卷积或池化层开始向后融合。同样以图6的AlexNet神经网络模型为例,假设第1层至第9层已融合,处理装置203会找出神经网络模型中未融合层的所有卷积和池化层,即第11层、第13层、第15层,接着从最前未被融合的卷积或池化层开始融合,也就是起始层为第11层。
在步骤1102中,以所述起始层为基准进行融合,逐一排查融合策略的所有规则,以建立模板融合单元。处理装置203以起始层为基准进行融合,逐一排查融合策略的所有规则,以建立模板融合单元。在满足所有规则的前提下,计算装置201的硬件资源足以支撑一次性地载入计算模板融合单元所需的数据,进而根据模板融合单元执行神经网络计算。除了前述的起始规则外,融合策略示例性地还可以包括以下规则:
规则一:向后融合
所谓的向后融合指的是自起始层往神经网络模型推理的方向融合,以图6为例,即是按着第一层→第二层→第三层的方向融合。如果起始层之前还有未融合层,则在此规则下这些未融合层将不被考虑纳入模板融合单元中。
规则二:优先向前融合
所谓的向前融合指的是自起始层往神经网络推理的反方向融合,以图6为例,则是按着第三层→第二层→第一层的方向融合。此规则通常与前述起始层为最前未被融合的卷积或池化层的起始规则搭配,原因在于所述的卷积或池化层前可能还有未被融合的层。在选定起始层后,处理装置203优先向前融合,试图把起始层前尚未被融合的层纳入模板融合单元中。同样以图6的AlexNet神经网络模型为例,假设第1层至第2层已融合,处理装置203发现最前未被融合的卷积或池化层是第5层,故起始层为第5层,优先向前融合第4层、第3层,如果还能继续融合,则接着向后融合第6层、第7层等。
规则三:优先以块结构为单位
当神经网络模型具有块结构时,此规则要求处理装置203优先以块结构而不是以层为单位增删模板融合单元,如果一整个块的运算逻辑融合不成功,才考虑从各个分支上的层进行融合。以图7的神经网络模型为例,处理装置203会优先考虑子网络701或子网络702为单位进行融合。
当神经网络为长链结构时,由于不存在块结构,故直接以层为单位增删模板融合单元。此规则不适用于长链结构的神经网络模型。
规则四:单分支输出
此实施例的融合策略不支持模板融合单元为多输出网络,其原因在于模板融合单元内部实现的形状推导主要采用从后向前推导的形式,多输出网络意味着需要从不同的输出分别向前推导,推导的结果不必然会归结到同一个特征图上,以至于无法收敛。
换言之,模板融合单元的输出需为单分支输出,也就是模板融合单元的最后一层只能具有一个输出。图7标示了子网络701的二种融合方式,第一种是将第一层至第五层融合成一个模板融合单元703,第二种是将第一层至第六层融合成一个模板融合单元704。由于第三层及第五层的输出都是模板融合单元703的输出,故模板融合单元703属于多输出网络,即多分支输出。而第六层的输出是模板融合单元704的输出,只产生一个输出数据,故模板融合单元704属于单输出网络,即单分支输出。处理单元203会判断模板融合单元的输出是否为单分支输出,如果此规则未被满足时,处理装置203增删模 板融合单元内的层直到此规则被满足。
规则五:包括至少2个主层
当层逻辑过于简单时,模板融合单元的性能还不如未融合的层的性能,故以层逻辑作为融合策略时,处理装置203会评估所融合的各层的运算是否足够复杂,使得融合产生效益。欲产生效益,就需要尽量将主层纳入模板融合单元,主层指的是矩阵乘、池化或卷积等耗费大量输入/输出资源的层,此处的池化包括各类池化,像是最大池化(maxpool)或均值池化(avgpool),卷积也包括各类卷积,像是普通卷积、带均值的卷积、分通道卷积(depthwise conv)等。此规则为模板融合单元包括至少2个主层。当处理单元203判断此规则未被满足时,处理装置203会调整模板融合单元直到此规则被满足。
规则六:包括主层、主层、非主层依次相邻的连续结构
此规则为模板融合单元需包括主层、主层及非主层的连续结构,即:主层、主层以及非主层依次相邻的连续结构。这样的运算足够复杂,使得融合具有效益。参阅图6中的第4层-第5层-第6层,其中第4层为最大池化层,第5层为卷积层,第6层为ReLU激活层,符合主层、主层、非主层依次相邻的连续结构,因此包括第4层、第5层、第6层的模板融合单元便可满足此规则。当处理单元203判断此规则未被满足时,处理装置203会调整模板融合单元直到此规则被满足。
规则七:包括标量计算层以及向量计算层相邻的连续结构
此规则为模板融合单元包括标量计算层以及向量计算层的连续结构,即:标量计算层、向量计算层依次相邻的连续结构。所述标量计算层指的是加法层、减法层或乘法层,所述向量计算层指的是激活层、批标准化层或缩放层。当处理单元203判断此规则未被满足时,处理装置203会调整模板融合单元直到此规则被满足。
规则八:卷积层的权值不为某个层的输出
此规则为模板融合单元中的卷积层的权值不为神经网络的任一层的输出,不论该层是否被纳入在模板融合单元。当处理单元203判断此规则未被满足时,处理装置203会将此卷积层自模板融合单元中移除。
规则九:卷积层的权值不与神经网络的任一层共用
由于模板融合单元涉及的神经网络模型中算子的权值具有特别的摆放形式,当被融合的卷积算子与其他算子共用权值时,权值的摆放逻辑会发生冲突,此规则为模板融合单元中的卷积算子的权值不与神经网络的任一层共用。当处理单元203判断此规则未被满足时,处理装置203会将此卷积算子自模板融合单元中移除。
规则十:权值不大于WRAM的可用空间
大图模式对于WRAM 432的限制较少,原因在于载入SRAM 308的片上单元图只是特征图的一部分,在计算模板融合单元时,WRAM 432只需要存放该特征图的所有权值即可。但由于小图模式可能会将多张特征图加载至SRAM 308,在这种情况下所需的权值会变多,便要谨慎评估WRAM 432的可用空间是否足够。此规则为片上单元图中的权值所需存储空间不大于WRAM 432的可用空间,当处理装置203判断此规则未被满足时,处理装置203会减少片上单元图的大小。
如果权值是基于C维度的输出通道参数Cout进行拆分,由于权值会被平均分配到多个处理器核306中,则本规则调整为:
Figure PCTCN2021119943-appb-000001
其中,W j为片上单元图j涉及的权值所需存储空间,n为集群中处理器核的数量,W为WRAM 432 的可用空间。
规则十一:冗余百分比
冗余百分比为当输入依赖运算与输出依赖运算所产生的冗余总和与模板融合单元正常输入/输出量的比例,此处正常输入/输出量指的是片上单元图在未被拆分前没有冗余的数据量。处理装置203会计算模板融合单元将当前层融合进来后,片上单元图从DRAM 204至SRAM 308的访存量size TFU,与正常输入/输出量(不含冗余)size ori的百分比,其中访存量size TFU指的是理论的访存量size ori加上冗余总和。其公式如下:
Figure PCTCN2021119943-appb-000002
处理装置203会将模板融合单元的拆分信息和形状推导计算在内,并设定百分比阈值为50%、75%、100%、125%或150%,较佳为100%。以百分比阈值为100%为例,表示当冗余总和大于模板融合单元正常输入/输出量的2倍时,便不再融合。此规则为拆分片上单元图所产生的冗余总和不超出与百分比阈值相关的特定比例,一旦超过,表示冗余部分过多,大量的资源将耗费在计算冗余上,效能下降,因此当处理装置203判断此规则未被满足时,处理装置203会停止融合。
需要注意的是,在小图模式下,由于从DRAM 204至SRAM 308过程一次加载至少一整张完整的特征图,故不会产生冗余。此规则不适用于小图模式。
规则十二:片上单元图输入输出尺寸
假设SRAM 308的空间尺寸为S,片上单元图所需存储空间为IN,片上单元图的计算结果所需存储空间为OUT,则此规则为SRAM 308的空间尺寸需要满足以下条件:
如果IN和OUT不能复用存储空间的话,IN+OUT<S
如果IN和OUT可以复用存储空间的话,MAX(IN,OUT)<S
即如果IN和OUT不能复用存储空间的话,片上单元图的存储空间与计算结果的存储空间之和小于SRAM 308的可用空间;如果IN和OUT可复用存储空间的话,片上单元图的存储空间与计算结果的存储空间较大者小于SRAM 308的可用空间。
规则十三:W i+IN1+IN2≤S
在小图模式下,此规则为SRAM 308的空间尺寸需要满足以下条件:
W i+IN1+IN2≤S
即子图i的权值所需存储空间W i、片上单元图所需存储空间IN1、缓存空间IN2的总和不大于SRAM 308的可用空间。当处理装置203判断此规则未被满足时,处理装置203减少片上单元图的数量直到此规则被满足。
规则十四:SubINi+W i+IN2≤S
在小图模式下,此规则为SRAM 308的空间尺寸需要满足以下条件:
SubINi+W i+IN2≤S
即子图i的所需存储空间SubINi、子图i的权值所需存储空间W i、缓存空间IN2的总和不大于SRAM 308的可用空间。当处理装置203判断此规则未被满足时,处理装置203减少片上单元图的数量直到所 述规则被满足。
规则十五:SubOUTi+W i+1+IN2≤S
在小图模式下,此规则为SRAM 308的空间尺寸需要满足以下条件:
SubOUTi+W i+1+IN2≤S
即子图i的中间结果所需存储空间SubOUTi、下一个子图的权值所需存储空间W i+1、缓存空间IN2的总和不大于SRAM 308的可用空间。当处理装置203判断此规则未被满足时,处理装置203减少片上单元图的数量直到所述规则被满足。
规则十六:W i+W i+1≤W
模板融合单元中参与卷积运算的权值会被独立搬运并驻留在WRAM 432上。在小图模式下,如果子图包括多张特征图,考虑到子图间的流水,WRAM 432最多同时存储相邻两个子图的权值。假设每个子图i的所需存储空间为W i,且WRAM 432的总空间为W,此规则为WRAM 432的空间尺寸需要满足以下条件:
W i+W i+1≤W
即子图i的权值所需存储空间W i、下一个子图的权值所需存储空间W i+1总和不大于WRAM 432的可用空间。当处理装置203判断此规则未被满足时,处理装置203减少片上单元图的数量直到所述规则被满足。
规则十七:子图所需存储空间不大于NRAM的可用空间
此规则为子图所需存储空间不大于NRAM 431的可用空间。当SRAM 308上的片上单元图要被拆分成子图搬运至NRAM 431时,处理装置203可以在N、H、W维度上进行细粒度拆分。如果NRAM 431的空间不足,处理装置203会把片上单元图拆分得更细,直到此规则被满足。一般来说,NRAM 431都会具有合理的可用空间,使得片上单元图被拆分到合理的程度便可一次性地被载入,就融合策略的角度来看,模板融合单元不会受到批处理数目的影响。然而,片上单元图被拆分的越小(即子图越多),处理速度会下降,故处理装置203需要评估NRAM 431的空间。
在一些实施例中,SRAM 308的空间与集群305内的处理器核306的NRAM 431的个数相对应,例如集群305包括4个处理器核306,则SRAM 308的空间为NRAM 431的空间的4倍。换言之,大图模式下的片上单元图一般能分配给4个处理器核306处理,这种架构设计已考虑载入SRAM 308的数据能一次性地分配给所有NRAM 431。因此在大图模式下不需要考虑此规则。
规则十八:特征图的数量不大于特征图阈值
在小图模式下,片上单元图可能会包括多张特征图,特征图越多,SRAM 308与NRAM 431间的子图传输次数就越多,效率便会下降,因此并非片上单元图包括的特征图越多越好,处理装置203会根据片上单元图中特征图的数量来计算合适的融合层数,使其效益最大化。此规则为片上单元图中的特征图的数量不大于特征图阈值,当处理装置203判断此规则未被满足时,处理装置203减少片上数据中特征图的数量直到所述规则被满足。
规则十九:步长冗余
步长冗余指的是:当模板融合单元融合层数太多,再加上卷积和池化的内核的长宽大于步长时,每个输出点需要的输入数据存在重叠部分,也就是前述的输入依赖运算,该重叠部分即为步长冗余。步长冗余使得每个处理器核306需要多读取一些数据,但是这一部分复用的数据会占去片上片外的访 问资源,模板融合单元包括的层数越多,步长冗余越严重。此规则为卷积层或池化层的内核的边长与步长的差值总和不大于冗余阈值。
在此实施例中,冗余阈值的定义如下。假设卷积和池化层的内核的长和宽为k x和k y,长和宽方向的步长分别为s x和s y,则长方向的步长冗余为模板融合单元内所有卷积及池化层的k x-s x的总和;同理,宽方向的步长冗余为模板融合单元内所有卷积及池化层的k y-s y的总和。此实施例的冗余阈值可以为3、4、5或6,较佳为4。只要长方向或宽方向任一方向的步长冗余大于冗余阈值,此规则便不被满足。处理装置203调整模板融合单元,通常为减少被融合的层数,直到此规则被满足。
融合策略对于步长冗余设定了例外规则。如欲融合的层里存在多分支且模板融合单元能融合整个多分支的前提下,模板融合单元的性能会表现的更为优异,在这种情况下,处理装置203会忽略步长冗余的规则,亦即步长冗余不会限制模板融合单元融合多分支,即在此实施例的融合策略中,融合多分支优先于步长冗余的限制。也就是说,步长冗余只有在单分支的情况下才会被考虑。
以上的规则仅为示例,本披露并不限制各规则执行的顺序,亦不限制这些规则需同时被考虑,本领域技术人员在不同的应用场景下可以根据实际情况增删规则,以实现符合当下应用场景的融合策略。
回到图11,在步骤1103中,根据建立后的模板融合单元执行神经网络计算。计算装置201基于片上系统-集群-处理器核的三级运算层次,搭配DRAM-SRAM-NRAM/WRAM这样的三层内存设计,将模板融合单元视为神经网络中一个自定义的层,一次性地自DRAM 204载入计算模板融合单元所需的数据至SRAM 308,使得数据能够在适当的层级里缓存并计算,形成充分的流水,计算完成后再将计算结果自SRAM 308传送至DRAM 204,大大减少神经网络计算中的输入/输出开销。
当计算机视觉、语音、自然语言处理、数据挖掘等领域的输入数据欲进行各类深度学习和机器学习算法时,本披露基于模板融合单元,可以减少神经网络计算中的输入/输出开销。本披露的另一个实施例是一种利用模板融合单元执行神经网络计算的方法。图12示出其流程。
在步骤1201中,根据融合策略决定模板融合单元。处理装置203根据融合策略的起始规则,选择模板融合单元的起始层;并以所述起始层为基准进行融合,逐一排查融合策略的所有规则,以建立模板融合单元。前一个实施例已详细示例说明融合策略的各种规则,不再赘述。
在此步骤中,模板融合单元会以源代码的方式展现,接下来需要通过编译器将源代码转换成机器语言的目标代码(object code),又称作机器代码。以下多个步骤即为编译器将模板融合单元的源代码转换成机器语言的目标代码的过程。
在步骤1202中,推导模板融合单元的形状。对于模板融合单元需要处理的数据,此实施例采用的是逆推的方式,编译器从输出向前逆推出需要多少尺寸的输入,以图8为例,便是从特征图803逆向推导至特征图802,再逆向推导至特征图801。在此步骤中,编译器不仅根据模板融合单元推导所需输入数据,还会进一步推导冗余。
接下来执行步骤1203,推导地址。根据模板融合单元的形状,编译器对整个控制流图进行片上存储空间地址的推导,并且实现通用地址的访问,以达到精简计算资源、缩短计算时间的目的。控制流图是用在编译器中的一种抽象数据结构,代表了一个程序可能会执行的所有路径,以流程图的形式反映过程内所有节点的可能流向。控制流图是由节点和节点间的关系所组成的。节点又称为基本块(basic block,BB),是程序中最大限度顺序执行的语句序列,每个基本块只有一个入口和出口,执行时从其入口进入,自其出口退出。基本块的特点是只要第一条指令被执行了,那么基本块内所有指令都会按照顺序被执行。
每个基本块包含至少一条指令,基本块中的指令可能使用指针指向特定的片上存储空间。指针是一种变量,用以保存特定地址空间的地址。通过指针,处理器核306可以将数据载入到指针指向的特定地址的空间中,或是从指针指向的特定地址中的数据取出。
编译器根据模板融合单元的划分情况,初始划分基本块,再经过迭代运算后,确认基本块及相互关系,至此完成实现模板融合单元的目标代码。
不仅如此,编译器还会针对神经网络中前后两个模板融合单元的复用数据进行分析,判断前一次模板融合单元中有多少数据可以留在片上供下一个模板融合单元使用,并根据判断结果规划各数据的存储地址。
在此步骤中,编译器完成控制流图中地址的推导。
在步骤1204中,分配片上存储空间。处理装置203基于模板融合单元地址的推导,分配SRAM 308、NRAM 431及WRAM 432的物理空间。在此步骤中,编译器完成控制流图中指针的指向。
最后执行步骤1205,生成可执行指令。在此步骤中,链接器(linker)将编译器所生成的目标代码外加库链接,使其成为一个可执行文件。更详细来说,目标代码是包括机器码和链接器可用信息的程序模块,链接器的工作是解析未定义的符号引用,将目标代码中的占位符替换为符号的地址,进而生成可执行指令。可执行指令可直接被计算装置201执行,以完成神经网络的计算。
本披露通过设定融合策略,动态地决定模板融合单元,融合神经网络中的多个层,以形成新的自定义的层,一次性载入计算模板融合单元所需的数据,以减少输入/输出开销。
以前述融合策略的规则来决定模板融合单元时,不必然要以卷积层或是池化层为起始展开融合。前述实施例提及在一种应用场景下,起始规则可以是起始层为神经网络中最前未被融合的层,这层可以是卷积层或池化层以外的层。这样的起始规则使得模板融合单元的建立更为弹性,能够针对不同的神经网络,基于各层的排序,适当地选择起始层开始融合,不受卷积层或是池化层在神经网络模型中的位置和数量所限,进而适应各种网络模型,让融合更加全面,提升整体效益。
举例来说,以图6的神经网络模型为例,假设第1层至第5层已融合完毕,在建立下一个模板融合单元时,如果起始规则采用起始层为最前未被融合的卷积或池化层,则下一个卷积或池化层为第8层,换言之,第6层及第7层可能不会被融合,影响整体效益。
本披露的另一个实施例为一种融合神经网络的方案,其起始层为除卷积层及池化层之外的层,即非卷积层和非池化层。此实施例同样基于图1至图4的框架来实现。此实施例同样执行如图11所示的流程图。
在步骤1101中,根据融合策略选择起始层。处理装置203根据融合策略选择起始层,例如融合策略的起始规则是起始层为神经网络中最前未被融合的层,这层是卷积层或池化层以外的层。
需注意的是,此步骤不采用起始规则为起始层为最前未被融合的卷积或池化层,如果按此起始规则选择起始层,便会限制起始层必须为卷积或是池化层,此实施例不受卷积层或是池化层在神经网络模型中的位置和数量所限的优势便不存在了。
在一种应用场景下,起始层可以为元素对元素(element-wise)层,又称逐元素层,该层是对向量的每一个元素进行操作,此类操作的输入数据与输出数据形状一致。元素对元素层包括以下几类:
1.基本运算:向量加、向量减、向量乘等
2.进阶运算:绝对值、平方根、除法、指数、求余、求幂等
3.三角函数运算
4.取整运算:上取整、四舍五入、下取整、只保留整数等
5.激活函数:sigmoid、tanh、ReLU等
在另一种应用场景下,起始层可以为添加填充(addpadding)层。添加填充是为了不丢弃原图信息,并保持输入数据的大小与原图一致,在输入数据周围添加空白信息的元素。
在另一种应用场景下,起始层可以为自定义层。随着深度学习的发展以及神经网络的复杂化,公知或是标准的算子已不敷使用,越来越多自定义运算规则的算子应用于神经网络中,此实施例可以选择自定义层作为起始层。
在另一种应用场景下,此实施例的融合策略的起始规则使得处理装置203进一步判断神经网络是否包括块结构。如不包括,表示神经网络为长链式结构,处理装置203根据前述起始规则选择神经网络中最前未被融合的层即可;如包括,此实施例参考前述规则三,优先以块结构为单位进行融合,故处理装置203接着判断块结构中的最前层是否为除卷积层及池化层之外的层。如是,则处理装置203以所述最前层为起始层。
当处理装置203判断最前层为卷积层及池化层其中之一时,处理装置203可以直接选择所述卷积层或池化层为起始层,或是向前选择最接近所述最前层的除卷积层及池化层之外的层为起始层。图13 示出具有块结构的神经网络模型,该示例性神经网络模型包括子网络1301及子网络1302。子网络1301包括第一层到第六层,子网络1302包括第八层到第十一层,子网络1301及子网络1302以第七层连接。假设子网络1301已融合完毕,在融合子网络1302时,根据前述规则,处理装置203判断子网络1302的最前层(即第八层)是否为除卷积层及池化层之外的层。如果是,则直接选择第八层为起始层进行融合;如果第八层是卷积层或池化层,处理装置203同样可以选择第八层为起始层,或是向前选择最接近最前层的除卷积层及池化层之外的层为起始层,最接近第八层的前一层为第七层,第七层尚未被融合,且假设第七层不是卷积也不是池化层,则处理装置203选择第七层为起始层。如果第七层还是卷积或池化层,则此实施例可以选择从第七层或第八层作为起始层。
此实施例会优先融合整个块结构,以提升融合效益。但在特定应用场景下,处理装置203无法向前选择最接近最前层的除卷积层及池化层之外的层为起始层。以图7的神经网络模型为例,假设子网络701已融合完毕,在融合子网络702时,如果第七层为卷积或池化层,而在子网络701已融合的情况下,处理装置203无法向前选择最接近最前层的除卷积层及池化层之外的层为起始层,这时处理装置203改向后选择最接近最前层的除卷积层及池化层之外的层(即第八层)为起始层,但如此便无法将整个块结构纳入模板融合单元中。由于以第八层作为起始层的融合效果不理想,处理装置203还可以直接选择以第七层作为起始层。
在选择起始层后,接着执行步骤1102,基于起始层建立模板融合单元。处理装置203可以根据前述实施例中例举的各规则(规则一至规则十九)建立模板融合单元,这些规则仅为示例,此实施例并不限制各规则执行的顺序,亦不限制这些规则需同时被考虑,本领域技术人员在不同的应用场景下可以根据实际情况增删规则,以实现符合当下应用场景的融合策略。
步骤1101及步骤1102对应至步骤1201的根据融合策略决定模板融合单元。接着编译器推导模板融合单元的形状(步骤1202)、推导地址(步骤1203)、分配片上存储空间(步骤1204),最后由链接器生成可执行指令(步骤1205)。
在步骤1103中,根据建立后的模板融合单元执行神经网络计算。计算装置201执行前述可执行指令,以根据模板融合单元执行神经网络计算。
此实施例的起始层可以为除卷积及池化外的层,这样的起始规则使得模板融合单元的建立更为弹性,能够针对不同的神经网络,适当地选择起始层开始融合,不受卷积层或是池化层在神经网络模型中的位置和数量所限,进而适应各种网络模型,让融合更加全面,提升整体效益。
在现代的神经网络模型中,不必然各层的输入/输出特征图都具有如图8所示的倒金字塔形式,有些层的输入特征图尺寸会小于特征图数据尺寸,这种层常应用在计算机视觉的深度学习领域,在特定情境下,会需要将图像恢复到原来的尺寸以便进行进一步的计算。在计算这种层时,图像尺寸被扩大,以实现图像由小分辨率到大分辨率的映射的操作。图14示出这种层的示意图,从图14可知,输入/输出特征图会产生有如正金字塔的形式,这种层在本披露中称为正金字塔层,而图8中输入特征图大于输出特征图的层称为倒金字塔层。
实务上,正金字塔层包括反卷积层(deconvolution layer)、上池化层(unpooling layer)或上采样层(unsampling layer)。
反卷积又称为转置卷积或空洞卷积,它并不是正向卷积的完全逆过程,反卷积是一种特殊的正向卷积,需要有参数参与计算,而参数是要进行训练学习的。反卷积是先按照一定的比例通过补0,来扩大输入图像的尺寸,接着旋转卷积核,再进行正向卷积。
上池化操作分为最大池化的上池化操作及平均池化的上池化操作。最大池化的上池化会保留最大值的位置信息,然后在其余位置补0,如图15A所示,图中示出最大池化层1501,其输入特征图1502经过最大池化层1501后产生输出特征图1503,图中还示出最大池化的上池化层1504,其输入特征图1503经过上池化层1504后产生输出特征图1505,输出特征图1505的尺寸大于输入特征图1503的尺寸。平均池化的上池化则是将平均值都填入其对应原始数据区域中相应位置即可,如图15B所示,图中示出平均池化层1506,其输入特征图1507经过平均池化层1506后产生输出特征图1508,图中还示出平均池化的上池化层1509,其输入特征图1508经过上池化层1509后产生输出特征图1510,输出特征图1510的尺寸大于输入特征图1508的尺寸。
上采样是直接在对应原始数据区域根据内核来扩充特征图。图16示出上采样的示意图,图中输入特征图1601经过最大池化层(未绘出)后产生中间特征图1602,中间特征图1602经过上采样层(未绘出)的内核1603扩充后,得到输出特征图1604,输出特征图1604的尺寸大于中间特征图1602的尺寸。
前述这些算子的特征都是在于输入特征图小于输出特征图。另外,可能还存在用户自定义层同样具有输入特征图小于输出特征图的特征。本披露的另一个实施例是一种融合神经网络的方案,此方案同样具有图1至图4所示的框架,其融合策略可以融合正金字塔层。图17示出此实施例融合如图18所示的神经网络的流程图,图18为示例性的神经网络模型,共有14层,其中第一段1801包括第1层至第4层,为倒金字塔层,第二段1802包括第5层至第9层,为正金字塔层,第三段1803包括第10层至第14层,为倒金字塔层。
在步骤1701中,根据融合策略建立模板融合单元。处理装置203先根据融合策略的起始规则,选择模板融合单元的起始层,在一种应用场景中,其起始规则可以是最前未融合的层为模板融合单元的起始层。假设第1层至第3层已被融合,则第4层为这次模板融合单元的起始层,并以第4层起向后融合,逐一排查融合策略的所有规则,以建立模板融合单元。首先将正金字塔层的第5层也融合进去,如果还能继续融合,则处理装置203继续往后融合。在另一种应用场景中,起始规则可以是所有未融合层中最前的正金字塔层为模板融合单元的起始层。同样假设第1层至第3层已被融合,则第5层是最前的正金字塔层,故第5层为这次模板融合单元的起始层,向后融合。
此实施例不限制正金字塔层和倒金字塔层的融合方式,可以全是正金字塔层融合在一块,例如模板融合单元包括第5层至第9层,也可以混合融合在一块,例如模板融合单元包括第3层至第6层,或是模板融合单元包括第9层至第12层等。换言之,模板融合单元可以只包括正金字塔层,也可以包括倒金字塔层加上正金字塔层,或是正金字塔层加上倒金字塔层。
不仅如此,正金字塔层与倒金字塔层可以在模板融合单元中相邻,例如第4层与第5层、第9层与第10层。
当神经网络为块结构式,而该块结构包括正金字塔层时,此实施例的融合策略的规则为以块结构为单位增删模板融合单元。
在此实施例中的主层定义为矩阵乘、池化、卷积、反卷积、上池化及上采样层,当神经网络包括多个主层时,融合策略的规则为模板融合单元包括至少2个主层,当处理单元203判断所述规则未被满足时,处理装置203调整模板融合单元直到所述规则被满足。此实施例还可以包括另一个融合策略的规则为模板融合单元包括主层+主层+非主层的连续结构,当处理单元203判断所述规则未被满足时,处理装置203调整模板融合单元直到所述规则被满足。
在步骤1702中,推导模板融合单元的形状。接下来执行步骤1703,推导地址。在步骤1704中,分配片上存储空间。在步骤1705中,生成可执行指令。这些步骤与步骤1202至1205无异,不再赘述。
最后执行步骤1706,根据模板融合单元执行神经网络计算。计算装置201执行前述可执行指令,以根据模板融合单元执行神经网络计算。
此实施例可以融合正金字塔层和倒金字塔层,这样的融合策略使得模板融合单元的建立更为弹性,不受输入特征图及输出特征图尺寸的限制,进而适应各种网络模型,让融合更加全面,提升整体效益。
在生成可执行指令后,计算装置201便可以根据可执行指令以模板融合单元为单位来推理神经网络。本披露的另一个实施例是一种基于可执行指令计算神经网络的方案,此方案同样具有图1至图4的架构,用以计算模板融合单元的图,其实施如图19所示的流程。
在步骤1901中,存储神经网络的特征图。如前述实施例中所描述,处理装置203根据融合策略融合神经网络的多层,以产生模板融合单元,并基于各规则适当拆分特征图成片上单元图。
更详细来说,当处理装置203在图12的步骤1201中根据融合策略决定模板融合单元,且判断特征图大于SRAM 308的可用空间时,即大图模式,需要将特征图拆分,使其能多次性地载入至SRAM 308中。拆分的方式可以在N、H、W维度至少其中之一进行特定粒度的拆分,在此实施例中,特定粒度可以是但不限于为二分之一。而当处理装置203判断特征图不大于SRAM 308的可用空间时,即小图模式,片上单元图可能包括单个或多个特征图,取决于SRAM 308的可用空间能载入多少特征图。在 前述实施例中已就大图模式及小图模式描述过将特征图转换为片上单元图的技术细节,不再赘述。
欲进行神经网络计算的特征图均存储在DRAM 204中。
在步骤1902中,载入片上单元图。由于可执行指令是基于模板融合单元计算神经网络的,因此当计算装置201执行可执行指令时,便是根据模板融合单元进行神经网络计算,而不是根据神经网络的各层逐层计算。可执行指令载有如何拆分特征图成片上单元图的信息,也就是载有片上单元图的地址信息,SRAM 308根据地址信息,通过GMDA 311自DRAM 204的适当地址载入片上单元图。
在步骤1903中,载入子图。NRAM 432通过MVDMA 434载入子图。以1个集群305包括4个处理器核306为例,片上单元图会被拆分为4个子图,集群305中的一个处理器核306将片上单元图在N、H、W维度至少其中之一进行特定粒度的拆分成4个子图,通过MVDMA 434分别发送到每个处理器核306的NRAM 432。在此实施例中,特定粒度可以是但不限于为二分之一。
在步骤1904中,计算子图并产生对应的中间结果。每个处理器核306的运算模块42自NRAM 431取出子图进行计算,产出中间结果后再存回NRAM 431中。需注意的是,由于每个处理器核306分配到的子图属于片上单元图的不同部分,因此每个中间结果也反映了计算结果的一部分。
在步骤1905中,归约中间结果,以产生对应片上单元图的计算结果。归约指的是将中间结果结合起来成为计算结果,也就是前述的输出依赖运算。广播总线309传送每个处理器核306的中间结果至下一个处理器核306,处理器核306将前一个处理器核306的中间结果与所存储相对应的中间结果进行计算,以产生计算结果。归约有多种方式可以实现,以下以环形全归约(ring allreduce)为例说明此实施例如何进行归约。
图20显示环形全归约框架。环形全归约框架2000示例性展示一个计算装置201中的4个集群:第一集群2001、第二集群2002、第三集群2003及第四集群2004,每个集群包括4个处理器核。环形全归约框架2000将这些集群组织成一个逻辑环路。每个集群只与前一个集群和下一个集群连接,并往同一个方向接收及发送数据。如图20中箭头方向所示,每个集群以顺时针方向从前一个集群处接收数据,计算后向下一个集群发送数据,其数据传输在同步模块304的控制协调下通过CDMA 310来进行,这样的框架可以充分利用每个集群的输入/输出带宽。
这些处于逻辑环路中的多个集群如图21所示。在进行环状全归约前,各集群内的处理器核已完成该核的子图计算工作,也就是产出了中间结果,分别存储在每个处理器核的NRAM中,以第一集群2001为例,其4个处理器核分别生成中间结果a 0、b 0、c 0、d 0
接下来执行归约程序,这些集群将进行N-1次(在此处N为4)的归约迭代。在每次迭代中,这些集群将向下一个集群发送全部中间结果,并从前一个集群接收所有中间结果进行计算,每个集群发送和接收的中间结果在每次迭代中都是不同的。
为说明方便,在此假设输出依赖运算只需要将这些中间结果相加便生成计算结果。图22A示出在第一次迭代时,第一集群2001的中间结果a 0被传送至第二集群2002与中间结果a 1相加,第二集群2002的中间结果b 1被传送至第三集群2003与中间结果b 2相加,第三集群2003的中间结果c 2被传送至第四集群2004与中间结果c 3相加,第四集群2004的中间结果d 3被传送至第一集群2001与中间结果d 0相加。
图22B示出在第二次迭代时,第二集群2002的中间结果a 0+a 1被传送至第三集群2003与中间结果a 2相加,第三集群2003的中间结果b 1+b 2被传送至第四集群2004与中间结果b 3相加,第四集群2004的中间结果c 2+c 3被传送至第一集群2001与中间结果c 0相加,第一集群2001的中间结果d 0+d 3被传送至第二集群2002与中间结果d 1相加。
图22C示出在第三次迭代时,第三集群2003的中间结果a 0+a 1+a 2被传送至第四集群2004与中间结果a 3相加,第四集群2004的中间结果b 1+b 2+b 3被传送至第一集群2001与中间结果b 0相加,第一集群2001的中间结果c 0+c 2+c 3被传送至第二集群2002与中间结果c 1相加,第二集群2002的中间结果d 0+d 1+d 3被传送至第三集群2003与中间结果d 2相加。
执行完前述加法计算后,可以得到如图23A所示的状态,每个集群都有一个处理器核执行了完整的归约计算,也就是将纵向相对应的中间结果全相加起来,例如:第一集群2001的第二个处理器核载有计算结果b 0+b 1+b 2+b 3,第二集群2002的第三个处理器核载有计算结果c 0+c 1+c 2+c 3,第三集群2003 的第四个处理器核载有计算结果d 0+d 1+d 2+d 3,第四集群2004的第一个处理器核载有中间结果a 0+a 1+a 2+a 3
为了实现全归约,集群必须交换这些计算结果,使得所有集群都具有相同的最终值,这步骤称为全集(allgather)。全集程序的过程与归约程序的流程相近,也就是再进行N-1次的迭代,但集群接收的数值不累加,而是进行覆盖,最后会得到如图23B所示的结果,所有的处理器核都载有完整的计算结果,这些计算结果会被存储在SRAM 308中。
以上的环状全归约操作仅用以说明此实施例归约的一种实施方式,本披露不限制归约的方式。
最后执行步骤1906,将计算结果存回。SRAM 308通过GDMA 311将计算结果存回至DRAM 204。这些计算结果是集群计算片上单元图后的结果。至此计算装置201完成片上单元图的计算。
此实施例基于可执行指令计算神经网络,其可执行指令是根据模板融合单元而不是神经网络的各层进行计算的,减少了片上片外的输入/输出消耗,提升计算功效。
如前述融合策略的规则二中提到的,本披露可以选择优先向前融合。向前融合指的是自起始层往神经网络推理的反方向融合,也就是往神经网络起点的方向进行融合,图24示出一种示例性的长链式神经网络,共有14层。本披露的另一个实施例为利用图1至图4的框架实现向前融合神经网络的方法,所述的神经网络示例性地是图24所示的长链式神经网络。该方法如图25所示。
在步骤2501中,根据融合策略选择融合的起始层。先参考神经网络241,处理装置203根据融合策略选择融合的起始层。为方便说明,假设图24中的第1层至第5层已融合成模板融合单元2401,且此实施例的融合策略的规则之一是起始层为最前未被融合的卷积或池化层。在此步骤中,当处理装置203执行融合时,判断未融合的层中有哪些是卷积层或池化层,如图所示,第8层为最大池化层,第9层为卷积层,因此最前未被融合的卷积或池化层为第8层,处理装置203将第8层设定为本次融合的起始层。
在步骤2502中,向神经网络的起点方向进行融合,以建立模板融合单元。在此实施例中,模板融合单元内的各层需为连续,不得越过已融合的层去融合未融合的层,也就是模板融合单元内的各层需为连续不间断的未融合层。以第8层为起始层,向神经网络241的起点方向进行融合即是把第7层纳入模板融合单元中,处理装置203判断第7层是否为未融合层,由于仅有第1层至第5层已融合成模板融合单元2401,故第7层是未融合层,处理装置203设定第7层(局部归一层)与第8层(最大池化)进行融合,即模板融合单元2402。
融合时,处理装置203视模板融合单元2402中的最前层为模板融合单元2402的输入层,即第7层为输入层,并视最后1层为模板融合单元2402的输出层,即起始层第8层为输出层,处理装置203基于输入层及输出层进行金字塔融合。更详细来说,模板融合单元2402基于如图8所示的倒金字塔数据结构,以第7层的输入为模板融合单元2402的输入,以第8层的输出为模板融合单元2402的输出,从输出数据往回推导至输入数据,第7层至第8层间的中间数据存储在SRAM 308中不存回DRAM 204。在此原则下,根据前述实施例提及的融合策略的各规则做判断,以决定第7层加上第8层是否满足规则,可以成为模板融合单元。
假设模板融合单元2402满足融合策略的所有规则,接着处理装置203继续向神经网络241的起点方向进行融合,即试图把第6层(ReLU激活层)也纳入模板融合单元中,即模板融合单元2403。模板融合单元2403也具有如图8所示的倒金字塔数据结构,以第6层的输入为模板融合单元2403的输入,以第8层的输出为模板融合单元2403的输出,第6层至第7层间和第7层至第8层间的中间数据均存储在SRAM 308中不存回DRAM 204,根据前述实施例提及的融合策略的各规则做判断,以决定第6层至第8层是否满足规则,可以成为模板融合单元。
假设模板融合单元2403亦满足融合策略的所有规则,接着处理装置203再向神经网络241的起点方向进行融合,即试图把第5层也纳入模板融合单元中。处理装置203会判断新加入的层是否已被融合,由于第5层已被融合成模板融合单元2401,故处理装置203不会将第5层纳入,至此停止融合,此阶段的模板融合单元建立完毕,即模板融合单元2403。
整个神经网络241都会基于前述的方式进行融合,神经网络242示出一种可能的最终融合结果,原本整个神经网络242包括14层,即14个算子,在融合完成后,变成由模板融合单元2401、模板融 合单元2403、模板融合单元2404、模板融合单元2405所组成的4层自定义层,即4个自定义算子。
回到图25,在步骤2503中,根据模板融合单元执行神经网络计算。在神经网络242中,计算装置201根据模板融合单元2401、模板融合单元2403、模板融合单元2404、模板融合单元2405所组成的4层自定义层,执行神经网络计算。换言之,计算装置201在执行神经网络计算时,是执行前述4层自定义层,来取代执行原本的14层,进而达到减少输入/输出开销,提升资源效益的技术功效。
在计算神经网络时,由于模板融合单元包括多个层,以模板融合单元为单位进行计算时,本披露会将所需的权值一次性地从DRAM 204加载至SRAM 308中。以一个模板融合单元包括第一卷积层与第二卷积层为例,在计算该模板融合单元时,处理装置203不单单将第一卷积层的权值载入至SRAM 308中,还会将第二卷积层的权值一并载入。更详细来说,当处理器核306在计算第一卷积层时,第二卷积层的权值已经存储在SRAM 308中,一旦第一卷积层计算完毕,第二卷积层的权值可以立即从SRAM 308载入至WRAM 432中,以提升权值加载的速度。
不仅如此,WRAM 432同样可以预加载权值。如果WRAM 432足够大,第一卷积层与第二卷积层的权值可以一次性地从SRAM 308加载至WRAM 432中,当第一卷积层计算完毕时,第二卷积层的权值不需要从SRAM 308载入至WRAM 432中,运算模块42直接从WRAM 432读取第二卷积层的权值计算,更加地降低权值加载的时间,提升整体运行的速度。
本披露另一个实施例为利用图1至图4的框架实现双向融合神经网络的方法,所述的神经网络同样以图24的长链式神经网络为例,另展示于图26进行说明。
双向融合指的是可以向前也可以向后进行融合。该方法如图27所示,融合策略同时向前亦向后进行融合,以建立模板融合单元,再根据模板融合单元进行神经网络计算。同样地,假设图26中的第1层至第5层已融合成模板融合单元2601,且此实施例的融合策略的起始规则是起始层为最前未被融合的卷积或池化层。
在步骤2701中,处理装置203根据融合策略选择融合的起始层。处理装置203判断最前未被融合的卷积或池化层为第8层的最大池化层,因此处理装置203将第8层设定为本次融合的起始层。
在步骤2702中,接着向神经网络的起点方向进行融合。处理装置203向前将第7层纳入模板融合单元中,第7层成为新加入的层。
在步骤2703中,处理装置203判断新加入的层是否为未融合层。第7层是未融合层,执行步骤2704,处理装置203设定第7层与第8层为模板融合单元2602。
接着执行步骤2705,处理装置203判断模板融合单元2602是否符合融合策略的规则。融合时,处理装置203视模板融合单元2602中的最前层为模板融合单元2602的输入层,即第7层为输入层,并视起始层为模板融合单元2602的输出层,即第8层为输出层,处理装置203基于输入层及输出层进行金字塔融合。
如符合融合策略的规则,则执行步骤2706,处理装置203自起始层向神经网络的终点方向进行融合,也就是自第8层起,先融合第7层,在此步骤中跳转往后融合第9层,以形成模板融合单元2603。这种往前往后跳转融合的方式称为跳跃式融合。
在步骤2707中,处理装置203判断模板融合单元2603是否符合融合策略的规则。融合时,处理装置203视模板融合单元2603中的连续各层的最前层为模板融合单元2603的输入层,即第7层,而向后跳跃的最后一层为模板融合单元2603的输出层,即第9层,处理装置203基于输入层及输出层进行金字塔融合。
如符合融合策略的规则,回到步骤2702,再向神经网络的起点方向进行融合,处理装置203将第6层纳入模板融合单元中。在步骤2703中,处理装置203判断新加入的层是否为未融合层。第6层是未融合层,故执行步骤2704,处理装置203设定第6层与第9层为模板融合单元2604。
接着执行步骤2705,处理装置203判断模板融合单元2604是否符合融合策略的规则。融合时,处理装置203视模板融合单元2604中的最前层为模板融合单元2604的输入层,即第6层为输入层,并视向后跳跃的最后一层为模板融合单元2604的输出层,即第9层,处理装置203基于输入层及输出层进行金字塔融合。
如符合融合策略的规则,执行步骤2706,处理装置203向神经网络的终点方向进行融合,这时跳 转融合第10层,以形成模板融合单元2605。在步骤2707中,处理装置203判断模板融合单元2605是否符合融合策略的规则。融合时,处理装置203视模板融合单元2605中的连续各层的最前层为模板融合单元2605的输入层,即第6层,而向后跳跃的最后一层为模板融合单元2605的输出层,即第10层,处理装置203基于输入层及输出层进行金字塔融合。
如符合融合策略的规则,再回到步骤2702,向神经网络的起点方向进行融合,处理装置203将第5层纳入模板融合单元中。在步骤2703中,处理装置203判断第5层是否为未融合层。由于第5层是被融合进模板融合单元2601中,故执行步骤2708,处理装置203停止融合。在步骤2705及步骤2707中,当处理装置203判断模板融合单元不符合融合策略的规则时,同样执行步骤2708,处理装置203停止融合。至此,处理装置203建立了模板融合单元。
最后执行步骤2709,计算装置201根据建立好的模板融合单元进行神经网络计算。
在另一种应用场景下,如果在步骤2703中处理装置203判断新加入的层已被融合,处理装置203可以跳转向神经网络的终点方向进行融合。举例来说,当处理装置203判断第5层已被融合时,可以直接执行步骤2706,处理装置203向神经网络的终点方向进行融合,跳转融合第11层,也就是新的模板融合单元包括第6层至第11层,依此往后融合,直到融合策略不再满足为止。
在另一种应用场景下,此实施例的跳跃式融合可以先往后融合,再往前融合,依序跳跃。同样以图26的第8层为起始层为例,处理装置203首先往后选择融合第9层,接着往前跳跃融合第7层,再往后跳跃融合第10层,以此类推。本披露并不限制前后跳跃融合的先后顺序。
此实施例说明了跳跃式融合的运作模式。可以理解地,前述的跳跃式融合是以每融合一层往前或往后跳跃一次,如图26左侧的箭头所示。本领域技术人员可以在本披露的范围内轻易地调整跳跃方式,每融合n层跳跃一次,其中n为自然数。例如每融合二层往前或往后跳跃一次或每融合三层往前或往后跳跃一次,这样的调整均涵盖在本披露的揭露范围中,亦在本披露的保护范围中。
本披露的另一个实施例是一种利用图1至图4的框架实现双向融合神经网络的方法,所述神经网络示例性地具有如图28所示的块结构。此实施例的融合策略的起始规则同样是起始层为最前未被融合的卷积或池化层,自起始层向神经网络的起点方向及终点方向进行跳跃式融合,以建立模板融合单元,再根据模板融合单元进行神经网络计算。此外,由于此神经网络为块结构,此实施例的融合策略的规则之一是以块结构为单位融合。以下将进一步说明决定模板融合单元的方式。
首先,处理装置203根据融合策略选择融合的起始层,并自起始层向神经网络的起点方向进行融合。假设最前未被融合的卷积层或池化层为第七层,因此处理装置203将第七层设定为本次融合的起始层,并往前将第六层纳入模板融合单元中。虽然第六层是未融合层,可以融合,但处理装置203判断出第六层属于块结构2801。根据融合策略,处理装置203需将以块结构2801为单位融合,因此处理装置203一次性地将第一层至第六层全部纳入,形成模板融合单元2802。
接着,处理装置203判断模板融合单元2802是否符合融合策略的其他规则。融合时,处理装置203视第一层为模板融合单元2802的输入层,并视第七层为模板融合单元2802的输出层,处理装置203基于输入层及输出层进行金字塔融合。此实施例可以参考规则一至规则十九选择合适的组成融合策略,例如规则五:包括至少2个主层、规则六:包括主层+主层+非主层的连续结构、规则七:包括标量计算层+向量计算层的连续结构等。
如模板融合单元2802符合融合策略的规则,接着处理装置203向神经网络的终点方向进行融合,即融合第八层。但第八层具有两个输出,使得模板融合单元成为多分支输出,不符合规则四,再者第八层属于块结构2803,处理装置203会将整个块结构2803融合进来,成为模板融合单元2804。处理装置203接着判断模板融合单元2804是否符合融合策略的规则。如果符合,则模板融合单元2804为最终的模板融合单元。计算装置201以模板融合单元2804进行神经网络计算。如果不符合,表示计算装置201的硬件条件不足以支撑一次性执行模板融合单元2804,这时处理装置201停止融合,建立起其中一个模板融合单元,即模板融合单元2802。
处理装置203会继续试图融合块结构2803成为另一个模板融合单元2805,假设模板融合单元2805符合融合策略,处理装置203便又建立另一个模板融合单元。
最后计算装置201根据建立好的两个模板融合单元,即模板融合单元2802与模板融合单元2805 进行神经网络计算,相较于进行10层计算,大大减少了输入/输出消耗。
本披露的另一个实施例是一种利用图1至图4的框架实现向前、向后、双向、跳跃式融合神经网络的方案。向前、向后、双向、跳跃式融合神经网络的方案已在前述多个实施例中描述,不再各别赘述。此实施例的融合策略具有多种融合弹性,针对同一个神经网络分别评估向前、向后、双向、跳跃式融合的各种模板融合单元方案的优劣,进而选择最佳方案作为模板融合单元。在此实施例中,所谓最佳方案可以是模板融合单元数量最少、主层融合最多、未被融合层最少或所占用片上存储空间最少等。由于此实施例可以接受多种融合方式,并从中选择最佳方案作为模板融合单元,因此实施例能充分利用计算装置201的硬件环境,相较于前述的实施例,此实施例更能节省输入/输出损耗,进一步提升计算效率。
在前述的实施例中,步骤1103、步骤1706、步骤2503、步骤2709等均提及根据模板融合单元来进行神经网络计算。本披露的另一个实施例是一种执行模板融合单元的方法,该方法充分利用GDMA 311,IODMA 433、MVDMA 434和运算模块42之间的并发性,将流水线的概念导入,使得中间结果尽可能地驻留在片上,不仅减少片上与片外之间的输入/输出,同时利用了片上单元图搬运高带宽的优点,加速处理速度。此处的并发性指的是前述几个元件可以独立且平行运作,不受其他元件的影响。
如前所述,模板融合单元中首层的输入和末层的输出作为该模板融合单元与DRAM 204的交互数据,期间各层的计算皆不需要再访问DRAM 204。在此实施例中,处理装置203进一步根据NRAM 431与WRAM 432的大小把模板融合单元划分为若干个子模板融合单元。
图29示出划分子模板融合单元的示意图,图中的T1层至T11层是特定深度学习网络中的一段,处理装置203再根据NRAM 431与WRAM 432的大小把模板融合单元2901划分为第一子模板融合单元2911及第二子模板融合单元2912。在其他实施例中,处理装置203可以把模板融合单元划分为不特定数量的子模板融合单元。
在开始计算模板融合单元2901前,GDMA 311将模板融合单元2901所需的数据一次性从DRAM 204搬运到SRAM 308上,接着MVDMA 434把执行第一子模板融合单元2911所需的子图搬运到NRAM 431上,运算模块42便开始执行第一子模板融合单元2911的任务,也就是计算第T1层至第T6层,期间无需再访问SRAM 308。当第一子模板融合单元2911计算完毕后,产出第一中间结果,MVDMA 434将第一中间结果自NRAM 431搬运至SRAM 308。
接着MVDMA 434把执行第二子模板融合单元2912所需的子图自SRAM 308搬运到NRAM 431上,运算模块42执行第二子模板融合单元2912的任务,也就是计算第T7层至第T11层,期间同样无需再访问SRAM 308。当第二子模板融合单元2912计算完毕后,产出第二中间结果,MVDMA 434将第二中间结果自NRAM 431搬运至SRAM 308,其中一个处理器核306将第一中间结果与第二中间结果进行归约,以产生计算结果,最后GDMA 311将计算结果一次性从SRAM 308搬运到DRAM 204上,至此完成模板融合单元2901的任务,也就是完成T1层至T11层的任务,期间仅在模板融合单元2901的开始与结束访问DRAM 204,大大降低输入/输出的次数。
计算装置201具有很强算力的一个重要原因在于片上系统-集群-处理器核的三级运算层次,搭配DRAM-SRAM-NRAM/WRAM这样的三层内存设计,使得数据能够在适当的层级里缓存并计算,形成充分的流水。
计算装置201在进行计算时,主要可以分为以下三大阶段。载入阶段(load):将数据载入;计算阶段(compute):搬运数据、计算、搬运中间结果;存回阶段(store):存回结果。
更详细来说,此实施例采两层三级流水线,如图30所示,第一层的载入阶段3001、计算阶段3002、存回阶段3003发生在集群层次中。第一层载入阶段3001是执行模板融合单元,由GDMA 330将数据自DRAM 204载入至SRAM 308中,第一层计算阶段3002是集群305对载入的片上单元图进行计算,并产生计算结果,第一层存回阶段3003是GDMA 330将计算结果从SRAM 308存回至DRAM 204中。
由于集群305包括多个处理器核306,第一层计算阶段3002实际上会透过存储核307将片上单元图分割成对应子图,广播发送至处理器核306进行计算,因此,第二层的三级水线发生在处理器核306中。更详细来说,第二层载入阶段3004是执行子模板融合单元,由MVDMA 434从SRAM 308中将子图载入到NRAM 431中,同时将所需权值载入至WRAM 432,第二层计算阶段3005是将子图和权 值搬运至运算模块42,进行计算,再将中间结果搬运回NRAM 431中,第二层存回阶段3006则是MVDMA 434将中间结果从NRAM 431存回至SRAM 308中。
第一层的流水线指的是第一层载入阶段3001、第一层计算阶段3002和第一层存回阶段3003可以同时并行。现以同一个集群305欲处理第一片上单元图、第二片上单元图和第三片上单元图为例,首先,第一片上单元图在第一层载入阶段3001载入到SRAM 308中。接着第一片上单元图在第一层计算阶段3002被计算,并将第一计算结果搬运回SRAM 308中,在计算第一片上单元图的同时,第二片上单元图在第一层载入阶段3007载入到SRAM 308中。当第一计算结果在第一层存回阶段3003被存回至DRAM 204中时,第二片上单元图在第一层计算阶段3008被计算,并将第二计算结果搬运回SRAM 308中,且第三片上单元图在第一层载入阶段3010被载入到SRAM 308中。
为了配合前述流水线的操作,此实施例的SRAM 308包括2个存储空间:乒存储空间与乓存储空间。模板融合单元的流水按照SRAM 308的乒乓属性分为3种:输入/输出乒乓(IO parity)、输入乒乓(input parity)、无乒乓(no parity)。输入/输出乒乓可以支援载入、计算和存回并行,为实现输入/输出乒乓,乒存储空间与乓存储空间需要完全相等,分别供载入和存回使用。输入乒乓仅支援存回和计算并行,会额外增加SRAM 308上搬运的时间,与输入/输出乒乓相比,乒存储空间与乓存储空间不需要完全相等,但是需要多分配一块与存回存储空间大小相等的高速缓存(cache)。无乒乓指的是载入/存回和计算串行,空间不需要进行额外分配。
为了实现前述的第一层流水线,此实施例的SRAM 308具有相同大小的乒存储空间与乓存储空间,以达到输入/输出乒乓的效果。再以图30做说明,第一个子图的第一层载入阶段3001、第一层计算阶段3002和第一层存回阶段3003是在乒存储空间上执行的,而第二个子图的第一层载入阶段3007、第一层计算阶段3008和第一层存回阶段3009则在乓存储空间上执行,第三个子图的第一层载入阶段3010、第一层计算阶段3011和第一层存回阶段3012则又在乒存储空间上执行,以此类推。
第二层的流水线指的是第二层载入阶段3004、第二层计算阶段3005和第二层存回阶段3006可以同时并行。试以同一个处理器核306欲处理第一子图、第二子图和第三子图为例。首先,第一子图在第二层载入阶段3004载入到NRAM 431中,并将所需权值载入至WRAM 432。接着第一子图在第二层计算阶段3005进行计算并归约,并将归约后的中间结果搬运回NRAM 431中,同时,第二子图在第二层载入阶段3013载入到NRAM 431中,并将所需权值载入至WRAM 432。最后第一中间结果在第二层存回阶段3006被存回至SRAM 308中,同时第二子图在第二层计算阶段3014被计算并归约,并将归约后的中间结果搬运回NRAM 431中,且第三子图在第二层载入阶段3015被载入到NRAM 431中,并将所需权值载入至WRAM 432。
考虑到每个集群305的任务不同,完成的时间自然不一致,此实施例的同步模块304会利用同步屏障指令来同步任务完成的时间,以避免时序出现错误。
在一种应用场景下,此实施例未开启权值置换,也就是在执行前述流水过程中,当计算一个子图时,会同步广播下一个子图的权值,因此同一时间内WRAM 432会存储相邻2个子图的权值,由于多个子图的权值在WRAM 432内的空间会相互影响,因此相邻2个子图的权值在WRAM 432内所占用的空间会大于相邻2个子图的权值的总和。对于SRAM 308来说,为了便于在多个批处理的情况下直接广播而不是多次访问DRAM 204,处理装置203需要为SRAM 308分配多个存储权值的空间,这些权值会一直驻留在SRAM 308中,如果模板融合单元包括多个卷积层,则SRAM 308或WRAM 432的空间可能不够大到足以载入所有的权值,以至于无法融合多层。
当处理装置203判断模板融合单元中包括多个卷积层时,此实施例会切换成权值置换模式。权值置换指的是在计算某一个子图时,处理装置203才将下一个子图的权值自DRAM 204载入至SRAM 308中。轮到下一个子图进行计算时,广播总线309将权值广播至WRAM 432。相较于未开启权值置换,权值置换虽然会增加片外片上的访问次数,但SRAM 308仅需配置最大权值的存储空间,而任何时间SRAM 308仅存储一份子图的权值,这空间可以复用。融合策略的规则可以包括权值置换的切换,当模板融合单元的权值总和小时,不采用权值置换,以争取较快的计算速度;当权值总和大时,采用权值置换,以争取融合较多的层数。
此实施例基于片上系统-集群-处理器核的三级运算层次,以及DRAM-SRAM-NRAM/WRAM 的三层内存设计,建立起两层三级流水线,充分利用硬件资源,提升神经网络计算效率。
图31示出另一实施例基于模板融合单元执行计算程序的流程图,模板融合单元包括多个子模板融合单元。在步骤3101中,将模板融合单元所需的数据一次性从片外DRAM 204搬运到SRAM 308上。在步骤3102中,判断模板融合单元中的子模板融合单元是否都计算完毕。如否,则执行步骤3103,选择一个未计算的子模板融合单元并将其所需的数据搬运到NRAM 431和WRAM 432上。在步骤3104中,执行所选择的子模板融合单元的任务,执行期间无需再访问SRAM。在步骤3105中,将产出的中间结果自NRAM 431搬运至SRAM 308,并回到步骤3102。
如果在步骤3102中,判断所有的子模板融合单元都已执行完毕,则执行步骤3106,将所有中间结果进行归约,以产生计算结果。在步骤3107中,将计算结果从SRAM 308搬运到DRAM 204上。至此完成模板融合单元的任务。
图32示出另一实施例的两层三级流水线的流程图。在步骤3201中,载入第一片上单元图。在步骤3202中,同步计算第一片上单元图并产生第一中间结果,及载入第二片上单元图。在步骤3203中,同步存回第一中间结果、计算第二片上单元图并产生第二中间结果、及载入第三片上单元图。其中步骤3202更包括以下步骤。在步骤3204中,载入第一子图,其中第一子图为第一片上单元图的至少一部分。在步骤3205中,同步计算第一子图并产生第一中间结果,及载入第二子图,其中第二子图为第一片上单元图的至少一部分。在步骤3206中,同步存回第一中间结果、计算第二子图、及载入第三子图,其中第三子图亦为第一片上单元图的至少一部分。
本披露另一个实施例为一种计算机可读存储介质,其上存储有根据融合策略动态融合神经网络的计算机程序代码,当所述计算机程序代码由处理器运行时,执行如图10、图11、图12、图17、图19、图25、图27、图31、图32所述的方法。
本披露通过设定融合策略,动态地决定模板融合单元,融合神经网络中的多个层,以形成新的自定义的层,一次性载入计算模板融合单元所需的数据,以减少输入/输出开销。
根据不同的应用场景,本披露的电子设备或装置可以包括服务器、云端服务器、服务器集群、数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、PC设备、物联网终端、移动终端、手机、行车记录仪、导航仪、传感器、摄像头、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、视觉终端、自动驾驶终端、交通工具、家用电器、和/或医疗设备。所述交通工具包括飞机、轮船和/或车辆;所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机;所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。本披露的电子设备或装置还可以被应用于互联网、物联网、数据中心、能源、交通、公共管理、制造、教育、电网、电信、金融、零售、工地、医疗等领域。进一步,本披露的电子设备或装置还可以用于云端、边缘端、终端等与人工智能、大数据和/或云计算相关的应用场景中。在一个或多个实施例中,根据本披露方案的算力高的电子设备或装置可以应用于云端设备(例如云端服务器),而功耗小的电子设备或装置可以应用于终端设备和/或边缘端设备(例如智能手机或摄像头)。在一个或多个实施例中,云端设备的硬件信息和终端设备和/或边缘端设备的硬件信息相互兼容,从而可以根据终端设备和/或边缘端设备的硬件信息,从云端设备的硬件资源中匹配出合适的硬件资源来模拟终端设备和/或边缘端设备的硬件资源,以便完成端云一体或云边端一体的统一管理、调度和协同工作。
需要说明的是,为了简明的目的,本披露将一些方法及其实施例表述为一系列的动作及其组合,但是本领域技术人员可以理解本披露的方案并不受所描述的动作的顺序限制。因此,依据本披露的公开或教导,本领域技术人员可以理解其中的某些步骤可以采用其他顺序来执行或者同时执行。进一步,本领域技术人员可以理解本披露所描述的实施例可以视为可选实施例,即其中所涉及的动作或模块对于本披露某个或某些方案的实现并不一定是必需的。另外,根据方案的不同,本披露对一些实施例的描述也各有侧重。鉴于此,本领域技术人员可以理解本披露某个实施例中没有详述的部分,也可以参见其他实施例的相关描述。
在具体实现方面,基于本披露的公开和教导,本领域技术人员可以理解本披露所公开的若干实施例也可以通过本文未公开的其他方式来实现。例如,就前文所述的电子设备或装置实施例中的各个单元来说,本文在考虑了逻辑功能的基础上对其进行拆分,而实际实现时也可以有另外的拆分方式。又 例如,可以将多个单元或组件结合或者集成到另一个系统,或者对单元或组件中的一些特征或功能进行选择性地禁用。就不同单元或组件之间的连接关系而言,前文结合附图所讨论的连接可以是单元或组件之间的直接或间接耦合。在一些场景中,前述的直接或间接耦合涉及利用接口的通信连接,其中通信接口可以支持电性、光学、声学、磁性或其它形式的信号传输。
在本披露中,作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元示出的部件可以是或者也可以不是物理单元。前述部件或单元可以位于同一位置或者分布到多个网络单元上。另外,根据实际的需要,可以选择其中的部分或者全部单元来实现本披露实施例所述方案的目的。另外,在一些场景中,本披露实施例中的多个单元可以集成于一个单元中或者各个单元物理上单独存在。
在一些实现场景中,上述集成的单元可以采用软件程序模块的形式来实现。如果以软件程序模块的形式实现并作为独立的产品销售或使用时,所述集成的单元可以存储在计算机可读取存储器中。基于此,当本披露的方案以软件产品(例如计算机可读存储介质)的形式体现时,该软件产品可以存储在存储器中,其可以包括若干指令用以使得计算机设备(例如个人计算机、服务器或者网络设备等)执行本披露实施例所述方法的部分或全部步骤。前述的存储器可以包括但不限于U盘、闪存盘、只读存储器(Read Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。
在另外一些实现场景中,上述集成的单元也可以采用硬件的形式实现,即为具体的硬件电路,其可以包括数字电路和/或模拟电路等。电路的硬件结构的物理实现可以包括但不限于物理器件,而物理器件可以包括但不限于晶体管或忆阻器等器件。鉴于此,本文所述的各类装置(例如计算装置或其他处理装置)可以通过适当的硬件处理器来实现,例如中央处理器、GPU、FPGA、DSP和ASIC等。进一步,前述的所述存储单元或存储装置可以是任意适当的存储介质(包括磁存储介质或磁光存储介质等),其例如可以是可变电阻式存储器(Resistive Random Access Memory,RRAM)、动态随机存取存储器(Dynamic Random Access Memory,DRAM)、静态随机存取存储器(Static Random Access Memory,SRAM)、增强动态随机存取存储器(Enhanced Dynamic Random Access Memory,EDRAM)、高带宽存储器(High Bandwidth Memory,HBM)、混合存储器立方体(Hybrid Memory Cube,HMC)、ROM和RAM等。
依据以下条款可更好地理解前述内容:
2020110458882条款A1、一种融合神经网络的集成电路装置,所述神经网络包括第i层,所述第i层的输入特征图小于输出特征图,所述集成电路装置包括:
处理装置,用以根据融合策略建立模板融合单元;以及
计算装置,用以根据所述模板融合单元执行神经网络计算;
其中,所述模板融合单元包括所述第i层,i为自然数。
条款A2、根据条款A1所述的集成电路装置,其中所述神经网络还包括第j层,位于所述第i层前,所述第j层的输入特征图大于输出特征图,所述模板融合单元还包括所述第j层,j为自然数,且i不等于j。
条款A3、根据条款A1所述的集成电路装置,其中所述神经网络还包括第j层,位于所述第i层后,所述第j层的输入特征图大于输出特征图,所述模板融合单元还包括所述第j层,j为自然数,且i不等于j。
条款A4、根据条款A2或条款A3所述的集成电路装置,其中所述第i层及所述第j层相邻。
条款A5、根据条款A1所述的集成电路装置,其中所述融合策略为所述第i层为所述模板融合单元的起始层。
条款A6、根据条款A1所述的集成电路装置,其中所述第i层位于所述神经网络的块结构中,所述融合策略的规则为以所述块结构为单位增删所述模板融合单元。
条款A7、根据条款A1所述的集成电路装置,其中所述第i层为反卷积层、上池化层及上采样层其中之一。
条款A8、根据条款A7所述的集成电路装置,其中所述神经网络包括多个主层,所述主层为矩阵乘、池化、卷积及所述第i层其中之一,所述融合策略的规则为所述模板融合单元包括至少2个主层, 当所述处理单元判断所述规则未被满足时,所述处理装置调整所述模板融合单元直到所述规则被满足。
条款A9、根据条款A7所述的集成电路装置,其中所述神经网络包括多个主层,所述主层为矩阵乘、池化、卷积及所述第i层其中之一,所述融合策略的规则为所述模板融合单元包括主层、主层、非主层依次相邻的连续结构,当所述处理单元判断所述规则未被满足时,所述处理装置调整所述模板融合单元直到所述规则被满足。
条款A10、根据条款A1所述的集成电路装置,其中所述第i层为自定义层。
条款A11、一种板卡,包括根据条款A1至条款A10任一项所述的集成电路装置。
条款A12、一种融合神经网络的方法,所述神经网络包括第i层,所述第i层的输入特征图小于输出特征图,i为自然数,所述方法包括:
根据融合策略建立模板融合单元,所述模板融合单元包括所述第i层;以及
根据所述模板融合单元执行神经网络计算。
条款A13、根据条款A12所述的方法,其中所述神经网络还包括第j层,位于所述第i层前,所述第二层的输入特征图大于输出特征图,所述模板融合单元还包括所述第j层,j为自然数,且i不等于j。
条款A14、根据条款A12所述的方法,其中所述神经网络还包括第j层,位于所述第i层后,所述第j层的输入特征图大于输出特征图,所述模板融合单元还包括所述第j层,j为自然数,且i不等于j。
条款A15、根据条款A13或条款A14所述的方法,其中所述第i层及所述第j层相邻。
条款A16、根据条款A12所述的方法,其中所述融合策略为所述第i层为所述模板融合单元的起始层。
条款A17、根据条款A12所述的方法,其中所述第i层位于所述神经网络的块结构中,所述融合策略的规则为以所述块结构为单位增删所述模板融合单元。
条款A18、根据条款A12所述的方法,其中所述第i层为反卷积层、上池化层及上采样层其中之一。
条款A19、根据条款A18所述的方法,其中所述神经网络包括多个主层,所述主层为矩阵乘、池化、卷积及所述第i层其中之一,所述融合策略的规则为所述模板融合单元包括至少2个主层,当所述处理单元判断所述规则未被满足时,所述处理装置调整所述模板融合单元直到所述规则被满足。
条款A20、根据条款A18所述的方法,其中所述神经网络包括多个主层,所述主层为矩阵乘、池化、卷积及所述第i层其中之一,所述融合策略的规则为所述模板融合单元包括主层、主层、非主层依次相邻的连续结构,当所述处理单元判断所述规则未被满足时,所述处理装置调整所述模板融合单元直到所述规则被满足。
条款A21、根据条款A12所述的方法,其中所述第i层为自定义层。
条款A22、一种计算机可读存储介质,其上存储有融合神经网络的计算机程序代码,当所述计算机程序代码由处理装置运行时,执行条款A12至条款A21任一项所述的方法。2020110458882
2020110438963条款B1、一种根据模板融合单元计算神经网络的计算装置,所述模板融合单元融合所述神经网络的多层,所述计算装置包括多个集群,每个集群包括:
共享存储单元,用以自片外内存载入片上单元图;以及
多个处理器核,每个处理器核包括:
神经元存储单元,用以自所述共享存储单元载入子图,所述子图为所述片上单元图的一部分;以及
运算模块,用以计算所述子图并产生中间结果;
其中,所述中间结果在所述多个处理器核间归约,以产生对应所述片上单元图的计算结果,所述共享存储单元将所述计算结果存回至所述片外内存。
条款B2、根据条款B1所述的计算装置,其中所述片外内存存储特征图,当所述特征图所需存储空间大于所述共享存储单元的可用空间时,所述片上单元图为所述特征图的一部分。
条款B3、根据条款B2所述的计算装置,其中所述特征图包括N、H、W、C维度,所述片上单 元图是所述特征图在所述N、H、W维度至少其中之一进行特定粒度的拆分。
条款B4、根据条款B1所述的计算装置,其中所述片外内存存储多个特征图,当所述多个特征图所需存储空间不大于所述共享存储单元的可用空间时,所述片上单元图包括所述多个特征图。
条款B5、根据条款B4所述的计算装置,其中所述子图为所述多个特征图其中之一。
条款B6、根据条款B1所述的计算装置,其中所述片上单元图包括N、H、W、C维度,所述子图是所述片上单元图在所述N、H、W维度至少其中之一进行特定粒度的拆分。
条款B7、根据条款B1所述的计算装置,其中每个集群还包括广播总线,所述多个处理器核其中之一根据所述多个处理器核的数量拆分所述片上单元图,所述广播总线传送每个处理器核的中间结果至下一个处理器核,所述处理器核将前一个处理器核的中间结果与所存储相对应的中间结果进行计算,以产生所述计算结果。
条款B8、一种根据模板融合单元计算神经网络的集成电路装置,包括:
片外内存,用以存储所述神经网络的特征图;
处理装置,用以根据融合策略融合所述神经网络的多层,以产生所述模板融合单元,并拆分所述特征图成片上单元图;以及
计算装置,包括多个集群,每个集群包括:
共享存储单元,用以载入所述片上单元图;
多个处理器核,每个处理器核包括:
神经元存储单元,用以自所述共享存储单元载入子图,所述子图为所述片上单元图的一部分;
运算模块,用以计算所述子图并产生中间结果;
其中,所述中间结果在所述多个处理器核间归约,以产生对应所述片上单元图的计算结果,所述共享存储单元将所述计算结果存回至所述片外内存。
条款B9、根据条款B8所述的集成电路装置,其中当所述处理装置判断所述特征图所需存储空间大于所述共享存储单元的可用空间时,将所述特征图拆分为所述片上单元图。
条款B10、根据条款B9所述的集成电路装置,其中所述特征图包括N、H、W、C维度,所述处理装置将所述特征图在所述N、H、W维度至少其中之一进行特定粒度的拆分。
条款B11、根据条款B8所述的集成电路装置,其中当处理装置判断所述特征图所需存储空间不大于所述共享存储单元的可用空间时,所述片上单元图包括多个特征图。
条款B12、根据条款B11所述的集成电路装置,其中所述子图为所述多个片上单元图其中之一。
条款B13、根据条款B9所述的集成电路装置,其中所述片上单元图包括N、H、W、C维度,所述多个处理器核其中之一将所述片上单元图在所述N、H、W维度至少其中之一进行特定粒度的拆分成所述子图。
条款B14、根据条款B8所述的集成电路装置,其中每个集群还包括广播总线,所述多个处理器核其中之一根据所述多个处理器核的数量拆分所述片上单元图,所述广播总线传送每个处理器核的中间结果至下一个处理器核,所述处理器核将前一个处理器核的中间结果与所存储相对应的中间结果进行计算,以产生所述计算结果。
条款B15、一种板卡,包括根据条款B8至条款B14任一项所述的集成电路装置。
条款B16、一种根据模板融合单元计算神经网络的方法,所述模板融合单元融合所述神经网络的多层,所述方法包括:
载入所述片上单元图;
载入子图,所述子图为所述片上单元图的一部分;
计算所述子图并产生中间结果;
归约所述中间结果,以产生对应所述片上单元图的计算结果;以及
将所述计算结果存回。
条款B17、根据条款B16所述的方法,还包括:
存储所述神经网络的特征图。
条款B18、一种计算机可读存储介质,其上存储有根据模板融合单元计算神经网络的计算机程序 代码,当所述计算机程序代码由处理装置运行时,执行条款B16至条款B17任一项所述的方法。2020110438963
2020110458524条款C1、一种融合神经网络的集成电路装置,包括:
计算装置,包括多个处理器核,每个处理器核包括神经元存储单元及权值存储单元;
处理装置,用以:
融合所述神经网络以建立模板融合单元,所述模板融合单元对应有片上单元图及片上单元图的相应权值;以及
根据所述神经元存储单元与所述权值存储单元的大小将所述模板融合单元划分为多个子模板融合单元,每个子模板融合单元对应有子图及子图的相应权值,所述子图为所述片上单元图的一部分,所述子图的相应权值为所述片上单元图的相应权值的一部分;
其中,所述计算装置将所述子图载入所述神经元存储单元,将所述子图的相应权值载入所述权值存储单元,以所述子模板融合单元的单位进行计算。
条款C2、根据条款C1所述的集成电路装置,其中所述计算装置还包括共享存储单元,所述子图自所述共享存储单元搬运至所述神经元存储单元,所述子图的相应权值自所述共享存储单元搬运至所述权值存储单元。
条款C3、根据条款C2所述的集成电路装置,还包括片外内存,所述片上单元图及所述片上单元图的相应权值自所述片外内存搬运至所述共享存储单元。
条款C4、根据条款C3所述的集成电路装置,其中所述计算装置还包括运算模块,自所述神经元存储单元读取所述子图,并自所述权值存储单元读取所述子图的相应权值,计算后产出中间结果,所述中间结果暂存于所述神经元存储单元中。
条款C5、根据条款C4所述的集成电路装置,其中所述中间结果自所述神经元存储单元搬运至所述共享存储单元。
条款C6、根据条款C5所述的集成电路装置,其中所述多个处理器核其中之一将每个子模板融合单元的中间结果进行归约,以产生计算结果,所述计算结果自所述共享存储单元搬运至所述片外内存。
条款C7、一种板卡,包括根据条款C1至条款C6任一项所述的集成电路装置。
条款C8、一种根据模板融合单元计算神经网络的计算装置,所述模板融合单元划分为多个子模板融合单元,每个子模板融合单元对应有子图及子图的相应权值,所述计算装置包括多个处理器核,每个处理器核包括神经元存储单元及权值存储单元,所述计算装置将所述子图载入所述神经元存储单元,将所述子图的相应权值载入所述权值存储单元,以所述子模板融合单元的单位进行计算。
条款C9、根据条款C8所述的计算装置,其中所述计算装置还包括共享存储单元,所述子图自所述共享存储单元搬运至所述神经元存储单元,所述子图的相应权值自所述共享存储单元搬运至所述权值存储单元。
条款C10、根据条款C9所述的计算装置,连接至片外内存,其中所述模板融合单元对应有片上单元图及片上单元图的相应权值,所述子图为所述片上单元图的一部分,所述子图的相应权值为所述片上单元图的相应权值的一部分,所述片上单元图及所述片上单元图的相应权值自所述片外内存搬运至所述共享存储单元。
条款C11、根据条款C10所述的计算装置,还包括运算模块,自所述神经元存储单元读取所述子图,并自所述权值存储单元读取所述子图的相应权值,计算后产出中间结果,所述中间结果暂存于所述神经元存储单元中。
条款C12、根据条款C11所述的计算装置,其中所述中间结果自所述神经元存储单元搬运至所述共享存储单元。
条款C13、根据条款C12所述的计算装置,其中所述多个处理器核其中之一将每个子模板融合单元的中间结果进行归约,以产生计算结果,所述计算结果自所述共享存储单元搬运至所述片外内存。
条款C14、一种融合神经网络的处理装置,连接至计算装置,所述计算装置包括多个处理器核,每个处理器核包括神经元存储单元及权值存储单元包括,所述处理装置,用以:
融合所述神经网络以建立模板融合单元,所述模板融合单元对应有片上单元图及片上单元图的相应权值;以及
根据所述神经元存储单元与所述权值存储单元的大小将所述模板融合单元划分为多个子模板融合单元,每个子模板融合单元对应有子图及子图的相应权值,所述子图为所述片上单元图的一部分,所述子图的相应权值为所述片上单元图的相应权值的一部分。
条款C15、一种在集成电路装置中融合神经网络的方法,所述集成电路装置包括计算装置,包括多个处理器核,每个处理器核包括神经元存储单元及权值存储单元,所述方法包括:
融合所述神经网络以建立模板融合单元,所述模板融合单元对应有片上单元图及片上单元图的相应权值;
根据所述神经元存储单元与所述权值存储单元的大小将所述模板融合单元划分为多个子模板融合单元,每个子模板融合单元对应有子图及子图的相应权值,所述子图为所述片上单元图的一部分,所述子图的相应权值为所述片上单元图的相应权值的一部分;以及
将所述子图载入所述神经元存储单元,将所述子图的相应权值载入所述权值存储单元,以所述子模板融合单元的单位进行计算。
条款C16、一种计算机可读存储介质,其上存储有融合神经网络的计算机程序代码,当所述计算机程序代码由处理装置运行时,执行条款C15所述的方法。2020110458524
以上对本披露实施例进行了详细介绍,本文中应用了具体个例对本披露的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本披露的方法及其核心思想;同时,对于本领域的一般技术人员,依据本披露的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本披露的限制。

Claims (18)

  1. 一种执行神经网络计算的集成电路装置,包括:
    处理装置,用以建立模板融合单元;
    编译器,用以将所述模板融合单元转换成目标代码;
    链接器,用以将所述目标代码外加库链接,形成可执行文件;以及
    计算装置,用以执行所述可执行文件,以实现神经网络计算。
  2. 根据权利要求1所述的集成电路装置,其中当所述处理装置建立模板融合单元时,用以:
    根据融合策略的起始规则,选择所述模板融合单元的起始层;以及
    以所述起始层为基准进行融合,排查所述融合策略的规则,以建立所述模板融合单元。
  3. 根据权利要求1所述的集成电路装置,其中当所述编译器将所述模板融合单元转换成目标代码时,推导所述模板融合单元的形状。
  4. 根据权利要求3所述的集成电路装置,其中所述编译器从输出向前逆推出所需的输入数据及冗余。
  5. 根据权利要求3所述的集成电路装置,其中所述编译器对整个控制流图进行片上存储空间地址的推导,并实现通用地址的访问。
  6. 根据权利要求5所述的集成电路装置,其中所述编译器用以:
    根据所述模板融合单元的划分情况,初始划分基本块;以及
    经过迭代运算后,确认所述基本块及相互关系。
  7. 根据权利要求5所述的集成电路装置,其中所述编译器用以:
    判断前一次模板融合单元中有多少数据可以供下一个模板融合单元使用;以及
    根据判断结果,规划所述片上存储空间地址。
  8. 根据权利要求7所述的集成电路装置,其中所述处理装置根据所述片上存储空间地址分配片上存储空间。
  9. 一种板卡,包括根据权利要求1至8任一项所述的集成电路装置。
  10. 一种执行神经网络计算的方法,包括:
    建立模板融合单元;
    将所述模板融合单元转换成目标代码;
    将所述目标代码外加库链接,形成可执行文件;以及
    执行所述可执行文件,以实现神经网络计算。
  11. 根据权利要求10所述的方法,其中所述建立步骤包括:
    根据融合策略的起始规则,选择所述模板融合单元的起始层;以及
    以所述起始层为基准进行融合,逐一排查所述融合策略的规则,以建立所述模板融合单元。
  12. 根据权利要求10所述的方法,其中所述转换步骤包括:
    推导所述模板融合单元的形状。
  13. 根据权利要求12所述的方法,其中所述推导步骤从输出向前逆推出所需的输入数据及冗余。
  14. 根据权利要求12所述的方法,其中所述推导步骤对整个控制流图进行片上存储空间地址的推导,并实现通用地址的访问。
  15. 根据权利要求14所述的方法,其中所述转换步骤还包括:
    根据所述模板融合单元的划分情况,初始划分基本块;以及
    经过迭代运算后,确认所述基本块及相互关系。
  16. 根据权利要求14所述的方法,其中所述转换步骤还包括:
    判断前一次模板融合单元中有多少数据可以供下一个模板融合单元使用;以及
    根据判断结果,规划所述片上存储空间地址。
  17. 根据权利要求16所述的方法,还包括:
    根据所述片上存储空间地址分配片上存储空间。
  18. 一种计算机可读存储介质,其上存储有执行神经网络计算的计算机程序代码,当所述计算机程序代码由处理装置运行时,执行权利要求10至17任一项所述的方法。
PCT/CN2021/119943 2020-09-28 2021-09-23 执行神经网络计算的装置、板卡、方法及可读存储介质 WO2022063183A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/003,682 US20230274158A1 (en) 2020-09-28 2021-09-23 Device and method for neural network computing, and board and readable storage medium

Applications Claiming Priority (8)

Application Number Priority Date Filing Date Title
CN202011045888.2A CN114358264A (zh) 2020-09-28 2020-09-28 融合神经网络的装置、板卡、方法及可读存储介质
CN202011045871.7 2020-09-28
CN202011043896.3A CN114282642A (zh) 2020-09-28 2020-09-28 计算神经网络的计算装置、板卡、方法及可读存储介质
CN202011045852.4A CN114358261A (zh) 2020-09-28 2020-09-28 融合神经网络的装置、板卡、方法及可读存储介质
CN202011045852.4 2020-09-28
CN202011045871.7A CN114358263A (zh) 2020-09-28 2020-09-28 执行神经网络计算的装置、板卡、方法及可读存储介质
CN202011045888.2 2020-09-28
CN202011043896.3 2020-09-28

Publications (1)

Publication Number Publication Date
WO2022063183A1 true WO2022063183A1 (zh) 2022-03-31

Family

ID=80844935

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/119943 WO2022063183A1 (zh) 2020-09-28 2021-09-23 执行神经网络计算的装置、板卡、方法及可读存储介质

Country Status (2)

Country Link
US (1) US20230274158A1 (zh)
WO (1) WO2022063183A1 (zh)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180314928A1 (en) * 2015-11-17 2018-11-01 Intitute of Computing Technology, Chinese Academy of Sciences Operation apparatus and method for acceleration chip for accelerating deep neural network algorithm
CN109754073A (zh) * 2018-12-29 2019-05-14 北京中科寒武纪科技有限公司 数据处理方法、装置、电子设备和可读存储介质
CN110321999A (zh) * 2018-03-30 2019-10-11 北京深鉴智能科技有限公司 神经网络计算图优化方法
CN110908667A (zh) * 2019-11-18 2020-03-24 北京迈格威科技有限公司 神经网络联合编译的方法、装置和电子设备

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180314928A1 (en) * 2015-11-17 2018-11-01 Intitute of Computing Technology, Chinese Academy of Sciences Operation apparatus and method for acceleration chip for accelerating deep neural network algorithm
CN110321999A (zh) * 2018-03-30 2019-10-11 北京深鉴智能科技有限公司 神经网络计算图优化方法
CN109754073A (zh) * 2018-12-29 2019-05-14 北京中科寒武纪科技有限公司 数据处理方法、装置、电子设备和可读存储介质
CN110908667A (zh) * 2019-11-18 2020-03-24 北京迈格威科技有限公司 神经网络联合编译的方法、装置和电子设备

Also Published As

Publication number Publication date
US20230274158A1 (en) 2023-08-31

Similar Documents

Publication Publication Date Title
WO2023045445A1 (zh) 数据处理装置、数据处理方法及相关产品
CN112799726A (zh) 数据处理装置、方法及相关产品
WO2022134873A1 (zh) 数据处理装置、数据处理方法及相关产品
CN113837922A (zh) 计算装置、数据处理方法及相关产品
CN113469336A (zh) 优化神经网络模型的编译方法、执行方法及相关产品
WO2023030507A1 (zh) 编译优化方法、装置、计算机设备以及存储介质
WO2022063183A1 (zh) 执行神经网络计算的装置、板卡、方法及可读存储介质
WO2022095675A1 (zh) 神经网络稀疏化的装置、方法及相关产品
CN114358261A (zh) 融合神经网络的装置、板卡、方法及可读存储介质
WO2022063217A1 (zh) 向前融合神经网络的装置、板卡、方法及可读存储介质
CN113469337B (zh) 用于优化神经网络模型的编译方法及其相关产品
WO2022135599A1 (zh) 融合分支结构的装置、板卡、方法及可读存储介质
WO2022135600A1 (zh) 计算神经网络的装置、板卡、方法及可读存储介质
CN112948001A (zh) 设定张量硬件配置的方法、可读存储介质及装置
CN115840894A (zh) 一种用于处理多维张量数据的方法及其相关产品
CN114358263A (zh) 执行神经网络计算的装置、板卡、方法及可读存储介质
CN114282659A (zh) 计算神经网络的装置、板卡、方法及可读存储介质
CN114358264A (zh) 融合神经网络的装置、板卡、方法及可读存储介质
CN114282642A (zh) 计算神经网络的计算装置、板卡、方法及可读存储介质
CN114330676A (zh) 融合神经网络的装置、板卡、方法及可读存储介质
CN114330679A (zh) 融合神经网络的装置、板卡、方法及可读存储介质
CN114330678A (zh) 向前融合神经网络的装置、板卡、方法及可读存储介质
CN114757327A (zh) 融合神经网络的装置、板卡、方法及可读存储介质
WO2023087698A1 (zh) 执行卷积运算的计算装置、方法及相关产品
WO2022257980A1 (zh) 计算装置、利用计算装置实施卷积运算的方法及相关产品

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21871552

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 01.09.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21871552

Country of ref document: EP

Kind code of ref document: A1