WO2022135599A1 - 融合分支结构的装置、板卡、方法及可读存储介质 - Google Patents

融合分支结构的装置、板卡、方法及可读存储介质 Download PDF

Info

Publication number
WO2022135599A1
WO2022135599A1 PCT/CN2021/141393 CN2021141393W WO2022135599A1 WO 2022135599 A1 WO2022135599 A1 WO 2022135599A1 CN 2021141393 W CN2021141393 W CN 2021141393W WO 2022135599 A1 WO2022135599 A1 WO 2022135599A1
Authority
WO
WIPO (PCT)
Prior art keywords
layer
sub
fusion
branch
layers
Prior art date
Application number
PCT/CN2021/141393
Other languages
English (en)
French (fr)
Inventor
兰慧盈
王瑞涛
罗海钊
曹博
陈峋宇
Original Assignee
中科寒武纪科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN202011561973.4A external-priority patent/CN114692837A/zh
Priority claimed from CN202011563266.9A external-priority patent/CN114757327A/zh
Application filed by 中科寒武纪科技股份有限公司 filed Critical 中科寒武纪科技股份有限公司
Publication of WO2022135599A1 publication Critical patent/WO2022135599A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks

Definitions

  • the present invention generally relates to the field of neural networks. More specifically, the present invention relates to a device, a board, a method and a readable storage medium for dynamically fusing branch structures of a neural network according to a fusion strategy.
  • Neural network is a system of multiple neurons connected according to certain rules, which is roughly composed of the following four layer structures: input layer, convolution layer, pooling layer, fully connected layer ( fully connected layer).
  • the input layer intercepts part of the information from the input data and converts it into a feature matrix for presentation, which contains the features corresponding to the part of the information.
  • the convolution layer is configured to receive the feature matrix from the input layer, and perform feature extraction on the input data through a convolution operation.
  • Convolutional layers can be constructed with multiple layers of convolutional layers in practical applications.
  • the pooling layer is configured to replace a certain region of the data with a value, which is usually the maximum or average of all the values in that region. Through pooling, the model size can be reduced and the calculation speed can be improved without losing too much information.
  • the fully connected layer plays the role of a classifier in the entire convolutional neural network, which is equivalent to feature space transformation, extracting and integrating all the previous useful information, and comparing information based on different classifications to determine whether the input data is similar to the comparison. the target.
  • the solution of the present invention provides an apparatus, board, method and readable storage medium for dynamically merging branch structures of neural networks according to a fusion strategy.
  • the present invention discloses an integrated circuit device that dynamically fuses branch structures of a neural network according to a fusion strategy, including a processing device and a computing device.
  • the processing device is configured to establish a topological sequence according to the branch structure, perform fusion based on the starting layer of the topological sequence, and check the rules in the fusion strategy to establish a template fusion unit.
  • the computing device is used for performing neural network computation according to the template fusion unit.
  • the present invention discloses a board including the aforementioned integrated circuit device.
  • the present invention discloses a method for dynamically merging branch structures of a neural network according to a fusion strategy, comprising: establishing a topological sequence according to the branch structure; performing fusion based on the starting layer of the topological sequence, and checking rules within the fusion strategy to establish a template fusion unit; and performing neural network computations according to the template fusion unit.
  • the present invention discloses a computer-readable storage medium on which computer program codes for dynamically merging branch structures of neural networks according to a fusion strategy are stored.
  • the computer program codes are executed by a processing device, the aforementioned method is executed. .
  • the present invention fuses the branch structure to generate a template fusion unit, the input of the first layer and the output of the last layer in the template fusion unit are used as the interaction data between the template fusion unit and the off-chip memory, and the calculation of each layer does not need to access the off-chip memory during the period. , greatly reducing the frequency of on-chip and off-chip I/O accesses.
  • FIG. 1 is a structural diagram illustrating a board according to an embodiment of the present invention.
  • FIG. 2 is a structural diagram illustrating an integrated circuit device according to an embodiment of the present invention.
  • FIG. 3 is a schematic diagram illustrating an internal structure of a computing device according to an embodiment of the present invention.
  • FIG. 4 is a schematic diagram showing the internal structure of a processor core according to an embodiment of the present invention.
  • FIG. 5 is a schematic diagram showing when one processor core wants to write data to a processor core of another cluster
  • 6A is a schematic diagram illustrating the AlexNet model
  • 6B is a schematic diagram illustrating an input/output feature map such as a positive pyramid structure
  • FIG. 7A is a schematic diagram illustrating an up-pooling operation of max-pooling
  • FIG. 7B is a schematic diagram illustrating an up-pooling operation of average pooling
  • 7C is a schematic diagram illustrating an exemplary neural network model
  • 8A is a schematic diagram illustrating an upsampling operation
  • 8B is a schematic diagram illustrating fusion of two convolutional layers according to an embodiment of the present invention.
  • FIG. 10 is a flow chart showing the size of an on-chip cell map according to an embodiment of the present invention.
  • FIG. 11 is a schematic diagram illustrating an exemplary neural network model
  • 12A is a flow chart illustrating dynamic fusion of neural networks according to a fusion strategy according to an embodiment of the present invention
  • FIG. 12B is one of the flowcharts illustrating the execution of neural network computation by the template fusion unit according to an embodiment of the present invention
  • FIG. 12C is the second flow chart showing the implementation of neural network calculation by the template fusion unit according to the embodiment of the present invention.
  • 13A is a schematic diagram illustrating the shape of an input-output graph of an exemplary neural network model
  • FIG. 13B is a flowchart illustrating the establishment of a template fusion unit based on a positive pyramid layer according to an embodiment of the present invention
  • Figure 13C is a fragment showing an exemplary neural network model
  • FIG. 14 is a schematic diagram illustrating a topology sequence of a branch structure according to an embodiment of the present invention.
  • 15 is a schematic diagram illustrating the reduction of a long-chain structure to a branched structure according to an embodiment of the present invention.
  • FIG. 16 is a fragment showing another exemplary neural network model
  • FIG. 17 is a schematic diagram illustrating a topological sequence of a branch structure according to another embodiment of the present invention.
  • FIG. 18 is a flow chart showing the branch structure of the fusion neural network according to another embodiment of the present invention.
  • the term “if” may be contextually interpreted as “when” or “once” or “in response to determining” or “in response to detecting”.
  • a neural network is composed of an input layer, a convolution layer, an activation function, a pooling layer, and a fully connected layer, ranging from a few layers to hundreds of layers, each layer performs an operator, such as the convolution layer performs convolution operations
  • each layer performs an operator, such as the convolution layer performs convolution operations
  • it when referring to a specific layer, it means the operator corresponding to the layer.
  • variable data are generally represented by feature maps (matrix).
  • feature maps matrix
  • the input information of the entire neural network model and the input maps of each layer of the model are collectively referred to as feature maps.
  • feature maps Once the feature maps are loaded onto the on-chip memory component, they are referred to as on-chip unit maps in the present invention.
  • the parameters of the training network model usually do not change frequently after the training is stable, or the network topology and hardware parameters can be compiled and generated after the network topology and hardware parameters are determined, and will not change during the calculation process, so they can be regarded as constant data.
  • Constant data includes However, it is not limited to weights, offsets, device hardware instructions, mean and variance of batch norm, etc.
  • weights are uniformly used to represent all constant data.
  • data in the present invention, it generally refers to a graph structure that allows the operations of corresponding operators in the neural network model to be fused together according to the fusion strategy.
  • the graph structure involves variable data and constant data, that is, the feature map. Add the corresponding weights.
  • FIG. 1 shows a schematic structural diagram of a board 10 according to an embodiment of the present invention.
  • the board 10 includes a chip 101, which is a system-on-chip (SoC), or a system-on-a-chip, and integrates one or more combined processing devices.
  • SoC system-on-chip
  • the combined processing device is an artificial
  • the intelligent computing unit is used to support various deep learning and machine learning algorithms to meet the intelligent processing requirements in complex scenarios in the fields of computer vision, speech, natural language processing, and data mining.
  • deep learning technology is widely used in the field of cloud intelligence.
  • a notable feature of cloud intelligence applications is the large amount of input data, which has high requirements on the storage capacity and computing capacity of the platform.
  • the board 10 in this embodiment is suitable for cloud intelligence applications. applications, with huge off-chip storage, on-chip storage and massive computing power.
  • the chip 101 is connected to an external device 103 through an external interface device 102 .
  • the external device 103 is, for example, a server, a computer, a camera, a monitor, a mouse, a keyboard, a network card or a wifi interface, and the like.
  • the data to be processed can be transmitted to the chip 101 by the external device 103 through the external interface device 102 .
  • the calculation result of the chip 101 can be transmitted back to the external device 103 via the external interface device 102 .
  • the external interface device 102 may have different interface forms, such as a PCIe interface and the like.
  • the board 10 also includes a storage device 104 for storing data, which includes one or more storage units 105 .
  • the storage device 104 is connected to the control device 106 and the chip 101 through a bus and performs data transmission.
  • the control device 106 in the board 10 is configured to control the state of the chip 101 .
  • the control device 106 may include a microcontroller (Micro Controller Unit, MCU).
  • FIG. 2 is a block diagram showing a combined processing device in the chip 101 of this embodiment.
  • the combined processing device 20 includes a computing device 201, an interface device 202, a processing device 203, and a DRAM 204.
  • the computing device 201 is configured to perform operations specified by the user, and is mainly implemented as a single-core intelligent processor or a multi-core intelligent processor to perform deep learning or machine learning calculations, which can interact with the processing device 203 through the interface device 202 to Work together to complete a user-specified operation.
  • the interface device 202 is used to transmit data and control instructions between the computing device 201 and the processing device 203 .
  • the computing device 201 may obtain input data from the processing device 203 via the interface device 202 and write the input data into the storage device on-chip of the computing device 201 .
  • the computing device 201 can obtain the control instruction from the processing device 203 via the interface device 202 and write it into the control cache on the computing device 201 .
  • the interface device 202 can also read the data in the storage device of the computing device 201 and transmit it to the processing device 203 .
  • the processing device 203 performs basic control including but not limited to data transfer, starting and/or stopping the computing device 201, and the like.
  • the processing device 203 may be one or more types of central processing unit (CPU), graphics processing unit (GPU), or other general-purpose and/or special-purpose processors.
  • processors include but are not limited to digital signal processors (DSPs), application specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., and the number thereof can be determined according to actual needs.
  • DSPs digital signal processors
  • ASICs application specific integrated circuits
  • FPGAs field-programmable gate arrays
  • Programmable logic devices discrete gate or transistor logic devices, discrete hardware components, etc., and the number thereof can be determined according to actual needs.
  • the computing device 201 of the present invention can be regarded as having a single-core structure or a homogeneous multi-core structure. However, when the computing device 201 and the processing device 203 are considered together, the two are considered to form a heterogeneous multi-core structure.
  • the DRAM 204 is used to store the data to be processed, and is a DDR memory with a size of 16G or more, and is used to save the data of the computing device 201 and/or the processing device 203.
  • FIG. 3 shows a schematic diagram of the internal structure of the computing device 201 .
  • the computing device 201 is used to process input data such as computer vision, speech, natural language, and data mining.
  • the computing device 201 in the figure is designed with a multi-core hierarchical structure.
  • the computing device 201 is a system-on-a-chip, which includes multiple clusters. Each cluster further includes a plurality of processor cores, in other words, the computing device 201 is constituted at the level of system-on-chip-cluster-processor cores.
  • the computing device 201 includes an external storage controller 301 , a peripheral communication module 302 , an on-chip interconnect module 303 , a synchronization module 304 , and multiple clusters 305 .
  • the peripheral communication module 302 is used for receiving a control signal from the processing device 203 through the interface device 202 to start the computing device 201 to perform tasks.
  • the on-chip interconnection module 303 connects the external storage controller 301 , the peripheral communication module 302 and the multiple clusters 305 to transmit data and control signals among the modules.
  • the synchronization module 304 is a global synchronization barrier controller (GBC), which is used to coordinate the work progress of each cluster and ensure the synchronization of information.
  • GBC global synchronization barrier controller
  • a plurality of clusters 305 are the computing cores of the computing device 201, and 4 are exemplarily shown in the figure. With the development of hardware, the computing device 201 of the present invention may further include 8, 16, 64, or even more. Cluster 305. Cluster 305 is used to efficiently execute deep learning algorithms.
  • each cluster 305 includes multiple processor cores (IPU cores) 306 and one memory core (MEM core) 307 .
  • IPU cores processor cores
  • MEM core memory core
  • processor cores 306 The number of processor cores 306 is exemplarily shown in the figure, and the present invention does not limit the number of processor cores 306 . Its internal structure is shown in Figure 4.
  • Each processor core 306 includes three modules: a control module 41 , an arithmetic module 42 and a storage module 43 .
  • the control module 41 is used to coordinate and control the work of the arithmetic module 42 and the storage module 43 to complete the task of deep learning, and it includes an instruction fetch unit (instruction fetch unit, IFU) 411 and an instruction decoding unit (instruction Decode unit, IDU) 412.
  • the instruction fetching unit 411 is used to acquire the instruction from the processing device 203 , and the instruction decoding unit 412 decodes the acquired instruction, and sends the decoding result to the operation module 42 and the storage module 43 as control information.
  • the operation module 42 includes a vector operation unit 421 and a matrix operation unit 422 .
  • the vector operation unit 421 is used to perform vector operations, and can support complex operations such as vector multiplication, addition, and nonlinear transformation;
  • the matrix operation unit 422 is responsible for the core calculation of the deep learning algorithm, that is, matrix multiplication and convolution.
  • the storage module 43 is used to store or transport related data, including a neuron storage unit (neuron RAM, NRAM) 431, a weight storage unit (weight RAM, WRAM) 432, an input/output direct memory access module (input/output direct memory access , IODMA) 433, move direct memory access module (move direct memory access, MVDMA) 434.
  • the NRAM 431 is used to store the feature map calculated by the processor core 306 and the intermediate results after the calculation;
  • the WRAM 432 is used to store the weights of the deep learning network; memory access;
  • the MVDMA 434 is used to control the memory access of the NRAM 431/WRAM 432 and the SRAM 308.
  • the storage core 307 is mainly used for storage and communication, that is, to store the shared data or intermediate results between the processor cores 306, and to execute the communication between the cluster 305 and the DRAM 204, the communication between the clusters 305, and the processor Communication among the cores 306, etc.
  • the memory core 307 has scalar operation capability for performing scalar operations.
  • the storage core 307 includes a shared storage unit (SRAM) 308, a broadcast bus 309, a cluster direct memory access (CDMA) 310 and a global direct memory access (GDMA) 311.
  • SRAM shared storage unit
  • CDMA cluster direct memory access
  • GDMA global direct memory access
  • the SRAM 308 assumes the role of a high-performance data transfer station.
  • the data multiplexed between different processor cores 306 in the same cluster 305 does not need to be obtained from the DRAM 204 through the processor cores 306, but is stored in the processor through the SRAM 308.
  • the storage core 307 only needs to quickly distribute the multiplexed data from the SRAM 308 to the multiple processor cores 306, so as to improve the communication efficiency between the cores and greatly reduce the on-chip and off-chip input/output accesses.
  • the broadcast bus 309, the CDMA 310 and the GDMA 311 are used to perform the communication between the processor cores 306, the communication between the clusters 305 and the data transmission between the clusters 305 and the DRAM 204, respectively. They will be explained separately below.
  • the broadcast bus 309 is used to complete high-speed communication among the processor cores 306 in the cluster 305.
  • the broadcast bus 309 in this embodiment supports inter-core communication methods including unicast, multicast and broadcast.
  • Unicast refers to point-to-point (ie, a single processor core to a single processor core) data transmission
  • multicast is a communication method that transmits a piece of data from SRAM 308 to specific processor cores 306, and broadcast is a communication method.
  • the communication method in which copies of data are transmitted from SRAM 308 to all processor cores 306 is a special case of multicast.
  • the CDMA 310 is used to control access to the SRAM 308 between different clusters 305 within the same computing device 201.
  • Figure 5 shows a schematic diagram when one processor core wants to write data to the processor cores of another cluster to illustrate the working principle of CDMA 310.
  • the same computing device includes multiple clusters. For the convenience of description, only cluster 0 and cluster 1 are shown in the figure, and cluster 0 and cluster 1 respectively include multiple processor cores. Cluster 0 shows only processor core 0, and cluster 1 shows only processor core 1. Core 0 wants to write data to Core 1.
  • processor core 0 sends a unicast write request to write data into local SRAM 0
  • CDMA 0 acts as the master
  • CDMA 1 acts as the slave
  • the master pushes the write request to the slave, that is, the master
  • the end sends the write address AW and the write data W, and transfers the data to SRAM 1 of cluster 1, and then the slave sends a write response B as a response.
  • the processor core 1 of cluster 1 sends a unicast read request to transfer the data from SRAM 1. read out.
  • the GDMA 311 cooperates with the external memory controller 301 to control the memory access from the SRAM 308 of the cluster 305 to the DRAM 204 , or to read data from the DRAM 204 to the SRAM 308 .
  • the communication between the DRAM 204 and the NRAM 431 or the WRAM 432 can be implemented through two channels. The first channel is to directly contact DRAM 204 and NRAM 431 or WRAM 432 through IODAM 433; the second channel is to transfer data between DRAM 204 and SRAM 308 through GDMA 311, and then through MVDMA 434 to transfer data between SRAM 308 and NRAM 431 or WRAM 432 transfers.
  • a data transmission channel can be selected according to its own hardware conditions.
  • the functionality of GDMA 311 and the functionality of IODMA 433 may be integrated in the same component.
  • GDMA 311 and IODMA 433 are regarded as different components.
  • the function of GDMA 311, the function of IODMA 433, the function of CDMA 310, and the function of MVDMA 434 can also be realized by the same component.
  • the realized function and the technical effect achieved are similar to the present invention, all belong to protection scope of the present invention.
  • the structure of the neural network model is mainly divided into two categories: long chain structure and branch structure.
  • the long-chain structure means that the neural network model is composed of layers connected by a single chain, each layer has only one input and one output, and the whole belongs to a single branch, such as the VGG16 model or the AlexNet model shown in Figure 6A.
  • the branch structure means that the sub-network in the neural network has only one input and one output, but there are multiple branches in the sub-network, that is, some layers of the sub-network have multiple inputs or outputs, such as the resblock structure of resnet50 and the block structure of inception_v3. Wait.
  • the branch structure is shown in FIG.
  • the sub-network 701' has only one input and one output, which includes the first to sixth layers, the first layer has 2 outputs, and the sixth layer has 2 inputs, so the sub-network 701' includes 2 branches, one branch It is the first layer ⁇ the second layer ⁇ the third layer ⁇ the sixth layer, and the other branch is the first layer ⁇ the fourth layer ⁇ the fifth layer ⁇ the sixth layer, and the sub-network 701 ′ constitutes a branch structure.
  • the sub-network 702' also constitutes a branch structure.
  • the present invention greatly reduces off-chip on-chip data transmission by fusing adjacent layers of the neural network.
  • Figure 8B shows a schematic diagram of fusing two convolutional layers together.
  • the input of the first convolutional layer 810 is a 7 ⁇ 7 feature map 801, which convolves the feature map 801 with a 3 ⁇ 3 kernel (not shown) to obtain the features of the first convolutional layer 810 Figure 802.
  • the value of the 5 ⁇ 5 feature sub-map 804 affects the 3 ⁇ 3 feature sub-map 805 .
  • the first convolutional layer 810 will then calculate the 5 ⁇ 5 feature submap 806 , and the value of the 5 ⁇ 5 feature submap 806 will be Affects 3x3 feature submap 807.
  • the feature map 802 becomes the input of the second convolution layer 811, which is also convolved with the 3 ⁇ 3 kernel to obtain the feature map 803 of the second convolution layer 811. .
  • the value of the 3 ⁇ 3 feature sub-map 805 will affect the 1 ⁇ 1 feature sub-map 808 in the feature map 803 .
  • the second convolutional layer 811 will then calculate the 3 ⁇ 3 feature submap 807 , and the value of the 3 ⁇ 3 feature submap 807 will affect the 1 ⁇ 1 value in the feature map 803 Feature subgraph 809.
  • the computing device 201 reads the 5 ⁇ 5 feature sub-map 804 from the DRAM 204 when the first layer of convolution 810 is performed, and stores the 3 ⁇ 3 feature sub-map 805 back to the DRAM 204 after the calculation is completed, and then from the DRAM 204 reads the 5 ⁇ 5 feature submap 806 , and stores the 3 ⁇ 3 feature submap 807 in the DRAM 204 after the calculation.
  • the second layer of convolution 811 it is also necessary to read the 3 ⁇ 3 feature sub-map 805 from the DRAM 204.
  • the 1 ⁇ 1 feature sub-map 808 is stored in the DRAM 204, and then the 3 ⁇ 3 feature sub-map is read from the DRAM 204.
  • the 1 ⁇ 1 feature submap 809 is stored in the DRAM 204. It can be seen from the above description that the feature map 802 is repeatedly read and stored on the off-chip as intermediate data, which considerably occupies system resources.
  • the feature map 802 is stored in the NRAM 431 (the weights of the first convolutional layer 810 and the second convolutional layer 811 It can also be stored in the WRAM 432), so that the number of visits between the computing device 201 and the DRAM 204 can be reduced, thereby improving the execution efficiency of the overall neural network.
  • the graphs shown in FIG. 8B eg, feature graph 801 , feature graph 802 , and feature graph 803 ) look like an inverted pyramid as a whole in the context of a neural network model, so they are called inverted pyramid layers.
  • Fig. 6B shows a schematic diagram of such a layer. Combined with the principle of the inverted pyramid layer shown in the inverted pyramid shown in Fig. 8B, it can be seen from Fig. 6B that the input/output feature map will be in the form of a positive pyramid, so it is called a positive pyramid layer. .
  • positive pyramid layers include deconvolution layers, unpooling layers, or unsampling layers.
  • Deconvolution also known as transposed convolution or hole convolution, is not a complete inverse process of forward convolution.
  • Deconvolution is a special forward convolution that requires parameters to participate in the calculation, and the parameters are training to learn. Deconvolution is to first expand the size of the output image by filling 0 according to a certain proportion, then rotate the convolution kernel, and then perform forward convolution.
  • the pooling operation is divided into the pooling operation of max pooling and the pooling operation of average pooling.
  • the upper pooling of the max pooling will retain the position information of the maximum value, and then fill the remaining positions with 0, as shown in FIG.
  • the output feature map 703 is generated, and the upper pooling layer 704 of maximum pooling is also shown in the figure.
  • the input feature map 703 passes through the upper pooling layer 704 to generate an output feature map 705 , and the size of the output feature map 705 is larger than that of the input feature map 703 size of.
  • the upper pooling of the average pooling is to fill in the average value into the corresponding position in the corresponding original data area, as shown in FIG.
  • the average pooling layer 706 is shown in the figure, and the input feature map 707 is passed through the average pooling.
  • the output feature map 708 is generated after the pooling layer 706, and the upper pooling layer 709 of average pooling is also shown in the figure.
  • the input feature map 708 passes through the upper pooling layer 709 to generate an output feature map 710. Enter the dimensions of the feature map 708.
  • FIG. 8A shows a schematic diagram of upsampling, in which the input feature map 801 ′ passes through the max pooling layer (not shown) to generate an intermediate feature map 802 ′, and the intermediate feature map 802 ′ passes through the kernel of the upsampling layer (not shown) After 803' is expanded, an output feature map 804' is obtained, and the size of the output feature map 804' is larger than the size of the intermediate feature map 802'.
  • the features of the aforementioned operators are that the input feature map is smaller than the output feature map.
  • Neural network fusion is usually based on the backward fusion of specific convolutional layers and pooling layers in the neural network, that is, the starting layer of fusion is a convolutional layer or a pooling layer, and multiple layers are fused backwards according to its own hardware conditions. , which may contain multiple convolutional and pooling layers.
  • the ordering of layers has become complicated. For example, if an activation layer is set in front of the convolutional layer, this activation layer should also be considered how to integrate with the subsequent convolutional layer. Therefore, in addition to the fusion with the convolutional layer and the pooling layer as the core, the present invention provides a variety of fusion methods.
  • FIG. 1 Another embodiment of the present invention is a novel fusion method, which is implemented by utilizing the hardware structures of the aforementioned FIG. 1, FIG. 2, FIG. 3 and FIG. 4.
  • This fusion is called a template fuse unit (TFU). ).
  • the template fusion unit mainly flexibly fuses multiple layers into one layer through a certain fusion strategy to reduce the input/output overhead of the network, which includes the aforementioned neural network fusion and other fusion methods.
  • the set of these fused layers is It is a template fusion unit, which can be regarded as a new layer or a custom layer.
  • the feature maps, weights, etc. required by the template fusion unit are loaded from the DRAM 204 to the on-chip SRAM 308 at one time. After the feature maps are loaded into the SRAM 308, they are called the on-chip cell map, and the on-chip cell map will be cut into subsections.
  • the weights required to calculate the subgraph are also loaded from the SRAM 308 to the WRAM 432 , after the calculation of each subgraph is completed, the corresponding intermediate result is obtained, and the intermediate result is stored back to the SRAM 308.
  • the calculation result is stored back to the DRAM 204 at one time. That is to say, the corresponding results obtained by the on-chip unit graph and weights participating in the operation of the operators in the neural network model are passed between the DRAM 204 and the SRAM 308, and the output (intermediate result) corresponding to the subgraph is passed between the SRAM 308 and the NRAM 431. . From the perspective of the computing device 201 , the data loading of the template fusion unit is in units of on-chip unit graphs, and the calculation is in units of subgraphs.
  • SRAM 308 is one of the important reference indicators for fusion strategy, and its space size determines whether the template fusion unit is in large image mode or small image mode.
  • the small image mode and the large image mode refer to whether a feature map stored in the DRAM 204 can be moved to the SRAM 308 for processing at one time, and the processing device 203 will compare the storage space required for the feature map with the available space in the SRAM 308. If the space of SRAM 308 is insufficient and the feature map cannot fit, it is in the large image mode; if the SRAM 308 is large enough to accommodate the entire feature map, it is in the small image mode.
  • the on-chip cell map is only a part of the feature map; in the small image mode, if the available space of the SRAM 308 is large enough or the feature map is small enough, the SRAM 308 may To accommodate multiple feature maps, that is, the on-chip unit map can include multiple feature maps.
  • the feature map In the case of the large image mode, the feature map must be split before being loaded into the computing device 201 .
  • the processing device 203 will split the feature map on the DRAM 204 until a sufficiently small on-chip cell map is generated to meet the space requirements of the SRAM 308, so that the on-chip cell map can be moved to the SRAM 308 for processing at one time.
  • input-dependent operations and output-dependent operations may be generated.
  • Input-dependent operation means that the on-chip cell graphs after splitting overlap at least partially, and each subset requires some additional copies of the input to perform a complete operation, resulting in data redundancy in the splitting operation, the so-called data redundancy It means that the same piece of data is multiplexed in the system.
  • Input-dependent operations are caused when the template fusion unit includes layers such as convolution, pooling, or matrix multiplication.
  • the output-dependent operation means that after each subgraph produces an intermediate result, it needs to be reduced to obtain the calculation result.
  • Reduction means that based on the understanding of the content of the on-chip unit map itself, it is divided into sub-maps and calculated separately to reduce the calculation scale, so as to minimize the amount of data on the premise of keeping the original on-chip unit map as much as possible. , and then restore or integrate the calculation results based on the subgraph.
  • Computational results are interdependent when reducing.
  • the template fusion unit includes layers such as inner product, convolution, matrix multiplication, sorting, counting, etc., output-dependent operations are caused.
  • the data formats of the feature maps that can be processed by this embodiment include N, H, W, and C dimensions, where N represents batch, H represents height, W represents width, and C represents channel. .
  • N represents the number of images in this batch
  • H represents the number of pixels in the vertical direction of the image
  • W represents the number of pixels in the horizontal direction
  • C represents the number of channels (for example, the number of channels in a black and white image is 1, and the number of channels in RGB is 1.
  • the number of channels C of the color image is 3).
  • the order of these dimensions determines the composition of the data.
  • the common composition methods are NHWC and NCHW.
  • Figure 9 shows the format difference between NCHW and NHWC. This figure takes an RGB color image as an example. G represents a green pixel, and B represents a blue pixel.
  • the sequence 91 is in NCHW format, N is arranged in the outer layer, the pixels in each channel are next to each other, and then arranged in the order of RGB, the offset of the element whose coordinates are (n, c, h, w) in storage is (( n ⁇ C+c) ⁇ H+h) ⁇ W+w.
  • Sequence 92 is in NHWC format, C is arranged in the innermost layer, and the RGB pixels corresponding to the spatial positions of multiple channels are close together.
  • the figure also shows the positions of the input pixel 901, the input pixel 902, and the input pixel 903 in different arrangements, and the three input pixels 901, the input pixel 902, and the input pixel 903 are combined to form a point in the image. color.
  • the conversion method of the corresponding coordinate offset of the element whose coordinates are (n, c, h, w) is ((n ⁇ H+h) ⁇ W+w) ⁇ C+c.
  • NHWC is closer to the BMP image data storage format than NCHW.
  • the BMP format file stores data according to each pixel, and each pixel stores the color value of all channels, which makes it unnecessary to read the input image. Do additional dimensional transformations. Therefore, the memory access locality of NHWC is better, and one output pixel can be obtained for every three input pixels, while NCHW must wait for all channel inputs to be ready to obtain the final output result, which requires a large cache space.
  • This embodiment determines the size of the on-chip cell map, and FIG. 10 shows a corresponding flow chart.
  • step 1001 the processing device 203 determines whether the storage space required for the feature map is greater than the available space of the SRAM 308. If so, it means that the feature map cannot be loaded into the SRAM 308 at one time, so step 1002 is executed to split the feature map. In this embodiment, the processing device 203 chooses to split in any dimension. The processing device 203 preferentially chooses to split on the N dimension, because no input or output dependent operation will be generated. If splitting on the N dimension cannot meet the requirements, then consider splitting on the H or W dimension. Input- or output-dependent operations may occur.
  • This embodiment also supports splitting in the C dimension, especially splitting along the Cout direction, so that one convolution is split into multiple convolutions by means of data optimization, so that the WRAM 432 can hold lower weights, For example, the weights are divided into four processor cores 306 . Therefore, as long as the splitting in a certain dimension can be handled by the computing device 201, it is an acceptable splitting manner in this embodiment.
  • the processing device 203 may sequentially perform splitting with a specific granularity among the N, H, and W dimensions, and the specific granularity may be a fixed or variable ratio, or represented by a function.
  • the processing device 203 divides the feature map or weight from large to small. Taking the feature map as an example, firstly, the feature map with dimension NHWC is divided into the feature map of N 1 HWC and the feature map of N 2 HWC in the N dimension, where the specific granularity is a fixed ratio, and N 1 and N 2 are each N. one-half of .
  • the processing device 203 continues to split the feature map of N 1 HWC into the feature map of N 1 H 1 WC and the feature map of N 1 H 2 WC in the H dimension, wherein H 1 and H 2 are each half of H. If it is not small enough, the processing device 203 continues to split the feature map of N 1 H 1 WC into the feature map of N 1 H 1 W 1 C and the feature map of N 1 H 1 W 2 C in the W dimension, where W 1 and W 2 are each one-half of W.
  • the processing device 203 may continue to perform smaller granularity splits in the N, W, and H dimensions, such as quarter, eighth, or sixteenth, until the feature The map is small enough to be an on-chip cell map that can be loaded into SRAM 308 in one go.
  • the processing device 203 may also continue to split in one dimension, and will select another dimension to continue splitting until it can no longer be split. For example, it continues to split on the H dimension. If the split to the smallest unit still cannot be loaded into the SRAM 308, it will be split on the W dimension until the smallest unit is split.
  • the size of the required storage space is usually similar to the available space of the SRAM 308.
  • the DRAM 204 can only transmit one split feature map to the SRAM 308 at a time, but in the small image mode, the space of the SRAM 308 may be loaded from the DRAM 204 at one time. feature map.
  • processing device 203 After the processing device 203 splits the feature map, it returns to step 1001, and the processing device 203 determines whether the required storage space for the split feature map is still larger than the available space of the SRAM 308, and if so, executes step 1002 again, and continues to split down .
  • step 1003 is executed, and the processing device 203 sets The split feature map is the on-chip unit map. So far, the processing device 203 has determined the size of the on-chip cell map.
  • the processing device 203 determines the template fusion unit according to the size of the on-chip unit map.
  • FIG. 11 shows an exemplary neural network model with 14 layers in total, wherein the first segment 1101 includes layers 1 to 4, which are inverted pyramids Layers, the second segment 1102 includes layers 5 to 9, which are positive pyramid layers, and the third segment 1103 includes layers 10 to 14, which are inverted pyramid layers.
  • FIG. 12A shows a flow chart of this embodiment to dynamically fuse neural networks according to a fusion strategy. As shown in FIG. 12A , it includes: in step 1201 , selecting the starting layer of the template fusion unit according to the starting rule of the fusion strategy. The processing device 203 selects the start layer of the template fusion unit according to the start rule of the fusion strategy, that is, selects the layer to start fusion among the layers that have not been fused in the neural network.
  • the starting rule may be that the starting layer is the most unfused layer in the neural network, and the processing device 203 will search for the most unfused layer.
  • the processing device 203 Taking the AlexNet neural network model of FIG. 6A as an example, there are 23 layers in total. Assuming that the first to fifth layers have been fused, when the starting rule is that the starting layer is the most unfused layer in the neural network, the processing device 203 The ReLU activation layer of the 6th layer will be selected as the starting layer and fused backward (that is, fused in the direction of the 7th layer). It should be noted that under this starting rule, the starting layer is not necessarily a convolutional layer or a pooling layer.
  • the starting rule is that the starting layer is the convolution or pooling layer that has not been fused before, and the processing device 203 will first find the All convolution and pooling layers of unfused layers in the neural network model are fused backwards starting from the most unfused convolution or pooling layer. Also taking the AlexNet neural network model in FIG.
  • the processing device 203 will find out all the convolution and pooling layers of the unfused layers in the neural network model, that is, the 11th layer, The 13th layer, the 15th layer, and then start the fusion from the convolution or pooling layer that has not been fused at the front, that is, the starting layer is the 11th layer. Assuming that layers 1 to 3 in Figure 11 have been fused, if the starting rule is that the starting layer is the most unfused layer in the neural network, then the fourth layer will be set as the starting point of this template fusion unit. Start layer, and fuse backward from layer 4.
  • step 1202 fusion is performed on the basis of the starting layer, and all the rules of the fusion strategy are checked one by one to establish a template fusion unit.
  • the processing device 203 performs fusion based on the starting layer, and checks all the rules of the fusion strategy one by one, so as to establish a template fusion unit.
  • the hardware resources of the computing device 201 are sufficient to support the one-time loading of the data required by the template fusion unit, and then perform the neural network calculation according to the template fusion unit.
  • the fusion strategy can exemplarily include the following rules:
  • the so-called backward fusion refers to the fusion from the initial layer to the direction of the neural network model inference. Taking Figure 6A as an example, the fusion is in the direction of the first layer ⁇ the second layer ⁇ the third layer. If there are unfused layers before the starting layer, these unfused layers will not be considered for inclusion in the template fusion unit under this rule.
  • the so-called forward fusion refers to the fusion from the initial layer to the reverse direction of neural network inference.
  • the fusion is in the direction of the third layer ⁇ the second layer ⁇ the first layer.
  • This rule is usually paired with the aforementioned starting rule that the starting layer is the first unfused convolution or pooling layer, because there may be unfused layers before the convolution or pooling layer.
  • the processing device 203 preferentially fuses forward, and tries to incorporate the layers that have not been fused before the initial layer into the template fusion unit. Also taking the AlexNet neural network model in FIG.
  • the processing device 203 finds that the convolution or pooling layer that has not been fused before is the fifth layer, so the starting layer is the fifth layer. Layers 4 and 3 are first fused forward, and if they can continue to be fused, then layers 6 and 7 are fused backwards.
  • the processing device 203 finds that the convolution or pooling layer that has not been fused before is the fourth layer, so the starting layer is For the fourth layer, the third layer is first fused forward. If the fusion can continue, the fifth and sixth layers are then fused backward.
  • this rule requires the processing device 203 to preferentially add or delete template fusion units in the branch structure rather than in layers. fusion.
  • the processing device 203 will prioritize the sub-network 701' or the sub-network 702' as a unit for fusion.
  • the template fusion unit is directly added or deleted in units of layers. This rule does not apply to neural network models with long chain structures.
  • the fusion strategy of this embodiment does not support that the template fusion unit is a multi-output network.
  • the reason is that the shape derivation implemented inside the template fusion unit mainly adopts the form of backward-forward derivation. Derivation, the results of the derivation will not necessarily be attributed to the same feature map, so that it cannot converge.
  • FIG. 7C shows two fusion methods of the sub-network 701 ′. The first is to fuse the first to fifth layers into a template fusion unit 703 ′, and the second is to fuse the first to sixth layers into a single template fusion unit 703 ′.
  • Template fusion unit 704' Since the outputs of the third layer and the fifth layer are the outputs of the template fusion unit 703', the template fusion unit 703' belongs to a multi-output network, that is, multi-branch output.
  • the output of the sixth layer is the output of the template fusion unit 704', and only one output data is generated, so the template fusion unit 704' belongs to a single-output network, that is, a single-branch output.
  • the processing unit 203 determines whether the output of the template fusion unit is a single-branch output, and if the rule is not satisfied, the processing device 203 adds or deletes layers in the template fusion unit until the rule is satisfied.
  • the processing device 203 will evaluate whether the operations of each layer to be fused are complex enough so that the fusion produces benefits .
  • the main layer refers to a layer that consumes a lot of input/output resources such as matrix multiplication, pooling or convolution.
  • the pooling here includes various types of pooling, such as It is the maximum pooling (maxpool) or the mean pooling (avgpool), and the convolution also includes various types of convolutions, such as ordinary convolution, convolution with mean, sub-channel convolution (depthwise conv), etc.
  • This rule is that the template fusion unit includes at least 2 main layers.
  • Rule 6 Include a continuous structure in which the main layer, the main layer, and the non-main layer are adjacent in turn
  • the template fusion unit needs to include a continuous structure of the main layer, the main layer and the non-main layer, that is, the continuous structure in which the main layer, the main layer and the non-main layer are adjacent in sequence.
  • Such operations are complex enough for fusion to be beneficial.
  • the 4th layer - the 5th layer - the 6th layer in Figure 6A where the 4th layer is the maximum pooling layer, the 5th layer is the convolutional layer, and the 6th layer is the ReLU activation layer, which conforms to the main layer, main layer, Non-main layers are consecutive structures adjacent to each other, so the template fusion unit including layer 4, layer 5, and layer 6 can satisfy this rule.
  • the processing unit 203 determines that the rule is not satisfied, the processing device 203 adjusts the template fusion unit until the rule is satisfied.
  • This rule is a continuous structure in which the template fusion unit includes a scalar computing layer and a vector computing layer, that is, a continuous structure in which the scalar computing layer and the vector computing layer are adjacent in sequence.
  • the scalar calculation layer refers to an addition layer, a subtraction layer or a multiplication layer
  • the vector calculation layer refers to an activation layer, a batch normalization layer or a scaling layer.
  • This rule is that the weight of the convolutional layer in the template fusion unit is not the output of any layer of the neural network, regardless of whether the layer is included in the template fusion unit or not.
  • the processing device 203 removes the convolutional layer from the template fusion unit.
  • the processing unit 203 determines that the rule is not satisfied, the processing device 203 removes the convolution operator from the template fusion unit.
  • the large image mode has fewer restrictions on the WRAM 432, because the on-chip cell map loaded into the SRAM 308 is only a part of the feature map.
  • the WRAM 432 only needs to store the ownership value of the feature map.
  • the small image mode may load multiple feature maps into the SRAM 308, in this case, the required weights will increase, and it is necessary to carefully evaluate whether the available space of the WRAM 432 is sufficient.
  • This rule is that the storage space required for the weights in the on-chip unit map is not greater than the available space of the WRAM 432.
  • the processing device 203 determines that this rule is not satisfied, the processing device 203 reduces the size of the on-chip unit map.
  • W j is the storage space required for the weights involved in the on-chip unit graph j
  • n is the number of processor cores in the cluster
  • W is the available space of the WRAM 432 .
  • the redundancy percentage is the ratio of the sum of redundancy generated by input-dependent operations and output-dependent operations to the normal input/output volume of the template fusion unit. Redundant amount of data.
  • the processing device 203 will calculate the percentage of the memory access size TFU of the on-chip unit map from the DRAM 204 to the SRAM 308 after the template fusion unit fuses the current layer, and the normal input/output (excluding redundancy) size ori , where the memory access amount Size TFU refers to the theoretical memory access size ori plus the sum of redundancy. Its formula is as follows:
  • the processing device 203 takes into account the split information and shape derivation of the template fusion unit, and sets the percentage threshold to 50%, 75%, 100%, 125% or 160%, preferably 100%. Taking the percentage threshold of 100% as an example, it means that when the sum of redundancy is greater than twice the normal input/output amount of the template fusion unit, the fusion will not be performed. This rule is that the sum of redundancy generated by splitting the on-chip unit graph does not exceed a specific ratio related to the percentage threshold. Once it exceeds, it means that there are too many redundant parts, and a lot of resources will be spent on computing redundancy, resulting in reduced performance. Therefore, when When the processing device 203 determines that the rule is not satisfied, the processing device 203 stops the fusion.
  • thumbnail mode since at least one complete feature map is loaded at a time from the DRAM 204 to the SRAM 308, there is no redundancy. This rule does not apply to thumbnail mode.
  • the sum of the storage space of the on-chip cell map and the storage space of the calculation result is less than the available space of the SRAM 308; if the storage space of IN and OUT can be multiplexed, the storage space of the on-chip cell map The storage space of the calculation result is larger than the available space of the SRAM 308 .
  • the processing device 203 judges that the rule is not satisfied, the processing device 203 reduces the number of on-chip unit maps until the rule is satisfied.
  • the processing means 203 judges that this rule is not satisfied, the processing means 203 reduces the number of on-chip cell maps until the rule is satisfied.
  • the processing means 203 judges that this rule is not satisfied, the processing means 203 reduces the number of on-chip cell maps until the rule is satisfied.
  • the weights involved in the convolution operation in the template fusion unit are carried independently and reside on the WRAM 432 .
  • the WRAM 432 stores the weights of two adjacent sub-images at the same time. Assuming that the required storage space of each subgraph i is Wi, and the total space of the WRAM 432 is W, this rule is that the space size of the WRAM 432 needs to meet the following conditions:
  • the processing means 203 judges that this rule is not satisfied, the processing means 203 reduces the number of on-chip cell maps until the rule is satisfied.
  • This rule is that the storage space required by the subgraph is not larger than the available space of the NRAM 431.
  • the processing device 203 can perform fine-grained splitting in the N, H, and W dimensions. If there is insufficient space in NRAM 431, processing device 203 will split the on-chip cell map finer until this rule is satisfied.
  • NRAM 431 will have a reasonable available space, so that the on-chip unit map can be split to a reasonable extent and can be loaded at one time. From the perspective of fusion strategy, the template fusion unit will not be affected by the number of batches. Impact. However, the smaller the on-chip cell map is split (that is, the more sub-maps), the processing speed will decrease, so the processing device 203 needs to evaluate the space of the NRAM 431.
  • the space of the SRAM 308 corresponds to the number of NRAMs 431 of the processor cores 306 in the cluster 305.
  • the cluster 305 includes 4 processor cores 306, and the space of the SRAM 308 is equal to the space of the NRAM 431. 4 times.
  • the on-chip cell map in the large-picture mode can generally be allocated to four processor cores 306 for processing.
  • This architectural design has considered that the data loaded into the SRAM 308 can be allocated to all the NRAMs 431 at one time. Therefore, this rule does not need to be considered in large image mode.
  • Rule 18 The number of feature maps is not greater than the feature map threshold
  • the on-chip cell map may include multiple feature maps.
  • the processing device 203 will calculate an appropriate number of fusion layers according to the number of feature maps in the on-chip unit map, so as to maximize the benefit.
  • This rule is that the number of feature maps in the on-chip unit map is not greater than the feature map threshold.
  • the processing device 203 determines that this rule is not satisfied, the processing device 203 reduces the number of feature maps in the on-chip data until the rule is satisfied.
  • Step redundancy refers to: when there are too many fusion layers of the template fusion unit, and the length and width of the convolution and pooling kernels are larger than the step size, the input data required by each output point overlaps, that is For the aforementioned input-dependent operations, the overlapping portion is the step redundancy. Step redundancy makes each processor core 306 need to read more data, but this part of the multiplexed data will occupy on-chip and off-chip access resources. The more layers the template fusion unit includes, the greater the step redundancy. serious. This rule is that the sum of the difference between the edge length and the stride length of the kernel of the convolutional or pooling layer is not greater than the redundancy threshold.
  • the redundancy threshold is defined as follows. Assuming that the length and width of the kernels of the convolution and pooling layers are k x and ky , and the strides in the length and width directions are s x and s y , respectively, the step size redundancy in the length direction is all volumes in the template fusion unit. The sum of k x -s x of product and pooling layers; similarly, the stride redundancy in the width direction is the sum of k y -s y of all convolution and pooling layers in the template fusion unit.
  • the redundancy threshold in this embodiment can be 3, 4, 5 or 6, preferably 4. This rule is not satisfied as long as the step redundancy in either the long or wide direction is greater than the redundancy threshold.
  • the processing device 203 adjusts the template fusion unit, usually to reduce the number of layers to be fused, until this rule is satisfied.
  • the fusion strategy sets an exception rule for step redundancy. If there are multiple branches in the layer to be fused and the template fusion unit can fuse the entire multiple branches, the performance of the template fusion unit will be better. In this case, the processing device 203 will ignore the redundant step size.
  • the rule, that is, step redundancy does not restrict the template fusion unit from merging multiple branches, that is, in the fusion strategy of this embodiment, merging multiple branches takes precedence over the restriction of step redundancy. That is, step redundancy is only considered in the case of a single branch.
  • the neural network calculation is performed according to the established template fusion unit.
  • the computing device 201 is based on the three-level operation level of the system-on-chip-cluster-processor core, and is matched with a three-level memory design such as DRAM-SRAM-NRAM/WRAM.
  • the data required by the calculation template fusion unit is loaded from the DRAM 204 to the SRAM 308, so that the data can be cached and calculated in an appropriate level, and a sufficient flow is formed.
  • the calculation results are sent from the SRAM 308 to the DRAM 204.
  • the present invention is based on a template fusion unit, which can reduce input/output overhead in neural network computing.
  • FIG. 12B shows its flow.
  • the template fusion unit is determined according to the fusion strategy.
  • the processing device 203 selects the start layer of the template fusion unit according to the start rule of the fusion strategy; performs fusion based on the start layer, and checks all the rules of the fusion strategy one by one to establish a template fusion unit.
  • Various rules of the fusion policy have been described in detail in the previous embodiment, and will not be repeated here.
  • the template fusion unit will be displayed in the form of source code, and then the source code needs to be converted into machine language object code (object code), also known as machine code, by the compiler.
  • object code also known as machine code
  • the following steps are the process that the compiler converts the source code of the template fusion unit into the object code of the machine language.
  • step 1202' the shape of the template fusion unit is deduced.
  • this embodiment adopts the method of inverse deduction.
  • the compiler deduces the input of the required size from the output forward. Taking FIG. 8B as an example, it is deduced backward from the feature map 803 to the feature. Figure 802 , which is then reversely derived to the feature map 801 .
  • the compiler not only deduces the required input data from the template fusion unit, but also deduces further redundancy.
  • step 1203' is executed to deduce the address.
  • the compiler deduces the address of the on-chip storage space for the entire control flow graph, and realizes the access to the general address, so as to achieve the purpose of reducing computing resources and shortening computing time.
  • a control flow graph is an abstract data structure used in the compiler, which represents all the paths that a program may execute, and reflects the possible flow of all nodes in the process in the form of a flowchart.
  • a control flow graph is composed of nodes and relationships between nodes.
  • a node also known as a basic block (BB) is a sequence of statements that are executed sequentially in the program to the greatest extent possible. Each basic block has only one entry and one exit. When executing, it enters from its entry and exits from its exit. The characteristic of the basic block is that as long as the first instruction is executed, all the instructions in the basic block will be executed in order.
  • Each basic block contains at least one instruction, and the instructions in the basic block may use pointers to specific on-chip memory spaces.
  • a pointer is a variable that holds the address of a specific address space. Through the pointer, the processor core 306 can load data into the space of the specific address pointed to by the pointer, or fetch data from the specific address pointed to by the pointer.
  • the compiler initially divides the basic blocks, and after iterative operations, confirms the basic blocks and their interrelationships, and thus completes the target code for implementing the template fusion unit.
  • the compiler will also analyze the multiplexed data of the two template fusion units before and after the neural network, and determine how much data in the previous template fusion unit can be left on the chip for the next template fusion unit. Plan the storage address of each data.
  • the compiler completes the deduction of the address in the control flow graph.
  • step 1204' on-chip storage space is allocated.
  • the processing device 203 allocates the physical space of the SRAM 308, the NRAM 431 and the WRAM 432 based on the derivation of the template fusion unit address.
  • the compiler completes the pointing of the pointer in the control flow graph.
  • step 1205' is executed to generate executable instructions.
  • the linker links the object code generated by the compiler and the library to make it an executable file.
  • object code is a program module that includes machine code and information available to the linker.
  • the linker's job is to resolve undefined symbol references, replace the placeholders in the object code with the addresses of the symbols, and generate executable instruction.
  • the executable instructions can be directly executed by the computing device 201 to complete the computation of the neural network.
  • FIG. 12C shows its flow.
  • step 1201 according to the starting rule of the fusion strategy, the initial layer of the template fusion unit is selected.
  • the processing device 203 selects the initial layer of the template fusion unit according to the initial rule of the fusion strategy, that is, in the neural network model From the layers that have not been fused, select the layer to start merging.
  • step 1202 the fusion is performed on the basis of the initial layer, and all the rules of the fusion strategy are checked one by one to establish a template fusion unit.
  • the processing device 203 is fused on the basis of the initial layer, and checks all the rules of the fusion strategy one by one.
  • the template fusion unit is set up.
  • Various rules of the fusion strategy have been illustrated in detail when describing Fig. 12B, and will not be repeated. Under the premise of satisfying all the rules, the hardware resources of the computing device 201 are enough to support the one-time loading of the computing template fusion unit The required data, and then the neural network calculation is performed according to the template fusion unit.
  • the fourth layer is exemplarily set as the starting layer of the template fusion unit in step 1201", in this step, the fourth layer is fused backwards, and all the rules of the fusion strategy are checked one by one to establish Template fusion unit.
  • the fifth layer of the positive pyramid layer is also fused into it, and if the fusion can continue, the processing device 203 continues to fuse backwards.
  • FIG. 13A shows the shapes of the input/output feature maps of the fifth layer and the sixth layer in FIG. 11 .
  • the fifth layer is the deconvolution layer and the sixth layer is the Relu activation layer.
  • the input feature map of the fifth layer exemplarily includes three input data X 1 , X 2 , X 3 , the input data X 1 will generate output data Y 1 to Y 2 after the input data X 1 passes through the fifth layer, and the input data X 2 After passing through the fifth layer, output data Y 3 to Y 4 will be generated, input data X 3 will generate output data Y 5 to Y 6 after passing through the fifth layer, and output data Y 1 to Y 6 will be activated through the sixth layer.
  • the activation layer does not change the amount of data, so the input data Y 1 to Y 6 pass through the sixth layer to generate output data Z 1 to Z 6 as shown in the figure, respectively.
  • FIG. 13B shows a flow chart of establishing a template fusion unit based on a positive pyramid layer.
  • FIG. 13A shows that the fifth layer includes 3 fusion blocks, namely X 1 -Y 1 -Y2 is the first fusion block 1301, X2 - Y3 - Y4 is the second fusion block 1302, X3 - Y5 -Y6 is the third fusion block 1303; the sixth layer also includes 3 fusion blocks, That is, Y1 - Y2 - Z1 - Z2 is the fourth fusion block 1304 (from input data X1), Y3 - Y4 - Z3 - Z4 is the fifth fusion block 1305 (from input data X2 ) , Y5 - Y6-Z5 - Z6 is the sixth fusion block 1306 (from input data X3 ).
  • step 1301' the processing device 203 sets all output data corresponding to the same input data as fused blocks, that is, identifies the aforementioned fused blocks 1301-1306.
  • step 1302' taking the fusion block as a unit, a template fusion unit is established according to the fusion strategy.
  • the rules related to the fusion of positive pyramid layers also include:
  • each processor core 306 is allocated in units of fused blocks. Since the fusion block has the same input data and is a complete data block, it is more convenient to divide the fusion block into subgraphs. If a subgraph includes incomplete fused blocks, for example a subgraph includes fused blocks 1301, 1304, part of fused block 1302 (data block 1307), and part of fused block 1305 (data block 1308), this will make the next processing It is difficult for the processor core 306 to determine the processed and unprocessed parts of the fusion blocks 1302 and 1305.
  • next processor core 306 cannot know the size of the data blocks 1307 and 1308, so that the on-chip A problem occurred when the unit graph was split into subgraphs, which could result in missing parts of the data that were not calculated.
  • the processing device 203 is allocated to each processor core 306 in units of fused blocks. Assuming that a certain processor core 306 still has space after the complete calculation of the fusion blocks 1301 and 1304, the processing device 203 will further determine whether the processor core 306 can calculate the fusion blocks 1302 and 1305 at the same time. , 1305 are also assigned to this processor core 306 , if not, the fusion blocks 1302 , 1305 are assigned to the next processor core 306 .
  • a specific processor core 306 When a specific fused block is repeatedly calculated between processor cores, a specific processor core 306 will be assigned to calculate the specific fused block, and the intermediate result of calculating the specific fused block will be stored in the SRAM 308, and the storage core 307 will merge the intermediate result into other Intermediate results produced by processor core 306 .
  • fusion blocks 1301, 1302, 1304, 1305 are allocated to the first processor core
  • fusion blocks 1302, 1303, 1305, 1306 are allocated to the second processor core
  • fusion blocks 1302, 1305 are repeated
  • the processing device 203 will re-adjust the amount of tasks, and only assign the fusion blocks 1302 and 1305 to one of the processor cores, such as the first processor core, so the first processor core is still the calculation fusion block 1301, 1305. 1302, 1304, 1305, but the second processor only needs to calculate the fusion blocks 1303, 1306.
  • the intermediate results are stored in the SRAM 308, and the storage core 307 will combine the intermediate results obtained by the first processor core calculation fusion blocks 1302 and 1305 with the second processor core calculation fusion blocks 1303 and 1305.
  • the intermediate results of 1306 are merged to produce intermediate results corresponding to fused blocks 1301 , 1302 , 1304 , and 1305 , and intermediate results corresponding to fused blocks 1302 , 1303 , 1305 , and 1306 .
  • it saves computing resources, and on the other hand, it also satisfies the relationship of output dependence.
  • the processing means 203 judges that this rule is not satisfied, the processing means 203 reduces the number of on-chip cell maps until the rule is satisfied.
  • the storage space required for the weights involved in the fusion block is not larger than the available space of the WRAM 432 .
  • the processing device 203 determines that these fusion strategies are not satisfied, the processing device 203 reduces the number of fusion blocks. Other rules will not be repeated.
  • the positive pyramid layer may need to be filled with 0 according to a certain ratio to expand the size of the output image, then rotate the convolution kernel, and then perform forward convolution, so when it comes to fused positive pyramid layers, the weight refers to The output channel weight after 0 is added.
  • This embodiment does not limit the fusion method of the positive pyramid layer and the inverted pyramid layer. All positive pyramid layers can be fused together.
  • the template fusion unit includes layers 5 to 9, and can also be mixed together.
  • the template fusion unit includes Layers 3 to 6, or the template fusion unit includes layers 9 to 12, etc.
  • the template fusion unit may include only the positive pyramid layer, or may include the inverted pyramid layer plus the positive pyramid layer, or the positive pyramid layer and the inverted pyramid layer.
  • step 1202 the template fusion unit will be presented in the form of source code, and then the source code needs to be converted into machine language object code (object code) by a compiler, also called machine code.
  • object code object code
  • the multiple steps are the process that the compiler converts the source code of the template fusion unit into the object code of the machine language.
  • Step 1203" is then executed to deduce the shape of the template fusion unit.
  • this embodiment adopts a reverse inference method, and the compiler inversely deduces the input of the required size from the output to the front, as shown in Figure 8B For example, it is reversely derived from the feature map 803 to the feature map 802, and then reversely derived to the feature map 801.
  • the compiler not only deduces the required input data according to the template fusion unit, but also further deduces redundancy.
  • step 1204" is executed to deduce the address.
  • the compiler deduces the address of the on-chip storage space for the entire control flow graph, and realizes the access of the general address, so as to achieve the purpose of reducing computing resources and shortening computing time.
  • Control flow graph is an abstract data structure used in the compiler, which represents all the paths that a program may execute, and reflects the possible flow of all nodes in the process in the form of a flowchart.
  • Control flow graph is composed of nodes and nodes
  • a node is also called a basic block (BB), which is a sequence of statements executed in the program to the greatest extent possible.
  • Each basic block has only one entry and exit. Exit exit. The characteristic of the basic block is that as long as the first instruction is executed, all the instructions in the basic block will be executed in order.
  • Each basic block contains at least one instruction, and the instructions in the basic block may use pointers to specific on-chip memory spaces.
  • a pointer is a variable that holds the address of a specific address space. Through the pointer, the processor core 306 can load data into the space of the specific address pointed to by the pointer, or fetch data from the specific address pointed to by the pointer.
  • the compiler initially divides the basic blocks, and after iterative operations, confirms the basic blocks and their interrelationships, and thus completes the target code for implementing the template fusion unit.
  • the compiler will also analyze the multiplexed data of the two template fusion units before and after the neural network, and determine how much data in the previous template fusion unit can be left on the chip for the next template fusion unit. Plan the storage address of each data.
  • the compiler completes the deduction of the address in the control flow graph.
  • step 1205" the on-chip storage space is allocated.
  • the processing device 203 allocates the physical space of SRAM 308, NRAM 431 and WRAM 432 based on the derivation of the template fusion unit address.
  • the compiler completes the pointing of the pointer in the control flow graph .
  • step 1206" is executed to generate executable instructions.
  • the linker links the object code generated by the compiler and the library to make it an executable file.
  • the object code includes The program module of the machine code and the available information of the linker, the work of the linker is to parse the undefined symbol reference, replace the placeholder in the object code with the address of the symbol, and then generate executable instructions.
  • the computing device 201 executes the executable instructions , to perform neural network computations according to the template fusion unit.
  • This embodiment can fuse the positive pyramid layer and the inverted pyramid layer.
  • Such a fusion strategy makes the establishment of the template fusion unit more flexible, and is not limited by the size of the input feature map and output feature map, thereby adapting to various network models, making the fusion more Comprehensive, improve the overall efficiency.
  • the starting rule may be that the starting layer is the most unfused layer in the neural network, and this layer may be a layer other than a convolutional layer or a pooling layer.
  • the starting rule makes the establishment of the template fusion unit more flexible. For different neural networks, based on the ordering of each layer, the starting layer can be appropriately selected to start fusion. The location and quantity in the model are limited, so as to adapt to various network models, making the integration more comprehensive and improving the overall efficiency.
  • the next convolution or pooling layer is the 8th layer, in other words, the 6th and 7th layers may not be merged, which affects the overall benefit.
  • Another embodiment of the present invention is a scheme for integrating neural networks, wherein the starting layer is a layer other than the convolutional layer and the pooling layer, that is, the non-convolutional layer and the non-pooling layer.
  • This embodiment is also implemented based on the framework of FIGS. 1 to 4 .
  • This embodiment also executes the flowchart shown in FIG. 12A.
  • the starting layer is selected according to the fusion strategy.
  • the processing device 203 selects the starting layer according to the fusion strategy.
  • the starting rule of the fusion strategy is that the starting layer is the most unfused layer in the neural network, and this layer is a layer other than the convolutional layer or the pooling layer.
  • the starting layer can be an element-wise layer, an addpadding layer, or a custom layer.
  • this step does not use the starting rule as the starting layer is the convolution or pooling layer that has not been fused before. If the starting layer is selected according to this starting rule, the starting layer must be convolutional. Or the pooling layer, the advantage of this embodiment not being limited by the location and number of convolutional layers or pooling layers in the neural network model does not exist.
  • the fusion is preferentially performed in the unit of the branch structure.
  • the branch structure is too complex to integrate the entire branch structure into the template fusion unit, and the fusion branch structure can only be abandoned based on the aforementioned rules.
  • Rule 4 requires that the output of the template fusion unit must be a single branch output, which also reflects that fusion must be performed in units of branch structures.
  • the fusion strategy of rules three and four is not friendly to the neural network model with branch structure, and the fusion effect is not good.
  • FIG. 13C shows an exemplary neural network model segment, which includes a branch structure 1300 ′′, the starting point of the branch structure 1300 ′′ is the T1 layer, the end point is the T10 layer, and the first branch 1301 ′′ and the second branch are expanded from the T1 layer. 1302", the first branch 1301” includes the T2 layer and the T3 layer, and the second branch 1302" includes the T4 layer to the T9 layer.
  • Topological sequence refers to a linear sequence of all nodes in a directed acyclic graph, and must meet the following two conditions: each node must appear and only appear once; if there is a sequence from node A to node B path, then node A appears before node B in the sequence. Simply put, it is the process of obtaining a total order on a set from a partial order on the set. Based on the aforementioned principles, when establishing the topological sequence, the processing device 203 first identifies the start point and end point of the branch structure 1300", that is, the start point is the T1 layer and the end point is the T10 layer.
  • the processing device 203 sets the start point of the branch structure 1300" as the topological sequence.
  • the starting point is also set as the starting layer of the template fusion unit, and the end point of the branch structure 1300" is set as the end point of the topological sequence, and then the middle layers in the branch structure 1300" are arranged according to the topology.
  • the following 2 kinds are also set.
  • the first arrangement method is to compare the number of layers of each branch, and arrange at least the layers of the sub-branch according to the number of layers; Layers of branches.
  • This embodiment adopts the second arrangement.
  • the first branch 1301" has 2 layers, the second branch 1302" has 6 layers, and the first branch 1301" has fewer layers, so the layers in the first branch 1301" are arranged in the layers in the second branch 1302" Before.
  • a topology sequence with T1 layer ⁇ T2 layer ⁇ T3 layer ⁇ T4 layer ⁇ T5 layer ⁇ T6 layer ⁇ T7 layer ⁇ T8 layer ⁇ T9 layer ⁇ T10 layer is formed.
  • the topological sequence of branch structure 1300" forms a long chain structure 1400.
  • the layer in the topological sequence is used as a unit, rather than the entire branch structure as a unit to add or delete template fusion units.
  • a template fusion unit is established.
  • the processing device 203 regards the neural network model with the branch structure 1300" as a neural network model with a long-chain structure 1400, and takes the starting layer (T1 layer) of the long-chain structure 1400 as a reference
  • any rule except rule 3 and rule 4 in the aforementioned fusion strategy can be selected to create a template fusion unit.
  • the template fusion unit does not necessarily need to include the entire branch structure.
  • the long-chain structure 1400 can generate two template fusion units: the first template fusion unit 1401 includes layers T1 to T5, and the second template fusion unit 1402 includes layers T6 to T10.
  • the shapes of the first template fusion unit 1401 and the second template fusion unit 1402 are as shown in FIG. 15
  • the first template fusion unit 1401 has two branch outputs, which are respectively connected to the second template The T10 layer and the T6 layer of the fusion unit 1402, that is to say, the first template fusion unit 1401 has two outputs
  • the second template fusion unit 1402 has two inputs.
  • the processing device 203 determines whether the first template fusion unit 1401 includes the end point of the branch structure.
  • the first template fusion unit 1401 does not include the T10 layer, and the processing device 203 further determines whether the available space of the NRAM 431 is large enough. If so, the processing device 203 causes the computing device 201 to output the first template fusion unit 1401 when deriving the address.
  • the two calculation results (that is, the intermediate results of the last layer T3 layer and T5 layer) are stored in the NRAM 431, because the second template fusion unit 1402 can directly calculate the value from the NRAM 431.
  • the processing device 203 further determines whether the available space of the SRAM 308 is large enough. If the available space of the SRAM 308 is large enough, the two calculation results will be stored in the SRAM 308, and can be calculated directly from the SRAM 308 when the second template fusion unit 1402 is calculated.
  • the computing device 201 Since these two calculation results are the on-chip unit map of the second template fusion unit 1402, the computing device 201 does not need to load the on-chip unit map from the DRAM 204 when calculating the second template fusion unit 1402, but directly from the NRAM 431 or the SRAM 308 Read computation, reducing on-chip and off-chip accesses.
  • the computing device 201 will store the two calculation results produced by the first template fusion unit 1401 back into the DRAM 204.
  • the computing device 201 calculates the second template fusion unit 1402 , the 2 calculation results will be loaded from the DRAM 204 for calculation.
  • the processing device 203 determines whether the second template fusion unit 1402 includes the end point of the branch structure.
  • the second template fusion unit 1402 does include the T10 layer, so when the processing device 203 deduces the address, the computing device 201 stores the calculation result produced by the second template fusion unit 1402 back into the DRAM 204.
  • the processing device 203 of this embodiment converts the branched structure into a long-chain structure, the long-chain structure is simple, and it is easy to generate a template fusion unit, and then restores the long-chain structure to a branched structure for shape and address derivation, which is no longer required.
  • the fusion is performed in units of the entire branch structure.
  • the computing device 201 performs neural network computation according to the template fusion unit.
  • FIG. 16 shows an exemplary neural network model segment, which includes a branch structure 1600, the starting point of the branch structure 1600 is the T1 layer, the end point is the T11 layer, and the first branch 1601 and the second branch 1602 are expanded from the T1 layer.
  • One branch 1601 includes layers T2 to T7, and the second branch 1602 includes layers T8 to T10.
  • the first branch 1601 includes a sub-branch structure, the starting point of the sub-branch structure is the T3 layer and the end point is the T7 layer, the first sub-branch 1603 includes the T4 layer and the T5 layer, and the second sub-branch 1604 includes the T6 layer.
  • the processing device 203 When merging the branch structure 1600 , the processing device 203 first establishes a topological sequence for the branch structure 1600 . First, identify the start point and the end point of the branch structure 1600, that is, the start point is the T1 layer and the end point is the T11 layer. The processing device 203 sets the start point of the branch structure 1600 as the start point of the topological sequence, and the start point is also set as the start point of the template fusion unit. layer, and set the end point of the branch structure 1600 as the end point of the topological sequence. The processing device 203 further determines whether there is a sub-branch structure in the branch structure 1600, and the branch structure 1600 does have a sub-branch structure. The starting point, the ending point and the middle layers are arranged in the following two ways.
  • the first arrangement method is to compare the number of layers on the sub-branch in the sub-branch structure, and arrange at least each layer of the sub-branch according to the number of layers.
  • the first sub-branch 1603 has 2 layers
  • the second sub-branch 1604 has 1 layer
  • the first sub-branch 1603 has more layers, so the layers in the first sub-branch 1603 are arranged in the layers in the second sub-branch 1604 Before.
  • the topological order of the sub-branch structure is T3 layer ⁇ T4 layer ⁇ T5 layer ⁇ T6 layer ⁇ T7 layer.
  • the second arrangement method is to compare the number of layers on the sub-branch in the sub-branch structure, and arrange the layers of the sub-branch from the least to the most according to the number of layers.
  • the second sub-branch 1604 has fewer layers, so the layers in the second sub-branch 1604 are arranged before the layers in the first sub-branch 1603 .
  • the topological order of the sub-branch structure is T3 layer ⁇ T6 layer ⁇ T4 layer ⁇ T5 layer ⁇ T7 layer.
  • the processing device 203 After processing the topological sorting of the sub-branch structure, the processing device 203 continues to sort the branch structure 1600 .
  • the sorting method of the branch structure 1600 is the same as that of the sub-branch structure.
  • the branch structure 1600 is also arranged by the number of layers.
  • the number of layers of the first branch 1601 is more than that of the second branch 1602, so the layers of the first branch 1601 are arranged before the layers of the second branch 1602, that is, the long chain structure 1701 shown in FIG.
  • the processing device 203 replaces the branch structure 1600 with the long-chain structure 1701 or 1702, adds or deletes template fusion units in units of layers in the topological sequence, checks the rules in the fusion strategy, and establishes template fusion units.
  • the template fusion unit does not necessarily need to include the entire branch structure 1601 or 1602 .
  • the processing device 203 determines whether the template fusion unit includes the end point of the branch structure or the sub-branch structure, and if not, the processing device 203 It is further judged whether the available space of the NRAM 431 is large enough, if so, when the processing device 203 deduces the address, the processing device 203 makes the computing device 201 store the intermediate result of the last layer produced by the template fusion unit in the NRAM 431. If the available space of the NRAM 431 is not large enough, the processing device 203 further determines whether the available space of the SRAM 308 is large enough. If the available space of the SRAM 308 is large enough, the intermediate result of the last layer will be stored in the SRAM 308, and the value can be directly calculated from the SRAM 308 when calculating the template fusion unit.
  • the template fusion unit does not include the branch structure or the end point of the sub-branch, it means that its output (the intermediate result of the last layer) is the on-chip unit graph of the next template fusion unit. Therefore, when calculating the next template fusion unit, the computing device 201 does not The on-chip cell map needs to be loaded from the DRAM 204, and the calculations are read directly from the NRAM 431 or SRAM 308, reducing on-chip and off-chip accesses.
  • the intermediate result of the last layer of the template fusion unit will be stored back to DRAM 204.
  • computing device 201 calculates the next template fusion unit, it will be loaded from DRAM 204. Calculation.
  • the processing device 203 determines that the template fusion unit includes the branch structure 1600 or the end point of the sub-branch, the processing device 203 causes the computing device 201 to store the intermediate result of the last layer produced by the template fusion unit back into the DRAM when deriving the address. 204.
  • the processing device 203 of this embodiment converts the branch/sub-branch structure into a long-chain structure, the long-chain structure is simple, and it is easy to generate a template fusion unit, and then restores the long-chain structure to a branched structure for shape and address derivation.
  • the computing device 201 performs neural network computation according to the template fusion unit.
  • Another embodiment of the present invention is a method for dynamically fusing branch structures of a neural network according to a fusion strategy.
  • This embodiment fuses branch structures with sub-branches by an apparatus having the structures of FIGS. 1 to 4 .
  • FIG. 18 shows a flowchart of this embodiment.
  • step 1801 a topological sequence is established for the branch structure. This step is further divided into the following steps.
  • step 1802 the start and end points of the branch structure are identified.
  • step 1803 the starting point of the branch structure is set as the starting point of the topological sequence.
  • step 1804 the starting point is set as the starting layer of the template fusion unit.
  • step 1805 the end point of the branch structure is set as the end point of the topological sequence.
  • step 1806 it is determined whether there is a sub-branch structure in the branch structure, if yes, step 1807 is executed to identify the start point and the end point of the sub-branch structure.
  • step 1808 the start point, the end point and the middle layers in the sub-branch structure are arranged in a specific order, and there are two ways of arrangement: comparing the number of layers on the sub-branch in the sub-branch structure, according to the number of layers, from more to least Arrange the layers of the sub-branches, and compare the number of layers on the sub-branches in the sub-branch structure, and arrange the layers of the sub-branches according to the number of layers.
  • step 1809 is executed to sort the layers of the sub-branch structure in a specific order.
  • the ordering of the branch structure is the same as that of the sub-branch structure. So far, this embodiment converts the branch structure into a long chain structure.
  • step 1810 is executed to replace the branch structure with a long-chain structure, and the template fusion unit is added or deleted using the layer in the topological sequence as a unit. Based on the initial layer set in step 1804, the rules in the fusion strategy are checked to create a template. fusion unit. In this step, the branch structure is replaced by a long chain structure, and step 1202 is executed, and the technical details are not repeated here.
  • this embodiment When deriving the shape of the template fusion unit of the branch structure or the sub-branch structure, this embodiment will determine whether the template fusion unit includes the end point of the branch structure or the sub-branch structure, if not, further determine whether the available space of the NRAM 431 is large enough, if so , then when deriving the address, the computing device 201 is caused to store the intermediate result of the last layer produced by the template fusion unit in the NRAM 431. If the available space of the NRAM 431 is not large enough, it is further judged whether the available space of the SRAM 308 is large enough. If the available space of the SRAM 308 is large enough, the intermediate result of the last layer will be stored in the SRAM 308, and the value can be directly calculated from the SRAM 308 when calculating the template fusion unit.
  • this embodiment will only store the intermediate result of the last layer of the template fusion unit back into the DRAM 204.
  • Load for calculation When calculating the next template fusion unit, Load for calculation.
  • this embodiment causes the computing device 201 to store the intermediate result of the last layer produced by the template fusion unit back into the DRAM 204 when deriving the address.
  • step 1811 is performed, and the neural network calculation is performed according to the template fusion unit.
  • Another embodiment of the present invention is a computer-readable storage medium on which computer program codes for dynamically merging branch structures of neural networks according to fusion strategies are stored, and when the computer program codes are executed by a processor, the foregoing embodiments are executed the method described.
  • the invention dynamically determines the template fusion unit by setting the fusion strategy, fuses the branch structure in the neural network to form a new self-defined layer, and loads the data required by the calculation template fusion unit at one time, so as to reduce the input/output overhead.
  • an integrated circuit device incorporating a neural network the neural network comprising a positive pyramid layer, the input feature map of the positive pyramid layer is smaller than the output feature map, the input data in the input feature map produces the output At least one output data in a feature map, the integrated circuit device includes:
  • processing means for:
  • All output data corresponding to the same input data are set as fusion blocks, and the output feature map includes a plurality of fusion blocks;
  • a template fusion unit is established according to the fusion strategy.
  • the computing device is used for performing neural network computation according to the template fusion unit.
  • Clause A2 the integrated circuit device of Clause A1, wherein the computing device includes a plurality of clusters, each cluster includes a plurality of processor cores, and the fusion strategy is based on the hardware resources of each processor core, so that The fusion block is assigned to each processor core in units.
  • each cluster further includes a storage core
  • the storage core includes a shared storage unit
  • the fusion strategy is that when a particular fusion block is recalculated among particular processor cores, One of the specific processor cores is assigned to compute the specific fused block and store the intermediate result to the shared storage unit.
  • Clause A4 The integrated circuit device of Clause A3, wherein the memory core incorporates the intermediate results into intermediate results generated by other particular processor cores.
  • Clause A5. The integrated circuit device of Clause A3, wherein the shared storage unit includes a cache space, and the fusion strategy is the storage space required for the weights of the next subgraph, the storage space required for all output data, the The sum of the cache space is not greater than the available space of the shared storage unit, and when the processing device determines that the fusion strategy is not satisfied, the processing device stops fusing the positive pyramid layer.
  • each processor core includes a weight storage unit
  • the fusion strategy is that the storage space required for the weights involved in the fusion block is not larger than the weight storage unit
  • Clause A7 The integrated circuit device of Clause A1, wherein the positive pyramid layer comprises a deconvolution layer, an unpooling layer, or an unsampling layer.
  • Clause A8 a board comprising the integrated circuit device of any one of clauses A1 to A7.
  • a method of fusing a neural network comprising a positive pyramid layer, the input feature map of the positive pyramid layer is smaller than the output feature map, the input data in the input feature map produces the output feature map of at least one output data, the method comprising:
  • All output data corresponding to the same input data are set as fusion blocks, and the output feature map includes a plurality of fusion blocks;
  • a template fusion unit is established according to the fusion strategy.
  • Neural network computations are performed according to the template fusion unit.
  • Clause A10 a computer-readable storage medium having stored thereon computer program code fused with a neural network that, when executed by a processing device, performs the method of Clause A9.
  • the electronic device or device of the present invention may include servers, cloud servers, server clusters, data processing devices, robots, computers, printers, scanners, tablet computers, smart terminals, PC equipment, Internet of Things terminals, mobile Terminals, mobile phones, driving recorders, navigators, sensors, cameras, cameras, video cameras, projectors, watches, headphones, mobile storage, wearable devices, visual terminals, autonomous driving terminals, vehicles, home appliances, and/or medical equipment.
  • the vehicles include airplanes, ships and/or vehicles;
  • the household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lamps, gas stoves, and range hoods;
  • the medical equipment includes nuclear magnetic resonance instruments, B-ultrasound and/or electrocardiograph.
  • the electronic device or device of the present invention can also be applied to the Internet, Internet of Things, data center, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction site, medical treatment and other fields. Further, the electronic device or device of the present invention can also be used in application scenarios related to artificial intelligence, big data and/or cloud computing, such as cloud, edge terminal, and terminal.
  • the electronic device or device with high computing power according to the solution of the present invention can be applied to cloud devices (eg cloud server), while the electronic device or device with low power consumption can be applied to terminal devices and/or Edge devices (such as smartphones or cameras).
  • the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that the hardware resources of the cloud device can be obtained from the hardware information of the terminal device and/or the edge device according to the hardware information of the terminal device and/or the edge device. Matching appropriate hardware resources to simulate the hardware resources of terminal devices and/or edge devices, so as to complete the unified management, scheduling and collaborative work of device-cloud integration or cloud-edge-device integration.
  • the present invention expresses some methods and their embodiments as a series of actions and their combinations, but those skilled in the art can understand that the solution of the present invention is not limited by the sequence of the described actions . Accordingly, based on the disclosure or teachings of the present invention, those skilled in the art will understand that some of the steps may be performed in other orders or simultaneously. Further, those skilled in the art can understand that the embodiments described in the present invention may be regarded as optional embodiments, that is, the actions or modules involved therein are not necessarily necessary for the realization of one or some solutions of the present invention. In addition, according to different solutions, the present invention also has different emphases in the description of some embodiments. In view of this, those skilled in the art can understand the parts that are not described in detail in a certain embodiment of the present invention, and can also refer to the related descriptions of other embodiments.
  • units described as separate components may or may not be physically separated, and components shown as units may or may not be physical units.
  • the aforementioned components or elements may be co-located or distributed over multiple network elements.
  • some or all of the units may be selected to achieve the purpose of the solutions described in the embodiments of the present invention.
  • multiple units in this embodiment of the present invention may be integrated into one unit or each unit physically exists independently.
  • the above integrated units may be implemented in the form of software program modules. If implemented in the form of a software program module and sold or used as a stand-alone product, the integrated unit may be stored in a computer-readable memory. Based on this, when the solution of the present invention is embodied in the form of a software product (eg, a computer-readable storage medium), the software product may be stored in a memory, which may include several instructions to cause a computer device (eg, a personal computer, a server or network equipment, etc.) to execute some or all of the steps of the method described in the embodiments of the present invention.
  • a computer device eg, a personal computer, a server or network equipment, etc.
  • the aforementioned memory may include, but is not limited to, a U disk, a flash disk, a read-only memory (Read Only Memory, ROM), a random access memory (Random Access Memory, RAM), a mobile hard disk, a magnetic disk, or a CD, etc. that can store programs. medium of code.
  • the above-mentioned integrated units may also be implemented in the form of hardware, that is, specific hardware circuits, which may include digital circuits and/or analog circuits, and the like.
  • the physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, and the physical devices may include, but are not limited to, devices such as transistors or memristors.
  • various types of devices described herein eg, computing devices or other processing devices
  • suitable hardware processors such as central processing units, GPUs, FPGAs, DSPs, ASICs, and the like.
  • the aforementioned storage unit or storage device can be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), which can be, for example, a variable resistance memory (Resistive Random Access Memory, RRAM), dynamic Random Access Memory (Dynamic Random Access Memory, DRAM), Static Random Access Memory (Static Random Access Memory, SRAM), Enhanced Dynamic Random Access Memory (EDRAM), High Bandwidth Memory (High Bandwidth Memory) , HBM), hybrid memory cube (Hybrid Memory Cube, HMC), ROM and RAM, etc.
  • a variable resistance memory Resistive Random Access Memory, RRAM
  • Dynamic Random Access Memory Dynamic Random Access Memory
  • SRAM Static Random Access Memory
  • EDRAM Enhanced Dynamic Random Access Memory
  • HBM High Bandwidth Memory
  • HBM Hybrid Memory Cube
  • ROM and RAM etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

根据融合策略动态融合神经网络的分支结构的装置、板卡、方法及可读存储介质,其中计算装置包括在集成电路装置中,该集成电路装置包括通用互联接口和其他处理装置。计算装置与其他处理装置进行交互,共同完成用户指定的计算操作。集成电路装置还可以包括存储装置,存储装置分别与计算装置和其他处理装置连接,用于计算装置和其他处理装置的数据存储。

Description

融合分支结构的装置、板卡、方法及可读存储介质
相关申请的交叉引用
本申请要求于2020年12月25日申请的,申请号为2020115619734,名称为“融合分支结构的装置、板卡、方法及可读存储介质”;于2020年12月25日申请的,申请号为2020115632669,名称为“融合神经网络的装置、板卡、方法及可读存储介质”的中国专利申请的优先权。
技术领域
本发明一般地涉及神经网络领域。更具体地,本发明涉及根据融合策略动态融合神经网络的分支结构的装置、板卡、方法及可读存储介质。
背景技术
神经网络是按照一定规则连接起来的多个神经元系统,大致上是由以下四种层结构所组成:输入层、卷积层(convolution layer)、池化层(pooling layer)、全连接层(fully connected layer)。
输入层是自输入数据中截取部分信息,转化成特征矩阵方式呈现,其中载有对应该部分信息的特征。卷积层配置成接收来自输入层的特征矩阵,通过卷积操作对输入数据进行特征抽取。卷积层在实际运用时可以建制多层卷积层。池化层配置成对数据的某一个区域用一个值代替,这值通常是该区域所有数值里的最大值或平均值。通过池化,在不至于损失过多信息的前提下,可以缩减模型大小、提高计算速度。全连接层在整个卷积神经网络中起到分类器的作用,相当于特征空间变换,把前面所有有用的信息提取整合,基于不同的分类做信息比对,借以判断输入数据是否相似于比对的标的。
随着科技的发展,神经网络的层数越来越多,结构也越来越复杂,现今已经开发出许多带有分支结构的神经网络模型,例如ResNet模型。具有分支结构的模型在计算时会耗去大量资源,同时延迟运算时间。
因此,一种减少神经网络模型分支结构的输入/输出访问的机制是人工智能领域中迫切需要的。
发明内容
为了至少部分地解决背景技术中提到的技术问题,本发明的方案提供了一种根据融合策略动态融合神经网络的分支结构的装置、板卡、方法及可读存储介质。
在一个方面中,本发明揭露一种根据融合策略动态融合神经网络的分支结构的集成电路装置,包括处理装置及计算装置。处理装置用以根据所述分支结构,建立拓扑序列,以所述拓扑序列的起始层为基准进行融合,排查所述融合策略内的规则,以建立模板融合单元。计算装置用以根据所述模板融合单元执行神经网络计算。
在另一个方面,本发明揭露一种板卡,包括根据前述的集成电路装置。
在另一个方面,本发明揭露一种根据融合策略动态融合神经网络的分支结构的方法,包括:根据所述分支结构,建立拓扑序列;以所述拓扑序列的起始层为基准进行融合,排查所述融合策略内的规则,以建立模板融合单元;以及根据所述模板融合单元执行神经网络计算。
另一个方面,本发明揭露一种计算机可读存储介质,其上存储有根据融合策略动态融合神经网络的分支结构的计算机程序代码,当所述计算机程序代码由处理装置运行时,执行前述的方法。
本发明对分支结构进行融合以产生模板融合单元,模板融合单元中首层的输入和末层的输出作为模板融合单元与片外内存的交互数据,期间各层的计算皆不需要访问片外内存,大大减少片上片外的输入/输出访问频率。
附图说明
通过参考附图阅读下文的详细描述,本发明示例性实施方式的上述以及其他目的、特征和优点将变得易于理解。在附图中,以示例性而非限制性的方式示出了本发明的若干实施方式,并且相同或对应的标号表示相同或对应的部分其中:
图1是示出本发明实施例的板卡的结构图;
图2是示出本发明实施例的集成电路装置的结构图;
图3是示出本发明实施例的计算装置的内部结构示意图;
图4是示出本发明实施例的处理器核的内部结构示意图;
图5是示出当一个处理器核欲将数据写入至另一个集群的处理器核时的示意图;
图6A是示出AlexNet模型的示意图;
图6B是示出输入/输出特征图如正金字塔结构的示意图;
图7A是示出最大池化的上池化操作的示意图;
图7B是示出平均池化的上池化操作的示意图;
图7C是示出一种示例性地神经网络模型的示意图;
图8A是示出上采样操作的示意图;
图8B是示出本发明实施例的两个卷积层融合在一起的示意图;
图9是示出NCHW与NHWC的格式示意图;
图10是示出本发明实施例决定片上单元图大小的流程图;
图11是示出一种示例性的神经网络模型示意图;
图12A是示出本发明实施例根据融合策略动态融合神经网络的流程图;
图12B是示出本发明实施例利用模板融合单元执行神经网络计算的流程图之一;
图12C是示出本发明实施例利用模板融合单元执行神经网络计算的流程图之二;
图13A是示出示例性的神经网络模型的输入输出图形状的示意图;
图13B是示出本发明实施例基于正金字塔层建立模板融合单元的流程图;
图13C是示出示例性的神经网络模型片段;
图14是示出本发明实施例分支结构的拓扑序列的示意图;
图15是示出本发明实施例的长链结构还原成分支结构的示意图;
图16是示出另一种示例性的神经网络模型片段;
图17是示出本发明另一个实施例分支结构的拓扑序列的示意图;以及
图18是示出本发明另一个实施例融合神经网络的分支结构的流程图。
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
应当理解,本发明的权利要求、说明书及附图中的术语“第一”、“第二”、“第三”和“第四”等是用于区别不同对象,而不是用于描述特定顺序。本发明的说明书和权利要求书中使用的术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。
还应当理解,在此本发明说明书中所使用的术语仅仅是出于描述特定实施例的目的,而并不意在限定本发明。如在本发明说明书和权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。还应当进一步理解,在本发明说明书和权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。
如在本说明书和权利要求书中所使用的那样,术语“如果”可以依据上下文被解释为“当...时”或“一旦”或“响应于确定”或“响应于检测到”。
下面结合附图来详细描述本发明的具体实施方式。
神经网络是由输入层、卷积层、激活函数、池化层、全连接层所组成,少则数层,多则上百层,每层执行一个算子,例如卷积层执行卷积算子,有多少层便需要执行多少算子。在本发明中,当提及特定层时,便表示该层相对应的算子。
在进行神经网络计算时,输入信息和模型各层的输出结果在每次推理计算时是不同的,它们被视为变量数据,变量数据一般都是以特征图(矩阵)来表现的,在本发明中,整个神经网络模型的输入信息和模型各层的输入图统称为特征图,一旦特征图加载到片上存储器部件上,在本发明中称为片上单元图。训练网络模型的参数在训练稳定之后通常不会频繁改动,或是网络拓扑结构和硬件参数确定后就可以编译生成,在计算过程中不会变更,因此它们可以被视为常量数据,常量数据包括但不限于权值、偏置、设备硬件指令、批标准化(batchnorm)的均值和方差等,在本发明中统一以权值代表所有的常量数据。而本发明中提及“数据”时,泛指根据融合策略使得神经网络模型中允许对应算子的运算操作融合在一起的图结构,该图结构所涉及变量数据和常量数据,也就是特征图加上相应的权值。
图1示出本发明实施例的一种板卡10的结构示意图。如图1所示,板卡10包括芯片101,其是一种系统级芯片(System on Chip,SoC),或称片上系统,集成有一个或多个组合处理装置,组合处理装置是一种人工智能运算单元,用以支持各类深度学习和机器学习算法,满足计算机视觉、语音、自然语言处理、数据挖掘等领域复杂场景下的智能处理需求。特别是深度学习技术大量应用在云端智能领域,云端智能应用的一个显著特点是输入数据量大,对平台的存储能力和计算能力有很高的要求,此实施例的板卡10适用在云端智能应用,具有庞大的片外存储、片上存储和大量的计算能力。
芯片101通过对外接口装置102与外部设备103相连接。外部设备103例如是服务器、计算机、摄像头、显示器、鼠标、键盘、网卡或wifi接口等。待处理的数据可以由外部设备103通过对外接口装置102传递至芯片101。芯片101的计算结果可以经由对外接口装置102传送回外部设备103。根据不同的应用场景,对外接口装置102可以具有不同的接口形式,例如PCIe接口等。
板卡10还包括用于存储数据的存储器件104,其包括一个或多个存储单元105。存储器件104通过总线与控制器件106和芯片101进行连接和数据传输。板卡10中的控制器件106配置用于对芯片101的状态进行调控。为此,在一个应用场景中,控制器件106可以包括单片机(Micro Controller Unit,MCU)。
图2是示出此实施例的芯片101中的组合处理装置的结构图。如图2中所示,组合处理装置20包括计算装置201、接口装置202、处理装置203和DRAM 204。
计算装置201配置成执行用户指定的操作,主要实现为单核智能处理器或者多核智能处理器,用以执行深度学习或机器学习的计算,其可以通过接口装置202与处理装置203进行交互,以共同完成用户指定的操作。
接口装置202用于在计算装置201与处理装置203间传输数据和控制指令。例如,计算装置201可以经由接口装置202从处理装置203中获取输入数据,写入计算装置201片上的存储装置。进一步,计算装置201可以经由接口装置202从处理装置203中获取控制指令,写入计算装置201片上的控制缓存中。替代地或可选地,接口装置202也可以读取计算装置201的存储装置中的数据并传输给处理装置203。
处理装置203作为通用的处理装置,执行包括但不限于数据搬运、对计算装置201的开启和/或停止等基本控制。根据实现方式的不同,处理装置203可以是中央处理器(central processing unit,CPU)、图形处理器(graphics processing unit,GPU)或其他通用和/或专用处理器中的一种或多种类型的处理器,这些处理器包括但不限于数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field-programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等,并且其数目可以根据实际需要来确定。如前所述,仅就本发明的计算装置201而言,其可以视为具有单核结构或者同构多核结构。然而,当将计算装置201和处理装置203整合共同考虑时,二 者视为形成异构多核结构。
DRAM 204用以存储待处理的数据,为DDR内存,大小通常为16G或更大,用于保存计算装置201和/或处理装置203的数据。
图3示出了计算装置201的内部结构示意图。计算装置201用以处理计算机视觉、语音、自然语言、数据挖掘等输入数据,图中的计算装置201采用多核分层结构设计,计算装置201作为一个片上系统,其包括多个集群(cluster),每个集群又包括多个处理器核,换言之,计算装置201是以片上系统-集群-处理器核的层次所构成的。
以片上系统的层级来看,如图3所示,计算装置201包括外部存储控制器301、外设通信模块302、片上互联模块303、同步模块304以及多个集群305。
外部存储控制器301可以有多个,在图中示例性地展示2个,其用以响应处理器核发出的访问请求,访问外部存储设备,例如图2中的DRAM 204,从而自片外读取数据或是将数据写入。外设通信模块302用以通过接口装置202接收来自处理装置203的控制信号,启动计算装置201执行任务。片上互联模块303将外部存储控制器301、外设通信模块302及多个集群305连接起来,用以在各个模块间传输数据和控制信号。同步模块304是一种全局同步屏障控制器(global barrier controller,GBC),用以协调各集群的工作进度,确保信息的同步。多个集群305是计算装置201的计算核心,在图中示例性地展示4个,随着硬件的发展,本发明的计算装置201还可以包括8个、16个、64个、甚至更多的集群305。集群305用以高效地执行深度学习算法。
以集群的层级来看,如图3所示,每个集群305包括多个处理器核(IPU core)306及一个存储核(MEM core)307。
处理器核306在图中示例性地展示4个,本发明不限制处理器核306的数量。其内部架构如图4所示。每个处理器核306包括三大模块:控制模块41、运算模块42及存储模块43。
控制模块41用以协调并控制运算模块42和存储模块43的工作,以完成深度学习的任务,其包括取指单元(instruction fetch unit,IFU)411及指令译码单元(instruction decode unit,IDU)412。取指单元411用以获取来自处理装置203的指令,指令译码单元412则将获取的指令进行译码,并将译码结果作为控制信息发送给运算模块42和存储模块43。
运算模块42包括向量运算单元421及矩阵运算单元422。向量运算单元421用以执行向量运算,可支持向量乘、加、非线性变换等复杂运算;矩阵运算单元422负责深度学习算法的核心计算,即矩阵乘及卷积。
存储模块43用来存储或搬运相关数据,包括神经元存储单元(neuron RAM,NRAM)431、权值存储单元(weight RAM,WRAM)432、输入/输出直接内存访问模块(input/output direct memory access,IODMA)433、搬运直接内存访问模块(move direct memory access,MVDMA)434。NRAM 431用以存储供处理器核306计算的特征图及计算后的中间结果;WRAM 432则用以存储深度学习网络的权值;IODMA 433通过广播总线309控制NRAM 431/WRAM 432与DRAM 204的访存;MVDMA 434则用以控制NRAM 431/WRAM 432与SRAM 308的访存。
回到图3,存储核307主要用以存储和通信,即存储处理器核306间的共享数据或中间结果、以及执行集群305与DRAM 204之间的通信、集群305间彼此的通信、处理器核306间彼此的通信等。在其他实施例中,存储核307具有标量运算的能力,用以执行标量运算。
存储核307包括共享存储单元(SRAM)308、广播总线309、集群直接内存访问模块(cluster direct memory access,CDMA)310及全局直接内存访问模块(global direct memory access,GDMA)311。SRAM 308承担高性能数据中转站的角色,在同一个集群305内不同处理器核306之间所复用的数据不需要通过处理器核306各自向DRAM 204获得,而是经SRAM 308在处理器核306间中转,存储核307只需要将复用的数据从SRAM 308迅速分发给多个处理器核306即可,以提高核间通讯效率,亦大大减少片上片外的输入/输出访问。
广播总线309、CDMA 310及GDMA 311则分别用来执行处理器核306间的通信、集群305间的通信和集群305与DRAM 204的数据传输。以下将分别说明。
广播总线309用以完成集群305内各处理器核306间的高速通信,此实施例的广播总线309 支持核间通信方式包括单播、多播与广播。单播是指点对点(即单一处理器核至单一处理器核)的数据传输,多播是将一份数据从SRAM 308传输到特定几个处理器核306的通信方式,而广播则是将一份数据从SRAM 308传输到所有处理器核306的通信方式,属于多播的一种特例。
CDMA 310用以控制在同一个计算装置201内不同集群305间的SRAM 308的访存。图5示出当一个处理器核欲将数据写入至另一个集群的处理器核时的示意图,以说明CDMA 310的工作原理。在此应用场景中,同一个计算装置包括多个集群,为方便说明,图中仅展示集群0与集群1,集群0与集群1分别包括多个处理器核,同样为了说明方便,图中的集群0仅展示处理器核0,集群1仅展示处理器核1。处理器核0欲将数据写入至处理器核1。
首先,处理器核0发送单播写请求将数据写入本地的SRAM 0中,CDMA 0作为主(master)端,CDMA 1作为从(slave)端,主端向从端推送写请求,即主端发送写地址AW和写数据W,将数据传送到集群1的SRAM 1中,接着从端发送写响应B作为回应,最后集群1的处理器核1发送单播读请求将数据从SRAM 1中读取出来。
回到图3,GDMA 311与外部存储控制器301协同,用以控制集群305的SRAM 308到DRAM 204的访存,或是将数据自DRAM 204读取至SRAM 308中。从前述可知,DRAM 204与NRAM 431或WRAM 432间的通信可以经由2个渠道来实现。第一个渠道是通过IODAM 433直接联系DRAM 204与NRAM 431或WRAM 432;第二个渠道是先经由GDMA 311使得数据在DRAM 204与SRAM 308间传输,再经过MVDMA 434使得数据在SRAM 308与NRAM 431或WRAM 432间传输。虽然表面上看来第二个渠道需要更多的元件参与,数据流较长,但实际上在部分实施例中,第二个渠道的带宽远大于第一个渠道,因此DRAM 204与NRAM 431或WRAM 432间的通信通过第二个渠道可能更有效率。本发明的实施例可根据本身硬件条件选择数据传输渠道。
在其他实施例中,GDMA 311的功能和IODMA 433的功能可以整合在同一部件中。本发明为了方便描述,将GDMA 311和IODMA 433视为不同部件,对于本领域技术人员来说,只要其实现的功能以及达到的技术效果与本发明类似,即属于本发明的保护范围。进一步地,GDMA 311的功能、IODMA 433的功能、CDMA 310的功能、MVDMA 434的功能亦可以由同一部件来实现,同样地,只要其实现的功能以及达到的技术效果与本发明类似,均属于本发明的保护范围。
神经网络模型的结构主要分为两类:长链结构与分支结构。长链结构指的是神经网络模型为单链条串接的层所组成,每层只有一个输入及一个输出,整体属于单支,例如VGG16模型或是图6A所示的AlexNet模型。分支结构指的是神经网络中的子网络仅有一个输入及一个输出,但子网络内存在多分支,即子网络的部分层具有多个输入或输出,例如resnet50的resblock结构、inception_v3的block结构等。分支结构如图7C所示,是一种示例性地神经网络模型,该示例性神经网络模型包括子网络701’及子网络702’。子网络701’仅有一个输入及一个输出,其包括第一层到第六层,第一层具有2个输出,第六层具有2个输入,因此子网络701’包括2个分支,一个分支为第一层→第二层→第三层→第六层,而另一个分支为第一层→第四层→第五层→第六层,子网络701’构成一个分支结构。同样地,子网络702’亦构成一个分支结构。
在执行深度学习的各层计算时,需要大量的片外片上访问,特别是将输入数据自DRAM 204读取至计算装置201中,再将计算装置201的计算结果存储至DRAM 204。这种频繁的访问会耗去极大的硬件资源。为了解决这个问题,本发明通过融合神经网络的相邻层,在很大程度上减少了片外片上的数据传输。
图8B示出将两个卷积层融合在一起的示意图。第一层卷积层810的输入为7×7的特征图801,该层将特征图801与3×3的内核(未示出)进行卷积后,得到第一层卷积层810的特征图802。其中,5×5特征子图804的数值会影响3×3特征子图805。假设步长(stride)为1,在计算完5×5特征子图804后,第一层卷积层810会接着计算5×5特征子图806,而5×5特征子图806的数值会影响3×3特征子图807。
在进行第二层卷积层811的计算时,特征图802成为第二层卷积层811的输入,同样与3×3的内核进行卷积,得到第二层卷积层811的特征图803。其中,3×3特征子图805的数值会影响特征图803中的1×1特征子图808。在计算完3×3特征子图805后,第二层卷积层811会接着计 算3×3特征子图807,而3×3特征子图807的数值会影响特征图803中的1×1特征子图809。
如果未融合,计算装置201在进行第一层卷积810时,自DRAM 204读取5×5特征子图804,计算完后将3×3特征子图805存储回DRAM 204,接着再从DRAM 204读取5×5特征子图806,计算完后将3×3特征子图807存储至DRAM 204。在进行第二层卷积811时,同样需要自DRAM 204读取3×3特征子图805,计算完后将1×1特征子图808存储至DRAM 204,接着自DRAM 204读取3×3特征子图807,计算完后将1×1特征子图809存储至DRAM 204。通过上述说明可知,特征图802作为中间数据反复在片外片上被读取存储,相当占用系统资源。
如果将第一层卷积层810与第二层卷积层811进行融合,也就是把特征图802存储在NRAM 431中(第一层卷积层810与第二层卷积层811的权值亦可存储在WRAM 432中),如此便可减少计算装置201与DRAM 204间的访问次数,进而提高整体神经网络的执行效率。还有,图8B所示的图(如特征图801、特征图802、特征图803)在神经网络模型上下文中整体看起来像倒金字塔,故称为倒金字塔层。
在现代的神经网络模型中,不必然各层的输入/输出特征图都具有如图8B所示的倒金字塔形式,有些层的输入特征图尺寸会小于输出特征图尺寸,这种层常应用在计算机视觉的深度学习领域。在特定情境下,会需要将图像恢复到原来的尺寸以便进行进一步的计算。在计算这种层时,图像尺寸被扩大,以实现图像由小分辨率到大分辨率的映射的操作。图6B示出这种层的示意图,结合图8B所示的倒金字塔所示倒金字塔层的原理,从图6B可知,输入/输出特征图会产生有如正金字塔的形式,故称为正金字塔层。
实务上,正金字塔层包括反卷积层(deconvolution layer)、上池化层(unpooling layer)或上采样层(unsampling layer)。
反卷积又称为转置卷积或空洞卷积,它并不是正向卷积的完全逆过程,反卷积是一种特殊的正向卷积,需要有参数参与计算,而参数是要进行训练学习的。反卷积是先按照一定的比例通过补0,来扩大输出图像的尺寸,接着旋转卷积核,再进行正向卷积。
上池化操作分为最大池化的上池化操作及平均池化的上池化操作。最大池化的上池化会保留最大值的位置信息,然后在其余位置补0,如图7A所示,图中示出最大池化层701,其输入特征图702经过最大池化层701后产生输出特征图703,图中还示出最大池化的上池化层704,其输入特征图703经过上池化层704后产生输出特征图705,输出特征图705的尺寸大于输入特征图703的尺寸。平均池化的上池化则是将平均值都填入其对应原始数据区域中相应位置即可,如图7B所示,图中示出平均池化层706,其输入特征图707经过平均池化层706后产生输出特征图708,图中还示出平均池化的上池化层709,其输入特征图708经过上池化层709后产生输出特征图710,输出特征图710的尺寸大于输入特征图708的尺寸。
上采样是直接在对应原始数据区域根据内核来扩充特征图。图8A示出上采样的示意图,图中输入特征图801’经过最大池化层(未绘出)后产生中间特征图802’,中间特征图802’经过上采样层(未绘出)的内核803’扩充后,得到输出特征图804’,输出特征图804’的尺寸大于中间特征图802’的尺寸。
前述这些算子的特征都是输入特征图小于输出特征图。另外,可能还存在用户自定义层同样具有输入特征图小于输出特征图的特征。
神经网络融合通常是基于神经网络中的特定卷积层和池化层向后进行融合,亦即融合的起始层为卷积层或池化层,根据其本身硬件条件向后融合了多层,其间可能包含多个卷积层和池化层。但随着深度学习及神经网络的发展,层的排序变得复杂,例如在卷积层前面设置有激活层,则此激活层应该也要被考虑如何与其后的卷积层进行融合。因此,除了单纯以卷积层和池化层为核心进行融合之外,本发明提供多样的融合方式,不必然以卷积层和池化层为核心,而采取特定的策略,弹性地选择神经网络的各层进行融合,即便是用户自定义的层,只要符合融合策略便可被融合,使得整体效能最佳化。
本发明的另一个实施例是一种新式的融合方法,通过利用前述图1、图2、图3及图4的硬件结构来实施的,这种融合称为模板融合单元(template fuse unit,TFU)。模板融合单元主要是 通过一定的融合策略弹性地将多个层融合成一个层,来减少网络的输入/输出开销,其包括前述的神经网络融合及其他融合方式,这些被融合的层的集合即为模板融合单元,可以视为是新的层或是自定义的层。
此实施例一次性将模板融合单元所需的特征图、权值等自DRAM 204载入至片上的SRAM 308,特征图载入至SRAM 308后称为片上单元图,片上单元图会被切割成子图,每次自SRAM 308载入一份子图到被指派计算该子图的处理器核306的NRAM 431,且计算该子图所需的权值亦会自SRAM 308被载入至WRAM 432上,每个子图计算完成后获得对应的中间结果,中间结果被存回SRAM 308,所有子图都完成计算后再一次性地将计算结果存回DRAM 204。也就是说,片上单元图和权值参与神经网络模型中算子的运算操作获得的对应结果在DRAM 204与SRAM 308间传递,子图对应的输出(中间结果)在SRAM 308与NRAM 431间传递。从计算装置201的角度来看,模板融合单元的数据载入是以片上单元图为单位,而计算是以子图为单位。
更详细来说,SRAM 308是融合策略的重要参考指标之一,其空间大小决定了模板融合单元为大图模式或是小图模式。小图模式与大图模式是指存储在DRAM 204的一张特征图是否能一次性地搬到SRAM 308进行处理,处理装置203会将该特征图所需存储空间与SRAM 308可用空间进行比较。如果SRAM 308空间不足,特征图摆不下,则为大图模式;如果SRAM 308足以容纳整张特征图,就是小图模式。需特别注意的是,在大图模式下,片上单元图只是特征图的一部分;在小图模式下,如果SRAM 308的可用空间足够大,或是特征图足够小,SRAM 308或许可以一次性地容纳多张特征图,即片上单元图可以包括多张特征图。
如是大图模式,则必须拆分该特征图方能载入计算装置201中。处理装置203会在DRAM 204上将该特征图进行拆分,直到产生足够小的片上单元图能够满足SRAM 308的空间需求,使得该片上单元图可以一次性地搬到SRAM 308进行处理。而特征图在进行拆分时,可能会产生输入依赖运算和输出依赖运算。
输入依赖运算是指拆分后的各片上单元图至少部分重叠,每个子集都需要一些输入的额外副本,以进行完整的运算,从而导致拆分操作中的数据冗余,所谓的数据冗余是指同一段数据在系统中被复用。当模板融合单元包括卷积、池化或矩阵乘等层时都会导致输入依赖运算。
输出依赖运算是指每个子图产出中间结果后,还需要进行归约(reduce),才能得到计算结果。归约是指在基于对片上单元图本身内容理解的基础上,拆分成子图后分别计算,以缩减计算规模,从而在尽可能保持原片上单元图原貌的前提下,最大限度地精简数据量,再以子图为基础还原或整合计算结果。进行归约时计算结果是互为依赖的。当模板融合单元包括内积、卷积、矩阵乘、排序、计数等层时都会导致输出依赖运算。
此实施例可以处理的特征图的数据格式包括N、H、W、C维度,其中N代表批处理(batch)、H代表高度(height)、W代表宽度(width)、C代表通道(channel)。以图像数据为例,N表示这批图像共有几张,H表示图像在竖直方向有多少像素,W表示水平方向像素数,C表示通道数(例如黑白图像的通道数C为1,而RGB彩色图像的通道数C为3)。
这些维度的排序决定了数据的组成方式,常见的组成方式有NHWC和NCHW两种,图9示出NCHW与NHWC的格式区别,此图是以RGB彩色图像为例,图中R表示红色像素、G表示绿色像素、B表示蓝色像素。数列91为NCHW格式,N排列在外层,每个通道内像素紧挨在一起,再依RGB的顺序排列,坐标为(n,c,h,w)的元素在存储中的偏移为((n×C+c)×H+h)×W+w。数列92是NHWC格式,C排列在最内层,多个通道对应空间位置的RGB像素紧挨在一起。图中亦显示出输入像素901、输入像素902、输入像素903在不同排列方式下所处的位置,而这三个输入像素901、输入像素902、输入像素903合起来便是图像中一个点的颜色。坐标为(n,c,h,w)的元素相应的坐标向偏移的换算方法是((n×H+h)×W+w)×C+c。NHWC首先相比NCHW更加接近BMP的图片数据存储格式,BMP格式的文件中按照一个个像素点来存储数据,每个像素点存储了所有通道的颜色值,这使得在读取输入图片时不需要进行额外的维度转换。因此,NHWC的访存局部性较佳,每三个输入像素即可得到一个输出像素,NCHW则必须等所有通道输入准备好才能得到最终输出结果,需要占用较大的缓存空间。
此实施例决定片上单元图大小,图10示出相应的流程图。
在步骤1001中,处理装置203判断特征图所需存储空间是否大于SRAM 308的可用空间。如是,表示该特征图无法一次性地载入至SRAM 308中,因此执行步骤1002,拆分特征图。在此实施例中,处理装置203选择在任何维度上进行拆分。处理装置203优先选择在N维度上进行拆分,因为不会产生输入或输出依赖运算,如在N维度上进行拆分无法满足要求,再考虑在H或是W维度上进行拆分,这时便可能会产生输入或输出依赖运算。此实施例亦支持在C维度上进行拆分,特别是沿着Cout方向拆分,这样通过数据优化的方式把一个卷积拆分成多个卷积,使得WRAM 432可以放得下权值,例如:将权值拆分到四个处理器核306上。因此,只要在某一维度上进行拆分是计算装置201能处理的,都是此实施例可以接受的拆分方式。
更进一步来说,处理装置203可以依序在N、H、W维度间进行特定粒度的拆分,特定粒度可以是一个固定或变动的比值,或是以一个函数来表示。在一种应用场景下,处理装置203由大往小对特征图或权值进行拆分。以特征图为例,首先在N维度上将维度为NHWC的特征图拆分成N 1HWC的特征图与N 2HWC的特征图,其中特定粒度是固定比值,N 1与N 2各为N的二分之一。如果还不够小,处理装置203则在H维度上继续将N 1HWC的特征图拆分成N 1H 1WC的特征图与N 1H 2WC的特征图,其中H 1与H 2各为H的二分之一。如果还不够小,处理装置203则在W维度上继续将N 1H 1WC的特征图拆分成N 1H 1W 1C的特征图与N 1H 1W 2C的特征图,其中W 1与W 2各为W的二分之一。处理装置203可以在N、W、H维度上继续进行更小粒度的拆分,像是做四分之一等分、八分之一等分或十六分之一等分的切割,直到特征图足够小,成为可以一次性地载入SRAM 308的片上单元图为止。
可以理解的是,处理装置203还可以在一个维度上持续拆分,直到不能再拆分,才会选择另外一个维度持续拆分。例如持续在H维度上进行拆分,如果拆分至最小单位仍无法载入至SRAM 308中时,才改以在W维度上进行拆分,直到拆分至最小单位。
需特别注意的是,由于这样的拆分方式是由大拆到小,因此当拆分的特征图满足条件时,其所需存储空间的大小通常会与SRAM 308的可用空间相差无几。换言之,在大图模式下,DRAM 204每次仅能传送一张拆分后的特征图至SRAM 308,但在小图模式下,SRAM 308的空间却可能可以一次性地自DRAM 204载入多张特征图。
处理装置203拆分特征图后,回到步骤1001,处理装置203判断拆分后的特征图所需存储空间是否还大于SRAM 308的可用空间,如是,则再次执行步骤1002,继续往下拆分。
如处理装置203判断拆分后的特征图所需存储空间不大于SRAM 308的可用空间时,表示SRAM 308可以一次性地载入拆分后的特征图,则执行步骤1003,处理装置203设定拆分后的特征图为片上单元图。至此,处理装置203确定了片上单元图的大小。
进一步地,处理装置203根据片上单元图的尺寸决定模板融合单元。
接着,此实施例开始融合神经网络各层为模板融合单元,图11示出一种示例性的神经网络模型,共有14层,其中第一段1101包括第1层至第4层,为倒金字塔层,第二段1102包括第5层至第9层,为正金字塔层,第三段1103包括第10层至第14层,为倒金字塔层。在确定片上单元图的尺寸后,图12A示出此实施例根据融合策略动态融合神经网络的流程图。如图12A所示,包括:在步骤1201中,根据融合策略的起始规则,选择模板融合单元的起始层。处理装置203根据融合策略的起始规则,选择模板融合单元的起始层,也就是在神经网络里尚未融合的层中选择开始融合的层。
在一种应用场景下,所述起始规则可以是起始层为神经网络中最前未被融合的层,处理装置203会搜索出最前未被融合的层。以图6A的AlexNet神经网络模型为例,共有23层,假设第1层至第5层已融合,则当起始规则是起始层为神经网络中最前未被融合的层时,处理装置203会选择第6层的ReLU激活层为起始层,向后融合(即向第7层的方向融合)。需注意的是,在此起始规则下,起始层不必然为卷积层或池化层。
在另一种应用场景下,考虑到卷积和池化层最消耗输入/输出资源,因此起始规则为起始层为最前未被融合的卷积或池化层,处理装置203会先找出神经网络模型中未融合层的所有卷积和 池化层,从最前未被融合的卷积或池化层开始向后融合。同样以图6A的AlexNet神经网络模型为例,假设第1层至第9层已融合,处理装置203会找出神经网络模型中未融合层的所有卷积和池化层,即第11层、第13层、第15层,接着从最前未被融合的卷积或池化层开始融合,也就是起始层为第11层。假设图11的第1层至第3层已被融合,如果起始规则是起始层为神经网络中最前未被融合的层,则第4层会被设定为这次模板融合单元的起始层,并以第4层起向后融合。
在步骤1202中,以起始层为基准进行融合,逐一排查融合策略的所有规则,以建立模板融合单元。处理装置203以起始层为基准进行融合,逐一排查融合策略的所有规则,以建立模板融合单元。在满足所有规则的前提下,计算装置201的硬件资源足以支撑一次性地载入计算模板融合单元所需的数据,进而根据模板融合单元执行神经网络计算。除了前述的起始规则外,融合策略示例性地还可以包括以下规则:
规则一:向后融合
所谓的向后融合指的是自起始层往神经网络模型推理的方向融合,以图6A为例,即是按着第一层→第二层→第三层的方向融合。如果起始层之前还有未融合层,则在此规则下这些未融合层将不被考虑纳入模板融合单元中。
规则二:优先向前融合
所谓的向前融合指的是自起始层往神经网络推理的反方向融合,以图6A为例,则是按着第三层→第二层→第一层的方向融合。此规则通常与前述起始层为最前未被融合的卷积或池化层的起始规则搭配,原因在于所述的卷积或池化层前可能还有未被融合的层。在选定起始层后,处理装置203优先向前融合,试图把起始层前尚未被融合的层纳入模板融合单元中。同样以图6A的AlexNet神经网络模型为例,假设第1层至第2层已融合,处理装置203发现最前未被融合的卷积或池化层是第5层,故起始层为第5层,优先向前融合第4层、第3层,如果还能继续融合,则接着向后融合第6层、第7层等。
还有,同样以图11的神经网络模型为例,假设第1层至第2层已融合,处理装置203发现最前未被融合的卷积或池化层是第4层,故起始层为第4层,优先向前融合第3层,如果还能继续融合,则接着向后融合第5层、第6层等。
规则三:优先以分支结构为单位
当神经网络模型具有分支结构时,此规则要求处理装置203优先以分支结构而不是以层为单位增删模板融合单元,如果一整个块的运算逻辑融合不成功,才考虑从各个分支上的层进行融合。以图7C的神经网络模型为例,处理装置203会优先考虑子网络701’或子网络702’为单位进行融合。
当神经网络为长链结构时,由于不存在分支结构,故直接以层为单位增删模板融合单元。此规则不适用于长链结构的神经网络模型。
规则四:单分支输出
此实施例的融合策略不支持模板融合单元为多输出网络,其原因在于模板融合单元内部实现的形状推导主要采用从后向前推导的形式,多输出网络意味着需要从不同的输出分别向前推导,推导的结果不必然会归结到同一个特征图上,以至于无法收敛。
换言之,模板融合单元的输出需为单分支输出,也就是模板融合单元的最后一层只能具有一个输出。图7C标示了子网络701’的二种融合方式,第一种是将第一层至第五层融合成一个模板融合单元703’,第二种是将第一层至第六层融合成一个模板融合单元704’。由于第三层及第五层的输出都是模板融合单元703’的输出,故模板融合单元703’属于多输出网络,即多分支输出。而第六层的输出是模板融合单元704’的输出,只产生一个输出数据,故模板融合单元704’属于单输出网络,即单分支输出。处理单元203会判断模板融合单元的输出是否为单分支输出,如果此规则未被满足时,处理装置203增删模板融合单元内的层直到此规则被满足。
规则五:包括至少2个主层
当层逻辑过于简单时,模板融合单元的性能还不如未融合的层的性能,故以层逻辑作为融合策略时,处理装置203会评估所融合的各层的运算是否足够复杂,使得融合产生效益。欲产生效益,就需要尽量将主层纳入模板融合单元,主层指的是矩阵乘、池化或卷积等耗费大量输入/输出资源的层,此处的池化包括各类池化,像是最大池化(maxpool)或均值池化(avgpool),卷积也包括各类卷积,像是普通卷积、带均值的卷积、分通道卷积(depthwise conv)等。此规则为模板融合单元包括至少2个主层。当处理单元203判断此规则未被满足时,处理装置203会调整模板融合单元直到此规则被满足。
规则六:包括主层、主层、非主层依次相邻的连续结构
此规则为模板融合单元需包括主层、主层及非主层的连续结构,即:主层、主层以及非主层依次相邻的连续结构。这样的运算足够复杂,使得融合具有效益。参阅图6A中的第4层-第5层-第6层,其中第4层为最大池化层,第5层为卷积层,第6层为ReLU激活层,符合主层、主层、非主层依次相邻的连续结构,因此包括第4层、第5层、第6层的模板融合单元便可满足此规则。当处理单元203判断此规则未被满足时,处理装置203会调整模板融合单元直到此规则被满足。
规则七:包括标量计算层以及向量计算层相邻的连续结构
此规则为模板融合单元包括标量计算层以及向量计算层的连续结构,即:标量计算层、向量计算层依次相邻的连续结构。所述标量计算层指的是加法层、减法层或乘法层,所述向量计算层指的是激活层、批标准化层或缩放层。当处理单元203判断此规则未被满足时,处理装置203会调整模板融合单元直到此规则被满足。
规则八:卷积层的权值不为某个层的输出
此规则为模板融合单元中的卷积层的权值不为神经网络的任一层的输出,不论该层是否被纳入在模板融合单元。当处理单元203判断此规则未被满足时,处理装置203会将此卷积层自模板融合单元中移除。
规则九:卷积层的权值不与神经网络的任一层共用
由于模板融合单元涉及的神经网络模型中算子的权值具有特别的摆放形式,当被融合的卷积算子与其他算子共用权值时,权值的摆放逻辑会发生冲突,此规则为模板融合单元中的卷积算子的权值不与神经网络的任一层共用。当处理单元203判断此规则未被满足时,处理装置203会将此卷积算子自模板融合单元中移除。
规则十:权值不大于WRAM的可用空间
大图模式对于WRAM 432的限制较少,原因在于载入SRAM 308的片上单元图只是特征图的一部分,在计算模板融合单元时,WRAM 432只需要存放该特征图的所有权值即可。但由于小图模式可能会将多张特征图加载至SRAM 308,在这种情况下所需的权值会变多,便要谨慎评估WRAM 432的可用空间是否足够。此规则为片上单元图中的权值所需存储空间不大于WRAM 432的可用空间,当处理装置203判断此规则未被满足时,处理装置203会减少片上单元图的大小。
如果权值是基于C维度的输出通道参数Cout进行拆分,由于权值会被平均分配到多个处理器核306中,则本规则调整为:
Figure PCTCN2021141393-appb-000001
其中,W j为片上单元图j涉及的权值所需存储空间,n为集群中处理器核的数量,W为WRAM 432的可用空间。
规则十一:冗余百分比
冗余百分比为当输入依赖运算与输出依赖运算所产生的冗余总和与模板融合单元正常输入/输出量的比例,此处正常输入/输出量指的是片上单元图在未被拆分前没有冗余的数据量。处理装置203会计算模板融合单元将当前层融合进来后,片上单元图从DRAM 204至SRAM 308的访存量size TFU,与正常输入/输出量(不含冗余)size ori的百分比,其中访存量size TFU指的是理论的访存量size ori加上冗余总和。其公式如下:
Figure PCTCN2021141393-appb-000002
处理装置203会将模板融合单元的拆分信息和形状推导计算在内,并设定百分比阈值为50%、75%、100%、125%或160%,较佳为100%。以百分比阈值为100%为例,表示当冗余总和大于模板融合单元正常输入/输出量的2倍时,便不再融合。此规则为拆分片上单元图所产生的冗余总和不超出与百分比阈值相关的特定比例,一旦超过,表示冗余部分过多,大量的资源将耗费在计算冗余上,效能下降,因此当处理装置203判断此规则未被满足时,处理装置203会停止融合。
需要注意的是,在小图模式下,由于从DRAM 204至SRAM 308过程一次加载至少一整张完整的特征图,故不会产生冗余。此规则不适用于小图模式。
规则十二:片上单元图输入输出尺寸
假设SRAM 308的空间尺寸为S,片上单元图所需存储空间为IN,片上单元图的计算结果所需存储空间为OUT,则此规则为SRAM 308的空间尺寸需要满足以下条件:
如果IN和OUT不能复用存储空间的话,IN+OUT<S
如果IN和OUT可以复用存储空间的话,MAX(IN,OUT)<S
即如果IN和OUT不能复用存储空间的话,片上单元图的存储空间与计算结果的存储空间之和小于SRAM 308的可用空间;如果IN和OUT可复用存储空间的话,片上单元图的存储空间与计算结果的存储空间较大者小于SRAM 308的可用空间。
规则十三:W i+IN1+IN2≤S
在小图模式下,此规则为SRAM 308的空间尺寸需要满足以下条件:
W i+IN1+IN2≤S
即子图i的权值所需存储空间W i、片上单元图所需存储空间IN1、缓存空间IN2的总和不大于SRAM308的可用空间。当处理装置203判断此规则未被满足时,处理装置203减少片上单元图的数量直到此规则被满足。
规则十四:SubINi+W i+IN2≤S
在小图模式下,此规则为SRAM 308的空间尺寸需要满足以下条件:
SubINi+W i+IN2≤S
即子图i的所需存储空间SubINi、子图i的权值所需存储空间W i、缓存空间IN2的总和不大于SRAM308的可用空间。当处理装置203判断此规则未被满足时,处理装置203减少片上单元图的数量直到所述规则被满足。
规则十五:SubOUTi+W i+1+IN2≤S
在小图模式下,此规则为SRAM 308的空间尺寸需要满足以下条件:
SubOUTi+W i+1+IN2≤S
即子图i的中间结果所需存储空间SubOUTi、下一个子图的权值所需存储空间W i+1、缓存空间IN2的总和不大于SRAM 308的可用空间。当处理装置203判断此规则未被满足时,处理装置203减少片上单元图的数量直到所述规则被满足。
规则十六:W i+W i+1≤W
模板融合单元中参与卷积运算的权值会被独立搬运并驻留在WRAM 432上。在小图模式下,如果子图包括多张特征图,考虑到子图间的流水,WRAM 432最多同时存储相邻两个子图的权值。假设每个子图i的所需存储空间为W i,且WRAM 432的总空间为W,此规则为WRAM 432的空间尺寸需要满足以下条件:
W i+W i+1≤W
即子图i的权值所需存储空间W i、下一个子图的权值所需存储空间W i+1总和不大于WRAM 432的可用空间。当处理装置203判断此规则未被满足时,处理装置203减少片上单元图的数量直到所述规则被满足。
规则十七:子图所需存储空间不大于NRAM的可用空间
此规则为子图所需存储空间不大于NRAM 431的可用空间。当SRAM 308上的片上单元图要被拆分成子图搬运至NRAM 431时,处理装置203可以在N、H、W维度上进行细粒度拆分。如果NRAM 431的空间不足,处理装置203会把片上单元图拆分得更细,直到此规则被满足。一般来说,NRAM 431都会具有合理的可用空间,使得片上单元图被拆分到合理的程度便可一次性地被载入,就融合策略的角度来看,模板融合单元不会受到批处理数目的影响。然而,片上单元图被拆分的越小(即子图越多),处理速度会下降,故处理装置203需要评估NRAM 431的空间。
在一些实施例中,SRAM 308的空间与集群305内的处理器核306的NRAM 431的个数相对应,例如集群305包括4个处理器核306,则SRAM 308的空间为NRAM 431的空间的4倍。换言之,大图模式下的片上单元图一般能分配给4个处理器核306处理,这种架构设计已考虑载入SRAM 308的数据能一次性地分配给所有NRAM 431。因此在大图模式下不需要考虑此规则。规则十八:特征图的数量不大于特征图阈值
在小图模式下,片上单元图可能会包括多张特征图,特征图越多,SRAM 308与NRAM 431间的子图传输次数就越多,效率便会下降,因此并非片上单元图包括的特征图越多越好,处理装置203会根据片上单元图中特征图的数量来计算合适的融合层数,使其效益最大化。此规则为片上单元图中的特征图的数量不大于特征图阈值,当处理装置203判断此规则未被满足时,处理装置203减少片上数据中特征图的数量直到所述规则被满足。
规则十九:步长冗余
步长冗余指的是:当模板融合单元融合层数太多,再加上卷积和池化的内核的长宽大于步长时,每个输出点需要的输入数据存在重叠部分,也就是前述的输入依赖运算,该重叠部分即为步长冗余。步长冗余使得每个处理器核306需要多读取一些数据,但是这一部分复用的数据会占去片上片外的访问资源,模板融合单元包括的层数越多,步长冗余越严重。此规则为卷积层或池化层的内核的边长与步长的差值总和不大于冗余阈值。
在此实施例中,冗余阈值的定义如下。假设卷积和池化层的内核的长和宽为k x和k y,长和宽方向的步长分别为s x和s y,则长方向的步长冗余为模板融合单元内所有卷积及池化层的k x-s x的总和;同理,宽方向的步长冗余为模板融合单元内所有卷积及池化层的k y-s y的总和。此实施例的冗余阈值可以为3、4、5或6,较佳为4。只要长方向或宽方向任一方向的步长冗余大于冗 余阈值,此规则便不被满足。处理装置203调整模板融合单元,通常为减少被融合的层数,直到此规则被满足。
融合策略对于步长冗余设定了例外规则。如欲融合的层里存在多分支且模板融合单元能融合整个多分支的前提下,模板融合单元的性能会表现的更为优异,在这种情况下,处理装置203会忽略步长冗余的规则,亦即步长冗余不会限制模板融合单元融合多分支,即在此实施例的融合策略中,融合多分支优先于步长冗余的限制。也就是说,步长冗余只有在单分支的情况下才会被考虑。
以上的规则仅为示例,本发明并不限制各规则执行的顺序,亦不限制这些规则需同时被考虑,本领域技术人员在不同的应用场景下可以根据实际情况增删规则,以实现符合当下应用场景的融合策略。
回到图12A,在步骤1203中,根据建立后的模板融合单元执行神经网络计算。计算装置201基于片上系统-集群-处理器核的三级运算层次,搭配DRAM-SRAM-NRAM/WRAM这样的三层内存设计,将模板融合单元视为神经网络中一个自定义的层,一次性地自DRAM 204载入计算模板融合单元所需的数据至SRAM 308,使得数据能够在适当的层级里缓存并计算,形成充分的流水,计算完成后再将计算结果自SRAM 308传送至DRAM 204,大大减少神经网络计算中的输入/输出开销。
当计算机视觉、语音、自然语言处理、数据挖掘等领域的输入数据欲进行各类深度学习和机器学习算法时,本发明基于模板融合单元,可以减少神经网络计算中的输入/输出开销。
本发明的另一个实施例是一种利用模板融合单元执行神经网络计算的方法之一。图12B示出其流程。
在步骤1201’中,根据融合策略决定模板融合单元。处理装置203根据融合策略的起始规则,选择模板融合单元的起始层;并以起始层为基准进行融合,逐一排查融合策略的所有规则,以建立模板融合单元。前一个实施例已详细示例说明融合策略的各种规则,不再赘述。
在此步骤中,模板融合单元会以源代码的方式展现,接下来需要通过编译器将源代码转换成机器语言的目标代码(object code),又称作机器代码。以下多个步骤即为编译器将模板融合单元的源代码转换成机器语言的目标代码的过程。
在步骤1202’中,推导模板融合单元的形状。对于模板融合单元需要处理的数据,此实施例采用的是逆推的方式,编译器从输出向前逆推出需要多少尺寸的输入,以图8B为例,便是从特征图803逆向推导至特征图802,再逆向推导至特征图801。在此步骤中,编译器不仅根据模板融合单元推导所需输入数据,还会进一步推导冗余。
接下来执行步骤1203’,推导地址。根据模板融合单元的形状,编译器对整个控制流图进行片上存储空间地址的推导,并且实现通用地址的访问,以达到精简计算资源、缩短计算时间的目的。控制流图是用在编译器中的一种抽象数据结构,代表了一个程序可能会执行的所有路径,以流程图的形式反映过程内所有节点的可能流向。控制流图是由节点和节点间的关系所组成的。节点又称为基本块(basic block,BB),是程序中最大限度顺序执行的语句序列,每个基本块只有一个入口和出口,执行时从其入口进入,自其出口退出。基本块的特点是只要第一条指令被执行了,那么基本块内所有指令都会按照顺序被执行。
每个基本块包含至少一条指令,基本块中的指令可能使用指针指向特定的片上存储空间。指针是一种变量,用以保存特定地址空间的地址。通过指针,处理器核306可以将数据载入到指针指向的特定地址的空间中,或是从指针指向的特定地址中的数据取出。
编译器根据模板融合单元的划分情况,初始划分基本块,再经过迭代运算后,确认基本块及相互关系,至此完成实现模板融合单元的目标代码。
不仅如此,编译器还会针对神经网络中前后两个模板融合单元的复用数据进行分析,判断前一次模板融合单元中有多少数据可以留在片上供下一个模板融合单元使用,并根据判断结果规划各数据的存储地址。
在此步骤中,编译器完成控制流图中地址的推导。
在步骤1204’中,分配片上存储空间。处理装置203基于模板融合单元地址的推导,分配SRAM 308、NRAM 431及WRAM 432的物理空间。在此步骤中,编译器完成控制流图中指针的指向。
最后执行步骤1205’,生成可执行指令。在此步骤中,链接器(linker)将编译器所生成的目标代码外加库链接,使其成为一个可执行文件。更详细来说,目标代码是包括机器码和链接器可用信息的程序模块,链接器的工作是解析未定义的符号引用,将目标代码中的占位符替换为符号的地址,进而生成可执行指令。可执行指令可直接被计算装置201执行,以完成神经网络的计算。
本发明的另一个实施例是一种利用模板融合单元执行神经网络计算的方法之二,图12C示出其流程。
在步骤1201”中,根据融合策略的起始规则,选择模板融合单元的起始层。处理装置203根据融合策略的起始规则,选择模板融合单元的起始层,也就是在神经网络模型中尚未融合的层中选择开始融合的层。
在步骤1202”中,以起始层为基准进行融合,逐一排查融合策略的所有规则,以建立模板融合单元。处理装置203以起始层为基准进行融合,逐一排查融合策略的所有规则,以建立模板融合单元。描述图12B时已详细示例说明融合策略的各种规则,不再赘述。在满足所有规则的前提下,计算装置201的硬件资源便足以支撑一次性地载入计算模板融合单元所需的数据,进而根据模板融合单元执行神经网络计算。
由于在步骤1201”中已示例性的设定第4层为这次模板融合单元的起始层,因此在此步骤中从第4层起向后融合,逐一排查融合策略的所有规则,以建立模板融合单元。首先将正金字塔层的第5层也融合进去,如果还能继续融合,则处理装置203继续往后融合。
以下将说明正金字塔层的融合方式。图13A示出图11中的第5层与第6层的输入/输出特征图的形状。以融合图13A中的第5层与第6层为例,假设第5层为反卷积层,第6层为Relu激活层。为说明方便,第5层的输入特征图示例性的包括3个输入数据X 1、X 2、X 3,输入数据X 1经过第5层后会产生输出数据Y 1至Y 2,输入数据X 2经过第5层后会产生输出数据Y 3至Y 4,输入数据X 3经过第5层后会产生输出数据Y 5至Y 6,输出数据Y 1至Y 6再经过第6层激活,由于激活层不会改变数据量,因此输入数据Y 1至Y 6经过第6层后分别产生如图所示的输出数据Z 1至Z 6。图13B示出基于正金字塔层建立模板融合单元的流程图。
在此步骤中根据融合策略建立模板融合单元时,此实施例将同一个输入数据的所有输出数据视为一个融合块,图13A显示了第5层包括3个融合块,即X 1―Y 1―Y 2是第一融合块1301、X 2―Y 3―Y 4是第二融合块1302、X 3―Y 5―Y 6是第三融合块1303;第6层亦包括3个融合块,即Y 1―Y 2―Z 1―Z 2是第四融合块1304(来自输入数据X 1)、Y 3―Y 4―Z 3―Z 4是第五融合块1305(来自输入数据X 2)、Y 5―Y 6―Z 5―Z 6是第六融合块1306(来自输入数据X 3)。
在步骤1301’中,处理装置203设定对应同一个输入数据的所有输出数据为融合块,也就是识别出前述的融合块1301-1306。
在步骤1302’中,以融合块为单位,根据融合策略建立模板融合单元。除了前述的规则一至规则十八,与融合正金字塔层相关的规则还包括:
规则十九:以融合块为单位
基于每个处理器核306的硬件资源,以融合块为单位分配给每个处理器核306。由于融合块具有同一个输入数据,是一个完整的数据块,以融合块为单位切割成子图较为便捷。倘若一个子图包括不完整的融合块,例如某个子图包括融合块1301、1304、部分的融合块1302(数据块1307)及部分的融合块1305(数据块1308),这会使得下一个处理器核306难以判断融合块1302与1305已处理和未处理的部分,更详细来说,由于硬件通信的限制,下一个处理器核306无法得知数据块1307、1308的大小,以至于由片上单元图分割成子图时出现问题,可能导致遗漏部分数据未计算的情况。
为避免发生前述情况,处理装置203以融合块为单位分配给每个处理器核306。假设某个处 理器核306在完整计算融合块1301、1304后尚有空间,处理装置203会进一步判断该处理器核306能否一并计算融合块1302、1305,如果可以,便将融合块1302、1305也分配给该处理器核306,如果不行,将融合块1302、1305分配给下一个处理器核306。
规则二十:重复计算融合块
当特定的融合块在处理器核间重复计算时,将指派特定处理器核306计算特定融合块,并将计算特定融合块的中间结果存储至SRAM 308中,存储核307将中间结果合并至其他处理器核306所产生的中间结果。
举例来说,假设根据其他融合策略,融合块1301、1302、1304、1305分配给第一处理器核,融合块1302、1303、1305、1306分配给第二处理器核,融合块1302、1305重复计算了,为节省计算资源,处理装置203会重新调整任务量,把融合块1302、1305仅指派其中一个处理器核,例如第一处理器核,故第一处理器核还是计算融合块1301、1302、1304、1305,但第二处理器却只需计算融合块1303、1306。故第一处理器核计算完毕后,中间结果存储至SRAM 308中,存储核307会将第一处理器核计算融合块1302、1305所获得的中间结果与第二处理器核计算融合块1303、1306的中间结果合并,以产出对应融合块1301、1302、1304、1305的中间结果,以及对应融合块1302、1303、1305、1306的中间结果。一方面节省计算资源,另一方面也满足输出依赖的关系。
前述其他规则还是必须满足的。例如根据规则十四,在小图模式下,子图i的中间结果所需存储空间SubOUTi、下一个子图的权值所需存储空间W i+1、缓存空间IN2的总和不大于SRAM 308的可用空间。当处理装置203判断此规则未被满足时,处理装置203减少片上单元图的数量直到所述规则被满足。又例如规则九,融合块涉及的权值所需存储空间不大于WRAM 432的可用空间。当处理装置203判断这些融合策略未被满足时,处理装置203减少融合块的数量。其他规则不再赘述。
由于正金字塔层可能需要按照一定的比例通过补0,来扩大输出图像的尺寸,接着旋转卷积核,再进行正向卷积,因此当涉及到融合正金字塔层时,权值指的是为补0后的输出通道权值。
此实施例不限制正金字塔层和倒金字塔层的融合方式,可以全是正金字塔层融合在一块,例如模板融合单元包括第5层至第9层,也可以混合融合在一块,例如模板融合单元包括第3层至第6层,或是模板融合单元包括第9层至第12层等。换言之,模板融合单元可以只包括正金字塔层,也可以包括倒金字塔层加上正金字塔层,或是正金字塔层加上倒金字塔层。
回到图12C,在步骤1202”中,模板融合单元会以源代码的方式展现,接下来需要通过编译器将源代码转换成机器语言的目标代码(object code),又称作机器代码。以下多个步骤即为编译器将模板融合单元的源代码转换成机器语言的目标代码的过程。
接着执行步骤1203”,推导模板融合单元的形状。对于模板融合单元需要处理的数据,此实施例采用的是逆推的方式,编译器从输出向前逆推出需要多少尺寸的输入,以图8B为例,便是从特征图803逆向推导至特征图802,再逆向推导至特征图801。在此步骤中,编译器不仅根据模板融合单元推导所需输入数据,还会进一步推导冗余。
接下来执行步骤1204”,推导地址。根据模板融合单元的形状,编译器对整个控制流图进行片上存储空间地址的推导,并且实现通用地址的访问,以达到精简计算资源、缩短计算时间的目的。控制流图是用在编译器中的一种抽象数据结构,代表了一个程序可能会执行的所有路径,以流程图的形式反映过程内所有节点的可能流向。控制流图是由节点和节点间的关系所组成的。节点又称为基本块(basic block,BB),是程序中最大限度顺序执行的语句序列,每个基本块只有一个入口和出口,执行时从其入口进入,自其出口退出。基本块的特点是只要第一条指令被执行了,那么基本块内所有指令都会按照顺序被执行。
每个基本块包含至少一条指令,基本块中的指令可能使用指针指向特定的片上存储空间。指针是一种变量,用以保存特定地址空间的地址。通过指针,处理器核306可以将数据载入到指 针指向的特定地址的空间中,或是从指针指向的特定地址中的数据取出。
编译器根据模板融合单元的划分情况,初始划分基本块,再经过迭代运算后,确认基本块及相互关系,至此完成实现模板融合单元的目标代码。
不仅如此,编译器还会针对神经网络中前后两个模板融合单元的复用数据进行分析,判断前一次模板融合单元中有多少数据可以留在片上供下一个模板融合单元使用,并根据判断结果规划各数据的存储地址。
在此步骤中,编译器完成控制流图中地址的推导。
在步骤1205”中,分配片上存储空间。处理装置203基于模板融合单元地址的推导,分配SRAM 308、NRAM 431及WRAM 432的物理空间。在此步骤中,编译器完成控制流图中指针的指向。
最后执行步骤1206”,生成可执行指令。在此步骤中,链接器(linker)将编译器所生成的目标代码外加库链接,使其成为一个可执行文件。更详细来说,目标代码是包括机器码和链接器可用信息的程序模块,链接器的工作是解析未定义的符号引用,将目标代码中的占位符替换为符号的地址,进而生成可执行指令。计算装置201执行可执行指令,以根据模板融合单元执行神经网络计算。
此实施例可以融合正金字塔层和倒金字塔层,这样的融合策略使得模板融合单元的建立更为弹性,不受输入特征图及输出特征图尺寸的限制,进而适应各种网络模型,让融合更加全面,提升整体效益。
另外,以前述融合策略的规则来决定模板融合单元时,不必然要以卷积层或是池化层为起始展开融合。前述实施例提及在一种应用场景下,起始规则可以是起始层为神经网络中最前未被融合的层,这层可以是卷积层或池化层以外的层。这样的起始规则使得模板融合单元的建立更为弹性,能够针对不同的神经网络,基于各层的排序,适当地选择起始层开始融合,不受卷积层或是池化层在神经网络模型中的位置和数量所限,进而适应各种网络模型,让融合更加全面,提升整体效益。
举例来说,以图6A的神经网络模型为例,假设第1层至第5层已融合完毕,在建立下一个模板融合单元时,如果起始规则采用起始层为最前未被融合的卷积或池化层,则下一个卷积或池化层为第8层,换言之,第6层及第7层可能不会被融合,影响整体效益。
本发明的另一个实施例为一种融合神经网络的方案,其起始层为除卷积层及池化层之外的层,即非卷积层和非池化层。此实施例同样基于图1至图4的框架来实现。此实施例同样执行如图12A所示的流程图。
在步骤1201中,根据融合策略选择起始层。处理装置203根据融合策略选择起始层,例如融合策略的起始规则是起始层为神经网络中最前未被融合的层,这层是卷积层或池化层以外的层。起始层可以为元素对元素(element-wise)层、添加填充(addpadding)层或是自定义层。
需注意的是,此步骤不采用起始规则为起始层为最前未被融合的卷积或池化层,如果按此起始规则选择起始层,便会限制起始层必须为卷积或是池化层,此实施例不受卷积层或是池化层在神经网络模型中的位置和数量所限的优势便不存在了。
如果神经网络包括分支结构,根据前述规则三,优先以分支结构为单位进行融合。然而,有时候分支结构太过复杂,无法将整个分支结构整合至模板融合单元中,基于前述的规则只能放弃融合分支结构。不仅如此,规则四要求模板融合单元的输出需为单分支输出,亦反映了必须以分支结构为单位进行融合。换言之,规则三和四的融合策略对具有分支结构的神经网络模型并不友善,融合效果不佳。
本发明的另一个实施例是一种根据融合策略动态融合神经网络的分支结构的装置,该装置同样具有图1至图4的结构。此实施例不必然一定要以完整的分支结构进行融合。图13C示出一种示例性的神经网络模型片段,其包括分支结构1300”,分支结构1300”的始点是T1层,终点是T10层,由T1层展开有第一分支1301”与第二分支1302”,第一分支1301”包括T2层与T3层,第二分支1302”包括T4层至T9层。
在融合分支结构1300”时,处理装置203先为分支结构1300”建立起拓扑序列。拓扑序列指的是对一个有向无环图中的所有节点排成一个线性序列,且必须满足以下两个条件:每个节点必须出现且只出现一次;若存在一条从节点A到节点B的路径,那么在序列中节点A出现在节点B的前面。简单的说,就是由某个集合上的一个偏序得到该集合上的一个全序的过程。基于前述原则,处理装置203在建立拓扑序列时,先识别分支结构1300”的始点与终点,即始点为T1层和终点为T10层。处理装置203设定分支结构1300”的始点为拓扑序列的始点,该始点亦被设定为模板融合单元的起始层,并设定分支结构1300”的终点为拓扑序列的终点,再依拓扑排列分支结构1300”中的中间各层,排列的方式有以下2种。
第一种排列方式是比较各分支的层数,依层数数量由多至少排列子分支的各层;第二种排列方式是比较各分支上的层数,依层数数量由少至多排列子分支的各层。此实施例采第二种排列方式。第一分支1301”有2层,第二分支1302”有6层,第一分支1301”的层数较少,因此第一分支1301”中的各层排在第二分支1302”中的各层之前。基于这种排列方式,如图14所示,形成具有T1层→T2层→T3层→T4层→T5层→T6层→T7层→T8层→T9层→T10层的拓扑序列。经过转换,分支结构1300”的拓扑序列形成一个长链结构1400。
此实施例是以拓扑序列中的层为单位,而不是以整个分支结构为单位来增删模板融合单元,处理装置203以长链结构1400来取代分支结构1300”,排查融合策略内的规则,以建立模板融合单元。换言之,处理装置203将带有分支结构1300”的神经网络模型视为是具有一个长链结构1400的神经网络模型,以长链结构1400的起始层(T1层)作为基准进行融合,如此便可以选用前述融合策略内的任何规则(除了规则三和规则四),来建立模板融合单元。
在此实施例中,不必然模板融合单元需要包括整个分支结构。举例来说,假设长链结构1400可以生成2个模板融合单元:第一模板融合单元1401包括T1层至T5层,第二模板融合单元1402包括T6层至T10层。当长链结构1400还原成分支结构时,第一模板融合单元1401与第二模板融合单元1402的形状如图15所示,第一模板融合单元1401具有2个分支输出,分别连接至第二模板融合单元1402的T10层与T6层,也就是说,第一模板融合单元1401具有2个输出端,第二模板融合单元1402具有2个输入端。
为了让数据的搬运更有效率,在推导第一模板融合单元1401的形状时,处理装置203接着判断第一模板融合单元1401是否包括分支结构的终点。第一模板融合单元1401并未包括T10层,处理装置203进一步判断NRAM 431的可用空间是否足够大,如是,则处理装置203在推导地址时,使得计算装置201将第一模板融合单元1401产出的2个计算结果(也就是最末层T3层与T5层的中间结果)存储在NRAM 431中,原因是第二模板融合单元1402可以直接自NRAM 431取值计算。如果NRAM 431的可用空间不够大,处理装置203进一步判断SRAM 308的可用空间是否足够大。如果SRAM 308的可用空间足够大,这2个计算结果便会存储在SRAM 308中,在计算第二模板融合单元1402时可以直接自SRAM 308取值计算。
由于这2个计算结果就是第二模板融合单元1402的片上单元图,因此计算装置201在计算第二模板融合单元1402时,不需要从DRAM 204载入片上单元图,直接从NRAM 431或SRAM 308读取计算,减少了片上片外的访问。
如果NRAM 431及SRAM 308的可用空间都不够大,计算装置201才会将第一模板融合单元1401产出的2个计算结果存回至DRAM 204中,当计算装置201计算第二模板融合单元1402时,将从DRAM 204载入这2个计算结果进行计算。
在推导第二模板融合单元1402的形状时,处理装置203判断第二模板融合单元1402是否包括分支结构的终点。第二模板融合单元1402确实包括T10层,则处理装置203在推导地址时,使得计算装置201将第二模板融合单元1402产出的计算结果存回至DRAM 204中。
综上所述,此实施例的处理装置203将分支结构转换成长链结构,长链结构简单,容易生成模板融合单元,再将长链结构还原成分支结构进行形状和地址的推导,不再需要以整个分支结构为单位进行融合。计算装置201根据模板融合单元来执行神经网络计算。
本发明的另一个实施例同样是融合分支结构的装置,该装置亦具有图1至图4的结构。与 图15的分支结构不同处在于,此实施例可融合具有子分支的分支结构。图16示出一种示例性的神经网络模型片段,其包括分支结构1600,分支结构1600的始点是T1层,终点是T11层,由T1层展开有第一分支1601与第二分支1602,第一分支1601包括T2层至T7层,第二分支1602包括T8层至T10层。而第一分支1601包括子分支结构,子分支结构的始点是T3层,终点是T7层,第一子分支1603包括T4层及T5层,第二子分支1604包括T6层。
在融合分支结构1600时,处理装置203先为分支结构1600建立起拓扑序列。首先识别分支结构1600的始点与终点,即始点为T1层,终点为T11层,处理装置203设定分支结构1600的始点为拓扑序列的始点,该始点亦被设定为模板融合单元的起始层,并设定分支结构1600的终点为拓扑序列的终点。处理装置203进一步判断分支结构1600是否存在子分支结构,分支结构1600确实存在子分支结构,处理装置203先识别子分支结构的始点与终点,即T3层与T7层,再依拓扑排列子分支结构中的始点、终点及中间各层,排列的方式有以下2种。
第一种排列方式是比较子分支结构中的子分支上的层数,依层数数量由多至少排列子分支的各层。第一子分支1603有2层,第二子分支1604有1层,第一子分支1603的层数较多,因此第一子分支1603中的各层排在第二子分支1604中的各层之前。基于这种排列方式,子分支结构的拓扑排序为T3层→T4层→T5层→T6层→T7层。
第二种排列方式是比较子分支结构中的子分支上的层数,依层数数量由少至多排列子分支的各层。第二子分支1604的层数较少,因此第二子分支1604中的各层排在第一子分支1603中的各层之前。基于这种排列方式,子分支结构的拓扑排序为T3层→T6层→T4层→T5层→T7层。
在处理完子分支结构的拓扑排序后,处理装置203继续对分支结构1600进行排序。在此实施例中,分支结构1600的排序方式与子分支结构相同,换言之,如果子分支采第一种排列方式(以层数数量由多至少来排列)则分支结构1600亦以层数数量由多至少来排列,第一分支1601的层数多于第二分支1602,因此第一分支1601的各层排在第二分支1602的各层之前,即生成图17所示的长链结构1701;如果子分支采第二种排列方式(以层数数量由少至多来排列)则分支结构1600的第二分支1602的各层排在第一分支1601的各层之前,即生成图17所示的长链结构1702。
接着处理装置203以长链结构1701或1702来取代分支结构1600,以拓扑序列中的层为单位来增删模板融合单元,排查融合策略内的规则,以建立模板融合单元。同样地,在此实施例中,不必然模板融合单元需要包括整个分支结构1601或1602。
为了让数据的搬运更有效率,在推导分支结构1601或1602的模板融合单元的形状时,处理装置203会判断模板融合单元是否包括分支结构或子分支结构的终点,如果未包括,处理装置203进一步判断NRAM 431的可用空间是否足够大,如是,则处理装置203在推导地址时,使得计算装置201将该模板融合单元产出的最末层的中间结果存储在NRAM 431中。如果NRAM 431的可用空间不够大,处理装置203进一步判断SRAM 308的可用空间是否足够大。如果SRAM 308的可用空间足够大,则最末层的中间结果便会存储在SRAM 308中,在计算该模板融合单元时可以直接自SRAM 308取值计算。
如果模板融合单元未包括分支结构或子分支的终点,表示其输出(最末层的中间结果)是下一个模板融合单元的片上单元图,因此计算装置201在计算下一个模板融合单元时,不需要从DRAM 204载入片上单元图,直接从NRAM 431或SRAM 308读取计算,减少了片上片外的访问。
但如果NRAM 431及SRAM 308的可用空间都不够大,模板融合单元的最末层的中间结果会被存回至DRAM 204中,当计算装置201计算下一个模板融合单元时,从DRAM 204载入进行计算。
如果处理装置203判断该模板融合单元包括分支结构1600或子分支的终点,则处理装置203在推导地址时,使得计算装置201将该模板融合单元产出的最末层的中间结果存回至DRAM 204中。
此实施例虽然以分支结构包括一个子分支结构进行说明,本领域技术人员可以容易的推及多个子分支的情况,故不再细化。此实施例的处理装置203将分支/子分支结构转换成长链结构,长 链结构简单,容易生成模板融合单元,再将长链结构还原成分支结构进行形状和地址的推导。计算装置201根据模板融合单元来执行神经网络计算。
本发明的另一个实施例是根据融合策略动态融合神经网络的分支结构的方法,此实施例由具有图1至图4的结构的装置来融合具有子分支的分支结构。图18示出此实施例的流程图。
在步骤1801中,为分支结构建立起拓扑序列。此步骤又分为以下多个步骤。
在步骤1802中,识别分支结构的始点与终点。在步骤1803中,设定分支结构的始点为拓扑序列的始点。在步骤1804中,设定该始点为模板融合单元的起始层。在步骤1805中,设定分支结构的终点为拓扑序列的终点。在步骤1806中,判断分支结构是否存在子分支结构,如是,执行步骤1807,识别子分支结构的始点与终点。在步骤1808中,依特定顺序排列子分支结构中的始点、终点及中间各层,排列的方式有以下2种:比较子分支结构中的子分支上的层数,依层数数量由多至少排列子分支的各层,以及比较子分支结构中的子分支上的层数,依层数数量由少至多排列子分支的各层。在处理完子分支结构的拓扑排序后,或是在步骤1806中判断分支结构不存在子分支结构,则执行步骤1809,依特定顺序对分支结构的各层进行排序。在此实施例中,分支结构的排序方式与子分支结构相同。至此,此实施例将分支结构转换成长链结构。
接着执行步骤1810,以长链结构来取代分支结构,用拓扑序列中的层为单位来增删模板融合单元,基于在步骤1804中设定的起始层,排查融合策略内的规则,以建立模板融合单元。此步骤即是以长链结构来取代分支结构,执行步骤1202,技术细节不再赘述。
在推导分支结构或子分支结构的模板融合单元的形状时,此实施例会判断模板融合单元是否包括分支结构或子分支结构的终点,如果未包括,进一步判断NRAM 431的可用空间是否足够大,如是,则在推导地址时使得计算装置201将该模板融合单元产出的最末层的中间结果存储在NRAM 431中。如果NRAM 431的可用空间不够大,进一步判断SRAM 308的可用空间是否足够大。如果SRAM 308的可用空间足够大,则最末层的中间结果便会存储在SRAM 308中,在计算该模板融合单元时可以直接自SRAM 308取值计算。
但如果NRAM 431及SRAM 308的可用空间都不够大,此实施例才会将模板融合单元的最末层的中间结果存回至DRAM 204中,在计算下一个模板融合单元时,将从DRAM 204载入进行计算。
如果模板融合单元包括分支结构或子分支结构的终点,则此实施例在推导地址时使得计算装置201将该模板融合单元产出的最末层的中间结果存回至DRAM 204中。
最后执行步骤1811,根据模板融合单元来执行神经网络计算。
本发明另一个实施例为一种计算机可读存储介质,其上存储有根据融合策略动态融合神经网络的分支结构的计算机程序代码,当所述计算机程序代码由处理器运行时,执行前述实施例所述的方法。
本发明通过设定融合策略,动态地决定模板融合单元,融合神经网络中的分支结构,以形成新的自定义的层,一次性载入计算模板融合单元所需的数据,以减少输入/输出开销。
依据以下条款可更好地理解前述内容:
2020115632669条款A1,一种融合神经网络的集成电路装置,所述神经网络包括正金字塔层,所述正金字塔层的输入特征图小于输出特征图,所述输入特征图中的输入数据产生所述输出特征图中的至少一个输出数据,所述集成电路装置包括:
处理装置,用以:
设定对应同一个输入数据的所有输出数据为融合块,所述输出特征图包括多个融合块;
以所述融合块为单位,根据融合策略建立模板融合单元;以及
计算装置,用以根据所述模板融合单元执行神经网络计算。
条款A2,根据条款A1所述的集成电路装置,其中所述计算装置包括多个集群,每个集群包括多个处理器核,所述融合策略为基于每个处理器核的硬件资源,以所述融合块为单位分配给每个处理器核。
条款A3,根据条款A2所述的集成电路装置,其中每个集群还包括存储核,所述存储核包括 共享存储单元,所述融合策略为当特定融合块在特定处理器核间重复计算时,指派所述特定处理器核其中之一计算所述特定融合块,并将中间结果存储至所述共享存储单元。
条款A4,根据条款A3所述的集成电路装置,其中所述存储核将所述中间结果合并至其他特定处理器核所产生的中间结果中。
条款A5,根据条款A3所述的集成电路装置,其中所述共享存储单元包括缓存空间,所述融合策略为下一个子图的权值所需存储空间、所有输出数据所需存储空间、所述缓存空间的总和不大于所述共享存储单元的可用空间,当所述处理装置判断所述融合策略未被满足时,所述处理装置停止融合所述正金字塔层。
条款A6,根据条款A2所述的集成电路装置,其中每个处理器核包括权值存储单元,所述融合策略为所述融合块涉及的权值所需存储空间不大于所述权值存储单元的可用空间,当所述处理装置判断所述融合策略未被满足时,所述处理装置减少所述融合块的数量。
条款A7,根据条款A1所述的集成电路装置,其中所述正金字塔层包括反卷积层(deconvolution layer)、上池化层(unpooling layer)或上采样层(unsampling layer)。
条款A8,一种板卡,包括根据条款A1至A7任一项所述的集成电路装置。
条款A9,一种融合神经网络的方法,所述神经网络包括正金字塔层,所述正金字塔层的输入特征图小于输出特征图,所述输入特征图中的输入数据产生所述输出特征图中的至少一个输出数据,所述方法包括:
设定对应同一个输入数据的所有输出数据为融合块,所述输出特征图包括多个融合块;
以所述融合块为单位,根据融合策略建立模板融合单元;以及
根据所述模板融合单元执行神经网络计算。
条款A10,一种计算机可读存储介质,其上存储有融合神经网络的计算机程序代码,当所述计算机程序代码由处理装置运行时,执行条款A9所述的方法。
根据不同的应用场景,本发明的电子设备或装置可以包括服务器、云端服务器、服务器集群、数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、PC设备、物联网终端、移动终端、手机、行车记录仪、导航仪、传感器、摄像头、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、视觉终端、自动驾驶终端、交通工具、家用电器、和/或医疗设备。所述交通工具包括飞机、轮船和/或车辆;所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机;所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。本发明的电子设备或装置还可以被应用于互联网、物联网、数据中心、能源、交通、公共管理、制造、教育、电网、电信、金融、零售、工地、医疗等领域。进一步,本发明的电子设备或装置还可以用于云端、边缘端、终端等与人工智能、大数据和/或云计算相关的应用场景中。在一个或多个实施例中,根据本发明方案的算力高的电子设备或装置可以应用于云端设备(例如云端服务器),而功耗小的电子设备或装置可以应用于终端设备和/或边缘端设备(例如智能手机或摄像头)。在一个或多个实施例中,云端设备的硬件信息和终端设备和/或边缘端设备的硬件信息相互兼容,从而可以根据终端设备和/或边缘端设备的硬件信息,从云端设备的硬件资源中匹配出合适的硬件资源来模拟终端设备和/或边缘端设备的硬件资源,以便完成端云一体或云边端一体的统一管理、调度和协同工作。
需要说明的是,为了简明的目的,本发明将一些方法及其实施例表述为一系列的动作及其组合,但是本领域技术人员可以理解本发明的方案并不受所描述的动作的顺序限制。因此,依据本发明的公开或教导,本领域技术人员可以理解其中的某些步骤可以采用其他顺序来执行或者同时执行。进一步,本领域技术人员可以理解本发明所描述的实施例可以视为可选实施例,即其中所涉及的动作或模块对于本发明某个或某些方案的实现并不一定是必需的。另外,根据方案的不同,本发明对一些实施例的描述也各有侧重。鉴于此,本领域技术人员可以理解本发明某个实施例中没有详述的部分,也可以参见其他实施例的相关描述。
在具体实现方面,基于本发明的公开和教导,本领域技术人员可以理解本发明所公开的若干实施例也可以通过本文未公开的其他方式来实现。例如,就前文所述的电子设备或装置实施例中 的各个单元来说,本文在考虑了逻辑功能的基础上对其进行拆分,而实际实现时也可以有另外的拆分方式。又例如,可以将多个单元或组件结合或者集成到另一个系统,或者对单元或组件中的一些特征或功能进行选择性地禁用。就不同单元或组件之间的连接关系而言,前文结合附图所讨论的连接可以是单元或组件之间的直接或间接耦合。在一些场景中,前述的直接或间接耦合涉及利用接口的通信连接,其中通信接口可以支持电性、光学、声学、磁性或其它形式的信号传输。
在本发明中,作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元示出的部件可以是或者也可以不是物理单元。前述部件或单元可以位于同一位置或者分布到多个网络单元上。另外,根据实际的需要,可以选择其中的部分或者全部单元来实现本发明实施例所述方案的目的。另外,在一些场景中,本发明实施例中的多个单元可以集成于一个单元中或者各个单元物理上单独存在。
在一些实现场景中,上述集成的单元可以采用软件程序模块的形式来实现。如果以软件程序模块的形式实现并作为独立的产品销售或使用时,所述集成的单元可以存储在计算机可读取存储器中。基于此,当本发明的方案以软件产品(例如计算机可读存储介质)的形式体现时,该软件产品可以存储在存储器中,其可以包括若干指令用以使得计算机设备(例如个人计算机、服务器或者网络设备等)执行本发明实施例所述方法的部分或全部步骤。前述的存储器可以包括但不限于U盘、闪存盘、只读存储器(Read Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。
在另外一些实现场景中,上述集成的单元也可以采用硬件的形式实现,即为具体的硬件电路,其可以包括数字电路和/或模拟电路等。电路的硬件结构的物理实现可以包括但不限于物理器件,而物理器件可以包括但不限于晶体管或忆阻器等器件。鉴于此,本文所述的各类装置(例如计算装置或其他处理装置)可以通过适当的硬件处理器来实现,例如中央处理器、GPU、FPGA、DSP和ASIC等。进一步,前述的所述存储单元或存储装置可以是任意适当的存储介质(包括磁存储介质或磁光存储介质等),其例如可以是可变电阻式存储器(Resistive Random Access Memory,RRAM)、动态随机存取存储器(Dynamic Random Access Memory,DRAM)、静态随机存取存储器(Static Random Access Memory,SRAM)、增强动态随机存取存储器(Enhanced Dynamic Random Access Memory,EDRAM)、高带宽存储器(High Bandwidth Memory,HBM)、混合存储器立方体(Hybrid Memory Cube,HMC)、ROM和RAM等。
以上对本发明实施例进行了详细介绍,本文中应用了具体个例对本发明的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本发明的方法及其核心思想;同时,对于本领域的一般技术人员,依据本发明的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本发明的限制。

Claims (14)

  1. 一种根据融合策略动态融合神经网络的分支结构的集成电路装置,包括:
    处理装置,用以:
    根据所述分支结构,建立拓扑序列;以及
    以所述拓扑序列的起始层为基准进行融合,排查所述融合策略内的规则,以建立模板融合单元;以及
    计算装置,用以根据所述模板融合单元执行神经网络计算。
  2. 根据权利要求1所述的集成电路装置,其中所述处理装置在建立所述拓扑序列时,用以识别所述分支结构的始点与终点,所述处理装置设定所述始点为所述起始层。
  3. 根据权利要求2所述的集成电路装置,其中所述处理装置在建立所述拓扑序列时,还用以:
    设定所述分支结构的始点为所述拓扑序列的始点;
    设定所述分支结构的终点为所述拓扑序列的终点。
  4. 根据权利要求3所述的集成电路装置,其中所述处理装置在建立所述拓扑序列时,还用以:
    判断所述分支结构是否存在子分支结构,如是,则:
    识别所述子分支结构的始点与终点;
    依特定顺序排列所述子分支结构中的始点、终点及层。
  5. 根据权利要求4所述的集成电路装置,其中所述处理装置在依特定顺序排列所述子分支结构中的始点、终点及层时,还用以:
    比较所述子分支结构中的子分支上的层数;以及
    依层数数量由多至少排列所述子分支的层。
  6. 根据权利要求4所述的集成电路装置,其中所述处理装置在依特定顺序排列所述子分支结构中的始点、终点及层时,还用以:
    比较所述子分支结构中的子分支上的层数;以及
    依层数数量由少至多排列所述子分支的层。
  7. 根据权利要求4所述的集成电路装置,其中所述计算装置包括多个集群,每个集群包括共享存储单元,所述处理装置还用以判断所述模板融合单元是否包括所述子分支结构的终点;如否,所述计算装置将所述模板融合单元中,所述子分支结构中各子分支最末层的中间结果存储在所述共享存储单元中。
  8. 根据权利要求4所述的集成电路装置,其中所述计算装置包括多个集群,每个集群包括多个处理器核,每个处理器核包括神经元存储单元,所述处理装置还用以判断所述模板融合单元是否包括所述子分支结构的终点;如否,所述计算装置将所述模板融合单元中,所述子分支结构中各子分支最末层的中间结果存储在所述神经元存储单元中。
  9. 根据权利要求2所述的集成电路装置,其中所述计算装置包括多个集群,每个集群包括共享存储单元,所述处理装置还用以判断所述模板融合单元是否包括所述分支结构的终点;如否,所述计算装置将所述模板融合单元中,所述分支结构中各分支最末层的中间结果存储在所述共享存储单元中。
  10. 根据权利要求2所述的集成电路装置,其中所述计算装置包括多个集群,每个集群包括多个处理器核,每个处理器核包括神经元存储单元,所述处理装置还用以判断所述模板融合单元是否包括所述分支结构的终点;如否,所述计算装置将所述模板融合单元中,所述分支结构中各分支最末层的中间结果存储在所述神经元存储单元中。
  11. 根据权利要求1所述的集成电路装置,其中所述起始层为除卷积层及池化层之外的层。
  12. 根据权利要求1所述的集成电路装置,其中所述起始层为所述神经网络中最前未被融合的层。
  13. 根据权利要求1所述的集成电路装置,其中所述融合策略为以所述拓扑序列中的层为单 位增删所述模板融合单元。
  14. 一种板卡,包括根据权利要求1至13任一项所述的集成电路装置。
PCT/CN2021/141393 2020-12-25 2021-12-25 融合分支结构的装置、板卡、方法及可读存储介质 WO2022135599A1 (zh)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202011563266.9 2020-12-25
CN202011561973.4 2020-12-25
CN202011561973.4A CN114692837A (zh) 2020-12-25 2020-12-25 融合分支结构的装置、板卡、方法及可读存储介质
CN202011563266.9A CN114757327A (zh) 2020-12-25 2020-12-25 融合神经网络的装置、板卡、方法及可读存储介质

Publications (1)

Publication Number Publication Date
WO2022135599A1 true WO2022135599A1 (zh) 2022-06-30

Family

ID=82158854

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/141393 WO2022135599A1 (zh) 2020-12-25 2021-12-25 融合分支结构的装置、板卡、方法及可读存储介质

Country Status (1)

Country Link
WO (1) WO2022135599A1 (zh)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107016175A (zh) * 2017-03-23 2017-08-04 中国科学院计算技术研究所 适用神经网络处理器的自动化设计方法、装置及优化方法
CN109117940A (zh) * 2018-06-19 2019-01-01 腾讯科技(深圳)有限公司 一种卷积神经网络前向加速方法、装置及系统
CN109918951A (zh) * 2019-03-12 2019-06-21 中国科学院信息工程研究所 一种基于层间融合的人工智能处理器侧信道防御系统
US20200134390A1 (en) * 2018-10-30 2020-04-30 International Business Machines Corporation Implementing artificial intelligence agents to perform machine learning tasks using predictive analytics to leverage ensemble policies for maximizing long-term returns

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107016175A (zh) * 2017-03-23 2017-08-04 中国科学院计算技术研究所 适用神经网络处理器的自动化设计方法、装置及优化方法
CN109117940A (zh) * 2018-06-19 2019-01-01 腾讯科技(深圳)有限公司 一种卷积神经网络前向加速方法、装置及系统
US20200134390A1 (en) * 2018-10-30 2020-04-30 International Business Machines Corporation Implementing artificial intelligence agents to perform machine learning tasks using predictive analytics to leverage ensemble policies for maximizing long-term returns
CN109918951A (zh) * 2019-03-12 2019-06-21 中国科学院信息工程研究所 一种基于层间融合的人工智能处理器侧信道防御系统

Similar Documents

Publication Publication Date Title
WO2023045445A1 (zh) 数据处理装置、数据处理方法及相关产品
WO2023045446A1 (zh) 计算装置、数据处理方法及相关产品
WO2022134873A1 (zh) 数据处理装置、数据处理方法及相关产品
CN113469336A (zh) 优化神经网络模型的编译方法、执行方法及相关产品
CN112084023A (zh) 数据并行处理的方法、电子设备及计算机可读存储介质
WO2023030507A1 (zh) 编译优化方法、装置、计算机设备以及存储介质
WO2022135599A1 (zh) 融合分支结构的装置、板卡、方法及可读存储介质
WO2022095675A1 (zh) 神经网络稀疏化的装置、方法及相关产品
CN113469337B (zh) 用于优化神经网络模型的编译方法及其相关产品
CN114358261A (zh) 融合神经网络的装置、板卡、方法及可读存储介质
WO2022063183A1 (zh) 执行神经网络计算的装置、板卡、方法及可读存储介质
WO2022135600A1 (zh) 计算神经网络的装置、板卡、方法及可读存储介质
WO2022063217A1 (zh) 向前融合神经网络的装置、板卡、方法及可读存储介质
CN112948001A (zh) 设定张量硬件配置的方法、可读存储介质及装置
CN114692844A (zh) 数据处理装置、数据处理方法及相关产品
WO2023087698A1 (zh) 执行卷积运算的计算装置、方法及相关产品
WO2022257980A1 (zh) 计算装置、利用计算装置实施卷积运算的方法及相关产品
CN114330679A (zh) 融合神经网络的装置、板卡、方法及可读存储介质
CN114358264A (zh) 融合神经网络的装置、板卡、方法及可读存储介质
CN114358263A (zh) 执行神经网络计算的装置、板卡、方法及可读存储介质
CN114330676A (zh) 融合神经网络的装置、板卡、方法及可读存储介质
WO2022134872A1 (zh) 数据处理装置、数据处理方法及相关产品
CN114757327A (zh) 融合神经网络的装置、板卡、方法及可读存储介质
CN114282659A (zh) 计算神经网络的装置、板卡、方法及可读存储介质
CN114282642A (zh) 计算神经网络的计算装置、板卡、方法及可读存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21909594

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21909594

Country of ref document: EP

Kind code of ref document: A1