CN114202067A

CN114202067A - Bandwidth optimization method for convolutional neural network accelerator and related equipment

Info

Publication number: CN114202067A
Application number: CN202111445760.XA
Authority: CN
Inventors: 曾成龙; 蔡权雄; 牛昕宇
Original assignee: Shandong Industry Research Kunyun Artificial Intelligence Research Institute Co ltd
Current assignee: Shandong Industry Research Kunyun Artificial Intelligence Research Institute Co ltd
Priority date: 2021-11-30
Filing date: 2021-11-30
Publication date: 2022-03-18

Abstract

Before the convolutional neural network operates, a first weight parameter of a convolutional layer to be calculated is transported to a first on-chip memory from a first off-chip memory, and a second weight parameter of the convolutional layer to be calculated is transported to a second on-chip memory from a second off-chip memory; when convolution calculation of a convolution layer to be calculated is executed, a first weight parameter is conveyed to a first calculation engine from a first on-chip memory to execute first convolution operation, and a second weight parameter is conveyed to a second calculation engine from a second on-chip memory to execute second convolution operation; when the first convolution operation and the second convolution operation are executed, the second weight parameter is transported from the second on-chip memory to the first calculation engine to execute the third convolution operation, and the first weight parameter is transported from the first on-chip memory to the second calculation engine to execute the fourth convolution operation; and when the third convolution operation and the fourth convolution operation are executed, performing convolution calculation of the next convolution layer.

Description

Bandwidth optimization method for convolutional neural network accelerator and related equipment

Technical Field

The invention relates to the technical field of agricultural Internet of things, in particular to a bandwidth optimization method and related equipment for a convolutional neural network accelerator.

Background

With the rapid development of deep learning, convolutional neural networks are widely applied to the fields of computer vision and the like. In order to improve the precision of the convolutional neural network, the convolutional neural network model becomes larger and larger, and the required calculation amount is also larger and larger. The CPU is used for realizing a convolutional neural network algorithm, and the speed is low; the GPU is used for realizing the convolutional neural network algorithm, so that the power consumption is high and the delay is large.

Disclosure of Invention

The embodiment of the invention provides a bandwidth optimization method and related equipment for a convolutional neural network accelerator, which improve the transmission bandwidth of weight parameters, reduce the transmission time of the weight parameters and improve the performance of the convolutional neural network accelerator.

In a first aspect, an embodiment of the present invention provides a bandwidth optimization method for a convolutional neural network accelerator, where the method includes:

before the convolutional neural network operates, a first weight parameter of a convolutional layer to be calculated is transported to a first on-chip memory from a first off-chip memory in advance, and a second weight parameter of the convolutional layer to be calculated is transported to a second on-chip memory from a second off-chip memory in advance, wherein the first weight parameter and the second weight parameter are two weight parameters of the convolutional layer to be calculated in the convolutional neural network;

when convolution calculation of the convolution layer to be calculated is executed, the first weight parameter is carried to a first calculation engine from the first on-chip memory to execute first convolution operation, and the second weight parameter is carried to a second calculation engine from the second on-chip memory to execute second convolution operation;

when the first convolution operation and the second convolution operation are executed, the second weight parameter is transported from the second on-chip memory to the first calculation engine to execute a third convolution operation, and the first weight parameter is transported from the first on-chip memory to the second calculation engine to execute a fourth convolution operation;

and when the third convolution operation and the fourth convolution operation are executed, performing convolution calculation of a next convolution layer.

Further, before the moving the first weight parameter from the first off-chip memory to the first on-chip memory and the moving the second weight parameter from the second off-chip memory to the second on-chip memory, the method further comprises:

acquiring weight parameters of each convolution layer in the convolution neural network;

dividing the weight parameter of each convolution layer in the convolution neural network into a first weight parameter and a second weight parameter;

and storing the first weight parameter into a first off-chip memory, and storing the second weight parameter into a second off-chip memory.

Further, after the transporting the first weight parameter of the convolution layer to be calculated from the first off-chip memory to the first on-chip memory, the method further includes:

judging whether the first on-chip memory has enough space to store the first weight data of the next volume of lamination;

and if the first on-chip memory has enough space to store the first weight data of the next convolution layer, carrying the first weight data of the next convolution layer from the first off-chip memory to the first on-chip memory in advance.

and if the first on-chip memory has insufficient space for storing the first weight data of the next convolution layer, waiting for the first on-chip memory to release sufficient storage space.

Further, after the transporting the second weight parameter of the convolutional layer to be calculated from the second off-chip memory to the second on-chip memory, the method further includes:

judging whether the second on-chip memory has enough space to store the second weight data of the next volume of lamination;

if so, the second weight data of the next volume layer is transferred from the second off-chip memory to the second on-chip memory in advance.

Further, after the transporting the convolution layer second value parameter to be calculated from the second off-chip memory to the second on-chip memory, the method further includes:

and if the second on-chip memory does not have enough space to store the second weight data of the next convolution layer, waiting for the second on-chip memory to release enough storage space.

In a second aspect, an embodiment of the present invention provides a bandwidth optimization apparatus for a convolutional neural network accelerator, where the apparatus includes:

the first carrying module is used for carrying a first weight parameter of a convolutional layer to be calculated from a first off-chip memory to a first on-chip memory and carrying a second weight parameter of the convolutional layer to be calculated from a second off-chip memory to a second on-chip memory in advance before the convolutional neural network operates, wherein the first weight parameter and the second weight parameter are two weight parameters of the convolutional layer to be calculated in the convolutional neural network;

a second carrying module, configured to carry the first weight parameter from the first on-chip memory to a first computation engine to perform a first convolution operation and carry the second weight parameter from the second on-chip memory to a second computation engine to perform a second convolution operation when performing convolution calculation on the convolution layer to be calculated;

a third carrying module, configured to carry the second weight parameter from the second on-chip memory to the first calculation engine to perform a third convolution operation and carry the first weight parameter from the first on-chip memory to the second calculation engine to perform a fourth convolution operation when the first convolution operation and the second convolution operation are performed;

and the calculation module is used for performing convolution calculation of the next convolution layer when the third convolution operation and the fourth convolution operation are executed.

In a third aspect, an embodiment of the present invention provides an AI acceleration chip, where the first off-chip memory, the second off-chip memory, the first on-chip memory, the second on-chip memory, the on-chip network, the first computation engine, and the second computation engine are connected by the on-chip network, and the on-chip network is used to execute steps in the bandwidth optimization method for a convolutional neural network accelerator according to any one of the embodiments of the present invention.

In a fourth aspect, an embodiment of the present invention provides an electronic device, including: the device comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the computer program to realize the steps of the bandwidth optimization method for the convolutional neural network accelerator according to any one of the embodiments of the present invention.

In a fifth aspect, the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps in the bandwidth optimization method for a convolutional neural network accelerator according to any one of the embodiments of the present invention.

In the embodiment of the invention, before the operation of the convolutional neural network, a first weight parameter of a convolutional layer to be calculated is transported to a first on-chip memory from a first off-chip memory in advance, and a second weight parameter of the convolutional layer to be calculated is transported to a second on-chip memory from a second off-chip memory in advance, wherein the first weight parameter and the second weight parameter are two weight parameters of the convolutional layer to be calculated in the convolutional neural network; when convolution calculation of the convolution layer to be calculated is executed, the first weight parameter is carried to a first calculation engine from the first on-chip memory to execute first convolution operation, and the second weight parameter is carried to a second calculation engine from the second on-chip memory to execute second convolution operation; when the first convolution operation and the second convolution operation are executed, the second weight parameter is transported from the second on-chip memory to the first calculation engine to execute a third convolution operation, and the first weight parameter is transported from the first on-chip memory to the second calculation engine to execute a fourth convolution operation; and when the third convolution operation and the fourth convolution operation are executed, performing convolution calculation of a next convolution layer. Embodiments of the present invention may accelerate one or more convolutional neural networks simultaneously by multiple compute engines.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic structural diagram of a bandwidth optimization method for a convolutional neural network accelerator according to an embodiment of the present invention;

fig. 2 is a schematic flowchart of a bandwidth optimization method for a convolutional neural network accelerator according to an embodiment of the present invention;

FIG. 3 is a flow chart of another bandwidth optimization method for a convolutional neural network accelerator provided in the present application;

fig. 4 is a schematic structural diagram of a bandwidth optimizing device for a convolutional neural network accelerator according to an embodiment of the present invention.

Wherein:

DDR: an off-chip memory;

scratchpad memory: an on-chip memory;

noc: a network on chip;

AI engine: a convolution calculation engine.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary and intended to be illustrative of the present invention and should not be construed as limiting the present invention, and all other embodiments that can be obtained by one skilled in the art based on the embodiments of the present invention without inventive efforts shall fall within the scope of protection of the present invention.

In the description of the present invention, it is to be understood that the terms "central," "longitudinal," "lateral," "length," "width," "thickness," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," "clockwise," "counterclockwise," "axial," "circumferential," "radial," and the like are used in the orientations and positional relationships indicated in the drawings for convenience in describing the present invention and to simplify the description, and are not intended to indicate or imply that the referenced devices or elements must have a particular orientation, be constructed and operated in a particular orientation, and are therefore not to be considered limiting of the present invention.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.

In the present invention, unless otherwise expressly specified or limited, the terms "mounted," "connected," "secured," and the like are to be construed broadly and can, for example, be fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.

In the present invention, unless otherwise expressly stated or limited, "above" or "below" a first feature means that the first and second features are in direct contact, or that the first and second features are not in direct contact but are in contact with each other via another feature therebetween. Also, the first feature being "on," "above" and "over" the second feature includes the first feature being directly on and obliquely above the second feature, or merely indicating that the first feature is at a higher level than the second feature. A first feature being "under," "below," and "beneath" a second feature includes the first feature being directly under and obliquely below the second feature, or simply meaning that the first feature is at a lesser elevation than the second feature.

Referring to fig. 1, fig. 1 is a schematic structural diagram of a bandwidth optimization method for a convolutional neural network accelerator according to the present invention, and as shown in fig. 1, the method includes two DDRs, two scratchpad memories, a noc (network on chip), and two calculation engines. DDR is an off-chip memory, with large capacity and slow speed (small bandwidth); the scratch pad memory is an on-chip memory, and has small capacity and high speed (large bandwidth); AI engine is a convolution calculation engine used to accelerate convolutional neural network algorithms. The DDR0 is a first off-chip memory, the DDR1 is a second off-chip memory, the scratchpad memory0 is a first on-chip memory, the scratchpad memory1 is a second on-chip memory, the engine0 is a first engine, and the engine2 is a second engine. noc is a network on chip, connecting DDR, scratchpad memory and AI engine so that data can be transmitted between the three. The number of DDR, scratchpad memory and AI engine in the accelerator may be extended.

In the embodiment of the invention, the DDR is a storage space which can be directly addressed by the CPU, and an advanced synchronous circuit is applied, so that the main steps of transmission and output of the designated address and the data are independently executed and are kept completely synchronous with the CPU. The scratchpad memory consists of an SRAM storage component, an address decoding component and a data output circuit, and uses an on-chip telling bus and processing connection. The engine is a core component of a development program or a system on an electronic platform, and functions required by the program can be quickly established and laid or the operation of an auxiliary program can be utilized by utilizing the engine. The noc (Network-on-Chip) is an electronic system based on Network communication implemented on a single Chip, and is in the form of an integrated circuit Chip.

In the embodiment of the invention, the convolutional neural network accelerates, and by adding the scratchpad memory, the transmission of half weight parameters can be reduced, so that the bandwidth requirement is reduced, and the performance of the convolutional neural network accelerator is improved.

Referring to fig. 2, fig. 2 is a schematic diagram illustrating a bandwidth optimization method for a convolutional neural network accelerator according to the present application, where the method includes:

201. before the convolutional neural network operates, the first weight parameters of the convolutional layer to be calculated are carried to the first on-chip memory from the first off-chip memory in advance, and the second weight parameters of the convolutional layer to be calculated are carried to the second on-chip memory from the second off-chip memory in advance.

In an embodiment of the present invention, the first weight parameter and the second weight parameter are two weight parameters of a convolutional layer to be calculated in the convolutional neural network. The first weight parameter is a first half weight parameter, and the second weight parameter is a second half weight parameter.

In an embodiment of the present invention, the Convolutional Neural Network (CNN) includes a Convolutional layer (Convolutional layer) and a pooling layer (posing layer), and the Convolutional Neural Network includes at least two Convolutional layers. Convolutional neural networks are an efficient identification method that has been developed in recent years and has attracted extensive attention.

In the embodiment of the invention, the convolutional layer is composed of a plurality of convolution units, the parameter of each convolution unit is obtained through back propagation algorithmic method, and the convolution operation aims to extract different input characteristics.

In the embodiment of the invention, the memory space which can be directly addressed by the CPU of the off-chip memory utilizes an advanced synchronous circuit, so that the main steps of the transmission and the output of the designated address and the data are independently executed and are kept completely synchronous with the CPU.

In the embodiment of the invention, the CPU is a Central Processing Unit (CPU/Processor) which is one of main devices of an electronic computer and a core accessory in the computer. Its functions are mainly to interpret computer instructions and to process data in computer software. The CPU is responsible for reading the instructions, decoding the instructions and executing the core components of the instructions in all operations in the computer.

In the embodiment of the present invention, the on-chip memory (scratchpad memory) is composed of three components, namely an SRAM memory component, an address decoding component and a data output circuit, and has a non-overlapping address space unified with a main memory by using an on-chip high-speed bus and a processing connection.

Further, before the first weight parameter is transferred from the first off-chip memory to the first on-chip memory, and the second weight parameter is transferred from the second off-chip memory to the second on-chip memory, the method further includes: acquiring weight parameters of each convolution layer in the convolution neural network; dividing the weight parameter of each convolution layer in the convolution neural network into a first weight parameter and a second weight parameter; and storing the first weight parameter into a first off-chip memory, and storing the second weight parameter into a second off-chip memory.

In the embodiment of the present invention, before the convolutional neural network operates, the weight parameter may be obtained by calculation in advance, the weight parameter of each layer needs to be divided into two parts, the first weight parameter is a weight parameter for calculating a first half part of the convolutional layer, and the second weight parameter is a weight parameter for calculating a second half part of the convolutional layer. The invention stores the first weight parameter in a first off-chip memory in advance, and stores the second weight parameter in a second off-chip memory.

Further, after the first weight parameter of the convolution layer to be calculated is transferred from the first off-chip memory to the first on-chip memory, the method further includes: judging whether the first on-chip memory has enough space to store the first weight data of the next volume of lamination; and if the first on-chip memory has enough space to store the first weight data of the next convolution layer, carrying the first weight data of the next convolution layer from the first off-chip memory to the first on-chip memory in advance.

In the embodiment of the present invention, after the first weight parameter of the convolution layer to be calculated is transferred from the first off-chip memory to the first on-chip memory, it is determined whether the first on-chip memory has enough space to store the first weight data of the next convolution layer, and if the first on-chip memory has enough space to store the first weight data of the next convolution layer, the first weight data of the next convolution layer is transferred from the first off-chip memory to the first on-chip memory in advance.

Further, after the first weight parameter of the convolution layer to be calculated is transferred from the first off-chip memory to the first on-chip memory, the method further includes: and if the first on-chip memory has insufficient space for storing the first weight data of the next convolution layer, waiting for the first on-chip memory to release sufficient storage space.

In the embodiment of the present invention, after the first weight parameter of the convolution layer to be calculated is transferred from the first off-chip memory to the first on-chip memory, if the first on-chip memory has insufficient space to store the first weight data of the next convolution layer, the first on-chip memory is waited to release sufficient storage space.

Further, after the second weight parameter of the convolutional layer to be calculated is transferred from the second off-chip memory to the second on-chip memory, the method further includes: judging whether the second on-chip memory has enough space to store the second weight data of the next volume of lamination; if so, the second weight data of the next volume layer is transferred from the second off-chip memory to the second on-chip memory in advance.

In the embodiment of the invention, whether the second on-chip memory has enough space to store the second weight data of the next volume layer is judged; if so, the second weight data of the next volume layer is transferred from the second off-chip memory to the second on-chip memory in advance.

Further, after the second value parameter of the convolution layer to be calculated is transferred from the second off-chip memory to the second on-chip memory, the method further includes: and if the second on-chip memory does not have enough space to store the second weight data of the next convolution layer, waiting for the second on-chip memory to release enough storage space.

In the embodiment of the present invention, after the second value parameter of the convolution layer to be calculated is transferred from the second off-chip memory to the second on-chip memory, if the second on-chip memory does not have enough space to store the second weight data of the next convolution layer, the second on-chip memory is waited to release enough storage space

202. When convolution calculation of the convolution layer to be calculated is executed, the first weight parameter is carried to the first calculation engine from the first on-chip memory to execute the first convolution operation, and the second weight parameter is carried to the second calculation engine from the second on-chip memory to execute the second convolution operation.

In the embodiment of the invention, the convolutional layer consists of a plurality of convolutional units, and the parameter of each convolutional unit is obtained by a back propagation algorithm; the convolution operation aims to extract different input features, the first layer of convolution layer may be an intelligent agent with some low-level features such as edges, lines, angles and other levels, and more layers of networks can iteratively extract more complex features from the low-level features.

In the embodiment of the invention, the back propagation algorithm is suitable for a learning algorithm of a multilayer neuron network, and is based on a gradient descent method.

In the embodiment of the present invention, when performing the calculation of the convolution layer to be calculated, the first weight parameter is transferred from the first on-chip memory to the first calculation engine to perform the first convolution operation, and it is first determined whether the first convolution operation is completed; if the first convolution operation is finished, continuing to perform the next step; if the first convolution operation is not completed, the first weight parameter is transferred from the first on-chip memory to the first calculation engine again to execute the first convolution operation until the operation is completed and the convolution operation of the next convolution layer is entered.

In the embodiment of the present invention, when performing convolution calculation for calculating the convolution layer, the second weight parameter is transferred from the second on-chip memory to the second calculation engine to perform the second convolution operation, and it is first determined whether the second convolution operation is completed; if the second convolution operation is finished, continuing to perform the next step; if the second convolution operation is not completed, the second weight parameter is transferred from the second on-chip memory to the second calculation engine again to execute the second convolution operation until the operation is completed and the convolution operation of the next convolution layer is entered.

In the embodiment of the invention, for the same layer of convolutional network, when the first computing engine reads the upper half weight parameter for computation, the second computing engine can read the lower half weight parameter for computation, and each computing engine reduces the transmission of half weight parameter, for example, originally, the first computing engine needs to read the complete weight parameter once, and if the first computing engine reads the complete weight parameter from the same off-chip memory, the second computing engine needs to wait for the first engine to read the complete weight parameter, and if the time for reading the complete weight parameter is 1 second, the two engines need to take the same complete weight parameter within 2 seconds, if the complete weight parameter is stored in both off-chip memories, the first engine can read the first off-chip memory, and meanwhile, the second engine can read the second off-chip memory, and the two engines can take the same complete weight parameter within only 1 second, in the invention, the complete weight is divided into two weight parameters, the reading time of each weight parameter is 0.5 second, the first engine can take one weight parameter from the first off-chip memory, and the second engine can take the other weight parameter from the second off-chip memory.

In the embodiment of the invention, the weight parameters of the convolutional neural network of the next layer in the off-chip memory are transmitted into the on-chip memory in advance while the convolutional neural network of the current layer is calculated. And when the next layer of operation is performed, the weight parameter is directly obtained from the on-chip memory, and the bandwidth of the on-chip memory is greater than that of the off-chip memory, so that the transmission bandwidth of the weight parameter is improved, the transmission time of the weight parameter is reduced, and the performance of the convolutional neural network accelerator is improved.

203. And when the first convolution operation and the second convolution operation are executed, the second weight parameter is transferred from the second on-chip memory to the first calculation engine to execute a third convolution operation, and the first weight parameter is transferred from the first on-chip memory to the second calculation engine to execute a fourth convolution operation.

In an embodiment of the present invention, when the first convolution operation and the second convolution operation are completed, the second weight parameter is transferred from the second on-chip memory to the first calculation engine to execute a third convolution operation; firstly, judging whether the third convolution operation is finished or not; if the third convolution operation is finished, continuing the next step; if the third convolution operation is not completed, the second weight parameter is transferred from the second on-chip memory to the first calculation engine again to execute the third convolution operation until the operation is completed and the convolution operation of the next convolution layer is entered.

In an embodiment of the present invention, when the first convolution operation and the second convolution operation are completed, the first weight parameter is transferred from the first on-chip memory to the second calculation engine to execute a fourth convolution operation; firstly, judging whether the fourth convolution operation is finished or not; if the fourth convolution operation is finished, continuing the next step; if the fourth convolution operation is not completed, the first weight parameter is transferred from the first on-chip memory to the second calculation engine again to execute the fourth convolution operation until the operation is completed and the convolution operation of the next convolution layer is entered.

In the embodiment of the present invention, for the same layer of convolutional network, when the first computing engine reads the upper half weight parameters for computation, the second computing engine can read the lower half weight parameters for computation, and each computing engine does not reduce the transmission of half weight parameters, for example, originally, if the first computing engine needs to read the complete weight parameters once and reads the same in the same off-chip memory, the second computing engine needs to wait for the first engine to read the complete weight parameters, assuming that the time for reading the complete weight parameters is 1 second, two engines need 2 seconds to take the same complete weight parameters, if the complete weight parameters are stored in both off-chip memories, the first engine can read the first off-chip memory, and meanwhile, the second engine can read the second off-chip memory, and two engines can take the same complete weight parameters only in 1 second, in the invention, the complete weight is divided into two weight parameters, the reading time of each weight parameter is 0.5 second, the first engine can take one weight parameter from the first off-chip memory, and the second engine can take the other weight parameter from the second off-chip memory.

204. And when the third convolution operation and the fourth convolution operation are executed, performing convolution calculation of the next convolution layer.

In an embodiment of the present invention, when the third convolution operation and the fourth convolution operation are completed, the calculation of the next convolution layer is performed.

In an embodiment of the present invention, the third convolution operation is a convolution neural network operation executed by transferring the second weight parameter from the second on-chip memory to the first calculation engine; and entering the convolution neural network operation of the next convolution layer after the third convolution operation is completed. The fourth convolution operation is a convolution neural network operation for carrying the first weight parameter from the first on-chip memory to the second calculation engine to execute; and after the fourth convolution operation is completed, the convolution neural network operation of the next convolution layer is carried out.

In the embodiment of the invention, before the operation of the convolutional neural network, a first weight parameter of a convolutional layer to be calculated is transported to a first on-chip memory from a first off-chip memory in advance, and a second weight parameter of the convolutional layer to be calculated is transported to a second on-chip memory from a second off-chip memory in advance, wherein the first weight parameter and the second weight parameter are two weight parameters of the convolutional layer to be calculated in the convolutional neural network; when convolution calculation of the convolution layer to be calculated is executed, the first weight parameter is carried to a first calculation engine from the first on-chip memory to execute first convolution operation, and the second weight parameter is carried to a second calculation engine from the second on-chip memory to execute second convolution operation; when the first convolution operation and the second convolution operation are executed, the second weight parameter is transported from the second on-chip memory to the first calculation engine to execute a third convolution operation, and the first weight parameter is transported from the first on-chip memory to the second calculation engine to execute a fourth convolution operation; and when the third convolution operation and the fourth convolution operation are executed, performing convolution calculation of a next convolution layer. According to the embodiment of the invention, one or more convolutional neural networks can be accelerated by a plurality of computing engines at the same time.

Referring to fig. 3, fig. 3 is a flowchart of another bandwidth optimization method for a convolutional neural network accelerator according to the present application, where the method includes:

step S0, start;

step S1: before the convolutional neural network operates, the weight parameters can be calculated in advance, and the weight parameters of each layer need to be divided into two parts, wherein one part is stored in DDR0, and the other part is stored in DDR 1. The accelerated neural network model here has only 2 convolutions, so L0_ coef0 and L1_ coef0 are stored in advance in DDR 0.

In an embodiment of the invention, the DDR0 is a first off-chip memory, and the DDR1 is a second off-chip memory; l0_ coef0 represents a first weight parameter of layer0, and L0_ coef1 represents a second weight parameter of layer 0. The layer0 is the first layer of the convolutional layer, and the layer1 is the second layer of the convolutional layer.

Step S2: l0_ coef1 and L1_ coef1 are stored in advance in DDR 1.

In the embodiment of the present invention, the L0_ coef1 is a second weight parameter of the first layer of the convolutional layer, and the L1_ coef1 is a second weight parameter of the second layer of the convolutional layer.

Step S3: carry L0_ coef0 from DDR0 to scratchpad memory0(scpd 0);

in the embodiment of the present invention, the scratch pad memory0(scpd0) is a first on-chip memory.

Step S4: carry L0_ coef1 from DDR1 to scratchpad memory1(scpd1)

In the embodiment of the present invention, the scratch pad memory1(scpd1) is a second on-chip memory.

Step S5: a determination is made as to whether L0_ coef0 has all been carried to scpd 0. If yes, the steps S7 and S10 are carried out, otherwise, the step S3 is carried out, and the data transportation of the L0_ coef0 is continued.

Step S6: a determination is made as to whether L0_ coef1 has all been carried to scpd 1. If yes, the steps S8 and S11 are carried out, otherwise, the step S4 is carried out, and the data transportation of the L0_ coef1 is continued.

Step S7: it is determined whether scpd0 has enough space to store L1_ coef 0. If yes, go to step S9, otherwise wait for scpd0 to release enough storage space.

Step S8: it is determined whether scpd1 has enough space to store L1_ coef 1. If yes, go to step S12, otherwise wait for scpd1 to release enough storage space.

Step S9: carry L1_ coef0 from DDR0 to scpd0

Step S10: engine0 takes L0_ coef0 from scpd0 and performs convolution operation.

In an embodiment of the present invention, engine0 is the first engine.

Step S11: the engine1 takes out L0_ coef1 from scpd1 and carries out convolution operation;

in an embodiment of the present invention, engine1 is the second engine.

Step S12: carry L1_ coef1 from DDR1 to scpd1

Step S13: judging whether the transportation is finished or not by L1_ coef0, if so, judging whether the step S19 is finished or not, if the step S19 is also finished, performing the step S21, otherwise, waiting for the completion of the transportation

Step S14: it is determined whether engine0 completes the L0_ coef0 calculation and engine1 completes the L0_ coef1 calculation. If yes, go to step S17, otherwise continue to wait for the computation to complete

Step S15: it is determined whether engine0 completes the L0_ coef0 calculation and engine1 completes the L0_ coef1 calculation. If yes, go to step S18, otherwise, continue to wait for the computation to complete.

Step S16: judging whether the L1_ coef0 is finished, if so, judging whether S20 is finished, if step S20 is also finished, performing step S22, otherwise, waiting for the completion of the transportation

Step S17: engine0 takes L0_ coef1 from scpd1 and performs convolution operation

Step S18: engine1 takes L0_ coef0 from scpd0 and performs convolution operation

Step S19: it is determined whether engine0 completes the L0_ coef1 calculation and engine1 completes the L0_ coef0 calculation. If so, see if S13 is satisfied, and if S13 is also satisfied, proceed to step S21

Step S20: it is determined whether engine0 completes the L0_ coef1 calculation and engine1 completes the L0_ coef1 calculation. If so, see if S16 is satisfied, and if S16 is also satisfied, proceed to step S22

Step S21: engine0 begins the second layer convolutional neural network operation

Step S22: engine1 begins the second layer convolutional neural network operation

Referring to fig. 4, fig. 4 is a schematic diagram illustrating a bandwidth optimization apparatus for a convolutional neural network accelerator according to the present application, where the apparatus includes:

a first carrying module 401, configured to carry a first weight parameter of a convolutional layer to be calculated from a first off-chip memory to a first on-chip memory in advance before a convolutional neural network operates, and carry a second weight parameter of the convolutional layer to be calculated from a second off-chip memory to a second on-chip memory, where the first weight parameter and the second weight parameter are two weight parameters of the convolutional layer to be calculated in the convolutional neural network;

a second carrying module 402, configured to carry the first weight parameter from the first on-chip memory to a first computation engine to perform a first convolution operation, and carry the second weight parameter from the second on-chip memory to a second computation engine to perform a second convolution operation when performing convolution calculation on the convolutional layer to be calculated;

a third carrying module 403, configured to carry the second weight parameter from the second on-chip memory to the first calculation engine to perform a third convolution operation and carry the first weight parameter from the first on-chip memory to the second calculation engine to perform a fourth convolution operation when the first convolution operation and the second convolution operation are performed;

a calculating module 404, configured to perform convolution calculation on a next convolution layer when the third convolution operation and the fourth convolution operation are completed.

Further, before the moving the first weight parameter from the first off-chip memory to the first on-chip memory and the moving the second weight parameter from the second off-chip memory to the second on-chip memory, the apparatus further includes:

the acquiring module is used for acquiring weight parameters of each convolution layer in the convolution neural network;

the dividing module is used for dividing the weight parameters of each convolution layer in the convolution neural network into a first weight parameter and a second weight parameter;

and the storage module is used for storing the first weight parameter into a first off-chip memory and storing the second weight parameter into a second off-chip memory.

Further, after the transporting the first weight parameter of the convolution layer to be calculated from the first off-chip memory to the first on-chip memory, the apparatus further includes:

the first judging module is used for judging whether the first on-chip memory has enough space to store the first weight data of the next volume of lamination;

and a fourth carrying module, configured to, if the first on-chip memory has enough space to store the first weight data of the next convolution layer, carry the first weight data of the next convolution layer from the first off-chip memory to the first on-chip memory in advance.

a first releasing module, configured to wait for the first on-chip memory to release sufficient storage space if the first on-chip memory has insufficient space to store the first weight data of the next convolution layer.

Further, after the transporting the second weight parameter of the convolutional layer to be calculated from the second off-chip memory to the second on-chip memory, the apparatus further includes:

the second judging module is used for judging whether the second on-chip memory has enough space to store the second weight data of the next volume of lamination;

and a fifth carrying module, configured to carry the second weight data of the next convolution layer from the second off-chip memory to the second on-chip memory in advance if the second weight data exists.

Further, after the transporting the second value parameter of the convolutional layer to be calculated from the second off-chip memory to the second on-chip memory, the apparatus further includes:

and the second releasing module is used for waiting for the second on-chip memory to release enough storage space if the second on-chip memory does not have enough space to store the second weight data of the next convolution layer.

The embodiment of the present invention further provides an AI acceleration chip, where the first off-chip memory, the second off-chip memory, the first on-chip memory, the second on-chip memory, the on-chip network, the first calculation engine, and the second calculation engine are connected by the on-chip network, and the on-chip network is used to execute the steps in the bandwidth optimization method for the convolutional neural network accelerator according to any one of the embodiments of the present invention.

Embodiments of the present invention also provide a computer storage medium, wherein the computer storage medium stores a computer program for electronic data exchange, and the computer program enables a computer to execute part or all of the steps of any of the bandwidth optimization methods facing a convolutional neural network accelerator as set forth in the above method embodiments.

Embodiments of the present invention also provide an electronic device, which includes a non-transitory computer-readable storage medium storing a computer program, where the computer program is operable to cause a computer to perform part or all of the steps of any one of the bandwidth optimization methods for a convolutional neural network accelerator as described in the above method embodiments.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are exemplary embodiments and that the acts and modules illustrated are not necessarily required to practice the invention.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implementing, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some interfaces, devices or units, and may be an electric or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may be implemented in the form of a software program module.

The integrated units, if implemented in the form of software program modules and sold or used as stand-alone products, may be stored in a computer readable memory. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a memory and includes several instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned memory comprises: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable memory, which may include: flash Memory disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.

The above embodiments of the present invention are described in detail, and the principle and the implementation of the present invention are explained by applying specific embodiments, and the above description of the embodiments is only used to help understanding the method of the present invention and the core idea thereof; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A bandwidth optimization method for a convolutional neural network accelerator is characterized by comprising the following steps:

2. The convolutional neural network accelerator-oriented bandwidth optimization method of claim 1, wherein before the carrying the first weight parameters from the first off-chip memory to the first on-chip memory and the carrying the second weight parameters from the second off-chip memory to the second on-chip memory, the method further comprises:

3. The convolutional neural network accelerator-oriented bandwidth optimization method of claim 1, wherein after the transporting the convolutional layer first weight parameters to be calculated from the first off-chip memory to the first on-chip memory, the method further comprises:

4. The convolutional neural network accelerator-oriented bandwidth optimization method of claim 3, wherein after the transporting the convolutional layer first weight parameters to be calculated from the first off-chip memory to the first on-chip memory, the method further comprises:

5. The convolutional neural network accelerator-oriented bandwidth optimization method of claim 1, wherein after the carrying of the convolutional layer second weight parameters to be calculated from the second off-chip memory to the second on-chip memory, the method further comprises:

6. The convolutional neural network accelerator-oriented bandwidth optimization method of claim 5, wherein after the transporting the convolutional layer second value parameters to be calculated from the second off-chip memory to the second on-chip memory, the method further comprises:

7. An apparatus for bandwidth optimization for a convolutional neural network accelerator, the apparatus comprising:

8. An AI acceleration chip, characterized by the first off-chip memory, the second off-chip memory, the first on-chip memory, the second on-chip memory, the on-chip network, the first computation engine and the second computation engine, wherein the on-chip network connects the first off-chip memory, the second off-chip memory, the first on-chip memory, the second on-chip memory, the first computation engine and the second computation engine, and the on-chip network is configured to perform the steps in the bandwidth optimization method for the convolutional neural network accelerator according to any one of claims 1 to 6.

9. An electronic device, comprising: memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps in the bandwidth optimization method for a convolutional neural network accelerator as claimed in any one of claims 1 to 6 when executing the computer program.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a computer program which, when being executed by a processor, carries out the steps of the method for bandwidth optimization for a convolutional neural network accelerator as defined in any one of claims 1 to 6.