CN107862378B

CN107862378B - Multi-core-based convolutional neural network acceleration method and system, storage medium and terminal

Info

Publication number: CN107862378B
Application number: CN201711273248.5A
Authority: CN
Inventors: 张慧明
Original assignee: Vivante Technology Shanghai Co ltd; VeriSilicon Microelectronics Shanghai Co Ltd
Current assignee: Vivante Technology Shanghai Co ltd; VeriSilicon Microelectronics Shanghai Co Ltd
Priority date: 2017-12-06
Filing date: 2017-12-06
Publication date: 2020-04-24
Anticipated expiration: 2037-12-06
Also published as: CN107862378A

Abstract

The invention provides a multi-core-based convolutional neural network acceleration method and system, a storage medium and a terminal, and the method comprises the steps of splitting a layer of convolutional neural network into at least two subtasks, wherein each subtask corresponds to a convolutional core; the convolution kernels are connected in series; based on each convolution kernel, executing a first preset number of vector dot product operations in parallel, wherein each vector dot product operation comprises a second preset number of multiplication operations; the product of the first preset quantity and the second preset quantity is the number of the multiplier-adders in the convolution kernel; and outputting the vector dot product operation result of each convolution kernel according to the output priority sequence. The multi-core-based convolutional neural network acceleration method and system, the storage medium and the terminal save the data bandwidth of the convolutional neural network through a plurality of parallel convolutional cores; under the condition of the same hardware data bandwidth, the processing speed of the convolutional neural network is improved through parallel vector dot product operation in a convolution kernel.

Description

Multi-core-based convolutional neural network acceleration method and system, storage medium and terminal

Technical Field

The invention relates to the technical field of data processing, in particular to a multi-core-based convolutional neural network acceleration method and system, a storage medium and a terminal.

Background

At present, deep learning and machine learning are widely applied in the fields of vision processing, speech recognition and image analysis. Convolutional neural networks are important components of deep learning and machine learning. The processing speed of the convolutional neural network is increased, and the processing speed of deep learning and machine learning can be increased in an equal proportion.

In the prior art, applications of visual processing, speech recognition and image analysis are based on a multi-layer convolutional neural network. Each layer of convolutional neural network needs a large amount of data processing and convolution operation, and the requirements on hardware processing speed and resource consumption are high. With the continuous development of wearable devices, internet of things applications and automatic driving technologies, how to implement a convolutional neural network in an embedded product and achieve a smooth processing speed becomes a great challenge for current hardware architecture design. Taking the typical convolutional neural networks ResNet and VGG16 as examples, ResNet requires 15 gigabytes of bandwidth at 16-bit floating point precision if it is to run to a speed of 60 frames per second; VGG16 requires 6.0 gbytes of bandwidth at 16-bit floating point precision if it is to run to a speed of 60 frames per second.

At present, in order to realize acceleration of the convolutional neural network, it is realized by arranging a plurality of convolution units in parallel. In an ideal case, the more convolution units, the faster the processing speed. However, in practical application, the data bandwidth can greatly limit the processing speed of the convolution unit, the bandwidth resource of hardware is very precious, and the cost for improving the data bandwidth of the hardware is huge. Therefore, under the condition of limited data bandwidth and hardware overhead, the processing speed of the convolutional neural network is improved, and the problem which is urgently needed to be solved by the current hardware architecture design is solved.

Disclosure of Invention

In view of the above-mentioned shortcomings of the prior art, the present invention provides a method and system for accelerating a convolutional neural network based on multiple cores, a storage medium and a terminal, which save the data bandwidth of the convolutional neural network by multiple parallel convolutional cores; under the condition of the same hardware data bandwidth, the processing speed of the convolutional neural network is improved through parallel vector dot product operation in a convolution kernel.

To achieve the above and other related objects, the present invention provides a method for accelerating a convolutional neural network based on multiple cores, comprising the steps of: splitting a layer of convolutional neural network into at least two subtasks, each subtask corresponding to a convolution kernel; the convolution kernels are connected in series, so that input data can be transmitted in series between the convolution kernels; based on each convolution kernel, executing a first preset number of vector dot product operations in parallel, wherein each vector dot product operation comprises a second preset number of multiplication operations; the product of the first preset quantity and the second preset quantity is the number of the multiplier-adders in the convolution kernel; and outputting the vector dot product operation result of each convolution kernel according to the output priority sequence.

In an embodiment of the invention, the second predetermined number is 3 to support a 3D vector dot product.

In an embodiment of the present invention, the output priority is determined according to the order of the input data to the convolution kernel, and the convolution kernel that inputs the input data first has a higher output priority than the convolution kernel that inputs the input data later.

In an embodiment of the present invention, the parallel execution of the first predetermined number of vector dot product operations based on each convolution kernel includes the following steps:

acquiring (N + M-1) input data; wherein N is the first preset number, and M is the second preset number;

inputting the first 1 to N input data to the 1 st to N multiplier-adder respectively to multiply the first coefficient;

inputting the first 2 to N +1 input data into the (N +1) th to 2N multiplier-adder respectively to be multiplied by the second coefficient;

in the same way, the first M to N + M-1 input data are respectively input to the (N x M-N +1) to N x M multipliers and adders to be multiplied by the Mth coefficient;

and accumulating products at corresponding positions on N positions in the M times of multiplication to obtain N vector dot product operation results.

Correspondingly, the invention provides a multi-core-based convolutional neural network acceleration system, which comprises a convolutional core setting module, a vector dot product module and an output module, wherein the convolutional core setting module is used for setting a vector dot product;

the convolution kernel setting module is used for splitting a layer of convolution neural network into at least two subtasks, and each subtask corresponds to a convolution kernel; the convolution kernels are connected in series, so that input data can be transmitted in series between the convolution kernels;

the vector dot product module is used for executing a first preset number of vector dot product operations in parallel based on each convolution kernel, and each vector dot product operation comprises a second preset number of multiplication operations; the product of the first preset quantity and the second preset quantity is the number of the multiplier-adders in the convolution kernel;

and the output module is used for outputting the vector dot product operation result of each convolution kernel according to the output priority order.

In an embodiment of the present invention, the vector dot product module performs the following steps when executing a first predetermined number of vector dot product operations in parallel based on each convolution kernel:

The present invention provides a storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described multi-core based convolutional neural network acceleration method.

Finally, the invention provides a terminal comprising a processor and a memory;

the memory is used for storing a computer program;

the processor is used for executing the computer program stored in the memory so as to enable the terminal to execute the multi-core-based convolutional neural network acceleration method.

As described above, the multi-core convolutional neural network acceleration method and system, the storage medium, and the terminal according to the present invention have the following advantageous effects:

(1) the bandwidth consumption of a convolutional neural network operation dynamic memory is saved; taking a 4-convolution kernel mode as an example, under the same data processing speed, 75% of input image data bandwidth can be saved when the convolution neural network operates;

(2) the processing speed of the convolutional neural network is improved, and under the condition of the same hardware data bandwidth, the processing speed of the convolutional neural network can be improved to 300% by taking a 3D vector dot product as an example;

(3) the dynamic power consumption of the convolutional neural network is reduced, the operation time is reduced to 33% of the original operation time by taking 4 convolutional kernels and a 3D vector dot product as an example, the input image bandwidth of the convolutional neural network is reduced to 25% of the original operation time, and the dynamic power consumption is reduced by 85%;

(4) the processing speed of the convolutional neural network in the embedded product is optimized, the method has the advantages of clear architecture, clear division of labor, easiness in implementation, simple flow and the like, and can be widely applied to the Internet of things, wearable equipment and vehicle-mounted equipment.

Drawings

FIG. 1 is a flow chart illustrating a method for accelerating a convolutional neural network based on multiple cores according to an embodiment of the present invention;

FIG. 2 shows a schematic diagram of coordinates of an input image, coefficients and an output image;

FIG. 3 is a block diagram illustrating an embodiment of a multi-core convolutional neural network acceleration method according to the present invention;

FIG. 4 is a first state diagram illustrating the parallel 3D vector dot products in one embodiment of the multi-core convolutional neural network acceleration method of the present invention;

FIG. 5 is a second state diagram illustrating the parallel 3D vector dot products in one embodiment of the multi-core convolutional neural network acceleration method of the present invention;

FIG. 6 is a third state diagram illustrating the parallel 3D vector dot products in one embodiment of the multi-core convolutional neural network acceleration method of the present invention;

FIG. 7 is a state diagram illustrating the summation of parallel 3D vector dot products in one embodiment of the multi-core based convolutional neural network acceleration method of the present invention;

FIG. 8 is a schematic diagram illustrating an embodiment of a multi-core convolutional neural network acceleration system according to the present invention;

fig. 9 is a schematic structural diagram of a terminal according to an embodiment of the invention.

Description of the element reference numerals

21 convolution kernel setting module

22 vector dot product module

23 output module

31 processor

32 memory

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.

It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the components related to the present invention are only shown in the drawings rather than drawn according to the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.

The multi-core-based convolutional neural network acceleration method and system, the storage medium and the terminal save the data bandwidth of the convolutional neural network through a plurality of parallel convolutional cores under the condition of limited data bandwidth; under the condition of the same hardware data bandwidth, the processing speed of the convolutional neural network is improved through parallel vector dot product operation in a convolutional kernel; the processing speed of the convolutional neural network in the embedded product is optimized, the method has the advantages of clear architecture, clear division of labor, easiness in implementation, simple flow and the like, and can be widely applied to the Internet of things, wearable equipment and vehicle-mounted equipment.

As shown in fig. 1, in an embodiment, the method for accelerating a convolutional neural network based on multiple cores of the present invention includes the following steps:

step S1, splitting a layer of convolutional neural network into at least two subtasks, wherein each subtask corresponds to a convolutional kernel; the convolution kernels are connected in series, so that input data can be transmitted in series between the convolution kernels.

Taking image processing as an example, as shown in fig. 2, the input image and the output image are both three-dimensional images, and the input image includes an abscissa inx, an ordinate iny, and a coefficient depth coordinate kz. The output image comprises an abscissa outx, an ordinate outy and a depth coordinate z. The coefficients are four-dimensional data including an abscissa kx, an ordinate ky, a coefficient depth coordinate kz, and an output depth coordinate z.

When splitting a layer of convolutional neural network into four subtasks, the coefficient sequence is split into four groups according to the z direction, as shown in fig. 2, each group of coefficients is assigned to a convolutional kernel of a different convolutional neural network. In the method, data are serially communicated among different convolution kernels, so that bandwidth is saved through data sharing.

And according to the processing characteristics of the convolutional neural network, carrying out convolution operation on each group of coefficients and all input images to obtain an output image of a z plane. Therefore, the input image has great reusability. As shown in fig. 3, the different convolution kernels are connected in series by a series path so that the input image data passes in series between the convolution kernels. Specifically, after the input image data is read into the convolution kernel 0 from the memory, the convolution kernel 0 performs convolution operation on the input image data and the first group of coefficients, and simultaneously, the convolution kernel 0 transmits the input image data to the convolution kernel 1 through the data serial channel. The convolution core 1 saves bandwidth consumption in reading input image data. Convolution kernel 2 and convolution kernel 3 would perform the same data manipulation to avoid bandwidth consumption of the input image data. Under the 4 convolution kernel mode, three times of reading of input image data are reduced by serially connecting data channels, 75% of bandwidth consumption of the input image data is saved, and meanwhile, the power consumption of an internal memory is also reduced; the balance problem of the number of the windings of the physical realization layer and the realization frequency is well solved.

Step S2, executing a first preset number of vector dot product operations in parallel based on each convolution kernel, each vector dot product operation including a second preset number of multiplication operations; the product of the first preset number and the second preset number is the number of the multiplier-adders in the convolution kernel.

21) acquiring (N + M-1) input data; wherein N is the first preset number, and M is the second preset number;

22) inputting the first 1 to N input data to the 1 st to N multiplier-adder respectively to multiply the first coefficient;

23) inputting the first 2 to N +1 input data into the (N +1) th to 2N multiplier-adder respectively to be multiplied by the second coefficient;

24) in the same way, the first M to N + M-1 input data are respectively input to the (N x M-N +1) to N x M multipliers and adders to be multiplied by the Mth coefficient;

25) and accumulating products at corresponding positions on N positions in the M times of multiplication to obtain N vector dot product operation results.

The vector dot product, which includes three multiplications (M-3), is further described below. In this embodiment, 24 multiply-and-add units are included within each convolution kernel. To maximize the utilization of the multiplier-adder with limited bandwidth, 8 3D vector dot product results are computed simultaneously by a 3D vector dot product operation involving three multiplications (N-8). Specifically, the 3D vector dot product operation formula is as follows:

out0＝in0*k0+in1*k1+in2*k2

out1＝in1*k0+in2*k1+in3*k2

out2＝in2*k0+in3*k1+in4*k2

......

out7＝in7*k0+in8*k1+in9*k2

first, as shown in fig. 4, 10 input image data, i.e., in0, in1, int2.. in9, are read from a memory. The 0 th to 7 th data (in0, in1, in2, in3, in4, in5, in6, in7) and the coefficient 0(k0) are multiplied, and the result of the multiplication is written to the input terminals of the accumulator, i.e., out00, out01, out02.. out 07.

Next, as shown in fig. 5, the 10 input image data of the previous step, i.e., in0, in1, int2.. in9, are reused. The 1 st to 8 th data (in1, in2, in3, in4, in5, in6, in7, in8) and the coefficient 1(k1) are multiplied, and the result of the multiplication is written into the input terminals of the accumulator, i.e., out10, out11, and out12.. out 17.

Again, as shown in fig. 6, the 10 input image data of the previous step, i.e. in0, in1, int2.. in9, are reused. The 2 nd to 9 th data (in2, in3, in4, in5, in6, in7, in8, in9) and the coefficient 2(k2) are multiplied, and the result of the multiplication is written to the input terminals of the accumulator, i.e., out20, out21, and out22.. out 27.

Finally, as shown in fig. 7, the results at the corresponding positions of the three multiplications are sequentially accumulated to obtain 8 3D vector dot product results. Wherein the results out00, out10, and out20 at the first position of each multiplication are added to obtain a first 3D vector dot product result; adding the results out01, out11, and out21 at the second position of each multiplication to obtain a second 3D vector dot product result; by analogy, the results out07, out17, and out27 at the eighth position of each multiplication are added to obtain an eighth 3D vector dot product result.

And step S3, outputting the vector dot product operation result of each convolution kernel according to the output priority order.

Specifically, the output priority is determined according to the order of inputting the input data into the convolution kernel. The earlier the input data is acquired, the higher the output priority of the corresponding convolution kernel. Therefore, the convolution kernel that inputs the input data first has a higher output priority than the convolution kernel that inputs the input data later.

As shown in fig. 8, in an embodiment, the multi-core based convolutional neural network acceleration system of the present invention includes a convolution kernel setting module 21, a vector dot product module 22, and an output module 23.

The convolution kernel setting module 21 is configured to split a layer of convolutional neural network into at least two subtasks, where each subtask corresponds to a convolution kernel; the convolution kernels are connected in series, so that input data can be transmitted in series between the convolution kernels.

The vector dot product module 22 is connected to the convolution kernel setting module 21, and configured to execute a first preset number of vector dot product operations in parallel based on each convolution kernel, where each vector dot product operation includes a second preset number of multiplication operations; the product of the first preset number and the second preset number is the number of the multiplier-adders in the convolution kernel.

In an embodiment of the present invention, the vector dot product module 22 executes a first predetermined number of vector dot product operations in parallel based on each convolution kernel to execute the following steps:

The following is further illustrated with a 3D vector dot product that includes three multiplications (M-3). In this embodiment, 24 multiply-and-add units are included within each convolution kernel. To maximize the utilization of the multiplier-adder with limited bandwidth, 8 3D vector dot product results are computed simultaneously by a 3D vector dot product operation involving three multiplications (N-8). Specifically, the 3D vector dot product operation formula is as follows:

out0＝in0*k0+in1*k1+in2*k2

out1＝in1*k0+in2*k1+in3*k2

out2＝in2*k0+in3*k1+in4*k2

......

out7＝in7*k0+in8*k1+in9*k2

The output module 23 is connected to the vector dot product module 22, and configured to output the vector dot product operation result of each convolution kernel according to the output priority order.

It should be noted that the division of the modules of the above system is only a logical division, and the actual implementation may be wholly or partially integrated into one physical entity, or may be physically separated. And these modules can be realized in the form of software called by processing element; or may be implemented entirely in hardware; and part of the modules can be realized in the form of calling software by the processing element, and part of the modules can be realized in the form of hardware. For example, the x module may be a processing element that is set up separately, or may be implemented by being integrated in a chip of the apparatus, or may be stored in a memory of the apparatus in the form of program code, and the function of the x module may be called and executed by a processing element of the apparatus. Other modules are implemented similarly. In addition, all or part of the modules can be integrated together or can be independently realized. The processing element described herein may be an integrated circuit having signal processing capabilities. In implementation, each step of the above method or each module above may be implemented by an integrated logic circuit of hardware in a processor element or an instruction in the form of software.

For example, the above modules may be one or more integrated circuits configured to implement the above methods, such as: one or more Application Specific Integrated Circuits (ASICs), or one or more microprocessors (DSPs), or one or more Field Programmable Gate Arrays (FPGAs), etc. For another example, when one of the above modules is implemented in the form of a processing element scheduler code, the processing element may be a general-purpose processor, such as a Central Processing Unit (CPU) or other processor capable of calling program code. For another example, these modules may be integrated together and implemented in the form of a system-on-a-chip (SOC).

The storage medium of the present invention stores thereon a computer program that, when executed by a processor, implements the above-described multi-core based convolutional neural network acceleration method. Preferably, the storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

As shown in fig. 9, in one embodiment, the terminal of the present invention includes a processor 31 and a memory 32.

The memory 32 is used for storing computer programs.

Preferably, the memory 32 comprises: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

The processor 31 is connected to the memory 32, and is configured to execute the computer program stored in the memory 32, so as to enable the terminal to execute the above-mentioned multi-core-based convolutional neural network acceleration method.

Preferably, the processor 31 may be a general-purpose processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; the integrated circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic device, or discrete hardware components.

In summary, the multi-core-based convolutional neural network acceleration method, the multi-core-based convolutional neural network acceleration system, the storage medium and the terminal of the invention save the bandwidth consumption of the convolutional neural network operation dynamic memory; taking a 4-convolution kernel mode as an example, under the same data processing speed, 75% of input image data bandwidth can be saved when the convolution neural network operates; the processing speed of the convolutional neural network is improved, and under the condition of the same hardware data bandwidth, the processing speed of the convolutional neural network can be improved to 300% by taking a 3D vector dot product as an example; the dynamic power consumption of the convolutional neural network is reduced, the operation time is reduced to 33% of the original operation time by taking 4 convolutional kernels and a 3D vector dot product as an example, the input image bandwidth of the convolutional neural network is reduced to 25% of the original operation time, and the dynamic power consumption is reduced by 85%; the processing speed of the convolutional neural network in the embedded product is optimized, the method has the advantages of clear architecture, clear division of labor, easiness in implementation, simple flow and the like, and can be widely applied to the Internet of things, wearable equipment and vehicle-mounted equipment. Therefore, the invention effectively overcomes various defects in the prior art and has high industrial utilization value.

The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims

1. A multi-core-based convolutional neural network acceleration method is characterized by comprising the following steps:

splitting a layer of convolutional neural network into at least two subtasks, each subtask corresponding to a convolution kernel; the convolution kernels are connected in series, so that input data can be transmitted in series between the convolution kernels;

based on each convolution kernel, executing a first preset number of vector dot product operations in parallel, wherein each vector dot product operation comprises a second preset number of multiplication operations; the product of the first preset quantity and the second preset quantity is the number of the multiplier-adders in the convolution kernel;

and outputting the vector dot product operation result of each convolution kernel according to the output priority sequence.

2. The multi-core-based convolutional neural network acceleration method of claim 1, wherein the second preset number is 3 to support a 3D vector dot product.

3. The multi-core-based convolutional neural network acceleration method of claim 1, wherein the output priority is determined according to the order of input of the input data into the convolutional kernels, and the convolutional kernels input with the input data first have higher output priority than the convolutional kernels input with the input data later.

4. The multi-core-based convolutional neural network acceleration method of claim 1, wherein performing a first preset number of vector dot product operations in parallel on a per convolutional core basis comprises the steps of:

5. A multi-core-based convolutional neural network acceleration system is characterized by comprising a convolutional core setting module, a vector dot product module and an output module;

6. The multi-core based convolutional neural network acceleration system of claim 5, wherein the second preset number is 3 to support a 3D vector dot product.

7. The multi-core-based convolutional neural network acceleration system as claimed in claim 5, wherein the output priority is determined according to the precedence order of the input data input convolution kernels, and the convolution kernels input with the input data first have higher output priority than the convolution kernels input with the input data later.

8. The multi-core based convolutional neural network acceleration system as claimed in claim 5, wherein the vector dot product module performs the following steps when performing a first preset number of vector dot product operations in parallel based on each convolution core:

9. A storage medium on which a computer program is stored, which program, when executed by a processor, implements the multi-core based convolutional neural network acceleration method of any one of claims 1 to 4.

10. A terminal comprising a processor and a memory;

the memory is used for storing a computer program;

the processor is configured to execute the memory-stored computer program to cause the terminal to perform the multi-core based convolutional neural network acceleration method of any of claims 1 to 4.