CN107862378B - Multi-core-based convolutional neural network acceleration method and system, storage medium and terminal - Google Patents
Multi-core-based convolutional neural network acceleration method and system, storage medium and terminal Download PDFInfo
- Publication number
- CN107862378B CN107862378B CN201711273248.5A CN201711273248A CN107862378B CN 107862378 B CN107862378 B CN 107862378B CN 201711273248 A CN201711273248 A CN 201711273248A CN 107862378 B CN107862378 B CN 107862378B
- Authority
- CN
- China
- Prior art keywords
- neural network
- dot product
- convolutional neural
- vector dot
- core
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
- G06N3/065—Analogue means
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Computational Linguistics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Neurology (AREA)
- Complex Calculations (AREA)
Abstract
The invention provides a multi-core-based convolutional neural network acceleration method and system, a storage medium and a terminal, and the method comprises the steps of splitting a layer of convolutional neural network into at least two subtasks, wherein each subtask corresponds to a convolutional core; the convolution kernels are connected in series; based on each convolution kernel, executing a first preset number of vector dot product operations in parallel, wherein each vector dot product operation comprises a second preset number of multiplication operations; the product of the first preset quantity and the second preset quantity is the number of the multiplier-adders in the convolution kernel; and outputting the vector dot product operation result of each convolution kernel according to the output priority sequence. The multi-core-based convolutional neural network acceleration method and system, the storage medium and the terminal save the data bandwidth of the convolutional neural network through a plurality of parallel convolutional cores; under the condition of the same hardware data bandwidth, the processing speed of the convolutional neural network is improved through parallel vector dot product operation in a convolution kernel.
Description
Technical Field
The invention relates to the technical field of data processing, in particular to a multi-core-based convolutional neural network acceleration method and system, a storage medium and a terminal.
Background
At present, deep learning and machine learning are widely applied in the fields of vision processing, speech recognition and image analysis. Convolutional neural networks are important components of deep learning and machine learning. The processing speed of the convolutional neural network is increased, and the processing speed of deep learning and machine learning can be increased in an equal proportion.
In the prior art, applications of visual processing, speech recognition and image analysis are based on a multi-layer convolutional neural network. Each layer of convolutional neural network needs a large amount of data processing and convolution operation, and the requirements on hardware processing speed and resource consumption are high. With the continuous development of wearable devices, internet of things applications and automatic driving technologies, how to implement a convolutional neural network in an embedded product and achieve a smooth processing speed becomes a great challenge for current hardware architecture design. Taking the typical convolutional neural networks ResNet and VGG16 as examples, ResNet requires 15 gigabytes of bandwidth at 16-bit floating point precision if it is to run to a speed of 60 frames per second; VGG16 requires 6.0 gbytes of bandwidth at 16-bit floating point precision if it is to run to a speed of 60 frames per second.
At present, in order to realize acceleration of the convolutional neural network, it is realized by arranging a plurality of convolution units in parallel. In an ideal case, the more convolution units, the faster the processing speed. However, in practical application, the data bandwidth can greatly limit the processing speed of the convolution unit, the bandwidth resource of hardware is very precious, and the cost for improving the data bandwidth of the hardware is huge. Therefore, under the condition of limited data bandwidth and hardware overhead, the processing speed of the convolutional neural network is improved, and the problem which is urgently needed to be solved by the current hardware architecture design is solved.
Disclosure of Invention
In view of the above-mentioned shortcomings of the prior art, the present invention provides a method and system for accelerating a convolutional neural network based on multiple cores, a storage medium and a terminal, which save the data bandwidth of the convolutional neural network by multiple parallel convolutional cores; under the condition of the same hardware data bandwidth, the processing speed of the convolutional neural network is improved through parallel vector dot product operation in a convolution kernel.
To achieve the above and other related objects, the present invention provides a method for accelerating a convolutional neural network based on multiple cores, comprising the steps of: splitting a layer of convolutional neural network into at least two subtasks, each subtask corresponding to a convolution kernel; the convolution kernels are connected in series, so that input data can be transmitted in series between the convolution kernels; based on each convolution kernel, executing a first preset number of vector dot product operations in parallel, wherein each vector dot product operation comprises a second preset number of multiplication operations; the product of the first preset quantity and the second preset quantity is the number of the multiplier-adders in the convolution kernel; and outputting the vector dot product operation result of each convolution kernel according to the output priority sequence.
In an embodiment of the invention, the second predetermined number is 3 to support a 3D vector dot product.
In an embodiment of the present invention, the output priority is determined according to the order of the input data to the convolution kernel, and the convolution kernel that inputs the input data first has a higher output priority than the convolution kernel that inputs the input data later.
In an embodiment of the present invention, the parallel execution of the first predetermined number of vector dot product operations based on each convolution kernel includes the following steps:
acquiring (N + M-1) input data; wherein N is the first preset number, and M is the second preset number;
inputting the first 1 to N input data to the 1 st to N multiplier-adder respectively to multiply the first coefficient;
inputting the first 2 to N +1 input data into the (N +1) th to 2N multiplier-adder respectively to be multiplied by the second coefficient;
in the same way, the first M to N + M-1 input data are respectively input to the (N x M-N +1) to N x M multipliers and adders to be multiplied by the Mth coefficient;
and accumulating products at corresponding positions on N positions in the M times of multiplication to obtain N vector dot product operation results.
Correspondingly, the invention provides a multi-core-based convolutional neural network acceleration system, which comprises a convolutional core setting module, a vector dot product module and an output module, wherein the convolutional core setting module is used for setting a vector dot product;
the convolution kernel setting module is used for splitting a layer of convolution neural network into at least two subtasks, and each subtask corresponds to a convolution kernel; the convolution kernels are connected in series, so that input data can be transmitted in series between the convolution kernels;
the vector dot product module is used for executing a first preset number of vector dot product operations in parallel based on each convolution kernel, and each vector dot product operation comprises a second preset number of multiplication operations; the product of the first preset quantity and the second preset quantity is the number of the multiplier-adders in the convolution kernel;
and the output module is used for outputting the vector dot product operation result of each convolution kernel according to the output priority order.
In an embodiment of the invention, the second predetermined number is 3 to support a 3D vector dot product.
In an embodiment of the present invention, the output priority is determined according to the order of the input data to the convolution kernel, and the convolution kernel that inputs the input data first has a higher output priority than the convolution kernel that inputs the input data later.
In an embodiment of the present invention, the vector dot product module performs the following steps when executing a first predetermined number of vector dot product operations in parallel based on each convolution kernel:
acquiring (N + M-1) input data; wherein N is the first preset number, and M is the second preset number;
inputting the first 1 to N input data to the 1 st to N multiplier-adder respectively to multiply the first coefficient;
inputting the first 2 to N +1 input data into the (N +1) th to 2N multiplier-adder respectively to be multiplied by the second coefficient;
in the same way, the first M to N + M-1 input data are respectively input to the (N x M-N +1) to N x M multipliers and adders to be multiplied by the Mth coefficient;
and accumulating products at corresponding positions on N positions in the M times of multiplication to obtain N vector dot product operation results.
The present invention provides a storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described multi-core based convolutional neural network acceleration method.
Finally, the invention provides a terminal comprising a processor and a memory;
the memory is used for storing a computer program;
the processor is used for executing the computer program stored in the memory so as to enable the terminal to execute the multi-core-based convolutional neural network acceleration method.
As described above, the multi-core convolutional neural network acceleration method and system, the storage medium, and the terminal according to the present invention have the following advantageous effects:
(1) the bandwidth consumption of a convolutional neural network operation dynamic memory is saved; taking a 4-convolution kernel mode as an example, under the same data processing speed, 75% of input image data bandwidth can be saved when the convolution neural network operates;
(2) the processing speed of the convolutional neural network is improved, and under the condition of the same hardware data bandwidth, the processing speed of the convolutional neural network can be improved to 300% by taking a 3D vector dot product as an example;
(3) the dynamic power consumption of the convolutional neural network is reduced, the operation time is reduced to 33% of the original operation time by taking 4 convolutional kernels and a 3D vector dot product as an example, the input image bandwidth of the convolutional neural network is reduced to 25% of the original operation time, and the dynamic power consumption is reduced by 85%;
(4) the processing speed of the convolutional neural network in the embedded product is optimized, the method has the advantages of clear architecture, clear division of labor, easiness in implementation, simple flow and the like, and can be widely applied to the Internet of things, wearable equipment and vehicle-mounted equipment.
Drawings
FIG. 1 is a flow chart illustrating a method for accelerating a convolutional neural network based on multiple cores according to an embodiment of the present invention;
FIG. 2 shows a schematic diagram of coordinates of an input image, coefficients and an output image;
FIG. 3 is a block diagram illustrating an embodiment of a multi-core convolutional neural network acceleration method according to the present invention;
FIG. 4 is a first state diagram illustrating the parallel 3D vector dot products in one embodiment of the multi-core convolutional neural network acceleration method of the present invention;
FIG. 5 is a second state diagram illustrating the parallel 3D vector dot products in one embodiment of the multi-core convolutional neural network acceleration method of the present invention;
FIG. 6 is a third state diagram illustrating the parallel 3D vector dot products in one embodiment of the multi-core convolutional neural network acceleration method of the present invention;
FIG. 7 is a state diagram illustrating the summation of parallel 3D vector dot products in one embodiment of the multi-core based convolutional neural network acceleration method of the present invention;
FIG. 8 is a schematic diagram illustrating an embodiment of a multi-core convolutional neural network acceleration system according to the present invention;
fig. 9 is a schematic structural diagram of a terminal according to an embodiment of the invention.
Description of the element reference numerals
21 convolution kernel setting module
22 vector dot product module
23 output module
31 processor
32 memory
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.
It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the components related to the present invention are only shown in the drawings rather than drawn according to the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.
The multi-core-based convolutional neural network acceleration method and system, the storage medium and the terminal save the data bandwidth of the convolutional neural network through a plurality of parallel convolutional cores under the condition of limited data bandwidth; under the condition of the same hardware data bandwidth, the processing speed of the convolutional neural network is improved through parallel vector dot product operation in a convolutional kernel; the processing speed of the convolutional neural network in the embedded product is optimized, the method has the advantages of clear architecture, clear division of labor, easiness in implementation, simple flow and the like, and can be widely applied to the Internet of things, wearable equipment and vehicle-mounted equipment.
As shown in fig. 1, in an embodiment, the method for accelerating a convolutional neural network based on multiple cores of the present invention includes the following steps:
step S1, splitting a layer of convolutional neural network into at least two subtasks, wherein each subtask corresponds to a convolutional kernel; the convolution kernels are connected in series, so that input data can be transmitted in series between the convolution kernels.
Taking image processing as an example, as shown in fig. 2, the input image and the output image are both three-dimensional images, and the input image includes an abscissa inx, an ordinate iny, and a coefficient depth coordinate kz. The output image comprises an abscissa outx, an ordinate outy and a depth coordinate z. The coefficients are four-dimensional data including an abscissa kx, an ordinate ky, a coefficient depth coordinate kz, and an output depth coordinate z.
When splitting a layer of convolutional neural network into four subtasks, the coefficient sequence is split into four groups according to the z direction, as shown in fig. 2, each group of coefficients is assigned to a convolutional kernel of a different convolutional neural network. In the method, data are serially communicated among different convolution kernels, so that bandwidth is saved through data sharing.
And according to the processing characteristics of the convolutional neural network, carrying out convolution operation on each group of coefficients and all input images to obtain an output image of a z plane. Therefore, the input image has great reusability. As shown in fig. 3, the different convolution kernels are connected in series by a series path so that the input image data passes in series between the convolution kernels. Specifically, after the input image data is read into the convolution kernel 0 from the memory, the convolution kernel 0 performs convolution operation on the input image data and the first group of coefficients, and simultaneously, the convolution kernel 0 transmits the input image data to the convolution kernel 1 through the data serial channel. The convolution core 1 saves bandwidth consumption in reading input image data. Convolution kernel 2 and convolution kernel 3 would perform the same data manipulation to avoid bandwidth consumption of the input image data. Under the 4 convolution kernel mode, three times of reading of input image data are reduced by serially connecting data channels, 75% of bandwidth consumption of the input image data is saved, and meanwhile, the power consumption of an internal memory is also reduced; the balance problem of the number of the windings of the physical realization layer and the realization frequency is well solved.
Step S2, executing a first preset number of vector dot product operations in parallel based on each convolution kernel, each vector dot product operation including a second preset number of multiplication operations; the product of the first preset number and the second preset number is the number of the multiplier-adders in the convolution kernel.
In an embodiment of the present invention, the parallel execution of the first predetermined number of vector dot product operations based on each convolution kernel includes the following steps:
21) acquiring (N + M-1) input data; wherein N is the first preset number, and M is the second preset number;
22) inputting the first 1 to N input data to the 1 st to N multiplier-adder respectively to multiply the first coefficient;
23) inputting the first 2 to N +1 input data into the (N +1) th to 2N multiplier-adder respectively to be multiplied by the second coefficient;
24) in the same way, the first M to N + M-1 input data are respectively input to the (N x M-N +1) to N x M multipliers and adders to be multiplied by the Mth coefficient;
25) and accumulating products at corresponding positions on N positions in the M times of multiplication to obtain N vector dot product operation results.
The vector dot product, which includes three multiplications (M-3), is further described below. In this embodiment, 24 multiply-and-add units are included within each convolution kernel. To maximize the utilization of the multiplier-adder with limited bandwidth, 8 3D vector dot product results are computed simultaneously by a 3D vector dot product operation involving three multiplications (N-8). Specifically, the 3D vector dot product operation formula is as follows:
out0=in0*k0+in1*k1+in2*k2
out1=in1*k0+in2*k1+in3*k2
out2=in2*k0+in3*k1+in4*k2
......
out7=in7*k0+in8*k1+in9*k2
first, as shown in fig. 4, 10 input image data, i.e., in0, in1, int2.. in9, are read from a memory. The 0 th to 7 th data (in0, in1, in2, in3, in4, in5, in6, in7) and the coefficient 0(k0) are multiplied, and the result of the multiplication is written to the input terminals of the accumulator, i.e., out00, out01, out02.. out 07.
Next, as shown in fig. 5, the 10 input image data of the previous step, i.e., in0, in1, int2.. in9, are reused. The 1 st to 8 th data (in1, in2, in3, in4, in5, in6, in7, in8) and the coefficient 1(k1) are multiplied, and the result of the multiplication is written into the input terminals of the accumulator, i.e., out10, out11, and out12.. out 17.
Again, as shown in fig. 6, the 10 input image data of the previous step, i.e. in0, in1, int2.. in9, are reused. The 2 nd to 9 th data (in2, in3, in4, in5, in6, in7, in8, in9) and the coefficient 2(k2) are multiplied, and the result of the multiplication is written to the input terminals of the accumulator, i.e., out20, out21, and out22.. out 27.
Finally, as shown in fig. 7, the results at the corresponding positions of the three multiplications are sequentially accumulated to obtain 8 3D vector dot product results. Wherein the results out00, out10, and out20 at the first position of each multiplication are added to obtain a first 3D vector dot product result; adding the results out01, out11, and out21 at the second position of each multiplication to obtain a second 3D vector dot product result; by analogy, the results out07, out17, and out27 at the eighth position of each multiplication are added to obtain an eighth 3D vector dot product result.
And step S3, outputting the vector dot product operation result of each convolution kernel according to the output priority order.
Specifically, the output priority is determined according to the order of inputting the input data into the convolution kernel. The earlier the input data is acquired, the higher the output priority of the corresponding convolution kernel. Therefore, the convolution kernel that inputs the input data first has a higher output priority than the convolution kernel that inputs the input data later.
As shown in fig. 8, in an embodiment, the multi-core based convolutional neural network acceleration system of the present invention includes a convolution kernel setting module 21, a vector dot product module 22, and an output module 23.
The convolution kernel setting module 21 is configured to split a layer of convolutional neural network into at least two subtasks, where each subtask corresponds to a convolution kernel; the convolution kernels are connected in series, so that input data can be transmitted in series between the convolution kernels.
Taking image processing as an example, as shown in fig. 2, the input image and the output image are both three-dimensional images, and the input image includes an abscissa inx, an ordinate iny, and a coefficient depth coordinate kz. The output image comprises an abscissa outx, an ordinate outy and a depth coordinate z. The coefficients are four-dimensional data including an abscissa kx, an ordinate ky, a coefficient depth coordinate kz, and an output depth coordinate z.
When splitting a layer of convolutional neural network into four subtasks, the coefficient sequence is split into four groups according to the z direction, as shown in fig. 2, each group of coefficients is assigned to a convolutional kernel of a different convolutional neural network. In the method, data are serially communicated among different convolution kernels, so that bandwidth is saved through data sharing.
And according to the processing characteristics of the convolutional neural network, carrying out convolution operation on each group of coefficients and all input images to obtain an output image of a z plane. Therefore, the input image has great reusability. As shown in fig. 3, the different convolution kernels are connected in series by a series path so that the input image data passes in series between the convolution kernels. Specifically, after the input image data is read into the convolution kernel 0 from the memory, the convolution kernel 0 performs convolution operation on the input image data and the first group of coefficients, and simultaneously, the convolution kernel 0 transmits the input image data to the convolution kernel 1 through the data serial channel. The convolution core 1 saves bandwidth consumption in reading input image data. Convolution kernel 2 and convolution kernel 3 would perform the same data manipulation to avoid bandwidth consumption of the input image data. Under the 4 convolution kernel mode, three times of reading of input image data are reduced by serially connecting data channels, 75% of bandwidth consumption of the input image data is saved, and meanwhile, the power consumption of an internal memory is also reduced; the balance problem of the number of the windings of the physical realization layer and the realization frequency is well solved.
The vector dot product module 22 is connected to the convolution kernel setting module 21, and configured to execute a first preset number of vector dot product operations in parallel based on each convolution kernel, where each vector dot product operation includes a second preset number of multiplication operations; the product of the first preset number and the second preset number is the number of the multiplier-adders in the convolution kernel.
In an embodiment of the present invention, the vector dot product module 22 executes a first predetermined number of vector dot product operations in parallel based on each convolution kernel to execute the following steps:
21) acquiring (N + M-1) input data; wherein N is the first preset number, and M is the second preset number;
22) inputting the first 1 to N input data to the 1 st to N multiplier-adder respectively to multiply the first coefficient;
23) inputting the first 2 to N +1 input data into the (N +1) th to 2N multiplier-adder respectively to be multiplied by the second coefficient;
24) in the same way, the first M to N + M-1 input data are respectively input to the (N x M-N +1) to N x M multipliers and adders to be multiplied by the Mth coefficient;
25) and accumulating products at corresponding positions on N positions in the M times of multiplication to obtain N vector dot product operation results.
The following is further illustrated with a 3D vector dot product that includes three multiplications (M-3). In this embodiment, 24 multiply-and-add units are included within each convolution kernel. To maximize the utilization of the multiplier-adder with limited bandwidth, 8 3D vector dot product results are computed simultaneously by a 3D vector dot product operation involving three multiplications (N-8). Specifically, the 3D vector dot product operation formula is as follows:
out0=in0*k0+in1*k1+in2*k2
out1=in1*k0+in2*k1+in3*k2
out2=in2*k0+in3*k1+in4*k2
......
out7=in7*k0+in8*k1+in9*k2
first, as shown in fig. 4, 10 input image data, i.e., in0, in1, int2.. in9, are read from a memory. The 0 th to 7 th data (in0, in1, in2, in3, in4, in5, in6, in7) and the coefficient 0(k0) are multiplied, and the result of the multiplication is written to the input terminals of the accumulator, i.e., out00, out01, out02.. out 07.
Next, as shown in fig. 5, the 10 input image data of the previous step, i.e., in0, in1, int2.. in9, are reused. The 1 st to 8 th data (in1, in2, in3, in4, in5, in6, in7, in8) and the coefficient 1(k1) are multiplied, and the result of the multiplication is written into the input terminals of the accumulator, i.e., out10, out11, and out12.. out 17.
Again, as shown in fig. 6, the 10 input image data of the previous step, i.e. in0, in1, int2.. in9, are reused. The 2 nd to 9 th data (in2, in3, in4, in5, in6, in7, in8, in9) and the coefficient 2(k2) are multiplied, and the result of the multiplication is written to the input terminals of the accumulator, i.e., out20, out21, and out22.. out 27.
Finally, as shown in fig. 7, the results at the corresponding positions of the three multiplications are sequentially accumulated to obtain 8 3D vector dot product results. Wherein the results out00, out10, and out20 at the first position of each multiplication are added to obtain a first 3D vector dot product result; adding the results out01, out11, and out21 at the second position of each multiplication to obtain a second 3D vector dot product result; by analogy, the results out07, out17, and out27 at the eighth position of each multiplication are added to obtain an eighth 3D vector dot product result.
The output module 23 is connected to the vector dot product module 22, and configured to output the vector dot product operation result of each convolution kernel according to the output priority order.
Specifically, the output priority is determined according to the order of inputting the input data into the convolution kernel. The earlier the input data is acquired, the higher the output priority of the corresponding convolution kernel. Therefore, the convolution kernel that inputs the input data first has a higher output priority than the convolution kernel that inputs the input data later.
It should be noted that the division of the modules of the above system is only a logical division, and the actual implementation may be wholly or partially integrated into one physical entity, or may be physically separated. And these modules can be realized in the form of software called by processing element; or may be implemented entirely in hardware; and part of the modules can be realized in the form of calling software by the processing element, and part of the modules can be realized in the form of hardware. For example, the x module may be a processing element that is set up separately, or may be implemented by being integrated in a chip of the apparatus, or may be stored in a memory of the apparatus in the form of program code, and the function of the x module may be called and executed by a processing element of the apparatus. Other modules are implemented similarly. In addition, all or part of the modules can be integrated together or can be independently realized. The processing element described herein may be an integrated circuit having signal processing capabilities. In implementation, each step of the above method or each module above may be implemented by an integrated logic circuit of hardware in a processor element or an instruction in the form of software.
For example, the above modules may be one or more integrated circuits configured to implement the above methods, such as: one or more Application Specific Integrated Circuits (ASICs), or one or more microprocessors (DSPs), or one or more Field Programmable Gate Arrays (FPGAs), etc. For another example, when one of the above modules is implemented in the form of a processing element scheduler code, the processing element may be a general-purpose processor, such as a Central Processing Unit (CPU) or other processor capable of calling program code. For another example, these modules may be integrated together and implemented in the form of a system-on-a-chip (SOC).
The storage medium of the present invention stores thereon a computer program that, when executed by a processor, implements the above-described multi-core based convolutional neural network acceleration method. Preferably, the storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
As shown in fig. 9, in one embodiment, the terminal of the present invention includes a processor 31 and a memory 32.
The memory 32 is used for storing computer programs.
Preferably, the memory 32 comprises: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
The processor 31 is connected to the memory 32, and is configured to execute the computer program stored in the memory 32, so as to enable the terminal to execute the above-mentioned multi-core-based convolutional neural network acceleration method.
Preferably, the processor 31 may be a general-purpose processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; the integrated circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic device, or discrete hardware components.
In summary, the multi-core-based convolutional neural network acceleration method, the multi-core-based convolutional neural network acceleration system, the storage medium and the terminal of the invention save the bandwidth consumption of the convolutional neural network operation dynamic memory; taking a 4-convolution kernel mode as an example, under the same data processing speed, 75% of input image data bandwidth can be saved when the convolution neural network operates; the processing speed of the convolutional neural network is improved, and under the condition of the same hardware data bandwidth, the processing speed of the convolutional neural network can be improved to 300% by taking a 3D vector dot product as an example; the dynamic power consumption of the convolutional neural network is reduced, the operation time is reduced to 33% of the original operation time by taking 4 convolutional kernels and a 3D vector dot product as an example, the input image bandwidth of the convolutional neural network is reduced to 25% of the original operation time, and the dynamic power consumption is reduced by 85%; the processing speed of the convolutional neural network in the embedded product is optimized, the method has the advantages of clear architecture, clear division of labor, easiness in implementation, simple flow and the like, and can be widely applied to the Internet of things, wearable equipment and vehicle-mounted equipment. Therefore, the invention effectively overcomes various defects in the prior art and has high industrial utilization value.
The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.
Claims (10)
1. A multi-core-based convolutional neural network acceleration method is characterized by comprising the following steps:
splitting a layer of convolutional neural network into at least two subtasks, each subtask corresponding to a convolution kernel; the convolution kernels are connected in series, so that input data can be transmitted in series between the convolution kernels;
based on each convolution kernel, executing a first preset number of vector dot product operations in parallel, wherein each vector dot product operation comprises a second preset number of multiplication operations; the product of the first preset quantity and the second preset quantity is the number of the multiplier-adders in the convolution kernel;
and outputting the vector dot product operation result of each convolution kernel according to the output priority sequence.
2. The multi-core-based convolutional neural network acceleration method of claim 1, wherein the second preset number is 3 to support a 3D vector dot product.
3. The multi-core-based convolutional neural network acceleration method of claim 1, wherein the output priority is determined according to the order of input of the input data into the convolutional kernels, and the convolutional kernels input with the input data first have higher output priority than the convolutional kernels input with the input data later.
4. The multi-core-based convolutional neural network acceleration method of claim 1, wherein performing a first preset number of vector dot product operations in parallel on a per convolutional core basis comprises the steps of:
acquiring (N + M-1) input data; wherein N is the first preset number, and M is the second preset number;
inputting the first 1 to N input data to the 1 st to N multiplier-adder respectively to multiply the first coefficient;
inputting the first 2 to N +1 input data into the (N +1) th to 2N multiplier-adder respectively to be multiplied by the second coefficient;
in the same way, the first M to N + M-1 input data are respectively input to the (N x M-N +1) to N x M multipliers and adders to be multiplied by the Mth coefficient;
and accumulating products at corresponding positions on N positions in the M times of multiplication to obtain N vector dot product operation results.
5. A multi-core-based convolutional neural network acceleration system is characterized by comprising a convolutional core setting module, a vector dot product module and an output module;
the convolution kernel setting module is used for splitting a layer of convolution neural network into at least two subtasks, and each subtask corresponds to a convolution kernel; the convolution kernels are connected in series, so that input data can be transmitted in series between the convolution kernels;
the vector dot product module is used for executing a first preset number of vector dot product operations in parallel based on each convolution kernel, and each vector dot product operation comprises a second preset number of multiplication operations; the product of the first preset quantity and the second preset quantity is the number of the multiplier-adders in the convolution kernel;
and the output module is used for outputting the vector dot product operation result of each convolution kernel according to the output priority order.
6. The multi-core based convolutional neural network acceleration system of claim 5, wherein the second preset number is 3 to support a 3D vector dot product.
7. The multi-core-based convolutional neural network acceleration system as claimed in claim 5, wherein the output priority is determined according to the precedence order of the input data input convolution kernels, and the convolution kernels input with the input data first have higher output priority than the convolution kernels input with the input data later.
8. The multi-core based convolutional neural network acceleration system as claimed in claim 5, wherein the vector dot product module performs the following steps when performing a first preset number of vector dot product operations in parallel based on each convolution core:
acquiring (N + M-1) input data; wherein N is the first preset number, and M is the second preset number;
inputting the first 1 to N input data to the 1 st to N multiplier-adder respectively to multiply the first coefficient;
inputting the first 2 to N +1 input data into the (N +1) th to 2N multiplier-adder respectively to be multiplied by the second coefficient;
in the same way, the first M to N + M-1 input data are respectively input to the (N x M-N +1) to N x M multipliers and adders to be multiplied by the Mth coefficient;
and accumulating products at corresponding positions on N positions in the M times of multiplication to obtain N vector dot product operation results.
9. A storage medium on which a computer program is stored, which program, when executed by a processor, implements the multi-core based convolutional neural network acceleration method of any one of claims 1 to 4.
10. A terminal comprising a processor and a memory;
the memory is used for storing a computer program;
the processor is configured to execute the memory-stored computer program to cause the terminal to perform the multi-core based convolutional neural network acceleration method of any of claims 1 to 4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711273248.5A CN107862378B (en) | 2017-12-06 | 2017-12-06 | Multi-core-based convolutional neural network acceleration method and system, storage medium and terminal |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711273248.5A CN107862378B (en) | 2017-12-06 | 2017-12-06 | Multi-core-based convolutional neural network acceleration method and system, storage medium and terminal |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107862378A CN107862378A (en) | 2018-03-30 |
CN107862378B true CN107862378B (en) | 2020-04-24 |
Family
ID=61705060
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711273248.5A Active CN107862378B (en) | 2017-12-06 | 2017-12-06 | Multi-core-based convolutional neural network acceleration method and system, storage medium and terminal |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107862378B (en) |
Families Citing this family (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108681773B (en) * | 2018-05-23 | 2020-01-10 | 腾讯科技(深圳)有限公司 | Data operation acceleration method, device, terminal and readable storage medium |
CN109117940B (en) * | 2018-06-19 | 2020-12-15 | 腾讯科技(深圳)有限公司 | Target detection method, device, terminal and storage medium based on convolutional neural network |
US20200005125A1 (en) * | 2018-06-27 | 2020-01-02 | International Business Machines Corporation | Low precision deep neural network enabled by compensation instructions |
CN108920413B (en) * | 2018-06-28 | 2019-08-09 | 中国人民解放军国防科技大学 | Convolutional neural network multi-core parallel computing method facing GPDSP |
US11954573B2 (en) * | 2018-09-06 | 2024-04-09 | Black Sesame Technologies Inc. | Convolutional neural network using adaptive 3D array |
CN109740733B (en) * | 2018-12-27 | 2021-07-06 | 深圳云天励飞技术有限公司 | Deep learning network model optimization method and device and related equipment |
CN109740747B (en) | 2018-12-29 | 2019-11-12 | 北京中科寒武纪科技有限公司 | Operation method, device and Related product |
CN110889497B (en) * | 2018-12-29 | 2021-04-23 | 中科寒武纪科技股份有限公司 | Learning task compiling method of artificial intelligence processor and related product |
CN111563586B (en) * | 2019-02-14 | 2022-12-09 | 上海寒武纪信息科技有限公司 | Splitting method of neural network model and related product |
CN109886400B (en) * | 2019-02-19 | 2020-11-27 | 合肥工业大学 | Convolution neural network hardware accelerator system based on convolution kernel splitting and calculation method thereof |
CN110109646B (en) * | 2019-03-28 | 2021-08-27 | 北京迈格威科技有限公司 | Data processing method, data processing device, multiplier-adder and storage medium |
CN110689115B (en) * | 2019-09-24 | 2023-03-31 | 安徽寒武纪信息科技有限公司 | Neural network model processing method and device, computer equipment and storage medium |
CN110689121A (en) * | 2019-09-24 | 2020-01-14 | 上海寒武纪信息科技有限公司 | Method for realizing neural network model splitting by using multi-core processor and related product |
US20220383082A1 (en) * | 2019-09-24 | 2022-12-01 | Anhui Cambricon Information Technology Co., Ltd. | Neural network processing method and apparatus, computer device and storage medium |
CN110738317A (en) * | 2019-10-17 | 2020-01-31 | 中国科学院上海高等研究院 | FPGA-based deformable convolution network operation method, device and system |
CN110796245B (en) * | 2019-10-25 | 2022-03-22 | 浪潮电子信息产业股份有限公司 | Method and device for calculating convolutional neural network model |
WO2021081854A1 (en) * | 2019-10-30 | 2021-05-06 | 华为技术有限公司 | Convolution operation circuit and convolution operation method |
CN111610963B (en) * | 2020-06-24 | 2021-08-17 | 上海西井信息科技有限公司 | Chip structure and multiply-add calculation engine thereof |
CN114399828B (en) * | 2022-03-25 | 2022-07-08 | 深圳比特微电子科技有限公司 | Training method of convolution neural network model for image processing |
CN116303108B (en) * | 2022-09-07 | 2024-05-14 | 芯砺智能科技(上海)有限公司 | Weight address arrangement method suitable for parallel computing architecture |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104915322A (en) * | 2015-06-09 | 2015-09-16 | 中国人民解放军国防科学技术大学 | Method for accelerating convolution neutral network hardware and AXI bus IP core thereof |
CN106203617A (en) * | 2016-06-27 | 2016-12-07 | 哈尔滨工业大学深圳研究生院 | A kind of acceleration processing unit based on convolutional neural networks and array structure |
CN106599883A (en) * | 2017-03-08 | 2017-04-26 | 王华锋 | Face recognition method capable of extracting multi-level image semantics based on CNN (convolutional neural network) |
CN106845635A (en) * | 2017-01-24 | 2017-06-13 | 东南大学 | CNN convolution kernel hardware design methods based on cascade form |
CN106875011A (en) * | 2017-01-12 | 2017-06-20 | 南京大学 | The hardware structure and its calculation process of two-value weight convolutional neural networks accelerator |
CN106951395A (en) * | 2017-02-13 | 2017-07-14 | 上海客鹭信息技术有限公司 | Towards the parallel convolution operations method and device of compression convolutional neural networks |
CN107301456A (en) * | 2017-05-26 | 2017-10-27 | 中国人民解放军国防科学技术大学 | Deep neural network multinuclear based on vector processor speeds up to method |
WO2017186829A1 (en) * | 2016-04-27 | 2017-11-02 | Commissariat A L'energie Atomique Et Aux Energies Alternatives | Device and method for calculating convolution in a convolutional neural network |
-
2017
- 2017-12-06 CN CN201711273248.5A patent/CN107862378B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104915322A (en) * | 2015-06-09 | 2015-09-16 | 中国人民解放军国防科学技术大学 | Method for accelerating convolution neutral network hardware and AXI bus IP core thereof |
WO2017186829A1 (en) * | 2016-04-27 | 2017-11-02 | Commissariat A L'energie Atomique Et Aux Energies Alternatives | Device and method for calculating convolution in a convolutional neural network |
CN106203617A (en) * | 2016-06-27 | 2016-12-07 | 哈尔滨工业大学深圳研究生院 | A kind of acceleration processing unit based on convolutional neural networks and array structure |
CN106875011A (en) * | 2017-01-12 | 2017-06-20 | 南京大学 | The hardware structure and its calculation process of two-value weight convolutional neural networks accelerator |
CN106845635A (en) * | 2017-01-24 | 2017-06-13 | 东南大学 | CNN convolution kernel hardware design methods based on cascade form |
CN106951395A (en) * | 2017-02-13 | 2017-07-14 | 上海客鹭信息技术有限公司 | Towards the parallel convolution operations method and device of compression convolutional neural networks |
CN106599883A (en) * | 2017-03-08 | 2017-04-26 | 王华锋 | Face recognition method capable of extracting multi-level image semantics based on CNN (convolutional neural network) |
CN107301456A (en) * | 2017-05-26 | 2017-10-27 | 中国人民解放军国防科学技术大学 | Deep neural network multinuclear based on vector processor speeds up to method |
Non-Patent Citations (2)
Title |
---|
CUDA-CONVNET 深层卷积神经网络算法的;李大霞;《中国优秀硕士学位论文全文数据库 信息科技辑》;20160315;正文第29-30页 * |
一种简洁高效的加速卷积神经网络的方法;刘进锋;《科学技术与工程》;20141128;第14卷(第33期);240-244 * |
Also Published As
Publication number | Publication date |
---|---|
CN107862378A (en) | 2018-03-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107862378B (en) | Multi-core-based convolutional neural network acceleration method and system, storage medium and terminal | |
JP6857286B2 (en) | Improved performance of neural network arrays | |
US11960566B1 (en) | Reducing computations for data including padding | |
EP3349153B1 (en) | Convolutional neural network (cnn) processing method and apparatus | |
EP3480745A1 (en) | Hardware implementation of convolution layer of deep neural network | |
CN111860812B (en) | Apparatus and method for performing convolutional neural network training | |
CN107341541B (en) | Apparatus and method for performing full connectivity layer neural network training | |
EP3480747A1 (en) | Single plane filters | |
CN111476360A (en) | Apparatus and method for Winograd transform convolution operation of neural network | |
US20190034327A1 (en) | Accessing prologue and epilogue data | |
TW202123093A (en) | Method and system for performing convolution operation | |
WO2023065983A1 (en) | Computing apparatus, neural network processing device, chip, and data processing method | |
CN110109646A (en) | Data processing method, device and adder and multiplier and storage medium | |
US20230019151A1 (en) | Implementation of pooling and unpooling or reverse pooling in hardware | |
CN114138231B (en) | Method, circuit and SOC for executing matrix multiplication operation | |
CN111047025B (en) | Convolution calculation method and device | |
CN114764615A (en) | Convolution operation implementation method, data processing method and device | |
CN110245706B (en) | Lightweight target detection method for embedded application | |
JP7387017B2 (en) | Address generation method and unit, deep learning processor, chip, electronic equipment and computer program | |
CN110825311B (en) | Method and apparatus for storing data | |
CN117063182A (en) | Data processing method and device | |
EP4295276A1 (en) | Accelerated execution of convolution operation by convolutional neural network | |
WO2020194465A1 (en) | Neural network circuit | |
CN111832714A (en) | Operation method and device | |
Anand et al. | Scaling computation on GPUs using powerlists |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information |
Address after: 201203 China (Shanghai) Free Trade Pilot Zone 20A, Zhangjiang Building, 289 Chunxiao Road Applicant after: Xinyuan Microelectronics (Shanghai) Co., Ltd. Applicant after: Core chip technology (Shanghai) Co., Ltd. Address before: 201203 Zhangjiang Building 20A, 560 Songtao Road, Zhangjiang High-tech Park, Pudong New Area, Shanghai Applicant before: VeriSilicon Microelectronics (Shanghai) Co., Ltd. Applicant before: Core chip technology (Shanghai) Co., Ltd. |
|
CB02 | Change of applicant information | ||
GR01 | Patent grant | ||
GR01 | Patent grant |