CN107862378B - Multi-core-based convolutional neural network acceleration method and system, storage medium and terminal - Google Patents

Multi-core-based convolutional neural network acceleration method and system, storage medium and terminal Download PDF

Info

Publication number
CN107862378B
CN107862378B CN201711273248.5A CN201711273248A CN107862378B CN 107862378 B CN107862378 B CN 107862378B CN 201711273248 A CN201711273248 A CN 201711273248A CN 107862378 B CN107862378 B CN 107862378B
Authority
CN
China
Prior art keywords
neural network
dot product
convolutional neural
vector dot
core
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711273248.5A
Other languages
Chinese (zh)
Other versions
CN107862378A (en
Inventor
张慧明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Vivante Technology Shanghai Co ltd
VeriSilicon Microelectronics Shanghai Co Ltd
Original Assignee
Vivante Technology Shanghai Co ltd
VeriSilicon Microelectronics Shanghai Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Vivante Technology Shanghai Co ltd, VeriSilicon Microelectronics Shanghai Co Ltd filed Critical Vivante Technology Shanghai Co ltd
Priority to CN201711273248.5A priority Critical patent/CN107862378B/en
Publication of CN107862378A publication Critical patent/CN107862378A/en
Application granted granted Critical
Publication of CN107862378B publication Critical patent/CN107862378B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • G06N3/065Analogue means

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)

Abstract

The invention provides a multi-core-based convolutional neural network acceleration method and system, a storage medium and a terminal, and the method comprises the steps of splitting a layer of convolutional neural network into at least two subtasks, wherein each subtask corresponds to a convolutional core; the convolution kernels are connected in series; based on each convolution kernel, executing a first preset number of vector dot product operations in parallel, wherein each vector dot product operation comprises a second preset number of multiplication operations; the product of the first preset quantity and the second preset quantity is the number of the multiplier-adders in the convolution kernel; and outputting the vector dot product operation result of each convolution kernel according to the output priority sequence. The multi-core-based convolutional neural network acceleration method and system, the storage medium and the terminal save the data bandwidth of the convolutional neural network through a plurality of parallel convolutional cores; under the condition of the same hardware data bandwidth, the processing speed of the convolutional neural network is improved through parallel vector dot product operation in a convolution kernel.

Description

Multi-core-based convolutional neural network acceleration method and system, storage medium and terminal
Technical Field
The invention relates to the technical field of data processing, in particular to a multi-core-based convolutional neural network acceleration method and system, a storage medium and a terminal.
Background
At present, deep learning and machine learning are widely applied in the fields of vision processing, speech recognition and image analysis. Convolutional neural networks are important components of deep learning and machine learning. The processing speed of the convolutional neural network is increased, and the processing speed of deep learning and machine learning can be increased in an equal proportion.
In the prior art, applications of visual processing, speech recognition and image analysis are based on a multi-layer convolutional neural network. Each layer of convolutional neural network needs a large amount of data processing and convolution operation, and the requirements on hardware processing speed and resource consumption are high. With the continuous development of wearable devices, internet of things applications and automatic driving technologies, how to implement a convolutional neural network in an embedded product and achieve a smooth processing speed becomes a great challenge for current hardware architecture design. Taking the typical convolutional neural networks ResNet and VGG16 as examples, ResNet requires 15 gigabytes of bandwidth at 16-bit floating point precision if it is to run to a speed of 60 frames per second; VGG16 requires 6.0 gbytes of bandwidth at 16-bit floating point precision if it is to run to a speed of 60 frames per second.
At present, in order to realize acceleration of the convolutional neural network, it is realized by arranging a plurality of convolution units in parallel. In an ideal case, the more convolution units, the faster the processing speed. However, in practical application, the data bandwidth can greatly limit the processing speed of the convolution unit, the bandwidth resource of hardware is very precious, and the cost for improving the data bandwidth of the hardware is huge. Therefore, under the condition of limited data bandwidth and hardware overhead, the processing speed of the convolutional neural network is improved, and the problem which is urgently needed to be solved by the current hardware architecture design is solved.
Disclosure of Invention
In view of the above-mentioned shortcomings of the prior art, the present invention provides a method and system for accelerating a convolutional neural network based on multiple cores, a storage medium and a terminal, which save the data bandwidth of the convolutional neural network by multiple parallel convolutional cores; under the condition of the same hardware data bandwidth, the processing speed of the convolutional neural network is improved through parallel vector dot product operation in a convolution kernel.
To achieve the above and other related objects, the present invention provides a method for accelerating a convolutional neural network based on multiple cores, comprising the steps of: splitting a layer of convolutional neural network into at least two subtasks, each subtask corresponding to a convolution kernel; the convolution kernels are connected in series, so that input data can be transmitted in series between the convolution kernels; based on each convolution kernel, executing a first preset number of vector dot product operations in parallel, wherein each vector dot product operation comprises a second preset number of multiplication operations; the product of the first preset quantity and the second preset quantity is the number of the multiplier-adders in the convolution kernel; and outputting the vector dot product operation result of each convolution kernel according to the output priority sequence.
In an embodiment of the invention, the second predetermined number is 3 to support a 3D vector dot product.
In an embodiment of the present invention, the output priority is determined according to the order of the input data to the convolution kernel, and the convolution kernel that inputs the input data first has a higher output priority than the convolution kernel that inputs the input data later.
In an embodiment of the present invention, the parallel execution of the first predetermined number of vector dot product operations based on each convolution kernel includes the following steps:
acquiring (N + M-1) input data; wherein N is the first preset number, and M is the second preset number;
inputting the first 1 to N input data to the 1 st to N multiplier-adder respectively to multiply the first coefficient;
inputting the first 2 to N +1 input data into the (N +1) th to 2N multiplier-adder respectively to be multiplied by the second coefficient;
in the same way, the first M to N + M-1 input data are respectively input to the (N x M-N +1) to N x M multipliers and adders to be multiplied by the Mth coefficient;
and accumulating products at corresponding positions on N positions in the M times of multiplication to obtain N vector dot product operation results.
Correspondingly, the invention provides a multi-core-based convolutional neural network acceleration system, which comprises a convolutional core setting module, a vector dot product module and an output module, wherein the convolutional core setting module is used for setting a vector dot product;
the convolution kernel setting module is used for splitting a layer of convolution neural network into at least two subtasks, and each subtask corresponds to a convolution kernel; the convolution kernels are connected in series, so that input data can be transmitted in series between the convolution kernels;
the vector dot product module is used for executing a first preset number of vector dot product operations in parallel based on each convolution kernel, and each vector dot product operation comprises a second preset number of multiplication operations; the product of the first preset quantity and the second preset quantity is the number of the multiplier-adders in the convolution kernel;
and the output module is used for outputting the vector dot product operation result of each convolution kernel according to the output priority order.
In an embodiment of the invention, the second predetermined number is 3 to support a 3D vector dot product.
In an embodiment of the present invention, the output priority is determined according to the order of the input data to the convolution kernel, and the convolution kernel that inputs the input data first has a higher output priority than the convolution kernel that inputs the input data later.
In an embodiment of the present invention, the vector dot product module performs the following steps when executing a first predetermined number of vector dot product operations in parallel based on each convolution kernel:
acquiring (N + M-1) input data; wherein N is the first preset number, and M is the second preset number;
inputting the first 1 to N input data to the 1 st to N multiplier-adder respectively to multiply the first coefficient;
inputting the first 2 to N +1 input data into the (N +1) th to 2N multiplier-adder respectively to be multiplied by the second coefficient;
in the same way, the first M to N + M-1 input data are respectively input to the (N x M-N +1) to N x M multipliers and adders to be multiplied by the Mth coefficient;
and accumulating products at corresponding positions on N positions in the M times of multiplication to obtain N vector dot product operation results.
The present invention provides a storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described multi-core based convolutional neural network acceleration method.
Finally, the invention provides a terminal comprising a processor and a memory;
the memory is used for storing a computer program;
the processor is used for executing the computer program stored in the memory so as to enable the terminal to execute the multi-core-based convolutional neural network acceleration method.
As described above, the multi-core convolutional neural network acceleration method and system, the storage medium, and the terminal according to the present invention have the following advantageous effects:
(1) the bandwidth consumption of a convolutional neural network operation dynamic memory is saved; taking a 4-convolution kernel mode as an example, under the same data processing speed, 75% of input image data bandwidth can be saved when the convolution neural network operates;
(2) the processing speed of the convolutional neural network is improved, and under the condition of the same hardware data bandwidth, the processing speed of the convolutional neural network can be improved to 300% by taking a 3D vector dot product as an example;
(3) the dynamic power consumption of the convolutional neural network is reduced, the operation time is reduced to 33% of the original operation time by taking 4 convolutional kernels and a 3D vector dot product as an example, the input image bandwidth of the convolutional neural network is reduced to 25% of the original operation time, and the dynamic power consumption is reduced by 85%;
(4) the processing speed of the convolutional neural network in the embedded product is optimized, the method has the advantages of clear architecture, clear division of labor, easiness in implementation, simple flow and the like, and can be widely applied to the Internet of things, wearable equipment and vehicle-mounted equipment.
Drawings
FIG. 1 is a flow chart illustrating a method for accelerating a convolutional neural network based on multiple cores according to an embodiment of the present invention;
FIG. 2 shows a schematic diagram of coordinates of an input image, coefficients and an output image;
FIG. 3 is a block diagram illustrating an embodiment of a multi-core convolutional neural network acceleration method according to the present invention;
FIG. 4 is a first state diagram illustrating the parallel 3D vector dot products in one embodiment of the multi-core convolutional neural network acceleration method of the present invention;
FIG. 5 is a second state diagram illustrating the parallel 3D vector dot products in one embodiment of the multi-core convolutional neural network acceleration method of the present invention;
FIG. 6 is a third state diagram illustrating the parallel 3D vector dot products in one embodiment of the multi-core convolutional neural network acceleration method of the present invention;
FIG. 7 is a state diagram illustrating the summation of parallel 3D vector dot products in one embodiment of the multi-core based convolutional neural network acceleration method of the present invention;
FIG. 8 is a schematic diagram illustrating an embodiment of a multi-core convolutional neural network acceleration system according to the present invention;
fig. 9 is a schematic structural diagram of a terminal according to an embodiment of the invention.
Description of the element reference numerals
21 convolution kernel setting module
22 vector dot product module
23 output module
31 processor
32 memory
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.
It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the components related to the present invention are only shown in the drawings rather than drawn according to the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.
The multi-core-based convolutional neural network acceleration method and system, the storage medium and the terminal save the data bandwidth of the convolutional neural network through a plurality of parallel convolutional cores under the condition of limited data bandwidth; under the condition of the same hardware data bandwidth, the processing speed of the convolutional neural network is improved through parallel vector dot product operation in a convolutional kernel; the processing speed of the convolutional neural network in the embedded product is optimized, the method has the advantages of clear architecture, clear division of labor, easiness in implementation, simple flow and the like, and can be widely applied to the Internet of things, wearable equipment and vehicle-mounted equipment.
As shown in fig. 1, in an embodiment, the method for accelerating a convolutional neural network based on multiple cores of the present invention includes the following steps:
step S1, splitting a layer of convolutional neural network into at least two subtasks, wherein each subtask corresponds to a convolutional kernel; the convolution kernels are connected in series, so that input data can be transmitted in series between the convolution kernels.
Taking image processing as an example, as shown in fig. 2, the input image and the output image are both three-dimensional images, and the input image includes an abscissa inx, an ordinate iny, and a coefficient depth coordinate kz. The output image comprises an abscissa outx, an ordinate outy and a depth coordinate z. The coefficients are four-dimensional data including an abscissa kx, an ordinate ky, a coefficient depth coordinate kz, and an output depth coordinate z.
When splitting a layer of convolutional neural network into four subtasks, the coefficient sequence is split into four groups according to the z direction, as shown in fig. 2, each group of coefficients is assigned to a convolutional kernel of a different convolutional neural network. In the method, data are serially communicated among different convolution kernels, so that bandwidth is saved through data sharing.
And according to the processing characteristics of the convolutional neural network, carrying out convolution operation on each group of coefficients and all input images to obtain an output image of a z plane. Therefore, the input image has great reusability. As shown in fig. 3, the different convolution kernels are connected in series by a series path so that the input image data passes in series between the convolution kernels. Specifically, after the input image data is read into the convolution kernel 0 from the memory, the convolution kernel 0 performs convolution operation on the input image data and the first group of coefficients, and simultaneously, the convolution kernel 0 transmits the input image data to the convolution kernel 1 through the data serial channel. The convolution core 1 saves bandwidth consumption in reading input image data. Convolution kernel 2 and convolution kernel 3 would perform the same data manipulation to avoid bandwidth consumption of the input image data. Under the 4 convolution kernel mode, three times of reading of input image data are reduced by serially connecting data channels, 75% of bandwidth consumption of the input image data is saved, and meanwhile, the power consumption of an internal memory is also reduced; the balance problem of the number of the windings of the physical realization layer and the realization frequency is well solved.
Step S2, executing a first preset number of vector dot product operations in parallel based on each convolution kernel, each vector dot product operation including a second preset number of multiplication operations; the product of the first preset number and the second preset number is the number of the multiplier-adders in the convolution kernel.
In an embodiment of the present invention, the parallel execution of the first predetermined number of vector dot product operations based on each convolution kernel includes the following steps:
21) acquiring (N + M-1) input data; wherein N is the first preset number, and M is the second preset number;
22) inputting the first 1 to N input data to the 1 st to N multiplier-adder respectively to multiply the first coefficient;
23) inputting the first 2 to N +1 input data into the (N +1) th to 2N multiplier-adder respectively to be multiplied by the second coefficient;
24) in the same way, the first M to N + M-1 input data are respectively input to the (N x M-N +1) to N x M multipliers and adders to be multiplied by the Mth coefficient;
25) and accumulating products at corresponding positions on N positions in the M times of multiplication to obtain N vector dot product operation results.
The vector dot product, which includes three multiplications (M-3), is further described below. In this embodiment, 24 multiply-and-add units are included within each convolution kernel. To maximize the utilization of the multiplier-adder with limited bandwidth, 8 3D vector dot product results are computed simultaneously by a 3D vector dot product operation involving three multiplications (N-8). Specifically, the 3D vector dot product operation formula is as follows:
out0=in0*k0+in1*k1+in2*k2
out1=in1*k0+in2*k1+in3*k2
out2=in2*k0+in3*k1+in4*k2
......
out7=in7*k0+in8*k1+in9*k2
first, as shown in fig. 4, 10 input image data, i.e., in0, in1, int2.. in9, are read from a memory. The 0 th to 7 th data (in0, in1, in2, in3, in4, in5, in6, in7) and the coefficient 0(k0) are multiplied, and the result of the multiplication is written to the input terminals of the accumulator, i.e., out00, out01, out02.. out 07.
Next, as shown in fig. 5, the 10 input image data of the previous step, i.e., in0, in1, int2.. in9, are reused. The 1 st to 8 th data (in1, in2, in3, in4, in5, in6, in7, in8) and the coefficient 1(k1) are multiplied, and the result of the multiplication is written into the input terminals of the accumulator, i.e., out10, out11, and out12.. out 17.
Again, as shown in fig. 6, the 10 input image data of the previous step, i.e. in0, in1, int2.. in9, are reused. The 2 nd to 9 th data (in2, in3, in4, in5, in6, in7, in8, in9) and the coefficient 2(k2) are multiplied, and the result of the multiplication is written to the input terminals of the accumulator, i.e., out20, out21, and out22.. out 27.
Finally, as shown in fig. 7, the results at the corresponding positions of the three multiplications are sequentially accumulated to obtain 8 3D vector dot product results. Wherein the results out00, out10, and out20 at the first position of each multiplication are added to obtain a first 3D vector dot product result; adding the results out01, out11, and out21 at the second position of each multiplication to obtain a second 3D vector dot product result; by analogy, the results out07, out17, and out27 at the eighth position of each multiplication are added to obtain an eighth 3D vector dot product result.
And step S3, outputting the vector dot product operation result of each convolution kernel according to the output priority order.
Specifically, the output priority is determined according to the order of inputting the input data into the convolution kernel. The earlier the input data is acquired, the higher the output priority of the corresponding convolution kernel. Therefore, the convolution kernel that inputs the input data first has a higher output priority than the convolution kernel that inputs the input data later.
As shown in fig. 8, in an embodiment, the multi-core based convolutional neural network acceleration system of the present invention includes a convolution kernel setting module 21, a vector dot product module 22, and an output module 23.
The convolution kernel setting module 21 is configured to split a layer of convolutional neural network into at least two subtasks, where each subtask corresponds to a convolution kernel; the convolution kernels are connected in series, so that input data can be transmitted in series between the convolution kernels.
Taking image processing as an example, as shown in fig. 2, the input image and the output image are both three-dimensional images, and the input image includes an abscissa inx, an ordinate iny, and a coefficient depth coordinate kz. The output image comprises an abscissa outx, an ordinate outy and a depth coordinate z. The coefficients are four-dimensional data including an abscissa kx, an ordinate ky, a coefficient depth coordinate kz, and an output depth coordinate z.
When splitting a layer of convolutional neural network into four subtasks, the coefficient sequence is split into four groups according to the z direction, as shown in fig. 2, each group of coefficients is assigned to a convolutional kernel of a different convolutional neural network. In the method, data are serially communicated among different convolution kernels, so that bandwidth is saved through data sharing.
And according to the processing characteristics of the convolutional neural network, carrying out convolution operation on each group of coefficients and all input images to obtain an output image of a z plane. Therefore, the input image has great reusability. As shown in fig. 3, the different convolution kernels are connected in series by a series path so that the input image data passes in series between the convolution kernels. Specifically, after the input image data is read into the convolution kernel 0 from the memory, the convolution kernel 0 performs convolution operation on the input image data and the first group of coefficients, and simultaneously, the convolution kernel 0 transmits the input image data to the convolution kernel 1 through the data serial channel. The convolution core 1 saves bandwidth consumption in reading input image data. Convolution kernel 2 and convolution kernel 3 would perform the same data manipulation to avoid bandwidth consumption of the input image data. Under the 4 convolution kernel mode, three times of reading of input image data are reduced by serially connecting data channels, 75% of bandwidth consumption of the input image data is saved, and meanwhile, the power consumption of an internal memory is also reduced; the balance problem of the number of the windings of the physical realization layer and the realization frequency is well solved.
The vector dot product module 22 is connected to the convolution kernel setting module 21, and configured to execute a first preset number of vector dot product operations in parallel based on each convolution kernel, where each vector dot product operation includes a second preset number of multiplication operations; the product of the first preset number and the second preset number is the number of the multiplier-adders in the convolution kernel.
In an embodiment of the present invention, the vector dot product module 22 executes a first predetermined number of vector dot product operations in parallel based on each convolution kernel to execute the following steps:
21) acquiring (N + M-1) input data; wherein N is the first preset number, and M is the second preset number;
22) inputting the first 1 to N input data to the 1 st to N multiplier-adder respectively to multiply the first coefficient;
23) inputting the first 2 to N +1 input data into the (N +1) th to 2N multiplier-adder respectively to be multiplied by the second coefficient;
24) in the same way, the first M to N + M-1 input data are respectively input to the (N x M-N +1) to N x M multipliers and adders to be multiplied by the Mth coefficient;
25) and accumulating products at corresponding positions on N positions in the M times of multiplication to obtain N vector dot product operation results.
The following is further illustrated with a 3D vector dot product that includes three multiplications (M-3). In this embodiment, 24 multiply-and-add units are included within each convolution kernel. To maximize the utilization of the multiplier-adder with limited bandwidth, 8 3D vector dot product results are computed simultaneously by a 3D vector dot product operation involving three multiplications (N-8). Specifically, the 3D vector dot product operation formula is as follows:
out0=in0*k0+in1*k1+in2*k2
out1=in1*k0+in2*k1+in3*k2
out2=in2*k0+in3*k1+in4*k2
......
out7=in7*k0+in8*k1+in9*k2
first, as shown in fig. 4, 10 input image data, i.e., in0, in1, int2.. in9, are read from a memory. The 0 th to 7 th data (in0, in1, in2, in3, in4, in5, in6, in7) and the coefficient 0(k0) are multiplied, and the result of the multiplication is written to the input terminals of the accumulator, i.e., out00, out01, out02.. out 07.
Next, as shown in fig. 5, the 10 input image data of the previous step, i.e., in0, in1, int2.. in9, are reused. The 1 st to 8 th data (in1, in2, in3, in4, in5, in6, in7, in8) and the coefficient 1(k1) are multiplied, and the result of the multiplication is written into the input terminals of the accumulator, i.e., out10, out11, and out12.. out 17.
Again, as shown in fig. 6, the 10 input image data of the previous step, i.e. in0, in1, int2.. in9, are reused. The 2 nd to 9 th data (in2, in3, in4, in5, in6, in7, in8, in9) and the coefficient 2(k2) are multiplied, and the result of the multiplication is written to the input terminals of the accumulator, i.e., out20, out21, and out22.. out 27.
Finally, as shown in fig. 7, the results at the corresponding positions of the three multiplications are sequentially accumulated to obtain 8 3D vector dot product results. Wherein the results out00, out10, and out20 at the first position of each multiplication are added to obtain a first 3D vector dot product result; adding the results out01, out11, and out21 at the second position of each multiplication to obtain a second 3D vector dot product result; by analogy, the results out07, out17, and out27 at the eighth position of each multiplication are added to obtain an eighth 3D vector dot product result.
The output module 23 is connected to the vector dot product module 22, and configured to output the vector dot product operation result of each convolution kernel according to the output priority order.
Specifically, the output priority is determined according to the order of inputting the input data into the convolution kernel. The earlier the input data is acquired, the higher the output priority of the corresponding convolution kernel. Therefore, the convolution kernel that inputs the input data first has a higher output priority than the convolution kernel that inputs the input data later.
It should be noted that the division of the modules of the above system is only a logical division, and the actual implementation may be wholly or partially integrated into one physical entity, or may be physically separated. And these modules can be realized in the form of software called by processing element; or may be implemented entirely in hardware; and part of the modules can be realized in the form of calling software by the processing element, and part of the modules can be realized in the form of hardware. For example, the x module may be a processing element that is set up separately, or may be implemented by being integrated in a chip of the apparatus, or may be stored in a memory of the apparatus in the form of program code, and the function of the x module may be called and executed by a processing element of the apparatus. Other modules are implemented similarly. In addition, all or part of the modules can be integrated together or can be independently realized. The processing element described herein may be an integrated circuit having signal processing capabilities. In implementation, each step of the above method or each module above may be implemented by an integrated logic circuit of hardware in a processor element or an instruction in the form of software.
For example, the above modules may be one or more integrated circuits configured to implement the above methods, such as: one or more Application Specific Integrated Circuits (ASICs), or one or more microprocessors (DSPs), or one or more Field Programmable Gate Arrays (FPGAs), etc. For another example, when one of the above modules is implemented in the form of a processing element scheduler code, the processing element may be a general-purpose processor, such as a Central Processing Unit (CPU) or other processor capable of calling program code. For another example, these modules may be integrated together and implemented in the form of a system-on-a-chip (SOC).
The storage medium of the present invention stores thereon a computer program that, when executed by a processor, implements the above-described multi-core based convolutional neural network acceleration method. Preferably, the storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
As shown in fig. 9, in one embodiment, the terminal of the present invention includes a processor 31 and a memory 32.
The memory 32 is used for storing computer programs.
Preferably, the memory 32 comprises: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
The processor 31 is connected to the memory 32, and is configured to execute the computer program stored in the memory 32, so as to enable the terminal to execute the above-mentioned multi-core-based convolutional neural network acceleration method.
Preferably, the processor 31 may be a general-purpose processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; the integrated circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic device, or discrete hardware components.
In summary, the multi-core-based convolutional neural network acceleration method, the multi-core-based convolutional neural network acceleration system, the storage medium and the terminal of the invention save the bandwidth consumption of the convolutional neural network operation dynamic memory; taking a 4-convolution kernel mode as an example, under the same data processing speed, 75% of input image data bandwidth can be saved when the convolution neural network operates; the processing speed of the convolutional neural network is improved, and under the condition of the same hardware data bandwidth, the processing speed of the convolutional neural network can be improved to 300% by taking a 3D vector dot product as an example; the dynamic power consumption of the convolutional neural network is reduced, the operation time is reduced to 33% of the original operation time by taking 4 convolutional kernels and a 3D vector dot product as an example, the input image bandwidth of the convolutional neural network is reduced to 25% of the original operation time, and the dynamic power consumption is reduced by 85%; the processing speed of the convolutional neural network in the embedded product is optimized, the method has the advantages of clear architecture, clear division of labor, easiness in implementation, simple flow and the like, and can be widely applied to the Internet of things, wearable equipment and vehicle-mounted equipment. Therefore, the invention effectively overcomes various defects in the prior art and has high industrial utilization value.
The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims (10)

1. A multi-core-based convolutional neural network acceleration method is characterized by comprising the following steps:
splitting a layer of convolutional neural network into at least two subtasks, each subtask corresponding to a convolution kernel; the convolution kernels are connected in series, so that input data can be transmitted in series between the convolution kernels;
based on each convolution kernel, executing a first preset number of vector dot product operations in parallel, wherein each vector dot product operation comprises a second preset number of multiplication operations; the product of the first preset quantity and the second preset quantity is the number of the multiplier-adders in the convolution kernel;
and outputting the vector dot product operation result of each convolution kernel according to the output priority sequence.
2. The multi-core-based convolutional neural network acceleration method of claim 1, wherein the second preset number is 3 to support a 3D vector dot product.
3. The multi-core-based convolutional neural network acceleration method of claim 1, wherein the output priority is determined according to the order of input of the input data into the convolutional kernels, and the convolutional kernels input with the input data first have higher output priority than the convolutional kernels input with the input data later.
4. The multi-core-based convolutional neural network acceleration method of claim 1, wherein performing a first preset number of vector dot product operations in parallel on a per convolutional core basis comprises the steps of:
acquiring (N + M-1) input data; wherein N is the first preset number, and M is the second preset number;
inputting the first 1 to N input data to the 1 st to N multiplier-adder respectively to multiply the first coefficient;
inputting the first 2 to N +1 input data into the (N +1) th to 2N multiplier-adder respectively to be multiplied by the second coefficient;
in the same way, the first M to N + M-1 input data are respectively input to the (N x M-N +1) to N x M multipliers and adders to be multiplied by the Mth coefficient;
and accumulating products at corresponding positions on N positions in the M times of multiplication to obtain N vector dot product operation results.
5. A multi-core-based convolutional neural network acceleration system is characterized by comprising a convolutional core setting module, a vector dot product module and an output module;
the convolution kernel setting module is used for splitting a layer of convolution neural network into at least two subtasks, and each subtask corresponds to a convolution kernel; the convolution kernels are connected in series, so that input data can be transmitted in series between the convolution kernels;
the vector dot product module is used for executing a first preset number of vector dot product operations in parallel based on each convolution kernel, and each vector dot product operation comprises a second preset number of multiplication operations; the product of the first preset quantity and the second preset quantity is the number of the multiplier-adders in the convolution kernel;
and the output module is used for outputting the vector dot product operation result of each convolution kernel according to the output priority order.
6. The multi-core based convolutional neural network acceleration system of claim 5, wherein the second preset number is 3 to support a 3D vector dot product.
7. The multi-core-based convolutional neural network acceleration system as claimed in claim 5, wherein the output priority is determined according to the precedence order of the input data input convolution kernels, and the convolution kernels input with the input data first have higher output priority than the convolution kernels input with the input data later.
8. The multi-core based convolutional neural network acceleration system as claimed in claim 5, wherein the vector dot product module performs the following steps when performing a first preset number of vector dot product operations in parallel based on each convolution core:
acquiring (N + M-1) input data; wherein N is the first preset number, and M is the second preset number;
inputting the first 1 to N input data to the 1 st to N multiplier-adder respectively to multiply the first coefficient;
inputting the first 2 to N +1 input data into the (N +1) th to 2N multiplier-adder respectively to be multiplied by the second coefficient;
in the same way, the first M to N + M-1 input data are respectively input to the (N x M-N +1) to N x M multipliers and adders to be multiplied by the Mth coefficient;
and accumulating products at corresponding positions on N positions in the M times of multiplication to obtain N vector dot product operation results.
9. A storage medium on which a computer program is stored, which program, when executed by a processor, implements the multi-core based convolutional neural network acceleration method of any one of claims 1 to 4.
10. A terminal comprising a processor and a memory;
the memory is used for storing a computer program;
the processor is configured to execute the memory-stored computer program to cause the terminal to perform the multi-core based convolutional neural network acceleration method of any of claims 1 to 4.
CN201711273248.5A 2017-12-06 2017-12-06 Multi-core-based convolutional neural network acceleration method and system, storage medium and terminal Active CN107862378B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711273248.5A CN107862378B (en) 2017-12-06 2017-12-06 Multi-core-based convolutional neural network acceleration method and system, storage medium and terminal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711273248.5A CN107862378B (en) 2017-12-06 2017-12-06 Multi-core-based convolutional neural network acceleration method and system, storage medium and terminal

Publications (2)

Publication Number Publication Date
CN107862378A CN107862378A (en) 2018-03-30
CN107862378B true CN107862378B (en) 2020-04-24

Family

ID=61705060

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711273248.5A Active CN107862378B (en) 2017-12-06 2017-12-06 Multi-core-based convolutional neural network acceleration method and system, storage medium and terminal

Country Status (1)

Country Link
CN (1) CN107862378B (en)

Families Citing this family (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108681773B (en) * 2018-05-23 2020-01-10 腾讯科技(深圳)有限公司 Data operation acceleration method, device, terminal and readable storage medium
CN109117940B (en) * 2018-06-19 2020-12-15 腾讯科技(深圳)有限公司 Target detection method, device, terminal and storage medium based on convolutional neural network
US20200005125A1 (en) * 2018-06-27 2020-01-02 International Business Machines Corporation Low precision deep neural network enabled by compensation instructions
CN108920413B (en) * 2018-06-28 2019-08-09 中国人民解放军国防科技大学 Convolutional neural network multi-core parallel computing method facing GPDSP
US11954573B2 (en) * 2018-09-06 2024-04-09 Black Sesame Technologies Inc. Convolutional neural network using adaptive 3D array
CN109740733B (en) * 2018-12-27 2021-07-06 深圳云天励飞技术有限公司 Deep learning network model optimization method and device and related equipment
CN109740747B (en) 2018-12-29 2019-11-12 北京中科寒武纪科技有限公司 Operation method, device and Related product
CN110889497B (en) * 2018-12-29 2021-04-23 中科寒武纪科技股份有限公司 Learning task compiling method of artificial intelligence processor and related product
CN111563586B (en) * 2019-02-14 2022-12-09 上海寒武纪信息科技有限公司 Splitting method of neural network model and related product
CN109886400B (en) * 2019-02-19 2020-11-27 合肥工业大学 Convolution neural network hardware accelerator system based on convolution kernel splitting and calculation method thereof
CN110109646B (en) * 2019-03-28 2021-08-27 北京迈格威科技有限公司 Data processing method, data processing device, multiplier-adder and storage medium
CN110689115B (en) * 2019-09-24 2023-03-31 安徽寒武纪信息科技有限公司 Neural network model processing method and device, computer equipment and storage medium
CN110689121A (en) * 2019-09-24 2020-01-14 上海寒武纪信息科技有限公司 Method for realizing neural network model splitting by using multi-core processor and related product
US20220383082A1 (en) * 2019-09-24 2022-12-01 Anhui Cambricon Information Technology Co., Ltd. Neural network processing method and apparatus, computer device and storage medium
CN110738317A (en) * 2019-10-17 2020-01-31 中国科学院上海高等研究院 FPGA-based deformable convolution network operation method, device and system
CN110796245B (en) * 2019-10-25 2022-03-22 浪潮电子信息产业股份有限公司 Method and device for calculating convolutional neural network model
WO2021081854A1 (en) * 2019-10-30 2021-05-06 华为技术有限公司 Convolution operation circuit and convolution operation method
CN111610963B (en) * 2020-06-24 2021-08-17 上海西井信息科技有限公司 Chip structure and multiply-add calculation engine thereof
CN114399828B (en) * 2022-03-25 2022-07-08 深圳比特微电子科技有限公司 Training method of convolution neural network model for image processing
CN116303108B (en) * 2022-09-07 2024-05-14 芯砺智能科技(上海)有限公司 Weight address arrangement method suitable for parallel computing architecture

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104915322A (en) * 2015-06-09 2015-09-16 中国人民解放军国防科学技术大学 Method for accelerating convolution neutral network hardware and AXI bus IP core thereof
CN106203617A (en) * 2016-06-27 2016-12-07 哈尔滨工业大学深圳研究生院 A kind of acceleration processing unit based on convolutional neural networks and array structure
CN106599883A (en) * 2017-03-08 2017-04-26 王华锋 Face recognition method capable of extracting multi-level image semantics based on CNN (convolutional neural network)
CN106845635A (en) * 2017-01-24 2017-06-13 东南大学 CNN convolution kernel hardware design methods based on cascade form
CN106875011A (en) * 2017-01-12 2017-06-20 南京大学 The hardware structure and its calculation process of two-value weight convolutional neural networks accelerator
CN106951395A (en) * 2017-02-13 2017-07-14 上海客鹭信息技术有限公司 Towards the parallel convolution operations method and device of compression convolutional neural networks
CN107301456A (en) * 2017-05-26 2017-10-27 中国人民解放军国防科学技术大学 Deep neural network multinuclear based on vector processor speeds up to method
WO2017186829A1 (en) * 2016-04-27 2017-11-02 Commissariat A L'energie Atomique Et Aux Energies Alternatives Device and method for calculating convolution in a convolutional neural network

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104915322A (en) * 2015-06-09 2015-09-16 中国人民解放军国防科学技术大学 Method for accelerating convolution neutral network hardware and AXI bus IP core thereof
WO2017186829A1 (en) * 2016-04-27 2017-11-02 Commissariat A L'energie Atomique Et Aux Energies Alternatives Device and method for calculating convolution in a convolutional neural network
CN106203617A (en) * 2016-06-27 2016-12-07 哈尔滨工业大学深圳研究生院 A kind of acceleration processing unit based on convolutional neural networks and array structure
CN106875011A (en) * 2017-01-12 2017-06-20 南京大学 The hardware structure and its calculation process of two-value weight convolutional neural networks accelerator
CN106845635A (en) * 2017-01-24 2017-06-13 东南大学 CNN convolution kernel hardware design methods based on cascade form
CN106951395A (en) * 2017-02-13 2017-07-14 上海客鹭信息技术有限公司 Towards the parallel convolution operations method and device of compression convolutional neural networks
CN106599883A (en) * 2017-03-08 2017-04-26 王华锋 Face recognition method capable of extracting multi-level image semantics based on CNN (convolutional neural network)
CN107301456A (en) * 2017-05-26 2017-10-27 中国人民解放军国防科学技术大学 Deep neural network multinuclear based on vector processor speeds up to method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CUDA-CONVNET 深层卷积神经网络算法的;李大霞;《中国优秀硕士学位论文全文数据库 信息科技辑》;20160315;正文第29-30页 *
一种简洁高效的加速卷积神经网络的方法;刘进锋;《科学技术与工程》;20141128;第14卷(第33期);240-244 *

Also Published As

Publication number Publication date
CN107862378A (en) 2018-03-30

Similar Documents

Publication Publication Date Title
CN107862378B (en) Multi-core-based convolutional neural network acceleration method and system, storage medium and terminal
JP6857286B2 (en) Improved performance of neural network arrays
US11960566B1 (en) Reducing computations for data including padding
EP3349153B1 (en) Convolutional neural network (cnn) processing method and apparatus
EP3480745A1 (en) Hardware implementation of convolution layer of deep neural network
CN111860812B (en) Apparatus and method for performing convolutional neural network training
CN107341541B (en) Apparatus and method for performing full connectivity layer neural network training
EP3480747A1 (en) Single plane filters
CN111476360A (en) Apparatus and method for Winograd transform convolution operation of neural network
US20190034327A1 (en) Accessing prologue and epilogue data
TW202123093A (en) Method and system for performing convolution operation
WO2023065983A1 (en) Computing apparatus, neural network processing device, chip, and data processing method
CN110109646A (en) Data processing method, device and adder and multiplier and storage medium
US20230019151A1 (en) Implementation of pooling and unpooling or reverse pooling in hardware
CN114138231B (en) Method, circuit and SOC for executing matrix multiplication operation
CN111047025B (en) Convolution calculation method and device
CN114764615A (en) Convolution operation implementation method, data processing method and device
CN110245706B (en) Lightweight target detection method for embedded application
JP7387017B2 (en) Address generation method and unit, deep learning processor, chip, electronic equipment and computer program
CN110825311B (en) Method and apparatus for storing data
CN117063182A (en) Data processing method and device
EP4295276A1 (en) Accelerated execution of convolution operation by convolutional neural network
WO2020194465A1 (en) Neural network circuit
CN111832714A (en) Operation method and device
Anand et al. Scaling computation on GPUs using powerlists

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 201203 China (Shanghai) Free Trade Pilot Zone 20A, Zhangjiang Building, 289 Chunxiao Road

Applicant after: Xinyuan Microelectronics (Shanghai) Co., Ltd.

Applicant after: Core chip technology (Shanghai) Co., Ltd.

Address before: 201203 Zhangjiang Building 20A, 560 Songtao Road, Zhangjiang High-tech Park, Pudong New Area, Shanghai

Applicant before: VeriSilicon Microelectronics (Shanghai) Co., Ltd.

Applicant before: Core chip technology (Shanghai) Co., Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant