CN110147252A

CN110147252A - A kind of parallel calculating method and device of convolutional neural networks

Info

Publication number: CN110147252A
Application number: CN201910348849.0A
Authority: CN
Inventors: 陈海波
Original assignee: Deep Blue Technology Shanghai Co Ltd
Current assignee: Deep Blue Technology Shanghai Co Ltd
Priority date: 2019-04-28
Filing date: 2019-04-28
Publication date: 2019-08-20

Abstract

A kind of parallel calculating method and device of convolutional neural networks, for improving convolution algorithm efficiency.The parallel calculating method of one such convolutional neural networks includes: the image data obtained to convolution, wherein, any pixel point data in the image data to convolution be by corresponded in original digital image data the first pixel number of any pixel point data position according to be separated by least one second pixel number of N number of pixel number evidence according to merging, the first pixel number evidence and at least one described second pixel number are determined according to same a line, N is located at by the first convolution step-length；The image data to convolution and weighted data are subjected to convolution, obtain at least two convolution results.

Description

A kind of parallel calculating method and device of convolutional neural networks

Technical field

This application involves convolutional neural networks acceleration technique field more particularly to a kind of parallel computations of convolutional neural networks Method and device.

Background technique

With the continuous maturation of depth learning technology, convolutional neural networks (Convolutional Neural Network, CNN) it is widely used in the fields such as computer vision, speech recognition, natural language processing, wherein the essence of convolutional neural networks It is convolution algorithm.

Currently, the basic unit for convolutional calculation is processing unit (Processing Element, PE), a PE packet Multiple Digital Signal Processing (Digital Signal Processing, DSP) component is included, a DSP can do an image The convolution algorithm of pixel a, for example, DSP can do the convolution fortune for the image slices vegetarian refreshments that a bit wide is 16bit (bit) It calculates.

And as more accurate deep learning model is developed, many application problems such as target detection, image recognition Need INT8 or lower fixed-point computation precision only to keep acceptable identification accuracy, comparison INT16 data bit width reduces Half, data space also halve.Under identical hardware resource, if continuing to be transported using above-mentioned traditional calculation It calculates, operation efficiency is low, to storage resource and significant wastage.

Summary of the invention

The embodiment of the present application provides the parallel calculating method and device of a kind of convolutional neural networks, for improving convolutional Neural The operation efficiency of network.

In a first aspect, the application provides a kind of parallel calculating method of convolutional neural networks, comprising: obtain the figure to convolution As data, wherein any pixel point data in the image data to convolution is by corresponding described in original digital image data One pixel number according to position the first pixel number according to at least one second pixel number for being separated by N number of pixel number evidence It is obtained according to merging, the first pixel number evidence and at least one described second pixel number are according to same a line is located at, and N is by first Convolution step-length determines；The image data to convolution and weighted data are subjected to convolution, obtain at least two convolution results.

In the embodiment of the present application, since any pixel point data in the image data to convolution is by original digital image data The middle correspondence pixel number is obtained according to the first pixel number evidence and at least one second pixel number of position according to merging, Therefore, when to the pixel number according to convolution algorithm is carried out, at least two pixel numbers in original digital image data have really been carried out According to convolution algorithm, compared to the prior art in a clock cycle only carry out the convolution algorithm of a pixel number evidence, improve Operation efficiency.

In a possible design, any pixel point data be by first pixel number according to be separated by it is N number of Second pixel number of pixel number evidence is obtained according to merging；

Obtain the image data to convolution, comprising: will often be separated by N number of pixel number in every a line of the original digital image data According to two pixel numbers according to merging into a pixel number evidence, the image data after being reset；Figure after the rearrangement As obtaining the image data to convolution in data.

In the embodiment of the present application, first original digital image data can be reset, and then obtains the image data to convolution, It is also possible to when needing to carry out convolution algorithm, the first pixel number evidence and the second pixel is directly read from original digital image data Data, those of ordinary skill in the art can select according to actual needs, herein with no restrictions.

In a possible design, the corresponding second convolution step-length of image data after the rearrangement is the first volume Twice of product step-length.

In a possible design, any pixel point data be by first pixel number according to be separated by it is N number of Second pixel number of pixel number evidence is obtained according to merging；The image data to convolution is rolled up with weighted data Product, obtains at least two convolution results, comprising: by ith pixel point data corresponding i-th in the image data to convolution A first pixel number will be separated by i-th of second pictures of N number of pixel number evidence according to a high position for the first port of i-th of DSP of input Vegetarian refreshments data input the low level of the second port of the i-th digital signal processing DSP, and by the weighted data with institute The low level that corresponding i-th of the weighted data of ith pixel point data inputs the third port of i-th of DSP is stated, by i from 1 time It goes through to M, is obtained 2 × M and multiplies accumulating as a result, M is the number of the weighted data；2 × M the accumulation result is carried out tired Add, obtains two convolution results.

In a possible design, every eight DSP are cascaded as one group in the M DSP；To the 2 × M cumulative knots Fruit is added up, and two convolution results are obtained, comprising: the result that multiplies accumulating of j-th of DSP in every eight DSP is added to jth + 1 DSP's multiplies accumulating in result, and M/8 centre accumulated value is obtained from 1 traversal to 8 in j；By in addition to the M DSP DSP adds up to described M/8 intermediate accumulated value, obtains two convolution results.

In the embodiment of the present application, one group of eight DSP is plus the DSP in addition to M DSP, that is, nine DSP altogether, just May be implemented 2 × 8 multiply-add operations, operation efficiency 16/9=1.77, compared to the prior art in a clock cycle complete One pixel data operation, operation efficiency improve nearly 2 times.

In a possible design ,+1 DSP output of the jth multiplies accumulating defeated with j-th of DSP at the time of result Start a work shift accumulation result at the time of between time interval be the first preset interval；At the time of kth group DSP exports intermediate accumulated value with Time interval between at the time of+1 group DSP of kth exports intermediate accumulated value is the second prefixed time interval, wherein k 1-M/8 Middle any integer.

In the embodiment of the present application, due to being multiplied and being added operation in DSP, later group DSP is than upper one group DSP delay one is clapped, and the latter DSP organized in interior is clapped than previous DSP delay one.

In a possible design,

Convolution step-length corresponding to image data after the rearrangement is convolution step-length corresponding to the original digital image data Twice.

Second aspect, the embodiment of the present application provide a kind of parallel computation unit of convolutional neural networks, comprising:

Memory, for storing instruction；

Processor executes following process for reading the instruction in the memory:

Obtain image data to convolution, wherein any pixel point data in the image data to convolution be by Corresponded in original digital image data the first pixel number of any pixel point data position according to be separated by N number of pixel number evidence At least one second pixel number obtain according to merging, the first pixel number evidence and at least one described second pixel Data are located at same row, and N is determined by the first convolution step-length；

The image data to convolution and weighted data are subjected to convolution, obtain at least two convolution results.

It obtains when the image data of convolution, is specifically used in the processor:

Two pixel numbers for being often separated by N number of pixel number evidence in every a line of original digital image data evidence is merged into one A pixel number evidence, the image data after being reset；

The image data to convolution is obtained from the image data after the rearrangement.

The image data to convolution and weighted data are subjected to convolution in the processor, obtain at least two convolution When as a result, it is specifically used for:

By corresponding i-th of first pixel numbers of ith pixel point data in the image data to convolution according to input A high position for the first port of i-th of DSP will be separated by i-th of second pixel numbers of N number of pixel number evidence according to input described i-th The low level of the second port of a Digital Signal Processing DSP, and by the weighted data with the ith pixel point data pair I-th of the weighted data answered inputs the low level of the third port of i-th of DSP, by i from 1 traversal to M, is obtained 2 × M Multiply accumulating as a result, M is the number of the weighted data；

It adds up to the 2 × M accumulation result, obtains two convolution results.

In a possible design, every eight DSP are cascaded as one group in the M DSP；

It adds up in the processor to the 2 × M accumulation result, when obtaining two convolution results, is specifically used for:

The result that multiplies accumulating of j-th of DSP in every eight DSP is added to the multiplying accumulating in result of+1 DSP of jth, j from M/8 intermediate accumulated value is obtained to 8 in 1 traversal；

It is added up by the DSP in addition to the M DSP to described M/8 intermediate accumulated value, obtains two convolution knots Fruit.

In a possible design,

At the time of+1 DSP output of the jth multiplies accumulating result with j-th of DSP output at the time of multiplying accumulating result Between time interval be the first preset interval；

Between at the time of kth group DSP exports intermediate accumulated value and at the time of+1 group DSP of kth exports intermediate accumulated value when Between between be divided into the second prefixed time interval, wherein k be 1-M/8 in any integer.

The third aspect, the embodiment of the present application provide a kind of parallel computation unit of convolutional neural networks, comprising:

Acquiring unit, for obtaining the image data to convolution, wherein any picture in the image data to convolution Vegetarian refreshments data are by corresponding to the first pixel number evidence of any pixel point data position in original digital image data and being separated by N At least one second pixel number of a pixel number evidence is obtained according to merging, the first pixel number evidence and described at least one A second pixel number is determined according to same a line, N is located at by the first convolution step-length；

Convolution unit obtains volume at least two for the image data to convolution and weighted data to be carried out convolution Product result.

The acquiring unit is used for:

In a possible design, the corresponding second convolution step-length of image data after the rearrangement is at least described Twice of one convolution step-length.

The convolution unit is used for:

It adds up to the 2 × M accumulation result, obtains two convolution results.

In a possible design, every eight DSP are cascaded as one group in the M DSP；

The convolution unit is used for:

In a possible design,

Fourth aspect, the application of this hair apply example and provide a kind of computer readable storage medium, the computer-readable storage medium Matter is stored with computer program, and the computer program includes program instruction, and described program instructs when executed by a computer, makes The method that the computer executes any one possible design of first aspect or above-mentioned first aspect.

5th aspect, the embodiment of the present application provide a kind of computer program product, and the computer program product is stored with Computer program, the computer program include program instruction, and described program instructs when executed by a computer, make the calculating The method that machine executes any one possible design of first aspect or above-mentioned first aspect.

Detailed description of the invention

In order to illustrate the technical solution of the embodiments of the present invention more clearly, will make below to required in the embodiment of the present invention Attached drawing is briefly described, it should be apparent that, attached drawing described below is only some embodiments of the present invention, for For those of ordinary skill in the art, without creative efforts, it can also be obtained according to these attached drawings other Attached drawing.

Fig. 1 is the schematic diagram of convolution algorithm in the prior art；

Fig. 2 is a kind of schematic diagram of system block diagram provided by the embodiments of the present application；

Fig. 3 is each layer data structures and parameter distribution of yolov3tiny neural network in the prior art；

Fig. 4 is the structural schematic diagram of parallel computation module provided by the present application；

Fig. 5 is the structural schematic diagram of PE array provided by the embodiments of the present application；

Fig. 6 is the structural schematic diagram of PE provided by the embodiments of the present application；

Fig. 7 is the structural schematic diagram of DSP provided by the embodiments of the present application；

Fig. 8 is a kind of flow diagram of the parallel calculating method of convolutional neural networks provided by the embodiments of the present application；

Fig. 9 is the schematic diagram of the image data of rearrangement provided by the embodiments of the present application；

Figure 10 is the schematic diagram that ith pixel point data provided by the embodiments of the present application carries out convolution；

Figure 11 is the volume that the image data after rearrangement provided by the embodiments of the present application obtain after convolution with weighted data The schematic diagram of product result；

Figure 12 is a kind of parallel computation unit of convolutional neural networks provided by the embodiments of the present application；

Figure 13 is the parallel computation unit of another convolutional neural networks provided by the embodiments of the present application.

Specific embodiment

In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described.

Hereinafter, the part term in the embodiment of the present application is explained, in order to those skilled in the art understand that.

(1) convolution algorithm

Assuming that input is 6 × 6 × 1 image array, convolution kernel is 3 × 3 × 1 weight matrix, then image array and weight The convolution algorithm process of matrix is specific as follows:

Referring to Figure 1, p1, p2, p3 are chosen from 6 × 6 × 1 image array, p7, p8, p9, p13, p14, p15 totally 9 Pixel is added after 9 pixels are multiplied with the corresponding points in the weight matrix of convolution kernel, obtains convolution results V1, meter Calculation process is as follows:

V1=p1*k1+p2*k2+p3*k3+p7*k4+p8*k5+p9*k6+p14*k7+p15*k8+p16 * k9；

Similarly, it can be calculated:

V2=p2*k1+p3*k2+p4*k3+p8*k4+p9*k5+p10*k6+p13*k7+p14*k8+p1 5*k9；

V3=p3*k1+p4*k2+p5*k3+p9*k4+p10*k5+p11*k6+p15*k7+p16*k8+p 17*k9；

…

V16=p22*k1+p23*k2+p24*k3+p28*k4+p29*k5+p30*k6+p34*k7+p35 * k8+p36*k9.

By calculating process above, one 4 × 4 × 1 output matrix is obtained.During specific implementation, to protect It is big as the size of image array to demonstrate,prove output matrix, then zero padding operation can be carried out to image array, that is, in image moment Surrounding's zero padding of battle array, becomes 8 × 8 × 1 image array.In this way, being deconvoluted one 8 × 8 × 1 with one 3 × 3 × 1 convolution kernel Image array after, can obtain a size be 6 × 6 × 1 output matrix.

During above-mentioned convolution algorithm, the number of convolution kernel is 1.Certainly, according to actual needs, the number of convolution kernel It is also possible to 3,16,32,64,128,255,256,512, or is other values.The port number of output matrix after process of convolution Equal to the number that the depth of the number of convolution kernel, that is, output matrix is equal to convolution kernel.

Fig. 2 is referred to, is a kind of system block diagram provided by the embodiments of the present application.Specially system on chip (System-on-a Chip, SoC), comprising: embedded processor (Processing System, PS) end and field programmable gate array (Field- Programmable Gate Array, FPGA) programmable logic resource (Programmable logic, PL) end.Wherein, PS End includes: OverDrive Processor ODP (Accelerated Processing Units, APU), graphics processor (Graphics Processing Unit, GPU) and Double Data Rate (Double Data Rate, DDR) controller；The end PL includes: parallel computation Module (Computation), capture (Capture) module and DDR controller.

It can also include safe digital card (Secure Digital Memory, SD), PS_ in Fig. 2 with continued reference to Fig. 2 DDR memory, PL_DDR memory and camera.Wherein, the weight of trained convolutional neural networks is stored in SD card Each parameter is 8bit in parameter and configuration parameter, weight parameter and configuration parameter.Upon power-up of the system, APU adds from SD card Weight parameter and configuration parameter are carried to PS_DDR memory.The image data acquired by camera is buffered in interior by Capture In portion's buffer (buffer), then start the calculating that parallel computation module starts convolutional neural networks, it, will after the completion of calculating PS_DDR storage is written by AXI4 (Advanced eXtensible Interface) bus in the character image data extracted In device, APU and GPU are post-processed and are shown from PS_DDR memory acquisition raw image data and character image data.

In the embodiment of the present application, APU can load weight parameter and configuration parameter from SD card, without again by a People's computer (PersonalComputer, PC) end imports, so as to realize that completely disengaging PC machine is calculated, so as to Simplify whole system framework.In addition, it should be noted that the trapping module in the end PL is not required, because of original graph It is obtained as data can be from server end, is stored in SD card, when needing to carry out convolution algorithm, is transferred from SD card.This Sample, in the case where trapping module is not required, camera is also just not required.Therefore, those of ordinary skill in the art can Carry out the adjustment of adaptability to system framework shown in Fig. 2 with actual demand with the technical solution that proposes according to the application.

Then, the parallel computation module in system architecture shown in Fig. 2 is described in detail below.With yolov3tiny nerve For the feature extraction of network, Fig. 3 is referred to, Fig. 3 is each layer data structures and parameter point of yolov3tiny neural network Cloth, as can be seen from Figure 3 totally 20 layers of the network, there is 13 layers of convolutional layer, 6 layers of pond (Pooling) layer and 1 layer of up-sampling (upsample) layer, wherein convolutional layer convolution multiply-add operation adds up to 5564961792 times.Due to the convolutional neural networks structure Complexity, calculating data volume is huge, and during specific implementation, parallel computation module can be realized by PE array, for example, parallel Computing module may include input buffer module (Input Buffer), PE array and output buffer module (Out Buffer), Specifically refer to Fig. 4.

In the embodiment of the present application, since the hardware resource of the side PL contains DSP 1728, plug-in 4 DDR4 memories 16 PE can be arranged in PE array module, specifically refer to Fig. 5 according to calculating and memory bandwidth matching principle by x16bit. Wherein each PE is responsible for the convolutional calculation of a channel direction, and a PE includes 64 DSP, specifically refers to Fig. 6, institute in Fig. 6 64 DSP shown use cascade structure, that is, 64 DSP points are disposed for 8 groups, and every 8 DSP are cascaded as one group, Such cascade structure can reduce the use of add tree, while decrease the consumption of PFGA interconnection resource.

It then will then introduce DSP below, what each DSP was responsible for each pixel number evidence on depth direction multiplies accumulating operation. The internal structure of DSP refers to Fig. 7, and as an example, DSP shown in fig. 7 includes four ports, pre-summer, multipliers And adder, wherein four ports can be port A, port B, port C and port D, adder be for other DSP The result that multiplies accumulating add up.For example, after the input of port A is added with port D input by pre-summer, it can be with port B Input do multiplying.Herein, the digit of port A can be 27bit, and the digit of port D can be 27bit, the position of port B Number can be 18bit, and certainly, with the continuous development of DSP technology, the digit of DSP different port may be other values, herein Only citing signal, does not limit the digit of the port DSP.

Technical solution provided by the embodiments of the present application is introduced with reference to the accompanying drawing, during following introduction, incite somebody to action this For the technical solution that application provides is applied in system framework shown in Fig. 2.

Fig. 8 is referred to, one embodiment of the application provides a kind of parallel calculating method of convolutional neural networks, the stream of this method Journey is described as follows:

S801: the image data to convolution is obtained, wherein any pixel point data in the image data to convolution It is by corresponding to the first pixel number evidence of any pixel point data position in original digital image data and being separated by N number of pixel At least one second pixel number of data is obtained according to merging, the first pixel number evidence and at least one described second pixel Point data is located at same a line, and N is determined by the first convolution step-length.

In the embodiment of the present application, since any pixel point data in the image data to convolution is by original digital image data First pixel number of middle corresponding any pixel point data position according to be separated by N number of pixel number evidence at least one second Pixel number is obtained according to merging, therefore, when any pixel point data carries out convolution with corresponding weighted data, really Carried out the convolution of two pixel number evidences in original digital image data simultaneously, so compared to the prior art in can only once carry out The operation of the convolution of one pixel number evidence improves the efficiency of convolution algorithm.

In the embodiment of the present application, N is determined by the first convolution step-length, if the first convolution step-length is indicated with S, N=S-1.

As an example, if S=1, N=S-1=1-1=0, this show the first pixel number according to at least one the Two pixel numbers evidence is two adjacent pixel number evidences.Size with original digital image data is 4 × 4 × 3, and the size of convolution kernel is For 3 × 3 × 3, if the first pixel number is according to being 1 in original digital image data, at least one second pixel number evidence can be original 4 and 7 in 4 or original digital image data in image data；If the first pixel number according to be in original digital image data 4, at least one A second pixel number is according to 7 and 10 in 7 or the original digital image data that can be in original pixel data.

As another example, if S=2, N=S-1=2-1=1, then show the first pixel number according to at least one the Two pixel numbers evidence is non-adjacent, one pixel number evidence of midfeather.Continue to continue to use the example above, if the first pixel number According to be in original digital image data 1, then at least one second pixel number is according to can be 7 or original digital image data in original digital image data In 7 and 10, if the first pixel number according to be in original digital image data 4, at least one second pixel number is according to can be original 10 and 13 in 10 or original digital image data in image data.

During being described below, then with any pixel point data be by first pixel number according to be separated by N number of picture For second pixel number of vegetarian refreshments data is obtained according to merging.

In the embodiment of the present application, obtaining to the mode of the image data of convolution includes but is not limited to following two, is divided below It is not introduced:

Mode one

Two pixel numbers for being often separated by N number of pixel number evidence in every a line of original digital image data evidence is merged into one A pixel number evidence, the image data after being reset；The picture number to convolution is obtained from the image data after the rearrangement According to.

During specific implementation, Fig. 9 is referred to, with S=1, for the first row of original digital image data, in the assumed condition Under, it, specifically will be in the first row exactly by two pixel numbers adjacent two-by-two in the row according to a pixel number evidence is merged into Pixel number merge into pixel number according to (Isosorbide-5-Nitrae) according to 1,4, pixel number is merged into a pixel number evidence according to 4,7 Pixel number is merged into a pixel number according to (7,10) according to 7,10, pixel number is merged into pixel according to 10,13 by (4,7) Pixel number is merged into pixel number according to (13,16) according to 13,16 by point data (10,13)；According to above-mentioned example, for depth Pixel number is then merged into a pixel number according to (2,5) according to 2,5, pixel number is merged according to 5,8 by the second row on direction It is a pixel number according to (5,8), pixel number is merged into a pixel number according to (8,11), by pixel number evidence according to 8,11 14,17 a pixel number evidence (14,17) is merged into, and so on, after converting original digital image data to 3 × 3 × 3 rearrangement Image data.

After the image data after being reset, then the picture number to convolution is obtained from the image data after rearrangement According in the embodiment of the present application, since a PE array includes 16 PE, a PE is responsible for the convolution meter in a channel direction It calculates, therefore, acquisition is right on initial position on the image of convolution kernel after rearrangement in the channel direction to convolved data The image data answered, or with the mobile primary rear corresponding image data of convolution step-length.

Mode two

In the embodiment of the present application, it can also not have in advance convert original digital image data, but rolled up Product calculate when, directly from original digital image data read any pixel point data position the first pixel number according to be separated by it is N number of At least one second pixel number evidence of pixel number evidence.

Those of ordinary skill in the art can select the implementation in aforesaid way one or mode two according to actual needs, Herein with no restrictions.

S802: the image data to convolution and weighted data are subjected to convolution, obtain at least two convolution results.Its In, at least two convolution results are same a line in the convolution results that original digital image data obtains later with weighted data progress convolution Upper two adjacent pixel number evidences.

In the embodiment of the present application, since a DSP is responsible for the convolution algorithm of a pixel number evidence on depth direction, It is exactly the corresponding pixel number evidence of a DSP.Therefore, during specific implementation, by taking ith pixel point data as an example, then Being will be to corresponding i-th of first pixel number of ith pixel point data in the image data of convolution according to i-th of DSP of input A high position for first port will be separated by i-th of second pixel numbers of N number of pixel number evidence according to the second port of i-th of DSP of input Low level, and by i-th of weighted data corresponding with ith pixel point data in weighted data input i-th of DSP third The low level of port is obtained 2 × M and multiplies accumulating as a result, M is the number of weighted data by i from 1 traversal to M；To the 2 × M A accumulation result adds up, and obtains two convolution results.Wherein, first port can be port A shown in Fig. 7, second end Mouth can be port D shown in Fig. 7, and third port can be port B shown in Fig. 7.

Continue to continue to use the example above, the digit with pixel number evidence each in original digital image data is 8, the position of weighted data Number is also 8, by taking the ith pixel point data in the image data to convolution is (Isosorbide-5-Nitrae) as an example, then by pixel number according to 1 pair The 8bit answered is input to the most-significant byte of port A, and in low 19 zero paddings, pixel number is input to end according to 4 corresponding 8bit data The least-significant byte of mouth D, and in high 19 zero paddings, the 8bit data of respective pixel point data 1 in weighted data are input to port B's Least-significant byte, and in high 10 zero paddings, pass through the pre-summer and multiplier of DSP in this way, just can be realized A × B+D × B operation, has Body referring to Figure 10.

In the embodiment of the present application, ith pixel point data (Isosorbide-5-Nitrae) among the above can be understood to a channel side Upwards, convolution kernel on original digital image data be in first position when it is corresponding first point and on original digital image data according to convolution Step-length is in corresponding when the second position at first point after carrying out movement, once carries out two pixel number evidences in this way, being achieved that Convolution algorithm, so as to improve the efficiency of convolution algorithm.

In i from 1 traversal to M, then available 2 × M multiply accumulating result.It adds up multiplying accumulating result to 2 × M Afterwards, two convolution results can be obtained, the specific implementation process is as follows:

In the embodiment of the present application, due to including adder in DSP, the cumulative of DSP operation result may be implemented, However the clock between high-order item and low level item that multiplies accumulating in result of DSP keeps 3, interval, is guaranteeing not spilling in low level in this way In the case where at most cumulative 8 product terms of each DSP need to add up using additional DSP after 8 product terms. Therefore, in the embodiment of the present application, after DSP realizes that 8 product terms are cumulative, then need to come using the DSP in addition to M DSP It further adds up to M/8 intermediate accumulated value.

So based on above-mentioned analysis, in the embodiment of the present application, 9 DSP can execute 8 × 2INT8 multiply-add operation, compare One clock cycle calculates the multiply-add operation of 1 pixel number evidence, and operational performance improves 16/9=1.77 times, that is, can be with The calculated performance of convolutional neural networks is promoted nearly 2 times, and all input image datas and weighted data are INT8 phase Than being significantly reduced half in INT16 memory space and memory bandwidth.

Herein, it should be noted that DSP port A and port B digit be 27bit, each picture of original digital image data When vegetarian refreshments data are 8, by technical solution provided by the present application, 2 pixel numbers of a clock cycle calculating may be implemented According to multiply-add operation, and the digit of the port DSP increase or original digital image data each pixel data bits reduce when, It can also realize that a clock cycle calculates the multiply-add operation of 2 or more pixel number evidences.

It should be noted that it is the first volume that convolution step-length, which is the second convolution step-length, for the image data after resetting 2 times of product step-length.For example, the first convolution step-length is 1, then the second convolution step-length is 2.

After the convolution algorithm that each PE has handled the convolution kernel of corresponding channel, original digital image data and weight just can be obtained Data carry out the convolution results of convolution algorithm, and specifically referring to Figure 11.For in Figure 11, the size of original digital image data be D × W × H, the number of convolution kernel are P, that is to say, that port number is P, and the size of convolution kernel is K × K × D, original digital image data and convolution The corresponding pixel number of core is according to output R × C × N after progress convolution algorithm, wherein and R=(W+2p-K)/S+1, C=(H+2p-K)/ S+1, p are the data of the row or column of zero padding filling, p=1.S is the first convolution step-length.To be applicable in parallel computation side set forth above Method first has to former data image convert the picture number after resetting for original digital image data according to rearrangement method shown in Fig. 9 According at this moment, convolution step-length needs to be adjusted to the second convolution step-length, is twice of the first convolution step-length.

Further, in the embodiment of the present application, at the time of+1 DSP of jth output multiplies accumulating result with described j-th Time interval between at the time of DSP output multiplies accumulating result is the first preset interval；

In the embodiment of the present application, the first prefixed time interval and the second prefixed time interval can be a clock week Phase.Since DSP will complete multiplication and add operation, in order to which multiplying of enabling that the latter DSP and previous DSP exports is tired Add result to add up, then needs the latter DSP postponing one to clap, that is, one clock cycle of delay, and for each group The last one DSP output by the DSP in addition to M DSP the result is that added up, that is, cumulative DSP, and the DSP that adds up has one A input port, therefore in order to complete the cumulative of the intermediate accumulated value of each group DSP output, then by later group DSP than previous group DSP Delay one is clapped, that is, one clock cycle of delay.

Finally, it should be noted that the parallel calculating method of convolutional neural networks provided by the present application is applied to above-mentioned When yolov3tiny neural network, when inadequate for 3 layers of front convolutional layer depth direction 64, need in original digital image data and weight It is rearranged in the scheduling of data or zero padding, with regard to being rolled up without additional data dispatch since the 4th layer of convolutional layer Product operation.

Device provided by the embodiments of the present application is introduced with reference to the accompanying drawing.

Referring to Figure 12, it is a kind of parallel computation unit 1200 of convolutional neural networks provided by the present application, comprising:

Memory 1201, for storing instruction；

Processor 1202 executes following process for reading the instruction in the memory:

Obtain image data to convolution, wherein any pixel point data in the image data to convolution be by Corresponded in original digital image data the first pixel number of any pixel point data position according to be separated by N number of pixel number evidence At least one second pixel number obtain according to merging, the first pixel number evidence and at least one described second pixel Data are located at same row, and N is determined by convolution step-length；

In the embodiment of the present application, processor 1202 can be central processing unit (central processing unit, CPU) or application-specific integrated circuit (application-specific integrated circuit, ASIC), can be One or more can be baseband chip, etc. for controlling the integrated circuit of program execution.The quantity of memory 1201 can be with It is one or more, memory can be read-only memory (read-only memory, ROM), random access memory (random access memory, RAM) or magnetic disk storage, etc..

It obtains when the image data of convolution, is specifically used in the processor 1202:

The image data to convolution and weighted data are subjected to convolution in the processor 1202, obtain at least two When convolution results, it is specifically used for:

It adds up to the 2 × M accumulation result, obtains two convolution results.

In a possible design, every eight DSP are cascaded as one group in the M DSP；

It adds up in the processor 1202 to the 2 × M accumulation result, when obtaining two convolution results, specifically For:

In a possible design,

Referring to Figure 13, it is a kind of parallel computation unit 1300 of convolutional neural networks provided by the present application, comprising:

Acquiring unit 1301, for obtaining the image data to convolution, wherein appointing in the image data to convolution One pixel number according to be by correspond in original digital image data any pixel point data position the first pixel number evidence with It is separated by least one second pixel number of N number of pixel number evidence and obtains according to merging, the first pixel number evidence and described At least one second pixel number is determined according to same a line, N is located at by convolution step-length；

Convolution unit 1302 obtains at least two for the image data to convolution and weighted data to be carried out convolution A convolution results.

The acquiring unit 1301 is used for:

The convolution unit 1302 is used for:

It adds up to the 2 × M accumulation result, obtains two convolution results.

In a possible design, every eight DSP are cascaded as one group in the M DSP；

The convolution unit 1302 is used for:

In a possible design,

The above, above embodiments are only described in detail to the technical solution to the application, but the above implementation The method that the explanation of example is merely used to help understand the embodiment of the present invention, should not be construed as the limitation to the embodiment of the present invention.This Any changes or substitutions that can be easily thought of by those skilled in the art, should all cover the embodiment of the present invention protection scope it It is interior.

Claims

1. a kind of parallel calculating method of convolutional neural networks, which is characterized in that the described method includes:

Obtain the image data to convolution, wherein any pixel point data in the image data to convolution is by original image As corresponded in data the first pixel number of any pixel point data position according to be separated by N number of pixel number evidence extremely What few second pixel number was obtained according to merging, the first pixel number evidence and at least one described second pixel number evidence Positioned at same a line, N is determined by the first convolution step-length；

2. the method according to claim 1, wherein any pixel point data is by first pixel Data be separated by the second pixel number of N number of pixel number evidence according to merging；

Obtain the image data to convolution, comprising:

Two pixel numbers for being often separated by N number of pixel number evidence in every a line of original digital image data evidence is merged into a picture Vegetarian refreshments data, the image data after being reset；

3. according to the method described in claim 2, it is characterized in that,

The corresponding second convolution step-length of image data after the rearrangement is twice of the first convolution step-length.

4. the method according to claim 1, wherein any pixel point data is by first pixel Data be separated by the second pixel number of N number of pixel number evidence according to merging；

The image data to convolution and weighted data are subjected to convolution, obtain at least two convolution results, comprising:

By corresponding i-th of first pixel numbers of ith pixel point data in the image data to convolution according to i-th of input A high position for the first port of DSP will be separated by i-th of second pixel numbers of N number of pixel number evidence according to the input i-th digital The low level of the second port of signal processing DSP, and by the weighted data corresponding with the ith pixel point data I weighted data inputs the low level of the third port of i-th of DSP, by i from 1 traversal to M, is obtained 2 × M and multiplies accumulating As a result, M is the number of the weighted data；

It adds up to the 2 × M accumulation result, obtains two convolution results.

5. according to the method described in claim 4, it is characterized in that, every eight DSP are cascaded as one group in the M DSP；

It adds up to the 2 × M accumulation result, obtains two convolution results, comprising:

The result that multiplies accumulating of j-th of DSP in every eight DSP is added to the multiplying accumulating in result of+1 DSP of jth, j is from 1 time It goes through to 8, M/8 intermediate accumulated value is obtained；

It is added up by the DSP in addition to the M DSP to described M/8 intermediate accumulated value, obtains two convolution results.

6. according to the method described in claim 5, it is characterized in that,

It is exported between at the time of multiplying accumulating result at the time of+1 DSP output of the jth multiplies accumulating result with j-th of DSP Time interval be the first preset interval；

Between time between at the time of kth group DSP output centre accumulated value and at the time of+1 group DSP of kth output centre accumulated value It is divided into the second prefixed time interval, wherein k is any integer in 1-M/8.

7. a kind of parallel computation unit of convolutional neural networks characterized by comprising

Memory, for storing instruction；

Processor executes following process for reading the instruction in the memory:

8. device according to claim 7, which is characterized in that any pixel point data is by first pixel Data be separated by the second pixel number of N number of pixel number evidence according to merging；

9. device according to claim 8, which is characterized in that

10. device according to claim 7, which is characterized in that any pixel point data is by first pixel Point data be separated by the second pixel number of N number of pixel number evidence according to merging；

The image data to convolution and weighted data are subjected to convolution in the processor, obtain at least two convolution results When, it is specifically used for:

It adds up to the 2 × M accumulation result, obtains two convolution results.

11. device according to claim 10, which is characterized in that every eight DSP are cascaded as one group in the M DSP；

12. device according to claim 11, which is characterized in that

13. a kind of computer readable storage medium, which is characterized in that the computer-readable recording medium storage has computer journey Sequence, the computer program include program instruction, and described program instructs when executed by a computer, execute the computer such as Method as claimed in any one of claims 1 to 6.

14. a kind of computer program product, which is characterized in that the computer program product is stored with computer program, described Computer program includes program instruction, and described program instructs when executed by a computer, executes the computer as right is wanted Seek any method in 1-6.