CN110059808A

CN110059808A - A kind of method for reading data and reading data device of convolutional neural networks

Info

Publication number: CN110059808A
Application number: CN201910547468.5A
Authority: CN
Inventors: 陈海波
Original assignee: DeepBlue AI Chips Research Institute Jiangsu Co Ltd
Current assignee: DeepBlue AI Chips Research Institute Jiangsu Co Ltd
Priority date: 2019-06-24
Filing date: 2019-06-24
Publication date: 2019-07-26
Anticipated expiration: 2039-06-24
Also published as: CN110059808B

Abstract

The invention discloses a kind of method for reading data of convolutional neural networks and reading data device, this method comprises: computing unit is from the first submatrix once read in the first matrix in image buffer；Computing unit is from the second submatrix read every time in second matrix in each weight buffer；First matrix be m*(m+2) * z matrix, the second matrix be n*(n+1) * z matrix；Total bit number shared by first submatrix and the second submatrix is equal to the bit wide bit number K of data/address bus；Computing unit makees convolutional calculation according to the first matrix and the second matrix that read out, obtains output image.In this way, first matrix and the second matrix are not necessarily to the zero padding on each channel, it is compared to image array and convolution weight in the prior art on each channel after supplement zero, image data and the respective quantity of weighted data reduce one times, so that the calculation amount of FPGA and delay time decrease one times, and then improve the convolution efficiency of FPGA.

Description

A kind of method for reading data and reading data device of convolutional neural networks

Technical field

The present invention relates to the hardware-accelerated technical field of on-site programmable gate array FPGA, espespecially a kind of convolutional neural networks Method for reading data and reading data device.

Background technique

With the continuous development of artificial intelligence (AI), it evolves from the manual features engineering of early stage till now can be from Learn in mass data, the fields such as machine vision, speech recognition and natural language processing all achieve important breakthrough.Convolution mind More and more favored through network (Convolutional Neural Network, CNN) in artificial intelligence field, it is One of great representative network structure, especially achieves very big success in field of image processing in depth learning technology.With Network become increasing, become increasingly complex, need a large amount of computing resource to be trained it, therefore people will pay attention to Power turns to field programmable gate array (Field Programmble Gate Array, FPGA) device, and FPGA not only has soft The programmability and flexibility of part, while having specific integrated circuit (Application Specific Integrated again Circuit, ASIC) the high characteristic handled up with low latency, and since with input/output abundant (I/O) interface, FPGA is also It is highly suitable as CNN hardware accelerator.CNN hardware accelerator provides more more advanced characteristics, such as image classification, object Identification and tracking, face and speech recognition, natural language processing etc., are applied to automatic metaplasia for advanced smart network In the scenes such as production, control, the productivity for improving relevant industries is that user brings better service.

Currently, FPGA is when realizing convolutional neural networks hardware accelerator, most basic problem is exactly to realize convolutional calculation, To realize convolutional calculation, then need to obtain image data and weighted data.However, in order to improve the read-write efficiency of data, when When total bit wide of the image data on each channel and the not identical bit wide of data/address bus, mended generally on each channel corresponding Zero so that the image data on each channel after zero padding is identical as data/address bus bit wide.For on each channel of convolution weight Weighted data total bit wide and data/address bus bit wide it is not identical when, also take identical method.

But due to having mended the zero of corresponding number on each channel of the image of input and convolution weight, increase figure As the respective data volume of data and weighted data, so that calculation amount and the delay time of FPGA are also increased, so that The convolution efficiency of FPGA is low.

Summary of the invention

The embodiment of the present invention provides the method for reading data and reading data device of a kind of convolutional neural networks, to improve The convolution efficiency of FPGA.

In a first aspect, the embodiment of the present invention provides a kind of method for reading data of convolutional neural networks, using on site may be used Gate array FPGA is programmed, the FPGA includes image buffer, at least one weight buffer and computing unit, the method packet It includes:

In described image buffer include image to be processed the first matrix, first matrix be m*(m+2) * z matrix；Institute It states and stores at least one convolution weight in each weight buffer of at least one weight buffer, each convolution weight corresponding one A second matrix, second matrix be n*(n+1) * z matrix；Wherein, first matrix is used for the matrix of m*(m+2) * z In the height for characterizing first matrix be m, width m+2, depth z；Second matrix is used for the matrix of n*(n+1) * z In the height for characterizing second matrix be n, width n+1, depth z；M, n and z is the integer more than or equal to 1；

The computing unit is from the first submatrix once read in first matrix in described image buffer, and described first Submatrix is the matrix of P*1*z；Wherein, P K/q/z, K are the bit wide bit number of data/address bus, and q is each member in the first matrix The bit wide bit number that element occupies；Total bit number shared by first submatrix is equal to K；First submatrix is P*1*z's Matrix is P, is highly 1, depth z for characterizing the width of first submatrix；

The computing unit is from the second submatrix read every time in second matrix in each weight buffer, and described second Submatrix is the matrix of P*1*z；Wherein, P K/q/z, q are the bit wide bit number that each element occupies in the second matrix；It is described Total bit number shared by second submatrix is equal to K；The matrix that second submatrix is P*1*z is for characterizing the described second sub- square The width of battle array is P, is highly 1, depth z；

The computing unit makees convolutional calculation according to first matrix read out and the second matrix, obtains output image.

Optionally, the FPGA is connect with external memory, the method also includes:

The FPGA reads the third matrix of the image to be processed from the external memory；The third matrix is m*m* The matrix of z；

The FPGA determines the first storage address in described image buffer；

Wherein, first storage address be wr_addr_temp+wr_vcnt* Image_Z/32*2-Image_Z/ 32；The wr_addr_temp is used to characterize f-th of third submatrix of the third matrix, and the wr_vcnt is for characterizing F row of f-th of third submatrix in third matrix；The Image_Z is used to characterize the figure in f-th of third submatrix As the number of data；The third submatrix is the matrix of P*1*z, and P K/q/z, q are that each element occupies in third matrix Bit wide bit number；Total bit number shared by the third submatrix is equal to K；

The FPGA is based on first storage address, and the third submatrix in the third matrix is stored in described first and is deposited Store up address.

Optionally, the method also includes:

The FPGA reads the 4th matrix of each convolution weight from the external memory, obtain at least one the 4th Matrix；4th matrix is n*(n+1) matrix of * z；4th matrix is that each convolution weight makees zero padding processing The matrix obtained afterwards, each convolution weight are the matrix of n*n*z；

The FPGA determines the second storage address of r-th of weight buffer；

Wherein, second storage address is (wr_hcnt-1)+16* (wr_scnt-1)+((Weight_h+1)/(32/ Weight_Z))*Weight_W*(wr_vcnt-1)；

The wr_hcnt is used to characterize y-th of the 4th sub- squares in e-th of the 4th matrixes at least one described the 4th matrix Battle array, the Weight_h are used to characterize the height n, the Weight_Z of each convolution weight for characterizing each volume The depth n, the Weight_W of product weight are used to characterize the width z, the wr_vcnt of each convolution weight for characterizing R-th of weight buffer at least one described weight buffer, r-th of weight buffer are described y-th the 4th The target weight buffer of submatrix, the wr_scnt are slow in r-th of weight for characterizing e-th of the 4th matrixes The sequence of all 4th matrixes in storage；4th submatrix is the matrix of P*1*z, and P K/q/z, q are the 4th square The bit wide bit number that each element occupies in battle array；Total bit number shared by 4th submatrix is equal to K；

The FPGA is based on second storage address, and y-th of the 4th submatrixs are stored in second storage address.

Optionally, the computing unit from once read in described image buffer in first matrix first son Before matrix, the method also includes:

The FPGA makees zero padding processing to the third matrix, obtains first matrix.

Optionally, the computing unit is from the first sub- square once read in first matrix in described image buffer Battle array, comprising:

The computing unit determines the third storage address of first submatrix；

Wherein, the third storage address is (rd_scnt- 1)+Image_Z/32* (rd_wcnt-1)+(Image_W+2) * Image_Z/32* (rd_hcnt-1)+Image_Z/32* (rd_fcnt-1) * S+((rd_fcnt-1)/Image_W) is rounded * (Weight_W- 1)*Image_Z/32-addr_temp1；

The rd_scnt is used to characterize the minimum unit of first matrix, is first submatrix, the Image_ Z is used to characterize the number of the image data in the first submatrix, is equal to p*z, the rd_wcnt is for characterizing first matrix In data in the i-th column, the Image_W is used to characterize the width m of the third matrix, and the rd_hcnt is for characterizing institute The first submatrix in the i-th row in the first matrix is stated, the rd_fcnt is used for table for characterizing total convolution number, the S Preset step-length value is levied, ((the rd_fcnt-1)/Image_W) is rounded * (Weight_W -1) * Image_Z/32 for characterizing The starting row of current convolution in first matrix, the addr_temp1 are used to characterize first in first matrix the The storage address of one submatrix；

The computing unit is based on the third storage address, reads first submatrix.

Optionally, the computing unit is from the second sub- square read every time in second matrix in each weight buffer Battle array, comprising:

The computing unit determines the 4th storage of the second submatrix in each weight buffer in e-th of second matrixes Address；

4th storage address is (rd_hcnt-1)+RD_HCNT_VALUE_TEMP* (rd_vcnt -1)+addr_ temp2；

The rd_vcnt is used to characterize j-th of second matrixes of present weight buffer, and the rd_hcnt is described for characterizing Corresponding h-th of second submatrixs of j-th of second matrixes, the RD_HCNT_VALUE_TEMP is for characterizing described j-th the The storage address of first the second submatrix of two matrixes, the addr_temp2 are each in present weight buffer for determining The storage address for first the second submatrix that the needs of second matrix are read；

The computing unit is read described e-th second according to the 4th storage address from each weight buffer The second submatrix in matrix.

Optionally, for the first row of the output image, addr_temp1=0, for its of the output image Its row, the addr_temp1=Image_Z/32* (Image_W+2).

Optionally, for two row of head and the tail of the output image, rd_hcnt=1 ~ ((Weight_h+1)/(32/ Weight_Z)) * (Weight_W -1), the RD_HCNT_VALUE_TEMP=((Weight_h+1)/(32/Weight_ Z))*( Weight_W -1)；For it is described output image remove head and the tail two rows other rows, rd_hcnt=1 ~ (( Weight_h+1)/(32/Weight_Z)) * Weight_W, the RD_HCNT_VALUE_TEMP=((Weight_h+1)/ (32/Weight_Z))* Weight_W；

For the first row of the output image, the addr_temp2=(Weight_h -1) * rd_vcnt=2*rd_vcnt； For the last line of the output image, the addr_temp2=(Weight_h -1) * (rd_vcnt-1)=2* (rd_ Vcnt-1), for other rows of the output image, addr_temp2=0.

Second aspect, the embodiment of the present invention provide a kind of reading data device of convolutional neural networks, the convolutional Neural The reading data device of network includes: FPGA；Wherein, the FPGA include image buffer, at least one weight buffer and Computing unit；

Described image buffer, for storing the first matrix of image to be processed, first matrix is m*(m+2) square of * z Battle array；Wherein, first matrix is m*(m+2) matrix of * z for characterizing the height of first matrix is m, width m+2, Depth is z；M and z is the integer more than or equal to 1；

Each weight buffer of at least one weight buffer is for storing at least one convolution weight, each convolution power Again correspond to second matrix, second matrix be n*(n+1) * z matrix；Wherein, second matrix is n*(n+1) * z Matrix for characterize the height of second matrix be n, width n+1, depth z；N is the integer more than or equal to 1；

The computing unit, for from the first submatrix once read in described image buffer in first matrix, institute State the matrix that the first submatrix is P*1*z；Wherein, P K/q/z, K are the bit wide bit number of data/address bus, and q is in the first matrix The bit wide bit number that each element occupies；Total bit number shared by first submatrix is equal to K；First submatrix is P* The matrix of 1*z is P, is highly 1, depth z for characterizing the width of first submatrix；

The computing unit is also used to: from the second submatrix read every time in each weight buffer in second matrix, Second submatrix is the matrix of P*1*z；Wherein, P K/q/z, q are the bit wide bit that each element occupies in the second matrix Number；Total bit number shared by second submatrix is equal to K；The matrix that second submatrix is P*1*z is described for characterizing The width of second submatrix is P, is highly 1, depth z；

The computing unit is also used to: being made convolutional calculation according to first matrix read out and the second matrix, is exported Image.

Optionally, the FPGA is connect with external memory, and FPGA is used for:

The third matrix of the image to be processed is read from the external memory；The third matrix is the matrix of m*m*z；

Determine the first storage address in described image buffer；

Based on first storage address, the third submatrix in the third matrix is stored in first storage address.

Optionally, the FPGA is also used to:

The 4th matrix that each convolution weight is read from the external memory, obtains at least one the 4th matrix；Institute Stating the 4th matrix is n*(n+1) matrix of * z；4th matrix is to obtain after each convolution weight makees zero padding processing Matrix, each convolution weight be n*n*z matrix；

Determine the second storage address of r-th of weight buffer；

Based on second storage address, y-th of the 4th submatrixs are stored in second storage address.

Optionally, the computing unit be used for from once read in described image buffer in first matrix the Before one submatrix, the FPGA is specifically used for:

Zero padding processing is made to the third matrix, obtains first matrix.

Optionally, the computing unit for from once read in described image buffer in first matrix When one submatrix, it is specifically used for:

Determine the third storage address of first submatrix；

Based on the third storage address, first submatrix is read.

Optionally, the computing unit for from read every time in each weight buffer in second matrix When two submatrixs, it is specifically used for:

Determine the 4th storage address of the second submatrix in each weight buffer in e-th of second matrixes；

According to the 4th storage address, from read in each weight buffer in e-th of second matrixes second Submatrix.

The third aspect, the embodiment of the present invention provide a kind of reading data device of convolutional neural networks, including processor and Memory；Wherein, the memory is for storing one or more computer programs；When one or more of memory storage When a computer program is executed by the processor, so that the reading data device of the convolutional neural networks realizes first aspect Or the method for any one possible design of above-mentioned first aspect.

Fourth aspect, the embodiment of the present invention provide a kind of computer readable storage medium, the computer-readable storage medium Matter is stored with computer program, and the computer program includes program instruction, and described program instructs when executed by a computer, makes The method that the computer executes any one possible design of first aspect or above-mentioned first aspect.

5th aspect, the embodiment of the present invention provide a kind of computer program product, and the computer program product is stored with Computer program, the computer program include program instruction, and described program instructs when executed by a computer, make the calculating The method that machine executes any one possible design of first aspect or above-mentioned first aspect.

The present invention has the beneficial effect that:

It include the first matrix of image to be processed, the first matrix in the present invention in the technical solution of embodiment, in image buffer For the matrix of m*(m+2) * z；At least one convolution weight is stored in each weight buffer of at least one weight buffer, the Two matrixes be n*(n+1) * z matrix；Wherein, the first matrix is used to characterize the height of the first matrix for the matrix of m*(m+2) * z For m, width m+2, depth z；Second matrix is n*(n+1) matrix of * z for characterizing the height of the second matrix is n, width For n+1, depth z；M, n and z is the integer more than or equal to 1；Computing unit once reads the first square from image buffer The first submatrix in battle array, the first submatrix are the matrix of P*1*z；Wherein, P K/q/z, K are the bit wide bit of data/address bus Number, q are the bit wide bit number that each element occupies in the first matrix；Total bit number shared by first submatrix is equal to K；First son The matrix that matrix is P*1*z is P, is highly 1, depth z for characterizing the width of the first submatrix；Computing unit is from each power The second submatrix in second matrix is read in weight buffer every time, the second submatrix is the matrix of P*1*z；Wherein P is K/q/z, q are the bit wide bit number that each element occupies in the second matrix；Total bit number shared by second submatrix is equal to K；The The matrix that two submatrixs are P*1*z is P, is highly 1, depth z for characterizing the width of the second submatrix；Computing unit according to The first matrix and the second matrix read out makees convolutional calculation, obtains output image.In this way, the first matrix and second Matrix is not necessarily to the zero padding on each channel, is compared to image array and volume in the prior art on each channel after supplement zero Product weight, image data and the respective quantity of weighted data reduce one times, so that the calculation amount of FPGA and delay time One times is decreased, and then improves the convolution efficiency of FPGA.

Detailed description of the invention

Fig. 1 is a kind of schematic diagram of the method for reading data of convolutional neural networks in the prior art；

Fig. 2 is a kind of flow diagram of the method for reading data of convolutional neural networks provided in an embodiment of the present invention；

Fig. 3 is a kind of schematic diagram of storage form of the image data provided in an embodiment of the present invention in DDR；

Fig. 4 is a kind of schematic diagram of image data storage form in DDR and image buffer provided in an embodiment of the present invention；

Fig. 5 is that a kind of counter counts provided in an embodiment of the present invention calculate showing for storage address of the image data in image buffer It is intended to；

Fig. 6 is a kind of signal of image data spread pattern of storage address in image buffer provided in an embodiment of the present invention Figure；

Fig. 7 is a kind of schematic diagram of first matrix provided in an embodiment of the present invention；

Fig. 8 is a kind of schematic diagram of storage form of the weighted data provided in an embodiment of the present invention in DDR；

Fig. 9 is a kind of schematic diagram of weighted data storage form in image buffer provided in an embodiment of the present invention；

Figure 10 is that a kind of counter provided in an embodiment of the present invention calculates storage address of the weighted data in weight buffer Schematic diagram；

Figure 11 be it is provided in an embodiment of the present invention it is a kind of calculate output image corresponding first matrix of the first row in first Second matrix makees the schematic diagram of the storage address of the first submatrix of convolutional calculation；

Figure 12 be it is provided in an embodiment of the present invention it is a kind of calculate output image corresponding first matrix of the second row in first Second matrix makees the schematic diagram of the storage address of the first submatrix of convolutional calculation；

Figure 13 is a kind of the second submatrix for calculating the second matrix to be read in weight buffer provided in an embodiment of the present invention Storage address schematic diagram；

Figure 14 is that a kind of counter provided in an embodiment of the present invention calculates depositing for the second submatrix read from weight buffer Store up the schematic diagram of address；

Figure 15 is a kind of structural schematic diagram of the reading data device of convolutional neural networks provided in an embodiment of the present invention；

Figure 16 is a kind of structural schematic diagram of FPGA accelerator provided in an embodiment of the present invention；

Figure 17 is a kind of structural schematic diagram of the reading data device of convolutional neural networks provided in an embodiment of the present invention；

Figure 18 is a kind of structural schematic diagram of the reading data device of convolutional neural networks provided in an embodiment of the present invention.

Specific embodiment

To make the objectives, technical solutions, and advantages of the present invention clearer, below in conjunction with attached drawing to the present invention make into It is described in detail to one step, it is clear that the described embodiment is only a part of the embodiment of the present invention, instead of all the embodiments. Based on the embodiments of the present invention, obtained by those of ordinary skill in the art without making creative efforts all Other embodiments shall fall within the protection scope of the present invention.

The shapes and sizes of each component do not reflect actual proportions in attached drawing, and purpose is schematically illustrate the content of present invention.

It please refers to shown in Fig. 1, is a kind of schematic diagram of the method for reading data of convolutional neural networks in the prior art.Such as Shown in Fig. 1, here is with each image data and the bit wide of each weighted data for 16 bits, and the bit wide of data/address bus is 512 For bit.

In general, in order to guarantee the matrix for exporting image and the matrix of input picture (is with the matrix of input picture in Fig. 1 208*208*16, height, that is, line number of the matrix are 208, and width, that is, columns of the matrix is 208,16) depth of the matrix is Size it is equally big (i.e. output image matrix width and highly it is identical as the width of the matrix of input picture and height), then Zero padding operation, that is, matrix first row and last column zero padding in input picture can be carried out to the matrix of input picture, Become 210*208*16 image array.In this way, being deconvoluted the supplement of a 210*208*16 with the convolution weight of 32 3*3*16 After image after zero, the matrix for the output image that a size is 208*208*32 can be obtained.

The total bit number of image data on each channel of image after supplement zero is after 16*16=256(supplements zero Image is the matrix of 210*208*16, and each channel of the image after supplement zero is a 1*1* in the matrix of 210*208*16 16 minor matrix shares 16 image datas on that is, each channel), weighted data on each channel of convolution weight always than Special number is the matrix that 16*16=256(convolution weight is 3*3*16, and each channel of convolution weight is one in the matrix of 3*3*16 The minor matrix of a 1*1*16 shares 16 weighted datas on each channel of convolution weight), than the bit wide ratio of data/address bus Special number 512 is few.Therefore, in order to improve the efficiency that FPGA reads and writes data, need to supplement each channel supplement of the image after zero 16 zero, total bit wide of the data on each channel after making supplement zero is identical as the bit wide of data/address bus, that is, after supplementing zero Each channel of image is the minor matrix of a 1*1*32；16 zero will be supplemented on each channel of convolution weight, weigh convolution Total bit wide of data on each channel of weight is identical as the bit wide of data/address bus, that is, it is logical to supplement each of convolution weight after zero Road is the minor matrix of a 1*1*32.

Therefore, because having mended 16 zero on each channel of the matrix of the matrix and convolution weight of input picture, supplement Image array and the respective data volume of convolution weight matrix after zero increase one times, so that the calculation amount and delay of FPGA Time also increases one times, so that the convolution efficiency of FPGA is low.

In order to improve the convolution efficiency of FPGA, the embodiment of the invention provides a kind of reading data sides of convolutional neural networks Method.

Illustratively, it please refers to shown in Fig. 2, is a kind of reading data of convolutional neural networks provided in an embodiment of the present invention The flow diagram of method.This method is applied to FPGA, and FPGA may include image buffer, at least one weight buffer (hereafter by taking 16 weight buffers as an example) and computing unit.As shown in Fig. 2, this method process includes:

In S201, image buffer include image to be processed the first matrix, the first matrix be m*(m+2) * z matrix；At least At least one convolution weight is stored in each weight buffer of one weight buffer, each convolution weight is one second corresponding Matrix, the second matrix be n*(n+1) * z matrix；Wherein, the first matrix is the matrix of m*(m+2) * z for characterizing the first matrix Height be m, width m+2, depth z；Second matrix is n*(n+1) matrix of * z is used to characterize the height of the second matrix and is N, width be n+1, depth z；M, n and z is the integer more than or equal to 1.

Optionally, in embodiments of the present invention, this method step before step S201 is being executed further include:

FPGA reads image data and weighted data, and the image data that will be read from external memory from external memory It is stored in image buffer, will be stored in 16 weight buffers from the weighted data read in external memory.Wherein, External memory can be Double Data Rate dynamic RAM (Double Data Rate, DDR), or safe digital Block (Secure Digital Memory Card, SD), the embodiment of the present invention does not limit.It is hereafter that DDR is with external memory Example.

Illustratively, it please refers to shown in Fig. 3, is a kind of storage of the image data provided in an embodiment of the present invention in DDR The schematic diagram of form.As shown in figure 3, FPGA can read the third matrix of image to be processed from DDR；Wherein, third matrix For the matrix of m*m*z (hereafter by taking third matrix is the matrix of 208*208*16 as an example).

Optionally, since the bit wide bit number of data/address bus is 512, each element (each picture number in third matrix According to) bit wide bit number be 16, therefore, the image data in every two channel in third matrix can be combined into one 512 ratio Special data are stored in the same storage address of DDR.For example, every two column image data in every row in third matrix can It is stored in the same storage address of DDR with being combined into the data of 512 bits, i.e., the sub- square of each third in third matrix Wherein, 2 indicate column to battle array 2*1*16(, and 1 indicates row, i.e., 2 indicate width, and 1 indicates height) it can store the same storage in DDR In address.For example, data1 and data2 can store in the same storage address (such as addr0) of DDR in third matrix. The storage address of the image data of the 1st row in third matrix has 208/2=104 storage address.

Above-mentioned is by taking the 1st row image data in third matrix is stored in the method for DDR as an example, certainly, in third matrix The 2nd the-the 208 row image data of row can be using being stored in DDR with the 1st same or similar method of row, the present invention is implemented Example does not limit.It is stored in herein with the 2nd the-the 208 row image data of row in third matrix using method identical with the 1st row For in DDR, therefore, storage address of the image data of third matrix in DDR shares 104*208=21632 storage address.

The mode stored in DDR due to third matrix are as follows: every two column image data in every row is combined into 512 bits Data be stored in the same storage address of DDR, therefore, FPGA can read every two column figure in every row from DDR every time As data, i.e. FPGA can read a third submatrix of third matrix from DDR every time.

By foregoing teachings it is found that the embodiment of the present invention provides input picture (i.e. third matrix) without mending on each channel 16 zero, it is compared to the image array in the prior art on each channel after supplement zero, the quantity of image data is reduced One times, so that the bandwidth of FPGA access DDR be made to decrease one times.

The process that FPGA will be stored in image buffer from the image data read in DDR is described below.

Optionally, FPGA can will be stored in same in image buffer from the third submatrix read every time in DDR In a storage address.Wherein, third submatrix is the matrix of P*1*z, and P K/q/z, q are that each element occupies in third matrix Bit number；Total bit number shared by third submatrix is equal to K.For example, third submatrix be 2*1*16 matrix, P be 512/ 16/16=2, q are the bit number that each element occupies in third matrix, i.e. q=16；Total bit number shared by third submatrix etc. In 512(, that is, K=512).

Optionally, when FPGA is from when reading the third submatrix read in third matrix in DDR, FPGA determines third The first storage address (determining first storage address of the third submatrix in image buffer) of submatrix storage.Wherein, First storage address is-Image_Z/32 wr_addr_temp+wr_vcnt* Image_Z/32*2； wr_addr_ Temp is used to characterize f-th of third submatrix of third matrix, and wr_vcnt is for characterizing f-th of third submatrix in third square The f row of battle array；Image_Z is used to characterize the number of the image data in f-th of third submatrix.FPGA can be deposited based on first Address is stored up, the third submatrix in third matrix is stored in the first storage address.

For example, please referring to Fig. 3 and Fig. 4, (Fig. 4 is a kind of image data provided in an embodiment of the present invention in DDR and figure As the schematic diagram of storage form in buffer), when FPGA from read DDR read third matrix in first third submatrix When (data1 and data2 of the 1st row i.e. in third matrix), wherein, wr_addr_temp is since 0 by wr_addr_temp=0( It calculates, since third matrix shares 104*208=21632 third submatrix, the value range of wr_addr_temp is 0 ~ 21631), wr_vcnt=1, Image_Z=32(indicate to be combined into the data of 512 bits by 32 image datas), then first Storage address=0+1*(32/32) 2-32/32=1 *, i.e., the first storage address of first third submatrix in third matrix is addr1.When FPGA from read FPGA read the 2nd third submatrix in third matrix (the 1st row i.e. in third matrix Data3 and data4) when, wr_addr_temp=1, wr_vcnt=1, Image_Z=32(is due to sharing 32 in third submatrix Image data, so Image_Z=32), then the first storage address=1+1*(32/32) 2-32/32=2 *, i.e., in third matrix First storage address of the 2nd third submatrix is addr2.Wherein, image data write-in image buffer is stored in FPGA During, the counting of wr_vcnt and wr_hcnt can be realized by counter, and specifically referring to Fig. 5, (Fig. 5 is real for the present invention A kind of counter counts for applying example offer calculate the schematic diagram of storage address of the image data in image buffer).In Fig. 5, wr_ When the count value of hcnt reaches 104, the count value of wr_vcnt adds 1.Wherein, due to sharing 208/2 in every a line of third matrix =104 third submatrixs, so the value range of wr_hcnt is 1 ~ 104；Since third matrix shares 208 rows, so wr_ The value range of vcnt is 1 ~ 208.

Optionally, in order to distinguish every two column image data in every row matrix in third matrix, FPGA can be by third Every two column image data in every row matrix in matrix is respectively stored in the different zones in same first storage address. For example, the first storage address for storing the third submatrix in third matrix in image buffer can be divided into the first son by FPGA Storage region and the second sub-storage areas；First sub-storage areas is used to store the high position data of third submatrix, and the second son is deposited Storage area domain is used to store the low data of third submatrix；The low data of the high position data and third matrix of third submatrix Bit wide bit number is 256.

For example, please referring to shown in Fig. 4, FPGA can be by the lower-order digit of each third submatrix 2*1*16 in third matrix According to the first sub-storage areas for being stored in the first storage address, by a high position of each third submatrix 2*1*16 in third matrix Data are stored in the second sub-storage areas of the first storage address.For example, by the picture number of the 1st row the 1st column in third matrix The first sub-storage areas of the 1st the first storage address (i.e. addr1) is stored according to (i.e. data1), by the 1st in third matrix The image data (i.e. data2) that row the 2nd arranges is stored in the second sub-storage areas of addr1.Due to being stored in image buffer Image data is stored according to row, is please referred to shown in Fig. 6, slow in image for a kind of image data provided in an embodiment of the present invention The schematic diagram of the spread pattern of storage address in storage.

It should be noted that the image data stored in some storage address occur is 0(in Fig. 6 in Fig. 4 and Fig. 6 In, the storage address of blank is the storage address for storing 0), this is FPGA because having carried out zero padding operation to third matrix (FPGA respectively supplements a column zero in the right and left of third matrix), FPGA obtains the first square after having carried out zero padding to third matrix Battle array.Illustratively, it please refers to shown in Fig. 7, is a kind of schematic diagram of first matrix provided in an embodiment of the present invention.As shown in fig. 7, The left side first in first matrix is classified as zero, last is classified as zero.

The process that FPGA will be stored in weight buffer from the weighted data read in DDR is described below.Illustratively, It please refers to shown in Fig. 8, is a kind of schematic diagram of storage form of the weighted data provided in an embodiment of the present invention in DDR.Such as Fig. 8 Shown, FPGA can read the 4th matrix of convolution weight from external memory；Wherein, the 4th matrix is n*(n+1) * z Matrix (hereafter by taking the 4th matrix is the matrix of 3*4*16 as an example).Hereafter for being stored with 32 convolution weights in DDR.

Optionally, since the bit wide bit number of data/address bus is 512, the bit wide bit number of each element in convolution weight It is 16, and convolution weight sheet, as the matrix of 3*3*16, every row only has 3 column weighted datas, is not enough combined into 2 512 bit numbers Data, therefore, in DDR, the storage of convolution weight is stored in the form of the 4th matrix, and the 4th matrix is that each convolution weight makees zero The matrix obtained after filling processing.Wherein, the 4th matrix is to obtain after convolution weight supplements a column zero on the right, i.e. the 4th square Last of battle array is classified as zero, such as data4, data8, data12 zero.Therefore, in DDR, the 4th matrix can be with every two The data that weighted data in channel is combined into 512 bits are stored in the same storage address of DDR, i.e., the 4th matrix can To be stored in a storage address of DDR with the 4th submatrix (matrix that the 4th submatrix is 2*1*16).Wherein, the 4th square The 4th submatrix in battle array can be using the same or similar mode for being stored in DDR with third submatrix in above-mentioned third matrix It is stored in DDR, does not repeat to repeat herein.For example, data1 and data2 in the 4th matrix is stored in the same storage ground DDR In location (such as addr0), then the storage address of the weighted data of the 1st row in the 4th matrix has 4/2=2 storage address, and the 4th The storage address of weighted data in matrix has 4*3=12 storage address.

By foregoing teachings it is found that the 4th matrix provided in an embodiment of the present invention on each channel without having mended 16 zero, It is compared to the convolution weight in the prior art on each channel after supplement zero, the quantity of weighted data reduces nearly one Times, so that decreasing the bandwidth of FPGA access DDR will by about one time.

Since the mode that the 4th matrix stores in DDR is to be stored in a storage of DDR in the form of the 4th submatrix In address, therefore, FPGA can read the 4th submatrix from DDR every time.

Optionally, FPGA can will be stored in the r in 16 weight buffers from the 4th submatrix of reading in DDR every time In a weight buffer, i.e. FPGA can will be in r-th of weight buffer in e-th of the 4th matrixes in 32 matrixes.

Optionally, when FPGA reads four submatrix in e-th of the 4th matrixes from reading DDR, FPGA is true Second storage address of settled preceding 4th submatrix storage.Wherein, the second storage address is (wr_hcnt-1)+16* (wr_ scnt-1)+((Weight_h+1)/(32/ Weight_Z))*Weight_W*(wr_vcnt-1).Wr_hcnt is for characterizing e Y-th of the 4th submatrixs in a 4th matrix, wr_vcnt are used to characterize r-th of weight at least one weight buffer Buffer, r-th of weight buffer are the target weight buffer of y-th of the 4th submatrixs, and wr_scnt is for characterizing e-th The sequence of all fourth matrixes of 4th matrix in r-th of weight buffer.FPGA is based on the second storage address, by y-th 4th submatrix is stored in the second storage address, Weight_h be used to characterize the height n(i.e. Weight_h of each convolution weight= 3), Weight_Z is used to characterize depth z(i.e. Weight_Z=16 of each convolution weight), Weight_W is for characterizing each volume Width n(, that is, Weight_W=3 of product weight).Wherein, since each 4th matrix shares ((Weight_h+1)/(32/ Weight_Z)) * Weight_W the 4th submatrixs, so the value range of wr_hcnt is 1 ~ ((3+1)/(32/16)) * 3, i.e., The value range of wr_hcnt is 1 ~ 6.Due to sharing 16 weight buffers, so the value range of wr_vcnt is 1 ~ 16.By It can store 2 the 4th matrixes in each weight buffer, so wr_scnt range is 1 ~ 2.

For example, please referring to Fig. 8 and Fig. 9, (Fig. 9 is a kind of weighted data provided in an embodiment of the present invention in image buffer storage The schematic diagram of storage form in device), when FPGA by from DDR read 32 the 4th matrixes in the 1st the 4th matrix in the 1st A 4th submatrix (i.e. the data1 and data2 of the 1st row in first the 4th matrix) storage is into 16 weight buffers When the 1st weight buffer, wr_hcnt=1, wr_vcnt=1, wr_scnt=1, then the second storage address=1-1+16*(1-1)+ ((3+1)/(32/16)) * 3*(1-1)=0, i.e., the 4th current submatrix will be stored in the second storage of the first weight buffer Address is addr0.Wherein, the storage address of each weight buffer is from 0 ~ 1023.

Optionally, FPGA is during by weighted data write-in weight buffer storage, wr_hcnt, wr_vcnt, wr_ Scnt can realize with corresponding counter count that respectively specifically 0(Figure 10 is provided in an embodiment of the present invention one referring to Figure 1 Kind counter calculates the schematic diagram of storage address of the weighted data in weight buffer).In Figure 10, wr_hcnt is corresponding When counting reaches 6, that is, the 4th submatrix of first the 4th matrix has been written into weight buffer storage and finishes, wr_ Vcnt adds 1, and when the corresponding counting of wr_vcnt reaches 16, that is, calculator calculates the 16th the 4th matrix and is written the 16th The process of second storage address of weight buffer storage is completed.Wr_scnt adds 1, this shows to deposit in each weight buffer The storage region of second the 4th matrix of storage starts that the process of subsequent 4th matrix is written.

Optionally, in order to distinguish every two column weighted data in every row matrix in the 4th matrix, FPGA can be by the 4th Every two column weighted data in every row matrix in matrix is respectively stored in the different zones in same second storage address. For example, the second storage address for storing the 4th submatrix in each weight buffer can be divided into third sub-storage areas by FPGA (identifying third sub-storage areas in Fig. 9 with RAMA) and the 4th sub-storage areas (identify the 4th subpool in Fig. 9 with RAMB Domain).Wherein, third sub-storage areas is used to store the high position data of the 4th submatrix, and the 4th sub-storage areas is for storage the The low data of four submatrixs；The bit wide bit number of the low data of the high position data and the 4th submatrix of 4th submatrix is 256.For example, the image data (i.e. data1) that the 1st row the 1st in first the 4th matrix arranges can be stored in first by FPGA In the RAMA of 1st the second storage address (i.e. addr0) of a weight buffer (i.e. weight changes memory 0), by first The image data (i.e. data2) of the 1st row the 2nd column in four matrixes is stored in the 1st the second storage address that weight changes memory 0 In the RAMB of (i.e. addr0).

The process that FPGA reads image data from image buffer is described below.Here be with the computing unit of FPGA from In image buffer for reading image data.In order to distinguish the convolution weight read from DDR (i.e. the 4th matrix) He Congquan The convolution weight read in weight buffer, below by taking the second matrix is from the convolution weight read in weight buffer as an example.

S202, computing unit are from the first submatrix once read in the first matrix in image buffer, the first submatrix For the matrix of P*1*z；Wherein, P K/q/z, K are the bit wide bit number of data/address bus, and q is that each element occupies in the first matrix Bit wide bit number；Total bit number shared by first submatrix is equal to K；The matrix that first submatrix is P*1*z is for characterizing the The width of one submatrix is P, is highly 1, depth z.

Optionally, by foregoing teachings it is found that the first matrix is to be stored in figure after having carried out zero padding to third matrix by FPGA As in buffer, so, computing unit can from the first submatrix once read in image buffer in the first matrix, One submatrix is the matrix of 2*1*16, wherein P is 512/16/16, i.e. P=2, K 512, q are each element in the first matrix The bit wide bit number of occupancy, i.e. q=16；Total bit number shared by first submatrix is equal to 512；First submatrix is 2*1*16's Matrix is 2, is highly 1, depth 16 for characterizing the width of the first submatrix.

When computing unit reads the first submatrix from image buffer, computing unit can determine the first submatrix Third storage address, then, computing unit are based on third storage address, read the first submatrix.Wherein, third storage address is (rd_scnt- 1) +Image_Z/32*(rd_wcnt-1)+ (Image_W+2)*Image_Z/32*(rd_hcnt-1)+ Image_Z/32* (rd_fcnt-1) * S+((rd_fcnt-1)/Image_W) is rounded * (Weight_W- 1) * Image_Z/ 32-addr_temp1；Rd_scnt is used to characterize the minimum unit of the first matrix, is first submatrix, and Image_Z is used for The number of the image data in the first submatrix is characterized, p*z is equal to, rd_wcnt is used to characterize the number in the first matrix in the i-th column According to the width m, rd_hcnt that Image_W is used to characterize third matrix are used to characterize the first son in the i-th row in the first matrix Matrix, rd_fcnt is for characterizing total convolution number, and for characterizing preset step-length value, ((rd_fcnt-1)/Image_W) takes S Whole * (Weight_W -1) * Image_Z/32 is used to characterize the starting row of current convolution in the first matrix, and addr_temp1 is used for Characterize the storage address of first the first submatrix in the first matrix.Wherein, for the first row of output image, addr_ Temp1=0, for exporting other rows of image, addr_temp1=Image_Z/32* (Image_W+2).

For example, with computing unit read output image corresponding first matrix of the first row in first the second square Battle array is done for the first submatrix of convolutional calculation.When computing unit reads the 1st that the first submatrix is the 1st row in the first matrix When a submatrix, rd_scnt=1, Image_Z=32(i.e. the first submatrix share 16*2=32 image data), rd_wcnt=1 (indicating that current first submatrix is the data in the first matrix in the 1st column), Image_W=208(indicates the width of third matrix It is 208, i.e., third matrix shares 208 column), the first rd_hcnt=1(current submatrix is in the 1st row in the first matrix First submatrix), rd_fcnt=1(indicates the total convolution number of the 1st row of the first matrix), S=1(indicates that counter counts every time 1) jump adds, Weight_W=3(indicates width n=3 of each convolution weight, i.e., each convolution weight shares 3 column), addr_ temp1=0.The then third storage address of current first submatrix=(1-1)+32/32*(1-1)+(104+2) * 32/32*(1-1)+ 32/32*(1-1) * 1+((1-1)/104) be rounded 32/32-0=0 * (3-1) *, i.e. the 1st first of the 1st row in the first matrix The storage address of submatrix is addr0.And so on, the storage address of the 1st the first submatrix of the 1st row in the first matrix For addr1, the storage address of the 1st the first submatrix of the 2nd row of the first matrix is addr106, the 2nd the first submatrix Storage address is addr107, and specifically 1(Figure 11 is provided in an embodiment of the present invention a kind of to calculate the of output image referring to Figure 1 Make the signal of the storage address of the first submatrix of convolutional calculation in corresponding first matrix of a line with first the second matrix Figure).Since there are two the second matrixes for storage in each weight buffer, so computing unit reads the first row of output image When doing the first submatrix of convolutional calculation with first the second matrix in corresponding first matrix, it is only necessary to which operation is twice.

Similarly, it is read with computing unit in corresponding first matrix of the second row of output image and is done with first the second matrix For first submatrix of convolutional calculation.When computing unit reads the 1st son that the first submatrix is the 1st row in the first matrix When matrix, rd_scnt=1, Image_Z=32, rd_wcnt=1(indicate current first submatrix be in the first matrix the 1st column in Data), the first Image_W=208, rd_hcnt=1(current submatrix is the first sub- square in the 1st row in the first matrix Battle array), rd_fcnt=105(indicates the total convolution number of the 2nd row of the first matrix), S=1, Weight_W=3, addr_temp1= 32/32*( 104 +2)=106.The then third storage address of current first submatrix=(1-1)+32/32*(1-1)+(104+2) * 32/32*(1-1)+32/32*(105-1) * 1+((105-1)/104) be rounded 32/32-106=0 * (3-1) *, i.e. in the first matrix The 1st row the 1st the first submatrix storage address be addr0.And so on, the 1st of the 1st row in the first matrix The storage address of one submatrix is addr1, and the storage address of the 1st the first submatrix of the 2nd row of the first matrix is Addr106, the storage address of the 2nd the first submatrix are addr107, the 1st the first submatrix of the 3rd row of the first matrix Storage address is addr212, and the storage address of the 2nd the first submatrix is addr213, and specifically 2(Figure 12 is this referring to Figure 1 It is rolled up in a kind of corresponding first matrix of the second row for calculating output image that inventive embodiments provide with first the second matrix The schematic diagram of the storage address for the first submatrix that product calculates).

S203, computing unit are from the second submatrix read every time in second matrix in each weight buffer, and Two submatrixs are the matrix of P*1*z；Wherein, P K/q/z, q are the bit wide bit number that each element occupies in the second matrix；The Total bit number shared by two submatrixs is equal to K；The matrix that second submatrix is P*1*z is used to characterize the width of the second submatrix It P, is highly 1, depth z.

Optionally, by foregoing teachings it is found that sharing 32 the second matrixes, each weight in 16 weight buffers is cached 2 the second matrixes are stored in device.So since computing unit is from 16 weight buffers while to read the second sub- square Battle array, and the step of reading the second submatrix from each weight buffer is identical, is cached below with computing unit from r-th of weight For reading the second submatrix in e-th of second matrixes in device every time.Wherein, the second submatrix is the matrix of 2*1*16；P is 512/16/16=2, q are the bit wide bit number that each element occupies in the second matrix, i.e. q=16；Shared by second submatrix always than Special number is equal to 512；The matrix that second submatrix is 2*1*16 is 2, is highly 1, depth for characterizing the width of the second submatrix It is 16.

Optionally, when computing unit is from the second submatrix (current in each weight buffer in e-th of second matrixes Two submatrixs) when, computing unit can determine the 4th storage address of current second submatrix, and then, computing unit is according to Four storage address read current second submatrix.Wherein, the 4th storage address is (rd_hcnt-1)+RD_HCNT_VALUE_ TEMP*(rd_vcnt - 1) + addr_temp2；Rd_vcnt is used to characterize j-th of second matrixes of present weight buffer, Rd_hcnt is for characterizing corresponding h-th of second submatrixs of j-th of second matrixes, and RD_HCNT_VALUE_TEMP is for characterizing The storage address of first the second submatrix of j-th of second matrixes, addr_temp2 is for determining in present weight buffer The storage address for first the second submatrix that the needs of each second matrix are read.Wherein, for the head and the tail of output image Two rows, rd_hcnt=1 ~ ((Weight_h+1)/(32/ Weight_Z)) * (Weight_W -1), then rd_hcnt=1 ~ ((3+1)/(32/16)) * (3-1), the i.e. value range of rd_hcnt are 1 ~ 4；RD_HCNT_VALUE_TEMP=(( Weight_ H+1)/(32/Weight_Z)) * (Weight_W -1), then RD_HCNT_VALUE_TEMP=((3+1)/(32/16)) * (3- 1)=4.For the other rows for removing two rows of head and the tail of output image, rd_hcnt=1 ~ ((Weight_h+1)/(32/ Weight_Z)) * Weight_W, then rd_hcnt=1 ~ ((3+1)/(32/16)) * 3, the i.e. value range of rd_hcnt is 1 ~ 6, RD_HCNT_VALUE_TEMP=((Weight_h+1)/(32/Weight_Z)) * Weight_W, then RD_HCNT_ VALUE_TEMP=((3+1)/(32/16)) * 3=6.For exporting the first row of image, addr_temp2=(Weight_h- 1)*rd_vcnt=2*rd_vcnt.For exporting the last line of image, addr_temp2=(Weight_h -1) * (rd_ Vcnt-1)=2* (rd_vcnt-1), for exporting other rows of image, addr_temp2=0.

For example, for export image the first row, what first the second matrix of r-th of weight buffer was read The storage address of first the second submatrix be 2(, that is, Weight_rd_addr=2), second the second matrix be read first The storage address of a second submatrix is 8.For the other rows for removing two rows of head and the tail of output image, r-th of weight buffer The storage address of first the second submatrix that is read of first the second matrix be 0(i.e. Weight_rd_addr=0), the The storage address for first the second submatrix that two the second matrixes are read is 6.Specifically 3(Figure 13 is the present invention referring to Figure 1 The signal of the storage address of second submatrix of the second matrix to be read in a kind of calculating weight buffer that embodiment provides Figure), wherein the rd_scnt in Figure 13 is used to characterize total convolution number that second matrix needs to carry out.

In embodiments of the present invention, the counting of rd_hcnt, rd_vcnt, rd_scnt can be realized with counter respectively, tool 4(Figure 14 is that a kind of counter provided in an embodiment of the present invention calculates the second son read from weight buffer to body referring to Figure 1 The schematic diagram of the storage address of matrix).In Figure 14, for exporting two row of head and the tail of image, reach in the count value of rd_hcnt When 4, indicate that computing unit has read the second submatrix of first the second matrix of present weight buffer, then rd_vcnt Count value adds 1.Rd_scnt adds 1, due to there is every row in the first matrix to have 105 the second submatrixs, then second submatrix Weighted data 105 times to be read.

It should be noted that FPGA can first carry out step S202, then execute step S203, step can also be first carried out S203, then step S203 is executed, or be performed simultaneously step S202 and step S203, in embodiments of the present invention not to step The execution sequence of S202 and step S203 are limited.

S204, computing unit make convolutional calculation according to the first matrix and the second matrix that read out, obtain output image.

Optionally, computing unit can by read in the first submatrix in the first matrix and 16*2 the second matrixes the Two submatrixs make convolutional calculation, obtain output image.

By foregoing teachings it is found that the embodiment of the present invention provides the first matrix and the second matrix is not necessarily to mend on each channel 16 zero, be compared in the prior art on each channel supplement zero after image array and convolution weight, image data and The respective quantity of weighted data reduces one times, so that the calculation amount of FPGA and delay time decrease one times, in turn Improve the convolution efficiency of FPGA.

It should be noted that above-mentioned is with data/address bus bit wide for 512 bits, the bit wide of image data and weighted data is For 16 bits, certainly, data/address bus bit wide is not 512 bits, the bit wide of image data and weighted data be not 16 bit when, Data storage and reading can be carried out by the way of same as described above or similar, the embodiment of the present invention is not construed as limiting.

It as can be seen from the above description, include figure to be processed in image buffer in the present invention in the technical solution of embodiment First matrix of picture, the first matrix be m*(m+2) * z matrix；It is deposited in each weight buffer of at least one weight buffer Store up at least one convolution weight, the second matrix is n*(n+1) matrix of * z；Wherein, the first matrix is used for the matrix of m*(m+2) * z In characterize the first matrix height be m, width m+2, depth z；Second matrix is n*(n+1) matrix of * z is for characterizing the The height of two matrixes is n, width n+1, depth z；M, n and z is the integer more than or equal to 1；Computing unit is slow from image The first submatrix in the first matrix is once read in storage, the first submatrix is the matrix of P*1*z；Wherein, P K/q/z, K For the bit wide bit number of data/address bus, q is the bit wide bit number that each element occupies in the first matrix；Shared by first submatrix Total bit number is equal to K；The matrix that first submatrix is P*1*z is P, is highly 1, depth for characterizing the width of the first submatrix For z；Computing unit is from the second submatrix read every time in second matrix in each weight buffer, the second submatrix The matrix of P*1*z；Wherein P is K/q/z, and q is the bit wide bit number that each element occupies in the second matrix；Shared by second submatrix Total bit number be equal to K；The matrix that second submatrix is P*1*z is used to characterize the width of the second submatrix and is P, is highly 1, is deep Degree is z；Computing unit makees convolutional calculation according to the first matrix and the second matrix that read out, obtains output image.By this Mode, the first matrix and the second matrix are not necessarily to the zero padding on each channel, are compared to and mend on each channel in the prior art Image array and convolution weight after zeroizing, image data and the respective quantity of weighted data reduce one times, so that The calculation amount of FPGA and delay time decrease one times, and then improve the convolution efficiency of FPGA.

Under based on the same inventive concept, the embodiment of the invention provides a kind of reading data devices of convolutional neural networks. It please refers to shown in Figure 15, is a kind of structural representation of the reading data device of convolutional neural networks provided in an embodiment of the present invention Figure.

As shown in figure 15, the reading data device of convolutional neural networks includes three parts: host side 10, FPGA(dotted line frame Part, with 11 mark FPGA) and external memory in Figure 15.External memory is connect with host side 10, FPGA respectively.Wherein, External memory includes: in external cache 1(Figure 15 with 121 mark external caches 1), identified with 122 in external cache 2(Figure 15 External cache 2) and external cache 3(Figure 15 in 123 mark external caches 3), FPGA point includes: direct memory access 110 (Direct Memory Access, DMA), AXI4(Advanced eXtensible Interface)-lite interface 111, With 112 mark AXI4) in AXI4(Figure 15, interconnection module 113 uses 114 Mark cache modules 1 in cache module 1(Figure 15), it is slow With 115 Mark cache modules 2 in storing module 2(Figure 15) and computing unit 116.Wherein, AXI4-lite interface 111 is one kind Data bus interface, AXI4 are a kind of data/address bus.Host side 10 is able to access that DMA.Computing unit 116 includes the place of multichannel Unit PE is managed, multiplies accumulating calculating for completing.

Illustratively, please refer in conjunction with shown in Figure 15 and Figure 16 that (Figure 16 is that a kind of FPGA provided in an embodiment of the present invention adds The structural schematic diagram of fast device), computing unit 116 can read image data (i.e. from external cache 2 by interconnection module 113 Image data is stored in external cache 2), (weight number is stored i.e. in external cache 3 from weighted data is read in external cache 3 According to) or computing unit 116 by interconnection module 113 from external cache 2 read weighted data (deposited in external cache 2 Store up weighted data), image data (storing image data in external cache 3) is read from external cache 3, the embodiment of the present invention The type of the data stored in external cache 2 and external cache 3 is not limited.

It is referred to shown in Figure 15 and Figure 16 continuing with combining, the figure that computing unit 116 can will be read from external memory It as data are stored in cache module 1, will be stored in cache module 2 from the weighted data read in external memory, or will It is stored in cache module 1 from the weighted data read in external memory, the image data that will be read from external memory It is stored in cache module 2, the data type that the embodiment of the present invention does not cache cache module 1 and cache module 2 is defined. When computing unit 116 needs to carry out convolutional calculation, computing unit 116 can read figure from cache module 1 and cache module 2 As data and weighted data are for carrying out convolutional calculation.In Figure 16, DFF is a kind of trigger, is added in FPGA accelerator DDF is that FPGA is routed for convenience, simplifies the structure of FPGA.

The number of the reading data device of convolutional neural networks in the present embodiment and aforementioned convolutional neural networks shown in Fig. 2 It is that retouching in detail for the aforementioned method for reading data to convolutional neural networks is passed through based on the invention under same design according to read method It stating, those skilled in the art can be apparent from the implementation process of reading data device in the present embodiment, so in order to illustrate Book it is succinct, details are not described herein.

Under based on the same inventive concept, the embodiment of the invention provides a kind of reading data devices of convolutional neural networks. It please refers to shown in Figure 17, is a kind of structural representation of the reading data device of convolutional neural networks provided in an embodiment of the present invention Figure.

As shown in figure 17, the reading data device 1700 of convolutional neural networks includes: that FPGA(is identified in Figure 17 with 1710 FPGA)；Wherein, FPGA includes image buffer 1711, at least one weight buffer 1712 and computing unit 1713；

Image buffer 1711, for storing the first matrix of image to be processed, the first matrix is m*(m+2) matrix of * z；Its In, the first matrix is m*(m+2) matrix of * z for characterizing the height of the first matrix is m, width m+2, depth z；M and z It is the integer more than or equal to 1；

Each weight buffer of at least one weight buffer 1712 is for storing at least one convolution weight, each convolution power Again correspond to second matrix, the second matrix be n*(n+1) * z matrix；Wherein, the second matrix is used for the matrix of n*(n+1) * z In characterize the second matrix height be n, width n+1, depth z；N is the integer more than or equal to 1；

Computing unit 1713, for sub from the first submatrix once read in image buffer 1711 in first matrix, first Matrix is the matrix of P*1*z；Wherein, P K/q/z, K are the bit wide bit number of data/address bus, and q is each element in the first matrix The bit wide bit number of occupancy；Total bit number shared by first submatrix is equal to K；The matrix that first submatrix is P*1*z is used for table The width for levying the first submatrix is P, is highly 1, depth z；

Computing unit 1713 is also used to: from the second submatrix read every time in each weight buffer in second matrix, Second submatrix is the matrix of P*1*z；Wherein, P K/q/z, q are the bit wide bit number that each element occupies in the second matrix； Total bit number shared by second submatrix is equal to K；Second submatrix is that the matrix of P*1*z is used to characterize the width of the second submatrix For P, be highly 1, depth z；

Computing unit 1713 is also used to: being made convolutional calculation according to the first matrix and the second matrix that read out, is obtained output image.

Optionally, FPGA connect with external memory and (is not shown in Figure 17), and FPGA is used for:

The third matrix of image to be processed is read from external memory；Third matrix is the matrix of m*m*z；

Determine the first storage address in image buffer 1711；

Wherein, the first storage address is-Image_Z/32 wr_addr_temp+wr_vcnt* Image_Z/32*2； Wr_addr_temp is used to characterize f-th of third submatrix of third matrix, and wr_vcnt is for characterizing f-th of third submatrix In the f row of third matrix；Image_Z is used to characterize the number of the image data in f-th of third submatrix；Third submatrix For the matrix of P*1*z, P K/q/z, q are the bit wide bit number that each element occupies in third matrix；Shared by third submatrix Total bit number is equal to K；

Based on the first storage address, the third submatrix in third matrix is stored in the first storage address.

Optionally, FPGA is also used to:

The 4th matrix that each convolution weight is read from external memory obtains at least one the 4th matrix；4th matrix is N*(n+1) the matrix of * z；4th matrix is that each convolution weight makees the matrix obtained after zero padding processing, and each convolution weight is The matrix of n*n*z；

Determine second storage address of r-th of weight buffer；

Wherein, the second storage address is (wr_hcnt-1)+16* (wr_scnt-1)+((Weight_h+1)/(32/Weight_ Z))*Weight_W*(wr_vcnt-1)；

Wr_hcnt is used to characterize y-th of the 4th submatrixs in e-th of the 4th matrixes at least one the 4th matrix, The height n, Weight_Z that Weight_h is used to characterize each convolution weight are used to characterize the depth n of each convolution weight, Weight_W is used to characterize the width z, wr_vcnt of each convolution weight for characterizing 1712 at least one weight buffer R-th of weight buffer, r-th of weight buffer are the target weight buffer of y-th of the 4th submatrixs, and wr_scnt is used for Characterize the sequence of all fourth matrixes of e-th of the 4th matrixes in r-th of weight buffer；4th submatrix is P*1*z's Matrix, P K/q/z, q are the bit wide bit number that each element occupies in the 4th matrix；Total bit number shared by 4th submatrix Equal to K；

Based on the second storage address, y-th of the 4th submatrixs are stored in the second storage address.

Optionally, it is used in computing unit 1713 from the first son once read in image buffer 1711 in first matrix Before matrix, FPGA is specifically used for:

Zero padding processing is made to third matrix, obtains the first matrix.

Optionally, computing unit 1713 for from once read in image buffer 1711 in first matrix first son When matrix, it is specifically used for:

Determine the third storage address of the first submatrix；

Wherein, third storage address is (rd_scnt- 1)+Image_Z/32* (rd_wcnt-1)+(Image_W+2) * Image_Z/32* (rd_hcnt-1)+Image_Z/32* (rd_fcnt-1) * S+((rd_fcnt-1)/Image_W) is rounded * (Weight_W- 1)*Image_Z/32-addr_temp1；

Rd_scnt is used to characterize the minimum unit of the first matrix, is first submatrix, Image_Z is for characterizing the first son The number of image data in matrix, is equal to p*z, and rd_wcnt is used to characterize the data in the first matrix in the i-th column, Image_W For characterizing the width m, rd_hcnt of third matrix for characterizing the first submatrix in the i-th row in the first matrix, rd_ Fcnt is for characterizing total convolution number, and for S for characterizing preset step-length value, ((rd_fcnt-1)/Image_W) is rounded * (Weight_W -1) * Image_Z/32 is used to characterize the starting row of current convolution in the first matrix, and addr_temp1 is used for table Levy the storage address of first the first submatrix in the first matrix；

Based on third storage address, the first submatrix is read.

Optionally, computing unit 1713 for from read every time in each weight buffer in second matrix When two submatrixs, it is specifically used for:

4th storage address is (rd_hcnt-1)+RD_HCNT_VALUE_TEMP* (rd_vcnt -1)+addr_temp2；

Rd_vcnt is used to characterize j-th of second matrixes of present weight buffer, and rd_hcnt is for characterizing j-th of second matrixes Corresponding h-th of second submatrixs, RD_HCNT_VALUE_TEMP are used to characterize first the second sub- square of j-th of second matrixes The storage address of battle array, addr_temp2 is for determining the needs of each second matrix in present weight buffer are read first The storage address of a second submatrix；

According to the 4th storage address, from the second submatrix read in each weight buffer in e-th of second matrixes.

Optionally, for the first row of output image, addr_temp1=0, for exporting other rows of image, addr_ temp1= Image_Z /32*( Image_W +2)。

Optionally, for two row of head and the tail of output image, rd_hcnt=1 ~ ((Weight_h+1)/(32/Weight_ Z)) * (Weight_W -1), RD_HCNT_VALUE_TEMP=((Weight_h+1)/(32/Weight_Z)) * ( Weight_W -1)；For output image remove head and the tail two rows other rows, rd_hcnt=1 ~ ((Weight_h+1)/ (32/Weight_Z)) * Weight_W, RD_HCNT_VALUE_TEMP=((Weight_h+1)/(32/Weight_Z)) * Weight_W；

For exporting the first row of image, addr_temp2=(Weight_h -1) * rd_vcnt=2*rd_vcnt；For output The last line of image, addr_temp2=(Weight_h -1) * (rd_vcnt-1)=2* (rd_vcnt-1) scheme output Other rows of picture, addr_temp2=0.

The reading data device 1700 of convolutional neural networks in the present embodiment and aforementioned convolutional neural networks shown in Fig. 2 Method for reading data be that the detailed of the aforementioned method for reading data to convolutional neural networks is passed through based on the invention under same design Thin description, those skilled in the art can be apparent from the reading data device 1700 of convolutional neural networks in the present embodiment Implementation process, so details are not described herein in order to illustrate the succinct of book.

Under based on the same inventive concept, the embodiment of the invention provides a kind of reading data devices of convolutional neural networks. It please refers to shown in Figure 18, is a kind of structural representation of the reading data device of convolutional neural networks provided in an embodiment of the present invention Figure.

As shown in figure 18, the reading data device 1800 of convolutional neural networks includes processor 1801 and memory 1802. Optionally, processor 1801 can be general central processing unit (Central Processing Unit, CPU) or specific answer With integrated circuit (Application Specific Integrated Circuit, ASIC), it can be one or more and be used for Control the integrated circuit that program executes.

Optionally, memory 1802 may include high-speed random access memory, can also include nonvolatile storage, example Such as disk memory, flush memory device or other non-volatile solid state memory parts, the embodiment of the present invention are not construed as limiting.

Optionally, memory 1802 is for storing one or more computer programs；When one of the storage of memory 1802 Or multiple computer programs by processor 1801 execute when so that the reading data device 1800 of convolutional neural networks can be realized All or part of the steps in embodiment shown in Fig. 2.

The reading data device 1800 of convolutional neural networks in the present embodiment and aforementioned convolutional neural networks shown in Fig. 2 Method for reading data be that the detailed of the aforementioned method for reading data to convolutional neural networks is passed through based on the invention under same design Thin description, those skilled in the art can be apparent from the reading data device 1800 of convolutional neural networks in the present embodiment Implementation process, so details are not described herein in order to illustrate the succinct of book.

Under based on the same inventive concept, the embodiment of the invention provides a kind of computer readable storage mediums.Optionally, it counts Calculation machine readable storage medium storing program for executing has a computer program, and computer program includes program instruction, program instruction when executed by a computer, The step of making computer execute the method for reading data of above-mentioned convolutional neural networks.By in this present embodiment computer program with The method for reading data of aforementioned convolutional neural networks shown in Fig. 2 is based on the invention under same design, by aforementioned to convolution The detailed description of the method for reading data of neural network, those skilled in the art can be apparent from computer in the present embodiment The implementation process of program, so details are not described herein in order to illustrate the succinct of book.

Under based on the same inventive concept, the embodiment of the invention provides a kind of computer program product, computer program is produced Product are stored with computer program, and computer program includes program instruction, program instruction when executed by a computer so that computer The step of executing the method for reading data of above-mentioned convolutional neural networks.By in this present embodiment computer program product with it is aforementioned The method for reading data of convolutional neural networks shown in Fig. 2 is based on the invention under same design, by aforementioned to convolutional Neural The detailed description of the method for reading data of network, those skilled in the art can be apparent from computer program in the present embodiment The implementation process of product, so details are not described herein in order to illustrate the succinct of book.

The present invention be referring to according to the method for the embodiment of the present invention, the process of equipment (system) and computer program product Figure and/or block diagram describe.It should be understood that every one stream in flowchart and/or the block diagram can be realized by computer program instructions The combination of process and/or box in journey and/or box and flowchart and/or the block diagram.It can provide these computer programs Instruct the processor of general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce A raw machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute for real The device for the function of being specified in present one or more flows of the flowchart and/or one or more blocks of the block diagram.

These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates, Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or The function of being specified in multiple boxes.

These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one The step of function of being specified in a box or multiple boxes.

Obviously, various changes and modifications can be made to the invention without departing from essence of the invention by those skilled in the art Mind and range.In this way, if these modifications and changes of the present invention belongs to the range of the claims in the present invention and its equivalent technologies Within, then the present invention is also intended to include these modifications and variations.

Claims

1. a kind of method for reading data of convolutional neural networks, is applied to on-site programmable gate array FPGA, the FPGA includes Image buffer, at least one weight buffer and computing unit, which is characterized in that the described method includes:

2. the method as described in claim 1, which is characterized in that the FPGA is connect with external memory, and the method is also wrapped It includes:

The FPGA determines the first storage address in described image buffer；

3. method according to claim 2, which is characterized in that the method also includes:

The FPGA determines the second storage address of r-th of weight buffer；

4. method as claimed in claim 3, which is characterized in that once read from described image buffer in the computing unit Before taking the first submatrix in first matrix, the method also includes:

5. method according to claim 2, which is characterized in that the computing unit is once read from described image buffer The first submatrix in first matrix, comprising:

The computing unit determines the third storage address of first submatrix；

6. method a method as claimed in any one of claims 1 to 5, which is characterized in that the computing unit is every from each weight buffer Secondary the second submatrix read in second matrix, comprising:

7. method as claimed in claim 5, which is characterized in that for the first row of the output image, the addr_ Temp1=0, for other rows of the output image, the addr_temp1=Image_Z/32* (Image_W+2).

8. method as claimed in claim 6, which is characterized in that for two row of head and the tail of the output image, the rd_hcnt =1 ~ ((Weight_h+1)/(32/Weight_Z)) * (Weight_W -1), the RD_HCNT_VALUE_TEMP=(( Weight_h +1)/(32/Weight_Z))*( Weight_W -1)；For its for removing two rows of head and the tail of the output image Its row, rd_hcnt=1 ~ ((Weight_h+1)/(32/Weight_Z)) * Weight_W, the RD_HCNT_ VALUE_TEMP=(( Weight_h +1)/(32/Weight_Z))* Weight_W；

9. a kind of reading data device of convolutional neural networks, which is characterized in that the reading data of the convolutional neural networks fills Setting includes: FPGA；Wherein, the FPGA includes image buffer, at least one weight buffer and computing unit；

10. a kind of reading data device of convolutional neural networks characterized by comprising processor and memory；Wherein,

The memory is for storing one or more computer programs；When one or more computers of memory storage When program is executed by the processor, so that the reading data device of the neural network is executed such as any one of claim 1-8 The method.

11. a kind of computer readable storage medium, which is characterized in that the computer-readable recording medium storage has computer journey Sequence, the computer program include program instruction, and described program instructs when executed by a computer, execute the computer such as Method of any of claims 1-8.

12. a kind of computer program product, which is characterized in that the computer program product is stored with computer program, described Computer program includes program instruction, and described program instructs when executed by a computer, executes the computer as right is wanted Seek method described in any one of 1-8.