CN111898743A

CN111898743A - CNN acceleration method and accelerator

Info

Publication number: CN111898743A
Application number: CN202010784854.9A
Authority: CN
Inventors: 陈乔乔; 刘洪杰
Original assignee: Shenzhen Jiutian Ruixin Technology Co ltd
Current assignee: Shenzhen Jiutian Ruixin Technology Co ltd
Priority date: 2020-06-02
Filing date: 2020-08-06
Publication date: 2020-11-06

Abstract

The invention provides a CNN (convolutional neural network) acceleration method and accelerator, relates to the technical field of convolutional neural networks, and mainly solves the technical problems of complex control, repeated reading and writing and poor expansibility in the conventional CNN. The CNN acceleration method comprises the following steps: inputting initial data, and reading the initial data in sequence to obtain a first feature vector group; multiplying and accumulating the convolution kernel and the first eigenvector group to obtain a second eigenvector group; performing partial sum accumulation on the second feature vector group to obtain a third feature vector group; and classifying the third feature vector group to obtain a classification result. The invention has no redundant read-write operation, the data read-write is very friendly to the off-chip DDR, and the read-write efficiency problem does not exist; the invention is easy to expand, and can improve the clock frequency of reading and writing data at the interface side or adopt a multi-bank mode for high computation force requirements; each port is completely independent of the other without any dependency.

Description

CNN acceleration method and accelerator

Technical Field

The invention relates to the technical field of convolutional neural networks, in particular to a CNN (convolutional neural network) acceleration method and accelerator.

Background

Convolutional Neural Networks (CNN) are a type of feed-forward Neural Network that includes convolution calculations and has a deep structure, and are one of the representative algorithms of deep learning (deep learning). Its artificial neuron can respond to peripheral units in a part of coverage range, and has excellent performance for large-scale image processing.

The problems of complex control, repeated read-write operation, certain limitation, poor expansibility and the like exist in the conventional CNN accelerator or algorithm. Therefore, the present invention optimizes the existing technical solutions.

Disclosure of Invention

One of the purposes of the present invention is to provide a CNN acceleration method and accelerator, which solve the technical problems of complex CNN control, repeated read-write and poor expansibility in the prior art. Advantageous effects can be achieved in preferred embodiments of the present invention, as described in detail below.

In order to achieve the purpose, the invention provides the following technical scheme:

the invention relates to a CNN acceleration method, which comprises the following steps:

inputting initial data, and reading the initial data in sequence to obtain a first feature vector group; scanning initial data in sequence from the channel direction, the horizontal direction and the vertical direction to obtain a first feature vector group;

multiplying and accumulating the convolution kernel and the first eigenvector group to obtain a second eigenvector group; the method specifically comprises the following steps:

wherein ofm _ t represents a second set of feature vectors; ifm denotes a first feature vector group; kernel denotes the convolution kernel;

h1 denotes the vertical direction index of the second feature vector group, H denotes the vertical direction index of the first feature vector group, and H denotes the maximum value of the vertical direction index of the first feature vector group;

w1 denotes the horizontal direction index of the second feature vector group, W denotes the horizontal direction index of the first feature vector group, and W denotes the maximum value of the horizontal direction index of the first feature vector group;

m1 represents the channel direction index of the second feature vector group, M represents the group number index of the kernel convolution kernel, and M represents the maximum value of the channel direction index of the first feature vector group;

i denotes the vertical index of the convolution kernel, H_KRepresents the maximum value of the vertical index of the convolution kernel;

j denotes the horizontal index of the convolution kernel, W_KRepresents the maximum value of the horizontal index of the convolution kernel;

k denotes the channel direction index of the convolution kernel, C_KRepresenting a maximum value of a channel direction index of the convolution kernel;

i1 denotes a row index of the second eigenvector group, and j1 denotes a column index of the second eigenvector group;

accumulating the partial sums of the second eigenvector group to obtain a third eigenvector group; the method specifically comprises the following steps:

accumulating the shifts of the second eigenvector group to obtain a third eigenvector group; the expression formula is as follows:

wherein ofm _ F denotes a third feature vector group, ofm _ t denotes a second feature vector group;

h2 denotes a vertical direction index of the third eigenvector group, H1 denotes a vertical direction index of the second eigenvector group, and H1 denotes a maximum value of the vertical direction index of the second eigenvector group;

w2 denotes a horizontal direction index of the third feature vector group, W1 denotes a horizontal direction index of the second feature vector group, and W1 denotes a horizontal direction index maximum value of the second feature vector group;

m2 denotes a channel direction index of the third eigenvector group, M1 denotes a channel direction index of the second eigenvector group, and M1 denotes a channel direction index maximum value of the second eigenvector group;

H_K1representing the maximum value, W, in the row direction of the second set of eigenvectors_K1Representing the maximum value in the column direction of the second feature vector group;

s represents window stepping and is set according to actual requirements;

the accumulating the partial sums of the second eigenvector group to obtain a third eigenvector group further includes:

storing the sum of the parts of one window in the row direction in a register;

storing all the windowed parts in the row direction in an on-chip RAM;

classifying the third feature vector group to obtain a classification result; the method specifically comprises the following steps:

and bringing the third feature vector group into a softmax () function, and carrying out classification processing to obtain a classification result.

Further, the multiplying and accumulating the convolution kernel and the first eigenvector group to obtain a second eigenvector group includes:

the convolution kernel is a constant vector.

the convolution kernel is multiple.

The present invention also includes a computer-readable storage medium having stored thereon a computer program which, when executed, performs the CNN acceleration method as described above.

The invention also includes a CNN accelerator comprising: a processor, and a memory coupled to the processor, the memory having a computer program stored therein, the computer program, when executed by the processor, performing the CNN acceleration method as described above.

The CNN acceleration method and the CNN accelerator provided by the invention at least have the following beneficial technical effects:

the whole control of the invention is simple, only initial data is needed to be scanned in sequence from the channel direction, the horizontal direction and the vertical direction on the input layer, and no complex control is needed, such as window division (s is 1, s is 2, etc.); in addition, the invention can read the feature vector of the initial data once, the feature vector of the output layer can be output on the corresponding channel, and no redundant read-write operation exists. The data reading and writing of the invention is very friendly to the off-chip DDR, and the problem of reading and writing efficiency does not exist. The invention is easy to expand, and can improve the interface side data reading clock or adopt a multi-bank mode for high calculation force requirement; and each port is completely independent of each other without any dependence.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flow chart of a CNN acceleration method according to the present invention;

FIG. 2 is a schematic diagram of a second feature vector set according to the present invention;

FIG. 3 is a schematic diagram of a first feature vector set obtained by the present invention;

FIG. 4 is a schematic diagram of the structure of the convolution kernel of the present invention;

FIG. 5 is a schematic diagram of the present invention in which partial sum accumulation is performed;

fig. 6 is a schematic structural diagram of a CNN accelerator according to the present invention.

FIG. 1-processor; 2-memory.

Detailed Description

In order that the objects, aspects and advantages of the present invention will become more apparent, various exemplary embodiments will be described below with reference to the accompanying drawings, which form a part hereof, and in which are shown by way of illustration various exemplary embodiments in which the invention may be practiced, and in which like numerals in different drawings represent the same or similar elements, unless otherwise specified. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. It is to be understood that they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims, and that other embodiments may be used, or structural and functional modifications may be made to the embodiments set forth herein, without departing from the scope and spirit of the present disclosure. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present invention with unnecessary detail.

Referring to fig. 1, the present invention provides an embodiment of a CNN acceleration method, which includes:

s1: inputting initial data, and reading the initial data in sequence to obtain a first feature vector group;

s2: multiplying and accumulating the convolution kernel and the first eigenvector group to obtain a second eigenvector group;

s3: accumulating the partial sums of the second eigenvector group to obtain a third eigenvector group;

s4: and classifying the third feature vector group to obtain a classification result.

The CNN itself includes three layers, i.e., an input layer, a convolutional layer, and an output layer, which are connected in this order. Inputting initial data in the input layer, reading the initial data in sequence to obtain a first feature vector group, and sending the first feature vector group to the convolutional layer, wherein the initial data is preferably a picture. In the convolutional layer, multiplying and accumulating the convolutional kernel and the first eigenvector group to obtain a second eigenvector group, accumulating partial sums of the second eigenvector group to obtain a third eigenvector group, and sending the third eigenvector group to the output layer; and then, classifying the third feature vector group to obtain a classification result. Wherein, the final classification result can distinguish various objects on the picture, such as cats, dogs, characters, flowers, birds and the like.

The invention provides a CNN acceleration method which is a realization method based on vector x vector and then partial sum accumulation. The implementation process of the partial sum is to accumulate the contribution of each vector x vector of the input feature vector group to the final output feature vector group result on the channel according to the result obtained by each vector x vector of the input feature vector group, and is different from the partial sum of other schemes. Therefore, the invention is optimal in DDR efficiency, internal control logic, energy consumption and computing power expandability.

Referring specifically to fig. 3, S1: inputting initial data, and reading the initial data in sequence to obtain a first feature vector group, wherein the first feature vector group comprises:

and scanning the initial data in sequence from the channel direction, the horizontal direction and the vertical direction to obtain a first feature vector group.

It should be noted that, the initial data (corresponding to feature map in the figure) is sequentially scanned in the order from the channel direction first (corresponding to C direction of the coordinate system in fig. 3), the horizontal direction second (corresponding to W direction of the coordinate system in fig. 3), and the vertical direction last (corresponding to H direction of the coordinate system in fig. 3), so as to obtain a first feature vector group; the feature vectors of each subdata in the initial data are sequentially scanned from top to bottom and from left to right, and the feature vectors of each subdata are integrated into a first feature vector group. For example: the initial data is a picture, the information of each point on the picture is sequentially read from top to bottom and from left to right, the information of each point in all channel directions forms a feature vector, and a plurality of different feature vectors can be obtained by traversing the whole picture. Namely, scanning the picture according to the CWH direction in sequence, and converting the picture into a first feature vector group consisting of single feature vectors with different coordinates;

the feature vectors are scanned starting from when H is 0:

vector 0 includes the following feature vectors: H0W0C0, H0W0C1, H0W0c2.. H0W0 Cn-1;

vector 1 includes the following feature vectors: H0W1C0, H0W1C1, H0W1C2.. H0W1 Cn-1;

the

vectors

2, 3, 4. H0Wn-1C0, H0Wn-1C1.... H0Wn-1 Cn-1;

then, starting to scan the feature vectors when H is 1:

vector 0 includes the following feature vectors: H1W0C0, H1W0C1, h1w0c2.. H1W0 Cn-1;

vector 1 includes the following feature vectors: H1W1C0, H1W1C1, h1w1c2.. H1W1 Cn-1;

the

vectors

2, 3, 4. H1Wn-1C0, H1Wn-1C1.... H1Wn-1 Cn-1;

and then, when the H is 2 and the H is 3.

The invention only needs to scan each eigenvector included in the initial data in the input layer according to the sequence of the channel direction, the horizontal direction and the vertical direction, does not need any complex control and has the characteristic of simple control.

Referring specifically to fig. 2 and 4, S2: multiplying and accumulating the convolution kernel and the first eigenvector group to obtain a second eigenvector group, wherein the steps of:

the convolution kernel is a constant vector; the convolution kernel is plural.

It should be noted that the convolution kernel employs: the constant vector is trained and shaped by the disclosed mass data. For example: the convolution kernel can be extracted from a public-net-yolov 3 model trained on the public-deep learning model above the public-massive image library (ImageNet).

As shown in fig. 4, one convolution kernel (taking kernel _ group0 in fig. 4 as an example) itself is also a vector group, and the vectors in the convolution kernel are also read by scanning in the order of channel direction (C direction) first, horizontal direction (W direction) second, and vertical direction (H direction) last.

Also taking the kernel _ group0 convolution kernel in fig. 4 as an example, scanning in the CWH direction to obtain the following vector group:

when the first H is 0, the second,

vector 0 includes: H0W0C0, H0W0C1.. H0W0Cn-1,

vector 1 includes: H0W1C0, H0W1C1.

Vectors

2, 3, 4. H0Wn-1C0, H0Wn-1C1.... H0Wn-1 Cn-1; and then when H is respectively 1, 2 and 3.

If there are a plurality of convolution kernels, the number of convolution kernels in the present invention is M, and the convolution kernels are kernel _ group0, kernel _ group1, and kernel _ group2.... kernel _ group pm-1, respectively.

As shown in fig. 2, after the convolution kernel and the first eigenvector group are multiplied and accumulated, a second eigenvector group is obtained and sent to the register.

S2: multiplying and accumulating the convolution kernel and the first eigenvector group to obtain a second eigenvector group, which specifically comprises:

the simplified expression is:

m1 represents the channel direction index of the second feature vector group, M represents the group number index of the kernel convolution kernel, and M represents the maximum value of the channel direction index of the first feature vector group, wherein M is equal to the group number index of the kernel convolution kernel;

k denotes the channel direction index of the convolution kernel, C_KRepresenting a maximum value of a channel direction index of the convolution kernel; wherein the channel direction index of the convolution kernel is equal to the channel direction index of the first eigenvector group;

i1 denotes a row index of the second eigenvector group, and j1 denotes a column index of the second eigenvector group.

The specific algorithm is as follows:

wherein ofm _ t represents a feature vector of the second feature vector group of the present invention, ifm represents a feature vector of the first feature vector group of the present invention, kernel represents a convolution kernel, h represents a vertical direction index of the second feature vector group, w represents a horizontal direction index of the second feature vector group, m represents a channel direction index of the second feature vector group, and m is equal to a group number index of the kernel convolution kernel; i denotes a vertical direction index of the kernel, j denotes a horizontal direction index of the kernel, k denotes a channel direction index of the kernel, and k is equal to the channel direction index of the first eigenvector group.

Note that ofm _ t represents the partial sum of the output feature vector group (ofm) compared to the standard convolution formula, where i x j partial sums are added compared to the final ofm, i.e., the w x h size of the convolution kernel; it can also be seen from the above formula that the ifm herein does not require jump address reading, but only sequential reading. The standard convolution formula is as follows:

wherein ofm denotes the output feature vector set, and ifm denotes the first feature vector set (also called the input feature vector set); kernel denotes the convolution kernel;

h0 denotes the vertical direction index of the output feature vector group, H denotes the vertical direction index of the input feature vector group, and H denotes the maximum value of the vertical direction index of the input feature vector group;

w0 denotes the horizontal direction index of the output feature vector group, W denotes the horizontal direction index of the input feature vector group, and W denotes the maximum value of the horizontal direction index of the input feature vector group;

m0 represents the channel direction index of the output feature vector group, M represents the group number index of the kernel convolution kernel, and M represents the maximum value of the channel direction index of the input feature vector group;

i denotes the vertical index of kernel, H_KRepresents the maximum value of the vertical index of the convolution kernel;

j denotes the horizontal index of kernel, W_KRepresents the maximum value of the horizontal index of the convolution kernel;

k denotes the channel direction index of kernel, C_KRepresenting a maximum value of a channel direction index of the convolution kernel; wherein k is the same as the channel direction index of the input feature vector group;

s denotes a windowing step.

The specific algorithm is as follows:

wherein ofm denotes an output feature vector (output feature map), ifm denotes an input feature vector (input feature map), kernel denotes a convolution kernel, h denotes a vertical direction index of the output feature vector, w denotes a horizontal direction index of the output feature vector, and m denotes a channel direction index of the output feature vector, where m is equal to a group number index of the kernel convolution kernel; i represents the vertical direction index of the kernel, j represents the horizontal direction index of the kernel, k represents the channel direction index of the kernel, k is equal to the channel direction index of the input feature vector, and s represents the windowing step.

S3: accumulating the partial sums of the second eigenvector group to obtain a third eigenvector group, comprising:

shifting and accumulating the second eigenvector group to obtain a third eigenvector group; the expression formula is as follows:

the simplified expression is:

m2 denotes a channel index of the third eigenvector group, M1 denotes a channel direction index of the second eigenvector group, and M1 denotes a channel direction index maximum value of the second eigenvector group;

and s represents a windowing step set according to actual requirements.

The specific algorithm is as follows:

for example: s is equal to 1, m is equal to 0, kernel (w is equal to h is equal to 3), and the shift accumulation process of step S3 specifically operates as shown in the following table:

ofm_F[0][0][0]＝ofm_t[0][0][0][0][0]+ofm_t[0][1][0][0][1]+ofm_t[0][2][0][0][2]+ofm_t[1][0][0][1][0]+ofm_t[1][1][0][1][1]+ofm_t[1][2][0][1][2]+ofm_t[2][0][0][2][0]+ofm_t[2][1][0][2][1]+ofm_t[2][2][0][2][2]

the table above shows a specific decomposition process of how to obtain the final result ofm _ F from the partial sum ofm _ t, where the first leftmost column represents the row index of ofm _ t, the second column represents the index of each point on each row (00 represents the 0 th point on row 0, 01 represents the 1 st point on row 0, … …, and so on), and starting from column 3 to column 11, the first feature vector group of inputs is multiplied by each feature vector of the convolution kernel to obtain the partial sum (i.e., the result of vector x vector).

To obtain the final result ofm _ F, i.e. the last column, it is only necessary to add up the results of the vectors x vector labeled with the same shading; it is clear that the accumulation here is divided into partial sum accumulation within a row and partial sum accumulation between rows; there are 3 identical shading blocks in each row that need to be added up for a total of 3 rows, so a total of 9 shading blocks need to be added up (diagonally), thus obtaining the final ofm _ F.

Therefore, as shown in the above table, in conjunction with fig. 5, the process of accumulating the partial sums of the second eigenvector group further includes 2 steps:

s31: storing the sum of the parts of one window in the row direction into a register;

s32: the sum of all windowed portions in the row direction is saved to on-chip RAM.

If the RAM memory capacity on the chip is insufficient, all the windowed portions in the row direction can be moved in and out of the chip.

The invention carries out multiplication accumulation calculation and partial sum accumulation calculation based on initial data and convolution kernel, thus realizing that data reading and writing is very friendly to off-chip DDR and having no problem of reading and writing efficiency; the expansion is easy, the requirement on high calculation power can be met, the interface side data reading clock can be improved, or a multi-bank mode is adopted; each port is completely independent of the other without any dependency.

S4: and classifying the third feature vector group to obtain a classification result, wherein the specific implementation process is as follows:

and bringing the third feature vector group into a softmax () function, and classifying to obtain a result.

Softmax () function:

V_iis an element of the i-th feature vector of the third set of feature vectors, V_jIs the element of the jth feature vector in the third feature vector group, j represents the number of terms of the feature vector in the third feature vector group, S_iThe probabilities are classified for the pictures.

Note that the final result S_iAnd if the picture is in the same range or the same numerical value, the picture is classified into a class. For example, if the probability range of a picture class is 0.3-0.5 and Si is 0.4, the picture belongs to the picture class. The end result is therefore a categorisation of the input pictures.

The present invention also provides a computer-readable storage medium having stored thereon a computer program which, when executed, performs the CNN acceleration method described above.

Referring to fig. 6, the present invention also includes a CNN accelerator, which includes: a processor, and a memory coupled to the processor, the memory having a computer program stored therein, the computer program, when executed by the processor, performing the CNN acceleration method described above.

It should be noted that the processor itself includes registers, and the memory includes RAM. Therefore, the invention optimizes the CNN algorithm and improves the performance efficiency on hardware.

After reading the above description, it will be apparent to one skilled in the art that various features described herein can be implemented by a method, a data processing system, or a computer program product. Accordingly, these features may be embodied in hardware, in software in their entirety, or in a combination of hardware and software. Furthermore, the above-described features may also be embodied in the form of a computer program product stored on one or more computer-readable storage media having computer-readable program code segments or instructions embodied in the storage medium. The readable storage medium is configured to store various types of data to support operations at the device. The readable storage medium may be implemented by any type of volatile or non-volatile storage device, or combination thereof. Such as an electrostatic hard disk, a Static Random Access Memory (SRAM), an electrically erasable programmable read-only memory (EEPROM), an erasable programmable read-only memory (EPROM), a programmable read-only memory (PROM), a read-only memory (ROM), an optical storage device, a magnetic storage device, a flash memory, a magnetic or optical disk, and/or combinations thereof.

The specific application method of the invention is as follows:

the first garbage classification application method of the invention comprises the following steps:

the garbage classification result is 5 types: plastic bottles, books, paper boxes, waste batteries and kitchen garbage; inputting a picture into the invention, and scanning according to the CWH sequence direction to obtain a first feature vector group; the convolution kernels are two convolution kernels of 7 × 16 and 6 × 32, and the first feature vector group and the convolution kernels are multiplied and accumulated to obtain a second feature vector group, which is the calculation of formula 1; then, carrying out partial sum (calculation of the public 2) on the second feature vector group to obtain a third feature vector group; and classifying the third feature vector group by a softmax () function to obtain a probability value, and mapping corresponding garbage classification.

Secondly, the license plate recognition application method of the invention comprises the following steps:

chinese license plate number relates to 31-province Chinese characters, 26 letters and 10 numbers, and license plate pictures are acquired by a camera and other acquisition devices and then input into the invention. Scanning the license plate picture according to the CWH sequence direction to obtain a first feature vector group; multiplying and accumulating the first feature vector group and the convolution kernel to obtain a second feature vector group, which is the calculation of a formula 1; then, carrying out partial sum (calculation of the public 2) on the second feature vector group to obtain a third feature vector group; and classifying the third feature vector group by a softmax () function to obtain a probability value, mapping corresponding Chinese characters, letters and numbers to form a clear license plate number, and facilitating subsequent work of traffic police officers.

While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted for elements thereof without departing from the spirit and scope of the invention. In addition, many modifications may be made to adapt a particular situation or material to the teachings of the invention without departing from the essential scope thereof. Therefore, it is intended that the invention not be limited to the particular embodiment disclosed, but that the invention will include all embodiments falling within the scope of the appended claims.

Claims

1. A CNN acceleration method, comprising:

s represents window stepping and is set according to actual requirements;

storing the sum of the parts of one window in the row direction in a register;

storing all the windowed parts in the row direction in an on-chip RAM;

2. The CNN acceleration method according to claim 1, wherein said multiplying and accumulating the convolution kernel with the first eigenvector group to obtain a second eigenvector group comprises:

the convolution kernel is a constant vector.

3. The CNN acceleration method according to claim 2, wherein said multiplying and accumulating the convolution kernel with the first eigenvector group to obtain a second eigenvector group comprises:

the convolution kernel is multiple.

4. A computer-readable storage medium, having stored thereon a computer program which, when executed, performs the CNN acceleration method according to any one of claims 1-3.

5. A CNN accelerator, comprising: a processor, and a memory coupled to the processor, the memory having stored therein a computer program that, when executed by the processor, performs the CNN acceleration method of any one of claims 1-3.