CN111967582B

CN111967582B - CNN convolutional layer operation method and CNN convolutional layer operation accelerator

Info

Publication number: CN111967582B
Application number: CN202010791455.5A
Authority: CN
Inventors: 杨继林
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2020-08-07
Filing date: 2020-08-07
Publication date: 2022-07-08
Anticipated expiration: 2040-08-07
Also published as: CN111967582A

Abstract

The invention provides a CNN convolutional layer operation method and a CNN convolutional layer operation accelerator, which can both: reading convolution kernels used for performing CNN convolution layer operation on the characteristic image to be processed, and converting the read convolution kernels into a weight matrix H (H)_pq) (ii) a Reading a block image on the characteristic image to be processed according to a preset image size threshold value, and according to the weight matrix H (H)_pq) Calculating a local operation result of the CNN convolution layer corresponding to the currently read block image; judging whether the whole characteristic image to be processed is read completely: if so, arranging the obtained local operation results of the CNN convolutional layers according to the relative position relationship between the block images, and splicing to obtain the CNN convolutional layer operation results corresponding to the whole characteristic image to be processed; if not, continuing to read in the next block image. The method is used for reducing the complexity of the CNN convolution operation, reducing the pressure of the storage bandwidth and reducing the cost of completing the CNN convolution layer operation.

Description

CNN convolutional layer operation method and CNN convolutional layer operation accelerator

Technical Field

The invention relates to the field of convolution operation acceleration, in particular to a CNN convolution layer operation method and a CNN convolution layer operation accelerator.

Background

With the continuous development of CNNs (Convolutional Neural Networks), CNNs are applied more and more widely in the fields of image classification, image recognition, and the like.

The operation of the CNN convolutional layer is two-dimensional convolution, and a common implementation scheme is that sliding windows are used for realizing convolution calculation, namely, a special control module is used for acquiring a feature map two-dimensional window with the same size for a k × k convolution kernel, then the feature map two-dimensional window slides on a feature map which needs convolutional layer calculation, and then multiplication and addition operation is carried out on the feature map which needs convolutional layer calculation and the corresponding point of the convolution kernel. The method for realizing two-dimensional convolution by the sliding window is more intuitive, and the subsequent calculation process is relatively simple as long as the correct two-dimensional window of the characteristic diagram can be obtained. However, the control module for generating the two-dimensional window is relatively complex to implement. In addition, for k × k convolution kernels, and for feature maps requiring convolution layer calculation for online input, additional k-1 line storage is often required, thereby increasing the cost. In addition, in the operation process of the conventional CNN convolution layer, the weight in the convolution kernel is required to be repeatedly read for many times, and the storage bandwidth pressure is increased to a certain extent.

Therefore, the present invention provides a CNN convolutional layer operation method and a CNN convolutional layer operation accelerator, which are used to solve the above problems.

Disclosure of Invention

In view of the above-mentioned deficiencies of the prior art, the present invention provides a CNN convolutional layer operation method and a CNN convolutional layer operation accelerator, which are used for reducing the complexity of CNN convolutional operation. The invention also provides for reducing storage bandwidth pressure. The invention is also used for reducing the cost of completing the CNN convolutional layer operation.

In a first aspect, the present invention provides a CNN convolutional layer operation method, including the steps of:

s1, reading convolution kernels used for CNN convolution layer operation on the characteristic image to be processed, and converting the read convolution kernels into a weight matrix H (H)_pq) Where the convolution kernel is a k × k convolution kernel, h_pqIs a weight matrix H (H)_pq) P is 0, 1, 2, …, k-1; q is 0, 1, 2, …, k-1;

s2, reading a block image on the characteristic image to be processed according to the preset image size threshold value and the block image, and according to the weight matrix H (H)_pq) Calculating a local operation result of the CNN convolution layer corresponding to the currently read block image;

s3, judging whether the whole characteristic image to be processed is read completely, if so, continuing to execute the step S4, otherwise, repeatedly executing the step S2;

s4, arranging the local operation results of the CNN convolutional layers obtained in the step S2 according to the relative position relationship among the block images, and splicing to obtain the CNN convolutional layer operation results corresponding to the whole characteristic image to be processed;

wherein, the step S2 is based on the weight matrix H (H)_pq) The implementation method for calculating the local operation result of the CNN convolution layer corresponding to the currently read block image comprises the following steps:

p1, converting the block image read currently into image matrix A (a)_ij) Wherein the block image is a digital image of m × n pixels, a_ijIs an image matrix A (a)_ij) The (i, j) element of (a); wherein i ═ 0, 1, 2.., m-1; j ═ 0, 1, 2,. ang, n-1;

p2, read rightsWeight matrix H (H)_pq) Each element h of_pqAnd respectively obtaining each element h_pqEach corresponding image matrix A (a)_ij) And a product matrix formed by all elements required to be multiplied with the element, and each element h_pqMultiplying with the corresponding product matrix to obtain each element h_pqRespective corresponding local matrices;

p3, calculating the sum of the obtained local matrixes, wherein the sum is the local operation result of the CNN convolution layer corresponding to the block image read in currently;

wherein, the block images read in step S2 each time are different from each other;

in step S2, a block image on the feature image to be processed is read according to a preset image size threshold, and the reading method includes:

when reading the block image for the first time, reading a block image meeting the requirement of the image size threshold from the feature image to be processed according to a preset reading initial position;

when the block images are read again, each read block image contains k-1 rows or k-1 columns of pixels of its respective adjacent block image.

Further, the element h involved in step P2_pqThe corresponding product matrix includes the following cases:

when p is 0 and q is 0, the element h is concerned_pqThe corresponding product matrix is the image matrix A (a)_ij) Wherein all rows and columns remaining after removing the n-k +1, n-k +2, n-k +3, …, n-1 th column and the m-k +1, m-k +2, m-k +3, …, m-1 th row form an (m-k +1) × (n-k +1) matrix;

when p is 0 and q is not equal to 0, the element h involved_pqThe corresponding product matrix is the image matrix A (a)_ij) A (m-k +1) × (n-k +1) matrix formed by splicing all rows and columns except the 0 th, 1, 2, …, q-1, n-k + q +2, n-k + q +3, … and n-1 th columns and the m-k +1, m-k +2, m-k +3, … and m-1 th rows;

when p ≠ 0 and q ≠ 0, the element h involved_pqCorresponding toThe product matrix is the image matrix A (a)_ij) The (m-k +1) × (n-k +1) matrix is formed by splicing all rows and columns which are remained after removing the n-k +1, n-k +2, n-k +3, … and n-1 columns and removing the 0, 1, 2, …, p-1, m-k + p +2, m-k + p +3, … and m-1 rows;

when p ≠ 0 and q ≠ 0, the element h involved_pqThe corresponding product matrix is the image matrix A (a)_ij) Wherein the (m-k +1) × (n-k +1) matrix is formed by splicing all rows and columns which are left after removing the 0 th, 1 st, 2 nd, … rd, q-1 th, n-k + q +1, n-k + q +2, n-k + q +3, … th and n-1 st columns and removing the 0 th, 1 st, 2 nd, … th, p-1 th, m-k + p +2 th, m-k + p +3 th, … th and m-1 st rows.

Further, the weight matrix H (H) obtained by the conversion in step S1_pq) Storing in a cache; the image matrix A (a) to be converted in step P1_ij) Stored in a cache.

Further, the CNN convolutional layer operation method is realized based on an FPGA.

Further, in step P2, each element h is divided into multiple elements by a multiplier array_pqMultiplying with the corresponding product matrix to obtain each element h_pqA respective corresponding partial matrix.

In another aspect, the present invention provides a CNN convolutional layer arithmetic accelerator, including:

the first data pre-reading module is used for reading a convolution kernel used for performing CNN convolution layer operation on the characteristic image to be processed and converting the read convolution kernel into a weight matrix H (H)_pq) Wherein the convolution kernel is k × k convolution kernel, h_pqIs a weight matrix H (H)_pq) The (p, q) element of (a), p ═ 0, 1, 2,. ang, k-1; q-0, 1, 2,. k-1;

the second data pre-reading module is used for reading a block image on the characteristic image to be processed according to a preset image size threshold;

a local operation module for calculating the weight matrix H (H) according to the weight matrix_pq) Calculating a local operation result of the CNN convolution layer corresponding to the block image currently read by the second data pre-reading module;

the judging module is used for judging whether the whole characteristic image to be processed is read completely;

the convolutional layer operation result output module is used for arranging the local operation results of the CNN convolutional layers obtained by the local operation module according to the relative position relationship between the block images when the judgment module judges that the whole characteristic image to be processed is read, and then splicing the CNN convolutional layers to obtain and output the CNN convolutional layer operation results corresponding to the whole characteristic image to be processed;

the calling module is used for calling the data pre-reading module to continue executing when the judging module judges that the whole characteristic image to be processed is not read;

wherein, the local operation module comprises:

an image matrix conversion unit for converting the currently read block image into an image matrix A (a)_ij) Wherein the block image is a digital image of m × n pixels, a_ijIs an image matrix A (a)_ij) The (i, j) element of (a); wherein i ═ 0, 1, 2.., m-1; j ═ 0, 1, 2,. ang, n-1;

a local matrix acquisition unit for reading the weight matrix H (H)_pq) Each element h of (2)_pqAnd respectively obtaining each element h_pqEach corresponding image matrix A (a)_ij) And a product matrix formed by all elements required to be multiplied with the element, and each element h_pqMultiplying with the corresponding product matrix to obtain each element h_pqRespective corresponding local matrices;

the local operation result acquisition unit is used for calculating the sum of all the obtained local matrixes, wherein the sum is the local operation result of the CNN convolution layer corresponding to the currently read block image;

the second data pre-reading module reads different block images of the characteristic image to be processed each time;

the second data pre-reading module reads a block image on the characteristic image to be processed according to a preset image size threshold, and the reading method comprises the following steps:

when reading the block image for the first time, reading a block image meeting the requirement of the image size threshold from the characteristic image to be processed according to a preset reading initial position;

Further, the element h involved in the local matrix acquisition unit_pqThe corresponding product matrix includes the following cases:

when p is 0 and q is not equal to 0, the element h involved_pqThe corresponding product matrix is the image matrix A (a)_ij) Removing 0, 1, 2, …, q-1, n-k + q +2, n-k + q +3, … and n-1 columns and removing m-k +1, m-k +2, m-k +3, … and all rows and columns which are remained after m-k +1, m-k +2, m-k +3, … and m-1 rows are spliced to form an (m-k +1) x (n-k +1) matrix;

when p ≠ 0 and q ≠ 0, the element h involved_pqThe corresponding product matrix is the image matrix A (a)_ij) The (m-k +1) × (n-k +1) matrix is formed by splicing all rows and columns which are remained after removing the n-k +1, n-k +2, n-k +3, … and n-1 columns and removing the 0, 1, 2, …, p-1, m-k + p +2, m-k + p +3, … and m-1 rows;

Furthermore, the CNN convolution layer operation accelerator also comprises a cache;

the weight matrix H (H) converted by the first data pre-reading module_pq) Storing in a cache;

the image matrix A (a) converted by the image matrix conversion unit_ij) Stored in a cache.

Further, the CNN convolutional layer operation accelerator is realized based on an FPGA.

Further, the local matrix acquisition unit adopts a multiplier array to combine each element h_pqMultiplying with the corresponding product matrix to obtain each element h_pqA respective corresponding partial matrix.

The beneficial effect of the invention is that,

(1) the CNN convolutional layer operation method and the CNN convolutional layer operation accelerator provided by the invention avoid the use of a feature map two-dimensional window in the prior art, further avoid the use of a control module for generating the feature map two-dimensional window in the prior art, reduce the complexity of CNN convolutional operation to a certain extent and are convenient to realize.

(2) The CNN convolutional layer operation method and the CNN convolutional layer operation accelerator provided by the invention use each weight (corresponding to element h) in the convolutional core_pq) For the starting point, directly obtaining all the feature points (corresponding to the image matrix A (a) converted from the block image) in the block image which are respectively corresponding to each weight and need to be multiplied by the read weight_ij) All the elements of the convolutional kernel), then multiplying each weight in the convolutional kernel by the product matrix corresponding to the weight to obtain a local matrix corresponding to each weight, and then performing matrix addition operation on all the local matrices corresponding to all the weights in the convolutional kernel to obtain a local operation result of the CNN convolutional layer corresponding to each block image.

(3) The CNN convolutional layer operation method and the CNN convolutional layer operation accelerator provided by the invention use the weight matrix H (H) required in the operation process_pq) And an image matrix A (a)_ij) The storage is in the cache, on one hand, the extra storage is avoided, the cost for completing the CNN convolution layer operation is reduced to a certain extent, and on the other hand, the cost is reducedThe number of times data required for the CNN convolutional layer operation is read from the external memory, which is helpful to increase the rate of the CNN convolutional layer operation to some extent.

In addition, the invention has reliable design principle, simple structure and very wide application prospect.

Drawings

In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present invention, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained based on these drawings without creative efforts.

FIG. 1 is a schematic flow diagram of a method of one embodiment of the invention.

Fig. 2 is a schematic diagram of the distribution of the relative position relationship of the block image F1 in the feature image to be processed in the present invention.

Fig. 3 is a schematic diagram of the distribution of the relative position relationship of the segmented image F2 in the feature image to be processed in the present invention.

Fig. 4 is a schematic diagram of the distribution of the relative positional relationship of the segmented image F3 in the feature image to be processed in the present invention.

Fig. 5 is a schematic diagram of the distribution of the relative position relationship of the block image F4 in the feature image to be processed in the present invention.

FIG. 6 is a schematic diagram of the arrangement positions of the matrix C1, the matrix C2, the matrix C3 and the matrix C4 in the present invention.

FIG. 7 is a schematic block diagram of a system of one embodiment of the present invention.

Detailed Description

In order to make those skilled in the art better understand the technical solution of the present invention, the technical solution in the embodiment of the present invention will be clearly and completely described below with reference to the drawings in the embodiment of the present invention, and it is obvious that the described embodiment is only a part of the embodiment of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

FIG. 1 is a schematic flow chart of a CNN convolutional layer operation method according to an embodiment of the present invention.

As shown in fig. 1, the CNN convolutional layer operation method includes:

step S1, reading convolution kernels used for CNN convolution layer operation on the characteristic image to be processed, and converting the read convolution kernels into a weight matrix H (H)_pq) Where the convolution kernel is a k × k convolution kernel, h_pqIs a weight matrix H (H)_pq) The (p, q) element of (a), p ═ 0, 1, 2,. ang, k-1; q-0, 1, 2,. k-1;

step S2, reading a block image on the characteristic image to be processed according to the preset image size threshold value, and according to the weight matrix H (H)_pq) Calculating a local operation result of the CNN convolution layer corresponding to the currently read block image;

step S3, judging whether the whole characteristic image to be processed is read completely, if so, continuing to execute step S4, otherwise, repeatedly executing step S2;

and S4, arranging the local operation results of the CNN convolutional layers obtained in the step S2 according to the relative position relationship among the block images, and splicing to obtain the CNN convolutional layer operation results corresponding to the whole characteristic image to be processed.

step P1, converting the block image currently read in into image matrix A (a)_ij) Wherein the block image is a digital image of m × n pixels, a_ijIs an image matrix A (a)_ij) The (i, j) element of (a); wherein i ═ 0, 1, 2.., m-1; j ═ 0, 1, 2,. ang, n-1;

step P2, read the weight matrix H (H)_pq) Each element h of_pqAnd respectively obtaining each element h_pqEach corresponding image matrix A (a)_ij) All elements of (1) that need to be multiplied with itForming a product matrix, and dividing each element h_pqMultiplying with the product matrix corresponding to each element to obtain each element h_pqRespective corresponding local matrices;

and step P3, calculating the sum of the obtained local matrixes, wherein the sum is the local operation result of the CNN convolution layer corresponding to the currently read block image.

Wherein, the block images of the characteristic image to be processed read each time in the step S2 are different from each other;

in step S2, a block image on the feature image to be processed is read according to a preset image size threshold, where the reading method includes:

Alternatively, as an embodiment of the present invention, the element h involved in step P2_pqThe corresponding product matrix includes the following cases:

when p ≠ 0 and q ≠ 0, the element h involved_pqThe corresponding product matrix is the image matrix A (a)_ij) In which the n-k +1, n-k +2, n-k +3, …, n-1 rows are removed and the 0, 1, 2, …, n-1 rows are removed,p-1, m-k + p +2, m-k + p +3, …, and a (m-k +1) × (n-k +1) matrix formed by splicing all rows and columns left after the m-1 row;

Alternatively, as an embodiment of the present invention, the weight matrix H (H) obtained by conversion in step S1_pq) Storing in a cache; the image matrix A (a) to be converted in step P1_ij) Stored in a cache.

Optionally, as an embodiment of the present invention, the CNN convolutional layer operation method is implemented based on an FPGA.

Alternatively, as an embodiment of the present invention, in step P2, each element h is divided into multiple elements by using a multiplier array_pqMultiplying with the corresponding product matrix to obtain each element h_pqA respective corresponding partial matrix.

In order to facilitate understanding of the present invention, the following describes the CNN convolutional layer operation method provided by the present invention further by using the principle of the CNN convolutional layer operation method of the present invention and combining the process of performing CNN convolutional layer operation on the feature image to be processed in the embodiment.

Specifically, the CNN convolutional layer operation method includes:

l1, reading convolution kernels used for CNN convolution layer operation on the characteristic image to be processed, and converting the read convolution kernels into a weight matrix H (H)_pq) Where the convolution kernel is a k × k convolution kernel, h_pqIs a weight matrix H (H)_pq) P is 0, 1, 2, …, k-1; q is 0, 1, 2, …, k-1.

The characteristic image to be processed is an image which needs to be subjected to CNN convolutional layer operation.

The feature image to be processed and the convolution kernel required for CNN convolution layer operation are both stored in an external DDR (Double Data Rate) in advance.

In the present embodiment, for convenience of description, k — 3 is taken as an example. Correspondingly, the convolution kernel for CNN convolutional layer operation on the feature image to be processed in this embodiment is a convolution kernel of 3 × 3, and further corresponds to a weight matrix H (H)_pq) The order of the third-order matrix, specifically,

p＝q＝0，1，2。

to increase the pair weight matrix H (H)_pq) For convenience of reading, the third-order matrix obtained by conversion in step L1 is stored in a buffer memory, and a weight matrix H (H) is taken_pq) When each element (i.e., weight) in the weight matrix H (H) is used, the weight matrix H (H) can be directly read from the buffer_pq) Each element.

Step L2, according to the preset image size threshold, reading a block image on the characteristic image to be processed, and according to the weight matrix H (H)_pq) And calculating the local operation result of the CNN convolution layer corresponding to the currently read block image.

In step L2, the block images of the feature image to be processed read each time are different from each other.

In step L2, a block image on the feature image to be processed is read according to a preset image size threshold, and the reading method includes:

when the block images are read again, each of the read block images contains two rows or two columns of pixels of its respective adjacent block image.

Therefore, in this embodiment, when the size of the feature image to be processed is smaller than or equal to the image size threshold, a block image that meets the requirement of the image size threshold on the feature image to be processed, which is read for the first time according to the preset image size threshold in step L2, is the feature image to be processed itself; when the size of the feature image to be processed is larger than the image size threshold, a block image which meets the requirement of the image size threshold on the feature image to be processed and is read for the first time according to the preset image size threshold in step L2 is a local image of the feature image to be processed.

In this embodiment, the size of the feature image to be processed is larger than the image size threshold, the preset image size threshold is 10 × 9 pixels, and the feature image to be processed is 15 × 12 pixels.

In this embodiment, in step L2, a block image F1 with a size of 10 × 9 pixels is read onto the feature image to be processed for the first time according to a preset image size threshold (a reading start position can be preset as a pixel point at the upper left corner of the feature image to be processed). The position of this block image F1 on the feature image to be processed (15 × 12 pixels) is shown in fig. 2. The broken line portion in fig. 2 represents the feature image to be processed of 15 × 12 pixels, each of the broken line boxes represents one pixel of the feature image to be processed, and the portion framed by the black rectangular box in fig. 2 represents the patch image F1.

Wherein the step L2 is based on the weight matrix H (H)_pq) The implementation method for calculating the local operation result of the CNN convolutional layer corresponding to the currently read block image F1 comprises the following steps:

step L21, converting the currently read block image F1 into an image matrix A (a)_ij)，a_ijIs an image matrix A (a)_ij) The (i, j) element of (a); wherein i ═ 0, 1, 2.., m-1; j ═ 0, 1, 2,. ang, n-1; the patch image F1 in the present embodiment corresponds to m 7, n 6; specifically, this time:

image matrix

Denoted as image matrix a 1.

The image matrix A (a) to be converted in step L21_ij) Stored in a buffer, and when needed, the image matrix A (a) is taken_ij) And when the data is needed, the data can be directly taken from the cache.

Step (ii) ofL22, read weight matrix H (H)_pq) Each element h of_pqAnd respectively obtaining each element h_pqEach corresponding image matrix A (a)_ij) And a product matrix formed by all elements required to be multiplied with the element, and each element h_pqMultiplying with the product matrix corresponding to each element to obtain each element h_pqA respective corresponding partial matrix.

Wherein, the element h involved in the step L22_pqThe corresponding product matrix includes the following cases:

when p is 0 and q is 0, the element h is concerned_pqThe corresponding product matrix is the image matrix A (a)_ij) Wherein all rows and columns remaining after removing the (n-k +1, n-k +2, n-k +3, …, n-1) th columns and the (m-k +1), m-k +2, m-k +3, …, m-1 th rows form an (m-k +1) × (n-k +1) matrix;

when p ≠ 0 and q ≠ 0, the element h involved_pqThe corresponding product matrix is the image matrix A (a)_ij) Wherein the matrix is a (m-k +1) × (n-k +1) matrix formed by splicing all the rows and columns except the 0 th, 1, 2, …, q-1, n-k + q +2, n-k + q +3, … and n-1 th columns and the 0, 1, 2, …, p-1, m-k + p +2, m-k + p +3, … and m-1 th rows.

In the present embodiment, when the weight matrix H (H) is read_pq) Element h of_pqIs h₀₀When it is due top is 0 and q is 0, in which case the element h₀₀The corresponding product matrix is:

all rows and columns remaining after removing the n-k +1, n-k +2, n-k +3, … and n-1 columns and the m-k +1, m-k +2, m-k +3, … and m-1 rows in the image matrix A1 form an (m-k +1) × (n-k +1) matrix, namely a 5 × 4 matrix formed by splicing all rows and columns remaining after removing the 4 th and 5 th columns and removing the 5 th and 6 th rows in the image matrix A1 is recorded as a product matrix 00, and specifically:

in the present embodiment, when the weight matrix H (H) is read_pq) Element h of_pqIs h₀₁When, since p is 0 and q is 1 ≠ 0, then the element h₀₁The corresponding product matrix is:

a 5 × 4 matrix formed by splicing all rows and columns except the 0 th and 5 th columns and all rows and columns left after the 5 th and 6 th rows in the image matrix a1 is recorded as a product matrix 01, and specifically includes:

in the present embodiment, when the weight matrix H (H) is read_pq) Element h of_pqIs h₁₀When, since q is 0 and p is 1 ≠ 0, then the element h₁₀The corresponding product matrix is:

a 5 × 4 matrix formed by splicing all rows and columns except the 4 th and 5 th columns and the 0 th and 6 th rows in the image matrix a1 is recorded as a product matrix 10, and specifically includes:

in the present embodiment, when the weight matrix H (H) is read_pq) Element h of_pqIs h₁₁When q is equal to p is equal to 1, the element h is₁₁The corresponding product matrix is:

a 5 × 4 matrix formed by splicing all rows and columns except the 0 th and 5 th columns and all rows except the 0 th and 6 th rows in the image matrix a1 is recorded as a product matrix 11, and specifically includes:

when the weight matrix H (H) is read_pq) Other element h of_pqIn time, the corresponding product matrix can be obtained by referring to the above manner.

For the image matrix A1, the weight matrix H (H) is read_pq) Each element h of_pqAnd obtains the weight matrix H (H)_pq) Each element h of_pqAfter the corresponding product matrix, the weight matrix H (H) is respectively set_pq) Each element h of_pqMultiplying with the corresponding product matrix to obtain each element h_pqA respective corresponding partial matrix.

It can be seen that the present invention uses each weight (corresponding to element h) in the convolution kernel_pq) For the starting point, a product matrix which is formed by all characteristic points which need to be multiplied by the read weights in the block images and correspond to each weight is directly obtained, the reading times of each weight in a convolution kernel from an external storage (DDR) are reduced to a certain extent, and the reduction of the storage bandwidth pressure is facilitated to a certain extent.

And L23, calculating the sum of the local matrixes obtained in the step L22, wherein the sum is the local operation result of the CNN convolution layer corresponding to the currently read block image F1.

The sum of the local matrices is calculated by addition of the matrices.

And step L3, judging whether the whole characteristic image to be processed is completely read, if so, continuing to execute step L4, otherwise, repeatedly executing step L2.

In this embodiment, it is obvious that after the block image F1 is read, the entire feature image to be processed is not completely read, and it is necessary to continue reading the block images of the feature image to be processed.

In this embodiment, in step L2, after the whole feature image to be processed is completely read according to the preset image size threshold (10 × 9 pixels), the total separately read block images include, in addition to the block image F1, a block image F2, a block image F3, and a block image F4, where the schematic positions of the block image F2, the block image F3, and the block image F4 on the whole feature image to be processed are shown in fig. 3, 4, and 5 in sequence.

For each of the block image F2, the block image F3, and the block image F4, the CNN convolutional layer local operation result (which is also a matrix) corresponding thereto may be obtained with reference to the block image F1 in step L2.

It should be noted that, before each new block image is read, the last stored block image may be cleared.

And L4, arranging the local operation results of the CNN convolutional layers obtained in the step L2 according to the relative position relationship among the block images, and splicing to obtain the CNN convolutional layer operation results corresponding to the whole characteristic image to be processed.

For example, the feature image to be processed with 15 × 12 pixels is to be read four times, corresponding to 4 block images, where the 4 block images sequentially include a block image F1, a block image F2, a block image F3, and a block image F4 according to the reading order, and a schematic diagram of a distribution of relative positions of the block image F1, the block image F2, the block image F3, and the block image F4 in the feature image to be processed is shown in fig. 2.

The CNN convolutional layer local operation results corresponding to the block image F1, the block image F2, the block image F3, and the block image F4 are sequentially recorded as a matrix C1, a matrix C2, a matrix C3, and a matrix C4, and then a schematic diagram of the arrangement positions of the matrix C1, the matrix C2, the matrix C3, and the matrix C4 is shown in fig. 6. The matrix C1, the matrix C2, the matrix C3, and the matrix C4 are spliced according to their arrangement positions to form a spliced matrix, which is the operation result of the CNN convolutional layer corresponding to the whole feature image to be processed in this embodiment. And converting the CNN convolutional layer operation result into an image and outputting the image to obtain a corresponding image of the whole characteristic image to be processed after the CNN convolutional layer operation in the embodiment.

The CNN convolutional layer operation method in this embodiment is implemented based on an FPGA.

In summary, the CNN convolutional layer operation method of the present invention uses the weight matrix H (H) required in the CNN convolutional layer operation process_pq) And an image matrix A (a)_ij) The storage in the cache avoids extra storage, is beneficial to reducing the cost of completing the CNN convolutional layer operation to a certain extent, and reduces the times of reading data required by the CNN convolutional layer operation from external storage (DDR) and is beneficial to increasing the speed of the CNN convolutional layer operation to a certain extent.

In addition, the CNN convolution layer operation method also avoids the use of a feature map two-dimensional window in the prior art, further avoids the use of a control module for generating the feature map two-dimensional window in the prior art, reduces the complexity of CNN convolution operation to a certain extent, and is convenient to implement.

FIG. 7 is a diagram of an embodiment of a CNN convolutional layer arithmetic accelerator according to the present invention.

As shown in fig. 7, the CNN convolutional layer arithmetic accelerator 100 includes:

a first data pre-reading module 101, configured to read in a convolution kernel used for performing CNN convolution layer operation on a feature image to be processed, and convert the read convolution kernel into a weight matrix H (H)_pq) Where the convolution kernel is a k × k convolution kernel, h_pqIs a weight matrix H (H)_pq) The (p, q) element of (a), p ═ 0, 1, 2,. ang, k-1; q-0, 1, 2,. k-1;

the second data pre-reading module 102 is configured to read a block image on the feature image to be processed according to a preset image size threshold;

a local operation module 103 for calculating a weight matrix H (H) according to the weight matrix_pq) Calculating the local operation result of the CNN convolution layer corresponding to the block image currently read in by the second data pre-reading module 102;

the judging module 104 is used for judging whether the whole characteristic image to be processed is read completely;

a convolutional layer operation result output module 105, configured to, when the determination module 104 determines that the entire feature image to be processed has been read completely, arrange the local operation results of each CNN convolutional layer obtained by the local operation module 103 according to the relative position relationship between the block images, and then splice to obtain and output a CNN convolutional layer operation result corresponding to the entire feature image to be processed;

the calling module 106 is configured to call the data pre-reading module to continue executing when the judging module 104 judges that the whole feature image to be processed is not read;

wherein, the local operation module 103 includes:

an image matrix converting unit 1031 for converting the currently read block image into an image matrix a (a)_ij) Wherein the block image is a digital image of m × n pixels, a_ijIs an image matrix A (a)_ij) The (i, j) element of (a); wherein i is 0, 1, 2, 1, m-1; j ═ 0, 1, 2,. ang, n-1;

a local matrix acquisition unit 1032 for reading the weight matrix H (H)_pq) Each element h of (2)_pqAnd respectively obtaining each element h_pqEach corresponding image matrix A (a)_ij) And a product matrix formed by all elements required to be multiplied with the element, and each element h_pqMultiplying with the corresponding product matrix to obtain each element h_pqRespective corresponding local matrices;

a local operation result obtaining unit 1033, configured to calculate a sum of the obtained local matrices, where the sum is a local operation result of the CNN convolution layer corresponding to the currently read block image.

The second data pre-reading module reads different block images each time.

The second data pre-reading module 102 reads a block image on the feature image to be processed according to a preset image size threshold, and the reading method includes:

Optionally, as an embodiment of the present invention, the element h involved in the local matrix obtaining unit 1032_pqThe corresponding product matrix includes the following cases:

Optionally, as an embodiment of the present invention, the CNN convolution layer operation accelerator further includes a cache;

the first data pre-reading module 101 converts the weight matrix into a weight matrix H (H)_pq) Storing in a cache;

the image matrix converting unit 1031 converts it intoImage matrix A (a)_ij) Stored in a cache.

Optionally, as an embodiment of the present invention, the CNN convolutional layer arithmetic accelerator is implemented based on an FPGA.

Optionally, as an embodiment of the present invention, a block image that meets the requirement of the image size threshold on the read to-be-processed feature image is:

when the size of the part which is not read on the feature image to be processed is larger than the image size threshold, reading a local image which is equal to the image size threshold in size on the part which is not read on the feature image to be processed;

and when the size of the part which is not read on the characteristic image to be processed is smaller than or equal to the image size threshold, reading all the images which are not read on the characteristic image to be processed.

In particular implementations, the local matrix unit 1032 may employ a multiplier array to multiply each element h_pqMultiplying with the product matrix corresponding to each element to obtain each element h_pqRespective corresponding local matrices; the local operation result obtaining unit 1033 may calculate a sum of the obtained local matrices using an accumulator array.

The same and similar parts in the various embodiments in this specification may be referred to each other.

In the embodiments provided in the present invention, it should be understood that the disclosed system and method can be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, the division of the modules and the units is only one logical function division, and there may be other division ways in actual implementation.

Although the present invention has been described in detail by referring to the drawings in connection with the preferred embodiments, the present invention is not limited thereto. Various equivalent modifications or substitutions can be made on the embodiments of the present invention by those skilled in the art without departing from the spirit and scope of the present invention, and these modifications or substitutions are within the scope of the present invention/any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A CNN convolutional layer operation method is characterized by comprising the following steps:

s1, reading convolution kernels used for CNN convolution layer operation on the characteristic image to be processed, and converting the read convolution kernels into weight matrixes

Wherein the convolution kernel is

The kernel of the convolution is a function of,

is a weight matrix

Is/are as follows

The elements are selected from the group consisting of,

；

s2, reading a block image on the characteristic image to be processed according to the preset image size threshold value, and according to the weight matrix

Calculating a local operation result of the CNN convolution layer corresponding to the currently read block image;

wherein, the step S2 is based on the weight matrix

The implementation method for calculating the local operation result of the CNN convolution layer corresponding to the currently read block image comprises the following steps:

p1, converting the block image read currently into image matrix

Wherein the block image is

A digital image of the pixel or pixels of the image,

as a matrix of images

Is/are as follows

An element; wherein the content of the first and second substances,

；

p2, read weight matrix

Each element of (2)

And obtaining each element separately

Each corresponding composed image matrix

And a product matrix formed by all elements to be multiplied with, and dividing each element

Multiplying with their respective product matrixes to obtain each element

Respective corresponding local matrices;

in step S2, the block images of the feature image to be processed read each time are different from each other;

when the block images are read again, each read block image comprises k-1 rows or k-1 columns of pixels of each adjacent block image;

the elements involved in step P2

The corresponding product matrix includes the following cases:

when p =0 and q =0, the element concerned

The corresponding product matrix is an image matrix

Wherein all rows and columns remaining after the removal of the (n-k +1), (n-k + 2), (n-k + 3), (., n-1) th rows and (m-k +1), (m-k + 2), (m-k + 3), (., m-1) th rows form

A matrix;

at p =0 and

when referring to the elements

The corresponding product matrix is the image matrix

Wherein all rows and columns remaining after the removal of the 0 th, 1, 2, the

A matrix;

in that

And q =0, the element concerned

The corresponding product matrix is that of the image

Wherein the rows and columns are spliced to form the column of the computer system except the rows of the computer system which are left after the rows of the computer system are removed from the rows of the computer system which are n-k +1, n-k +2, n-k +3, a

A matrix;

in that

And is

When referring to the elements

The corresponding product matrix is the image matrix

Wherein all rows and columns remaining after the 0, 1, 2, 1.,. q-1, n-k + q +2, n-k + q +3, 1.,. n-1 columns and the 0, 1, 2, 1.,. p-1, m-k + p +2, m-k + p +3, 1.,. m-1 columns are removed are spliced to form

And (4) matrix.

2. The CNN convolutional layer operation method of claim 1,

the weight matrix obtained by conversion in step S1

Storing in a cache;

the image matrix to be converted in step P1

Stored in a cache.

3. The CNN convolutional layer operation method of claim 1, wherein the CNN convolutional layer operation method is implemented based on an FPGA.

4. The CNN convolutional layer operation method of claim 1, wherein each element is divided into multiple units in P2 by a multiplier array

Multiplying with their respective product matrixes to obtain each element

A respective corresponding partial matrix.

5. A CNN convolutional layer arithmetic accelerator, comprising:

the first data pre-reading module is used for reading convolution kernels used for performing CNN convolution layer operation on the characteristic image to be processed and converting the read convolution kernels into a weight matrix

Wherein the convolution kernel is

The kernel of the convolution is a function of,

is a weight matrix

Is

The elements of the group consisting of,

；

；

a local operation module for calculating the weight matrix according to the weight matrix

Calculating a local operation result of the CNN convolution layer corresponding to the block image currently read in by the second data pre-reading module;

wherein, the local operation module comprises:

an image matrix conversion unit for converting the currently read block image into an image matrix

Wherein the block image is

A digital image of the pixel or pixels is obtained,

as a matrix of images

Is/are as follows

An element; wherein the content of the first and second substances,

；

；

a local matrix acquisition unit for reading the weight matrix

Each element of (1)

And obtaining each element separately

Each corresponding composed image matrix

Multiplying with their respective product matrixes to obtain each element

Respective corresponding local matrices;

the second data pre-reading module reads different block images each time;

the elements involved in the local matrix acquisition unit

The corresponding product matrix includes the following cases:

when p =0 and q =0, the element concerned

The corresponding product matrix is an image matrix

A matrix;

at p =0 and

when referring to the elements

The corresponding product matrix is the image matrix

Wherein all rows and columns remaining after the removal of the 0 th, 1, 2, the

A matrix;

in that

And q =0, the element concerned

The corresponding product matrix is an image matrix

A matrix;

in that

And is

When referring to the elements

The corresponding product matrix is an image matrix

And (4) matrix.

6. The CNN convolution layer operation accelerator of claim 5, wherein the CNN convolution layer operation accelerator further includes a cache;

weight matrix converted by first data pre-reading module

Storing in a cache;

image matrix converted by the image matrix conversion unit

Stored in a cache.

7. The CNN convolutional layer arithmetic accelerator of claim 5, wherein the CNN convolutional layer arithmetic accelerator is implemented based on an FPGA.

8. The CNN convolutional layer arithmetic accelerator of claim 5, wherein the local matrix fetch unit applies a multiplier array to each element

Multiplying with their respective product matrixes to obtain each element

A respective corresponding partial matrix.