CN114612309B

CN114612309B - Full-on-chip dynamic reconfigurable super-resolution device

Info

Publication number: CN114612309B
Application number: CN202210512559.7A
Authority: CN
Inventors: 常亮; 赵鑫; 周军
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2022-05-12
Filing date: 2022-05-12
Publication date: 2022-10-14
Anticipated expiration: 2042-05-12
Also published as: CN114612309A

Abstract

A full-on-chip dynamic reconfigurable super-resolution device belongs to the technical field of image processing. The full-on-chip dynamic reconfigurable super-resolution device comprises a preprocessing circuit, an arithmetic operation circuit, an interpolation circuit and a post-processing circuit; the preprocessing circuit comprises a weight buffer, an input buffer and an input image color space conversion circuit, the arithmetic operation circuit comprises a data redistribution circuit, a convolution calculation block, a shared addition tree circuit and an interlayer buffer, the interpolation circuit comprises a nearest neighbor interpolation circuit and a temporary buffer, and the post-processing circuit comprises an output shaping circuit and an output image color space conversion circuit. The invention adopts the mapping strategy of convolution compression, convolution decomposition and PE remapping and the convolution calculation block consisting of a plurality of dynamic reconfigurable PE calculation units, greatly reduces the calculation amount of deconvolution calculation, improves the calculation efficiency of deconvolution calculation, effectively eliminates invalid calculation and avoids the problem of unbalanced calculation load.

Description

Full-on-chip dynamic reconfigurable super-resolution device

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to a full-on-chip dynamically reconfigurable super-resolution device.

Background

With the development of artificial intelligence algorithm, the super-resolution network based on deep learning has better image reconstruction effect and has good application prospect in key industries such as old photo restoration, medical detection, security monitoring and the like. However, the super-resolution network has the problems of large calculation amount and large data access amount, so that the requirement on hardware is extremely high. The reason is mainly caused by large calculation amount in the deconvolution processIn (1). In order to solve the above problems, chang et al (Chang, jung-Wo, keon-Wo Kang, and Suk-Ju Kang. "An energy-efficiency FPGA-based constrained conditional neural network operator for single image prediction-resolution"IEEE Transactions on Circuits and Systems for Video Technology30.1 (2018) 281-295.) A TDC (transposed convolution conversion) method for converting deconvolution into convolution is provided, and the amount of computation of deconvolution is effectively reduced by converting deconvolution into convolution operation and redistributing computation tasks. The problems of low utilization rate of PE (Processing Elements) and unbalanced calculation load still exist, and the speedup space of deconvolution is not fully mined.

Disclosure of Invention

The invention aims to provide a full-on-chip dynamically reconfigurable super-resolution device aiming at the problems in the background technology.

In order to realize the purpose, the technical scheme adopted by the invention is as follows:

a full-on-chip dynamic reconfigurable super-resolution device comprises a preprocessing circuit, an arithmetic operation circuit, an interpolation circuit and a post-processing circuit;

wherein the pre-processing circuit comprises a weight buffer, an input buffer and an input image color space conversion circuit; the weight buffer is used for buffering weight data of the super-resolution network, the input buffer is used for buffering input image data, the input image color space conversion circuit reads the input image data in the input buffer and converts the input image data from an RGB format into a YCbCr format, Y channel data obtained after conversion are input into the data redistribution circuit, and Cb and Cr channel data are input into the nearest neighbor interpolation circuit;

the arithmetic operation circuit comprises a data redistribution circuit, a convolution calculation block, a shared addition tree circuit and an interlayer buffer; the data redistribution circuit reads the weight data in the weight buffer and the Y-channel data output by the input image color space conversion circuit, and redistributes the weight data and the Y-channel data according to a designated mapping strategy according to a scaling factor to obtain redistributed data; the convolution calculation block receives the redistributed data and performs convolution operation on the redistributed data to obtain a convolution operation result; the shared addition tree circuit receives convolution operation results and accumulates the convolution operation results to obtain output characteristic diagram data of a current layer in the super-resolution network; the interlayer buffer receives and stores output characteristic diagram data of a current layer, when the maximum number of layers of the super-resolution network is not reached, the output characteristic diagram data is used as an input characteristic diagram of a next layer of network and is input into the data redistribution circuit, and when the maximum number of layers of the super-resolution network is reached, the output characteristic diagram data is input into the output shaping circuit;

the interpolation circuit comprises a nearest neighbor interpolation circuit and a temporary buffer; the nearest neighbor interpolation circuit interpolates the received Cb and Cr channel data based on a nearest neighbor interpolation strategy to obtain an interpolated characteristic diagram; the temporary buffer receives and buffers the characteristic graph after interpolation;

the post-processing circuit comprises an output shaping circuit and an output image color space conversion circuit; the output shaping circuit reads the output characteristic diagram data of the interlayer buffer and rearranges the output characteristic diagram data to obtain the sequential output of Y-channel data; the output image color space conversion circuit reads the sequential output of the Y-channel data and the characteristic diagram after interpolation, and converts the sequential output of the Y-channel data and the characteristic diagram after interpolation into RGB format data for output.

Further, the input image data is a plurality of sub-images obtained by segmenting the original image.

Further, the weight data is obtained by segmenting the original image into a plurality of sub-images, and then training the segmented sub-images as a training data set.

Further, the mapping strategy comprises convolution compression, convolution decomposition and PE remapping processes, wherein the convolution compression compresses the weight data and the Y-channel data according to a scaling factor, and removes a 0 value in the Y-channel data and the weight data corresponding to the 0 value; the convolution decomposition decomposes the compressed data into convolutions of different lengths; PE remapping combines convolutions of different lengths into convolutions of fixed length and inputs them to a convolution computation block.

Further, when the scaling factor is 2, the apparatus can realize

Parallel operation of deconvolution with the size of 9 × 9, wherein the processing processes of the data redistribution circuit and the convolution calculation block are specifically as follows:

1) And (3) convolution compression:

compressing the input Y-channel data and the weight data to obtain

Convolution with size of 5 × 5,

Convolution with size of 5 × 4,

4 x 5 convolution sum

A convolution of size 4 x 4, wherein,mandnis a positive integer;

2) And (3) convolution decomposition:

will be provided with

A convolution of size 5 x 5 is decomposed into

A convolution sum of length 9

A convolution of length 7;

a convolution of size 5 x 4 is decomposed into

A convolution sum of length 9

A length-2 convolution;

a convolution of size 4 x 5 is decomposed into

A convolution sum of length 9

A length-2 convolution;

a convolution of size 4 x 4 is decomposed into

A convolution sum of length 9

A convolution of length 7;

3) PE remapping:

the convolution of length 7 is combined with the convolution of length 2 to yield

Convolution of length 9 and then with the rest

Inputting convolution with the length of 9 into a convolution calculation block;

4) And (3) deconvolution operation:

input ofmnThe convolution is sent into a convolution calculation block for convolution calculation to obtainmnA convolution ofAnd (5) calculating the result.

Further, when the scaling factor is 3, the apparatus can realizemnParallel operation of deconvolution with the size of 9 × 9, wherein the processing processes of the data redistribution circuit and the convolution calculation block are specifically as follows:

1) And (3) convolution compression:

compressing the input Y-channel data and the weight data to obtainmnConvolution of size 3 × 3;

2) And (3) convolution decomposition:

will be provided withmnA convolution of size 3 x 3 is decomposed intomnA convolution of length 9;

3) PE remapping:

will be provided withmnConvolution input convolution calculation blocks with the length of 9;

4) And (3) deconvolution operation:

input ofmnThe convolution is sent into a convolution calculation block for convolution calculation to obtainmnAnd convolution operation results.

Further, when the scaling factor is 4, the apparatus can realize

1) And (3) convolution compression:

compressing the input Y-channel data and the weight data to obtain

A convolution of size 3 x 3,

A convolution of size 3 x 2,

Convolution with size of 2 x 3,mnConvolution of size 2 × 2;

2) And (3) convolution decomposition:

will be provided with

A convolution of size 3 x 3 is decomposed into

A convolution of length 9;

a convolution of size 3 x 2 is decomposed into

A length-3 convolution;

a convolution of size 2 x 3 is decomposed into

A convolution sum of length 3

A length-2 convolution;mna convolution of size 2 x 2 is decomposed into

A length 4 convolution sum

A length-2 convolution;

3) PE remapping:

combining the convolution of length 4, the convolution of length 3 and the convolution of length 2 to obtain

Convolution of length 9 and then with the rest

4) And (3) deconvolution operation:

input ofmnSending the convolution into a convolution calculation block for convolution calculation to obtainmnAnd convolution operation results.

Preferably, whenm×nThe device has the highest calculation efficiency when the frequency is multiple of 9.

Wherein the convolution computation block comprisesm×nThe PE computing units comprise 1~9 pixel points, 1~9 weight data points, 1~9 multipliers, 1~8 adders, a first data selector and a second data selector; 1 st pixel point A ₁ And the 1 st weight data point W ₁ The product of (2) with the 2 nd pixel A ₂ And 2 nd weight data point W ₂ The 1 st data is obtained by adding the products of the first and second adders in the 1 st adder; point 3 of pixel A ₃ And 3 rd weight data point W ₃ The product of (2) and the 4 th pixel A ₄ And the 4 th weight data point W ₄ The products of (1) are added in a 3 rd adder to obtain 2 nd data; adding the 1 st data and the 2 nd data in a 2 nd adder to obtain 3 rd data, inputting the 3 rd data into an input end of a first data selector, connecting one output end of the first data selector with a first input end of a 4 th adder, and taking the other output end as a first output of the dynamic reconfigurable PE computing unit; the 6 th pixel point A ₆ And the 6 th weight data point W ₆ The product of (2) and the 7 th pixel A ₇ And the 7 th weight data point W ₇ The products of (4) are added in a 6 th adder to obtain 4 th data; the 5 th pixel point A ₅ And the 5 th weight data point W ₅ The product of (1) and the 4 th data are added in a 5 th adder to obtain 5 th data; adding the 5 th data and the data output by the first data selector in a 4 th adder to obtain 6 th data; 8 th pixel point A ₈ And the 8 th weight data point W ₈ The product of (2) and the 9 th pixel A ₉ And the 9 th weight data point W ₉ The products of (1) are added in an 8 th adder to obtain 7 th data; the obtained 7 th data is input to the input terminal of the second data selector, and the data at one output terminal of the second data selector and the 6 th data are added in the 7 th adder to obtainA second output of the dynamically reconfigurable PE computational unit; and the data at the other output end of the second data selector is used as a third output of the dynamic reconfigurable PE computing unit.

Further, each dynamically reconfigurable PE computing unit may implement 3 operating modes:

mode 0: outputting 1 convolution operation result with the length of 9;

mode 1: outputting 1 convolution operation result with the length of 7 and 1 convolution operation result with the length of 2;

mode 2: and outputting 1 convolution operation result with the length of 4, 1 convolution operation result with the length of 3 and 1 convolution operation result with the length of 2.

The result of the first output is 1 convolution operation result with the length of 4 in the mode 2; the result of the second output is 1 convolution operation result with the length of 9 of the mode 0, or 1 convolution operation result with the length of 7 of the mode 1, or 1 convolution operation result with the length of 3 of the mode 2; the result of the third output is 1 convolution operation result with length of 2 in mode 1 or 1 convolution operation result with length of 2 in mode 2.

Compared with the prior art, the invention has the beneficial effects that:

1. the full-on-chip dynamic reconfigurable super-resolution device provided by the invention adopts a mapping strategy of convolution compression, convolution decomposition and PE remapping and a convolution calculation block consisting of a plurality of dynamic reconfigurable PE calculation units, so that the calculation amount of deconvolution calculation is greatly reduced, the calculation efficiency of deconvolution calculation is improved, invalid calculation is effectively eliminated, and the problem of unbalanced calculation load is avoided.

2. According to the full-on-chip dynamic reconfigurable super-resolution device provided by the invention, the input image data and the weight data are obtained by segmenting the original image into a plurality of sub-images and training, so that the data volume between layers is greatly reduced, the communication between an intermediate network layer and an off-chip memory is avoided, the full-on-chip storage is realized, and the throughput of the device is improved.

Drawings

Fig. 1 is a schematic structural diagram of a full-on-chip dynamically reconfigurable super-resolution device provided by the invention;

fig. 2 is a schematic structural diagram of a dynamic reconfigurable PE computing unit in the full-on-chip dynamic reconfigurable super-resolution device provided by the invention.

Detailed Description

The technical scheme of the invention is detailed below by combining the accompanying drawings and the embodiment.

Examples

Fig. 1 is a schematic structural diagram of a full on-chip dynamically reconfigurable super-resolution device provided by the present invention; the device comprises a preprocessing circuit, an arithmetic operation circuit, an interpolation circuit and a post-processing circuit;

wherein the pre-processing circuit comprises a weight buffer, an input buffer and an input image color space conversion circuit; the weight buffer is used for caching weight data of the super-resolution network, the input buffer is used for caching input image data, the input image color space conversion circuit reads the input image data in the input buffer and converts the input image data from an RGB format into a YCbCr format, Y-channel data obtained after conversion are input into the data redistribution circuit, and Cb and Cr channel data are input into the nearest neighbor interpolation circuit;

the interpolation circuit comprises a nearest neighbor interpolation circuit and a temporary buffer; the nearest neighbor interpolation circuit interpolates the received Cb and Cr channel data based on a nearest neighbor interpolation strategy to obtain an interpolated characteristic diagram; the temporary buffer receives and buffers the characteristic diagram after interpolation;

The input image data is a plurality of 54 × 36 RGB format images obtained by segmenting an original image in RGB format having a size of 1080 × 720.

The weight data is obtained by segmenting an original image into a plurality of sub-images and then training the segmented sub-images.

Wherein, the convolution computation block comprises 3 × 3 dynamic reconfigurable PE computation units, and the dynamic reconfigurable PE computation units comprise 1~9 pixel point, 1~9 weight data point, 1~9 multiplier, 1~8 adder, first data selector and second data selector, as shown in FIG. 2; 1 st pixel point A ₁ And the 1 st weight data point W ₁ The product of (2) with the 2 nd pixel A ₂ And 2 nd weight data point W ₂ The 1 st data is obtained by adding the products of the first and second adders in the 1 st adder; point 3 of pixel A ₃ And 3 rd weight data point W ₃ The product of (2) and the 4 th pixel A ₄ And the 4 th weight data point W ₄ The products of (1) are added in a 3 rd adder to obtain 2 nd data; the 1 st data and the 2 nd data are added in the 2 nd adder, the obtained 3 rd data is inputted into the input end of the first data selector, one output end of the first data selector is connected with the first input end of the 4 th adderThe other output end of the input end is used as the first output of the dynamic reconfigurable PE computing unit; the 6 th pixel point A ₆ And the 6 th weight data point W ₆ The product of (2) and the 7 th pixel A ₇ And the 7 th weight data point W ₇ The products of (4) are added in a 6 th adder to obtain 4 th data; the 5 th pixel A ₅ And the 5 th weight data point W ₅ The product of (1) and the 4 th data are added in a 5 th adder to obtain 5 th data; adding the 5 th data and the data output by the first data selector in a 4 th adder to obtain 6 th data; 8 th pixel point A ₈ And the 8 th weight data point W ₈ The product of (2) and the 9 th pixel A ₉ And the 9 th weight data point W ₉ The products of (1) are added in an 8 th adder to obtain 7 th data; inputting the obtained 7 th data into the input end of a second data selector, and adding the data at one output end of the second data selector and the 6 th data in a 7 th adder to obtain a second output of the dynamic reconfigurable PE calculation unit; and the data at the other output end of the second data selector is used as a third output of the dynamic reconfigurable PE computing unit.

The scaling factor is set to 4, so that 16 deconvolution parallel operations with the size of 9 × 9 can be realized, and the processing processes of the data redistribution circuit and the convolution calculation block are specifically as follows:

1) And (3) convolution compression:

compressing the input Y-channel data and the weight data to obtain 1 convolution with the size of 3 multiplied by 3, 3 convolutions with the size of 3 multiplied by 2, 3 convolutions with the size of 2 multiplied by 3 and 9 convolutions with the size of 2 multiplied by 2;

2) And (3) convolution decomposition:

decomposing 1 convolution of size 3 x 3 into 1 convolution of length 9; 3 convolutions of size 3 x 2 are decomposed into 6 convolutions of length 3; the 3 convolutions of size 2 x 3 are decomposed into 2 convolutions of length 3 and 6 convolutions of length 2; the 9 convolutions of size 2 x 2 are decomposed into 8 convolutions of length 4 and 2 convolutions of length 2;

3) PE remapping:

combining the convolution with the length of 4, the convolution with the length of 3 and the convolution with the length of 2 to obtain 8 convolutions with the length of 9, and then inputting the 8 convolutions and the remaining 1 convolutions with the length of 9 into a convolution calculation block;

4) And (3) deconvolution operation:

the input 9 convolutions are sent into a convolution calculation block for convolution calculation to obtain 9 convolution operation results; the convolution calculation block comprises 9 dynamic reconfigurable PE calculation units which are arranged in a 3 multiplied by 3 mode.

Although the present invention has been described with reference to a preferred embodiment, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A full-on-chip dynamic reconfigurable super-resolution device is characterized by comprising a preprocessing circuit, an arithmetic operation circuit, an interpolation circuit and a post-processing circuit;

the arithmetic operation circuit comprises a data redistribution circuit, a convolution calculation block, a shared addition tree circuit and an interlayer buffer; the data redistribution circuit reads the weight data in the weight buffer and the Y-channel data output by the input image color space conversion circuit, and redistributes the weight data and the Y-channel data according to a designated mapping strategy according to a scaling factor to obtain redistributed data; the convolution calculation block receives the redistributed data and performs convolution operation on the redistributed data to obtain a convolution operation result; the shared addition tree circuit receives convolution operation results and accumulates the convolution operation results to obtain output characteristic map data of a current layer in the super-resolution network; the interlayer buffer receives and stores output characteristic diagram data of a current layer, when the maximum number of layers of the super-resolution network is not reached, the output characteristic diagram data is used as an input characteristic diagram of a next layer of network and is input into the data redistribution circuit, and when the maximum number of layers of the super-resolution network is reached, the output characteristic diagram data is input into the output shaping circuit;

the post-processing circuit comprises an output shaping circuit and an output image color space conversion circuit; the output shaping circuit reads the output characteristic diagram data of the interlayer buffer and rearranges the output characteristic diagram data to obtain the sequential output of Y-channel data; the output image color space conversion circuit reads the sequential output of the Y-channel data and the feature map after interpolation, converts the sequential output of the Y-channel data and the feature map after interpolation into RGB format data and outputs the data;

the mapping strategy comprises convolution compression, convolution decomposition and PE remapping processes, wherein the convolution compression compresses weight data and Y-channel data according to a scaling factor, and removes a 0 value in the Y-channel data and the weight data corresponding to the 0 value; the convolution decomposition decomposes the compressed data into convolutions of different lengths; the PE remapping combines convolutions of different lengths into convolutions of fixed length and inputs the convolutions into a convolution calculating block;

the convolution computation block comprisesm×nA plurality of dynamically reconfigurable PE computing units,the dynamic reconfigurable PE computing unit comprises a 1~9 pixel point, a 1~9 weight data point, a 1~9 multiplier, an 1~8 adder, a first data selector and a second data selector; 1 st pixel point A ₁ And the 1 st weight data point W ₁ The product of (2) with the 2 nd pixel A ₂ And 2 nd weight data point W ₂ The 1 st data is obtained by adding the products of the first and second adders in the 1 st adder; the 3 rd pixel A ₃ And 3 rd weight data point W ₃ The product of (2) and the 4 th pixel A ₄ And the 4 th weight data point W ₄ The products of (1) are added in a 3 rd adder to obtain 2 nd data; the 1 st data and the 2 nd data are added in a 2 nd adder, the obtained 3 rd data is input into the input end of a first data selector, one output end of the first data selector is connected with the first input end of a 4 th adder, and the other output end of the first data selector is used as the first output of the dynamic reconfigurable PE computing unit; 6 th pixel A ₆ And the 6 th weight data point W ₆ The product of (2) and the 7 th pixel A ₇ And the 7 th weight data point W ₇ The products of (1) are added in a 6 th adder to obtain 4 th data; the 5 th pixel point A ₅ And the 5 th weight data point W ₅ The product of (1) and the 4 th data are added in a 5 th adder to obtain 5 th data; adding the 5 th data and the data output by the first data selector in a 4 th adder to obtain 6 th data; 8 th pixel point A ₈ And the 8 th weight data point W ₈ The product of (2) and the 9 th pixel A ₉ And the 9 th weight data point W ₉ The products of (4) are added in an 8 th adder to obtain 7 th data; inputting the obtained 7 th data into the input end of a second data selector, and adding the data at one output end of the second data selector and the 6 th data in a 7 th adder to obtain a second output of the dynamic reconfigurable PE calculation unit; and the data at the other output end of the second data selector is used as a third output of the dynamic reconfigurable PE computing unit.

2. The full on-chip dynamic reconfigurable super-resolution device according to claim 1, wherein when the scaling factor is 2, the processing procedures of the data redistribution circuit and the convolution calculation block are as follows:

1) And (3) convolution compression:

compressing the input Y-channel data and the weight data to obtain

Convolution with size of 5 × 5,

Convolution with size of 5 × 4,

4 x 5 convolution sum

A convolution of size 4 x 4, wherein,mandnis a positive integer;

2) And (3) convolution decomposition:

will be provided with

A convolution of size 5 x 5 is decomposed into

A convolution sum of length 9

A convolution of length 7;

a convolution of size 5 x 4 is decomposed into

A convolution sum of length 9

A length-2 convolution;

a convolution of size 4 x 5 is decomposed into

A convolution sum of length 9

A length-2 convolution;

a convolution of size 4 x 4 is decomposed into

A convolution sum of length 9

A convolution of length 7;

3) PE remapping:

Convolution of length 9 and then with the rest

4) And (3) deconvolution operation:

3. The full on-chip dynamic reconfigurable super-resolution device according to claim 1, wherein when the scaling factor is 3, the processing procedures of the data redistribution circuit and the convolution calculation block are as follows:

1) And (3) convolution compression:

2) And (3) convolution decomposition:

3) PE remapping:

4) And (3) deconvolution operation:

input ofmnThe convolution is sent into a convolution calculation block for convolution calculation to obtainmnAnd (5) convolution operation results.

4. The full on-chip dynamic reconfigurable super-resolution device according to claim 1, wherein when the scaling factor is 4, the processing procedures of the data redistribution circuit and the convolution calculation block are as follows:

1) And (3) convolution compression:

compressing the input Y-channel data and the weight data to obtain