CN115965528A

CN115965528A - Super-resolution system and method for high-speed image acquisition

Info

Publication number: CN115965528A
Application number: CN202211651686.1A
Authority: CN
Inventors: 杨晨; 张喆; 孟依烁; 刘新
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2022-12-21
Filing date: 2022-12-21
Publication date: 2023-04-14

Abstract

The invention provides a super-resolution system and a method for high-speed image acquisition, which comprises the following steps: the device comprises a Padding module, a Linebuffer module, a Bicubic top layer calculation module and a shift register; the Padding module supplements the original image and outputs the supplemented image to the Linebuffer module; the Linebuffer module performs line caching on the supplementary image and synchronously outputs four lines of data in sequence; the Bicubic top layer calculation module receives four rows of data output by the Linebuffer module, one-dimensional interpolation intermediate results are calculated according to rows through a longitudinal window and are temporarily stored in a shift register, and after four rows of data are calculated, one-dimensional interpolation of the one-dimensional interpolation intermediate results is calculated according to rows through a transverse window, so that interpolation point pixel values are obtained and output. The invention improves the processing efficiency and reduces the resource consumption.

Description

Super-resolution system and method for high-speed image acquisition

Technical Field

The invention belongs to the technical field of hardware acceleration of an FPGA (field programmable gate array) platform algorithm, and particularly relates to a super-resolution system and a super-resolution method for high-speed image acquisition.

Background

When the unmanned aerial vehicle moving at a high speed monitors the ground, the unmanned aerial vehicle often uses a high-frame-rate airborne camera to shoot, but because of the limitation of transmission bandwidth and computational power of an airborne computer, the resolution of the camera carried on the unmanned aerial vehicle is low, so that the image transmitted back to a ground station is fuzzy and lacks details, and the requirement of high-precision monitoring is difficult to meet.

Super-resolution (Super-resolution), also known as upsampling (upsampling), is an algorithm for improving the resolution and quality of videos or images, is widely applied in the fields of video editing and image processing, such as surveillance video processing, high-resolution camera shooting, movie restoration and the like, and can realize the function of converting low-resolution images received by a ground station into high-resolution images.

The traditional interpolation algorithm in the super-resolution algorithm still has great research value and hardware realization significance due to the advantages of regular formula, easiness in hardware realization and the like. The conventional interpolation algorithm mainly includes a nearest neighbor interpolation method, a bilinear interpolation method, a bicubic interpolation method and the like. The nearest neighbor interpolation method is simple and fast, but the quality of the interpolation output image is not high. Bilinear interpolation has the property of a low-pass filter, which weakens high-frequency components in image information and causes blurring of the edge of an interpolated output image. The bicubic interpolation method uses a 4 multiplied by 4 lattice (16 points in total) near an interpolation point as a reference to obtain the pixel value of the interpolation point, the problem of step boundary of the nearest interpolation method is solved, and the interpolation effect of the image edge is superior to that of bilinear interpolation.

The bicubic interpolation algorithm mainly has two implementation modes: the 16-point ordinary bicubic interpolation method and the 16-point convolution bicubic interpolation method are mainly different in the mode of solving an interpolation kernel. The former calculates the interpolation coefficient according to the bicubic term formula of the convolution kernel; the latter simplifies the solution of the two-dimensional plane interpolation kernel into the x direction and the y direction, and successively calculates the interpolation point coefficient in each one-dimensional direction to obtain the interpolation coefficient. The 16-point convolution bicubic interpolation method is simple in calculation formula, and the hardware realization circuits in the x direction and the y direction can be multiplexed.

However, in the existing method, the horizontal window and the vertical window are not separated, 16 original pixel points need to be cached, and hardware consumption and waiting time are increased.

In addition, the 16-point convolution bicubic interpolation method firstly carries out coefficient calculation in the x direction and the y direction, the existing method usually selects alpha = -0.5 or alpha = -0.75 as the convolution kernel coefficient of the interpolation algorithm, because a general computer platform uses a floating point ALU (arithmetic logic unit) to carry out operation, the selection of the coefficient with decimal place does not generate extra calculation expense, and simultaneously because of the extremely large dynamic range of the floating point, the extra decimal place can also be accurately represented by the floating point without introducing errors. However, for the FPGA platform, the floating-point operation speed is very slow, and a large amount of resource overhead is also generated, and the fixed-point operation unit is very sensitive to the data bit width, and using α = -0.5 or α = -0.75 in the existing scheme results in multiplication of the data bit width participating in the operation and extra resource consumption, and if the data bit width is maintained by discarding low-level data to save the consumption, the quality of the interpolation image is deteriorated. Meanwhile, the current 16-point convolution bicubic interpolation method does not consider special optimization during 4-time upsampling, and for 4-time upsampling, if direct interpolation is carried out, the number of points inserted between two adjacent reference points is not fixed, so that the number of points required to be inserted between the two reference points needs to be counted, and additional calculation complexity is introduced. In addition, due to the irregular relative distance, an infinite decimal number needs to be approximated by a finite decimal number in the fixed point calculation, which brings a rounding error. In addition, the conventional scheme uses different relative distances to calculate the interpolation coefficient, reduces the calculation amount of addition and multiplication through a simplified formula, but does not consider the accumulative error caused by a multi-stage pipeline.

The optimal architecture is the best trade-off between accuracy and hardware cost. Careful research must therefore be undertaken in terms of output image quality, hardware resource consumption and throughput performance. The bicubic interpolation architecture proposed by g.mahale et al can generate high quality up-sampled pictures, but requires very many output resources and consumes very high energy. nuno-Magand, k.gribbon, f.sabetzadeh et al implement bicubic interpolation, and these architectures store the entire image pixel in external memory, thus requiring a significant external memory. Such external memory increases the overall cost of the system, reduces performance, and increases power consumption. Meanwhile, most of the current hardware implementation schemes for image bicubic interpolation by using the FPGA use a floating point unit. Floating point units incur a large area overhead, generate high power, and impact the overall performance of the system.

In summary, although there are many solutions for the design of the super-resolution hardware system based on the 16-point-convolution bicubic linear interpolation, there is still a lack of a method for considering the hardware resource consumption, the output image quality and the throughput performance.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a super-resolution system and a method for high-speed image acquisition, which improve the processing efficiency and reduce the resource consumption.

The invention is realized by the following technical scheme:

a super-resolution system for high-speed image acquisition, comprising: the device comprises a Padding module, a Linebuffer module, a Bicubic top layer calculation module and a shift register;

the panning module is used for supplementing the original image and outputting the supplemented image to the Linebuffer module;

the Linebuffer module is used for caching the lines of the supplementary image and synchronously outputting four lines of data in sequence;

and the Bicubic top-layer calculation module is used for receiving four rows of data output by the Linebuffer module, calculating one-dimensional interpolation intermediate results in rows through a longitudinal window and temporarily storing the one-dimensional interpolation intermediate results in the shift register, and after the four rows of data are calculated, calculating one-dimensional interpolation of the one-dimensional interpolation intermediate results in rows through a transverse window to obtain interpolation point pixel values and outputting the interpolation point pixel values.

Preferably, the Bicubic top layer calculation module comprises three single-channel modules which are an R channel module, a G channel module and a B channel module respectively; each single-channel module comprises a Bicubic module;

the Bicubic top-layer calculation module is used for receiving four rows of data output by the Linebuffer module and respectively sending R, G and B pixel values of the four rows of data to the R channel module, the G channel module and the B channel module; the Bicubic module of each single-channel module calculates one-dimensional interpolation intermediate results in a row by a longitudinal window and temporarily stores the one-dimensional interpolation intermediate results in a shift register, after four rows of calculation are completed, one-dimensional interpolation of the one-dimensional interpolation intermediate results is calculated in a row by a transverse window, R, G and B pixel values of interpolation points are respectively obtained, and the R, G and B pixel values of the interpolation points are spliced to obtain the pixel values of the interpolation points and output.

Further, the Linebuffer module provides an enable signal for starting the Bicubic module to perform operation.

Further, the Bicubic module comprises a cal _ B module, a cal _ q module, a shift _ reg module, a cal _ B _ post module and a cal _ q _ post module;

the cal _ B module is used for receiving four rows of data, sequentially performing multiplication operation, subtraction operation and addition operation to obtain four intermediate coefficients and outputting the four intermediate coefficients to the cal _ q module;

the cal _ q module is used for receiving the four intermediate coefficients output by the cal _ B module, sequentially performing multiplication operation and two-stage addition trees to obtain a one-dimensional interpolation intermediate result, and outputting the one-dimensional interpolation intermediate result to the shift _ reg module;

the shift _ reg module is used for receiving the one-dimensional interpolation intermediate result output by the cal _ q module, caching the one-dimensional interpolation intermediate result through a shift register, and outputting the four-point one-dimensional interpolation intermediate result to the cal _ B _ post module in parallel;

the cal _ B _ post module is used for receiving the four-point one-dimensional interpolation intermediate result output by the shift _ reg module, sequentially performing multiplication operation, subtraction operation and addition operation to obtain four transverse intermediate coefficients and outputting the four transverse intermediate coefficients to the cal _ q _ post module;

and the cal _ q _ post module is used for receiving the four transverse intermediate coefficients output by the cal _ B _ post module, sequentially performing multiplication operation and two-stage addition tree to obtain a two-dimensional interpolation result, namely R, G or B pixel values of a value point.

Furthermore, the cal _ q module is used for operating the received four intermediate coefficients and prestored parameters, finishing multiplication operation in the first stage and then obtaining a one-dimensional interpolation intermediate result through a two-stage addition tree.

Further, the cal _ q _ post module is used for calculating the received four transverse intermediate coefficients and prestored parameters, the first stage finishes multiplication operation, and then a two-dimensional interpolation result is obtained through a two-stage addition tree;

the cal _ q _ post modules are four in number, pre-stored parameters of each cal _ q _ post module are different, the four cal _ q _ post modules perform parallel calculation, and two-dimensional interpolation results of adjacent four points are obtained in one clock cycle.

A super-resolution method for high-speed image acquisition comprises the following steps:

s1, supplementing an original image to obtain a supplemented image;

s2, performing line caching on the supplementary image to obtain four lines of data;

and S3, performing one-dimensional interpolation on the four rows of data according to the rows through a longitudinal window to obtain a one-dimensional interpolation intermediate result, temporarily storing the one-dimensional interpolation intermediate result, and performing one-dimensional interpolation on the one-dimensional interpolation intermediate result according to the rows through a transverse window after four rows of calculation are completed to obtain an interpolation point pixel value.

Preferably, S1 is in particular: and assigning the pixel value of the edge of the original image to the pixel point of the original image to be supplemented to obtain a supplemented image.

Preferably, in S3, when the one-dimensional interpolation is performed, the interpolation is performed according to a rule of inserting interpolation points at fixed intervals.

Preferably, in S3, the convolution kernel used for one-dimensional interpolation is as follows:

where | d | is the distance from the interpolation point to the reference point.

Compared with the prior art, the invention has the following beneficial effects:

the super-resolution system is based on a 16-point convolution Bicubic interpolation algorithm, a Bicubic top layer calculation module calculates one-dimensional interpolation intermediate results in a row mode through a longitudinal window, the calculated one-dimensional interpolation intermediate results are temporarily stored in a shift register, and the one-dimensional interpolation intermediate results slide to the right once every period; after 4 cycles, the transverse window calculates the one-dimensional interpolation of the one-dimensional interpolation intermediate result according to the line, the shift register is adopted for caching, the parallel output of the 4-point one-dimensional interpolation intermediate result value is realized, the intermediate result in the operation process can be efficiently multiplexed, the output data can be kept unchanged while the parallelism of the longitudinal window is reduced to 1, the hardware resource is saved, and the resource consumption is reduced to a certain extent. The system realizes the function of converting low-resolution images received by a ground station into high-resolution images, and can meet the high-precision monitoring requirement of an unmanned aerial vehicle system under high-speed movement.

Furthermore, constants used in the operation (such as the distance d between two interpolation points) are stored in hardware in advance, so that a large number of redundant operations are avoided.

The super-resolution method is based on a 16-point convolution bicubic interpolation algorithm, a one-dimensional interpolation intermediate result is calculated in a row mode through a longitudinal window, the calculated one-dimensional interpolation intermediate result is temporarily stored, and the super-resolution method slides to the right once every period; after 4 cycles, the transverse window calculates the one-dimensional interpolation of the one-dimensional interpolation intermediate result according to the line, and the parallel output of the 4-point one-dimensional interpolation intermediate result value is realized in a cache mode, so that the intermediate result in the operation process is efficiently multiplexed, the output data is kept unchanged while the parallelism of the longitudinal window is reduced to 1, the hardware resource is saved, and the resource utilization rate is improved to a certain extent.

Furthermore, in the invention, the pixel value of the outermost circle, namely the edge of the original image is assigned to the value needing to be filled, so that the interpolation point can be furthest undistorted.

Furthermore, considering that the direct 4 times of upsampling can cause the insertion of different numbers of points between the reference points to cause extra resource consumption and precision loss, the invention enables the points to be interpolated to be uniformly distributed between the reference points through Padding, and the relative distance between the points to be interpolated and the reference points is a fixed number, namely | d | is fixed, thereby reducing the hardware consumption.

Furthermore, in the parameter selection process, errors and calculation amount possibly introduced by the existing scheme are considered, alpha = -1 is selected as the convolution kernel parameter, decimal occurrence in the convolution kernel parameter is avoided, pipeline delay deterioration caused by introduction of a divider execution unit is avoided, and the pipeline throughput rate of convolution kernel calculation is improved.

Drawings

FIG. 1 is a schematic diagram of bicubic interpolation;

FIG. 2 is a schematic diagram comparing different Padding schemes;

FIG. 3 is a schematic diagram of the distance between an interpolation point and a reference point;

FIG. 4 is a schematic diagram of an interpolation reduction scheme analysis;

FIG. 5 is a schematic diagram of a pixel point of an original image to be supplemented;

FIG. 6 is a schematic diagram of the padding process of the present invention;

FIG. 7 is a schematic flow chart of the algorithm of the present invention;

FIG. 8 is a schematic diagram of a Bicubic module;

FIG. 9 is a schematic structural diagram of a Bicubic top-level computing module;

FIG. 10 is a schematic diagram of a single channel module architecture;

FIG. 11 is a schematic view of the cal _ B module structure;

FIG. 12 is a schematic view of the cal _ q block structure;

FIG. 13 is a schematic view of a transverse window and a longitudinal window;

FIG. 14 is a schematic structural diagram of a shift _ reg module;

FIG. 15 is a block diagram of cal _ B _ post;

FIG. 16 is a block diagram of cal _ q _ post;

FIG. 17 is a block diagram of the overall system of the present invention;

FIG. 18 is a schematic diagram of a test vector process;

FIG. 19 is a comparison of C99 at the head of the first line of an upsampled test image and post-initialization simulation results;

FIG. 20 is a comparison of the C99 and post-initialization simulation results for the first line end of an upsampled test image;

FIG. 21 is a comparison result of the C code of the fifth element row head of the up-sampled test image and the post-initialization simulation result;

FIG. 22 is a comparison of the C code at the end of the fifth row of the upsampled test image with the post-initialization simulation result;

FIG. 23 shows the results of 4K images obtained by the present invention;

FIG. 24 illustrates the resource usage and resource utilization obtained by the present invention;

FIG. 25 is a clock constraint of a system;

FIG. 26 is an analysis of the setup hold time margin after Bicubic IP placement and routing.

Detailed Description

For a further understanding of the invention, reference will now be made to the following examples, which are provided to illustrate further features and advantages of the invention, and are not intended to limit the scope of the invention as set forth in the following claims.

The invention comprises three parts: an algorithm part, a hardware part and a system part.

And an algorithm part: the one-dimensional bicubic interpolation algorithm is improved, and the multiplication amount in the finally realized improved algorithm is 75% and 47% of the original one-dimensional and two-dimensional bicubic interpolation algorithm.

Hardware part: constants (such as the distance d between two interpolation points) used in the operation are stored in hardware in advance, so that a large amount of redundant operation is avoided; in addition, the shift register is used for efficiently multiplexing intermediate results in the operation process, so that the resource utilization rate is reduced to a certain extent. And finally, a multi-stage pipelining technology is used in the implementation of bicubic interpolation, so that the operation efficiency is improved.

And a system part: when the Zynq7020 development platform is used for system building, a PL end where a computing module is located and a PS end where a DDR is located are interconnected through the VDMA, and high bandwidth provided by the VDMA provides a foundation for realization of subsequent video streams.

The three sections are described below.

1. The algorithm part adopts the following technical scheme:

the invention adopts a 16-point convolution bicubic interpolation algorithm to realize subsequent hardware.

The bicubic interpolation algorithm can be seen as a two-dimensional extension of the one-dimensional interpolation function. The one-dimensional interpolation function is shown in equation (1):

wherein k is 0,1,2,3, A _k Is the pixel value of four reference points adjacent to the point to be solved, x _k Is the abscissa of the four reference points, x is the abscissa of the point to be determined, d _k Is the distance between two points, β is the convolution kernel, and g (x) is the interpolated output.

In the invention, convolution kernels given by R.Keys are selected, and interpolation point pixel value calculation is carried out by taking adjacent 4 points as reference. The convolution kernel is as follows:

where d is the distance between the reference point and the point to be solved. The parameter α is generally set to-0.5 or-0.75, and the parameter α = -1 is set in the present invention. I.e. the convolution kernel is a piecewise cubic polynomial form.

The bicubic interpolation algorithm performs interpolation again on the basis of the one-dimensional interpolation result. As shown in FIG. 1, the bicubic interpolation algorithm selects 16 adjacent reference points whose positions depend on the coordinates of the interpolation point such that the interpolation point g (x, y) falls on the reference point A ₁₁ 、A ₂₁ 、A ₁₂ 、A ₂₂ And in the surrounded square area, sequentially calculating interpolation point pixel values in the vertical direction and the horizontal direction. Giving an interpolation point pixel value calculation formula:

wherein i and j are 0,1,2,3, A _j Pixel values of four reference points adjacent in the vertical direction, y _k And x _k Respectively, the ordinate and the abscissa of the four reference points, and y and x are the ordinate and the abscissa of the point to be solved.

I.e. interpolation is performed in the y, x direction in sequence, since this equation (3) is a linear operation, the change of the calculation order in the two-dimensional direction does not affect the final result. Therefore, in practical application, the calculation order can be determined by referring to indexes such as space complexity, time complexity and the like realized by an algorithm.

In four arithmetic operations of an FPGA (field programmable gate array), the cost of division in time and space is larger than that of the other three arithmetic operations, and the introduction of the division operation is reduced in the hardware design of the arithmetic. From the viewpoint of interpolation processing of the whole image, 16 reference points required for solving the interpolation point need to be sequentially substituted into formula (3). In the invention, the parameter alpha = -1 is set, so that decimal in a convolution kernel parameter is avoided, the delay deterioration of a production line caused by introducing a divider execution unit is avoided, and the throughput rate of the production line of beta (d) calculation is improved. Selecting a convolution kernel beta (d) with the parameter alpha = -1 as shown in the formula (4):

Hardware implementations of bicubic interpolation algorithms can be divided into two categories: non-uniform structures and uniform structures. The non-uniform structure is used in the fields of image rotation, stepless scaling and the like. The homogeneous structure is a special case of the non-homogeneous structure in nature, and is only suitable for the case that the position of an interpolation point is fixed. The invention adopts a fixed magnification factor, and simultaneously takes the reduction of hardware consumption into consideration, so that a uniform structure is selected. Taking a one-dimensional quadruple interpolation of 8 points as an example, a specific implementation is shown in fig. 2. Under the condition of no padding, if one-dimensional quadruple uniform interpolation is required, the number of points inserted between two reference points is 6 or 7, and the relative distances between the interpolation points and the reference points are different, which results in that the distance between each interpolation point and the reference point needs to be calculated in hardware, and meanwhile, if the number of interpolation points between two adjacent reference points is different, an additional counter is introduced to control the calculation times of a next-stage module, which also introduces additional resource consumption, and the condition that padding is 1 is similar. In the invention, a method of padding to 2 is adopted, and it can be known from the figure that if the method is adopted, points to be interpolated are uniformly distributed between reference points, that is, 4 points are inserted between two reference points at a fixed interval, the distance between the two reference points is assumed to be 1, and | d | is the distance between the point to be interpolated and a left adjacent reference point, and it is easy to know that | d | is respectively equal to 1/8, 3/8, 5/8 and 7/8, because | d | is fixed, in order to avoid repeated calculation, | d |, | d | is advanced ² ，|d| ³ The result is calculated and assigned as parameter, i.e. pre-stored parameter, in verilog code. In comparison with fixed-point operation, the strict timing constraint requirement and the reduced operation speed of FPGA floating-point operation have led to the pastSo that FPGA designers can use fixed-point operation as much as possible. Similarly, in the invention, fixed-point operation is used, and in order to ensure that the precision loss of the fixed-point calculation is within a tolerable range, the method divides | d | ² ，|d| ³ The value shifted to the left by 18 bits is substituted into the calculation, and when the pixel value result is finally output, the value is shifted to the right by 18 bits, and correct data is restored.

The invention reduces the use of DSP (hardware multiplier) as much as possible by simplifying the interpolation algorithm.

As shown in FIG. 3, assuming that the distance between the reference points is 1, the interpolation point g (x, y) is compared with the reference point A ₁₁ The horizontal distance | d | is u. The interpolation point g (x, y) is compared with the reference point A ₁₀ ，A ₁₂ ，A ₁₃ Is (1 + u), (1-u), (2-u), according to equations (4) and (3), then there are:

I _c0 、I _c1 、I _c2 、I _c3 is g (x) ₀ ,y ₀ ) With reference point A ₁₀ ，A ₁₁ ，A ₁₂ ，A ₁₃ The horizontal distance between the two components is substituted for each component result obtained by the internal term in the formula (3), and the horizontal distance between the two components is added and simplified to obtain:

g(u)＝u×(u×(u×B ₄ +B ₃ )+B ₂ )+B ₁ (6)

further combining to obtain:

g(u)＝u ³ ×B ₄ +u ² ×B ₃ +u×B ₂ +B ₁ (7)

wherein:

for the original one-dimensional bicubic interpolation algorithm, the calculation formula of the interpolation point pixel value is given by formula (7), the hardware structure diagram is shown in fig. 4 (a), and it is easy to see that in the structure diagram, 5 fixed point multipliers and 5 fixed point multipliers are needed to be used3 adders. Formula (6) is obtained by extracting the formula (7) and optimizing the formula, and fig. 4 (b) is an optimized structural diagram, and the multiplication amount of the optimized structural diagram is 75% of the original multiplication amount. Furthermore, because the padding mode of the invention ensures that the distance between the interpolation point and the reference point is only 4 fixed values, the characteristics of the invention can be utilized to carry out u, u ² ，u ³ Prestoring and then reusing equation (7) results in the hardware structure shown in fig. 4 (c), which has the advantage of avoiding the accumulated error introduced by using multiple stages of fixed-point multipliers. Meanwhile, compared with the hardware shown in fig. 4 (b), the structure utilizes two stages of addition trees to reduce the number of original pipeline stages from six stages to three stages, and compared with the structure shown in fig. 4 (a), the optimization scheme also reduces the number of multiplication operations to 75%.

The two-dimensional bicubic interpolation algorithm is an expansion of one-dimensional interpolation in a spatial dimension, and a final formula of the algorithm is given by an equation (9):

g(v)＝v ³ ×C ₄ +v ² ×C ₃ +v ² ×C ₂ +C ₁ (9)

wherein:

g _i (u)＝u ³ ×B _i4 +u ² ×B _i3 +u ¹ ×B _i2 +B _i1 (11)

u and v take the values of 1/8, 3/8, 5/8 and 7/8 respectively.

When performing the interpolation calculation, the position where the interpolation starts is shown in fig. 5, where a circular point is a pixel point on the original image. When pix _ d (0, 0) is calculated, the reference point on the original image and the 4 × 4 pixel window used for interpolation are pix _ re (0, 0) and pix _ s4 × 4, respectively, so that it can be found that a part of pixel points (square points in fig. 5) need to be supplemented to complete the interpolation operation. The complementary process is also called padding.

In the present invention, the padding process is as shown in fig. 6, and the simple padding 0 is not selected, but the pixel value of the outermost circle, i.e. the edge, of the original image is assigned to the value to be padded, so that the interpolation point can be maximally undistorted.

The algorithm flow diagram of the invention is shown in fig. 7.

The algorithm comprises the following steps:

s1, padding is carried out on an original image, and specifically: supplementing pixel points around the original image, preferably assigning the pixel values of the edge of the original image to the pixel points of the original image to be supplemented, and obtaining a supplemented image;

s2, synchronously outputting four lines of data in sequence through the supplemented image subjected to Padding through a line buffer (Linebuffer);

and S3, synchronously outputting four lines of data and parallelly entering a Bicubic top-layer computing module. The Bicubic top layer calculation module is obtained by instantiating the Bicubic module for three times and processes pixel values of three channels of R, G and B in parallel. The Bicubic module is divided into a longitudinal window and a transverse window; the longitudinal window receives four rows of data sent by the Linebuffer in parallel, one-dimensional interpolation intermediate results are calculated according to rows, the calculated one-dimensional interpolation intermediate results are temporarily stored in the shift register, and the one-dimensional interpolation intermediate results slide to the right once every period; after 4 periods, the transverse window calculates one-dimensional interpolation of the one-dimensional interpolation intermediate result according to rows and outputs the one-dimensional interpolation, the value is the single-channel pixel value of the final interpolation point, and the transverse window slides to the right once every period. And the Bicubic top layer calculation module splices and outputs the pixel values of the R, G and B channels of the interpolation point according to the positions.

And S4, judging whether all the image interpolation points are calculated, if so, entering the next step, and if not, waiting for completion.

And S5, integrating image data, adding a BMP file header and the like.

1. The hardware part adopts the following technical scheme:

the hardware of the Bicubic interpolation algorithm comprises a Linebuffer module and a Bicubic top layer calculation module, wherein the Bicubic top layer calculation module comprises three single-channel modules which are respectively an R channel module, a G channel module and a B channel module, and the single-channel module comprises a Bicubic module.

And the Linebuffer module is used for caching the supplemented image after Padding and synchronously outputting four lines of data to the Bicubic top-layer calculation module in sequence.

As shown in fig. 8, the Bicubic module is composed of a portrait window and a landscape window, both of which are essentially the use of similarly configured hardware modules in different dimensions (landscape and portrait). The square points represent pixel points which are being output or are about to be output by the Linebuffer module, the round points represent points which are output by the Linebuffer module in the past clock period, the triangular points represent one-dimensional interpolation intermediate results output by the longitudinal window, and the diamond points represent interpolation point pixel values obtained through final calculation.

It is worth mentioning that, because the horizontal window is separated from the vertical window, the data output by the Linebuffer module can directly enter the vertical window for processing, thereby avoiding caching 16 original pixel points, and reducing hardware consumption and waiting time.

As shown in fig. 9, the Bicubic top-level computing module controls the starting and stopping of the Bicubic top-level computing module and selects a computing mode through an upper-level module, inputs RGB pixel values of four lines of data, respectively sends the RGB pixel values into corresponding single-channel modules (an R channel module, a G channel module, and a B channel module) for computing, and outputs the RGB pixel values spliced into four points to a next level after computing is finished.

The clock and the reset signal provide a synchronous clock and reset for the Bicubic top-layer computing module, the enabling signal is provided by the Linebuffer module at the upper stage and used for starting the Bicubic module to carry out operation, and line0_ pixel-line 4_ pixel are four lines of RGB data of the image to be interpolated. To reduce the amount of computation required by a processor in an FPGA (field programmable Gate array) development board to recover data from DDR3 memory, the design is to access the original image in four DDR3 memories, the output of the computation results is in a line traversal order of 1,5, \8230;. 2157,2,6, \8230;, 2158,3,7, \8230; 2159,4,8, \8230;, 2160, outputting four interpolation points adjacent to one line of data at a time. The sel signal is used to select the internal parameter data, select the calculation and output the data of the row. vld _ out is used to provide the next stage module with the enable to correctly receive valid data.

The single-channel module is composed of 8 sub-modules, and pixel values of four connected points in each row are obtained through calculation. As shown in fig. 10, includes a cal _ B module, a cal _ q module, a shift _ reg module, a cal _ B _ post module, and four cal _ q _ post modules.

Description of cal _ B block:

the schematic diagram of the module structure is shown in fig. 11, the pixel values of the original image are input, the multiplication operation by two is realized through the shift operation, and then the subtraction and addition operations are realized in the second stage and the third stage pipelines, and finally the intermediate coefficient is obtained. The specific definition of each signal in the cal _ B block is given in table 1.

Signal Definitions in Table 1cal _BModule

Port name	Width of	Type (B)	Description of the preferred embodiment
				clk
	1	unsigned	Clock signal
				en	1	unsigned	Input enable signal
rst
		1	unsigned	Reset signal
line0_pixel					8	unsigned	First line of raw data provided by linebuffer
	line1_pixel	8	unsigned	Second line of raw data provided by linebuffer
line2_pixel					8	unsigned	Third line of raw data provided by linebuffer
	line3_pixel	8	unsigned	Fourth line of raw data provided by linebuffer
vld_out
		1	unsigned	Output enable signal
b1					8	unsigned	Output intermediate coefficient b1
	b2	8	unsigned	Output intermediate coefficient b2
b3					8	unsigned	Output intermediate coefficient b3
	b4	8	unsigned	Output intermediate coefficient b4

Description of cal _ q Module:

the schematic structural diagram of the cal _ q module is shown in fig. 12, four intermediate coefficients are input, a data selector selects pre-stored parameters for operation, the first stage completes multiplication, and then a one-dimensional interpolation intermediate result q is obtained through two stages of addition trees. The specific definition of each signal in the cal _ q block is given in table 2.

Each Signal is specifically defined in the Table 2cal _qModule

Port name	Width of	Type (B)	Description of the preferred embodiment
				clk
	1	unsigned	Clock signal
				en	1	unsigned	Input enable signal
rst
		1	unsigned	Reset signal
sel
		2	unsigned	Selection signals for selecting parameters involved in the calculation
b1					8	unsigned	Intermediate coefficient b1
	b2	8	unsigned	Intermediate coefficient b2
b3					8	unsigned	Middle coefficient b3
	b4	8	unsigned	Middle coefficient b4
vld_out
		1	unsigned	Output enable signal
q
		23	signed	Intermediate result q, 13-bit integer, 9-bit decimal

shift _ reg block description:

the shift _ reg module implements a horizontal sliding window function. As shown in fig. 13, x0-x7 are 8 points obtained by upsampling in a certain row, and q0-q4 are vertical window calculation results, wherein q0-q3 in the horizontal window 1 is used for calculating the upper sampling points x0-x3; q1-q4 in the transverse window 2 are then used to compute the upsampled points x4-x7. Therefore, it can be obtained that the vertical window calculation result q can be multiplexed when calculating pixel values at different positions (such as a horizontal window 1 and a horizontal window 2 in the figure), so that a shift register is adopted for buffering and parallel output of the intermediate result q values of 4-point one-dimensional interpolation is realized, and input is provided for a cal _ B _ post module at the next stage. In this way, the output data can be kept unchanged while the vertical window parallelism is reduced to 1. Thereby saving hardware resources.

The schematic structure diagram of the shift _ reg module and the signal definition are given in fig. 14 and table 3.

Table 3 Shift/reg Module Signal Definitions

Port name	Width of	Type (B)	Description of the invention
				clk
	1	unsigned	Clock signal
				en	1	unsigned	Input enable signal
rst
		1	unsigned	Reset signal
q
		23	signed	Inputting the intermediate result of q, 13-bit integer and 9-bit decimal
q0					8	signed	Outputting the intermediate result q0, 13-bit integer, 9-bit decimal
	q1	8	signed	Output intermediate result q1,13-bit integer, 9-bit decimal
q2					8	signed	Outputting intermediate result q2, 13 bit integer, 9 bit decimal
	q3	8	signed	Outputting intermediate result q3, 13 bit integer, 9 bit decimal
vld_out
		1	signed	Output enable signal

Description of cal _ B _ post module:

the cal _ B _ post module is used for calculating transverse intermediate coefficients, inputting a 4-point one-dimensional interpolation intermediate result q output by the shift _ reg module, realizing multiplication by two through shift operation, then realizing subtraction and addition operation on second-stage and third-stage pipelines, and finally obtaining four transverse intermediate coefficient outputs. The block diagram and signal definitions are given in fig. 15 and table 4.

TABLE 4cal _B _postmodule Signal definition

cal _ q _ post module description:

the schematic diagram of the module structure is shown in fig. 16, four horizontal intermediate coefficients are input, pre-stored parameters are used for operation, multiplication operation is completed in the first stage, and then a two-dimensional interpolation result q is obtained through a two-stage addition tree.

It should be noted that there are four similar modules cal _ q _ post in total, the pre-stored parameters of each module are different, and the interpolation results of four adjacent points are obtained in one clock cycle through parallel computation. Because the software algorithm has the condition that the calculation result is negative or more than 255 (more than 255 is assigned to 255 in software and less than 0 is assigned to 0 in software), the judgment and rounding are also needed in hardware, firstly, whether the calculation result is negative is judged through a sign bit, if the calculation result is negative, the output result is set to zero, then, whether the overflow condition exists is judged through the upper bit of the highest bit of the integer, and if the overflow condition exists, the output result is set to 255. For a general output result, since there are decimal places in the calculation, it is necessary to perform rounding by hardware to obtain a final pixel value, when the integer number is smaller than 255 and the decimal number is larger than 0.5, the integer number is added with one to be output as a final result, otherwise, the integer number is directly used as the final result.

The cal _ q _ post module signal definition is given by table 5.

TABLE 5cal _q _postModule Signal Definitions

3. The system part adopts the following technical scheme:

the overall system block diagram of the present invention is shown in fig. 17, and is mainly divided into a data flow and a control flow.

The data stream may be divided into an up-sampling processing part and an HDMI output part:

an up-sampling processing section: reading a BMP image by the PS terminal, discarding a file header and the like, preprocessing, and storing the preprocessed BMP image into a DDR3 memory designated address; the VDMA0 enables a read channel, reads original image data from the DDR3 at the address, and sends the data stream to the Linebuffer IP core through the AXI4-stream interface. And the Linebuffer IP core caches 3 lines and synchronizes with the downlink data, and totally outputs 4 lines of data. The Bicubic IP core processes data received from the Linebuffer IP core, and outputs 1,5,9. And enabling a write channel by the VDMA1, receiving the data stream processed by the Bicubic IP core, and writing the data stream back to another specified address of the DDR. The above steps are cycled 4 times, which is different from the first time in that the Bicubic IP core outputs the 2,6,10.. Linear data of the up-sampled image, and so on. And the PS terminal reads and sets the up-sampling output data, and after BMP (bone map) such as a file header is added, the file is named and written into the SD card.

An HDMI output section: the VDMA2 enables a reading channel, reads image data after PS setting, and sends the image data to an AXI-Stream to Video Out IP core through an AXI4-Stream interface. The AXI-Stream to Video Out IP core converts AXI4-Stream format data into RGB888 format under the control of the Video Timing Controller IP core, and sends the RGB888 format data to the DVI Transmitter for driving an HDMI interface.

Controlling the flow, and interacting the upper computer with the PS end through the UART; the initial coordinate information of the image is sent to the PS terminal from the upper computer and displayed; and sending the BMP image to an upper computer from the PS end, and writing the BMP image into an SD card completion signal. The GP interface is interconnected with the peripheral configuration interface through AXI Interconnect, so that the PS end controls the peripheral at the PL end. And the PS end configures the address and the size of a frame buffer space of the VDMI IP core, a read-write channel and the like, initializes and configures the output time sequence parameters of the Video Timing Controller IP core, and controls to start or stop displaying.

Detailed description of the preferred embodiment

Hardware simulation verification:

the purpose of simulation verification is to verify whether a hardware module in the valve can correctly realize the super-resolution algorithm designed by the invention and is completely consistent with the C code operation result. Therefore, in the current test vector, the pixel points in the 1K original image are mainly used as excitation to be transmitted to a hardware module, and whether the difference exists between the output pixel points and the operation result of the C code or not is checked.

In the control signal part of the test vector, the clk signal provides a clock signal for the whole hardware, so that cyclic assignment is required; the rst signal provides a reset and is therefore 0 at initialization and then pulled high 21ns later, thereby reducing the impact on the results of the operation at system start-up. en provides the enable for the whole system and is therefore also 0 at initialization, pulled high 21ns later.

In the data signal portion of the test vector, first, the pixel values in the 1K image are converted to 16-ary form using MATLAB and stored in txt. The pixel values in txt are then stored in pixel0_ mem, and pixel0_ mem using the readmemh function. After the en signal is pulled high, the values in the four memories are sequentially written into the data input interface of the hardware, and the specific process is as shown in fig. 18.

The size of the up-sampled image is 3840 × 2160, and if the whole image output by hardware is compared with the C code, the comparison is more complicated. Therefore, several lines of data are optionally selected from the upsampled result. And comparing the data of the head part and the tail part of the line with the result of the C code. The comparison results of the line head and the line tail are the same, and the data calculation of the whole line can be proved to be correct. The comparison of the data of different lines can basically show that the calculation result of the whole image is error-free.

For the test image, the up-sampled image generated by the C code is read by MATLAB and converted into a 16-ary format. Meanwhile, the original test image is written into testbench, and then post-simulation (post-simulation) is performed on the hardware. The pair of the C code of the first line head and line end of the up-sampled test image and the post-initialization simulation result is shown in fig. 19 and 20. In fig. 19, vld _ out _ all is an output data valid signal. It can be seen from the figure that after vld _ out _ all is pulled high, there are still 3 cycles of invalid data, and in the fourth cycle, the first two data are also invalid. Only then is the correct upsampling result. According to the result marked by the box in the figure, the data of the simulation result of the super-resolution image obtained by the code C at the post-implementation at the head of the first line is consistent. In fig. 20, two data in the last cycle of the output data of the first row are still invalid data. Besides, the running result of the C code is consistent with the simulation result of the hardware.

The comparison of the C code at the head and tail of the fifth line of the up-sampled test image with the post-initialization simulation result is shown in fig. 21 and 22. From the results in the figure, it can be found that invalid data of a form similar to that in fig. 19 and 20 is removed. The C code is basically consistent with the result obtained by hardware simulation.

The simulation verification result can be obtained, and the hardware realized by the invention can realize the function of up-sampling. And the output result is consistent with the running result of the C code.

And (3) system verification:

in the system verification process, firstly, the 1K image which needs to be super-resolved is copied into the SD card, and then the SD card is inserted into the development board. After the development board is powered on, the program is downloaded into the development board, and the super-resolution hardware implementation is started. And the calculated pixel points of the 4K images are transmitted to the PS end from the PL end where the calculation module is located through the VDMA to carry out SD card writing and HDMI display. The displayed result is shown in fig. 23, the image displayed on the left display is the super-resolution 4K image, and one 4K image is displayed in four times due to the limitation of the display resolution.

In order to more comprehensively evaluate algorithm design and hardware implementation, the performance evaluation is divided into two processes: image quality evaluation and hardware evaluation. The image quality evaluation is mainly used for evaluating the image quality after super-resolution, and the three adopted indexes are respectively as follows: l obtained by PSNR, SSIM and LPIPS models ² distance. The hardware evaluation is divided into three parts of resource use condition, operation frequency and system delay.

And (3) image quality evaluation:

the image quality evaluation indexes selected by the invention are as follows: l obtained by PSNR, SSIM and LPIPS models ² distance. The PSNR (Peak Signal Noise Ratio) is a Peak Signal-to-Noise Ratio, and is used to estimate a mean square error between the original image and the processed image. SSIM (Structural Similarity) is a Structural Similarity, which is an index for measuring the Similarity between two images. The LPIPS (required Perceptial Image Patch Similarity) performs feature extraction through deep learning, so as to determine whether the two images are similar. The three indexes are widely applied to image similarity evaluation.

In order to facilitate more intuitive evaluation of the image quality after super-resolution, a certain number of 4K images are selected and compressed into 1K by using an averaging mode, then the images are subjected to super-resolution processing, and the processed 4K images are compared with the original images, so that the quality of the images processed by the algorithm is judged.

Therefore, the up-sampling results of the 4K image and the C code implementation are compared for PSNR, SSIM and L ² distance, the average performance achieved by the algorithm of the invention is respectively: 31.76,0.867 and 0.268.

Hardware evaluation:

for the hardware evaluation process, synthesis and instantiation are carried out on the realized hardware system, so that the use condition of logic resources is obtained. Next, the master clock constraint is performed, and the highest frequency of the system operation is obtained when the time margin, the hold time margin, and the VDMA read/write bandwidth satisfy the requirements, and the results of the clock constraint and the time margin are shown in fig. 24 and fig. 25. Finally, the delay of the system for sampling on a single image can be obtained according to the system running frequency and the running period of the complete image.

After the implementation of hardware implementation, the resource usage of the complete system including the IP such as VDMA and Video Timing Controller is obtained as shown in fig. 26.

Taking a single picture as an example, the total delay required for upsampling can be derived from equation (8). Wherein, latency _single Delay of up-sampling for a single picture, N _process And f _max The number of cycles required for a single image and the maximum frequency at which the system operates, respectively.

According to the design process of the hardware module, the period required by the hardware system to process a single image is as follows:

N _process ＝N _row ×Num _row +t _linebuffer ＝964×2160+12×960＝2,093,760 (9)

wherein N is _row Number of processing required for sampling per line；Num _row The number of lines included for the up-sampled image; t is t _buffer The time required for line caching.

The total delay for upsampling a single picture can thus be derived as:

therefore, the total delay of the system is 0.013s, the theoretical frame rate is 76FPS, and the method can be applied to unmanned aerial vehicle high-speed image acquisition.

Comparing the present invention (Bicubic IP) with other mainstream hardware implementations, the following table 6 can be found:

TABLE 6 comprehensive comparison of the invention with other mainstream hardware

It is easy to see that, while having a higher clock frequency, the output image quality and the occupation of hardware logic resources of the present invention have certain advantages compared with other implementation methods.

The invention realizes the following functions: based on a Xilinx Zynq7020 development platform, a single 1K image with low resolution is subjected to a super-resolution algorithm to obtain a 4K image with high resolution, the 4K image is displayed by using HDMI, and meanwhile, the obtained 4K image output result is written into an SD card.

In the invention, firstly, in the parameter selection process, errors and calculated amount possibly introduced by the existing scheme are considered, and alpha = -1 is selected as a convolution kernel parameter. Secondly, considering that the direct 4 times of upsampling can cause the insertion of different numbers of points between the reference points to cause extra resource consumption and precision loss, the invention carries out Padding operation on the image, the points to be interpolated are uniformly distributed between the reference points through Padding, and the relative distance between the points to be interpolated and the reference points is four fixed finite decimal numbers. In addition, because there are only four distance parameters, the formula can be further simplified to reduce the number of pipeline stages to reduce the accumulated error or data bit width. In consideration of the problems that the existing scheme uses floating point arithmetic to cause increased area overhead, increased power consumption, reduced calculation rate and the like, the method uses the fixed point unit for operation, and can obtain an interpolation result which is very close to a floating point operation result by using fixed point operation with narrower bit width through parameter selection, padding image and formula simplification. Also, aiming at the problem that the existing scheme is too dependent on external storage, the invention is realized by using a pipeline parallel architecture, the Bicubic IP is internally provided with no extra memory, the input is given by a 4 multiplied by 1 longitudinal window generated by Linebuffer, and the Bicubic IP receives four pixel values in each clock cycle. Further, the invention explores the possibility of IP internal data multiplexing, and because the distance parameter is fixed, the invention optimizes the intermediate result which needs to be calculated four times in each clock period into one time by using the shift register to generate the transverse window, thereby reducing the negative influence caused by repeated calculation. In addition, the invention reduces the pipeline stage number and the required data bit width by reasonably arranging the connection sequence of the multiplier and the adder, and realizes the balance of calculation precision and resource consumption.

Claims

1. A super-resolution system for high-speed image acquisition, comprising: the device comprises a Padding module, a Linebuffer module, a Bicubic top layer calculation module and a shift register;

the Padding module is used for supplementing the original image and outputting the supplemented image to the Linebuffer module;

2. The super-resolution system oriented to high-speed image acquisition of claim 1, wherein the Bicubic top layer calculation module comprises three single channel modules, namely an R channel module, a G channel module and a B channel module; each single-channel module comprises a Bicubic module;

the Bicubic top-layer calculation module is used for receiving four rows of data output by the Linebuffer module and respectively sending R, G and B pixel values of the four rows of data to the R channel module, the G channel module and the B channel module; the Bicubic module of each single-channel module calculates one-dimensional interpolation intermediate results in a row by a longitudinal window and temporarily stores the results in a shift register, after four rows of calculation are finished, one-dimensional interpolation of the one-dimensional interpolation intermediate results is calculated in a row by a transverse window, R, G and B pixel values of interpolation points are respectively obtained, and the R, G and B pixel values of the interpolation points are spliced to obtain the pixel values of the interpolation points and output.

3. The super-resolution system oriented to high-speed image acquisition of claim 2, wherein the Linebuffer module provides an enable signal for turning on the Bicubic module for operation.

4. The super-resolution system oriented to high-speed image acquisition according to claim 2, wherein the Bicubic module comprises a cal _ B module, a cal _ q module, a shift _ reg module, a cal _ B _ post module, and a cal _ q _ post module;

the shift _ reg module is used for receiving the one-dimensional interpolation intermediate result output by the cal _ q module and caching the one-dimensional interpolation intermediate result through a shift register, so that the four-point one-dimensional interpolation intermediate result is output to the cal _ B _ post module in parallel;

and the cal _ q _ post module is used for receiving the four transverse intermediate coefficients output by the cal _ B _ post module, and obtaining a two-dimensional interpolation result, namely R, G or B pixel values of a interpolation point, sequentially through multiplication operation and a two-stage addition tree.

5. The super-resolution system oriented to high-speed image acquisition of claim 4, wherein the cal _ q module operates the received four intermediate coefficients with pre-stored parameters, the first stage performs multiplication, and then obtains a one-dimensional interpolation intermediate result through a two-stage addition tree.

6. The super-resolution system oriented to high-speed image acquisition according to claim 4, wherein the cal _ q _ post module is used for operating the received four transverse intermediate coefficients and prestored parameters, the first stage is used for completing multiplication operation, and then a two-dimensional interpolation result is obtained through a two-stage addition tree;

7. A super-resolution method for high-speed image acquisition is characterized by comprising the following steps:

s1, supplementing an original image to obtain a supplemented image;

8. The super-resolution method for high-speed image acquisition according to claim 7, wherein S1 specifically comprises: and assigning the pixel value of the edge of the original image to the pixel point of the original image to be supplemented to obtain a supplemented image.

9. The super-resolution method for high-speed image acquisition according to claim 7, wherein in step S3, the interpolation is performed according to a rule of inserting interpolation points at fixed intervals.

10. The super-resolution method for high-speed image acquisition according to claim 7, wherein in S3, the convolution kernel used in the one-dimensional interpolation is as follows: