CN112581366B

CN112581366B - Portable image super-resolution system and system construction method

Info

Publication number: CN112581366B
Application number: CN202011376766.1A
Authority: CN
Inventors: 刘明亮; 王晓航
Original assignee: Heilongjiang University
Current assignee: Heilongjiang University
Priority date: 2020-11-30
Filing date: 2020-11-30
Publication date: 2022-05-20
Anticipated expiration: 2040-11-30
Also published as: CN112581366A

Abstract

The invention discloses a portable image super-resolution system and a system construction method. The PL hardware layer is responsible for constructing a DPU IP core, is used for reasoning a neural network and accelerates the reasoning process; the PS embedded Linux system layer is responsible for reading and storing pictures, scheduling tasks of the DPUs, performing sub-pixel convolution operation on neural network output and communicating with an upper computer; the client application layer faces to a user, and the user simply operates the picture which needs to be subjected to super-resolution and is specified by the user, acquires the output high-resolution image and modifies the parameters. The deployment of the convolutional neural network of the image super-resolution on the ZYNQ embedded platform is realized, the network reasoning is successfully accelerated, and a good output result is obtained.

Description

Portable image super-resolution system and system construction method

Technical Field

The invention belongs to the field of images; in particular to a portable image super-resolution system and a system construction method.

Background

The image Super Resolution (SR) problem is a classical problem in the field of computer vision, the aim of which is to restore High Resolution (HR) images from Low Resolution (LR) image reconstruction. Various SR methods have been widely used in industry, security and medicine, while also showing great promise in the field of social entertainment. Therefore, the problem of super-resolution image calculation has attracted the attention of many excellent scholars in the field of computer vision, and many excellent super-resolution image algorithms have been proposed.

Early SR methods were mainly based on image interpolation methods, such as nearest neighbor interpolation, bilinear interpolation and bicubic interpolation. Image interpolation algorithms typically generate high resolution images by interpolating new pixels into low resolution images, where the pixels are obtained by weighted averaging of adjacent pixel values of the low resolution images. Some more effective methods are super-resolution processing using statistical image prior information, which can obtain more image details but also require more prior knowledge.

In recent years, many machine learning-based super-resolution methods have been proposed in succession, such as the SRCNN, which for the first time successfully introduces neural network technology into the super-resolution problem. This method uses a lightweight structure of the network, but achieves a higher result quality than the most advanced methods at the time. FSRCNN uses 1x1 convolution to expand and reduce the number of feature maps, and unlike SRCNN, instead of inputting interpolated pseudo high resolution maps into the network, this method enlarges the image through a deconvolution layer, greatly reducing the time cost of training and reasoning. ESPCN provides a new up-sampling method of sub-pixel convolution, time cost of training and reasoning is reduced, and an output result with better quality is obtained.

Although the existing method based on machine learning achieves very excellent results, the following defects still exist: (1) the parameter quantity and the calculation quantity of the traditional network model are huge, the time and the power consumption consumed by calculation are not ideal, and meanwhile, the application of the traditional network model on an embedded platform is limited (2), because the traditional network generally tends to increase the depth and the width of the network in order to obtain better output quality, the training and the deployment are difficult (3), because the data precision required in the network training stage is higher, the data format of float32 or float64 is adopted, and the output quality improvement brought by using high-precision data in the inference stage of the network is very weak.

Disclosure of Invention

The invention provides a portable image super-resolution system and a system construction method, which realize the deployment of a convolutional neural network of image super-resolution on a ZYNQ embedded platform, successfully accelerate network reasoning and obtain a good output result.

The invention is realized by the following technical scheme:

a portable image super-resolution system comprises a PL hardware layer, a PS embedded Linux system layer and a client application layer, wherein the PL hardware layer is responsible for constructing a DPU IP core, is used for reasoning of a neural network and accelerates the reasoning process; the PS embedded Linux system layer is responsible for reading and storing pictures, scheduling tasks of the DPUs, performing sub-pixel convolution operation on neural network output and communicating with an upper computer; the client application layer faces a user, and the user simply operates the pictures needing super resolution, acquires the output high-resolution images and modifies parameters.

A system construction method of a portable image super-resolution system comprises a PL (personal information Unit) end logic construction step, an embedded Linux customization and transplantation step, a convolutional neural network model design training and deployment step, a lower computer control construction step and a DPU network accelerated reasoning construction step.

Further, the step of constructing the PL side logic is specifically,

step S2.1: integration and connection of DPU IP cores; the DPU IP core adopts a low RAM consumption mode, the DSP slice consumption mode is set to be high, and depth direction convolution is not used; the working frequency of the DSP slice needs to be fixed to be twice of the working frequency of the DPU;

step S2.2: and configuring the ZYNQ core according to a development board schematic diagram.

Further, the step of performing embedded Linux customization and transplantation specifically comprises

Step S3.1: customizing the U-Boot of the embedded Linux; namely, the SD card starting mode is directly used;

step S3.2: the Rootfs customization of the embedded Linux is carried out on the basis of the step S3.1; configuring the root file system of the installation environment by enabling the specified starting path in the last step to contain the file pair installation environment for system starting and running;

step S3.3: kernel customization of the embedded Linux is carried out on the basis of the step S3.2; modifying the equipment tree file of the system;

step S3.4: compiling and transplanting the embedded Linux on the basis of the step S3.3; and (4) running build function compilation of the Peerlinux, generating a U-Boot file after compiling, copying the generated U-Boot file and ub file to a FAT32 partition of the SD card, and decompressing a rootfs compression packet to an EXT4 partition of the SD card.

Further, the step of designing, training and deploying the convolutional neural network model is specifically,

step S4.1: constructing a convolution neural network model by using two methods of 5 residual blocks and sub-pixel convolution;

step S4.2: training the convolutional neural network model of the step S4.1; training by using 100 pictures of 1920x1080x3, wherein the picture comprises 50 landscape pictures and 50 paintings, training data and verification data are 9:1, reducing the pictures to 640x360x3 to obtain input of convolutional neural network model training, inputting the input value of the model constructed in the step 1, using an original 1920x1080x3 image as a Lable of a network, and storing a finally trained convolutional neural network model;

step S4.3: and deploying the convolutional neural network model trained in the step S4.2 to realize the super-resolution function of the image.

Further, the control of the lower computer is specifically realized by acquiring the addresses of all pictures in the specified directory and reporting the task amount; performing double-process processing according to the task quantity;

firstly, reading a picture, reading the picture of the process I, then carrying out DPU reasoning of the process I and obtaining a result, carrying out sub-pixel convolution on the result of the process I to form a hyper-resolution image of the process I, storing the hyper-resolution image of the process I and reporting the time consumption of a task;

and the second process reads the picture, performs DPU reasoning on the second process after reading the picture of the second process and acquires a result, performs sub-pixel convolution on the result of the second process to form a hyper-resolution image of the second process, stores the hyper-resolution image of the second process and reports the time consumed by the task.

Further, the step of constructing the DPU network accelerated inference specifically includes the following steps:

step S6.1: opening the DPU equipment;

step S6.2: loading a network model by using DPU equipment;

step S6.3: reading an image address list;

step S6.4: feeding the low resolution image of step S3 to the DPU;

step S6.5: starting a DPU task;

step S6.6: acquiring the output tensor in the step S6.5;

step S6.7: processing the output tensor in step S6.6 using the sub-pixel convolution at the PS end;

step S6.8: judging whether the image address list reaches the upper limit, if so, performing step S9, and if not, performing step S6.3;

step S6.9: and (6) ending.

The invention has the beneficial effects that:

under the condition of maintaining the quality of an output picture, the fast image super-resolution processing speed is realized through the accelerated reasoning of the DPU network; meanwhile, the algorithm is deployed on an embedded platform, so that compared with the algorithm deployed on a PC (personal computer) platform, the algorithm greatly reduces the power consumption, and the system has portability

Drawings

FIG. 1 is a diagram of the program architecture of the system of the present invention.

Fig. 2 is a functional distribution diagram of the system of the present invention.

Fig. 3 is an architecture diagram of the hardware system of the present invention.

FIG. 4 is a schematic diagram of xc7z020 according to the present invention.

FIG. 5 is a schematic diagram of the JTAG download circuit of the present invention.

Fig. 6 is a schematic diagram of a USB interface circuit of the present invention.

FIG. 7 is a schematic diagram of a UART interface circuit according to the present invention

Fig. 8 is a schematic diagram of the QSPI FLASH circuit of the present invention.

Fig. 9 is a schematic diagram of a gigabit port circuit of the present invention.

FIG. 10 is a schematic diagram of the SD card circuit of the present invention.

Fig. 11 is a schematic diagram of a power supply circuit of the present invention, in which, (a) a 0V power supply circuit, (b) an 8V power supply circuit, (c) a 5V power supply circuit, and (d) a 3V power supply circuit.

FIG. 12 is a diagram illustrating the configuration of the DPU IP core parameters according to the present invention.

FIG. 13 is a schematic diagram of the DPU IP core operating frequency configuration of the present invention.

FIG. 14 is a PL-terminal global wiring diagram of the present invention.

FIG. 15 is a diagram of the DPU bus address assignment of the present invention.

FIG. 16 is a schematic diagram of a ZYNQ core configuration of the present invention.

Fig. 17 is a schematic diagram of the configuration of the Petalinux start-up mode according to the present invention.

FIG. 18 is a schematic diagram of the shut down generation boot areas automation mode of the present invention.

FIG. 19 is a schematic diagram of the Linux boot path and CMA space being manually configured according to the present invention.

FIG. 20 is a schematic diagram of an OpenCV environment cross-compilation configuration in accordance with the present invention.

FIG. 21 is a schematic diagram of a Python environment cross-compilation configuration of the present invention.

FIG. 22 is a schematic diagram of a device tree file configuration according to the present invention.

Fig. 23 is a schematic diagram of the residual block structure of the present invention.

FIG. 24 is a schematic diagram of the convolutional neural network structure of the present invention.

FIG. 25 is a flow chart of a control method of the present invention.

FIG. 26 is a flow chart of the DPU accelerated neural network inference method of the present invention.

Fig. 27 is a natural image contrast diagram of the present invention, in which (a) an original image and (b) a super-resolution result diagram.

Fig. 28 is a comparison graph of artificial drawing according to the present invention, wherein (a) the original drawing and (b) the super-resolution result graph.

Detailed Description

The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1

Further, the step of constructing the PL side logic is specifically,

Further, the step 3 specifically includes performing model freezing, model quantization and model compiling on the convolutional neural network model trained in the step 2.

Furthermore, the model freezing body is that the definition of the model calculation graph and the model weight are combined into the same file, so that the deployment of the model is facilitated.

Further, the model quantization is that a convolutional neural network model with a parameter of float32 type is subjected to int8 quantization operation, the quantized model is corrected through a quantized data set (the adverse effect of precision reduction caused by quantization is reduced to the minimum), and the model can be conveniently quantized through calling a decentq tool carried by a DNNDK tool kit through a script.

Further, the model is compiled such that the quantized model is compiled into a model which can be read by the DPU. The DNNC tool carried by the DNNDK tool package can be called through the script to compile the model into an ELF file, and then the file is packaged into an o file, so that the calling of the DNNDK-Python API is facilitated.

(this is the result analysis, not the step (explanation)) PSNR (peak signal to noise ratio) and SSIM (structural similarity) are the evaluation indexes for evaluating the quality of the neural network output picture, the higher the quality is, and the smaller the parameter number and training time is, the better the quality is.

Due to the use of the sub-pixel convolution technology, images flowing in the network are low-resolution images, and the final effect is that the speed of the network reasoning stage is greatly improved; the memory consumption in the network reasoning stage is reduced; the net end effect is improved.

If the task is super resolution 4 times, the low resolution image size is 100 x 100, the original high resolution image size is 400 x 400, if no sub-pixel convolution is used, the network input is a picture pre-magnified by a bicubic interpolation algorithm, i.e. 400 x 400, if sub-pixel convolution is used, the input does not need to be pre-magnified, i.e. 100 x 100, which results in:

the speed of the network reasoning phase is greatly improved.

Through calculation, compared with the method that the sub-pixel convolution is not used, the common convolution with the output of 3 channels is used for replacement, the calculation amount is reduced by about 160 times; making the inference time decrease proportionally.

The network reasoning phase consumes less memory.

Through calculation, compared with the method that the sub-pixel convolution is not used, the common convolution with the output of 3 channels is used for replacement, and the memory consumption is reduced by about 16 times.

The net end effect is improved.

Example 2

The experimental conditions are as follows: network a (sub-pixel convolution) and network B (normal convolution), using the same training set, training the same number of iterations, using the same test set, yields the following results:

	PSNR	SSIM
			network A	32.10	0.8958
Network B	31.23	0.8901

Compared with the network B, the network A improves the PSNR (peak signal-to-noise ratio) by 2.79 percent and the SSIM (structural similarity) by 0.64 percent, so that the final inference result of the network is improved.

Due to the use of the residual learning technology, the phenomenon that the gradient of a deep network disappears when the network is in a training stage is relieved, and the convergence speed of the final effect network in the training stage is greatly improved;

example 3

The experimental conditions are as follows: network a (using residual learning), network B (not using residual learning), other conditions being the same;

the results of the experiment are shown in FIG. 3

As is apparent from fig. 3, the network a using the residual learning technique (represented by a green line) has a significantly improved network convergence rate compared to the network B not using residual learning (represented by an orange line), which means that in the model training phase, the network a using the residual learning technique can obtain a better result with a smaller number of training iterations, and the time and labor cost consumed by training the network are reduced.

Due to the lightweight design of the network structure, the network model is smaller and more exquisite, and the final effect is that the network reasoning speed is increased; memory consumption is reduced; the final result is a slight degradation in quality.

The experimental environment is as follows: the network A uses 5 residual blocks, the number of convolution kernels of each convolution layer is 64, the network B uses 10 residual blocks, the number of convolution kernels of each convolution layer is 128, and other conditions are the same;

	PSNR	SSIM	amount of ginseng	Time consuming
					Network A	32.10	0.8958	101179	2657s
Network B	32.81	0.9039	755931	8684s

As can be seen from the above table, although the quality of the design network output result of 10 residual blocks is improved, the quality is greatly improved with the parameter, and under the cost that PSNR and SSIM are reduced by 2.16% and 0.89%, respectively, the parameter and training time are greatly improved by 86.61% and 69.40%, respectively, which is very disadvantageous to the network model deployed on an embedded platform due to large parameter amount and long training time, which results in the substantial increase of model inference time and memory consumption, and the cost improvement due to the increase of the requirement on a hardware platform. Through a plurality of tests, a better balance point can be obtained between the performance and the quality by using the given structure in the invention.

Example 4

(for explanation) due to the use of the model quantization technology, the final effect of converting the network weight from the Float32 type to the Int8 type is that the speed of the network inference phase is greatly improved; the memory consumption in the network reasoning stage is reduced; the network reasoning output index is not obviously reduced.

The speed of the network reasoning phase is greatly improved.

The data type before quantization is Float32 (32-bit floating point type), a large amount of DSP slice resources are consumed for calculation in the FPGA, and because the resources are very limited, a batch of data needs to be divided into a plurality of batches for respective calculation, while the resources consumed for calculation by using Int8 (8-bit integer type) are few, so that a batch of data only needs to be divided into a few batches for calculation, and the calculation time is greatly reduced.

The network reasoning phase consumes less memory.

Since the weight is changed to Int8, the characteristic diagram calculated in the network is also of type Int8, and compared with type Float32, the use of type Int8 can directly reduce the memory space consumption by 4 times.

The decrease of the network reasoning output index is not obvious

The experimental conditions are as follows: network A uses quantization techniques and network B does not use quantization techniques

	PSNR	SSIM
			Network A	32.10	0.8958
Network B	32.06	0.8957

From the data, the evaluation index of the quantized network output image is slightly reduced but the amplitude is very small, the quality reduction of the output image caused by the index reduction can hardly be observed by human eyes, the reasoning speed is greatly improved, the memory consumption is greatly reduced, and the quantization is very important for the neural network model to be deployed on a portable platform with low computing power and low power consumption.

Further, the control of the lower computer is specifically realized by acquiring the addresses of all pictures in the specified directory and reporting the task amount; performing double-process simultaneous processing according to the task amount;

step S6.1: opening the DPU equipment;

step S6.2: loading a network model by using DPU equipment;

step S6.3: reading an image address list;

step S6.4: feeding the low resolution image of step S3 to the DPU;

step S6.5: starting a DPU task;

step S6.6: acquiring the output tensor in the step S6.5;

step S6.9: and (6) ending.

Example 5

The test is carried out through a debugging mode, and the program can output information and verify conveniently. The design and a computer are connected to the same router, and the upper computer logs in the embedded Linux of the development board by using SSH to execute the program.

The program running process and the network reasoning time can be known through outputting, when the program runs, the super-resolution image output by the program can be found under a result directory, the super-resolution tasks of the natural image and the artificial drawing can be better completed by comparing the front part and the back part of the image with (a) and (b) of fig. 27 and (a) and (b) of fig. 28, the resolution of the image is successfully amplified according to the specified magnification, and the quality of the output image is greatly improved compared with that of the input image.

The test task is super-resolution of a single 640x360x3 image to 1920x1080x 3. The network inference speed comparison table is shown in the following table.

TABLE 5-1 network inference speed comparison

In terms of time, as can be seen from table 5-1, the network inference time is about 2.2s, the time for processing one picture by a single process is about 44.8 s, and the total processing time of each picture is about 22.43 s in practice due to the two-process design. The power consumption of the GPU and the power consumption of the CPU are both measured by AIDA64 software, the power consumption of other parts in the system is not included, the power consumption of the design in Vivado is estimated to be 3.7w, and the maximum power of the platform power supply is 10w, so that the power consumption of other parts of the platform is not more than 10w in total even if the power consumption of other parts of the platform is included.

Claims

1. A system construction method of a portable image super-resolution system is characterized in that the system comprises a PL hardware layer, a PS embedded Linux system layer and a client application layer, wherein the PL hardware layer is responsible for constructing a DPU IP core, is used for reasoning of a neural network and accelerates the reasoning process; the PS embedded Linux system layer is responsible for reading and storing pictures, scheduling tasks of the DPUs, performing sub-pixel convolution operation on neural network output and communicating with an upper computer; the client application layer faces to a user, and the user simply operates the picture needing super resolution specified by the user, acquires the output high-resolution image and modifies parameters;

the system construction method comprises the steps of constructing PL end logic, customizing and transplanting embedded Linux, designing, training and deploying a convolutional neural network model, constructing control of a lower computer and constructing DPU network accelerated reasoning;

the steps of the design training and deployment of the convolutional neural network model are specifically,

step S4.2: training the convolutional neural network model of the step S4.1; training by using 100 pictures of 1920x1080x3, wherein the picture comprises 50 landscape pictures and 50 paintings, training data and verification data are 9:1, the pictures are reduced to 640x360x3 to obtain input of convolutional neural network model training, the input is input to the model constructed in the step S4.1, an original 1920x1080x3 image is used as a Lable of a network, and a finally trained convolutional neural network model is stored;

step S4.3: deploying the convolutional neural network model trained in the step S4.2 to realize the super-resolution function of the image;

the step of constructing the DPU network accelerated reasoning specifically comprises the following steps:

step S6.1: opening a DPU device;

step S6.2: loading a network model by using DPU equipment;

step S6.3: reading an image address list;

step S6.4: sending the low-resolution image of the step S6.3 into a DPU;

step S6.5: starting a DPU task;

step S6.6: acquiring the output tensor in the step S6.5;

step S6.8: judging whether the image address list reaches the upper limit, if so, performing a step SS6.9, and if not, performing a step S6.3;

step S6.9: and (6) ending.

2. The system building method according to claim 1, wherein the step of building PL side logic is embodied as

3. The system building method according to claim 1, wherein the step of performing embedded Linux customization and migration is specifically

step S3.2: the Rootfs customization of the embedded Linux is carried out on the basis of the step S3.1; enabling the starting path appointed in the last step to contain the files for starting and running the system and configuring a root file system of the installation environment;

4. The system building method according to claim 1, wherein the control of the building lower computer is specifically to obtain addresses of all pictures in a specified directory and report task amount; performing double-process simultaneous processing according to the task quantity;