WO2022242029A1

WO2022242029A1 - Generation method, system and apparatus capable of visual resolution enhancement, and storage medium

Info

Publication number: WO2022242029A1
Application number: PCT/CN2021/126019
Authority: WO
Inventors: 金龙存; 卢盛林
Original assignee: 广东奥普特科技股份有限公司
Priority date: 2021-05-18
Filing date: 2021-10-25
Publication date: 2022-11-24
Also published as: CN113139907B; CN113139907A

Abstract

A generation method, system and apparatus capable of visual resolution enhancement, and a storage medium. An acquired low-resolution single image to be processed is subjected to resolution processing by means of a single-image training model established by using training samples and a preset loss function, which training samples contain a high-resolution single image sample, a low-resolution single image sample and corresponding image description information samples thereof, such that the effect of restoring the low-resolution single image to a high-resolution single image can be accurately and efficiently realized. A single image having a higher resolution can be acquired on the basis of specific image description information prior, and high-frequency information of the single image can be restored, such that an output high-resolution single image contains more texture structure details, and the definition of the single image is thus improved. The present invention can be widely applied to the technical field of image processing.

Description

Generation method, system, device and storage medium for visual resolution enhancement

This application claims the priority of the Chinese patent application submitted to the China Patent Office on May 18, 2021, with the application number 202110541939.9, and the title of the invention is "Method, system, device and storage medium for visual resolution enhancement", the entire content of which Incorporated in this application by reference.

technical field

The invention belongs to the technical field of image processing, and in particular relates to a generation method, system, device and storage medium of visual resolution enhancement.

Background technique

In recent years, due to the limitations of volume, weight and cost of digital image acquisition equipment, the resolution of the collected images is low, which greatly reduces the clarity of the images. At the same time, people's demand for high-definition images is increasing, and how to improve image and video quality has become an increasingly important issue. Image super-resolution aims to repair low-resolution images, so that the image contains more detailed information and improves the clarity of the image. This technology has important practical significance. For example, in the field of security monitoring, surveillance video acquisition equipment acquires video frames that lack effective information due to cost constraints, and security monitoring relies heavily on high-resolution images with clear information. Using image super-resolution technology, it can increase the details of video frames. The supplement of these information can provide effective evidence for fighting crime. At present, using image super-resolution as a pre-processing technology can effectively improve the accuracy of tasks such as target detection, face recognition, and abnormal warning in the security field.

Previous methods used for image super-resolution are interpolation-based or reconstruction-based methods. Image super-resolution based on interpolation is the first algorithm applied in the field of super-resolution. This type of algorithm is based on a fixed polynomial calculation mode, and the pixel value at the interpolation position is calculated from the existing pixel value, such as bilinear interpolation, Bicubic interpolation and Lanczos scaling. Reconstruction-based methods use strict prior knowledge as a constraint to find a suitable reconstruction function in the constrained space, thereby reconstructing high-resolution images with detailed information. These algorithms usually fall into the problem of the image being too smooth and cannot recover the texture details of the image well.

In recent years, with the development of deep learning and convolutional neural networks, image super-resolution technology has made great breakthroughs. The convolutional neural network uses external data sets to learn the mapping model between low-resolution images and high-resolution images, and uses the learned mapping model to reconstruct high-resolution images from low-resolution images. When the input low-resolution image lacks effective information, it is difficult for the neural network to learn the mapping relationship comprehensively. Using this mapping model with incomplete learning will lead to serious blurring of the reconstructed image, and it is difficult to obtain the content information in the image.

Contents of the invention

In order to solve the problem of lack of effective information in low-resolution images in image super-resolution tasks, the present invention provides a more accurate and efficient method, system, device and storage medium for visual resolution enhancement, which can be combined with image description information , to build a deeper network, so that it can acquire a single image with high definition a priori based on specific image description information.

The present invention adopts following technical scheme:

In the first aspect, a method for generating visual resolution enhancement is provided, including:

Obtain the low-resolution single image to be processed and its corresponding image description information;

Processing the low-resolution single image and its corresponding image description information through a single image super-resolution model, and outputting a high-resolution single image;

The training method of the single image super-resolution model includes:

Collecting training samples, the training samples include high-resolution single image samples, low-resolution single image samples and corresponding image description information samples;

According to the collected training samples, a single image super-resolution model is established based on the preset loss function and high-resolution single image samples.

Optionally, the collecting training samples, the training samples include high-resolution single image samples, low-resolution single image samples and their corresponding image description information samples, including:

Use the publicly available large-scale CUB bird image dataset to obtain high-resolution single image samples and back them up; image samples and back them up;

The built-in bicube downsampling function of Matlab is used to degenerate the high-resolution single image sample into a low-resolution single image sample of the scaling factor; optionally, functions of other software or other functions can also be used to degenerate the single image sample The high-resolution single image sample is a low-resolution single image sample of the scaling factor;

Using the English sentence information used to describe at least one of the feather color, body characteristics, movement posture and environmental performance of the bird in the image in the large-scale CUB bird image data set, the corresponding image description information sample is obtained. Therefore, for image data sets using other targets, English sentence information used to describe at least one of the color, body features, motion posture and environmental performance of the target in the image data set can also be used to obtain the corresponding image Describe a sample of information.

Optionally, the establishment of a single image super-resolution model based on a preset loss function and a high-resolution single image sample according to the collected training samples includes:

Obtain low-resolution single image samples and corresponding image description information;

Based on single-layer convolution, the low-resolution single image is extracted with shallow features, and the input low-resolution single image is converted from RGB color space to feature space;

Using an adaptive adjustment block to encode the image description information to obtain a description variable with the same dimension as the image feature;

Concatenate the description variables and image features, and use a layer of convolution to perform channel compression on the concatenated features;

Using multi-scale sub-networks to extract deep features from shallow features;

Use the upsampling module to scale up the deep features;

Use two layers of convolution to reconstruct and output a high-resolution single image of RGB channels;

Based on the preset loss function, the reconstructed high-resolution single image and the backup high-resolution single image sample, the positive sample combined with matching description information and the negative sample combined with mismatched description information are reversely converged to establish a single image super resolution model.

Optionally, the encoding processing of the image description information by using an adaptive adjustment block to obtain description variables with the same dimensions as image features includes:

The self-adaptive regulation block is made up of two branches, wherein one branch is made up of a fully connected layer, outputs a description encoding vector, and the other branch is made up of a fully connected layer and a sigmoid activation function, and outputs a weight vector;

The vectors output by the two branches are multiplied by corresponding position element values, and transformed into description variables with the same dimensions as the image features; the description information is adjusted through the weight vector, and the description coding features are adaptively scaled, so that the image description Redundant information is eliminated, and effective information for image reconstruction is obtained.

Optionally, the multi-scale sub-network is used to perform deep feature extraction on shallow features, including:

The shallow features are down-sampled into small-scale feature maps through bilinear interpolation, and their scale is reduced to half of the original;

Use this scale as the input of the first layer of sub-network, and gradually increase the large-scale sub-network in stages;

The output of different sub-networks in the previous stage is scaled up through nearest neighbor interpolation, and then fused into the input of large-scale sub-networks; where the sub-networks are connected in series by a certain number of attention residual densely connected blocks at each stage Composition, the number of attention residual dense connection blocks used by sub-networks of different scales from top to bottom is 5, 7, and 3 respectively;

The adaptive fusion module based on the channel attention mechanism is used to fuse the information of different frequencies extracted by sub-networks at different scales.

Optionally, the use of the upsampling module to scale up the deep features includes:

The feature scale is enlarged using the nearest neighbor interpolation algorithm.

Optionally, the attention residual densely connected block is composed of three spatial attention residual densely connected units and one that connects the input of the attention residual densely connected block with the output of the last spatial attention residual densely connected unit local skip connections.

Optionally, the spatial attention residual densely connected unit includes densely connected groups of five convolutional layers, a spatial attention convolution group, and the input of the spatial attention residual densely connected unit and the spatial attention convolution The output of the group is connected by a skip connection.

Optionally, the self-adaptive fusion module based on the channel attention mechanism fuses information of different frequencies extracted by sub-networks at different scales, including:

Interpolate the small-scale feature map to generate a feature map of the same size as the large-scale feature map;

The interpolated feature maps are passed to the global average pooling layer, channel compression convolution layer and channel expansion convolution layer respectively;

Concatenate the obtained vectors of the three scales and process them with the softmax layer on the same channel to generate the corresponding weight matrix;

Divide the weight matrix into three weight components corresponding to three sub-networks, and multiply the feature maps after each scale interpolation with the corresponding weight components;

The obtained three feature maps are weighted and summed to obtain the fused output.

Optionally, processing the low-resolution single image and its corresponding image description information through the single-image super-resolution model to output a high-resolution single image includes:

Input a low-resolution single image into the shallow feature extraction module to obtain shallow image features;

Input the corresponding image description information into the adaptive adjustment block to obtain the description variable with the same dimension as the image feature, connect the description variable and image feature in series, input the subsequent single image super-resolution model, and output a high-resolution single image.

In the second aspect, a generation system for visual resolution enhancement is provided, including:

An acquisition module, configured to acquire a low-resolution single image to be processed and its corresponding image description information;

An output module, configured to process the low-resolution single image and its corresponding image description information through a single-image super-resolution model, and output a high-resolution single image;

Training module, for training described single image super-resolution model, described training module comprises:

The sampling sub-module is used to collect training samples, and the training samples include high-resolution single image samples, low-resolution single image samples and corresponding image description information samples;

The model establishment sub-module is used to establish a single image super-resolution model based on a preset loss function and a high-resolution single image sample according to the collected training samples.

Optionally, the sampling submodule includes:

The first sampling unit is used to use the public large-scale CUB bird image data set to obtain a high-resolution single image sample and back it up;

The second sampling unit is used to degenerate the high-resolution single image sample into a low-resolution single image sample of the scaling factor by adopting the built-in bicube down-sampling function of Matlab;

The third sampling unit is used to obtain the corresponding image description information sample by using the English sentence information used to describe the feather color, body characteristics, motion posture and environmental performance of the bird in the image in the large-scale CUB bird image data set .

Optionally, the model building submodule includes:

an acquisition unit, configured to acquire a low-resolution single image and corresponding image description information;

The extraction unit is used to extract shallow features from the low-resolution single image based on single-layer convolution, and convert the input low-resolution single image from RGB color space to feature space;

An encoding processing unit, configured to use an adaptive adjustment block to encode the image description information to obtain description variables having the same dimensions as image features;

The compression unit is used to concatenate the description variables and image features, and use a layer of convolution to perform channel compression on the concatenated features;

A deep feature extraction unit for performing deep feature extraction on shallow features using a multi-scale sub-network;

The upsampling unit is used to scale up the deep features by using the upsampling module;

The reconstruction unit is used to reconstruct and output a high-resolution single image of RGB channel by adopting two layers of convolution;

The model building unit is used to reversely converge the reconstructed high-resolution single image and backup high-resolution single image samples, positive samples combined with matching description information, and negative samples combined with mismatched description information based on a preset loss function , to build a single image super-resolution model.

Optionally, the encoding processing unit includes:

The encoding subunit consists of a layer of fully connected layers for outputting description encoding vectors;

The first weight subunit consists of a fully connected layer and a sigmoid activation function for outputting a weight vector;

The transformation subunit is used to multiply the value of the corresponding position element of the vector output by the encoding subunit and the weight subunit, and transform it into a description variable having the same dimension as the image feature.

Optionally, the multi-scale sub-network includes:

The scaling unit is used to downsample the shallow features into a small-scale feature map through bilinear interpolation, and its scale is reduced to half of the original;

Adding units to use this scale as the input of the first layer of sub-networks, and gradually increase the large-scale sub-networks in stages;

The input unit is used to amplify the output of different sub-networks in the previous stage through nearest neighbor interpolation, and fuse them into the input of large-scale sub-networks; wherein, the sub-networks are composed of a certain number of attention residuals in each stage. The difference densely connected blocks are connected in series, and the number of attention residual densely connected blocks used by sub-networks of different scales from top to bottom are 5, 7, and 3 respectively;

The fusion unit is used to fuse information of different frequencies extracted by sub-networks at different scales by using an adaptive fusion module based on the channel attention mechanism.

Optionally, the upsampling unit includes:

The enlargement subunit is used to enlarge the feature scale using the nearest neighbor interpolation algorithm.

Optionally, the attention residual dense connection block includes:

The first component unit is used to compose three spatial attention residual densely connected units and a local skip connection connecting the input of the attention residual densely connected block and the output of the last spatial attention residual densely connected unit.

Optionally, the fusion unit includes:

The mapping subunit is used to interpolate the small-scale feature map to generate a feature map with the same size as the large-scale feature map;

The transfer subunit is used to transfer the interpolated feature map to the global average pooling layer, channel compression convolution layer and channel expansion convolution layer respectively;

The second weight subunit is used to concatenate the obtained vectors of the three scales, and use the softmax layer on the same channel for processing to generate a corresponding weight matrix;

Multiply subunits. It is used to divide the weight matrix into three weight components corresponding to three sub-networks, and multiply the feature maps after each scale interpolation with the corresponding weight components;

The output subunit performs a weighted summation operation on the obtained three feature maps to obtain a fused output.

Optionally, the spatial attention residual dense connection unit includes:

The second component unit is used to connect a densely connected group of five convolutional layers, a spatial attention convolutional group, and a skip link connecting the input of the spatial attention residual densely connected unit and the output of the spatial attention convolutional group composition.

Optionally, the output module includes:

The extraction sub-module is used to input a low-resolution single image into the shallow feature extraction module to obtain shallow image features;

The output sub-module inputs the corresponding image description information into the adaptive adjustment block to obtain the description variable with the same dimension as the image feature, connects the description variable and the image feature in series, inputs the subsequent single image super-resolution model, and outputs a high-resolution single image images.

In a third aspect, a device is provided, including:

memory for storing at least one program;

A processor, configured to execute the at least one program to implement the method as described above.

In a fourth aspect, a storage medium is provided, which stores an executable program, and when the executable program is executed by a processor, the above method is implemented.

Compared with the prior art, the embodiments of the present invention have the following beneficial effects:

The low-resolution single image to be processed is obtained by using a single image training model established by using training samples including high-resolution single image samples, low-resolution single image samples and their corresponding image description information samples, and a preset loss function. The resolution processing of each image can accurately and efficiently realize the effect of restoring a low-resolution single image to a high-resolution single image, and can obtain a priori a single image with higher definition based on specific image description information.

Description of drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only These are some embodiments of the present invention. Those skilled in the art can also obtain other drawings based on these drawings without creative work.

The structures, proportions, sizes, etc. shown in this manual are only used to cooperate with the content disclosed in the manual, so that people familiar with this technology can understand and read, and are not used to limit the conditions for the implementation of the present invention, so there is no technical In the substantive meaning above, any modification of structure, change of proportional relationship or adjustment of size shall still fall within the scope of the technical content disclosed in the present invention without affecting the effect and purpose of the present invention. within the range that can be covered.

FIG. 1 is a schematic flow chart of steps in a method for generating visual resolution enhancement provided by an embodiment of the present invention;

FIG. 2 is a structural block diagram of a generation system for visual resolution enhancement provided by an embodiment of the present invention;

Fig. 3 is a schematic flow chart of a single image super-resolution model in an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a multi-scale sub-network in an embodiment of the present invention;

Fig. 5 is a schematic structural diagram of an attention residual densely connected block in an embodiment of the present invention;

Fig. 6 is a schematic diagram of the operation details of the adaptive fusion module in the embodiment of the present invention.

Detailed ways

In order to make the purpose, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the accompanying drawings in the embodiments of the present invention. Obviously, the following description The embodiments are only some of the embodiments of the present invention, but not all of them. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

Example 1

As shown in Figure 1, this embodiment provides a method for generating visual resolution enhancement, including the following steps:

S1. Obtain the low-resolution single image to be processed and its corresponding image description information;

S2. Process the low-resolution single image and its corresponding image description information through a single-image super-resolution model, and output a high-resolution single image;

The training process of the single image super-resolution model comprises the following steps:

S3. Collect training samples, where the training samples include high-resolution single image samples, low-resolution single image samples, and corresponding image description information samples;

S4. According to the collected training samples, a single image super-resolution model is established based on a preset loss function and high-resolution single image samples.

Optionally, the step S3 includes:

S31. Collect a high-resolution single image sample, use the public large-scale CUB bird image data set to obtain a high-resolution single image sample and back it up;

S32. Using the built-in bicubic downsampling function of Matlab to degenerate the high-resolution single image sample into a low-resolution single image sample with a scaling factor of ×4;

S33. Using the English sentence information in the above data set for describing the feather color, body characteristics, movement posture and environmental performance of the bird in the image;

Therefore, through steps S31-S33, high-resolution single image samples, low-resolution single image samples and corresponding image description information samples can be obtained, so as to establish training samples.

Optionally, the step S4 includes:

S41. Obtain a low-resolution single image and corresponding image description information;

S42. Extract shallow features from the low-resolution single image based on single-layer convolution, and convert the input low-resolution single image from RGB color space to feature space;

S43. Using an adaptive adjustment block to encode the description information of the image to obtain a description variable having the same dimension as the image feature;

S44. Concatenating the descriptive variables and image features, and performing channel compression on the concatenated features by using a layer of convolution;

S45. Using a multi-scale sub-network to extract deep features from shallow features;

S46. Using an upsampling module to scale up deep features;

S47. Reconstruct and output a high-resolution single image of RGB channels by using two layers of convolution;

S48. Reversely converge the reconstructed high-resolution single image and backup high-resolution single image samples, positive samples combined with matching description information, and negative samples combined with mismatched description information based on the preset loss function, to establish a single image Image super-resolution model.

Optionally, the step S43 includes:

S431. The adaptive adjustment block is composed of two branches, one of which is composed of a fully connected layer, and outputs a description encoding vector, and the other branch is composed of a fully connected layer and a sigmoid activation function, and outputs a weight vector;

S432. Multiply the element values corresponding to the positions of the vectors output by the two branches, and transform it into a description variable having the same dimension as the image feature; specifically, adjust the description information through the weight vector, and adaptively scale the description encoding feature, thereby The redundant information in the image description is eliminated, and the effective information for image reconstruction is obtained.

Optionally, the step S45 includes:

S451. Downsample the shallow features into a small-scale feature map through bilinear interpolation, and the scale is reduced to half of the original;

S452. Using the scale as the input of the first-layer sub-network, gradually increase the large-scale sub-network in stages;

S453. The input of the large-scale sub-network is fused by the output of different sub-networks in the previous stage through the nearest neighbor interpolation for scaling up; After the scale is enlarged by interpolation, it is fused to become the input of the large-scale sub-network; where the sub-network is composed of a certain number of attention residual densely connected blocks in series at each stage, and is composed of sub-networks of different scales from top to bottom. The number of attention residual densely connected blocks used is 5, 7, 3 respectively;

S454. Use an adaptive fusion module based on a channel attention mechanism to fuse information of different frequencies extracted by sub-networks at different scales.

Optionally, the step S454 includes:

S4541. Interpolate the small-scale feature map to generate a feature map with the same size as the large-scale feature map;

S4542, the interpolated feature map is passed to the global average pooling layer, the channel compression convolution layer and the channel expansion convolution layer respectively;

S4543. Concatenate the obtained vectors of three scales, and use a softmax layer on the same channel to process, and generate a corresponding weight matrix;

S4544. Divide the weight matrix into three weight components corresponding to the three sub-networks, and multiply the interpolated feature maps of each scale by the corresponding weight components;

S4545. Perform a weighted summation operation on the obtained three feature maps to obtain a fused output.

Example 2

As shown in Figure 2, this embodiment provides a generation system for visual resolution enhancement, which includes:

An acquisition module, configured to acquire low-resolution single images and image description information to be processed;

An output module, configured to process the low-resolution single image and image description information variables through a single image super-resolution model, and output a reconstructed high-resolution single image, wherein the single image is super-resolved The rate model is based on high-resolution single image samples and low-resolution single image samples.

Training modules include:

Optionally, the sampling submodule includes:

The first sampling unit is used to collect high-resolution single image samples, using the public large-scale CUB bird image data set to obtain high-resolution single image samples and back them up;

The second sampling unit is used to degenerate the high-resolution single image sample into the low-resolution single image sample of the scaling factor by adopting the built-in bicube down-sampling function of Matlab;

The third sampling unit is used to use the English sentence information used to describe the feather color, body characteristics, motion posture and environmental performance of the bird in the image in the above data set.

Therefore, the above-mentioned sampling unit can acquire high-resolution single image samples, low-resolution single image samples and corresponding image description information samples, so as to establish training samples.

Optionally, the model building submodule includes:

An acquisition unit, configured to acquire a low-resolution single image sample and corresponding image description information;

The extraction unit is used to extract shallow features from the low-resolution single image based on single-layer convolution, and converts the input low-resolution single image from RGB color space to feature space;

An encoding processing unit, configured to encode the description information of the image by using an adaptive adjustment block to obtain a description variable having the same dimension as the image feature;

Optionally, the encoding processing unit includes:

The encoding subunit consists of a fully connected layer and outputs a description encoding vector;

The first weight subunit consists of a fully connected layer and a sigmoid activation function, and outputs a weight vector;

The transformation subunit multiplies the vector corresponding position element value output by the encoding subunit and the weight subunit, and transforms it into a descriptive variable identical to the dimension of the image feature.

Optionally, the multi-scale sub-network includes:

The input unit is used to scale up the output of different sub-networks in the previous stage through nearest neighbor interpolation, and fuse them into the input of the large-scale sub-network; that is to say, the input of the large-scale sub-network is obtained from the previous stage The outputs of different sub-networks are fused by nearest neighbor interpolation and then scaled up; among them, the sub-networks are composed of a certain number of attention residual densely connected blocks in series at each stage, from top to bottom of different scales The number of attention residual densely connected blocks used by the sub-network is 5, 7, 3 respectively;

The fusion unit is used to fuse information of different frequencies extracted by sub-networks at different scales by using an adaptive fusion module based on a channel attention mechanism.

Optionally, the upsampling unit includes:

Optionally, the attention residual dense connection block includes:

Optionally, the fusion unit includes:

Optionally, the spatial attention residual dense connection unit includes:

Optionally, the output module includes:

Example 3

The present embodiment provides a kind of device, and this device comprises:

at least one processor;

at least one memory for storing at least one program;

When at least one program is executed by at least one processor, at least one processor is made to implement the steps of the method for generating visual resolution enhancement as in Embodiment 1 above.

Example 4

This embodiment provides a storage medium, which stores a processor-executable program, and the executable program is used to execute the method for generating visual resolution enhancement as described in Embodiment 1 when executed by the processor. step.

Example 5

Referring to Fig. 3 to Fig. 6, this embodiment provides a flow chart of a method for generating visual resolution enhancement, which can be used as a specific implementation of Embodiment 1, and Embodiment 2 can also implement the method of this embodiment. Include the following steps:

A. Collect training samples, the training samples include high-resolution single image samples, low-resolution single image samples and their corresponding image description information samples;

B. Establish a single image super-resolution model based on the collected training samples;

C. Obtain the low-resolution single image to be processed and its corresponding image description information;

D. Process the low-resolution single image to be processed and its corresponding image description information through the single image super-resolution model, and output a high-resolution single image.

Wherein, the concrete implementation scheme of step A is:

A1. Obtain the public large-scale CUB bird data set as the training data set. The data set can be divided into 200 categories, with a total of 11,788 images. Each image has ten English sentences to describe the feather color, body characteristics, movement posture and environmental performance of the bird in the image. The training set and test set are divided according to the ratio of 8855:2913. The proportion of training data set and test data set in each category is balanced, and there will be no problem of unbalanced distribution of training set and test set samples. The corresponding image description information adopts the CNN-RNN encoding method to encode the description prior information.

A2. Use the "imresize" function of MATLAB to perform 4-fold bicube downsampling on a high-resolution single image to obtain the corresponding low-resolution single image, which constitutes a ternary matching of {I _HR , I _LR ,c} data set. The negative sample description information used in the positive and negative sample matching loss is to randomly select one from the rest of the image descriptions as a mismatch description through random numbers, and obtain a dataset of {I _LR , I _HR , neg_c} triple negative sample descriptions; Horizontal or vertical flipping, 90° rotation, and random cropping of image blocks are used as data enhancement methods.

The specific implementation scheme of step B is:

B1. A low-resolution image block with a size of 30×30 randomly cut from a low-resolution single image is used as an input, which is denoted as I _LR .

B2. In the shallow feature extraction module, the single-layer convolutional layer converts the input low-resolution image from the RGB color space to the feature space. The obtained feature contains 64 channels, and the size is the same as the input image. The convolutional layer consists of a 3×3 convolutional layer and an activation function. At the same time, the adaptive adjustment block encodes the description of the image to obtain the same description variable as the dimension of the image feature. The adaptive adjustment block is composed of two branches, one of which is composed of a fully connected layer, which outputs a description encoding vector, and the other branch is composed of a fully connected layer and a sigmoid activation function, which outputs a weight vector, and the vectors output by the two branches correspond to The positional element values are multiplied to obtain the description variable. Then, the descriptive variables and image features are concatenated, and the channel compression is performed through a layer of 3×3 convolutional layers, so as to obtain the shallow feature F _S . This process can be expressed as:

B3. After obtaining the shallow feature F _S , input it into the deep feature extraction module composed of multi-scale sub-networks, and generate an effective deep feature F _d through parallel multiple sub-networks. Finally obtained deep feature F _d ∈ R ^2W×2H×C , it can be found that the scale of the deep feature is doubled on the scale of the shallow feature. characteristic information. In order to construct feature maps of different scales, the shallow features are first sampled into small-scale feature maps by bilinear interpolation, and the scale is reduced to half of the original

The deep feature extraction module takes this scale as the input of the first layer sub-network, and gradually increases the large-scale sub-network in stages. The input of the large-scale sub-network is fused by the output of different sub-networks in the previous stage through nearest neighbor interpolation for scaling up. The sub-network consists of a certain number of attention residual densely connected blocks concatenated at each stage. For sub-networks of different scales, the number of attention residual densely connected blocks connected in series is also different. The number of attention residual densely connected blocks used by sub-networks of different scales from top to bottom is 5, 7, and 3 respectively. . The subsequent adaptive fusion module based on the channel attention mechanism effectively fuses information of different frequencies extracted by sub-networks at different scales. This module can be expressed as:

F _d =MARDN(F _s )

B4. After obtaining the deep feature F _d , input it into the up-sampling module, and the up-sampling module uses the nearest neighbor interpolation algorithm to enlarge the feature scale. This module can be expressed as:

F _up ＝Inter(F _d ,s)↑

Among them, F _up is the feature after upsampling, Inter( ) represents the nearest neighbor interpolation function, and s represents the amplification factor.

B5. Finally, the super-resolution image I _SR of RGB channels is reconstructed through two layers of 3×3 convolutional layers. This process can be expressed as:

I _SR ＝Conv(Conv(F _up ))

B6. The discriminator uses a VGG network composed of convolutions with a step size. The input image is a generated image and a real image. After several convolution layers with a step size, the input image features are dimensionally changed, and the feature map is reduced. Finally, The output feature map is concatenated with the image description coding vector c, and the true and false logical value of the judgment is obtained through the binary classifier. This process can be expressed as:

Var＝Net _D ({I _SR ,I _HR },c)

B7. Using a loss function to reversely converge the reconstructed high-resolution single image _ISR and the backup high-resolution single image sample, and establish a single image super-resolution model. During the training process, the loss function of the generator consists of three parts: the reconstruction loss L _rec , the perceptual loss L _VGG and the confrontation loss L _adv :

L _G ＝λ ₁ *L _rec +λ ₂ *L _VGG +λ ₃ *L _adv

λ ₁ , λ ₂ and λ ₃ correspond to the weights of these three losses, respectively. In order to ensure that the reconstructed image is as similar as possible to the real image in terms of image content, pixel-by-pixel constraints are performed in the image space through the reconstruction loss. Here, the reconstruction loss uses the L ₁ loss function:

Among them, N=H×W×C represents the total pixels of the image, and W, H, and C represent the width, height and number of channels of a high-resolution single image, respectively.

At the same time, in order to increase the texture information of the image, the feature information extracted by the fixed classification network VGG of the reconstructed image should be similar to the real image, and the perceptual loss is used to constrain it. The perceptual loss is defined as follows:

Here, M=H×W×C denotes the size of the specified feature map.

In addition, in order to ensure the mutual game between the generator and the discriminator, it is necessary to use an adversarial loss function to train the generator and the discriminator. The purpose of the generator's adversarial loss is to make the distribution of the reconstructed image and the real image as close as possible, which is defined as follows:

L _adv ＝log(1-Net _D (Net _G (I _LR ,c)))

Wherein, var represents attribute information of a single frame.

Different from the adversarial loss of the generator, the goal of the adversarial loss of the discriminator is to distinguish the reconstructed image from the real image as much as possible in the image distribution. Compared with SRGAN and ESRGAN, which only calculate the adversarial loss between images, this embodiment adds positive and negative sample adversarial loss constraints. Among them, the positive sample refers to the combination of matched description information c, and the negative sample refers to the discriminator loss neg_c combined with mismatched description information. The adversarial loss for the discriminator is defined as follows:

Denotes the adversarial loss between the generated image and the real image;

It is a joint judgment of adding matching description coding, the purpose is to make the discriminator judge the fidelity of the generated image and at the same time distinguish whether it corresponds to the description;

It is worth noting that whether it is a real image or a generated image, if it is jointly input into the discriminator with the mismatch description information, the result obtained is a "false" judgment. These three losses are defined as follows:

Set the learning rate, backpropagate the gradient by minimizing the loss function error, update the network parameters, and iterate until the network is trained to converge.

During the reverse convergence training, the batch size is set to 16, the initial learning rate is set to 10 ^-4 , and the description code is 1024 hidden variables. In order to construct batch data, a low-resolution image with a size of 30×30 is randomly cut from the low-resolution image. The high-resolution image block is paired with a 120×120 high-resolution image block. During the iterative training process, according to the convergence of the network, when the total number of training iterations reaches {5×10 ⁴ , 1×10 ⁵ ,2×10 ⁵ ,3×10 ⁵ }, the learning rate is attenuated by half. First use the reconstruction loss L _rec to train the generator to avoid the problem that the discriminator can easily distinguish the gradient from the generated image from the real image. In this embodiment, the ADAM optimizer is used to perform reverse gradient propagation on the model, where the parameters of ADAM are set as β ₁ =0.9, β ₂ =0.999 and ε=10 ⁻⁸ . Use the L1 loss function to ensure that the reconstructed image is as similar as possible to the real image in terms of image content, use perceptual loss to ensure that the image texture information is as similar as possible to the real image, and use confrontation loss to make the distribution of the reconstructed image and the real image as close as possible to distinguish Find out whether it corresponds to the description, set the coefficients of these loss functions, and update the network parameters by back propagation by minimizing these error sums, and iterate until the network is trained to convergence.

The scheme of step C is specifically:

Obtain the pre-divided CUB test data set, which contains a variety of low-resolution single images and their corresponding image description information variables.

The scheme of step D is specifically:

Input the low-resolution single image of the CUB test data set to be restored into the trained single image super-resolution model, and perform step B on the input CUB test data set single image through the single image super-resolution model In an embodiment, the adaptive adjustment block encodes the description corresponding to the image to obtain a description variable with the same dimension as the image feature, and then converts the description variable and the input low-resolution image from the RGB color space to the feature through a single-layer convolution layer The spatial image features are concatenated, and finally channel compression is performed through a layer of convolution to obtain shallow features, and then processed by the subsequent network to output a high-resolution single image.

To sum up, the single image training model established by using the training samples including high-resolution single image samples, low-resolution single image samples and their corresponding image description information samples, and the preset loss function is used to obtain the to-be-processed The resolution processing of a low-resolution single image can accurately and efficiently realize the effect of restoring a low-resolution single image to a high-resolution single image, and can obtain higher definition based on specific image description information a priori of a single image.

As mentioned above, the above embodiments are only used to illustrate the technical solutions of the present invention, not to limit them; Although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: it can still understand the foregoing The technical solutions recorded in each embodiment are modified, or some of the technical features are replaced equivalently; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the various embodiments of the present invention.

Claims

A method for generating visual resolution enhancement, comprising:

Obtain the low-resolution single image to be processed and its corresponding image description information;

Processing the low-resolution single image and its corresponding image description information through a single image super-resolution model, and outputting a high-resolution single image;

The training method of the single image super-resolution model includes:

Collecting training samples, the training samples include high-resolution single image samples, low-resolution single image samples and corresponding image description information samples;

According to the collected training samples, a single image super-resolution model is established based on the preset loss function and high-resolution single image samples.
The generation method of visual resolution enhancement according to claim 1, characterized in that, the collection of training samples, the training samples include high-resolution single image samples, low-resolution single image samples and their corresponding images Sample descriptive information, including:

Use the image data set of the preset target to obtain a high-resolution single image sample and back it up;

degenerating the high resolution single image sample to a low resolution single image sample of a scaling factor;

The corresponding image description information samples are obtained by using the English sentence information in the image data set used to describe at least one of the color, body characteristics, motion posture and environmental performance of the preset target in the image.
The generation method of visual resolution enhancement according to claim 1, wherein, according to the collected training samples, a single image super-resolution model is established based on a preset loss function and a high-resolution single image sample, including :

Obtain low-resolution single image samples and corresponding image description information;

Based on single-layer convolution, the low-resolution single image is extracted with shallow features, and the input low-resolution single image is converted from RGB color space to feature space;

Using an adaptive adjustment block to encode the image description information to obtain a description variable with the same dimension as the image feature;

Concatenate the description variables and image features, and use a layer of convolution to perform channel compression on the concatenated features;

Using multi-scale sub-networks to extract deep features from shallow features;

Use the upsampling module to scale up the deep features;

Use two layers of convolution to reconstruct and output a high-resolution single image of RGB channels;

Based on the preset loss function, the reconstructed high-resolution single image and the backup high-resolution single image sample, the positive sample combined with matching description information and the negative sample combined with mismatched description information are reversely converged to establish a single image super resolution model.
The generation method of visual resolution enhancement according to claim 3, characterized in that, said image description information is encoded using an adaptive adjustment block to obtain description variables that are the same as the dimensions of image features, including:

The self-adaptive regulation block is made up of two branches, wherein one branch is made up of a fully connected layer, outputs a description encoding vector, and the other branch is made up of a fully connected layer and a sigmoid activation function, and outputs a weight vector;

The vectors output by the two branches are multiplied by corresponding position element values, and transformed into a description variable with the same dimension as the image feature.
The generation method of visual resolution enhancement according to claim 3, characterized in that, the use of multi-scale sub-networks to perform deep feature extraction on shallow features includes:

The shallow features are down-sampled into small-scale feature maps through bilinear interpolation, and their scale is reduced to half of the original;

Use this scale as the input of the first layer of sub-network, and gradually increase the large-scale sub-network in stages;

The output of different sub-networks in the previous stage is scaled up through nearest neighbor interpolation, and then fused into the input of large-scale sub-networks; where the sub-networks are connected in series by a certain number of attention residual densely connected blocks at each stage Composition, the number of attention residual dense connection blocks used by sub-networks of different scales from top to bottom is 5, 7, and 3 respectively;

The adaptive fusion module based on the channel attention mechanism is used to fuse the information of different frequencies extracted by sub-networks at different scales.
The generation method of visual resolution enhancement according to claim 3, wherein said using an upsampling module to scale up deep features includes:

The feature scale is enlarged using the nearest neighbor interpolation algorithm.
The method for generating visual resolution enhancement according to claim 5, wherein the attention residual densely connected block consists of three spatial attention residual densely connected units and one attention residual densely connected block It consists of local skip connections connecting the input to the output of the last spatial attention residual densely connected unit.
The method for generating visual resolution enhancement according to claim 5, wherein said adaptive fusion module based on a channel attention mechanism fuses information of different frequencies extracted by sub-networks at different scales, including:

Interpolate the small-scale feature map to generate a feature map of the same size as the large-scale feature map;

The interpolated feature maps are passed to the global average pooling layer, channel compression convolution layer and channel expansion convolution layer respectively;

Concatenate the obtained vectors of the three scales and process them with the softmax layer on the same channel to generate the corresponding weight matrix;

Divide the weight matrix into three weight components corresponding to three sub-networks, and multiply the feature maps after each scale interpolation with the corresponding weight components;

The obtained three feature maps are weighted and summed to obtain the fused output.
The generation method of visual resolution enhancement according to claim 7, wherein the spatial attention residual densely connected unit comprises a densely connected group of five convolutional layers, a spatial attention convolution group, and a spatial Skip connections connecting the input of the attention residual densely connected unit and the output of the spatial attention convolution group.
The method for generating visual resolution enhancement according to any one of claims 1 to 9, wherein the low-resolution single image and its corresponding image description information are described by the single image super-resolution model Process and output a high-resolution single image, including:

Input a low-resolution single image into the shallow feature extraction module to obtain shallow image features;

Input the corresponding image description information into the adaptive adjustment block to obtain the description variable with the same dimension as the image feature, connect the description variable and image feature in series, input the subsequent single image super-resolution model, and output a high-resolution single image.
A generation system for visual resolution enhancement, characterized by comprising:

An acquisition module, configured to acquire a low-resolution single image to be processed and its corresponding image description information;

An output module, configured to process the low-resolution single image and its corresponding image description information through a single-image super-resolution model, and output a high-resolution single image;

Training module, for training described single image super-resolution model, described training module comprises:

The sampling sub-module is used to collect training samples, and the training samples include high-resolution single image samples, low-resolution single image samples and corresponding image description information samples;

The model establishment sub-module is used to establish a single image super-resolution model based on a preset loss function and a high-resolution single image sample according to the collected training samples.
A device, characterized in that it comprises:

memory for storing at least one program;

A processor, configured to execute the at least one program to implement the method according to any one of claims 1-10.
A storage medium, characterized in that an executable program is stored, and when the executable program is executed by a processor, the method according to any one of claims 1 to 10 is implemented.