CN113554058A

CN113554058A - Method, system, device and storage medium for enhancing resolution of visual target image

Info

Publication number: CN113554058A
Application number: CN202110698882.3A
Authority: CN
Inventors: 金龙存; 卢盛林
Original assignee: Guangdong OPT Machine Vision Co Ltd
Current assignee: Guangdong OPT Machine Vision Co Ltd
Priority date: 2021-06-23
Filing date: 2021-06-23
Publication date: 2021-10-26

Abstract

The invention discloses a method, a system, a device and a storage medium for enhancing the resolution of a visual target image, wherein the method comprises the following steps: processing the low-resolution face image to be processed and the corresponding face attribute by adopting a pre-trained face image super-resolution model, and outputting a high-resolution face image; the training method of the face image super-resolution model comprises the following steps: acquiring training samples, wherein the training samples comprise high-resolution face image samples, low-resolution face image samples and a preset number of face attribute samples corresponding to the high-resolution face image samples and the low-resolution face image samples; and establishing a face image super-resolution model based on a preset loss function and the high-resolution face image sample according to the training sample. The invention realizes the prior guidance of low-resolution face image enhancement and the recovery of high-frequency information of the face image by the trained face image super-resolution model, so that the output high-resolution face image contains more face structure details, and the definition of the face image is improved.

Description

Method, system, device and storage medium for enhancing resolution of visual target image

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to a method, a system, a device and a storage medium for enhancing the resolution of a visual target image.

Background

In recent years, with the increasing demand for image and video quality, how to improve the image and video quality becomes an increasingly important issue. The super-resolution of the image aims to repair the low-resolution image, so that the image contains more detail information, and the definition of the image is improved. The technology has important practical significance, for example, in the field of safety monitoring, due to cost limitation, monitoring video acquisition equipment acquires video frames lacking effective information, and safety monitoring greatly depends on high-resolution images with clear information. By adopting the image super-resolution technology, the details of the video frame can be increased. The supplement of such information can provide effective evidence for fighting crimes. At present, the image super-resolution is used as a pre-processing technology, so that the precision of tasks such as target detection, face recognition, abnormity early warning and the like in the safety field can be effectively improved.

Previous methods used for super resolution of images are either interpolation-based or reconstruction-based methods. The super-resolution of an image based on an interpolation mode is an algorithm applied firstly in the super-resolution field, and the algorithm is based on a fixed polynomial calculation mode and calculates a pixel value of an interpolation position by using an existing pixel value, such as bilinear interpolation, bicubic interpolation and Lanczos scaling. The reconstruction-based method adopts strict prior knowledge as constraint, and finds a proper reconstruction function in a constraint space, so that a high-resolution image with detail information is reconstructed. These algorithms often suffer from the problem of the image being too smooth to recover the texture details of the image well.

In recent years, with the development of deep learning and convolutional neural networks, the image super-resolution technology has made a great breakthrough, and the research on the super-resolution of face images is also receiving more and more attention from researchers. However, some work on combining the face attributes is to concatenate face attribute information on shallow features, while the face image is a special image with high structure and symmetry, and the deep features can better represent the structure information of the face image. At present, models of existing models are constructed through structures such as an hourglass network, a residual error network or an automatic encoder, and the structures are not enough for supporting deep model construction and are limited to simple feature extraction network structures.

Disclosure of Invention

The invention provides a method, a system, a device and a storage medium for enhancing the resolution of a visual target image, which can further combine attribute information and deep facial structure characteristics to construct a deeper network, so that the network can acquire a human face image with high definition based on specific human face attribute prior, and the technical problems can be solved or at least partially solved.

In order to achieve the purpose, the invention adopts the following technical scheme:

in a first aspect, a method for enhancing the resolution of an image of a visual target is provided, which includes:

processing the low-resolution face image to be processed and the corresponding face attribute by adopting a pre-trained face image super-resolution model, and outputting a high-resolution face image; the training method of the face image super-resolution model comprises the following steps:

acquiring training samples, wherein the training samples comprise high-resolution face image samples, low-resolution face image samples and a preset number of face attribute samples corresponding to the high-resolution face image samples and the low-resolution face image samples;

and establishing a face image super-resolution model based on a preset loss function and the high-resolution face image sample according to the training sample.

Optionally, the acquiring a training sample, where the training sample includes a high-resolution face image sample, a low-resolution face image sample, and a preset number of face attribute samples corresponding to the high-resolution face image sample, includes:

collecting a high-resolution face image sample, and obtaining the high-resolution face image sample by adopting a face image data set CelebA and backing up the high-resolution face image sample;

adopting an image scaling algorithm to carry out down-sampling on the high-resolution face image sample to generate a low-resolution face image sample;

and selecting attributes which are beneficial to super-resolution of the face image in the face image data set CelebA as corresponding face attribute samples, and selecting a preset number of face attribute samples.

Optionally, the establishing a face image super-resolution model based on a preset loss function and a high-resolution face image sample according to the training sample includes:

acquiring low-resolution face image samples and a corresponding preset number of face attribute samples;

extracting shallow features of the low-resolution face image based on a convolutional neural network to generate low-resolution face image features;

combining with an automatic encoder, and acquiring the structural characteristics of the face image by carrying out operations of encoding compression and decoding reduction on the low-resolution face image characteristics;

extracting deep features from the serial connection of the structural features of the face images and the corresponding preset number of face attributes by adopting a double residual dense connection network;

pixel rearrangement upsampling is adopted to enlarge the characteristic dimension and reconstruct a high-resolution face image;

and reversely converging the reconstructed high-resolution face image and the backed-up high-resolution face image sample based on a preset loss function, and establishing a face image super-resolution model.

Optionally, the obtaining, in combination with an automatic encoder, the structural features of the face image by performing operations of encoding compression and decoding restoration on the low-resolution face image features includes:

the encoder downsamples using a maximum pooling operation, expands or compresses the number of characteristic channels using 2 convolutional layers with a convolutional kernel size of 3;

the decoder performs upsampling using deconvolution, expanding or compressing the number of feature channels using 2 convolutional layers of convolution kernel size 3.

Optionally, the extracting deep features of the series connection of the structural features of the face image and the corresponding preset number of face attributes by using a dual residual dense connection network includes:

carrying out size transformation on feature maps input by a preset number of face attribute variables;

connecting the structural features of the face image and the face attribute variables in series in a channel dimension;

compressing the number of channels of the characteristic graphs connected in series through a convolution layer with the convolution kernel size of 3;

and extracting deep features from the features with the compressed channel number through a double residual dense connection network.

Optionally, the reconstructing a high-resolution face image by enlarging a feature scale through pixel rearrangement and upsampling includes:

based on pixel rearrangement, the number of channels is expanded to 4 times of the original number by one layer of convolution operation, and then after rearrangement is carried out by utilizing specific positions of the channels, a result that the characteristic scale is expanded by 2 times and the number of the channels is reduced to be consistent with the input is obtained; in the implementation of the model, aiming at a multiplied by 4 amplification factor model, two pixel rearrangement upsampling modules are cascaded to obtain output characteristics amplified by 4 times;

and reconstructing a high-resolution face image by adopting 2 convolution layers with convolution kernel size of 3.

Optionally, the processing the low-resolution face image to be processed and the face attribute corresponding to the low-resolution face image by using the pre-trained face image super-resolution model and outputting the high-resolution face image includes:

inputting the low-resolution face image into a face structural feature extraction model to obtain face structural features;

and connecting the human face structural features and the human face attribute variables in series, inputting the human face structural features and the human face attribute variables into a subsequent human face image super-resolution model, and outputting a high-resolution human face image.

In a second aspect, a system for visual target image resolution enhancement is provided, comprising:

the output module is used for processing the low-resolution face image to be processed and the face attribute corresponding to the low-resolution face image to be processed by adopting a pre-trained face image super-resolution model and outputting a high-resolution face image; the training module of the face image super-resolution model comprises:

the acquisition submodule is used for acquiring training samples, and the training samples comprise high-resolution face image samples, low-resolution face image samples and a preset number of face attribute samples corresponding to the high-resolution face image samples;

and the model establishing submodule is used for establishing a face image super-resolution model based on a preset loss function and a high-resolution face image sample according to the training sample.

Optionally, the acquisition sub-module comprises:

the acquisition unit is used for acquiring a high-resolution face image sample, and obtaining and backing up the high-resolution face image sample by adopting a public large-scale face image data set CelebA;

the sampling unit is used for carrying out downsampling on the high-resolution face image sample by adopting an image scaling algorithm to generate a low-resolution face image sample;

and the selecting unit is used for selecting a preset number of face attribute samples by taking the attribute which is in favor of super-resolution of the face image in the face image data set CelebA as the corresponding face attribute samples. Wherein the preset number may be ten.

Optionally, the model building submodule includes:

the acquisition unit is used for acquiring low-resolution face image samples and corresponding preset number of face attribute samples;

the first extraction unit is used for extracting shallow features from the low-resolution face image based on a convolutional neural network to generate low-resolution face image features;

the coding and decoding unit is used for combining an automatic coder and obtaining the structural characteristics of the human face image by the operations of coding compression and decoding reduction on the low-resolution human face image characteristics;

the second extraction unit is used for extracting deep features from the serial connection of the structural features of the face images and the corresponding preset number of face attributes by adopting a double residual dense connection network;

the reconstruction unit is used for enlarging the characteristic scale by adopting pixel rearrangement upsampling and reconstructing a high-resolution face image;

and the model establishing unit is used for reversely converging the reconstructed high-resolution face image and the backed-up high-resolution face image sample based on a preset loss function and establishing a face image super-resolution model.

Optionally, the automatic encoder is provided with an encoder and a decoder, and the encoding and decoding unit includes:

the coding subunit is used for performing down-sampling by using the maximum pooling operation after shallow feature extraction, and then expanding the number of feature channels by using 2 convolutional layers with the convolutional kernel size of 3;

and the decoding subunit is used for performing up-sampling by using deconvolution before the human face attribute variable is input, and then expanding the number of characteristic channels by using 2 convolution layers with convolution kernel size of 3.

Optionally, the second extraction unit includes:

the transformation subunit is used for carrying out size transformation on the feature maps input by the preset number of human face attribute variables;

the concatenation subunit is used for concatenating the structural features of the face image and the face attribute variables in a channel dimension;

the compression subunit is used for compressing the channel number of the characteristic graphs which are connected in series through a convolution layer with the convolution kernel size of 3;

and the extraction subunit is used for extracting the deep features from the features with the compressed channel number through a double residual dense connection network.

Optionally, the reconstruction unit comprises:

the rearrangement subunit is used for expanding the number of the channels to 4 times of the original number by one layer of convolution operation based on pixel rearrangement, and then obtaining a result that the characteristic scale is expanded by 2 times and the number of the channels is reduced to be consistent with the input number after the rearrangement is carried out by utilizing the specific position of the channels; in the implementation of the model, aiming at a multiplied by 4 amplification factor model, two pixel rearrangement upsampling modules are cascaded to obtain output characteristics amplified by 4 times;

and the reconstruction subunit is used for reconstructing a high-resolution face image by adopting 2 convolution layers with the convolution kernel size of 3.

Optionally, the output module includes:

the extraction submodule is used for inputting the low-resolution face image into the face structure feature extraction model to obtain the face structure feature;

and the output sub-module is used for connecting the face structure characteristics with the face attribute variables in series, inputting the follow-up face image super-resolution model and outputting a high-resolution face image.

In a third aspect, an apparatus is provided that includes a memory for storing at least one program and a processor for loading the at least one program to perform the method for visual target image resolution enhancement as described above.

In a fourth aspect, a storage medium is provided that stores a processor-executable program which, when executed by a processor, is configured to perform the method of visual target image resolution enhancement as described above.

Compared with the prior art, the embodiment of the invention has the following beneficial effects:

the embodiment of the invention provides a method, a system, a device and a storage medium for enhancing the resolution of a visual target image, which adopt training samples comprising high-resolution face image samples, low-resolution face image samples and a preset number of corresponding face attribute samples, and accurately and efficiently realize the effect of restoring a low-resolution face image into a high-resolution face image by performing resolution processing on the acquired low-resolution face image to be processed based on a face image super-resolution model established by a preset loss function, and can acquire the face image with higher definition based on a specific face attribute prior.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

The structures, ratios, sizes, and the like shown in the present specification are only used for matching with the contents disclosed in the specification, so that those skilled in the art can understand and read the present invention, and do not limit the conditions for implementing the present invention, so that the present invention has no technical significance, and any structural modifications, changes in the ratio relationship, or adjustments of the sizes, without affecting the functions and purposes of the present invention, should still fall within the scope covered by the contents disclosed in the present invention.

FIG. 1 is a flowchart illustrating steps of a method for enhancing the resolution of an image of a visual target according to an embodiment of the present invention;

FIG. 2 is a block diagram of a system for enhancing the resolution of an image of a visual target according to an embodiment of the present invention;

FIG. 3 is a schematic flow chart of a super-resolution model of a face image based on reconstruction loss in an embodiment of the present invention;

FIG. 4 is a schematic flow chart of a face image super-resolution model based on countermeasure loss in the embodiment of the invention;

FIG. 5 is a schematic diagram illustrating the details of the operation of the rearrangement sub-unit in the embodiment of the present invention;

fig. 6 shows the screened attributes that contribute to super-resolution of face images in the embodiment of the present invention.

Detailed Description

In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the embodiments described below are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1

As shown in fig. 1, the present embodiment provides a method for enhancing the resolution of an image of a visual target, the method comprising the following steps:

s3, processing the low-resolution face image to be processed and the face attribute corresponding to the low-resolution face image to be processed by adopting a pre-trained face image super-resolution model, and outputting a high-resolution face image; the training method of the face image super-resolution model comprises the following steps:

s1, collecting training samples, wherein the training samples comprise high-resolution face image samples, low-resolution face image samples and a preset number of face attribute samples corresponding to the high-resolution face image samples and the low-resolution face image samples; optionally, the preset number is ten;

and S2, establishing a face image super-resolution model based on the preset loss function and the high-resolution face image sample according to the training sample.

As an alternative implementation manner of this embodiment, step S1 includes:

s11, collecting high-resolution face image samples, and obtaining and backing up the high-resolution face image samples by adopting a face image data set CelebA;

s12, down-sampling the high-resolution face image sample by adopting an image scaling algorithm to generate a low-resolution face image sample;

and S13, selecting the attributes which are beneficial to super resolution of the face image in the face image data set CelebA as corresponding face attribute samples, and selecting a preset number of face attribute samples.

Therefore, training samples can be established according to the high-resolution face image samples, the low-resolution face image samples and the corresponding preset number of face attribute samples.

As an alternative implementation manner of this embodiment, step S2 includes:

s21, acquiring low-resolution face image samples and corresponding preset number of face attribute samples;

s22, extracting shallow features of the low-resolution face image based on the convolutional neural network to generate low-resolution face image features;

s23, combining with an automatic encoder, obtaining the structural characteristics of the face image by the operations of encoding compression and decoding reduction of the low-resolution face image characteristics;

s24, extracting deep features of the structural features of the face images and the corresponding series connection of the preset number of face attributes by adopting a double residual dense connection network;

s25, enlarging the characteristic scale by adopting pixel rearrangement and up-sampling to reconstruct a high-resolution face image;

and S26, reversely converging the reconstructed high-resolution face image and the backed high-resolution face image sample based on a preset loss function, and establishing a face image super-resolution model.

As an alternative implementation manner of this embodiment, the automatic encoder is provided with an encoder and a decoder, and step S23 includes:

s231, the encoder performs downsampling by using the maximum pooling operation, and expands or compresses the number of characteristic channels by using 2 convolutional layers with the convolutional kernel size of 3;

and S232, the decoder performs up-sampling by using deconvolution, and expands or compresses the number of characteristic channels by using 2 convolutional layers with the convolutional kernel size of 3.

As an alternative implementation manner of this embodiment, step S24 includes:

s241, carrying out size transformation on feature graphs input by the preset number of face attribute variables;

s242, connecting the structural features of the face image and the face attribute variables in series in a channel dimension;

s243, compressing the channel number of the characteristic images which are connected in series through a convolution layer with the convolution kernel size of 3;

and S244, extracting deep features from the features with the compressed channel number through a double residual dense connection network.

As an alternative implementation manner of this embodiment, step S25 includes:

s251, based on pixel rearrangement, expanding the number of channels to 4 times of the original number by one layer of convolution operation, and then obtaining a result that the characteristic scale is expanded by 2 times and the number of channels is reduced to be consistent with the input number after the channels are rearranged at specific positions; in the implementation of the model, aiming at a multiplied by 4 amplification factor model, two pixel rearrangement upsampling modules are cascaded to obtain output characteristics amplified by 4 times;

and S252, reconstructing a high-resolution face image by adopting 2 convolution layers with convolution kernel size of 3.

As an alternative implementation manner of this embodiment, step S3 includes:

s31, inputting the low-resolution face image into a face structure feature extraction model to obtain face structure features;

and S32, connecting the face structure characteristics with the face attribute variables in series, inputting a subsequent face image super-resolution model, and outputting a high-resolution face image.

Therefore, in the method for enhancing the resolution of the visual target image provided by this embodiment, the training samples including the high-resolution face image samples, the low-resolution face image samples, and the corresponding preset number of face attribute samples are used, and the face image super-resolution model established based on the preset loss function is used to perform resolution processing on the acquired low-resolution face image to be processed, so that the effect of recovering the low-resolution face image into the high-resolution face image can be accurately and efficiently achieved, and the face image with higher definition can be acquired based on the specific face attribute in a priori.

Example 2

As shown in fig. 2, the present embodiment provides a system for enhancing the resolution of an image of a visual target, which can be used to implement the method provided in embodiment 1, and the system includes:

the output module 20 is configured to process the low-resolution face image to be processed and the face attribute corresponding to the low-resolution face image to be processed by using a pre-trained face image super-resolution model, and output a high-resolution face image; the training module 10 of the face image super-resolution model comprises:

the acquisition submodule 11 is configured to acquire a training sample, where the training sample includes a high-resolution face image sample, a low-resolution face image sample, and a preset number of face attribute samples corresponding to the high-resolution face image sample and the low-resolution face image sample;

and the model establishing submodule 12 is used for establishing a face image super-resolution model based on a preset loss function and the high-resolution face image sample according to the training sample.

As an optional implementation manner of this embodiment, the acquisition sub-module 11 includes:

the sampling unit is used for carrying out down-sampling on the high-resolution face image sample by adopting an image scaling algorithm to generate a low-resolution face image sample;

As an optional implementation manner of this embodiment, the model building sub-module 12 includes:

the encoding and decoding unit is used for combining an automatic encoder and acquiring the structural characteristics of the face image by performing encoding compression and decoding reduction on the low-resolution face image characteristics;

the second extraction unit is used for extracting deep features from the serial connection of the structural features of the face image and the corresponding preset number of face attributes by adopting a double residual dense connection network;

As an alternative implementation of this embodiment, the automatic encoder is provided with an encoder and a decoder, and the encoding and decoding unit includes:

and the decoding subunit is used for performing up-sampling by using deconvolution before the human face attribute variable is input, and expanding the number of characteristic channels by using 2 convolution layers with convolution kernel size of 3.

As an optional implementation manner of this embodiment, the second extraction unit includes:

the series subunit is used for connecting the structural features of the face image and the face attribute variables in series in a channel dimension;

As an optional implementation manner of this embodiment, the reconstruction unit includes:

As an optional implementation manner of this embodiment, the output module 20 includes:

Therefore, in the system for enhancing the resolution of the visual target image provided by this embodiment, the training samples including the high-resolution face image samples, the low-resolution face image samples, and the preset number of face attribute samples corresponding to the high-resolution face image samples are adopted, and the face image super-resolution model established based on the preset loss function performs resolution processing on the acquired low-resolution face image to be processed, so that the effect of recovering the low-resolution face image into the high-resolution face image can be accurately and efficiently achieved, and the face image with higher definition can be acquired based on the specific face attribute in a priori.

Example 3

The present embodiments provide an apparatus, comprising:

at least one processor;

at least one memory for storing at least one program;

when the at least one program is executed by the at least one processor, the at least one processor may be configured to perform the steps of a method for visual target image resolution enhancement as described in embodiment 1 above.

Example 4

A storage medium having stored therein a program executable by a processor, the program executable by the processor being for performing the method steps of visual target image resolution enhancement as in embodiment 1.

Example 5

Referring to fig. 3 to 6, a flow chart of a method for enhancing the resolution of an image of a visual target is provided, which specifically includes the following steps:

A. collecting training samples, wherein the training samples comprise high-resolution face image samples, low-resolution face image samples and ten face attribute samples corresponding to the high-resolution face image samples and the low-resolution face image samples;

B. establishing a face image super-resolution model according to the acquired training samples;

C. acquiring a low-resolution face image to be processed and a face attribute corresponding to the low-resolution face image;

D. and processing the low-resolution face image to be processed and the corresponding face attribute thereof through the face image super-resolution model, and outputting the high-resolution face image.

Wherein, the specific implementation scheme of the step A is as follows:

a1, acquiring a public large-scale face image data set CelebA as a training data set. The data set is composed of 202599 human face images, the data set comprises an aligned subdata set and a non-aligned subdata set, the data sets respectively contain artificial labeling information, the aligned subset comprises 5 feature points including eyes, noses and mouths, and each piece of data comprises 40 attribute labels. And selecting the aligned face data subsets as a training set and a test set, and selecting 10 attributes shown in the figure 4 by screening the attributes which are beneficial to super-resolution of the face image. Because a large number of background areas exist in the original face pictures, in order to better extract the feature information of the face, firstly, the face images with the length and width of 120 are cut out from the pictures in a center cutting mode to be used as target high-resolution images.

A2, using MATLAB's "imresize" function to make 4 times of double cubic down sampling to the high resolution face image, and getting the corresponding low resolution face image, whose size is 30 x 30, forming { I }_HR,I_LRVar }. And a horizontal or vertical overturning, 90-degree rotation and random cutting of image blocks are adopted as a data enhancement mode.

The specific embodiment of the step B is as follows:

b1, taking the low-resolution face image as network input and marking as I_LR。

B2, use shallow feature extraction module Net_feaAnd performing shallow feature extraction on the input image, wherein the obtained features comprise 64 channels and have the same size as the input image. The shallow feature extraction module consists of a 3 × 3 convolutional layer and an activation function. The features of the shallow feature extraction module can be expressed by the following formula:

F_LR＝Net_fea(I_LR)

b3, using a face structure feature extraction module which extracts the shallow feature F_LRInputting an automatic encoder, and acquiring structural characteristics F of the face image through operations of encoding compression and decoding reduction_stc. The encoder uses maximum pooling for the operation down-sampling, the decoder uses deconvolution for up-sampling, and both the encoder and decoder use 2 convolutional layers with a convolutional kernel size of 3 to expand or compress the number of feature channels. This module can be expressed as:

F_stc＝Net_De(Net_En(F_LR))

net here_DeAnd Net_EnThe encoder and decoder networks are in opposite directions for structural consistency.

B4, processing the guided attribute variable var and the facial image structural feature F by using the deep feature extraction module_stcFeatures concatenated in channel dimensions to obtain deep features F_deep. The deep feature extraction module consists of a layer of 1 multiplied by 1 convolutional layer and a double residual dense connection network. The 1 × 1 convolutional layer is used to compress the number of feature map channels. The dual residual dense connection network comprises a plurality of residual dense connection blocks which are cascaded, and jump connections are added outside each dense connection block and the plurality of cascaded blocks. The residual error dense connection block can improve the network performance by utilizing multi-level characteristic information, construct a deep network and avoid the problem of gradient disappearance. The convolution kernels used by the depth feature extraction module RRDB are all 3 multiplied by 3, and the default channel number is 64. This process can be expressed as:

F_deep＝Net_RRDB(conv(F_stc,var))

b5, using an upsampling module to reconstruct a large-scale feature F_SR. The operation of the up-sampling module is to obtain the deep layer characteristic F_deepAnd then, expanding the number of channels to 4 times of the original number by one layer of convolution operation, and then obtaining a result that the characteristic scale is expanded by 2 times and the number of channels is reduced to be consistent with the input after rearrangement is carried out by utilizing the specific position of the channels. In the implementation of the model, for a x 4 magnification factor model, two pixel rearrangement upsampling modules are cascaded to obtain an output characteristic at 4 times magnification. The moldA block may be represented as:

F_SR＝Net_up(F_deep)

b6, feature F after amplification_SRReconstructing and outputting super-resolution image I with RGB three channels through reconstruction module_SR. The reconstruction module consists of two 3 x 3 convolutional layers and an activation function. This module can be expressed as:

I_SR＝Net_rec(F_SR)

b7, adopting loss function to reconstruct high-resolution face image I_SRAnd reversely converging with the backed-up high-resolution face image sample to establish a face image super-resolution model. There are two cases, respectively, based on models of reconstruction loss and countermeasures against loss.

Based on a model of reconstruction loss, the loss function uses L₁Loss function, computing network generated high resolution face image I_SRAnd the real high-resolution face image I in the sample_HRThe error between. The loss function can constrain the generated image to be closer to the real image, L₁The formula for the loss function is as follows:

here, N represents the total number of pixels, and the data set uses a pattern of sRGB color space, so the total number of pixels is N — H × W × C. Wherein W, H, C represent the width, height and channel number of the high-resolution face image respectively. And setting a learning rate, reversely propagating the gradient by minimizing the loss function error, updating network parameters, and continuously iterating until the network is trained to be convergent.

Based on the model for resisting loss, a discriminator structure is added on the basis of the model based on reconstruction loss. This discriminator inputs the generated super-resolution image and the real image, respectively, to the discriminator network of the VGG structure. The feature extraction block of the structure consists of a convolution layer with convolution kernel size of 3 and step length of 1, a batch normalization layer and a LeakyReLU activation layer, and meanwhile, the scale compression block of the structure consists of a convolution layer with convolution kernel size of 4 and step length of 2, a batch normalization layer and a LeakyReLU activation layer. After passing through the feature extraction block and the scale compression block for a plurality of times, the features with the size of 4 multiplied by c are obtained, and c is a hyper-parameter of the setting channel. Then, the corresponding attribute variables are connected in series with the features proposed by the discriminator, and a binary judgment result is output after passing through the full connection layer, and is expressed by a formula as follows:

{V_real,V_fake}＝Net_D({I_SR,I_HR},var)

after joining the arbiter network, the model becomes an attribute-guided conditional generator arbiter network whose generator loss is made up of three parts:

L_G＝λ₁*L_rec+λ₂*L_VGG+λ₃*L_adv

λ₁，λ₂and λ₃Corresponding to the three lost weights, respectively. In order to ensure that the reconstructed image is similar to the real image in image content as much as possible, the image space is constrained pixel by pixel through reconstruction loss, wherein the reconstruction loss uses L₁Loss function:

where N × W × C represents the total pixels of the image, and W, H, and C represent the width, height, and number of channels of the high-resolution face image, respectively.

Meanwhile, in order to increase the texture information of the image, the feature information extracted by the reconstructed image through the fixed classification network VGG should be similar to the real image, and the perceptual loss is used for constraining the reconstructed image, and is defined as follows:

here, M — H × W × C indicates the size of the specific feature map.

The aim of the loss countermeasure is to make the reconstructed image and the real image as close as possible in distribution, which is defined as follows:

L_adv＝log(1-Net_D(Net_G(I_LR,var)))

where var represents attribute information of a human face.

The discriminator loss is the matching loss after adding the attribute variables, and the loss of the positive sample is adopted to restrict:

L_D＝log(1-Net_D(I_HR,var))+log(Net_D(Net_G(I_LR,var),var))

and setting a learning rate, reversely propagating the gradient by minimizing the loss function error, updating network parameters, and continuously iterating until the network is trained to be convergent.

In the backward convergence training, the batch size is set to 16, and the initial learning rate is set to 10^-4. During the iterative training process, according to the convergence condition of the network, on the model based on the reconstruction loss, when the total number of the trained iterations reaches {2 x 10 }⁵,4×10⁵,6×10⁵,8×10⁵The learning rate is attenuated by half; on the model based on the resistance loss, when the total number of training iterations reaches {5 × 10 }⁴,1×10⁵,2×10⁵,3×10⁵The learning rate is halved. This example uses an ADAM optimizer to perform inverse gradient propagation on models with ADAM parameters set to β₁＝0.9，β₂0.999 and e 10^-8. Model usage L based on reconstruction loss₁Loss function, computing network generated high resolution face image I_SRAnd the original high-resolution face image I_HRAnd (4) reversely propagating and updating the network parameters by minimizing the error, and continuously iterating until the network is trained to be converged. Based on a loss-resisting model, an L1 loss function is used for ensuring that a reconstructed image is similar to a real image on the image content as much as possible, perceptual loss is used for ensuring that image texture information is similar to the real image as much as possible, and the reconstructed image and the real image are distributed on the basis of the loss-resisting modelThe coefficients of these loss functions are set as close as possible, and the network parameters are updated by back-propagating by minimizing the sum of these errors, and iteration is continued until the network is trained to converge.

The scheme of the step C is specifically as follows:

and acquiring a pre-divided CelebA test data set, wherein the data set comprises various low-resolution face images and face attribute variables corresponding to the low-resolution face images.

The scheme of the step D is specifically as follows:

inputting the low-resolution face image of the CelebA test data set to be restored into a trained face image super-resolution model, performing the implementation scheme of the step B on the face image of the CelebA test data set through the face image super-resolution model, connecting the extracted structural features of the face image and the face attribute variables in series before a deep feature extraction module, and outputting the high-resolution face image through subsequent network processing.

In summary, the method, the system, the apparatus, and the storage medium for enhancing the resolution of the visual target image provided by the embodiments of the present invention adopt the training samples including the high-resolution face image samples, the low-resolution face image samples, and the corresponding preset number of face attribute samples, and perform resolution processing on the acquired low-resolution face image to be processed based on the face image super-resolution model established by the preset loss function, so that the effect of recovering the low-resolution face image into the high-resolution face image can be accurately and efficiently achieved, and the face image with higher definition can be acquired based on the specific face attribute prior.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for resolution enhancement of an image of a visual target, comprising:

2. The method of claim 1, wherein the acquiring training samples comprising high resolution face image samples, low resolution face image samples and a predetermined number of face attribute samples comprises:

3. The method for resolution enhancement of visual target images according to claim 2, wherein said building a super-resolution model of facial images based on a pre-determined loss function and high resolution facial image samples from said training samples comprises:

4. The method for resolution enhancement of visual target images according to claim 3, wherein said combining with an automatic encoder, obtaining structural features of the facial image by operations of encoding compression and decoding restoration on said low resolution facial image features, comprises:

5. The method of claim 3, wherein the extracting deep features from the concatenation of the structural features of the face image and the corresponding predetermined number of face attributes using a dual residual dense connectivity network comprises:

6. The method of enhancing the resolution of a visual target image according to claim 3, wherein said reconstructing a high resolution face image by upscaling the feature size using pixel rebinning upsampling comprises:

7. The method for enhancing the resolution of a visual target image according to claim 1, wherein the processing the low-resolution face image to be processed and the corresponding face attribute thereof by using the pre-trained super-resolution model of the face image and outputting the high-resolution face image comprises:

8. A system for visual target image resolution enhancement, comprising:

9. An apparatus comprising a memory for storing at least one program and a processor for loading the at least one program to perform the method of any one of claims 1-7.

10. A storage medium storing a program executable by a processor, the program executable by the processor being configured to perform the method of any one of claims 1-7 when executed by the processor.