CN112598579B

CN112598579B - Monitoring scene-oriented image super-resolution method, device and storage medium

Info

Publication number: CN112598579B
Application number: CN202011579005.6A
Authority: CN
Inventors: 胡旭阳; 姚佳丽; 李瑮
Original assignee: Suzhou Keda Special Video Co ltd; Suzhou Keda Technology Co Ltd
Current assignee: Suzhou Keda Special Video Co ltd; Suzhou Keda Technology Co Ltd
Priority date: 2020-12-28
Filing date: 2020-12-28
Publication date: 2024-08-27
Anticipated expiration: 2040-12-28
Also published as: CN112598579A

Abstract

The application relates to a monitoring scene-oriented image super-resolution method, a monitoring scene-oriented image super-resolution device and a storage medium, belonging to the technical field of image processing, wherein the method comprises the following steps: inputting the target low-resolution image into a pre-trained feature mapping network to obtain high-dimensional features in a target feature space; inputting the target low-resolution image and the high-dimensional features into a pre-trained image reconstruction network to obtain a high-resolution image; the method can solve the problems that the low-resolution image synthesized by the existing image super-resolution method based on deep learning has difference from a real low-resolution image and has poor generalization; the feature mapping network learns the feature mapping relation in advance; and the image reconstruction network is trained by combining the output result obtained after the feature mapping network, so that the domain difference between the synthesized and real low-resolution images is further reduced, and the image reconstruction effect is improved.

Description

Monitoring scene-oriented image super-resolution method, device and storage medium

Technical Field

The application relates to a monitoring scene-oriented image super-resolution method, a monitoring scene-oriented image super-resolution device and a storage medium, and belongs to the technical field of image processing.

Background

The image super-resolution (Image Super Resolution) technique refers to a technique of restoring a low-resolution image to a high-resolution image.

With the development of convolutional neural networks (Convolutional Nearul Network, CNN) and generation countermeasure networks (GENERATIVE ADVERSARIAL Network, GAN) in the field of pixel-level image processing, learning-based image super-resolution methods have grown endlessly. Such as: a method of image super resolution using a super resolution convolutional network (Super Resolution Convolutional Network, SRCNN); or a method of generating an image Super Resolution of an countermeasure Network (Super Resolution GENERATIVE ADVERSARIAL Network, SRGAN) using the Super Resolution; or an (ENHANCED DEEP Residual Networks for SINGLE IMAGE Super-Resolution, EDSR) image Super-Resolution method using an enhanced depth residual network for single image Super-Resolution, etc.

In the learning-based image super-resolution method, it is necessary to train a network model using an image pair composed of a low-resolution image and a high-resolution image. Wherein the low resolution image in each pair of images is obtained by performing a line bilinear downsampling on the high resolution image in the pair of images.

However, the low resolution image acquired in the above manner has a low generalization performance, and is different from the low resolution image acquired in an actual scene. At this time, there is a problem that the high resolution image reconstruction effect of the network model obtained by training using the image is poor.

Disclosure of Invention

The application provides an image super-resolution method, an image super-resolution device and a storage medium for a monitoring scene, which can solve the problems that a low-resolution image synthesized by the existing image super-resolution method based on deep learning is different from a real low-resolution image and has poor generalization. The application provides the following technical scheme:

In a first aspect, a method for super resolution of an image for a monitoring scene is provided, the method comprising:

acquiring a target low-resolution image to be restored;

Inputting the target low-resolution image into a pre-trained feature mapping network to obtain high-dimensional features in a target feature space; the feature mapping network is obtained by training first training data; the first training data includes a first low resolution image and a second low resolution image; the first low-resolution image is a low-resolution image synthesized based on the high-resolution image; the second low-resolution image is a low-resolution image obtained by collecting an actual scene;

Inputting the target low-resolution image and the high-dimensional features into a pre-trained image reconstruction network to obtain a high-resolution image corresponding to the target low-resolution image; the image reconstruction network is trained by using second training data, wherein the second training data comprises a high-resolution image, a first low-resolution image corresponding to the high-resolution image and an output result obtained after the first low-resolution image is input into the feature mapping network.

Optionally, the training process of the feature mapping network includes:

Inputting the first training data into a preset initial network model; the initial network model is used for learning the spatial representation of the first training data;

Training the initial network model by using a first loss function, and restricting the feature space of the image features of the first training data so that the image features of each image in the first training data are mapped to the target feature space to obtain the feature mapping network.

Optionally, the initial network model is a variational self-encoder model based on shared parameters.

Optionally, the first loss function includes an L ₁ loss function and a contrast loss function, the contrast loss function being used to constrain a feature space of an image feature; the L ₁ loss function is used for reducing the difference between the model estimation result and the real result.

Optionally, the image reconstruction network includes a feature fusion layer connected to the feature mapping network, where the feature fusion layer is configured to fuse the image feature with the high-dimensional feature.

Optionally, the fusing the image feature with the high-dimensional feature includes:

Splicing the image features and the high-dimensional features to obtain spliced features;

and fusing the spliced features through a convolution layer with a preset size to obtain the fused features.

Optionally, the image reconstruction network further includes a shallow feature extraction layer located before the feature fusion layer, a depth feature extraction layer located after the feature fusion layer, an upsampling layer, and a reconstruction layer;

the shallow feature extraction layer is used for extracting shallow features of the target low-resolution image;

The feature fusion layer is used for fusing the shallow features and the high-dimensional features to obtain the fusion features;

the depth feature extraction layer is used for extracting depth features of the fusion features;

the up-sampling layer is used for improving the resolution of the depth features to obtain a high-resolution feature map;

and the reconstruction layer is used for recovering the high-resolution characteristic map to obtain the high-resolution image.

Optionally, the depth feature extraction layer assigns different weights to each channel in the fusion feature based on an Attention attribute mechanism to extract the high resolution feature.

Optionally, the training process of the image reconstruction network includes:

inputting the second training data into a preset super-resolution network model;

Training the super-resolution network model by using a second loss function to obtain the image reconstruction network; the second loss function comprises an L ₁ loss function and a perception loss function, and the perception loss function is used for improving semantic similarity between a model estimation result and a real result; the L ₁ loss function is used for reducing the difference between the model estimation result and the real result.

Optionally, the first low resolution image is synthesized by a corresponding second low resolution image and pre-extracted low quality features; the low quality features are image features extracted from a plurality of second low resolution images.

Optionally, the low quality features include blur kernels and/or noise.

Optionally, the blur kernel is extracted from the second low resolution image using a pre-trained generation countermeasure network, the generation countermeasure network comprising a generator for modeling the blur kernel;

The noise is obtained by patch extraction of the second low-resolution image.

In a second aspect, a monitoring scene-oriented image super-resolution device is provided, the device comprising a processor and a memory; the memory stores a program, and the program is loaded and executed by the processor to implement the monitoring scene-oriented image super-resolution method according to the first aspect.

In a third aspect, a computer readable storage medium is provided, in which a program is stored, and the program is loaded and executed by the processor to implement the monitoring scene-oriented image super resolution method according to the first aspect.

The application has the beneficial effects that: inputting the target low-resolution image into a pre-trained feature mapping network to obtain high-dimensional features in a target feature space; inputting the target low-resolution image and the high-dimensional features into a pre-trained image reconstruction network to obtain a high-resolution image corresponding to the target low-resolution image; the method can solve the problems that the low-resolution image synthesized by the existing image super-resolution method based on deep learning has difference from a real low-resolution image and has poor generalization; the image reconstruction network is trained by using the high-resolution image, a first low-resolution image corresponding to the high-resolution image and an output result obtained after the first low-resolution image is input into the feature mapping network; the feature mapping network is trained by using the first low-resolution image and the second low-resolution image; in this way, the feature mapping network learns the feature mapping relationship in advance, so that the image features of the first low-resolution image and the image features of the second low-resolution image can be mapped to the same feature space (namely, the target feature space); and the image reconstruction network is trained by combining the output result obtained after the feature mapping network, so that the domain difference between the synthesized and real low-resolution images is further reduced, and the image reconstruction effect is improved. Wherein the first low resolution image is a low resolution image synthesized based on the high resolution image; the second low-resolution image is a low-resolution image acquired from the actual scene.

In addition, the second loss function is improved by combining the image super-resolution background, the middle layer of the VGG network is extracted as the perception loss aiming at the image super-resolution background, and the super-resolution network model is jointly optimized by combining the pixel level L1 loss, so that the sensory effect of the high-resolution image can be improved.

In addition, by introducing an attention mechanism in the image reconstruction network, the image characteristics of different channels can be treated differently, and the characterization capability of the network is improved.

In addition, the low-quality features are extracted from the second low-resolution image, and the first low-resolution image is obtained by combining the low-quality features with the high-resolution image, so that the first low-resolution image is more similar to the truly acquired low-resolution image, and the generalization performance of the network model obtained by training by using the first low-resolution image is improved.

In addition, since the generated countermeasure network can learn similar properties of the real image, the accuracy of the blur kernel extraction can be improved by training in advance to obtain the generated countermeasure network and extracting the blur kernel in the low-quality feature using the generated countermeasure network.

In addition, by implementing the initial network model as a VAE that shares parameters, since the VAE is not a code that compresses the input image into potential space, the image is converted into the two most common statistical distribution parameters, namely mean and standard deviation; a normal distribution in the potential space can be defined using the mean and standard deviation, thus improving the accuracy of the feature mapping.

In addition, the feature mapping network is obtained through training by using the first loss function, the first loss function comprises the countermeasure loss function, and the countermeasure loss function is based on the loss function of the generated countermeasure network, so that the output result is more approximate to the real result, and the training effect is improved.

The foregoing description is only an overview of the present application, and is intended to provide a better understanding of the present application, as it is embodied in the following description, with reference to the preferred embodiments of the present application and the accompanying drawings.

Drawings

Fig. 1 is a schematic diagram of a network structure of an RCAN according to an embodiment of the present application;

FIG. 2 is a flowchart of a method for super resolution of images for a monitoring scene according to an embodiment of the present application;

FIG. 3 is a schematic view of a feature spatial distribution of a first low resolution image before feature mapping and after feature mapping according to an embodiment of the present application;

FIG. 4 is a flow chart of a training method for a feature mapping network provided by one embodiment of the present application;

FIG. 5 is a schematic diagram of a feature mapping process of a feature mapping network provided by one embodiment of the present application;

FIG. 6 is a schematic diagram of an image reconstruction network according to one embodiment of the present application;

FIG. 7 is a flow chart of a training method for an image reconstruction network according to one embodiment of the present application;

FIG. 8 is a schematic diagram of a monitoring scene oriented image super resolution process provided by an embodiment of the present application;

FIG. 9 is a flow chart of a fuzzy core extraction method provided by one embodiment of the present application;

FIG. 10 is a schematic diagram of a blur kernel provided by one embodiment of the present application;

FIG. 11 is a flow chart of a noise extraction method provided by one embodiment of the present application;

FIG. 12 is a schematic diagram of noise provided by one embodiment of the present application;

FIG. 13 is a block diagram of a monitoring scene oriented image super-resolution device according to one embodiment of the present application;

fig. 14 is a block diagram of an image super-resolution device for a monitoring scene according to an embodiment of the present application.

Detailed Description

The following describes in further detail the embodiments of the present application with reference to the drawings and examples. The following examples are illustrative of the application and are not intended to limit the scope of the application.

First, several terms related to the present application will be described.

Image resolution refers to the amount of information stored in an image. The higher the resolution of the image, the clearer the corresponding image, and the higher the quality; the lower the image resolution the more blurred the corresponding image and the lower the quality.

Super-Resolution (SR) refers to reconstructing a corresponding high Resolution image from an observed low Resolution image.

Residual channel attention network (Residual Channel Attention Network, RCAN): for adaptively learning features in different channels in deeper networks.

Because the low-resolution image (DR) contains a large amount of low-frequency information, but features in each channel in a general convolutional neural network are treated equally, the characteristic-distinguishing learning capability of the cross-feature channel is lacking, the characterization capability of a deep network is blocked, and the requirement of recovering high-frequency information as much as possible in a super-resolution task is not met. Based on the method, different channels can be treated differently through the RCAN, and the characterization capability of the network is improved.

The RCAN includes: a shallow feature extraction layer, a depth feature extraction layer, an upsampling layer and a reconstruction layer. Each module is described separately below.

1. The shallow feature extraction layer is used for extracting shallow features from the input image.

In one example, the shallow feature extraction layer uses a convolution layer (conv) to extract shallow features F ₀ from the input image I _LR, see in particular the following:

F₀＝H_SF(I_LR)

Where H _SF represents a convolution operation.

2. The depth feature extraction layer is used for carrying out depth feature extraction based on the output of the shallow feature extraction layer.

In one example, the depth feature extraction layer extracts depth feature F _DF using residual in residual (Residual in Residual, RIR), specifically taking part in the following:

F_DF＝H_RIR(F₀)

wherein H _RIR denotes the RIR module, comprising G residual groups.

Each RIR module comprises a long jump connection and a short jump connection, which is beneficial to transmitting rich low-frequency information, so that a main network can learn more effective information.

In addition, the RIR module introduces a channel attention mechanism (Channel Attention, CA). The traditional convolutional neural network (Convolutional Neural Networks, CNN) based SR approach is the same for logistic regression (Logistic Regression, LR) channel characteristics. In order to concentrate the network on more information features, the interdependence between feature channels is exploited to form a channel attention mechanism.

In the feature map of the network, the network features captured by the feature maps of different channels are different, and the contribution of the different points to the recovery of the high-frequency features in the super-resolution task is different, so that the variability among the channels can be increased by giving different weights to the channels in the feature map by adopting a channel attention mechanism.

3. The up-sampling layer is used for improving the resolution ratio of the depth features and obtaining a high-resolution feature map.

In one example, the upsampling layer includes a deconvolution layer (also referred to as a transpose convolution), a combination of nearest neighbor upsampling (Nearest neighbor sampling) and convolution, or a sub-pixel-roll neural network of a high-efficiency sub-pixel-roll neural network (EFFICIENT SUB-Pixel Convolutional Neural Network, ESPCN), the network structure of the upsampling layer is not limited by this embodiment.

The upsampling process of the upsampling layer is represented by the following formula:

F_UP＝H_UP(F_DF)

Wherein H _UP represents an up-sampling algorithm; f _UP denotes a feature map obtained after upsampling.

4. The reconstruction layer is used for recovering the high-resolution feature map to obtain a high-resolution image.

In one example, the reconstruction layer reconstructs features of the high resolution feature map through one Conv layer, and the reconstruction process is represented by:

I_SR＝H_REC(F_UP)＝H_RCAN(I_LR)

Wherein I _SR represents the resulting high resolution image, H _REC represents the reconstruction convolution module, and H _RCAN represents the RCAN network.

The loss function used by the RCAN network in training includes, but is not limited to, at least one of the following: an L1 loss function, an L2 loss function, a Generate Antagonism Network (GAN) loss function, a texture aware loss function (perceptual loss), etc.

The L1 loss function is used to reduce the difference between the model estimation result and the real result. In other words, the L1 loss function is used to minimize the sum of absolute differences of the target value and the estimated value.

The L2 loss function is used to minimize the sum of squares of the difference between the target value and the estimated value.

The GAN loss function is a loss function implemented using a neural network for interaction between two networks to generate an estimate that approximates a true value.

The perceptual loss function is used for improving semantic similarity between the model estimation result and the real result.

Self-encoder (Autoencoder, AE): is a neural network that uses a back-propagation algorithm to make the output value equal to the input value, which now compresses the input into a potential spatial representation, which is then reconstructed into an output. In essence, a self-encoder is a data compression algorithm that is implemented by a neural network for both compression and decompression. Building a self-encoder requires two parts: an encoder (Encoder) and a Decoder (Decoder). The encoder is used to compress the input into a potential spatial representation, which can be represented by a function f (x), and the decoder is used to reconstruct the potential spatial representation into an output, which can be represented by a function g (x), both the encoding function f (x) and the decoding function g (x) being neural network models.

Variable self-encoder (Variational autoencoder, VAE): the conventional self-encoder is a lossy data compression algorithm, so that the best effect or the potential space expression with good structure is not obtained when the image reconstruction is performed, and the VAE does not compress the input image into the potential space coding, but converts the image into two most common statistical distribution parameters, namely a mean value and a standard deviation. The VAE also includes an encoder and a decoder. The encoder is used to convert the input image into two parameters representing the potential space: mean and variance, which may define a normal distribution in the potential space; random sampling is then performed from this normal distribution. The decoder module maps the sampling points in the potential space back to the original input image for reconstruction purposes.

Assuming that for the original data sample { X ₁,…,X_n }, X can be used to describe the population of this sample, where the distribution of X, p (X), is known, this probability distribution of p (X) can be sampled directly. When the distribution p (X) of the samples is unknown, it is necessary to calculate X by converting p (X), and this can be achieved by VAE. Specifically, let the distribution of p (X) be expressed as:

where p (x|z) represents a model in which X is generated from Z, which obeys a standard normal distribution, that is, p (Z) =n (0, 1). Z is sampled from the standard normal distribution, and then X is calculated according to Z, so that a generated model is obtained.

Referring to the schematic algorithm diagram of the VAE shown in fig. 1, as can be seen from fig. 1, the VAE performs potential spatial characterization of the mean and variance on the original data through a neural network, describes the original data as normal distribution, performs sampling according to the normal distribution, inputs the sampled result into a decoding structure, and finally generates a target image.

Optionally, the present application is described by taking an execution body of each embodiment as an electronic device as an example, where the electronic device may be a terminal or a server, where the terminal may be a computer, a mobile phone, a tablet computer, etc., and the embodiment does not limit a device type of the electronic device.

In addition, the application scenes of the monitoring scene-oriented image super-resolution method provided by the application comprise, but are not limited to, the following:

First scenario: reconstructing the low-resolution face image into a high-resolution face image. Such as: personnel monitoring scenes, personnel attendance scenes and the like.

The second scenario: the low resolution vehicle image is reconstructed into a high resolution vehicle image. Such as: a bayonet monitoring scene, a violation monitoring scene, etc.

Of course, in other embodiments, the image super-resolution method facing the monitoring scene may also be applied to other scenes, which are not listed here.

Fig. 2 is a flowchart of an image super-resolution method for a monitoring scene according to an embodiment of the present application. The method at least comprises the following steps:

in step 201, a target low resolution image to be restored is acquired.

The target low-resolution image is obtained by image acquisition of a real scene by using an image acquisition device. The target low resolution image may be a frame of image in a video stream; or as a single image; the target low-resolution image may be sent by other devices or acquired by the electronic device through a camera, and the source of the target low-resolution image is not limited in this embodiment.

Optionally, after the electronic device acquires a target image, detecting the image resolution of the target image; when the image resolution is less than the preset threshold, determining that the target image is a target low resolution image, and executing step 202; when the image resolution is greater than or equal to a preset threshold, determining that the target image is not the target low-resolution image, and carrying out resolution detection on the next target image until all target images are traversed, and ending the process.

Among other ways of detecting image resolution include, but are not limited to: reading from image information of a target image; or the image editing program is called to read the image attribute information to obtain, and the embodiment does not limit the detection mode of the image resolution.

Or after the electronic equipment acquires a target image, determining whether the image acquisition equipment for shooting the target image is blacklist equipment; if yes, determining the target image as a target low resolution image, and executing step 202; if not, determining that the target image is not the target low-resolution image, and determining whether the image acquisition device of the next target image is a blacklist device or not until all target images are traversed, and ending the flow.

Wherein the blacklist device comprises at least one device identification, each device identification being for indicating an image capturing device for capturing a low resolution image. Alternatively, the device identifier may be a device number, a number, or the like of the image capturing device, and the implementation manner of the device identifier is not limited in this embodiment.

In other embodiments, the electronic device may determine whether the target image is a target low resolution image in other manners, which are not further listed herein; or the electronic device takes each acquired target image as a target low-resolution image, and directly executes step 202 after acquiring the target image (i.e., the target low-resolution image).

Optionally, after acquiring the target low resolution image, the electronic device determines whether the target low resolution image has performed a high resolution restoration process; if yes, ending the flow; if not, step 202 is performed.

Wherein determining whether the target low resolution image has performed a high resolution restoration process includes, but is not limited to: calculating abstract information of the target low-resolution image; comparing the summary information with historical summary information; if the comparison results are consistent, determining that the target low-resolution image has executed a high-resolution restoration process; if not, it is determined that the target low resolution image does not perform the high resolution restoration process. The history summary information is summary information of a history low-resolution image, which is a low-resolution image on which a high-resolution restoration process has been performed. Or comparing the similarity of the target low resolution image with the historical low resolution image; if the similarity is greater than the similarity threshold, determining that the target low-resolution image has performed a high-resolution restoration process; if the similarity is less than or equal to the similarity threshold, determining that the target low resolution image does not perform the high resolution restoration process. In other embodiments, the electronic device may also determine whether the target low resolution image has performed the high resolution restoration process in other ways, which are not further illustrated herein.

Step 202, inputting a target low-resolution image into a pre-trained feature mapping network to obtain high-dimensional features in a target feature space; the feature mapping network is obtained by training the first training data; the first training data includes a first low resolution image and a second low resolution image; the first low-resolution image is a low-resolution image synthesized based on the high-resolution image; the second low-resolution image is a low-resolution image acquired from the actual scene.

The feature mapping network is pre-stored in the electronic equipment, and the feature mapping network can be obtained by training in the electronic equipment; or may be stored in the electronic device after training in other devices.

The feature mapping network is used for extracting high-dimensional features of the target low-resolution image, and mapping the high-dimensional features to the target feature space, so that the high-dimensional features in the target feature space are obtained.

There may be a domain difference between images for a first low resolution image and a second low resolution image corresponding to the same subject. Based on this, in this embodiment, the feature mapping network is obtained through pre-training, so that the high-dimensional features of the first low-resolution image and the high-dimensional features of the second low-resolution image are mapped into the same feature space, and then the super-resolution network model is trained, so that the image reconstruction network obtained through training performs image reconstruction by combining the high-dimensional features in the target feature space, so that the domain difference between the synthesis and the real low-resolution image can be reduced, and the image reconstruction effect is improved.

For a first low-resolution image and a second low-resolution image corresponding to the same photographic subject, it is assumed that the high-dimensional features of the first low-resolution image and the second low-resolution image are visualized by using a high-dimensional feature visualization tool (e.g., a t-distribution random neighborhood embedding (t-distributed stochastic neighbor embedding, TSNE) tool), and the visualization result is shown in fig. 3. From the feature space shown on the left side of fig. 3, it is clear that the high-dimensional features of the first low-resolution image cannot overlap the high-dimensional features of the second low-resolution image, as shown with reference to the left-hand circular portion of fig. 3. After the first low resolution image and the second low resolution image are subjected to feature mapping through the feature mapping network, a feature space shown on the right side of fig. 3 is obtained. From the feature space on the right side of fig. 3, the high-dimensional features of the first low-resolution image can overlay the high-dimensional features of the second low-resolution image; in other words, the high-dimensional features of the first low-resolution image and the high-dimensional features of the second low-resolution image are located in a unified feature space. In fig. 3, dark dots represent high-dimensional features of a first low-resolution image, and light dots represent high-dimensional features of a second low-resolution image.

Referring to fig. 4, the training process of the feature mapping network comprises at least steps 41 and 42:

step 41, inputting the first training data into a preset initial network model; the initial network model is used to learn a spatial characterization of the first training data.

The initial network model is a self-encoder that includes an encoder and a decoder. To enhance the feature mapping effect, in one example, the initial network model is a shared parameter-based variational self-encoder model.

The first low-resolution image in the first training data may be obtained by performing double interpolation sampling on the corresponding second low-resolution image; or by combining the corresponding second low resolution image with pre-extracted low quality features, in a manner described in the examples below.

Wherein the low quality features are image features extracted from a plurality of second low resolution images.

And step 42, training the initial network model by using the first loss function, and performing feature space constraint on the image features of the first training data so that the image features of all the images in the first training data are mapped to the target feature space to obtain a feature mapping network.

During training of the feature mapping network, the initial network model learns the self-encoding of the first low resolution image and the second low resolution image.

Optionally, the first loss function includes an L ₁ loss function and an antagonistic loss function.

The contrast loss function is used to constrain the feature space of the image features.

In one example, the counterloss function is represented by:

wherein, Is the output result of the counterloss function; x represents samples in the first low-pixel image and the second low-pixel image; e represents the expected value of the random variable; d denotes a arbiter in the generation countermeasure network.

The L ₁ loss function is used to reduce the difference between the model estimation result and the real result.

In one example, the L ₁ loss function is represented by:

Wherein L ₁ is the output of the L ₁ loss function; x represents the sample in the first low pixel image, Representing an estimate of x; r denotes the samples in the second low-pixel image,Representing an estimate of y; |n| ₁ denotes the 1-norm of the vector n; e represents the expected value of the random variable.

The first loss function may be a sum of an L ₁ loss function and an antagonistic loss function; or the weight values of the L ₁ loss function and the counterloss function are the weight values of the L ₁ loss function and the counterloss function, wherein the weight corresponding to the loss functions of different types is preset.

Such as: the first loss function L _total is represented by:

In other embodiments, the first loss function may also include other types of loss functions, or include only an L ₁ loss function, or include only an antagonistic loss function, and the present embodiment is not limited to the implementation of the first loss function.

After training to obtain the feature mapping network, referring to the feature mapping process of the feature mapping network shown in fig. 5, after the target low-resolution image is input into the feature mapping network, the high-dimensional feature of the target low-resolution image is calculated by the encoder 51. Wherein the feature mapping network comprises an encoder 51 and a decoder 52.

Step 203, inputting the target low-resolution image and the high-dimensional features into a pre-trained image reconstruction network to obtain a high-resolution image corresponding to the target low-resolution image; the image reconstruction network is trained by using second training data, wherein the second training data comprises a high-resolution image, a first low-resolution image corresponding to the high-resolution image, and an output result obtained after the first low-resolution image is input into the feature mapping network.

In this embodiment, in order to make the domain space between the high-resolution image and the target low-resolution image coincide, the image reconstruction network fuses the high-dimensional features located in the target feature space with the image features of the target low-resolution image. Accordingly, the image reconstruction network includes a feature fusion layer coupled to the feature mapping network for fusing image features with high-dimensional features.

In one example, the feature fusion layer fuses image features with high-dimensional features, including: splicing the image features and the high-dimensional features to obtain spliced features; and fusing the spliced features through a convolution layer with a preset size to obtain fusion features.

The convolution layer with the preset size can be a convolution layer of 1×1, and the implementation manner of the convolution layer used in feature fusion is not limited in this embodiment.

Because the low-resolution image contains a large amount of low-frequency information, the characteristics in each channel in a common convolutional neural network are treated equally, the characteristic-crossing channel distinguishing learning capability is lacked, the characterization capability of a deep network is blocked, and the requirement of recovering high-frequency information as much as possible in a super-resolution task is not met. Based on the above, the image reconstruction network provided in this embodiment can treat different channels differently, and improves the characterization capability of the network.

In one example, referring to fig. 6, the image reconstruction network is an improvement over the RCAN. The image reconstruction network includes a shallow feature extraction layer 62 located before the feature fusion layer 61, a depth feature extraction layer 63 located after the feature fusion layer 61, an upsampling layer 64, and a reconstruction layer 65.

The shallow feature extraction layer is used for extracting shallow features of the target low-resolution image.

F₀＝H_SF(I_LR)

Where H _sf represents a convolution operation.

The feature fusion layer is used for fusing the shallow features and the high-dimensional features to obtain fusion features. The process of feature fusion refers to the following formula:

Where F _FU represents the fusion feature and H _FU represents a1×1 convolution operation; f _LR represents the high-dimensional characteristics of the characteristic mapping network output; Representing feature stitching. The feature stitching mode may be stitching by using a concat method. In other embodiments, the feature stitching method may also be other methods, such as: the join method, the merge method, and the like, and the present embodiment does not limit the feature stitching manner.

The depth feature extraction layer is used for extracting depth features of the fusion features, and specifically participates in the following formula:

F_DF＝H_RIR(F_FU)。

the up-sampling layer is used for improving the resolution ratio of the depth features to obtain a high-resolution feature map, and the following formula is specifically participated in:

F_UP＝H_UP(F_DF)。

the reconstruction layer is used for recovering the high-resolution feature map to obtain a high-resolution image, and the following formula is specifically participated in:

I_SR＝H_REC(F_UP)＝H_RCAN(I_LR)。

Wherein the depth feature extraction layer assigns different weights to each channel in the fusion feature based on an Attention (Attention) mechanism to extract the high resolution feature. In one example, the depth feature extraction layer includes a plurality of RIR modules.

Referring to fig. 7, the training process of the image reconstruction network includes at least steps 71 and 72:

Step 71, inputting the second training data into a preset super-resolution network model.

The model structure of the super-resolution network model is the same as that of the image reconstruction network, referring to fig. 6, and the embodiment is not described here again.

The first low-resolution image in the second training data may be obtained by performing double interpolation sampling on the corresponding second low-resolution image; or by combining the corresponding second low resolution image with pre-extracted low quality features, in a manner described in the examples below.

The high-resolution image in the second training data is acquired from the real environment. Optionally, the real environment is related to an environment corresponding to an application scene of the image reconstruction network, such as: the image reconstruction network is used for reconstructing the low-quality face image acquired by the monitoring bayonet, and the real environment for acquiring the high-resolution image is the acquisition environment of the monitoring bayonet.

Optionally, after acquiring a large number of real images (i.e., images acquired for the real environment), the electronic device may determine whether each real image is a high resolution image; if yes, the real image is used as a high-resolution image in the second training data; if not, the real image is discarded or used as a second resolution image, so that the electronic equipment can extract low-quality features, and the embodiment is described below.

Ways of determining whether each real image is a high resolution image include, but are not limited to: detecting an image resolution of the real image; when the image resolution is greater than or equal to a preset threshold, determining that the real image is a high-resolution image; when the image resolution is less than a preset threshold, it is determined that the real image is not a high resolution image. Or determining whether the image acquisition device which shoots the real image is a white list device; if yes, determining the real image as a high-resolution image; if not, it is determined that the real image is not a high resolution image.

Wherein the whitelist device comprises at least one device identification, each device identification being for indicating an image capturing device capturing a high resolution image.

In other embodiments, the electronic device may determine whether the real image is a high resolution image in other ways, which are not further illustrated herein.

And step 72, training the super-resolution network model by using a second loss function to obtain an image reconstruction network.

Optionally, the second loss function includes an L ₁ loss function and a perceptual loss function.

Where i represents the i-th layer of the feature extraction network,Representing a VGG-Face network based on ResNet, n representing the total extraction layer number, I _SR representing the image output by the reconstruction network, and I _HR representing the high resolution image corresponding to I _LR.

It should be noted that the image reconstruction network may also be used to reconstruct other types of high resolution images, such as: vehicle images, merchandise images, etc., at which point the pre-trained VGG network-identified objects are replaced with corresponding objects, rather than just identifying faces.

The L ₁ loss function is used to reduce the difference between the model estimation result and the real result. In one example, the L ₁ loss function at the pixel level is represented by:

Wherein, I _SR represents an image output by the reconstruction network, I _HR represents a high-resolution image corresponding to I _LR, I represents the ith pixel in the image, n represents the total number of pixels, and I sequentially takes positive integers less than or equal to n.

In other embodiments, the second loss function may also include other types of loss functions, or include only an L ₁ loss function, or include only a perceptual loss function, and the present embodiment is not limited to implementation of the second loss function.

Optionally, after training to obtain the image reconstruction network, the image reconstruction network may also be tested using the test data to improve the performance of the image reconstruction network. The test data comprises a high-resolution image, a first low-resolution image corresponding to the high-resolution image, and an output result obtained after the first low-resolution image is input into the feature mapping network. The data content of the test data is different from the data content of the second training data.

After training to obtain an image reconstruction network, referring to the generation process of the high resolution image shown in fig. 8, the target low resolution image is input into a feature mapping network 81 and an image reconstruction network 82 respectively; the feature mapping network 81 outputs high-dimensional features located in the target feature space; the image reconstruction network 82 fuses the high-dimensional features with the extracted shallow features, followed by depth feature extraction, upsampling, and image reconstruction to obtain a high resolution image.

In summary, in the monitoring scene-oriented image super-resolution method provided in the embodiment, the high-dimensional features located in the target feature space are obtained by inputting the target low-resolution image into the feature mapping network trained in advance; inputting the target low-resolution image and the high-dimensional features into a pre-trained image reconstruction network to obtain a high-resolution image corresponding to the target low-resolution image; the method can solve the problems that the low-resolution image synthesized by the existing image super-resolution method based on deep learning has difference from a real low-resolution image and has poor generalization; the image reconstruction network is trained by using the high-resolution image, a first low-resolution image corresponding to the high-resolution image and an output result obtained after the first low-resolution image is input into the feature mapping network; the feature mapping network is trained by using the first low-resolution image and the second low-resolution image; in this way, the feature mapping network learns the feature mapping relationship in advance, so that the image features of the first low-resolution image and the image features of the second low-resolution image can be mapped to the same feature space (i.e., the target feature space); and the image reconstruction network is trained by combining the output result obtained after the feature mapping network, so that the domain difference between the synthesized and real low-resolution images is further reduced, and the image reconstruction effect is improved. Wherein the first low resolution image is a low resolution image synthesized based on the high resolution image; the second low-resolution image is a low-resolution image acquired from the actual scene.

In addition, the feature mapping network is obtained through training by using the first loss function, the first loss function comprises the countermeasure loss function, and the countermeasure loss function is based on the loss function generated by the countermeasure network, so that the output result is more approximate to the real result, and the training effect is improved.

The conventional synthesis method of the first low resolution image includes: and performing bilinear downsampling on the high-resolution image obtained through real acquisition to obtain a first low-resolution image. However, when image acquisition is performed on the real world, the low quality features of the resulting real low resolution image (i.e., the second low resolution image) are diverse. Therefore, acquiring the first resolution image only by the bilinear downsampling method and using the first resolution image to train the image reconstruction network and/or the feature mapping network may cause a problem of poor generalization performance of the obtained network model.

Optionally, based on the foregoing embodiment, the present embodiment provides a method for synthesizing a first low-resolution image, where the first low-resolution image is obtained by synthesizing a high-resolution image with pre-extracted low-quality features; the low quality features are extracted from a plurality of second low resolution images.

The low quality features are used to indicate various low quality factors of the second low resolution image. In one example, the low quality features include blur kernels and/or noise.

The blur kernel is a kind of convolution kernel, essentially a matrix, and the convolution of a high resolution image with the blur kernel results in the image becoming blurred, and is therefore called a blur kernel. The image convolution operation refers to matrix convolution.

Because the real image has trans-scale reproduction property (international Cross-Scale Recurrence Property), a correct super-resolution image blurring kernel can be obtained, and the similarity of each image block in the low-resolution image can be maximized. Based on this, in the present embodiment, a way of extracting a blur kernel is proposed, where the blur kernel is obtained by extracting a second low resolution image using a pre-trained generation countermeasure network. Referring to the fuzzy core extraction method shown in FIG. 9, the method at least comprises the steps 91-94:

Step 91, acquiring a plurality of second low resolution images.

The second low-resolution image is a low-resolution image acquired from the real environment.

Step 92, for each of the plurality of second low resolution images, inputting the second low resolution image into the initial generation countermeasure network to obtain a network output result.

The initial generation of the countermeasure network includes a generator and a arbiter.

The generator is for modeling the blur kernel. In one example, the generator includes a linear network model formed by stacking convolutions of convolution kernel sizes 7 x 7,5 x 5,3 x 3 and 31 x 1. The step length of the convolution 7×7 of the first layer is 2, the rest convolution step lengths are 1, and the receptive field of the whole network model is 13×13.

Assuming that the size of the input initially generated second low resolution image of the countermeasure network is 64 x 64, the 32 x 32 resolution image is obtained after the generator calculation.

In other embodiments, the receptive field of the generator may be of other dimensions, and the network structure of the generator is not limited in this embodiment.

The arbiter is used to learn the distribution of pixels within an image block to distinguish whether the input image is from a true image distribution. In one example, the arbiter comprises a full convolution pacth arbiter, which is a stack of one 7 x 7 convolution and 61 x 1 convolutions, with a receptive field of the arbiter network of 7 x 7. That is, the arbiter outputs a probability that each point on the probability map corresponds to a corresponding 7×7 pixel block in the input image belonging to the real image.

In other embodiments, the receptive field of the discriminator may be of other sizes, and the network structure of the discriminator is not limited in this embodiment.

And 93, training the initial generation countermeasure network by using the third loss function and the network output result to obtain a final generation countermeasure network corresponding to the current second low-resolution image.

Optionally, the electronic device trains the initially generated countermeasure network in an alternate training mode, and weights of the discriminator and the generator are respectively updated in the training process until the model converges.

In one example, the third loss function G is represented by:

Where E represents the expected value of the distribution function, patch (I _LR) represents the distribution of the second low-resolution image, G represents the network model of the generator, D represents the network model of the arbiter, The fuzzy core regularization term is represented and used for constraining the fuzzy core to be more consistent with human prior.

And 94, inputting the preset image into each generated countermeasure network to obtain a fuzzy core corresponding to each generated countermeasure network.

Wherein the preset image is a preset size image with a central pixel value of 1 and other position pixel values of 0. The predetermined size may be 25×25, and the value of the predetermined size is not limited in this embodiment.

In this embodiment, the preset images are input into each of the generation countermeasure networks, and the generator in the generation countermeasure network can obtain the output image of the preset receptive field, which is the blur kernel. In this way, the electronic device does not need to store the parameters of each trained generation countermeasure network, and can save storage resources.

In the present embodiment, by extracting the blur kernel using the generation countermeasure network, since the generation countermeasure network learns in advance a large number of blur features of the second low-resolution image, the accuracy of the blur kernel can be improved. In one example, the fuzzy core derived using the generation of the countermeasure network extraction is shown with reference to FIG. 10.

In other embodiments, the electronic device may also directly input the high-resolution image into any one of the generated countermeasure networks after training to generate the countermeasure network, and the generator outputs the low-resolution image with 2 times of the down-sampled resolution after blurring, without extracting the blur kernel, that is, after step 93, step 94 is not performed, which is not limited by the high-resolution blurring process in this embodiment.

Noise refers to unnecessary or redundant interference information in the image data. The presence of noise can affect the quality of the image.

Alternatively, the noise average value is assumed to be 0, based on which the noise is obtained by patch extraction of the second low resolution image. Referring to the noise extraction method shown in fig. 11, the method at least includes steps 1101-1104:

step 1101, for each second low resolution image, sliding in the whole image with a preset step size using a rectangular frame of the target resolution;

In this embodiment, the target resolution may be 56×48, and the preset step size may be 8; in other embodiments, the values of the target resolution and the preset step may be other values, which are not limited in this embodiment.

In step 1102, the variance within each rectangular box region is calculated.

Step 1103, comparing the rectangular frame with the minimum variance with a preset variance threshold; if the minimum variance is greater than the variance threshold, discarding the second low resolution image, and executing step 1101 again; if the minimum variance is less than or equal to the variance threshold, then step 1104 is performed;

step 1104, subtracting the average value of the rectangular frame area from the pixel value of the rectangular frame area to obtain noise data, and ending the process.

In one example, the noise of patch extraction of the second low resolution image is shown with reference to fig. 12.

In summary, in this embodiment, the low-quality feature is extracted from the second low-resolution image, and the low-quality feature is used to synthesize the high-resolution image with the low-quality feature to obtain the first low-resolution image, so that the first low-resolution image is more similar to the actually acquired low-resolution image, thereby improving the generalization performance of the network model obtained by training using the first low-resolution image.

Fig. 13 is a block diagram of an image super-resolution device for a monitoring scene according to an embodiment of the present application. The device at least comprises the following modules: an image acquisition module 1310, a feature mapping module 1320, and an image generation module 1330.

An image acquisition module 1310, configured to acquire a target low resolution image to be restored;

A feature mapping module 1320, configured to input the target low-resolution image into a feature mapping network trained in advance, to obtain a high-dimensional feature located in a target feature space; the feature mapping network is obtained by training first training data; the first training data includes a first low resolution image and a second low resolution image; the first low-resolution image is a low-resolution image synthesized based on the high-resolution image; the second low-resolution image is a low-resolution image obtained by collecting an actual scene;

The image generation module 1330 is configured to input the target low-resolution image and the high-dimensional feature into a pre-trained image reconstruction network to obtain a high-resolution image corresponding to the target low-resolution image; the image reconstruction network is trained by using second training data, wherein the second training data comprises a high-resolution image, a first low-resolution image corresponding to the high-resolution image and an output result obtained after the first low-resolution image is input into the feature mapping network.

For relevant details reference is made to the method embodiments described above.

It should be noted that: in the above embodiment, when performing image super resolution for a monitoring scene, the image super resolution device for a monitoring scene provided in the above embodiment is only exemplified by the division of the above functional modules, in practical application, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the image super resolution device for a monitoring scene is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the image super-resolution device facing the monitoring scene provided in the above embodiment and the image super-resolution method embodiment facing the monitoring scene belong to the same concept, and detailed implementation processes of the device are shown in the method embodiment, and are not repeated here.

Fig. 14 is a block diagram of an image super-resolution device for monitoring a scene according to an embodiment of the present application, which may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a server, etc., or may be also referred to as a user equipment, a portable terminal, a laptop terminal, a desktop terminal, a control terminal, etc., and the embodiment is not limited in type of the device. The apparatus includes at least a processor 1401 and a memory 1402.

Processor 1401 may include one or more processing cores such as: 4 core processors, 8 core processors, etc. The processor 1401 may be implemented in at least one hardware form of DSP (DIGITAL SIGNAL Processing), FPGA (Field-Programmable gate array), PLA (Programmable Logic Array ). Processor 1401 may also include a main processor, which is a processor for processing data in an awake state, also referred to as a CPU (Central Processing Unit, central processor), and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 1401 may be integrated with a GPU (Graphics Processing Unit, image processor) that is responsible for rendering and rendering of the content that the display screen is required to display. In some embodiments, the processor 1401 may also include an AI (ARTIFICIAL INTELLIGENCE ) processor for processing computing operations related to machine learning.

Memory 1402 may include one or more computer-readable storage media, which may be non-transitory. Memory 1402 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 1402 is used to store at least one instruction for execution by processor 1401 to implement the monitoring scene oriented image super resolution method provided by the method embodiments of the present application.

In some embodiments, the monitoring scene-oriented image super-resolution device further optionally includes: a peripheral interface and at least one peripheral. The processor 1401, memory 1402, and peripheral interfaces may be connected by buses or signal lines. The individual peripheral devices may be connected to the peripheral device interface via buses, signal lines or circuit boards. Illustratively, peripheral devices include, but are not limited to: radio frequency circuitry, touch display screens, audio circuitry, and power supplies, among others.

Of course, the image super-resolution device facing the monitoring scene may further include fewer or more components, which is not limited in this embodiment.

Optionally, the present application further provides a computer readable storage medium, where a program is stored, where the program is loaded and executed by a processor to implement the above-mentioned method embodiment of image super-resolution method for monitoring a scene.

Optionally, the present application further provides a computer product, where the computer product includes a computer readable storage medium, where a program is stored in the computer readable storage medium, and the program is loaded and executed by a processor to implement the method for super resolution of images facing a monitoring scene in the above method embodiment.

The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims

1. A monitoring scene-oriented image super-resolution method, the method comprising:

acquiring a target low-resolution image to be restored;

The training process of the feature mapping network comprises the following steps: inputting the first training data into a preset initial network model; the initial network model is used for learning the spatial representation of the first training data; training the initial network model by using a first loss function, and restricting the feature space of the image features of the first training data so that the image features of each image in the first training data are mapped to the target feature space to obtain the feature mapping network;

2. The method of claim 1, wherein the initial network model is a shared parameter-based variational self-encoder model.

3. The method of claim 1, wherein the first loss function comprises an L ₁ loss function and a contrast loss function, the contrast loss function to constrain a feature space of an image feature; the L ₁ loss function is used for reducing the difference between the model estimation result and the real result.

4. The method of claim 1, wherein the image reconstruction network includes a feature fusion layer coupled to the feature mapping network, the feature fusion layer configured to fuse image features with the high-dimensional features.

5. The method of claim 4, wherein the fusing the image features with the high-dimensional features comprises:

And fusing the spliced features through a convolution layer with a preset size to obtain fusion features.

6. The method of claim 4, wherein the image reconstruction network further comprises a shallow feature extraction layer located before the feature fusion layer, a depth feature extraction layer located after the feature fusion layer, an upsampling layer, and a reconstruction layer;

the feature fusion layer is used for fusing the shallow features and the high-dimensional features to obtain fusion features;

7. The method of claim 6, wherein the depth feature extraction layer weights each channel in the fused feature differently based on an Attention mechanism to extract the high resolution feature.

8. The method of claim 1, wherein the training process of the image reconstruction network comprises:

9. The method of claim 1, wherein the first low resolution image is synthesized from a corresponding second low resolution image and pre-extracted low quality features; the low quality features are image features extracted from a plurality of second low resolution images.

10. The monitoring scene-oriented image super-resolution device is characterized by comprising a processor and a memory; stored in the memory is a program that is loaded and executed by the processor to implement the monitoring scene oriented image super resolution method of any one of claims 1 to 9.

11. A computer-readable storage medium, characterized in that the storage medium has stored therein a program which, when executed by a processor, is adapted to implement the monitoring scene oriented image super resolution method according to any one of claims 1 to 9.