WO2023010701A1

WO2023010701A1 - Image generation method, apparatus, and electronic device

Info

Publication number: WO2023010701A1
Application number: PCT/CN2021/128518
Authority: WO
Inventors: Yapeng LI; Ningbo WANG
Original assignee: Zhejiang Dahua Technology Co., Ltd.
Priority date: 2021-08-02
Filing date: 2021-11-03
Publication date: 2023-02-09
Also published as: CN113344792A; CN113344792B

Abstract

Disclosed are an image generation method, an apparatus, and an electronic device. The method includes: inputting N frames of low-resolution face images sequentially into a first network model by iterative means, training the first network model, restricting the training process with the output loss of the first network model until a training result of the first network model is convergent, marking the trained first network model as a second network model. The second network model performs super-resolution processing on multi-frame low-resolution face images of any target, thereby obtaining super-resolution face images of which identity is consistent with identity of the multi-frame low-resolution face images. Based on the above method, it is possible to solve the problem that super-resolution processing based on single-frame low-resolution face images cannot guarantee that the identity information in the obtained super-resolution face images is consistent with the identity information of the single-frame low-resolution face images.

Description

IMAGE GENERATION METHOD, APPARATUS, AND ELECTRONIC DEVICE

CROSS REFERENCE

The present application claims foreign priority of China Patent Application No. 202110879082.1 filed on August 02, 2021, in the China National Intellectual Property Administration, the entire contents of which are hereby incorporated by reference.

TECHNICAL FIELD

The present disclosure relates to the field of face recognition technologies, and in particular to an image generation method, an apparatus, and an electronic device.

BACKGROUND

With the rapid development of science and technology and the advent of big data era, information security has become more and more important. Face recognition, as a safe, non-contact, convenient, and efficient way of identity information authentication, has been widely used in all aspects of social life. However, in relatively large surveillance scenes, the size of a face that appears in the video is usually small, and the image definition is low, which is difficult to meet the needs of face recognition. Therefore, a face super-resolution technology becomes more and more important. The face super-resolution technology essentially adds high-frequency features to low-resolution face images to generate high-resolution face images.

The prior art is usually based on a single-frame low-resolution face image and obtains a super-resolution face image through super-resolution processing. The super-resolution face image obtained in this way has missing face information, and cannot guarantee that the identity information of the super-resolution face image is consistent with the identity information of the single-frame low-resolution face image.

SUMMARY OF THE DISCLOSURE

The present disclosure provides an image generation method, an apparatus, and an electronic device, to realize super-resolution processing of multiple low-resolution face images to obtain super-resolution face images of which the identity information is consistent with the identity information of the low-resolution face images.

In a first aspect, the present disclosure provides an image generation method, comprising:

obtaining N frames of low-resolution face images of a first target, wherein the N is a positive integer greater than or equal to 2;

training the first network model according to the N frames of low-resolution face images to obtain a second network model, wherein the first network model is configured to perform super-resolution processing a low-resolution face image;

performing super-resolution processing on the N frames of low-resolution face images in sequence based on the second network model to obtain N frames of super-resolution face images; and

taking a last-frame super-resolution face image among the N frames of super-resolution face images as a final face image.

By virtue of the above method, super-resolution processing of multiple low-resolution face images can be realized to obtain super-resolution face images of which the identity information is consistent with the identity information of the low-resolution face images.

In some embodiments, the training the first network model according to the N frames of low-resolution face images to obtain the second network model comprises:

calculating and obtaining an output loss of the first network model according to the N frames of low-resolution face images, wherein the output loss is configured to restrict a training process of the first network model;

determining whether a training result of the first network model being convergent according to the output loss;

in response to the training result of the first network model not being convergent, adjusting parameters of the first network model and continuing to train the first network model until the training result is convergent; and

in response to the training result of the first network model being convergent, taking the trained first network model as the second network model.

By virtue of the above method, the output loss of the first network model is configured to restrict the training process of the first network model. The second network model is obtained after the training process, configured to realize super-resolution processing of multiple low-resolution face images of any target to obtain super-resolution face images of which the identity information is consistent with the identity information of the low-resolution face images.

In some embodiments, the calculating and obtaining the output loss of the first network model according to the N frames of low-resolution face images comprise:

obtaining N random variables and a super-resolution face image set based on the N frames of low-resolution face images through a first network model; wherein the number of frames of super-resolution face images in the super-resolution face image set is N;

inputting the super-resolution face images in the super-resolution face image set into a recognition network in sequence, and extracting to obtain N face feature values; and

inputting the N random variables and the N face feature values into a loss function, and calculating to obtain the output loss of the first network model.

By virtue of the above method, the obtained output loss of the first network model is configured to restrict the training process of the first network model to cause the training result to be convergent.

In some embodiments, the obtaining the N random variables and the super-resolution face image set based on the N frames of low-resolution face images through the first network model comprises:

determining a frame of low-resolution face image among the N frames of low-resolution face image as a first reference frame;

inputting a first super-resolution face image and a first low-resolution face image into the first network model to obtain a random variable corresponding to the first low-resolution face image; wherein the first super-resolution face image is a real high-resolution face image of the first target, and the first low-resolution face image is a next-frame image of the first reference frame;

inputting the random variable and the first reference frame into the first network model to obtain a second super-resolution face image; and

replacing the first super-resolution face image with the second super-resolution face image, replacing the first low-resolution face image with a next-frame face image of the first low-resolution face image, and continuing to perform training on the first network model to generate other super-resolution face images in sequence forming another super-resolution face image set.

By virtue of the above method, the obtained super-resolution face images of the super-resolution face image set are configured to abstract face feature values. The face values and the N random variables are configured to calculate the output loss of the first network model.

In some embodiments, the inputting the N random variables and the N face feature values into the loss function, and calculating to obtain the output loss of the first network model comprise:

inputting the N random variables into a negative log-likelihood loss function, and calculating to obtain a negative log-likelihood loss; wherein the negative log-likelihood loss is configured to restrict the first network model such that random variables output by the first network model obey a standard positive distribution;

inputting the N face feature values into a cosine loss function, and calculating to obtain a cosine loss; wherein the cosine loss is configured to indicate a degree of difference between super-resolution face features and real face features;

inputting the cosine loss into a cosine comparison loss function, and calculating to obtain a cosine comparison loss; wherein the cosine comparison loss is configured to restrict the first network model, such that a similarity between a super-resolution face image generated each time and the real high-resolution face image is greater than a similarity between a super-resolution face image generated last time and the real high-resolution face image; and

inputting the negative log-likelihood loss, the cosine loss and the cosine comparison loss into the loss function, and calculating to obtain the output loss of the first network model; wherein the output loss is configured to restrict the training process of the first network model.

By virtue of the above method, the output loss of the first network model is obtained. The training process of the first network model is restricted through the output loss, which can achieve that the random variables encoded by the first network model obey the standard positive distribution and that the similarity between the super-resolution face image and the real high-resolution face image generated by the first network model each time is greater than the similarity between the super-resolution face image and the real high-resolution generated by the first network model last time.

In some embodiments, the performing super-resolution processing on the N frames of low-resolution face images in sequence based on the second network model to obtain the N frames of super-resolution face images comprises:

randomly sampling a first random variable among random variables that obey a standard positive distribution generated in the training process, and determining a second reference frame among the N frames of low-resolution face images;

inputting the first random variable and the second reference frame into the second network model to obtain a super-resolution face image corresponding to the first random variable;

inputting the super-resolution face image and the second low-resolution face image into the second network model to obtain a second random variable; wherein the second low-resolution face image is a next-frame image of the second reference frame; and

replacing the first random variable with the second random variable, replacing the second low-resolution face image with a next-frame image of the second low-resolution face image, and continuing to perform super-resolution processing on the replaced second low-resolution face image to obtain the N frames of super-resolution face images in sequence.

By virtue of the above method, the second network model is configured to perform super-resolution processing on the N frames of low-resolution face images, and the super-resolution face image generated each time additionally has detail features of one frame of low-resolution face image than the super-resolution face image generated in the previous time. Therefore, the last generated super-resolution face image contains the detailed features of N frames of low-resolution face image, that is, the identity information of the last generated super-resolution face image is consistent with the identity information of the N frames of low-resolution face images.

In a second aspect, the present disclosure provides an image generation apparatus, comprising:

an obtaining module, configured to obtain N frames of low-resolution face images of a first target, wherein the N is a positive integer greater than or equal to 2;

a training module, configured to train the first network model according to the N frames of low-resolution face images to obtain a second network model, wherein the first network model is configured to perform super-resolution processing a low-resolution face image;

a processing module, configured to perform super-resolution processing on the N frames of low-resolution face images in sequence based on the second network model to obtain N frames of super-resolution face images; and

a selection module, configured to take a last-frame super-resolution face image among the N frames of super-resolution face images as a final face image.

In some embodiments, the training module comprises:

a calculation unit, configured to calculate and obtain an output loss of the first network model according to the N frames of low-resolution face images, wherein the output loss is configured to restrict a training process of the first network model;

a determining unit, configured to determine whether a training result of the first network model is convergent according to the output loss;

an adjustment unit, configured to, in response to the training result of the first network model not being convergent, adjust parameters of the first network model and continuing to train the first network model until the training result is convergent; and

a marking unit, configured to, in response to the training result of the first network model being convergent, record the trained first network model as the second network model.

In some embodiments, the calculation unit is specifically configured to:

obtain N random variables and a super-resolution face image set from the N frames of low-resolution face images through a first network model; wherein the number of frames of super-resolution face images in the super-resolution face image set is N;

input the super-resolution face images in the super-resolution face image set into a recognition network in sequence, and extracting to obtain N face feature values; and

input the N random variables and the N face feature values into a loss function, and calculating to obtain the output loss of the first network model.

In some embodiments, the calculation unit is further configured to:

determine a frame of low-resolution face image among the N frames of low-resolution face image as a first reference frame;

input a first super-resolution face image and a first low-resolution face image into the first network model to obtain a random variable corresponding to the first low-resolution face image; wherein the first super-resolution face image is a real high-resolution face image of the first target, and the first low-resolution face image is a next-frame image of the first reference frame;

input the random variable and the first reference frame into the first network model to obtain a second super-resolution face image; and

replace the first super-resolution face image with the second super-resolution face image, and replace the first low-resolution face image with a next-frame face image of the first low-resolution face image; continue to perform training on the first network model to generate super-resolution face images in sequence forming a super-resolution face image set.

In some embodiments, the calculation unit is further configured to:

input the N random variables into a negative log-likelihood loss function, and calculate to obtain a negative log-likelihood loss; wherein the negative log-likelihood loss is configured to restrict the first network model such that the random variables output by the first network model obey the standard positive distribution;

input the N face feature values into a cosine loss function, and calculate to obtain a cosine loss; wherein the cosine loss is configured to indicate a degree of difference between super-resolution face features and real face features;

input the cosine loss into a cosine comparison loss function, and calculate to obtain a cosine comparison loss; wherein the cosine comparison loss is configured to restrict the first network model, such that a similarity between the super-resolution face image generated each time and the real high-resolution face image is greater than a similarity between the super-resolution face image generated last time and the real high-resolution face image; and

input the negative log-likelihood loss, the cosine loss and the cosine comparison loss into the loss function, and calculate to obtain the output loss of the first network model; wherein the output loss is configured to restrict the training process of the first network model.

In some embodiments, the processing module comprises:

an obtaining unit, configured to randomly sample a first random variable among random variables that obey a standard positive distribution generated in the training process, and determine a second reference frame among the N frames of low-resolution face images;

a processing unit, configured to input the first random variable and the second reference frame into the second network model to obtain a super-resolution face image corresponding to the first random variable;

an encoding unit, configured to input the super-resolution face image and the second low-resolution face image into the second network model to obtain a second random variable; wherein the second low-resolution face image is a next-frame image of the second reference frame; and

an updating unit, configured to replace the first random variable with the second random variable, replace the second low-resolution face image with a next-frame image of the second low-resolution face image, and continue to perform super-resolution processing on the replaced second low-resolution face image to obtain the N frames of super-resolution face images in sequence.

In a third aspect, the present disclosure provides an electronic device, comprising:

a memory, configured to store a computer program; and

a processor, configured to execute the computer program stored in the memory to perform the method as described above.

In a fourth aspect, the present disclosure provides a storage medium, storing a computer program; wherein the computer program is configured to perform the method as described above when executed by a processor.

Based on the method provided by the present disclosure, super-resolution processing is performed on the N frames of low-resolution face images of the first target to train the first network model. In the training process, the last generated super-resolution face image contains the detailed features of N frames of low-resolution face image. Therefore, after super-resolution processing is performed on the N frames of low-resolution face images of the first target based on the second network model, the identity information of the last generated super-resolution face image is consistent with the identity information of the first target.

Of course, the second network model may not only perform super-resolution processing on the N frames of low-resolution face images of the first target to obtain the super-resolution face image of which identity information is consistent with the identity information of the first target, the second network model may also perform super-resolution processing on N frames of low-resolution face images of a second target to obtain a super-resolution face image of which identity information is consistent with the identity information of the second target.

For the technical effects that can be achieved with respect to each of the second to fourth aspects, reference may be made to the above description of the technical effects that can be achieved with respect to the first aspect or the various possible solutions in the first aspect, which will not be repeated here.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a flowchart of an image generation method according to an embodiment of the present disclosure.

FIG. 2 is a flowchart of a method for training a first network model according to an embodiment of the present disclosure.

FIG. 3 is a flowchart of a method for obtaining N random variables and a super-resolution face image set based on a first network model according to an embodiment of the present disclosure.

FIG. 4 is a flowchart of a method for calculating an output loss of a first network model according to an embodiment of the present disclosure.

FIG. 5 is a flowchart of a method for obtaining N frames of super-resolution face images based on a second network model according to an embodiment of the present disclosure.

FIG. 6 is a schematic view of a method for training a first network model according to an embodiment of the present disclosure.

FIG. 7 is a flowchart of a method for performing super-resolution processing on N frames of low-resolution face images based on a second network model according to an embodiment of the present disclosure.

FIG. 8 is a structural schematic view of an image generation apparatus according to an embodiment of the present disclosure.

FIG. 9 is a structural schematic view of a training module according to an embodiment of the present disclosure.

FIG. 10 a structural schematic view of a processing module according to an embodiment of the present disclosure.

FIG. 11 is a structural schematic view of an electronic device according to an embodiment of the present disclosure.

DETAILED DESCRIPTION

In order to make the objectives, technical solutions, and advantages of the present disclosure clearer, the present disclosure will be further described in detail below with reference to the accompanying drawings. The specific operation method in method embodiments may also be applied to apparatus embodiments or system embodiments. It should be noted that in the description of the present disclosure, “a plurality of” is understood as “at least two” . “And/or” describes the association relationship of associated object, indicating that there can be three types of relationships, for example, A and/or B, which can mean: A alone exists, A and B exist at the same time, and B exists alone. “A connected to B” can mean: A and B are directly connected; and/or A and B are connected through C. In addition, in the description of the present disclosure, words such as “first” and “second” are only intended for a purpose of distinguishing description, and cannot be understood as indicating or implying relative importance, nor as indicating or implying order.

The present disclosure will be further described in detail below in conjunction with the accompanying drawings.

The image generation method provided by the embodiments of the present disclosure can solve the problem of being unable to ensure that identity information of an obtained super-resolution face image is consistent with identity information of a single-frame low-resolution face image while performing super-resolution processing based on the single-frame low-resolution face image. The method and apparatus described in the embodiments of the present disclosure are based on a same technical concept. Since the principles of the method and apparatus to solve the problem are similar, the embodiments of the apparatus and the method can be referred to each other, and repetition will not be repeated.

Face super-resolution technology is essentially to add high-frequency features to low-resolution face images to generate high-resolution face images. In the field of face super-resolution technologies, a SRFlow network model is often used. The SRFlow network model is reversible and can learn a conditional distribution of super-resolution images with respect to low-resolution images. High-resolution image and low-resolution image are input into the SRFlow network model to obtain random variables that meet a specific distribution. Low-resolution image and random variables that meet the specific distribution are input into the SRFlow network model to generate the super-resolution face image.

In the prior art, super-resolution processing is usually performed on a single-frame low-resolution face image based on the SRFlow network model to obtain a super-resolution face image. However, due to the lack of detailed information of the single-frame low-resolution face image, and the detailed information is usually key information to distinguish face identity, thus it cannot be ensured that the identity information of the obtained super-resolution face image is consistent with the identity information of the low-resolution face image.

In order to solve the problem of being unable to ensure that identity information of an obtained super-resolution face image is consistent with identity information of a single-frame low-resolution face image while performing super-resolution processing based on the single-frame low-resolution face image, the present disclosure proposes a solution: based on a first network model, sequentially inputting multi-frame low-resolution face images of a first target into the first network model in an iterative manner; training the first network model; restricting a training process according to an output loss of the first network; in response to a training result of the first network model being convergent, recording the trained first network model as a second network model; performing super-resolution processing on the multi-frame low-resolution face images of the first target or multi-frame low-resolution face images of a second target with the second network model, and obtaining a last frame of super-resolution face image. The generated last super-resolution face image has detailed features of the multi-frame low-resolution face images, and thus the identity information thereof is consistent with the identity information of the low-resolution face image.

Specifically, as shown in FIG. 1, FIG. 1 is a flowchart of an image generation method according to an embodiment of the present disclosure.

At block S11. obtaining N frames of low-resolution face images of a first target, wherein the N is a positive integer greater than or equal to 2.

At block S12: training the first network model according to the N frames of low-resolution face images to obtain a second network model.

In the embodiment of the present disclosure, the first network model may be a SRFlow network model. The N frames of low-resolution face images are sequentially input into the first network model in an iterative manner, and the first network model is trained. The training process is restricted according to an output loss of the first network. When the training result of the first network model is convergent, the trained first network model is recorded as the second network model.

At block S13: performing super-resolution processing on the N frames of low-resolution face images in sequence based on the second network model to obtain N frames of super-resolution face images.

In the embodiment of the present disclosure, super-resolution processing is performed on the N frames of low-resolution face images in sequence based on the second network model to obtain N frames of super-resolution face images. In the process of the super-resolution processing, the super-resolution face image generated each time additionally has detail features of one frame of low-resolution face image than the super-resolution face image generated in the previous time.

At block S14: taking a last-frame super-resolution face image among the N frames of super-resolution face images as a final face image.

By inputting the N frames of low-resolution face images into the second network model, the super-resolution face image generated each time additionally has detail features of one frame of low-resolution face image than the super-resolution face image generated in the previous time. Therefore, the last generated super-resolution face image contains the detailed features of N frames of low-resolution face images, that is, the identity information of the last generated super-resolution face image is consistent with the identity information of the N frames of low-resolution face images.

In order to further explain how the second network model is obtained, the method of training the first network model described in step S12 is required to be described in detail, as shown in FIG. 2, FIG. 2 is a flowchart of a method for training a first network model according to an embodiment of the present disclosure.

At block S21: obtaining N random variables and a super-resolution face image set from the N frames of low-resolution face images through a first network model.

In the embodiment of the present disclosure, the super-resolution face image set stores a frame of real high-resolution face image of the first target, and the super-resolution face images generated each time. The frame of real high-resolution face image is recorded as a first super-resolution face image, and the total number of super-resolution face images in the super-resolution face image set is N.

The obtaining the N random variables and the super-resolution face image set can be implemented by inputting the N frames of low-resolution face images into the first network model in an iterative manner. The specific process is shown in FIG. 3.

At block S31: determining a frame of low-resolution face image among the N frames of low-resolution face image as a first reference frame.

In the embodiment of the present disclosure, the first reference frame may be the first frame in the N frames of low-resolution face images, or may be the second frame, the third frame, the fourth frame, etc. In the present disclosure, the first frame of low-resolution face image is selected as an example.

At block S32: putting the frame of real high-resolution face image as a first super-resolution face image into the super-resolution face image set, and taking a next-frame image of the first reference frame as a first low-resolution face image.

At block S33: inputting the first super-resolution face image and the first low-resolution face image into the first network model to obtain a random variable corresponding to the first low-resolution face image.

At block S34: inputting the random variable and the first reference frame into the first network model to obtain a second super-resolution face image.

At block S35: putting the second super-resolution face image into the super-resolution face image set, and determining whether the number of image frames in the super-resolution face image set is N.

In the embodiment of the present disclosure, when the number of image frames in the super-resolution face image set is not N, step S36 is executed; when the number of image frames in the super-resolution face image set is N, step S37 is executed.

At block S36: in response to the number of image frames in the super-resolution face image set being not N, replacing the first super-resolution face image with the second super-resolution face image, and replacing the first low-resolution face image with a next-frame face image of the first low-resolution face image.

When the number of image frames is not N, the first super-resolution face image is replaced with the second super-resolution face image, and the first low-resolution face image is replaced with a next-frame face image of the first low-resolution face image. Then step S33 is executed.

At block S37: in response to the number of image frames in the super-resolution face image set being N, obtaining the N random variables and the super-resolution face image set with N image frames.

Based on the above steps, the super-resolution face images in the super-resolution face image set are configured to extract face feature values, and the face feature values and the N random variables are configured to calculate the output loss of the first network model.

At block S22: inputting the super-resolution face images in the super-resolution face image set into a recognition network in sequence, and extracting to obtain N face feature values.

At block S23: inputting the N random variables and the N face feature values into a loss function, and calculating to obtain an output loss of the first network model.

The training process of the first network model is restricted by the output loss, such that the random variables output by the first network model obey a standard positive distribution, and the similarity between the super-resolution face image generated each time and the real high-resolution face image is greater than the similarity between the super-resolution face image generated last time and the real high-resolution face image.

At block S24. determining whether a training result of the first network model being convergent according to the output loss.

When the output loss of the first network model is convergent, it is indicated that the training result of the first network model is convergent, and step S25 is executed; when the output loss of the first network model is not convergent, step 26 is executed.

At block S25: in response to the training result of the first network model being convergent, recording the trained first network model as the second network model.

When the training result is convergent, it is indicated that after the first network model performs super-resolution processing on the multiple-frame low-resolution face images, the identity information of the last generated super-resolution face image is consistent with the identity information of the low-resolution face image. The trained first network model is recorded as the second network model. The second network model can perform super-resolution processing on multiple-frame low-resolution face images of any target, and the identity information of the last generated super-resolution face image is consistent with the identity information of the low-resolution face image.

At block S26: in response to the training result of the first network model not being convergent, adjusting parameters of the first network model and continuing to train the first network model until the training result is convergent.

When the training result is not convergent, the parameters of the first network model are adjusted, N frames of low-resolution face images of another target are continually obtained, step S11 is executed, and the first network model is continually trained until the training result is convergent.

Based on the above steps, the N frames of low-resolution face images are input to the first network model, the first network model is trained, and the training is completed to obtain the second network model. The second network model can perform super-resolution processing on multiple-frame low-resolution face images of any target, and the identity information of the last generated super-resolution face image is consistent with the identity information of the low-resolution face image.

In the training process of obtaining the second network model, the output loss of the first network model is configured to restrict the first network model such that the random variables output by the first network model obey the standard positive distribution, and the similarity between the super-resolution face image generated each time and the real high-resolution face image is greater than the similarity between the super-resolution face image generated last time and the real high-resolution face image.

In order to further explain the calculation method of the output loss, the output loss of the first network model calculated in step S23 is required to be explained in detail. The specific calculation process of the output loss is shown in FIG. 4.

At block S41: inputting the N random variables into a negative log-likelihood loss function, and calculating to obtain a negative log-likelihood loss.

In the embodiment of the present disclosure, the negative log-likelihood loss is configured to restrict the first network model such that the random variables output by the first network model obey the standard positive distribution, where the negative log-likelihood loss can be calculated by formula (1) :

where the formula (1) is the negative log-likelihood loss function, LR is a low-resolution face image, SR is a super-resolution face image, θ is a distribution parameter, N is the number of frames of the low-resolution face image, and LR _1i indicates that the i-th frame of low-resolution face image input to the first network model, p _Z (z _1i) represents a spatial distribution of random variables, Z _1i represents a random variable obtained by inputting the i-th frame of low-resolution face image into the first network model, and f _θ is the first network model. The first network model f _θ is decomposed into M reversible layer sequences:

At block S42: inputting the N face feature values into a cosine loss function, and calculating to obtain a cosine loss.

In the embodiment of the present disclosure, the cosine loss indicates the degree of difference between super-resolution face features and real face features, where the cosine loss can be calculated by formula (2) :

where the formula (2) is the cosine loss function, Similarity _i is a cosine similarity between a super-resolution face image super-divided by the first network model for the i-th time and a real high-resolution face image, and the cosine similarity is in a value range of (-1, 1) . The greater the cosine similarity is, the higher the similarity between the super-resolution face image and the real high-resolution face image is. The cosine similarity can be calculated by formula (3) :

where Similarity _i represents the cosine similarity generated for the i-th time, formula (3) is the cosine similarity function, F _i is a face feature value extracted after the super-resolution face image generated by the first network model for the i-th time is input into the recognition network, and F ₀ is a face feature value extracted after the real high-resolution face image is input into the recognition network.

At block S43: inputting the cosine loss into a cosine comparison loss function, and calculating to obtain a cosine comparison loss.

In the embodiment of the present disclosure, the cosine comparison loss is configured to restrict the first network model, such that the similarity between the super-resolution face image generated each time and the real high-resolution face image is greater than the similarity between the super-resolution face image generated last time and the real high-resolution face image, that is, Similarity _i+1 is greater than Similarity _i, and the cosine comparison loss can be calculated by formula (4) :

where formula (4) is the cosine comparison loss function, e is the base of the natural logarithm, and α is a comparison coefficient.

At block S44: inputting the negative log-likelihood loss, the cosine loss and the cosine comparison loss into the loss function, and calculating to obtain the output loss of the first network model.

In the embodiment of the present disclosure, the output loss is configured to restrict the training process of the first network model, which can make the random variables encoded by the first network model obey the standard positive distribution, and can also make the similarity between the super-resolution face image generated each time and the real high-resolution face image greater than the similarity between the super-resolution face image generated last time and the real high-resolution face image. The output loss can be calculated by formula (5) :

where formula (5) is the loss function.

Based on the above steps, the output loss of the first network model is calculated. When the output loss is not convergent, the parameters of the first network model are adjusted, and the training of the first network model is continued until the output loss is convergent.

When the output loss is convergent, it is indicated that the second network model obtained after the training can perform super-resolution processing on the N frames of low-resolution face images to obtain the N frames of super-resolution face images, and in the process of super-resolution processing, the similarity between the super-resolution face image generated each time and the real high-resolution face image is greater than the similarity between the super-resolution face image generated last time and the real high-resolution face image.

In order to further explain how the second network model performs super-resolution processing on the N frames of low-resolution face images, step S13 is required to be described in detail. Specifically, as shown in FIG. 5, FIG. 5 is a flowchart of a method for obtaining N frames of super-resolution face images based on a second network model according to an embodiment of the present disclosure.

At block S51: randomly sampling a first random variable from the random variables that obey the standard positive distribution generated in the training process, and determining a second reference frame from the N frames of low-resolution face images.

At block S52: taking a next-frame image of the second reference frame as a second low-resolution face image.

At block S53: inputting the first random variable and the second reference frame into the second network model to obtain the super-resolution face image corresponding to the first random variable.

At block S54: counting the super-resolution face image generated each time, and determining whether a total number of frames of the super-resolution face images is N.

The purpose of counting the super-resolution face images generated each time is to determine whether the super-resolution processing has been performed on all the N frames of low-resolution face images. When the total number of frames of the super-resolution face image is N, it is indicated that the super-resolution is completed, and step S55 is executed. When the total number of frames of the super-resolution face image is not N, step S56 is executed.

S55: in response to the total number of frames of the super-resolution face image being N, which indicates that the super-resolution processing is completed, and obtaining the N frames of super-resolution face image.

At block S56: in response to the total number of frames of the super-resolution face image being not N, inputting the super-resolution face images and the second low-resolution face image into the second network model to obtain a second random variable.

At block S57: replacing the first random variable with the second random variable, replacing the second low-resolution face image with a next-frame image of the second low-resolution face image, and continuing to perform super-resolution processing on the replaced second low-resolution face image.

After replacing the first random variable with the second random variable, and replacing the second low-resolution face image with the next frame of the second low-resolution face image, step S53 is executed to continually perform super-resolution processing on the replaced second low-resolution face image.

Based on the above method, the second network model is configured to perform super-resolution processing on the N frames of low-resolution face images, and the super-resolution face image generated each time additionally has detail features of one frame of low-resolution face image than the super-resolution face image generated in the previous time. Therefore, the last generated super-resolution face image contains the detailed features of N frames of low-resolution face image, that is, the identity information of the last generated super-resolution face image is consistent with the identity information of the N frames of low-resolution face images.

Of course, using the second network model based on the above steps can not only perform super-resolution processing on N frames of low-resolution face images of the first target, but also perform super-resolution processing on N frames of low-resolution face images of the second target. The generated identity information of the last frame of super-resolution face image of the second target is consistent with the identity information of the N frames of low-resolution face image of the second target.

Further, in order to elaborate on an image generation method provided by the present disclosure, the method provided by the present disclosure will be described in detail below through specific application scenarios.

Before the image is generated, the first network model is required to be trained. Referring to FIG. 6, the N frames of low-resolution face images of the first target are sorted according to an obtaining order of an image obtaining device, and are recorded as a first frame of low-resolution face image, a second frame of low-resolution face image, ..., and an Nth frame of low-resolution face image. The first frame of low-resolution face image is taken as the reference frame LR ₁₁, and the real high-resolution face image HR of the first target is input into the recognition network to obtain a first face feature value F ₀, where HR is recorded as SR ₀.

For a first training, HR and the second frame of low-resolution face image LR ₁₂ are input into the first network model to obtain the first random variable Z ₁₁; Z ₁₁ and LR ₁₁ are input into the first network model to generate the first frame of super-resolution face image SR ₁₁; SR ₁₁ is input into the recognition network to obtain a second face feature value F ₂;

For a second training, SR ₁₁ and the third frame of low-resolution face image LR ₁₂ are input into the first network model to obtain the second random variable Z ₁₂; Z ₁₂ and LR ₁₁ are input into the first network model to generate the second frame of super-resolution face image SR ₁₂ of the first object; SR ₁₂ is input into the recognition network to obtain a third face feature value F ₃;

In an i-th (i>1) training, the super-resolution face image SR ₁ _(i-1) generated by the first network model for an (i-1) -th time and the (i+1) -th frame of low-resolution face image LR ₁ _(i+1) are input into the first network model to obtain the i-th random variable Z _1i; Z _1i and LR ₁₁ are input into the first network model to generate the i-th frame of super-resolution face image SR _1i of the first object, SR _1i is input into the recognition network to obtain the (i+1) -th face feature value F _i.

The generated face feature values {Fi, i= 0, 1, ..., N-1} are input into the loss function, calculation is performed to obtain the output loss of the first network model, and it is determined whether the output loss is convergent. When the output loss is convergent, it is indicated that the training result of the first network model is convergent, and the trained first network model is recorded as the second network model. When the output loss is not convergent, the parameters of the first network model are adjusted, and the training of the first network model is continued to be performed until the training result is convergent.

The second network model obtained based on the above training method can not only perform super-resolution processing on the N frames of low-resolution face images of the first target to obtain super-resolution face images of which the identity information is consistent with the identity information of the first target, and also can perform super-resolution processing on the N frames of low-resolution face images of the second target to obtain super-resolution face images of which the identity information consistent with the identity information of the second target.

After completing the training of the first network model and obtaining the second network model, the multi-frame low-resolution face image of any target can be super-resolution processed through the second network to obtain super-resolution face images of which the identity information is consistent with the low-resolution face images. Taking the first target as an example and referring to FIG. 7 to describe the specific process.

In the first super-resolution, a random variable Z ₂₁ and a reference frame LR ₂₁ are randomly sampled in the random variable distribution space that meets the standard positive distribution generated during the training process and simultaneously input into the second network model, to generate the second frame of super-resolution face image SR ₂₁ of the second object. The first frame of low-resolution face image of N frames of low-resolution face images is determined as the reference frame LR ₂₁.

For the second super-resolution, SR ₂₁ and the second frame of low-resolution face image LR ₂₂ are simultaneously input into the second network model to obtain a second random variable Z ₂₂; Z ₂₂ and LR ₂₁ are input into the second network model to generate the second frame super-resolution face image SR ₂₂ of the second object.

For the i-th super-resolution, the (i-1) -th frame of super-resolution face image SR ₂ _(i-1) and the i-th frame of low-resolution face image LR _2i generated by the second network model are simultaneously input to the second network model to generate the i-th super-resolution face image SR _2i of the second network model.

After the last frame of low-resolution face image is input into the second network model, the last frame of super-resolution face image generated is taken as the final super-resolution result.

Based on the above process, N frames of low-resolution face images of the first target are sequentially input to the first network model, the first network model is trained, and the output loss of the first network model is configured to restrict the training process of the first network model to cause the training result of the first network model to converge, and the trained first network model is recorded as the second network model. Because the last-frame super-resolution face image obtained in the training process contains detailed information of multiple frames of low-resolution face images, after the second network model is used to perform super-resolution processing on the N frames of low-resolution face images of the first target, the identity information of the last-frame super-resolution face image obtained is consistent with the identity information of the first target.

Of course, the second network model can not only perform super-resolution processing on the N frames of low-resolution face images of the first target to obtain super-resolution face images of which the identity information is consistent with the identity information of the first target, and also can perform super-resolution processing on the N frames of low-resolution face images of the second target to obtain super-resolution face images of which the identity information consistent with the identity information of the second target.

Based on the same inventive concept, an image generation apparatus is also provided in an embodiment of the present disclosure. As shown in FIG. 8, FIG. 8 is a structural schematic view of an image generation apparatus according to an embodiment of the present disclosure. The apparatus includes:

an obtaining module 81, configured to obtain N frames of low-resolution face images of a first target, wherein the N is a positive integer greater than or equal to 2;

a training module 82, configured to train the first network model according to the N frames of low-resolution face images to obtain a second network model, wherein the first network model is capable of perform super-resolution processing on a low-resolution face images;

a processing module 83, configured to perform super-resolution processing on the N frames of low-resolution face images in sequence based on the second network model to obtain N frames of super-resolution face images.

a selection module 84, configured to take a last-frame super-resolution face image among the N frames of super-resolution face images as a final face image.

In a possible design, as shown in FIG. 9, the training module includes:

a calculation unit 91, configured to calculate and obtain an output loss of the first network model according to the N frames of low-resolution face images, wherein the output loss is configured to restrict a training process of the first network model;

a determining unit 92, configured to determine whether a training result of the first network model is convergent according to the output loss;

an adjustment unit 93, configured to, in response to the training result of the first network model not being convergent, adjust parameters of the first network model and continuing to train the first network model until the training result is convergent;

a marking unit 94, configured to, in response to the training result of the first network model being convergent, record the trained first network model as the second network model.

In a possible design, the calculation unit is specifically configured to:

In a possible design, the calculation unit is also configured to:

In a possible design, as shown in FIG. 10, the processing module includes:

an obtaining unit 101, configured to randomly sample a first random variable from the random variables that obey the standard positive distribution generated in the training process, and determine a second reference frame from the N frames of low-resolution face images;

a processing unit 102, configured to input the first random variable and the second reference frame into the second network model to obtain the super-resolution face image corresponding to the first random variable;

an encoding unit 103, configured to input the super-resolution face image and the second low-resolution face image into the second network model to obtain a second random variable; wherein the second low-resolution face image is a next-frame image of the second reference frame;

an updating unit 104, configured to replace the first random variable with the second random variable, replace the second low-resolution face image with a next-frame image of the second low-resolution face image, and continue to perform super-resolution processing on the replaced second low-resolution face image to obtain the N frames of super-resolution face images in sequence.

Based on the above image generation apparatus, N frames of low-resolution face images of the first target are sequentially input to the first network model, the first network model is trained, and the output loss of the first network model is configured to restrict the training process of the first network model to cause the training result of the first network model to converge, and the trained first network model is recorded as the second network model. Because the last-frame super-resolution face image obtained in the training process contains detailed information of multiple frames of low-resolution face images, after the second network model is used to perform super-resolution processing on the N frames of low-resolution face images of the first target, the identity information of the last-frame super-resolution face image obtained is consistent with the identity information of the first target.

Based on a same inventive concept, an embodiment of the present disclosure also provide an electronic device, which can realize the function of the above image generation apparatus. Referring to FIG. 11, the electronic device includes:

at least one processor 111 and a memory 112 connected to the at least one processor 111. The specific connection medium between the processor 111 and the memory 112 is not limited in the embodiment of the present disclosure. Taking the bus 110 as an example, the bus 110 is represented by a thick line in FIG. 11, and the connection mode between other components is only for schematic illustration, and is not to be taken as a limitation. The bus 110 may be divided into an address bus, a data bus, a control bus, etc. For ease of presentation, only a thick line is used in FIG. 11 to represent it, but it does not mean that there is only one bus or one type of bus. Alternatively, the processor 111 may also be called a controller, and there is no restriction on the name.

In the embodiment of the present disclosure, the memory 112 stores instructions that can be executed by at least one processor 111, and the at least one processor 111 can execute the image generation method discussed above by executing the instructions stored in the memory 112. The processor 111 can implement the functions of each module in the apparatus shown in FIG. 6.

Among them, the processor 111 is a control center of the device and can connect various parts of the entire such control device using various interfaces and lines to monitor the device as a whole by running or executing the instructions stored in the memory 112 and calling the data stored in the memory 112, the various functions and processing data of the device.

In a possible design, the processor 111 may include one or more processing units, and the processor 111 may integrate an application processor and a modem processor, wherein the application processor primarily handles the operating system, user interface, and applications, etc., and the modem processor primarily handles wireless communications. It will be appreciated that the above modem processor may also not be integrated into processor 111. In some embodiments, processor 111 and memory 112 may be implemented on the same chip, and in some embodiments, they may also be implemented separately on separate chips.

The processor 111 may be a general purpose processor, such as a central processing unit (CPU) , a digital signal processor, a specialized integrated circuit, a field programmable gate array or other programmable logic device, a discrete gate or transistor logic device, or a discrete hardware component that can implement or perform each of the methods, steps, and logic block diagrams disclosed in embodiments of the present disclosure. The general purpose processor may be a microprocessor or any conventional processor, etc. The steps of the image generation method disclosed in conjunction with the embodiments of the present disclosure can be directly embodied as performed by the hardware processor or performed with a combination of hardware and software modules in the processor.

The memory 112 serves as a non-volatile computer readable storage medium that can be configured to store non-volatile software programs, non-volatile computer executable programs, and modules. The memory 112 may include at least one type of storage medium, which may include, for example, flash memory, hard disk, multimedia card, card-type memory, random access memory (RAM) , static random access memory (SRAM) , programmable read-only memory (PROM) , Read Only Memory (ROM) , Electrically Erasable Programmable Read-Only Memory (EEPROM) , magnetic memory, disk, CD-ROM, etc. Magnetic memory, disk, CD-ROM, etc. The memory 112 is any other medium capable of being used to carry or store desired program code in the form of instructions or data structures and capable of being accessed by a computer, but is not limited thereto. The memory 112 in the embodiments of the present disclosure may also be a circuit or any other device capable of performing storage functions for storing program instructions and/or data.

By designing and programming the processor 111, the code corresponding to the image generation method introduced in the above embodiments can be solidified into the chip, such that the chip can execute the steps of the image generation method of the embodiments shown in FIG. 1 when the chip is running. The way of designing and programing the processor 111 is a technology well known to those skilled in the art, and will not be repeated here.

Based on a same inventive concept, an embodiment of the present disclosure also provide a storage medium that stores computer instructions, and when the computer instructions run on the computer, the computer executes the image generation method discussed above.

In some possible implementation manners, various aspects of the image generation method provided in the present disclosure can also be implemented in the form of a program product, which includes program code. When the program product runs on a device, the program code is configured to control the device to execute the steps in the image generation method according to various exemplary embodiments of the present disclosure described above in this specification.

Those skilled in the art should understand that the embodiments of the present disclosure can be provided as a method, a system, or a computer program product. Therefore, the present disclosure may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, the present disclosure may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc. ) containing computer-usable program codes.

The present disclosure is described with reference to flowcharts and/or block diagrams of methods, devices (systems) , and computer program products according to embodiments of the present disclosure. It should be understood that each process and/or block in the flowchart and/or block diagram, and the combination of processes and/or blocks in the flowchart and/or block diagram can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, a specialized computer, an embedded processor, or other programmable data processing device to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing device produce a device for implementing the functions specified in one or more processes of a flowchart and/or one or more boxes of a block diagram.

These computer program instructions may also be stored in a computer readable memory capable of directing a computer or other programmable data processing device to operate in a particular manner, such that the instructions stored in such computer readable memory produce an article of manufacture comprising an instruction device that implements a function specified in one or more processes of a flowchart and/or one or more boxes of a block diagram.

These computer program instructions may also be loaded onto a computer or other programmable data processing device such that a series of operational steps are performed on the computer or other programmable device to produce computer-implemented processing such that the instructions executed on the computer or other programmable device provide steps for implementing the functions specified in the flowchart one process or a plurality of processes and/or the block diagram one block or a plurality of blocks.

Obviously, those skilled in the art can make various changes and modifications to the present disclosure without departing from the spirit and scope of the present disclosure. In this way, if these modifications and variations of the present disclosure fall within the scope of the claims of the present disclosure and their equivalent technologies, then the present disclosure is also intended to include these modifications and variations.

Claims

An image generation method, comprising:

obtaining N frames of low-resolution face images of a first target, wherein the N is a positive integer greater than or equal to 2;

training the first network model according to the N frames of low-resolution face images to obtain a second network model, wherein the first network model is configured to perform super-resolution processing a low-resolution face image;

performing super-resolution processing on the N frames of low-resolution face images in sequence based on the second network model to obtain N frames of super-resolution face images; and

taking a last-frame super-resolution face image among the N frames of super-resolution face images as a final face image.
The method according to claim 1, wherein the training the first network model according to the N frames of low-resolution face images to obtain the second network model comprises:

calculating and obtaining an output loss of the first network model according to the N frames of low-resolution face images, wherein the output loss is configured to restrict a training process of the first network model;

determining whether a training result of the first network model being convergent according to the output loss;

in response to the training result of the first network model not being convergent, adjusting parameters of the first network model and continuing to train the first network model until the training result is convergent; and

in response to the training result of the first network model being convergent, taking the trained first network model as the second network model.
The method according to claim 2, wherein the calculating and obtaining the output loss of the first network model according to the N frames of low-resolution face images comprise:

obtaining N random variables and a super-resolution face image set based on the N frames of low-resolution face images through a first network model; wherein the number of frames of super-resolution face images in the super-resolution face image set is N;

inputting the super-resolution face images in the super-resolution face image set into a recognition network in sequence, and extracting to obtain N face feature values; and

inputting the N random variables and the N face feature values into a loss function, and calculating to obtain the output loss of the first network model.
The method according to claim 3, wherein the obtaining the N random variables and the super-resolution face image set based on the N frames of low-resolution face images through the first network model comprises:

determining a frame of low-resolution face image among the N frames of low-resolution face image as a first reference frame;

inputting a first super-resolution face image and a first low-resolution face image into the first network model to obtain a random variable corresponding to the first low-resolution face image; wherein the first super-resolution face image is a real high-resolution face image of the first target, and the first low-resolution face image is a next-frame image of the first reference frame;

inputting the random variable and the first reference frame into the first network model to obtain a second super-resolution face image; and

replacing the first super-resolution face image with the second super-resolution face image, replacing the first low-resolution face image with a next-frame face image of the first low-resolution face image, and continuing to perform training on the first network model to generate other super-resolution face images in sequence forming another super-resolution face image set.
The method according to claim 3, wherein the inputting the N random variables and the N face feature values into the loss function, and calculating to obtain the output loss of the first network model comprise:

inputting the N random variables into a negative log-likelihood loss function, and calculating to obtain a negative log-likelihood loss; wherein the negative log-likelihood loss is configured to restrict the first network model such that random variables output by the first network model obey a standard positive distribution;

inputting the N face feature values into a cosine loss function, and calculating to obtain a cosine loss; wherein the cosine loss is configured to indicate a degree of difference between super-resolution face features and real face features;

inputting the cosine loss into a cosine comparison loss function, and calculating to obtain a cosine comparison loss; wherein the cosine comparison loss is configured to restrict the first network model, such that a similarity between a super-resolution face image generated each time and the real high-resolution face image is greater than a similarity between a super-resolution face image generated last time and the real high-resolution face image; and

inputting the negative log-likelihood loss, the cosine loss and the cosine comparison loss into the loss function, and calculating to obtain the output loss of the first network model; wherein the output loss is configured to restrict the training process of the first network model.
The method according to claim 1, wherein the performing super-resolution processing on the N frames of low-resolution face images in sequence based on the second network model to obtain the N frames of super-resolution face images comprises:

randomly sampling a first random variable among random variables that obey a standard positive distribution generated in the training process, and determining a second reference frame among the N frames of low-resolution face images;

inputting the first random variable and the second reference frame into the second network model to obtain a super-resolution face image corresponding to the first random variable;

inputting the super-resolution face image and the second low-resolution face image into the second network model to obtain a second random variable; wherein the second low-resolution face image is a next-frame image of the second reference frame; and

replacing the first random variable with the second random variable, replacing the second low-resolution face image with a next-frame image of the second low-resolution face image, and continuing to perform super-resolution processing on the replaced second low-resolution face image to obtain the N frames of super-resolution face images in sequence.
An image generation apparatus, comprising:

an obtaining module, configured to obtain N frames of low-resolution face images of a first target, wherein the N is a positive integer greater than or equal to 2;

a training module, configured to train the first network model according to the N frames of low-resolution face images to obtain a second network model, wherein the first network model is configured to perform super-resolution processing a low-resolution face image;

a processing module, configured to perform super-resolution processing on the N frames of low-resolution face images in sequence based on the second network model to obtain N frames of super-resolution face images; and

a selection module, configured to take a last-frame super-resolution face image among the N frames of super-resolution face images as a final face image.
The apparatus according to claim 7, wherein the processing module comprises:

an obtaining unit, configured to randomly sample a first random variable among random variables that obey a standard positive distribution generated in the training process, and determine a second reference frame among the N frames of low-resolution face images;

a processing unit, configured to input the first random variable and the second reference frame into the second network model to obtain a super-resolution face image corresponding to the first random variable;

an encoding unit, configured to input the super-resolution face image and the second low-resolution face image into the second network model to obtain a second random variable; wherein the second low-resolution face image is a next-frame image of the second reference frame; and

an updating unit, configured to replace the first random variable with the second random variable, replace the second low-resolution face image with a next-frame image of the second low-resolution face image, and continue to perform super-resolution processing on the replaced second low-resolution face image to obtain the N frames of super-resolution face images in sequence.
An electronic device, comprising:

a memory, configured to store a computer program; and

a processor, configured to execute the computer program stored in the memory to perform the method according to any one of claims 1-6.
A storage medium, storing a computer program; wherein the computer program is configured to perform the method according to any one of claims 1-6 when executed by a processor.