WO2023010701A1 - Image generation method, apparatus, and electronic device - Google Patents

Image generation method, apparatus, and electronic device Download PDF

Info

Publication number
WO2023010701A1
WO2023010701A1 PCT/CN2021/128518 CN2021128518W WO2023010701A1 WO 2023010701 A1 WO2023010701 A1 WO 2023010701A1 CN 2021128518 W CN2021128518 W CN 2021128518W WO 2023010701 A1 WO2023010701 A1 WO 2023010701A1
Authority
WO
WIPO (PCT)
Prior art keywords
resolution face
super
network model
face image
resolution
Prior art date
Application number
PCT/CN2021/128518
Other languages
French (fr)
Inventor
Yapeng LI
Ningbo WANG
Original Assignee
Zhejiang Dahua Technology Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Dahua Technology Co., Ltd. filed Critical Zhejiang Dahua Technology Co., Ltd.
Publication of WO2023010701A1 publication Critical patent/WO2023010701A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4053Scaling of whole images or parts thereof, e.g. expanding or contracting based on super-resolution, i.e. the output image resolution being higher than the sensor resolution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4046Scaling of whole images or parts thereof, e.g. expanding or contracting using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/778Active pattern-learning, e.g. online learning of image or video features
    • G06V10/7796Active pattern-learning, e.g. online learning of image or video features based on specific statistical tests
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present disclosure relates to the field of face recognition technologies, and in particular to an image generation method, an apparatus, and an electronic device.
  • Face recognition as a safe, non-contact, convenient, and efficient way of identity information authentication, has been widely used in all aspects of social life.
  • the size of a face that appears in the video is usually small, and the image definition is low, which is difficult to meet the needs of face recognition. Therefore, a face super-resolution technology becomes more and more important.
  • the face super-resolution technology essentially adds high-frequency features to low-resolution face images to generate high-resolution face images.
  • the prior art is usually based on a single-frame low-resolution face image and obtains a super-resolution face image through super-resolution processing.
  • the super-resolution face image obtained in this way has missing face information, and cannot guarantee that the identity information of the super-resolution face image is consistent with the identity information of the single-frame low-resolution face image.
  • the present disclosure provides an image generation method, an apparatus, and an electronic device, to realize super-resolution processing of multiple low-resolution face images to obtain super-resolution face images of which the identity information is consistent with the identity information of the low-resolution face images.
  • the present disclosure provides an image generation method, comprising:
  • N is a positive integer greater than or equal to 2;
  • the training the first network model according to the N frames of low-resolution face images to obtain the second network model comprises:
  • the output loss of the first network model is configured to restrict the training process of the first network model.
  • the second network model is obtained after the training process, configured to realize super-resolution processing of multiple low-resolution face images of any target to obtain super-resolution face images of which the identity information is consistent with the identity information of the low-resolution face images.
  • the calculating and obtaining the output loss of the first network model according to the N frames of low-resolution face images comprise:
  • the obtained output loss of the first network model is configured to restrict the training process of the first network model to cause the training result to be convergent.
  • the obtaining the N random variables and the super-resolution face image set based on the N frames of low-resolution face images through the first network model comprises:
  • first super-resolution face image is a real high-resolution face image of the first target
  • first low-resolution face image is a next-frame image of the first reference frame
  • the obtained super-resolution face images of the super-resolution face image set are configured to abstract face feature values.
  • the face values and the N random variables are configured to calculate the output loss of the first network model.
  • the inputting the N random variables and the N face feature values into the loss function, and calculating to obtain the output loss of the first network model comprise:
  • the cosine loss is configured to indicate a degree of difference between super-resolution face features and real face features
  • the cosine comparison loss is configured to restrict the first network model, such that a similarity between a super-resolution face image generated each time and the real high-resolution face image is greater than a similarity between a super-resolution face image generated last time and the real high-resolution face image;
  • the output loss is configured to restrict the training process of the first network model.
  • the output loss of the first network model is obtained.
  • the training process of the first network model is restricted through the output loss, which can achieve that the random variables encoded by the first network model obey the standard positive distribution and that the similarity between the super-resolution face image and the real high-resolution face image generated by the first network model each time is greater than the similarity between the super-resolution face image and the real high-resolution generated by the first network model last time.
  • the performing super-resolution processing on the N frames of low-resolution face images in sequence based on the second network model to obtain the N frames of super-resolution face images comprises:
  • the second network model is configured to perform super-resolution processing on the N frames of low-resolution face images, and the super-resolution face image generated each time additionally has detail features of one frame of low-resolution face image than the super-resolution face image generated in the previous time. Therefore, the last generated super-resolution face image contains the detailed features of N frames of low-resolution face image, that is, the identity information of the last generated super-resolution face image is consistent with the identity information of the N frames of low-resolution face images.
  • an image generation apparatus comprising:
  • an obtaining module configured to obtain N frames of low-resolution face images of a first target, wherein the N is a positive integer greater than or equal to 2;
  • a training module configured to train the first network model according to the N frames of low-resolution face images to obtain a second network model, wherein the first network model is configured to perform super-resolution processing a low-resolution face image;
  • a processing module configured to perform super-resolution processing on the N frames of low-resolution face images in sequence based on the second network model to obtain N frames of super-resolution face images
  • a selection module configured to take a last-frame super-resolution face image among the N frames of super-resolution face images as a final face image.
  • the training module comprises:
  • a calculation unit configured to calculate and obtain an output loss of the first network model according to the N frames of low-resolution face images, wherein the output loss is configured to restrict a training process of the first network model
  • a determining unit configured to determine whether a training result of the first network model is convergent according to the output loss
  • an adjustment unit configured to, in response to the training result of the first network model not being convergent, adjust parameters of the first network model and continuing to train the first network model until the training result is convergent;
  • a marking unit configured to, in response to the training result of the first network model being convergent, record the trained first network model as the second network model.
  • the calculation unit is specifically configured to:
  • N N random variables and a super-resolution face image set from the N frames of low-resolution face images through a first network model; wherein the number of frames of super-resolution face images in the super-resolution face image set is N;
  • the calculation unit is further configured to:
  • first super-resolution face image is a real high-resolution face image of the first target
  • first low-resolution face image is a next-frame image of the first reference frame
  • the calculation unit is further configured to:
  • the negative log-likelihood loss is configured to restrict the first network model such that the random variables output by the first network model obey the standard positive distribution
  • the N face feature values into a cosine loss function, and calculate to obtain a cosine loss; wherein the cosine loss is configured to indicate a degree of difference between super-resolution face features and real face features;
  • the cosine comparison loss is configured to restrict the first network model, such that a similarity between the super-resolution face image generated each time and the real high-resolution face image is greater than a similarity between the super-resolution face image generated last time and the real high-resolution face image;
  • the negative log-likelihood loss, the cosine loss and the cosine comparison loss into the loss function, and calculate to obtain the output loss of the first network model; wherein the output loss is configured to restrict the training process of the first network model.
  • the processing module comprises:
  • an obtaining unit configured to randomly sample a first random variable among random variables that obey a standard positive distribution generated in the training process, and determine a second reference frame among the N frames of low-resolution face images;
  • a processing unit configured to input the first random variable and the second reference frame into the second network model to obtain a super-resolution face image corresponding to the first random variable
  • an encoding unit configured to input the super-resolution face image and the second low-resolution face image into the second network model to obtain a second random variable; wherein the second low-resolution face image is a next-frame image of the second reference frame;
  • an updating unit configured to replace the first random variable with the second random variable, replace the second low-resolution face image with a next-frame image of the second low-resolution face image, and continue to perform super-resolution processing on the replaced second low-resolution face image to obtain the N frames of super-resolution face images in sequence.
  • the present disclosure provides an electronic device, comprising:
  • a memory configured to store a computer program
  • a processor configured to execute the computer program stored in the memory to perform the method as described above.
  • the present disclosure provides a storage medium, storing a computer program; wherein the computer program is configured to perform the method as described above when executed by a processor.
  • super-resolution processing is performed on the N frames of low-resolution face images of the first target to train the first network model.
  • the last generated super-resolution face image contains the detailed features of N frames of low-resolution face image. Therefore, after super-resolution processing is performed on the N frames of low-resolution face images of the first target based on the second network model, the identity information of the last generated super-resolution face image is consistent with the identity information of the first target.
  • the second network model may not only perform super-resolution processing on the N frames of low-resolution face images of the first target to obtain the super-resolution face image of which identity information is consistent with the identity information of the first target, the second network model may also perform super-resolution processing on N frames of low-resolution face images of a second target to obtain a super-resolution face image of which identity information is consistent with the identity information of the second target.
  • FIG. 1 is a flowchart of an image generation method according to an embodiment of the present disclosure.
  • FIG. 2 is a flowchart of a method for training a first network model according to an embodiment of the present disclosure.
  • FIG. 3 is a flowchart of a method for obtaining N random variables and a super-resolution face image set based on a first network model according to an embodiment of the present disclosure.
  • FIG. 4 is a flowchart of a method for calculating an output loss of a first network model according to an embodiment of the present disclosure.
  • FIG. 5 is a flowchart of a method for obtaining N frames of super-resolution face images based on a second network model according to an embodiment of the present disclosure.
  • FIG. 6 is a schematic view of a method for training a first network model according to an embodiment of the present disclosure.
  • FIG. 7 is a flowchart of a method for performing super-resolution processing on N frames of low-resolution face images based on a second network model according to an embodiment of the present disclosure.
  • FIG. 8 is a structural schematic view of an image generation apparatus according to an embodiment of the present disclosure.
  • FIG. 9 is a structural schematic view of a training module according to an embodiment of the present disclosure.
  • FIG. 10 a structural schematic view of a processing module according to an embodiment of the present disclosure.
  • FIG. 11 is a structural schematic view of an electronic device according to an embodiment of the present disclosure.
  • the image generation method provided by the embodiments of the present disclosure can solve the problem of being unable to ensure that identity information of an obtained super-resolution face image is consistent with identity information of a single-frame low-resolution face image while performing super-resolution processing based on the single-frame low-resolution face image.
  • the method and apparatus described in the embodiments of the present disclosure are based on a same technical concept. Since the principles of the method and apparatus to solve the problem are similar, the embodiments of the apparatus and the method can be referred to each other, and repetition will not be repeated.
  • Face super-resolution technology is essentially to add high-frequency features to low-resolution face images to generate high-resolution face images.
  • a SRFlow network model is often used.
  • the SRFlow network model is reversible and can learn a conditional distribution of super-resolution images with respect to low-resolution images.
  • High-resolution image and low-resolution image are input into the SRFlow network model to obtain random variables that meet a specific distribution.
  • Low-resolution image and random variables that meet the specific distribution are input into the SRFlow network model to generate the super-resolution face image.
  • super-resolution processing is usually performed on a single-frame low-resolution face image based on the SRFlow network model to obtain a super-resolution face image.
  • the detailed information is usually key information to distinguish face identity, thus it cannot be ensured that the identity information of the obtained super-resolution face image is consistent with the identity information of the low-resolution face image.
  • the present disclosure proposes a solution: based on a first network model, sequentially inputting multi-frame low-resolution face images of a first target into the first network model in an iterative manner; training the first network model; restricting a training process according to an output loss of the first network; in response to a training result of the first network model being convergent, recording the trained first network model as a second network model; performing super-resolution processing on the multi-frame low-resolution face images of the first target or multi-frame low-resolution face images of a second target with the second network model, and obtaining a last frame of super-resolution face image.
  • the generated last super-resolution face image has detailed features of the multi-frame low-resolution face images, and thus the identity information thereof is consistent with the identity
  • FIG. 1 is a flowchart of an image generation method according to an embodiment of the present disclosure.
  • the first network model may be a SRFlow network model.
  • the N frames of low-resolution face images are sequentially input into the first network model in an iterative manner, and the first network model is trained.
  • the training process is restricted according to an output loss of the first network.
  • the trained first network model is recorded as the second network model.
  • super-resolution processing is performed on the N frames of low-resolution face images in sequence based on the second network model to obtain N frames of super-resolution face images.
  • the super-resolution face image generated each time additionally has detail features of one frame of low-resolution face image than the super-resolution face image generated in the previous time.
  • the super-resolution face image generated each time additionally has detail features of one frame of low-resolution face image than the super-resolution face image generated in the previous time. Therefore, the last generated super-resolution face image contains the detailed features of N frames of low-resolution face images, that is, the identity information of the last generated super-resolution face image is consistent with the identity information of the N frames of low-resolution face images.
  • the second network model may not only perform super-resolution processing on the N frames of low-resolution face images of the first target to obtain the super-resolution face image of which identity information is consistent with the identity information of the first target, the second network model may also perform super-resolution processing on N frames of low-resolution face images of a second target to obtain a super-resolution face image of which identity information is consistent with the identity information of the second target.
  • FIG. 2 is a flowchart of a method for training a first network model according to an embodiment of the present disclosure.
  • the super-resolution face image set stores a frame of real high-resolution face image of the first target, and the super-resolution face images generated each time.
  • the frame of real high-resolution face image is recorded as a first super-resolution face image, and the total number of super-resolution face images in the super-resolution face image set is N.
  • the obtaining the N random variables and the super-resolution face image set can be implemented by inputting the N frames of low-resolution face images into the first network model in an iterative manner.
  • the specific process is shown in FIG. 3.
  • the first reference frame may be the first frame in the N frames of low-resolution face images, or may be the second frame, the third frame, the fourth frame, etc.
  • the first frame of low-resolution face image is selected as an example.
  • step S36 when the number of image frames in the super-resolution face image set is not N, step S36 is executed; when the number of image frames in the super-resolution face image set is N, step S37 is executed.
  • step S33 is executed.
  • the super-resolution face images in the super-resolution face image set are configured to extract face feature values, and the face feature values and the N random variables are configured to calculate the output loss of the first network model.
  • the training process of the first network model is restricted by the output loss, such that the random variables output by the first network model obey a standard positive distribution, and the similarity between the super-resolution face image generated each time and the real high-resolution face image is greater than the similarity between the super-resolution face image generated last time and the real high-resolution face image.
  • step S25 is executed; when the output loss of the first network model is not convergent, step 26 is executed.
  • the training result is convergent, it is indicated that after the first network model performs super-resolution processing on the multiple-frame low-resolution face images, the identity information of the last generated super-resolution face image is consistent with the identity information of the low-resolution face image.
  • the trained first network model is recorded as the second network model.
  • the second network model can perform super-resolution processing on multiple-frame low-resolution face images of any target, and the identity information of the last generated super-resolution face image is consistent with the identity information of the low-resolution face image.
  • the parameters of the first network model are adjusted, N frames of low-resolution face images of another target are continually obtained, step S11 is executed, and the first network model is continually trained until the training result is convergent.
  • the N frames of low-resolution face images are input to the first network model, the first network model is trained, and the training is completed to obtain the second network model.
  • the second network model can perform super-resolution processing on multiple-frame low-resolution face images of any target, and the identity information of the last generated super-resolution face image is consistent with the identity information of the low-resolution face image.
  • the output loss of the first network model is configured to restrict the first network model such that the random variables output by the first network model obey the standard positive distribution, and the similarity between the super-resolution face image generated each time and the real high-resolution face image is greater than the similarity between the super-resolution face image generated last time and the real high-resolution face image.
  • the output loss of the first network model calculated in step S23 is required to be explained in detail.
  • the specific calculation process of the output loss is shown in FIG. 4.
  • the negative log-likelihood loss is configured to restrict the first network model such that the random variables output by the first network model obey the standard positive distribution, where the negative log-likelihood loss can be calculated by formula (1) :
  • LR is a low-resolution face image
  • SR is a super-resolution face image
  • is a distribution parameter
  • N is the number of frames of the low-resolution face image
  • LR 1i indicates that the i-th frame of low-resolution face image input to the first network model
  • p Z (z 1i ) represents a spatial distribution of random variables
  • Z 1i represents a random variable obtained by inputting the i-th frame of low-resolution face image into the first network model
  • f ⁇ is the first network model.
  • the first network model f ⁇ is decomposed into M reversible layer sequences:
  • the cosine loss indicates the degree of difference between super-resolution face features and real face features, where the cosine loss can be calculated by formula (2) :
  • Similarity i is a cosine similarity between a super-resolution face image super-divided by the first network model for the i-th time and a real high-resolution face image, and the cosine similarity is in a value range of (-1, 1) .
  • the cosine similarity can be calculated by formula (3) :
  • Similarity i represents the cosine similarity generated for the i-th time
  • formula (3) is the cosine similarity function
  • F i is a face feature value extracted after the super-resolution face image generated by the first network model for the i-th time is input into the recognition network
  • F 0 is a face feature value extracted after the real high-resolution face image is input into the recognition network.
  • the cosine comparison loss is configured to restrict the first network model, such that the similarity between the super-resolution face image generated each time and the real high-resolution face image is greater than the similarity between the super-resolution face image generated last time and the real high-resolution face image, that is, Similarity i+1 is greater than Similarity i , and the cosine comparison loss can be calculated by formula (4) :
  • formula (4) is the cosine comparison loss function, e is the base of the natural logarithm, and ⁇ is a comparison coefficient.
  • the output loss is configured to restrict the training process of the first network model, which can make the random variables encoded by the first network model obey the standard positive distribution, and can also make the similarity between the super-resolution face image generated each time and the real high-resolution face image greater than the similarity between the super-resolution face image generated last time and the real high-resolution face image.
  • the output loss can be calculated by formula (5) :
  • the output loss of the first network model is calculated.
  • the parameters of the first network model are adjusted, and the training of the first network model is continued until the output loss is convergent.
  • the second network model obtained after the training can perform super-resolution processing on the N frames of low-resolution face images to obtain the N frames of super-resolution face images, and in the process of super-resolution processing, the similarity between the super-resolution face image generated each time and the real high-resolution face image is greater than the similarity between the super-resolution face image generated last time and the real high-resolution face image.
  • FIG. 5 is a flowchart of a method for obtaining N frames of super-resolution face images based on a second network model according to an embodiment of the present disclosure.
  • the purpose of counting the super-resolution face images generated each time is to determine whether the super-resolution processing has been performed on all the N frames of low-resolution face images.
  • step S55 is executed.
  • step S56 is executed.
  • step S53 is executed to continually perform super-resolution processing on the replaced second low-resolution face image.
  • the second network model is configured to perform super-resolution processing on the N frames of low-resolution face images, and the super-resolution face image generated each time additionally has detail features of one frame of low-resolution face image than the super-resolution face image generated in the previous time. Therefore, the last generated super-resolution face image contains the detailed features of N frames of low-resolution face image, that is, the identity information of the last generated super-resolution face image is consistent with the identity information of the N frames of low-resolution face images.
  • using the second network model based on the above steps can not only perform super-resolution processing on N frames of low-resolution face images of the first target, but also perform super-resolution processing on N frames of low-resolution face images of the second target.
  • the generated identity information of the last frame of super-resolution face image of the second target is consistent with the identity information of the N frames of low-resolution face image of the second target.
  • the first network model is required to be trained.
  • the N frames of low-resolution face images of the first target are sorted according to an obtaining order of an image obtaining device, and are recorded as a first frame of low-resolution face image, a second frame of low-resolution face image, ..., and an Nth frame of low-resolution face image.
  • the first frame of low-resolution face image is taken as the reference frame LR 11
  • the real high-resolution face image HR of the first target is input into the recognition network to obtain a first face feature value F 0 , where HR is recorded as SR 0 .
  • HR and the second frame of low-resolution face image LR 12 are input into the first network model to obtain the first random variable Z 11 ;
  • Z 11 and LR 11 are input into the first network model to generate the first frame of super-resolution face image SR 11 ;
  • SR 11 is input into the recognition network to obtain a second face feature value F 2 ;
  • SR 11 and the third frame of low-resolution face image LR 12 are input into the first network model to obtain the second random variable Z 12 ;
  • Z 12 and LR 11 are input into the first network model to generate the second frame of super-resolution face image SR 12 of the first object;
  • SR 12 is input into the recognition network to obtain a third face feature value F 3 ;
  • the super-resolution face image SR 1 (i-1) generated by the first network model for an (i-1) -th time and the (i+1) -th frame of low-resolution face image LR 1 (i+1) are input into the first network model to obtain the i-th random variable Z 1i ;
  • Z 1i and LR 11 are input into the first network model to generate the i-th frame of super-resolution face image SR 1i of the first object,
  • SR 1i is input into the recognition network to obtain the (i+1) -th face feature value F i .
  • the output loss it is indicated that the training result of the first network model is convergent, and the trained first network model is recorded as the second network model.
  • the output loss is not convergent, the parameters of the first network model are adjusted, and the training of the first network model is continued to be performed until the training result is convergent.
  • the second network model obtained based on the above training method can not only perform super-resolution processing on the N frames of low-resolution face images of the first target to obtain super-resolution face images of which the identity information is consistent with the identity information of the first target, and also can perform super-resolution processing on the N frames of low-resolution face images of the second target to obtain super-resolution face images of which the identity information consistent with the identity information of the second target.
  • the multi-frame low-resolution face image of any target can be super-resolution processed through the second network to obtain super-resolution face images of which the identity information is consistent with the low-resolution face images.
  • the first target as an example and referring to FIG. 7 to describe the specific process.
  • a random variable Z 21 and a reference frame LR 21 are randomly sampled in the random variable distribution space that meets the standard positive distribution generated during the training process and simultaneously input into the second network model, to generate the second frame of super-resolution face image SR 21 of the second object.
  • the first frame of low-resolution face image of N frames of low-resolution face images is determined as the reference frame LR 21 .
  • SR 21 and the second frame of low-resolution face image LR 22 are simultaneously input into the second network model to obtain a second random variable Z 22 ;
  • Z 22 and LR 21 are input into the second network model to generate the second frame super-resolution face image SR 22 of the second object.
  • the (i-1) -th frame of super-resolution face image SR 2 (i-1) and the i-th frame of low-resolution face image LR 2i generated by the second network model are simultaneously input to the second network model to generate the i-th super-resolution face image SR 2i of the second network model.
  • the last frame of low-resolution face image is input into the second network model
  • the last frame of super-resolution face image generated is taken as the final super-resolution result.
  • N frames of low-resolution face images of the first target are sequentially input to the first network model, the first network model is trained, and the output loss of the first network model is configured to restrict the training process of the first network model to cause the training result of the first network model to converge, and the trained first network model is recorded as the second network model. Because the last-frame super-resolution face image obtained in the training process contains detailed information of multiple frames of low-resolution face images, after the second network model is used to perform super-resolution processing on the N frames of low-resolution face images of the first target, the identity information of the last-frame super-resolution face image obtained is consistent with the identity information of the first target.
  • the second network model can not only perform super-resolution processing on the N frames of low-resolution face images of the first target to obtain super-resolution face images of which the identity information is consistent with the identity information of the first target, and also can perform super-resolution processing on the N frames of low-resolution face images of the second target to obtain super-resolution face images of which the identity information consistent with the identity information of the second target.
  • FIG. 8 is a structural schematic view of an image generation apparatus according to an embodiment of the present disclosure.
  • the apparatus includes:
  • an obtaining module 81 configured to obtain N frames of low-resolution face images of a first target, wherein the N is a positive integer greater than or equal to 2;
  • a training module 82 configured to train the first network model according to the N frames of low-resolution face images to obtain a second network model, wherein the first network model is capable of perform super-resolution processing on a low-resolution face images;
  • a processing module 83 configured to perform super-resolution processing on the N frames of low-resolution face images in sequence based on the second network model to obtain N frames of super-resolution face images.
  • a selection module 84 configured to take a last-frame super-resolution face image among the N frames of super-resolution face images as a final face image.
  • the training module includes:
  • a calculation unit 91 configured to calculate and obtain an output loss of the first network model according to the N frames of low-resolution face images, wherein the output loss is configured to restrict a training process of the first network model;
  • a determining unit 92 configured to determine whether a training result of the first network model is convergent according to the output loss
  • an adjustment unit 93 configured to, in response to the training result of the first network model not being convergent, adjust parameters of the first network model and continuing to train the first network model until the training result is convergent;
  • a marking unit 94 configured to, in response to the training result of the first network model being convergent, record the trained first network model as the second network model.
  • the calculation unit is specifically configured to:
  • N N random variables and a super-resolution face image set from the N frames of low-resolution face images through a first network model; wherein the number of frames of super-resolution face images in the super-resolution face image set is N;
  • the calculation unit is also configured to:
  • the calculation unit is also configured to:
  • the negative log-likelihood loss is configured to restrict the first network model such that the random variables output by the first network model obey the standard positive distribution
  • the N face feature values into a cosine loss function, and calculate to obtain a cosine loss; wherein the cosine loss is configured to indicate a degree of difference between super-resolution face features and real face features;
  • the cosine comparison loss is configured to restrict the first network model, such that a similarity between the super-resolution face image generated each time and the real high-resolution face image is greater than a similarity between the super-resolution face image generated last time and the real high-resolution face image;
  • the negative log-likelihood loss, the cosine loss and the cosine comparison loss into the loss function, and calculate to obtain the output loss of the first network model; wherein the output loss is configured to restrict the training process of the first network model.
  • the processing module includes:
  • an obtaining unit 101 configured to randomly sample a first random variable from the random variables that obey the standard positive distribution generated in the training process, and determine a second reference frame from the N frames of low-resolution face images;
  • a processing unit 102 configured to input the first random variable and the second reference frame into the second network model to obtain the super-resolution face image corresponding to the first random variable
  • an encoding unit 103 configured to input the super-resolution face image and the second low-resolution face image into the second network model to obtain a second random variable; wherein the second low-resolution face image is a next-frame image of the second reference frame;
  • an updating unit 104 configured to replace the first random variable with the second random variable, replace the second low-resolution face image with a next-frame image of the second low-resolution face image, and continue to perform super-resolution processing on the replaced second low-resolution face image to obtain the N frames of super-resolution face images in sequence.
  • N frames of low-resolution face images of the first target are sequentially input to the first network model, the first network model is trained, and the output loss of the first network model is configured to restrict the training process of the first network model to cause the training result of the first network model to converge, and the trained first network model is recorded as the second network model. Because the last-frame super-resolution face image obtained in the training process contains detailed information of multiple frames of low-resolution face images, after the second network model is used to perform super-resolution processing on the N frames of low-resolution face images of the first target, the identity information of the last-frame super-resolution face image obtained is consistent with the identity information of the first target.
  • the second network model can not only perform super-resolution processing on the N frames of low-resolution face images of the first target to obtain super-resolution face images of which the identity information is consistent with the identity information of the first target, and also can perform super-resolution processing on the N frames of low-resolution face images of the second target to obtain super-resolution face images of which the identity information consistent with the identity information of the second target.
  • an embodiment of the present disclosure also provide an electronic device, which can realize the function of the above image generation apparatus.
  • the electronic device includes:
  • the bus 110 is represented by a thick line in FIG. 11, and the connection mode between other components is only for schematic illustration, and is not to be taken as a limitation.
  • the bus 110 may be divided into an address bus, a data bus, a control bus, etc. For ease of presentation, only a thick line is used in FIG. 11 to represent it, but it does not mean that there is only one bus or one type of bus.
  • the processor 111 may also be called a controller, and there is no restriction on the name.
  • the memory 112 stores instructions that can be executed by at least one processor 111, and the at least one processor 111 can execute the image generation method discussed above by executing the instructions stored in the memory 112.
  • the processor 111 can implement the functions of each module in the apparatus shown in FIG. 6.
  • the processor 111 is a control center of the device and can connect various parts of the entire such control device using various interfaces and lines to monitor the device as a whole by running or executing the instructions stored in the memory 112 and calling the data stored in the memory 112, the various functions and processing data of the device.
  • the processor 111 may include one or more processing units, and the processor 111 may integrate an application processor and a modem processor, wherein the application processor primarily handles the operating system, user interface, and applications, etc., and the modem processor primarily handles wireless communications. It will be appreciated that the above modem processor may also not be integrated into processor 111. In some embodiments, processor 111 and memory 112 may be implemented on the same chip, and in some embodiments, they may also be implemented separately on separate chips.
  • the processor 111 may be a general purpose processor, such as a central processing unit (CPU) , a digital signal processor, a specialized integrated circuit, a field programmable gate array or other programmable logic device, a discrete gate or transistor logic device, or a discrete hardware component that can implement or perform each of the methods, steps, and logic block diagrams disclosed in embodiments of the present disclosure.
  • the general purpose processor may be a microprocessor or any conventional processor, etc.
  • the steps of the image generation method disclosed in conjunction with the embodiments of the present disclosure can be directly embodied as performed by the hardware processor or performed with a combination of hardware and software modules in the processor.
  • the memory 112 serves as a non-volatile computer readable storage medium that can be configured to store non-volatile software programs, non-volatile computer executable programs, and modules.
  • the memory 112 may include at least one type of storage medium, which may include, for example, flash memory, hard disk, multimedia card, card-type memory, random access memory (RAM) , static random access memory (SRAM) , programmable read-only memory (PROM) , Read Only Memory (ROM) , Electrically Erasable Programmable Read-Only Memory (EEPROM) , magnetic memory, disk, CD-ROM, etc. Magnetic memory, disk, CD-ROM, etc.
  • the memory 112 is any other medium capable of being used to carry or store desired program code in the form of instructions or data structures and capable of being accessed by a computer, but is not limited thereto.
  • the memory 112 in the embodiments of the present disclosure may also be a circuit or any other device capable of performing storage functions for storing program instructions and/or data.
  • the code corresponding to the image generation method introduced in the above embodiments can be solidified into the chip, such that the chip can execute the steps of the image generation method of the embodiments shown in FIG. 1 when the chip is running.
  • the way of designing and programing the processor 111 is a technology well known to those skilled in the art, and will not be repeated here.
  • an embodiment of the present disclosure also provide a storage medium that stores computer instructions, and when the computer instructions run on the computer, the computer executes the image generation method discussed above.
  • various aspects of the image generation method provided in the present disclosure can also be implemented in the form of a program product, which includes program code.
  • the program product runs on a device
  • the program code is configured to control the device to execute the steps in the image generation method according to various exemplary embodiments of the present disclosure described above in this specification.
  • the embodiments of the present disclosure can be provided as a method, a system, or a computer program product. Therefore, the present disclosure may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, the present disclosure may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc. ) containing computer-usable program codes.
  • computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions may also be stored in a computer readable memory capable of directing a computer or other programmable data processing device to operate in a particular manner, such that the instructions stored in such computer readable memory produce an article of manufacture comprising an instruction device that implements a function specified in one or more processes of a flowchart and/or one or more boxes of a block diagram.
  • These computer program instructions may also be loaded onto a computer or other programmable data processing device such that a series of operational steps are performed on the computer or other programmable device to produce computer-implemented processing such that the instructions executed on the computer or other programmable device provide steps for implementing the functions specified in the flowchart one process or a plurality of processes and/or the block diagram one block or a plurality of blocks.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Medical Informatics (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Human Computer Interaction (AREA)
  • Mathematical Physics (AREA)
  • Image Processing (AREA)
  • Image Analysis (AREA)

Abstract

Disclosed are an image generation method, an apparatus, and an electronic device. The method includes: inputting N frames of low-resolution face images sequentially into a first network model by iterative means, training the first network model, restricting the training process with the output loss of the first network model until a training result of the first network model is convergent, marking the trained first network model as a second network model. The second network model performs super-resolution processing on multi-frame low-resolution face images of any target, thereby obtaining super-resolution face images of which identity is consistent with identity of the multi-frame low-resolution face images. Based on the above method, it is possible to solve the problem that super-resolution processing based on single-frame low-resolution face images cannot guarantee that the identity information in the obtained super-resolution face images is consistent with the identity information of the single-frame low-resolution face images.

Description

IMAGE GENERATION METHOD, APPARATUS, AND ELECTRONIC DEVICE
CROSS REFERENCE
The present application claims foreign priority of China Patent Application No. 202110879082.1 filed on August 02, 2021, in the China National Intellectual Property Administration, the entire contents of which are hereby incorporated by reference.
TECHNICAL FIELD
The present disclosure relates to the field of face recognition technologies, and in particular to an image generation method, an apparatus, and an electronic device.
BACKGROUND
With the rapid development of science and technology and the advent of big data era, information security has become more and more important. Face recognition, as a safe, non-contact, convenient, and efficient way of identity information authentication, has been widely used in all aspects of social life. However, in relatively large surveillance scenes, the size of a face that appears in the video is usually small, and the image definition is low, which is difficult to meet the needs of face recognition. Therefore, a face super-resolution technology becomes more and more important. The face super-resolution technology essentially adds high-frequency features to low-resolution face images to generate high-resolution face images.
The prior art is usually based on a single-frame low-resolution face image and obtains a super-resolution face image through super-resolution processing. The super-resolution face image obtained in this way has missing face information, and cannot guarantee that the identity information of the super-resolution face image is consistent with the identity information of the single-frame low-resolution face image.
SUMMARY OF THE DISCLOSURE
The present disclosure provides an image generation method, an apparatus, and an electronic device, to realize super-resolution processing of multiple low-resolution face images to obtain super-resolution face images of which the identity information is consistent with the identity information of the low-resolution face images.
In a first aspect, the present disclosure provides an image generation method, comprising:
obtaining N frames of low-resolution face images of a first target, wherein the N is a positive integer greater than or equal to 2;
training the first network model according to the N frames of low-resolution face images to obtain a second network model, wherein the first network model is configured to perform super-resolution processing a low-resolution face image;
performing super-resolution processing on the N frames of low-resolution face images in sequence based on the second network model to obtain N frames of super-resolution face images; and
taking a last-frame super-resolution face image among the N frames of super-resolution face images as a final face image.
By virtue of the above method, super-resolution processing of multiple low-resolution face images can be realized to obtain super-resolution face images of which the identity information is consistent with the identity information of the low-resolution face images.
In some embodiments, the training the first network model according to the N frames of low-resolution face images to obtain the second network model comprises:
calculating and obtaining an output loss of the first network model according to the N frames of low-resolution face images, wherein the output loss is configured to restrict a training process of the first network model;
determining whether a training result of the first network model being convergent according to the output loss;
in response to the training result of the first network model not being convergent, adjusting parameters of the first network model and continuing to train the first network model until the training result is convergent; and
in response to the training result of the first network model being convergent, taking the trained first network model as the second network model.
By virtue of the above method, the output loss of the first network model is configured to restrict the training process of the first network model. The second network model is obtained after the training process, configured to realize super-resolution processing of multiple low-resolution face images of any target to obtain super-resolution face images of which the identity information is consistent with the identity information of the low-resolution face images.
In some embodiments, the calculating and obtaining the output loss of the first network model according to the N frames of low-resolution face images comprise:
obtaining N random variables and a super-resolution face image set based on the N frames of low-resolution face images through a first network model; wherein the number of frames of super-resolution face images in the super-resolution face image set is N;
inputting the super-resolution face images in the super-resolution face image set into a recognition network in sequence, and extracting to obtain N face feature values; and
inputting the N random variables and the N face feature values into a loss function, and calculating to obtain the output loss of the first network model.
By virtue of the above method, the obtained output loss of the first network model is configured to restrict the training process of the first network model to cause the training result to be convergent.
In some embodiments, the obtaining the N random variables and the super-resolution face image set based on the N frames of low-resolution face images through the first network model comprises:
determining a frame of low-resolution face image among the N frames of low-resolution face image as a first reference frame;
inputting a first super-resolution face image and a first low-resolution face image into the first network model to obtain a random variable corresponding to the first low-resolution face image; wherein the first super-resolution face image is a real high-resolution face image of the first target, and the first low-resolution face image is a next-frame image of the first reference frame;
inputting the random variable and the first reference frame into the first network model to obtain a second super-resolution face image; and
replacing the first super-resolution face image with the second super-resolution face image, replacing the first low-resolution face image with a next-frame face image of the first low-resolution face image, and continuing to perform training on the first network model to generate other super-resolution face images in sequence forming another super-resolution face image set.
By virtue of the above method, the obtained super-resolution face images of the super-resolution face image set are configured to abstract face feature values. The face values and the N random variables are configured to calculate the output loss of the first network model.
In some embodiments, the inputting the N random variables and the N face feature values into the loss function, and calculating to obtain the output loss of the first network model comprise:
inputting the N random variables into a negative log-likelihood loss function, and calculating to obtain a negative log-likelihood loss; wherein the negative log-likelihood loss is configured to restrict the first network model such that random variables output by the first network model obey a standard positive distribution;
inputting the N face feature values into a cosine loss function, and calculating to obtain a cosine loss; wherein the cosine loss is configured to indicate a degree of difference between super-resolution face features and real face features;
inputting the cosine loss into a cosine comparison loss function, and calculating to obtain a cosine comparison loss; wherein the cosine comparison loss is configured to restrict the first network model, such that a similarity between a super-resolution face image generated each time and the real high-resolution face image is greater than a similarity between a super-resolution face image generated last time and the real high-resolution face image; and
inputting the negative log-likelihood loss, the cosine loss and the cosine comparison loss into the loss function, and calculating to obtain the output loss of the first network model; wherein the output loss is configured to restrict the training process of the first network model.
By virtue of the above method, the output loss of the first network model is obtained. The training process of the first network model is restricted through the output loss, which can achieve that the random variables encoded by the first network model obey the standard positive distribution and that the similarity between the super-resolution face image and the real high-resolution face image generated by the first network model each time is greater than the similarity between the super-resolution face image and the real high-resolution generated by the first network model last time.
In some embodiments, the performing super-resolution processing on the N frames of low-resolution face images in sequence based on the second network model to obtain the N frames of super-resolution face images comprises:
randomly sampling a first random variable among random variables that obey a standard positive distribution generated in the training process, and determining a second reference frame among the N frames of low-resolution face images;
inputting the first random variable and the second reference frame into the second network model to obtain a super-resolution face image corresponding to the first random variable;
inputting the super-resolution face image and the second low-resolution face image into the second network model to obtain a second random variable; wherein the second low-resolution face image is a next-frame image of the second reference frame; and
replacing the first random variable with the second random variable, replacing the second low-resolution face image with a next-frame image of the second low-resolution face image, and continuing to perform super-resolution processing on the replaced second low-resolution face image to obtain the N frames of super-resolution face images in sequence.
By virtue of the above method, the second network model is configured to perform super-resolution processing on the N frames of low-resolution face images, and the super-resolution face image generated each time additionally has detail features of one frame of low-resolution face image than the super-resolution face image generated in the previous time. Therefore, the last generated super-resolution face image contains the detailed features of N frames of low-resolution face image, that is, the identity information of the last generated super-resolution face image is consistent with the identity information of the N frames of low-resolution face images.
In a second aspect, the present disclosure provides an image generation apparatus, comprising:
an obtaining module, configured to obtain N frames of low-resolution face images of a first target,  wherein the N is a positive integer greater than or equal to 2;
a training module, configured to train the first network model according to the N frames of low-resolution face images to obtain a second network model, wherein the first network model is configured to perform super-resolution processing a low-resolution face image;
a processing module, configured to perform super-resolution processing on the N frames of low-resolution face images in sequence based on the second network model to obtain N frames of super-resolution face images; and
a selection module, configured to take a last-frame super-resolution face image among the N frames of super-resolution face images as a final face image.
In some embodiments, the training module comprises:
a calculation unit, configured to calculate and obtain an output loss of the first network model according to the N frames of low-resolution face images, wherein the output loss is configured to restrict a training process of the first network model;
a determining unit, configured to determine whether a training result of the first network model is convergent according to the output loss;
an adjustment unit, configured to, in response to the training result of the first network model not being convergent, adjust parameters of the first network model and continuing to train the first network model until the training result is convergent; and
a marking unit, configured to, in response to the training result of the first network model being convergent, record the trained first network model as the second network model.
In some embodiments, the calculation unit is specifically configured to:
obtain N random variables and a super-resolution face image set from the N frames of low-resolution face images through a first network model; wherein the number of frames of super-resolution face images in the super-resolution face image set is N;
input the super-resolution face images in the super-resolution face image set into a recognition network in sequence, and extracting to obtain N face feature values; and
input the N random variables and the N face feature values into a loss function, and calculating to obtain the output loss of the first network model.
In some embodiments, the calculation unit is further configured to:
determine a frame of low-resolution face image among the N frames of low-resolution face image as a first reference frame;
input a first super-resolution face image and a first low-resolution face image into the first network model to obtain a random variable corresponding to the first low-resolution face image; wherein the first super-resolution face image is a real high-resolution face image of the first target, and the first low-resolution face image is a next-frame image of the first reference frame;
input the random variable and the first reference frame into the first network model to obtain a second super-resolution face image; and
replace the first super-resolution face image with the second super-resolution face image, and replace the first low-resolution face image with a next-frame face image of the first low-resolution face image; continue to perform training on the first network model to generate super-resolution face images in sequence forming a super-resolution face image set.
In some embodiments, the calculation unit is further configured to:
input the N random variables into a negative log-likelihood loss function, and calculate to obtain a negative log-likelihood loss; wherein the negative log-likelihood loss is configured to restrict the first network model such that the random variables output by the first network model obey the standard positive distribution;
input the N face feature values into a cosine loss function, and calculate to obtain a cosine loss; wherein the cosine loss is configured to indicate a degree of difference between super-resolution face features and real face features;
input the cosine loss into a cosine comparison loss function, and calculate to obtain a cosine comparison loss; wherein the cosine comparison loss is configured to restrict the first network model, such that a similarity between the super-resolution face image generated each time and the real high-resolution face image is greater than a similarity between the super-resolution face image generated last time and the real high-resolution face image; and
input the negative log-likelihood loss, the cosine loss and the cosine comparison loss into the loss function, and calculate to obtain the output loss of the first network model; wherein the output loss is configured to restrict the training process of the first network model.
In some embodiments, the processing module comprises:
an obtaining unit, configured to randomly sample a first random variable among random variables that obey a standard positive distribution generated in the training process, and determine a second reference frame among the N frames of low-resolution face images;
a processing unit, configured to input the first random variable and the second reference frame into the second network model to obtain a super-resolution face image corresponding to the first random variable;
an encoding unit, configured to input the super-resolution face image and the second low-resolution face image into the second network model to obtain a second random variable; wherein the second low-resolution face image is a next-frame image of the second reference frame; and
an updating unit, configured to replace the first random variable with the second random variable, replace the second low-resolution face image with a next-frame image of the second low-resolution face image, and continue to perform super-resolution processing on the replaced second low-resolution face image to obtain the N frames of super-resolution face images in sequence.
In a third aspect, the present disclosure provides an electronic device, comprising:
a memory, configured to store a computer program; and
a processor, configured to execute the computer program stored in the memory to perform the method as described above.
In a fourth aspect, the present disclosure provides a storage medium, storing a computer program; wherein the computer program is configured to perform the method as described above when executed by a processor.
Based on the method provided by the present disclosure, super-resolution processing is performed on the N frames of low-resolution face images of the first target to train the first network model. In the training process, the last generated super-resolution face image contains the detailed features of N frames of low-resolution face image. Therefore, after super-resolution processing is performed on the N frames of low-resolution face images of the first target based on the second network model, the identity information of the last generated super-resolution face image is consistent with the identity information of the first target.
Of course, the second network model may not only perform super-resolution processing on the N frames of low-resolution face images of the first target to obtain the super-resolution face image of which identity information is consistent with the identity information of the first target, the second network model may also perform super-resolution processing on N frames of low-resolution face images of a second target to obtain a super-resolution face image of which identity information is consistent with the identity information of the second target.
For the technical effects that can be achieved with respect to each of the second to fourth aspects, reference may be made to the above description of the technical effects that can be achieved with respect to the  first aspect or the various possible solutions in the first aspect, which will not be repeated here.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a flowchart of an image generation method according to an embodiment of the present disclosure.
FIG. 2 is a flowchart of a method for training a first network model according to an embodiment of the present disclosure.
FIG. 3 is a flowchart of a method for obtaining N random variables and a super-resolution face image set based on a first network model according to an embodiment of the present disclosure.
FIG. 4 is a flowchart of a method for calculating an output loss of a first network model according to an embodiment of the present disclosure.
FIG. 5 is a flowchart of a method for obtaining N frames of super-resolution face images based on a second network model according to an embodiment of the present disclosure.
FIG. 6 is a schematic view of a method for training a first network model according to an embodiment of the present disclosure.
FIG. 7 is a flowchart of a method for performing super-resolution processing on N frames of low-resolution face images based on a second network model according to an embodiment of the present disclosure.
FIG. 8 is a structural schematic view of an image generation apparatus according to an embodiment of the present disclosure.
FIG. 9 is a structural schematic view of a training module according to an embodiment of the present disclosure.
FIG. 10 a structural schematic view of a processing module according to an embodiment of the present disclosure.
FIG. 11 is a structural schematic view of an electronic device according to an embodiment of the present disclosure.
DETAILED DESCRIPTION
In order to make the objectives, technical solutions, and advantages of the present disclosure clearer, the present disclosure will be further described in detail below with reference to the accompanying drawings. The specific operation method in method embodiments may also be applied to apparatus embodiments or system embodiments. It should be noted that in the description of the present disclosure, “a plurality of” is understood as “at least two” . “And/or” describes the association relationship of associated object, indicating that there can be three types of relationships, for example, A and/or B, which can mean: A alone exists, A and B exist at the same time, and B exists alone. “A connected to B” can mean: A and B are directly connected; and/or A and B are connected through C. In addition, in the description of the present disclosure, words such as “first” and “second” are only intended for a purpose of distinguishing description, and cannot be understood as indicating or implying relative importance, nor as indicating or implying order.
The present disclosure will be further described in detail below in conjunction with the accompanying drawings.
The image generation method provided by the embodiments of the present disclosure can solve the problem of being unable to ensure that identity information of an obtained super-resolution face image is consistent with identity information of a single-frame low-resolution face image while performing super-resolution processing based on the single-frame low-resolution face image. The method and apparatus described in the embodiments of the present disclosure are based on a same technical concept. Since the  principles of the method and apparatus to solve the problem are similar, the embodiments of the apparatus and the method can be referred to each other, and repetition will not be repeated.
Face super-resolution technology is essentially to add high-frequency features to low-resolution face images to generate high-resolution face images. In the field of face super-resolution technologies, a SRFlow network model is often used. The SRFlow network model is reversible and can learn a conditional distribution of super-resolution images with respect to low-resolution images. High-resolution image and low-resolution image are input into the SRFlow network model to obtain random variables that meet a specific distribution. Low-resolution image and random variables that meet the specific distribution are input into the SRFlow network model to generate the super-resolution face image.
In the prior art, super-resolution processing is usually performed on a single-frame low-resolution face image based on the SRFlow network model to obtain a super-resolution face image. However, due to the lack of detailed information of the single-frame low-resolution face image, and the detailed information is usually key information to distinguish face identity, thus it cannot be ensured that the identity information of the obtained super-resolution face image is consistent with the identity information of the low-resolution face image.
In order to solve the problem of being unable to ensure that identity information of an obtained super-resolution face image is consistent with identity information of a single-frame low-resolution face image while performing super-resolution processing based on the single-frame low-resolution face image, the present disclosure proposes a solution: based on a first network model, sequentially inputting multi-frame low-resolution face images of a first target into the first network model in an iterative manner; training the first network model; restricting a training process according to an output loss of the first network; in response to a training result of the first network model being convergent, recording the trained first network model as a second network model; performing super-resolution processing on the multi-frame low-resolution face images of the first target or multi-frame low-resolution face images of a second target with the second network model, and obtaining a last frame of super-resolution face image. The generated last super-resolution face image has detailed features of the multi-frame low-resolution face images, and thus the identity information thereof is consistent with the identity information of the low-resolution face image.
Specifically, as shown in FIG. 1, FIG. 1 is a flowchart of an image generation method according to an embodiment of the present disclosure.
At block S11. obtaining N frames of low-resolution face images of a first target, wherein the N is a positive integer greater than or equal to 2.
At block S12: training the first network model according to the N frames of low-resolution face images to obtain a second network model.
In the embodiment of the present disclosure, the first network model may be a SRFlow network model. The N frames of low-resolution face images are sequentially input into the first network model in an iterative manner, and the first network model is trained. The training process is restricted according to an output loss of the first network. When the training result of the first network model is convergent, the trained first network model is recorded as the second network model.
At block S13: performing super-resolution processing on the N frames of low-resolution face images in sequence based on the second network model to obtain N frames of super-resolution face images.
In the embodiment of the present disclosure, super-resolution processing is performed on the N frames of low-resolution face images in sequence based on the second network model to obtain N frames of super-resolution face images. In the process of the super-resolution processing, the super-resolution face image generated each time additionally has detail features of one frame of low-resolution face image than the super-resolution face image generated in the previous time.
At block S14: taking a last-frame super-resolution face image among the N frames of super-resolution face images as a final face image.
By inputting the N frames of low-resolution face images into the second network model, the super-resolution face image generated each time additionally has detail features of one frame of low-resolution face image than the super-resolution face image generated in the previous time. Therefore, the last generated super-resolution face image contains the detailed features of N frames of low-resolution face images, that is, the identity information of the last generated super-resolution face image is consistent with the identity information of the N frames of low-resolution face images.
Of course, the second network model may not only perform super-resolution processing on the N frames of low-resolution face images of the first target to obtain the super-resolution face image of which identity information is consistent with the identity information of the first target, the second network model may also perform super-resolution processing on N frames of low-resolution face images of a second target to obtain a super-resolution face image of which identity information is consistent with the identity information of the second target.
In order to further explain how the second network model is obtained, the method of training the first network model described in step S12 is required to be described in detail, as shown in FIG. 2, FIG. 2 is a flowchart of a method for training a first network model according to an embodiment of the present disclosure.
At block S21: obtaining N random variables and a super-resolution face image set from the N frames of low-resolution face images through a first network model.
In the embodiment of the present disclosure, the super-resolution face image set stores a frame of real high-resolution face image of the first target, and the super-resolution face images generated each time. The frame of real high-resolution face image is recorded as a first super-resolution face image, and the total number of super-resolution face images in the super-resolution face image set is N.
The obtaining the N random variables and the super-resolution face image set can be implemented by inputting the N frames of low-resolution face images into the first network model in an iterative manner. The specific process is shown in FIG. 3.
At block S31: determining a frame of low-resolution face image among the N frames of low-resolution face image as a first reference frame.
In the embodiment of the present disclosure, the first reference frame may be the first frame in the N frames of low-resolution face images, or may be the second frame, the third frame, the fourth frame, etc. In the present disclosure, the first frame of low-resolution face image is selected as an example.
At block S32: putting the frame of real high-resolution face image as a first super-resolution face image into the super-resolution face image set, and taking a next-frame image of the first reference frame as a first low-resolution face image.
At block S33: inputting the first super-resolution face image and the first low-resolution face image into the first network model to obtain a random variable corresponding to the first low-resolution face image.
At block S34: inputting the random variable and the first reference frame into the first network model to obtain a second super-resolution face image.
At block S35: putting the second super-resolution face image into the super-resolution face image set, and determining whether the number of image frames in the super-resolution face image set is N.
In the embodiment of the present disclosure, when the number of image frames in the super-resolution face image set is not N, step S36 is executed; when the number of image frames in the super-resolution face image set is N, step S37 is executed.
At block S36: in response to the number of image frames in the super-resolution face image set being not N, replacing the first super-resolution face image with the second super-resolution face image, and  replacing the first low-resolution face image with a next-frame face image of the first low-resolution face image.
When the number of image frames is not N, the first super-resolution face image is replaced with the second super-resolution face image, and the first low-resolution face image is replaced with a next-frame face image of the first low-resolution face image. Then step S33 is executed.
At block S37: in response to the number of image frames in the super-resolution face image set being N, obtaining the N random variables and the super-resolution face image set with N image frames.
Based on the above steps, the super-resolution face images in the super-resolution face image set are configured to extract face feature values, and the face feature values and the N random variables are configured to calculate the output loss of the first network model.
At block S22: inputting the super-resolution face images in the super-resolution face image set into a recognition network in sequence, and extracting to obtain N face feature values.
At block S23: inputting the N random variables and the N face feature values into a loss function, and calculating to obtain an output loss of the first network model.
The training process of the first network model is restricted by the output loss, such that the random variables output by the first network model obey a standard positive distribution, and the similarity between the super-resolution face image generated each time and the real high-resolution face image is greater than the similarity between the super-resolution face image generated last time and the real high-resolution face image.
At block S24. determining whether a training result of the first network model being convergent according to the output loss.
When the output loss of the first network model is convergent, it is indicated that the training result of the first network model is convergent, and step S25 is executed; when the output loss of the first network model is not convergent, step 26 is executed.
At block S25: in response to the training result of the first network model being convergent, recording the trained first network model as the second network model.
When the training result is convergent, it is indicated that after the first network model performs super-resolution processing on the multiple-frame low-resolution face images, the identity information of the last generated super-resolution face image is consistent with the identity information of the low-resolution face image. The trained first network model is recorded as the second network model. The second network model can perform super-resolution processing on multiple-frame low-resolution face images of any target, and the identity information of the last generated super-resolution face image is consistent with the identity information of the low-resolution face image.
At block S26: in response to the training result of the first network model not being convergent, adjusting parameters of the first network model and continuing to train the first network model until the training result is convergent.
When the training result is not convergent, the parameters of the first network model are adjusted, N frames of low-resolution face images of another target are continually obtained, step S11 is executed, and the first network model is continually trained until the training result is convergent.
Based on the above steps, the N frames of low-resolution face images are input to the first network model, the first network model is trained, and the training is completed to obtain the second network model. The second network model can perform super-resolution processing on multiple-frame low-resolution face images of any target, and the identity information of the last generated super-resolution face image is consistent with the identity information of the low-resolution face image.
In the training process of obtaining the second network model, the output loss of the first network model is configured to restrict the first network model such that the random variables output by the first  network model obey the standard positive distribution, and the similarity between the super-resolution face image generated each time and the real high-resolution face image is greater than the similarity between the super-resolution face image generated last time and the real high-resolution face image.
In order to further explain the calculation method of the output loss, the output loss of the first network model calculated in step S23 is required to be explained in detail. The specific calculation process of the output loss is shown in FIG. 4.
At block S41: inputting the N random variables into a negative log-likelihood loss function, and calculating to obtain a negative log-likelihood loss.
In the embodiment of the present disclosure, the negative log-likelihood loss is configured to restrict the first network model such that the random variables output by the first network model obey the standard positive distribution, where the negative log-likelihood loss can be calculated by formula (1) :
Figure PCTCN2021128518-appb-000001
where the formula (1) is the negative log-likelihood loss function, LR is a low-resolution face image, SR is a super-resolution face image, θ is a distribution parameter, N is the number of frames of the low-resolution face image, and LR 1i indicates that the i-th frame of low-resolution face image input to the first network model, p Z (z 1i) represents a spatial distribution of random variables, Z 1i represents a random variable obtained by inputting the i-th frame of low-resolution face image into the first network model, and f θ is the first network model. The first network model f θ is decomposed into M reversible layer sequences:
Figure PCTCN2021128518-appb-000002
At block S42: inputting the N face feature values into a cosine loss function, and calculating to obtain a cosine loss.
In the embodiment of the present disclosure, the cosine loss indicates the degree of difference between super-resolution face features and real face features, where the cosine loss can be calculated by formula (2) :
Figure PCTCN2021128518-appb-000003
where the formula (2) is the cosine loss function, Similarity i is a cosine similarity between a super-resolution face image super-divided by the first network model for the i-th time and a real high-resolution face image, and the cosine similarity is in a value range of (-1, 1) . The greater the cosine similarity is, the higher the similarity between the super-resolution face image and the real high-resolution face image is. The cosine similarity can be calculated by formula (3) :
Figure PCTCN2021128518-appb-000004
where Similarity i represents the cosine similarity generated for the i-th time, formula (3) is the cosine similarity function, F i is a face feature value extracted after the super-resolution face image generated by the first network model for the i-th time is input into the recognition network, and F 0 is a face feature value extracted after the real high-resolution face image is input into the recognition network.
At block S43: inputting the cosine loss into a cosine comparison loss function, and calculating to obtain a cosine comparison loss.
In the embodiment of the present disclosure, the cosine comparison loss is configured to restrict the first network model, such that the similarity between the super-resolution face image generated each time and the real high-resolution face image is greater than the similarity between the super-resolution face image  generated last time and the real high-resolution face image, that is, Similarity i+1 is greater than Similarity i, and the cosine comparison loss can be calculated by formula (4) :
Figure PCTCN2021128518-appb-000005
where formula (4) is the cosine comparison loss function, e is the base of the natural logarithm, and α is a comparison coefficient.
At block S44: inputting the negative log-likelihood loss, the cosine loss and the cosine comparison loss into the loss function, and calculating to obtain the output loss of the first network model.
In the embodiment of the present disclosure, the output loss is configured to restrict the training process of the first network model, which can make the random variables encoded by the first network model obey the standard positive distribution, and can also make the similarity between the super-resolution face image generated each time and the real high-resolution face image greater than the similarity between the super-resolution face image generated last time and the real high-resolution face image. The output loss can be calculated by formula (5) :
Figure PCTCN2021128518-appb-000006
where formula (5) is the loss function.
Based on the above steps, the output loss of the first network model is calculated. When the output loss is not convergent, the parameters of the first network model are adjusted, and the training of the first network model is continued until the output loss is convergent.
When the output loss is convergent, it is indicated that the second network model obtained after the training can perform super-resolution processing on the N frames of low-resolution face images to obtain the N frames of super-resolution face images, and in the process of super-resolution processing, the similarity between the super-resolution face image generated each time and the real high-resolution face image is greater than the similarity between the super-resolution face image generated last time and the real high-resolution face image.
In order to further explain how the second network model performs super-resolution processing on the N frames of low-resolution face images, step S13 is required to be described in detail. Specifically, as shown in FIG. 5, FIG. 5 is a flowchart of a method for obtaining N frames of super-resolution face images based on a second network model according to an embodiment of the present disclosure.
At block S51: randomly sampling a first random variable from the random variables that obey the standard positive distribution generated in the training process, and determining a second reference frame from the N frames of low-resolution face images.
At block S52: taking a next-frame image of the second reference frame as a second low-resolution face image.
At block S53: inputting the first random variable and the second reference frame into the second network model to obtain the super-resolution face image corresponding to the first random variable.
At block S54: counting the super-resolution face image generated each time, and determining whether a total number of frames of the super-resolution face images is N.
The purpose of counting the super-resolution face images generated each time is to determine whether the super-resolution processing has been performed on all the N frames of low-resolution face images.  When the total number of frames of the super-resolution face image is N, it is indicated that the super-resolution is completed, and step S55 is executed. When the total number of frames of the super-resolution face image is not N, step S56 is executed.
S55: in response to the total number of frames of the super-resolution face image being N, which indicates that the super-resolution processing is completed, and obtaining the N frames of super-resolution face image.
At block S56: in response to the total number of frames of the super-resolution face image being not N, inputting the super-resolution face images and the second low-resolution face image into the second network model to obtain a second random variable.
At block S57: replacing the first random variable with the second random variable, replacing the second low-resolution face image with a next-frame image of the second low-resolution face image, and continuing to perform super-resolution processing on the replaced second low-resolution face image.
After replacing the first random variable with the second random variable, and replacing the second low-resolution face image with the next frame of the second low-resolution face image, step S53 is executed to continually perform super-resolution processing on the replaced second low-resolution face image.
Based on the above method, the second network model is configured to perform super-resolution processing on the N frames of low-resolution face images, and the super-resolution face image generated each time additionally has detail features of one frame of low-resolution face image than the super-resolution face image generated in the previous time. Therefore, the last generated super-resolution face image contains the detailed features of N frames of low-resolution face image, that is, the identity information of the last generated super-resolution face image is consistent with the identity information of the N frames of low-resolution face images.
Of course, using the second network model based on the above steps can not only perform super-resolution processing on N frames of low-resolution face images of the first target, but also perform super-resolution processing on N frames of low-resolution face images of the second target. The generated identity information of the last frame of super-resolution face image of the second target is consistent with the identity information of the N frames of low-resolution face image of the second target.
Further, in order to elaborate on an image generation method provided by the present disclosure, the method provided by the present disclosure will be described in detail below through specific application scenarios.
Before the image is generated, the first network model is required to be trained. Referring to FIG. 6, the N frames of low-resolution face images of the first target are sorted according to an obtaining order of an image obtaining device, and are recorded as a first frame of low-resolution face image, a second frame of low-resolution face image, ..., and an Nth frame of low-resolution face image. The first frame of low-resolution face image is taken as the reference frame LR 11, and the real high-resolution face image HR of the first target is input into the recognition network to obtain a first face feature value F 0, where HR is recorded as SR 0.
For a first training, HR and the second frame of low-resolution face image LR 12 are input into the first network model to obtain the first random variable Z 11; Z 11 and LR 11 are input into the first network model to generate the first frame of super-resolution face image SR 11; SR 11 is input into the recognition network to obtain a second face feature value F 2;
For a second training, SR 11 and the third frame of low-resolution face image LR 12 are input into the first network model to obtain the second random variable Z 12; Z 12 and LR 11 are input into the first network model to generate the second frame of super-resolution face image SR 12 of the first object; SR 12 is input into the recognition network to obtain a third face feature value F 3;
In an i-th (i>1) training, the super-resolution face image SR 1  (i-1) generated by the first network model for an (i-1) -th time and the (i+1) -th frame of low-resolution face image LR 1  (i+1) are input into the first network model to obtain the i-th random variable Z 1i; Z 1i and LR 11 are input into the first network model to generate the i-th frame of super-resolution face image SR 1i of the first object, SR 1i is input into the recognition network to obtain the (i+1) -th face feature value F i.
The generated face feature values {Fi, i= 0, 1, ..., N-1} are input into the loss function, calculation is performed to obtain the output loss of the first network model, and it is determined whether the output loss is convergent. When the output loss is convergent, it is indicated that the training result of the first network model is convergent, and the trained first network model is recorded as the second network model. When the output loss is not convergent, the parameters of the first network model are adjusted, and the training of the first network model is continued to be performed until the training result is convergent.
The second network model obtained based on the above training method can not only perform super-resolution processing on the N frames of low-resolution face images of the first target to obtain super-resolution face images of which the identity information is consistent with the identity information of the first target, and also can perform super-resolution processing on the N frames of low-resolution face images of the second target to obtain super-resolution face images of which the identity information consistent with the identity information of the second target.
After completing the training of the first network model and obtaining the second network model, the multi-frame low-resolution face image of any target can be super-resolution processed through the second network to obtain super-resolution face images of which the identity information is consistent with the low-resolution face images. Taking the first target as an example and referring to FIG. 7 to describe the specific process.
In the first super-resolution, a random variable Z 21 and a reference frame LR 21 are randomly sampled in the random variable distribution space that meets the standard positive distribution generated during the training process and simultaneously input into the second network model, to generate the second frame of super-resolution face image SR 21 of the second object. The first frame of low-resolution face image of N frames of low-resolution face images is determined as the reference frame LR 21.
For the second super-resolution, SR 21 and the second frame of low-resolution face image LR 22 are simultaneously input into the second network model to obtain a second random variable Z 22; Z 22 and LR 21 are input into the second network model to generate the second frame super-resolution face image SR 22 of the second object.
For the i-th super-resolution, the (i-1) -th frame of super-resolution face image SR 2  (i-1) and the i-th frame of low-resolution face image LR 2i generated by the second network model are simultaneously input to the second network model to generate the i-th super-resolution face image SR 2i of the second network model.
After the last frame of low-resolution face image is input into the second network model, the last frame of super-resolution face image generated is taken as the final super-resolution result.
Based on the above process, N frames of low-resolution face images of the first target are sequentially input to the first network model, the first network model is trained, and the output loss of the first network model is configured to restrict the training process of the first network model to cause the training result of the first network model to converge, and the trained first network model is recorded as the second network model. Because the last-frame super-resolution face image obtained in the training process contains detailed information of multiple frames of low-resolution face images, after the second network model is used to perform super-resolution processing on the N frames of low-resolution face images of the first target, the identity information of the last-frame super-resolution face image obtained is consistent with the identity information of the first target.
Of course, the second network model can not only perform super-resolution processing on the N frames of low-resolution face images of the first target to obtain super-resolution face images of which the identity information is consistent with the identity information of the first target, and also can perform super-resolution processing on the N frames of low-resolution face images of the second target to obtain super-resolution face images of which the identity information consistent with the identity information of the second target.
Based on the same inventive concept, an image generation apparatus is also provided in an embodiment of the present disclosure. As shown in FIG. 8, FIG. 8 is a structural schematic view of an image generation apparatus according to an embodiment of the present disclosure. The apparatus includes:
an obtaining module 81, configured to obtain N frames of low-resolution face images of a first target, wherein the N is a positive integer greater than or equal to 2;
training module 82, configured to train the first network model according to the N frames of low-resolution face images to obtain a second network model, wherein the first network model is capable of perform super-resolution processing on a low-resolution face images;
processing module 83, configured to perform super-resolution processing on the N frames of low-resolution face images in sequence based on the second network model to obtain N frames of super-resolution face images.
selection module 84, configured to take a last-frame super-resolution face image among the N frames of super-resolution face images as a final face image.
In a possible design, as shown in FIG. 9, the training module includes:
calculation unit 91, configured to calculate and obtain an output loss of the first network model according to the N frames of low-resolution face images, wherein the output loss is configured to restrict a training process of the first network model;
a determining unit 92, configured to determine whether a training result of the first network model is convergent according to the output loss;
an adjustment unit 93, configured to, in response to the training result of the first network model not being convergent, adjust parameters of the first network model and continuing to train the first network model until the training result is convergent;
a marking unit 94, configured to, in response to the training result of the first network model being convergent, record the trained first network model as the second network model.
In a possible design, the calculation unit is specifically configured to:
obtain N random variables and a super-resolution face image set from the N frames of low-resolution face images through a first network model; wherein the number of frames of super-resolution face images in the super-resolution face image set is N;
input the super-resolution face images in the super-resolution face image set into a recognition network in sequence, and extracting to obtain N face feature values; and
input the N random variables and the N face feature values into a loss function, and calculating to obtain the output loss of the first network model.
In a possible design, the calculation unit is also configured to:
determine a frame of low-resolution face image among the N frames of low-resolution face image as a first reference frame;
input a first super-resolution face image and a first low-resolution face image into the first network model to obtain a random variable corresponding to the first low-resolution face image; wherein the first super-resolution face image is a real high-resolution face image of the first target, and the first low-resolution face image is a next-frame image of the first reference frame;
input the random variable and the first reference frame into the first network model to obtain a second super-resolution face image; and
replace the first super-resolution face image with the second super-resolution face image, and replace the first low-resolution face image with a next-frame face image of the first low-resolution face image; continue to perform training on the first network model to generate super-resolution face images in sequence forming a super-resolution face image set.
In a possible design, the calculation unit is also configured to:
input the N random variables into a negative log-likelihood loss function, and calculate to obtain a negative log-likelihood loss; wherein the negative log-likelihood loss is configured to restrict the first network model such that the random variables output by the first network model obey the standard positive distribution;
input the N face feature values into a cosine loss function, and calculate to obtain a cosine loss; wherein the cosine loss is configured to indicate a degree of difference between super-resolution face features and real face features;
input the cosine loss into a cosine comparison loss function, and calculate to obtain a cosine comparison loss; wherein the cosine comparison loss is configured to restrict the first network model, such that a similarity between the super-resolution face image generated each time and the real high-resolution face image is greater than a similarity between the super-resolution face image generated last time and the real high-resolution face image; and
input the negative log-likelihood loss, the cosine loss and the cosine comparison loss into the loss function, and calculate to obtain the output loss of the first network model; wherein the output loss is configured to restrict the training process of the first network model.
In a possible design, as shown in FIG. 10, the processing module includes:
an obtaining unit 101, configured to randomly sample a first random variable from the random variables that obey the standard positive distribution generated in the training process, and determine a second reference frame from the N frames of low-resolution face images;
processing unit 102, configured to input the first random variable and the second reference frame into the second network model to obtain the super-resolution face image corresponding to the first random variable;
an encoding unit 103, configured to input the super-resolution face image and the second low-resolution face image into the second network model to obtain a second random variable; wherein the second low-resolution face image is a next-frame image of the second reference frame;
an updating unit 104, configured to replace the first random variable with the second random variable, replace the second low-resolution face image with a next-frame image of the second low-resolution face image, and continue to perform super-resolution processing on the replaced second low-resolution face image to obtain the N frames of super-resolution face images in sequence.
Based on the above image generation apparatus, N frames of low-resolution face images of the first target are sequentially input to the first network model, the first network model is trained, and the output loss of the first network model is configured to restrict the training process of the first network model to cause the training result of the first network model to converge, and the trained first network model is recorded as the second network model. Because the last-frame super-resolution face image obtained in the training process contains detailed information of multiple frames of low-resolution face images, after the second network model is used to perform super-resolution processing on the N frames of low-resolution face images of the first target, the identity information of the last-frame super-resolution face image obtained is consistent with the identity information of the first target.
Of course, the second network model can not only perform super-resolution processing on the N  frames of low-resolution face images of the first target to obtain super-resolution face images of which the identity information is consistent with the identity information of the first target, and also can perform super-resolution processing on the N frames of low-resolution face images of the second target to obtain super-resolution face images of which the identity information consistent with the identity information of the second target.
Based on a same inventive concept, an embodiment of the present disclosure also provide an electronic device, which can realize the function of the above image generation apparatus. Referring to FIG. 11, the electronic device includes:
at least one processor 111 and a memory 112 connected to the at least one processor 111. The specific connection medium between the processor 111 and the memory 112 is not limited in the embodiment of the present disclosure. Taking the bus 110 as an example, the bus 110 is represented by a thick line in FIG. 11, and the connection mode between other components is only for schematic illustration, and is not to be taken as a limitation. The bus 110 may be divided into an address bus, a data bus, a control bus, etc. For ease of presentation, only a thick line is used in FIG. 11 to represent it, but it does not mean that there is only one bus or one type of bus. Alternatively, the processor 111 may also be called a controller, and there is no restriction on the name.
In the embodiment of the present disclosure, the memory 112 stores instructions that can be executed by at least one processor 111, and the at least one processor 111 can execute the image generation method discussed above by executing the instructions stored in the memory 112. The processor 111 can implement the functions of each module in the apparatus shown in FIG. 6.
Among them, the processor 111 is a control center of the device and can connect various parts of the entire such control device using various interfaces and lines to monitor the device as a whole by running or executing the instructions stored in the memory 112 and calling the data stored in the memory 112, the various functions and processing data of the device.
In a possible design, the processor 111 may include one or more processing units, and the processor 111 may integrate an application processor and a modem processor, wherein the application processor primarily handles the operating system, user interface, and applications, etc., and the modem processor primarily handles wireless communications. It will be appreciated that the above modem processor may also not be integrated into processor 111. In some embodiments, processor 111 and memory 112 may be implemented on the same chip, and in some embodiments, they may also be implemented separately on separate chips.
The processor 111 may be a general purpose processor, such as a central processing unit (CPU) , a digital signal processor, a specialized integrated circuit, a field programmable gate array or other programmable logic device, a discrete gate or transistor logic device, or a discrete hardware component that can implement or perform each of the methods, steps, and logic block diagrams disclosed in embodiments of the present disclosure. The general purpose processor may be a microprocessor or any conventional processor, etc. The steps of the image generation method disclosed in conjunction with the embodiments of the present disclosure can be directly embodied as performed by the hardware processor or performed with a combination of hardware and software modules in the processor.
The memory 112 serves as a non-volatile computer readable storage medium that can be configured to store non-volatile software programs, non-volatile computer executable programs, and modules. The memory 112 may include at least one type of storage medium, which may include, for example, flash memory, hard disk, multimedia card, card-type memory, random access memory (RAM) , static random access memory (SRAM) , programmable read-only memory (PROM) , Read Only Memory (ROM) , Electrically Erasable Programmable Read-Only Memory (EEPROM) , magnetic memory, disk, CD-ROM, etc. Magnetic memory,  disk, CD-ROM, etc. The memory 112 is any other medium capable of being used to carry or store desired program code in the form of instructions or data structures and capable of being accessed by a computer, but is not limited thereto. The memory 112 in the embodiments of the present disclosure may also be a circuit or any other device capable of performing storage functions for storing program instructions and/or data.
By designing and programming the processor 111, the code corresponding to the image generation method introduced in the above embodiments can be solidified into the chip, such that the chip can execute the steps of the image generation method of the embodiments shown in FIG. 1 when the chip is running. The way of designing and programing the processor 111 is a technology well known to those skilled in the art, and will not be repeated here.
Based on a same inventive concept, an embodiment of the present disclosure also provide a storage medium that stores computer instructions, and when the computer instructions run on the computer, the computer executes the image generation method discussed above.
In some possible implementation manners, various aspects of the image generation method provided in the present disclosure can also be implemented in the form of a program product, which includes program code. When the program product runs on a device, the program code is configured to control the device to execute the steps in the image generation method according to various exemplary embodiments of the present disclosure described above in this specification.
Those skilled in the art should understand that the embodiments of the present disclosure can be provided as a method, a system, or a computer program product. Therefore, the present disclosure may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, the present disclosure may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc. ) containing computer-usable program codes.
The present disclosure is described with reference to flowcharts and/or block diagrams of methods, devices (systems) , and computer program products according to embodiments of the present disclosure. It should be understood that each process and/or block in the flowchart and/or block diagram, and the combination of processes and/or blocks in the flowchart and/or block diagram can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, a specialized computer, an embedded processor, or other programmable data processing device to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing device produce a device for implementing the functions specified in one or more processes of a flowchart and/or one or more boxes of a block diagram.
These computer program instructions may also be stored in a computer readable memory capable of directing a computer or other programmable data processing device to operate in a particular manner, such that the instructions stored in such computer readable memory produce an article of manufacture comprising an instruction device that implements a function specified in one or more processes of a flowchart and/or one or more boxes of a block diagram.
These computer program instructions may also be loaded onto a computer or other programmable data processing device such that a series of operational steps are performed on the computer or other programmable device to produce computer-implemented processing such that the instructions executed on the computer or other programmable device provide steps for implementing the functions specified in the flowchart one process or a plurality of processes and/or the block diagram one block or a plurality of blocks.
Obviously, those skilled in the art can make various changes and modifications to the present disclosure without departing from the spirit and scope of the present disclosure. In this way, if these modifications and variations of the present disclosure fall within the scope of the claims of the present  disclosure and their equivalent technologies, then the present disclosure is also intended to include these modifications and variations.

Claims (10)

  1. An image generation method, comprising:
    obtaining N frames of low-resolution face images of a first target, wherein the N is a positive integer greater than or equal to 2;
    training the first network model according to the N frames of low-resolution face images to obtain a second network model, wherein the first network model is configured to perform super-resolution processing a low-resolution face image;
    performing super-resolution processing on the N frames of low-resolution face images in sequence based on the second network model to obtain N frames of super-resolution face images; and
    taking a last-frame super-resolution face image among the N frames of super-resolution face images as a final face image.
  2. The method according to claim 1, wherein the training the first network model according to the N frames of low-resolution face images to obtain the second network model comprises:
    calculating and obtaining an output loss of the first network model according to the N frames of low-resolution face images, wherein the output loss is configured to restrict a training process of the first network model;
    determining whether a training result of the first network model being convergent according to the output loss;
    in response to the training result of the first network model not being convergent, adjusting parameters of the first network model and continuing to train the first network model until the training result is convergent; and
    in response to the training result of the first network model being convergent, taking the trained first network model as the second network model.
  3. The method according to claim 2, wherein the calculating and obtaining the output loss of the first network model according to the N frames of low-resolution face images comprise:
    obtaining N random variables and a super-resolution face image set based on the N frames of low-resolution face images through a first network model; wherein the number of frames of super-resolution face images in the super-resolution face image set is N;
    inputting the super-resolution face images in the super-resolution face image set into a recognition network in sequence, and extracting to obtain N face feature values; and
    inputting the N random variables and the N face feature values into a loss function, and calculating to obtain the output loss of the first network model.
  4. The method according to claim 3, wherein the obtaining the N random variables and the super-resolution face image set based on the N frames of low-resolution face images through the first network model comprises:
    determining a frame of low-resolution face image among the N frames of low-resolution face image as a first reference frame;
    inputting a first super-resolution face image and a first low-resolution face image into the first network model to obtain a random variable corresponding to the first low-resolution face image; wherein the first super-resolution face image is a real high-resolution face image of the first target, and the first low-resolution face image is a next-frame image of the first reference frame;
    inputting the random variable and the first reference frame into the first network model to obtain a second super-resolution face image; and
    replacing the first super-resolution face image with the second super-resolution face image, replacing the first low-resolution face image with a next-frame face image of the first low-resolution face image, and continuing to perform training on the first network model to generate other super-resolution face images in sequence forming another super-resolution face image set.
  5. The method according to claim 3, wherein the inputting the N random variables and the N face feature values into the loss function, and calculating to obtain the output loss of the first network model comprise:
    inputting the N random variables into a negative log-likelihood loss function, and calculating to obtain a negative log-likelihood loss; wherein the negative log-likelihood loss is configured to restrict the first network model such that random variables output by the first network model obey a standard positive distribution;
    inputting the N face feature values into a cosine loss function, and calculating to obtain a cosine loss; wherein the cosine loss is configured to indicate a degree of difference between super-resolution face features and real face features;
    inputting the cosine loss into a cosine comparison loss function, and calculating to obtain a cosine comparison loss; wherein the cosine comparison loss is configured to restrict the first network model, such that a similarity between a super-resolution face image generated each time and the real high-resolution face image is greater than a similarity between a super-resolution face image generated last time and the real high-resolution face image; and
    inputting the negative log-likelihood loss, the cosine loss and the cosine comparison loss into the loss function, and calculating to obtain the output loss of the first network model; wherein the output loss is configured to restrict the training process of the first network model.
  6. The method according to claim 1, wherein the performing super-resolution processing on the N frames of low-resolution face images in sequence based on the second network model to obtain the N frames of super-resolution face images comprises:
    randomly sampling a first random variable among random variables that obey a standard positive distribution generated in the training process, and determining a second reference frame among the N frames of low-resolution face images;
    inputting the first random variable and the second reference frame into the second network model to obtain a super-resolution face image corresponding to the first random variable;
    inputting the super-resolution face image and the second low-resolution face image into the second network model to obtain a second random variable; wherein the second low-resolution face image is a next-frame image of the second reference frame; and
    replacing the first random variable with the second random variable, replacing the second low-resolution face image with a next-frame image of the second low-resolution face image, and continuing to perform super-resolution processing on the replaced second low-resolution face image to obtain the N frames of super-resolution face images in sequence.
  7. An image generation apparatus, comprising:
    an obtaining module, configured to obtain N frames of low-resolution face images of a first target, wherein the N is a positive integer greater than or equal to 2;
    a training module, configured to train the first network model according to the N frames of low-resolution face images to obtain a second network model, wherein the first network model is configured to perform super-resolution processing a low-resolution face image;
    a processing module, configured to perform super-resolution processing on the N frames of low-resolution face images in sequence based on the second network model to obtain N frames of super-resolution face images; and
    a selection module, configured to take a last-frame super-resolution face image among the N frames of super-resolution face images as a final face image.
  8. The apparatus according to claim 7, wherein the processing module comprises:
    an obtaining unit, configured to randomly sample a first random variable among random variables that obey a standard positive distribution generated in the training process, and determine a second reference frame among the N frames of low-resolution face images;
    a processing unit, configured to input the first random variable and the second reference frame into the second network model to obtain a super-resolution face image corresponding to the first random variable;
    an encoding unit, configured to input the super-resolution face image and the second low-resolution face image into the second network model to obtain a second random variable; wherein the second low-resolution face image is a next-frame image of the second reference frame; and
    an updating unit, configured to replace the first random variable with the second random variable, replace the second low-resolution face image with a next-frame image of the second low-resolution face image, and continue to perform super-resolution processing on the replaced second low-resolution face image to obtain the N frames of super-resolution face images in sequence.
  9. An electronic device, comprising:
    a memory, configured to store a computer program; and
    a processor, configured to execute the computer program stored in the memory to perform the method according to any one of claims 1-6.
  10. A storage medium, storing a computer program; wherein the computer program is configured to perform the method according to any one of claims 1-6 when executed by a processor.
PCT/CN2021/128518 2021-08-02 2021-11-03 Image generation method, apparatus, and electronic device WO2023010701A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110879082.1 2021-08-02
CN202110879082.1A CN113344792B (en) 2021-08-02 2021-08-02 Image generation method and device and electronic equipment

Publications (1)

Publication Number Publication Date
WO2023010701A1 true WO2023010701A1 (en) 2023-02-09

Family

ID=77480653

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/128518 WO2023010701A1 (en) 2021-08-02 2021-11-03 Image generation method, apparatus, and electronic device

Country Status (2)

Country Link
CN (1) CN113344792B (en)
WO (1) WO2023010701A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113344792B (en) * 2021-08-02 2022-07-05 浙江大华技术股份有限公司 Image generation method and device and electronic equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110222724A1 (en) * 2010-03-15 2011-09-15 Nec Laboratories America, Inc. Systems and methods for determining personal characteristics
CN107423701A (en) * 2017-07-17 2017-12-01 北京智慧眼科技股份有限公司 The non-supervisory feature learning method and device of face based on production confrontation network
CN110889895A (en) * 2019-11-11 2020-03-17 南昌大学 Face video super-resolution reconstruction method fusing single-frame reconstruction network
CN111062867A (en) * 2019-11-21 2020-04-24 浙江大华技术股份有限公司 Video super-resolution reconstruction method
CN112507617A (en) * 2020-12-03 2021-03-16 青岛海纳云科技控股有限公司 Training method of SRFlow super-resolution model and face recognition method
CN113344792A (en) * 2021-08-02 2021-09-03 浙江大华技术股份有限公司 Image generation method and device and electronic equipment

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112508782B (en) * 2020-09-10 2024-04-26 浙江大华技术股份有限公司 Training method of network model, and super-resolution reconstruction method and device of face image

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110222724A1 (en) * 2010-03-15 2011-09-15 Nec Laboratories America, Inc. Systems and methods for determining personal characteristics
CN107423701A (en) * 2017-07-17 2017-12-01 北京智慧眼科技股份有限公司 The non-supervisory feature learning method and device of face based on production confrontation network
CN110889895A (en) * 2019-11-11 2020-03-17 南昌大学 Face video super-resolution reconstruction method fusing single-frame reconstruction network
CN111062867A (en) * 2019-11-21 2020-04-24 浙江大华技术股份有限公司 Video super-resolution reconstruction method
CN112507617A (en) * 2020-12-03 2021-03-16 青岛海纳云科技控股有限公司 Training method of SRFlow super-resolution model and face recognition method
CN113344792A (en) * 2021-08-02 2021-09-03 浙江大华技术股份有限公司 Image generation method and device and electronic equipment

Also Published As

Publication number Publication date
CN113344792A (en) 2021-09-03
CN113344792B (en) 2022-07-05

Similar Documents

Publication Publication Date Title
US11403838B2 (en) Image processing method, apparatus, equipment, and storage medium to obtain target image features
US11853882B2 (en) Methods, apparatus, and storage medium for classifying graph nodes
WO2017219991A1 (en) Optimization method and apparatus suitable for model of pattern recognition, and terminal device
WO2022042123A1 (en) Image recognition model generation method and apparatus, computer device and storage medium
WO2020098296A1 (en) Image retrieval method and device
US11714921B2 (en) Image processing method with ash code on local feature vectors, image processing device and storage medium
US9208374B2 (en) Information processing apparatus, control method therefor, and electronic device
EP4085369A1 (en) Forgery detection of face image
CN114549913B (en) Semantic segmentation method and device, computer equipment and storage medium
CN113221983B (en) Training method and device for transfer learning model, image processing method and device
US20230021551A1 (en) Using training images and scaled training images to train an image segmentation model
TWI803243B (en) Method for expanding images, computer device and storage medium
WO2023010701A1 (en) Image generation method, apparatus, and electronic device
CN115438804A (en) Prediction model training method, device and equipment and image prediction method
CN117033039A (en) Fault detection method, device, computer equipment and storage medium
US11361189B2 (en) Image generation method and computing device
CN114998814B (en) Target video generation method and device, computer equipment and storage medium
CN114241411B (en) Counting model processing method and device based on target detection and computer equipment
CN114155388A (en) Image recognition method and device, computer equipment and storage medium
CN114187465A (en) Method and device for training classification model, electronic equipment and storage medium
CN108665434B (en) Image synthesis method and device
US20240161245A1 (en) Image optimization
CN117238017A (en) Face recognition method, device, computer equipment and storage medium
CN115424184A (en) Video object segmentation method and device and electronic equipment
US20200159710A1 (en) Computer architecture for emulating single dimensional string correlithm object dynamic time warping in a correlithm object processing system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21952564

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE