WO2019227479A1 - 人脸旋转图像的生成方法及装置 - Google Patents

人脸旋转图像的生成方法及装置 Download PDF

Info

Publication number
WO2019227479A1
WO2019227479A1 PCT/CN2018/089611 CN2018089611W WO2019227479A1 WO 2019227479 A1 WO2019227479 A1 WO 2019227479A1 CN 2018089611 W CN2018089611 W CN 2018089611W WO 2019227479 A1 WO2019227479 A1 WO 2019227479A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
face
loss
network
rotation
Prior art date
Application number
PCT/CN2018/089611
Other languages
English (en)
French (fr)
Inventor
饶强
遇冰
冯柏岚
胡一博
吴翔
赫然
孙哲南
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to CN201880090767.4A priority Critical patent/CN111819568A/zh
Priority to PCT/CN2018/089611 priority patent/WO2019227479A1/zh
Publication of WO2019227479A1 publication Critical patent/WO2019227479A1/zh
Priority to US17/038,208 priority patent/US11232286B2/en

Links

Images

Classifications

    • G06T3/18
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • G06T5/70
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T9/00Image coding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T9/00Image coding
    • G06T9/002Image coding using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • G06V40/165Detection; Localisation; Normalisation using facial parts and geometric relationships
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person
    • G06T2207/30201Face

Definitions

  • Embodiments of the present invention relate to the field of computer vision, and in particular, to a method and device for generating a face rotation image.
  • Computer vision is an integral part of various intelligent / autonomous systems in various application areas, such as manufacturing, inspection, document analysis, medical diagnostics, and military. It is about how to use cameras / video cameras and computers to obtain What we need is the knowledge of the data and information of the subject. Graphically speaking, it is to install a computer with eyes (camera / camcorder) and brain (algorithm) to replace the human eye to identify, track and measure targets, so that the computer can perceive the environment. Because perception can be viewed as extracting information from sensory signals, computer vision can also be viewed as a science that studies how to make artificial systems "perceive" from images or multidimensional data.
  • computer vision is to use various imaging systems instead of visual organs to obtain input information, and then the computer replaces the brain to process and interpret these input information.
  • the ultimate research goal of computer vision is to enable computers to observe and understand the world visually like humans and have the ability to adapt to the environment autonomously.
  • Face rotation refers to the use of computer vision-related methods, such as image processing, machine learning, and computer graphics, to obtain a real 2D face image of a given 2D face image. Face image based on geometric mapping principle of the face in three-dimensional space. Face rotation is mainly to solve the problem of inaccurate side recognition due to large angle rotation of the face in face recognition. In addition, face rotation can also solve the problem of insufficient face data in the training of face recognition models. For face data expansion.
  • 3D / 2D models and generating sparse subspaces.
  • the 3D / 2D model is obtained by mapping the 2D face image onto the 3D face model, estimating the 3D model parameters of the face, and then obtaining the rotated face projection image through the perspective transformation, thereby obtaining the rotated 2D Face image.
  • the 3D face model can theoretically solve the problem of face image rotation in any pose, but at present, the 3D face calculation amount is large, and the accuracy of realization is not high enough.
  • the method of generating sparse subspace is to use sparse and low-rank learning methods to treat different poses of the same face as linear subspaces, and then use low rank constraints to solve the face positive face image with the lowest rank.
  • This method mainly solves the technology of rotating a human face from a side face to a front face, which is a special case of face rotation.
  • the prior art uses a generative adversarial network and a one-dimensional one-hot pose encoder to guide the pose encoding of the face image and generate different pose features of the face image.
  • the generative adversarial network is a method of training a face generation model through a deep learning model and an adversarial manner.
  • the one-hot pose coding method used in the prior art is not accurate enough to express the pose, and the method does not have continuity.
  • the structure of the adversarial discriminating network in the generative adversarial network in the existing scheme makes the The discrimination is not robust enough, so that the rotation image generated by it is not effective.
  • the embodiments of the present application disclose a method and a device for generating a face rotation image, which can improve the efficiency of generating a face rotation image and obtain better image quality.
  • an embodiment of the present application provides a method for generating a face rotation image, including:
  • Multiple first training pictures are obtained from the training data set according to the rotation angle of the face, and the multiple first training pictures each include a human face, and the rotation angles of the faces presented in the multiple first training pictures are all The face rotation angle;
  • the face to-be-input signal is input to a face rotation image generation model to obtain a face rotation image.
  • the method for generating a face rotation image obtaineds an input signal of a face rotation image generation model by performing posture encoding on a face image and a target face image, and further generates a model by using the face rotation image.
  • a face rotation image is generated. Since the pose encoding method is more accurate and robust in describing the face posture, the generated face rotation image is also more accurate.
  • the rotation angles of the faces presented in the multiple first training pictures are the same, and the rotation angle here may be the user
  • the rotation angles of the face presentation included in the multiple first training pictures are all the preset angles.
  • the method for generating a face rotation image provided in the embodiment of the present application has no limitation on the angle of face rotation, that is, face rotation at various angles can be realized.
  • the plurality of first training pictures are obtained from the training data set according to the rotation angle of the face, and the multiple first training pictures all include human faces.
  • the human faces It does not need to be the same face as the face in the face image.
  • the face image may be a real-time face to be rotated input by a user
  • the plurality of first training pictures are training data sets maintained by the database, so the faces included in the plurality of first training pictures and the faces contained in the face images may be considered to have no direct relationship, of course, the face images The faces contained in it can also appear in the database and used as training data.
  • the generating a signal to be input according to the face image, a pose encoding map of the face image, and a pose encoding map of the target face image may specifically
  • the to-be-input signal is obtained by fusing the face image, the pose coded image of the face image, and the pose coded image of the target face image by means of feature fusion.
  • Feature fusion is the combination of features that have distinguishing meanings and complementary effects, and are organically combined as a unified feature in some way.
  • Feature fusion is a commonly used technology in the field of biometrics, and features can be integrated in a variety of ways. The information contained in the fused features is more accurate and richer.
  • the to-be-input signal contains more information than any image or posture coded image in the face image, the posture coded image of the face image, and the target coded image in the posture coded image. Accurate and richer. Therefore, using the signal to be input to generate a face rotation image can make the generated face rotation image more accurate.
  • a more accurate face Rotating images can be used to improve the accuracy of face recognition.
  • performing posture encoding on the face image according to two or more key points in the face image, and obtaining a posture encoding map of the face image includes:
  • Gaussian blur processing is performed on the first image corresponding to each of the N key points one by one, to obtain N first Gaussian blur maps, and the N sheets
  • the first Gaussian blur image is a pose-encoded image of the face image.
  • the N first images having the same size as the face image include:
  • N all 0 matrices each all 0 matrices corresponding to a key point; map the position of the key point in the face image to a corresponding position in the all 0 matrix, and map the corresponding position in the all 0 matrix
  • the value of the position is changed from 0 to 1; N unique one-hot code matrices are generated, and the N unique one-hot code matrices are the N first images.
  • N first images are related to position coordinates of the key points in the face image.
  • performing Gaussian blur processing on the first image corresponding to each of the key points one by one with each of the N key points as the center includes:
  • Gaussian blur processing is performed on the one-hot code matrix with a point having a value of 1 in each one-hot code matrix as a center.
  • performing posture encoding on the target face image according to two or more key points in the target face image, and obtaining a posture encoding map of the target face image includes:
  • Gaussian blur processing is performed on the second image corresponding to each of the key points one by one with respect to each of the M key points as a center to obtain M second Gaussian blur maps, and the M sheets
  • the second Gaussian blur image is a pose-encoded image of the target face image.
  • the M second images having the same size as the target face image are the same as the N dimensions of the face image in a possible implementation manner described above.
  • the manner of the first image is the same, and is not repeated here.
  • the target face image is obtained according to the multiple first training pictures, including:
  • the target face image is obtained according to an average value of the pose encoding pictures of the plurality of first training pictures.
  • posture encoding maps of the plurality of first training pictures herein may also be obtained according to the foregoing posture encoding manner, and are not expanded here.
  • the face rotation image generation model is obtained according to a training generative adversarial network
  • the generative adversarial network includes at least one face generating network and at least two discriminant networks.
  • Two discriminative networks form a coupled confrontation for generating an adversarial loss, and the adversarial loss is used to update the at least one face generation network and the at least two discriminant networks.
  • the updated at least one face generation network is The face rotation image generates a model.
  • different discriminative networks can discriminate the predicted face rotation image generated by the face generating network under different conditions, and the discriminant results obtained by different discriminative networks will
  • the generative adversarial network has an impact, so that the generative adversarial network can adjust and grasp different aspects of the face image according to the different conditions described above, thereby outputting a more accurate face rotation image.
  • an embodiment of the present application provides a training method for a generative adversarial network.
  • the generative adversarial network includes a face generation network and a plurality of coupled adversarial discrimination networks.
  • the coupled adversarial discrimination network includes at least one A first discrimination network and a second discrimination network, the method includes:
  • the face image and the face rotation image are images before and after the same face rotation
  • the trained face generation network is output.
  • the face image and the face rotation image are the same image before and after the face rotation, and it is not limited here that it must be a frontal face rotation to obtain a profile face, Or the side face is rotated to obtain a positive face. It should be understood that there is a certain amount of rotation angle between the face before the rotation and the face after the rotation. The rotation angle may be preset and will not be described again here.
  • the training method of the generative adversarial network obtaineds a pose image of a face image and a pose image of a face rotation image by performing pose encoding on a face image and a face rotation image, and passes the
  • the face generation network in the generative adversarial network generates a predicted face rotation image; further, the predicted face rotation image is discriminated by at least a first discrimination network and a second discrimination network to obtain a first loss and a second loss, respectively.
  • the first loss and the second loss are weighted and summed to obtain the total loss of the generative adversarial network, and the face generation network and the first and second discriminant networks in the generative adversarial network are updated by the total loss.
  • the predicted face rotation image obtained by the face generation network or the discrimination network through the above-mentioned pose encoding map is also closer to the real face rotation image.
  • the rotation angle of the training data face image and face rotation image
  • the network obtained by the training can also be adapted to face rotation at various angles, thus Improve the operability and user experience of face rotation.
  • different discrimination networks can predict people generated by the face generation network through different conditions.
  • the face rotation image is used for discrimination, and the discrimination results obtained by different discrimination networks will affect the generative adversarial network, so that the generative adversarial network can adjust and grasp different aspects of the face image according to the different conditions described above. Thus, a more accurate face rotation image is output.
  • the method before the updating the face generation network, the first discrimination network, and the second discrimination network according to the total loss of the generative adversarial network, the method further includes :
  • a real image loss is obtained, and the real image loss includes at least one of pixel loss, total variation loss, and identification feature loss; correspondingly, the generating
  • the total loss of the adversarial network is obtained according to a weighted sum of at least one of the real image losses, the first loss, and the second loss.
  • a possible implementation manner of the embodiment of the present application is to consider not only the first loss and the second loss, but also the true loss of the image, such as pixel loss, total variation loss, and identity feature loss.
  • the true loss of the image includes the pixel loss
  • the total loss of the generative adversarial network is obtained according to a weighted sum of the first loss, the second loss, and the pixel loss, and is introduced through the introduction of pixel loss.
  • the total loss of the generative adversarial network is obtained according to the weighted summation of the first loss, the second loss, and the total variation loss.
  • the total variation loss has the effect of preventing the local image gradient from being too large.
  • the generated predicted face image is prevented from generating local gradients that are too large to generate local defects; when the true loss of the image includes the identity
  • the total loss of the generative adversarial network is based on the first loss, the second loss, and the identity It is obtained by weighted summing of other feature losses, and the identity recognition feature is used to ensure that the generated face in the predicted face rotation image and the person included in the input training data (face image and face rotation image) are included. Face identity information remains the same.
  • identity feature loss the trained generative adversarial network can generate a rotated image with more accurate identity information; when the true loss of the image includes two or three of the three losses mentioned above, When included, the corresponding effect will be taken into account.
  • the step of encoding the face image according to two or more key points in the face image to obtain the pose encoding map of the face image includes:
  • the key point detection algorithm is used to detect the face image to obtain position coordinates corresponding to the N key points of the face image, and N first images having the same size as the face image are constructed, and the N One first image corresponds to each of the N keypoints, and each of the N keypoints is centered on the first image to perform Gaussian blur on the first image corresponding to each of the keypoints
  • N first Gaussian blur images are obtained, where the N first Gaussian blur images are pose-encoded images of the face image, and N is a positive integer greater than 1.
  • the N first images having the same size as the face image include:
  • N all 0 matrices each all 0 matrices corresponding to a key point; map the position of the key point in the face image to a corresponding position in the all 0 matrix, and map the corresponding position in the all 0 matrix
  • the value of the position is changed from 0 to 1; N unique one-hot code matrices are generated, and the N unique one-hot code matrices are the N first images.
  • Gaussian blurring through key points to achieve image pose encoding the description of face poses is more accurate and robust, and through more accurate and robust face pose descriptions, the face generation network can be used to predict people The face rotation image is closer to the real face rotation image.
  • N first images are related to position coordinates of the key points in the face image.
  • performing Gaussian blur processing on the first image corresponding to each of the key points one by one with each of the N key points as the center includes:
  • Gaussian blur processing is performed on the one-hot code matrix with a point having a value of 1 in each one-hot code matrix as a center.
  • the posture encoding is performed on the face rotation image according to two or more key points in the face rotation image to obtain a posture encoding map of the face rotation image.
  • the M second images are in one-to-one correspondence with the M key points, and each of the M key points is centered on the second image corresponding to each of the key points.
  • Gaussian blur processing is performed to obtain M second Gaussian blur images, where the M second Gaussian blur images are attitude-encoded images of the face rotation image, and M is a positive integer greater than 1.
  • the M second images with the same size as the face rotation image are the same as N in the structure of the face image in a possible implementation manner described above.
  • the manner of the first image is the same, and is not repeated here.
  • the inputting the face image, the face rotation image, and the predicted face rotation image into the first discrimination network to obtain a first loss includes:
  • the first judgment network includes a two-class discriminator, and the two-class discriminator is used to judge as true or as false.
  • the face image is used as a determination condition of the first determination network, and the trueness of the face rotation image and the predicted face rotation image are determined according to the first determination network. False, and generating the first loss according to the determination result, including:
  • L ii is the first loss
  • I a is the face image
  • I b is the face rotation image
  • Rotate an image for the predicted face It indicates finding a desired distribution in the H (I b) rotating the facial image I b, i.e., the face image rotation is true probability I b
  • the predicted face rotation image True probability Is the first discrimination network based on the face image
  • ⁇ ii is a parameter of the first discrimination network, Is the input of the first discrimination network.
  • inputting the face rotation image, the pose encoding image of the face rotation image, and the predicted face rotation image into the second discrimination network to obtain a second loss includes: :
  • the posture coded image of the face rotation image as the determination condition of the second discrimination network, determine the authenticity of the face rotation image and the predicted face rotation image according to the second discrimination network, and The second loss is generated according to the discrimination result; wherein the second judgment network includes a two-class discriminator, and the two-class discriminator is used to determine whether it is true or false.
  • the first discrimination network uses the face image as a discrimination condition
  • the second discrimination network uses the posture coded image of the face rotation image as a discrimination condition.
  • the final discrimination results obtained by the two are:
  • the first loss and the second loss are used as a total loss of the generative adversarial network through a weighted sum, and the total loss is used to update the generative adversarial network (including the face generation network, the first discriminant network).
  • the second discriminative network the generative adversarial network thus trained can have a very good grasp of both the apparent authenticity of the face and the information about the face posture.
  • the first discrimination network uses the face image as a discrimination condition, it can be understood that the apparent authenticity of a human face is grasped by the first discrimination network.
  • the posture coded image of the face rotation image is used as the discrimination condition, so it can be understood that the face posture is grasped by the second discrimination network.
  • the posture encoding map of the face rotation image is used as a determination condition of the second discrimination network, and the face rotation image and the predicted person are determined according to the second discrimination network.
  • the authenticity of the face rotation image and generating the second loss according to the discrimination result include:
  • L ip is the second loss and I b is the face rotation image, Is the predicted face-rotated image, and P b is a pose encoding map of the face-rotated image
  • the predicted face rotation image True probability ⁇ ip is a parameter of the second discrimination network based on the posture encoding map of the face rotation image, Is the input of the second discrimination network.
  • the obtaining the real image loss according to the face rotation image and the predicted face rotation image includes:
  • L pix is the pixel loss and S is the scale amount
  • Is the predicted face rotation image
  • I b is the face rotation image
  • the pixel difference value here indicates a difference value between pixels of a corresponding position of the predicted face rotation image and the face rotation image.
  • the obtaining the real image loss according to the face rotation image and the predicted face rotation image includes:
  • L tv is the total variation loss, that is, the predicted face rotation image
  • W represents the width of the predicted face rotated image
  • H represents the height of the predicted face rotated image
  • C represents the predicted face rotated image The number of channels.
  • the obtaining a real image loss according to the face rotation image and the predicted face rotation image includes:
  • the identity recognition feature is used to ensure that the identity information between the predicted face rotation image and the face image remains unchanged
  • Lip represents the loss of the identity recognition feature
  • f is a pre-trained face recognition model
  • the face recognition model f is a deep neural network, and the deep neural network includes at least one pooling layer and at least one fully connected layer, wherein, Represents the output of the last pooling layer of the face recognition model f, Represents the output of the last fully connected layer of the face recognition model f.
  • the updating the face generation network, the first discrimination network, and the second discrimination network according to the total loss of the generative adversarial network includes:
  • an embodiment of the present application provides a method for generating a rotation image of a human body, including:
  • a plurality of second training pictures are obtained from the training data set according to the rotation angle of the human body, the plurality of second training pictures all include the human body, and the rotation angles of the human bodies included in the plurality of second training pictures are all the human body Rotation angle;
  • the human to-be-rotated image generation model is input with the to-be-input signal to obtain a human body rotation image.
  • the generating a signal to be input according to the human body image, a posture coded image of the human body image, and a posture coded image of the target human body image may be specifically based on a feature
  • the fusion method fuses the human body image, the posture coded image of the human body image, and the posture coded image of the target human body image to obtain the input signal.
  • Feature fusion is the combination of distinctive and complementary features in a certain way as a unified feature.
  • Feature fusion is a commonly used technology in the field of biometrics, and features can be integrated in a variety of ways. The information contained in the fused features is more accurate and richer.
  • the information to be input is more accurate and more accurate than the image contained in the human body image, the posture coded image of the human body image, and the target body image. More abundant. Therefore, using the signal to be input to generate a human body rotation image can make the generated human body rotation image more accurate.
  • the human body rotation method is applied to an application scenario of person positioning or recognition in a monitoring system, a more accurate human body is generated. Rotating images can be used to improve the accuracy of positioning and recognition.
  • performing posture encoding on the human body image to obtain a posture encoding map of the human body image includes:
  • Gaussian blur processing is performed on the third image corresponding to each of the key points one by one with each of the W key points as a center to obtain W third Gaussian blur maps, and the W sheets
  • the third Gaussian blur image is a pose-encoded image of the human body image.
  • W key points of the human body image are first determined.
  • the W key points herein may include the foregoing
  • the above N key points are only the key points on the human face, such as the center of the left eyeball, the center of the right eyeball, the tip of the nose, the corners of the left and right mouths, or the points of the contour of the face;
  • the W key points may also include points corresponding to key parts of the human body, such as a left elbow node, a right elbow node, a center point of the left knee, a center point of the right knee, and the like;
  • the third image corresponding to the key point is subjected to Gaussian blur processing with each key point as the center.
  • This way of performing image coding by performing Gaussian blurring on key points describes the posture of the human body. More accurate and robust, resulting in higher quality rotated images of the human body.
  • the W third images having the same size as the human body image include:
  • W all 0 matrices each of which corresponds to a key point; map the position of the key point in the human image to a corresponding position in the all 0 matrix, and map the corresponding position in the all 0 matrix
  • the value of is changed from 0 to 1; thus, W one-hot code matrices are generated, and the W one-hot code matrices are the W third images.
  • W third images are related to position coordinates of the key point in the human body image.
  • the step of performing Gaussian blur processing on the third image corresponding to each of the key points one by one with each of the W key points as a center includes:
  • Gaussian blur processing is performed on the one-hot code matrix with a point having a value of 1 in each one-hot code matrix as a center.
  • the posture encoding method of the target human body image it is the same as the posture encoding method of the human body image described above, and may differ in the number of key points, but the implementation process is the same, and therefore will not be described again.
  • an embodiment of the present application provides a training method for a generative adversarial network.
  • the generative adversarial network includes a body image generation network and a plurality of coupled adversarial discrimination networks.
  • a three-discrimination network and a fourth-discrimination network, the method includes:
  • the human body image and the human body rotation image are images before and after the same body body is rotated;
  • the training method of the generative adversarial network obtaineds the posture encoding diagram of the human image and the posture encoding diagram of the human rotating image by performing posture encoding on the human body image and the human body rotation image, and adopts the generative confrontation through the generative confrontation.
  • the human body image generation network in the network generates the predicted human rotation image; further, the at least two discrimination networks such as the third discrimination network and the fourth discrimination network are used to discriminate the predicted human rotation image to obtain a third loss and a fourth loss, respectively.
  • the third loss and the fourth loss are weighted and summed to obtain the total loss of the generative adversarial network, and the human image generation network and the third and fourth discriminant networks in the generative adversarial network are updated by the total loss. Because the above description of the posture encoding method is more accurate and robust in describing the posture of the human body, the predicted image of the human body rotated by the human body image generation network or the discriminative network through the above-mentioned posture coded image is also closer to the real human body rotated image. In addition, in this training method, since there is no restriction on the rotation angle of the training data (human body image and human body rotation image), the network obtained by the training can also be adapted to human body rotation at various angles, thereby improving human body rotation Operability and user experience.
  • different discrimination networks can generate predicted human bodies generated by the human image generation network through different conditions. Rotate the image for discrimination, and the discrimination results obtained by different discrimination networks will affect the generative adversarial network, so that the generative adversarial network can adjust and grasp different aspects of the human image according to the different conditions described above, and output More accurate human body rotation image.
  • the pose encoding method provided in the third aspect is the same.
  • the specific key point values may be different, the operation methods are the same Yes, so I wo n’t repeat them here.
  • I wo n’t repeat them here.
  • an embodiment of the present application provides a device for generating a face rotation image, and the device includes a module for executing the first aspect or a method in any possible implementation manner of the first aspect.
  • an embodiment of the present application provides a training device for a generative adversarial network, where the device includes a module for executing the second aspect or a method in any possible implementation manner of the second aspect.
  • an embodiment of the present application provides a device for generating a face rotation image, including a processor and a memory, where the memory is used to store program instructions, and the processor is used to call the program instructions to execute the first aspect and The method provided in any possible implementation of the first aspect.
  • an embodiment of the present application provides a device for training a generative adversarial network, including a processor and a memory, where the memory is used to store program instructions, and the processor is used to call the program instructions to execute the second aspect and The method provided in any possible implementation of the second aspect.
  • an embodiment of the present application provides a computer-readable storage medium, characterized in that the computer-readable storage medium stores program instructions, and when the program instructions are run by a processor, the first aspect and The method provided in any possible implementation of the first aspect.
  • an embodiment of the present application provides a computer-readable storage medium, characterized in that the computer-readable storage medium stores program instructions, and when the program instructions are run by a processor, the second aspect and The method provided in any possible implementation of the second aspect.
  • a chip includes a processor and a data interface, and the processor reads an instruction stored in a memory through the data interface and executes the first aspect or any possible one of the first aspect. Method in implementation.
  • the chip may further include a memory, where the memory stores instructions, and the processor is configured to execute the instructions stored on the memory, and when the instructions are executed, the chip The processor is configured to execute the method in the first aspect or any possible implementation manner of the first aspect.
  • a chip includes a processor and a data interface, and the processor reads an instruction stored in a memory through the data interface, and executes the second aspect or any possible second aspect. Method in implementation.
  • the chip may further include a memory, where the memory stores instructions, and the processor is configured to execute the instructions stored on the memory, and when the instructions are executed, the chip The processor is configured to execute the method in the second aspect or any possible implementation of the second aspect.
  • FIG. 1 is a schematic structural diagram of a system architecture according to an embodiment of the present application.
  • FIG. 2 is a logic diagram of a convolutional neural network provided by an embodiment of the present application.
  • FIG. 3 is a schematic diagram of a chip hardware structure provided by an embodiment of the present application.
  • FIG. 4 is a schematic flowchart of a training method for a generative adversarial network according to an embodiment of the present application
  • FIG. 5 is a schematic flowchart of a method for generating a face rotation image according to an embodiment of the present application
  • FIG. 6 is a schematic block diagram of a training device for a generative adversarial network according to an embodiment of the present application
  • FIG. 7 is a schematic block diagram of a device for generating a face rotation image according to an embodiment of the present application.
  • FIG. 8 is a schematic diagram of a hardware structure of a training device for a generative adversarial network according to an embodiment of the present application
  • FIG. 9 is a schematic diagram of a hardware structure of a face rotation image generating device according to an embodiment of the present application.
  • the training method of the generative adversarial network provided in the embodiment of the present application relates to the processing of computer vision, and can be specifically applied to data processing methods such as data training, machine learning, and deep learning.
  • the training data (such as the face image and Face rotation image) performs symbolic and formal intelligent information modeling, extraction, preprocessing, training, etc., and finally obtains a trained generative adversarial network; and the method for generating a face rotation image provided in the embodiment of the present application
  • the trained generative adversarial network trained above can be used to input input data (such as the face image in this application) into the trained generative adversarial network to obtain output data (such as the face rotation image in the application) ).
  • the training method of the generative adversarial network and the method of generating the face rotation image provided by the embodiments of the present application are inventions based on the same idea, and can also be understood as two parts in a system or a whole.
  • Two phases of the process such as the model training phase and the model application phase.
  • the method and device provided in the embodiments of the present application can be applied to face recognition. For example, when there is only a side face during the face recognition process, the method for generating a face rotation image provided in the embodiment of the present application can be used to identify the Mr. Side Face. Become a front face, and then perform face recognition based on the front face.
  • the method and device provided in the embodiments of the present application can also be used for face restoration.
  • the method provided in the embodiments of the present application can obtain a complete face image from any image. It is also possible to obtain face images at other angles according to the face positive face image or a certain side face image, thereby enriching the information of the face image at various angles, and making the acquisition of the monitoring object more accurate.
  • the method and device provided in the embodiments of the present application can also be used to expand the training database. As shown in FIG.
  • the I / O interface 112 of the execution device 110 can convert the image processed by the execution device (such as the obtained face rotation image) and The face images input by the user are sent to the database 130 together as a training data pair, so that the training data maintained by the database 130 is more abundant, thereby providing more abundant training data for the training work of the training device 120.
  • the face image and the face rotation image are used as training data and provided to the initial model for training; in the model application phase, the face image is to be processed in actual application (the processing here is human (Face rotation processing)
  • the data is subjected to related data processing and then input to a deep neural network to obtain output data: a face rotation image.
  • the expressions of face image and face rotation image are used in both the training phase and the application phase, but the face image and face rotation image in the training phase should not be considered as the face in the application phase.
  • the image and the face rotation image must be the same image.
  • the training database can be expanded by face rotation.
  • input data face image
  • face rotation processing face rotation processing
  • output data face rotated image
  • the face image is rotated from one pose (pose) angle to another and the corresponding rotated image is obtained.
  • a neural network can be composed of a neural unit.
  • a neural unit can refer to an operation unit that takes x s and intercept 1 as inputs.
  • the output of the operation unit can be:
  • W s is the weight of x s
  • b is the bias of the neural unit.
  • f is the activation function of the neural unit, which is used to introduce non-linear characteristics into the neural network to convert the input signal in the neural unit into the output signal. The output signal of this activation function can be used as the input of the next convolutional layer.
  • the activation function can be a sigmoid function.
  • a neural network is a network formed by combining many of the above-mentioned single neural units, that is, the output of one neural unit can be the input of another neural unit.
  • the input of each neural unit can be connected to the local receiving domain of the previous layer to extract the features of the local receiving domain.
  • the local receiving domain can be a region composed of several neural units.
  • Deep Neural Network also called multilayer neural network
  • DNN Deep Neural Network
  • the neural network inside the DNN can be divided into three categories: input layer, hidden layer, and output layer.
  • the first layer is the input layer
  • the last layer is the output layer
  • the number of layers in the middle are all hidden layers.
  • the layers are fully connected, that is, any neuron in the i-th layer must be connected to any neuron in the i + 1-th layer.
  • the DNN looks very complicated, it is actually not complicated in terms of the work of each layer.
  • the definition of these parameters in the DNN is as follows: Take the coefficient W as an example: Assume that in a three-layer DNN, the linear coefficients of the fourth neuron in the second layer to the second neuron in the third layer are defined as The superscript 3 represents the number of layers in which the coefficient W is located, while the subscripts correspond to the third layer index 2 of the output and the second layer index 4 of the input. The summary is: the coefficient of the kth neuron in the L-1 layer to the jth neuron in the L layer is defined as It should be noted that the input layer does not have W parameters. In deep neural networks, more hidden layers allow the network to better characterize complex situations in the real world.
  • Training a deep neural network is also a process of learning a weight matrix.
  • the ultimate goal is to obtain the weight matrices (weight matrices formed by vectors W of many layers) of all layers of the trained deep neural network.
  • Convolutional neural network (CNN, Convolutional Neuron Network) is a deep neural network with a convolutional structure.
  • Convolutional neural networks include a feature extractor consisting of a convolutional layer and a sub-sampling layer.
  • the feature extractor can be regarded as a filter, and the convolution process can be regarded as a convolution using a trainable filter and an input image or a convolution feature map.
  • a convolution layer is a neuron layer in a convolutional neural network that performs convolution processing on input signals. In the convolutional layer of a convolutional neural network, a neuron can only be connected to some of the neighboring layer neurons.
  • a convolution layer usually contains several feature planes, and each feature plane can be composed of some rectangularly arranged neural units.
  • the convolution kernel can be initialized in the form of a matrix of random size. During the training process of the convolutional neural network, the convolution kernel can obtain reasonable weights through learning. In addition, the direct benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, while reducing the risk of overfitting.
  • the loss function is taken as an example. The higher the output value (loss) of the loss function, the greater the difference, so the training of the deep neural network becomes a process of minimizing this loss as much as possible.
  • Convolutional neural networks can use the backpropagation (BP) algorithm to modify the size of the parameters in the initial super-resolution model during the training process, which makes the reconstruction error loss of the super-resolution model smaller and smaller.
  • BP backpropagation
  • the input signal is forwardly transmitted until the output will generate an error loss, and the parameters in the initial super-resolution model are updated by back-propagating the error loss information, thereby converging the error loss.
  • the back-propagation algorithm is a back-propagation motion dominated by error loss, and aims to obtain the optimal parameters of the super-resolution model, such as the weight matrix.
  • GAN Generative Adversarial Networks
  • This model includes at least two modules: one module is a generative model and the other module is a discriminative model. These two modules learn from each other's games to produce a better output.
  • Both the generation model and the discriminant model can be neural networks, specifically deep neural networks or convolutional neural networks.
  • the basic principle of GAN is as follows: Take the GAN that generates a picture as an example, suppose there are two networks, G (Generator) and D (Discriminator), where G is a network that generates a picture, it receives a random noise z, and passes this noise Generate a picture and record it as G (z); D is a discriminative network used to judge whether a picture is "real”.
  • the pixel value of the image can be a red-green-blue (RGB) color value, and the pixel value can be a long integer representing the color.
  • the pixel value is 256 * Red + 100 * Green + 76Blue, where Blue represents the blue component, Green represents the green component, and Red represents the red component.
  • Blue represents the blue component
  • Green represents the green component
  • Red represents the red component.
  • the pixel value may be a grayscale value.
  • an embodiment of the present invention provides a system architecture 100.
  • the data acquisition device 160 is configured to collect training data.
  • the training data includes a face image and a face rotation image, where the face image is an image before the face is rotated.
  • the face rotation image is an image obtained by rotating the face in the face image;
  • training data is stored in the database 130, and the training device 120 obtains the target model / rule 101 based on the training data maintained in the database 130.
  • the first embodiment will describe in more detail how the training device 120 obtains a target model / rule 101 based on the training data.
  • the target model / rule 101 can be used to implement the method for generating a face rotation image provided by the embodiment of the present application, that is, The face image is input into the target model / rule 101 after relevant preprocessing, and a face rotation image can be obtained.
  • the target model / rule 101 in the embodiment of the present application may specifically generate a face generation network.
  • the face generation network is obtained by training a generative adversarial network.
  • the training data maintained in the database 130 does not necessarily come from the data collection device 160, but may also be received from other devices.
  • the training device 120 does not necessarily perform the training of the target model / rule 101 based entirely on the training data maintained by the database 130. It is also possible to obtain training data from the cloud or other places for model training. The above description should not be taken as an application for this application. Definition of Examples.
  • the target model / rule 101 trained according to the training device 120 may be applied to different systems or devices, such as the execution device 110 shown in FIG. 1, and the execution device 110 may be a terminal, such as a mobile phone terminal, a tablet computer, Notebook computers, AR / VR, car terminals, etc., but also servers or the cloud.
  • the execution device 110 is configured with an I / O interface 112 for data interaction with an external device. The user can input data to the I / O interface 112 through the client device 140, and the input data is described in the embodiment of the present application.
  • It may include: a face image input by a user, and a plurality of first training pictures from a database, where the plurality of first training pictures each include a human face (the face and the face included in the face image are not Must be the same), and the rotation angles of the faces presented in the plurality of first training pictures are all angles ⁇ , and the angle ⁇ may be preset. For example, if the face rotation image output by the execution device 110 is An image rotated by ⁇ degrees based on the face image.
  • the preprocessing module 113 is configured to perform preprocessing according to input data (such as the face image) received by the I / O interface 112. In the embodiment of the present application, the preprocessing module 113 may be used to perform Attach two or more key points to pose encoding the face image to obtain a pose encoding map of the face image.
  • the preprocessing module 114 is configured to perform preprocessing according to the input data received by the I / O interface 112, such as (the plurality of first training pictures).
  • the preprocessing module 114 may be configured to perform A plurality of first training pictures are used to obtain a target face image, and the pre-processing module 113 performs pose encoding on the target face image to obtain a pose-encoded image of the target face image.
  • the pre-processing module 113 further
  • the to-be-input signal may be generated according to the face image, the pose-coded map of the face image, and the pose-coded map of the target face image, and the to-be-input signal is input to the calculation module 111, and the The calculation module 111 performs calculations according to the target model / rule 101 and the signal to be input, and finally obtains a face rotation image.
  • the execution device 110 may call data, codes, etc. in the data storage system 150 for corresponding processing
  • the data, instructions, etc. obtained by corresponding processing may also be stored in the data storage system 150.
  • the I / O interface 112 returns the processing result, such as the face rotation image obtained above, to the client device 140, so as to provide it to the user.
  • the training device 120 can generate different target models / rules 101 based on different training data for different goals or different tasks, and the corresponding target models / rules 101 can be used to achieve the above goals or complete The above tasks thus provide users with the desired results.
  • the user can manually input the input data, and the manual setting can be operated through the interface provided by the I / O interface 112.
  • the client device 140 may automatically send the input data to the I / O interface 112. If the client device 140 is required to automatically send the input data to obtain the authorization of the user, the user may set corresponding permissions in the client device 140.
  • the user may view the result output by the execution device 110 on the client device 140, and the specific presentation form may be a specific manner such as display, sound, and action.
  • the client device 140 may also be used as a data acquisition terminal, collecting input data of the input I / O interface 112 and output results of the output I / O interface 112 as new sample data and storing them in the database 130.
  • the data is stored in a database 130.
  • FIG. 1 is only a schematic diagram of a system architecture provided by an embodiment of the present invention, and the positional relationship among the devices, components, modules, etc. shown in the figure does not constitute any limitation.
  • the data storage system 150 is an external memory with respect to the execution device 110. In other cases, the data storage system 150 may also be placed in the execution device 110.
  • a target model / rule 101 is obtained by training according to the training device 120.
  • the target model / rule 101 may be a face rotation obtained according to a training generative adversarial network (GAN).
  • GAN training generative adversarial network
  • the generative adversarial network provided in the embodiment of the present application may include at least one face generation network and at least two discriminant networks, the at least two discriminant networks form a coupled adversarial network for generating an adversarial loss, The confrontation loss is used to update the at least one face generation network and the at least two discriminant networks, and the updated at least one face generation network generates a model for the face rotation image.
  • the at least one face generation network and the at least two discriminant networks may be specifically convolutional neural networks.
  • the convolutional neural network is a deep neural network with a convolution structure. It is a deep learning architecture.
  • the deep learning architecture refers to the algorithm through machine learning. At multiple levels of abstraction.
  • CNN is a feed-forward artificial neural network, and each neuron in the feed-forward artificial neural network can respond to the image input into it.
  • the convolutional neural network (CNN) 200 may include an input layer 210, a convolutional layer / pooling layer 220 (where the pooling layer is optional), and a neural network layer 230.
  • the convolution layer / pooling layer 220 may include the layers 221-226 as an example.
  • the layer 221 is a convolution layer
  • the layer 222 is a pooling layer
  • the layer 223 is a volume.
  • Convolutional layer 224 is a pooling layer
  • 225 is a convolution layer
  • 226 is a pooling layer.
  • 221 and 222 are convolution layers
  • 223 is a pooling layer
  • 224 and 225 are convolutions.
  • Layer, 226 is a pooling layer. That is, the output of the convolution layer can be used as the input of the subsequent pooling layer, or it can be used as the input of another convolution layer to continue the convolution operation.
  • the convolutional layer 221 is taken as an example to introduce the inner working principle of a convolutional layer.
  • the convolution layer 221 may include many convolution operators.
  • the convolution operator is also called a kernel. Its function in image processing is equivalent to a filter that extracts specific information from the input image matrix.
  • the convolution operator is essentially It can be a weight matrix. This weight matrix is usually defined in advance. In the process of convolving the image, the weight matrix is usually one pixel followed by one pixel (or two pixels followed by two pixels along the horizontal direction on the input image). ... this depends on the processing of the stride step) to complete the work of extracting specific features from the image.
  • the size of the weight matrix should be related to the size of the image. It should be noted that the depth dimension of the weight matrix and the depth dimension of the input image are the same.
  • the weight matrix will be extended to The entire depth of the input image. Therefore, convolution with a single weight matrix will produce a convolution output of a single depth dimension, but in most cases, a single weight matrix is not used, but multiple weight matrices with the same size (row ⁇ column) are applied. That is, multiple isotype matrices.
  • the output of each weight matrix is stacked to form the depth dimension of the convolution image.
  • the dimensions here can be understood as determined by the "multiple" described above.
  • Different weight matrices can be used to extract different features in the image. For example, one weight matrix is used to extract image edge information, another weight matrix is used to extract specific colors of the image, and another weight matrix is used to remove unwanted noise in the image.
  • the multiple weight matrices have the same size (row ⁇ column), and the feature maps extracted after the multiple weight matrices of the same size have the same size, and then the multiple extracted feature maps of the same size are combined to form a convolution operation. Output.
  • weight values in these weight matrices need to be obtained after a large amount of training in practical applications.
  • Each weight matrix formed by the weight values obtained through training can be used to extract information from the input image, so that the convolutional neural network 200 can make correct predictions. .
  • the initial convolutional layer for example, 221
  • the features extracted by subsequent convolutional layers become more and more complex, such as features such as high-level semantics.
  • the 221-226 layers shown in 220 in Figure 2 can be a convolution layer followed by a layer.
  • the pooling layer may also be a multi-layer convolution layer followed by one or more pooling layers.
  • the sole purpose of the pooling layer is to reduce the space size of the image.
  • the pooling layer may include an average pooling operator and / or a maximum pooling operator for sampling the input image to obtain a smaller-sized image.
  • the average pooling operator can calculate the pixel values in the image within a specific range to produce an average value as the result of the average pooling.
  • the maximum pooling operator can take the pixel with the largest value in the range in a specific range as the result of the maximum pooling.
  • the operators in the pooling layer should also be related to the size of the image.
  • the size of the output image processed by the pooling layer may be smaller than the size of the image of the input pooling layer.
  • Each pixel in the image output by the pooling layer represents the average or maximum value of the corresponding subregion of the image of the input pooling layer.
  • the convolutional neural network 200 After processing by the convolutional layer / pooling layer 220, the convolutional neural network 200 is not enough to output the required output information. Because as described above, the convolution layer / pooling layer 220 only extracts features and reduces parameters brought by the input image. However, in order to generate the final output information (required class information or other related information), the convolutional neural network 200 needs to use the neural network layer 230 to generate an output of a number of required classes or a group. Therefore, the neural network layer 230 may include multiple hidden layers (such as 231, 232 to 23n shown in FIG. 2) and an output layer 240. The parameters included in the multiple hidden layer may be based on the specific task type. The relevant training data is obtained by pre-training. For example, the task type can include image recognition, image classification, image super-resolution reconstruction, etc ...
  • the output layer 240 After the multiple hidden layers in the neural network layer 230, that is, the last layer of the entire convolutional neural network 200 is an output layer 240, which has a loss function similar to the classification cross entropy, and is specifically used to calculate the prediction error.
  • the forward propagation of the entire convolutional neural network 200 (as shown in Fig. 2 from 210 to 240 is forward propagation)
  • the reverse propagation (as shown in Fig. 2 from 240 to 210 is back propagation) Start to update the weight values and biases of the layers mentioned earlier to reduce the loss of the convolutional neural network 200 and the error between the results output by the convolutional neural network 200 and the ideal results through the output layer.
  • the convolutional neural network 200 shown in FIG. 2 is only an example of a convolutional neural network. In specific applications, the convolutional neural network may also exist in the form of other network models.
  • FIG. 3 is a chip hardware structure according to an embodiment of the present invention.
  • the chip includes a neural network processor 30.
  • the chip can be set in the execution device 110 shown in FIG. 1 to complete the calculation work of the calculation module 111.
  • the chip can also be set in the training device 120 shown in FIG. 1 to complete the training work of the training device 120 and output the target model / rule 101.
  • the algorithms of each layer in the convolutional neural network shown in FIG. 2 can be implemented in the chip shown in FIG. 3.
  • the neural network processor 30 may be an NPU, a TPU, or a GPU and other processors suitable for large-scale XOR processing. Take the NPU as an example: The NPU can be mounted as a coprocessor to the host CPU, and the main CPU assigns tasks to it. The core part of the NPU is an arithmetic circuit 303. The controller 304 controls the arithmetic circuit 303 to extract matrix data in the memories (301 and 302) and perform multiplication and addition operations.
  • the computing circuit 303 includes multiple processing units (Process Engines, PEs).
  • the arithmetic circuit 303 is a two-dimensional pulsating array.
  • the arithmetic circuit 303 may also be a one-dimensional pulsation array or other electronic circuits capable of performing mathematical operations such as multiplication and addition.
  • the arithmetic circuit 303 is a general-purpose matrix processor.
  • the arithmetic circuit 303 takes the weight data of the matrix B from the weight memory 302, and buffers the data on each PE in the arithmetic circuit 303.
  • the arithmetic circuit 303 takes the input data of the matrix A from the input memory 301, performs a matrix operation based on the input data of the matrix A and the weight data of the matrix B, and the partial result or the final result of the obtained matrix is stored in an accumulator 308 .
  • the unified memory 306 is used to store input data and output data.
  • the weight data is directly transferred to a weight memory 302 through a storage unit access controller (DMAC, Direct Memory Access Controller) 305.
  • the input data is also transferred to the unified memory 306 through the DMAC.
  • DMAC Direct Memory Access Controller
  • a bus interface unit (BIU, Bus Interface) unit 310 is used for the interaction between the DMAC and the instruction fetch memory (Instruction, Fetch, Buffer) 309; the bus interface unit 301 is also used to fetch the instruction memory 309 to obtain instructions from external memory; The storage unit access controller 305 obtains the original data of the input matrix A or the weight matrix B from an external memory.
  • BIU Bus Interface
  • the DMAC is mainly used to transfer input data in the external memory DDR to the unified memory 306, or to transfer weight data to the weight memory 302, or to transfer input data to the input memory 301.
  • the vector calculation unit 307 has a plurality of operation processing units. If necessary, the output of the operation circuit 303 is further processed, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, and so on.
  • the vector calculation unit 307 is mainly used for the calculation of non-convolutional layers or fully connected layers (FCs) in a neural network. Specifically, the vector calculation unit 307 can process calculations such as Pooling, Normalization, and the like.
  • the vector calculation unit 307 may apply a non-linear function to the output of the arithmetic circuit 303, such as a vector of accumulated values, to generate an activation value.
  • the vector calculation unit 307 generates a normalized value, a merged value, or both.
  • the vector calculation unit 307 stores the processed vectors to the unified memory 306.
  • the vector processed by the vector calculation unit 307 can be used as an activation input of the arithmetic circuit 303, for example, for use in subsequent layers in the neural network, as shown in FIG. 2, if the current processing layer is a hidden layer 1 (231), the vector processed by the vector calculation unit 307 can also be used for calculation in the hidden layer 2 (232).
  • An instruction fetch memory 309 connected to the controller 304 is used to store instructions used by the controller 304;
  • the unified memory 306, the input memory 301, the weight memory 302, and the instruction fetch memory 309 are all On-Chip memories.
  • the external memory is independent of the NPU hardware architecture.
  • each layer in the convolutional neural network shown in FIG. 2 may be performed by an operation circuit 303 or a vector calculation unit 307.
  • FIG. 4 is a training method 400 of a generative adversarial network according to Embodiment 1 of the present invention.
  • the generative adversarial network includes a face generation network and a plurality of coupled adversarial discrimination networks.
  • the coupled adversarial discrimination network is at least Including a first discrimination network and a second discrimination network, the method 400 includes:
  • S401 Receive a face image and a face rotation image; the face image and the face rotation image are images before and after the same face rotation;
  • S402. Perform posture encoding on the face image according to two or more key points in the face image to obtain a posture encoding map of the face image.
  • the face generation network, the first discriminative network, and the second discriminant network updated at the current time in S407 are used to perform the next actions in S404 to S406, and iterate sequentially until the total loss of the generative adversarial network converges and ends.
  • the training of the generative adversarial network outputs a trained face generation network.
  • the training method of the generative adversarial network obtains a pose image of a face image and a pose image of a face rotation image by performing pose encoding on a face image and a face rotation image. And generating a predictive face rotation image through the face generation network in the generative adversarial network; further, discriminating the predicted face rotation image through at least a first discrimination network and a second discrimination network to obtain a first loss and a second Loss, weighting and summing the first loss and the second loss to obtain the total loss of the generative adversarial network, and updating the face generation network, the first discriminative network, and the first loss in the generative adversarial network from the total loss.
  • Binary Discrimination Network Binary Discrimination Network.
  • the predicted face rotation image obtained by the face generation network or the discrimination network through the above-mentioned pose encoding map is also closer to the real face rotation image.
  • the rotation angle of the training data face image and face rotation image
  • the network obtained by the training can also be adapted to face rotation at various angles, thus Improve the operability and user experience of face rotation.
  • the face rotation image is used for discrimination, and the discrimination results obtained by different discrimination networks will affect the generative adversarial network, so that the generative adversarial network can adjust and grasp different aspects of the face image according to the different conditions described above. Thus, a more accurate face rotation image is output.
  • the "coupling" is reflected in that the losses obtained by the multiple discrimination networks collectively affect the total loss, such as the method 400 provided in the embodiment of the present application.
  • the first loss and the second loss are weighted and summed to obtain the total loss, thereby reflecting the coupling relationship between the first discrimination network and the second discrimination network.
  • the “coupling” here also It can be called “coordination”, or “joint”, etc., and its essential meaning is that the losses obtained by multiple discrimination networks are combined in a certain relationship and collectively affect the total loss; wherein the "confrontation" is reflected in:
  • the multiple discriminative networks and the face generation network are confrontational relationships. This confrontational relationship is described in detail in point (7) of the concept introduction above, that is, the "game” between generation and discrimination. I won't repeat them here.
  • the face image and the face rotation image in the first embodiment are essentially training data for training the generative adversarial network.
  • they are the same piece of data Images before and after face rotation.
  • the method and device provided in the embodiments of the present application do not limit that the face rotation must be a positive face rotation to obtain a side face, or a side face rotation to obtain a positive face, so the requirements for training data pairs will not necessarily be one.
  • Zhang Weizheng's face and another face are side faces. It should be understood that there is a certain amount of rotation angle between the face before rotation and the face after rotation, which can be preset.
  • rotation angle also referred to as the face rotation angle in this article
  • a positive face is 0 degrees
  • a right face is rotated to a positive angle
  • a left face is rotated to a negative angle.
  • clockwise rotation is a positive angle
  • counterclockwise rotation is a negative angle.
  • the method 400 may be specifically performed by the training device 120 shown in FIG. 1.
  • the face image and the face rotation image in the method 400 may be training data maintained in the database 130 shown in FIG. 1, and may be optional.
  • S402 and S403 of the method 400 may be executed in the training device 120, or may be pre-executed by other function modules before the training device 120, that is, the training data received or obtained from the database 130 is pre-preliminarily
  • the processing such as the posture encoding process described in S402 and S403, obtains the posture encoding map of the face image and the posture encoding map of the face rotation image as the input of the training device 120, and the training device 120 executes S404 to S408.
  • the method 400 may be processed by a CPU, or may be jointly processed by a CPU and a GPU, or a processor suitable for neural network computing may be used without the GPU, which is not limited in this application.
  • the training device 120 may be specifically configured to train the generative adversarial network provided by the embodiment of the present application.
  • the generative adversarial network provided by the embodiment of the present application includes a face generation network and multiple coupled adversarial discrimination networks. It should be noted that, in the embodiment of the present application, although only two examples of the first discrimination network and the second discrimination network are given as examples, the embodiment of the present application does not limit the specific number of discrimination networks, such as three discriminations. There are even more or four discriminative networks. These different discriminant networks can discriminate the prediction images generated by the face generation network based on different discriminant conditions, so that the generative adversarial network can be based on these different conditions. Adjust and grasp different aspects of the face image to output a more accurate face rotation image.
  • the method 400 may further include:
  • a real image loss is obtained according to the face rotation image and the predicted face rotation image, and the real image loss includes at least one of a loss of pixels, a total variation loss (Total Variation), and an identification feature loss.
  • the total loss of the generative adversarial network is obtained according to a weighted sum of at least one of the real image losses, the first loss, and the second loss.
  • the total loss is equal to a result obtained by a weighted sum of the pixel loss, the first loss, and the second loss; when the real image loss includes a total variation loss, the total loss The loss is the result of a weighted sum of the total variation loss, the first loss, and the second loss; when the real image loss includes an identity feature loss, the total loss is equal to the identity feature loss, the first loss, and the first loss.
  • the result obtained by the weighted sum of the two losses; when the real image loss includes the pixel loss, the total variation loss, and the identity feature loss, the total loss is the weight of the three losses with the first loss and the second loss Sum the results.
  • the real image loss includes any one of pixel loss, total variation loss, and identification feature loss, the total loss is obtained by weighting and summing the two with the first loss and the second loss. , Will not repeat them here.
  • the first loss and the second loss are considered, but also the true loss of the image, such as the pixel loss, the total variation loss, and the identification feature loss, as described above.
  • the stability of training is further considered and the convergence of the training of the generative adversarial network can be accelerated;
  • the introduction of total variation loss On the basis of enriching the training conditions of the generative adversarial network, the generated predicted face image is prevented from generating local gradients that are too large, thereby generating local defects; and because the identity recognition feature is used to ensure the generated predicted face rotation
  • the face information contained in the image and the identity information of the face contained in the input training data (face image and face rotation image) remain unchanged. Therefore, through the introduction of identity feature loss, the trained generative adversarial network can generate a rotated image with more accurate identity information; when the true loss of the image includes two or three of the above three losses When each is included, the
  • the S402 specifically includes:
  • the key point detection algorithm is used to detect the face image to obtain position coordinates corresponding to the N key points of the face image, and N first images having the same size as the face image are constructed, and the N One first image corresponds to each of the N keypoints, and each of the N keypoints is centered on the first image to perform Gaussian blur on the first image corresponding to each of the keypoints (gaussian blur) processing to obtain N first Gaussian blur maps, where the N first Gaussian blur maps are pose-coded maps of the face image, and N is a positive integer greater than 1.
  • the N first images having the same size as the face image include:
  • N all 0 matrices each all 0 matrices corresponding to a key point; map the position of the key point in the face image to a corresponding position in the all 0 matrix, and map the corresponding position in the all 0 matrix
  • the value of the position is changed from 0 to 1; N one-hot code matrices are generated, and the N one-hot code matrices are the N first images.
  • the one-hot code matrix described in this application document refers to a matrix in which only one vector value is 1, and the remaining vector values are all zero. This will not be described in detail below.
  • performing Gaussian blur processing on the first image corresponding to each of the key points centered on each of the N key points includes:
  • Gaussian blur processing is performed on the one-hot code matrix with a point having a value of 1 in each one-hot code matrix as a center.
  • the face generation network can generate The face rotation image is predicted to be closer to the real face rotation image.
  • the size of the N all-zero matrices (that is, the number of rows and columns) is the same as the size of the face image, because the N one-hot code matrices are the N first images, each The value of 1 in the one-hot code matrix corresponds to the position of a key point in the face image. For example, when N is equal to 5, suppose that the 5 key points correspond to the five key positions of the face. Points, such as the center of the left eyeball, the center of the right eyeball, the tip of the nose, the corner of the left mouth, and the corner of the right mouth.
  • the value of the center of the one-hot code matrix corresponding to the nose tip is 1, and the remaining positions are still 0, and the left corner of the mouth is taken as an example.
  • the left mouth corner is at the coordinate position (x, y) of the face image
  • the one-hot code matrix corresponding to the left mouth corner has a value of 1 at its (x, y) coordinate position, and the values of the remaining positions are still 0.
  • the size can be understood as a row ⁇ column.
  • the face image and the face rotation image described in S401 have the same size.
  • the matrix here can also be a tensor, and the tensor can be understood as a matrix with depth.
  • the conventional matrix is X ⁇ Y, where X is the row of the matrix, Y is the column of the matrix, and the tensor is X ⁇ Y ⁇ Z, where Z is the depth of the matrix.
  • the face image, the face rotation image, the pose encoding map of the face image, and the pose encoding map of the face rotating image may all have the same size, or be called a homogeneous matrix.
  • the generated predicted face rotation image may also have the same size as the above-mentioned image or pose-coded image. Since the image size is not changed during the face rotation, it can be understood that the input and output image data of the generative adversarial network have the same size.
  • the key point detection algorithm is used to detect the face image to obtain position coordinates corresponding to the N key points of the face image.
  • the N key points can be exemplified above. It is 5, of course, it can also be 10 or other more or less number. This solution does not limit this.
  • the specific value of N can depend on the key point detection algorithm, that is, key points can be designed in advance according to requirements. The number is not repeated here.
  • the S403 specifically includes:
  • the M second images are in one-to-one correspondence with the M key points, and each of the M key points is centered on the second image corresponding to each of the key points.
  • Gaussian blur processing is performed to obtain M second Gaussian blur images, where the M second Gaussian blur images are attitude-encoded images of the face rotation image, and M is a positive integer greater than 1.
  • the M second images with the same size as the face rotation image can be understood as the same way as the N first images with the same size as the face image described above, so here No longer.
  • the S405 specifically includes:
  • the first judgment network includes a two-class discriminator, and the two-class discriminator is used to judge as true or as false.
  • the face image is used as the discrimination condition of the first discrimination network, and the authenticity of the face rotation image and the predicted face rotation image are determined according to the first discrimination network, and Generating the first loss according to the determination result includes:
  • L ii is the first loss
  • I a is the face image
  • I b is the face rotation image
  • Rotate an image for the predicted face It indicates finding a desired distribution in the H (I b) rotating the facial image I b, i.e., the face image rotation is true probability I b
  • the predicted face rotation image True probability Is the first discrimination network based on the face image
  • ⁇ ii is a parameter of the first discrimination network, Is the input of the first discrimination network.
  • the S406 specifically includes: using the posture coded image of the face rotation image as a determination condition of the second determination network, according to the first determination network A bi-discrimination network judges the authenticity of the face rotated image and the predicted face-rotated image, and generates the second loss according to the discrimination result; wherein the second judgment network includes a binary classification discriminator, and The binary classification discriminator is used to determine whether it is true or false.
  • the posture encoding map of the face rotation image is used as the determination condition of the second discrimination network, and the authenticity of the face rotation image and the predicted face rotation image is determined according to the second discrimination network. And generate the second loss according to the judgment result, including:
  • L ip is the second loss and I b is the face rotation image, Is the predicted face-rotated image, and P b is a pose encoding map of the face-rotated image
  • the predicted face rotation image True probability ⁇ ip is a parameter of the second discrimination network based on the posture encoding map of the face rotation image, Is the input of the second discrimination network.
  • the S406a may specifically include performing the following calculations:
  • L pix is the pixel loss and S is the scale amount
  • Is the predicted face rotation image
  • I b is the face rotation image
  • the S406a may specifically include performing the following calculations:
  • L tv is the total variation loss, that is, the predicted face rotation image
  • W represents the width of the predicted face rotated image
  • H represents the height of the predicted face rotated image
  • C represents the predicted face rotated image The number of channels.
  • the S406a may specifically include performing the following calculations:
  • the identity recognition feature is used to ensure that the identity information between the predicted face rotation image and the face image remains unchanged
  • Lip represents the loss of the identity recognition feature
  • f is a pre-trained face recognition model
  • the face recognition model f is a deep neural network, and the deep neural network includes at least one pooling layer and at least one fully connected layer, wherein, Represents the output of the last pooling layer of the face recognition model f, Represents the output of the last fully connected layer of the face recognition model f.
  • the S407 may specifically include:
  • the discrimination network here includes the first discrimination network and the second discrimination network.
  • the purpose of the network is to make it recognize the predicted face rotation image generated by the face generation network as much as possible, that is, to recognize it as true or false.
  • updating the face generation network here to minimize the error of the face generation network means that the predicted face rotation image generated by the face generation network is recognized as true by the discrimination network. Or as little as possible.
  • first discrimination network and the second discrimination network are updated here to maximize the values of the first loss and the second loss, and the specific implementation is as follows:
  • the first discrimination network is updated so that the value of the first loss is maximized, and the purpose is to update the first discrimination network so that the first discrimination network recognizes as much as possible the The difference between a predicted image and a real image.
  • the predicted image mentioned above is the predicted face rotation image
  • the real image mentioned here is the received face rotation image.
  • the face generation network and the discrimination network are an adversarial relationship, Or "game” process.
  • the face generation network must strive to generate difficult to recognize predicted images, and the discriminating network must strive to identify the difference between the predicted image and the real image.
  • This dynamic "game” is specifically reflected in the parameter update until the updated The parameters make the two dynamically balanced, that is, the overall optimal state is reached, the update is stopped, or the training of the generative adversarial network is stopped, and the trained face generation network is output.
  • the first discrimination network uses the face image as a discrimination condition
  • the second discrimination network uses the posture coded image of the face rotation image as a discrimination condition.
  • the final discrimination results obtained by the two are: A first loss and a second loss, and a weighted sum of the first loss and the second loss to obtain a weighted sum result as the total loss of the generative adversarial network, and the total loss is used to update all
  • the generative adversarial network including the face generating network, the first discriminant network, and the second discriminant network), iterating the above steps until the entire generative adversarial network reaches a dynamic equilibrium or a global optimum, then stop updating, Output the trained face generation network.
  • the face generation network thus trained can have a very good grasp of both the apparent authenticity of the face and the information about the face posture.
  • the first discrimination network uses the face image as a discrimination condition, it can be understood that the apparent authenticity of a human face is grasped by the first discrimination network.
  • the posture coded image of the face rotation image is used as the discrimination condition, so it can be understood that the face posture is grasped by the second discrimination network.
  • FIG. 5 is a method 500 for generating a face rotation image provided by Embodiment 2 of the present invention, including:
  • the multiple first training pictures each include a human face, and the rotation angles of the face presentation included in the multiple first training pictures. Are the face rotation angles;
  • S505. Generate a signal to be input according to the face image, the posture coded map of the face image, and the posture coded map of the target face image, wherein the size of the face image and the pose of the face image
  • the size of the encoded image is the same as the size of the pose encoded image of the target face image
  • S506 Input the to-be-input signal into a face rotation image generation model to obtain a face rotation image.
  • the method for generating a face rotation image obtaineds an input signal of a face rotation image generation model by performing posture encoding on a face image and a target face image, and further generates a model by using the face rotation image.
  • a face rotation image is generated. Since the pose encoding method is more accurate and robust in describing the face posture, the generated face rotation image is also more accurate.
  • the rotation angles of the faces presented in the multiple first training pictures are the same, and the rotation angle here may be the user
  • the rotation angles of the face presentation included in the multiple first training pictures are all the preset angles.
  • the method for generating a face rotation image provided in the embodiment of the present application has no limitation on the angle of face rotation, that is, face rotation at various angles can be realized.
  • the method 500 may be specifically executed by the execution device 110 shown in FIG. 1, and the face image in the method 500 may be input data given by the client device 140 shown in FIG. 1.
  • the pre-processing module 113 can be used to execute the gesture encoding process described in S502 and S504 in the method 500, and the pre-processing module 114 in the execution device 110 can be used to execute S503 in the method 500.
  • the pre-processing module 113 may also be used to execute the S505, and the calculation module 111 in the execution device 110 may be used to execute the S506.
  • the execution device 110 may be specifically configured to train a generative adversarial network provided by the embodiment of the present application.
  • the plurality of first training pictures are obtained from the training data set according to the rotation angle of the face, and the multiple first training pictures all include human faces.
  • the faces in the face image are not required to be the same face.
  • the face image may be a real-time face to be rotated input by the user, and the multiple faces
  • the first training picture is a training data set maintained by the database. Therefore, the faces included in the multiple first training pictures and the faces included in the face image may not be considered to be directly related.
  • the people included in the face image Faces can also appear in the database and used as training data.
  • the method 500 may be processed by a CPU, or may be jointly processed by a CPU and a GPU, or a processor suitable for neural network computing may be used without the GPU, which is not limited in this application.
  • the generating a signal to be input according to the face image, the posture coded map of the face image, and the posture coded map of the target face image may be specifically performed by
  • the feature fusion method fuses the face image, the posture coded image of the face image, and the posture coded image of the target face image to obtain the input signal.
  • Feature fusion is the combination of features that have distinguishing meanings and complementary effects, and are organically combined as a unified feature in some way.
  • Feature fusion is a commonly used technology in the field of biometrics, and features can be integrated in a variety of ways. The information contained in the fused features is more accurate and richer.
  • the to-be-input signal contains more information than any image or posture coded image in the face image, the posture coded image of the face image, and the target coded image in the posture coded image. Accurate and richer. Therefore, using the signal to be input to generate a face rotation image can make the generated face rotation image more accurate.
  • a more accurate face Rotating images can be used to improve the accuracy of face recognition.
  • the S502 may specifically include:
  • Gaussian blur processing is performed on the first image corresponding to each of the N key points one by one, to obtain N first Gaussian blur maps, and the N sheets
  • the first Gaussian blur image is a pose-encoded image of the face image.
  • the N first images having the same size as the face image include:
  • N all 0 matrices each all 0 matrices corresponding to a key point; map the position of the key point in the face image to a corresponding position in the all 0 matrix, and map the corresponding position in the all 0 matrix
  • the value of the position is changed from 0 to 1; N unique one-hot code matrices are generated, and the N unique one-hot code matrices are the N first images.
  • performing Gaussian blur processing on the first image corresponding to each of the key points with each of the N key points as the center includes:
  • Gaussian blur processing is performed on the one-hot code matrix with a point having a value of 1 in each one-hot code matrix as a center.
  • the S504 specifically includes:
  • Gaussian blur processing is performed on the second image corresponding to each of the key points one by one with respect to each of the M key points as a center to obtain M second Gaussian blur maps, and the M sheets
  • the second Gaussian blur image is a pose-encoded image of the target face image.
  • the M second images having the same size as the target face image are constructed in the same manner as the N first images having the same size as the face image described above, and therefore are not described herein again.
  • a possible implementation manner is: the target face image is obtained according to the multiple first training pictures, including:
  • the target face image is obtained according to an average value of the pose encoding pictures of the plurality of first training pictures.
  • the posture encoding maps of the plurality of first training pictures can be obtained by using the same posture encoding method as the posture encoding methods of S502 and S504, that is, for each first training picture, first use key points
  • the detection algorithm detects this first training picture and obtains the position coordinates corresponding to N facial points in the face, and then generates N and N based on the position coordinates corresponding to the N key points.
  • the one-hot codes corresponding to the key points are one-to-one, and then Gaussian blur is performed on the point with a value of 1 in each one-hot code to obtain N Gaussian blur maps.
  • the specific averaging method may be to add and then average the position pixel values corresponding to all Gaussian blur images.
  • a possible implementation manner is: the face rotation image generation model is obtained according to a training generative adversarial network, and the generative adversarial network Including at least one face generation network and at least two discriminative networks, the at least two discriminative networks form a coupled confrontation for generating a confrontation loss, and the confrontation loss is used to update the at least one face generation network and the at least one Two discriminant networks, and the updated at least one face generation network generates a model for the face rotation image.
  • the face rotation image generation model here may be a face generation network obtained by training in the first embodiment.
  • the training phase of the face generation network (the phase performed by the training device 120 shown in FIG. 1), the specific training is performed by the first embodiment and any possible implementation method based on the first embodiment
  • the generative adversarial network provided in the example; and the second embodiment can be understood as the application phase of the face generation network (as shown in the execution phase of the execution device 110 shown in FIG. 1), which can be specifically embodied by using the embodiment
  • a trained face generation network generates a face rotation image according to the face image to be rotated input by the user in the second embodiment, thereby obtaining an output image, that is, the face rotation image in the second embodiment.
  • the corresponding pre-processing is performed, and the corresponding posture is obtained according to the posture encoding process described in S502 and S504 of the second embodiment.
  • the coded image can be obtained by performing feature fusion on the pose coded image of the face image output from S502 and the pose coded image of the target face image output from S504 and the face image received by S501 through the feature fusion method described above.
  • a richer feature map is the to-be-input signal, which integrates the features of S501, S502, and S504, so that the face-rotated image generation model has better quality of the face-rotated image obtained based on the to-be-input signal. , That is, rotating the image closer to the real face.
  • the network training phase of the first embodiment and the network application phase of the second embodiment for the simplicity and intuitiveness of expression, the expressions of the face image and the face rotation image are used, but because of the two embodiments The embodiments belong to different embodiments for representing unused stages, so the face images in the two embodiments should not be understood as the same image.
  • the face images in the first embodiment are training data and can represent real images. It can also be a virtual image obtained by interpolation.
  • the face image in the second embodiment is usually a real face image input by the user.
  • the face rotation image in the first embodiment is also a training image.
  • the image generated by the face rotation image generation model should theoretically be the same as the real face rotation image as much as possible, but The ability of the body to be set rotating face image generating method.
  • the face rotation image generation method and the training method of the generative adversarial network according to the embodiments of the present application are described in detail above with reference to FIGS. 1 to 5.
  • the following describes the face rotation image generation device and the training device of the generative adversarial network in the embodiments of the present application with reference to FIGS. 6 to 9.
  • the action recognition device shown in FIGS. 6 to 9 may specifically be a monitoring device or a terminal. Devices, network servers, and network cloud platforms with image processing capabilities.
  • the apparatus shown in FIG. 6 to FIG. 9 can execute each step of the corresponding method according to the embodiment of the present application. For brevity, repeated description is appropriately omitted below.
  • FIG. 6 is a schematic block diagram of a training apparatus 600 for a generative adversarial network according to an embodiment of the present application.
  • the generative adversarial network includes a face generation network and a plurality of coupled adversarial discrimination networks.
  • the coupled adversarial discrimination network includes at least a first discrimination network and a second discrimination network.
  • the apparatus 600 includes:
  • the receiving unit 601 is configured to receive a face image and a face rotation image; the face image and the face rotation image are images before and after the same face rotation;
  • a pose encoding unit 602 configured to perform pose encoding on the face image according to two or more key points in the face image to obtain a pose encoding map of the face image;
  • the posture encoding unit 603 is further configured to perform posture encoding on the face rotation image according to two or more key points in the face rotation image to obtain a posture encoding map of the face rotation image;
  • a face generating unit 604 is configured to input the face image, the attitude encoding map of the face image, and the attitude encoding map of the face rotation image into the face generation network to generate a predicted face rotation image. ;
  • a first determination unit 605, configured to input the face image, the face rotation image, and the predicted face rotation image into the first determination network to obtain a first loss;
  • a second determination unit 606, configured to input the face rotation image, the posture coded image of the face rotation image, and the predicted face rotation image into the second determination network to obtain a second loss
  • a back propagation unit 607 is configured to update the face generation network, the first discrimination network, and the second discrimination network according to the total loss of the generative confrontation network, and the total loss of the generative confrontation network is based on The weighted summation of the first loss and the second loss is obtained;
  • the output unit 608 is used to output the trained face generation network until the total loss of the generative adversarial network converges.
  • the training device for the generative adversarial network obtains a pose image of a face image and a pose image of a face rotation image by performing attitude encoding on a face image and a face rotation image, and passes the
  • the face generation network in the generative adversarial network generates a predicted face rotation image; further, the predicted face rotation image is discriminated by at least a first discrimination network and a second discrimination network to obtain a first loss and a second loss, respectively.
  • the first loss and the second loss are weighted and summed to obtain the total loss of the generative adversarial network, and the face generation network and the first and second discriminant networks in the generative adversarial network are updated by the total loss.
  • the predicted face rotation image obtained by the face generation network or the discrimination network through the above-mentioned pose encoding map is also closer to the real face rotation image.
  • the rotation angle of the training data face image and face rotation image
  • the network obtained by the training can also be adapted to face rotation at various angles, thereby Improve the operability and user experience of face rotation.
  • the face rotation image is used for discrimination, and the discrimination results obtained by different discrimination networks will affect the generative adversarial network, so that the generative adversarial network can adjust and grasp different aspects of the face image according to the different conditions described above. Thus, a more accurate face rotation image is output.
  • FIG. 7 is a schematic block diagram of a face rotation image generating apparatus 700 according to an embodiment of the present application.
  • the apparatus 700 includes:
  • a pose encoding unit 702 configured to perform pose encoding on the face image according to two or more key points in the face image to obtain a pose encoding map of the face image;
  • An obtaining unit 703 is configured to obtain multiple first training pictures from a training data set according to a face rotation angle, the multiple first training pictures each including a human face, and the multiple faces included in the multiple first training pictures.
  • the presented rotation angles are all the face rotation angles;
  • the posture encoding unit 702 is further configured to perform posture encoding on the target facial image according to two or more key points in the target facial image to obtain a posture encoded map of the target facial image.
  • the target face image is obtained according to the plurality of first training pictures;
  • a signal generating unit 704 is configured to generate a signal to be input according to the face image, a pose coded map of the face image, and a pose coded map of the target face image, wherein the size of the face image, the The size of the pose encoding map of the face image is the same as the size of the pose encoding map of the target face image;
  • the image generating unit 705 is configured to input the signal to be input into a face rotation image generation model to obtain a face rotation image.
  • the method for generating a face rotation image obtaineds an input signal of a face rotation image generation model by performing posture encoding on a face image and a target face image, and further generates a model by using the face rotation image.
  • a face rotation image is generated. Since the pose encoding method is more accurate and robust in describing the face posture, the generated face rotation image is also more accurate.
  • the rotation angles of the faces presented in the multiple first training pictures are the same, and the rotation angle here may be the user
  • the rotation angles of the face presentation included in the multiple first training pictures are all the preset angles.
  • the method for generating a face rotation image provided in the embodiment of the present application has no limitation on the angle of face rotation, that is, face rotation at various angles can be realized.
  • FIG. 8 is a schematic diagram of a hardware structure of a training device for a generative adversarial network according to an embodiment of the present application.
  • the training apparatus 800 (the apparatus 800 may be a computer device) of the generative adversarial network shown in FIG. 8 includes a memory 801, a processor 802, a communication interface 803, and a bus 804.
  • the memory 801, the processor 802, and the communication interface 803 implement a communication connection between each other through a bus 804.
  • the memory 801 may be a read-only memory (Read Only Memory, ROM), a static storage device, a dynamic storage device, or a random access memory (Random Access Memory, RAM).
  • the memory 801 may store a program. When the program stored in the memory 801 is executed by the processor 802, the processor 802 and the communication interface 803 are configured to perform each step of the training method of the generative adversarial network in the embodiment of the present application.
  • the processor 802 may use a general-purpose central processing unit (CPU), a microprocessor, an application specific integrated circuit (ASIC), a graphics processing unit (GPU), or one or more An integrated circuit is configured to execute a related program to implement a function to be performed by a unit in the training device of the generative adversarial network in the embodiment of the present application, or execute a method of training the generative adversarial network in the method embodiment of the present application.
  • CPU general-purpose central processing unit
  • ASIC application specific integrated circuit
  • GPU graphics processing unit
  • An integrated circuit is configured to execute a related program to implement a function to be performed by a unit in the training device of the generative adversarial network in the embodiment of the present application, or execute a method of training the generative adversarial network in the method embodiment of the present application.
  • the processor 802 may also be an integrated circuit chip with signal processing capabilities. In the implementation process, each step of the training method of the generative adversarial network of the present application may be completed by an integrated logic circuit of hardware in the processor 802 or an instruction in the form of software.
  • the above processor 802 may also be a general-purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a ready-made programmable gate array (Field Programmable Gate Array, FPGA), or other programmable logic device , Discrete gate or transistor logic devices, discrete hardware components.
  • DSP digital signal processor
  • ASIC application specific integrated circuit
  • FPGA ready-made programmable gate array
  • FPGA field Programmable Gate Array
  • a general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • a software module may be located in a mature storage medium such as a random access memory, a flash memory, a read-only memory, a programmable read-only memory, or an electrically erasable programmable memory, a register, and the like.
  • the storage medium is located in the memory 801, and the processor 802 reads the information in the memory 801 and, in conjunction with its hardware, completes the functions required by the units included in the training device of the generative adversarial network in the embodiment of the present application, or performs the method implementation of the present application Training method of generative adversarial network.
  • the communication interface 803 uses a transceiving device such as, but not limited to, a transceiver to implement communication between the device 800 and other devices or a communication network.
  • a transceiving device such as, but not limited to, a transceiver to implement communication between the device 800 and other devices or a communication network.
  • training data (such as a face image and a face rotation image described in Embodiment 1 of the present application) may be acquired through the communication interface 803.
  • the bus 804 may include a path for transmitting information between various components of the device 800 (for example, the memory 801, the processor 802, and the communication interface 803).
  • the receiving unit 601 in the training device 600 of the generative adversarial network is equivalent to the communication interface 803 in the training device 800 of the generative adversarial network
  • the attitude encoding unit 602, the face generation unit 604, the first determination unit 605, the first The second discrimination unit 606 and the back propagation unit 607 may be equivalent to the processor 802.
  • FIG. 9 is a schematic diagram of a hardware structure of a device for generating a face rotation image according to an embodiment of the present application.
  • a device 900 for generating a face rotation image shown in FIG. 9 includes a memory 901, a processor 902, a communication interface 903, and a bus 904.
  • the memory 901, the processor 902, and the communication interface 903 implement a communication connection between each other through a bus 904.
  • the memory 901 may be a read-only memory (Read Only Memory, ROM), a static storage device, a dynamic storage device, or a random access memory (Random Access Memory, RAM).
  • the memory 901 may store a program. When the program stored in the memory 901 is executed by the processor 902, the processor 902 and the communication interface 903 are configured to execute each step of the method for generating a face rotation image in the embodiment of the present application.
  • the processor 902 may use a general-purpose central processing unit (CPU), a microprocessor, an application-specific integrated circuit (ASIC), a graphics processor (graphics processing unit, GPU), or one or more An integrated circuit is configured to execute a related program to implement a function required by a unit in the device for generating a face rotation image in the embodiment of the present application, or to execute a method for generating a face rotation image in a method embodiment of the present application.
  • CPU central processing unit
  • ASIC application-specific integrated circuit
  • GPU graphics processor
  • the processor 902 may also be an integrated circuit chip with signal processing capabilities. In the implementation process, each step of the method for generating a face rotation image of the present application may be completed by an integrated logic circuit of hardware in the processor 902 or an instruction in a form of software.
  • the processor 902 may also be a general-purpose processor, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a ready-made programmable gate array (Field Programmable Gate Array, FPGA), or other programmable logic device. , Discrete gate or transistor logic devices, discrete hardware components.
  • a general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • the steps of the method disclosed in combination with the embodiments of the present application may be directly implemented by a hardware decoding processor, or may be performed by using a combination of hardware and software modules in the decoding processor.
  • a software module may be located in a mature storage medium such as a random access memory, a flash memory, a read-only memory, a programmable read-only memory, or an electrically erasable programmable memory, a register, and the like.
  • the storage medium is located in the memory 901, and the processor 902 reads the information in the memory 901 and, in conjunction with its hardware, completes the functions required to be performed by the units included in the face rotation image generating device in the embodiment of the present application, or executes the method implementation of the present application.
  • the communication interface 903 uses a transceiving device such as but not limited to a transceiver to implement communication between the device 900 and other devices or a communication network.
  • a transceiving device such as but not limited to a transceiver to implement communication between the device 900 and other devices or a communication network.
  • training data (such as a face image described in Embodiment 2 of the present application) may be acquired through the communication interface 903.
  • the bus 904 may include a path for transmitting information between various components of the device 900 (for example, the memory 901, the processor 902, and the communication interface 903).
  • the receiving unit 701 and the obtaining unit 703 in the face rotation image generating device 700 are equivalent to the communication interface 903 in the face rotation image generating device 900; the attitude encoding unit 702 in the face rotation image generating device 700
  • the signal generating unit 704 and the image generating unit 705 may correspond to the processor 902.
  • apparatuses 800 and 900 shown in FIG. 8 and FIG. 9 only show the memory, the processor, and the communication interface, in a specific implementation process, those skilled in the art should understand that the apparatuses 800 and 900 further include an implementation Other devices necessary for normal operation. At the same time, according to specific needs, those skilled in the art should understand that the devices 800 and 900 may further include hardware devices that implement other additional functions. In addition, those skilled in the art should understand that the devices 800 and 900 may also include only the components necessary to implement the embodiments of the present application, and not necessarily all the components shown in FIG. 8 or FIG. 9.
  • the device 800 is equivalent to the training device 120 in FIG. 1
  • the device 900 is equivalent to the execution device 110 in FIG. 1.
  • Those of ordinary skill in the art may realize that the units and algorithm steps of each example described in combination with the embodiments disclosed herein can be implemented by electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. A professional technician can use different methods to implement the described functions for each specific application, but such implementation should not be considered to be beyond the scope of this application.
  • the disclosed systems, devices, and methods may be implemented in other ways.
  • the device embodiments described above are only schematic.
  • the division of the unit is only a logical function division.
  • multiple units or components may be combined or Can be integrated into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, which may be electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objective of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each of the units may exist separately physically, or two or more units may be integrated into one unit.
  • the functions are implemented in the form of software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium.
  • the technical solution of the present application is essentially a part that contributes to the existing technology or a part of the technical solution can be embodied in the form of a software product.
  • the computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in the embodiments of the present application.
  • the foregoing storage media include: U disks, mobile hard disks, read-only memories (ROMs), random access memories (RAMs), magnetic disks or compact discs and other media that can store program codes .

Abstract

本申请提供一种人脸旋转图像的生成方法及装置,涉及人工智能领域,具体涉及计算机视觉领域。本方法包括:根据获取的人脸图像中的两个或两个以上关键点对所述人脸图像进行姿态编码以获得姿态编码图;从训练数据集中获取多张包含人脸的训练图片,且所述多张训练图片中包含的人脸呈现的旋转角度均为同一角度;采用前述类似方式根据目标人脸图像中的两个或两个以上关键点对所述目标人脸图像进行姿态编码以获得姿态编码图;其中,所述目标人脸图像是根据所述多张训练图片得到的;根据所述人脸图像和前述两种姿态编码图生成待输入信号;将所述待输入信号输入人脸旋转图像生成模型得到人脸旋转图像。通过本方法,可以提高姿态编码的连续性和准确性,从而提高人脸旋转图像的生成效率。

Description

人脸旋转图像的生成方法及装置 技术领域
本发明实施例涉及计算机视觉领域,尤其涉及一种人脸旋转图像的生成方法及装置。
背景技术
计算机视觉是各个应用领域,如制造业、检验、文档分析、医疗诊断,和军事等领域中各种智能/自主系统中不可分割的一部分,它是一门关于如何运用照相机/摄像机和计算机来获取我们所需的,被拍摄对象的数据与信息的学问。形象地说,就是给计算机安装上眼睛(照相机/摄像机)和大脑(算法)用来代替人眼对目标进行识别、跟踪和测量等,从而使计算机能够感知环境。因为感知可以看作是从感官信号中提取信息,所以计算机视觉也可以看作是研究如何使人工系统从图像或多维数据中“感知”的科学。总的来说,计算机视觉就是用各种成象系统代替视觉器官获取输入信息,再由计算机来代替大脑对这些输入信息完成处理和解释。计算机视觉的最终研究目标就是使计算机能像人那样通过视觉观察和理解世界,具有自主适应环境的能力。
人脸旋转(Face Rotation)是指对一张给定的2D人脸图像,利用计算机视觉的相关方法,如图像处理、机器学习、计算机图形学等获得该人脸旋转后的真实化的符合人脸在三维空间中几何映射原理的人脸图像。人脸旋转主要为解决人脸识别中,因人脸大角度旋转导致侧脸识别不准的问题;另外,通过人脸旋转还可以解决人脸识别模型训练中人脸数据不足的问题,即可用于人脸数据的扩充。
用于解决侧脸识别问题的方法很多,人脸旋转技术是其中一个。就人脸旋转技术来说,常用的方法有:3D/2D模型和生成稀疏子空间。其中,3D/2D模型是通过将2D人脸图像映射到3D人脸模型上,估计出人脸的3D模型参数,然后再通过视角变换得到旋转后人脸的投影图像,从而得到旋转后的2D人脸图像。通过3D人脸模型,理论上可以解决任意姿态的人脸图像旋转问题,但是目前来讲,3D人脸计算量较大,并且真实化的精度还不太高。生成稀疏子空间方法是通过稀疏和低秩学习的方法,将同一人脸的不同姿态看成线性子空间,然后通过低秩约束求解出具有最低秩的人脸正脸图像。这种方法主要是解决将人脸从侧脸转动到正脸的技术,是人脸旋转的一种特殊情况。
为了克服上述问题,现有技术利用生成式对抗网络和一个一维度one-hot姿态编码器 指导人脸图像的姿态编码,并生成人脸图像的不同姿态特征。其中,所述生成式对抗网络是通过深度学习模型和生成对抗的方式训练人脸生成模型的方法。但是,现有技术中采用的one-hot姿态编码的方式对姿态的表达不够准确,且该方式不具有连续性;另外,现有方案中的生成式对抗网络中的对抗判别网络的结构使得对抗判别不够鲁棒,从而使得通过其生成的旋转图像的效果不佳。
发明内容
本申请实施例公开了一种人脸旋转图像的生成方法及装置,可以提高生成人脸旋转图像的效率,获得更好的图像质量。
第一方面,本申请实施例提供一种人脸旋转图像的生成方法,包括:
接收人脸图像;
根据所述人脸图像中的两个或两个以上关键点对所述人脸图像进行姿态编码,得到所述人脸图像的姿态编码图;
根据人脸旋转角度从训练数据集中获取多张第一训练图片,所述多张第一训练图片均包含人脸,且所述多张第一训练图片中包含的人脸呈现的旋转角度均为所述人脸旋转角度;
根据目标人脸图像中的两个或两个以上关键点对所述目标人脸图像进行姿态编码,得到所述目标人脸图像的姿态编码图;其中,所述目标人脸图像是根据所述多张第一训练图片得到的;
根据所述人脸图像、所述人脸图像的姿态编码图和所述目标人脸图像的姿态编码图生成待输入信号,其中所述人脸图像的尺寸、所述人脸图像的姿态编码图的尺寸和所述目标人脸图像的姿态编码图的尺寸相同;
将所述待输入信号输入人脸旋转图像生成模型得到人脸旋转图像。
本申请实施例提供的人脸旋转图像的生成方法,通过对人脸图像和目标人脸图像进行姿态编码,得到人脸旋转图像生成模型的输入信号,并进一步通过所述人脸旋转图像生成模型生成人脸旋转图像,由于所述姿态编码方式对人脸姿态的描述更加精确和鲁棒,因此生成的人脸旋转图像也更加准确。另外,由于该方法提供的目标人脸图像是根据所述多张第一训练图片得到的,所述多张第一训练图片中包含的人脸呈现的旋转角度相同,这里的旋转角度可以是用户预设的,如,用户输入人脸图像,并指示图像生成设备生成预设角度的人脸旋转图像,则上述多张第一训练图片中包含的人脸呈现的旋转角度均为该预设角度;通过这样的设置,本申请实施例提供的人脸旋转图像的生成方法对人脸旋 转的角度没有限制,即,可以实现各种不同角度的人脸旋转。
值得说明的是,在第一方面提供的方法中,所述根据人脸旋转角度从训练数据集中获取多张第一训练图片,所述多张第一训练图片均包含人脸,这里的人脸和所述人脸图像中的人脸,不要求是同一张人脸,事实上,所述人脸图像在第一方面提供的方法中,可以是用户输入的实时的待旋转的人脸,而所述多张第一训练图片是数据库维护的训练数据集,因此所述多张第一训练图片中包含的人脸与所述人脸图像包含的人脸可以认为没有直接关系,当然人脸图像中包含的人脸也可以出现在数据库中被当做训练数据进行使用。
需要说明的是,在第一方面提供的方法中,所述根据所述人脸图像、所述人脸图像的姿态编码图和所述目标人脸图像的姿态编码图生成待输入信号,具体可以是通过特征融合的方式融合所述人脸图像、所述人脸图像的姿态编码图和所述目标人脸图像的姿态编码图得到所述待输入信号。特征融合是将有区分意义并且具有互补作用的特征,通过某种方式有机地结合在一起作为统一的特征。特征融合是生物识别技术领域常用的一种技术手段,可以采用多种方式实现特征的融合。融合后的特征包含的信息更加准确、更加丰富。可以理解,所述待输入信号相比于所述人脸图像、所述人脸图像的姿态编码图和所述目标人脸图像的姿态编码图中的任一图像或姿态编码图包含的信息更加准确、也更加丰富。因此,利用所述待输入信号进行人脸旋转图像的生成,可以使生成的人脸旋转图像更准确,当该人脸旋转方法被应用到人脸识别的应用场景中时,更准确的人脸旋转图像可以用于提高人脸识别的准确度。
在一种可能的实现方式中,所述根据所述人脸图像中的两个或两个以上关键点对所述人脸图像进行姿态编码,得到所述人脸图像的姿态编码图包括:
利用关键点检测算法对所述人脸图像进行检测,得到所述人脸图像的N个关键点分别对应的位置坐标,N为大于1的整数;
构造与所述人脸图像尺寸相同的N张第一图像,所述N张第一图像与所述N个关键点一一对应;
分别以所述N个关键点中的每个关键点为中心,对与所述每个关键点一一对应的第一图像进行高斯模糊处理,得到N张第一高斯模糊图,所述N张第一高斯模糊图为所述人脸图像的姿态编码图。
在这种可能的实现方式中,先确定人脸图像的N个关键点,再以每个关键点为中心对该关键点对应的第一图像进行高斯模糊处理,这种通过关键点进行高斯模糊从而实现图像姿态编码的方式,对人脸姿态的描述更加精确和鲁棒,进而得到更高质量的人脸旋 转图像。
在一种可能的实现方式中,所述构造与所述人脸图像尺寸相同的N张第一图像包括:
生成N个全0矩阵,每一个全0矩阵对应一个关键点;将该关键点在所述人脸图像中的位置映射到该全0矩阵中的相应位置,并将该全0矩阵中的相应位置的值由0改为1;由此生成N个独热码矩阵,所述N个独热码矩阵为所述N张第一图像。
需要说明的是,所述N张第一图像与所述关键点在人脸图像中的位置坐标有关。
在一种可能的实现方式中,所述分别以所述N个关键点中的每个关键点为中心,对与所述每个关键点一一对应的第一图像进行高斯模糊处理,包括:
分别以每个独热码矩阵中值为1的点为中心,对所述独热码矩阵进行高斯模糊处理。
在一种可能的实现方式中,根据目标人脸图像中的两个或两个以上关键点对所述目标人脸图像进行姿态编码,得到所述目标人脸图像的姿态编码图包括:
利用关键点检测算法对所述目标人脸图像进行检测,得到所述目标人脸图像的M个关键点分别对应的位置坐标,M为大于1的整数;
构造与所述目标人脸图像尺寸相同的M张第二图像,所述M张第二图像与所述M个关键点一一对应;
分别以所述M个关键点中的每个关键点为中心,对与所述每个关键点一一对应的第二图像进行高斯模糊处理,得到M张第二高斯模糊图,所述M张第二高斯模糊图为所述目标人脸图像的姿态编码图。
需要说明的是,所述构造与所述目标人脸图像尺寸相同的M张第二图像,与上面所述的一种可能的实现方式中的所述构造与所述人脸图像尺寸相同的N张第一图像的方式相同,此处不再赘述。
在一种可能的实现方式中,所述目标人脸图像是根据所述多张第一训练图片得到的,包括:
所述目标人脸图像是根据所述多张第一训练图片的姿态编码图的平均值得到的。
需要说明的是,此处的多张第一训练图片的姿态编码图也可以是根据上述姿态编码的方式获取的,此处不再展开。
在一种可能的实现方式中,所述人脸旋转图像生成模型是根据训练生成式对抗网络得到的,所述生成式对抗网络包括至少一个人脸生成网络以及至少两个判别网络,所述至少两个判别网络形成耦合对抗,用于产生对抗损失,所述对抗损失用于更新所述至少一个人脸生成网络以及所述至少两个判别网络,所述更新后的至少一个人脸生成网络为 所述人脸旋转图像生成模型。
通过使用至少两个形成耦合对抗关系的判别网络,使得不同的判别网络可以通过不同的条件对所述人脸生成网络生成的预测人脸旋转图像进行判别,且不同判别网络得到判别结果都会对所述生成式对抗网络产生影响,从而使得所述生成式对抗网络能够根据上述不同的条件对人脸图像的不同方面进行调整和把握,从而输出更加准确的人脸旋转图像。
第二方面,本申请实施例提供一种生成式对抗网络的训练方法,所述生成式对抗网络包括人脸生成网络,以及多个耦合对抗的判别网络,所述耦合对抗的判别网络至少包括第一判别网络和第二判别网络,所述方法包括:
接收人脸图像,以及人脸旋转图像;所述人脸图像和所述人脸旋转图像为同一张人脸旋转前和旋转后的图像;
根据所述人脸图像中的两个或两个以上关键点对所述人脸图像进行姿态编码,得到所述人脸图像的姿态编码图;
根据所述人脸旋转图像中的两个或两个以上关键点对所述人脸旋转图像进行姿态编码,得到所述人脸旋转图像的姿态编码图;
将所述人脸图像、所述人脸图像的姿态编码图以及所述人脸旋转图像的姿态编码图输入所述人脸生成网络,以生成预测人脸旋转图像;
将所述人脸图像、所述人脸旋转图像和所述预测人脸旋转图像输入所述第一判别网络以得到第一损失;
将所述人脸旋转图像、所述人脸旋转图像的姿态编码图以及所述预测人脸旋转图像输入所述第二判别网络以得到第二损失;
根据所述生成式对抗网络的总损失更新所述人脸生成网络、所述第一判别网络以及所述第二判别网络,所述生成式对抗网络的总损失根据所述第一损失和第二损失加权求和得到;
直到所述生成式对抗网络的总损失收敛,输出训练后的人脸生成网络。
需要说明的,所述人脸图像和所述人脸旋转图像为同一张人脸旋转前和旋转后的图像,这里并不限定其必须是正脸(frontal face)旋转得到侧脸(profile face),或侧脸旋转得到正脸,应该理解为,旋转前的人脸和旋转后的人脸之间具有一定大小的旋转角度,该旋转角度可以是预设的,此处不再赘述。
本申请实施例提供的生成式对抗网络的训练方法,通过对对人脸图像和人脸旋转图 像进行姿态编码,得到人脸图像的姿态编码图和人脸旋转图像的姿态编码图,并通过所述生成式对抗网络中的人脸生成网络生成预测人脸旋转图像;进而通过至少第一判别网络和第二判别网络分别对预测人脸旋转图像进行判别得到第一损失和第二损失,将所述第一损失和第二损失进行加权求和得到生成式对抗网络的总损失,并由该总损失更新所述生成式对抗网络中的人脸生成网络以及第一判别网络和第二判别网络。由于上述的姿态编码方式对人脸姿态的描述更加精确和鲁棒,使得人脸生成网络或者判别网络通过上述的姿态编码图得到的预测人脸旋转图像也更加接近真实的人脸旋转图像。另外,在该训练方法中,由于对训练数据(人脸图像和人脸旋转图像)的旋转角度并没有限制,因此该训练得到的网络也可以适应于各种不同角度的人脸旋转,由此提升人脸旋转的可操作性以及用户体验。另外,通过使用第一判别网络和第二判别网络,且所述第一判别网络和第二判别网络耦合对抗,使得不同的判别网络可以通过不同的条件对所述人脸生成网络生成的预测人脸旋转图像进行判别,且不同判别网络得到判别结果都会对所述生成式对抗网络产生影响,从而使得所述生成式对抗网络能够根据上述不同的条件对人脸图像的不同方面进行调整和把握,从而输出更加准确的人脸旋转图像。
在一种可能的实现方式中,在所述根据所述生成式对抗网络的总损失更新所述人脸生成网络、所述第一判别网络以及所述第二判别网络之前,所述方法还包括:
根据所述人脸旋转图像及所述预测人脸旋转图像得到真实图像损失,所述真实图像损失包括像素损失、全变分损失及身份识别特征损失中的至少一个损失;对应的,所述生成式对抗网络的总损失根据所述真实图像损失中的至少一个损失、所述第一损失和第二损失加权求和得到。
本申请实施例可能的实现方式,不仅考虑第一损失和第二损失,还考虑到图像真实损失,如像素损失、全变分损失及身份识别特征损失。当所述图像真实损失包括所述像素损失时,所述生成式对抗网络的总损失根据所述第一损失、所述第二损失和所述像素损失的加权求和得到,通过像素损失的引入,在丰富所述生成式对抗网络的训练条件的基础上进一步考虑了训练的稳定性并可以加速收敛所述生成式对抗网络的训练;当所述图像真实损失包括所述全变分损失时,所述生成式对抗网络的总损失根据所述第一损失、所述第二损失和所述全变分损失加权求和得到,由于全变分损失具有防止生成图像局部梯度过大的作用,因此通过全变分损失的引入,在丰富所述生成式对抗网络的训练条件的基础上防止了生成的预测人脸图像出现局部梯度过大从而产生局部瑕疵;当所述图像真实损失包括所述身份识别特征损失时,所述生成式对抗网络的总损失根据所述第一损 失、所述第二损失和所述身份识别特征损失的加权求和得到,所述身份识别特征用于保证生成的所述预测人脸旋转图像中包含的人脸与输入的训练数据(人脸图像和人脸旋转图像)中包含的人脸的身份信息保持不变。通过身份识别特征损失的引入,使得所述训练得到的生成式对抗网络可以生成具有更加准确的身份信息的旋转后图像;当所述图像真实损失包括上述三个损失中的两个或三个均包含时,对应的效果将被考虑到。
在一种可能的实现方式中,所述根据所述人脸图像中的两个或两个以上关键点对所述人脸图像进行姿态编码,得到所述人脸图像的姿态编码图,包括:
利用关键点检测算法对所述人脸图像进行检测,得到所述人脸图像的N个关键点分别对应的位置坐标,构造与所述人脸图像尺寸相同的N张第一图像,所述N张第一图像与所述N个关键点一一对应,分别以所述N个关键点中的每个关键点为中心,对与所述每个关键点一一对应的第一图像进行高斯模糊处理,得到N张第一高斯模糊图,所述N张第一高斯模糊图为所述人脸图像的姿态编码图,N为大于1的正整数。
在一种可能的实现方式中,所述构造与所述人脸图像尺寸相同的N张第一图像包括:
生成N个全0矩阵,每一个全0矩阵对应一个关键点;将该关键点在所述人脸图像中的位置映射到该全0矩阵中的相应位置,并将该全0矩阵中的相应位置的值由0改为1;由此生成N个独热码矩阵,所述N个独热码矩阵为所述N张第一图像。
通过关键点进行高斯模糊从而实现图像姿态编码的方式,对人脸姿态的描述更加精确和鲁棒,通过更加精确和鲁棒的人脸姿态描述,可以使得所述人脸生成网络生成的预测人脸旋转图像更加接近真实的所述人脸旋转图像。
需要说明的是,所述N张第一图像与所述关键点在人脸图像中的位置坐标有关。
在一种可能的实现方式中,所述分别以所述N个关键点中的每个关键点为中心,对与所述每个关键点一一对应的第一图像进行高斯模糊处理,包括:
分别以每个独热码矩阵中值为1的点为中心,对所述独热码矩阵进行高斯模糊处理。
在一种可能的实现方式中,所述根据所述人脸旋转图像中的两个或两个以上关键点对所述人脸旋转图像进行姿态编码,得到所述人脸旋转图像的姿态编码图,包括:
利用关键点检测算法对所述人脸旋转图像进行检测,得到所述人脸旋转图像的M个关键点分别对应的位置坐标,构造与所述人脸旋转图像尺寸相同的M张第二图像,所述M张第二图像与所述M个关键点一一对应,分别以所述M个关键点中的每个关键点为中心,对与所述每个关键点一一对应的第二图像进行高斯模糊处理,得到M张第二高斯模糊图,所述M张第二高斯模糊图为所述人脸旋转图像的姿态编码图,M为大于1的正 整数。
需要说明的是,所述构造与所述人脸旋转图像尺寸相同的M张第二图像,与上面所述的一种可能的实现方式中的所述构造与所述人脸图像尺寸相同的N张第一图像的方式相同,此处不再赘述。
在一种可能的实现方式中,所述将所述人脸图像,所述人脸旋转图像和所述预测人脸旋转图像输入所述第一判别网络得到第一损失,包括:
以所述人脸图像作为所述第一判别网络的判别条件,根据所述第一判别网络判断所述人脸旋转图像和所述预测人脸旋转图像的真假性,并根据判别结果生成所述第一损失;其中,所述第一判断网络包括二分类判别器,所述二分类判别器用于判断为真或判断为假。
在一种可能的实现方式中,以所述人脸图像作为所述第一判别网络的判别条件,根据所述第一判别网络判断所述人脸旋转图像和所述预测人脸旋转图像的真假性,并根据判别结果生成所述第一损失,包括:
Figure PCTCN2018089611-appb-000001
其中,L ii为所述第一损失,I a为所述人脸图像,I b为所述人脸旋转图像,
Figure PCTCN2018089611-appb-000002
为所述预测人脸旋转图像,
Figure PCTCN2018089611-appb-000003
表示在所述人脸旋转图像I b的分布H(I b)上求期望,即所述人脸旋转图像I b为真的概率;
Figure PCTCN2018089611-appb-000004
表示所述第一判别网络的损失函数,
Figure PCTCN2018089611-appb-000005
表示在所述预测人脸旋转图像
Figure PCTCN2018089611-appb-000006
的分布
Figure PCTCN2018089611-appb-000007
上的期望,即所述预测人脸旋转图像
Figure PCTCN2018089611-appb-000008
为真的概率;
Figure PCTCN2018089611-appb-000009
为以所述人脸图像为条件的所述第一判别网络,θ ii为所述第一判别网络的参数,
Figure PCTCN2018089611-appb-000010
为所述第一判别网络的输入。
在一种可能的实现方式中,所述将所述人脸旋转图像,所述人脸旋转图像的姿态编码图以及所述预测人脸旋转图像输入所述第二判别网络得到第二损失,包括:
以所述人脸旋转图像的姿态编码图作为所述第二判别网络的判别条件,根据所述第二判别网络判断所述人脸旋转图像和所述预测人脸旋转图像的真假性,并根据判别结果生成所述第二损失;其中,所述第二判断网络包括二分类判别器,所述二分类判别器用于判断为真或判断为假。
在上述的生成式对抗网络中,第一判别网络以所述人脸图像作为判别条件,第二判别网络以所述人脸旋转图像的姿态编码图作为判别条件,二者最后得到的判别结果:第一损失和第二损失,通过加权求和作为所述生成式对抗网络的总损失,该总损失用于更新所述生成式对抗网络(包括所述人脸生成网络、所述第一判别网络和所述第二判别网络),由此训练得到的生成式对抗网络对于人脸的表观真实性以及人脸姿态两方面的信息都能有非常好的把握。综上所述:由于所述第一判别网络以所述人脸图像作为判别条件,因此可以理解人脸的表观真实性由所述第一判别网络把握,由于所述第二判别网络 以所述人脸旋转图像的姿态编码图作为判别条件,因此可以理解人脸姿态由所述第二判别网络把握。
在一种可能的实现方式中,以所述人脸旋转图像的姿态编码图作为所述第二判别网络的判别条件,根据所述第二判别网络判断所述人脸旋转图像和所述预测人脸旋转图像的真假性,并根据判别结果生成所述第二损失,包括:
Figure PCTCN2018089611-appb-000011
其中,L ip为所述第二损失,I b为所述人脸旋转图像,
Figure PCTCN2018089611-appb-000012
为所述预测人脸旋转图像,P b为所述人脸旋转图像的姿态编码图,
Figure PCTCN2018089611-appb-000013
表示在所述人脸旋转图像I b的分布H(I b)上求期望,即所述人脸旋转图像I b为真的概率;
Figure PCTCN2018089611-appb-000014
表示所述第二判别网络的损失函数,
Figure PCTCN2018089611-appb-000015
表示在所述预测人脸旋转图像
Figure PCTCN2018089611-appb-000016
的分布
Figure PCTCN2018089611-appb-000017
上的期望,即所述预测人脸旋转图像
Figure PCTCN2018089611-appb-000018
为真的概率;
Figure PCTCN2018089611-appb-000019
为以所述人脸旋转图像的姿态编码图为条件的所述第二判别网络,θ ip为所述第二判别网络的参数,
Figure PCTCN2018089611-appb-000020
为所述第二判别网络的输入。
在一种可能的实现方式中,当所述真实图像损失包括像素损失,所述根据所述人脸旋转图像及所述预测人脸旋转图像得到真实图像损失,包括:
Figure PCTCN2018089611-appb-000021
其中,L pix是所述像素损失,S是尺度量,
Figure PCTCN2018089611-appb-000022
为所述预测人脸旋转图像,I b为所述人脸旋转图像,
Figure PCTCN2018089611-appb-000023
表示将所述预测人脸旋转图像和所述人脸旋转图像进行缩放到S尺度量时计算像素差值的1范数损失。
需要说明的是,这里的像素差值表示所述预测人脸旋转图像与所述人脸旋转图像对应位置的像素之间的差值。
在一种可能的实现方式中,当所述真实图像损失包括全变分损失,所述根据所述人脸旋转图像及所述预测人脸旋转图像得到真实图像损失,包括:
Figure PCTCN2018089611-appb-000024
其中,L tv是所述全变分损失,即所述预测人脸旋转图像
Figure PCTCN2018089611-appb-000025
在横向和纵向两个方向一阶梯度绝对值的和,其中,W表示所述预测人脸旋转图像的宽,H表示所述预测人脸旋转图像的高,C表示所述预测人脸旋转图像通道数。
在一种可能的实现方式中,当所述真实图像损失包括身份识别特征损失,所述根据 所述人脸旋转图像及所述预测人脸旋转图像得到真实图像损失,包括:
Figure PCTCN2018089611-appb-000026
其中,身份识别特征用来保证所述预测人脸旋转图像和所述人脸图像之间的身份信息保持不变,L ip表示所述身份识别特征损失,f为预先训练好的人脸识别模型,所述人脸识别模型f为深度神经网络,所述深度神经网络包括至少一个池化层和至少一个全连接层,其中,
Figure PCTCN2018089611-appb-000027
表示所述人脸识别模型f的最后一个池化层的输出,
Figure PCTCN2018089611-appb-000028
表示所述人脸识别模型f最后一个全连接层的输出。
在一种可能的实现方式中,所述根据所述生成式对抗网络的总损失更新所述人脸生成网络、所述第一判别网络以及所述第二判别网络,包括:
更新所述人脸生成网络,以使得所述人脸生成网络生成的误差最小;
更新所述第一判别网络和所述第二判别网络,以使得所述第一损失和所述第二损失的值最大;
交替迭代上述更新直到所述生成式对抗网络达到收敛。
第三方面,本申请实施例提供了一种人体旋转图像的生成方法,包括:
接收人体图像;
对所述人体图像进行姿态编码,得到所述人体图像的姿态编码图;
根据人体旋转角度从训练数据集中获取多张第二训练图片,所述多张第二训练图片均包含人体,且所述多张第二训练图片中包含的人体呈现的旋转角度均为所述人体旋转角度;
对目标人体图像进行姿态编码,得到所述目标人体图像的姿态编码图;其中,所述目标人体图像是根据所述多张第二训练图片得到的;
根据所述人体图像、所述人体图像的姿态编码图和所述目标人体图像的姿态编码图生成待输入信号,其中所述人体图像的尺寸、所述人体图像的姿态编码图的尺寸和所述目标人体图像的姿态编码图的尺寸相同;
将所述待输入信号输入人体旋转图像生成模型得到人体旋转图像。
需要说明的是,在第三方面提供的方法中,所述根据所述人体图像、所述人体图像的姿态编码图和所述目标人体图像的姿态编码图生成待输入信号,具体可以是通过特征融合的方式融合所述人体图像、所述人体图像的姿态编码图和所述目标人体图像的姿态编码图得到所述待输入信号。特征融合是将有区分意义并且具有互补作用的特征,通过 某种方式有机地结合在一起作为统一的特征。特征融合是生物识别技术领域常用的一种技术手段,可以采用多种方式实现特征的融合。融合后的特征包含的信息更加准确、更加丰富。可以理解,所述待输入信号相比于所述人体图像、所述人体图像的姿态编码图和所述目标人体图像的姿态编码图中的任一图像或姿态编码图包含的信息更加准确、也更加丰富。因此,利用所述待输入信号进行人体旋转图像的生成,可以使生成的人体旋转图像更准确,当该人体旋转方法被应用到监控系统的人物定位或识别的应用场景中时,更准确的人体旋转图像可以用于提高定位和识别的准确度。
在一种可能的实现方式中,所述对所述人体图像进行姿态编码,得到所述人体图像的姿态编码图包括:
利用关键点检测算法对所述人体图像进行检测,得到所述人体图像的W个关键点分别对应的位置坐标,W为大于1的整数;
构造与所述人体图像尺寸相同的W张第三图像,所述W张第三图像与所述W个关键点一一对应;
分别以所述W个关键点中的每个关键点为中心,对与所述每个关键点一一对应的第三图像进行高斯模糊处理,得到W张第三高斯模糊图,所述W张第三高斯模糊图为所述人体图像的姿态编码图。
在这种可能的实现方式中,先确定人体图像的W个关键点,当所述人体图像和第一方面的人脸图像中包含的人为同一个人时,这里的W个关键点可以包括上述第一方面的N个关键点,上述N个关键点仅为人脸上的关键点,如左眼球中心、右眼球中心、鼻尖、左嘴角和右嘴角,或者还可以包括脸部轮廓的点等;而所述的W个关键点还可以包括人体关键部位对应的点,如左胳膊肘节点、右胳膊肘节点、左膝盖中心点、右膝盖中心点等;
确定W个关键点之后,再以每个关键点为中心对该关键点对应的第三图像进行高斯模糊处理,这种通过关键点进行高斯模糊从而实现图像姿态编码的方式,对人体姿态的描述更加精确和鲁棒,进而得到更高质量的人体旋转图像。
在一种可能的实现方式中,所述构造与所述人体图像尺寸相同的W张第三图像包括:
生成W个全0矩阵,每一个全0矩阵对应一个关键点;将该关键点在所述人体图像中的位置映射到该全0矩阵中的相应位置,并将该全0矩阵中的相应位置的值由0改为1;由此生成W个独热码矩阵,所述W个独热码矩阵为所述W张第三图像。
需要说明的是,所述W张第三图像与所述关键点在人体图像中的位置坐标有关。
在一种可能的实现方式中,所述分别以所述W个关键点中的每个关键点为中心,对 与所述每个关键点一一对应的第三图像进行高斯模糊处理,包括:
分别以每个独热码矩阵中值为1的点为中心,对所述独热码矩阵进行高斯模糊处理。
关于所述目标人体图像的姿态编码方式,与上述人体图像的姿态编码方式相同,在关键点的数目上可以不同,但实现的过程是相同的,因此不再赘述。
第四方面,本申请实施例提供一种生成式对抗网络的训练方法,所述生成式对抗网络包括人体图像生成网络,以及多个耦合对抗的判别网络,所述耦合对抗的判别网络至少包括第三判别网络和第四判别网络,所述方法包括:
接收人体图像,以及人体旋转图像;所述人体图像和所述人体旋转图像为同一个人体旋转前和旋转后的图像;
对所述人体图像进行姿态编码,得到所述人体图像的姿态编码图;
对所述人体旋转图像进行姿态编码,得到所述人体旋转图像的姿态编码图;
将所述人体图像、所述人体图像的姿态编码图以及所述人体旋转图像的姿态编码图输入所述人体图像生成网络,以生成预测人体旋转图像;
将所述人体图像、所述人体旋转图像和所述预测人体旋转图像输入所述第三判别网络以得到第三损失;
将所述人体旋转图像、所述人体旋转图像的姿态编码图以及所述预测人体旋转图像输入所述第四判别网络以得到第四损失;
根据所述生成式对抗网络的总损失更新所述人体图像生成网络、所述第三判别网络以及所述第四判别网络,所述生成式对抗网络的总损失根据所述第三损失和第四损失加权求和得到;
直到所述生成式对抗网络的总损失收敛,输出训练后的人体图像生成网络。
本申请实施例提供的生成式对抗网络的训练方法,通过对对人体图像和人体旋转图像进行姿态编码,得到人体图像的姿态编码图和人体旋转图像的姿态编码图,并通过所述生成式对抗网络中的人体图像生成网络生成预测人体旋转图像;进而通过至少两个判别网络如第三判别网络和第四判别网络,分别对预测人体旋转图像进行判别得到第三损失和第四损失,将所述第三损失和第四损失进行加权求和得到生成式对抗网络的总损失,并由该总损失更新所述生成式对抗网络中的人体图像生成网络以及第三判别网络和第四判别网络。由于上述的姿态编码方式对人体姿态的描述更加精确和鲁棒,使得人体图像生成网络或者判别网络通过上述的姿态编码图得到的预测人体旋转图像也更加接近真实的人体旋转图像。另外,在该训练方法中,由于对训练数据(人体图像和人体旋转图像) 的旋转角度并没有限制,因此该训练得到的网络也可以适应于各种不同角度的人体旋转,由此提升人体旋转的可操作性以及用户体验。另外,通过使用第三判别网络和第四判别网络,且所述第三判别网络和第四判别网络耦合对抗,使得不同的判别网络可以通过不同的条件对所述人体图像生成网络生成的预测人体旋转图像进行判别,且不同判别网络得到判别结果都会对所述生成式对抗网络产生影响,从而使得所述生成式对抗网络能够根据上述不同的条件对人体图像的不同方面进行调整和把握,从而输出更加准确的人体旋转图像。
关于所述人体图像的姿态编码方式以及所述人体旋转图像姿态编码方式,与第三方面提供的姿态编码方式相同,虽然在具体的关键点取值上可以有所不同,但操作方式上是相同的,因此此处不再赘述。具体的得到第三损失和第四损失的方式可以参考第二方面得到第一损失和第二损失的方式,此处不再赘述。
第五方面,本申请实施例提供了一种人脸旋转图像的生成装置,所述装置包括用于执行所述第一方面或者第一方面的任一可能的实现方式中的方法的模块。
第六方面,本申请实施例提供一种生成式对抗网络的训练装置,所述装置包括用于执行所述第二方面或者第二方面的任一可能的实现方式中的方法的模块。
第七方面,本申请实施例提供一种人脸旋转图像的生成设备,包括处理器和存储器,所述存储器用于存储程序指令,所述处理器用于调用所述程序指令来执行第一方面及第一方面的任意一种可能的实现方式所提供的方法。
第八方面,本申请实施例提供一种训练生成式对抗网络的设备,包括处理器和存储器,所述存储器用于存储程序指令,所述处理器用于调用所述程序指令来执行第二方面及第二方面的任意一种可能的实现方式所提供的方法。
第九方面,本申请实施例提供一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有程序指令,当所述程序指令由处理器运行时,实现第一方面及第一方面的任意一种可能的实现方式所提供的方法。
第十方面,本申请实施例提供一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有程序指令,当所述程序指令由处理器运行时,实现第二方面及第二方面的任意一种可能的实现方式所提供的方法。
第十一方面,提供一种芯片,所述芯片包括处理器与数据接口,所述处理器通过所述数据接口读取存储器上存储的指令,执行第一方面或第一方面的任一可能的实现方式中的方法。
可选地,作为一种实现方式,所述芯片还可以包括存储器,所述存储器中存储有指令,所述处理器用于执行所述存储器上存储的指令,当所述指令被执行时,所述处理器 用于执行第一方面或第一方面的任一可能的实现方式中的方法。
第十二方面,提供一种芯片,所述芯片包括处理器与数据接口,所述处理器通过所述数据接口读取存储器上存储的指令,执行第二方面或第二方面的任一可能的实现方式中的方法。
可选地,作为一种实现方式,所述芯片还可以包括存储器,所述存储器中存储有指令,所述处理器用于执行所述存储器上存储的指令,当所述指令被执行时,所述处理器用于执行第二方面或第二方面的任一可能的实现方式中的方法。
附图说明
下面对本申请实施例用到的附图进行介绍。
图1是本申请实施例提供的系统架构的结构示意图;
图2是本申请实施例提供的卷积神经网络的逻辑示意图;
图3是本申请实施例提供的一种芯片硬件结构示意图;
图4是本申请实施例提供的一种生成式对抗网络的训练方法流程示意图;
图5是本申请实施例提供的一种人脸旋转图像的生成方法流程示意图;
图6是本申请实施例提供的一种生成式对抗网络的训练装置的示意性框图;
图7是本申请实施例提供的一种人脸旋转图像的生成装置的示意性框图;
图8是本申请实施例提供的一种生成式对抗网络的训练装置的硬件结构示意图;
图9是本申请实施例提供的一种人脸旋转图像的生成装置的硬件结构示意图。
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
本申请实施例提供的生成式对抗网络的训练方法,涉及计算机视觉的处理,具体可以应用于数据训练、机器学习、深度学习等数据处理方法,对训练数据(如本申请中的人脸图像和人脸旋转图像)进行符号化和形式化的智能信息建模、抽取、预处理、训练等,最终得到训练好的生成式对抗网络;并且,本申请实施例提供的人脸旋转图像的生成方法可以运用上述训练好的生成式对抗网络,将输入数据(如本申请中的人脸图像)输入到所述训练好的生成式对抗网络中,得到输出数据(如本申请中的人脸旋转图像)。需要说明的是,本申请实施例提供的生成式对抗网络的训练方法和人脸旋转图像的生成方法是基于同一个构思产生的发明,也可以理解为一个系统中的两个部分,或一个整体流程的两个阶段:如模型训练阶段和模型应用阶段。本申请实施例提供的方法和装置可以应用到人脸识别中,比如,当人脸识别过程中只有侧脸时,可以运用本申请实施例提供的人脸旋转图像生成方法,将该侧脸先生成为正脸,再基于正脸进行人脸识别,由于正脸的人脸识别通常会比侧脸更为准确,因此,该方法可以帮助提升人脸识别的准确度。另外,本申请实施例提供的方法和装置还可以用于人脸恢复,例如在公安系统的安防监控中,本申请实施例提供的方法可以实现从任意角度的图像得到完整的人脸正脸图像, 也可以根据人脸正脸图像或某一个侧脸图像得到其他各个角度的人脸图像,从而丰富人脸图像各个角度的信息,使得对监控对象的获取更加准确。本申请实施例提供的方法和装置还可以用于扩充训练数据库,如图1所示执行设备110的I/O接口112可以将经执行设备处理过的图像(如得到的人脸旋转图像)和用户输入的人脸图像一起作为训练数据对发送给数据库130,以使得数据库130维护的训练数据更加丰富,从而为训练设备120的训练工作提供更丰富的训练数据。
另外需要说明的是,在模型训练阶段,人脸图像和人脸旋转图像作为训练数据,提供给初始模型进行训练;在模型应用阶段,人脸图像作为实际应用中待处理(此处的处理为人脸旋转处理)的数据进行相关的数据处理后输入深度神经网络得到输出数据:人脸旋转图像。在文字描述上为了简洁直观起见,在训练阶段和应用阶段都用了人脸图像和人脸旋转图像的表述,但是不应该认为训练阶段的人脸图像和人脸旋转图像与应用阶段的人脸图像和人脸旋转图像必然是相同图像。如上所述,当人脸图像和人脸旋转图像的表述出现在训练阶段,应理解其为训练数据;当人脸图像和人脸旋转图像的表述出现在应用阶段,应理解其分别为实际应用中的输入和输出,后文不再就此赘述。当然,如背景技术中提到的通过人脸旋转可以扩充训练数据库,本申请实施例在应用阶段对输入数据(人脸图像)进行人脸旋转处理后得到输出数据(人脸旋转图像),这里的输入数据和输出数据可以作为新的训练数据添加到训练数据库中以用于扩充训练数据库。
由于本申请实施例涉及大量神经网络的应用,为了便于理解,下面先对本申请实施例涉及的相关术语及神经网络等相关概念进行介绍。
(1)人脸旋转
利用图像处理和机器学习、计算机图形学等相关方法,将人脸图像从一个姿态(pose)角度旋转到另一个姿态角度并得到相应的旋转后图像。
(2)神经网络
神经网络可以是由神经单元组成的,神经单元可以是指以x s和截距1为输入的运算单元,该运算单元的输出可以为:
Figure PCTCN2018089611-appb-000029
其中,s=1、2、……n,n为大于1的自然数,W s为x s的权重,b为神经单元的偏置。f为神经单元的激活函数(activation functions),用于将非线性特性引入神经网络中,来将神经单元中的输入信号转换为输出信号。该激活函数的输出信号可以作为下一层卷积层的输入。激活函数可以是sigmoid函数。神经网络是将许多个上述单一的神经单元联结在一起形成的网络,即一个神经单元的输出可以是另一个神经单元的输入。每个神经单元的输入可以与前一层的局部接受域相连,来提取局部接受域的特征,局部接受域可以是由若干个神经单元组成的区域。
(3)深度神经网络
深度神经网络(Deep Neural Network,DNN),也称多层神经网络,可以理解为具有很多层隐含层的神经网络,这里的“很多”并没有特别的度量标准。从DNN按不同层的位置划分,DNN内部的神经网络可以分为三类:输入层,隐含层,输出层。一般来说 第一层是输入层,最后一层是输出层,中间的层数都是隐含层。层与层之间是全连接的,也就是说,第i层的任意一个神经元一定与第i+1层的任意一个神经元相连。虽然DNN看起来很复杂,但是就每一层的工作来说,其实并不复杂,简单来说就是如下线性关系表达式:
Figure PCTCN2018089611-appb-000030
其中,
Figure PCTCN2018089611-appb-000031
是输入向量,
Figure PCTCN2018089611-appb-000032
是输出向量,
Figure PCTCN2018089611-appb-000033
是偏移向量,W是权重矩阵(也称系数),α()是激活函数。每一层仅仅是对输入向量
Figure PCTCN2018089611-appb-000034
经过如此简单的操作得到输出向量
Figure PCTCN2018089611-appb-000035
由于DNN层数多,则系数W和偏移向量
Figure PCTCN2018089611-appb-000036
的数量也就很多了。这些参数在DNN中的定义如下所述:以系数W为例:假设在一个三层的DNN中,第二层的第4个神经元到第三层的第2个神经元的线性系数定义为
Figure PCTCN2018089611-appb-000037
上标3代表系数W所在的层数,而下标对应的是输出的第三层索引2和输入的第二层索引4。总结就是:第L-1层的第k个神经元到第L层的第j个神经元的系数定义为
Figure PCTCN2018089611-appb-000038
需要注意的是,输入层是没有W参数的。在深度神经网络中,更多的隐含层让网络更能够刻画现实世界中的复杂情形。理论上而言,参数越多的模型复杂度越高,“容量”也就越大,也就意味着它能完成更复杂的学习任务。训练深度神经网络的也就是学习权重矩阵的过程,其最终目的是得到训练好的深度神经网络的所有层的权重矩阵(由很多层的向量W形成的权重矩阵)。
(4)卷积神经网络
卷积神经网络(CNN,Convolutional Neuron Network)是一种带有卷积结构的深度神经网络。卷积神经网络包含了一个由卷积层和子采样层构成的特征抽取器。该特征抽取器可以看作是滤波器,卷积过程可以看作是使用一个可训练的滤波器与一个输入的图像或者卷积特征平面(feature map)做卷积。卷积层是指卷积神经网络中对输入信号进行卷积处理的神经元层。在卷积神经网络的卷积层中,一个神经元可以只与部分邻层神经元连接。一个卷积层中,通常包含若干个特征平面,每个特征平面可以由一些矩形排列的神经单元组成。同一特征平面的神经单元共享权重,这里共享的权重就是卷积核。共享权重可以理解为提取图像信息的方式与位置无关。这其中隐含的原理是:图像的某一部分的统计信息与其他部分是一样的。即意味着在某一部分学习的图像信息也能用在另一部分上。所以对于图像上的所有位置,都能使用同样的学习得到的图像信息。在同一卷积层中,可以使用多个卷积核来提取不同的图像信息,一般地,卷积核数量越多,卷积操作反映的图像信息越丰富。
卷积核可以以随机大小的矩阵的形式初始化,在卷积神经网络的训练过程中卷积核可以通过学习得到合理的权重。另外,共享权重带来的直接好处是减少卷积神经网络各层之间的连接,同时又降低了过拟合的风险。
(5)损失函数
在训练深度神经网络的过程中,因为希望深度神经网络的输出尽可能的接近真正想要预测的值,所以可以通过比较当前网络的预测值和真正想要的目标值,再根据两者之间的差异情况来更新每一层神经网络的权重向量(当然,在第一次更新之前通常会有初始化的过程,即为深度神经网络中的各层预先配置参数),比如,如果网络的预测值高了,就调整权重向量让它预测低一些,不断的调整,直到深度神经网络能够预测出真正想要的目标值或与真正想要的目标值非常接近的值。因此,就需要预先定义“如何比较预测值和目标值之间的差异”,这便是损失函数(loss function)或目标函数(objective function),它们是用于衡量预测值和目标值的差异的重要方程。其中,以损失函数举例, 损失函数的输出值(loss)越高表示差异越大,那么深度神经网络的训练就变成了尽可能缩小这个loss的过程。
(6)反向传播算法
卷积神经网络可以采用误差反向传播(back propagation,BP)算法在训练过程中修正初始的超分辨率模型中参数的大小,使得超分辨率模型的重建误差损失越来越小。具体地,前向传递输入信号直至输出会产生误差损失,通过反向传播误差损失信息来更新初始的超分辨率模型中参数,从而使误差损失收敛。反向传播算法是以误差损失为主导的反向传播运动,旨在得到最优的超分辨率模型的参数,例如权重矩阵。
(7)生成式对抗网络
生成式对抗网络(GAN,Generative Adversarial Networks)是一种深度学习模型。该模型中至少包括两个模块:一个模块是生成模型(Generative Model),另一个模块是判别模型(Discriminative Model),通过这两个模块互相博弈学习,从而产生更好的输出。生成模型和判别模型都可以是神经网络,具体可以是深度神经网络,或者卷积神经网络。GAN的基本原理如下:以生成图片的GAN为例,假设有两个网络,G(Generator)和D(Discriminator),其中G是一个生成图片的网络,它接收一个随机的噪声z,通过这个噪声生成图片,记做G(z);D是一个判别网络,用于判别一张图片是不是“真实的”。它的输入参数是x,x代表一张图片,输出D(x)代表x为真实图片的概率,如果为1,就代表100%是真实的图片,如果为0,就代表不可能是真实的图片。在对该生成式对抗网络进行训练的过程中,生成网络G的目标就是尽可能生成真实的图片去欺骗判别网络D,而判别网络D的目标就是尽量把G生成的图片和真实的图片区分开来。这样,G和D就构成了一个动态的“博弈”过程,也即“生成式对抗网络”中的“对抗”。最后博弈的结果,在理想的状态下,G可以生成足以“以假乱真”的图片G(z),而D难以判定G生成的图片究竟是不是真实的,即D(G(z))=0.5。这样就得到了一个优异的生成模型G,它可以用来生成图片。
(8)像素值
图像的像素值可以是一个红绿蓝(RGB)颜色值,像素值可以是表示颜色的长整数。例如,像素值为256*Red+100*Green+76Blue,其中,Blue代表蓝色分量,Green代表绿色分量,Red代表红色分量。各个颜色分量中,数值越小,亮度越低,数值越大,亮度越高。对于灰度图像来说,像素值可以是灰度值。
下面介绍本申请实施例提供的系统架构。
参见附图1,本发明实施例提供了一种系统架构100。如所述系统架构100所示,数据采集设备160用于采集训练数据,本申请实施例中训练数据包括:人脸图像和人脸旋转图像,其中该人脸图像为人脸旋转前的图像,该人脸旋转图像为该人脸图像中的人脸进行旋转后得到的图像;并将训练数据存入数据库130,训练设备120基于数据库130中维护的训练数据训练得到目标模型/规则101。下面将以实施例一更详细地描述训练设备120如何基于训练数据得到目标模型/规则101,该目标模型/规则101能够用于实现本申请实施例提供的人脸旋转图像的生成方法,即,将人脸图像通过相关预处理后输入该目标模型/规则101,即可得到人脸旋转图像。本申请实施例中的目标模型/规则101具体 可以为人脸生成网络,在本申请提供的实施例中,该人脸生成网络是通过训练生成式对抗网络得到的。需要说明的是,在实际的应用中,所述数据库130中维护的训练数据不一定都来自于数据采集设备160的采集,也有可能是从其他设备接收得到的。另外需要说明的是,训练设备120也不一定完全基于数据库130维护的训练数据进行目标模型/规则101的训练,也有可能从云端或其他地方获取训练数据进行模型训练,上述描述不应该作为对本申请实施例的限定。
根据训练设备120训练得到的目标模型/规则101可以应用于不同的系统或设备中,如应用于图1所示的执行设备110,所述执行设备110可以是终端,如手机终端,平板电脑,笔记本电脑,AR/VR,车载终端等,还可以是服务器或者云端等。在附图1中,执行设备110配置有I/O接口112,用于与外部设备进行数据交互,用户可以通过客户设备140向I/O接口112输入数据,所述输入数据在本申请实施例中可以包括:来自用户输入的人脸图像,来自数据库的多张第一训练图片,其中,所述多张第一训练图片均包含人脸(该人脸与人脸图像中包含的人脸不一定相同),且所述多张第一训练图片中包含的人脸呈现的旋转角度均为角度θ,该角度θ可以是预设的,如希望所述执行设备110输出的人脸旋转图像是在所述人脸图像的基础上旋转了θ度的图像。
预处理模块113用于根据I/O接口112接收到的输入数据(如所述人脸图像)进行预处理,在本申请实施例中,预处理模块113可以用于根据所述人脸图像中的两个或两个以上关键点对所述人脸图像进行姿态编码,得到所述人脸图像的姿态编码图。
预处理模块114用于根据I/O接口112接收到的输入数据,如(所述多张第一训练图片)进行预处理,在本申请实施例中,预处理模块114可以用于根据所述多张第一训练图片得到目标人脸图像,并由所述预处理模块113对所述目标人脸图像进行姿态编码,得到所述目标人脸图像的姿态编码图;所述预处理模块113还可以根据所述人脸图像、所述人脸图像的姿态编码图和所述目标人脸图像的姿态编码图生成待输入信号,并将所述待输入信号输入到所述计算模块111,由所述计算模块111根据所述目标模型/规则101以及所述待输入信号进行计算,最终得到人脸旋转图像。
在执行设备110对输入数据进行预处理,或者在执行设备110的计算模块111执行计算等相关的处理过程中,执行设备110可以调用数据存储系统150中的数据、代码等以用于相应的处理,也可以将相应处理得到的数据、指令等存入数据存储系统150中。
最后,I/O接口112将处理结果,如上述得到的人脸旋转图像返回给客户设备140,从而提供给用户。
值得说明的是,训练设备120可以针对不同的目标或称不同的任务,基于不同的训练数据生成相应的目标模型/规则101,该相应的目标模型/规则101即可以用于实现上述目标或完成上述任务,从而为用户提供所需的结果。
在附图1中所示情况下,用户可以手动给定输入数据,该手动给定可以通过I/O接口112提供的界面进行操作。另一种情况下,客户设备140可以自动地向I/O接口112发送输入数据,如果要求客户设备140自动发送输入数据需要获得用户的授权,则用户可以在客户设备140中设置相应权限。用户可以在客户设备140查看执行设备110输出的结果,具体的呈现形式可以是显示、声音、动作等具体方式。客户设备140也可以作为数据采集端,采集如图所示输入I/O接口112的输入数据及输出I/O接口112的输出结果作为新的样本数据,并存入数据库130。当然,也可以不经过客户设备140进行采集,而是由I/O接口112直接将如图所示输入I/O接口112的输入数据及输出I/O接口112的输出结果,作为新的样本数据存入数据库130。
值得注意的是,附图1仅是本发明实施例提供的一种系统架构的示意图,图中所示设备、器件、模块等之间的位置关系不构成任何限制,例如,在附图1中,数据存储系统150相对执行设备110是外部存储器,在其它情况下,也可以将数据存储系统150置于执行设备110中。
如图1所示,根据训练设备120训练得到目标模型/规则101,该目标模型/规则101在本申请实施例中可以是根据训练生成式对抗网络(GAN,Generative Adversarial Networks)得到的人脸旋转图像生成模型,具体的,本申请实施例提供的生成式对抗网络可以包括:至少一个人脸生成网络以及至少两个判别网络,所述至少两个判别网络形成耦合对抗,用于产生对抗损失,所述对抗损失用于更新所述至少一个人脸生成网络以及所述至少两个判别网络,所述更新后的至少一个人脸生成网络为所述人脸旋转图像生成模型。在本申请实施例提供的生成式对抗网络中,所述至少一个人脸生成网络以及所述至少两个判别网络具体都可以是卷积神经网络。
如前文的基础概念介绍所述,卷积神经网络是一种带有卷积结构的深度神经网络,是一种深度学习(deep learning)架构,深度学习架构是指通过机器学习的算法,在不同的抽象层级上进行多个层次的学习。作为一种深度学习架构,CNN是一种前馈(feed-forward)人工神经网络,该前馈人工神经网络中的各个神经元可以对输入其中的图像作出响应。
如图2所示,卷积神经网络(CNN)200可以包括输入层210,卷积层/池化层220 (其中池化层为可选的),以及神经网络层230。
卷积层/池化层220:
卷积层:
如图2所示卷积层/池化层220可以包括如示例221-226层,举例来说:在一种实现中,221层为卷积层,222层为池化层,223层为卷积层,224层为池化层,225为卷积层,226为池化层;在另一种实现方式中,221、222为卷积层,223为池化层,224、225为卷积层,226为池化层。即卷积层的输出可以作为随后的池化层的输入,也可以作为另一个卷积层的输入以继续进行卷积操作。
下面将以卷积层221为例,介绍一层卷积层的内部工作原理。
卷积层221可以包括很多个卷积算子,卷积算子也称为核,其在图像处理中的作用相当于一个从输入图像矩阵中提取特定信息的过滤器,卷积算子本质上可以是一个权重矩阵,这个权重矩阵通常被预先定义,在对图像进行卷积操作的过程中,权重矩阵通常在输入图像上沿着水平方向一个像素接着一个像素(或两个像素接着两个像素……这取决于步长stride的取值)的进行处理,从而完成从图像中提取特定特征的工作。该权重矩阵的大小应该与图像的大小相关,需要注意的是,权重矩阵的纵深维度(depth dimension)和输入图像的纵深维度是相同的,在进行卷积运算的过程中,权重矩阵会延伸到输入图像的整个深度。因此,和一个单一的权重矩阵进行卷积会产生一个单一纵深维度的卷积化输出,但是大多数情况下不使用单一权重矩阵,而是应用多个尺寸(行×列)相同的权重矩阵,即多个同型矩阵。每个权重矩阵的输出被堆叠起来形成卷积图像的纵深维度,这里的维度可以理解为由上面所述的“多个”来决定。不同的权重矩阵可以用来提取图像中不同的特征,例如一个权重矩阵用来提取图像边缘信息,另一个权重矩阵用来提取图像的特定颜色,又一个权重矩阵用来对图像中不需要的噪点进行模糊化等。该多个权重矩阵尺寸(行×列)相同,经过该多个尺寸相同的权重矩阵提取后的特征图的尺寸也相同,再将提取到的多个尺寸相同的特征图合并形成卷积运算的输出。
这些权重矩阵中的权重值在实际应用中需要经过大量的训练得到,通过训练得到的权重值形成的各个权重矩阵可以用来从输入图像中提取信息,从而使得卷积神经网络200进行正确的预测。
当卷积神经网络200有多个卷积层的时候,初始的卷积层(例如221)往往提取较多的一般特征,该一般特征也可以称之为低级别的特征;随着卷积神经网络200深度的加深,越往后的卷积层(例如226)提取到的特征越来越复杂,比如高级别的语义之类的特 征,语义越高的特征越适用于待解决的问题。
池化层:
由于常常需要减少训练参数的数量,因此卷积层之后常常需要周期性的引入池化层,在如图2中220所示例的221-226各层,可以是一层卷积层后面跟一层池化层,也可以是多层卷积层后面接一层或多层池化层。在图像处理过程中,池化层的唯一目的就是减少图像的空间大小。池化层可以包括平均池化算子和/或最大池化算子,以用于对输入图像进行采样得到较小尺寸的图像。平均池化算子可以在特定范围内对图像中的像素值进行计算产生平均值作为平均池化的结果。最大池化算子可以在特定范围内取该范围内值最大的像素作为最大池化的结果。另外,就像卷积层中用权重矩阵的大小应该与图像尺寸相关一样,池化层中的运算符也应该与图像的大小相关。通过池化层处理后输出的图像尺寸可以小于输入池化层的图像的尺寸,池化层输出的图像中每个像素点表示输入池化层的图像的对应子区域的平均值或最大值。
神经网络层230:
在经过卷积层/池化层220的处理后,卷积神经网络200还不足以输出所需要的输出信息。因为如前所述,卷积层/池化层220只会提取特征,并减少输入图像带来的参数。然而为了生成最终的输出信息(所需要的类信息或其他相关信息),卷积神经网络200需要利用神经网络层230来生成一个或者一组所需要的类的数量的输出。因此,在神经网络层230中可以包括多层隐含层(如图2所示的231、232至23n)以及输出层240,该多层隐含层中所包含的参数可以根据具体的任务类型的相关训练数据进行预先训练得到,例如该任务类型可以包括图像识别,图像分类,图像超分辨率重建等等……
在神经网络层230中的多层隐含层之后,也就是整个卷积神经网络200的最后层为输出层240,该输出层240具有类似分类交叉熵的损失函数,具体用于计算预测误差,一旦整个卷积神经网络200的前向传播(如图2由210至240方向的传播为前向传播)完成,反向传播(如图2由240至210方向的传播为反向传播)就会开始更新前面提到的各层的权重值以及偏差,以减少卷积神经网络200的损失,及卷积神经网络200通过输出层输出的结果和理想结果之间的误差。
需要说明的是,如图2所示的卷积神经网络200仅作为一种卷积神经网络的示例,在具体的应用中,卷积神经网络还可以以其他网络模型的形式存在。
下面介绍本申请实施例提供的一种芯片硬件结构。
图3为本发明实施例提供的一种芯片硬件结构,该芯片包括神经网络处理器30。该 芯片可以被设置在如图1所示的执行设备110中,用以完成计算模块111的计算工作。该芯片也可以被设置在如图1所示的训练设备120中,用以完成训练设备120的训练工作并输出目标模型/规则101。如图2所示的卷积神经网络中各层的算法均可在如图3所示的芯片中得以实现。
神经网络处理器30可以是NPU,TPU,或者GPU等一切适合用于大规模异或运算处理的处理器。以NPU为例:NPU可以作为协处理器挂载到主CPU(Host CPU)上,由主CPU为其分配任务。NPU的核心部分为运算电路303,通过控制器304控制运算电路303提取存储器(301和302)中的矩阵数据并进行乘加运算。
在一些实现中,运算电路303内部包括多个处理单元(Process Engine,PE)。在一些实现中,运算电路303是二维脉动阵列。运算电路303还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路303是通用的矩阵处理器。
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路303从权重存储器302中取矩阵B的权重数据,并缓存在运算电路303中的每一个PE上。运算电路303从输入存储器301中取矩阵A的输入数据,根据矩阵A的输入数据与矩阵B的权重数据进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器(accumulator)308中。
统一存储器306用于存放输入数据以及输出数据。权重数据直接通过存储单元访问控制器(DMAC,Direct Memory Access Controller)305,被搬运到权重存储器302中。输入数据也通过DMAC被搬运到统一存储器306中。
总线接口单元(BIU,Bus Interface Unit)310,用于DMAC和取指存储器(Instruction Fetch Buffer)309的交互;总线接口单元301还用于取指存储器309从外部存储器获取指令;总线接口单元301还用于存储单元访问控制器305从外部存储器获取输入矩阵A或者权重矩阵B的原数据。
DMAC主要用于将外部存储器DDR中的输入数据搬运到统一存储器306中,或将权重数据搬运到权重存储器302中,或将输入数据搬运到输入存储器301中。
向量计算单元307多个运算处理单元,在需要的情况下,对运算电路303的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。向量计算单元307主要用于神经网络中非卷积层,或全连接层(FC,fully connected layers)的计算,具体可以处理:Pooling(池化),Normalization(归一化)等的计算。例如,向量计算单元307 可以将非线性函数应用到运算电路303的输出,例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元307生成归一化的值、合并值,或二者均有。
在一些实现中,向量计算单元307将经处理的向量存储到统一存储器306。在一些实现中,经向量计算单元307处理过的向量能够用作运算电路303的激活输入,例如用于神经网络中后续层中的使用,如图2所示,若当前处理层是隐含层1(231),则经向量计算单元307处理过的向量还可以被用到隐含层2(232)中的计算。
控制器304连接的取指存储器(instruction fetch buffer)309,用于存储控制器304使用的指令;
统一存储器306,输入存储器301,权重存储器302以及取指存储器309均为On-Chip存储器。外部存储器独立于该NPU硬件架构。
其中,图2所示的卷积神经网络中各层的运算可以由运算电路303或向量计算单元307执行。
实施例一:
图4为本发明实施例一提供的一种生成式对抗网络的训练方法400,所述生成式对抗网络包括人脸生成网络,以及多个耦合对抗的判别网络,所述耦合对抗的判别网络至少包括第一判别网络和第二判别网络,所述方法400包括:
S401,接收人脸图像,以及人脸旋转图像;所述人脸图像和所述人脸旋转图像为同一张人脸旋转前和旋转后的图像;
S402,根据所述人脸图像中的两个或两个以上关键点对所述人脸图像进行姿态编码,得到所述人脸图像的姿态编码图;
S403,根据所述人脸旋转图像中的两个或两个以上关键点对所述人脸旋转图像进行姿态编码,得到所述人脸旋转图像的姿态编码图;
S404,将所述人脸图像、所述人脸图像的姿态编码图以及所述人脸旋转图像的姿态编码图输入所述人脸生成网络,以生成预测人脸旋转图像;
S405,将所述人脸图像、所述人脸旋转图像和所述预测人脸旋转图像输入所述第一判别网络以得到第一损失;
S406,将所述人脸旋转图像、所述人脸旋转图像的姿态编码图以及所述预测人脸旋转图像输入所述第二判别网络以得到第二损失;
S407,根据所述生成式对抗网络的总损失更新所述人脸生成网络、所述第一判别网络以及所述第二判别网络,所述生成式对抗网络的总损失根据所述第一损失和第二损失 加权求和得到;
重复执行S404至S407,直到所述生成式对抗网络的总损失收敛,则执行
S408,输出训练后的人脸生成网络。
其中,通过S407当次更新的人脸生成网络、第一判别网络和第二判别网络用于执行下一次的S404至S406的动作,依次迭代,直到所述生成式对抗网络的总损失收敛,结束对所述生成式对抗网络的训练,输出训练后的人脸生成网络。
本申请实施例提供的生成式对抗网络的训练方法,通过对对人脸图像和人脸旋转图像进行姿态编码(Pose Encoding),得到人脸图像的姿态编码图和人脸旋转图像的姿态编码图,并通过所述生成式对抗网络中的人脸生成网络生成预测人脸旋转图像;进而通过至少第一判别网络和第二判别网络分别对预测人脸旋转图像进行判别得到第一损失和第二损失,将所述第一损失和第二损失进行加权求和得到生成式对抗网络的总损失,并由该总损失更新所述生成式对抗网络中的人脸生成网络以及第一判别网络和第二判别网络。由于上述的姿态编码方式对人脸姿态的描述更加精确和鲁棒,使得人脸生成网络或者判别网络通过上述的姿态编码图得到的预测人脸旋转图像也更加接近真实的人脸旋转图像。另外,在该训练方法中,由于对训练数据(人脸图像和人脸旋转图像)的旋转角度并没有限制,因此该训练得到的网络也可以适应于各种不同角度的人脸旋转,由此提升人脸旋转的可操作性以及用户体验。并且,通过使用第一判别网络和第二判别网络,且所述第一判别网络和第二判别网络耦合对抗,使得不同的判别网络可以通过不同的条件对所述人脸生成网络生成的预测人脸旋转图像进行判别,且不同判别网络得到判别结果都会对所述生成式对抗网络产生影响,从而使得所述生成式对抗网络能够根据上述不同的条件对人脸图像的不同方面进行调整和把握,从而输出更加准确的人脸旋转图像。
需要说明的是,所述的多个耦合对抗的判别网络,其中,所述“耦合”体现在:该多个判别网络分别得到的损失共同影响总损失,如在本申请实施例提供的方法400中,所述第一损失和所述第二损失加权求和得到所述总损失,由此体现所述第一判别网络和所述第二判别网络之间的耦合关系,这里的“耦合”也可以称为“协同”,或者“联合”等,其本质的含义在于多个判别网络得到的损失以某种关系结合并共同影响所述总损失;其中,所述“对抗”体现在:所述多个判别网络与所述人脸生成网络之间是对抗的关系,这种对抗的关系在前文的概念介绍中的第(7)点有详细介绍,即生成与判别之间的“博弈”,此处不再赘述。
需要说明的,在实施例一中的所述人脸图像和所述人脸旋转图像本质上是训练数据, 用于训练所述生成式对抗网络,作为一对训练数据对,它们分别是同一张人脸旋转前和旋转后的图像。这里需要注意的是,本申请实施例提供的方法和装置并不限定人脸旋转必须是正脸旋转得到侧脸,或侧脸旋转得到正脸,因此对训练数据对的要求也不会是必须一张为正脸另一张为侧脸,应该理解的是,此处表述的旋转前的人脸和旋转后的人脸之间具有一定大小的旋转角度,该旋转角度可以是预设的。
另外需要说明的是,关于旋转角度(本文也称人脸旋转角度),应理解为:以正脸为0度,往右旋转人脸为正角度,往左旋转人脸为负角度。从俯视图来看,顺时针旋转为正角度,逆时针旋转为负角度。
所述方法400具体可以由如图1所示的训练设备120执行,所述方法400中的人脸图像和人脸旋转图像可以是如图1所示的数据库130中维护的训练数据,可选的,所述方法400的S402和S403可以在训练设备120中执行,也可以在训练设备120之前由其他功能模块预先执行,即先对从所述数据库130中接收或者获取到的训练数据进行预处理,如S402和S403所述的姿态编码过程,得到人脸图像的姿态编码图和人脸旋转图像的姿态编码图,作为所述训练设备120的输入,并由所述训练设备120执行S404至S408。
可选的,所述方法400可以由CPU处理,也可以由CPU和GPU共同处理,也可以不用GPU,而使用其他适合用于神经网络计算的处理器,本申请不做限制。
所述训练设备120具体可以用于训练本申请实施例提供的生成式对抗网络,如前所述,本申请实施例提供的生成式对抗网络包括人脸生成网络以及多个耦合对抗的判别网络,需要说明的是,在本申请实施例中虽然仅给出第一判别网络和第二判别网络两个判别网络的举例,但是本申请实施例并不限定判别网络的具体个数,如三个判别网络或者四个判别网络甚至更多,这些不同的判别网络可以基于不同的判别条件对所述人脸生成网络生成的预测图像进行判别,从而可以使得所述生成式对抗网络能够根据这些不同的条件对人脸图像的不同方面进行调整和把握,从而输出更加准确的人脸旋转图像。
在实施例一提供的方法400的基础上,一种可能的实现方式为,在S407之前,所述方法400还可以包括:
S406a,根据所述人脸旋转图像及所述预测人脸旋转图像得到真实图像损失,所述真实图像损失包括像素损失、全变分损失(Total Variation Regularization)及身份识别特征损失中的至少一个损失;对应的,所述生成式对抗网络的总损失根据所述真实图像损失中的至少一个损失、所述第一损失和第二损失加权求和得到。
当所述真实图像损失包括像素损失时,所述总损失等于像素损失、第一损失和第二 损失的加权求和得到的结果;当所述真实图像损失包括全变分损失时,所述总损失等于全变分损失、第一损失和第二损失的加权求和得到的结果;当所述真实图像损失包括身份识别特征损失时,所述总损失等于身份识别特征损失、第一损失和第二损失的加权求和得到的结果;当真实图像损失包括像素损失、全变分损失和身份识别特征损失三者时,所述总损失为这三个损失与第一损失、第二损失的加权求和得到的结果。当所述真实图像损失包括像素损失、全变分损失和身份识别特征损失三者中的任两者时,由该任两者与第一损失、第二损失进行加权求和得到所述总损失,此处不再赘述。
在这种实现方式中,不仅考虑第一损失和第二损失,还考虑到图像真实损失,如上所述的像素损失、全变分损失及身份识别特征损失。其中,通过像素损失的引入,在丰富所述生成式对抗网络的训练条件的基础上进一步考虑了训练的稳定性并可以加速收敛所述生成式对抗网络的训练;通过全变分损失的引入,在丰富所述生成式对抗网络的训练条件的基础上防止了生成的预测人脸图像出现局部梯度过大从而产生局部瑕疵;而由于所述身份识别特征用于保证生成的所述预测人脸旋转图像中包含的人脸与输入的训练数据(人脸图像和人脸旋转图像)中包含的人脸的身份信息保持不变。因此,通过身份识别特征损失的引入,使得所述训练得到的生成式对抗网络可以生成具有更加准确的身份信息的旋转后图像;当所述图像真实损失包括上述三个损失中的两个或三个均包含时,对应的效果将都被考虑到。
在实施例一提供的方法400以及其可能的实现方式的基础上,所述S402具体包括:
利用关键点检测算法对所述人脸图像进行检测,得到所述人脸图像的N个关键点分别对应的位置坐标,构造与所述人脸图像尺寸相同的N张第一图像,所述N张第一图像与所述N个关键点一一对应,分别以所述N个关键点中的每个关键点为中心,对与所述每个关键点一一对应的第一图像进行高斯模糊(gaussian blur)处理,得到N张第一高斯模糊图,所述N张第一高斯模糊图为所述人脸图像的姿态编码图,N为大于1的正整数。
其中,所述构造与所述人脸图像尺寸相同的N张第一图像包括:
生成N个全0矩阵,每一个全0矩阵对应一个关键点;将该关键点在所述人脸图像中的位置映射到该全0矩阵中的相应位置,并将该全0矩阵中的相应位置的值由0改为1;由此生成N个独热码(one-hot code)矩阵,所述N个独热码矩阵为所述N张第一图像。
在本申请文件中所述的独热码矩阵是指:只有一个向量值为1,其余向量值全为零的矩阵。下文对此不再赘述。
其中,所述分别以所述N个关键点中的每个关键点为中心,对与所述每个关键点一 一对应的第一图像进行高斯模糊处理,包括:
分别以每个独热码矩阵中值为1的点为中心,对所述独热码矩阵进行高斯模糊处理。
这种通过关键点进行高斯模糊从而实现图像姿态编码的方式,对人脸姿态的描述更加精确和鲁棒,通过更加精确和鲁棒的人脸姿态描述,可以使得所述人脸生成网络生成的预测人脸旋转图像更加接近真实的所述人脸旋转图像。
需要说明的是,所述N个全0矩阵的尺寸(即行数和列数)与所述人脸图像的尺寸相同,因为所述N个独热码矩阵为所述N张第一图像,每个独热码矩阵中为1的值是对应一个关键点在人脸图像中的位置的,举例来说,当N等于5的时候,假设这5个关键点为人脸的五个关键位置对应的点,如左眼球中心,右眼球中心,鼻尖,左嘴角,右嘴角。以鼻尖这个关键点为例,假设鼻尖在人脸图像的正中央位置,则鼻尖对应的独热码矩阵中正中央位置的值为1,其余位置的值仍为0,再以左嘴角为例,假设左嘴角在人脸图像的坐标位置(x,y)时,则左嘴角对应的独热码矩阵在其(x,y)的坐标位置处的值为1,其余位置的值仍为0。
另外,在本申请中出现图像尺寸,矩阵尺寸,姿态编码图尺寸等表述,其中尺寸均可以理解为行×列,例如,S401中所述的人脸图像和人脸旋转图像即具有相同的尺寸,意思是所述人脸图像和所述人脸旋转图像在进入神经网络之后均以相同尺寸的矩阵形式呈现,当然此处的矩阵还可以是张量,张量可以理解为具有纵深的矩阵,如常规的矩阵是X×Y,其中X为矩阵的行,Y为矩阵的列,张量则为X×Y×Z,其中Z则为矩阵的纵深。可以理解的是,在上述方法400中,所述人脸图像,人脸旋转图像,人脸图像的姿态编码图,人脸旋转图像的姿态编码图均可以具有相同尺寸,或者叫做同型矩阵。所述生成预测人脸旋转图像也可以与上述图像或姿态编码图具有相同的尺寸。由于在做人脸旋转的过程中,图像尺寸是不被改变的,因此可以理解为所述生成式对抗网络的输入和输出的图像数据均具有相同尺寸。
另外,还需要说明的是,所述利用关键点检测算法对所述人脸图像进行检测,得到所述人脸图像的N个关键点分别对应的位置坐标,这里的N个关键点如上举例可以是5个,当然也可以是10个或者其他更多或更少的数目,本方案不对此进行限定,具体N等于多少可以取决于所述关键点检测算法,即可以预先根据需求设计好关键点的数目,此处不再赘述。
在实施例一提供的方法400以及其可能的实现方式的基础上,所述S403具体包括:
利用关键点检测算法对所述人脸旋转图像进行检测,得到所述人脸旋转图像的M个 关键点分别对应的位置坐标,构造与所述人脸旋转图像尺寸相同的M张第二图像,所述M张第二图像与所述M个关键点一一对应,分别以所述M个关键点中的每个关键点为中心,对与所述每个关键点一一对应的第二图像进行高斯模糊处理,得到M张第二高斯模糊图,所述M张第二高斯模糊图为所述人脸旋转图像的姿态编码图,M为大于1的正整数。
这里的所述构造与所述人脸旋转图像尺寸相同的M张第二图像,可以理解为与上面所述构造与所述人脸图像尺寸相同的N张第一图像的方式相同,因此此处不再赘述。
在实施例一提供的方法400以及其可能的实现方式的基础上,所述S405具体包括:
以所述人脸图像作为所述第一判别网络的判别条件,根据所述第一判别网络判断所述人脸旋转图像和所述预测人脸旋转图像的真假性,并根据判别结果生成所述第一损失;其中,所述第一判断网络包括二分类判别器,所述二分类判别器用于判断为真或判断为假。
具体的,所述以所述人脸图像作为所述第一判别网络的判别条件,根据所述第一判别网络判断所述人脸旋转图像和所述预测人脸旋转图像的真假性,并根据判别结果生成所述第一损失,包括:
Figure PCTCN2018089611-appb-000039
其中,L ii为所述第一损失,I a为所述人脸图像,I b为所述人脸旋转图像,
Figure PCTCN2018089611-appb-000040
为所述预测人脸旋转图像,
Figure PCTCN2018089611-appb-000041
表示在所述人脸旋转图像I b的分布H(I b)上求期望,即所述人脸旋转图像I b为真的概率;
Figure PCTCN2018089611-appb-000042
表示所述第一判别网络的损失函数,
Figure PCTCN2018089611-appb-000043
表示在所述预测人脸旋转图像
Figure PCTCN2018089611-appb-000044
的分布
Figure PCTCN2018089611-appb-000045
上的期望,即所述预测人脸旋转图像
Figure PCTCN2018089611-appb-000046
为真的概率;
Figure PCTCN2018089611-appb-000047
为以所述人脸图像为条件的所述第一判别网络,θ ii为所述第一判别网络的参数,
Figure PCTCN2018089611-appb-000048
为所述第一判别网络的输入。
在实施例一提供的方法400以及其可能的实现方式的基础上,所述S406具体包括:以所述人脸旋转图像的姿态编码图作为所述第二判别网络的判别条件,根据所述第二判别网络判断所述人脸旋转图像和所述预测人脸旋转图像的真假性,并根据判别结果生成所述第二损失;其中,所述第二判断网络包括二分类判别器,所述二分类判别器用于判断为真或判断为假。
具体的,以所述人脸旋转图像的姿态编码图作为所述第二判别网络的判别条件,根据所述第二判别网络判断所述人脸旋转图像和所述预测人脸旋转图像的真假性,并根据判别结果生成所述第二损失,包括:
Figure PCTCN2018089611-appb-000049
其中,L ip为所述第二损失,I b为所述人脸旋转图像,
Figure PCTCN2018089611-appb-000050
为所述预测人脸旋转图像,P b为所述人脸旋转图像的姿态编码图,
Figure PCTCN2018089611-appb-000051
表示在所述人脸旋转图像I b的分布H(I b)上求期望,即所述人脸旋转图像I b为真的概率;
Figure PCTCN2018089611-appb-000052
表示所述第二判别网络的损失函数,
Figure PCTCN2018089611-appb-000053
表示在所述预测人脸旋转图像
Figure PCTCN2018089611-appb-000054
的分布
Figure PCTCN2018089611-appb-000055
上的期望,即所述预测人脸旋转图像
Figure PCTCN2018089611-appb-000056
为真的概率;
Figure PCTCN2018089611-appb-000057
为以所述人脸旋转图像的姿态编码图为条件的所述第二判别网络,θ ip为所述第二判别网络的参数,
Figure PCTCN2018089611-appb-000058
为所述第二判别网络的输入。
在实施例一提供的方法400以及其可能的实现方式的基础上,当所述真实图像损失包括像素损失,所述S406a具体可以包括执行如下计算:
Figure PCTCN2018089611-appb-000059
其中,L pix是所述像素损失,S是尺度量,
Figure PCTCN2018089611-appb-000060
为所述预测人脸旋转图像,I b为所述人脸旋转图像,
Figure PCTCN2018089611-appb-000061
表示将所述预测人脸旋转图像和所述人脸旋转图像进行缩放到S尺度量时计算像素差值的1范数损失。
在实施例一提供的方法400以及其可能的实现方式的基础上,当所述真实图像损失包括全变分损失,所述S406a具体可以包括执行如下计算:
Figure PCTCN2018089611-appb-000062
其中,L tv是所述全变分损失,即所述预测人脸旋转图像
Figure PCTCN2018089611-appb-000063
在横向和纵向两个方向一阶梯度绝对值的和,其中,W表示所述预测人脸旋转图像的宽,H表示所述预测人脸旋转图像的高,C表示所述预测人脸旋转图像通道数。
在实施例一提供的方法400以及其可能的实现方式的基础上,当所述真实图像损失包括身份识别特征损失,所述S406a具体可以包括执行如下计算:
Figure PCTCN2018089611-appb-000064
其中,身份识别特征用来保证所述预测人脸旋转图像和所述人脸图像之间的身份信息保持不变,L ip表示所述身份识别特征损失,f为预先训练好的人脸识别模型,所述人脸识别模型f为深度神经网络,所述深度神经网络包括至少一个池化层和至少一个全连接层,其中,
Figure PCTCN2018089611-appb-000065
表示所述人脸识别模型f的最后一个池化层的输出,
Figure PCTCN2018089611-appb-000066
表示所述人脸识别模型f最后一个全连接层的输出。
在实施例一提供的方法400以及其可能的实现方式的基础上,所述S407具体可以包括:
更新所述人脸生成网络,以使得所述人脸生成网络的误差最小;
更新所述第一判别网络和所述第二判别网络,以使得所述第一损失和所述第二损失的值最大;
交替迭代上述更新直到所述生成式对抗网络达到收敛。
具体的,上述更新过程可以理解为:
根据所述总损失更新所述人脸生成网络,以使其生成的预测人脸旋转图像尽可能的混淆所述第一判别网络和所述第二判别网络,换句话说,更新人脸生成网络的目的就是要使它尽可能的生成让判别网络难以识别真假的预测人脸旋转图像,这里的判别网络包括第一判别网络和第二判别网络,下同,不再赘述。
根据所述总损失更新所述第一判别网络和所述第二判别网络,以使得所述第一损失和所述第二损失的值最大,形象的说,更新第一判别网络和第二判别网络的目的是使其尽可能的对人脸生成网络生成的预测人脸旋转图像做出识别,即识别出其为真或为假。
如上所述,更新人脸生成网络为了混淆判别网络,更新判别网络为了使其不被混淆,二者相互对抗,形成博弈,最终达到动态平衡,即交替迭代上述更新直到所述生成式对抗网络达到收敛
需要说明的是,此处的更新所述人脸生成网络,以使得所述人脸生成网络的误差最小,是指所述人脸生成网络生成的预测人脸旋转图像被判别网络识别出来为真或者为假的可能性尽可能的小。
还需要说明的是,此处的更新所述第一判别网络和所述第二判别网络,以使得所述第一损失和所述第二损失的值最大,具体的实现方式如下所述:
以更新所述第一判别网络为例,可以理解的是,所述第一判别网络作为一个二分类判别器,可以有两个节点分别进行输出,其中一个节点用于输出判别为真的概率,如0.3,另一个节点则用于输出判别为假的概率,显然,若为真的概率是0.3,则为假的概率是1-0.3=0.7。此时第一损失的取值为0.7,即第一损失取两个节点输出的值中较大的那一个。再举例说,若判别为真的概率是0.9,则判别为假的概率为0.1,此时第一损失的取值为0.9。因此,更新所述第一判别网络以使得所述第一损失的值最大,目的在于更新所述第一判别网络,使所述第一判别网络尽可能的识别出所述人脸生成网络生成的预测图像和 真实图像之间的区别。上面说的预测图像即所述预测人脸旋转图像,这里说的真实图像即接收的所述人脸旋转图像。
可以看出,在所述的生成式对抗网络的训练中,所述人脸生成网络与判别网络(包括所述第一判别网络和所述第二判别网络)之间是一种对抗的关系,或称“博弈”的过程。人脸生成网络要努力生成难以被识别的预测图像,而判别网络要努力识别出预测图像与真实图像之间的区别,这种动态的“博弈”具体体现在参数的更新上,直到更新后的参数使得两者动态平衡,即达到了整体最优的状态,停止更新,或者说停止对所述生成式对抗网络的训练,输出训练后的人脸生成网络。
在上述的生成式对抗网络中,第一判别网络以所述人脸图像作为判别条件,第二判别网络以所述人脸旋转图像的姿态编码图作为判别条件,二者最后得到的判别结果:第一损失和第二损失,并通过对所述第一损失和所述第二损失加权求和,得到加权求和的结果作为所述生成式对抗网络的总损失,该总损失用于更新所述生成式对抗网络(包括所述人脸生成网络、所述第一判别网络和所述第二判别网络),迭代上述步骤直至整个生成式对抗网络达到动态平衡或全局最优,则停止更新,输出训练后的人脸生成网络。由此训练得到的人脸生成网络对于人脸的表观真实性以及人脸姿态两方面的信息都能有非常好的把握。综上所述:由于所述第一判别网络以所述人脸图像作为判别条件,因此可以理解人脸的表观真实性由所述第一判别网络把握,由于所述第二判别网络以所述人脸旋转图像的姿态编码图作为判别条件,因此可以理解人脸姿态由所述第二判别网络把握。
实施例二:
图5为本发明实施例二提供的一种人脸旋转图像的生成方法500,包括:
S501,接收人脸图像;
S502,根据所述人脸图像中的两个或两个以上关键点对所述人脸图像进行姿态编码,得到所述人脸图像的姿态编码图;
S503,根据人脸旋转角度从训练数据集中获取多张第一训练图片,所述多张第一训练图片均包含人脸,且所述多张第一训练图片中包含的人脸呈现的旋转角度均为所述人脸旋转角度;
S504,根据目标人脸图像中的两个或两个以上关键点对所述目标人脸图像进行姿态编码,得到所述目标人脸图像的姿态编码图;其中,所述目标人脸图像是根据所述多张第一训练图片得到的;
S505,根据所述人脸图像、所述人脸图像的姿态编码图和所述目标人脸图像的姿态编码图生成待输入信号,其中所述人脸图像的尺寸、所述人脸图像的姿态编码图的尺寸和所述目标人脸图像的姿态编码图的尺寸相同;
S506,将所述待输入信号输入人脸旋转图像生成模型得到人脸旋转图像。
本申请实施例提供的人脸旋转图像的生成方法,通过对人脸图像和目标人脸图像进行姿态编码,得到人脸旋转图像生成模型的输入信号,并进一步通过所述人脸旋转图像生成模型生成人脸旋转图像,由于所述姿态编码方式对人脸姿态的描述更加精确和鲁棒,因此生成的人脸旋转图像也更加准确。另外,由于该方法提供的目标人脸图像是根据所述多张第一训练图片得到的,所述多张第一训练图片中包含的人脸呈现的旋转角度相同,这里的旋转角度可以是用户预设的,如,用户输入人脸图像,并指示图像生成设备生成预设角度的人脸旋转图像,则上述多张第一训练图片中包含的人脸呈现的旋转角度均为该预设角度;通过这样的设置,本申请实施例提供的人脸旋转图像的生成方法对人脸旋转的角度没有限制,即,可以实现各种不同角度的人脸旋转。
所述方法500具体可以由如图1所示的执行设备110执行,所述方法500中的人脸图像可以是如图1所示的客户设备140给出的输入数据,所述执行设备110中的预处理模块113可以用来执行所述方法500中S502和S504所述的姿态编码过程,所述执行设备110中的预处理模块114可以用来执行所述方法500中的S503。所述预处理模块113还可以用于执行所述S505,所述执行设备110中的计算模块111可以用于执行所述S506。
所述执行设备110具体可以用于训练本申请实施例提供的生成式对抗网络,
值得说明的是,在所述方法500中,所述根据人脸旋转角度从训练数据集中获取多张第一训练图片,所述多张第一训练图片均包含人脸,这里的人脸和所述人脸图像中的人脸,不要求是同一张人脸,事实上,所述人脸图像在所述方法500中,可以是用户输入的实时的待旋转的人脸,而所述多张第一训练图片是数据库维护的训练数据集,因此所述多张第一训练图片中包含的人脸与所述人脸图像包含的人脸可以认为没有直接关系,当然人脸图像中包含的人脸也可以出现在数据库中被当做训练数据进行使用。
可选的,所述方法500可以由CPU处理,也可以由CPU和GPU共同处理,也可以不用GPU,而使用其他适合用于神经网络计算的处理器,本申请不做限制。
需要说明的是,在所述方法500中,所述根据所述人脸图像、所述人脸图像的姿态编码图和所述目标人脸图像的姿态编码图生成待输入信号,具体可以是通过特征融合的方式融合所述人脸图像、所述人脸图像的姿态编码图和所述目标人脸图像的姿态编码图 得到所述待输入信号。特征融合是将有区分意义并且具有互补作用的特征,通过某种方式有机地结合在一起作为统一的特征。特征融合是生物识别技术领域常用的一种技术手段,可以采用多种方式实现特征的融合。融合后的特征包含的信息更加准确、更加丰富。可以理解,所述待输入信号相比于所述人脸图像、所述人脸图像的姿态编码图和所述目标人脸图像的姿态编码图中的任一图像或姿态编码图包含的信息更加准确、也更加丰富。因此,利用所述待输入信号进行人脸旋转图像的生成,可以使生成的人脸旋转图像更准确,当该人脸旋转方法被应用到人脸识别的应用场景中时,更准确的人脸旋转图像可以用于提高人脸识别的准确度。
在实施例二提供的方法500的基础上,一种可能的实现方式为,所述S502具体可以包括:
利用关键点检测算法对所述人脸图像进行检测,得到所述人脸图像的N个关键点分别对应的位置坐标,N为大于1的整数;
构造与所述人脸图像尺寸相同的N张第一图像,所述N张第一图像与所述N个关键点一一对应;
分别以所述N个关键点中的每个关键点为中心,对与所述每个关键点一一对应的第一图像进行高斯模糊处理,得到N张第一高斯模糊图,所述N张第一高斯模糊图为所述人脸图像的姿态编码图。
其中,所述构造与所述人脸图像尺寸相同的N张第一图像包括:
生成N个全0矩阵,每一个全0矩阵对应一个关键点;将该关键点在所述人脸图像中的位置映射到该全0矩阵中的相应位置,并将该全0矩阵中的相应位置的值由0改为1;由此生成N个独热码矩阵,所述N个独热码矩阵为所述N张第一图像。
其中,所述分别以所述N个关键点中的每个关键点为中心,对与所述每个关键点一一对应的第一图像进行高斯模糊处理,包括:
分别以每个独热码矩阵中值为1的点为中心,对所述独热码矩阵进行高斯模糊处理。
这种通过关键点进行高斯模糊从而实现图像姿态编码的方式,对人脸姿态的描述更加精确和鲁棒,通过更加精确和鲁棒的人脸姿态描述,可以使得所述人脸旋转图像生成模型生成的人脸旋转图像更加接近真实的人脸旋转图像。
在实施例二提供的方法500以及其可能的实现方式的基础上,所述S504具体包括:
利用关键点检测算法对所述目标人脸图像进行检测,得到所述目标人脸图像的M个关键点分别对应的位置坐标,M为大于1的整数;
构造与所述目标人脸图像尺寸相同的M张第二图像,所述M张第二图像与所述M个关键点一一对应;
分别以所述M个关键点中的每个关键点为中心,对与所述每个关键点一一对应的第二图像进行高斯模糊处理,得到M张第二高斯模糊图,所述M张第二高斯模糊图为所述目标人脸图像的姿态编码图。
所述构造与所述目标人脸图像尺寸相同的M张第二图像,与上面所述构造与所述人脸图像尺寸相同的N张第一图像的方式相同,因此此处不再赘述。
在实施例二提供的方法500以及其可能的实现方式的基础上,一种可能的实现方式为:所述目标人脸图像是根据所述多张第一训练图片得到的,包括:
所述目标人脸图像是根据所述多张第一训练图片的姿态编码图的平均值得到的。
需要说明的是:所述多张第一训练图片的姿态编码图可以使用与所述S502和S504的姿态编码方法相同的姿态编码方法获得,即针对每一张第一训练图片,先利用关键点检测算法对这张第一训练图片进行检测,得到其中的N个人脸关键点(facial landmark)分别对应的位置坐标,然后根据这N个关键点分别对应的位置坐标,生成N个与这N个关键点一一对应的独热码,再以每个独热码中值为1的点为中心进行高斯模糊得到N张高斯模糊图,这样对每一张第一训练图片都做完姿态编码之后,再进行求平均,具体的求平均的方式,可以是对所有高斯模糊图对应的位置像素值进行相加再求平均。
在实施例二提供的方法500以及其可能的实现方式的基础上,一种可能的实现方式为:所述人脸旋转图像生成模型是根据训练生成式对抗网络得到的,所述生成式对抗网络包括至少一个人脸生成网络以及至少两个判别网络,所述至少两个判别网络形成耦合对抗,用于产生对抗损失,所述对抗损失用于更新所述至少一个人脸生成网络以及所述至少两个判别网络,所述更新后的至少一个人脸生成网络为所述人脸旋转图像生成模型。
此处的人脸旋转图像生成模型可以是上述实施例一训练得到的人脸生成网络。
可以理解实施例一为该人脸生成网络的训练阶段(如图1所示的训练设备120执行的阶段),具体训练是采用由实施例一以及实施例一基础上任意一种可能的实现方式中提供的生成式对抗网络进行的;而实施例二则可以理解为是该人脸生成网络的应用阶段(如图1所示的执行设备110执行的阶段),具体可以体现为采用由实施例一训练得到的人脸生成网络,并根据用户输入的待旋转的人脸图像,在实施例二中也称人脸旋转图像,从而得到输出图像,即实施例二中的人脸旋转图像。当然由于在实施例二中,待输入信号在被输入所述人脸旋转图像生成模型之前,经过了相应的预处理,如实施例二的 S502和S504所述的姿态编码过程,得到相应的姿态编码图,并可以通过上面所述的特征融合的方式,对S502输出的人脸图像的姿态编码图和S504输出的目标人脸图像的姿态编码图以及S501接收的人脸图像进行特征融合,得到更为丰富的特征图即所述待输入信号,该待输入信号综合了S501,S502和S504的特征,使得所述人脸旋转图像生成模型基于该待输入信号得到的人脸旋转图像质量更好,即更接近真实的人脸旋转图像。
如前面所述,虽然在实施例一的网络训练阶段和实施例二的网络应用阶段,为了表达的简洁与直观,均使用了人脸图像和人脸旋转图像的表述,但是由于两个实施例分别属于不同的实施例用于表示不用的阶段,因此不应该将两个实施例中的人脸图像理解为相同图像,事实上,实施例一中的人脸图像为训练数据,可以表示真实图像,也可以是经插值操作得到的虚拟图像,而实施例二中的人脸图像通常是用户输入的真实人脸图像;同样的,在实施例一中的人脸旋转图像也是训练图像,其可以是真实的图像,也可以是经插值操作得到的虚拟图像;与实施例一中的人脸图像形成训练数据对,对所述生成式对抗网络进行训练;而实施例二中的人脸旋转图像为由所述人脸旋转图像生成模型生成的图像,该图像理论上应尽可能的与真实的人脸旋转图像相同,但是具体要根据人脸旋转图像生成方法的能力而定。
上文结合图1至图5对本申请实施例的人脸旋转图像生成方法及生成式对抗网络的训练方法进行了详细的描述。下文结合图6至图9对本申请实施例的人脸旋转图像生成装置以及生成式对抗网络的训练装置进行描述,应理解,图6至图9所示的动作识别装置具体可以是监控设备、终端设备、网络服务器以及网络云平台等具有图片处理功能的设备。图6至图9所示的装置可以执行本申请实施例的对应方法的各个步骤,为了简洁,下面适当省略重复的描述。
图6是本申请实施例提供的一种生成式对抗网络的训练装置600的示意性框图。所述生成式对抗网络包括人脸生成网络,以及多个耦合对抗的判别网络,所述耦合对抗的判别网络至少包括第一判别网络和第二判别网络,所述装置600包括:
接收单元601,用于接收人脸图像,以及人脸旋转图像;所述人脸图像和所述人脸旋转图像为同一张人脸旋转前和旋转后的图像;
姿态编码单元602,用于根据所述人脸图像中的两个或两个以上关键点对所述人脸图像进行姿态编码,得到所述人脸图像的姿态编码图;
所述姿态编码单元603,还用于根据所述人脸旋转图像中的两个或两个以上关键点对所述人脸旋转图像进行姿态编码,得到所述人脸旋转图像的姿态编码图;
人脸生成单元604,用于将所述人脸图像、所述人脸图像的姿态编码图以及所述人脸旋转图像的姿态编码图输入所述人脸生成网络,以生成预测人脸旋转图像;
第一判别单元605,用于将所述人脸图像、所述人脸旋转图像和所述预测人脸旋转图像输入所述第一判别网络以得到第一损失;
第二判别单元606,用于将所述人脸旋转图像、所述人脸旋转图像的姿态编码图以及所述预测人脸旋转图像输入所述第二判别网络以得到第二损失;
反向传播单元607,用于根据所述生成式对抗网络的总损失更新所述人脸生成网络、所述第一判别网络以及所述第二判别网络,所述生成式对抗网络的总损失根据所述第一损失和第二损失加权求和得到;
输出单元608,直到所述生成式对抗网络的总损失收敛,用于输出训练后的人脸生成网络。
本申请实施例提供的生成式对抗网络的训练装置,通过对对人脸图像和人脸旋转图像进行姿态编码,得到人脸图像的姿态编码图和人脸旋转图像的姿态编码图,并通过所述生成式对抗网络中的人脸生成网络生成预测人脸旋转图像;进而通过至少第一判别网络和第二判别网络分别对预测人脸旋转图像进行判别得到第一损失和第二损失,将所述第一损失和第二损失进行加权求和得到生成式对抗网络的总损失,并由该总损失更新所述生成式对抗网络中的人脸生成网络以及第一判别网络和第二判别网络。由于上述的姿态编码方式对人脸姿态的描述更加精确和鲁棒,使得人脸生成网络或者判别网络通过上述的姿态编码图得到的预测人脸旋转图像也更加接近真实的人脸旋转图像。另外,在该训练装置中,由于对训练数据(人脸图像和人脸旋转图像)的旋转角度并没有限制,因此该训练得到的网络也可以适应于各种不同角度的人脸旋转,由此提升人脸旋转的可操作性以及用户体验。并且,通过使用第一判别网络和第二判别网络,且所述第一判别网络和第二判别网络耦合对抗,使得不同的判别网络可以通过不同的条件对所述人脸生成网络生成的预测人脸旋转图像进行判别,且不同判别网络得到判别结果都会对所述生成式对抗网络产生影响,从而使得所述生成式对抗网络能够根据上述不同的条件对人脸图像的不同方面进行调整和把握,从而输出更加准确的人脸旋转图像。
图7是本申请实施例提供的一种人脸旋转图像的生成装置700的示意性框图。所述装置700包括:
接收单元701,用于接收人脸图像;
姿态编码单元702,用于根据所述人脸图像中的两个或两个以上关键点对所述人脸图像进行姿态编码,得到所述人脸图像的姿态编码图;
获取单元703,用于根据人脸旋转角度从训练数据集中获取多张第一训练图片,所述多张第一训练图片均包含人脸,且所述多张第一训练图片中包含的人脸呈现的旋转角度均为所述人脸旋转角度;
所述姿态编码单元702,还用于根据目标人脸图像中的两个或两个以上关键点对所述目标人脸图像进行姿态编码,得到所述目标人脸图像的姿态编码图;其中,所述目标人 脸图像是根据所述多张第一训练图片得到的;
信号生成单元704,用于根据所述人脸图像、所述人脸图像的姿态编码图和所述目标人脸图像的姿态编码图生成待输入信号,其中所述人脸图像的尺寸、所述人脸图像的姿态编码图的尺寸和所述目标人脸图像的姿态编码图的尺寸相同;
图像生成单元705,用于将所述待输入信号输入人脸旋转图像生成模型得到人脸旋转图像。
本申请实施例提供的人脸旋转图像的生成方法,通过对人脸图像和目标人脸图像进行姿态编码,得到人脸旋转图像生成模型的输入信号,并进一步通过所述人脸旋转图像生成模型生成人脸旋转图像,由于所述姿态编码方式对人脸姿态的描述更加精确和鲁棒,因此生成的人脸旋转图像也更加准确。另外,由于该方法提供的目标人脸图像是根据所述多张第一训练图片得到的,所述多张第一训练图片中包含的人脸呈现的旋转角度相同,这里的旋转角度可以是用户预设的,如,用户输入人脸图像,并指示图像生成设备生成预设角度的人脸旋转图像,则上述多张第一训练图片中包含的人脸呈现的旋转角度均为该预设角度;通过这样的设置,本申请实施例提供的人脸旋转图像的生成方法对人脸旋转的角度没有限制,即,可以实现各种不同角度的人脸旋转。
图8是本申请实施例提供的一种生成式对抗网络的训练装置的硬件结构示意图。图8所示的生成式对抗网络的训练装置800(该装置800具体可以是一种计算机设备)包括存储器801、处理器802、通信接口803以及总线804。其中,存储器801、处理器802、通信接口803通过总线804实现彼此之间的通信连接。
存储器801可以是只读存储器(Read Only Memory,ROM),静态存储设备,动态存储设备或者随机存取存储器(Random Access Memory,RAM)。存储器801可以存储程序,当存储器801中存储的程序被处理器802执行时,处理器802和通信接口803用于执行本申请实施例的生成式对抗网络的训练方法的各个步骤。
处理器802可以采用通用的中央处理器(Central Processing Unit,CPU),微处理器,应用专用集成电路(Application Specific Integrated Circuit,ASIC),图形处理器(graphics processing unit,GPU)或者一个或多个集成电路,用于执行相关程序,以实现本申请实施例的生成式对抗网络的训练装置中的单元所需执行的功能,或者执行本申请方法实施例的生成式对抗网络的训练方法。
处理器802还可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,本申请的生成式对抗网络的训练方法的各个步骤可以通过处理器802中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器802还可以是通用处理器、数字信号处理 器(Digital Signal Processing,DSP)、专用集成电路(ASIC)、现成可编程门阵列(Field Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器801,处理器802读取存储器801中的信息,结合其硬件完成本申请实施例的生成式对抗网络的训练装置中包括的单元所需执行的功能,或者执行本申请方法实施例的生成式对抗网络的训练方法。
通信接口803使用例如但不限于收发器一类的收发装置,来实现装置800与其他设备或通信网络之间的通信。例如,可以通过通信接口803获取训练数据(如本申请实施例一所述的人脸图像和人脸旋转图像)。
总线804可包括在装置800各个部件(例如,存储器801、处理器802、通信接口803)之间传送信息的通路。
应理解,生成式对抗网络的训练装置600中的接收单元601相当于生成式对抗网络的训练装置800中的通信接口803,姿态编码单元602、人脸生成单元604、第一判别单元605、第二判别单元606以及反向传播单元607可以相当于处理器802。
图9是本申请实施例提供的人脸旋转图像的生成装置的硬件结构示意图。图9所示的人脸旋转图像的生成装置900(该装置900具体可以是一种计算机设备)包括存储器901、处理器902、通信接口903以及总线904。其中,存储器901、处理器902、通信接口903通过总线904实现彼此之间的通信连接。
存储器901可以是只读存储器(Read Only Memory,ROM),静态存储设备,动态存储设备或者随机存取存储器(Random Access Memory,RAM)。存储器901可以存储程序,当存储器901中存储的程序被处理器902执行时,处理器902和通信接口903用于执行本申请实施例的人脸旋转图像的生成方法的各个步骤。
处理器902可以采用通用的中央处理器(Central Processing Unit,CPU),微处理器,应用专用集成电路(Application Specific Integrated Circuit,ASIC),图形处理器(graphics processing unit,GPU)或者一个或多个集成电路,用于执行相关程序,以实现本申请实施例的人脸旋转图像的生成装置中的单元所需执行的功能,或者执行本申请方法实施例 的人脸旋转图像的生成方法。
处理器902还可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,本申请的人脸旋转图像的生成方法的各个步骤可以通过处理器902中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器902还可以是通用处理器、数字信号处理器(Digital Signal Processing,DSP)、专用集成电路(ASIC)、现成可编程门阵列(Field Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器901,处理器902读取存储器901中的信息,结合其硬件完成本申请实施例的人脸旋转图像的生成装置中包括的单元所需执行的功能,或者执行本申请方法实施例的人脸旋转图像的生成方法。
通信接口903使用例如但不限于收发器一类的收发装置,来实现装置900与其他设备或通信网络之间的通信。例如,可以通过通信接口903获取训练数据(如本申请实施例二所述的人脸图像)。
总线904可包括在装置900各个部件(例如,存储器901、处理器902、通信接口903)之间传送信息的通路。
应理解,人脸旋转图像的生成装置700中的接收单元701,获取单元703相当于人脸旋转图像的生成装置900中的通信接口903;人脸旋转图像的生成装置700中的姿态编码单元702、信号生成单元704、图像生成单元705可以相当于处理器902。
应注意,尽管图8和图9所示的装置800和900仅仅示出了存储器、处理器、通信接口,但是在具体实现过程中,本领域的技术人员应当理解,装置800和900还包括实现正常运行所必须的其他器件。同时,根据具体需要,本领域的技术人员应当理解,装置800和900还可包括实现其他附加功能的硬件器件。此外,本领域的技术人员应当理解,装置800和900也可仅仅包括实现本申请实施例所必须的器件,而不必包括图8或图9中所示的全部器件。
可以理解,所述装置800相当于图1中的所述训练设备120,所述装置900相当于图1中的所述执行设备110。本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合 来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (36)

  1. 一种人脸旋转图像的生成方法,其特征在于,包括:
    接收人脸图像;
    根据所述人脸图像中的两个或两个以上关键点对所述人脸图像进行姿态编码,得到所述人脸图像的姿态编码图;
    根据人脸旋转角度从训练数据集中获取多张第一训练图片,所述多张第一训练图片均包含人脸,且所述多张第一训练图片中包含的人脸呈现的旋转角度均为所述人脸旋转角度;
    根据目标人脸图像中的两个或两个以上关键点对所述目标人脸图像进行姿态编码,得到所述目标人脸图像的姿态编码图;其中,所述目标人脸图像是根据所述多张第一训练图片得到的;
    根据所述人脸图像、所述人脸图像的姿态编码图和所述目标人脸图像的姿态编码图生成待输入信号,其中所述人脸图像的尺寸、所述人脸图像的姿态编码图的尺寸和所述目标人脸图像的姿态编码图的尺寸相同;
    将所述待输入信号输入人脸旋转图像生成模型得到人脸旋转图像。
  2. 根据权利要求1所述的方法,其特征在于,所述根据所述人脸图像中的两个或两个以上关键点对所述人脸图像进行姿态编码,得到所述人脸图像的姿态编码图包括:
    利用关键点检测算法对所述人脸图像进行检测,得到所述人脸图像的N个关键点分别对应的位置坐标,N为大于1的整数;
    构造与所述人脸图像尺寸相同的N张第一图像,所述N张第一图像与所述N个关键点一一对应;
    分别以所述N个关键点中的每个关键点为中心,对与所述每个关键点一一对应的第一图像进行高斯模糊处理,得到N张第一高斯模糊图,所述N张第一高斯模糊图为所述人脸图像的姿态编码图。
  3. 根据权利要求1或2所述的方法,其特征在于,根据目标人脸图像中的两个或两个以上关键点对所述目标人脸图像进行姿态编码,得到所述目标人脸图像的姿态编码图包括:
    利用关键点检测算法对所述目标人脸图像进行检测,得到所述目标人脸图像的M个关键点分别对应的位置坐标,M为大于1的整数;
    构造与所述目标人脸图像尺寸相同的M张第二图像,所述M张第二图像与所述M个关键点一一对应;
    分别以所述M个关键点中的每个关键点为中心,对与所述每个关键点一一对应的第二图像进行高斯模糊处理,得到M张第二高斯模糊图,所述M张第二高斯模糊图为所述目标人脸图像的姿态编码图。
  4. 根据权利要求1至3任一项所述的方法,其特征在于,所述目标人脸图像是根据所述多张第一训练图片得到的,包括:
    所述目标人脸图像是根据所述多张第一训练图片的姿态编码图的平均值得到的。
  5. 根据权利要求1至4任一项所述的方法,其特征在于,所述人脸旋转图像生成模型是根据训练生成式对抗网络得到的,所述生成式对抗网络包括至少一个人脸生成网络以及至少两个判别网络,所述至少两个判别网络形成耦合对抗,用于产生对抗损失,所述对抗损失用于更新所述至少一个人脸生成网络以及所述至少两个判别网络,所述更新后的至少一个人脸生成网络为所述人脸旋转图像生成模型。
  6. 一种生成式对抗网络的训练方法,其特征在于,所述生成式对抗网络包括人脸生成网络,以及多个耦合对抗的判别网络,所述耦合对抗的判别网络至少包括第一判别网络和第二判别网络,所述方法包括:
    接收人脸图像,以及人脸旋转图像;所述人脸图像和所述人脸旋转图像为同一张人脸旋转前和旋转后的图像;
    根据所述人脸图像中的两个或两个以上关键点对所述人脸图像进行姿态编码,得到所述人脸图像的姿态编码图;
    根据所述人脸旋转图像中的两个或两个以上关键点对所述人脸旋转图像进行姿态编码,得到所述人脸旋转图像的姿态编码图;
    将所述人脸图像、所述人脸图像的姿态编码图以及所述人脸旋转图像的姿态编码图输入所述人脸生成网络,以生成预测人脸旋转图像;
    将所述人脸图像、所述人脸旋转图像和所述预测人脸旋转图像输入所述第一判别网络以得到第一损失;
    将所述人脸旋转图像、所述人脸旋转图像的姿态编码图以及所述预测人脸旋转图像输入所述第二判别网络以得到第二损失;
    根据所述生成式对抗网络的总损失更新所述人脸生成网络、所述第一判别网络以及所述第二判别网络,所述生成式对抗网络的总损失根据所述第一损失和第二损失加权求 和得到;
    直到所述生成式对抗网络的总损失收敛,输出训练后的人脸生成网络。
  7. 根据权利要求6所述的方法,其特征在于,在所述根据所述生成式对抗网络的总损失更新所述人脸生成网络、所述第一判别网络以及所述第二判别网络之前,所述方法还包括:
    根据所述人脸旋转图像及所述预测人脸旋转图像得到真实图像损失,所述真实图像损失包括像素损失、全变分损失及身份识别特征损失中的至少一个损失;对应的,所述生成式对抗网络的总损失根据所述真实图像损失中的至少一个损失、所述第一损失和第二损失加权求和得到。
  8. 根据权利要求6或7所述的方法,其特征在于,所述根据所述人脸图像中的两个或两个以上关键点对所述人脸图像进行姿态编码,得到所述人脸图像的姿态编码图,包括:
    利用关键点检测算法对所述人脸图像进行检测,得到所述人脸图像的N个关键点分别对应的位置坐标,构造与所述人脸图像尺寸相同的N张第一图像,所述N张第一图像与所述N个关键点一一对应,分别以所述N个关键点中的每个关键点为中心,对与所述每个关键点一一对应的第一图像进行高斯模糊处理,得到N张第一高斯模糊图,所述N张第一高斯模糊图为所述人脸图像的姿态编码图,N为大于1的正整数。
  9. 根据权利要求6至8任一项所述的方法,其特征在于,所述根据所述人脸旋转图像中的两个或两个以上关键点对所述人脸旋转图像进行姿态编码,得到所述人脸旋转图像的姿态编码图,包括:
    利用关键点检测算法对所述人脸旋转图像进行检测,得到所述人脸旋转图像的M个关键点分别对应的位置坐标,构造与所述人脸旋转图像尺寸相同的M张第二图像,所述M张第二图像与所述M个关键点一一对应,分别以所述M个关键点中的每个关键点为中心,对与所述每个关键点一一对应的第二图像进行高斯模糊处理,得到M张第二高斯模糊图,所述M张第二高斯模糊图为所述人脸旋转图像的姿态编码图,M为大于1的正整数。
  10. 根据权利要求6至9任一项所述的方法,其特征在于,所述将所述人脸图像,所述人脸旋转图像和所述预测人脸旋转图像输入所述第一判别网络得到第一损失,包括:
    以所述人脸图像作为所述第一判别网络的判别条件,根据所述第一判别网络判断所述人脸旋转图像和所述预测人脸旋转图像的真假性,并根据判别结果生成所述第一损失;其中,所述第一判断网络包括二分类判别器,所述二分类判别器用于判断为真或判断为假。
  11. 根据权利要求10所述的方法,其特征在于:以所述人脸图像作为所述第一判别网络的判别条件,根据所述第一判别网络判断所述人脸旋转图像和所述预测人脸旋转图像的真假性,并根据判别结果生成所述第一损失,包括:
    Figure PCTCN2018089611-appb-100001
    其中,L ii为所述第一损失,I a为所述人脸图像,I b为所述人脸旋转图像,
    Figure PCTCN2018089611-appb-100002
    为所述预测人脸旋转图像,
    Figure PCTCN2018089611-appb-100003
    表示在所述人脸旋转图像I b的分布H(I b)上求期望,即所述人脸旋转图像I b为真的概率;
    Figure PCTCN2018089611-appb-100004
    表示所述第一判别网络的损失函数,
    Figure PCTCN2018089611-appb-100005
    表示在所述预测人脸旋转图像
    Figure PCTCN2018089611-appb-100006
    的分布
    Figure PCTCN2018089611-appb-100007
    上的期望,即所述预测人脸旋转图像
    Figure PCTCN2018089611-appb-100008
    为真的概率;
    Figure PCTCN2018089611-appb-100009
    为以所述人脸图像为条件的所述第一判别网络,θ ii为所述第一判别网络的参数,
    Figure PCTCN2018089611-appb-100010
    为所述第一判别网络的输入。
  12. 根据权利要求6至11任一项所述的方法,其特征在于,所述将所述人脸旋转图像,所述人脸旋转图像的姿态编码图以及所述预测人脸旋转图像输入所述第二判别网络得到第二损失,包括:
    以所述人脸旋转图像的姿态编码图作为所述第二判别网络的判别条件,根据所述第二判别网络判断所述人脸旋转图像和所述预测人脸旋转图像的真假性,并根据判别结果生成所述第二损失;其中,所述第二判断网络包括二分类判别器,所述二分类判别器用于判断为真或判断为假。
  13. 根据权利要求12所述的方法,其特征在于,以所述人脸旋转图像的姿态编码图作为所述第二判别网络的判别条件,根据所述第二判别网络判断所述人脸旋转图像和所述预测人脸旋转图像的真假性,并根据判别结果生成所述第二损失,包括:
    Figure PCTCN2018089611-appb-100011
    其中,L ip为所述第二损失,I b为所述人脸旋转图像,
    Figure PCTCN2018089611-appb-100012
    为所述预测人脸旋转图像,P b为所述人脸旋转图像的姿态编码图,
    Figure PCTCN2018089611-appb-100013
    表示在所述人脸旋转图像I b的分布H(I b)上求期望,即所述人脸旋转图像I b为真的概率;
    Figure PCTCN2018089611-appb-100014
    表示所述第二判别网络的损失函数,
    Figure PCTCN2018089611-appb-100015
    表示在所述预测人脸旋转图像
    Figure PCTCN2018089611-appb-100016
    的分布
    Figure PCTCN2018089611-appb-100017
    上的期望,即所述预测人脸旋转图像
    Figure PCTCN2018089611-appb-100018
    为真的概率;
    Figure PCTCN2018089611-appb-100019
    为以所述人脸旋转图像的姿态编码图为条件的所述第二判别网络,θ ip为所述第二判别网络的参数,
    Figure PCTCN2018089611-appb-100020
    为所述第二判别网络的输入。
  14. 根据权利要求7至13任一项所述的方法,其特征在于,当所述真实图像损失包括像素损失,所述根据所述人脸旋转图像及所述预测人脸旋转图像得到真实图像损失,包括:
    Figure PCTCN2018089611-appb-100021
    其中,L pix是所述像素损失,S是尺度量,
    Figure PCTCN2018089611-appb-100022
    为所述预测人脸旋转图像,I b为所述人脸旋转图像,
    Figure PCTCN2018089611-appb-100023
    表示将所述预测人脸旋转图像和所述人脸旋转图像进行缩放到S尺度量时计算像素差值的1范数损失。
  15. 根据权利要求7至14任一项所述的方法,其特征在于,当所述真实图像损失包括全变分损失,所述根据所述人脸旋转图像及所述预测人脸旋转图像得到真实图像损失,包括:
    Figure PCTCN2018089611-appb-100024
    其中,L tv是所述全变分损失,即所述预测人脸旋转图像
    Figure PCTCN2018089611-appb-100025
    在横向和纵向两个方向一阶梯度绝对值的和,其中,W表示所述预测人脸旋转图像的宽,H表示所述预测人脸旋转图像的高,C表示所述预测人脸旋转图像通道数。
  16. 根据权利要求7至15任一项所述的方法,其特征在于,当所述真实图像损失包括身份识别特征损失,所述根据所述人脸旋转图像及所述预测人脸旋转图像得到真实图像损失,包括:
    Figure PCTCN2018089611-appb-100026
    其中,身份识别特征用来保证所述预测人脸旋转图像和所述人脸图像之间的身份信息保持不变,L ip表示所述身份识别特征损失,f为预先训练好的人脸识别模型,所述人脸识别模型f为深度神经网络,所述深度神经网络包括至少一个池化层和至少一个全连接层,其中,
    Figure PCTCN2018089611-appb-100027
    表示所述人脸识别模型f的最后一个池化层的输出,
    Figure PCTCN2018089611-appb-100028
    表示所述人脸识别模型f最后一个全连接层的输出。
  17. 一种人脸旋转图像的生成装置,其特征在于,包括:
    接收单元,用于接收人脸图像;
    姿态编码单元,用于根据所述人脸图像中的两个或两个以上关键点对所述人脸图像进行姿态编码,得到所述人脸图像的姿态编码图;
    获取单元,用于根据人脸旋转角度从训练数据集中获取多张第一训练图片,所述多张第一训练图片均包含人脸,且所述多张第一训练图片中包含的人脸呈现的旋转角度均为所述人脸旋转角度;
    所述姿态编码单元,还用于根据目标人脸图像中的两个或两个以上关键点对所述目标人脸图像进行姿态编码,得到所述目标人脸图像的姿态编码图;其中,所述目标人脸 图像是根据所述多张第一训练图片得到的;
    信号生成单元,用于根据所述人脸图像、所述人脸图像的姿态编码图和所述目标人脸图像的姿态编码图生成待输入信号,其中所述人脸图像的尺寸、所述人脸图像的姿态编码图的尺寸和所述目标人脸图像的姿态编码图的尺寸相同;
    图像生成单元,用于将所述待输入信号输入人脸旋转图像生成模型得到人脸旋转图像。
  18. 根据权利要求17所述的装置,其特征在于,所述姿态编码单元,具体用于:
    利用关键点检测算法对所述人脸图像进行检测,得到所述人脸图像的N个关键点分别对应的位置坐标,N为大于1的整数;
    构造与所述人脸图像尺寸相同的N张第一图像,所述N张第一图像与所述N个关键点一一对应;
    分别以所述N个关键点中的每个关键点为中心,对与所述每个关键点一一对应的第一图像进行高斯模糊处理,得到N张第一高斯模糊图,所述N张第一高斯模糊图为所述人脸图像的姿态编码图。
  19. 根据权利要求17或18所述的装置,其特征在于,所述姿态编码单元,具体用于:
    利用关键点检测算法对所述目标人脸图像进行检测,得到所述目标人脸图像的M个关键点分别对应的位置坐标,M为大于1的整数;
    构造与所述目标人脸图像尺寸相同的M张第二图像,所述M张第二图像与所述M个关键点一一对应;
    分别以所述M个关键点中的每个关键点为中心,对与所述每个关键点一一对应的第二图像进行高斯模糊处理,得到M张第二高斯模糊图,所述M张第二高斯模糊图为所述目标人脸图像的姿态编码图。
  20. 根据权利要求17至19任一项所述的装置,其特征在于,所述目标人脸图像是根据所述多张第一训练图片得到的,包括:
    所述目标人脸图像是根据所述多张第一训练图片的姿态编码图的平均值得到的。
  21. 根据权利要求17至20任一项所述的装置,其特征在于,所述人脸旋转图像生成模型是根据训练生成式对抗网络得到的,所述生成式对抗网络包括至少一个人脸生成网络以及至少两个判别网络,所述至少两个判别网络形成耦合对抗,用于产生对抗损失,所述对抗损失用于更新所述至少一个人脸生成网络以及所述至少两个判别网络,所述更 新后的至少一个人脸生成网络为所述人脸旋转图像生成模型。
  22. 一种生成式对抗网络的训练装置,其特征在于,所述生成式对抗网络包括人脸生成网络,以及多个耦合对抗的判别网络,所述耦合对抗的判别网络至少包括第一判别网络和第二判别网络,所述装置包括:
    接收单元,用于接收人脸图像,以及人脸旋转图像;所述人脸图像和所述人脸旋转图像为同一张人脸旋转前和旋转后的图像;
    姿态编码单元,用于根据所述人脸图像中的两个或两个以上关键点对所述人脸图像进行姿态编码,得到所述人脸图像的姿态编码图;
    所述姿态编码单元,还用于根据所述人脸旋转图像中的两个或两个以上关键点对所述人脸旋转图像进行姿态编码,得到所述人脸旋转图像的姿态编码图;
    人脸生成单元,用于将所述人脸图像、所述人脸图像的姿态编码图以及所述人脸旋转图像的姿态编码图输入所述人脸生成网络,以生成预测人脸旋转图像;
    第一判别单元,用于将所述人脸图像、所述人脸旋转图像和所述预测人脸旋转图像输入所述第一判别网络以得到第一损失;
    第二判别单元,用于将所述人脸旋转图像、所述人脸旋转图像的姿态编码图以及所述预测人脸旋转图像输入所述第二判别网络以得到第二损失;
    反向传播单元,用于根据所述生成式对抗网络的总损失更新所述人脸生成网络、所述第一判别网络以及所述第二判别网络,所述生成式对抗网络的总损失根据所述第一损失和第二损失加权求和得到;
    输出单元,直到所述生成式对抗网络的总损失收敛,用于输出训练后的人脸生成网络。
  23. 根据权利要求22所述的装置,其特征在于,所述装置还包括:真实图像损失计算单元,用于:
    根据所述人脸旋转图像及所述预测人脸旋转图像得到真实图像损失,所述真实图像损失包括像素损失、全变分损失及身份识别特征损失中的至少一个损失;对应的,所述生成式对抗网络的总损失根据所述真实图像损失中的至少一个损失、所述第一损失和第二损失加权求和得到。
  24. 根据权利要求22或23所述的装置,其特征在于,所述姿态编码单元具体用于:
    利用关键点检测算法对所述人脸图像进行检测,得到所述人脸图像的N个关键点分别对应的位置坐标,构造与所述人脸图像尺寸相同的N张第一图像,所述N张第一图像 与所述N个关键点一一对应,分别以所述N个关键点中的每个关键点为中心,对与所述每个关键点一一对应的第一图像进行高斯模糊处理,得到N张第一高斯模糊图,所述N张第一高斯模糊图为所述人脸图像的姿态编码图,N为大于1的正整数。
  25. 根据权利要求22至24任一项所述的装置,其特征在于,所述姿态编码单元具体用于:
    利用关键点检测算法对所述人脸旋转图像进行检测,得到所述人脸旋转图像的M个关键点分别对应的位置坐标,构造与所述人脸旋转图像尺寸相同的M张第二图像,所述M张第二图像与所述M个关键点一一对应,分别以所述M个关键点中的每个关键点为中心,对与所述每个关键点一一对应的第二图像进行高斯模糊处理,得到M张第二高斯模糊图,所述M张第二高斯模糊图为所述人脸旋转图像的姿态编码图,M为大于1的正整数。
  26. 根据权利要求22至25任一项所述的装置,其特征在于,所述第一判别单元用于:
    以所述人脸图像作为所述第一判别网络的判别条件,根据所述第一判别网络判断所述人脸旋转图像和所述预测人脸旋转图像的真假性,并根据判别结果生成所述第一损失;其中,所述第一判断网络包括二分类判别器,所述二分类判别器用于判断为真或判断为假。
  27. 根据权利要求26所述的装置,其特征在于:所述第一判别单元具体用于执行如下计算:
    Figure PCTCN2018089611-appb-100029
    其中,L ii为所述第一损失,I a为所述人脸图像,I b为所述人脸旋转图像,
    Figure PCTCN2018089611-appb-100030
    为所述预测人脸旋转图像,
    Figure PCTCN2018089611-appb-100031
    表示在所述人脸旋转图像I b的分布H(I b)上求期望,即所述人脸旋转图像I b为真的概率;
    Figure PCTCN2018089611-appb-100032
    表示所述第一判别网络的损失函数,
    Figure PCTCN2018089611-appb-100033
    表示在所述预测人脸旋转图像
    Figure PCTCN2018089611-appb-100034
    的分布
    Figure PCTCN2018089611-appb-100035
    上的期望,即所述预测人脸旋转图像
    Figure PCTCN2018089611-appb-100036
    为真的概率;
    Figure PCTCN2018089611-appb-100037
    为以所述人脸图像为条件的所述第一判别网络,θ ii为所述第一判别网络的参数,
    Figure PCTCN2018089611-appb-100038
    为所述第一判别网络的输入。
  28. 根据权利要求22至27任一项所述的装置,其特征在于,所述第二判别单元用于:
    以所述人脸旋转图像的姿态编码图作为所述第二判别网络的判别条件,根据所述第二判别网络判断所述人脸旋转图像和所述预测人脸旋转图像的真假性,并根据判别结果生成所述第二损失;其中,所述第二判断网络包括二分类判别器,所述二分类判别器用于判断为真或判断为假。
  29. 根据权利要求28所述的装置,其特征在于,所述第二判别单元具体用于执行如 下计算:
    Figure PCTCN2018089611-appb-100039
    其中,L ip为所述第二损失,I b为所述人脸旋转图像,
    Figure PCTCN2018089611-appb-100040
    为所述预测人脸旋转图像,P b为所述人脸旋转图像的姿态编码图,
    Figure PCTCN2018089611-appb-100041
    表示在所述人脸旋转图像I b的分布H(I b)上求期望,即所述人脸旋转图像I b为真的概率;
    Figure PCTCN2018089611-appb-100042
    表示所述第二判别网络的损失函数,
    Figure PCTCN2018089611-appb-100043
    表示在所述预测人脸旋转图像
    Figure PCTCN2018089611-appb-100044
    的分布
    Figure PCTCN2018089611-appb-100045
    上的期望,即所述预测人脸旋转图像
    Figure PCTCN2018089611-appb-100046
    为真的概率;
    Figure PCTCN2018089611-appb-100047
    为以所述人脸旋转图像的姿态编码图为条件的所述第二判别网络,θ ip为所述第二判别网络的参数,
    Figure PCTCN2018089611-appb-100048
    为所述第二判别网络的输入。
  30. 根据权利要求23至29任一项所述的装置,其特征在于,当所述真实图像损失包括像素损失,所述真实图像损失计算单元用于执行如下计算:
    Figure PCTCN2018089611-appb-100049
    其中,L pix是所述像素损失,S是尺度量,
    Figure PCTCN2018089611-appb-100050
    为所述预测人脸旋转图像,I b为所述人脸旋转图像,
    Figure PCTCN2018089611-appb-100051
    表示将所述预测人脸旋转图像和所述人脸旋转图像进行缩放到S尺度量时计算像素差值的1范数损失。
  31. 根据权利要求23至30任一项所述的装置,其特征在于,当所述真实图像损失包括全变分损失,所述真实图像损失计算单元用于执行如下计算:
    Figure PCTCN2018089611-appb-100052
    其中,L tv是所述全变分损失,即所述预测人脸旋转图像
    Figure PCTCN2018089611-appb-100053
    在横向和纵向两个方向一阶梯度绝对值的和,其中,W表示所述预测人脸旋转图像的宽,H表示所述预测人脸旋转图像的高,C表示所述预测人脸旋转图像通道数。
  32. 根据权利要求23至31任一项所述的装置,其特征在于,当所述真实图像损失包括身份识别特征损失,所述真实图像损失计算单元用于执行如下计算:
    Figure PCTCN2018089611-appb-100054
    其中,身份识别特征用来保证所述预测人脸旋转图像和所述人脸图像之间的身份信息保持不变,L ip表示所述身份识别特征损失,f为预先训练好的人脸识别模型,所述人脸识别模型f为深度神经网络,所述深度神经网络包括至少一个池化层和至少一个全连接层,其中,
    Figure PCTCN2018089611-appb-100055
    表示所述人脸识别模型f的最后一个池化层的输出,
    Figure PCTCN2018089611-appb-100056
    表示所述人脸识别模型f最后一个全连接层的输出。
  33. 一种人脸旋转图像的生成设备,其特征在于,包括处理器和存储器,所述存储器用于存储程序指令,所述处理器用于调用所述程序指令来执行权利要求1-5任一项所述的方法。
  34. 一种训练生成式对抗网络的设备,其特征在于,包括处理器和存储器,所述存储器用于存储程序指令,所述处理器用于调用所述程序指令来执行权利要求6-16任一项所述的方法。
  35. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有程序指令,当所述程序指令由处理器运行时,实现权利要求1-5任一项所述的方法。
  36. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有程序指令,当所述程序指令由处理器运行时,实现权利要求6-16任一项所述的方法。
PCT/CN2018/089611 2018-06-01 2018-06-01 人脸旋转图像的生成方法及装置 WO2019227479A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
CN201880090767.4A CN111819568A (zh) 2018-06-01 2018-06-01 人脸旋转图像的生成方法及装置
PCT/CN2018/089611 WO2019227479A1 (zh) 2018-06-01 2018-06-01 人脸旋转图像的生成方法及装置
US17/038,208 US11232286B2 (en) 2018-06-01 2020-09-30 Method and apparatus for generating face rotation image

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/089611 WO2019227479A1 (zh) 2018-06-01 2018-06-01 人脸旋转图像的生成方法及装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/038,208 Continuation US11232286B2 (en) 2018-06-01 2020-09-30 Method and apparatus for generating face rotation image

Publications (1)

Publication Number Publication Date
WO2019227479A1 true WO2019227479A1 (zh) 2019-12-05

Family

ID=68697775

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/089611 WO2019227479A1 (zh) 2018-06-01 2018-06-01 人脸旋转图像的生成方法及装置

Country Status (3)

Country Link
US (1) US11232286B2 (zh)
CN (1) CN111819568A (zh)
WO (1) WO2019227479A1 (zh)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111583099A (zh) * 2020-04-14 2020-08-25 上海联影智能医疗科技有限公司 图像摆正方法、计算机设备和存储介质
CN111847147A (zh) * 2020-06-18 2020-10-30 闽江学院 一种无接触眼动式电梯楼层输入方法及装置
CN112070888A (zh) * 2020-09-08 2020-12-11 北京字节跳动网络技术有限公司 图像生成方法、装置、设备和计算机可读介质
CN112418344A (zh) * 2020-12-07 2021-02-26 汇纳科技股份有限公司 一种训练方法、目标检测方法、介质及电子设备
CN113326934A (zh) * 2021-05-31 2021-08-31 上海哔哩哔哩科技有限公司 神经网络的训练方法、生成图像及视频的方法和装置

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP7107441B2 (ja) * 2018-08-31 2022-07-27 日本電気株式会社 情報処理装置、方法およびプログラム
CN109800294B (zh) * 2019-01-08 2020-10-13 中国科学院自动化研究所 基于物理环境博弈的自主进化智能对话方法、系统、装置
US11188740B2 (en) * 2019-12-18 2021-11-30 Qualcomm Incorporated Two-pass omni-directional object detection
CN112800898A (zh) * 2021-01-18 2021-05-14 深圳市网联安瑞网络科技有限公司 行人重识别数据集增强方法、系统、终端、摄像头及介质
CN112837211B (zh) * 2021-01-28 2023-07-18 北京奇艺世纪科技有限公司 一种图片处理方法、装置、电子设备及可读存储介质
TWI768913B (zh) * 2021-05-20 2022-06-21 國立中正大學 眼睛中心定位方法及其定位系統
CN113222144B (zh) * 2021-05-31 2022-12-27 北京有竹居网络技术有限公司 图像修复模型的训练方法及图像修复方法、装置及设备
US11900534B2 (en) * 2021-07-30 2024-02-13 The Boeing Company Systems and methods for synthetic image generation
CN116310659B (zh) * 2023-05-17 2023-08-08 中数元宇数字科技(上海)有限公司 训练数据集的生成方法及设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103065360A (zh) * 2013-01-16 2013-04-24 重庆绿色智能技术研究院 一种发型效果图的生成方法及系统
CN106251294A (zh) * 2016-08-11 2016-12-21 西安理工大学 一种单幅正视人脸图像的虚拟多姿态生成方法
CN107506717A (zh) * 2017-08-17 2017-12-22 南京东方网信网络科技有限公司 无约束场景中基于深度变换学习的人脸识别方法
CN107871107A (zh) * 2016-09-26 2018-04-03 北京眼神科技有限公司 人脸认证方法和装置

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8090160B2 (en) * 2007-10-12 2012-01-03 The University Of Houston System Automated method for human face modeling and relighting with application to face recognition
CN103646244B (zh) * 2013-12-16 2018-01-09 北京天诚盛业科技有限公司 人脸特征的提取、认证方法及装置
CN105740758A (zh) 2015-12-31 2016-07-06 上海极链网络科技有限公司 基于深度学习的互联网视频人脸识别方法
US10474881B2 (en) * 2017-03-15 2019-11-12 Nec Corporation Video retrieval system based on larger pose face frontalization
CN107122705B (zh) * 2017-03-17 2020-05-19 中国科学院自动化研究所 基于三维人脸模型的人脸关键点检测方法
US10878612B2 (en) * 2017-04-04 2020-12-29 Intel Corporation Facial image replacement using 3-dimensional modelling techniques
CN107292813B (zh) * 2017-05-17 2019-10-22 浙江大学 一种基于生成对抗网络的多姿态人脸生成方法
CN107437077A (zh) * 2017-08-04 2017-12-05 深圳市唯特视科技有限公司 一种基于生成对抗网络的旋转面部表示学习的方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103065360A (zh) * 2013-01-16 2013-04-24 重庆绿色智能技术研究院 一种发型效果图的生成方法及系统
CN106251294A (zh) * 2016-08-11 2016-12-21 西安理工大学 一种单幅正视人脸图像的虚拟多姿态生成方法
CN107871107A (zh) * 2016-09-26 2018-04-03 北京眼神科技有限公司 人脸认证方法和装置
CN107506717A (zh) * 2017-08-17 2017-12-22 南京东方网信网络科技有限公司 无约束场景中基于深度变换学习的人脸识别方法

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"How to Rotate a Face in an Image?", CVPR, 25 May 2018 (2018-05-25), Retrieved from the Internet <URL:https://zhuanlan.zhihu.com/p/37305160> *
"Large-scale Face Image Editing Theory, Method and Application", INSTITUTE OF AUTOMATION, CHINESE ACADEMY OF SCIENCES HE, RAN, 4 May 2018 (2018-05-04), Retrieved from the Internet <URL:https://baijiahao.baidu.com/s?id=1599526577293416704&wfr=spider&for=pc> *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111583099A (zh) * 2020-04-14 2020-08-25 上海联影智能医疗科技有限公司 图像摆正方法、计算机设备和存储介质
CN111847147A (zh) * 2020-06-18 2020-10-30 闽江学院 一种无接触眼动式电梯楼层输入方法及装置
CN111847147B (zh) * 2020-06-18 2023-04-18 闽江学院 一种无接触眼动式电梯楼层输入方法及装置
CN112070888A (zh) * 2020-09-08 2020-12-11 北京字节跳动网络技术有限公司 图像生成方法、装置、设备和计算机可读介质
CN112070888B (zh) * 2020-09-08 2024-04-05 抖音视界有限公司 图像生成方法、装置、设备和计算机可读介质
CN112418344A (zh) * 2020-12-07 2021-02-26 汇纳科技股份有限公司 一种训练方法、目标检测方法、介质及电子设备
CN112418344B (zh) * 2020-12-07 2023-11-21 汇纳科技股份有限公司 一种训练方法、目标检测方法、介质及电子设备
CN113326934A (zh) * 2021-05-31 2021-08-31 上海哔哩哔哩科技有限公司 神经网络的训练方法、生成图像及视频的方法和装置
CN113326934B (zh) * 2021-05-31 2024-03-29 上海哔哩哔哩科技有限公司 神经网络的训练方法、生成图像及视频的方法和装置

Also Published As

Publication number Publication date
US20210012093A1 (en) 2021-01-14
CN111819568A (zh) 2020-10-23
US11232286B2 (en) 2022-01-25

Similar Documents

Publication Publication Date Title
WO2019227479A1 (zh) 人脸旋转图像的生成方法及装置
CN110532871B (zh) 图像处理的方法和装置
US20210319258A1 (en) Method and apparatus for training classification task model, device, and storage medium
WO2021043168A1 (zh) 行人再识别网络的训练方法、行人再识别方法和装置
WO2019228358A1 (zh) 深度神经网络的训练方法和装置
WO2021143101A1 (zh) 人脸识别方法和人脸识别装置
WO2021175050A1 (zh) 三维重建方法和三维重建装置
WO2021018163A1 (zh) 神经网络的搜索方法及装置
CN111783748B (zh) 人脸识别方法、装置、电子设备及存储介质
CN111914997B (zh) 训练神经网络的方法、图像处理方法及装置
CN109684969B (zh) 凝视位置估计方法、计算机设备及存储介质
CN113705769A (zh) 一种神经网络训练方法以及装置
WO2021218238A1 (zh) 图像处理方法和图像处理装置
CN110222718B (zh) 图像处理的方法及装置
CN111832592B (zh) Rgbd显著性检测方法以及相关装置
CN113807183A (zh) 模型训练方法及相关设备
WO2022052782A1 (zh) 图像的处理方法及相关设备
WO2021190433A1 (zh) 更新物体识别模型的方法和装置
CN113536970A (zh) 一种视频分类模型的训练方法及相关装置
Das et al. A fusion of appearance based CNNs and temporal evolution of skeleton with LSTM for daily living action recognition
CN114492634A (zh) 一种细粒度装备图片分类识别方法及系统
WO2023142886A1 (zh) 表情迁移方法、模型训练方法和装置
CN113313133A (zh) 一种生成对抗网络的训练方法、动画图像生成方法
CN116758212A (zh) 基于自适应去噪算法的3d重建方法、装置、设备及介质
WO2022179606A1 (zh) 一种图像处理方法及相关装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18920213

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18920213

Country of ref document: EP

Kind code of ref document: A1