WO2021073417A1 - 表情生成方法、装置、设备及存储介质 - Google Patents

表情生成方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2021073417A1
WO2021073417A1 PCT/CN2020/118388 CN2020118388W WO2021073417A1 WO 2021073417 A1 WO2021073417 A1 WO 2021073417A1 CN 2020118388 W CN2020118388 W CN 2020118388W WO 2021073417 A1 WO2021073417 A1 WO 2021073417A1
Authority
WO
WIPO (PCT)
Prior art keywords
expression
face
facial
photo
target domain
Prior art date
Application number
PCT/CN2020/118388
Other languages
English (en)
French (fr)
Inventor
王健宗
王义文
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021073417A1 publication Critical patent/WO2021073417A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T11/002D [Two Dimensional] image generation
    • G06T11/001Texturing; Colouring; Generation of texture or colour
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • G06V40/165Detection; Localisation; Normalisation using facial parts and geometric relationships
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships

Definitions

  • This application relates to the field of artificial intelligence technology, and in particular to an expression generation method, device, equipment and storage medium based on a confrontation generation network (StarGAN network).
  • StarGAN network a confrontation generation network
  • this technology is an effective and rapid technology for image processing and analysis. It is also a new field of computer vision, which will be defined in the original image space.
  • the image is converted to another space in a certain form, and the unique nature of the space is used to more conveniently process and process the image, and finally it is converted back to the original image space to achieve the desired effect.
  • This technology is currently mainly applied to mobile electronic terminals to realize the recognition and classification of human eigenfaces. Specifically, the target facial image information is first extracted, and then the face to be recognized is projected into a new multi-dimensional face space. , The recognition and classification of human faces can be completed through simple classification methods.
  • the main purpose of this application is to provide an expression generation method, device, equipment and storage medium based on the StarGAN network, which aims to solve the problem of the difficulty in realizing one-to-many image conversion in the existing image conversion technology, which results in image processing efficiency. Lower technical issues.
  • this application provides an expression generation method based on the StarGAN network.
  • the expression generation method includes the following steps: acquiring a photo to be processed, and identifying all face portraits in the photo based on face recognition technology, And draw the face bounding box of each face portrait on the photo; use face key point detection technology to mark the facial features in the face portrait from the face bounding box; obtain the to-be-converted
  • the target domain label wherein the target domain label is an expression data set used to indicate that the photo is to be converted; the photo with the facial features and the target domain label are input to a pre-trained person
  • the facial expression modification model the expression data set corresponding to the target domain tag is queried from the expression material library, wherein the facial expression modification model is obtained based on a single network that can span multiple expression data sets across different expression domains
  • the expression replacement model of expression materials is used to determine the expression data set corresponding to the target domain label from the expression material library according to the mapping relationship between the domain label and the expression data sets of multiple different expression domains.
  • this application also provides an expression generation device based on the StarGAN network, the expression generation device including:
  • the face recognition module is used to obtain the photos to be processed, recognize all the face portraits in the photos based on the face recognition technology, and draw the face bounding box of each face portrait on the photos; using The face key point detection technology marks the facial features in the face portrait from the face bounding box; the acquisition module is used to obtain the target domain label to be converted, wherein the target domain label is used In the expression data set indicating the photo to be converted; the query module is used to input the photo with the facial features position mark and the target domain label into the pre-trained facial expression modification model, and from the expression The expression data set corresponding to the target domain tag is queried in the material library, wherein the facial expression modification model is an expression replacement model that can obtain expression materials based on a single network that can span multiple expression data sets in different expression domains.
  • the expression data set corresponding to the target domain label is determined from the expression material library, and the expression material library includes at least two different expression data sets.
  • this application also provides an expression generation device based on the StarGAN network.
  • the expression generation device based on the StarGAN network includes a memory, a processor, and a memory, a processor, and a memory that is stored in the memory and can be stored in the processor.
  • the expression generation program based on the StarGAN network running on the computer when the expression generation program based on the StarGAN network is executed by the processor, implements the steps of the expression generation method based on the StarGAN network as follows: obtain the photos to be processed, and based on Face recognition technology recognizes all face portraits in the photo, and outlines the face bounding box of each face portrait on the photo; using face key point detection technology to mark out the face bounding box
  • the position of the facial features in the face portrait; the target domain label to be converted is obtained, wherein the target domain label is an expression data set used to indicate the photo to be converted; all the facial features marked with the position are marked
  • the photo and the target domain label are input into a pre-trained facial expression modification model, and the expression data set corresponding to the target domain label is queried from the expression material library, wherein the facial expression modification model is Based on a single network, the expression replacement model of expression materials can be obtained across multiple expression data sets of different expression domains.
  • the expression material library It is used to obtain the expression material library from the expression material library according to the mapping relationship between domain tags and expression data sets of different expression domains.
  • Determine the expression data set corresponding to the target domain tag the expression material library includes at least two expression data sets of different expression domains; fill the determined expression data sets of different expressions in sequence through the facial expression modification model To the corresponding facial features in the human face portrait, and perform expression image synthesis processing to obtain a target expression photo.
  • this application also provides a computer-readable storage medium that stores an expression generation program based on the StarGAN network, and the expression generation program based on the StarGAN network is executed by a processor.
  • the tag is an expression data set used to indicate that the photo is to be converted; the photo with the facial features and the target domain tag are input into the pre-trained facial expression modification model, and from the expression material library Query the expression data set corresponding to the target domain label, wherein the facial expression modification model is an expression replacement model based on a single network that can cross multiple expression data sets of different expression domains to
  • the expression generation method provided in this application is to input photos and target domain tags as input information into a facial expression modification model based on a single network that can cross multiple expression data sets of different expression domains to obtain expression replacement of expression materials, query and According to the expression data set corresponding to the target domain tag, the expression synthesis processing is performed on the photos according to the expression data set to obtain the target expression photos, thereby realizing one-to-many domain image conversion and improving the conversion efficiency and accuracy of the images.
  • FIG. 1 is a schematic structural diagram of an operating environment of a mobile terminal involved in a solution according to an embodiment of the application;
  • FIG. 2 is a schematic flowchart of a first embodiment of a method for generating expressions based on the StarGAN network provided by this application;
  • Fig. 3 is a schematic structural diagram of a facial expression modification model provided by this application.
  • FIG. 4 is a schematic diagram after the expression conversion of the face image in the photo provided by this application.
  • FIG. 5 is a schematic diagram of the training process of the facial expression modification model provided by this application.
  • Fig. 6 is a schematic diagram of functional modules of an expression generation device based on the StarGAN network provided by this application.
  • This application provides an expression generation device based on the StarGAN network.
  • the device may be a plug-in in a mobile terminal for executing the expression generation method provided in the embodiment of this application, as shown in FIG. 1, which is an embodiment of this application.
  • FIG. 1 A schematic diagram of the structure of the mobile terminal operating environment involved in the solution.
  • the mobile terminal includes: a processor 101, such as a CPU, a communication bus 102, a user interface 103, a network interface 104, and a memory 105.
  • the communication bus 102 is used to implement connection and communication between these components.
  • the user interface 103 may include a display screen (Display) and an input unit such as a keyboard (Keyboard), and the network interface 104 may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface).
  • the memory 105 may be a high-speed RAM memory, or a non-volatile memory (non-volatile memory), such as a magnetic disk memory.
  • the memory 105 may also be a storage system independent of the aforementioned processor 101.
  • the hardware structure of the mobile terminal shown in FIG. 1 does not constitute a limitation on the expression generation device based on the StarGAN network, and may include more or less components than shown in the figure, or a combination of certain components , Or different component arrangements.
  • the memory 105 as a computer-readable storage medium may include an operating system, a network communication module, a user interface module, and a StarGAN network-based expression generation program for realizing expression generation.
  • the operating system is a program that manages and controls the expression generation device based on the StarGAN network and the software resource calls in the memory, and supports the operation of the expression generation program based on the StarGAN network and other software and/or programs.
  • the network interface 104 is mainly used to access the network; the user interface 103 is mainly used to detect the user's expression synthesis control operation on the mobile terminal, that is, it is optional
  • the processor 101 may be used to call the expression generation program based on the StarGAN network stored in the memory 105 and execute the operations of the following embodiments of the expression generation method based on the StarGAN network.
  • FIG. 2 is a flowchart of the expression generation method based on the StarGAN network provided by an embodiment of the application.
  • the expression generation method based on the StarGAN network specifically includes the following steps:
  • Step S210 Obtain a photo to be processed, recognize all face portraits in the photo based on face recognition technology, and draw a face bounding box of each face portrait on the photo;
  • the photos to be processed mainly refer to old-fashioned photos, such as backup photos of photos developed through film, or the film negative itself. Of course, it can also be an existing electronic version of the photo. Contains at least one face image on it.
  • the delineation of the face bounding box is specifically implemented as, based on the image recognition technology, the input photo is first distinguished between the background and the portrait image.
  • the background blur operation can be added, the portrait image will be further highlighted, and then the face part of the portrait image can be extracted through the face detection, and the profile of the avatar can be drawn, so as to realize the mark of the face ,
  • the marking can be done in the form of dots or lines, and even after identifying the position of the avatar, the area where the portrait is located is directly masked by translucent.
  • this step further includes determining the face portrait in the photo that needs to be subjected to expression processing according to the actual facial expression modification request of the user, and marking the facial portrait.
  • Step S220 using the face key point detection technology to mark the position of the facial features in the face portrait from the face bounding box;
  • the face key point detection technology refers to the technology to detect the facial partial area of the human face, which is mainly used to detect the facial contours, eyes, eyebrows, mouth, nose and other organs of the human face. Feature regional markings, and the geometric position relationship between each other; specifically, it can be detected and identified according to the morphology of each organ.
  • the facial features of the face specifically, only the facial features of the face that need to be processed by the facial expression may be marked; or all the facial features are marked, but the parts that need to be processed by the facial expression are highlighted.
  • Step S230 Obtain a target domain tag to be converted, where the target domain tag is an expression data set used to indicate that the photo is to be converted;
  • the target domain label can also be added to the corresponding facial features on the face portrait.
  • the facial expression modification model passes Identify the target domain tag on the photo to determine the expression material to be obtained.
  • Step S240 input the photo with the facial features and the target domain label into the pre-trained facial expression modification model, and query the expression data corresponding to the target domain label from the expression material library
  • the facial expression modification model is an expression replacement model based on a single network that can cross multiple expression data sets of different expression domains to obtain expression materials, which is used for expression data based on domain tags and multiple expression domains.
  • the expression material library including at least two expression data sets of different expression domains;
  • the target domain label to be converted refers to the annotation information corresponding to the expression that the user expects, and the annotation information is used to indicate the expression data set of the expression that the user expects in the expression material library;
  • a database of individual case’s facial expression types is pre-stored on the terminal or facial expression processing device.
  • This database can be called a collection of target domains, such as angry and contemptuous. , Disgust, fear, happy, sad, surprised and other expressions.
  • Each expression is set with multiple images of different expressions to form an expression material library.
  • the expression material library is for angry, contempt, disgust, fear, happiness,
  • the classification label of sadness or surprise is the target domain label.
  • the facial expression modification model is obtained through pre-training, and the training process is mainly the mapping relationship between the training model and different target domains. After obtaining the mapping relationship, use the When the model modifies the expression of the face image, it can obtain expression materials in different domains to generate expression photos.
  • the realization of the mapping relationship training is as follows:
  • the training data is input into the model, and the model extracts two data for learning and training.
  • One is the expression data and the other is the domain label.
  • the current model is used to train one domain data.
  • the model first summarizes the domain labels of all input training data, and then classifies them based on the summarized domain labels, where classification refers to the classification of the same domain labels Divide into one category and label the target domain labels, and then the model will learn and train according to the classified target domain labels, so as to obtain the final facial expression modification model that can obtain data across different expression domains.
  • the specific learning can be done in the following ways:
  • the target domain label is equivalent to the link index set on the model, and the domain label is the link index set on the corresponding expression domain, and the model is in the process of summarizing the domain labels and marking the target domain label. It is equivalent to establishing the mapping relationship between the target domain label and the domain label, and when the model trains the target domain label, the training model passes through the mapping relationship between the target domain label and the domain label;
  • the model learns a summary table of the target domain labels, and stores the learned results in the model;
  • the model is continuously trained to jump between the mapping relationship between the target domain label and the domain label, that is, input an angry target domain label, and the model queries the corresponding expression domain domain label according to the angry label, and then retrieve expression data under the domain label according to the domain label;
  • the output expression data determines whether it corresponds to the target domain label, for example, the input is an angry target domain label, and after the model is retrieved by query, the output expression data is an angry expression, indicating that the training process is Yes, if the output is an unhappy expression, it means that there is a problem with the mapping relationship, and it needs to be trained again until the output data is correct.
  • step S250 the determined expression data sets of different expressions are sequentially filled into the positions of the corresponding facial features in the facial portrait through the facial expression modification model, and the expression image synthesis process is performed to obtain a target expression photo .
  • the facial expression modification model refers to a training model that has learned the mapping relationships in multiple target domains at the same time during training. It includes a discriminator, two generators, and an auxiliary discriminator.
  • the discriminator is used to detect the authenticity of the input photo. The authenticity here can immediately determine whether the target domain of the classification is consistent with the expression in the photo, or whether it is a photo with a modification flag (that is, it can be understood as Photos modified with emoticons).
  • the above solution can also be synthesized and modified directly through the facial expression modification model after obtaining the photos to be processed.
  • the model contains the mapping relationship of multiple target domains at the same time, when the After the model receives the input photos, it directly retrieves the expression data in multiple target domains according to the learned mapping relationship to find matching images for synthesis, outputs image data of multiple target domains, and finally converts the target according to the user's input.
  • the domain tag filters the output image data of multiple target domains to obtain the final composite image.
  • the target domain label is not only categorized and labeled according to angry, happy, etc., but also can be set according to the label of the expression material, that is, a target domain can contain expressions with multiple emotions.
  • the domain label for this type of target domain is marked as a domain label such as angry, but is set to the names of expression library 1, expression library 2, etc., and the expressions of these names
  • the libraries are from different expression domains; correspondingly, the target domain labels obtained after training on the model can be expression library 1'and expression library 2'.
  • the model has learned through training that it can access expression materials in different expression domains, so the model will be based on expression library 1'and the mapping relationship obtained by training , Directly query the expression materials in the expression library 1 corresponding to the expression library 1', and then sequentially replace these materials with the content of the corresponding expression positions on the photo, thereby synthesizing the required expression photos.
  • the face detection algorithm on the open source computer vision class library platform of the face detection library Dlib is used to extract the face image in the photo A human face, and draw a boundary box of the human face at the position of the human face;
  • the facial expression tracking algorithm is used to mark the facial expression tracking points on the facial image in the bounding box of the human face.
  • the input photo may or may not have a portrait.
  • Dlib is executed for the input photo.
  • the face detection algorithm on the OpenCv (open source computer vision class library) platform finds the Bounding Box (bounding box) of the face position. Face detection is to detect whether there is a face in a face image, locate all faces in the face image, and obtain a high-precision face calibration frame.
  • the border of the face is marked by the identification method of the key tracking points of the face.
  • the outline of the face and the eyebrows of the face are roughly marked.
  • the recognition of the facial features of a human face can be recognized by the facial features recognition model, which is trained based on the facial features generated under various expressions. For example, training an eye recognition model, first obtain separately To the eye shape under angry, contempt, disgust, scared, happy, sad, surprised expressions, and then specify the overall shape of the eye, and then learn the details of the expression of the eyes under different expressions through the model structure based on the specified overall shape, based on these learning The obtained details construct an accurate eye recognition model. At last, when a person is detected in the photo, the eye recognition model is directly called to identify the location of the eye in the bounding box of the portrait, and the tracking point is marked. So as to realize the recognition of the eyes. In the same way, the recognition of other facial features is the same as that of the eyes.
  • step S220 to recognize the facial features of the marked face image, and specifically use a pre-trained face key point detector to identify the human face.
  • the face image in the face bounding box is detected by facial features, and the detected facial features are marked by the mark method of 106 tracking points, so as to depict the shape and features of each facial feature on the face image. location information.
  • the training of the face key point detector can directly use the expression data in the target domain that has been uniformly classified for training, that is, the expression in the target domain is recognized through the face detection algorithm, and the expression in the target domain is recognized by the face detection algorithm.
  • 106 tracking points are evenly arranged in the facial features, so that it can accurately reflect the shape and area of the facial features, and also reflect the orientation and shape of the facial features.
  • Target domains in at least two different expression domains and a domain label corresponding to each target domain the target domains containing facial expression feature images that have been categorized manually;
  • the discriminator in the preset model framework of the confrontation network is trained to distinguish between true and false photos, so as to obtain a discriminator that can recognize expression types;
  • the target domain perform conversion training from the input photo to the target domain to be converted on the generator in the preset model framework of the confrontation network, and perform restoration and reconstruction training on the converted image of the input photo;
  • the preset model framework of the confrontation network records the domain labels corresponding to the facial expression feature images that are categorized according to the manual classification method, and generates target domain labels in the model framework according to the domain labels, and establishes the target domain labels and domains The mapping relationship between tags;
  • the facial expression modification model that satisfies the expression synthesis is constructed.
  • training is performed one by one according to the classification of the target domain. For example, all expression materials of the target domain of the angry expression are input into the discriminator and generator one by one, and then Judge the true and false results output by the discriminator. If the input is a real angry expression and the domain label of the target domain is the same, then input a non-an angry expression of the target domain expression material to judge, according to the output result Then judge, the input exchanged in this way is trained. In the same way, the training of the generator is also realized by a similar method.
  • the corresponding target domain tags are also marked on these expression materials, and the mapping relationship between the target domain tags and the original domain tags is established and stored in the discriminator, so as to realize the learning of the mapping relationship. So that the final model can achieve cross-domain access.
  • step S240 when training the facial expression modification model, it also includes learning the mapping relationship between different domains, and obtaining the information of multiple domains according to the mapping relationship between the target domain label and the different domains.
  • the expression material means that a target domain tag corresponds to an expression data set with multiple domains, and one material is selected from each expression data set for use, and multiple expression photos are synthesized and output from a face in a photo.
  • model training is performed by acquiring expression data of a known target domain.
  • the training process specifically includes:
  • the training process is mainly to train the probability of discriminating between true and false, so that the discrimination probability of the discriminator is closer to the real;
  • the original image is converted to the target domain image, and the target domain image after the conversion is restored to the original image training.
  • the training process is mainly the unified learning of the mapping relationship between the generator and multiple target domains, so that the generator can be flexibly Switch mapping between multiple domains to achieve the purpose of arbitrary conversion;
  • the authenticity is re-identified, and the training result of the generator is adjusted based on the result of the re-identification, so as to realize the image classification or reduce the output loss of the model.
  • the first is the training of the discriminator.
  • the training of the discriminator is as follows:
  • the expression materials in the target domains in the at least two different expression domains are used as input photos respectively, and the domain labels of the target domains in the at least two different expression domains are used as the target domain labels of the target conversion domain, and input To the discriminator;
  • an anti-loss function formula is used to discriminate the authenticity of the input photo, and the result of the discrimination is judged to determine whether the target domain label is correct; wherein the anti-loss function formula is:
  • the X is the input photo
  • the C is the tag of the target conversion domain or the tag of the target domain
  • the G(x, c) is the output image of the first generator G in the facial expression modification model.
  • judging the result of the identification to determine whether the target domain label is correct is to determine whether it belongs to the expression shape corresponding to the target domain label through the shape of the expression in the output photo after identification. If it is, it is judged to be true.
  • the photo of is the result of the synthesis of tags belonging to the target domain. On the contrary, it is not, and the discriminator needs to be re-entered.
  • the output of the training of the generator G can also optimize the classification loss of the discriminator D, specifically:
  • the second generator performs image restoration and reconstruction processing according to the received fake image and the original target domain label to obtain a recombined image, and outputs it to the first generator for deep concatenation loop synthesis;
  • the auxiliary discriminator performs the discrimination of true and false images on the fake image, and performs reclassification processing according to the result of the discrimination;
  • the model is calculated for the optimization loss to obtain the final confrontation loss; wherein the optimization loss includes the training optimization loss of the discriminator and the generator, the restoration reconstruction loss, and the overall training loss.
  • the confrontation loss after training its calculation formula is as follows:
  • x) represents the probability distribution of the domain label calculated by the discriminator on the real image
  • c' is the original target domain label
  • ⁇ cls and ⁇ rec are hyperparameters, which are used to adjust the loss of the domain reclassification processing respectively The importance of restoring and reconstructing losses compared to resisting losses
  • ⁇ gp is a fixed value of 10.
  • the loss calculation for restoration and reconstruction specifically minimizes the confrontation loss and classification loss, and G tries hard to generate a realistic picture in the target domain.
  • the learned conversion will only change the domain-related information of the input image without changing the content of the image.
  • the cycle consistency loss it is the combination of G(x,c) and the original label c'of the image x. Enter G, calculate the 1 norm difference between the generated picture and x.
  • step S40 after the facial expression modification model is trained according to the above method, in step S40, after the target domain tag is recognized on the photo, the corresponding facial expression material library is matched according to the target domain tag.
  • the expression data set and then select one or more expression materials to fill the corresponding facial features according to actual requirements, and finally perform fusion processing on the filled region and the edge region of the facial features to obtain a complete target facial expression photo.
  • the model in addition to training the above-mentioned attention to the mapping relationship of the target domain, it can also be trained by combining multiple data sets, and multiple data sets are recorded as mutual collections of multiple target domains.
  • a mask vector is introduced to control the data set for input training, and the mask vector m control allows the unified StarGAN network to ignore unknown labels and known labels from a specific data set;
  • an n-dimensional data is used to represent the mask vector m, and the specific implementation steps are as follows:
  • All the acquired facial expression feature images that have been classified are formed into a data set, and a mask vector m is set based on the data set, and the mask vector is used to unify the domain labels in the data set Format control; wherein, the unified format of the domain label is: Cn is the domain label of the nth data set;
  • the mask vector is input to the generator, and the data set is controlled to perform unified training of the model based on the unified format of the domain label.
  • step S230 when the image is synthesized, the input photos are synthesized by multi-domain conversion based on the learned facial expression modification model.
  • the model is based on the multi-domain
  • the mapping between learning is to synthesize an image for each target domain, and then select an image as the target conversion domain.
  • the following is an example of a target domain where there are multiple expression materials with different emotions in an expression domain. For example, as shown in Figure 4:
  • the first picture on the left represents the input input picture
  • the target domain label corresponding to the input is a label that needs to be converted into 7 expression images.
  • the facial expression modification model is based on the mapping relationship between the target domain label and the multi-domain.
  • replace these materials one by one with the original input pictures, and then the 7 pictures represent the composite images generated according to the attributes of the expression materials. From left to right, the expressions respectively indicate: angry, contempt, disgust, fear, Happy, sad, surprised.
  • the method further includes classifying the synthesized target expression photos in the target domain according to the facial expression modification model, and the specific implementation is as follows:
  • the target expression photos are classified into the corresponding expression data set, thereby expanding the expression materials in the expression data set, which not only realizes the acquisition of expression materials between multiple domains, but also realizes the dynamic expansion of the expression data set.
  • the above-mentioned method is used to transform and generate expressions.
  • a training model to learn the mapping relationship of multiple domains, it is possible to train multiple data sets of different domains in a single network at the same time. Only one model can be used for Multiple domains perform image-to-image conversion, which solves the limitations of scalability and robustness when processing more than two domains, and these methods are inefficient and inefficient in the task of multi-domain image conversion The problem.
  • the input photo in the figure only involves one face image, so the photo is
  • the face image of is the face avatar to be modified by the user.
  • the specific implementation process is as follows:
  • step S510 the input picture x and the target generation domain c are combined and fed to the generation network to synthesize a fake picture.
  • the generation network G refers to the aforementioned facial expression modification model, and this model is trained based on the training algorithm of the StarGAN network. As shown in Figure 2, the model includes two discriminators D and two A generator G.
  • step S520 the fake picture and the real picture are respectively fed to the discriminator D.
  • D needs to judge whether the picture is real, and also needs to judge which domain it comes from.
  • step S530 the generated fake picture and the domain information c'of the original picture are combined and fed to the generator G to request that the original input picture x can be reconstructed by output.
  • the goal of the network is to convert x into an output picture y.
  • the output picture y can be classified into the target domain c, and the loss of each part is analyzed:
  • x) represents the probability distribution of the domain label calculated by D on the real picture. This learning goal will enable D to recognize the input image x as the corresponding domain c', where (x,c') is given by the training set, and the following is the loss of optimizing D:
  • the mask vector is also input to the generator.
  • the mask vector is also input to the generator.
  • ci represents the label of the i-th data set, and the label ci is known if If it is a binary attribute, it can be expressed as a binary vector. If it is a category attribute, it represents a onehot. The remaining n-1 are designated as 0.
  • m is a onehot code of length n:
  • the result of the expression modification is similar to the following.
  • the first picture on the left represents the input picture
  • the subsequent 7 pictures represent the composite picture generated according to the expression attributes. From left to right, the expressions respectively indicate: angry, contemptuous, disgusted, and afraid , Happy, sad, surprised.
  • FIG. 6 is a schematic diagram of the functional modules of the expression generation device based on the StarGAN network provided by an embodiment of the application.
  • the device includes: a face recognition module 61, a collection module 62, a query module 63, and a synthesis module 64;
  • the face recognition module 61 is used to obtain a photo to be processed, recognize all face portraits in the photo based on face recognition technology, and draw a face bounding box of each face portrait on the photo ; Use face key point detection technology to mark the position of the facial features in the face portrait from the face bounding box;
  • the acquisition module 62 is configured to acquire a target domain label to be converted, where the target domain label is an expression data set used to indicate that the photo is to be converted;
  • the query module 63 is configured to input the photo with the facial features and the target domain label into the pre-trained facial expression modification model, and query the expression material library corresponding to the target domain label
  • An expression data set wherein the facial expression modification model is an expression replacement model that can obtain expression materials based on a single network that can span multiple expression data sets of different expression domains, and is used according to the target domain label and domain label and multiple
  • the paired mapping relationship between the expression data sets of different expression domains, the expression data set corresponding to the target domain tag is determined from the expression material library, the expression material library includes at least two expression data sets of different expression domains ;
  • the synthesis module 64 is used for the facial expression modification model to sequentially fill the determined different expressions and expression data sets into the positions of the corresponding facial features in the facial portrait, and perform the synthesis processing of the expression images to obtain Target expression photos.
  • the application also provides a computer-readable storage medium.
  • the computer-readable storage medium may be non-volatile or volatile.
  • the computer-readable storage medium stores an expression generation program based on the StarGAN network, and the expression generation program based on the StarGAN network is stored on the computer-readable storage medium.
  • the generation program is executed by the processor, the steps of the expression generation method based on the StarGAN network as described in any of the above embodiments are implemented.
  • the method implemented when the expression generation program based on the StarGAN network is executed by the processor can refer to the various embodiments of the expression generation method based on the StarGAN network of this application, so it will not be repeated.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Geometry (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

本申请涉及人工智能技术领域,公开了一种基于StarGAN网络的表情生成方法,表情生成方法是通过将相片和目标域标签作为输入信息,输入到基于单一网络可以跨越不同表情域的多个表情数据集获取表情素材的表情替换的人脸表情修改模型,查询与所述目标域标签对应的表情数据集,并根据所述表情数据集对相片进行表情合成处理,得到目标表情相片;本申请还提供了一种基于StarGAN网络的表情生成装置、设备及计算机可读存储介质,基于仅使用一个模型为多个域执行图像到图像转换,从而实现一对多域的图像转换,提高了图像的转换效率和准确率。

Description

表情生成方法、装置、设备及存储介质
本申请要求于2019年10月18日提交中国专利局、申请号为201910990781.6、发明名称为“表情生成方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在申请中。
技术领域
本申请涉及人工智能技术领域,尤其涉及一种基于对抗生成网络(StarGAN网络)的表情生成方法、装置、设备及存储介质。
背景技术
随着智能图像技术的不断发展,尤其是图像到图像转换技术,该技术是一门有效、快速地对图像进行处理和分析的技术,也是目前计算机视觉的新型领域,其将定义在原图像空间的图像以某种形式转换到另外的空间,利用空间的特有性质更方便地对图像进行一定的加工和处理,最后再转换回原图像空间以达到所需的效果。该技术目前主要是应用在移动电子终端上,用于实现人类特征脸的识别与分类,具体是首先提取出目标的脸部图像信息,然后将待识别的人脸投影到新的多维人脸空间,通过简单的分类手段可以完成对人脸的识别与分类。
此外,随着美图美颜等图像处理技术的发展,该技术也被广泛应用于众多的娱乐软件上,比如改变人脸情绪表情,可以将开心的表情转换成为生气、愤怒、哀伤等其他的表情。但是发明人意识到,现有的图像转换中,其图像模型只能实现一个模型对应一个图像域的转换,在多域的图像转换任务中,也只是通过调度不同的模型进行转换,该种转换方式的效率并不高,并且其最终的效果也不佳,可见,其在可扩展性和鲁棒性方面还是存在比较大的局限性,因此,亟需提供一种可以同时在多个图像域之间来回相互转换的图像处理方式来解决上述的问题。
发明内容
本申请的主要目的在于提供一种基于StarGAN网络的表情生成方法、装置、设备及存储介质,旨在解决现有的图像转换技术中,难以实现一对多域的图像转换,而导致图像处理效率较低的技术问题。
为实现上述目的,本申请提供一种基于StarGAN网络的表情生成方法,所述表情生成方法包括以下步骤:获取待处理的相片,并基于人脸识别技术识别出所述相片中所有人脸画像,并在所述相片上勾画出每个人脸画像的人脸边界框;利用人脸关键点检测技术从所述人脸边界框中标记出所述人脸画像中的人脸五官位置;获取待转换的目标域标签,其中,所述目标域标签为用于指示所述相片待转换的表情数据集;将带有人脸五官位置标记的所述相片和所述目标域标签输入至预先训练得到的人脸表情修改模型中,并从表情素材库中查询与所述目标域标签对应的表情数据集,其中,所述人脸表情修改模型为基于单一网络可跨越不同表情域的多个表情数据集获取表情素材的表情替换模型,其用于根据域标签与多个不同表情域的表情数据集之间的对映射关系,从表情素材库中确定与所述目标域标签对应的表情数据集,所述表情素材库包括有至少两个不同表情域的表情数据集;通过所述人脸表情修改模型将确定的不同表情的表情数据集依次填充至所述人脸画像中对应的人脸五官位置中,并进行表情图像合成处理,得到目标表情相片。
此外,为实现上述目的,本申请还提供一种基于StarGAN网络的表情生成装置,所述表情生成装置包括:
人脸识别模块,用于获取待处理的相片,并基于人脸识别技术识别出所述相片中所有 的人脸画像,并在所述相片上勾画出每个人脸画像的人脸边界框;利用人脸关键点检测技术从所述人脸边界框中标记出所述人脸画像中的人脸五官位置;采集模块,用于获取待转换的目标域标签,其中,所述目标域标签为用于指示所述相片待转换的表情数据集;查询模块,用于将带有人脸五官位置标记的所述相片和所述目标域标签输入至预先训练得到的人脸表情修改模型中,并从表情素材库中查询与所述目标域标签对应的表情数据集,其中,所述人脸表情修改模型为基于单一网络可以跨越不同表情域的多个表情数据集获取表情素材的表情替换模型,其用于根据域标签与多个不同表情域的表情数据集之间的对映射关系,从表情素材库中确定与所述目标域标签对应的表情数据集,所述表情素材库包括有至少两个不同表情域的表情数据集;合成模块,用于通过所述人脸表情修改模型将确定的不同表情的表情数据集依次填充至所述人脸画像中对应的人脸五官位置中,并进行表情图像的合成处理,得到目标表情相片。
此外,为实现上述目的,本申请还提供一种基于StarGAN网络的表情生成设备,所述基于StarGAN网络的表情生成设备包括:存储器、处理器以及存储在所述存储器上并可在所述处理器上运行的基于StarGAN网络的表情生成程序,所述基于StarGAN网络的表情生成程序被所述处理器执行时实现如下所述的基于StarGAN网络的表情生成方法的步骤:获取待处理的相片,并基于人脸识别技术识别出所述相片中所有人脸画像,并在所述相片上勾画出每个人脸画像的人脸边界框;利用人脸关键点检测技术从所述人脸边界框中标记出所述人脸画像中的人脸五官位置;获取待转换的目标域标签,其中,所述目标域标签为用于指示所述相片待转换的表情数据集;将带有人脸五官位置标记的所述相片和所述目标域标签输入至预先训练得到的人脸表情修改模型中,并从表情素材库中查询与所述目标域标签对应的表情数据集,其中,所述人脸表情修改模型为基于单一网络可跨越不同表情域的多个表情数据集获取表情素材的表情替换模型,其用于根据域标签与多个不同表情域的表情数据集之间的对映射关系,从表情素材库中确定与所述目标域标签对应的表情数据集,所述表情素材库包括有至少两个不同表情域的表情数据集;通过所述人脸表情修改模型将确定的不同表情的表情数据集依次填充至所述人脸画像中对应的人脸五官位置中,并进行表情图像合成处理,得到目标表情相片。
此外,为实现上述目的,本申请还提供一种计算机可读存储介质,所述计算机可读存储介质上存储有基于StarGAN网络的表情生成程序,所述基于StarGAN网络的表情生成程序被处理器执行时实现如下所述的基于StarGAN网络的表情生成方法的步骤:获取待处理的相片,并基于人脸识别技术识别出所述相片中所有人脸画像,并在所述相片上勾画出每个人脸画像的人脸边界框;利用人脸关键点检测技术从所述人脸边界框中标记出所述人脸画像中的人脸五官位置;获取待转换的目标域标签,其中,所述目标域标签为用于指示所述相片待转换的表情数据集;将带有人脸五官位置标记的所述相片和所述目标域标签输入至预先训练得到的人脸表情修改模型中,并从表情素材库中查询与所述目标域标签对应的表情数据集,其中,所述人脸表情修改模型为基于单一网络可跨越不同表情域的多个表情数据集获取表情素材的表情替换模型,其用于根据域标签与多个不同表情域的表情数据集之间的对映射关系,从表情素材库中确定与所述目标域标签对应的表情数据集,所述表情素材库包括有至少两个不同表情域的表情数据集;通过所述人脸表情修改模型将确定的不同表情的表情数据集依次填充至所述人脸画像中对应的人脸五官位置中,并进行表情图像合成处理,得到目标表情相片。
本申请提供的表情生成方法是通过将相片和目标域标签作为输入信息,输入到基于单一网络可以跨越不同表情域的多个表情数据集获取表情素材的表情替换的人脸表情修改模型,查询与所述目标域标签对应的表情数据集,并根据所述表情数据集对相片进行表情合成处理,得到目标表情相片,从而实现一对多域的图像转换,提高了图像的转换效率和准 确率。
附图说明
图1为本申请实施例方案涉及的移动终端的运行环境的结构示意图;
图2为本申请提供的基于StarGAN网络的表情生成方法第一实施例的流程示意图;
图3为本申请提供的人脸表情修改模型的结构示意图;
图4为本申请提供相片中人脸图像的表情转换后的示意图;
图5为本申请提供的人脸表情修改模型训练的流程示意图;
图6为本申请提供基于StarGAN网络的表情生成装置的功能模块示意图。
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。
具体实施方式
应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。
本申请提供一种基于StarGAN网络的表情生成装置,该装置可以是移动终端中的一个插件,用于执行本申请实施例提供的表情生成方法,如图1所示,图1为本申请实施例方案涉及的移动终端运行环境的结构示意图。
如图1所示,该移动终端包括:处理器101,例如CPU,通信总线102、用户接口103,网络接口104,存储器105。其中,通信总线102用于实现这些组件之间的连接通信。用户接口103可以包括显示屏(Display)、输入单元比如键盘(Keyboard),网络接口104可选的可以包括标准的有线接口、无线接口(如WI-FI接口)。存储器105可以是高速RAM存储器,也可以是稳定的存储器(non-volatile memory),例如磁盘存储器。存储器105可选的还可以是独立于前述处理器101的存储系统。
本领域技术人员可以理解,图1中示出的移动终端的硬件结构并不构成对基于StarGAN网络的表情生成设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。
如图1所示,作为一种计算机可读存储介质的存储器105中可以包括操作系统、网络通信模块、用户接口模块以及用于实现表情生成的基于StarGAN网络的表情生成程序。其中,操作系统是管理和控制基于StarGAN网络的表情生成装置、存储器中的软件资源调用的程序,支持基于StarGAN网络的表情生成程序以及其它软件和/或程序的运行。
在图1所示的基于StarGAN网络的表情生成装置的硬件结构中,网络接口104主要用于接入网络;用户接口103主要用于检测用户在移动终端上的表情合成控制操作,即是选择的表情域的标签信息的触控操作,而处理器101可以用于调用存储器105中存储的基于StarGAN网络的表情生成程序,并执行以下基于StarGAN网络的表情生成方法的各实施例的操作。
基于上述投诉举报系统的硬件结构,本申请提出了一种基于StarGAN网络的表情生成方法,参照图2,图2为本申请实施例提供的基于StarGAN网络的表情生成方法的流程图。在本实施例中,所述基于StarGAN网络的表情生成方法具体包括以下步骤:
步骤S210,获取待处理的相片,并基于人脸的识别技术识别出所述相片中所有的人脸画像,并在所述相片上勾画出每个人脸画像的人脸边界框;
在该步骤中,所述待处理的相片主要指的是老式相片,例如通过胶片洗出来的照片的备份版照片,或者是胶片底片本身,当然其也可以是现有的电子版照片,其相片上包含有至少一个人脸图像。
在本实施例中,对于人脸边界框的勾画具体实现为,基于图像识别技术,先对输入的相片进行背景和人像图像的区分,优选的,在进行背景和人像图像的区分过程中,还可以 增加对背景的虚化操作,将人像图像做进一步的凸显处理,然后在通过人脸的检测对人像图像中的人脸部分提取出来,并且进行头像轮廓的描画,从而实现对人脸的标记,其标记具体可以是以描点的方式进行标记,也可以是使用线条的方式标记,甚至在识别出头像位置后,直接对人像所在的区域进行半透明的蒙蔽处理。
在该步骤中,还包括根据用户的实际表情修改请求确定所述相片中需要进行表情处理的人脸画像,并对人脸画像进行标注。
步骤S220,利用人脸关键点检测技术从所述人脸边界框中标记出人脸五官在所述人脸画像中的位置;
在本实施例中,人脸关键点检测技术指的是对人脸的面部局部区域进行检测的技术,其主要是用来对人脸上的脸轮廓、眼睛、眉毛、嘴巴、鼻子等器官的特征的区域标记,以及相互之间的几何位置关系;具体可以根据各个器官的形态进行检测识别。
进一步的,该步骤中,在进行人脸五官标记时,具体可以仅标记需要进行表情处理的人脸上的人脸五官;或者是全部标记,但是对需要进行表情处理的部分进行高亮设置。
步骤S230,获取待转换的目标域标签,其中,所述目标域标签为用于指示所述相片待转换的表情数据集;
在该步骤中,在获取到的目标域标签后,还可以是将目标域标签添加到人脸画像上对应的人脸五官的区域上,而在进行表情合成操作时,人脸表情修改模型通过识别相片上的目标域标签来确定待获取的表情素材。
步骤S240,将带有人脸五官位置标记的所述相片和所述目标域标签输入至预先训练得到的人脸表情修改模型中,并从表情素材库中查询与所述目标域标签对应的表情数据集,其中,所述人脸表情修改模型为基于单一网络可以跨越不同表情域的多个表情数据集获取表情素材的表情替换模型,其用于根据以及域标签与多个不同表情域的表情数据集之间的对映射关系,从表情素材库中确定与所述目标域标签对应的表情数据集,所述表情素材库包括有至少两个不同表情域的表情数据集;
在本实施例中,所述待转换的目标域标签指的是用户期待的表情对应的标注信息,而该标注信息用于指示用户期待的表情在所述表情素材库中的表情数据集;在实际应用中,为了便于终端对相片的表情修改处理,一般情况下都会预先在终端或者是表情处理装置上存储有一个案表情类型分类的数据库,该数据库可以叫做目标域的集合,例如生气、轻蔑、厌恶、害怕、高兴、悲伤、惊讶等表情,每种表情都设置有多个不同的表现形式的图像,从而形成表情素材库,而表情素材库中对于生气、轻蔑、厌恶、害怕、高兴、悲伤或惊讶的分类标注则为目标域标签。当用户输入一张相片时,通过该设置目标域标签来确定选择转换为哪个类型的表情,并从对应的表情数据集选择合适的图像对输入相片的表情进行替换合成处理。
在本实施例中,所述人脸表情修改模型是通过预训练得到的,而该训练的过程主要是训练模型与不同目标域之间的映射关系,通过获取到该映射关系后,再使用该模型进行人脸图像的表情进行修改时,可以在不同域中获取表情素材进行生成表情相片,对于映射关系训练的实现为:
首先,从已知分类的表情数据中选择表情数据作为模型的训练数据,其中,在选择该模型的训练数据时,需要从不同表情域中挑选,并且挑选出来的数据都是已知域标签的数据;
然后,将训练数据输入到模型中,模型分别提取两个数据进行学习训练,一个数据是表情数据,另一个是域标签,而在对表情数据进行学习训练时采用当前的一个模型训练一个域数据的方式进行训练即可;对于域标签的学习训练,则是模型先对所有输入的训练数据的域标签进行汇总,然后基于汇总后的域标签进行分类,其中分类指的是将相同域标签 的划分类一类并进行目标域标签的标注,然后模型再根据分类后的目标域标签进行学习训练,从而得到最终的可以跨越不同表情域获取数据的人脸表情修改模型。
对于根据分类后的目标域标签进行学习训练,具体可以通过以下方式来学习:
这里需要说明的是目标域标签相当于设置在模型上的链接索引,而域标签则是设置在对应的表情域上的链接索引,而模型在对域标签进行汇总并标记目标域标签的过程中就相当于建立了目标域标签与域标签之间的映射关系,而模型在对目标域标签的训练中,就是训练模型在目标域标签到域标签之间的映射关系的走通;
首先,模型学习目标域标签的汇总表格,并将学习到的结果存储到模型中;
然后,通过举例的方式不断训练模型中目标域标签与域标签之间的映射关系的跳转,即是输入一个生气的目标域标签,模型根据生气的标签查询对应的表情域的域标签,然后根据域标签调取域标签下的表情数据;
最后,根据输出的表情数据来判断是否与目标域标签对应,例如,输入的是生气的目标域标签,而模型通过查询调取后,输出的表情数据是生气的表情,则说明该训练过程是对的,反之若输出的是不开心的表情,则说明该映射关系存在问题,需要再次训练,直到输出的数据正确为止。
步骤S250,通过所述人脸表情修改模型将确定的不同表情的表情数据集依次填充至所述人脸画像中对应的人脸五官的位置中,并进行表情图像的合成处理,得到目标表情相片。
在本实施例中,所述人脸表情修改模型指的是在训练时,同时学习了多个目标域中的映射关系的训练模型,其包括鉴别器、两个生成器和辅助鉴别器,其中鉴别器是用于对输入的相片的真假性的检测,这里的真假可以立即判断分类的目标域与相片中的表情是否一致,或者是是否带修改标识的相片(即是可以理解为是进行表情修改过的相片)。
在实际应用中,上述的方案还可以是通过获取到待处理的相片后,直接通过人脸表情修改模型进行合成修改,由于该模型是同时包含了多个目标域的映射关系,因此,当该模型接收到输入的相片后,直接根据学习的映射关系调取多个目标域中的表情数据找到相符合的图像进行合成,输出多个目标域的图像数据,最后根据用户输入的期待转换的目标域标签对输出的多个目标域的图像数据进行筛选,得到最终的合成图像。
在本实施例中,所述目标域标签除了是按照生气、开心等分类标注之外,还可以是根据表情素材标号的方式进行设置,也即是说一个目标域可以包含有多种情绪的表情素材(人脸表情特征图像),对于这类的目标域的域标签就标注为生气之类的域标签了,而是设置为表情库1、表情库2等等的名称,且这些名称的表情库都是来自于不同表情域的;对应的,在模型上训练后得到的目标域标签可以是表情库1’、表情库2’。这时,当用户输入一个相片和目标域标签为表情库1’时,其模型由于通过训练学习到了可以访问不同表情域上的表情素材,所以模型会根据表情库1’以及训练得到的映射关系,直接查询到与表情库1’对应域的表情库1中的表情素材,然后将这些素材依次替换相片上对应表情位置的内容,从而合成形成所需要的表情相片。
在本实施例中,在识别相片中的人脸图像时,具体是通过人脸图像处理技术判断所述相片中是否存在人脸图像,具体是识别相片中是否存在需要进行表情调换的人脸图像。
在实际应用中,一个相片可能会存在多个人脸图像,也即是多个人物时,用户需要对其中的人脸图像进行标注,标注需要进行表情替换的部分人脸图像,优选的可以通过在移动终端上通过勾选的方式标记人脸图像的大概位置。
若检测到相片中存在人脸图像(即是需要替换的人脸图像)时,则根据人脸检测函数库Dlib在开放源代码计算机视觉类库平台上的人脸检测算法提取所述相片中的人脸,并在所述人脸所在的位置描画出人脸边界框;
利用表情的跟踪算法对所述人脸边界框内的人脸图像进行表情跟踪点的标记。
在实际应用中,对于输入的相片其可能带有人像,也可能不带有人像,而对于输入的相片中进行是否含有人脸的图像的检测,只有在检测到人脸图像时,执行使用Dlib在OpenCv(开放源代码计算机视觉类库)平台上的人脸检测算法找到人脸位置的Bounding Box(边界框)。人脸检测为检测人脸图像是否存在人脸,并定位人脸图像中的所有人脸,获取高精度的人脸标定框。
基于识别出来后,通过人脸关键跟踪点的标识方式对人脸的边框进行标记,优选的,只对人脸的脸型轮廓以及眉毛进行大概的标记。
在本实施例中,对于人脸五官的识别可以通过五官识别模型来识别,该五官识别模型是根据各种表情下产生的五官形状来训练得到的,例如训练一个眼睛的识别模型,首先分别获取到生气、轻蔑、厌恶、害怕、高兴、悲伤、惊讶表情下的眼睛形态,然后规定眼睛的整体形状,然后基于规定的整体形状,通过模型构架学习不同表情下的眼睛的表示细节,基于这些学习到的细节构建出准确的眼睛识别模型,最后在使用时,当检测到相片中存在人像后,直接调用眼睛识别模型在人像的边界框内识别出眼睛的所在部位,并进行跟踪点的标记,从而实现眼睛的识别。同理,对于其他五官的识别与眼睛的相同。
进一步的,基于上述对人脸的大概轮廓标记后,跳转至步骤S220,对标注的人脸图像进行人脸五官的识别,具体的通过预先训练好的人脸关键点检测器对所述人脸边界框中的人脸图像进行人脸五官检测,并采用106追踪点的标记方式对检测到的人脸五官进行边界标记,以描绘出所述人脸图像上的每个人脸五官的形状和位置信息。
在实际应用中,对于人脸关键点检测器的训练具体可以直接使用一致分类好的目标域中的表情数据来进行训练,即是通过人脸检测算法对目标域中的表情进行识别,并且将106个跟踪点均匀设置在五官区域中,使得其能精准反应五官的形状和区域,同时还反应出五官之间的方位和形状。
在本实施例中,在根据人脸表情修改模型查询多域中的表情数据集之前,还需要预先对人脸表情修改模型的训练,具体的该模型的训练通过以下方式实现:
获取至少两个不同表情域中的目标域以及每个目标域对应的域标签,所述目标域中包含有已经通过人工归类的方式归类完成的人脸表情特征图像;
根据所述目标域,对预设的对抗网络的模型构架中的鉴别器进行真假相片的区分训练,以得到可识别出表情类别的鉴别器;
根据所述目标域,对预设的对抗网络的模型构架中的生成器进行输入相片到待转换的目标域之间的转换训练以及对输入相片转换后的图像进行还原重建的训练;
所述预设的对抗网络的模型构架记录根据人工归类的方式归类完成的人脸表情特征图像对应的域标签,并根据域标签在模型构架中生成目标域标签,建立目标域标签与域标签之间的映射关系;
基于训练后的鉴别器、生成器和所述映射关系,构建满足表情合成的所述人脸表情修改模型。
在本实施例中,在训练鉴别器和生成器的过程中,是按照目标域的分类进行逐个训练,例如,将生气表情的目标域的所有表情素材逐一输入到鉴别器和生成器中,然后对鉴别器输出的真假结果进行判断,若输入的是真的生气表情且于目标域的域标签是一致的,则在输入一个非生气表情的目标域的表情素材进行判断,根据输出的结果再进行判断,这样交换的输入进行训练。同理对于生成器的训练也是采用相似的方法实现。
进一步的,在学习这些表情素材的同时,还对这些表情素材标注上对应的目标域标签,并将目标域标签与原来的域标签建立映射关系存储在鉴别器中,从而实现映射关系的学习,使得最终的模型可以实现跨域的访问。
在本实施例中,在步骤S240中,在训练所述人脸表情修改模型时,还包括学习不同域 之间的映射关系,根据目标域标签与不同域之间的映射关系获取多个域的表情素材,即是说一个目标域标签对应有多个域的表情数据集,并从每个表情数据集中选择一个素材来使用,对一个相片的中的一个人脸画像合成输出多个表情相片。
在本实施例中,在训练过程中,我们随机生成一个目标域标签,训练模型以灵活地将输入图像转换为目标域。通过这样做,我们可以控制域标签,并在测试阶段将图像转换成任何期望的域。
对于所述基于StarGAN网络的统一的训练模型构架,下面结合具体的模型进行说明,如图3所示,通过获取已知的目标域的表情数据来进行模型训练,其训练过程具体包括:
鉴别器的真假区分训练,该训练过程主要是对区分真假的概率的训练,使得鉴别器的分辨概率更加接近真实;
原始图像转换到目标域图像,以及对转换后的目标域图像还原原始图像的训练,该训练过程主要是对生成器与多个目标域之间的映射关系的统一学习,使得生成器可以灵活地在多个域之间的切换映射,达到任意转换的目的;
根据目标域图像进行真假的再次鉴别,基于再次鉴别的结果来调整上述生成器的训练结果,从而实现对图像的在分类或者减低模型的输出损失。
进一步的,对于上述的模型训练,具体基于已知的目标域中的图像数据x以及对应的目标域标签c作为输入,将x转换成输出图片y,输出图片y能够被归类成目标域c,分析各个部分的损失:
首先是鉴别器的训练,该鉴别器的训练具体如下:
分别将所述至少两个不同表情域中的目标域中的表情素材作为输入相片,以所述至少两个不同表情域中的目标域的域标签作为所述目标转换域的目标域标签,输入至所述鉴别器中;
在所述输入相片经过所述鉴别器采用对抗损失函数公式对所述输入相片的真假进行鉴别,判断鉴别的结果来判定所述目标域标签是否正确;其中,所述对抗损失函数公式为:
Figure PCTCN2020118388-appb-000001
其中,所述X为输入相片,所述C为目标转换域的标签或者目标域标签,所述G(x,c)为所述人脸表情修改模型中的第一生成器G的输出图像。
在实际应用中,判断鉴别的结果来判定所述目标域标签是否正确具体是通过对鉴别后输出的相片中的表情的形状来确定是否属于目标域标签对应的表情形状,若是,则判断是真的相片,即是属于目标域标签的合成结果,反之,则不是,需要重新输入训练鉴别器。
然后是生成器G的训练,对于生成器G的训练的输出还可以对鉴别器D分类损失进行优化,具体为:
基于随机算法生成一个目标域标签,以及获取所述输入相片的原始目标域标签;
将所述输入相片和目标域标签输入至第一生成器中合成假图像,并将所述假图像分别输出至第二生成器和辅助鉴别器;
所述第二生成器根据接收到的所述假图像和原始目标域标签进行图像的还原重建处理,得到重组图像,并输出给所述第一生成器进行深度串联的循环合成;
所述辅助鉴别器对所述假图像进行真假图像的鉴别,并根据鉴别的结果进行再分类处理;
根据所述循环合成和再分类处理的结果对模型进行优化损失的计算,得到最终的对抗损失;其中,所述优化损失包括鉴别器、生成器的训练优化损失,还原重建损失,训练总体损失,以及训练后的对抗损失,其计算公式如下:
所述鉴别器的训练优化损失:
Figure PCTCN2020118388-appb-000002
所述生成器的训练优化损失:
Figure PCTCN2020118388-appb-000003
所述还原重建损失:
Figure PCTCN2020118388-appb-000004
所述训练总体损失:
Figure PCTCN2020118388-appb-000005
所述训练后的对抗损失:
Figure PCTCN2020118388-appb-000006
其中,Dcls(c'|x)代表鉴别器对真实图像计算得到的域标签概率分布;c'为原始目标域标签;λ cls和λ rec是超参数,分别用来调整域再分类处理的损失和还原重建损失相比于对抗损失的重要程度;
Figure PCTCN2020118388-appb-000007
是沿着直线均匀采样真正的和生成的图像;λ gp为固定值10。
在实际应用中,对于还原重建的损失计算具体是通过最小化对抗损失与分类损失,G努力尝试做到生成目标域中的现实图片。但是这无法保证学习到的转换只会改变输入图片的域相关的信息而不改变图片内容,加上了周期一致性损失,是将G(x,c)和图片x的原始标签c'结合喂入到G中,将生成的图片和x计算1范数差异。
在本实施例中,根据上述的方式训练得到了人脸表情修改模型后,在执行步骤S40时,通过在相片上识别到目标域标签后,根据目标域标签从表情素材库中匹配到对应的表情数据集,然后再根据实际要求选择一个或者多个表情素材填充到对应的五官部位上,最后对填充后的区域以及该五官部位的边沿区域进行融合处理,得到完整的目标表情相片。
进一步的,对于模型的训练除了上述的注意目标域的映射关系进行训练之外,还可以通过联合多个数据集的方式训练,而多个数据集记为多个目标域的相互集合,在联合多个数据集训练时,通过引入一个掩码矢量(maskvector)控制数据集进行输入训练,通过掩码矢量m控制允许统一基于StarGAN网络忽略来自于特定数据集的未知标签以及已知标签;在统一基于StarGAN网络中,使用一个n维数据来表征掩码矢量m,其具体的实现步骤为:
将获取到的所有已归类的所述人脸表情特征图像形成数据集,并基于所述数据集设置一个掩码矢量m,所述掩码矢量用于对所述数据集中的域标签进行统一格式控制;其中,所述域标签统一格式为:
Figure PCTCN2020118388-appb-000008
Cn为第n个数据集的域标签;
将所述掩码矢量输入值所述生成器,基于所述域标签统一格式控制所述数据集进行模型的统一训练。
进一步的,对于步骤S230,在进行图像的合成时,基于学习好的人脸表情修改模型对输入的相片进行多域转换的合成,通过输入一张任意的待处理的相片,模型基于多域之间的映射学习将该相片针对每个目标域都合成一张图像,然后再从中选择一张作为目标转换域的图像,下面以一个表情域存在多个不同情绪的表情素材的目标域为例进行举例说明,具体如图4所示:
图左第一张图片代表input输入图片,而对应输入的目标域标签是需要转换成7个表情图像的标签,人脸表情修改模型根据目标域标签和多域的映射关系,访问查询到携带有7中表情素材的表情域,将这些素材一一替换原输入的图片,随后7张图片代表根据表情素材属性生成的合成图,从左至右,表情分别表示:生气、轻蔑、厌恶、害怕、高兴、悲伤、惊讶。
这样当老相片中人闭眼睛或者没有笑,就可以利用此网络进行修复生成新的图片,在手机端相机APP会应用广泛。
在本案中,在步骤S250之后,还包括根据所述人脸表情修改模型对合成后的目标表情相片进行目标域的在学习归类,具体的实现为:
将所述目标表情相片输入到所述人脸表情修改模型中,通过模型中的鉴别器进行真假的判断,根据判断的结果确定所述目标表情相片的域标签,根据所述与域标签将所述目标表情相片归类到对应的表情数据集中,从而扩大表情数据集中的表情素材,这样不仅实现了多域之间的表情素材获取,还实现了表情数据集的动态扩展。
在本申请通过上述的方法进行表情的转换生成,通过一个训练一个模型学习多域的映射关系,实现了允许在一个单一的网络内同时训练不同域的多个数据集,可以仅使用一个模型为多个域执行图像到图像转换,从而解决了在处理两个以上的域时存在伸缩性以及鲁棒性方面的局限性,以及这些方法在多域图像转换的任务中效率不高且效果不佳的问题。
下面以结合图4中的例子,对本申请实施例提供的表情生成方法中的人脸表情修改模型的训练过程做详细的说明,在图中输入的相片中仅涉及一个人脸图像,所以相片中的人脸图像即是用户待修改的人脸头像,当然在实际应用中,当输入的相片中存在多个人脸头像时,这需要执行上述的步骤S210提取其中的待处理图像,基于一个人脸图像的相片来说,其具体实现过程如下:
步骤S510,将输入图片x和目标生成域c结合喂入到生成网络来合成fake图片。
在本实施例中,生成网络G指的是上述的人脸表情修改模型,而该模型是基于StarGAN网络的训练算法训练得到的,如图2所示,该模型包括两个鉴别器D和两个生成器G。
步骤S520,将fake图片和真实图片分别喂入到鉴别器D,D需要判断图片是否真实,还需要判断它来自哪个域。
步骤S530,将生成的fake图片和原始图片的域信息c'结合起来喂入到生成器G要求能输出重建出原始输入图片x。
在本实施例中,对于给定的输入图片x和目标域标签c,网络的目标是将x转换成输出图片y,输出图片y能够被归类成目标域c,分析各个部分的损失:
(1)GAN常见的对抗损失:
Figure PCTCN2020118388-appb-000009
(2)真实图片的域分类损失用来优化D,Dcls(c'|x)代表D对真实图片计算得到的域标签概率分布。这一学习目标将会使得D能够将输入图片x识别为对应的域c',这里的(x,c')是训练集给定的,如下是优化D的损失:
Figure PCTCN2020118388-appb-000010
(3)fake图片的域分类损失来优化G,让G尽力去生成图片让它能够被D分类成目标域c,如下是优化G的损失:
Figure PCTCN2020118388-appb-000011
(4)重建损失:通过最小化对抗损失与分类损失,G努力尝试做到生成目标域中的现实图片。但是这无法保证学习到的转换只会改变输入图片的域相关的信息而不改变图片内容,加上了周期一致性损失,是将G(x,c)和图片x的原始标签c'结合喂入到G中,将生成的图片和x计算1范数差异:
Figure PCTCN2020118388-appb-000012
(5)总体损失:
Figure PCTCN2020118388-appb-000013
Figure PCTCN2020118388-appb-000014
(6)提升训练获得的对抗损失,
Figure PCTCN2020118388-appb-000015
是沿着直线均匀采样真正的和生成的图像,λ gp为固定值10:
Figure PCTCN2020118388-appb-000016
(7)联合多个数据集训练时把mask向量也输入到生成器,联合多个数据集训练时把mask向量也输入到生成器,ci代表第i个数据集的标签,已知标签ci如果是二进制属性则可以表示为二进制向量,如果为类别属性表示一个onehot。剩下的n-1个则指定为0。m则是一个长度为n的onehot编码:
Figure PCTCN2020118388-appb-000017
(8)其表情修改结果类似如下,图左第一张图片代表input输入图片,随后7张图片代表根据表情属性生成地合成图,从左至右,表情分别表示:生气、轻蔑、厌恶、害怕、高兴、悲伤、惊讶。
为了解决上述的问题,本申请实施例还提供了一种基于StarGAN网络的表情生成装置,参照图6,图6为本申请实施例提供的基于StarGAN网络的表情生成装置的功能模块的示意图。在本实施例中,该装置包括:人脸识别模块61、采集模块62、查询模块63、合成模块64;
人脸识别模块61,用于获取待处理的相片,并基于人脸的识别技术识别出所述相片中所有的人脸画像,并在所述相片上勾画出每个人脸画像的人脸边界框;利用人脸关键点检测技术从所述人脸边界框中标记出人脸五官在所述人脸画像中的位置;
采集模块62,用于获取待转换的目标域标签,其中,所述目标域标签为用于指示所述相片待转换的表情数据集;
查询模块63,用于将带有人脸五官的标记的所述相片和所述目标域标签输入至预先训练得到的人脸表情修改模型中,从表情素材库中查询与所述目标域标签对应的表情数据集,其中,所述人脸表情修改模型为基于单一网络可以跨越不同表情域的多个表情数据集获取表情素材的表情替换模型,其用于根据所述目标域标签以及域标签与多个不同表情域的表情数据集之间的对映射关系,从表情素材库中确定与所述目标域标签对应的表情数据集,所述表情素材库包括有至少两个不同表情域的表情数据集;
合成模块64,用于所述人脸表情修改模型将确定的不同表情与的表情数据集依次填充至所述人脸画像中对应的人脸五官的位置中,并进行表情图像的合成处理,得到目标表情相片。
基于与上述本申请实施例的基于StarGAN网络的表情生成方法相同的实施例说明内容,因此本实施例对基于StarGAN网络的表情生成装置的实施例内容不做过多赘述。
本申请还提供一种计算机可读存储介质。
本实施例中,所述计算机可读存储介质可以是非易失性,也可以是易失性,所述计算机可读存储介质上存储有基于StarGAN网络的表情生成程序,所述基于StarGAN网络的表情生成程序被处理器执行时实现如上述任一项实施例中所述的基于StarGAN网络的表情生成方法的步骤。其中,基于StarGAN网络的表情生成程序被处理器执行时所实现的方法可参照本申请基于StarGAN网络的表情生成方法的各个实施例,因此不再过多赘述。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM)中,包括若干指令用以使得一台终端(可以是手机,计算机,服务器或者网络设备等)执行本申请各个实施例所述的方法。
上面结合附图对本申请的实施例进行了描述,但是本申请并不局限于上述的具体实施方式,上述的具体实施方式仅仅是示意性的,而不是限制性的,本领域的普通技术人员在本申请的启示下,在不脱离本申请宗旨和权利要求所保护的范围情况下,还可做出很多形式,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,这些均属于本申请的保护之内。

Claims (20)

  1. 一种基于StarGAN网络的表情生成方法,其中,所述表情生成方法包括以下步骤:
    获取待处理的相片,并基于人脸识别技术识别出所述相片中所有的人脸画像,并在所述相片上勾画出每个人脸画像的人脸边界框;
    利用人脸关键点检测技术从所述人脸边界框中标记出所述人脸画像中的人脸五官位置;
    获取待转换的目标域标签,其中,所述目标域标签为用于指示所述相片待转换的表情数据集;
    将带有人脸五官位置标记的所述相片和所述目标域标签输入至预先训练得到的人脸表情修改模型中,并从表情素材库中查询与所述目标域标签对应的表情数据集,其中,所述人脸表情修改模型为基于单一网络可以跨越不同表情域的多个表情数据集获取表情素材的表情替换模型,其用于根据域标签与多个不同表情域的表情数据集之间的对映射关系,从表情素材库中确定与所述目标域标签对应的表情数据集,所述表情素材库包括有至少两个不同表情域的表情数据集;
    通过所述人脸表情修改模型将确定的不同表情的表情数据集依次填充至所述人脸画像中对应的人脸五官位置中,并进行表情图像合成处理,得到目标表情相片。
  2. 根据权利要求1所述的基于StarGAN网络的表情生成方法,其中,所述基于人脸识别技术识别出所述相片中所有的人脸画像,并在所述相片上勾画出每个人脸画像的人脸边界框的步骤包括:
    识别所述相片中是否存在人脸图像;
    若存在,则根据预置的人脸检测算法提取所述相片中的人脸,并在所述人脸所在的位置描画出人脸边界框;
    利用表情的跟踪算法对所述人脸边界框内的人脸图像进行表情跟踪点的标记。
  3. 根据权利要求2所述的基于StarGAN网络的表情生成方法,其中,所述利用人脸关键点检测技术从所述人脸边界框中标记出所述人脸画像中的人脸五官位置的步骤包括:
    通过预先训练好的人脸关键点检测器,对所述人脸边界框中的人脸图像进行人脸五官检测,并采用106追踪点的标记方式对检测到的人脸五官进行边界标记,以描绘出所述人脸图像上的每个人脸五官的形状和位置信息。
  4. 根据权利要求1-3任一项所述的基于StarGAN网络的表情生成方法,其中,所述预先训练得到的人脸表情修改模型通过以下方式训练得到:
    获取至少两个不同表情域中的目标域以及每个目标域对应的域标签,所述目标域中包含有已经通过人工归类的方式归类完成的人脸表情特征图像;
    根据所述目标域,对预设的对抗网络的模型构架中的鉴别器进行真假相片的区分训练,以得到可识别出表情类别的鉴别器;
    根据所述目标域,对预设的对抗网络的模型构架中的生成器进行输入相片到待转换的目标域之间的转换训练以及对输入相片转换后的图像进行还原重建的训练;
    所述预设的对抗网络的模型构架记录根据人工归类的方式归类完成的人脸表情特征图像对应的域标签,并根据域标签在模型构架中生成目标域标签,建立目标域标签与域标签之间的映射关系;
    基于训练后的鉴别器、生成器和所述映射关系,构建满足表情合成的所述人脸表情修改模型。
  5. 根据权利要求4所述的基于StarGAN网络的表情生成方法,其中,所述根据所述目标域,对预设的对抗网络的模型构架中的鉴别器进行真假相片的区分训练,以得到可识别出表情类别的鉴别器包括:
    分别将所述至少两个不同表情域中的目标域中的表情素材作为输入相片,以所述至少 两个不同表情域中的目标域的域标签作为所述目标转换域的目标域标签,输入至所述鉴别器中;
    在所述输入相片经过所述鉴别器采用对抗损失函数公式对所述输入相片的真假进行鉴别,判断鉴别的结果来判定所述目标域标签是否正确。
  6. 根据权利要求5所述的基于StarGAN网络的表情生成方法,其中,所述根据所述目标域,对预设的对抗网络的模型构架中的生成器进行输入相片到待转换的目标域之间的转换训练以及对输入相片转换后的图像进行还原重建的训练包括:
    基于随机算法生成一个目标域标签,以及获取所述输入相片的原始目标域标签;
    将所述输入相片和目标域标签输入至第一生成器中合成假图像,并将所述假图像分别输出至第二生成器和辅助鉴别器;
    所述第二生成器根据接收到的所述假图像和原始目标域标签进行图像的还原重建处理,得到重组图像,并输出给所述第一生成器进行深度串联的循环合成;
    所述辅助鉴别器对所述假图像进行真假图像的鉴别,并根据鉴别的结果进行再分类处理;
    根据所述循环合成和再分类处理的结果对模型进行优化损失的计算,得到最终的对抗损失;其中,所述优化损失包括鉴别器、生成器的训练优化损失,还原重建损失,训练总体损失,以及训练后的对抗损失。
  7. 根据权利要求6所述的基于StarGAN网络的表情生成方法,其中,在所述获取至少两个不同表情域中的目标域以及每个目标域对应的域标签之后,还包括:
    将获取到的所有已归类的所述人脸表情特征图像形成数据集,并基于所述数据集设置一个掩码矢量m,所述掩码矢量用于对所述数据集中的域标签进行统一格式控制;其中,所述域标签统一格式为:
    Figure PCTCN2020118388-appb-100001
    Cn为第n个数据集的域标签;
    将所述掩码矢量输入值所述生成器,基于所述域标签统一格式控制所述数据集进行模型的统一训练。
  8. 一种基于StarGAN网络的表情生成装置,其中,所述表情生成装置包括:
    人脸识别模块,用于获取待处理的相片,并基于人脸识别技术识别出所述相片中所有的人脸画像,并在所述相片上勾画出每个人脸画像的人脸边界框;利用人脸关键点检测技术从所述人脸边界框中标记出所述人脸画像中的人脸五官位置;
    采集模块,用于获取待转换的目标域标签,其中,所述目标域标签为用于指示所述相片待转换的表情数据集;
    查询模块,用于将带有人脸五官位置标记的所述相片和所述目标域标签输入至预先训练得到的人脸表情修改模型中,并从表情素材库中查询与所述目标域标签对应的表情数据集,其中,所述人脸表情修改模型为基于单一网络可以跨越不同表情域的多个表情数据集获取表情素材的表情替换模型,其用于根据域标签与多个不同表情域的表情数据集之间的对映射关系,从表情素材库中确定与所述目标域标签对应的表情数据集,所述表情素材库包括有至少两个不同表情域的表情数据集;
    合成模块,用于通过所述人脸表情修改模型将确定的不同表情的表情数据集依次填充至所述人脸画像中对应的人脸五官位置中,并进行表情图像的合成处理,得到目标表情相片。
  9. 一种基于StarGAN网络的表情生成设备,其中,所述表情生成设备包括:存储器、处理器以及存储在所述存储器上并可在所述处理器上运行的基于StarGAN网络的表情生成程序,所述基于StarGAN网络的表情生成程序被所述处理器执行时实现如下所述的基于StarGAN网络的表情生成方法的步骤:
    获取待处理的相片,并基于人脸识别技术识别出所述相片中所有的人脸画像,并在所 述相片上勾画出每个人脸画像的人脸边界框;
    利用人脸关键点检测技术从所述人脸边界框中标记出所述人脸画像中的人脸五官位置;
    获取待转换的目标域标签,其中,所述目标域标签为用于指示所述相片待转换的表情数据集;
    将带有人脸五官位置标记的所述相片和所述目标域标签输入至预先训练得到的人脸表情修改模型中,并从表情素材库中查询与所述目标域标签对应的表情数据集,其中,所述人脸表情修改模型为基于单一网络可以跨越不同表情域的多个表情数据集获取表情素材的表情替换模型,其用于根据域标签与多个不同表情域的表情数据集之间的对映射关系,从表情素材库中确定与所述目标域标签对应的表情数据集,所述表情素材库包括有至少两个不同表情域的表情数据集;
    通过所述人脸表情修改模型将确定的不同表情的表情数据集依次填充至所述人脸画像中对应的人脸五官位置中,并进行表情图像合成处理,得到目标表情相片。
  10. 根据权利要求9所述的基于StarGAN网络的表情生成设备,其中,所述基于StarGAN网络的表情生成程序被所述处理器执行所述基于人脸识别技术识别出所述相片中所有的人脸画像,并在所述相片上勾画出每个人脸画像的人脸边界框的步骤时,包括如下步骤:
    识别所述相片中是否存在人脸图像;
    若存在,则根据预置的人脸检测算法提取所述相片中的人脸,并在所述人脸所在的位置描画出人脸边界框;
    利用表情的跟踪算法对所述人脸边界框内的人脸图像进行表情跟踪点的标记。
  11. 根据权利要求10所述的基于StarGAN网络的表情生成设备,其中,所述基于StarGAN网络的表情生成程序被所述处理器执行所述利用人脸关键点检测技术从所述人脸边界框中标记出所述人脸画像中的人脸五官位置的步骤时,包括如下步骤:
    通过预先训练好的人脸关键点检测器,对所述人脸边界框中的人脸图像进行人脸五官检测,并采用106追踪点的标记方式对检测到的人脸五官进行边界标记,以描绘出所述人脸图像上的每个人脸五官的形状和位置信息。
  12. 根据权利要求9-11中任一项所述的基于StarGAN网络的表情生成设备,其中,所述基于StarGAN网络的表情生成程序被所述处理器执行所述预先训练得到的人脸表情修改模型的步骤时,包括如下步骤:
    获取至少两个不同表情域中的目标域以及每个目标域对应的域标签,所述目标域中包含有已经通过人工归类的方式归类完成的人脸表情特征图像;
    根据所述目标域,对预设的对抗网络的模型构架中的鉴别器进行真假相片的区分训练,以得到可识别出表情类别的鉴别器;
    根据所述目标域,对预设的对抗网络的模型构架中的生成器进行输入相片到待转换的目标域之间的转换训练以及对输入相片转换后的图像进行还原重建的训练;
    所述预设的对抗网络的模型构架记录根据人工归类的方式归类完成的人脸表情特征图像对应的域标签,并根据域标签在模型构架中生成目标域标签,建立目标域标签与域标签之间的映射关系;
    基于训练后的鉴别器、生成器和所述映射关系,构建满足表情合成的所述人脸表情修改模型。
  13. 根据权利要求12所述的基于StarGAN网络的表情生成设备,其中,所述基于StarGAN网络的表情生成程序被所述处理器执行所述根据所述目标域,对预设的对抗网络的模型构架中的鉴别器进行真假相片的区分训练,以得到可识别出表情类别的鉴别器的步骤时,包括如下步骤:
    分别将所述至少两个不同表情域中的目标域中的表情素材作为输入相片,以所述至少 两个不同表情域中的目标域的域标签作为所述目标转换域的目标域标签,输入至所述鉴别器中;
    在所述输入相片经过所述鉴别器采用对抗损失函数公式对所述输入相片的真假进行鉴别,判断鉴别的结果来判定所述目标域标签是否正确。
  14. 根据权利要求13所述的基于StarGAN网络的表情生成设备,其中,所述基于StarGAN网络的表情生成程序被所述处理器执行所述根据所述目标域,对预设的对抗网络的模型构架中的生成器进行输入相片到待转换的目标域之间的转换训练以及对输入相片转换后的图像进行还原重建的训练的步骤时,包括如下步骤:
    基于随机算法生成一个目标域标签,以及获取所述输入相片的原始目标域标签;
    将所述输入相片和目标域标签输入至第一生成器中合成假图像,并将所述假图像分别输出至第二生成器和辅助鉴别器;
    所述第二生成器根据接收到的所述假图像和原始目标域标签进行图像的还原重建处理,得到重组图像,并输出给所述第一生成器进行深度串联的循环合成;
    所述辅助鉴别器对所述假图像进行真假图像的鉴别,并根据鉴别的结果进行再分类处理;
    根据所述循环合成和再分类处理的结果对模型进行优化损失的计算,得到最终的对抗损失;其中,所述优化损失包括鉴别器、生成器的训练优化损失,还原重建损失,训练总体损失,以及训练后的对抗损失。
  15. 根据权利要求14所述的基于StarGAN网络的表情生成设备,其中,所述基于StarGAN网络的表情生成程序被所述处理器执行所述获取至少两个不同表情域中的目标域以及每个目标域对应的域标签的步骤之后,还包括如下步骤:
    将获取到的所有已归类的所述人脸表情特征图像形成数据集,并基于所述数据集设置一个掩码矢量m,所述掩码矢量用于对所述数据集中的域标签进行统一格式控制;其中,所述域标签统一格式为:
    Figure PCTCN2020118388-appb-100002
    Cn为第n个数据集的域标签;
    将所述掩码矢量输入值所述生成器,基于所述域标签统一格式控制所述数据集进行模型的统一训练。
  16. 一种计算机可读存储介质,其中,所述计算机可读存储介质上存储有基于StarGAN网络的表情生成程序,所述基于StarGAN网络的表情生成程序被处理器执行时实现如下所述的基于StarGAN网络的表情生成方法的步骤:
    获取待处理的相片,并基于人脸识别技术识别出所述相片中所有的人脸画像,并在所述相片上勾画出每个人脸画像的人脸边界框;
    利用人脸关键点检测技术从所述人脸边界框中标记出所述人脸画像中的人脸五官位置;
    获取待转换的目标域标签,其中,所述目标域标签为用于指示所述相片待转换的表情数据集;
    将带有人脸五官位置标记的所述相片和所述目标域标签输入至预先训练得到的人脸表情修改模型中,并从表情素材库中查询与所述目标域标签对应的表情数据集,其中,所述人脸表情修改模型为基于单一网络可以跨越不同表情域的多个表情数据集获取表情素材的表情替换模型,其用于根据域标签与多个不同表情域的表情数据集之间的对映射关系,从表情素材库中确定与所述目标域标签对应的表情数据集,所述表情素材库包括有至少两个不同表情域的表情数据集;
    通过所述人脸表情修改模型将确定的不同表情的表情数据集依次填充至所述人脸画像中对应的人脸五官位置中,并进行表情图像合成处理,得到目标表情相片。
  17. 根据权利要求16所述的计算机可读存储介质,其中,所述基于StarGAN网络的表情生成程序被所述处理器执行所述基于人脸识别技术识别出所述相片中所有的人脸画像, 并在所述相片上勾画出每个人脸画像的人脸边界框的步骤时,包括如下步骤:
    识别所述相片中是否存在人脸图像;
    若存在,则根据预置的人脸检测算法提取所述相片中的人脸,并在所述人脸所在的位置描画出人脸边界框;
    利用表情的跟踪算法对所述人脸边界框内的人脸图像进行表情跟踪点的标记。
  18. 根据权利要求17所述的计算机可读存储介质,其中,所述基于StarGAN网络的表情生成程序被所述处理器执行所述利用人脸关键点检测技术从所述人脸边界框中标记出所述人脸画像中的人脸五官位置的步骤时,包括如下步骤:
    通过预先训练好的人脸关键点检测器,对所述人脸边界框中的人脸图像进行人脸五官检测,并采用106追踪点的标记方式对检测到的人脸五官进行边界标记,以描绘出所述人脸图像上的每个人脸五官的形状和位置信息。
  19. 根据权利要求16-18中任一项所述的计算机可读存储介质,其中,所述基于StarGAN网络的表情生成程序被所述处理器执行所述预先训练得到的人脸表情修改模型的步骤时,包括如下步骤:
    获取至少两个不同表情域中的目标域以及每个目标域对应的域标签,所述目标域中包含有已经通过人工归类的方式归类完成的人脸表情特征图像;
    根据所述目标域,对预设的对抗网络的模型构架中的鉴别器进行真假相片的区分训练,以得到可识别出表情类别的鉴别器;
    根据所述目标域,对预设的对抗网络的模型构架中的生成器进行输入相片到待转换的目标域之间的转换训练以及对输入相片转换后的图像进行还原重建的训练;
    所述预设的对抗网络的模型构架记录根据人工归类的方式归类完成的人脸表情特征图像对应的域标签,并根据域标签在模型构架中生成目标域标签,建立目标域标签与域标签之间的映射关系;
    基于训练后的鉴别器、生成器和所述映射关系,构建满足表情合成的所述人脸表情修改模型。
  20. 根据权利要求19所述的计算机可读存储介质,其中,所述基于StarGAN网络的表情生成程序被所述处理器执行所述根据所述目标域,对预设的对抗网络的模型构架中的鉴别器进行真假相片的区分训练,以得到可识别出表情类别的鉴别器的步骤时,包括如下步骤:
    分别将所述至少两个不同表情域中的目标域中的表情素材作为输入相片,以所述至少两个不同表情域中的目标域的域标签作为所述目标转换域的目标域标签,输入至所述鉴别器中;
    在所述输入相片经过所述鉴别器采用对抗损失函数公式对所述输入相片的真假进行鉴别,判断鉴别的结果来判定所述目标域标签是否正确。
PCT/CN2020/118388 2019-10-18 2020-09-28 表情生成方法、装置、设备及存储介质 WO2021073417A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910990781.6A CN111028305A (zh) 2019-10-18 2019-10-18 表情生成方法、装置、设备及存储介质
CN201910990781.6 2019-10-18

Publications (1)

Publication Number Publication Date
WO2021073417A1 true WO2021073417A1 (zh) 2021-04-22

Family

ID=70205440

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/118388 WO2021073417A1 (zh) 2019-10-18 2020-09-28 表情生成方法、装置、设备及存储介质

Country Status (2)

Country Link
CN (1) CN111028305A (zh)
WO (1) WO2021073417A1 (zh)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113239833A (zh) * 2021-05-20 2021-08-10 厦门大学 一种基于双分支干扰分离网络的人脸表情识别方法
CN113269145A (zh) * 2021-06-22 2021-08-17 中国平安人寿保险股份有限公司 表情识别模型的训练方法、装置、设备及存储介质
CN113887357A (zh) * 2021-09-23 2022-01-04 华南理工大学 一种人脸表示攻击检测方法、系统、装置及介质
CN114387482A (zh) * 2022-01-05 2022-04-22 齐鲁工业大学 基于人脸图像的数据增强方法、模型训练方法及分析方法
CN114581569A (zh) * 2022-02-21 2022-06-03 华南理工大学 基于代理引导域适应的外观保持的人脸图像动漫化方法
CN114821602A (zh) * 2022-06-28 2022-07-29 北京汉仪创新科技股份有限公司 训练对抗神经网络生成字库的方法、系统、设备和介质
CN116631042A (zh) * 2023-07-25 2023-08-22 数据空间研究院 表情图像生成、表情识别模型、方法、系统和存储器
WO2023169023A1 (zh) * 2022-03-11 2023-09-14 腾讯科技(深圳)有限公司 表情模型的生成方法、装置、设备及介质

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111028305A (zh) * 2019-10-18 2020-04-17 平安科技(深圳)有限公司 表情生成方法、装置、设备及存储介质
CN111860380A (zh) * 2020-07-27 2020-10-30 平安科技(深圳)有限公司 人脸图像生成方法、装置、服务器及存储介质
CN111968029A (zh) * 2020-08-19 2020-11-20 北京字节跳动网络技术有限公司 表情变换方法、装置、电子设备和计算机可读介质
CN112766105A (zh) * 2021-01-07 2021-05-07 北京码牛科技有限公司 一种应用于图码联采系统的图像转换方法及装置
CN113096206B (zh) * 2021-03-15 2022-09-23 中山大学 基于注意力机制网络的人脸生成方法、装置、设备及介质
WO2022205416A1 (zh) * 2021-04-02 2022-10-06 深圳先进技术研究院 一种基于生成式对抗网络的人脸表情生成方法
CN113239977B (zh) * 2021-04-22 2023-03-24 武汉大学 多域图像转换模型的训练方法、装置、设备及存储介质
CN113269700B (zh) * 2021-04-29 2023-12-12 北京达佳互联信息技术有限公司 视频生成方法、装置、电子设备及存储介质
CN114373034A (zh) * 2022-01-10 2022-04-19 腾讯科技(深圳)有限公司 图像处理方法、装置、设备、存储介质及计算机程序
CN114596615B (zh) * 2022-03-04 2023-05-05 湖南中科助英智能科技研究院有限公司 基于对抗学习的人脸活体检测方法、装置、设备及介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108334904A (zh) * 2018-02-07 2018-07-27 深圳市唯特视科技有限公司 一种基于统一生成对抗网络的多域图像转换技术
CN109509242A (zh) * 2018-11-05 2019-03-22 网易(杭州)网络有限公司 虚拟对象面部表情生成方法及装置、存储介质、电子设备
CN110084121A (zh) * 2019-03-27 2019-08-02 南京邮电大学 基于谱归一化的循环生成式对抗网络的人脸表情迁移的实现方法
US20190304104A1 (en) * 2018-04-03 2019-10-03 Sri International Applying artificial intelligence to generate motion information
CN111028305A (zh) * 2019-10-18 2020-04-17 平安科技(深圳)有限公司 表情生成方法、装置、设备及存储介质

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108776983A (zh) * 2018-05-31 2018-11-09 北京市商汤科技开发有限公司 基于重建网络的人脸重建方法和装置、设备、介质、产品

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108334904A (zh) * 2018-02-07 2018-07-27 深圳市唯特视科技有限公司 一种基于统一生成对抗网络的多域图像转换技术
US20190304104A1 (en) * 2018-04-03 2019-10-03 Sri International Applying artificial intelligence to generate motion information
CN109509242A (zh) * 2018-11-05 2019-03-22 网易(杭州)网络有限公司 虚拟对象面部表情生成方法及装置、存储介质、电子设备
CN110084121A (zh) * 2019-03-27 2019-08-02 南京邮电大学 基于谱归一化的循环生成式对抗网络的人脸表情迁移的实现方法
CN111028305A (zh) * 2019-10-18 2020-04-17 平安科技(深圳)有限公司 表情生成方法、装置、设备及存储介质

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113239833A (zh) * 2021-05-20 2021-08-10 厦门大学 一种基于双分支干扰分离网络的人脸表情识别方法
CN113239833B (zh) * 2021-05-20 2023-08-29 厦门大学 一种基于双分支干扰分离网络的人脸表情识别方法
CN113269145B (zh) * 2021-06-22 2023-07-25 中国平安人寿保险股份有限公司 表情识别模型的训练方法、装置、设备及存储介质
CN113269145A (zh) * 2021-06-22 2021-08-17 中国平安人寿保险股份有限公司 表情识别模型的训练方法、装置、设备及存储介质
CN113887357A (zh) * 2021-09-23 2022-01-04 华南理工大学 一种人脸表示攻击检测方法、系统、装置及介质
CN113887357B (zh) * 2021-09-23 2024-04-12 华南理工大学 一种人脸表示攻击检测方法、系统、装置及介质
CN114387482A (zh) * 2022-01-05 2022-04-22 齐鲁工业大学 基于人脸图像的数据增强方法、模型训练方法及分析方法
CN114387482B (zh) * 2022-01-05 2024-04-16 刘磊 基于人脸图像的数据增强方法、模型训练方法及分析方法
CN114581569A (zh) * 2022-02-21 2022-06-03 华南理工大学 基于代理引导域适应的外观保持的人脸图像动漫化方法
WO2023169023A1 (zh) * 2022-03-11 2023-09-14 腾讯科技(深圳)有限公司 表情模型的生成方法、装置、设备及介质
CN114821602A (zh) * 2022-06-28 2022-07-29 北京汉仪创新科技股份有限公司 训练对抗神经网络生成字库的方法、系统、设备和介质
CN116631042A (zh) * 2023-07-25 2023-08-22 数据空间研究院 表情图像生成、表情识别模型、方法、系统和存储器
CN116631042B (zh) * 2023-07-25 2023-10-13 数据空间研究院 表情图像生成、表情识别模型、方法、系统和存储器

Also Published As

Publication number Publication date
CN111028305A (zh) 2020-04-17

Similar Documents

Publication Publication Date Title
WO2021073417A1 (zh) 表情生成方法、装置、设备及存储介质
CN112101357B (zh) 一种rpa机器人智能元素定位拾取方法及系统
CN111858954B (zh) 面向任务的文本生成图像网络模型
WO2021036059A1 (zh) 图像转换模型训练方法、异质人脸识别方法、装置及设备
CN110516096A (zh) 合成感知数字图像搜索
Kadam et al. Detection and localization of multiple image splicing using MobileNet V1
CN109767261A (zh) 产品推荐方法、装置、计算机设备和存储介质
CN110598019B (zh) 重复图像识别方法及装置
CN109711874A (zh) 用户画像生成方法、装置、计算机设备和存储介质
WO2023072067A1 (zh) 人脸属性编辑模型的训练以及人脸属性编辑方法
Liu et al. A 3 GAN: an attribute-aware attentive generative adversarial network for face aging
WO2024109374A1 (zh) 换脸模型的训练方法、装置、设备、存储介质和程序产品
Dey et al. Learning cross-modal deep embeddings for multi-object image retrieval using text and sketch
Fu et al. A multi-task learning model with adversarial data augmentation for classification of fine-grained images
Zhang et al. Facial component-landmark detection with weakly-supervised lr-cnn
WO2011096010A1 (ja) パターン認識装置
Zeng et al. Video‐driven state‐aware facial animation
CN112966676A (zh) 一种基于零样本学习的文档关键信息抽取方法
US20220375223A1 (en) Information generation method and apparatus
CN115640401A (zh) 文本内容提取方法及装置
Dembani et al. UNSUPERVISED FACIAL EXPRESSION DETECTION USING GENETIC ALGORITHM.
KR102279772B1 (ko) 시간의 특성을 고려한 영상 생성 방법 및 그를 위한 장치
Wadhawan et al. Multi-attributed and structured text-to-face synthesis
CN110717928A (zh) 人脸运动单元AUs的参数估计方法、装置和电子设备
CN111353353A (zh) 跨姿态的人脸识别方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20877799

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20877799

Country of ref document: EP

Kind code of ref document: A1