WO2023142886A1 - 表情迁移方法、模型训练方法和装置 - Google Patents

表情迁移方法、模型训练方法和装置 Download PDF

Info

Publication number
WO2023142886A1
WO2023142886A1 PCT/CN2022/143944 CN2022143944W WO2023142886A1 WO 2023142886 A1 WO2023142886 A1 WO 2023142886A1 CN 2022143944 W CN2022143944 W CN 2022143944W WO 2023142886 A1 WO2023142886 A1 WO 2023142886A1
Authority
WO
WIPO (PCT)
Prior art keywords
expression
neural network
network model
training
target object
Prior art date
Application number
PCT/CN2022/143944
Other languages
English (en)
French (fr)
Inventor
陈刚
贺敬武
梁芊荟
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2023142886A1 publication Critical patent/WO2023142886A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T19/00Manipulating 3D models or images for computer graphics
    • G06T19/20Editing of 3D images, e.g. changing shapes or colours, aligning objects or positioning parts
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions

Definitions

  • the embodiments of the present application relate to the technical field of image processing, and more specifically, relate to an expression transfer method, a model training method and a device.
  • the facial expression capture system can simulate human facial expressions in real time by capturing the movement of key points on the user's face, and can endow human facial expressions on digital images (that is, virtual characters, such as virtual characters or anthropomorphic animals).
  • An expression can be represented by superposition of multiple partial action expressions, each partial action expression is an expression base, and the linear combination weight of a set of expression bases is the expression base coefficient.
  • the expression base coefficients of human facial expressions can be obtained first, and then the expression base coefficients can be combined with the expression bases of the digital image, thereby driving the digital image to make an expression corresponding to the facial expression.
  • Embodiments of the present application provide an expression transfer method, a model training method, and a device, which can improve the accuracy of user expression information transmission, thereby improving the efficiency of expression transfer.
  • a method for expression migration including: obtaining expression data to be processed, the expression data to be processed includes a plurality of first video frames and a plurality of second video frames, wherein the first video frame and the Described second video frame comprises the facial image of source object; Based on the first neural network model and the first expression base coefficient, obtain the second expression base coefficient, wherein the first expression base coefficient is obtained based on the first video frame and is related to The source object corresponds; the target object is driven according to the second expression base coefficient to transfer the expression of the source object in the first video frame to the target object, wherein the second expression base coefficient is the same as the Target object correspondence;
  • the fourth expression base coefficient is obtained, wherein the third expression base coefficient is obtained based on the second video frame and corresponds to the source object ;
  • the initial model parameters of the second neural network model are the same as the initial model parameters of the first neural network model, and during the application process of the first neural network model, the model of the second neural network model parameters are adjusted during a training process based on first training data obtained from at least some of the plurality of first video frames;
  • the preset condition includes: the expression loss generated when the expression base coefficient output by the second neural network model is used to drive the target object is smaller than the expression base coefficient output by the first neural network model is used to drive the target Expression loss when object is generated.
  • the first neural network model or the second neural network model may be used for processing to obtain the expression coefficients corresponding to the target object.
  • the expression base coefficient corresponding to the target object to drive the target object can make the target object make the same expression as the source object, and realize the accurate transmission of facial expression.
  • the second neural network model can be further trained by using the expression data to be processed, so as to continuously update the model parameters of the second neural network model and improve the performance of the second neural network model. precision.
  • the training samples of the second neural network model can be enriched, the applicable scenarios of the second neural network model can be expanded, and the problem of inaccurate transmission of extreme expressions can be avoided.
  • the expression loss corresponding to the second neural network model is smaller than the expression loss corresponding to the first neural network model
  • the first neural network model can be replaced by the second neural network model, and then the second neural network model is used to continue processing, thereby obtaining a more
  • the matching expression base coefficient makes the expression displayed by the target object consistent with the expression of the source object, and realizes the accurate transmission and expression of the expression information of the source object.
  • the first neural network model is associated with the source object and the target object; the second neural network model is associated with the source object and the target object Objects are associated.
  • the method before obtaining the fourth basic expression coefficient based on the second neural network model and the third basic expression coefficient, the method further includes: based on the first training The data and the first loss function are used to train the second neural network model, wherein the first loss function is used for gradient backpropagation to adjust model parameters of the second neural network model.
  • the first loss function can be an L1 norm loss function (also called an absolute value loss function), or an L2 norm loss function, or an L1 norm loss function with an additional regular constraint term or an additional regular constraint The L2-norm loss function for the term.
  • L1 norm loss function also called an absolute value loss function
  • L2 norm loss function or an L1 norm loss function with an additional regular constraint term or an additional regular constraint The L2-norm loss function for the term.
  • the first training data includes a first output result set, a second output result set, a first expression recognition result set, and a second expression recognition result set; wherein, the The first output result set includes: based on the expression base coefficients obtained in each of the first video frames in the at least part of the first video frames, the adjusted expression base coefficients are output after being processed by the first neural network model;
  • the second output result set includes: based on the expression base coefficients obtained in each of the first video frames in the at least part of the first video frames, the adjusted expression base coefficients are output after being processed by the second neural network model;
  • An expression recognition result set includes: the result of performing expression recognition on each first video frame in the at least part of the first video frames; the second expression recognition result set includes: the result of performing expression recognition on the second output result set
  • Each output result is the result of facial expression recognition on the digital image frame obtained when the target object is driven.
  • the first training data includes not only the expression data related to the source object, but also the expression data of the target object.
  • the expression between the source object and the target object can be established. constraints and associations. After the trained second neural network model adjusts the expression base coefficients, the expression displayed by the target object and the source object can be consistent, and the accurate transmission and expression of user expression information can be realized.
  • the first loss function is used to characterize the adjusted expression base coefficients corresponding to the same video frame in the first output result set and the second output result set and the difference between the expression recognition results corresponding to the same video frame in the first expression recognition result set and the second expression recognition result set.
  • model parameters of the second neural network model can be optimized, and the deviation between the first neural network model and the second neural network model will not be large.
  • the at least part of the first video frame is obtained by sampling from the plurality of first video frames.
  • sampling involved here may be random sampling, or sampling according to a certain sampling frequency, such as sampling every five video frames, or continuous sampling, which is not limited in this embodiment of the present application.
  • the number of first video frames used for training the second neural network model is less than the total number of first video frames, the amount of calculation in the training process can be reduced, thereby reducing the resources required for calculation.
  • the method further includes: updating the model parameters of the first neural network model to be consistent with the model parameters of the second neural network model; During the application process of the two neural network models, the model parameters of the first neural network model are adjusted during the training process based on the second training data, wherein the second training data is based on the data in the plurality of second video frames At least part of the second video frame is obtained.
  • At least part of the second video frames in the second video frame may also be used to train the first neural network model. This can enrich the training samples and improve the training accuracy of the neural network model.
  • an expression migration method including: obtaining expression data to be processed, the expression data to be processed includes a plurality of first video frames and a plurality of second video frames, wherein the first video frame and the Described second video frame comprises the facial image of source object; Based on the first neural network model and the first expression base coefficient, obtain the second expression base coefficient, wherein the first expression base coefficient is obtained based on the first video frame and is related to The source object corresponds; the target object is driven according to the second expression base coefficient to transfer the expression of the source object in the first video frame to the target object, wherein the second expression base coefficient is the same as the Target object correspondence;
  • the model parameters of the first neural network model are updated to be consistent with the parameters of the second neural network model to obtain an adjusted first neural network model; based on the adjusted first neural network Model and the third expression base coefficient, obtain the fourth expression base coefficient, wherein the third expression base coefficient is obtained based on the second video frame and corresponds to the source object; drive the fourth expression base coefficient according to the a target object, to transfer the expression of the source object in the second video frame to the target object, wherein the fourth expression base coefficient corresponds to the target object;
  • the initial model parameters of the second neural network model are the same as the initial model parameters of the first neural network model, and during the application process of the first neural network model, the model of the second neural network model parameters are adjusted during a training process based on first training data obtained from at least some of the plurality of first video frames;
  • the preset condition includes: the expression loss generated when the expression base coefficient output by the second neural network model is used to drive the target object is smaller than the expression base coefficient output by the first neural network model is used to drive the target Expression loss when object is generated.
  • the first neural network model may be used for processing to obtain the expression coefficients corresponding to the target object.
  • the expression base coefficient corresponding to the target object to drive the target object can make the target object make the same expression as the source object, and realize the accurate transmission of facial expression.
  • the second neural network model can be further trained by using the expression data to be processed, so as to continuously update the model parameters of the second neural network model.
  • the model parameters of the first neural network model are updated to be consistent with the second neural network model.
  • the training samples of the first neural network model can be enriched, the applicable scenarios of the first neural network model can be expanded, the problem of inaccurate transmission of extreme expressions can be avoided, and more matching expression bases can be obtained according to the updated first neural network model.
  • the coefficient makes the expression displayed by the target object consistent with the expression of the source object, and realizes the accurate transmission and expression of the expression information of the source object.
  • the first neural network model is associated with the source object and the target object; the second neural network model is associated with the source object and the target object Objects are associated.
  • the method before updating the model parameters of the first neural network model to be consistent with the parameters of the second neural network model, the method further includes: based on the The first training data and the first loss function are used to train the second neural network model, wherein the first loss function is used for gradient backpropagation to adjust model parameters of the second neural network model.
  • the first training data includes a first output result set, a second output result set, a first expression recognition result set, and a second expression recognition result set; wherein, the The first output result set includes: based on the expression base coefficients obtained in each of the first video frames in the at least part of the first video frames, the adjusted expression base coefficients are output after being processed by the first neural network model;
  • the second output result set includes: based on the expression base coefficients obtained in each of the first video frames in the at least part of the first video frames, the adjusted expression base coefficients are output after being processed by the second neural network model;
  • An expression recognition result set includes: the result of performing expression recognition on each first video frame in the at least part of the first video frames; the second expression recognition result set includes: the result of performing expression recognition on the second output result set
  • Each output result is the result of facial expression recognition on the digital image frame obtained when the target object is driven.
  • the first loss function is used to characterize the adjusted expression base coefficients corresponding to the same video frame in the first output result set and the second output result set and the difference between the expression recognition results corresponding to the same video frame in the first expression recognition result set and the second expression recognition result set.
  • the at least part of the first video frame is obtained by sampling from the plurality of first video frames.
  • the method further includes: during the application process of the adjusted first neural network model, adjusting the first neural network model during the training process based on the second training data.
  • At least part of the second video frames in the second video frame may be used to train the second neural network model. This can enrich the training samples and improve the training accuracy of the neural network model.
  • a model training method including:
  • obtaining a first training frame comprising a facial image of a source subject
  • the original expression base coefficient and head pose parameter corresponding to the facial image of the source object are obtained;
  • the second training frame includes the facial image of the source object in the front;
  • the difference between the expression recognition result corresponding to the first training frame and the expression recognition result corresponding to the third training frame and/or the expression recognition result corresponding to the second training frame corresponds to the fourth training frame
  • the difference between the facial expression recognition results adjust the parameters of the original neural network model to obtain the target neural network model.
  • expression constraints between the source object and the target object are established, so that the association between the source object and the target object can be established.
  • the expression constraint between the source object and the target object can make the expression of the source object accurately transferred to the target object
  • the difference between the expression recognition result corresponding to the first training frame and the expression recognition result corresponding to the third training frame includes: the first training frame corresponds to The Manhattan distance or Euclidean distance between the expression recognition result corresponding to the third training frame and the expression recognition result corresponding to the third training frame; and/or, the expression recognition result corresponding to the second training frame and the fourth training frame corresponding
  • the difference between the expression recognition results includes: the Manhattan distance or the Euclidean distance between the expression recognition result corresponding to the second training frame and the expression recognition result corresponding to the fourth training frame.
  • an expression transfer device including various units/modules for performing the expression transfer method in the first aspect or the second aspect.
  • a model training device including various units/modules for executing the model training method in the above third aspect.
  • an expression migration device which includes: a memory for storing programs; a processor for executing the programs stored in the memory, and when the programs stored in the memory are executed, the processing The device is used to implement the expression transfer method in the first aspect or the second aspect.
  • a model training device which includes: a memory for storing programs; a processor for executing the programs stored in the memory, and when the programs stored in the memory are executed, the processing The device is used to execute the model training method in the third aspect above.
  • an electronic device in an eighth aspect, includes the expression transfer device of the fourth aspect or the sixth aspect.
  • the electronic device may specifically be a mobile terminal (for example, a smart phone), a tablet computer, a notebook computer, an augmented reality/virtual reality device, a vehicle-mounted terminal device, and the like.
  • a mobile terminal for example, a smart phone
  • a tablet computer for example, a tablet computer
  • a notebook computer for example, a tablet computer
  • an augmented reality/virtual reality device for example, a vehicle-mounted terminal device, and the like.
  • a computer device which includes the model training apparatus in the fifth aspect or the seventh aspect.
  • the computer device may specifically be a server or a cloud device or the like.
  • a computer-readable storage medium stores program code for execution by a device.
  • the program code is executed by the device, the device is used to execute the first aspect and the first aspect. two-way or three-way approach.
  • a computer program product containing instructions is provided, and when the computer program product is run on a computer, it causes the computer to execute the method in the first aspect, the second aspect or the third aspect.
  • a chip in a twelfth aspect, includes a processor and a data interface, and the processor reads instructions stored on the memory through the data interface, and executes the above-mentioned first aspect, second aspect or third aspect method in .
  • the chip may further include a memory, the memory stores instructions, the processor is configured to execute the instructions stored in the memory, and when the instructions are executed, the The processor is configured to execute the method in the first aspect, the second aspect or the third aspect above.
  • the aforementioned chip may specifically be a field programmable gate array or an application specific integrated circuit.
  • the method in the first aspect may specifically refer to the method in the first aspect and any of the various implementation manners in the first aspect
  • the method in the second aspect may specifically refer to the second aspect Aspect and the method in any one of the various implementations in the second aspect
  • the method in the third aspect may specifically refer to the third aspect and any one of the various implementations in the third aspect Methods.
  • FIG. 1 is a schematic structural diagram of a system architecture provided by an embodiment of the present application.
  • Fig. 2 is a schematic diagram of a convolutional neural network model provided by an embodiment of the present application.
  • FIG. 3 is a schematic diagram of a chip hardware structure provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of a system architecture provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of a scene applicable to an embodiment of the present application.
  • Fig. 6 is a schematic flowchart of an expression transfer method provided by an embodiment of the present application.
  • Fig. 7 is a schematic flow chart of another expression transfer method provided by the embodiment of the present application.
  • FIG. 8 is a schematic flowchart of a model training method provided by an embodiment of the present application.
  • FIG. 9 is a schematic flowchart of a model training method provided by an embodiment of the present application.
  • Fig. 10 is a schematic flowchart of an expression transfer method provided by an embodiment of the present application.
  • Fig. 11 is a schematic structural diagram of a device provided by an embodiment of the present application.
  • FIG. 12 is a schematic diagram of a hardware structure of a device provided by an embodiment of the present application.
  • the neural network can be composed of neural units, and the neural unit can refer to an operation unit that takes x s and the intercept b as input, and the output of the operation unit can be:
  • W s is the weight of x s
  • b is the bias of the neuron unit.
  • f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal.
  • the output signal of the activation function can be used as the input of the next convolutional layer, and the activation function can be a sigmoid function.
  • a neural network is a network formed by connecting multiple above-mentioned single neural units, that is, the output of one neural unit can be the input of another neural unit.
  • the input of each neural unit can be connected with the local receptive field of the previous layer to extract the features of the local receptive field.
  • the local receptive field can be an area composed of several neural units.
  • Deep neural network also known as multi-layer neural network or multi-layer perceptron (MLP)
  • MLP multi-layer perceptron
  • DNN is divided according to the position of different layers, and the neural network inside DNN can be divided into three categories: input layer, hidden layer, and output layer.
  • the first layer is the input layer
  • the last layer is the output layer
  • the layers in the middle are all hidden layers.
  • the layers are fully connected, that is, any neuron in the i-th layer must be connected to any neuron in the i+1-th layer.
  • DNN looks complicated, it is actually not complicated in terms of the work of each layer.
  • it is the following linear relationship expression: in, is the input vector, is the output vector, Is the offset vector, W is the weight matrix (also called coefficient), and ⁇ () is the activation function.
  • Each layer is just an input vector After such a simple operation to get the output vector Due to the large number of DNN layers, the coefficient W and the offset vector The number is also higher.
  • DNN The definition of these parameters in DNN is as follows: Taking the coefficient W as an example, assuming that in a three-layer DNN, the linear coefficient from the fourth neuron of the second layer to the second neuron of the third layer is defined as The superscript 3 represents the layer number of the coefficient W, and the subscript corresponds to the output third layer index 2 and the input second layer index 4.
  • the coefficient from the kth neuron of the L-1 layer to the jth neuron of the L layer is defined as
  • the input layer has no W parameter.
  • more hidden layers make the network more capable of describing complex situations in the real world. Theoretically speaking, a model with more parameters has a higher complexity and a greater "capacity", which means that it can complete more complex learning tasks.
  • Training the deep neural network is the process of learning the weight matrix, and its ultimate goal is to obtain the weight matrix of all layers of the trained deep neural network (the weight matrix formed by the vector W of many layers).
  • Convolutional neural network is a deep neural network with a convolutional structure.
  • the convolutional neural network contains a feature extractor composed of a convolutional layer and a subsampling layer, which can be regarded as a filter.
  • the convolutional layer refers to the neuron layer that performs convolution processing on the input signal in the convolutional neural network.
  • a neuron can only be connected to some adjacent neurons.
  • a convolutional layer usually contains several feature planes, and each feature plane can be composed of some rectangularly arranged neural units. Neural units of the same feature plane share weights, and the shared weights here are convolution kernels. Shared weight can be understood as the way to extract data information is independent of location.
  • the convolution kernel can be initialized in the form of a matrix of random size, and the convolution kernel can obtain reasonable weights through learning during the training process of the convolutional neural network.
  • the direct benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, while reducing the risk of overfitting.
  • Recurrent neural networks are mainly used to process sequence data.
  • the layers are fully connected, and each node in each layer is unconnected.
  • the network will remember the previous information and apply it to the calculation of the current output, that is, the nodes between the hidden layer and this layer are no longer connected but connected, and the input of the hidden layer not only includes The output of the input layer also includes the output of the hidden layer at the previous moment.
  • RNN is designed to allow machines to have the ability to remember like humans. Therefore, the output of RNN needs to rely on current input information and historical memory information.
  • the forward propagation algorithm also known as the forward propagation algorithm, is an algorithm that performs calculations from front to back. Using the forward propagation algorithm, starting from the input layer, it calculates backward layer by layer until the operation reaches the output layer to obtain the output result. The forward propagation algorithm obtains the result of the output layer through a layer-by-layer operation from front to back.
  • the neural network can use the error back propagation (back propagation, BP) algorithm to correct the size of the parameters in the initial neural network model during the training process, so that the reconstruction error loss of the neural network model becomes smaller and smaller. Specifically, passing the input signal forward until the output will generate an error loss, and updating the parameters in the initial neural network model by backpropagating the error loss information, so that the error loss converges.
  • the backpropagation algorithm is a backpropagation movement dominated by error loss, aiming to obtain the optimal parameters of the neural network model, such as the weight matrix.
  • the embodiment of the present application provides a system architecture 100 .
  • the data collection device 160 is used to collect training data.
  • the training data may include human face video frames, digital image video frames, expression base coefficients, and the like.
  • the data collection device 160 After collecting the training data, the data collection device 160 stores the training data in the database 130 , and the training device 120 obtains the target model/rule 101 based on training data maintained in the database 130 .
  • the following embodiments will describe the training process of the target model/rule 101 in more detail with reference to the accompanying drawings, which will not be described in detail here.
  • the above target model/rule 101 can be used to implement the expression transfer method of the embodiment of the present application.
  • the target model/rule 101 in the embodiment of the present application may specifically be a neural network.
  • the target model/rule 101 is obtained by training the original processing model.
  • the training data maintained in the database 130 may not all be collected by the data collection device 160, but may also be received from other devices.
  • the training device 120 does not necessarily perform the training of the target model/rules 101 based entirely on the training data maintained by the database 130. It is also possible to obtain training data from the cloud or other devices for model training. The above description should not be taken as a reference to this application Limitations of the Examples.
  • the target model/rule 101 obtained by training according to the training device 120 can be applied to different systems or devices, such as the execution device 110 shown in FIG. Notebook computer, augmented reality (augmented reality, AR) terminal, virtual reality (virtual reality, VR) terminal, vehicle-mounted terminal, etc., can also be a server or cloud, etc.
  • an execution device 110 is configured with an input/output (input/output, I/O) interface 112 for data interaction with an external device, and a user may input data to the I/O interface 112 through a client device 140.
  • the input data in this embodiment of the application may include: the expression data to be processed input by the client device, such as human face video frames.
  • the preprocessing module 113 and the preprocessing module 114 are used to perform preprocessing according to the input data (such as data to be processed) received by the I/O interface 112.
  • the input data such as data to be processed
  • the execution device 110 When the execution device 110 preprocesses the input data, or in the calculation module 111 of the execution device 110 performs calculations and other related processing, the execution device 110 can call the data, codes, etc. in the data storage system 150 for corresponding processing , the correspondingly processed data and instructions may also be stored in the data storage system 150 .
  • the I/O interface 112 returns the processing result to the client device 140, thereby providing it to the user.
  • the training device 120 can generate corresponding target models/rules 101 based on different training data for different goals or different tasks, and the corresponding target models/rules 101 can be used to achieve the above-mentioned goals or complete above tasks, thereby providing the desired result to the user.
  • the user can manually specify the input data, and the manual specification can be operated through the interface provided by the I/O interface 112 .
  • the client device 140 can automatically send the input data to the I/O interface 112 . If the client device 140 is required to automatically send the input data to obtain the user's authorization, the user can set the corresponding authority in the client device 140 .
  • the user can view the results output by the execution device 110 on the client device 140, and the specific presentation form may be specific ways such as display, sound, and action.
  • the client device 140 can also be used as a data collection terminal, collecting the input data input to the I/O interface 112 as shown in the figure and the output results of the output I/O interface 112 as new sample data, and storing them in the database 130 .
  • the I/O interface 112 directly uses the input data input to the I/O interface 112 as shown in the figure and the output result of the output I/O interface 112 as a new sample The data is stored in database 130 .
  • FIG. 1 is only a schematic diagram of a system architecture provided by the embodiment of the present application, and the positional relationship between devices, devices, modules, etc. shown in the figure does not constitute any limitation.
  • the data The storage system 150 is an external memory relative to the execution device 110 , and in other cases, the data storage system 150 may also be placed in the execution device 110 .
  • the target model/rule 101 is obtained according to the training device 120, the target model/rule 101 in the embodiment of the present application may be the neural network in the present application, specifically, the neural network used in the embodiment of the present application It can be CNN, DNN, deep convolutional neural networks (deep convolutional neural networks, DCNN), recurrent neural network (recurrent neural network, RNN) and so on.
  • CNN is a very common neural network
  • a convolutional neural network is a deep neural network with a convolutional structure. It is a deep learning (deep learning) architecture. Multiple levels of learning are performed on the abstraction level.
  • CNN is a feed-forward artificial neural network in which individual neurons can respond to data input into it.
  • a convolutional neural network (CNN) 200 may include an input layer 210 , a convolutional/pooling layer 220 (where the pooling layer is optional), and a neural network layer 230 .
  • the input layer 210 can obtain the data to be processed, and pass the obtained data to be processed by the convolutional layer/pooling layer 220 and the subsequent neural network layer 230 to obtain the processing result of the data.
  • the internal layer structure of the CNN 200 in FIG. 2 will be described in detail below.
  • the convolutional layer/pooling layer 220 may include layers 221-226 as examples, for example: in one implementation, the 221st layer is a convolutional layer, the 222nd layer is a pooling layer, and the 223rd layer is a volume Layers, 224 are pooling layers, 225 are convolutional layers, and 226 are pooling layers; in another implementation, 221 and 222 are convolutional layers, 223 are pooling layers, and 224 and 225 are convolutional layers Layer, 226 is a pooling layer. That is, the output of the convolutional layer can be used as the input of the subsequent pooling layer, or it can be used as the input of another convolutional layer to continue the convolution operation.
  • the convolution layer 221 may include many convolution operators, which are also called kernels, and their role in data processing is equivalent to a filter for extracting specific information from the input data matrix.
  • the convolution operators are essentially Can be a weight matrix, which is usually predefined.
  • weight values in these weight matrices need to be obtained through a lot of training in practical applications, and each weight matrix formed by the weight values obtained through training can be used to extract information from the input data, so that the convolutional neural network 200 can make correct predictions .
  • the initial convolutional layer (such as 221) often extracts more general features, which can also be referred to as low-level features;
  • the features extracted by the later convolutional layers (such as 226) become more and more complex, such as features such as high-level semantics, and features with higher semantics are more suitable for the problem to be solved.
  • pooling layer After the convolutional layer.
  • it can be a layer of convolutional layers followed by a layer
  • the pooling layer can also be a multi-layer convolutional layer followed by one or more pooling layers.
  • the sole purpose of the pooling layer is to reduce the size of the data.
  • the convolutional neural network 200 After being processed by the convolutional layer/pooling layer 220, the convolutional neural network 200 is not enough to output the required output information. Because as mentioned earlier, the convolutional layer/pooling layer 220 only extracts features and reduces the parameters brought by the input data. However, in order to generate the final output information (required class information or other relevant information), the convolutional neural network 200 needs to use the neural network layer 230 to generate one or a group of outputs with the required number of classes. Therefore, the neural network layer 230 may include multiple hidden layers (231, 232 to 23n as shown in FIG. 2 ) and an output layer 240, and the parameters contained in the multi-layer hidden layers may be pre-configured according to the training data. Get trained.
  • the output layer 240 which has a loss function similar to the classification cross entropy, and is specifically used to calculate the prediction error.
  • the convolutional neural network shown in Figure 2 is only an example of a possible convolutional neural network for the expression transfer method of the embodiment of the present application.
  • the expression of the embodiment of the application The convolutional neural network adopted by the migration method can also exist in the form of other network models.
  • FIG. 3 is a hardware structure of a chip provided by an embodiment of the present application, and the chip includes a neural network processor 30 .
  • the chip can be set in the execution device 110 shown in FIG. 1 to complete the computing work of the computing module 111 .
  • the chip can also be set in the training device 120 shown in FIG. 1 to complete the training work of the training device 120 and output the target model/rule 101 .
  • the algorithms of each layer in the convolutional neural network shown in Figure 2 can be implemented in the chip shown in Figure 3 .
  • the neural network processor NPU 30 is mounted on the main central processing unit (central processing unit, CPU) (host CPU) as a coprocessor, and tasks are assigned by the main CPU.
  • the core part of the NPU is the operation circuit 303, and the controller 304 controls the operation circuit 303 to extract data in the memory (weight memory or input memory) and perform operations.
  • the operation circuit 303 includes multiple processing units (process engine, PE).
  • arithmetic circuit 303 is a two-dimensional systolic array.
  • the arithmetic circuit 303 may also be a one-dimensional systolic array or other electronic circuits capable of performing mathematical operations such as multiplication and addition.
  • arithmetic circuit 303 is a general-purpose matrix processor.
  • the operation circuit fetches the data corresponding to the matrix B from the weight memory 302, and caches it in each PE in the operation circuit.
  • the operation circuit fetches the data of matrix A from the input memory 301 and performs matrix operation with matrix B, and the obtained partial results or final results of the matrix are stored in the accumulator (accumulator) 308 .
  • the vector computing unit 307 can perform further processing on the output of the computing circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison and so on.
  • the vector calculation unit 307 can be used for network calculations of non-convolution/non-FC layers in neural networks, such as pooling (pooling), batch normalization (batch normalization), local response normalization (local response normalization), etc. .
  • the vector computation unit can 307 store the vector of the processed output to the unified buffer 306 .
  • the vector computing unit 307 may apply a non-linear function to the output of the computing circuit 303, such as a vector of accumulated values, to generate activation values.
  • vector computation unit 307 generates normalized values, binned values, or both.
  • the vector of processed outputs can be used as an activation input to the arithmetic circuit 303, for example for use in a subsequent layer in a neural network.
  • the unified memory 306 is used to store input data and output data.
  • the weight data directly transfers the input data in the external memory to the input memory 301 and/or unified memory 306 through the storage unit access controller (direct memory access controller, DMAC) 305, stores the weight data in the external memory into the weight memory 302, And store the data in the unified memory 306 into the external memory.
  • DMAC direct memory access controller
  • a bus interface unit (bus interface unit, BIU) 310 is configured to implement interaction between the main CPU, DMAC and instruction fetch memory 309 through the bus.
  • An instruction fetch buffer 309 connected to the controller 304 is used to store instructions used by the controller 304.
  • the controller 304 is configured to call the instruction cached in the instruction fetch memory 309 to control the operation process of the operation accelerator.
  • the unified memory 306, the input memory 301, the weight memory 302, and the instruction fetch memory 309 are all on-chip memories, and the external memory is a memory outside the NPU.
  • the external memory can be a double data rate synchronous dynamic random Memory (double data rate synchronous dynamic random access memory, DDR SDRAM), high bandwidth memory (high bandwidth memory, HBM) or other readable and writable memory.
  • each layer in the convolutional neural network shown in FIG. 2 may be performed by the operation circuit 303 or the vector calculation unit 307 .
  • the execution device 110 in Figure 1 introduced above can execute the various steps of the expression migration method of the embodiment of the present application, and the CNN model shown in Figure 2 and the chip shown in Figure 3 can also be used to implement the expression transfer method of the embodiment of the application Steps of the migration method.
  • the model training method of the embodiment of the present application and the expression transfer method of the embodiment of the present application will be described in detail below with reference to the accompanying drawings.
  • the embodiment of the present application provides a system architecture 400 .
  • the system architecture includes a local device 401, a local device 402, an execution device 110, and a data storage system 150, wherein the local device 401 and the local device 402 are connected to the execution device 110 through a communication network.
  • Execution device 110 may be implemented by one or more servers.
  • the execution device 110 may be used in cooperation with other computing devices, such as data storage, routers, load balancers and other devices.
  • Execution device 110 may be arranged on one physical site, or distributed on multiple physical sites.
  • the execution device 110 can use the data in the data storage system 150, or call the program code in the data storage system 150 to implement the expression migration method of the embodiment of the present application.
  • Each local device can represent any computing device, such as a personal computer, computer workstation, smartphone, tablet, smart camera, smart car or other type of cellular phone, media consumption device, wearable device, set-top box, game console, etc.
  • Each user's local device can interact with the execution device 110 through any communication mechanism/communication standard communication network, and the communication network can be a wide area network, a local area network, a point-to-point connection, etc., or any combination thereof.
  • the communication network can be a wide area network, a local area network, a point-to-point connection, etc., or any combination thereof.
  • the local device 401 and the local device 402 obtain relevant parameters of the target neural network from the execution device 110, deploy the target neural network on the local device 401 and the local device 402, and use the target neural network to obtain The expression base coefficient that drives the digital image and so on.
  • the target neural network can be directly deployed on the execution device 110, and the execution device 110 obtains the expression data to be processed from the local device 401 or the local device 402, and processes the expression data to be processed according to the target neural network to obtain It is used to drive the expression base coefficient of the digital avatar, and transmit the expression base coefficient to the local device 401 or the local device 402.
  • the aforementioned execution device 110 may also be a cloud device. In this case, the execution device 110 may be deployed on the cloud; or, the aforementioned execution device 110 may also be a terminal device. In this case, the execution device 110 may be deployed on the user terminal side. This is not limited.
  • the neural network model provided in the embodiment of the present application is mainly used in the process of transferring the expression of the source object to the target object.
  • the source object may be a human or other animal, and the human and other animal here may be real or virtual.
  • the source object can also be an avatar of an inanimate object.
  • the target object can be an avatar of a person, an animal, a plant, or an inanimate object, etc.
  • the neural network model provided by the embodiment of the present application can be applied to the process of transferring human facial expressions to digital images.
  • Expression transfer refers to the method of using digital images to simulate the expressions of characters and express the emotions of characters.
  • the digital image can also be referred to as an avatar or an avatar, which is the carrier of expression animation and the target of expression transfer.
  • the digital image can be a virtual character (such as a virtual anchor) or an anthropomorphic animal (such as a monkey, a puppy, a giraffe) and the like.
  • Facial expressions can be decomposed into multiple partial muscle movements, and multiple partial muscle movements can be combined to form various facial expressions. Therefore, an expression can be represented by superposition of multiple partial action expressions, where each partial action expression can be called an expression base, a set of basic expressions used to express an expression is called an expression base set, and a set of linear combination weights of an expression base is called is the expression base coefficient.
  • the emoticon F can mean:
  • a 1 , A 2 , A 3 ⁇ A n are expression bases, such as opening the mouth, raising the corners of the mouth, squinting, etc. ; ⁇ 1 , ⁇ 2 , ⁇ 3 ⁇ n are expression base coefficients.
  • the expression F can be expressed as:
  • B id is a facial identity base group, such as high nose bridge, big eyes, oval face, etc.; ⁇ id is an identity coefficient; A exp is an expression base group; ⁇ exp is the expression base coefficient.
  • B id , ⁇ id , A exp , and ⁇ exp may be in the form of a matrix or a multidimensional vector.
  • a user-specific expression base can be obtained, which can also be called a personalized expression base in the embodiment of the present application.
  • the expression F can also be expressed as:
  • a exp is a set of standard expression bases, that is, the standard expression base set; ⁇ A exp is the component of the fine-tuned expression base; A exp + ⁇ A exp is a set of personalized expression bases, that is, the personalized expression base set.
  • the rest of the parameters in formula (3) have the same meanings as the corresponding parameters in formula (2), and will not be repeated here.
  • the "expression base” mentioned in the following embodiments of the present application may refer to a standard expression base or a personalized expression base, which is not limited in this embodiment of the present application.
  • Different roles (such as human and giraffe, male anchor and female anchor) can correspond to a set of standard expression bases respectively. It can be understood that whether it is a human expression or an expression of a digital image, it can be expressed by a corresponding standard expression base set, or can be expressed by a corresponding personalized expression base set, which is not limited in this embodiment of the present application.
  • the facial expression capture system can simulate the facial expression in real time by capturing the movement of the key points of the user's face, and can endow the facial expression on the digital image.
  • the facial expression transfer scene shown in FIG. The image 502 follows the human face in real time to make vivid expressions.
  • One expression transfer method is as follows: first obtain the expression basis coefficients of human facial expressions by reconstructing the face model, and then combine the expression basis coefficients with the expression basis sets of the digital image, so as to drive the digital image to make the same expression as the human face expression .
  • first obtain the expression basis coefficients of human facial expressions by reconstructing the face model and then combine the expression basis coefficients with the expression basis sets of the digital image, so as to drive the digital image to make the same expression as the human face expression .
  • directly applying the expression base coefficient transfer of the human face expression to the digital image cannot ensure the accurate transmission of user expression information. Inaccurate transmission of sad expressions, etc., resulting in low efficiency of expression transfer, affecting user experience when using applications for expression transfer.
  • the embodiment of the present application provides a model training method and an expression transfer method, which can improve the accuracy of expression transfer and improve the efficiency of expression transfer, thereby improving user experience in expression transfer.
  • the neural network model trained by the model training method can be used to process the expression basis coefficients of human facial expressions to obtain the adjusted expression basis coefficients, and use the adjusted expression basis coefficients to drive the digital image to realize the expression Migrate or pass on.
  • the neural network model establishes the expression constraint and association between the user and the digital image during the training process, so that when the expression base coefficient obtained by the neural network model is used to drive the digital image, the digital image can make facial expressions Consistent expressions for accurate delivery of facial expressions.
  • the neural network model can be further trained by using the expression data to be processed, so as to continuously update the model parameters of the neural network model and improve the accuracy of the neural network model.
  • Fig. 6 shows a schematic flowchart of an expression transfer method provided by an embodiment of the present application.
  • the emoticon method 600 shown in FIG. 6 may be executed by the execution device 110 in FIG. 1 , or jointly executed by the execution device 110 and the training device 120 .
  • the method 600 may include steps S610-S650.
  • the expression data to be processed may be video data (ie, video stream).
  • the plurality of first video frames and the plurality of second video frames may be extracted frame by frame from the video data.
  • the multiple first video frames may belong to the same video stream, or may belong to different video streams.
  • the multiple second video frames may belong to the same video stream, or may belong to different video streams.
  • the first video frame and the second video frame may belong to the same video stream, or may belong to different video streams.
  • the embodiments of the present application are not limited to this.
  • the expression data to be processed may be a collection of pictures.
  • the plurality of first video frames and the plurality of second video frames are pictures in the picture set.
  • the facial expression data to be processed may be obtained by using a single camera, or may be obtained by using multiple cameras.
  • the camera may be an RGB camera, a depth camera, an infrared camera, and the like.
  • the camera can be a mobile phone camera, a camera lens, a computer camera, a surveillance camera, and the like.
  • each first video frame of the plurality of first video frames and each second video frame of the plurality of second video frames includes a facial image of the source object. That is to say, the expression data to be processed corresponds to the same object.
  • the expression data to be processed may be acquired at one time, or acquired in multiple batches.
  • the embodiment of the present application does not limit the acquisition order and acquisition time of multiple first video frames and multiple second video frames, as long as the first video frame is acquired before processing the first video frame 1. It only needs to acquire the second video frame before processing the second video frame. Therefore, the step numbers involved in the embodiments of the present application are only used to distinguish different steps, and do not limit the execution sequence of the steps.
  • the input and output of the first neural network model are expression base coefficients, and its input is the expression base coefficient corresponding to the source object, that is, the first expression base coefficient; its output is the expression base coefficient corresponding to the target object. Coefficient, that is, the second expression base coefficient.
  • the first neural network model is used to process the basic expression coefficients corresponding to the source object to obtain the basic expression coefficients corresponding to the target object.
  • the first video frame can be input into a pre-trained neural network model, and the neural network model can directly output the first expression base coefficient.
  • a series of processing operations can be performed on the first video frame, such as face positioning, key point detection, three-dimensional face reconstruction, pose estimation, expression recognition, etc., to obtain the first expression base coefficient, wherein each processing operation is also There are various implementations possible.
  • a standard expression basis set or a personalized expression basis set of the source object may be used to obtain the first expression basis coefficient.
  • the expression represented by the combination of the first expression base coefficient and the standard expression base set or the personalized expression base set of the source object is consistent with the expression presented by the source object in the first video frame.
  • the first neural network model is pre-trained, and its training process can adopt the model training method described in the following embodiments, which will be described below in conjunction with the accompanying drawings, and will not be described in detail here.
  • S630 Drive the target object according to the second expression base coefficient to transfer the expression of the source object in the first video frame to the target object, wherein the second expression base coefficient corresponds to the target object.
  • the target object is driven according to the second expression basis coefficient, that is, the combination of the second expression basis coefficient and the expression basis set of the target object is used to make the target object show the same expression as the source object in the first video frame.
  • the first video frame involved in steps S620 and S630 is any video frame among the plurality of acquired first video frames. Therefore, for each of the plurality of first video frames, steps S620 and S630 are performed. Correspondingly, according to each first video frame, the corresponding first expression base coefficient can be obtained.
  • the input and output of the second neural network model are expression base coefficients, and its input is the expression base coefficient corresponding to the source object, that is, the third expression base coefficient; its output is the expression base coefficient corresponding to the target object.
  • Coefficient that is, the fourth expression base coefficient.
  • the second neural network model is used to process the basic expression coefficients corresponding to the source object to obtain the basic expression coefficients corresponding to the target object.
  • the second neural network model has the same function as the first neural network model, the difference is that the first neural network model is used to process the expression base coefficient corresponding to the source object in the first video frame, and the second neural network model The model is used to process expression base coefficients corresponding to the source object in the second video frame.
  • the manner of obtaining the third basic expression coefficient according to the second video frame may be the same as the manner of obtaining the first basic expression coefficient according to the first video frame.
  • a standard expression basis set or a personalized expression basis set of the source object may be used to obtain the third expression basis coefficient.
  • the expression represented by the combination of the third expression base coefficient and the standard expression base set or the personalized expression base set of the source object is consistent with the expression presented by the source object in the second video frame.
  • the initial model parameters of the second neural network model are the same as the initial model parameters of the first neural network model. It can be considered that before method 600 is performed, the first neural network model is the same as the second neural network model.
  • the target object is driven according to the fourth expression basis coefficient, that is, the combination of the fourth expression basis coefficient and the expression basis set of the target object is used to make the target object show the same expression as the source object in the second video frame.
  • the second video frame involved in steps S640 and S650 is any video frame in the plurality of acquired second video frames. Therefore, for each of the plurality of second video frames, steps S640 and S650 are performed. Correspondingly, according to each second video frame, the corresponding third expression base coefficient can be obtained.
  • the first neural network model is associated with the source object and the target object.
  • a second neural network model is associated with the source object and the target object. For different combinations of source objects and target objects, the first neural network model is different and the second neural network model is different.
  • the model parameters of the second neural network model are adjusted during the training process based on the first training data, which is based on a plurality of first video frames obtained from at least a portion of the first video frame.
  • the first training data which is based on a plurality of first video frames obtained from at least a portion of the first video frame.
  • at least part of the first video frames in the first video frame can also be used to process the second neural network model train.
  • the first neural network model can be replaced with the second neural network model, and then the second neural network model can be used to continue processing.
  • the related video frame that the second neural network model continues to process is referred to as the second video frame.
  • the at least part of the first video frame is obtained by sampling from a plurality of first video frames.
  • the sampling involved here may be random sampling, or sampling according to a certain sampling frequency, such as sampling every five video frames, or continuous sampling, which is not limited in this embodiment of the present application.
  • the number of first video frames used for training the second neural network model is less than the total number of first video frames, the amount of calculation in the training process can be reduced, thereby reducing the resources required for calculation.
  • the data types included in the first training data are different, and the processes of obtaining the first training data based on at least some of the first video frames in the plurality of first video frames are different.
  • the first training data may include the at least part of the first video frame, may include the expression base coefficients of the source object obtained according to the first video frame, and may include the output of the source object expression base coefficients after being processed by the second neural network model
  • the expression base coefficients may include the expression base coefficients output by the source object's expression base coefficients after being processed by the first neural network model, and may also include combinations of the aforementioned several data types, and so on.
  • how to obtain the first training data may be determined according to the constructed training optimization function.
  • the preset condition may include: the expression loss generated when the expression base coefficient output by the second neural network model is used to drive the target object is smaller than the expression generated when the expression base coefficient output by the first neural network model is used to drive the target object loss.
  • the input of the first neural network model and the second neural network model should be the same, and the expressions to be transferred involved
  • the video frames are the same. That is to say, for a certain first video frame, the first expression base coefficients corresponding to the first video frame are respectively input into the first neural network model and the second neural network model, and the first neural network can be respectively obtained The expression base coefficient output by the model and the expression base coefficient output by the second neural network model.
  • the first facial image of the target object can be obtained; utilize the expression base coefficient output by the second neural network model to drive the target object, and the second facial image of the target object can be obtained .
  • the expression base coefficient output by the first neural network model is used for the expression loss generated when driving the target object, and can be determined according to the expression difference between the first facial image and the first video frame.
  • the expression base coefficient output by the second neural network model is used for the expression loss generated when driving the target object, and can be determined according to the expression difference between the second facial image and the first video frame.
  • the expression loss corresponding to the second neural network model is smaller than the expression loss corresponding to the first neural network model, it can be considered that the accuracy of the second neural network model is higher.
  • the expression base coefficient output by the second neural network model matches the target object more, and when it is used to drive the target object, the target object makes The expressions are more consistent with the expressions made by the source object in the video frame.
  • the first neural network model or the second neural network model may be used for processing to obtain the expression coefficients corresponding to the target object.
  • the expression base coefficient corresponding to the target object to drive the target object can make the target object make the same expression as the source object, and realize the accurate transmission of facial expression.
  • the second neural network model can be further trained by using the expression data to be processed, so as to continuously update the model parameters of the second neural network model and improve the performance of the second neural network model. precision.
  • the training samples of the second neural network model can be enriched, the applicable scenarios of the second neural network model can be expanded, and the problem of inaccurate transmission of extreme expressions can be avoided.
  • the expression loss corresponding to the second neural network model is smaller than the expression loss corresponding to the first neural network model
  • the first neural network model can be replaced by the second neural network model, and then the second neural network model is used to continue processing, thereby obtaining a more
  • the matching expression base coefficient makes the expression displayed by the target object consistent with the expression of the source object, and realizes the accurate transmission and expression of the expression information of the source object.
  • the correlation between the expression base coefficient and the identity coefficient can be reduced, or the expression base coefficient and the source object's
  • the decoupling of the identity coefficient reduces the interference of the identity coefficient of the source object in the process of driving the target object by the expression base coefficient. For example, if the source object in the video frame has bulging cheeks, when performing face reconstruction, a thin model with bulging cheeks may be created, or a fat model without bulging cheeks may be created, depending on different The expression basis coefficients obtained by the face reconstruction model have different results.
  • the first neural network model or the second neural network model may restrict the expression base coefficients, so that the correlation between the expression base coefficients and the identity coefficients is reduced.
  • the method 600 further includes: training the second neural network model based on the first training data and the first loss function, wherein the first loss function is used for gradient backpropagation to adjust the second neural network model Model parameters for the network model.
  • the “gradient” involved here refers to the gradient of the model parameter size in the second neural network model.
  • the gradient of the parameter size of the model is obtained by calculating the partial derivative of the first loss function with respect to the corresponding parameter.
  • the first training data includes data used to be substituted into the first loss function to calculate the loss value.
  • the first loss function can be an L1 norm loss function (also called an absolute value loss function), or an L2 norm loss function, or an L1 norm loss function with an additional regular constraint term or an additional regular constraint The L2-norm loss function for the term.
  • L1 norm loss function also called an absolute value loss function
  • L2 norm loss function or an L1 norm loss function with an additional regular constraint term or an additional regular constraint The L2-norm loss function for the term.
  • the first training data includes a first output result set, a second output result set, a first expression recognition result set, and a second expression recognition result set.
  • the first output result set includes: the adjusted basic expression coefficients obtained after being processed by the first neural network model based on the basic expression coefficients of each first video frame in at least part of the first video frames.
  • each of the first video frames in at least some of the first video frames can obtain the corresponding expression base coefficients, and after inputting these corresponding expression base coefficients into the first neural network model, the corresponding adjusted expression base coefficients can be obtained. coefficient.
  • the first output result set includes these adjusted expression base coefficients output by the first neural network model.
  • the second output result set includes: the adjusted basic expression coefficients obtained based on at least part of the basic expression coefficients obtained in each of the first video frames after being processed by the second neural network model.
  • each first video frame in at least some of the first video frames can obtain the corresponding expression base coefficients, and after inputting these corresponding expression base coefficients into the second neural network model, the corresponding adjusted expression base coefficients can be obtained. coefficient.
  • the second output result set includes these adjusted expression base coefficients output by the second neural network model.
  • the first set of expression recognition results includes: results of expression recognition performed on each first video frame in at least part of the first video frames.
  • the second expression recognition result set includes: the result of performing expression recognition on the digital image frame obtained when each output result in the second output result set drives the target object. That is to say, the target object is driven by the adjusted expression base coefficient output by the second neural network model, and a digital image frame containing the expression of the target object can be obtained.
  • the second facial expression recognition result set includes the results of recognizing the facial expressions of the target object.
  • the first training data includes not only the expression data related to the source object, but also the expression data of the target object.
  • the expression between the source object and the target object can be established. constraints and associations.
  • the trained second neural network model adjusts the expression base coefficients, the expression displayed by the target object and the source object can be consistent, and the accurate transmission and expression of user expression information can be realized.
  • the user can also be provided with the function of zooming in on the expression.
  • the first loss function is used to characterize the difference between the adjusted expression base coefficients corresponding to the same video frame in the first output result set and the second output result set, and the first expression recognition The difference between the expression recognition result corresponding to the same video frame in the result set and the second expression recognition result set.
  • the difference between the output results of the first neural network model and the second neural network model for the same video frame, and the output of the second neural network model can be used to drive the target
  • the difference between the expression made by the subject and that of the source object is determined.
  • the model parameters of the second neural network model can be optimized, and the deviation between the first neural network model and the second neural network model will not be large.
  • the method 600 further includes: updating the model parameters of the first neural network model to be consistent with the model parameters of the second neural network model; during the application process of the second neural network model, based on During the training process of the second training data, the model parameters of the first neural network model are adjusted, wherein the second training data is obtained according to at least part of the second video frames in the plurality of second video frames.
  • At least part of the second video frame in the second video frame can also be used to process the first neural network
  • the model is trained. This can enrich the training samples and improve the training accuracy of the neural network model.
  • the process of training the first neural network model is similar to the aforementioned process of training the second neural network model, except that the training data is different.
  • the training data is different.
  • Fig. 7 shows a schematic flow chart of another expression transfer method provided by the embodiment of the present application.
  • the method 700 may include steps S710-S760.
  • Acquire expression data to be processed where the expression data to be processed includes a plurality of first video frames and a plurality of second video frames, wherein the first video frames and the second video frames include facial images of a source object.
  • Step S710 is the same as step S610 in method 600, and for details, reference may be made to related descriptions about S610.
  • Step S720 is the same as step S620 in method 600, and for details, reference may be made to related descriptions about S620.
  • Step S730 is the same as step S630 in method 600, and for details, reference may be made to related descriptions about S630.
  • the preset condition includes: the expression loss generated when the expression base coefficient output by the second neural network model is used to drive the target object is smaller than the expression loss generated when the expression base coefficient output by the first neural network model is used to drive the target object.
  • Step S750 is similar to step S640 in method 600, except that: in S640, the second neural network model is used instead, and in S750, the adjusted first neural network model is continued. But actually, the adjusted first neural network model is the same as the second neural network model. For specific content, refer to related descriptions about S640.
  • Step S760 is the same as step S650 in method 600, and for details, reference may be made to related descriptions about S650.
  • the method 700 further includes: during the application process of the adjusted first neural network model, adjusting the model parameters of the second neural network model during the training process based on the second training data, wherein The second training data is obtained according to at least part of the second video frames in the plurality of second video frames.
  • At least part of the second video frames in the second video frame may be used to train the second neural network model. This can enrich the training samples and improve the training accuracy of the neural network model.
  • method 600 the first neural network model and the second neural network model are alternately used to process the expression base coefficients of the expression data to be processed, and in method 600, the first neural network is always used
  • the model processes the expression base coefficients of the expression data to be processed, but its model parameters can be updated according to the training results of the second neural network model.
  • FIG. 8 shows a schematic flowchart of a model training method provided by an embodiment of the present application.
  • the method 800 shown in FIG. 8 can be executed by the training device 120 in FIG. 1 .
  • the method 800 may include steps S810-S870.
  • the first training frame is a video frame or a picture.
  • the first training frame is input into the neural network model, and the original expression base coefficients and head pose parameters are directly output by the neural network model.
  • a series of processing operations can be performed on the first training frame, such as face positioning, key point detection, 3D face reconstruction, pose estimation, expression recognition, etc., to obtain the original expression base coefficients and head pose parameters, where each There are also multiple ways to implement this processing operation.
  • the original expression basis coefficients can be combined with the expression basis set of the source object to obtain the frontal image of the source object.
  • the frontal images involved here can be considered as images without head pose parameters.
  • S840 Process the original expression base coefficients according to the original neural network model to obtain adjusted expression base coefficients.
  • the adjusted expression basis coefficients can be combined with the expression basis set of the target object, and brought into the head pose parameters to obtain the face image of the target object under the head pose parameters.
  • the degree of head rotation and translation of the target object is consistent with the degree of head rotation and translation of the source object.
  • the adjusted expression basis coefficients may be combined with the expression basis set of the target object to obtain a frontal image of the target object.
  • expression constraints between the source object and the target object are established, so that the association between the source object and the target object can be established.
  • the expression constraint between the source object and the target object can make the expression of the source object be accurately transferred to the target object, for example, the expression of a human face is transferred to a digital image.
  • the expression constraints between different digital images can establish the association between digital images and realize the indirect transmission of human facial expressions.
  • the difference between the expression recognition result corresponding to the first training frame and the expression recognition result corresponding to the third training frame includes: the expression recognition result corresponding to the first training frame corresponds to the expression recognition result corresponding to the third training frame The Manhattan distance or Euclidean distance between facial expression recognition results.
  • the difference between the expression recognition result corresponding to the second training frame and the expression recognition result corresponding to the fourth training frame includes: the expression recognition result corresponding to the second training frame corresponds to the expression recognition result corresponding to the fourth training frame The Manhattan distance or Euclidean distance between the facial expression recognition results.
  • model training method and expression transfer method provided in the embodiment of the present application will be introduced below with reference to more specific examples. It can be understood that, the following embodiments are described by taking the scene of transferring human facial expression to a digital image as an example, that is, the source object is a person, and the target object is a digital image. However, the embodiments of the present application are also applicable to other combinations of the source object and the target object.
  • FIG. 9 shows a schematic flowchart of a model training method provided by an embodiment of the present application.
  • the model training method 900 shown in FIG. 9 may be a specific example of the method 800 .
  • the method 900 may specifically include the following steps.
  • a human face video frame refers to a video frame in video data with a human face image, and may also be referred to as a video image.
  • the face images in the video frames belong to the same user.
  • each video frame in the video data can be regarded as a picture, so optionally, multiple pictures with human face images can also be obtained in this step. That is to say, in this step S901, it only needs to acquire a two-dimensional image (that is, a facial image sample) containing facial expressions.
  • the video data with a face image may be single-view video data acquired by a single camera, or multi-view video data obtained by shooting at multiple angles using multiple cameras, which is not limited in this embodiment of the present application .
  • the video data with human face images richer facial expressions should be included as much as possible.
  • any frame in the face video frame ⁇ f 0 , f 1 , f 2 ,..., f n ⁇ can be considered as a specific example of the first training frame in the aforementioned method 800, where n is a positive integer.
  • the original expression base coefficient W is a multidimensional coefficient vector matching the number of expression bases in the first expression base group, for example, a k-dimensional coefficient vector, where k is a positive integer.
  • the first expression basis set refers to the expression basis set combined with the original expression basis coefficient W to express human facial expressions.
  • the first expression base group may be a preset sample expression base set, or may be a personalized expression base set including facial identity features.
  • the original expression base coefficient W can be obtained based on deep learning.
  • the first training frame fi can be input into a pre-trained neural network model, and the output of the neural network model is the original expression base coefficient W.
  • the neural network model involved here is a model that outputs expression base coefficients according to human face images.
  • the original expression base coefficient W may be obtained by using face positioning technology, face key point detection technology, 3D face reconstruction technology, face pose estimation technology, and face expression recognition technology.
  • face positioning technology face key point detection technology
  • 3D face reconstruction technology face pose estimation technology
  • face expression recognition technology face expression recognition technology
  • Face location technology refers to the determination of the position of a face in a picture through face detection technology.
  • Face key point detection technology refers to locating some points with special semantic information on the face image, such as eyes, nose tip, lips, etc.
  • 3D face reconstruction technology refers to the technology of generating a 3D model of a human face.
  • 3D reconstruction of a human face can be realized through human face video frames.
  • the reconstructed face model should have the same topological structure as the face in the video frame, that is, the key points of the reconstructed face model projected into the 2D image still correspond to the key points of the face in the video frame.
  • the accuracy of face model reconstruction affects the accuracy of subsequent calculation of expression basis coefficients.
  • the face pose estimation technology refers to the technology to realize the transformation estimation of the face in the world coordinate system, mainly to estimate the head pose parameters corresponding to the face in the current picture.
  • the projection process from a 3D model to a 2D image will undergo rotation and translation transformations.
  • the head pose parameters include the rotation angle of the head and the translation amount of the head.
  • the rotation angle of the head can be represented by the rotation matrix R
  • the translation amount of the head can be represented by the translation vector T.
  • the accuracy of transformation estimation determines the accuracy of the head pose of the driven digital avatar.
  • Facial expression recognition technology refers to classifying the facial expressions in the input face image and identifying which expression the facial expression belongs to.
  • the reconstructed face model can adopt a formula similar to the above formula (2), namely:
  • the F face is a reconstructed face model; is the average face model, generally preset, and can be considered as a neutral face without expression;
  • B id1 is the facial identity feature, generally the identity basis set of principal component analysis (PCA);
  • ⁇ id1 is the identity coefficient;
  • a exp1 is the expression basis group;
  • ⁇ exp1 is the expression basis coefficient.
  • B id1 , ⁇ id1 , A exp1 , and ⁇ exp1 may be in the form of a matrix or a multidimensional vector.
  • adding ⁇ id1 B id1 can restore the 3D face shape of the user without expression, and adding ⁇ exp1 A exp1 can restore the user with expression 3D face, so as to realize the reconstruction of the face.
  • B id1 , ⁇ id1 and ⁇ exp1 in formula (4) can be solved during or before the process of 3D face reconstruction.
  • the accuracy of face reconstruction depends on identity information and expression information, and the accuracy of identity information directly affects the accuracy of the facial expression base coefficient (ie, the original expression base coefficient W). Therefore, in some embodiments, the identity coefficients of the same user under different expressions can be constrained to improve the accuracy of identity information. Specifically, the identity coefficients of the same user under different expressions should be consistent, that is, they should be the same, so the exchange of identity coefficients of the same user in different video frames should not affect the generated face model.
  • a face model with an expression 1 can be reconstructed according to the first video frame, specifically, an expression base coefficient 1 and an identity coefficient 1 can be obtained.
  • a face model with an expression 2 can be reconstructed according to the second video frame, specifically, an expression base coefficient 2 and an identity coefficient 2 can be obtained.
  • the identity coefficient 1 and 2 are exchanged, the expression obtained by combining the expression base coefficient 1 and the identity coefficient 2 should be consistent with the expression 1, and the expression obtained by combining the expression base coefficient 2 and the identity coefficient 1 should be consistent with the expression 2. Therefore, the user's face can be reconstructed by combining different identity coefficients and the same expression base coefficient, which can improve the accuracy of face reconstruction.
  • step S902 as long as the original expression base coefficient W and the head posture parameter can be acquired according to the face image, the specific acquisition method is not limited in this embodiment of the application.
  • the neural network model MLP fine-tunes the original expression base W.
  • step S905 drive the digital avatar according to the adjusted expression base coefficient W' obtained in step S904 and the head posture parameters obtained in step S902, and obtain the digital avatar corresponding to the first training frame fi by performing differentiable rendering on the digital avatar Frame f i ′, the digital image frame f i ′ is a specific example of the aforementioned third training frame.
  • the process of driving the digital avatar to combine the adjusted expression base coefficient W' with the expression base of the digital avatar to make the digital avatar make a corresponding expression In order to distinguish it from the first expression base set, the expression base set corresponding to the digital image is referred to as the second expression base set in the embodiment of the present application.
  • step S905 by executing step S905 multiple times, the digital image frame corresponding to each video frame in the face video frame ⁇ f 0 , f 1 , f 2 ,..., f n ⁇ can be obtained, so All digital image frames corresponding to face video frames ⁇ f 0 ,f 1 ,f 2 ,...,f n ⁇ can be expressed as ⁇ f 0 ′,f 1 ′,f 2 ′,...,f n ' ⁇ .
  • the positive digital avatar is the frontal face of the digital avatar.
  • the adjusted expression base coefficient W' and the model of the digital avatar can be used to generate the front face of the digital avatar.
  • the frontal digital avatar can also be understood as a digital avatar frame without a head posture obtained by driving the digital avatar with the adjusted expression base coefficient W'.
  • the frontal digital image F i ′ can be understood as a specific example of the aforementioned fourth training frame.
  • step S906 by executing step S906 multiple times, the frontal digital image corresponding to each video frame in the face video frame ⁇ f 0 , f 1 , f 2 ,...,f n ⁇ can be obtained, so All frontal digital images corresponding to face video frames ⁇ f 0 ,f 1 ,f 2 ,...,f n ⁇ can be expressed as ⁇ F 0 ′,F 1 ′,F 2 ′,...,F n ' ⁇ .
  • the frontal face can be understood as the recovered face without head pose parameters.
  • the original expression base coefficient W and the recovered face model can be used to generate a frontal face.
  • the front face F i can be understood as a specific example of the aforementioned second training frame.
  • step S907 by executing step S907 multiple times, the front face corresponding to each video frame in the face video frame ⁇ f 0 , f 1 , f 2 ,..., f n ⁇ can be obtained, so All frontal faces corresponding to face video frames ⁇ f 0 , f 1 , f 2 ,...,f n ⁇ can be expressed as ⁇ F 0 , F 1 , F 2 ,...,F n ⁇ .
  • the loss function L1 in the embodiment of the present application is used to characterize the expression difference between the human face video frame and the digital avatar image, and its optimization purpose is to make the facial expression of the input video frame consistent with the facial expression of the driven digital avatar.
  • the “gradient” involved here refers to the gradient of the model parameter size in the neural network model MLP.
  • the gradient of the parameter size of the model can be obtained by calculating the partial derivative of the loss function L 1 with respect to the corresponding parameters.
  • the loss function L1 can be:
  • exp(f i ) is the expression recognized according to the first training frame f i in the face video frame; exp(f' i ) is based on the corresponding frame f' i (i.e. the third training frame) in the digital image frame ) the expression recognized; exp(F i ) is the expression recognized according to the front face F i (ie the second training frame); exp(F′ i ) is the expression according to the front digital image F′ i (ie the fourth training frame ) recognized expressions.
  • exp(f i ), exp(f′ i ), exp(F i ), exp(F′ i ) can be represented by multidimensional vectors or matrices.
  • the output value of optimization formula (5) is aimed at making the expressions of the face and the digital image of the face video frame consistent under the frontal face and the actual posture (such as a profile face). Therefore, the smaller the output value of the loss function L1 , the better.
  • facial expressions can be recognized through a neural network model, for example, the first training frame f i in a human face video frame is input into a pre-trained neural network model, and the neural network model is used for the first training frame f i Process, recognize facial expressions, and output corresponding vector representations.
  • the expression loss constraint between the human face video frame and the digital image frame is established by using the L2 norm.
  • the loss function L 1 can also be:
  • the loss function L1 can also be:
  • Formula (6) includes the first half of formula (5), which is used to characterize the difference in expression between the human face and the digital image in the human face video frame in the actual posture presented.
  • the formula (7) includes the second half of the formula (5), which is used to characterize the face expression difference between the human face and the digital image of the human face video frame.
  • step S906 and step S907 can be omitted in method 900 . If the loss function L 1 shown in formula (7) is adopted, the method 900 can omit step S905.
  • the loss function L 1 may also be constructed in other ways, for example, by using the L1 norm, which is not limited in this embodiment of the present application.
  • the L1 norm is the 1 norm of the vector, which represents the sum of the absolute values of the non-zero elements in the vector, such as
  • 1 ⁇ i
  • the L2 norm is the 2 norm of the vector, which represents the square sum of the elements of the vector, such as
  • step S908 the model parameters of the neural network model MLP can be adjusted according to the result of the loss function L1 . So far, a training process of the neural network model MLP is completed. For the next frame after the first training frame fi , steps S901-S908 are also performed, except that the model parameters of the neural network model used in the latter video frame have been compared with the model parameters of the neural network model used in the previous frame. Adjustment.
  • the loss function L1 is continuously converged.
  • the result of the loss function L1 is less than the preset threshold or no longer decreases, it can be considered that the training of the neural network model MLP is completed.
  • the method shown in FIG. 9 is introduced using the neural network model MLP as an example.
  • the neural network model MLP in FIG. 9 can be replaced by any one of the aforementioned Neural networks, such as convolutional neural networks, recurrent neural networks, deep neural networks, and more.
  • the method shown in FIG. 9 is described by taking the transfer of facial expressions to digital images as an example, and the training data used include human face video frames and digital image frames.
  • the human face video frame can also be replaced with images of other digital avatars, and the neural network model trained in this way is used to transfer expressions between the digital avatars.
  • the neural network model is trained by constructing the expression loss function L1 of the face video frame and the digital avatar, and the trained neural network model is associated with the user in the face video frame and the driven digital avatar.
  • FIG. 10 shows a schematic flowchart of an expression transfer method provided by an embodiment of the present application.
  • the method 1000 shown in FIG. 10 may be a specific example of the method 600 or the method 700 .
  • the method 1000 shown in FIG. 10 may include the following steps.
  • the face video frames acquired in this step are video frames to be processed, including expression data to be processed, where t is a positive integer.
  • the human face in the human face video frame in step S1001 and the human face in the human face video frame in step S901 should belong to the same user.
  • the first expression base coefficient M is a k-dimensional coefficient vector, and k is a positive integer.
  • the method of obtaining the first basic expression coefficient M may be the same as the method of obtaining the original basic expression coefficient W in the method 900. For details, please refer to the relevant description above, which will not be repeated here.
  • the first MLP is the neural network model MLP trained by the method 900 .
  • the digital image frame g′ i corresponding to the face video frame g i can be obtained by performing differentiable rendering on the digital image.
  • step S1005 the same digital avatar involved in the method 900 .
  • step S1005 by executing step S1005 multiple times, the digital image frame corresponding to each video frame in the face video frame ⁇ g 0 , g 1 , g 2 ,..., g t ⁇ can be obtained, so All digital image frames corresponding to face video frames ⁇ g 0 , g 1 , g 2 ,..., g t ⁇ can be expressed as ⁇ g′ 0 , g′ 1 , g′ 2 ,...,g′ t ⁇ .
  • step S1006 during the process of executing step S1002 for each frame in the face video frame ⁇ g 0 , g 1 , g 2 ,..., g t ⁇ , perform data processing on the first expression base coefficient obtained in step S1002 Sampling, and input the first expression base coefficient obtained by sampling into the second MLP.
  • the first expression base coefficient M corresponding to any frame g i is input into the second MLP.
  • the data sampling frequency may be every P frames, where P may be 1, 2, 4, 7 or other positive integers greater than 1.
  • P may be 1, 2, 4, 7 or other positive integers greater than 1.
  • random sampling or continuous sampling may also be used, which is not limited in this embodiment of the present application.
  • the second MLP is also a neural network model MLP trained by the method 900, wherein the initial model parameters of the second MLP are the same as those of the first MLP.
  • step S1008 according to the fifth expression base coefficient M" obtained in step S1007, the digital image frame g" i corresponding to the human face video frame can be obtained.
  • step S1008 by executing step S1008 multiple times, the digital image frame corresponding to each sampled video frame can be obtained, so all digital image frames corresponding to the sampled video frame can be expressed as ⁇ g′′ 0 , g′′ 1 , g′′ 2 ,...,g′′ t ⁇ .
  • the model parameters of the first MLP and the second MLP are the same, after inputting the first expression base coefficient M into the first MLP and the second MLP respectively, the obtained second expression base coefficient M′ and the second MLP Five expression base coefficients M " are identical.
  • the expression presented by the expression base combination of the second expression base coefficient M ' and the digital image should be the same as the expression presented by the expression base combination of the fifth expression base coefficient M " and the digital image .
  • the first loss function L2 is used to characterize the expression difference between the sampled video frame and the digital avatar image, and its optimization purpose is to make the facial expression of the sampled video frame consistent with the facial expression of the driven digital avatar.
  • the first loss function L2 may be:
  • exp(g i ) is the expression recognized according to the sampled video frame g i ; exp(g′′ i ) is the expression recognized according to the corresponding digital image frame g′′ i ; ⁇ is a hyperparameter, that is, a parameter set artificially , rather than the parameters obtained through training.
  • exp(g i ), exp(g′′ i ) can be represented by a multidimensional vector or matrix.
  • 2 is the distance between the fifth expression base coefficient and M” and the second expression base coefficient M’, the smaller the distance, the fifth expression base coefficient and M” and the second expression base coefficient The smaller the difference in the coefficient M′.
  • the face of the sampled video frame can be consistent with the expression of the digital image, and the gap between the first MLP and the second MLP should not be too large. Therefore, the smaller the first loss function L2 , the better.
  • the expression loss constraint between the sampled video frame and the digital image frame is established by using the L2 norm.
  • the first loss function L 2 may also be constructed in other ways, for example, by using the L1 norm, which is not limited in this embodiment of the present application.
  • step S1006 the model parameters of the second MLP can be adjusted according to the result of the first loss function L2 .
  • Steps S1006-S1009 can be regarded as the completion of a training process for the second MLP.
  • steps S1006-S1009 are also performed, except that the model parameters of the second MLP used by the latter sampled video frame are compared with the second MLP used by the previous sampled video frame The model parameters of have changed.
  • the model parameters of the first MLP may be replaced with the model parameters of the second MLP, that is, the model parameters of the second MLP are assigned to the first MLP.
  • the expression base coefficients can be output in real time on the trained first MLP, and at the same time, the data can be sampled in real time and the second MLP can be adjusted.
  • the second MLP is used to replace the first MLP.
  • the second MLP continues to output expression base coefficients in real time, and the first MLP continues to train in real time.
  • the training process of the second MLP is asynchronous with the training process of the first MLP, and the training samples are enriched by using the expression data to be processed, so as to avoid the situation that extreme expressions cannot be accurately transmitted.
  • the expression transfer method provided in the embodiment of the present application can be applied on the server side, and can also be applied on the user terminal.
  • the user terminal may be a smart phone, a tablet computer, a notebook computer, a desktop computer, and the like.
  • the server end may be implemented by a computer having a processing function, and the embodiment of the present application is not limited thereto.
  • FIG. 11 shows a schematic structural diagram of a device provided by an embodiment of the present application.
  • the apparatus 1100 shown in FIG. 11 includes an acquisition unit 1101 and a processing unit 1102 .
  • the apparatus 1100 may be located in the executing device 110 shown in FIG. 1 or other devices.
  • the apparatus 1100 can be used to implement the expression transfer method shown in FIG. 6 or FIG. 7 , and can also be used to implement the embodiment shown in FIG. 10 .
  • the acquiring unit 1101 may be configured to execute steps S610 , S620 , and S640 in method 600 , or execute steps S710 , S720 , and S750 in method 700 .
  • the processing unit 1102 may be configured to execute steps S630, S650 in the method 600, or execute steps S730, S740, S760 in the method 700.
  • the apparatus 1100 may be located in the training device 120 shown in FIG. 1 or other devices.
  • the apparatus 1100 can be used to execute the model training method shown in FIG. 8 , and can also be used to execute the embodiment shown in FIG. 9 .
  • the acquiring unit 1101 may be used to execute steps S810-S860 in the method 800.
  • the processing unit 1102 may be used to execute step S870 in the method 800 .
  • FIG. 12 shows a schematic diagram of a hardware structure of a device provided by an embodiment of the present application.
  • the device 1200 shown in FIG. 12 includes a memory 1201 , a processor 1202 , a communication interface 1203 and a bus 1204 .
  • the memory 1201 , the processor 1202 , and the communication interface 1203 are connected to each other through a bus 1204 .
  • the memory 1201 may be a read only memory (read only memory, ROM), a static storage device, a dynamic storage device or a random access memory (random access memory, RAM).
  • the memory 1201 may store programs, and when the programs stored in the memory 1201 are executed by the processor 1202, the processor 1202 and the communication interface 1203 are used to execute each step of the model training method of the embodiment of the present application.
  • the processor 1202 may be a general-purpose central processing unit (central processing unit, CPU), a microprocessor, an application specific integrated circuit (application specific integrated circuit, ASIC), a graphics processing unit (graphics processing unit, GPU) or one or more
  • the integrated circuit is used to execute related programs to realize the functions required by the units in the model training device of the embodiment of the present application, or to execute the model training method of the embodiment of the present application.
  • the processor 1202 may also be an integrated circuit chip with signal processing capabilities. In the implementation process, each step of the model training method of the present application may be completed by an integrated logic circuit of hardware in the processor 1202 or instructions in the form of software.
  • the above-mentioned processor 1202 can also be a general-purpose processor, a digital signal processor (digital signal processing, DSP), an application-specific integrated circuit, a ready-made programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, discrete gates Or transistor logic devices, discrete hardware components.
  • DSP digital signal processing
  • FPGA field programmable gate array
  • a general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.
  • the steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in a mature storage medium in the field such as random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, register.
  • the storage medium is located in the memory 1201, and the processor 1202 reads the information in the memory 1201, and combines its hardware to complete the functions required by the units included in the model training device of the embodiment of the present application, or execute the model training method of the embodiment of the present application .
  • the communication interface 1203 implements communication between the apparatus 1200 and other devices or communication networks by using a transceiver device such as but not limited to a transceiver.
  • the bus 1204 may include a pathway for transferring information between various components of the device 1200 (eg, memory 1201 , processor 1202 , communication interface 1203 ).
  • the embodiment of the present application also provides a schematic diagram of the hardware structure of an expression transfer device, the structure of which is the same as that of the device for model training in FIG. 12 , so the expression transfer device will be described with reference to FIG. 12 .
  • the device 1200 includes a memory 1201 , a processor 1202 , a communication interface 1203 and a bus 1204 .
  • the memory 1201 , the processor 1202 , and the communication interface 1203 are connected to each other through a bus 1204 .
  • the memory 1201 may be a read only memory (read only memory, ROM), a static storage device, a dynamic storage device or a random access memory (random access memory, RAM).
  • the memory 1201 may store a program. When the program stored in the memory 1201 is executed by the processor 1202, the processor 1202 and the communication interface 1203 are used to execute each step of the expression transfer method in the embodiment of the present application.
  • the processor 1202 may be a general-purpose central processing unit (central processing unit, CPU), a microprocessor, an application specific integrated circuit (application specific integrated circuit, ASIC), a graphics processing unit (graphics processing unit, GPU) or one or more
  • the integrated circuit is used to execute related programs to realize the functions required by the units in the expression transfer device of the embodiment of the present application, or to execute the expression transfer method of the embodiment of the present application.
  • the processor 1202 may also be an integrated circuit chip with signal processing capabilities. In the implementation process, each step of the expression transfer method of the present application may be completed by an integrated logic circuit of hardware in the processor 1202 or instructions in the form of software.
  • the above-mentioned processor 1202 can also be a general-purpose processor, a digital signal processor (digital signal processing, DSP), an application-specific integrated circuit, a ready-made programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, discrete gates Or transistor logic devices, discrete hardware components.
  • DSP digital signal processing
  • FPGA field programmable gate array
  • a general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.
  • the steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in a mature storage medium in the field such as random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, register.
  • the storage medium is located in the memory 1201, and the processor 1202 reads the information in the memory 1201, and combines its hardware to complete the functions required by the units included in the expression transfer device of the embodiment of the application, or execute the expression transfer method of the embodiment of the application .
  • the communication interface 1203 implements communication between the apparatus 1200 and other devices or communication networks by using a transceiver device such as but not limited to a transceiver.
  • the bus 1204 may include a pathway for transferring information between various components of the device 1200 (eg, memory 1201 , processor 1202 , communication interface 1203 ).
  • the embodiment of the present application also provides a computer program storage medium, the computer program storage medium has program instructions, and when the program instructions are executed directly or indirectly, the model training method or expression transfer method mentioned above can be realized.
  • the embodiment of the present application also provides a chip system, the chip system includes at least one processor, and when the program instructions are executed in the at least one processor, the model training method or expression transfer method mentioned above can be realized.
  • the disclosed systems, devices and methods may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components can be combined or May be integrated into another system, or some features may be ignored, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the functions described above are realized in the form of software function units and sold or used as independent products, they can be stored in a computer-readable storage medium.
  • the technical solution of the present application is essentially or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (read-only memory, ROM), random access memory (random access memory, RAM), magnetic disk or optical disc and other media that can store program codes. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Graphics (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Architecture (AREA)
  • Multimedia (AREA)
  • Computer Hardware Design (AREA)
  • Geometry (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Human Computer Interaction (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Image Analysis (AREA)

Abstract

本申请实施例提供了一种表情迁移方法、模型训练方法和装置。在表情迁移方法中,利用预先训练好的第一神经网络模型对源对象对应的表情基系数进行处理,得到与目标对象对应的表情基系数,并利用其驱动目标对象,以将源对象的表情迁移至目标对象。同时,基于待处理的表情数据对第二神经网络模型进行训练,以调整第二神经网络模型的模型参数。当第二神经网络模型输出的表情基系数用于驱动目标对象时产生的表情损失小于第一神经网络模型输出的表情基系数用于驱动目标对象时产生的表情损失时,继续利用第二神经网络模型对源对象对应的表情基系数进行处理。上述技术方案能够提高表情信息传递的准确性,从而提高表情迁移效率。

Description

表情迁移方法、模型训练方法和装置
本申请要求于2022年01月28日提交中国专利局、申请号为202210105861.0、申请名称为“表情迁移方法、模型训练方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及图像处理技术领域,并且更具体地,涉及一种表情迁移方法、模型训练方法和装置。
背景技术
面部表情捕捉系统的出现,实现了人类面部表情的捕捉与迁移。具体地,面部表情捕捉系统通过对用户面部关键点的运动进行捕获,可以实时模拟出人脸表情,并可以将人脸表情赋予到数字形象(即虚拟角色,如虚拟人物或拟人动物)上。
表情可以由多个局部动作表情叠加表示,每个局部动作表情为一个表情基,一组表情基的线性组合权重为表情基系数。在表情迁移过程中,可以先获得人脸表情的表情基系数,再将该表情基系数与数字形象的表情基组合,从而驱动数字形象做出与人脸表情对应的表情。
然而,由于人脸模型的表情基与数字形象的表情基之间通常会存在差异,直接将人脸表情的表情基系数迁移应用到数字形象中并不能确保用户表情信息的准确传递,导致表情迁移效率不高。
发明内容
本申请实施例提供一种表情迁移方法、模型训练方法和装置,能够提高用户表情信息传递的准确性,从而提高表情迁移的效率。
第一方面,提供了一种表情迁移方法,包括:获取待处理表情数据,所述待处理表情数据包括多个第一视频帧和多个第二视频帧,其中所述第一视频帧和所述第二视频帧包括源对象的面部图像;基于第一神经网络模型和第一表情基系数,获取第二表情基系数,其中所述第一表情基系数基于所述第一视频帧得到且与所述源对象对应;根据所述第二表情基系数驱动目标对象,以将所述第一视频帧中的源对象的表情迁移至所述目标对象,其中所述第二表情基系数与所述目标对象对应;
当满足预设条件时,基于第二神经网络模型和第三表情基系数,获取第四表情基系数,其中所述第三表情基系数基于所述第二视频帧得到且与所述源对象对应;根据所述第四表情基系数驱动所述目标对象,以将所述第二视频帧中的源对象的表情迁移至所述目标对象,其中所述第四表情基系数与所述目标对象对应;
其中,所述第二神经网络模型的初始模型参数与所述第一神经网络模型的初始模型参 数相同,且在所述第一神经网络模型的应用过程中,所述第二神经网络模型的模型参数在基于第一训练数据的训练过程中被调整,所述第一训练数据是根据所述多个第一视频帧中的至少部分第一视频帧获得的;
所述预设条件包括:所述第二神经网络模型输出的表情基系数用于驱动所述目标对象时产生的表情损失小于所述第一神经网络模型输出的表情基系数用于驱动所述目标对象时产生的表情损失。
本申请实施例中,在获取与源对象对应的表情基系数后,可以利用第一神经网络模型或第二神经网络模型进行处理,得到与目标对象对应的表情系数。利用与目标对象对应的表情基系数驱动目标对象,可以使目标对象做出与源对象一致的表情,实现面部表情的准确传递。另外,在第一神经网络模型应用于表情迁移过程的同时,可以利用待处理表情数据对第二神经网络模型进一步训练,以不断更新第二神经网络模型的模型参数,提高第二神经网络模型的精度。这样,可以丰富第二神经网络模型的训练样本,拓展第二神经网络模型的适用场景,避免出现极端表情传递不准确的问题。当第二神经网络模型对应的表情损失小于第一神经网络模型对应的表情损失时,可以将第一神经网络模型替换为第二神经网络模型,然后利用第二神经网络模型继续处理,从而得到更匹配的表情基系数使得目标对象所展现的表情与源对象的表情一致,实现源对象表情信息的准确传递与表达。
结合第一方面,在一种可能的实现方式中,所述第一神经网络模型与所述源对象和所述目标对象相关联;所述第二神经网络模型与所述源对象和所述目标对象相关联。
结合第一方面,在一种可能的实现方式中,在所述基于第二神经网络模型和第三表情基系数,获取第四表情基系数之前,所述方法还包括:基于所述第一训练数据和第一损失函数,对所述第二神经网络模型进行训练,其中所述第一损失函数用于梯度回传,以调整所述第二神经网络模型的模型参数。
可选地,第一损失函数可以为L1范数损失函数(也称绝对值损失函数),也可以为L2范数损失函数,还可以为附加正则约束项的L1范数损失函数或附加正则约束项的L2范数损失函数。
结合第一方面,在一种可能的实现方式中,所述第一训练数据包括第一输出结果集合、第二输出结果集合、第一表情识别结果集合和第二表情识别结果集合;其中,所述第一输出结果集合包括:基于所述至少部分第一视频帧中的每个第一视频帧得到的表情基系数经所述第一神经网络模型处理后输出的调整后表情基系数;所述第二输出结果集合包括:基于所述至少部分第一视频帧中的每个第一视频帧得到的表情基系数经所述第二神经网络模型处理后输出的调整后表情基系数;所述第一表情识别结果集合包括:对所述至少部分第一视频帧中的每个第一视频帧进行表情识别的结果;所述第二表情识别结果集合包括:对所述第二输出结果集合中的每个输出结果在驱动所述目标对象时所得到的数字形象帧进行表情识别的结果。
本申请实施例中,第一训练数据既包括源对象相关的表情数据,还包括目标对象的表情数据,在用于第二神经网络模型的训练时,可以建立源对象与目标对象之间的表情约束和关联。有训练好的第二神经网络模型调整表情基系数后,可以使目标对象与源对象所展现的表情一致,实现用户表情信息的准确传递与表达。
结合第一方面,在一种可能的实现方式中,所述第一损失函数用于表征所述第一输出 结果集合与所述第二输出结果集合中对应于同一视频帧的调整后表情基系数之间的差异,以及所述第一表情识别结果集合与所述第二表情识别结果集合中对应于所述同一视频帧的表情识别结果之间的差异。
这样,既可以优化第二神经网络模型的模型参数,也不会使第一神经网络模型与第二神经网络模型偏差较大。
结合第一方面,在一种可能的实现方式中,所述至少部分第一视频帧是从所述多个第一视频帧中采样得到。
这里所涉及的采样,可以是随机采样,也可以是按照一定采样频率采样,例如每隔5个视频帧采样一次等,还可以是连续采样,本申请实施例对此不作限定。
当用于训练第二神经网络模型的第一视频帧的数量小于第一视频帧的总数量时,可以减少训练过程中的计算量,从而减少计算所需资源。
结合第一方面,在一种可能的实现方式中,所述方法还包括:将所述第一神经网络模型的模型参数更新为与所述第二神经网络模型的模型参数一致;在所述第二神经网络模型的应用过程中,在基于第二训练数据的训练过程中调整所述第一神经网络模型的模型参数,其中所述第二训练数据是根据所述多个第二视频帧中的至少部分第二视频帧获得的。
在利用第二神经网络模型对基于第二视频帧得到的第三表情基系数进行处理的过程中,还可以利用第二视频帧中的至少部分第二视频帧对第一神经网络模型进行训练。这样可以丰富训练样本,提高神经网络模型的训练精度。
第二方面,提供了一种表情迁移方法,包括:获取待处理表情数据,所述待处理表情数据包括多个第一视频帧和多个第二视频帧,其中所述第一视频帧和所述第二视频帧包括源对象的面部图像;基于第一神经网络模型和第一表情基系数,获取第二表情基系数,其中所述第一表情基系数基于所述第一视频帧得到且与所述源对象对应;根据所述第二表情基系数驱动目标对象,以将所述第一视频帧中的源对象的表情迁移至所述目标对象,其中所述第二表情基系数与所述目标对象对应;
当满足预设条件时,将所述第一神经网络模型的模型参数更新为与第二神经网络模型的参数一致,得到调整后的第一神经网络模型;基于所述调整后的第一神经网络模型和第三表情基系数,获取第四表情基系数,其中所述第三表情基系数基于所述第二视频帧得到且与所述源对象对应;根据所述第四表情基系数驱动所述目标对象,以将所述第二视频帧中的源对象的表情迁移至所述目标对象,其中所述第四表情基系数与所述目标对象对应;
其中,所述第二神经网络模型的初始模型参数与所述第一神经网络模型的初始模型参数相同,且在所述第一神经网络模型的应用过程中,所述第二神经网络模型的模型参数在基于第一训练数据的训练过程中被调整,所述第一训练数据是根据所述多个第一视频帧中的至少部分第一视频帧获得的;
所述预设条件包括:所述第二神经网络模型输出的表情基系数用于驱动所述目标对象时产生的表情损失小于所述第一神经网络模型输出的表情基系数用于驱动所述目标对象时产生的表情损失。
本申请实施例中,在获取与源对象对应的表情基系数后,可以利用第一神经网络模型进行处理,得到与目标对象对应的表情系数。利用与目标对象对应的表情基系数驱动目标对象,可以使目标对象做出与源对象一致的表情,实现面部表情的准确传递。
另外,在第一神经网络模型应用于表情迁移过程的同时,可以利用待处理表情数据对第二神经网络模型进一步训练,以不断更新第二神经网络模型的模型参数。当第二神经网络模型对应的表情损失小于第一神经网络模型对应的表情损失时,将第一神经网络模型的模型参数更新为与第二神经网络模型一致。这样,可以丰富第一神经网络模型的训练样本,拓展第一神经网络模型的适用场景,避免出现极端表情传递不准确的问题,还能够根据更新后的第一神经网络模型得到更匹配的表情基系数使得目标对象所展现的表情与源对象的表情一致,实现源对象表情信息的准确传递与表达。
结合第二方面,在一种可能的实现方式中,所述第一神经网络模型与所述源对象和所述目标对象相关联;所述第二神经网络模型与所述源对象和所述目标对象相关联。
结合第二方面,在一种可能的实现方式中,在所述将所述第一神经网络模型的模型参数更新为与第二神经网络模型的参数一致之前,所述方法还包括:基于所述第一训练数据和第一损失函数,对所述第二神经网络模型进行训练,其中所述第一损失函数用于梯度回传,以调整所述第二神经网络模型的模型参数。
结合第二方面,在一种可能的实现方式中,所述第一训练数据包括第一输出结果集合、第二输出结果集合、第一表情识别结果集合和第二表情识别结果集合;其中,所述第一输出结果集合包括:基于所述至少部分第一视频帧中的每个第一视频帧得到的表情基系数经所述第一神经网络模型处理后输出的调整后表情基系数;所述第二输出结果集合包括:基于所述至少部分第一视频帧中的每个第一视频帧得到的表情基系数经所述第二神经网络模型处理后输出的调整后表情基系数;所述第一表情识别结果集合包括:对所述至少部分第一视频帧中的每个第一视频帧进行表情识别的结果;所述第二表情识别结果集合包括:对所述第二输出结果集合中的每个输出结果在驱动所述目标对象时所得到的数字形象帧进行表情识别的结果。
结合第二方面,在一种可能的实现方式中,所述第一损失函数用于表征所述第一输出结果集合与所述第二输出结果集合中对应于同一视频帧的调整后表情基系数之间的差异,以及所述第一表情识别结果集合与所述第二表情识别结果集合中对应于所述同一视频帧的表情识别结果之间的差异。
结合第二方面,在一种可能的实现方式中,所述至少部分第一视频帧是从所述多个第一视频帧中采样得到。
结合第二方面,在一种可能的实现方式中,所述方法还包括:在所述调整后的第一神经网络模型的应用过程中,在基于第二训练数据的训练过程中调整所述第二神经网络模型的模型参数,其中所述第二训练数据是根据所述多个第二视频帧中的至少部分第二视频帧获得的。
在利用第一神经网络模型对基于第二视频帧得到的第三表情基系数进行处理的过程中,还可以利用第二视频帧中的至少部分第二视频帧对第二神经网络模型进行训练。这样可以丰富训练样本,提高神经网络模型的训练精度。
第三方面,提供了一种模型训练方法,包括:
获取第一训练帧,所述第一训练帧包括源对象的面部图像;
基于所述第一训练帧,获取与所述源对象的面部图像对应的原始表情基系数和头部姿态参数;
根据所述原始表情基系数,获取第二训练帧,所述第二训练帧包括所述源对象位于正面的面部图像;
根据原始神经网络模型对所述原始表情基系数进行处理,得到调整后的表情基系数;
根据所述调整后的表情基系数和所述头部姿态参数驱动目标对象,获取第三训练帧,所述第三训练帧包括所述目标对象在所述头部姿态参数下的面部图像;
根据所述调整后的表情基系数驱动所述目标对象,获取第四训练帧,所述第四训练帧包括所述目标对象位于正面的面部图像;
根据所述第一训练帧对应的表情识别结果与所述第三训练帧对应的表情识别结果之间的差异和/或所述第二训练帧对应的表情识别结果与所述第四训练帧对应的表情识别结果之间的差异,调整所述原始神经网络模型的参数,以获取目标神经网络模型。
本申请实施例中,在训练神经网络模型的过程中,建立了源对象与目标对象之间的表情约束,从而可以建立源对象与目标对象之间的关联。源对象与目标对象之间的表情约束可以使源对象的表情准确的传递到目标对象上
结合第三方面,在一种可能的实现方式中,所述第一训练帧对应的表情识别结果与所述第三训练帧对应的表情识别结果之间的差异包括:所述第一训练帧对应的表情识别结果与所述第三训练帧对应的表情识别结果之间的曼哈顿距离或欧氏距离;和/或,所述第二训练帧对应的表情识别结果与所述第四训练帧对应的表情识别结果之间的差异包括:所述第二训练帧对应的表情识别结果与所述第四训练帧对应的表情识别结果之间曼哈顿距离或欧氏距离。
第四方面,提供了一种表情迁移装置,包括用于执行上述第一方面或第二方面中的表情迁移方法的各个单元/模块。
第五方面,提供了一种模型训练装置,包括用于执行上述第三方面中的模型训练方法的各个单元/模块。
第六方面,提供了一种表情迁移装置,该装置包括:存储器,用于存储程序;处理器,用于执行所述存储器存储的程序,当所述存储器存储的程序被执行时,所述处理器用于执行上述第一方面或第二方面中的表情迁移方法。
第七方面,提供了一种模型训练装置,该装置包括:存储器,用于存储程序;处理器,用于执行所述存储器存储的程序,当所述存储器存储的程序被执行时,所述处理器用于执行上述第三方面中的模型训练方法。
第八方面,提供了一种电子设备,该电子设备包括上述第四方面或第六方面的表情迁移装置。
在上述第八方面中,电子设备具体可以是移动终端(例如,智能手机),平板电脑,笔记本电脑,增强现实/虚拟现实设备以及车载终端设备等等。
第九方面,提供了一种计算机设备,该计算机设备包括上述第五方面或第七方面中的模型训练装置。
在上述第九方面中,该计算机设备具体可以是服务器或者云端设备等等。
第十方面,提供一种计算机可读存储介质,所述计算机可读介质存储用于设备执行的程序代码,所述程序代码被所述设备执行时,所述设备用于执行第一方面、第二方面或第三方面的方法。
第十一方面,提供一种包含指令的计算机程序产品,当该计算机程序产品在计算机上运行时,使得计算机执行上述第一方面、第二方面或第三方面中的方法。
第十二方面,提供一种芯片,所述芯片包括处理器与数据接口,所述处理器通过所述数据接口读取存储器上存储的指令,执行上述第一方面、第二方面或者第三方面中的方法。
可选地,作为一种实现方式,所述芯片还可以包括存储器,所述存储器中存储有指令,所述处理器用于执行所述存储器上存储的指令,当所述指令被执行时,所述处理器用于执行上述第一方面、第二方面或者第三方面中的方法。
上述芯片具体可以是现场可编程门阵列或者专用集成电路。
应理解,本申请中,第一方面的方法具体可以是指第一方面以及第一方面中各种实现方式中的任意一种实现方式中的方法,第二方面的方法具体可以是指第二方面以及第二方面中各种实现方式中的任意一种实现方式中的方法,第三方面的方法具体可以是指第三方面以及第三方面中各种实现方式中的任意一种实现方式中的方法。
附图说明
图1是本申请实施例提供的一种系统架构的结构示意图。
图2是本申请实施例提供的一种卷积神经网络模型的示意图。
图3是本申请实施例提供的一种芯片硬件结构示意图。
图4是本申请实施例提供的一种系统架构的示意图。
图5是本申请实施例适用的场景示意图。
图6是本申请实施例提供的一种表情迁移方法的示意性流程图。
图7是本申请实施例提供的另一种表情迁移方法的示意性流程图。
图8是本申请实施例提供的一种模型训练方法的示意性流程图。
图9是本申请实施例提供的一种模型训练方法的示意性流程图。
图10是本申请实施例提供的一种表情迁移方法的示意性流程图。
图11是本申请实施例提供的一种装置的示意性结构图。
图12是本申请实施例提供的一种装置的硬件结构示意图。
具体实施方式
下面将结合附图,对本申请实施例中的技术方案进行描述。
由于本申请实施例涉及大量神经网络模型的应用,为了便于理解,下面先对本申请实施例涉及的相关术语及神经网络模型等相关概念进行介绍。
(1)神经网络
神经网络可以是由神经单元组成的,神经单元可以是指以x s和截距b为输入的运算单元,该运算单元的输出可以为:
Figure PCTCN2022143944-appb-000001
其中,s=1、2、……n,n为大于1的自然数,W s为x s的权重,b为神经单元的偏置。f为神经单元的激活函数(activation functions),用于将非线性特性引入神经网络中,来将神经单元中的输入信号转换为输出信号。该激活函数的输出信号可以作为下一层卷积层的输入,激活函数可以是sigmoid函数。神经网络是将多个上述单一的神经单元联结在一 起形成的网络,即一个神经单元的输出可以是另一个神经单元的输入。每个神经单元的输入可以与前一层的局部接受域相连,来提取局部接受域的特征,局部接受域可以是由若干个神经单元组成的区域。
(2)深度神经网络
深度神经网络(deep neural network,DNN),也称多层神经网络或多层感知机(multi-layer perceptron,MLP),可以理解为具有多层隐含层的神经网络。按照不同层的位置对DNN进行划分,DNN内部的神经网络可以分为三类:输入层,隐含层,输出层。一般来说第一层是输入层,最后一层是输出层,中间的层数都是隐含层。层与层之间是全连接的,也就是说,第i层的任意一个神经元一定与第i+1层的任意一个神经元相连。
虽然DNN看起来很复杂,但是就每一层的工作来说,其实并不复杂,简单来说就是如下线性关系表达式:
Figure PCTCN2022143944-appb-000002
其中,
Figure PCTCN2022143944-appb-000003
是输入向量,
Figure PCTCN2022143944-appb-000004
是输出向量,
Figure PCTCN2022143944-appb-000005
是偏移向量,W是权重矩阵(也称系数),α()是激活函数。每一层仅仅是对输入向量
Figure PCTCN2022143944-appb-000006
经过如此简单的操作得到输出向量
Figure PCTCN2022143944-appb-000007
由于DNN层数多,系数W和偏移向量
Figure PCTCN2022143944-appb-000008
的数量也比较多。这些参数在DNN中的定义如下所述:以系数W为例,假设在一个三层的DNN中,第二层的第4个神经元到第三层的第2个神经元的线性系数定义为
Figure PCTCN2022143944-appb-000009
上标3代表系数W所在的层数,而下标对应的是输出的第三层索引2和输入的第二层索引4。
综上,第L-1层的第k个神经元到第L层的第j个神经元的系数定义为
Figure PCTCN2022143944-appb-000010
需要注意的是,输入层是没有W参数的。在深度神经网络中,更多的隐含层让网络更能够刻画现实世界中的复杂情形。理论上而言,参数越多的模型复杂度越高,“容量”也就越大,也就意味着它能完成更复杂的学习任务。训练深度神经网络的也就是学习权重矩阵的过程,其最终目的是得到训练好的深度神经网络的所有层的权重矩阵(由很多层的向量W形成的权重矩阵)。
(3)卷积神经网络
卷积神经网络(convolutional neuron network,CNN)是一种带有卷积结构的深度神经网络。卷积神经网络包含了一个由卷积层和子采样层构成的特征抽取器,该特征抽取器可以看作是滤波器。卷积层是指卷积神经网络中对输入信号进行卷积处理的神经元层。在卷积神经网络的卷积层中,一个神经元可以只与部分邻层神经元连接。一个卷积层中,通常包含若干个特征平面,每个特征平面可以由一些矩形排列的神经单元组成。同一特征平面的神经单元共享权重,这里共享的权重就是卷积核。共享权重可以理解为提取数据信息的方式与位置无关。卷积核可以以随机大小的矩阵的形式初始化,在卷积神经网络的训练过程中卷积核可以通过学习得到合理的权重。另外,共享权重带来的直接好处是减少卷积神经网络各层之间的连接,同时又降低了过拟合的风险。
(4)循环神经网络
循环神经网络(recurrent neural networks,RNN)主要用来处理序列数据。在传统的神经网络模型中,是从输入层到隐含层再到输出层,层与层之间是全连接的,而对于每一层层内之间的各个节点是无连接的。而在RNN中,网络会对前面的信息进行记忆并应用于当前输出的计算中,即隐含层本层之间的节点不再无连接而是有连接的,并且隐含层的输入不仅包括输入层的输出还包括上一时刻隐含层的输出。RNN旨在让机器像人一样拥有记忆的能力,因此,RNN的输出需要依赖当前的输入信息和历史的记忆信息。
(5)损失函数
在训练深度神经网络的过程中,因为希望深度神经网络的输出尽可能的接近真正想要预测的值,所以可以通过比较当前网络的预测值和真正想要的目标值,再根据两者之间的差异情况来更新每一层神经网络的权重向量(当然,在第一次更新之前通常会有初始化的过程,即为深度神经网络中的各层预先配置参数)。比如,如果网络的预测值高了,就调整权重向量让它预测低一些,不断地调整,直到深度神经网络能够预测出真正想要的目标值或与真正想要的目标值非常接近的值。因此,就需要预先定义“如何比较预测值和目标值之间的差异”,这便是损失函数(loss function)或目标函数(objective function),它们是用于衡量预测值和目标值的差异的重要方程。其中,以损失函数举例,损失函数的输出值(loss)越高表示差异越大,那么深度神经网络的训练就变成了尽可能缩小这个loss的过程。
(6)前向传播算法
前向传播算法,也可以称为正向传播算法,是由前向后进行计算的算法。利用前向传播算法,从输入层开始,一层层向后计算,直到运算到输出层,得到输出结果。前向传播算法通过一层层从前向后的运算,得到输出层结果。
(7)反向传播算法
神经网络可以采用误差反向传播(back propagation,BP)算法在训练过程中修正初始的神经网络模型中参数的大小,使得神经网络模型的重建误差损失越来越小。具体地,前向传递输入信号直至输出会产生误差损失,通过反向传播误差损失信息来更新初始的神经网络模型中参数,从而使误差损失收敛。反向传播算法是以误差损失为主导的反向传播运动,旨在得到最优的神经网络模型的参数,例如权重矩阵。
如图1所示,本申请实施例提供了一种系统架构100。在图1中,数据采集设备160用于采集训练数据。本申请实施例中,训练数据可以包括人脸视频帧、数字形象视频帧、表情基系数等。
在采集到训练数据之后,数据采集设备160将这些训练数据存入数据库130,训练设备120基于数据库130中维护的训练数据训练得到目标模型/规则101。以下实施例将结合附图更详细地描述目标模型/规则101的训练过程,在此暂不详述。
上述目标模型/规则101能够用于实现本申请实施例的表情迁移方法。本申请实施例中的目标模型/规则101具体可以为神经网络。在本申请实施例中,该目标模型/规则101是通过训练原始处理模型得到的。需要说明的是,在实际的应用中,数据库130中维护的训练数据不一定都来自于数据采集设备160的采集,也有可能是从其他设备接收得到的。另外需要说明的是,训练设备120也不一定完全基于数据库130维护的训练数据进行目标模型/规则101的训练,也有可能从云端或其他设备获取训练数据进行模型训练,上述描述不应该作为对本申请实施例的限定。
根据训练设备120训练得到的目标模型/规则101可以应用于不同的系统或设备中,如应用于图1所示的执行设备110,所述执行设备110可以是终端,如手机终端、平板电脑、笔记本电脑、增强现实(augmented reality,AR)终端、虚拟现实(virtual reality,VR)终端、车载终端等,还可以是服务器或者云端等。在图1中,执行设备110配置输入/输出(input/output,I/O)接口112,用于与外部设备进行数据交互,用户可以通过客户设备 140向I/O接口112输入数据,所述输入数据在本申请实施例中可以包括:客户设备输入的待处理表情数据,如人脸视频帧。
预处理模块113和预处理模块114用于根据I/O接口112接收到的输入数据(如待处理数据)进行预处理,在本申请实施例中,也可以没有预处理模块113和预处理模块114(也可以只有其中的一个预处理模块),而直接采用计算模块111对输入数据进行处理。
在执行设备110对输入数据进行预处理,或者在执行设备110的计算模块111执行计算等相关的处理过程中,执行设备110可以调用数据存储系统150中的数据、代码等以用于相应的处理,也可以将相应处理得到的数据、指令等存入数据存储系统150中。
最后,I/O接口112将处理结果返回给客户设备140,从而提供给用户。
值得说明的是,训练设备120可以针对不同的目标或称不同的任务,基于不同的训练数据生成相应的目标模型/规则101,该相应的目标模型/规则101即可以用于实现上述目标或完成上述任务,从而为用户提供所需的结果。
在图1所示情况下,用户可以手动给定输入数据,该手动给定可以通过I/O接口112提供的界面进行操作。另一种情况下,客户设备140可以自动地向I/O接口112发送输入数据,如果要求客户设备140自动发送输入数据需要获得用户的授权,则用户可以在客户设备140中设置相应权限。用户可以在客户设备140查看执行设备110输出的结果,具体的呈现形式可以是显示、声音、动作等具体方式。客户设备140也可以作为数据采集端,采集如图所示输入I/O接口112的输入数据及输出I/O接口112的输出结果作为新的样本数据,并存入数据库130。当然,也可以不经过客户设备140进行采集,而是由I/O接口112直接将如图所示输入I/O接口112的输入数据及输出I/O接口112的输出结果,作为新的样本数据存入数据库130。
值得注意的是,图1仅是本申请实施例提供的一种系统架构的示意图,图中所示设备、器件、模块等之间的位置关系不构成任何限制,例如,在图1中,数据存储系统150相对执行设备110是外部存储器,在其它情况下,也可以将数据存储系统150置于执行设备110中。
如图1所示,根据训练设备120训练得到目标模型/规则101,该目标模型/规则101在本申请实施例中可以是本申请中的神经网络,具体地,本申请实施例使用的神经网络可以为CNN、DNN、深度卷积神经网络(deep convolutional neural networks,DCNN)、循环神经网络(recurrent neural network,RNN)等等。
由于CNN是一种非常常见的神经网络,下面结合图2重点对CNN的结构进行详细的介绍。如上文的基础概念介绍所述,卷积神经网络是一种带有卷积结构的深度神经网络,是一种深度学习(deep learning)架构,深度学习架构是指通过机器学习的算法,在不同的抽象层级上进行多个层次的学习。作为一种深度学习架构,CNN是一种前馈(feed-forward)人工神经网络,该前馈人工神经网络中的各个神经元可以对输入其中的数据作出响应。
如图2所示,卷积神经网络(CNN)200可以包括输入层210,卷积层/池化层220(其中池化层为可选的),以及神经网络层230。其中,输入层210可以获取待处理数据,并将获取到的待处理数据交由卷积层/池化层220以及后面的神经网络层230进行处理,可以得到数据的处理结果。下面对图2中的CNN 200中内部的层结构进行详细的介绍。
卷积层/池化层220:
卷积层:
如图2所示卷积层/池化层220可以包括如示例221-226层,举例来说:在一种实现中,221层为卷积层,222层为池化层,223层为卷积层,224层为池化层,225为卷积层,226为池化层;在另一种实现方式中,221、222为卷积层,223为池化层,224、225为卷积层,226为池化层。即卷积层的输出可以作为随后的池化层的输入,也可以作为另一个卷积层的输入以继续进行卷积操作。
下面将以卷积层221为例,介绍一层卷积层的内部工作原理。
卷积层221可以包括很多个卷积算子,卷积算子也称为核,其在数据处理中的作用相当于一个从输入数据矩阵中提取特定信息的过滤器,卷积算子本质上可以是一个权重矩阵,这个权重矩阵通常被预先定义。
这些权重矩阵中的权重值在实际应用中需要经过大量的训练得到,通过训练得到的权重值形成的各个权重矩阵可以用来从输入数据中提取信息,从而使得卷积神经网络200进行正确的预测。
当卷积神经网络200有多个卷积层的时候,初始的卷积层(例如221)往往提取较多的一般特征,该一般特征也可以称之为低级别的特征;随着卷积神经网络200深度的加深,越往后的卷积层(例如226)提取到的特征越来越复杂,比如高级别的语义之类的特征,语义越高的特征越适用于待解决的问题。
池化层:
由于常常需要减少训练参数的数量,因此卷积层之后常常需要周期性的引入池化层,在如图2中220所示例的221-226各层,可以是一层卷积层后面跟一层池化层,也可以是多层卷积层后面接一层或多层池化层。在数据处理过程中,池化层的唯一目的就是减少数据的空间大小。
神经网络层230:
在经过卷积层/池化层220的处理后,卷积神经网络200还不足以输出所需要的输出信息。因为如前所述,卷积层/池化层220只会提取特征,并减少输入数据带来的参数。然而为了生成最终的输出信息(所需要的类信息或其他相关信息),卷积神经网络200需要利用神经网络层230来生成一个或者一组所需要的类的数量的输出。因此,在神经网络层230中可以包括多层隐含层(如图2所示的231、232至23n)以及输出层240,该多层隐含层中所包含的参数可以根据训练数据进行预先训练得到。
在神经网络层230中的多层隐含层之后,也就是整个卷积神经网络200的最后层为输出层240,该输出层240具有类似分类交叉熵的损失函数,具体用于计算预测误差,一旦整个卷积神经网络200的前向传播(如图2由210至240方向的传播为前向传播)完成,反向传播(如图2由240至210方向的传播为反向传播)就会开始更新前面提到的各层的权重值以及偏差,以减少卷积神经网络200的损失,及卷积神经网络200通过输出层输出的结果和理想结果之间的误差。
可选地,在其他一些实施例中,图2所示的卷积层/池化层220中可以有多个卷积层/池化层并行,该多个并行的卷积层/池化层分别提取特征,并将提取的特征均输入给全神经网络层230进行处理。
需要说明的是,图2所示的卷积神经网络仅作为一种本申请实施例的表情迁移方法的一种可能的卷积神经网络的示例,在具体的应用中,本申请实施例的表情迁移方法所采用的卷积神经网络还可以以其他网络模型的形式存在。
图3为本申请实施例提供的一种芯片的硬件结构,该芯片包括神经网络处理器30。该芯片可以被设置在如图1所示的执行设备110中,用以完成计算模块111的计算工作。该芯片也可以被设置在如图1所示的训练设备120中,用以完成训练设备120的训练工作并输出目标模型/规则101。如图2所示的卷积神经网络中各层的算法均可在如图3所示的芯片中得以实现。
神经网络处理器NPU 30作为协处理器挂载到主中央处理器(central processing unit,CPU)(host CPU)上,由主CPU分配任务。NPU的核心部分为运算电路303,控制器304控制运算电路303提取存储器(权重存储器或输入存储器)中的数据并进行运算。
在一些实现中,运算电路303内部包括多个处理单元(process engine,PE)。在一些实现中,运算电路303是二维脉动阵列。运算电路303还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路303是通用的矩阵处理器。
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路从权重存储器302中取矩阵B相应的数据,并缓存在运算电路中每一个PE上。运算电路从输入存储器301中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器(accumulator)308中。
向量计算单元307可以对运算电路的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。例如,向量计算单元307可以用于神经网络中非卷积/非FC层的网络计算,如池化(pooling)、批归一化(batch normalization)、局部响应归一化(local response normalization)等。
在一些实现中,向量计算单元能307将经处理的输出的向量存储到统一缓存器306。例如,向量计算单元307可以将非线性函数应用到运算电路303的输出,例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元307生成归一化的值、合并值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路303的激活输入,例如用于在神经网络中的后续层中的使用。
统一存储器306用于存放输入数据以及输出数据。
权重数据直接通过存储单元访问控制器(direct memory access controller,DMAC)305将外部存储器中的输入数据搬运到输入存储器301和/或统一存储器306、将外部存储器中的权重数据存入权重存储器302,以及将统一存储器306中的数据存入外部存储器。
总线接口单元(bus interface unit,BIU)310,用于通过总线实现主CPU、DMAC和取指存储器309之间进行交互。
与控制器304连接的取指存储器(instruction fetch buffer)309,用于存储控制器304使用的指令。
控制器304,用于调用取指存储器309中缓存的指令,实现控制该运算加速器的工作过程。
一般地,统一存储器306、输入存储器301、权重存储器302以及取指存储器309均 为片上(on-chip)存储器,外部存储器为该NPU外部的存储器,该外部存储器可以为双倍数据率同步动态随机存储器(double data rate synchronous dynamic random access memory,DDR SDRAM)、高带宽存储器(high bandwidth memory,HBM)或其他可读可写的存储器。
其中,图2所示的卷积神经网络中各层的运算可以由运算电路303或向量计算单元307执行。
上文中介绍的图1中的执行设备110能够执行本申请实施例的表情迁移方法的各个步骤,图2所示的CNN模型和图3所示的芯片也可以用于执行本申请实施例的表情迁移方法的各个步骤。下面结合附图对本申请实施例的模型训练方法和本申请实施例的表情迁移方法进行详细的介绍。
如图4所示,本申请实施例提供了一种系统架构400。该系统架构包括本地设备401、本地设备402以及执行设备110和数据存储系统150,其中,本地设备401和本地设备402通过通信网络与执行设备110连接。
执行设备110可以由一个或多个服务器实现。可选地,执行设备110可以与其它计算设备配合使用,例如:数据存储器、路由器、负载均衡器等设备。执行设备110可以布置在一个物理站点上,或者分布在多个物理站点上。执行设备110可以使用数据存储系统150中的数据,或者调用数据存储系统150中的程序代码来实现本申请实施例的表情迁移方法。
用户可以操作各自的用户设备(例如本地设备401和本地设备402)与执行设备110进行交互。每个本地设备可以表示任何计算设备,例如个人计算机、计算机工作站、智能手机、平板电脑、智能摄像头、智能汽车或其他类型蜂窝电话、媒体消费设备、可穿戴设备、机顶盒、游戏机等。
每个用户的本地设备可以通过任何通信机制/通信标准的通信网络与执行设备110进行交互,通信网络可以是广域网、局域网、点对点连接等方式,或它们的任意组合。
在一种实现方式中,本地设备401、本地设备402从执行设备110获取到目标神经网络的相关参数,将目标神经网络部署在本地设备401、本地设备402上,利用该目标神经网络得到用于驱动数字形象的表情基系数等等。
在另一种实现方式中,执行设备110上可以直接部署目标神经网络,执行设备110通过从本地设备401或本地设备402获取待处理表情数据,并根据目标神经网络对待处理表情数据进行处理得到用于驱动数字形象的表情基系数,并将该表情基系数传送给本地设备401或本地设备402。
上述执行设备110也可以为云端设备,此时,执行设备110可以部署在云端;或者,上述执行设备110也可以为终端设备,此时,执行设备110可以部署在用户终端侧,本申请实施例对此并不限定。
目前,神经网络模型广泛应用于图像、视频、语音等多个领域,展现出超越传统方法的能力。本申请实施例提供的神经网络模型主要用于将源对象的表情迁移到目标对象的过程。本申请实施例中,源对象可以是人或其他动物,这里的人和其他动物可以是真实的,也可以是虚拟形象。源对象也可以为没有生命的物体的虚拟形象。目标对象可以是人、动物、植物或没有生命的物体等的虚拟形象。示例性的,本申请实施例提供的神经网络模型 可以应用于将人脸表情迁移到数字形象的过程。
表情迁移,是指使用数字形象模拟人物表情、表现人物情感的方法。这里数字形象也可以称为虚拟形象或虚拟角色,其为表情动画的载体、表情迁移的目标。数字形象可以为虚拟人物(如虚拟主播)或拟人动物(如猴子、小狗、长颈鹿)等。
人脸的表情可以分解为多个局部的肌肉动作,多个局部的肌肉动作可以组合形成各种人脸表情。因此,表情可以由多个局部动作表情叠加表示,其中每个局部动作表情可以称为一个表情基,用于表达表情的一组基础表情称为表情基组,一组表情基的线性组合权重称为表情基系数。
例如表情F可以表示:
Figure PCTCN2022143944-appb-000011
其中,
Figure PCTCN2022143944-appb-000012
为中性表情(或称中性脸、平均脸,即没有表情的脸);A 1、A 2、A 3···A n为表情基,如分别表示张嘴、嘴角上扬、眯眼等等;α 1、α 2、α 3···α n为表情基系数。
每个人都有自己的面部身份特征,例如高鼻梁、大眼睛等,因此一个表情的表达方式中还可以考虑用户的面部身份特征,如表情F可以表示为:
Figure PCTCN2022143944-appb-000013
其中,
Figure PCTCN2022143944-appb-000014
为中性表情(或称中性脸、平均脸);B id为面部身份基组,例如包括高鼻梁、大眼睛、鹅蛋脸等等;β id为身份系数;A exp为表情基组;α exp为表情基系数。这里,B id、β id、A exp、α exp可以为矩阵或多维向量形式。
面部身份特征不同的人,做同一个表情时可能存在差异,例如大眼睛的人和小眼睛的人在大笑时,眼睛眯合的程度有所不同。因此对通用表情基(或称标准表情基、预设表情基)进行微调使其包含用户的面部身份特征,可以得到用户特定的表情基,本申请实施例中也可以称为个性化表情基。这样,表情F还可以表示为:
Figure PCTCN2022143944-appb-000015
其中,A exp为一组标准表情基,即标准表情基组;ΔA exp为微调表情基的分量;A exp+ΔA exp为一组个性化表情基,即个性化表情基组。公式(3)中的其余参数与公式(2)中对应参数的含义相同,不再赘述。
需要说明的是,本申请以下实施例中所提及的“表情基”,可以指标准表情基,也可以指个性化表情基,本申请实施例对此不作限定。不同角色(如人和长颈鹿、男主播和女主播)可以分别对应一组标准表情基。可以理解的是,不论是人的表情还是数字形象的表情,其可以用对应的标准表情基组表达,也可以用对应的个性化表情基组表达,本申请实施例对此不作限定。
面部表情捕捉系统的出现,实现了人类面部表情的捕捉与迁移。具体地,面部表情捕捉系统通过对用户面部关键点的运动进行捕获,可以实时模拟出人脸表情,并可以将人脸表情赋予到数字形象上。示例性的,如图5所示的表情迁移场景中,用户可以使用电子设备上的应用程序识别用户501自己的表情,并将用户501的表情(如微笑)迁移到数字形象502上,使数字形象502实时跟随人脸做出活灵活现的表情。
一种表情迁移方式为:先通过重建人脸模型获得人脸表情的表情基系数,再将该表情基系数与数字形象的表情基组组合,从而驱动数字形象做出与人脸表情相同的表情。然而,由于人脸模型的表情基与数字形象的表情基之间通常会存在差异,直接将人脸表情的表情 基系数迁移应用到数字形象中并不能确保用户表情信息的准确传递,例如对用户悲伤的表情传递不准确等,从而导致表情迁移的效率不高,影响用户在使用应用程序进行表情迁移时的体验。
鉴于此,本申请实施例提供了一种模型训练方法和表情迁移方法,能够提高表情传递的准确性,提高表情迁移的效率,从而提高用户在表情传递时的体验。具体地,可以利用模型训练方法训练好的神经网络模型,对人脸表情的表情基系数进行处理,得到调整后的表情基系数,并利用调整后的表情基系数驱动数字形象,以实现表情的迁移或传递。其中,神经网络模型在训练过程中建立了用户与数字形象之间的表情约束和关联,使得经神经网络模型处理得到的表情基系数在驱动该数字形象时,数字形象可以做出与人脸表情一致的表情,实现面部表情的准确传递。另外,在神经网络模型应用于表情迁移过程的同时,可以利用待处理表情数据对神经网络模型进一步训练,以不断更新神经网络模型的模型参数,提高神经网络模型的精度。
下面结合附图对本申请实施例提供的模型训练方法和表情迁移方法进行详细介绍。
图6示出了本申请实施例提供的一种表情迁移方法的示意性流程图。图6所示的表情方法600可以由图1中的执行设备110执行,或者由执行设备110和训练设备120联合执行。参考图6,该方法600可以包括步骤S610-S650。
S610,获取待处理表情数据,该待处理表情数据包括多个第一视频帧和多个第二视频帧,其中该第一视频帧和该第二视频帧包括源对象的面部图像。
可选地,该待处理表情数据可以为视频数据(即视频流)。相应地,该多个第一视频帧和该多个第二视频帧可以是从该视频数据中一帧帧提取的。该多个第一视频帧可以属于同一视频流,也可以属于不同视频流。该多个第二视频帧可以属于同一视频流,也可以属于不同视频流。第一视频帧和第二视频帧可以属于同一视频流,也可以属于不同视频流。本申请实施例对此均不作限定。
可选地,该待处理表情数据可以为图片集合。相应地,该多个第一视频帧和该多个第二视频帧即为该图片集合中的一张张图片。
可选地,该待处理表情数据可以是利用单个摄像头拍摄得到的,也可以是利用多个摄像头拍摄得到的。摄像头可以为RGB摄像头、深度相机、红外摄像头等。摄像头可以为手机摄像头、相机镜头、电脑摄像头、监控摄像头等。
本申请实施例中,该多个第一视频帧中的每个第一视频帧和该多个第二视频帧的每个第二视频帧包括源对象的面部图像。也就是说,该待处理表情数据对应于同一对象。
可选地,该待处理表情数据可以是一次性获取的,也可以是分多次获取的。当分多次获取待处理表情数据时,本申请实施例对多个第一视频帧和多个第二视频帧的获取顺序和获取时间不作限定,只要在处理第一视频帧之前获取第一视频帧、在处理第二视频帧之前获取第二视频帧即可。因此,本申请实施例中所涉及的步骤序号仅用于区分不同的步骤,并不对该步骤的执行顺序造成限定。
S620,基于第一神经网络模型和第一表情基系数,获取第二表情基系数,其中第一表情基系数基于第一视频帧得到且与源对象对应。
本申请实施例中,第一神经网络模型的输入和输出均为表情基系数,其输入为与源对象对应的表情基系数,即第一表情基系数;其输出为与目标对象对应的表情基系数,即第 二表情基系数。换言之,第一神经网络模型用于将与源对象对应的表情基系数进行处理,得到与目标对象对应的表情基系数。
本申请实施例中,根据第一视频帧获取第一表情基系数的实现方式有很多种,在此不作具体限定。例如,可以将第一视频帧输入预先训练好的神经网络模型中,该神经网络模型可以直接输出第一表情基系数。又如,可以对第一视频帧进行一系列处理操作,如人脸定位、关键点检测、三维人脸重建、姿态估计、表情识别等,而得到第一表情基系数,其中每种处理操作还可以有多种实现方式。
可选地,在申请实施例中,可以利用标准表情基组或源对象的个性化表情基组获取第一表情基系数。相应地,第一表情基系数与标准表情基组或源对象的个性化表情基组组合表示的表情与源对象在第一视频帧中所呈现的表情一致。
这里,第一神经网络模型为预先训练好的,其训练过程可以采用以下实施例中所描述的模型训练方法,下文将结合附图进行说明,在此暂不详述。
S630,根据第二表情基系数驱动目标对象,以将第一视频帧中的源对象的表情迁移至目标对象,其中第二表情基系数与目标对象对应。
这里,根据第二表情基系数驱动目标对象,即利用第二表情基系数与目标对象的表情基组组合,使目标对象展现与第一视频帧中的源对象一致的表情。
应理解,步骤S620和S630所涉及的第一视频帧为所获取的多个第一视频帧中的任意一个视频帧。因此,对于多个第一视频帧中的每一帧,均执行步骤S620和S630。相应地,根据每个第一视频帧都可以得到与之对应的第一表情基系数。
S640,当满足预设条件时,基于第二神经网络模型和第三表情基系数,获取第四表情基系数,其中第三表情基系数基于第二视频帧得到且与源对象对应。
本申请实施例中,第二神经网络模型的输入和输出均为表情基系数,其输入为与源对象对应的表情基系数,即第三表情基系数;其输出为与目标对象对应的表情基系数,即第四表情基系数。换言之,第二神经网络模型用于将与源对象对应的表情基系数进行处理,得到与目标对象对应的表情基系数。
本申请实施例中,第二神经网络模型与第一神经网络模型的作用相同,区别在于第一神经网络模型用于处理与第一视频帧中的源对象对应的表情基系数,第二神经网络模型用于处理与第二视频帧中的源对象对应的表情基系数。这里,根据第二视频帧获取第三表情基系数的方式可以与根据第一视频帧获取第一表情基系数的方式相同。
可选地,在申请实施例中,可以利用标准表情基组或源对象的个性化表情基组获取第三表情基系数。相应地,第三表情基系数与标准表情基组或源对象的个性化表情基组组合表示的表情与源对象在第二视频帧中所呈现的表情一致。
本申请实施例中,第二神经网络模型的初始模型参数与第一神经网络模型的初始模型参数相同。可以认为,在执行方法600之前,第一神经网络模型与第二神经网络模型相同。
S650,根据第四表情基系数驱动目标对象,以将第二视频帧中的源对象的表情迁移至目标对象,其中第四表情基系数与目标对象对应。
这里,根据第四表情基系数驱动目标对象,即利用第四表情基系数与目标对象的表情基组组合,使目标对象展现与第二视频帧中的源对象一致的表情。
应理解,步骤S640和S650所涉及的第二视频帧为所获取的多个第二视频帧中的任意 一个视频帧。因此,对于多个第二视频帧中的每一帧,均执行步骤S640和S650。相应地,根据每个第二视频帧都可以得到与之对应的第三表情基系数。
本申请实施例中,第一神经网络模型与源对象和目标对象相关联。第二神经网络模型与源对象和目标对象相关联。对于不同的源对象和目标对象的组合,第一神经网络模型不同,第二神经网络模型不同。
本申请实施例中,在第一神经网络模型的应用过程中,第二神经网络模型的模型参数在基于第一训练数据的训练过程中被调整,第一训练数据是根据多个第一视频帧中的至少部分第一视频帧获得的。换言之,在利用第一神经网络模型对基于第一视频帧得到的第一表情基系数进行处理的过程中,还可以利用第一视频帧中的至少部分第一视频帧对第二神经网络模型进行训练。这样,当满足预设条件时,可以将第一神经网络模型替换为第二神经网络模型,然后利用第二神经网络模型继续处理。为方便区分,本申请实施例中将第二神经网络模型继续处理的相关视频帧称为第二视频帧。
可选地,在一些实施例中,该至少部分第一视频帧是从多个第一视频帧中采样得到。这里所涉及的采样,可以是随机采样,也可以是按照一定采样频率采样,例如每隔5个视频帧采样一次等,还可以是连续采样,本申请实施例对此不作限定。
当用于训练第二神经网络模型的第一视频帧的数量小于第一视频帧的总数量时,可以减少训练过程中的计算量,从而减少计算所需资源。
本申请实施例中,第一训练数据所包括的数据类型不同,基于多个第一视频帧中的至少部分第一视频帧获得第一训练数据的过程不同。例如,该第一训练数据可以包括该至少部分第一视频帧,可以包括根据第一视频帧得到的源对象的表情基系数,可以包括源对象表情基系数经第二神经网络模型处理后输出的表情基系数,可以包括源对象表情基系数经第一神经网络模型处理后输出的表情基系数,还可以包括前述几种数据类型的组合,等等。在实际应用中,关于如何获取第一训练数据,可以根据所构建的训练优化函数确定。
可选地,预设条件可以包括:第二神经网络模型输出的表情基系数用于驱动目标对象时产生的表情损失小于第一神经网络模型输出的表情基系数用于驱动目标对象时产生的表情损失。
可以理解的是,在比较第二神经网络模型对应的表情损失与第一神经网络模型对应的表情损失时,第一神经网络模型与第二神经网络模型的输入应相同,所涉及的待迁移表情的视频帧相同。也就是说,针对某一第一视频帧而言,将该第一视频帧对应的第一表情基系数分别输入到第一神经网络模型和第二神经网络模型中,可以分别得到第一神经网络模型输出的表情基系数和第二神经网络模型输出的表情基系数。利用第一神经网络模型输出的表情基系数驱动目标对象,可以得到目标对象的第一面部图像;利用第二神经网络模型输出的表情基系数驱动目标对象,可以得到目标对象的第二面部图像。第一神经网络模型输出的表情基系数用于驱动目标对象时产生的表情损失,可以根据第一面部图像与该第一视频帧之间的表情差异确定。第二神经网络模型输出的表情基系数用于驱动目标对象时产生的表情损失,可以根据第二面部图像与该第一视频帧之间的表情差异确定。
当第二神经网络模型对应的表情损失小于第一神经网络模型对应的表情损失时,可以认为第二神经网络模型的精度更高。将同一表情基系数分别输入到第一神经网络模型和第二神经网络模型中时,第二神经网络模型输出的表情基系数与目标对象更匹配,其用于驱 动目标对象时,目标对象做出的表情与视频帧中源对象所做出的表情更一致。
本申请实施例中,在获取与源对象对应的表情基系数后,可以利用第一神经网络模型或第二神经网络模型进行处理,得到与目标对象对应的表情系数。利用与目标对象对应的表情基系数驱动目标对象,可以使目标对象做出与源对象一致的表情,实现面部表情的准确传递。另外,在第一神经网络模型应用于表情迁移过程的同时,可以利用待处理表情数据对第二神经网络模型进一步训练,以不断更新第二神经网络模型的模型参数,提高第二神经网络模型的精度。这样,可以丰富第二神经网络模型的训练样本,拓展第二神经网络模型的适用场景,避免出现极端表情传递不准确的问题。当第二神经网络模型对应的表情损失小于第一神经网络模型对应的表情损失时,可以将第一神经网络模型替换为第二神经网络模型,然后利用第二神经网络模型继续处理,从而得到更匹配的表情基系数使得目标对象所展现的表情与源对象的表情一致,实现源对象表情信息的准确传递与表达。
本申请实施例中,与源对象对应的表情基系数经过第一神经网络模型或第二神经网络模型处理后,可以降低表情基系数与身份系数的相关性,或将表情基系数与源对象的身份系数解耦合,减小了源对象的身份系数在表情基系数驱动目标对象过程中的干扰。举例说明,若视频帧中的源对象鼓腮,则在进行人脸重建时,可能会建立一个比较瘦但处于鼓腮的模型,还可能建立一个比较胖但没有鼓腮的模型,根据不同的人脸重建模型得到的表情基系数有不同的结果。直接使用与源对象对应的表情基系数驱动目标对象,可能会得到相差较大的表情。这里,第一神经网络模型或第二神经网络模型可以对表情基系数进行约束,使得表情基系数与身份系数的相关性减小。
可选地,在步骤S640之前,方法600还包括:基于第一训练数据和第一损失函数,对第二神经网络模型进行训练,其中第一损失函数用于梯度回传,以调整第二神经网络模型的模型参数。
也就是说,通过调整第二神经网络模型的模型参数,优化第一损失函数的值,使得第一损失函数收敛至预设值,或者使第一损失函数的损失值变化率小于预设阈值,从而完成第二神经网络模型的训练。这里所涉及的“梯度”是指第二神经网络模型中模型参数大小的梯度。该模型参数大小的梯度通过计算第一损失函数对相应参数的偏导得到。
可选地,第一训练数据包括用于代入第一损失函数中计算损失值的数据。
可选地,第一损失函数可以为L1范数损失函数(也称绝对值损失函数),也可以为L2范数损失函数,还可以为附加正则约束项的L1范数损失函数或附加正则约束项的L2范数损失函数。
可选地,在一些实施例中,第一训练数据包括第一输出结果集合、第二输出结果集合、第一表情识别结果集合和第二表情识别结果集合。
第一输出结果集合包括:基于至少部分第一视频帧中的每个第一视频帧得到的表情基系数经第一神经网络模型处理后输出的调整后表情基系数。换言之,至少部分第一视频帧中的每个第一视频帧都可以得到对应的表情基系数,将这些对应的表情基系数分别输入到第一神经网络模型后,可以得到对应的调整后表情基系数。第一输出结果集合即包括这些由第一神经网络模型输出的调整后表情基系数。
第二输出结果集合包括:基于至少部分第一视频帧中的每个第一视频帧得到的表情基系数经第二神经网络模型处理后输出的调整后表情基系数。换言之,至少部分第一视频帧 中的每个第一视频帧都可以得到对应的表情基系数,将这些对应的表情基系数分别输入到第二神经网络模型后,可以得到对应的调整后表情基系数。第二输出结果集合即包括这些由第二神经网络模型输出的调整后表情基系数。
第一表情识别结果集合包括:对至少部分第一视频帧中的每个第一视频帧进行表情识别的结果。
第二表情识别结果集合包括:对第二输出结果集合中的每个输出结果在驱动目标对象时所得到的数字形象帧进行表情识别的结果。也就是说,用第二神经网络模型输出的调整后表情基系数驱动目标对象,可以得到包含目标对象表情的数字形象帧。第二表情识别结果集合即包括对目标对象的表情进行识别的结果。
本申请实施例中,第一训练数据既包括源对象相关的表情数据,还包括目标对象的表情数据,在用于第二神经网络模型的训练时,可以建立源对象与目标对象之间的表情约束和关联。有训练好的第二神经网络模型调整表情基系数后,可以使目标对象与源对象所展现的表情一致,实现用户表情信息的准确传递与表达。另外,基于表情约束还可以为用户提供表情的放大功能。
可选地,在一些实施例中,第一损失函数用于表征第一输出结果集合与第二输出结果集合中对应于同一视频帧的调整后表情基系数之间的差异,以及第一表情识别结果集合与第二表情识别结果集合中对应于同一视频帧的表情识别结果之间的差异。
也就是说,在计算第一损失函数的输出值时,可以根据第一神经网络模型和第二神经网络模型针对同一视频帧的输出结果之间的差异,以及第二神经网络模型的输出驱动目标对象所做出的表情与源对象的表情之间的差异确定。这样,既可以优化第二神经网络模型的模型参数,也不会使第一神经网络模型与第二神经网络模型偏差较大。
可选地,在一些实施例中,方法600还包括:将第一神经网络模型的模型参数更新为与第二神经网络模型的模型参数一致;在第二神经网络模型的应用过程中,在基于第二训练数据的训练过程中调整第一神经网络模型的模型参数,其中第二训练数据是根据多个第二视频帧中的至少部分第二视频帧获得的。
也就是说,在利用第二神经网络模型对基于第二视频帧得到的第三表情基系数进行处理的过程中,还可以利用第二视频帧中的至少部分第二视频帧对第一神经网络模型进行训练。这样可以丰富训练样本,提高神经网络模型的训练精度。
在该实施例中,对第一神经网络模型进行训练的过程与前述对第二神经网络模型进行训练的过程类似,只是训练数据有所不同。具体可参考前文描述,为简洁,在此不再赘述。
图7示出了本申请实施例提供的另一种表情迁移方法的示意性流程图。参考图7,该方法700可以包括步骤S710-S760。
S710,获取待处理表情数据,待处理表情数据包括多个第一视频帧和多个第二视频帧,其中第一视频帧和第二视频帧包括源对象的面部图像。
步骤S710与方法600中的步骤S610相同,具体可参考关于S610的相关描述。
S720,基于第一神经网络模型和第一表情基系数,获取第二表情基系数,其中第一表情基系数基于第一视频帧得到且与源对象对应。
步骤S720与方法600中的步骤S620相同,具体可参考关于S620的相关描述。
S730,根据第二表情基系数驱动目标对象,以将第一视频帧中的源对象的表情迁移至 目标对象,其中第二表情基系数与目标对象对应。
步骤S730与方法600中的步骤S630相同,具体可参考关于S630的相关描述。
S740,当满足预设条件时,将第一神经网络模型的模型参数更新为与第二神经网络模型的参数一致,得到调整后的第一神经网络模型。
该预设条件包括:第二神经网络模型输出的表情基系数用于驱动目标对象时产生的表情损失小于第一神经网络模型输出的表情基系数用于驱动目标对象时产生的表情损失。
S750,基于调整后的第一神经网络模型和第三表情基系数,获取第四表情基系数,其中第三表情基系数基于第二视频帧得到且与源对象对应。
步骤S750与方法600中的步骤S640类似,区别在于:S640中替换使用第二神经网络模型,S750继续使用调整后的第一神经网络模型。但实际上,调整后的第一神经网络模型与第二神经网络模型是相同的。具体内容可参考关于S640的相关描述。
S760,根据第四表情基系数驱动目标对象,以将第二视频帧中的源对象的表情迁移至目标对象,其中第四表情基系数与目标对象对应。
步骤S760与方法600中的步骤S650相同,具体可参考关于S650的相关描述。
可选地,在一些实施例中,方法700还包括:在调整后的第一神经网络模型的应用过程中,在基于第二训练数据的训练过程中调整第二神经网络模型的模型参数,其中第二训练数据是根据多个第二视频帧中的至少部分第二视频帧获得的。
在利用第一神经网络模型对基于第二视频帧得到的第三表情基系数进行处理的过程中,还可以利用第二视频帧中的至少部分第二视频帧对第二神经网络模型进行训练。这样可以丰富训练样本,提高神经网络模型的训练精度。
需要说明的是,适用于方法600的其他实施例,同样可以结合方法700使用,相关描述可参考关于图6的说明,为简洁,在此不再赘述。
综上,根据方法600和方法700的描述可知,方法600中是交替使用第一神经网络模型和第二神经网络模型对待处理表情数据的表情基系数进行处理,方法600中始终使用第一神经网络模型对待处理表情数据的表情基系数进行处理,但其模型参数可以根据第二神经网络模型的训练结果进行更新。
图8示出了本申请实施例提供的一种模型训练方法的示意性流程图。图8所示的方法800可以由图1中的训练设备120执行。参考图8,该方法800可以包括步骤S810-S870。
S810,获取第一训练帧,第一训练帧包括源对象的面部图像。
该第一训练帧为视频帧或图片。
S820,基于第一训练帧,获取与源对象的面部图像对应的原始表情基系数和头部姿态参数。
实现该步骤的方式可以有多种,本申请实施例对此不作限定。例如,将第一训练帧输入神经网络模型中,由神经网络模型直接输出原始表情基系数和头部姿态参数。再如,可以对第一训练帧进行一系列处理操作,如人脸定位、关键点检测、三维人脸重建、姿态估计、表情识别等,而得到原始表情基系数和头部姿态参数,其中每种处理操作还可以有多种实现方式。
S830,根据原始表情基系数,获取第二训练帧,第二训练帧包括源对象位于正面的面部图像。
可选地,可以将原始表情基系数与源对象的表情基组组合,得到源对象的正面图像。这里所涉及的正面图像可以认为是不带头部姿态参数的图像。
S840,根据原始神经网络模型对原始表情基系数进行处理,得到调整后的表情基系数。
S850,根据调整后的表情基系数和头部姿态参数驱动目标对象,获取第三训练帧,第三训练帧包括目标对象在头部姿态参数下的面部图像。
可选地,可以将调整后的表情基系数与目标对象的表情基组组合,并带入头部姿态参数,可以得到目标对象在头部姿态参数下的面部图像。这里,目标对象的头部旋转、平移程度与源对象的头部旋转、平移程度一致。
S860,根据调整后的表情基系数驱动目标对象,获取第四训练帧,第四训练帧包括目标对象位于正面的面部图像。
可选地,可以将调整后的表情基系数与目标对象的表情基组组合,得到目标对象的正面图像。
S870,根据第一训练帧对应的表情识别结果与第三训练帧对应的表情识别结果之间的差异和/或第二训练帧对应的表情识别结果与第四训练帧对应的表情识别结果之间的差异,调整原始神经网络模型的参数,以获取目标神经网络模型。
本申请实施例中,在训练神经网络模型的过程中,建立了源对象与目标对象之间的表情约束,从而可以建立源对象与目标对象之间的关联。源对象与目标对象之间的表情约束可以使源对象的表情准确的传递到目标对象上,例如将人脸表情传递到数字形象上。不同数字形象之间的表情约束可以建立数字形象之间的关联,实现人脸表情的间接传递。
可选地,在一些实施例中,第一训练帧对应的表情识别结果与第三训练帧对应的表情识别结果之间的差异包括:第一训练帧对应的表情识别结果与第三训练帧对应的表情识别结果之间的曼哈顿距离或欧氏距离。
可选地,在一些实施例中,第二训练帧对应的表情识别结果与第四训练帧对应的表情识别结果之间的差异包括:第二训练帧对应的表情识别结果与第四训练帧对应的表情识别结果之间曼哈顿距离或欧氏距离。
下面结合更为具体的例子对本申请实施例提供的模型训练方法和表情迁移方法进行介绍。可以理解的是,以下实施例是以将人脸表情迁移到数字形象上的场景为例进行描述,即源对象为人,目标对象为数字形象。但对于源对象和目标对象为其他形式的组合,本申请实施例同样适用。
图9示出了本申请实施例提供的一种模型训练方法的示意性流程图。图9所示的模型训练方法900可以是方法800的一个具体的例子。该方法900具体可以包括如下步骤。
S901,获取人脸视频帧{f 0,f 1,f 2,...,f n}。
人脸视频帧指的是带有人脸图像的视频数据中的视频帧,也可以称为视频图像。其中视频帧中的人脸图像是同一用户的。实际上,视频数据中的每个视频帧可以认为是一张图片,因此可选地,在该步骤中也可以获取多张带有人脸图像的图片。也就是说,在该步骤S901中,只要获取到包含面部表情的二维图像(即面部图像样本)即可。
可选地,带有人脸图像的视频数据可以是利用单个摄像头获取的单视角视频数据,也可以是利用多个摄像头在多个角度拍摄得到的多视角视频数据,本申请实施例对此不作限定。但应理解,在该带有人脸图像的视频数据中,应尽可能包括较为丰富的面部表情动作。
可以理解的是,人脸视频帧{f 0,f 1,f 2,...,f n}中的任意一帧可以认为是前述方法800中的第一训练帧的一个具体示例,其中n为正整数。
S902,对于人脸视频帧{f 0,f 1,f 2,...,f n}中的第一训练帧f i(1≤i≤n,i为正整数),获取该第一训练帧f i中的人脸图像对应的原始表情基系数W={w 0,w 1,w 2,...,w k}和头部姿态参数。
原始表情基系数W是一个与第一表情基组中表情基数量匹配的多维系数向量,如为k维系数向量,其中k为正整数。这里第一表情基组指的是与原始表情基系数W组合以表达人脸表情的表情基组。例如,该第一表情基组可以是预设的样本表情基组,也可以是包含面部身份特征的个性化表情基组。
本申请实施例中,对于获取原始表情基系数W的方式不作任何限定,这里可以采用现有的任意一种获取表情基系数的方式。
作为一个示例,可以基于深度学习的方式获取原始表情基系数W。例如,可以将第一训练帧f i输入到预先训练好的神经网络模型中,该神经网络模型的输出即为原始表情基系数W。应理解,此处涉及的神经网络模型是根据人脸图像输出表情基系数的模型。
作为另一个示例,可以采用人脸定位技术、人脸关键点检测技术、三维人脸重建技术、人脸姿态估计技术、人脸表情识别技术等获取原始表情基系数W。其中每种技术还可以有多种实现方案,这些方案均可适用于此。下面仅对各个技术进行简要说明。
人脸定位技术,是指通过人脸检测技术,确定人脸在图片中的位置。
人脸关键点检测技术,是指在人脸图像上定位出一些具有特殊语义信息的点,例如眼睛、鼻尖、嘴唇等。
三维人脸重建技术,是指生成人脸三维模型的技术。例如,可以通过人脸视频帧实现人脸的三维重建。重建出的人脸模型应与视频帧中的人脸具有相同的拓扑结构,即重建出的人脸模型投影到2D图像中的关键点与视频帧中的人脸关键点依然相互对应。人脸模型重建的精度影响着后续表情基系数求解的准确性。
人脸姿态估计技术,是指实现人脸在世界坐标系下的变换估计的技术,主要是要估计出当前图片中的人脸所对应的头部姿态参数。一般来说,三维模型到二维图像的投影过程会经历旋转和平移变换,相应地,本申请实施例中,头部姿态参数包括头部的旋转角度和头部的平移量。头部的旋转角度可以旋转矩阵R表示,头部的平移量可以平移向量T表示。在人脸姿态估计技术中,变换估计的准确性决定着所驱动的数字形象的头部姿态的准确性。
人脸表情识别技术,是指对输入的人脸图像中的面部表情进行分类,识别出该面部表情属于哪种表情。
在三维人脸重建的过程中,重建的人脸模型可以采用类似上述公式(2)的公式,即:
Figure PCTCN2022143944-appb-000016
这里,F 人脸为重建的人脸模型;
Figure PCTCN2022143944-appb-000017
为平均的人脸模型,一般为预设的,可以认为是不带表情的中性脸;B id1为面部身份特征,一般为主成分分析(principal component analysis,PCA)的身份基组;β id1为身份系数;A exp1为表情基组;α exp1为表情基系数。这里,B id1、β id1、A exp1、α exp1可以为矩阵或多维向量形式。
也就是说,在平均的人脸模型F 1的基础上,加上β id1B id1可以恢复出不带表情的用户三 维人脸脸型,再加上α exp1A exp1可以恢复出带有表情的用户三维人脸,从而实现人脸的重建。公式(4)中的B id1、β id1和α exp1可在三维人脸重建的过程中或之前求解出来。
从公式(4)可以看出,人脸重建的准确度依赖于身份信息和表情信息,身份信息的准确与否直接影响着人脸表情基系数(即原始表情基系数W)的准确性。因此,在一些实施例中,可以约束同一个用户在不同表情下的身份系数来提高身份信息的准确性。具体来说,同一个用户在不同表情下的身份系数应当具有一致性,即应为相同的,因而同一个用户在不同视频帧中的身份系数相互交换,应不影响所生成的人脸模型。
作为示例而非限定,根据第一视频帧可以重建出带有表情1的人脸模型,具体可以得到表情基系数1和身份系数1。根据第二视频帧可以重建出带有表情2的人脸模型,具体可以得到表情基系数2和身份系数2。将身份系数1和2调换,由表情基系数1和身份系数2组合得到的表情应与表情1一致,由表情基系数2和身份系数1组合得到的表情应与表情2一致。因此,可以通过不同身份系数和同一表情基系数组合来重建用户人脸,可以提高人脸重建的准确度。
总而言之,在步骤S902中,只要能够根据人脸图像获取到原始表情基系数W和头部姿态参数即可,具体的获取方式本申请实施例并不限定。
S903,将原始表情基系数W输入到神经网络模型MLP中。
S904,神经网络模型MLP对原始表情基系数W={w 0,w 1,w 2,...,w k}进行处理,输出调整后的表情基系数W′={w′ 0,w 1′,w′ 2,...,w′ k}。
在该步骤中,神经网络模型MLP对原始表情基W进行微调。
S905,根据步骤S904中得到的调整后的表情基系数W′和步骤S902中得到的头部姿态参数驱动数字形象,通过对数字形象进行可微分渲染得到与第一训练帧f i对应的数字形象帧f i′,该数字形象帧f i′为前述第三训练帧的一个具体示例。
在该步骤中,驱动数字形象即将调整后的表情基系数W′与数字形象的表情基组合,使数字形象做出相应表情的过程。为与第一表情基组区分,本申请实施例中将数字形象对应的表情基组称为第二表情基组。
本申请实施例中,通过多次执行步骤S905,可以得到人脸视频帧{f 0,f 1,f 2,...,f n}中的每一视频帧所对应的数字形象帧,因此与人脸视频帧{f 0,f 1,f 2,...,f n}对应的全部数字形象帧可以表示为{f 0′,f 1′,f 2′,...,f n′}。
S906,根据步骤S904中得到的调整后的表情基系数W′生成对应的正面数字形象F i′。
正面数字形象即数字形象的正脸。在该步骤中,可以利用调整后的表情基系数W′和数字形象的模型生成数字形象的正脸。在一些实施例中,正面数字形象也可以理解为利用调整后的表情基系数W′驱动数字形象所得到的不带头部姿态的数字形象帧。该正面数字形象F i′可以理解为是前述第四训练帧的一个具体示例。
本申请实施例中,通过多次执行步骤S906,可以得到人脸视频帧{f 0,f 1,f 2,...,f n}中的每一视频帧所对应的正面数字形象,因此与人脸视频帧{f 0,f 1,f 2,...,f n}对应的全部正面数字形象可以表示为{F 0′,F 1′,F 2′,...,F n′}。
S907,根据步骤S902中得到的原始表情基系数W生成对应的正面人脸F i
正面人脸可以理解为恢复出的不带头部姿态参数的人脸。在该步骤中,可以利用原始表情基系数W和恢复出来的人脸模型生成正面人脸。正面人脸F i可以理解为是前述第二 训练帧的一个具体示例。
本申请实施例中,通过多次执行步骤S907,可以得到人脸视频帧{f 0,f 1,f 2,...,f n}中的每一视频帧所对应的正面人脸,因此与人脸视频帧{f 0,f 1,f 2,...,f n}对应的全部正面人脸可以表示为{F 0,F 1,F 2,...,F n}。
S908,计算损失函数L 1,并进行梯度回传,以调整神经网络模型MLP的模型参数。
本申请实施例中的损失函数L 1用于表征人脸视频帧与数字形象图像之间的表情差异,其优化目的在于使得输入视频帧的面部表情和所驱动的数字形象的面部表情一致。这里所涉及的“梯度”是指神经网络模型MLP中模型参数大小的梯度。该模型参数大小的梯度可以通过计算损失函数L 1对相应参数的偏导得到。
可选地,在一些实施例中,损失函数L 1可以为:
L 1=||exp(f i)-exp(f′ i)|| 2+||exp(F i)-exp(F′ i)|| 2   (5)
其中,exp(f i)为根据人脸视频帧中的第一训练帧f i识别出来的表情;exp(f′ i)为根据数字形象帧中的对应帧f′ i(即第三训练帧)识别出来的表情;exp(F i)为根据正面人脸F i(即第二训练帧)识别出来的表情;exp(F′ i)为根据正面数字形象F′ i(即第四训练帧)识别出来的表情。可选地,exp(f i)、exp(f′ i)、exp(F i)、exp(F′ i)可以多维向量或矩阵表示。
||exp(f i)-exp(f′ i)|| 2为人脸视频帧中的面部表情与数字形象帧中的面部表情在表情空间上的距离,该距离越小,说明所驱动的数字形象的表情与人脸视频帧中的表情差异越小,越趋于一致。
||exp(F i)-exp(F′ i)|| 2为人脸视频帧中的面部在正面的表情与数字形象帧中的面部在正面的表情在表情空间上的距离,该距离越小,说明所驱动的数字形象的正脸表情与人的正脸表情差异越小,越趋于一致。
也就是说,优化公式(5)的输出值,目的在于使人脸视频帧的人脸和数字形象在正脸及实际所展现的姿态(如侧脸)下的表情均一致。因此,损失函数L 1的输出值越小越好。
在一些实施例中,可以通过神经网络模型识别表情,例如将人脸视频帧中的第一训练帧f i输入到预先训练好的神经网络模型中,该神经网络模型对第一训练帧f i进行处理,识别出人脸表情,并输出相应的向量表示。
上述公式(5)中利用L2范数建立了人脸视频帧和数字形象帧之间的表情损失约束。
在一些实施例中,损失函数L 1还可以为:
L 1=||exp(f i)-exp(f′ i)|| 2    (6)
在另一些实施例中,损失函数L 1还可以为:
L 1=||exp(F i)-exp(F′ i)|| 2    (7)
公式(6)中包括公式(5)中的前半部分,用于表征人脸视频帧的人脸和数字形象在实际所展现的姿态下的表情差异。
公式(7)中包括公式(5)中的后半部分,用于表征人脸视频帧的人脸和数字形象的正脸表情差异。
若采用公式(6)中所示的损失函数L 1,则方法900中可以省略步骤S906和步骤S907。若采用公式(7)中所示的损失函数L 1,则方法900可以省略步骤S905。
应理解,在其他一些实施例中,也可以通过其他方式构建损失函数L 1,例如利用L1范数构建,本申请实施例对此不作限定。可以理解的是,L1范数为向量的1范数,表示 向量中非零元素的绝对值之和,如||X|| 1=Σ i|x i|。L2范数为向量的2范数,表示向量的元素的平方和再开平方,如
Figure PCTCN2022143944-appb-000018
在步骤S908之后,根据损失函数L 1的结果可以对神经网络模型MLP的模型参数进行调整。至此,完成一次对神经网络模型MLP的训练过程。对于第一训练帧f i后的下一帧,同样执行步骤S901-S908,只不过后一视频帧所使用的神经网络模型的模型参数相比前一帧所使用的神经网络模型的模型参数已经调整。
通过不断调整神经网络模型MLP的模型参数,使损失函数L 1不断收敛。当损失函数L 1的结果小于预设阈值或者不再减小时,可以认为神经网络模型MLP训练完成。
需要说明的是,为方便理解,图9所示的方法中是以神经网络模型MLP为例进行介绍的,在其他一些实施例中,图9中的神经网络模型MLP可以替换为前述任意一种神经网络,例如卷积神经网络、循环神经网络、深度神经网络等等。
还需要说明的是,为方便理解,图9所示的方法中是以人脸表情迁移到数字形象为例进行的说明,所使用的训练数据包括人脸视频帧和数字形象帧。在其他一些实施例中,人脸视频帧还可以替换为其他数字形象的图像,这样训练出来的神经网络模型用于在数字形象之间传递表情。
本申请实施例中,通过构建人脸视频帧与数字形象的表情损失函数L 1来训练神经网络模型,训练好的神经网络模型与人脸视频帧中的用户以及所驱动的数字形象相关联。
图10示出了本申请实施例提供的一种表情迁移方法的示意性流程图。图10所示的方法1000可以是方法600或方法700的一个具体的例子。图10所示的方法1000可以包括如下步骤。
S1001,获取人脸视频帧{g 0,g 1,g 2,...,g t}。
该步骤中所获取的人脸视频帧为待处理的视频帧,包括待处理表情数据,其中t为正整数。步骤S1001中的人脸视频帧中的人脸与步骤S901中的人脸视频帧中的人脸应属于同一用户。
S1002,对于人脸视频帧{g 0,g 1,g 2,...,g t}中的任意帧g i(1≤i≤t,i为正整数),获取该视频帧g i中的人脸图像对应的第一表情基系数M={m 0,m 1,m 2,...,m k}。
这里,第一表情基系数M为k维系数向量,k为正整数。本申请实施例中,获取第一表情基系数M的方式可以与方法900中获取原始表情基系数W的方式相同,具体可参考上文相关描述,在此不再赘述。
S1003,将第一表情基系数M输入到第一MLP中。
这里,第一MLP为经方法900训练好的神经网络模型MLP。
S1004,第一MLP对第一表情基系数M={m 0,m 1,m 2,...,m k}进行处理,输出第二表情基系数M′={m′ 0,m 1′,m′ 2,...,m′ k}。
S1005,根据步骤S1004中得到的第二表情基系数M′驱动数字形象。通过对数字形象进行可微分渲染可以得到与人脸视频帧g i对应的数字形象帧g′ i
在该步骤中,数字形象帧g′ i中数字形象的表情与人脸视频帧g i中的人脸表情一致。步骤S1005所驱动的数字形象与方法900中的涉及的数字形象为同一数字形象。
本申请实施例中,通过多次执行步骤S1005,可以得到人脸视频帧{g 0,g 1,g 2,...,g t}中的每一视频帧所对应的数字形象帧,因此与人脸视频帧{g 0,g 1,g 2,...,g t}对应的全部 数字形象帧可以表示为{g′ 0,g′ 1,g′ 2,...,g′ t}。
S1006,在对人脸视频帧{g 0,g 1,g 2,...,g t}中的每一帧执行步骤S1002的过程中,对步骤S1002中得到的第一表情基系数进行数据采样,并将采样得到的第一表情基系数输入到第二MLP中。例如,将任意帧g i对应的第一表情基系数M输入到第二MLP中。
当然,在其他一些实施例中,也可以对人脸视频帧{g 0,g 1,g 2,...,g t}进行采样,再根据采样得到的人脸视频帧确定对应的表情基系数后,将其输入到第二MLP中。
本申请实施例中,数据采样的频率可以为每隔P帧采样一次,这里P可以为1、2、4、7或者其他大于1的正整数。当然,也可以随机采样或连续采样,本申请实施例并不限定。
可选地,第二MLP也为经方法900训练好的神经网络模型MLP,其中第二MLP的初始模型参数与第一MLP的初始模型参数相同。
S1007,第二MLP对第一表情基系数M={m 0,m 1,m 2,...,m k}进行处理,输出第五表情基系数M″={m″ 0,m″ 1,m″ 2,...,m″ k}。
S1008,根据步骤S1007中得到的第五表情基系数M″可以得到与人脸视频帧对应的数字形象帧g″ i
本申请实施例中,通过多次执行步骤S1008,可以得到每一个采样视频帧所对应的数字形象帧,因此与采样视频帧对应的全部数字形象帧可以表示为{g″ 0,g″ 1,g″ 2,...,g″ t}。
可以理解的是,若第一MLP与第二MLP的模型参数相同,则将第一表情基系数M分别输入到第一MLP和第二MLP后,所得到的第二表情基系数M′和第五表情基系数M″相同。相应地,第二表情基系数M′与数字形象的表情基组合所呈现的表情应同第五表情基系数M″与数字形象的表情基组合所呈现的表情相同。
S1009,计算第一损失函数L 2,并进行梯度回传,以调整第二MLP的模型参数。
本申请实施例中,第一损失函数L 2用于表征采样视频帧与数字形象图像之间的表情差异,其优化目的在于使得采样视频帧的面部表情和所驱动的数字形象的面部表情一致。
可选地,在一些实施例中,第一损失函数L 2可以为:
L 2=||exp(g i)-exp(g″ i)|| 2+λ||M″-M′|| 2   (8)
其中,exp(g i)为根据采样视频帧g i识别出来的表情;exp(g″ i)为根据对应的数字形象帧g″ i识别出来的表情;λ为超参数,即人为设置的参数,而不是通过训练得到的参数。可选地,exp(g i)、exp(g″ i)可以多维向量或矩阵表示。
||exp(g i)-exp(g″ i)|| 2为采样视频帧中的面部表情与数字形象帧中的面部表情在表情空间上的距离,该距离越小,数字形象的表情与采样视频帧中的表情差异越小,越趋于一致。
||M″-M′|| 2为第五表情基系数与M″与第二表情基系数M′之间的距离,该距离越小,第五表情基系数与M″与第二表情基系数M′的差异越小。通过约束||M″-M′|| 2,可以防止模型为了迎合训练集而过于复杂造成过拟合的情况,可以提高模型的泛化能力。
通过优化公式(8),可以使采样视频帧的人脸和数字形象的表情一致,并且约束第一MLP与第二MLP之间的差距不要过大。因此,第一损失函数L 2越小越好。上述公式(8)中利用L2范数建立了采样视频帧和数字形象帧之间的表情损失约束。
应理解,在其他一些实施例中,也可以通过其他方式构建第一损失函数L 2,例如利用L1范数构建,本申请实施例对此不作限定。
在步骤S1006之后,根据第一损失函数L 2的结果可以对第二MLP的模型参数进行调 整。步骤S1006-S1009可以认为是完成了一次对第二MLP的训练过程。对于采样视频帧g i后的下一采样视频帧,同样执行步骤S1006-S1009,只不过后一采样视频帧所使用的第二MLP的模型参数相比前一采样视频帧所使用的第二MLP的模型参数已经发生变化。
当经由第二MLP处理得到的第五表情基系数在驱动数字形象时,其表情损失小于经由第一MLP处理得到的第二表情基系数在驱动数字形象的表情损失时,可以对第一MLP的模型参数进行更新。例如,可以将第一MLP的模型参数替换为第二MLP的模型参数,即将第二MLP的模型参数赋予到第一MLP上。又如,也可以直接第一MLP替换为训练好的第二MLP,直接使用第二MLP进行表情基系数的微调,以驱动数字形象,同时将第一MLP的模型参数更新,以对第一MLP进行如步骤S1006-S1009所示的训练过程。
因此,在实际使用中,可以训练好的第一MLP上实时输出表情基系数,同时实时采样数据并调整第二MLP。当第二MLP在采样数据集中的表情损失小于第一MLP时,使用第二MLP替换第一MLP。此时的第二MLP继续实时输出表情基系数,第一MLP继续实时训练。第二MLP的训练过程与第一MLP的训练过程异步,利用待处理表情数据进行训练,可以丰富训练样本,避免出现极端表情未能准确传递的情况。
本申请实施例提供的表情迁移方法可以在服务器端应用,还可以在用户终端上应用。用户终端可以为智能手机、平板电脑、笔记本电脑、台式计算机等。服务器端可以由具有处理功能的计算机来实现,本申请实施例不限于此。
上文结合图1至图10详细的描述了本申请实施例的方法实施例,下面结合图11至图12,详细描述本申请实施例的装置实施例。应理解,方法实施例的描述与装置实施例的描述相互对应,因此,未详细描述的部分可以参见前面方法实施例。
图11示出了本申请实施例提供的一种装置的示意性结构图。图11所示的装置1100包括获取单元1101和处理单元1102。
若装置1100用于执行申请实施例提供的表情迁移方法,该装置1100可以位于图1所示的执行设备110或其他设备中。该装置1100可用于执行图6或图7所示的表情迁移方法,还可用于执行图10所示的实施例。
获取单元1101可以用于执行方法600中的步骤S610、S620、S640,或者执行方法700中的步骤S710、S720、S750。
处理单元1102可以用于执行方法600中的步骤S630、S650,或者执行方法700中的步骤S730、S740、S760。
若装置1100用于执行申请实施例提供的模型训练方法,该装置1100可以位于图1所示的训练设备120或其他设备中。该装置1100可用于执行图8所示的模型训练方法,还可用于执行图9所示的实施例。
获取单元1101可以用于执行方法800中的步骤S810-S860。
处理单元1102可以用于执行方法800中的步骤S870。
图12示出了本申请实施例提供的一种装置的硬件结构示意图。图12所示的装置1200包括存储器1201、处理器1202、通信接口1203以及总线1204。其中,存储器1201、处理器1202、通信接口1203通过总线1204实现彼此之间的通信连接。
存储器1201可以是只读存储器(read only memory,ROM),静态存储设备,动态存储设备或者随机存取存储器(random access memory,RAM)。存储器1201可以存储程序, 当存储器1201中存储的程序被处理器1202执行时,处理器1202和通信接口1203用于执行本申请实施例的模型训练方法的各个步骤。
处理器1202可以采用通用的中央处理器(central processing unit,CPU),微处理器,应用专用集成电路(application specific integrated circuit,ASIC),图形处理器(graphics processing unit,GPU)或者一个或多个集成电路,用于执行相关程序,以实现本申请实施例的模型训练装置中的单元所需执行的功能,或者执行本申请实施例的模型训练方法。
处理器1202还可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,本申请的模型训练方法的各个步骤可以通过处理器1202中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器1202还可以是通用处理器、数字信号处理器(digital signal processing,DSP)、专用集成电路、现成可编程门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器1201,处理器1202读取存储器1201中的信息,结合其硬件完成本申请实施例的模型训练装置中包括的单元所需执行的功能,或者执行本申请实施例的模型训练方法。
通信接口1203使用例如但不限于收发器一类的收发装置,来实现装置1200与其他设备或通信网络之间的通信。
总线1204可包括在装置1200各个部件(例如,存储器1201、处理器1202、通信接口1203)之间传送信息的通路。
本申请实施例还提供了一种表情迁移装置的硬件结构示意图,其结构与图12的用于模型训练的装置的结构相同,因此仍参考图12对表情迁移装置进行描述。
装置1200包括存储器1201、处理器1202、通信接口1203以及总线1204。其中,存储器1201、处理器1202、通信接口1203通过总线1204实现彼此之间的通信连接。
存储器1201可以是只读存储器(read only memory,ROM),静态存储设备,动态存储设备或者随机存取存储器(random access memory,RAM)。存储器1201可以存储程序,当存储器1201中存储的程序被处理器1202执行时,处理器1202和通信接口1203用于执行本申请实施例的表情迁移方法的各个步骤。
处理器1202可以采用通用的中央处理器(central processing unit,CPU),微处理器,应用专用集成电路(application specific integrated circuit,ASIC),图形处理器(graphics processing unit,GPU)或者一个或多个集成电路,用于执行相关程序,以实现本申请实施例的表情迁移装置中的单元所需执行的功能,或者执行本申请实施例的表情迁移方法。
处理器1202还可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,本申请的表情迁移方法的各个步骤可以通过处理器1202中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器1202还可以是通用处理器、数字信号处理器(digital signal processing,DSP)、专用集成电路、现成可编程门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实 现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器1201,处理器1202读取存储器1201中的信息,结合其硬件完成本申请实施例的表情迁移装置中包括的单元所需执行的功能,或者执行本申请实施例的表情迁移方法。
通信接口1203使用例如但不限于收发器一类的收发装置,来实现装置1200与其他设备或通信网络之间的通信。
总线1204可包括在装置1200各个部件(例如,存储器1201、处理器1202、通信接口1203)之间传送信息的通路。
本申请实施例还提供一种计算机程序存储介质,该计算机程序存储介质具有程序指令,当该程序指令被直接或者间接执行时,使得前文中的模型训练方法或表情迁移方法得以实现。
本申请实施例还提供一种芯片系统,该芯片系统包括至少一个处理器,当程序指令在该至少一个处理器中执行时,使得前文中的模型训练方法或表情迁移方法得以实现。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而 前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (20)

  1. 一种表情迁移方法,其特征在于,包括:
    获取待处理表情数据,所述待处理表情数据包括多个第一视频帧和多个第二视频帧,其中所述第一视频帧和所述第二视频帧包括源对象的面部图像;
    基于第一神经网络模型和第一表情基系数,获取第二表情基系数,其中所述第一表情基系数基于所述第一视频帧得到且与所述源对象对应;
    根据所述第二表情基系数驱动目标对象,以将所述第一视频帧中的源对象的表情迁移至所述目标对象,其中所述第二表情基系数与所述目标对象对应;
    当满足预设条件时,
    基于第二神经网络模型和第三表情基系数,获取第四表情基系数,其中所述第三表情基系数基于所述第二视频帧得到且与所述源对象对应;
    根据所述第四表情基系数驱动所述目标对象,以将所述第二视频帧中的源对象的表情迁移至所述目标对象,其中所述第四表情基系数与所述目标对象对应;
    其中,所述第二神经网络模型的初始模型参数与所述第一神经网络模型的初始模型参数相同,且在所述第一神经网络模型的应用过程中,所述第二神经网络模型的模型参数在基于第一训练数据的训练过程中被调整,所述第一训练数据是根据所述多个第一视频帧中的至少部分第一视频帧获得的;
    所述预设条件包括:
    所述第二神经网络模型输出的表情基系数用于驱动所述目标对象时产生的表情损失小于所述第一神经网络模型输出的表情基系数用于驱动所述目标对象时产生的表情损失。
  2. 根据权利要求1所述的方法,其特征在于,
    所述第一神经网络模型与所述源对象和所述目标对象相关联;
    所述第二神经网络模型与所述源对象和所述目标对象相关联。
  3. 根据权利要求1或2所述的方法,其特征在于,在所述基于第二神经网络模型和第三表情基系数,获取第四表情基系数之前,所述方法还包括:
    基于所述第一训练数据和第一损失函数,对所述第二神经网络模型进行训练,其中所述第一损失函数用于梯度回传,以调整所述第二神经网络模型的模型参数。
  4. 根据权利要求3所述的方法,其特征在于,所述第一训练数据包括第一输出结果集合、第二输出结果集合、第一表情识别结果集合和第二表情识别结果集合;其中,
    所述第一输出结果集合包括:基于所述至少部分第一视频帧中的每个第一视频帧得到的表情基系数经所述第一神经网络模型处理后输出的调整后表情基系数;
    所述第二输出结果集合包括:基于所述至少部分第一视频帧中的每个第一视频帧得到的表情基系数经所述第二神经网络模型处理后输出的调整后表情基系数;
    所述第一表情识别结果集合包括:对所述至少部分第一视频帧中的每个第一视频帧进行表情识别的结果;
    所述第二表情识别结果集合包括:对所述第二输出结果集合中的每个输出结果在驱动所述目标对象时所得到的数字形象帧进行表情识别的结果。
  5. 根据权利要求4所述的方法,其特征在于,所述第一损失函数用于表征所述第一输出结果集合与所述第二输出结果集合中对应于同一视频帧的调整后表情基系数之间的差异,以及所述第一表情识别结果集合与所述第二表情识别结果集合中对应于所述同一视频帧的表情识别结果之间的差异。
  6. 根据权利要求1至5中任一项所述的方法,其特征在于,所述至少部分第一视频帧是从所述多个第一视频帧中采样得到。
  7. 根据权利要求1至6中任一项所述的方法,其特征在于,所述方法还包括:
    将所述第一神经网络模型的模型参数更新为与所述第二神经网络模型的模型参数一致;
    在所述第二神经网络模型的应用过程中,在基于第二训练数据的训练过程中调整所述第一神经网络模型的模型参数,其中所述第二训练数据是根据所述多个第二视频帧中的至少部分第二视频帧获得的。
  8. 一种表情迁移方法,其特征在于,包括:
    获取待处理表情数据,所述待处理表情数据包括多个第一视频帧和多个第二视频帧,其中所述第一视频帧和所述第二视频帧包括源对象的面部图像;
    基于第一神经网络模型和第一表情基系数,获取第二表情基系数,其中所述第一表情基系数基于所述第一视频帧得到且与所述源对象对应;
    根据所述第二表情基系数驱动目标对象,以将所述第一视频帧中的源对象的表情迁移至所述目标对象,其中所述第二表情基系数与所述目标对象对应;
    当满足预设条件时,将所述第一神经网络模型的模型参数更新为与第二神经网络模型的参数一致,得到调整后的第一神经网络模型;
    基于所述调整后的第一神经网络模型和第三表情基系数,获取第四表情基系数,其中所述第三表情基系数基于所述第二视频帧得到且与所述源对象对应;
    根据所述第四表情基系数驱动所述目标对象,以将所述第二视频帧中的源对象的表情迁移至所述目标对象,其中所述第四表情基系数与所述目标对象对应;
    其中,所述第二神经网络模型的初始模型参数与所述第一神经网络模型的初始模型参数相同,且在所述第一神经网络模型的应用过程中,所述第二神经网络模型的模型参数在基于第一训练数据的训练过程中被调整,所述第一训练数据是根据所述多个第一视频帧中的至少部分第一视频帧获得的;
    所述预设条件包括:
    所述第二神经网络模型输出的表情基系数用于驱动所述目标对象时产生的表情损失小于所述第一神经网络模型输出的表情基系数用于驱动所述目标对象时产生的表情损失。
  9. 根据权利要求8所述的方法,其特征在于,
    所述第一神经网络模型与所述源对象和所述目标对象相关联;
    所述第二神经网络模型与所述源对象和所述目标对象相关联。
  10. 根据权利要求8或9所述的方法,其特征在于,在所述将所述第一神经网络模型的模型参数更新为与第二神经网络模型的参数一致之前,所述方法还包括:
    基于所述第一训练数据和第一损失函数,对所述第二神经网络模型进行训练,其中所述第一损失函数用于梯度回传,以调整所述第二神经网络模型的模型参数。
  11. 根据权利要求10所述的方法,其特征在于,所述第一训练数据包括第一输出结果集合、第二输出结果集合、第一表情识别结果集合和第二表情识别结果集合;其中,
    所述第一输出结果集合包括:基于所述至少部分第一视频帧中的每个第一视频帧得到的表情基系数经所述第一神经网络模型处理后输出的调整后表情基系数;
    所述第二输出结果集合包括:基于所述至少部分第一视频帧中的每个第一视频帧得到的表情基系数经所述第二神经网络模型处理后输出的调整后表情基系数;
    所述第一表情识别结果集合包括:对所述至少部分第一视频帧中的每个第一视频帧进行表情识别的结果;
    所述第二表情识别结果集合包括:对所述第二输出结果集合中的每个输出结果在驱动所述目标对象时所得到的数字形象帧进行表情识别的结果。
  12. 根据权利要求11所述的方法,其特征在于,所述第一损失函数用于表征所述第一输出结果集合与所述第二输出结果集合中对应于同一视频帧的调整后表情基系数之间的差异,以及所述第一表情识别结果集合与所述第二表情识别结果集合中对应于所述同一视频帧的表情识别结果之间的差异。
  13. 根据权利要求8至12中任一项所述的方法,其特征在于,所述至少部分第一视频帧是从所述多个第一视频帧中采样得到。
  14. 根据权利要求1至6中任一项所述的方法,其特征在于,所述方法还包括:
    在所述调整后的第一神经网络模型的应用过程中,在基于第二训练数据的训练过程中调整所述第二神经网络模型的模型参数,其中所述第二训练数据是根据所述多个第二视频帧中的至少部分第二视频帧获得的。
  15. 一种模型训练方法,其特征在于,包括:
    获取第一训练帧,所述第一训练帧包括源对象的面部图像;
    基于所述第一训练帧,获取与所述源对象的面部图像对应的原始表情基系数和头部姿态参数;
    根据所述原始表情基系数,获取第二训练帧,所述第二训练帧包括所述源对象位于正面的面部图像;
    根据原始神经网络模型对所述原始表情基系数进行处理,得到调整后的表情基系数;
    根据所述调整后的表情基系数和所述头部姿态参数驱动目标对象,获取第三训练帧,所述第三训练帧包括所述目标对象在所述头部姿态参数下的面部图像;
    根据所述调整后的表情基系数驱动所述目标对象,获取第四训练帧,所述第四训练帧包括所述目标对象位于正面的面部图像;
    根据所述第一训练帧对应的表情识别结果与所述第三训练帧对应的表情识别结果之间的差异和/或所述第二训练帧对应的表情识别结果与所述第四训练帧对应的表情识别结果之间的差异,调整所述原始神经网络模型的参数,以获取目标神经网络模型。
  16. 根据权利要求15所述的方法,其特征在于,
    所述第一训练帧对应的表情识别结果与所述第三训练帧对应的表情识别结果之间的差异包括:
    所述第一训练帧对应的表情识别结果与所述第三训练帧对应的表情识别结果之间的曼哈顿距离或欧氏距离;
    和/或,
    所述第二训练帧对应的表情识别结果与所述第四训练帧对应的表情识别结果之间的差异包括:
    所述第二训练帧对应的表情识别结果与所述第四训练帧对应的表情识别结果之间曼哈顿距离或欧氏距离。
  17. 一种表情迁移装置,其特征在于,包括:
    存储器,用于存储程序;
    处理器,用于执行所述存储器存储的程序,当所述处理器执行所述存储器存储的程序时,所述处理器用于执行权利要求1至14中任一项所述的方法。
  18. 一种模型训练装置,其特征在于,包括:
    存储器,用于存储程序;
    处理器,用于执行所述存储器存储的程序,当所述处理器执行所述存储器存储的程序时,所述处理器用于执行权利要求15或16所述的方法。
  19. 一种计算机可读存储介质,其特征在于,所述计算机可读介质存储用于设备执行的程序代码,所述程序代码被所述设备执行时,所述设备执行如权利要求1至14中任一项所述的表情迁移方法或者如权利要求15或16所述的模型训练方法。
  20. 一种芯片,其特征在于,所述芯片包括处理器与数据接口,所述处理器通过所述数据接口读取存储器上存储的指令,以执行如权利要求1至14中任一项所述的表情迁移方法或者如权利要求15或16所述的模型训练方法。
PCT/CN2022/143944 2022-01-28 2022-12-30 表情迁移方法、模型训练方法和装置 WO2023142886A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210105861.0A CN116563450A (zh) 2022-01-28 2022-01-28 表情迁移方法、模型训练方法和装置
CN202210105861.0 2022-01-28

Publications (1)

Publication Number Publication Date
WO2023142886A1 true WO2023142886A1 (zh) 2023-08-03

Family

ID=87470509

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/143944 WO2023142886A1 (zh) 2022-01-28 2022-12-30 表情迁移方法、模型训练方法和装置

Country Status (2)

Country Link
CN (1) CN116563450A (zh)
WO (1) WO2023142886A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116757923A (zh) * 2023-08-16 2023-09-15 腾讯科技(深圳)有限公司 一种图像生成方法、装置、电子设备及存储介质

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108229268A (zh) * 2016-12-31 2018-06-29 商汤集团有限公司 表情识别及卷积神经网络模型训练方法、装置和电子设备
US20190392202A1 (en) * 2018-10-30 2019-12-26 Baidu Online Network Technology (Beijing) Co., Ltd. Expression recognition method, apparatus, electronic device, and storage medium
CN111783620A (zh) * 2020-06-29 2020-10-16 北京百度网讯科技有限公司 表情识别方法、装置、设备及存储介质
WO2021034463A1 (en) * 2019-08-19 2021-02-25 Neon Evolution Inc. Methods and systems for image and voice processing
US20210104086A1 (en) * 2018-06-14 2021-04-08 Intel Corporation 3d facial capture and modification using image and temporal tracking neural networks
US20210133434A1 (en) * 2019-01-10 2021-05-06 Boe Technology Group Co., Ltd. Computer-implemented method of recognizing facial expression, apparatus for recognizing facial expression, method of pre-training apparatus for recognizing facial expression, computer-program product for recognizing facial expression
CN113570029A (zh) * 2020-04-29 2021-10-29 华为技术有限公司 获取神经网络模型的方法、图像处理方法及装置

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108229268A (zh) * 2016-12-31 2018-06-29 商汤集团有限公司 表情识别及卷积神经网络模型训练方法、装置和电子设备
US20210104086A1 (en) * 2018-06-14 2021-04-08 Intel Corporation 3d facial capture and modification using image and temporal tracking neural networks
US20190392202A1 (en) * 2018-10-30 2019-12-26 Baidu Online Network Technology (Beijing) Co., Ltd. Expression recognition method, apparatus, electronic device, and storage medium
US20210133434A1 (en) * 2019-01-10 2021-05-06 Boe Technology Group Co., Ltd. Computer-implemented method of recognizing facial expression, apparatus for recognizing facial expression, method of pre-training apparatus for recognizing facial expression, computer-program product for recognizing facial expression
WO2021034463A1 (en) * 2019-08-19 2021-02-25 Neon Evolution Inc. Methods and systems for image and voice processing
CN113570029A (zh) * 2020-04-29 2021-10-29 华为技术有限公司 获取神经网络模型的方法、图像处理方法及装置
CN111783620A (zh) * 2020-06-29 2020-10-16 北京百度网讯科技有限公司 表情识别方法、装置、设备及存储介质

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116757923A (zh) * 2023-08-16 2023-09-15 腾讯科技(深圳)有限公司 一种图像生成方法、装置、电子设备及存储介质
CN116757923B (zh) * 2023-08-16 2023-12-08 腾讯科技(深圳)有限公司 一种图像生成方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
CN116563450A (zh) 2023-08-08

Similar Documents

Publication Publication Date Title
WO2021043168A1 (zh) 行人再识别网络的训练方法、行人再识别方法和装置
WO2020228525A1 (zh) 地点识别及其模型训练的方法和装置以及电子设备
WO2019200749A1 (zh) 识别人脸的方法、装置、计算机设备和存储介质
US11983850B2 (en) Image processing method and apparatus, device, and storage medium
WO2019227479A1 (zh) 人脸旋转图像的生成方法及装置
WO2020192736A1 (zh) 物体识别方法及装置
US20220245926A1 (en) Object Recognition Method and Apparatus
WO2021184933A1 (zh) 一种人体三维模型重建方法
JP2022503647A (ja) クロスドメイン画像変換
CN108363973B (zh) 一种无约束的3d表情迁移方法
WO2021018245A1 (zh) 图像分类方法及装置
CN109684969B (zh) 凝视位置估计方法、计算机设备及存储介质
WO2021013095A1 (zh) 图像分类方法、图像分类模型的训练方法及其装置
WO2022052782A1 (zh) 图像的处理方法及相关设备
WO2022001372A1 (zh) 训练神经网络的方法、图像处理方法及装置
WO2020177214A1 (zh) 一种基于文本不同特征空间的双流式视频生成方法
WO2022267036A1 (zh) 神经网络模型训练方法和装置、数据处理方法和装置
CN114339054B (zh) 拍照模式的生成方法、装置和计算机可读存储介质
WO2022165722A1 (zh) 单目深度估计方法、装置及设备
CN110222718A (zh) 图像处理的方法及装置
WO2023165361A1 (zh) 一种数据处理方法及相关设备
WO2023142886A1 (zh) 表情迁移方法、模型训练方法和装置
CN114581502A (zh) 基于单目图像的三维人体模型联合重建方法、电子设备及存储介质
WO2022179603A1 (zh) 一种增强现实方法及其相关设备
WO2022156475A1 (zh) 神经网络模型的训练方法、数据处理方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22923671

Country of ref document: EP

Kind code of ref document: A1