CN116152447B

CN116152447B - Face modeling method and device, electronic equipment and storage medium

Info

Publication number: CN116152447B
Application number: CN202310431115.5A
Authority: CN
Inventors: 杨硕; 何昊南; 何山; 殷兵; 刘聪; 周良; 胡金水
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2023-04-21
Filing date: 2023-04-21
Publication date: 2023-09-26
Anticipated expiration: 2043-04-21
Also published as: CN116152447A

Abstract

The application provides a face modeling method, a face modeling device, electronic equipment and a storage medium, wherein the face modeling method comprises the following steps: obtaining image data containing a target face; performing parameterization modeling on a target face in the image data by using a pre-trained face parameter processing model to obtain face parameters of the target face; the face parameter processing model is obtained by carrying out face parameter modeling training at least based on face image data in sample video data and voice data corresponding to the face image data. According to the face parameter processing method, face parameter modeling is carried out through face image data in sample video and audio data corresponding to the face image data, a face parameter processing model capable of obtaining face parameters of a target face based on the image data containing the target face is trained, and therefore the face parameter processing model can learn information which is missing in a two-dimensional image and exists in a voice space.

Description

Face modeling method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of data processing, and in particular, to a face modeling method, a device, an electronic apparatus, and a storage medium.

Background

With the development of games, short video and AR/VR technology, the technology of creating face models is increasingly applied to related fields, such as: creation of a 3D avatar, recognition of a human face, virtual makeup, and the like.

Currently, most mobile devices are equipped with monocular RGB cameras, so how to create a face model based on monocular images or videos is an important research direction for those skilled in the art.

Disclosure of Invention

The application provides a face modeling method, a face modeling device, electronic equipment and a storage medium, which are used for realizing the creation of a face model from an image or a video.

According to a first aspect of an embodiment of the present application, there is provided a face modeling method, including:

obtaining image data containing a target face;

performing parameterization modeling on a target face in the image data by using a pre-trained face parameter processing model to obtain face parameters of the target face;

the face parameter processing model is obtained by carrying out face parameter modeling training at least based on face image data in sample video data and voice data corresponding to the face image data.

In an optional embodiment of the present application, the performing parametric modeling on the target face in the image data by using a pre-trained facial parameter processing model to obtain facial parameters of the target face includes:

Obtaining visual characteristics of the image data;

and carrying out parameterization processing on the visual characteristics of the image data to obtain facial parameters corresponding to the target face.

In an optional embodiment of the application, the obtaining the visual feature of the image data includes:

inputting the image data into a pre-trained facial parameter processing model, and extracting visual features of the image data through a visual feature extraction model in the facial parameter processing model.

In an optional embodiment of the present application, the facial parameter processing model is obtained by performing parameter optimization based on at least a similarity between the sample acoustic features and the sample visual features output by the facial parameter processing model;

the sample acoustic features comprise acoustic features of voice data corresponding to the face image data in the sample video data; the sample visual features comprise visual features of the face image data in the sample video data.

In an alternative embodiment of the present application, the facial parameter processing model is trained by:

performing acoustic feature extraction processing on voice data corresponding to the face image data in the sample video data to obtain the sample acoustic features;

Performing visual feature extraction processing on face image data in the sample video data by using a pre-constructed face parameter processing model to obtain the sample visual features;

the facial parameter processing model is optimized based at least on a similarity between the sample acoustic features and the sample visual features.

In an alternative embodiment of the present application, the optimizing the facial parameter processing model based at least on a similarity between the sample acoustic features and the sample visual features includes:

constructing a loss function of the facial parameter processing model according to the similarity between the sample acoustic features and the sample visual features;

optimizing the facial parameter processing model based on the loss function.

obtaining the pre-constructed facial parameter processing model, and processing the sample facial parameters obtained by the facial image data of the sample video data;

rendering a two-dimensional image of the sample face model according to the sample face parameters and the initial face model parameters;

And optimizing the facial parameter processing model according to the degree of difference between the two-dimensional image and the face image data of the sample video data and the degree of similarity between the sample acoustic features and the sample visual features.

In an optional embodiment of the present application, the degree of difference between the two-dimensional image and the face image data of the sample video data is determined by:

calculating pixel difference values between the two-dimensional image and the face image data, and determining a first difference degree between the two-dimensional image and the face image data;

and/or the number of the groups of groups,

extracting face features of the two-dimensional image and the face image data to obtain a first face feature of the two-dimensional image and a second face feature of the face image data; determining a second degree of difference between the two-dimensional image and the face image data according to the first face feature and the second face feature;

and/or the number of the groups of groups,

and determining a third difference degree between the preset key points of the two-dimensional image and the preset key points in the face image data.

In an optional embodiment of the present application, the extracting the face features of the two-dimensional image and the face image data to obtain a first face feature of the two-dimensional image and a second face feature of the face image data includes:

And extracting the face features of the two-dimensional image and the face image data by using a pre-trained face recognition model to obtain the first face features of the two-dimensional image and the second face features of the face image data.

In an alternative embodiment of the present application, the method further comprises:

and applying the facial parameters of the target face to a pre-established initial face model to obtain a face model corresponding to the target face.

According to a second aspect of an embodiment of the present application, there is provided a face modeling apparatus including:

a first unit for obtaining image data including a target face;

the second unit is used for carrying out parameterization modeling on the target face in the image data by utilizing a pre-trained face parameter processing model to obtain the face parameters of the target face;

According to a third aspect of an embodiment of the present application, there is provided an electronic apparatus including:

a processor;

A memory for storing the processor-executable instructions;

the processor is used for executing the face modeling method through running the instructions in the memory.

According to a fourth aspect of an embodiment of the present application, there is provided a computer storage medium, wherein the storage medium stores a computer program which, when executed by a processor, performs the above-described face modeling method.

Compared with the prior art, the application has the following advantages:

the application provides a face modeling method, a face modeling device, electronic equipment and a storage medium, wherein the face modeling method comprises the following steps: obtaining image data containing a target face; performing parameterization modeling on a target face in the image data by using a pre-trained face parameter processing model to obtain face parameters of the target face; the face parameter processing model is obtained by carrying out face parameter modeling training at least based on face image data in sample video data and voice data corresponding to the face image data.

According to the face parameter processing method, face parameter modeling is carried out through face image data in sample video and audio data corresponding to the face image data, a face parameter processing model capable of obtaining face parameters of a target face based on the image data containing the target face is trained, and therefore the face parameter processing model can learn information which is missing in a two-dimensional image and exists in a voice space.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic view of an application scenario of a face model modeling method according to an embodiment of the present application.

Fig. 2 is a flowchart of a face modeling method according to another embodiment of the present application.

Fig. 3 is a training flowchart of a facial parameter processing model according to another embodiment of the present application.

Fig. 4 is a schematic structural diagram of a face modeling apparatus according to another embodiment of the present application.

Fig. 5 is a schematic structural diagram of an electronic device according to another embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The application provides a face modeling method, a face modeling device, an electronic device and a storage medium, so as to realize the creation of a face model from an image or a video, and the face modeling method, the device, the electronic device and the storage medium are described in detail in the following embodiments one by one.

Exemplary implementation Environment

In order to facilitate understanding of the face model modeling method, the face model modeling device, the electronic device and the storage medium provided by the embodiment of the application, firstly, an application scene of the face model modeling method is introduced.

In the embodiment of the scene of the application, the face model creation method is particularly applied to creating a 3D face model based on image data shot by a mobile phone.

Referring to fig. 1, fig. 1 is a schematic view of an application scenario of a face model modeling method according to an embodiment of the present application.

Fig. 1 includes: a generation stage of a 3D face model and a training stage of a face parameter processing model;

the generation stage of the 3D face model mainly includes the following steps S101 to S103:

step S101, obtaining captured image data including a target face.

The image data of the target face refers to a video clip of a person desiring to create a 3D face model, which is shot by a mobile phone camera.

Step S102, performing parameterization processing on a target face in the image data by using a pre-trained face parameter processing model to obtain face parameters of the target face.

That is, the image data is input into the face parameter processing model, so that the face parameter processing model performs parameterization processing on a target face in the image data to obtain the face parameter of the target face.

Step S103, constructing a 3D face model of the target face according to the face parameters of the target face.

Specifically, the face parameters of the target face may be applied to an initial face model created in advance, so as to obtain a 3D face model corresponding to the target face.

In the practical application process, the implementation subject of the steps S101 to S103 may be a mobile phone for capturing the image data, or may be a computer or a server dedicated to generating the 3D face head portrait, which is not limited to the present application.

In the training phase of the facial parameter processing model:

in an embodiment of the present application, the content for training the facial parameter processing model is sample video data, that is, video containing image data of a face and voice data in the image data.

In an optional embodiment of the present application, the sample audio-visual data may be a video clip of a speaker in a public conference, or a video clip of a person who clips in a video clip when speaking, etc.

Between training the facial parameter processing model, a pair of samples for training the facial parameter processing model is first constructed.

In an embodiment of the present application, the sample pair of the facial parameter processing model includes: and the visual characteristic of a certain video frame in the face image data and the acoustic characteristic of voice data corresponding to the video frame.

Specifically, as shown in fig. 1, fig. 1 includes: an acoustic model 104 and a facial parameter processing model 105.

Wherein the acoustic model 104 is used for extracting acoustic characteristics of voice data based on the voice data input into the model;

the facial parameter processing model 105 is used to extract visual features of a video frame based on the video frame input to the model.

Specifically, the facial parameter processing model 105 includes a visual feature extraction model, and after the video frame is input into the facial parameter processing model 105, the visual feature extraction model performs visual feature extraction processing on the video frame to obtain the visual feature.

After the acoustic feature and the visual feature are obtained, a loss function of the facial parameter processing model 105 may be constructed based on the similarity between the visual feature and the acoustic feature, and then the facial parameter processing model 105 may be optimized based on a loss value of the loss function.

It can be appreciated that the above description of the embodiments of the present application is only for better understanding the face modeling method provided by the present application, but is not used for limiting the application scenario of the face modeling method, and the face modeling method may also be applied to other scenarios, for example, scenarios for face recognition, virtual makeup, etc.

Exemplary method

The face modeling method is characterized in that face parameter modeling is carried out through face image data in sample video and audio data and voice data corresponding to the face image data, a face parameter processing model capable of obtaining face parameters of a target face based on the image data containing the target face is trained, and therefore the face parameter processing model can learn information which is missing in a two-dimensional image and exists in a voice space.

In an alternative embodiment of the present application, the implementation subject of the face modeling method may be a notebook computer, a tablet computer, a desktop computer, a set-top box, a mobile device (for example, a mobile phone, a portable music player, a personal digital assistant, a dedicated messaging device, a game host), or any combination of two or more of these data processing devices, or may be a server.

Referring to fig. 2, fig. 2 is a flowchart of a face modeling method according to another embodiment of the present application.

As shown in fig. 2, the face modeling method includes the following steps S201 and S202:

step S201, obtaining image data including a target face.

In the embodiment of the present application, the image data including the target face may be understood as a video shot for the target face.

In the practical application process, the image data containing the target face can be obtained through shooting by a camera of a mobile terminal such as a mobile phone, a tablet personal computer and the like, and also can be obtained through modes such as the internet. The application is not limited in this regard.

Step S202, performing parameterization modeling on a target face in the image data by using a pre-trained face parameter processing model to obtain face parameters of the target face.

The pre-trained facial parameter processing model may be understood as a convolutional neural network, and may be obtained by training in a Machine Learning (ML) manner during a specific application. Machine learning (which is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc.) is dedicated to studying obtaining new knowledge or skills through training samples, reorganizing existing knowledge structures, and constantly improving their own performance. Machine learning typically includes artificial neural networks, belief networks, reinforcement learning, transfer learning, induction learning, etc., which are a branch of artificial intelligence (Artificial Intellingence, AI) technology.

In the embodiment of the application, the facial parameter processing model comprises a visual feature extraction model, and in the process of processing the image data by using the facial parameter processing model, the visual feature extraction model firstly performs frame-level feature extraction processing on the image data to obtain visual features of a target face in each video frame in the image data, and then the facial parameter processing model obtains facial parameters of the target face based on the visual features.

That is, the performing parametric modeling on the target face in the image data by using a pre-trained face parameter processing model to obtain the face parameter of the target face includes:

obtaining visual characteristics of the image data;

Further, in order to facilitate understanding of the process of obtaining the facial parameters of the target face by the facial parameter processing model, the following description will first be given of the training process of the facial parameter processing model.

Referring to fig. 3, fig. 3 is a training flowchart of a facial parameter processing model according to another embodiment of the present application.

As shown in fig. 3, the face parameter processing model is trained by the following steps S301 to S303:

step S301, performing acoustic feature extraction processing on voice data corresponding to the face image data in the sample audio-visual data, so as to obtain the sample acoustic feature.

In the embodiment of the present application, the sample video data may be understood as a video including face image and voice data. For example, the sample audio-visual data may be audio-video of a person when speaking. The sample acoustic features may be understood as phoneme features of speech data in the sample audiovisual data.

In an alternative embodiment of the application, the sample acoustic features may be obtained by a pre-trained speech recognition model. The voice recognition model is specifically used for performing voice recognition processing on voice data input into the voice recognition model to obtain a recognition text corresponding to the voice data.

The voice recognition model comprises an acoustic model and a feature decoder, wherein the acoustic model is used for extracting acoustic features of voice data input into the voice recognition model to obtain acoustic features of the voice data, and the feature decoder is used for decoding the acoustic features to obtain recognition texts corresponding to the voice data.

In another alternative embodiment of the present application, the sample acoustic features may also be obtained based on a pre-trained acoustic feature extraction model, which is not limiting to the present application.

Step S302, performing visual feature extraction processing on the face image data in the sample video data by using a pre-constructed face parameter processing model to obtain the sample visual features.

The face image data in the sample audio-visual data may be understood as an image portion in the sample audio-visual data, for example: video frames in the sample image data.

In the embodiment of the present application, the step S302 includes: and inputting the face image data in the sample video data into the pre-constructed face parameter processing model, and extracting sample visual features in the face image data through a visual feature extraction model in the face parameter processing model.

As described above, the acoustic features of the voice data in the sample audio-visual data refer to the phoneme features in the voice data, and a certain correspondence exists between the phoneme as a basic unit of voice and the lip shape, and after the sample visual features in the face image data are extracted by the facial parameter processing model, the visual features of the face in various aspects of the person when speaking also exist in the sample visual features. Based on this, the face parameter processing model can be optimized, and the following step S303 is further performed.

Step S303, optimizing the facial parameter processing model at least according to the similarity between the sample acoustic features and the sample visual features.

In an optional embodiment of the present application, before performing parameter optimization processing on the facial parameter processing model, the visual feature and the acoustic feature need to be further corresponding to improve training accuracy of the facial parameter processing model.

In the practical application process, the extraction process of the sample visual features of the face image data in the sample video and audio data is carried out based on video frames, namely, each video frame corresponds to one visual feature; for the acoustic features extracted by the acoustic model or the speech recognition model, the frequency of the acoustic features is generally 49 hz, the step length between each acoustic feature is about 20 ms, and for most sample video data, each extracted visual feature cannot be in one-to-one correspondence with the acoustic feature. Therefore, in the embodiment of the present application, after the acoustic features are extracted by the speech recognition model, linear interpolation may be performed based on the variation between adjacent acoustic features, so that the number of acoustic features after linear interpolation is twice the number of video frames.

For example, for 30fps sample video data, after linear interpolation processing is performed on the acoustic features of the sample video data, the frequency of the acoustic features becomes 6hz, at this time, each video frame has two acoustic features matched with the acoustic features, any video frame and the acoustic features matched with the video frame can be used as positive sample pairs, and the acoustic features and the video frames which are not matched are used as negative sample pairs, so that after the visual features of the video frames are extracted, the facial parameter processing model is trained.

optimizing the facial parameter processing model based on the loss function.

Specifically, the construction of the loss function of the facial parameter processing model may be realized by the following formulas (1) to (3):

（1）；

（2）；

（3）；

wherein, the liquid crystal display device comprises a liquid crystal display device,representing acoustic features; />Representing visual features; />Representing a similarity between the i-th visual feature and its corresponding acoustic feature; />Is an adjustable temperature coefficient; k represents the number of visual features; />A comparison term representing from the acoustic feature to the visual feature; />A comparison term representing from visual features to acoustic features; lambda represents the weight coefficient of the comparison term, in an alternative embodiment of the applicationWherein λ may be set to 0.5; equation (3) is the finally obtained loss function; />Representing a loss value of the facial parameter processing model.

In another optional embodiment of the application, the optimizing the facial parameter processing model at least according to the similarity between the sample acoustic features and the sample visual features comprises the following steps S1 to S3:

Step S1, obtaining the pre-constructed facial parameter processing model, and processing the sample facial parameters obtained by the facial image data of the sample video data.

And step S2, rendering a two-dimensional image of the sample face model according to the sample face parameters and the initial face model parameters.

In an alternative embodiment of the present application, an initial face model may be constructed by a FLAME model, thereby obtaining the parameters of the initial face model.

The initial face model may be represented by the following formula (4):

（4）；

wherein N represents N vertexes of the initial face model in the three-dimensional space;is a parameter for controlling the shape change of the face model; />Is a parameter for controlling the expression change of the face model; />Is a parameter for controlling the articulation and overall rotation of the face model.

In the practical application process, the face image data obtained from the sample video data is input into the face parametersAfter the facial parameter processing model is processed, the facial parameter processing model outputs predicted expression parameters obtained from the facial image dataAnd articulation parameters>I.e. ] a +>Wherein->Representing the facial parameter processing model, and I represents face image data input into the facial parameter processing model.

In an optional embodiment of the present application, the output face parameter of the face parameter processing model and the initial face model parameter of the initial face model are input together into a preset differential renderer, so as to obtain a two-dimensional image of the rendered sample face model.

Specifically, the process of processing the face image data of the sample video data to obtain the face parameters of the sample face according to the pre-constructed face parameter processing model in the step S2 may be represented by the following formula (5):

（5）；

wherein, the liquid crystal display device comprises a liquid crystal display device,two-dimensional image representing the sample face model, < >>Is a texture parameter for rendering a two-dimensional image;is an illumination parameter for rendering a two-dimensional image; />Is the parameter of the internal and external parameters of the camera.

And step S3, optimizing the facial parameter processing model according to the degree of difference between the two-dimensional image and the face image data of the sample video data and the degree of similarity between the sample acoustic features and the sample visual features.

Specifically, the step S3 is to construct a loss function of the facial parameter processing model by combining the degree of difference between the two-dimensional image and the face image data of the sample video data on the basis of the degree of similarity between the acoustic feature and the sample visual feature, so as to optimize the facial parameter processing model.

The optimization of the facial parameter processing model in the step S3 in the embodiment of the present application is considered that the facial parameter processing model learns the missing in the two-dimensional pixel space in the face modeling process and learns the existing information in the speech space.

Further, the degree of difference between the two-dimensional image and the face image data of the sample video data may be obtained by:

and/or the number of the groups of groups,

Namely, the difference degree between the two-dimensional image and the face image data of the sample video data comprises one or more of the first difference degree, the second difference degree and the third difference degree.

In an optional embodiment of the present application, in optimizing the facial parameter processing model according to the degree of difference between the two-dimensional image and the facial image data of the sample audio-visual data and the degree of similarity between the sample acoustic feature and the sample visual feature, the degree of difference between the two-dimensional image and the facial image data of the sample audio-visual data and the degree of similarity between the sample acoustic feature and the sample visual feature need to be adjusted to the same order of magnitude, so as to facilitate the subsequent optimization of the facial parameter processing model.

Specifically, the method for calculating the pixel difference between the two-dimensional image and the face image data of the sample video data specifically comprises the step of calculating the pixel difference between the two-dimensional image and the video frame in the face image data pixel by pixel.

Further, in an optional embodiment of the present application, the process of extracting the face features of the two-dimensional image and the face image data is implemented based on a pre-trained face recognition model, that is, the face features of the two-dimensional image and the face image data are extracted by using the pre-trained face recognition model, so as to obtain a first face feature of the two-dimensional image and a second face feature of the face image data.

Further, the second degree of difference between the two-dimensional image and the face image data refers to a difference between the first face feature and the second face feature.

Further, the preset key points of the two-dimensional image refer to key parts of the related face marked in the two-dimensional image, for example, the corner of eyes, corner of mouth and position of lips in the face; similarly, the preset key points in the face image data refer to key positions of the related faces marked in the face image data.

In an optional embodiment of the present application, in order to facilitate setting of preset keypoints of the two-dimensional image, after obtaining the sample face parameters in the step S1, a sample face model may be created based on the sample face parameters, then the sample face model is projected onto a two-dimensional plane, an image obtained by projecting the two-dimensional plane is used as a rendering image, and projection points of key points/vertices of the sample face model are used as the keypoints.

Based on this, the third difference may be obtained by calculating a difference between the key point of the two-dimensional image and the key point of the face image data.

After the face parameter processing model is obtained through training in the above manner, the face parameters of the target face can be obtained based on the above steps S201 and S202.

That is, the face parameters of the target face obtained by the face parameter processing model are used for constructing a face model corresponding to the target face.

In another alternative embodiment of the present application, a two-dimensional image of the target face may be rendered based on a manner similar to the above steps S1 and S2, and then a face model corresponding to the target face may be constructed based on the two-dimensional image of the target face. The application is not limited in this regard.

In summary, according to the face modeling method, face parameter modeling is performed through face image data in sample video data and voice data corresponding to the face image data, and a face parameter processing model capable of obtaining face parameters of a target face based on image data containing the target face is trained, so that the face parameter processing model can learn information missing in a two-dimensional image and existing in a voice space.

Exemplary apparatus

The embodiment of the application also provides a face modeling device, please refer to fig. 4, fig. 4 is a schematic structural diagram of the face modeling device according to another embodiment of the application.

As shown in fig. 4, the face modeling apparatus includes:

a first unit 401, configured to obtain image data including a target face;

a second unit 402, configured to perform parametric modeling on a target face in the image data by using a pre-trained face parameter processing model, so as to obtain a face parameter of the target face;

obtaining visual characteristics of the image data;

optimizing the facial parameter processing model based on the loss function.

and/or the number of the groups of groups,

In an alternative embodiment of the present application, the apparatus further comprises:

and a third unit 403, configured to apply the facial parameters of the target face to a pre-created initial face model, and obtain a face model corresponding to the target face.

The face modeling device provided in this embodiment belongs to the same application conception as the face modeling method provided in the above embodiment of the present application, and may execute the face modeling method provided in any of the above embodiments of the present application, and has a functional module and beneficial effects corresponding to executing the face modeling method. Technical details not described in detail in the present embodiment may refer to specific processing content of the face modeling method provided in the foregoing embodiment of the present application, and will not be described herein.

Exemplary electronic device

In another embodiment of the present application, please refer to fig. 5, fig. 5 is a schematic structural diagram of an electronic device according to another embodiment of the present application.

As shown in fig. 5, the electronic device includes:

a memory 200 and a processor 210;

wherein the memory 200 is connected to the processor 210, and is used for storing a program;

the processor 210 is configured to implement the face modeling method disclosed in any of the foregoing embodiments by running a program stored in the memory 200.

Specifically, the electronic device may further include: a bus, a communication interface 220, an input device 230, and an output device 240.

The processor 210, the memory 200, the communication interface 220, the input device 230, and the output device 240 are interconnected by a bus. Wherein:

a bus may comprise a path that communicates information between components of a computer system.

Processor 210 may be a general-purpose processor such as a general-purpose Central Processing Unit (CPU), microprocessor, etc., or may be an application-specific integrated circuit (ASIC), or one or more integrated circuits for controlling the execution of programs in accordance with aspects of the present invention. But may also be a Digital Signal Processor (DSP), application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components.

Processor 210 may include a main processor, and may also include a baseband chip, modem, and the like.

The memory 200 stores programs for implementing the technical scheme of the present invention, and may also store an operating system and other key services. In particular, the program may include program code including computer-operating instructions. More specifically, memory 200 may include read-only memory (ROM), other types of static storage devices that may store static information and instructions, random access memory (random access memory, RAM), other types of dynamic storage devices that may store information and instructions, disk storage, flash, and the like.

The input device 230 may include means for receiving data and information entered by a user, such as a keyboard, mouse, camera, scanner, light pen, voice input device, touch screen, pedometer, or gravity sensor, among others.

Output device 240 may include means, such as a display screen, printer, speakers, etc., that allow information to be output to a user.

The communication interface 220 may include devices using any transceiver or the like for communicating with other devices or communication networks, such as ethernet, radio Access Network (RAN), wireless Local Area Network (WLAN), etc.

The processor 210 executes programs stored in the memory 200 and invokes other devices that may be used to implement the steps of any of the face modeling methods provided in the above embodiments of the present application.

Exemplary computer program product and storage Medium

In addition to the methods and apparatus described above, embodiments of the application may also be a computer program product comprising computer program instructions which, when executed by a processor, cause the processor to perform the steps in a face modeling method according to various embodiments of the application described in the "exemplary methods" section of this specification.

The computer program product may write program code for performing operations of embodiments of the present application in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, an embodiment of the present application may also be a storage medium having stored thereon a computer program that is executed by a processor to perform the steps in the face modeling method according to various embodiments of the present application described in the above-described "exemplary method" section of the present specification, and specifically may implement the steps of:

step S201, obtaining image data containing a target face;

step S202, carrying out parameterization modeling on a target face in the image data by utilizing a pre-trained face parameter processing model to obtain face parameters of the target face;

For the foregoing method embodiments, for simplicity of explanation, the methodologies are shown as a series of acts, but one of ordinary skill in the art will appreciate that the present application is not limited by the order of acts, as some steps may, in accordance with the present application, occur in other orders or concurrently. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required for the present application.

It should be noted that, in the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described as different from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other. For the apparatus class embodiments, the description is relatively simple as it is substantially similar to the method embodiments, and reference is made to the description of the method embodiments for relevant points.

The steps in the method of each embodiment of the application can be sequentially adjusted, combined and deleted according to actual needs, and the technical features described in each embodiment can be replaced or combined.

The modules and the submodules in the device and the terminal of the embodiments of the application can be combined, divided and deleted according to actual needs.

In the embodiments provided in the present application, it should be understood that the disclosed terminal, apparatus and method may be implemented in other manners. For example, the above-described terminal embodiments are merely illustrative, and for example, the division of modules or sub-modules is merely a logical function division, and there may be other manners of division in actual implementation, for example, multiple sub-modules or modules may be combined or integrated into another module, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or modules, which may be in electrical, mechanical, or other forms.

The modules or sub-modules illustrated as separate components may or may not be physically separate, and components that are modules or sub-modules may or may not be physical modules or sub-modules, i.e., may be located in one place, or may be distributed over multiple network modules or sub-modules. Some or all of the modules or sub-modules may be selected according to actual needs to achieve the purpose of the embodiment.

In addition, each functional module or sub-module in the embodiments of the present application may be integrated in one processing module, or each module or sub-module may exist alone physically, or two or more modules or sub-modules may be integrated in one module. The integrated modules or sub-modules may be implemented in hardware or in software functional modules or sub-modules.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software unit executed by a processor, or in a combination of the two. The software elements may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A face modeling method, comprising:

obtaining image data containing a target face;

the face parameter processing model is obtained by carrying out face parameter modeling training at least based on face image data in sample video data and voice data corresponding to the face image data, and comprises the following steps: the facial parameter processing model is obtained by performing parameter optimization at least based on sample acoustic features and the similarity between sample visual features output by the facial parameter processing model; the sample acoustic features comprise acoustic features of voice data corresponding to the face image data in the sample video data; the sample visual features comprise visual features of the face image data in the sample video data.

2. The method according to claim 1, wherein the parameterizing the target face in the image data using a pre-trained face parameter processing model to obtain the face parameters of the target face includes:

Obtaining visual characteristics of the image data;

3. The method of claim 2, wherein the obtaining the visual characteristics of the image data comprises:

4. The method of claim 1, wherein the facial parameter processing model is trained by:

5. The method of claim 4, wherein said optimizing the facial parameter processing model based at least on similarity between the sample acoustic features and the sample visual features comprises:

optimizing the facial parameter processing model based on the loss function.

6. The method of claim 4, wherein said optimizing the facial parameter processing model based at least on similarity between the sample acoustic features and the sample visual features comprises:

7. The method of claim 6, wherein the degree of difference between the two-dimensional image and the face image data of the sample audiovisual data is determined by:

and/or the number of the groups of groups,

8. The method of claim 7, wherein the extracting the face features of the two-dimensional image and the face image data to obtain the first face feature of the two-dimensional image and the second face feature of the face image data comprises:

9. The method according to claim 1, wherein the method further comprises:

10. A face modeling apparatus, comprising:

a first unit for obtaining image data including a target face;

11. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

the processor is configured to execute the face modeling method according to any of the preceding claims 1-9 by executing instructions in the memory.

12. A computer storage medium, characterized in that the storage medium stores a computer program which, when executed by a processor, performs the face modeling method of any of the preceding claims 1-9.