CN116310435A - Driving method and device for three-dimensional face, electronic equipment and readable storage medium - Google Patents

Driving method and device for three-dimensional face, electronic equipment and readable storage medium Download PDF

Info

Publication number
CN116310435A
CN116310435A CN202310141085.4A CN202310141085A CN116310435A CN 116310435 A CN116310435 A CN 116310435A CN 202310141085 A CN202310141085 A CN 202310141085A CN 116310435 A CN116310435 A CN 116310435A
Authority
CN
China
Prior art keywords
emotion
mouth shape
source
obtaining
driving
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310141085.4A
Other languages
Chinese (zh)
Inventor
杜宗财
范锡睿
赵亚飞
张世昌
郭紫垣
陈毅
王志强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202310141085.4A priority Critical patent/CN116310435A/en
Publication of CN116310435A publication Critical patent/CN116310435A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/75Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • G10L15/25Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Human Computer Interaction (AREA)
  • General Health & Medical Sciences (AREA)
  • Acoustics & Sound (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Artificial Intelligence (AREA)
  • Child & Adolescent Psychology (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Signal Processing (AREA)
  • Image Processing (AREA)

Abstract

The disclosure provides a driving method and device for a three-dimensional face, electronic equipment and a readable storage medium, relates to the technical fields of computer vision, deep learning, augmented reality, virtual reality and the like, and can be applied to scenes such as metauniverse, virtual digital people and the like. The driving method of the three-dimensional face comprises the following steps: acquiring audio data and emotion information; according to the source mouth shape characteristics extracted from the audio data and the source emotion characteristics extracted from the emotion information, obtaining fusion mouth shape characteristics and fusion emotion characteristics; obtaining mouth shape driving parameters according to the fusion mouth shape characteristics, and obtaining emotion driving parameters according to the fusion emotion characteristics; and driving the three-dimensional face according to the mouth shape driving parameters and the emotion driving parameters. The method and the device can improve the matching between the mouth shape and emotion of the driven three-dimensional face, so that the three-dimensional face is more real, and the driving effect is enhanced.

Description

Driving method and device for three-dimensional face, electronic equipment and readable storage medium
Technical Field
The disclosure relates to the technical field of artificial intelligence, and relates to the technical fields of computer vision, deep learning, augmented reality, virtual reality and the like, which can be applied to scenes such as metauniverse, virtual digital people and the like. Provided are a driving method and device of a three-dimensional face, an electronic device and a readable storage medium.
Background
Audio-driven three-dimensional faces are an important technology in scenes such as metauniverse/virtual digital people, and aim to drive three-dimensional faces to present mouth shapes and moods matched with input audio.
Disclosure of Invention
According to a first aspect of the present disclosure, there is provided a driving method of a three-dimensional face, including: acquiring audio data and emotion information; according to the source mouth shape characteristics extracted from the audio data and the source emotion characteristics extracted from the emotion information, obtaining fusion mouth shape characteristics and fusion emotion characteristics; obtaining mouth shape driving parameters according to the fusion mouth shape characteristics, and obtaining emotion driving parameters according to the fusion emotion characteristics; and driving the three-dimensional face according to the mouth shape driving parameters and the emotion driving parameters.
According to a second aspect of the present disclosure, there is provided a driving apparatus for a three-dimensional face, including: the acquisition unit is used for acquiring the audio data and the emotion information; the fusion unit is used for obtaining fusion mouth shape characteristics and fusion emotion characteristics according to the source mouth shape characteristics extracted from the audio data and the source emotion characteristics extracted from the emotion information; the processing unit is used for obtaining mouth shape driving parameters according to the fusion mouth shape characteristics and obtaining emotion driving parameters according to the fusion emotion characteristics; and the driving unit is used for driving the three-dimensional face according to the mouth shape driving parameters and the emotion driving parameters.
According to a third aspect of the present disclosure, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described above.
According to a fourth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method as described above.
According to a fifth aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method as described above.
According to the technical scheme, the method and the device can obtain more accurate driving parameters by fusing the source mouth shape characteristics and the source emotion characteristics, improve the matching property between the mouth shape and emotion of the driven three-dimensional face, enable the three-dimensional face to be more real, and further enhance the driving effect.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.
Drawings
The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 is a schematic diagram according to a first embodiment of the present disclosure;
FIG. 2 is a schematic diagram according to a second embodiment of the present disclosure;
FIG. 3 is a schematic diagram according to a third embodiment of the present disclosure;
fig. 4 is a block diagram of an electronic device for implementing a driving method of a three-dimensional face according to an embodiment of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Fig. 1 is a schematic diagram according to a first embodiment of the present disclosure. As shown in fig. 1, the driving method of the three-dimensional face of the embodiment specifically includes the following steps:
s101, acquiring audio data and emotion information;
s102, obtaining a fusion mouth shape feature and a fusion emotion feature according to the source mouth shape feature extracted from the audio data and the source emotion feature extracted from the emotion information;
s103, obtaining mouth shape driving parameters according to the fusion mouth shape characteristics, and obtaining emotion driving parameters according to the fusion emotion characteristics;
and S104, driving the three-dimensional face according to the mouth shape driving parameters and the emotion driving parameters.
According to the driving method of the three-dimensional face, after the source mouth shape feature and the source emotion feature are respectively extracted, the source mouth shape feature and the source emotion feature are fused to obtain the fused mouth shape feature and the fused emotion feature, then corresponding driving parameters are obtained according to different fused features, and finally the three-dimensional face is driven according to the obtained driving parameters, such as speaking, singing, broadcasting and the like, so that the three-dimensional face can present the mouth shape corresponding to the audio data and the emotion corresponding to the emotion information, more accurate driving parameters can be obtained by fusing the source mouth shape feature and the source emotion feature, the matching performance between the mouth shape and the emotion of the driven three-dimensional face is improved, the three-dimensional face is more real, and the driving effect is enhanced.
In the embodiment, when S101 is executed, emotion information may be obtained according to the obtained audio data, or emotion information input by the user may be obtained; the emotion information may be text corresponding to different emotion categories, and the emotion categories include one of happiness, excitement, heart injury, vitality, neutrality, surprise, nausea and the like.
In this embodiment, when S101 is performed to obtain emotion information from audio data, emotion recognition may be performed on the audio data (for example, using an audio emotion recognition algorithm), and the recognition result may be used as the emotion information.
In an actual scene, the audio data may or may not contain emotion; if the audio data does not contain emotion, the embodiment cannot acquire emotion information from the audio data through the audio emotion recognition algorithm.
In order to ensure that emotion information can be obtained from audio data, the present embodiment may further employ the following manner when S101 is performed: converting the audio data into text; emotion recognition is performed on the text (for example, a text emotion recognition algorithm is used), and the recognition result is used as emotion information.
It can be appreciated that in the embodiment, when S101 is executed, the audio data may be first subjected to emotion recognition by using the audio emotion recognition algorithm, and if the emotion information cannot be obtained, the text obtained by converting the audio data is then subjected to emotion recognition by using the text emotion recognition algorithm, so as to obtain the emotion information.
The present embodiment may also employ the following manner when S101 is performed to acquire emotion information: determining driving scenes of three-dimensional faces, such as news broadcasting scenes, hosting scenes and the like; the emotion information corresponding to the determined driving scene is obtained, and the embodiment can obtain the emotion information through a corresponding relation between the preset driving scene and the emotion information, wherein the corresponding relation comprises different driving scenes and emotion information corresponding to different driving scenes, such as neutral emotion information corresponding to a news broadcasting scene.
That is, the embodiment can obtain emotion information according to the driving scene of the three-dimensional face, so that the problems that the obtained emotion information is inconsistent with the driving scene and cannot be obtained according to audio data are avoided, for example, the three-dimensional face is prevented from presenting excited emotion in the news broadcasting scene, and therefore the driving effect of the three-dimensional face is further improved.
In addition, in the embodiment, when S101 is executed, at least one kind of emotion information may be obtained in different manners, if emotion information input by a user exists, the emotion information is used as final emotion information, if emotion information input by the user does not exist, emotion information obtained by driving a scene is used as final emotion information, otherwise, emotion information obtained according to audio data is used as final emotion information.
In this embodiment, after the step S101 of obtaining the audio data and the emotion information, the step S102 of obtaining the fused mouth shape feature and the fused emotion feature according to the source mouth shape feature extracted from the audio data and the source emotion feature extracted from the emotion information.
In the embodiment, when executing S102, the audio data may be input to the mouth shape feature extraction network, and an output result of the mouth shape feature extraction network is used as a source mouth shape feature; the emotional information may be input into an emotional feature extraction network, and an output result of the emotional feature extraction network may be used as a source emotional feature.
The mouth shape feature extraction network is obtained by training in advance according to the sample audio data and the source mouth shape features of the sample audio data, and the emotion feature extraction network is obtained by training in advance according to the sample emotion information and the source emotion features of the sample emotion information.
Because of a certain correlation between audio and emotion, for example, mouth shapes are more abundant when emotion is strong, and emotion is more strong when tone is high.
Therefore, in this embodiment, after S102 is executed to obtain the source mouth shape feature and the source emotion feature, the source mouth shape feature and the source emotion feature are fused, so as to obtain the fused mouth shape feature and the fused emotion feature, so that the fused mouth shape feature includes the source emotion feature and the fused emotion feature includes the source mouth shape feature.
In the embodiment, when S102 is executed to obtain the fused mouth shape feature according to the source mouth shape feature extracted from the audio data and the source emotion feature extracted from the emotion information, the following optional implementation manners may be adopted: mapping the source mouth shape feature into a mouth shape query vector, mapping the source emotion feature into an emotion key value vector and an emotion value vector, and mapping the feature into different types of vectors through convolution of a plurality of 1 multiplied by 1 in the embodiment; and obtaining the fused mouth shape characteristic according to the mouth shape query vector, the emotion key value vector and the emotion value vector.
In this embodiment, when S102 is executed to obtain the fused mouth shape feature according to the mouth shape query vector, emotion key value vector and emotion value vector, the following calculation formula may be used:
Figure BDA0004087857840000041
in the above formula: m is M fused Representing a fused mouth shape feature; q (Q) M Representing a mouth-shaped query vector; k (K) E Representing an emotion key value vector; v (V) E Representing an emotion value vector; d represents the dimension of the vector.
Likewise, when S102 is executed to obtain a fused emotion feature according to the source mouth shape feature extracted from the audio data and the source emotion feature extracted from the emotion information, the present embodiment may adopt the following alternative implementation manners: mapping the source emotion features into emotion query vectors, mapping the source mouth shape features into mouth shape key value vectors and mouth shape value vectors, and mapping the features into different types of vectors through convolution of a plurality of 1 multiplied by 1 in the embodiment; and obtaining the fusion emotion characteristics according to the emotion inquiry vector, the mouth shape key value vector and the mouth shape value vector.
In this embodiment, when S102 is executed to obtain the fused emotion feature according to the emotion query vector, the mouth shape key value vector and the mouth shape value vector, the following calculation formula may be used:
Figure BDA0004087857840000051
in the above formula: e (E) fused Representing a fused emotional characteristic; q (Q) E Representing an emotion query vector; k (K) M Representing a mouth shape key value vector; v (V) M Representing a vector of mouth shape values; d represents the dimension of the vector.
In addition, in this embodiment, when S102 is executed to obtain a fused mouth shape feature or a fused emotion feature according to the source mouth shape feature and the source emotion feature, the source mouth shape feature and the source emotion feature may be input into a feature fusion network, and an output result of the feature fusion network is used as the fused mouth shape feature and/or the fused emotion feature, that is, the fused mouth shape feature and the fused emotion feature in this embodiment may be the same.
The feature fusion network is trained in advance according to a sample feature pair (comprising a sample mouth shape feature and a sample emotion feature) and fusion features of the sample feature pair.
That is, in this embodiment, by fusing the source mouth shape feature and the source emotion feature, a fused mouth shape feature for obtaining the mouth shape driving parameter and a fused emotion feature for obtaining the emotion driving parameter are generated, so that the fused feature includes the source mouth shape feature and the source emotion feature, the accuracy of the obtained fused feature is improved, and the accuracy of the driving parameter obtained according to the fused feature can be correspondingly improved.
After the step S102 is executed to obtain the fused mouth shape feature and the fused emotion feature, the step S103 is executed to obtain mouth shape driving parameters according to the fused mouth shape feature and emotion driving parameters according to the fused emotion feature; the driving parameter obtained in this embodiment may be a blendrope weight.
In the embodiment, when executing S103 to obtain a mouth shape driving parameter according to the fused mouth shape feature, the mouth shape learning network may be input with the fused mouth shape feature, and an output result of the mouth shape learning network may be used as the mouth shape driving parameter; the mouth shape learning network is trained in advance according to the mouth shape characteristics of the sample and mouth shape driving parameters of the mouth shape characteristics of the sample.
In the embodiment, when S103 is executed to obtain the emotion driving parameter according to the fused emotion feature, the fused emotion feature may be input into the emotion learning network, and the output result of the emotion learning network is used as the emotion driving parameter; the emotion learning network is trained in advance according to the sample emotion characteristics and emotion driving parameters of the sample emotion characteristics.
In addition, in executing S103 to obtain the mouth shape driving parameters according to the fused mouth shape characteristics, the present embodiment may also employ the following manner: according to the fusion mouth shape characteristic and the source mouth shape characteristic, a target mouth shape characteristic is obtained, and the addition result between the fusion mouth shape characteristic and the source mouth shape characteristic can be used as the target mouth shape characteristic; according to the target mouth shape characteristics, mouth shape driving parameters are obtained, and the embodiment can input the target mouth shape characteristics into a mouth shape learning network which is obtained through training in advance so as to obtain the mouth shape driving parameters.
In the embodiment, when S103 is executed to obtain the emotion driving parameter according to the fused emotion feature, the following manner may be adopted: according to the fused emotion characteristics and the source emotion characteristics, target emotion characteristics are obtained, and the added result between the fused emotion characteristics and the source emotion characteristics can be used as the target emotion characteristics; according to the target emotion characteristics, emotion driving parameters are obtained, and the embodiment can input the target emotion characteristics into an emotion learning network which is obtained through training in advance to obtain the emotion driving parameters.
That is, the driving parameters can be obtained according to the target features obtained by the source features and the fusion features, so that the target mouth shape features can reflect the mouth shape more accurately, and the target emotion features can reflect the emotion more accurately, thereby improving the accuracy of the obtained driving parameters.
In this embodiment, after the step S103 is performed to obtain the mouth shape driving parameter and the emotion driving parameter, the step S104 is performed to drive the three-dimensional face according to the mouth shape driving parameter and the emotion driving parameter.
In the embodiment, when executing S104, after the mouth shape driving parameter and the emotion driving parameter are spliced, the spliced result is input into the three-dimensional face model to drive the three-dimensional face; the three-dimensional face of the embodiment may be an avatar or a face of a digital person.
It can be understood that, if the embodiment executes S101 to obtain the emotion information input by the user, the embodiment can achieve the purpose of manually controlling the category and/or intensity of the emotion presented by the three-dimensional face.
Fig. 2 is a schematic diagram according to a second embodiment of the present disclosure. A flow chart of this embodiment when driving a three-dimensional face is shown in fig. 2: inputting the emotion information into an emotion feature extraction network to obtain source emotion features; fusing the source mouth shape characteristics and the source emotion characteristics to obtain fused mouth shape characteristics and fused emotion characteristics; the method comprises the steps of inputting fused mouth shape characteristics into a learning network to obtain mouth shape driving parameters, and inputting fused emotion characteristics into the emotion learning network to obtain emotion driving parameters; and after the mouth shape driving parameters and the emotion driving parameters are spliced, inputting a three-dimensional face model to drive the three-dimensional face.
Fig. 3 is a schematic diagram according to a third embodiment of the present disclosure. As shown in fig. 3, the driving apparatus 300 for a three-dimensional face of the present embodiment includes:
an acquiring unit 301, configured to acquire audio data and emotion information;
a fusion unit 302, configured to obtain a fusion mouth shape feature and a fusion emotion feature according to the source mouth shape feature extracted from the audio data and the source emotion feature extracted from the emotion information;
the processing unit 303 is configured to obtain a mouth shape driving parameter according to the fused mouth shape feature, and obtain an emotion driving parameter according to the fused emotion feature;
the driving unit 304 is configured to drive a three-dimensional face according to the mouth shape driving parameter and the emotion driving parameter.
The acquiring unit 301 may acquire emotion information according to the acquired audio data, or may acquire emotion information input by a user; the emotion information may be text corresponding to different emotion categories, and the emotion categories include one of happiness, excitement, heart injury, vitality, neutrality, surprise, nausea and the like.
The acquiring unit 301 may perform emotion recognition on the audio data when acquiring emotion information from the audio data, and use the recognition result as emotion information.
In an actual scene, the audio data may or may not contain emotion; if the audio data does not contain emotion, the embodiment cannot acquire emotion information from the audio data through the audio emotion recognition algorithm.
In order to ensure that emotion information can be acquired from audio data, the acquisition unit 301 may also employ the following means: converting the audio data into text; and carrying out emotion recognition on the text, and taking a recognition result as emotion information.
It may be appreciated that the acquiring unit 301 may first perform emotion recognition on the audio data using an audio emotion recognition algorithm, and if emotion information cannot be acquired, then perform emotion recognition on the text obtained by converting the audio data using a text emotion recognition algorithm to acquire the emotion information.
The acquisition unit 301 may also employ the following manner in acquiring the emotion information: determining a driving scene of a three-dimensional face; the obtaining unit 301 may obtain emotion information corresponding to the determined driving scene, where the corresponding relationship between the preset driving scene and the emotion information includes different driving scenes and emotion information corresponding to different driving scenes.
That is, the acquiring unit 301 can acquire the emotion information according to the driving scene of the three-dimensional face, so as to avoid the problems that the acquired emotion information does not match the driving scene and cannot acquire the emotion information according to the audio data, for example, avoid the three-dimensional face presenting excited emotion in the news broadcasting scene, thereby further improving the driving effect of the three-dimensional face.
In addition, the acquiring unit 301 may acquire at least one kind of emotion information in different manners, if emotion information input by a user exists, the emotion information is taken as final emotion information, if emotion information input by a user does not exist, emotion information acquired by driving a scene is taken as final emotion information, and otherwise, emotion information acquired according to audio data is taken as final emotion information.
In this embodiment, after the obtaining unit 301 obtains the audio data and the emotion information, the fusion unit 302 obtains the fused mouth shape feature and the fused emotion feature according to the source mouth shape feature extracted from the audio data and the source emotion feature extracted from the emotion information.
The fusion unit 302 may input the audio data into the mouth shape feature extraction network, and take the output result of the mouth shape feature extraction network as the source mouth shape feature; the emotional information may be input into an emotional feature extraction network, and an output result of the emotional feature extraction network may be used as a source emotional feature.
Because of a certain correlation between audio and emotion, for example, mouth shapes are more abundant when emotion is strong, and emotion is more strong when tone is high.
Therefore, after obtaining the source mouth shape feature and the source emotion feature, the fusion unit 302 fuses the source mouth shape feature and the source emotion feature, thereby obtaining the fusion mouth shape feature and the fusion emotion feature, so that the fusion mouth shape feature includes the source emotion feature, and the fusion emotion feature includes the source mouth shape feature.
The fusion unit 302 may adopt the following alternative implementation manners when obtaining the fusion mouth shape feature according to the source mouth shape feature extracted from the audio data and the source emotion feature extracted from the emotion information: mapping the source mouth shape characteristics into mouth shape query vectors, and mapping the source emotion characteristics into emotion key value vectors and emotion value vectors; and obtaining the fused mouth shape characteristic according to the mouth shape query vector, the emotion key value vector and the emotion value vector.
The fusion unit 302 may use the following calculation formula when obtaining the fused mouth shape feature according to the mouth shape query vector, emotion key value vector and emotion value vector:
Figure BDA0004087857840000081
in the above formula: m is M fused Representing a fused mouth shape feature; q (Q) M Representing a mouth-shaped query vector; k (K) E Representing an emotion key value vector; v (V) E Representing an emotion value vector; d represents the dimension of the vector.
Likewise, when the fusion unit 302 obtains the fusion emotion feature according to the source mouth shape feature extracted from the audio data and the source emotion feature extracted from the emotion information, alternative implementation manners may be: mapping the source emotion characteristics into emotion query vectors, and mapping the source mouth shape characteristics into mouth shape key value vectors and mouth shape value vectors; and obtaining the fusion emotion characteristics according to the emotion inquiry vector, the mouth shape key value vector and the mouth shape value vector.
The fusion unit 302 may use the following calculation formula when obtaining the fused emotion feature according to the emotion query vector, the mouth shape key value vector and the mouth shape value vector:
Figure BDA0004087857840000091
in the above formula: e (E) fused Representing a fused emotional characteristic; q (Q) E Representing an emotion query vector; k (K) M Representing a mouth shape key value vector; v (V) M Representing a vector of mouth shape values; d represents the dimension of the vector.
In addition, when the fusion unit 302 obtains the fusion mouth shape feature or the fusion emotion feature according to the source mouth shape feature and the source emotion feature, the source mouth shape feature and the source emotion feature may be input into the feature fusion network, and the output result of the feature fusion network is used as the fusion mouth shape feature and/or the fusion emotion feature, that is, the fusion mouth shape feature and the fusion emotion feature in this embodiment may be the same.
The feature fusion network is trained in advance according to a sample feature pair (comprising a sample mouth shape feature and a sample emotion feature) and fusion features of the sample feature pair.
That is, the fusion unit 302 generates the fusion mouth shape feature for obtaining the mouth shape driving parameter and the fusion emotion feature for obtaining the emotion driving parameter by fusing the source mouth shape feature and the source emotion feature, so that the fusion feature comprises the source mouth shape feature and the source emotion feature, the accuracy of the obtained fusion feature is improved, and the accuracy of the driving parameter obtained according to the fusion feature can be correspondingly improved.
In this embodiment, after the fusion unit 302 obtains the fusion mouth shape feature and the fusion emotion feature, the processing unit 303 obtains the mouth shape driving parameters according to the fusion mouth shape feature, and obtains the emotion driving parameters according to the fusion emotion feature.
When obtaining the mouth shape driving parameters according to the fused mouth shape characteristics, the processing unit 303 may input the fused mouth shape characteristics into the mouth shape learning network, and take the output result of the mouth shape learning network as the mouth shape driving parameters; the mouth shape learning network is trained in advance according to the mouth shape characteristics of the sample and mouth shape driving parameters of the mouth shape characteristics of the sample.
When obtaining the emotion driving parameters according to the fused emotion characteristics, the processing unit 303 may input the fused emotion characteristics into an emotion learning network, and take the output result of the emotion learning network as the emotion driving parameters; the emotion learning network is trained in advance according to the sample emotion characteristics and emotion driving parameters of the sample emotion characteristics.
In addition, when obtaining the die driving parameters from the fused die characteristics, the processing unit 303 may further adopt the following modes: obtaining a target mouth shape characteristic according to the fusion mouth shape characteristic and the source mouth shape characteristic; and obtaining the mouth shape driving parameters according to the target mouth shape characteristics.
The processing unit 303 may further use the following manner when obtaining the emotion driving parameter according to the fused emotion characteristics: obtaining target emotion characteristics according to the fused emotion characteristics and the source emotion characteristics; and obtaining emotion driving parameters according to the target emotion characteristics.
That is, the processing unit 303 may obtain the driving parameters according to the target features obtained by the source features and the fusion features, so that the target mouth shape features can more accurately reflect the mouth shape, and the target emotion features can more accurately reflect the emotion, thereby improving the accuracy of the obtained driving parameters.
In this embodiment, after the processing unit 303 obtains the mouth shape driving parameters and the emotion driving parameters, the driving unit 304 drives the three-dimensional face according to the mouth shape driving parameters and the emotion driving parameters.
The driving unit 304 may splice the mouth shape driving parameter and the emotion driving parameter, and then input the spliced result into the three-dimensional face model to drive the three-dimensional face; the three-dimensional face of the embodiment may be an avatar or a face of a digital person.
In the technical scheme of the disclosure, the acquisition, storage, application and the like of the related user personal information all conform to the regulations of related laws and regulations, and the public sequence is not violated.
According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.
As shown in fig. 4, there is a block diagram of an electronic device of a driving method of a three-dimensional face according to an embodiment of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 4, the apparatus 400 includes a computing unit 401 that can perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM) 402 or a computer program loaded from a storage unit 408 into a Random Access Memory (RAM) 403. In RAM403, various programs and data required for the operation of device 400 may also be stored. The computing unit 401, ROM402, and RAM403 are connected to each other by a bus 404. An input/output (I/O) interface 405 is also connected to bus 404.
Various components in device 400 are connected to I/O interface 405, including: an input unit 406 such as a keyboard, a mouse, etc.; an output unit 407 such as various types of displays, speakers, and the like; a storage unit 408, such as a magnetic disk, optical disk, etc.; and a communication unit 409 such as a network card, modem, wireless communication transceiver, etc. The communication unit 409 allows the device 400 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.
The computing unit 401 may be a variety of general purpose and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 401 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 401 performs the respective methods and processes described above, for example, a driving method of a three-dimensional face. For example, in some embodiments, the method of driving a three-dimensional face may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 408.
In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 400 via the ROM402 and/or the communication unit 409. When the computer program is loaded into the RAM403 and executed by the computing unit 401, one or more steps of the driving method of the three-dimensional face described above may be performed. Alternatively, in other embodiments, the computing unit 401 may be configured to perform the driving method of the three-dimensional face by any other suitable means (e.g. by means of firmware).
Various implementations of the systems and techniques described here can be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable three-dimensional face-driven device such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a presentation device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for presenting information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual Private Server" or simply "VPS") are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.
The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims (17)

1. A driving method of a three-dimensional human face comprises the following steps:
acquiring audio data and emotion information;
according to the source mouth shape characteristics extracted from the audio data and the source emotion characteristics extracted from the emotion information, obtaining fusion mouth shape characteristics and fusion emotion characteristics;
obtaining mouth shape driving parameters according to the fusion mouth shape characteristics, and obtaining emotion driving parameters according to the fusion emotion characteristics;
and driving the three-dimensional face according to the mouth shape driving parameters and the emotion driving parameters.
2. The method of claim 1, wherein the obtaining mood information comprises:
determining a driving scene of the three-dimensional face;
and acquiring emotion information corresponding to the driving scene.
3. The method of claim 1, wherein the obtaining mood information comprises:
converting the audio data into text;
and carrying out emotion recognition on the text, and taking a recognition result as the emotion information.
4. The method of claim 1, wherein the deriving a fused mouth shape feature from the source mouth shape feature extracted from the audio data and the source emotion feature extracted from the emotion information comprises:
mapping the source mouth shape characteristics into mouth shape query vectors, and mapping the source emotion characteristics into emotion key value vectors and emotion value vectors;
and obtaining the fusion mouth shape feature according to the mouth shape query vector, the emotion key value vector and the emotion value vector.
5. The method of claim 1, wherein the deriving a fused emotional feature from the source mouth shape feature extracted from the audio data and the source emotional feature extracted from the emotional information comprises:
mapping the source emotion characteristics into emotion query vectors, and mapping the source mouth shape characteristics into mouth shape key value vectors and mouth shape value vectors;
and obtaining the fused emotion characteristics according to the emotion inquiry vector, the mouth shape key value vector and the mouth shape value vector.
6. The method of claim 1, wherein the deriving a die drive parameter from the fused die characteristic comprises:
obtaining a target mouth shape characteristic according to the fusion mouth shape characteristic and the source mouth shape characteristic;
and obtaining the mouth shape driving parameters according to the target mouth shape characteristics.
7. The method of claim 1, wherein the deriving mood-driving parameters from the fused mood characteristics comprises:
obtaining target emotion characteristics according to the fused emotion characteristics and the source emotion characteristics;
and obtaining the emotion driving parameters according to the target emotion characteristics.
8. A driving apparatus of a three-dimensional face, comprising:
the acquisition unit is used for acquiring the audio data and the emotion information;
the fusion unit is used for obtaining fusion mouth shape characteristics and fusion emotion characteristics according to the source mouth shape characteristics extracted from the audio data and the source emotion characteristics extracted from the emotion information;
the processing unit is used for obtaining mouth shape driving parameters according to the fusion mouth shape characteristics and obtaining emotion driving parameters according to the fusion emotion characteristics;
and the driving unit is used for driving the three-dimensional face according to the mouth shape driving parameters and the emotion driving parameters.
9. The apparatus of claim 8, wherein the obtaining unit, when obtaining the mood information, specifically performs:
determining a driving scene of the three-dimensional face;
and acquiring emotion information corresponding to the driving scene.
10. The apparatus of claim 8, wherein the obtaining unit, when obtaining the mood information, specifically performs:
converting the audio data into text;
and carrying out emotion recognition on the text, and taking a recognition result as the emotion information.
11. The apparatus of claim 8, wherein the fusing unit, when obtaining a fused mouth shape feature from a source mouth shape feature extracted from the audio data and a source emotion feature extracted from the emotion information, specifically performs:
mapping the source mouth shape characteristics into mouth shape query vectors, and mapping the source emotion characteristics into emotion key value vectors and emotion value vectors;
and obtaining the fusion mouth shape feature according to the mouth shape query vector, the emotion key value vector and the emotion value vector.
12. The apparatus of claim 8, wherein the fusing unit, when obtaining a fused emotion feature from a source mouth shape feature extracted from the audio data and a source emotion feature extracted from the emotion information, specifically performs:
mapping the source emotion characteristics into emotion query vectors, and mapping the source mouth shape characteristics into mouth shape key value vectors and mouth shape value vectors;
and obtaining the fused emotion characteristics according to the emotion inquiry vector, the mouth shape key value vector and the mouth shape value vector.
13. The apparatus according to claim 8, wherein the processing unit, when deriving the mouth shape driving parameters from the fused mouth shape characteristics, specifically performs:
obtaining a target mouth shape characteristic according to the fusion mouth shape characteristic and the source mouth shape characteristic;
and obtaining the mouth shape driving parameters according to the target mouth shape characteristics.
14. The apparatus of claim 8, wherein the processing unit, when deriving emotion-driven parameters from the fused emotion features, specifically performs:
obtaining target emotion characteristics according to the fused emotion characteristics and the source emotion characteristics;
and obtaining the emotion driving parameters according to the target emotion characteristics.
15. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.
16. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-7.
17. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any of claims 1-7.
CN202310141085.4A 2023-02-16 2023-02-16 Driving method and device for three-dimensional face, electronic equipment and readable storage medium Pending CN116310435A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310141085.4A CN116310435A (en) 2023-02-16 2023-02-16 Driving method and device for three-dimensional face, electronic equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310141085.4A CN116310435A (en) 2023-02-16 2023-02-16 Driving method and device for three-dimensional face, electronic equipment and readable storage medium

Publications (1)

Publication Number Publication Date
CN116310435A true CN116310435A (en) 2023-06-23

Family

ID=86780744

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310141085.4A Pending CN116310435A (en) 2023-02-16 2023-02-16 Driving method and device for three-dimensional face, electronic equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN116310435A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117453932A (en) * 2023-10-25 2024-01-26 深圳麦风科技有限公司 Virtual person driving parameter generation method, device and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117453932A (en) * 2023-10-25 2024-01-26 深圳麦风科技有限公司 Virtual person driving parameter generation method, device and storage medium
CN117453932B (en) * 2023-10-25 2024-08-30 深圳麦风科技有限公司 Virtual person driving parameter generation method, device and storage medium

Similar Documents

Publication Publication Date Title
CN113407850B (en) Method and device for determining and acquiring virtual image and electronic equipment
CN114020950B (en) Training method, device, equipment and storage medium for image retrieval model
CN113052962B (en) Model training method, information output method, device, equipment and storage medium
CN113836278B (en) Training and dialogue generation method and device for universal dialogue model
CN113361363A (en) Training method, device and equipment for face image recognition model and storage medium
CN114549710A (en) Virtual image generation method and device, electronic equipment and storage medium
CN114723888B (en) Three-dimensional hair model generation method, device, equipment, storage medium and product
CN113641829B (en) Training and knowledge graph completion method and device for graph neural network
CN113870399B (en) Expression driving method and device, electronic equipment and storage medium
CN116310435A (en) Driving method and device for three-dimensional face, electronic equipment and readable storage medium
CN115170703A (en) Virtual image driving method, device, electronic equipment and storage medium
CN113962845B (en) Image processing method, image processing apparatus, electronic device, and storage medium
US11610396B2 (en) Logo picture processing method, apparatus, device and medium
CN113177466A (en) Identity recognition method and device based on face image, electronic equipment and medium
CN116778040A (en) Face image generation method based on mouth shape, training method and device of model
CN117171310A (en) Digital person interaction method and device, electronic equipment and storage medium
CN116402914A (en) Method, device and product for determining stylized image generation model
CN114969195B (en) Dialogue content mining method and dialogue content evaluation model generation method
CN113408298B (en) Semantic analysis method, semantic analysis device, electronic equipment and storage medium
CN116257611A (en) Question-answering model training method, question-answering processing device and storage medium
CN113808572B (en) Speech synthesis method, speech synthesis device, electronic equipment and storage medium
CN115292467A (en) Information processing and model training method, apparatus, device, medium, and program product
CN114648601A (en) Virtual image generation method, electronic device, program product and user terminal
CN113327311A (en) Virtual character based display method, device, equipment and storage medium
CN113704256A (en) Data identification method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination