WO2023246163A9

WO2023246163A9 - Virtual digital human driving method, apparatus, device, and medium

Info

Publication number: WO2023246163A9
Application number: PCT/CN2023/079026
Authority: WO
Inventors: 杨善松; 成刚; 刘韶; 李绪送; 付爱国
Original assignee: 海信视像科技股份有限公司
Priority date: 2022-06-22
Filing date: 2023-03-01
Publication date: 2024-02-29
Also published as: WO2023246163A1

Abstract

The present disclosure relates to a virtual digital human driving method, an apparatus, a device and a medium. The method comprises: acquiring user information, the user information comprising voice information and image information; according to the user information, determining a user intention and a user emotion; according to the user intention, determining a reply text of a virtual digital human, and, according to the user intention and the user emotion, determining a reply emotion of the virtual digital human; according to the reply text, determining limb actions of the virtual digital human, and, according to the reply emotion, determining an emotion expression mode of the virtual digital human. The present disclosure achieves a natural and anthropomorphic interaction state of a virtual human, thereby improving the simulation effect and expression naturalness of a virtual digital human.

Description

A virtual digital human driving method, device, equipment and medium

Cross-references to related applications

This disclosure claims the priority of the Chinese patent application filed with the China Patent Office on June 22, 2022, with application number 202210714001.7; and filed with the China Patent Office on June 28, 2022, with application number 202210751784.6, the entire contents of which are incorporated by reference. in this disclosure.

Technical field

The present disclosure relates to the field of virtual digital human technology, and in particular, to a virtual digital human driving method, device, equipment and medium.

Background technique

Virtual digital human is a multi-modal intelligent human-computer interaction technology that integrates computer vision, speech recognition, speech synthesis, natural language processing, terminal display and other technologies to create a highly anthropomorphic virtual image, like a real person Interact and communicate with people.

Although virtual digital humans already have a small number of display-level applications, their expressive capabilities are still subject to certain limitations. First, from the perspective of perceptual capabilities, in real and complex acoustic scenes, differences in channels, environments, speakers, etc. significantly increase the difficulty of recognition; secondly, it is difficult for current intelligent interaction systems to accurately recognize users in different complex natural interaction scenes. True intention and emotional state, making it difficult to output matching system actions. Finally, due to the limitations of the virtual human's visual image, form-driven technology, and speech synthesis technology, the virtual human's fidelity and natural expression are still relatively stiff.

Contents of the invention

The present disclosure provides a virtual digital human driving method, including:

Obtain user information, which includes voice information and image information;

Determine user intentions and user emotions based on the user information;

Determine the reply text of the virtual digital person according to the user intention, and determine the reply emotion of the virtual digital person according to the user intention and user emotion;

The physical movements of the virtual digital human are determined according to the reply text, and the emotional expression of the virtual digital human is determined according to the reply emotion.

The present disclosure also provides a computer device, including:

one or more processors;

Memory, used to store one or more programs;

When the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the method as described in any one of the first aspects.

The present disclosure also provides a computer-readable non-volatile storage medium on which a computer program is stored. The program When executed by the processor, the method as described in any one of the first aspects is implemented.

Description of drawings

Figure 1A is a schematic diagram of an application scenario of a virtual digital human driving process according to some embodiments;

Figure 1B is a schematic structural diagram of a virtual digital human according to some embodiments;

Figure 2A is a hardware configuration block diagram of a computer device according to some embodiments;

Figure 2B is a schematic diagram of a software configuration of a computer device according to some embodiments;

Figure 2C is a schematic diagram showing an icon control interface of an application included in a smart device according to some embodiments;

Figure 3A is a schematic flowchart of a virtual digital human driving method according to some embodiments;

Figure 3B is a schematic diagram of the principle of a virtual digital human driving method according to some embodiments;

Figure 4A is a schematic flowchart of another virtual digital human driving method according to some embodiments;

Figure 4B is a schematic diagram of the principle of a virtual digital human driving method according to some embodiments;

Figure 4C is a schematic flowchart of yet another virtual digital human driving method according to some embodiments;

Figure 4D is a schematic flowchart of yet another virtual digital human driving method according to some embodiments;

Figure 5 is a schematic flowchart of a virtual digital human driving method according to some embodiments;

Figure 6 is a schematic flowchart of yet another virtual digital human driving method according to some embodiments;

Figure 7 is a schematic flowchart of yet another virtual digital human driving method according to some embodiments;

Figure 8 is a schematic diagram of a virtual digital human according to some embodiments;

Figure 9 is a schematic diagram of a virtual digital human according to some embodiments;

Figure 10 is a schematic diagram of the principle of generating a new speaking style according to some embodiments;

Figure 11 is a schematic diagram of a human-computer interaction scenario according to some embodiments;

Figure 12 is a schematic flowchart of a speaking style generation method according to some embodiments;

Figure 13 is a schematic diagram of facial topology data divided into regions according to some embodiments;

Figure 14 is a schematic flowchart of a speaking style generation method according to some embodiments;

Figure 15 is a schematic flowchart of a speaking style generation method according to some embodiments;

Figure 16 is a schematic flowchart of a speaking style generation method according to some embodiments;

Figure 17 is a schematic structural diagram of a framework of a speaking style model according to some embodiments;

Figure 18 is a schematic flowchart of a speaking style generation method according to some embodiments;

Figure 19 is a schematic structural diagram of the framework of a speaking style generation model according to some embodiments;

Figure 20 is a schematic flowchart of a speaking style generation method according to some embodiments;

Figure 21 is a schematic structural diagram of the framework of a speaking style generation model according to some embodiments;

Figure 22 is a schematic structural diagram of the framework of a speaking style generation model according to some embodiments;

Figure 23 is a schematic flowchart of a speaking style generation method according to some embodiments;

Figure 24 is a schematic structural diagram of a computer device according to some embodiments.

Detailed ways

In order to understand the above-mentioned objects, features and advantages of the present disclosure more clearly, the solutions of the present disclosure will be described below. Describe further. It should be noted that, as long as there is no conflict, the embodiments of the present disclosure and the features in the embodiments can be combined with each other.

Many specific details are set forth in the following description to fully understand the present disclosure, but the present disclosure can also be implemented in other ways different from those described here; obviously, the embodiments in the description are only part of the embodiments of the present disclosure, and Not all examples.

As a new generation of human-computer interaction, the system design of virtual digital people usually consists of five modules: character image, speech generation, dynamic image generation, audio and video synthesis display, and interactive modeling. Character images can be divided into There are two categories: 2D and 3D, which can be divided into cartoon, anthropomorphic, realistic, hyper-realistic and other styles in terms of appearance; the speech generation module can generate corresponding character voices based on text; the animation generation module can generate the dynamics of specific characters based on speech or text Image; the audio and video synthesis display module synthesizes speech and dynamic images into a video, and finally displays it to the user; the interactive module enables the digital human to have interactive functions, that is, it recognizes the user's intention through intelligent technologies such as speech semantic recognition, and determines the digital human's behavior based on the user's current intention. The subsequent voice and actions drive the character to start the next round of interaction.

Although virtual digital humans already have a small number of display-level applications, their expressive capabilities are still subject to certain limitations. First, from the perspective of perceptual capabilities, in real and complex acoustic scenes, differences in channels, environments, speakers, etc. significantly increase the difficulty of recognition; secondly, it is difficult for current intelligent interaction systems to accurately recognize users in different complex natural interaction scenes. The real intention and emotional state make it difficult to output matching system actions; finally, due to the limitations of the virtual human's visual image, form-driven technology, and speech synthesis technology, the virtual human's fidelity and natural expression are still relatively stiff.

In view of the shortcomings of the above technical problems, the embodiments of the present disclosure first obtain user information, which includes voice information and image information; then determine the user intention and user emotion based on the user information; finally determine the body movements of the virtual digital human based on the user intention, and Determine the emotional expression of the virtual digital human based on the user's emotion, that is, based on the acquired user voice information and user image information, process the user voice information and user image information to determine the user's intention and user's emotion, and then determine the virtual number based on the user's intention Human body movements determine the emotional expression of virtual digital people based on user emotions, allowing virtual digital people to truly restore user intentions and user emotions, and improve the fidelity and naturalness of expression of virtual digital people.

FIG. 1A is a schematic diagram of an application scenario of a virtual digital human driving process in an embodiment of the present disclosure. As shown in Figure 1A, the virtual digital human driving process can be used in the interaction scenario between users and smart terminals. It is assumed that the smart terminals in this scenario include smart blackboards, smart large screens, smart speakers, smart phones, etc., and the smart terminal displays virtual digital humans. Examples of virtual digital people include virtual teachers, virtual brand images, virtual assistants, virtual shopping guides, virtual anchors, etc. As shown in Figure 1B, when the user wants to control the virtual digital person displayed on the smart terminal in this scene, he or she needs to First, a voice command is issued. At this time, the smart terminal collects the user's voice information and collects the user's image information. By processing the user's voice information and the user's image information, it determines the user's intention and user emotion, and then based on the parsed user instructions and user emotion, Determine the body movements and emotional expressions of the virtual digital human, realize the virtual digital human's true restoration of the user's intentions and user emotions, and improve the virtual digital human's realism and natural expression.

The virtual digital human driving method provided by the embodiments of the present disclosure can be implemented based on computer equipment, or functional modules or functional entities in the computer equipment.

Among them, the computer equipment can be a personal computer (Personal Computer, PC), server, mobile phone, tablet Computers, notebook computers, large computers, etc. are not specifically limited in the embodiments of the present disclosure.

Exemplarily, FIG. 2A is a hardware configuration block diagram of a computer device according to one or more embodiments of the present disclosure. As shown in Figure 2A, the computer equipment includes: a tuner and demodulator 210, a communicator 220, a detector 230, an external device interface 240, a controller 250, a display 260, an audio output interface 270, a memory, a power supply, and a user interface 280. of at least one. The controller 250 includes a central processing unit, a video processor, an audio processor, a graphics processor, a RAM, a ROM, and first to nth interfaces for input/output. The display 260 may be at least one of a liquid crystal display, an OLED display, a touch display, and a projection display, and may also be a projection device and a projection screen. The tuner and demodulator 210 receives broadcast television signals through wired or wireless reception methods, and demodulates audio and video signals, such as EPG audio and video data signals, from multiple wireless or wired broadcast and television signals. The communicator 220 is a component for communicating with external devices or servers according to various communication protocol types. For example, the communicator may include at least one of a Wifi module, a Bluetooth module, a wired Ethernet module, other network communication protocol chips or near field communication protocol chips, and an infrared receiver. The computer device can establish the transmission and reception of control signals and data signals with the server or local control device through the communicator 220 . The detector 230 is used to collect signals from the external environment or interactions with the outside.

In some embodiments, the controller 250 controls the operation of the computer device and responds to user operations through various software control programs stored on the memory. Controller 250 controls the overall operation of the computer device. The user may input a user command into a graphical user interface (GUI) displayed on the display 260, and the user input interface receives the user input command through the graphical user interface (GUI). Alternatively, the user can input a user command by inputting a specific sound or gesture, and the user input interface recognizes the sound or gesture through the sensor to receive the user input command.

Figure 2B is a schematic diagram of the software configuration of a computer device according to one or more embodiments of the present disclosure. As shown in Figure 2B, the system is divided into four layers. From top to bottom, they are the Applications layer (referred to as the "Application layer"). "), Application Framework layer (referred to as "framework layer"), Android runtime and system library layer (referred to as "system runtime library layer"), and kernel layer.

Figure 2C is a schematic diagram showing the icon control interface of an application included in a smart terminal (mainly a smart playback device, such as a smart TV, a digital cinema system or an audio-visual server, etc.) according to one or more embodiments of the present disclosure, as shown in Figure 2C As shown, the application layer contains at least one application that can display a corresponding icon control on the display, such as: Live TV application icon control, Video on Demand VOD application icon control, Media Center application icon control, Application Center Icon Control , game application icon controls, etc. Live TV app that provides live TV from different sources. Video on demand VOD application that can provide videos from different storage sources. Unlike live TV applications, video on demand offers the display of video from certain storage sources. Media center application can provide various multimedia content playback applications. The application center can provide storage for various applications.

In order to explain the virtual digital human driving method in more detail, the following will be described in an exemplary manner in conjunction with Figure 3A. It can be understood that the steps involved in Figure 3A may include more steps or less during actual implementation. steps, and the order between these steps may also be different, subject to being able to implement the virtual digital human driving method provided in the embodiment of the present disclosure.

FIG. 3A is a schematic flowchart of a virtual digital human driving method provided by an embodiment of the present disclosure; FIG. 3B is a schematic principle diagram of a virtual digital human driving method provided by an embodiment of the present disclosure. This embodiment can be applied to virtual digital humans. row-driven situation. The method of this embodiment can be executed by an intelligent terminal, which can be implemented in hardware/or software, and can be configured in a computer device.

As shown in Figure 3A, the method specifically includes the following steps:

S10. Obtain user information, which includes voice information and image information.

In a specific implementation, the smart terminal includes a sound sensor and a visual sensor, where the sound sensor can be an example of a microphone array, etc., the visual sensor includes a 2D visual sensor and a 3D visual sensor, and the visual sensor can be an example of a camera, etc.

The smart terminal collects voice information through sound sensors and image information through visual sensors. The voice information includes semantic information and acoustic information, and the image information includes scene information and user image information.

S20. Determine user intentions and user emotions based on user information.

After the terminal device collects voice information based on the sound sensor, it can determine the user's intention based on the semantic information included in the voice information, that is, how the user expects to drive the virtual digital human to act. After collecting the image information based on the visual sensor, it can determine the user's intention based on the semantic information included in the voice information. The collected image information determines the facial expression of the user who sent the voice message, and based on the user's facial expression in the collected image information, the user's expectation to drive the emotion expressed by the virtual digital human is determined.

S30. Determine the reply text of the virtual digital person according to the user's intention, and determine the reply emotion of the virtual digital person according to the user's intention and the user's emotion.

After the user intention and user emotion are determined based on the user information, the virtual digital person's reply text can be determined based on the user intention, such as the text corresponding to the virtual digital person's reply voice, and the virtual digital person's reply emotion can be determined based on the user intention and user emotion, that is, The emotional expression required for the virtual digital person's reply is determined according to the user's intention, and the emotion required for the virtual digital person's reply is determined based on the emotion expressed by the user. In a specific implementation, when the emotion expressed by the user is a sad emotion, At this time, the emotion that the virtual digital person needs to express in reply is also a sad emotion.

S40. Determine the body movements of the virtual digital human based on the reply text, and determine the emotional expression of the virtual digital human based on the reply emotion.

After the user intention and user emotion are determined based on the user information, the reply text of the virtual digital person is determined based on the user intention, and the reply emotion of the virtual digital person is determined based on the user intention and user emotion, and then the body movements of the virtual digital person are determined based on the reply text, and Determine the emotional expression of the virtual digital human based on the reply emotion, that is, first establish multi-modal human-computer interaction information perception capabilities for speech recognition and image recognition, and then determine the user's intended user emotion through the acquired voice information and image information. The user's intention determines the reply text of the virtual digital person, and determines the virtual digital person's reply emotion based on the user's intention and user emotion. Finally, based on the virtual digital person, emotional expressions and body movements are generated to realize the virtual digital person's voice, expressions, and movements. etc. synthesis.

The virtual digital human driving method provided by the embodiment of the present disclosure first obtains user information, that is, voice information and image information, then determines the user intention and user emotion based on the user information, determines the virtual digital human's reply text based on the user intention, and determines the virtual digital human's reply text based on the user information. Intention and user emotion determine the virtual digital human's reply emotion. Finally, the virtual digital human's body movements are determined based on the reply text, and the virtual digital human's emotional expression method is determined based on the reply emotion. This achieves a natural anthropomorphic virtual human interaction state and improves the virtual digital human's quasi-likeness. Authenticity and naturalness of expression.

FIG. 4A is a schematic flowchart of another virtual digital human driving method provided by an embodiment of the present disclosure. FIG. 4B is a schematic diagram of another virtual digital human driving method provided by the present disclosure. Here is a schematic diagram of the principle of another virtual digital human driving method provided by the embodiment. The embodiment of the present disclosure is based on the above embodiment. As shown in Figures 4A and 4B, a specific implementation of step S20 includes:

S201. Process the voice information and determine the text information and voice emotion information corresponding to the voice information.

Optionally, as shown in Figure 4C, step S201 includes:

S2012. Perform text transcription processing on the voice information and determine the text information corresponding to the voice information.

In a specific implementation, after the voice information is acquired, the voice recognition module performs text transcription processing on the acquired voice information, that is, the voice information is converted into text information corresponding to the voice information.

Specifically, the terminal device can input voice information into an automatic speech recognition (Automatic Speech Recognition, ASR) engine set offline to obtain text information output by the ASR engine.

In the embodiment of the present disclosure, after completing the text transcription process of the voice information, the terminal device may continue to wait for the user to input voice. If the start of a human voice is recognized based on Voice Activity Detection (VAD), recording will continue. If the end of the human voice is recognized based on VAD, the recording will stop. The terminal device can use the recorded audio as user voice information. The terminal device can then input the user's voice information into the ASR engine to obtain text information corresponding to the user's voice information.

S2013. Extract voiceprint features from the voice information and determine the voice emotion information corresponding to the voice information.

Voiceprint features are the sound wave spectrum that carries speech information displayed by electroacoustic instruments. Voiceprint features represent the different wavelengths, frequencies, intensity, and rhythm of different sounds, that is, the pitch, intensity, length, and duration of the user's voice. Timbre, different users have different voiceprint characteristics. By extracting voiceprint features from the voice information, the emotional information expressed by the user corresponding to the voice information can be obtained, that is, the voice emotion information.

S202. Process the image information and determine the scene information and image emotion information corresponding to the image information.

Optionally, as shown in Figure 4D, step S202 includes:

S2021. Preprocess the image information to determine the scene key point information and user key point information included in the image information.

The scene key point information refers to the key points of the scene in which the user is located in the image information in addition to the user information. The user key point information refers to the key points of the user's limbs or facial features in the image information. For example, the image information collected by the terminal device shows a teacher standing in front of a blackboard, that is, the scene key point information included in the image information is the blackboard, and the user key point information included in the image is the user's eyes, mouth, arms, legs, etc.

S2022. Determine the scene information corresponding to the image information based on the scene key point information.

By preprocessing the image and obtaining the scene key point information, the scene information of the terminal device can be determined, that is, in which scene the terminal device is applied.

In a specific implementation, by constructing a knowledge base of different application scenarios of virtual digital humans, a scene recognition model is constructed based on algorithms such as entity recognition, entity linking, and entity alignment, and then the image information of different application scenarios in the knowledge base is preprocessed. After obtaining the scene key point information corresponding to the image information of different application scenarios, input the scene key point information corresponding to the image information of different application scenarios into the scene recognition model to train the scene recognition model until the scene recognition model reaches convergence and determine the target scene Identify the model. Then, graph mapping, information extraction and other methods are used to preprocess the acquired image information to obtain the scene key point information corresponding to the image. The scene key point information obtained after preprocessing is Information is input into the target scene recognition model to perform scene recognition to ensure the accuracy of the scene recognition results.

S2023. Determine the image emotional information according to the corresponding relationship between the user key point information and the preset user emotional key points.

By preprocessing the image and obtaining the user's key point information, the user's emotions collected by the terminal device can be determined, that is, the emotions expressed by the user included in the image information collected by the terminal device.

S203. Determine user intention based on text information and scene information.

After obtaining the text information corresponding to the voice information and the scene information corresponding to the image information, the body movements that the user expects to drive the virtual digital human can be determined based on the text information, and then combined with the determined scene information to further ensure that the terminal device drives the virtual digital based on the text information. Coordination accuracy of human body movements.

S204. Determine user emotion based on text information, voice emotion information and image emotion information.

After obtaining the text information corresponding to the voice information, the emotion expressed by the user can be roughly determined based on the text information, and then by fusing the voice emotion information and the image emotion information, the virtual digital human can be accurately driven to express the user's emotion and improve the fidelity of the virtual digital human.

The virtual digital human determination method provided by the embodiment of the present disclosure first determines the text information and voice emotion information corresponding to the voice information by processing the voice information, and determines the scene information and image emotion information corresponding to the image information by processing the image information. , then determine the user's intention based on the text information and scene information, and determine the user's emotion based on the text information, voice emotion information, and image emotion information. That is, based on the text information, the user's expectations can be determined to drive the body movements of the virtual digital human, and then combined with the determined scene information to further ensure the coordination accuracy of the terminal device's body movements driven by the virtual digital human based on text information. Based on the text information, the emotion expressed by the user can be roughly determined, and then by integrating voice emotional information and image emotional information, the virtual digital human can be accurately driven to express the user's expression. Emotions to improve the realism of virtual digital humans.

Figure 5 is a schematic flow chart of another virtual digital human driving method provided by an embodiment of the present disclosure. The embodiment of the present disclosure is based on the embodiment corresponding to Figure 4C. As shown in Figure 5, before step S2012, it also includes:

S2010. Extract the speech feature vector of speech information.

In the embodiment of the present disclosure, by constructing highly robust voiceprint recognition and voiceprint clustering technologies, automatic login of multiple users is achieved through voice modalities, while paralinguistic information such as gender and accent is extracted to establish basic user information. Aiming at the difficulty of clustering speech features with an uncertain number of target classifications and the impact of speech channel interference on classification and clustering effects, the noisy density space unsupervised clustering technology is used, combined with stochastic linear discriminant analysis technology to achieve highly reliable voiceprint classification. , clustering to reduce the impact of channel interference on voiceprint recognition. That is, in this disclosure, a speech recognition model is constructed, which can adapt to different paralinguistic information, and the accuracy of speech recognition is high.

In a specific implementation, before performing text transcription processing on the voice information and determining the text information corresponding to the voice information, the voice feature vector in the voice information is first extracted. Specifically, the voice feature vector includes: accent feature vector, gender feature vector , age feature vector, etc.

S2011. Add speech feature vectors to the convolutional layer of the speech recognition model.

Among them, the speech recognition model includes an acoustic model and a language model. The acoustic model includes a convolutional neural network model with an attention mechanism, and the language model includes a deep neural network model.

The speech recognition model constructed in this disclosure is a joint modeling of acoustic model and language model. By using deep time series Column convolution and attention mechanism are used to build the acoustic model, and the speech feature vector is added as a condition in the convolution layer of the convolutional neural network model to adapt to different speech features. At the language model level, a model structure based on deep neural networks that can be quickly intervened and configured is implemented, and the voice characteristics of different paralinguistic information are adapted through user-specific voiceprints to improve the accuracy of speech recognition.

The specific implementation of step S2012 may include:

S20120. Perform text transcription processing on the speech information based on the speech recognition model, and determine the text information corresponding to the speech information.

After the speech recognition model is built, the speech information can be transcribed into text based on the speech recognition model to improve the accuracy of the speech recognition results.

Figure 6 is a schematic flow chart of another virtual digital human driving method provided by an embodiment of the present disclosure. The embodiment of the present disclosure is based on the corresponding embodiment of Figure 4A. As shown in Figure 6, the specific implementation of step 40 is: include:

S401. Obtain the action identifier included in the reply text.

Examples of action identifiers include: lifting, stretching, blinking, opening, etc.

In the process of driving virtual digital humans, key point driving includes speech content separation, content key point driving, speaker key point driving, key point-based image generation module, key point-based image stretching module, etc. Therefore, first, based on parsing the text information at the transcription point of the speech information, the action identifiers and key point identifiers included in the text information are obtained.

S402. Select the body movements of the virtual digital human from the preset action database corresponding to the scene information according to the action identification.

Specifically, if the action identifier is lifting, then the body movements of the virtual digital human are selected from the preset action database corresponding to the scene information, such as raising the head, raising the leg, etc.

Among them, the preset action database includes action type definition, action arrangement, action connection, etc.

S403. Determine the emotional expression of key points of the virtual digital human based on the voice emotional information and the image emotional information.

Specifically, if the obtained voice emotion information and image emotion information are happy emotion information, examples of emotional expressions for determining the key points of the virtual digital human can be smiling with the mouth, clapping with both hands, etc.

In a specific implementation, the deep learning method is used to learn the mapping of virtual human key points and voice feature information, as well as the mapping of human face key points and voice emotion information and image emotion information.

The virtual digital human driving method provided by the embodiments of the present disclosure realizes the generation of voice-driven virtual digital human animation with controllable expressions by integrating emotional key point templates.

Figure 7 is a schematic flow chart of another virtual digital human driving method provided by an embodiment of the present disclosure. The embodiment of the present disclosure is based on the embodiment corresponding to Figure 6. As shown in Figure 7, before step S401, it also includes:

S301. Determine the image of the virtual digital person.

In a specific implementation, different features are extracted from speech to drive head movements, facial movements, and body movements respectively, forming a more vivid speech driving method.

The image of the virtual digital human is driven based on the deep neural network method, and a generative adversarial network is applied for high-fidelity real-time generation. The image generation of the virtual digital human is distinguished between action-driven and image library production. Among them, the virtual digital human image generation is divided into action-driven and image library production. The hair library, clothing library, and tooth model of the digital human image are produced offline, and the image can be produced in a targeted manner according to different application scenarios. The motion driver module of the virtual digital human is processed on the server side, and then the topological vertex data is encapsulated and transmitted, and texture mapping, rendering output, etc. are performed on the device side.

As a specific implementation method, with the user's key points as the core, the key point driving technology based on the adversarial network, the feature point geometric stretching method, and the image transformation and generation technology based on the Encoder-Decoder method are used to realize the driving of virtual digital humans. . At the same time, by integrating the emotional key point template method, the corresponding relationship between the user key points and the preset user emotional key points is established to realize the emotional expression of virtual digital people.

As another specific implementation, 3D face driving technology based on deep coding and decoding technology to realize semantic mapping of speech features and vertex three-dimensional motion features, rhythmic head driving technology based on deep codec nested temporal network, with head movement and the ability to discriminate control of facial activity.

Embodiments of the present disclosure provide a computer device, including: one or more processors; and a memory for storing one or more programs. When the one or more programs are executed by the one or more processors, such that The one or more processors implement the method described in any one of the embodiments of the present disclosure.

In addition, considering that the current speaking style generation efficiency for virtual digital people is low, for example, the smart device shown in Figure 8 simultaneously displays the expression and mouth shape of the virtual digital person when speaking while playing the voice response information, and as shown in Figure 9 It shows that the user can not only hear the voice of the virtual digital person, but also see the expression of the virtual digital person when speaking, giving the user an experience of talking to people.

Usually, when people speak, different people have different states. For example, some people have accurate mouth shapes and rich expressions when they speak, while some people have small mouth shapes and serious expressions when they speak. That is, different people have different speaking styles. In this way, virtual digital people with different speaking styles can be designed, that is, virtual digital people with different speaking styles have different mouth shapes and expressions. Users can have conversations with three-dimensional virtual people with different speaking styles, thereby improving the user experience. Every time a new three-dimensional virtual human speaking style is designed, corresponding training samples need to be obtained first, and the speaking style model is retrained based on the corresponding training samples, so that new speaking style parameters can be generated based on the retrained speaking style model. And based on the speaking style parameters, the basic speaking style is driven, as shown in Figure 10, and a new speaking style can be generated. Since retraining the speaking style model requires a lot of time to collect training samples and process a large amount of data, it takes a lot of time to generate a new speaking style, making the speaking style generation efficiency relatively low.

In order to solve the above problem, the present disclosure determines the fitting coefficient of each style characteristic attribute by fitting the target style characteristic attribute based on multiple style characteristic attributes; determines the target style according to the fitting coefficient of each style characteristic attribute and multiple style characteristic vectors. Feature vector, multiple style feature vectors correspond to multiple style feature attributes one-to-one; input the target style feature vector into the speaking style model, and output the target speaking style parameters. The speaking style model trains the speaking style model based on multiple style feature vectors Obtained from the framework; based on the target speaking style parameters, the target speaking style is generated. In this way, the target style feature vector can be fitted with multiple style feature vectors. Since the speaking model is trained based on multiple style feature vectors, therefore, Inputting the target style feature vectors fitted by multiple style feature vectors into the speaking model can directly obtain the corresponding new speaking style. There is no need to retrain the speaking style model, which can achieve rapid transfer of speaking styles and improve the efficiency of speaking style generation. .

Figure 11 is a schematic diagram of a human-computer interaction scenario according to some embodiments. As shown in Figure 11, the relationship between users and smart home In the voice interaction scenario, smart devices may include smart refrigerator 110, smart washing machine 120, smart display device 130, etc. When a user wants to control a smart device, he or she needs to issue a voice command first. When the smart device receives the voice command, it needs to perform semantic understanding of the voice command and determine the semantic understanding result corresponding to the voice command. According to the semantics Understand the results and execute corresponding control instructions to meet the user's needs. The smart devices in this scenario all include a display screen, which can be a touch screen or a non-touch screen. For terminal devices with touch screens, users can communicate with the terminal device through gestures, fingers, or touch tools (such as stylus pens). interactive operations. For non-touch screen terminal devices, interactive operations with the terminal device can be implemented through external devices (for example, a mouse or a keyboard, etc.). The display screen can display a three-dimensional virtual person, and the user can see the three-dimensional virtual person and his or her expression when speaking through the display screen, thereby realizing dialogue and interaction with the three-dimensional virtual person.

The speaking style generation method provided by the embodiments of the present disclosure can be implemented based on a computer device, or a functional module or functional entity in the computer device. The computer device may be a personal computer (PC), a server, a mobile phone, a tablet computer, a notebook computer, a large computer, etc., which are not specifically limited in the embodiments of the present disclosure.

In order to explain the speaking style generation method in more detail, the following will be explained in an exemplary manner with reference to Figure 12. It can be understood that the steps involved in Figure 12 may include more steps or less in actual implementation. steps, and the order between these steps may also be different, subject to being able to implement the speaking style generation method provided in the embodiments of the present disclosure.

Figure 12 is a schematic flowchart of a speaking style generation method according to some embodiments. As shown in Figure 12, the method specifically includes the following steps:

S1301. Fit the target style feature attributes based on multiple style feature attributes, and determine the fitting coefficient of each style feature attribute.

Exemplarily, the facial topology data sequence of the user speaking during the Δt time period is collected. In the facial topology data sequence, each frame of facial topology data corresponds to a dynamic face topology, and the face topology includes multiple vertices. , Each vertex in the dynamic face topology structure corresponds to a vertex coordinate (x, y, z). When the user is not speaking, it corresponds to a preset static face topology. The vertex coordinates of each vertex in the static face topology are (x', y', z'). In this way, based on the same vertex in the dynamic face The difference between the vertex coordinates in the face topology structure and the vertex coordinates in the static face topology structure can determine the vertex offset (Δx, Δy, Δz) of each vertex in each dynamic face topology structure, that is, Δx = x-x', Δy=y-y', Δz=z-z'. Based on the vertex offset (Δx, Δy, Δz) of each vertex in all dynamic face topologies corresponding to the facial topology data sequence, the average vertex offset of each vertex in the dynamic face topology can be determined

Figure 13 is a schematic diagram of facial topology data divided into regions according to an embodiment of the present disclosure. As shown in Figure 13, facial topology data can be divided into multiple regions. For example, facial topology data can be divided into three regions. , respectively, are S1, S2 and S3, where S1 is all the facial area above the lower edge of the eyes, S2 is the facial area from the lower edge of the eyes to the upper edge of the upper lip, and S3 is the facial area from the upper edge of the upper lip to the chin. Based on the above embodiment, the average vertex offset of all vertices of the dynamic face topology in the area S1 can be determined average of The average vertex offset of all vertices of the dynamic face topology within region S2 average of Dynamic face topology in area S3 The average vertex offset of all vertices average of The style characteristic attributes can be obtained, which is To sum up, one style feature attribute can be obtained for one user, and multiple style feature attributes can be obtained based on multiple users.

According to the multiple acquired style feature attributes, a new style feature attribute can be fitted and formed, that is, the target style feature attribute. For example, the target style feature attributes can be obtained by fitting based on the following formula:

in, is the target feature attribute, is the style characteristic attribute of user 1, is the style characteristic attribute of user 2, is the style feature attribute of user n, a1 is the fitting coefficient of the style feature attribute of user 1, a2 is the fitting coefficient of the style feature attribute of user 2, an is the fitting coefficient of the style feature attribute of user n, n is the user Quantity, a1+a2+…+an=1.

Based on the above formula, optimization methods can be used, such as gradient descent method, Gauss-Newton method, etc., to obtain the fitting coefficient of each style characteristic attribute.

It should be noted that this embodiment only takes dividing facial topological structure data into three areas as an example for illustration, and does not serve as a specific restriction on dividing facial topological structure data into areas.

S1302: Determine a target style feature vector based on the fitting coefficient of each style feature attribute and multiple style feature vectors.

The plurality of style feature vectors correspond to the plurality of style feature attributes in one-to-one correspondence.

For example, the style feature vector is a representation of style, and the embedding obtained by training the classification task model can be used as the style feature vector based on the classification task model, or the one-hot feature vector can be directly designed as the style feature vector. For example, if 3 users have 3 style feature attributes corresponding to one-hot feature vectors, then the 3 style feature vectors can be [1;0;0], [0;1;0] and [0;0;1].

On the basis of the above embodiment, the style feature attributes of n users with different speaking styles are obtained. Correspondingly, the style feature vectors of n users can be obtained. These n style feature attributes correspond to n style feature vectors one-to-one, n style feature attributes and their corresponding style feature vectors form the basic style feature base. Based on n style feature attributes The respective fitting coefficients are multiplied by the corresponding style feature vectors, and the target style feature vectors can be expressed in the form of basic style feature bases, as shown in the following formula:

p＝a1×F1+a2×F2+…+an×Fn (2)

Among them, F1 is the style feature vector of user 1, F2 is the style feature vector of user 2, Fn is the style feature vector of user n, and p is the target style feature vector.

For example, the style feature vector is a one-hot feature vector, and the target style feature vector p can be expressed as:

S1303. Input the target style feature vector into the speaking style model and output the target speaking style parameters.

The speaking style model is obtained by training a speaking style model framework based on the plurality of style feature vectors.

For example, the framework of the speaking style model is trained based on multiple style feature vectors in the basic style feature base to obtain the framework of the trained speaking style model, that is, the speaking style model. Inputting the target style feature vector into the speaking style model can be understood as inputting the product of multiple style feature vectors and respective fitting coefficients into the speaking style model, which is the same as the training sample input when training the framework of the speaking style model. Therefore, based on the speaking style model, using the target style feature vector as input, the target speaking style parameters can be directly output.

The target speaking style parameter can be the vertex offset between each vertex in the dynamic face topology structure and the corresponding vertex in the static face topology structure; or it can be the coefficient of the expression basis of the dynamic face topology structure, or it can be other parameters, this disclosure does not impose specific restrictions on this.

S1304: Generate a target speaking style based on the target speaking style parameters.

For example, the target speaking style parameter is the vertex offset of each vertex in the dynamic face topology structure and the corresponding vertex in the static face topology structure. In this way, on the basis of the static face topology structure, based on the vertex of each vertex The offset drives each vertex of the static face topology to move to the corresponding position, and the target speaking style can be obtained.

In the embodiment of the present disclosure, by fitting the target style feature attributes based on multiple style feature attributes, the fitting coefficient of each style feature attribute is determined; the target style is determined based on the fitting coefficient of each style feature attribute and the multiple style feature vectors. Feature vector, multiple style feature vectors correspond to multiple style feature attributes one-to-one; input the target style feature vector into the speaking style model, and output the target speaking style parameters. The speaking style model trains the speaking style model based on multiple style feature vectors Obtained from the framework; based on the target speaking style parameters, the target speaking style is generated. In this way, the target style feature vector can be fitted with multiple style feature vectors. Since the speaking model is trained based on multiple style feature vectors, therefore, Inputting the target style feature vectors fitted by multiple style feature vectors into the speaking model can directly obtain the corresponding new speaking style. There is no need to retrain the speaking style model, which can achieve rapid transfer of speaking styles and improve the efficiency of speaking style generation. .

Figure 14 is a schematic flowchart of a speaking style generation method according to some embodiments. Figure 14 is based on the embodiment shown in Figure 12. Before executing S1301, it also includes:

S1501: Collect multi-frame facial topological structure data when multiple preset users read multiple speech segments.

For example, users with different speaking styles are selected as default users, and multiple speech segments are also selected. Each default user When the user reads each piece of speech, multi-frame facial topological structure data of the preset user is collected. For example, the duration of speech 1 is t1, and the frequency of collecting facial topology data is 30 frames/second. In this way, by default, after user 1 reads each segment of speech 1, t1*30 frames of facial topology data can be collected.

S1502, for each preset user: determine the multi-frame facial topology in each divided area based on the respective speaking style parameters and divided areas of the multi-frame facial topological structure data corresponding to the multiple segments of speech. The average of the speaking style parameters of the structural data.

For example, based on the above embodiment, for the preset user 1, after the preset user 1 finishes reading m segments of speech, t1*30*m frames of facial topology data can be collected. The vertex offsets (Δx, Δy, Δz) of each vertex of the dynamic face topology structure and each vertex of the static face topology structure in each frame of facial topology data can be used as the speaking style of each frame of facial topology data. Parameters, based on the vertex offset (Δx, Δy, Δz) of each vertex of all dynamic face topology corresponding to the t1*30*m frame facial topology data of the preset user 1, the preset user 1's Average vertex offset for each vertex of dynamic face topology in facial topology data

Based on the divided area of the facial topology data, for each divided area of the preset user 1, the average vertex offset of all vertices of the dynamic face topology in the facial topological structure data within the divided area can be obtained average of. For example, the facial topology data is divided into three areas, where the average value of the average vertex offset of all vertices of the dynamic face topology in the facial topology data within area S1 is The average value of the average vertex offset of all vertices of the dynamic face topology in the facial topology data in area S2 is The average value of the average vertex offset of all vertices of the dynamic face topology in the facial topology data in area S3 is

S1503: Splice the average value of the speaking style parameters of the multi-frame facial topological structure data in each divided area in a preset order to obtain the style feature attributes of each preset user.

For example, the preset order may be a top-to-bottom order as shown in FIG. 13 , or may be a bottom-to-top order as shown in FIG. 13 , and the present disclosure does not specifically limit this. If the preset order is from top to bottom as shown in Figure 13, based on the above embodiment, all the dynamic face topology structures in the facial topology data corresponding to the respective areas can be spliced in the order of areas S1, S2 and S3. The average value of the average vertex offset of the vertices. In this way, the style feature attributes of the preset user 1 can be obtained, that is

To sum up, the style characteristic attributes can be obtained for the default user 1 In this way, multiple style characteristic attributes can be obtained for multiple preset users.

Figure 15 is a schematic flowchart of a speaking style generation method according to some embodiments. Figure 15 is based on the embodiment shown in Figure 14. Before executing S1301, it also includes:

S1601: Collect multi-frame target facial topological structure data when the target user reads the multiple segments of speech.

The target user and the plurality of preset users are different users.

For example, when it is currently necessary to generate a target speaking style that is different from the speaking styles of multiple preset users, the target speaking style is collected. Multi-frame target facial topological structure data when the target user corresponding to the speaking style reads multiple segments of speech, and the content of the multiple segments of speech read by the target user is the same as the content of the multiple segments of speech read by multiple presets. For example, after the target user reads m segments of speech with a duration of t1, the target facial topological structure data of t1*30*m frames can be obtained.

S1602: Determine the multi-frame target facial topology in each divided area based on the respective speaking style parameters of the multi-frame target facial topological structure data corresponding to the multiple segments of speech and the divided areas of the facial topological structure data. The average of the speaking style parameters of the data.

The vertex offsets (Δx', Δy', Δz') of each vertex of the dynamic face topology and each vertex of the static face topology in the target facial topology data of each frame can be used as the target facial topology of each frame. The speaking style parameters of the structural data are based on the vertex offset (Δx', Δy', Δz') of each vertex of all dynamic face topology structures in the target user's t1*30*m frame target facial topology data, The average vertex offset of each vertex of the dynamic face topology in the target facial topology data of the target user can be determined

Based on the above divided areas of facial topology data, for each divided area of the target user, the average vertex offset of all vertices of the dynamic face topology in the target facial topological structure data within the divided area can be obtained average of. For example, the facial topology data is divided into three areas, where the average value of the average vertex offset of all vertices of the dynamic face topology in the target facial topology data in area S1 is The average value of the average vertex offset of all vertices of the dynamic face topology in the target facial topology data in area S2 is The average value of the average vertex offset of all vertices of the dynamic face topology in the target facial topology data in area S3 is

S1603: Splice the average value of the speaking style parameters of the multi-frame target facial topological structure data in each divided area according to the preset order to obtain the target style feature attribute.

Exemplarily, based on the same preset order as in the above embodiment, the average value of the average vertex offsets of all vertices of the dynamic face topology in the target facial topology data is spliced, for example, based on as shown in Figure 13 In order from top to bottom, the average vertex offset of all vertices of the dynamic face topology in the target facial topology data corresponding to the respective areas can be spliced in the order of areas S1, S2 and S3, and the target user can be obtained The target style characteristic attribute of

It should be noted that S1501-S1503 shown in Figure 14 can be executed first, and then S1601-S1603 shown in Figure 15 can be executed; or, S1601-S1603 shown in Figure 15 can be executed first, and then S1601-S1603 shown in Figure 14 can be executed. As shown in S1501-S1503, this disclosure does not impose specific limitations on this.

Figure 16 is a schematic flowchart of a speaking style generation method according to some embodiments. Figure 16 is based on the embodiments shown in Figures 14 and 13. Before executing S1303, it also includes:

S1701, obtain the training sample set.

The training sample set includes an input sample set and an output sample set. The input sample includes voice features and the plurality of corresponding style feature vectors, and the output sample includes the speaking style parameters.

It is preset that when users read speech, they can extract the intrinsic features of the speech information, mainly extracting features that can express the speech content. For example, the speech Melp feature can be extracted as the speech feature, or the speech feature extraction commonly used in the industry can be used. model to extract speech features, or you can also extract speech features based on a designed deep network model. Based on the extraction efficiency of speech features, after the preset user reads multiple speech segments, the speech feature sequence can be extracted. If the contents of the multiple speech segments read by multiple preset users are exactly the same, the same speech can be extracted for different preset users. Characteristic sequence. In this way, for the same voice feature in the voice feature sequence, there are multiple style feature vectors corresponding to multiple preset users. One voice feature and its corresponding multiple style feature vectors can be used as input samples, based on all the voice feature sequences. For speech features, multiple input samples can be obtained, that is, an input sample set is obtained.

For example, while extracting each voice feature, corresponding facial topology data can be collected. Based on the respective vertex coordinates of all vertices of the dynamic face topology in the facial topology data, the dynamic person in the facial topology data can be obtained. The respective vertex offsets of all vertices of the face topology. The respective vertex offsets of all vertices of the dynamic face topology in the facial topology data are used as a set of speaking style parameters, and a set of speaking style parameters is an output sample. In this way, based on the multi-frame facial topology corresponding to the speech feature sequence Structural data can obtain multiple output samples, that is, output sample sets. The input sample set and the output sample set constitute the training sample set for training the speaking style generation model.

S1702. Define the framework of the speaking style model.

The framework of the speaking style model includes a linear combination unit and a network model. The linear combination unit is used to generate a linear combination style feature vector of the plurality of style feature vectors and generate a linear combination output sample of a plurality of output samples. The input samples correspond to the output samples one-to-one; the network model is used to generate corresponding predicted output samples according to the linear combination style feature vector.

Figure 17 is a schematic structural diagram of the framework of the speaking style model according to some embodiments. As shown in Figure 17, the framework of the speaking style model includes a linear combination unit 310 and a network model 320. The input end of the linear combination unit 310 is used to receive training samples. , the output end of the linear combination unit 310 is connected to the input end of the network model 320, and the output end of the network model 320 is the output end of the framework 300 of the speaking style model.

After the training samples are input to the linear combination unit 310, the training samples include input samples and output samples, where the input samples include speech features and their corresponding multiple style feature vectors. The linear combination unit 310 can linearly combine the multiple style feature vectors. , to obtain a linear combination of style feature vectors, and the corresponding speaking style parameters of multiple style feature vectors can also be linearly combined to obtain a linear combination output sample. The linear combination unit 310 can output speech features and their corresponding linear combination style feature vectors, that is, linear combination input samples, and can also output corresponding linear combination output samples. The linear combination training samples are input to the network model 320. The linear combination training samples include linear combination input samples and linear combination output samples. Based on the linear combination training samples, the network model 320 is trained.

S1703: Train the framework of the speaking style model according to the training sample set and the loss function to obtain the speaking style model.

Based on the above embodiment, the training samples in the training sample set are input to the framework of the speaking style model. The framework of the speaking style model can output predicted output samples. The loss function is used to determine the loss value of the predicted output sample and the output sample. Based on the loss value, the loss value is reduced. In the small direction, adjust the model parameters of the framework of the speaking style model, and then complete an iterative training. In this way, based on the framework of multiple iterations of training the speaking style model, a well-trained framework for training the speaking style model, that is, the speaking style model, can be obtained.

In this embodiment, a training sample set is obtained. The training sample set includes an input sample set and an output sample set. The input sample includes speech features and their corresponding multiple style feature vectors, and the output sample includes speaking style parameters; defining the speaking style model. Framework, the framework of the speaking style model includes a linear combination unit and a network model. The linear combination unit is used to generate a linear combination style feature vector of multiple style feature vectors, and generate a linear combination output sample of multiple output samples. The input sample is the same as the output sample. One correspondence; the network model is used to generate corresponding predicted output samples based on linear combination of style feature vectors; based on the training sample set and loss function, the framework of the speaking style model is trained to obtain the speaking style model. In this way, the speaking style model is essentially based on multiple The linear combination of style feature vectors obtained by training the network model can improve the diversity of training samples of the network model and improve the versatility of the speaking style model.

Figure 18 is a schematic flowchart of a speaking style generation method according to some embodiments. Figure 18 is a detailed description of an implementation method when performing S1703 based on the embodiment shown in Figure 16, as follows:

S501. Input the training sample set to the linear combination unit, and generate the linear combination style feature vector based on the multiple style feature vectors and their respective weight values. Based on the respective weight values of the multiple style feature vectors, The weight value and the multiple output samples are used to generate the linear combination output sample.

The sum of the weight values of the plurality of style feature vectors is 1.

For example, after the training samples are input to the linear combination unit, based on the linear combination unit, multiple style feature vectors can be assigned weight values respectively, and the sum of the weight values of the multiple style feature vectors is 1, and the multiple style feature vectors can be assigned weight values. By adding the products of each style feature vector and the corresponding weight value in the feature vector, a linear combination style feature vector can be obtained. Each style feature vector corresponds to an output sample, and the linear combination output sample can be obtained by adding the product of the weight value of multiple style feature vectors and the corresponding output sample. In this way, based on different weight values, different linear combination style feature vectors and different linear combination output samples can be obtained. Based on multiple speech features and their corresponding linear combination style feature vectors, a linear combination input sample set can be obtained. Based on A linear combination of output samples corresponding to multiple speech features can be obtained.

S502: Train the network model according to the loss function and the linear combination training sample set to obtain the speaking style model.

The linear combination training sample set includes a linear combination input sample set and a linear combination output sample set. The linear combination input sample includes the speech feature and its corresponding linear combination style feature vector.

Exemplarily, the linear combination training sample set includes a linear combination input sample set and a linear combination output sample set. The linear combination training samples are input to the network model. Based on the network model and the linear combination input sample, a predicted output sample can be obtained based on the loss function. In the direction in which the loss value decreases, the model parameters of the network model are adjusted, and an iterative training of the network model is completed. In this way, based on multiple iterative trainings of the network model, a well-trained framework for training the speaking style model, that is, the speaking style model, can be obtained.

In this embodiment, by inputting the training sample set into the linear combination unit, a linear combination style feature vector is generated based on multiple style feature vectors and their respective weight values, and based on the respective weight values of the multiple style feature vectors and multiple outputs sample, generate a linear combination output sample, the sum of the weight values of multiple style feature vectors is 1; according to the loss Function and linear combination training sample set, train the network model, and obtain the speaking style model. The linear combination training sample set includes a linear combination input sample set and a linear combination output sample set. The linear combination input sample includes speech features and their corresponding linear combination style features. Vector, the linearly combined training samples can be used as training samples for the network model, which can increase the number and diversity of training samples for the network model, and improve the versatility and accuracy of the speaking style model.

In some embodiments of the present disclosure, Figure 19 is a schematic structural diagram of the framework of another speaking style generation model provided by the embodiment of the present disclosure. As shown in Figure 19, based on the embodiment shown in Figure 17, the speaking style model The framework also includes a scaling unit 330. The input end of the scaling unit 330 is used to receive training samples, the output end of the scaling unit 330 is connected to the input end of the linear combination unit 310, and the scaling unit 330 is used to compare multiple style feature vectors and multiple output samples based on randomly generated scaling factors. Each performs scaling to obtain multiple scaling style feature vectors and multiple scaling output samples, and output scaling training samples. The scaling training samples include multiple scaling style feature vectors and their respective corresponding scaling training samples. The scaling factor can be 0.5-2, and the scaling factor is accurate to one decimal place.

The scaled training samples are input to the linear combination unit 310. Based on the linear combination unit 310, multiple scaled style feature vectors can be linearly combined to obtain a linear combination style feature vector. The scaled output samples corresponding to the multiple scaled style feature vectors can also be Perform linear combination to obtain linear combination output samples. The linear combination unit 310 can output speech features and their corresponding linear combination style feature vectors, that is, linear combination input samples, and can also output corresponding linear combination output samples. The linear combination training samples are input to the network model 320. The linear combination training samples include linear combination input samples and linear combination output samples. Based on the linear combination training samples, the network model 320 is trained.

Figure 20 is a schematic flowchart of a speaking style generation method according to some embodiments. Figure 20 is a detailed description of another possible implementation when performing S1703 based on the embodiment shown in Figure 16, as follows:

S5011. Input the training sample set to the scaling unit, generate multiple scaling style feature vectors based on the scaling factor and the multiple style feature vectors, and generate multiple scaling style feature vectors based on the scaling factor and the multiple output samples. scaled output samples.

For example, after the training samples are input to the scaling unit, based on the scaling unit, multiple style feature vectors can be scaled separately with random scaling factors, and multiple scaled style feature vectors can be obtained. Each style feature vector corresponds to an output sample, and multiple scaled output samples can be obtained by scaling the corresponding output samples based on the respective scaling factors of multiple style feature vectors. In this way, a scaled input sample set can be obtained based on multiple voice features and their corresponding multiple scaled style feature vectors, and a scaled output sample set can be obtained based on the scaled output samples corresponding to the multiple voice features.

S5012, input the multiple scaling style feature vectors and the multiple scaling output samples to the linear combination unit, and generate the linear combination style feature based on the multiple scaling style feature vectors and their respective weight values. vector, generating the linear combination output sample based on respective weight values of the plurality of scaled style feature vectors and the plurality of scaled output samples.

The sum of the weight values of the multiple scaling style feature vectors is 1.

Exemplarily, the scaled training sample set includes a scaled input sample set and a scaled output sample set. The scaled training sample set is input to the linear combination unit. Based on the linear combination unit, multiple scaled style feature vectors can be assigned weights respectively. value, and the sum of the weight values of multiple scaling style feature vectors is 1. By adding the products of each scaling style feature vector and the corresponding weight value in the multiple scaling style feature vectors, a linear combination style feature vector can be obtained . Each scaling style feature vector corresponds to a scaling output sample, and the linear combination output sample can be obtained by adding the product of the respective weight values of multiple scaling style feature vectors and the corresponding scaling output samples. In this way, based on different weight values, different linear combination style feature vectors and different linear combination output samples can be obtained. Based on multiple speech features and their corresponding linear combination style feature vectors, a linear combination input sample set can be obtained. Based on The scaled output samples corresponding to multiple speech features can obtain a linear combination output sample set.

In this embodiment, the framework of the speaking style model also includes a scaling unit; the training sample set is input to the scaling unit, based on the scaling factor and multiple style feature vectors, multiple scaling style feature vectors are generated, based on the scaling factor and multiple outputs sample to generate multiple scaled output samples; input multiple scaled style feature vectors and multiple scaled output samples to the linear combination unit, and generate a linear combination style feature vector based on multiple scaled style feature vectors and their respective weight values. The respective weight values of multiple scaling style feature vectors and multiple scaling output samples generate linear combination output samples. The sum of the weight values of multiple scaling style feature vectors is 1; according to the loss function and the linear combination training sample set, train network model to obtain a speaking style model. The linear combination training sample set includes a linear combination input sample set and a linear combination output sample set. The linear combination input sample includes speech features and their corresponding linear combination style feature vectors. In this way, the scaled polynomial A style feature vector serves as a training sample for the network model, which can increase the number and diversity of training samples for the network model, thereby improving the versatility and accuracy of the speaking style model.

In some embodiments of the present disclosure, Figure 21 is a schematic structural diagram of the framework of the speaking style generation model according to some embodiments, Figure 22 is a schematic structural diagram of the framework of the speaking style generation model according to some embodiments, and Figure 21 is a schematic structural diagram of the framework of the speaking style generation model according to some embodiments. Based on the embodiment shown in Figure 22, based on the embodiment shown in Figure 19, the network model 320 includes a first-level network model 321, a second-level network model 322 and an overlay unit 323. The output end of the first-level network model 321 and the second-level network model 322 The output terminals of the level network model 322 are connected to the input terminals of the superposition unit 323, and the output terminals of the superposition unit 323 are used to output predicted output samples. The loss function includes a first loss function and a second loss function.

The linear combination training samples are respectively input to the first-level network model 321 and the second-level network model 322. The first-level prediction output sample can be output based on the first-level network model 321, and the second-level prediction output sample can be output based on the second-level network model 322. The first-level prediction output The sample and the second-level prediction output sample are input to the superposition unit 323. Based on the superposition unit 323, the first-level prediction output sample and the second-level prediction output sample are superimposed to obtain a prediction output sample. Level 1 network model 321 may include The convolutional network and the fully connected network are used to extract the single-frame correspondence between speech and facial topological structure data. The secondary network model 322 can be a sequence-to-sequence seq2seq network model, for example, it can be long short-term memory (Long short-term memory). memory, LSTM) network model, Gate Recurrent Unit (GRU) network model or Transformer network model, which is used to enhance the continuity of speech features and facial expressions and the delicacy of speaking style.

Exemplarily, the loss function L=b1*L1+b2*L2, where L1 is the first loss function, used to determine the loss value of the first-level prediction output sample and the linear combination output sample, L2 is the second loss function, using To determine the loss value of the secondary prediction output sample and the linear combination output sample, b1 is the weight of the first loss function, b2 is the weight of the second loss function, and b1 and b2 are adjustable. By setting b2 close to 0, the first-level network model 321 can be trained, and by setting b1 close to 0, the second-level network model 322 can be trained. In this way, the first-level network model and the second-level network model can be trained separately in stages, which can improve the convergence speed of network model training, save the training time of the network model, and thus improve the efficiency of speaking style generation.

Figure 23 is a schematic flowchart of a speaking style generation method according to some embodiments. Figure 23 is a detailed description of an implementation method when performing S502 based on the embodiment shown in Figure 18 or Figure 20, as follows:

S5021: Train the first-level network model according to the linear combination training sample set and the first loss function to obtain an intermediate speaking style model.

The intermediate speaking style model includes the second-level network model and the trained first-level network model.

Illustratively, based on the above embodiment, in the first stage, the weight b2 of the second loss function is set to approach 0. The loss function of the current network model can be understood as the first loss function, and the linear combination training samples are input separately. to the first-level network model and the second-level network model. Based on the predicted output sample output by the superposition unit, the first loss function and the corresponding linear combination output sample, the first loss value can be obtained, and the model parameters of the first-level network model are adjusted based on the direction in which the first loss value decreases until the first loss The values converge, and the trained first-level network model is obtained. The framework of the speaking style model trained in the first stage is the intermediate speaking style model.

S5022: Fix the model parameters of the trained first-level network model.

For example, after training the first-level network model, entering the second stage, the model parameters of the trained first-level network model need to be fixed first.

S5023: Train the secondary network model in the intermediate speaking style model according to the linear combination training sample set and the second loss function to obtain the speaking style model.

The speaking style model includes the trained first-level network and the trained second-level network.

Secondly, set the weight b1 of the first loss function close to 0. The loss function of the current network model can be understood as the second loss function. The linear combination training sample is input to the second-level network model and the trained first-level network. in the model. Based on the predicted output sample output by the superposition unit, the second loss function and the corresponding linear combination output sample, the second loss value can be obtained, and the model parameters of the secondary network model are adjusted based on the direction in which the second loss value decreases until the second loss The values converge and the trained secondary network model is obtained. The framework of the speaking style model trained in the first stage is the speaking style model.

In this embodiment, the network model includes a first-level network model, a second-level network model and a superposition unit. The output end of the first-level network model and the output end of the second-level network model are both connected to the input end of the overlay unit. The output of the overlay unit end use Predict the output sample based on the output; the loss function includes the first loss function and the second loss function; according to the linear combination of the training sample set and the first loss function, train the first-level network model to obtain the intermediate speaking style model, and the intermediate speaking style model includes the second-level network model and the trained first-level network model; fix the model parameters of the trained first-level network model; train the second-level network model in the intermediate speaking style model according to the linear combination of the training sample set and the second loss function to obtain the speaking style model, the speaking style model includes a trained first-level network and a trained second-level network. In this way, the network model can be trained in stages, which can improve the convergence speed of the network model, that is, shorten the training time of the network model, thereby improving Generative efficiency of speaking styles.

Figure 24 is a schematic structural diagram of a computer device provided by an embodiment of the present disclosure. As shown in Figure 24, the computer device includes a processor 910 and a memory 920; the number of processors 910 in the computer device can be one or more. In Figure 24, one processor 910 is taken as an example; the processor 910 in the computer device The memory 920 may be connected through a bus or other means. In FIG. 24 , the connection through a bus is taken as an example.

As a computer-readable non-volatile storage medium, the memory 920 can be used to store software programs, computer executable programs and modules, such as program instructions/modules corresponding to the semantic understanding model training method in the embodiments of the present disclosure; or the present disclosure Program instructions/modules corresponding to the semantic understanding method in the embodiment. The processor 910 executes various functional applications and data processing of the computer device by running software programs, instructions and modules stored in the memory 920, that is, implementing the semantic understanding model training method or the short video recall method provided by the embodiments of the present disclosure. .

The memory 920 may mainly include a stored program area and a stored data area, where the stored program area may store an operating system and an application program required for at least one function; the stored data area may store data created according to the use of the terminal, etc. In addition, the memory 920 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some examples, memory 920 may further include memory located remotely relative to processor 910, and these remote memories may be connected to the computer device through a network. Examples of the above-mentioned networks include but are not limited to the Internet, intranets, local area networks, mobile communication networks and combinations thereof.

Claims

A virtual digital human driving method includes:

Obtain user information, which includes voice information and image information;

Determine user intentions and user emotions based on the user information;

Determine the reply text of the virtual digital person according to the user intention, and determine the reply emotion of the virtual digital person according to the user intention and user emotion;

The physical movements of the virtual digital human are determined according to the reply text, and the emotional expression of the virtual digital human is determined according to the reply emotion.
The method according to claim 1, wherein determining user intention and user emotion according to the user information includes:

Process the voice information to determine the text information and voice emotion information corresponding to the voice information;

Process the image information to determine the scene information and image emotion information corresponding to the image information;

Determine the user intention according to the text information and the scene information;

The user emotion is determined based on the text information, the voice emotion information and the image emotion information.
The method according to claim 2, processing the voice information and determining text information and voice emotion information corresponding to the voice information includes:

Perform text transcription processing on the voice information to determine the text information corresponding to the voice information;

Perform voiceprint feature extraction on the voice information to determine voice emotion information corresponding to the voice information.
The method according to claim 3, before performing text transcription processing on the voice information and determining text information corresponding to the voice information, further comprising:

Extract the speech feature vector of the speech information;

The speech feature vector is added to the convolutional layer of the speech recognition model, wherein the speech recognition model includes an acoustic model and a language model, the acoustic model includes a convolutional neural network model of the attention mechanism, and the language model includes Deep neural network model;

The step of performing text transcription processing on the voice information and determining the text information corresponding to the voice information includes:

Based on the speech recognition model, text transcription processing is performed on the speech information to determine text information corresponding to the speech information.
The method according to claim 2, processing the image information and determining scene information and image emotion information corresponding to the image information includes:

Determine the scene key point information and user key point information included in the image information;

Determine scene information corresponding to the image information according to the scene key point information;

The image emotion information is determined according to the corresponding relationship between the user key point information and the preset user emotion key points.
The method according to claim 2, determining the body movements of the virtual digital person based on the reply text, and determining the emotional expression of the virtual digital person based on the reply text and reply emotion, including:

Obtain the action identifier included in the reply text;

Select the body movements of the virtual digital human from the preset action database corresponding to the scene information according to the action identifier;

According to the voice emotion information and the image emotion information, the emotional expression of key points of the virtual digital human is determined.
The method according to claim 6, before determining the body movements of the virtual digital human based on the reply text and determining the emotional expression of the virtual digital human based on the reply emotion, further comprising:

Determine the image of the virtual digital person.
The method according to any one of claims 1-7, further comprising:

Fit the target style feature attributes based on multiple style feature attributes, and determine the fitting coefficient of each style feature attribute; the style feature attributes are determined based on the face image when the user reads the voice;

Determine a target style feature vector according to the fitting coefficient of each style feature attribute and a plurality of style feature vectors, and the plurality of style feature vectors correspond to the plurality of style feature attributes in one-to-one correspondence;

Input the target style feature vector into a speaking style model and output target speaking style parameters. The speaking style model is obtained by training a speaking style model framework based on the multiple style feature vectors;

Based on the target speaking style parameters, a target speaking style for the virtual digital human is generated.
The method according to claim 8, before fitting the target style feature attributes based on multiple style feature attributes and determining the fitting coefficient of each style feature attribute, the method further includes:

Collect multi-frame facial topology data when multiple preset users read multiple speech segments;

For each preset user: determine the multi-frame facial topology data in each divided area according to the respective speaking style parameters and the divided areas of the facial topological structure data corresponding to the multiple segments of speech. the mean value of said speaking style parameters;

The average value of the speaking style parameters of the multi-frame facial topological structure data in each divided area is spliced in a preset order to obtain the style feature attributes of each preset user.
The method of claim 9, further comprising:

Collect multi-frame target facial topological structure data when the target user reads the multiple segments of speech, and the target user and the plurality of preset users are different users;

According to the speaking style parameters of the multi-frame target facial topology data corresponding to the multiple segments of speech and the divided areas of the facial topological structure data, determine the multi-frame target facial topological structure data in each divided area. the average value of said speaking style parameters;

The average value of the speaking style parameters of the multi-frame target facial topological structure data in each divided area is spliced in accordance with the preset order to obtain the target style feature attributes.
The method according to claim 9, before inputting the target style feature vector into the speaking style model and outputting the target speaking style parameters, further comprising:

Obtain a training sample set, the training sample set includes an input sample set and an output sample set, the input sample includes voice features and the plurality of corresponding style feature vectors, and the output sample includes the speaking style parameters;

Define a framework of the speaking style model, which includes a linear combination unit and a network model Type, the linear combination unit is used to generate a linear combination style feature vector of the plurality of style feature vectors, and generate a linear combination output sample of a plurality of output samples, where the input sample corresponds to the output sample one-to-one; The network model is used to generate corresponding prediction output samples according to the linear combination style feature vector;

According to the training sample set and the loss function, the framework of the speaking style model is trained to obtain the speaking style model.
The method according to claim 11, said training the framework of the speaking style model according to the training sample set and the loss function to obtain the speaking style model, including:

The training sample set is input to the linear combination unit, and based on the multiple style feature vectors and their respective weight values, the linear combination style feature vector is generated based on the respective weight values of the multiple style feature vectors. and the multiple output samples to generate the linear combination output sample, and the sum of the weight values of the multiple style feature vectors is 1;

According to the loss function and the linear combination training sample set, the network model is trained to obtain the speaking style model. The linear combination training sample set includes a linear combination input sample set and a linear combination output sample set. The linear combination input sample includes The speech features and their corresponding linear combination style feature vectors.
The method according to claim 11, the framework of the speaking style model further includes a scaling unit;

The framework of training the speaking style model according to the training sample set and the loss function to obtain the speaking style model includes:

The training sample set is input to the scaling unit, a plurality of scaling style feature vectors are generated based on the scaling factor and the multiple style feature vectors, and a plurality of scaling style feature vectors are generated based on the scaling factor and the multiple output samples. output sample;

Input the plurality of scaling style feature vectors and the plurality of scaling output samples to the linear combination unit, and generate the linear combination style feature vector based on the multiple scaling style feature vectors and their respective weight values, Generate the linear combination output sample based on the respective weight values of the multiple scaling style feature vectors and the multiple scaling output samples, where the sum of the respective weight values of the multiple scaling style feature vectors is 1;

According to the loss function and the linear combination training sample set, the network model is trained to obtain the speaking style model. The linear combination training sample set includes a linear combination input sample set and a linear combination output sample set. The linear combination input sample includes The speech features and their corresponding linear combination style feature vectors.
The method according to claim 12 or 13, the network model includes a first-level network model, a second-level network model and a superposition unit, and the output end of the first-level network model and the output end of the second-level network model are both connected to The input end of the overlay unit is connected, and the output end of the overlay unit is used to output the predicted output sample; the loss function includes a first loss function and a second loss function;

Training the network model according to the loss function and the linear combination training sample set to obtain the speaking style model includes:

According to the linear combination training sample set and the first loss function, the first-level network model is trained to obtain an intermediate speaking style model. The intermediate speaking style model includes the second-level network model and the trained first-level network model. level network model;

Fixing the model parameters of the trained first-level network model;

According to the linear combination training sample set and the second loss function, the secondary network model in the intermediate speaking style model is trained to obtain the speaking style model. The speaking style model includes the trained The first-level network and the trained second-level network.
A computer device consisting of:

one or more processors;

memory for storing one or more programs,

When the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the method as described in any one of claims 1-14.
A computer-readable non-volatile storage medium on which a computer program is stored. When the program is executed by a processor, the method according to any one of claims 1-14 is implemented.