WO2023246163A9 - Virtual digital human driving method, apparatus, device, and medium - Google Patents

Virtual digital human driving method, apparatus, device, and medium Download PDF

Info

Publication number
WO2023246163A9
WO2023246163A9 PCT/CN2023/079026 CN2023079026W WO2023246163A9 WO 2023246163 A9 WO2023246163 A9 WO 2023246163A9 CN 2023079026 W CN2023079026 W CN 2023079026W WO 2023246163 A9 WO2023246163 A9 WO 2023246163A9
Authority
WO
WIPO (PCT)
Prior art keywords
style
information
model
user
linear combination
Prior art date
Application number
PCT/CN2023/079026
Other languages
French (fr)
Chinese (zh)
Other versions
WO2023246163A1 (en
Inventor
杨善松
成刚
刘韶
李绪送
付爱国
Original Assignee
海信视像科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN202210714001.7A external-priority patent/CN115270922A/en
Priority claimed from CN202210751784.6A external-priority patent/CN117370605A/en
Application filed by 海信视像科技股份有限公司 filed Critical 海信视像科技股份有限公司
Publication of WO2023246163A1 publication Critical patent/WO2023246163A1/en
Publication of WO2023246163A9 publication Critical patent/WO2023246163A9/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content

Definitions

  • the present disclosure relates to the field of virtual digital human technology, and in particular, to a virtual digital human driving method, device, equipment and medium.
  • Virtual digital human is a multi-modal intelligent human-computer interaction technology that integrates computer vision, speech recognition, speech synthesis, natural language processing, terminal display and other technologies to create a highly anthropomorphic virtual image, like a real person Interact and communicate with people.
  • the present disclosure provides a virtual digital human driving method, including:
  • Obtain user information which includes voice information and image information
  • the physical movements of the virtual digital human are determined according to the reply text, and the emotional expression of the virtual digital human is determined according to the reply emotion.
  • the present disclosure also provides a computer device, including:
  • processors one or more processors
  • Memory used to store one or more programs
  • the one or more processors When the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the method as described in any one of the first aspects.
  • the present disclosure also provides a computer-readable non-volatile storage medium on which a computer program is stored.
  • the program When executed by the processor, the method as described in any one of the first aspects is implemented.
  • Figure 1A is a schematic diagram of an application scenario of a virtual digital human driving process according to some embodiments
  • Figure 1B is a schematic structural diagram of a virtual digital human according to some embodiments.
  • Figure 2A is a hardware configuration block diagram of a computer device according to some embodiments.
  • Figure 2B is a schematic diagram of a software configuration of a computer device according to some embodiments.
  • Figure 2C is a schematic diagram showing an icon control interface of an application included in a smart device according to some embodiments.
  • Figure 3A is a schematic flowchart of a virtual digital human driving method according to some embodiments.
  • Figure 3B is a schematic diagram of the principle of a virtual digital human driving method according to some embodiments.
  • Figure 4A is a schematic flowchart of another virtual digital human driving method according to some embodiments.
  • Figure 4B is a schematic diagram of the principle of a virtual digital human driving method according to some embodiments.
  • Figure 4C is a schematic flowchart of yet another virtual digital human driving method according to some embodiments.
  • Figure 4D is a schematic flowchart of yet another virtual digital human driving method according to some embodiments.
  • Figure 5 is a schematic flowchart of a virtual digital human driving method according to some embodiments.
  • Figure 6 is a schematic flowchart of yet another virtual digital human driving method according to some embodiments.
  • Figure 7 is a schematic flowchart of yet another virtual digital human driving method according to some embodiments.
  • Figure 8 is a schematic diagram of a virtual digital human according to some embodiments.
  • Figure 9 is a schematic diagram of a virtual digital human according to some embodiments.
  • Figure 10 is a schematic diagram of the principle of generating a new speaking style according to some embodiments.
  • Figure 11 is a schematic diagram of a human-computer interaction scenario according to some embodiments.
  • Figure 12 is a schematic flowchart of a speaking style generation method according to some embodiments.
  • Figure 13 is a schematic diagram of facial topology data divided into regions according to some embodiments.
  • Figure 14 is a schematic flowchart of a speaking style generation method according to some embodiments.
  • Figure 15 is a schematic flowchart of a speaking style generation method according to some embodiments.
  • Figure 16 is a schematic flowchart of a speaking style generation method according to some embodiments.
  • Figure 17 is a schematic structural diagram of a framework of a speaking style model according to some embodiments.
  • Figure 18 is a schematic flowchart of a speaking style generation method according to some embodiments.
  • Figure 19 is a schematic structural diagram of the framework of a speaking style generation model according to some embodiments.
  • Figure 20 is a schematic flowchart of a speaking style generation method according to some embodiments.
  • Figure 21 is a schematic structural diagram of the framework of a speaking style generation model according to some embodiments.
  • Figure 22 is a schematic structural diagram of the framework of a speaking style generation model according to some embodiments.
  • Figure 23 is a schematic flowchart of a speaking style generation method according to some embodiments.
  • Figure 24 is a schematic structural diagram of a computer device according to some embodiments.
  • the system design of virtual digital people usually consists of five modules: character image, speech generation, dynamic image generation, audio and video synthesis display, and interactive modeling.
  • Character images can be divided into There are two categories: 2D and 3D, which can be divided into cartoon, anthropomorphic, realistic, hyper-realistic and other styles in terms of appearance;
  • the speech generation module can generate corresponding character voices based on text;
  • the animation generation module can generate the dynamics of specific characters based on speech or text Image;
  • the audio and video synthesis display module synthesizes speech and dynamic images into a video, and finally displays it to the user;
  • the interactive module enables the digital human to have interactive functions, that is, it recognizes the user's intention through intelligent technologies such as speech semantic recognition, and determines the digital human's behavior based on the user's current intention. The subsequent voice and actions drive the character to start the next round of interaction.
  • the embodiments of the present disclosure first obtain user information, which includes voice information and image information; then determine the user intention and user emotion based on the user information; finally determine the body movements of the virtual digital human based on the user intention, and Determine the emotional expression of the virtual digital human based on the user's emotion, that is, based on the acquired user voice information and user image information, process the user voice information and user image information to determine the user's intention and user's emotion, and then determine the virtual number based on the user's intention Human body movements determine the emotional expression of virtual digital people based on user emotions, allowing virtual digital people to truly restore user intentions and user emotions, and improve the fidelity and naturalness of expression of virtual digital people.
  • FIG. 1A is a schematic diagram of an application scenario of a virtual digital human driving process in an embodiment of the present disclosure.
  • the virtual digital human driving process can be used in the interaction scenario between users and smart terminals.
  • the smart terminals in this scenario include smart blackboards, smart large screens, smart speakers, smart phones, etc.
  • Examples of virtual digital people include virtual teachers, virtual brand images, virtual assistants, virtual shopping guides, virtual anchors, etc.
  • a voice command is issued.
  • the smart terminal collects the user's voice information and collects the user's image information.
  • the virtual digital human driving method provided by the embodiments of the present disclosure can be implemented based on computer equipment, or functional modules or functional entities in the computer equipment.
  • the computer equipment can be a personal computer (Personal Computer, PC), server, mobile phone, tablet Computers, notebook computers, large computers, etc. are not specifically limited in the embodiments of the present disclosure.
  • FIG. 2A is a hardware configuration block diagram of a computer device according to one or more embodiments of the present disclosure.
  • the computer equipment includes: a tuner and demodulator 210, a communicator 220, a detector 230, an external device interface 240, a controller 250, a display 260, an audio output interface 270, a memory, a power supply, and a user interface 280. of at least one.
  • the controller 250 includes a central processing unit, a video processor, an audio processor, a graphics processor, a RAM, a ROM, and first to nth interfaces for input/output.
  • the display 260 may be at least one of a liquid crystal display, an OLED display, a touch display, and a projection display, and may also be a projection device and a projection screen.
  • the tuner and demodulator 210 receives broadcast television signals through wired or wireless reception methods, and demodulates audio and video signals, such as EPG audio and video data signals, from multiple wireless or wired broadcast and television signals.
  • the communicator 220 is a component for communicating with external devices or servers according to various communication protocol types.
  • the communicator may include at least one of a Wifi module, a Bluetooth module, a wired Ethernet module, other network communication protocol chips or near field communication protocol chips, and an infrared receiver.
  • the computer device can establish the transmission and reception of control signals and data signals with the server or local control device through the communicator 220 .
  • the detector 230 is used to collect signals from the external environment or interactions with the outside.
  • the controller 250 controls the operation of the computer device and responds to user operations through various software control programs stored on the memory. Controller 250 controls the overall operation of the computer device.
  • the user may input a user command into a graphical user interface (GUI) displayed on the display 260, and the user input interface receives the user input command through the graphical user interface (GUI).
  • GUI graphical user interface
  • the user can input a user command by inputting a specific sound or gesture, and the user input interface recognizes the sound or gesture through the sensor to receive the user input command.
  • FIG. 2B is a schematic diagram of the software configuration of a computer device according to one or more embodiments of the present disclosure. As shown in Figure 2B, the system is divided into four layers. From top to bottom, they are the Applications layer (referred to as the "Application layer”). “), Application Framework layer (referred to as “framework layer”), Android runtime and system library layer (referred to as “system runtime library layer”), and kernel layer.
  • Applications layer referred to as the "Application layer”
  • Application Framework layer referred to as “framework layer”
  • Android runtime and system library layer referred to as “system runtime library layer”
  • FIG. 2C is a schematic diagram showing the icon control interface of an application included in a smart terminal (mainly a smart playback device, such as a smart TV, a digital cinema system or an audio-visual server, etc.) according to one or more embodiments of the present disclosure, as shown in Figure 2C
  • the application layer contains at least one application that can display a corresponding icon control on the display, such as: Live TV application icon control, Video on Demand VOD application icon control, Media Center application icon control, Application Center Icon Control , game application icon controls, etc.
  • Live TV app that provides live TV from different sources.
  • Video on demand VOD application that can provide videos from different storage sources.
  • video on demand offers the display of video from certain storage sources.
  • Media center application can provide various multimedia content playback applications.
  • the application center can provide storage for various applications.
  • FIG. 3A is a schematic flowchart of a virtual digital human driving method provided by an embodiment of the present disclosure
  • FIG. 3B is a schematic principle diagram of a virtual digital human driving method provided by an embodiment of the present disclosure.
  • This embodiment can be applied to virtual digital humans. row-driven situation.
  • the method of this embodiment can be executed by an intelligent terminal, which can be implemented in hardware/or software, and can be configured in a computer device.
  • the method specifically includes the following steps:
  • the smart terminal includes a sound sensor and a visual sensor, where the sound sensor can be an example of a microphone array, etc., the visual sensor includes a 2D visual sensor and a 3D visual sensor, and the visual sensor can be an example of a camera, etc.
  • the smart terminal collects voice information through sound sensors and image information through visual sensors.
  • the voice information includes semantic information and acoustic information
  • the image information includes scene information and user image information.
  • the terminal device After the terminal device collects voice information based on the sound sensor, it can determine the user's intention based on the semantic information included in the voice information, that is, how the user expects to drive the virtual digital human to act. After collecting the image information based on the visual sensor, it can determine the user's intention based on the semantic information included in the voice information. The collected image information determines the facial expression of the user who sent the voice message, and based on the user's facial expression in the collected image information, the user's expectation to drive the emotion expressed by the virtual digital human is determined.
  • the virtual digital person's reply text can be determined based on the user intention, such as the text corresponding to the virtual digital person's reply voice
  • the virtual digital person's reply emotion can be determined based on the user intention and user emotion, that is,
  • the emotional expression required for the virtual digital person's reply is determined according to the user's intention, and the emotion required for the virtual digital person's reply is determined based on the emotion expressed by the user.
  • the emotion expressed by the user is a sad emotion
  • the emotion that the virtual digital person needs to express in reply is also a sad emotion.
  • the reply text of the virtual digital person is determined based on the user intention
  • the reply emotion of the virtual digital person is determined based on the user intention and user emotion
  • the body movements of the virtual digital person are determined based on the reply text
  • Determine the emotional expression of the virtual digital human based on the reply emotion that is, first establish multi-modal human-computer interaction information perception capabilities for speech recognition and image recognition, and then determine the user's intended user emotion through the acquired voice information and image information.
  • the user's intention determines the reply text of the virtual digital person, and determines the virtual digital person's reply emotion based on the user's intention and user emotion.
  • emotional expressions and body movements are generated to realize the virtual digital person's voice, expressions, and movements. etc. synthesis.
  • the virtual digital human driving method provided by the embodiment of the present disclosure first obtains user information, that is, voice information and image information, then determines the user intention and user emotion based on the user information, determines the virtual digital human's reply text based on the user intention, and determines the virtual digital human's reply text based on the user information. Intention and user emotion determine the virtual digital human's reply emotion. Finally, the virtual digital human's body movements are determined based on the reply text, and the virtual digital human's emotional expression method is determined based on the reply emotion. This achieves a natural anthropomorphic virtual human interaction state and improves the virtual digital human's quasi-likeness. Authenticity and naturalness of expression.
  • FIG. 4A is a schematic flowchart of another virtual digital human driving method provided by an embodiment of the present disclosure.
  • FIG. 4B is a schematic diagram of another virtual digital human driving method provided by the present disclosure.
  • step S20 includes:
  • S201 Process the voice information and determine the text information and voice emotion information corresponding to the voice information.
  • step S201 includes:
  • the voice recognition module performs text transcription processing on the acquired voice information, that is, the voice information is converted into text information corresponding to the voice information.
  • the terminal device can input voice information into an automatic speech recognition (Automatic Speech Recognition, ASR) engine set offline to obtain text information output by the ASR engine.
  • ASR Automatic Speech Recognition
  • the terminal device may continue to wait for the user to input voice. If the start of a human voice is recognized based on Voice Activity Detection (VAD), recording will continue. If the end of the human voice is recognized based on VAD, the recording will stop. The terminal device can use the recorded audio as user voice information. The terminal device can then input the user's voice information into the ASR engine to obtain text information corresponding to the user's voice information.
  • VAD Voice Activity Detection
  • Voiceprint features are the sound wave spectrum that carries speech information displayed by electroacoustic instruments. Voiceprint features represent the different wavelengths, frequencies, intensity, and rhythm of different sounds, that is, the pitch, intensity, length, and duration of the user's voice. Timbre, different users have different voiceprint characteristics. By extracting voiceprint features from the voice information, the emotional information expressed by the user corresponding to the voice information can be obtained, that is, the voice emotion information.
  • S202 Process the image information and determine the scene information and image emotion information corresponding to the image information.
  • step S202 includes:
  • S2021. Preprocess the image information to determine the scene key point information and user key point information included in the image information.
  • the scene key point information refers to the key points of the scene in which the user is located in the image information in addition to the user information.
  • the user key point information refers to the key points of the user's limbs or facial features in the image information.
  • the image information collected by the terminal device shows a teacher standing in front of a blackboard, that is, the scene key point information included in the image information is the blackboard, and the user key point information included in the image is the user's eyes, mouth, arms, legs, etc.
  • the scene information of the terminal device can be determined, that is, in which scene the terminal device is applied.
  • a scene recognition model is constructed based on algorithms such as entity recognition, entity linking, and entity alignment, and then the image information of different application scenarios in the knowledge base is preprocessed.
  • After obtaining the scene key point information corresponding to the image information of different application scenarios input the scene key point information corresponding to the image information of different application scenarios into the scene recognition model to train the scene recognition model until the scene recognition model reaches convergence and determine the target scene Identify the model.
  • graph mapping, information extraction and other methods are used to preprocess the acquired image information to obtain the scene key point information corresponding to the image.
  • the scene key point information obtained after preprocessing is Information is input into the target scene recognition model to perform scene recognition to ensure the accuracy of the scene recognition results.
  • the user's emotions collected by the terminal device can be determined, that is, the emotions expressed by the user included in the image information collected by the terminal device.
  • S203 Determine user intention based on text information and scene information.
  • the body movements that the user expects to drive the virtual digital human can be determined based on the text information, and then combined with the determined scene information to further ensure that the terminal device drives the virtual digital based on the text information. Coordination accuracy of human body movements.
  • S204 Determine user emotion based on text information, voice emotion information and image emotion information.
  • the emotion expressed by the user can be roughly determined based on the text information, and then by fusing the voice emotion information and the image emotion information, the virtual digital human can be accurately driven to express the user's emotion and improve the fidelity of the virtual digital human.
  • the virtual digital human determination method first determines the text information and voice emotion information corresponding to the voice information by processing the voice information, and determines the scene information and image emotion information corresponding to the image information by processing the image information. , then determine the user's intention based on the text information and scene information, and determine the user's emotion based on the text information, voice emotion information, and image emotion information. That is, based on the text information, the user's expectations can be determined to drive the body movements of the virtual digital human, and then combined with the determined scene information to further ensure the coordination accuracy of the terminal device's body movements driven by the virtual digital human based on text information. Based on the text information, the emotion expressed by the user can be roughly determined, and then by integrating voice emotional information and image emotional information, the virtual digital human can be accurately driven to express the user's expression. Emotions to improve the realism of virtual digital humans.
  • Figure 5 is a schematic flow chart of another virtual digital human driving method provided by an embodiment of the present disclosure.
  • the embodiment of the present disclosure is based on the embodiment corresponding to Figure 4C. As shown in Figure 5, before step S2012, it also includes:
  • voiceprint recognition and voiceprint clustering technologies by constructing highly robust voiceprint recognition and voiceprint clustering technologies, automatic login of multiple users is achieved through voice modalities, while paralinguistic information such as gender and accent is extracted to establish basic user information. Aiming at the difficulty of clustering speech features with an uncertain number of target classifications and the impact of speech channel interference on classification and clustering effects, the noisy density space unsupervised clustering technology is used, combined with stochastic linear discriminant analysis technology to achieve highly reliable voiceprint classification. , clustering to reduce the impact of channel interference on voiceprint recognition. That is, in this disclosure, a speech recognition model is constructed, which can adapt to different paralinguistic information, and the accuracy of speech recognition is high.
  • the voice feature vector in the voice information is first extracted.
  • the voice feature vector includes: accent feature vector, gender feature vector , age feature vector, etc.
  • the speech recognition model includes an acoustic model and a language model.
  • the acoustic model includes a convolutional neural network model with an attention mechanism
  • the language model includes a deep neural network model.
  • the speech recognition model constructed in this disclosure is a joint modeling of acoustic model and language model.
  • Column convolution and attention mechanism are used to build the acoustic model, and the speech feature vector is added as a condition in the convolution layer of the convolutional neural network model to adapt to different speech features.
  • a model structure based on deep neural networks that can be quickly intervened and configured is implemented, and the voice characteristics of different paralinguistic information are adapted through user-specific voiceprints to improve the accuracy of speech recognition.
  • step S2012 may include:
  • the speech information can be transcribed into text based on the speech recognition model to improve the accuracy of the speech recognition results.
  • Figure 6 is a schematic flow chart of another virtual digital human driving method provided by an embodiment of the present disclosure.
  • the embodiment of the present disclosure is based on the corresponding embodiment of Figure 4A.
  • the specific implementation of step 40 is: include:
  • action identifiers include: lifting, stretching, blinking, opening, etc.
  • key point driving includes speech content separation, content key point driving, speaker key point driving, key point-based image generation module, key point-based image stretching module, etc. Therefore, first, based on parsing the text information at the transcription point of the speech information, the action identifiers and key point identifiers included in the text information are obtained.
  • the body movements of the virtual digital human are selected from the preset action database corresponding to the scene information, such as raising the head, raising the leg, etc.
  • the preset action database includes action type definition, action arrangement, action connection, etc.
  • S403. Determine the emotional expression of key points of the virtual digital human based on the voice emotional information and the image emotional information.
  • examples of emotional expressions for determining the key points of the virtual digital human can be smiling with the mouth, clapping with both hands, etc.
  • the deep learning method is used to learn the mapping of virtual human key points and voice feature information, as well as the mapping of human face key points and voice emotion information and image emotion information.
  • the virtual digital human driving method provided by the embodiments of the present disclosure realizes the generation of voice-driven virtual digital human animation with controllable expressions by integrating emotional key point templates.
  • Figure 7 is a schematic flow chart of another virtual digital human driving method provided by an embodiment of the present disclosure.
  • the embodiment of the present disclosure is based on the embodiment corresponding to Figure 6. As shown in Figure 7, before step S401, it also includes:
  • different features are extracted from speech to drive head movements, facial movements, and body movements respectively, forming a more vivid speech driving method.
  • the image of the virtual digital human is driven based on the deep neural network method, and a generative adversarial network is applied for high-fidelity real-time generation.
  • the image generation of the virtual digital human is distinguished between action-driven and image library production. Among them, the virtual digital human image generation is divided into action-driven and image library production.
  • the hair library, clothing library, and tooth model of the digital human image are produced offline, and the image can be produced in a targeted manner according to different application scenarios.
  • the motion driver module of the virtual digital human is processed on the server side, and then the topological vertex data is encapsulated and transmitted, and texture mapping, rendering output, etc. are performed on the device side.
  • the key point driving technology based on the adversarial network the feature point geometric stretching method, and the image transformation and generation technology based on the Encoder-Decoder method are used to realize the driving of virtual digital humans.
  • the emotional key point template method the corresponding relationship between the user key points and the preset user emotional key points is established to realize the emotional expression of virtual digital people.
  • 3D face driving technology based on deep coding and decoding technology to realize semantic mapping of speech features and vertex three-dimensional motion features
  • rhythmic head driving technology based on deep codec nested temporal network, with head movement and the ability to discriminate control of facial activity.
  • Embodiments of the present disclosure provide a computer device, including: one or more processors; and a memory for storing one or more programs. When the one or more programs are executed by the one or more processors, such that The one or more processors implement the method described in any one of the embodiments of the present disclosure.
  • the smart device shown in Figure 8 simultaneously displays the expression and mouth shape of the virtual digital person when speaking while playing the voice response information, and as shown in Figure 9 It shows that the user can not only hear the voice of the virtual digital person, but also see the expression of the virtual digital person when speaking, giving the user an experience of talking to people.
  • the basic speaking style is driven, as shown in Figure 10, and a new speaking style can be generated. Since retraining the speaking style model requires a lot of time to collect training samples and process a large amount of data, it takes a lot of time to generate a new speaking style, making the speaking style generation efficiency relatively low.
  • the present disclosure determines the fitting coefficient of each style characteristic attribute by fitting the target style characteristic attribute based on multiple style characteristic attributes; determines the target style according to the fitting coefficient of each style characteristic attribute and multiple style characteristic vectors.
  • Feature vector multiple style feature vectors correspond to multiple style feature attributes one-to-one; input the target style feature vector into the speaking style model, and output the target speaking style parameters.
  • the speaking style model trains the speaking style model based on multiple style feature vectors Obtained from the framework; based on the target speaking style parameters, the target speaking style is generated. In this way, the target style feature vector can be fitted with multiple style feature vectors.
  • the speaking model is trained based on multiple style feature vectors, therefore, Inputting the target style feature vectors fitted by multiple style feature vectors into the speaking model can directly obtain the corresponding new speaking style. There is no need to retrain the speaking style model, which can achieve rapid transfer of speaking styles and improve the efficiency of speaking style generation. .
  • FIG 11 is a schematic diagram of a human-computer interaction scenario according to some embodiments.
  • smart devices may include smart refrigerator 110, smart washing machine 120, smart display device 130, etc.
  • a user wants to control a smart device, he or she needs to issue a voice command first.
  • the smart device receives the voice command, it needs to perform semantic understanding of the voice command and determine the semantic understanding result corresponding to the voice command. According to the semantics Understand the results and execute corresponding control instructions to meet the user's needs.
  • the smart devices in this scenario all include a display screen, which can be a touch screen or a non-touch screen.
  • terminal devices with touch screens users can communicate with the terminal device through gestures, fingers, or touch tools (such as stylus pens). interactive operations.
  • interactive operations with the terminal device can be implemented through external devices (for example, a mouse or a keyboard, etc.).
  • the display screen can display a three-dimensional virtual person, and the user can see the three-dimensional virtual person and his or her expression when speaking through the display screen, thereby realizing dialogue and interaction with the three-dimensional virtual person.
  • the speaking style generation method provided by the embodiments of the present disclosure can be implemented based on a computer device, or a functional module or functional entity in the computer device.
  • the computer device may be a personal computer (PC), a server, a mobile phone, a tablet computer, a notebook computer, a large computer, etc., which are not specifically limited in the embodiments of the present disclosure.
  • Figure 12 is a schematic flowchart of a speaking style generation method according to some embodiments. As shown in Figure 12, the method specifically includes the following steps:
  • each frame of facial topology data corresponds to a dynamic face topology
  • the face topology includes multiple vertices.
  • Each vertex in the dynamic face topology structure corresponds to a vertex coordinate (x, y, z).
  • the vertex coordinates of each vertex in the static face topology are (x', y', z').
  • the average vertex offset of each vertex in the dynamic face topology can be determined
  • FIG 13 is a schematic diagram of facial topology data divided into regions according to an embodiment of the present disclosure.
  • facial topology data can be divided into multiple regions.
  • facial topology data can be divided into three regions. , respectively, are S1, S2 and S3, where S1 is all the facial area above the lower edge of the eyes, S2 is the facial area from the lower edge of the eyes to the upper edge of the upper lip, and S3 is the facial area from the upper edge of the upper lip to the chin.
  • the average vertex offset of all vertices of the dynamic face topology in the area S1 can be determined average of The average vertex offset of all vertices of the dynamic face topology within region S2 average of Dynamic face topology in area S3
  • the average vertex offset of all vertices average of The style characteristic attributes can be obtained, which is To sum up, one style feature attribute can be obtained for one user, and multiple style feature attributes can be obtained based on multiple users.
  • a new style feature attribute can be fitted and formed, that is, the target style feature attribute.
  • the target style feature attributes can be obtained by fitting based on the following formula:
  • a1 is the fitting coefficient of the style feature attribute of user 1
  • a2 is the fitting coefficient of the style feature attribute of user 2
  • an is the fitting coefficient of the style feature attribute of user n
  • optimization methods can be used, such as gradient descent method, Gauss-Newton method, etc., to obtain the fitting coefficient of each style characteristic attribute.
  • this embodiment only takes dividing facial topological structure data into three areas as an example for illustration, and does not serve as a specific restriction on dividing facial topological structure data into areas.
  • S1302 Determine a target style feature vector based on the fitting coefficient of each style feature attribute and multiple style feature vectors.
  • the plurality of style feature vectors correspond to the plurality of style feature attributes in one-to-one correspondence.
  • the style feature vector is a representation of style
  • the embedding obtained by training the classification task model can be used as the style feature vector based on the classification task model, or the one-hot feature vector can be directly designed as the style feature vector.
  • the 3 style feature vectors can be [1;0;0], [0;1;0] and [0;0;1].
  • the style feature attributes of n users with different speaking styles are obtained.
  • the style feature vectors of n users can be obtained.
  • These n style feature attributes correspond to n style feature vectors one-to-one, n style feature attributes and their corresponding style feature vectors form the basic style feature base.
  • the respective fitting coefficients are multiplied by the corresponding style feature vectors, and the target style feature vectors can be expressed in the form of basic style feature bases, as shown in the following formula:
  • F1 is the style feature vector of user 1
  • F2 is the style feature vector of user 2
  • Fn is the style feature vector of user n
  • p is the target style feature vector.
  • the style feature vector is a one-hot feature vector
  • the target style feature vector p can be expressed as:
  • the speaking style model is obtained by training a speaking style model framework based on the plurality of style feature vectors.
  • the framework of the speaking style model is trained based on multiple style feature vectors in the basic style feature base to obtain the framework of the trained speaking style model, that is, the speaking style model.
  • Inputting the target style feature vector into the speaking style model can be understood as inputting the product of multiple style feature vectors and respective fitting coefficients into the speaking style model, which is the same as the training sample input when training the framework of the speaking style model. Therefore, based on the speaking style model, using the target style feature vector as input, the target speaking style parameters can be directly output.
  • the target speaking style parameter can be the vertex offset between each vertex in the dynamic face topology structure and the corresponding vertex in the static face topology structure; or it can be the coefficient of the expression basis of the dynamic face topology structure, or it can be other parameters, this disclosure does not impose specific restrictions on this.
  • S1304 Generate a target speaking style based on the target speaking style parameters.
  • the target speaking style parameter is the vertex offset of each vertex in the dynamic face topology structure and the corresponding vertex in the static face topology structure.
  • the offset drives each vertex of the static face topology to move to the corresponding position, and the target speaking style can be obtained.
  • the fitting coefficient of each style feature attribute is determined; the target style is determined based on the fitting coefficient of each style feature attribute and the multiple style feature vectors.
  • Feature vector multiple style feature vectors correspond to multiple style feature attributes one-to-one; input the target style feature vector into the speaking style model, and output the target speaking style parameters.
  • the speaking style model trains the speaking style model based on multiple style feature vectors Obtained from the framework; based on the target speaking style parameters, the target speaking style is generated. In this way, the target style feature vector can be fitted with multiple style feature vectors.
  • the speaking model is trained based on multiple style feature vectors, therefore, Inputting the target style feature vectors fitted by multiple style feature vectors into the speaking model can directly obtain the corresponding new speaking style. There is no need to retrain the speaking style model, which can achieve rapid transfer of speaking styles and improve the efficiency of speaking style generation. .
  • Figure 14 is a schematic flowchart of a speaking style generation method according to some embodiments.
  • Figure 14 is based on the embodiment shown in Figure 12. Before executing S1301, it also includes:
  • S1501 Collect multi-frame facial topological structure data when multiple preset users read multiple speech segments.
  • users with different speaking styles are selected as default users, and multiple speech segments are also selected.
  • Each default user When the user reads each piece of speech, multi-frame facial topological structure data of the preset user is collected. For example, the duration of speech 1 is t1, and the frequency of collecting facial topology data is 30 frames/second. In this way, by default, after user 1 reads each segment of speech 1, t1*30 frames of facial topology data can be collected.
  • t1*30*m frames of facial topology data can be collected.
  • the vertex offsets ( ⁇ x, ⁇ y, ⁇ z) of each vertex of the dynamic face topology structure and each vertex of the static face topology structure in each frame of facial topology data can be used as the speaking style of each frame of facial topology data.
  • the average vertex offset of all vertices of the dynamic face topology in the facial topological structure data within the divided area can be obtained average of.
  • the facial topology data is divided into three areas, where the average value of the average vertex offset of all vertices of the dynamic face topology in the facial topology data within area S1 is The average value of the average vertex offset of all vertices of the dynamic face topology in the facial topology data in area S2 is The average value of the average vertex offset of all vertices of the dynamic face topology in the facial topology data in area S3 is
  • S1503 Splice the average value of the speaking style parameters of the multi-frame facial topological structure data in each divided area in a preset order to obtain the style feature attributes of each preset user.
  • the preset order may be a top-to-bottom order as shown in FIG. 13 , or may be a bottom-to-top order as shown in FIG. 13 , and the present disclosure does not specifically limit this.
  • the preset order is from top to bottom as shown in Figure 13, based on the above embodiment, all the dynamic face topology structures in the facial topology data corresponding to the respective areas can be spliced in the order of areas S1, S2 and S3. The average value of the average vertex offset of the vertices. In this way, the style feature attributes of the preset user 1 can be obtained, that is
  • the style characteristic attributes can be obtained for the default user 1 In this way, multiple style characteristic attributes can be obtained for multiple preset users.
  • Figure 15 is a schematic flowchart of a speaking style generation method according to some embodiments.
  • Figure 15 is based on the embodiment shown in Figure 14. Before executing S1301, it also includes:
  • S1601 Collect multi-frame target facial topological structure data when the target user reads the multiple segments of speech.
  • the target user and the plurality of preset users are different users.
  • the target speaking style is collected.
  • Multi-frame target facial topological structure data when the target user corresponding to the speaking style reads multiple segments of speech, and the content of the multiple segments of speech read by the target user is the same as the content of the multiple segments of speech read by multiple presets. For example, after the target user reads m segments of speech with a duration of t1, the target facial topological structure data of t1*30*m frames can be obtained.
  • S1602 Determine the multi-frame target facial topology in each divided area based on the respective speaking style parameters of the multi-frame target facial topological structure data corresponding to the multiple segments of speech and the divided areas of the facial topological structure data. The average of the speaking style parameters of the data.
  • the vertex offsets ( ⁇ x', ⁇ y', ⁇ z') of each vertex of the dynamic face topology and each vertex of the static face topology in the target facial topology data of each frame can be used as the target facial topology of each frame.
  • the speaking style parameters of the structural data are based on the vertex offset ( ⁇ x', ⁇ y', ⁇ z') of each vertex of all dynamic face topology structures in the target user's t1*30*m frame target facial topology data,
  • the average vertex offset of each vertex of the dynamic face topology in the target facial topology data of the target user can be determined
  • the average vertex offset of all vertices of the dynamic face topology in the target facial topological structure data within the divided area can be obtained average of.
  • the facial topology data is divided into three areas, where the average value of the average vertex offset of all vertices of the dynamic face topology in the target facial topology data in area S1 is The average value of the average vertex offset of all vertices of the dynamic face topology in the target facial topology data in area S2 is The average value of the average vertex offset of all vertices of the dynamic face topology in the target facial topology data in area S3 is
  • S1603 Splice the average value of the speaking style parameters of the multi-frame target facial topological structure data in each divided area according to the preset order to obtain the target style feature attribute.
  • the average value of the average vertex offsets of all vertices of the dynamic face topology in the target facial topology data is spliced, for example, based on as shown in Figure 13
  • the average vertex offset of all vertices of the dynamic face topology in the target facial topology data corresponding to the respective areas can be spliced in the order of areas S1, S2 and S3, and the target user can be obtained.
  • S1501-S1503 shown in Figure 14 can be executed first, and then S1601-S1603 shown in Figure 15 can be executed; or, S1601-S1603 shown in Figure 15 can be executed first, and then S1601-S1603 shown in Figure 14 can be executed.
  • this disclosure does not impose specific limitations on this.
  • Figure 16 is a schematic flowchart of a speaking style generation method according to some embodiments.
  • Figure 16 is based on the embodiments shown in Figures 14 and 13. Before executing S1303, it also includes:
  • the training sample set includes an input sample set and an output sample set.
  • the input sample includes voice features and the plurality of corresponding style feature vectors, and the output sample includes the speaking style parameters.
  • the speech Melp feature can be extracted as the speech feature, or the speech feature extraction commonly used in the industry can be used. model to extract speech features, or you can also extract speech features based on a designed deep network model.
  • the speech feature sequence can be extracted. If the contents of the multiple speech segments read by multiple preset users are exactly the same, the same speech can be extracted for different preset users. Characteristic sequence. In this way, for the same voice feature in the voice feature sequence, there are multiple style feature vectors corresponding to multiple preset users. One voice feature and its corresponding multiple style feature vectors can be used as input samples, based on all the voice feature sequences. For speech features, multiple input samples can be obtained, that is, an input sample set is obtained.
  • corresponding facial topology data can be collected. Based on the respective vertex coordinates of all vertices of the dynamic face topology in the facial topology data, the dynamic person in the facial topology data can be obtained.
  • the respective vertex offsets of all vertices of the dynamic face topology in the facial topology data are used as a set of speaking style parameters, and a set of speaking style parameters is an output sample.
  • Structural data can obtain multiple output samples, that is, output sample sets.
  • the input sample set and the output sample set constitute the training sample set for training the speaking style generation model.
  • the framework of the speaking style model includes a linear combination unit and a network model.
  • the linear combination unit is used to generate a linear combination style feature vector of the plurality of style feature vectors and generate a linear combination output sample of a plurality of output samples.
  • the input samples correspond to the output samples one-to-one;
  • the network model is used to generate corresponding predicted output samples according to the linear combination style feature vector.
  • Figure 17 is a schematic structural diagram of the framework of the speaking style model according to some embodiments.
  • the framework of the speaking style model includes a linear combination unit 310 and a network model 320.
  • the input end of the linear combination unit 310 is used to receive training samples.
  • the output end of the linear combination unit 310 is connected to the input end of the network model 320
  • the output end of the network model 320 is the output end of the framework 300 of the speaking style model.
  • the training samples include input samples and output samples, where the input samples include speech features and their corresponding multiple style feature vectors.
  • the linear combination unit 310 can linearly combine the multiple style feature vectors. , to obtain a linear combination of style feature vectors, and the corresponding speaking style parameters of multiple style feature vectors can also be linearly combined to obtain a linear combination output sample.
  • the linear combination unit 310 can output speech features and their corresponding linear combination style feature vectors, that is, linear combination input samples, and can also output corresponding linear combination output samples.
  • the linear combination training samples are input to the network model 320.
  • the linear combination training samples include linear combination input samples and linear combination output samples. Based on the linear combination training samples, the network model 320 is trained.
  • the training samples in the training sample set are input to the framework of the speaking style model.
  • the framework of the speaking style model can output predicted output samples.
  • the loss function is used to determine the loss value of the predicted output sample and the output sample. Based on the loss value, the loss value is reduced. In the small direction, adjust the model parameters of the framework of the speaking style model, and then complete an iterative training. In this way, based on the framework of multiple iterations of training the speaking style model, a well-trained framework for training the speaking style model, that is, the speaking style model, can be obtained.
  • a training sample set is obtained.
  • the training sample set includes an input sample set and an output sample set.
  • the input sample includes speech features and their corresponding multiple style feature vectors, and the output sample includes speaking style parameters; defining the speaking style model.
  • Framework the framework of the speaking style model includes a linear combination unit and a network model.
  • the linear combination unit is used to generate a linear combination style feature vector of multiple style feature vectors, and generate a linear combination output sample of multiple output samples.
  • the input sample is the same as the output sample.
  • the network model is used to generate corresponding predicted output samples based on linear combination of style feature vectors; based on the training sample set and loss function, the framework of the speaking style model is trained to obtain the speaking style model.
  • the speaking style model is essentially based on multiple
  • the linear combination of style feature vectors obtained by training the network model can improve the diversity of training samples of the network model and improve the versatility of the speaking style model.
  • Figure 18 is a schematic flowchart of a speaking style generation method according to some embodiments.
  • Figure 18 is a detailed description of an implementation method when performing S1703 based on the embodiment shown in Figure 16, as follows:
  • the sum of the weight values of the plurality of style feature vectors is 1.
  • multiple style feature vectors can be assigned weight values respectively, and the sum of the weight values of the multiple style feature vectors is 1, and the multiple style feature vectors can be assigned weight values.
  • a linear combination style feature vector can be obtained.
  • Each style feature vector corresponds to an output sample, and the linear combination output sample can be obtained by adding the product of the weight value of multiple style feature vectors and the corresponding output sample.
  • different linear combination style feature vectors and different linear combination output samples can be obtained.
  • a linear combination input sample set can be obtained.
  • a linear combination of output samples corresponding to multiple speech features can be obtained.
  • the linear combination training sample set includes a linear combination input sample set and a linear combination output sample set.
  • the linear combination input sample includes the speech feature and its corresponding linear combination style feature vector.
  • the linear combination training sample set includes a linear combination input sample set and a linear combination output sample set.
  • the linear combination training samples are input to the network model.
  • a predicted output sample can be obtained based on the loss function.
  • the model parameters of the network model are adjusted, and an iterative training of the network model is completed. In this way, based on multiple iterative trainings of the network model, a well-trained framework for training the speaking style model, that is, the speaking style model, can be obtained.
  • a linear combination style feature vector is generated based on multiple style feature vectors and their respective weight values, and based on the respective weight values of the multiple style feature vectors and multiple outputs sample, generate a linear combination output sample, the sum of the weight values of multiple style feature vectors is 1; according to the loss Function and linear combination training sample set, train the network model, and obtain the speaking style model.
  • the linear combination training sample set includes a linear combination input sample set and a linear combination output sample set.
  • the linear combination input sample includes speech features and their corresponding linear combination style features.
  • Vector, the linearly combined training samples can be used as training samples for the network model, which can increase the number and diversity of training samples for the network model, and improve the versatility and accuracy of the speaking style model.
  • Figure 19 is a schematic structural diagram of the framework of another speaking style generation model provided by the embodiment of the present disclosure.
  • the speaking style model The framework also includes a scaling unit 330.
  • the input end of the scaling unit 330 is used to receive training samples
  • the output end of the scaling unit 330 is connected to the input end of the linear combination unit 310
  • the scaling unit 330 is used to compare multiple style feature vectors and multiple output samples based on randomly generated scaling factors.
  • Each performs scaling to obtain multiple scaling style feature vectors and multiple scaling output samples, and output scaling training samples.
  • the scaling training samples include multiple scaling style feature vectors and their respective corresponding scaling training samples.
  • the scaling factor can be 0.5-2, and the scaling factor is accurate to one decimal place.
  • the scaled training samples are input to the linear combination unit 310. Based on the linear combination unit 310, multiple scaled style feature vectors can be linearly combined to obtain a linear combination style feature vector. The scaled output samples corresponding to the multiple scaled style feature vectors can also be Perform linear combination to obtain linear combination output samples.
  • the linear combination unit 310 can output speech features and their corresponding linear combination style feature vectors, that is, linear combination input samples, and can also output corresponding linear combination output samples.
  • the linear combination training samples are input to the network model 320.
  • the linear combination training samples include linear combination input samples and linear combination output samples. Based on the linear combination training samples, the network model 320 is trained.
  • Figure 20 is a schematic flowchart of a speaking style generation method according to some embodiments.
  • Figure 20 is a detailed description of another possible implementation when performing S1703 based on the embodiment shown in Figure 16, as follows:
  • multiple style feature vectors can be scaled separately with random scaling factors, and multiple scaled style feature vectors can be obtained.
  • Each style feature vector corresponds to an output sample, and multiple scaled output samples can be obtained by scaling the corresponding output samples based on the respective scaling factors of multiple style feature vectors.
  • a scaled input sample set can be obtained based on multiple voice features and their corresponding multiple scaled style feature vectors, and a scaled output sample set can be obtained based on the scaled output samples corresponding to the multiple voice features.
  • S5012 input the multiple scaling style feature vectors and the multiple scaling output samples to the linear combination unit, and generate the linear combination style feature based on the multiple scaling style feature vectors and their respective weight values.
  • the scaled training sample set includes a scaled input sample set and a scaled output sample set.
  • the scaled training sample set is input to the linear combination unit.
  • multiple scaled style feature vectors can be assigned weights respectively. value, and the sum of the weight values of multiple scaling style feature vectors is 1.
  • a linear combination style feature vector can be obtained .
  • Each scaling style feature vector corresponds to a scaling output sample, and the linear combination output sample can be obtained by adding the product of the respective weight values of multiple scaling style feature vectors and the corresponding scaling output samples. In this way, based on different weight values, different linear combination style feature vectors and different linear combination output samples can be obtained.
  • a linear combination input sample set can be obtained.
  • Based on The scaled output samples corresponding to multiple speech features can obtain a linear combination output sample set.
  • the linear combination training sample set includes a linear combination input sample set and a linear combination output sample set.
  • the linear combination input sample includes the speech feature and its corresponding linear combination style feature vector.
  • the linear combination training sample set includes a linear combination input sample set and a linear combination output sample set.
  • the linear combination training samples are input to the network model.
  • a predicted output sample can be obtained based on the loss function.
  • the model parameters of the network model are adjusted, and an iterative training of the network model is completed. In this way, based on multiple iterative trainings of the network model, a well-trained framework for training the speaking style model, that is, the speaking style model, can be obtained.
  • the framework of the speaking style model also includes a scaling unit; the training sample set is input to the scaling unit, based on the scaling factor and multiple style feature vectors, multiple scaling style feature vectors are generated, based on the scaling factor and multiple outputs sample to generate multiple scaled output samples; input multiple scaled style feature vectors and multiple scaled output samples to the linear combination unit, and generate a linear combination style feature vector based on multiple scaled style feature vectors and their respective weight values.
  • the respective weight values of multiple scaling style feature vectors and multiple scaling output samples generate linear combination output samples.
  • the sum of the weight values of multiple scaling style feature vectors is 1; according to the loss function and the linear combination training sample set, train network model to obtain a speaking style model.
  • the linear combination training sample set includes a linear combination input sample set and a linear combination output sample set.
  • the linear combination input sample includes speech features and their corresponding linear combination style feature vectors.
  • the scaled polynomial A style feature vector serves as a training sample for the network model, which can increase the number and diversity of training samples for the network model, thereby improving the versatility and accuracy of the speaking style model.
  • Figure 21 is a schematic structural diagram of the framework of the speaking style generation model according to some embodiments
  • Figure 22 is a schematic structural diagram of the framework of the speaking style generation model according to some embodiments
  • Figure 21 is a schematic structural diagram of the framework of the speaking style generation model according to some embodiments.
  • the network model 320 includes a first-level network model 321, a second-level network model 322 and an overlay unit 323.
  • the output end of the first-level network model 321 and the second-level network model 322 The output terminals of the level network model 322 are connected to the input terminals of the superposition unit 323, and the output terminals of the superposition unit 323 are used to output predicted output samples.
  • the loss function includes a first loss function and a second loss function.
  • the linear combination training samples are respectively input to the first-level network model 321 and the second-level network model 322.
  • the first-level prediction output sample can be output based on the first-level network model 321, and the second-level prediction output sample can be output based on the second-level network model 322.
  • the first-level prediction output The sample and the second-level prediction output sample are input to the superposition unit 323. Based on the superposition unit 323, the first-level prediction output sample and the second-level prediction output sample are superimposed to obtain a prediction output sample.
  • Level 1 network model 321 may include The convolutional network and the fully connected network are used to extract the single-frame correspondence between speech and facial topological structure data.
  • the secondary network model 322 can be a sequence-to-sequence seq2seq network model, for example, it can be long short-term memory (Long short-term memory).
  • memory Long short-term memory
  • LSTM Long short-term memory
  • GRU Gate Recurrent Unit
  • Transformer Transformer network model
  • the loss function L b1*L1+b2*L2, where L1 is the first loss function, used to determine the loss value of the first-level prediction output sample and the linear combination output sample, L2 is the second loss function, using To determine the loss value of the secondary prediction output sample and the linear combination output sample, b1 is the weight of the first loss function, b2 is the weight of the second loss function, and b1 and b2 are adjustable.
  • L1 is the first loss function
  • L2 is the weight of the second loss function
  • b1 and b2 are adjustable.
  • Figure 23 is a schematic flowchart of a speaking style generation method according to some embodiments.
  • Figure 23 is a detailed description of an implementation method when performing S502 based on the embodiment shown in Figure 18 or Figure 20, as follows:
  • S5021 Train the first-level network model according to the linear combination training sample set and the first loss function to obtain an intermediate speaking style model.
  • the intermediate speaking style model includes the second-level network model and the trained first-level network model.
  • the weight b2 of the second loss function is set to approach 0.
  • the loss function of the current network model can be understood as the first loss function, and the linear combination training samples are input separately. to the first-level network model and the second-level network model.
  • the first loss value can be obtained, and the model parameters of the first-level network model are adjusted based on the direction in which the first loss value decreases until the first loss The values converge, and the trained first-level network model is obtained.
  • the framework of the speaking style model trained in the first stage is the intermediate speaking style model.
  • the model parameters of the trained first-level network model need to be fixed first.
  • S5023 Train the secondary network model in the intermediate speaking style model according to the linear combination training sample set and the second loss function to obtain the speaking style model.
  • the speaking style model includes the trained first-level network and the trained second-level network.
  • the loss function of the current network model can be understood as the second loss function.
  • the linear combination training sample is input to the second-level network model and the trained first-level network. in the model. Based on the predicted output sample output by the superposition unit, the second loss function and the corresponding linear combination output sample, the second loss value can be obtained, and the model parameters of the secondary network model are adjusted based on the direction in which the second loss value decreases until the second loss The values converge and the trained secondary network model is obtained.
  • the framework of the speaking style model trained in the first stage is the speaking style model.
  • the network model includes a first-level network model, a second-level network model and a superposition unit.
  • the output end of the first-level network model and the output end of the second-level network model are both connected to the input end of the overlay unit.
  • the output of the overlay unit end use Predict the output sample based on the output;
  • the loss function includes the first loss function and the second loss function; according to the linear combination of the training sample set and the first loss function, train the first-level network model to obtain the intermediate speaking style model, and the intermediate speaking style model includes the second-level network model and the trained first-level network model; fix the model parameters of the trained first-level network model; train the second-level network model in the intermediate speaking style model according to the linear combination of the training sample set and the second loss function to obtain the speaking style model, the speaking style model includes a trained first-level network and a trained second-level network.
  • the network model can be trained in stages, which can improve the convergence speed of the network model, that is, shorten the training time of the network model
  • Figure 24 is a schematic structural diagram of a computer device provided by an embodiment of the present disclosure.
  • the computer device includes a processor 910 and a memory 920; the number of processors 910 in the computer device can be one or more.
  • one processor 910 is taken as an example; the processor 910 in the computer device
  • the memory 920 may be connected through a bus or other means. In FIG. 24 , the connection through a bus is taken as an example.
  • the memory 920 can be used to store software programs, computer executable programs and modules, such as program instructions/modules corresponding to the semantic understanding model training method in the embodiments of the present disclosure; or the present disclosure Program instructions/modules corresponding to the semantic understanding method in the embodiment.
  • the processor 910 executes various functional applications and data processing of the computer device by running software programs, instructions and modules stored in the memory 920, that is, implementing the semantic understanding model training method or the short video recall method provided by the embodiments of the present disclosure. .
  • the memory 920 may mainly include a stored program area and a stored data area, where the stored program area may store an operating system and an application program required for at least one function; the stored data area may store data created according to the use of the terminal, etc.
  • the memory 920 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device.
  • memory 920 may further include memory located remotely relative to processor 910, and these remote memories may be connected to the computer device through a network. Examples of the above-mentioned networks include but are not limited to the Internet, intranets, local area networks, mobile communication networks and combinations thereof.

Landscapes

  • Engineering & Computer Science (AREA)
  • Library & Information Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

The present disclosure relates to a virtual digital human driving method, an apparatus, a device and a medium. The method comprises: acquiring user information, the user information comprising voice information and image information; according to the user information, determining a user intention and a user emotion; according to the user intention, determining a reply text of a virtual digital human, and, according to the user intention and the user emotion, determining a reply emotion of the virtual digital human; according to the reply text, determining limb actions of the virtual digital human, and, according to the reply emotion, determining an emotion expression mode of the virtual digital human. The present disclosure achieves a natural and anthropomorphic interaction state of a virtual human, thereby improving the simulation effect and expression naturalness of a virtual digital human.

Description

一种虚拟数字人驱动方法、装置、设备和介质A virtual digital human driving method, device, equipment and medium
相关申请的交叉引用Cross-references to related applications
本公开要求在2022年06月22日提交中国专利局、申请号为202210714001.7;在2022年06月28日提交中国专利局、申请号为202210751784.6的中国专利申请的优先权,其全部内容通过引用结合在本公开中。This disclosure claims the priority of the Chinese patent application filed with the China Patent Office on June 22, 2022, with application number 202210714001.7; and filed with the China Patent Office on June 28, 2022, with application number 202210751784.6, the entire contents of which are incorporated by reference. in this disclosure.
技术领域Technical field
本公开涉及虚拟数字人技术领域,尤其涉及一种虚拟数字人驱动方法、装置、设备和介质。The present disclosure relates to the field of virtual digital human technology, and in particular, to a virtual digital human driving method, device, equipment and medium.
背景技术Background technique
虚拟数字人,是一种整合计算机视觉、语音识别、语音合成、自然语言处理、终端显示等多种技术,实现的多模态智能人机交互技术,打造高度拟人化的虚拟形象,像真人般与人互动沟通。Virtual digital human is a multi-modal intelligent human-computer interaction technology that integrates computer vision, speech recognition, speech synthesis, natural language processing, terminal display and other technologies to create a highly anthropomorphic virtual image, like a real person Interact and communicate with people.
虽然虚拟数字人已经有小部分展示级应用,但是表达能力还是会受一定的限制。首先从感知能力角度,在真实复杂的声学场景下,来自信道、环境、说话人等各个方面的差异性,识别难度明显增加;其次当前智能交互系统难以准确认知不同复杂自然交互场景下的用户真实意图和情感状态,从而难以输出匹配的系统动作。最后,目前受虚拟人视觉形象、形态驱动技术的限制以及语音合成技术的限制,虚拟人的拟真性和表达自然度还比较生硬。Although virtual digital humans already have a small number of display-level applications, their expressive capabilities are still subject to certain limitations. First, from the perspective of perceptual capabilities, in real and complex acoustic scenes, differences in channels, environments, speakers, etc. significantly increase the difficulty of recognition; secondly, it is difficult for current intelligent interaction systems to accurately recognize users in different complex natural interaction scenes. True intention and emotional state, making it difficult to output matching system actions. Finally, due to the limitations of the virtual human's visual image, form-driven technology, and speech synthesis technology, the virtual human's fidelity and natural expression are still relatively stiff.
发明内容Contents of the invention
本公开提供了一种虚拟数字人驱动方法,包括:The present disclosure provides a virtual digital human driving method, including:
获取用户信息,所述用户信息包括语音信息和图像信息;Obtain user information, which includes voice information and image information;
根据所述用户信息,确定用户意图和用户情感;Determine user intentions and user emotions based on the user information;
根据所述用户意图确定所述虚拟数字人的回复文本,以及根据所述用户意图和用户情感确定所述虚拟数字人的回复情感;Determine the reply text of the virtual digital person according to the user intention, and determine the reply emotion of the virtual digital person according to the user intention and user emotion;
根据所述回复文本确定所述虚拟数字人肢体动作,以及根据所述回复情感确定所述虚拟数字人情感表达方式。The physical movements of the virtual digital human are determined according to the reply text, and the emotional expression of the virtual digital human is determined according to the reply emotion.
本公开还提供了一种计算机设备,包括:The present disclosure also provides a computer device, including:
一个或多个处理器;one or more processors;
存储器,用于存储一个或多个程序;Memory, used to store one or more programs;
当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如第一方面中任一项所述的方法。When the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the method as described in any one of the first aspects.
本公开还提供了一种计算机可读非易失性存储介质,其上存储有计算机程序,该程序 被处理器执行时实现如第一方面中任一项所述的方法。The present disclosure also provides a computer-readable non-volatile storage medium on which a computer program is stored. The program When executed by the processor, the method as described in any one of the first aspects is implemented.
附图说明Description of drawings
图1A为根据一些实施例的一种虚拟数字人驱动过程的应用场景示意图;Figure 1A is a schematic diagram of an application scenario of a virtual digital human driving process according to some embodiments;
图1B为根据一些实施例的一种虚拟数字人的结构示意图;Figure 1B is a schematic structural diagram of a virtual digital human according to some embodiments;
图2A为根据一些实施例的计算机设备的硬件配置框图;Figure 2A is a hardware configuration block diagram of a computer device according to some embodiments;
图2B为根据一些实施例的计算机设备的软件配置示意图;Figure 2B is a schematic diagram of a software configuration of a computer device according to some embodiments;
图2C为根据一些实施例的智能设备中包含的应用程序的图标控件界面显示示意图;Figure 2C is a schematic diagram showing an icon control interface of an application included in a smart device according to some embodiments;
图3A为根据一些实施例的一种虚拟数字人驱动方法的流程示意图;Figure 3A is a schematic flowchart of a virtual digital human driving method according to some embodiments;
图3B为根据一些实施例的一种虚拟数字人驱动方法的原理示意图;Figure 3B is a schematic diagram of the principle of a virtual digital human driving method according to some embodiments;
图4A为根据一些实施例的另一种虚拟数字人驱动方法的流程示意图;Figure 4A is a schematic flowchart of another virtual digital human driving method according to some embodiments;
图4B为根据一些实施例的一种虚拟数字人驱动方法的原理示意图;Figure 4B is a schematic diagram of the principle of a virtual digital human driving method according to some embodiments;
图4C为根据一些实施例的又一种虚拟数字人驱动方法的流程示意图;Figure 4C is a schematic flowchart of yet another virtual digital human driving method according to some embodiments;
图4D为根据一些实施例的又一种虚拟数字人驱动方法的流程示意图;Figure 4D is a schematic flowchart of yet another virtual digital human driving method according to some embodiments;
图5为根据一些实施例的一种虚拟数字人驱动方法的流程示意图;Figure 5 is a schematic flowchart of a virtual digital human driving method according to some embodiments;
图6为根据一些实施例的又一种虚拟数字人驱动方法的流程示意图;Figure 6 is a schematic flowchart of yet another virtual digital human driving method according to some embodiments;
图7为根据一些实施例的又一种虚拟数字人驱动方法的流程示意图;Figure 7 is a schematic flowchart of yet another virtual digital human driving method according to some embodiments;
图8为根据一些实施例的虚拟数字人的示意图;Figure 8 is a schematic diagram of a virtual digital human according to some embodiments;
图9为根据一些实施例的虚拟数字人的示意图;Figure 9 is a schematic diagram of a virtual digital human according to some embodiments;
图10为根据一些实施例的生成新的说话风格的原理示意图;Figure 10 is a schematic diagram of the principle of generating a new speaking style according to some embodiments;
图11为根据一些实施例的人机交互场景的示意图;Figure 11 is a schematic diagram of a human-computer interaction scenario according to some embodiments;
图12为根据一些实施例的说话风格生成方法的流程示意图;Figure 12 is a schematic flowchart of a speaking style generation method according to some embodiments;
图13为根据一些实施例的面部拓扑结构数据划分区域的示意图;Figure 13 is a schematic diagram of facial topology data divided into regions according to some embodiments;
图14为根据一些实施例的说话风格生成方法的流程示意图;Figure 14 is a schematic flowchart of a speaking style generation method according to some embodiments;
图15为根据一些实施例的说话风格生成方法的流程示意图;Figure 15 is a schematic flowchart of a speaking style generation method according to some embodiments;
图16为根据一些实施例的说话风格生成方法的流程示意图;Figure 16 is a schematic flowchart of a speaking style generation method according to some embodiments;
图17为根据一些实施例的说话风格模型的框架的结构示意图;Figure 17 is a schematic structural diagram of a framework of a speaking style model according to some embodiments;
图18为根据一些实施例的说话风格生成方法的流程示意图;Figure 18 is a schematic flowchart of a speaking style generation method according to some embodiments;
图19为根据一些实施例的说话风格生成模型的框架的结构示意图;Figure 19 is a schematic structural diagram of the framework of a speaking style generation model according to some embodiments;
图20为根据一些实施例的说话风格生成方法的流程示意图;Figure 20 is a schematic flowchart of a speaking style generation method according to some embodiments;
图21为根据一些实施例的说话风格生成模型的框架的结构示意图;Figure 21 is a schematic structural diagram of the framework of a speaking style generation model according to some embodiments;
图22为根据一些实施例的说话风格生成模型的框架的结构示意图;Figure 22 is a schematic structural diagram of the framework of a speaking style generation model according to some embodiments;
图23为根据一些实施例的说话风格生成方法的流程示意图;Figure 23 is a schematic flowchart of a speaking style generation method according to some embodiments;
图24为根据一些实施例的一种计算机设备的结构示意图。Figure 24 is a schematic structural diagram of a computer device according to some embodiments.
具体实施方式Detailed ways
为了能够更清楚地理解本公开的上述目的、特征和优点,下面将对本公开的方案进行 进一步描述。需要说明的是,在不冲突的情况下,本公开的实施例及实施例中的特征可以相互组合。In order to understand the above-mentioned objects, features and advantages of the present disclosure more clearly, the solutions of the present disclosure will be described below. Describe further. It should be noted that, as long as there is no conflict, the embodiments of the present disclosure and the features in the embodiments can be combined with each other.
在下面的描述中阐述了很多具体细节以便于充分理解本公开,但本公开还可以采用其他不同于在此描述的方式来实施;显然,说明书中的实施例只是本公开的一部分实施例,而不是全部的实施例。Many specific details are set forth in the following description to fully understand the present disclosure, but the present disclosure can also be implemented in other ways different from those described here; obviously, the embodiments in the description are only part of the embodiments of the present disclosure, and Not all examples.
虚拟数字人作为新一代人机交互方式,其系统设计通常由人物形象、语音生成、动态图像生成、音视频合成显示、交互建模5个模块构成,人物形象根据人物图像资源的维度可分为2D和3D两大类,从外观上又可分为卡通、拟人、写实、超写实等风格;语音生成模块可以基于文本生成对应的人物语音;动画生成模块可以根据语音或者文本生成特定人物的动态图像;音视频合成显示模块将语音和动态图像合成视频,最终显示给用户;交互模块使数字人具备交互功能,即通过语音语义识别等智能技术识别用户意图,并根据用户当前意图决定数字人的后续的语音和动作,驱动人物开启下一轮交互。As a new generation of human-computer interaction, the system design of virtual digital people usually consists of five modules: character image, speech generation, dynamic image generation, audio and video synthesis display, and interactive modeling. Character images can be divided into There are two categories: 2D and 3D, which can be divided into cartoon, anthropomorphic, realistic, hyper-realistic and other styles in terms of appearance; the speech generation module can generate corresponding character voices based on text; the animation generation module can generate the dynamics of specific characters based on speech or text Image; the audio and video synthesis display module synthesizes speech and dynamic images into a video, and finally displays it to the user; the interactive module enables the digital human to have interactive functions, that is, it recognizes the user's intention through intelligent technologies such as speech semantic recognition, and determines the digital human's behavior based on the user's current intention. The subsequent voice and actions drive the character to start the next round of interaction.
虽然虚拟数字人已经有小部分展示级应用,但是表达能力还是会受一定的限制。首先从感知能力角度,在真实复杂的声学场景下,来自信道、环境、说话人等各个方面的差异性,识别难度明显增加;其次当前智能交互系统难以准确认知不同复杂自然交互场景下的用户真实意图和情感状态,从而难以输出匹配的系统动作;最后,目前受虚拟人视觉形象、形态驱动技术的限制以及语音合成技术的限制,虚拟人的拟真性和表达自然度还比较生硬。Although virtual digital humans already have a small number of display-level applications, their expressive capabilities are still subject to certain limitations. First, from the perspective of perceptual capabilities, in real and complex acoustic scenes, differences in channels, environments, speakers, etc. significantly increase the difficulty of recognition; secondly, it is difficult for current intelligent interaction systems to accurately recognize users in different complex natural interaction scenes. The real intention and emotional state make it difficult to output matching system actions; finally, due to the limitations of the virtual human's visual image, form-driven technology, and speech synthesis technology, the virtual human's fidelity and natural expression are still relatively stiff.
针对上述技术问题存在的缺点,本公开实施例首先获取用户信息,用户信息包括语音信息和图像信息;然后根据用户信息,确定用户意图和用户情感;最后根据用户意图确定虚拟数字人肢体动作,以及根据用户情感确定虚拟数字人情感表达方式,即基于获取的用户语音信息和用户图像信息的基础上,对用户语音信息和用户图像信息进行处理确定用户意图和用户情感,然后根据用户意图确定虚拟数字人肢体动作,根据用户情感确定虚拟数字人情感表达方式,实现虚拟数字人真实还原用户意图和用户情感,提高虚拟数字人的拟真性和表达自然度。In view of the shortcomings of the above technical problems, the embodiments of the present disclosure first obtain user information, which includes voice information and image information; then determine the user intention and user emotion based on the user information; finally determine the body movements of the virtual digital human based on the user intention, and Determine the emotional expression of the virtual digital human based on the user's emotion, that is, based on the acquired user voice information and user image information, process the user voice information and user image information to determine the user's intention and user's emotion, and then determine the virtual number based on the user's intention Human body movements determine the emotional expression of virtual digital people based on user emotions, allowing virtual digital people to truly restore user intentions and user emotions, and improve the fidelity and naturalness of expression of virtual digital people.
图1A为本公开实施例中一种虚拟数字人驱动过程的应用场景示意图。如图1A所示,虚拟数字人驱动过程可用于用户与智能终端的交互场景中,假设该场景中的智能终端包括智能黑板、智慧大屏、智能音箱以及智能电话等,智能终端显示虚拟数字人,虚拟数字人示例性包括虚拟教师、虚拟品牌形象、虚拟助手、虚拟导购和虚拟主播等,如图1B所述,用户想要对该场景中的智能终端显示的虚拟数字人进行控制时,需要先发出语音指令,此时智能终端采集用户语音信息并采集用户图像信息,通过对用户语音信息和用户图像信息进行处理,确定用户的意图和用户情感,然后根据解析出的用户指令和用户情感,确定虚拟数字人的肢体动作以及情感表达方式,实现虚拟数字人真实还原用户意图和用户情感,提高虚拟数字人的拟真性和表达自然度。FIG. 1A is a schematic diagram of an application scenario of a virtual digital human driving process in an embodiment of the present disclosure. As shown in Figure 1A, the virtual digital human driving process can be used in the interaction scenario between users and smart terminals. It is assumed that the smart terminals in this scenario include smart blackboards, smart large screens, smart speakers, smart phones, etc., and the smart terminal displays virtual digital humans. Examples of virtual digital people include virtual teachers, virtual brand images, virtual assistants, virtual shopping guides, virtual anchors, etc. As shown in Figure 1B, when the user wants to control the virtual digital person displayed on the smart terminal in this scene, he or she needs to First, a voice command is issued. At this time, the smart terminal collects the user's voice information and collects the user's image information. By processing the user's voice information and the user's image information, it determines the user's intention and user emotion, and then based on the parsed user instructions and user emotion, Determine the body movements and emotional expressions of the virtual digital human, realize the virtual digital human's true restoration of the user's intentions and user emotions, and improve the virtual digital human's realism and natural expression.
本公开实施例提供的虚拟数字人驱动方法,可以基于计算机设备,或者计算机设备中的功能模块或者功能实体实现。The virtual digital human driving method provided by the embodiments of the present disclosure can be implemented based on computer equipment, or functional modules or functional entities in the computer equipment.
其中,计算机设备可以为个人计算机(Personal Computer,PC)、服务器、手机、平板 电脑、笔记本电脑、大型计算机等,本公开实施例对此不作具体限定。Among them, the computer equipment can be a personal computer (Personal Computer, PC), server, mobile phone, tablet Computers, notebook computers, large computers, etc. are not specifically limited in the embodiments of the present disclosure.
示例性的,图2A为根据本公开一个或多个实施例的计算机设备的硬件配置框图。如图2A所示,计算机设备包括:调谐解调器210、通信器220、检测器230、外部装置接口240、控制器250、显示器260、音频输出接口270、存储器、供电电源、用户接口280中的至少一种。其中,控制器250包括中央处理器,视频处理器,音频处理器,图形处理器,RAM,ROM,用于输入/输出的第一接口至第n接口。显示器260可为液晶显示器、OLED显示器、触控显示器以及投影显示器中的至少一种,还可以为一种投影装置和投影屏幕。调谐解调器210通过有线或无线接收方式接收广播电视信号,以及从多个无线或有线广播电视信号中解调出音视频信号,如EPG音视频数据信号。通信器220是用于根据各种通信协议类型与外部设备或服务器进行通信的组件。例如:通信器可以包括Wifi模块,蓝牙模块,有线以太网模块等其他网络通信协议芯片或近场通信协议芯片,以及红外接收器中的至少一种。计算机设备可以通过通信器220与服务器或者本地控制设备建立控制信号和数据信号的发送和接收。检测器230用于采集外部环境或与外部交互的信号。Exemplarily, FIG. 2A is a hardware configuration block diagram of a computer device according to one or more embodiments of the present disclosure. As shown in Figure 2A, the computer equipment includes: a tuner and demodulator 210, a communicator 220, a detector 230, an external device interface 240, a controller 250, a display 260, an audio output interface 270, a memory, a power supply, and a user interface 280. of at least one. The controller 250 includes a central processing unit, a video processor, an audio processor, a graphics processor, a RAM, a ROM, and first to nth interfaces for input/output. The display 260 may be at least one of a liquid crystal display, an OLED display, a touch display, and a projection display, and may also be a projection device and a projection screen. The tuner and demodulator 210 receives broadcast television signals through wired or wireless reception methods, and demodulates audio and video signals, such as EPG audio and video data signals, from multiple wireless or wired broadcast and television signals. The communicator 220 is a component for communicating with external devices or servers according to various communication protocol types. For example, the communicator may include at least one of a Wifi module, a Bluetooth module, a wired Ethernet module, other network communication protocol chips or near field communication protocol chips, and an infrared receiver. The computer device can establish the transmission and reception of control signals and data signals with the server or local control device through the communicator 220 . The detector 230 is used to collect signals from the external environment or interactions with the outside.
在一些实施例中,控制器250,通过存储在存储器上中各种软件控制程序,来控制计算机设备的工作和响应用户的操作。控制器250控制计算机设备的整体操作。用户可在显示器260上显示的图形用户界面(GUI)输入用户命令,则用户输入接口通过图形用户界面(GUI)接收用户输入命令。或者,用户可通过输入特定的声音或手势进行输入用户命令,则用户输入接口通过传感器识别出声音或手势,来接收用户输入命令。In some embodiments, the controller 250 controls the operation of the computer device and responds to user operations through various software control programs stored on the memory. Controller 250 controls the overall operation of the computer device. The user may input a user command into a graphical user interface (GUI) displayed on the display 260, and the user input interface receives the user input command through the graphical user interface (GUI). Alternatively, the user can input a user command by inputting a specific sound or gesture, and the user input interface recognizes the sound or gesture through the sensor to receive the user input command.
图2B为根据本公开一个或多个实施例的计算机设备的软件配置示意图,如图2B所示,将系统分为四层,从上至下分别为应用程序(Applications)层(简称“应用层”),应用程序框架(Application Framework)层(简称“框架层”),安卓运行时(Android runtime)和系统库层(简称“系统运行库层”),以及内核层。Figure 2B is a schematic diagram of the software configuration of a computer device according to one or more embodiments of the present disclosure. As shown in Figure 2B, the system is divided into four layers. From top to bottom, they are the Applications layer (referred to as the "Application layer"). "), Application Framework layer (referred to as "framework layer"), Android runtime and system library layer (referred to as "system runtime library layer"), and kernel layer.
图2C为根据本公开一个或多个实施例的智能终端(主要为智能播放设备,例如智能电视、数字影院系统或者影音服务器等)中包含的应用程序的图标控件界面显示示意图,如图2C中所示,应用程序层包含至少一个应用程序可以在显示器中显示对应的图标控件,如:直播电视应用程序图标控件、视频点播VOD应用程序图标控件、媒体中心应用程序图标控件、应用程序中心图标控件、游戏应用图标控件等。直播电视应用程序,可以通过不同的信号源提供直播电视。视频点播VOD应用程序,可以提供来自不同存储源的视频。不同于直播电视应用程序,视频点播提供来自某些存储源的视频显示。媒体中心应用程序,可以提供各种多媒体内容播放的应用程序。应用程序中心,可以提供储存各种应用程序。Figure 2C is a schematic diagram showing the icon control interface of an application included in a smart terminal (mainly a smart playback device, such as a smart TV, a digital cinema system or an audio-visual server, etc.) according to one or more embodiments of the present disclosure, as shown in Figure 2C As shown, the application layer contains at least one application that can display a corresponding icon control on the display, such as: Live TV application icon control, Video on Demand VOD application icon control, Media Center application icon control, Application Center Icon Control , game application icon controls, etc. Live TV app that provides live TV from different sources. Video on demand VOD application that can provide videos from different storage sources. Unlike live TV applications, video on demand offers the display of video from certain storage sources. Media center application can provide various multimedia content playback applications. The application center can provide storage for various applications.
为了更加详细的说明虚拟数字人驱动方法,以下将以示例性的方式结合图3A进行说明,可以理解的是,图3A中所涉及的步骤在实际实现时可以包括更多的步骤,或者更少的步骤,并且这些步骤之间的顺序也可以不同,以能够实现本公开实施例中提供的虚拟数字人驱动方法为准。In order to explain the virtual digital human driving method in more detail, the following will be described in an exemplary manner in conjunction with Figure 3A. It can be understood that the steps involved in Figure 3A may include more steps or less during actual implementation. steps, and the order between these steps may also be different, subject to being able to implement the virtual digital human driving method provided in the embodiment of the present disclosure.
图3A是本公开实施例提供的一种虚拟数字人驱动方法的流程示意图;图3B是本公开实施例提供的一种虚拟数字人驱动方法的原理示意图。本实施例可适用于对虚拟数字人进 行驱动的情况。本实施例方法可由智能终端来执行,该智能终端可采用硬件/或软件的方式来实现,并可配置于计算机设备中。FIG. 3A is a schematic flowchart of a virtual digital human driving method provided by an embodiment of the present disclosure; FIG. 3B is a schematic principle diagram of a virtual digital human driving method provided by an embodiment of the present disclosure. This embodiment can be applied to virtual digital humans. row-driven situation. The method of this embodiment can be executed by an intelligent terminal, which can be implemented in hardware/or software, and can be configured in a computer device.
如图3A所示,该方法具体包括如下步骤:As shown in Figure 3A, the method specifically includes the following steps:
S10、获取用户信息,用户信息包括语音信息和图像信息。S10. Obtain user information, which includes voice information and image information.
在具体的实施方式中,智能终端包括声音传感器和视觉传感器,其中,声音传感器示例性可以为麦克风阵列等,视觉传感器包括2D视觉传感器和3D视觉传感器,视觉传感器示例性可以为摄像头等。In a specific implementation, the smart terminal includes a sound sensor and a visual sensor, where the sound sensor can be an example of a microphone array, etc., the visual sensor includes a 2D visual sensor and a 3D visual sensor, and the visual sensor can be an example of a camera, etc.
智能终端通过声音传感器采集语音信息,通过视觉传感器采集图像信息,其中,语音信息包括语义信息和声学信息,图像信息包括场景信息和用户图像信息。The smart terminal collects voice information through sound sensors and image information through visual sensors. The voice information includes semantic information and acoustic information, and the image information includes scene information and user image information.
S20、根据用户信息,确定用户意图和用户情感。S20. Determine user intentions and user emotions based on user information.
终端设备基于声音传感器采集到语音信息后,基于语音信息包括的语义信息,可以确定用户的意图,即用户期望驱动虚拟数字人以何种方式动作,在基于视觉传感器采集到图像信息后,可以基于采集到的图像信息确定发出语音信息的用户的面部表情,根据采集到的图像信息中用户的面部表情,确定用户期望驱动虚拟数字人所表达的情感。After the terminal device collects voice information based on the sound sensor, it can determine the user's intention based on the semantic information included in the voice information, that is, how the user expects to drive the virtual digital human to act. After collecting the image information based on the visual sensor, it can determine the user's intention based on the semantic information included in the voice information. The collected image information determines the facial expression of the user who sent the voice message, and based on the user's facial expression in the collected image information, the user's expectation to drive the emotion expressed by the virtual digital human is determined.
S30、根据用户意图确定虚拟数字人的回复文本,以及根据用户意图和用户情感确定虚拟数字人的回复情感。S30. Determine the reply text of the virtual digital person according to the user's intention, and determine the reply emotion of the virtual digital person according to the user's intention and the user's emotion.
当根据用户信息确定用户意图和用户情感后,可基于用户意图确定虚拟数字人的回复文本,例如虚拟数字人回复语音所对应的文本,根据用户意图以及用户情感确定虚拟数字人的回复情感,即根据用户意图确定虚拟数字人回复所需要的情感表达,以及根据用户表达出来的情感确定虚拟数字人回复所需要表达的情感,在具体的实施方式中,当用户表达出来的情感为悲伤的情感,此时虚拟数字人回复所需表达的情感也为悲伤情感。After the user intention and user emotion are determined based on the user information, the virtual digital person's reply text can be determined based on the user intention, such as the text corresponding to the virtual digital person's reply voice, and the virtual digital person's reply emotion can be determined based on the user intention and user emotion, that is, The emotional expression required for the virtual digital person's reply is determined according to the user's intention, and the emotion required for the virtual digital person's reply is determined based on the emotion expressed by the user. In a specific implementation, when the emotion expressed by the user is a sad emotion, At this time, the emotion that the virtual digital person needs to express in reply is also a sad emotion.
S40、根据回复文本确定虚拟数字人肢体动作,以及根据回复情感确定虚拟数字人情感表达方式。S40. Determine the body movements of the virtual digital human based on the reply text, and determine the emotional expression of the virtual digital human based on the reply emotion.
当根据用户信息确定用户意图和用户情感后,基于用户意图确定虚拟数字人的回复文本,以及根据用户意图和用户情感确定虚拟数字人的回复情感,然后根据回复文本确定虚拟数字人肢体动作,以及根据回复情感确定虚拟数字人情感表达方式,即首先针对语音识别和图像识别,建立多模态的人机交互信息感知能力,然后通过获取的语音信息和图像信息,确定用户意图的用户情感,根据用户意图确定虚拟数字人的回复文本,以及根据用户意图和用户情感确定虚拟数字人的回复情感,最后基于虚拟数字人进行情感表达方式表达和肢体动作生成,实现虚拟数字人的语音、表情、动作等的合成。After the user intention and user emotion are determined based on the user information, the reply text of the virtual digital person is determined based on the user intention, and the reply emotion of the virtual digital person is determined based on the user intention and user emotion, and then the body movements of the virtual digital person are determined based on the reply text, and Determine the emotional expression of the virtual digital human based on the reply emotion, that is, first establish multi-modal human-computer interaction information perception capabilities for speech recognition and image recognition, and then determine the user's intended user emotion through the acquired voice information and image information. The user's intention determines the reply text of the virtual digital person, and determines the virtual digital person's reply emotion based on the user's intention and user emotion. Finally, based on the virtual digital person, emotional expressions and body movements are generated to realize the virtual digital person's voice, expressions, and movements. etc. synthesis.
本公开实施例提供的虚拟数字人驱动方法,首先获取用户信息,即语音信息和图像信息,然后根据用户信息,确定用户意图和用户情感,根据用户意图确定虚拟数字人的回复文本,以及根据用户意图和用户情感确定虚拟数字人的回复情感,最后根据回复文本确定虚拟数字人肢体动作,根据回复情感确定虚拟数字人情感表达方式,实现自然拟人化的虚拟人交互状态,提高虚拟数字人的拟真性和表达自然度。The virtual digital human driving method provided by the embodiment of the present disclosure first obtains user information, that is, voice information and image information, then determines the user intention and user emotion based on the user information, determines the virtual digital human's reply text based on the user intention, and determines the virtual digital human's reply text based on the user information. Intention and user emotion determine the virtual digital human's reply emotion. Finally, the virtual digital human's body movements are determined based on the reply text, and the virtual digital human's emotional expression method is determined based on the reply emotion. This achieves a natural anthropomorphic virtual human interaction state and improves the virtual digital human's quasi-likeness. Authenticity and naturalness of expression.
图4A是本公开实施例提供的另一种虚拟数字人驱动方法的流程示意图,图4B是本公 开实施例提供的另一种虚拟数字人驱动方法的原理示意图,本公开实施例是在上述实施例的基础上,如图4A和4B所示,步骤S20的一种具体实施方式包括:FIG. 4A is a schematic flowchart of another virtual digital human driving method provided by an embodiment of the present disclosure. FIG. 4B is a schematic diagram of another virtual digital human driving method provided by the present disclosure. Here is a schematic diagram of the principle of another virtual digital human driving method provided by the embodiment. The embodiment of the present disclosure is based on the above embodiment. As shown in Figures 4A and 4B, a specific implementation of step S20 includes:
S201、对语音信息进行处理,确定语音信息对应的文本信息和语音情感信息。S201. Process the voice information and determine the text information and voice emotion information corresponding to the voice information.
可选的,如图4C所示,步骤S201包括:Optionally, as shown in Figure 4C, step S201 includes:
S2012、对语音信息进行文本转录处理,确定语音信息对应的文本信息。S2012. Perform text transcription processing on the voice information and determine the text information corresponding to the voice information.
在具体的实施方式中,当获取到语音信息后,通过语音识别模块对获取到的语音信息进行文本转录处理,即将语音信息转化为与语音信息对应的文本信息。In a specific implementation, after the voice information is acquired, the voice recognition module performs text transcription processing on the acquired voice information, that is, the voice information is converted into text information corresponding to the voice information.
具体地,终端设备可以将语音信息输入至离线设置的自动语音识别(Automatic Speech Recognition,ASR)引擎中,得到ASR引擎输出的文本信息。Specifically, the terminal device can input voice information into an automatic speech recognition (Automatic Speech Recognition, ASR) engine set offline to obtain text information output by the ASR engine.
在本公开实施例中,终端设备在完成对语音信息的文本转录处理后,可以继续等待用户输入语音。若基于语音端点检测(Voice Activity Detection,VAD)识别到人声开始时,持续录音。若基于VAD识别到人声结束时,则停止录音。终端设备可以将录音得到的音频作为用户语音信息。终端设备进而可以将用户语音信息输入ASR引擎,得到用户语音信息对应的文本信息。In the embodiment of the present disclosure, after completing the text transcription process of the voice information, the terminal device may continue to wait for the user to input voice. If the start of a human voice is recognized based on Voice Activity Detection (VAD), recording will continue. If the end of the human voice is recognized based on VAD, the recording will stop. The terminal device can use the recorded audio as user voice information. The terminal device can then input the user's voice information into the ASR engine to obtain text information corresponding to the user's voice information.
S2013、对语音信息进行声纹特征提取,确定语音信息对应的语音情感信息。S2013. Extract voiceprint features from the voice information and determine the voice emotion information corresponding to the voice information.
声纹特征,是用电声学仪器显示的携带言语信息的声波频谱,声纹特征表现了不同声音的不同波长、频率、强度、节奏,即用户发出语音对应的音高、音强、音长和音色,不同用户对应的声纹特征不同。通过对语音信息进行声纹特征提取,可以获取到发出该语音信息对应的用户所表达的情感信息,即语音情感信息。Voiceprint features are the sound wave spectrum that carries speech information displayed by electroacoustic instruments. Voiceprint features represent the different wavelengths, frequencies, intensity, and rhythm of different sounds, that is, the pitch, intensity, length, and duration of the user's voice. Timbre, different users have different voiceprint characteristics. By extracting voiceprint features from the voice information, the emotional information expressed by the user corresponding to the voice information can be obtained, that is, the voice emotion information.
S202、对图像信息进行处理,确定图像信息对应的场景信息和图像情感信息。S202. Process the image information and determine the scene information and image emotion information corresponding to the image information.
可选的,如图4D所示,步骤S202包括:Optionally, as shown in Figure 4D, step S202 includes:
S2021、对图像信息进行预处理,确定图像信息包括的场景关键点信息和用户关键点信息。S2021. Preprocess the image information to determine the scene key point information and user key point information included in the image information.
场景关键点信息指的是图像信息中除用户信息外包含用户所处场景的关键点,用户关键点信息指的是图像信息中用户各肢体或五官的关键点。例如,终端设备采集的图像信息为教师站立在黑板前,即此时该图像信息包括的场景关键点信息为黑板,该图像包括的用户关键点信息为用户的眼睛、嘴巴、胳膊、腿等。The scene key point information refers to the key points of the scene in which the user is located in the image information in addition to the user information. The user key point information refers to the key points of the user's limbs or facial features in the image information. For example, the image information collected by the terminal device shows a teacher standing in front of a blackboard, that is, the scene key point information included in the image information is the blackboard, and the user key point information included in the image is the user's eyes, mouth, arms, legs, etc.
S2022、根据场景关键点信息,确定图像信息对应的场景信息。S2022. Determine the scene information corresponding to the image information based on the scene key point information.
通过对图像进行预处理,获取到场景关键点信息后,可以确定终端设备的场景信息,即终端设备应用在哪个场景中。By preprocessing the image and obtaining the scene key point information, the scene information of the terminal device can be determined, that is, in which scene the terminal device is applied.
在具体的实施方式中,通过构建虚拟数字人不同应用场景的知识库,基于实体识别、实体链接、实体对齐等算法构建场景识别模型,然后对知识库中不同应用场景的图像信息进行预处理后得到不同应用场景的图像信息对应的场景关键点信息后,将不同应用场景的图像信息对应的场景关键点信息输入至场景识别模型对场景识别模型进行训练,直至场景识别模型达到收敛,确定目标场景识别模型。然后采用图映射、信息抽取等方法对获取的图像信息进行预处理得到该图像对应的场景关键点信息,将预处理后得到的场景关键点信 息输入至目标场景识别模型,进行场景识别,保证场景识别结果的准确性。In a specific implementation, by constructing a knowledge base of different application scenarios of virtual digital humans, a scene recognition model is constructed based on algorithms such as entity recognition, entity linking, and entity alignment, and then the image information of different application scenarios in the knowledge base is preprocessed. After obtaining the scene key point information corresponding to the image information of different application scenarios, input the scene key point information corresponding to the image information of different application scenarios into the scene recognition model to train the scene recognition model until the scene recognition model reaches convergence and determine the target scene Identify the model. Then, graph mapping, information extraction and other methods are used to preprocess the acquired image information to obtain the scene key point information corresponding to the image. The scene key point information obtained after preprocessing is Information is input into the target scene recognition model to perform scene recognition to ensure the accuracy of the scene recognition results.
S2023、根据用户关键点信息与预设用户情绪关键点的对应关系,确定图像情感信息。S2023. Determine the image emotional information according to the corresponding relationship between the user key point information and the preset user emotional key points.
通过对图像进行预处理,获取到用户关键点信息后,可以确定终端设备采集的用户的情感,即终端设备采集的图像信息中包括的用户所表达的情感。By preprocessing the image and obtaining the user's key point information, the user's emotions collected by the terminal device can be determined, that is, the emotions expressed by the user included in the image information collected by the terminal device.
S203、根据文本信息和场景信息,确定用户意图。S203. Determine user intention based on text information and scene information.
在获取到语音信息对应的文本信息以及图像信息对应的场景信息后,基于文本信息可以确定用户期望驱动虚拟数字人的肢体动作,然后结合确定的场景信息,进一步保证终端设备基于文本信息驱动虚拟数字人肢体动作的协调准确性。After obtaining the text information corresponding to the voice information and the scene information corresponding to the image information, the body movements that the user expects to drive the virtual digital human can be determined based on the text information, and then combined with the determined scene information to further ensure that the terminal device drives the virtual digital based on the text information. Coordination accuracy of human body movements.
S204、根据文本信息、语音情感信息和图像情感信息,确定用户情感。S204. Determine user emotion based on text information, voice emotion information and image emotion information.
在获取到语音信息对应的文本信息后,基于文本信息可以大致确定用户所表达的情感,然后通过融合语音情感信息和图像情感信息,精确驱动虚拟数字人表达用户情感,提高虚拟数字人拟真性。After obtaining the text information corresponding to the voice information, the emotion expressed by the user can be roughly determined based on the text information, and then by fusing the voice emotion information and the image emotion information, the virtual digital human can be accurately driven to express the user's emotion and improve the fidelity of the virtual digital human.
本公开实施例提供的虚拟数字人确定方法,首先通过对语音信息进行处理,确定语音信息对应的文本信息和语音情感信息,通过对图像信息进行处理,确定图像信息对应的场景信息和图像情感信息,然后基于文本信息和场景信息,确定用户意图,根据文本信息、语音情感信息和图像情感信息,确定用户情感,即基于文本信息可以确定用户期望驱动虚拟数字人的肢体动作,然后结合确定的场景信息,进一步保证终端设备基于文本信息驱动虚拟数字人肢体动作的协调准确性,基于文本信息可以大致确定用户所表达的情感,然后通过融合语音情感信息和图像情感信息,精确驱动虚拟数字人表达用户情感,提高虚拟数字人拟真性。The virtual digital human determination method provided by the embodiment of the present disclosure first determines the text information and voice emotion information corresponding to the voice information by processing the voice information, and determines the scene information and image emotion information corresponding to the image information by processing the image information. , then determine the user's intention based on the text information and scene information, and determine the user's emotion based on the text information, voice emotion information, and image emotion information. That is, based on the text information, the user's expectations can be determined to drive the body movements of the virtual digital human, and then combined with the determined scene information to further ensure the coordination accuracy of the terminal device's body movements driven by the virtual digital human based on text information. Based on the text information, the emotion expressed by the user can be roughly determined, and then by integrating voice emotional information and image emotional information, the virtual digital human can be accurately driven to express the user's expression. Emotions to improve the realism of virtual digital humans.
图5是本公开实施例提供的又一种虚拟数字人驱动方法的流程示意图,本公开实施例是在图4C对应的实施例的基础上,如图5所示,步骤S2012之前,还包括:Figure 5 is a schematic flow chart of another virtual digital human driving method provided by an embodiment of the present disclosure. The embodiment of the present disclosure is based on the embodiment corresponding to Figure 4C. As shown in Figure 5, before step S2012, it also includes:
S2010、提取语音信息的语音特征向量。S2010. Extract the speech feature vector of speech information.
本公开实施例中,通过构建高鲁棒的声纹识别和声纹聚类技术,通过语音模态实现多用户的自动登录,同时提取性别、口音等副语言信息建立基础用户信息。针对不确定目标分类数量语音特征聚类的难点,以及语音信道干扰对分类和聚类效果的影响,采用带噪密度空间无监督聚类技术,结合随机线性判别分析技术实现高可靠的声纹分类、聚类,减少信道干扰对声纹识别的影响。即本公开中,构建语音识别模型,该语音识别模型可适配不同副语言信息,语音识别的准确率较高。In the embodiment of the present disclosure, by constructing highly robust voiceprint recognition and voiceprint clustering technologies, automatic login of multiple users is achieved through voice modalities, while paralinguistic information such as gender and accent is extracted to establish basic user information. Aiming at the difficulty of clustering speech features with an uncertain number of target classifications and the impact of speech channel interference on classification and clustering effects, the noisy density space unsupervised clustering technology is used, combined with stochastic linear discriminant analysis technology to achieve highly reliable voiceprint classification. , clustering to reduce the impact of channel interference on voiceprint recognition. That is, in this disclosure, a speech recognition model is constructed, which can adapt to different paralinguistic information, and the accuracy of speech recognition is high.
在具体的实施方式中,在对语音信息进行文本转录处理,确定语音信息对应的文本信息之前,首先提取语音信息中的语音特征向量,具体的,语音特征向量包括:口音特征向量、性别特征向量、年龄特征向量等。In a specific implementation, before performing text transcription processing on the voice information and determining the text information corresponding to the voice information, the voice feature vector in the voice information is first extracted. Specifically, the voice feature vector includes: accent feature vector, gender feature vector , age feature vector, etc.
S2011、在语音识别模型的卷积层中添加语音特征向量。S2011. Add speech feature vectors to the convolutional layer of the speech recognition model.
其中,语音识别模型包括声学模型和语言模型,声学模型包括注意力机制的卷积神经网络模型,语言模型包括深度神经网络模型。Among them, the speech recognition model includes an acoustic model and a language model. The acoustic model includes a convolutional neural network model with an attention mechanism, and the language model includes a deep neural network model.
本公开构建的语音识别模型为声学模型和语言模型的联合建模,通过采用深度时间序 列卷积和注意力机制来构建声学模型,在卷积神经网络模型的卷积层中加入语音特征向量作为条件来适配不同的语音特征。在语言模型层面,实现可快速干预配置的基于深度神经网络的模型结构,通过用户专属声纹适配不同副语言信息的语音特征,提升语音识别的准确率。The speech recognition model constructed in this disclosure is a joint modeling of acoustic model and language model. By using deep time series Column convolution and attention mechanism are used to build the acoustic model, and the speech feature vector is added as a condition in the convolution layer of the convolutional neural network model to adapt to different speech features. At the language model level, a model structure based on deep neural networks that can be quickly intervened and configured is implemented, and the voice characteristics of different paralinguistic information are adapted through user-specific voiceprints to improve the accuracy of speech recognition.
步骤S2012的具体实施方式可包括:The specific implementation of step S2012 may include:
S20120、基于语音识别模型对语音信息进行文本转录处理,确定语音信息对应的文本信息。S20120. Perform text transcription processing on the speech information based on the speech recognition model, and determine the text information corresponding to the speech information.
当构建好语音识别模型后,可以基于语音识别模型对语音信息进行文本转录处理,提高语音识别结果的准确率。After the speech recognition model is built, the speech information can be transcribed into text based on the speech recognition model to improve the accuracy of the speech recognition results.
图6是本公开实施例提供的又一种虚拟数字人驱动方法的流程示意图,本公开实施例是在图4A对应的实施例的基础上,如图6所示,步骤40的具体实现方式,包括:Figure 6 is a schematic flow chart of another virtual digital human driving method provided by an embodiment of the present disclosure. The embodiment of the present disclosure is based on the corresponding embodiment of Figure 4A. As shown in Figure 6, the specific implementation of step 40 is: include:
S401、获取回复文本中包括的动作标识。S401. Obtain the action identifier included in the reply text.
动作标识示例性包括:抬、伸、眨、张等。Examples of action identifiers include: lifting, stretching, blinking, opening, etc.
在对虚拟数字人进行驱动的过程中,关键点驱动包括语音内容分离、内容关键点驱动、说话人关键点驱动、基于关键点的图像生成模块、基于关键点的图像拉伸模块等。因此,首先基于对语音信息的转录处的文本信息进行解析,获取文本信息中包括的动作标识以及关键点标识。In the process of driving virtual digital humans, key point driving includes speech content separation, content key point driving, speaker key point driving, key point-based image generation module, key point-based image stretching module, etc. Therefore, first, based on parsing the text information at the transcription point of the speech information, the action identifiers and key point identifiers included in the text information are obtained.
S402、根据动作标识,从场景信息对应的预设动作数据库中选择虚拟数字人的肢体动作。S402. Select the body movements of the virtual digital human from the preset action database corresponding to the scene information according to the action identification.
具体的,若动作标识为抬,则从场景信息对应的预设动作数据库中选择虚拟数字人的肢体动作为抬头、抬腿等。Specifically, if the action identifier is lifting, then the body movements of the virtual digital human are selected from the preset action database corresponding to the scene information, such as raising the head, raising the leg, etc.
其中,预设动作数据库包含动作类型定义、动作编排、动作衔接等。Among them, the preset action database includes action type definition, action arrangement, action connection, etc.
S403、根据语音情感信息和图像情感信息,确定虚拟数字人的关键点的情感表达方式。S403. Determine the emotional expression of key points of the virtual digital human based on the voice emotional information and the image emotional information.
具体的,若获取的语音情感信息和图像情感信息为开心的情感信息,则确定虚拟数字人的关键点的情感表达方式示例性可以为嘴巴笑,双手鼓掌等。Specifically, if the obtained voice emotion information and image emotion information are happy emotion information, examples of emotional expressions for determining the key points of the virtual digital human can be smiling with the mouth, clapping with both hands, etc.
在具体的实施方式中,通过深度学习的方法学习虚拟人关键点与语音特征信息的映射,以及人脸关键点与语音情感信息和图像情感信息的映射。In a specific implementation, the deep learning method is used to learn the mapping of virtual human key points and voice feature information, as well as the mapping of human face key points and voice emotion information and image emotion information.
本公开实施例提供的虚拟数字人驱动方法,在该方法中通过融合情绪关键点模板方式,实现表情可控的语音驱动虚拟数字人动画生成。The virtual digital human driving method provided by the embodiments of the present disclosure realizes the generation of voice-driven virtual digital human animation with controllable expressions by integrating emotional key point templates.
图7是本公开实施例提供的又一种虚拟数字人驱动方法的流程示意图,本公开实施例是在图6对应的实施例的基础上,如图7所示,步骤S401之前还包括:Figure 7 is a schematic flow chart of another virtual digital human driving method provided by an embodiment of the present disclosure. The embodiment of the present disclosure is based on the embodiment corresponding to Figure 6. As shown in Figure 7, before step S401, it also includes:
S301、确定虚拟数字人的形象。S301. Determine the image of the virtual digital person.
在具体的实施方式中,通过从语音中提取不同的特征,分别驱动头部运动、面部活动、肢体动作,综合形成更加生动的语音驱动方式。In a specific implementation, different features are extracted from speech to drive head movements, facial movements, and body movements respectively, forming a more vivid speech driving method.
虚拟数字人的形象通过深度神经网络的方法为基础进行驱动,同时应用生成对抗网络进行高保真的实时生成,虚拟数字人的形象生成区分动作驱动、和形象库制作,其中,虚 拟数字人的形象的头发库、服饰库、牙齿模型,是离线制作出来,可以根据应用场景的不同,进行形象的针对性制作。虚拟数字人的动作驱动模块在服务端处理,之后进行拓扑顶点数据的封装与传输,在设备端进行纹理贴图、渲染输出等。The image of the virtual digital human is driven based on the deep neural network method, and a generative adversarial network is applied for high-fidelity real-time generation. The image generation of the virtual digital human is distinguished between action-driven and image library production. Among them, the virtual digital human image generation is divided into action-driven and image library production. The hair library, clothing library, and tooth model of the digital human image are produced offline, and the image can be produced in a targeted manner according to different application scenarios. The motion driver module of the virtual digital human is processed on the server side, and then the topological vertex data is encapsulated and transmitted, and texture mapping, rendering output, etc. are performed on the device side.
作为一种具体的实施方式,以用户关键点为核心,基于对抗网络的关键点驱动技术、特征点几何拉伸方法以及基于Encoder-Decoder方法的图像变换和生成技术,实现对虚拟数字人的驱动。同时,通过融合情绪关键点模板方式,建立用户关键点和预设用户情绪关键点的对应关系,实现虚拟数字人情感表达方式。As a specific implementation method, with the user's key points as the core, the key point driving technology based on the adversarial network, the feature point geometric stretching method, and the image transformation and generation technology based on the Encoder-Decoder method are used to realize the driving of virtual digital humans. . At the same time, by integrating the emotional key point template method, the corresponding relationship between the user key points and the preset user emotional key points is established to realize the emotional expression of virtual digital people.
作为另一种具体的实施方式,基于深层编解码技术实现语音特征与顶点三维运动特征语义映射的3D面部驱动技术,基于深层编解码器嵌套时序网络的韵律头部驱动技术,具备头部运动和面部活动区分控制的能力。As another specific implementation, 3D face driving technology based on deep coding and decoding technology to realize semantic mapping of speech features and vertex three-dimensional motion features, rhythmic head driving technology based on deep codec nested temporal network, with head movement and the ability to discriminate control of facial activity.
本公开实施例提供一种计算机设备,包括:一个或多个处理器;存储器,用于存储一个或多个程序,当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现本公开实施例中的任一种所述的方法。Embodiments of the present disclosure provide a computer device, including: one or more processors; and a memory for storing one or more programs. When the one or more programs are executed by the one or more processors, such that The one or more processors implement the method described in any one of the embodiments of the present disclosure.
另考虑到目前针对虚拟数字人的说话风格生成效率较低,例如图8所示的智能设备在播放语音响应信息的同时,同步显示虚拟数字人说话时的表情和口型,另如图9所示,用户既可以听见虚拟数字人的声音,又可以看到虚拟数字人说话时的表情,给用户一种与人对话的体验感。In addition, considering that the current speaking style generation efficiency for virtual digital people is low, for example, the smart device shown in Figure 8 simultaneously displays the expression and mouth shape of the virtual digital person when speaking while playing the voice response information, and as shown in Figure 9 It shows that the user can not only hear the voice of the virtual digital person, but also see the expression of the virtual digital person when speaking, giving the user an experience of talking to people.
通常,人们说话的时候,不同的人具有不同的状态,例如,有的人说话时口型准确、表情丰富,有的人说话时口型偏小、表情严肃等。也就是说,不同的人具有不同的说话风格。如此,可以设计出不同说话风格的虚拟数字人,即不同说话风格的虚拟数字人的口型、表情不同,用户可以与不同说话风格的三维虚拟人进行对话,从而能够提升用户的体验。每设计一种新的三维虚拟人的说话风格,均需要先获取相应的训练样本,基于相应的训练样本重新训练说话风格模型,使得基于重新训练后的说话风格模型可以生成新的说话风格参数,并基于说话风格参数驱动基本说话风格,如图10所示,可以生成新的说话风格。由于重新训练说话风格模型需要花费大量的时间来采集训练样本和处理大量的数据,因此,每生成一个新的说话风格需要花费比较多的时间,使得说话风格的生成效率比较低下。Usually, when people speak, different people have different states. For example, some people have accurate mouth shapes and rich expressions when they speak, while some people have small mouth shapes and serious expressions when they speak. That is, different people have different speaking styles. In this way, virtual digital people with different speaking styles can be designed, that is, virtual digital people with different speaking styles have different mouth shapes and expressions. Users can have conversations with three-dimensional virtual people with different speaking styles, thereby improving the user experience. Every time a new three-dimensional virtual human speaking style is designed, corresponding training samples need to be obtained first, and the speaking style model is retrained based on the corresponding training samples, so that new speaking style parameters can be generated based on the retrained speaking style model. And based on the speaking style parameters, the basic speaking style is driven, as shown in Figure 10, and a new speaking style can be generated. Since retraining the speaking style model requires a lot of time to collect training samples and process a large amount of data, it takes a lot of time to generate a new speaking style, making the speaking style generation efficiency relatively low.
为了解决上述问题,本公开通过基于多个风格特征属性拟合目标风格特征属性,确定各风格特征属性的拟合系数;根据各风格特征属性的拟合系数和多个风格特征向量,确定目标风格特征向量,多个风格特征向量与多个风格特征属性一一对应;将目标风格特征向量输入至说话风格模型中,输出目标说话风格参数,说话风格模型是基于多个风格特征向量训练说话风格模型的框架得到的;基于目标说话风格参数,生成目标说话风格,如此,可以将目标风格特征向量用多个风格特征向量来拟合,由于说话模型是基于多个风格特征向量训练得到的,因此,将多个风格特征向量拟合的目标风格特征向量输入至说话模型中可以直接得到相应的新的说话风格,无需针对说话风格模型重新训练,可实现说话风格的快速迁移,提升说话风格的生成效率。In order to solve the above problem, the present disclosure determines the fitting coefficient of each style characteristic attribute by fitting the target style characteristic attribute based on multiple style characteristic attributes; determines the target style according to the fitting coefficient of each style characteristic attribute and multiple style characteristic vectors. Feature vector, multiple style feature vectors correspond to multiple style feature attributes one-to-one; input the target style feature vector into the speaking style model, and output the target speaking style parameters. The speaking style model trains the speaking style model based on multiple style feature vectors Obtained from the framework; based on the target speaking style parameters, the target speaking style is generated. In this way, the target style feature vector can be fitted with multiple style feature vectors. Since the speaking model is trained based on multiple style feature vectors, therefore, Inputting the target style feature vectors fitted by multiple style feature vectors into the speaking model can directly obtain the corresponding new speaking style. There is no need to retrain the speaking style model, which can achieve rapid transfer of speaking styles and improve the efficiency of speaking style generation. .
图11为根据一些实施例的人机交互场景的示意图。如图11所示,用户与智能家居的 语音交互场景中,智能设备可以包括智能冰箱110、智能洗衣机120和智能显示设备130等。用户想要对智能设备进行控制时,需要先发出语音指令,而智能设备在接收到该语音指令时,需要对该语音指令进行语义理解,确定与该语音指令所对应的语义理解结果,根据语义理解结果,执行相应的控制指令,满足用户的使用需求。该场景中的智能设备均包括显示屏,显示屏可以是触摸屏,也可以是非触摸屏,对于具有触摸屏的终端设备,用户可以通过手势、手指或者触控工具(例如,触控笔)实现与终端设备的交互操作。对于非触摸屏的终端设备,可以通过外部设备(例如,鼠标或者键盘等)实现与终端设备的交互操作。显示屏可以显示三维虚拟人,用户通过显示屏可以看到三维虚拟人及其说话时的表情,实现与三维虚拟人的对话交互。Figure 11 is a schematic diagram of a human-computer interaction scenario according to some embodiments. As shown in Figure 11, the relationship between users and smart home In the voice interaction scenario, smart devices may include smart refrigerator 110, smart washing machine 120, smart display device 130, etc. When a user wants to control a smart device, he or she needs to issue a voice command first. When the smart device receives the voice command, it needs to perform semantic understanding of the voice command and determine the semantic understanding result corresponding to the voice command. According to the semantics Understand the results and execute corresponding control instructions to meet the user's needs. The smart devices in this scenario all include a display screen, which can be a touch screen or a non-touch screen. For terminal devices with touch screens, users can communicate with the terminal device through gestures, fingers, or touch tools (such as stylus pens). interactive operations. For non-touch screen terminal devices, interactive operations with the terminal device can be implemented through external devices (for example, a mouse or a keyboard, etc.). The display screen can display a three-dimensional virtual person, and the user can see the three-dimensional virtual person and his or her expression when speaking through the display screen, thereby realizing dialogue and interaction with the three-dimensional virtual person.
本公开实施例提供的说话风格生成方法,可以基于计算机设备,或者计算机设备中的功能模块或者功能实体实现。其中,计算机设备可以为个人计算机(personal computer,PC)、服务器、手机、平板电脑、笔记本电脑、大型计算机等,本公开实施例对此不作具体限定。The speaking style generation method provided by the embodiments of the present disclosure can be implemented based on a computer device, or a functional module or functional entity in the computer device. The computer device may be a personal computer (PC), a server, a mobile phone, a tablet computer, a notebook computer, a large computer, etc., which are not specifically limited in the embodiments of the present disclosure.
为了更加详细的说明说话风格生成方式,以下将以示例性的方式结合图12进行说明,可以理解的是,图12中所涉及的步骤在实际实现时可以包括更多的步骤,或者更少的步骤,并且这些步骤之间的顺序也可以不同,以能够实现本公开实施例中提供的说话风格生成方法为准。In order to explain the speaking style generation method in more detail, the following will be explained in an exemplary manner with reference to Figure 12. It can be understood that the steps involved in Figure 12 may include more steps or less in actual implementation. steps, and the order between these steps may also be different, subject to being able to implement the speaking style generation method provided in the embodiments of the present disclosure.
图12为根据一些实施例的说话风格生成方法的流程示意图,如图12所示,该方法具体包括如下步骤:Figure 12 is a schematic flowchart of a speaking style generation method according to some embodiments. As shown in Figure 12, the method specifically includes the following steps:
S1301,基于多个风格特征属性拟合目标风格特征属性,确定各风格特征属性的拟合系数。S1301. Fit the target style feature attributes based on multiple style feature attributes, and determine the fitting coefficient of each style feature attribute.
示例性的,采集Δt时间段内用户说话时的面部拓扑结构数据序列,在面部拓扑结构数据序列中,每帧面部拓扑结构数据对应一个动态人脸拓扑结构,人脸拓扑结构中包括多个顶点,动态人脸拓扑结构中的每个顶点对应一个顶点坐标(x,y,z)。用户在不说话时,对应一个预设的静态人脸拓扑结构,静态人脸拓扑结构中的每个顶点的顶点坐标为(x’,y’,z’),如此,基于同一顶点在动态人脸拓扑结构中顶点坐标和在静态人脸拓扑结构中的顶点坐标的差值,可以确定每个动态人脸拓扑结构中每个顶点的顶点偏移量(Δx,Δy,Δz),即Δx=x-x’,Δy=y-y’,Δz=z-z’。基于面部拓扑结构数据序列对应的所有动态人脸拓扑结构中每个顶点的顶点偏移量(Δx,Δy,Δz),可以确定动态人脸拓扑结构中每个顶点的平均顶点偏移量 Exemplarily, the facial topology data sequence of the user speaking during the Δt time period is collected. In the facial topology data sequence, each frame of facial topology data corresponds to a dynamic face topology, and the face topology includes multiple vertices. , Each vertex in the dynamic face topology structure corresponds to a vertex coordinate (x, y, z). When the user is not speaking, it corresponds to a preset static face topology. The vertex coordinates of each vertex in the static face topology are (x', y', z'). In this way, based on the same vertex in the dynamic face The difference between the vertex coordinates in the face topology structure and the vertex coordinates in the static face topology structure can determine the vertex offset (Δx, Δy, Δz) of each vertex in each dynamic face topology structure, that is, Δx = x-x', Δy=y-y', Δz=z-z'. Based on the vertex offset (Δx, Δy, Δz) of each vertex in all dynamic face topologies corresponding to the facial topology data sequence, the average vertex offset of each vertex in the dynamic face topology can be determined
图13为本公开实施例提供的一种面部拓扑结构数据划分区域的示意图,如图13所示,可以将面部拓扑结构数据为多个区域,例如,可以将面部拓扑结构数据划分为三个区域,分别为S1、S2和S3,其中,S1为眼睛下边缘之上的所有面部区域,S2为眼睛下边缘至上嘴唇的上边缘的面部区域,S3为上嘴唇的上边缘至下巴的面部区域。在上述实施例的基础上,可以确定出区域S1内的动态人脸拓扑结构的所有顶点的平均顶点偏移量 的平均值区域S2内的动态人脸拓扑结构的所有顶点的平均顶点偏移量的平均值区域S3内的动态人脸拓扑结构 的所有顶点平均顶点偏移量的平均值可以得到风格特征属性,即为综上所述,针对一个用户可以得到一个风格特征属性,如此,基于多个用户则可以得到多个风格特征属性。Figure 13 is a schematic diagram of facial topology data divided into regions according to an embodiment of the present disclosure. As shown in Figure 13, facial topology data can be divided into multiple regions. For example, facial topology data can be divided into three regions. , respectively, are S1, S2 and S3, where S1 is all the facial area above the lower edge of the eyes, S2 is the facial area from the lower edge of the eyes to the upper edge of the upper lip, and S3 is the facial area from the upper edge of the upper lip to the chin. Based on the above embodiment, the average vertex offset of all vertices of the dynamic face topology in the area S1 can be determined average of The average vertex offset of all vertices of the dynamic face topology within region S2 average of Dynamic face topology in area S3 The average vertex offset of all vertices average of The style characteristic attributes can be obtained, which is To sum up, one style feature attribute can be obtained for one user, and multiple style feature attributes can be obtained based on multiple users.
根据获取到的多个风格特征属性,可以拟合形成一个新的风格特征属性,即目标风格特征属性。例如,可以基于如下公式,拟合得到目标风格特征属性:
According to the multiple acquired style feature attributes, a new style feature attribute can be fitted and formed, that is, the target style feature attribute. For example, the target style feature attributes can be obtained by fitting based on the following formula:
其中,为目标特征属性,为用户1的风格特征属性,为用户2的风格特征属性,为用户n的风格特征属性,a1为用户1的风格特征属性的拟合系数,a2为用户2的风格特征属性的拟合系数,an为用户n的风格特征属性的拟合系数,n为用户数量,a1+a2+…+an=1。in, is the target feature attribute, is the style characteristic attribute of user 1, is the style characteristic attribute of user 2, is the style feature attribute of user n, a1 is the fitting coefficient of the style feature attribute of user 1, a2 is the fitting coefficient of the style feature attribute of user 2, an is the fitting coefficient of the style feature attribute of user n, n is the user Quantity, a1+a2+…+an=1.
基于上述公式,可以采用最优化方法,例如,梯度下降法、高斯牛顿法等,得到每个风格特征属性的拟合系数。Based on the above formula, optimization methods can be used, such as gradient descent method, Gauss-Newton method, etc., to obtain the fitting coefficient of each style characteristic attribute.
需要说明的是,本实施例仅以将面部拓扑结构数据划分为三个区域为例进行实例性说明,并不作为对面部拓扑结构数据区域划分的具体限制。It should be noted that this embodiment only takes dividing facial topological structure data into three areas as an example for illustration, and does not serve as a specific restriction on dividing facial topological structure data into areas.
S1302,根据所述各风格特征属性的拟合系数和多个风格特征向量,确定目标风格特征向量。S1302: Determine a target style feature vector based on the fitting coefficient of each style feature attribute and multiple style feature vectors.
所述多个风格特征向量与所述多个风格特征属性一一对应。The plurality of style feature vectors correspond to the plurality of style feature attributes in one-to-one correspondence.
示例性的,风格特征向量为风格的表征,可以基于分类任务模型,将训练分类任务模型得到的embedding作为风格特征向量,或者,可以直接设计one-hot特征向量为风格特征向量。例如,3个用户对应3个风格特征属性为one-hot特征向量,则3个风格特征向量可以是[1;0;0]、[0;1;0]和[0;0;1]。For example, the style feature vector is a representation of style, and the embedding obtained by training the classification task model can be used as the style feature vector based on the classification task model, or the one-hot feature vector can be directly designed as the style feature vector. For example, if 3 users have 3 style feature attributes corresponding to one-hot feature vectors, then the 3 style feature vectors can be [1;0;0], [0;1;0] and [0;0;1].
在上述实施例的基础上,获取n个不同说话风格的用户的风格特征属性,相应的,可以得到n个用户的风格特征向量,这n个风格特征属性与n个风格特征向量一一对应,n个风格特征属性与各自对应的风格特征向量组成风格基本特征基。基于n个风格特征属性 各自的拟合系数与对应的风格特征向量相乘,可以将目标风格特征向量以风格基本特征基的形式进行表示,如下公式:On the basis of the above embodiment, the style feature attributes of n users with different speaking styles are obtained. Correspondingly, the style feature vectors of n users can be obtained. These n style feature attributes correspond to n style feature vectors one-to-one, n style feature attributes and their corresponding style feature vectors form the basic style feature base. Based on n style feature attributes The respective fitting coefficients are multiplied by the corresponding style feature vectors, and the target style feature vectors can be expressed in the form of basic style feature bases, as shown in the following formula:
p=a1×F1+a2×F2+…+an×Fn      (2)p=a1×F1+a2×F2+…+an×Fn (2)
其中,F1为用户1的风格特征向量,F2为用户2的风格特征向量,Fn为用户n的风格特征向量,p为目标风格特征向量。Among them, F1 is the style feature vector of user 1, F2 is the style feature vector of user 2, Fn is the style feature vector of user n, and p is the target style feature vector.
例如,风格特征向量为one-hot特征向量,可以将目标风格特征向量p表示为:
For example, the style feature vector is a one-hot feature vector, and the target style feature vector p can be expressed as:
S1303,将所述目标风格特征向量输入至说话风格模型中,输出目标说话风格参数。S1303. Input the target style feature vector into the speaking style model and output the target speaking style parameters.
所述说话风格模型是基于所述多个风格特征向量训练说话风格模型的框架得到的。The speaking style model is obtained by training a speaking style model framework based on the plurality of style feature vectors.
示例性的,根据风格基本特征基中的多个风格特征向量,训练说话风格模型的框架,得到训练好的说话风格模型的框架,即说话风格模型。将目标风格特征向量输入至说话风格模型中,可以理解为将多个风格特征向量和各自拟合系数的乘积输入至说话风格模型中,这与训练说话风格模型的框架时输入的训练样本相同。故而,基于说话风格模型,将目标风格特征向量作为输入,可以直接输出得到目标说话风格参数。For example, the framework of the speaking style model is trained based on multiple style feature vectors in the basic style feature base to obtain the framework of the trained speaking style model, that is, the speaking style model. Inputting the target style feature vector into the speaking style model can be understood as inputting the product of multiple style feature vectors and respective fitting coefficients into the speaking style model, which is the same as the training sample input when training the framework of the speaking style model. Therefore, based on the speaking style model, using the target style feature vector as input, the target speaking style parameters can be directly output.
目标说话风格参数可以是动态人脸拓扑结构中的各顶点与静态人脸拓扑结构中对应顶点的顶点偏移量;或者,可以是动态人脸拓扑结构的表情基的系数,或者还可以是其他参数,本公开对此不作具体限制。The target speaking style parameter can be the vertex offset between each vertex in the dynamic face topology structure and the corresponding vertex in the static face topology structure; or it can be the coefficient of the expression basis of the dynamic face topology structure, or it can be other parameters, this disclosure does not impose specific restrictions on this.
S1304,基于所述目标说话风格参数,生成目标说话风格。S1304: Generate a target speaking style based on the target speaking style parameters.
示例性的,目标说话风格参数为动态人脸拓扑结构中的各顶点与静态人脸拓扑结构中对应顶点的顶点偏移量,如此,在静态人脸拓扑结构的基础上,基于各顶点的顶点偏移量,驱动静态人脸拓扑结构的各顶点移动至对相应的位置,则可以得到目标说话风格。For example, the target speaking style parameter is the vertex offset of each vertex in the dynamic face topology structure and the corresponding vertex in the static face topology structure. In this way, on the basis of the static face topology structure, based on the vertex of each vertex The offset drives each vertex of the static face topology to move to the corresponding position, and the target speaking style can be obtained.
在本公开实施例中,通过基于多个风格特征属性拟合目标风格特征属性,确定各风格特征属性的拟合系数;根据各风格特征属性的拟合系数和多个风格特征向量,确定目标风格特征向量,多个风格特征向量与多个风格特征属性一一对应;将目标风格特征向量输入至说话风格模型中,输出目标说话风格参数,说话风格模型是基于多个风格特征向量训练说话风格模型的框架得到的;基于目标说话风格参数,生成目标说话风格,如此,可以将目标风格特征向量用多个风格特征向量来拟合,由于说话模型是基于多个风格特征向量训练得到的,因此,将多个风格特征向量拟合的目标风格特征向量输入至说话模型中可以直接得到相应的新的说话风格,无需针对说话风格模型重新训练,可实现说话风格的快速迁移,提升说话风格的生成效率。In the embodiment of the present disclosure, by fitting the target style feature attributes based on multiple style feature attributes, the fitting coefficient of each style feature attribute is determined; the target style is determined based on the fitting coefficient of each style feature attribute and the multiple style feature vectors. Feature vector, multiple style feature vectors correspond to multiple style feature attributes one-to-one; input the target style feature vector into the speaking style model, and output the target speaking style parameters. The speaking style model trains the speaking style model based on multiple style feature vectors Obtained from the framework; based on the target speaking style parameters, the target speaking style is generated. In this way, the target style feature vector can be fitted with multiple style feature vectors. Since the speaking model is trained based on multiple style feature vectors, therefore, Inputting the target style feature vectors fitted by multiple style feature vectors into the speaking model can directly obtain the corresponding new speaking style. There is no need to retrain the speaking style model, which can achieve rapid transfer of speaking styles and improve the efficiency of speaking style generation. .
图14为根据一些实施例的说话风格生成方法的流程示意图,图14为如图12所示实施例的基础上,执行S1301之前还包括:Figure 14 is a schematic flowchart of a speaking style generation method according to some embodiments. Figure 14 is based on the embodiment shown in Figure 12. Before executing S1301, it also includes:
S1501,采集多个预设用户朗读多段语音时的多帧面部拓扑结构数据。S1501: Collect multi-frame facial topological structure data when multiple preset users read multiple speech segments.
示例性的,选取不同说话风格的用户作为预设用户,同时还选取多段语音,每个预设 用户朗读每段语音时采集该预设用户的多帧面部拓扑结构数据。例如,语音1的时长为t1,采集面部拓扑结构数据的频率为30帧/秒,如此,预设用户1朗读完每段语音1后,可以采集到t1*30帧面部拓扑结构数据。For example, users with different speaking styles are selected as default users, and multiple speech segments are also selected. Each default user When the user reads each piece of speech, multi-frame facial topological structure data of the preset user is collected. For example, the duration of speech 1 is t1, and the frequency of collecting facial topology data is 30 frames/second. In this way, by default, after user 1 reads each segment of speech 1, t1*30 frames of facial topology data can be collected.
S1502,针对每个预设用户:根据所述多段语音对应的所述多帧面部拓扑结构数据各自的说话风格参数和面部拓扑结构数据的划分区域,确定各划分区域内的所述多帧面部拓扑结构数据的所述说话风格参数的平均值。S1502, for each preset user: determine the multi-frame facial topology in each divided area based on the respective speaking style parameters and divided areas of the multi-frame facial topological structure data corresponding to the multiple segments of speech. The average of the speaking style parameters of the structural data.
示例性的,基于上述实施例,针对预设用户1,预设用户1朗读完m段语音后,可以采集到t1*30*m帧面部拓扑结构数据。可以将每帧面部拓扑结构数据中的动态人脸拓扑结构的各顶点和静态人脸拓扑结构的各顶点的顶点偏移量(Δx,Δy,Δz),作为每帧面部拓扑结构数据的说话风格参数,基于预设用户1的t1*30*m帧面部拓扑结构数据对应的所有动态人脸拓扑结构的每个顶点的顶点偏移量(Δx,Δy,Δz),可以确定预设用户1的面部拓扑结构数据中动态人脸拓扑结构的每个顶点的平均顶点偏移量 For example, based on the above embodiment, for the preset user 1, after the preset user 1 finishes reading m segments of speech, t1*30*m frames of facial topology data can be collected. The vertex offsets (Δx, Δy, Δz) of each vertex of the dynamic face topology structure and each vertex of the static face topology structure in each frame of facial topology data can be used as the speaking style of each frame of facial topology data. Parameters, based on the vertex offset (Δx, Δy, Δz) of each vertex of all dynamic face topology corresponding to the t1*30*m frame facial topology data of the preset user 1, the preset user 1's Average vertex offset for each vertex of dynamic face topology in facial topology data
基于面部拓扑结构数据的划分区域,针对预设用户1的每个划分区域,可以得到划分区域内的面部拓扑结构数据中动态人脸拓扑结构的所有顶点的平均顶点偏移量 的平均值。例如,面部拓扑结构数据划分为三个区域,其中,区域S1内的面部拓扑结构数据中动态人脸拓扑结构的所有顶点的平均顶点偏移量的平均值为区域S2内的面部拓扑结构数据中动态人脸拓扑结构的所有顶点的平均顶点偏移量的平均值为区域S3内的面部拓扑结构数据中动态人脸拓扑结构的所有顶点的平均顶点偏移量的平均值为 Based on the divided area of the facial topology data, for each divided area of the preset user 1, the average vertex offset of all vertices of the dynamic face topology in the facial topological structure data within the divided area can be obtained average of. For example, the facial topology data is divided into three areas, where the average value of the average vertex offset of all vertices of the dynamic face topology in the facial topology data within area S1 is The average value of the average vertex offset of all vertices of the dynamic face topology in the facial topology data in area S2 is The average value of the average vertex offset of all vertices of the dynamic face topology in the facial topology data in area S3 is
S1503,将所述各划分区域内的所述多帧面部拓扑结构数据的所述说话风格参数的平均值按照预设顺序拼接,得到所述每个预设用户的风格特征属性。S1503: Splice the average value of the speaking style parameters of the multi-frame facial topological structure data in each divided area in a preset order to obtain the style feature attributes of each preset user.
示例性的,预设顺序可以是按照如图13所示的从上到下的顺序,或者,可以是按照如图13所示的从下到上的顺序,本公开对此不做具体限制。若预设顺序为如图13所示的从上到下的顺序,基于上述实施例,可以按照区域S1、S2和S3的顺序拼接各自区域对应的面部拓扑结构数据中动态人脸拓扑结构的所有顶点的平均顶点偏移量的平均值,如此,可以得到预设用户1的风格特征属性,即 For example, the preset order may be a top-to-bottom order as shown in FIG. 13 , or may be a bottom-to-top order as shown in FIG. 13 , and the present disclosure does not specifically limit this. If the preset order is from top to bottom as shown in Figure 13, based on the above embodiment, all the dynamic face topology structures in the facial topology data corresponding to the respective areas can be spliced in the order of areas S1, S2 and S3. The average value of the average vertex offset of the vertices. In this way, the style feature attributes of the preset user 1 can be obtained, that is
综上所述,针对预设用户1可以得到风格特征属性 如此,针对多个预设用户,可以得到多个风格特征属性。To sum up, the style characteristic attributes can be obtained for the default user 1 In this way, multiple style characteristic attributes can be obtained for multiple preset users.
图15为根据一些实施例的说话风格生成方法的流程示意图,图15为如图14所示实施例的基础上,执行S1301之前还包括:Figure 15 is a schematic flowchart of a speaking style generation method according to some embodiments. Figure 15 is based on the embodiment shown in Figure 14. Before executing S1301, it also includes:
S1601,采集目标用户朗读所述多段语音时的多帧目标面部拓扑结构数据。S1601: Collect multi-frame target facial topological structure data when the target user reads the multiple segments of speech.
所述目标用户与所述多个预设用户为不同的用户。The target user and the plurality of preset users are different users.
示例性的,当前需要生成与多个预设用户的说话风格不同的目标说话风格时,采集目 标说话风格对应的目标用户朗读多段语音时的多帧目标面部拓扑结构数据,且目标用户朗读的多段语音的内容与多个预设朗读的多段语音的内容相同。例如,目标用户朗读m段时长为t1的语音后,可以得到的t1*30*m帧目标面部拓扑结构数据。For example, when it is currently necessary to generate a target speaking style that is different from the speaking styles of multiple preset users, the target speaking style is collected. Multi-frame target facial topological structure data when the target user corresponding to the speaking style reads multiple segments of speech, and the content of the multiple segments of speech read by the target user is the same as the content of the multiple segments of speech read by multiple presets. For example, after the target user reads m segments of speech with a duration of t1, the target facial topological structure data of t1*30*m frames can be obtained.
S1602,根据所述多段语音对应的所述多帧目标面部拓扑结构数据各自的说话风格参数和所述面部拓扑结构数据的划分区域,确定所述各划分区域内的所述多帧目标面部拓扑结构数据的所述说话风格参数的平均值。S1602: Determine the multi-frame target facial topology in each divided area based on the respective speaking style parameters of the multi-frame target facial topological structure data corresponding to the multiple segments of speech and the divided areas of the facial topological structure data. The average of the speaking style parameters of the data.
可以将每帧目标面部拓扑结构数据中的动态人脸拓扑结构的各顶点和静态人脸拓扑结构的各顶点的顶点偏移量(Δx’,Δy’,Δz’),作为每帧目标面部拓扑结构数据的说话风格参数,基于目标用户的t1*30*m帧目标面部拓扑结构数据中的所有动态人脸拓扑结构的每个顶点的顶点偏移量(Δx’,Δy’,Δz’),可以确定目标用户的目标面部拓扑结构数据中动态人脸拓扑结构的每个顶点的平均顶点偏移量 The vertex offsets (Δx', Δy', Δz') of each vertex of the dynamic face topology and each vertex of the static face topology in the target facial topology data of each frame can be used as the target facial topology of each frame. The speaking style parameters of the structural data are based on the vertex offset (Δx', Δy', Δz') of each vertex of all dynamic face topology structures in the target user's t1*30*m frame target facial topology data, The average vertex offset of each vertex of the dynamic face topology in the target facial topology data of the target user can be determined
基于上述面部拓扑结构数据的划分区域,针对目标用户的每个划分区域,可以得到划分区域内的目标面部拓扑结构数据中动态人脸拓扑结构的所有顶点的平均顶点偏移量 的平均值。例如,面部拓扑结构数据划分为三个区域,其中,区域S1内的目标面部拓扑结构数据中动态人脸拓扑结构的所有顶点的平均顶点偏移量的平均值为 区域S2内的目标面部拓扑结构数据中动态人脸拓扑结构的所有顶点的平均顶点偏移量的平均值为区域S3内的目标面部拓扑结构数据中动态人脸拓扑结构的所有顶点的平均顶点偏移量的平均值为 Based on the above divided areas of facial topology data, for each divided area of the target user, the average vertex offset of all vertices of the dynamic face topology in the target facial topological structure data within the divided area can be obtained average of. For example, the facial topology data is divided into three areas, where the average value of the average vertex offset of all vertices of the dynamic face topology in the target facial topology data in area S1 is The average value of the average vertex offset of all vertices of the dynamic face topology in the target facial topology data in area S2 is The average value of the average vertex offset of all vertices of the dynamic face topology in the target facial topology data in area S3 is
S1603,将所述各划分区域内的所述多帧目标面部拓扑结构数据的所述说话风格参数的平均值按照所述预设顺序拼接,得到所述目标风格特征属性。S1603: Splice the average value of the speaking style parameters of the multi-frame target facial topological structure data in each divided area according to the preset order to obtain the target style feature attribute.
示例性的,基于与上述实施例中相同的预设顺序,拼接目标面部拓扑结构数据中动态人脸拓扑结构的所有顶点的平均顶点偏移量的平均值,例如,基于如图13所示的从上到下的顺序,可以按照区域S1、S2和S3的顺序拼接各自区域对应的目标面部拓扑结构数据中动态人脸拓扑结构的所有顶点的平均顶点偏移量的平均值,可以得到目标用户的目标风格特征属性,即 Exemplarily, based on the same preset order as in the above embodiment, the average value of the average vertex offsets of all vertices of the dynamic face topology in the target facial topology data is spliced, for example, based on as shown in Figure 13 In order from top to bottom, the average vertex offset of all vertices of the dynamic face topology in the target facial topology data corresponding to the respective areas can be spliced in the order of areas S1, S2 and S3, and the target user can be obtained The target style characteristic attribute of
需要说明的是,可以先执行如图14所示的S1501-S1503,再执行如图15所示的S1601-S1603;或者,可以先执行如图15所示的S1601-S1603,再执行如图14所示的S1501-S1503,本公开对此不做具体限制。It should be noted that S1501-S1503 shown in Figure 14 can be executed first, and then S1601-S1603 shown in Figure 15 can be executed; or, S1601-S1603 shown in Figure 15 can be executed first, and then S1601-S1603 shown in Figure 14 can be executed. As shown in S1501-S1503, this disclosure does not impose specific limitations on this.
图16为根据一些实施例的说话风格生成方法的流程示意图,图16为如图14和图13所示实施例的基础上,执行S1303之前还包括:Figure 16 is a schematic flowchart of a speaking style generation method according to some embodiments. Figure 16 is based on the embodiments shown in Figures 14 and 13. Before executing S1303, it also includes:
S1701,获取训练样本集。S1701, obtain the training sample set.
所述训练样本集包括输入样本集和输出样本集,输入样本包括语音特征及其对应的所述多个风格特征向量,输出样本包括所述说话风格参数。 The training sample set includes an input sample set and an output sample set. The input sample includes voice features and the plurality of corresponding style feature vectors, and the output sample includes the speaking style parameters.
预设用户在朗读语音时,可以提取语音信息的内在特征,主要是提取能表达语音内容的特征,例如,可以提取语音梅尔普特征作为语音特征,或者,可以使用行业内常用的语音特征提取模型来提取语音特征,或者,还可以基于设计好的深度网络模型提取语音特征等。基于语音特征的提取效率,在预设用户朗读完多段语音后,可以提取到语音特征序列,多个预设用户朗读的多段语音的内容完全相同,则针对不同预设用户可以提取到相同的语音特征序列。如此,针对语音特征序列中的同一语音特征,对应有多个预设用户的多个风格特征向量,可以将一个语音特征及其对应的多个风格特征向量作为输入样本,基于语音特征序列的所有语音特征,可以得到多个输入样本,即得到输入样本集。It is preset that when users read speech, they can extract the intrinsic features of the speech information, mainly extracting features that can express the speech content. For example, the speech Melp feature can be extracted as the speech feature, or the speech feature extraction commonly used in the industry can be used. model to extract speech features, or you can also extract speech features based on a designed deep network model. Based on the extraction efficiency of speech features, after the preset user reads multiple speech segments, the speech feature sequence can be extracted. If the contents of the multiple speech segments read by multiple preset users are exactly the same, the same speech can be extracted for different preset users. Characteristic sequence. In this way, for the same voice feature in the voice feature sequence, there are multiple style feature vectors corresponding to multiple preset users. One voice feature and its corresponding multiple style feature vectors can be used as input samples, based on all the voice feature sequences. For speech features, multiple input samples can be obtained, that is, an input sample set is obtained.
示例性的,提取每个语音特征的同时,可以采集到相应的面部拓扑结构数据,基于面部拓扑结构数据中动态人脸拓扑结构的所有顶点各自的顶点坐标,可以得到面部拓扑结构数据中动态人脸拓扑结构的所有顶点各自的顶点偏移量。将面部拓扑结构数据中动态人脸拓扑结构的所有顶点各自的顶点偏移量作为一组说话风格参数,一组说话风格参数即为一个输出样本,如此,基于语音特征序列对应的多帧面部拓扑结构数据,可以得到多个输出样本,即输出样本集,输入样本集和输出样本集则构成了训练说话风格生成模型的训练样本集。For example, while extracting each voice feature, corresponding facial topology data can be collected. Based on the respective vertex coordinates of all vertices of the dynamic face topology in the facial topology data, the dynamic person in the facial topology data can be obtained. The respective vertex offsets of all vertices of the face topology. The respective vertex offsets of all vertices of the dynamic face topology in the facial topology data are used as a set of speaking style parameters, and a set of speaking style parameters is an output sample. In this way, based on the multi-frame facial topology corresponding to the speech feature sequence Structural data can obtain multiple output samples, that is, output sample sets. The input sample set and the output sample set constitute the training sample set for training the speaking style generation model.
S1702,定义所述说话风格模型的框架。S1702. Define the framework of the speaking style model.
所述说话风格模型的框架包括线性组合单元和网络模型,所述线性组合单元用于生成所述多个风格特征向量的线性组合风格特征向量,生成多个输出样本的线性组合输出样本,所述输入样本与所述输出样本一一对应;所述网络模型用于根据所述线性组合风格特征向量,生成对应的预测输出样本。The framework of the speaking style model includes a linear combination unit and a network model. The linear combination unit is used to generate a linear combination style feature vector of the plurality of style feature vectors and generate a linear combination output sample of a plurality of output samples. The input samples correspond to the output samples one-to-one; the network model is used to generate corresponding predicted output samples according to the linear combination style feature vector.
图17为根据一些实施例的说话风格模型的框架的结构示意图,如图17所示,说话风格模型的框架包括线性组合单元310和网络模型320,线性组合单元310的输入端用于接收训练样本,线性组合单元310的输出端与网络模型320的输入端连接,网络模型320的输出端即为说话风格模型的框架300的输出端。Figure 17 is a schematic structural diagram of the framework of the speaking style model according to some embodiments. As shown in Figure 17, the framework of the speaking style model includes a linear combination unit 310 and a network model 320. The input end of the linear combination unit 310 is used to receive training samples. , the output end of the linear combination unit 310 is connected to the input end of the network model 320, and the output end of the network model 320 is the output end of the framework 300 of the speaking style model.
训练样本输入至线性组合单元310后,训练样本包括输入样本和输出样本,其中,输入样本包括语音特征及其对应的多个风格特征向量,线性组合单元310可以将多个风格特征向量进行线性组合,得到线性组合风格特征向量,还可以将多个风格特征向量各自对应的说话风格参数的进行线性组合,得到线性组合输出样本。线性组合单元310可以输出语音特征及其对应的线性组合风格特征向量,即线性组合输入样本,同时还可以输出相应的线性组合输出样本。将线性组合训练样本输入至网络模型320,线性组合训练样本包括线性组合输入样本和线性组合输出样本,基于线性组合训练样本,对网络模型320进行训练。After the training samples are input to the linear combination unit 310, the training samples include input samples and output samples, where the input samples include speech features and their corresponding multiple style feature vectors. The linear combination unit 310 can linearly combine the multiple style feature vectors. , to obtain a linear combination of style feature vectors, and the corresponding speaking style parameters of multiple style feature vectors can also be linearly combined to obtain a linear combination output sample. The linear combination unit 310 can output speech features and their corresponding linear combination style feature vectors, that is, linear combination input samples, and can also output corresponding linear combination output samples. The linear combination training samples are input to the network model 320. The linear combination training samples include linear combination input samples and linear combination output samples. Based on the linear combination training samples, the network model 320 is trained.
S1703,根据所述训练样本集和损失函数,训练所述说话风格模型的框架,得到所述说话风格模型。S1703: Train the framework of the speaking style model according to the training sample set and the loss function to obtain the speaking style model.
基于上述实施例,将训练样本集中的训练样本输入至说话风格模型的框架,说话风格模型的框架可以输出预测输出样本,损失函数用于确定预测输出样本和输出样本的损失值,基于损失值减小的方向,调整说话风格模型的框架的模型参数,自此完成一次迭代训练。 如此,基于多次迭代训练说话风格模型的框架,可以得到训练好的训练说话风格模型的框架,即说话风格模型。Based on the above embodiment, the training samples in the training sample set are input to the framework of the speaking style model. The framework of the speaking style model can output predicted output samples. The loss function is used to determine the loss value of the predicted output sample and the output sample. Based on the loss value, the loss value is reduced. In the small direction, adjust the model parameters of the framework of the speaking style model, and then complete an iterative training. In this way, based on the framework of multiple iterations of training the speaking style model, a well-trained framework for training the speaking style model, that is, the speaking style model, can be obtained.
本实施例中,通过获取训练样本集,训练样本集包括输入样本集和输出样本集,输入样本包括语音特征及其对应的多个风格特征向量,输出样本包括说话风格参数;定义说话风格模型的框架,说话风格模型的框架包括线性组合单元和网络模型,线性组合单元用于生成多个风格特征向量的线性组合风格特征向量,生成多个输出样本的线性组合输出样本,输入样本与输出样本一一对应;网络模型用于根据线性组合风格特征向量,生成对应的预测输出样本;根据训练样本集和损失函数,训练说话风格模型的框架,得到说话风格模型,如此,说话风格模型实质是基于多个风格特征向量的线性组合风格特征向量训练网络模型得到的,可以提升网络模型的训练样本的多样性,能够提升说话风格模型的通用性。In this embodiment, a training sample set is obtained. The training sample set includes an input sample set and an output sample set. The input sample includes speech features and their corresponding multiple style feature vectors, and the output sample includes speaking style parameters; defining the speaking style model. Framework, the framework of the speaking style model includes a linear combination unit and a network model. The linear combination unit is used to generate a linear combination style feature vector of multiple style feature vectors, and generate a linear combination output sample of multiple output samples. The input sample is the same as the output sample. One correspondence; the network model is used to generate corresponding predicted output samples based on linear combination of style feature vectors; based on the training sample set and loss function, the framework of the speaking style model is trained to obtain the speaking style model. In this way, the speaking style model is essentially based on multiple The linear combination of style feature vectors obtained by training the network model can improve the diversity of training samples of the network model and improve the versatility of the speaking style model.
图18为根据一些实施例的说话风格生成方法的流程示意图,图18为图16所示实施例的基础上,执行S1703时的一种实现方式的具体描述,如下:Figure 18 is a schematic flowchart of a speaking style generation method according to some embodiments. Figure 18 is a detailed description of an implementation method when performing S1703 based on the embodiment shown in Figure 16, as follows:
S501,将所述训练样本集输入至所述线性组合单元,基于所述多个风格特征向量及其各自的权重值,生成所述线性组合风格特征向量,基于所述多个风格特征向量各自的权重值和所述多个输出样本,生成所述线性组合输出样本。S501. Input the training sample set to the linear combination unit, and generate the linear combination style feature vector based on the multiple style feature vectors and their respective weight values. Based on the respective weight values of the multiple style feature vectors, The weight value and the multiple output samples are used to generate the linear combination output sample.
所述多个风格特征向量各自的权重值的和值为1。The sum of the weight values of the plurality of style feature vectors is 1.
示例性的,训练样本输入至线性组合单元后,基于线性组合单元,可以对多个风格特征向量分别赋予权重值,且多个风格特征向量各自的权重值的和值为1,将多个风格特征向量中各风格特征向量与对应的权重值的乘积进行相加,可以得到线性组合风格特征向量。每个风格特征向量对应一个输出样本,将多个风格特征向量各自的权重值与对应的输出样本的乘积进行相加,可以得到线性组合输出样本。如此,基于不同的权重值,可以得到不同的线性组合风格特征向量以及不同的线性组合输出样本,基于多个语音特征及其各自对应的线性组合风格特征向量,可以得到线性组合输入样本集,基于多个语音特征各自对应的输出样本,可以得到线性组合输出样本集。For example, after the training samples are input to the linear combination unit, based on the linear combination unit, multiple style feature vectors can be assigned weight values respectively, and the sum of the weight values of the multiple style feature vectors is 1, and the multiple style feature vectors can be assigned weight values. By adding the products of each style feature vector and the corresponding weight value in the feature vector, a linear combination style feature vector can be obtained. Each style feature vector corresponds to an output sample, and the linear combination output sample can be obtained by adding the product of the weight value of multiple style feature vectors and the corresponding output sample. In this way, based on different weight values, different linear combination style feature vectors and different linear combination output samples can be obtained. Based on multiple speech features and their corresponding linear combination style feature vectors, a linear combination input sample set can be obtained. Based on A linear combination of output samples corresponding to multiple speech features can be obtained.
S502,根据所述损失函数和线性组合训练样本集,训练所述网络模型,得到所述说话风格模型。S502: Train the network model according to the loss function and the linear combination training sample set to obtain the speaking style model.
所述线性组合训练样本集包括线性组合输入样本集和线性组合输出样本集,线性组合输入样本包括所述语音特征及其对应的所述线性组合风格特征向量。The linear combination training sample set includes a linear combination input sample set and a linear combination output sample set. The linear combination input sample includes the speech feature and its corresponding linear combination style feature vector.
示例性的,线性组合训练样本集包括线性组合输入样本集和线性组合输出样本集,将线性组合训练样本输入至网络模型,基于网络模型和线性组合输入样本,可以得到预测输出样本,基于损失函数的损失值减小的方向,调整网络模型的模型参数,自此完成一次网络模型的迭代训练。如此,基于网络模型的多次迭代训练,可以得到训练好的训练说话风格模型的框架,即说话风格模型。Exemplarily, the linear combination training sample set includes a linear combination input sample set and a linear combination output sample set. The linear combination training samples are input to the network model. Based on the network model and the linear combination input sample, a predicted output sample can be obtained based on the loss function. In the direction in which the loss value decreases, the model parameters of the network model are adjusted, and an iterative training of the network model is completed. In this way, based on multiple iterative trainings of the network model, a well-trained framework for training the speaking style model, that is, the speaking style model, can be obtained.
本实施例中,通过将训练样本集输入至线性组合单元,基于多个风格特征向量及其各自的权重值,生成线性组合风格特征向量,基于多个风格特征向量各自的权重值和多个输出样本,生成线性组合输出样本,多个风格特征向量各自的权重值的和值为1;根据损失 函数和线性组合训练样本集,训练网络模型,得到说话风格模型,线性组合训练样本集包括线性组合输入样本集和线性组合输出样本集,线性组合输入样本包括语音特征及其对应的线性组合风格特征向量,可以将线性组合后的训练样本作为网络模型的训练样本,能够提升网络模型训练样本的数量以及多样性,能够提升说话风格模型的通用性和准确性。In this embodiment, by inputting the training sample set into the linear combination unit, a linear combination style feature vector is generated based on multiple style feature vectors and their respective weight values, and based on the respective weight values of the multiple style feature vectors and multiple outputs sample, generate a linear combination output sample, the sum of the weight values of multiple style feature vectors is 1; according to the loss Function and linear combination training sample set, train the network model, and obtain the speaking style model. The linear combination training sample set includes a linear combination input sample set and a linear combination output sample set. The linear combination input sample includes speech features and their corresponding linear combination style features. Vector, the linearly combined training samples can be used as training samples for the network model, which can increase the number and diversity of training samples for the network model, and improve the versatility and accuracy of the speaking style model.
一些本公开的实施例中,图19为本公开实施例提供的另一种说话风格生成模型的框架的结构示意图,如图19所示,在图17所示实施例的基础上,说话风格模型的框架还包括缩放单元330。缩放单元330的输入端用于接收训练样本,缩放单元330的输出端与线性组合单元310的输入端连接,缩放单元330用于基于随机产生的缩放因子对多个风格特征向量和多个输出样本各自进行缩放,得到多个缩放风格特征向量和多个缩放输出样本,并输出缩放训练样本,缩放训练样本包括多个缩放风格特征向量及其各自对应的缩放训练样本。缩放因子可以是0.5-2,缩放因子的精确到达到小数点后一位。In some embodiments of the present disclosure, Figure 19 is a schematic structural diagram of the framework of another speaking style generation model provided by the embodiment of the present disclosure. As shown in Figure 19, based on the embodiment shown in Figure 17, the speaking style model The framework also includes a scaling unit 330. The input end of the scaling unit 330 is used to receive training samples, the output end of the scaling unit 330 is connected to the input end of the linear combination unit 310, and the scaling unit 330 is used to compare multiple style feature vectors and multiple output samples based on randomly generated scaling factors. Each performs scaling to obtain multiple scaling style feature vectors and multiple scaling output samples, and output scaling training samples. The scaling training samples include multiple scaling style feature vectors and their respective corresponding scaling training samples. The scaling factor can be 0.5-2, and the scaling factor is accurate to one decimal place.
缩放训练样本输入至线性组合单元310,基于线性组合单元310可以将多个缩放风格特征向量进行线性组合,得到线性组合风格特征向量,还可以将多个缩放风格特征向量各自对应的缩放输出样本的进行线性组合,得到线性组合输出样本。线性组合单元310可以输出语音特征及其对应的线性组合风格特征向量,即线性组合输入样本,同时还可以输出相应的线性组合输出样本。将线性组合训练样本输入至网络模型320,线性组合训练样本包括线性组合输入样本和线性组合输出样本,基于线性组合训练样本,对网络模型320进行训练。The scaled training samples are input to the linear combination unit 310. Based on the linear combination unit 310, multiple scaled style feature vectors can be linearly combined to obtain a linear combination style feature vector. The scaled output samples corresponding to the multiple scaled style feature vectors can also be Perform linear combination to obtain linear combination output samples. The linear combination unit 310 can output speech features and their corresponding linear combination style feature vectors, that is, linear combination input samples, and can also output corresponding linear combination output samples. The linear combination training samples are input to the network model 320. The linear combination training samples include linear combination input samples and linear combination output samples. Based on the linear combination training samples, the network model 320 is trained.
图20为根据一些实施例的说话风格生成方法的流程示意图,图20为图16所示实施例的基础上,执行S1703时的另一种可能的实现方式的具体描述,如下:Figure 20 is a schematic flowchart of a speaking style generation method according to some embodiments. Figure 20 is a detailed description of another possible implementation when performing S1703 based on the embodiment shown in Figure 16, as follows:
S5011,将所述训练样本集输入至所述缩放单元,基于缩放因子和所述多个风格特征向量,生成多个缩放风格特征向量,基于所述缩放因子和所述多个输出样本,生成多个缩放输出样本。S5011. Input the training sample set to the scaling unit, generate multiple scaling style feature vectors based on the scaling factor and the multiple style feature vectors, and generate multiple scaling style feature vectors based on the scaling factor and the multiple output samples. scaled output samples.
示例性的,训练样本输入至缩放单元后,基于缩放单元,能够以随机缩放因子对多个风格特征向量分别进行缩放处理,可以得到多个缩放风格特征向量。每个风格特征向量对应一个输出样本,基于多个风格特征向量各自的缩放因子缩放对应的输出样本,可以得到多个缩放输出样本。如此,基于多个语音特征及其各自对应的多个缩放风格特征向量,可以得到缩放输入样本集,基于多个语音特征各自对应的缩放输出样本,可以得到缩放输出样本集。For example, after the training samples are input to the scaling unit, based on the scaling unit, multiple style feature vectors can be scaled separately with random scaling factors, and multiple scaled style feature vectors can be obtained. Each style feature vector corresponds to an output sample, and multiple scaled output samples can be obtained by scaling the corresponding output samples based on the respective scaling factors of multiple style feature vectors. In this way, a scaled input sample set can be obtained based on multiple voice features and their corresponding multiple scaled style feature vectors, and a scaled output sample set can be obtained based on the scaled output samples corresponding to the multiple voice features.
S5012,将所述多个缩放风格特征向量和所述多个缩放输出样本输入至所述线性组合单元,基于所述多个缩放风格特征向量及其各自的权重值,生成所述线性组合风格特征向量,基于所述多个缩放风格特征向量各自的权重值和所述多个缩放输出样本,生成所述线性组合输出样本。S5012, input the multiple scaling style feature vectors and the multiple scaling output samples to the linear combination unit, and generate the linear combination style feature based on the multiple scaling style feature vectors and their respective weight values. vector, generating the linear combination output sample based on respective weight values of the plurality of scaled style feature vectors and the plurality of scaled output samples.
所述多个缩放风格特征向量各自的权重值的和值为1。The sum of the weight values of the multiple scaling style feature vectors is 1.
示例性的,缩放训练样本集包括缩放输入样本集和缩放输出样本集,将缩放训练样本集输入至线性组合单元,基于线性组合单元,可以对多个缩放风格特征向量分别赋予权重 值,且多个缩放风格特征向量各自的权重值的和值为1,将多个缩放风格特征向量中各缩放风格特征向量与对应的权重值的乘积进行相加,可以得到线性组合风格特征向量。每个缩放风格特征向量对应一个缩放输出样本,将多个缩放风格特征向量各自的权重值与对应的缩放输出样本的乘积进行相加,可以得到线性组合输出样本。如此,基于不同的权重值,可以得到不同的线性组合风格特征向量以及不同的线性组合输出样本,基于多个语音特征及其各自对应的线性组合风格特征向量,可以得到线性组合输入样本集,基于多个语音特征各自对应的缩放输出样本,可以得到线性组合输出样本集。Exemplarily, the scaled training sample set includes a scaled input sample set and a scaled output sample set. The scaled training sample set is input to the linear combination unit. Based on the linear combination unit, multiple scaled style feature vectors can be assigned weights respectively. value, and the sum of the weight values of multiple scaling style feature vectors is 1. By adding the products of each scaling style feature vector and the corresponding weight value in the multiple scaling style feature vectors, a linear combination style feature vector can be obtained . Each scaling style feature vector corresponds to a scaling output sample, and the linear combination output sample can be obtained by adding the product of the respective weight values of multiple scaling style feature vectors and the corresponding scaling output samples. In this way, based on different weight values, different linear combination style feature vectors and different linear combination output samples can be obtained. Based on multiple speech features and their corresponding linear combination style feature vectors, a linear combination input sample set can be obtained. Based on The scaled output samples corresponding to multiple speech features can obtain a linear combination output sample set.
S502,根据所述损失函数和线性组合训练样本集,训练所述网络模型,得到所述说话风格模型。S502: Train the network model according to the loss function and the linear combination training sample set to obtain the speaking style model.
所述线性组合训练样本集包括线性组合输入样本集和线性组合输出样本集,线性组合输入样本包括所述语音特征及其对应的所述线性组合风格特征向量。The linear combination training sample set includes a linear combination input sample set and a linear combination output sample set. The linear combination input sample includes the speech feature and its corresponding linear combination style feature vector.
示例性的,线性组合训练样本集包括线性组合输入样本集和线性组合输出样本集,将线性组合训练样本输入至网络模型,基于网络模型和线性组合输入样本,可以得到预测输出样本,基于损失函数的损失值减小的方向,调整网络模型的模型参数,自此完成一次网络模型的迭代训练。如此,基于网络模型的多次迭代训练,可以得到训练好的训练说话风格模型的框架,即说话风格模型。Exemplarily, the linear combination training sample set includes a linear combination input sample set and a linear combination output sample set. The linear combination training samples are input to the network model. Based on the network model and the linear combination input sample, a predicted output sample can be obtained based on the loss function. In the direction in which the loss value decreases, the model parameters of the network model are adjusted, and an iterative training of the network model is completed. In this way, based on multiple iterative trainings of the network model, a well-trained framework for training the speaking style model, that is, the speaking style model, can be obtained.
本实施例中,通过说话风格模型的框架还包括缩放单元;将训练样本集输入至缩放单元,基于缩放因子和多个风格特征向量,生成多个缩放风格特征向量,基于缩放因子和多个输出样本,生成多个缩放输出样本;将多个缩放风格特征向量和多个缩放输出样本输入至线性组合单元,基于多个缩放风格特征向量及其各自的权重值,生成线性组合风格特征向量,基于多个缩放风格特征向量各自的权重值和多个缩放输出样本,生成线性组合输出样本,多个缩放风格特征向量各自的权重值的和值为1;根据损失函数和线性组合训练样本集,训练网络模型,得到说话风格模型,线性组合训练样本集包括线性组合输入样本集和线性组合输出样本集,线性组合输入样本包括语音特征及其对应的线性组合风格特征向量,如此,将缩放后的多个风格特征向量作为网络模型的训练样本,能够提升网络模型训练样本的数量以及多样性,从而能够提升说话风格模型的通用性和准确性。In this embodiment, the framework of the speaking style model also includes a scaling unit; the training sample set is input to the scaling unit, based on the scaling factor and multiple style feature vectors, multiple scaling style feature vectors are generated, based on the scaling factor and multiple outputs sample to generate multiple scaled output samples; input multiple scaled style feature vectors and multiple scaled output samples to the linear combination unit, and generate a linear combination style feature vector based on multiple scaled style feature vectors and their respective weight values. The respective weight values of multiple scaling style feature vectors and multiple scaling output samples generate linear combination output samples. The sum of the weight values of multiple scaling style feature vectors is 1; according to the loss function and the linear combination training sample set, train network model to obtain a speaking style model. The linear combination training sample set includes a linear combination input sample set and a linear combination output sample set. The linear combination input sample includes speech features and their corresponding linear combination style feature vectors. In this way, the scaled polynomial A style feature vector serves as a training sample for the network model, which can increase the number and diversity of training samples for the network model, thereby improving the versatility and accuracy of the speaking style model.
一些本公开的实施例中,图21为根据一些实施例的说话风格生成模型的框架的结构示意图,图22为根据一些实施例的说话风格生成模型的框架的结构示意图,图21为图17所示实施例的基础上,图22为图19所示实施例的基础上,网络模型320包括一级网络模型321、二级网络模型322和叠加单元323,一级网络模型321的输出端和二级网络模型322的输出端均与叠加单元323的输入端连接,叠加单元323的输出端用于输出预测输出样本。损失函数包括第一损失函数和第二损失函数。In some embodiments of the present disclosure, Figure 21 is a schematic structural diagram of the framework of the speaking style generation model according to some embodiments, Figure 22 is a schematic structural diagram of the framework of the speaking style generation model according to some embodiments, and Figure 21 is a schematic structural diagram of the framework of the speaking style generation model according to some embodiments. Based on the embodiment shown in Figure 22, based on the embodiment shown in Figure 19, the network model 320 includes a first-level network model 321, a second-level network model 322 and an overlay unit 323. The output end of the first-level network model 321 and the second-level network model 322 The output terminals of the level network model 322 are connected to the input terminals of the superposition unit 323, and the output terminals of the superposition unit 323 are used to output predicted output samples. The loss function includes a first loss function and a second loss function.
线性组合训练样本分别输入至一级网络模型321和二级网络模型322,可以基于一级网络模型321输出一级预测输出样本,基于二级网络模型322输出二级预测输出样本,一级预测输出样本和二级预测输出样本输入至叠加单元323,基于叠加单元323将一级预测输出样本与二级预测输出样本进行叠加,得到预测输出样本。一级网络模型321可以包括 卷积网络和全连接网络,其作用为提取语音与面部拓扑结构数据的单帧对应性,二级网络模型322可以是序列到序列seq2seq网络模型,例如,可以是长短期记忆(Long short-term memory,LSTM)网络模型、门控循环单元(Gate Recurrent Unit,GRU)网络模型或Transformer网络模型,其作用为增强语音特征与面部表情连续性和说话风格的细腻性。The linear combination training samples are respectively input to the first-level network model 321 and the second-level network model 322. The first-level prediction output sample can be output based on the first-level network model 321, and the second-level prediction output sample can be output based on the second-level network model 322. The first-level prediction output The sample and the second-level prediction output sample are input to the superposition unit 323. Based on the superposition unit 323, the first-level prediction output sample and the second-level prediction output sample are superimposed to obtain a prediction output sample. Level 1 network model 321 may include The convolutional network and the fully connected network are used to extract the single-frame correspondence between speech and facial topological structure data. The secondary network model 322 can be a sequence-to-sequence seq2seq network model, for example, it can be long short-term memory (Long short-term memory). memory, LSTM) network model, Gate Recurrent Unit (GRU) network model or Transformer network model, which is used to enhance the continuity of speech features and facial expressions and the delicacy of speaking style.
示例性的,损失函数L=b1*L1+b2*L2,其中,L1为第一损失函数,用于确定一级预测输出样本和线性组合输出样本的损失值,L2为第二损失函数,用于确定二级预测输出样本和线性组合输出样本的损失值,b1为第一损失函数的权值,b2为第二损失函数的权值,b1和b2是可调节的。通过将b2设置为趋近于0,可以对一级网络模型321进行训练,通过将b1设置为趋近于0,可以对二级网络模型322进行训练。如此,可以实现分阶段单独训练一级网络模型和二级网络模型,能够提升网络模型训练的收敛速度,节省网络模型的训练时间,从而能够提升说话风格的生成效率。Exemplarily, the loss function L=b1*L1+b2*L2, where L1 is the first loss function, used to determine the loss value of the first-level prediction output sample and the linear combination output sample, L2 is the second loss function, using To determine the loss value of the secondary prediction output sample and the linear combination output sample, b1 is the weight of the first loss function, b2 is the weight of the second loss function, and b1 and b2 are adjustable. By setting b2 close to 0, the first-level network model 321 can be trained, and by setting b1 close to 0, the second-level network model 322 can be trained. In this way, the first-level network model and the second-level network model can be trained separately in stages, which can improve the convergence speed of network model training, save the training time of the network model, and thus improve the efficiency of speaking style generation.
图23为根据一些实施例的说话风格生成方法的流程示意图,图23为图18或图20所示实施例的基础上,执行S502时的一种实现方式的具体描述,如下:Figure 23 is a schematic flowchart of a speaking style generation method according to some embodiments. Figure 23 is a detailed description of an implementation method when performing S502 based on the embodiment shown in Figure 18 or Figure 20, as follows:
S5021,根据所述线性组合训练样本集和所述第一损失函数,训练所述一级网络模型,得到中间说话风格模型。S5021: Train the first-level network model according to the linear combination training sample set and the first loss function to obtain an intermediate speaking style model.
所述中间说话风格模型包括所述二级网络模型和训练好的所述一级网络模型。The intermediate speaking style model includes the second-level network model and the trained first-level network model.
示例性的,基于上述实施例,第一阶段,将第二损失函数的权值b2设置为趋近于0,当前网络模型的损失函数可以理解为第一损失函数,将线性组合训练样本分别输入至一级网络模型和二级网络模型中。基于叠加单元输出的预测输出样本、第一损失函数和相应的线性组合输出样本,可以得到第一损失值,基于第一损失值减小的方向调整一级网络模型的模型参数,直至第一损失值收敛,得到训练好的一级网络模型,第一阶段训练好的说话风格模型的框架即为中间说话风格模型。Illustratively, based on the above embodiment, in the first stage, the weight b2 of the second loss function is set to approach 0. The loss function of the current network model can be understood as the first loss function, and the linear combination training samples are input separately. to the first-level network model and the second-level network model. Based on the predicted output sample output by the superposition unit, the first loss function and the corresponding linear combination output sample, the first loss value can be obtained, and the model parameters of the first-level network model are adjusted based on the direction in which the first loss value decreases until the first loss The values converge, and the trained first-level network model is obtained. The framework of the speaking style model trained in the first stage is the intermediate speaking style model.
S5022,固定所述训练好的所述一级网络模型的模型参数。S5022: Fix the model parameters of the trained first-level network model.
示例性的,训练好一级网络模型之后,进入第二阶段,首先需要固定训练好的一级网络模型的模型参数。For example, after training the first-level network model, entering the second stage, the model parameters of the trained first-level network model need to be fixed first.
S5023,根据所述线性组合训练样本集和所述第二损失函数,训练所述中间说话风格模型中的所述二级网络模型,得到所述说话风格模型。S5023: Train the secondary network model in the intermediate speaking style model according to the linear combination training sample set and the second loss function to obtain the speaking style model.
所述说话风格模型包括所述训练好的所述一级网络和训练好的所述二级网络。The speaking style model includes the trained first-level network and the trained second-level network.
其次,将第一损失函数的权值b1设置为趋近于0,当前网络模型的损失函数可以理解为第二损失函数,将线性组合训练样本输入至二级网络模型和训练好的一级网络模型中。基于叠加单元输出的预测输出样本、第二损失函数和相应的线性组合输出样本,可以得到第二损失值,基于第二损失值减小的方向调整二级网络模型的模型参数,直至第二损失值收敛,得到训练好的二级网络模型,第一阶段训练好说话风格模型的框架即为说话风格模型。Secondly, set the weight b1 of the first loss function close to 0. The loss function of the current network model can be understood as the second loss function. The linear combination training sample is input to the second-level network model and the trained first-level network. in the model. Based on the predicted output sample output by the superposition unit, the second loss function and the corresponding linear combination output sample, the second loss value can be obtained, and the model parameters of the secondary network model are adjusted based on the direction in which the second loss value decreases until the second loss The values converge and the trained secondary network model is obtained. The framework of the speaking style model trained in the first stage is the speaking style model.
本实施例中,通过网络模型包括一级网络模型、二级网络模型和叠加单元,一级网络模型的输出端和二级网络模型的输出端均与叠加单元的输入端连接,叠加单元的输出端用 于输出预测输出样本;损失函数包括第一损失函数和第二损失函数;根据线性组合训练样本集和第一损失函数,训练一级网络模型,得到中间说话风格模型,中间说话风格模型包括二级网络模型和训练好的一级网络模型;固定训练好的一级网络模型的模型参数;根据线性组合训练样本集和第二损失函数,训练中间说话风格模型中的二级网络模型,得到说话风格模型,说话风格模型包括训练好的一级网络和训练好的二级网络,如此,可以分阶段对网络模型进行训练,能够提升网络模型的收敛速度,即缩短网络模型的训练时间,从而能够提升说话风格的生成效率。In this embodiment, the network model includes a first-level network model, a second-level network model and a superposition unit. The output end of the first-level network model and the output end of the second-level network model are both connected to the input end of the overlay unit. The output of the overlay unit end use Predict the output sample based on the output; the loss function includes the first loss function and the second loss function; according to the linear combination of the training sample set and the first loss function, train the first-level network model to obtain the intermediate speaking style model, and the intermediate speaking style model includes the second-level network model and the trained first-level network model; fix the model parameters of the trained first-level network model; train the second-level network model in the intermediate speaking style model according to the linear combination of the training sample set and the second loss function to obtain the speaking style model, the speaking style model includes a trained first-level network and a trained second-level network. In this way, the network model can be trained in stages, which can improve the convergence speed of the network model, that is, shorten the training time of the network model, thereby improving Generative efficiency of speaking styles.
图24是本公开实施例提供的一种计算机设备的结构示意图。如图24所示,该计算机设备包括处理器910和存储器920;计算机设备中处理器910的数量可以是一个或多个,图24中以一个处理器910为例;计算机设备中的处理器910和存储器920可以通过总线或其他方式连接,图24中以通过总线连接为例。Figure 24 is a schematic structural diagram of a computer device provided by an embodiment of the present disclosure. As shown in Figure 24, the computer device includes a processor 910 and a memory 920; the number of processors 910 in the computer device can be one or more. In Figure 24, one processor 910 is taken as an example; the processor 910 in the computer device The memory 920 may be connected through a bus or other means. In FIG. 24 , the connection through a bus is taken as an example.
存储器920作为一种计算机可读非易失性存储介质,可用于存储软件程序、计算机可执行程序以及模块,如本公开实施例中的语义理解模型训练方法对应的程序指令/模块;或者本公开实施例中的语义理解方法对应的程序指令/模块。处理器910通过运行存储在存储器920中的软件程序、指令以及模块,从而执行计算机设备的各种功能应用以及数据处理,即实现本公开实施例所提供的语义理解模型训练方法或者短视频召回方法。As a computer-readable non-volatile storage medium, the memory 920 can be used to store software programs, computer executable programs and modules, such as program instructions/modules corresponding to the semantic understanding model training method in the embodiments of the present disclosure; or the present disclosure Program instructions/modules corresponding to the semantic understanding method in the embodiment. The processor 910 executes various functional applications and data processing of the computer device by running software programs, instructions and modules stored in the memory 920, that is, implementing the semantic understanding model training method or the short video recall method provided by the embodiments of the present disclosure. .
存储器920可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序;存储数据区可存储根据终端的使用所创建的数据等。此外,存储器920可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他非易失性固态存储器件。在一些实例中,存储器920可进一步包括相对于处理器910远程设置的存储器,这些远程存储器可以通过网络连接至计算机设备。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。 The memory 920 may mainly include a stored program area and a stored data area, where the stored program area may store an operating system and an application program required for at least one function; the stored data area may store data created according to the use of the terminal, etc. In addition, the memory 920 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some examples, memory 920 may further include memory located remotely relative to processor 910, and these remote memories may be connected to the computer device through a network. Examples of the above-mentioned networks include but are not limited to the Internet, intranets, local area networks, mobile communication networks and combinations thereof.

Claims (16)

  1. 一种虚拟数字人驱动方法,包括:A virtual digital human driving method includes:
    获取用户信息,所述用户信息包括语音信息和图像信息;Obtain user information, which includes voice information and image information;
    根据所述用户信息,确定用户意图和用户情感;Determine user intentions and user emotions based on the user information;
    根据所述用户意图确定所述虚拟数字人的回复文本,以及根据所述用户意图和用户情感确定所述虚拟数字人的回复情感;Determine the reply text of the virtual digital person according to the user intention, and determine the reply emotion of the virtual digital person according to the user intention and user emotion;
    根据所述回复文本确定所述虚拟数字人肢体动作,以及根据所述回复情感确定所述虚拟数字人情感表达方式。The physical movements of the virtual digital human are determined according to the reply text, and the emotional expression of the virtual digital human is determined according to the reply emotion.
  2. 根据权利要求1所述的方法,所述根据所述用户信息,确定用户意图和用户情感,包括:The method according to claim 1, wherein determining user intention and user emotion according to the user information includes:
    对所述语音信息进行处理,确定所述语音信息对应的文本信息和语音情感信息;Process the voice information to determine the text information and voice emotion information corresponding to the voice information;
    对所述图像信息进行处理,确定所述图像信息对应的场景信息和图像情感信息;Process the image information to determine the scene information and image emotion information corresponding to the image information;
    根据所述文本信息和所述场景信息,确定所述用户意图;Determine the user intention according to the text information and the scene information;
    根据所述文本信息、所述语音情感信息和所述图像情感信息,确定所述用户情感。The user emotion is determined based on the text information, the voice emotion information and the image emotion information.
  3. 根据权利要求2所述的方法,所述对所述语音信息进行处理,确定所述语音信息对应的文本信息和语音情感信息,包括:The method according to claim 2, processing the voice information and determining text information and voice emotion information corresponding to the voice information includes:
    对所述语音信息进行文本转录处理,确定所述语音信息对应的文本信息;Perform text transcription processing on the voice information to determine the text information corresponding to the voice information;
    对所述语音信息进行声纹特征提取,确定所述语音信息对应的语音情感信息。Perform voiceprint feature extraction on the voice information to determine voice emotion information corresponding to the voice information.
  4. 根据权利要求3所述的方法,所述对所述语音信息进行文本转录处理,确定所述语音信息对应的文本信息之前,还包括:The method according to claim 3, before performing text transcription processing on the voice information and determining text information corresponding to the voice information, further comprising:
    提取所述语音信息的语音特征向量;Extract the speech feature vector of the speech information;
    在语音识别模型的卷积层中添加所述语音特征向量,其中,所述语音识别模型包括声学模型和语言模型,所述声学模型包括注意力机制的卷积神经网络模型,所述语言模型包括深度神经网络模型;The speech feature vector is added to the convolutional layer of the speech recognition model, wherein the speech recognition model includes an acoustic model and a language model, the acoustic model includes a convolutional neural network model of the attention mechanism, and the language model includes Deep neural network model;
    所述对所述语音信息进行文本转录处理,确定所述语音信息对应的文本信息,包括:The step of performing text transcription processing on the voice information and determining the text information corresponding to the voice information includes:
    基于所述语音识别模型对所述语音信息进行文本转录处理,确定所述语音信息对应的文本信息。Based on the speech recognition model, text transcription processing is performed on the speech information to determine text information corresponding to the speech information.
  5. 根据权利要求2所述的方法,所述对所述图像信息进行处理,确定所述图像信息对应的场景信息和图像情感信息,包括:The method according to claim 2, processing the image information and determining scene information and image emotion information corresponding to the image information includes:
    确定所述图像信息包括的场景关键点信息和用户关键点信息;Determine the scene key point information and user key point information included in the image information;
    根据所述场景关键点信息,确定所述图像信息对应的场景信息;Determine scene information corresponding to the image information according to the scene key point information;
    根据所述用户关键点信息与预设用户情绪关键点的对应关系,确定所述图像情感信息。The image emotion information is determined according to the corresponding relationship between the user key point information and the preset user emotion key points.
  6. 根据权利要求2所述的方法,所述根据所述回复文本确定所述虚拟数字人肢体动作,以及根据所述回复文本和回复情感确定所述虚拟数字人情感表达方式,包括:The method according to claim 2, determining the body movements of the virtual digital person based on the reply text, and determining the emotional expression of the virtual digital person based on the reply text and reply emotion, including:
    获取所述回复文本中包括的动作标识; Obtain the action identifier included in the reply text;
    根据所述动作标识,从所述场景信息对应的预设动作数据库中选择所述虚拟数字人的肢体动作;Select the body movements of the virtual digital human from the preset action database corresponding to the scene information according to the action identifier;
    根据所述语音情感信息和所述图像情感信息,确定所述虚拟数字人的关键点的情感表达方式。According to the voice emotion information and the image emotion information, the emotional expression of key points of the virtual digital human is determined.
  7. 根据权利要求6所述的方法,所述根据所述回复文本确定所述虚拟数字人肢体动作,以及根据所述回复情感确定所述虚拟数字人情感表达方式之前,还包括:The method according to claim 6, before determining the body movements of the virtual digital human based on the reply text and determining the emotional expression of the virtual digital human based on the reply emotion, further comprising:
    确定所述虚拟数字人的形象。Determine the image of the virtual digital person.
  8. 根据权利要求1-7中任一所述的方法,还包括:The method according to any one of claims 1-7, further comprising:
    基于多个风格特征属性拟合目标风格特征属性,确定各风格特征属性的拟合系数;所述风格特征属性是根据用户朗读语音时的人脸图像确定的;Fit the target style feature attributes based on multiple style feature attributes, and determine the fitting coefficient of each style feature attribute; the style feature attributes are determined based on the face image when the user reads the voice;
    根据所述各风格特征属性的拟合系数和多个风格特征向量,确定目标风格特征向量,所述多个风格特征向量与所述多个风格特征属性一一对应;Determine a target style feature vector according to the fitting coefficient of each style feature attribute and a plurality of style feature vectors, and the plurality of style feature vectors correspond to the plurality of style feature attributes in one-to-one correspondence;
    将所述目标风格特征向量输入至说话风格模型中,输出目标说话风格参数,所述说话风格模型是基于所述多个风格特征向量训练说话风格模型的框架得到的;Input the target style feature vector into a speaking style model and output target speaking style parameters. The speaking style model is obtained by training a speaking style model framework based on the multiple style feature vectors;
    基于所述目标说话风格参数,生成用于所述虚拟数字人的目标说话风格。Based on the target speaking style parameters, a target speaking style for the virtual digital human is generated.
  9. 根据权利要求8所述的方法,所述基于多个风格特征属性拟合目标风格特征属性,确定各风格特征属性的拟合系数之前,还包括:The method according to claim 8, before fitting the target style feature attributes based on multiple style feature attributes and determining the fitting coefficient of each style feature attribute, the method further includes:
    采集多个预设用户朗读多段语音时的多帧面部拓扑结构数据;Collect multi-frame facial topology data when multiple preset users read multiple speech segments;
    针对每个预设用户:根据所述多段语音对应的所述多帧面部拓扑结构数据各自的说话风格参数和面部拓扑结构数据的划分区域,确定各划分区域内的所述多帧面部拓扑结构数据的所述说话风格参数的平均值;For each preset user: determine the multi-frame facial topology data in each divided area according to the respective speaking style parameters and the divided areas of the facial topological structure data corresponding to the multiple segments of speech. the mean value of said speaking style parameters;
    将所述各划分区域内的所述多帧面部拓扑结构数据的所述说话风格参数的平均值按照预设顺序拼接,得到所述每个预设用户的风格特征属性。The average value of the speaking style parameters of the multi-frame facial topological structure data in each divided area is spliced in a preset order to obtain the style feature attributes of each preset user.
  10. 根据权利要求9所述的方法,还包括:The method of claim 9, further comprising:
    采集目标用户朗读所述多段语音时的多帧目标面部拓扑结构数据,所述目标用户与所述多个预设用户为不同的用户;Collect multi-frame target facial topological structure data when the target user reads the multiple segments of speech, and the target user and the plurality of preset users are different users;
    根据所述多段语音对应的所述多帧目标面部拓扑结构数据各自的说话风格参数和所述面部拓扑结构数据的划分区域,确定所述各划分区域内的所述多帧目标面部拓扑结构数据的所述说话风格参数的平均值;According to the speaking style parameters of the multi-frame target facial topology data corresponding to the multiple segments of speech and the divided areas of the facial topological structure data, determine the multi-frame target facial topological structure data in each divided area. the average value of said speaking style parameters;
    将所述各划分区域内的所述多帧目标面部拓扑结构数据的所述说话风格参数的平均值按照所述预设顺序拼接,得到所述目标风格特征属性。The average value of the speaking style parameters of the multi-frame target facial topological structure data in each divided area is spliced in accordance with the preset order to obtain the target style feature attributes.
  11. 根据权利要求9所述的方法,所述将所述目标风格特征向量输入至说话风格模型中,输出目标说话风格参数之前,还包括:The method according to claim 9, before inputting the target style feature vector into the speaking style model and outputting the target speaking style parameters, further comprising:
    获取训练样本集,所述训练样本集包括输入样本集和输出样本集,输入样本包括语音特征及其对应的所述多个风格特征向量,输出样本包括所述说话风格参数;Obtain a training sample set, the training sample set includes an input sample set and an output sample set, the input sample includes voice features and the plurality of corresponding style feature vectors, and the output sample includes the speaking style parameters;
    定义所述说话风格模型的框架,所述说话风格模型的框架包括线性组合单元和网络模 型,所述线性组合单元用于生成所述多个风格特征向量的线性组合风格特征向量,生成多个输出样本的线性组合输出样本,所述输入样本与所述输出样本一一对应;所述网络模型用于根据所述线性组合风格特征向量,生成对应的预测输出样本;Define a framework of the speaking style model, which includes a linear combination unit and a network model Type, the linear combination unit is used to generate a linear combination style feature vector of the plurality of style feature vectors, and generate a linear combination output sample of a plurality of output samples, where the input sample corresponds to the output sample one-to-one; The network model is used to generate corresponding prediction output samples according to the linear combination style feature vector;
    根据所述训练样本集和损失函数,训练所述说话风格模型的框架,得到所述说话风格模型。According to the training sample set and the loss function, the framework of the speaking style model is trained to obtain the speaking style model.
  12. 根据权利要求11所述的方法,所述根据所述训练样本集和损失函数,训练所述说话风格模型的框架,得到所述说话风格模型,包括:The method according to claim 11, said training the framework of the speaking style model according to the training sample set and the loss function to obtain the speaking style model, including:
    将所述训练样本集输入至所述线性组合单元,基于所述多个风格特征向量及其各自的权重值,生成所述线性组合风格特征向量,基于所述多个风格特征向量各自的权重值和所述多个输出样本,生成所述线性组合输出样本,所述多个风格特征向量各自的权重值的和值为1;The training sample set is input to the linear combination unit, and based on the multiple style feature vectors and their respective weight values, the linear combination style feature vector is generated based on the respective weight values of the multiple style feature vectors. and the multiple output samples to generate the linear combination output sample, and the sum of the weight values of the multiple style feature vectors is 1;
    根据所述损失函数和线性组合训练样本集,训练所述网络模型,得到所述说话风格模型,所述线性组合训练样本集包括线性组合输入样本集和线性组合输出样本集,线性组合输入样本包括所述语音特征及其对应的所述线性组合风格特征向量。According to the loss function and the linear combination training sample set, the network model is trained to obtain the speaking style model. The linear combination training sample set includes a linear combination input sample set and a linear combination output sample set. The linear combination input sample includes The speech features and their corresponding linear combination style feature vectors.
  13. 根据权利要求11所述的方法,所述说话风格模型的框架还包括缩放单元;The method according to claim 11, the framework of the speaking style model further includes a scaling unit;
    所述根据所述训练样本集和损失函数,训练所述说话风格模型的框架,得到所述说话风格模型,包括:The framework of training the speaking style model according to the training sample set and the loss function to obtain the speaking style model includes:
    将所述训练样本集输入至所述缩放单元,基于缩放因子和所述多个风格特征向量,生成多个缩放风格特征向量,基于所述缩放因子和所述多个输出样本,生成多个缩放输出样本;The training sample set is input to the scaling unit, a plurality of scaling style feature vectors are generated based on the scaling factor and the multiple style feature vectors, and a plurality of scaling style feature vectors are generated based on the scaling factor and the multiple output samples. output sample;
    将所述多个缩放风格特征向量和所述多个缩放输出样本输入至所述线性组合单元,基于所述多个缩放风格特征向量及其各自的权重值,生成所述线性组合风格特征向量,基于所述多个缩放风格特征向量各自的权重值和所述多个缩放输出样本,生成所述线性组合输出样本,所述多个缩放风格特征向量各自的权重值的和值为1;Input the plurality of scaling style feature vectors and the plurality of scaling output samples to the linear combination unit, and generate the linear combination style feature vector based on the multiple scaling style feature vectors and their respective weight values, Generate the linear combination output sample based on the respective weight values of the multiple scaling style feature vectors and the multiple scaling output samples, where the sum of the respective weight values of the multiple scaling style feature vectors is 1;
    根据所述损失函数和线性组合训练样本集,训练所述网络模型,得到所述说话风格模型,所述线性组合训练样本集包括线性组合输入样本集和线性组合输出样本集,线性组合输入样本包括所述语音特征及其对应的所述线性组合风格特征向量。According to the loss function and the linear combination training sample set, the network model is trained to obtain the speaking style model. The linear combination training sample set includes a linear combination input sample set and a linear combination output sample set. The linear combination input sample includes The speech features and their corresponding linear combination style feature vectors.
  14. 根据权利要求12或13所述的方法,所述网络模型包括一级网络模型、二级网络模型和叠加单元,所述一级网络模型的输出端和所述二级网络模型的输出端均与所述叠加单元的输入端连接,所述叠加单元的输出端用于输出所述预测输出样本;所述损失函数包括第一损失函数和第二损失函数;The method according to claim 12 or 13, the network model includes a first-level network model, a second-level network model and a superposition unit, and the output end of the first-level network model and the output end of the second-level network model are both connected to The input end of the overlay unit is connected, and the output end of the overlay unit is used to output the predicted output sample; the loss function includes a first loss function and a second loss function;
    所述根据所述损失函数和线性组合训练样本集,训练所述网络模型,得到所述说话风格模型,包括:Training the network model according to the loss function and the linear combination training sample set to obtain the speaking style model includes:
    根据所述线性组合训练样本集和所述第一损失函数,训练所述一级网络模型,得到中间说话风格模型,所述中间说话风格模型包括所述二级网络模型和训练好的所述一级网络模型; According to the linear combination training sample set and the first loss function, the first-level network model is trained to obtain an intermediate speaking style model. The intermediate speaking style model includes the second-level network model and the trained first-level network model. level network model;
    固定所述训练好的所述一级网络模型的模型参数;Fixing the model parameters of the trained first-level network model;
    根据所述线性组合训练样本集和所述第二损失函数,训练所述中间说话风格模型中的所述二级网络模型,得到所述说话风格模型,所述说话风格模型包括所述训练好的所述一级网络和训练好的所述二级网络。According to the linear combination training sample set and the second loss function, the secondary network model in the intermediate speaking style model is trained to obtain the speaking style model. The speaking style model includes the trained The first-level network and the trained second-level network.
  15. 一种计算机设备,包括:A computer device consisting of:
    一个或多个处理器;one or more processors;
    存储器,用于存储一个或多个程序,memory for storing one or more programs,
    当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如权利要求1-14中任一所述的方法。When the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the method as described in any one of claims 1-14.
  16. 一种计算机可读非失性存储介质,其上存储有计算机程序,该程序被处理器执行时实现如权利要求1-14中任一所述的方法。 A computer-readable non-volatile storage medium on which a computer program is stored. When the program is executed by a processor, the method according to any one of claims 1-14 is implemented.
PCT/CN2023/079026 2022-06-22 2023-03-01 Virtual digital human driving method, apparatus, device, and medium WO2023246163A1 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202210714001.7A CN115270922A (en) 2022-06-22 2022-06-22 Speaking style generation method and device, electronic equipment and storage medium
CN202210714001.7 2022-06-22
CN202210751784.6 2022-06-28
CN202210751784.6A CN117370605A (en) 2022-06-28 2022-06-28 Virtual digital person driving method, device, equipment and medium

Publications (2)

Publication Number Publication Date
WO2023246163A1 WO2023246163A1 (en) 2023-12-28
WO2023246163A9 true WO2023246163A9 (en) 2024-02-29

Family

ID=89379111

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/079026 WO2023246163A1 (en) 2022-06-22 2023-03-01 Virtual digital human driving method, apparatus, device, and medium

Country Status (1)

Country Link
WO (1) WO2023246163A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117690416B (en) * 2024-02-02 2024-04-12 江西科技学院 Artificial intelligence interaction method and artificial intelligence interaction system
CN117828320B (en) * 2024-03-05 2024-05-07 元创者(厦门)数字科技有限公司 Virtual digital person construction method and system

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108363706B (en) * 2017-01-25 2023-07-18 北京搜狗科技发展有限公司 Method and device for man-machine dialogue interaction
CN109271018A (en) * 2018-08-21 2019-01-25 北京光年无限科技有限公司 Exchange method and system based on visual human's behavioral standard
KR20190118539A (en) * 2019-09-30 2019-10-18 엘지전자 주식회사 Artificial intelligence apparatus and method for recognizing speech in consideration of utterance style
CN112396693A (en) * 2020-11-25 2021-02-23 上海商汤智能科技有限公司 Face information processing method and device, electronic equipment and storage medium
CN114357135A (en) * 2021-12-31 2022-04-15 科大讯飞股份有限公司 Interaction method, interaction device, electronic equipment and storage medium
CN115270922A (en) * 2022-06-22 2022-11-01 海信视像科技股份有限公司 Speaking style generation method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
WO2023246163A1 (en) 2023-12-28

Similar Documents

Publication Publication Date Title
US20230316643A1 (en) Virtual role-based multimodal interaction method, apparatus and system, storage medium, and terminal
WO2021043053A1 (en) Animation image driving method based on artificial intelligence, and related device
CN110688911B (en) Video processing method, device, system, terminal equipment and storage medium
TWI778477B (en) Interaction methods, apparatuses thereof, electronic devices and computer readable storage media
WO2022116977A1 (en) Action driving method and apparatus for target object, and device, storage medium, and computer program product
WO2023246163A9 (en) Virtual digital human driving method, apparatus, device, and medium
Wu et al. Survey on audiovisual emotion recognition: databases, features, and data fusion strategies
US8725507B2 (en) Systems and methods for synthesis of motion for animation of virtual heads/characters via voice processing in portable devices
Hong et al. Real-time speech-driven face animation with expressions using neural networks
KR101604593B1 (en) Method for modifying a representation based upon a user instruction
CN113454708A (en) Linguistic style matching agent
CN110286756A (en) Method for processing video frequency, device, system, terminal device and storage medium
KR101306221B1 (en) Method and apparatus for providing moving picture using 3d user avatar
JP2018014094A (en) Virtual robot interaction method, system, and robot
CN110874137B (en) Interaction method and device
JP7227395B2 (en) Interactive object driving method, apparatus, device, and storage medium
WO2023284435A1 (en) Method and apparatus for generating animation
CN110148406B (en) Data processing method and device for data processing
WO2021232876A1 (en) Method and apparatus for driving virtual human in real time, and electronic device and medium
US20230082830A1 (en) Method and apparatus for driving digital human, and electronic device
JP2023552854A (en) Human-computer interaction methods, devices, systems, electronic devices, computer-readable media and programs
CN112652041A (en) Virtual image generation method and device, storage medium and electronic equipment
WO2021232877A1 (en) Method and apparatus for driving virtual human in real time, and electronic device, and medium
CN117809679A (en) Server, display equipment and digital human interaction method
CN117370605A (en) Virtual digital person driving method, device, equipment and medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23825821

Country of ref document: EP

Kind code of ref document: A1