WO2023246163A1 - Procédé de commande d'être humain numérique virtuel, appareil, dispositif et support - Google Patents

Procédé de commande d'être humain numérique virtuel, appareil, dispositif et support Download PDF

Info

Publication number
WO2023246163A1
WO2023246163A1 PCT/CN2023/079026 CN2023079026W WO2023246163A1 WO 2023246163 A1 WO2023246163 A1 WO 2023246163A1 CN 2023079026 W CN2023079026 W CN 2023079026W WO 2023246163 A1 WO2023246163 A1 WO 2023246163A1
Authority
WO
WIPO (PCT)
Prior art keywords
style
information
model
user
linear combination
Prior art date
Application number
PCT/CN2023/079026
Other languages
English (en)
Chinese (zh)
Other versions
WO2023246163A9 (fr
Inventor
杨善松
成刚
刘韶
李绪送
付爱国
Original Assignee
海信视像科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN202210714001.7A external-priority patent/CN115270922A/zh
Priority claimed from CN202210751784.6A external-priority patent/CN117370605A/zh
Application filed by 海信视像科技股份有限公司 filed Critical 海信视像科技股份有限公司
Publication of WO2023246163A1 publication Critical patent/WO2023246163A1/fr
Publication of WO2023246163A9 publication Critical patent/WO2023246163A9/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content

Definitions

  • the present disclosure relates to the field of virtual digital human technology, and in particular, to a virtual digital human driving method, device, equipment and medium.
  • Virtual digital human is a multi-modal intelligent human-computer interaction technology that integrates computer vision, speech recognition, speech synthesis, natural language processing, terminal display and other technologies to create a highly anthropomorphic virtual image, like a real person Interact and communicate with people.
  • the present disclosure provides a virtual digital human driving method, including:
  • Obtain user information which includes voice information and image information
  • the physical movements of the virtual digital human are determined according to the reply text, and the emotional expression of the virtual digital human is determined according to the reply emotion.
  • the present disclosure also provides a computer device, including:
  • processors one or more processors
  • Memory used to store one or more programs
  • the one or more processors When the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the method as described in any one of the first aspects.
  • the present disclosure also provides a computer-readable non-volatile storage medium on which a computer program is stored.
  • the program When executed by the processor, the method as described in any one of the first aspects is implemented.
  • Figure 1A is a schematic diagram of an application scenario of a virtual digital human driving process according to some embodiments
  • Figure 1B is a schematic structural diagram of a virtual digital human according to some embodiments.
  • Figure 2A is a hardware configuration block diagram of a computer device according to some embodiments.
  • Figure 2B is a schematic diagram of a software configuration of a computer device according to some embodiments.
  • Figure 2C is a schematic diagram showing an icon control interface of an application included in a smart device according to some embodiments.
  • Figure 3A is a schematic flowchart of a virtual digital human driving method according to some embodiments.
  • Figure 3B is a schematic diagram of the principle of a virtual digital human driving method according to some embodiments.
  • Figure 4A is a schematic flowchart of another virtual digital human driving method according to some embodiments.
  • Figure 4B is a schematic diagram of the principle of a virtual digital human driving method according to some embodiments.
  • Figure 4C is a schematic flowchart of yet another virtual digital human driving method according to some embodiments.
  • Figure 4D is a schematic flowchart of yet another virtual digital human driving method according to some embodiments.
  • Figure 5 is a schematic flowchart of a virtual digital human driving method according to some embodiments.
  • Figure 6 is a schematic flowchart of yet another virtual digital human driving method according to some embodiments.
  • Figure 7 is a schematic flowchart of yet another virtual digital human driving method according to some embodiments.
  • Figure 8 is a schematic diagram of a virtual digital human according to some embodiments.
  • Figure 9 is a schematic diagram of a virtual digital human according to some embodiments.
  • Figure 10 is a schematic diagram of the principle of generating a new speaking style according to some embodiments.
  • Figure 11 is a schematic diagram of a human-computer interaction scenario according to some embodiments.
  • Figure 12 is a schematic flowchart of a speaking style generation method according to some embodiments.
  • Figure 13 is a schematic diagram of facial topology data divided into regions according to some embodiments.
  • Figure 14 is a schematic flowchart of a speaking style generation method according to some embodiments.
  • Figure 15 is a schematic flowchart of a speaking style generation method according to some embodiments.
  • Figure 16 is a schematic flowchart of a speaking style generation method according to some embodiments.
  • Figure 17 is a schematic structural diagram of a framework of a speaking style model according to some embodiments.
  • Figure 18 is a schematic flowchart of a speaking style generation method according to some embodiments.
  • Figure 19 is a schematic structural diagram of the framework of a speaking style generation model according to some embodiments.
  • Figure 20 is a schematic flowchart of a speaking style generation method according to some embodiments.
  • Figure 21 is a schematic structural diagram of the framework of a speaking style generation model according to some embodiments.
  • Figure 22 is a schematic structural diagram of the framework of a speaking style generation model according to some embodiments.
  • Figure 23 is a schematic flowchart of a speaking style generation method according to some embodiments.
  • Figure 24 is a schematic structural diagram of a computer device according to some embodiments.
  • the system design of virtual digital people usually consists of five modules: character image, speech generation, dynamic image generation, audio and video synthesis display, and interactive modeling.
  • Character images can be divided into There are two categories: 2D and 3D, which can be divided into cartoon, anthropomorphic, realistic, hyper-realistic and other styles in terms of appearance;
  • the speech generation module can generate corresponding character voices based on text;
  • the animation generation module can generate the dynamics of specific characters based on speech or text Image;
  • the audio and video synthesis display module synthesizes speech and dynamic images into a video, and finally displays it to the user;
  • the interactive module enables the digital human to have interactive functions, that is, it recognizes the user's intention through intelligent technologies such as speech semantic recognition, and determines the digital human's behavior based on the user's current intention. The subsequent voice and actions drive the character to start the next round of interaction.
  • the embodiments of the present disclosure first obtain user information, which includes voice information and image information; then determine the user intention and user emotion based on the user information; finally determine the body movements of the virtual digital human based on the user intention, and Determine the emotional expression of the virtual digital human based on the user's emotion, that is, based on the acquired user voice information and user image information, process the user voice information and user image information to determine the user's intention and user's emotion, and then determine the virtual number based on the user's intention Human body movements determine the emotional expression of virtual digital people based on user emotions, allowing virtual digital people to truly restore user intentions and user emotions, and improve the fidelity and naturalness of expression of virtual digital people.
  • FIG. 1A is a schematic diagram of an application scenario of a virtual digital human driving process in an embodiment of the present disclosure.
  • the virtual digital human driving process can be used in the interaction scenario between users and smart terminals.
  • the smart terminals in this scenario include smart blackboards, smart large screens, smart speakers, smart phones, etc.
  • Examples of virtual digital people include virtual teachers, virtual brand images, virtual assistants, virtual shopping guides, virtual anchors, etc.
  • a voice command is issued.
  • the smart terminal collects the user's voice information and collects the user's image information.
  • the virtual digital human driving method provided by the embodiments of the present disclosure can be implemented based on computer equipment, or functional modules or functional entities in the computer equipment.
  • the computer equipment can be a personal computer (Personal Computer, PC), server, mobile phone, tablet Computers, notebook computers, large computers, etc. are not specifically limited in the embodiments of the present disclosure.
  • FIG. 2A is a hardware configuration block diagram of a computer device according to one or more embodiments of the present disclosure.
  • the computer equipment includes: a tuner and demodulator 210, a communicator 220, a detector 230, an external device interface 240, a controller 250, a display 260, an audio output interface 270, a memory, a power supply, and a user interface 280. of at least one.
  • the controller 250 includes a central processing unit, a video processor, an audio processor, a graphics processor, a RAM, a ROM, and first to nth interfaces for input/output.
  • the display 260 may be at least one of a liquid crystal display, an OLED display, a touch display, and a projection display, and may also be a projection device and a projection screen.
  • the tuner and demodulator 210 receives broadcast television signals through wired or wireless reception methods, and demodulates audio and video signals, such as EPG audio and video data signals, from multiple wireless or wired broadcast and television signals.
  • the communicator 220 is a component for communicating with external devices or servers according to various communication protocol types.
  • the communicator may include at least one of a Wifi module, a Bluetooth module, a wired Ethernet module, other network communication protocol chips or near field communication protocol chips, and an infrared receiver.
  • the computer device can establish the transmission and reception of control signals and data signals with the server or local control device through the communicator 220 .
  • the detector 230 is used to collect signals from the external environment or interactions with the outside.
  • the controller 250 controls the operation of the computer device and responds to user operations through various software control programs stored on the memory. Controller 250 controls the overall operation of the computer device.
  • the user may input a user command into a graphical user interface (GUI) displayed on the display 260, and the user input interface receives the user input command through the graphical user interface (GUI).
  • GUI graphical user interface
  • the user can input a user command by inputting a specific sound or gesture, and the user input interface recognizes the sound or gesture through the sensor to receive the user input command.
  • FIG. 2B is a schematic diagram of the software configuration of a computer device according to one or more embodiments of the present disclosure. As shown in Figure 2B, the system is divided into four layers. From top to bottom, they are the Applications layer (referred to as the "Application layer”). “), Application Framework layer (referred to as “framework layer”), Android runtime and system library layer (referred to as “system runtime library layer”), and kernel layer.
  • Applications layer referred to as the "Application layer”
  • Application Framework layer referred to as “framework layer”
  • Android runtime and system library layer referred to as “system runtime library layer”
  • FIG. 2C is a schematic diagram showing the icon control interface of an application included in a smart terminal (mainly a smart playback device, such as a smart TV, a digital cinema system or an audio-visual server, etc.) according to one or more embodiments of the present disclosure, as shown in Figure 2C
  • the application layer contains at least one application that can display a corresponding icon control on the display, such as: Live TV application icon control, Video on Demand VOD application icon control, Media Center application icon control, Application Center Icon Control , game application icon controls, etc.
  • Live TV app that provides live TV from different sources.
  • Video on demand VOD application that can provide videos from different storage sources.
  • video on demand offers the display of video from certain storage sources.
  • Media center application can provide various multimedia content playback applications.
  • the application center can provide storage for various applications.
  • FIG. 3A is a schematic flowchart of a virtual digital human driving method provided by an embodiment of the present disclosure
  • FIG. 3B is a schematic principle diagram of a virtual digital human driving method provided by an embodiment of the present disclosure.
  • This embodiment can be applied to virtual digital humans. row-driven situation.
  • the method of this embodiment can be executed by an intelligent terminal, which can be implemented in hardware/or software, and can be configured in a computer device.
  • the method specifically includes the following steps:
  • the smart terminal includes a sound sensor and a visual sensor, where the sound sensor can be an example of a microphone array, etc., the visual sensor includes a 2D visual sensor and a 3D visual sensor, and the visual sensor can be an example of a camera, etc.
  • the smart terminal collects voice information through sound sensors and image information through visual sensors.
  • the voice information includes semantic information and acoustic information
  • the image information includes scene information and user image information.
  • the terminal device After the terminal device collects voice information based on the sound sensor, it can determine the user's intention based on the semantic information included in the voice information, that is, how the user expects to drive the virtual digital human to act. After collecting the image information based on the visual sensor, it can determine the user's intention based on the semantic information included in the voice information. The collected image information determines the facial expression of the user who sent the voice message, and based on the user's facial expression in the collected image information, the user's expectation to drive the emotion expressed by the virtual digital human is determined.
  • the virtual digital person's reply text can be determined based on the user intention, such as the text corresponding to the virtual digital person's reply voice
  • the virtual digital person's reply emotion can be determined based on the user intention and user emotion, that is,
  • the emotional expression required for the virtual digital person's reply is determined according to the user's intention, and the emotion required for the virtual digital person's reply is determined based on the emotion expressed by the user.
  • the emotion expressed by the user is a sad emotion
  • the emotion that the virtual digital person needs to express in reply is also a sad emotion.
  • the reply text of the virtual digital person is determined based on the user intention
  • the reply emotion of the virtual digital person is determined based on the user intention and user emotion
  • the body movements of the virtual digital person are determined based on the reply text
  • Determine the emotional expression of the virtual digital human based on the reply emotion that is, first establish multi-modal human-computer interaction information perception capabilities for speech recognition and image recognition, and then determine the user's intended user emotion through the acquired voice information and image information.
  • the user's intention determines the reply text of the virtual digital person, and determines the virtual digital person's reply emotion based on the user's intention and user emotion.
  • emotional expressions and body movements are generated to realize the virtual digital person's voice, expressions, and movements. etc. synthesis.
  • the virtual digital human driving method provided by the embodiment of the present disclosure first obtains user information, that is, voice information and image information, then determines the user intention and user emotion based on the user information, determines the virtual digital human's reply text based on the user intention, and determines the virtual digital human's reply text based on the user information. Intention and user emotion determine the virtual digital human's reply emotion. Finally, the virtual digital human's body movements are determined based on the reply text, and the virtual digital human's emotional expression method is determined based on the reply emotion. This achieves a natural anthropomorphic virtual human interaction state and improves the virtual digital human's quasi-likeness. Authenticity and naturalness of expression.
  • FIG. 4A is a schematic flowchart of another virtual digital human driving method provided by an embodiment of the present disclosure.
  • FIG. 4B is a schematic diagram of another virtual digital human driving method provided by the present disclosure.
  • step S20 includes:
  • S201 Process the voice information and determine the text information and voice emotion information corresponding to the voice information.
  • step S201 includes:
  • the voice recognition module performs text transcription processing on the acquired voice information, that is, the voice information is converted into text information corresponding to the voice information.
  • the terminal device can input voice information into an automatic speech recognition (Automatic Speech Recognition, ASR) engine set offline to obtain text information output by the ASR engine.
  • ASR Automatic Speech Recognition
  • the terminal device may continue to wait for the user to input voice. If the start of a human voice is recognized based on Voice Activity Detection (VAD), recording will continue. If the end of the human voice is recognized based on VAD, the recording will stop. The terminal device can use the recorded audio as user voice information. The terminal device can then input the user's voice information into the ASR engine to obtain text information corresponding to the user's voice information.
  • VAD Voice Activity Detection
  • Voiceprint features are the sound wave spectrum that carries speech information displayed by electroacoustic instruments. Voiceprint features represent the different wavelengths, frequencies, intensity, and rhythm of different sounds, that is, the pitch, intensity, length, and duration of the user's voice. Timbre, different users have different voiceprint characteristics. By extracting voiceprint features from the voice information, the emotional information expressed by the user corresponding to the voice information can be obtained, that is, the voice emotion information.
  • S202 Process the image information and determine the scene information and image emotion information corresponding to the image information.
  • step S202 includes:
  • S2021. Preprocess the image information to determine the scene key point information and user key point information included in the image information.
  • the scene key point information refers to the key points of the scene in which the user is located in the image information in addition to the user information.
  • the user key point information refers to the key points of the user's limbs or facial features in the image information.
  • the image information collected by the terminal device shows a teacher standing in front of a blackboard, that is, the scene key point information included in the image information is the blackboard, and the user key point information included in the image is the user's eyes, mouth, arms, legs, etc.
  • the scene information of the terminal device can be determined, that is, in which scene the terminal device is applied.
  • a scene recognition model is constructed based on algorithms such as entity recognition, entity linking, and entity alignment, and then the image information of different application scenarios in the knowledge base is preprocessed.
  • After obtaining the scene key point information corresponding to the image information of different application scenarios input the scene key point information corresponding to the image information of different application scenarios into the scene recognition model to train the scene recognition model until the scene recognition model reaches convergence and determine the target scene Identify the model.
  • graph mapping, information extraction and other methods are used to preprocess the acquired image information to obtain the scene key point information corresponding to the image.
  • the scene key point information obtained after preprocessing is Information is input into the target scene recognition model to perform scene recognition to ensure the accuracy of the scene recognition results.
  • the user's emotions collected by the terminal device can be determined, that is, the emotions expressed by the user included in the image information collected by the terminal device.
  • S203 Determine user intention based on text information and scene information.
  • the body movements that the user expects to drive the virtual digital human can be determined based on the text information, and then combined with the determined scene information to further ensure that the terminal device drives the virtual digital based on the text information. Coordination accuracy of human body movements.
  • S204 Determine user emotion based on text information, voice emotion information and image emotion information.
  • the emotion expressed by the user can be roughly determined based on the text information, and then by fusing the voice emotion information and image emotion information, the virtual digital human can be accurately driven to express the user's emotion and improve the fidelity of the virtual digital human.
  • the virtual digital human determination method first determines the text information and voice emotion information corresponding to the voice information by processing the voice information, and determines the scene information and image emotion information corresponding to the image information by processing the image information. , then determine the user's intention based on the text information and scene information, and determine the user's emotion based on the text information, voice emotion information, and image emotion information. That is, based on the text information, the user's expectations can be determined to drive the body movements of the virtual digital human, and then combined with the determined scene information to further ensure the coordination accuracy of the terminal device's body movements driven by the virtual digital human based on text information. Based on the text information, the emotion expressed by the user can be roughly determined, and then by integrating voice emotional information and image emotional information, the virtual digital human can be accurately driven to express the user's expression. Emotions to improve the realism of virtual digital humans.
  • Figure 5 is a schematic flow chart of another virtual digital human driving method provided by an embodiment of the present disclosure.
  • the embodiment of the present disclosure is based on the embodiment corresponding to Figure 4C. As shown in Figure 5, before step S2012, it also includes:
  • voiceprint recognition and voiceprint clustering technologies by constructing highly robust voiceprint recognition and voiceprint clustering technologies, automatic login of multiple users is achieved through voice modalities, while paralinguistic information such as gender and accent is extracted to establish basic user information. Aiming at the difficulty of clustering speech features with an uncertain number of target classifications and the impact of speech channel interference on classification and clustering effects, the noisy density space unsupervised clustering technology is used, combined with stochastic linear discriminant analysis technology to achieve highly reliable voiceprint classification. , clustering to reduce the impact of channel interference on voiceprint recognition. That is, in this disclosure, a speech recognition model is constructed, which can adapt to different paralinguistic information, and the accuracy of speech recognition is high.
  • the voice feature vector in the voice information is first extracted.
  • the voice feature vector includes: accent feature vector, gender feature vector , age feature vector, etc.
  • the speech recognition model includes an acoustic model and a language model.
  • the acoustic model includes a convolutional neural network model with an attention mechanism
  • the language model includes a deep neural network model.
  • the speech recognition model constructed in this disclosure is a joint modeling of acoustic model and language model.
  • Column convolution and attention mechanism are used to build the acoustic model, and the speech feature vector is added as a condition in the convolution layer of the convolutional neural network model to adapt to different speech features.
  • a model structure based on deep neural networks that can be quickly intervened and configured is implemented, and the voice characteristics of different paralinguistic information are adapted through user-specific voiceprints to improve the accuracy of speech recognition.
  • step S2012 may include:
  • the speech information can be transcribed into text based on the speech recognition model to improve the accuracy of the speech recognition results.
  • Figure 6 is a schematic flow chart of another virtual digital human driving method provided by an embodiment of the present disclosure.
  • the embodiment of the present disclosure is based on the corresponding embodiment of Figure 4A.
  • the specific implementation of step 40 is: include:
  • action identifiers include: lifting, stretching, blinking, opening, etc.
  • key point driving includes speech content separation, content key point driving, speaker key point driving, key point-based image generation module, key point-based image stretching module, etc. Therefore, first, based on parsing the text information at the transcription point of the speech information, the action identifiers and key point identifiers included in the text information are obtained.
  • the body movements of the virtual digital human are selected from the preset action database corresponding to the scene information, such as raising the head, raising the leg, etc.
  • the preset action database includes action type definition, action arrangement, action connection, etc.
  • S403. Determine the emotional expression of key points of the virtual digital human based on the voice emotional information and the image emotional information.
  • examples of emotional expressions for determining the key points of the virtual digital human can be smiling with the mouth, clapping with both hands, etc.
  • the deep learning method is used to learn the mapping of virtual human key points and voice feature information, as well as the mapping of human face key points and voice emotion information and image emotion information.
  • the virtual digital human driving method provided by the embodiments of the present disclosure realizes the generation of voice-driven virtual digital human animation with controllable expressions by integrating emotional key point templates.
  • Figure 7 is a schematic flow chart of another virtual digital human driving method provided by an embodiment of the present disclosure.
  • the embodiment of the present disclosure is based on the embodiment corresponding to Figure 6. As shown in Figure 7, before step S401, it also includes:
  • different features are extracted from speech to drive head movements, facial movements, and body movements respectively, forming a more vivid speech driving method.
  • the image of the virtual digital human is driven based on the deep neural network method, and a generative adversarial network is applied for high-fidelity real-time generation.
  • the image generation of the virtual digital human is distinguished between action-driven and image library production. Among them, the virtual digital human image generation is divided into action-driven and image library production.
  • the hair library, clothing library, and tooth model of the digital human image are produced offline, and the image can be produced in a targeted manner according to different application scenarios.
  • the motion driver module of the virtual digital human is processed on the server side, and then the topological vertex data is encapsulated and transmitted, and texture mapping, rendering output, etc. are performed on the device side.
  • the key point driving technology based on the adversarial network the feature point geometric stretching method, and the image transformation and generation technology based on the Encoder-Decoder method are used to realize the driving of virtual digital humans.
  • the emotional key point template method the corresponding relationship between the user key points and the preset user emotional key points is established to realize the emotional expression of virtual digital people.
  • 3D face driving technology based on deep codec technology to achieve semantic mapping of speech features and vertex three-dimensional motion features rhythmic head driving technology based on deep codec nested temporal network, with head movement and the ability to discriminate control of facial activity.
  • Embodiments of the present disclosure provide a computer device, including: one or more processors; and a memory for storing one or more programs. When the one or more programs are executed by the one or more processors, such that The one or more processors implement the method described in any one of the embodiments of the present disclosure.
  • the smart device shown in Figure 8 simultaneously displays the expression and mouth shape of the virtual digital person when speaking while playing the voice response information, and as shown in Figure 9 It shows that the user can not only hear the voice of the virtual digital person, but also see the expression of the virtual digital person when speaking, giving the user an experience of talking to people.
  • the basic speaking style is driven, as shown in Figure 10, and a new speaking style can be generated. Since retraining the speaking style model requires a lot of time to collect training samples and process a large amount of data, it takes a lot of time to generate a new speaking style, making the speaking style generation efficiency relatively low.
  • the present disclosure determines the fitting coefficient of each style characteristic attribute by fitting the target style characteristic attribute based on multiple style characteristic attributes; determines the target style according to the fitting coefficient of each style characteristic attribute and multiple style characteristic vectors.
  • Feature vector multiple style feature vectors correspond to multiple style feature attributes one-to-one; input the target style feature vector into the speaking style model, and output the target speaking style parameters.
  • the speaking style model trains the speaking style model based on multiple style feature vectors Obtained from the framework; based on the target speaking style parameters, the target speaking style is generated. In this way, the target style feature vector can be fitted with multiple style feature vectors.
  • the speaking model is trained based on multiple style feature vectors, therefore, Inputting the target style feature vectors fitted by multiple style feature vectors into the speaking model can directly obtain the corresponding new speaking style. There is no need to retrain the speaking style model, which can achieve rapid transfer of speaking styles and improve the efficiency of speaking style generation. .
  • FIG 11 is a schematic diagram of a human-computer interaction scenario according to some embodiments.
  • smart devices may include smart refrigerator 110, smart washing machine 120, smart display device 130, etc.
  • a user wants to control a smart device, he or she needs to issue a voice command first.
  • the smart device receives the voice command, it needs to perform semantic understanding of the voice command and determine the semantic understanding result corresponding to the voice command. According to the semantics Understand the results and execute corresponding control instructions to meet the user's needs.
  • the smart devices in this scenario all include a display screen, which can be a touch screen or a non-touch screen.
  • terminal devices with touch screens users can communicate with the terminal device through gestures, fingers, or touch tools (such as stylus pens). interactive operations.
  • interactive operations with the terminal device can be implemented through external devices (for example, a mouse or a keyboard, etc.).
  • the display screen can display a three-dimensional virtual person, and the user can see the three-dimensional virtual person and his or her expression when speaking through the display screen, thereby realizing dialogue and interaction with the three-dimensional virtual person.
  • the speaking style generation method provided by the embodiments of the present disclosure can be implemented based on a computer device, or a functional module or functional entity in the computer device.
  • the computer device may be a personal computer (PC), a server, a mobile phone, a tablet computer, a notebook computer, a large computer, etc., which are not specifically limited in the embodiments of the present disclosure.
  • Figure 12 is a schematic flowchart of a speaking style generation method according to some embodiments. As shown in Figure 12, the method specifically includes the following steps:
  • each frame of facial topology data corresponds to a dynamic face topology
  • the face topology includes multiple vertices.
  • Each vertex in the dynamic face topology structure corresponds to a vertex coordinate (x, y, z).
  • the vertex coordinates of each vertex in the static face topology are (x', y', z').
  • the average vertex offset of each vertex in the dynamic face topology can be determined
  • FIG 13 is a schematic diagram of facial topology data divided into regions according to an embodiment of the present disclosure.
  • facial topology data can be divided into multiple regions.
  • facial topology data can be divided into three regions. , respectively, are S1, S2 and S3, where S1 is all the facial area above the lower edge of the eyes, S2 is the facial area from the lower edge of the eyes to the upper edge of the upper lip, and S3 is the facial area from the upper edge of the upper lip to the chin.
  • the average vertex offset of all vertices of the dynamic face topology in the area S1 can be determined average of The average vertex offset of all vertices of the dynamic face topology within region S2 average of Dynamic face topology in area S3
  • the average vertex offset of all vertices average of The style characteristic attributes can be obtained, which is To sum up, one style feature attribute can be obtained for one user, and multiple style feature attributes can be obtained based on multiple users.
  • a new style feature attribute can be fitted and formed, that is, the target style feature attribute.
  • the target style feature attributes can be obtained by fitting based on the following formula:
  • a1 is the fitting coefficient of the style feature attribute of user 1
  • a2 is the fitting coefficient of the style feature attribute of user 2
  • an is the fitting coefficient of the style feature attribute of user n
  • optimization methods can be used, such as gradient descent method, Gauss-Newton method, etc., to obtain the fitting coefficient of each style characteristic attribute.
  • this embodiment only takes dividing facial topological structure data into three areas as an example for illustration, and does not serve as a specific restriction on dividing facial topological structure data into areas.
  • S1302 Determine a target style feature vector based on the fitting coefficient of each style feature attribute and multiple style feature vectors.
  • the plurality of style feature vectors correspond to the plurality of style feature attributes in one-to-one correspondence.
  • the style feature vector is a representation of style
  • the embedding obtained by training the classification task model can be used as the style feature vector based on the classification task model, or the one-hot feature vector can be directly designed as the style feature vector.
  • the 3 style feature vectors can be [1;0;0], [0;1;0] and [0;0;1].
  • the style feature attributes of n users with different speaking styles are obtained.
  • the style feature vectors of n users can be obtained.
  • These n style feature attributes correspond to n style feature vectors one-to-one, n style feature attributes and their corresponding style feature vectors form the basic style feature base.
  • the respective fitting coefficients are multiplied by the corresponding style feature vectors, and the target style feature vectors can be expressed in the form of basic style feature bases, as shown in the following formula:
  • F1 is the style feature vector of user 1
  • F2 is the style feature vector of user 2
  • Fn is the style feature vector of user n
  • p is the target style feature vector.
  • the style feature vector is a one-hot feature vector
  • the target style feature vector p can be expressed as:
  • the speaking style model is obtained by training a speaking style model framework based on the plurality of style feature vectors.
  • the framework of the speaking style model is trained based on multiple style feature vectors in the basic style feature base to obtain the framework of the trained speaking style model, that is, the speaking style model.
  • Inputting the target style feature vector into the speaking style model can be understood as inputting the product of multiple style feature vectors and respective fitting coefficients into the speaking style model, which is the same as the training sample input when training the framework of the speaking style model. Therefore, based on the speaking style model, using the target style feature vector as input, the target speaking style parameters can be directly output.
  • the target speaking style parameter can be the vertex offset between each vertex in the dynamic face topology structure and the corresponding vertex in the static face topology structure; or it can be the coefficient of the expression basis of the dynamic face topology structure, or it can be other parameters, this disclosure does not impose specific restrictions on this.
  • S1304 Generate a target speaking style based on the target speaking style parameters.
  • the target speaking style parameter is the vertex offset of each vertex in the dynamic face topology structure and the corresponding vertex in the static face topology structure.
  • the offset drives each vertex of the static face topology to move to the corresponding position, and the target speaking style can be obtained.
  • the fitting coefficient of each style feature attribute is determined; the target style is determined based on the fitting coefficient of each style feature attribute and the multiple style feature vectors.
  • Feature vector multiple style feature vectors correspond to multiple style feature attributes one-to-one; input the target style feature vector into the speaking style model, and output the target speaking style parameters.
  • the speaking style model trains the speaking style model based on multiple style feature vectors Obtained from the framework; based on the target speaking style parameters, the target speaking style is generated. In this way, the target style feature vector can be fitted with multiple style feature vectors.
  • the speaking model is trained based on multiple style feature vectors, therefore, Inputting the target style feature vectors fitted by multiple style feature vectors into the speaking model can directly obtain the corresponding new speaking style. There is no need to retrain the speaking style model, which can achieve rapid transfer of speaking styles and improve the efficiency of speaking style generation. .
  • Figure 14 is a schematic flowchart of a speaking style generation method according to some embodiments.
  • Figure 14 is based on the embodiment shown in Figure 12. Before executing S1301, it also includes:
  • S1501 Collect multi-frame facial topological structure data when multiple preset users read multiple speech segments.
  • users with different speaking styles are selected as default users, and multiple speech segments are also selected.
  • Each default user When the user reads each piece of speech, multi-frame facial topological structure data of the preset user is collected. For example, the duration of speech 1 is t1, and the frequency of collecting facial topology data is 30 frames/second. In this way, by default, after user 1 reads each segment of speech 1, t1*30 frames of facial topology data can be collected.
  • t1*30*m frames of facial topology data can be collected.
  • the vertex offsets ( ⁇ x, ⁇ y, ⁇ z) of each vertex of the dynamic face topology structure and each vertex of the static face topology structure in each frame of facial topology data can be used as the speaking style of each frame of facial topology data.
  • the average vertex offset of all vertices of the dynamic face topology in the facial topological structure data within the divided area can be obtained average of.
  • the facial topology data is divided into three areas, where the average value of the average vertex offset of all vertices of the dynamic face topology in the facial topology data within area S1 is The average value of the average vertex offset of all vertices of the dynamic face topology in the facial topology data in area S2 is The average value of the average vertex offset of all vertices of the dynamic face topology in the facial topology data in area S3 is
  • S1503 Splice the average value of the speaking style parameters of the multi-frame facial topological structure data in each divided area in a preset order to obtain the style feature attributes of each preset user.
  • the preset order may be a top-to-bottom order as shown in FIG. 13 , or may be a bottom-to-top order as shown in FIG. 13 , and the present disclosure does not specifically limit this.
  • the preset order is from top to bottom as shown in Figure 13, based on the above embodiment, all the dynamic face topology structures in the facial topology data corresponding to the respective areas can be spliced in the order of areas S1, S2 and S3. The average value of the average vertex offset of the vertices. In this way, the style feature attributes of the preset user 1 can be obtained, that is
  • the style characteristic attributes can be obtained for the default user 1 In this way, multiple style characteristic attributes can be obtained for multiple preset users.
  • Figure 15 is a schematic flowchart of a speaking style generation method according to some embodiments.
  • Figure 15 is based on the embodiment shown in Figure 14. Before executing S1301, it also includes:
  • S1601 Collect multi-frame target facial topological structure data when the target user reads the multiple segments of speech.
  • the target user and the plurality of preset users are different users.
  • the target speaking style is collected.
  • Multi-frame target facial topological structure data when the target user corresponding to the speaking style reads multiple segments of speech, and the content of the multiple segments of speech read by the target user is the same as the content of the multiple segments of speech read by multiple presets. For example, after the target user reads m segments of speech with a duration of t1, the target facial topological structure data of t1*30*m frames can be obtained.
  • S1602 Determine the multi-frame target facial topology in each divided area based on the respective speaking style parameters of the multi-frame target facial topological structure data corresponding to the multiple segments of speech and the divided areas of the facial topological structure data. The average of the speaking style parameters of the data.
  • the vertex offsets ( ⁇ x', ⁇ y', ⁇ z') of each vertex of the dynamic face topology and each vertex of the static face topology in the target facial topology data of each frame can be used as the target facial topology of each frame.
  • the speaking style parameters of the structural data are based on the vertex offset ( ⁇ x', ⁇ y', ⁇ z') of each vertex of all dynamic face topology structures in the target user's t1*30*m frame target facial topology data,
  • the average vertex offset of each vertex of the dynamic face topology in the target facial topology data of the target user can be determined
  • the average vertex offset of all vertices of the dynamic face topology in the target facial topological structure data within the divided area can be obtained average of.
  • the facial topology data is divided into three areas, where the average value of the average vertex offset of all vertices of the dynamic face topology in the target facial topology data in area S1 is The average value of the average vertex offset of all vertices of the dynamic face topology in the target facial topology data in area S2 is The average value of the average vertex offset of all vertices of the dynamic face topology in the target facial topology data in area S3 is
  • S1603 Splice the average value of the speaking style parameters of the multi-frame target facial topological structure data in each divided area according to the preset order to obtain the target style feature attributes.
  • the average value of the average vertex offsets of all vertices of the dynamic face topology in the target facial topology data is spliced, for example, based on as shown in Figure 13
  • the average vertex offset of all vertices of the dynamic face topology in the target facial topology data corresponding to the respective areas can be spliced in the order of areas S1, S2 and S3, and the target user can be obtained.
  • S1501-S1503 shown in Figure 14 can be executed first, and then S1601-S1603 shown in Figure 15 can be executed; or, S1601-S1603 shown in Figure 15 can be executed first, and then S1601-S1603 shown in Figure 14 can be executed.
  • this disclosure does not impose specific limitations on this.
  • Figure 16 is a schematic flowchart of a speaking style generation method according to some embodiments.
  • Figure 16 is based on the embodiments shown in Figures 14 and 13. Before executing S1303, it also includes:
  • the training sample set includes an input sample set and an output sample set.
  • the input sample includes voice features and the plurality of corresponding style feature vectors, and the output sample includes the speaking style parameters.
  • the speech Melp feature can be extracted as the speech feature, or the speech feature extraction commonly used in the industry can be used. model to extract speech features, or you can also extract speech features based on a designed deep network model.
  • the speech feature sequence can be extracted. If the contents of the multiple speech segments read by multiple preset users are exactly the same, the same speech can be extracted for different preset users. feature sequence. In this way, for the same voice feature in the voice feature sequence, there are multiple style feature vectors corresponding to multiple preset users. One voice feature and its corresponding multiple style feature vectors can be used as input samples, based on all the voice feature sequences. For speech features, multiple input samples can be obtained, that is, an input sample set is obtained.
  • corresponding facial topology data can be collected. Based on the respective vertex coordinates of all vertices of the dynamic face topology in the facial topology data, the dynamic person in the facial topology data can be obtained.
  • the respective vertex offsets of all vertices of the dynamic face topology in the facial topology data are used as a set of speaking style parameters, and a set of speaking style parameters is an output sample.
  • Structural data can obtain multiple output samples, that is, output sample sets.
  • the input sample set and the output sample set constitute the training sample set for training the speaking style generation model.
  • the framework of the speaking style model includes a linear combination unit and a network model.
  • the linear combination unit is used to generate a linear combination style feature vector of the plurality of style feature vectors and generate a linear combination output sample of a plurality of output samples.
  • the input samples correspond to the output samples one-to-one;
  • the network model is used to generate corresponding predicted output samples according to the linear combination style feature vector.
  • Figure 17 is a schematic structural diagram of the framework of the speaking style model according to some embodiments.
  • the framework of the speaking style model includes a linear combination unit 310 and a network model 320.
  • the input end of the linear combination unit 310 is used to receive training samples.
  • the output end of the linear combination unit 310 is connected to the input end of the network model 320
  • the output end of the network model 320 is the output end of the framework 300 of the speaking style model.
  • the training samples include input samples and output samples, where the input samples include speech features and their corresponding multiple style feature vectors.
  • the linear combination unit 310 can linearly combine the multiple style feature vectors. , to obtain a linear combination of style feature vectors, and the corresponding speaking style parameters of multiple style feature vectors can also be linearly combined to obtain a linear combination output sample.
  • the linear combination unit 310 can output speech features and their corresponding linear combination style feature vectors, that is, linear combination input samples, and can also output corresponding linear combination output samples.
  • the linear combination training samples are input to the network model 320.
  • the linear combination training samples include linear combination input samples and linear combination output samples. Based on the linear combination training samples, the network model 320 is trained.
  • the training samples in the training sample set are input to the framework of the speaking style model.
  • the framework of the speaking style model can output predicted output samples.
  • the loss function is used to determine the loss value of the predicted output sample and the output sample. Based on the loss value, In the small direction, adjust the model parameters of the framework of the speaking style model, and then complete an iterative training. In this way, based on the framework of multiple iterations of training the speaking style model, a well-trained framework for training the speaking style model, that is, the speaking style model, can be obtained.
  • a training sample set is obtained.
  • the training sample set includes an input sample set and an output sample set.
  • the input sample includes speech features and their corresponding multiple style feature vectors, and the output sample includes speaking style parameters; defining the speaking style model.
  • Framework the framework of the speaking style model includes a linear combination unit and a network model.
  • the linear combination unit is used to generate a linear combination style feature vector of multiple style feature vectors, and generate a linear combination output sample of multiple output samples.
  • the input sample is the same as the output sample.
  • the network model is used to generate corresponding predicted output samples based on linear combination of style feature vectors; based on the training sample set and loss function, the framework of the speaking style model is trained to obtain the speaking style model.
  • the speaking style model is essentially based on multiple
  • the linear combination of style feature vectors obtained by training the network model can improve the diversity of training samples of the network model and improve the versatility of the speaking style model.
  • Figure 18 is a schematic flowchart of a speaking style generation method according to some embodiments.
  • Figure 18 is a detailed description of an implementation method when performing S1703 based on the embodiment shown in Figure 16, as follows:
  • the sum of the weight values of the plurality of style feature vectors is 1.
  • multiple style feature vectors can be assigned weight values respectively, and the sum of the weight values of the multiple style feature vectors is 1, and the multiple style feature vectors can be assigned weight values.
  • a linear combination style feature vector can be obtained.
  • Each style feature vector corresponds to an output sample, and the linear combination output sample can be obtained by adding the product of the weight value of multiple style feature vectors and the corresponding output sample.
  • different linear combination style feature vectors and different linear combination output samples can be obtained.
  • a linear combination input sample set can be obtained.
  • a linear combination of output samples corresponding to multiple speech features can be obtained.
  • the linear combination training sample set includes a linear combination input sample set and a linear combination output sample set.
  • the linear combination input sample includes the speech feature and its corresponding linear combination style feature vector.
  • the linear combination training sample set includes a linear combination input sample set and a linear combination output sample set.
  • the linear combination training samples are input to the network model.
  • a predicted output sample can be obtained based on the loss function.
  • the model parameters of the network model are adjusted, and an iterative training of the network model is completed. In this way, based on multiple iterative trainings of the network model, a well-trained framework for training the speaking style model, that is, the speaking style model, can be obtained.
  • a linear combination style feature vector is generated based on multiple style feature vectors and their respective weight values, and based on the respective weight values of the multiple style feature vectors and multiple outputs sample, generate a linear combination output sample, the sum of the weight values of multiple style feature vectors is 1; according to the loss Function and linear combination training sample set, train the network model, and obtain the speaking style model.
  • the linear combination training sample set includes a linear combination input sample set and a linear combination output sample set.
  • the linear combination input sample includes speech features and their corresponding linear combination style features.
  • Vector, the linearly combined training samples can be used as training samples for the network model, which can increase the number and diversity of network model training samples, and can improve the versatility and accuracy of the speaking style model.
  • Figure 19 is a schematic structural diagram of the framework of another speaking style generation model provided by the embodiment of the present disclosure.
  • the speaking style model The framework also includes a scaling unit 330.
  • the input end of the scaling unit 330 is used to receive training samples
  • the output end of the scaling unit 330 is connected to the input end of the linear combination unit 310
  • the scaling unit 330 is used to compare multiple style feature vectors and multiple output samples based on randomly generated scaling factors.
  • Each performs scaling to obtain multiple scaling style feature vectors and multiple scaling output samples, and output scaling training samples.
  • the scaling training samples include multiple scaling style feature vectors and their respective corresponding scaling training samples.
  • the scaling factor can be 0.5-2, and the scaling factor is accurate to one decimal place.
  • the scaled training samples are input to the linear combination unit 310. Based on the linear combination unit 310, multiple scaled style feature vectors can be linearly combined to obtain a linear combination style feature vector. The scaled output samples corresponding to the multiple scaled style feature vectors can also be Perform linear combination to obtain linear combination output samples.
  • the linear combination unit 310 can output speech features and their corresponding linear combination style feature vectors, that is, linear combination input samples, and can also output corresponding linear combination output samples.
  • the linear combination training samples are input to the network model 320.
  • the linear combination training samples include linear combination input samples and linear combination output samples. Based on the linear combination training samples, the network model 320 is trained.
  • Figure 20 is a schematic flowchart of a speaking style generation method according to some embodiments.
  • Figure 20 is a detailed description of another possible implementation when performing S1703 based on the embodiment shown in Figure 16, as follows:
  • multiple style feature vectors can be scaled separately with random scaling factors, and multiple scaled style feature vectors can be obtained.
  • Each style feature vector corresponds to an output sample, and multiple scaled output samples can be obtained by scaling the corresponding output samples based on the respective scaling factors of multiple style feature vectors.
  • a scaled input sample set can be obtained based on multiple voice features and their corresponding multiple scaled style feature vectors, and a scaled output sample set can be obtained based on the scaled output samples corresponding to the multiple voice features.
  • S5012 input the multiple scaling style feature vectors and the multiple scaling output samples to the linear combination unit, and generate the linear combination style feature based on the multiple scaling style feature vectors and their respective weight values.
  • the scaled training sample set includes a scaled input sample set and a scaled output sample set.
  • the scaled training sample set is input to the linear combination unit.
  • multiple scaled style feature vectors can be assigned weights respectively. value, and the sum of the weight values of multiple scaling style feature vectors is 1.
  • a linear combination style feature vector can be obtained .
  • Each scaling style feature vector corresponds to a scaling output sample, and the linear combination output sample can be obtained by adding the product of the respective weight values of multiple scaling style feature vectors and the corresponding scaling output samples. In this way, based on different weight values, different linear combination style feature vectors and different linear combination output samples can be obtained.
  • a linear combination input sample set can be obtained.
  • Based on The scaled output samples corresponding to multiple speech features can obtain a linear combination output sample set.
  • the linear combination training sample set includes a linear combination input sample set and a linear combination output sample set.
  • the linear combination input sample includes the speech feature and its corresponding linear combination style feature vector.
  • the linear combination training sample set includes a linear combination input sample set and a linear combination output sample set.
  • the linear combination training samples are input to the network model.
  • a predicted output sample can be obtained based on the loss function.
  • the model parameters of the network model are adjusted, and an iterative training of the network model is completed. In this way, based on multiple iterative trainings of the network model, a well-trained framework for training the speaking style model, that is, the speaking style model, can be obtained.
  • the framework of the speaking style model also includes a scaling unit; the training sample set is input to the scaling unit, based on the scaling factor and multiple style feature vectors, multiple scaling style feature vectors are generated, based on the scaling factor and multiple outputs sample to generate multiple scaled output samples; input multiple scaled style feature vectors and multiple scaled output samples to the linear combination unit, and generate a linear combination style feature vector based on multiple scaled style feature vectors and their respective weight values.
  • the respective weight values of multiple scaling style feature vectors and multiple scaling output samples generate linear combination output samples.
  • the sum of the weight values of multiple scaling style feature vectors is 1; according to the loss function and the linear combination training sample set, train network model to obtain a speaking style model.
  • the linear combination training sample set includes a linear combination input sample set and a linear combination output sample set.
  • the linear combination input sample includes speech features and their corresponding linear combination style feature vectors.
  • the scaled polynomial A style feature vector serves as a training sample for the network model, which can increase the number and diversity of training samples for the network model, thereby improving the versatility and accuracy of the speaking style model.
  • Figure 21 is a schematic structural diagram of the framework of the speaking style generation model according to some embodiments
  • Figure 22 is a schematic structural diagram of the framework of the speaking style generation model according to some embodiments
  • Figure 21 is a schematic structural diagram of the framework of the speaking style generation model according to some embodiments.
  • the network model 320 includes a first-level network model 321, a second-level network model 322 and an overlay unit 323.
  • the output end of the first-level network model 321 and the second-level network model 323 The output terminals of the level network model 322 are connected to the input terminals of the superposition unit 323, and the output terminals of the superposition unit 323 are used to output predicted output samples.
  • the loss function includes a first loss function and a second loss function.
  • the linear combination training samples are respectively input to the first-level network model 321 and the second-level network model 322.
  • the first-level prediction output sample can be output based on the first-level network model 321, and the second-level prediction output sample can be output based on the second-level network model 322.
  • the first-level prediction output The sample and the second-level prediction output sample are input to the superposition unit 323. Based on the superposition unit 323, the first-level prediction output sample and the second-level prediction output sample are superimposed to obtain a prediction output sample.
  • Level 1 network model 321 may include The convolutional network and the fully connected network are used to extract the single-frame correspondence between speech and facial topological structure data.
  • the secondary network model 322 can be a sequence-to-sequence seq2seq network model, for example, it can be long short-term memory (Long short-term memory).
  • memory Long short-term memory
  • LSTM Long short-term memory
  • GRU Gate Recurrent Unit
  • Transformer Transformer network model
  • the loss function L b1*L1+b2*L2, where L1 is the first loss function, used to determine the loss value of the first-level prediction output sample and the linear combination output sample, L2 is the second loss function, using To determine the loss value of the secondary prediction output sample and the linear combination output sample, b1 is the weight of the first loss function, b2 is the weight of the second loss function, and b1 and b2 are adjustable.
  • L1 is the first loss function
  • L2 is the weight of the second loss function
  • b1 and b2 are adjustable.
  • Figure 23 is a schematic flowchart of a speaking style generation method according to some embodiments.
  • Figure 23 is a detailed description of an implementation method when performing S502 based on the embodiment shown in Figure 18 or Figure 20, as follows:
  • S5021 Train the first-level network model according to the linear combination training sample set and the first loss function to obtain an intermediate speaking style model.
  • the intermediate speaking style model includes the second-level network model and the trained first-level network model.
  • the weight b2 of the second loss function is set to approach 0.
  • the loss function of the current network model can be understood as the first loss function, and the linear combination training samples are input separately. to the first-level network model and the second-level network model.
  • the first loss value can be obtained, and the model parameters of the first-level network model are adjusted based on the direction in which the first loss value decreases until the first loss The values converge, and the trained first-level network model is obtained.
  • the framework of the speaking style model trained in the first stage is the intermediate speaking style model.
  • the model parameters of the trained first-level network model need to be fixed first.
  • S5023 Train the secondary network model in the intermediate speaking style model according to the linear combination training sample set and the second loss function to obtain the speaking style model.
  • the speaking style model includes the trained first-level network and the trained second-level network.
  • the loss function of the current network model can be understood as the second loss function.
  • the linear combination training sample is input to the second-level network model and the trained first-level network. in the model. Based on the predicted output sample output by the superposition unit, the second loss function and the corresponding linear combination output sample, the second loss value can be obtained, and the model parameters of the secondary network model are adjusted based on the direction in which the second loss value decreases until the second loss The values converge and the trained secondary network model is obtained.
  • the framework of the speaking style model trained in the first stage is the speaking style model.
  • the network model includes a first-level network model, a second-level network model and a superposition unit.
  • the output end of the first-level network model and the output end of the second-level network model are both connected to the input end of the overlay unit.
  • the output of the overlay unit end use Predict the output sample based on the output;
  • the loss function includes the first loss function and the second loss function; according to the linear combination of the training sample set and the first loss function, train the first-level network model to obtain the intermediate speaking style model, and the intermediate speaking style model includes the second-level network model and the trained first-level network model; fix the model parameters of the trained first-level network model; train the second-level network model in the intermediate speaking style model according to the linear combination of the training sample set and the second loss function to obtain the speaking style model, the speaking style model includes a trained first-level network and a trained second-level network.
  • the network model can be trained in stages, which can improve the convergence speed of the network model, that is, shorten the training time of the network model
  • Figure 24 is a schematic structural diagram of a computer device provided by an embodiment of the present disclosure.
  • the computer device includes a processor 910 and a memory 920; the number of processors 910 in the computer device can be one or more.
  • one processor 910 is taken as an example; the processor 910 in the computer device
  • the memory 920 may be connected through a bus or other means. In FIG. 24 , the connection through a bus is taken as an example.
  • the memory 920 can be used to store software programs, computer executable programs and modules, such as program instructions/modules corresponding to the semantic understanding model training method in the embodiments of the present disclosure; or the present disclosure Program instructions/modules corresponding to the semantic understanding method in the embodiment.
  • the processor 910 executes various functional applications and data processing of the computer device by running software programs, instructions and modules stored in the memory 920, that is, implementing the semantic understanding model training method or the short video recall method provided by the embodiments of the present disclosure. .
  • the memory 920 may mainly include a stored program area and a stored data area, where the stored program area may store an operating system and an application program required for at least one function; the stored data area may store data created according to the use of the terminal, etc.
  • the memory 920 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device.
  • memory 920 may further include memory located remotely relative to processor 910, and these remote memories may be connected to the computer device through a network. Examples of the above-mentioned networks include but are not limited to the Internet, intranets, local area networks, mobile communication networks and combinations thereof.

Landscapes

  • Engineering & Computer Science (AREA)
  • Library & Information Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

La présente divulgation concerne un procédé de commande d'être humain numérique virtuel, un appareil, un dispositif et un support. Le procédé consiste à : acquérir des informations d'utilisateur, les informations d'utilisateur comprenant des informations de voix et des informations d'image; selon les informations d'utilisateur, déterminer une intention d'utilisateur et une émotion d'utilisateur; selon l'intention d'utilisateur, déterminer un texte de réponse d'un être humain numérique virtuel, et, selon l'intention d'utilisateur et l'émotion d'utilisateur, déterminer une émotion de réponse de l'être humain numérique virtuel; et, selon l'intention d'utilisateur, déterminer des actions de membre de l'être humain numérique virtuel, et, selon l'émotion de réponse, déterminer un mode d'expression d'émotion de l'être humain numérique virtuel. La présente divulgation permet d'obtenir un état d'interaction naturelle et anthropomorphique d'un être humain virtuel, ce qui permet d'améliorer l'effet de simulation et le caractère naturel d'expression d'un être humain numérique virtuel.
PCT/CN2023/079026 2022-06-22 2023-03-01 Procédé de commande d'être humain numérique virtuel, appareil, dispositif et support WO2023246163A1 (fr)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202210714001.7A CN115270922A (zh) 2022-06-22 2022-06-22 说话风格生成方法、装置、电子设备和存储介质
CN202210714001.7 2022-06-22
CN202210751784.6A CN117370605A (zh) 2022-06-28 2022-06-28 一种虚拟数字人驱动方法、装置、设备和介质
CN202210751784.6 2022-06-28

Publications (2)

Publication Number Publication Date
WO2023246163A1 true WO2023246163A1 (fr) 2023-12-28
WO2023246163A9 WO2023246163A9 (fr) 2024-02-29

Family

ID=89379111

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/079026 WO2023246163A1 (fr) 2022-06-22 2023-03-01 Procédé de commande d'être humain numérique virtuel, appareil, dispositif et support

Country Status (1)

Country Link
WO (1) WO2023246163A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117690416A (zh) * 2024-02-02 2024-03-12 江西科技学院 一种人工智能交互方法及人工智能交互系统
CN117828320A (zh) * 2024-03-05 2024-04-05 元创者(厦门)数字科技有限公司 一种虚拟数字人构建方法及其系统

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108363706A (zh) * 2017-01-25 2018-08-03 北京搜狗科技发展有限公司 人机对话交互的方法和装置、用于人机对话交互的装置
CN109271018A (zh) * 2018-08-21 2019-01-25 北京光年无限科技有限公司 基于虚拟人行为标准的交互方法及系统
US20200111482A1 (en) * 2019-09-30 2020-04-09 Lg Electronics Inc. Artificial intelligence apparatus and method for recognizing speech in consideration of utterance style
CN112396693A (zh) * 2020-11-25 2021-02-23 上海商汤智能科技有限公司 一种面部信息的处理方法、装置、电子设备及存储介质
CN114357135A (zh) * 2021-12-31 2022-04-15 科大讯飞股份有限公司 交互方法、交互装置、电子设备以及存储介质
CN115270922A (zh) * 2022-06-22 2022-11-01 海信视像科技股份有限公司 说话风格生成方法、装置、电子设备和存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108363706A (zh) * 2017-01-25 2018-08-03 北京搜狗科技发展有限公司 人机对话交互的方法和装置、用于人机对话交互的装置
CN109271018A (zh) * 2018-08-21 2019-01-25 北京光年无限科技有限公司 基于虚拟人行为标准的交互方法及系统
US20200111482A1 (en) * 2019-09-30 2020-04-09 Lg Electronics Inc. Artificial intelligence apparatus and method for recognizing speech in consideration of utterance style
CN112396693A (zh) * 2020-11-25 2021-02-23 上海商汤智能科技有限公司 一种面部信息的处理方法、装置、电子设备及存储介质
CN114357135A (zh) * 2021-12-31 2022-04-15 科大讯飞股份有限公司 交互方法、交互装置、电子设备以及存储介质
CN115270922A (zh) * 2022-06-22 2022-11-01 海信视像科技股份有限公司 说话风格生成方法、装置、电子设备和存储介质

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117690416A (zh) * 2024-02-02 2024-03-12 江西科技学院 一种人工智能交互方法及人工智能交互系统
CN117690416B (zh) * 2024-02-02 2024-04-12 江西科技学院 一种人工智能交互方法及人工智能交互系统
CN117828320A (zh) * 2024-03-05 2024-04-05 元创者(厦门)数字科技有限公司 一种虚拟数字人构建方法及其系统
CN117828320B (zh) * 2024-03-05 2024-05-07 元创者(厦门)数字科技有限公司 一种虚拟数字人构建方法及其系统

Also Published As

Publication number Publication date
WO2023246163A9 (fr) 2024-02-29

Similar Documents

Publication Publication Date Title
US20230316643A1 (en) Virtual role-based multimodal interaction method, apparatus and system, storage medium, and terminal
WO2021043053A1 (fr) Procédé de commande d'images d'animation basé sur l'intelligence artificielle et dispositif associé
CN110688911B (zh) 视频处理方法、装置、系统、终端设备及存储介质
WO2022116977A1 (fr) Procédé et appareil de commande d'action destinés à un objet cible, ainsi que dispositif, support de stockage et produit programme informatique
WO2023246163A1 (fr) Procédé de commande d'être humain numérique virtuel, appareil, dispositif et support
Wu et al. Survey on audiovisual emotion recognition: databases, features, and data fusion strategies
Hong et al. Real-time speech-driven face animation with expressions using neural networks
US8725507B2 (en) Systems and methods for synthesis of motion for animation of virtual heads/characters via voice processing in portable devices
KR101604593B1 (ko) 이용자 명령에 기초하여 리프리젠테이션을 수정하기 위한 방법
CN110286756A (zh) 视频处理方法、装置、系统、终端设备及存储介质
CN113454708A (zh) 语言学风格匹配代理
KR101306221B1 (ko) 3차원 사용자 아바타를 이용한 동영상 제작장치 및 방법
Scherer et al. A generic framework for the inference of user states in human computer interaction: How patterns of low level behavioral cues support complex user states in HCI
JP2018014094A (ja) 仮想ロボットのインタラクション方法、システム及びロボット
CN110874137B (zh) 一种交互方法以及装置
WO2023284435A1 (fr) Procédé et appareil permettant de générer une animation
CN110148406B (zh) 一种数据处理方法和装置、一种用于数据处理的装置
WO2021232876A1 (fr) Procédé et appareil d'entraînement d'humain virtuel en temps réel, dispositif électronique et support d'enregistrement
US20230082830A1 (en) Method and apparatus for driving digital human, and electronic device
JP2023552854A (ja) ヒューマンコンピュータインタラクション方法、装置、システム、電子機器、コンピュータ可読媒体及びプログラム
CN110794964A (zh) 虚拟机器人的交互方法、装置、电子设备及存储介质
CN112652041A (zh) 虚拟形象的生成方法、装置、存储介质及电子设备
CN113205569B (zh) 图像绘制方法及装置、计算机可读介质和电子设备
WO2021232877A1 (fr) Procédé et appareil de pilotage d'un être humain virtuel en temps réel, dispositif électronique, et support
CN117809679A (zh) 一种服务器、显示设备及数字人交互方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23825821

Country of ref document: EP

Kind code of ref document: A1