WO2023216765A1 - 多模态交互方法以及装置 - Google Patents

多模态交互方法以及装置 Download PDF

Info

Publication number
WO2023216765A1
WO2023216765A1 PCT/CN2023/085827 CN2023085827W WO2023216765A1 WO 2023216765 A1 WO2023216765 A1 WO 2023216765A1 CN 2023085827 W CN2023085827 W CN 2023085827W WO 2023216765 A1 WO2023216765 A1 WO 2023216765A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
interaction
user
modal
virtual character
Prior art date
Application number
PCT/CN2023/085827
Other languages
English (en)
French (fr)
Inventor
朱鹏程
马远凯
罗智凌
周伟
李禹�
钱景
Original Assignee
阿里巴巴(中国)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴(中国)有限公司 filed Critical 阿里巴巴(中国)有限公司
Publication of WO2023216765A1 publication Critical patent/WO2023216765A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/2053D [Three Dimensional] animation driven by audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/403D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/003D [Three Dimensional] image rendering
    • G06T15/005General purpose rendering architectures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T19/00Manipulating 3D models or images for computer graphics
    • G06T19/006Mixed reality
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2203/00Indexing scheme relating to G06F3/00 - G06F3/048
    • G06F2203/01Indexing scheme relating to G06F3/01
    • G06F2203/011Emotion or mood input determined on the basis of sensed human body parameters such as pulse, heart rate or beat, temperature of skin, facial expressions, iris, voice pitch, brain activity patterns

Definitions

  • the embodiments of this specification relate to the field of computer technology, and in particular to a multi-modal interaction method for virtual characters.
  • embodiments of this specification provide a multi-modal interaction method.
  • One or more embodiments of this specification simultaneously relate to a multi-modal interaction device, a computing device, a computer-readable storage medium, and a computer program to solve technical deficiencies existing in the existing technology.
  • a multi-modal interaction method is provided, which is applied to a virtual character interactive control system, including:
  • Receive multi-modal data wherein the multi-modal data includes voice data and video data;
  • the three-dimensional rendering model is used to generate an image of the virtual character including the action interaction strategy, so as to drive the virtual character to perform multi-modal interaction.
  • a multi-modal interaction device is provided, which is applied to a virtual character interactive control system, including:
  • a data receiving module configured to receive multi-modal data, wherein the multi-modal data includes voice data and video data;
  • a data recognition module configured to identify the multi-modal data and obtain user intention data and/or user gesture data, wherein the user gesture data includes user emotion data and user action data;
  • a strategy determination module configured to determine an avatar interaction strategy based on the user intention data and/or user gesture data, wherein the avatar interaction strategy includes a text interaction strategy and/or an action interaction strategy;
  • a rendering model acquisition module configured to acquire a three-dimensional rendering model of the virtual character
  • the interaction driving module is configured to use the three-dimensional rendering model to generate an image of the virtual character including the action interaction strategy based on the virtual character interaction strategy, so as to drive the virtual character to perform multi-modal interaction.
  • a computing device including:
  • the memory is used to store computer-executable instructions
  • the processor is used to execute the computer-executable instructions.
  • the steps of the multi-modal interaction method are implemented.
  • a computer-readable storage medium which stores computer-executable instructions.
  • the instructions are executed by a processor, the steps of the above-mentioned multi-modal interaction method are implemented.
  • a computer program is provided, wherein when the computer program is executed in a computer, the computer is caused to perform the steps of the above-mentioned multi-modal interaction method.
  • One embodiment of this specification provides a multi-modal interaction method, which is applied to a virtual character interactive control system by receiving multi-modal data, wherein the multi-modal data includes voice data and video data; identifying the multi-modal data Emotion data is used to obtain user intention data and/or user posture data, wherein the user posture data includes user emotion data and user action data; and a virtual character interaction strategy is determined based on the user intention data and/or user posture data, wherein,
  • the avatar interaction strategy includes a text interaction strategy and/or an action interaction strategy; a three-dimensional rendering model of the avatar is obtained; based on the avatar interaction strategy, the three-dimensional rendering model is used to generate all images including the action interaction strategy.
  • the image of the virtual character is configured to drive the virtual character to perform multi-modal interaction.
  • the user's communication intention and/or the user's corresponding gesture are determined, and then, according to the user's communication intention and/or the user's corresponding gesture Determine the specific interaction strategy between the virtual character and the user, and then drive the virtual character to complete the interaction process with the user according to the determined interaction strategy.
  • This method can not only detect and identify the user's emotions and actions, but also when deciding on the virtual character interaction strategy, It will also take into account the avatar's response to the user's emotions and actions, so that the avatar's expression of the user's emotions and/or actions will have corresponding response strategies. This will not only make the delay of the entire interaction process lower, but also make the entire interaction process slower.
  • the interaction process between users and virtual characters is smoother, giving users a better interactive experience.
  • Figure 1 is a schematic system structure diagram of a multi-modal interaction method applied to a virtual character interaction control system according to an embodiment of this specification;
  • Figure 2 is a flow chart of a multi-modal interaction method provided by an embodiment of this specification
  • Figure 3 is a system architecture diagram of a virtual character interactive control system provided by an embodiment of this specification.
  • Figure 4 is a schematic diagram of the processing process of a multi-modal interaction method provided by an embodiment of this specification
  • Figure 5 is a schematic structural diagram of a multi-modal interaction device provided by an embodiment of this specification.
  • Figure 6 is a structural block diagram of a computing device provided by an embodiment of this specification.
  • first, second, etc. may be used to describe various information in one or more embodiments of this specification, the information should not be limited to these terms. These terms are only used to distinguish information of the same type from each other.
  • the first may also be called the second, and similarly, the second may also be called the first.
  • the word "if” as used herein may be interpreted as "when” or “when” or “in response to determining.”
  • Multi-modal interaction Users can communicate with digital humans through voice, text, expressions, movements, and gestures, and digital humans can also respond to users through voice, text, expressions, movements, and gestures.
  • Duplex interaction An interactive method that enables real-time, two-way communication. Users and digital humans can interrupt or reply to each other at any time.
  • Non-exclusive dialogue The two parties in the dialogue can conduct two-way communication, and the user and the digital person can interrupt or reply to each other at any time.
  • VAD Voice Activity Detection, Voice Activity Detection: also known as voice endpoint detection and voice boundary detection.
  • TTS Text To Speech, speech synthesis technology
  • Digital human refers to a virtual character with a digital appearance that can be used to interact with real people in virtual reality applications.
  • the traditional interaction method is an exclusive question-and-answer format using voice as the carrier.
  • the current intelligent dialogue control system for virtual character interaction may only support voice duplex capabilities, lack video understanding capabilities and visual duplex state decision-making capabilities, and cannot perceive the user's expressions, actions, environment and other multi-modal modes.
  • Information even some dialogue systems only support basic question and answer capabilities, and do not have duplex capabilities (active/passive interruption, acceptance), video understanding capabilities and visual duplex state decision-making capabilities, and cannot perceive the user's expressions , action and environment and other multi-modal information.
  • a multi-modal interaction method provided by the embodiments of this specification is applied to the virtual character interaction control system.
  • a multi-modal control module By setting up a multi-modal control module, a multi-modal duplex state module and a basic dialogue module, it can realize the interaction between the virtual character and the virtual character.
  • the interaction process of real users on the basis of completing basic dialogue tasks, can also identify and process multi-modal data, so that virtual characters can actively take over and interrupt user conversations, shorten the interaction delay of the system, and at the same time perceive the user
  • Multi-modal information such as expressions, actions and gestures can be used in various application scenarios, such as identity verification, fault damage assessment, item verification and other complex application scenarios, which will have good application effects.
  • the function of the multi-modal control module is to control the input and output of the video stream and voice stream in the interactive system. At the input end, this module segments and understands the input voice stream and video stream, and controls whether the multi-modal duplex system is triggered or not, reducing the cost of system transmission while speeding up the processing efficiency of the system. At the output end, it is responsible for rendering the results of the system into a digital human video stream output.
  • the function of the multi-modal duplex status management module is to manage the status of the current conversation and determine the duplex status.
  • the current duplex status includes 1) duplex active ⁇ passive interruption, 2) duplex active acceptance, 3) calling basic dialogue system or business logic, 4) no feedback.
  • the functions of the basic dialogue module are: including basic business logic and dialogue question and answer capabilities.
  • a multi-modal interaction method is provided.
  • This specification also relates to a multi-modal interaction device, a computing device, and a computer-readable storage medium.
  • a computing device In the following embodiments Explain in detail one by one.
  • Figure 1 shows a schematic structural diagram of a system in which a multi-modal interaction method is applied to a virtual character interaction control system according to an embodiment of this specification.
  • Figure 1 shows an avatar interactive control system 100, and the avatar interactive control system 100 includes a multi-modal control module 102 and a multi-modal duplex state management module 104.
  • the multi-modal control module 102 in the virtual character interactive control system 100 can be used as the input of video streams and voice streams, and can also be used as the output of the virtual character interactive video stream; where the multi-modal input part includes the video stream. Input, voice stream input.
  • the multi-modal control module 102 performs emotion detection and gesture detection on the video stream
  • the multi-modal control module 102 performs speech detection on the voice stream
  • the interaction strategy of the virtual character is determined, wherein the interaction strategy can be mainly divided into action undertaking, copywriting + action undertaking.
  • the multi-modal duplex state management module 104 can also render the virtual character according to the determined virtual character interaction strategy, thereby outputting the rendered video stream of the virtual character through the multi-modal control module 102 .
  • the multi-modal interaction method enables the virtual character to interact with the user by visually understanding the user in the video stream, sensing the user's emotions, actions, etc., and providing a way for the virtual character to take over or interrupt.
  • the interaction process becomes a non-exclusive conversation, and the virtual character can also provide multi-modal interaction of emotion, movement and/or voice.
  • Figure 2 shows a flow chart of a multi-modal interaction method provided by an embodiment of this specification, which specifically includes the following steps.
  • the multi-modal interaction method provided by the embodiments of this specification is applied to the virtual character interaction control system.
  • the virtual character and the user can be supported to achieve small interaction delay, smooth communication, and simulation The process of human interaction.
  • Step 202 Receive multi-modal data, where the multi-modal data includes voice data and video data.
  • the virtual character interactive control system can receive multi-modal data.
  • the multi-modal data is voice data and video data corresponding to the user.
  • the voice data can be understood as the voice data of the user communicating with the virtual character.
  • the voice data expressed by the user to the avatar, "Can I inquire about placing an insurance order?"
  • the video data can be understood as the expression, movement, mouth shape, and the user's expression when the user expresses the above voice data to the avatar.
  • Environmental video data following the above example, when the user expresses the above voice data, the expression shown in the video data is confusion, the action is a hand-opening action, and the mouth shape is the mouth shape corresponding to the above voice data.
  • the virtual character in order to achieve simulated human interaction in the interaction between the virtual character and the user, the virtual character needs to respond immediately to the user's voice data and video data to reduce the delay generated during the interaction process. At the same time, it also needs to support functions such as interaction, interruption, and acceptance between both parties.
  • Step 204 Identify the multi-modal data and obtain user intention data and/or user gesture data, where the user gesture data includes user emotion data and user action data.
  • the user intention data can be understood as the intention of the voice data expressed by the user.
  • the purpose of the voice data of "Can I inquire about the insurance order?" is to ask the virtual character whether it can help inquire about the user's previous insurance order.
  • User posture data can be understood as the posture data expressed by the user in the video data, including user emotion data and user action data.
  • the user's face expresses the "doubtful” emotion
  • the user's hands display the "hand-spreading" movement.
  • the virtual character interactive control system can identify multi-modal data and separately identify voice data and video data in multi-modal data.
  • user intention data is obtained by recognizing voice data
  • user gesture data is obtained by recognizing video data
  • the user gesture data may include user emotion data and user action data. It should be noted that in different application scenarios, for the user's multi-modal data, only user intention data, user posture data, or both user intention data and user posture data may be identified. .
  • the expression "and/or" in this embodiment does not limit this in any way.
  • the virtual character interactive control system can recognize voice data and video data respectively to determine the user's intentions, emotions, actions, postures and other information.
  • identifying the multi-modal data and obtaining user intention data and/or user gesture data includes: performing text conversion on the voice data in the multi-modal data, identifying the converted text data, and obtaining user gesture data. Intention data; and/or perform emotion recognition on the video data and/or voice data in the multi-modal data to obtain user emotion data; perform gesture recognition on the video data in the multi-modal data to obtain user action data. ; Determine user gesture data based on the user emotion data and the user action data.
  • the virtual character interactive control system can first perform text conversion on the voice data in the multi-modal data, and then identify the converted text data to obtain user intention data.
  • the specific method of converting speech to text includes but is not limited to using ASR technology. This embodiment does not specifically limit the specific conversion method. It should be noted that in order to ensure that the interactive system can provide instant feedback even when the user is speaking, the system can segment the voice stream according to the VAD time of 200ms and divide it into small ones. Speech units, and then input each speech unit into the ASR module to convert it into text to facilitate subsequent identification of user intention data.
  • the virtual character interactive control system can then perform user emotion recognition and gesture recognition based on the video data.
  • user emotion recognition can be recognized not only based on video data, but also based on voice data, or based on both video data and voice data.
  • emotion recognition can be performed based on the user's facial expression changes (eyes, lip twitches) or head shaking movements in video data.
  • Another example is to perform emotion recognition based on the volume and tone of voice data.
  • the virtual character interactive control system can also recognize the actions displayed by the user based on the video data, such as recognizing the user's gestures. When the user makes a hand gesture, the user's action data can be obtained.
  • the virtual character interactive control system can comprehensively determine the user's posture data based on the user's emotional data and user action data.
  • the virtual character interactive control system can identify and perceive minor changes in the user's voice data and video data, so as to accurately capture the user's intentions and dynamics, and facilitate subsequent decisions on what strategies the virtual character should use. and ways to achieve multi-modal interaction with users.
  • the virtual character interactive control system can adopt a two-stage identification method, that is, first conduct rough detection of emotions, and then classify the emotions to obtain the target emotion.
  • the step of performing emotion recognition on the video data in the multi-modal data and obtaining user emotion data includes:
  • Emotion detection is performed on the video data in the multi-modal data.
  • the target emotion in the video data is classified to obtain user emotion data.
  • the target situation Emotions can be understood as user emotions preset by the system, such as anger, displeasure, neutral, happiness, surprise, etc.
  • the virtual character interactive control system can first perform emotion detection on the video data in the multi-modal data.
  • the target emotion can be classified and determined to obtain user emotion data.
  • the system will adopt a two-stage recognition method, which can first recognize the user's expression in the video stream, but detects that the user's expression in the video stream has
  • the target emotion is classified into emotion categories to determine the final user emotion data.
  • the virtual character interactive control system can be configured with a target emotion rough recall module and an emotion classification module.
  • the target emotion coarse recall module can conduct coarse-grained detection of the video stream, and the emotion classification module can perform emotion classification of the target emotion on the video stream to determine whether the user's emotion data is angry, unhappy, neutral, happy, or surprised.
  • the target emotion rough recall module can use the ResNet18 model, and the emotion classification module can use the sequential Transformer model, but this embodiment is not limited to the use of these two model types.
  • the virtual character interactive control system When the virtual character interactive control system does not find that the user has a specified emotion, it will not transmit the video stream backwards, which not only reduces the transmission cost of the system, but also speeds up the system's recognition efficiency.
  • the virtual character interactive control system when it performs gesture recognition on the user's video data, it can also use a two-stage recognition method. Specifically, the step of performing gesture recognition on the video data in the multi-modal data and obtaining the user action data includes:
  • Gesture detection is performed on the video data in the multi-modal data.
  • the target gesture in the video data is classified to obtain user action data.
  • the target gesture can be understood as the gesture type preset by the system, such as gestures with clear meanings (such as ok, numbers, or left and right swipes), unsafe gestures (such as middle finger and little finger, etc.), customized special gesture.
  • gestures with clear meanings such as ok, numbers, or left and right swipes
  • unsafe gestures such as middle finger and little finger, etc.
  • customized special gesture such as gestures with clear meanings (such as ok, numbers, or left and right swipes), unsafe gestures (such as middle finger and little finger, etc.), customized special gesture.
  • the virtual character interactive control system can perform gesture detection on video data in multi-modal data.
  • the target gesture can be classified to obtain user action data.
  • the gesture recognition process can also use the target gesture rough call module and the gesture classification module, that is, to achieve coarse-grained recognition of user gestures in the video stream, and then classify and recognize the target gesture to determine whether the user action data is clear Meaningful gestures (such as OK, numbers, or swiping left and right), unsafe gestures (such as middle finger and little finger, etc.), and customized special gestures.
  • the multi-modal interaction method uses a two-stage recognition method to identify user emotions and user actions, which can not only quickly complete the recognition process, but also reduce the transmission cost of the system and improve the recognition efficiency of the system.
  • the virtual character interaction control system can first call the pre-stored basic dialogue data to support the basic interaction process that can be realized. Specifically, after the step of identifying the multi-modal data and obtaining user intention data and/or user gesture data, the step further includes:
  • call pre-stored basic dialogue data wherein the basic dialogue data includes basic voice data and/or basic action data; render based on the basic dialogue data dye the output video stream of the virtual character, and drive the virtual character to display the output video stream.
  • Basic dialogue data can be understood as voice and/or action data pre-stored in the system that can drive virtual characters to achieve basic interaction.
  • the conversation data includes basic communication voice data stored in the database, including but not limited to "Hello”, “Thank you”, “Do you have any questions", etc.
  • Action data for basic communication including but not limited to "love” action, “shaking head” action, "nodding” action, etc.
  • the virtual character interactive control system can also search for basic dialogues that match the user intention data and/or user posture data from the basic dialogue data pre-stored in the system based on the user intention data and/or user posture data. data and make calls. Since the basic dialogue data includes basic voice data and/or basic action data, the virtual character interaction control system can render the output video stream corresponding to the virtual character based on the basic voice data and/or basic action data to drive the virtual character to perform processing on the output video stream. exhibit.
  • the basic conversation data may also include basic business data completed by a virtual character preset by the system, such as providing basic business services to users, etc., which is not specifically limited in this embodiment.
  • the virtual character interactive control system can realize recognition based on multi-modal data to clarify the user's intentions, expressed emotions, actions, gestures and other multi-modal data, so that the virtual character can target the user's emotional data, posture
  • the data makes interactive expressions like a human simulation.
  • the virtual character interaction control system can also provide multi-modality in the embodiments of this specification so that the virtual character can interact with the user like a human being and can realize duplex active acceptance, duplex active/passive interruption and other interactive states.
  • the duplex state decision-making module is used to determine the virtual character interaction strategy and realize the acceptance/interruption of multi-modal duplex.
  • the virtual character interactive control system can be designed with three interactive modules, as shown in Figure 3.
  • Figure 3 shows a system architecture diagram of the virtual character interactive control system provided by the embodiment of this specification.
  • Figure 3 includes three modules: multimodal control module, multimodal duplex state management module and basic dialogue module.
  • the above three modules can also be regarded as subsystems, namely multimodal control system, multimodal dual Work status management system and basic dialogue system.
  • the multi-modal control system controls the input and output of video streams and voice streams in the interactive system. On the input side, this module segments and understands the input voice stream and video stream.
  • the core includes the processing functions of voice stream, streaming video expression and streaming video action. At the output end, it is responsible for rendering the results of the system into a digital human video stream output.
  • the multimodal duplex state management system is responsible for managing the status of the current conversation and deciding the current duplex strategy.
  • the current duplex strategy includes duplex active ⁇ passive interruption, duplex active acceptance, calling the basic dialogue system or business logic and no feedback.
  • the basic dialogue system includes basic business logic and dialogue question and answer capabilities, and has basic question and answer interaction capabilities; that is, the user's question is input, and the system outputs the answer to the question.
  • NLU Natural Language Understanding
  • DM dialogue management
  • NLG Natural Language Generation
  • the specific implementation process of the multi-modal duplex state management module can be described in detail to clarify how to provide virtual characters with the ability to accept and interrupt each other in the virtual character interactive control system.
  • Step 206 Determine an avatar interaction strategy based on the user intention data and/or user gesture data, where the avatar interaction strategy includes a text interaction strategy and/or an action interaction strategy.
  • the virtual character interaction strategy can be understood as the copywriting decision-making, action decision-making or the combination of copywriting decision-making and action decision-making undertaken between the virtual character and the user, that is, text interaction strategy and/or action interaction strategy.
  • the text interaction strategy can be understood as the interactive text corresponding to the user's voice data by the virtual character, and whether the interactive text needs to be interrupted in the middle of the sentence or continued at the end of the sentence in the voice text expressed by the user.
  • the action interaction strategy can be understood as the interaction posture corresponding to the user's posture data of the virtual character, and whether the interaction posture needs to be interrupted in the middle of the sentence in the voice text expressed by the user, or continued at the end of the sentence.
  • the virtual character interaction control system can determine the textual content of the virtual character based on the user's intention data, whether it is interrupted in the middle of the user's sentence or continued at the end of the user's sentence, that is, the text interaction strategy.
  • the avatar interaction control system can also determine the avatar's gesture acceptance content based on the user's gesture data, whether it is interrupting the gesture in the middle of the user's sentence or performing gesture acceptance at the end of the user's sentence, that is, the action interaction strategy.
  • the avatar does not necessarily have a text interaction strategy and an action interaction strategy. That is, the text interaction strategy and the action interaction strategy may also have an "and/or" relationship.
  • the virtual character can not only take over or interrupt the user's interaction, but also support the function of not giving any feedback. That is, when the user's VAD time does not reach 800ms and there is no need to call the basic dialogue system or business logic to answer, the system will not respond. Any feedback.
  • the step of determining the virtual character interaction strategy based on the user intention data and/or user gesture data includes:
  • the video data in the multi-modal data is fused to determine the user's target intention text and/or target gesture action; based on the target intention text and/or the Describe the target posture and action and determine the virtual character interaction strategy.
  • the virtual character interaction control system can also perform fusion and alignment processing on the text, video stream and voice stream to comprehensively determine the user's target intention text and/or target posture. action. Furthermore, the specific virtual character interaction strategy can be determined later based on the target intention text and/or the target gesture action.
  • the emotion classification module has recognized the user's expression from the face, such as a smile, but the user may be expressing a helpless and bitter smile. Therefore, in order to solve this problem, the virtual character interactive control system can make multi-modal judgments based on the user's voice and the currently spoken text, so as to achieve better results. During specific implementation, the system can use a multi-modal classification model to conduct more refined emotional judgments. Finally, the module will output the current interaction status, which can include three status slots, namely text, user gestures, and user emotions. The multi-modal duplex status management module is used to make duplex status decisions.
  • the multi-modal interaction method provided by the embodiments of this specification accurately clarifies the user's interaction purpose by further comprehensively judging the user's intention data and/or user gesture data, and avoids subsequent problems caused by incorrect user interaction purposes.
  • the avatar displays ineffective communication, reducing the avatar's intelligence.
  • determining the virtual character interaction strategy based on the target intention text and/or the target gesture action includes:
  • the action interaction strategy of the virtual character is determined based on the target gesture action.
  • the virtual character interaction control system determines the text interaction strategy between the virtual character and the user based on the target intention text. For example, if the user's target intention text is "Check the status of insurance orders", then the avatar's text interaction strategy can start from the end of the sentence of the user's voice text, that is, the avatar can express "Wait a moment, I will check for you.” ..” If the user's target intention text is "Why are you so slow, haven't you found it yet?", then the avatar's text interaction strategy can be interrupted from the middle of the intention text, that is, after the user finishes saying "Why are you so slow?" When saying this, the avatar can immediately express "Don't worry.” In this way, real-time communication between virtual characters and users can be realized to achieve the effect of simulated human communication.
  • the avatar interaction control system can also determine the action interaction strategy between the avatar and the user based on the target posture and action. For example, if the user's target gesture is an "ok" gesture, then the avatar's action interaction strategy can also display the "ok” gesture. If the user's target gesture is a "middle finger” gesture, the avatar can also respond without any action, and can only reply with text content, such as "Are you dissatisfied with anything?" or just reply with "Shake your head and cry” action.
  • different text interaction strategies and/or action interaction strategies can be determined for different target intention texts and/or target posture actions. For example, if there is only target intention text, it can be determined that the virtual character only responds to the text interaction strategy, or only responds to the action interaction strategy, or a combination of the text interaction strategy and the action interaction strategy. If there are only target posture actions, it can be determined that the virtual character only responds to text interaction strategies, or only action interaction strategies, or a combination of text interaction strategies and action interaction strategies. If both the target intention text and the target gesture action are present, it can be determined that the virtual character only responds to the text interaction strategy, or only responds to the action interaction strategy, or a combination of the text interaction strategy and the action interaction strategy. Furthermore, the embodiments of this specification cannot provide exhaustive examples of all situations, but the avatar interaction control system in this embodiment can support determining different avatar interaction strategies according to different interaction states.
  • Step 208 Obtain the three-dimensional rendering model of the virtual character.
  • the virtual character interaction control system can obtain the three-dimensional rendering model of the virtual character, so as to subsequently generate the interactive video stream of the virtual character based on the three-dimensional rendering model to complete multi-modal interaction with the user.
  • the virtual character may be composed of a cartoon or computer graphics image, or may be composed of a simulated human image, which is not specifically limited in this embodiment.
  • Step 210 Based on the virtual character interaction strategy, use the three-dimensional rendering model to generate an image of the virtual character including the action interaction strategy, so as to drive the virtual character to perform multi-modal interaction.
  • the avatar interaction control system can use a three-dimensional rendering model to generate an image of the avatar that includes the action interaction strategy of the avatar based on the determined avatar interaction strategy. For example, the head movements, facial expressions, and gestures corresponding to the virtual character drive the rendered virtual character image to achieve multi-modal interaction with the user.
  • the avatar interaction control system can specifically determine the text acceptance position and/or the action acceptance position corresponding to the avatar according to the text interaction strategy and the action interaction strategy, so as to realize a duplex active acceptance process.
  • the step of using the three-dimensional rendering model to generate an image of the virtual character including the action interaction strategy based on the virtual character interaction strategy to drive the virtual character to perform multi-modal interaction includes:
  • the text acceptance position of the virtual character's text interaction is determined based on the text interaction strategy, where the text acceptance position is the acceptance position corresponding to the voice data; the text acceptance position of the virtual character's action interaction is determined based on the action interaction strategy.
  • Action undertaking position wherein the action undertaking position is an undertaking position corresponding to the video data; based on the text undertaking position and/or the action undertaking position, the three-dimensional rendering model is used to generate the action interaction strategy including the action undertaking position
  • the image of the virtual character is used to drive the virtual character to perform multi-modal interaction.
  • the text connection position can be understood as the connection position of the interactive text of the virtual character corresponding to the voice text expressed by the user, which can be divided into the connection in the middle of the sentence and the connection at the end of the sentence.
  • the action continuation position can be understood as the interactive action of the virtual character.
  • the continuation position corresponding to the voice text expressed by the user can be divided into action continuation in the middle of the sentence or action continuation at the end of the sentence.
  • the avatar interaction control system can use a three-dimensional rendering model based on the text acceptance position of the virtual character text interaction and the action acceptance position of the avatar action interaction based on the text acceptance position and/or the action acceptance position. Generate a virtual character image containing action interaction strategies to determine the multi-modal interaction process of the virtual character.
  • Action-only response means that the digital person does not respond verbally, but only responds to the user with actions. For example, during the conversation, the user suddenly shakes his hand to greet the digital person, and the virtual character only needs to reply with a greeting action. Need to affect other current conversation states.
  • Action + copywriting means that the digital person not only needs to respond to the user with actions, but also needs to respond verbally. This kind of undertaking will have a certain impact on the current conversation process, but it will also give people a sense of intelligence in the experience. For example, when it is detected that the user is unhappy during the conversation, the avatar needs to interrupt the current conversation state, actively ask the user "what is dissatisfied about", and provide comforting actions at the same time.
  • the virtual character interactive control system can also provide a duplex active/passive interruption process.
  • the step of using the three-dimensional rendering model to generate an image of the virtual character including the action interaction strategy based on the virtual character interaction strategy to drive the virtual character to perform multi-modal interaction includes:
  • the current multi-modal interaction of the avatar is suspended; based on the interruption intention data, the determination Place Interrupt and accept interaction data corresponding to the virtual character, and based on the interrupt and accept interaction data, use the three-dimensional rendering model to generate an image of the virtual character including the action interaction strategy to drive the virtual character to continue. Multimodal interaction.
  • Interruption intention data can be understood as data that the user has explicitly refused to communicate with the virtual character. For example, the user makes a "shut up” gesture, or explicitly states “pause communication” and other statements.
  • Interruption and acceptance interaction data can be understood as the corresponding acceptance text statement or action data when the virtual character determines that the user has the intention to interrupt.
  • the avatar interaction strategy of the avatar interaction control system if it is determined that the user has the intention to interrupt based on the user intention data and/or user posture data, the avatar's current interactive text or interactive action can be suspended. , determine the corresponding interruption acceptance interaction data based on the interruption intention.
  • the three-dimensional rendering model is then used to generate a virtual character image that contains the above action interaction strategy, so as to drive the virtual character to continue to complete multi-modal interactions based on interruptions and receive interaction data.
  • This interruption intention can be the interruption intention displayed by the user, such as the user's negative expression or negative emotion during the digital human's speech. wait. It can also be the user's implicit intention to interrupt, such as the user suddenly disappearing or not being in a communication state. Under the current strategy, the digital human will interrupt the current speaking state, wait for the user to speak, or actively ask the other party the reason for the interruption.
  • the virtual character interaction control system can also provide an output rendering function, which fuses the audio data stream and video data stream that determine the virtual character interaction, and then pushes them out.
  • the step of using the three-dimensional rendering model to generate an image of the virtual character including the action interaction strategy based on the virtual character interaction strategy to drive the virtual character to perform multi-modal interaction includes:
  • the output rendering of the virtual character interactive control system is combined into a video stream and pushed out, which contains a total of 3 parts.
  • the streaming TTS part synthesizes the text output of the system into an audio stream.
  • the driving part includes two sub-modules, the facial driving module and the action driving module.
  • the facial driving module drives the digital human to output accurate mouth shapes based on the voice stream.
  • the action driver module drives the digital human to output accurate actions based on the action tags output by the system.
  • the rendering and synthesis part is responsible for rendering and synthesizing the output of the driver part, TTS and other modules into the digital human video stream.
  • the multi-modal interaction method provided by the embodiments of this specification, by adding a video stream and a corresponding visual understanding module, can not only perceive the user's facial expressions, but also perceive the user's actions.
  • similar methods can be used to expand new visual processing modules to allow virtual characters to perceive more multi-modal information, such as environmental information.
  • the system can support real-time perception of five facial expressions of the user: angry, unhappy, neutral, happy, and annoyed. It can sense three major types of actions in real time: actions with clear meaning (such as OK, numbers, and swiping left and right, etc.), unsafe gestures (such as middle finger and little finger) and customized special actions.
  • this method changes the dialogue form from an exclusive dialogue form of one question and one answer to a non-exclusive dialogue form that can be taken over or interrupted at any time.
  • the multi-modal control module divides the conversation into smaller decision-making units, and no longer uses the complete user question as the trigger condition for user reply, so that even during the conversation, You can also take over and interrupt at any time.
  • the voice stream segmentation strategy is to segment the voice stream with a VAD time of 200ms. Generally speaking, the ventilation interval during a person's speaking process is about 200ms.
  • the video stream adopts a detection triggering strategy.
  • the multi-modal duplex status management module is the core of solving this problem, because it not only maintains the current duplex conversation status, but also can decide the current reply strategy.
  • the duplex strategy includes duplex active acceptance, duplex active ⁇
  • This solution divides the dialogue into smaller units and uses this unit as the granularity of digital human decision-making and reply, so that the dialogue is no longer an exclusive dialogue form of one question and one answer.
  • the system is already processing the user's input information and calculating the result of the reply.
  • the system no longer needs to calculate again from the beginning, and can directly play the following words. This greatly reduces the interaction delay.
  • the system's dialogue delay can be reduced from 1.5 seconds to about 800ms.
  • Figure 4 shows a schematic diagram of the processing process of a multi-modal interaction method provided by an embodiment of this description.
  • the embodiment in Figure 4 can be divided into multi-modal control system-input, multi-modal duplex state management system, basic dialogue system, and multi-modal control system-output.
  • the above systems can be understood as applications of multi-modal interaction methods.
  • Four subsystems of the virtual character interactive control system can be divided into multi-modal control system-input, multi-modal duplex state management system, basic dialogue system, and multi-modal control system-output.
  • the user's video stream and voice stream enter from the multi-modal control system-input.
  • the video stream it first passes through the target emotion detection rough call module and the target gesture detection rough call module, and then performs emotion classification and gesture classification, and inputs the final emotion recognition results and gesture recognition results into the multimodal data & alignment module.
  • the speech stream it is first segmented, then converted into text through ASR, and finally input into the multimodal data & alignment module.
  • the multimodal data & alignment module combines the speech recognition results with the emotion and gesture recognition results in the video to determine the target user's intention and target action data, and inputs them into the multimodal dual state management system. In the work status decision-making module.
  • the multi-modal duplex state decision-making system in Figure 4 can make duplex strategy decisions and determine two acceptance methods, one is action + copywriting acceptance, and the other is action-only acceptance.
  • action + copywriting you can judge whether to take over in the middle of the sentence or at the end of the sentence, and then it can be divided into two branches to realize the undertaking process.
  • the decision on the copywriting and the action are first determined based on the intent recognition.
  • the decision on the copywriting and the action can be determined.
  • the action inheritance it is enough to decide on the specific action to take over, and finally Input the virtual character's acceptance strategy into the multi-modal control system-output to determine the streaming video stream and streaming audio stream.
  • multi-modal duplex state decision-making system also includes multi-modal interruption intention judgment, which can be combined with the business to implement specific interruption functions.
  • the multi-modal control system-output can determine the face-driven data and motion-driven data according to the streaming video stream and streaming audio stream of the virtual character to complete the rendering of the virtual character + the confluence processing of the streaming media to output Digital human video streaming.
  • the basic dialogue system can also provide basic dialogue data for the interaction of the virtual character, as well as basic business logic and action matching , jointly complete the generation of digital human video streams.
  • the multi-modal interaction method has the effects of multi-modal perception, multi-modal duplexing and short interaction delay.
  • embodiments of this specification propose a system that can perceive user voice and video information. Compared with traditional dialogue systems based on voice streams, this solution can not only process the user's voice information, but also identify and detect the user's emotions and actions, greatly improving the intelligence of digital human perception.
  • embodiments of this specification propose an interactive system that can be immediately accepted and interrupted at any time. Compared with the traditional one-wheel dialogue system with one question and one answer, this system can give the user some feedback and replies in real time during the user's speaking process, such as a simple tone of voice.
  • the duplex interaction system improves the fluency of interaction, thereby giving users a better interaction experience.
  • Short interaction delay When the user has not finished expressing completely, the system is already processing the user's input information in a streaming manner and calculating the reply result. When the user finishes expressing, the system no longer needs to calculate from scratch and plays directly. Just take over the conversation, which greatly shortens the interaction delay. In terms of somatosensory, the system's dialogue delay can be reduced from 1.5 seconds to about 800ms.
  • Figure 5 shows a schematic structural diagram of a multi-modal interaction device provided by an embodiment of this specification.
  • the device used in the virtual character interactive control system includes:
  • the data receiving module 502 is configured to receive multi-modal data, wherein the multi-modal data includes voice data and video data; the data recognition module 504 is configured to identify the multi-modal data, obtain user intention data and/or Or user gesture data, wherein the user gesture data includes user emotion data and user action data; the strategy determination module 506 is configured to determine an avatar interaction strategy based on the user intention data and/or user gesture data, wherein the The virtual character interaction strategy includes text interaction strategy and/or action interaction strategy; the rendering model acquisition module 508 is configured to obtain the three-dimensional rendering model of the virtual character; the interaction driving module 510 is configured to based on the virtual character interaction strategy , using the three-dimensional rendering model to generate an image of the virtual character including the action interaction strategy, so as to drive the virtual character to perform multi-modal interaction.
  • the data identification module 504 is further configured to: perform text conversion on the voice data in the multi-modal data, identify the converted text data, and obtain user intention data; and/or perform text conversion on the multi-modal data.
  • state data Perform emotion recognition on video data and/or voice data to obtain user emotion data; perform gesture recognition on the video data in the multi-modal data to obtain user action data; determine the user based on the user emotion data and the user action data attitude data.
  • the data identification module 504 is further configured to: perform emotion detection on the video data in the multi-modal data, and if it is detected that the video data contains a target emotion, perform emotion detection on the video data. Classify the target emotions in the system to obtain user emotion data.
  • the data identification module 504 is further configured to: perform gesture detection on the video data in the multi-modal data, and if it is detected that the video data includes a target gesture, perform gesture detection on the video data. Classify the target gestures in the system to obtain user action data.
  • the policy determination module 506 is further configured to: perform fusion processing on the video data in the multi-modal data based on the user intention data and/or user gesture data, and determine the user's target intention text and/or or target gesture action; determine the virtual character interaction strategy based on the target intention text and/or the target gesture action.
  • the strategy determination module 506 is further configured to: determine a text interaction strategy of the avatar based on the target intention text; and/or determine an action interaction strategy of the avatar based on the target gesture action.
  • the interaction driving module 510 is further configured to: determine the text acceptance position of the virtual character text interaction based on the text interaction strategy, wherein the text acceptance position is the acceptance position corresponding to the voice data. ; Determine the action acceptance position of the virtual character action interaction based on the action interaction strategy, wherein the action acceptance position is the acceptance position corresponding to the video data; based on the text acceptance position and/or the action acceptance position, using the three-dimensional rendering model to generate an image of the virtual character including the action interaction strategy, so as to drive the virtual character to perform multi-modal interaction.
  • the interaction driving module 510 is further configured to: pause the avatar if it is determined that the user has interruption intention data in the user intention data and/or user gesture data of the avatar interaction policy.
  • Current multi-modal interaction determining the interruption and acceptance interaction data corresponding to the virtual character based on the interruption intention data, and based on the interruption and acceptance interaction data, using the three-dimensional rendering model to generate an interaction strategy including the action the image of the virtual character to drive the virtual character to continue multi-modal interaction.
  • the device further includes: a video stream output module configured to call pre-stored basic dialogue data based on the user intention data and/or the user gesture data, wherein the basic dialogue data includes basic dialogue data.
  • Voice data and/or basic action data rendering the output video stream of the avatar based on the basic dialogue data, and driving the avatar to display the output video stream.
  • the interaction driving module 510 is further configured to: determine the audio data stream of the virtual character's text interaction based on the text interaction strategy; determine the video data of the virtual character's action interaction based on the action interaction strategy. Stream; fuse the audio data stream and the video data stream, render the multi-modal interactive data stream of the virtual character, and based on the multi-modal interactive data stream, use the three-dimensional rendering model to generate
  • the action interaction strategy uses the image of the virtual character to drive the virtual character to perform multi-modal interaction.
  • the multi-modal interaction device receives the user's voice data and audio data, and performs intention recognition and gesture recognition to determine the user's communication intention and/or the user's corresponding gesture, and then, according to the user's
  • the communication intention and/or the user's corresponding posture determines the specific interaction strategy between the virtual character and the user, and then drives the virtual character to complete the interaction process with the user according to the determined interaction strategy.
  • This method can not only detect and identify the user's emotions and actions , when deciding on the virtual character interaction strategy, the virtual character's response to the user's emotions and actions will also be considered, so that the virtual character has corresponding response strategies for the user's expression of emotions and/or actions, which not only makes the entire interaction process Lower latency will also make the entire interaction process between users and virtual characters smoother, giving users a better interactive experience.
  • the above is a schematic solution of a multi-modal interaction device in this embodiment. It should be noted that the technical solution of the multi-modal interaction device and the technical solution of the above-mentioned multi-modal interaction method belong to the same concept. For details that are not described in detail in the technical solution of the multi-modal interaction device, please refer to the above-mentioned multi-modal interaction method. Description of technical solutions for dynamic interaction methods.
  • Figure 6 shows a structural block diagram of a computing device 600 provided according to an embodiment of this specification.
  • Components of the computing device 600 include, but are not limited to, memory 610 and processor 620 .
  • the processor 620 and the memory 610 are connected through a bus 630, and the database 650 is used to save data.
  • Computing device 600 also includes an access device 640 that enables computing device 600 to communicate via one or more networks 660 .
  • networks include the Public Switched Telephone Network (PSTN), a local area network (LAN), a wide area network (WAN), a personal area network (PAN), or a combination of communications networks such as the Internet.
  • Access device 640 may include one or more of any type of network interface (e.g., a network interface card (NIC)), wired or wireless, such as an IEEE 802.11 Wireless Local Area Network (WLAN) wireless interface, Global Microwave Interconnection Access (Wi-MAX) interface, Ethernet interface, Universal Serial Bus (USB) interface, cellular network interface, Bluetooth interface, Near Field Communication (NFC) interface, etc.
  • NIC network interface card
  • the above-mentioned components of the computing device 600 and other components not shown in FIG. 6 may also be connected to each other, such as through a bus. It should be understood that the structural block diagram of the computing device shown in FIG. 6 is for illustrative purposes only and does not limit the scope of this description. Those skilled in the art can add or replace other components as needed.
  • Computing device 600 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet computer, personal digital assistant, laptop computer, notebook computer, netbook, etc.), a mobile telephone (e.g., smartphone ), a wearable computing device (e.g., smart watch, smart glasses, etc.) or other type of mobile device, or a stationary computing device such as a desktop computer or PC.
  • a mobile computer or mobile computing device e.g., tablet computer, personal digital assistant, laptop computer, notebook computer, netbook, etc.
  • a mobile telephone e.g., smartphone
  • a wearable computing device e.g., smart watch, smart glasses, etc.
  • stationary computing device such as a desktop computer or PC.
  • Computing device 600 may also be a mobile or stationary server.
  • the processor 620 is configured to execute the following computer-executable instructions. When the computer-executable instructions are executed by the processor, the steps of the above-mentioned multi-modal interaction method are implemented.
  • the above is a schematic solution of a computing device in this embodiment. It should be noted that the technical solution of the computing device and the technical solution of the above-mentioned multi-modal interaction method belong to the same concept. Details that are not described in detail in the technical solution of the computing device can be found in the technical solution of the above-mentioned multi-modal interaction method. description of.
  • An embodiment of this specification also provides a computer-readable storage medium that stores computer-executable instructions.
  • the computer-executable instructions are executed by a processor, the steps of the above multi-modal interaction method are implemented.
  • An embodiment of the present specification also provides a computer program, wherein when the computer program is executed in a computer, the computer is caused to perform the steps of the above-mentioned multi-modal interaction method.
  • the computer instructions include computer program code, which may be in the form of source code, object code, executable file or some intermediate form.
  • the computer-readable medium may include: any entity or device capable of carrying the computer program code, recording media, U disk, mobile hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory) , Random Access Memory (RAM, Random Access Memory), electrical carrier signals, telecommunications signals, and software distribution media, etc.
  • ROM Read-Only Memory
  • RAM Random Access Memory
  • electrical carrier signals telecommunications signals
  • software distribution media etc.
  • the content contained in the computer-readable medium can be appropriately added or deleted according to the requirements of legislation and patent practice in the jurisdiction.
  • the computer-readable medium Excludes electrical carrier signals and telecommunications signals.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Graphics (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Computer Hardware Design (AREA)
  • Software Systems (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

本说明书实施例提供多模态交互方法以及装置,其中所述多模态交互方法应用于虚拟人物交互控制系统,包括:接收多模态数据,其中,所述多模态数据包括语音数据和视频数据;识别所述多模态数据,获得用户意图数据和/或用户姿态数据,其中,所述用户姿态数据包括用户情绪数据以及用户动作数据;基于所述用户意图数据和/或用户姿态数据确定虚拟人物交互策略,其中,所述虚拟人物交互策略包括文本交互策略和/或动作交互策略;获取所述虚拟人物的三维渲染模型;基于所述虚拟人物交互策略,利用所述三维渲染模型生成包含所述动作交互策略的所述虚拟人物的形象,以驱动所述虚拟人物进行多模态交互,使得整个交互过程的时延较低,给用户较好的交互体验。

Description

多模态交互方法以及装置
本申请要求于2022年05月09日提交中国专利局、申请号为202210499890.X、申请名称为“多模态交互方法以及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本说明书实施例涉及计算机技术领域,特别涉及一种虚拟人物的多模态交互方法。
背景技术
随着虚拟人物技术的发展,智能数字人产品已经越来越多地深入到人们生活的各个方面。目前虚拟人物的需求被进一步扩展,还要求其作为与用户能够进行语言、动作以及情感等多模态交互的伙伴。而目前虚拟人物交互系统较为呆板,智能性较差,只能根据系统中预先设置的指令动作以及文本内容,并通过系统中的交互组件,实现模式单一的交互过程,整个交互过程不仅会有较长的时间延迟,还较大地影响了虚拟人物与用户交互的流畅性,给用户交叉的交互体验。
发明内容
有鉴于此,本说明书实施例提供了一种多模态交互方法。本说明书一个或者多个实施例同时涉及一种多模态交互装置,一种计算设备,一种计算机可读存储介质以及一种计算机程序,以解决现有技术中存在的技术缺陷。
根据本说明书实施例的第一方面,提供了一种多模态交互方法,应用于虚拟人物交互控制系统,包括:
接收多模态数据,其中,所述多模态数据包括语音数据和视频数据;
识别所述多模态数据,获得用户意图数据和/或用户姿态数据,其中,所述用户姿态数据包括用户情绪数据以及用户动作数据;
基于所述用户意图数据和/或用户姿态数据确定虚拟人物交互策略,其中,所述虚拟人物交互策略包括文本交互策略和/或动作交互策略;
获取所述虚拟人物的三维渲染模型;
基于所述虚拟人物交互策略,利用所述三维渲染模型生成包含所述动作交互策略的所述虚拟人物的形象,以驱动所述虚拟人物进行多模态交互。
根据本说明书实施例的第二方面,提供了一种多模态交互装置,应用于虚拟人物交互控制系统,包括:
数据接收模块,被配置为接收多模态数据,其中,所述多模态数据包括语音数据和视频数据;
数据识别模块,被配置为识别所述多模态数据,获得用户意图数据和/或用户姿态数据,其中,所述用户姿态数据包括用户情绪数据以及用户动作数据;
策略确定模块,被配置为基于所述用户意图数据和/或用户姿态数据确定虚拟人物交互策略,其中,所述虚拟人物交互策略包括文本交互策略和/或动作交互策略;
渲染模型获取模块,被配置为获取所述虚拟人物的三维渲染模型;
交互驱动模块,被配置为基于所述虚拟人物交互策略,利用所述三维渲染模型生成包含所述动作交互策略的所述虚拟人物的形象,以驱动所述虚拟人物进行多模态交互。
根据本说明书实施例的第三方面,提供了一种计算设备,包括:
存储器和处理器;
所述存储器用于存储计算机可执行指令,所述处理器用于执行所述计算机可执行指令,该计算机可执行指令被处理器执行时实现上述多模态交互方法的步骤。
根据本说明书实施例的第四方面,提供了一种计算机可读存储介质,其存储有计算机可执行指令,该指令被处理器执行时实现上述多模态交互方法的步骤。
根据本说明书实施例的第五方面,提供了一种计算机程序,其中,当所述计算机程序在计算机中执行时,令计算机执行上述多模态交互方法的步骤。
本说明书一个实施例提供了一种多模态交互方法,应用于虚拟人物交互控制系统,通过接收多模态数据,其中,所述多模态数据包括语音数据和视频数据;识别所述多模态数据,获得用户意图数据和/或用户姿态数据,其中,所述用户姿态数据包括用户情绪数据以及用户动作数据;基于所述用户意图数据和/或用户姿态数据确定虚拟人物交互策略,其中,所述虚拟人物交互策略包括文本交互策略和/或动作交互策略;获取所述虚拟人物的三维渲染模型;基于所述虚拟人物交互策略,利用所述三维渲染模型生成包含所述动作交互策略的所述虚拟人物的形象,以驱动所述虚拟人物进行多模态交互。
具体的,通过接收到用户的语音数据和音频数据,并进行意图识别和姿态识别,以确定用户的交流意图和/或用户对应的姿态,进而,根据用户的交流意图和/或用户对应的姿态确定虚拟人物与用户具体的交互策略,再驱动虚拟人物根据确定的交互策略,完成与用户的交互过程,该种方式不仅能够检测和识别出用户的情绪和动作,在决策虚拟人物交互策略时,还会考虑到虚拟人物应对用户的情绪和动作,使得虚拟人物对用户的情绪和/或动作的表达,均有相应的应对策略,这样不仅使得整个交互过程的时延较低,还会使得整个用户与虚拟人物的交互过程更加流畅,给用户较好的交互体验。
附图说明
图1是本说明书一个实施例提供的一种多模态交互方法应用于虚拟人物交互控制系统的的系统结构示意图;
图2是本说明书一个实施例提供的一种多模态交互方法的流程图;
图3是本说明书一个实施例提供的虚拟人物交互控制系统的系统架构图;
图4是本说明书一个实施例提供的一种多模态交互方法的处理过程示意图;
图5是本说明书一个实施例提供的一种多模态交互装置的结构示意图;
图6是本说明书一个实施例提供的一种计算设备的结构框图。
具体实施方式
在下面的描述中阐述了很多具体细节以便于充分理解本说明书。但是本说明书能够以很多不同于在此描述的其它方式来实施,本领域技术人员可以在不违背本说明书内涵的情况下做类似推广,因此本说明书不受下面公开的具体实施的限制。
在本说明书一个或多个实施例中使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本说明书一个或多个实施例。在本说明书一个或多个实施例和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。还应当理解,本说明书一个或多个实施例中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。
应当理解,尽管在本说明书一个或多个实施例中可能采用术语第一、第二等来描述各种信息,但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如,在不脱离本说明书一个或多个实施例范围的情况下,第一也可以被称为第二,类似地,第二也可以被称为第一。取决于语境,如在此所使用的词语“如果”可以被解释成为“在……时”或“当……时”或“响应于确定”。
首先,对本说明书一个或多个实施例涉及的名词术语进行解释。
多模态交互:用户可通过语音、文字、表情、动作和手势等方式与数字人进行交流,数字人也可以通过语音、文字、表情、动作和手势等方式对用户进行回复。
双工交互:可以进行实时的、双向通信的交互方式,用户和数字人相互可以随时打断或回复对方。
非独占式对话:对话双方可进行双向通信,用户和数字人相互可以随时打断或回复对方。
VAD(Voice Activity Detection,语音活动检测):又称语音端点检测,语音边界检测。
TTS(Text To Speech,语音合成技术):将文本转化为声音。
数字人:指的是具有数字化外形的虚拟人物,可使用在虚拟现实应用中与真人交互。在与数字人的交流过程中,传统的交互方式为以语音为载体的独占式问答形式。
目前,虚拟人物与用户的交互过程中,可能会出现以下问题:
1)在沟通的流畅度上,基于独占式的交流形式,用户既不能主动打断数字人的对话,数字人也不能在与用户对话的过程中,进行即时的承接回复,导致用户与数字人的交流不智能。
2)在感知能力的多样性上,以语音为载体的沟通形式,数字人既无法感知用户的面部变化,如表情和用户的对话状态等,也无法感知用户的肢体动作,如手势和身体姿态等。这些信息的缺失,会使得用户在与数字人的沟通过程中,不能即时对用户的状态进行反馈,导致对话过程十分呆板。
3)在对话的响应时间上,由于ASR和系统的时延等因素,一般的对话时延约为1.2~1.5s,而人无感的对话时延约为600~800ms,过长的对话时延会导致对话卡顿感严重,用户体验较差。
另外,目前对于虚拟人物交互的智能对话控制系统,可能仅支持语音的双工能力,缺乏视频的理解能力和视觉的双工状态决策能力,不能感知到用户的表情、动作和环境等多模态信息;甚至某些对话系统,仅支持基本的问答能力,并没有双工能力(主动/被动打断、承接),也没有视频的理解能力和视觉的双工状态决策能力,不能感知用户的表情、动作和环境等多模态信息。
基于此,本说明书实施例提供的一种多模态交互方法,应用于虚拟人物交互控制系统,通过设置多模态控制模块、多模态双工状态模块以及基础对话模块,能够实现虚拟人物与真实用户的交互过程,在完成基本对话任务的基础上,还可对多模态数据进行识别处理,以实现虚拟人物能够主动承接、打断用户对话,缩短系统的交互时延,同时感知到用户的表情、动作和手势等多模态信息,以适用于各种不同的应用场景,比如身份核验、故障定损和物品核验等复杂的应用场景,都将具有不错的应用效果。
需要说明的是,多模态控制模块的功能为:控制着交互系统里视频流和语音流的输入输出。在输入端,该模块将输入的语音流和视频流进行切分和理解,控制着多模态双工系统的触发与否,降低系统传输的成本时同时加快系统的处理效率。在输出端,负责将系统的结果渲染成数字人的视频流输出。多模态双工状态管理模块的功能为:管理当前对话的状态,并决策双工状态。当前的双工状态包含1)双工主动\被动打断、2)双工主动承接、3)调用基础对话系统或业务逻辑4)无反馈。基础对话模块的功能为:包含基本的业务逻辑和对话问答能力。
进而,下述实施例中将会对本说明书实施例提供的一种多模态交互方法,各个模块的具体处理方式进行详细介绍。
基于此,在本说明书中,提供了一种多模态交互方法,本说明书同时涉及一种多模态交互装置,一种计算设备,以及一种计算机可读存储介质,在下面的实施例中逐一进行详细说明。
参见图1,图1示出了根据本说明书一个实施例提供的一种多模态交互方法应用于虚拟人物交互控制系统的系统结构示意图。
图1为虚拟人物交互控制系统100,且该虚拟人物交互控制系统100中包括多模态控制模块102和多模态双工状态管理模块104。
实际应用中,虚拟人物交互控制系统100中的多模态控制模块102可作为视频流、语音流的输入,也可作为虚拟人物交互视频流的输出;其中,多模态输入的部分包括视频流输入、语音流输入。同时,多模态控制模块102对视频流进行情绪检测以及手势检测,多模态控制模块102对语音流进行语音检测,并将对视频流的检测结果和/或音频流的检测结果输入至多模态双工状态管理模块104中的双工状态决策中,以确定虚拟人物的交互策略,其中,该交互策略主要可分为动作承接、文案+动作承接。进一步地,多模态双工状态管理模块104还可根据确定的虚拟人物交互策略渲染虚拟人物,进而实现将虚拟人物渲染后的视频流,通过多模态控制模块102进行输出。
本说明书实施例提供的多模态交互方法,通过对视频流中的用户进行视觉理解,感知到用户的情绪、动作等,并为虚拟人物提供的承接或打断的方式,使得虚拟人物与用户的交互过程变为非独占式对话,且虚拟人物也能够提供情绪、动作和/或语音的多模态交互。
下述结合附图2,图2示出了本说明书一个实施例提供的一种多模态交互方法的流程图,具体包括以下步骤。
需要说明的是,本说明书实施例提供的多模态交互方法应用于虚拟人物交互控制系统,通过该虚拟人物交互控制系统,可支持虚拟人物与用户实现互动延时较小,沟通流畅,且仿真人交互的过程。
步骤202:接收多模态数据,其中,所述多模态数据包括语音数据和视频数据。
实际应用中,虚拟人物交互控制系统可接收多模态数据,该多模态数据为对应于用户的语音数据和视频数据,其中,语音数据可以理解为用户与虚拟人物交流的语音数据。比如,用户向虚拟人物表达的“请问可以查询下投保订单么”这句语音数据;视频数据可以理解为用户与虚拟人物表达上述语音数据时,用户所表现的表情、动作、口型,用户所处环境的视频数据。沿用上例,用户在表达上述语音数据时,可在视频数据中展现的表情为疑惑,动作为摊手的动作,口型为表达上述语音数据对应的口型。
需要说明的是,虚拟人物与用户的交互为了实现仿真人交互,则需要虚拟人物针对用户的语音数据和视频数据做出即时的反应,减少交互过程中所产生的延时。同时,还需支持双方的交互、打断、承接等功能。
步骤204:识别所述多模态数据,获得用户意图数据和/或用户姿态数据,其中,所述用户姿态数据包括用户情绪数据以及用户动作数据。
其中,用户意图数据可以理解为用户所表达的语音数据的意图。比如上例中,“请问可以查询下投保订单么”这句语音数据的意图,为询问虚拟人物是否能够帮忙查询下该用户之前的投保订单。
用户姿态数据可以理解为用户在视频数据中所表达的姿态数据,包括用户情绪数据和用户动作数据。比如上例中,用户面部所表达的“疑惑”的情绪,用户手部所展示的“摊手”的动作。
实际应用中,虚拟人物交互控制系统可对多模态数据进行识别,分别识别多模态数据中的语音数据和视频数据。进而,通过识别语音数据以获得用户意图数据,通过识别视频数据以获得用户姿态数据,且该用户姿态数据可包括用户情绪数据以及用户动作数据。需要说明的是,在不同的应用场景下,针对用户的多模态数据,可能会仅识别出用户意图数据,或者是用户姿态数据,或者是既识别出用户意图数据,又识别出用户姿态数据。本实施例中通过“和/或”的表达方式,对此不作任何限定。
进一步地,虚拟人物交互控制系统可分别对语音数据以及视频数据进行识别,进而确定用户的意图、情绪、动作、姿态等信息。具体的,所述识别所述多模态数据,获得用户意图数据和/或用户姿态数据,包括:对所述多模态数据中的语音数据进行文本转换,识别转换后的文本数据,获得用户意图数据;和/或对所述多模态数据中的视频数据和/或语音数据进行情绪识别,获得用户情绪数据;对所述多模态数据中的视频数据进行手势识别,获得用户动作数据;基于所述用户情绪数据以及所述用户动作数据确定用户姿态数据。
实际应用中,虚拟人物交互控制系统可对多模态数据中的语音数据先进行文本转换,进而,识别转换后的文本数据,以获得用户意图数据。其中,具体语音转换文本的方式包括但不限于采用ASR技术,本实施例对具体转换方式不作具体限定。需要说明的是,为了保证该交互系统即使在用户说话的过程中,虚拟人物也能够进行即时的反馈,该系统可将语音流按照200ms的VAD时间对语音流进行切分,分成一个一个小的语音单元,再将每一个语音单元输入至ASR模块,将其转换成文本,便于后续识别出用户意图数据。
进一步地,虚拟人物交互控制系统在确定了用户意图数据之后,即可再根据视频数据进行用户情绪识别、手势识别。需要说明的是,用户的情绪识别不仅可根据视频数据识别,还可根据语音数据识别,或者是根据视频数据以及语音数据一起来进行识别。比如,根据视频数据中用户的面部表情变化(眼神、嘴唇抽动)或者摇头动作等进行情绪识别。又例如,根据语音数据的音量强弱、口气来进行情绪识别。另外,虚拟人物交互控制系统还可根据视频数据对用户所展示的动作进行识别,比如对用户的手势进行识别,用户在摆出摊手的手势时,即可获得了该用户的动作数据。最后,虚拟人物交互控制系统可根据用户情绪数据以及用户动作数据,以综合确定用户的姿态数据。
需要说明的是,虚拟人物交互控制系统对用户的语音数据以及视频数据中发生的微小变化,均可进行识别感知,以实现精准地捕获用户的意图以及动态,便于后续决策虚拟人物以何种策略和方式与用户实现多模态交互。
更进一步地,虚拟人物交互控制系统为了能够尽快获知用户情绪数据,该系统可采用两阶段的识别方式,即先进行情绪粗召检测,再对情绪进行分类,以获得目标情绪。具体的,所述对所述多模态数据中的视频数据进行情绪识别,获得用户情绪数据的步骤包括:
对所述多模态数据中的视频数据进行情绪检测,在检测到所述视频数据中包含目标情绪的情况下,对所述视频数据中的目标情绪进行分类,获得用户情绪数据。其中,目标情 绪可以理解为系统预先设置的用户情绪,比如生气、不悦、中性、开心和惊讶等。
具体实施时,虚拟人物交互控制系统可先对多模态数据中的视频数据进行情绪检测。当检测到视频流中包含了系统预先设置的目标情绪时,即可对目标情绪进行分类确定,以获得用户情绪数据。实际应用中,为了保证系统的识别速度和识别准确率都有较好的效果,系统将采用两阶段的识别方式,可对视频流中的用户的表情先进行识别,但检测到视频流中具有用户的目标情绪时,对该目标情绪进行情感类别分类,以确定最终的用户情绪数据。
需要说明的是,虚拟人物交互控制系统中可配置目标情绪粗召模块以及情绪分类模块。目标情绪粗召模块可对视频流进行粗粒度检测,情绪分类模块可对视频流进行目标情绪的情感分类,以确定用户情绪数据为生气、不悦、中性、开心或者是惊讶。目标情绪粗召模块可采用ResNet18模型,情绪分类模块可采用时序的Transformer模型,但本实施例不限定于使用这两个模型类型。
而当虚拟人物交互控制系统未发现用户有指定情绪的时候,则不向后传输视频流,在降低系统的传输成本的同时也加快了系统的识别效率。
同样地,虚拟人物交互控制系统对用户的视频数据进行手势识别时,也可采用两阶段识别方式。具体的,所述对所述多模态数据中的视频数据进行手势识别,获得用户动作数据的步骤包括:
对所述多模态数据中的视频数据进行手势检测,在检测到所述视频数据中包括目标手势的情况下,对所述视频数据中的目标手势进行分类,获得用户动作数据。
其中,目标手势可以理解为系统预先设置的手势类型,比如具有明确含义的手势(如ok、数字或者左滑右滑)、不安全的手势(如竖中指和比小拇指等)、自定义的特殊手势。
具体实施时,虚拟人物交互控制系统可对多模态数据中的视频数据进行手势检测。当检测到视频流中包含了系统预先设置的目标手势时,即可对该目标手势进行分类,以获得用户动作数据。实际应用中,手势识别的过程也可采用目标手势粗召模块以及手势分类模块,即实现对视频流中用户手势的粗粒度识别,进而再对目标手势进行分类识别,确定用户动作数据是否为明确含义的手势(如ok、数字或者左滑右滑)、不安全的手势(如竖中指和比小拇指等)、自定义的特殊手势中的某一种。
本说明书实施例提供的多模态交互方法,通过采用两阶段识别方式,对用户情绪以及用户动作进行识别,不仅能够快速地完成识别过程,还能降低系统的传输成本提高系统的识别效率。
虚拟人物交互控制系统在确定多模态数据中用户意图数据和/或用户姿态数据之后,即可先调用预先存储的基础对话数据,以支持能够实现的基础交互过程。具体的,所述识别所述多模态数据,获得用户意图数据和/或用户姿态数据的步骤之后还包括:
基于所述用户意图数据和/或所述用户姿态数据,调用预先存储的基础对话数据,其中,所述基础对话数据包括基础语音数据和/或基础动作数据;基于所述基础对话数据渲 染所述虚拟人物的输出视频流,并驱动所述虚拟人物对所述输出视频流进行展示。
基础对话数据可以理解为系统中预先存储的可驱动虚拟人物实现基础交互的语音和/或动作数据。比如,该对话数据包括存储在数据库中的基础交流语音数据,包括但不限定于“您好”、“谢谢”、“还有什么问题么”等。基础交流的动作数据,包括但不限定于“比爱心”动作、“摇头”动作、“点头”动作等。
实际应用中,虚拟人物交互控制系统还可根据用户意图数据和/或用户姿态数据,从预先存储在系统中的基础对话数据中,查找与用户意图数据和/或用户姿态数据较为匹配的基础对话数据,并进行调用。由于基础对话数据包括基础语音数据和/或基础动作数据,虚拟人物交互控制系统即可根据基础语音数据和/或基础动作数据渲染虚拟人物对应的输出视频流,以驱动虚拟人物对输出视频流进行展示。
需要说明的是,基础对话数据中还可包括系统预先设置的虚拟人物所完成的基础业务数据,比如为用户提供基础的业务服务等,本实施例中对此不作具体限定。
综上,虚拟人物交互控制系统能够实现根据多模态数据的识别,以明确用户的意图、所表达的情绪、动作、手势等多模态数据,以便于虚拟人物能够针对用户的情绪数据、姿态数据做出仿真人似的交互表达。
另外,虚拟人物交互控制系统为了虚拟人物能够与用户实现仿真人似的交互,以及能够实现双工主动承接、双工主动/被动打断等交互状态,本说明书实施例中还可提供多模态双工状态决策模块,以实现确定虚拟人物交互策略,实现多模态双工的承接/打断。
基于此,虚拟人物交互控制系统可设计三个交互模块,可参见图3,图3示出了本说明书实施例提供的虚拟人物交互控制系统的系统架构图。
图3中包括多模态控制模块、多模态双工状态管理模块以及基础对话模块这三个模块,也可将上述三个模块看作子系统,即多模态控制系统、多模态双工状态管理系统和基础对话系统。其中,多模态控制系统控制着交互系统里视频流和语音流的输入输出。在输入端,该模块将输入的语音流和视频流进行切分和理解,核心包含语音流、流式视频表情和流式视频动作的处理功能。在输出端,负责将系统的结果渲染成数字人的视频流输出。多模态双工状态管理系统负责管理当前对话的状态,并决策当前的双工策略。当前的双工策略包含双工主动\被动打断、双工主动承接、调用基础对话系统或业务逻辑和无反馈。基础对话系统包含基本的业务逻辑和对话问答能力,具备基本的问答交互能力;也即输入用户的问题,系统输出该问题的答案,一般来说包含三个子模块。1)NLU(自然语言理解)模块:对文本信息进行识别理解,转换成计算机可理解的结构化语义表示或者意图标签。2)DM(对话管理)模块:维护和更新当前的对话状态,并决策下一步系统动作。3)NLG(自然语言生成)模块:将系统输出的状态转换成可理解的自然语言文本。
下述实施例中可针对多模态双工状态管理模块的具体实现过程进行详细说明,以明确虚拟人物交互控制系统中如何为虚拟人物提供相互承接、相互打断的能力。
步骤206:基于所述用户意图数据和/或用户姿态数据确定虚拟人物交互策略,其中,所述虚拟人物交互策略包括文本交互策略和/或动作交互策略。
虚拟人物交互策略可以理解为虚拟人物与用户之间所承接的文案决策、动作决策或者是文案决策及动作决策的结合,即文本交互策略和/或动作交互策略。文本交互策略可以理解为虚拟人物针对用户的语音数据对应的交互文本,以及该交互文本需要在用户所表达的语音文本中的句中打断、还是句尾承接。动作交互策略可以理解为虚拟人物针对用户的姿态数据所对应的交互姿态,以及该交互姿态需要在用户所表达的语音文本中的句中打断、还是句尾承接。
实际应用中,虚拟人物交互控制系统可根据用户意图数据确定虚拟人物的文本承接内容,无论是在用户的句中打断还是在用户的句尾承接,即文本交互策略。虚拟人物交互控制系统还可根据用户姿态数据确定虚拟人物的姿态承接内容,无论是在用户的句中进行姿态打断还是在用户的句尾进行姿态承接,即动作交互策略。需要说明是,对于用户的某一意图数据和/或姿态数据,虚拟人物不一定均具有文本交互策略和动作交互策略,即文本交互策略与动作交互策略也可以是“和/或”的关系。
另外,虚拟人物不仅能够对用户的交互进行承接或打断,还可支持不作任何反馈的功能,即当用户VAD时间未达到800ms时,不需要调用基础对话系统或业务逻辑进行回答时,系统不作任何反馈。
具体的,所述基于所述用户意图数据和/或用户姿态数据确定虚拟人物交互策略的步骤包括:
基于所述用户意图数据和/或用户姿态数据对所述多模态数据中的视频数据进行融合处理,确定用户的目标意图文本和/或目标姿态动作;基于所述目标意图文本和/或所述目标姿态动作,确定虚拟人物交互策略。
实际应用中,虚拟人物交互控制系统在确定了用户意图数据和/或用户姿态数据之后,还可对文本、视频流和语音流进行融合对齐处理,综合判断用户的目标意图文本和/或目标姿态动作。进而,后续可根据目标意图文本和/或目标姿态动作确定具体的虚拟人物交互策略。
以情绪识别为例,情绪分类模块从面部已经识别出了用户的表情,如微笑,但是用户有可能是在表达一种无奈的苦笑。因此为了解决这种问题,虚拟人物交互控制系统可从用户的语音和当前说的文本话术进行多模态的判断,从而达到更好的效果。具体实施时,该系统可采用一个多模态的分类模型进行更加精细地情绪判断,最终该模块会输出当前的交互状态,可包含三个状态槽位,即文本、用户手势动作、用户情绪,以为多模态双工状态管理模块进行双工状态的决策。
本说明书实施例提供的多模态交互方法,通过对用户意图数据和/或用户姿态数据进一步综合判断,以精准地明确用户的交互目的,避免了由于用户交互目的错误而导致后续 虚拟人物所展示出无效的沟通,降低虚拟人物的智能度。
虚拟人物交互控制系统在精确地获知用户的目标意图文本和/或目标姿态动作之后,即可分别精准地确定虚拟人物的文本交互策略和/动作交互策略。具体的,所述基于所述目标意图文本和/或所述目标姿态动作,确定虚拟人物交互策略,包括:
基于所述目标意图文本确定虚拟人物的文本交互策略;和/或
基于所述目标姿态动作确定虚拟人物的动作交互策略。
实际应用中,虚拟人物交互控制系统根据目标意图文本确定虚拟人物与用户的文本交互策略。比如,若用户的目标意图文本为“查询投保订单状态”,那么虚拟人物的文本交互策略可从该用户语音文本的句尾开始承接,即虚拟人物可表达“您稍等,我为您查询...”。若用户的目标意图文本为“你怎么这么慢,还没有查询到么”,那么虚拟人物的文本交互策略可从该意图文本的中间可打断承接,即在用户说完“你怎么这么慢”这句话时,虚拟人物即可马上表达“不要着急哦”。这样便可实现虚拟人物与用户的即时性的交流,以达到仿真人交流的效果。
进一步地,虚拟人物交互控制系统还可根据目标姿态动作确定虚拟人物与用户的动作交互策略。比如,若用户的目标姿态动作为“ok”的手势,那么虚拟人物的动作交互策略可同样展示“ok”的手势。若用户的目标姿态动作为“比中指”的手势,那么虚拟人物也可不做任何动作回应,可以仅回复文本内容,比如“有什么不满意的么”,或者是仅回复一个“摇头哭”的动作。
需要说明的是,上述针对不同的目标意图文本和/或目标姿态动作,均可确定不同的文本交互策略和/或动作交互策略。举例说明,若仅有目标意图文本时,可确定虚拟人物仅应对文本交互策略、或者是仅应对动作交互策略、或者是文本交互策略与动作交互策略的结合。若仅有目标姿态动作时,可确定虚拟人物仅应对文本交互策略、或者是仅应对动作交互策略、或者是文本交互策略与动作交互策略的结合。若目标意图文本和目标姿态动作均具有时,可确定虚拟人物仅应对文本交互策略、或者是仅应对动作交互策略、或者是文本交互策略与动作交互策略的结合。进而,本说明书实施例中并不能将所有情况穷尽举例,但本实施例中的虚拟人物交互控制系统可支持根据不同的交互状态,确定不同的虚拟人物交互策略。
步骤208:获取所述虚拟人物的三维渲染模型。
实际应用中,虚拟人物交互控制系统可获取虚拟人物的三维渲染模型,便于后续根据该三维渲染模型生成虚拟人物的交互视频流,完成与用户的多模态交互。需要说明的是,虚拟人物可由卡通或电脑绘图形象构成,也可由仿真人形象构成,本实施例中对此不作具体限定。
步骤210:基于所述虚拟人物交互策略,利用所述三维渲染模型生成包含所述动作交互策略的所述虚拟人物的形象,以驱动所述虚拟人物进行多模态交互。
实际应用中,虚拟人物交互控制系统可根据确定的虚拟人物交互策略,并利用三维渲染模型生成包含上述虚拟人物的动作交互策略的虚拟人物的形象。比如该虚拟人物对应的头部动作、面部表情以及手势动作等,进而,驱动渲染后的虚拟人物形象与用户实现多模态交互。
进一步地,虚拟人物交互控制系统即可根据文本交互策略以及动作交互策略,具体确定虚拟人物对应的文本承接位置和/或动作承接位置,以实现双工主动承接的过程。具体的,所述基于所述虚拟人物交互策略,利用所述三维渲染模型生成包含所述动作交互策略的所述虚拟人物的形象,以驱动所述虚拟人物进行多模态交互的步骤包括:
基于所述文本交互策略确定所述虚拟人物文本交互的文本承接位置,其中,所述文本承接位置为针对所述语音数据对应的承接位置;基于所述动作交互策略确定所述虚拟人物动作交互的动作承接位置,其中,所述动作承接位置为针对所述视频数据对应的承接位置;基于所述文本承接位置和/或所述动作承接位置,利用所述三维渲染模型生成包含所述动作交互策略的所述虚拟人物的形象,以驱动所述虚拟人物进行多模态交互。
文本承接位置可以理解为虚拟人物的交互文本针对用户表达的语音文本对应的承接位置,可分为句中承接、句尾承接。动作承接位置可以理解为虚拟人物的交互动作,所针对用户表达的语音文本对应的承接位置,可分为在句中进行动作承接或者是在句尾进行动作承接。
实际应用中,虚拟人物交互控制系统可根据在确定虚拟人物文本交互的文本承接位置、以及虚拟人物动作交互的动作承接位置之后,即可根据文本承接位置和/或动作承接位置,利用三维渲染模型生成包含动作交互策略的虚拟人物形象,以确定虚拟人物的多模态交互过程。
需要说明的是,虚拟人物交互控制系统判断需要承接用户的对话或动作时,会触发当前的承接策略。承接的方式一共包含两种,一种是仅动作承接,一种是动作+文案承接。仅动作承接指的是数字人不做口头的承接回复,仅做动作响应用户,如在对话的过程中用户突然摇了摇手向数字人打招呼,虚拟人物仅需要回复一个打招呼的动作即可,不需要影响当前其它的对话状态。动作+文案承接指的是数字人不仅要做动作响应用户,还需要做口头的承接回复,这种承接会对当前的对话流程产生一定影响,但也会在体验上给人智能的感觉。如当检测到用户在对话过程中出现了不开心的情绪,虚拟人物需要打断当前的对话状态,主动询问用户“有什么事情不满意”,同时给出安慰动作。
另外,虚拟人物交互控制系统还可提供双工主动/被动打断的过程。具体的,所述基于所述虚拟人物交互策略,利用所述三维渲染模型生成包含所述动作交互策略的所述虚拟人物的形象,以驱动所述虚拟人物进行多模态交互的步骤包括:
在所述虚拟人物交互策略的用户意图数据和/或用户姿态数据中,确定用户具有打断意图数据的情况下,暂停所述虚拟人物当前的多模态交互;基于所述打断意图数据确定所 述虚拟人物对应的打断承接交互数据,并基于所述打断承接交互数据,利用所述三维渲染模型生成包含所述动作交互策略的所述虚拟人物的形象,以驱动所述虚拟人物继续进行多模态交互。
打断意图数据可以理解为用户具有明确拒绝与虚拟人物交流的数据。比如,用户做出“闭嘴”手势,或者是明确说明“暂停沟通吧”等语句。
打断承接交互数据可以理解为虚拟人物确定用户具有打断意图时,所对应的承接文本语句或者是承接的动作数据等。
实际应用中,虚拟人物交互控制系统的虚拟人物交互策略中,如果根据用户意图数据和/或用户姿态数据,确定用户具有打断意图的情况下,即可暂停虚拟人物当前的交互文本或者交互动作,根据该打断意图确定对应的打断承接交互数据。再利用三维渲染模型生成包含上述动作交互策略的虚拟人物形象,以驱动虚拟人物继续根据打断承接交互数据继续完成多模态交互。例如,当数字人发现用户有打断意图时会主动打断当前对话,这种打断意图可以是用户显示的打断意图,如在数字人讲话过程中,用户否定性的表达或者负向情绪等。也可以是用户隐式的打断意图,比如用户突然消失,或者不在一个沟通的状态。在当前策略下,数字人会打断当前的说话状态,等待用户说话,或者主动询问对方打断的原因。
最后,虚拟人物交互控制系统还可提供输出渲染功能,将确定虚拟人物交互的音频数据流以及视频数据流进行融合,再推送出去。具体的,所述基于所述虚拟人物交互策略,利用所述三维渲染模型生成包含所述动作交互策略的所述虚拟人物的形象,以驱动所述虚拟人物进行多模态交互的步骤包括:
基于所述文本交互策略确定所述虚拟人物文本交互的音频数据流;基于所述动作交互策略确定所述虚拟人物的动作交互的视频数据流;将所述音频数据流和所述视频数据流进行融合处理,渲染所述虚拟人物的多模态交互数据流,并基于所述多模态交互数据流,利用所述三维渲染模型生成包含所述动作交互策略的所述虚拟人物的形象,以驱动所述虚拟人物进行多模态交互。
实际应用中,虚拟人物交互控制系统的输出渲染合成视频流推送出去,一共包含3个部分。1)流式TTS部分,将系统的文本输出合成音频流。2)驱动部分,包含两个子模块,面部驱动模块和动作驱动模块。面部驱动模块根据语音流,驱动数字人输出准确的口型。动作驱动模块根据系统输出的动作标签,驱动数字人输出准确的动作。3)渲染合成部分,负责将驱动部分、TTS等模块的输出渲染合成数字人的视频流。
综上,本说明书实施例提供的多模态交互方法,通过加入视频流和对应视觉理解模块,不仅可以感知到用户的面部表情,而且可以感知用户的动作。此外可以通过类似的方法,扩展新的视觉处理模块,让虚拟人物感知更多的多模态信息,如环境信息等。在本说明书实施例中,本系统可以支持实时感知用户的生气、不悦、中性、开心和惊讶五种面部表情, 可以实时感知有明确意义的动作(如OK、数字和左滑右滑等)、不安全手势(如竖中指和比小拇指等)和自定义特殊动作三大类动作。
另外,本方法通过增加多模态控制模块和多模态双工状态管理模块,让对话从一问一答的独占式对话形式,变成可以随时承接或打断的非独占式对话形式。能解决这个问题主要有两个原因:1)多模态控制模块将对话切分成更小的决策单元,不再以完整的用户问题作为用户回复的触发条件,从而做到即使在对话过程中,也可以随时承接,随时打断。其中语音流的切分策略是以200ms的VAD时间将语音流进行切分,一般来说,人说话过程的换气间隔约200ms。视频流采用检测触发的策略,当检测到指定动作、表情或者目标物体时,再进行双工状态的决策。2)多模态双工状态管理模块是解决这个问题的核心,因为其不仅维护了当前的双工对话状态,而且可以决策当前的回复策略,双工策略包含双工主动承接、双工主动\被动打断、调用基础对话系统或业务逻辑和无反馈4种状态。通过在这4中状态之间决策,该系统可以实现随时承接,随时打断和基本问答的能力。3)本方案将对话切分成更小的单元,并将该单元作为数字人决策和回复的粒度,让对话不再是一问一答的独占式对话形式。因此即使用户还未完整的表达完整,该系统已经在处理用户的输入信息并计算回复的结果了,当用户表达完时,系统已经不需要从头再进行计算了,直接播放承接话术即可,从而大大缩短了交互的时延。在体感上,系统的对话时延可以由1.5秒钟降低到800ms左右。
参见图4,图4示出了本说明一个实施例提供的一种多模态交互方法的处理过程示意图。
图4实施例中可分为多模态控制系统-输入、多模态双工状态管理系统、基础对话系统、以及多模态控制系统-输出,上述系统可以理解为多模态交互方法所应用于虚拟人物交互控制系统的4个子系统。
实际应用中,用户的视频流和语音流从多模态控制系统-输入进入。对于视频流,先经过目标情绪检测粗召模块以及目标手势检测粗召模块,进而再进行情绪分类以及手势分类,并将最后的情绪识别结果以及手势识别结果输入至多模态数据&对齐模块。对于语音流,先进行切分,再通过ASR进行文本转换,最后输入至多模态数据&对齐模块。进一步地,多模态数据&对齐模块综合语音识别结果和视频中的情绪以及手势识别结果,确定出目标用户意图与目标动作数据,并输入至多模态双工状态管理系统中的多模态双工状态决策模块中。
进一步地,图4中的多模态双工状态决策系统可进行双工策略决策,确定两种承接方式,一种是动作+文案承接,另一种是仅动作承接。在动作+文案承接的过程中,可通过在句中承接还是在句尾承接的判断,进而可分为两个支路实现承接过程。具体的,在句尾承接中,先根据意图识别确定承接文案决策以及承接动作决策,在句中承接中,可确定承接文案决策以及承接动作决策。另外,在动作承接中,决策出具体的承接动作即可,最后再 将虚拟人物的承接策略输入至多模态控制系统-输出中,确定流式视频流与流式音频流。
需要说明的是,多模态双工状态决策系统中还包括多模态打断意图判断,可结合业务实现具体的承接打断功能。
更进一步地,多模态控制系统-输出可根据虚拟人物的流式视频流与流式音频流确定面部驱动数据以及动作驱动数据,以完成对虚拟人物的渲染+流媒体的合流处理,以输出数字人视频流。
另外,在多模态控制系统-输出可根据虚拟人物的流式视频流与流式音频流中,基础对话系统还可为虚拟人物的交互提供基础对话数据,以及基础的业务逻辑和动作相匹配,共同完成数字人视频流的生成。
综上,本说明书实施例提供的一种多模态交互方法,具有多模态感知、多模态双工以及交互时延短的效果。具体的,针对多模态感知,本说明书实施例提出了一种可以感知用户语音和视频信息的系统。与传统的基于语音流的对话系统相比,本方案不仅可以处理用户的语音信息,而且可以识别和检测用户的情绪和动作,大大提高了数字人感知的智能性。针对多模态双工,本说明书实施例提出了一种可以即时承接和随时打断的交互系统。与传统的一问一答的单轮式对话系统相比,该系统可以在用户说话的过程中,即时给与用户一些反馈和回复,如简单的语气承接。此外,当用户不在接听状态或者用户有明显的打断对话的意图时,可以随时打断当前的对话流程。双工交互系统提高了交互的流畅性,从而可以给用户更好的交互体验。交互时延短:当用户还未完整的表达完整,该系统已经在流式的处理用户的输入信息并计算回复的结果了,当用户表达完时,系统已经不需要从头进行计算了,直接播放承接话术即可,大大缩短了交互的时延。在体感上,系统的对话时延可以由1.5秒钟降低到800ms左右。
与上述方法实施例相对应,本说明书还提供了多模态交互装置实施例,图5示出了本说明书一个实施例提供的一种多模态交互装置的结构示意图。如图5所示,该装置应用于虚拟人物交互控制系统包括:
数据接收模块502被配置为接收多模态数据,其中,所述多模态数据包括语音数据和视频数据;数据识别模块504,被配置为识别所述多模态数据,获得用户意图数据和/或用户姿态数据,其中,所述用户姿态数据包括用户情绪数据以及用户动作数据;策略确定模块506,被配置为基于所述用户意图数据和/或用户姿态数据确定虚拟人物交互策略,其中,所述虚拟人物交互策略包括文本交互策略和/或动作交互策略;渲染模型获取模块508,被配置为获取所述虚拟人物的三维渲染模型;交互驱动模块510,被配置为基于所述虚拟人物交互策略,利用所述三维渲染模型生成包含所述动作交互策略的所述虚拟人物的形象,以驱动所述虚拟人物进行多模态交互。
可选地,所述数据识别模块504进一步被配置为:对所述多模态数据中的语音数据进行文本转换,识别转换后的文本数据,获得用户意图数据;和/或对所述多模态数据中的 视频数据和/或语音数据进行情绪识别,获得用户情绪数据;对所述多模态数据中的视频数据进行手势识别,获得用户动作数据;基于所述用户情绪数据以及所述用户动作数据确定用户姿态数据。
可选地,所述数据识别模块504进一步被配置为:对所述多模态数据中的视频数据进行情绪检测,在检测到所述视频数据中包含目标情绪的情况下,对所述视频数据中的目标情绪进行分类,获得用户情绪数据。
可选地,所述数据识别模块504进一步被配置为:对所述多模态数据中的视频数据进行手势检测,在检测到所述视频数据中包括目标手势的情况下,对所述视频数据中的目标手势进行分类,获得用户动作数据。
可选地,所述策略确定模块506进一步被配置为:基于所述用户意图数据和/或用户姿态数据对所述多模态数据中的视频数据进行融合处理,确定用户的目标意图文本和/或目标姿态动作;基于所述目标意图文本和/或所述目标姿态动作,确定虚拟人物交互策略。
可选地,所述策略确定模块506进一步被配置为:基于所述目标意图文本确定虚拟人物的文本交互策略;和/或基于所述目标姿态动作确定虚拟人物的动作交互策略。
可选地,所述交互驱动模块510进一步被配置为:基于所述文本交互策略确定所述虚拟人物文本交互的文本承接位置,其中,所述文本承接位置为针对所述语音数据对应的承接位置;基于所述动作交互策略确定所述虚拟人物动作交互的动作承接位置,其中,所述动作承接位置为针对所述视频数据对应的承接位置;基于所述文本承接位置和/或所述动作承接位置,利用所述三维渲染模型生成包含所述动作交互策略的所述虚拟人物的形象,以驱动所述虚拟人物进行多模态交互。
可选地,所述交互驱动模块510进一步被配置为:在所述虚拟人物交互策略的用户意图数据和/或用户姿态数据中,确定用户具有打断意图数据的情况下,暂停所述虚拟人物当前的多模态交互;基于所述打断意图数据确定所述虚拟人物对应的打断承接交互数据,并基于所述打断承接交互数据,利用所述三维渲染模型生成包含所述动作交互策略的所述虚拟人物的形象,以驱动所述虚拟人物继续进行多模态交互。
可选地,所述装置还包括:视频流输出模块,被配置为基于所述用户意图数据和/或所述用户姿态数据,调用预先存储的基础对话数据,其中,所述基础对话数据包括基础语音数据和/或基础动作数据;基于所述基础对话数据渲染所述虚拟人物的输出视频流,并驱动所述虚拟人物对所述输出视频流进行展示。
可选地,所述交互驱动模块510进一步被配置为:基于所述文本交互策略确定所述虚拟人物文本交互的音频数据流;基于所述动作交互策略确定所述虚拟人物的动作交互的视频数据流;将所述音频数据流和所述视频数据流进行融合处理,渲染所述虚拟人物的多模态交互数据流,并基于所述多模态交互数据流,利用所述三维渲染模型生成包含所述动作交互策略的所述虚拟人物的形象,以驱动所述虚拟人物进行多模态交互。
本说明书实施例提供的多模态交互装置,通过接收到用户的语音数据和音频数据,并进行意图识别和姿态识别,以确定用户的交流意图和/或用户对应的姿态,进而,根据用户的交流意图和/或用户对应的姿态确定虚拟人物与用户具体的交互策略,再驱动虚拟人物根据确定的交互策略,完成与用户的交互过程,该种方式不仅能够检测和识别出用户的情绪和动作,在决策虚拟人物交互策略时,还会考虑到虚拟人物应对用户的情绪和动作,使得虚拟人物对用户的情绪和/或动作的表达,均有相应的应对策略,这样不仅使得整个交互过程的时延较低,还会使得整个用户与虚拟人物的交互过程更加流畅,给用户较好的交互体验。
上述为本实施例的一种多模态交互装置的示意性方案。需要说明的是,该多模态交互装置的技术方案与上述的多模态交互方法的技术方案属于同一构思,多模态交互装置的技术方案未详细描述的细节内容,均可以参见上述多模态交互方法的技术方案的描述。
图6示出了根据本说明书一个实施例提供的一种计算设备600的结构框图。该计算设备600的部件包括但不限于存储器610和处理器620。处理器620与存储器610通过总线630相连接,数据库650用于保存数据。
计算设备600还包括接入设备640,接入设备640使得计算设备600能够经由一个或多个网络660通信。这些网络的示例包括公用交换电话网(PSTN)、局域网(LAN)、广域网(WAN)、个域网(PAN)或诸如因特网的通信网络的组合。接入设备640可以包括有线或无线的任何类型的网络接口(例如,网络接口卡(NIC))中的一个或多个,诸如I EEE802.11无线局域网(WLAN)无线接口、全球微波互联接入(Wi-MAX)接口、以太网接口、通用串行总线(USB)接口、蜂窝网络接口、蓝牙接口、近场通信(NFC)接口,等等。
在本说明书的一个实施例中,计算设备600的上述部件以及图6中未示出的其他部件也可以彼此相连接,例如通过总线。应当理解,图6所示的计算设备结构框图仅仅是出于示例的目的,而不是对本说明书范围的限制。本领域技术人员可以根据需要,增添或替换其他部件。
计算设备600可以是任何类型的静止或移动计算设备,包括移动计算机或移动计算设备(例如,平板计算机、个人数字助理、膝上型计算机、笔记本计算机、上网本等)、移动电话(例如,智能手机)、可佩戴的计算设备(例如,智能手表、智能眼镜等)或其他类型的移动设备,或者诸如台式计算机或PC的静止计算设备。计算设备600还可以是移动式或静止式的服务器。
其中,处理器620用于执行如下计算机可执行指令,该计算机可执行指令被处理器执行时实现上述多模态交互方法的步骤。
上述为本实施例的一种计算设备的示意性方案。需要说明的是,该计算设备的技术方案与上述的多模态交互方法的技术方案属于同一构思,计算设备的技术方案未详细描述的细节内容,均可以参见上述多模态交互方法的技术方案的描述。
本说明书一实施例还提供一种计算机可读存储介质,其存储有计算机可执行指令,该计算机可执行指令被处理器执行时实现上述多模态交互方法的步骤。
上述为本实施例的一种计算机可读存储介质的示意性方案。需要说明的是,该存储介质的技术方案与上述的多模态交互方法的技术方案属于同一构思,存储介质的技术方案未详细描述的细节内容,均可以参见上述多模态交互方法的技术方案的描述。
本说明书一实施例还提供一种计算机程序,其中,当所述计算机程序在计算机中执行时,令计算机执行上述多模态交互方法的步骤。
上述为本实施例的一种计算机程序的示意性方案。需要说明的是,该计算机程序的技术方案与上述的多模态交互方法的技术方案属于同一构思,计算机程序的技术方案未详细描述的细节内容,均可以参见上述多模态交互方法的技术方案的描述。
上述对本说明书特定实施例进行了描述。其它实施例在所附权利要求书的范围内。在一些情况下,在权利要求书中记载的动作或步骤可以按照不同于实施例中的顺序来执行并且仍然可以实现期望的结果。另外,在附图中描绘的过程不一定要求示出的特定顺序或者连续顺序才能实现期望的结果。在某些实施方式中,多任务处理和并行处理也是可以的或者可能是有利的。
所述计算机指令包括计算机程序代码,所述计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质可以包括:能够携带所述计算机程序代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、电载波信号、电信信号以及软件分发介质等。需要说明的是,所述计算机可读介质包含的内容可以根据司法管辖区内立法和专利实践的要求进行适当的增减,例如在某些司法管辖区,根据立法和专利实践,计算机可读介质不包括电载波信号和电信信号。
需要说明的是,对于前述的各方法实施例,为了简便描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本说明书实施例并不受所描述的动作顺序的限制,因为依据本说明书实施例,某些步骤可以采用其它顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定都是本说明书实施例所必须的。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其它实施例的相关描述。
以上公开的本说明书优选实施例只是用于帮助阐述本说明书。可选实施例并没有详尽叙述所有的细节,也不限制该发明仅为所述的具体实施方式。显然,根据本说明书实施例的内容,可作很多的修改和变化。本说明书选取并具体描述这些实施例,是为了更好地解释本说明书实施例的原理和实际应用,从而使所属技术领域技术人员能很好地理解和利用本说明书。本说明书仅受权利要求书及其全部范围和等效物的限制。

Claims (13)

  1. 一种多模态交互方法,应用于虚拟人物交互控制系统,包括;
    接收多模态数据,其中,所述多模态数据包括语音数据和视频数据;
    识别所述多模态数据,获得用户意图数据和/或用户姿态数据,其中,所述用户姿态数据包括用户情绪数据以及用户动作数据;
    基于所述用户意图数据和/或用户姿态数据确定虚拟人物交互策略,其中,所述虚拟人物交互策略包括文本交互策略和/或动作交互策略;
    获取所述虚拟人物的三维渲染模型;
    基于所述虚拟人物交互策略,利用所述三维渲染模型生成包含所述动作交互策略的所述虚拟人物的形象,以驱动所述虚拟人物进行多模态交互。
  2. 根据权利要求1所述的多模态交互方法,所述识别所述多模态数据,获得用户意图数据和/或用户姿态数据,包括:
    对所述多模态数据中的语音数据进行文本转换,识别转换后的文本数据,获得用户意图数据;和/或
    对所述多模态数据中的视频数据和/或语音数据进行情绪识别,获得用户情绪数据;
    对所述多模态数据中的视频数据进行手势识别,获得用户动作数据;
    基于所述用户情绪数据以及所述用户动作数据确定用户姿态数据。
  3. 根据权利要求2所述的多模态交互方法,所述对所述多模态数据中的视频数据进行情绪识别,获得用户情绪数据,包括:
    对所述多模态数据中的视频数据进行情绪检测,在检测到所述视频数据中包含目标情绪的情况下,对所述视频数据中的目标情绪进行分类,获得用户情绪数据。
  4. 根据权利要求2所述的多模态交互方法,所述对所述多模态数据中的视频数据进行手势识别,获得用户动作数据,包括:
    对所述多模态数据中的视频数据进行手势检测,在检测到所述视频数据中包括目标手势的情况下,对所述视频数据中的目标手势进行分类,获得用户动作数据。
  5. 根据权利要求1所述的多模态交互方法,所述基于所述用户意图数据和/或用户姿态数据确定虚拟人物交互策略,包括:
    基于所述用户意图数据和/或用户姿态数据对所述多模态数据中的视频数据进行融合处理,确定用户的目标意图文本和/或目标姿态动作;
    基于所述目标意图文本和/或所述目标姿态动作,确定虚拟人物交互策略。
  6. 根据权利要求5所述的多模态交互方法,所述基于所述目标意图文本和/或所述目标姿态动作,确定虚拟人物交互策略,包括:
    基于所述目标意图文本确定虚拟人物的文本交互策略;和/或
    基于所述目标姿态动作确定虚拟人物的动作交互策略。
  7. 根据权利要求6所述的多模态交互方法,所述基于所述虚拟人物交互策略,利用所述三维渲染模型生成包含所述动作交互策略的所述虚拟人物的形象,以驱动所述虚拟人物进行多模态交互,包括:
    基于所述文本交互策略确定所述虚拟人物文本交互的文本承接位置,其中,所述文本承接位置为针对所述语音数据对应的承接位置;
    基于所述动作交互策略确定所述虚拟人物动作交互的动作承接位置,其中,所述动作承接位置为针对所述视频数据对应的承接位置;
    基于所述文本承接位置和/或所述动作承接位置,利用所述三维渲染模型生成包含所述动作交互策略的所述虚拟人物的形象,以驱动所述虚拟人物进行多模态交互。
  8. 根据权利要求1所述的多模态交互方法,所述基于所述虚拟人物交互策略,利用所述三维渲染模型生成包含所述动作交互策略的所述虚拟人物的形象,以驱动所述虚拟人物进行多模态交互,包括:
    在所述虚拟人物交互策略的用户意图数据和/或用户姿态数据中,确定用户具有打断意图数据的情况下,暂停所述虚拟人物当前的多模态交互;
    基于所述打断意图数据确定所述虚拟人物对应的打断承接交互数据,并基于所述打断承接交互数据,利用所述三维渲染模型生成包含所述动作交互策略的所述虚拟人物的形象,以驱动所述虚拟人物继续进行多模态交互。
  9. 根据权利要求1所述的多模态交互方法,所述识别所述多模态数据,获得用户意图数据和/或用户姿态数据之后,还包括:
    基于所述用户意图数据和/或所述用户姿态数据,调用预先存储的基础对话数据,其中,所述基础对话数据包括基础语音数据和/或基础动作数据;
    基于所述基础对话数据渲染所述虚拟人物的输出视频流,并驱动所述虚拟人物对所述输出视频流进行展示。
  10. 根据权利要求1所述的多模态交互方法,所述基于所述虚拟人物交互策略,利用所述三维渲染模型生成包含所述动作交互策略的所述虚拟人物的形象,以驱动所述虚拟人物进行多模态交互,包括:
    基于所述文本交互策略确定所述虚拟人物文本交互的音频数据流;
    基于所述动作交互策略确定所述虚拟人物的动作交互的视频数据流;
    将所述音频数据流和所述视频数据流进行融合处理,渲染所述虚拟人物的多模态交互数据流,并基于所述多模态交互数据流,利用所述三维渲染模型生成包含所述动作交互策略的所述虚拟人物的形象,以驱动所述虚拟人物进行多模态交互。
  11. 一种多模态交互装置,应用于虚拟人物交互控制系统,包括:
    数据接收模块,被配置为接收多模态数据,其中,所述多模态数据包括语音数据和视频数据;
    数据识别模块,被配置为识别所述多模态数据,获得用户意图数据和/或用户姿态数据,其中,所述用户姿态数据包括用户情绪数据以及用户动作数据;
    策略确定模块,被配置为基于所述用户意图数据和/或用户姿态数据确定虚拟人物交互策略,其中,所述虚拟人物交互策略包括文本交互策略和/或动作交互策略;
    渲染模型获取模块,被配置为获取所述虚拟人物的三维渲染模型;
    交互驱动模块,被配置为基于所述虚拟人物交互策略,利用所述三维渲染模型生成包含所述动作交互策略的所述虚拟人物的形象,以驱动所述虚拟人物进行多模态交互。
  12. 一种计算设备,包括:
    存储器和处理器;
    所述存储器用于存储计算机可执行指令,所述处理器用于执行所述计算机可执行指令,该计算机可执行指令被处理器执行时实现权利要求1至10任意一项所述多模态交互方法的步骤。
  13. 一种计算机可读存储介质,其存储有计算机可执行指令,该计算机可执行指令被处理器执行时实现权利要求1至10任意一项所述多模态交互方法的步骤。
PCT/CN2023/085827 2022-05-09 2023-04-03 多模态交互方法以及装置 WO2023216765A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210499890.X 2022-05-09
CN202210499890.XA CN114995636A (zh) 2022-05-09 2022-05-09 多模态交互方法以及装置

Publications (1)

Publication Number Publication Date
WO2023216765A1 true WO2023216765A1 (zh) 2023-11-16

Family

ID=83024526

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/085827 WO2023216765A1 (zh) 2022-05-09 2023-04-03 多模态交互方法以及装置

Country Status (2)

Country Link
CN (1) CN114995636A (zh)
WO (1) WO2023216765A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114995636A (zh) * 2022-05-09 2022-09-02 阿里巴巴(中国)有限公司 多模态交互方法以及装置
CN115914366B (zh) * 2023-01-10 2023-06-30 北京红棉小冰科技有限公司 虚拟人物物语推送方法、系统和电子设备
CN116798427A (zh) * 2023-06-21 2023-09-22 支付宝(杭州)信息技术有限公司 基于多模态的人机交互方法及数字人系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108416420A (zh) * 2018-02-11 2018-08-17 北京光年无限科技有限公司 基于虚拟人的肢体交互方法及系统
CN109032328A (zh) * 2018-05-28 2018-12-18 北京光年无限科技有限公司 一种基于虚拟人的交互方法及系统
CN109271018A (zh) * 2018-08-21 2019-01-25 北京光年无限科技有限公司 基于虚拟人行为标准的交互方法及系统
CN109324688A (zh) * 2018-08-21 2019-02-12 北京光年无限科技有限公司 基于虚拟人行为标准的交互方法及系统
CN114995636A (zh) * 2022-05-09 2022-09-02 阿里巴巴(中国)有限公司 多模态交互方法以及装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108416420A (zh) * 2018-02-11 2018-08-17 北京光年无限科技有限公司 基于虚拟人的肢体交互方法及系统
CN109032328A (zh) * 2018-05-28 2018-12-18 北京光年无限科技有限公司 一种基于虚拟人的交互方法及系统
CN109271018A (zh) * 2018-08-21 2019-01-25 北京光年无限科技有限公司 基于虚拟人行为标准的交互方法及系统
CN109324688A (zh) * 2018-08-21 2019-02-12 北京光年无限科技有限公司 基于虚拟人行为标准的交互方法及系统
CN114995636A (zh) * 2022-05-09 2022-09-02 阿里巴巴(中国)有限公司 多模态交互方法以及装置

Also Published As

Publication number Publication date
CN114995636A (zh) 2022-09-02

Similar Documents

Publication Publication Date Title
US20230316643A1 (en) Virtual role-based multimodal interaction method, apparatus and system, storage medium, and terminal
WO2023216765A1 (zh) 多模态交互方法以及装置
US20200279553A1 (en) Linguistic style matching agent
CN106653052B (zh) 虚拟人脸动画的生成方法及装置
US11430438B2 (en) Electronic device providing response corresponding to user conversation style and emotion and method of operating same
Rossi et al. An extensible architecture for robust multimodal human-robot communication
CN110400251A (zh) 视频处理方法、装置、终端设备及存储介质
US20140129207A1 (en) Augmented Reality Language Translation
WO2017200074A1 (ja) 対話方法、対話システム、対話装置、及びプログラム
CN109086860B (zh) 一种基于虚拟人的交互方法及系统
US20220335079A1 (en) Method for generating virtual image, device and storage medium
WO2023226914A1 (zh) 基于多模态数据的虚拟人物驱动方法、系统及设备
KR102174922B1 (ko) 사용자의 감정 또는 의도를 반영한 대화형 수어-음성 번역 장치 및 음성-수어 번역 장치
US20230046658A1 (en) Synthesized speech audio data generated on behalf of human participant in conversation
CN110737335B (zh) 机器人的交互方法、装置、电子设备及存储介质
CN113793398A (zh) 基于语音交互的绘画方法与装置、存储介质和电子设备
KR20200059112A (ko) 로봇 상호작용 시스템 및 그를 위한 프로그램
CN113689879A (zh) 实时驱动虚拟人的方法、装置、电子设备及介质
CN116009692A (zh) 虚拟人物交互策略确定方法以及装置
US20230343324A1 (en) Dynamically adapting given assistant output based on a given persona assigned to an automated assistant
JP2023099309A (ja) アバターを通じて映像の音声を手話に通訳する方法、コンピュータ装置、およびコンピュータプログラム
WO2017200077A1 (ja) 対話方法、対話システム、対話装置、及びプログラム
JP7423490B2 (ja) ユーザの感情に応じたキャラクタの傾聴感を表現する対話プログラム、装置及び方法
JP2023120130A (ja) 抽出質問応答を利用する会話型aiプラットフォーム
Babu et al. Marve: a prototype virtual human interface framework for studying human-virtual human interaction

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23802541

Country of ref document: EP

Kind code of ref document: A1