CN114995636A

CN114995636A - Multi-modal interaction method and device

Info

Publication number: CN114995636A
Application number: CN202210499890.XA
Authority: CN
Inventors: 朱鹏程; 马远凯; 罗智凌; 周伟; 李禹�; 钱景
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2022-05-09
Filing date: 2022-05-09
Publication date: 2022-09-02
Also published as: WO2023216765A1

Abstract

The embodiment of the specification provides a multi-modal interaction method and a device, wherein the multi-modal interaction method is applied to a virtual character interaction control system and comprises the following steps: receiving multimodal data, wherein the multimodal data comprises voice data and video data; identifying the multi-modal data, and obtaining user intention data and/or user posture data, wherein the user posture data comprises user emotion data and user action data; determining a virtual character interaction policy based on the user intent data and/or user pose data, wherein the virtual character interaction policy comprises a text interaction policy and/or an action interaction policy; acquiring a three-dimensional rendering model of the virtual character; based on the virtual character interaction strategy, the three-dimensional rendering model is utilized to generate the image of the virtual character containing the action interaction strategy so as to drive the virtual character to carry out multi-mode interaction, so that the time delay of the whole interaction process is low, and better interaction experience is provided for users.

Description

Multi-modal interaction method and device

Technical Field

The embodiment of the specification relates to the technical field of computers, in particular to a multi-modal interaction method for virtual characters.

Background

With the development of virtual character technology, intelligent digital human products have increasingly penetrated into various aspects of people's lives. Currently, the demand for virtual characters is further expanded, and they are also required as partners for enabling multimodal interactions with users, such as language, motion, and emotion. The existing virtual character interaction system is rigid and has poor intelligence, only instruction action and text content preset in the system can be used, and a single interaction process can be realized through an interaction assembly in the system.

Disclosure of Invention

In view of this, the embodiments of the present specification provide a multi-modal interaction method. One or more embodiments of the present specification also relate to a multimodal interaction apparatus, a computing device, a computer-readable storage medium, and a computer program, so as to solve the technical deficiencies of the prior art.

According to a first aspect of embodiments of the present specification, there is provided a multi-modal interaction method applied to a virtual character interaction control system, including:

receiving multimodal data, wherein the multimodal data comprises voice data and video data;

identifying the multi-modal data, and obtaining user intention data and/or user posture data, wherein the user posture data comprises user emotion data and user action data;

determining a virtual character interaction policy based on the user intent data and/or user pose data, wherein the virtual character interaction policy comprises a text interaction policy and/or an action interaction policy;

acquiring a three-dimensional rendering model of the virtual character;

and generating an image of the virtual character containing the action interaction strategy by utilizing the three-dimensional rendering model based on the virtual character interaction strategy so as to drive the virtual character to carry out multi-modal interaction.

According to a second aspect of embodiments of the present specification, there is provided a multimodal interaction apparatus applied to a virtual character interaction control system, including:

a data receiving module configured to receive multimodal data, wherein the multimodal data comprises voice data and video data;

a data recognition module configured to recognize the multimodal data, obtain user intention data and/or user posture data, wherein the user posture data comprises user emotion data and user action data;

a policy determination module configured to determine a virtual character interaction policy based on the user intent data and/or user pose data, wherein the virtual character interaction policy comprises a text interaction policy and/or an action interaction policy;

a rendering model obtaining module configured to obtain a three-dimensional rendering model of the virtual character;

and the interaction driving module is configured to generate an avatar of the virtual character containing the action interaction strategy by using the three-dimensional rendering model based on the virtual character interaction strategy so as to drive the virtual character to perform multi-modal interaction.

According to a third aspect of embodiments herein, there is provided a computing device comprising:

a memory and a processor;

the memory is configured to store computer-executable instructions, and the processor is configured to execute the computer-executable instructions, which when executed by the processor, implement the steps of the multi-modal interaction method described above.

According to a fourth aspect of embodiments herein, there is provided a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the steps of the above-described multimodal interaction method.

According to a fifth aspect of embodiments herein, there is provided a computer program, wherein the computer program, when executed in a computer, causes the computer to perform the steps of the above-mentioned multi-modal interaction method.

One embodiment of the present specification provides a multi-modal interaction method, which is applied to a virtual character interaction control system, and receives multi-modal data, wherein the multi-modal data comprises voice data and video data; identifying the multi-modal data, and obtaining user intention data and/or user posture data, wherein the user posture data comprises user emotion data and user action data; determining a virtual character interaction policy based on the user intent data and/or user pose data, wherein the virtual character interaction policy comprises a text interaction policy and/or an action interaction policy; acquiring a three-dimensional rendering model of the virtual character; and based on the virtual character interaction strategy, generating an image of the virtual character containing the action interaction strategy by using the three-dimensional rendering model so as to drive the virtual character to carry out multi-mode interaction.

Specifically, the voice data and the audio data of the user are received, intention recognition and gesture recognition are carried out to determine the communication intention of the user and/or the gesture corresponding to the user, further, the specific interaction strategy of the virtual character and the user is determined according to the communication intention of the user and/or the gesture corresponding to the user, then the virtual character is driven to complete the interaction process with the user according to the determined interaction strategy, the mode not only can detect and recognize the emotion and the action of the user, but also can consider the emotion and the action of the virtual character corresponding to the user when the interaction strategy of the virtual character is decided, so that the expression of the emotion and/or the action of the virtual character to the user is correspondingly responded, the time delay of the whole interaction process is lower, and the interaction process of the whole user and the virtual character is smoother, and better interactive experience is provided for users.

Drawings

FIG. 1 is a schematic diagram of a system structure of a multi-modal interaction method applied to a virtual character interaction control system according to an embodiment of the present disclosure;

FIG. 2 is a flow diagram of a method for multimodal interaction provided by one embodiment of the present specification;

FIG. 3 is a system architecture diagram of a virtual character interaction control system provided by one embodiment of the present description;

FIG. 4 is a process diagram of a multi-modal interaction method according to an embodiment of the present disclosure;

FIG. 5 is a schematic structural diagram of a multimodal interaction apparatus provided in an embodiment of the present specification;

fig. 6 is a block diagram of a computing device according to an embodiment of the present disclosure.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present specification. This description may be implemented in many ways other than those specifically set forth herein, and those skilled in the art will appreciate that the present description is susceptible to similar generalizations without departing from the scope of the description, and thus is not limited to the specific implementations disclosed below.

The terminology used in the description of the one or more embodiments is for the purpose of describing the particular embodiments only and is not intended to be limiting of the description of the one or more embodiments. As used in one or more embodiments of the present specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present specification refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It will be understood that, although the terms first, second, etc. may be used herein in one or more embodiments to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first can also be referred to as a second and, similarly, a second can also be referred to as a first without departing from the scope of one or more embodiments of the present description. The word "if," as used herein, may be interpreted as "at … …" or "when … …" or "in response to a determination," depending on the context.

First, the noun terms to which one or more embodiments of the present specification relate are explained.

Multi-modal interaction: the user can communicate with the digital person through modes of voice, characters, expressions, actions, gestures and the like, and the digital person can reply to the user through modes of voice, characters, expressions, actions, gestures and the like.

Duplex interaction: the interactive mode of real-time and two-way communication can be carried out, and the user and the digital person can mutually interrupt or reply each other at any time.

Non-exclusive dialog: the two parties of the conversation can carry out two-way communication, and the user and the digital person can mutually interrupt or reply each other at any time.

VAD (Voice Activity Detection): also known as voice endpoint detection, voice boundary detection.

TTS (Text To Speech, Speech synthesis technology): the text is converted to sound.

A digital person: refers to a virtual character having a digitized appearance that can be used in a virtual reality application to interact with a real person. In the process of communication with digital people, the traditional interactive mode is an exclusive question-and-answer mode taking voice as a carrier.

At present, the following problems may occur in the interaction process of the virtual character and the user:

1) on the communication fluency, based on an exclusive communication form, the user can not actively interrupt the conversation of the digital person, and the digital person can not carry out instant receiving reply in the conversation process with the user, so that the communication between the user and the digital person is not intelligent.

2) In the diversity of perception abilities, in a communication mode taking voice as a carrier, a digital person cannot perceive facial changes of a user, such as expressions and a conversation state of the user, and cannot perceive body movements of the user, such as gestures and body postures. The absence of such information may make the user unable to immediately feedback the state of the user during the communication process with the digital person, resulting in a rather rigid conversation process.

3) In the response time of the conversation, due to the time delay of the ASR, the system and other factors, the general conversation time delay is about 1.2-1.5 s, the non-sensible conversation time delay is about 600-800 ms, the overlong conversation time delay can cause the pause of the conversation card to be serious, and the user experience is poor.

In addition, the current intelligent dialogue control system for virtual character interaction only supports the duplex capability of voice, lacks the comprehension capability of video and the duplex state decision capability of vision, and cannot sense multi-modal information such as expression, action, environment and the like of a user; even some dialog systems only support basic question-answering capability, do not have duplex capability (active/passive interruption, taking-over), do not have video comprehension capability and visual duplex state decision-making capability, and cannot sense multi-modal information such as expression, action and environment of a user.

Based on this, the multi-modal interaction method provided in the embodiment of the present specification is applied to a virtual character interaction control system, and by setting a multi-modal control module, a multi-modal duplex state module, and a basic dialog module, an interaction process between a virtual character and a real user can be implemented, and on the basis of completing a basic dialog task, multi-modal data can be identified and processed, so that the virtual character can actively take over and interrupt a user dialog, and interaction time delay of the system is shortened, and meanwhile, multi-modal information such as expressions, actions, and gestures of the user is sensed, so that the method is suitable for various application scenarios, such as complex application scenarios such as identity verification, failure assessment, and article verification, and has good application effects.

It should be noted that the multi-modal control module functions as follows: controls the input and output of video streams and voice streams in the interactive system. At the input end, the module divides and understands the input voice stream and the input video stream, controls whether the multi-mode duplex system is triggered or not, and accelerates the processing efficiency of the system when the transmission cost of the system is reduced. And at the output end, rendering the result of the system into a video stream of the digital person for output. The multi-mode duplex state management module has the functions of: the state of the current dialog is managed and the duplex state is decided. The current duplex status includes 1) duplex active \ passive break, 2) duplex active take over, 3) invoking basic dialog system or business logic 4) no feedback. The basic dialogue module functions as follows: including basic business logic and dialogue question-answer capability.

Furthermore, the following embodiments will describe in detail a specific processing manner of each module of the multimodal interaction method provided in the embodiments of the present specification.

In view of the above, in the present specification, a multimodal interaction method is provided, and the present specification relates to a multimodal interaction apparatus, a computing device, and a computer-readable storage medium, which are described in detail in the following embodiments one by one.

Referring to fig. 1, fig. 1 is a schematic diagram illustrating a system structure of a multi-modal interaction method applied to a virtual character interaction control system according to an embodiment of the present specification.

Fig. 1 shows a virtual character interaction control system 100, and the virtual character interaction control system 100 includes a multi-modal control module 102 and a multi-modal duplex status management module 104.

In practical applications, the multi-modal control module 102 in the avatar interaction control system 100 may be used as an input of a video stream and a voice stream, and may also be used as an output of an avatar interaction video stream; wherein, the multi-modal input part comprises video stream input and voice stream input. Meanwhile, the multi-modal control module 102 performs emotion detection and gesture detection on the video stream, and the multi-modal control module 102 performs voice detection on the voice stream and inputs a detection result of the video stream and/or a detection result of the audio stream into a duplex state decision in the multi-modal duplex state management module 104 to determine an interaction policy of the virtual character, wherein the interaction policy mainly includes action reception, a case and action reception. Further, the multi-modal duplex state management module 104 may render the virtual character according to the determined virtual character interaction policy, so as to implement a video stream rendered by the virtual character, and output the video stream through the multi-modal control module 102.

According to the multi-modal interaction method provided by the embodiment of the specification, the user in the video stream is visually understood, the emotion, the action and the like of the user are sensed, and a receiving or interrupting mode is provided for the virtual character, so that the interaction process of the virtual character and the user is changed into a non-exclusive conversation, and the virtual character can also provide multi-modal interaction of the emotion, the action and/or voice.

Fig. 2, which is described below with reference to fig. 2, illustrates a flowchart of a multimodal interaction method provided in an embodiment of the present specification, which specifically includes the following steps.

It should be noted that the multimodal interaction method provided in the embodiments of the present specification is applied to a virtual character interaction control system, and by using the virtual character interaction control system, it can be supported that the interaction delay between a virtual character and a user is short, the communication is smooth, and the process of human interaction is simulated.

Step 202: multimodal data is received, wherein the multimodal data includes voice data and video data.

In practice, the avatar interaction control system may receive multimodal data, which is speech data and video data corresponding to the user, wherein the speech data may be understood as speech data of the user communicating with the avatar. For example, the voice data of "how to place a guarantee order" can be inquired by asking the user to express the voice data of "asking the user to ask the virtual character; the video data can be understood as the expression, action and mouth shape of the user and the video data of the environment where the user is located when the user and the virtual character express the voice data. Along with the above example, when the user expresses the voice data, the expression that can be shown in the video data is doubt, the action is the action of the tedder, and the mouth shape is the mouth shape corresponding to the voice data.

It should be noted that, in order to realize human simulation interaction, the virtual character needs to make an immediate response to the voice data and the video data of the user, so as to reduce the delay generated in the interaction process. Meanwhile, functions of interaction, interruption, carrying and the like of the two parties are required to be supported.

Step 204: and recognizing the multi-modal data, and obtaining user intention data and/or user posture data, wherein the user posture data comprises user emotion data and user action data.

Wherein the user intention data may be understood as the intention of the speech data expressed by the user. For example, in the above example, "ask can inquire how to place a guarantee order" the intention of the voice data is to ask whether the virtual character can help to inquire about the previous guarantee order of the user.

User gesture data may be understood as gesture data expressed by a user in video data, including user mood data and user motion data. For example, in the above example, the "puzzled" emotion expressed by the face of the user, and the "tedded hand" action exhibited by the hands of the user.

In practical application, the virtual character interaction control system can identify the multi-modal data and respectively identify voice data and video data in the multi-modal data. Further, user intent data is obtained by recognizing voice data, user gesture data is obtained by recognizing video data, and the user gesture data may include user emotion data as well as user motion data. It should be noted that, in different application scenarios, for multi-modal data of a user, only user intention data or user posture data may be recognized, or both user intention data and user posture data may be recognized. In the present embodiment, the expression "and/or" is used, and this is not limited at all.

Further, the virtual character interaction control system can respectively identify the voice data and the video data so as to determine the intention, emotion, action, posture and other information of the user. Specifically, the recognizing the multi-modal data and obtaining the user intention data and/or the user posture data includes: performing text conversion on voice data in the multi-modal data, and identifying the converted text data to obtain user intention data; and/or performing emotion recognition on video data and/or voice data in the multi-modal data to obtain user emotion data; performing gesture recognition on video data in the multi-modal data to obtain user action data; determining user gesture data based on the user mood data and the user action data.

In practical application, the virtual character interaction control system can firstly perform text conversion on voice data in the multi-modal data, and then recognize the converted text data to obtain user intention data. The specific manner of converting the text by the speech includes, but is not limited to, using ASR technology, and the specific manner of converting is not specifically limited in this embodiment. It should be noted that, in order to ensure that the interactive system can perform instant feedback even during the process of the user speaking, the system can segment the voice stream according to the VAD time of 200ms, divide the voice stream into a small voice unit, and input each voice unit into the ASR module to convert it into text, so as to facilitate the subsequent recognition of the user intention data.

Further, after the virtual character interaction control system determines the user intention data, the user emotion recognition and gesture recognition can be performed according to the video data. It should be noted that the emotion recognition of the user can be performed not only based on the video data but also based on the voice data, or based on both the video data and the voice data. For example, emotion recognition is performed based on changes in facial expression (eye movement, lip movement) or head shaking motion of the user in the video data. For another example, emotion recognition is performed based on the volume level and breath of voice data. In addition, the virtual character interaction control system can also recognize the motion displayed by the user according to the video data, for example, recognize the gesture of the user, and the user can obtain the motion data of the user when the user puts out the gesture of the spreader. Finally, the virtual character interaction control system can comprehensively determine the posture data of the user according to the emotion data and the action data of the user.

It should be noted that, the virtual character interaction control system can recognize and perceive the small changes in the voice data and the video data of the user, so as to accurately capture the intention and dynamics of the user, and facilitate the subsequent decision of which strategy and mode the virtual character interacts with the user in a multi-mode manner.

Furthermore, in order to obtain the emotion data of the user as soon as possible, the virtual character interaction control system can adopt a two-stage recognition mode, namely, emotion rough call detection is firstly carried out, and then, emotion is classified to obtain target emotion. Specifically, the step of performing emotion recognition on the video data in the multimodal data to obtain user emotion data includes:

and carrying out emotion detection on the video data in the multi-modal data, and classifying target emotions in the video data to obtain user emotion data under the condition that the video data contains the target emotions. The target emotion can be understood as user emotion preset by the system, such as anger, displeasure, neutrality, happiness, surprise and the like.

In specific implementation, the virtual character interaction control system can perform emotion detection on video data in the multi-modal data. When the video stream is detected to contain the target emotion preset by the system, the target emotion can be classified and determined to obtain the emotion data of the user. In practical application, in order to ensure that the recognition speed and the recognition accuracy of the system have good effects, the system adopts a two-stage recognition mode, the expression of a user in a video stream can be recognized firstly, and when the target emotion of the user is detected in the video stream, the target emotion is classified according to emotion types so as to determine the final emotion data of the user.

It should be noted that, the virtual character interaction control system may be configured with a target emotion rough calling module and an emotion classification module. The target emotion rough calling module can perform coarse-grained detection on the video stream, and the emotion classification module can perform emotion classification on the target emotion on the video stream so as to determine that the emotion data of the user is angry, unpleasant, neutral, happy or surprised. The target emotion rough calling module may adopt a ResNet18 model, and the emotion classification module may adopt a time-series Transformer model, but the embodiment is not limited to the use of these two model types.

When the virtual character interaction control system does not find that the user has the appointed emotion, the video stream is not transmitted backwards, the transmission cost of the system is reduced, and meanwhile the identification efficiency of the system is accelerated.

Similarly, when the virtual character interaction control system performs gesture recognition on the video data of the user, a two-stage recognition mode can also be adopted. Specifically, the step of performing gesture recognition on the video data in the multimodal data to obtain the user action data includes:

and carrying out gesture detection on the video data in the multi-modal data, and classifying the target gesture in the video data to obtain user action data under the condition that the video data is detected to comprise the target gesture.

The target gesture can be understood as a gesture type preset by the system, such as a gesture with a definite meaning (such as ok, numbers or right and left sliding), an unsafe gesture (such as middle and small vertical fingers and the like), and a customized special gesture.

In specific implementation, the virtual character interaction control system can perform gesture detection on video data in the multi-modal data. When the video stream is detected to contain a target gesture preset by the system, the target gesture can be classified to obtain user action data. In practical application, the gesture recognition process can also adopt a target gesture rough calling module and a gesture classification module, namely, rough-grained recognition of user gestures in a video stream is realized, then classification recognition is carried out on the target gestures, and whether user action data are gestures with clear meanings (such as ok, numbers or left-right sliding), unsafe gestures (such as vertical middle fingers and little fingers) or customized special gestures is determined.

The multi-modal interaction method provided by the embodiment of the specification identifies the emotion and the action of the user by adopting a two-stage identification mode, can quickly complete the identification process, and can reduce the transmission cost of the system and improve the identification efficiency of the system.

After determining the user intention data and/or the user posture data in the multimodal data, the virtual character interaction control system can call the pre-stored basic dialogue data to support the realized basic interaction process. Specifically, the step of identifying the multimodal data and obtaining the user intention data and/or the user posture data further comprises:

based on the user intention data and/or the user posture data, invoking pre-stored basic dialogue data, wherein the basic dialogue data comprises basic voice data and/or basic action data; rendering an output video stream of the virtual character based on the basic dialogue data, and driving the virtual character to display the output video stream.

The basic dialogue data can be understood as pre-stored voice and/or action data in the system, which can drive the virtual character to realize basic interaction. For example, the dialogue data includes basic communication voice data stored in a database, including but not limited to "hello", "thank you", "what questions there are", and the like. The basic communication motion data includes, but is not limited to, a "love heart" motion, a "head shaking" motion, a "head nodding" motion, and the like.

In practical application, the virtual character interaction control system can also search basic dialogue data which is matched with the user intention data and/or the user posture data from basic dialogue data which are stored in the system in advance according to the user intention data and/or the user posture data, and call the basic dialogue data. Because the basic dialogue data comprises basic voice data and/or basic action data, the virtual character interaction control system can render the output video stream corresponding to the virtual character according to the basic voice data and/or the basic action data so as to drive the virtual character to display the output video stream.

It should be noted that the basic dialogue data may further include basic business data completed by a virtual character preset by the system, for example, basic business services provided for the user, and the like, which is not specifically limited in this embodiment.

In conclusion, the virtual character interaction control system can realize recognition according to the multi-modal data to clarify the multi-modal data such as the intention, the expressed emotion, the action, the gesture and the like of the user, so that the virtual character can make a simulated human-like interaction expression according to the emotion data and the gesture data of the user.

In addition, in order to enable the virtual character to achieve human-like interaction with the user and achieve interaction states such as duplex active accepting, duplex active/passive interruption and the like, the virtual character interaction control system may further provide a multi-modal duplex state decision module in an embodiment of the present specification to determine a virtual character interaction policy and achieve multi-modal duplex accepting/interruption.

Based on this, the virtual character interaction control system can design three interaction modules, and referring to fig. 3, fig. 3 shows a system architecture diagram of the virtual character interaction control system provided by the embodiment of the present specification.

Fig. 3 includes three modules, i.e., a multi-modal control module, a multi-modal duplex status management module, and a basic dialog module, which can also be regarded as subsystems, i.e., a multi-modal control system, a multi-modal duplex status management system, and a basic dialog system. The multi-mode control system controls the input and output of video stream and voice stream in the interactive system. At the input end, the module segments and understands the input voice stream and video stream, and the core comprises the processing functions of voice stream, streaming video expression and streaming video action. At the output, it is responsible for rendering the results of the system into a video stream output of the digital person. The multi-mode duplex state management system is responsible for managing the state of the current conversation and deciding the current duplex strategy. Current duplexing strategies include duplexing active \ passive interruption, duplexing active take-over, invoking basic dialog systems or business logic, and no feedback. The basic dialogue system comprises basic service logic and dialogue question-answer capability and has basic question-answer interaction capability; i.e. the question input to the user, and the answer to the question output by the system, typically comprises three sub-modules. 1) NLU (natural language understanding) module: the text information is recognized and understood and converted into a structured semantic representation or an intention label which can be understood by a computer. 2) DM (session management) module: maintaining and updating the current dialog state and deciding on the next system action. 3) NLG (natural language generation) module: and converting the state of the system output into an understandable natural language text.

The following embodiments may describe a specific implementation process of the multi-modal duplex status management module in detail to clarify how to provide the virtual character with the capability of mutual accepting and mutual interrupting in the virtual character interaction control system.

Step 206: determining a virtual character interaction policy based on the user intent data and/or user pose data, wherein the virtual character interaction policy comprises a text interaction policy and/or an action interaction policy.

The virtual character interaction strategy can be understood as a literary decision, an action decision or a combination of the literary decision and the action decision, namely a text interaction strategy and/or an action interaction strategy, carried out between the virtual character and the user. The text interaction strategy can be understood as an interaction text corresponding to the voice data of the user by the virtual character, and whether the interaction text needs to be interrupted in a sentence in the voice text expressed by the user or carried at the end of the sentence. The action interaction strategy can be understood as the interaction gesture corresponding to the gesture data of the virtual character aiming at the user, and whether the interaction gesture needs to be interrupted in a sentence in a voice text expressed by the user or carried at the end of the sentence.

In practical application, the virtual character interaction control system can determine the text carrying content of the virtual character according to the user intention data, whether the text carrying content is interrupted in the sentence of the user or carried at the tail of the sentence of the user, namely a text interaction strategy. The virtual character interaction control system can also determine the posture carrying content of the virtual character according to the user posture data, no matter the posture interruption is carried out in the sentence of the user or the posture carrying is carried out at the tail of the sentence of the user, namely an action interaction strategy. It should be noted that, for a certain intention data and/or posture data of the user, the virtual character does not necessarily have both the text interaction policy and the action interaction policy, i.e. the text interaction policy and the action interaction policy may also be "and/or" relationship.

In addition, the virtual character can not only accept or interrupt the interaction of the user, but also support the function of not making any feedback, namely when the VAD time of the user does not reach 800ms, the system does not make any feedback when the basic dialogue system or the business logic does not need to be called for answering.

Specifically, the step of determining the virtual character interaction strategy based on the user intention data and/or the user posture data comprises:

performing fusion processing on video data in the multi-modal data based on the user intention data and/or the user posture data, and determining a target intention text and/or a target posture action of the user; determining a virtual character interaction strategy based on the target intention text and/or the target gesture action.

In practical application, after determining the user intention data and/or the user posture data, the virtual character interaction control system can also perform fusion and alignment processing on the text, the video stream and the voice stream, and comprehensively judge the target intention text and/or the target posture action of the user. Further, a specific virtual character interaction strategy can be subsequently determined based on the target intent text and/or the target gesture actions.

For example, the emotion classification module may have recognized the expression of the user, such as a smile, from the face, but the user may be expressing an unexpected smile. Therefore, in order to solve such a problem, the avatar interactive control system can perform multi-modal judgment from the user's voice and the currently spoken text grammar, thereby achieving a better effect. In specific implementation, the system can adopt a multi-modal classification model to perform more detailed emotion judgment, and finally the module can output the current interactive state, which can comprise three state slots, namely a text, a user gesture action and a user emotion, so as to perform duplex state decision for the multi-modal duplex state management module.

The multi-modal interaction method provided by the embodiment of the specification accurately defines the interaction purpose of the user by further comprehensively judging the user intention data and/or the user posture data, avoids invalid communication exhibited by subsequent virtual characters due to the error of the user interaction purpose, and reduces the intelligence of the virtual characters.

After the virtual character interaction control system accurately learns the target intention text and/or the target posture action of the user, the text interaction strategy and/or the action interaction strategy of the virtual character can be respectively and accurately determined. Specifically, the determining a virtual character interaction strategy based on the target intention text and/or the target posture action includes:

determining a text interaction strategy of the virtual character based on the target intention text; and/or

And determining an action interaction strategy of the virtual character based on the target posture action.

In practical application, the virtual character interaction control system determines a text interaction strategy of the virtual character and the user according to the target intention text. For example, if the target intention text of the user is "query placed order status", the text interaction policy of the virtual character may be carried over from the end of the sentence of the user's voice text, i.e., the virtual character may express "you are you and so on, i. If the target intention text of the user is 'how slow you are and not yet queried', the text interaction strategy of the virtual character can interrupt the connection from the middle of the intention text, namely, when the user finishes 'how slow you are', the virtual character can immediately express 'not to be connected'. Therefore, the real-time communication between the virtual character and the user can be realized, and the effect of simulating the communication between the human beings is achieved.

Further, the virtual character interaction control system can also determine an action interaction strategy of the virtual character and the user according to the target gesture action. For example, if the user's target gesture is an "ok" gesture, the action interaction policy of the avatar may likewise exhibit an "ok" gesture. If the target gesture of the user is a gesture of the middle finger, the virtual character may not respond to any action, and may only respond to the text content, such as "what is unsatisfactory" or only respond to a "cry by shaking his head".

It should be noted that, different text interaction strategies and/or action interaction strategies may be determined for different target intention text and/or target gesture actions. For example, if only the target intention text exists, it may be determined that the virtual character only deals with the text interaction policy, only deals with the action interaction policy, or a combination of the text interaction policy and the action interaction policy. If only the target gesture action exists, the virtual character can be determined to only deal with the text interaction strategy, or only deal with the action interaction strategy, or the combination of the text interaction strategy and the action interaction strategy. If the target intention text and the target gesture action are both provided, the virtual character can be determined to only deal with the text interaction strategy, or only deal with the action interaction strategy, or the combination of the text interaction strategy and the action interaction strategy. Furthermore, in the embodiments of the present disclosure, all the cases cannot be taken as examples, but the virtual character interaction control system in the embodiments may support determining different virtual character interaction policies according to different interaction states.

Step 208: and acquiring a three-dimensional rendering model of the virtual character.

In practical application, the virtual character interaction control system can obtain a three-dimensional rendering model of a virtual character, so that the interactive video stream of the virtual character can be generated conveniently according to the three-dimensional rendering model, and multi-mode interaction with a user is completed. It should be noted that the virtual character may be composed of a cartoon or a computer drawing image, or may be composed of a simulated human image, which is not specifically limited in this embodiment.

Step 210: and generating an image of the virtual character containing the action interaction strategy by utilizing the three-dimensional rendering model based on the virtual character interaction strategy so as to drive the virtual character to carry out multi-modal interaction.

In practical application, the virtual character interaction control system can generate an image of a virtual character containing the action interaction strategy of the virtual character by utilizing a three-dimensional rendering model according to the determined virtual character interaction strategy. Such as head movements, facial expressions, gesture movements and the like corresponding to the virtual character, and further, the rendered virtual character image is driven to realize multi-modal interaction with the user.

Further, the virtual character interaction control system can specifically determine the text carrying position and/or the action carrying position corresponding to the virtual character according to the text interaction strategy and the action interaction strategy so as to realize the duplex active carrying process. Specifically, the step of generating an avatar of the virtual character including the action interaction policy by using the three-dimensional rendering model based on the virtual character interaction policy to drive the virtual character to perform multi-modal interaction includes:

determining a text receiving position of the virtual character text interaction based on the text interaction strategy, wherein the text receiving position is a receiving position corresponding to the voice data; determining an action carrying position of the virtual character action interaction based on the action interaction strategy, wherein the action carrying position is a carrying position corresponding to the video data; and generating an image of the virtual character containing the action interaction strategy by utilizing the three-dimensional rendering model based on the text receiving position and/or the action receiving position so as to drive the virtual character to carry out multi-modal interaction.

The text receiving position can be understood as a receiving position corresponding to the interactive text of the virtual character aiming at the voice text expressed by the user, and can be divided into receiving in a sentence and receiving at the end of the sentence. The action carrying position can be understood as the interactive action of the virtual character, and the carrying position corresponding to the voice text expressed by the user can be divided into carrying out action carrying in a sentence or carrying out action carrying at the end of the sentence.

In practical application, the virtual character interaction control system can generate a virtual character image containing an action interaction strategy by using a three-dimensional rendering model according to the text receiving position and/or the action receiving position after the text receiving position of virtual character text interaction and the action receiving position of virtual character action interaction are determined, so as to determine the multi-mode interaction process of the virtual character.

It should be noted that, when the virtual character interaction control system determines that the user's conversation or action needs to be accepted, the current acceptance policy is triggered. The methods of the bearing include two types, one is only action bearing, and the other is action + document bearing. The action-only receiving means that the digital person does not do oral receiving reply, and only does action to respond to the user, for example, if the user suddenly shakes a hand crank to call the digital person in the conversation process, the virtual character only needs to reply one call-calling action without influencing other current conversation states. Action + document reception means that the digital person not only needs to do action to respond to the user, but also needs to do oral reception reply, and the reception has certain influence on the current conversation process, but also gives intelligent feeling to the experience. If it is detected that the user has happened an unconscious emotion in the conversation process, the virtual character needs to interrupt the current conversation state, actively ask the user what is not satisfied, and give a comforting action.

In addition, the avatar interaction control system may also provide a process of duplex active/passive interruption. Specifically, the step of generating an avatar of the virtual character including the action interaction policy by using the three-dimensional rendering model based on the virtual character interaction policy to drive the virtual character to perform multi-modal interaction includes:

suspending the current multi-modal interaction of the virtual character in case that the user is determined to have interrupting intention data in the user intention data and/or the user posture data of the virtual character interaction strategy; and determining interruption receiving interaction data corresponding to the virtual character based on the interruption intention data, and generating an image of the virtual character containing the action interaction strategy by using the three-dimensional rendering model based on the interruption receiving interaction data so as to drive the virtual character to continue to perform multi-mode interaction.

Interrupting intent data may be understood as a user having data that specifically denies communication with the avatar. For example, the user makes a "close mouth" gesture, or explicitly states a sentence "pause the communication bar".

The interruption receiving interactive data may be understood as a corresponding receiving text sentence or receiving action data when the virtual character determines that the user has the interruption intention.

In practical application, in a virtual character interaction strategy of the virtual character interaction control system, if the user is determined to have an interruption intention according to user intention data and/or user posture data, the current interaction text or interaction action of the virtual character can be suspended, and corresponding interruption receiving interaction data is determined according to the interruption intention. And generating a virtual character image containing the action interaction strategy by using the three-dimensional rendering model so as to drive the virtual character to continue to finish multi-modal interaction according to the interrupted and received interaction data. For example, the current conversation may be actively interrupted when the digitizer finds the user has an interrupting intent, such as a user-displayed interrupting intent, such as a negative expression or negative emotion during the digitizer's speech. It may also be an implicit interruption intention of the user, such as the user suddenly disappearing, or not in a communicative state. Under the current strategy, the digital person can interrupt the current speaking state, wait for the user to speak, or actively ask the other party for the reason of interruption.

And finally, the virtual character interaction control system can also provide an output rendering function, and the audio data stream and the video data stream which determine the virtual character interaction are fused and then pushed out. Specifically, the step of generating an avatar of the virtual character including the action interaction policy by using the three-dimensional rendering model based on the virtual character interaction policy to drive the virtual character to perform multi-modal interaction includes:

determining an audio data stream of the virtual character text interaction based on the text interaction policy; determining a video data stream of action interactions of the virtual character based on the action interaction policy; and fusing the audio data stream and the video data stream, rendering a multi-modal interactive data stream of the virtual character, and generating an image of the virtual character containing the action interaction strategy by using the three-dimensional rendering model based on the multi-modal interactive data stream so as to drive the virtual character to carry out multi-modal interaction.

In practical application, the output rendering composite video stream of the virtual character interaction control system is pushed out, and the total content of the output rendering composite video stream is 3 parts. 1) And the streaming TTS part synthesizes the text output of the system into an audio stream. 2) And the driving part comprises two sub-modules, a face driving module and an action driving module. The face driving module drives the digital person to output an accurate mouth shape according to the voice stream. And the action driving module drives the digital person to output accurate actions according to the action labels output by the system. 3) And the rendering synthesis part is responsible for rendering and synthesizing the output of the modules such as the driving part, the TTS and the like into the video stream of the digital person.

In summary, the multimodal interaction method provided in the embodiments of the present specification can not only perceive the facial expression of the user, but also perceive the motion of the user by adding the video stream and the corresponding visual understanding module. In addition, a new visual processing module can be expanded through a similar method, so that the virtual character can perceive more multi-modal information, such as environment information and the like. In the embodiment of the specification, the system can support real-time perception of five facial expressions of anger, displeasure, neutrality, distraction and surprise of a user, and can perceive three major actions of definite actions (such as OK, numbers, left-sliding, right-sliding and the like), unsafe gestures (such as vertical middle fingers, little fingers and the like) and customized special actions in real time.

In addition, the method changes the exclusive conversation form of one question and one answer into a non-exclusive conversation form which can be carried or interrupted at any time by adding the multi-mode control module and the multi-mode duplex state management module. There are two main reasons why this problem can be solved: 1) the multi-mode control module divides the conversation into smaller decision units, and does not take complete user problems as a trigger condition for user reply, so that the conversation can be carried at any time and interrupted at any time even in the conversation process. The voice stream segmentation strategy is to segment the voice stream with VAD time of 200ms, and generally speaking of a person takes ventilation interval of about 200 ms. And the video stream adopts a detection triggering strategy, and when a specified action, expression or target object is detected, the decision of the duplex state is carried out. 2) The multimodal duplex state management module is the core to solve this problem because it not only maintains the current duplex dialog state, but also can decide the current reply strategy, which includes 4 states of duplex active take-over, duplex active \ passive break, call of basic dialog system or service logic and no feedback. By making decisions between these 4 states, the system can achieve the ability to take on, interrupt and basically ask for a question and answer at any time. 3) The method cuts the dialogue into smaller units, and takes the units as the granularity of decision and reply of the digital person, so that the dialogue is not in an exclusive dialogue form of question and answer. Therefore, even if the user does not completely express the user, the system processes the input information of the user and calculates the reply result, and when the user finishes expressing the user, the system does not need to calculate from the beginning, and only needs to directly play the call receiving operation, thereby greatly shortening the time delay of interaction. In a body sense, the conversation time delay of the system can be reduced to about 800ms from 1.5 seconds.

Referring to fig. 4, fig. 4 is a schematic diagram illustrating a processing procedure of a multi-modal interaction method according to an embodiment of the present disclosure.

The embodiment of fig. 4 can be divided into a multi-modal control system-input, a multi-modal duplex status management system, a basic dialog system, and a multi-modal control system-output, which can be understood as 4 subsystems of a virtual character interaction control system to which the multi-modal interaction method is applied.

In practice, the user's video and voice streams enter from the multimodal control system-input. For the video stream, the video stream passes through a target emotion detection rough calling module and a target gesture detection rough calling module, then emotion classification and gesture classification are carried out, and the final emotion recognition result and the final gesture recognition result are input to a multi-modal data & alignment module. For voice flow, firstly, the voice flow is divided, then the text conversion is carried out through ASR, and finally the voice flow is input into the multi-modal data & alignment module. Further, the multi-mode data & alignment module integrates the voice recognition result, the emotion in the video and the gesture recognition result, determines the intention of the target user and the target action data, and inputs the intention and the target action data into a multi-mode duplex state decision module in the multi-mode duplex state management system.

Further, the multi-modal duplex status decision system in fig. 4 can perform duplex policy decision to determine two types of accepting, one is action + document accepting, and the other is action-only accepting. In the process of action + document bearing, the bearing process can be realized by judging whether bearing is carried in a sentence or at the tail of the sentence and further divided into two branches. Specifically, in the sentence end joining, a joined document decision and a joined action decision are determined according to the intention recognition, and in the sentence joining, the joined document decision and the joined action decision are determined. In addition, in action receiving, a specific receiving action is determined, and finally, a receiving strategy of the virtual character is input into the multi-modal control system-output to determine a streaming video stream and a streaming audio stream.

It should be noted that the multi-modal duplex state decision system further includes multi-modal interruption intention judgment, and can implement a specific interruption receiving function in combination with a service.

Further, the multimodal control system-output may determine face driving data and motion driving data from the streamed video stream and the streamed audio stream of the avatar to complete rendering of the avatar + streaming of the streaming media to output a digital personal video stream.

In addition, in the multi-modal control system-output, the streaming video stream and the streaming audio stream of the virtual character can be used, the basic dialogue system can also provide basic dialogue data for interaction of the virtual character, and basic business logic and action are matched, so that the generation of the digital human video stream is completed together.

In summary, the multi-modal interaction method provided in the embodiments of the present specification has the effects of multi-modal perception, multi-modal duplexing, and short interaction delay. Specifically, for multi-modal perception, the embodiments of the present specification propose a system that can perceive user voice and video information. Compared with the traditional dialogue system based on voice stream, the scheme not only can process the voice information of the user, but also can identify and detect the emotion and the action of the user, and the intelligence of digital human perception is greatly improved. For multi-modal duplexing, embodiments of the present specification provide an interactive system that can be instantly accepted and interrupted at any time. Compared with the traditional question-answer one-wheel dialogue system, the system can give some feedback and responses such as simple tone support to the user in real time during the speaking process of the user. In addition, when the user is not in a listening state or the user obviously intends to interrupt the conversation, the current conversation process can be interrupted at any time. The duplex interactive system improves the interactive fluency, thereby giving better interactive experience to users. The interaction time delay is short: when the user does not completely express, the system processes the input information of the user and calculates the reply result in a streaming mode, when the user finishes expressing, the system does not need to calculate from the beginning, the call-receiving operation can be played directly, and the interaction time delay is greatly shortened. In a body sense, the conversation time delay of the system can be reduced to about 800ms from 1.5 seconds.

Corresponding to the above method embodiment, the present specification further provides a multi-modal interaction apparatus embodiment, and fig. 5 shows a schematic structural diagram of a multi-modal interaction apparatus provided in an embodiment of the present specification. As shown in fig. 5, the apparatus applied to the virtual character interaction control system includes:

the data receiving module 502 is configured to receive multimodal data, wherein the multimodal data includes voice data and video data; a data recognition module 504 configured to recognize the multimodal data, obtain user intent data and/or user gesture data, wherein the user gesture data includes user emotion data and user motion data; a policy determination module 506 configured to determine a virtual character interaction policy based on the user intent data and/or user pose data, wherein the virtual character interaction policy comprises a text interaction policy and/or an action interaction policy; a rendering model obtaining module 508 configured to obtain a three-dimensional rendering model of the virtual character; an interaction driving module 510 configured to generate an avatar of the avatar including the action interaction policy using the three-dimensional rendering model based on the avatar interaction policy to drive the avatar to perform multi-modal interaction.

Optionally, the data identification module 504 is further configured to: performing text conversion on voice data in the multi-modal data, and identifying the converted text data to obtain user intention data; and/or performing emotion recognition on video data and/or voice data in the multi-modal data to obtain user emotion data; performing gesture recognition on video data in the multi-modal data to obtain user action data; determining user gesture data based on the user mood data and the user action data.

Optionally, the data identification module 504 is further configured to: and carrying out emotion detection on the video data in the multi-modal data, and classifying target emotions in the video data to obtain user emotion data under the condition that the video data contains the target emotions.

Optionally, the data identification module 504 is further configured to: and carrying out gesture detection on the video data in the multi-modal data, and classifying the target gesture in the video data to obtain user action data under the condition that the video data is detected to comprise the target gesture.

Optionally, the policy determination module 506 is further configured to: performing fusion processing on video data in the multi-modal data based on the user intention data and/or the user posture data, and determining a target intention text and/or a target posture action of the user; determining a virtual character interaction strategy based on the target intention text and/or the target gesture action.

Optionally, the policy determination module 506 is further configured to: determining a text interaction strategy of the virtual character based on the target intention text; and/or determining an action interaction strategy of the virtual character based on the target gesture action.

Optionally, the interaction driver module 510 is further configured to: determining a text receiving position of the virtual character text interaction based on the text interaction strategy, wherein the text receiving position is a receiving position corresponding to the voice data; determining an action carrying position of the virtual character action interaction based on the action interaction strategy, wherein the action carrying position is a carrying position corresponding to the video data; and generating an image of the virtual character containing the action interaction strategy by utilizing the three-dimensional rendering model based on the text receiving position and/or the action receiving position so as to drive the virtual character to carry out multi-modal interaction.

Optionally, the interaction driver module 510 is further configured to: suspending the current multi-modal interaction of the virtual character in case that the user is determined to have interrupting intention data in the user intention data and/or the user posture data of the virtual character interaction strategy; and determining interruption receiving interaction data corresponding to the virtual character based on the interruption intention data, and generating an image of the virtual character containing the action interaction strategy by using the three-dimensional rendering model based on the interruption receiving interaction data so as to drive the virtual character to continue to perform multi-mode interaction.

Optionally, the apparatus further comprises: a video stream output module configured to invoke pre-stored basic dialogue data based on the user intention data and/or the user gesture data, wherein the basic dialogue data comprises basic voice data and/or basic action data; rendering an output video stream of the virtual character based on the basic dialogue data, and driving the virtual character to display the output video stream.

Optionally, the interaction driver module 510 is further configured to: determining an audio data stream of the virtual character text interaction based on the text interaction policy; determining a video data stream of action interactions of the virtual character based on the action interaction policy; and fusing the audio data stream and the video data stream, rendering a multi-modal interactive data stream of the virtual character, and generating an image of the virtual character containing the action interaction strategy by using the three-dimensional rendering model based on the multi-modal interactive data stream so as to drive the virtual character to carry out multi-modal interaction.

The multi-modal interaction device provided by the embodiment of the specification determines the communication intention of the user and/or the gesture corresponding to the user by receiving the voice data and the audio data of the user and performing intention recognition and gesture recognition, further determines the specific interaction strategy of the virtual character and the user according to the communication intention of the user and/or the gesture corresponding to the user, and then drives the virtual character to complete the interaction process with the user according to the determined interaction strategy, so that the method not only can detect and recognize the emotion and the action of the user, but also can consider the emotion and the action of the virtual character corresponding to the user when deciding the interaction strategy of the virtual character, so that the expression of the emotion and/or the action of the virtual character to the user is correspondingly responded, thereby not only ensuring the time delay of the whole interaction process to be lower, but also ensuring the interaction process of the whole user and the virtual character to be smoother, giving users a better interactive experience.

The foregoing is a schematic diagram of a multimodal interaction apparatus of the present embodiment. It should be noted that the technical solution of the multi-modal interaction apparatus belongs to the same concept as the technical solution of the multi-modal interaction method described above, and for details that are not described in detail in the technical solution of the multi-modal interaction apparatus, reference may be made to the description of the technical solution of the multi-modal interaction method described above.

FIG. 6 illustrates a block diagram of a computing device 600 provided in accordance with one embodiment of the present description. The components of the computing device 600 include, but are not limited to, a memory 610 and a processor 620. The processor 620 is coupled to the memory 610 via a bus 630 and a database 650 is used to store data.

Computing device 600 also includes access device 640, access device 640 enabling computing device 600 to communicate via one or more networks 660. Examples of such networks include the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. Access device 640 may include one or more of any type of network interface (e.g., a Network Interface Card (NIC)) whether wired or wireless, such as an IEEE802.11 Wireless Local Area Network (WLAN) wireless interface, a worldwide interoperability for microwave access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a bluetooth interface, a Near Field Communication (NFC) interface, and so forth.

In one embodiment of the present description, the above-described components of computing device 600, as well as other components not shown in FIG. 6, may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device structure shown in FIG. 6 is for illustration purposes only and is not intended to limit the scope of the present description. Those skilled in the art may add or replace other components as desired.

Computing device 600 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet computer, personal digital assistant, laptop computer, notebook computer, netbook, etc.), mobile phone (e.g., smartphone), wearable computing device (e.g., smartwatch, smart glasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 600 may also be a mobile or stationary server.

Wherein the processor 620 is configured to execute computer-executable instructions that, when executed by the processor, implement the steps of the multimodal interaction method described above.

The above is an illustrative scheme of a computing device of the present embodiment. It should be noted that the technical solution of the computing device and the technical solution of the multi-modal interaction method belong to the same concept, and details that are not described in detail in the technical solution of the computing device can be referred to the description of the technical solution of the multi-modal interaction method.

An embodiment of the present specification further provides a computer-readable storage medium storing computer-executable instructions, which when executed by a processor, implement the steps of the multimodal interaction method described above.

The above is an illustrative scheme of a computer-readable storage medium of the present embodiment. It should be noted that the technical solution of the storage medium belongs to the same concept as the technical solution of the multi-modal interaction method, and details that are not described in detail in the technical solution of the storage medium can be referred to the description of the technical solution of the multi-modal interaction method.

An embodiment of the present specification further provides a computer program, wherein when the computer program is executed in a computer, the computer program causes the computer to execute the steps of the multimodal interaction method.

The above is an illustrative scheme of a computer program of the present embodiment. It should be noted that the technical solution of the computer program belongs to the same concept as the technical solution of the above multi-modal interaction method, and for details that are not described in detail in the technical solution of the computer program, reference may be made to the description of the technical solution of the above multi-modal interaction method.

The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

The computer instructions comprise computer program code which may be in the form of source code, object code, an executable file or some intermediate form, or the like. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

It should be noted that, for the sake of simplicity, the foregoing method embodiments are described as a series of acts, but those skilled in the art should understand that the present embodiment is not limited by the described acts, because some steps may be performed in other sequences or simultaneously according to the present embodiment. Further, those skilled in the art should also appreciate that the embodiments described in this specification are preferred embodiments and that acts and modules referred to are not necessarily required for an embodiment of the specification.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to the related descriptions of other embodiments.

The preferred embodiments of the present specification disclosed above are intended only to aid in the description of the specification. Alternative embodiments are not exhaustive and do not limit the invention to the precise embodiments described. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the embodiments and the practical application, and to thereby enable others skilled in the art to best understand the specification and utilize the specification. The specification is limited only by the claims and their full scope and equivalents.

Claims

1. A multi-modal interaction method is applied to a virtual character interaction control system and comprises the following steps of;

acquiring a three-dimensional rendering model of the virtual character;

and based on the virtual character interaction strategy, generating an image of the virtual character containing the action interaction strategy by using the three-dimensional rendering model so as to drive the virtual character to carry out multi-mode interaction.

2. The multi-modal interaction method of claim 1, the identifying the multi-modal data, obtaining user intent data and/or user gesture data, comprising:

performing text conversion on voice data in the multi-modal data, and identifying the converted text data to obtain user intention data; and/or

Performing emotion recognition on video data and/or voice data in the multi-modal data to obtain user emotion data;

performing gesture recognition on video data in the multi-modal data to obtain user action data;

determining user gesture data based on the user emotion data and the user action data.

3. The method of claim 2, wherein the performing emotion recognition on the video data in the multimodal data to obtain user emotion data comprises:

and carrying out emotion detection on the video data in the multi-modal data, and classifying target emotions in the video data to obtain user emotion data under the condition that the video data contains the target emotions.

4. The multi-modal interaction method of claim 2, wherein the gesture recognition of the video data in the multi-modal data to obtain the user action data comprises:

5. The multi-modal interaction method of claim 1, the determining a virtual character interaction policy based on the user intent data and/or user pose data, comprising:

performing fusion processing on video data in the multi-modal data based on the user intention data and/or the user posture data, and determining a target intention text and/or a target posture action of the user;

determining a virtual character interaction strategy based on the target intention text and/or the target gesture action.

6. The multi-modal interaction method of claim 5, the determining a virtual character interaction strategy based on the target intent text and/or the target gesture action, comprising:

7. The multi-modal interaction method of claim 6, wherein generating the avatar of the avatar containing the action interaction strategy using the three-dimensional rendering model based on the avatar interaction strategy to drive the avatar to perform multi-modal interactions comprises:

determining a text carrying position of the virtual character text interaction based on the text interaction strategy, wherein the text carrying position is a carrying position corresponding to the voice data;

determining an action carrying position of the virtual character action interaction based on the action interaction strategy, wherein the action carrying position is a carrying position corresponding to the video data;

and generating an image of the virtual character containing the action interaction strategy by utilizing the three-dimensional rendering model based on the text receiving position and/or the action receiving position so as to drive the virtual character to carry out multi-modal interaction.

8. The multi-modal interaction method of claim 1, wherein generating the avatar of the avatar containing the action interaction strategy using the three-dimensional rendering model based on the avatar interaction strategy to drive the avatar to perform multi-modal interactions comprises:

suspending the current multi-modal interaction of the virtual character if it is determined that the user has interrupting intention data in the user intention data and/or the user posture data of the virtual character interaction strategy;

and determining interruption receiving interaction data corresponding to the virtual character based on the interruption intention data, and generating an image of the virtual character containing the action interaction strategy by using the three-dimensional rendering model based on the interruption receiving interaction data so as to drive the virtual character to continue to perform multi-mode interaction.

9. The multi-modal interaction method of claim 1, the identifying the multi-modal data, after obtaining user intent data and/or user gesture data, further comprising:

based on the user intention data and/or the user posture data, invoking pre-stored basic dialogue data, wherein the basic dialogue data comprises basic voice data and/or basic action data;

rendering an output video stream of the virtual character based on the basic dialogue data, and driving the virtual character to display the output video stream.

10. The multi-modal interaction method of claim 1, wherein generating the avatar of the avatar containing the action interaction strategy using the three-dimensional rendering model based on the avatar interaction strategy to drive the avatar to perform multi-modal interactions comprises:

determining an audio data stream of the virtual character text interaction based on the text interaction policy;

determining a video data stream of action interactions of the virtual character based on the action interaction policy;

and fusing the audio data stream and the video data stream, rendering a multi-modal interactive data stream of the virtual character, and generating an image of the virtual character containing the action interaction strategy by using the three-dimensional rendering model based on the multi-modal interactive data stream so as to drive the virtual character to carry out multi-modal interaction.

11. A multi-modal interaction device is applied to a virtual character interaction control system and comprises:

a data recognition module configured to recognize the multi-modal data, obtain user intention data and/or user posture data, wherein the user posture data includes user emotion data and user action data;

12. A computing device, comprising:

a memory and a processor;

the memory is configured to store computer-executable instructions, and the processor is configured to execute the computer-executable instructions, which when executed by the processor, perform the steps of the multi-modal interaction method of any of claims 1 to 10.

13. A computer-readable storage medium storing computer-executable instructions that, when executed by a processor, perform the steps of the multi-modal interaction method of any of claims 1 to 10.