WO2023226914A1

WO2023226914A1 - Virtual character driving method and system based on multimodal data, and device

Info

Publication number: WO2023226914A1
Application number: PCT/CN2023/095449
Authority: WO
Inventors: 朱鹏程; 马远凯; 冷海涛; 张昆才; 罗智凌; 周伟; 李禹�; 钱景
Original assignee: 阿里巴巴（中国）有限公司
Priority date: 2022-05-23
Filing date: 2023-05-22
Publication date: 2023-11-30
Also published as: CN114840090A

Abstract

The present application relates to the fields of artificial intelligence, deep learning, machine learning, virtual reality, etc. in the computer technology, and provides a virtual character driving method and system based on multimodal data, and a device. The method of the present application comprises: during a round of dialogue between a virtual character and a user, obtaining voice data input by the user and image data of the user in real time; when it is detected that a silence duration of the voice data input by the user is greater than or equal to a preset duration and a voice input is not finished, converting voice data in the previous time period into corresponding text information; recognizing gesture information of the user according to image data of the user in the previous time period, and determining, according to the gesture information of the user and the text information, corresponding gesture intention classification to recognize a gesture intention of the user in real time; and driving, on the basis of the gesture intention of the user, the virtual character to perform a corresponding response behavior in time. Therefore, the degree of personification of the virtual character is improved, and the interaction between the virtual character and a person is smoother and more intelligent.

Description

Virtual character driving method, system and device based on multi-modal data

This application requests the priority of the Chinese patent application submitted to the China Patent Office on May 23, 2022, with the application number 202210567637.3 and the application name "Virtual character driving method, system and device based on multi-modal data", and its entire content incorporated herein by reference.

Technical field

This application relates to artificial intelligence, deep learning, machine learning, virtual reality and other fields in computer technology, and in particular to a virtual character driving method, system and device based on multi-modal data.

Background technique

With the development of virtual reality technology, human-computer interaction based on virtual characters is becoming more and more popular in people's lives. It can be widely used in scenarios such as smart customer service, virtual tutors, smart family doctors, and virtual anchors. When real people have face-to-face conversations, timely feedback can be provided based on the other party's gestures and movements and other information. In the existing interaction between virtual characters and people, users (real people) mostly interact with the virtual characters through text input or triggering page controls through the system page displayed on the terminal touch screen, such as leaving messages, liking, turning pages, etc. The lack of technical means for "face-to-face" interaction between virtual characters and real people results in a low degree of anthropomorphism of virtual characters, and the interaction process between virtual characters and people is not smooth and intelligent.

Contents of the invention

This application provides a virtual character driving method, system and device based on multi-modal data to solve the problem of low anthropomorphism of virtual characters and unsmooth and unintelligent interaction between virtual characters and people.

In the first aspect, this application provides a virtual character driving method based on multi-modal data, including:

Obtain the three-dimensional image rendering model of the virtual character to use the virtual character to provide interactive services to users;

During a round of dialogue between the virtual character and the user, the voice data input by the user and the user's image data are obtained in real time;

When it is detected that the silence duration of the voice data input by the user is greater than or equal to the preset duration, if it is determined that the voice input has not ended, the voice data input by the user in the previous period is converted into corresponding text information, The previous period is from the time when the last silence duration was greater than or equal to the preset duration to the current time;

Identify the user's gesture information according to the user's image data in the previous period, and determine the gesture intention classification corresponding to the user's gesture information based on the user's gesture information and the text information;

According to the gesture intention classification corresponding to the user's gesture information and the current conversation status, the corresponding driver is determined. moving data;

According to the driving data and the three-dimensional image rendering model of the virtual character, the virtual character is driven to perform the corresponding response behavior.

In the second aspect, this application provides a virtual character driving method based on multi-modal data, including:

In a round of dialogue between the virtual character and the user, during the user's voice input process, the voice data input by the user and the image data of the user are obtained in real time;

When it is detected that the silence duration of the voice input is greater than or equal to the preset duration, if it is determined that the voice input has not ended, the user's gesture information is identified based on the user's image data in the previous period. A period of time is from the time when the last silence duration was greater than or equal to the preset duration to the current moment;

According to the user's gesture information, if it is determined that the user has made a gesture that needs to be accepted, then determine the driving data of the virtual character;

According to the driving data and the three-dimensional image rendering model of the virtual character, the virtual character is driven to perform the acceptance response behavior corresponding to the user's gesture information.

In the third aspect, this application provides a virtual character driving system based on multi-modal data, including:

The driver control module is used to obtain the three-dimensional image rendering model of the virtual character to provide interactive services to the user using the virtual character;

The multi-modal input module is used to obtain the voice data input by the user and the image data of the user in real time during a round of dialogue between the virtual character and the user;

A voice processing module configured to convert the voice data input by the user in the previous period when it is detected that the silence duration of the voice data input by the user is greater than or equal to the preset duration and if it is determined that the voice input has not ended. is the corresponding text information, and the previous period is from the time when the last silence duration was greater than or equal to the preset duration to the current time;

An image processing module, configured to identify the user's gesture information based on the user's image data in the previous period, and determine the user's gesture information corresponding to the user's gesture information based on the user's gesture information and the text information. Gesture intent classification;

The drive control module is also used to determine the corresponding drive data according to the gesture intention classification corresponding to the user's gesture information and the current conversation state; and drive the virtual character according to the drive data and the three-dimensional image rendering model of the virtual character. Execute the corresponding response behavior.

In the fourth aspect, this application provides a virtual character driving system based on multi-modal data, including:

The decision-driven module is used to obtain the three-dimensional image rendering model of the virtual character, so as to use the virtual character to provide interactive services to users;

The multi-modal input module is used to obtain the voice data input by the user and the image data of the user in real time during a conversation between the virtual character and the user;

A sensing module configured to, when it is detected that the silence duration of the voice input is greater than or equal to the preset duration, and if it is determined that the voice input has not ended, identify the user's gesture according to the user's image data in the previous period. information, all The above-mentioned previous period is from the time when the last silence duration was greater than or equal to the preset duration to the current moment;

The decision-making driving module is also used to determine the driving data of the virtual character according to the user's gesture information, if it is determined that the user has made a gesture that needs to be accepted; according to the driving data and the three-dimensional image rendering model of the virtual character , driving the virtual character to perform the acceptance response behavior corresponding to the user's gesture information.

In a fifth aspect, this application provides an electronic device, including: a processor, and a memory communicatively connected to the processor;

The memory stores computer execution instructions;

The processor executes computer execution instructions stored in the memory to implement the method described in the first aspect or the second aspect.

In a sixth aspect, the present application provides a computer-readable storage medium that stores computer-executable instructions, which when executed by a processor are used to implement the above-mentioned first or second aspect. the method described.

The virtual character driving method, system and device based on multi-modal data provided by this application obtain the voice data input by the user and the image data of the user in real time during a round of dialogue between the virtual character and the user; when user input is detected When the silence duration of the voice data is greater than or equal to the preset duration, if it is determined that the voice input has not ended, the voice data input by the user in the previous period will be converted into corresponding text information, and the user will be identified based on the user's image data in the previous period The gesture information of the user is determined, and the gesture intention classification corresponding to the user's gesture information is determined based on the user's gesture information and text information, so that the gesture intention of the user's gesture can be recognized in real time, and based on the gesture intention of the user's gesture and the current conversation state, the driver The virtual character performs the corresponding response behavior, causing the virtual character in the output video stream to perform the corresponding response behavior, increasing the real-time recognition ability of the user's gestures, and driving the virtual character to respond promptly to the user's gesture intention, realizing the virtual character's interaction with the real person Users conduct "face-to-face" interactions, which improves the degree of anthropomorphism of virtual characters and makes the interaction between virtual characters and people smoother and more intelligent.

Description of the drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

Figure 1 is a framework diagram of an exemplary virtual character-human interaction system provided by this application;

Figure 2 is a flow chart of a virtual character driving method based on multi-modal data provided by an embodiment of the present application;

Figure 3 is a flow chart of a method for driving a virtual character to accept users provided by an embodiment of the present application;

Figure 4 is a schematic structural diagram of a virtual character driving system based on multi-modal data provided by an exemplary embodiment of the present application;

Figure 5 is a schematic structural diagram of a virtual character driving system based on multi-modal data provided by another exemplary embodiment of the present application;

Figure 6 is a schematic structural diagram of a virtual character driving system based on multi-modal data provided by another exemplary embodiment of the present application;

FIG. 7 is a schematic structural diagram of an electronic device according to an exemplary embodiment of the present application.

Through the above-mentioned drawings, clear embodiments of the present application have been shown, which will be described in more detail below. These drawings and text descriptions are not intended to limit the scope of the present application's concepts in any way, but are intended to illustrate the application's concepts for those skilled in the art with reference to specific embodiments.

Detailed ways

Exemplary embodiments will be described in detail herein, examples of which are illustrated in the accompanying drawings. When the following description refers to the drawings, the same numbers in different drawings refer to the same or similar elements unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with this application. Rather, they are merely examples of systems and methods consistent with aspects of the present application as detailed in the appended claims.

First, the terms involved in this application will be explained:

Multi-modal interaction: Users can communicate with virtual characters through text, voice, expressions, etc. The virtual characters can understand user text, voice, expressions and other information, and can in turn communicate with users through text, voice, expressions, etc.

Gesture interaction: Users can communicate with virtual characters through gestures, and virtual characters can also reply to users through gestures and other methods.

Duplex interaction: A real-time, two-way interaction method. The user can interrupt the virtual character at any time, and the virtual character can also interrupt himself who is speaking when necessary.

Undertake: During the conversation between the avatar and the human user, when the user inputs the avatar to receive the dialogue state, the avatar can provide instant feedback to the user, such as nodding, smiling, and softly responding, without interrupting the user's input. , to guide the subsequent conversation process.

Interruption: During a conversation between a virtual character and a human user, one party can interrupt the other party's conversation at any time and start a new round of interaction.

Voice Activity Detection (VAD), also known as voice endpoint detection and voice boundary detection, is a technology that detects the duration of silence in voice input.

Text To Speech (TTS): It is a technology that converts text into speech.

The virtual character driving method based on multi-modal data provided by this application involves artificial intelligence, deep learning, machine learning, virtual reality and other fields in computer technology, and can be specifically applied to scenarios in which virtual characters interact with people.

For example, common scenarios in which virtual characters interact with humans include: intelligent customer service, government consultation, life services, smart transportation, virtual companions, virtual anchors, virtual teachers, online games, etc.

In order to solve the problem of low anthropomorphism of existing virtual characters, resulting in unsmooth and unintelligent communication processes, this application provides a virtual character driving method based on multi-modal data. During a round of dialogue between the virtual character and the user, Acquire the voice data input by the user and the user's image data in real time. When it is detected that the silence duration of the voice data input by the user is greater than or equal to the preset duration, and if it is determined that the voice input has not ended, the voice data input by the user in this round of dialogue is converted into corresponding text information. Recognize the user's gesture information based on the user's image data in this round of dialogue, and based on According to the user's gesture information and text information, the gesture intention classification corresponding to the user's gesture information is determined, thereby accurately and real-time identifying the user's gesture intention. After determining the gesture intention classification of the user's gesture, the corresponding driving data is determined based on the gesture intention classification corresponding to the user's gesture information and the current conversation state, and the virtual character is driven to execute based on the driving data and the three-dimensional image rendering model of the virtual character. Corresponding response behavior. This enables the virtual character to respond promptly to the user's gestures and voice input, giving it multi-modal interaction capabilities, improving the degree of anthropomorphism of the virtual character, and making the communication process between the virtual character and people smoother and more intelligent. .

The virtual character driving method based on multi-modal data provided by this application can be applied to the interactive system between virtual characters and people. Figure 1 is a framework diagram of an exemplary interactive system between virtual characters and people. As shown in Figure 1, the interactive system between virtual characters and people includes the following subsystems: perception system, multi-modal duplex state management system, drive control system and basic dialogue system.

Among them, the perception system is responsible for receiving the input of multi-modal information such as voice and images, and processing the data such as segmentation and recognition of the input voice and image data, obtaining the recognition results, and providing the recognition results to the multi-modal duplex state. management system. The multi-modal duplex state management system is responsible for managing the state of the current conversation, performing decision-making processing of the duplex response state based on the recognition result and the state of the current conversation, and obtaining a decision result including a response strategy. The drive control system is responsible for performing virtual character driving, rendering and other processing based on the decision-making results of the multi-modal duplex state management system, generating the virtual character's video stream, and outputting the video stream. The basic dialogue system is responsible for realizing basic human-machine dialogue capabilities, that is, generating corresponding reply information based on questions input by the user.

Specifically, as the input end of the interactive system, the perception system is responsible for controlling the input of video streams and voice streams in the interactive system, and implementing functions such as segmenting and identifying the input voice streams and video streams. Specifically, for the voice stream, in order to ensure that the interactive system can provide some timely feedback to the virtual character even when the user is speaking, the perception system will be based on a short preset duration (such as 200ms) of silence time (i.e. VAD time) to segment the voice stream. In a user's voice input, whenever a silence time greater than or equal to the preset duration occurs, the voice stream is segmented to generate a voice unit in the previous period, and the voice The unit is input to the Automatic Speech Recognition (ASR) module, which converts the speech unit into text information through the ASR module, and is finally input to the multi-modal processing and alignment module. For video streaming, the perception system recognizes the user's gesture information based on the user's image data in this round of dialogue, and inputs the gesture information into the multi-modal processing and alignment module. The multi-modal processing and alignment module combines the user's gesture information and the speech unit's Text information, determine the gesture intention classification corresponding to the gesture information.

For example, gestures that need to be fed back to the user can include three major categories: gestures with clear meanings (such as OK, numbers, left and right swipes, etc.), unsafe gestures (such as middle finger and little thumb), Customize special moves.

The multi-modal duplex state management system is based on the real-time recognition of the user's gesture intention classification, and combines the current dialogue status to make decisions on the duplex response status, determine the duplex response status corresponding to the gesture intention classification, and determine the duplex response status corresponding to different duplex response statuses. Different response strategies are used to determine the response strategy corresponding to the gesture intention classification and obtain the decision-making result.

For example, the duplex response state can include four states: duplex active/passive interruption, duplex active acceptance, calling the basic dialogue system, and no feedback, which respectively correspond to the virtual character actively or passively interrupting the current processing. strategy, virtual characters actively take over the user's strategy, start a new round of dialogue (that is, call the basic dialogue system), and no feedback These four types of response strategies.

Among them, duplex active acceptance: when it is judged that it is necessary to accept the user's dialogue or action, the corresponding acceptance strategy is triggered. There are at least two ways to take over: one is "action takeover only", which means that the virtual character does not make a verbal takeover reply, but only responds to the user by making takeover actions, without affecting other conversation states. The other is "action + copywriting", that is, the virtual character not only performs the following actions, but also broadcasts the copywriting to respond to the user.

Duplex active\passive interruption: During the virtual character broadcasting process, when it is judged that the user has the intention to interrupt, such as the user making unsafe gestures, stop gestures, inputting voice with the intention of stopping, etc., the current conversation will be actively interrupted immediately. Under the interruption strategy, the avatar will interrupt the current speaking state, wait for the user to speak, or actively ask the other party the reason for the interruption. If the user inputs voice data with definite semantics, a new round of dialogue will be started; if the user does not input voice data with definite semantics after a period of time, the current dialogue will be continued and the virtual character will continue to broadcast.

Call the basic dialogue system: When the silent time (VAD time) of the user's voice input reaches the silence duration threshold (that is, the VAD threshold), the user's voice input is received and the basic dialogue system is called to directly reply to the user. This is the basic function of the interaction system between virtual characters and people, and will not be described in detail here. Among them, the silent duration threshold (that is, the VAD threshold) is usually 800ms, which can be configured and adjusted according to the actual application scenario.

No feedback: No feedback is given and the current status is maintained.

The drive control system specifically includes the following three parts: 1) The streaming TTS (Text To Speech) part, which synthesizes the text output in the decision results into an audio stream. 2) The driving part includes two sub-modules, the face driving module and the action driving module. Among them, the face driving module drives the virtual character to output accurate mouth shape according to the voice stream to be output in the decision result, and generates face driving data; the action driving module According to the action tag to be output in the decision result, the virtual character is driven to make accurate actions, and action-driven data is generated, such as an action blend shape (blendshape) driven model. 3) The rendering and synthesis part renders the output of the driver, streaming TTS and other parts and synthesizes the video stream of the virtual character.

Basic dialogue system: Contains basic business logic and has basic dialogue interaction capabilities, that is, input the user's question, and the basic dialogue system outputs the answer to the question. Specifically, basic dialogue systems usually include: NLU (Natural Language Understanding) module, DM (Dialog Management) module and NLG (Natural Language Generation) module. Among them, the business logic is the query logic that obtains the data content required in the reply information based on the question query entered by the user. For example, the user question is "My height is 160cm, what size should I send?" and the answer information is "You should wear M size". The "M size" in the answer information is obtained by querying the business logic based on the height of 160cm.

Among them, the NLU module is used to identify and understand text information and convert it into a computer-understandable structured semantic representation or intent label. The DM module is used to maintain and update the current dialogue status and decide on the next system action. The NLG module is used to convert the status output by the system into understandable natural language text.

Based on the above-mentioned perception system, multi-modal duplex state management system, drive control system and basic dialogue system, by adding video streams and corresponding visual understanding modules, users can interact with virtual characters through gestures. In this solution, actions with clear meaning (such as likes, left swipes, right swipes, etc.), unsafe gestures (such as middle finger gestures) can be sensed in real time. (such as the little finger, etc.) and customized special actions. In addition, by adding a perception system, a multi-modal duplex state management system and a drive control system, the dialogue becomes a dialogue form that can take over or interrupt the current dialogue at any time based on user gestures. The current duplex response state includes four states: duplex active acceptance, duplex active\passive interruption, calling the basic dialogue system, and no feedback, which respectively correspond to the interruption strategy of the virtual character actively or passively interrupting the current processing, and the virtual character actively interrupting. Taking over the user's acceptance strategy, starting a new round of dialogue (that is, calling the basic dialogue system), and no feedback are four types of response strategies. By deciding between these four types of response strategies, the understanding of the user's gestures can be achieved and the implementation based on The ability of user gestures to undertake, interrupt and basic question and answer enables virtual characters to have multi-modal (voice and gesture) interaction capabilities, improves the degree of anthropomorphism of virtual characters, and makes the communication process between virtual characters and people smoother and more intelligent.

The technical solution of the present application and how the technical solution of the present application solves the above technical problems will be described in detail below with specific embodiments. The following specific embodiments can be combined with each other, and the same or similar concepts or processes may not be described again in some embodiments. The embodiments of the present application will be described below with reference to the accompanying drawings.

Figure 2 is a flow chart of a virtual character driving method based on multi-modal data provided by an embodiment of the present application. The virtual character driving method based on multi-modal data provided in this embodiment can be specifically applied to electronic devices that have the function of using virtual characters to interact with humans. The electronic device can be a conversation robot, a terminal or a server, etc. In other embodiments , the electronic device can also be implemented using other devices, and this embodiment is not specifically limited here.

As shown in Figure 2, the specific steps of this method are as follows:

Step S201: Obtain a three-dimensional image rendering model of the virtual character to use the virtual character to provide interactive services to the user.

Among them, the three-dimensional image rendering model of the virtual character includes the rendering data required to realize the rendering of the virtual character. The three-dimensional image rendering model based on the virtual character can render the skeletal data of the virtual character into the three-dimensional image of the virtual character displayed to the user.

The method provided in this embodiment can be applied in scenarios where virtual characters interact with people, using virtual characters with three-dimensional images to realize real-time interaction functions between machines and people, so as to provide intelligent services to people.

Step S202: During a round of dialogue between the virtual character and the user, the voice data input by the user and the user's image data are obtained in real time.

In this embodiment, during a round of dialogue between the virtual character and the user, the input voice stream is obtained in real time and the voice data input by the user is obtained; the video stream from the user can also be monitored in real time, and the video frames are sampled according to the preset frequency. Get the user's image data.

Among them, the user's image data includes image data in the video frame in which the user appears, including the user's face image and images of the arms and part of the body appearing in the video frame.

For example, the input voice stream and video stream can be acquired in real time by the gesture sensing system in the interactive system framework shown in Figure 1 above.

Step S203: When it is detected that the silence duration of the voice data input by the user is greater than or equal to the preset duration, and if it is determined that the voice input has not ended, the voice data input by the user in the previous period is converted into corresponding text information. It is the time since the last silence duration was greater than or equal to the preset duration to the current moment.

Perform real-time voice activity detection on the acquired user's voice data to determine the length of silence of the voice data. when When the silence duration of the voice data input by the user is greater than or equal to the preset duration, if the silence duration of the voice data is less than the silence duration threshold, it means that a long period of silence occurs during the user input process, but this voice input has not yet ended. In this case, a duplex response process is performed based on the voice data and the user's image data in the previous period, so that the virtual character can respond in time to the gestures made by the user in the previous period to guide the subsequent dialogue process. , making the interaction between virtual characters and users smoother and more intelligent.

Among them, the preset duration is a shorter duration that is less than the silence duration threshold. The silence duration threshold is the silence duration used to determine whether the user's current round of input has ended. When the silence duration of the user's voice input reaches the silence duration threshold, the user's current round of voice input is determined. Finish. For example, the silent duration threshold can be 800ms, and the preset duration can be 200ms. The preset duration can be set and adjusted according to the needs of the actual application scenario, and is not specifically limited here.

For example, when it is detected that the silence duration of the voice data input by the user is greater than or equal to the preset duration, and if it is determined that the voice input has not ended, the voice data input by the user in the previous period is input into the ASR module, and the voice data is transferred to the ASR module through the ASR module. The data is converted into corresponding text information.

For example, the perception system divides the speech stream into small speech units one by one according to a silence time (that is, VAD time) of a preset length (such as 200ms), and one speech unit corresponds to the adjacent one. For the voice data between two moments when the silence duration reaches the preset duration, each voice unit is input to the automatic speech recognition ASR module, and the voice unit is converted into text information through the ASR module.

Step S204: Identify the user's gesture information based on the user's image data in the previous period, and determine the gesture intention classification corresponding to the user's gesture information based on the user's gesture information and text information.

In this embodiment, the user's image data in the previous period is acquired, gesture recognition is performed on the user's image data in the previous period, and the user's gesture information is identified.

In practical applications, some gestures can have different meanings in different scenarios, that is, a gesture represents different user intentions in different scenarios. In this step, the user's gesture information in the previous period and the text information of the user's input voice data are combined to perform multi-modal classification to determine the gesture intention classification corresponding to the user's gesture information in the previous period, thereby accurately identifying the user The meaning of the gesture.

For example, in this embodiment, gesture intention categories that require duplex response can be pre-configured, and a response strategy corresponding to each gesture intention category can be configured. Configuring corresponding response strategies according to different categories of gesture intentions enables the virtual character to respond more accurately to the user's correct gesture intention, compared to configuring the response strategy according to gestures, and improves the degree of anthropomorphism of the virtual character.

Step S205: Determine corresponding driving data according to the gesture intention classification corresponding to the user's gesture information and the current dialogue state.

Among them, the current dialogue state includes the following two types: the state in which the user inputs the avatar and receives the state, and the state in which the avatar outputs the avatar and the user receives the state.

In this embodiment, the response strategies may include the following four categories: an interruption strategy in which the avatar actively or passively interrupts the current processing, an acceptance strategy in which the avatar actively accepts the user, starts a new round of dialogue, and no feedback.

Each type of response strategy includes one or more response strategies, and each response strategy includes a corresponding gesture meaning. Graph classification, response time, and response mode. The specific content of each response strategy can be configured according to the needs of actual application scenarios, and is not specifically limited here.

In the dialogue state where the user inputs the avatar to receive, the response strategy adopted can be one of acceptance strategy, starting a new round of dialogue, or no feedback. Considering that in actual application scenarios, there are usually no scene requirements for avatars to interrupt user input, therefore when the user inputs a dialogue state that the avatar receives, the interruption strategy is usually not used to respond.

When the avatar outputs the dialogue state that the user receives, the avatar does not need to take over the user, and the user can interrupt the current output of the avatar, that is, interrupt the current dialogue state, so that the user can get what he needs faster. Information. Therefore, when the virtual character outputs a dialogue state that the user receives, the response strategy adopted can be one of interruption strategy, starting a new round of dialogue, or no feedback, but the acceptance strategy is usually not used.

After the gesture intention classification of the user's gesture information in the previous period is recognized in real time, in this step, the current response strategy is determined based on the gesture intention classification corresponding to the user's gesture information and combined with the current dialogue status, and based on the current The response strategy generates driving data for the avatar. The driving data includes all the driving parameters required to drive the virtual character to execute the response strategy corresponding to the gesture intention classification, thereby realizing the facial driving and action driving of the virtual character.

For example, if the response strategy corresponding to the gesture intention classification includes the virtual character making a prescribed expression, the driving data includes expression driving parameters; if the response strategy corresponding to the gesture intention classification includes the virtual character making a prescribed gesture action, the driving data Including action-driven parameters; if the response strategy corresponding to the gesture intention classification includes a virtual character broadcasting prescribed words, the driving data includes voice-driven parameters; if the response strategy corresponding to the gesture intention classification includes multiple responses in expressions, words, and actions method, the driving data includes corresponding multiple driving parameters, which can drive the virtual character to perform response behaviors corresponding to the response strategy.

Step S206: Drive the virtual character to perform the corresponding response behavior according to the driving data and the three-dimensional image rendering model of the virtual character.

After classifying the gesture intention of the user's gesture information and determining the corresponding driving data, the skeletal model of the virtual character is driven according to the driving data to obtain the skeletal data corresponding to the response behavior, and the skeletal data is rendered according to the three-dimensional detailed rendering model of the virtual character to obtain the response. The virtual character image data corresponding to the behavior. By rendering the virtual character image data into the output video stream, the virtual character in the output video stream makes corresponding response behaviors, thereby realizing the multi-modal duplex interaction function in which the virtual character responds promptly to the user's gestures.

This embodiment obtains the voice data input by the user and the image data of the user in real time during a round of dialogue between the virtual character and the user; when it is detected that the silence duration of the voice data input by the user is greater than or equal to the preset duration, if it is determined If the voice input has not ended, the voice data input by the user in the previous period is converted into corresponding text information, the user's gesture information is identified based on the user's image data in the previous period, and the user is determined based on the user's gesture information and text information. The gesture intention corresponding to the gesture information is classified, so that the gesture intention of the user's gesture can be recognized in real time, and based on the gesture intention of the user's gesture and the current conversation state, the virtual character is driven to perform the corresponding response behavior, so that the virtual character in the output video stream does It can generate corresponding response behaviors, increase the real-time recognition ability of user gestures, and drive virtual characters to respond promptly to the user's gesture intentions, which improves the degree of anthropomorphism of virtual characters and makes the interaction between virtual characters and people smoother and more intelligent.

In an optional embodiment, the above step S204 can use a multi-modal classification model to perform multi-modal alignment and classification processing on the user's gesture information and the text information of the user's input voice, and determine the user's gesture intention classification to accurately Identify the intent of user gestures.

Specifically, the text information and the user's image data in the previous period are input into the trained multi-modal classification model. Through the multi-modal classification model, the user's gesture information is recognized based on the user's image data in the previous period and the text information is extracted. Semantic features, multi-modal classification processing is performed based on the semantic features of the user's gesture information and text information, thereby determining the gesture intention classification corresponding to the user's gesture information.

In practical applications, some gestures can have different meanings in different scenarios, that is, a gesture represents different user intentions in different scenarios. For example, the action of "swipe up" can express the gesture intention of "turning the page up" in one scenario, and can express the gesture intention of "hello" in another scenario.

In this embodiment, the multi-modal classification model accurately identifies the gesture intention corresponding to the user's gesture information by fusing the semantic features of the text information of the user's input speech with the user's gesture information.

Among them, the multimodal classification model can be implemented using any existing multimodal image classification model, or other multimodal alignment models can be used to implement the function of correcting image classification results based on text information.

For example, the user's gesture information can be identified based on the user's image data in the previous period. Specifically, a time-series convolutional neural network can be used to perform feature extraction and gesture classification on the user's image data in multiple video frames to identify it in real time. gestures made by the user. In addition, identifying the user's gesture information based on the user's image data in the previous period can be implemented using any existing gesture recognition algorithm that realizes the function of identifying the user's gesture based on the user's image data, which will not be described again here.

For example, the sensing system can recognize the gestures shown in Table 1 below:

Table 1

In an optional embodiment, the interactive system can provide a front-end configuration page through which response strategies can be configured to flexibly configure one or more response strategies based on the needs of different specific application scenarios.

Specifically, in response to the response policy configuration operation, at least one of the following types of response policies is configured:

Interrupt strategy, take over strategy, start a new round of dialogue, no feedback.

Specifically, the first category is the interruption strategy: a strategy that interrupts the current processing of the avatar when the avatar outputs the dialogue state received by the user, including one or more interruption strategies in which the avatar actively interrupts the current processing, and The avatar passively interrupts one or more strategies currently being processed.

Among them, in each interruption strategy, the virtual character can be configured to perform at least one of the following interruption response behaviors: broadcasting interruption copywriting and making interruption actions. The interrupting action includes at least one of hand action and facial action.

For example, during the virtual character broadcasting process, when the user does not give a voice instruction to interrupt the virtual character broadcasting, but The avatar determines that the user has the intention to interrupt based on the user's gestures. For example, if the user makes unsafe gestures, stop gestures, inputs voice with the intention of stopping, etc., the avatar will immediately interrupt the current conversation and trigger the corresponding interruption strategy. Different gestures The corresponding interruption strategies can be different, and the interruption response behaviors can be different.

Under each interruption strategy, the virtual character will interrupt the current speaking state, wait for the user to speak, and make corresponding response behaviors based on the specific response method of the interruption strategy. If the user inputs voice data with definite semantics, a new round of dialogue will be started; if the user does not input voice data with definite semantics after a period of time, the current dialogue will be continued and the virtual character will continue to broadcast.

The second category is the acceptance strategy: in the dialogue state where the user inputs the avatar to receive, the avatar actively responds to the user's gestures to assist the dialogue, but does not affect the user's input. In the acceptance strategy, the virtual character can be configured to perform at least one acceptance response behavior of making an acceptance action and broadcasting an acceptance copy, where the acceptance action includes at least one of hand movements and facial movements.

Specifically, in the dialogue state where the user inputs the virtual character to receive, when it is judged that it is necessary to take over the user's dialogue or action, the corresponding takeover strategy is triggered. There are at least two methods of acceptance: one is only "action acceptance", that is, making an acceptance action, which means that the avatar does not make a verbal acceptance reply, but only responds to the user with an action action, without affecting other conversations. state. The other is "action + copywriting", that is, the virtual character not only performs the following actions, but also broadcasts the copywriting to respond to the user. In some scenarios, you can also configure some acceptance strategies to only broadcast the acceptance copy. Among them, the undertaking actions include facial expressions, gestures, etc.

In this embodiment, multiple acceptance strategies can be configured, and the acceptance response behaviors of different acceptance strategies can be different.

The third category is the strategy of starting a new round of dialogue, that is, calling the basic dialogue system. When the silent time (VAD time) of the user's voice input reaches the silence duration threshold (that is, the VAD threshold), the user's voice input ends and the basic dialogue is called. The system replies directly to the user. This is the basic function of the interaction system between virtual characters and people, and will not be described in detail here. Among them, the silent duration threshold (that is, the VAD threshold) is usually 800ms, which can be configured and adjusted according to the actual application scenario.

The fourth category is a no-feedback strategy: no feedback and maintain the current state.

Each type of response strategy includes one or more response strategies, and each response strategy includes corresponding gesture intention classification, response time and response mode.

Illustratively, the virtual character actively takes over the user's

In this embodiment, four types of response strategies can be flexibly configured through the front-end page: the interruption strategy of the avatar actively or passively interrupting the current processing, the avatar's acceptance strategy of actively accepting the user, starting a new round of dialogue, and no feedback. Decision-making between these four types of response strategies can realize duplex interaction capabilities of timely acceptance and interruption based on user gestures, making virtual characters have multi-modal (voice and gesture) interaction capabilities and improving the degree of anthropomorphism of virtual characters. , making the communication process between virtual characters and people smoother and more intelligent.

In an optional embodiment, when performing step S205, if the current dialogue state is a state in which the user inputs the virtual character, the first target strategy corresponding to the gesture intention classification is determined according to the gesture intention classification corresponding to the user's gesture information. , the first goal strategy is one of taking over the strategy, starting a new round of dialogue, or no feedback; according to the first goal strategy corresponding to the gesture intention classification, the corresponding driving data is determined, and the driving data is used to drive the virtual character to perform the first goal The response behavior corresponding to the policy.

Among them, the types and specific contents of the acceptance response behaviors included in the acceptance strategies corresponding to different gesture intention categories may be different.

Considering that in actual application scenarios, there are usually no scene requirements for avatars to interrupt user input, in the dialogue state where the user inputs avatars to receive, the interruption strategy is usually not used to respond. In this embodiment, when the user inputs a dialogue state for the avatar to receive, the response strategy adopted can be one of the following strategies, starting a new round of dialogue, or no feedback, instead of using the interruption strategy to respond, which can avoid the impact. The user inputs normally, and can be processed in a timely manner based on the gesture intention of the user's gesture, which improves the degree of anthropomorphism of the virtual character, improves the user's enthusiasm for continued interaction, and improves the smoothness and intelligence of the interaction between the virtual character and the user.

Further, if the first target strategy is an undertaking strategy, the first driving data is determined according to the undertaking strategy, and the first driving data is used to drive the virtual character to perform at least one undertaking response behavior of taking an undertaking action and broadcasting an undertaking copy, Wherein, the taking action includes at least one of hand action and facial action.

In this embodiment, the acceptance strategy may include at least one of the following acceptance response behaviors: making an acceptance action and broadcasting the acceptance copy. When it is determined that the user's dialogue or action needs to be taken over, the corresponding takeover strategy is triggered. There are at least two ways to take over: one is "action takeover only", which means that the virtual character does not make a verbal takeover reply, but only responds to the user by making takeover actions, without affecting other conversation states. The other is "action + copywriting", that is, the virtual character not only performs the following actions, but also broadcasts the copywriting to respond to the user.

Optionally, the execution timing of various types of acceptance response behaviors can also be configured in the acceptance strategy. For example, the execution timing of any response response behavior may include the following: execution immediately, execution after a specified period of time, or execution after user input is completed. In addition, different takeover response behaviors in the same takeover strategy can be configured with different execution timings.

For example, during the user input process, if the gesture intention of the user is "like", the corresponding acceptance strategy can be: immediately make a "smile" expression and a gesture indicating happiness, and broadcast it after the user input is completed. "I'm very happy to receive your compliment" copywriting.

For example, during the user input process, if the gesture intention of the user is "like", the corresponding acceptance strategy can be: immediately make a "smile" expression and a gesture indicating happiness, and immediately broadcast the message "thank you" Undertake copywriting.

It should be noted that the acceptance copy to be broadcast immediately is usually set to short content, such as "Uh-huh", "Yes", "Yes", "Hmm", "Oh", etc. The acceptance copy will not affect the user's normal speech enter.

In this embodiment, through the gesture intention of the user's gesture, the acceptance response processing such as making the acceptance action and broadcasting the acceptance copy is performed in a timely manner, which improves the degree of anthropomorphism of the virtual character, can increase the user's enthusiasm for continued interaction, and improves the relationship between the virtual character and the user. Smoothness and intelligence of interaction.

In an optional embodiment, when performing step S205, if the current dialogue state is a state in which the avatar outputs the user's acceptance, the second target strategy corresponding to the gesture intention classification is determined according to the gesture intention classification corresponding to the user's gesture information. , the second target strategy is one of interruption strategy, starting a new round of dialogue, or no feedback; according to the second target strategy corresponding to the gesture intention classification, the second driving data is determined, and the second driving data is used to drive the execution of the virtual character Second item response behavior corresponding to the targeting strategy.

In practical applications, during the interaction between the avatar and the user, when the avatar outputs a conversation that the user receives, the avatar does not need to accept the user, and the user can interrupt the avatar's current output, that is, interrupt the current conversation. status so that users can get the information they need faster. In this embodiment, when the avatar outputs the dialogue state received by the user, the response strategy adopted may be one of interruption strategy, starting a new round of dialogue, or no feedback, but usually the acceptance strategy is not adopted.

Among them, when the interruption strategy is executed, it will interrupt the current processing of the virtual character and drive the virtual character to execute the interruption response behavior corresponding to the interruption strategy. The interrupting response behavior of the interrupting strategy may include at least one interrupting response behavior of making an interrupting action and broadcasting an interrupting copy, wherein the undertaking action includes at least one of hand movements and facial movements.

Optionally, the execution timing of various interrupt response behaviors can also be configured in the interrupt policy. For example, the execution timing of any interrupt response behavior may include the following: execution immediately, execution after a specified period of time, or execution after user input is completed. In addition, different interrupt response behaviors in the same interrupt policy can be configured with different execution timings.

For example, during the avatar's broadcasting process, if the gesture intention of the user's gesture is "stop", the corresponding interruption strategy can be: the avatar immediately interrupts the current broadcast, immediately makes expressions and gestures indicating confusion, and immediately Interruption copy that announces “Do you have any questions?”

In this embodiment, under the interrupt policy, during the avatar broadcasting process, when it is determined that the user has the intention to interrupt, such as the user making unsafe gestures, stop gestures, inputting voice with the intention of stopping, etc., the avatar will interrupt. Interrupt the current speaking state, wait for the user to speak, or actively ask the other party for the reason for the interruption, make some actions, etc., which can achieve duplex capabilities. The virtual character has the ability to actively or passively interrupt his current broadcast, and respond to user gestures Make response behaviors after interruption to guide the subsequent conversation process, making the interaction between virtual characters and users smoother and more intelligent.

Optionally, after interrupting the current processing of the avatar, if the user's voice input is received within the first period of time and the semantic information of the user's voice input is recognized, the next round of dialogue will be started, and the dialogue will be started based on the user's voice input. Semantic information for conversational processing.

Among them, the first duration is generally set to a short duration so that the user will not feel a long pause. The first duration can be set and adjusted according to the needs of the actual application scenario, such as a few hundred milliseconds, 1 second, or even a few seconds. Seconds, etc., there is no specific limit here.

After interrupting the current processing of the avatar, if the user's voice input is not received within the first period of time, or the semantic information of the user's voice input cannot be recognized, the current output of the interrupted avatar will be continued.

Optionally, if the user's voice input is not received within the first period of time, or the semantic information of the user's voice input cannot be recognized, the current output of the interrupted virtual character can be continued after a pause of the second period of time. Give the user enough time for voice input.

Among them, the second duration can be hundreds of milliseconds, 1 second, or even several seconds, etc., and can be set and adjusted according to the needs of actual application scenarios, and is not specifically limited here.

In this embodiment, after driving the virtual character to perform the interrupt response behavior, if the user's voice input with semantic information is received within the first period of time, a new round of dialogue is started. If no voice input with semantic information by the user is received, For voice input, you can pause for a certain period of time and then continue the avatar's previous broadcast, so as to avoid interrupting the response behavior and affecting the normal interaction between the avatar and the user, and improve the smoothness and intelligence of the interaction between the avatar and the user.

FIG. 3 is a flow chart of a method for driving a virtual character to accept users according to an embodiment of the present application. In this embodiment, the user's gestures can be recognized in real time, and the virtual character is driven to perform response processing based on the user's gestures. As shown in Figure 3, the specific steps of this method are as follows:

Step S301: Obtain a three-dimensional image rendering model of the virtual character to provide interactive services to users using the virtual character.

Step S302: In a round of dialogue between the virtual character and the user, during the user's voice input process, the voice data input by the user and the user's image data are obtained in real time.

Step S303: When it is detected that the silence duration of the voice input is greater than or equal to the preset duration, and if it is determined that the voice input has not ended, the user's gesture information is identified based on the user's image data in the previous period. The previous period is the period since the last silence. The duration is greater than or equal to the preset duration to the current moment.

Perform real-time voice activity detection on the acquired user's voice data to determine the length of silence of the voice data. When the silence duration of the voice data input by the user is greater than or equal to the preset duration, if the silence duration of the voice data is less than the silence duration threshold, it means that a long period of silence occurs during the user input process, but this voice input has not yet ended. In this case, a duplex response process is performed based on the user's image data in the previous period, so that the virtual character can respond to the user's gestures in a timely manner to guide the subsequent conversation process, making the interaction between the virtual character and the user smoother and more efficient. intelligent.

In this step, when it is detected that the silence duration of the voice data input by the user is greater than or equal to the preset duration, and if it is determined that the voice input has not ended, the user's gesture information is recognized based on the user's image data in the previous period, thereby identifying the user's gesture information in real time. Show the current gesture made by the user.

For example, the user's gesture information is identified based on the user's image data in the previous period. Specifically, the user's gesture information can be identified through time series A convolutional neural network is used to extract features and classify gestures from user image data in multiple video frames to identify gestures made by users in real time.

In addition, identifying the user's gesture information based on the user's image data in the previous period can be implemented using any existing gesture recognition algorithm that realizes the function of identifying the user's gesture based on the user's image data, which will not be described again here.

For example, the sensing system may recognize the gestures shown in Table 1 above.

Step S304: According to the user's gesture information, if it is determined that the user has made a gesture that needs to be accepted, the driving data of the virtual character is determined.

After identifying the user's gesture information, it is determined whether the user's current gesture belongs to a gesture that needs to be accepted. If the user's current gesture belongs to a gesture that needs to be accepted, corresponding driving data is generated according to the acceptance response strategy corresponding to the user's current gesture. The driving data includes all the driving parameters required to drive the virtual character to execute the response strategy corresponding to the user's current gesture, so as to realize the facial driving and action driving of the virtual character.

For example, if the acceptance response strategy corresponding to the user's current gesture includes the virtual character making a prescribed expression, then the driving data includes expression driving parameters; if the acceptance response strategy corresponding to the user's current gesture includes the virtual character making a prescribed action, then The driving data includes action driving parameters; if the response strategy corresponding to the user's current gesture includes a virtual character broadcasting prescribed words, the driving data includes voice driving parameters; if the response strategy corresponding to the user's current gesture includes expressions, words, etc. and multiple response modes in the action, the driving data includes corresponding multiple driving parameters, which can drive the virtual character to perform response behaviors corresponding to the response strategy.

In this embodiment, gestures that require an acceptance response (that is, gestures that need to be accepted) and corresponding acceptance response strategies can be pre-configured.

When it is determined that the user's dialogue or action needs to be taken over, the corresponding takeover strategy is triggered. In the acceptance strategy, the virtual character can be configured to perform at least one acceptance response behavior of making an acceptance action and broadcasting an acceptance copy, where the acceptance action includes at least one of hand movements and facial movements. There are at least two methods of acceptance: one is only "action acceptance", that is, making an acceptance action, which means that the avatar does not make a verbal acceptance reply, but only responds to the user with an action action, without affecting other conversations. state. The other is "action + copywriting", that is, the virtual character not only performs the following actions, but also broadcasts the copywriting to respond to the user. In some scenarios, you can also configure some acceptance strategies to only broadcast the acceptance copy. Among them, the undertaking actions include facial expressions, gestures, etc.

Optionally, the execution timing of various types of acceptance response behaviors can also be configured in the acceptance policy. For example, the execution timing of any response response behavior may include the following: execution immediately, execution after a specified period of time, or execution after user input is completed. In addition, different takeover response behaviors in the same takeover strategy can be configured with different execution timings.

For example, during the user input process, if the gesture intention of the user is "like", the corresponding acceptance strategy can be: immediately make a "smile" expression and a gesture indicating happiness, and immediately broadcast the message "thank you" Acceptance document case.

Step S305: According to the driving data and the three-dimensional image rendering model of the virtual character, drive the virtual character to perform an acceptance response behavior corresponding to the user's gesture information.

After determining the corresponding driving data according to the response strategy corresponding to the user's current gesture, the skeletal model of the virtual character is driven according to the driving data to obtain the skeletal data corresponding to the response behavior, and the skeletal data is rendered according to the three-dimensional detailed rendering model of the virtual character. Obtain the virtual character image data corresponding to the response behavior. By rendering the virtual character image data into the output video stream, the virtual character in the output video stream performs corresponding response behaviors, thereby realizing the multi-modal duplex interaction function in which the virtual character responds to the user's gestures in a timely manner.

In this embodiment, during a round of dialogue between the virtual character and the user, the voice data input by the user and the user's image data are obtained in real time during the user's voice input; when it is detected that the silence duration of the voice input is greater than or equal to the preset duration, if it is determined that the voice input has not ended, the user's gesture information will be recognized in real time based on the user's image data in the previous period. Based on the user's gesture information, if it is determined that the user has made a gesture that needs to be accepted, the virtual character will be driven to target The user's current gesture is processed in response, so that the virtual character in the output video stream performs the corresponding response behavior, increasing the real-time recognition ability of the user's gesture, and driving the virtual character to respond to the user's gesture in a timely manner, improving The degree of anthropomorphism of virtual characters makes the interaction between virtual characters and people smoother and more intelligent.

In an optional embodiment, the above step S304 can be implemented by using the following steps:

Step S3041: Convert the voice data input by the user in the previous period into corresponding text information.

In this step, the voice data input by the user in the previous period is input into the ASR module, and the voice data is converted into corresponding text information through the ASR module.

Step S3042: Determine the gesture intention classification corresponding to the user's gesture information based on the user's gesture information and text information.

In practical applications, a gesture can have different intentions in different scenarios. In this step, the user's gesture information in the previous period and the text information of the user's input voice data are combined to perform multi-modal classification and determine the previous period. The gesture intention classification corresponding to the user's gesture information.

For example, in this embodiment, gesture intention categories that require duplex response can be pre-configured, and a response strategy corresponding to each gesture intention category can be configured.

Specifically, the text information and the user's image data in the previous period are input into the trained multi-modal classification model. Through the multi-modal classification model, the user's gesture information is recognized based on the user's image data in the previous period and the text information is extracted. According to the semantic features of the user's gesture information and text information, multi-modal classification processing is performed to determine the gesture intention classification corresponding to the user's gesture information.

Step S3043: If the gesture intention classification corresponding to the user's gesture information belongs to the gesture intention classification that needs to be accepted, it is determined that the user has made the gesture that needs to be accepted, and the driving data of the virtual character is determined according to the gesture intention classification corresponding to the user's gesture information.

Among them, the driving data is used to drive the virtual character to perform the acceptance response behavior corresponding to the gesture intention classification corresponding to the user's gesture information.

The specific implementation of this step is consistent with the specific implementation of the above-mentioned step S205, in which the driving data of the virtual character is determined according to the gesture intention classification corresponding to the user's gesture information in the dialogue state where the user inputs the virtual character. This embodiment will not be described again here.

In this embodiment, according to the user's gesture information, if it is determined that the user has made a gesture that needs to be accepted, then when determining the driving data of the virtual character, the voice data input by the user in the previous period is converted into corresponding text information, Integrate the text information of the user's input voice, accurately identify the gesture intention corresponding to the user's gesture information, accurately and in real time identify the user's gesture intention, and drive the virtual character to perform the corresponding response behavior based on the user's gesture intention to guide The subsequent dialogue process makes the interaction between the virtual character and the user smoother and more intelligent.

Figure 4 is a schematic structural diagram of a virtual character driving system based on multi-modal data provided by an exemplary embodiment of the present application. The multi-modal data-based virtual character driving system provided by the embodiments of the present application can execute the processing flow provided by the multi-modal data-based virtual character driving method embodiment. As shown in FIG. 4 , the virtual character driving system 40 based on multi-modal data includes: a multi-modal input module 41 , a voice processing module 42 , an image processing module 43 and a driving control module 44 .

Among them, the drive control module 44 is used to obtain a three-dimensional image rendering model of the virtual character so as to use the virtual character to provide interactive services to the user.

The multi-modal input module 41 is used to obtain the voice data input by the user and the user's image data in real time during a round of dialogue between the virtual character and the user.

The voice processing module 42 is configured to, when it is detected that the silence duration of the voice data input by the user is greater than or equal to the preset duration, and if it is determined that the voice input has not ended, convert the voice data input by the user in the previous period into corresponding text information, The previous period is from the time when the last silence duration was greater than or equal to the preset duration to the current moment;

The image processing module 43 identifies the user's gesture information based on the user's image data in the previous period, and The gesture information and text information are used to determine the gesture intention classification corresponding to the user's gesture information.

The drive control module 44 is also used to determine the corresponding drive data according to the gesture intention classification corresponding to the user's gesture information and the current dialogue state; and drive the virtual character to perform the corresponding response behavior according to the drive data and the three-dimensional image rendering model of the virtual character. .

The system provided by the embodiment of the present application can be specifically used to execute the solution provided by the method embodiment corresponding to Figure 2 above. The specific functions and the technical effects that can be achieved will not be described again here.

In an optional embodiment, when identifying the user's gesture information based on the user's image data in the previous period, and determining the gesture intention classification corresponding to the user's gesture information based on the user's gesture information and text information, the image processing module 43 is also used for:

Input the text information and the user's image data in the previous period into the trained multi-modal classification model. Through the multi-modal classification model, the user's gesture information is recognized based on the user's image data in the previous period and the semantic features of the text information are extracted. , based on the semantic features of the user's gesture information and text information, perform multi-modal classification processing to determine the gesture intention classification corresponding to the user's gesture information.

In an optional embodiment, as shown in FIG. 5 , the avatar driving system 40 based on multi-modal data also includes: a strategy configuration module 45 .

The policy configuration module 45 is configured to configure at least one of the following types of response strategies in response to the response strategy configuration operation: an interruption strategy in which the virtual character actively or passively interrupts the current processing, an acceptance strategy in which the virtual character actively accepts the user, and starts a new round of dialogue, No feedback.

In an optional embodiment, when determining the corresponding driving data according to the gesture intention classification corresponding to the user's gesture information and the current dialogue state, as shown in FIG. 5 , the driving control module 44 includes: a response decision unit 441 and drive control unit 442. Wherein, the response decision unit 441 is used to determine the first target strategy corresponding to the gesture intention classification according to the gesture intention classification corresponding to the user's gesture information if the current dialogue state is a state in which the user inputs the avatar. The first target strategy is Either take over the strategy, start a new conversation, or have no feedback.

The drive control unit 442 is used to determine the corresponding drive data according to the first target strategy corresponding to the gesture intention classification, and the drive data is used to drive the virtual character to perform the response behavior corresponding to the first target strategy.

In an optional embodiment, when classifying the corresponding first target strategy according to the gesture intention and determining the corresponding drive data, the drive control unit 442 is also configured to: if the first target strategy is a takeover strategy, then according to the takeover strategy, The first driving data is determined, and the first driving data is used to drive the virtual character to perform at least one acceptance response behavior of making an acceptance action and broadcasting an acceptance copy, wherein the acceptance action includes at least one of hand movements and facial movements.

In an optional embodiment, the drive control unit is also configured to determine the second target corresponding to the gesture intention classification according to the gesture intention classification corresponding to the user's gesture information if the current dialogue state is a state in which the virtual character outputs the user's reception. strategy, the second target strategy is one of interruption strategy, starting a new round of dialogue, or no feedback; according to the second target strategy corresponding to the gesture intention classification, the second driving data is determined, and the second driving data is used to drive the virtual character Execute second The response behavior corresponding to the target strategy.

The system provided by the embodiment of the present application can be specifically used to execute the solution provided by any optional method embodiment based on the method embodiment corresponding to Figure 2 above. The specific functions and the technical effects that can be achieved will not be described again here.

Figure 6 is a schematic architectural diagram of a virtual character driving system based on multi-modal data provided by another exemplary embodiment of the present application. The multi-modal data-based virtual character driving system provided by the embodiments of the present application can execute the processing flow provided by the multi-modal data-based virtual character driving method embodiment. As shown in FIG. 6 , the virtual character driving system 60 based on multimodal data includes: a multimodal input module 61 , a perception module 62 and a decision driving module 63 .

Among them, the decision-making driving module 63 is used to obtain a three-dimensional image rendering model of the virtual character, so as to use the virtual character to provide interactive services to the user.

The multi-modal input module 61 is used to obtain the voice data input by the user and the image data of the user in real time during a conversation between the virtual character and the user.

The sensing module 62 is configured to detect the user's gesture information based on the user's image data in the previous period when it is detected that the silence duration of the voice input is greater than or equal to the preset duration, and if it is determined that the voice input has not ended. A moment of silence duration greater than or equal to the preset duration to the current moment.

The decision-making driving module 63 is used to determine the driving data of the virtual character according to the user's gesture information, and if it is determined that the user has made a gesture that needs to be accepted; and based on the driving data and the three-dimensional image rendering model of the virtual character, drive the virtual character to perform the user's gesture. The corresponding acceptance response behavior of the information.

The system provided by the embodiment of the present application can be specifically used to execute the solution provided by the method embodiment corresponding to Figure 3 above. The specific functions and the technical effects that can be achieved will not be described again here.

In an optional embodiment, when determining the driving data of the virtual character based on the user's gesture information, if it is determined that the user has made a gesture that needs to be accepted, the sensing module 62 is also used to: The voice data is converted into corresponding text information; based on the user's gesture information and text information, the gesture intention classification corresponding to the user's gesture information is determined.

The decision-making driving module 63 is also used to determine that the user has made a gesture that needs to be accepted if the gesture intention classification corresponding to the user's gesture information belongs to the gesture intention classification that needs to be accepted, and determine the virtual character according to the gesture intention classification corresponding to the user's gesture information. The driving data is used to drive the virtual character to perform the corresponding response behavior corresponding to the gesture intention classification corresponding to the user's gesture information.

The system provided by the embodiment of the present application can be specifically used to execute the solution provided by any optional method embodiment based on the method embodiment corresponding to Figure 3 above. The specific functions and achievable technical effects will not be described again here.

FIG. 7 is a schematic structural diagram of an electronic device according to an exemplary embodiment of the present application. As shown in Figure 7, the electronic device 70 includes: a processor 701, and a memory 702 communicatively connected to the processor 701. The memory 702 stores computer execution instructions.

The processor executes the computer execution instructions stored in the memory to implement the solution provided by any of the above method embodiments. The specific functions and the technical effects that can be achieved will not be described again here.

Embodiments of the present application also provide a computer-readable storage medium. The computer-readable storage medium stores computer Execution instructions. When the computer execution instructions are executed by the processor, they are used to implement the solution provided by any of the above method embodiments. The specific functions and the technical effects that can be achieved will not be described again here.

Embodiments of the present application also provide a computer program product. The computer program product includes: a computer program. The computer program is stored in a readable storage medium. At least one processor of the electronic device can read the computer program from the readable storage medium. At least A processor executes a computer program so that the electronic device executes the solution provided by any of the above method embodiments. The specific functions and technical effects that can be achieved will not be described again here.

In addition, some of the processes described in the above embodiments and drawings include multiple operations that appear in a specific order, but it should be clearly understood that these operations may not be performed in the order in which they appear in this document or may be performed in parallel. , is only used to distinguish different operations, and the sequence number itself does not represent any execution order. Additionally, these processes may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that the descriptions such as "first" and "second" in this article are used to distinguish different messages, devices, modules, etc., and do not represent the order, nor do they limit "first" and "second" are different types. "Plural" means more than two, unless otherwise clearly and specifically limited.

Other embodiments of the present application will be readily apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of this application that follow the general principles of this application and include common knowledge or customary technical means in the technical field that are not disclosed in this application. . It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It is to be understood that the present application is not limited to the precise structures described above and illustrated in the accompanying drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

A virtual character driving method based on multi-modal data, which is characterized by including:

Obtain the three-dimensional image rendering model of the virtual character to use the virtual character to provide interactive services to users;

During a round of dialogue between the virtual character and the user, the voice data input by the user and the user's image data are obtained in real time;

When it is detected that the silence duration of the voice data input by the user is greater than or equal to the preset duration, if it is determined that the voice input has not ended, the voice data input by the user in the previous period is converted into corresponding text information, The previous period is from the time when the last silence duration was greater than or equal to the preset duration to the current time;

Identify the user's gesture information according to the user's image data in the previous period, and determine the gesture intention classification corresponding to the user's gesture information based on the user's gesture information and the text information;

Determine the corresponding driving data according to the gesture intention classification corresponding to the user's gesture information and the current conversation state;

According to the driving data and the three-dimensional image rendering model of the virtual character, the virtual character is driven to perform the corresponding response behavior.
The method of claim 1, wherein the user's gesture information is identified based on the user's image data in the previous period, and based on the user's gesture information and the text information, Determining the gesture intention classification corresponding to the user's gesture information includes:

The text information and the user's image data in the previous period are input into the trained multi-modal classification model. Through the multi-modal classification model, according to the user's image data in the previous period Identify the user's gesture information, extract the semantic features of the text information, perform multi-modal classification processing based on the user's gesture information and the semantic features of the text information, and determine the gesture corresponding to the user's gesture information Intent classification.
The method according to claim 1 or 2, further comprising:

In response to the response policy configuration operation, configure at least one of the following types of response policies:

Interrupt strategy, take over strategy, start a new round of dialogue, no feedback;

Each type of response strategy includes one or more response strategies, and each response strategy includes a corresponding gesture intention classification, response time and response mode.
The method according to claim 3, characterized in that determining the corresponding driving data according to the gesture intention classification corresponding to the user's gesture information and the current dialogue state includes:

If the current dialogue state is a state in which the user inputs the virtual character to receive, then according to the gesture intention classification corresponding to the user's gesture information, the first target strategy corresponding to the gesture intention classification is determined, and the first target strategy is the acceptance strategy. , start a new round of dialogue or no feedback;

According to the first target strategy corresponding to the gesture intention classification, corresponding driving data is determined, and the driving data is used to drive the virtual character to perform the response behavior corresponding to the first target strategy.
The method of claim 4, wherein determining the corresponding driving data according to the first target strategy corresponding to the gesture intention classification includes:

If the first target strategy is an undertaking strategy, then the first driving data is determined according to the undertaking strategy, and the first driving data is used to drive the virtual character to perform at least one of undertaking actions and broadcasting undertaking copywriting. Response behavior, wherein the undertaking action includes at least one of hand action and facial action.
The method according to claim 3, characterized in that determining the corresponding driving data according to the gesture intention classification corresponding to the user's gesture information and the current dialogue state includes:

If the current dialogue state is a state that the avatar outputs and the user receives, then according to the gesture intention classification corresponding to the user's gesture information, a second target strategy corresponding to the gesture intention classification is determined, and the second target strategy is interruption. One of strategy, starting a new round of dialogue, or no feedback;

Second driving data is determined according to the second target strategy corresponding to the gesture intention classification, and the second driving data is used to drive the virtual character to perform the response behavior corresponding to the second target strategy.
A virtual character driving method based on multi-modal data, which is characterized by including:

Obtain the three-dimensional image rendering model of the virtual character to use the virtual character to provide interactive services to users;

In a round of dialogue between the virtual character and the user, during the user's voice input process, the voice data input by the user and the image data of the user are obtained in real time;

When it is detected that the silence duration of the voice input is greater than or equal to the preset duration, if it is determined that the voice input has not ended, the user's gesture information is identified based on the user's image data in the previous period. A period of time is from the time when the last silence duration was greater than or equal to the preset duration to the current moment;

According to the user's gesture information, if it is determined that the user has made a gesture that needs to be accepted, then determine the driving data of the virtual character;

According to the driving data and the three-dimensional image rendering model of the virtual character, the virtual character is driven to perform the acceptance response behavior corresponding to the user's gesture information.
The method according to claim 7, characterized in that, if it is determined that the user has made a gesture that needs to be accepted according to the user's gesture information, then determining the driving data of the virtual character includes:

Convert the voice data input by the user in the previous period into corresponding text information;

Determine the gesture intention classification corresponding to the user's gesture information according to the user's gesture information and the text information;

If the gesture intention classification corresponding to the user's gesture information belongs to the gesture intention classification that needs to be accepted, it is determined that the user has made the gesture that needs to be accepted, and the virtual character is determined according to the gesture intention classification corresponding to the user's gesture information. Driving data, the driving data is used to drive the virtual character to perform the acceptance response behavior corresponding to the gesture intention classification corresponding to the user's gesture information.
A virtual character driving system based on multi-modal data, which is characterized by including:

The driver control module is used to obtain the three-dimensional image rendering model of the virtual character to provide interactive services to the user using the virtual character;

The multi-modal input module is used to obtain the voice data input by the user and the image data of the user in real time during a round of dialogue between the virtual character and the user;

A voice processing module configured to convert the voice data input by the user in the previous period when it is detected that the silence duration of the voice data input by the user is greater than or equal to the preset duration and if it is determined that the voice input has not ended. is the corresponding text information, and the previous period is from the time when the last silence duration was greater than or equal to the preset duration to the current time;

An image processing module, configured to identify the user's gesture information based on the user's image data in the previous period, and determine the user's gesture information corresponding to the user's gesture information based on the user's gesture information and the text information. Gesture intent classification;

The drive control module is also used to determine the corresponding drive data according to the gesture intention classification corresponding to the user's gesture information and the current conversation state; and drive the virtual character according to the drive data and the three-dimensional image rendering model of the virtual character. Execute the corresponding response behavior.
A virtual character driving system based on multi-modal data, which is characterized by including:

The decision-driven module is used to obtain the three-dimensional image rendering model of the virtual character, so as to use the virtual character to provide interactive services to users;

The multi-modal input module is used to obtain the voice data input by the user and the image data of the user in real time during a conversation between the virtual character and the user;

A sensing module configured to, when it is detected that the silence duration of the voice input is greater than or equal to the preset duration, and if it is determined that the voice input has not ended, identify the user's gesture according to the user's image data in the previous period. Information, the previous period is from the time when the last silence duration was greater than or equal to the preset duration to the current moment;

The decision-making driving module is also used to determine the driving data of the virtual character according to the user's gesture information, if it is determined that the user has made a gesture that needs to be accepted; according to the driving data and the three-dimensional image rendering model of the virtual character , driving the virtual character to perform the acceptance response behavior corresponding to the user's gesture information.
An electronic device, characterized by comprising: a processor, and a memory communicatively connected to the processor;

The memory stores computer execution instructions;

The processor executes computer-executable instructions stored in the memory to implement the method according to any one of claims 1-8.
A computer-readable storage medium, characterized in that computer-executable instructions are stored in the computer-readable storage medium, and when executed by a processor, the computer-executable instructions are used to implement any one of claims 1-8. method described.