WO2023226914A1 - Virtual character driving method and system based on multimodal data, and device - Google Patents

Virtual character driving method and system based on multimodal data, and device Download PDF

Info

Publication number
WO2023226914A1
WO2023226914A1 PCT/CN2023/095449 CN2023095449W WO2023226914A1 WO 2023226914 A1 WO2023226914 A1 WO 2023226914A1 CN 2023095449 W CN2023095449 W CN 2023095449W WO 2023226914 A1 WO2023226914 A1 WO 2023226914A1
Authority
WO
WIPO (PCT)
Prior art keywords
user
gesture
virtual character
data
information
Prior art date
Application number
PCT/CN2023/095449
Other languages
French (fr)
Chinese (zh)
Inventor
朱鹏程
马远凯
冷海涛
张昆才
罗智凌
周伟
李禹�
钱景
Original Assignee
阿里巴巴(中国)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴(中国)有限公司 filed Critical 阿里巴巴(中国)有限公司
Publication of WO2023226914A1 publication Critical patent/WO2023226914A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/017Gesture based interaction, e.g. based on a set of recognized hand gestures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/403D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings

Definitions

  • This application relates to artificial intelligence, deep learning, machine learning, virtual reality and other fields in computer technology, and in particular to a virtual character driving method, system and device based on multi-modal data.
  • This application provides a virtual character driving method, system and device based on multi-modal data to solve the problem of low anthropomorphism of virtual characters and unsmooth and unintelligent interaction between virtual characters and people.
  • this application provides a virtual character driving method based on multi-modal data, including:
  • the voice data input by the user and the user's image data are obtained in real time;
  • the voice data input by the user in the previous period is converted into corresponding text information,
  • the previous period is from the time when the last silence duration was greater than or equal to the preset duration to the current time;
  • the corresponding driver is determined. moving data
  • the virtual character is driven to perform the corresponding response behavior.
  • this application provides a virtual character driving method based on multi-modal data, including:
  • the voice data input by the user and the image data of the user are obtained in real time;
  • the user's gesture information is identified based on the user's image data in the previous period.
  • a period of time is from the time when the last silence duration was greater than or equal to the preset duration to the current moment;
  • the user's gesture information if it is determined that the user has made a gesture that needs to be accepted, then determine the driving data of the virtual character;
  • the virtual character is driven to perform the acceptance response behavior corresponding to the user's gesture information.
  • this application provides a virtual character driving system based on multi-modal data, including:
  • the driver control module is used to obtain the three-dimensional image rendering model of the virtual character to provide interactive services to the user using the virtual character;
  • the multi-modal input module is used to obtain the voice data input by the user and the image data of the user in real time during a round of dialogue between the virtual character and the user;
  • a voice processing module configured to convert the voice data input by the user in the previous period when it is detected that the silence duration of the voice data input by the user is greater than or equal to the preset duration and if it is determined that the voice input has not ended. is the corresponding text information, and the previous period is from the time when the last silence duration was greater than or equal to the preset duration to the current time;
  • An image processing module configured to identify the user's gesture information based on the user's image data in the previous period, and determine the user's gesture information corresponding to the user's gesture information based on the user's gesture information and the text information.
  • Gesture intent classification configured to identify the user's gesture information based on the user's image data in the previous period, and determine the user's gesture information corresponding to the user's gesture information based on the user's gesture information and the text information.
  • the drive control module is also used to determine the corresponding drive data according to the gesture intention classification corresponding to the user's gesture information and the current conversation state; and drive the virtual character according to the drive data and the three-dimensional image rendering model of the virtual character. Execute the corresponding response behavior.
  • this application provides a virtual character driving system based on multi-modal data, including:
  • the decision-driven module is used to obtain the three-dimensional image rendering model of the virtual character, so as to use the virtual character to provide interactive services to users;
  • the multi-modal input module is used to obtain the voice data input by the user and the image data of the user in real time during a conversation between the virtual character and the user;
  • a sensing module configured to, when it is detected that the silence duration of the voice input is greater than or equal to the preset duration, and if it is determined that the voice input has not ended, identify the user's gesture according to the user's image data in the previous period. information, all The above-mentioned previous period is from the time when the last silence duration was greater than or equal to the preset duration to the current moment;
  • the decision-making driving module is also used to determine the driving data of the virtual character according to the user's gesture information, if it is determined that the user has made a gesture that needs to be accepted; according to the driving data and the three-dimensional image rendering model of the virtual character , driving the virtual character to perform the acceptance response behavior corresponding to the user's gesture information.
  • this application provides an electronic device, including: a processor, and a memory communicatively connected to the processor;
  • the memory stores computer execution instructions
  • the processor executes computer execution instructions stored in the memory to implement the method described in the first aspect or the second aspect.
  • the present application provides a computer-readable storage medium that stores computer-executable instructions, which when executed by a processor are used to implement the above-mentioned first or second aspect. the method described.
  • the virtual character driving method, system and device based on multi-modal data obtain the voice data input by the user and the image data of the user in real time during a round of dialogue between the virtual character and the user; when user input is detected When the silence duration of the voice data is greater than or equal to the preset duration, if it is determined that the voice input has not ended, the voice data input by the user in the previous period will be converted into corresponding text information, and the user will be identified based on the user's image data in the previous period
  • the gesture information of the user is determined, and the gesture intention classification corresponding to the user's gesture information is determined based on the user's gesture information and text information, so that the gesture intention of the user's gesture can be recognized in real time, and based on the gesture intention of the user's gesture and the current conversation state, the driver
  • the virtual character performs the corresponding response behavior, causing the virtual character in the output video stream to perform the corresponding response behavior, increasing the real-time recognition ability of the user's gestures, and driving the virtual character
  • Figure 1 is a framework diagram of an exemplary virtual character-human interaction system provided by this application.
  • Figure 2 is a flow chart of a virtual character driving method based on multi-modal data provided by an embodiment of the present application
  • Figure 3 is a flow chart of a method for driving a virtual character to accept users provided by an embodiment of the present application
  • Figure 4 is a schematic structural diagram of a virtual character driving system based on multi-modal data provided by an exemplary embodiment of the present application
  • Figure 5 is a schematic structural diagram of a virtual character driving system based on multi-modal data provided by another exemplary embodiment of the present application.
  • Figure 6 is a schematic structural diagram of a virtual character driving system based on multi-modal data provided by another exemplary embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of an electronic device according to an exemplary embodiment of the present application.
  • Multi-modal interaction Users can communicate with virtual characters through text, voice, expressions, etc.
  • the virtual characters can understand user text, voice, expressions and other information, and can in turn communicate with users through text, voice, expressions, etc.
  • Gesture interaction Users can communicate with virtual characters through gestures, and virtual characters can also reply to users through gestures and other methods.
  • Duplex interaction A real-time, two-way interaction method.
  • the user can interrupt the virtual character at any time, and the virtual character can also interrupt himself who is speaking when necessary.
  • the avatar can provide instant feedback to the user, such as nodding, smiling, and softly responding, without interrupting the user's input. , to guide the subsequent conversation process.
  • VAD Voice Activity Detection
  • Text To Speech It is a technology that converts text into speech.
  • the virtual character driving method based on multi-modal data involves artificial intelligence, deep learning, machine learning, virtual reality and other fields in computer technology, and can be specifically applied to scenarios in which virtual characters interact with people.
  • this application provides a virtual character driving method based on multi-modal data.
  • a round of dialogue between the virtual character and the user Acquire the voice data input by the user and the user's image data in real time.
  • the voice data input by the user in this round of dialogue is converted into corresponding text information.
  • the gesture intention classification corresponding to the user's gesture information is determined, thereby accurately and real-time identifying the user's gesture intention.
  • the corresponding driving data is determined based on the gesture intention classification corresponding to the user's gesture information and the current conversation state, and the virtual character is driven to execute based on the driving data and the three-dimensional image rendering model of the virtual character. Corresponding response behavior. This enables the virtual character to respond promptly to the user's gestures and voice input, giving it multi-modal interaction capabilities, improving the degree of anthropomorphism of the virtual character, and making the communication process between the virtual character and people smoother and more intelligent. .
  • FIG 1 is a framework diagram of an exemplary interactive system between virtual characters and people.
  • the interactive system between virtual characters and people includes the following subsystems: perception system, multi-modal duplex state management system, drive control system and basic dialogue system.
  • the perception system is responsible for receiving the input of multi-modal information such as voice and images, and processing the data such as segmentation and recognition of the input voice and image data, obtaining the recognition results, and providing the recognition results to the multi-modal duplex state.
  • management system The multi-modal duplex state management system is responsible for managing the state of the current conversation, performing decision-making processing of the duplex response state based on the recognition result and the state of the current conversation, and obtaining a decision result including a response strategy.
  • the drive control system is responsible for performing virtual character driving, rendering and other processing based on the decision-making results of the multi-modal duplex state management system, generating the virtual character's video stream, and outputting the video stream.
  • the basic dialogue system is responsible for realizing basic human-machine dialogue capabilities, that is, generating corresponding reply information based on questions input by the user.
  • the perception system is responsible for controlling the input of video streams and voice streams in the interactive system, and implementing functions such as segmenting and identifying the input voice streams and video streams.
  • the perception system will be based on a short preset duration (such as 200ms) of silence time (i.e. VAD time) to segment the voice stream.
  • the voice stream is segmented to generate a voice unit in the previous period, and the voice
  • the unit is input to the Automatic Speech Recognition (ASR) module, which converts the speech unit into text information through the ASR module, and is finally input to the multi-modal processing and alignment module.
  • ASR Automatic Speech Recognition
  • the perception system recognizes the user's gesture information based on the user's image data in this round of dialogue, and inputs the gesture information into the multi-modal processing and alignment module.
  • the multi-modal processing and alignment module combines the user's gesture information and the speech unit's Text information, determine the gesture intention classification corresponding to the gesture information.
  • gestures that need to be fed back to the user can include three major categories: gestures with clear meanings (such as OK, numbers, left and right swipes, etc.), unsafe gestures (such as middle finger and little thumb), Customize special moves.
  • gestures with clear meanings such as OK, numbers, left and right swipes, etc.
  • unsafe gestures such as middle finger and little thumb
  • Customize special moves such as OK, numbers, left and right swipes, etc.
  • the multi-modal duplex state management system is based on the real-time recognition of the user's gesture intention classification, and combines the current dialogue status to make decisions on the duplex response status, determine the duplex response status corresponding to the gesture intention classification, and determine the duplex response status corresponding to different duplex response statuses. Different response strategies are used to determine the response strategy corresponding to the gesture intention classification and obtain the decision-making result.
  • the duplex response state can include four states: duplex active/passive interruption, duplex active acceptance, calling the basic dialogue system, and no feedback, which respectively correspond to the virtual character actively or passively interrupting the current processing.
  • strategy virtual characters actively take over the user's strategy, start a new round of dialogue (that is, call the basic dialogue system), and no feedback
  • duplex active acceptance when it is judged that it is necessary to accept the user's dialogue or action, the corresponding acceptance strategy is triggered.
  • action takeover only which means that the virtual character does not make a verbal takeover reply, but only responds to the user by making takeover actions, without affecting other conversation states.
  • action + copywriting that is, the virtual character not only performs the following actions, but also broadcasts the copywriting to respond to the user.
  • Duplex active ⁇ passive interruption During the virtual character broadcasting process, when it is judged that the user has the intention to interrupt, such as the user making unsafe gestures, stop gestures, inputting voice with the intention of stopping, etc., the current conversation will be actively interrupted immediately. Under the interruption strategy, the avatar will interrupt the current speaking state, wait for the user to speak, or actively ask the other party the reason for the interruption. If the user inputs voice data with definite semantics, a new round of dialogue will be started; if the user does not input voice data with definite semantics after a period of time, the current dialogue will be continued and the virtual character will continue to broadcast.
  • the silent time (VAD time) of the user's voice input reaches the silence duration threshold (that is, the VAD threshold)
  • the silent duration threshold that is, the VAD threshold
  • the silent duration threshold is usually 800ms, which can be configured and adjusted according to the actual application scenario.
  • the drive control system specifically includes the following three parts: 1) The streaming TTS (Text To Speech) part, which synthesizes the text output in the decision results into an audio stream. 2)
  • the driving part includes two sub-modules, the face driving module and the action driving module. Among them, the face driving module drives the virtual character to output accurate mouth shape according to the voice stream to be output in the decision result, and generates face driving data; the action driving module According to the action tag to be output in the decision result, the virtual character is driven to make accurate actions, and action-driven data is generated, such as an action blend shape (blendshape) driven model.
  • the rendering and synthesis part renders the output of the driver, streaming TTS and other parts and synthesizes the video stream of the virtual character.
  • Basic dialogue system Contains basic business logic and has basic dialogue interaction capabilities, that is, input the user's question, and the basic dialogue system outputs the answer to the question.
  • basic dialogue systems usually include: NLU (Natural Language Understanding) module, DM (Dialog Management) module and NLG (Natural Language Generation) module.
  • the business logic is the query logic that obtains the data content required in the reply information based on the question query entered by the user. For example, the user question is "My height is 160cm, what size should I send?" and the answer information is "You should wear M size”. The "M size" in the answer information is obtained by querying the business logic based on the height of 160cm.
  • the NLU module is used to identify and understand text information and convert it into a computer-understandable structured semantic representation or intent label.
  • the DM module is used to maintain and update the current dialogue status and decide on the next system action.
  • the NLG module is used to convert the status output by the system into understandable natural language text.
  • multi-modal duplex state management system by adding video streams and corresponding visual understanding modules, users can interact with virtual characters through gestures.
  • actions with clear meaning such as likes, left swipes, right swipes, etc.
  • unsafe gestures such as middle finger gestures
  • the dialogue becomes a dialogue form that can take over or interrupt the current dialogue at any time based on user gestures.
  • the current duplex response state includes four states: duplex active acceptance, duplex active ⁇ passive interruption, calling the basic dialogue system, and no feedback, which respectively correspond to the interruption strategy of the virtual character actively or passively interrupting the current processing, and the virtual character actively interrupting.
  • duplex active acceptance strategy Taking over the user's acceptance strategy, starting a new round of dialogue (that is, calling the basic dialogue system), and no feedback are four types of response strategies.
  • no feedback By deciding between these four types of response strategies, the understanding of the user's gestures can be achieved and the implementation based on The ability of user gestures to undertake, interrupt and basic question and answer enables virtual characters to have multi-modal (voice and gesture) interaction capabilities, improves the degree of anthropomorphism of virtual characters, and makes the communication process between virtual characters and people smoother and more intelligent.
  • Figure 2 is a flow chart of a virtual character driving method based on multi-modal data provided by an embodiment of the present application.
  • the virtual character driving method based on multi-modal data provided in this embodiment can be specifically applied to electronic devices that have the function of using virtual characters to interact with humans.
  • the electronic device can be a conversation robot, a terminal or a server, etc. In other embodiments , the electronic device can also be implemented using other devices, and this embodiment is not specifically limited here.
  • Step S201 Obtain a three-dimensional image rendering model of the virtual character to use the virtual character to provide interactive services to the user.
  • the three-dimensional image rendering model of the virtual character includes the rendering data required to realize the rendering of the virtual character.
  • the three-dimensional image rendering model based on the virtual character can render the skeletal data of the virtual character into the three-dimensional image of the virtual character displayed to the user.
  • the method provided in this embodiment can be applied in scenarios where virtual characters interact with people, using virtual characters with three-dimensional images to realize real-time interaction functions between machines and people, so as to provide intelligent services to people.
  • Step S202 During a round of dialogue between the virtual character and the user, the voice data input by the user and the user's image data are obtained in real time.
  • the input voice stream is obtained in real time and the voice data input by the user is obtained; the video stream from the user can also be monitored in real time, and the video frames are sampled according to the preset frequency. Get the user's image data.
  • the user's image data includes image data in the video frame in which the user appears, including the user's face image and images of the arms and part of the body appearing in the video frame.
  • the input voice stream and video stream can be acquired in real time by the gesture sensing system in the interactive system framework shown in Figure 1 above.
  • Step S203 When it is detected that the silence duration of the voice data input by the user is greater than or equal to the preset duration, and if it is determined that the voice input has not ended, the voice data input by the user in the previous period is converted into corresponding text information. It is the time since the last silence duration was greater than or equal to the preset duration to the current moment.
  • the preset duration is a shorter duration that is less than the silence duration threshold.
  • the silence duration threshold is the silence duration used to determine whether the user's current round of input has ended. When the silence duration of the user's voice input reaches the silence duration threshold, the user's current round of voice input is determined. Finish.
  • the silent duration threshold can be 800ms
  • the preset duration can be 200ms. The preset duration can be set and adjusted according to the needs of the actual application scenario, and is not specifically limited here.
  • the voice data input by the user in the previous period is input into the ASR module, and the voice data is transferred to the ASR module through the ASR module.
  • the data is converted into corresponding text information.
  • the perception system divides the speech stream into small speech units one by one according to a silence time (that is, VAD time) of a preset length (such as 200ms), and one speech unit corresponds to the adjacent one.
  • a silence time that is, VAD time
  • a preset length such as 200ms
  • Step S204 Identify the user's gesture information based on the user's image data in the previous period, and determine the gesture intention classification corresponding to the user's gesture information based on the user's gesture information and text information.
  • the user's image data in the previous period is acquired, gesture recognition is performed on the user's image data in the previous period, and the user's gesture information is identified.
  • gestures can have different meanings in different scenarios, that is, a gesture represents different user intentions in different scenarios.
  • the user's gesture information in the previous period and the text information of the user's input voice data are combined to perform multi-modal classification to determine the gesture intention classification corresponding to the user's gesture information in the previous period, thereby accurately identifying the user The meaning of the gesture.
  • gesture intention categories that require duplex response can be pre-configured, and a response strategy corresponding to each gesture intention category can be configured.
  • Configuring corresponding response strategies according to different categories of gesture intentions enables the virtual character to respond more accurately to the user's correct gesture intention, compared to configuring the response strategy according to gestures, and improves the degree of anthropomorphism of the virtual character.
  • Step S205 Determine corresponding driving data according to the gesture intention classification corresponding to the user's gesture information and the current dialogue state.
  • the current dialogue state includes the following two types: the state in which the user inputs the avatar and receives the state, and the state in which the avatar outputs the avatar and the user receives the state.
  • the response strategies may include the following four categories: an interruption strategy in which the avatar actively or passively interrupts the current processing, an acceptance strategy in which the avatar actively accepts the user, starts a new round of dialogue, and no feedback.
  • Each type of response strategy includes one or more response strategies, and each response strategy includes a corresponding gesture meaning.
  • the specific content of each response strategy can be configured according to the needs of actual application scenarios, and is not specifically limited here.
  • the response strategy adopted can be one of acceptance strategy, starting a new round of dialogue, or no feedback. Considering that in actual application scenarios, there are usually no scene requirements for avatars to interrupt user input, therefore when the user inputs a dialogue state that the avatar receives, the interruption strategy is usually not used to respond.
  • the avatar When the avatar outputs the dialogue state that the user receives, the avatar does not need to take over the user, and the user can interrupt the current output of the avatar, that is, interrupt the current dialogue state, so that the user can get what he needs faster. Information. Therefore, when the virtual character outputs a dialogue state that the user receives, the response strategy adopted can be one of interruption strategy, starting a new round of dialogue, or no feedback, but the acceptance strategy is usually not used.
  • the current response strategy is determined based on the gesture intention classification corresponding to the user's gesture information and combined with the current dialogue status, and based on the current The response strategy generates driving data for the avatar.
  • the driving data includes all the driving parameters required to drive the virtual character to execute the response strategy corresponding to the gesture intention classification, thereby realizing the facial driving and action driving of the virtual character.
  • the driving data includes expression driving parameters; if the response strategy corresponding to the gesture intention classification includes the virtual character making a prescribed gesture action, the driving data Including action-driven parameters; if the response strategy corresponding to the gesture intention classification includes a virtual character broadcasting prescribed words, the driving data includes voice-driven parameters; if the response strategy corresponding to the gesture intention classification includes multiple responses in expressions, words, and actions method, the driving data includes corresponding multiple driving parameters, which can drive the virtual character to perform response behaviors corresponding to the response strategy.
  • Step S206 Drive the virtual character to perform the corresponding response behavior according to the driving data and the three-dimensional image rendering model of the virtual character.
  • the skeletal model of the virtual character is driven according to the driving data to obtain the skeletal data corresponding to the response behavior, and the skeletal data is rendered according to the three-dimensional detailed rendering model of the virtual character to obtain the response.
  • the virtual character image data corresponding to the behavior.
  • This embodiment obtains the voice data input by the user and the image data of the user in real time during a round of dialogue between the virtual character and the user; when it is detected that the silence duration of the voice data input by the user is greater than or equal to the preset duration, if it is determined If the voice input has not ended, the voice data input by the user in the previous period is converted into corresponding text information, the user's gesture information is identified based on the user's image data in the previous period, and the user is determined based on the user's gesture information and text information.
  • the gesture intention corresponding to the gesture information is classified, so that the gesture intention of the user's gesture can be recognized in real time, and based on the gesture intention of the user's gesture and the current conversation state, the virtual character is driven to perform the corresponding response behavior, so that the virtual character in the output video stream does It can generate corresponding response behaviors, increase the real-time recognition ability of user gestures, and drive virtual characters to respond promptly to the user's gesture intentions, which improves the degree of anthropomorphism of virtual characters and makes the interaction between virtual characters and people smoother and more intelligent.
  • the above step S204 can use a multi-modal classification model to perform multi-modal alignment and classification processing on the user's gesture information and the text information of the user's input voice, and determine the user's gesture intention classification to accurately Identify the intent of user gestures.
  • the text information and the user's image data in the previous period are input into the trained multi-modal classification model.
  • the multi-modal classification model the user's gesture information is recognized based on the user's image data in the previous period and the text information is extracted. Semantic features, multi-modal classification processing is performed based on the semantic features of the user's gesture information and text information, thereby determining the gesture intention classification corresponding to the user's gesture information.
  • gestures can have different meanings in different scenarios, that is, a gesture represents different user intentions in different scenarios.
  • the action of "swipe up” can express the gesture intention of "turning the page up” in one scenario, and can express the gesture intention of "hello” in another scenario.
  • the multi-modal classification model accurately identifies the gesture intention corresponding to the user's gesture information by fusing the semantic features of the text information of the user's input speech with the user's gesture information.
  • the multimodal classification model can be implemented using any existing multimodal image classification model, or other multimodal alignment models can be used to implement the function of correcting image classification results based on text information.
  • the user's gesture information can be identified based on the user's image data in the previous period.
  • a time-series convolutional neural network can be used to perform feature extraction and gesture classification on the user's image data in multiple video frames to identify it in real time. gestures made by the user.
  • identifying the user's gesture information based on the user's image data in the previous period can be implemented using any existing gesture recognition algorithm that realizes the function of identifying the user's gesture based on the user's image data, which will not be described again here.
  • the sensing system can recognize the gestures shown in Table 1 below:
  • the interactive system can provide a front-end configuration page through which response strategies can be configured to flexibly configure one or more response strategies based on the needs of different specific application scenarios.
  • At least one of the following types of response policies is configured:
  • Interrupt strategy take over strategy, start a new round of dialogue, no feedback.
  • the first category is the interruption strategy: a strategy that interrupts the current processing of the avatar when the avatar outputs the dialogue state received by the user, including one or more interruption strategies in which the avatar actively interrupts the current processing, and The avatar passively interrupts one or more strategies currently being processed.
  • the virtual character in each interruption strategy, can be configured to perform at least one of the following interruption response behaviors: broadcasting interruption copywriting and making interruption actions.
  • the interrupting action includes at least one of hand action and facial action.
  • the avatar determines that the user has the intention to interrupt based on the user's gestures. For example, if the user makes unsafe gestures, stop gestures, inputs voice with the intention of stopping, etc., the avatar will immediately interrupt the current conversation and trigger the corresponding interruption strategy. Different gestures The corresponding interruption strategies can be different, and the interruption response behaviors can be different.
  • the virtual character Under each interruption strategy, the virtual character will interrupt the current speaking state, wait for the user to speak, and make corresponding response behaviors based on the specific response method of the interruption strategy. If the user inputs voice data with definite semantics, a new round of dialogue will be started; if the user does not input voice data with definite semantics after a period of time, the current dialogue will be continued and the virtual character will continue to broadcast.
  • the second category is the acceptance strategy: in the dialogue state where the user inputs the avatar to receive, the avatar actively responds to the user's gestures to assist the dialogue, but does not affect the user's input.
  • the virtual character can be configured to perform at least one acceptance response behavior of making an acceptance action and broadcasting an acceptance copy, where the acceptance action includes at least one of hand movements and facial movements.
  • the corresponding takeover strategy is triggered.
  • the other is "action + copywriting", that is, the virtual character not only performs the following actions, but also broadcasts the copywriting to respond to the user.
  • action + copywriting that is, the virtual character not only performs the following actions, but also broadcasts the copywriting to respond to the user.
  • the undertaking actions include facial expressions, gestures, etc.
  • multiple acceptance strategies can be configured, and the acceptance response behaviors of different acceptance strategies can be different.
  • the third category is the strategy of starting a new round of dialogue, that is, calling the basic dialogue system.
  • the silent time (VAD time) of the user's voice input reaches the silence duration threshold (that is, the VAD threshold)
  • the user's voice input ends and the basic dialogue is called.
  • the system replies directly to the user.
  • the silent duration threshold that is, the VAD threshold
  • the silent duration threshold is usually 800ms, which can be configured and adjusted according to the actual application scenario.
  • the fourth category is a no-feedback strategy: no feedback and maintain the current state.
  • Each type of response strategy includes one or more response strategies, and each response strategy includes corresponding gesture intention classification, response time and response mode.
  • the virtual character actively takes over the user's
  • four types of response strategies can be flexibly configured through the front-end page: the interruption strategy of the avatar actively or passively interrupting the current processing, the avatar's acceptance strategy of actively accepting the user, starting a new round of dialogue, and no feedback.
  • Decision-making between these four types of response strategies can realize duplex interaction capabilities of timely acceptance and interruption based on user gestures, making virtual characters have multi-modal (voice and gesture) interaction capabilities and improving the degree of anthropomorphism of virtual characters. , making the communication process between virtual characters and people smoother and more intelligent.
  • the first target strategy corresponding to the gesture intention classification is determined according to the gesture intention classification corresponding to the user's gesture information.
  • the first goal strategy is one of taking over the strategy, starting a new round of dialogue, or no feedback; according to the first goal strategy corresponding to the gesture intention classification, the corresponding driving data is determined, and the driving data is used to drive the virtual character to perform the first goal The response behavior corresponding to the policy.
  • the types and specific contents of the acceptance response behaviors included in the acceptance strategies corresponding to different gesture intention categories may be different.
  • the interruption strategy is usually not used to respond.
  • the response strategy adopted can be one of the following strategies, starting a new round of dialogue, or no feedback, instead of using the interruption strategy to respond, which can avoid the impact.
  • the user inputs normally, and can be processed in a timely manner based on the gesture intention of the user's gesture, which improves the degree of anthropomorphism of the virtual character, improves the user's enthusiasm for continued interaction, and improves the smoothness and intelligence of the interaction between the virtual character and the user.
  • the first driving data is determined according to the undertaking strategy, and the first driving data is used to drive the virtual character to perform at least one undertaking response behavior of taking an undertaking action and broadcasting an undertaking copy, wherein, the taking action includes at least one of hand action and facial action.
  • the acceptance strategy may include at least one of the following acceptance response behaviors: making an acceptance action and broadcasting the acceptance copy.
  • the corresponding takeover strategy is triggered.
  • the other is "action + copywriting", that is, the virtual character not only performs the following actions, but also broadcasts the copywriting to respond to the user.
  • the execution timing of various types of acceptance response behaviors can also be configured in the acceptance strategy.
  • the execution timing of any response response behavior may include the following: execution immediately, execution after a specified period of time, or execution after user input is completed.
  • different takeover response behaviors in the same takeover strategy can be configured with different execution timings.
  • the corresponding acceptance strategy can be: immediately make a "smile” expression and a gesture indicating happiness, and broadcast it after the user input is completed. "I'm very happy to receive your compliment” copywriting.
  • the corresponding acceptance strategy can be: immediately make a "smile” expression and a gesture indicating happiness, and immediately broadcast the message "thank you” Undertake copywriting.
  • the acceptance copy to be broadcast immediately is usually set to short content, such as "Uh-huh”, “Yes”, “Yes”, “Hmm”, “Oh”, etc.
  • the acceptance copy will not affect the user's normal speech enter.
  • the acceptance response processing such as making the acceptance action and broadcasting the acceptance copy is performed in a timely manner, which improves the degree of anthropomorphism of the virtual character, can increase the user's enthusiasm for continued interaction, and improves the relationship between the virtual character and the user. Smoothness and intelligence of interaction.
  • the second target strategy corresponding to the gesture intention classification is determined according to the gesture intention classification corresponding to the user's gesture information.
  • the second target strategy is one of interruption strategy, starting a new round of dialogue, or no feedback; according to the second target strategy corresponding to the gesture intention classification, the second driving data is determined, and the second driving data is used to drive the execution of the virtual character Second item response behavior corresponding to the targeting strategy.
  • the avatar when the avatar outputs a conversation that the user receives, the avatar does not need to accept the user, and the user can interrupt the avatar's current output, that is, interrupt the current conversation. status so that users can get the information they need faster.
  • the response strategy adopted may be one of interruption strategy, starting a new round of dialogue, or no feedback, but usually the acceptance strategy is not adopted.
  • the interrupting response behavior of the interrupting strategy may include at least one interrupting response behavior of making an interrupting action and broadcasting an interrupting copy, wherein the undertaking action includes at least one of hand movements and facial movements.
  • the execution timing of various interrupt response behaviors can also be configured in the interrupt policy.
  • the execution timing of any interrupt response behavior may include the following: execution immediately, execution after a specified period of time, or execution after user input is completed.
  • different interrupt response behaviors in the same interrupt policy can be configured with different execution timings.
  • the corresponding interruption strategy can be: the avatar immediately interrupts the current broadcast, immediately makes expressions and gestures indicating confusion, and immediately Interruption copy that announces “Do you have any questions?”
  • the avatar during the avatar broadcasting process, when it is determined that the user has the intention to interrupt, such as the user making unsafe gestures, stop gestures, inputting voice with the intention of stopping, etc., the avatar will interrupt. Interrupt the current speaking state, wait for the user to speak, or actively ask the other party for the reason for the interruption, make some actions, etc., which can achieve duplex capabilities.
  • the virtual character has the ability to actively or passively interrupt his current broadcast, and respond to user gestures Make response behaviors after interruption to guide the subsequent conversation process, making the interaction between virtual characters and users smoother and more intelligent.
  • the next round of dialogue will be started, and the dialogue will be started based on the user's voice input. Semantic information for conversational processing.
  • the first duration is generally set to a short duration so that the user will not feel a long pause.
  • the first duration can be set and adjusted according to the needs of the actual application scenario, such as a few hundred milliseconds, 1 second, or even a few seconds. Seconds, etc., there is no specific limit here.
  • the current output of the interrupted virtual character can be continued after a pause of the second period of time. Give the user enough time for voice input.
  • the second duration can be hundreds of milliseconds, 1 second, or even several seconds, etc., and can be set and adjusted according to the needs of actual application scenarios, and is not specifically limited here.
  • a new round of dialogue is started. If no voice input with semantic information by the user is received, For voice input, you can pause for a certain period of time and then continue the avatar's previous broadcast, so as to avoid interrupting the response behavior and affecting the normal interaction between the avatar and the user, and improve the smoothness and intelligence of the interaction between the avatar and the user.
  • FIG. 3 is a flow chart of a method for driving a virtual character to accept users according to an embodiment of the present application.
  • the user's gestures can be recognized in real time, and the virtual character is driven to perform response processing based on the user's gestures.
  • the specific steps of this method are as follows:
  • Step S301 Obtain a three-dimensional image rendering model of the virtual character to provide interactive services to users using the virtual character.
  • the three-dimensional image rendering model of the virtual character includes the rendering data required to realize the rendering of the virtual character.
  • the three-dimensional image rendering model based on the virtual character can render the skeletal data of the virtual character into the three-dimensional image of the virtual character displayed to the user.
  • the method provided in this embodiment can be applied in scenarios where virtual characters interact with people, using virtual characters with three-dimensional images to realize real-time interaction functions between machines and people, so as to provide intelligent services to people.
  • Step S302 In a round of dialogue between the virtual character and the user, during the user's voice input process, the voice data input by the user and the user's image data are obtained in real time.
  • the input voice stream is obtained in real time and the voice data input by the user is obtained; the video stream from the user can also be monitored in real time, and the video frames are sampled according to the preset frequency. Get the user's image data.
  • the user's image data includes image data in the video frame in which the user appears, including the user's face image and images of the arms and part of the body appearing in the video frame.
  • Step S303 When it is detected that the silence duration of the voice input is greater than or equal to the preset duration, and if it is determined that the voice input has not ended, the user's gesture information is identified based on the user's image data in the previous period.
  • the previous period is the period since the last silence.
  • the duration is greater than or equal to the preset duration to the current moment.
  • the silence duration of the voice data input by the user is greater than or equal to the preset duration, if the silence duration of the voice data is less than the silence duration threshold, it means that a long period of silence occurs during the user input process, but this voice input has not yet ended.
  • a duplex response process is performed based on the user's image data in the previous period, so that the virtual character can respond to the user's gestures in a timely manner to guide the subsequent conversation process, making the interaction between the virtual character and the user smoother and more efficient. intelligent.
  • the preset duration is a shorter duration that is less than the silence duration threshold.
  • the silence duration threshold is the silence duration used to determine whether the user's current round of input has ended. When the silence duration of the user's voice input reaches the silence duration threshold, the user's current round of voice input is determined. Finish.
  • the silent duration threshold can be 800ms
  • the preset duration can be 200ms. The preset duration can be set and adjusted according to the needs of the actual application scenario, and is not specifically limited here.
  • the user's gesture information is recognized based on the user's image data in the previous period, thereby identifying the user's gesture information in real time. Show the current gesture made by the user.
  • the user's gesture information is identified based on the user's image data in the previous period.
  • the user's gesture information can be identified through time series
  • a convolutional neural network is used to extract features and classify gestures from user image data in multiple video frames to identify gestures made by users in real time.
  • identifying the user's gesture information based on the user's image data in the previous period can be implemented using any existing gesture recognition algorithm that realizes the function of identifying the user's gesture based on the user's image data, which will not be described again here.
  • the sensing system may recognize the gestures shown in Table 1 above.
  • Step S304 According to the user's gesture information, if it is determined that the user has made a gesture that needs to be accepted, the driving data of the virtual character is determined.
  • the user's current gesture belongs to a gesture that needs to be accepted. If the user's current gesture belongs to a gesture that needs to be accepted, corresponding driving data is generated according to the acceptance response strategy corresponding to the user's current gesture.
  • the driving data includes all the driving parameters required to drive the virtual character to execute the response strategy corresponding to the user's current gesture, so as to realize the facial driving and action driving of the virtual character.
  • the driving data includes expression driving parameters; if the acceptance response strategy corresponding to the user's current gesture includes the virtual character making a prescribed action, then The driving data includes action driving parameters; if the response strategy corresponding to the user's current gesture includes a virtual character broadcasting prescribed words, the driving data includes voice driving parameters; if the response strategy corresponding to the user's current gesture includes expressions, words, etc. and multiple response modes in the action, the driving data includes corresponding multiple driving parameters, which can drive the virtual character to perform response behaviors corresponding to the response strategy.
  • gestures that require an acceptance response that is, gestures that need to be accepted
  • corresponding acceptance response strategies can be pre-configured.
  • the virtual character can be configured to perform at least one acceptance response behavior of making an acceptance action and broadcasting an acceptance copy, where the acceptance action includes at least one of hand movements and facial movements.
  • the acceptance action includes at least one of hand movements and facial movements.
  • the other is "action + copywriting", that is, the virtual character not only performs the following actions, but also broadcasts the copywriting to respond to the user.
  • the undertaking actions include facial expressions, gestures, etc.
  • multiple acceptance strategies can be configured, and the acceptance response behaviors of different acceptance strategies can be different.
  • the execution timing of various types of acceptance response behaviors can also be configured in the acceptance policy.
  • the execution timing of any response response behavior may include the following: execution immediately, execution after a specified period of time, or execution after user input is completed.
  • different takeover response behaviors in the same takeover strategy can be configured with different execution timings.
  • the corresponding acceptance strategy can be: immediately make a "smile” expression and a gesture indicating happiness, and broadcast it after the user input is completed. "I'm very happy to receive your compliment” copywriting.
  • the corresponding acceptance strategy can be: immediately make a "smile” expression and a gesture indicating happiness, and immediately broadcast the message "thank you” Acceptance document case.
  • the acceptance copy to be broadcast immediately is usually set to short content, such as "Uh-huh”, “Yes”, “Yes”, “Hmm”, “Oh”, etc.
  • the acceptance copy will not affect the user's normal speech enter.
  • Step S305 According to the driving data and the three-dimensional image rendering model of the virtual character, drive the virtual character to perform an acceptance response behavior corresponding to the user's gesture information.
  • the skeletal model of the virtual character is driven according to the driving data to obtain the skeletal data corresponding to the response behavior, and the skeletal data is rendered according to the three-dimensional detailed rendering model of the virtual character.
  • the voice data input by the user and the user's image data are obtained in real time during the user's voice input; when it is detected that the silence duration of the voice input is greater than or equal to the preset duration, if it is determined that the voice input has not ended, the user's gesture information will be recognized in real time based on the user's image data in the previous period.
  • the virtual character Based on the user's gesture information, if it is determined that the user has made a gesture that needs to be accepted, the virtual character will be driven to target The user's current gesture is processed in response, so that the virtual character in the output video stream performs the corresponding response behavior, increasing the real-time recognition ability of the user's gesture, and driving the virtual character to respond to the user's gesture in a timely manner, improving
  • the degree of anthropomorphism of virtual characters makes the interaction between virtual characters and people smoother and more intelligent.
  • step S304 can be implemented by using the following steps:
  • Step S3041 Convert the voice data input by the user in the previous period into corresponding text information.
  • the voice data input by the user in the previous period is input into the ASR module, and the voice data is converted into corresponding text information through the ASR module.
  • Step S3042 Determine the gesture intention classification corresponding to the user's gesture information based on the user's gesture information and text information.
  • a gesture can have different intentions in different scenarios.
  • the user's gesture information in the previous period and the text information of the user's input voice data are combined to perform multi-modal classification and determine the previous period.
  • gesture intention categories that require duplex response can be pre-configured, and a response strategy corresponding to each gesture intention category can be configured.
  • the text information and the user's image data in the previous period are input into the trained multi-modal classification model.
  • the multi-modal classification model the user's gesture information is recognized based on the user's image data in the previous period and the text information is extracted.
  • multi-modal classification processing is performed to determine the gesture intention classification corresponding to the user's gesture information.
  • gestures can have different meanings in different scenarios, that is, a gesture represents different user intentions in different scenarios.
  • the action of "swipe up” can express the gesture intention of "turning the page up” in one scenario, and can express the gesture intention of "hello” in another scenario.
  • the multi-modal classification model accurately identifies the gesture intention corresponding to the user's gesture information by fusing the semantic features of the text information of the user's input speech with the user's gesture information.
  • the multimodal classification model can be implemented using any existing multimodal image classification model, or other multimodal alignment models can be used to implement the function of correcting image classification results based on text information.
  • the user's gesture information can be identified based on the user's image data in the previous period.
  • a time-series convolutional neural network can be used to perform feature extraction and gesture classification on the user's image data in multiple video frames to identify it in real time. gestures made by the user.
  • identifying the user's gesture information based on the user's image data in the previous period can be implemented using any existing gesture recognition algorithm that realizes the function of identifying the user's gesture based on the user's image data, which will not be described again here.
  • Step S3043 If the gesture intention classification corresponding to the user's gesture information belongs to the gesture intention classification that needs to be accepted, it is determined that the user has made the gesture that needs to be accepted, and the driving data of the virtual character is determined according to the gesture intention classification corresponding to the user's gesture information.
  • the driving data is used to drive the virtual character to perform the acceptance response behavior corresponding to the gesture intention classification corresponding to the user's gesture information.
  • step S205 the driving data of the virtual character is determined according to the gesture intention classification corresponding to the user's gesture information in the dialogue state where the user inputs the virtual character. This embodiment will not be described again here.
  • the voice data input by the user in the previous period is converted into corresponding text information, Integrate the text information of the user's input voice, accurately identify the gesture intention corresponding to the user's gesture information, accurately and in real time identify the user's gesture intention, and drive the virtual character to perform the corresponding response behavior based on the user's gesture intention to guide
  • the subsequent dialogue process makes the interaction between the virtual character and the user smoother and more intelligent.
  • Figure 4 is a schematic structural diagram of a virtual character driving system based on multi-modal data provided by an exemplary embodiment of the present application.
  • the multi-modal data-based virtual character driving system provided by the embodiments of the present application can execute the processing flow provided by the multi-modal data-based virtual character driving method embodiment.
  • the virtual character driving system 40 based on multi-modal data includes: a multi-modal input module 41 , a voice processing module 42 , an image processing module 43 and a driving control module 44 .
  • the drive control module 44 is used to obtain a three-dimensional image rendering model of the virtual character so as to use the virtual character to provide interactive services to the user.
  • the multi-modal input module 41 is used to obtain the voice data input by the user and the user's image data in real time during a round of dialogue between the virtual character and the user.
  • the voice processing module 42 is configured to, when it is detected that the silence duration of the voice data input by the user is greater than or equal to the preset duration, and if it is determined that the voice input has not ended, convert the voice data input by the user in the previous period into corresponding text information,
  • the previous period is from the time when the last silence duration was greater than or equal to the preset duration to the current moment;
  • the image processing module 43 identifies the user's gesture information based on the user's image data in the previous period, and The gesture information and text information are used to determine the gesture intention classification corresponding to the user's gesture information.
  • the drive control module 44 is also used to determine the corresponding drive data according to the gesture intention classification corresponding to the user's gesture information and the current dialogue state; and drive the virtual character to perform the corresponding response behavior according to the drive data and the three-dimensional image rendering model of the virtual character. .
  • the image processing module 43 when identifying the user's gesture information based on the user's image data in the previous period, and determining the gesture intention classification corresponding to the user's gesture information based on the user's gesture information and text information, the image processing module 43 is also used for:
  • the user's gesture information is recognized based on the user's image data in the previous period and the semantic features of the text information are extracted.
  • perform multi-modal classification processing to determine the gesture intention classification corresponding to the user's gesture information.
  • the avatar driving system 40 based on multi-modal data also includes: a strategy configuration module 45 .
  • the policy configuration module 45 is configured to configure at least one of the following types of response strategies in response to the response strategy configuration operation: an interruption strategy in which the virtual character actively or passively interrupts the current processing, an acceptance strategy in which the virtual character actively accepts the user, and starts a new round of dialogue, No feedback.
  • Each type of response strategy includes one or more response strategies, and each response strategy includes corresponding gesture intention classification, response time and response mode.
  • the driving control module 44 when determining the corresponding driving data according to the gesture intention classification corresponding to the user's gesture information and the current dialogue state, as shown in FIG. 5 , the driving control module 44 includes: a response decision unit 441 and drive control unit 442. Wherein, the response decision unit 441 is used to determine the first target strategy corresponding to the gesture intention classification according to the gesture intention classification corresponding to the user's gesture information if the current dialogue state is a state in which the user inputs the avatar. The first target strategy is Either take over the strategy, start a new conversation, or have no feedback.
  • the drive control unit 442 is used to determine the corresponding drive data according to the first target strategy corresponding to the gesture intention classification, and the drive data is used to drive the virtual character to perform the response behavior corresponding to the first target strategy.
  • the drive control unit 442 when classifying the corresponding first target strategy according to the gesture intention and determining the corresponding drive data, is also configured to: if the first target strategy is a takeover strategy, then according to the takeover strategy, The first driving data is determined, and the first driving data is used to drive the virtual character to perform at least one acceptance response behavior of making an acceptance action and broadcasting an acceptance copy, wherein the acceptance action includes at least one of hand movements and facial movements.
  • the drive control unit is also configured to determine the second target corresponding to the gesture intention classification according to the gesture intention classification corresponding to the user's gesture information if the current dialogue state is a state in which the virtual character outputs the user's reception.
  • the second target strategy is one of interruption strategy, starting a new round of dialogue, or no feedback; according to the second target strategy corresponding to the gesture intention classification, the second driving data is determined, and the second driving data is used to drive the virtual character Execute second The response behavior corresponding to the target strategy.
  • Figure 6 is a schematic architectural diagram of a virtual character driving system based on multi-modal data provided by another exemplary embodiment of the present application.
  • the multi-modal data-based virtual character driving system provided by the embodiments of the present application can execute the processing flow provided by the multi-modal data-based virtual character driving method embodiment.
  • the virtual character driving system 60 based on multimodal data includes: a multimodal input module 61 , a perception module 62 and a decision driving module 63 .
  • the decision-making driving module 63 is used to obtain a three-dimensional image rendering model of the virtual character, so as to use the virtual character to provide interactive services to the user.
  • the multi-modal input module 61 is used to obtain the voice data input by the user and the image data of the user in real time during a conversation between the virtual character and the user.
  • the sensing module 62 is configured to detect the user's gesture information based on the user's image data in the previous period when it is detected that the silence duration of the voice input is greater than or equal to the preset duration, and if it is determined that the voice input has not ended. A moment of silence duration greater than or equal to the preset duration to the current moment.
  • the decision-making driving module 63 is used to determine the driving data of the virtual character according to the user's gesture information, and if it is determined that the user has made a gesture that needs to be accepted; and based on the driving data and the three-dimensional image rendering model of the virtual character, drive the virtual character to perform the user's gesture. The corresponding acceptance response behavior of the information.
  • the sensing module 62 when determining the driving data of the virtual character based on the user's gesture information, if it is determined that the user has made a gesture that needs to be accepted, the sensing module 62 is also used to: The voice data is converted into corresponding text information; based on the user's gesture information and text information, the gesture intention classification corresponding to the user's gesture information is determined.
  • the decision-making driving module 63 is also used to determine that the user has made a gesture that needs to be accepted if the gesture intention classification corresponding to the user's gesture information belongs to the gesture intention classification that needs to be accepted, and determine the virtual character according to the gesture intention classification corresponding to the user's gesture information.
  • the driving data is used to drive the virtual character to perform the corresponding response behavior corresponding to the gesture intention classification corresponding to the user's gesture information.
  • FIG. 7 is a schematic structural diagram of an electronic device according to an exemplary embodiment of the present application.
  • the electronic device 70 includes: a processor 701, and a memory 702 communicatively connected to the processor 701.
  • the memory 702 stores computer execution instructions.
  • the processor executes the computer execution instructions stored in the memory to implement the solution provided by any of the above method embodiments.
  • the specific functions and the technical effects that can be achieved will not be described again here.
  • Embodiments of the present application also provide a computer-readable storage medium.
  • the computer-readable storage medium stores computer Execution instructions. When the computer execution instructions are executed by the processor, they are used to implement the solution provided by any of the above method embodiments. The specific functions and the technical effects that can be achieved will not be described again here.
  • Embodiments of the present application also provide a computer program product.
  • the computer program product includes: a computer program.
  • the computer program is stored in a readable storage medium.
  • At least one processor of the electronic device can read the computer program from the readable storage medium.
  • At least A processor executes a computer program so that the electronic device executes the solution provided by any of the above method embodiments. The specific functions and technical effects that can be achieved will not be described again here.

Abstract

The present application relates to the fields of artificial intelligence, deep learning, machine learning, virtual reality, etc. in the computer technology, and provides a virtual character driving method and system based on multimodal data, and a device. The method of the present application comprises: during a round of dialogue between a virtual character and a user, obtaining voice data input by the user and image data of the user in real time; when it is detected that a silence duration of the voice data input by the user is greater than or equal to a preset duration and a voice input is not finished, converting voice data in the previous time period into corresponding text information; recognizing gesture information of the user according to image data of the user in the previous time period, and determining, according to the gesture information of the user and the text information, corresponding gesture intention classification to recognize a gesture intention of the user in real time; and driving, on the basis of the gesture intention of the user, the virtual character to perform a corresponding response behavior in time. Therefore, the degree of personification of the virtual character is improved, and the interaction between the virtual character and a person is smoother and more intelligent.

Description

基于多模态数据的虚拟人物驱动方法、系统及设备Virtual character driving method, system and device based on multi-modal data
本申请要求于2022年05月23日提交中国专利局、申请号为202210567637.3、申请名称为“基于多模态数据的虚拟人物驱动方法、系统及设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application requests the priority of the Chinese patent application submitted to the China Patent Office on May 23, 2022, with the application number 202210567637.3 and the application name "Virtual character driving method, system and device based on multi-modal data", and its entire content incorporated herein by reference.
技术领域Technical field
本申请涉及计算机技术中的人工智能、深度学习、机器学习、虚拟现实等领域,尤其涉及一种基于多模态数据的虚拟人物驱动方法、系统及设备。This application relates to artificial intelligence, deep learning, machine learning, virtual reality and other fields in computer technology, and in particular to a virtual character driving method, system and device based on multi-modal data.
背景技术Background technique
随着虚拟现实技术的发展,基于虚拟人物的人机交互在人们生活中的应用越来越普及,如可以广泛应用于智能客服、虚拟导师、智能家庭医生、虚拟主播等场景中。在真人面对面进行对话时,是可以基于对方的手势动作等信息进行及时地反馈的。现有的虚拟人物与人的交互中,用户(真人)大多通过终端触控屏显示的系统页面,通过文字输入或触发页面控件等方式进行留言、点赞、翻页等与虚拟人物的交互,缺乏虚拟人物与真人进行“面对面”交互的技术手段,导致虚拟人物的拟人化程度低,虚拟人物与人的交互过程不顺畅、不智能。With the development of virtual reality technology, human-computer interaction based on virtual characters is becoming more and more popular in people's lives. It can be widely used in scenarios such as smart customer service, virtual tutors, smart family doctors, and virtual anchors. When real people have face-to-face conversations, timely feedback can be provided based on the other party's gestures and movements and other information. In the existing interaction between virtual characters and people, users (real people) mostly interact with the virtual characters through text input or triggering page controls through the system page displayed on the terminal touch screen, such as leaving messages, liking, turning pages, etc. The lack of technical means for "face-to-face" interaction between virtual characters and real people results in a low degree of anthropomorphism of virtual characters, and the interaction process between virtual characters and people is not smooth and intelligent.
发明内容Contents of the invention
本申请提供一种基于多模态数据的虚拟人物驱动方法、系统及设备,用以解决虚拟人物拟人化程度低,虚拟人物与人的交互过程不顺畅、不智能的问题。This application provides a virtual character driving method, system and device based on multi-modal data to solve the problem of low anthropomorphism of virtual characters and unsmooth and unintelligent interaction between virtual characters and people.
第一方面,本申请提供一种基于多模态数据的虚拟人物驱动方法,包括:In the first aspect, this application provides a virtual character driving method based on multi-modal data, including:
获取虚拟人物的三维形象渲染模型以利用虚拟人物提供对用户的交互服务;Obtain the three-dimensional image rendering model of the virtual character to use the virtual character to provide interactive services to users;
在虚拟人物与用户的一轮对话过程中,实时获取用户输入的语音数据和所述用户的图像数据;During a round of dialogue between the virtual character and the user, the voice data input by the user and the user's image data are obtained in real time;
当检测到所述用户输入的语音数据的静默时长大于或等于预设时长时,若确定所述语音输入未结束,则将上一时段内所述用户输入的语音数据转换为对应的文本信息,所述上一时段为自上一次静默时长大于或等于预设时长的时刻至当前时刻;When it is detected that the silence duration of the voice data input by the user is greater than or equal to the preset duration, if it is determined that the voice input has not ended, the voice data input by the user in the previous period is converted into corresponding text information, The previous period is from the time when the last silence duration was greater than or equal to the preset duration to the current time;
根据所述上一时段内所述用户的图像数据识别所述用户的手势信息,并根据所述用户的手势信息和所述文本信息,确定所述用户的手势信息对应的手势意图分类;Identify the user's gesture information according to the user's image data in the previous period, and determine the gesture intention classification corresponding to the user's gesture information based on the user's gesture information and the text information;
根据所述用户的手势信息对应的手势意图分类,以及当前的对话状态,确定对应的驱 动数据;According to the gesture intention classification corresponding to the user's gesture information and the current conversation status, the corresponding driver is determined. moving data;
根据所述驱动数据和虚拟人物的三维形象渲染模型,驱动虚拟人物执行对应的响应行为。According to the driving data and the three-dimensional image rendering model of the virtual character, the virtual character is driven to perform the corresponding response behavior.
第二方面,本申请提供一种基于多模态数据的虚拟人物驱方法,包括:In the second aspect, this application provides a virtual character driving method based on multi-modal data, including:
获取虚拟人物的三维形象渲染模型,以利用虚拟人物向用户提供交互服务;Obtain the three-dimensional image rendering model of the virtual character to use the virtual character to provide interactive services to users;
在虚拟人物与用户的一轮对话中,在用户进行语音输入的过程中,实时获取用户输入的语音数据和所述用户的图像数据;In a round of dialogue between the virtual character and the user, during the user's voice input process, the voice data input by the user and the image data of the user are obtained in real time;
当检测到所述语音输入的静默时长大于或等于预设时长时,若确定所述语音输入未结束,则根据上一时段内所述用户的图像数据识别所述用户的手势信息,所述上一时段为自上一次静默时长大于或等于预设时长的时刻至当前时刻;When it is detected that the silence duration of the voice input is greater than or equal to the preset duration, if it is determined that the voice input has not ended, the user's gesture information is identified based on the user's image data in the previous period. A period of time is from the time when the last silence duration was greater than or equal to the preset duration to the current moment;
根据所述用户的手势信息,若确定所述用户做出了需承接的手势,则确定虚拟人物的驱动数据;According to the user's gesture information, if it is determined that the user has made a gesture that needs to be accepted, then determine the driving data of the virtual character;
根据所述驱动数据和虚拟人物的三维形象渲染模型,驱动虚拟人物执行所述用户的手势信息对应的承接响应行为。According to the driving data and the three-dimensional image rendering model of the virtual character, the virtual character is driven to perform the acceptance response behavior corresponding to the user's gesture information.
第三方面,本申请提供一种基于多模态数据的虚拟人物驱动系统,包括:In the third aspect, this application provides a virtual character driving system based on multi-modal data, including:
驱动控制模块,用于获取虚拟人物的三维形象渲染模型以利用虚拟人物提供对用户的交互服务;The driver control module is used to obtain the three-dimensional image rendering model of the virtual character to provide interactive services to the user using the virtual character;
多模态输入模块,用于在虚拟人物与用户的一轮对话过程中,实时获取用户输入的语音数据和所述用户的图像数据;The multi-modal input module is used to obtain the voice data input by the user and the image data of the user in real time during a round of dialogue between the virtual character and the user;
语音处理模块,用于当检测到所述用户输入的语音数据的静默时长大于或等于预设时长时,若确定所述语音输入未结束,则将上一时段内所述用户输入的语音数据转换为对应的文本信息,所述上一时段为自上一次静默时长大于或等于预设时长的时刻至当前时刻;A voice processing module configured to convert the voice data input by the user in the previous period when it is detected that the silence duration of the voice data input by the user is greater than or equal to the preset duration and if it is determined that the voice input has not ended. is the corresponding text information, and the previous period is from the time when the last silence duration was greater than or equal to the preset duration to the current time;
图像处理模块,用于根据所述上一时段内所述用户的图像数据识别所述用户的手势信息,并根据所述用户的手势信息和所述文本信息,确定所述用户的手势信息对应的手势意图分类;An image processing module, configured to identify the user's gesture information based on the user's image data in the previous period, and determine the user's gesture information corresponding to the user's gesture information based on the user's gesture information and the text information. Gesture intent classification;
所述驱动控制模块还用于根据所述用户的手势信息对应的手势意图分类,以及当前的对话状态,确定对应的驱动数据;根据所述驱动数据和虚拟人物的三维形象渲染模型,驱动虚拟人物执行对应的响应行为。The drive control module is also used to determine the corresponding drive data according to the gesture intention classification corresponding to the user's gesture information and the current conversation state; and drive the virtual character according to the drive data and the three-dimensional image rendering model of the virtual character. Execute the corresponding response behavior.
第四方面,本申请提供一种基于多模态数据的虚拟人物驱动系统,包括:In the fourth aspect, this application provides a virtual character driving system based on multi-modal data, including:
决策驱动模块,用于获取虚拟人物的三维形象渲染模型,以利用虚拟人物向用户提供交互服务;The decision-driven module is used to obtain the three-dimensional image rendering model of the virtual character, so as to use the virtual character to provide interactive services to users;
多模态输入模块,用于在虚拟人物与用户的一轮对话中,在用户进行语音输入的过程中,实时获取用户输入的语音数据和所述用户的图像数据;The multi-modal input module is used to obtain the voice data input by the user and the image data of the user in real time during a conversation between the virtual character and the user;
感知模块,用于当检测到所述语音输入的静默时长大于或等于预设时长时,若确定所述语音输入未结束,则根据上一时段内所述用户的图像数据识别所述用户的手势信息,所 述上一时段为自上一次静默时长大于或等于预设时长的时刻至当前时刻;A sensing module configured to, when it is detected that the silence duration of the voice input is greater than or equal to the preset duration, and if it is determined that the voice input has not ended, identify the user's gesture according to the user's image data in the previous period. information, all The above-mentioned previous period is from the time when the last silence duration was greater than or equal to the preset duration to the current moment;
所述决策驱动模块还用于根据所述用户的手势信息,若确定所述用户做出了需承接的手势,则确定虚拟人物的驱动数据;根据所述驱动数据和虚拟人物的三维形象渲染模型,驱动虚拟人物执行所述用户的手势信息对应的承接响应行为。The decision-making driving module is also used to determine the driving data of the virtual character according to the user's gesture information, if it is determined that the user has made a gesture that needs to be accepted; according to the driving data and the three-dimensional image rendering model of the virtual character , driving the virtual character to perform the acceptance response behavior corresponding to the user's gesture information.
第五方面,本申请提供一种电子设备,包括:处理器,以及与所述处理器通信连接的存储器;In a fifth aspect, this application provides an electronic device, including: a processor, and a memory communicatively connected to the processor;
所述存储器存储计算机执行指令;The memory stores computer execution instructions;
所述处理器执行所述存储器存储的计算机执行指令,以实现上述第一方面或第二方面所述的方法。The processor executes computer execution instructions stored in the memory to implement the method described in the first aspect or the second aspect.
第六方面,本申请提供一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机执行指令,所述计算机执行指令被处理器执行时用于实现上述第一方面或第二方面所述的方法。In a sixth aspect, the present application provides a computer-readable storage medium that stores computer-executable instructions, which when executed by a processor are used to implement the above-mentioned first or second aspect. the method described.
本申请提供的基于多模态数据的虚拟人物驱动方法、系统及设备,通过在虚拟人物与用户的一轮对话过程中,实时获取用户输入的语音数据和用户的图像数据;当检测到用户输入的语音数据的静默时长大于或等于预设时长时,若确定语音输入未结束,则将上一时段内用户输入的语音数据转换为对应的文本信息,根据上一时段内用户的图像数据识别用户的手势信息,并根据用户的手势信息和文本信息,确定用户的手势信息对应的手势意图分类,从而能够实时地识别用户手势的手势意图,并基于用户手势的手势意图和当前的对话状态,驱动虚拟人物执行对应的响应行为,使得输出视频流中虚拟人物做出对应的响应行为,增加用户手势的实时识别能力,并且驱动虚拟人物针对用户的手势意图做出及时地响应,实现虚拟人物与真人用户进行“面对面”交互,提高了虚拟人物拟人化程度,使得虚拟人物与人的交互更顺畅、更智能。The virtual character driving method, system and device based on multi-modal data provided by this application obtain the voice data input by the user and the image data of the user in real time during a round of dialogue between the virtual character and the user; when user input is detected When the silence duration of the voice data is greater than or equal to the preset duration, if it is determined that the voice input has not ended, the voice data input by the user in the previous period will be converted into corresponding text information, and the user will be identified based on the user's image data in the previous period The gesture information of the user is determined, and the gesture intention classification corresponding to the user's gesture information is determined based on the user's gesture information and text information, so that the gesture intention of the user's gesture can be recognized in real time, and based on the gesture intention of the user's gesture and the current conversation state, the driver The virtual character performs the corresponding response behavior, causing the virtual character in the output video stream to perform the corresponding response behavior, increasing the real-time recognition ability of the user's gestures, and driving the virtual character to respond promptly to the user's gesture intention, realizing the virtual character's interaction with the real person Users conduct "face-to-face" interactions, which improves the degree of anthropomorphism of virtual characters and makes the interaction between virtual characters and people smoother and more intelligent.
附图说明Description of the drawings
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本申请的实施例,并与说明书一起用于解释本申请的原理。The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.
图1为本申请提供的一示例性的虚拟人物与人的交互系统的框架图;Figure 1 is a framework diagram of an exemplary virtual character-human interaction system provided by this application;
图2为本申请一实施例提供的基于多模态数据的虚拟人物驱动方法流程图;Figure 2 is a flow chart of a virtual character driving method based on multi-modal data provided by an embodiment of the present application;
图3为本申请一实施例提供的实现驱动虚拟人物承接用户的方法流程图;Figure 3 is a flow chart of a method for driving a virtual character to accept users provided by an embodiment of the present application;
图4为本申请一示例性实施例提供的基于多模态数据的虚拟人物驱动系统的结构示意图;Figure 4 is a schematic structural diagram of a virtual character driving system based on multi-modal data provided by an exemplary embodiment of the present application;
图5为本申请另一示例性实施例提供的基于多模态数据的虚拟人物驱动系统的结构示意图;Figure 5 is a schematic structural diagram of a virtual character driving system based on multi-modal data provided by another exemplary embodiment of the present application;
图6为本申请另一示例性实施例提供的基于多模态数据的虚拟人物驱动系统的结构示意图; Figure 6 is a schematic structural diagram of a virtual character driving system based on multi-modal data provided by another exemplary embodiment of the present application;
图7为本申请一示例实施例提供的电子设备的结构示意图。FIG. 7 is a schematic structural diagram of an electronic device according to an exemplary embodiment of the present application.
通过上述附图,已示出本申请明确的实施例,后文中将有更详细的描述。这些附图和文字描述并不是为了通过任何方式限制本申请构思的范围,而是通过参考特定实施例为本领域技术人员说明本申请的概念。Through the above-mentioned drawings, clear embodiments of the present application have been shown, which will be described in more detail below. These drawings and text descriptions are not intended to limit the scope of the present application's concepts in any way, but are intended to illustrate the application's concepts for those skilled in the art with reference to specific embodiments.
具体实施方式Detailed ways
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本申请相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本申请的一些方面相一致的系统和方法的例子。Exemplary embodiments will be described in detail herein, examples of which are illustrated in the accompanying drawings. When the following description refers to the drawings, the same numbers in different drawings refer to the same or similar elements unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with this application. Rather, they are merely examples of systems and methods consistent with aspects of the present application as detailed in the appended claims.
首先对本申请所涉及的名词进行解释:First, the terms involved in this application will be explained:
多模态交互:用户可通过文字、语音、表情等方式与虚拟人物交流,虚拟人物可以理解用户文字、语音、表情等信息,并可以反过来通过文字、语音、表情等方式与用户进行交流。Multi-modal interaction: Users can communicate with virtual characters through text, voice, expressions, etc. The virtual characters can understand user text, voice, expressions and other information, and can in turn communicate with users through text, voice, expressions, etc.
手势交互:用户可通过手势与虚拟人物进行交流,虚拟人物也可以通过手势等方式对用户进行回复。Gesture interaction: Users can communicate with virtual characters through gestures, and virtual characters can also reply to users through gestures and other methods.
双工交互:实时的、双向的交互方式,用户可以随时打断虚拟人物,虚拟人物也可以在必要的时候打断正在说话的自己。Duplex interaction: A real-time, two-way interaction method. The user can interrupt the virtual character at any time, and the virtual character can also interrupt himself who is speaking when necessary.
承接:虚拟人物在与人类用户的对话过程,在用户输入虚拟人物接收的对话状态下,虚拟人物可对用户进行即时反馈,如点头、微笑和轻声应和等,但又不打断用户的输入,以引导后续的对话流程。Undertake: During the conversation between the avatar and the human user, when the user inputs the avatar to receive the dialogue state, the avatar can provide instant feedback to the user, such as nodding, smiling, and softly responding, without interrupting the user's input. , to guide the subsequent conversation process.
打断:虚拟人物在与人类用户的对话过程中,一方可以随时中止另一方的对话,开起新一轮的交互。Interruption: During a conversation between a virtual character and a human user, one party can interrupt the other party's conversation at any time and start a new round of interaction.
语音活动检测(Voice Activity Detection,简称VAD),又称语音端点检测,语音边界检测,是一种检测语音输入的静默时长的技术。Voice Activity Detection (VAD), also known as voice endpoint detection and voice boundary detection, is a technology that detects the duration of silence in voice input.
语音合成(Text To Speech,简称TTS):是一种将文本转化为语音的技术。Text To Speech (TTS): It is a technology that converts text into speech.
本申请提供的基于多模态数据的虚拟人物驱动方法,涉及计算机技术中的人工智能、深度学习、机器学习、虚拟现实等领域,具体可以应用于虚拟人物与人的交互的场景中。The virtual character driving method based on multi-modal data provided by this application involves artificial intelligence, deep learning, machine learning, virtual reality and other fields in computer technology, and can be specifically applied to scenarios in which virtual characters interact with people.
示例性地,常见的虚拟人物与人类交互的场景包括:智能客服、政务咨询、生活服务、智慧交通、虚拟陪伴人、虚拟主播、虚拟教师、网络游戏等等。For example, common scenarios in which virtual characters interact with humans include: intelligent customer service, government consultation, life services, smart transportation, virtual companions, virtual anchors, virtual teachers, online games, etc.
针对现有虚拟人物拟人化程度低,导致沟通过程不顺畅、不智能的问题,本申请提供一种基于多模态数据的虚拟人物驱动方法,通过在虚拟人物与用户的一轮对话过程中,实时获取用户输入的语音数据和用户的图像数据。当检测到用户输入的语音数据的静默时长大于或等于预设时长时,若确定语音输入未结束,则将本轮对话中用户输入的语音数据转换为对应的文本信息。根据本轮对话中用户的图像数据识别用户的手势信息,并根 据用户的手势信息和文本信息,确定用户的手势信息对应的手势意图分类,从而精准地、实时地识别用户的手势意图。并在确定用户手势的手势意图分类后,基于用户的手势信息对应的手势意图分类,以及当前的对话状态,确定对应的驱动数据,根据驱动数据和虚拟人物的三维形象渲染模型,驱动虚拟人物执行对应的响应行为。从而使得虚拟人物能够针对用户的手势和语音输入及时地做出相应的响应,使其具备多模态交互能力,提高了虚拟人物拟人化程度,使得虚拟人物与人的沟通过程更顺畅、更智能。In order to solve the problem of low anthropomorphism of existing virtual characters, resulting in unsmooth and unintelligent communication processes, this application provides a virtual character driving method based on multi-modal data. During a round of dialogue between the virtual character and the user, Acquire the voice data input by the user and the user's image data in real time. When it is detected that the silence duration of the voice data input by the user is greater than or equal to the preset duration, and if it is determined that the voice input has not ended, the voice data input by the user in this round of dialogue is converted into corresponding text information. Recognize the user's gesture information based on the user's image data in this round of dialogue, and based on According to the user's gesture information and text information, the gesture intention classification corresponding to the user's gesture information is determined, thereby accurately and real-time identifying the user's gesture intention. After determining the gesture intention classification of the user's gesture, the corresponding driving data is determined based on the gesture intention classification corresponding to the user's gesture information and the current conversation state, and the virtual character is driven to execute based on the driving data and the three-dimensional image rendering model of the virtual character. Corresponding response behavior. This enables the virtual character to respond promptly to the user's gestures and voice input, giving it multi-modal interaction capabilities, improving the degree of anthropomorphism of the virtual character, and making the communication process between the virtual character and people smoother and more intelligent. .
本申请提供的基于多模态数据的虚拟人物驱动方法可以应用于虚拟人物与人的交互系统中。图1为一示例性的虚拟人物与人的交互系统的框架图,如图1所示,虚拟人物与人的交互系统包括以下子系统:感知系统、多模态双工状态管理系统、驱动控制系统和基础对话系统。The virtual character driving method based on multi-modal data provided by this application can be applied to the interactive system between virtual characters and people. Figure 1 is a framework diagram of an exemplary interactive system between virtual characters and people. As shown in Figure 1, the interactive system between virtual characters and people includes the following subsystems: perception system, multi-modal duplex state management system, drive control system and basic dialogue system.
其中,感知系统负责接收语音、图像等多模态信息的输入,以及对输入的语音、图像数据的切分、识别等数据处理,得到识别结果,并将识别结果提供给多模态双工状态管理系统。多模态双工状态管理系统负责管理当前对话的状态,基于识别结果和当前对话的状态进行双工响应状态的决策处理,得到包含响应策略的决策结果。驱动控制系统负责基于多模态双工状态管理系统的决策结果进行虚拟人物驱动、渲染等处理,生成虚拟人物的视频流,并输出视频流。基础对话系统负责实现基础的人机对话能力,也即根据用户输入的问题生成对应的答复信息。Among them, the perception system is responsible for receiving the input of multi-modal information such as voice and images, and processing the data such as segmentation and recognition of the input voice and image data, obtaining the recognition results, and providing the recognition results to the multi-modal duplex state. management system. The multi-modal duplex state management system is responsible for managing the state of the current conversation, performing decision-making processing of the duplex response state based on the recognition result and the state of the current conversation, and obtaining a decision result including a response strategy. The drive control system is responsible for performing virtual character driving, rendering and other processing based on the decision-making results of the multi-modal duplex state management system, generating the virtual character's video stream, and outputting the video stream. The basic dialogue system is responsible for realizing basic human-machine dialogue capabilities, that is, generating corresponding reply information based on questions input by the user.
具体地,感知系统作为交互系统的输入端,负责控制交互系统中视频流和语音流的输入,实现对输入的语音流和视频流进行切分和识别等功能。具体地,对于语音流,为了保证该交互系统即使在用户说话的时候,虚拟人物也可以进行一些及时的反馈,感知系统会基于一个较短的预设时长(如200ms)的静默时间(也即VAD时间)对语音流进行切分,在用户一次语音输入中,每当产生大于或等于预设时长的静默时间时,进行语音流的切分,生成上一时段内的一个语音单元,将语音单元输入到自动语音识别(Automatic Speech Recognition,简称ASR)模块,通过ASR模块将语音单元将其转换成文本信息,最终输入到多模态处理和对齐模块。对于视频流,感知系统根据本轮对话中用户的图像数据识别用户的手势信息,将手势信息输入到多模态处理和对齐模块,多模态处理和对齐模块结合用户的手势信息和语音单元的文本信息,确定手势信息对应的手势意图分类。Specifically, as the input end of the interactive system, the perception system is responsible for controlling the input of video streams and voice streams in the interactive system, and implementing functions such as segmenting and identifying the input voice streams and video streams. Specifically, for the voice stream, in order to ensure that the interactive system can provide some timely feedback to the virtual character even when the user is speaking, the perception system will be based on a short preset duration (such as 200ms) of silence time (i.e. VAD time) to segment the voice stream. In a user's voice input, whenever a silence time greater than or equal to the preset duration occurs, the voice stream is segmented to generate a voice unit in the previous period, and the voice The unit is input to the Automatic Speech Recognition (ASR) module, which converts the speech unit into text information through the ASR module, and is finally input to the multi-modal processing and alignment module. For video streaming, the perception system recognizes the user's gesture information based on the user's image data in this round of dialogue, and inputs the gesture information into the multi-modal processing and alignment module. The multi-modal processing and alignment module combines the user's gesture information and the speech unit's Text information, determine the gesture intention classification corresponding to the gesture information.
示例性地,需要向用户反馈的手势可以包含三大类:分别为有明确意义的手势动作(如OK、数字和左滑右滑等)、不安全手势(如竖中指和比小拇指等)、自定义特殊动作。For example, gestures that need to be fed back to the user can include three major categories: gestures with clear meanings (such as OK, numbers, left and right swipes, etc.), unsafe gestures (such as middle finger and little thumb), Customize special moves.
多模态双工状态管理系统基于实时识别的用户的手势意图分类,并结合当前的对话状态进行双工响应状态的决策,确定手势意图分类对应的双工响应状态,不同的双工响应状态对应不同的响应策略,从而确定手势意图分类对应的响应策略,得到决策结果。The multi-modal duplex state management system is based on the real-time recognition of the user's gesture intention classification, and combines the current dialogue status to make decisions on the duplex response status, determine the duplex response status corresponding to the gesture intention classification, and determine the duplex response status corresponding to different duplex response statuses. Different response strategies are used to determine the response strategy corresponding to the gesture intention classification and obtain the decision-making result.
示例性地,双工响应状态可以包括双工主动\被动打断、双工主动承接、调用基础对话系统和无反馈这4种状态,分别对应于虚拟人物主动或被动打断当前处理的打断策略、虚拟人物主动承接用户的承接策略、开启新一轮对话(也即调用基础对话系统)、和无反馈 这4类响应策略。For example, the duplex response state can include four states: duplex active/passive interruption, duplex active acceptance, calling the basic dialogue system, and no feedback, which respectively correspond to the virtual character actively or passively interrupting the current processing. strategy, virtual characters actively take over the user's strategy, start a new round of dialogue (that is, call the basic dialogue system), and no feedback These four types of response strategies.
其中,双工主动承接:当判断需要承接用户的对话或动作时,触发对应的承接策略。承接的方式至少包含如下两种:一种是仅“动作承接”,指的是虚拟人物不做口头的承接回复,仅做出承接动作响应用户,不影响其它的对话状态。另一种是“动作+文案承接”,也即虚拟人物不仅做出承接动作,而且播报承接文案来响应用户。Among them, duplex active acceptance: when it is judged that it is necessary to accept the user's dialogue or action, the corresponding acceptance strategy is triggered. There are at least two ways to take over: one is "action takeover only", which means that the virtual character does not make a verbal takeover reply, but only responds to the user by making takeover actions, without affecting other conversation states. The other is "action + copywriting", that is, the virtual character not only performs the following actions, but also broadcasts the copywriting to respond to the user.
双工主动\被动打断:在虚拟人物播报过程中,当判断用户有打断意图,如用户做出不安全手势、停止手势,输入具有停止意图的语音等,会立即主动打断当前对话。在打断策略下,虚拟人物会打断当前的说话状态,等待用户说话,或者主动询问对方打断的原因。如果用户输入具有确定语义的语音数据,则开启新一轮对话;如果一段时间后用户没有输入具有确定语义的语音数据,则继续当前对话,虚拟人物继续播报。Duplex active\passive interruption: During the virtual character broadcasting process, when it is judged that the user has the intention to interrupt, such as the user making unsafe gestures, stop gestures, inputting voice with the intention of stopping, etc., the current conversation will be actively interrupted immediately. Under the interruption strategy, the avatar will interrupt the current speaking state, wait for the user to speak, or actively ask the other party the reason for the interruption. If the user inputs voice data with definite semantics, a new round of dialogue will be started; if the user does not input voice data with definite semantics after a period of time, the current dialogue will be continued and the virtual character will continue to broadcast.
调用基础对话系统:当用户语音输入的静默时间(VAD时间)达到静默时长阈值(也即VAD阈值)时,用户语音输入接收,调用基础对话系统直接回复用户。这是虚拟人物与人的交互系统的基本功能,此处不再赘述。其中,静默时长阈值(也即VAD阈值)通常为800ms,可以根据实际应用场景进行配置和调整。Call the basic dialogue system: When the silent time (VAD time) of the user's voice input reaches the silence duration threshold (that is, the VAD threshold), the user's voice input is received and the basic dialogue system is called to directly reply to the user. This is the basic function of the interaction system between virtual characters and people, and will not be described in detail here. Among them, the silent duration threshold (that is, the VAD threshold) is usually 800ms, which can be configured and adjusted according to the actual application scenario.
无反馈:不作任何反馈,维持当前状态。No feedback: No feedback is given and the current status is maintained.
驱动控制系统具体包括以下三部分:1)流式TTS(Text To Speech,从文本到语音)部分,将决策结果中的文本输出合成音频流。2)驱动部分,包含两个子模块,面部驱动模块和动作驱动模块,其中,面部驱动模块根据决策结果中待输出的语音流驱动虚拟人物输出准确的口型,并生成面部驱动数据;动作驱动模块根据决策结果中待输出的动作标签,驱动虚拟人物做出准确的动作,并生成动作驱动数据,如动作混合形状(blendshape)驱动模型。3)渲染合成部分,将驱动、流式TTS等部分的输出进行渲染并合成虚拟人物的视频流。The drive control system specifically includes the following three parts: 1) The streaming TTS (Text To Speech) part, which synthesizes the text output in the decision results into an audio stream. 2) The driving part includes two sub-modules, the face driving module and the action driving module. Among them, the face driving module drives the virtual character to output accurate mouth shape according to the voice stream to be output in the decision result, and generates face driving data; the action driving module According to the action tag to be output in the decision result, the virtual character is driven to make accurate actions, and action-driven data is generated, such as an action blend shape (blendshape) driven model. 3) The rendering and synthesis part renders the output of the driver, streaming TTS and other parts and synthesizes the video stream of the virtual character.
基础对话系统:包含基本的业务逻辑,具备基本的对话交互能力,也即输入用户的问题,基础对话系统输出该问题的答案。具体地,基础对话系统通常包括:NLU(Natural Language Understanding,自然语言理解)模块、DM(Dialog Management,对话管理)模块和NLG(Natural Language Generation,自然语言生成)模块。其中,业务逻辑是基于用户输入的问题查询获取到答复信息中所需的数据内容的查询逻辑。例如,用户问题是“我的身高是160cm,我应该传什么尺码”,答案信息为“您应该穿M码”,答案信息中的“M码”是基于身高160cm查询业务逻辑得到的。Basic dialogue system: Contains basic business logic and has basic dialogue interaction capabilities, that is, input the user's question, and the basic dialogue system outputs the answer to the question. Specifically, basic dialogue systems usually include: NLU (Natural Language Understanding) module, DM (Dialog Management) module and NLG (Natural Language Generation) module. Among them, the business logic is the query logic that obtains the data content required in the reply information based on the question query entered by the user. For example, the user question is "My height is 160cm, what size should I send?" and the answer information is "You should wear M size". The "M size" in the answer information is obtained by querying the business logic based on the height of 160cm.
其中,NLU模块用于对文本信息进行识别理解,转换成计算机可理解的结构化语义表示或者意图标签。DM模块用于维护和更新当前的对话状态,并决策下一步系统动作。NLG模块用于将系统输出的状态转换成可理解的自然语言文本。Among them, the NLU module is used to identify and understand text information and convert it into a computer-understandable structured semantic representation or intent label. The DM module is used to maintain and update the current dialogue status and decide on the next system action. The NLG module is used to convert the status output by the system into understandable natural language text.
基于上述的感知系统、多模态双工状态管理系统、驱动控制系统和基础对话系统,通过加入视频流和对应视觉理解模块,让用户可以通过手势与虚拟人物进行交互。在本方案中,可以实时感知有明确意义的动作(如点赞和左滑、右滑等)、不安全手势(如竖中指 和比小拇指等)和自定义特殊动作三大类动作。此外,通过增加感知系统、多模态双工状态管理系统和驱动控制系统,让对话变成基于用户手势可随时承接或打断当前对话的对话形式。当前的双工响应状态包含双工主动承接、双工主动\被动打断、调用基础对话系统和无反馈4种状态,分别对应虚拟人物主动或被动打断当前处理的打断策略、虚拟人物主动承接用户的承接策略、开启新一轮对话(也即调用基础对话系统)、和无反馈这4类响应策略,通过在这4类响应策略之间决策,能够实现对用户手势的理解,实现基于用户手势进行承接、打断和基本问答的能力,使得虚拟人物具备多模态(语音和手势)交互能力,提高了虚拟人物拟人化程度,使得虚拟人物与人的沟通过程更顺畅、更智能。Based on the above-mentioned perception system, multi-modal duplex state management system, drive control system and basic dialogue system, by adding video streams and corresponding visual understanding modules, users can interact with virtual characters through gestures. In this solution, actions with clear meaning (such as likes, left swipes, right swipes, etc.), unsafe gestures (such as middle finger gestures) can be sensed in real time. (such as the little finger, etc.) and customized special actions. In addition, by adding a perception system, a multi-modal duplex state management system and a drive control system, the dialogue becomes a dialogue form that can take over or interrupt the current dialogue at any time based on user gestures. The current duplex response state includes four states: duplex active acceptance, duplex active\passive interruption, calling the basic dialogue system, and no feedback, which respectively correspond to the interruption strategy of the virtual character actively or passively interrupting the current processing, and the virtual character actively interrupting. Taking over the user's acceptance strategy, starting a new round of dialogue (that is, calling the basic dialogue system), and no feedback are four types of response strategies. By deciding between these four types of response strategies, the understanding of the user's gestures can be achieved and the implementation based on The ability of user gestures to undertake, interrupt and basic question and answer enables virtual characters to have multi-modal (voice and gesture) interaction capabilities, improves the degree of anthropomorphism of virtual characters, and makes the communication process between virtual characters and people smoother and more intelligent.
下面以具体地实施例对本申请的技术方案以及本申请的技术方案如何解决上述技术问题进行详细说明。下面这几个具体的实施例可以相互结合,对于相同或相似的概念或过程可能在某些实施例中不再赘述。下面将结合附图,对本申请的实施例进行描述。The technical solution of the present application and how the technical solution of the present application solves the above technical problems will be described in detail below with specific embodiments. The following specific embodiments can be combined with each other, and the same or similar concepts or processes may not be described again in some embodiments. The embodiments of the present application will be described below with reference to the accompanying drawings.
图2为本申请一实施例提供的基于多模态数据的虚拟人物驱动方法流程图。本实施例提供的基于多模态数据的虚拟人物驱动方法具体可以应用于具有使用虚拟人物实现与人类交互功能的电子设备,该电子设备可以是对话机器人、终端或服务器等,在其他实施例中,电子设备还可以采用其他设备实现,本实施例此处不做具体限定。Figure 2 is a flow chart of a virtual character driving method based on multi-modal data provided by an embodiment of the present application. The virtual character driving method based on multi-modal data provided in this embodiment can be specifically applied to electronic devices that have the function of using virtual characters to interact with humans. The electronic device can be a conversation robot, a terminal or a server, etc. In other embodiments , the electronic device can also be implemented using other devices, and this embodiment is not specifically limited here.
如图2所示,该方法具体步骤如下:As shown in Figure 2, the specific steps of this method are as follows:
步骤S201、获取虚拟人物的三维形象渲染模型以利用虚拟人物提供对用户的交互服务。Step S201: Obtain a three-dimensional image rendering model of the virtual character to use the virtual character to provide interactive services to the user.
其中,虚拟人物的三维形象渲染模型包括实现虚拟人物渲染所需的渲染数据,基于虚拟人物的三维形象渲染模型可以将虚拟人物的骨骼数据渲染成呈现给用户时展示的虚拟人物的三维形象。Among them, the three-dimensional image rendering model of the virtual character includes the rendering data required to realize the rendering of the virtual character. The three-dimensional image rendering model based on the virtual character can render the skeletal data of the virtual character into the three-dimensional image of the virtual character displayed to the user.
本实施例提供的方法,可以应用于虚拟人物与人交互的场景中,利用具有三维形象的虚拟人物,实现机器与人的实时交互功能,以向人提供智能服务。The method provided in this embodiment can be applied in scenarios where virtual characters interact with people, using virtual characters with three-dimensional images to realize real-time interaction functions between machines and people, so as to provide intelligent services to people.
步骤S202、在虚拟人物与用户的一轮对话过程中,实时获取用户输入的语音数据和用户的图像数据。Step S202: During a round of dialogue between the virtual character and the user, the voice data input by the user and the user's image data are obtained in real time.
本实施例中,在虚拟人物与用户的一轮对话过程中,实时获取输入的语音流,得到用户输入的语音数据;还可以实时地监测来自用户的视频流,按照预设频率采样视频帧,得到用户的图像数据。In this embodiment, during a round of dialogue between the virtual character and the user, the input voice stream is obtained in real time and the voice data input by the user is obtained; the video stream from the user can also be monitored in real time, and the video frames are sampled according to the preset frequency. Get the user's image data.
其中,用户的图像数据中包含用户出现视频帧内的图像数据,包括用户的人脸图像以及出现在视频帧内的手臂及部分躯体的图像。Among them, the user's image data includes image data in the video frame in which the user appears, including the user's face image and images of the arms and part of the body appearing in the video frame.
示例性地,可以由上述图1所示的交互系统框架中的手势感知系统来实时获取输入语音流和视频流。For example, the input voice stream and video stream can be acquired in real time by the gesture sensing system in the interactive system framework shown in Figure 1 above.
步骤S203、当检测到用户输入的语音数据的静默时长大于或等于预设时长时,若确定语音输入未结束,则将上一时段内用户输入的语音数据转换为对应的文本信息,上一时段为自上一次静默时长大于或等于预设时长的时刻至当前时刻。Step S203: When it is detected that the silence duration of the voice data input by the user is greater than or equal to the preset duration, and if it is determined that the voice input has not ended, the voice data input by the user in the previous period is converted into corresponding text information. It is the time since the last silence duration was greater than or equal to the preset duration to the current moment.
对于获取到的用户的语音数据实时进行语音活动检测,确定语音数据的静默时长。当 用户输入的语音数据的静默时长大于或等于预设时长时,若语音数据的静默时长小于静默时长阈值,说明用户输入过程中产生较长时间的静默,但是本次的语音输入尚未结束。这种情况下基于上一时段内的语音数据和用户的图像数据进行一次双工响应处理,使得虚拟人物针对上一时段内用户做出的手势及时地做出响应行为,以引导后续的对话流程,使得虚拟人物与用户的交互更加流畅、更加智能。Perform real-time voice activity detection on the acquired user's voice data to determine the length of silence of the voice data. when When the silence duration of the voice data input by the user is greater than or equal to the preset duration, if the silence duration of the voice data is less than the silence duration threshold, it means that a long period of silence occurs during the user input process, but this voice input has not yet ended. In this case, a duplex response process is performed based on the voice data and the user's image data in the previous period, so that the virtual character can respond in time to the gestures made by the user in the previous period to guide the subsequent dialogue process. , making the interaction between virtual characters and users smoother and more intelligent.
其中,预设时长为一个小于静默时长阈值的较短时长,静默时长阈值为判断用户本轮输入是否结束的静默时长,当用户语音输入的静默时长达到静默时长阈值,则确定用户本轮语音输入结束。例如静默时长阈值可以为800ms,预设时长可以为200ms。预设时长可以根据实际应用场景的需要进行设置和调整,此处不做具体限定。Among them, the preset duration is a shorter duration that is less than the silence duration threshold. The silence duration threshold is the silence duration used to determine whether the user's current round of input has ended. When the silence duration of the user's voice input reaches the silence duration threshold, the user's current round of voice input is determined. Finish. For example, the silent duration threshold can be 800ms, and the preset duration can be 200ms. The preset duration can be set and adjusted according to the needs of the actual application scenario, and is not specifically limited here.
示例性地,当检测到用户输入的语音数据的静默时长大于或等于预设时长时,若确定语音输入未结束,则将上一时段内用户输入的语音数据输入ASR模块,通过ASR模块将语音数据转换为对应的文本信息。For example, when it is detected that the silence duration of the voice data input by the user is greater than or equal to the preset duration, and if it is determined that the voice input has not ended, the voice data input by the user in the previous period is input into the ASR module, and the voice data is transferred to the ASR module through the ASR module. The data is converted into corresponding text information.
示例性地,感知系统将语音流按照一个预设时长(如200ms)的静默时间(也即VAD时间)对语音流进行切分,分成一个一个的小的语音单元,一个语音单元为相邻对应两次静默时长达到预设时长的时刻之间的语音数据,将每一个语音单元输入到自动语音识别ASR模块,通过ASR模块将语音单元将其转换成文本信息。For example, the perception system divides the speech stream into small speech units one by one according to a silence time (that is, VAD time) of a preset length (such as 200ms), and one speech unit corresponds to the adjacent one. For the voice data between two moments when the silence duration reaches the preset duration, each voice unit is input to the automatic speech recognition ASR module, and the voice unit is converted into text information through the ASR module.
步骤S204、根据上一时段内用户的图像数据识别用户的手势信息,并根据用户的手势信息和文本信息,确定用户的手势信息对应的手势意图分类。Step S204: Identify the user's gesture information based on the user's image data in the previous period, and determine the gesture intention classification corresponding to the user's gesture information based on the user's gesture information and text information.
本实施例中,获取上一时段内的用户的图像数据,对上一时段内的用户的图像数据进行手势识别,识别用户的手势信息。In this embodiment, the user's image data in the previous period is acquired, gesture recognition is performed on the user's image data in the previous period, and the user's gesture information is identified.
在实际应用中,有的手势在不同的场景可以具有不同的含义,也即一个手势在不同的场景中表示不同的用户意图。该步骤中,结合上一时段内的用户的手势信息和用户输入语音数据的文本信息,进行多模态分类,确定上一时段内用户的手势信息对应的手势意图分类,从而精准地识别出用户手势的含义。In practical applications, some gestures can have different meanings in different scenarios, that is, a gesture represents different user intentions in different scenarios. In this step, the user's gesture information in the previous period and the text information of the user's input voice data are combined to perform multi-modal classification to determine the gesture intention classification corresponding to the user's gesture information in the previous period, thereby accurately identifying the user The meaning of the gesture.
示例性地,本实施例中可以预先配置好需要进行双工响应的手势意图分类,并配置每一手势意图分类对应的响应策略。根据不同分类的手势意图配置对应的响应策略,相对于根据手势配置响应策略,使得虚拟人物能够针对用户正确的手势意图更加精准地做出响应,提高虚拟人物的拟人化程度。For example, in this embodiment, gesture intention categories that require duplex response can be pre-configured, and a response strategy corresponding to each gesture intention category can be configured. Configuring corresponding response strategies according to different categories of gesture intentions enables the virtual character to respond more accurately to the user's correct gesture intention, compared to configuring the response strategy according to gestures, and improves the degree of anthropomorphism of the virtual character.
步骤S205、根据用户的手势信息对应的手势意图分类,以及当前的对话状态,确定对应的驱动数据。Step S205: Determine corresponding driving data according to the gesture intention classification corresponding to the user's gesture information and the current dialogue state.
其中,当前的对话状态包括以下两种:用户输入虚拟人物接收的状态、虚拟人物输出用户接收的状态。Among them, the current dialogue state includes the following two types: the state in which the user inputs the avatar and receives the state, and the state in which the avatar outputs the avatar and the user receives the state.
本实施例中,响应策略可以包括如下4类:虚拟人物主动或被动打断当前处理的打断策略、虚拟人物主动承接用户的承接策略、开启新一轮对话、无反馈。In this embodiment, the response strategies may include the following four categories: an interruption strategy in which the avatar actively or passively interrupts the current processing, an acceptance strategy in which the avatar actively accepts the user, starts a new round of dialogue, and no feedback.
其中,每一类响应策略包括一个或者多个响应策略,每一响应策略包括对应的手势意 图分类、响应时间和响应方式。每一响应策略的具体内容可以根据实际应用场景的需要进行配置,此处不做具体限定。Each type of response strategy includes one or more response strategies, and each response strategy includes a corresponding gesture meaning. Graph classification, response time, and response mode. The specific content of each response strategy can be configured according to the needs of actual application scenarios, and is not specifically limited here.
在用户输入虚拟人物接收的对话状态下,采用的响应策略可以是承接策略、开启新一轮对话或无反馈中的一种。考虑到在实际应用场景中,通常不具有虚拟人物打断用户输入的场景需求,因此在用户输入虚拟人物接收的对话状态下,通常不采用打断策略进行响应。In the dialogue state where the user inputs the avatar to receive, the response strategy adopted can be one of acceptance strategy, starting a new round of dialogue, or no feedback. Considering that in actual application scenarios, there are usually no scene requirements for avatars to interrupt user input, therefore when the user inputs a dialogue state that the avatar receives, the interruption strategy is usually not used to respond.
在虚拟人物输出用户接收的对话状态下,虚拟人物不需要承接用户,而用户可以打断虚拟人物的当前输出,也即打断当前的对话状态,以使用户能够更快地获取到自己所需的信息。因此,在虚拟人物输出用户接收的对话状态下,采用的响应策略可以是打断策略、开启新一轮对话或无反馈中的一种,但通常不采用承接策略。When the avatar outputs the dialogue state that the user receives, the avatar does not need to take over the user, and the user can interrupt the current output of the avatar, that is, interrupt the current dialogue state, so that the user can get what he needs faster. Information. Therefore, when the virtual character outputs a dialogue state that the user receives, the response strategy adopted can be one of interruption strategy, starting a new round of dialogue, or no feedback, but the acceptance strategy is usually not used.
在实时地识别出上一时段内用户的手势信息的手势意图分类之后,该步骤中,根据用户的手势信息对应的手势意图分类,结合当前的对话状态,确定当前的响应策略,并根据当前的响应策略生成虚拟人物的驱动数据。该驱动数据包括驱动虚拟人物执行手势意图分类对应的响应策略所需的所有驱动参数,实现虚拟人物的面部驱动和动作驱动。After the gesture intention classification of the user's gesture information in the previous period is recognized in real time, in this step, the current response strategy is determined based on the gesture intention classification corresponding to the user's gesture information and combined with the current dialogue status, and based on the current The response strategy generates driving data for the avatar. The driving data includes all the driving parameters required to drive the virtual character to execute the response strategy corresponding to the gesture intention classification, thereby realizing the facial driving and action driving of the virtual character.
示例性地,若手势意图分类对应的响应策略包括虚拟人物做出规定表情,则该驱动数据包括表情驱动参数;若手势意图分类对应的响应策略包括虚拟人物做出规定手势动作,则该驱动数据包括动作驱动参数;若手势意图分类对应的响应策略包括虚拟人物播报规定话术,则该驱动数据包括语音驱动参数;若手势意图分类对应的响应策略包括表情、话术和动作中的多种响应方式,则该驱动数据包括对应的多种驱动参数,可以驱动虚拟人物执行响应策略对应的响应行为。For example, if the response strategy corresponding to the gesture intention classification includes the virtual character making a prescribed expression, the driving data includes expression driving parameters; if the response strategy corresponding to the gesture intention classification includes the virtual character making a prescribed gesture action, the driving data Including action-driven parameters; if the response strategy corresponding to the gesture intention classification includes a virtual character broadcasting prescribed words, the driving data includes voice-driven parameters; if the response strategy corresponding to the gesture intention classification includes multiple responses in expressions, words, and actions method, the driving data includes corresponding multiple driving parameters, which can drive the virtual character to perform response behaviors corresponding to the response strategy.
步骤S206、根据驱动数据和虚拟人物的三维形象渲染模型,驱动虚拟人物执行对应的响应行为。Step S206: Drive the virtual character to perform the corresponding response behavior according to the driving data and the three-dimensional image rendering model of the virtual character.
在根据用户手势信息的手势意图分类,确定对应的驱动数据之后,根据驱动数据驱动虚拟人物的骨骼模型得到响应行为对应的骨骼数据,根据虚拟人物的三维详细渲染模型对骨骼数据进行渲染,得到响应行为对应的虚拟人物图像数据。通过将虚拟人物图像数据渲染到输出视频流中,使得输出视频流中虚拟人物做出对应的响应行为,从而实现虚拟人物针对用户的手势做出及时响应的多模态双工交互功能。After classifying the gesture intention of the user's gesture information and determining the corresponding driving data, the skeletal model of the virtual character is driven according to the driving data to obtain the skeletal data corresponding to the response behavior, and the skeletal data is rendered according to the three-dimensional detailed rendering model of the virtual character to obtain the response. The virtual character image data corresponding to the behavior. By rendering the virtual character image data into the output video stream, the virtual character in the output video stream makes corresponding response behaviors, thereby realizing the multi-modal duplex interaction function in which the virtual character responds promptly to the user's gestures.
本实施例通过在虚拟人物与用户的一轮对话过程中,实时获取用户输入的语音数据和用户的图像数据;当检测到用户输入的语音数据的静默时长大于或等于预设时长时,若确定语音输入未结束,则将上一时段内用户输入的语音数据转换为对应的文本信息,根据上一时段内用户的图像数据识别用户的手势信息,并根据用户的手势信息和文本信息,确定用户的手势信息对应的手势意图分类,从而能够实时地识别用户手势的手势意图,并基于用户手势的手势意图和当前的对话状态,驱动虚拟人物执行对应的响应行为,使得输出视频流中虚拟人物做出对应的响应行为,增加用户手势的实时识别能力,并且驱动虚拟人物针对用户的手势意图做出及时地响应,提高了虚拟人物拟人化程度,使得虚拟人物与人的交互更顺畅、更智能。 This embodiment obtains the voice data input by the user and the image data of the user in real time during a round of dialogue between the virtual character and the user; when it is detected that the silence duration of the voice data input by the user is greater than or equal to the preset duration, if it is determined If the voice input has not ended, the voice data input by the user in the previous period is converted into corresponding text information, the user's gesture information is identified based on the user's image data in the previous period, and the user is determined based on the user's gesture information and text information. The gesture intention corresponding to the gesture information is classified, so that the gesture intention of the user's gesture can be recognized in real time, and based on the gesture intention of the user's gesture and the current conversation state, the virtual character is driven to perform the corresponding response behavior, so that the virtual character in the output video stream does It can generate corresponding response behaviors, increase the real-time recognition ability of user gestures, and drive virtual characters to respond promptly to the user's gesture intentions, which improves the degree of anthropomorphism of virtual characters and makes the interaction between virtual characters and people smoother and more intelligent.
一种可选的实施例中,上述步骤S204可以使用多模态分类模型对用户的手势信息和用户输入语音的文本信息进行多模态对齐及分类处理,确定用户的手势意图分类,以精准地识别用户手势的意图。In an optional embodiment, the above step S204 can use a multi-modal classification model to perform multi-modal alignment and classification processing on the user's gesture information and the text information of the user's input voice, and determine the user's gesture intention classification to accurately Identify the intent of user gestures.
具体地,将文本信息与上一时段内用户的图像数据输入训练好的多模态分类模型,通过多模态分类模型,根据上一时段内用户的图像数据识别用户的手势信息,提取文本信息的语义特征,根据用户的手势信息和文本信息的语义特征进行多模态分类处理,从而确定用户的手势信息对应的手势意图分类。Specifically, the text information and the user's image data in the previous period are input into the trained multi-modal classification model. Through the multi-modal classification model, the user's gesture information is recognized based on the user's image data in the previous period and the text information is extracted. Semantic features, multi-modal classification processing is performed based on the semantic features of the user's gesture information and text information, thereby determining the gesture intention classification corresponding to the user's gesture information.
在实际应用中,有的手势在不同的场景可以具有不同的含义,也即一个手势在不同的场景中表示不同的用户意图。例如,对于“上划”这一动作,在一种场景下可以表示“向上翻页”的手势意图,在另一场景下可以表示“你好”的手势意图。In practical applications, some gestures can have different meanings in different scenarios, that is, a gesture represents different user intentions in different scenarios. For example, the action of "swipe up" can express the gesture intention of "turning the page up" in one scenario, and can express the gesture intention of "hello" in another scenario.
本实施例中,多模态分类模型通过融合用户输入语音的文本信息的语义特征与用户手势信息,准确地识别出用户的手势信息对应的手势意图。In this embodiment, the multi-modal classification model accurately identifies the gesture intention corresponding to the user's gesture information by fusing the semantic features of the text information of the user's input speech with the user's gesture information.
其中,多模态分类模型可以采用现有的任意一种多模态的图像分类模型实现,或者还可以采用其他的多模态对齐模型,实现基于文本信息对图像分类结果进行修正的功能。Among them, the multimodal classification model can be implemented using any existing multimodal image classification model, or other multimodal alignment models can be used to implement the function of correcting image classification results based on text information.
示例性地,根据上一时段内用户的图像数据识别用户的手势信息,具体可以通过时序的卷积神经网络来对多个视频帧中的用户图像数据进行特征提取和手势分类,来实时地识别出用户做出的手势。另外,根据上一时段内用户的图像数据识别用户的手势信息,可以采用现有任意一种实现基于用户的图像数据识别用户做出的手势的功能的手势识别算法实现,此处不再赘述。For example, the user's gesture information can be identified based on the user's image data in the previous period. Specifically, a time-series convolutional neural network can be used to perform feature extraction and gesture classification on the user's image data in multiple video frames to identify it in real time. gestures made by the user. In addition, identifying the user's gesture information based on the user's image data in the previous period can be implemented using any existing gesture recognition algorithm that realizes the function of identifying the user's gesture based on the user's image data, which will not be described again here.
示例性地,感知系统可以识别如下表1所示的手势: For example, the sensing system can recognize the gestures shown in Table 1 below:
表1
Table 1
一种可选实施例中,交互系统可以提供前端配置页面,通过前端配置页面可以进行响应策略的配置操作,以基于不同具体应用场景的需要灵活地配置一个或者多个响应策略。In an optional embodiment, the interactive system can provide a front-end configuration page through which response strategies can be configured to flexibly configure one or more response strategies based on the needs of different specific application scenarios.
具体地,响应于响应策略配置操作,配置以下至少一类响应策略:Specifically, in response to the response policy configuration operation, at least one of the following types of response policies is configured:
打断策略、承接策略、开启新一轮对话、无反馈。Interrupt strategy, take over strategy, start a new round of dialogue, no feedback.
具体地,第一类是打断策略:在虚拟人物输出用户接收的对话状态下,打断虚拟人物当前处理的策略,包括虚拟人物主动打断当前处理的一种或多种打断策略、以及虚拟人物被动打断当前处理的一种或多种策略。Specifically, the first category is the interruption strategy: a strategy that interrupts the current processing of the avatar when the avatar outputs the dialogue state received by the user, including one or more interruption strategies in which the avatar actively interrupts the current processing, and The avatar passively interrupts one or more strategies currently being processed.
其中,每一种打断策略中可以配置虚拟人物执行如下至少一种打断响应行为:播报打断文案、做出打断动作。其中,打断动作包括手部动作、面部动作中的至少一种。Among them, in each interruption strategy, the virtual character can be configured to perform at least one of the following interruption response behaviors: broadcasting interruption copywriting and making interruption actions. The interrupting action includes at least one of hand action and facial action.
示例性地,在虚拟人物播报过程中,当用户未做出打断虚拟人物播报的语音指示,但 虚拟人物基于用户的手势判断用户有打断意图,如用户做出不安全手势、停止手势,输入具有停止意图的语音等,会立即主动打断当前对话,触发对应的打断策略,不同的手势对应的打断策略可以不同,做出的打断响应行为可以不同。For example, during the virtual character broadcasting process, when the user does not give a voice instruction to interrupt the virtual character broadcasting, but The avatar determines that the user has the intention to interrupt based on the user's gestures. For example, if the user makes unsafe gestures, stop gestures, inputs voice with the intention of stopping, etc., the avatar will immediately interrupt the current conversation and trigger the corresponding interruption strategy. Different gestures The corresponding interruption strategies can be different, and the interruption response behaviors can be different.
在每一打断策略下,虚拟人物会打断当前的说话状态,等待用户说话,并基于打断策略的具体响应方式,做出相应的响应行为。如果用户输入具有确定语义的语音数据,则开启新一轮对话;如果一段时间后用户没有输入具有确定语义的语音数据,则继续当前对话,虚拟人物继续播报。Under each interruption strategy, the virtual character will interrupt the current speaking state, wait for the user to speak, and make corresponding response behaviors based on the specific response method of the interruption strategy. If the user inputs voice data with definite semantics, a new round of dialogue will be started; if the user does not input voice data with definite semantics after a period of time, the current dialogue will be continued and the virtual character will continue to broadcast.
第二类是承接策略:在用户输入虚拟人物接收的对话状态下,虚拟人物主动针对用户的手势做出承接响应行为,以辅助对话的进行,但不会影响用户输入。承接策略中可以配置虚拟人物执行做出承接动作、播报承接文案中的至少一种承接响应行为,其中,承接动作包括手部动作、面部动作中的至少一种。The second category is the acceptance strategy: in the dialogue state where the user inputs the avatar to receive, the avatar actively responds to the user's gestures to assist the dialogue, but does not affect the user's input. In the acceptance strategy, the virtual character can be configured to perform at least one acceptance response behavior of making an acceptance action and broadcasting an acceptance copy, where the acceptance action includes at least one of hand movements and facial movements.
具体地,在用户输入虚拟人物接收的对话状态下,当判断需要承接用户的对话或动作时,触发对应的承接策略。承接的方式至少包含如下两种:一种是仅“动作承接”,也即做出承接动作,指的是虚拟人物不做口头的承接回复,仅做出承接动作响应用户,不影响其它的对话状态。另一种是“动作+文案承接”,也即虚拟人物不仅做出承接动作,而且播报承接文案来响应用户。在某些场景下,还可以配置有的承接策略仅播报承接文案。其中,承接动作包括面部表情、手势动作等。Specifically, in the dialogue state where the user inputs the virtual character to receive, when it is judged that it is necessary to take over the user's dialogue or action, the corresponding takeover strategy is triggered. There are at least two methods of acceptance: one is only "action acceptance", that is, making an acceptance action, which means that the avatar does not make a verbal acceptance reply, but only responds to the user with an action action, without affecting other conversations. state. The other is "action + copywriting", that is, the virtual character not only performs the following actions, but also broadcasts the copywriting to respond to the user. In some scenarios, you can also configure some acceptance strategies to only broadcast the acceptance copy. Among them, the undertaking actions include facial expressions, gestures, etc.
本实施例中可以配置多种承接策略,不同承接策略的承接响应行为可以不同。In this embodiment, multiple acceptance strategies can be configured, and the acceptance response behaviors of different acceptance strategies can be different.
第三类是开启新一轮对话的策略,也即调用基础对话系统,当用户语音输入的静默时间(VAD时间)达到静默时长阈值(也即VAD阈值)时,用户语音输入结束,调用基础对话系统直接回复用户。这是虚拟人物与人的交互系统的基本功能,此处不再赘述。其中,静默时长阈值(也即VAD阈值)通常为800ms,可以根据实际应用场景进行配置和调整。The third category is the strategy of starting a new round of dialogue, that is, calling the basic dialogue system. When the silent time (VAD time) of the user's voice input reaches the silence duration threshold (that is, the VAD threshold), the user's voice input ends and the basic dialogue is called. The system replies directly to the user. This is the basic function of the interaction system between virtual characters and people, and will not be described in detail here. Among them, the silent duration threshold (that is, the VAD threshold) is usually 800ms, which can be configured and adjusted according to the actual application scenario.
第四类是无反馈的策略:不作任何反馈,维持当前状态。The fourth category is a no-feedback strategy: no feedback and maintain the current state.
其中,每一类响应策略包括一个或者多个响应策略,每一响应策略包括对应的手势意图分类、响应时间和响应方式。Each type of response strategy includes one or more response strategies, and each response strategy includes corresponding gesture intention classification, response time and response mode.
示例性地,虚拟人物主动承接用户的Illustratively, the virtual character actively takes over the user's
本实施例中,通过前端页面可以灵活地配置虚拟人物主动或被动打断当前处理的打断策略、虚拟人物主动承接用户的承接策略、开启新一轮对话、无反馈这4类响应策略,通过在这4类响应策略之间决策,能够实现基于用户手势及时地进行承接、打断的双工交互能力,使得虚拟人物具备多模态(语音和手势)交互能力,提高了虚拟人物拟人化程度,使得虚拟人物与人的沟通过程更顺畅、更智能。In this embodiment, four types of response strategies can be flexibly configured through the front-end page: the interruption strategy of the avatar actively or passively interrupting the current processing, the avatar's acceptance strategy of actively accepting the user, starting a new round of dialogue, and no feedback. Decision-making between these four types of response strategies can realize duplex interaction capabilities of timely acceptance and interruption based on user gestures, making virtual characters have multi-modal (voice and gesture) interaction capabilities and improving the degree of anthropomorphism of virtual characters. , making the communication process between virtual characters and people smoother and more intelligent.
一种可选实施例中,在执行步骤S205时,若当前的对话状态为用户输入虚拟人物接收的状态,则根据用户的手势信息对应的手势意图分类,确定手势意图分类对应的第一目标策略,第一目标策略为承接策略、开启新一轮对话或无反馈中的一种;根据手势意图分类对应的第一目标策略,确定对应的驱动数据,驱动数据用于驱动虚拟人物执行第一目标 策略对应的响应行为。In an optional embodiment, when performing step S205, if the current dialogue state is a state in which the user inputs the virtual character, the first target strategy corresponding to the gesture intention classification is determined according to the gesture intention classification corresponding to the user's gesture information. , the first goal strategy is one of taking over the strategy, starting a new round of dialogue, or no feedback; according to the first goal strategy corresponding to the gesture intention classification, the corresponding driving data is determined, and the driving data is used to drive the virtual character to perform the first goal The response behavior corresponding to the policy.
其中,不同的手势意图分类对应的承接策略包括的承接响应行为的种类和具体内容可以不同。Among them, the types and specific contents of the acceptance response behaviors included in the acceptance strategies corresponding to different gesture intention categories may be different.
考虑到在实际应用场景中,通常不具有虚拟人物打断用户输入的场景需求,在用户输入虚拟人物接收的对话状态下,通常不采用打断策略进行响应。本实施例中,在用户输入虚拟人物接收的对话状态下,采用的响应策略可以是承接策略、开启新一轮对话或无反馈中的一种,而不采用打断策略进行响应,可以避免影响用户正常输入,并且能基于用户手势的手势意图及时地进行承接处理,提高虚拟人物的拟人化程度,能够提高用户继续交互积极性,提高虚拟人物与用户交互的流畅度和智能化。Considering that in actual application scenarios, there are usually no scene requirements for avatars to interrupt user input, in the dialogue state where the user inputs avatars to receive, the interruption strategy is usually not used to respond. In this embodiment, when the user inputs a dialogue state for the avatar to receive, the response strategy adopted can be one of the following strategies, starting a new round of dialogue, or no feedback, instead of using the interruption strategy to respond, which can avoid the impact. The user inputs normally, and can be processed in a timely manner based on the gesture intention of the user's gesture, which improves the degree of anthropomorphism of the virtual character, improves the user's enthusiasm for continued interaction, and improves the smoothness and intelligence of the interaction between the virtual character and the user.
进一步地,若第一目标策略为承接策略,则根据承接策略,确定第一驱动数据,第一驱动数据用于驱动虚拟人物执行做出承接动作、播报承接文案中的至少一种承接响应行为,其中,承接动作包括手部动作、面部动作中的至少一种。Further, if the first target strategy is an undertaking strategy, the first driving data is determined according to the undertaking strategy, and the first driving data is used to drive the virtual character to perform at least one undertaking response behavior of taking an undertaking action and broadcasting an undertaking copy, Wherein, the taking action includes at least one of hand action and facial action.
本实施例中,承接策略中可以包括以下至少一种承接响应行为:做出承接动作、播报承接文案。当判断需要承接用户的对话或动作时,触发对应的承接策略。承接的方式至少包含如下两种:一种是仅“动作承接”,指的是虚拟人物不做口头的承接回复,仅做出承接动作响应用户,不影响其它的对话状态。另一种是“动作+文案承接”,也即虚拟人物不仅做出承接动作,而且播报承接文案来响应用户。In this embodiment, the acceptance strategy may include at least one of the following acceptance response behaviors: making an acceptance action and broadcasting the acceptance copy. When it is determined that the user's dialogue or action needs to be taken over, the corresponding takeover strategy is triggered. There are at least two ways to take over: one is "action takeover only", which means that the virtual character does not make a verbal takeover reply, but only responds to the user by making takeover actions, without affecting other conversation states. The other is "action + copywriting", that is, the virtual character not only performs the following actions, but also broadcasts the copywriting to respond to the user.
可选地,承接策略中还可以配置各类承接响应行为的执行时机。示例性地,任一承接响应行为的执行时机可以包括如下几种:立即执行、指定时长后执行、用户输入结束后执行。另外,同一承接策略中的不同承接响应行为可以配置不同的执行时机。Optionally, the execution timing of various types of acceptance response behaviors can also be configured in the acceptance strategy. For example, the execution timing of any response response behavior may include the following: execution immediately, execution after a specified period of time, or execution after user input is completed. In addition, different takeover response behaviors in the same takeover strategy can be configured with different execution timings.
例如,在用户输入过程中,对于用户做出手势的手势意图为“点赞”,对应的承接策略可以为:立即做出“微笑”的表情和表示开心的手势动作,在用户输入结束后播报“获得您的赞美,我很高兴”的承接文案。For example, during the user input process, if the gesture intention of the user is "like", the corresponding acceptance strategy can be: immediately make a "smile" expression and a gesture indicating happiness, and broadcast it after the user input is completed. "I'm very happy to receive your compliment" copywriting.
例如,在用户输入过程中,对于用户做出手势的手势意图为“点赞”,对应的承接策略可以为:立即做出“微笑”的表情和表示开心的手势动作,立即播报“谢谢”的承接文案。For example, during the user input process, if the gesture intention of the user is "like", the corresponding acceptance strategy can be: immediately make a "smile" expression and a gesture indicating happiness, and immediately broadcast the message "thank you" Undertake copywriting.
需要说明的是,立即播报的承接文案通常设置为简短内容,如“嗯嗯”、“是”、“对对”、“嗯”、“哦哦”等,播报承接文案不会影响用户正常语音输入。It should be noted that the acceptance copy to be broadcast immediately is usually set to short content, such as "Uh-huh", "Yes", "Yes", "Hmm", "Oh", etc. The acceptance copy will not affect the user's normal speech enter.
本实施例中,通过基于用户手势的手势意图及时地进行做出承接动作、播报承接文案等承接响应处理,提高了虚拟人物的拟人化程度,能够提高用户继续交互积极性,提高了虚拟人物与用户交互的流畅度和智能化。In this embodiment, through the gesture intention of the user's gesture, the acceptance response processing such as making the acceptance action and broadcasting the acceptance copy is performed in a timely manner, which improves the degree of anthropomorphism of the virtual character, can increase the user's enthusiasm for continued interaction, and improves the relationship between the virtual character and the user. Smoothness and intelligence of interaction.
一种可选实施例中,在执行步骤S205时,若当前的对话状态为虚拟人物输出用户接收的状态,则根据用户的手势信息对应的手势意图分类,确定手势意图分类对应的第二目标策略,第二目标策略为打断策略、开启新一轮对话或无反馈中的一种;根据手势意图分类对应的第二目标策略,确定第二驱动数据,第二驱动数据用于驱动虚拟人物执行第二目 标策略对应的响应行为。In an optional embodiment, when performing step S205, if the current dialogue state is a state in which the avatar outputs the user's acceptance, the second target strategy corresponding to the gesture intention classification is determined according to the gesture intention classification corresponding to the user's gesture information. , the second target strategy is one of interruption strategy, starting a new round of dialogue, or no feedback; according to the second target strategy corresponding to the gesture intention classification, the second driving data is determined, and the second driving data is used to drive the execution of the virtual character Second item response behavior corresponding to the targeting strategy.
在实际应用中,虚拟人物与用户的交互过程中,在虚拟人物输出用户接收的对话状态下,虚拟人物不需要承接用户,而用户可以打断虚拟人物的当前输出,也即打断当前的对话状态,以使用户能够更快地获取到自己所需的信息。本实施例中,在虚拟人物输出用户接收的对话状态下,采用的响应策略可以是打断策略、开启新一轮对话或无反馈中的一种,但通常不采用承接策略。In practical applications, during the interaction between the avatar and the user, when the avatar outputs a conversation that the user receives, the avatar does not need to accept the user, and the user can interrupt the avatar's current output, that is, interrupt the current conversation. status so that users can get the information they need faster. In this embodiment, when the avatar outputs the dialogue state received by the user, the response strategy adopted may be one of interruption strategy, starting a new round of dialogue, or no feedback, but usually the acceptance strategy is not adopted.
其中,打断策略执行时会打断虚拟人物当前处理,并驱动虚拟人物执行打断策略对应的打断响应行为。打断策略的打断响应行为可以包括做出打断动作、播报打断文案中的至少一种打断响应行为,其中,承接动作包括手部动作、面部动作中的至少一种。Among them, when the interruption strategy is executed, it will interrupt the current processing of the virtual character and drive the virtual character to execute the interruption response behavior corresponding to the interruption strategy. The interrupting response behavior of the interrupting strategy may include at least one interrupting response behavior of making an interrupting action and broadcasting an interrupting copy, wherein the undertaking action includes at least one of hand movements and facial movements.
可选地,打断策略中还可以配置各类打断响应行为的执行时机。示例性地,任一打断响应行为的执行时机可以包括如下几种:立即执行、指定时长后执行、用户输入结束后执行。另外,同一打断策略中的不同打断响应行为可以配置不同的执行时机。Optionally, the execution timing of various interrupt response behaviors can also be configured in the interrupt policy. For example, the execution timing of any interrupt response behavior may include the following: execution immediately, execution after a specified period of time, or execution after user input is completed. In addition, different interrupt response behaviors in the same interrupt policy can be configured with different execution timings.
例如,在虚拟人物播报过程中,对于用户做出手势的手势意图为“停止”,对应的打断策略可以为:虚拟人物立即打断当前播报,立即做出表示疑惑的表情和手势动作,立即播报“请问您有什么问题吗”的打断文案。For example, during the avatar's broadcasting process, if the gesture intention of the user's gesture is "stop", the corresponding interruption strategy can be: the avatar immediately interrupts the current broadcast, immediately makes expressions and gestures indicating confusion, and immediately Interruption copy that announces “Do you have any questions?”
本实施例中,在打断策略下,在虚拟人物播报过程中,当判断用户有打断意图,如用户做出不安全手势、停止手势,输入具有停止意图的语音等,会虚拟人物会打断当前的说话状态,等待用户说话,或者主动询问对方打断的原因、做出一些动作等,可以实现具有双工能力、虚拟人物具有主动或被动打断自己当前播报的能力,并针对用户手势做出打断后的响应行为,以引导后续的对话流程,使得虚拟人物与用户的交互更加流畅、更加智能。In this embodiment, under the interrupt policy, during the avatar broadcasting process, when it is determined that the user has the intention to interrupt, such as the user making unsafe gestures, stop gestures, inputting voice with the intention of stopping, etc., the avatar will interrupt. Interrupt the current speaking state, wait for the user to speak, or actively ask the other party for the reason for the interruption, make some actions, etc., which can achieve duplex capabilities. The virtual character has the ability to actively or passively interrupt his current broadcast, and respond to user gestures Make response behaviors after interruption to guide the subsequent conversation process, making the interaction between virtual characters and users smoother and more intelligent.
可选地,在打断虚拟人物当前处理之后,若在第一时长内接收到用户的语音输入,并识别出用户的语音输入的语义信息,则开启下一轮对话,根据用户的语音输入的语义信息进行对话处理。Optionally, after interrupting the current processing of the avatar, if the user's voice input is received within the first period of time and the semantic information of the user's voice input is recognized, the next round of dialogue will be started, and the dialogue will be started based on the user's voice input. Semantic information for conversational processing.
其中,第一时长一般设置为一个较短的时长,使得用户不会感觉到长时间的停顿,第一时长可以根据实际应用场景的需要进行设置和调整,例如几百毫秒、1秒、甚至几秒等,此处不做具体限定。Among them, the first duration is generally set to a short duration so that the user will not feel a long pause. The first duration can be set and adjusted according to the needs of the actual application scenario, such as a few hundred milliseconds, 1 second, or even a few seconds. Seconds, etc., there is no specific limit here.
在打断虚拟人物当前处理之后,若在第一时长内未接收到用户的语音输入,或者无法识别出用户的语音输入的语义信息,则继续被打断的虚拟人物的当前输出。After interrupting the current processing of the avatar, if the user's voice input is not received within the first period of time, or the semantic information of the user's voice input cannot be recognized, the current output of the interrupted avatar will be continued.
可选地,若在第一时长内未接收到用户的语音输入,或者无法识别出用户的语音输入的语义信息,可以在停顿第二时长之后,继续被打断的虚拟人物的当前输出,以给用户留出足够的语音输入时间。Optionally, if the user's voice input is not received within the first period of time, or the semantic information of the user's voice input cannot be recognized, the current output of the interrupted virtual character can be continued after a pause of the second period of time. Give the user enough time for voice input.
其中,第二时长可以为几百毫秒、1秒、甚至几秒等,可以根据实际应用场景的需要进行设置和调整,此处不做具体限定。Among them, the second duration can be hundreds of milliseconds, 1 second, or even several seconds, etc., and can be set and adjusted according to the needs of actual application scenarios, and is not specifically limited here.
本实施例中,通过在驱动虚拟人物执行打断响应行为之后,如果在第一时长内接收到用户具有语义信息的语音输入,则开启新一轮对话,如果没有接收到用户具有语义信息的 语音输入,则可以停顿一定时长后继续虚拟人物之前的播报,以避免打断响应行为影响虚拟人物与用户正常交互,提高虚拟人物与用户的交互的流畅度和智能化。In this embodiment, after driving the virtual character to perform the interrupt response behavior, if the user's voice input with semantic information is received within the first period of time, a new round of dialogue is started. If no voice input with semantic information by the user is received, For voice input, you can pause for a certain period of time and then continue the avatar's previous broadcast, so as to avoid interrupting the response behavior and affecting the normal interaction between the avatar and the user, and improve the smoothness and intelligence of the interaction between the avatar and the user.
图3为本申请一实施例提供的实现驱动虚拟人物承接用户的方法流程图。本实施例中,可以实时地识别用户的手势,基于用户的手势驱动虚拟人物进行承接响应处理。如图3所示,该方法具体步骤如下:FIG. 3 is a flow chart of a method for driving a virtual character to accept users according to an embodiment of the present application. In this embodiment, the user's gestures can be recognized in real time, and the virtual character is driven to perform response processing based on the user's gestures. As shown in Figure 3, the specific steps of this method are as follows:
步骤S301、获取虚拟人物的三维形象渲染模型,以利用虚拟人物向用户提供交互服务。Step S301: Obtain a three-dimensional image rendering model of the virtual character to provide interactive services to users using the virtual character.
其中,虚拟人物的三维形象渲染模型包括实现虚拟人物渲染所需的渲染数据,基于虚拟人物的三维形象渲染模型可以将虚拟人物的骨骼数据渲染成呈现给用户时展示的虚拟人物的三维形象。Among them, the three-dimensional image rendering model of the virtual character includes the rendering data required to realize the rendering of the virtual character. The three-dimensional image rendering model based on the virtual character can render the skeletal data of the virtual character into the three-dimensional image of the virtual character displayed to the user.
本实施例提供的方法,可以应用于虚拟人物与人交互的场景中,利用具有三维形象的虚拟人物,实现机器与人的实时交互功能,以向人提供智能服务。The method provided in this embodiment can be applied in scenarios where virtual characters interact with people, using virtual characters with three-dimensional images to realize real-time interaction functions between machines and people, so as to provide intelligent services to people.
步骤S302、在虚拟人物与用户的一轮对话中,在用户进行语音输入的过程中,实时获取用户输入的语音数据和用户的图像数据。Step S302: In a round of dialogue between the virtual character and the user, during the user's voice input process, the voice data input by the user and the user's image data are obtained in real time.
本实施例中,在虚拟人物与用户的一轮对话过程中,实时获取输入的语音流,得到用户输入的语音数据;还可以实时地监测来自用户的视频流,按照预设频率采样视频帧,得到用户的图像数据。In this embodiment, during a round of dialogue between the virtual character and the user, the input voice stream is obtained in real time and the voice data input by the user is obtained; the video stream from the user can also be monitored in real time, and the video frames are sampled according to the preset frequency. Get the user's image data.
其中,用户的图像数据中包含用户出现视频帧内的图像数据,包括用户的人脸图像以及出现在视频帧内的手臂及部分躯体的图像。Among them, the user's image data includes image data in the video frame in which the user appears, including the user's face image and images of the arms and part of the body appearing in the video frame.
步骤S303、当检测到语音输入的静默时长大于或等于预设时长时,若确定语音输入未结束,则根据上一时段内用户的图像数据识别用户的手势信息,上一时段为自上一次静默时长大于或等于预设时长的时刻至当前时刻。Step S303: When it is detected that the silence duration of the voice input is greater than or equal to the preset duration, and if it is determined that the voice input has not ended, the user's gesture information is identified based on the user's image data in the previous period. The previous period is the period since the last silence. The duration is greater than or equal to the preset duration to the current moment.
对于获取到的用户的语音数据实时进行语音活动检测,确定语音数据的静默时长。当用户输入的语音数据的静默时长大于或等于预设时长时,若语音数据的静默时长小于静默时长阈值,说明用户输入过程中产生较长时间的静默,但是本次的语音输入尚未结束,这种情况下基于上一时段内用户的图像数据进行一次双工响应处理,使得虚拟人物针对用户手势及时地做出响应行为,以引导后续的对话流程,使得虚拟人物与用户的交互更加流畅、更加智能。Perform real-time voice activity detection on the acquired user's voice data to determine the length of silence of the voice data. When the silence duration of the voice data input by the user is greater than or equal to the preset duration, if the silence duration of the voice data is less than the silence duration threshold, it means that a long period of silence occurs during the user input process, but this voice input has not yet ended. In this case, a duplex response process is performed based on the user's image data in the previous period, so that the virtual character can respond to the user's gestures in a timely manner to guide the subsequent conversation process, making the interaction between the virtual character and the user smoother and more efficient. intelligent.
其中,预设时长为一个小于静默时长阈值的较短时长,静默时长阈值为判断用户本轮输入是否结束的静默时长,当用户语音输入的静默时长达到静默时长阈值,则确定用户本轮语音输入结束。例如静默时长阈值可以为800ms,预设时长可以为200ms。预设时长可以根据实际应用场景的需要进行设置和调整,此处不做具体限定。Among them, the preset duration is a shorter duration that is less than the silence duration threshold. The silence duration threshold is the silence duration used to determine whether the user's current round of input has ended. When the silence duration of the user's voice input reaches the silence duration threshold, the user's current round of voice input is determined. Finish. For example, the silent duration threshold can be 800ms, and the preset duration can be 200ms. The preset duration can be set and adjusted according to the needs of the actual application scenario, and is not specifically limited here.
该步骤中,当检测到用户输入的语音数据的静默时长大于或等于预设时长时,若确定语音输入未结束,则根据上一时段内用户的图像数据识别用户的手势信息,从而实时地识别出用户当前做出的手势。In this step, when it is detected that the silence duration of the voice data input by the user is greater than or equal to the preset duration, and if it is determined that the voice input has not ended, the user's gesture information is recognized based on the user's image data in the previous period, thereby identifying the user's gesture information in real time. Show the current gesture made by the user.
示例性地,根据上一时段内用户的图像数据识别用户的手势信息,具体可以通过时序 的卷积神经网络来对多个视频帧中的用户图像数据进行特征提取和手势分类,来实时地识别出用户做出的手势。For example, the user's gesture information is identified based on the user's image data in the previous period. Specifically, the user's gesture information can be identified through time series A convolutional neural network is used to extract features and classify gestures from user image data in multiple video frames to identify gestures made by users in real time.
另外,根据上一时段内用户的图像数据识别用户的手势信息,可以采用现有任意一种实现基于用户的图像数据识别用户做出的手势的功能的手势识别算法实现,此处不再赘述。In addition, identifying the user's gesture information based on the user's image data in the previous period can be implemented using any existing gesture recognition algorithm that realizes the function of identifying the user's gesture based on the user's image data, which will not be described again here.
示例性地,感知系统可以识别如上述表1所示的手势。For example, the sensing system may recognize the gestures shown in Table 1 above.
步骤S304、根据用户的手势信息,若确定用户做出了需承接的手势,则确定虚拟人物的驱动数据。Step S304: According to the user's gesture information, if it is determined that the user has made a gesture that needs to be accepted, the driving data of the virtual character is determined.
在识别出用户的手势信息之后,确定用户当前的手势是否属于需承接的手势,如果用户当前的手势属于需承接的手势,则根据用户当前的手势对应的承接响应策略,生成对应的驱动数据。该驱动数据包括驱动虚拟人物执行用户当前的手势对应的承接响应策略所需的所有驱动参数,实现虚拟人物的面部驱动和动作驱动。After identifying the user's gesture information, it is determined whether the user's current gesture belongs to a gesture that needs to be accepted. If the user's current gesture belongs to a gesture that needs to be accepted, corresponding driving data is generated according to the acceptance response strategy corresponding to the user's current gesture. The driving data includes all the driving parameters required to drive the virtual character to execute the response strategy corresponding to the user's current gesture, so as to realize the facial driving and action driving of the virtual character.
示例性地,若用户当前的手势对应的承接响应策略包括虚拟人物做出规定表情,则该驱动数据包括表情驱动参数;若用户当前的手势对应的承接响应策略包括虚拟人物做出规定动作,则该驱动数据包括动作驱动参数;若用户当前的手势对应的承接响应策略包括虚拟人物播报规定话术,则该驱动数据包括语音驱动参数;若用户当前的手势对应的承接响应策略包括表情、话术和动作中的多种响应方式,则该驱动数据包括对应的多种驱动参数,可以驱动虚拟人物执行响应策略对应的响应行为。For example, if the acceptance response strategy corresponding to the user's current gesture includes the virtual character making a prescribed expression, then the driving data includes expression driving parameters; if the acceptance response strategy corresponding to the user's current gesture includes the virtual character making a prescribed action, then The driving data includes action driving parameters; if the response strategy corresponding to the user's current gesture includes a virtual character broadcasting prescribed words, the driving data includes voice driving parameters; if the response strategy corresponding to the user's current gesture includes expressions, words, etc. and multiple response modes in the action, the driving data includes corresponding multiple driving parameters, which can drive the virtual character to perform response behaviors corresponding to the response strategy.
本实施例中可以预先配置需要做出承接响应的手势(也即需承接的手势),以及对应的承接响应策略。In this embodiment, gestures that require an acceptance response (that is, gestures that need to be accepted) and corresponding acceptance response strategies can be pre-configured.
当判断需要承接用户的对话或动作时,触发对应的承接策略。承接策略中可以配置虚拟人物执行做出承接动作、播报承接文案中的至少一种承接响应行为,其中,承接动作包括手部动作、面部动作中的至少一种。承接的方式至少包含如下两种:一种是仅“动作承接”,也即做出承接动作,指的是虚拟人物不做口头的承接回复,仅做出承接动作响应用户,不影响其它的对话状态。另一种是“动作+文案承接”,也即虚拟人物不仅做出承接动作,而且播报承接文案来响应用户。在某些场景下,还可以配置有的承接策略仅播报承接文案。其中,承接动作包括面部表情、手势动作等。When it is determined that the user's dialogue or action needs to be taken over, the corresponding takeover strategy is triggered. In the acceptance strategy, the virtual character can be configured to perform at least one acceptance response behavior of making an acceptance action and broadcasting an acceptance copy, where the acceptance action includes at least one of hand movements and facial movements. There are at least two methods of acceptance: one is only "action acceptance", that is, making an acceptance action, which means that the avatar does not make a verbal acceptance reply, but only responds to the user with an action action, without affecting other conversations. state. The other is "action + copywriting", that is, the virtual character not only performs the following actions, but also broadcasts the copywriting to respond to the user. In some scenarios, you can also configure some acceptance strategies to only broadcast the acceptance copy. Among them, the undertaking actions include facial expressions, gestures, etc.
本实施例中可以配置多种承接策略,不同承接策略的承接响应行为可以不同。In this embodiment, multiple acceptance strategies can be configured, and the acceptance response behaviors of different acceptance strategies can be different.
可选地,承接策略中还可以配置各类承接响应行为的执行时机。示例性地,任一承接响应行为的执行时机可以包括如下几种:立即执行、指定时长后执行、用户输入结束后执行。另外,同一承接策略中的不同承接响应行为可以配置不同的执行时机。Optionally, the execution timing of various types of acceptance response behaviors can also be configured in the acceptance policy. For example, the execution timing of any response response behavior may include the following: execution immediately, execution after a specified period of time, or execution after user input is completed. In addition, different takeover response behaviors in the same takeover strategy can be configured with different execution timings.
例如,在用户输入过程中,对于用户做出手势的手势意图为“点赞”,对应的承接策略可以为:立即做出“微笑”的表情和表示开心的手势动作,在用户输入结束后播报“获得您的赞美,我很高兴”的承接文案。For example, during the user input process, if the gesture intention of the user is "like", the corresponding acceptance strategy can be: immediately make a "smile" expression and a gesture indicating happiness, and broadcast it after the user input is completed. "I'm very happy to receive your compliment" copywriting.
例如,在用户输入过程中,对于用户做出手势的手势意图为“点赞”,对应的承接策略可以为:立即做出“微笑”的表情和表示开心的手势动作,立即播报“谢谢”的承接文 案。For example, during the user input process, if the gesture intention of the user is "like", the corresponding acceptance strategy can be: immediately make a "smile" expression and a gesture indicating happiness, and immediately broadcast the message "thank you" Acceptance document case.
需要说明的是,立即播报的承接文案通常设置为简短内容,如“嗯嗯”、“是”、“对对”、“嗯”、“哦哦”等,播报承接文案不会影响用户正常语音输入。It should be noted that the acceptance copy to be broadcast immediately is usually set to short content, such as "Uh-huh", "Yes", "Yes", "Hmm", "Oh", etc. The acceptance copy will not affect the user's normal speech enter.
步骤S305、根据驱动数据和虚拟人物的三维形象渲染模型,驱动虚拟人物执行用户的手势信息对应的承接响应行为。Step S305: According to the driving data and the three-dimensional image rendering model of the virtual character, drive the virtual character to perform an acceptance response behavior corresponding to the user's gesture information.
在根据用户当前的手势对应的承接响应策略,确定对应的驱动数据之后,根据驱动数据驱动虚拟人物的骨骼模型得到响应行为对应的骨骼数据,根据虚拟人物的三维详细渲染模型对骨骼数据进行渲染,得到响应行为对应的虚拟人物图像数据。通过将虚拟人物图像数据渲染到输出视频流中,使得输出视频流中虚拟人物做出对应的承接响应行为,从而实现虚拟人物针对用户的手势做出及时响应的多模态双工交互功能。After determining the corresponding driving data according to the response strategy corresponding to the user's current gesture, the skeletal model of the virtual character is driven according to the driving data to obtain the skeletal data corresponding to the response behavior, and the skeletal data is rendered according to the three-dimensional detailed rendering model of the virtual character. Obtain the virtual character image data corresponding to the response behavior. By rendering the virtual character image data into the output video stream, the virtual character in the output video stream performs corresponding response behaviors, thereby realizing the multi-modal duplex interaction function in which the virtual character responds to the user's gestures in a timely manner.
本实施例通过在虚拟人物与用户的一轮对话中,在用户进行语音输入的过程中,实时获取用户输入的语音数据和用户的图像数据;当检测到语音输入的静默时长大于或等于预设时长时,若确定语音输入未结束,则根据上一时段内用户的图像数据实时地识别用户的手势信息,根据用户的手势信息,若确定用户做出了需承接的手势,则驱动虚拟人物针对用户当前的手势做出承接响应处理,使得输出视频流中虚拟人物做出对应的承接响应行为,增加用户手势的实时识别能力,并且驱动虚拟人物针对用户的手势做出及时地承接响应,提高了虚拟人物拟人化程度,使得虚拟人物与人的交互更顺畅、更智能。In this embodiment, during a round of dialogue between the virtual character and the user, the voice data input by the user and the user's image data are obtained in real time during the user's voice input; when it is detected that the silence duration of the voice input is greater than or equal to the preset duration, if it is determined that the voice input has not ended, the user's gesture information will be recognized in real time based on the user's image data in the previous period. Based on the user's gesture information, if it is determined that the user has made a gesture that needs to be accepted, the virtual character will be driven to target The user's current gesture is processed in response, so that the virtual character in the output video stream performs the corresponding response behavior, increasing the real-time recognition ability of the user's gesture, and driving the virtual character to respond to the user's gesture in a timely manner, improving The degree of anthropomorphism of virtual characters makes the interaction between virtual characters and people smoother and more intelligent.
一种可选实施例中,上述步骤S304可以采用如下步骤实现:In an optional embodiment, the above step S304 can be implemented by using the following steps:
步骤S3041、将上一时段内用户输入的语音数据转换为对应的文本信息。Step S3041: Convert the voice data input by the user in the previous period into corresponding text information.
该步骤中,将上一时段内用户输入的语音数据输入ASR模块,通过ASR模块将语音数据转换为对应的文本信息。In this step, the voice data input by the user in the previous period is input into the ASR module, and the voice data is converted into corresponding text information through the ASR module.
步骤S3042、根据用户的手势信息和文本信息,确定用户的手势信息对应的手势意图分类。Step S3042: Determine the gesture intention classification corresponding to the user's gesture information based on the user's gesture information and text information.
在实际应用中,一个手势在不同的场景可以有不同的意图,该步骤中,结合上一时段内的用户的手势信息和用户输入语音数据的文本信息,进行多模态分类,确定上一时段内用户的手势信息对应的手势意图分类。In practical applications, a gesture can have different intentions in different scenarios. In this step, the user's gesture information in the previous period and the text information of the user's input voice data are combined to perform multi-modal classification and determine the previous period. The gesture intention classification corresponding to the user's gesture information.
示例性地,本实施例中可以预先配置好需要进行双工响应的手势意图分类,并配置每一手势意图分类对应的响应策略。For example, in this embodiment, gesture intention categories that require duplex response can be pre-configured, and a response strategy corresponding to each gesture intention category can be configured.
具体地,将文本信息与上一时段内用户的图像数据输入训练好的多模态分类模型,通过多模态分类模型,根据上一时段内用户的图像数据识别用户的手势信息,提取文本信息的语义特征,根据用户的手势信息和文本信息的语义特征,进行多模态分类处理,确定用户的手势信息对应的手势意图分类。Specifically, the text information and the user's image data in the previous period are input into the trained multi-modal classification model. Through the multi-modal classification model, the user's gesture information is recognized based on the user's image data in the previous period and the text information is extracted. According to the semantic features of the user's gesture information and text information, multi-modal classification processing is performed to determine the gesture intention classification corresponding to the user's gesture information.
在实际应用中,有的手势在不同的场景可以具有不同的含义,也即一个手势在不同的场景中表示不同的用户意图。例如,对于“上划”这一动作,在一种场景下可以表示“向上翻页”的手势意图,在另一场景下可以表示“你好”的手势意图。 In practical applications, some gestures can have different meanings in different scenarios, that is, a gesture represents different user intentions in different scenarios. For example, the action of "swipe up" can express the gesture intention of "turning the page up" in one scenario, and can express the gesture intention of "hello" in another scenario.
本实施例中,多模态分类模型通过融合用户输入语音的文本信息的语义特征与用户手势信息,准确地识别出用户的手势信息对应的手势意图。In this embodiment, the multi-modal classification model accurately identifies the gesture intention corresponding to the user's gesture information by fusing the semantic features of the text information of the user's input speech with the user's gesture information.
其中,多模态分类模型可以采用现有的任意一种多模态的图像分类模型实现,或者还可以采用其他的多模态对齐模型,实现基于文本信息对图像分类结果进行修正的功能。Among them, the multimodal classification model can be implemented using any existing multimodal image classification model, or other multimodal alignment models can be used to implement the function of correcting image classification results based on text information.
示例性地,根据上一时段内用户的图像数据识别用户的手势信息,具体可以通过时序的卷积神经网络来对多个视频帧中的用户图像数据进行特征提取和手势分类,来实时地识别出用户做出的手势。另外,根据上一时段内用户的图像数据识别用户的手势信息,可以采用现有任意一种实现基于用户的图像数据识别用户做出的手势的功能的手势识别算法实现,此处不再赘述。For example, the user's gesture information can be identified based on the user's image data in the previous period. Specifically, a time-series convolutional neural network can be used to perform feature extraction and gesture classification on the user's image data in multiple video frames to identify it in real time. gestures made by the user. In addition, identifying the user's gesture information based on the user's image data in the previous period can be implemented using any existing gesture recognition algorithm that realizes the function of identifying the user's gesture based on the user's image data, which will not be described again here.
步骤S3043、若用户的手势信息对应的手势意图分类属于需承接的手势意图分类,则确定用户做出了需承接的手势,根据用户的手势信息对应的手势意图分类,确定虚拟人物的驱动数据。Step S3043: If the gesture intention classification corresponding to the user's gesture information belongs to the gesture intention classification that needs to be accepted, it is determined that the user has made the gesture that needs to be accepted, and the driving data of the virtual character is determined according to the gesture intention classification corresponding to the user's gesture information.
其中,驱动数据用于驱动虚拟人物执行用户的手势信息对应的手势意图分类对应的承接响应行为。Among them, the driving data is used to drive the virtual character to perform the acceptance response behavior corresponding to the gesture intention classification corresponding to the user's gesture information.
该步骤的具体实现方式,与上述步骤S205的具体实现方式中,在用户输入虚拟人物接收的对话状态下,根据用户的手势信息对应的手势意图分类,确定虚拟人物的驱动数据的实现方式一致,本实施例此处不再赘述。The specific implementation of this step is consistent with the specific implementation of the above-mentioned step S205, in which the driving data of the virtual character is determined according to the gesture intention classification corresponding to the user's gesture information in the dialogue state where the user inputs the virtual character. This embodiment will not be described again here.
本实施例中,在根据用户的手势信息,若确定用户做出了需承接的手势,则确定虚拟人物的驱动数据时,通过将上一时段内用户输入的语音数据转换为对应的文本信息,融合用户输入语音的文本信息,准确地识别出用户的手势信息对应的手势意图,能够精准地、实时地识别用户的手势意图,并基于用户的手势意图驱动虚拟人物执行对应承接响应行为,以引导后续的对话流程,使得虚拟人物与用户的交互更加流畅、更加智能。In this embodiment, according to the user's gesture information, if it is determined that the user has made a gesture that needs to be accepted, then when determining the driving data of the virtual character, the voice data input by the user in the previous period is converted into corresponding text information, Integrate the text information of the user's input voice, accurately identify the gesture intention corresponding to the user's gesture information, accurately and in real time identify the user's gesture intention, and drive the virtual character to perform the corresponding response behavior based on the user's gesture intention to guide The subsequent dialogue process makes the interaction between the virtual character and the user smoother and more intelligent.
图4为本申请一示例性实施例提供的基于多模态数据的虚拟人物驱动系统的结构示意图。本申请实施例提供的基于多模态数据的虚拟人物驱动系统可以执行基于多模态数据的虚拟人物驱动方法实施例提供的处理流程。如图4所示,基于多模态数据的虚拟人物驱动系统40包括:多模态输入模块41,语音处理模块42、图像处理模块43和驱动控制模块44。Figure 4 is a schematic structural diagram of a virtual character driving system based on multi-modal data provided by an exemplary embodiment of the present application. The multi-modal data-based virtual character driving system provided by the embodiments of the present application can execute the processing flow provided by the multi-modal data-based virtual character driving method embodiment. As shown in FIG. 4 , the virtual character driving system 40 based on multi-modal data includes: a multi-modal input module 41 , a voice processing module 42 , an image processing module 43 and a driving control module 44 .
其中,驱动控制模块44用于获取虚拟人物的三维形象渲染模型以利用虚拟人物提供对用户的交互服务。Among them, the drive control module 44 is used to obtain a three-dimensional image rendering model of the virtual character so as to use the virtual character to provide interactive services to the user.
多模态输入模块41用于在虚拟人物与用户的一轮对话过程中,实时获取用户输入的语音数据和用户的图像数据。The multi-modal input module 41 is used to obtain the voice data input by the user and the user's image data in real time during a round of dialogue between the virtual character and the user.
语音处理模块42用于当检测到用户输入的语音数据的静默时长大于或等于预设时长时,若确定语音输入未结束,则将上一时段内用户输入的语音数据转换为对应的文本信息,上一时段为自上一次静默时长大于或等于预设时长的时刻至当前时刻;The voice processing module 42 is configured to, when it is detected that the silence duration of the voice data input by the user is greater than or equal to the preset duration, and if it is determined that the voice input has not ended, convert the voice data input by the user in the previous period into corresponding text information, The previous period is from the time when the last silence duration was greater than or equal to the preset duration to the current moment;
图像处理模块43根据上一时段内用户的图像数据识别用户的手势信息,并根据用户 的手势信息和文本信息,确定用户的手势信息对应的手势意图分类。The image processing module 43 identifies the user's gesture information based on the user's image data in the previous period, and The gesture information and text information are used to determine the gesture intention classification corresponding to the user's gesture information.
驱动控制模块44还用于根据用户的手势信息对应的手势意图分类,以及当前的对话状态,确定对应的驱动数据;根据驱动数据和虚拟人物的三维形象渲染模型,驱动虚拟人物执行对应的响应行为。The drive control module 44 is also used to determine the corresponding drive data according to the gesture intention classification corresponding to the user's gesture information and the current dialogue state; and drive the virtual character to perform the corresponding response behavior according to the drive data and the three-dimensional image rendering model of the virtual character. .
本申请实施例提供的系统可以具体用于执行上述图2对应方法实施例所提供的方案,具体功能和所能实现的技术效果此处不再赘述。The system provided by the embodiment of the present application can be specifically used to execute the solution provided by the method embodiment corresponding to Figure 2 above. The specific functions and the technical effects that can be achieved will not be described again here.
一种可选地实施例中,在根据上一时段内用户的图像数据识别用户的手势信息,并根据用户的手势信息和文本信息,确定用户的手势信息对应的手势意图分类时,图像处理模块43还用于:In an optional embodiment, when identifying the user's gesture information based on the user's image data in the previous period, and determining the gesture intention classification corresponding to the user's gesture information based on the user's gesture information and text information, the image processing module 43 is also used for:
将文本信息与上一时段内用户的图像数据输入训练好的多模态分类模型,通过多模态分类模型,根据上一时段内用户的图像数据识别用户的手势信息,提取文本信息的语义特征,根据用户的手势信息和文本信息的语义特征,进行多模态分类处理,确定用户的手势信息对应的手势意图分类。Input the text information and the user's image data in the previous period into the trained multi-modal classification model. Through the multi-modal classification model, the user's gesture information is recognized based on the user's image data in the previous period and the semantic features of the text information are extracted. , based on the semantic features of the user's gesture information and text information, perform multi-modal classification processing to determine the gesture intention classification corresponding to the user's gesture information.
一种可选地实施例中,如图5所示,基于多模态数据的虚拟人物驱动系统40还包括:策略配置模块45。In an optional embodiment, as shown in FIG. 5 , the avatar driving system 40 based on multi-modal data also includes: a strategy configuration module 45 .
策略配置模块45用于响应于响应策略配置操作,配置以下至少一类响应策略:虚拟人物主动或被动打断当前处理的打断策略、虚拟人物主动承接用户的承接策略、开启新一轮对话、无反馈。The policy configuration module 45 is configured to configure at least one of the following types of response strategies in response to the response strategy configuration operation: an interruption strategy in which the virtual character actively or passively interrupts the current processing, an acceptance strategy in which the virtual character actively accepts the user, and starts a new round of dialogue, No feedback.
其中,每一类响应策略包括一个或者多个响应策略,每一响应策略包括对应的手势意图分类、响应时间和响应方式。Each type of response strategy includes one or more response strategies, and each response strategy includes corresponding gesture intention classification, response time and response mode.
一种可选地实施例中,在根据用户的手势信息对应的手势意图分类,以及当前的对话状态,确定对应的驱动数据时,如图5所示,驱动控制模块44包括:响应决策单元441和驱动控制单元442。其中,响应决策单元441用于若当前的对话状态为用户输入虚拟人物接收的状态,则根据用户的手势信息对应的手势意图分类,确定手势意图分类对应的第一目标策略,第一目标策略为承接策略、开启新一轮对话或无反馈中的一种。In an optional embodiment, when determining the corresponding driving data according to the gesture intention classification corresponding to the user's gesture information and the current dialogue state, as shown in FIG. 5 , the driving control module 44 includes: a response decision unit 441 and drive control unit 442. Wherein, the response decision unit 441 is used to determine the first target strategy corresponding to the gesture intention classification according to the gesture intention classification corresponding to the user's gesture information if the current dialogue state is a state in which the user inputs the avatar. The first target strategy is Either take over the strategy, start a new conversation, or have no feedback.
驱动控制单元442用于根据手势意图分类对应的第一目标策略,确定对应的驱动数据,驱动数据用于驱动虚拟人物执行第一目标策略对应的响应行为。The drive control unit 442 is used to determine the corresponding drive data according to the first target strategy corresponding to the gesture intention classification, and the drive data is used to drive the virtual character to perform the response behavior corresponding to the first target strategy.
一种可选地实施例中,在根据手势意图分类对应的第一目标策略,确定对应的驱动数据时,驱动控制单元442还用于:若第一目标策略为承接策略,则根据承接策略,确定第一驱动数据,第一驱动数据用于驱动虚拟人物执行做出承接动作、播报承接文案中的至少一种承接响应行为,其中,承接动作包括手部动作、面部动作中的至少一种。In an optional embodiment, when classifying the corresponding first target strategy according to the gesture intention and determining the corresponding drive data, the drive control unit 442 is also configured to: if the first target strategy is a takeover strategy, then according to the takeover strategy, The first driving data is determined, and the first driving data is used to drive the virtual character to perform at least one acceptance response behavior of making an acceptance action and broadcasting an acceptance copy, wherein the acceptance action includes at least one of hand movements and facial movements.
一种可选地实施例中,驱动控制单元还用于若当前的对话状态为虚拟人物输出用户接收的状态,则根据用户的手势信息对应的手势意图分类,确定手势意图分类对应的第二目标策略,第二目标策略为打断策略、开启新一轮对话或无反馈中的一种;根据手势意图分类对应的第二目标策略,确定第二驱动数据,第二驱动数据用于驱动虚拟人物执行第二 目标策略对应的响应行为。In an optional embodiment, the drive control unit is also configured to determine the second target corresponding to the gesture intention classification according to the gesture intention classification corresponding to the user's gesture information if the current dialogue state is a state in which the virtual character outputs the user's reception. strategy, the second target strategy is one of interruption strategy, starting a new round of dialogue, or no feedback; according to the second target strategy corresponding to the gesture intention classification, the second driving data is determined, and the second driving data is used to drive the virtual character Execute second The response behavior corresponding to the target strategy.
本申请实施例提供的系统可以具体用于执行上述图2对应方法实施例基础上的任一可选的方法实施例所提供的方案,具体功能和所能实现的技术效果此处不再赘述。The system provided by the embodiment of the present application can be specifically used to execute the solution provided by any optional method embodiment based on the method embodiment corresponding to Figure 2 above. The specific functions and the technical effects that can be achieved will not be described again here.
图6为本申请另一示例性实施例提供的基于多模态数据的虚拟人物驱动系统的架构示意图。本申请实施例提供的基于多模态数据的虚拟人物驱动系统可以执行基于多模态数据的虚拟人物驱动方法实施例提供的处理流程。如图6所示,基于多模态数据的虚拟人物驱动系统60包括:多模态输入模块61,感知模块62和决策驱动模块63。Figure 6 is a schematic architectural diagram of a virtual character driving system based on multi-modal data provided by another exemplary embodiment of the present application. The multi-modal data-based virtual character driving system provided by the embodiments of the present application can execute the processing flow provided by the multi-modal data-based virtual character driving method embodiment. As shown in FIG. 6 , the virtual character driving system 60 based on multimodal data includes: a multimodal input module 61 , a perception module 62 and a decision driving module 63 .
其中,决策驱动模块63用于获取虚拟人物的三维形象渲染模型,以利用虚拟人物向用户提供交互服务。Among them, the decision-making driving module 63 is used to obtain a three-dimensional image rendering model of the virtual character, so as to use the virtual character to provide interactive services to the user.
多模态输入模块61用于在虚拟人物与用户的一轮对话中,在用户进行语音输入的过程中,实时获取用户输入的语音数据和用户的图像数据。The multi-modal input module 61 is used to obtain the voice data input by the user and the image data of the user in real time during a conversation between the virtual character and the user.
感知模块62用于当检测到语音输入的静默时长大于或等于预设时长时,若确定语音输入未结束,则根据上一时段内用户的图像数据识别用户的手势信息,上一时段为自上一次静默时长大于或等于预设时长的时刻至当前时刻。The sensing module 62 is configured to detect the user's gesture information based on the user's image data in the previous period when it is detected that the silence duration of the voice input is greater than or equal to the preset duration, and if it is determined that the voice input has not ended. A moment of silence duration greater than or equal to the preset duration to the current moment.
决策驱动模块63用于根据用户的手势信息,若确定用户做出了需承接的手势,则确定虚拟人物的驱动数据;根据驱动数据和虚拟人物的三维形象渲染模型,驱动虚拟人物执行用户的手势信息对应的承接响应行为。The decision-making driving module 63 is used to determine the driving data of the virtual character according to the user's gesture information, and if it is determined that the user has made a gesture that needs to be accepted; and based on the driving data and the three-dimensional image rendering model of the virtual character, drive the virtual character to perform the user's gesture. The corresponding acceptance response behavior of the information.
本申请实施例提供的系统可以具体用于执行上述图3对应方法实施例所提供的方案,具体功能和所能实现的技术效果此处不再赘述。The system provided by the embodiment of the present application can be specifically used to execute the solution provided by the method embodiment corresponding to Figure 3 above. The specific functions and the technical effects that can be achieved will not be described again here.
一种可选实施例中,在根据用户的手势信息,若确定用户做出了需承接的手势,则确定虚拟人物的驱动数据时,感知模块62还用于:将上一时段内用户输入的语音数据转换为对应的文本信息;根据用户的手势信息和文本信息,确定用户的手势信息对应的手势意图分类。In an optional embodiment, when determining the driving data of the virtual character based on the user's gesture information, if it is determined that the user has made a gesture that needs to be accepted, the sensing module 62 is also used to: The voice data is converted into corresponding text information; based on the user's gesture information and text information, the gesture intention classification corresponding to the user's gesture information is determined.
决策驱动模块63还用于若用户的手势信息对应的手势意图分类属于需承接的手势意图分类,则确定用户做出了需承接的手势,根据用户的手势信息对应的手势意图分类,确定虚拟人物的驱动数据,驱动数据用于驱动虚拟人物执行用户的手势信息对应的手势意图分类对应的承接响应行为。The decision-making driving module 63 is also used to determine that the user has made a gesture that needs to be accepted if the gesture intention classification corresponding to the user's gesture information belongs to the gesture intention classification that needs to be accepted, and determine the virtual character according to the gesture intention classification corresponding to the user's gesture information. The driving data is used to drive the virtual character to perform the corresponding response behavior corresponding to the gesture intention classification corresponding to the user's gesture information.
本申请实施例提供的系统可以具体用于执行上述图3对应方法实施例基础上的任一可选的方法实施例所提供的方案,具体功能和所能实现的技术效果此处不再赘述。The system provided by the embodiment of the present application can be specifically used to execute the solution provided by any optional method embodiment based on the method embodiment corresponding to Figure 3 above. The specific functions and achievable technical effects will not be described again here.
图7为本申请一示例实施例提供的电子设备的结构示意图。如图7所示,该电子设备70包括:处理器701,以及与处理器701通信连接的存储器702,存储器702存储计算机执行指令。FIG. 7 is a schematic structural diagram of an electronic device according to an exemplary embodiment of the present application. As shown in Figure 7, the electronic device 70 includes: a processor 701, and a memory 702 communicatively connected to the processor 701. The memory 702 stores computer execution instructions.
其中,处理器执行存储器存储的计算机执行指令,以实现上述任一方法实施例所提供的方案,具体功能和所能实现的技术效果此处不再赘述。The processor executes the computer execution instructions stored in the memory to implement the solution provided by any of the above method embodiments. The specific functions and the technical effects that can be achieved will not be described again here.
本申请实施例还提供一种计算机可读存储介质,计算机可读存储介质中存储有计算机 执行指令,计算机执行指令被处理器执行时用于实现上述任一方法实施例所提供的方案,具体功能和所能实现的技术效果此处不再赘述。Embodiments of the present application also provide a computer-readable storage medium. The computer-readable storage medium stores computer Execution instructions. When the computer execution instructions are executed by the processor, they are used to implement the solution provided by any of the above method embodiments. The specific functions and the technical effects that can be achieved will not be described again here.
本申请实施例还提供了一种计算机程序产品,计算机程序产品包括:计算机程序,计算机程序存储在可读存储介质中,电子设备的至少一个处理器可以从可读存储介质读取计算机程序,至少一个处理器执行计算机程序使得电子设备执行上述任一方法实施例所提供的方案,具体功能和所能实现的技术效果此处不再赘述。Embodiments of the present application also provide a computer program product. The computer program product includes: a computer program. The computer program is stored in a readable storage medium. At least one processor of the electronic device can read the computer program from the readable storage medium. At least A processor executes a computer program so that the electronic device executes the solution provided by any of the above method embodiments. The specific functions and technical effects that can be achieved will not be described again here.
另外,在上述实施例及附图中的描述的一些流程中,包含了按照特定顺序出现的多个操作,但是应该清楚了解,这些操作可以不按照其在本文中出现的顺序来执行或并行执行,仅仅是用于区分开各个不同的操作,序号本身不代表任何的执行顺序。另外,这些流程可以包括更多或更少的操作,并且这些操作可以按顺序执行或并行执行。需要说明的是,本文中的“第一”、“第二”等描述,是用于区分不同的消息、设备、模块等,不代表先后顺序,也不限定“第一”和“第二”是不同的类型。“多个”的含义是两个以上,除非另有明确具体的限定。In addition, some of the processes described in the above embodiments and drawings include multiple operations that appear in a specific order, but it should be clearly understood that these operations may not be performed in the order in which they appear in this document or may be performed in parallel. , is only used to distinguish different operations, and the sequence number itself does not represent any execution order. Additionally, these processes may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that the descriptions such as "first" and "second" in this article are used to distinguish different messages, devices, modules, etc., and do not represent the order, nor do they limit "first" and "second" are different types. "Plural" means more than two, unless otherwise clearly and specifically limited.
本领域技术人员在考虑说明书及实践这里公开的发明后,将容易想到本申请的其它实施方案。本申请旨在涵盖本申请的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本申请的一般性原理并包括本申请未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本申请的真正范围和精神由下面的权利要求书指出。Other embodiments of the present application will be readily apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of this application that follow the general principles of this application and include common knowledge or customary technical means in the technical field that are not disclosed in this application. . It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.
应当理解的是,本申请并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本申请的范围仅由所附的权利要求书来限制。 It is to be understood that the present application is not limited to the precise structures described above and illustrated in the accompanying drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims (12)

  1. 一种基于多模态数据的虚拟人物驱动方法,其特征在于,包括:A virtual character driving method based on multi-modal data, which is characterized by including:
    获取虚拟人物的三维形象渲染模型以利用虚拟人物提供对用户的交互服务;Obtain the three-dimensional image rendering model of the virtual character to use the virtual character to provide interactive services to users;
    在虚拟人物与用户的一轮对话过程中,实时获取用户输入的语音数据和所述用户的图像数据;During a round of dialogue between the virtual character and the user, the voice data input by the user and the user's image data are obtained in real time;
    当检测到所述用户输入的语音数据的静默时长大于或等于预设时长时,若确定所述语音输入未结束,则将上一时段内所述用户输入的语音数据转换为对应的文本信息,所述上一时段为自上一次静默时长大于或等于预设时长的时刻至当前时刻;When it is detected that the silence duration of the voice data input by the user is greater than or equal to the preset duration, if it is determined that the voice input has not ended, the voice data input by the user in the previous period is converted into corresponding text information, The previous period is from the time when the last silence duration was greater than or equal to the preset duration to the current time;
    根据所述上一时段内所述用户的图像数据识别所述用户的手势信息,并根据所述用户的手势信息和所述文本信息,确定所述用户的手势信息对应的手势意图分类;Identify the user's gesture information according to the user's image data in the previous period, and determine the gesture intention classification corresponding to the user's gesture information based on the user's gesture information and the text information;
    根据所述用户的手势信息对应的手势意图分类,以及当前的对话状态,确定对应的驱动数据;Determine the corresponding driving data according to the gesture intention classification corresponding to the user's gesture information and the current conversation state;
    根据所述驱动数据和虚拟人物的三维形象渲染模型,驱动虚拟人物执行对应的响应行为。According to the driving data and the three-dimensional image rendering model of the virtual character, the virtual character is driven to perform the corresponding response behavior.
  2. 根据权利要求1所述的方法,其特征在于,所述根据所述上一时段内所述用户的图像数据识别所述用户的手势信息,并根据所述用户的手势信息和所述文本信息,确定所述用户的手势信息对应的手势意图分类,包括:The method of claim 1, wherein the user's gesture information is identified based on the user's image data in the previous period, and based on the user's gesture information and the text information, Determining the gesture intention classification corresponding to the user's gesture information includes:
    将所述文本信息与所述上一时段内所述用户的图像数据输入训练好的多模态分类模型,通过所述多模态分类模型,根据所述上一时段内所述用户的图像数据识别所述用户的手势信息,提取所述文本信息的语义特征,根据所述用户的手势信息和所述文本信息的语义特征,进行多模态分类处理,确定所述用户的手势信息对应的手势意图分类。The text information and the user's image data in the previous period are input into the trained multi-modal classification model. Through the multi-modal classification model, according to the user's image data in the previous period Identify the user's gesture information, extract the semantic features of the text information, perform multi-modal classification processing based on the user's gesture information and the semantic features of the text information, and determine the gesture corresponding to the user's gesture information Intent classification.
  3. 根据权利要求1或2所述的方法,其特征在于,还包括:The method according to claim 1 or 2, further comprising:
    响应于响应策略配置操作,配置以下至少一类响应策略:In response to the response policy configuration operation, configure at least one of the following types of response policies:
    打断策略、承接策略、开启新一轮对话、无反馈;Interrupt strategy, take over strategy, start a new round of dialogue, no feedback;
    其中,每一类响应策略包括一个或者多个响应策略,每一所述响应策略包括对应的手势意图分类、响应时间和响应方式。Each type of response strategy includes one or more response strategies, and each response strategy includes a corresponding gesture intention classification, response time and response mode.
  4. 根据权利要求3所述的方法,其特征在于,所述根据所述用户的手势信息对应的手势意图分类,以及当前的对话状态,确定对应的驱动数据,包括:The method according to claim 3, characterized in that determining the corresponding driving data according to the gesture intention classification corresponding to the user's gesture information and the current dialogue state includes:
    若当前的对话状态为用户输入虚拟人物接收的状态,则根据所述用户的手势信息对应的手势意图分类,确定所述手势意图分类对应的第一目标策略,所述第一目标策略为承接策略、开启新一轮对话或无反馈中的一种;If the current dialogue state is a state in which the user inputs the virtual character to receive, then according to the gesture intention classification corresponding to the user's gesture information, the first target strategy corresponding to the gesture intention classification is determined, and the first target strategy is the acceptance strategy. , start a new round of dialogue or no feedback;
    根据所述手势意图分类对应的第一目标策略,确定对应的驱动数据,所述驱动数据用于驱动虚拟人物执行所述第一目标策略对应的响应行为。According to the first target strategy corresponding to the gesture intention classification, corresponding driving data is determined, and the driving data is used to drive the virtual character to perform the response behavior corresponding to the first target strategy.
  5. 根据权利要求4所述的方法,其特征在于,所述根据所述手势意图分类对应的第一目标策略,确定对应的驱动数据,包括: The method of claim 4, wherein determining the corresponding driving data according to the first target strategy corresponding to the gesture intention classification includes:
    若所述第一目标策略为承接策略,则根据所述承接策略,确定第一驱动数据,所述第一驱动数据用于驱动虚拟人物执行做出承接动作、播报承接文案中的至少一种承接响应行为,其中,所述承接动作包括手部动作、面部动作中的至少一种。If the first target strategy is an undertaking strategy, then the first driving data is determined according to the undertaking strategy, and the first driving data is used to drive the virtual character to perform at least one of undertaking actions and broadcasting undertaking copywriting. Response behavior, wherein the undertaking action includes at least one of hand action and facial action.
  6. 根据权利要求3所述的方法,其特征在于,所述根据所述用户的手势信息对应的手势意图分类,以及当前的对话状态,确定对应的驱动数据,包括:The method according to claim 3, characterized in that determining the corresponding driving data according to the gesture intention classification corresponding to the user's gesture information and the current dialogue state includes:
    若当前的对话状态为虚拟人物输出用户接收的状态,则根据所述用户的手势信息对应的手势意图分类,确定所述手势意图分类对应的第二目标策略,所述第二目标策略为打断策略、开启新一轮对话或无反馈中的一种;If the current dialogue state is a state that the avatar outputs and the user receives, then according to the gesture intention classification corresponding to the user's gesture information, a second target strategy corresponding to the gesture intention classification is determined, and the second target strategy is interruption. One of strategy, starting a new round of dialogue, or no feedback;
    根据所述手势意图分类对应的第二目标策略,确定第二驱动数据,所述第二驱动数据用于驱动虚拟人物执行所述第二目标策略对应的响应行为。Second driving data is determined according to the second target strategy corresponding to the gesture intention classification, and the second driving data is used to drive the virtual character to perform the response behavior corresponding to the second target strategy.
  7. 一种基于多模态数据的虚拟人物驱方法,其特征在于,包括:A virtual character driving method based on multi-modal data, which is characterized by including:
    获取虚拟人物的三维形象渲染模型,以利用虚拟人物向用户提供交互服务;Obtain the three-dimensional image rendering model of the virtual character to use the virtual character to provide interactive services to users;
    在虚拟人物与用户的一轮对话中,在用户进行语音输入的过程中,实时获取用户输入的语音数据和所述用户的图像数据;In a round of dialogue between the virtual character and the user, during the user's voice input process, the voice data input by the user and the image data of the user are obtained in real time;
    当检测到所述语音输入的静默时长大于或等于预设时长时,若确定所述语音输入未结束,则根据上一时段内所述用户的图像数据识别所述用户的手势信息,所述上一时段为自上一次静默时长大于或等于预设时长的时刻至当前时刻;When it is detected that the silence duration of the voice input is greater than or equal to the preset duration, if it is determined that the voice input has not ended, the user's gesture information is identified based on the user's image data in the previous period. A period of time is from the time when the last silence duration was greater than or equal to the preset duration to the current moment;
    根据所述用户的手势信息,若确定所述用户做出了需承接的手势,则确定虚拟人物的驱动数据;According to the user's gesture information, if it is determined that the user has made a gesture that needs to be accepted, then determine the driving data of the virtual character;
    根据所述驱动数据和虚拟人物的三维形象渲染模型,驱动虚拟人物执行所述用户的手势信息对应的承接响应行为。According to the driving data and the three-dimensional image rendering model of the virtual character, the virtual character is driven to perform the acceptance response behavior corresponding to the user's gesture information.
  8. 根据权利要求7所述的方法,其特征在于,所述根据所述用户的手势信息,若确定所述用户做出了需承接的手势,则确定虚拟人物的驱动数据,包括:The method according to claim 7, characterized in that, if it is determined that the user has made a gesture that needs to be accepted according to the user's gesture information, then determining the driving data of the virtual character includes:
    将所述上一时段内所述用户输入的语音数据转换为对应的文本信息;Convert the voice data input by the user in the previous period into corresponding text information;
    根据所述用户的手势信息和所述文本信息,确定所述用户的手势信息对应的手势意图分类;Determine the gesture intention classification corresponding to the user's gesture information according to the user's gesture information and the text information;
    若所述用户的手势信息对应的手势意图分类属于需承接的手势意图分类,则确定所述用户做出了需承接的手势,根据所述用户的手势信息对应的手势意图分类,确定虚拟人物的驱动数据,所述驱动数据用于驱动虚拟人物执行所述用户的手势信息对应的手势意图分类对应的承接响应行为。If the gesture intention classification corresponding to the user's gesture information belongs to the gesture intention classification that needs to be accepted, it is determined that the user has made the gesture that needs to be accepted, and the virtual character is determined according to the gesture intention classification corresponding to the user's gesture information. Driving data, the driving data is used to drive the virtual character to perform the acceptance response behavior corresponding to the gesture intention classification corresponding to the user's gesture information.
  9. 一种基于多模态数据的虚拟人物驱动系统,其特征在于,包括:A virtual character driving system based on multi-modal data, which is characterized by including:
    驱动控制模块,用于获取虚拟人物的三维形象渲染模型以利用虚拟人物提供对用户的交互服务;The driver control module is used to obtain the three-dimensional image rendering model of the virtual character to provide interactive services to the user using the virtual character;
    多模态输入模块,用于在虚拟人物与用户的一轮对话过程中,实时获取用户输入的语音数据和所述用户的图像数据; The multi-modal input module is used to obtain the voice data input by the user and the image data of the user in real time during a round of dialogue between the virtual character and the user;
    语音处理模块,用于当检测到所述用户输入的语音数据的静默时长大于或等于预设时长时,若确定所述语音输入未结束,则将上一时段内所述用户输入的语音数据转换为对应的文本信息,所述上一时段为自上一次静默时长大于或等于预设时长的时刻至当前时刻;A voice processing module configured to convert the voice data input by the user in the previous period when it is detected that the silence duration of the voice data input by the user is greater than or equal to the preset duration and if it is determined that the voice input has not ended. is the corresponding text information, and the previous period is from the time when the last silence duration was greater than or equal to the preset duration to the current time;
    图像处理模块,用于根据所述上一时段内所述用户的图像数据识别所述用户的手势信息,并根据所述用户的手势信息和所述文本信息,确定所述用户的手势信息对应的手势意图分类;An image processing module, configured to identify the user's gesture information based on the user's image data in the previous period, and determine the user's gesture information corresponding to the user's gesture information based on the user's gesture information and the text information. Gesture intent classification;
    所述驱动控制模块还用于根据所述用户的手势信息对应的手势意图分类,以及当前的对话状态,确定对应的驱动数据;根据所述驱动数据和虚拟人物的三维形象渲染模型,驱动虚拟人物执行对应的响应行为。The drive control module is also used to determine the corresponding drive data according to the gesture intention classification corresponding to the user's gesture information and the current conversation state; and drive the virtual character according to the drive data and the three-dimensional image rendering model of the virtual character. Execute the corresponding response behavior.
  10. 一种基于多模态数据的虚拟人物驱动系统,其特征在于,包括:A virtual character driving system based on multi-modal data, which is characterized by including:
    决策驱动模块,用于获取虚拟人物的三维形象渲染模型,以利用虚拟人物向用户提供交互服务;The decision-driven module is used to obtain the three-dimensional image rendering model of the virtual character, so as to use the virtual character to provide interactive services to users;
    多模态输入模块,用于在虚拟人物与用户的一轮对话中,在用户进行语音输入的过程中,实时获取用户输入的语音数据和所述用户的图像数据;The multi-modal input module is used to obtain the voice data input by the user and the image data of the user in real time during a conversation between the virtual character and the user;
    感知模块,用于当检测到所述语音输入的静默时长大于或等于预设时长时,若确定所述语音输入未结束,则根据上一时段内所述用户的图像数据识别所述用户的手势信息,所述上一时段为自上一次静默时长大于或等于预设时长的时刻至当前时刻;A sensing module configured to, when it is detected that the silence duration of the voice input is greater than or equal to the preset duration, and if it is determined that the voice input has not ended, identify the user's gesture according to the user's image data in the previous period. Information, the previous period is from the time when the last silence duration was greater than or equal to the preset duration to the current moment;
    所述决策驱动模块还用于根据所述用户的手势信息,若确定所述用户做出了需承接的手势,则确定虚拟人物的驱动数据;根据所述驱动数据和虚拟人物的三维形象渲染模型,驱动虚拟人物执行所述用户的手势信息对应的承接响应行为。The decision-making driving module is also used to determine the driving data of the virtual character according to the user's gesture information, if it is determined that the user has made a gesture that needs to be accepted; according to the driving data and the three-dimensional image rendering model of the virtual character , driving the virtual character to perform the acceptance response behavior corresponding to the user's gesture information.
  11. 一种电子设备,其特征在于,包括:处理器,以及与所述处理器通信连接的存储器;An electronic device, characterized by comprising: a processor, and a memory communicatively connected to the processor;
    所述存储器存储计算机执行指令;The memory stores computer execution instructions;
    所述处理器执行所述存储器存储的计算机执行指令,以实现如权利要求1-8中任一项所述的方法。The processor executes computer-executable instructions stored in the memory to implement the method according to any one of claims 1-8.
  12. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有计算机执行指令,所述计算机执行指令被处理器执行时用于实现如权利要求1-8中任一项所述的方法。 A computer-readable storage medium, characterized in that computer-executable instructions are stored in the computer-readable storage medium, and when executed by a processor, the computer-executable instructions are used to implement any one of claims 1-8. method described.
PCT/CN2023/095449 2022-05-23 2023-05-22 Virtual character driving method and system based on multimodal data, and device WO2023226914A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210567637.3 2022-05-23
CN202210567637.3A CN114840090A (en) 2022-05-23 2022-05-23 Virtual character driving method, system and equipment based on multi-modal data

Publications (1)

Publication Number Publication Date
WO2023226914A1 true WO2023226914A1 (en) 2023-11-30

Family

ID=82572222

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/095449 WO2023226914A1 (en) 2022-05-23 2023-05-22 Virtual character driving method and system based on multimodal data, and device

Country Status (2)

Country Link
CN (1) CN114840090A (en)
WO (1) WO2023226914A1 (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114840090A (en) * 2022-05-23 2022-08-02 阿里巴巴(中国)有限公司 Virtual character driving method, system and equipment based on multi-modal data
CN115356953B (en) * 2022-10-21 2023-02-03 北京红棉小冰科技有限公司 Virtual robot decision method, system and electronic equipment
CN115509366A (en) * 2022-11-21 2022-12-23 科大讯飞股份有限公司 Intelligent cabin multi-modal man-machine interaction control method and device and electronic equipment
CN116798427A (en) * 2023-06-21 2023-09-22 支付宝(杭州)信息技术有限公司 Man-machine interaction method based on multiple modes and digital man system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105070290A (en) * 2015-07-08 2015-11-18 苏州思必驰信息科技有限公司 Man-machine voice interaction method and system
CN108415561A (en) * 2018-02-11 2018-08-17 北京光年无限科技有限公司 Gesture interaction method based on visual human and system
CN114840090A (en) * 2022-05-23 2022-08-02 阿里巴巴(中国)有限公司 Virtual character driving method, system and equipment based on multi-modal data

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105070290A (en) * 2015-07-08 2015-11-18 苏州思必驰信息科技有限公司 Man-machine voice interaction method and system
CN108415561A (en) * 2018-02-11 2018-08-17 北京光年无限科技有限公司 Gesture interaction method based on visual human and system
CN114840090A (en) * 2022-05-23 2022-08-02 阿里巴巴(中国)有限公司 Virtual character driving method, system and equipment based on multi-modal data

Also Published As

Publication number Publication date
CN114840090A (en) 2022-08-02

Similar Documents

Publication Publication Date Title
WO2023226914A1 (en) Virtual character driving method and system based on multimodal data, and device
US11749265B2 (en) Techniques for incremental computer-based natural language understanding
CN107894833B (en) Multi-modal interaction processing method and system based on virtual human
US11551804B2 (en) Assisting psychological cure in automated chatting
CN108000526B (en) Dialogue interaction method and system for intelligent robot
CN107423809B (en) Virtual robot multi-mode interaction method and system applied to video live broadcast platform
WO2023226913A1 (en) Virtual character drive method, apparatus, and device based on expression recognition
CN106985137B (en) Multi-modal exchange method and system for intelligent robot
CN106503786B (en) Multi-modal interaction method and device for intelligent robot
JP2023089115A (en) Hot-word free adaptation of automated assistant function
KR102448382B1 (en) Electronic device for providing image related with text and operation method thereof
CN110299152A (en) Interactive output control method, device, electronic equipment and storage medium
CN107315742A (en) The Interpreter's method and system that personalize with good in interactive function
JP2004206704A (en) Dialog management method and device between user and agent
JP2001229392A (en) Rational architecture for executing conversational character with communication of small number of messages
WO2023216765A1 (en) Multi-modal interaction method and apparatus
CN107704612A (en) Dialogue exchange method and system for intelligent robot
EP3635513B1 (en) Selective detection of visual cues for automated assistants
CN106502382B (en) Active interaction method and system for intelligent robot
KR20200036089A (en) Apparatus and method for interaction
TW201937344A (en) Smart robot and man-machine interaction method
KR102222911B1 (en) System for Providing User-Robot Interaction and Computer Program Therefore
CN110737335B (en) Interaction method and device of robot, electronic equipment and storage medium
TW201947427A (en) Man-machine dialog method, client, electronic device and storage medium
KR20220011083A (en) Information processing method, device, electronic equipment and storage medium in user dialogue

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23810985

Country of ref document: EP

Kind code of ref document: A1