CN107340865B

CN107340865B - Multi-modal virtual robot interaction method and system

Info

Publication number: CN107340865B
Application number: CN201710519314.6A
Authority: CN
Inventors: 尚小维
Original assignee: Beijing Guangnian Wuxian Technology Co Ltd
Current assignee: Beijing Virtual Point Technology Co Ltd
Priority date: 2017-06-29
Filing date: 2017-06-29
Publication date: 2020-12-11
Anticipated expiration: 2037-06-29
Also published as: CN107340865A

Abstract

The invention provides a multi-modal virtual robot interaction method, which comprises the following steps: starting a virtual robot, displaying the image of the virtual robot in a preset display area, wherein the virtual robot has a set character and a background story; acquiring a single-mode and/or multi-mode interaction instruction sent by a user; a robot capability interface is called to analyze a single-mode and/or multi-mode interactive instruction to obtain an interactive instruction intention; screening and generating multi-modal response data associated with the set character and the background story according to the current application scene and the set character; the virtual robot visually outputs multi-mode response data. The present invention employs a virtual robot with story-setting and personality attributes for conversational interaction such that a user appears to be conversing with a person. In addition, the virtual robot provided by the invention also has the function of screening and generating multi-mode response data according to the set character attributes and the set stories, so that the virtual robot has certain selectivity and initiative on interactive contents.

Description

Multi-modal virtual robot interaction method and system

Technical Field

The invention relates to the field of artificial intelligence, in particular to a multi-mode virtual robot interaction method and system.

Background

The development of robotic chat interactive systems has been directed to mimicking human conversation. Early well-known chat bot applications include the mini i chat bot, the siri chat bot on apple cell phone, etc. process received input (including text or speech) and respond in an attempt to mimic human responses between contexts.

However, these existing intelligent robots are far from meeting the requirements to make the virtual robot have some characteristics of human and even completely imitate human conversation to enrich the user's interactive experience.

Disclosure of Invention

In order to solve the above problems, the present invention provides a multimodal virtual robot interaction method, including the following steps:

starting a virtual robot to display an image of the virtual robot in a preset display area, wherein the virtual robot has a set character and a background story;

acquiring a single-mode and/or multi-mode interaction instruction sent by a user;

a robot capacity interface is called to analyze the single-mode and/or multi-mode interactive instruction, and the intention of the interactive instruction is obtained;

filtering and generating multi-modal response data associated with the set character and the background story according to the current application scene and the set character;

and outputting the multi-modal response data through the image of the virtual robot.

According to one embodiment of the invention, the condition that triggers the event that enables the virtual robot comprises:

detecting a particular biometric input;

or, starting hardware loaded with a virtual robot program package;

or, a specified system, application, specified function loaded by the hardware is started.

According to one embodiment of the invention, the step of calling a robot capability interface to analyze the single-mode and/or multi-mode interactive instruction and acquiring the intention of the interactive instruction comprises the following steps:

and calling voice recognition, visual recognition, semantic understanding, emotion calculation, cognitive calculation, expression control and action control interfaces which are adaptive to the set background story and the set character.

According to an embodiment of the present invention, in the step of filtering to generate multi-modal response data associated with the set character and the backstory, the method further comprises:

judging whether the single-mode and/or multi-mode interaction instruction conforms to the set character;

when the intention direction of the interactive instruction is not in accordance with the set character direction, response data representing rejection is output, and the response data can be multi-modal response data.

According to one embodiment of the invention, the single-modal and/or multi-modal interaction instructions comprise interaction instructions issued under entertainment, companion, and assistant application scenes.

According to an embodiment of the present invention, the existence form of the virtual robot is not limited to any one of the following forms:

system services, platform functions, in-application functions, individual applications, text robot matching avatars.

According to another aspect of the invention, there is also provided a storage medium having stored thereon program code executable to perform the method steps of any of the above.

According to another aspect of the present invention, there is also provided a multi-modal virtual robot interaction apparatus, the apparatus including:

the virtual robot system comprises a starting display unit, a display unit and a control unit, wherein the starting display unit is used for starting a virtual robot so as to display the image of the virtual robot in a preset display area, and the virtual robot is provided with a set character and a background story;

the acquisition unit is used for acquiring a single-mode and/or multi-mode interaction instruction sent by a user;

the calling unit is used for calling a robot capability interface to analyze the single-mode and/or multi-mode interactive instruction and obtain the intention of the interactive instruction;

a generating unit, which is used for filtering and generating multi-modal response data associated with the setting character and the background story according to the current application scene and the setting character;

an output unit for outputting the multi-modal response data through an avatar of the virtual robot.

According to one embodiment of the present invention, the start display unit includes:

a detection subunit for detecting a specific biometric input, or that hardware loaded with a virtual robot program package is activated;

or, a specified system, an application and a specified function loaded by the hardware are started;

a display subunit, configured to display the avatar of the virtual robot in a preset display area.

According to one embodiment of the invention, the apparatus comprises:

and the voice recognition subunit, the visual recognition subunit, the semantic understanding subunit, the emotion calculation subunit, the cognition calculation subunit, the expression control subunit and the action control subunit are adapted to the set background story and the set character.

According to an embodiment of the invention, the generating unit further comprises:

the judging subunit is used for judging whether the single-mode and/or multi-mode interaction instruction conforms to the set character;

and the rejecting subunit is used for outputting response data representing rejection when the intention direction of the interactive instruction does not accord with the set character direction, wherein the response data can be multi-modal response data.

According to one embodiment of the invention, the device comprises a scene selection unit for selecting an application scene, wherein the application scene comprises an entertainment application scene, a companion application scene and a helper application scene.

According to one embodiment of the invention, the apparatus comprises a component that supports multimodal interaction of the virtual robot's presence modality, without limiting any of the following ways:

According to another aspect of the present invention, there is also provided a multi-modal virtual robot interaction system, the system including:

the target hardware equipment is used for displaying the virtual robot image with the set background story and the set character, and has the capacity of receiving single-mode and/or multi-mode interactive instructions sent by a user and the capacity of outputting multi-mode response data;

a cloud server in communication with the target hardware device and providing a multimodal robot capability interface for performing the following steps:

and screening and generating multi-modal response data associated with the set character and the background story according to the current application scene and the set character.

According to one embodiment of the invention, the system target hardware device comprises:

a biometric detection module for detecting whether a specific biometric input is made;

or, starting hardware loaded with a virtual robot program package;

The virtual robot with the set background story and the set character attributes is adopted for conversation interaction, so that the virtual robot is more full in image and closer to a real human, the interaction experience of a user is enriched, the user seems to have a conversation with a person instead of a machine, and the imagination space of the user is enhanced. In addition, the virtual robot of the invention also has the function of screening and generating multi-mode response data according to the set character attributes and the set background stories, so that the virtual robot has certain selectivity and initiative on interactive contents.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:

FIG. 1 shows a schematic diagram of multimodal interaction using a virtual robot, according to an embodiment of the invention;

FIG. 2 shows a block diagram of a multi-modal virtual robot interaction, in accordance with one embodiment of the present invention;

FIG. 3 shows a system block diagram of multi-modal virtual robot interaction, in accordance with one embodiment of the present invention;

FIG. 4 shows a robot capability interface diagram of a system for multi-modal virtual robot interaction, in accordance with one embodiment of the invention;

FIG. 5 shows a block workflow diagram of a multi-modal virtual robot interaction method according to one embodiment of the invention;

FIG. 6 shows a schematic diagram of the relationship between a set character and a background story, according to one embodiment of the invention;

FIG. 7 shows a flow diagram for multimodal interaction in accordance with an embodiment of the invention;

FIG. 8 shows a detailed flow diagram for multimodal interaction in accordance with one embodiment of the invention;

FIG. 9 shows another flow diagram for multimodal interaction in accordance with an embodiment of the invention; and

fig. 10 shows a flowchart of communication among three parties, namely, a user, a target hardware device with a virtual robot installed, and a cloud server, according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below with reference to the accompanying drawings.

FIG. 1 shows a schematic diagram of multi-modal interaction with a virtual robot according to the present invention. For clarity, the following description is required before the examples:

the virtual robot is a multi-modal interactive robot, so that the multi-modal interactive robot becomes one member in an interactive process, and a user can ask and answer, chat and play games with the multi-modal interactive robot. The virtual image is a carrier of the multi-modal interactive robot and represents multi-modal output of the multi-modal interactive robot. The virtual robot (with virtual image as carrier) is: the multi-modal interactive robot and the virtual image are a community of carriers, namely: designing the determined UI image as a carrier; based on multi-mode man-machine interaction, the system has AI capabilities of semantics, emotion, cognition and the like; the user can enjoy the personalized and intelligent service robot with smooth experience. In this embodiment, the virtual robot includes: a virtual robot image of a 3D high-modulus animation.

The cloud server is a terminal which provides the processing capability of the multi-modal interactive robot for analyzing the interactive requirements of the user (voice recognition, visual recognition, semantic understanding, emotion calculation, cognitive calculation, expression control and action control), and the interaction with the user is realized.

As shown in fig. 1, the system includes a user 101, a target hardware device 102, a virtual robot 103, and a cloud server 104.

Where the user 101 may be a single person, a single virtual robot, and a single physical robot. These objects can all interact with the virtual robot 103. Additionally, target hardware device 102 includes a display area 1021 and a hardware device 1022. The display area 1021 is used for displaying the image of the virtual robot 103, the hardware device 1022 is used in cooperation with the cloud server 104 and used for instruction analysis and data processing in the multi-modal interaction process, and the hardware device 1022 can be embedded into an intelligent robot operating system. Since the avatar of the virtual robot 103 needs a screen carrier to present. Thus, the display area 1021 includes: PC screens, projectors, televisions, multimedia display screens, holographic projection, VR, and AR. Generally, a PC side with a host is selected as the hardware device 1022. In fig. 1, the display area 1021 is a PC screen.

The process of multi-modal interaction of the user 101 with the virtual robot 103 shown in fig. 1 may be:

first, the user 101 intentionally initiates an interaction, and before the interaction, the virtual robot 103 needs to be woken up first, and the means for waking up the virtual robot may be a biological feature such as a voiceprint, an iris, a touch, a key, a remote controller, and a specific limb action, a gesture, etc. The virtual robot 103 may be started together when hardware loaded with a virtual robot program package is started, or a designated system, application, or designated function loaded in the hardware may be started. After waking up virtual robot 103, the image of virtual robot 103 is displayed in display area 1021, and the woken up virtual robot 103 has a set character and a background story.

It should be noted that the image of the virtual robot 103 is not limited to a fixed image make-up, and the image of the virtual robot 103 is generally a 3D high-modulus animation image, and these images can be matched with a setting character and a background story, especially various clothes and accessories under the current scene of the virtual robot. The user 101 may select the provided dress of the virtual robot 103. The dressing of the virtual robot 103 may be classified by occupation and occasion. The above maskings can be invoked from the cloud server 104 or from the PC102 stored on the PC102, but the PC102 generally stores virtual robot image data occupying a small space, and most of the data is stored in the cloud server 104. In addition, the virtual robot 103 provided by the present invention is closer to a human being because the virtual robot 103 has a set character and a background story.

Then, the successfully woken virtual robot 103 waits for the single-mode and/or multi-mode interactive commands sent by the user 101, and after the user 101 sends the interactive commands, the PC102 acquires the commands, and generally speaking, the PC102 can collect audio information sent by the user through a microphone installed thereon, collect images and video information of the user through a camera, and collect touch information of the user through a touch device.

After the single-mode and/or multi-mode interaction instruction sent by the user 101 is obtained, the virtual robot 103 calls the robot capability interface to analyze the obtained single-mode and/or multi-mode interaction instruction sent by the user 101, and obtains the intention of the interaction instruction. The robot capability interface may include voice recognition, visual recognition, semantic understanding, emotion calculation, cognitive calculation, expression control, and motion control interfaces adapted to the set story line and the set character.

Then, the virtual robot 103 cooperates with the cloud server 104 to filter and generate multi-modal response data associated with the set characters and the background story according to the current application scenario and the set characters. The application scenes generally include an entertainment application scene, a companion application scene, and an assistant application scene, and the user 101 may select an application scene interacting with the virtual robot 103 as needed. Since the virtual robot 103 has the set character and the set story of the background, the virtual robot 103 generates corresponding multi-modal response data according to the preset character and the preset story of the robot. The virtual robot 103 has autonomy and selectivity, and for the user 101 interaction instruction which does not accord with the self preset character and the preset background story, the virtual robot 103 outputs response data representing rejection, and the response data can be single-mode response data or multi-mode response data.

Finally, the multi-modal response data generated in the above steps is output through the image of the virtual robot 103. The avatar of the virtual robot 103 on the screen of the PC102 can output multimodal response data by facial expressions, mouth shapes, and limb movements. The virtual robot 103 has a function of multi-modal output, and therefore, the interaction is more diversified.

In brief, the above interaction steps are to start the virtual robot 103 to display the image of the virtual robot 103 in the preset display area, and the virtual robot 103 has the set character and the background story. Then, the virtual robot 103 acquires the single-modality and/or multi-modality interactive instruction sent by the user 101. And then, calling a robot capability interface to analyze the single-mode and/or multi-mode interactive instruction, and acquiring the intention of the interactive instruction. Then, multi-modal response data associated with the set character and the background story are generated by screening according to the current application scene and the set character. And finally, outputting multi-modal response data through the image of the virtual robot 103.

The scene is

In the present invention, the target hardware device 102 is actually used as a display carrier for the virtual robot 103 and a display carrier for interactive dialog content. Cloud server 104 is a carrier of virtual robot data. The interactive dialog process of the virtual robot 103 and the user 101 is described below as an example. In the entertainment application scenario, the background story of the virtual robot is given to a modern female with profound recognitions and good lovers, and the virtual robot 103 with elegant character can chat with the user 101 in this way.

The virtual robot 103 says: hi (smiling), do you have a help (call)?

The user 101 says: sometime, i want to sing.

The virtual robot 103 says: what song do you want to sing (smile)?

The user 101 says: did i think at any time, did you have a good-hearing song recently?

The virtual robot 103 says: is you smiling about how much you like a chinese song or english song or other foreign language song (question)?

The user 101 says: the Chinese song bar.

The virtual robot 103 says: good (interesting), i understand that do you like songs of what genre (question)?

The user 101 says: joyful and explosive, such as Wang Fei in Xiao Jingteng, so that people dance with the song with a sexuality.

When the character of the virtual robot is set to be positive, known and gentle, the intention of the user points to the character which is wonderful and beautifully beautiful, and the character does not point to the set character of the virtual robot, and response data which represents rejection is output, namely:

the virtual robot 103 says: unfortunately, i do not jump such a dance.

In the above dialog, the virtual robot 103 changes its own emotion while making a response and waiting for the other party to make a response. The content in parentheses in the above questions and answers is an expressive response made by the virtual robot 103. In addition to expressive responses, the virtual robot 103 may express the current emotion of the virtual robot by decreasing the tone of voice and increasing the tone of voice. In addition to the responses in expressions and tones, the virtual robot 103 can express its own emotion through actions on the limbs, such as a series of actions of nodding the head, waving the hand, sitting down, standing up, walking, running, and the like.

The virtual robot 103 may make corresponding expressions, tones, and changes in limbs according to the emotion change of the interactive object by judging the emotion change of the interactive object. The virtual robot 103 can also make up for the defect of unsmooth interaction process caused by program jamming and network problems in the form of dancing or other performances when the program jamming or the network problems occur. In addition, such interactive output may also improve the interactive capabilities of their dialog for users who have a slight lack of some recognition capabilities.

Most importantly, the virtual robot 103 has a preset character and a preset background story, so the virtual robot 103 rejects to output multi-modal response data which is not consistent with the character of the robot. In this way, the virtual robot 103 is closer to a human in the sense of interaction, making the content of interaction richer and more interesting.

FIG. 2 shows a block diagram of a multi-modal virtual robot interaction, in accordance with one embodiment of the present invention. As shown in fig. 2, the system includes a user 101, a target hardware device 102, and a cloud server 104. The user 101 includes three different types, which are a human, a virtual robot, and a physical robot. The target hardware device 102 includes a wake-up detection module 201, an input acquisition module 202, and a display area 1021.

It should be noted that the wake-up detection module 201 is configured to wake up and start the virtual robot 103, and the wake-up detection unit 201 detects that a specific biometric input is inputted, and then starts the virtual robot 103. Generally, the biometric input includes a touch action of the user, i.e. the user touches a touch area on a specific position of the target hardware device 102 by a finger, and the virtual robot 103 is woken up and then started. In addition, the wake-up detection module 201 may be removed under certain conditions, and the certain conditions mentioned herein may be that the virtual robot 103 is started together with the hardware loaded with the virtual robot program package, and at this time, the target hardware device 102 does not need to be loaded with the wake-up detection module 201. The conditions for waking up the virtual robot include, but are not limited to, the following:

having a particular biometric input;

or

Starting hardware loaded with a virtual robot program package;

or a specific system, application, or specific function loaded by the hardware is activated.

Also included in the target hardware device 102 is an input acquisition module 202, where the input acquisition module 202 is configured to acquire a single-modality and/or multi-modality interaction instruction sent by a user. The input acquisition module 202 may include a keyboard, a microphone, and a camera. The keyboard may obtain text information input by the user 101, the microphone may obtain audio information input by the user 101, and the camera may obtain images and video information input by the user 101. Other devices that can obtain the interaction instruction of the user 101 can also be applied to the interaction of the present invention, and the present invention is not limited thereto.

FIG. 3 shows a system block diagram of multi-modal virtual robot interaction, in accordance with one embodiment of the present invention. As shown in fig. 3, the system includes a wake detection module 201, an input acquisition module 202, an input parsing module 203, a screening processing module 204, and a data output module 205. The target hardware device 102 with the virtual robot 103 installed therein includes a wake-up detection module 201, an input acquisition module 202, an input analysis module 203, a screening processing module 204, and a data output module 205. Cloud server 104 includes an input parsing module 203 and a filtering processing module 204.

In the multi-modal virtual robot interaction system provided by the invention, communication is established between the target hardware device 102 provided with the virtual robot 103 and the cloud server 104, and tasks of analyzing and screening single-modal and/or multi-modal reply data sent by the user 101 are cooperatively completed. Therefore, the target hardware device 102 and the cloud server 104, which are installed with the virtual robot 103, both include the input analysis module 203 and the screening processing module 204.

As shown in fig. 3, the multi-modal virtual robot interaction system provided by the present invention includes a wake-up detection module 201, configured to receive start-up information for starting the virtual robot 103 sent by the user 101, and wake up the virtual robot 103. Generally, the wake-up detection module 201 is capable of detecting an input of a specific biometric feature, such as fingerprint information or voiceprint information of the user 101, or other preset biometric features, and waking up the virtual robot 103 according to specific information included in the biometric feature.

However, in addition to waking up the virtual robot 103 by the wake-up detection module 201, the virtual robot 103 may also be started up with hardware loaded with a virtual robot program package; or the designated system, application, designated function loaded by the hardware are started simultaneously when being started. This way, the hardware placement space of the interactive system can be saved, but the user 101 cannot control the starting time of the virtual robot 103. The designer of the interactive system may select an appropriate wake-up mode of the virtual robot 103 according to the actual situation. In addition, it should be noted that the manner of waking up the virtual robot 103 is not limited to the two wake-up manners mentioned above, and other manners capable of waking up the virtual robot 103 may also be applied to the interactive system provided by the present invention, and the present invention is not limited thereto.

In addition, the interactive system further includes an input obtaining module 202, where the input obtaining module 202 is configured to obtain a single-modality and/or multi-modality interactive instruction sent by the user 101. These interactive instructions may include text information, audio information, image information, and video information entered by the user 101. In order to capture the above-mentioned multimodal information transmitted by the user 101, the input acquisition module 202 is equipped with a text capture unit 2021, an audio capture unit 2022, an image capture unit 2023, and a video capture unit 2024. The text collection unit 2021 may be any entity and virtual keyboard. The audio capturing unit 2022 may be a microphone, or other devices that can capture audio information of the user 101.

The image capturing unit 2023 and the video capturing unit 2024 may be cameras, and the cameras may capture image information of one user 101 at intervals and then select suitable image information of the user 101. The interval time can be 1 minute or any other time, and the interval time parameter is set when the interactive system is designed and can be modified in subsequent use.

In addition, examples of user input multimodal information devices include a keyboard, a cursor control device (mouse), a microphone for voice operation, a scanner, touch functionality (e.g., capacitive sensors to detect physical touch), a camera (detecting motion not involving touch using visible or invisible wavelengths), and so forth.

The interactive system further comprises an input analysis module 203, which is used for calling the robot capability interface to analyze the single-mode and/or multi-mode interactive instruction and obtain the intention of the interactive instruction. Generally, the input parsing module 203 included in the target hardware device 102 establishes a communication relationship with the cloud server 104, and sends information for invoking the robot capability interface to the cloud server 104. The cloud server 104 provides the robot capability to analyze the single-mode and/or multi-mode interactive instruction, then obtains the intention of the interactive instruction according to the analyzed result, and guides the generation of the response data according to the intention of the interactive instruction.

In addition, the interactive system further comprises a filtering processing module 204 for filtering and generating multi-modal response data associated with the set character and the background story according to the current application scene and the set character. It should be noted that before the interaction starts, the user 101 may select an interactive application scenario, in the present invention, the interactive application scenario includes an entertainment application scenario, an accompanying application scenario, and an assistant application scenario, after the application scenario is selected, the interaction formally starts, the user 101 may perform interaction with the virtual robot 103 in the application scenario, and the virtual robot 103 may filter and generate multimodal response data associated with the setting character and the background story according to the current application scenario and the setting character. When the interactive instructions of user 101 do not match the personality and story backdrop set by virtual robot 103, virtual robot 103 may output multi-modal response data characterizing the rejection, e.g., "do not match, i do not" i come to read a dream chapter bar ".

Finally, the interactive system further comprises a data output module 205 for outputting the multi-modal response data through the avatar of the virtual robot. The multimodal response data includes text response data, audio response data, image response data, and video response data. The avatar of the virtual robot 103 outputs multi-modal response data through facial expressions, tones of speech, body movements, and the like. Output devices include, for example, display screens, speakers, haptic response devices, and the like. The communication capabilities of mobile devices include both wired and wireless communications. Examples include: one or more Wi-Fi antennas, GPS antennas, cellular antennas, NFC antennas, Bluetooth antennas.

FIG. 4 shows a robot capability interface diagram of a system for multi-modal virtual robot interaction, in accordance with one embodiment of the present invention. As shown in fig. 4, the robot capability interfaces include a voice recognition capability interface, a visual recognition capability interface, a semantic understanding capability interface, an emotion calculating capability interface, a cognitive control capability interface, and an expression control capability interface. The interaction system calls the robot capability interface after acquiring the single-mode and/or multi-mode interaction instruction sent by the user 101, analyzes the acquired interaction instruction, and acquires the intention of the interaction instruction.

The voice recognition capability interface is used for recognizing an audio interaction instruction sent by the user 101, recognizing the audio interaction instruction firstly, recognizing the language of the audio interaction instruction, and recognizing the character information of the interaction instruction by recognizing the character of the interaction instruction after confirming the language type to which the interaction instruction belongs. And then, sending the interactive instruction to a semantic understanding capability interface, identifying semantic information contained in the interactive instruction by using the semantic understanding capability interface, and analyzing the intention of the interactive instruction sent by the user 101. The visual recognition capability interface may be used to recognize the identity of the interactive object and to recognize the expression and body movement information of the user, in conjunction with the voice recognition capability interface, to resolve the intent of the interactive instructions sent by the user 101.

In addition, the emotion calculating capability interface is used for recognizing and analyzing the emotional state of the user 101 during interaction, and the intention of the interaction instruction is analyzed according to the emotional state of the user 101 in cooperation with the previous voice recognition capability interface, visual recognition capability interface and semantic understanding capability interface. The cognitive computing power interface and the cognitive control power interface are used for executing tasks related to cognitive aspects of the virtual robot.

The robot capability interface can be called when analyzing the interactive instruction intention, and can also be called when generating response data, so as to screen and generate single-mode and/or multi-mode response data.

FIG. 5 shows a block workflow diagram of a multi-modal virtual robot interaction method according to one embodiment of the invention. As shown in fig. 5, the interactive system includes a wake-up detection module 201, an input acquisition module 202, an input parsing module 203, a screening processing module 204, and a data output module 205. Wherein, the wake-up detection module 201 includes a wake-up unit; the input acquisition module 202 includes an audio acquisition unit, a text acquisition unit, an image acquisition unit, and a video acquisition unit. The input parsing module 203 includes voice recognition capabilities, visual recognition capabilities, semantic understanding capabilities, emotion calculation capabilities, cognitive calculation capabilities, expression control capabilities, and cognitive control capabilities. The screening processing module 204 includes a screening unit and a processing unit.

First, the virtual robot 103 is started when the wake-up unit in the wake-up detection module 201 receives the input of the specific biological information sent by the user 101, and then the audio acquisition unit, the text acquisition unit, the image acquisition unit, and the video acquisition unit in the input acquisition module 202 acquire the single-mode and/or multi-mode interaction instruction sent by the user. The input analysis module 203 calls the voice recognition capability, the visual recognition capability, the semantic understanding capability, the emotion calculation capability, the cognitive calculation capability, the expression control capability and the cognitive control capability to analyze the single-mode and/or multi-mode interactive instruction and acquire the intention of the interactive instruction. The screening unit in the screening processing module 204 screens the multi-modal response data associated with the setting character and the background story according to the current application scene and the setting character, and the processing unit generates the multi-modal response data which needs to be output. Finally, the data output module 205 outputs the multi-modal response data through the avatar of the virtual robot.

Fig. 6 shows a schematic diagram of the relationship between a setting character and a background story according to an embodiment of the invention. As shown in fig. 6, the story backdrop a, the story backdrop B, the story backdrop C, the story D, the story E, and the story F may be associated with preset characters, each of which affects the preset characters of the virtual robot 103. After each interaction, the virtual robot 103 records all the interaction processes, and the interaction processes also affect the character of the virtual robot 103 to some extent.

The above modes are not limited, and it should be noted that:

the virtual robot 103 can have independent, lasting and stable character limit and is associated with a fixed background story and identity setting, and the virtual robot performs man-machine interaction under the perfect setting, so that the virtual robot 103 is closer to human beings, and brings more comfortable interaction experience to the user 101 in the interaction process.

FIG. 7 shows a flow diagram for conducting multimodal interactions, in accordance with one embodiment of the present invention. As shown in fig. 7, in step S701, the virtual robot 103 is first enabled to display the avatar of the virtual robot 103 in the preset display area, and the virtual robot 103 has a set character and a background story. Next, in step S702, a single-modality and/or multi-modality interaction instruction sent by the user is acquired. Then, in step S703, the robot capability interface is called to analyze the single-mode and/or multi-mode interactive instruction, and an intention of the interactive instruction is obtained. Next, in step S704, multi-modal response data associated with the setting character and the background story is generated by filtering according to the current application scene and the setting character. Finally, in step S705, the multi-modal response data is output through the avatar of the virtual robot.

FIG. 8 shows a detailed flow diagram for multimodal interaction in accordance with one embodiment of the invention. As shown in the figure, in step S801, when the target hardware device 102 detects that a specific biometric input is input, or hardware loaded with a virtual robot program package is started, the virtual robot 103 is woken up, and then, in step S802, after the virtual robot 103 is woken up, the avatar of the virtual robot 103 is displayed in a preset area, and the displayed virtual robot avatar has a set character and a background story. At this time, the preparatory work before the interaction is finished, and the interaction formally starts. Next, in step S803, the virtual robot 103 obtains a single-mode and/or multi-mode interactive instruction sent by the user 101, and then transmits the interactive instruction to the next link, and in step S804, the virtual robot 103 invokes a voice recognition, a visual recognition, a semantic understanding, an emotion calculation, a cognitive calculation, an expression control, and an action control capability interface adapted to a preset story and a setting character to parse the single-mode and/or multi-mode interactive instruction, and obtains an intention of the interactive instruction.

Then, in step S805, the virtual robot 103 determines whether the interactive command matches the set character, and outputs response data representing rejection if the direction of the interactive command does not match the set character, where the response data may be multi-modal response data. Next, in step S806, when the interactive instruction is intended to conform to the set character, multimodal response data associated with the set character and the background story is generated. Finally, in step S807, the multi-modal response data is output through the avatar of the virtual robot 103.

FIG. 9 shows another flow diagram for multimodal interaction in accordance with an embodiment of the invention. As shown, in step S901, the target hardware device 102 sends out interactive content to the cloud server 104. Thereafter, the target hardware device 102 is in a state of waiting for the cloud server 104 to complete part of the tasks of the cloud server 104. During the wait, the target hardware device 102 will time the time it takes to return data. If the return data is not obtained for a long time, for example, the predetermined time length is more than 5S, the target hardware device 102 may choose to perform local reply, and generate local common response data. Then the virtual robot image outputs the animation matched with the local common answer, and calls the voice playing device to play the voice.

Fig. 10 shows a flowchart of communication among three parties, namely, the user 101, the target hardware device 102 with the installed virtual robot 103, and the cloud server 104, according to an embodiment of the present invention.

As shown in fig. 10, at the beginning of the interaction, the user 101 activates the virtual robot 103, the image of the virtual robot 103 is displayed on the display area 1021 of the target hardware device 102, the virtual robot 103 activated by the user 101 has the set character and background, and the user 101 selects an application scene. At this point, the interaction is about to begin.

After the interaction starts, the virtual robot 103 obtains a single-mode and/or multi-mode interaction instruction sent by the user, and then the virtual robot 103 on the target hardware device 102 calls a robot capability interface to analyze the single-mode and/or multi-mode interaction instruction, so as to obtain an intention of the interaction instruction. Next, the virtual robot 103 filters the multi-modal response data pre-generated in relation to the set character and the background story according to the application scenario and the set character selected by the current user 101. If the interactive instruction does not conform to the set character of the virtual robot 103, the virtual robot 103 outputs multi-modal response data representing rejection. Finally, the virtual robot 103 outputs the generated multi-modal output data through the avatar.

It is to be understood that the disclosed embodiments of the invention are not limited to the particular structures, process steps, or materials disclosed herein but are extended to equivalents thereof as would be understood by those ordinarily skilled in the relevant arts. It is also to be understood that the terminology used herein is for the purpose of describing particular embodiments only, and is not intended to be limiting.

Reference in the specification to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the invention. Thus, the appearances of the phrase "one embodiment" or "an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment.

Although the embodiments of the present invention have been described above, the above description is only for the convenience of understanding the present invention, and is not intended to limit the present invention. It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A multi-modal virtual robot interaction method, the method comprising the steps of:

enabling a virtual robot to display the image of the virtual robot in a preset display area, wherein the virtual robot has a set character and a background story, and the condition for triggering the event for enabling the virtual robot comprises the following conditions: detecting that specific biological characteristics are input or hardware loaded with a virtual robot program package is started or a specified system, application and specified function loaded by the hardware are started;

screening and generating multi-modal response data associated with the set character and the background story according to the current application scene and the set character, wherein whether the single-modal and/or multi-modal interaction instruction conforms to the set character is judged, and when the intention direction of the interaction instruction does not conform to the set character direction, response data representing rejection is output, and the response data can be multi-modal response data;

2. The multi-modal virtual robot interaction method of claim 1, wherein invoking a robot capability interface to parse the single-modal and/or multi-modal interaction instructions, the step of obtaining the intent of the interaction instructions comprises:

3. The method of claim 1, wherein the single-modality and/or multi-modality interaction instructions comprise interaction instructions issued under entertainment, companion, assistant application scenarios.

4. The multi-modal virtual robot interaction method of claim 1, wherein the existence form of the virtual robot is not limited to any one of the following ways:

5. A storage medium having stored thereon program code executable to perform the method steps of any of claims 1-4.

6. A multi-modal virtual robotic interaction device, the device comprising:

a starting display unit for starting a virtual robot to display an avatar of the virtual robot in a preset display area, the virtual robot having a set character and a background story, comprising: a detection subunit for detecting a specific biometric input or that hardware loaded with a virtual robot program package is started or that a designated system, an application, a designated function loaded by the hardware is started, a display subunit for displaying an avatar of the virtual robot in a preset display area;

a generating unit for filtering and generating multi-modal response data associated with the setting character and the background story according to a current application scene and the setting character, comprising: the judging subunit is used for judging whether the single-mode and/or multi-mode interactive instruction conforms to the set character, and the rejecting subunit is used for outputting response data representing rejection when the intention direction of the interactive instruction does not conform to the set character direction, wherein the response data can be multi-mode response data;

7. The multi-modal virtual robot interaction device of claim 6, wherein the device comprises:

8. The multi-modal virtual robot interaction device of claim 6, wherein the device comprises a scenario selection unit to select an application scenario, wherein an application scenario comprises an entertainment application scenario, a companion application scenario, and a helper application scenario.

9. The multi-modal virtual robot interaction device of any of claims 6-8, wherein the device comprises components that support multi-modal interaction of the virtual robot's presence modality without limiting any of the following:

10. A multi-modal virtual robot interaction system, the system comprising:

a target hardware device for displaying a virtual robot image having a set story and a set character, and having a capability of receiving a single-modality and/or multi-modality interactive command transmitted from a user and a capability of outputting multi-modality response data, comprising: the system comprises a biological characteristic detection module, a virtual robot program package generation module and a virtual robot program package generation module, wherein the biological characteristic detection module is used for detecting whether specific biological characteristics are input or not, and detecting that hardware loaded with the virtual robot program package is started or a specified system, an application and a specified function loaded by the hardware are started;

and screening and generating multi-modal response data associated with the set character and the background story according to the current application scene and the set character, wherein whether the single-modal and/or multi-modal interaction instruction conforms to the set character is judged, and when the intention direction of the interaction instruction does not conform to the set character direction, response data representing rejection is output, and the response data can be multi-modal response data.