CN110471531A

CN110471531A - Multi-modal interactive system and method in virtual reality

Info

Publication number: CN110471531A
Application number: CN201910749937.1A
Authority: CN
Inventors: 王鑫; 许昭慧
Original assignee: Shanghai Yixue Education Technology Co Ltd
Current assignee: Shanghai Yixue Education Technology Co Ltd
Priority date: 2019-08-14
Filing date: 2019-08-14
Publication date: 2019-11-19

Abstract

The present invention provides multi-modal human-computer dialogue learning system, method and apparatus in a kind of virtual reality, User can intuitively be interacted with computer by a variety of interactive modes, in three dimensions with the operation of User and intention dynamic change, various sense organs and interactive mode are explained, situational understanding can preferably be realized by data fusion.Interlocutor's feedback of personification is given on this basis, to reach the practice effect of better man-machine natural language foreign language learning.

Description

System and method for multi-modal man-machine conversation in virtual reality

Technical Field

The invention relates to the technical field of data processing, in particular to a system, a method and equipment for multi-modal man-machine dialogue learning in Virtual Reality (VR).

Background

The invention is mainly based on two backgrounds, namely that real human external education resources are scarce or high in cost, the current 1-to-1 North America external education mode pain point is uneconomic in scale, and loss of many companies is increased along with the increase of scale. The current online voice culture industry tries to solve the bottlenecks of high teacher cost and scarce supply through teaching of AI virtual teachers; secondly, the mainstream foreign language man-machine dialogue learning system requires the user to follow and repeat through the standard sentences, the followed and read sentences are scored from the voice evaluation system, and the learning modes only can understand the words spoken by other people, so that the hearing and spoken language pronunciation are improved, but the subjective expression cannot be well carried out.

Disclosure of Invention

In recent years, with the development of voice recognition, voice dialogue systems, voice evaluation models, voice synthesis, and virtual reality technologies, natural dialogue between a person and a computer has been greatly improved, and the computer has been able to understand the weather inquiry requirements of a user, answer the shopping questions of the user, inquire ticket information, and the like, and even trace the questions with inaccurate voice recognition, and answer the questions in various tones according to the characters as required. However, the inventor finds that the technology applied to the online teaching of the foreign language has at least the following disadvantages through long-term observation and experiments:

firstly, at a mobile phone or a Pad mobile terminal, a real language context is created through a rich text form, but the mode of external education pure pronunciation matching with pictures and sound effects still cannot generate satisfactory immersion.

Secondly, it is good to use the effect of VR virtual reality to create an immersive foreign language learning environment, however, the existing online teaching VR system is limited to send programmed VR video data through a fixed mode, it should be understood that the language ability of a person does not only mean whether he can make sentences conforming to grammar rules, but also has the ability to properly and vividly use language, and this ability needs to cultivate language sense through a high-frequency natural interactive dialogue process to improve the comprehensive ability of foreign language, so the VR system lacking a man-machine dialogue function has limited help to the foreign language ability of the user.

Thirdly, single-channel voice is taken as the main part of man-machine interaction in foreign language teaching, however, head movement, gestures, limb movement, emotion changes and the like of a person during speaking are important information feedback in natural conversation, and academic multi-mode man-machine conversation technology has preliminary exploration.

In view of the above defects in the prior art, the present invention provides a system, a method and a device for multi-modal human-machine dialogue learning in virtual reality, so that a student user can intuitively interact with a computer through multiple interactive modes, dynamically change along with the operation and intention of the student user in a three-dimensional space, explain various senses and interactive modes, and better realize situation understanding through data fusion. On the basis, the feedback of anthropomorphic feedback is given to the interlocutor, thereby achieving better training effect of man-machine natural language foreign language learning.

In one aspect, the present invention provides a multimodal human-machine dialog system, comprising: the virtual reality equipment is configured to be capable of creating a virtual space and manipulating a virtual image; the information acquisition module is configured to be capable of acquiring and receiving user information from a user through the virtual reality device; the information processing module is configured to fuse the received user information to generate multi-modal collaborative dialog content; and the information output module is configured to be capable of outputting the multi-modal collaborative session content to the virtual reality device so as to correspondingly manipulate the virtual imagery in the virtual space.

In some embodiments, optionally, the virtual reality device includes one or more of the following: the system comprises a virtual reality head display, a virtual reality base station, a handheld controller, a motion sensing device, a host, a display screen, a sensor, an audio acquisition device and an audio playing device; and the user information comprises one or more of the following channel signals: user speech, emotion analysis, head tracking, gaze interaction, spatial localization, handle and gesture signals, somatosensory.

In some embodiments, optionally, the information processing module is further configured to fuse different channel signals in the user information at different timings.

In some embodiments, optionally, the information processing module is further configured to fuse different channel signals in the user information according to different timing relationships and/or constraint relationships between other features and the speech to obtain the multi-modal collaborative dialog content.

In some embodiments, optionally, the constraint relationship comprises one or more of the following relationships: alternating relationships, complementary relationships, enhanced relationships; wherein, the alternative relation means that semantic information representation between different channel signals is similar and/or can be replaced with each other; the complementary relation means that the voice content in the conversation process needs other channel signals as supplement to form complete semantics; and an enhancement relationship means that semantic information represented between different channel signals is relatively independent and/or can enhance the expression effect of other channel signals.

In some embodiments, optionally, when the different channel signals are ambiguous with respect to understanding the semantics, the information processing module is further configured to: if the information among the channels is in an alternate relationship, making a dialogue response according to the voice content; if the information among the channels is in a complementary relationship, combining the information of the channels with the complementary relationship to eliminate ambiguity; if the information among the channels is an enhancement relationship, making a dialogue response according to the voice content and the emotional intensity; and if the ambiguity can not be eliminated, performing suggestive inquiry according to the context content of the current conversation process so as to further acquire feedback from the user.

In some embodiments, optionally, the information processing module is further configured to switch the conversation control right between the user and the system according to the question or question of the user and the system during the conversation process of the user and the system.

In another aspect, the present application further provides a multimodal human-machine conversation method, including the following steps: receiving user information from a user, the user information comprising a multi-channel signal; fusing the multi-channel signals according to the time sequence relation and/or the constraint relation to generate multi-modal collaborative dialog content; and outputting the multi-modal collaborative dialog content to the virtual reality device so as to correspondingly control the virtual image in the virtual space.

In some embodiments, optionally, the fusing step comprises: if the multi-channel signal only comprises voice information, making a dialogue response according to the voice content; if the multi-channel signal is ambiguous in semantic understanding, fusing the multi-channel signal according to a constraint relation; and if the ambiguity can not be eliminated, performing suggestive inquiry according to the context content of the current conversation process so as to further acquire feedback from the user.

In some embodiments, optionally, during the course of a user's dialog with the system, the dialog control rights are changed between the user and the system based on questions or questions asked from the user and the system.

Compared with the prior art, the invention has the beneficial effects that at least:

first, the invention combines the multi-mode interaction technology and the virtual reality technology, and the developed novel man-machine conversation system enables student users to more achieve immersive learning activities, and the whole interaction process is vivid and lifelike, thereby greatly improving the learning interest of students and achieving the purpose of learning migration.

Secondly, the technique of the present invention contributes to improving the naturalness of the man-machine conversation. The multi-modal dialog system constructed on the basis of fusion processing of multi-channel signals such as voice, head gestures, gestures and emotions can provide more information power for a computer, so that student users can obtain more natural experience in the whole dialog process.

The conception, the specific structure and the technical effects of the present invention will be further described with reference to the accompanying drawings to fully understand the objects, the features and the effects of the present invention.

Drawings

The present invention will become more readily understood from the following detailed description when read in conjunction with the accompanying drawings, wherein like reference numerals designate like parts throughout the figures, and in which:

fig. 1 is a schematic structural diagram of functional modules according to an embodiment of the present invention.

FIG. 2 is a block diagram of program modules according to an embodiment of the present invention.

FIG. 3 is a conversation policy logic diagram of one embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. The present invention may be embodied in many different forms of embodiments and the scope of the invention is not limited to the embodiments set forth herein. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, shall fall within the scope of protection of the present invention.

The multi-modal human-machine dialog learning system and device can include a contextual classroom and a multi-modal human-machine dialog learning system. Virtual reality equipment such as a host VR head display, a VR base station, a handheld controller, somatosensory equipment and a computer host are arranged in the scene classroom. The host VR head display can also comprise a display screen, a sensor, an audio acquisition device and an audio playing device. The VR base station and handheld controller can transform a room into a three-dimensional space, allow the user to move around in a virtual world, and manipulate virtual imagery using the handheld controller for motion tracking. The computer host is provided with a multi-mode man-machine conversation learning system which is respectively connected with the VR head display device and the somatosensory device.

Fig. 1 is a schematic structural diagram of functional modules according to an embodiment of the present invention. As shown in fig. 1, the multi-modal human-machine dialog system provided by the present invention may further include an information obtaining module, an information processing module, and an information output module, besides the virtual reality device. The virtual reality device can create a virtual space and manipulate virtual images, and may include one or more of the following devices: the system comprises a virtual reality head display, a virtual reality base station, a handheld controller, a motion sensing device, a host, a display screen, a sensor, an audio acquisition device and an audio playing device. The information acquisition module can acquire and receive user information from a user through the virtual reality device, and the user information can include one or more of the following channel signals: user speech, emotion analysis, head tracking, gaze interaction, spatial localization, handle and gesture signals, somatosensory. The information processing module can fuse the received user information to generate multi-modal collaborative dialog content. The information output module can output the multi-modal collaborative session content to the virtual reality device so as to correspondingly control the virtual image in the virtual space.

In some embodiments, the information acquisition module receives information of different channels such as voice, voice emotion, head tracking, space orientation, handle signal and body feeling signal from the user through the host VR head display input device, and then generates the multi-modal collaborative dialog content by means of the information processing module. And the information output module is used for synchronously outputting the voice response of the virtual character in the virtual reality scene and the visual and action interaction of the collocation to the host VR head display equipment.

FIG. 2 is a block diagram of program modules according to an embodiment of the present invention. By adopting the technical scheme, the student user wears the VR helmet in a scene classroom and enters the multi-modal man-machine conversation learning system as shown in figure 2. The novel teaching system has the advantages that a student user can be familiar with operations such as voice recording, head and space positioning, handle buttons and the like through the novice guidance module, then the system enters the learning target module, and the student user can know the background of a story and the learning task to be completed. And the system displays a virtual scene with a scene sequence label of 1 and displays displayable virtual roles in the virtual scene, and the system virtual roles develop man-machine conversation with student users according to the story line of the functional script.

During the conversation process between the user and the system, the information processing module can also switch the conversation control right between the user and the system according to the question or question of the user and the system. In some embodiments, in the multi-modal human-computer conversation learning system, the storyline of the functional script adopts a mixed-dominant mode, that is, the user and the system can master the control right of the conversation, the user and the system can ask or ask questions, and the conversation control right in the conversation process is changed along with the conversation process, so that the multi-modal human-computer conversation is more like the interactive mode of a real human conversation.

By adopting the technical scheme, in the man-machine conversation, fusion of different channel signals on different time sequences in the conversation process needs to be processed. In some embodiments, for example, in an english teaching scenario, the technical solution of the present invention may mainly use speech, and perform fusion feedback on the dialog leading and dialog control strategies according to different timing relationships and constraint relationships between other features and speech, which is a key to improve the naturalness of multimodal human-machine dialog.

In some embodiments, the information processing module can fuse different channel signals in the user information at different timings, and can fuse different channel signals in the user information according to different timing relationships and/or constraint relationships between other features and the speech to obtain the multi-modal collaborative dialog content.

The constraint relationships may include one or more of the following relationships: alternating relationships, complementary relationships, and enhanced relationships. An alternating relationship means that semantic information representations between different channel signals are similar and/or can be substituted for each other; the complementary relation means that the voice content in the conversation process needs other channel signals as supplement to form complete semantics; an enhancement relationship means that semantic information represented between different channel signals is relatively independent and/or can enhance the expression effect of other channel signals.

In some embodiments, in the multimodal human-machine dialog learning system, the different channel signals may include: user speech, emotion analysis, head tracking, gaze interaction, spatial localization, handle and gesture signals, and somatosensory. Different channels have different influences on voice interaction, and multi-modal collaborative dialogue content can be fused according to information alternation relation, information complementary relation, information enhancement relation and the like.

By adopting the technical scheme, the dialogue management module considers the relevance of different channel signals on semantics, realizes the fusion processing on different levels, formulates various dialogue strategies according to teaching targets, can effectively improve the naturalness of man-machine dialogue, enables student users to learn proper language expression sentences in the scene, and finally transmits dialogue information through virtual characters in the voice synthesis module.

FIG. 3 is a conversation policy logic diagram of one embodiment of the present invention. As shown in fig. 3, after receiving a multi-channel signal input, when there are multiple channel signals and the semantic understanding is ambiguous, if the information between the channels is an alternate relationship, because word senses can be substituted for each other, a dialog response is made according to the speech content; if the information between the channels is in a complementary relationship, the information of the complementary relationship channels is combined to eliminate ambiguity as the voice content needs to be supplemented with other channel information to form complete semantics; if the information among the channels is an enhancement relationship, the expression effect of the voice channel can be enhanced, so that a dialogue response is made according to the voice content and the emotional intensity; if the ambiguity can not be resolved, some suggestive inquiries are made according to the context content of the current conversation process, and the user is required to perform feedback.

The invention also provides a multi-mode man-machine conversation method, which comprises the following steps: receiving user information from a user, the user information comprising a multi-channel signal; fusing the multi-channel signals according to the time sequence relation and/or the constraint relation to generate multi-modal collaborative dialog content; and outputting the multi-modal collaborative dialog content to the virtual reality device so as to correspondingly control the virtual image in the virtual space.

In some embodiments, if the multi-channel signal includes only voice information, a dialog response is made based on the voice content; if the multi-channel signal is ambiguous in semantic understanding, fusing the multi-channel signal according to a constraint relation; and if the ambiguity can not be eliminated, performing suggestive inquiry according to the context content of the current conversation process so as to further acquire feedback from the user. During the conversation between the user and the system, the conversation control right is changed between the user and the system according to the question or question of the user and the system.

In some embodiments, a multimodal human-machine dialog learning system and apparatus, comprising: contextual classrooms and multimodal human-machine dialogue learning systems. In practice, the working process is as follows:

the student user wears the VR helmet in a scene classroom and enters the multi-modal man-machine conversation learning system as shown in the structural schematic diagram of the program module in FIG. 2. Firstly, a student user is familiar with operations such as voice recording, head and space positioning, handle buttons and the like through the novice guidance module, then the system enters the learning target module, and the student user knows the background of a story and the learning task required to be completed, such as: in a detective story context, What, and white question is expressed. And the system displays a virtual scene with a scene sequence label of 1 and displays displayable virtual roles in the virtual scene, and the system virtual roles develop man-machine conversation with student users according to the story line of the functional script. The storyline may be a student user role-playing spy to police to assist in case handling and a virtual role police to converse to find out who a suspect is.

In the multi-mode information fusion module, a voice signal and a head tracking signal Are received at the same time, a student user answers the question "Are you for?" and the head tracking signal receives a nod, the dialogue management module regards the answer as a positive reply according to a dialogue strategy that the information is in an alternative relation, and in order to enable the student user to learn a proper language expression statement in the scene, if the student user needs to be actively introduced for the first time, the student user can inquire about "May I know your name?" through a virtual role in the voice synthesis module.

In some examples, the multi-modal information fusion module receives the voice signal and the handle signal at the same time, a student user answers the question of 'Who is the choice?' to 'I think he is the choice' and points to a photo with the handle, the dialogue management module combines the information of the complementary relation channel according to the dialogue strategy of the information being the complementary relation, the he capable of understanding the sentence is the photo person pointed to by the handle, and when the multi-channel information is fused, the ambiguity is eliminated.

In some examples, the method further comprises the steps of receiving the voice signal and the spatial positioning signal at the same time in the multi-modal information fusion module, retreating for several steps when the student user says 'at's this people's the dead?', rising the surprised and afraid emotion in emotion analysis, and making a dialogue response according to the information and the dialogue strategy for enhancing the relation and adding the emotion intensity to the voice content by the dialogue management module, understanding the surprised and afraid emotion of the student user on the item, and making a certain appeasing words through a virtual character.

In some embodiments, the various methods, processes, modules, apparatuses, devices, or systems described above may be implemented or performed in one or more processing devices (e.g., digital processors, analog processors, digital circuits designed to process information, analog circuits designed to process information, state machines, computing devices, computers, and/or other mechanisms for electronically processing information). The one or more processing devices may include one or more devices that perform some or all of the operations of a method in response to instructions stored electronically on an electronic storage medium. The one or more processing devices may include one or more devices configured through hardware, firmware, and/or software to be specifically designed for performing one or more operations of a method. The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.

Embodiments of the invention may be implemented in hardware, firmware, software, or various combinations thereof. The invention may also be implemented as instructions stored on a machine-readable medium, which may be read and executed using one or more processing devices. In one implementation, a machine-readable medium may include various mechanisms for storing and/or transmitting information in a form readable by a machine (e.g., a computing device). For example, a machine-readable storage medium may include read-only memory, random-access memory, magnetic disk storage media, optical storage media, flash-memory devices, and other media for storing information, and a machine-readable transmission medium may include various forms of propagated signals (including carrier waves, infrared signals, digital signals), and other media for transmitting information. While firmware, software, routines, or instructions may be described in the above disclosure in terms of performing certain exemplary aspects and embodiments of certain actions, it will be apparent that such descriptions are merely for convenience and that such actions in fact result from a machine device, computing device, processing device, processor, controller, or other device or machine executing the firmware, software, routines, or instructions.

This written description uses examples to disclose the invention, one or more examples of which are described or illustrated in the specification and drawings. Each example is provided by way of explanation of the invention, not limitation of the invention. In fact, it will be apparent to those skilled in the art that various modifications and variations can be made in the present invention without departing from the scope or spirit of the invention. For instance, features illustrated or described as part of one embodiment, can be used with another embodiment to yield a still further embodiment. It is therefore intended that the present invention cover the modifications and variations of this invention provided they come within the scope of the appended claims and their equivalents. The above description is only an embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. The protection scope of the present invention is subject to the protection scope of the claims.

Claims

1. A multimodal human machine dialog system, comprising:

a virtual reality device configured to be able to create a virtual space and manipulate a virtual image;

an information acquisition module configured to enable acquisition and receipt of user information from a user through the virtual reality device;

an information processing module configured to enable fusion of the received user information to generate multimodal collaborative dialog content; and

an information output module configured to be capable of outputting the multi-modal collaborative dialog content to the virtual reality device to manipulate the virtual imagery in the virtual space accordingly.

2. The multimodal human dialog system of claim 1, characterized in that:

the virtual reality device includes one or more of: the system comprises a virtual reality head display, a virtual reality base station, a handheld controller, a motion sensing device, a host, a display screen, a sensor, an audio acquisition device and an audio playing device; and

the user information comprises one or more of the following channel signals: user speech, emotion analysis, head tracking, gaze interaction, spatial localization, handle and gesture signals, somatosensory.

3. A multimodal human dialog system as claimed in any of the preceding claims, characterized in that:

the information processing module is further configured to be capable of fusing different channel signals in the user information at different timings.

4. A multimodal human dialog system as claimed in any of the preceding claims, characterized in that:

the information processing module is further configured to be capable of fusing different channel signals in the user information according to different timing sequence relations and/or constraint relations between other features and voice to obtain multi-modal collaborative dialog content.

5. A multimodal human dialog system as claimed in any of the preceding claims, characterized in that:

the constraint relationship comprises one or more of the following relationships: alternating relationships, complementary relationships, enhanced relationships; wherein,

the alternate relationship means that semantic information representations between different channel signals are similar and/or can be substituted for each other;

the complementary relation means that the voice content in the conversation process needs other channel signals as supplement to form complete semantics; and

the enhancement relationship means that semantic information expressed between different channel signals is relatively independent, and/or the expression effect of other channel signals can be enhanced.

6. A multimodal human dialog system as claimed in any of the preceding claims, characterized in that:

when the different channel signals are ambiguous with respect to understanding semantics, the information processing module is further configured to be capable of:

if the information among the channels is the alternate relationship, making a dialogue response according to the voice content;

if the information between the channels is the complementary relationship, combining the information of the channels with the complementary relationship to eliminate ambiguity;

if the information among the channels is the enhancement relationship, making a dialogue response according to the voice content and the emotional intensity; and

if the ambiguity can not be resolved, prompt inquiry is carried out according to the context content of the current conversation process so as to further obtain feedback from the user.

7. A multimodal human dialog system as claimed in any of the preceding claims, characterized in that:

the information processing module is further configured to switch the conversation control right between the user and the system according to the question or question of the user and the system during the conversation process of the user and the system.

8. A method of multimodal human-machine dialog, comprising the steps of:

receiving user information from a user, the user information comprising a multi-channel signal;

fusing the multi-channel signals according to a time sequence relation and/or a constraint relation to generate multi-modal collaborative dialog content; and

and outputting the multi-modal collaborative dialog content to a virtual reality device so as to correspondingly control the virtual image in the virtual space.

9. The method of claim 8, wherein the fusing step comprises:

if the multi-channel signal only comprises voice information, making a dialogue response according to the voice content;

if the multi-channel signal is ambiguous in semantic understanding, fusing the multi-channel signal according to the constraint relation; and

10. A multi-modal human dialog method according to any of the preceding claims characterized in that:

during the conversation between the user and the system, the conversation control right is changed between the user and the system according to the question or question of the user and the system.