CN118092629A

CN118092629A - Multi-mode interaction method, device, electronic equipment and storage medium

Info

Publication number: CN118092629A
Application number: CN202211431749.2A
Authority: CN
Inventors: 程林
Original assignee: Beijing Zitiao Network Technology Co Ltd
Current assignee: Beijing Zitiao Network Technology Co Ltd
Priority date: 2022-11-16
Filing date: 2022-11-16
Publication date: 2024-05-28

Abstract

The disclosure provides a multi-modal interaction method, a multi-modal interaction device, an electronic device and a storage medium. The multi-modal interaction method comprises the following steps: acquiring voice information and action information of a user; determining a corresponding control instruction according to the voice information and the action information; and executing the operation corresponding to the control instruction. The method of the present disclosure can interact in conjunction with user speech and actions in an augmented reality environment.

Description

Multi-mode interaction method, device, electronic equipment and storage medium

Technical Field

The disclosure relates to the technical field of intelligent terminals, and in particular relates to a multi-mode interaction method, a multi-mode interaction device, electronic equipment and a storage medium.

Background

The augmented reality XR (Extended Reality) is to combine reality with virtual through a computer to create a virtual environment capable of man-machine interaction. In an augmented reality environment, the accuracy of the augmented reality device in recognizing a voice instruction input by a user is not high.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

The disclosure provides a multi-modal interaction method, a multi-modal interaction device, electronic equipment and a storage medium.

The present disclosure adopts the following technical solutions.

In some embodiments, the present disclosure provides a multi-modal interaction method comprising:

Acquiring voice information and action information of a user;

determining a corresponding control instruction according to the voice information and the action information;

And executing the operation corresponding to the control instruction.

In some embodiments, the present disclosure provides a multi-modal interaction device comprising:

The acquisition module is used for acquiring voice information and action information of a user;

the first processing module is used for determining a corresponding control instruction according to the voice information and the action information;

and the second processing module is used for executing the operation corresponding to the control instruction.

In some embodiments, the present disclosure provides an electronic device comprising: at least one memory and at least one processor;

The memory is used for storing program codes, and the processor is used for calling the program codes stored in the memory to execute the method.

In some embodiments, the present disclosure provides a computer readable storage medium for storing program code which, when executed by a processor, causes the processor to perform the above-described method.

In some embodiments, the present disclosure provides a computer program product comprising instructions that, when executed by a computer device, cause the computer device to perform the above-described method.

The multi-mode interaction method provided by the embodiment of the disclosure is applied to an augmented reality device, and is used for determining a corresponding control instruction according to voice information and action information of a user and executing an operation corresponding to the control instruction. According to the embodiment of the disclosure, the superposition judgment is performed by combining the action information of the user when the voice interaction is performed, so that the accuracy of the augmented reality device in recognizing the voice command input by the user is improved, and the practicability is high.

Drawings

The above and other features, advantages, and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. The same or similar reference numbers will be used throughout the drawings to refer to the same or like elements. It should be understood that the figures are schematic and that elements and components are not necessarily drawn to scale.

FIG. 1 is a flow chart of a multi-modal interaction method of an embodiment of the present disclosure.

FIG. 2 is a schematic structural diagram of a multi-modal interaction device of an embodiment of the present disclosure;

Fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been shown in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in and/or in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "including" and variations thereof as used herein are intended to be open-ended, i.e., including, but not limited to. The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments. Related definitions of other terms will be given in the description below. The term "responsive to" and related terms mean that one signal or event is affected to some extent by another signal or event, but not necessarily completely or directly. If event x occurs "in response to" event y, x may be directly or indirectly in response to y. For example, the occurrence of y may ultimately lead to the occurrence of x, but other intermediate events and/or conditions may exist. In other cases, y may not necessarily result in the occurrence of x, and x may occur even though y has not yet occurred. Furthermore, the term "responsive to" may also mean "at least partially responsive to".

The term "determining" broadly encompasses a wide variety of actions, which may include obtaining, calculating, computing, processing, deriving, investigating, looking up (e.g., looking up in a table, a database or another data structure), ascertaining and the like, and may also include receiving (e.g., receiving information), accessing (e.g., accessing data in a memory) and the like, as well as parsing, selecting, choosing, establishing and the like. Related definitions of other terms will be given in the description below. Related definitions of other terms will be given in the description below.

It should be noted that the terms "first," "second," and the like in this disclosure are merely used to distinguish between different devices, modules, or units and are not used to define an order or interdependence of functions performed by the devices, modules, or units.

It should be noted that references to "a" and "an" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be construed as "one or more" unless the context clearly indicates otherwise.

The names of messages or information interacted between the various devices in the embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.

The following describes in detail the schemes provided by the embodiments of the present disclosure with reference to the accompanying drawings.

As shown in fig. 1, fig. 1 is a flowchart of a multi-modal interaction method according to an embodiment of the disclosure, including the following steps.

Step S01: acquiring voice information and action information of a user;

In some embodiments, it is noted that in the current augmented reality environment, voice command input is an important ring of multimodal interactions. The augmented reality device collects voice information input by a user through a microphone of the wearable display component, and then recognizes the voice information and then outputs feedback to the user. However, directivity and accuracy of voice information input by a user are sometimes low, resulting in that the augmented reality device cannot determine a control instruction corresponding to the voice information input by the user.

In some embodiments, the present disclosure provides a solution to the above-mentioned problem by combining user voice information and action information to perform interactive judgment. Specifically, firstly, voice information input by a user and action information of the user are collected through the augmented reality equipment. For example, when the voice information input by the user is detected, the current action information of the user is synchronously collected.

Step S02: determining a corresponding control instruction according to the voice information and the action information;

in some embodiments, after the voice information and the action information of the user are acquired, the voice information and the action information are respectively analyzed and processed, the audio features and the action features are extracted, then the multi-mode feature fusion is performed on the audio features and the action features, combined interaction information is generated, and finally a control instruction is determined according to the combined interaction information.

Step S03: and executing the operation corresponding to the control instruction.

In some embodiments, after the control instruction matching the voice information is acquired, an operation corresponding to the control instruction is performed. For example, a preset screen is displayed in a display component of the augmented reality device, or preset audio is output through an audio component of the augmented reality device, or vibration feedback is output through the augmented reality device.

In some embodiments, the acquiring the voice information and the action information of the user includes:

Acquiring voice information input by a user through a sound receiving component of an augmented reality device, and

And acquiring action information of the user through an action detection part of the augmented reality device.

In some embodiments, the augmented reality device includes a wearable display component and an operating component, wherein the wearable display component in turn includes a sound receiving component for gathering voice information input by a user. The operation part further comprises a controller, such as a handheld handle device or a wearable controller, wherein the handle or the controller is provided with a motion detection part for tracking the motion of the user, and the motion information of the user is acquired through the motion detection part.

In some embodiments, the sound pickup assembly includes a microelectromechanical system microphone, and/or a sound pickup sensor.

In some embodiments, the sound pickup assembly includes a high AOP (Acoustic Overload Point, maximum sound pressure level) micro-electromechanical system microphone, and/or a Voice pickup Sensor VPU (Voice Pick-Up Sensor) for collecting Voice information input by the user.

In some embodiments, the motion detection component comprises at least one of a depth camera, an inertial measurement unit, and a gaze point tracking sensor.

In some embodiments, the motion detection component comprises at least one of a depth camera, an inertial measurement unit, and a gaze point tracking sensor. The principle of the depth camera TOF (Time of flight) is that light pulses are continuously sent to a target, then light returned from an object is received by a sensor, the distance of the target is obtained by detecting the flight (round trip) time of the light pulses, and then simple actions or gestures of a user can be identified by a corresponding algorithm.

In some embodiments, inertial measurement unit IMU (Inertial Measurement Unit) is used to measure the motion of a user's critical motion site. It integrates sensor modules of gyroscopes, accelerometers and magnetometers, which possess higher frame rates, also independent of field and lighting conditions.

In some embodiments, the gaze point tracking sensor is used for eye movement tracking, and capturing the eye movement track of the user in the augmented reality scene can accurately calculate the gaze of the user, and stay in which area of the device screen, so as to control the operations of scrolling the screen, browsing the web page, playing the game, and the like.

In some embodiments, the motion information of the user includes at least one of user gesture information, user head motion information, motion information for a controller of the augmented reality device, and user eye movement information.

In some embodiments, the user's movements are tracked by the movement detection component of the augmented reality device, including but not limited to, the user's current gesture movements and orientations, the user's current handle movements and orientations, the user's current eye movements and orientations, the gesture movements and directions of the wearable display component the user is currently wearing, and so on.

In some embodiments, the inertial measurement unit includes at least one of an acceleration sensor, a gyroscope, a magnetic force sensor, and a six degree of freedom sensor.

In some embodiments, an acceleration sensor is used to measure acceleration, which measures the movement of the user himself, relative to a remotely sensed device.

In some embodiments, the gyroscope is an angular motion detection device using a momentum moment sensitive housing of a high speed rotor with respect to the inertial volume about one or two axes orthogonal to the spin axis. The application of the device can measure the inclination angle and the inclination direction of the augmented reality device. Thereby better capturing the motion. The gyroscope provides positioning information for the scene in the augmented reality device display to change in real time as the user's head moves. For example, when a user wears an augmented reality device to look up, the display in the eye needs to show the sky in the virtual world to you in real time, and when you turn around, the display needs to show the scene behind you to simulate a real turn around.

In some embodiments, the magnetic sensor is used for testing the intensity and direction of the magnetic field, the orientation of the positioning device and the principle of the magnetometer are similar to that of a compass, and the included angle between the current augmented reality device and the four directions of southeast, northwest and northwest can be measured. The principle of magnetometers is a sensor that measures a magnetic field using the principle of hall elements. The hall element can only measure the magnetic field in one direction, so if the magnetometer is to measure the magnetic field in space, the hall sensor in three directions is used to measure the magnetic fields in three orthogonal directions to form a combined magnetic field, and the direction of the combined magnetic field is used to compare with the direction of the geomagnetic field, so that the direction and the gesture of the sensor can be obtained.

In some embodiments, six degrees of freedom (6 DOF) adds up and down, left and right, back and forth movements of the body on a three degree of freedom basis. The six-degree-of-freedom augmented reality device can detect a change in the vertical, front-to-back, left-to-right displacement of the user due to body movement in addition to a change in the angle of view caused by rotation of the user's head. The six degrees of freedom consist of one triaxial accelerometer and one triaxial gyroscope.

In some embodiments, the speech information includes semantic information or does not include semantic information.

In some embodiments, the user-entered speech includes not only semantic speech, but also non-semantic speech, such as whistles, personification, and the like. It can be understood that when the voice interaction scheme provided by the related technology obtains the voice of the user without semantics, the voice interaction scheme cannot obtain explicit directivity information, further cannot determine the voice instruction, and has poor user experience.

In some embodiments, the determining the corresponding control instruction according to the voice information and the action information includes:

generating combined interaction information according to the action information and the voice information;

and acquiring a control instruction which is stored in the current augmented reality scene and is matched with the combined interaction information.

In some embodiments, after the action information and the voice information are acquired, the augmented reality device sends the action information and the voice information to the processor module, and the processor module performs superposition matching on the action information and the voice information and preset interaction to obtain a control instruction stored in the current augmented reality scene and matched with the combined interaction information. For example, when a user swings out a pistol action and simultaneously sends out an audible word to make a firing action, visual feedback of the firing bullet is given assuming that the current device environment has a preset combination of this action and speech. For another example, a user may simulate a magic wand in a magic world game by lifting a handle controller or gestures, and simultaneously reciting a corresponding spell, thereby giving out a preset magic corresponding to the voice. For another example, in a game, the user performs whistling or hand-holding on an object at which the user looks at the eyes, and the object will give a correspondingly different feedback.

In some embodiments, the generating the combined interaction information according to the action information and the voice information includes:

Extracting user action characteristics from the action information and extracting audio characteristics from the voice information;

and carrying out multi-mode feature fusion on the extracted user action features and the audio features to generate combined interaction information.

In some embodiments, since a single voice mode cannot generally contain all effective information required for generating an accurate prediction result, the present disclosure combines information of the voice mode and the action mode, implements information supplementation, widens a coverage range of information contained in input data, and improves accuracy of matching of subsequent control instructions. Therefore, the embodiment of the disclosure provides a novel multi-mode interaction mode, which can solve the problem that a voice recognition system, a gesture and an eye tracking system are not accurate enough in the current augmented reality environment. Specifically, the method and the device combine action information of the user to carry out superposition judgment when carrying out voice interaction, so that accuracy of the augmented reality device in recognizing voice instructions input by the user and fusion sense of the augmented reality device and the environment are improved.

In some embodiments, the performing an operation corresponding to the control instruction includes:

and displaying a preset picture in a display component of the augmented reality device, and/or outputting preset audio through an audio component of the augmented reality device, and/or outputting vibration feedback through the augmented reality device.

In some embodiments, the feedback effect of the output includes one or more of: the method comprises the steps of displaying a preset picture in a display component of the augmented reality device, outputting preset audio through an audio component of the augmented reality device, outputting vibration feedback through the augmented reality device and the like.

As shown in fig. 2, an embodiment of the present disclosure further provides a multi-modal interaction device, including:

the acquisition module 1 is used for acquiring voice information and action information of a user;

The first processing module 2 is used for determining a corresponding control instruction according to the voice information and the action information;

and the second processing module 3 is used for executing the operation corresponding to the control instruction.

In some embodiments, the obtaining module is specifically configured to:

In some embodiments, the first processing module is specifically configured to:

In some embodiments, the first processing module is further specifically configured to:

In some embodiments, the second processing module is specifically configured to:

For embodiments of the device, reference is made to the description of method embodiments for the relevant points, since they essentially correspond to the method embodiments. The apparatus embodiments described above are merely illustrative, wherein the modules illustrated as separate modules may or may not be separate. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

The method and apparatus of the present disclosure are described above based on the embodiments and applications. In addition, the present disclosure also provides an electronic device and a computer-readable storage medium, which are described below.

Referring now to fig. 3, a schematic diagram of an electronic device (e.g., a terminal device or server) 800 suitable for use in implementing embodiments of the present disclosure is shown. The terminal devices in the embodiments of the present disclosure may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and stationary terminals such as digital TVs, desktop computers, and the like. The electronic device shown in the drawings is merely an example and should not be construed to limit the functionality and scope of use of the disclosed embodiments.

The electronic device 800 may include a processing means (e.g., a central processor, a graphics processor, etc.) 801 that may perform various appropriate actions and processes in accordance with programs stored in a Read Only Memory (ROM) 802 or loaded from a storage 808 into a Random Access Memory (RAM) 803. In the RAM803, various programs and data required for the operation of the electronic device 800 are also stored. The processing device 801, the ROM 802, and the RAM803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.

In general, the following devices may be connected to the I/O interface 805: input devices 806 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, and the like; an output device 807 including, for example, a Liquid Crystal Display (LCD), speakers, vibrators, etc.; storage 808 including, for example, magnetic tape, hard disk, etc.; communication means 809. The communication means 809 may allow the electronic device 800 to communicate wirelessly or by wire with other devices to exchange data. While an electronic device 800 having various means is shown, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from a network via communication device 809, or installed from storage device 808, or installed from ROM 802. The above-described functions defined in the methods of the embodiments of the present disclosure are performed when the computer program is executed by the processing device 801.

It should be noted that the computer readable medium described in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.

In some embodiments, the clients, servers may communicate using any currently known or future developed network protocol, such as HTTP (HyperText Transfer Protocol ), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the internet (e.g., the internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed networks.

The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to perform the methods of the present disclosure described above.

Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units involved in the embodiments of the present disclosure may be implemented by means of software, or may be implemented by means of hardware. Wherein the names of the units do not constitute a limitation of the units themselves in some cases.

The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a Complex Programmable Logic Device (CPLD), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

According to one or more embodiments of the present disclosure, there is provided a multi-modal interaction method, including:

Acquiring voice information and action information of a user;

And executing the operation corresponding to the control instruction.

According to one or more embodiments of the present disclosure, there is provided a method of acquiring voice information and motion information of a user, including:

In accordance with one or more embodiments of the present disclosure, a method is provided, the acoustic assembly including a microelectromechanical system microphone, and/or a pickup sensor.

In accordance with one or more embodiments of the present disclosure, a method is provided, the motion detection component comprising at least one of a depth camera, an inertial measurement unit, and a gaze point tracking sensor.

According to one or more embodiments of the present disclosure, there is provided a method, the motion information of the user including at least one of user gesture information, user head motion information, motion information for a controller of the augmented reality device, and user eye movement information.

According to one or more embodiments of the present disclosure, there is provided a method, the inertial measurement unit including at least one of an acceleration sensor, a gyroscope, a magnetic force sensor, and a six degree of freedom sensor.

According to one or more embodiments of the present disclosure, a method is provided, the speech information including semantic information or not including semantic information.

According to one or more embodiments of the present disclosure, there is provided a method of determining a corresponding control instruction according to the voice information and the motion information, including:

According to one or more embodiments of the present disclosure, there is provided a method of generating combined interaction information from the action information and the voice information, including:

According to one or more embodiments of the present disclosure, there is provided a method of performing operations corresponding to the control instructions, including:

According to one or more embodiments of the present disclosure, there is provided a multi-modal interaction apparatus, comprising:

According to one or more embodiments of the present disclosure, there is provided an electronic device including: at least one memory and at least one processor;

Wherein the at least one memory is configured to store program code, and the at least one processor is configured to invoke the program code stored by the at least one memory to perform any of the methods described above.

According to one or more embodiments of the present disclosure, a computer-readable storage medium is provided for storing program code which, when executed by a processor, causes the processor to perform the above-described method.

According to one or more embodiments of the present disclosure, a computer program product is provided, characterized in that the computer program product comprises instructions which, when executed by a computer device, cause the computer device to perform the above-mentioned method.

The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in this disclosure is not limited to the specific combinations of features described above, but also covers other embodiments which may be formed by any combination of features described above or equivalents thereof without departing from the spirit of the disclosure. Such as those described above, are mutually substituted with the technical features having similar functions disclosed in the present disclosure (but not limited thereto).

Moreover, although operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims.

Claims

1. A multi-modal interaction method, applied to an augmented reality device, comprising:

Acquiring voice information and action information of a user;

And executing the operation corresponding to the control instruction.

2. The method of claim 1, wherein the obtaining voice information and motion information of the user comprises:

3. The method of claim 2, wherein the sound pickup assembly comprises a microelectromechanical system microphone, and/or a sound pickup sensor.

4. The method of claim 2, wherein the motion detection component comprises at least one of a depth camera, an inertial measurement unit, and a gaze point tracking sensor.

5. The method of claim 4, wherein the motion information of the user comprises at least one of user gesture information, user head motion information, motion information for a controller of the augmented reality device, and user eye movement information.

6. The method of claim 4, wherein the inertial measurement unit comprises at least one of an acceleration sensor, a gyroscope, a magnetic force sensor, and a six degree of freedom sensor.

7. The method of claim 2, wherein the speech information includes semantic information or does not include semantic information.

8. The method of claim 1, wherein said determining a corresponding control command based on said voice information and said action information comprises:

9. The method of claim 8, wherein generating combined interaction information from the action information and the voice information comprises:

10. The method of claim 1, wherein the performing an operation corresponding to the control instruction comprises:

11. A multi-modal interaction device, comprising:

12. An electronic device, comprising:

At least one memory and at least one processor;

wherein the at least one memory is configured to store program code, and the at least one processor is configured to invoke the program code stored by the at least one memory to perform the method of any of claims 1 to 10.

13. A computer readable storage medium for storing program code which, when run by a computer device, causes the computer device to perform the method of any one of claims 1 to 10.

14. A computer program product, characterized in that it comprises instructions which, when executed by a computer device, cause the computer device to perform the method according to any of claims 1 to 10.