CN113126765A

CN113126765A - Multi-modal input interaction method and device, robot and storage medium

Info

Publication number: CN113126765A
Application number: CN202110439619.2A
Authority: CN
Inventors: 张献涛; 暴筱; 林小俊; 支涛
Original assignee: Beijing Yunji Technology Co Ltd
Current assignee: Beijing Yunji Technology Co Ltd
Priority date: 2021-04-22
Filing date: 2021-04-22
Publication date: 2021-07-16

Abstract

The embodiment of the application provides a multi-modal input interaction method, a device, a robot and a storage medium, wherein the multi-modal input interaction method comprises the following steps: acquiring at least one piece of input information; performing intention identification on the at least one piece of input information to obtain a target intention; acquiring interaction information according to the target intention and page information corresponding to the target intention, wherein the page information is obtained according to all the input information for identifying the target intention; and outputting the interactive information. According to the multi-modal input interaction method provided by some embodiments of the application, the target intention is combined with the corresponding page information, and the user differentiation requirements can be identified according to the input information of the same content of the user, so that the user experience is improved.

Description

Multi-modal input interaction method and device, robot and storage medium

Technical Field

The application relates to the technical field of computer application, in particular to a multi-modal input interaction method, a multi-modal input interaction device, a multi-modal input interaction robot and a storage medium.

Background

With the deep development of digitization and intelligence technologies in various fields, more intelligent devices play a role in life. Currently, many intelligent terminal devices can support both touch screen input and voice input.

Therefore, how to promote interaction with the intelligent device by using a multi-modal manner such as a touch screen or voice becomes a technical problem to be solved urgently.

Disclosure of Invention

Some embodiments of the application can identify dissimilarity requirements for information with the same content input by a user, improve human-computer interaction effects and improve user experience by combining a target intention obtained after intention identification with corresponding page information to generate an interaction information method.

In a first aspect, some embodiments of the present application provide a multimodal input interaction method, including: acquiring at least one piece of input information; performing intention identification according to the at least one piece of input information to obtain a target intention; acquiring interaction information according to the target intention and page information corresponding to the target intention, wherein the page information is obtained according to all the input information for identifying the target intention; and outputting the interactive information.

According to the method and the device, the target intention obtained after intention identification is combined with the corresponding page information to generate the interactive information, the dissimilarity requirement can be identified aiming at the input information of the same content of the user, and the user experience is improved.

In some embodiments, the performing intent recognition according to the at least one piece of input information to obtain a target intent includes: identifying interfering input information in the at least one input information; filtering the interference input information from the at least one piece of input information to obtain effective input information; and identifying the intention according to the effective input information to obtain the target intention.

According to the embodiment of the application, the interference information is identified and filtered from the input information, the interference of irrelevant information is eliminated, and the accuracy of the target intention is improved.

In some embodiments, the identifying interfering input information in the at least one input information comprises: and identifying the interference input information according to the input time of each piece of input information in the at least one piece of input information.

The embodiment of the application judges whether the input information is the interference information according to the input time of each piece of input information in at least one piece of input information, and provides an effective method for judging the interference information.

In some embodiments, the performing intent recognition according to the at least one piece of input information to obtain a target intent includes: acquiring a difference value between the input time of the first input information and the input time of the second input information; confirming that the difference is greater than a set threshold; and performing intention recognition at least according to the second input information to obtain the target intention.

The embodiment of the application provides a method for judging interference information according to the difference value of the input time of two adjacent input information, namely, the time interval between the two adjacent input information is determined to be larger than a set threshold value, so that the interference information can be effectively filtered, and the accuracy of intention identification is improved.

In some embodiments, said performing intent recognition based on said at least one input information comprises: and acquiring a target format file corresponding to the input information according to the at least one piece of input information, and performing intention identification on the target format file.

According to the method and the device, the target format file corresponding to the information is generated according to the input information, then the target format file is subjected to intention recognition, the input information of various modes can be conveniently recognized by the same intention recognition model, namely after the input information of various modes is converted into the target format file, the trained intention recognition model is input for intention recognition, and the intention recognition efficiency is improved.

In some embodiments, the input information comprises voice information; the performing intent recognition according to the at least one piece of input information includes: performing voice recognition according to the voice information to obtain a voice recognition result; obtaining a text format file according to the voice recognition result; and identifying the intention according to the text format file.

According to the embodiment of the application, the voice input information is subjected to voice recognition, intention recognition is carried out according to the text format file obtained by the voice recognition result, the voice can be recognized and converted into the text file by adopting the existing semantic recognition model, and the processing speed of target intention recognition is improved.

In some embodiments, the input information comprises touch screen information; the performing intent recognition according to the at least one piece of input information includes: obtaining a first target format file corresponding to the touch screen information according to the information of the button corresponding to the touch screen information in the page to which the button belongs; and identifying the intention according to the first target format file.

According to the method and the device for recognizing the intention of the first target format file, the first target format file corresponding to the touch screen information is obtained according to the information of the corresponding button of the touch screen information in the page to which the button belongs, intention recognition is conducted according to the first target format file, the intention recognition can be conducted on the first target format file through the existing deep learning model, and the difficulty of intention recognition is reduced.

In a second aspect, some embodiments of the present application provide a multimodal input interaction apparatus, including: an input module configured to obtain at least one piece of input information; an identification module configured to perform intent identification according to the at least one piece of input information, so as to obtain a target intent; an obtaining module configured to obtain interaction information according to the target intention and page information corresponding to the target intention, wherein the page information is obtained according to all the input information for identifying the target intention; an output module configured to output the interaction information.

In a third aspect, some embodiments of the present application provide a robot comprising: an input device configured to obtain at least one piece of input information; an output device configured to output the interaction information or a page corresponding to the target intention; a memory configured to store a program of computer-readable instructions; a processor configured to implement the method of the first aspect or any possible implementation manner of the first aspect according to the at least one piece of input information.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, can implement the method described in the first aspect or any possible implementation manner of the first aspect.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.

FIG. 1 is a schematic view of a usage scenario of a multi-modal input interaction method according to an embodiment of the present application;

FIG. 2 is a flowchart of a multi-modal input interaction method according to an embodiment of the present application;

FIG. 3 is a second flowchart of a multi-modal input interaction method according to an embodiment of the present application;

FIG. 4 is a third flowchart of a multi-modal input interaction method according to an embodiment of the present application;

FIG. 5 is a fourth flowchart of a multi-modal input interaction method according to an embodiment of the present application;

FIG. 6 is a block diagram of a multimodal input interaction apparatus according to an embodiment of the present application;

FIG. 7 is a block diagram illustrating a second embodiment of a multi-modal input interaction apparatus;

fig. 8 is a block diagram of a robot according to an embodiment of the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance.

A brief description of a usage scenario of some embodiments of the present application is provided below in conjunction with fig. 1.

Fig. 1 provides a schematic diagram of a usage scenario of a multi-modal input interaction method, in the application scenario of fig. 1, involving an intelligent terminal device 130 and a user 140, as an example, the intelligent terminal device 130 of fig. 1 includes a voice interaction apparatus 110, a touch display screen 120 or other types of interaction apparatuses, and it is understood that the intelligent terminal device 130 of fig. 1 further includes a memory (not shown in the figure) or a processor (not shown in the figure). The user 140 of fig. 1 may interact with the intelligent terminal device 130 through voice, touch screen, video or picture input, and the like. For example, the user 140 may interact with the intelligent terminal device 130 through the voice interaction apparatus 110 by means of voice input, may interact with the intelligent terminal device 130 through the touch display screen 120 by means of clicking, sliding a button, inputting characters by an input method, or the like, or may interact with the intelligent terminal device 130 through a camera device (not shown) by means of video or picture input, or the like.

The multi-modal input interaction method provided by some embodiments of the present application, which can be performed by the smart terminal device 130 of fig. 1, is exemplarily set forth below in connection with fig. 2.

As shown in fig. 2, the multi-modal input interaction method of some embodiments of the present application includes: s210, acquiring at least one piece of input information; s220, performing intention identification according to the at least one piece of input information to obtain a target intention; s230, acquiring interaction information according to the target intention and page information corresponding to the target intention, wherein the page information is obtained according to all the input information for identifying the target intention; s240 outputs the interactive information.

The above steps are exemplarily set forth below.

In some embodiments of the present application, the acquiring of the input information at S210 includes acquiring only one piece of input information. For example, in the application scenario shown in fig. 1, the user 140 speaks an input of "take me to go to the toilet" to the intelligent terminal device 130, the intelligent terminal device 130 can analyze and recognize the target intention according to the voice input of the user 140, generate interaction information (the interaction information may include ok) according to the target intention, and finally guide the user 140 to go to the toilet by the intelligent terminal. It will be appreciated that S210 in this example includes only one piece of input information: "take me to go to the toilet".

In some embodiments of the present application, the S210 input information includes a plurality of pieces, for example, if the user wants to buy an article for store, the user needs to first click a "shopping mall" button on the home page of the smart terminal device (i.e., obtain a first piece of input information) to enter a first page, the page has a "store step" button, and when the user then inputs "store step" (i.e., obtains a second piece of input information) by clicking or by voice, the smart terminal device introduces the article for store (i.e., identifies the target intention according to the first piece of input information and the second piece of input information and obtains the introduction of the interaction information as the article for store) according to the content of the first page. That is, the input information for identifying the target intention in this example includes the following two pieces of input information: firstly, clicking a 'shopping mall' button, and secondly, clicking or speaking 'safe stepping' by voice.

In some embodiments of the present application, for better management of the multi-modal input information and facilitating subsequent processing, the step S210 of obtaining the input information comprises: time of request for obtaining the information_i(ii) a The located page information PageId_iThe value represents the identification of the interactive page of the current user and the intelligent terminal equipment; input type RequestType of a request_i(including "voice," "touch screen," "input method," "video," or "picture" input, etc.); requested input Content_iThe content is related to the input type (if the input is voice input, the content is voice, if the input is touch screen click, the content is clicked page module id, such as a certain button id, and the like, and if the input is input method, the content is text); unique identification GuestID of user_iThe value can be used for obtaining a unique value through authentication login, face identification, fingerprint identification and the like of a user; i represents the number of any input information. In addition, for the management of security and authority, optionally, the input information may be accompanied by information such as a unique device Id of the intelligent terminal device and a current network information address.

For input information of different modalities, in order to facilitate subsequent parsing and development, the multi-modal input interaction method of some embodiments of the present application includes: s220, according to the at least one piece of input information, a target format file corresponding to the information is obtained, and then intention recognition is carried out on the target format file. Wherein the object format file comprises text, pictures and the like.

The process of identifying a target intent from input information of various modalities is schematically set forth below in connection with a number of examples.

In some embodiments of the present application, the input information of S210 is voice information; the corresponding S220 includes: performing voice recognition according to the voice information to obtain a voice recognition result; obtaining a text format file according to the voice recognition result; and identifying the intention according to the text format file. It should be noted that, in some embodiments of the present application, a picture format file may also be obtained according to the voice recognition result, and then the intention recognition is performed according to the picture format file.

In some embodiments of the present application, the input information of S210 is touch screen information; the corresponding S220 includes: obtaining a first target format file corresponding to the information according to the information of the button corresponding to the touch screen information in the page to which the information belongs; and identifying the intention according to the first target format file. Wherein, the first target format file can be text, picture, etc.

In some embodiments of the present application, the S210 input information is input method input information; the corresponding S220 includes: obtaining a second target format file corresponding to the information according to the information input by the input method; and identifying the intention according to the second target format file. Wherein, the second target format file can be text, picture, etc.

In some embodiments of the present application, the input information of S210 is video or picture input information; the corresponding S220 includes: obtaining a third target format file corresponding to the information according to the video or picture input information; and identifying the intention according to the third target format file. Wherein, the third target format file can be text, picture, etc.

The method of S220 performing intent recognition based on the input information includes multiple methods, for example: maximum entropy, support vector machine, and machine learning, among others. The machine learning method comprises the following steps: and sorting the existing intentions, and carrying out classification training on the existing intentions to finally obtain an intention recognition model. For example: for multi-modal input information "exit", "confirm" and "bring to the restroom", intent recognition may be achieved using a deep learning model based on semantic information that has been trained. As one example, a method of training an intent recognition model includes:

in the first step, the user input information text is collected in advance (for example, 1 ten thousand groups of input information texts are collected).

In the second step, the intention category of each piece of input information text is labeled ("exit", "confirm" or "go to toilet", etc.).

And thirdly, after manual cleaning and confirmation, constructing a training set according to the input information text and the intention category.

Fourthly, after word segmentation is carried out on the text corpus training set, a word vector model (such as word2vec) is used for coding the text to obtain d_text。

The fifth step, the intention category and d_textAn intent recognition model based on a long-short term memory neural network is trained.

And finally, the intention recognition model utilizes the input-based information text, and can obtain the target intention corresponding to the input information by utilizing the long-term and short-term neural network after semantic feature information extraction.

It should be noted that, in some embodiments of the present application, at least one piece of input information acquired by the intelligent terminal device further includes interference information input by other users or the user, and the interference information may interfere with the result of the intention identification, so as to reduce the accuracy of the target intention. To reduce the impact of interference information on intent recognition, S220 in some embodiments of the present application includes: identifying interfering input information in the at least one input information; filtering the interference input information from the at least one piece of input information to obtain effective input information; and identifying the intention according to the effective input information to obtain the target intention.

The implementation process of the multi-modal input interaction method including the recognition of the interference information provided by some embodiments of the present application is exemplarily set forth below with reference to fig. 3, and as shown in fig. 3, the multi-modal input interaction method in some embodiments of the present application includes:

s310, acquiring at least one piece of input information; s320 identifying interfering input information in the at least one input information; s330, filtering the interference input information from the at least one piece of input information to obtain effective input information; s340, performing intention identification according to the effective input information to obtain the target intention; s350, acquiring interaction information according to the target intention and page information corresponding to the target intention, wherein the page information is obtained according to all the input information for identifying the target intention; s360, outputting the interactive information.

In some embodiments of the present application, the interference information related to S320 includes user continuity mis-input information. In order to identify whether the input information is interference information, in some embodiments of the present application, S320 includes: and identifying the interference input information according to the input time of each piece of input information in the at least one piece of input information. For example: when the user speaks a command of going to the toilet to the intelligent terminal device and accidentally touches a button of going to the shop, whether the interference information exists can be judged according to input time characteristics of two input information of going to the toilet and going to the shop (for example, the two input information are triggered simultaneously or the time interval is short). By performing S320, it is possible to recognize that the input information "go to the shop" is interference information and needs to be filtered out.

In some embodiments of the present application, to identify the interference information, S220 includes: acquiring a difference value between the input time of the first input information and the input time of the second input information; confirming that the difference is greater than a set threshold; and performing intention recognition at least according to the second input information to obtain the target intention. For example: after the user inputs a piece of information "go to the restroom" by clicking a touch screen or voice, "go to the shop" is clicked carelessly. In some embodiments of the present application, the threshold is set to 200ms, and if the difference between the time of inputting the two input messages "go to the toilet" and "go to the store" is less than 200ms, the input of "go to the store" is considered as the mistaken input interference information and needs to be filtered out. Wherein, the threshold value can be set according to the use experience and the scene requirement.

When a user performs multi-modal input interaction with the intelligent terminal device, when the user inputs the same input information, the actual requirements of the user may have differences. For example: in the interaction process with intelligent terminal equipment in a shopping mall, for the same voice input 'security stepping' of a user, the user may have two different requirements, namely firstly introducing security stepping commodities to the user and secondly leading the user to a security stepping shop. In order to accurately identify the actual intent of the user, disambiguating, the multi-modal input interaction method of some embodiments of the present application includes: s230 acquires the interaction information according to the target intention and the page information corresponding to the target intention, wherein the page information is obtained by acquiring all input information of the target intention.

In some embodiments of the present application, an implementation process for obtaining two different types of interaction information for the same input information "enter" is exemplarily described below with reference to fig. 4 and 5.

First, with reference to fig. 4, an implementation process of some embodiments of the present application for inputting information "place on foot" to obtain "interaction information for introducing a place on foot commodity to a user" is exemplarily described, including:

s410, a user clicks a 'shopping mall' button (namely, a first piece of input information) of an operation home page to enter a first page, and the first page is provided with a 'security step' button; s420, the user clicks a button of 'step on' or speaks 'step on' (namely a second piece of input information) by voice; s430, performing intention identification according to the input information 'step on' to obtain a target intention corresponding to the 'step on' safely; s440, according to the target intention and the first page information corresponding to the target intention, obtaining an interactive page introducing the information of the safe pedal commodity; s450 presents an interaction page. That is, the interactive information for introducing the pedaled commodity to the user is obtained according to the target intention corresponding to the "pedaled" and the first page information.

Referring to fig. 5, an implementation process of some embodiments of the present application for obtaining interaction information of a user to an ann-pedaled store according to input information "ann-pedaled" includes:

s510, clicking a button (namely a first piece of input information) for operating leading of a home page by a user, and entering a second page, wherein the second page is provided with a button for 'ann pedaling'; s520, the user clicks a button of 'step on' or speaks 'step on' (namely a second piece of input information) by voice; s530, performing intention identification according to input information 'Anjian' to obtain a target intention corresponding to 'Anjian'; s540, acquiring interactive information of 'leading a user to an ann-pedalling store' according to the target intention and second page information corresponding to the target intention; s550 executes the interactive information, i.e., takes the user to the store. That is, the interaction information for leading the user to the store is obtained according to the target intention corresponding to the "safe-stepping" and the second page information.

It should be noted that, in some embodiments of the present application, a method for obtaining interaction information according to the target intention and page information corresponding to the target intention includes: and searching a prestored interaction mapping relation table or generating interaction information and the like, wherein the method for generating the interaction information comprises an end-to-end mode in deep learning.

In some embodiments of the present application, the manner of outputting the interaction information includes: text, voice, and motion guidance such as voice answers, text replies, screen lights, button jumps, music plays, and road leads.

A multimodal input interaction apparatus 600 provided by some embodiments of the present application is illustrated below in conjunction with fig. 6. It should be understood that the apparatus corresponds to the method embodiment of fig. 2, and can perform the steps related to the method embodiment, the specific functions of the apparatus can be referred to the description above, and the detailed description is appropriately omitted here to avoid redundancy. The device comprises at least one software functional module which can be stored in a memory in the form of software or firmware or solidified in an operating system of the device, and the device comprises: an input module 610 configured to obtain at least one piece of input information; an identification module 620 configured to perform intent identification according to the at least one input information to obtain a target intent; an obtaining module 630, configured to obtain interaction information according to the target intent and page information corresponding to the target intent, where the page information is obtained according to all the input information for identifying the target intent; an output module 640 configured to output the interaction information.

In some embodiments of the present application, a filtering module 720 is added to the apparatus shown in fig. 6, and a multimodal input interaction apparatus 700 provided in some embodiments of the present application is exemplarily described below with reference to fig. 7. Wherein the filtering module 720 is configured to identify interfering input information of the at least one input information; filtering the interference input information from the at least one piece of input information to obtain effective input information; the identification module 730 is configured to perform intent identification according to the valid input information, so as to obtain the target intent. It should be understood that the apparatus 700 corresponds to the method embodiment of fig. 3, and can perform the steps related to the method embodiment, the specific functions of the apparatus can be referred to the description above, and the detailed description is appropriately omitted here to avoid redundancy.

The robot provided by some embodiments of the present application is exemplarily set forth in the following with reference to fig. 8, and it should be understood that the robot corresponds to the above-mentioned method embodiment for multimodal input interaction performed on a smart terminal device, and can perform the various steps involved in the above-mentioned method embodiment, and the specific functions of the robot can be referred to the above description, and in order to avoid repetition, the detailed description is appropriately omitted here. The robot includes at least one software function module that can be stored in memory in the form of software or firmware or solidified in the operating system of the robot. The robot 800 of fig. 8, comprising: an input device 810 configured to obtain at least one piece of input information; an output device 820 configured to output the interaction information or the page corresponding to the target intention; a memory 830 configured to store a program of computer-readable instructions; a processor 840 configured to read the program from the memory 830 to implement the multi-modal input interaction method described in fig. 2 and 3.

The input device 810 includes a voice interaction device, a touch display device, a camera device, or the like, where the information input by the touch screen includes a sliding mode, a clicking mode, an input method input mode, or the like.

Processor 840 may process digital signals and may include various computing structures. Such as a complex instruction set computer architecture, a structurally reduced instruction set computer architecture, or an architecture that implements a combination of instruction sets. In some examples, processor 840 may be a microprocessor.

Memory 830 may be used to store instructions that are executed by processor 840 or data related to the execution of instructions. The instructions and/or data may include code for performing some or all of the functions of one or more of the modules described in embodiments of the application. The processor 840 of embodiments of the present disclosure may be used to execute instructions in the memory 830 to implement the methods shown in fig. 2-3. Memory 830 includes dynamic random access memory, static random access memory, flash memory, optical memory, or other memory known to those skilled in the art.

Some embodiments of the present application also provide a computer readable storage medium having stored thereon a computer program, which when executed by a processor, can implement the above-described multi-modal input interaction method performed on a robot.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims

1. A multi-modal input interaction method, the method comprising:

acquiring at least one piece of input information;

performing intention identification according to the at least one piece of input information to obtain a target intention;

acquiring interaction information according to the target intention and page information corresponding to the target intention, wherein the page information is obtained according to all the input information for identifying the target intention;

and outputting the interactive information.

2. The method of claim 1, wherein the identifying the intention according to the at least one input information to obtain the target intention comprises:

identifying interfering input information in the at least one input information;

filtering the interference input information from the at least one piece of input information to obtain effective input information;

and identifying the intention according to the effective input information to obtain the target intention.

3. The method of claim 2,

the identifying interfering input information in the at least one input information comprises: and identifying the interference input information according to the input time of each piece of input information in the at least one piece of input information.

4. The method of claim 1, wherein the identifying the intention according to the at least one input information to obtain the target intention comprises:

acquiring a difference value between the input time of the first input information and the input time of the second input information;

confirming that the difference is greater than a set threshold;

and performing intention recognition at least according to the second input information to obtain the target intention.

5. The method of claim 1, wherein the identifying intent from the at least one input message comprises:

and acquiring a target format file corresponding to the input information according to the at least one piece of input information, and performing intention identification on the target format file.

6. The method of claim 1,

the input information comprises voice information;

the performing intent recognition according to the at least one piece of input information includes:

performing voice recognition according to the voice information to obtain a voice recognition result;

obtaining a text format file according to the voice recognition result;

and identifying the intention according to the text format file.

7. The method of claim 1,

the input information comprises touch screen information;

obtaining a first target format file corresponding to the touch screen information according to the information of the button corresponding to the touch screen information in the page to which the button belongs;

and identifying the intention according to the first target format file.

8. A multimodal input interaction apparatus, comprising:

an input module configured to obtain at least one piece of input information;

an identification module configured to perform intent identification according to the at least one piece of input information, so as to obtain a target intent;

an obtaining module configured to obtain interaction information according to the target intention and page information corresponding to the target intention, wherein the page information is obtained according to all the input information for identifying the target intention;

an output module configured to output the interaction information.

9. A robot, comprising:

an input device configured to obtain at least one piece of input information;

an output device configured to output the interaction information or a page corresponding to the target intention;

a memory configured to store a program of computer-readable instructions;

a processor configured to implement the method of any of the above claims 1-7 according to the target screening information.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, is adapted to carry out the method of any one of claims 1 to 7.