CN115062131A

CN115062131A - Multi-mode-based man-machine interaction method and device

Info

Publication number: CN115062131A
Application number: CN202210753297.3A
Authority: CN
Inventors: 何锐颖; 杨晓龙; 张志强
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2022-06-29
Filing date: 2022-06-29
Publication date: 2022-09-16

Abstract

The embodiment of the specification provides a multi-mode-based man-machine interaction method and device. On the premise that the terminal has the capability of collecting information in multiple modes, when a user interacts with the terminal, the terminal can collect multi-mode information input by the user, including video information, voice information, language text and event text input by the user through input operation, and the like, and transmit the multi-mode information to the server. The server can extract a user instruction and a user language from the multi-modal information, input the user instruction into the instruction processor and obtain first output content, and input the user language into the natural language processor and obtain second output content. The server may then determine responsive content for the user based on the fusion of the first output content and the second output content.

Description

Multi-mode-based man-machine interaction method and device

Technical Field

One or more embodiments of the present disclosure relate to the field of human-computer interaction technologies, and in particular, to a human-computer interaction method and apparatus based on multiple modalities.

Background

With the development of scientific technology, the intelligence level of terminal products based on computing processing is higher and higher. The terminal product gradually has multiple information acquisition modes such as video acquisition, voice acquisition and touch screen, and the user can operate the terminal product through corresponding multiple interaction modes. The development of various human-computer interaction modes provides more expression modes for users, and the users have more convenient life modes.

Accordingly, improved solutions are desired that can increase the level of intelligence during human-computer interaction.

Disclosure of Invention

One or more embodiments of the present specification describe a multi-modal based human-computer interaction method and apparatus to improve the level of intelligence in the human-computer interaction process. The specific technical scheme is as follows.

In a first aspect, an embodiment provides a multi-modal-based human-computer interaction method, including:

acquiring multi-modal information input by a user through a terminal;

extracting user instructions and user language from the multimodal information;

determining, by an instruction processor, first output content for the user instruction;

determining, by a natural language processor, second output content for the user language;

determining responsive content for the user based on the fusion of the first output content and the second output content.

In one embodiment, the multimodal information includes at least one of the following categories: voice information, video information, text information; the text information includes language text and event text input by the user through an input operation.

In one embodiment, the step of extracting user instructions and user language from the multimodal information comprises:

extracting a user instruction from any kind of information contained in the multi-modal information based on a preset instruction;

and extracting the user language from the voice information and the language text contained in the multi-modal information.

In one embodiment, the method further comprises:

extracting user features from the multimodal information;

the step of determining, by the instruction processor, first output content for the user instruction includes:

determining, by an instruction processor, first output content for the user instruction based on the user characteristic;

the step of determining, by the natural language processor, second output content for the user language includes:

determining, by a natural language processor, second output content for the user language based on the user characteristic.

In one embodiment, the step of determining, by the instruction processor, first output content for the user instruction comprises:

determining, by an instruction processor, first output content for the user instruction based on the second intermediate output; wherein the second intermediate output is an intermediate output determined by the natural language processor for the user language.

In one embodiment, the step of determining, by the natural language processor, the second output content for the user language includes:

determining, by the natural language processor, second output content for the user language based on the first intermediate output; wherein the first intermediate output is an intermediate output determined by the instruction processor for the user instruction.

determining a first output content for the user instruction by a normalization operation of a standard job program in the instruction processor.

In one embodiment, the step of determining the response content for the user comprises:

and fusing the first output content and the second output content based on a preset fusion decision to obtain response content for the user.

determining a result of fusing the first output content and the second output content based on a trained fusion model, and determining response content for the user based on the fused result.

In one embodiment, after determining the response content for the user, the method further comprises:

classifying the response content based on a multi-modal category contained in the response content;

and outputting the classified response contents through the terminal by adopting a corresponding modal output mode.

In one embodiment, the method is performed by a server or the method is performed by the terminal.

In a second aspect, an embodiment provides a multi-modal based human-computer interaction device, including:

the acquisition module is configured to acquire multi-modal information input by a user through the terminal;

an extraction module configured to extract a user instruction and a user language from the multimodal information;

a first determining module configured to determine, by an instruction processor, first output content for the user instruction;

a second determination module configured to determine, by the natural language processor, a second output content for the user language;

a fusion module configured to determine responsive content for the user based on a fusion of the first output content and the second output content.

In one embodiment, the extraction module is further configured to extract user features from the multi-modal information;

the first determining module is specifically configured to:

the second determining module is specifically configured to:

In one embodiment, the first determining module is specifically configured to:

In a third aspect, embodiments provide a computer-readable storage medium having a computer program stored thereon, which, when executed in a computer, causes the computer to perform the method of any of the first aspect.

In a fourth aspect, an embodiment provides a computing device, including a memory and a processor, where the memory stores executable code, and the processor executes the executable code to implement the method of any one of the first aspect.

In the method and the device provided by the embodiment of the specification, after the terminal collects multimodal information in a human-computer interaction process, a user instruction and a user language can be extracted from the multimodal information, a first output content for the user instruction is determined through the instruction processor, a second output content for the user language is determined through the natural language processor, and then a response content for the user is determined based on fusion of the two output contents. In the human-computer interaction process, the instruction processor and the natural language processor are adopted to process different modal information in the information input by the user respectively, the information input by the user in different input modes can be considered simultaneously, the diversity of machine processing and response is increased, and the intelligent level in the human-computer interaction process is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below. It is obvious that the drawings in the following description are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.

FIG. 1 is a schematic diagram illustrating an implementation scenario of an embodiment disclosed herein;

FIG. 2 is a schematic flowchart of a multi-modal based human-computer interaction method according to an embodiment;

fig. 3 is a system structure diagram of multi-modal human-computer interaction provided in this embodiment;

fig. 4 is a schematic block diagram of a multi-modal based human-computer interaction device according to an embodiment.

Detailed Description

The scheme provided by the specification is described below with reference to the accompanying drawings.

Fig. 1 is a schematic view of an implementation scenario of an embodiment disclosed in this specification. Through interaction between the user and the terminal, the terminal can collect multi-modal information input by the user, and the server processes the multi-modal information, including processing by the instruction processor and processing by the natural language processor. Through the processing of the two processors, the server obtains the response content, and the response content is output to the user through the terminal, so that the response is carried out on the user. Fig. 1 is a schematic diagram of an implementation scenario of the present specification, and in practical applications, when the terminal has a high processing capability, the processing procedures of the instruction processor and the natural language processor may also be executed in the terminal.

Currently, more and more terminal products have multiple information acquisition modes, including video acquisition, voice acquisition, touch screen and other modes. The user can carry out human-computer interaction with the terminal product through modes such as video interaction, voice interaction or touch screen interaction. For example, terminal products such as smart phones, in-vehicle terminals, smart speakers, etc. can support multi-modal interactions.

The video interaction refers to an interaction mode that the terminal extracts the user intention from the video or the image by acquiring the video or the image containing the body actions such as the user gesture, the eye spirit and the like, and then responds to the user. The voice interaction refers to an interaction mode that the terminal collects voice information of the user and responds to the user based on understanding of the voice information. The touch screen interaction refers to an interaction mode that the terminal collects input operation of a user on a screen and responds to the user based on the input operation. The input operation of the user on the screen comprises clicking, sliding and the like. The touch screen interaction may include a key interaction, and the key may be set in the screen or may be set outside the screen. The multi-modal information refers to information of multiple modalities, including information of modalities such as images, videos, audios, and texts. In different interaction modes, the terminal can acquire information of different modalities.

Different man-machine interaction modes have different advantages and are suitable for different application scenes. For example, with the development of natural language recognition technology, voice interaction is rapidly developed and is more and more favored by users. The voice interaction is more suitable for the human-to-human conversation mode, and the user can express more information naturally. The advantages of touch screen interaction are clear flow and accurate response. However, whether voice interaction or touch screen interaction, the single interaction mode always has the disadvantages.

For example, when a user says 'i are hungry and want to eat' to a mobile phone, a voice assistant in the mobile phone collects voice information input by the user, the voice information is processed to pop up a covering layer, and a searched restaurant list is displayed on the covering layer in an image-text mode for the user to select. And finishing the voice interaction. Subsequently, the user can only select a suitable restaurant by page clicking, that is, selecting a restaurant by touch screen interaction.

The analysis machine processes the information in different modes in different interaction modes, and the processing process can comprise an instruction processing process and a natural language processing process. Regardless of the process, a single process limits the machine's diverse responses to human-machine interaction processes.

In order to improve the intelligence level of the human-computer interaction process and increase the diversity of machine processing and response, the embodiment provides a human-computer interaction method based on multiple modes. Which comprises the following steps: step S210, obtaining multi-modal information input by a user through a terminal; step S220, extracting a user instruction and a user language from the multi-modal information; step S230, determining, by the instruction processor, a first output content for the user instruction; step S240, determining, by the natural language processor, second output content for the user language; step S250, determining response content for the user based on the fusion of the first output content and the second output content. In this embodiment, when multi-modal information in the human-computer interaction process is collected, the information of different modalities can be processed through the instruction processor and the natural language processor respectively, so that the information input by adopting different input modes can be processed correspondingly and appropriately, the human-computer interaction mode is more diversified, and the intelligent level is higher.

The following describes the embodiments of the present specification in detail with reference to the embodiment of fig. 2.

Fig. 2 is a schematic flowchart of a multi-modal-based human-computer interaction method according to an embodiment. The method may be performed by a server or a terminal. The terminal is used for acquiring multi-mode information in a human-computer interaction process. In particular, the client in the terminal can collect the multi-modal information. The man-machine interaction process mentioned in the present specification is not limited to the interaction between natural people and the terminal. The object for performing the human-computer interaction process with the terminal equipment can be a natural person, a robot or a trained animal. In general, a person in "human-computer interaction" is an object having the ability to interact with a machine. In the following description, an object having the capability of interacting with a machine is referred to as a user, i.e. a user of a terminal. The server may be implemented by any device, equipment, platform, equipment cluster, etc. with computing and processing capabilities, and the terminal may be implemented by any terminal with computing and processing capabilities. The method comprises the following steps.

In step S210, multimodal information input by a user through a terminal is acquired.

The multimodal information includes at least one of the following categories: voice information, video information, text information. The text information comprises language text and event text which are input by a user through input operation. For example, the characters input by the user in the text box of the terminal are language texts; when a user clicks a certain link in a page provided by the terminal, the terminal collects an event text. When an interaction event (event) such as clicking, sliding, dragging and the like is performed between a user and the terminal, the terminal can correspondingly collect an event text.

The terminal is a device with multi-modal information acquisition capability. For example, the terminal should have at least two of the following information acquisition capabilities: the ability to capture video information or images, the ability to capture audio information, and the ability to capture text information. The voice information collected by the terminal may be in the form of a voice stream, and the collected video information may be in the form of a video stream.

In a certain time period, the user interacts with the terminal, and the terminal can acquire at least one of voice information, video information and text information. The interaction between the user and the terminal is also actually the interaction between the user and the operating system of the terminal. When the method is executed by the server, the terminal can send the collected multi-modal information to the server.

In step S220, a user instruction and a user language are extracted from the multimodal information.

The user instruction may be understood as an operation instruction, for example, the user instruction in a music playing scene may include playing, stopping, previous, next, random playing, sequential playing, and the like; in other scenarios, the user instructions may include: search, refresh, rollback, collection, copy, paste, etc. In this embodiment, an instruction set may be predefined, where the instruction set includes a plurality of preset instructions, for example, the following preset instructions are included: common 1, common 2, … …, common dn.

When the step is executed, the user instruction can be extracted from any kind of information contained in the multi-modal information based on a plurality of preset instructions. For example, a user instruction may be extracted from the video information, for example, an image included in the video information is detected, it may be detected that the user has an action of turning left, and the preset instruction corresponding to turning left may be to turn left; the user instructions may also be extracted from audio information or from language text or event text. Specifically, whether a preset instruction is included in the multimodal information may be detected, and if so, the detected preset instruction is determined as a user instruction. Therefore, the user instruction extracted from the multimodal information can be understood as an instruction in the plurality of preset instructions, that is, the multimodal information has been correspondingly converted into an instruction in the preset instructions.

The user language can be understood as the natural language of the user, and the source of the user language can be the voice input by the user or the characters input by the user. In extracting the user language, the user language may be extracted from the speech information and the language text included in the multimodal information. For example, the text extracted from the audio information may be directly determined as the user language, or the user language may be obtained by performing a preset process on the text extracted from the audio information. The language text can also be directly determined as the user language, or the user language can be obtained after the language text is subjected to preset processing.

When extracting text from voice information, an Automatic Speech Recognition (ASR) technique may be used to obtain the recognized text.

In step S230, the first output content E1 for the user instruction is determined by the instruction processor.

In step S240, the second output content E2 for the user language is determined by the natural language processor.

The instruction processor is used for determining output content aiming at user instructions, and processed data objects are the instructions. The instruction processor may determine the first output content E1 for the user instruction through a standardized operation of a Standard Operating Procedure (SOP). The standardized operation of the SOP may be understood as an operation based on a plurality of directed instruction trees. When there are M preset instructions, there are M instruction trees corresponding to the M preset instructions. The preset instruction is used as a root node and is directed to different next nodes under different conditions, and the next nodes represent the next operation aiming at the preset instruction. The instruction processor may determine the first output content E1 using the SOP and an instruction tree corresponding to the preset instruction. The obtained first output content E1 may be a certain operation, or may be page information obtained through a certain operation.

The natural language processor is used to determine output content for a user language, the data object processed by which is a natural language. And a natural language processor, which may be a Robot (Robot) for processing natural language. The processing of the user language by the natural language processor may include: natural Language Understanding (NLU), Dialog Management (DM), and Natural Language Generation (NLG). The resulting second output content E2 may be natural language in text form.

The instruction processor and the natural language processor may be modules configured in the execution body, for example, modules configured in a server or a terminal; or may be a module configured in another device.

The first output content E1 is an output result obtained by the instruction processor based on the user instruction, and is a content of response to the user instruction. The second output content E2 is an output result obtained by the natural language processor based on the user language, and is a content responsive to the user instruction. In step S230, the user instruction may be directly input into the instruction processor, and the first output content E1 is obtained from the instruction processor. In performing step S240, the user language may be directly input into the natural language processor, and the second output content E2 is obtained from the natural language processor. The execution sequence of steps S230 and S240 may be first available, or may be simultaneous, and the present application does not limit this.

In order to improve the processing effect of the instruction processor and the natural language processor, user characteristics may be extracted from the multi-modal information, and the determined first output content E1 and second output content E2 may be more accurate in combination with the user characteristics. The user characteristics are characteristic information related to the user, including characteristic attributes of the user, behavior characteristics of the user, environmental characteristics of the user, and the like.

In extracting the user feature, when the video information and/or the voice information is included in the multimodal information, the user feature may be extracted from the video information and the voice information included in the multimodal information. For example, the user's external shape, emotional feature, and the like may be extracted from an image included in video information, and a voiceprint feature, a user portrait feature, and the like may be extracted from voice information. In extracting the user features from the video information, the user features in the image may be extracted based on a set of image models such as a person matching model, an emotion model, Optical Character Recognition (OCR), and the like. When extracting user features from speech information, the user features in the speech information may be extracted based on a set of acoustic models such as a voiceprint model and a user portrait.

After the user characteristics are extracted, the first output content E1 for the user instruction may be determined by the instruction processor based on the user characteristics when step S230 is executed. In particular, user characteristics and user instructions may be input to the instruction processor, resulting in first output content E1. For example, the instruction processor may determine, in conjunction with the user characteristics, the next node to point from a node of the instruction tree.

In performing step S240, the second output content E2 for the user language may be determined by the natural language processor based on the user characteristics. In particular, the user characteristics and the user language may be input into the natural language processor, resulting in the second output content E2. For example, the natural language processor may perform NLU, DM, and NLG processing in conjunction with user features.

The multi-modal information is man-machine interaction information in the same time period, so that the processing process of the instruction processor and the processing process of the natural language processor can be mutually influenced. The mutual influence can improve the accuracy of the processing results of the two processors. Thus, in one embodiment, the instruction processor and the natural language processor may communicate contextually.

That is, in performing step S230, the first output content E1 for the user instruction may be determined by the instruction processor based on the intermediate output (referred to as a second intermediate output) determined by the natural language processor for the user language. The second intermediate output and the user instruction are used as input data for the instruction processor, by which the first output content E1 is determined. The second intermediate output may be characterized as an input to determine a next node pointed to from a node of the instruction tree.

Likewise, in performing step S240, the second output content for the user language may be determined by the natural language processor based on an intermediate output (referred to as a first intermediate output) determined by the instruction processor for the user instruction. The second output content E2 is determined by the natural language processor, taking the first intermediate output and the user language as input data for the natural language processor. The natural language processor may perform processing such as NLU, DM, and NLG in conjunction with the second output content.

The intermediate output may be stored in a buffer and read from the buffer. In specific implementation, the instruction processor and the natural language processor can read the cache firstly in the processing process, and when the intermediate output of the opposite side is read, the subsequent operation can be carried out based on the intermediate output of the opposite side; and if the data is not read, executing the original processing flow. Meanwhile, when the intermediate output is obtained, the intermediate output is stored in a cache for the other side to read. When the processing procedure of the instruction processor and the natural language processor comprises a plurality of implementation steps, the cache can be read firstly when any one step is executed; when the step execution is completed, the intermediate output is stored to the cache. Thus, there may be one or more intermediate outputs of the instruction processor and one or more intermediate outputs of the natural language processor.

In practice, a context manager may be employed to store intermediate outputs to a cache and to read intermediate outputs from the cache when needed. The intermediate outputs include a first intermediate output and a second intermediate output.

The interaction between the instruction processor and the natural language processor through the intermediate output determines a first output content E1 and a second output content E2, respectively. The intermediate output may be determined by the instruction processor, may be output by the natural language processor, or may include both.

In step S250, the response content for the user is determined based on the fusion of the first output content E1 and the second output content E2. The performance of this step may involve a variety of embodiments.

For example, the first output content E1 and the second output content E2 may be fused based on a preset fusion decision, resulting in the response content for the user. Wherein a large number of fusion decisions may be preset. A merged decision may comprise a combination of a first output content E1 and a second output content E2, certain conditions and corresponding decision results. The multiple fusion decisions can also be presented in the form of decision trees, and response content for the user is obtained through judgment of the decision trees.

For another example, a result of fusing the first output content E1 and the second output content E2 may be determined based on a trained fusion model, and the response content for the user may be determined based on the fused result. The first output content E1 and the second output content E2 may be input into a fusion model, resulting in a fused result. The fused result can be directly used as the response content for the user, and certain preset adjustment can be performed on the basis of the fused result to obtain the response content for the user.

Regardless of the fusion method, the principle and principle of the processing is that if two output contents (including the first output content E1 and the second output content E2) are not consistent, one output content can be selected as the response content according to a preset method. If there is no conflict between the two output contents, the output orders thereof may be sorted or combined, etc.

After the response content for the user is determined, the response content can be classified based on the multi-modal types contained in the response content, and the classified response content is output through the terminal by adopting a corresponding modal output mode. The multi-modal categories may include categories of video information, text information, voice information, and the like.

When the execution subject is the server, the server classifies the response content based on the multi-modal category contained in the response content, and sends the classified response content to the terminal. And the terminal receives the classified response content sent by the server, and outputs the classified response content in a corresponding modal output mode for the user to check.

When the execution main body is the terminal, the terminal classifies the response content based on the multi-modal types contained in the response content, and outputs the classified response content by adopting a corresponding modal output mode for the user to check.

Fig. 3 is a system structure diagram of multi-modal human-computer interaction provided in this embodiment. The terminal collects multi-mode information such as video stream, text information and voice stream, and inputs the multi-mode information into the multi-mode information processor.

In the multi-mode information processor, inputting a video stream into an image model set for processing, and obtaining user characteristics, user instructions and the like extracted from an image through model processing such as a person matching model, an emotion model, an OCR (optical character recognition) and the like; inputting a language text and an event text contained in the text information into a text instruction processor, and extracting user input information from the text instruction processor by processing events and word contents, wherein the user input information comprises a user language, a user instruction and the like, and specifically comprises the event and the text obtained based on a GUI (graphical user interface), the text obtained based on ASR (application component rule) and the like; the voice stream is input into a voice model set, and voice model characteristics, user language, user instructions and the like are extracted from the voice stream through model processing such as ASR, voiceprint model, user portrait and the like.

And inputting the extracted information into a multi-information fusion device, and classifying and integrating the information by the multi-information fusion device to respectively obtain user characteristics (feature _ 1-feature _ n), user instructions (common _ 1-common _ m) and user language (namely the user language in the text form). The user instruction is derived from an interface protocol appointed in touch screen interaction in the text information, a classification result of user audio input, a classification of gesture content in the image frame and the like.

The user instructions and user features are input to the instruction processor by the action of the context manager, and the first output content E1 is obtained by SOP processing in the instruction processor. The user language and the user characteristics are input to the natural language processor, and the second output content E2 is obtained through the processing of NLU, DM and NLG in the natural language processor.

The instruction processor and the natural language processor are in context intercommunication in the process of processing. And the final output results of the two devices are subjected to content fusion to obtain response contents, and the response contents are classified and output to the terminal through the multi-information output device. The language Text To be output is converted into voice through processing from Text To Speech (TTS), the voice is broadcasted To the user through the terminal, and the video stream is played through the terminal, and Text information is displayed.

The present embodiment will be described below with reference to specific examples.

When a user says that the user is hungry to eat the mobile phone, the service assistant in the mobile phone collects voice information input by the user and sends the voice information to the server for processing, the server processes the multi-mode information by adopting the method of the embodiment and then returns response content to the mobile phone, and the mobile phone pops up the covering layer and displays the searched restaurant list on the covering layer in an image-text mode.

After the restaurant list is output on the page, the user can select more restaurants through the sliding-down operation of the mobile phone screen. The user finds that the number of nearby restaurants is about 100, and it is difficult to decide what restaurant to select. At this time, the user says a sentence "i want to see the restaurant with a large number of visitors and the score of the restaurant 10" to screen the target restaurant. At this time, the service assistant in the handset collects the following multimodal information: the user can slide down on the screen of the mobile phone, and the voice information input by the user can be obtained. The service assistant sends the collected multi-modal information to the server, and the server performs the following operations.

The multi-modal information processor in the server processes the audio information to obtain the user language 'i want to see restaurants with more visitors and the top10 scores', determines the current emotion (such as happy mood) characteristics of the user according to the audio information, and analyzes the user portrait according to the audio information: the user is located in Sichuan. The multimodal information processor recognizes this event of the slide down operation.

Through the classification of the multi-information fusion device, the following 3-point key information is obtained:

user characteristics: the emotion is joyful, and the region is Sichuan;

and (3) a user instruction: a slide down operation, conditional search;

user language: i want to see restaurants that have visited a large number of people and scored the top 10.

The user characteristics and the user instruction can enter the instruction processor to be processed, and the user characteristics and the user language can enter the natural language processor to be processed. The processing of the two can be carried out simultaneously, and in the processing process, context intercommunication can be carried out, and the intermediate output of the other party is mutually used.

In the instruction processor, the user selects to enter the searching process, and according to the programmed setting of the SOP, the user number >1000 is used for further range searching on the restaurant list according to the searching condition of' top10, so as to obtain the searching result, and the searching result is stored in the cache through the context manager.

In a natural language processor, in the NLU, processing may be performed in conjunction with user features. Assuming that the processing result of the instruction processor can be obtained from the cache in the processing of the DM, the NM combines the output of the NLU and the output content of the instruction processor to process, and inputs the processing result into the NLG, and generates a natural language through the NLG, such as "i have found a restaurant of top10 for you".

When the output content of the instruction processor and the output content of the natural language processor are obtained, the two are reintegrated and subjected to decision-making processing, and response content for a user is obtained.

The multi-information output device classifies and arranges the response contents and respectively puts the response contents into three channels defined by the mobile phone terminal. The service assistant at the mobile phone end obtains the following two contents:

list data to be displayed;

and simultaneously, broadcasting a voice to guide the further action of the user.

The service assistant presents the top-scoring 10 restaurant list on the cell phone and reports the voice "i have found the top10 restaurants for you".

Through the processing of the method, the user can input multi-modal information at the same time, and obtain diversified responses aiming at the multi-modal information instead of a processing result of single-modal information.

In this embodiment, on the premise that the terminal has multiple information collection capabilities, processing of information in multiple modalities (video, audio, operation, etc.), model recognition, and classification of instruction sets are added, the conventional instruction behavior is processed through the definition of a standardized flow in the instruction processor, and the voice robot focuses on natural language understanding and processing and combines the natural language understanding and processing, so that the processing and response diversity of the machine is increased.

In this specification, the words "first" in the first output content, first intermediate output, etc., and the words "second" in the description are used for convenience of distinction and description only and are not intended to have any limiting meaning.

The foregoing describes certain embodiments of the present specification, and other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily have to be in the particular order shown, or in sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

Fig. 4 is a schematic block diagram of a multi-modal based human-computer interaction device according to an embodiment. The apparatus 400 may be deployed in a server or in a terminal. The server may be implemented by any device, equipment, platform, device cluster, etc. having computing and processing capabilities, and the terminal may be implemented by any terminal having computing and processing capabilities. This embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2. The apparatus 400 comprises:

an obtaining module 410 configured to obtain multimodal information input by a user through a terminal;

an extraction module 420 configured to extract a user instruction and a user language from the multimodal information;

a first determining module 430 configured to determine, by an instruction processor, first output content for the user instruction;

a second determination module 440 configured to determine, by the natural language processor, a second output content for the user language;

a fusion module 450 configured to determine responsive content for the user based on a fusion of the first output content and the second output content.

In one embodiment, the extraction module 420 comprises:

a first extraction sub-module (not shown in the figure) configured to extract a user instruction from any kind of information included in the multi-modal information based on a preset instruction;

a second extraction sub-module (not shown in the figure) configured to extract the user language from the speech information and the language text included in the multimodal information.

In one embodiment, the extraction module 420 is further configured to extract user features from the multimodal information;

the first determining module 430 is specifically configured to:

the second determining module 440 is specifically configured to:

In an embodiment, the first determining module 430 is specifically configured to:

In an embodiment, the second determining module 440 is specifically configured to:

In one embodiment, the fusion module 450 is specifically configured to:

In one embodiment, the apparatus 400 further comprises:

a classification module (not shown in the figure) configured to classify the response content for the user based on a multi-modal category contained in the response content after determining the response content;

and an output module (not shown in the figure) configured to output the classified response content through the terminal by using a corresponding modal output mode.

The above device embodiments correspond to the method embodiments, and specific descriptions may refer to descriptions of the method embodiments, which are not repeated herein. The device embodiment is obtained based on the corresponding method embodiment, has the same technical effect as the corresponding method embodiment, and for the specific description, reference may be made to the corresponding method embodiment.

Embodiments of the present specification also provide a computer-readable storage medium having a computer program stored thereon, which, when executed in a computer, causes the computer to perform the method of any one of fig. 1 to 3.

The embodiment of the present specification further provides a computing device, which includes a memory and a processor, where the memory stores executable code, and the processor executes the executable code to implement the method described in any one of fig. 1 to 3.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the storage medium and the computing device embodiments, since they are substantially similar to the method embodiments, they are described relatively simply, and reference may be made to some descriptions of the method embodiments for relevant points.

Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in connection with the embodiments of the invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.

The above-mentioned embodiments further describe the objects, technical solutions and advantages of the embodiments of the present invention in detail. It should be understood that the above description is only exemplary of the embodiments of the present invention, and is not intended to limit the scope of the present invention, and any modification, equivalent replacement, or improvement made on the basis of the technical solutions of the present invention should be included in the scope of the present invention.

Claims

1. A multi-modal-based human-computer interaction method comprises the following steps:

acquiring multi-modal information input by a user through a terminal;

extracting user instructions and user language from the multimodal information;

2. The method of claim 1, the multi-modal information comprising at least one of the following categories: voice information, video information, text information; the text information includes language text and event text input by the user through an input operation.

3. The method of claim 2, the step of extracting user instructions and user language from the multimodal information comprising:

4. The method of claim 1, further comprising:

extracting user features from the multimodal information;

5. The method of claim 1, the step of determining, by an instruction processor, first output content for the user instruction, comprising:

6. The method of claim 1, the step of determining, by a natural language processor, second output content for the user language comprising:

7. The method of claim 1, the step of determining, by an instruction processor, first output content for the user instruction, comprising:

8. The method of claim 1, the step of determining the responsive content for the user comprising:

9. The method of claim 1, the step of determining the responsive content for the user comprising:

10. The method of claim 1, after determining the responsive content for the user, further comprising:

11. The method of claim 1, performed by a server, or performed by the terminal.

12. A multi-modality based human-computer interaction device, comprising:

a first determination module configured to determine, by an instruction processor, first output content for the user instruction;

13. The apparatus of claim 12, the extraction module further configured to extract user features from the multimodal information;

the first determining module is specifically configured to:

the second determining module is specifically configured to:

14. The apparatus of claim 12, the first determination module being specifically configured to:

15. A computer-readable storage medium, on which a computer program is stored which, when executed in a computer, causes the computer to carry out the method of any one of claims 1-11.

16. A computing device comprising a memory having executable code stored therein and a processor that, when executing the executable code, implements the method of any of claims 1-11.