CN111966212A

CN111966212A - Multi-mode-based interaction method and device, storage medium and smart screen device

Info

Publication number: CN111966212A
Application number: CN202010605417.6A
Authority: CN
Inventors: 牛禹; 曹玉树; 姜威; 郭伟宇; 司庆; 卢宁
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd; Shanghai Xiaodu Technology Co Ltd
Priority date: 2020-06-29
Filing date: 2020-06-29
Publication date: 2020-11-20

Abstract

The application discloses an interaction method, an interaction device, a storage medium and an intelligent screen device based on multiple modes, and relates to the technical field of artificial intelligence and multi-mode interaction. The specific implementation scheme is as follows: recognizing user characteristics based on a multi-modal interaction technology, and acquiring interaction scene information of a user based on the multi-modal interaction technology; analyzing the interaction intention of the user according to the user characteristics and the interaction scene information; and acquiring a target interaction instruction corresponding to the interaction intention, and thus, performing interaction control on the intelligent screen equipment by adopting the target interaction instruction.

Description

Multi-mode-based interaction method and device, storage medium and smart screen device

Technical Field

The application relates to the technical field of computers, in particular to the technical field of artificial intelligence and multi-mode interaction, and particularly relates to an interaction method, an interaction device, a storage medium and an intelligent screen device based on multiple modes.

Background

The multi-modal intelligent interaction is the main form of intelligent interaction in the future, and along with the progress of artificial intelligence technology, the interaction between people and intelligent equipment and between people and machines tends to be more intelligent and natural. For example, a smart speaker, a smart robot, etc., a user may obtain various resources and services through interaction with an electronic device.

Disclosure of Invention

The multi-mode-based interaction method, the multi-mode-based interaction device, the storage medium and the intelligent screen device can be used for determining the interaction intention of the user by combining the user characteristics and the interaction environment information of the user, and the multi-mode intelligent interaction effect of the intelligent screen device is effectively improved.

According to a first aspect, a multimodal-based interaction method is provided, which is applied to a smart screen device, and includes: recognizing user characteristics based on a multi-modal interaction technology, and acquiring interaction scene information of a user based on the multi-modal interaction technology; analyzing the interaction intention of the user according to the user characteristics and the interaction scene information; and acquiring a target interaction instruction corresponding to the interaction intention, and thus, performing interaction control on the intelligent screen equipment by adopting the target interaction instruction.

According to the multi-mode-based interaction method, the user characteristics are identified based on the multi-mode interaction technology, the interaction scene information of the user is obtained based on the multi-mode interaction technology, the interaction intention of the user is analyzed according to the user characteristics and the interaction scene information, and the target interaction instruction corresponding to the interaction intention is obtained, so that the interaction control is performed on the intelligent screen device through the target interaction instruction, the interaction intention of the user can be determined by combining the user characteristics and the interaction environment information of the user, and the multi-mode intelligent interaction effect of the intelligent screen device is effectively improved.

According to a second aspect, there is provided a multimodal-based interaction apparatus applied to a smart screen device, including: the recognition module is used for recognizing the user characteristics based on a multi-modal interaction technology; the first acquisition module is used for acquiring interaction scene information of a user based on the multi-modal interaction technology; the analysis module is used for analyzing the interaction intention of the user according to the user characteristics and the interaction scene information; and the second acquisition module is used for acquiring a target interaction instruction corresponding to the interaction intention so as to carry out interaction control on the intelligent screen equipment by adopting the target interaction instruction.

According to the multi-mode-based interaction device, the user characteristics are identified based on the multi-mode interaction technology, the interaction scene information of the user is obtained based on the multi-mode interaction technology, the interaction intention of the user is analyzed according to the combination of the user characteristics and the interaction scene information, the target interaction instruction corresponding to the interaction intention is obtained, interaction control is conducted on the intelligent screen device through the target interaction instruction, the interaction intention of the user can be determined jointly through the combination of the user characteristics and the interaction environment information of the user, and the multi-mode intelligent interaction effect of the intelligent screen device is effectively improved.

According to a third aspect, there is provided a smart screen device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the multi-modal based interaction method of the embodiments of the present application.

According to the intelligent screen device, the user characteristics are identified based on the multi-mode interaction technology, the interaction scene information of the user is obtained based on the multi-mode interaction technology, the interaction intention of the user is analyzed according to the combination of the user characteristics and the interaction scene information, the target interaction instruction corresponding to the interaction intention is obtained, interaction control is conducted on the intelligent screen device through the target interaction instruction, the interaction intention of the user can be determined jointly by combining the user characteristics and the interaction environment information of the user, and the multi-mode intelligent interaction effect of the intelligent screen device is effectively improved.

According to a fourth aspect, a non-transitory computer-readable storage medium is presented having stored thereon computer instructions for causing a computer to perform a multimodal based interaction method as disclosed in embodiments of the present application.

According to the technology of the application, the technical problem that in the related technology, simple visual perception or voice perception can not meet the requirements of current man-machine intelligent interaction is solved, the interaction intention of a user can be determined by combining user characteristics and interaction environment information of the user, and the multi-mode intelligent interaction effect of the intelligent screen device is effectively improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present application, nor do they limit the scope of the present application. Other features of the present application will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

FIG. 1 is a schematic diagram according to a first embodiment of the present application;

FIG. 2 is a schematic diagram according to a second embodiment of the present application;

FIG. 3 is a schematic structural diagram of an interaction device based on multiple modalities in an embodiment of the present application;

FIG. 4 is a schematic illustration according to a third embodiment of the present application;

FIG. 5 is a schematic illustration according to a fourth embodiment of the present application;

FIG. 6 is a block diagram of a smart screen device for implementing a multimodal based interaction method of an embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 is a schematic diagram according to a first embodiment of the present application. It should be noted that the execution main body of the interaction method based on multiple modalities of this embodiment is an interaction device based on multiple modalities, the device can be implemented in a software and/or hardware manner, the device can be configured in an intelligent screen device, and the intelligent screen device can be understood as a hardware device having intelligent functional elements such as a human-computer interaction screen, a voice collector, a voice player, an image collection cluster, and the like.

The embodiment of the application relates to the technical field of Artificial Intelligence and multi-modal interaction, wherein Artificial Intelligence (AI) is abbreviated in English. The method is a new technical science for researching and developing theories, methods, technologies and application systems for simulating, extending and expanding human intelligence. The multi-modal interaction is an interaction form which integrates vision, voice and mode perception and is provided by the development of hardware equipment with system intellectualization and interaction humanization.

As shown in fig. 1, the multi-modal based interaction method may include:

s101: the user characteristics are recognized based on the multi-modal interaction technology, and interaction scene information of the user is obtained based on the multi-modal interaction technology.

Each source or form of information may be referred to as a modality. For example, humans have touch, hearing, vision, smell; information media such as voice, video, text and the like; a wide variety of sensors, such as radar, infrared, accelerometer, etc., each of which may be referred to as a modality. Two different languages may be considered as two modalities, and even data sets acquired under two different conditions may be considered as two modalities.

Therefore, the user characteristics can be recognized based on the multi-modal interaction technology in the multi-modal interaction technology field, and the interaction scene information of the user is obtained based on the multi-modal interaction technology, so that the dimensionality of the interaction information can be enriched as much as possible, the interaction between the user and the intelligent screen device can effectively simulate the interaction mode between people, and the confirmation of the subsequent interaction intention is effectively assisted.

In some embodiments, the face features, the posture features, and the gesture features of the user may be obtained based on a multi-modal interaction technology, and the face features, the posture features, and the gesture features are used as the user features, each of the features may be regarded as a "modality", and of course, the user features may not be limited to the foregoing, but may also be fingerprint features, voiceprint features, and the like, so that the user features of multiple dimensions are obtained, the interaction intention of the user with respect to the smart screen device is not limited to voice interaction and gesture interaction, and the accuracy of subsequent interaction intention confirmation is ensured.

For example, a face recognition technology may be used to capture a face image of a human face, and then the face image is analyzed to obtain a face feature, and for the gesture feature and the posture feature, the face feature may be recognized in a similar manner, which is not limited thereto.

After the user features are identified based on the multi-modal interaction technology, the interaction scenario information of the user may be obtained based on the multi-modal interaction technology, where the interaction scenario information is used to describe interaction environment information of an environment where the user currently interacts with the smart screen device, or describe interaction information between the user and the smart screen device, and may also describe control information of the user on an environment state, and the like.

Optionally, in some embodiments, referring to fig. 2, fig. 2 is a schematic diagram according to a second embodiment of the present application, where the obtaining of interaction scenario information of a user based on a multi-modal interaction technology includes:

s201: acquiring multi-dimensional interactive environment information of the environment where the user is located, and taking the multi-dimensional interactive environment information as interactive scene information.

Optionally, in some embodiments, image recognition may be performed on an environment where the user is located, so as to obtain interaction environment information of a visual dimension of the environment according to the recognized image; and/or performing voice recognition on the environment where the user is located to acquire interaction environment information of the voice dimension of the environment according to the recognized voice, so that the interaction scene information of the environment dimension is acquired, and the modal dimension of the multi-modal interaction logic is enriched.

As an example, when the interaction environment information of the visual dimension is recognized, the visual scene currently faced by the smart screen device may be continuously sent to the visual analysis module to recognize the image, and when the interaction environment information of the voice dimension is recognized, the voice content currently faced by the smart screen device may be continuously sent to the voice analysis module to recognize the voice.

In some embodiments, when the interactive environment information of the visual dimension of the environment is acquired according to the recognized image, specifically, the recognized image may be subjected to image analysis, and characteristics of the content, time, brightness, and the like of the image are analyzed, so that the characteristics are used as the recognized interactive environment information.

In the embodiment of the present application, among the images and the voices, the described environmental features related to the environment and the audio features in the environment may be specifically recognized as the interactive environmental information, specifically, the recognized images may be analyzed to obtain the environmental features of the environment, the environmental features may be used as the interactive environmental information of the visual dimension, the recognized voices may be analyzed to obtain the audio features in the environment, and the audio features may be used as the interactive environmental information of the voice dimension, so that the implementation is simple and convenient, and the more accurate environmental features and audio features may be obtained, thereby assisting the subsequent output of the interactive instructions adapted to the environmental features and the audio features.

The environmental characteristic is at least one of the following: the light characteristics of the environment, the distance between a calibration object in the environment and the user, and the time characteristics of the environment, and the audio characteristics are at least one of the following: the method comprises the following steps of obtaining a light characteristic of the environment, a distance between a calibration object in the environment and a user and a time characteristic of the environment, and determining an interaction intention of the user by combining the sound characteristic of the environment and the noise characteristic of the environment, so as to assist in outputting an interaction instruction matched with the environment characteristic and the audio characteristic.

S202: and acquiring interactive information between the user and the intelligent screen equipment, and taking the interactive information as interactive scene information.

In some embodiments, the smart screen device is connected to an external input device, which includes, but is not limited to, a keyboard, a mouse, a remote controller, and the like, that is, if the external input device is connected to the smart screen device, the interaction information between the user and the smart screen device is obtained, which may be the interaction information directly input by the user via the smart screen device; and/or acquiring interaction information input to the intelligent screen device by the user through an external input device, so that the interaction intention of the user is determined by combining the user characteristics and the interaction information between the user and the intelligent screen device, and the 'mode' in multi-mode interaction is enriched from the dimensionality of the user and the device, so that the subsequent output of an interaction instruction matched with the interaction information between the user and the intelligent screen device is assisted.

S203: and acquiring control information of the user on the environment state, and taking the control information as interactive scene information.

In some embodiments, the interaction scenario information may also be, for example, control information of a user on an environment state, and then the control information of the user on the environment state may be obtained, where the control information of the user on the environment state is obtained by using a third-party device, where the third-party device has established a communication connection with the smart screen device, so that the interaction intention of the user is determined by combining the user characteristics and the control information of the user on the environment state, and a "modality" in the multi-modality interaction is enriched from the control dimension of the user on the environment state, so as to assist in subsequently outputting an interaction instruction adapted to the control information of the user on the environment state.

As an example, the third party device may be, for example, an air conditioner in the environment, and when the user adjusts the temperature and humidity of the environment through the air conditioner, the adjustment information of the temperature and humidity of the environment by the user may be used as the control information, which is not limited in this respect.

Through the above, the multi-dimensional interactive environment information of the environment where the user is located is obtained, the multi-dimensional interactive environment information is used as the interactive scene information, the interactive information between the user and the intelligent screen device is obtained, the interactive information is used as the interactive scene information, the control information of the user to the environment state is obtained, the control information is used as the interactive scene information, the multi-mode collection of the interactive environment information can be realized, the multi-dimensional interactive environment information is obtained, not only the interactive intention can be determined in an auxiliary mode based on the characteristics of the user, but also the interactive intention of the user can be determined by combining the multi-dimensional interactive environment information, and the multi-mode intelligent interactive effect of the intelligent screen device can be effectively improved.

S102: and analyzing the interaction intention of the user according to the user characteristics and the interaction scene information.

After the user characteristics of the multiple modalities and the interaction scene information of the user are determined, the interaction intention of the user can be analyzed by adopting a preconfigured analysis algorithm according to the combination of the user characteristics and the interaction scene information.

For example, feature keywords corresponding to user features may be analyzed, scene keywords corresponding to interactive scene information may be analyzed, then, corresponding interactive intentions may be matched from a preconfigured relationship table according to the feature keywords and the scene keywords as the interactive intentions of the user, and the relationship table may be sample feature keywords and sample scene keywords obtained in advance based on big data analysis, and sample interactive intentions corresponding to the sample feature keywords and the sample scene keywords, so that the interactive intentions of the user may be directly analyzed by combining the interactive scene information according to the user features.

Of course, any other possible manner may be adopted to analyze the interaction intention of the user according to the user characteristics in combination with the interaction scenario information, such as a modeling manner, an engineering manner, and the like, which is not limited thereto.

S103: and acquiring a target interaction instruction corresponding to the interaction intention, and thus, performing interaction control on the intelligent screen device by adopting the target interaction instruction.

With respect to the description of the foregoing embodiments, the present application also provides a specific example, in which the multimodal-based interaction method may be applied to a multimodal-based interaction apparatus, which may be shown in fig. 3, for example, and fig. 3 is a schematic structural diagram of the multimodal-based interaction apparatus in the embodiments of the present application. The multi-modality based interaction device includes: the multi-modal cloud 301 and the multi-modal terminal 302 are arranged in the smart screen device, wherein the multi-modal cloud 301 can be arranged at the cloud server side, that is, the multi-modal cloud 301 is used for processing the determination of the interaction intention and the generation processing logic configuration of the interaction instruction, so that the hardware resource consumption of the smart screen device is reduced, and the interaction response efficiency of the smart screen device is improved.

The multimodal terminal 302 of figure 3 described above can include a multimodal analysis module 3021, a multimodal interaction decision 3022, a multimodal input 3023, and multimodal output 3024, multimodal input 3023 user captures user characteristic input, speech input, visual input, external device input, and screen inputs of the smart screen device, etc., and transmits the multimodal inputs to the multimodal analysis module 3021, the user characteristics are analyzed by the multimodal analysis module 3021, and analyzes the interactive environment information, matches the interactive intention of the user by the multi-modal interactive decision 3022 according to the analysis result of the multi-modal analysis module 3021, and sends the interaction intention to the multi-modal cloud 301, the multi-modal cloud 301 obtains a target interaction instruction corresponding to the interaction intention of the user, and the target interaction instruction is sent to the multi-mode output 3024, so that the target interaction instruction is adopted to perform interaction control on the intelligent screen device.

In conjunction with the above embodiment and fig. 3, the present application may be described in detail as follows:

multimodal input 3023:

the method mainly comprises four input modes of visual input, voice input, peripheral input and touch input. The method is used for acquiring the current visual scene, voice scene, other peripheral input scene and screen touch input scene of the user respectively.

The visual input scene inputs a visual image faced by the current equipment into the visual analysis module, the voice input scene inputs a voice audio faced by the intelligent screen equipment at present into the voice analysis module, the screen trigger input supports an interactive scene of a user and the intelligent screen equipment, and the peripheral input scene is used for enhancing the input efficiency of the intelligent screen equipment and comprises but not limited to a keyboard, a mouse, a remote controller and the like.

The multi-modal analysis module 3021:

the multi-modal analysis module 3021 is configured to perform multi-dimensional analysis on currently acquired visual information, voice information, and sensor peripheral information, and perform one-time effective user interaction with multi-dimensional comprehensive assistance.

The visual analysis module analyzes the visual input data in multiple dimensions, including light detection and distance detection, outputs visual discrimination conditions of the current interaction environment, and identifies human face features, posture analysis, gesture recognition and the like based on a multi-modal recognition technology to form user features.

The voice analysis module carries out multi-dimensional analysis on the voice input data, such as environmental sound detection, noise detection and the like, and outputs voice judgment conditions of the current interaction environment; and recognizing voiceprint features of the user based on a multi-modal recognition technique to form user features.

The embedded peripherals are similar to screen triggers for responding to input from user specific application scenarios (fingerprinting, game scenarios, etc.) and general usage scenarios.

The multi-modal interaction decision 3022 determines an interaction scene and a user identity from interaction environment information (light brightness, interaction time, user-device distance, interaction environment sound, noise intensity, etc.) and user characteristics (face recognition, gesture recognition, body state recognition, voiceprint recognition, specific wake-up command recognition, etc.) generated by the multi-modal analysis module, converts visual interaction information/voice interaction information/other form input information into a target interaction instruction according to an analysis result, acquires different information results from the interaction response module to the user, and generates a certain interaction control effect on the interaction output module.

The multi-modal interaction response module of the multi-modal cloud 301 provides interaction services including, but not limited to, life services, home control, conversational communication, child education, multimedia entertainment, office applications, and the like according to the target interaction instruction provided by the multi-modal interaction decision module, which is not limited thereto.

And the multi-modal output 3024 is used for interactively displaying the result of the multi-modal response, and the display form is controlled by the multi-modal decision module. Forms of interactive output include, but are not limited to, screens, sounds, embedded peripherals, and the like. And the brightness and the darkness of an output screen, the sound size, the current output display form and the like are automatically adjusted according to the decision results (user characteristics, interaction environment information and the like) of the multi-mode decision module.

In the embodiment, the user characteristics are identified based on the multi-mode interaction technology, the interaction scene information of the user is acquired based on the multi-mode interaction technology, the interaction intention of the user is analyzed according to the user characteristics in combination with the interaction scene information, and the target interaction instruction corresponding to the interaction intention is acquired, so that the interaction control is performed on the intelligent screen device by adopting the target interaction instruction, the interaction intention of the user can be determined in combination with the user characteristics and the interaction environment information of the user, and the multi-mode intelligent interaction effect of the intelligent screen device is effectively improved.

Fig. 4 is a schematic diagram according to a fourth embodiment of the present application.

The multimodal based interaction apparatus 400 is applied to a smart screen device.

As shown in fig. 4, the multi-modality based interaction apparatus 400 includes:

the recognition module 401 is configured to recognize user features based on a multi-modal interaction technology;

a first obtaining module 402, configured to obtain interaction scene information of a user based on a multi-modal interaction technology;

an analysis module 403, configured to analyze an interaction intention of the user according to the user characteristics in combination with the interaction scenario information;

and a second obtaining module 404, configured to obtain a target interaction instruction corresponding to the interaction intention, so as to perform interaction control on the smart screen device by using the target interaction instruction.

In an embodiment of the present application, referring to fig. 5, where fig. 5 is a schematic diagram according to a fifth embodiment of the present application, the first obtaining module 402 includes:

the first obtaining sub-module 4021 is configured to obtain multi-dimensional interaction environment information of an environment where a user is located, and use the multi-dimensional interaction environment information as interaction scene information; and/or the presence of a gas in the gas,

the second obtaining sub-module 4022 is configured to obtain interaction information between the user and the smart screen device, and use the interaction information as interaction scene information; and/or the presence of a gas in the gas,

the third obtaining sub-module 4023 is configured to obtain control information of the user on the environment state, and use the control information as interaction scene information.

In an embodiment of the present application, the first obtaining sub-module 4021 is specifically configured to:

carrying out image recognition on the environment where the user is located so as to acquire interactive environment information of visual dimensions of the environment according to the recognized image; and/or the presence of a gas in the gas,

and performing voice recognition on the environment where the user is located so as to acquire the interaction environment information of the voice dimension of the environment according to the recognized voice.

In an embodiment of the present application, the first obtaining sub-module 4021 is further configured to:

and analyzing the identified image to obtain the environmental characteristics of the environment, and taking the environmental characteristics as the interactive environment information of the visual dimension.

In one embodiment of the present application, the environmental characteristic is at least one of: light characteristics of the environment, a distance between a calibration object within the environment and the user, and time characteristics of the environment.

In an embodiment of the application, the first obtaining sub-module 4021 is further configured to:

and analyzing the recognized voice to obtain audio features in the environment, and taking the audio features as the interactive environment information of the voice dimension.

In one embodiment of the application, the audio characteristic is at least one of: sound characteristics in the environment and noise characteristics in the environment.

In an embodiment of the present application, the smart screen device is connected to an external input device, where the second obtaining sub-module 4022 is specifically configured to:

acquiring interactive information directly input by a user through intelligent screen equipment; and/or the presence of a gas in the gas,

and acquiring interactive information input to the intelligent screen device by a user through an external input device.

In an embodiment of the present application, the third obtaining sub-module 4023 is specifically configured to:

and acquiring control information of the user on the environment state by adopting third-party equipment, wherein the third-party equipment establishes communication connection with the intelligent screen equipment.

In an embodiment of the present application, the identifying module 401 is specifically configured to:

the method comprises the steps of obtaining face features, posture features and gesture features of a user based on a multi-modal interaction technology, and taking the face features, the posture features and the gesture features as user features.

It should be noted that the foregoing explanation of the multi-modal based interaction method is also applicable to the multi-modal based interaction apparatus of the present embodiment, and is not repeated herein.

According to an embodiment of the present application, the present application also provides a smart screen device and a readable storage medium.

Fig. 6 is a block diagram of a smart screen device based on a multi-modal interaction method according to an embodiment of the present application. Smart screen devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. Smart screen devices may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 6, the smart screen device includes: one or more processors 601, memory 602, and interfaces for connecting the various components, including a high-speed interface and a low-speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the smart screen device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple smart screen devices may be connected, with each device providing some of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 6, one processor 601 is taken as an example.

The memory 602 is a non-transitory computer readable storage medium as provided herein. Wherein the memory stores instructions executable by at least one processor to cause the at least one processor to perform the multimodal based interaction method provided herein. The non-transitory computer-readable storage medium of the present application stores computer instructions for causing a computer to perform the multimodal-based interaction method provided herein.

The memory 602, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules (e.g., the recognition module 401, the first obtaining module 402, the analysis module 403, and the second obtaining module 404 shown in fig. 4) corresponding to the multi-modal based interaction method in the embodiments of the present application. The processor 601 executes various functional applications of the server and data processing by running non-transitory software programs, instructions and modules stored in the memory 602, that is, implements the multi-modal based interaction method in the above method embodiments.

The memory 602 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of a smart screen device that performs a multimodal-based interaction method, and the like. Further, the memory 602 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 602 optionally includes memory located remotely from the processor 601, and these remote memories may be connected over a network to a smart screen device that performs a multimodal based interaction method. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The smart screen device performing the multimodal-based interaction method may further include: an input device 603 and an output device 604. The processor 601, the memory 602, the input device 603 and the output device 604 may be connected by a bus or other means, and fig. 6 illustrates the connection by a bus as an example.

The input device 603 may receive input numeric or character information and generate key signal inputs related to user settings and function control of a smart screen apparatus performing the multi-modal based interaction method, such as an input device of a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointing stick, one or more mouse buttons, a track ball, a joystick, etc. The output devices 604 may include a display device, auxiliary lighting devices (e.g., LEDs), and tactile feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A multi-mode-based interaction method is applied to a smart screen device, and comprises the following steps:

recognizing user characteristics based on a multi-modal interaction technology, and acquiring interaction scene information of a user based on the multi-modal interaction technology;

analyzing the interaction intention of the user according to the user characteristics and the interaction scene information;

and acquiring a target interaction instruction corresponding to the interaction intention, and thus, performing interaction control on the intelligent screen equipment by adopting the target interaction instruction.

2. The multi-modal-based interaction method of claim 1, wherein the obtaining interaction scenario information of a user based on the multi-modal interaction technology comprises:

acquiring interactive environment information of multiple dimensions of the environment where the user is located, and taking the interactive environment information of the multiple dimensions as the interactive scene information; and/or the presence of a gas in the gas,

acquiring interaction information between the user and the intelligent screen equipment, and taking the interaction information as the interaction scene information; and/or the presence of a gas in the gas,

and acquiring control information of the user on the environment state, and taking the control information as the interactive scene information.

3. The multimodal-based interaction method of claim 2, wherein the obtaining interaction environment information for multiple dimensions of the environment in which the user is located comprises:

carrying out image recognition on the environment where the user is located so as to obtain interactive environment information of visual dimensions of the environment according to the recognized image; and/or the presence of a gas in the gas,

4. The multimodal-based interaction method of claim 3, wherein the obtaining interaction environment information in a visual dimension of the environment from the identified images comprises:

5. The multimodal-based interaction method of claim 4, wherein the environmental characteristic is at least one of: light characteristics of the environment, a distance between a calibration object within the environment and the user, and time characteristics of the environment.

6. The multimodal interaction method according to claim 3, wherein the obtaining interaction environment information of the speech dimension of the environment from the recognized speech comprises:

7. The multimodal interaction method according to claim 6, wherein the audio features are at least one of: sound characteristics in the environment and noise characteristics in the environment.

8. The multi-modality based interaction method of claim 2, wherein the smart screen device is connected with an external input device, and wherein the obtaining interaction information between the user and the smart screen device comprises:

acquiring interactive information directly input by the user through the intelligent screen equipment; and/or the presence of a gas in the gas,

and acquiring interactive information input to the intelligent screen device by the user through the external input device.

9. The multimodal-based interaction method of claim 2, wherein the obtaining control information of the user over the environmental state comprises:

10. The multimodal-based interaction method of claim 2, wherein the multimodal-based interaction technique identifies user features, comprising:

and acquiring the face features, the posture features and the gesture features of the user based on the multi-modal interaction technology, and taking the face features, the posture features and the gesture features as the user features.

11. An interactive device based on multi-mode is applied to a smart screen device, and the device comprises:

the recognition module is used for recognizing the user characteristics based on a multi-modal interaction technology;

the first acquisition module is used for acquiring interaction scene information of a user based on the multi-modal interaction technology;

the analysis module is used for analyzing the interaction intention of the user according to the user characteristics and the interaction scene information;

and the second acquisition module is used for acquiring a target interaction instruction corresponding to the interaction intention so as to carry out interaction control on the intelligent screen equipment by adopting the target interaction instruction.

12. The multimodal-based interaction apparatus of claim 11, wherein the first retrieving module comprises:

the first obtaining submodule is used for obtaining interactive environment information of multiple dimensions of the environment where the user is located and taking the interactive environment information of the multiple dimensions as the interactive scene information; and/or the presence of a gas in the gas,

the second obtaining submodule is used for obtaining interaction information between the user and the intelligent screen device and taking the interaction information as the interaction scene information; and/or the presence of a gas in the gas,

and the third obtaining submodule is used for obtaining the control information of the user on the environment state and taking the control information as the interactive scene information.

13. The multimodal-based interaction apparatus of claim 12, wherein the first retrieving submodule is specifically configured to:

14. The multimodal-based interaction apparatus of claim 13, wherein the first retrieving submodule is further configured to:

15. The multimodal-based interaction apparatus of claim 14, wherein the environmental characteristic is at least one of: light characteristics of the environment, a distance between a calibration object within the environment and the user, and time characteristics of the environment.

16. The multimodal-based interaction apparatus of claim 13, wherein the first retrieving submodule is further configured to:

17. The multimodal-based interaction apparatus of claim 16, wherein the audio features are at least one of: sound characteristics in the environment and noise characteristics in the environment.

18. The multi-modality-based interaction apparatus of claim 12, wherein the smart screen device is connected to an external input device, and wherein the second retrieving sub-module is specifically configured to:

19. The multimodal-based interaction apparatus of claim 12, wherein the third obtaining sub-module is specifically configured to:

20. The multimodal-based interaction apparatus of claim 12, wherein the recognition module is specifically configured to:

21. A smart screen device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-10.

22. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-10.