WO2021212388A1

WO2021212388A1 - Interactive communication implementation method and device, and storage medium

Info

Publication number: WO2021212388A1
Application number: PCT/CN2020/086222
Authority: WO
Inventors: 马海滨
Original assignee: 南京阿凡达机器人科技有限公司
Priority date: 2020-04-22
Filing date: 2020-04-22
Publication date: 2021-10-28
Also published as: CN112739507B; CN112739507A

Abstract

Disclosed are an interactive communication implementation method and device, and a storage medium. The method comprises: detecting whether the current interactive object stops interaction (S110); and when the current interactive object stops interaction in a wake-up state, determining, by means of collected image data and a speech signal, one candidate object participating in interaction as a new interactive object (S120). As such, interactive objects can be switched naturally, flexibly, and intelligently in a multi-user interaction scenario, so as to achieve, in a humanized manner, the aim of interactive communication with a plurality of objects in a timely and efficient manner.

Description

Method, equipment and storage medium for realizing interactive communication

Technical field

The invention relates to the technical field of human-computer interaction, in particular to a method, equipment and storage medium for realizing interactive communication.

Background technique

In recent years, "artificial intelligence" has become the most frequently appeared word in the Internet circle. At the same time, service robots have developed rapidly, and robots or smart devices such as personal virtual assistants and homework robots (such as sweeping robots) have realized " "Artificial Intelligence" application. At present, in many scenarios, robots or smart devices are required to have the ability to interact, and good interactive services have become one of the factors of highly competitive artificial intelligence services.

Most of the existing interaction methods are based on the recognition of voice content based on wake-up words. Triggering operations such as "wake-up words" or touch input operations are the main triggering methods for triggering current robots or smart devices to perform human-computer interaction. However, the problem of using the above method for interaction in a multi-person scenario is that for each subject person participating in the interaction, the above operation must be performed when the robot or smart device is in the awake state to switch new interactive objects midway, resulting in all The user must understand and master the trigger operations of different robots or smart devices. Furthermore, each time a new user is switched to interact with the robot or smart device, the above-mentioned trigger operation is executed. Such an interaction process is not only mechanical but also affects the rhythm of multi-person switching interaction. It cannot interact with multiple users in real time and intelligently in multi-user interaction scenarios. Users communicate effectively.

Summary of the invention

The purpose of the present invention is to provide a method, equipment and storage medium for realizing interactive communication to realize the natural, flexible and intelligent switching of interactive objects in a multi-user interaction scenario, so as to realize timely and efficient interaction with multiple objects in a humanized manner. The purpose of communication.

The technical solutions provided by the present invention are as follows:

The present invention provides a method for realizing interactive communication, which includes the steps:

Detect whether the current interactive object stops interacting;

When the current interactive object stops interacting in the awake state, a candidate object participating in the interaction is determined as a new interactive object by collecting image data and voice signals.

Further, it also includes the steps:

When the current interactive object does not stop interacting in the awakened state, the detection is continued while responding to the required service type of the current interactive object.

Further, it also includes the steps:

In the awakened state and the duration of no interaction object reaches the first preset time period, it enters the dormant state.

Further, it also includes the steps:

Judge whether it receives a wake-up signal when it is in a sleep state;

If a wake-up signal is received, it switches from the dormant state to the wake-up state, and it is determined that the target object that triggers the awakening of itself is the current interactive object.

Further, when the current interactive object stops interacting in the awake state, determining a candidate object participating in the interaction as a new interactive object by collecting image data and voice signals includes the steps:

When the duration for the current interactive object to stop interacting reaches the second preset time period, search for candidate objects participating in the interaction through image recognition and/or sound source localization;

If there is a candidate object, determine that the candidate object is the new interactive object;

If there are at least two candidate objects, one candidate object is determined as the new interactive object according to the image recognition result and/or the sound source localization result.

The present invention also provides an interactive communication realization device, including:

Image collection module, used to collect face images;

Audio collection module, used to collect voice signals;

The detection module is used to detect whether the current interactive object stops interacting;

The processing module is configured to determine a candidate object participating in the interaction as a new interactive object by collecting image data and voice signals when the current interactive object stops interacting in the awakened state.

Further, it also includes:

The execution module is configured to respond to the required service type of the current interactive object while continuing to detect when the current interactive object does not stop interacting in the awakened state;

The processing module is also configured to enter the dormant state when the duration for which there is no interactive object reaches the first preset duration in the awakened state.

Further, the detection module is also used to determine whether a wake-up signal is received when the detection module is in a sleep state;

The processing module is further configured to switch from the dormant state to the awakened state if a wake-up signal is received, and determine that the target object that triggers the awakening of itself is the current interactive object.

Further, the processing module includes:

The searching unit searches for candidate objects participating in the interaction through image recognition and/or sound source localization when the duration for the current interaction object to stop interacting reaches the second preset duration;

The object switching unit is configured to determine that if there is one candidate object, the candidate object is the new interactive object; if there are at least two candidate objects, determine one candidate object as the new interactive object according to the image recognition result and/or the sound source localization result New interactive objects.

The present invention also provides a storage medium in which at least one instruction is stored, and the instruction is loaded and executed by a processor to implement the operations performed by the interactive communication implementation method.

Through the interactive communication realization method, equipment and storage medium provided by the present invention, the interactive objects can be switched naturally, flexibly and intelligently in the multi-user interaction scene, so as to realize the timely and efficient interactive communication with multiple objects in a humanized manner. the goal of.

Description of the drawings

In the following, in a clear and easy-to-understand manner, the preferred embodiments will be described in conjunction with the accompanying drawings, and the above-mentioned characteristics, technical features, advantages and implementation methods of an interactive communication implementation method, device, and storage medium will be further described.

FIG. 1 is a flowchart of an embodiment of a method for implementing interactive communication of the present invention;

2 is a flowchart of another embodiment of a method for implementing interactive communication of the present invention;

3 is a flowchart of another embodiment of a method for implementing interactive communication of the present invention;

4 is a flowchart of another embodiment of a method for implementing interactive communication of the present invention;

FIG. 5 is a flowchart of another embodiment of a method for implementing interactive communication of the present invention;

FIG. 6 is a schematic diagram of the interaction of the emotional companion robot Robot of the present invention in a multi-user interaction scenario;

FIG. 7 is a schematic diagram of the human-computer interaction process when the robot of the present invention faces multiple people;

FIG. 8 is a schematic structural diagram of an embodiment of an interactive communication realization device of the present invention.

Detailed ways

In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the specific implementation manners of the present invention will be described below with reference to the accompanying drawings. Obviously, the drawings in the following description are only some embodiments of the present invention. For those of ordinary skill in the art, without creative work, other drawings can be obtained based on these drawings and obtained Other embodiments.

In order to make the drawings concise, the drawings only schematically show the parts related to the present invention, and they do not represent the actual structure of the product. In addition, in order to make the drawings concise and easy to understand, in some drawings, only one of the components with the same structure or function is schematically shown, or only one of them is marked. In this article, "a" not only means "only this one", but can also mean "more than one".

In the embodiment of the present invention, the terminal for implementing object switching includes, but is not limited to, personal virtual assistants, homework robots (such as sweeping robots), children's educational robots, elderly care robots, emotional companion robots, airport service robots, shopping service robots and other robots. It also includes smart devices such as smart phones, smart speaker devices, and smart voice elevators, which are usually used in social places such as shopping malls, subway stations, and railway stations.

An embodiment of the present invention, as shown in FIG. 1, a method for implementing interactive communication includes:

S110 detects whether the current interactive object stops interacting;

Specifically, the robot or smart device can collect image data (including but not limited to face images, gesture images) within the field of view through image collection modules such as cameras or camera arrays, and can also use audio collection modules such as microphones or microphone arrays. Obtain the input voice signal within the effective acquisition range. The types of interaction between the robot or smart device and the current interactive object include, but are not limited to, voice dialogue interaction and gesture dialogue interaction. The robot or smart device can judge whether to input the input voice signal according to the image data and/or voice signal on the current interactive object. It is also possible to determine whether to input a gesture based on the image data for the current interactive object. In addition, because the processor of the robot or smart device will perform the tasks it receives, it can also detect its own process to determine whether there is a voice interaction task obtained by voice recognition or a gesture interaction task obtained by image recognition, so as to detect and judge according to the above judgment result Whether the current interactive object stops interacting.

The microphone array in the embodiment of the present invention may be an array formed by a group of acoustic sensors located at different positions in space and regularly arranged according to a certain shape, and is a device for spatially sampling voice signals propagating in space. The voice signal processing method of the embodiment of the present invention does not specifically limit the specific form of the microphone array used.

The camera array in the embodiment of the present invention may be an array in which a group of image sensors located at different positions in space are regularly arranged according to a certain shape to collect image data from multiple viewing angles. As an example, the microphone array or camera may be a horizontal array, a T-shaped array, an L-shaped array, a polyhedral array, a spherical array, and so on.

In S120, when the current interactive object stops interacting in the awake state, a candidate object participating in the interaction is determined as a new interactive object by collecting image data and voice signals.

Specifically, the robot or smart device can determine whether the currently tracked interaction object (current interaction object includes a person, other smart device, or other robot) has stopped interacting with itself in an awakened state based on image data and/or voice signals. If the current interactive object stops interacting with the robot or smart device in the awake state, the robot or smart device will collect face images and voice signals to participate in one of the candidate objects of the interaction (the candidate objects include other people, Other smart devices or other robots) are replaced with new current interactive objects.

Exemplarily, assuming that robot A is the detection subject and user A is the current interactive object, when user A stops interacting with robot A, if robot A detects that user B is participating in the interaction by collecting image data and/or voice signals, then According to the image data and the voice signal, the user B is determined as the new interactive object.

In this embodiment, when the robot or smart device is in an awake state, there is no need to frequently voice input wake words to switch between new interactive objects as in the prior art, nor does it need to frequently cause all users to understand and master different robots or smart devices. Trigger operation, only need to realize real-time and intelligent switching of new interactive objects in multi-user interaction scenarios based on the collected image data and voice signals, and achieve perfect, effective and humanized switching with multiple objects in a timely and natural manner The purpose of interactive communication.

An embodiment of the present invention, as shown in FIG. 2, a method for implementing interactive communication includes:

S210 detects whether the current interactive object stops interacting;

S220 When the current interactive object stops interacting in the awakened state, a candidate object participating in the interaction is determined as a new interactive object by collecting image data and voice signals;

In S230, when the current interactive object does not stop interacting in the awakened state, the detection is continued while responding to the required service type of the current interactive object.

Specifically, for the parts in this embodiment that are the same as those in the above-mentioned embodiment, refer to the above-mentioned embodiment, which will not be repeated here. When the robot or smart device is in the awake state, if it is detected that the current interactive object has not stopped interacting, the robot or smart device continues to real-time whether the current interactive object has stopped interacting, and at the same time, it also obtains the voice signal of the current interactive object during the detection process Perform voice recognition (or gesture recognition) (or gesture signal) to obtain the corresponding required service type, and perform corresponding operations according to the required service type to give a response to the current interactive object. Among them, performing voice recognition (gesture recognition) on the voice signal (or gesture signal) to obtain the required service type is the existing technology, and will not be repeated here.

Exemplarily, a robot or smart device is used as the detection subject, and user A is the current interaction object. When user A does not stop interacting with the robot or smart device, the robot or smart device obtains the result by performing voice recognition on the voice signal input by user A. Play nursery rhymes", then the robot or smart device will query the music library to play nursery rhymes. Input voice signals through TTS (Text To Speech), which is suitable for deaf people through TTS-enabled devices (hereinafter referred to as TTS devices for short, and TTS devices are only provided in this scenario TTS function, no other services are provided) to manually input "Play Children's Songs" to make the TTS device voice broadcast the voice signal of "Play Children's Songs", the robot or smart device voice recognition and query the music library to play the children's songs.

An embodiment of the present invention, as shown in FIG. 3, a method for implementing interactive communication includes:

S310 judges whether a wake-up signal is received when it is in a sleep state;

Specifically, when a robot or smart device is in a sleep state, it will continuously monitor whether it receives a wake-up signal. The wake-up mechanism includes, but is not limited to, the wake-up signal is triggered by a voice input wake-up word, and mechanical buttons can also be preset on the robot or smart device. Or touch a button to generate a wake-up signal through touch and press, or it can generate a wake-up signal after receiving an input gesture that matches a preset wake-up gesture. Other ways of generating the wake-up signal by the wake-up mechanism are also within the protection scope of the present invention.

If S320 receives the wake-up signal, switch from the dormant state to the wake-up state, and determine that the target object that triggers the awakening of itself is the current interactive object;

Specifically, once the robot or smart device receives the wake-up signal in the dormant state, it automatically switches from the dormant state to the awakened state, thereby determining the target object that triggers the wake-up as the initial current interaction object in the current awakening state, where the target object It can be a person with normal language ability, or a person who uses TTS equipment to send out voice signals.

S330 detects whether the current interactive object stops interacting;

S340 When the current interactive object stops interacting in the awakened state, a candidate object participating in the interaction is determined as a new interactive object by collecting image data and voice signals;

In S350, when the current interactive object does not stop interacting in the awakening state, the detection is continued while responding to the required service type of the current interactive object.

Specifically, for the parts in this embodiment that are the same as those in the above-mentioned embodiment, refer to the above-mentioned embodiment, which will not be repeated here. In this embodiment, only when the robot or smart device switches from the sleep state to the awakened state, it needs to determine the current interaction object by triggering the target object that generates the wake-up signal. As long as the robot or smart device switches from the sleep state to the awake state, the During the entire wake-up state process, there is no need to frequently voice input wake-up words to switch new interactive objects midway as in the prior art, nor does it need to frequently cause all users to understand and master the trigger operations of different robots or smart devices. Image data and voice signals can realize real-time and intelligent switching of new interactive objects in multi-user interaction scenarios, perfect, effective, and humane realization of the purpose of switching interactive communication with multiple objects in a timely and natural manner.

An embodiment of the present invention, as shown in FIG. 4, a method for implementing interactive communication includes:

S410 detects whether the current interactive object stops interacting;

S420 When the current interactive object stops interacting in the awake state, a candidate object participating in the interaction is determined as a new interactive object by collecting image data and voice signals;

S430 When the current interactive object does not stop interacting in the awakened state, it will continue to detect while responding to the required service type of the current interactive object;

S440 enters the dormant state when the duration for which there is no interactive object reaches the first preset duration in the awakened state;

Specifically, when the robot or smart device is in the awake state, if the current interactive object stops interacting with itself, and the duration of the new interactive object interacting with itself is not detected for the first preset time period, it indicates that the robot or During the time period that the smart device lasts for the first preset time period, there is no interaction object to interact with the robot or the smart device. In addition, when in the awake state, when there are no interactive objects within the effective acquisition range of the audio acquisition module and image acquisition module of the robot or smart device and the duration reaches the first preset duration, it also indicates that the robot or smart device is continuing the first preset duration. In the set time period, there is no interaction object to interact with the robot or smart device. Once it is determined that the duration of the awakening state and the absence of interactive objects reaches the first preset duration, the robot or smart device will automatically enter the dormant state at this time to prevent the robot or smart device from being awake for a long time and save the robot or smart device’s cost. Power consumption increases the standby time of robots or smart devices.

S450 judges whether a wake-up signal is received when it is in a sleep state;

In S460, if the wake-up signal is received, switch from the dormant state to the wake-up state, and determine that the target object that triggers the wake-up of itself is the current interactive object.

Specifically, for the parts in this embodiment that are the same as those in the above-mentioned embodiment, refer to the above-mentioned embodiment, which will not be repeated here. This embodiment and the above embodiments show that no matter when the robot or the smart device enters the dormant state, the robot or the smart device only needs to determine the current state by triggering the target object that generates the wake-up signal when it switches from the dormant state to the awakened state. Interaction objects, as long as the robot or smart device switches from the dormant state to the awakened state, during the subsequent wake-up state, there is no need to frequently voice input wake-up words as in the prior art to switch new interactive objects midway, and it does not need to frequently cause all Users must understand and master the trigger operations of different robots or smart devices. They only need to realize real-time and intelligent switching of new interactive objects in multi-user interaction scenarios based on the collected image data and voice signals. This is not only more in line with daily communication patterns, but also It helps to achieve effective communication and increase the personification effect of human-machine communication, so as to achieve the purpose of effective interactive communication between robots or smart devices and multiple objects.

An embodiment of the present invention, as shown in FIG. 5, a method for implementing interactive communication includes:

S510 detects whether the current interactive object stops interacting;

S520, when the duration of the current interaction object stops interacting reaches the second preset duration, search for candidate objects participating in the interaction through image recognition and/or sound source localization;

Specifically, the second preset duration is less than the first preset duration. When the robot or smart device meets the trigger condition for finding and switching new interactive objects, that is, each time the robot or smart device interacts with the current interactive object, After the robot or smart device executes the last required service type of the current interactive object, it will wait for the second preset period of time. If the interactive information of the current interactive object is not received within the second preset period of time, the robot Or the smart device defaults that the current interactive object is no longer participating in the interaction. At this time, the robot or the smart device searches for all candidate objects participating in the interaction through image recognition and/or sound source localization, so as to select a new interactive object to continue the interaction.

S530 If there is a candidate object, determine that the candidate object is a new interactive object;

In S540, if there are at least two candidate objects, determine one candidate object as a new interactive object according to the image recognition result and/or the sound source localization result.

Specifically, when the robot or smart device meets the trigger condition for searching and switching new interactive objects, only one candidate object is determined as the new interactive object found this time after each search. The robot or smart device can be responsible for sound collection through the audio collection module to realize the auditory function of the robot or smart device. After the voice signal is collected, the voice signal is processed by framing and windowing, and the audio processing of the voice signal is used to determine the number of sound sources Then, the number of candidate objects is determined according to the number of sound sources, and the sound source localization recognition is a prior art, which will not be repeated here. If the number of candidate objects is determined to be one through the above method, the candidate object is directly determined as the new interactive object. If it is determined that the number of candidate objects is at least two, it is determined according to the time sequence of the acquired voice signals that the candidate user corresponding to the earliest acquired voice signal is the new interactive object found for this handover.

Exemplarily, in a scenario where a robot or smart device interacts with multiple people, the robot or smart device collects voice signals in real time through the audio collection module, and obtains the number of sound sources according to the sound source location recognition technology to determine the earliest sound source. The candidate users of the voice signal are the new interactive objects found for this handover.

Of course, the robot or smart device can also be responsible for the collection of image data through the image acquisition module to realize the vision function of the robot or smart device. After the image data is collected, the number of candidate objects can be determined through the image recognition result of the image recognition technology. If the candidate objects are determined When the number of is one, the candidate object is directly determined as a new interactive object. If it is determined that the number of candidate objects is at least two, the candidate user corresponding to the earliest participating interaction is determined as the new interaction object found in this handover according to the time sequence of each candidate object participating in the interaction obtained by image recognition.

Exemplarily, in a scene where multiple people interact with the robot, the robot captures image data in real time through the image acquisition module, and performs face recognition on the acquired image data. In order to obtain the number of the human body that issued the mouth opening action when the recognition result is that the mouth is opened, the candidate user A who issued the mouth opening action first is determined as the new interactive object found in this switch.

Of course, the robot or smart device can also be responsible for the collection of image data through the image collection module, and the audio collection module is responsible for the collection of sound. After the image data and voice signals are collected, the image recognition technology and sound source localization technology can be combined to analyze and determine the candidate object. Quantity, if the number of candidate objects is determined to be one, the candidate object is directly determined as a new interactive object. If it is determined that the number of candidate objects is at least two, comprehensively analyze the mouth opening action and voice signal of the candidate object according to the image recognition result and/or the sound source localization result, and find the earliest participant in the interaction from the candidate objects participating in the interaction. Corresponding to the candidate user, thereby determining that the candidate user who participated in the interaction earliest is the new interaction object found for this handover.

S550 When the current interactive object does not stop interacting in the awakening state, it will continue to detect while responding to the service type required by the current interactive object;

S560 enters the dormant state when the duration for which there is no interactive object reaches the first preset duration in the awakened state;

S570 judges whether a wake-up signal is received when it is in a sleep state;

In S580, if the wake-up signal is received, switch from the dormant state to the wake-up state, and determine that the target object that triggers the awakening of itself is the current interactive object.

For the parts in this embodiment that are the same as those in the foregoing embodiment, refer to the foregoing embodiment, which will not be repeated here. The present invention preferably uses image data and voice signals as the judging factors to detect candidate objects and determine one of the candidate objects as a new interactive object, so as to avoid candidates who will emit meaningless voice signals within the effective collection range of the audio collection module and the image collection module Objects (such as babies) or users who have no interactive intentions are mistakenly identified as new interactive objects. Combining image recognition technology and sound source localization technology, it achieves precise positioning of the direction and position of candidate objects, and improves the search and determination of new interactive objects. The accuracy rate.

In this embodiment, the robot or smart device automatically switches to a new interactive object to continue the interaction when the robot or smart device is awakened, which improves the efficiency of switching between the robot or smart device and multiple interactive objects, and shortens the robot or smart device from turning to the next interactive object. This greatly reduces the reaction time of switching interactions, improves the efficiency of switching communication between the robot or smart device and multiple interactive objects, makes the interaction process more natural and flexible, and greatly improves the interaction capabilities of the robot or smart device.

Exemplarily, as shown in FIG. 6, in the use scene of the emotional companion robot Robot, it includes Robot, User1, User2, and User3. Moreover, User1, User2, and User3 mentioned in the figure are not specific, but only used to distinguish different users. User1 comes to Robot and wakes up Robot through the wake-up word. Then Robot turns to User1 and interacts with User1. During the interaction, it is necessary to determine in real time whether User1 is still interacting with it (Robot). Robot locates and interacts with it through the sound source. Facial feature recognition judges that User1 has stopped interacting with it (Robot), and Robot should automatically turn to User2 who is talking. This strategy is also adapted when there are more than two users. The process of human-computer interaction when the robot faces multiple people is shown in Figure 7 and includes the following steps:

Step 0: Initial state; one Robot (in dormant state), two or more users who can interact with the Robot.

Step 1. User1 approaches Robot and wakes up Robot. Robot is awakened from dormant state and switched to awakened state, and then go to step 2.

Step 2. Robot turns to User1 and interacts with User1, and then goes to step 3.

Step 3. In the process of interaction between Robot and User1, it will judge whether User1 is still interacting with itself (Robot) through sound source localization and facial feature recognition. The judgment results are divided into the following four types:

(1) The judgment result is "Result 1", that is, Robot judges that User1 continues to interact with Robot, then Robot keeps staring at User1 and goes to step 3.

(2) The judgment result is "Result 2", that is, Robot judges that User1 has stopped interacting with Robot, and at this time Robot hears User2 talking, go to step 2, and then User2 will replace User1 in step 2 after going to step 2. .

(3) The judgment result is "Result 3", that is, Robot judges that User1 has stopped interacting with Robot, and at this time Robot does not hear User2 speaking, Robot will enter the sleep countdown state, if Robot hears User2 speaking before the end of the sleep countdown, Go to step 2. After going to step 2, User2 will replace User1 in step 2.

(4) The judgment result is "Result 4", that is, Robot judges that User1 has stopped interacting with Robot, and at this time, Robot does not hear User2 speaking, Robot will enter the sleep countdown state, if Robot does not hear User2 before the end of the sleep countdown To speak, go to step 0.

An embodiment of the present invention, an interactive communication realization device, as shown in FIG. 8, includes:

The image collection module 10 is used to collect face images;

The audio collection module 20 is used to collect voice signals;

The detection module 30 is used to detect whether the current interactive object stops interacting;

The processing module 40 is configured to determine a candidate object participating in the interaction as a new interactive object by collecting image data and voice signals when the current interactive object stops interacting in the awake state.

Specifically, this embodiment is a device embodiment corresponding to the foregoing method embodiment, and for specific effects, refer to the foregoing method embodiment, which will not be repeated here.

Based on the foregoing embodiment, it further includes:

The detection module 30 is also used for judging whether a wake-up signal is received when it is in a dormant state;

The processing module 40 is further configured to switch from the dormant state to the awakened state if a wake-up signal is received, and determine that the target object that triggers the awakening of itself is the current interactive object.

Based on the foregoing embodiment, it further includes:

The execution module is used to respond to the required service type of the current interactive object while continuing to detect when the current interactive object does not stop interacting in the awakened state;

The processing module 40 is also configured to enter the dormant state when the duration for which there is no interactive object reaches the first preset duration in the awake state.

Based on the foregoing embodiment, the processing module 40 includes:

The searching unit searches for candidate objects participating in the interaction through image recognition and/or sound source localization when the duration of the current interaction object stops interacting reaches the second preset duration;

The object switching unit is used to determine if there is a candidate object as a new interactive object; if there are at least two candidate objects, determine a candidate object as a new interactive object according to the image recognition result and/or the sound source localization result .

Specifically, this embodiment is a device embodiment corresponding to the above method embodiment. For specific effects, refer to the above method embodiment, which will not be repeated here.

Those skilled in the art can clearly understand that for the convenience and conciseness of description, only the division of the above-mentioned program modules is used as an example. In practical applications, the above-mentioned functions can be allocated by different program modules as needed, namely The internal structure of the device is divided into different program units or modules to complete all or part of the functions described above. The program modules in the embodiments can be integrated in one processing unit, or each unit can exist alone physically, or two or more units can be integrated in one processing unit. The above-mentioned integrated units can be implemented in the form of hardware. It can also be implemented in the form of a software program unit. In addition, the specific names of the program modules are only for the convenience of distinguishing each other, and are not used to limit the protection scope of the present application.

In an embodiment of the present invention, a smart device includes a processor and a memory, where the memory is used to store a computer program; the processor is used to execute the computer program stored in the memory to implement the interaction in the above method embodiment Communication implementation method.

The smart device may be a desktop computer, a notebook, a palmtop computer, a tablet computer, a mobile phone, a human-computer interaction screen and other devices. The smart device may include, but is not limited to, a processor and a memory. Those skilled in the art can understand that the above are only examples of smart devices, and do not constitute a limitation on smart devices, and may include more or fewer components than those shown in the figure, or a combination of certain components, or different components, for example: Smart devices may also include input/output interfaces, display devices, network access devices, communication buses, communication interfaces, and so on. The communication interface and the communication bus may also include an input/output interface, where the processor, the memory, the input/output interface, and the communication interface complete mutual communication through the communication bus. The memory stores a computer program, and the processor is used to execute the computer program stored on the memory to implement the interactive communication implementation method in the foregoing method embodiment.

The processor may be a central processing unit (Central Processing Unit, CPU), other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (ASIC), on-site Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc. The general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.

The memory may be an internal storage unit of the smart device, such as a hard disk or memory of the smart device. The memory may also be an external storage device of the smart device, for example: a plug-in hard disk equipped on the smart device, a smart media card (SMC), a secure digital (SD) card, Flash Card, etc. Further, the memory may also include both an internal storage unit of the smart device and an external storage device. The memory is used to store the computer program and other programs and data required by the smart device. The memory can also be used to temporarily store data that has been output or will be output.

The communication bus is a circuit that connects the described elements and realizes transmission between these elements. For example, the processor receives commands from other elements through the communication bus, decrypts the received commands, and performs calculations or data processing according to the decrypted commands. The memory may include program modules, such as a kernel (kernel), middleware (middleware), application programming interface (Application Programming Interface, API), and applications. The program module can be composed of software, firmware or hardware, or at least two of them. The input/output interface forwards commands or data input by the user through the input/output interface (such as a sensor, a keyboard, and a touch screen). The communication interface connects the smart device with other network devices, user equipment, and the network. For example, the communication interface may be wired or wirelessly connected to the network to connect to other external network equipment or user equipment. The wireless communication may include at least one of the following: wireless fidelity (WiFi), Bluetooth (BT), short-range wireless communication technology (NFC), global satellite positioning system (GPS), cellular communication, and so on. Wired communication may include at least one of the following: universal serial bus (USB), high-definition multimedia interface (HDMI), asynchronous transmission standard interface (RS-232), and so on. The network can be a telecommunication network and a communication network. The communication network can be a computer network, the Internet, the Internet of Things, and a telephone network. Smart devices can be connected to the network through a communication interface, and the protocol used by the smart device to communicate with other network devices can be supported by at least one of the application, application programming interface (API), middleware, kernel, and communication interface.

An embodiment of the present invention is a storage medium in which at least one instruction is stored, and the instruction is loaded and executed by a processor to implement the operations performed by the corresponding embodiment of the foregoing interactive communication implementation method. For example, the computer-readable storage medium may be read-only memory (ROM), random access memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

They can be implemented by program codes executable by a computing device, so that they can be stored in a storage device to be executed by the computing device, or they can be made into individual integrated circuit modules, or multiple modules or steps in them Made into a single integrated circuit module to achieve. In this way, the present invention is not limited to any specific combination of hardware and software.

In the above-mentioned embodiments, the description of each embodiment has its own emphasis. For parts that are not described or recorded in detail in an embodiment, reference may be made to related descriptions of other embodiments.

A person of ordinary skill in the art may realize that the units and algorithm steps of the examples described in combination with the embodiments disclosed herein can be implemented by electronic hardware or a combination of computer software and electronic hardware. Whether these functions are performed by hardware or software depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered beyond the scope of this application.

In the embodiments provided in this application, it should be understood that the disclosed device/smart device and method may be implemented in other ways. For example, the device/smart device embodiments described above are only illustrative. For example, the division of the modules or units is only a logical function division. In actual implementation, there may be other division methods, for example, multiple divisions. Units or components can be combined or integrated into another system, or some features can be omitted or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.

The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.

In addition, the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.

If the integrated module/unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the present invention implements all or part of the processes in the above-mentioned embodiment methods, and can also be completed by sending instructions to related hardware through a computer program. The computer program can be stored in a computer-readable storage medium. When the program is executed by the processor, it can implement the steps of the foregoing method embodiments. Wherein, the computer program includes: computer program code, and the computer program code may be in the form of source code, object code, executable file, or some intermediate forms. The computer-readable storage medium may include: any entity or device capable of carrying the computer program code, recording medium, U disk, mobile hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory) ), Random Access Memory (RAM, Random Access Memory), electrical carrier signal, telecommunications signal, and software distribution media, etc. It should be noted that the content contained in the computer-readable storage medium can be appropriately increased or decreased in accordance with the requirements of the legislation and patent practice in the jurisdiction. For example, in some jurisdictions, according to the legislation and patent practice, the computer can be The reading medium does not include electric carrier signals and telecommunication signals.

It should be noted that the above embodiments can be freely combined as required. The above are only the preferred embodiments of the present invention. It should be pointed out that for those of ordinary skill in the art, without departing from the principle of the present invention, several improvements and modifications can be made, and these improvements and modifications are also It should be regarded as the protection scope of the present invention.

Claims

A method for realizing interactive communication, which is characterized in that it comprises the following steps:

Detect whether the current interactive object stops interacting;

When the current interactive object stops interacting in the awake state, a candidate object participating in the interaction is determined as a new interactive object by collecting image data and voice signals.
The method for implementing interactive communication according to claim 1, characterized in that it further comprises the steps of:

When the current interactive object does not stop interacting in the awakened state, the detection is continued while responding to the required service type of the current interactive object.
The method for implementing interactive communication according to claim 1, characterized in that it further comprises the steps of:

In the awakened state and the duration of no interaction object reaches the first preset time period, it enters the dormant state.
The method for implementing interactive communication according to claim 1, characterized in that it further comprises the steps of:

Judge whether it receives a wake-up signal when it is in a sleep state;

If a wake-up signal is received, it switches from the dormant state to the wake-up state, and it is determined that the target object that triggers the awakening of itself is the current interactive object.
The method for implementing interactive communication according to any one of claims 1 to 4, wherein when the current interactive object stops interacting in the awakened state, a candidate participating in the interaction is selected by collecting image data and voice signals. Determining the object as a new interactive object includes the following steps:

When the duration for the current interactive object to stop interacting reaches the second preset time period, search for candidate objects participating in the interaction through image recognition and/or sound source localization;

If there is a candidate object, determine that the candidate object is the new interactive object;

If there are at least two candidate objects, one candidate object is determined as the new interactive object according to the image recognition result and/or the sound source localization result.
An interactive communication realization device, which is characterized in that it includes:

Image collection module, used to collect face images;

Audio collection module, used to collect voice signals;

The detection module is used to detect whether the current interactive object stops interacting;

The processing module is configured to determine a candidate object participating in the interaction as a new interactive object by collecting image data and voice signals when the current interactive object stops interacting in the awakened state.
The interactive communication realization device according to claim 6, characterized in that it further comprises:

The execution module is configured to respond to the required service type of the current interactive object while continuing to detect when the current interactive object does not stop interacting in the awakened state;

The processing module is also configured to enter the dormant state when the duration for which there is no interactive object reaches the first preset duration in the awakened state.
The interactive communication realization device according to claim 6, characterized in that:

The detection module is also used to determine whether a wake-up signal is received when the detection module is in a sleep state;

The processing module is further configured to switch from the dormant state to the awakened state if a wake-up signal is received, and determine that the target object that triggers the awakening of itself is the current interactive object.
The interactive communication realization device according to any one of claims 6-8, wherein the processing module comprises:

The searching unit searches for candidate objects participating in the interaction through image recognition and/or sound source localization when the duration for the current interaction object to stop interacting reaches the second preset duration;

The object switching unit is configured to determine that if there is one candidate object, the candidate object is the new interactive object; if there are at least two candidate objects, determine one candidate object as the new interactive object according to the image recognition result and/or the sound source localization result New interactive objects.
A storage medium, characterized in that at least one instruction is stored in the storage medium, and the instruction is loaded and executed by a processor to implement the method for implementing interactive communication according to any one of claims 1 to 5; Action performed.