WO2017031860A1

WO2017031860A1 - Artificial intelligence-based control method and system for intelligent interaction device

Info

Publication number: WO2017031860A1
Application number: PCT/CN2015/096587
Authority: WO
Inventors: 葛行飞; 李峥; 林汉权
Original assignee: 百度在线网络技术（北京）有限公司
Priority date: 2015-08-24
Filing date: 2015-12-07
Publication date: 2017-03-02
Also published as: CN105159111A; CN105159111B

Abstract

An artificial intelligence-based control method and system for an intelligent interaction device, and an intelligent interaction device. The method comprises: receiving multi-modal input signals, the multi-modal input signals comprising an image signal, a sound signal, and/or a distance signal input by a user (S101); performing human face detection according to the image signal, and acquiring, when a human face is detected, a human face image and human face information (S102); performing lip region detection according to the human face image, to determine the motion condition of the lip region(S103); positioning a sound source according to the sound signal, to obtain information about the sound source (S104); determining the interaction intention of the user and the intensity degree of the interaction intention according to the human face information, the motion condition of the lip region, the information about the sound source, and/or the distance signal(S105); and controlling, according to the interaction intention of the user and the intensity degree of the interaction intention, an intelligent interaction device to perform a corresponding interaction response (S106). By means of the method, the interaction experience of a user during interaction with an intelligent interaction device is improved, and the intelligence of the intelligent interaction device is improved.

Description

Intelligent interactive device control method and system based on artificial intelligence

Cross-reference to related applications

This application claims the priority of the Chinese patent application number “201510523179.3” submitted by Baidu Online Network Technology (Beijing) Co., Ltd. on August 24, 2015, and the invention name is “Intelligent Intelligent Intelligent Interactive Device Control Method and System”. .

Technical field

The present invention relates to the field of intelligent terminal technologies, and in particular, to an intelligent intelligence device control method, a control system, and an intelligent interaction device based on artificial intelligence (AI).

Background technique

Today's smart interactive devices, such as televisions, living appliances, etc., usually use remote control or pre-programmed procedures to perform related actions. Such a smart interactive device that performs related actions by remote control or a program set in advance has the following disadvantages:

The interaction with humans is single and the interaction is poor. This is because the remote control operation has limited functions, and the intelligent interaction device cannot perform actions other than the remote operation function. Similarly, the intelligent interaction device operates according to the program set in advance, and there is also Other actions than the setup procedure are completed, and different movements cannot be performed for different user needs. In addition, these interactions are performed after the user remotely controls or triggers a function button, so it is completely passive interaction.

Although some video conferencing tracking systems can turn the camera and the like to the speaker according to the voice of the speaker, it is not possible to accurately determine whether the speaker has an willingness to interact or to respond appropriately according to the willingness to interact.

Summary of the invention

It is an object of the invention to at least address one of the technical drawbacks.

To this end, the first object of the present invention is to propose an intelligent interactive device control method based on artificial intelligence. The method can improve the interaction experience between the user and the smart interaction device, and improve the intelligence of the smart interaction device.

A second object of the present invention is to provide an intelligent interactive device control system based on artificial intelligence.

A third object of the present invention is to provide an intelligent interactive device.

A fourth object of the invention is to propose an apparatus.

A fifth object of the present invention is to provide a non-volatile computer storage medium.

To achieve the above object, an embodiment of the first aspect of the present invention discloses an artificial intelligence-based intelligent interactive device control method, including the steps of: receiving a multi-modal input signal, the multi-modal input signal including a user Input image signal, sound signal and/or distance signal; performing face detection according to the image signal, and acquiring the face image and face information when detecting a human face; performing lip according to the face image Area detection to determine a lip motion condition; performing sound source localization according to the sound signal to obtain sound source information; according to the face information, the lip motion condition, the sound source information, and/or the distance signal Determining the user's willingness to interact and the degree of willingness to interact; and controlling the smart interaction device to perform a corresponding interactive response according to the user's willingness to interact and the willingness to interact.

The artificial intelligence-based intelligent interactive device control method according to the embodiment of the invention can collect the user's sound signal, image signal and/or distance signal in real time, and after analyzing the artificial intelligence, determine whether the user has the willingness to interact, and can determine The intensity of the interaction will be strong, and then the intelligent interaction device is controlled autonomously to perform corresponding actions, actively interacting with the user and enriching the interaction means, thereby improving the user experience.

An embodiment of the second aspect of the present invention discloses an artificial intelligence-based intelligent interactive device control system, including: a receiving module, configured to receive a multi-modal input signal, where the multi-modal input signal includes an image input by a user a signal, a sound signal, and/or a distance signal; a face detection module, configured to perform face detection according to the image signal, and acquire the face image and face information when a human face is detected; a lip detection module For performing lip detection according to the face image to determine a lip motion condition; a sound source positioning module, configured to perform sound source localization according to the sound signal to obtain sound source information; and a decision module, the decision module Determining the user's willingness to interact and the degree of interaction will be based on the face information, the lip motion, the sound source information, and/or the distance signal; and a composite output control module for The user's willingness to interact and the willingness to interact strongly control the intelligent interactive device to perform a corresponding interactive response.

The artificial intelligence-based intelligent interactive device control system according to the embodiment of the present invention can collect the user's sound signal, image signal and/or distance signal in real time, and after analyzing the artificial intelligence, determine whether the user has the willingness to interact, and can determine The intensity of the interaction will be strong, and then the intelligent interaction device is controlled autonomously to perform corresponding actions, actively interacting with the user and enriching the interaction means, thereby improving the user experience.

The embodiment of the third aspect of the present invention discloses an intelligent interaction device, comprising: the artificial intelligence-based intelligent interaction device control system according to the second aspect embodiment. The intelligent intelligent interaction device can collect the user's sound signal, image signal and/or distance signal in real time, and after analyzing by artificial intelligence, determine whether the user has the willingness to interact, and can determine the strong degree of interaction intention, and then autonomously Control the intelligent interaction device to perform corresponding actions, actively interact with the user and enrich the interaction means, thereby improving the user experience.

A fourth aspect of the present invention provides an apparatus comprising: one or more processors; a memory; one or more programs, the one or more programs being stored in the memory when many When the processor executes, the artificial intelligence-based intelligent interactive device control method of the first aspect of the present invention is executed.

A fifth aspect of the present invention provides a non-volatile computer storage medium storing one or more programs, when the one or more programs are executed by a device, causing the device An artificial intelligence-based intelligent interactive device control method for implementing the first aspect of the present invention.

The additional aspects and advantages of the invention will be set forth in part in the description which follows.

DRAWINGS

The aspects and advantages of the invention will become apparent and readily understood from the following description of the embodiments of the invention

1 is a flowchart of an artificial intelligence based intelligent interactive device control method according to an embodiment of the present invention;

2 is a structural block diagram of an artificial intelligence based intelligent interactive device control system according to an embodiment of the present invention;

3 is a schematic diagram of an artificial intelligence based intelligent interactive device control system in accordance with one embodiment of the present invention.

detailed description

The embodiments of the present invention are described in detail below, and the examples of the embodiments are illustrated in the drawings, wherein the same or similar reference numerals are used to refer to the same or similar elements or elements having the same or similar functions. The embodiments described below with reference to the drawings are intended to be illustrative of the invention and are not to be construed as limiting.

In the description of the present invention, it should be noted that the terms "installation", "connected", and "connected" are to be understood broadly, and may be, for example, mechanically or electrically connected, or two, unless otherwise specified and defined. The internal communication of the components may be directly connected or indirectly connected through an intermediate medium. For those skilled in the art, the specific meanings of the terms may be understood according to specific situations.

In order to solve the problem that the intelligent interactive device existing in the related art has poor intelligence and cannot interact well with humans, the present invention realizes intelligent interactive device control method, control system and intelligence based on artificial intelligence with high intelligence and good human interaction experience. Interactive devices, in which Artificial Intelligence (AI) is a new technical science that studies and develops theories, methods, techniques, and application systems for simulating, extending, and extending human intelligence. Artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that responds in a manner similar to human intelligence. Research in this area includes robotics, speech recognition, image recognition, and nature. Language processing and expert systems.

Artificial intelligence is a simulation of the information process of human consciousness and thinking. Artificial intelligence is not human intelligence, but it can be like human thinking, and it may exceed human intelligence. Artificial intelligence is a very broad science that consists of different fields. Such as machine learning, computer vision, etc. In general, one of the main goals of artificial intelligence research is to enable machines to perform complex tasks that typically require human intelligence.

An artificial intelligence-based intelligent interactive device control method, a control system, and an intelligent interaction device according to an embodiment of the present invention are described below with reference to the accompanying drawings.

1 is a flow chart of an artificial intelligence based intelligent interactive device control method according to an embodiment of the present invention. As shown in FIG. 1, the method includes the following steps:

S101: Receive a multi-modal input signal, where the multi-modal input signal includes an image signal, a sound signal, and/or a distance signal input by a user.

Specifically, the sound signal input by the user may be input by the user through a microphone; the image signal may be collected by a camera; the distance signal may be acquired by an infrared distance sensor.

S102: Perform face detection according to the image signal, and acquire a face image and a face information when a human face is detected. Among them, the face information includes but is not limited to the face area information and the face face degree.

Specifically, for an image captured by a camera, a face detection means may be used to detect whether there is a face in the image, an area occupied by the face in the image, whether the face is facing the smart interaction device, or the like.

After detecting the presence of a face in the image, the face image can be intercepted from the image and the face information can be saved.

S103: Perform lip detection according to the face image to determine the movement of the lip region.

Specifically, when it is detected in step S102 that a human face exists in the image, the detection of the lip motion condition may be performed from the intercepted face image by the lip region detecting means. For example, the detection result is that the lip zone sends an action or the lip zone does not act.

In one embodiment of the invention, the lip motion condition can be determined based on the lip shape difference between the multi-frame face images. For example, the lip area of the face image of the previous frame shows that the upper and lower lips are closed, and the lip area of the face image of the latter frame shows that the upper and lower lips are open. At this time, it can be determined that the user's lip area is moving, and the user may be opening. Speak and so on.

It should be noted that, under normal circumstances, even if the user does not speak, the upper and lower lips may act at a certain moment, such as yawning. In this case, the user's lip area should not be considered to have an action related to speaking or the like. Therefore, in order to avoid the occurrence of misjudgment, it is possible to determine whether the upper and lower lips are generated by comparing the lip regions of the continuous multi-frame image. The action, that is, whether the user has a speech or the like. In addition, it is also possible to determine whether the user has a speech or the like by detecting the voice activity of the voice signal, for example, determining whether the voice signal contains the voice when the user speaks (ie, the voice), and specifically can pass the voice in the artificial intelligence. The function of the recognition is implemented. When it is recognized that the voice of the speaker is included in the voice signal (ie, the voice), it can be determined that the user has a voice behavior. In this way, the occurrence of misjudgment in the above can also be avoided.

S104: Perform sound source localization according to the sound signal to obtain sound source information. The sound source information includes, but is not limited to, sound source orientation information and sound intensity information.

Specifically, for example, for a multi-directional sound signal received through a microphone array, the sound source can be determined accordingly. The bit means performs sound source localization to determine sound source orientation information (ie, sound source angle information) and sound intensity information.

It should be noted that usually, a plurality of sounds, such as speaking sounds and other noises, are included in the sound signal. Therefore, in order to accurately locate the sound of the speaker's voice, the sound source is based on the sound signal. Before positioning to obtain the sound source information, the sound signal can be denoised to filter out other noise interference, and the positioning accuracy of the sound source positioning of the speaker's voice can be improved. Specifically, it is determined whether the voice signal contains the voice when the user speaks; if yes, the voice of the voice signal in the voice signal is retained, and other interference noise is filtered out from the voice signal. In the above example, the voice can be manually The function of speech recognition in intelligence is to recognize the speech of the speaker contained in the sound signal through the speech recognition function, thereby filtering out other noises, thereby improving the positioning accuracy of the sound source localization of the speaker's speech. .

S105: Determine the user's willingness to interact and the degree of interaction intention according to the face information, the lip motion, the sound source information, and/or the distance signal.

It can be understood that, in the above description, the user's willingness to interact and the degree of interaction will be determined according to any one of the face information, the lip motion, the sound source information, and the distance signal, and may also be based on the face information, A plurality of or all of the lip motion, the sound source information, and the distance signal are used together to determine the user's willingness to interact and the degree of interaction willingness. Relative to the degree of interaction of the user and the willingness to interact through one or a few pieces of information, the accuracy and reliability of the user's willingness to interact and the degree of interaction will be judged by multiple or all of the above information. high.

As described below:

1. When it is determined that the user is facing the smart interaction device, the user's lips are not moving, the user utters the voice and the sound intensity is greater than the predetermined strength, and the distance between the user and the smart interaction device is less than the preset distance, the user is determined to have weak interaction willingness. Wherein, the predetermined intensity can be determined empirically, the purpose of which is to distinguish between a high-intensity sound and a relatively low-intensity sound, for example, the predetermined intensity may exist in the form of decibels, the predetermined intensity is, for example, 50 decibels, and when the sound intensity is less than 50 decibels, It is considered to be a low-intensity sound source, but it is considered to be a high-intensity sound source. Of course, in other examples of the present invention, the sound intensity can also be replaced by a voice activity index; the preset distance can also be determined empirically, for example: The preset distance is 1 meter. That is to say, if it is determined that the user faces the smart interactive device, the distance is close (such as within 1 meter), the lips are not moving, and there is no high-intensity sound source, it is determined that the user is interested in the smart interactive device, and there is a weak willingness to interact.

2. When it is determined that the user is exercising on the smart interaction device, the user's lips, the user utters the voice and the sound intensity is less than the predetermined strength, and the distance between the user and the smart interaction device is less than the preset distance, the user is determined to have a suspected interaction willingness. Wherein, the predetermined intensity can be determined empirically, the purpose of which is to distinguish between a high-intensity sound and a relatively low-intensity sound, for example, the predetermined intensity may exist in the form of decibels, the predetermined intensity is, for example, 50 decibels, and when the sound intensity is less than 50 decibels, It is considered to be a low-intensity sound source, otherwise it is considered to be a high-intensity sound source; the preset distance can also be determined empirically, for example, the preset distance is 1 meter. That is, if the user faces the smart interactive device, the distance is close (eg within 1 meter), the lips The action is generated, and there is no high-intensity sound source, and it is determined to be a suspected interactive will.

3. When it is determined that the user is exercising motion on the smart interaction device, the user's lips, the user utters the voice and the sound intensity is greater than the predetermined strength, and the distance between the user and the smart interaction device is less than the preset distance, the user is determined to have a strong willingness to interact. Wherein, the predetermined intensity can be determined empirically, the purpose of which is to distinguish between a high-intensity sound and a relatively low-intensity sound, for example, the predetermined intensity may exist in the form of decibels, the predetermined intensity is, for example, 50 decibels, and when the sound intensity is less than 50 decibels, It is considered to be a low-intensity sound source, but it is considered to be a high-intensity sound source. Of course, in other examples of the present invention, the sound intensity can also be replaced by a voice activity index; the preset distance can also be determined empirically, for example: The preset distance is 1 meter. That is to say, if the user faces the smart interactive device, the distance is close (such as within 1 meter), the lips generate motion, and there is a high-intensity sound source, it is determined that the user has a strong willingness to interact.

4. When it is determined that the user faces the smart interaction device, the user utters the voice and the sound intensity is greater than the predetermined strength, and the distance between the user and the smart interaction device is less than the preset distance, the user is determined to have the willingness to interact. Wherein, the predetermined intensity can be determined empirically, the purpose of which is to distinguish between a high-intensity sound and a relatively low-intensity sound, for example, the predetermined intensity may exist in the form of decibels, the predetermined intensity is, for example, 50 decibels, and when the sound intensity is less than 50 decibels, It is considered to be a low-intensity sound source, but it is considered to be a high-intensity sound source. Of course, in other examples of the present invention, the sound intensity can also be replaced by a voice activity index; the preset distance can also be determined empirically, for example: The preset distance is 1 meter. That is to say, if the user's side face is facing the device and the distance is close (eg, within 1 meter) and there is a high-intensity sound source, it is determined that the user has a willingness to interact.

5. When the face image is not detected, the user utters the sound and the sound intensity is greater than the predetermined intensity, and the distance between the user and the smart interaction device is less than the preset distance, the user is judged to have a strong suspected interaction willingness. Wherein, the predetermined intensity can be determined empirically, the purpose of which is to distinguish between a high-intensity sound and a relatively low-intensity sound, for example, the predetermined intensity may exist in the form of decibels, the predetermined intensity is, for example, 50 decibels, and when the sound intensity is less than 50 decibels, It is considered to be a low-intensity sound source, but it is considered to be a high-intensity sound source. Of course, in other examples of the present invention, the sound intensity can also be replaced by a voice activity index; the preset distance can also be determined empirically, for example: The preset distance is 1 meter. That is to say, if there is a high-intensity sound source, the camera can not detect the face, and the distance is close (such as within 1 meter): it is judged that the user has a strong suspected willingness to interact (that is, a strong interactive willingness is required).

6. When the face image is not detected, the user utters the voice and the sound intensity is greater than the predetermined strength, and the distance between the user and the smart interaction device is greater than the preset distance, the user is judged to have a weak suspected interaction willingness. Wherein, the predetermined intensity can be determined empirically, the purpose of which is to distinguish between a high-intensity sound and a relatively low-intensity sound, for example, the predetermined intensity may exist in the form of decibels, the predetermined intensity is, for example, 50 decibels, and when the sound intensity is less than 50 decibels, It is considered to be a low-intensity sound source, but it is considered to be a high-intensity sound source. Of course, in other examples of the present invention, the sound intensity can also be replaced by a voice activity index; the preset distance can also be determined empirically, for example: The preset distance is 1 meter. That is to say, if there is a high-intensity sound source, no face can be detected, and if the distance is far (such as more than 1 meter), it is judged to be weakly suspected (ie, weakly willing) Like interactive willingness).

7. The above are various example cases. Generally speaking, a multi-classifier for multiple interaction intentions is constructed according to multiple independent features input, and comprehensive judgment is performed according to the value of the multi-modal input signal to accurately determine the interaction intention. And respond accordingly.

S106: Control the smart interaction device to perform corresponding interaction response according to the user's willingness to interact and the willingness to interact.

For example, when it is determined that there is a weak interaction intention in the above steps, the smart interaction device can be intelligently controlled to perform a silent response, such as displaying different expressions, simple mechanical actions, and the like without vocalization.

When it is determined that there is a suspected interaction intention in the above steps, the smart interaction device may be controlled to perform a volume prompt response, such as issuing a prompt for increasing the volume.

When it is determined that there is a strong willingness to interact in the above steps, the smart interaction device may be controlled to perform a formal interactive response, that is, formally interact with the user.

When it is determined that there is a companion to the interaction in the above steps, the smart interaction device may be controlled to perform a voice/chat interaction response, that is, the voice/chat interaction mode is mainly used.

When it is determined that there is a strong suspected interaction intention in the above steps, the smart interaction device may be controlled to turn to the sound source direction and perform a prompt response, for example, turning the microphone to the sound source direction and prompting the user.

When it is determined in the above steps that there is a weak intentional interaction intention, only the smart interaction device may be controlled to turn to the sound source direction. For example: just turn the microphone to the direction of the sound source without prompting.

In addition, in order to more accurately determine the user's willingness to interact and the degree of interaction will be strong to avoid the occurrence of misjudgment, in one embodiment of the present invention, according to the face information, the lip motion, and the sound source information And/or determining whether the face information, the lip motion condition, the sound source information, and/or the distance signal satisfy a predetermined condition before the distance signal determines the user's willingness to interact and the degree of interaction intention; if the predetermined condition is met, the user interaction is performed. Willingness and the degree of strong willingness to interact.

Specifically, the above condition can be determined by a timer, for example, when it is detected that a positive face faces the smart interaction device, the timer is started, and the time of facing the smart interaction device on the front face exceeds a specific time. After the time (such as 3 seconds), it is determined that the user is indeed facing the smart interaction device. In this way, the occurrence of misjudgment can be avoided. Imagine that if the user only moves his head, he or she may face the smart interactive device at a certain moment, and by the above timing judgment, the user can move to the head. At some point, the face is neglected by the intelligent interactive device, so the probability of misjudgment can be reduced or even the misjudgment can be eliminated.

In addition, in order to further improve the user's willingness to interact and the accuracy of the judgment of the strong degree of interaction, before the user's willingness to interact and the willingness to interact are determined based on the face information, the lip motion, the sound source information, and/or the distance signal. , can quantify face information and lip movements. Such as: 30% positive face to face intelligent interaction The device, 50% of the face is facing the smart interaction device. After quantification, a unified standard can be provided for the user's willingness to interact and the degree of strong willingness to interact, thereby improving the accuracy of the judgment.

In an embodiment of the present invention, the method further includes: adjusting weights of the face information, the lip motion, the sound source information, and/or the distance signal, wherein the weight is used to influence the user's willingness to interact and the degree of interaction intention The judgment result; determining the user's willingness to interact and the intensity of the interaction intention further includes: judging the user's willingness to interact and the intensity of the interaction intention according to the face information, the lip motion, the sound source information, and/or the weight of the distance signal. Specifically, by adjusting the sensitivity (ie, weight) of each input signal, such as: increasing the weight of the positive face to the signal and the lip motion, and reducing the weight of the input intensity of the sound source, the user only moves the lip, which is not practical. In the case of utterance, it is also determined that there is a willingness to interact, so that different interaction behaviors can be responded to different scenarios, and the interactive experience of the smart interaction device is improved.

It should be noted that the smart interaction device can be an ordinary living appliance, an information appliance (such as a computer, a television, etc.), a video conference system, or an intelligent robot.

2 is a structural block diagram of an artificial intelligence based intelligent interactive device control system according to an embodiment of the present invention.

As shown in FIG. 2, in conjunction with FIG. 3, an artificial intelligence-based intelligent interactive device control system 200 according to an embodiment of the present invention includes: a receiving module 210 (such as a camera, an infrared distance sensor, a microphone array), and a face detecting module 220. The lip detection module 230, the sound source localization module 240, the decision module 250 (ie, the decision center), and the composite output control module 260.

The receiving module 210 is configured to receive a multi-modal input signal, where the multi-modal input signal includes an image signal, a sound signal, and/or a distance signal input by a user. The face detection module 220 is configured to perform face detection according to the image signal, and acquire the face image and face information when a human face is detected. The lip detection module 230 is configured to perform lip detection based on the facial image to determine a lip motion condition. The sound source positioning module 240 is configured to perform sound source localization according to the sound signal to obtain sound source information. The decision module 250 is configured to determine the user's willingness to interact and the degree of interaction will be based on the face information, the lip motion condition, the sound source information, and/or the distance signal. The composite output control module 260 is configured to control the smart interaction device to perform a corresponding interaction response according to the user's willingness to interact and the willingness to interact.

In an embodiment of the present invention, the method further includes: a voice activity detecting module (not shown in FIG. 2), configured to determine the sound signal before the sound source positioning module 240 performs sound source localization according to the sound signal to obtain the sound source information. Whether the voice of the user is spoken, and if so, the voice of the user in the voice signal is kept and filtered from the voice signal Other than interference noise.

Specifically, in general, a sound signal includes a plurality of sounds, such as a voice and other noises. Therefore, in order to accurately position the voice of the speaker, the sound source is positioned according to the sound signal. Before the sound source information is obtained, the sound signal can be denoised to filter out other noise interference, and the positioning accuracy of the sound source positioning of the speaker's voice can be improved. Specifically, it is determined whether the voice signal contains the voice when the user speaks; if yes, the voice of the voice signal in the voice signal is retained, and other interference noise is filtered out from the voice signal. In the above example, the voice can be manually The function of speech recognition in intelligence is to recognize the speech of the speaker contained in the sound signal through the speech recognition function, thereby filtering out other noises, thereby enhancing the sound source localization of the speaker's speech. Positioning accuracy.

In an embodiment of the present invention, the decision module 250 is further configured to determine, according to the face information, the lip motion condition, the sound source information, and/or the distance signal, the user's willingness to interact and Before the degree of interaction intention is strong, determining whether the face information, the lip motion, the sound source information, and/or the distance signal meet a predetermined condition; if the predetermined condition is met, performing a user's willingness to interact And a strong degree of willingness to interact.

In an embodiment of the present invention, the decision module 250 is further configured to determine, according to the face information, the lip motion condition, the sound source information, and/or the distance signal, the user's willingness to interact and Before the degree of interaction is strong, the face information and the lip motion are quantified.

In an embodiment of the present invention, the decision module 250 is further configured to: adjust the weight of the face information, the lip motion condition, the sound source information, and/or the distance signal, wherein the weight a judgment result for influencing the user's willingness to interact and a strong degree of interaction intention; the judging the user's willingness to interact and the degree of interaction intention, including: according to the face information, the movement of the lip region, The weight of the sound source information and/or the distance signal determines the user's willingness to interact and the degree of interaction willingness.

In an embodiment of the present invention, the face information includes face area information and a face face degree, and the sound source information includes sound source orientation information and sound intensity information.

In an embodiment of the present invention, the decision module 250 is configured to: when determining that the user is facing the smart interaction device, the user's lips are not moving, the user is vocalized and the sound intensity is greater than a predetermined intensity, and the user and the user When the distance between the smart interaction devices is less than the preset distance, the user is determined to have weak interaction intention, and the composite output control module 260 is configured to: control the smart interaction device to perform a silent response.

In an embodiment of the present invention, the decision module 250 is configured to: when determining that the user is exercising motion on the smart interaction device, the user's lips, the user vocalizing and the sound intensity is less than a predetermined intensity, and the user and the user When the distance between the smart interaction devices is less than the preset distance, the user is determined to have a suspected interaction intention, and the composite output control module 260 is configured to: control the smart interaction device to perform a volume prompt response.

In an embodiment of the present invention, the determining module 250 is configured to: when determining that the user is facing the smart interaction Determining that the user has a strong willingness to interact when the user's lips generate motion, the user utters sound and the sound intensity is greater than the predetermined intensity, and the distance between the user and the smart interaction device is less than the preset distance. The composite output control module 260 is configured to: control the smart interaction device to perform a formal interaction response.

In an embodiment of the present invention, the decision module 250 is configured to: when it is determined that the user faces the smart interaction device, the user vocalizes and the sound intensity is greater than the predetermined strength, and the user interacts with the smart When the distance between the devices is less than the preset distance, the user is determined to have the willingness to interact, and the composite output control module 260 is configured to: control the smart interaction device to perform a voice/chat interaction response.

In an embodiment of the present invention, the decision module 250 is configured to: when the face image is not detected, the user utters the voice and the sound intensity is greater than the predetermined strength, and the distance between the user and the smart interaction device is less than When the preset distance is determined, the user is determined to have a strong suspected interaction intention, and the composite output control module 260 is configured to: control the smart interaction device to turn to the sound source direction and perform a prompt response.

In an embodiment of the present invention, the decision module 250 is configured to: when no face image is detected, the user utters sound and the sound intensity is greater than the predetermined intensity, and the distance between the user and the smart interaction device is greater than When the preset distance is determined, the user is determined to have a weak mutual willingness to interact. The composite output control module 260 is configured to: control the response of the smart interactive device to the sound source.

In an embodiment of the present invention, the lip detection module 230 is configured to determine the lip motion according to a lip shape difference between the multi-frame facial images.

It should be noted that the specific implementation manner of the artificial intelligence-based intelligent interactive device control system in the embodiment of the present invention is similar to the specific implementation manner of the artificial intelligence-based intelligent interactive device control method in the embodiment of the present invention. For details, refer to the method part. Description, in order to reduce redundancy, we will not repeat them here.

Further, an embodiment of the present invention discloses an intelligent interaction device, including: an artificial intelligence-based intelligent interaction device control system according to any one of the above embodiments. The smart interaction device can collect the user's sound signal, image signal and/or distance signal in real time, and after analyzing by artificial intelligence, determine whether the user has the willingness to interact, and can determine the strong degree of interaction intention, and then control the intelligence autonomously. The interactive device performs corresponding actions, actively interacts with the user, and enriches the interaction means, thereby improving the user experience.

In the description of the present invention, it is to be understood that the terms "center", "longitudinal", "transverse", "length", "width", "thickness", "upper", "lower", "front", " After, "Left", "Right", "Vertical", "Horizontal", "Top", "Bottom", "Inside", "Outside", "Clockwise", "Counterclockwise", "Axial", "Path The orientation or positional relationship indicated to the "," "circumferential" or the like is based on the orientation or positional relationship shown in the drawings, and is merely for convenience of description of the present invention and simplified description, and does not indicate or imply that the device or component referred to has The specific orientation, construction and operation in a particular orientation are not to be construed as limiting the invention.

Moreover, the terms "first" and "second" are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, features defining "first" or "second" may include at least one of the features, either explicitly or implicitly. In the description of the present invention, the meaning of "a plurality" is at least two, such as two, three, etc., unless specifically defined otherwise.

In the description of the present specification, the description with reference to the terms "one embodiment", "some embodiments", "example", "specific example", or "some examples" and the like means a specific feature described in connection with the embodiment or example. A structure, material or feature is included in at least one embodiment or example of the invention. In the present specification, the schematic representation of the above terms is not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in a suitable manner in any one or more embodiments or examples. In addition, various embodiments or examples described in the specification, as well as features of various embodiments or examples, may be combined and combined.

Any process or method description in the flowcharts or otherwise described herein may be understood to represent a module, segment or portion of code that includes one or more executable instructions for implementing the steps of a particular logical function or process. And the scope of the preferred embodiments of the invention includes additional implementations, in which the functions may be performed in a substantially simultaneous manner or in an opposite order depending on the functions involved, in the order shown or discussed. It will be understood by those skilled in the art to which the embodiments of the present invention pertain.

The logic and/or steps represented in the flowchart or otherwise described herein, for example, may be considered as an ordered list of executable instructions for implementing logical functions, and may be embodied in any computer readable medium, Used in conjunction with, or in conjunction with, an instruction execution system, apparatus, or device (eg, a computer-based system, a system including a processor, or other system that can fetch instructions and execute instructions from an instruction execution system, apparatus, or device) Or use with equipment. For the purposes of this specification, a "computer-readable medium" can be any apparatus that can contain, store, communicate, propagate, or transport a program for use in an instruction execution system, apparatus, or device, or in conjunction with the instruction execution system, apparatus, or device. More specific examples (non-exhaustive list) of computer readable media include the following: electrical connections (electronic devices) having one or more wires, portable computer disk cartridges (magnetic devices), random access memory (RAM), Read only memory (ROM), erasable editable read only memory (EPROM or flash memory), fiber optic devices, and portable compact disk read only memory (CDROM). In addition, the computer readable medium may even be a paper or other suitable medium on which the program can be printed, as it may be optically scanned, for example by paper or other medium, followed by editing, interpretation or, if appropriate, other suitable The method is processed to obtain the program electronically and then stored in computer memory.

It should be understood that portions of the invention may be implemented in hardware, software, firmware or a combination thereof. In the above-described embodiments, multiple steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, it can be implemented by any one or combination of the following techniques well known in the art: having logic gates for implementing logic functions on data signals. Discrete logic circuits, application specific integrated circuits with suitable combinational logic gates, programmable gate arrays (PGAs), field programmable gate arrays (FPGAs), etc.

One of ordinary skill in the art can understand that all or part of the steps carried by the method of implementing the above embodiments can be completed by a program to instruct related hardware, and the program can be stored in a computer readable storage medium. When executed, one or a combination of the steps of the method embodiments is included.

In addition, each functional unit in each embodiment of the present invention may be integrated into one processing module, or each unit may exist physically separately, or two or more units may be integrated into one module. The above integrated modules can be implemented in the form of hardware or in the form of software functional modules. The integrated modules, if implemented in the form of software functional modules and sold or used as stand-alone products, may also be stored in a computer readable storage medium.

The above mentioned storage medium may be a read only memory, a magnetic disk or an optical disk or the like. Although the embodiments of the present invention have been shown and described, it is understood that the above-described embodiments are illustrative and are not to be construed as limiting the scope of the invention. The embodiments are subject to variations, modifications, substitutions and variations.

Claims

An intelligent intelligent device control method based on artificial intelligence, comprising the following steps:

Receiving a multi-modal input signal, the multi-modal input signal including an image signal, a sound signal, and/or a distance signal input by a user;

Performing face detection according to the image signal, and acquiring the face image and face information when a human face is detected;

Performing lip detection based on the face image to determine lip motion;

Performing sound source localization according to the sound signal to obtain sound source information;

Determining the user's willingness to interact and the degree of interaction will be based on the face information, the lip motion, the sound source information, and/or the distance signal;

The smart interaction device is controlled to perform a corresponding interaction response according to the user's willingness to interact and the willingness to interact.
The method for controlling an artificial intelligence-based intelligent interactive device according to claim 1, wherein before the sound source is located to obtain the sound source information according to the sound signal, the method further includes:

Determining whether the voice signal includes a voice when the user speaks;

If so, the voice in the voice signal when the user speaks is retained, and other interference noise is filtered out from the sound signal.
The artificial intelligence-based intelligent interactive device control method according to claim 1 or 2, wherein the sound source information, the sound source information, and/or the distance signal are used according to the face information, the lip motion condition Before determining the user's willingness to interact and the willingness to interact, it also includes:

Determining whether the face information, the lip motion condition, the sound source information, and/or the distance signal satisfy a predetermined condition;

If the predetermined condition is satisfied, the user's willingness to interact and the degree of strong willingness to interact are performed.
The artificial intelligence-based intelligent interactive device control method according to any one of claims 1 to 3, characterized in that, according to the face information, the lip region motion condition, the sound source information and/or the Before the distance signal determines the user's willingness to interact and the degree of interaction intention, the method further includes: performing quantization processing on the face information and the lip motion.
The method for controlling an artificial intelligence-based intelligent interactive device according to any one of claims 1 to 4, further comprising:

Adjusting the weight information of the face information, the lip motion, the sound source information, and/or the distance signal, wherein the weight is used to influence the interaction intention of the user and the strong degree of interaction intention Result

The determining the user's willingness to interact and the degree of interaction will further include:

Determining the user's willingness to interact and the degree of interaction will be based on the face information, the lip motion condition, the sound source information, and/or the weight of the distance signal.
The artificial intelligence-based intelligent interactive device control method according to any one of claims 1 to 5, wherein the face information includes face area information and a face face degree, and the sound source information includes Sound source orientation information and sound intensity information.
The artificial intelligence-based intelligent interactive device control method according to any one of claims 1 to 6, wherein the method according to the face information, the lip motion, the sound source information, and/or The distance signal determines the user's willingness to interact and the degree of interaction will be strong, including: when it is determined that the user is facing the smart interaction device, the user's lips are not moving, the user is vocalized and the sound intensity is greater than a predetermined strength And determining that the user has a weak willingness to interact when the distance between the user and the smart interaction device is less than a preset distance.

The controlling the smart interaction device to perform the corresponding interaction response according to the user's willingness to interact and the willingness to interact include: controlling the smart interaction device to perform a silent response.
The artificial intelligence-based intelligent interactive device control method according to any one of claims 1 to 6, wherein the method according to the face information, the lip motion, the sound source information, and/or The distance signal determines the user's willingness to interact and the degree of interaction will be strong, including: when it is determined that the user is exercising motion on the smart interaction device, the user's lips, the user is vocalized and the sound intensity is less than a predetermined strength And determining that the user has a suspected interaction intention when the distance between the user and the smart interaction device is less than the preset distance.

The controlling the smart interaction device to perform a corresponding interaction response according to the user's willingness to interact and the willingness to interact include: controlling the smart interaction device to perform a volume prompt response.
The artificial intelligence-based intelligent interactive device control method according to any one of claims 1 to 6, wherein the method according to the face information, the lip motion, the sound source information, and/or The distance signal determines the user's willingness to interact and the degree of willingness to interact, including: determining that the user is exercising motion on the smart interaction device, the user's lips, the user vocalizing and the sound intensity is greater than the Determining that the user has a strong willingness to interact when the predetermined strength and the distance between the user and the smart interaction device are less than the preset distance.

The controlling the smart interaction device to perform a corresponding interaction response according to the user's willingness to interact and the willingness to interact include: controlling the smart interaction device to perform a formal interaction response.
The artificial intelligence-based intelligent interactive device control method according to any one of claims 1 to 6, wherein the method according to the face information, the lip motion, the sound source information, and/or Determining, by the distance signal, the user's willingness to interact and the degree of willingness to interact, including: determining that the user faces the smart interactive device, the user utters the voice and the sound intensity is greater than the predetermined strength, and the user and the user When the distance between the smart interaction devices is less than the preset distance, determining that the user has a willingness to interact,

The controlling the smart interaction device to perform the corresponding interaction response according to the user's willingness to interact and the willingness to interact include: controlling the smart interaction device to perform a voice/chat interaction response.
The artificial intelligence-based intelligent interactive device control method according to any one of claims 1 to 6, wherein the method according to the face information, the lip motion, the sound source information, and/or The distance signal determines the user's willingness to interact and the degree of willingness to interact, including: when no face image is detected, the user vocalizes and the sound intensity is greater than the predetermined strength, and the user and the smart interaction device When the distance between the two is less than the preset distance, it is determined that the user has a strong suspected interaction intention,

The controlling the smart interaction device to perform the corresponding interaction response according to the user's willingness to interact and the willingness to interact include: controlling the smart interaction device to turn to the sound source direction and performing a prompt response.
The artificial intelligence-based intelligent interactive device control method according to any one of claims 1 to 6, wherein the method according to the face information, the lip motion, the sound source information, and/or The distance signal determines the user's willingness to interact and the degree of willingness to interact, including: when no face image is detected, the user vocalizes and the sound intensity is greater than the predetermined strength, and the user and the smart interaction device When the distance between the distances is greater than the preset distance, it is determined that the user has a weak suspected interaction intention,

The controlling the smart interaction device to perform the corresponding interaction response according to the user's willingness to interact and the willingness to interact include: controlling the response of the smart interaction device to the sound source.
The artificial intelligence-based intelligent interactive device control method according to any one of claims 1 to 12, wherein the lip region detection is performed according to the face image to determine a lip region motion, which comprises: The lip shape difference between the frame face images determines the lip region motion.
An intelligent interactive device control system based on artificial intelligence, comprising:

a receiving module, configured to receive a multi-modal input signal, where the multi-modal input signal includes an image signal, a sound signal, and/or a distance signal input by a user;

a face detection module, configured to perform face detection according to the image signal, and acquire the face image and face information when a human face is detected;

a lip detection module, configured to perform lip detection according to the facial image to determine a lip motion;

a sound source positioning module, configured to perform sound source localization according to the sound signal to obtain sound source information;

a decision module, configured to determine, according to the face information, the lip motion condition, the sound source information, and/or the distance signal, the user's willingness to interact and the degree of interaction willingness;

The composite output control module is configured to control the smart interaction device to perform a corresponding interaction response according to the user's willingness to interact and the willingness to interact.
The artificial intelligence-based intelligent interactive device control system according to claim 14, further comprising:

a voice activity detecting module, configured to determine, before the sound source localization module performs sound source localization according to the sound signal to obtain sound source information, whether the voice signal includes the voice when the user speaks, and if yes, The voice in the voice signal when the user speaks is retained, and other interference noise is filtered out from the sound signal.
The artificial intelligence-based intelligent interactive device control system according to claim 14 or 15, wherein the decision module is further configured to: according to the face information, the lip region motion condition, the sound source information And/or determining whether the face information, the lip motion condition, the sound source information, and/or the distance signal satisfy a predetermined period before determining the user's willingness to interact and the degree of interaction intention Condition; if the predetermined condition is satisfied, the user's willingness to interact and the degree of strong willingness to interact are performed.
The artificial intelligence-based intelligent interactive device control system according to any one of claims 14-16, wherein the decision module is further configured to: according to the face information, the lip region motion condition, Before the sound source information and/or the distance signal determine the user's willingness to interact and the degree of interaction intention, the face information and the lip motion condition are quantized.
The artificial intelligence-based intelligent interactive device control system according to any one of claims 14-17, wherein the decision module is further configured to:

Adjusting the weight information of the face information, the lip motion, the sound source information, and/or the distance signal, wherein the weight is used to influence the interaction intention of the user and the strong degree of interaction intention result;

The determining the user's willingness to interact and the willingness to interact include:

Determining the user's willingness to interact and the degree of interaction will be based on the face information, the lip motion condition, the sound source information, and/or the weight of the distance signal.
The artificial intelligence-based intelligent interactive device control system according to any one of claims 14 to 18, wherein the face information includes face area information and a face face degree, and the sound source information includes Sound source orientation information and sound intensity information.
The artificial intelligence-based intelligent interactive device control system according to any one of claims 14 to 19, wherein the decision module is configured to: when determining that the user is facing the smart interaction device, the user Determining that the user has a weak willingness to interact when the lips are not moving, the user is vocalized, and the sound intensity is greater than a predetermined intensity and the distance between the user and the smart interaction device is less than a preset distance.

The composite output control module is configured to: control the smart interaction device to perform a silent response.
The artificial intelligence-based intelligent interactive device control system according to any one of claims 14 to 19, wherein the decision module is configured to: when determining that the user is facing the smart interaction device, the user Determining that the user has a suspected willingness to interact when the lips generate motion, the user utters sound and the sound intensity is less than a predetermined intensity, and the distance between the user and the smart interaction device is less than the preset distance.

The composite output control module is configured to: control the smart interaction device to perform a volume prompt response.
The artificial intelligence-based intelligent interactive device control system according to any one of claims 14 to 19, wherein the decision module is configured to: when determining that the user is facing the smart interaction device, the user Determining that the user has a strong willingness to interact when the lips generate motion, the user utters sound and the sound intensity is greater than the predetermined intensity, and the distance between the user and the smart interaction device is less than the preset distance,

The composite output control module is configured to: control the smart interaction device to perform a formal interaction response.
The artificial intelligence-based intelligent interactive device control system according to any one of claims 14 to 19, wherein the decision module is configured to: when determining that the user side faces the smart interaction device, the user Determining that the user has a willingness to interact when the sound intensity is greater than the predetermined intensity and the distance between the user and the smart interaction device is less than the preset distance.

The composite output control module is configured to: control the smart interaction device to perform a voice/chat interaction response.
The artificial intelligence-based intelligent interactive device control system according to any one of claims 14 to 19, wherein the decision module is configured to: when no face image is detected, the user vocalizes and the sound intensity is greater than Determining that the user has a strong suspected willingness to interact when the predetermined strength and the distance between the user and the smart interaction device are less than the preset distance.

The composite output control module is configured to: control the smart interaction device to turn to the sound source direction and perform a prompt response.
The artificial intelligence-based intelligent interactive device control system according to any one of claims 14 to 19, wherein the decision module is configured to: when no face image is detected, the user vocalizes and the sound intensity is greater than Determining that the user has a weak suspected interaction will when the predetermined strength and the distance between the user and the smart interaction device are greater than the preset distance.

The composite output control module is configured to: control a response of the smart interaction device to the sound source.
The artificial intelligence-based intelligent interactive device control system according to any one of claims 14-25, wherein the lip region detecting module is configured to: determine, according to a lip shape difference between the multi-frame face images The movement of the lip area.
An intelligent interaction device, comprising: an artificial intelligence based intelligent interactive device control system according to any one of claims 14-26.
An apparatus, comprising:

One or more processors;

Memory

One or more programs, the one or more programs being stored in the memory, when executed by the one or more processors, performing the artificial intelligence-based according to any one of claims 1-13 Intelligent interactive device control method.
A non-volatile computer storage medium, characterized in that the computer storage medium stores one or more programs, when the one or more programs are executed by a device, causing the device to perform as claimed in claim 1. The artificial intelligence based intelligent interactive device control method according to any one of the preceding claims.