CN108766438B

CN108766438B - Man-machine interaction method and device, storage medium and intelligent terminal

Info

Publication number: CN108766438B
Application number: CN201810645687.2A
Authority: CN
Inventors: 陈彪
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2018-06-21
Filing date: 2018-06-21
Publication date: 2020-12-01
Anticipated expiration: 2038-06-21
Also published as: CN108766438A

Abstract

The embodiment of the application discloses a man-machine interaction method, a man-machine interaction device, a storage medium and an intelligent terminal. The method comprises the following steps: when a first voice signal is detected, positioning a first sound source corresponding to the first voice signal; if the positioning result of the first sound source meets a preset requirement, starting a camera, and detecting whether human eyes are aligned with the terminal through the camera; and if the human eyes are detected to be aligned with the terminal, starting a man-machine interaction mode and responding to a voice instruction corresponding to the first voice signal. By adopting the technical scheme, the intelligent terminal can judge whether to start the camera to detect the relation between human eyes and the terminal or not by positioning the first sound source when detecting the first voice signal, and can start the human-computer interaction mode and respond to the related voice command when detecting that the human eyes and the terminal are on time, so that the problem of complex operation caused by awakening the human-computer interaction mode by keywords is avoided, the operation of human-computer interaction is simplified, and the human-computer interaction efficiency is improved.

Description

Man-machine interaction method and device, storage medium and intelligent terminal

Technical Field

The embodiment of the invention relates to the technical field of artificial intelligence, in particular to a man-machine interaction method, a man-machine interaction device, a storage medium and an intelligent terminal.

Background

With the development of artificial intelligence technology, man-machine interaction gradually becomes the standard configuration of most intelligent terminals, and users can control the intelligent terminals by performing man-machine interaction with the intelligent terminals.

At present, when a user needs to use the man-machine interaction function of the intelligent terminal, the user needs to wake up the intelligent terminal through keywords first, so that the man-machine interaction mode can be started, and under the condition that the user continuously interacts with the intelligent terminal, the user needs to input the keywords to wake up the intelligent terminal before interacting every time, so that the operation is complex, the man-machine interaction process efficiency is low, and improvement is needed urgently.

Disclosure of Invention

The embodiment of the invention provides a man-machine interaction method, a man-machine interaction device, a storage medium and an intelligent terminal, which can automatically wake up the intelligent terminal to start a man-machine interaction mode without using keywords, so that the man-machine interaction function of the intelligent terminal is optimized.

In a first aspect, an embodiment of the present invention provides a human-computer interaction method, including:

when a first voice signal is detected, positioning a first sound source corresponding to the first voice signal;

if the positioning result of the first sound source meets a preset requirement, starting a camera, and detecting whether human eyes are aligned with the terminal through the camera;

and if the human eyes are detected to be aligned with the terminal, starting a man-machine interaction mode and responding to a voice instruction corresponding to the first voice signal.

In a second aspect, an embodiment of the present invention provides a device for shooting human-computer interaction, including:

the sound source positioning module is used for positioning a first sound source corresponding to a first voice signal when the first voice signal is detected;

the human eye alignment detection module is used for starting a camera if the positioning result of the first sound source meets a preset requirement and detecting whether human eyes are aligned with the terminal through the camera;

and the human-computer interaction response module is used for starting a human-computer interaction mode and responding to the voice instruction corresponding to the first voice signal if the human eye is detected to be aligned with the terminal.

In a third aspect, an embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements a human-computer interaction method according to an embodiment of the present invention.

In a fourth aspect, an embodiment of the present invention provides an intelligent terminal, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor executes the computer program to implement the human-computer interaction method according to the embodiment of the present invention.

According to the human-computer interaction scheme provided by the embodiment of the invention, when the first voice signal is detected, the first sound source corresponding to the first voice signal is positioned, if the positioning result of the first sound source meets the preset requirement, the camera is started, whether human eyes are aligned with the terminal is detected through the camera, and if the human eyes are detected to be aligned with the terminal, the human-computer interaction mode is started and the voice instruction corresponding to the first voice signal is responded. By adopting the technical scheme, the intelligent terminal can judge whether to start the camera to detect the relation between human eyes and the terminal through the positioning of the first sound source when detecting the first voice signal, and can start the human-computer interaction mode and respond to the related voice command when detecting that the human eyes and the terminal are on time, so that the problem of complex operation caused by awakening the human-computer interaction mode due to the keyword is avoided, the operation process of human-computer interaction is simplified, and meanwhile, the human-computer interaction efficiency is also improved.

Drawings

Fig. 1 is a schematic flow chart of a human-computer interaction method according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart illustrating another human-computer interaction method according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating a human-computer interaction method according to another embodiment of the present invention;

FIG. 4 is a flowchart illustrating a human-computer interaction method according to another embodiment of the present invention;

fig. 5 is a block diagram of a human-computer interaction device according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of an intelligent terminal according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of another intelligent terminal according to an embodiment of the present invention.

Detailed Description

The technical scheme of the invention is further explained by the specific implementation mode in combination with the attached drawings. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Before discussing exemplary embodiments in more detail, it should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart may describe the steps as a sequential process, many of the steps can be performed in parallel, concurrently or simultaneously. In addition, the order of the steps may be rearranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figure. The processes may correspond to methods, functions, procedures, subroutines, and the like.

Fig. 1 is a schematic flowchart of a human-computer interaction method according to an embodiment of the present invention, where the method may be executed by a human-computer interaction device, where the device may be implemented by software and/or hardware, and may be generally integrated in an intelligent terminal. As shown in fig. 1, the method includes:

step 101, when the first voice signal is detected, a first sound source corresponding to the first voice signal is located.

Illustratively, the intelligent terminal in the embodiment of the present invention may include a terminal device with a human-computer interaction function, such as a smart speaker, a mobile phone, a tablet computer, and a media player.

For example, the smart terminal may detect a sound signal through a detection device such as a sound sensor or a microphone, and may detect the sound signal, and when it is determined that a voice signal is included therein, it may be considered that the first voice signal is detected. The first voice signal may be a sound signal corresponding to the words spoken by any user, and may be, for example, an audio signal corresponding to the sound when the user issues an instruction to the smart terminal. Optionally, the detection device in the intelligent terminal can be in a normally open state for detecting the first voice signal in the environment in real time, and the omission of the voice instruction sent by the user to the intelligent terminal is avoided. The first sound source corresponding to the first voice signal may be an object generating the first voice signal. For example, if the first speech signal is originating from user a, the first sound source is user a, or the mouth of user a.

Illustratively, the positioning of the first sound source corresponding to the first voice signal includes determining information such as a distance and a direction from the intelligent terminal to the sound source corresponding to the first voice signal, so as to lock a source of the first voice signal. Specifically, locating the first sound source corresponding to the first voice signal may include: and determining the distance and the direction of a first sound source corresponding to the first voice signal relative to the terminal by using a sound positioning technology.

The sound localization technique may be a technique for determining the direction and distance of a sound source through sound stimulation, and specific determination methods may include, but are not limited to, beamforming, high-resolution spectrum estimation, or time difference. In this embodiment of the application, when the detection device detects that the first voice signal exists in the environment, a sound localization technology of the microphone array may be used to localize a first sound source corresponding to the first voice signal, and determine a distance and a direction of the first sound source relative to the intelligent terminal.

And 102, if the positioning result of the first sound source meets the preset requirement, starting a camera, and detecting whether human eyes are aligned with the terminal through the camera.

Illustratively, the positioning result of the first sound source includes a position and a direction of a sound source corresponding to the first voice signal relative to the intelligent terminal. The preset requirement can be that whether the distance between the user and the intelligent terminal meets a preset distance threshold value or not and whether the direction is within the shooting range of the camera or not is judged. The preset distance threshold value can be the farthest distance which is automatically set by the system and is required for ensuring that the complete sound information is detected, or the farthest distance between the user and the intelligent terminal when the user executes the human-computer interaction function according to the requirement of the user.

For example, when detecting whether the eyes are aligned with the terminal, a camera may be used to detect a face region in the captured image, then the region where the eyes are located is detected in the face region, and finally the region where the detected eyes are located is further identified and judged to determine whether the eyes are aligned with the intelligent terminal. Specifically, determining whether the eyes are aligned with the intelligent terminal may be to see whether the eye area in the acquired image is complete and the eyeballs are in the center of the eye area, and if so, it is said that the eyes are aligned with the intelligent terminal.

For example, the camera in the embodiment of the present application may be a fixed-view camera, may also be a 360 ° rotatable camera, and may also be a camera group composed of a plurality of fixed-view cameras, which is not limited herein. In addition, the camera may be a component integrated in the intelligent terminal, or may be an external component of the intelligent terminal, which is not limited herein.

In this application embodiment, if the location result of first sound source satisfies preset the requirement, then start the camera to detect whether people's eye aims at the terminal through the camera, can include: and if the distance between the first sound source and the terminal is smaller than the preset distance threshold, starting a camera according to the direction of the first sound source relative to the terminal, and detecting whether human eyes are aligned with the terminal through the camera.

Specifically, when the camera of intelligent terminal is the camera of a fixed visual field, if the positioning result of first sound source satisfies the preset requirement, then start the camera, and whether aim at the terminal through camera detection people's eye, can be that the distance for terminal equipment is less than the preset distance when first speech signal, and the sound source direction that first speech signal corresponds when the shooting scope of the camera of fixed visual field, start the camera of this fixed visual field, and gather the sound source image through this camera, judge whether people's eye aim at intelligent terminal.

When the camera of intelligent terminal is 360 rotatable cameras, if the positioning result of first sound source satisfies the requirement of predetermineeing, then start the camera to detect whether people's eye aims at the terminal through the camera, can be when first speech signal is less than when predetermineeing the distance for terminal equipment, according to the first sound source that detects for the direction at terminal, start rotatable camera, and rotate it to first sound source place direction, gather the sound source image in first sound source place direction, judge whether people's eye aims at intelligent terminal.

When the camera of the intelligent terminal is a camera group consisting of a plurality of cameras with fixed view fields, if the positioning result of the first sound source meets the preset requirement, the camera is started, whether human eyes are aligned to the terminal is detected through the camera, when the distance of the first voice signal relative to the terminal equipment is smaller than the preset distance, whether the direction of the first sound source relative to the terminal is within the shooting view field range of any camera of the camera group is judged, if so, the camera with the best shooting view angle is started to collect a sound source image, and whether human eyes are aligned to the intelligent terminal is judged.

And 103, if the human eyes are detected to be aligned with the terminal, starting a man-machine interaction mode and responding to a voice instruction corresponding to the first voice signal.

For example, the user may have a conversation with the intelligent terminal through the intelligent terminal, for example, the user may have a conversation with the intelligent terminal, and send a voice instruction to the intelligent terminal, and the intelligent terminal may respond to the voice instruction and feed back an execution result to the user if necessary. The voice instruction may be an instruction initiated by the user to the smart terminal in a voice form, for example, if the smart terminal is a smart speaker, the voice instruction may be a voice instruction that the user sends "start the smart speaker, play music in the default list" to indicate that the smart speaker works.

In the embodiment of the application, if it is detected that the human eyes are aligned with the intelligent terminal, it indicates that the user speaks towards the intelligent terminal, and at this time, it indicates that the user wants to send a voice instruction to the intelligent terminal through a man-machine interaction mode. Therefore, the man-machine interaction mode of the intelligent terminal is started, and the voice instruction corresponding to the first voice signal is responded. Optionally, in order to prevent that a voice instruction sent by a user is omitted due to detection of whether a sound source is located and human eyes are aligned with a terminal, the voice content corresponding to the first voice signal can be acquired while the first voice signal is detected and the first sound source corresponding to the first voice signal is located. And when responding to the voice instruction corresponding to the first voice signal, generating a voice instruction according to the voice content, and responding to the voice instruction.

Specifically, after the first voice signal is detected, regardless of whether the first voice signal is a voice instruction sent by a user, the voice content corresponding to the first voice signal is acquired, and when it is determined that the voice signal is the voice instruction sent by the user, that is, when the voice instruction corresponding to the first voice signal is responded, the voice content corresponding to the acquired first voice signal may be identified, so as to obtain the voice instruction corresponding to the voice content, and the voice instruction is responded. If the voice signal is determined not to be the voice command sent by the user, deleting the acquired voice content, and avoiding the occupation of redundant data on the memory.

It should be noted that, in the human-computer interaction process, a situation that the first speech spoken by the user does not include a specific control instruction may occur, and at this time, recognizing the first speech signal finds that the speech signal does not include a speech instruction, it is necessary to continue to detect the second speech signal corresponding to the first sound source, recognize the speech instruction included in the second speech signal, and respond to the speech instruction.

According to the man-machine interaction method provided by the embodiment of the application, when the first voice signal is detected, the first sound source corresponding to the first voice signal is positioned, if the positioning result of the first sound source meets the preset requirement, the camera is started, whether human eyes are aligned with the terminal is detected through the camera, and if the human eyes are detected to be aligned with the terminal, the man-machine interaction mode is started and the voice instruction corresponding to the first voice signal is responded. By adopting the technical scheme, the intelligent terminal can judge whether to start the camera to detect the relation between human eyes and the terminal through the positioning of the first sound source when detecting the first voice signal, and can start the human-computer interaction mode and respond to the related voice command when detecting that the human eyes and the terminal are on time, so that the problem of complex operation caused by awakening the human-computer interaction mode due to the keyword is avoided, the operation process of human-computer interaction is simplified, and meanwhile, the human-computer interaction efficiency is also improved.

In some embodiments, said locating a first sound source corresponding to said first speech signal comprises: determining the distance and the direction of a first sound source corresponding to the first voice signal relative to a terminal through a sound positioning technology; correspondingly, if the location result of first sound source satisfies and predetermines the requirement, then start the camera, and pass through whether the camera detects people's eye and aims at the terminal, include: and if the distance between the first sound source and the terminal is smaller than a preset distance threshold value, starting a camera according to the direction of the first sound source relative to the terminal, and detecting whether human eyes aim at the terminal through the camera. The advantage that sets up like this lies in, can judge whether start the camera for the distance and the direction of terminal through the sound source that speech signal corresponds to and how to start the camera, and then improved the accuracy of judging whether start man-machine interaction mode.

In some embodiments, when a prime number first speech signal is detected, positioning a first sound source corresponding to the first speech signal further includes: when the first voice signal is detected, acquiring voice content corresponding to the first voice signal; correspondingly, the responding to the voice instruction corresponding to the first voice signal comprises: and generating a voice instruction according to the voice content, and responding to the voice instruction. The advantage of setting up like this is that, can prevent to miss the voice command that the user sent because of carrying out the detection to sound source location and whether people's eyes aim at the terminal, for example, the user says "broadcast default list" to intelligent terminal, and intelligent terminal receives only to be used for the location and whether people's eyes aim at the detection of terminal after this speech signal, does not record this content, even subsequently judge this speech signal is the control command that the user sent, also can't learn content wherein and respond to this control command, cause the omission of user's voice command.

In some embodiments, after the initiating the human-computer interaction mode and responding to the voice instruction corresponding to the first voice signal, the method further includes: recording first voiceprint information corresponding to the first voice signal, and closing the camera; when a second voice signal corresponding to the first voiceprint information is detected, determining the moving speed of a first sound source corresponding to the first voiceprint information; and if the moving speed is smaller than a preset speed threshold value, responding to a voice instruction corresponding to the second voice signal. The intelligent terminal has the advantages that the camera can be turned off after the user and the intelligent terminal are confirmed to be in human-computer interaction, whether the human-computer interaction mode is continuously kept or not is judged only through the voice signal detection device, power consumption of the intelligent terminal is reduced, and meanwhile human-computer interaction efficiency is improved.

In some embodiments, the determining a moving speed of a first sound source corresponding to the first voiceprint information includes: acquiring a time interval between a first moment and a second moment, wherein the first moment comprises the moment of detecting the first voice signal, and the second moment comprises the moment of detecting the second voice signal; obtaining a distance difference between the first sound source at the first moment and the second moment relative to the terminal; and calculating the moving speed of the first sound source corresponding to the first voiceprint information according to the time interval and the distance difference. The advantage that sets up like this lies in, the removal speed of the first sound source of analysis that can be more accurate, and then accurate judgement need continue to keep intelligent terminal's human-computer interaction mode.

In some embodiments, after the initiating the human-computer interaction mode and responding to the voice instruction corresponding to the first voice signal, the method further includes: recording first voiceprint information corresponding to the first voice signal, and closing the camera; when a third voice signal corresponding to the first voiceprint information is detected, if the time interval between the current moment and the third moment is judged to be greater than the effective duration of the first voiceprint information, the third voice signal is positioned, and if the positioning result meets the preset requirement, the camera is started, and human eye detection is carried out again through the camera; the third time includes the time when the voice signal corresponding to the first voiceprint information is detected last time, and the effective duration of the first voiceprint information includes the time interval when the voice signal corresponding to the first voiceprint information is detected last two times. The advantage of setting up like this lies in, can satisfy under the condition of first voiceprint information effective duration, to the pronunciation instruction direct response that the different speech signal of first voiceprint information correspond, surpasss under the condition of effective duration, relocates and people's eye aim at the terminal detection, when reducing intelligent terminal consumption, has guaranteed the real-time and the accuracy of pronunciation instruction.

In some embodiments, after the initiating the human-computer interaction mode and responding to the voice instruction corresponding to the first voice signal, the method further includes: controlling a camera to acquire a face image corresponding to the first voice signal as a target face image and recording the target face image; if the detected voiceprint information of the fourth voice signal is second voiceprint information and the camera detects that human eyes are aligned with the terminal equipment, controlling the camera to collect a face image corresponding to the fourth voice signal and matching the face image with a recorded target face image; and if not, responding to the voice instruction corresponding to the fourth voice signal, and taking the face image corresponding to the fourth voice signal as target face information. The voice command recognition method has the advantages that voice signals sent by different users can be distinguished, and the situation that the voice command response is wrong due to the fact that the voice signals are inconsistent with the identity of the recognized user is avoided.

Fig. 2 is a schematic flow chart of another human-computer interaction method according to an embodiment of the present invention, where the method includes the following steps:

step 201, when the first voice signal is detected, a first sound source corresponding to the first voice signal is located.

Step 202, if the positioning result of the first sound source meets the preset requirement, starting a camera, and detecting whether human eyes are aligned with the terminal through the camera.

And 203, if the human eyes are detected to be aligned with the terminal, starting a man-machine interaction mode and responding to a voice instruction corresponding to the first voice signal.

And 204, recording first voiceprint information corresponding to the first voice signal, and closing the camera.

For example, the voiceprint information may be sound wave spectrum information carrying speech information detected by a device or a model integrated in the intelligent terminal, and each person has unique voiceprint information, so that the voiceprint information can distinguish voices of different persons or judge whether the voices are voices of the same person. The first voiceprint information may be voiceprint information of a sound source corresponding to the first voice signal.

Illustratively, when it is detected that the first voice signal is a voice instruction sent by a user to the intelligent terminal, after the voice instruction is responded, first voiceprint information corresponding to the first voice signal can be recorded, and whether a subsequent voice instruction is a voice sent by the same user or not is identified and judged through the first voiceprint information, so that whether a next voice instruction is responded is determined. The operations from the step 201 to the step 204 are not required to be performed on each subsequent voice signal, and the camera can be turned off at the moment, so that the power consumption of the intelligent terminal is reduced, and the efficiency of man-machine interaction is improved.

Step 205, when the second voice signal corresponding to the first voiceprint information is detected, determining the moving speed of the first sound source corresponding to the first voiceprint information.

For example, the second voice signal may be a voice signal corresponding to the first voiceprint information which is uttered after the first voice signal. For example, the voiceprint information of the user a is first voiceprint information, a first sentence spoken by the user a, "start the smart speaker playing default list" is a first voice signal, and after the smart speaker responds, the user a speaks "play the next" again, and at this time, the sentence is a second voice signal corresponding to the first voiceprint information.

Illustratively, when the second voice signal is detected, in order to prevent the position of the user from being changed, for example, the user is far away from the intelligent terminal, the second voice signal is the voice command of the user speaking with other people and not the intelligent terminal. Therefore, the second voice signal corresponding to the same first voiceprint information can not be detected, whether the voice signal contains the voice command or not is immediately identified, and if the voice signal contains the voice command, the voice command is executed. Whether the second voice signal meets the identification requirement needs to be further judged, namely, the moving speed of the first sound source corresponding to the first voiceprint information is determined, whether the moving speed of the first sound source meets the identification requirement is judged, and if the moving speed of the first sound source meets the identification requirement, specific identification is carried out.

The moving speed of the first sound source may be a moving speed of the first sound source in a period after the first voice signal is emitted and before the second voice signal is emitted. For example, after the user a sends the first voice signal, the user a moves backward and sends the second voice signal, and the moving speed of the user during the time period before the second voice signal is the moving speed of the first sound source after the user a sends the first voice signal.

In this embodiment of the present application, determining a moving speed of a first sound source corresponding to the first voiceprint information may include: acquiring a time interval between a first moment and a second moment, wherein the first moment comprises a moment of detecting a first voice signal, and the second moment comprises a moment of detecting a second voice signal; obtaining the distance difference of a first sound source relative to a terminal at a first moment and a second moment; and calculating the moving speed of the first sound source corresponding to the first voiceprint information according to the time interval and the distance difference.

Specifically, a first time and a second time corresponding to the detected first voice signal and the detected second voice signal may be obtained respectively to obtain a time interval between the two times, then a difference between distances of the first sound source relative to the intelligent terminal at the first time and the second time is obtained, and the moving speed of the first sound source may be obtained according to the distance difference and the time interval. For example, the first sound source is a user a, the user a sends a first voice signal at 9:00, sends a second voice signal at 9:02, and the distance between the user a sending the first voice signal and the terminal is 0.5 m, the distance between the user a sending the second voice signal and the intelligent terminal is 1.5 m, the time interval between the first time and the second time is 2 minutes, the distance difference is 1 m, and the moving speed of the first sound source is 0.5 m/min.

Step 206, determining whether the moving speed of the first sound source is less than a preset speed threshold, if so, executing step 207, and if not, returning to step 201.

For example, the preset speed threshold may be set by the user in advance according to the user's own needs, or may be automatically set by combining the relationship between the sound source moving speed and the accuracy of voice recognition and the use habits of the user, on the premise of ensuring that the voice command can be accurately recognized. And judging whether the moving speed of the first sound source is smaller than a preset speed threshold value, namely whether the moving speed of the first sound source can be determined as a second voice instruction which is sent by the first sound source and is sent to the intelligent terminal, if so, executing step 207, responding to the voice instruction corresponding to the second voice signal, if not, returning to step 201, and executing step 201 to step 203 again for the voice signal.

And step 207, responding to the voice instruction corresponding to the second voice signal.

For example, responding to the voice signal corresponding to the second voice signal includes recognizing whether the second voice signal includes a voice command, if so, responding to the voice command, and if not, ignoring the second voice signal.

According to the embodiment of the invention, when the intelligent terminal detects that the positioning of the first sound source corresponding to the first voice signal meets the preset requirement and detects that the human eyes are aligned with the terminal, the intelligent terminal starts a man-machine interaction mode and responds to a related voice command, and simultaneously records the voiceprint information corresponding to the first voice signal and closes the camera. And when a second voice signal corresponding to the first voiceprint information is detected subsequently, if the moving speed of the first sound source corresponding to the first voiceprint information is smaller than a preset speed threshold, responding to a voice instruction corresponding to the second voice signal. The camera can be turned off after the user and the intelligent terminal are confirmed to perform human-computer interaction, whether the human-computer interaction mode is continuously kept or not is judged only through the voice signal detection device, power consumption of the intelligent terminal is reduced, and meanwhile human-computer interaction efficiency is improved.

Fig. 3 is a schematic flowchart of another human-computer interaction method according to an embodiment of the present invention, where the method includes the following steps:

step 301, when the first voice signal is detected, a first sound source corresponding to the first voice signal is located.

Step 302, if the positioning result of the first sound source meets the preset requirement, starting a camera, and detecting whether human eyes are aligned with the terminal through the camera.

And 303, if the human eyes are detected to be aligned with the terminal, starting a man-machine interaction mode and responding to a voice instruction corresponding to the first voice signal.

And step 304, recording first voiceprint information corresponding to the first voice signal, and closing the camera.

Step 305, when a third voice signal corresponding to the first voiceprint information is detected, judging whether the time interval between the current time and the third time is greater than the effective duration of the first voiceprint information, if so, returning to execute step 301, and if not, executing step 306.

The third time comprises the time of last detection of the voice signal corresponding to the first voiceprint information, and the effective duration of the first voiceprint information comprises the time interval of last two detections of the voice signal corresponding to the first voiceprint information. The third voice signal may be a voice signal other than the first voice signal corresponding to the first voiceprint information, and may be the same as or different from the second voice signal.

Illustratively, the valid duration for each voiceprint information is not permanent, but has a fixed duration, i.e. the time interval between the last two detections of the voice signal corresponding to the voiceprint information. If the time interval between the current time and the third time is detected to exceed the effective duration of the voiceprint information corresponding to the current voice signal, the process needs to return to the step 301, and the steps 301 to 303 are executed again on the third voice signal. If the time interval between the current time and the third time does not exceed the valid duration of the voiceprint information corresponding to the current voice signal, step 306 is executed to respond to the voice instruction corresponding to the third voice signal. For example, the first voiceprint information is the voiceprint information of the user a, the third voice signal sent by the user a is detected at the current time, namely 9:00 a.m., the time of the last voice signal sent by the user a (namely, the third time) is 8:58, and the time of the last voice signal sent by the user a is 8: 55. The first voiceprint information of user a has a valid duration of the time interval between 8:55 and 8:58, i.e. 3 minutes. The time interval between the current time and the third time is 2 minutes and does not exceed the valid duration of the first voiceprint information by 3 minutes, so step 306 is executed to respond to the voice command corresponding to the third voice signal.

It should be noted that the initial valid duration of the first voiceprint information, that is, if the third voice signal at the current time is the second voice signal corresponding to the first voiceprint information, at this time, the voice signal corresponding to the first voiceprint information is detected only once recently, and the valid duration of the first voiceprint information may be determined by the system according to the historical valid duration of the first voiceprint information, for example, an average value of the historical valid durations. Or the user can preset the setting according to the self requirement.

And step 306, responding to the voice instruction corresponding to the third voice signal.

According to the embodiment of the invention, when the intelligent terminal detects that the positioning of the first sound source corresponding to the first voice signal meets the preset requirement and detects that the human eyes are aligned with the terminal, the intelligent terminal starts a man-machine interaction mode and responds to a related voice command, and simultaneously records the voiceprint information corresponding to the first voice signal and closes the camera. And when a third voice signal corresponding to the first voiceprint information is detected subsequently, if the time interval between the current moment and the third moment is less than the effective duration of the first voiceprint information, responding to a voice instruction corresponding to the third voice signal. The real-time performance and the accuracy of the voice instruction are guaranteed while the power consumption of the intelligent terminal is reduced.

Fig. 4 is a schematic flowchart of another human-computer interaction method according to an embodiment of the present invention, where the method includes the following steps:

step 401, when the first voice signal is detected, a first sound source corresponding to the first voice signal is located.

And 402, if the positioning result of the first sound source meets the preset requirement, starting a camera, and detecting whether human eyes are aligned with the terminal through the camera.

And 403, if the human eyes are detected to be aligned with the terminal, starting a man-machine interaction mode and responding to a voice instruction corresponding to the first voice signal.

And step 404, controlling the camera to collect a face image corresponding to the first voice signal as a target face image and recording the target face image.

Illustratively, in order to ensure the accuracy of a subsequent human-computer interaction process, when it is determined that the first voice signal is a voice signal sent by a user and used for controlling the intelligent terminal, the camera is controlled to acquire a face image corresponding to the first voice signal, and the face image is a face image to which human eyes belong when whether the human eyes are aligned with the terminal for detection is performed. And taking the face image as a target face image for subsequent determination of whether the face image is a voice signal emitted by the same sound source.

It should be noted that the execution sequence of step 404 is not limited in this application, and may be the sequence shown in this application embodiment, or may be executed before the human-computer interaction mode is started in step 403, or may be executed simultaneously with the voice instruction that starts the human-computer interaction mode and responds to the first voice signal.

And 405, if the detected voiceprint information of the fourth voice signal is second voiceprint information and the camera detects that human eyes are aligned with the terminal equipment, controlling the camera to acquire a face image corresponding to the fourth voice signal.

For example, the fourth speech signal may be another speech information detected after the first speech signal, which may be the same first voiceprint information as the first speech signal, or may be different from the first speech signal, which may be the second voiceprint information.

When the voiceprint information of the fourth voice signal is detected to be second voiceprint information and the camera detects that human eyes are aligned with the terminal equipment, the fact that the voice signal is different from a sound source of the first voice signal is indicated, the human eyes of a user are aligned with the intelligent terminal, at the moment, whether the user detected by the camera is the sound source corresponding to the fourth voice signal is further judged, and the camera is controlled to acquire a face image corresponding to the fourth voice signal.

It should be noted that, if the voiceprint information of the fourth speech signal is still the first voiceprint information, the human-computer interaction process is performed according to the method shown in fig. 2 and/or fig. 3. If the voiceprint information of the fourth voice signal is the second voiceprint information, but it is not detected that human eyes are aligned with the terminal device at the moment, it is indicated that the fourth voice signal is not a control instruction of a user for the terminal, and a voice instruction corresponding to the fourth voice signal is not responded.

And step 406, judging whether the face image corresponding to the fourth voice signal is matched with the recorded target face image, if not, executing step 407, and if so, executing step 408, and not responding to the voice instruction corresponding to the fourth voice signal.

For example, if the facial image corresponding to the fourth speech signal matches the recorded target facial image corresponding to the first speech signal, it indicates that the user facing the terminal is still the target user corresponding to the first speech signal at this time, and is not the user who sent the fourth speech signal, and at this time, the voice instruction corresponding to the fourth speech signal may not be responded. If the face image corresponding to the fourth voice signal does not match the target face image corresponding to the recorded first voice signal, it indicates that the user facing the terminal is not the target user corresponding to the first voice signal but the user corresponding to the fourth voice signal, and the human eyes of the user are aligned with the terminal, at this time, step 407 is executed, and a voice instruction corresponding to the fourth voice signal is responded.

For example, the first voice signal is uttered by the user a, and the target face image is also the face image of the user a. And the fourth voice signal is sent by the user B, and if the face image corresponding to the fourth voice signal matches the recorded target face image, it indicates that the user B is not aligned with the terminal device, and at this time, the user a is also aligned with the terminal device, so that the user eye corresponding to the fourth voice signal is not aligned with the terminal, and the voice instruction corresponding to the fourth voice signal may not be responded. If the face image corresponding to the fourth voice signal is not matched with the recorded target face image, it is indicated that the user B aims at the terminal device at the moment, so that the face image of the user B is not matched with the target face image of the user a, it is determined that the eyes of the aiming terminal are the eyes of the user B, and the voice instruction corresponding to the fourth voice signal can be responded at the moment.

And 407, responding to the voice instruction corresponding to the fourth voice signal, and taking the face image corresponding to the fourth voice signal as the target face information.

Illustratively, when the face image corresponding to the fourth voice signal is not matched with the recorded target face image, responding to a voice instruction corresponding to the fourth voice signal, and acquiring the face corresponding to the fourth voice signal as the target face information before replacement through the camera.

And step 408, not responding to the voice instruction corresponding to the fourth voice signal.

According to the embodiment of the invention, when the intelligent terminal detects that the positioning of the first sound source corresponding to the first voice signal meets the preset requirement and detects that the human eyes are aligned with the terminal, the intelligent terminal starts a man-machine interaction mode and responds to a related voice instruction, and acquires the current face image as the target face image. And subsequently detecting a fourth voice signal corresponding to the second voiceprint information, acquiring a face image corresponding to the fourth voice signal to be matched with the target face image when human eyes are detected to be aligned with the terminal equipment, and responding to a voice instruction corresponding to the fourth voice signal if the face image is not matched with the target face image. The voice signals sent by different users can be distinguished, and the condition that the voice instruction response is wrong due to the fact that the voice signals are not consistent with the identity of the identified user is avoided.

Fig. 5 is a block diagram of a human-computer interaction device according to an embodiment of the present invention, where the human-computer interaction device may be implemented by software and/or hardware, and is generally integrated in a terminal, and may respond to a voice command of a user by executing a human-computer interaction method. As shown in fig. 5, the apparatus includes:

a sound source positioning module 501, configured to, when a first voice signal is detected, position a first sound source corresponding to the first voice signal;

a human eye alignment detection module 502, configured to start a camera if the positioning result of the first sound source meets a preset requirement, and detect whether human eyes are aligned with the terminal through the camera;

and a human-computer interaction response module 503, configured to start a human-computer interaction mode and respond to the voice instruction corresponding to the first voice signal if it is detected that the human eye is aligned with the terminal.

The man-machine interaction device that provides in this application embodiment, when through detecting first speech signal, fix a position the first sound source that first speech signal corresponds, if the location result of first sound source satisfies and predetermines the requirement, then start the camera, and pass through whether camera detection people's eye aims at the terminal, if detect people's eye and aim at the terminal, then start the man-machine interaction mode and respond to the speech instruction that first speech signal corresponds. By adopting the technical scheme, the intelligent terminal can judge whether to start the camera to detect the relation between human eyes and the terminal through the positioning of the first sound source when detecting the first voice signal, and can start the human-computer interaction mode and respond to the related voice command when detecting that the human eyes and the terminal are on time, so that the problem of complex operation caused by awakening the human-computer interaction mode due to the keyword is avoided, the operation process of human-computer interaction is simplified, and meanwhile, the human-computer interaction efficiency is also improved.

Optionally, the positioning the first sound source corresponding to the first voice signal includes:

determining the distance and the direction of a first sound source corresponding to the first voice signal relative to a terminal through a sound positioning technology;

correspondingly, if the location result of first sound source satisfies and predetermines the requirement, then start the camera, and pass through whether the camera detects people's eye and aims at the terminal, include:

and if the distance between the first sound source and the terminal is smaller than a preset distance threshold value, starting a camera according to the direction of the first sound source relative to the terminal, and detecting whether human eyes aim at the terminal through the camera.

Optionally, the apparatus further comprises:

the voice content acquisition module is used for acquiring the voice content corresponding to the first voice signal when the first voice signal is detected;

correspondingly, the responding to the voice instruction corresponding to the first voice signal comprises:

and generating a voice instruction according to the voice content, and responding to the voice instruction.

Optionally, the apparatus further comprises:

the information recording module is used for recording first voiceprint information corresponding to the first voice signal;

the camera control module is used for closing the camera after recording first voiceprint information corresponding to the first voice signal;

the speed determining module is used for determining the moving speed of a first sound source corresponding to the first voiceprint information when a second voice signal corresponding to the first voiceprint information is detected;

the human-computer interaction response module 503 is further configured to respond to the voice instruction corresponding to the second voice signal if the moving speed is smaller than a preset speed threshold.

Optionally, the speed determining module is specifically configured to obtain a time interval between a first time and a second time, where the first time includes a time when the first voice signal is detected, and the second time includes a time when the second voice signal is detected;

obtaining a distance difference between the first sound source at the first moment and the second moment relative to the terminal;

and calculating the moving speed of the first sound source corresponding to the first voiceprint information according to the time interval and the distance difference.

Optionally, the apparatus further comprises:

an effective duration judging module, configured to, when a third voice signal corresponding to the first voiceprint information is detected, if it is judged that a time interval between the current time and the third time is greater than the effective duration of the first voiceprint information, control the sound source positioning module 501 to position the third voice signal, and if a positioning result meets a preset requirement, control the camera control module to start the camera, and control the eye alignment detection module 502 to perform eye detection again through the camera;

the third time includes the time when the voice signal corresponding to the first voiceprint information is detected last time, and the effective duration of the first voiceprint information includes the time interval when the voice signal corresponding to the first voiceprint information is detected last two times.

Optionally, the apparatus includes:

the face acquisition module is used for controlling the camera to acquire a face image corresponding to the first voice signal as a target face image;

the face image recording module is used for recording a target face image;

the face acquisition module is further used for controlling the camera to acquire a face image corresponding to the fourth voice signal if the detected voiceprint information of the fourth voice signal is second voiceprint information and the camera detects that human eyes are aligned with the terminal equipment;

the face image matching module is used for matching the face image corresponding to the fourth voice signal with the recorded target face image;

the human-computer interaction response module 503 is further configured to respond to the voice instruction corresponding to the fourth voice signal if the voice instruction is not matched with the fourth voice signal;

the face image recording module is further configured to take a face image corresponding to the fourth voice signal as target face information.

Embodiments of the present application also provide a storage medium containing computer-executable instructions, which when executed by a computer processor, are configured to perform a human-computer interaction method, the method including:

Storage medium-any of various types of memory devices or storage devices. The term "storage medium" is intended to include: mounting media such as CD-ROM, floppy disk, or tape devices; computer system memory or random access memory such as DRAM, DDRRAM, SRAM, EDORAM, Lanbas (Rambus) RAM, etc.; non-volatile memory such as flash memory, magnetic media (e.g., hard disk or optical storage); registers or other similar types of memory elements, etc. The storage medium may also include other types of memory or combinations thereof. In addition, the storage medium may be located in a first computer system in which the program is executed, or may be located in a different second computer system connected to the first computer system through a network (such as the internet). The second computer system may provide program instructions to the first computer for execution. The term "storage medium" may include two or more storage media that may reside in different locations, such as in different computer systems that are connected by a network. The storage medium may store program instructions (e.g., embodied as a computer program) that are executable by one or more processors.

Of course, the storage medium provided in the embodiments of the present application and containing computer-executable instructions is not limited to the above-described human-computer interaction operation, and may also perform related operations in the human-computer interaction method provided in any embodiments of the present application.

The embodiment of the application provides an intelligent terminal, and the man-machine interaction device provided by the embodiment of the application can be integrated in the intelligent terminal. Fig. 6 is a schematic structural diagram of an intelligent terminal provided in an embodiment of the present application. The smart terminal 600 may include: a memory 601 and a processor 602, and a computer program stored on the memory and executable on the processor, wherein the processor 602 implements the human-computer interaction method according to the embodiment of the present application when executing the computer program.

The intelligent terminal provided by the embodiment of the application can judge whether to start the camera to detect the relation between human eyes and the terminal through the positioning of the first sound source when detecting the first voice signal, can start the human-computer interaction mode and respond to the related voice command when detecting that the human eyes and the terminal are on time, avoids arousing the problem of complex operation caused by the human-computer interaction mode due to the keyword, simplifies the operation process of human-computer interaction, and simultaneously improves the human-computer interaction efficiency.

Fig. 7 is a schematic structural diagram of another intelligent terminal provided in an embodiment of the present application, where the intelligent terminal may include: a housing (not shown), a memory 701, a Central Processing Unit (CPU) 702 (also called a processor, hereinafter referred to as CPU), a circuit board (not shown), and a power circuit (not shown). The circuit board is arranged in a space enclosed by the shell; the CPU702 and the memory 701 are provided on the circuit board; the power supply circuit is used for supplying power to each circuit or device of the intelligent terminal; the memory 701 is used for storing executable program codes; the CPU702 executes a computer program corresponding to the executable program code by reading the executable program code stored in the memory 701 to implement the steps of:

The intelligent terminal further comprises: peripheral interfaces 703, RF (Radio Frequency) circuitry 705, audio circuitry 706, speakers 711, power management chip 708, input/output (I/O) subsystems 709, other input/control devices 710, touch screen 712, other input/control devices 710, and external port 704, which communicate via one or more communication buses or signal lines 707.

It should be understood that the illustrated smart terminal 700 is only one example of a smart terminal, and that the smart terminal 700 may have more or fewer components than shown in the figures, may combine two or more components, or may have a different configuration of components. The various components shown in the figures may be implemented in hardware, software, or a combination of hardware and software, including one or more signal processing and/or application specific integrated circuits.

The following describes in detail the intelligent terminal for human-computer interaction provided in this embodiment, where the intelligent terminal is a mobile phone as an example.

A memory 701, the memory 701 being accessible by the CPU702, the peripheral interface 703, and the like, the memory 701 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic disk storage devices, flash memory devices, or other volatile solid state storage devices.

A peripheral interface 703, said peripheral interface 703 may connect input and output peripherals of the device to the CPU702 and the memory 701.

An I/O subsystem 709, which I/O subsystem 709 may connect input and output peripherals on the device, such as a touch screen 712 and other input/control devices 710, to the peripheral interface 703. The I/O subsystem 709 may include a display controller 7091 and one or more input controllers 7092 for controlling other input/control devices 710. Where one or more input controllers 7092 receive electrical signals from or transmit electrical signals to other input/control devices 710, the other input/control devices 710 may include physical buttons (push buttons, rocker buttons, etc.), dials, slide switches, joysticks, click wheels. It is worth noting that the input controller 7092 may be connected to any one of the following: a keyboard, an infrared port, a USB interface, and a pointing device such as a mouse.

A touch screen 712, the touch screen 712 being an input interface and an output interface between the user intelligent terminal and the user, displaying visual output to the user, which may include graphics, text, icons, video, and the like.

The display controller 7091 in the I/O subsystem 709 receives electrical signals from the touch screen 712 or transmits electrical signals to the touch screen 712. The touch screen 712 detects a contact on the touch screen, and the display controller 7091 converts the detected contact into an interaction with a user interface object displayed on the touch screen 712, i.e., implements a human-computer interaction, and the user interface object displayed on the touch screen 712 may be an icon for running a game, an icon networked to a corresponding network, or the like. It is worth mentioning that the device may also comprise a light mouse, which is a touch sensitive surface that does not show visual output, or an extension of the touch sensitive surface formed by the touch screen.

The RF circuit 705 is mainly used to establish communication between the mobile phone and the wireless network (i.e., network side), and implement data reception and transmission between the mobile phone and the wireless network. Such as sending and receiving short messages, e-mails, etc. In particular, RF circuitry 705 receives and transmits RF signals, also referred to as electromagnetic signals, through which RF circuitry 705 converts electrical signals to or from electromagnetic signals and communicates with communication networks and other devices. RF circuitry 705 may include known circuitry for performing these functions including, but not limited to, an antenna system, an RF transceiver, one or more amplifiers, a tuner, one or more oscillators, a digital signal processor, a CODEC (CODEC) chipset, a Subscriber Identity Module (SIM), and so forth.

The audio circuit 706 is mainly used to receive audio data from the peripheral interface 703, convert the audio data into an electric signal, and transmit the electric signal to the speaker 711.

The speaker 711 is used to convert the voice signal received by the handset from the wireless network through the RF circuit 705 into sound and play the sound to the user.

And a power management chip 708 for supplying power and managing power to the hardware connected to the CPU702, the I/O subsystem, and the peripheral interface.

The human-computer interaction device, the storage medium and the intelligent terminal provided in the embodiments can execute the human-computer interaction method provided in any embodiment of the invention, and have corresponding functional modules and beneficial effects for executing the method. For technical details that are not described in detail in the above embodiments, reference may be made to a human-computer interaction method provided by any embodiment of the present invention.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A human-computer interaction method, comprising:

if the human eyes are detected to be aligned with the terminal, starting a man-machine interaction mode and responding to a voice instruction corresponding to the first voice signal;

recording first voiceprint information corresponding to the first voice signal, and closing the camera;

when a second voice signal corresponding to the first voiceprint information is detected, determining the moving speed of a first sound source corresponding to the first voiceprint information;

and if the moving speed is smaller than a preset speed threshold value, responding to a voice instruction corresponding to the second voice signal.

2. The method of claim 1, wherein locating the first sound source corresponding to the first speech signal comprises:

3. The method of claim 1, further comprising:

when the first voice signal is detected, acquiring voice content corresponding to the first voice signal;

4. The method of claim 1, wherein determining the moving speed of the first sound source corresponding to the first voiceprint information comprises:

acquiring a time interval between a first moment and a second moment, wherein the first moment comprises the moment of detecting the first voice signal, and the second moment comprises the moment of detecting the second voice signal;

5. The method of claim 1, wherein after the initiating the human-computer interaction mode and responding to the voice command corresponding to the first voice signal, further comprising:

when a third voice signal corresponding to the first voiceprint information is detected, if the time interval between the current moment and the third moment is judged to be greater than the effective duration of the first voiceprint information, the third voice signal is positioned, and if the positioning result meets the preset requirement, the camera is started, and human eye detection is carried out again through the camera;

6. The method of claim 1, wherein after the initiating the human-computer interaction mode and responding to the voice command corresponding to the first voice signal, further comprising:

controlling a camera to acquire a face image corresponding to the first voice signal as a target face image and recording the target face image;

if the detected voiceprint information of the fourth voice signal is second voiceprint information and the camera detects that human eyes are aligned with the terminal equipment, controlling the camera to collect a face image corresponding to the fourth voice signal and matching the face image with a recorded target face image;

and if not, responding to the voice instruction corresponding to the fourth voice signal, and taking the face image corresponding to the fourth voice signal as target face information.

7. A human-computer interaction device, comprising:

the human-computer interaction response module is used for starting a human-computer interaction mode and responding to a voice instruction corresponding to the first voice signal if the human eye is detected to be aligned with the terminal;

and the human-computer interaction response module is further used for responding to the voice instruction corresponding to the second voice signal if the moving speed is smaller than a preset speed threshold.

8. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the human-computer interaction method according to any one of claims 1 to 6.

9. An intelligent terminal, comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the human-computer interaction method according to any one of claims 1 to 6 when executing the computer program.