CN111767785A

CN111767785A - Man-machine interaction control method and device, intelligent robot and storage medium

Info

Publication number: CN111767785A
Application number: CN202010393351.9A
Authority: CN
Inventors: 王华洋; 黄华; 周院平; 孙信中; 矫人全
Original assignee: Nanjing Aoto Electronics Co ltd
Current assignee: Nanjing Aoto Electronics Co ltd
Priority date: 2020-05-11
Filing date: 2020-05-11
Publication date: 2020-10-13

Abstract

The invention relates to a man-machine interaction control method, a man-machine interaction control system, an intelligent robot and a storage medium, wherein the method comprises the steps of obtaining an audio signal; judging whether the obtained audio signal has voice or not; when the obtained audio signal is judged to contain voice, the audio signal is continuously obtained, and the live image in the preset direction is synchronously obtained until the currently collected voice is judged to be finished; carrying out human eye sight detection on the acquired field image to acquire human eye state data; and judging whether the human eye state data accord with the watching state, and if so, determining that the user to which the human eyes belong has the interaction intention. According to the human-computer interaction control scheme, the interaction intention of the user can be identified and judged, the interaction of the user with environmental noise and non-interaction intention is avoided, and the human-computer interaction experience can be improved; and unnecessary data processing can be effectively reduced, and the system overhead is reduced.

Description

Man-machine interaction control method and device, intelligent robot and storage medium

Technical Field

The invention relates to the field of human-computer interaction, in particular to a human-computer interaction control method and device, an intelligent robot and a storage medium.

Background

With the continuous development of the artificial intelligence technology, the artificial intelligence technology is adopted in more and more scenes to interact with the user, so that the service efficiency is improved, the user waiting time is reduced, and the user experience is improved. The human face recognition and the voice recognition are relatively fit with the normal communication habits of people, and have an important position in human-computer interaction.

In the current human-computer interaction process, the robot can perform feedback as long as the robot recognizes a human face or receives voice. That is, by default, the robot recognizes the user as the user having the interactive intention. However, in an actual scene, the robot is generally placed in a place with a large flow of people, and there may be a voice of multiple people speaking at the same time, and even a voice broadcasted by other machines. Under the influence of the acquisition angle and the acquisition range of the image/audio and the random characteristics of the user, the human face recognized by the robot may be a passing or distant user, and the received voice may be a distant human voice or a robot broadcasting voice. The robot cannot determine whether the identified user has the intention of interaction, so that the chaos phenomenon of random response can be caused, and the experience of man-machine interaction is greatly influenced.

Meanwhile, since it is not possible to determine whether the recognized user has an interactive intention, the robot may respond to a large number of irrelevant images/voices, causing the robot to perform much unnecessary data processing, increasing system overhead.

Disclosure of Invention

Therefore, it is necessary to provide a human-computer interaction control method and apparatus, an intelligent robot, and a storage medium, for solving the problems that whether an identified user has an interaction intention cannot be determined in the existing human-computer interaction process, and that the human-computer interaction experience is poor and the system overhead is large.

An embodiment of the present application provides a human-computer interaction control method, including:

acquiring an audio signal;

judging whether the obtained audio signal has voice or not;

when the obtained audio signal is judged to contain voice, the audio signal is continuously obtained, and the live image in the preset direction is synchronously obtained until the currently collected voice is judged to be finished;

carrying out human eye sight detection on the acquired field image to acquire human eye state data;

and judging whether the human eye state data accord with the watching state, and if so, determining that the user to which the human eyes belong has the interaction intention.

In some embodiments, the step of acquiring the audio signal is specifically acquiring the audio signal when the user is detected in a preset area.

In some embodiments, the step of acquiring an audio signal when the user is detected in the preset area specifically includes:

collecting audio signals and images;

carrying out face detection on the acquired image;

and when the human face is detected in the acquired image, outputting the acquired audio signal.

collecting audio signals and images to obtain the azimuth of a sound source;

carrying out face detection on the acquired image;

when a face is detected in the collected image, calculating the direction of the face;

and when the direction of the sound source is consistent with the direction of the face, outputting the collected audio signal.

In some embodiments, the method further comprises: in response to the acquired audio signal.

In some embodiments, the step of determining whether the human eye state data conforms to the gazing state, and if so, determining that the user to which the human eye belongs has an interaction intention specifically includes:

calculating the ratio of the field image frame to which the human eye state data conforming to the staring state belongs to the total frame number of the field image;

and when the ratio exceeds a preset threshold ratio value, determining that the user to which the human eyes belong has the interaction intention.

An embodiment of the present application further provides a human-computer interaction control apparatus, including:

an audio pickup unit for acquiring an audio signal;

the voice judging unit is used for judging whether the acquired audio signal has voice;

the synchronous acquisition unit is used for continuously acquiring the audio signals when the acquired audio signals are judged to contain voice, and synchronously acquiring the field images in the preset direction until the currently acquired voice is judged to be finished;

the sight detection unit is used for carrying out human eye sight detection on the acquired field image to acquire human eye state data;

and the interaction intention judging unit is used for judging whether the human eye state data conforms to the watching state or not, and if so, determining that the user to which the human eyes belong has the interaction intention.

In some embodiments, the synchronous acquisition unit of the face detection unit is further configured to capture live images in real time and perform caching; and when the voice distinguishing unit judges that the acquired audio signal contains voice, the synchronous acquisition unit reads the field image in the preset direction from the cached data.

Another embodiment of the present application provides an intelligent robot, including the human-computer interaction control device according to any one of the foregoing embodiments.

An embodiment of the present application further provides a machine-readable storage medium, on which a computer program is stored, wherein the computer program, when executed by a processor, implements the human-computer interaction control method according to any one of the foregoing embodiments.

The human-computer interaction control method provided by the embodiment of the application comprises the steps of firstly carrying out voice detection on an acquired audio signal, carrying out eye sight detection only when voice is detected, and judging whether a user has an interaction intention or not by judging the watching state of human eyes; the audio signal is only responded to if it is determined that the user has an interactive intention. According to the human-computer interaction control method, the interaction intention of the user can be identified and judged through the eye sight detection, the interaction of the user with environmental noise and non-interaction intention is avoided, and the human-computer interaction experience can be improved; and unnecessary data processing can be effectively reduced, and the system overhead is reduced.

Drawings

Fig. 1 is a schematic flowchart of a human-computer interaction control method according to an embodiment of the present application;

FIG. 2 is a schematic flow chart illustrating a human-computer interaction control method according to another embodiment of the present application;

fig. 3 is a schematic structural diagram of a human-computer interaction control device according to an embodiment of the present application.

Detailed Description

So that the manner in which the above recited objects, features and advantages of the present invention can be understood in detail, a more particular description of the invention, briefly summarized above, may be had by reference to the embodiments thereof which are illustrated in the appended drawings. In addition, the embodiments and features of the embodiments of the present application may be combined with each other without conflict.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.

As shown in fig. 1, an embodiment of the present application discloses a human-computer interaction control method, including:

s100, acquiring an audio signal;

the man-machine interaction control method can be executed by a system or an intelligent robot. The following takes an intelligent robot as an execution subject, and a man-machine interaction control method of the embodiment is specifically described. It can be understood that the human-computer interaction control method can also be executed by a human-computer interaction control device, the system can be arranged on a front end and a back end which are mutually connected in a communication way, the front end can be an intelligent robot directly facing a user, and the back end can be a server used for processing data.

The intelligent robot may be provided with an audio pickup unit, such as a microphone, for acquiring an audio signal.

In some embodiments, the smart robot may have a distinct direction of interaction, and the user has a greater likelihood of having an interactive intent only when the user is in the direction of interaction of the smart robot. The audio pickup unit on the intelligent robot can specifically adopt a directional microphone array, and only audio signals in a preset direction can be acquired. In order to reduce the interference of audio in other directions, when the audio signal in the preset direction is acquired, the audio signal in the preset direction can be enhanced, and the audio signal in other directions can be suppressed.

In some embodiments, some users may not pay attention to the interaction direction of the intelligent robot, and in order to avoid missing such users with interaction intention but not in the interaction direction of the intelligent robot, the audio pickup unit of the intelligent robot may also acquire audio signals in all directions without being limited to a specific direction. Meanwhile, the audio pickup unit can also obtain the position of the sound source, so that the intelligent robot can turn to the position of the sound source during subsequent processing, a live image corresponding to the position is obtained facing a user, the audio signal is responded, and the human-computer interaction experience is improved.

Since the sound can be transmitted at a long distance, in some cases, the user may speak at a position far away from the intelligent robot, but the sound can be transmitted to the position of the intelligent robot and collected by the intelligent robot. In this case, since the user is not in the vicinity of the smart robot, the user does not have an interactive intention. If the intelligent robot still executes the step S100, acquires the audio signal, and proceeds to the subsequent steps, which is unnecessary, and thus, the electric energy and system resources are greatly wasted. Therefore, in some embodiments, the step S100 may specifically be that when the user is detected in the preset area, the audio signal is acquired. Therefore, noise which is received when no user is nearby can be effectively filtered, data processing amount is reduced, and system overhead is reduced.

The intelligent robot can be provided with a sensor, and the sensor can detect whether a user exists in a preset area of the robot. The sensor can be an infrared sensor, an ultrasonic sensor, a laser radar, a human body proximity sensor, a depth camera and other common sensors.

The sensor can detect not only human bodies, but also other obstacles such as non-human objects such as tables, chairs and the like. In order to avoid the misjudgment of whether the user exists in the preset area or not caused by other obstacles, the detection data of the sensor can be further analyzed to judge whether the user belongs to a human body or not. For example, the height and width of the human body are within a specific range of values, and the judgment can be made based on the height and width. Or the contour of the human body has obvious characteristics, and whether the detection data belongs to the human body can be judged by judging whether the contour of the human body exists or not.

Infrared sensors can be divided into two categories-those based on the photoelectric effect and those based on the thermal effect. In some embodiments, the sensor uses an infrared sensor based on thermal effects. Infrared sensors based on thermal effects achieve target detection by infrared radiation to the sensed object. Because non-human barriers such as tables and chairs generally do not radiate infrared rays, the infrared sensor based on the heat effect is adopted, so that a lot of non-human barriers can be actively filtered when whether a user is in a preset area or not is detected, the quantity of detection data is reduced, and subsequent analysis and judgment are facilitated.

When step S100 is triggered, after the audio signal is acquired, the robot may continue to acquire the audio signal until the user stops speaking. Therefore, when the subsequent voice recognition and processing are convenient, the complete voice of the user can be provided, and the words of the user are prevented from being missed.

It is understood that the audio pick-up unit of the robot may also always collect audio signals, and the collected audio signals may be stored in a buffer. In step S100, when the audio signal is acquired, the audio signal may be read from the buffer. The audio signals in the buffer memory can be limited in a set time range or a set data size range, and if the audio signals in the buffer memory are not read by the robot in time for subsequent processing, the audio signals can be covered by subsequently acquired audio signals.

S200, judging whether the acquired audio signal has voice;

the determination of whether there is speech in the acquired audio signal can be made using a variety of existing implementations. For example, voice recognition can be directly performed on the audio signal to determine whether there is recognized voice content; if the voice content is not recognized, the voice is not considered to exist in the audio signal, and the subsequent processing can be terminated without responding to the audio signal. Otherwise, if the voice content is identified, the voice content in the audio signal is judged, and subsequent steps are required to be carried out for further judgment and processing.

In some embodiments, VAD (Voice Activity Detection, also called Voice endpoint Detection, Voice boundary Detection) techniques may also be used to determine whether there is Voice in the audio signal. The VAD technology can be divided into two parts, namely feature extraction and speech/non-speech classification judgment. The characteristics used by the VAD may be one or more of energy characteristics (such as short-term energy, zero-crossing rate, etc.), Frequency domain characteristics, cepstral characteristics (such as MFCC, Mel Frequency cepstrum coefficient, Mel-Frequency cepstral coefficients), harmonic characteristics, and long-term characteristics. The speech/non-speech classification can adopt different criteria such as threshold, statistical model or machine learning. The statistical model may be one of statistical models such as a mixture gaussian model GMM, laplace distribution, gamma distribution, and hidden markov model HMM. And (4) machine learning, namely training by using training data to obtain a classification discrimination model corresponding to the scene. For example, a deep neural network model DNN, a general background model UBM, or a support vector machine SVM may be constructed by using machine learning to perform speech/non-speech classification. For example, in one VAD scheme, a dual threshold scheme based on short-time energy and zero-crossing rate is employed.

S300, when the obtained audio signal is judged to contain voice, the audio signal is continuously obtained, and the live image in the preset direction is synchronously obtained until the currently collected voice is judged to be finished;

when the audio signal is judged to contain the voice, the user can be considered to be speaking. However, it still needs to determine whether the acquired audio signal is emitted by the user with the interactive intention. Therefore, further live images need to be acquired for subsequent determination.

The camera on the intelligent robot needs to acquire a field image in a preset direction. The preset direction may be determined according to the azimuth of the sound source. When the live image is obtained, the camera or the intelligent robot can be rotated to the position of the sound source, so that the camera or the intelligent robot is right opposite to the user, and then the live image right opposite to the user is shot by the camera. Because the direction of the sound source represents the direction of a user who possibly sends out voice, the obtained field image is the image of the user who just predicts the voice, the change of the face angle can be avoided in the subsequent eye sight detection, the error caused by the face angle is reduced, and the accuracy of the subsequent eye sight detection result can be effectively provided.

It can be understood that in some cases, the camera of the intelligent robot can shoot live images in real time and perform caching; and reading the multi-frame live image in the preset direction from the cached data for subsequent processing only when the acquired audio signal is judged to contain voice.

In step S300, not only the live image but also the audio signal is continuously acquired. And synchronous acquisition of the field image and the audio signal needs to be continuously carried out until the currently acquired voice is judged to be finished. Whether the currently collected voice is finished or not is judged, and the scheme in the step S200, that is, whether the voice exists in the obtained audio signal or not, may be adopted. When it is determined that there is no speech in the audio signal, it may be considered that the currently collected speech is finished — that is, the user has finished speaking his/her speech, and may need to wait for the response of the intelligent robot.

Considering that a person may have a pause even when a complete utterance is spoken due to different reasons such as thinking and breathing during speaking. Therefore, when determining whether the currently-collected voice is finished, it may be specifically determined that the currently-collected voice is finished when it is determined that no voice is present in the acquired audio signal and the duration exceeds the pause time threshold. The pause time threshold may be set according to practical situations, such as 0.5S, 1S, or 1.5S.

S400, carrying out human eye sight detection on the acquired field image to acquire human eye state data;

after the field image is acquired, the eye sight detection can be carried out to obtain the eye state data. The human eye state data can represent the sight line direction of the user in the live image.

For the eye sight detection, an image of a human eye region needs to be acquired first. The method comprises the steps of firstly carrying out face detection on a field image to obtain a face area, and then intercepting an image of a human eye area from the face area; alternatively, the image of the human eye region may be directly acquired from the live image. Step S400 will be specifically described below by taking an example of directly obtaining an image of a human eye region from a live image.

The image of the human eye region can be acquired from the live image using an existing human eye detection algorithm. For example, Hough transform based on gradient, human eye detection algorithm based on Harr-like features, region segmentation algorithm, edge extraction algorithm, symmetric transform algorithm, adaboost algorithm, gray scale integral projection method, template matching method, etc.

After the image of the human eye area is obtained, the human eye sight line detection can be carried out. When the human eye sight line is detected, the existing human eye sight line detection algorithm can be used.

In some embodiments, human eye gaze detection may be performed based on changes in the size of the eye contour. When looking down, the eye socket is minimum, and the looking down can be identified through the size of the eye socket; according to the ratio of the distance from the upper eye socket to the center of the eye to the whole eye socket, upward and downward vision can be distinguished; and then the left-looking, the front-looking and the left-looking can be distinguished according to the distance from the pupil to the left edge of the eye socket and the distance from the pupil to the right edge of the eye socket. Thus, in the longitudinal directions of the top view, the head-up view and the head-up view, three directions of the left view, the head-up view and the right view can be subdivided, and 9 kinds of sight line directions can be obtained in total. By analyzing the live image, the direction of the human eye sight line can be judged, and the human eye state data of the frame of live image can be obtained.

In some embodiments, the angle of eyeball rotation and thus the angle of the direction of gaze of the human eye may be calculated based on the change in the position of the pupil center relative to the orbit center.

It will be appreciated that other eye gaze detection algorithms may also be used, as long as eye state data about the gaze direction is available.

S600, judging whether the human eye state data accord with the watching state or not, and if so, determining that the user to which the human eyes belong has the interaction intention.

The acquired live image corresponds to the currently acquired complete voice, so that the live image comprises a plurality of frames of live image frames corresponding to the acquired voice time. And carrying out human eye sight detection on all the acquired field image frames to obtain human eye state data of the multi-frame images. Each live picture may represent a time instant. The human eye state data of the live images of all the frames represents the change of the sight line direction of the user in the corresponding collected voice time length.

The human eye gazing state can be understood as that the sight direction of the user falls on the intelligent robot. Since the acquired live image is oriented directly to the sound source, it can be considered to be directed to the user. Therefore, by analyzing the height of the intelligent robot and the height of the face of the user, the sight line direction of the user when the user gazes at the intelligent robot can be judged, and the sight line direction is the gazing state.

For example, if the height of the face of the user is higher than that of the smart robot, the user may look down when looking at the smart robot, and the looking direction of the gazing state is the overlooking. If the height of the face of the user is the same as that of the intelligent robot, the sight line direction of the head-up state can be displayed, and the sight line direction of the sight state is the head-up state. If the height of the face of the user is lower than that of the intelligent robot, the upward-looking sight direction can be presented, and the upward-looking sight direction of the watching state is upward-looking.

And judging whether the human eye state data conforms to the watching state or not by judging whether the human eye state data is in the sight line direction corresponding to the watching state or not. Because the live image includes the multi-frame image corresponding to the collected voice time length, the human eye state data also includes the human eye state data corresponding to the multi-frame image.

In some embodiments, it is determined that the user has an interaction intention as long as a ratio between the live image frame to which the eye state data corresponding to the gaze state belongs and the total frame number of all live images exceeds a preset threshold ratio value. The preset threshold ratio value may be specifically set, for example, 10%, 50%, or 80%. For example, the preset threshold proportion value may be set to 80%.

In some embodiments, the live image frames to which the eye state data corresponding to the gaze state belongs need to be continuous in time and for a duration that needs to meet a time threshold in order to determine that the user to which the eyes belong has an interactive intention. The time threshold may be specifically set according to the situation, such as 5S, 10S, and the like; the time duration can also be determined according to the corresponding collected voice time duration, for example, the voice time duration is longer, and the time threshold value is longer; the voice time is short, and the time threshold value is short. When speaking, the user will typically focus on the interactive object, and this focus will be maintained for at least some time. If there are other users around the user who are speaking, the other users may be looking at the speaking user and not at the intelligent robot. Therefore, the live image frames conforming to the watching state are required to be continuous in time and exceed a specific time length, so that the user with the interactive intention can be effectively identified, and the surrounding users who do not speak can be filtered.

It will be appreciated that the user may be temporarily distracted from the intelligent robot to another location, taking into account that the user is speaking, but will immediately turn his gaze back onto the intelligent robot as long as the user is still engaged in a conversation with the intelligent robot. For this case, a plurality of temporally consecutive live image frames may be collected together, and then it is determined whether the duration of the collected image frames satisfies a time threshold. Each field image frame refers to a field image frame set to which the human eye state data which are continuous in time and accord with the watching state belong.

In some embodiments, the eye state data of live images of all frames are required to be in line with the gaze state to determine that the user to which the eye belongs has an interactive intention.

In some embodiments, the eye state data may include an angle of a gaze direction of the eye, and the gaze direction of the corresponding gaze state may also be expressed as an angle. When the data of the human eye state is judged to be in line with the watching state, whether the angle of the watching direction of the human eye is the same as the angle of the sight line direction of the watching state or the angle deviation of the two angles is in a preset error range can be judged.

As long as it is determined that the eye state data corresponds to the gaze state, it is highly likely that a user to whom eyes in the live image belong will be a person who utters voice, that is, a person with an interactive intention, and need to respond to the acquired audio signal.

In some embodiments, as shown in fig. 2, the human-computer interaction control method further includes, S700, responding to the acquired audio signal.

The response to the audio signal may be to perform voice recognition on the audio signal, then obtain response data according to a result of the voice recognition, and perform corresponding response operation according to the response data. For example, if the voice recognition result shows that the user is making a service inquiry, the service database may be searched for an answer as response data, and the user may be answered according to the answer, such as voice broadcast or display on a display screen. If the voice recognition result shows that the user needs to transact a certain service, the number fetching operation can be carried out, or the user is guided to a corresponding service window, or the service transaction is directly carried out according to the service logic.

The intelligent robot can be provided with an interactive interface which can be a display screen of the intelligent robot or an operation interface. During the interaction process, the general user wants to interact with the object in a straight way, so that the interaction emotion can be kept. The interactive interface of the intelligent robot is generally fixed at a preset height position. Different users may have different heights, and thus, if the interactive interface maintains a fixed angle, such as a vertical angle, the interactive interface cannot guarantee that the interactive interface will face the faces of users with different heights. Therefore, in step S700, before responding to the acquired audio signal, the pitch angle of the interactive interface may be adjusted, so that the interactive interface is directly opposite to the face of the user, thereby improving the user experience.

When the face detection is carried out on the field image, the pitch deviation angle of the face of the user relative to the intelligent robot can be obtained. For example, by performing face detection on a live image, the height of the face of the user and the distance between the user and the intelligent robot can be obtained, and the pitch deviation angle of the face of the user relative to the interactive interface of the intelligent robot can be calculated by combining the height of the interactive interface of the intelligent robot. And then, adjusting the pitch angle of the interactive interface according to the calculated pitch deviation angle, so that the interactive interface can be over against the face of the user.

Before the audio signal is responded, the face of the user with the interactive intention can be identified, and the identity information of the user can be obtained. In responding to the audio signal, the response may be made in conjunction with the identity information of the user.

In some embodiments, there may still be multiple faces, and thus pairs of human eyes, in the live image. In step S400, eye sight detection is carried out on all eyes to obtain eye state data corresponding to each user; s600, respectively judging whether the human eye state data of each user accord with the watching state, if so, determining that the corresponding human eye belongs to the user with the interaction intention and needing to respond to the acquired audio signal.

If in step S600, all users to which a plurality of human eyes belong have an interaction intention, face recognition may be performed on the faces of the plurality of users, respectively, to obtain identity information of the users; determining response priority according to the user identity information; and preferentially responding to the users with high priority. In response, the intelligent robot can be turned to a user with high priority, and the intelligent robot is enabled to face the user who responds.

In some embodiments, in step S100, the audio signal may be acquired only when it is determined that the user is detected in the preset area, so as to effectively filter noise that the user does not experience when the user is nearby, reduce data processing amount, and reduce system overhead.

An image recognition mode may also be used when it is determined that a user is detected within the preset area. For example, images can be acquired by a camera on the intelligent robot. By carrying out face detection on the acquired image, if the face can be detected, the user can be judged to be detected in a preset area; otherwise, if the human face is not detected, it is judged that the user is not detected in the preset area. Thus. Step S100 may specifically be:

collecting audio signals and images;

carrying out face detection on the acquired image;

The implementation scheme of the face detection may adopt the existing face detection methods, such as a template matching method, a shape and edge method, a texture feature method, a color feature method, a support vector machine, a hidden markov model, an Adaboost algorithm, a neural network structure, and the like.

For example, a plurality of frames of images can be collected, face detection is performed only on one frame of image in the collected plurality of frames of images, and the face can be judged to be detected as long as a face exists in one frame of image; and judging that no human face is detected only if no human face is detected in all the acquired images.

For example, face detection may also be performed on all the acquired multi-frame images. A number threshold for the existence of the face may be preset, and when the number of frames of the detected face image is greater than the number threshold for the existence of the face, it is determined that the face is detected. Therefore, only the stably appearing face of the user can be used for subsequent processing; misjudgments caused by passing users which occur by accident can be avoided.

In order to improve the precision of face detection and avoid misjudgment, in the face detection, the face quality evaluation can be carried out on the detected face, and only the face meeting the preset face quality requirement can be judged to be the face in the field image. The face quality evaluation can be performed by using an existing face quality evaluation method, such as weighted scores of four facial features based on face symmetry, sharpness, brightness quality and image resolution, or a face image quality evaluation algorithm based on Patch, or a face image quality evaluation method using low-level features, or a face image quality evaluation method based on a convolutional neural network.

In some scenes, a display screen or a poster or a painting may be arranged around the intelligent robot, and a picture is displayed on the display screen. Inevitably, pictures, posters, paintings on the display screen may be images containing characters. Therefore, when the intelligent robot collects images, the intelligent robot can collect faces displayed on a display screen and can also collect faces on posters and paintings on site. In order to avoid the interference caused by the situation, when the human face detection is carried out on the collected image, the living body detection is also carried out at the same time. And in the living body detection, a living body detection algorithm in the existing face recognition can be adopted. For example, the liveness detection algorithm may be implemented by Using the predicted Pulse statistics Based on CNN and the liveness detection algorithm proposed by Youef Atoum, Xiaoming Liu (Youef Atoum, Xiaoming Liu, Face Anti-springing Using Patch and Depth-Based CNNs, 2017), or by Using the liveness detection algorithm proposed by Xiao Song, Xu Zoha et al (Xiao Song, Xu Zoha, Liangji Yang, Tianwei Lin. discrimination reaction for Accurate Face springing detection, 2018 PR) in combination with SPMT and TFBD. Only when a human face is detected and detected by a living body, the acquired audio signal is output.

For a speaking user, the orientation of the sound source collected should be consistent with the orientation of the face of the user. Otherwise, the voice of the user a may be collected, and the face of the user B may be photographed, thereby causing erroneous judgment. Therefore, the audio pickup unit on the intelligent robot can acquire the direction of the sound source at the same time when acquiring the audio. When judging whether the collected audio signal is output outwards, whether the orientation of the face is consistent with that of the sound source can be further judged. Therefore, step S100 may also specifically include:

collecting audio signals and images to obtain the azimuth of a sound source;

carrying out face detection on the acquired image;

And carrying out image analysis on the acquired image to obtain the direction of the face. When judging whether the direction of the sound source is consistent with the direction of the face, the direction difference between the sound source and the face is smaller than a preset direction difference threshold value. The preset orientation difference threshold is determined according to the detection error of the device, and may be set to 10 °, for example.

Therefore, before the audio signal is processed, whether the audio signal is sent by the user with the interaction intention can be judged. The condition that the voice obviously does not belong to the voice sent by the user with the interactive intention can be identified, and the data volume is effectively reduced.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the embodiments are not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the embodiments. Further, those skilled in the art will also appreciate that the embodiments described in the specification are presently preferred and that no particular act is required of the embodiments of the application.

As shown in fig. 3, an embodiment of the present application discloses a human-computer interaction control apparatus, including:

an audio pickup unit 100 for acquiring an audio signal;

a voice judging unit 200, configured to judge whether there is a voice in the acquired audio signal;

the synchronous acquisition unit 300 is configured to, when it is determined that the acquired audio signal contains a voice, continue to acquire the audio signal, and synchronously acquire a live image in a preset direction until it is determined that the currently acquired voice is finished;

a sight line detection unit 400, configured to perform human eye sight line detection on the acquired field image, and acquire human eye state data;

and an interaction intention determining unit 600, configured to determine whether the human eye state data conforms to the gazing state, and if so, determine that the user to which the human eye belongs has an interaction intention.

For specific working modes and principles of the audio pickup unit 100, the voice judging unit 200, the synchronous acquisition unit 300, the line of sight detecting unit 400 and the interaction intention judging unit 600, reference may be made to the description in the foregoing method embodiments, and details are not repeated herein.

In some embodiments, the human-computer interaction control device may further include a response processing unit 700 for responding to the audio signal. The operation of the response processing unit 700 can be seen from the description of the previous method embodiment.

In some embodiments, the audio pickup unit 100 is specifically configured to acquire an audio signal when a user is detected within a preset area. The human-computer interaction control device can also be provided with a sensor for judging whether a user is detected in a preset area. Therefore, noise which is received when no user is nearby can be effectively filtered, data processing amount is reduced, and system overhead is reduced.

In some embodiments, the audio pickup unit 100 may specifically include:

the signal acquisition module is used for acquiring audio signals and images;

the face detection module is used for carrying out face detection on the acquired image;

and the signal output module is used for outputting the acquired audio signal when the face is detected in the acquired image.

The specific working modes of the signal acquisition module, the face detection module and the signal output module can be referred to the description in the foregoing method embodiments. Therefore, noise which is received when no user is nearby can be effectively filtered, data processing amount is reduced, and system overhead is reduced.

In some embodiments, the face detection module is further configured to perform face quality evaluation on the detected face, and when the detected face meets a preset face quality requirement, it may be determined that the face is detected in the acquired image. Therefore, the accuracy of face detection can be improved, and misjudgment is avoided.

In some embodiments, the face detection module is further configured to perform face detection and live body detection at the same time, so as to avoid interference caused by a face on a display screen, a poster, or a painting around the intelligent robot.

In some embodiments, the audio pickup unit 100 may also specifically include:

the signal acquisition positioning module is used for acquiring audio signals and images and acquiring the azimuth of a sound source;

the face direction calculation module is used for calculating the direction of the face when the face is detected in the acquired image;

and the signal output module is used for outputting the collected audio signals when the position of the sound source is judged to be consistent with the position of the human face.

The specific working modes of the signal acquisition and positioning module, the face detection module, the face direction calculation module and the signal output module can be referred to the description in the foregoing method embodiments. The situation that the voice obviously does not belong to the voice sent by the user with the interaction intention can be recognized, the phenomenon that the voice of the user A is collected and the face of the user B is shot is avoided, and the data volume is effectively reduced.

In some application scenes, the human-computer interaction control device is respectively arranged on an intelligent robot at the front end and a server at the background, and the intelligent robot is in communication connection with the server; the audio pickup unit 100 and the synchronous acquisition unit 300 may be disposed on an intelligent robot, and the voice determination unit 200, the sight line detection unit 400, and the interaction intention determination unit 600 may be disposed on a server in the background. Therefore, the data can be processed and analyzed by utilizing the background server, the data processing performance requirement on the intelligent robot is reduced, the cost of the intelligent robot is reduced, and the arrangement and the expansion of the human-computer interaction control device are facilitated.

In some embodiments, the synchronous acquisition unit 300 may capture live images in real time and perform caching; only when the voice distinguishing unit 200 judges that the acquired audio signal contains voice, the image acquiring unit 300 reads the live image in the preset direction from the buffered data for subsequent processing.

According to the human-computer interaction control scheme provided by the embodiment of the application, firstly, voice detection is carried out on the acquired audio signal, the sight line of human eyes can be detected only when voice is detected, and whether a user has an interaction intention is judged by judging the watching state of the human eyes; the audio signal is only responded to if it is determined that the user has an interactive intention. According to the human-computer interaction control method, the interaction intention of the user can be identified and judged through the eye sight detection, the interaction of the user with environmental noise and non-interaction intention is avoided, and the human-computer interaction experience can be improved; and unnecessary data processing can be effectively reduced, and the system overhead is reduced.

An embodiment of the present application further provides an intelligent robot, which may include the aforementioned human-computer interaction control device.

An embodiment of the present application provides a machine-readable storage medium, on which a computer program is stored, wherein the computer program, when executed by a processor, implements the human-computer interaction control method described in any of the above embodiments.

The system/computer device integrated components/modules/units, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow in the method according to the above embodiments may be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable storage medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, etc. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

In the several embodiments provided in the present invention, it should be understood that the disclosed system and method may be implemented in other ways. For example, the system embodiments described above are merely illustrative, and for example, the division of the components is only one logical division, and other divisions may be realized in practice.

In addition, each functional module/component in each embodiment of the present invention may be integrated into the same processing module/component, or each module/component may exist alone physically, or two or more modules/components may be integrated into the same module/component. The integrated modules/components can be implemented in the form of hardware, or can be implemented in the form of hardware plus software functional modules/components.

It will be evident to those skilled in the art that the embodiments of the present invention are not limited to the details of the foregoing illustrative embodiments, and that the embodiments of the present invention are capable of being embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the embodiments being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. Several units, modules or means recited in the system, apparatus or terminal claims may also be implemented by one and the same unit, module or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.

The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A human-computer interaction control method is characterized by comprising the following steps:

acquiring an audio signal;

judging whether the obtained audio signal has voice or not;

2. The human-computer interaction control method according to claim 1, wherein the step of obtaining the audio signal is to obtain the audio signal when a user is detected in a preset area.

3. The human-computer interaction control method according to claim 2, wherein the step of acquiring the audio signal when the user is detected in the preset area specifically comprises:

collecting audio signals and images;

carrying out face detection on the acquired image;

4. The human-computer interaction control method according to claim 2, wherein the step of acquiring the audio signal when the user is detected in the preset area specifically comprises:

collecting audio signals and images to obtain the azimuth of a sound source;

carrying out face detection on the acquired image;

5. The human-computer interaction control method according to claim 1, further comprising: in response to the acquired audio signal.

6. The human-computer interaction control method according to claim 1, wherein the step of determining whether the human eye state data conforms to the gazing state, and if so, determining that the user to which the human eye belongs has the interaction intention specifically comprises:

7. A human-computer interaction control device, comprising:

an audio pickup unit for acquiring an audio signal;

8. The human-computer interaction control device according to claim 7, wherein the synchronous acquisition unit is further configured to capture live images in real time and perform caching; and when the voice distinguishing unit judges that the acquired audio signal contains voice, the synchronous acquisition unit reads the field image in the preset direction from the cached data.

9. An intelligent robot comprising the human-computer interaction control device according to any one of claims 7 to 8.

10. A machine readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the human-computer interaction control method of any one of claims 1-6.