CN113910217A

CN113910217A - Human-shaped robot head orientation method with audio-visual coordination

Info

Publication number: CN113910217A
Application number: CN202010993992.8A
Authority: CN
Inventors: 王守岩; 李岩
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2020-09-21
Filing date: 2020-09-21
Publication date: 2022-01-11
Anticipated expiration: 2040-09-21

Abstract

The invention provides a human-shaped robot head orientation method with audio and visual coordination, which is characterized by mainly comprising the following steps: firstly, acquiring a voice signal through a microphone array and recording time; then, identifying by using a voice identification method, and counting the number of the microphones collecting the voice; then setting a candidate target position according to time when the number of the human voice microphones is more than 1; then, rotating the camera and collecting a video; acquiring a face image frame from the video, identifying by using a face identification method to obtain a face identification result, and counting the number of faces; then calculating the area of the human faces and sequencing when the number of the human faces is more than 1; then setting candidate target faces according to the ranking; then, realizing lip movement identification by using a lip movement detection algorithm; then, when the number of faces with lip movement is larger than 1, setting a target position according to the area; and finally, rotating the head of the humanoid robot to realize orientation. The invention can make the humanoid robot distinguish the voice from other sounds and accurately interact with the target voice.

Description

Human-shaped robot head orientation method with audio-visual coordination

Technical Field

The invention belongs to the field of intelligent robots, and particularly relates to a human-shaped robot head orientation method with audio and visual coordination.

Background

A humanoid robot is a robot intended to imitate the appearance and behavior of a human being, and with the development of computer technology and mechanical technology, the application of the humanoid robot has been gradually developed from industry, agriculture to the fields of service, medical treatment, education, entertainment, life, and the like. The appearance of humanoid robots such as dish serving robots and infant-raising robots indicates that the application of humanoid robots has penetrated into the lives of people, and the application of humanoid robots in various fields is more and more common along with the development of related technologies.

At present, the problem that the head of a humanoid robot cannot accurately carry out human-computer interaction towards a target person who sends a command generally exists in the use of the humanoid robot, so that the target person cannot obtain higher human-computer interaction experience. Therefore, how to determine the direction of the humanoid robot head (called the orientation of the humanoid robot for short) according to the position of the target person becomes the problem to be solved firstly for the subsequent voice communication between the target person and the humanoid robot. When the humanoid robot faces the target person, the target person can obtain higher interactive experience through the reaction of the humanoid robot according to the commands of the target person, such as voice and the like.

At present, the orientation method of the humanoid robot is divided into two methods based on hearing and based on audio-visual cooperation. Two methods mostly pass through sound localization, such as "a robot sound source localization method and robot" (publication number CN108254721A) uses an auditory orientation method based on a microphone matrix; the patent "a sound source self-searching and positioning method" (publication number CN101295016B) uses an audio-visual cooperative orientation method.

However, the above two methods still have the following problems: 1) when various sound sources appear, human voice and other sounds cannot be distinguished; 2) when a plurality of human voices appear in a close range of the humanoid robot, the human-computer interaction with a target person corresponding to the target human voice can not be judged.

Disclosure of Invention

In order to solve the problems, the invention provides a method for distinguishing human voice and determining the direction of a target person sending a command by using a voice signal and a video acquired by a human robot so as to realize the orientation of the human robot, and the invention adopts the following technical scheme:

the invention provides an audio-visual cooperative humanoid robot head orientation method, which is used for determining a target direction for a humanoid robot at least comprising a microphone array, a camera and a control part and realizing the orientation of the humanoid robot head, and is characterized by comprising the following steps: step S1, collecting voice signals around the humanoid robot through a plurality of microphones in the microphone array and recording the time of collecting the voice signals by all the microphones; step S2, recognizing the voice signals collected by the microphones in sequence by using a preset voice recognition method to obtain voice recognition results, and meanwhile counting the number of the microphones with voice recognition results as voice and setting the number as the number of the voice microphones; step S3, judging whether the number of the human voice microphones is more than 1; step S4, when the number of the voice microphones is more than 1, the position corresponding to the microphone which firstly collects the voice is set as a candidate target position according to time; step S5, the humanoid robot controls the camera to turn to the candidate target position through the control part, and the camera is used for collecting the video of the candidate target position; step S6, sequentially acquiring face image frames from the video, sequentially identifying the face image frames by using a preset face identification method to obtain a face identification result containing face positions, and counting the number of faces in the face image frames; step S7, judging whether the number of the human faces is more than 1; step S8, when the number of the human faces is more than 1, sequentially calculating the areas of the human faces in the human face image frame and sequencing the human faces in the sequence from big to small to obtain a human face area sequence; step S9, setting faces corresponding to a predetermined number of areas ranked at the top in the face area sequence as candidate target faces; step S10, using a preset lip motion detection algorithm to sequentially carry out lip motion recognition on a preset number of continuous face image frames corresponding to the candidate target face, and setting the candidate target face with lip motion as a lip motion face; step S11, judging whether the number of the lip-moving faces is larger than 1; step S12, when the number of lip-moving faces is larger than 1, the face position corresponding to the lip-moving face with the largest area is set as the target orientation; and step S13, the humanoid robot controls the head of the humanoid robot to turn to the target direction through the control part so as to realize orientation.

The method for orienting the head of the humanoid robot with the audio and visual coordination provided by the invention can also have the technical characteristics that when the number of the human voice microphones is judged to be not more than 1 in the step S3, the method comprises the following steps: step A1, judging whether the number of the human voice microphones is 1; and step A2, when the number of the human voice microphones is 1, setting the corresponding directions of the microphones which collect the human voice as candidate target directions.

The method for orienting the head of the humanoid robot with the audio and visual coordination, provided by the invention, can also have the technical characteristics that when the number of the human faces is judged to be not more than 1 in the step S7, the method comprises the following steps: step B1, judging whether the number of the human faces is 1; and step B2, when the number of the human faces is 1, setting the position of the human face corresponding to the human face as the target azimuth.

The method for orienting the head of the humanoid robot with the audio and visual coordination, provided by the invention, can also have the technical characteristics that when the number of lip-moving faces is not more than 1 in the step S11, the method comprises the following steps: step C1, judging whether the number of the lip-moving faces is 1; and step C2, when the number of the lip-moving faces is 1, setting the face position corresponding to the lip-moving faces as the target orientation.

The invention provides a human-shaped robot head orientation method with audio and visual coordination, which can also have the technical characteristics that the human voice recognition method comprises the following steps: step T1, performing band-pass filtering on the voice signal to obtain a specific frequency band signal; and step T2, judging whether the strength of the specific frequency band signal exceeds 70% of the strength of the voice signal, if so, judging that the voice recognition result is the voice, and if not, judging that the voice recognition result is the non-voice.

The method for orienting the head of the humanoid robot with the audio and visual coordination provided by the invention can also have the technical characteristics that the microphone array at least comprises 3 microphones.

Action and Effect of the invention

According to the humanoid robot head orienting method with the audio and visual coordination, the voice signals around the humanoid robot are collected through the microphone array, so that the humanoid robot can collect the voice signals in a 360-degree range without omission, and a foundation is provided for the last accurate orientation of the humanoid robot head. And because the voice signal is judged, the voice is distinguished from other sounds, so that the humanoid robot can make a series of interactions according to the voice instead of meaningless interactions with other noises.

And further, when the number of the human faces is more than 1, the area of each human face in the human face image frame is sequentially calculated and sequenced from large to small to obtain a human face area sequence, and the human faces corresponding to the areas with the preset number and the areas with the preset number which are ranked at the front in the human face area sequence are set as candidate target human faces, so that the candidate target people gathered around the human-shaped robot can be determined, and the range is narrowed for determining the target people needing face-to-face human-computer interaction.

And because the lip movement detection is carried out on the preset number of human faces with large face areas, which human face generates the lip movement is determined, the lip movement human face with the largest face area is set as the target person which is most likely to carry out human-computer interaction with the humanoid robot face to face under the condition that the lips of the human faces all generate the lip movement, and furthermore, the humanoid robot turns the head of the humanoid robot to the target person through the control part according to the position of the human face with the largest face area, so that the condition that only one human voice appears around the humanoid robot in a close range is considered, the condition that the human voices appear around the humanoid robot in a close range simultaneously is also considered, the aim that the head of the humanoid robot accurately faces the target person is achieved to the maximum extent, and the target person obtains high interactive experience.

The method can distinguish the human voice from other voices and detect and judge the situation that a plurality of human voices appear around the humanoid robot in a close range, so as to determine which target human voice needs to be subjected to human-computer interaction, and therefore high human-computer interaction experience is provided for a target person.

Drawings

FIG. 1 is a block diagram of the construction of an anthropomorphic robot according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a microphone array according to an embodiment of the invention; and

fig. 3 is a flowchart of a head orientation method of an audiovisual cooperative humanoid robot according to an embodiment of the present invention.

Detailed Description

In order to make the technical means, the inventive features, the achievement objects and the effects of the present invention easy to understand, the following describes a head orientation method of an audio-visual cooperative humanoid robot in detail with reference to the embodiments and the accompanying drawings.

< example >

Fig. 1 is a block diagram showing the construction of an anthropomorphic robot according to an embodiment of the present invention.

As shown in fig. 1, the humanoid robot 10 includes at least a humanoid robot head 11, a humanoid robot body 12, a steering mechanism 13, and a control unit 14.

Wherein the humanoid robot body 12 is a conventional humanoid robot body 12.

The humanoid robot head 11 is rotatably connected with the humanoid robot body 12, in this example, the humanoid robot 10 realizes the relative rotation of the humanoid robot head 11 and the humanoid robot body 12 through a steering mechanism 13.

The humanoid robot head 11 at least comprises a microphone array 112, a camera 113 and an information processing mechanism 114.

The control unit 14 is provided on the humanoid robot head 11, is a master control chip, and can control the steering mechanism 13.

Fig. 2 is a schematic diagram of a microphone array according to an embodiment of the invention.

As shown in fig. 2, the microphone array 112 includes a1 st microphone, a2 nd microphone, and a 3 rd microphone. Three microphones are uniformly distributed on three vertexes of a circle inscribed regular triangle, the area range corresponding to each microphone is a sector with the same radius and the same angle of 120 degrees, each microphone is arranged at the center of the area where the microphone is positioned, the microphone 1 is positioned right in front of the humanoid robot 10, and the default position of the camera 113 (namely the initial position of the camera) is the same as the position of the 1 st microphone.

The information processing mechanism 114 processes the voice signal and the video acquired by the microphone array 112 and the camera 113, and sends a steering command for controlling the humanoid robot 10 to the control unit 14 according to the processing result.

When the control part 14 receives the steering command, the control part 14 controls the steering mechanism 13 to rotate the humanoid robot head 11 relative to the body until the humanoid robot head 11 rotates to the target orientation, so that the orientation of the humanoid robot head 11 is realized.

As shown in fig. 3, the humanoid robot head orientation process includes the following steps:

step S1, collecting the voice signals around the humanoid robot 10 by the plurality of microphones in the microphone array 112 and recording the time when all the microphones collected the voice signals.

In this embodiment, the information processing mechanism 114 only processes the voice signal when the volume of the voice signal is greater than 60 db.

And step S2, recognizing the voice signals collected by the microphones in sequence by using a preset voice recognition method to obtain voice recognition results, and counting the number of the microphones with voice recognition results as the voice and setting the number as the number of the voice microphones.

The human voice recognition method comprises the following steps:

in step T1, the voice signal is band-pass filtered to obtain a specific frequency band signal.

In this embodiment, the specific frequency band signal refers to a signal with a frequency band of 50Hz to 1 KHz.

And a step T2 of judging whether the strength of the specific frequency band signal exceeds 70% of the strength of the voice signal, wherein the voice recognition result is a human voice when the strength of the specific frequency band signal exceeds 70%, and the voice recognition result is a non-human voice when the strength of the specific frequency band signal does not exceed 70%.

If the voice recognition result is a non-human voice, the process resumes from step S1.

In step S3, it is determined whether the number of human voice microphones is greater than 1.

The process proceeds to step S4 when the number of human voice microphones is greater than 1, and proceeds to step a1 when the number of human voice microphones is not greater than 1.

And step S4, when the number of the human voice microphones is more than 1, setting the corresponding azimuth of the microphone which firstly collects the human voice as a candidate target azimuth according to time.

When the number of human voice microphones is greater than 1, it is determined which microphone among the plurality of microphones from which the human voice is collected first collects the human voice and sets the microphone as a sound source microphone whose corresponding azimuth is set as a candidate target azimuth, by the time recorded when the voice signal is collected in step S1. The candidate target position is the position where the target person may exist.

In the present embodiment, it is generally assumed that a target person who wants to perform human-computer interaction with the humanoid robot 10 in a face-to-face manner is present in a close range of the humanoid robot 10 and is facing the humanoid robot head 11 with the front face, and the target person emits a human voice toward the humanoid robot 10, so that the microphone that collects the human voice first is set as the sound source microphone in step S4.

Step a1, determine whether the number of human voice microphones is 1.

The process proceeds to step a2 when the number of human voice microphones is 1, and proceeds to step S1 when the number of human voice microphones is 0.

And step A2, when the number of the human voice microphones is 1, setting the corresponding directions of the microphones which collect the human voice as candidate target directions.

And when the number of the human voice microphones is 1, directly setting the microphone collecting the human voice as a sound source microphone, and setting the azimuth corresponding to the sound source microphone as a candidate target azimuth.

In step S5, when the information processing means 114 outputs the sound source microphone and the candidate target azimuth, the control unit 14 is instructed to rotate the camera 113, and upon receiving the instruction, the control unit 14 controls the steering means 13 to rotate the camera 113 so that the forward region of the camera 113 coincides with the forward region of the sound source microphone in the candidate target azimuth. And captures video of candidate target orientations using the camera 113.

In this embodiment, the camera 113 uses a wide-angle camera 113 having an angle of view greater than 120 degrees. The camera 113 starts capturing video after rotating to the target position.

And step S6, sequentially acquiring face image frames from the video to obtain a continuous face image frame sequence, sequentially identifying the face image frames by using a preset face identification method to obtain a face identification result containing the face position, and counting the number of faces in the face image frames.

Among them, the Face Recognition method uses a conventional Face Recognition method such as Cascade CNN (Haoxic Li, Zhe Lin, Xiaohui Shen, Jonathan Brandt, Gang hua.A. volumetric Facial neural network monitor for Face detection.2015, Computer Vision and Pattern Detection), DenseBox (Lichao Huang, Yi Yang, Yang Deng, Yin Yu. Densbox: unity Landmark Localization with End Object detection.2015, arXiv: Computer Vision and Pattern Detection) and Faceness-Net (Shu Yang, Ping Lun, Chen Chang Long, Xiaoooo Long local area Detection: Face Detection).

In this embodiment, the face recognition result highlights the recognized face in the face image frame in the form of a box, and each box corresponds to the relevant coordinate information.

In step S7, it is determined whether the number of faces is greater than 1.

The process proceeds to step S8 when the number of faces in one frame of the face image frame is greater than 1, and proceeds to step B1 when the number of faces is not greater than 1.

And step S8, when the number of the human faces is more than 1, sequentially calculating the areas of the human faces in the human face image frame, and sequencing the areas of all the human faces in one human face image frame according to the sequence from large to small to obtain a human face area sequence.

And step B1, judging whether the number of the human faces is 1.

The process proceeds to step B2 when the number of faces is 1, and proceeds to step S1 again when the number of faces is 0.

And step B2, when the number of the human faces is 1, setting the position of the human face corresponding to the human face as the target azimuth.

In step S9, the information processing mechanism 114 sets, as candidate target faces, faces corresponding to a predetermined number of areas ranked earlier in the face area sequence.

In this embodiment, a face corresponding to the top-ranked 3 area in the face area sequence is set as a candidate target face.

And step S10, sequentially carrying out lip motion recognition on a preset number of continuous face image frames corresponding to the candidate target face by using a preset lip motion detection algorithm, and setting the candidate target face with lip motion as a lip motion face.

In this embodiment, several consecutive frames of the face image frame in which 3 candidate target faces are detected are sequentially detected by using a conventional lip motion detection algorithm, so as to obtain a result of whether each candidate target face has lip motion, and the candidate target face with lip motion is set as the lip motion face.

Among them, the conventional lip motion detection algorithm is, for example, the lip motion detection algorithm mentioned in Li M.Audio-visual talking face detection [ C ] International Conference on Multimedia & Expo. IEEE, 2003.

And step S11, judging whether the number of the lip-moving faces is more than 1.

The process proceeds to step S12 when the number of lip-moving faces is greater than 1, and proceeds to step C1 when the number of lip-moving faces is not greater than 1.

And step S12, when the number of lip-moving faces is larger than 1, setting the face position corresponding to the lip-moving face with the largest area as the target direction.

In this embodiment, when more than 2 candidate target faces appear lip motion in one frame of face image frame, the closest candidate target person (i.e., the candidate target person with the largest face area) is considered as the target person, and the face position of the target person in the face image frame is the target orientation.

And step C1, judging whether the number of the lip-moving faces is 1.

The process proceeds to step C2 when the number of lip-flips is 1, and proceeds to step S1 when the number of lip-flips is 0.

And step C2, when the number of the lip-moving faces is 1, setting the face position corresponding to the lip-moving faces as the target orientation.

In step S13, the humanoid robot 10 controls the humanoid robot head 11 to turn to the target orientation by the control unit 14 to realize the orientation.

In this embodiment, the camera 113 and the humanoid robot head 11 are precisely oriented according to the coordinate information of the face position corresponding to the target position, so that the camera 113 and the humanoid robot head 11 accurately face the target person, man-machine interaction is further performed, and then the terminal state is entered.

Examples effects and effects

According to the humanoid robot head orienting method with the cooperation of the audios and the visions, the voice signals around the humanoid robot are collected through the microphone array, so that the humanoid robot can collect the voice signals in a 360-degree range without omission, and a foundation is provided for the last accurate orientation of the humanoid robot head. And because the voice signal is judged, the voice is distinguished from other sounds, so that the humanoid robot can make a series of interactions according to the voice instead of meaningless interactions with other noises.

The above-described embodiments are merely illustrative of specific embodiments of the present invention, and the present invention is not limited to the description of the above-described embodiments.

In the above embodiment, the microphone array includes 3 microphones, and in other solutions, a microphone array composed of a plurality of microphones such as 4, 5, 6, etc. may also be used to perform speech signal acquisition.

In the above embodiment, 3 top-ranked faces are selected from the face area sequence as candidate target faces, and in other schemes, 2 or 4 faces may be selected for subsequent lip movement detection.

In the above embodiment, the humanoid robot is a humanoid robot with a head part and a body part capable of rotating relatively, the invention can also aim at the humanoid robot with a body part and a head part incapable of rotating relatively to carry out head part orientation, and the head part orientation of the humanoid robot is realized by rotating the whole humanoid robot through the control part.

In the above embodiment, the humanoid robot controls the camera to turn to the candidate target azimuth through the control part, and the invention can also turn the head of the humanoid robot while turning the camera.

Claims

1. A human-shaped robot head orientation method with audio and visual coordination is used for determining a target direction and realizing the orientation of the human-shaped robot head for a human-shaped robot at least comprising a microphone array, a camera and a control part, and is characterized by comprising the following steps:

step S1, collecting voice signals around the humanoid robot through a plurality of microphones in the microphone array and recording the time when all the microphones collect the voice signals;

step S2, recognizing the voice signals collected by the microphones in sequence by using a preset voice recognition method to obtain voice recognition results, and counting the number of the microphones with the voice recognition results being voice and setting the number as the number of the voice microphones;

step S3, judging whether the number of the human voice microphones is more than 1;

step S4, when the number of the voice microphones is more than 1, setting the corresponding position of the microphone which firstly collects the voice as a candidate target position according to the time;

step S5, the humanoid robot controls the camera to turn to the candidate target position through the control part, and the camera is used for collecting the video of the candidate target position;

step S6, sequentially acquiring face image frames from the video, sequentially identifying the face image frames by using a preset face identification method to obtain a face identification result containing face positions, and counting the number of faces in the face image frames;

step S7, judging whether the number of the human faces is more than 1;

step S8, when the number of the human faces is more than 1, sequentially calculating the areas of the human faces in the human face image frame and sequencing the human faces in the sequence from big to small to obtain a human face area sequence;

step S9, the faces corresponding to the predetermined number of the areas ranked at the top in the face area sequence are set as candidate target faces;

step S10, using a preset lip motion detection algorithm to sequentially carry out lip motion recognition on a preset number of continuous face image frames corresponding to the candidate target faces, and setting the candidate target faces with lip motion as lip motion faces;

step S11, judging whether the number of the lip-moving faces is larger than 1;

step S12, when the number of the lip-moving faces is greater than 1, setting the face position corresponding to the lip-moving face having the largest area as a target orientation;

step S13, the humanoid robot controls the head of the humanoid robot to turn to the target orientation through the control part so as to realize the orientation.

2. The head orientation method of an audiovisual cooperative humanoid robot as claimed in claim 1, characterized in that:

wherein, when the step S3 judges that the number of the human voice microphones is not greater than 1, the method includes the following steps:

step A1, judging whether the number of the voice microphones is 1;

step A2, when the number of the human voice microphones is 1, setting the corresponding directions of the microphones which collect the human voice as the candidate target directions.

3. The head orientation method of an audiovisual cooperative humanoid robot as claimed in claim 1, characterized in that:

wherein, when the step S7 judges that the number of faces is not greater than 1, the method includes the following steps:

step B1, judging whether the number of the human faces is 1;

and step B2, when the number of the human faces is 1, setting the human face position corresponding to the human face as the target azimuth.

4. The head orientation method of an audiovisual cooperative humanoid robot as claimed in claim 1, characterized in that:

wherein, when the number of the lip-moving faces in the step S11 is not more than 1, the method comprises the following steps:

step C1, judging whether the number of the lip-moving faces is 1;

and step C2, when the number of the lip-moving faces is 1, setting the face position corresponding to the lip-moving face as a target orientation.

5. The head orientation method of an audiovisual cooperative humanoid robot as claimed in claim 1, characterized in that:

the human voice recognition method comprises the following steps:

step T1, performing band-pass filtering on the voice signal to obtain a specific frequency band signal;

and T2, judging whether the strength of the specific frequency band signal exceeds 70% of the strength of the voice signal, wherein the voice recognition result is a human voice when the strength of the specific frequency band signal exceeds 70%, and the voice recognition result is a non-human voice when the strength of the specific frequency band signal does not exceed 70%.

6. The head orientation method of an audiovisual cooperative humanoid robot as claimed in claim 1, characterized in that:

wherein the microphone array comprises at least 3 microphones.