CN113910217B

CN113910217B - Head orientation method of humanoid robot with cooperative hearing and vision

Info

Publication number: CN113910217B
Application number: CN202010993992.8A
Authority: CN
Inventors: 王守岩; 李岩
Original assignee: Fudan University
Current assignee: Fudan University
Filing date: 2020-09-21
Publication date: 2023-12-01
Anticipated expiration: 2040-09-21

Abstract

The invention provides a head orientation method of a humanoid robot with cooperative hearing and vision, which is characterized by mainly comprising the following steps: firstly, collecting voice signals through a microphone array and recording time; then, recognizing by using a voice recognition method, and counting the number of microphones collecting voice; setting candidate target orientations according to time when the number of the voice microphones is larger than 1; then, the camera is rotated and video is collected; the face image frames are obtained from the video, the face recognition result is obtained by using a face recognition method, and the number of faces is counted; then, when the number of faces is more than 1, calculating the face areas and sequencing; setting candidate target faces according to the ranking; then, lip movement identification is realized by using a lip movement detection algorithm; then setting a target azimuth according to the area when the number of the faces with lip movements is greater than 1; finally, the head of the humanoid robot is rotated to realize orientation. The invention can enable the humanoid robot to distinguish the human voice from other sounds and accurately interact with the target human voice.

Description

Head orientation method of humanoid robot with cooperative hearing and vision

Technical Field

The invention belongs to the field of intelligent robots, and particularly relates to a head orientation method of a humanoid robot with cooperative hearing and vision.

Background

Humanoid robots are robots intended to mimic the appearance and behavior of humans, and as computer technology and mechanical technology have evolved, applications of humanoid robots have gradually evolved from industry, agriculture, to the fields of service, medical, education, entertainment, life, etc. The advent of the robot in dishes, baby raising, etc. indicates that the application of the robot has penetrated into the life of people, which also means that the application of the robot in various fields will become more and more common with the development of related technologies.

At present, the problem that the head of the humanoid robot cannot accurately face a target person sending a command to perform man-machine interaction exists in the use of the humanoid robot, so that the target person cannot obtain higher man-machine interaction experience. Therefore, how to determine the direction of the head of the humanoid robot (the orientation of the humanoid robot for short) according to the position of the target person becomes a problem that needs to be solved first for the subsequent voice communication between the target person and the humanoid robot. When the humanoid robot faces the target person, the reaction of the humanoid robot according to the commands such as voice of the target person can enable the target person to obtain higher interaction experience.

At present, the orientation method of the humanoid robot is divided into two types based on hearing and based on hearing cooperation. Two methods are mostly sound localization, such as a "robot sound source localization method and robot" (publication number CN108254721 a) using an auditory orientation method based on a microphone matrix; the patent of 'an autonomous searching and positioning method of sound source' (publication No. CN 101295016B) uses an audio-visual cooperative orientation method.

However, the two methods described above still have the following problems: 1) When various sound sources appear, the human voice cannot be distinguished from other sounds; 2) When a plurality of human voices appear in a relatively close range of the humanoid robot, the humanoid robot cannot judge which target human voice corresponds to which target human voice is subjected to face-to-face human-computer interaction.

Disclosure of Invention

In order to solve the problems, the invention provides a method for distinguishing human voice and determining the direction of a target person sending a command by utilizing a voice signal and a video collected by a humanoid robot so as to realize the orientation of the humanoid robot, which adopts the following technical scheme:

the invention provides a head orientation method of a humanoid robot with cooperative hearing and vision, which is used for determining a target orientation for the humanoid robot at least comprising a microphone array, a camera and a control part and realizing the orientation of the head of the humanoid robot, and is characterized by comprising the following steps: step S1, voice signals around the humanoid robot are collected through a plurality of microphones in a microphone array, and the time for all the microphones to collect the voice signals is recorded; step S2, a voice signal collected by the microphones is sequentially identified by a preset voice identification method to obtain a voice identification result, and meanwhile, the number of microphones with voice identification results being voice is counted and set as the number of voice microphones; step S3, judging whether the number of the voice microphones is larger than 1; step S4, setting the azimuth corresponding to the microphone which collects the voice first as a candidate target azimuth according to time when the number of the voice microphones is larger than 1; s5, controlling the camera to turn to the candidate target azimuth through the control part by the humanoid robot, and acquiring a video of the candidate target azimuth by using the camera; step S6, sequentially acquiring face image frames from the video, sequentially identifying the face image frames by using a preset face identification method to obtain a face identification result containing the face position, and counting the number of faces of the face image frames; step S7, judging whether the number of faces is larger than 1; step S8, when the number of the faces is greater than 1, the areas of the faces in the face image frames are sequentially calculated, and the sequence of the face areas is obtained by sequencing the faces from large to small; step S9, setting faces corresponding to a predetermined number of areas which are ranked at the front in the face area sequence as candidate target faces; step S10, sequentially carrying out lip movement identification on a predetermined number of continuous face image frames corresponding to the candidate target faces by using a predetermined lip movement detection algorithm, and setting the candidate target faces with lip movements as lip movement faces; step S11, judging whether the number of lip-moving faces is larger than 1; step S12, setting the face position corresponding to the lip-moving face with the largest area as a target azimuth when the number of the lip-moving faces is larger than 1; step S13, the humanoid robot controls the humanoid robot head to turn to the target azimuth through the control part so as to realize orientation.

The head orientation method of the humanoid robot with the hearing and vision coordination provided by the invention can also have the technical characteristics that when the number of the humanoid microphones is judged to be not more than 1 in the step S3, the method comprises the following steps: step A1, judging whether the number of the voice microphones is 1; and A2, setting the positions corresponding to the microphones for collecting the voice as candidate target positions when the number of the voice microphones is 1.

The head orientation method of the humanoid robot with the cooperative hearing and vision provided by the invention can also have the technical characteristics that when the number of faces is judged to be not more than 1 in the step S7, the method comprises the following steps: step B1, judging whether the number of faces is 1; and B2, setting the face position corresponding to the face as a target azimuth when the number of the faces is 1.

The head orientation method of the humanoid robot with the cooperative hearing and vision provided by the invention can also have the technical characteristics that when the number of lip-moving faces in the step S11 is not more than 1, the method comprises the following steps: step C1, judging whether the number of lip moving faces is 1; and C2, setting the face positions corresponding to the lip moving faces as target orientations when the number of the lip moving faces is 1.

The head orientation method of the humanoid robot with the hearing and vision coordination provided by the invention can also have the technical characteristics that the voice recognition method comprises the following steps: step T1, carrying out band pass filtering on the voice signal to obtain a specific frequency band signal; and step T2, judging whether the intensity of the specific frequency band signal exceeds 70% of the intensity of the voice signal, wherein the voice recognition result is human voice when the specific frequency band signal is judged to be exceeded, and the voice recognition result is non-human voice when the specific frequency band signal is judged to be not exceeded.

The head orientation method of the humanoid robot with the hearing-vision coordination provided by the invention can also have the technical characteristics that the microphone array at least comprises 3 microphones.

The actions and effects of the invention

According to the head orientation method of the humanoid robot with the cooperative hearing and vision, as the voice signals around the humanoid robot are collected through the microphone array, the humanoid robot can collect the voice signals without omission in a 360-degree range, and a foundation is provided for the final accurate orientation of the head of the humanoid robot. And because the voice signal is subjected to voice judgment, the voice is distinguished from other sounds, so that the humanoid robot can perform a series of interactions according to the voice instead of nonsensical interactions with other noises.

And because the face image frames acquired from the video acquired by the camera are subjected to face recognition to obtain a face recognition result containing the face position, the number of faces is counted, the area of each face in the face image frames is sequentially calculated when the number of faces is more than 1, the face area sequences are obtained by sequencing the faces according to the sequence from large to small, and the faces corresponding to the predetermined number of areas in the face area sequences, which are ranked at the front, are set as candidate target faces, so that the candidate target people gathered around the humanoid robot can be determined, and the range is reduced for determining the target people needing face-to-face human interaction.

And the lip movement detection is carried out on the faces with large face areas and the preset number of faces so as to determine which face is subjected to lip movement, and the face with the largest face area is set to be the target person which is most likely to be subjected to human-computer interaction with the human-shaped robot under the condition that the lips are moved on all the faces, and furthermore, the human-shaped robot turns the head of the human-shaped robot to the target person through the control part according to the face position with the largest face area, so that the situation that only one human voice is close to the periphery of the human-shaped robot is considered, the situation that a plurality of human voices are close to the periphery of the human-shaped robot is considered, and the purpose that the head of the human-shaped robot faces the target person accurately is realized to the greatest extent is achieved, and the target person obtains higher interaction experience.

The method can distinguish the voice from other voices, and can detect and judge the situation that a plurality of voices appear around the humanoid robot in a short distance so as to determine which target voice needs to be subjected to man-machine interaction, thereby giving higher man-machine interaction experience to the target person.

Drawings

Fig. 1 is a block diagram of a humanoid robot according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a microphone array according to an embodiment of the invention; and

fig. 3 is a flowchart of a head orientation method of a humanoid robot with cooperative hearing and vision according to an embodiment of the present invention.

Detailed Description

In order to make the technical means, the creation features, the achievement of the purposes and the effects of the invention easy to understand, the following describes a humanoid robot head orientation method with the cooperation of hearing and vision with reference to the embodiments and the drawings.

< example >

Fig. 1 is a block diagram of a humanoid robot according to an embodiment of the present invention.

As shown in fig. 1, the humanoid robot 10 includes at least a humanoid robot head 11, a humanoid robot body 12, a steering mechanism 13, and a control unit 14.

Wherein the humanoid robot body 12 is a conventional humanoid robot body 12.

The humanoid robot head 11 is rotatably connected to the humanoid robot body 12, and in this example, the humanoid robot 10 achieves relative rotation of the humanoid robot head 11 and the humanoid robot body 12 by a steering mechanism 13.

The humanoid robot head 11 includes at least a microphone array 112, a camera 113, and an information processing mechanism 114.

The control unit 14 is provided on the humanoid robot head 11, and is a main control chip capable of controlling the steering mechanism 13.

Fig. 2 is a schematic diagram of a microphone array according to an embodiment of the invention.

As shown in fig. 2, the microphone array 112 includes a1 st microphone, a2 nd microphone, and a 3 rd microphone. The three microphones are uniformly distributed on three vertexes of the circular triangle, the area range corresponding to each microphone is a sector with the same radius and angle of 120 degrees, each microphone is arranged in the center of the area where the microphone is located, the microphone 1 is positioned right in front of the humanoid robot 10, and the default position (namely the initial position of the camera) of the camera 113 is the same as the 1 st microphone.

The information processing means 114 processes the voice signal and the video signal acquired by the microphone array 112 and the camera 113, respectively, and issues a steering command for controlling the humanoid robot 10 to the control unit 14 according to the processing result.

When the control section 14 receives the steering instruction, the control section 14 controls the steering mechanism 13 to perform the rotation of the humanoid robot head 11 with respect to the body until the humanoid robot head 11 is turned to the target orientation, thereby achieving the orientation of the humanoid robot head 11.

As shown in fig. 3, the humanoid robot head orientation process includes the steps of:

in step S1, the plurality of microphones in the microphone array 112 collect the voice signals around the humanoid robot 10 and record the time when all the microphones collect the voice signals.

In this embodiment, the information processing mechanism 114 processes the voice signal when the volume of the voice signal is greater than 60 db.

Step S2, a voice signal collected by the microphones is sequentially identified by a preset voice identification method to obtain a voice identification result, and meanwhile, the number of microphones with voice identification results being voice is counted and set to be the number of voice microphones.

The voice recognition method comprises the following steps:

and step T1, carrying out band pass filtering on the voice signal to obtain a signal with a specific frequency band.

In this embodiment, the specific frequency band signal refers to a signal with a frequency band of 50Hz-1 KHz.

And step T2, judging whether the intensity of the specific frequency band signal exceeds 70% of the intensity of the voice signal, wherein the voice recognition result is human voice when the specific frequency band signal is judged to be exceeded, and the voice recognition result is non-human voice when the specific frequency band signal is judged to be not exceeded.

If the voice recognition result is a non-human voice, the process starts again in step S1.

And S3, judging whether the number of the voice microphones is larger than 1.

Step S4 is entered when the number of human voice microphones is greater than 1, and step A1 is entered when the number of human voice microphones is not greater than 1.

And S4, setting the direction corresponding to the microphone which collects the voice first as a candidate target direction according to time when the number of the voice microphones is larger than 1.

When the number of the voice microphones is greater than 1, determining which microphone of the plurality of microphones acquired the voice is the first to acquire the voice and setting the microphone as a sound source microphone, and setting the azimuth corresponding to the sound source microphone as a candidate target azimuth, by the time recorded when the voice signal is acquired in the step S1. The candidate target bearing is the bearing in which the target person may exist.

In the present embodiment, it is generally defaulted that the target person who wants to perform man-machine interaction with the humanoid robot 10 is present in a range where the humanoid robot 10 is close and is facing the humanoid robot head 11 with the face facing the humanoid robot head, and at the same time, the target person makes a human voice against the humanoid robot 10, so the microphone from which the human voice was first collected is set as the sound source microphone in step S4.

And A1, judging whether the number of the voice microphones is 1.

Step A2 is entered when the number of human voice microphones is 1, and step S1 is entered when the number of human voice microphones is 0.

And A2, setting the positions corresponding to the microphones for collecting the voice as candidate target positions when the number of the voice microphones is 1.

When the number of the voice microphones is 1, the microphone which collects the voice is directly set as a sound source microphone, and the direction corresponding to the sound source microphone is set as a candidate target direction.

In step S5, when the information processing means 114 outputs the sound source microphone and the candidate target azimuth, an instruction to rotate the camera 113 is issued to the control unit 14, and the control unit 14 controls the steering means 13 to rotate the camera 113 after receiving the instruction, and shifts to the candidate target azimuth to ensure that the area immediately in front of the camera 113 coincides with the area immediately in front of the sound source microphone. And video of the candidate target bearing is acquired using camera 113.

In the present embodiment, the camera 113 uses a wide-angle camera 113 with a viewing angle greater than 120 degrees. The camera 113 starts capturing video after rotating to the target orientation.

Step S6, sequentially acquiring face image frames from the video to obtain a continuous face image frame sequence, sequentially identifying the face image frames by using a preset face identification method to obtain a face identification result containing the face position, and counting the number of faces of the face image frames.

The face recognition method uses a conventional face recognition method such as Cascade CNN (Hahiang Li, zhe Lin, xaohui Shen, jonathan Brandt, gang Huan. A convolutional neural network Cascade for face detection.2015, computer vision and pattern recognition), denseBox (Lichao Huang, yi Yang, yang Deng, yonan Yu. DenseBox: unifying Landmark Localization with End to End Object detection.2015, arXiv: computer Vision and Pattern Recognition), and facess-Net (Shuo Yang, ping Luo, chen Change Loy, xaoou Tang. Facess-Net: face Detection through Deep Facial Part Responses).

In this embodiment, the face recognition result highlights the recognized face in the face image frame in the form of boxes, and each box corresponds to relevant coordinate information.

And S7, judging whether the number of faces is larger than 1.

And when the number of faces in one face image frame is greater than 1, entering a step S8, and when the number of faces is not greater than 1, entering a step B1.

And S8, when the number of the faces is greater than 1, sequentially calculating the areas of the faces in the face image frames, and sequencing the areas of all the faces in one frame of face image according to the sequence from large to small to obtain a face area sequence.

And B1, judging whether the number of faces is 1.

Step B2 is entered when the number of faces is 1, and step S1 is entered again when the number of faces is 0.

And B2, setting the face position corresponding to the face as a target azimuth when the number of the faces is 1.

In step S9, the information processing mechanism 114 sets faces corresponding to a predetermined number of areas in the face area sequence, which are top-ranked, as candidate target faces.

In this embodiment, the face corresponding to the top-ranked 3 area in the face area sequence is set as the candidate target face.

Step S10, lip movement recognition is sequentially carried out on a predetermined number of continuous face image frames corresponding to the candidate target faces by using a predetermined lip movement detection algorithm, and the candidate target faces with lip movement are set as lip movement faces.

In this embodiment, several continuous frames before and after the face image frames in which 3 candidate target faces are detected are sequentially detected by using a conventional lip movement detection algorithm, so as to obtain a result of whether each candidate target face has lip movement, and the candidate target face with the lip movement is set as a lip movement face.

Among them, conventional lip movement detection algorithms such as those mentioned in Li M.Audio-visual talking face detection [ C ] International Conference on Multimedia & Expo.IEEE, 2003.

Step S11, judging whether the number of lip moving faces is larger than 1.

Step S12 is entered when the number of lip faces is greater than 1, and step C1 is entered when the number of lip faces is not greater than 1.

And step S12, setting the face position corresponding to the lip-moving face with the largest area as the target azimuth when the number of the lip-moving faces is larger than 1.

In this embodiment, when more than 2 candidate target faces in one face image frame have lip movements, the candidate target person closest to the near (i.e. the face area is the largest) is considered as the target person, and the face position of the target person in the face image frame is the target azimuth.

And C1, judging whether the number of the lip moving faces is 1.

Step C2 is entered when the number of lip faces is 1, and step S1 is entered when the number of lip faces is 0.

And C2, setting the face positions corresponding to the lip moving faces as target orientations when the number of the lip moving faces is 1.

In step S13, the humanoid robot 10 controls the humanoid robot head 11 to turn to the target azimuth by the control unit 14 to achieve orientation.

In this embodiment, the camera 113 and the humanoid robot head 11 are precisely oriented according to the coordinate information of the face position corresponding to the target azimuth, so that the camera 113 and the humanoid robot head 11 accurately face the target person, perform man-machine interaction further, and then enter an end state.

Example operation and Effect

According to the head orientation method of the humanoid robot with cooperative hearing and vision provided by the embodiment, as the voice signals around the humanoid robot are collected through the microphone array, the humanoid robot can collect the voice signals without omission in a 360-degree range, and a foundation is provided for the final accurate orientation of the head of the humanoid robot. And because the voice signal is subjected to voice judgment, the voice is distinguished from other sounds, so that the humanoid robot can perform a series of interactions according to the voice instead of nonsensical interactions with other noises.

The above examples are only for illustrating the specific embodiments of the present invention, and the present invention is not limited to the description scope of the above examples.

In the above embodiment, the microphone array includes 3 microphones, and in other solutions, a plurality of microphones such as 4, 5, 6, etc. may be used to form the microphone array for collecting the voice signal.

In the above embodiment, the top 3 faces are selected from the face area sequence as candidate target faces, and in other schemes, 2 or 4 faces may be used to perform subsequent lip movement detection.

In the above embodiment, the humanoid robot is a humanoid robot in which the head and the body can rotate relatively, and the invention can also orient the head of the humanoid robot in which the body and the head cannot rotate relatively, and the head of the humanoid robot is oriented by rotating the whole humanoid robot through the control part.

In the above embodiment, the humanoid robot controls the camera to turn to the candidate target azimuth through the control part, and the present invention may also rotate the head of the humanoid robot while rotating the camera.

Claims

1. A head orientation method of a humanoid robot with cooperative hearing and vision, which is used for determining a target azimuth for a humanoid robot at least comprising a microphone array, a camera and a control part and realizing the orientation of the humanoid robot head, and is characterized by comprising the following steps:

step S1, collecting voice signals around the humanoid robot through a plurality of microphones in the microphone array and recording the time when all the microphones collect the voice signals;

step S2, utilizing a preset voice recognition method to sequentially recognize the voice signals acquired by the microphones to obtain a voice recognition result, and simultaneously counting the number of microphones with voice recognition results as voice and setting the number of voice microphones;

step S3, judging whether the number of the voice microphones is larger than 1;

step S4, setting the position corresponding to the microphone which collects the voice first as a candidate target position according to the time when the number of the voice microphones is larger than 1;

s5, the humanoid robot controls the camera to turn to the candidate target azimuth through the control part, and the camera is used for collecting video of the candidate target azimuth;

step S6, sequentially acquiring face image frames from the video, sequentially identifying the face image frames by using a preset face identification method to obtain a face identification result containing the face position, and counting the number of faces of the face image frames;

step S7, judging whether the number of the faces is larger than 1;

step S8, when the number of the faces is greater than 1, sequentially calculating the areas of the faces in the face image frames and sequencing the faces in a sequence from large to small to obtain a face area sequence;

step S9, setting the faces corresponding to the areas of the preset number of the front-ranking faces in the face area sequence as candidate target faces;

step S10, sequentially carrying out lip movement identification on a predetermined number of continuous face image frames corresponding to the candidate target faces by using a predetermined lip movement detection algorithm, and setting the candidate target faces with lip movement as lip movement faces;

step S11, judging whether the number of the lip-moving faces is larger than 1;

step S12, setting the face position corresponding to the lip-moving face with the largest area as a target azimuth when the number of the lip-moving faces is larger than 1;

step S13, the humanoid robot controls the humanoid robot head to turn to the target azimuth through the control part so as to achieve the orientation.

2. The head orientation method of a humanoid robot with cooperative hearing and vision according to claim 1, wherein:

wherein, when the step S3 determines that the number of the voice microphones is not greater than 1, the method includes the following steps:

a1, judging whether the number of the voice microphones is 1;

and A2, setting the azimuth corresponding to the microphone for collecting the voice as the candidate target azimuth when the number of the voice microphones is 1.

3. The head orientation method of a humanoid robot with cooperative hearing and vision according to claim 1, wherein:

wherein, when the step S7 judges that the number of faces is not greater than 1, the method includes the following steps:

step B1, judging whether the number of the faces is 1;

and B2, setting the face position corresponding to the face as the target azimuth when the number of the faces is 1.

4. The head orientation method of a humanoid robot with cooperative hearing and vision according to claim 1, wherein:

wherein when the number of the lip moving faces is not greater than 1 in the step S11, the method comprises the following steps:

step C1, judging whether the number of the lip movement faces is 1;

and C2, setting the face position corresponding to the lip moving face as a target azimuth when the number of the lip moving faces is 1.

5. The head orientation method of a humanoid robot with cooperative hearing and vision according to claim 1, wherein:

the voice recognition method comprises the following steps of:

step T1, carrying out band-pass filtering on the voice signal to obtain a specific frequency band signal;

6. The head orientation method of a humanoid robot with cooperative hearing and vision according to claim 1, wherein:

wherein the microphone array comprises at least 3 microphones.