JP2008126329A - Voice recognition robot and its control method - Google Patents

Voice recognition robot and its control method Download PDF

Info

Publication number
JP2008126329A
JP2008126329A JP2006311482A JP2006311482A JP2008126329A JP 2008126329 A JP2008126329 A JP 2008126329A JP 2006311482 A JP2006311482 A JP 2006311482A JP 2006311482 A JP2006311482 A JP 2006311482A JP 2008126329 A JP2008126329 A JP 2008126329A
Authority
JP
Japan
Prior art keywords
voice
unit
sound
voice recognition
acquisition unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
JP2006311482A
Other languages
Japanese (ja)
Inventor
Ryo Murakami
涼 村上
Original Assignee
Toyota Motor Corp
トヨタ自動車株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Toyota Motor Corp, トヨタ自動車株式会社 filed Critical Toyota Motor Corp
Priority to JP2006311482A priority Critical patent/JP2008126329A/en
Publication of JP2008126329A publication Critical patent/JP2008126329A/en
Application status is Pending legal-status Critical

Links

Abstract

<P>PROBLEM TO BE SOLVED: To provide a voice recognition robot having enhanced voice recognizing performance in recognizing a voice emitted from a speech person, and to provide a control method of the robot. <P>SOLUTION: The voice recognition robot is equipped with a sound source identifying part to identify a direction in which the voice is generated, a voice acquiring part, a voice recognizing part to recognize contents of the voice, a measuring part to measure the sound pressure of the acquired voice, an image pickup part to perform image pickup about the direction in which the acquired voice is generated, a face sensing part to sense the face of a person existing in the picked-up image, an extraction part to extract lips from the sensed face, a specifying part to specify the center point of the extracted lips, and a holding part to hold the voice acquiring part and arranged with its attitude changeable by means of the driving motion of at least either of the expanding/shrinking motions and changing a joint angle, whereby the attitude of the holding part is changed so that relative positions of the center point of the talker's lips and the voice acquiring part are corrected on the basis of the measured sound pressure. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

  The present invention relates to a voice recognition robot that recognizes input voice information and responds to the contents thereof, and a control method for such a voice recognition robot.

  A speech recognition robot that recognizes the content of speech uttered by a human and responds by outputting an appropriate answer to the content is generally known. Such a speech recognition robot acquires speech uttered by a person as speech data, and performs speech recognition processing by analyzing the speech data. Therefore, it is necessary to accurately acquire the generated speech. is there.

For this purpose, a filter for removing noise is provided in a sound receiving unit such as a microphone for acquiring sound, or sound is limited to sound from a sound source located in a specific direction with directivity in the sound receiving unit. It is performed that the technique of acquiring is used (for example, patent documents 1 and 2).
JP 2006-171719 A JP 2006-251266 A

  However, if the position of the voice recognition robot and the relative position of the person generating the voice are fixed, the voice can be acquired with a certain degree of accuracy by the noise removal processing using the filter as described above. However, if these relative positions are not determined, the volume (sound pressure) of the voice acquired by the voice recognition robot is not stable. If noise removal processing is performed on speech for which sufficient sound pressure cannot be obtained, speech recognition performance may be lowered.

  In addition, when all or part of the voice recognition robot is configured to be movable, it is necessary to keep the distance between the robot and the person who is the speaker more than a certain distance for safety. There is a problem that you can not get enough.

  The present invention has been made to solve such problems, and provides a speech recognition robot that improves speech recognition performance for recognizing speech generated from a speaker and a method for controlling the speech recognition robot. It is the purpose.

  A voice recognition robot according to the present invention includes a sound source specifying unit that specifies a direction in which a voice is generated, a voice acquisition unit that acquires the generated voice, a voice recognition unit that recognizes the content of the voice acquired by the voice acquisition unit, A measuring unit that measures the sound pressure of the acquired sound, an image capturing unit that captures an image of the direction in which the acquired sound is generated, and creates image data of the captured image, and a human face that exists in the created image data A face detection unit that detects a lip, an extraction unit that extracts a lip from the detected face, a specifying unit that specifies a center point of the extracted lip, and the voice acquisition unit, and holds an expansion operation or a joint angle A holding unit that can change its posture by at least one of the driving operations of the change, the direction in which the sound specified by the sound source specifying unit is generated is imaged by the imaging unit, and the image captured by the imaging unit is included in the image Exists The human face is detected by the face detection unit, the lip center point extracted from the detected human face is specified, and the distance between the specified lip center point and the position of the voice acquisition unit is determined. Based on the calculated distance and the sound pressure of the sound measured by the measurement unit, the relative position of the sound acquisition unit with respect to the center point of the lip is determined, and the sound acquisition unit is moved to a predetermined position. Thus, the posture of the holding part is changed.

  According to such a voice recognition robot, the sound pressure of the voice acquired by the voice acquisition unit becomes a predetermined appropriate value for the distance between the lip of the speaker who generated the voice and the voice acquisition unit for acquiring the voice. In addition, the position of the voice acquisition unit can be determined. Thereby, since the voice recognized by the voice recognition unit is set to a constant sound pressure, the voice recognition performance for recognizing the voice generated from the speaker can be improved in the voice recognition unit.

  Further, in such a speech recognition robot, the target sound pressure as a target is stored, and the center point of the lip and the sound acquisition unit are stored with the difference between the measured sound pressure and the target sound pressure as a parameter. The optimum distance may be obtained and the position of the voice acquisition unit may be determined. According to such a voice recognition robot, the position of the voice acquisition unit can be determined according to the strength of the voice generated from the speaker, and the voice can be acquired with the target sound pressure, so that the voice recognition performance is further improved. Is possible.

  In addition, the voice acquisition unit is composed of a microphone having directivity, and a holding unit so that the center point of the lip is located on an extension line of a position facing the same direction as the directivity from the tip of the voice acquisition unit The posture may be changed. In this way, since the voice acquisition unit can be positioned in consideration of the directivity of the voice acquisition unit, the sensitivity for acquiring the voice by the voice acquisition unit even when the sound emitted from the speaker has a somewhat low sound pressure. It becomes possible to make it easy to acquire sound with sufficient sound pressure.

  Moreover, it is preferable that the said sound source specific | specification part is comprised from one or several microphones which have directivity. Such a sound source identifying unit can more accurately identify the relative direction of the generated voice as viewed from the robot.

  Further, the extraction unit may obtain a contour of the lips and use a center of gravity position specified by the contour as a center point. That is, the outline of the lips is obtained in a plane, the position of the center of gravity of the area surrounded by the outline drawn on the plane is specified, and the position of the center of gravity is set as the center point of the lips. In this way, the center point of the lips can be easily obtained.

  Further, in such a voice recognition robot, it is preferable that the detected face is provided so as to change the imaging direction so that the detected face continues to be positioned substantially in the center of the captured image. In this way, since the detected speaker can be kept in the imaging area, the possibility of losing sight of the speaker is reduced, and the need to detect a speaker again is reduced.

  It is more preferable that such a voice recognition robot further includes moving means and is configured to be movable within a predetermined area. Such a speech recognition robot can move so as to be close to and away from the speaker, so that it is easy to acquire speech with a more appropriate sound pressure.

  Furthermore, it is preferable that the holding unit is an arm unit including a joint to be driven and controlled so that the sound acquisition unit is held at a tip of the arm unit. By doing in this way, the arm part as a holding | maintenance part can be driven and the distance of a speaker and an audio | voice acquisition part can be kept at an appropriate position.

  The present invention also provides a method for controlling a voice recognition robot. In such a method for controlling a voice recognition robot, the step of specifying the direction in which the voice is generated, and the voice acquisition unit The step of recognizing the acquired sound, the step of recognizing the content of the acquired sound, the step of measuring the sound pressure of the acquired sound, and imaging the direction in which the acquired sound is generated, and obtaining image data of the captured image A step of generating, a step of detecting a human face existing in the generated image data, a step of extracting a lip from the detected face, a step of specifying a center point of the extracted lip, Based on the step of calculating the distance between the center point of the lips and the sound acquisition unit, the calculated distance, and the sound pressure of the sound measured by the measurement unit, the lips It is characterized in that it comprises a step of determining the relative position of the sound acquisition unit, and moving the voice acquisition unit in a defined position, the relative center point.

  According to such a control method of the voice recognition robot, the distance between the lip of the speaker who has generated the voice and the voice acquisition unit that acquires the voice, and the sound pressure of the voice acquired by the voice acquisition unit is a predetermined appropriate value. Thus, the position of the voice acquisition unit can be determined. Thereby, since the voice recognized by the voice recognition unit is set to a constant sound pressure, the voice recognition performance for recognizing the voice generated from the speaker can be improved in the voice recognition unit.

  Further, in such a method for controlling a voice recognition robot, the voice acquisition unit has directivity, and the lip is on an extension line of a position facing the same direction as the directivity from the tip of the voice acquisition unit. You may make it further provide the step which determines the position of a sound source acquisition part so that the center point of may be located. Since the voice acquisition unit can be positioned in consideration of the directivity of the voice acquisition unit, it is possible to acquire voice with high sensitivity of the voice acquisition unit even if the sound emitted from the speaker has a somewhat low sound pressure. .

  As described above, according to the present invention, it is possible to provide a speech recognition robot that improves speech recognition performance for recognizing speech generated from a speaker and a control method for controlling the speech recognition robot.

Embodiment 1 of the Invention
The speech recognition robot according to the first embodiment of the present invention will be described below with reference to FIGS.

  FIG. 1 shows that a person who is a speaker exists in the room R, and the voice recognition robot 1 is placed in the room R. As shown in FIG. 2, the speech recognition robot 1 shown in FIG. 1 includes a torso 10 fixed to the ground, a head 11, a right arm 12, and a left arm 13 connected to the torso 10. It is a humanoid robot constructed in the same way. Hereinafter, each component will be described in detail.

  The body 10 is installed on the floor surface of the room R, and includes a control unit 100 that controls the operation of the voice recognition robot 1 and other functions. The control unit 100 includes an arithmetic processing unit, a memory, and the like that recognizes the content of an input signal input from a microphone serving as an audio acquisition unit, which will be described later, and selects the appropriate response data and outputs the response data in audio. Control computer. The detailed configuration of the control unit 100 will be described later.

  The head 11 utters words to the outside, imaging units 111 and 112 for imaging a predetermined range in front of the voice recognition robot 1, sound source identification units 113 and 114 for listening to surrounding sounds, and And a speaker 115 for The imaging units 111 and 112 are optical cameras that acquire optical information within a predetermined range, create imaging data, capture the imaging data, and output the imaging data to the control unit 100. It has been.

  The sound source specifying units 113 and 114 each have a plurality of so-called directional microphones that can acquire sound from a certain direction in the horizontal direction, and the sound uttered in the surroundings is viewed from the voice recognition robot 1. It is possible to roughly specify from which direction the signal is transmitted. Note that the sound source identification units 113 and 114 are provided on the left and right side surfaces of the head 11, and from which direction the sound emitted around the voice recognition robot 1 is generated relative to the voice recognition robot 1. It is specified whether it has been done, and is output to the control unit 100 described later.

  The speaker 115 outputs response data created by the control unit 100 to the outside in a predetermined direction and size, and is provided below the front surface of the head 11.

  The head 11 is connected to the body 10 so as to be rotatable in the left-right direction in a plane horizontal to the floor surface. To change the image of the surrounding environment.

  The right arm 12 and the left arm 13 are controlled by an arithmetic processing unit (not shown) included in the control unit 100 to control the amount of driving of the joints included in each arm according to a predetermined drive control program. By determining the angle, a desired position and posture are taken. Hands 12a and 13a capable of gripping an object are provided at the distal ends of the right arm 12 and the left arm 13, and a directional microphone 12b is disposed on the hand portion 12a of the right arm 12, with the distal ends thereof being connected to the body 10. It is gripped with a slight inclination in a direction away from it. The microphone 12b acts as a voice acquisition unit that acquires the voice uttered around the voice recognition robot 1, and in the present embodiment, the right arm 12 that holds the microphone 12b corresponds to the holding unit. The right arm 12 and the left arm 13 are connected by joint members capable of driving the upper arm member, the lower arm member, the lower arm member, and the hand unit according to commands from the control unit 100, respectively, as long as the joint drive region of the joint member allows. The posture can be taken freely.

  As shown in FIG. 3, the control unit 100 creates a voice data from the voice acquired by the microphone 12b and recognizes the content thereof. The voice recognition unit 101 recognizes the contents of the voice data. The measurement unit 102 that measures sound pressure (sound intensity), and the face detected by the face detection unit 103 that detects the face of a person present in the image captured by the imaging units 111 and 112 provided in the head 10. A discrimination unit 104 that discriminates the position and orientation, an extraction unit 105 that partially extracts the lips from the detected face, a specification unit 106 that specifies the center point of the extracted lips, and response data for outputting sound And a storage area 108 for storing predetermined programs and data.

  The voice recognition unit 101 converts voice input through the microphone 12b into voice data such as a WAVE file through an A / D conversion unit (not shown), and divides the voice data into syllables. Is replaced with a word using the word database stored in the storage area 108. Then, the words contained in the voice data and their word order are analyzed, and the sentence closest to the analyzed voice data is selected from the many sentences stored in the storage area 108. If the degree of approximation between the selected text and the voice data is greater than or equal to a predetermined value, the analyzed voice data is recognized as the same content as the selected text, and indicates that the acquired voice is equal to the selected text Output a signal. If the closest sentence is less than the predetermined degree of approximation, it is determined that the corresponding sentence is not stored in the storage area, and a signal indicating that the content of the acquired voice has not been recognized is output.

  The measurement unit 102 obtains the sound pressure (strength) of the sound from the sound data created by the sound recognition unit 101 based on the amplitude of the data, and obtains the sound pressure of the input sound in time series. It is. The obtained sound pressure data is stored in the storage area 108.

  The face detection unit 103 estimates an edge corresponding to the contour of the face from image data obtained by imaging with the imaging units 111 and 112, and detects a region surrounded by the estimated face contour as a human face. To do.

  It should be noted that, based on the position of the eyes included in each human face detected by the face detection unit 103, that is, the relative distance and relative direction from the robot, the detected face faces in the direction viewed from the robot. It can be estimated whether or not. Specifically, as shown in FIG. 4, the center positions E10 and E11 of the eyes (right eye E1 and left eye E2) included in the face of the person are specified, and the midpoint M on the line segment connecting the center positions is determined. Identify. Then, a direction D that includes the line segment connecting the center positions and is perpendicular to the line segment connecting the center positions from the midpoint M in a plane parallel to the floor surface is obtained. The line-of-sight direction of the left eye E2), that is, the direction in which the face including these eyes is facing. Then, the detected direction of each face is obtained, and a signal for each direction is output together with the relative position from the robot where each face exists.

  The discriminating unit 104 discriminates which face is directed to the robot itself from the relative position and the direction of each face included in the captured image detected by the face detecting unit 103. To do. Specifically, as shown in FIG. 5, the center position of each face determined based on the position of the eyes (right eye and left eye) with reference to its own position P (for example, the center point of the head 11), It is determined whether or not it includes its own position by combining with the direction the face is facing. Here, the direction in which each face is facing has a predetermined width, and in detail, the width is given by a minute angle (for example, 5 degrees) in the direction horizontal to the floor with each direction as the center. It is said. In this way, it is determined whether or not each face is facing the robot itself.

  The extraction unit 105 partially extracts the lips from the faces in the image data detected by the face detection unit 103, determined from the faces determined by the determination unit 104 to face the robot. If there are a plurality of faces determined to be facing the robot, the face closest to the robot is selected, and the lips are partially extracted from the selected face.

  As a specific method of extracting the lips by the extracting unit 105, a portion that substantially matches a plurality of lip data stored in advance is recognized as a lip in the region included in the face outline, and the recognized portion Are extracted as lips in the face.

  The identifying unit 106 identifies, from the lip contour extracted by the extracting unit 105, the position of the center of gravity identified by the contour as the center point of the lip. Specifically, the contour of the extracted lip is obtained in a plane, the position of the center of gravity of the region surrounded by the contour drawn on the plane is specified, and the position of the center of gravity is set as the center point of the lip. Thus, the lip center point specified by the specifying unit 106 is stored in the storage area 108 as coordinates representing a relative position with the position of the voice recognition robot 1 as a reference point.

  The speech synthesizer 107 reads out the most appropriate response sentence data from among a large number of response sentence data groups stored in advance in the storage area corresponding to the acquired speech content recognized by the speech recognition unit 101, It converts into an audio file and outputs it outside through the speaker 115. The speech recognition robot 1 configured as described above captures a person located near the front surface, identifies a person to whom he / she should respond to a plurality of persons included in the captured screen, and It recognizes the contents of the voice uttered by a person and outputs the voice according to the contents.

The storage area 108 is held by the right arm 12 in addition to a single or multiple types of programs for operating joint members including the arm and neck of the speech recognition robot 1 and multiple types of response sentences for outputting voices. The length and directivity of the microphone 12b are stored, and additional information such as the lip center point described above is newly stored. Further, the storage area 108 stores a target sound pressure as a target value of the sound intensity (sound pressure) acquired from the speaker by the microphone 12b, and a basic distance between the lip and the microphone. Based on the mathematical formula, an actual distance between the lips and the microphone for calculating the measured sound pressure as the target sound pressure is calculated. In detail, based on the relationship between the measured sound pressure and the distance between the lips and the microphone, which are related by the following Equation 1, the lip and the microphone are set to obtain the target sound pressure as the sound intensity to be acquired. Calculate the distance.

  The voice recognition robot 1 configured as described above uses a difference between the measured sound pressure and the target sound pressure as a parameter, and the distance between the center point 201a of the lip 201 of the speaker 200 that generates the sound and the microphone 12b is predetermined. The position of the microphone 12b to be an appropriate value of is specified. Further, in consideration of the condition for determining the distance between the center point 201a of the lip 201 and the microphone 12b and the direction of directivity of the microphone 12b with respect to the lip 201, the posture of the self including the right arm 12 holding the microphone 12b is changed. The posture is changed mainly by driving the joint of the right arm 12 holding the microphone 12b by an arithmetic processing unit (not shown) of the control unit 100. The procedure will be described with reference to the schematic diagram shown in FIG. 6 and the flowchart shown in FIG. FIG. 6 is a schematic view showing a state in which the speaker 200 positioned around the voice recognition robot 1 utters a voice. As shown in FIG. 6, the voice recognition robot 1 has the right arm 12 holding the microphone 12b. The position of the tip of the microphone 12b and the inclination of the microphone 12b are adjusted by changing the angle θ1 formed by the upper arm and the lower arm and the angle θ2 formed by the hand portion 12a that holds the lower arm and the microphone 12b. FIG. 7 is a flowchart showing a procedure until the voice recognition robot 1 detects the speaker's face and changes its posture. Details will be described below.

  First, when power is supplied to the voice recognition robot 1 and a speaker 200 existing around the voice recognition robot 1 speaks to the voice recognition robot 1 with preparations for acquiring voices from the surroundings, voice recognition is performed. The robot 1 acquires the voice uttered by the person with the microphone 12b as a voice acquisition unit, and specifies the direction (relative direction seen from the robot) where the voice is uttered by the sound source identification units 113 and 114 (step 101). ). The voice acquired by the microphone 12b is converted into voice data by the voice recognition unit 101, and the measurement unit 102 starts measuring the sound pressure of the acquired voice from the voice data created by the voice recognition unit 101. (Step 102). At this time, the head 11 is rotated so that the front surface of the head 11 is positioned in the direction specified by the sound source specifying units 113 and 114, and imaging is started.

  Image data obtained by imaging by the imaging units 111 and 112 is input to the control unit 100, and it is determined whether or not the face detection unit 103 can detect a human face in the image data (step 103). ). Here, if even one person's face can be detected in the image data, the orientation of the detected face is discriminated by the discriminating unit 104 to determine whether there is a face facing the voice recognition robot 1. (Step 104). If there are faces facing the voice recognition robot 1, the distances of those faces from the voice recognition robot 1 are obtained, and the face existing at the closest position is selected (step 105). On the other hand, if there is no face facing the voice recognition robot 1, it is determined that the voice recognition robot 1 is not in a state of talking to the voice recognition robot 1, and the process returns to the state of preparing to acquire the voice again.

Next, when the face that faces the voice recognition robot 1 present at the closest position is determined, a portion corresponding to the lips is extracted from the area occupied by the determined face in the image data (step 106). . Then, after specifying the center point 201a of the extracted lip and calculating the position coordinates (x t , y t , z t ) of the center point 201a, the lip corresponding to the position coordinates of the center point is obtained from the speech recognition robot 1. The distance to the inner point is obtained by the principle of triangulation that is imaged by the imaging units 111 and 112 (step 107). Thus, the relative position of the speaker's lip center point 201a with respect to the position of the speech recognition robot 1 is specified.

  Further, the direction specified by the sound source specifying units 113 and 114, that is, the position where the speaker is present, in parallel with the operation of continuously capturing the face of the speaker and specifying the position of the lip center point of the speaker. Is continuously acquired, and voice data is created from the voice acquired by the voice recognition unit 101 (step 201). Then, the sound pressure of the sound acquired from the created sound data is measured by the measuring unit 102, and the sound pressure measured in time series is stored (step 202).

In this way, based on the above Equation 1, the target sound pressure from the speaker obtained in time series and the relative distance from the speech recognition robot 1 to the lip center point of the speaker are obtained. A target distance between the microphone tip and the lip center point for obtaining the sound pressure is calculated (step 108). Then, the target position (X t , Y t , Z t ) of the tip of the microphone 12b where the sound pressure of the sound acquired by the microphone 12b becomes the target sound pressure with the distance between the tip of the microphone and the lip center point as the target distance. The relationship between the joint angles θ1 and θ2 of the right arm 12 for the tip of the microphone 12b to reach the determined tip position is calculated (step 109).

  Further, among θ1 and θ2 satisfying such a relationship, the values of θ1 and θ2 are specified so that the center point of the lip of the speaker is positioned in the direction indicating the directivity of the microphone 12b from the tip of the microphone 12b. (Step 110). Note that any method may be used as a method of directing the tip of the microphone 12b toward the center point of the lips. For example, the coordinates of the tip portion and the rear end portion of the microphone 12b are obtained, and these You may make it determine the position of the front-end | tip part of the microphone 12b so that the vector which connects a coordinate may go to the coordinate which shows the center point of a lip from the front-end | tip part of the microphone 12b.

In order to realize the joint angles θ1 and θ2 of the right arm 12 determined in this way, the joint member of the right arm is driven so that the tip portion of the microphone 12b is near the lip of the speaker and the tip portion of the microphone 12b is the target position. ( Xt , Yt , Zt ) is reached (step 111). Since the target position of the tip portion of the microphone 12b changes according to the change in the strength of the voice from the speaker or the position of the speaker, the voice acquired by the microphone 12b is interrupted for a certain time (step 112). The flow from step 101 to step 111 is repeated. That is, the head 11 of the voice recognition robot 1 is rotated in the direction of the face of the person who always speaks to the voice recognition robot 1 by rotating the head 11 so that the region extracted as the selected face is positioned at the approximate center of the image data. 11 (front face of the speech recognition robot 1) is facing. As a result, the relative positional relationship between the lip of the speaker and the tip of the microphone is kept constant corresponding to the change in the situation as described above, and the sound pressure of the sound emitted from the speaker is constant. The accuracy of speech recognition in the speech recognition unit 101 is improved.

  When the voice acquired by the voice recognition unit 101 is interrupted for a certain period of time, it is further determined whether or not voice recognition is to be continued (step 113). If so, the procedure from step 101 is repeated again to perform voice recognition. When the process is to be terminated, the termination process is performed according to a predetermined procedure.

  As described above, in the voice recognition robot and the method for controlling the voice recognition robot as described above, the distance between the lip of the speaker who generated the voice and the microphone from which the voice is acquired is determined by the sound pressure of the voice acquired by the microphone. The distance between the microphone and the lips is determined so as to be an appropriate value. In addition, the right arm joint as a holding part that holds the microphone is driven so that the center point of the lip is located on the extension line extending from the tip of the microphone in the same direction as the directivity of the microphone. Thus, the microphone can be positioned in consideration of the directivity of the microphone.

  In the present embodiment, the position of the microphone is determined in consideration of the directivity of the microphone held by the right arm, but the present invention is not limited to this. For example, the driving amount (joint angle) of the joint member of the right arm may be determined based only on the distance between the microphone and the lip without considering directivity. In that case, the position of the microphone may be specified according to the joint allowable range of the joint member of the right arm, the distance between the speaker and the voice recognition robot, and the like.

  Further, when the right arm drive amount (joint angle) is determined based on the directivity of the microphone, the angles θ1 and θ2 at which the center point of the lip is accurately positioned in the direction of the directivity of the microphone from the tip of the microphone 12b. May not be identified. In that case, the center point of the lip may be positioned within a predetermined angle range from the direction indicating the directivity of the microphone 12b (for example, within a predetermined angle range inclined about the axial direction indicating the directivity). Good. In this way, even when the angles θ1 and θ2 cannot be strictly specified, the microphone 12b can be moved to a position where the directivity of the microphone 12b is considered to some extent. Thus, the voice recognition robot 1 is operated so that the front face of the head 11 of the voice recognition robot 1 (the face front face of the voice recognition robot 1) faces the face of the person who always talks to the voice recognition robot 1.

  In addition, an operation in which the front face of the head 11 of the voice recognition robot 1 (the front face of the voice recognition robot 1) is directed toward the face of the person talking to the voice recognition robot 1 is performed by the speaker from the voice recognition robot 1 at a predetermined distance. It continues until it moves to the position far away. That is, it is determined whether or not the relative distance between the center position of the selected speaker's lips and the voice recognition robot 1 is equal to or greater than a predetermined distance. May be made to continue. In this way, when it is determined that the speaker is away from the voice recognition robot by a predetermined distance or more, the voice acquisition and voice recognition operations may be terminated and a preparation state for the next voice acquisition may be set.

Embodiment 2 of the Invention
Next, a voice recognition robot according to a second embodiment of the present invention will be described with reference to FIGS. 8 and 9. In the present embodiment, the same components as those described in the first embodiment are denoted by the same reference numerals, and the description thereof is omitted.

  Similar to the speech recognition robot described in the above embodiment, the voice recognition robot 1 ′ shown in FIG. 8 includes the body 10, the head 11, the right arm 12, and the left arm 13 connected to the body 10. 10 is a humanoid robot including a lumbar part 20 connected to 10 and a right leg 21 and a left leg 22 connected to the lumbar part 20. The voice recognition robot 1 ′ configured as described above performs a bipedal walking action by alternately moving the right leg 21 and the left leg 22, and the right leg 21 and the left leg 22 are defined in the present invention. It corresponds to the moving means.

  The right leg 21 and the left leg 22 for performing bipedal walking are provided with respective members such as a hip joint, an upper thigh, a knee joint, a lower leg, an ankle joint, and a toe. Each of these members is connected via a joint portion (not shown), and is configured to be freely driven by a plurality of motors (not shown). A motor for driving the joints is driven according to a predetermined control program by an arithmetic processing unit (not shown) included in the control unit 100, and the joint drive angles of the respective joints are determined. The posture can be taken, and it is possible to move to a predetermined position by walking with two legs.

  The voice recognition robot 1 ′ configured in this manner directs the directivity of the microphone 12 b toward the lip 201 of the speaker 200 that generates voice, and the distance between the lip 201 and the microphone 12 b is a predetermined appropriate value. If the distance between the lip 201 and the microphone 12b does not become an appropriate value just by changing the posture, the lip 201 and the microphone 12b are moved by moving their positions. Satisfy the relative positional relationship. The posture is changed mainly by driving the joint of the right arm 12 holding the microphone 12b by an arithmetic processing unit (not shown) of the control unit 100, and moving itself drives the right leg 21 and the left leg 22. Done in The procedure will be described with reference to the flowchart shown in FIG. Note that the speech recognition robot 1 ′ in this embodiment is similar to the above-described embodiment in that the angle θ1 formed by the upper arm and the lower arm of the right arm 12 that holds the microphone 12b, and the hand portion 12a that holds the lower arm and the microphone 12b. The position of the tip of the microphone 12b and the inclination degree of the microphone 12b are adjusted by changing the angle θ2 formed by. This will be described in detail below with reference to the flowchart shown in FIG.

  First, when power is supplied to the voice recognition robot 1 ′, the voice recognition robot 1 ′ stops (inverts) in a state where it is prepared to acquire voice from the surroundings. In this state, the speaker 200 existing around the voice recognition robot 1 ′ When talking to the voice recognition robot 1 ′, the voice recognition robot 1 ′ acquires the voice uttered by the person with the microphone 12b as a voice acquisition unit and the direction in which the voice is uttered by the sound source identification units 113 and 114 (from the robot). Relative direction) is identified (step 301). The voice acquired by the microphone 12b is converted into voice data by the voice recognition unit 101, and the measurement unit 102 starts measuring the sound pressure of the acquired voice from the voice data created by the voice recognition unit 101. (Step 302). At this time, the head 11 is rotated so that the front surface of the head 11 is positioned in the direction specified by the sound source specifying units 113 and 114, and imaging is started, and the front of the head 11 is the front of the voice recognition robot 1 ′. The leg is driven and moved so as to be in front of the body, and the position of itself is corrected (step 303).

  Image data obtained by imaging by the imaging units 111 and 112 is input to the control unit 100, and it is determined whether or not the face detection unit 103 can detect a human face in the image data (step 304). ). Here, if even one person's face can be detected in the image data, the direction of the detected face is discriminated by the discriminating unit 104, and it is judged whether or not there is a face facing the voice recognition robot 1 ′. (Step 305). If there are faces facing the voice recognition robot 1 ′, the distances of the faces from the voice recognition robot 1 ′ are obtained, and the face existing at the closest position is selected (step 306). Conversely, if there is no face facing the voice recognition robot 1 ′, it is determined that the voice recognition robot 1 ′ is not talking to the voice recognition robot 1 ′, the movement by the legs is stopped, and the voice is again heard. Return to the obtainable stop state.

Next, when the face that faces the voice recognition robot 1 ′ that is present at the closest position is determined, a portion corresponding to the lips is extracted from the area occupied by the determined face in the image data (step 307). ). Then, after specifying the extracted lip center point 201a and calculating the position coordinates (x t , y t , z t ) of the center point 201a, it corresponds to the position coordinates of the center point from the speech recognition robot 1 ′. The distance to the point in the lip is obtained by the principle of triangulation that is imaged by the imaging units 111 and 112 (step 308). As a result, the relative position of the speaker's lip center point with respect to the position of the speech recognition robot 1 ′ is specified.

  Then, from the relative position between the position of the speech recognition robot 1 ′ and the lip center point of the speaker, the tip of the microphone 12 held on the right arm of the speech recognition robot 1 ′ is relative to the lip center point of the speaker. It is then determined whether or not it is possible to reach a position that is sufficiently close (step 309). As an example of a procedure for making such a determination, an area reached by the microphone 12b in consideration of the movement due to joint driving of the right arm 12 from a specific position (for example, a standing position) of the voice recognition robot is calculated, If the point that is as close as possible to the lip center point of the speaker is found and the distance between the point and the lip center point does not fall within a predetermined distance, the speech recognition robot 1 'is not sufficiently close to the speaker. Judge that.

  If it is determined that the voice recognition robot 1 ′ is not sufficiently close to the speaker by such a procedure, the voice recognition robot 1 ′ changes the stance position by driving the leg portion and moving (step 401). . At this time, a position where the voice recognition robot 1 ′ moves is selected such that a specific part (for example, a standing position) of the voice recognition robot 1 ′ approaches a predetermined distance in the direction toward the lip center point of the speaker. Then, returning to step 308 again, the relative position of the lip center point of the speaker with respect to the position of the speech recognition robot 1 ′ is specified. In this way, these steps are repeated until it is determined that the position of the speech recognition robot 1 ′ can reach a position sufficiently close to the lip center point of the speaker.

  On the other hand, the voice from the direction specified by the sound source specifying units 113 and 114, that is, the position where the speaker is present, in parallel with the operation of continuously capturing the face of the speaker and specifying the position of the lip center point of the speaker. Is continuously acquired, and voice data is created from the voice acquired by the voice recognition unit 101 (step 402). Then, the sound pressure of the sound acquired from the created sound data is measured by the measuring unit 102, and the sound pressure measured in time series is stored (step 403).

As described above, when it is determined that the tip of the microphone 12 held by the right arm of the speech recognition robot 1 ′ can reach a position sufficiently close to the lip center point of the speaker, the speaker who is obtained in time series is determined. Between the microphone tip and the lip center point for obtaining the target sound pressure from the sound pressure of the voice and the relative distance from the speech recognition robot 1 to the lip center point of the speaker based on the above-described equation 1 The target distance is calculated (step 310). Then, the target position (X t , Y t , Z t ) of the tip of the microphone 12b where the sound pressure of the sound acquired by the microphone 12b becomes the target sound pressure with the distance between the tip of the microphone and the lip center point as the target distance. The relationship between the joint angles θ1, θ2 of the right arm 12 for the tip of the microphone 12b to reach the determined tip position is calculated (step 311).

  Further, among θ1 and θ2 satisfying such a relationship, the values of θ1 and θ2 are specified so that the center point of the lip of the speaker is positioned in the direction indicating the directivity of the microphone 12b from the tip of the microphone 12b. It is determined whether or not it is possible (step 312).

When the values of θ1 and θ2 can be specified, in order to realize the determined joint angles θ1 and θ2 of the right arm 12, the right arm joint member is driven, and the tip portion of the microphone 12b is placed near the lip of the speaker. target position the tip portion of the microphone 12b (X t, Y t, Z t) to reach the (step 313).

  On the other hand, if the values of θ1 and θ2 cannot be specified, it is determined that the position of the speech recognition robot 1 ′ is too far from the speaker, the stance position of the speech recognition robot 1 ′ is corrected (step 404), and θ1 is again detected. And the process returns to the step of determining whether or not the values of θ2 can be specified (step 312).

  Note that the target position of the tip portion of the microphone 12b changes according to the change in the strength of the voice from the speaker, the position of the speaker, and the like, so that the voice acquired by the microphone 12b is interrupted for a certain time (step 314). ), The flow from step 101 to step 111 is repeated. That is, after rotating the head 11 so that the region extracted as the selected face is positioned at the approximate center of the image data, the legs are driven so that the front of the head 11 is the front of the body. Correct self-position. As a result, the relative positional relationship between the lip of the speaker and the tip of the microphone is kept constant corresponding to the change in the situation as described above, and the sound pressure of the sound emitted from the speaker is constant. The accuracy of speech recognition in the speech recognition unit 101 is improved.

  When the voice acquired by the voice recognition unit 101 is interrupted for a certain period of time, it is further determined whether or not to continue voice recognition (step 315). If so, the procedure from step 101 is repeated again, and voice recognition is performed. When the process is to be terminated, the termination process is performed according to a predetermined procedure.

  As described above, in the voice recognition robot and the voice recognition robot control method according to the present embodiment, the distance between the lip of the speaker who has generated the voice and the microphone from which the voice is acquired is determined based on the sound pressure of the voice acquired by the microphone. The distance between the microphone and the lips is determined so as to be an appropriate value. In addition, the right arm joint as a holding part that holds the microphone is driven so that the center point of the lip is located on the extension line extending from the tip of the microphone in the same direction as the directivity of the microphone. In addition, the robot's stance position is corrected autonomously. This makes it possible to position the microphone in consideration of the directivity of the microphone so that the distance is appropriate for performing speech recognition on the speaker.

  In the above-described embodiment, an example in which legs (right leg and left leg) for performing a walking motion are provided as moving means provided in the speech recognition robot is described. However, the present invention is not limited to this. It is not something that can be done. That is, a general moving means such as a rotationally driven wheel may be incorporated in the voice recognition robot. In addition, when the position of the robot is changed by the moving means, the robot body is not limited to the one having a function of recognizing its own position by the robot body, and the position of the robot is recognized by a position recognition station provided outside the robot. Position control may be performed.

  Further, the voice recognition robot according to the present invention is not limited to the humanoid robot as described above, and is not limited to one that performs movement or posture change by joint driving. For example, the holding unit that holds the microphone as the sound acquisition unit is not limited to the arm unit that is joint-driven as described above, and can be replaced with a simple rotation member or a telescopic member.

  As described above, according to the voice recognition robot and the voice recognition robot control method according to the present invention, the microphone as the voice acquisition unit is set to a relative distance suitable for the target sound pressure with respect to the lip of the speaker. Since it can be positioned, more accurate speech recognition can be performed.

BRIEF DESCRIPTION OF THE DRAWINGS It is the whole outline figure which shows a mode that the speech recognition robot which is 1st Embodiment based on this invention is provided indoors. It is the schematic which shows roughly the speech recognition robot shown in FIG. FIG. 3 is a block diagram conceptually showing the inside of a control unit provided in the voice recognition robot shown in FIGS. It is a figure which shows a mode that the direction of each person's face is calculated | required by the face detection part with which the speech recognition robot shown to FIG. It is a figure which shows a mode that the discrimination | determination part with which the voice recognition robot shown to FIG. 1, 2 was equipped discriminate | determines the face which has faced the direction of the robot. It is the schematic which shows a mode that the voice recognition robot shown in FIG. 1, 2 pointed the microphone at the speaker. It is a flowchart which shows the procedure in which the speech recognition robot shown in FIG. 1 specifies the speaker to whom he / she should respond based on the acquired speech and brings the speech acquisition unit closer to the speaker. It is the schematic which shows roughly the speech recognition robot which is 2nd Embodiment which concerns on this invention. It is a flowchart which shows the procedure in which the speech recognition robot shown in FIG. 8 specifies the speaker who should respond based on the acquired speech and brings the speech acquisition unit closer to the speaker.

Explanation of symbols

1, 1 '... voice recognition robot 10 ... body 11 ... head 12 ... right arm (holding part)
12b ... Microphone (voice acquisition unit)
13 ... Left arm 21 ... Right leg (moving means)
22 ... Left leg (moving means)
DESCRIPTION OF SYMBOLS 100 ... Control part 101 ... Speech recognition part 102 ... Measurement part 103 ... Face detection part 104 ... Discrimination part 105 ... Extraction part 106 ... Identification part 107 ... Speech synthesis Unit 108 ... storage area 111, 112 ... imaging unit 113, 114 ... sound source identification unit 115 ... speaker 200 ... speaker 201 ... speaker's lips 201a ... center point of lips

Claims (10)

  1. A sound source identifying unit for identifying the direction in which the sound is generated;
    An audio acquisition unit for acquiring the generated audio;
    A voice recognition unit for recognizing the content of the voice acquired by the voice acquisition unit;
    A measurement unit for measuring the sound pressure of the acquired voice;
    An imaging unit that captures an image of the direction in which the acquired sound is generated and creates image data of the captured image;
    A face detection unit for detecting a human face existing in the created image data;
    An extraction unit for extracting lips from the detected face;
    A specific part for identifying the center point of the extracted lips;
    A voice recognition robot that holds the voice acquisition unit and includes a holding unit that can change posture by a driving operation of at least one of an expansion operation and a joint angle change,
    The direction in which the sound identified by the sound source identification unit is generated is imaged by the imaging unit,
    A face of a person present in the image captured by the imaging unit is detected by the face detection unit;
    Identify the center point of the lips extracted from the detected human face,
    Calculate the distance between the identified center point of the lips and the position of the voice acquisition unit,
    Based on the calculated distance and the sound pressure of the sound measured by the measurement unit, the relative position of the sound acquisition unit with respect to the center point of the lips is determined,
    A voice recognition robot, wherein the posture of the holding unit is changed so as to move the voice acquisition unit to a predetermined position.
  2.   A target target sound pressure is stored, and an optimum distance between the center point of the lips and the sound acquisition unit is obtained using a difference between the measured sound pressure of the sound and the target sound pressure as a parameter, and the sound acquisition unit The voice recognition robot according to claim 1, wherein the position is determined.
  3.   The voice acquisition unit is composed of a microphone having directivity, and the posture of the holding unit is such that the center point of the lip is positioned on an extension line of a position facing the same direction as the directivity from the tip of the voice acquisition unit The voice recognition robot according to claim 1, wherein the voice recognition robot is changed.
  4.   The voice recognition robot according to any one of claims 1 to 3, wherein the sound source specifying unit includes one or more microphones having directivity.
  5.   5. The voice recognition robot according to claim 1, wherein the extraction unit obtains a contour of a lip and uses a center of gravity specified by the contour as a center point. 6.
  6.   The voice recognition robot according to claim 1, wherein the direction of imaging is changed so that the detected face continues to be located at the center of the captured image.
  7.   The voice recognition robot according to any one of claims 1 to 6, wherein the voice recognition robot further includes a moving unit, and is configured to be movable within a predetermined area.
  8.   The voice recognition according to claim 1, wherein the holding unit is an arm unit including a joint to be driven and controlled, and the voice acquisition unit is held at a tip of the arm unit. robot.
  9. Identifying the direction in which the sound occurred;
    Acquiring the generated voice via the voice acquisition unit;
    Recognizing the content of the acquired audio;
    Measuring the sound pressure of the acquired voice;
    Capturing in the direction in which the acquired sound is generated and creating image data of the captured image;
    Detecting a human face present in the created image data;
    Extracting a lip from the detected face;
    Identifying the center point of the extracted lips;
    Calculating a distance between the identified center point of the lips and the voice acquisition unit;
    Based on the calculated distance and the sound pressure of the sound measured by the measurement unit, determining the relative position of the sound acquisition unit with respect to the center point of the lips;
    And a step of moving the voice acquisition unit to a predetermined position.
  10.   The position of the sound source acquisition unit so that the voice acquisition unit has directivity and the center point of the lip is positioned on an extension line of a position facing the same direction as the directivity from the tip of the voice acquisition unit The method for controlling a speech recognition robot according to claim 9, further comprising a step of determining
JP2006311482A 2006-11-17 2006-11-17 Voice recognition robot and its control method Pending JP2008126329A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2006311482A JP2008126329A (en) 2006-11-17 2006-11-17 Voice recognition robot and its control method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP2006311482A JP2008126329A (en) 2006-11-17 2006-11-17 Voice recognition robot and its control method

Publications (1)

Publication Number Publication Date
JP2008126329A true JP2008126329A (en) 2008-06-05

Family

ID=39552688

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2006311482A Pending JP2008126329A (en) 2006-11-17 2006-11-17 Voice recognition robot and its control method

Country Status (1)

Country Link
JP (1) JP2008126329A (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010035472A1 (en) * 2008-09-26 2010-04-01 パナソニック株式会社 Line-of-sight direction determination device and line-of-sight direction determination method
JP2010130144A (en) * 2008-11-26 2010-06-10 Toyota Motor Corp Robot, sound collecting apparatus, and sound processing method
JP2011091851A (en) * 2010-12-17 2011-05-06 Toyota Motor Corp Robot and sound collecting device
JP2011101407A (en) * 2010-12-28 2011-05-19 Toyota Motor Corp Robot, and sound collection apparatus
TWI398853B (en) * 2010-05-10 2013-06-11 Univ Nat Cheng Kung System and method for simulating human speaking
JP2015150620A (en) * 2014-02-10 2015-08-24 日本電信電話株式会社 robot control system and robot control program
WO2016132729A1 (en) * 2015-02-17 2016-08-25 日本電気株式会社 Robot control device, robot, robot control method and program recording medium
WO2017031860A1 (en) * 2015-08-24 2017-03-02 百度在线网络技术(北京)有限公司 Artificial intelligence-based control method and system for intelligent interaction device
JP2017054065A (en) * 2015-09-11 2017-03-16 株式会社Nttドコモ Interactive device and interactive program
JP2017126895A (en) * 2016-01-14 2017-07-20 トヨタ自動車株式会社 robot
WO2019200722A1 (en) * 2018-04-16 2019-10-24 深圳市沃特沃德股份有限公司 Sound source direction estimation method and apparatus

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010035472A1 (en) * 2008-09-26 2010-04-01 パナソニック株式会社 Line-of-sight direction determination device and line-of-sight direction determination method
JP5230748B2 (en) * 2008-09-26 2013-07-10 パナソニック株式会社 Gaze direction determination device and gaze direction determination method
US8538044B2 (en) 2008-09-26 2013-09-17 Panasonic Corporation Line-of-sight direction determination device and line-of-sight direction determination method
JP2010130144A (en) * 2008-11-26 2010-06-10 Toyota Motor Corp Robot, sound collecting apparatus, and sound processing method
TWI398853B (en) * 2010-05-10 2013-06-11 Univ Nat Cheng Kung System and method for simulating human speaking
JP2011091851A (en) * 2010-12-17 2011-05-06 Toyota Motor Corp Robot and sound collecting device
JP2011101407A (en) * 2010-12-28 2011-05-19 Toyota Motor Corp Robot, and sound collection apparatus
JP2015150620A (en) * 2014-02-10 2015-08-24 日本電信電話株式会社 robot control system and robot control program
WO2016132729A1 (en) * 2015-02-17 2016-08-25 日本電気株式会社 Robot control device, robot, robot control method and program recording medium
JPWO2016132729A1 (en) * 2015-02-17 2017-11-30 日本電気株式会社 Robot control apparatus, robot, robot control method and program
WO2017031860A1 (en) * 2015-08-24 2017-03-02 百度在线网络技术(北京)有限公司 Artificial intelligence-based control method and system for intelligent interaction device
JP2017054065A (en) * 2015-09-11 2017-03-16 株式会社Nttドコモ Interactive device and interactive program
JP2017126895A (en) * 2016-01-14 2017-07-20 トヨタ自動車株式会社 robot
WO2019200722A1 (en) * 2018-04-16 2019-10-24 深圳市沃特沃德股份有限公司 Sound source direction estimation method and apparatus

Similar Documents

Publication Publication Date Title
US8660847B2 (en) Integrated local and cloud based speech recognition
JP3660492B2 (en) Object detection device
JP3945279B2 (en) Obstacle recognition apparatus, obstacle recognition method, obstacle recognition program, and mobile robot apparatus
CA2786681C (en) Voice-body identity correlation
US6185529B1 (en) Speech recognition aided by lateral profile image
JP4661838B2 (en) Route planning apparatus and method, cost evaluation apparatus, and moving body
US20100222925A1 (en) Robot control apparatus
US6690815B2 (en) Image recognition apparatus and method
Van den Bergh et al. Real-time 3D hand gesture interaction with a robot for understanding directions from humans
CN1204372C (en) Position marking detection method of robot cleaner and robot cleaner using such method
US7551980B2 (en) Apparatus, process, and program for controlling movable robot control
US7203573B2 (en) Workpiece taking-out apparatus
US5684531A (en) Ranging apparatus and method implementing stereo vision system
JP5127371B2 (en) Ultrasound image diagnostic system and control method thereof
Georgiades et al. AQUA: an aquatic walking robot
JP4560078B2 (en) Communication Robot
CA2748037C (en) Method and system for gesture recognition
CN1204543C (en) Information processing equiopment, information processing method and storage medium
US7742840B2 (en) Autonomous mobile robot
EP2509070B1 (en) Apparatus and method for determining relevance of input speech
CN1132149C (en) Game appts. voice selection appts. voice recognition appts. adn voice response appts.
US6243683B1 (en) Video control of speech recognition
KR20150014237A (en) Auto-cleaning system, cleaning robot and controlling method thereof
JP4460528B2 (en) Identification object identification device and robot having the same
US6804396B2 (en) Gesture recognition system