JP5797009B2

JP5797009B2 - Voice recognition apparatus, robot, and voice recognition method

Info

Publication number: JP5797009B2
Application number: JP2011112595A
Authority: JP
Inventors: 日浦　亮太; 亮太日浦; 宮内　均; 均宮内; 京子大嶋
Original assignee: Mitsubishi Heavy Industries Ltd
Current assignee: Mitsubishi Heavy Industries Ltd
Priority date: 2011-05-19
Filing date: 2011-05-19
Publication date: 2015-10-21
Anticipated expiration: 2031-05-19
Also published as: JP2012242609A

Description

本発明は、音声認識装置、ロボット、及び音声認識方法に関するものである。 The present invention relates to a voice recognition device, a robot, and a voice recognition method.

ユーザーの発話に応じて特定の動作コマンドを実行する音声認識機能を備えるロボットにおいて、人の音声以外の雑音（ノイズ）に対する誤反応による音声の誤認識を低減しつつ、音声認識の成功率を高めることが求められている。 In a robot with a voice recognition function that executes a specific motion command according to the user's utterance, while reducing the false recognition of voice due to a false reaction to noise other than human voice, increase the success rate of voice recognition It is demanded.

ここで、特許文献１には、ユーザーが発した音声データをマイクで検出すると共に、ユーザーの顔を撮像した画像からユーザーの口が動いているかどうかを判定し、口が動いていると判定している間の音声データに含まれる音声コマンドのみを動作コマンドとして発行し、ロボット装置を制御する技術が開示されている。
また、特許文献２には、マイクロホンアレイから入力される音声に基づき、話者が発する会話の最初に利用する特定の単語もしくは文の音声とその方向とを認識し、検出された音声の方向にカメラを向け、該カメラから入力された画像から人物の顔を検出し、対話処理を行う技術が開示されている。さらに特許文献２には、及び検出した話者方向に指向性を限定して、話者の音声と方向を認識し、顔検出処理を行い、検出された顔方向に移動し、音声認識の精度をより向上させる技術が開示されている。 Here, in Patent Document 1, voice data emitted by the user is detected by a microphone, and whether or not the user's mouth is moving is determined from an image obtained by capturing the user's face, and it is determined that the mouth is moving. A technique for controlling a robot apparatus by issuing only voice commands included in voice data during operation as operation commands is disclosed.
Further, Patent Document 2 recognizes the voice of a specific word or sentence used at the beginning of a conversation made by a speaker and its direction based on the voice input from the microphone array, and detects the direction of the detected voice. A technology is disclosed in which a camera is pointed, a human face is detected from an image input from the camera, and interactive processing is performed. Further, Patent Document 2 restricts directivity to the detected speaker direction, recognizes the voice and direction of the speaker, performs face detection processing, moves to the detected face direction, and performs speech recognition accuracy. A technique for further improving the above is disclosed.

特開２００７−１９０６２０号公報JP 2007-190620 A 特開２００６−２５１２６６号公報JP 2006-251266 A

しかしながら、特許文献１に開示されている技術は、口の動作があるときの音声コマンドのみを認識することとしているため、画像認識の遅れにより、発話開始時の音声認識の成功率が大きく低下する可能性がある。また、特許文献１に開示されている技術は、音声コマンドの採否を画像により選択するのみであることから、音声認識の成功率向上にはなんら寄与しない。
また、特許文献２に開示されている技術のように、指向性を変えるのみでは、その指向性の方向の雑音による過応答を防ぐことができない。また、屋内においては、部屋の反響音の成分が非常に大きく、指向性を変えることでの目的音と雑音の音量比がほとんど改善せず、効果が現れない可能性が高い。 However, since the technique disclosed in Patent Document 1 recognizes only a voice command when there is a mouth movement, the success rate of voice recognition at the start of speech greatly decreases due to a delay in image recognition. there is a possibility. Further, the technique disclosed in Patent Document 1 only selects whether to accept a voice command from an image, and thus does not contribute to improving the success rate of voice recognition.
Further, as in the technique disclosed in Patent Document 2, merely changing the directivity cannot prevent overresponse due to noise in the direction of the directivity. In addition, indoors, the component of the reverberation sound in the room is very large, and the volume ratio of the target sound and noise by changing the directivity is hardly improved, and there is a high possibility that the effect will not appear.

なお、一般的に、画像による発話判定は、雑音の影響は少ないがタイミングに正確さを欠き（例えば、発話開始の口の動きが小さいことに由来する）、音声による発話判定は、タイミングは正確であるが雑音に弱いとされる。 Note that, generally speaking, utterance determination by image is less affected by noise but lacks accuracy in timing (for example, because the mouth movement at the start of utterance is small), and speech utterance determination is accurate by timing. However, it is said to be vulnerable to noise.

本発明は、このような事情に鑑みてなされたものであって、過応答を低減しつつ、音声認識の認識率を高めることができる音声認識装置、ロボット、及び音声認識方法を提供することを目的とする。 The present invention has been made in view of such circumstances, and provides a speech recognition device, a robot, and a speech recognition method capable of increasing the recognition rate of speech recognition while reducing overresponse. Objective.

上記課題を解決するために、本発明の音声認識装置、ロボット、及び音声認識方法は以下の手段を採用する。 In order to solve the above-described problems, the speech recognition apparatus, robot, and speech recognition method of the present invention employ the following means.

すなわち、本発明に係る音声認識装置は、被写体を撮像すると共に被写体を示す画像情報を取得する撮像手段と、前記撮像手段による撮像が行われているときに発生している音を示す音情報を取得する音取得手段と、前記音取得手段によって取得された前記音情報に基づいて、人の音声を認識する音声認識手段と、前記撮像手段によって取得された前記画像情報に基づいて、人が発話している期間を示す発話区間を検出する発話区間検出手段と、前記発話区間検出手段によって検出された前記発話区間において、前記発話区間が検出されない場合に比べて、前記音声認識手段による音声認識の感度を上げる感度変更手段と、を備え、前記音声認識手段は、予め定められた閾値以上の状態量を、人の音声であると認識し、前記感度変更手段は、前記発話区間検出手段によって検出された前記発話区間において、前記発話区間が検出されない場合に比べて、前記閾値を下げることによって、前記音声認識手段による音声認識の感度を上げる。 That is, the speech recognition apparatus according to the present invention captures a subject and obtains image information indicating the subject, and sound information indicating a sound generated when the imaging unit performs imaging. A person speaks based on sound acquisition means to be acquired, voice recognition means for recognizing a person's voice based on the sound information acquired by the sound acquisition means, and the image information acquired by the imaging means. The speech recognition means by the speech recognition means in comparison with the case in which the speech section detected by the speech section detection means for detecting the speech section indicating the period during which the speech is detected is not detected in the speech section detected by the speech section detection means. includes a sensitivity changing unit for raising the sensitivity, the said voice recognition means, the state quantity of more than a predetermined threshold value, recognizes that it is the voice of a human, said sensitivity changing means, In detected by serial voice activity detection means and the speech period, as compared with the case where the speech segment is not detected, by lowering the threshold value, increasing the sensitivity of the speech recognition by the speech recognition means.

本発明によれば、撮像手段によって、被写体が撮像されると共に被写体を示す画像情報が取得され、音取得手段によって、撮像手段による撮像が行われているときに発生している音を示す音情報が取得される。 According to the present invention, the image information indicating the subject is acquired by the imaging unit and the image information indicating the subject is acquired, and the sound information indicating the sound generated when the imaging unit is capturing the image by the sound acquisition unit. Is acquired.

また、音声認識手段によって、音情報に基づいて人の音声が認識される。しかし、音声認識手段による音声の認識において、音声以外の雑音を音声と誤認識する場合があった。誤認識は、過応答となり、音声認識率が低下することとなる。
このような誤認識は、音声認識の感度を下げることによって防ぐことが考えられるが、音声認識の感度が下げられると、本来、人の音声として認識すべき音が認識されない可能性が生じる。 Further, the voice recognition means recognizes the human voice based on the sound information. However, in speech recognition by the speech recognition means, noise other than speech may be misrecognized as speech. Misrecognition results in overresponse, and the speech recognition rate decreases.
Such misrecognition can be prevented by lowering the sensitivity of voice recognition. However, if the sensitivity of voice recognition is lowered, there is a possibility that a sound that should be recognized as a human voice is not recognized.

そこで、発話区間検出手段によって、撮像手段で取得された画像情報に基づいて、人が発話している期間を示す発話区間が検出される。すなわち、画像情報に基づいて、人の顔が認識され、該認識された人の顔の器官の動きから、雑音の影響を受けない発話区間が検出される。
そして、感度変更手段によって、発話区間検出手段で検出された発話区間において、発話区間が検出されない場合に比べて、音声認識手段による音声認識の感度が上げられる。 Therefore, the utterance section indicating the period during which the person is speaking is detected by the utterance section detection means based on the image information acquired by the imaging means. That is, based on image information, a human face is recognized, and an utterance section that is not affected by noise is detected from the movement of the recognized human face organ.
Then, the sensitivity of the voice recognition means by the voice recognition means is increased by the sensitivity change means, compared to the case where the speech section is not detected in the speech section detected by the speech section detection means.

従って、画像情報に基づいて検出された雑音の影響を受けない発話区間に対応して、音声認識の感度が上げられることとなるので、本発明は、過応答を低減しつつ、音声認識の認識率を高めることができる。
さらに、予め定められた閾値以上の状態量が、人の音声であると認識され、画像情報に基づいて検出した発話区間において、該閾値を下げることによって、音声認識の感度が上げられるので、簡易に音声認識の感度を変更することができる。
また、本発明の音声認識装置は、前記発話区間における前記閾値の大きさ、前記発話区間が検出されていない区間における前記閾値の大きさが周辺の環境に応じて異ならせてもよい。
また、本発明の音声認識装置は、前記発話区間検出手段によって検出された前記発話区間において前記閾値を下げ、前記音取得手段によって取得された前記音情報により示される音に基づいて、前記状態量を算出し、前記閾値以上の前記状態量が生じた区間を発話区間として検出してもよい。
また、本発明の音声認識装置は、前記音情報により示される音に基づいた発話区間の検出は、前記閾値の変更よりも後に行われ、前記音情報に基づいた音声認識は、前記音情報により示される音に基づいた発話区間の検出よりも後に行われてもよい。 Accordingly, since the sensitivity of speech recognition is increased corresponding to the utterance period that is not affected by noise detected based on the image information, the present invention reduces the over-response while recognizing speech recognition. The rate can be increased.
Furthermore, since the state quantity equal to or greater than a predetermined threshold is recognized as human speech and the speech recognition sensitivity is increased by lowering the threshold in the utterance section detected based on the image information, The sensitivity of voice recognition can be changed.
In the speech recognition apparatus of the present invention, the magnitude of the threshold value in the utterance section and the magnitude of the threshold value in the section where the utterance section is not detected may be made different according to the surrounding environment.
Further, the speech recognition device of the present invention lowers the threshold value in the utterance section detected by the utterance section detection unit, and based on the sound indicated by the sound information acquired by the sound acquisition unit, the state quantity And a section in which the state quantity equal to or greater than the threshold value is generated may be detected as a speech section.
In the speech recognition device of the present invention, the detection of the utterance section based on the sound indicated by the sound information is performed after the change of the threshold, and the speech recognition based on the sound information is performed by the sound information. It may be performed after the detection of the utterance section based on the sound shown.

また、本発明の音声認識装置は、前記感度変更手段が、前記発話区間検出手段によって検出された前記発話区間と共に、該発話区間に連続した前及び後の少なくとも一方の所定時間において、前記音声認識手段による音声認識の感度を上げてもよい。 Further, in the speech recognition apparatus according to the present invention, the sensitivity changing means may recognize the speech recognition at a predetermined time before and after the utterance interval together with the utterance interval detected by the utterance interval detection means. The sensitivity of voice recognition by means may be increased.

人による発話の開始時（語頭）や発話の終了時（語尾）には、例えば口が大きく開けられなかったりするため、発話の語頭や語尾が発話区間として検出されない可能性がある。 At the start of a human utterance (beginning of a word) or at the end of an utterance (end of a word), for example, since the mouth cannot be opened widely, the beginning or ending of the utterance may not be detected as the utterance section.

本発明によれば、発話区間と共に、該発話区間に連続した前及び後の少なくとも一方の所定時間において、すなわち、前及び後に太められた発話区間において、音声認識手段による音声認識の感度が上げられる。なお、所定時間は、発話区間検出手段で検出されない可能性のある語頭や語尾に対応する時間であり、実験等により求められる値であり、予め設定されている。 According to the present invention, the sensitivity of speech recognition by the speech recognition means is increased at the predetermined time before and after the utterance section together with the utterance section, that is, in the utterance section thickened before and after. . The predetermined time is a time corresponding to the beginning or ending of the utterance that may not be detected by the utterance section detecting means, and is a value obtained by experiments or the like, and is set in advance.

従って、発話の語頭や語尾においても、より確実に音声認識の感度が上げられることができる。 Therefore, the sensitivity of voice recognition can be increased more reliably even at the beginning or end of an utterance.

本発明によれば、予め定められた閾値以上の音量が、人の音声であると認識され、画像情報に基づいて検出した発話区間において、該閾値を下げることによって、音声認識の感度が上げられるので、簡易に音声認識の感度を変更することができる。 According to the present invention, a sound volume that is equal to or higher than a predetermined threshold is recognized as a human voice, and the sensitivity of voice recognition is increased by lowering the threshold in an utterance section detected based on image information. Therefore, the sensitivity of voice recognition can be easily changed.

また、本発明の音声認識装置は、前記発話区間検出手段が、人の顔に含まれる口の動きに基づいて、該人が発話している発話区間を検出してもよい。 In the speech recognition apparatus of the present invention, the utterance section detecting means may detect a utterance section in which the person is speaking based on the movement of the mouth included in the person's face.

本発明によれば、人の口の動きに基づいて発話区間を検出するので、画像情報に基づいて発話区間を簡易に検出できる。
また、本発明の音声認識装置は、前記発話区間検出手段が、歯を検出した場合に口が動いていると判断してもよい。 According to the present invention, since the utterance section is detected based on the movement of the person's mouth, the utterance section can be easily detected based on the image information.
The speech recognition apparatus of the present invention may determine that the mouth is moving when the utterance section detecting means detects a tooth.

また、本発明の音声認識装置は、前記発話区間検出手段が、人の顔に含まれる目の向きに基づいて、該人が発話している発話区間を検出してもよい。 In the speech recognition apparatus of the present invention, the utterance section detecting means may detect the utterance section that the person is speaking based on the direction of eyes included in the face of the person.

本発明によれば、人の目の向き、すなわち視線に基づいて発話区間を検出するので、音声認識装置を備えた機器に対して話しかけている人の発話区間を簡易に検出できる。
また、本発明の音声認識装置は、発話区間検出手段が、頭部の向きや位置の変化に基づいて、該人が発話している発話区間を検出してもよい。 According to the present invention, since an utterance section is detected based on the direction of a person's eyes, that is, a line of sight, the utterance section of a person who is speaking to a device equipped with a speech recognition device can be easily detected.
Further, in the speech recognition apparatus of the present invention, the utterance section detecting means may detect the utterance section in which the person is speaking based on the change in the head direction and position.

一方、本発明に係るロボットは、上記記載の音声認識装置を備える。 On the other hand, a robot according to the present invention includes the above-described voice recognition device.

さらに、本発明に係る音声認識方法は、被写体を撮像すると共に被写体を示す画像情報を撮像手段によって取得し、該撮像手段による撮像が行われているときに発生している音を示す音情報を音取得手段によって取得する第１工程と、前記撮像手段によって取得された前記画像情報に基づいて、人が発話している発話区間を検出する第２工程と、前記第２工程によって検出された前記発話区間において、前記発話区間が検出されない場合に比べて、前記音情報に基づいた人の音声認識の感度を上げる第３工程と、を含み、前記第１工程は、予め定められた閾値以上の状態量を、人の音声であると認識し、前記第３工程は、前記第２工程によって検出された前記発話区間において、前記発話区間が検出されない場合に比べて、前記閾値を下げることによって、音声認識の感度を上げる。 Furthermore, the speech recognition method according to the present invention captures a subject, acquires image information indicating the subject by the imaging unit, and obtains sound information indicating a sound generated when the imaging unit performs imaging. A first step that is acquired by a sound acquisition unit; a second step that detects a speech section in which a person is speaking based on the image information acquired by the imaging unit; and the second step that is detected by the second step. in speech periods, as compared with the case where the speech segment is not detected, seen including a third step, the increasing the sensitivity of the speech recognition of a person based on the sound information, the first step, the threshold than the predetermined Is recognized as a human voice, and the third step lowers the threshold in the utterance interval detected in the second step as compared to the case where the utterance interval is not detected. And by increasing the sensitivity of the speech recognition.

本発明によれば、過応答を低減しつつ、音声認識の認識率を高めることができる、という優れた効果を有する。 According to the present invention, there is an excellent effect that the recognition rate of voice recognition can be increased while reducing overresponse.

本発明の実施形態に係るロボットの正面図である。It is a front view of the robot which concerns on embodiment of this invention. 従来の音声認識の説明に要する図である。It is a figure required for description of the conventional voice recognition. 本発明の実施形態に係る音声認識装置の機能を示す機能ブロック図である。It is a functional block diagram which shows the function of the speech recognition apparatus which concerns on embodiment of this invention. 本発明の実施形態に係る発話区間検出部で行われる処理の内容を具体的に示した模式図である。It is the schematic diagram which showed concretely the content of the process performed in the speech area detection part which concerns on embodiment of this invention. 本発明の実施形態に係る音声認識装置による音声認識の結果を示した模式図である。It is the schematic diagram which showed the result of the speech recognition by the speech recognition apparatus which concerns on embodiment of this invention.

以下に、本発明に係る音声認識装置、ロボット、及び音声認識方法の一実施形態について、図面を参照して説明する。 Hereinafter, an embodiment of a voice recognition device, a robot, and a voice recognition method according to the present invention will be described with reference to the drawings.

図１は、本実施形態に係るロボット１０の正面図である。
図１に示すように、ロボット１０には、頭部１２と、頭部１２を下方から支持する胸部１４と、胸部１４の右側に設けられた右腕部１６ａ、胸部１４の左側に設けられた左腕部１６ｂと、胸部１４の下方に接続された腰部１８と、腰部１８の下方に接続されたスカート部２０と、スカート部２０の下方に接続された脚部２２とが設けられている。 FIG. 1 is a front view of the robot 10 according to the present embodiment.
As shown in FIG. 1, the robot 10 includes a head 12, a chest 14 that supports the head 12 from below, a right arm 16 a provided on the right side of the chest 14, and a left arm provided on the left side of the chest 14. A portion 16b, a waist portion 18 connected below the chest portion 14, a skirt portion 20 connected below the waist portion 18, and a leg portion 22 connected below the skirt portion 20 are provided.

そして、頭部１２の前面の中央近傍には、図１に示すように、前方を撮像するためのカメラ３０、及びマイクロフォン３２（以下、単に「マイク３２」という。）が設けられている。
カメラ３０は、被写体を撮像すると共に被写体を示す画像情報を取得し、マイク３２は、カメラ３０による撮像が行われているときに発生している音を示す音情報を取得する。 In the vicinity of the center of the front surface of the head 12, as shown in FIG. 1, a camera 30 and a microphone 32 (hereinafter simply referred to as “microphone 32”) for imaging the front are provided.
The camera 30 captures an image of a subject and acquires image information indicating the subject, and the microphone 32 acquires sound information indicating a sound generated when the camera 30 is capturing an image.

そして、本実施形態に係るロボット１０は、カメラ３０によって取得された画像情報に基づいて、人の顔を認識すると共に、マイク３２によって取得された音情報に基づいて、人の音声を認識する音声認識処理を行う。
すなわち、ロボット１０は、ロボット１０に対しコミュニケーションを取ろうとしている人の顔を認識すると共に、該人の音声を認識し、これらの認識結果に応じた動作を行う。 The robot 10 according to the present embodiment recognizes a person's face based on the image information acquired by the camera 30 and recognizes a person's voice based on the sound information acquired by the microphone 32. Perform recognition processing.
That is, the robot 10 recognizes the face of a person who is trying to communicate with the robot 10, recognizes the voice of the person, and performs an operation according to the recognition result.

ここで、従来の音声認識について図２を参照して説明する。
ロボット１０は、予め定められた閾値（以下、「音声閾値」という。）を超えた音量（パワー）の音情報を人が発話した音声として認識する。 Here, conventional speech recognition will be described with reference to FIG.
The robot 10 recognizes sound information having a volume (power) exceeding a predetermined threshold (hereinafter referred to as “voice threshold”) as a voice uttered by a person.

しかし、図２（Ａ）に示すように、音声閾値以上の雑音（人の音声とは異なる音）が生じた場合、ロボット１０は、該雑音も人の音声として誤認識することとなる。また、図２（Ｂ）に示すように、人の音声に音声閾値以上の雑音が重なり合っている場合も、ロボット１０は、該雑音も人の音声として誤認識するばかりか、人の音声を正しく認識できないこととなる。このような、誤認識は、過応答となり、音声認識の認識率の低下を招く。 However, as shown in FIG. 2A, when noise (sound different from human voice) exceeding the voice threshold occurs, the robot 10 erroneously recognizes the noise as human voice. In addition, as shown in FIG. 2B, when the noise exceeding the voice threshold is overlapped with the human voice, the robot 10 not only erroneously recognizes the noise as the human voice but also correctly recognizes the human voice. It will not be recognized. Such misrecognition results in over-response and causes a reduction in the recognition rate of voice recognition.

図２（Ａ），（Ｂ）に示すような雑音の誤認識は、音声認識の感度を下げること、すなわち音声閾値の値を大きくすることによって防ぐことが考えられる。しかし、音声認識の感度が下げられると、本来、人の音声として認識すべき音が認識されない可能性が生じる。 It is conceivable to prevent erroneous recognition of noise as shown in FIGS. 2A and 2B by lowering the sensitivity of speech recognition, that is, by increasing the value of the speech threshold. However, if the sensitivity of voice recognition is lowered, there is a possibility that a sound that should be recognized as a human voice is not recognized.

そこで、本実施形態に係るロボット１０は、カメラ３０によって取得した画像情報に基づいて認識した人の顔の器官の動きから、人が発話している期間を示す発話区間を検出し、検出した発話区間において、発話区間が検出されない場合に比べて、音声認識の感度を上げる（音声閾値を下げる）処理を行う。この発話区間は、画像情報から求められるため、雑音の影響を受けない。 Therefore, the robot 10 according to the present embodiment detects an utterance section indicating a period during which the person is speaking from the movement of the human face organ recognized based on the image information acquired by the camera 30, and detects the detected utterance. In the section, processing for increasing the sensitivity of speech recognition (decreasing the speech threshold) is performed as compared to the case where the speech section is not detected. Since this utterance section is obtained from the image information, it is not affected by noise.

図３は、音声認識処理を行う音声認識装置４０の機能を示す機能ブロック図である。
なお、本実施形態に係るロボット１０は、ＣＰＵ（Central Processing Unit）によってプログラムを実行することにより、音声認識装置４０が備える各構成要素による処理を実現する。この場合、該プログラムは、ＲＯＭ（Read Only Memory）やその他の記憶媒体に予めインストールされる形態や、コンピュータ読み取り可能なＣＤ−ＲＯＭ等の可搬型の記憶媒体に記憶された状態で提供される形態、有線又は無線による通信手段を介して配信される形態等を適用することができる。 FIG. 3 is a functional block diagram illustrating functions of the voice recognition device 40 that performs voice recognition processing.
Note that the robot 10 according to the present embodiment realizes processing by each component included in the speech recognition apparatus 40 by executing a program by a CPU (Central Processing Unit). In this case, the program is installed in a ROM (Read Only Memory) or other storage medium in advance, or is provided in a state stored in a portable storage medium such as a computer-readable CD-ROM. A form distributed via a wired or wireless communication means can be applied.

音声認識装置４０は、画像情報に基づいて人の発話区間の検出を行う発話区間検出部４２Ａ、音情報に基づいて人の発話区間の検出を行う発話区間検出部４２Ｂ、発話区間における人の音声を認識する音声認識部４４を備える。 The speech recognition device 40 includes an utterance section detection unit 42A that detects a human utterance section based on image information, an utterance section detection unit 42B that detects a human utterance section based on sound information, and human speech in the utterance section. Is provided.

発話区間検出部４２Ａは、顔器官検出部５０、動き量算出部５２、閾値処理部５４、及び音声閾値変更部５６を備える。 The utterance section detection unit 42 </ b> A includes a face organ detection unit 50, a motion amount calculation unit 52, a threshold processing unit 54, and a voice threshold change unit 56.

顔器官検出部５０は、カメラ３０で取得された画像情報に基づいて、人の顔を認識し、所定の顔器官を検出する。顔器官の検出方法は、従来既知のものを用いればよい。
なお、本実施形態に係る顔器官検出部５０は、顔器官として人の口を検出する。 The facial organ detection unit 50 recognizes a human face based on the image information acquired by the camera 30 and detects a predetermined facial organ. A conventionally known method for detecting a facial organ may be used.
Note that the facial organ detection unit 50 according to the present embodiment detects a human mouth as a facial organ.

動き量算出部５２は、顔器官検出部５０で検出された顔器官の動きを算出する。
本実施形態では、顔器官として人の口を検出するため、口の動きとして口の開き加減、より具体的には上唇と下唇との開き量を算出する。 The movement amount calculation unit 52 calculates the movement of the facial organ detected by the facial organ detection unit 50.
In this embodiment, in order to detect a human mouth as a facial organ, the opening degree of the mouth is calculated as the movement of the mouth, more specifically, the opening amount of the upper lip and the lower lip is calculated.

閾値処理部５４は、動き量算出部５２で算出された値が予め定められた閾値（以下、「画像閾値」という。）以上か否かを判定し、画像閾値以上となった期間（時間）を、人が発話している期間を示す発話区間として検出する。 The threshold processing unit 54 determines whether or not the value calculated by the motion amount calculation unit 52 is equal to or greater than a predetermined threshold (hereinafter referred to as “image threshold”), and a period (time) when the value is equal to or greater than the image threshold. Is detected as an utterance section indicating a period during which a person is speaking.

音声閾値変更部５６は、閾値処理部５４で検出された発話区間において、該発話区間が検出されない場合に比べて、音声閾値を下げることによって、音声認識の感度を上げる。下げられた音声閾値の大きさ及び区間を示す音声閾値変更情報は、音声閾値変更部５６から発話区間検出部４２Ｂへ出力される。 The voice threshold change unit 56 raises the sensitivity of voice recognition by lowering the voice threshold in the utterance section detected by the threshold processing unit 54 as compared to the case where the utterance section is not detected. The voice threshold change information indicating the magnitude and interval of the lowered voice threshold is output from the voice threshold change unit 56 to the utterance section detection unit 42B.

発話区間検出部４２Ｂは、音量算出部６０及び閾値処理部６２を備える。 The utterance section detection unit 42B includes a volume calculation unit 60 and a threshold processing unit 62.

音量算出部６０は、マイク３２で取得された音情報により示される波形の振幅から音量を算出する。 The volume calculation unit 60 calculates the volume from the amplitude of the waveform indicated by the sound information acquired by the microphone 32.

閾値処理部６２は、音量算出部６０で算出された音量が音声閾値以上か否かを判定し、該音声閾値以上の音量を発話区間として検出する。なお、本実施形態に係る閾値処理部６２は、音声閾値変更部５６から入力された音声閾値変更情報により示される区間及び下げられた音声閾値の値を用いて、発話区間を検出し、該発話区間を発話区間情報として音声認識部４４へ出力する。 The threshold processing unit 62 determines whether or not the volume calculated by the volume calculation unit 60 is equal to or higher than the voice threshold, and detects a volume higher than the voice threshold as an utterance section. Note that the threshold processing unit 62 according to the present embodiment detects an utterance section using the section indicated by the voice threshold change information input from the voice threshold change section 56 and the lowered voice threshold value, and the utterance The section is output to the speech recognition unit 44 as utterance section information.

発話区間検出部４２Ｂへ入力される音情報は、ディレイ処理部７０Ａを介して所定の時間遅れを持って入力される。
上述のように、閾値処理部６２は、音声閾値変更部５６から出力された閾値変更情報を用いて発話区間を検出するため、発話区間検出部４２Ａが備える音声閾値変更部５６による閾値変更情報の出力が終了した後に、閾値処理部６２による処理を開始させるためである。 The sound information input to the utterance section detection unit 42B is input with a predetermined time delay via the delay processing unit 70A.
As described above, since the threshold value processing unit 62 detects the utterance period using the threshold value change information output from the voice threshold value change unit 56, the threshold value change information of the voice threshold value change unit 56 included in the utterance period detection unit 42A is used. This is because the processing by the threshold processing unit 62 is started after the output is completed.

音声認識部４４は、特徴量抽出部８０及びマッチング処理部８２を備えている。 The voice recognition unit 44 includes a feature amount extraction unit 80 and a matching processing unit 82.

特徴量抽出部８０は、マイク３２で取得された音情報を、例えばフーリエ変換等することによって、音の特徴（特徴量）を抽出する。 The feature quantity extraction unit 80 extracts sound features (feature quantities) by, for example, Fourier transforming the sound information acquired by the microphone 32.

マッチング処理部８２は、音の特徴量に応じた発話内容を示した認識辞書情報に基づいて、閾値処理部６２から出力された発話区間情報により示される発話区間における音の特徴量から、人の発話内容を特定し（マッチング処理）、音声認識結果として出力する。なお、認識辞書情報は、不図示の記憶手段に予め記憶されている。
そして、ロボット１０は、例えば音声認識結果が「おはよう」との発話を示している場合は、「おはよう」と音声を出力する等の、音声認識結果に基づいた動作を行う。 Based on the recognition dictionary information indicating the utterance content corresponding to the sound feature amount, the matching processing unit 82 calculates the human feature from the sound feature amount in the utterance section indicated by the utterance section information output from the threshold processing unit 62. The content of the utterance is specified (matching process) and output as a speech recognition result. Note that the recognition dictionary information is stored in advance in storage means (not shown).
Then, for example, when the voice recognition result indicates an utterance “good morning”, the robot 10 performs an operation based on the voice recognition result, such as outputting a voice “good morning”.

また、音声認識部４４へ入力される音情報は、ディレイ処理部７０Ｂを介して所定の時間遅れを持って入力される。
上述のように、マッチング処理部８２は、閾値処理部６２から出力された発話区間情報を用いて音声認識を行うため、閾値処理部６２による発話区間情報の出力が終了した後に、マッチング処理部８２による処理を開始させるためである。このため、ディレイ処理部７０Ｂによる時間遅れは、ディレイ処理部７０Ａによる時間遅れよりもさらに遅くなければならない。 The sound information input to the voice recognition unit 44 is input with a predetermined time delay via the delay processing unit 70B.
As described above, since the matching processing unit 82 performs speech recognition using the utterance section information output from the threshold processing unit 62, the matching processing unit 82 after the output of the utterance section information by the threshold processing unit 62 is completed. This is for starting the process. For this reason, the time delay by the delay processing unit 70B must be further delayed than the time delay by the delay processing unit 70A.

図４（Ａ），（Ｂ）は、発話区間検出部４２Ａ，４２Ｂで行われる処理の内容を具体的に示した模式図である。 FIGS. 4A and 4B are schematic diagrams specifically showing the contents of the processing performed by the utterance section detection units 42A and 42B.

図４（Ａ）に示されるように、動き量算出部５２は、口の動き量を、例えば６０分の１（１／６０）秒や３０分の１（１／３０）秒毎に算出する。閾値処理部５４は、動き量が画像閾値以上か否かを判定し、動き量が画像閾値以上となった期間を発話区間として検出する。この閾値判定によって、小さな口の動きは、発話区間として検出されないこととなる。 As shown in FIG. 4A, the movement amount calculation unit 52 calculates the movement amount of the mouth every 1 / 60th (1/60) second or 1 / 30th (1/30) second, for example. . The threshold processing unit 54 determines whether or not the amount of motion is equal to or greater than the image threshold, and detects a period in which the amount of motion is equal to or greater than the image threshold as an utterance section. By this threshold determination, a small mouth movement is not detected as an utterance section.

なお、本実施形態に係る閾値処理部５４は、検出した発話区間を、該発話区間に連続した前及び後の所定時間に広げる、太め処理を行う。
人による発話の開始時（語頭）や発話の終了時（語尾）には、口が大きく開けられなかったりするため、発話の語頭や語尾が発話区間として検出されない可能性がある。
そこで、発話区間を前及び後に広げることによって、発話の語頭及語尾も発話区間に含まれるようにする。なお、発話区間を広めるための上記所定時間は、閾値処理部５４によって検出されない可能性のある語頭や語尾に対応する時間であり、実験等により求められる値であり、予め設定されている。 Note that the threshold processing unit 54 according to the present embodiment performs a thickening process that extends the detected utterance section to a predetermined time before and after the utterance section.
Since the mouth cannot be opened widely at the start of the utterance (start of word) or at the end of the utterance (end of word) by a person, there is a possibility that the beginning or ending of the utterance is not detected as the utterance section.
Therefore, by expanding the utterance interval before and after, the beginning and ending of the utterance are included in the utterance interval. The predetermined time for expanding the utterance section is a time corresponding to the beginning or ending of the word that may not be detected by the threshold processing unit 54, and is a value obtained by experiments or the like, and is set in advance.

そして、音声閾値変更部５６によって、太め処理が行われた発話区間において、音声閾値が下げられ、音声閾値変更情報として閾値処理部６２へ出力される。 Then, the voice threshold value is reduced by the voice threshold change unit 56 in the utterance section where the thickening process is performed, and is output to the threshold processing unit 62 as voice threshold change information.

一方、図４（Ｂ）に示されるように、音量算出部６０は、時間遅れを持って入力された音情報により示される音に基づいて、例えば所定時間間隔毎における振幅の最大値の平均値を音量として算出する。
音量算出部６０によって算出された音量は、閾値処理部６２へ出力され、閾値処理部６２は、音量変更情報により示される音声閾値を用いて、閾値判定を行い、音声閾値以上の音量が生じた区間を発話区間として検出する。 On the other hand, as shown in FIG. 4 (B), the volume calculation unit 60, for example, based on the sound indicated by the sound information input with a time delay, for example, the average value of the maximum value of the amplitude at every predetermined time interval Is calculated as the volume.
The volume calculated by the volume calculation unit 60 is output to the threshold processing unit 62, and the threshold processing unit 62 performs threshold determination using the audio threshold indicated by the volume change information, and a volume equal to or higher than the audio threshold is generated. A section is detected as an utterance section.

図５（Ａ），（Ｂ）は、本実施形態に係る音声認識装置４０による音声認識の結果を示した模式図である。なお、図５（Ａ），（Ｂ）の左図は、従来の音声認識の結果（図２参照）であり、図５（Ａ），（Ｂ）の右図は、本実施形態に係る音声認識の結果である。
図５（Ａ）の右図に示されるように、音声閾値を下げることで、雑音を誤認識することが防がれる。そして、画像情報に基づいて検出された発話区間において音声閾値が下げられることにより、音声認識の感度が上げるため、音声認識装置４０は、雑音に対して過応答することなく、人の音声は正しく認識されることとなる。
また、図５（Ｂ)の右図に示されるように、雑音と人の音声が重なり合っていても、雑音を誤認識することがなくなるため、人の音声は正しく認識されることとなる。 5A and 5B are schematic views showing the results of speech recognition by the speech recognition apparatus 40 according to the present embodiment. 5A and 5B show the results of conventional speech recognition (see FIG. 2), and the right diagrams in FIGS. 5A and 5B show the speech according to the present embodiment. It is the result of recognition.
As shown in the right diagram of FIG. 5A, it is possible to prevent erroneous recognition of noise by lowering the voice threshold. Then, since the voice threshold is lowered in the utterance section detected based on the image information, the voice recognition sensitivity is increased. Therefore, the voice recognition device 40 does not over-respond to noise, and the voice of the person is correct. Will be recognized.
Further, as shown in the right diagram of FIG. 5B, since noise is not erroneously recognized even if noise and human voice overlap, human voice is recognized correctly.

また、ロボット１０周辺の環境によって、画像情報に基づいて検出された発話区間における音声閾値（以下、「区間内音声閾値」という。）の大きさや、発話区間が検出されていない区間における音声閾値（以下、「区間外音声閾値」という。）の大きさを異ならせてもよい。例えば、雑音の音量が大きい環境（例えばアミューズメント施設内等）では、区間外音声閾値は、より高く設定される。また、人がロボット１０に対して話しかける声が小さくなりやすい環境（例えば資料館内等）では、区間内音声閾値は、より小さく設定される。
このように、ロボット１０周辺の環境に応じて、区間内音声閾値と区間外音声閾値との比率を変更することによって、雑音に対して過応答する比率を下げ、音声認識率の向上を図ることが望ましい。 Further, depending on the environment around the robot 10, the size of the voice threshold in the utterance section detected based on the image information (hereinafter referred to as “intra-section voice threshold”), the voice threshold in the section where the utterance section is not detected ( Hereinafter, the magnitude of “out-of-interval voice threshold” may be varied. For example, in an environment where the volume of noise is high (for example, in an amusement facility), the out-of-interval voice threshold is set higher. Further, in an environment in which a voice spoken by the person to the robot 10 is likely to be low (for example, in a document hall), the intra-section voice threshold is set smaller.
As described above, by changing the ratio between the intra-speech voice threshold and the non-speech voice threshold according to the environment around the robot 10, the ratio of over-response to noise is lowered and the voice recognition rate is improved. Is desirable.

また、言語に応じて、区間内音声閾値及び区間外音声閾値の大きさや比率を変更してもよい。 Moreover, you may change the magnitude | size and ratio of the voice threshold within a zone, and the voice threshold outside a zone according to a language.

以上説明したように、本実施形態に係る音声認識装置４０は、被写体を撮像すると共に被写体を示す画像情報をカメラ３０によって取得し、該カメラ３０による撮像が行われているときに発生している音を示す音情報をマイク３２によって取得する。そして、音声認識装置４０は、カメラ３０によって取得された画像情報に基づいて、人が発話している発話区間を検出し、検出した発話区間において、発話区間が検出されない場合に比べて、音情報に基づいた人の音声認識の感度を上げる。
従って、本実施形態に係る音声認識装置４０は、画像情報に基づいて検出された雑音の影響を受けない発話区間に対応して、音声認識の感度が上げられることとなるので、過応答を低減しつつ、音声認識の認識率を高めることができる。 As described above, the speech recognition apparatus 40 according to the present embodiment is generated when the subject 30 is imaged and image information indicating the subject is acquired by the camera 30 and the camera 30 is imaging. Sound information indicating sound is acquired by the microphone 32. Then, the voice recognition device 40 detects the utterance section in which a person is speaking based on the image information acquired by the camera 30, and the sound information is compared with the case where the utterance section is not detected in the detected utterance section. Raise the sensitivity of human voice recognition based on.
Therefore, the speech recognition apparatus 40 according to the present embodiment reduces the overresponse because the sensitivity of speech recognition is increased in correspondence with the speech section that is not affected by the noise detected based on the image information. However, the recognition rate of voice recognition can be increased.

また、本実施形態に係る音声認識装置４０は、画像情報に基づいて検出した発話区間と共に、該発話区間に連続した前及び後において音声認識の感度を上げるので、発話の語頭や語尾においても、より確実に音声認識の感度が上げるこができる。 In addition, the speech recognition device 40 according to the present embodiment increases the sensitivity of speech recognition before and after the utterance interval, together with the utterance interval detected based on the image information, so even at the beginning or end of the utterance, The sensitivity of voice recognition can be increased more reliably.

また、本実施形態に係る音声認識装置４０は、予め定められた音声閾値以上の音量を、人の音声であると認識し、画像情報に基づいて検出した発話区間において、音声閾値を下げることによって、音声認識の感度を上げるので、簡易に音声認識の感度を変更することができる。 In addition, the speech recognition apparatus 40 according to the present embodiment recognizes that the sound volume is equal to or higher than a predetermined speech threshold as human speech, and lowers the speech threshold in the utterance section detected based on the image information. Since the voice recognition sensitivity is increased, the voice recognition sensitivity can be easily changed.

また、本実施形態に係る音声認識装置４０は、人の口の動きに基づいて発話区間を検出するので、画像情報に基づいて発話区間を簡易に検出できる。 Moreover, since the speech recognition apparatus 40 according to the present embodiment detects the utterance section based on the movement of the person's mouth, the utterance section can be easily detected based on the image information.

以上、本発明を、上記実施形態を用いて説明したが、本発明の技術的範囲は上記実施形態に記載の範囲には限定されない。発明の要旨を逸脱しない範囲で上記実施形態に多様な変更または改良を加えることができ、該変更または改良を加えた形態も本発明の技術的範囲に含まれる。 As mentioned above, although this invention was demonstrated using the said embodiment, the technical scope of this invention is not limited to the range as described in the said embodiment. Various changes or improvements can be added to the above-described embodiment without departing from the gist of the invention, and embodiments to which the changes or improvements are added are also included in the technical scope of the present invention.

例えば、上記実施形態では、音声認識装置４０が、人の顔に含まれる口の動きに基づいて、該人が発話している発話区間を検出する形態について説明したが、本発明は、これに限定されるものではなく、人の顔に含まれる目の向き、すなわち視線に基づいて、該人が発話している発話区間を検出する形態としてもよい。
この形態の場合、音声認識装置４０は、視線がロボット１０の向きを向いている人物を特定し、特定した人の視線がロボット１０の向きを向いている場合に、音声閾値を下げる。 For example, in the above-described embodiment, the voice recognition device 40 has been described based on the movement of the mouth included in the face of the person to detect the utterance section in which the person is speaking. However, the present invention is not limited to this, and the speech section in which the person is speaking may be detected based on the direction of eyes included in the face of the person, that is, the line of sight.
In the case of this form, the voice recognition device 40 identifies a person whose line of sight is facing the robot 10, and lowers the voice threshold when the line of sight of the identified person is facing the direction of the robot 10.

また、音声認識装置４０は、口の動きと視線の向きとを組み合わせてもよく、視線がロボット１０の方向を向き、かつ口の動きが画像閾値以上の場合に、音声閾値を下げてもよい。これにより、カメラ３０による撮像範囲に複数人存在する場合でも、ロボット１０に対して話しかけている人の音声のみを認識することができる。 The voice recognition device 40 may combine the movement of the mouth and the direction of the line of sight, and may lower the voice threshold when the line of sight faces the direction of the robot 10 and the movement of the mouth is equal to or greater than the image threshold. . Thereby, even when there are a plurality of persons in the imaging range of the camera 30, only the voice of the person who is talking to the robot 10 can be recognized.

さらに、人の口の動きや視線以外にも、例えば人の顔や頭部の向きや位置の変化を検出し、人の顔がロボット１０の方向を向いている場合を、人が発話している発話区間として検出する形態としてもよい。 Further, in addition to the movement of the person's mouth and line of sight, for example, a change in the direction or position of the person's face or head is detected, and the person speaks when the person's face is facing the robot 10. It is good also as a form detected as a certain speech section.

また、上記実施形態では、発話区間検出部４２Ａは、口の動きとして口の開き加減を算出する形態について説明したが、本発明は、これに限定されるものではなく、例えば、口の動き（口の開き加減）を周波数分析して、所定の周波数以上の場合に、口が動いていると判断してもよいし、歯を検出した場合（上唇と下唇との間に白色を検出した場合）に口が動いていると判断してもよい。 Further, in the above-described embodiment, the speech section detection unit 42A has described the form of calculating the opening / closing of the mouth as the movement of the mouth, but the present invention is not limited to this, and for example, the movement of the mouth ( Analyzing the frequency of opening and closing of the mouth), it may be judged that the mouth is moving when the frequency is above a predetermined frequency, or when white is detected between the upper lip and the lower lip In case) it may be determined that the mouth is moving.

また、上記実施形態では、発話区間検出部４２Ａは、画像情報に基づいて検出した発話区間を前後に太める太め処理を行う形態について説明したが、本発明は、これに限定されるものではなく、該発話区間の前及び後の何れか一方のみを太める形態としてもよい。 In the above embodiment, the utterance section detection unit 42A has been described as performing thickening processing to thicken the utterance section detected based on the image information back and forth, but the present invention is not limited to this. Further, only one of the front and rear of the utterance section may be thickened.

また、上記実施形態では、音声認識装置４０をロボット１０に適用する形態について説明したが、本発明は、これに限定されるものではなく、パーソナルコンピュータやＩＣレコーダ等他の機器に適用する形態としてもよい。 Moreover, although the said embodiment demonstrated the form which applies the speech recognition apparatus 40 to the robot 10, this invention is not limited to this, As a form applied to other apparatuses, such as a personal computer and an IC recorder. Also good.

１０ロボット
３０カメラ
３２マイク
４０音声認識装置
４２Ａ発話区間検出部
４４音声認識部
５６音声閾値変更部 DESCRIPTION OF SYMBOLS 10 Robot 30 Camera 32 Microphone 40 Voice recognition apparatus 42A Speech area detection part 44 Voice recognition part 56 Voice threshold value change part

Claims

Imaging means for imaging the subject and acquiring image information indicating the subject;
Sound acquisition means for acquiring sound information indicating sound that is occurring when imaging by the imaging means is being performed;
Voice recognition means for recognizing a person's voice based on the sound information acquired by the sound acquisition means;
Based on the image information acquired by the imaging means, an utterance section detecting means for detecting an utterance section indicating a period during which a person is speaking;
In the utterance section detected by the utterance section detection means, compared with a case where the utterance section is not detected, a sensitivity changing means for increasing the sensitivity of speech recognition by the speech recognition means;
Equipped with a,
The voice recognition means recognizes a state quantity equal to or greater than a predetermined threshold as human voice,
The sensitivity changing means lowers the threshold value in the utterance section detected by the utterance section detection means, compared with the case where the utterance section is not detected, thereby increasing the sensitivity of voice recognition by the voice recognition means. Recognition device.

The speech recognition apparatus according to claim 1, wherein a magnitude of the threshold in the utterance section and a magnitude of the threshold in a section in which the utterance section is not detected differ depending on a surrounding environment.

Lowering the threshold in the utterance interval detected by the utterance interval detection means,
Based on the sound indicated by the sound information acquired by the sound acquisition means, the state quantity is calculated,
The voice recognition device according to claim 1, wherein a section in which the state quantity equal to or greater than the threshold is generated is detected as an utterance section.

Detection of the utterance section based on the sound indicated by the sound information is performed after the change of the threshold value,
The speech recognition apparatus according to any one of claims 1 to 3, wherein speech recognition based on the sound information is performed after detection of an utterance section based on a sound indicated by the sound information.

The sensitivity changing means increases the sensitivity of voice recognition by the voice recognition means at a predetermined time before and after the utterance section together with the utterance section detected by the utterance section detection means. The speech recognition device according to any one of claims 1 to 4 .

The voice activity detection means based on the movement of the mouth that is included in the human face, the speech recognition apparatus according to any one of claims 1 to 5 for detecting a speech interval the person is speaking.

The speech recognition apparatus according to any one of claims 1 to 6, wherein the speech section detection means determines that the mouth is moving when a tooth is detected.

The voice activity detection means based on the eye direction is included in the human face, the speech recognition apparatus according to any one of claims 1 to 7 for detecting a speech interval the person is speaking.

The speech recognition apparatus according to any one of claims 1 to 8, wherein the utterance section detecting unit detects an utterance section in which the person is speaking based on a change in a head direction or a position.

A robot comprising the voice recognition device according to any one of claims 1 to 9 .

A first step of capturing an image of a subject, acquiring image information indicating the subject by the imaging unit, and acquiring sound information indicating a sound generated when the imaging unit is capturing by the sound acquisition unit;
A second step of detecting an utterance section in which a person is speaking based on the image information acquired by the imaging means;
A third step of increasing the sensitivity of human speech recognition based on the sound information in the utterance interval detected by the second step, compared to a case where the utterance interval is not detected;
Only including,
The first step recognizes a state quantity equal to or greater than a predetermined threshold as human voice,
In the speech recognition method , the third step increases the sensitivity of speech recognition by lowering the threshold value in the utterance interval detected in the second step as compared to a case where the utterance interval is not detected .