JP4715738B2

JP4715738B2 - Utterance detection device and utterance detection method

Info

Publication number: JP4715738B2
Application number: JP2006341568A
Authority: JP
Inventors: 貴志内藤
Original assignee: Toyota Motor Corp
Current assignee: Toyota Motor Corp
Priority date: 2006-12-19
Filing date: 2006-12-19
Publication date: 2011-07-06
Anticipated expiration: 2026-12-19
Also published as: JP2008152125A

Description

本発明は、発話検出装置及び発話検出方法に係り、特に、話者の唇を含んだ画像を連続的に撮像し、唇の形状の変形した度合いから話者の発話区間を検出する発話検出装置及び発話検出方法に関する。 The present invention relates to an utterance detection device and an utterance detection method, and in particular, an utterance detection device that continuously captures an image including a speaker's lips and detects the speaker's speech section from the degree of deformation of the shape of the lips. And an utterance detection method.

従来から話者が発話した音声をマイク等によって集音して文字データに変換したり、コンピュータを操作したりする音声認識技術が知られている。この音声認識技術では、周囲の騒音などに影響され、話者が発話をしていなくても騒音をもとに音声認識が行われて結果的に誤認識をしてしまう場合がある。 2. Description of the Related Art Conventionally, a voice recognition technique is known in which a voice spoken by a speaker is collected by a microphone or the like and converted into character data, or a computer is operated. In this voice recognition technology, there are cases where voice recognition is performed on the basis of noise even if the speaker is not speaking due to the influence of ambient noise and the like, resulting in erroneous recognition.

この誤認識を低減させる技術として、話者の唇を含んだ領域の画像をカメラにより連続的に撮像し、撮像された画像の唇の動きから話者が発話している発話区間を検出する技術が研究されている。例えば、特許文献１には、唇の輪郭の垂直方向の距離と基準値との差、あるいは唇の輪郭の曲率値から口の開閉を検出し、複数の対象者の中から発話している話者を特定する技術が開示されており、この技術では、検出された差や曲率値が予め定められた閾値以上でれば発話区間であると判断している。 As a technology to reduce this misrecognition, a technology that continuously captures an image of the area including the speaker's lips with a camera and detects a speech section in which the speaker is speaking from the movement of the lips of the captured image Has been studied. For example, Patent Document 1 discloses a story in which opening / closing of a mouth is detected from a difference between a vertical distance of a lip contour and a reference value, or a curvature value of a lip contour, and uttered from a plurality of subjects. A technique for identifying a person is disclosed. In this technique, if a detected difference or curvature value is equal to or greater than a predetermined threshold value, it is determined that the speech section is set.

また、非特許文献１には、現在の口唇パターンと、Ｎフレーム前の口唇パターンとの差から発話評価値を求めて発話状態を判定する技術が開示されており、現在の発話評価値が一定時間（非特許文献１では、１秒）前の発話評価値の２倍以上になると発話が開始されたと判断し、半分以下になると発話が終了したと判断している。
特開２０００−３３８９８７号公報村井ほか、「口周囲画像による頑強な発話検出」、音声言語情報処理３４−１３、２０００ Non-Patent Document 1 discloses a technique for determining an utterance state by obtaining an utterance evaluation value from a difference between a current lip pattern and a lip pattern before N frames, and the current utterance evaluation value is constant. It is determined that the utterance has started when the utterance evaluation value before time (1 second in Non-Patent Document 1) is more than twice, and it is determined that the utterance has ended when it is less than half.
JP 2000-338987 A Murai et al., “Stubborn Utterance Detection Using Mouth Perimeter Images”, Spoken Language Information Processing 34-13, 2000

しかしながら、上記特許文献１及び非特許文献１の技術では、精度よく発話区間を判別できない場合がある、という問題点があった。 However, the techniques disclosed in Patent Document 1 and Non-Patent Document 1 have a problem in that the utterance interval may not be accurately determined.

すなわち、発話する際の唇の動きには個人差があり、発話の際の口の開閉の大きさが異なる場合がある。このため、特許文献１の技術を適用した場合、唇の動きが小さい話者の発話区間を精度よく判別できない場合がある。そこで、唇の動きが小さい話者に合わせて閾値を定めた場合、発話時以外の唇の小さな動きも発話区間と誤判別してしまう場合がある。また、同じ話者が同じ唇の動きをした場合であっても、撮像時の明るさ等の撮像環境の違いによって、検出される前記基準値との差や前記曲率値が異なる場合があり、精度よく発話区間を判別できない場合がある。 That is, there are individual differences in the movement of the lips when speaking, and the size of opening and closing the mouth when speaking is sometimes different. For this reason, when the technique of Patent Document 1 is applied, it may be impossible to accurately determine the utterance section of a speaker whose lip movement is small. Therefore, when a threshold is set according to a speaker whose lip movement is small, a small lip movement other than during utterance may be misidentified as an utterance section. In addition, even when the same speaker moves the same lips, the difference between the detected reference value and the curvature value may differ depending on the imaging environment such as brightness at the time of imaging, There are cases where the utterance interval cannot be accurately determined.

また、一般的な会話では、発話の途中で唇の動きが一時的に停止する場合がある。このため、非特許文献１の技術を適用した場合、発話の途中で頻繁に発話開始、発話終了の判定がなされてしまい、結果として精度よく発話区間を判別できない場合がある。 In general conversation, the movement of the lips may temporarily stop during the utterance. For this reason, when the technique of Non-Patent Document 1 is applied, the utterance start and the utterance end are frequently determined during the utterance, and as a result, the utterance section may not be accurately determined.

本発明は、上記問題点を解消するためになされたものであり、精度よく発話区間を検出できる発話検出装置及び発話検出方法を提供することを目的とする。 The present invention has been made to solve the above problems, and an object thereof is to provide an utterance detection apparatus and an utterance detection method capable of detecting an utterance section with high accuracy.

上記目的を達成するため、請求項１に記載の発話検出装置は、話者の唇を含んだ画像を連続的に撮像する撮像手段と、前記話者が発話した音声を集音する音声集音手段と、前記撮像手段により連続的に撮像された画像に基づいて唇の形状が変形した度合いを示す変形量を導出する変形量導出手段と、前記撮像手段により撮像された画像に基づいて前記撮像手段から前記話者までの距離及び前記撮像手段に対する前記話者の顔の向きを導出する話者状態導出手段と、前記話者状態導出手段により導出された前記距離が所定範囲内で且つ導出された前記顔の向きが前記撮像手段に対して所定角度範囲内であると共に前記音声集音手段により集音された前記音声の強度が所定レベル以上である場合に、前記変形量導出手段によって導出された変形量に基づいて前記話者が発話している発話区間の判別に用いる当該変形量の閾値を決定する決定手段と、前記決定手段により決定された閾値を用いて前記変形量導出手段により導出された変形量から発話区間を検出する検出手段と、を備えている。 In order to achieve the above object, an utterance detection apparatus according to claim 1, an image pickup means for continuously picking up an image including a speaker's lips, and a sound collection device for collecting a sound uttered by the speaker. A deformation amount deriving unit for deriving a deformation amount indicating a degree of deformation of the shape of the lips based on images continuously captured by the image capturing unit, and the image capturing based on the image captured by the image capturing unit. A speaker state deriving unit for deriving a distance from the unit to the speaker and a direction of the speaker's face with respect to the imaging unit; and the distance derived by the speaker state deriving unit is derived within a predetermined range. In addition, when the direction of the face is within a predetermined angle range with respect to the imaging unit and the intensity of the sound collected by the sound collecting unit is equal to or higher than a predetermined level, the direction is derived by the deformation amount deriving unit. Amount of deformation Determining means for determining a threshold value of the deformation amount used for discrimination of an utterance section in which the speaker is speaking based on, and a deformation amount derived by the deformation amount deriving means using the threshold value determined by the determining means Detecting means for detecting a speech segment from

請求項１記載の発明によれば、撮像手段により、話者の唇を含んだ画像が連続的に撮像され、音声集音手段により、話者が発話した音声が集音され、変形量導出手段により、撮像手段により連続的に撮像された画像に基づいて唇の形状が変形した度合いを示す変形量が導出され、話者状態導出手段により、撮像手段により撮像された画像に基づいて撮像手段から話者までの距離及び撮像手段に対する話者の顔の向きが導出される。 According to the first aspect of the present invention, the image including the speaker's lips is continuously captured by the imaging unit, and the voice uttered by the speaker is collected by the voice collecting unit, and the deformation amount deriving unit Thus, a deformation amount indicating the degree of deformation of the shape of the lips is derived based on the images continuously captured by the imaging unit, and the speaker state deriving unit extracts the deformation from the imaging unit based on the image captured by the imaging unit. The distance to the speaker and the orientation of the speaker's face relative to the imaging means are derived.

そして、本発明によれば、決定手段により、話者状態導出手段によって導出された距離が所定範囲内で且つ導出された顔の向きが撮像手段に対して所定角度範囲内であると共に音声集音手段により集音された音声の強度が所定レベル以上である場合に、変形量導出手段によって導出された変形量に基づいて話者が発話している発話区間の判別に用いる当該変形量の閾値が決定され、検出手段により、決定手段によって決定された閾値を用いて変形量導出手段により導出された変形量から発話区間が検出される。 According to the present invention, the determination means determines that the distance derived by the speaker state deriving means is within a predetermined range and the derived face orientation is within a predetermined angle range with respect to the imaging means, and the sound collection When the intensity of the voice collected by the means is equal to or higher than a predetermined level, a threshold value of the deformation amount used for determining the utterance section in which the speaker is speaking based on the deformation amount derived by the deformation amount deriving means is The utterance section is detected from the deformation amount derived by the deformation amount deriving unit using the threshold determined by the determining unit.

このように、請求項１記載の発明によれば、話者の唇を含んだ画像を撮像手段により連続的に撮像すると共に話者が発話した音声を集音し、連続的に撮像した画像に基づいて唇の形状が変形した度合いを示す変形量を導出すると共に当該画像に基づいて撮像手段から話者までの距離及び撮像手段に対する話者の顔の向きを導出し、導出した距離が所定範囲内で且つ導出された顔の向きが撮像手段に対して所定角度範囲内であると共に集音した音声の強度が所定レベル以上である場合に導出した変形量に基づいて話者が発話している発話区間の判別に用いる当該変形量の閾値を決定し、決定した閾値を用いて導出した変形量から発話区間を検出しているので、精度よく発話区間を検出できる。 As described above, according to the first aspect of the present invention, the image including the speaker's lips is continuously captured by the imaging unit, and the voice uttered by the speaker is collected to obtain the continuously captured image. Based on the image, the amount of deformation indicating the degree of deformation of the lips is derived, and the distance from the imaging means to the speaker and the direction of the speaker's face relative to the imaging means are derived based on the image, and the derived distance is within a predetermined range. And the direction of the derived face is within a predetermined angle range with respect to the imaging means, and the speaker speaks based on the amount of deformation derived when the intensity of the collected voice is equal to or higher than a predetermined level. Since the threshold value of the deformation amount used for discrimination of the utterance section is determined and the utterance section is detected from the deformation amount derived using the determined threshold value, the utterance section can be detected with high accuracy.

なお、本発明は、請求項２記載の発明のように、周囲の騒音を集音する騒音集音手段をさらに備え、前記決定手段が、さらに前記騒音集音手段により集音された前記騒音の強度が予め定められたレベル未満である場合に、前記変形量導出手段によって導出された前記変形量に基づいて前記閾値を決定してもよい。 Note that, as in the invention of claim 2, the present invention further includes noise collecting means for collecting ambient noise, and the determining means further includes the noise collected by the noise collecting means. When the strength is less than a predetermined level, the threshold value may be determined based on the deformation amount derived by the deformation amount deriving unit.

また、本発明は、請求項３記載の発明のように、前記音声集音手段により集音された音声の音声認識を行って認識精度を示す精度情報を出力する音声認識手段をさらに備え、前記決定手段が、さらに前記音声認識手段より出力された前記精度情報により示される認識精度が予め定められた精度以上である場合に、前記変形量導出手段によって導出された前記変形量に基づいて前記閾値を決定してもよい。 Further, the present invention, as in the invention of claim 3, further comprises voice recognition means for performing voice recognition of the voice collected by the voice sound collection means and outputting accuracy information indicating the recognition accuracy, When the determination means further has a recognition accuracy indicated by the accuracy information output from the speech recognition means equal to or higher than a predetermined accuracy, the threshold value is based on the deformation amount derived by the deformation amount deriving device. May be determined.

また、本発明は、請求項４記載の発明のように、前記音声集音手段は、２つ以上のマイクにより構成され、各マイクにより集音された音声情報に基づいて前記撮像手段に対する音源の方向を推定する音源推定手段と、前記撮像手段により撮像された画像に基づいて当該撮像手段に対する前記話者の方向を導出する話者方向導出手段と、をさらに備え、前記決定手段は、さらに前記音源推定手段により推定された音源の方向と前記話者方向導出手段により導出された話者の方向の差が所定範囲内である場合に、前記変形量導出手段によって導出された前記変形量に基づいて前記閾値を決定してもよい。 Further, according to the present invention, as in the invention described in claim 4, the sound collecting means is composed of two or more microphones, and a sound source for the imaging means is based on sound information collected by each microphone. Sound source estimating means for estimating a direction; and speaker direction deriving means for deriving the direction of the speaker relative to the imaging means based on an image captured by the imaging means, and the determining means further includes the When the difference between the direction of the sound source estimated by the sound source estimation unit and the direction of the speaker derived by the speaker direction deriving unit is within a predetermined range, based on the deformation amount derived by the deformation amount deriving unit. The threshold may be determined.

一方、上記目的を達成するため、請求項５に記載の発話検出方法は、話者の唇を含んだ画像を撮像手段により連続的に撮像すると共に前記話者が発話した音声を集音し、連続的に撮像した前記画像に基づいて唇の形状が変形した度合いを示す変形量を導出すると共に当該画像に基づいて前記撮像手段から前記話者までの距離及び前記撮像手段に対する前記話者の顔の向きを導出し、導出した前記距離が所定範囲内で且つ導出された前記顔の向きが前記撮像手段に対して所定角度範囲内であると共に集音した前記音声の強度が所定レベル以上である場合に導出した変形量に基づいて前記話者が発話している発話区間の判別に用いる当該変形量の閾値を決定し、決定した前記閾値を用いて導出した前記変形量から発話区間を検出する。 On the other hand, in order to achieve the above-described object, the speech detection method according to claim 5 continuously captures an image including a speaker's lips by an imaging means and collects a speech spoken by the speaker, Deriving a deformation amount indicating the degree of deformation of the shape of the lips based on the continuously captured images, and the distance from the imaging means to the speaker based on the images and the speaker's face relative to the imaging means The derived distance is within a predetermined range, the derived face direction is within a predetermined angle range with respect to the imaging means, and the intensity of the collected sound is equal to or higher than a predetermined level. A threshold value of the deformation amount used for discrimination of the utterance interval in which the speaker is speaking is determined based on the deformation amount derived in the case, and the utterance interval is detected from the deformation amount derived using the determined threshold value .

よって、請求項５に記載の発明は、請求項１記載の発明と同様に作用するので、請求項１記載の発明と同様に、精度よく発話区間を検出できる。 Therefore, since the invention described in claim 5 operates in the same manner as the invention described in claim 1, the speech section can be detected with high accuracy as in the invention described in claim 1.

以上説明したように、本発明によれば、話者の唇を含んだ画像を撮像手段により連続的に撮像すると共に話者が発話した音声を集音し、連続的に撮像した画像に基づいて唇の形状が変形した度合いを示す変形量を導出すると共に当該画像に基づいて撮像手段から話者までの距離及び撮像手段に対する話者の顔の向きを導出し、導出した距離が所定範囲内で且つ導出された顔の向きが撮像手段に対して所定角度範囲内であると共に集音した音声の強度が所定レベル以上である場合に導出した変形量に基づいて話者が発話している発話区間の判別に用いる当該変形量の閾値を決定し、決定した閾値を用いて導出した変形量から発話区間を検出しているので、精度よく発話区間を検出できる、という優れた効果を有する。 As described above, according to the present invention, the image including the speaker's lips is continuously captured by the imaging unit, and the voice uttered by the speaker is collected and based on the continuously captured image. The amount of deformation indicating the degree of deformation of the lip shape is derived, and the distance from the imaging means to the speaker and the direction of the speaker's face relative to the imaging means are derived based on the image, and the derived distance is within a predetermined range. An utterance section in which the speaker speaks based on the amount of deformation derived when the derived face orientation is within a predetermined angle range with respect to the imaging means and the intensity of the collected voice is equal to or higher than a predetermined level. Since the utterance interval is detected from the deformation amount derived using the determined threshold value, the utterance interval can be detected with high accuracy.

以下、図面を参照して本発明の実施の形態について詳細に説明する。なお、以下では、本発明を音声認識装置に適用した場合について説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. Hereinafter, a case where the present invention is applied to a speech recognition apparatus will be described.

図１には、本実施の形態に係る音声認識装置１０の構成を示すブロック図が示されている。 FIG. 1 is a block diagram showing a configuration of a speech recognition apparatus 10 according to the present embodiment.

同図に示されるように、音声認識装置１０は、ＣＣＤ（Charge Coupled Device）イメージセンサを内蔵し、当該ＣＣＤイメージセンサ上に結像した画像を示す画像情報を出力するカメラ１２と、入力した音声の強度に応じた音声信号を出力する２個のマイク１４、１６と、カメラ１２より出力された画像情報に対して各種の画像処理を行う画像処理部１８と、話者が発話している否かの判別に用いる閾値を決定する閾値決定部２０と、画像処理部１８によって処理された情報に基づいて撮像された話者が発話している発話区間の検出を行う発話区間検出部２２と、マイク１４から出力された音声信号に基づいて音声認識を行う音声認識部２４と、を備えている。 As shown in the figure, the speech recognition apparatus 10 has a built-in CCD (Charge Coupled Device) image sensor, and outputs a camera 12 that outputs image information indicating an image formed on the CCD image sensor, and an input voice. The two microphones 14 and 16 that output audio signals corresponding to the intensity of the image, the image processing unit 18 that performs various image processing on the image information output from the camera 12, and whether the speaker is speaking A threshold value determination unit 20 that determines a threshold value used for such determination, an utterance period detection unit 22 that detects an utterance period in which a speaker imaged based on information processed by the image processing unit 18 is speaking, And a voice recognition unit 24 that performs voice recognition based on the voice signal output from the microphone 14.

カメラ１２は、話者が所定位置に位置した際に、当該話者の顔を所定サイズ範囲内で撮像可能な位置に設置されている。マイク１４は、前記所定位置に位置した話者が発話した音声を集音可能な位置に設置されている。マイク１６は、周囲の騒音を集音するため、前記所定位置から所定距離だけ離れた位置に設置されている。 The camera 12 is installed at a position where the face of the speaker can be imaged within a predetermined size range when the speaker is positioned at a predetermined position. The microphone 14 is installed at a position where sound uttered by a speaker located at the predetermined position can be collected. The microphone 16 is installed at a position away from the predetermined position by a predetermined distance in order to collect ambient noise.

カメラ１２は、前記所定位置に位置した話者の顔を、例えば、毎秒３０フレームで連続的に撮像するものとされており、撮像によって得られた各フレーム画像を示す画像情報を画像処理部１８へ順次出力する。 The camera 12 is assumed to continuously capture the face of the speaker located at the predetermined position, for example, at 30 frames per second, and image information indicating each frame image obtained by the imaging is image processing unit 18. Output sequentially.

マイク１４は、所定位置に位置した話者が発声した音声を集音するものとされており、話者が発声した音声を示す音声信号を閾値決定部２０及び音声認識部２４へそれぞれ出力する。 The microphone 14 collects sound uttered by a speaker located at a predetermined position, and outputs sound signals indicating the sound uttered by the speaker to the threshold value determination unit 20 and the speech recognition unit 24, respectively.

マイク１６は、周囲の騒音を集音するものとされており、騒音を示す音声信号を閾値決定部２０へ出力する。 The microphone 16 collects ambient noise and outputs an audio signal indicating the noise to the threshold value determination unit 20.

画像処理部１８は、カメラ１２から順次入力される各画像情報により示される各フレーム画像に基づいて唇の形状が変形した度合いを示す変形量を順次導出する。なお、本実施の形態に係る画像処理部１８は、この変形量として、順次入力される各画像情報により示される各フレーム画像に含まれる話者の唇形状を特定し、入力された画像情報により示されるフレーム画像とその画像情報の直前に入力された１または複数の画像情報により示される各画像フレームの唇形状を比較することにより唇の変動量Ｅを導出している。この唇の変動量Ｅを導出する技術は、本出願人が特願２００５−２６２７５１に提案しているので、ここでの詳細な説明を省略する。なお、画像処理部１８は、唇の変動量Ｅに代えて、上述した特許文献１に記載のように、唇の輪郭の垂直方向の距離と基準値との差、あるいは唇の輪郭の曲率値を上記変形量として導出するものとしてもよい。 The image processing unit 18 sequentially derives a deformation amount indicating the degree of deformation of the lip shape based on each frame image indicated by each image information sequentially input from the camera 12. Note that the image processing unit 18 according to the present embodiment specifies the speaker's lip shape included in each frame image indicated by each sequentially input image information as the deformation amount, and uses the input image information. The lip variation amount E is derived by comparing the lip shape of each image frame indicated by the displayed frame image and one or more image information input immediately before the image information. The technique for deriving the lip variation amount E has been proposed by the present applicant in Japanese Patent Application No. 2005-262751, and therefore detailed description thereof is omitted here. The image processing unit 18 replaces the lip variation amount E with the difference between the vertical distance of the lip contour and the reference value or the curvature value of the lip contour, as described in Patent Document 1 described above. May be derived as the amount of deformation.

画像処理部１８は、導出した変動量Ｅを示す変動量情報を閾値決定部２０及び発話区間検出部２２へ順次出力する。 The image processing unit 18 sequentially outputs variation amount information indicating the derived variation amount E to the threshold value determination unit 20 and the utterance section detection unit 22.

また、画像処理部１８は、フレーム画像内での話者の顔領域のサイズに応じてカメラ１２から話者までの距離を仮定した距離情報を不図示の記憶部に予め記憶しており、当該距離情報に基づいて、順次入力される画像情報により示される各フレーム画像内の話者の顔領域の大きさからカメラ１２から話者までの距離ｄを導出する。なお、本実施の形態では、画像処理部１８に距離情報を予め記憶しておき、フレーム画像内での話者の顔領域の大きさから距離ｄを導出するものとしたが、例えば、カメラ１２の他に当該カメラ１２から所定距離を隔てて同じ領域を撮像するカメラを設け、画像処理部１８が当該２台のカメラにより撮像された画像からステレオ法により距離ｄを導出するものとしてもよく、また、例えば、レーザーレーダ等を用いて距離ｄを導出するものとしてもよい。 Further, the image processing unit 18 stores in advance a distance information assuming a distance from the camera 12 to the speaker according to the size of the speaker's face area in the frame image in a storage unit (not shown). Based on the distance information, the distance d from the camera 12 to the speaker is derived from the size of the speaker's face area in each frame image indicated by the sequentially input image information. In this embodiment, distance information is stored in advance in the image processing unit 18 and the distance d is derived from the size of the speaker's face area in the frame image. For example, the camera 12 In addition, a camera that captures the same region at a predetermined distance from the camera 12 may be provided, and the image processing unit 18 may derive the distance d from the images captured by the two cameras by a stereo method. Further, for example, the distance d may be derived using a laser radar or the like.

さらに、画像処理部１８は、カメラ１２から順次入力される画像情報により示される各フレーム画像内での話者の顔領域に対して、固有空間法等によるパターンマッチングを行うことにより、カメラ１２に対して話者の顔が正面を向いている場合を基準として、顔が水平方向に回転した水平回転角度θ（所謂、パン角。）及び垂直方向に回転した垂直回転角度φ（所謂、チルト角。）を導出している。なお、本実施の形態では、固有空間法によるパターンマッチングを行うことによりカメラ１２に対する話者の顔の向きを導出しているが、その他のパターンマッチング技術等の技術を用いて話者の顔の向きを導出するものとしてもよい。 Furthermore, the image processing unit 18 performs pattern matching by the eigenspace method or the like on the face area of the speaker in each frame image indicated by the image information sequentially input from the camera 12, so that the camera 12 On the other hand, on the basis of the case where the speaker's face is facing the front, the horizontal rotation angle θ (so-called pan angle) that the face is rotated in the horizontal direction and the vertical rotation angle φ (so-called tilt angle) that is rotated in the vertical direction. .) Is derived. In the present embodiment, the orientation of the speaker's face with respect to the camera 12 is derived by performing pattern matching using the eigenspace method. However, the speaker's face orientation can be determined using other pattern matching techniques. The direction may be derived.

画像処理部１８は、導出した距離ｄ、水平回転角度θ及び垂直回転角度φを話者状態情報として閾値決定部２０へ出力する。 The image processing unit 18 outputs the derived distance d, horizontal rotation angle θ, and vertical rotation angle φ to the threshold determination unit 20 as speaker state information.

閾値決定部２０は、後述する閾値決定処理を行って、画像処理部１８から入力された話者状態情報及び変動量情報によりそれぞれ示される距離ｄ、水平回転角度θ、垂直回転角度φ、変動量Ｅや、マイク１４から入力された話者の音声を示す音声信号の強度ｉ_ｖ、マイク１６より入力する騒音を示す音声信号の強度ｉ_ｅ、後述する音声認識部２４から入力される尤度情報により示される尤度ｐに基づいて、発話区間の判別に用いる閾値Ｅ_Ｔｈを決定するものとされており、決定した閾値Ｅ_Ｔｈを示す閾値情報を発話区間検出部２２へ出力する。 The threshold value determination unit 20 performs threshold value determination processing to be described later, and the distance d, the horizontal rotation angle θ, the vertical rotation angle φ, and the variation amount indicated by the speaker state information and the variation amount information input from the image processing unit 18, respectively. E, the intensity i _{v of the} voice signal indicating the voice of the speaker input from the microphone 14, the intensity i _{e of the} voice signal indicating the noise input from the microphone 16, and the likelihood information input from the voice recognition unit 24 described later. The threshold value E _Th used for discrimination of the utterance interval is determined based on the likelihood p indicated by, and threshold information indicating the determined threshold value E _Th is output to the utterance interval detection unit 22.

発話区間検出部２２は、閾値決定部２０より閾値情報を入力されたことをトリガーとして、閾値決定部２０より入力した閾値情報により示される閾値Ｅ_Ｔｈを用いて、画像処理部１８より入力される変動量情報により示される変動量Ｅを判定して発話区間検出信号の出力を開始する。発話区間検出部２２は、変動量Ｅが閾値Ｅ_Ｔｈ以上であった場合に発話区間であることを示す発話区間検出信号を音声認識部２４へ出力し、変動量Ｅが閾値Ｅ_Ｔｈ未満であった場合に非発話区間であることを示す発話区間検出信号を音声認識部２４へ出力する。 The utterance section detection unit 22 is input from the image processing unit 18 using the threshold value E _Th indicated by the threshold value information input from the threshold value determination unit 20 using the threshold value information input from the threshold value determination unit 20 as a trigger. The variation amount E indicated by the variation amount information is determined, and the output of the speech segment detection signal is started. The utterance section detection unit 22 outputs an utterance section detection signal indicating that it is an utterance section when the variation amount E is equal to or greater than the threshold value E _Th to the speech recognition unit 24, and the variation amount E is less than the threshold value E _Th. If it is detected, a speech segment detection signal indicating that it is a non-speech segment is output to the speech recognition unit 24.

音声認識部２４は、発話区間検出部２２から発話区間検出信号が入力している場合、当該発話区間検出信号が発話区間であることを示している間のみ、マイク１４より入力される音声信号により示される音声の認識を行って文字データに変換し、発話区間検出部２２からの発話区間検出信号が未入力の場合、音声信号により示される音声を順次認識して文字データに変換する。 When the speech segment detection signal is input from the speech segment detection unit 22, the speech recognition unit 24 uses the speech signal input from the microphone 14 only while the speech segment detection signal indicates that the speech segment detection signal is a speech segment. The voice shown is recognized and converted into character data. When the speech zone detection signal from the speech zone detector 22 is not input, the voice shown by the voice signal is sequentially recognized and converted into character data.

また、音声認識部２４は、音声を認識して文字データに変換する際に、変換した文字データの認識精度を示す尤度ｐを導出する。この尤度ｐとは、認識した結果のもっともらしさを示す値である。本実施の形態に係る音声認識部２４は、音声を認識すると、例えば、「私」「若い」「たわし」などの変換候補毎にそれぞれに尤度ｐを導出して最も尤度の高い変換候補に変換している。 Further, when the speech recognition unit 24 recognizes speech and converts it into character data, the speech recognition unit 24 derives a likelihood p indicating the recognition accuracy of the converted character data. The likelihood p is a value indicating the likelihood of the recognized result. When the speech recognition unit 24 according to the present embodiment recognizes speech, for example, the conversion candidate with the highest likelihood is derived by deriving the likelihood p for each conversion candidate such as “I”, “Young”, and “Watashi”. Has been converted.

音声認識部２４は、変換した文字データを図示しない外部装置へ出力し、また、導出した尤度ｐを示す尤度情報を閾値決定部２０へ出力する。 The voice recognition unit 24 outputs the converted character data to an external device (not shown), and outputs likelihood information indicating the derived likelihood p to the threshold value determination unit 20.

次に、本実施の形態に係る音声認識装置１０の作用を説明する。 Next, the operation of the speech recognition apparatus 10 according to the present embodiment will be described.

カメラ１２は、常時連続的に撮像を行っており、発話者の顔が撮像領域内に入ると、当該発話者の顔を含んだ各フレーム画像を示す画像情報を画像処理部１８へ順次出力する。 The camera 12 continuously captures images, and when the speaker's face enters the imaging region, the camera 12 sequentially outputs image information indicating each frame image including the speaker's face to the image processing unit 18. .

画像処理部１８は、カメラ１２から順次入力された画像情報に対して各種の画像処理を行って、当該画像情報により示されるフレーム画像に含まれる話者の唇の変動量Ｅや、カメラ１２から話者までの距離ｄ、カメラ１２に対する話者の顔の水平回転角度θ及び垂直回転角度φを導出し、変動量Ｅを閾値決定部２０及び発話区間検出部２２へそれぞれ出力する共に、距離ｄ、水平回転角度θ及び垂直回転角度φを話者状態情報として閾値決定部２０へ出力する。
。 The image processing unit 18 performs various types of image processing on the image information sequentially input from the camera 12, and the speaker's lip variation amount E included in the frame image indicated by the image information or the camera 12. The distance d to the speaker, the horizontal rotation angle θ and the vertical rotation angle φ of the speaker's face with respect to the camera 12 are derived, and the fluctuation amount E is output to the threshold value determination unit 20 and the utterance section detection unit 22, respectively. The horizontal rotation angle θ and the vertical rotation angle φ are output to the threshold value determination unit 20 as speaker state information.
.

一方、マイク１４及びマイク１６は、常時音声の集音を行っており、マイク１４は話者が発声した音声を示す音声信号を閾値決定部２０及び音声認識部２４へそれぞれ出力し、マイク１６は周囲の騒音を示す音声信号を閾値決定部２０へ出力する。 On the other hand, the microphone 14 and the microphone 16 always collect voice, and the microphone 14 outputs a voice signal indicating the voice uttered by the speaker to the threshold value determination unit 20 and the voice recognition unit 24, respectively. An audio signal indicating ambient noise is output to the threshold value determination unit 20.

音声認識部２４は、マイク１４より入力した音声信号により示される音声を認識して文字データに変換すると共に尤度ｐを導出し、尤度情報を閾値決定部２０へ出力する。 The voice recognition unit 24 recognizes the voice indicated by the voice signal input from the microphone 14 and converts it into character data, derives the likelihood p, and outputs the likelihood information to the threshold value determination unit 20.

閾値決定部２０は、カメラ１２によって話者が撮像されて画像処理部１８から最初に話者状態情報及び変動量情報が入力されると、以下に示す閾値決定処理を実行する。 When the speaker 12 is imaged by the camera 12 and the speaker state information and the fluctuation amount information are first input from the image processing unit 18, the threshold determination unit 20 executes threshold determination processing described below.

図２には、閾値決定部２０において実行される閾値決定処理の流れを示すフローチャートが示されている。 FIG. 2 shows a flowchart showing the flow of threshold determination processing executed in the threshold determination unit 20.

同図のステップ１００では、初期処理として、カウンタＣ及び最大の変動量を記憶するための変数ＭＥをそれぞれ０に初期化する。 In step 100 of the figure, as an initial process, the counter C and the variable ME for storing the maximum fluctuation amount are each initialized to 0.

次にステップ１０２では、話者状態情報、変動量情報、尤度情報、音声信号の入力待ちを行い、次のステップ１０４では、入力された話者状態情報により示される距離ｄ、水平回転角度θ、垂直回転角度φが以下の（１）式〜（３）式に示される全ての条件を満たしているか否かを判定することにより、カメラ１２によって撮像された話者の顔の位置が発話の検出に適した範囲内にあるか否かを判定し、肯定判定となった場合はステップ１０６へ移行し、否定判定となった場合は上記ステップ１０２へ戻る。 Next, in step 102, input of speaker state information, variation information, likelihood information, and a voice signal is waited. In next step 104, the distance d indicated by the input speaker state information, the horizontal rotation angle θ By determining whether or not the vertical rotation angle φ satisfies all the conditions shown in the following equations (1) to (3), the position of the speaker's face captured by the camera 12 is It is determined whether or not it is within a range suitable for detection. If the determination is affirmative, the process proceeds to step 106, and if the determination is negative, the process returns to step 102.

Ｄ_ＭＩＮ＜ｄかつｄ＜Ｄ_ＭＡＸ・・・（１）
θ_ＭＩＮ＜θ かつ θ＜θ_ＭＡＸ・・・（２）
φ_ＭＩＮ＜φ かつ φ＜φ_ＭＡＸ・・・（３） D _MIN <d and d <D _MAX (1)
θ _MIN <θ and θ <θ _MAX (2)
φ _MIN <φ and φ <φ _MAX (3)

すなわち、カメラ１２から話者までの距離ｄが遠い場合、話者の唇の動きを精度良く検出できず、また、距離ｄが極端に近い場合、顔の輪郭などを捉えきれなくなるなどにより、画像処理で唇を識別できなくなる場合がある。 That is, when the distance d from the camera 12 to the speaker is far, the movement of the speaker's lips cannot be detected accurately, and when the distance d is extremely close, the face contour cannot be captured. The lips may not be identified during processing.

また、カメラ１２に対して顔を正面とした場合を基準として、顔が水平方向や垂直方向に大きく傾いていた場合、話者の唇の動きを精度良く検出できない場合がある。 Further, when the face is greatly inclined in the horizontal direction or the vertical direction with respect to the case where the face is the front with respect to the camera 12, the movement of the speaker's lips may not be detected with high accuracy.

このため、本実施の形態では、カメラ１２の解像度や撮像範囲等に応じて、フレーム画像内での話者の唇のサイズが検出に適したサイズとなるように範囲Ｄ_ＭＩＮ、Ｄ_ＭＡＸを予め定めており、また、フレーム画像内で唇が精度良く検出できるように角度範囲θ_ＭＩＮ、θ_ＭＡＸ及びφ_ＭＩＮ、φ_ＭＡＸを予め定めている。 Therefore, in the present embodiment, the ranges D _MIN and D _MAX are set in advance so that the size of the speaker's lips in the frame image becomes a size suitable for detection according to the resolution of the camera 12, the imaging range, and the like. In addition, angle ranges θ _MIN , θ _MAX and φ _MIN , φ _MAX are determined in advance so that the lips can be accurately detected in the frame image.

ステップ１０６では、マイク１４より入力された音声信号の強度ｉ_ｖ、マイク１６より入力された音声信号の強度ｉ_ｅ、尤度情報により示される尤度ｐが以下の（４）式〜（６）式に示される全ての条件を満たしているか否かを判定することにより、話者が実際に発話状態であるか否かを判定し、肯定判定となった場合はステップ１０８へ移行し、否定判定となった場合は上記ステップ１０２へ戻る。 In step 106, the intensity i _{v of the} audio signal input from the microphone 14, the intensity i _{e of the} audio signal input from the microphone 16, and the likelihood p indicated by the likelihood information are expressed by the following equations (4) to (6). It is determined whether or not the speaker is actually in an utterance state by determining whether or not all the conditions shown in the expression are satisfied. If the determination is affirmative, the process proceeds to step 108 and a negative determination is made. When it becomes, it returns to the said step 102.

ｉ_Ｖ＞Ｉ_Ｖ・・・（４）
ｐ＞Ｐ_０・・・（５）
ｉ_ｅ＜Ｉ_ｅ・・・（６） i _V > I _V (4)
p> P ₀ (5)
i _e <I _e (6)

すなわち、マイク１４より入力される音声の強度ｉ_Ｖが低い場合や、尤度ｐが低い場合、話者が実際には発話していない場合がある。また、マイク１６より入力される騒音の強度ｉ_ｅが高い場合、話者が発話した音声を十分に集音できない場合がある。 That is, when the intensity i _{V of the} voice input from the microphone 14 is low or the likelihood p is low, the speaker may not actually speak. In addition, when the intensity i _{e of} noise input from the microphone 16 is high, there is a case where the voice uttered by the speaker cannot be sufficiently collected.

このため、本実施の形態では、話者が実際には発話していると判別できる強度に発話判定レベルＩ_Ｖ及び発話判定尤度Ｐ_０を予め定めており、また、話者が発話した音声を十分に集音できる強度に騒音判定レベルＩ_ｅを予め定めている。 For this reason, in this embodiment, the speech determination level I _V and the speech determination likelihood P ₀ are determined in advance to such an intensity that it can be determined that the speaker is actually speaking, and the voice spoken by the speaker The noise judgment level _Ie is determined in advance so that the sound can be sufficiently collected.

ステップ１０８では、入力した変動量情報により示される変動量Ｅが変数ＭＥの値よりも大きいか否かを判定し、肯定判定となった場合はステップ１１０へ移行し、否定判定となった場合はステップ１１２へ移行する。 In step 108, it is determined whether or not the fluctuation amount E indicated by the inputted fluctuation amount information is larger than the value of the variable ME. If the determination is affirmative, the process proceeds to step 110, and if the determination is negative. The process proceeds to step 112.

ステップ１１０では、変数ＭＥに変動量Ｅの値を代入し、次のステップ１１２では、カウンタＣの値をインクリメントする。 In step 110, the value of the fluctuation amount E is substituted for the variable ME, and in the next step 112, the value of the counter C is incremented.

次のステップ１１４では、カウンタＣの値が所定値Ｎ（例えば、１０００）よりも大きくなったか否かを判定し、肯定判定となった場合はステップ１１６へ移行し、否定判定となった場合は上記ステップ１０２へ戻る。 In the next step 114, it is determined whether or not the value of the counter C has become larger than a predetermined value N (for example, 1000). If the determination is affirmative, the process proceeds to step 116, and if the determination is negative. Return to step 102 above.

すなわち、上述したステップ１０２〜ステップ１１４の処理を繰り返すことにより、変数ＭＥには、本閾値決定処理が開始した以降に話者が実際に発話した際の唇の変動量Ｅの最大値が記憶される。 That is, by repeating the above-described processing of Step 102 to Step 114, the variable ME stores the maximum value of the lip variation amount E when the speaker actually speaks after the threshold determination processing is started. The

次のステップ１１６では、変数ＭＥの値を以下の（７）式に代入することにより、閾値Ｅ_Ｔｈを算出し、算出した閾値Ｅ_Ｔｈを示す閾値情報を発話区間検出部２２へ出力して、本閾値決定処理は終了となる。 In the next step 116, the threshold value E _Th is calculated by substituting the value of the variable ME into the following equation (7), and the threshold value information indicating the calculated threshold value E _Th is output to the utterance section detection unit 22; This threshold value determination process ends.

Ｅ_Ｔｈ＝α×ＭＥ・・・（７）
ただし、０＜α＜１ E _Th = α × ME (7)
However, 0 <α <1

このように、本閾値決定処理によれば、話者が実際に発話した際の唇の変動量Ｅの最大値に基づいて閾値Ｅ_Ｔｈを定めているので、話者の発話を精度良く検出することができる。なお、本実施の形態では、αを０．３としている。 As described above, according to the threshold value determination process, the threshold value E _Th is determined based on the maximum value of the lip variation amount E when the speaker actually speaks. be able to. In the present embodiment, α is set to 0.3.

発話区間検出部２２は、閾値決定部２０より閾値情報が入力されると、入力された当該閾値情報により示される閾値Ｅ_Ｔｈを用いて画像処理部１８より入力される変動量情報により示される変動量Ｅの判定を開始し、変動量Ｅが閾値Ｅ_Ｔｈ以上であった場合に発話区間であることを示す発話区間検出信号を音声認識部２４へ出力する一方、変動量Ｅが閾値Ｅ_Ｔｈ未満であった場合に非発話区間であることを示す発話区間検出信号を音声認識部２４へ出力する。 When threshold information is input from the threshold determination unit 20, the utterance section detection unit 22 uses the threshold value E _Th indicated by the input threshold information and changes indicated by the variation amount information input from the image processing unit 18. The determination of the amount E is started, and when the variation amount E is equal to or greater than the threshold value E _Th , an utterance section detection signal indicating that the speech section is present is output to the speech recognition unit 24, while the variation amount E is less than the threshold value E _Th , A speech segment detection signal indicating that it is a non-speech segment is output to the speech recognition unit 24.

音声認識部２４は、発話区間検出信号が入力されると、入力された当該発話区間検出信号により発話区間と判定されている間のみ入力した音声信号により示される音声を順次認識して文字データに変換し、変換した文字データを図示しない外部装置へ出力する。 When the speech segment detection signal is input, the speech recognition unit 24 sequentially recognizes the speech indicated by the input speech signal only while it is determined as the speech segment by the input speech segment detection signal, and converts it into character data. The converted character data is output to an external device (not shown).

以上のように本実施の形態によれば、撮像手段（ここでは、カメラ１２）により、話者の唇を含んだ画像を連続的に撮像し、音声集音手段（ここでは、マイク１４）により、話者が発話した音声を集音し、変形量導出手段（ここでは、画像処理部１８）により、撮像手段により連続的に撮像された画像に基づいて唇の形状が変形した度合いを示す変形量を導出し、話者状態導出手段（ここでは、画像処理部１８）により、撮像手段により撮像された画像に基づいて撮像手段から話者までの距離及び撮像手段に対する話者の顔の向きを導出し、決定手段（ここでは、閾値決定部２０）により、話者状態導出手段により導出された距離が所定範囲内で且つ導出された顔の向きが撮像手段に対して所定角度範囲内であると共に音声集音手段により集音された音声の強度が所定レベル以上である場合に、変形量導出手段によって導出された変形量に基づいて話者が発話している発話区間の判別に用いる当該変形量の閾値を決定し、検出手段により、決定手段によって決定された閾値を用いて変形量導出手段により導出された変形量から発話区間を検出しているので、精度よく発話区間を検出できる。 As described above, according to the present embodiment, the image including the speaker's lips is continuously captured by the image capturing unit (here, the camera 12), and the sound collecting unit (here, the microphone 14) is captured. Deformation indicating the degree to which the shape of the lips is deformed by collecting the speech uttered by the speaker and deformed by the deformation amount deriving means (here, the image processing unit 18) based on the images continuously captured by the imaging means. And the distance from the imaging unit to the speaker and the orientation of the speaker's face relative to the imaging unit based on the image captured by the imaging unit by the speaker state deriving unit (here, the image processing unit 18). The distance derived by the speaker state deriving unit is within a predetermined range and the derived face orientation is within a predetermined angle range with respect to the imaging unit. Together with sound collection means When the intensity of the received voice is equal to or higher than a predetermined level, a threshold value of the deformation amount used for discrimination of the utterance section in which the speaker is speaking is determined and detected based on the deformation amount derived by the deformation amount deriving means. Since the utterance section is detected from the deformation amount derived by the deformation amount deriving means using the threshold value determined by the determining means, the utterance section can be detected with high accuracy.

また、本実施の形態によれば、周囲の騒音を集音する騒音集音手段（ここでは、マイク１６）をさらに備え、決定手段は、さらに騒音集音手段により集音された騒音の強度が予め定められたレベル未満である場合に、変形量導出手段によって導出された変形量に基づいて閾値を決定しているので、話者の発話した音声を十分に集音して閾値が決定できる。 In addition, according to the present embodiment, noise collecting means (here, the microphone 16) that collects ambient noise is further provided, and the determining means further has the intensity of the noise collected by the noise collecting means. Since the threshold value is determined based on the deformation amount derived by the deformation amount deriving means when the level is lower than the predetermined level, the threshold value can be determined by sufficiently collecting the speech uttered by the speaker.

さらに、本実施の形態によれば、音声集音手段により集音された音声の音声認識を行って認識精度を示す精度情報を出力する音声認識手段（ここでは、音声認識部２４）をさらに備え、決定手段は、さらに音声認識手段より出力された精度情報により示される認識精度が予め定められた精度以上である場合に、変形量導出手段によって導出された変形量に基づいて閾値を決定しているので、話者が発話した音声のうち認識精度の高い音声を発生した際の変動量に基づいて閾値が決定されるため、音声認識の精度が向上する。 Furthermore, according to the present embodiment, voice recognition means (here, voice recognition unit 24) is further provided that performs voice recognition of the voice collected by the voice sound collection means and outputs accuracy information indicating the recognition accuracy. The determining means further determines a threshold based on the deformation amount derived by the deformation amount deriving means when the recognition accuracy indicated by the accuracy information output from the speech recognition means is equal to or higher than a predetermined accuracy. Therefore, the threshold value is determined based on the amount of fluctuation when speech with high recognition accuracy is generated among the speech uttered by the speaker, so that the accuracy of speech recognition is improved.

ところで、マイク１４により話者が発話した音声以外の音源からの音を集音してしまう場合がある。このため、話者に対して２つ以上のマイク１４を所定間隔を隔て水平に配置して、閾値決定部２０において各マイク１４により集音された音声信号の強度の差からカメラ１２に対する音源の水平方向の角度ψ_ｉを推定すると共に、画像処理部１８において、カメラ１２から順次入力される画像情報により示される各フレーム画像からカメラ１２に対する話者の顔領域の水平方向の角度ψ_Ｓをさらに導出して話者状態情報として閾値決定部２０へ出力させるものとし、閾値決定部２０において実行される閾値決定処理のステップ１０６において上述した（４）式〜（６）式に示される判定に加えて以下の（８）式の条件を満たしているか否かの判定を加えてもよい。 By the way, there is a case where sound from a sound source other than the voice uttered by the speaker is collected by the microphone 14. For this reason, two or more microphones 14 are horizontally arranged with a predetermined interval for the speaker, and the sound source for the camera 12 is determined from the difference in the intensity of the sound signal collected by each microphone 14 in the threshold determination unit 20. In addition to estimating the horizontal angle ψ _i , the image processing unit 18 further determines the horizontal angle ψ _S of the speaker's face region relative to the camera 12 from each frame image indicated by the image information sequentially input from the camera 12. In addition to the determinations shown in the equations (4) to (6) described above in step 106 of the threshold value determination process executed in the threshold value determination unit 20, the speaker state information is derived and output to the threshold value determination unit 20. In addition, it may be determined whether or not the condition of the following equation (8) is satisfied.

｜ψ_Ｓ−ψ_ｉ｜＜Ψ_ｄ・・・（８） | Ψ _S −ψ _i | <ψ _d (8)

すなわち、音源が話者であると判別できる角度に角度閾値Ψ_ｄを定めておき、推定された音源の角度ψ_ｉと導出された話者の顔領域の角度ψ_Ｓとの差が当該角度閾値Ψ_ｄ内である場合に音源が話者であると判定し、条件が満たされる場合の唇の変動量Ｅの値を変数ＭＥに変動量Ｅに代入する。 That is, an angle threshold ψ _d is set to an angle at which it can be determined that the sound source is a speaker, and the difference between the estimated angle ψ _{i of} the sound source and the angle ψ _S of the derived speaker face region is the angle threshold. If it is within Ψ _d , it is determined that the sound source is a speaker, and the value of the amount of lip variation E when the condition is satisfied is substituted for the amount of variation E in the variable ME.

これにより、話者が発話した音声以外の音源からの音がマイク１４で集音されて閾値Ｅ_Ｔｈを算出されてしまうことを防止することができる。 As a result, it is possible to prevent the threshold E _Th from being calculated by collecting sound from the sound source other than the voice uttered by the speaker with the microphone 14.

さらに、画像処理部１８は、各フレーム画像からカメラ１２に対する話者の顔領域の水平方向の角度ψ_Ｓに加えてカメラ１２に対する話者の顔領域の垂直方向の角度ζ_Ｓを導出することも可能であり、また、マイク１４の個数や配置位置を変えることにより、閾値決定部２０においてカメラ１２に対する音源の垂直方向の角度ζ_ｉを推定するも可能である。このため、閾値決定処理のステップ１０６において、（８）式に代えて、あるいは、加えて以下の（９）式の条件を満たしているか否かの判定を行うようにしてもよい。 Further, the image processing unit 18 may derive a vertical angle ζ _S of the speaker face area with respect to the camera 12 in addition to the horizontal angle ψ _S of the speaker face area with respect to the camera 12 from each frame image. It is also possible to estimate the angle ζ _i in the vertical direction of the sound source with respect to the camera 12 in the threshold value determination unit 20 by changing the number and arrangement positions of the microphones 14. Therefore, in step 106 of the threshold value determination process, it may be determined whether or not the condition of the following expression (9) is satisfied instead of or in addition to the expression (8).

｜ζ_Ｓ−ζ_ｉ｜＜Ζ_ｄ・・・（９） | Ζ _S −ζ _i | <Ζ _d (9)

この角度閾値Ζ_ｄは、角度閾値Ψ_ｄと同様に、音源が話者であると判別できる角度に定めておけばよい。 The angle threshold Zeta _d, similarly to the threshold angle [psi _d, it is sufficient to set the angle that can be determined that the sound source is a speaker.

なお、本実施の形態では、閾値Ｅ_Ｔｈを唇の変動量Ｅの最大値の所定の割合とした場合について説明したが、本発明はこれに限定されるものではなく、例えば、閾値決定処理のステップ１０２〜ステップ１１４の１回のループ毎にそれぞれ求めらる変動量Ｅの平均値としてもよく、また、当該ループ毎にそれぞれ求めらる変動量Ｅの最小値としてもよい。 In the present embodiment, the case where the threshold value E _Th is set to a predetermined ratio of the maximum value of the lip variation amount E has been described. However, the present invention is not limited to this, and for example, threshold value determination processing The average value of the fluctuation amount E obtained for each loop of step 102 to step 114 may be used, or the minimum value of the fluctuation amount E obtained for each loop may be used.

また、本実施の形態で説明した音声認識装置１０の構成（図１参照。）は、一例であり、本発明の主旨を逸脱しない範囲内において適宜変更可能であることは言うまでもない。 The configuration of the speech recognition apparatus 10 described in this embodiment (see FIG. 1) is merely an example, and it goes without saying that the configuration can be appropriately changed without departing from the gist of the present invention.

また、本実施の形態で説明したる閾値決定処理の流れ（図２参照。）も一例であり、本発明の主旨を逸脱しない範囲内において適宜変更可能であることは言うまでもない。 Further, the flow of threshold value determination processing (see FIG. 2) described in the present embodiment is also an example, and it goes without saying that it can be changed as appropriate without departing from the gist of the present invention.

実施の形態に係る音声認識装置の概略構成を示すブロック図である。It is a block diagram which shows schematic structure of the speech recognition apparatus which concerns on embodiment. 実施の形態に係る閾値決定処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the threshold value determination process which concerns on embodiment.

Explanation of symbols

１０音声認識装置
１２カメラ
１４マイク
１６マイク
１８画像処理部
２０閾値決定部
２２発話区間判別部
２４音声認識部 DESCRIPTION OF SYMBOLS 10 Voice recognition apparatus 12 Camera 14 Microphone 16 Microphone 18 Image processing part 20 Threshold determination part 22 Speaking section discrimination | determination part 24 Voice recognition part

Claims

Imaging means for continuously capturing images including the lips of the speaker;
Voice collecting means for collecting voice uttered by the speaker;
A deformation amount deriving unit for deriving a deformation amount indicating a degree of deformation of the shape of the lips based on images continuously captured by the image capturing unit;
Speaker state deriving means for deriving a distance from the imaging means to the speaker and an orientation of the speaker's face relative to the imaging means based on an image taken by the imaging means;
The distance derived by the speaker state deriving unit is within a predetermined range, and the direction of the derived face is within a predetermined angle range with respect to the imaging unit, and the sound collected by the sound collecting unit Determining means for deciding a threshold value of the deformation amount used for discrimination of the utterance section in which the speaker is speaking based on the deformation amount derived by the deformation amount deriving means when the intensity of the voice is equal to or higher than a predetermined level. When,
Detecting means for detecting an utterance section from the deformation amount derived by the deformation amount deriving means using the threshold value determined by the determining means;
An utterance detection device comprising:

A noise collecting means for collecting ambient noise;
The determining means further sets the threshold based on the deformation amount derived by the deformation amount deriving means when the intensity of the noise collected by the noise sound collecting means is less than a predetermined level. The utterance detection device according to claim 1.

Voice recognition means for performing voice recognition of the voice collected by the voice collection means and outputting accuracy information indicating the recognition accuracy;
The determining means is further configured based on the deformation amount derived by the deformation amount deriving means when the recognition accuracy indicated by the accuracy information output from the speech recognition means is equal to or higher than a predetermined accuracy. The utterance detection device according to claim 1 or 2, wherein a threshold value is determined.

The sound collecting means is composed of two or more microphones,
Sound source estimation means for estimating the direction of the sound source relative to the imaging means based on audio information collected by each microphone;
Speaker direction deriving means for deriving the direction of the speaker relative to the imaging means based on the image taken by the imaging means,
The determining means is further derived by the deformation amount deriving means when the difference between the direction of the sound source estimated by the sound source estimating means and the direction of the speaker derived by the speaker direction deriving means is within a predetermined range. The utterance detection device according to any one of claims 1 to 3, wherein the threshold is determined based on the deformed amount.

The image including the speaker's lips is continuously captured by the imaging means, and the voice spoken by the speaker is collected,
Deriving a deformation amount indicating the degree of deformation of the shape of the lips based on the continuously captured images, and the distance from the imaging means to the speaker based on the images and the speaker's face relative to the imaging means The direction of
Deformation amount derived when the derived distance is within a predetermined range, the orientation of the derived face is within a predetermined angle range with respect to the imaging means, and the intensity of the collected sound is greater than or equal to a predetermined level And determining a threshold value of the deformation amount used for discrimination of the utterance section in which the speaker is speaking,
An utterance detection method for detecting an utterance section from the deformation amount derived using the determined threshold value.