JP5465166B2

JP5465166B2 - Utterance content recognition device and utterance content recognition method

Info

Publication number: JP5465166B2
Application number: JP2010287127A
Authority: JP
Inventors: 空悟守田
Original assignee: Kyocera Corp
Current assignee: Kyocera Corp
Priority date: 2010-12-24
Filing date: 2010-12-24
Publication date: 2014-04-09
Anticipated expiration: 2025-01-28
Also published as: JP2011070224A

Description

本発明は発声内容認識装置および発生内容認識方法に関する。 The present invention relates to a utterance content recognition apparatus and a generated content recognition method .

音声を文字列に置き換えるための音声認識技術が知られている。この音声認識技術では、まず収音器で発声者の発する音声を収音する。次に、収音した音声の特徴パターンを抽出する。そして、抽出した特徴パターンに対応する文字列パターンを認識結果として出力することにより、音声を文字列に置き換えている。 A speech recognition technique for replacing speech with a character string is known. In this voice recognition technology, first, a voice emitted by a speaker is picked up by a sound pickup device. Next, a feature pattern of the collected voice is extracted. Then, by outputting a character string pattern corresponding to the extracted feature pattern as a recognition result, the voice is replaced with a character string.

なお、特許文献１には、このような音声認識技術を、発声者の唇形状の特徴パターンに基づいてパターン認識を行う口元認識技術と併用することに関する記載がある。 Note that Patent Document 1 includes a description of using such a speech recognition technology together with a mouth recognition technology that performs pattern recognition based on a lip shape feature pattern of a speaker.

特開平６−３１１２２０号公報Japanese Patent Laid-Open No. 6-311220

しかしながら、上記従来の音声認識技術では、発声者が収音器の近くにいないと、発声者の発する音声の特徴パターン抽出が上手くできず、音声認識の精度が下がってしまうという問題があった。 However, the conventional speech recognition technology has a problem that if the speaker is not near the sound collector, the feature pattern extraction of the speech uttered by the speaker cannot be performed well, and the accuracy of speech recognition is lowered.

本発明は上記課題を解決するためになされたもので、その目的の一つは、発声者が収音器の近くにおらず低精度の音声認識が行われることを抑制できる発声内容認識装置および発生内容認識方法を提供することにある。

The present invention has been made in order to solve the above-mentioned problems, and one of its purposes is an utterance content recognition device capable of suppressing low-accuracy speech recognition when the speaker is not near the sound collector and It is to provide a method for recognizing occurrences .

上記課題を解決するための本発明に係る発声内容認識装置は、収音する収音手段と、前記収音手段に対し音声を発する発声者の画像を撮影する撮影手段と、前記収音される音声に基づく音声認識を行う音声認識手段と、前記撮影される画像に前記発声者の少なくとも一部を示す発声者画像が含まれていない場合に、前記音声認識手段が音声認識を行うことを制限する音声認識実施制限手段と、を含むことを特徴とする。 An utterance content recognition apparatus according to the present invention for solving the above-described problems is a sound collection unit that collects sound, a photographing unit that captures an image of a speaker who emits sound to the sound collection unit, and the sound collection unit. Restricting speech recognition by speech recognition means for performing speech recognition based on speech, and when the captured image does not include a speaker image indicating at least part of the speaker Voice recognition execution limiting means.

撮影される画像に発声者画像が含まれていない場合には、含まれている場合に比べ、発声者が近くにいない可能性が高いと考えられる。本発明によれば、取得される画像に発声者画像が含まれていない場合に音声認識の開始を制限するようにしたので、発声者が収音器の近くにおらず低精度の音声認識が行われることを抑制できる。 If the captured image does not include the speaker image, it is more likely that the speaker is not nearby than when the image is included. According to the present invention, since the start of speech recognition is limited when the acquired image does not include a speaker image, low-accuracy speech recognition is not performed because the speaker is not near the sound collector. It can be suppressed.

また、上記発声内容認識装置において、前記発声者画像は前記発声者の口元を示す口元画像である、こととしてもよい。 Further, in the utterance content recognition device, the speaker image may be a mouth image indicating the mouth of the speaker.

発声者が口元を当該発声内容認識装置に向けていない場合には、向けている場合に比べ、発声者の発する音声が当該発声内容認識装置に届きにくい可能性が高いと考えられる。本発明によれば、取得される画像に発声者の口元を示す口元画像が含まれていない場合に音声認識を行うことを制限するようにしたので、発声者の発する音声が当該パターン認識装置に届きにくく低精度の音声認識が行われることを抑制できる。 When the speaker does not point his / her mouth toward the utterance content recognition device, it is considered that the voice uttered by the utterer is more likely not to reach the utterance content recognition device. According to the present invention, since the speech recognition is restricted when the acquired image does not include the mouth image indicating the mouth of the speaker, the sound emitted by the speaker is transmitted to the pattern recognition device. It is possible to suppress low-accuracy voice recognition that is difficult to reach.

また、上記発声内容認識装置において、前記撮影手段は、前記画像を順次撮影し、前記音声認識実施制限手段は、前記撮影される画像に前記口元画像が含まれている場合であっても、順次取得された該口元画像により示される口元が動いていない場合に、前記音声認識手段が音声認識を行うことを制限する、こととしてもよい。 Further, in the utterance content recognition device, the photographing means sequentially photographs the images, and the voice recognition execution restriction means sequentially even when the mouth image is included in the photographed image. The voice recognition means may be restricted from performing voice recognition when the mouth indicated by the acquired mouth image is not moving.

発声者の口元が動いていない場合には、発声者が声を出していない可能性が高いと考えられる。本発明によれば、発声者の口元が動いていない場合には音声認識を行うことを制限するようにしたので、発声者が声を出しておらず低精度の音声認識が行われることを抑制できる。 If the speaker's mouth is not moving, it is likely that the speaker is not speaking. According to the present invention, since voice recognition is restricted when the speaker's mouth is not moving, it is possible to suppress the voice recognition from being performed by the speaker without speaking. it can.

また、上記発声内容認識装置において、前記撮影される画像に含まれる前記口元画像により示される口元に向けて、前記音声取得手段の指向性を合わせる音声指向性制御手段、をさらに含むこととしてもよい。 The utterance content recognition apparatus may further include voice directivity control means for adjusting the directivity of the voice acquisition means toward the mouth indicated by the mouth image included in the photographed image. .

本発明によれば、取得される口元画像により示される発声者の口元に向けて音声取得手段の指向性を合わせることにより、より高い精度で音声認識が行われるようにすることができる。 According to the present invention, voice recognition can be performed with higher accuracy by matching the directivity of the voice acquisition means toward the mouth of the speaker indicated by the acquired mouth image.

また、上記発声内容認識装置において、前記撮影される画像に含まれる前記口元画像により示される発声者の口元の形状又は該形状の推移に基づいて口元認識を行う口元認識手段と、前記収音される音声に基づく前記音声認識手段の認識結果と、該音声の発声者が該音声を発する際の前記撮影される画像に含まれる口元画像により示される口元の形状又は該形状の推移と、に基づいて、前記口元認識手段による口元認識の学習を行う口元認識学習手段と、をさらに含むこととしてもよい。 Further, in the utterance content recognition device, lip recognition means for performing lip recognition based on the shape of the utterer's lip indicated by the lip image included in the photographed image or a transition of the shape, and the collected sound Based on the recognition result of the voice recognition means based on the voice and the shape of the mouth or the transition of the shape indicated by the mouth image included in the photographed image when the voice speaker emits the voice. And mouth recognition learning means for learning mouth recognition by the mouth recognition means.

本発明によれば、音声が発される場合の口元の形状又は該形状の推移を取得することができる。さらに、該音声は音声認識手段によって認識される。このため、音声認識の認識結果と、口元の形状又は該形状の推移と、を対応付けることができるので、口元認識の学習を行うことができる。 According to the present invention, it is possible to acquire the shape of the mouth or the transition of the shape when sound is emitted. Further, the voice is recognized by voice recognition means. For this reason, since the recognition result of speech recognition can be associated with the shape of the mouth or the transition of the shape, learning of mouth recognition can be performed.

また、上記発声内容認識装置において、前記収音手段により収音される音声の収音状態の良さを示す収音状態評価値を取得する収音状態評価値取得手段、をさらに含み、前記口元認識学習手段による学習は、前記収音状態評価値により示される収音状態が所定閾値以上である場合の前記音声認識手段の認識結果に基づいて行われる、こととしてもよい。 The utterance content recognition device further includes a sound collection state evaluation value acquisition unit that acquires a sound collection state evaluation value indicating a good sound collection state of the sound collected by the sound collection unit, and the mouth recognition Learning by the learning means may be performed based on a recognition result of the voice recognition means when the sound pickup state indicated by the sound pickup state evaluation value is equal to or greater than a predetermined threshold.

本発明によれば、発声者の発する音声を良好な収音状態で収音できている場合にのみ、口元認識学習手段による学習を行うことができる。すなわち、音声認識が良好な状態で実施されている場合にのみ、口元認識学習を行うので、精度の悪い音声認識の認識結果により口元認識学習が行われる可能性を減少させることができる。 According to the present invention, learning by the mouth recognition learning means can be performed only when the sound produced by the speaker can be collected in a good sound collection state. That is, since the mouth recognition learning is performed only when the speech recognition is performed in a good state, it is possible to reduce the possibility that the mouth recognition learning is performed based on the recognition result of the voice recognition with low accuracy.

また、本発明に係る音声認識装置は、収音する収音手段と、前記収音手段に対し音声を発する発声者の口元を示す口元画像を撮影する撮影手段と、前記収音手段により収音される音声の収音状態の良さを示す収音状態評価値を取得する収音状態評価値取得手段と、前記収音される音声に基づく音声認識を行う音声認識手段と、前記撮影される画像に含まれる前記口元画像により示される発声者の口元の形状又は該形状の推移に基づいて口元認識を行う口元認識手段と、前記収音状態評価値により示される収音状態に応じて、前記音声認識手段又は前記口元認識手段のいずれにより認識を行うか決定する決定手段と、を含むことを特徴とする。 In addition, the speech recognition apparatus according to the present invention includes a sound collection unit that collects sound, a photographing unit that captures a mouth image indicating a mouth of a speaker who emits sound to the sound collection unit, and a sound collection unit that collects sound. A sound collection state evaluation value acquisition means for acquiring a sound collection state evaluation value indicating the goodness of the sound collection state of the sound to be collected, a voice recognition means for performing voice recognition based on the collected sound, and the captured image Mouth recognition means for performing mouth recognition based on the shape of the mouth of the speaker indicated by the mouth image included in the mouth image or transition of the shape, and the sound according to the sound pickup state indicated by the sound pickup state evaluation value Determining means for determining whether recognition is performed by the recognition means or the mouth recognition means.

本発明によれば、収音状態に応じて音声認識と口元認識を切り替えることができるので、収音状態が悪いときに低精度の音声認識が行われることを抑制できる。 According to the present invention, since voice recognition and mouth recognition can be switched according to the sound pickup state, it is possible to suppress low-accuracy voice recognition when the sound pickup state is bad.

本発明の実施の形態に係る発声内容認識装置の構成図である。It is a block diagram of the utterance content recognition apparatus which concerns on embodiment of this invention. 本発明の実施の形態に係る発声内容認識装置の機能ブロック図である。It is a functional block diagram of the utterance content recognition apparatus which concerns on embodiment of this invention. 本発明の実施の形態に係る発声内容認識装置の処理フロー図である。It is a processing flow figure of the utterance content recognition apparatus concerning an embodiment of the invention.

本発明の実施の形態について、図面を参照しながら説明する。 Embodiments of the present invention will be described with reference to the drawings.

本発明に係る発声内容認識装置１０は、例えば携帯電話などのコンピュータであり、図１に示すように、ＣＰＵ１２、記憶部１４、入力部１５、出力部１９を含んで構成される。 An utterance content recognition apparatus 10 according to the present invention is a computer such as a mobile phone, for example, and includes a CPU 12, a storage unit 14, an input unit 15, and an output unit 19, as shown in FIG.

入力部１５は、発声内容認識装置１０の外部の情報をＣＰＵ１２に対して入力するための機能部であり、本実施の形態では収音器１６と撮影機１８とを含んで構成される。収音器１６は、例えば指向性マイクロホンなど、音声を収音することのできる装置である。収音器１６は収音する方向についての指向性を有しており、ＣＰＵ１２はこの指向性を制御することができるように構成される。具体的には、ＣＰＵ１２が収音器１６の姿勢を制御することにより、その指向性を制御することができるように構成される。そして収音器１６は、収音した音声を電気信号に変えて、ＣＰＵ１２に出力する。 The input unit 15 is a functional unit for inputting information external to the utterance content recognition device 10 to the CPU 12, and is configured to include a sound collector 16 and a photographing machine 18 in the present embodiment. The sound collector 16 is a device that can pick up sound, such as a directional microphone. The sound collector 16 has directivity in the sound collecting direction, and the CPU 12 is configured to control this directivity. Specifically, the CPU 12 is configured to control the directivity of the sound collector 16 by controlling the attitude of the sound collector 16. The sound collector 16 converts the collected sound into an electrical signal and outputs it to the CPU 12.

撮影機１８は、例えばカメラやビデオカメラなど、画像を順次撮影することができる装置である。撮影機１８は、ＣＰＵ１２の制御により撮影方向を変更できるように構成される。具体的には、ＣＰＵ１２が撮影機１８の姿勢を制御することにより、その撮影方向を制御できるように構成される。そして撮影機１８は、撮影した画像をビットマップとしてＣＰＵ１２に対して出力する。 The photographing machine 18 is an apparatus that can sequentially photograph images, such as a camera or a video camera. The photographing machine 18 is configured to change the photographing direction under the control of the CPU 12. Specifically, the CPU 12 is configured to be able to control the photographing direction by controlling the posture of the photographing device 18. The photographing machine 18 outputs the photographed image to the CPU 12 as a bitmap.

ＣＰＵ１２は、記憶部１４に記憶されるプログラムを実行するための処理ユニットであり、発声内容認識装置１０の各部を制御する。 The CPU 12 is a processing unit for executing a program stored in the storage unit 14 and controls each unit of the utterance content recognition device 10.

また、ＣＰＵ１２は、音声認識及び口元認識のための処理を行う。音声認識では、まず収音器１６で発声者の発する音声を収音する。次に、収音した音声の特徴パターンを抽出する。より具体的には、ＣＰＵ１２は記憶部１４に記憶される特徴パターンと同じものが含まれるか否かを判断する。そしてこの判断の結果、同じものが含まれると判断される特徴パターンが抽出される特徴パターンとなる。そして、抽出した特徴パターンに対応する文字列パターンを認識結果として出力することにより、音声を文字列に置き換えている。 Further, the CPU 12 performs processing for voice recognition and mouth recognition. In the speech recognition, first, the sound emitted by the speaker is picked up by the sound pickup device 16. Next, a feature pattern of the collected voice is extracted. More specifically, the CPU 12 determines whether the same feature pattern stored in the storage unit 14 is included. As a result of this determination, feature patterns that are determined to include the same pattern are extracted. Then, by outputting a character string pattern corresponding to the extracted feature pattern as a recognition result, the voice is replaced with a character string.

口元認識では、まず撮影機１８で発声者の口元画像を撮影する。次に、撮影した口元画像により示される口元の形状又は該形状の推移の特徴パターンを抽出する。より具体的には、ＣＰＵ１２は記憶部１４に記憶される特徴パターンと同じものが含まれるか否かを判断する。そしてこの判断の結果、同じものが含まれると判断される特徴パターンが抽出される特徴パターンとなる。そして、抽出した特徴パターンに対応する文字列パターンを認識結果として出力することにより、口の形状又はその動きを文字列に置き換えている。 In the mouth recognition, first, the photographer 18 takes a mouth image of the speaker. Next, the shape of the mouth indicated by the photographed mouth image or the feature pattern of the transition of the shape is extracted. More specifically, the CPU 12 determines whether the same feature pattern stored in the storage unit 14 is included. As a result of this determination, feature patterns that are determined to include the same pattern are extracted. Then, by outputting a character string pattern corresponding to the extracted feature pattern as a recognition result, the shape of the mouth or its movement is replaced with the character string.

記憶部１４は、本実施の形態を実施するためのプログラムを記憶している。また、ＣＰＵ１２のワークメモリとしても動作する。 The memory | storage part 14 has memorize | stored the program for implementing this Embodiment. It also operates as a work memory for the CPU 12.

また記憶部１４は、音声認識のために、音声の特徴パターンと文字列パターンとを対応付けて記憶している。さらに記憶部１４は、口元認識のために、口元の形状又は該形状の推移の特徴パターンと文字列パターンとを対応付けて記憶している。なお、これらについては、特徴パターンを入力として文字列パターンを出力とするニューラルネットなどの学習システムを使用することも可能である。ここでは、記憶部１４において、特徴パターンと文字列パターンとを対応付けて記憶するものとする。 The storage unit 14 stores a voice feature pattern and a character string pattern in association with each other for voice recognition. Further, the storage unit 14 stores the shape of the mouth or the characteristic pattern of the transition of the shape and the character string pattern in association with each other for mouth recognition. For these, it is also possible to use a learning system such as a neural network that receives a feature pattern as an input and outputs a character string pattern. Here, it is assumed that the storage unit 14 stores the feature pattern and the character string pattern in association with each other.

出力部１９は、ＣＰＵ１２から入力されるデータを、ＣＰＵ１２から入力される指示情報に従って、出力手段により出力する。この出力手段には、例えばディスプレイなどの表示装置、スピーカなどの音声出力装置を用いることができる。 The output unit 19 outputs the data input from the CPU 12 by the output unit according to the instruction information input from the CPU 12. For example, a display device such as a display or a sound output device such as a speaker can be used as the output means.

本実施の形態では、以上説明したような発声内容認識装置１０において、音声認識を行う際の精度の向上を図ることができるようにしている。具体的には、発声者が収音器１６の近くにおらず低精度の音声認識が行われることを抑制している。また、発声者の発する音声が当該パターン認識装置に届きにくく低精度の音声認識が行われることを抑制している。さらに、発声者が声を出しておらず低精度の音声認識が行われることを抑制している。他に、発声者の口元に向けて音声取得手段の指向性を合わせることにより、より高い精度で音声認識が行われるようにしている。また、音声認識の認識結果と、該音声を発する口元の形状又は該形状の推移と、に基づいて口元認識の学習を行う。また、精度の悪い音声認識の認識結果により口元認識学習が行われる可能性を減少させることができるようにしている。さらに、収音状態に応じて音声認識と口元認識を切り替えることで、収音状態が悪いときに低精度の音声認識が行われることを抑制している。 In the present embodiment, the utterance content recognition apparatus 10 as described above can improve the accuracy when performing speech recognition. Specifically, it is suppressed that the speaker is not near the sound collector 16 and low-accuracy speech recognition is performed. In addition, it is possible to suppress low-accuracy voice recognition that is difficult for the voice produced by the speaker to reach the pattern recognition device. Furthermore, it is suppressed that the speaker does not speak and low-accuracy speech recognition is performed. In addition, voice recognition is performed with higher accuracy by matching the directivity of the voice acquisition means toward the mouth of the speaker. In addition, learning of mouth recognition is performed based on the recognition result of speech recognition and the shape of the mouth that emits the speech or the transition of the shape. Further, it is possible to reduce the possibility that the mouth recognition learning is performed based on the recognition result of the voice recognition with low accuracy. Furthermore, by switching between speech recognition and mouth recognition according to the sound collection state, low-accuracy speech recognition is suppressed when the sound collection state is poor.

図２は、以上のような機能を実現するための発声内容認識装置１０の機能ブロック図である。同図に示すように、発声内容認識装置１０のＣＰＵ１２は、機能的には、画像取得部１２０、口元探索部１２４、口元認識部１２６、口元認識学習部１２８、音声取得部１３０、信号レベル計測部１３４、音声認識部１３６、指向性制御部１４０、認識・学習判定部１４２、統合処理部１４４、を含んで構成されている。また、口元探索部１２４と、口元認識部１２６と、口元認識学習部１２８とは、口元認識機能部１２２を構成し、信号レベル計測部１３４と、音声認識部１３６とは、音声認識機能部１３２を構成している。以下、各部の処理について詳細に説明する。 FIG. 2 is a functional block diagram of the utterance content recognition device 10 for realizing the above functions. As shown in the figure, the CPU 12 of the utterance content recognition device 10 functionally includes an image acquisition unit 120, a mouth search unit 124, a mouth recognition unit 126, a mouth recognition learning unit 128, a voice acquisition unit 130, and a signal level measurement. Unit 134, voice recognition unit 136, directivity control unit 140, recognition / learning determination unit 142, and integration processing unit 144. In addition, the mouth searching unit 124, the mouth recognizing unit 126, and the mouth recognizing learning unit 128 constitute a mouth recognizing function unit 122, and the signal level measuring unit 134 and the sound recognizing unit 136 are the sound recognizing function unit 132. Is configured. Hereinafter, processing of each unit will be described in detail.

まず、画像取得部１２０は、撮影機１８で撮影された画像を順次取得する。発声者が収音器１６に対して発声している場合には、この画像に発声者の口元形状を示す口元画像が含まれる。また、撮影機１８が、口元画像が含まれる画像を順次撮影する場合には、一連の画像には発声者の口元形状の推移を示す口元画像が含まれる。そして画像取得部１２０は、取得した画像を口元探索部１２４に対して順次出力する。 First, the image acquisition unit 120 sequentially acquires images taken by the camera 18. When the speaker speaks to the sound collector 16, the image includes a mouth image indicating the mouth shape of the speaker. Further, when the photographing device 18 sequentially captures images including the mouth image, the series of images includes a mouth image indicating the transition of the mouth shape of the speaker. Then, the image acquisition unit 120 sequentially outputs the acquired images to the mouth searching unit 124.

また、画像取得部１２０は、撮影機１８が撮影している方向を示す方向情報も取得する。方向情報は、例えば発声内容認識装置１０の筐体に対する相対的方向を示す情報である。そして画像取得部１２０は、取得した方向情報を、取得した画像と対応付けて口元探索部１２４に対して順次出力する。 The image acquisition unit 120 also acquires direction information indicating the direction in which the photographing device 18 is photographing. The direction information is information indicating a relative direction with respect to the housing of the utterance content recognition device 10, for example. Then, the image acquisition unit 120 sequentially outputs the acquired direction information to the mouth search unit 124 in association with the acquired image.

口元探索部１２４は、画像取得部１２０から順次入力された画像に、上記口元画像が含まれるか否かを探索する。具体的には、画像から口元の特徴を示す特徴パターンを抽出する。そして、口元の特徴を示す特徴パターンを抽出できた場合に、該特徴パターンを抽出できた旨を示す口元検出情報を認識・学習判定部１４２に対して出力する。一方、口元の特徴を示す特徴パターンを抽出できなかった場合には、該特徴パターンを抽出できなかった旨を示す口元不検出情報を認識・学習判定部１４２に対して出力する。また、探索できた口元画像により示される口元の画像内における位置と、該口元画像を含む画像に対応付けて画像取得部１２０から入力された方向情報と、を指向性制御部１４０に対して出力する。 The mouth searching unit 124 searches whether or not the mouth image is included in images sequentially input from the image acquiring unit 120. Specifically, a feature pattern indicating the feature of the mouth is extracted from the image. When the feature pattern indicating the mouth feature can be extracted, the mouth detection information indicating that the feature pattern has been extracted is output to the recognition / learning determination unit 142. On the other hand, when the feature pattern indicating the feature of the mouth cannot be extracted, the mouth non-detection information indicating that the feature pattern cannot be extracted is output to the recognition / learning determination unit 142. Further, the position in the mouth image indicated by the mouth image that can be searched and the direction information input from the image acquisition unit 120 in association with the image including the mouth image are output to the directivity control unit 140. To do.

さらに口元探索部１２４は、画像から口元の特徴を示す特徴パターンを抽出する処理を一連の画像のそれぞれについて行い、抽出した特徴パターンの変化に基づいて、口元が動いているか否かを判断する処理を所定時間ごとに行う。そして口元が動いていると判断する場合に、口元探索部１２４は、口元が動いている旨を示す口元動情報を信号レベル計測部１３４及び認識・学習判定部１４２に対して出力する。一方、口元が動いていないと判断する場合に、口元探索部１２４は、口元が動いていない旨を示す口元不動情報を信号レベル計測部１３４及び認識・学習判定部１４２に対して出力する。 Further, the mouth searching unit 124 performs processing for extracting a feature pattern indicating the feature of the mouth from the image for each of the series of images, and processing for determining whether or not the mouth is moving based on the change of the extracted feature pattern. Is performed every predetermined time. When it is determined that the mouth is moving, the mouth searching unit 124 outputs mouth movement information indicating that the mouth is moving to the signal level measuring unit 134 and the recognition / learning determining unit 142. On the other hand, when it is determined that the mouth is not moving, the mouth searching unit 124 outputs mouth stationary information indicating that the mouth is not moving to the signal level measuring unit 134 and the recognition / learning determining unit 142.

また、口元探索部１２４は、画像取得部１２０から順次入力された画像を口元認識部１２６に対してそのまま出力する。 The mouth searching unit 124 outputs the images sequentially input from the image acquiring unit 120 to the mouth recognizing unit 126 as they are.

次に、指向性制御部１４０は、口元探索部１２４から入力される、探索できた口元画像により示される口元の画像内における位置と、該口元画像を含む画像に対応付けて画像取得部１２０から入力された方向情報と、に基づいて、収音器１６の指向性を制御する。より具体的には、撮影された画像の撮影方向と、該画像の中の特に口元の位置と、によって示される方向に収音器１６の指向性が向くよう、収音器１６の指向性を制御する。このようにすることにより、後述する発声者の音声の収音状態がよくなるようにしている。 Next, the directivity control unit 140 is associated with the position in the mouth image indicated by the mouth image that has been searched, which is input from the mouth searching unit 124, and the image including the mouth image from the image acquisition unit 120. The directivity of the sound collector 16 is controlled based on the input direction information. More specifically, the directivity of the sound collector 16 is set so that the directivity of the sound collector 16 is directed in the direction indicated by the photographing direction of the photographed image and particularly the position of the mouth in the image. Control. By doing so, the sound collection state of the voice of the speaker, which will be described later, is improved.

次に、音声取得部１３０は、収音器１６で収音された音声を順次取得する。該音声には、発声者が発した音声及びその他の雑音を含んでいる。そして音声取得部１３０は、取得した音声を信号レベル計測部１３４及び音声認識部１３６に対して順次出力する。 Next, the sound acquisition unit 130 sequentially acquires the sound collected by the sound collector 16. The speech includes speech uttered by the speaker and other noises. The voice acquisition unit 130 sequentially outputs the acquired voice to the signal level measurement unit 134 and the voice recognition unit 136.

信号レベル計測部１３４は、音声取得部１３０から順次入力された音声について、順次その収音状態の良さを示す収音状態評価値を取得する。収音状態評価値として具体的には、例えば音声信号対雑音信号比（ＳＮＲ,Signal to Noise Ratio）を用いることができる。収音状態評価値としてこのＳＮＲを用いると、収音状態評価値は入力された音声に含まれる発声者が発した音声と、その他の雑音と、の比となる。そして信号レベル計測部１３４は、取得した収音状態評価値を認識・学習判定部１４２に対して出力する。また、信号レベル計測部１３４は、音声取得部１３０から順次入力された音声を音声認識部１３６に対してそのまま出力する。 The signal level measurement unit 134 sequentially acquires sound collection state evaluation values indicating the goodness of the sound collection state for the sound sequentially input from the sound acquisition unit 130. Specifically, for example, an audio signal-to-noise signal ratio (SNR) can be used as the sound collection state evaluation value. When this SNR is used as the sound collection state evaluation value, the sound collection state evaluation value is a ratio between the voice uttered by the speaker included in the input voice and other noise. Then, the signal level measurement unit 134 outputs the acquired sound pickup state evaluation value to the recognition / learning determination unit 142. Further, the signal level measuring unit 134 outputs the voices sequentially input from the voice acquisition unit 130 to the voice recognition unit 136 as they are.

なお、発声者が発した音声と、その他の雑音と、を区別するために、信号レベル計測部１３４では口元探索部１２４から入力される口元動情報若しくは口元不動情報を利用する。信号レベル計測部１３４は、入力される口元動情報により口元が動いていることが示される場合に、音声取得部１３０から入力される音声は発声者が発した音声を含む音声であると判断する。この場合には、信号レベル計測部１３４は入力された音声から記憶部１４に記憶される特徴パターンを抽出し、発声者が発した音声である音声信号と、その他の雑音である雑音信号と、に分離する。そして分離された各信号の強度に基づいてＳＮＲを算出する。一方、信号レベル計測部１３４は、入力される口元不動情報により口元が動いていないことが示される場合に、音声取得部１３０から入力される音声は発声者が発した音声を含まない音声であると判断する。この場合には、音声信号がないので、ＳＮＲは０となる。 In addition, in order to distinguish the voice uttered by the speaker from other noises, the signal level measurement unit 134 uses the mouth movement information or the mouth movement information input from the mouth searching unit 124. When the input mouth movement information indicates that the mouth is moving, the signal level measurement unit 134 determines that the sound input from the sound acquisition unit 130 includes the sound emitted by the speaker. . In this case, the signal level measurement unit 134 extracts a feature pattern stored in the storage unit 14 from the input voice, and a voice signal that is a voice uttered by a speaker, a noise signal that is other noise, To separate. Then, the SNR is calculated based on the intensity of each separated signal. On the other hand, when the signal level measuring unit 134 indicates that the mouth does not move by the input mouth immobility information, the voice input from the voice acquisition unit 130 does not include the voice uttered by the speaker. Judge. In this case, the SNR is 0 because there is no audio signal.

また、雑音信号の強度が大きく変化しないと期待される場合には、口元動情報により口元が動いていることが示される場合に音声取得部１３０から入力される音声の信号強度から、入力される口元不動情報により口元が動いていないことが示される場合に音声取得部１３０から入力される音声の信号強度を減算して得られる信号強度を、発声者が発した音声である音声信号の強度としてＳＮＲを算出してもよい。 In addition, when it is expected that the intensity of the noise signal does not change greatly, it is input from the signal intensity of the voice input from the voice acquisition unit 130 when the mouth movement information indicates that the mouth is moving. When the mouth immobility information indicates that the mouth is not moving, the signal strength obtained by subtracting the signal strength of the voice input from the voice acquisition unit 130 is used as the strength of the voice signal that is the voice uttered by the speaker. The SNR may be calculated.

認識・学習判定部１４２は、口元探索部１２４から入力される口元検出情報又は口元不検出情報と、口元動情報又は口元不動情報と、信号レベル計測部１３４から入力される収音状態評価値と、に基づいて音声認識機能部１３２及び口元認識機能部１２２の制御を行う。 The recognition / learning determination unit 142 includes mouth detection information or mouth non-detection information input from the mouth search unit 124, mouth movement information or mouth immobility information, and a sound pickup state evaluation value input from the signal level measurement unit 134. , The voice recognition function unit 132 and the mouth recognition function unit 122 are controlled.

具体的には、認識・学習判定部１４２は、口元探索部１２４から入力される口元検出情報又は口元不検出情報によって、撮影機１８で撮影された画像に発声者の口元を示す口元画像が含まれているか否かを判断する。また、認識・学習判定部１４２は、口元探索部１２４から入力される口元動情報又は口元不動情報によって、撮影機１８で撮影された画像に含まれる口元画像によって示される口元が動いているか否かを判断する。さらに、認識・学習判定部１４２は、収音器１６での発声者音声の収音状態が良いか悪いかを、信号レベル計測部１３４から入力される収音状態評価値を閾値と比較した場合の大小によって判断する。そして、収音状態評価値が該閾値より低い場合には収音状態が悪い場合（収音状態レベル０）として分類する。一方、収音状態評価値が該閾値より高い場合には、信号レベル計測部１３４から入力される収音状態評価値を別の閾値と比較する。そして、収音状態評価値が該別の閾値より低い場合には収音状態が良い場合（収音状態レベル１）として分類する。また、収音状態評価値が該別の閾値より高い場合には収音状態が非常に良い状態（収音状態レベル２）として分類する。 Specifically, the recognition / learning determination unit 142 includes a mouth image indicating the mouth of the speaker in the image photographed by the photographing machine 18 based on mouth detection information or mouth non-detection information input from the mouth searching unit 124. It is determined whether or not. Further, the recognition / learning determination unit 142 determines whether or not the mouth indicated by the mouth image included in the image captured by the photographing machine 18 is moving based on the mouth movement information or the mouth immobility information input from the mouth searching unit 124. Judging. Further, the recognition / learning determination unit 142 compares the sound collection state evaluation value input from the signal level measurement unit 134 with a threshold value to determine whether the sound collection state of the speaker voice at the sound collector 16 is good or bad. Judging by the size of. If the sound collection state evaluation value is lower than the threshold, the sound collection state is poor (sound collection state level 0). On the other hand, when the sound collection state evaluation value is higher than the threshold, the sound collection state evaluation value input from the signal level measurement unit 134 is compared with another threshold. When the sound collection state evaluation value is lower than the other threshold, the sound collection state is classified as good (sound collection state level 1). When the sound collection state evaluation value is higher than the other threshold, the sound collection state is classified as a very good state (sound collection state level 2).

そして認識・学習判定部１４２は、上記各判断の判断結果に基づいて、音声認識機能部１３２及び口元認識機能部１２２の制御を行う。 The recognition / learning determination unit 142 controls the speech recognition function unit 132 and the mouth recognition function unit 122 based on the determination results of the above determinations.

すなわち、画像に口元画像が含まれておらず、かつ発声者音声の収音状態が悪い場合には、音声認識部１３６が音声認識を行うことを制限し、音声認識を行わないようにする。逆に、画像に口元画像が含まれていなくても、発声者音声の収音状態が良い場合（収音状態レベル１又は２）には、音声認識部１３６において音声認識を行うよう、音声認識機能部１３２を制御する。 That is, when the mouth image is not included in the image and the voice collection state of the speaker's voice is poor, the voice recognition unit 136 is restricted from performing voice recognition so that voice recognition is not performed. On the contrary, even if the mouth image is not included in the image, if the sound collection state of the speaker's voice is good (sound collection state level 1 or 2), the speech recognition unit 136 performs speech recognition so that speech recognition is performed. The function unit 132 is controlled.

また、画像に口元画像が含まれていても、その口元画像によって示される口元が動いていないと判断する場合には、音声認識部１３６が音声認識を行うことを制限し、音声認識を行わないようにする。一方、その口元画像によって示される口元が動いていると判断する場合には、収音器１６での発声者音声の収音状態によって、異なる処理を行う。 Further, even if the mouth image is included in the image, if it is determined that the mouth indicated by the mouth image is not moving, the speech recognition unit 136 is restricted from performing speech recognition, and speech recognition is not performed. Like that. On the other hand, when it is determined that the mouth indicated by the mouth image is moving, different processing is performed depending on the sound collection state of the speaker voice by the sound collector 16.

すなわち、収音器１６での発声者音声の収音状態が非常に良い場合（収音状態レベル２）には、認識・学習判定部１４２は、音声認識部１３６において音声認識を行うよう、音声認識機能部１３２を制御するとともに、口元認識機能部１２２に対し口元認識部１２６の口元認識学習を行うよう指示する。この口元認識学習については、後に詳述する。 That is, when the sound collection state of the speaker's voice by the sound collector 16 is very good (sound collection state level 2), the recognition / learning determination unit 142 performs voice recognition so that the voice recognition unit 136 performs voice recognition. While controlling the recognition function part 132, it instructs the mouth recognition function part 122 to perform mouth recognition learning of the mouth recognition part 126. This mouth recognition learning will be described in detail later.

また、収音器１６での発声者音声の収音状態が良い場合（収音状態レベル１）には、認識・学習判定部１４２は、口元認識部１２６による口元認識と音声認識部１３６による音声認識と、をともに行うよう、口元認識機能部１２２及び音声認識機能部１３２をそれぞれ制御する。また、口元認識部１２６による口元認識と音声認識部１３６による音声認識とに基づいて認識結果を出力するよう、統合処理部１４４に対して指示を行う。そして後述するように、統合処理部１４４が口元認識結果と音声認識結果に基づいて認識結果を作成し、作成した認識結果を出力部１９に対して出力する。 When the sound collection state of the speaker's voice by the sound collector 16 is good (sound collection state level 1), the recognition / learning determination unit 142 recognizes the mouth by the mouth recognition unit 126 and the sound by the speech recognition unit 136. The mouth recognition function unit 122 and the voice recognition function unit 132 are each controlled so as to perform recognition. Further, the integrated processing unit 144 is instructed to output a recognition result based on the mouth recognition by the mouth recognition unit 126 and the voice recognition by the voice recognition unit 136. As will be described later, the integration processing unit 144 creates a recognition result based on the mouth recognition result and the speech recognition result, and outputs the created recognition result to the output unit 19.

さらに、収音器１６での発声者音声の収音状態が悪い場合（収音状態レベル０）には、認識・学習判定部１４２は、音声認識部１３６が音声認識を行うことを制限し、音声認識を行わないようにするとともに、口元認識を行うようにする。すなわち、発声者音声の収音状態に応じて、音声認識又は口元認識のいずれにより認識を行うかを決定し、発声者音声の収音状態が悪い場合には音声認識を口元認識に切り替えるようにしている。 Furthermore, when the sound collection state of the speaker voice by the sound collector 16 is poor (sound collection state level 0), the recognition / learning determination unit 142 restricts the speech recognition unit 136 from performing speech recognition, Do not perform voice recognition and perform mouth recognition. That is, it is determined whether to perform recognition by voice recognition or mouth recognition according to the sound pickup state of the speaker's voice, and when the voice pickup state of the speaker's voice is poor, the voice recognition is switched to the mouth recognition. ing.

音声認識部１３６は、信号レベル計測部１３４から順次入力される音声に基づいて音声認識を行う。なお音声認識部１３６は、音声認識を行うことを制限されている場合には音声認識を行わない。 The voice recognition unit 136 performs voice recognition based on the voices sequentially input from the signal level measurement unit 134. Note that the voice recognition unit 136 does not perform voice recognition when the voice recognition is restricted.

音声認識を行う場合、音声認識部１３６は、まず順次入力される音声から、記憶部１４に記憶される特徴パターンを抽出する。そして、抽出した特徴パターンに対応付けて記憶部１４に記憶される文字列パターンを音声認識結果として統合処理部１４４及び口元認識学習部１２８に対して出力する。 When performing speech recognition, the speech recognition unit 136 first extracts a feature pattern stored in the storage unit 14 from sequentially input speech. Then, the character string pattern stored in the storage unit 14 in association with the extracted feature pattern is output to the integration processing unit 144 and the mouth recognition learning unit 128 as a speech recognition result.

なお、特徴パターン抽出処理は信号レベル計測部１３４で行い、音声認識部１３６は信号レベル計測部１３４が抽出した特徴パターンを受け取ることとしてもよい。また、音声認識部１３６は、例えば収音状態が悪いことにより音声認識を行うことを制限されている場合には、発声者が再度発声するよう促すために、出力部１９に対して再度の発声を促すための表示又は音声出力を行うよう指示することとしてもよい。つまり出力部１９は、音声認識部１３６によるこの指示に応じて、発声者に対し再度発声するよう指示する指示情報を通知する。 The feature pattern extraction process may be performed by the signal level measurement unit 134, and the voice recognition unit 136 may receive the feature pattern extracted by the signal level measurement unit 134. In addition, when the voice recognition unit 136 is restricted from performing voice recognition due to, for example, a poor sound collection state, the voice recognition unit 136 utters the output unit 19 again to urge the speaker to speak again. It may be instructed to perform display or audio output for prompting. That is, the output unit 19 notifies the instruction information for instructing the speaker to speak again in response to the instruction from the voice recognition unit 136.

口元認識部１２６は、口元探索部１２４から順次入力される画像に基づいて口元認識を行う。なお口元認識部１２６も、口元認識を行うことを制限されている場合には口元認識を行わない。 The mouth recognition unit 126 performs mouth recognition based on images sequentially input from the mouth search unit 124. The mouth recognition unit 126 does not perform mouth recognition when the mouth recognition is restricted.

口元認識を行う場合、口元認識部１２６は、まず順次入力される画像から、記憶部１４に記憶される口元の形状又は該形状の推移の特徴パターンを抽出する。そして、抽出した特徴パターンに対応付けて記憶部１４に記憶される文字列パターンを口元認識結果として統合処理部１４４に対して出力する。なお、特徴パターン抽出処理は口元探索部１２４で行い、口元認識部１２６は口元探索部１２４が抽出した特徴パターンを受け取ることとしてもよい。なお口元認識部１２６は、抽出特徴パターンに基づいて文字列パターンを出力することができたとき、良好な認識結果が得られたと判断する。 When performing mouth recognition, the mouth recognition unit 126 first extracts the shape of the mouth stored in the storage unit 14 or the characteristic pattern of the transition of the shape from sequentially input images. The character string pattern stored in the storage unit 14 in association with the extracted feature pattern is output to the integration processing unit 144 as a mouth recognition result. The feature pattern extraction process may be performed by the mouth search unit 124, and the mouth recognition unit 126 may receive the feature pattern extracted by the mouth search unit 124. The mouth recognizing unit 126 determines that a good recognition result has been obtained when the character string pattern can be output based on the extracted feature pattern.

口元認識学習部１２８は、認識・学習判定部１４２から口元認識機能部１２２に対し口元認識部１２６の口元認識学習を行うよう指示があった場合に、口元認識部１２６の口元認識学習を行う。 The mouth recognition learning unit 128 performs the mouth recognition learning of the mouth recognition unit 126 when the recognition / learning determination unit 142 instructs the mouth recognition function unit 122 to perform mouth recognition learning of the mouth recognition unit 126.

具体的には、口元認識学習部１２８は、発声者のある時点又は期間での口元について、口元認識部１２６において抽出した形状又は該形状の推移の特徴パターンと、その時点又は期間において発声者が発した音声の音声認識結果である文字列パターンと、を取得する。そして取得した特徴パターンと文字列パターンとを対応付けて記憶部１４に記憶する。このようにして記憶部１４に記憶される口元の形状又は該形状の推移の特徴パターンと、文字列パターンと、を更新することにより、口元認識部１２６の口元認識学習を行う。 Specifically, the mouth recognition learning unit 128 determines the shape extracted by the mouth recognition unit 126 or the characteristic pattern of the transition of the shape for the mouth at a certain point or period of the speaker, and the speaker at the time or period. A character string pattern that is a voice recognition result of the emitted voice is acquired. The acquired feature pattern and character string pattern are stored in the storage unit 14 in association with each other. Thus, the mouth recognition learning of the mouth recognition unit 126 is performed by updating the shape of the mouth or the characteristic pattern of the transition of the shape and the character string pattern stored in the storage unit 14 in this manner.

言い換えれば、口元認識学習部１２８は、収音される音声に基づく音声認識結果と、該音声の発声者が該音声を発する際に撮影された画像に含まれる口元画像により示される口元の形状又は該形状の推移と、に基づいて、口元認識の学習を行っている。 In other words, the mouth recognition learning unit 128 determines the mouth shape indicated by the speech recognition result based on the collected sound and the mouth image included in the image captured when the speaker of the sound emits the sound. Mouth recognition is learned based on the transition of the shape.

なお口元認識学習部１２８は、一定期間にわたり認識・学習判定部１４２からの上記指示がないことを検出した場合、記憶部１４に過去に記憶した口元の形状又は該形状の推移の特徴パターンと文字列パターンとに基づいて繰り返し学習（反復学習）を行うこととしてもよい。 When the mouth recognition learning unit 128 detects that there is no instruction from the recognition / learning determination unit 142 for a certain period, the mouth shape previously stored in the storage unit 14 or the characteristic pattern and character of the transition of the shape are stored. Repetitive learning (repetitive learning) may be performed based on the column pattern.

統合処理部１４４は、認識・学習判定部１４２により口元認識部１２６による口元認識結果と音声認識部１３６による音声認識結果とに基づいて認識結果を出力するよう指示された場合には、口元認識部１２６による口元認識と音声認識部１３６による音声認識とに基づいて認識結果を生成し、生成した認識結果を出力部１９に対して出力する。一方、認識・学習判定部１４２により口元認識部１２６による口元認識と音声認識部１３６による音声認識とに基づいて認識結果を出力するよう指示されていない場合には、音声認識部１３６による音声認識結果を認識結果として出力部１９に対して出力する。 When the recognition / learning determination unit 142 instructs the integration processing unit 144 to output a recognition result based on the mouth recognition result by the mouth recognition unit 126 and the voice recognition result by the voice recognition unit 136, the mouth recognition unit A recognition result is generated based on the mouth recognition by 126 and the voice recognition by the voice recognition unit 136, and the generated recognition result is output to the output unit 19. On the other hand, if the recognition / learning determination unit 142 is not instructed to output a recognition result based on the mouth recognition by the mouth recognition unit 126 and the speech recognition by the speech recognition unit 136, the speech recognition result by the speech recognition unit 136 Is output to the output unit 19 as a recognition result.

なお、口元認識部１２６による口元認識と音声認識部１３６による音声認識とに基づいて認識結果を生成する処理は、口元認識結果と音声認識結果のいずれかを認識結果として取得する処理であってもよいし、口元認識結果と音声認識結果の両方に基づく認識結果を生成する処理であってもよい。 Note that the process of generating the recognition result based on the mouth recognition by the mouth recognition unit 126 and the voice recognition by the voice recognition unit 136 is a process of acquiring either the mouth recognition result or the voice recognition result as a recognition result. It may be a process of generating a recognition result based on both the mouth recognition result and the voice recognition result.

以上説明した発声内容認識装置１０における発生内容認識精度向上処理を、該処理のフロー図を参照しながらより詳細に説明する。 The generation content recognition accuracy improving process in the utterance content recognition apparatus 10 described above will be described in more detail with reference to a flowchart of the process.

まず、画像取得部１２０は撮影機１８で撮影された画像を順次取得するための画像取得処理を行う（Ｓ１００）。次に、口元探索部１２４は画像取得部１２０において取得された画像において発声者の口元画像を探索する（Ｓ１０２）。より具体的には、画像から発声者の口元の形状の特徴パターンを抽出する。口元探索部１２４は、一連の画像についてこの特徴パターン抽出処理を行う（Ｓ１０２）。 First, the image acquisition unit 120 performs an image acquisition process for sequentially acquiring images captured by the camera 18 (S100). Next, the mouth search unit 124 searches the mouth image of the speaker in the image acquired by the image acquisition unit 120 (S102). More specifically, a feature pattern of the shape of the speaker's mouth is extracted from the image. The mouth searching unit 124 performs this feature pattern extraction process on a series of images (S102).

そして、口元探索部１２４は画像に発声者の口元画像が含まれているか否かを判断する（Ｓ１０４）。この判断で含まれていないと判断される場合、発声内容認識装置１０はＳ１０６の処理に進む。一方含まれていると判断される場合、発声内容認識装置１０はＳ１１６の処理に進む。 Then, the mouth searching unit 124 determines whether or not the image includes the speaker's mouth image (S104). If it is determined that it is not included in this determination, the utterance content recognition apparatus 10 proceeds to the process of S106. On the other hand, if it is determined that it is included, the utterance content recognition apparatus 10 proceeds to the process of S116.

Ｓ１０６では、音声取得部１３０が収音器１６で収音される音声を順次取得するための音声取得処理を行う（Ｓ１０６）。そして、信号レベル計測部１３４は音声取得部１３０が順次取得する音声のＳＮＲを順次測定する。そして閾値１を記憶部１４から読み出し、測定されたＳＮＲが閾値１を超えているか否かに応じて、発声内容認識装置１０は異なる処理を行う（Ｓ１０８）。 In S106, the sound acquisition unit 130 performs a sound acquisition process for sequentially acquiring sounds collected by the sound collector 16 (S106). Then, the signal level measuring unit 134 sequentially measures the SNRs of the sounds sequentially acquired by the sound acquiring unit 130. Then, the threshold value 1 is read from the storage unit 14, and the utterance content recognition apparatus 10 performs different processing depending on whether or not the measured SNR exceeds the threshold value 1 (S108).

測定されたＳＮＲが閾値１を超えていない場合、音声認識機能部１３２は発声者に対し聞き返すための処理を行う（Ｓ１１０）。一方、測定されたＳＮＲが閾値１を超えている場合には、音声認識機能部１３２は収音器１６により順次収音される発声者の音声に基づく音声認識を行い、統合処理部１４４は、音声認識機能部１３２から出力される音声認識結果を認識結果として取得し、出力する（Ｓ１１４）。 If the measured SNR does not exceed the threshold value 1, the speech recognition function unit 132 performs processing for listening back to the speaker (S110). On the other hand, when the measured SNR exceeds the threshold value 1, the voice recognition function unit 132 performs voice recognition based on the voice of the speaker who is sequentially picked up by the sound pickup unit 16, and the integration processing unit 144 The speech recognition result output from the speech recognition function unit 132 is acquired and output as a recognition result (S114).

Ｓ１１６では、口元探索部１２４は一連の画像について抽出した特徴パターンに基づいて得られる口元画像により、口元が動いているか否かを判断する（Ｓ１１６）。口元が動いていないと判断される場合、発声内容認識装置１０は音声認識処理も口元認識処理も行わず、処理を終了する（Ｓ１４０）。一方、口元が動いていると判断される場合、発声内容認識装置１０は以下の処理を行う。 In S116, the mouth searching unit 124 determines whether or not the mouth is moving based on the mouth image obtained based on the feature pattern extracted for the series of images (S116). When it is determined that the mouth is not moving, the utterance content recognition device 10 does not perform the speech recognition process or the mouth recognition process, and ends the process (S140). On the other hand, when it is determined that the mouth is moving, the utterance content recognition device 10 performs the following processing.

すなわち発声内容認識装置１０は、まず収音器１６の指向性を発声者の口元に合わせるための指向性制御処理を行う（Ｓ１１８）。そして発声内容認識装置１０は、音声取得部１３０が収音器１６で収音される音声を順次取得するための音声取得処理を行う（Ｓ１２０）。そして発声内容認識装置１０は、信号レベル計測部１３４において音声取得部１３０が順次取得する音声のＳＮＲを順次測定する。そして発声内容認識装置１０は閾値１と閾値２（閾値２＞閾値１）を記憶部１４から読み出し、測定されたＳＮＲが閾値１を超えていない場合、閾値１を超えているが閾値２を超えていない場合、閾値２を超えている場合、のそれぞれにおいて異なる処理を行う（Ｓ１２２，Ｓ１２７）。 That is, the utterance content recognition apparatus 10 first performs a directivity control process for matching the directivity of the sound collector 16 to the mouth of the speaker (S118). And the utterance content recognition apparatus 10 performs the audio | voice acquisition process for the audio | voice acquisition part 130 to acquire sequentially the audio | voice collected by the sound collector 16 (S120). Then, the utterance content recognition device 10 sequentially measures the SNRs of the voices sequentially acquired by the voice acquisition unit 130 in the signal level measurement unit 134. Then, the utterance content recognition device 10 reads the threshold value 1 and the threshold value 2 (threshold value 2> threshold value 1) from the storage unit 14, and when the measured SNR does not exceed the threshold value 1, it exceeds the threshold value 1 but exceeds the threshold value 2. If the threshold value 2 is exceeded, different processing is performed in each case (S122, S127).

まず測定されたＳＮＲが閾値１を超えていない場合、口元認識機能部１２２による、撮影機１８により順次撮影される画像に含まれる口元画像に基づく口元認識処理を行う（Ｓ１２３）。そして口元認識部１２６は、良好な認識結果を得ることができたか否かを判断し（Ｓ１２４）、良好な認識結果が得られたと判断される場合には、発声内容認識装置１０は、統合処理部１４４において、口元認識機能部１２２から出力される音声認識結果を認識結果として取得し、出力する（Ｓ１２５）。一方良好な認識結果を得ることができなかったと判断される場合には、音声認識部１３６は発声者に対し聞き返すための処理を行う（Ｓ１２６）。 First, when the measured SNR does not exceed the threshold value 1, the mouth recognition function unit 122 performs mouth recognition processing based on the mouth images included in images sequentially photographed by the camera 18 (S123). Then, the mouth recognition unit 126 determines whether or not a good recognition result has been obtained (S124). If it is determined that a good recognition result has been obtained, the utterance content recognition device 10 performs the integration process. The unit 144 acquires and outputs the speech recognition result output from the mouth recognition function unit 122 as a recognition result (S125). On the other hand, when it is determined that a good recognition result could not be obtained, the voice recognition unit 136 performs a process for listening back to the speaker (S126).

次に、測定されたＳＮＲが閾値１を超えているが閾値２を超えていない場合、発声内容認識装置１０は、音声認識機能部１３２による、収音器１６により順次収音される発声者の音声に基づく音声認識処理（Ｓ１２８）と、口元認識機能部１２２による、撮影機１８により順次撮影される画像に含まれる口元画像に基づく口元認識処理（Ｓ１３０）と、を行う。そして発声内容認識装置１０は、統合処理部１４４において、音声認識処理の認識結果と、口元認識処理の認識結果と、に基づいて認識結果を生成取得し、出力する（Ｓ１３２）。 Next, when the measured SNR exceeds the threshold value 1 but does not exceed the threshold value 2, the utterance content recognition device 10 uses the voice recognition function unit 132 of the speaker who sequentially collects the sounds. A voice recognition process based on voice (S128) and a mouth recognition process (S130) based on a mouth image included in images sequentially photographed by the camera 18 by the mouth recognition function unit 122 are performed. Then, the utterance content recognition apparatus 10 generates and outputs a recognition result based on the recognition result of the speech recognition process and the recognition result of the mouth recognition process in the integrated processing unit 144 (S132).

測定されたＳＮＲが閾値２を超えている場合には、発声内容認識装置１０は、音声認識機能部１３２による、収音器１６により順次収音される発声者の音声に基づく音声認識処理を行う（Ｓ１３４）。そして発声内容認識装置１０は、統合処理部１４４において、音声認識機能部１３２から出力される音声認識結果を認識結果として取得し、出力する（Ｓ１３６）。さらに発声内容認識装置１０は、この出力結果と、撮影機１８により順次撮影される画像に含まれる口元画像の特徴パターンと、に基づいて口元認識の学習処理を行う（Ｓ１３８）。 When the measured SNR exceeds the threshold value 2, the utterance content recognition device 10 performs voice recognition processing based on the voice of the speaker who is sequentially picked up by the sound pickup unit 16 by the voice recognition function unit 132. (S134). Then, the utterance content recognition apparatus 10 acquires and outputs the speech recognition result output from the speech recognition function unit 132 as a recognition result in the integration processing unit 144 (S136). Further, the utterance content recognition apparatus 10 performs mouth recognition learning processing based on the output result and the feature pattern of the mouth image included in the images sequentially photographed by the camera 18 (S138).

そして、以上のようにして実行される発生内容認識精度向上処理を終了するか否かを判断し、終了する場合には終了し、終了しない場合には再度Ｓ１００の処理から処理を繰り返す（Ｓ１４０）。 Then, it is determined whether or not the generated content recognition accuracy improving process executed as described above is to be ended. If it is to be ended, the process is ended. If not, the process is repeated from S100 again (S140). .

以上のようにして、発声内容認識装置１０は、撮影機１８で撮影された画像に発声者の少なくとも一部が含まれるか否かを判定しているので、口元画像発声者が収音器１６の近くにおらず低精度の音声認識が行われることを抑制することができる。また、撮影機１８で撮影された画像に口元画像が含まれるか否かを判定しているので、発声者の発する音声が当該パターン認識装置に届きにくく低精度の音声認識が行われることを抑制することができる。さらに、撮影機１８で撮影された画像に含まれる口元画像が動いているか否かを判定しているので、発声者が声を出しておらず低精度の音声認識が行われることを抑制している。他に、収音器１６の指向性を制御して発声者の口元に向けて音声取得手段の指向性を合わせることにより、よりよい収音状態で発声者の発する音声を取得できる。また、音声認識の認識結果と、該音声を発する口元の形状又は該形状の推移と、に基づいて口元認識の学習を行うので、口元認識の精度を上げることができ、さらに統合処理部１４４は口元認識結果と音声認識結果の両方に基づいて認識結果を生成しているので、パターン認識の認識結果の精度を上げることができる。また、収音器１６での発声者音声の収音状態が非常に良い場合に口元認識学習を行い、他の収音状態では口元認識学習を行わないようにしているので、精度の悪い音声認識の認識結果により口元認識学習が行われる可能性を減少させることができる。また、収音器１６での発声者音声の収音状態に応じて音声認識と口元認識を切り替えることができるので、収音状態が悪いときに低精度の音声認識が行われることを抑制できる。 As described above, since the utterance content recognition device 10 determines whether or not at least a part of the utterer is included in the image captured by the photographing device 18, the mouth image utterer is the sound collector 16. It is possible to suppress the low-accuracy voice recognition that is not in the vicinity of. Further, since it is determined whether or not the mouth image is included in the image photographed by the photographing machine 18, it is difficult for the voice uttered by the speaker to reach the pattern recognition device and to prevent low-accuracy voice recognition. can do. Further, since it is determined whether or not the mouth image included in the image photographed by the photographing machine 18 is moving, it is possible to suppress the low-accuracy voice recognition that is not performed by the speaker. Yes. In addition, by controlling the directivity of the sound collector 16 and matching the directivity of the sound acquisition means toward the speaker's mouth, the sound emitted by the speaker can be acquired in a better sound collection state. Moreover, since the recognition of the mouth is performed based on the recognition result of the speech recognition and the shape of the mouth that emits the speech or the transition of the shape, the accuracy of the mouth recognition can be improved. Since the recognition result is generated based on both the mouth recognition result and the speech recognition result, the accuracy of the pattern recognition recognition result can be improved. Further, since the mouth recognition learning is performed when the sound collection state of the speaker's voice by the sound collector 16 is very good and the mouth recognition learning is not performed in the other sound collection states, the voice recognition with low accuracy is performed. The possibility that the mouth recognition learning is performed can be reduced by the recognition result. In addition, since voice recognition and mouth recognition can be switched according to the sound pickup state of the speaker's voice at the sound pickup device 16, it is possible to suppress low-accuracy voice recognition when the sound pickup state is bad.

なお、本発明は上記実施の形態に限定されるものではない。例えば、発声内容認識装置１０は収音器１６を複数備えることとしてもよい。この場合には、信号レベル計測部１３４は撮影機１８の撮影した画像に含まれる口元画像により示される口元の位置に対して指向性を有する収音器１６において音声信号と雑音信号を収音し、その他の方向に対して指向性を有する収音器１６において収音される音声は全て雑音信号であると判断することとしてもよい。 The present invention is not limited to the above embodiment. For example, the utterance content recognition device 10 may include a plurality of sound collectors 16. In this case, the signal level measuring unit 134 collects the audio signal and the noise signal in the sound collector 16 having directivity with respect to the position of the mouth indicated by the mouth image included in the image taken by the photographing machine 18. The sound collected by the sound collector 16 having directivity with respect to other directions may be determined to be all noise signals.

また、撮影機１８の撮影した画像に含まれる口元画像により示される口元の位置に対して、撮影機１８自身の撮影方向を合わせるよう、撮影機１８の撮影方向を制御することとしてもよい。具体的には、ＣＰＵ１０が撮影機１８の撮影した画像に含まれる口元画像により示される口元の位置に応じて撮影機１８の撮影方向を制御することとしてもよい。 In addition, the shooting direction of the camera 18 may be controlled so that the shooting direction of the camera 18 itself matches the position of the mouth indicated by the mouth image included in the image captured by the camera 18. Specifically, the CPU 10 may control the shooting direction of the camera 18 according to the position of the mouth indicated by the mouth image included in the image captured by the camera 18.

さらに、記憶部１４は複数人についてそれぞれの音声の特徴パターン若しくは口元の形状又は該形状の推移の特徴パターンを記憶することとしてもよい。この場合、収音器１６において収音される音声から抽出された音声の特徴パターンと、撮影機１８において撮影される画像から抽出された口元の形状又は該形状の推移パターンと、が同一人物のものでない場合には、信号レベル計測部１３４は、取得された音声を雑音として取り扱うこととしてもよい。また、指向性制御部１４０は、該口元画像により示される口元の位置に対して指向性を合わせる処理を中止することとしてもよい。また、認識・学習判定部１４２は、音声認識処理、口元認識処理、口元認識学習処理、を行わないこととしてもよい。 Furthermore, the memory | storage part 14 is good also as memorize | storing the characteristic pattern of each voice, the shape of a mouth, or the characteristic pattern of transition of this shape about several persons. In this case, the sound feature pattern extracted from the sound collected by the sound collector 16 and the mouth shape or the transition pattern of the shape extracted from the image photographed by the photographing device 18 are the same person. If not, the signal level measurement unit 134 may treat the acquired voice as noise. In addition, the directivity control unit 140 may stop the process of matching directivity with the position of the mouth indicated by the mouth image. The recognition / learning determination unit 142 may not perform the speech recognition process, the mouth recognition process, and the mouth recognition learning process.

また、記憶部１４が複数人についてそれぞれの音声の特徴パターン若しくは口元の形状又は該形状の推移の特徴パターンを記憶する場合において、さらに各人を示す個人情報と対応付けてＲＦＩＤ(無線ＩＣタグ)情報を記憶することとしてもよい。この場合、発声内容認識装置１０がＲＦＩＤ読取手段を備えることとすれば、該ＲＦＩＤ読取手段によりＲＦＩＤを検出することで発声内容認識装置１０を使用しているのが記憶されるＲＦＩＤを持っている人であるかそうでないかを判断することができる。そして、ＲＦＩＤを持っていない人が発声内容認識装置１０を使用していると判断する場合には、上記各処理を行わないこととしてもよい。また、発声内容認識装置１０を使用しているのが記憶されるＲＦＩＤを持っている人であると判断される場合には、音声認識処理及び口元認識処理において該ＲＦＩＤに対応付けて記憶される個人情報で示される人についての特徴パターンを使用することとしてもよい。 Further, when the storage unit 14 stores a feature pattern of each voice or a mouth shape or a feature pattern of transition of the shape for a plurality of people, an RFID (wireless IC tag) is further associated with personal information indicating each person. Information may be stored. In this case, if the utterance content recognition device 10 includes an RFID reading unit, the utterance content recognition device 10 is used by detecting the RFID by the RFID reading unit. You can determine whether you are a person or not. And when it is judged that the person who does not have RFID uses the utterance content recognition apparatus 10, it is good also as not performing said each process. Further, when it is determined that the person using the utterance content recognition device 10 has a stored RFID, it is stored in association with the RFID in the voice recognition process and the mouth recognition process. It is good also as using the characteristic pattern about the person shown by personal information.

１０発声内容認識装置、１２ＣＰＵ、１４記憶部、１５入力部、１６収音器、１８撮影機、１９出力部、１２０画像取得部、１２２口元認識機能部、１２４
口元探索部、１２６口元認識部、１２８口元認識学習部、１３０音声取得部、１３２音声認識機能部、１３４信号レベル計測部、１３６音声認識部、１４０指向性制御部、１４２認識・学習判定部、１４４統合処理部

DESCRIPTION OF SYMBOLS 10 Speech content recognition apparatus, 12 CPU, 14 Memory | storage part, 15 Input part, 16 Sound collector, 18 Image pick-up machine, 19 Output part, 120 Image acquisition part, 122 Mouth recognition function part, 124
Mouth search unit, 126 Mouth recognition unit, 128 Mouth recognition learning unit, 130 Speech acquisition unit, 132 Speech recognition function unit, 134 Signal level measurement unit, 136 Speech recognition unit, 140 Directivity control unit, 142 Recognition / learning determination unit, 144 Integrated processing unit

Claims

And the sound collection means,
A photographing means for photographing the images,
Voice recognition means for performing voice recognition based on the collected voice;
A voice recognition implementation restriction unit that restricts the voice recognition unit from performing voice recognition when the photographed image does not include the mouth of a speaker who emits voice to the sound collection unit ;
Mouth recognition means for performing mouth recognition based on the shape of the mouth of the speaker included in the captured image or the transition of the shape;
Mouth recognition learning means for learning mouth recognition by the mouth recognition means based on the recognition result of the voice recognition means and the shape of the mouth of the speaker or the transition of the shape included in the photographed image When,
Based on a feature pattern extracted from the sound collected by the sound collecting means, the sound including the sound uttered by the speaker is converted into a voice signal that is the voice uttered by the speaker and other noise signals. A sound collection state evaluation value acquisition means for separating and acquiring a sound collection state evaluation value indicating a good sound collection state based on the separated audio signal and noise signal,
Learning by the mouth recognition learning unit is performed based on a recognition result of the voice recognition unit when a sound collection state indicated by the sound collection state evaluation value is a predetermined threshold or more.
The utterance content recognition apparatus characterized by this.

The utterance content recognition device according to claim 1 ,
The photographing means sequentially photographs the images,
The voice recognition execution restriction unit is configured to perform voice recognition when the mouth indicated by the sequentially acquired images does not move even when the mouth is included in the captured image. Restricting to do,
The utterance content recognition apparatus characterized by this.

In the utterance content recognition device according to claim 1 or 2 ,
Toward the mouth source that is part of an image to be the image capturing, voice directional control means to adjust the directivity of the sound pickup means,
A speech content recognition apparatus, further comprising:

Step A for performing voice recognition based on collected voice;
A step B for restricting the voice recognition when a voiced speaker's mouth is not included in the captured image;
Step C for performing mouth recognition based on the shape of the mouth of the speaker included in the photographed image or transition of the shape;
Step D of learning the mouth recognition based on the recognition result of the voice recognition and the shape of the mouth of the speaker included in the photographed image or the transition of the shape;
Based on a feature pattern extracted from the collected voice, the voice including the voice uttered by the speaker is separated into a voice signal that is the voice uttered by the speaker and other noise signals; Obtaining a sound pickup state evaluation value indicating the good sound pickup state based on the separated voice signal and noise signal, and
The learning by the step D is performed based on the recognition result of the voice recognition when the sound pickup state indicated by the sound pickup state evaluation value is a predetermined threshold value or more.
A speech content recognition method characterized by the above.