JP7370050B2

JP7370050B2 - Lip reading device and method

Info

Publication number: JP7370050B2
Application number: JP2019213234A
Authority: JP
Inventors: 剛史齊藤
Original assignee: Kyushu Institute of Technology NUC
Current assignee: Kyushu Institute of Technology NUC
Priority date: 2019-11-26
Filing date: 2019-11-26
Publication date: 2023-10-27
Anticipated expiration: 2039-11-26
Also published as: JP2021086274A

Description

本発明は、発話者の口唇特徴に加え、表情特徴を考慮することにより、発話内容を高精度で推定することができる読唇装置及び読唇方法に関する。 TECHNICAL FIELD The present invention relates to a lip reading device and a lip reading method that are capable of estimating utterance content with high accuracy by considering facial features in addition to lip characteristics of a speaker.

従来、音声情報をテキストに変換する音声認識技術は、実験室等の低騒音の環境下では、十分な認識率が得られており、少しずつ普及しつつあるが、周囲の騒音の影響を受け易いオフィスや屋外等の騒音環境下、或いは声を出し難い電車や病院等の公共の場所では利用し難く、実用性に欠けるという問題があった。また、発話が困難な発話障害者は音声認識技術を利用することができず、汎用性に欠けるという問題もあった。
これに対して、読唇技術は、発話者の唇の動き等から発話内容を推定することができ、音声を発する必要がなく（音声情報を必要とせず）、映像のみでも発話内容を推定できるため、騒音環境下や公共の場所等でも利用が期待できるだけでなく、発話障害者も利用することができる。特に、コンピュータを用いた読唇技術であれば、特別な訓練を必要とせず、誰でも手軽に利用できるため、その普及が期待されている。
例えば、特許文献１には、口唇領域を含む顔画像を取得する撮像手段と、取得画像から口唇領域を抽出する領域抽出手段と、抽出された口唇領域より形状特徴量を計測する特徴量計測手段と、登録モードにおいて計測されたキーワード発話シーンの特徴量を登録するキーワードＤＢと、認識モードにおいて、登録されているキーワードの特徴量と、文章の発話シーンを対象として計測された特徴量とを比較することにより口唇の発話内容を認識する認識処理を行って、文章の中からキーワードを認識するワードスポッティング読唇を行う判断手段と、判断手段が行った認識結果を表示する表示手段とを備えたワードスポッティング読唇装置が開示されている。 Conventionally, speech recognition technology that converts speech information into text has achieved sufficient recognition rates in low-noise environments such as laboratories, and is gradually becoming more popular, but it is susceptible to the effects of surrounding noise. The problem is that it is difficult to use in noisy environments such as offices and outdoors, or in public places such as trains and hospitals where it is difficult to make a sound, and it lacks practicality. Additionally, speech recognition technology cannot be used by speech-impaired people who have difficulty speaking, resulting in a lack of versatility.
On the other hand, lip-reading technology can estimate the content of the utterance from the movement of the speaker's lips, etc., and does not require vocalization (no audio information) and can estimate the content of the utterance using only images. Not only can it be used in noisy environments or in public places, but it can also be used by people with speech disabilities. In particular, computer-based lip reading technology is expected to become widespread, as anyone can easily use it without any special training.
For example, Patent Document 1 describes an imaging device that acquires a face image including a lip region, a region extracting device that extracts a lip region from the acquired image, and a feature amount measuring device that measures a shape feature amount from the extracted lip region. Compare the keyword DB that registers the feature values of the keyword utterance scenes measured in the registration mode, and the feature values of the registered keywords and the feature values measured for the sentence utterance scenes in the recognition mode. a word spotting means for performing word spotting lip reading to recognize keywords from a sentence by performing a recognition process to recognize the utterance content of the lips, and a display means for displaying the recognition result performed by the determining means. A spotting lip reading device is disclosed.

特開２０１２－５９０１７号公報Japanese Patent Application Publication No. 2012-59017

特許文献１をはじめとするコンピュータを用いた従来の読唇技術では、登録モード（学習モード、学習時）において、発話者の発話時の口唇を中心とする口唇周辺領域の動きの特徴を機械学習で学習している。しかし、発話内容は、発話者の表情全体に影響を与えるため、口唇周辺領域の動きの特徴のみを学習するだけでは、認識モード（評価時）において得られる認識率（発話内容の推定精度）に限界があった。また、従来は、登録モードにおいて、機械学習を行って識別器（学習モデル）を構築する際に、学習するデータ数を増やすため、年代及び性別が異なる様々な発話者のデータを区別することなく用いていた。しかし、発話内容が同一であっても、発話者の年代及び性別の違いにより、発話時の表情に違いが生じ（異なる特徴が現れ）、認識率に影響を及ぼす可能性があった。
本発明は、かかる事情に鑑みてなされたもので、発話者の口唇特徴に加え、表情特徴を考慮して機械学習を行うことにより、発話内容を高精度で推定することができ、必要に応じて、発話者の年代及び性別等の属性も考慮して、学習及び評価を行うことにより、さらに推定精度を高めることができる読唇装置及び読唇方法を提供することを目的とする。 In conventional lip reading technology using a computer, such as Patent Document 1, in registration mode (learning mode, learning time), machine learning is used to determine the characteristics of the movement of the lip surrounding area centered on the lips of the speaker when speaking. I'm learning. However, since the content of the utterance affects the entire facial expression of the speaker, learning only the movement characteristics of the area around the lips will affect the recognition rate (estimated accuracy of the utterance content) obtained in the recognition mode (evaluation). There was a limit. In addition, conventionally, when building a classifier (learning model) using machine learning in registration mode, in order to increase the number of data to be learned, data from various speakers of different ages and genders were not differentiated. I was using it. However, even if the content of the utterance is the same, differences in the age and gender of the speaker may cause differences in facial expressions (different characteristics appear) during the utterance, which may affect the recognition rate.
The present invention has been made in view of the above circumstances, and by performing machine learning in consideration of the speaker's facial features in addition to the lip features, it is possible to estimate the content of the utterance with high accuracy, and as needed. It is an object of the present invention to provide a lip-reading device and a lip-reading method that can further improve estimation accuracy by performing learning and evaluation in consideration of attributes such as the speaker's age and gender.

前記目的に沿う第１の発明に係る読唇装置は、学習時に、学習対象発話者の発話シーンが記録された学習対象画像を読み込み、評価時に、評価対象発話者の発話シーンが記録された評価対象画像を読み込む画像取得部と、該画像取得部に読み込まれた前記学習対象画像及び前記評価対象画像をそれぞれ画像処理して学習対象データ及び評価対象データを抽出する画像処理部と、学習時に、前記学習対象データに基づいて読唇の機械学習を行い、学習モデルを構築する学習処理部と、前記学習モデルを保存する読唇データベースと、評価時に、前記評価対象データと、前記読唇データベースに保存された前記学習モデルから、機械学習により、前記評価対象発話者の発話内容を推定する認識処理部とを備え、
前記画像処理部は、前記学習対象画像から前記学習対象発話者の顔領域を検出し、前記評価対象画像から前記評価対象発話者の顔領域を検出する顔検出手段と、該顔検出手段で検出された前記各顔領域からそれぞれの顔特徴点を検出する顔特徴点検出手段と、該顔特徴点検出手段で検出された前記各顔領域の前記顔特徴点からそれぞれ口唇領域を抽出する口唇領域抽出手段と、前記学習対象発話者の前記顔領域、前記顔特徴点及び前記口唇領域から、前記学習対象データとなる前記学習対象発話者の表情特徴及び口唇特徴を抽出し、前記評価対象発話者の前記顔領域、前記顔特徴点及び前記口唇領域から、前記評価対象データとなる前記評価対象発話者の表情特徴及び口唇特徴を抽出する特徴抽出手段とを有する。 A lip reading device according to a first aspect of the invention that meets the above object reads a learning target image in which a speech scene of a speaker to be evaluated is recorded during learning, and an image to be evaluated in which a speech scene of a speaker to be evaluated is recorded during evaluation. an image acquisition unit that reads an image; an image processing unit that performs image processing on the learning target image and the evaluation target image read into the image acquiring unit to extract learning target data and evaluation target data; a learning processing unit that performs machine learning of lip reading based on learning target data and constructs a learning model; a lip reading database that stores the learning model; and a lip reading database that stores the learning model; a recognition processing unit that estimates the utterance content of the evaluation target utterer from the learning model by machine learning;
The image processing unit includes a face detection means for detecting a face region of the learning target speaker from the learning target image and a face region of the evaluation target speaker from the evaluation target image; facial feature point detection means for detecting facial feature points from each of the facial feature points detected by the facial feature point detecting means; and lip regions for extracting lip regions from the facial feature points of the facial feature points detected by the facial feature point detecting means. extracting means, extracting facial features and lip features of the learning target speaker, which are the learning target data, from the face region, the facial feature points, and the lip region of the learning target speaker; and a feature extraction means for extracting facial features and lip features of the evaluation target speaker, which are the evaluation target data, from the face region, the facial feature points, and the lip region.

第１の発明に係る読唇装置において、前記学習処理部で構築される前記学習モデルは、前記学習対象発話者の年齢及び／又は性別に対応して属性別に構築され、前記認識処理部は、前記評価対象データから、別途、属性認識の機械学習により、前記評価対象発話者の年齢及び／又は性別を推定し、推定された前記評価対象発話者の年齢及び／又は性別に対応した属性の前記学習モデルを選択して、発話内容の推定に利用することが好ましい。 In the lip reading device according to the first invention, the learning model constructed by the learning processing unit is constructed for each attribute corresponding to the age and/or gender of the learning target speaker, and the recognition processing unit Separately, the age and/or gender of the evaluation target speaker is estimated from the evaluation target data through attribute recognition machine learning, and the learning of attributes corresponding to the estimated age and/or gender of the evaluation target speaker is performed. It is preferable to select a model and use it to estimate the content of the utterance.

第１の発明に係る読唇装置において、前記学習対象発話者及び前記評価対象発話者の発話シーンを撮影する撮影手段を備えてもよい。 The lip reading device according to the first aspect of the invention may further include a photographing means for photographing speech scenes of the learning target speaker and the evaluation target speaker.

第１の発明に係る読唇装置において、前記認識処理部で推定された前記評価対象発話者の発話内容を出力する認識結果出力部を備えることができる。 The lip reading device according to the first aspect of the invention may further include a recognition result output unit that outputs the utterance content of the evaluation target speaker estimated by the recognition processing unit.

第１の発明に係る読唇装置において、前記認識結果出力部は、前記認識処理部で推定された前記評価対象発話者の発話内容を文字で表示するディスプレイ及び／又は音声で出力するスピーカを備えることが好ましい。 In the lip reading device according to the first aspect of the invention, the recognition result output unit may include a display that displays the utterance content of the evaluation target speaker estimated by the recognition processing unit in text and/or a speaker that outputs it in audio. is preferred.

前記目的に沿う第２の発明に係る読唇方法は、学習対象発話者の発話シーンが記録された学習対象画像を読み込む学習時第１工程と、前記学習対象画像から前記学習対象発話者の顔領域を検出する学習時第２工程と、前記学習対象発話者の前記顔領域から前記学習対象発話者の顔特徴点を検出する学習時第３工程と、前記学習対象発話者の前記顔特徴点から前記学習対象発話者の口唇領域を検出する学習時第４工程と、前記学習対象発話者の前記顔領域、前記顔特徴点及び前記口唇領域から、学習対象データとなる前記学習対象発話者の表情特徴及び口唇特徴を抽出する学習時第５工程と、前記学習時第１工程～前記学習時第５工程を繰り返し、前記学習対象データに基づいて読唇の機械学習を行い、学習モデルを構築する学習時第６工程と、前記学習モデルを保存する学習時第７工程と、保存された前記学習モデルを読み込む評価時第１工程と、評価対象発話者の発話シーンが記録された評価対象画像を読み込む評価時第２工程と、前記評価対象画像から前記評価対象発話者の顔領域を検出する評価時第３工程と、前記評価対象発話者の前記顔領域から前記評価対象発話者の顔特徴点を検出する評価時第４工程と、前記評価対象発話者の前記顔特徴点から前記評価対象発話者の口唇領域を検出する評価時第５工程と、前記評価対象発話者の前記顔領域、前記顔特徴点及び前記口唇領域から、評価対象データとなる前記評価対象発話者の表情特徴及び口唇特徴を抽出する評価時第６工程と、前記評価対象データと前記学習モデルから、機械学習により、前記評価対象発話者の発話内容を推定する評価時第７工程とを備える。 A lip reading method according to a second invention that meets the above object includes a first step during learning in which a learning target image in which a speech scene of a learning target speaker is recorded, and a face region of the learning target speaker is read from the learning target image. a second learning step of detecting facial feature points of the learning target speaker from the face area of the learning target speaker; and a third learning step of detecting facial feature points of the learning target speaker from the facial feature points of the learning target speaker. A fourth learning step of detecting the lip area of the learning target speaker, and determining the facial expression of the learning target speaker as learning target data from the face area, facial feature points, and lip area of the learning target speaker. A learning step of extracting features and lip features, and repeating the first learning step to the fifth learning step to perform machine learning of lip reading based on the learning target data and construct a learning model. a seventh step during learning to save the learning model; a first step during evaluation to read the saved learning model; and a first step during evaluation to read the evaluation target image in which the utterance scene of the target speaker is recorded. a second evaluation step; a third evaluation step of detecting a face region of the evaluation target speaker from the evaluation target image; and detecting facial feature points of the evaluation target speaker from the face area of the evaluation target speaker. a fourth evaluation step of detecting, a fifth evaluation step of detecting the lip region of the evaluation target speaker from the facial feature points of the evaluation target speaker, and the face area and the face of the evaluation target speaker. a sixth step during evaluation of extracting facial features and lip features of the evaluation target speaker, which are evaluation target data, from the feature points and the lip region; and a seventh evaluation step of estimating the utterance content of the target speaker.

第２の発明に係る読唇方法において、前記学習時第６工程で構築される前記学習モデルは、前記学習対象発話者の年齢及び／又は性別に対応して属性別に構築されることが好ましい。 In the lip reading method according to the second aspect of the invention, it is preferable that the learning model constructed in the sixth step during learning is constructed for each attribute corresponding to the age and/or gender of the learning target speaker.

第２の発明に係る読唇方法において、前記評価時第７工程では、前記評価対象データから、別途、属性認識の機械学習により、前記評価対象発話者の年齢及び／又は性別を推定し、推定された年齢及び／又は性別に対応した属性の前記学習モデルを選択して、前記発話内容の推定に利用することができる。 In the lip reading method according to the second invention, in the seventh step during evaluation, the age and/or gender of the speaker to be evaluated is separately estimated from the evaluation target data by machine learning of attribute recognition. The learning model having attributes corresponding to the selected age and/or gender can be selected and used for estimating the utterance content.

第１の発明に係る読唇装置及び第２の発明に係る読唇方法は、発話者（学習対象発話者及び評価対象発話者をまとめて発話者という）の口唇特徴に加え、表情特徴も考慮して機械学習を行うことにより、評価時に、発話内容を高精度で推定することができる。特に、発話者の年齢及び／又は性別も考慮して機械学習を行い、発話内容の推定を行った場合、認識率をさらに高めることができる。 The lip reading device according to the first invention and the lip reading method according to the second invention consider not only the lip characteristics of the speaker (the learning target speaker and the evaluation target speaker are collectively referred to as the speaker), but also the facial expression characteristics. By performing machine learning, it is possible to estimate the utterance content with high accuracy during evaluation. In particular, if machine learning is performed in consideration of the age and/or gender of the speaker to estimate the content of the utterance, the recognition rate can be further improved.

本発明の一実施の形態に係る読唇装置の構成を示すブロック図である。FIG. 1 is a block diagram showing the configuration of a lip reading device according to an embodiment of the present invention. 同読唇装置の画像処理部の機能を示すブロック図である。FIG. 2 is a block diagram showing the functions of an image processing section of the lip reading device. （Ａ）、（Ｂ）はそれぞれ同読唇装置の顔特徴点検出手段で検出された顔特徴点を示す説明図である。(A) and (B) are explanatory diagrams each showing facial feature points detected by the facial feature point detection means of the lip reading device. 本発明の一実施の形態に係る読唇方法の学習時の動作を示すフローチャートである。3 is a flowchart showing operations during learning of a lip reading method according to an embodiment of the present invention. 同読唇方法の評価時の動作を示すフローチャートである。It is a flowchart which shows the operation|movement at the time of evaluation of the same lip-reading method.

続いて、本発明を具体化した実施の形態について説明し、本発明の理解に供する。
図１に示す本発明の一実施の形態に係る読唇装置１０及び読唇方法は、発話内容が既知の学習対象発話者の発話時の口唇特徴及び表情特徴等を機械学習することにより、評価対象発話者の発話内容を高精度で推定するものである。
図１に示すように、読唇装置１０は、学習対象発話者及び評価対象発話者の発話シーンを撮影（記録）する撮影手段１１を備えている。そして、読唇装置１０は、学習時に、学習対象発話者の発話シーンが記録された学習対象画像を撮影手段１１から読み込み、評価時に、評価対象発話者の発話シーンが記録された評価対象画像を撮影手段１１から読み込む画像取得部１３を備えている。また、読唇装置１０は、画像取得部１３に読み込まれた学習対象画像及び評価対象画像をそれぞれ画像処理して、機械学習に必要な学習対象データ及び評価対象データを抽出する画像処理部１４を備えている。さらに、読唇装置１０は、学習時に、学習対象データに基づいて読唇の機械学習を行い、学習モデルを構築する学習処理部１５と、学習モデルを保存する読唇データベース１６を備えている。そして、読唇装置１０は、評価時に、評価対象データと、読唇データベース１６に保存された学習モデルから、機械学習により、評価対象発話者の発話内容を推定する認識処理部１７を備えている。ここで、読唇装置１０は、図１に示すように、画像取得部１３、画像処理部１４、学習処理部１５、読唇データベース１６及び認識処理部１７を含んで構成されるが、読唇装置１０に用いられる読唇方法を実行するプログラムがコンピュータ１８にインストールされ、コンピュータ１８のＣＰＵがそのプログラムを実行することにより、コンピュータ１８を上記の画像取得部１３、画像処理部１４、学習処理部１５、読唇データベース１６及び認識処理部１７として機能させることができる。コンピュータの形態としては、デスクトップ型又はノート型が好適に用いられるが、これらに限定されるものではなく、適宜、選択することができる。なお、画像取得部１３、画像処理部１４、学習処理部１５、読唇データベース１６及び認識処理部１７の一部又は全ては、クラウドコンピューティングにより、ネットワークを通じて利用することもできる。また、撮影手段としてはビデオカメラが好適に用いられるが、読唇装置が専用の撮影手段を備えている必要はなく、発話シーンを撮影した各種の撮影手段をコンピュータ（画像取得部）に接続して学習対象画像又は評価対象画像を読み込むことができる。よって、撮影手段として、動画撮影機能が搭載されたスマートフォン等を用いてもよい。なお、撮影手段をコンピュータ（画像取得部）に接続して画像を読み込む代わりに、撮影手段に内蔵されたメモリーカード等の記憶装置を撮影手段からコンピュータ（画像取得部）に挿し代えて画像を読み込むこともできる。 Next, embodiments embodying the present invention will be described to provide an understanding of the present invention.
The lip reading device 10 and the lip reading method according to an embodiment of the present invention shown in FIG. This method estimates the content of a person's utterances with high accuracy.
As shown in FIG. 1, the lip reading device 10 includes a photographing means 11 for photographing (recording) speech scenes of the learning target speaker and the evaluation target speaker. Then, during learning, the lip reading device 10 reads the learning target image in which the utterance scene of the learning target speaker is recorded from the photographing means 11, and at the time of evaluation, captures the evaluation target image in which the utterance scene of the evaluation target speaker is recorded. It is provided with an image acquisition section 13 that reads from the means 11. The lip reading device 10 also includes an image processing unit 14 that performs image processing on the learning target image and the evaluation target image read into the image acquisition unit 13 to extract learning target data and evaluation target data necessary for machine learning. ing. Further, the lip reading device 10 includes a learning processing unit 15 that performs machine learning of lip reading based on learning target data and constructs a learning model during learning, and a lip reading database 16 that stores the learning model. The lip reading device 10 includes a recognition processing unit 17 that estimates the utterance content of the speaker to be evaluated using machine learning from the evaluation target data and the learning model stored in the lip reading database 16 at the time of evaluation. Here, as shown in FIG. A program for executing the lip reading method to be used is installed in the computer 18, and the CPU of the computer 18 executes the program, thereby controlling the computer 18 to the image acquisition section 13, the image processing section 14, the learning processing section 15, and the lip reading database. 16 and recognition processing section 17. The form of the computer is preferably a desktop type or a notebook type, but is not limited to these and can be selected as appropriate. Note that part or all of the image acquisition section 13, image processing section 14, learning processing section 15, lip reading database 16, and recognition processing section 17 can also be used through a network by cloud computing. Furthermore, although a video camera is suitably used as a photographing means, it is not necessary for the lip reading device to be equipped with a dedicated photographing means, and various photographing means for photographing speech scenes can be connected to a computer (image acquisition unit). A learning target image or an evaluation target image can be loaded. Therefore, a smartphone or the like equipped with a video shooting function may be used as the shooting means. Note that instead of connecting the photographing means to a computer (image acquisition section) and reading the images, a storage device such as a memory card built into the photographing means is inserted from the photographing means into the computer (image acquisition section) and the images are read in. You can also do that.

また、読唇装置１０は、認識処理部１７で推定された評価対象発話者の発話内容を出力する認識結果出力部１９を備えている。本実施の形態では、認識結果出力部１９は、認識処理部１７で推定された評価対象発話者の発話内容を文字で表示するディスプレイ２０及び音声で出力するスピーカ２１を備える構成としたが、読唇装置１０の使用場所及び使用環境等に応じて、ディスプレイ２０及びスピーカ２１のいずれか一方又は双方を適宜、選択して使用することができる。なお、ディスプレイ及びスピーカは、コンピュータの付属品若しくは内蔵品でもよいし、別途、コンピュータに後付け（外付け）したものでもよい。また、認識結果出力部は、ディスプレイ又はスピーカの一方のみを備える構成としてもよい。 The lip reading device 10 also includes a recognition result output unit 19 that outputs the utterance content of the evaluation target speaker estimated by the recognition processing unit 17. In the present embodiment, the recognition result output unit 19 is configured to include a display 20 that displays the utterance contents of the evaluation target speaker estimated by the recognition processing unit 17 in text, and a speaker 21 that outputs the utterance content in voice. Depending on the place and environment in which the device 10 is used, one or both of the display 20 and the speaker 21 can be selected and used as appropriate. Note that the display and speaker may be accessories or built-in items of the computer, or may be separately attached later (externally) to the computer. Further, the recognition result output unit may be configured to include only one of a display and a speaker.

次に、図２により、画像処理部１４の詳細について説明する。
画像処理部１４は、学習時に、学習対象画像から学習対象発話者の顔領域を検出し、評価時に、評価対象画像から評価対象発話者の顔領域を検出する顔検出手段２２を備えている。また、画像処理部１４は、顔検出手段２２で検出された各顔領域からそれぞれの顔特徴点を検出する顔特徴点検出手段２３と、顔特徴点検出手段２３で検出された各顔領域の顔特徴点からそれぞれ口唇領域を抽出する口唇領域抽出手段２４を備えている。そして、画像処理部１４は、学習時に、学習対象発話者の顔領域、顔特徴点及び口唇領域から、学習対象データとなる学習対象発話者の表情特徴及び口唇特徴を抽出し、評価時に、評価対象発話者の顔領域、顔特徴点及び口唇領域から、評価対象データとなる評価対象発話者の表情特徴及び口唇特徴を抽出する特徴抽出手段２５を備えている。 Next, details of the image processing section 14 will be explained with reference to FIG. 2.
The image processing unit 14 includes a face detection unit 22 that detects the face region of the speaker to be learned from the image to be learned during learning, and detects the face region of the speaker to be evaluated from the image to be evaluated during evaluation. The image processing unit 14 also includes a facial feature point detection unit 23 that detects each facial feature point from each facial area detected by the face detection unit 22, and a facial feature point detection unit 23 that detects each facial feature point from each facial area detected by the facial feature point detection unit 22; A lip region extracting means 24 is provided for extracting lip regions from each facial feature point. Then, during learning, the image processing unit 14 extracts facial features and lip features of the learning target speaker, which are learning target data, from the face region, facial feature points, and lip region of the learning target speaker, and at the time of evaluation, the facial features and lip features of the learning target speaker are extracted. A feature extraction means 25 is provided for extracting facial features and lip features of the target speaker, which are evaluation target data, from the target speaker's face region, facial feature points, and lip region.

顔特徴点検出手段２３で検出される顔特徴点は、例えば図３（Ａ）、（Ｂ）に示すように、発話者の顔の輪郭並びに眉、目、鼻及び口の位置と形状を表すものである。本実施の形態では、特徴点数を６８点としたが、特徴点数は、これに限定されることなく、適宜、増減させることができる。
コンピュータを用いた従来の読唇技術では、発話者の発話時の口唇を中心とする口唇周辺領域の動きの特徴のみを機械学習で学習していたが、読唇装置１０では、発話者の顔領域、顔特徴点及び口唇領域から、表情特徴及び口唇特徴を抽出することにより、発話時の口唇周辺領域の動きだけでなく、発話者の顔全体の表情の特徴（例えば、眉、目及び口等の位置、形状及び角度等の変化）を併せて機械学習することができ、認識率（発話内容の推定精度）を向上させることができる。 The facial feature points detected by the facial feature point detection means 23 represent the outline of the speaker's face and the positions and shapes of the eyebrows, eyes, nose, and mouth, as shown in FIGS. 3(A) and 3(B), for example. It is something. In this embodiment, the number of feature points is 68, but the number of feature points is not limited to this and can be increased or decreased as appropriate.
In conventional lip reading technology using a computer, machine learning was used to learn only the characteristics of the movement of the area around the lips of the speaker during speech, but the lip reading device 10 uses machine learning to learn only the characteristics of the movement of the area around the lips of the speaker when speaking. By extracting facial features and lip features from facial feature points and lip regions, it is possible to detect not only the movement of the area around the lips during speech, but also the facial features of the speaker's entire face (e.g. eyebrows, eyes, mouth, etc.). Changes in position, shape, angle, etc.) can also be machine learned, and the recognition rate (precision of estimating utterance content) can be improved.

また、学習処理部１５は、学習時に、学習対象データに基づいて学習対象発話者の年齢及び／又は性別も含めて読唇の機械学習を行い、学習対象発話者の年齢及び／又は性別に対応した属性別の学習モデルを構築することができる。そして、認識処理部１７は、評価時に、評価対象データから、別途、属性認識の機械学習により、評価対象発話者の年齢及び／又は性別を推定し、推定された年齢及び／又は性別に対応した属性の学習モデルを選択して、評価対象発話者の発話内容の推定に利用することができる。このように、学習時及び評価時に、発話者の年齢及び／又は性別も考慮して機械学習を行うことにより、発話者の年代及び／又は性別の違いが発話時の表情に与える影響を取り除いて認識率をさらに向上させることができる。 In addition, during learning, the learning processing unit 15 performs machine learning of lip reading based on the learning target data, including the age and/or gender of the learning target speaker, and performs lip reading machine learning that corresponds to the age and/or gender of the learning target speaker. It is possible to construct learning models for each attribute. Then, at the time of evaluation, the recognition processing unit 17 estimates the age and/or gender of the utterer to be evaluated from the evaluation target data separately by machine learning for attribute recognition, and identifies the target speaker's age and/or gender corresponding to the estimated age and/or gender. An attribute learning model can be selected and used to estimate the utterance content of the utterer to be evaluated. In this way, by performing machine learning while taking into account the age and/or gender of the speaker during learning and evaluation, we can eliminate the influence of differences in age and/or gender of the speaker on facial expressions when speaking. The recognition rate can be further improved.

次に、図４により、本発明の一実施の形態に係る読唇方法の学習時の動作について説明する。
まず、学習時第１工程で、学習対象発話者の発話シーンが記録された学習対象画像を画像取得部１３に読み込む（Ｓ１）。次に、学習時第２工程で、画像処理部１４の顔検出手段２２により、学習対象画像から学習対象発話者の顔領域を検出する（Ｓ２）。続いて、学習時第３工程で、画像処理部１４の顔特徴点検出手段２３により、学習対象発話者の顔領域から学習対象発話者の顔特徴点を検出し（Ｓ３）、学習時第４工程で、画像処理部１４の口唇領域検出手段２４により、学習対象発話者の顔特徴点から学習対象発話者の口唇領域を検出する（Ｓ４）。さらに、学習時第５工程で、画像処理部１４の特徴検出手段２５により、学習対象発話者の顔領域、顔特徴点及び口唇領域から、学習対象データとなる学習対象発話者の表情特徴及び口唇特徴を抽出する（Ｓ５）。以上の学習時第１工程～学習時第５工程は、学習する発話シーンの数だけ繰り返し行われる。そして、学習時第６工程で、学習処理部１５により、それぞれの発話シーンから抽出した学習対象データに基づいて読唇の機械学習を行う。このとき、学習対象発話者の年齢及び／又は性別等の属性認識も含めて機械学習を行うことにより、学習対象発話者の年齢及び／又は性別に対応した属性別の学習モデルを構築する（Ｓ６）。こうして構築された属性別の各学習モデルは、学習時第７工程において、読唇データベース１６に保存される（Ｓ７）。 Next, with reference to FIG. 4, the operation during learning of the lip reading method according to an embodiment of the present invention will be described.
First, in a first step during learning, a learning target image in which an utterance scene of a learning target speaker is recorded is read into the image acquisition unit 13 (S1). Next, in a second learning step, the face detection means 22 of the image processing unit 14 detects the face region of the learning target speaker from the learning target image (S2). Subsequently, in a third learning step, the facial feature point detection means 23 of the image processing unit 14 detects the facial feature points of the learning target speaker from the facial region of the learning target speaker (S3), and In the step, the lip area detection means 24 of the image processing unit 14 detects the lip area of the learning target speaker from the facial feature points of the learning target speaker (S4). Furthermore, in a fifth step during learning, the feature detection means 25 of the image processing unit 14 extracts the facial features and lips of the learning target speaker, which are the learning target data, from the face region, facial feature points, and lip region of the learning target speaker. Features are extracted (S5). The above-described first to fifth learning steps are repeated as many times as there are utterance scenes to be learned. Then, in a sixth step during learning, the learning processing unit 15 performs machine learning of lip reading based on the learning target data extracted from each utterance scene. At this time, a learning model for each attribute corresponding to the age and/or gender of the learning target speaker is constructed by performing machine learning including attribute recognition such as the age and/or gender of the learning target speaker (S6 ). Each attribute-based learning model thus constructed is stored in the lip reading database 16 in the seventh learning step (S7).

続いて、図５により、読唇方法の評価時の動作について説明する。
まず、評価時第１工程で、読唇データベース１６に保存された属性別の各学習モデル（学習済みモデル）を読み込む（Ｓ１）。そして、評価時第２工程で、評価対象発話者の発話シーンが記録された評価対象画像を画像取得部１３に読み込む（Ｓ２）。次に、評価時第３工程で、画像処理部１４の顔検出手段２２により、評価対象画像から評価対象発話者の顔領域を検出する（Ｓ３）。続いて、評価時第４工程で、画像処理部１４の顔特徴点検出手段２３により、評価対象発話者の顔領域から評価対象発話者の顔特徴点を検出し（Ｓ４）、評価時第５工程で、画像処理部１４の口唇領域検出手段２４により、評価対象発話者の顔特徴点から評価対象発話者の口唇領域を検出する（Ｓ５）。さらに、評価時第６工程で、画像処理部１４の特徴検出手段２５により、評価対象発話者の顔領域、顔特徴点及び口唇領域から、評価対象データとなる評価対象発話者の表情特徴及び口唇特徴を抽出する（Ｓ６）。そして、評価時第７工程で、評価対象データから、機械学習（属性認識）により、評価対象発話者の年齢及び／又は性別を推定し（Ｓ７）、評価対象データと、推定された年齢及び／又は性別に対応した属性の学習モデルから、機械学習（読唇処理）により、評価対象発話者の発話内容を推定する（Ｓ８）。推定された発話内容（評価結果）は、文字及び／又は音声に変換され、評価結果出力部１９のディスプレイ２０及び／又はスピーカ２１から出力される（Ｓ９）。 Next, the operation during evaluation of the lip reading method will be described with reference to FIG.
First, in a first step during evaluation, each attribute-specific learning model (trained model) stored in the lip reading database 16 is read (S1). Then, in a second evaluation step, the evaluation target image in which the utterance scene of the evaluation target speaker is recorded is read into the image acquisition unit 13 (S2). Next, in a third evaluation step, the face detection means 22 of the image processing unit 14 detects the face area of the evaluation target speaker from the evaluation target image (S3). Subsequently, in a fourth evaluation step, the facial feature point detection means 23 of the image processing unit 14 detects facial feature points of the evaluation target utterer from the evaluation target utterer's facial area (S4). In the step, the lip area detection means 24 of the image processing unit 14 detects the lip area of the speaker to be evaluated from the facial feature points of the speaker to be evaluated (S5). Furthermore, in a sixth step at the time of evaluation, the feature detection means 25 of the image processing unit 14 extracts the facial features and lips of the evaluation target speaker, which are evaluation target data, from the face area, facial feature points, and lip area of the evaluation target speaker. Features are extracted (S6). Then, in a seventh step during evaluation, the age and/or gender of the utterer to be evaluated is estimated from the evaluation target data by machine learning (attribute recognition) (S7), and the estimated age and/or gender are estimated from the evaluation target data by machine learning (attribute recognition). Alternatively, the utterance content of the speaker to be evaluated is estimated by machine learning (lip reading processing) from a learning model of attributes corresponding to gender (S8). The estimated speech content (evaluation result) is converted into text and/or voice, and output from the display 20 and/or speaker 21 of the evaluation result output unit 19 (S9).

表情特徴の抽出には、顔の動作解析ツールとして知られているＯｐｅｎＦａｃｅのＡｃｔｉｏｎＵｎｉｔｓの特徴量を利用することが好ましいが、これに限定されるものではない。また、機械学習では、深層学習の一種であるゲート付き回帰型ユニット（ＧａｔｅｄＲｅｃｕｒｒｅｎｔＵｎｉｔ、ＧＲＵ）を利用し、表情特徴及び口唇特徴をそれぞれ学習してから融合（統合）するｌａｔｅｆｕｓｉｏｎを用いることにより、認識率（発話内容の推定精度）を向上させることができるが、これらに限定されることなく、様々なアルゴリズムを用いることができる。 For extraction of facial expression features, it is preferable to use feature quantities of Action Units of Open Face, which is known as a facial motion analysis tool, but the present invention is not limited thereto. In addition, machine learning uses a gated recurrent unit (GRU), which is a type of deep learning, and late fusion, which learns facial features and lip features separately and then fuses them. , the recognition rate (precision of estimating utterance content) can be improved, but various algorithms can be used without being limited to these.

次に、本発明の作用効果を確認するために行った実施例について説明する。
（実施例１）
学習対象発話者を男性のみ１６名、女性のみ１６名及び男女８名ずつとして、それぞれ本発明の読唇方法の学習時第１工程～学習時第７工程を行い、属性別の３種類の学習モデルを構築した。そして、評価対象発話者を男性のみ８名又は女性のみ８名として、評価時第１工程～評価時第６工程を行い、評価時第７工程では、評価対象発話者の属性（性別）に関係なく、上記３種類の学習モデルをそれぞれ用いて発話内容を推定し、それぞれの認識率を求めた。その結果を表１に示す。なお、学習対象発話者及び評価対象発話者の年齢については考慮していない。 Next, examples performed to confirm the effects of the present invention will be described.
(Example 1)
The learning target speakers were 16 male only, 16 female only, and 8 male and female speakers, and the first to seventh learning steps of the lip reading method of the present invention were carried out, and three types of learning models were created for each attribute. was built. Then, the evaluation target speakers are 8 male only or 8 female only speakers, and the first evaluation step to the sixth evaluation step are performed. Instead, the utterance content was estimated using each of the three types of learning models mentioned above, and the recognition rate for each was determined. The results are shown in Table 1. Note that the ages of the learning target speaker and the evaluation target speaker are not taken into account.

表１より、評価対象発話者が男性の場合、学習対象発話者を男性のみとした学習モデルを用いた時の認識率が最も高く、学習対象発話者を女性のみとした学習モデルを用いた時の認識率が最も低いことがわかった。また、評価対象発話者が女性の場合、学習対象発話者を女性のみとした学習モデルを用いた時の認識率が最も高く、学習対象発話者を男性のみとした学習モデルを用いた時の認識率が最も低いことがわかった。これにより、評価対象発話者の性別と学習対象発話者の性別を一致させることにより、高い認識率が得られることが確認された。 From Table 1, when the target speaker for evaluation is male, the recognition rate is highest when using the learning model with only male speakers as the learning target, and when using the learning model with only female speakers as the learning target. The recognition rate was found to be the lowest. In addition, when the target speaker for evaluation is female, the recognition rate is highest when using the learning model with only female speakers as the target speaker, and the highest recognition rate when using the learning model with only male speakers as the target speaker. It was found that the rate was the lowest. This confirmed that a high recognition rate could be obtained by matching the gender of the utterer to be evaluated and the gender of the utterer to be studied.

（実施例２）
０～９の１０種の数字を英語で発話した場合、１０種の挨拶文を英語で発話した場合、及び０～９の１０種の数字を日本語で発話した場合のそれぞれの発話内容につき、本発明の読唇方法を用いて学習と評価を行い、認識率を求めた。また、比較のため、口唇特徴のみを用いて学習と評価を行った時の認識率と、表情特徴のみを用いて学習と評価を行った時の認識率も求めた。その結果を表２に示す。なお、いずれの場合も、学習対象発話者及び評価対象発話者は男女混合とし、年齢についても考慮していない。つまり、ここでは、学習対象発話者及び評価対象発話者の属性認識は行わず、表情特徴と口唇特徴を組合せた効果のみを確認した。 (Example 2)
When uttering 10 types of numbers from 0 to 9 in English, when uttering 10 types of greetings in English, and when uttering 10 types of numbers from 0 to 9 in Japanese, the contents of each utterance are as follows: Learning and evaluation were performed using the lip reading method of the present invention, and the recognition rate was determined. For comparison, we also calculated the recognition rate when learning and evaluating using only lip features and the recognition rate when learning and evaluating using only facial features. The results are shown in Table 2. In both cases, the speakers to be studied and the speakers to be evaluated are a mixture of men and women, and age is not taken into account. That is, here, attribute recognition of the learning target speaker and the evaluation target speaker was not performed, and only the effect of combining facial features and lip features was confirmed.

表２より、発話内容に関わらず、表情特徴と口唇特徴を組合せて学習と評価を行った本発明の読唇方法の認識率が最も高く、表情特徴のみで学習と評価を行った読唇方法の認識率が最も低いことがわかった。これにより、表情特徴と口唇特徴を組合せて学習と評価を行う本発明の読唇方法により、高い認識率が得られることが確認された。 Table 2 shows that, regardless of the content of the utterance, the recognition rate of the lip reading method of the present invention, which was trained and evaluated using a combination of facial expression features and lip features, was the highest, and the recognition rate of the lip reading method, which was trained and evaluated using only facial facial features, was the highest. It was found that the rate was the lowest. This confirmed that the lip reading method of the present invention, which performs learning and evaluation by combining facial features and lip features, can achieve a high recognition rate.

以上、本発明を、実施の形態を参照して説明してきたが、本発明は何ら上記した実施の形態に記載した構成に限定されるものではなく、特許請求の範囲に記載されている事項の範囲内で考えられるその他の実施の形態や変形例も含むものである。 Although the present invention has been described above with reference to the embodiments, the present invention is not limited to the configuration described in the above-described embodiments, and the matters described in the claims are as follows. It also includes other embodiments and modifications that may be considered within the scope.

１０：読唇装置、１１：撮影手段、１３：画像取得部、１４：画像処理部、１５：学習処理部、１６：読唇データベース、１７：認識処理部、１８：コンピュータ、１９：認識結果出力部、２０：ディスプレイ、２１：スピーカ、２２：顔検出手段、２３：顔特徴点検出手段、２４：口唇領域抽出手段、２５：特徴抽出手段 10: lip reading device, 11: photographing means, 13: image acquisition section, 14: image processing section, 15: learning processing section, 16: lip reading database, 17: recognition processing section, 18: computer, 19: recognition result output section, 20: Display, 21: Speaker, 22: Face detection means, 23: Facial feature point detection means, 24: Lip region extraction means, 25: Feature extraction means

Claims

an image acquisition unit that reads a learning target image in which a utterance scene of a learning target utterer is recorded during learning, and an image acquisition unit that reads an evaluation target image in which a utterance scene of an evaluation target utterer is recorded during evaluation; an image processing unit that performs image processing on the learning target image and the evaluation target image , respectively, to extract learning target data and evaluation target data; a learning processing unit that constructs a learning model for each attribute in accordance with the age and/or gender of a speaker to be learned ; a lip-reading database that stores the learning model; and a lip-reading database that stores the evaluation target data and the lip-reading database at the time of evaluation. a recognition processing unit that estimates the utterance content of the evaluation target utterer by machine learning from the learned model,
The image processing unit includes a face detection means for detecting a face region of the learning target speaker from the learning target image and a face region of the evaluation target speaker from the evaluation target image; facial feature point detection means for detecting facial feature points from each of the facial feature points detected by the facial feature point detecting means; and lip regions for extracting lip regions from the facial feature points of the facial feature points detected by the facial feature point detecting means. an extraction means, extracting the lip features of the learning target speaker, which are the learning target data, from the face region, the facial feature points, and the lip region of the learning target speaker; A lip reading device comprising: a feature extracting means for extracting lip features of the evaluation target speaker, which are the evaluation target data, from a face region, the facial feature points, and the lip region.

2. The lip reading device according to claim 1 , wherein the recognition processing unit separately estimates the age and/or gender of the evaluation target speaker from the evaluation target data through attribute recognition machine learning, and estimates the age and/or gender of the evaluation target speaker from the evaluation target data. A lip-reading device characterized in that the learning model having attributes corresponding to the age and/or gender of the utterer to be evaluated is selected and used for estimating the content of the utterance.

2. The lip reading device according to claim 1, wherein the feature extraction means extracts the lip features of the learning target speaker, which are the learning target data, from the face region, the facial feature points, and the lip region of the learning target speaker. In addition, facial features of the learning target speaker are extracted, and from the face region, facial feature points, and lip region of the evaluation target speaker, the lip features of the evaluation target speaker, which are the evaluation target data, are extracted. In addition, the lip reading device is characterized in that it extracts facial features of the speaker to be evaluated.

4. The lip reading device according to claim 1 , further comprising a photographing means for photographing speech scenes of the learning target speaker and the evaluation target speaker.

5. The lip reading device according to claim 1 , further comprising a recognition result output unit that outputs the utterance content of the evaluation target speaker estimated by the recognition processing unit.

6. The lip reading device according to claim 5 , wherein the recognition result output unit includes a display that displays the utterance content of the evaluation target speaker estimated by the recognition processing unit in text and/or a speaker that outputs it in audio. A lip reading device featuring:

a first learning step of reading a learning target image in which an utterance scene of the learning target speaker is recorded; a second learning step of detecting a facial area of the learning target speaker from the learning target image; and a learning target utterance. a third learning step of detecting facial feature points of the learning target speaker from the face region of the learning target speaker; and a learning step of detecting a lip region of the learning target speaker from the facial feature points of the learning target speaker. and a fifth learning step of extracting facial features and lip features of the learning target speaker, which are learning target data, from the face region, facial feature points, and lip region of the learning target speaker; Repeating the first learning step to the fifth learning step, performing machine learning of lip reading based on the learning target data, and creating a learning model for each attribute corresponding to the age and/or gender of the learning target speaker. A sixth step during learning to construct, a seventh step during learning to save the learning model, a first step during evaluation to read the saved learning model, and an evaluation target in which the utterance scene of the speaker to be evaluated is recorded. a second evaluation step of reading an image; a third evaluation step of detecting the face area of the evaluation target speaker from the evaluation target image; and a third evaluation step of detecting the evaluation target speaker's face from the evaluation target speaker's face area. a fourth evaluation step of detecting feature points; a fifth evaluation step of detecting a lip region of the evaluation target speaker from the facial feature points of the evaluation target speaker; and a face region of the evaluation target speaker. , a sixth step during evaluation of extracting facial features and lip features of the evaluation target speaker, which are evaluation target data, from the facial feature points and the lip region; , and a seventh evaluation step of estimating the utterance content of the utterer to be evaluated.

8. The lip reading method according to claim 7, wherein in the seventh evaluation step, the age and/or gender of the evaluation target speaker is separately estimated from the evaluation target data by attribute recognition machine learning. A lip reading method characterized in that the learning model having attributes corresponding to age and/or gender is selected and used for estimating the content of the utterance.