JP2022059957A

JP2022059957A - Voice recognition device

Info

Publication number: JP2022059957A
Application number: JP2020167873A
Authority: JP
Inventors: 秀樹清水; Hideki Shimizu
Original assignee: Citizen Watch Co Ltd
Current assignee: Citizen Watch Co Ltd
Priority date: 2020-10-02
Filing date: 2020-10-02
Publication date: 2022-04-14

Abstract

To provide a voice recognition device that can recognize utterance not using a vocal cord.SOLUTION: The voice recognition device includes: an imaging unit that acquires an image including a lip region of a speaker; a lip movement locus detection unit that detects a lip movement locus of the speaker from the image; a non-audible sound detection unit that detects a non-audible sound propagating in the air from a voice of the speaker; a frequency pattern extraction unit that analyzes frequency characteristics of the non-audible sound, and extracts a frequency pattern; a lip movement locus data storage unit that previously stores a correspondence between the lip movement locus and utterance content; a non-audible sound pattern storage unit that previously stores a correspondence between the frequency pattern of the non-audible sound and utterance content; an utterance candidate extraction unit that refers to the lip movement locus data storage unit to extract candidates for utterance content from the lip movement locus; an utterance determination unit that, in a case in which a plurality of candidates for utterance content are extracted, refers to the non-audible sound pattern storage unit to determine specific utterance content among the plurality of candidates for utterance content; and an output unit that outputs information about the determined utterance content.SELECTED DRAWING: Figure 1

Description

本発明は、音声認識装置に関する。 The present invention relates to a voice recognition device.

高齢者の中には声帯を使わずに話す人が多く存在する。また、小声で話す場合には、声帯を使わずに発話が行われることがある。声帯を使わない人の発話内容を知る方法として、口の動きから言葉を認識する方法が報告されている（例えば、特許文献１）。 Many elderly people speak without using the vocal cords. Also, when speaking in a low voice, the utterance may be performed without using the vocal cords. As a method of knowing the utterance content of a person who does not use the vocal cords, a method of recognizing words from the movement of the mouth has been reported (for example, Patent Document 1).

特許文献１には、口の動きに基づいて言葉を認識する読唇装置において、話者の口形を示す口形情報に基づいて、所定の音を発する場合に予め形作る必要がある口形であって、その音の母音に対応する口形とは異なる口形である第１の口形、および、１つの音を発し終える際に形作られる口形である第２の口形を検出する第１の口形検出手段と、検出された第１の口形および第２の口形に基づいて、話者が発した言葉を認識する認識手段とを含む読唇装置が記載されている。 Patent Document 1 describes a mouth shape that needs to be formed in advance when a predetermined sound is emitted based on the mouth shape information indicating the mouth shape of a speaker in a lip reading device that recognizes words based on the movement of the mouth. A first mouth shape detecting means for detecting a first mouth shape, which is a mouth shape different from the mouth shape corresponding to a vowel of a sound, and a second mouth shape, which is a mouth shape formed when one sound is finished to be emitted. A lip reading device including a recognition means for recognizing a word spoken by a speaker based on a first mouth shape and a second mouth shape is described.

また、声帯を使わずに話したときに発せられる非可聴音から発話内容を検出する方法が報告されている（例えば、特許文献２）。 Further, a method of detecting an utterance content from an inaudible sound emitted when speaking without using a vocal cord has been reported (for example, Patent Document 2).

特許文献２には、人間の体表に聴診器型のマイクロフォンを装着させ、声帯の規則唇動を用いない発話行動に伴って調音される非可聴つぶやき音の肉伝導の振動音を採取する方法が開示されている。 In Patent Document 2, a stethoscope-type microphone is attached to the human body surface, and a method of collecting the vibration sound of the flesh conduction of the non-audible muttering sound that is tuned along with the speech behavior that does not use the regular lip movement of the vocal cords. Is disclosed.

しかしながら、日本語は同じ唇の動きをする発話が複数あり、特許文献１のようにカメラで唇の動きを解読する方法では、限られた言葉しか解読できないという問題があった。 However, Japanese has a plurality of utterances with the same lip movement, and there is a problem that only a limited number of words can be deciphered by the method of deciphering the lip movement with a camera as in Patent Document 1.

また、特許文献２に記載の方法では、予め専用のマイクを装着しなければならず、使用できるシーンが限られるという問題があった。 Further, the method described in Patent Document 2 has a problem that a dedicated microphone must be attached in advance and the scenes that can be used are limited.

特開２００８－３１０３８２号公報Japanese Unexamined Patent Publication No. 2008-310382 国際公開第２００４／０２１７３８号International Publication No. 2004/021738

本発明は、話者が声帯を使わずに発話した場合であっても、発話内容を認識可能な音声認識装置を提供することを目的とする。 An object of the present invention is to provide a voice recognition device capable of recognizing the content of an utterance even when the speaker speaks without using the vocal cords.

本開示の実施形態に係る音声認識装置は、話者の発話動作中における口唇領域を含む画像を取得する撮像部と、画像から話者の唇動の軌跡を検出する唇動軌跡検出部と、話者が発話する際の音声から空中を伝搬する非可聴音を検出する非可聴音検出部と、非可聴音の周波数特性を解析し、周波数パターンを抽出する周波数パターン抽出部と、唇動の軌跡と発話内容との対応関係を予め記憶した唇動軌跡データ記憶部と、非可聴音の周波数パターンと発話内容との対応関係を予め記憶した非可聴音パターン記憶部と、唇動軌跡データ記憶部を参照して、唇動の軌跡から発話内容の候補を抽出する発話候補抽出部と、発話候補抽出部が複数の発話内容の候補を抽出した場合は、非可聴音パターン記憶部を参照して、複数の発話内容の候補の中から特定の発話内容を決定する発話決定部と、発話決定部によって決定された発話内容に関する情報を出力する出力部と、を有することを特徴とする。 The voice recognition device according to the embodiment of the present disclosure includes an imaging unit that acquires an image including a lip region during a speaker's utterance operation, a lip movement locus detection unit that detects the lip movement locus of the speaker from the image, and a lip movement locus detection unit. An inaudible sound detector that detects inaudible sound propagating in the air from the voice spoken by the speaker, a frequency pattern extractor that analyzes the frequency characteristics of the inaudible sound and extracts frequency patterns, and lip movement. The lip smacking data storage unit that stores the correspondence between the locus and the utterance content in advance, the non-audible sound pattern storage unit that stores the correspondence between the frequency pattern of the inaudible sound and the utterance content in advance, and the lip movement locus data storage. If the utterance candidate extraction unit extracts candidates for utterance content from the locus of lip movement and the utterance candidate extraction unit extracts multiple candidates for utterance content, refer to the non-audible sound pattern storage unit. It is characterized by having an utterance determination unit that determines a specific utterance content from a plurality of utterance content candidates, and an output unit that outputs information about the utterance content determined by the utterance determination unit.

上記の音声認識装置において、非可聴音検出部は、唇動軌跡検出部が検出した話者の唇動開始をトリガーとして、話者の非可聴音の検出を開始することが好ましい。 In the above voice recognition device, it is preferable that the non-audible sound detection unit starts the detection of the non-audible sound of the speaker by using the start of lip movement of the speaker detected by the lip movement locus detection unit as a trigger.

上記の音声認識装置において、非可聴音検出部は、非可聴音として、２０ｋＨｚ以上かつ７０ｋＨｚ以下の音波を検出することが好ましい。 In the above speech recognition device, the non-audible sound detection unit preferably detects sound waves of 20 kHz or more and 70 kHz or less as non-audible sounds.

上記の音声認識装置において、発話決定部は、周波数パターンにおけるピークの有無及び特定の周波数帯域において発生するピークの位置に基づいて、発話内容を決定することが好ましい。 In the above voice recognition device, it is preferable that the utterance determination unit determines the utterance content based on the presence / absence of a peak in the frequency pattern and the position of the peak generated in a specific frequency band.

上記の音声認識装置において、唇動の軌跡が略同一である複数の発話内容は、「な」、「た」、及び「だ」のうちの少なくとも２つを含んでいてもよい。 In the above voice recognition device, a plurality of utterance contents having substantially the same lip movement locus may include at least two of "na", "ta", and "da".

上記の音声認識装置において、唇動の軌跡が略同一である複数の発話内容は、「し」及び「ち」を含んでいてもよい。 In the above voice recognition device, a plurality of utterance contents in which the locus of lip movement is substantially the same may include "" and "chi".

上記の音声認識装置において、唇動の軌跡が略同一である複数の発話内容は、「あ」及び「は」を含んでいてもよい。 In the above voice recognition device, a plurality of utterance contents having substantially the same lip movement locus may include "a" and "ha".

本発明の音声認識装置によれば、話者が声帯を使わずに発話した場合であっても、発話内容を認識することができる。 According to the voice recognition device of the present invention, even when the speaker speaks without using the vocal cords, the content of the utterance can be recognized.

本開示の実施形態に係る音声認識装置のブロック図である。It is a block diagram of the voice recognition apparatus which concerns on embodiment of this disclosure. （ａ）は、顔画像認識部によって認識した顔の輪郭の例であり、（ｂ）は（ａ）の顔の輪郭に含まれる口の輪郭の例である。(A) is an example of the contour of the face recognized by the face image recognition unit, and (b) is an example of the contour of the mouth included in the contour of the face of (a). 「な」、「た」、及び「だ」と発話したときの唇動の軌跡を表す図であり、（ａ）はｙ方向の唇動の軌跡を表し、（ｂ）はｘ方向の唇動の軌跡を表す。It is a figure which shows the locus of lip movement when uttering "na", "ta", and "da", (a) shows the locus of lip movement in the y direction, and (b) is the lip movement in the x direction. Represents the trajectory of. （ａ）～（ｃ）は、それぞれ、「な」、「た」、及び「だ」と発話したときの音声の周波数スペクトルである。(A) to (c) are frequency spectra of the voice when uttering "na", "ta", and "da", respectively. 本開示の実施形態に係る音声認識装置の動作手順を説明するためのフローチャートである。It is a flowchart for demonstrating the operation procedure of the voice recognition apparatus which concerns on embodiment of this disclosure. 「し」及び「ち」と発話したときの唇動の軌跡を表す図であり、（ａ）はｙ方向の唇動の軌跡を表し、（ｂ）はｘ方向の唇動の軌跡を表す。It is a figure which shows the locus of lip movement when uttering "shi" and "chi", (a) shows the locus of lip movement in the y direction, and (b) shows the locus of lip movement in the x direction. （ａ）及び（ｂ）は、それぞれ、「し」及び「ち」と発話したときの音声の周波数スペクトルである。(A) and (b) are frequency spectra of voice when uttering "shi" and "chi", respectively. 「あ」及び「は」と発話したときの唇動の軌跡を表す図であり、（ａ）はｙ方向の唇動の軌跡を表し、（ｂ）はｘ方向の唇動の軌跡を表す。It is a figure which shows the locus of lip movement when uttering "a" and "ha", (a) shows the locus of lip movement in the y direction, and (b) shows the locus of lip movement in the x direction. （ａ）及び（ｂ）は、それぞれ、「あ」及び「は」と発話したときの音声の周波数スペクトルである。(A) and (b) are frequency spectra of voice when uttering "a" and "ha", respectively. 実施例１に係る音声認識装置を用いた会話システムの構成概略図である。FIG. 3 is a schematic configuration diagram of a conversation system using the voice recognition device according to the first embodiment. 実施例１に係る音声認識装置のブロック図である。It is a block diagram of the voice recognition apparatus which concerns on Example 1. FIG. 実施例１の変形例に係る音声認識装置のブロック図である。It is a block diagram of the voice recognition apparatus which concerns on the modification of Example 1. FIG. 実施例２に係る音声認識装置を用いた通訳装置の構成概略図である。FIG. 3 is a schematic configuration diagram of an interpreter device using the voice recognition device according to the second embodiment. 実施例２に係る音声認識装置のブロック図である。It is a block diagram of the voice recognition apparatus which concerns on Example 2. FIG. 実施例３に係る音声認識装置を用いた音声機器操作システムの構成概略図である。FIG. 3 is a schematic configuration diagram of a voice device operation system using the voice recognition device according to the third embodiment. 実施例３に係る音声認識装置のブロック図である。It is a block diagram of the voice recognition apparatus which concerns on Example 3. FIG.

以下、図面を参照して、本発明に係る音声認識装置について説明する。ただし、本発明の技術的範囲はそれらの実施の形態には限定されず、特許請求の範囲に記載された発明とその均等物に及ぶ点に留意されたい。 Hereinafter, the voice recognition device according to the present invention will be described with reference to the drawings. However, it should be noted that the technical scope of the present invention is not limited to those embodiments and extends to the inventions described in the claims and their equivalents.

図１に本開示の実施形態に係る音声認識装置１００１のブロック図を示す。音声認識装置１００１は、撮像部１と、唇動軌跡検出部２と、非可聴音検出部３と、周波数パターン抽出部４と、唇動軌跡データ記憶部５と、非可聴音パターン記憶部６と、発話候補抽出部７と、発話決定部８と、出力部９と、顔画像認識部１０と、を有する。音声認識装置１００１には、スマートフォンやタブレット端末等の情報端末を用いることができる。ただし、このような例には限られず、シングルボードコンピュータを用いた組込みモジュールとして実現することもできる。あるいは、音声認識装置１００１をサーバ上に配置し、撮像部１及び非可聴音検出部３で取得したデータをサーバに送信するようにしてもよい。撮像部１はカメラにより構成され、非可聴音検出部３はマイクにより構成される。唇動軌跡データ記憶部５及び非可聴音パターン記憶部６は、ハードディスク、または半導体メモリで構成される。唇動軌跡検出部２、周波数パターン抽出部４、発話候補抽出部７、発話決定部８、出力部９、及び顔画像認識部１０は、ＣＰＵ、ＲＯＭ及びＲＡＭなどを含む音声認識装置１００１に設けられているコンピュータにより、ソフトウエア（プログラム）として実現される。 FIG. 1 shows a block diagram of the voice recognition device 1001 according to the embodiment of the present disclosure. The voice recognition device 1001 includes an imaging unit 1, a lip movement locus detection unit 2, an inaudible sound detection unit 3, a frequency pattern extraction unit 4, a lip movement locus data storage unit 5, and an inaudible sound pattern storage unit 6. It has a speech candidate extraction unit 7, a speech determination unit 8, an output unit 9, and a face image recognition unit 10. An information terminal such as a smartphone or a tablet terminal can be used as the voice recognition device 1001. However, the present invention is not limited to such an example, and can be realized as an embedded module using a single board computer. Alternatively, the voice recognition device 1001 may be arranged on the server, and the data acquired by the image pickup unit 1 and the inaudible sound detection unit 3 may be transmitted to the server. The image pickup unit 1 is composed of a camera, and the inaudible sound detection unit 3 is composed of a microphone. The lip motion locus data storage unit 5 and the inaudible sound pattern storage unit 6 are composed of a hard disk or a semiconductor memory. The lip motion locus detection unit 2, the frequency pattern extraction unit 4, the utterance candidate extraction unit 7, the utterance determination unit 8, the output unit 9, and the face image recognition unit 10 are provided in the voice recognition device 1001 including a CPU, ROM, RAM, and the like. It is realized as software (program) by the computer.

撮像部１は、カメラであり、ＣＭＯＳ（Complementary Metal Oxide Semiconductor）型又はＣＣＤ（Charge Coupled Device）型のイメージセンサを備えている。撮像部１は、話者の発話動作中における口唇領域を含む画像を取得し、撮像した画像をフレーム毎に顔画像認識部１０に供給する。カメラは、スマートフォンやタブレット端末等の情報端末に予め備えられているものを利用することができ、外付けのカメラを利用することもできる。 The image pickup unit 1 is a camera and includes a CMOS (Complementary Metal Oxide Semiconductor) type or CCD (Charge Coupled Device) type image sensor. The image pickup unit 1 acquires an image including the lip region during the speech operation of the speaker, and supplies the captured image to the face image recognition unit 10 for each frame. As the camera, a camera provided in advance in an information terminal such as a smartphone or a tablet terminal can be used, and an external camera can also be used.

顔画像認識部１０は、内蔵する顔認識のためのアプリケーションプログラムによって、話者の顔及び口唇の輪郭を識別し、自動的に追尾する機能を有している。これにより、話者が撮像部１の撮像範囲内で移動しても、話者の顔画像を捉えることができる。 The face image recognition unit 10 has a function of identifying and automatically tracking the contours of the speaker's face and lips by a built-in application program for face recognition. As a result, even if the speaker moves within the imaging range of the imaging unit 1, the speaker's face image can be captured.

唇動軌跡検出部２は、撮像部１が取得した画像から話者の唇動の軌跡を検出する。図２（ａ）は、顔画像認識部１０によって認識した顔の輪郭の例であり、図２（ｂ）は図２（ａ）の顔の輪郭に含まれる口の輪郭の例である。図２（ａ）に示すように、顔画像認識部１０により、顔２１、眉２２、目２３、鼻２４、及び口２５のそれぞれの輪郭の位置を決定することができる。図２（ｂ）に示すように、発話によって、口唇は上下方向（ｙ方向）に開閉し、左右方向（ｘ方向）に伸縮する。そこで、口唇の動きを示すための特徴点を、上唇の下端ｙ１、下唇の上端ｙ２、唇の左側端部ｘ１、及び唇の右側端部ｘ２とした。また、口唇の動作の特徴量を上下方向の距離（Δｙ＝ｙ１－ｙ２）の時間的変化と、左右方向の距離（Δｘ＝ｘ２－ｘ１）の時間的変化とした。 The lip movement locus detection unit 2 detects the lip movement locus of the speaker from the image acquired by the imaging unit 1. FIG. 2A is an example of the contour of the face recognized by the face image recognition unit 10, and FIG. 2B is an example of the contour of the mouth included in the contour of the face of FIG. 2A. As shown in FIG. 2A, the face image recognition unit 10 can determine the positions of the contours of the face 21, the eyebrows 22, the eyes 23, the nose 24, and the mouth 25. As shown in FIG. 2B, the lips open and close in the vertical direction (y direction) and expand and contract in the horizontal direction (x direction) due to the utterance. Therefore, the feature points for showing the movement of the lips are the lower end y1 of the upper lip, the upper end y2 of the lower lip, the left end portion x1 of the lips, and the right end portion x2 of the lips. Further, the feature amount of the movement of the lips was defined as a temporal change of the vertical distance (Δy = y1-y2) and a temporal change of the horizontal distance (Δx = x2-x1).

唇動軌跡データ記憶部５は、唇動の軌跡（発話唇動プロファイル）と発話内容との対応関係を予め記憶している。図３は、「な」、「た」、「だ」と発話したときの唇動の軌跡を表す図であり、図３（ａ）はｙ方向の唇動の軌跡を表し、図３（ｂ）はｘ方向の唇動の軌跡を表す。図３（ａ）及び（ｂ）の横軸は唇動を開始してからの時間（秒）である。図３（ａ）の縦軸は上下方向の距離Δｙ（ｍｍ）であり、図３（ｂ）の縦軸は左右方向の距離Δｘ（ｍｍ）である。図３（ａ）において、Ｌｎｙ、Ｌｔｙ、Ｌｄｙは、それぞれ、「な」、「た」、「だ」と発話したときのｙ方向の唇動の軌跡を表す。また、図３（ｂ）において、Ｌｎｘ、Ｌｔｘ、Ｌｄｘは、それぞれ、「な」、「た」、「だ」と発話したときのｘ方向の唇動の軌跡を表す。唇動軌跡データ記憶部５は、上記の例以外にも種々の発話における唇動の軌跡と発話内容との対応関係を予め記憶している。唇動軌跡データ記憶部５は、唇動の軌跡の特徴量が、どの発話内容に近いのかを人工知能（ＡＩ）を用いて機械学習により生成した学習モデルを記憶していてもよい。 The lip movement locus data storage unit 5 stores in advance the correspondence between the lip movement locus (speech lip movement profile) and the utterance content. FIG. 3 is a diagram showing the locus of lip movement when uttering “na”, “ta”, and “da”, and FIG. 3 (a) shows the locus of lip movement in the y direction, and FIG. 3 (b). ) Represents the locus of lip movement in the x direction. The horizontal axis of FIGS. 3A and 3B is the time (seconds) from the start of lip movement. The vertical axis of FIG. 3A is the vertical distance Δy (mm), and the vertical axis of FIG. 3B is the horizontal distance Δx (mm). In FIG. 3A, Lny, Lty, and Ldy represent the locus of lip movement in the y direction when uttering “na”, “ta”, and “da”, respectively. Further, in FIG. 3B, Lnx, Ltx, and Ldx represent the locus of lip movement in the x direction when uttering “na”, “ta”, and “da”, respectively. In addition to the above example, the lip movement locus data storage unit 5 stores in advance the correspondence between the lip movement locus and the utterance content in various utterances. The lip movement locus data storage unit 5 may store a learning model generated by machine learning using artificial intelligence (AI) as to which speech content the feature amount of the lip movement locus is close to.

発話候補抽出部７は、唇動軌跡データ記憶部５を参照して、唇動の軌跡から発話内容の候補を抽出する。図３（ａ）及び（ｂ）に示した例では、「な」、「た」、「だ」と発話したときの唇動のｙ方向の時間的変化を表す３つの曲線（Ｌｎｙ、Ｌｔｙ、Ｌｄｙ）はほぼ同じ軌跡を描き、唇動のｘ方向の時間的変化を表す３つの曲線（Ｌｎｘ、Ｌｔｘ、Ｌｄｘ）がほぼ同じ軌跡を描いている。そのため、唇動のｙ方向の時間的変化を表す曲線が、３つの曲線（Ｌｎｙ、Ｌｔｙ、Ｌｄｙ）のうちのいずれかに類似し、かつ、唇動のｘ方向の時間的変化を表す曲線が、３つの曲線（Ｌｎｘ、Ｌｔｘ、Ｌｄｘ）のうちのいずれかに類似した曲線を示す発話を検出した場合、発話内容は、「な」、「た」、「だ」のいずれかであることは分かるが、これらの内のどの発話であるのかは特定できない。そこで、このような場合は、発話内容の候補は３つ抽出されることとなる。 The utterance candidate extraction unit 7 extracts candidates for utterance content from the lip movement locus with reference to the lip movement locus data storage unit 5. In the examples shown in FIGS. 3 (a) and 3 (b), three curves (Lny, Lty,) representing the temporal change of lip movement in the y direction when uttering “na”, “ta”, and “da” are performed. Ldy) draws almost the same locus, and three curves (Lnx, Ltx, Ldx) representing the temporal change of lip movement in the x direction draw almost the same locus. Therefore, the curve representing the temporal change of lip movement in the y direction is similar to any of the three curves (Lny, Ly, Ldy), and the curve representing the temporal change of lip movement in the x direction is When an utterance showing a curve similar to any of the three curves (Lnx, Ltx, Ldx) is detected, the utterance content may be any of "na", "ta", and "da". As you can see, it is not possible to identify which of these utterances. Therefore, in such a case, three candidates for the utterance content are extracted.

非可聴音検出部３は、話者が発話する際の音声から空中を伝搬する非可聴音を検出する。非可聴音検出部３として、スマートフォンやタブレット端末に内蔵されているＭＥＭＳ（Micro Electro Mechanical Systems）マイクを用いることができる。ＭＥＭＳマイクを用いることにより、非可聴音の周波数帯域を含めた発話を検知することができる。スマートフォン等の端末に内蔵されたＭＥＭＳマイクにおいて、ノイズ低減のために非可聴音の帯域をカットしている場合は、そのような帯域制限を解除すればよい。スマートフォン等に予め備えられているマイクを使用する代わりに、非可聴音を検出可能なマイクを外付けするようにしてもよい。非可聴音検出部３は、非可聴音として、２０ｋＨｚ以上かつ７０ｋＨｚ以下の音波を検出することが好ましい。 The inaudible sound detection unit 3 detects the inaudible sound propagating in the air from the voice when the speaker speaks. As the inaudible sound detection unit 3, a MEMS (Micro Electro Mechanical Systems) microphone built in a smartphone or tablet terminal can be used. By using a MEMS microphone, it is possible to detect utterances including the frequency band of inaudible sounds. When the non-audible sound band is cut in order to reduce noise in the MEMS microphone built in the terminal such as a smartphone, such a band limitation may be released. Instead of using a microphone provided in advance in a smartphone or the like, a microphone capable of detecting inaudible sound may be externally attached. The inaudible sound detection unit 3 preferably detects sound waves of 20 kHz or more and 70 kHz or less as inaudible sounds.

非可聴音検出部３は、唇動軌跡検出部２が検出した話者の唇動開始をトリガーとして、話者の非可聴音の検出を開始することが好ましい。非可聴音は話者が発話する場合に生じるもの以外にも、話者が体を動かした場合等によっても発生する場合があり、これがノイズとなるため、話者が発話を開始するタイミングを非可聴音のみから検出することが難しい場合もあり得る。そこで、非可聴音検出部３は、唇動軌跡検出部２が、話者の口唇が動き始めたことを検出してから非可聴音の検出を開始することが好ましい。このようにすることで、話者の発話によって生じる非可聴音を正確に検出することができる。 It is preferable that the non-audible sound detection unit 3 starts the detection of the non-audible sound of the speaker by using the start of lip movement of the speaker detected by the lip movement locus detection unit 2 as a trigger. Inaudible sounds may be generated not only when the speaker speaks, but also when the speaker moves his / her body, which causes noise. Therefore, the timing at which the speaker starts speaking is not set. It may be difficult to detect only from audible sounds. Therefore, it is preferable that the non-audible sound detection unit 3 starts the detection of the non-audible sound after the lip movement locus detection unit 2 detects that the speaker's lips have started to move. By doing so, it is possible to accurately detect the inaudible sound generated by the speaker's utterance.

周波数パターン抽出部４は、非可聴音の周波数特性を解析し、周波数パターンを抽出する。図４（ａ）は、「な」と発話したときの音声の周波数スペクトルであり、図４（ｂ）は、「た」と発話したときの音声の周波数スペクトルであり、図４（ｃ）は、「だ」と発話したときの音声の周波数スペクトルである。図４（ａ）～（ｃ）において、横軸は周波数（ｋＨｚ）、縦軸はパワー（ｄＢ）を示す。声帯を使わずに発話が行われた場合であっても、舌使いや喉の息の出し方により、非可聴音領域の周波数分布（周波数パターン）に差異が現れる。この周波数分布の違いを利用することにより、唇動軌跡では特定しきれない発話を識別することができる。 The frequency pattern extraction unit 4 analyzes the frequency characteristics of the inaudible sound and extracts the frequency pattern. FIG. 4 (a) is a frequency spectrum of the voice when "na" is spoken, FIG. 4 (b) is a frequency spectrum of the voice when "ta" is spoken, and FIG. 4 (c) is. , It is the frequency spectrum of the voice when uttering "da". In FIGS. 4A to 4C, the horizontal axis represents frequency (kHz) and the vertical axis represents power (dB). Even when the utterance is made without using the vocal cords, the frequency distribution (frequency pattern) in the inaudible region appears differently depending on how the tongue is used and how the throat is exhaled. By utilizing this difference in frequency distribution, it is possible to identify utterances that cannot be specified by lip smacking.

非可聴音パターン記憶部６は、非可聴音の周波数パターンと発話内容との対応関係を予め記憶している。即ち、非可聴音パターン記憶部６は、唇動軌跡が略同一の複数の発話のそれぞれを識別するための、周波数パターンにおける特徴点として、特定の周波数においてピークが発生するか否か、及びピークが発生する場合は、特定の周波数帯域において生じるピークの位置に関する情報を記憶している。例えば、非可聴音パターン記憶部６は、「な」、「た」、「だ」のそれぞれの非可聴音の周波数パターンにおけるピークの有無及び特定の周波数帯域において発生するピークの位置を記憶していることが好ましい。具体的には、図４（ａ）に示すように、「な」と発話した場合、舌全体を上顎に軽く押し当てるため、２０ｋＨｚ～３０ｋＨｚの範囲の周波数パターンには明確なピークは現れない。また、図４（ｂ）に示すように、「た」と発話した場合、舌先を上顎に弾くように強く当てるため、周波数パターンには２５ｋＨｚ～３０ｋＨｚの範囲にピークＰｔが現れる。さらに、図４（ｃ）に示すように、「だ」と発話した場合、濁音のため「た」の場合より舌先を上顎に軽く当てるため、周波数パターンには「た」よりも低い２０ｋＨｚ～２５ｋＨｚの範囲にピークＰｄが現れる。このように、非可聴音パターン記憶部６は、発話内容が「な」の場合は２０ｋＨｚ～３０ｋＨｚの範囲の周波数パターンには明確なピークは現れないこと、発話内容が「た」の場合は周波数パターンには２５ｋＨｚ～３０ｋＨｚの範囲にピークが現れること、及び、発話内容が「だ」の場合は周波数パターンには２０ｋＨｚ～２５ｋＨｚの範囲にピークが現れることを記憶している。このように、非可聴音パターン記憶部６は、唇動軌跡が略同一の複数の発話のそれぞれについて、非可聴音の周波数パターンにおいて、特定の周波数においてピークが発生するか否か、及びピークが発生する場合は、どの周波数帯域にピークが生じるかという情報を予め記憶している。ただし、非可聴音パターン記憶部６は、これら以外にも、「し」及び「ち」、並びに「あ」及び「は」のように、唇動の軌跡が略同一で非可聴音の周波数パターンが異なる発話の他の組み合わせについても記憶している。 The non-audible sound pattern storage unit 6 stores in advance the correspondence between the frequency pattern of the non-audible sound and the utterance content. That is, the non-audible sound pattern storage unit 6 determines whether or not a peak occurs at a specific frequency as a feature point in the frequency pattern for identifying each of a plurality of utterances having substantially the same lip movement trajectory, and a peak. When this occurs, it stores information about the position of the peak that occurs in a specific frequency band. For example, the non-audible sound pattern storage unit 6 stores the presence / absence of a peak in each of the non-audible sound frequency patterns of “na”, “ta”, and “da” and the position of the peak generated in a specific frequency band. It is preferable to have. Specifically, as shown in FIG. 4A, when the utterance "na" is spoken, the entire tongue is lightly pressed against the upper jaw, so that no clear peak appears in the frequency pattern in the range of 20 kHz to 30 kHz. Further, as shown in FIG. 4 (b), when the utterance "ta" is spoken, the peak Pt appears in the frequency pattern in the range of 25 kHz to 30 kHz because the tip of the tongue is strongly applied to the upper jaw. Furthermore, as shown in FIG. 4 (c), when uttering "da", the tip of the tongue touches the upper jaw lighter than in the case of "ta" due to voiced sound, so the frequency pattern is 20 kHz to 25 kHz, which is lower than "ta". The peak Pd appears in the range of. As described above, in the non-audible sound pattern storage unit 6, when the utterance content is "na", no clear peak appears in the frequency pattern in the range of 20 kHz to 30 kHz, and when the utterance content is "ta", the frequency. It is memorized that the peak appears in the range of 25 kHz to 30 kHz in the pattern, and the peak appears in the range of 20 kHz to 25 kHz in the frequency pattern when the utterance content is "da". As described above, the non-audible sound pattern storage unit 6 determines whether or not a peak occurs at a specific frequency in the non-audible sound frequency pattern for each of a plurality of utterances having substantially the same lip movement trajectory, and the peak is When it occurs, information about which frequency band the peak occurs is stored in advance. However, in addition to these, the non-audible sound pattern storage unit 6 has substantially the same lip movement trajectory and a non-audible sound frequency pattern, such as "shi" and "chi", and "a" and "ha". Also remembers other combinations of different utterances.

発話決定部８は、発話候補抽出部７が複数の発話内容の候補を抽出した場合は、非可聴音パターン記憶部６を参照して、複数の発話内容の候補の中から特定の発話内容を決定する。例えば、発話候補抽出部７が３つの発話内容の候補「な」、「た」、及び「だ」を抽出した場合は、非可聴音パターン記憶部６を参照して、上記３つの発話内容の候補の中から特定の発話内容を決定する。上述した通り、唇動軌跡検出部２が検出した唇動の軌跡が図３（ａ）及び（ｂ）に類似した曲線となった場合には、発話候補抽出部７は、唇動軌跡データ記憶部５を参照して、唇動の軌跡から発話内容の候補として「な」、「た」、及び「だ」を抽出する。次に、発話決定部８は、非可聴音パターン記憶部６を参照して、検出した非可聴音の周波数パターンを３つの発話内容の候補（「な」、「た」、「だ」）のそれぞれの周波数パターンと照合することにより、３つの発話内容の候補の中から特定の発話内容を決定する。 When the utterance candidate extraction unit 7 extracts a plurality of utterance content candidates, the utterance determination unit 8 refers to the non-audible sound pattern storage unit 6 and selects a specific utterance content from the plurality of utterance content candidates. decide. For example, when the utterance candidate extraction unit 7 extracts three utterance content candidates "na", "ta", and "da", the above three utterance contents are referred to with reference to the non-audible sound pattern storage unit 6. Determine a specific utterance content from the candidates. As described above, when the lip movement locus detected by the lip movement locus detection unit 2 has a curve similar to that in FIGS. 3A and 3B, the utterance candidate extraction unit 7 stores the lip movement locus data. With reference to Part 5, "na", "ta", and "da" are extracted as candidates for utterance content from the locus of lip movement. Next, the utterance determination unit 8 refers to the inaudible sound pattern storage unit 6 and uses the detected frequency pattern of the inaudible sound as three candidates for the utterance content (“na”, “ta”, “da”). By collating with each frequency pattern, a specific utterance content is determined from the three utterance content candidates.

発話決定部８は、周波数パターンにおける特定の周波数帯域において発生するピークの有無及びピークの位置に基づいて、発話内容を決定することができる。例えば、図４（ａ）のように、２０ｋＨｚ～３０ｋＨｚの範囲の周波数においてパワーのピークが検出されなかった場合には、検出した発話は「な」であると判定することができる。また、図４（ｂ）のように、２５ｋＨｚ～３０ｋＨｚの範囲の周波数においてパワーのピークＰｔが検出された場合には、検出した発話は「た」であると判定することができる。あるいは、図４（ｃ）のように、２０ｋＨｚ～２５ｋＨｚの範囲の周波数においてパワーのピークＰｄが検出された場合には、検出した発話は「だ」であると判定することができる。以上のようにして、発話決定部８は、発話候補抽出部７が３つの発話内容の候補（「な」、「た」、「だ」）を抽出した場合は、非可聴音パターン記憶部６を参照して、３つの発話内容の候補の中から特定の発話内容として「な」、「た」、及び「だ」のいずれか１つを決定する。 The utterance determination unit 8 can determine the utterance content based on the presence / absence of a peak that occurs in a specific frequency band in the frequency pattern and the position of the peak. For example, as shown in FIG. 4A, when a power peak is not detected in a frequency in the range of 20 kHz to 30 kHz, it can be determined that the detected utterance is “na”. Further, as shown in FIG. 4B, when the peak power Pt is detected at a frequency in the range of 25 kHz to 30 kHz, it can be determined that the detected utterance is “”. Alternatively, as shown in FIG. 4C, when the peak power Pd is detected at a frequency in the range of 20 kHz to 25 kHz, it can be determined that the detected utterance is “da”. As described above, in the utterance determination unit 8, when the utterance candidate extraction unit 7 extracts three utterance content candidates (“na”, “ta”, “da”), the non-audible sound pattern storage unit 6 With reference to, one of "na", "ta", and "da" is determined as a specific utterance content from the three utterance content candidates.

発話決定部８は、発話候補抽出部７が１つの発話内容の候補を抽出した場合は、非可聴の周波数パターンを参照せずに、当該候補を話者が発した発話内容と決定することができる。この場合は、非可聴の周波数パターンを参照する工程を省略することができるため、話者が発した発話の内容を迅速に決定することができ、音声認識装置１００１における処理負荷を低減することができる。ただし、発話決定部８は、能力的に問題無ければ、検出した音声の周波数パターンと、非可聴音パターン記憶部６に記憶した周波数パターンとの比較を行うようにしてもよい。これにより、この発音決定の信頼性を上げることが可能となる。 When the utterance candidate extraction unit 7 extracts one utterance content candidate, the utterance determination unit 8 may determine the candidate as the utterance content uttered by the speaker without referring to the inaudible frequency pattern. can. In this case, since the step of referring to the inaudible frequency pattern can be omitted, the content of the utterance uttered by the speaker can be quickly determined, and the processing load in the voice recognition device 1001 can be reduced. can. However, if there is no problem in terms of ability, the utterance determination unit 8 may compare the detected voice frequency pattern with the frequency pattern stored in the inaudible sound pattern storage unit 6. This makes it possible to increase the reliability of this pronunciation determination.

出力部９は、発話決定部８によって決定された発話内容に関する情報を出力する。出力部９に表示装置を接続した場合には、表示装置の画面に検出した発話内容を文字情報として表示することができる。また、出力部９に音声再生装置を接続した場合には、検出した発話内容を音声として出力することができる。例えば、画面表示の他に、イヤホンなどでの音声出力も併用するようにしてもよい。 The output unit 9 outputs information regarding the utterance content determined by the utterance determination unit 8. When a display device is connected to the output unit 9, the detected utterance content can be displayed as character information on the screen of the display device. Further, when the voice reproduction device is connected to the output unit 9, the detected utterance content can be output as voice. For example, in addition to the screen display, audio output from an earphone or the like may be used together.

次に、本実施形態に係る音声認識装置の動作手順について説明する。図５は、本開示の実施形態に係る音声認識装置の動作手順を説明するためのフローチャートである。まず、ステップＳ１０１において、撮像部１であるカメラを作動させる。カメラは、話者の発話動作中における口唇領域を含む画像を取得する。 Next, the operation procedure of the voice recognition device according to the present embodiment will be described. FIG. 5 is a flowchart for explaining the operation procedure of the voice recognition device according to the embodiment of the present disclosure. First, in step S101, the camera, which is the image pickup unit 1, is operated. The camera acquires an image including the lip region during the speaker's speech movement.

次に、ステップＳ１０２において、顔画像認識部１０が、話者の顔及び口唇の輪郭を識別する。 Next, in step S102, the face image recognition unit 10 identifies the contours of the speaker's face and lips.

次に、ステップＳ１０３において、唇動軌跡検出部２が、カメラが撮像した画像から話者の唇動の軌跡を検出する。 Next, in step S103, the lip movement locus detection unit 2 detects the locus of the speaker's lip movement from the image captured by the camera.

次に、ステップＳ１０４において、発話候補抽出部７が、唇動の軌跡と発話内容との対応関係を予め記憶した唇動軌跡データ記憶部５を参照して、唇動の軌跡から発話内容の候補を抽出する。 Next, in step S104, the utterance candidate extraction unit 7 refers to the lip movement locus data storage unit 5 that previously stores the correspondence between the lip movement locus and the utterance content, and is a candidate for the utterance content from the lip movement locus. To extract.

一方、カメラが作動し、話者の口唇が動き始めたことを検出した後、これをトリガーとして、ステップＳ１０５において非可聴音検出部３である非可聴音センサが作動し、話者が発話する際の音声から空中を伝播する非可聴音を検出する。 On the other hand, after the camera is activated and it is detected that the speaker's lips have started to move, the inaudible sound sensor, which is the inaudible sound detection unit 3, is activated in step S105, and the speaker speaks. Detects inaudible sounds propagating in the air from the voice.

次に、ステップＳ１０６において、周波数パターン抽出部４が、非可聴音の周波数特性を解析し、ステップＳ１０７において周波数パターンを抽出する。 Next, in step S106, the frequency pattern extraction unit 4 analyzes the frequency characteristics of the inaudible sound, and extracts the frequency pattern in step S107.

次に、ステップＳ１０８において、発話候補抽出部７がステップＳ１０４において抽出した発話候補が複数個であるか１つであるかを判断する。発話候補が１つのみである場合は、ステップＳ１０９において、発話決定部８が、話者による発話を単独の発話候補に決定する。発話候補が１つのみである場合の例として、例えば、母音等がある。この場合は、唇動の軌跡のみで発話内容を決定することができる。従って、発話候補が１つのみである場合は、非可聴音の周波数パターンを参照する必要がないため、効率的に発話内容を決定することができる。ただし、発話決定部８は、能力的に問題無ければ、検出した音声の周波数パターンと、非可聴音パターン記憶部６に記憶した周波数パターンとの比較を行うようにしてもよい。これにより、この発音決定の信頼性を上げることが可能となる。 Next, in step S108, the utterance candidate extraction unit 7 determines whether the number of utterance candidates extracted in step S104 is a plurality or one. When there is only one utterance candidate, in step S109, the utterance determination unit 8 determines the utterance by the speaker as a single utterance candidate. An example of a case where there is only one utterance candidate is, for example, a vowel. In this case, the content of the utterance can be determined only by the locus of lip movement. Therefore, when there is only one utterance candidate, it is not necessary to refer to the frequency pattern of the inaudible sound, so that the utterance content can be determined efficiently. However, if there is no problem in terms of ability, the utterance determination unit 8 may compare the detected voice frequency pattern with the frequency pattern stored in the inaudible sound pattern storage unit 6. This makes it possible to increase the reliability of this pronunciation determination.

一方、発話候補抽出部７が複数の発話内容の候補を抽出した場合は、ステップＳ１１０において、発話決定部８が、非可聴音の周波数パターンと発話内容との対応関係を予め記憶した非可聴音パターン記憶部６を参照して、複数の発話内容の候補の中から周波数パターンに基づいて特定の発話内容を決定する。 On the other hand, when the utterance candidate extraction unit 7 extracts a plurality of utterance content candidates, in step S110, the utterance determination unit 8 stores in advance the correspondence between the frequency pattern of the inaudible sound and the utterance content. With reference to the pattern storage unit 6, a specific utterance content is determined based on a frequency pattern from a plurality of utterance content candidates.

次に、ステップＳ１１１において、出力部９が、決定した発話内容を出力する。 Next, in step S111, the output unit 9 outputs the determined utterance content.

上記の説明においては、唇動軌跡から抽出される複数の発話候補として、「な」、「た」、及び「だ」の組み合わせを例示したが、このような例には限られない。即ち、唇動の軌跡が略同一である複数の発話内容が、「な」、「た」、及び「だ」のうちの２つの組み合わせである場合において、その２つの組み合わせの中から１つの発話内容を決定するようにしてもよい。さらに、複数の発話候補の他の例として、「し」及び「ち」の組み合わせ、並びに「あ」及び「は」の組み合わせがあり、これらの組み合わせから、特定の発話を決定する方法について以下に説明する。 In the above description, the combination of "na", "ta", and "da" is exemplified as a plurality of utterance candidates extracted from the lip movement locus, but the present invention is not limited to such an example. That is, when a plurality of utterance contents having substantially the same lip movement trajectory are two combinations of "na", "ta", and "da", one utterance is made from the two combinations. You may decide the content. Furthermore, as another example of multiple utterance candidates, there are a combination of "shi" and "chi", and a combination of "a" and "ha", and the method of determining a specific utterance from these combinations is described below. explain.

まず、複数の発話候補が「し」及び「ち」の組み合わせである場合について説明する。図６は、「し」及び「ち」と発話したときの唇動の軌跡を表す図であり、図６（ａ）はｙ方向の唇動の軌跡を表し、図６（ｂ）はｘ方向の唇動の軌跡を表す。図６（ａ）において、Ｌｓｙ及びＬｃｙは、それぞれ、「し」及び「ち」と発話したときにおける、図２（ｂ）に示したｙ方向の唇動（Δｙ＝ｙ１－ｙ２）の軌跡を表す。また、図６（ｂ）において、Ｌｓｘ及びＬｃｘは、それぞれ、「し」及び「ち」と発話したときのｘ方向の唇動（Δｘ＝ｘ２－ｘ１）の軌跡を表す。 First, a case where a plurality of utterance candidates are a combination of "shi" and "chi" will be described. FIG. 6 is a diagram showing the locus of lip movement when uttering “shi” and “chi”, FIG. 6 (a) shows the locus of lip movement in the y direction, and FIG. 6 (b) shows the locus of lip movement in the x direction. Represents the trajectory of lip movement. In FIG. 6A, Lsy and Lcy follow the locus of lip movement (Δy = y1-y2) in the y direction shown in FIG. 2B when uttering “shi” and “chi”, respectively. show. Further, in FIG. 6B, Lsx and Lcx represent the locus of lip movement (Δx = x2-x1) in the x direction when uttering “shi” and “chi”, respectively.

発話候補抽出部７は、唇動軌跡データ記憶部５を参照して、唇動の軌跡から発話内容の候補を抽出する。図６（ａ）及び（ｂ）に示した例では、「し」、「ち」と発話したときの唇動のｙ方向の時間的変化を表す曲線ＬｓｙとＬｃｙがほぼ同じ曲線であり、かつ、唇動のｘ方向の時間的変化を表す曲線ＬｓｘとＬｃｘがほぼ同じ曲線である。唇動軌跡検出部２が検出した唇動のｙ方向の軌跡が図６（ａ）に示した曲線に類似し、かつ、唇動のｘ方向の軌跡が図６（ｂ）に示した曲線に類似している場合には、発話候補抽出部７は、唇動軌跡データ記憶部５を参照して、唇動の軌跡から発話内容の候補として「し」及び「ち」を抽出する。従って、この場合、発話内容は、「し」及び「ち」のいずれかであることは分かるが、これらの内のどの発話であるのかは特定できない。そこで、このような場合は、発話内容の候補は２つ抽出されることとなる。 The utterance candidate extraction unit 7 extracts candidates for utterance content from the lip movement locus with reference to the lip movement locus data storage unit 5. In the examples shown in FIGS. 6 (a) and 6 (b), the curves Lsy and Lcy representing the temporal change of lip movement in the y direction when uttering “shi” and “chi” are substantially the same curve, and , Lsx and Lcx, which represent the temporal change of lip movement in the x direction, are almost the same curve. The lip smacking locus detection unit 2 detects the lip smacking locus in the y direction similar to the curve shown in FIG. 6 (a), and the lip smacking locus in the x direction is the curve shown in FIG. 6 (b). If they are similar, the lip movement candidate extraction unit 7 refers to the lip movement locus data storage unit 5 and extracts “shi” and “chi” as candidates for the speech content from the lip movement locus. Therefore, in this case, it is known that the utterance content is either "shi" or "chi", but it is not possible to specify which of these utterances. Therefore, in such a case, two candidates for the utterance content are extracted.

周波数パターン抽出部４は、非可聴音の周波数特性を解析し、周波数パターンを抽出する。図７（ａ）は、「し」と発話したときの音声の周波数スペクトルであり、図７（ｂ）は、「ち」と発話したときの音声の周波数スペクトルである。 The frequency pattern extraction unit 4 analyzes the frequency characteristics of the inaudible sound and extracts the frequency pattern. FIG. 7 (a) is a frequency spectrum of the voice when "shi" is spoken, and FIG. 7 (b) is a frequency spectrum of the voice when "chi" is spoken.

非可聴音パターン記憶部６は、非可聴音の周波数パターンと発話内容との対応関係を予め記憶している。例えば、図７（ａ）に示すように、「し」と発話した場合、４０ｋＨｚ近傍の周波数パターンには明確なピークは現れない。一方、図７（ｂ）に示すように、「ち」と発話した場合、舌の中央を上顎に押し付けることにより、周波数パターンには４０ｋＨｚ近傍にピークＰｃが現れる。このように、非可聴音パターン記憶部６は、「し」及び「ち」の非可聴音の周波数パターンと発話内容との対応関係を予め記憶している。 The non-audible sound pattern storage unit 6 stores in advance the correspondence between the frequency pattern of the non-audible sound and the utterance content. For example, as shown in FIG. 7A, when the utterance “” is spoken, no clear peak appears in the frequency pattern near 40 kHz. On the other hand, as shown in FIG. 7B, when "chi" is spoken, the peak Pc appears in the frequency pattern near 40 kHz by pressing the center of the tongue against the upper jaw. In this way, the non-audible sound pattern storage unit 6 stores in advance the correspondence between the frequency patterns of the non-audible sounds of "" and "chi" and the utterance content.

次に、発話決定部８は、非可聴音パターン記憶部６を参照して、検出した非可聴音の周波数パターンを２つの発話内容の候補（「し」、「ち」）のそれぞれの周波数パターンと照合することにより、２つの発話内容の候補の中から特定の発話内容を決定する。 Next, the utterance determination unit 8 refers to the inaudible sound pattern storage unit 6, and uses the detected frequency pattern of the inaudible sound as the frequency pattern of each of the two utterance content candidates (“” and “chi”). By collating with, a specific utterance content is determined from the two utterance content candidates.

発話決定部８は、周波数パターンにおけるピークの有無及び特定の周波数帯域において発生するピークの位置に基づいて、発話内容を決定することができる。図７（ａ）のように、４０ｋＨｚ近傍の周波数においてパワーのピークが検出されなかった場合には、検出した発話は「し」であると判定することができる。また、図７（ｂ）のように、４０ｋＨｚ近傍においてパワーのピークＰｃが検出された場合には、検出した発話は「ち」であると判定することができる。以上のようにして、発話決定部８は、発話候補抽出部７が２つの発話内容の候補（「し」、「ち」）を抽出した場合は、非可聴音パターン記憶部６を参照して、２つの発話内容の候補の中から特定の発話内容として「し」及び「ち」のいずれか１つを決定する。 The utterance determination unit 8 can determine the utterance content based on the presence / absence of a peak in the frequency pattern and the position of the peak generated in a specific frequency band. As shown in FIG. 7A, when the power peak is not detected at a frequency near 40 kHz, it can be determined that the detected utterance is “”. Further, as shown in FIG. 7B, when the power peak Pc is detected in the vicinity of 40 kHz, it can be determined that the detected utterance is “chi”. As described above, when the utterance candidate extraction unit 7 extracts two utterance content candidates (“shi” and “chi”), the utterance determination unit 8 refers to the non-audible sound pattern storage unit 6. One of "shi" and "chi" is determined as a specific utterance content from the two utterance content candidates.

次に、複数の発話候補が「あ」及び「は」の組み合わせである場合について説明する。図８は、「あ」及び「は」と発話したときの唇動の軌跡を表す図であり、図８（ａ）はｙ方向の唇動の軌跡を表し、図８（ｂ）はｘ方向の唇動の軌跡を表す。図８（ａ）において、Ｌａｙ及びＬｈｙは、それぞれ、「あ」及び「は」と発話したときにおける、図２（ｂ）に示したｙ方向の唇動（Δｙ＝ｙ１－ｙ２）の軌跡を表す。また、図８（ｂ）において、Ｌａｘ及びＬｈｘは、それぞれ、「あ」及び「は」と発話したときのｘ方向の唇動（Δｘ＝ｘ２－ｘ１）の軌跡を表す。 Next, a case where a plurality of utterance candidates are a combination of "a" and "ha" will be described. FIG. 8 is a diagram showing the locus of lip movement when uttering “a” and “ha”, FIG. 8 (a) shows the locus of lip movement in the y direction, and FIG. 8 (b) shows the locus of lip movement in the x direction. Represents the trajectory of lip movement. In FIG. 8 (a), Lay and Lhy follow the locus of lip movement (Δy = y1-y2) in the y direction shown in FIG. 2 (b) when uttering “a” and “ha”, respectively. show. Further, in FIG. 8 (b), Lax and Lhx represent the locus of lip movement (Δx = x2-x1) in the x direction when uttering “a” and “ha”, respectively.

発話候補抽出部７は、唇動軌跡データ記憶部５を参照して、唇動の軌跡から発話内容の候補を抽出する。図８（ａ）及び（ｂ）に示した例では、「あ」、「は」と発話したときの唇動のｙ方向の時間的変化を表す曲線ＬａｙとＬｈｙがほぼ同じ曲線であり、かつ、唇動のｘ方向の時間的変化を表す曲線ＬａｘとＬｈｘがほぼ同じ曲線である。唇動軌跡検出部２が検出した唇動のｙ方向の軌跡が図８（ａ）に示した曲線に類似し、かつ、唇動のｘ方向の軌跡が図８（ｂ）に示した曲線に類似している場合には、発話候補抽出部７は、唇動軌跡データ記憶部５を参照して、唇動の軌跡から発話内容の候補として「あ」及び「は」を抽出する。従って、この場合、発話内容は、「あ」及び「は」のいずれかであることは分かるが、これらの内のどの発話であるのかは特定できない。そこで、このような場合は、発話内容の候補は２つ抽出されることとなる。 The utterance candidate extraction unit 7 extracts candidates for utterance content from the lip movement locus with reference to the lip movement locus data storage unit 5. In the examples shown in FIGS. 8A and 8B, the curves Lay and Ly, which represent the temporal change of lip movement in the y direction when "a" and "ha" are spoken, are substantially the same curve, and , The curves Lax and Lhx representing the temporal change of lip movement in the x direction are almost the same curve. The lip smacking locus detection unit 2 detects the lip smacking locus in the y direction similar to the curve shown in FIG. 8 (a), and the lip smacking locus in the x direction is the curve shown in FIG. 8 (b). If they are similar, the speech candidate extraction unit 7 refers to the lip movement locus data storage unit 5 and extracts “a” and “ha” as candidates for the speech content from the lip movement locus. Therefore, in this case, it is known that the utterance content is either "a" or "ha", but it is not possible to specify which of these utterances. Therefore, in such a case, two candidates for the utterance content are extracted.

周波数パターン抽出部４は、非可聴音の周波数特性を解析し、周波数パターンを抽出する。図９（ａ）は、「あ」と発話したときの音声の周波数スペクトルであり、図９（ｂ）は、「は」と発話したときの音声の周波数スペクトルである。 The frequency pattern extraction unit 4 analyzes the frequency characteristics of the inaudible sound and extracts the frequency pattern. FIG. 9 (a) is a frequency spectrum of the voice when "a" is spoken, and FIG. 9 (b) is a frequency spectrum of the voice when "ha" is spoken.

非可聴音パターン記憶部６は、非可聴音の周波数パターンと発話内容との対応関係を予め記憶している。例えば、図９（ａ）に示すように、「あ」と発話した場合、舌全体を上顎に軽く押し当てるため、２０ｋＨｚ近傍の周波数パターンには明確なピークは現れない。また、図９（ｂ）に示すように、「は」と発話した場合、舌の中央を上顎に押し付けることにより、周波数パターンには２０ｋＨｚ近傍にピークＰｈが現れる。このように、非可聴音パターン記憶部６は、「あ」及び「は」の非可聴音の周波数パターンと発話内容との対応関係を予め記憶している。 The non-audible sound pattern storage unit 6 stores in advance the correspondence between the frequency pattern of the non-audible sound and the utterance content. For example, as shown in FIG. 9A, when the utterance "A" is spoken, the entire tongue is lightly pressed against the upper jaw, so that no clear peak appears in the frequency pattern near 20 kHz. Further, as shown in FIG. 9B, when "ha" is spoken, the peak Ph appears in the frequency pattern near 20 kHz by pressing the center of the tongue against the upper jaw. In this way, the non-audible sound pattern storage unit 6 stores in advance the correspondence between the frequency patterns of the non-audible sounds of "a" and "ha" and the utterance content.

次に、発話決定部８は、非可聴音パターン記憶部６を参照して、検出した非可聴音の周波数パターンを２つの発話内容の候補（「あ」、「は」）のそれぞれの周波数パターンと照合することにより、２つの発話内容の候補の中から特定の発話内容を決定する。 Next, the utterance determination unit 8 refers to the inaudible sound pattern storage unit 6, and uses the detected frequency pattern of the inaudible sound as the frequency pattern of each of the two utterance content candidates (“a” and “ha”). By collating with, a specific utterance content is determined from the two utterance content candidates.

発話決定部８は、周波数パターンにおけるピークの有無及び特定の周波数帯域において発生するピークの位置に基づいて、発話内容を決定することができる。図９（ａ）のように、２０ｋＨｚ近傍の周波数パターンにおいてパワーのピークが検出されなかった場合には、検出した発話は「あ」であると判定することができる。また、図９（ｂ）のように、２０ｋＨｚ近傍の周波数パターンにおいてパワーのピークが検出された場合には、検出した発話は「は」であると判定することができる。以上のようにして、発話決定部８は、発話候補抽出部７が２つの発話内容の候補（「あ」、「は」）を抽出した場合は、非可聴音パターン記憶部６を参照して、２つの発話内容の候補の中から特定の発話内容として「あ」及び「は」のいずれか１つを決定する。 The utterance determination unit 8 can determine the utterance content based on the presence / absence of a peak in the frequency pattern and the position of the peak generated in a specific frequency band. As shown in FIG. 9A, when the power peak is not detected in the frequency pattern near 20 kHz, it can be determined that the detected utterance is “A”. Further, as shown in FIG. 9B, when a power peak is detected in a frequency pattern near 20 kHz, it can be determined that the detected utterance is “ha”. As described above, when the utterance candidate extraction unit 7 extracts two utterance content candidates (“a” and “ha”), the utterance determination unit 8 refers to the non-audible sound pattern storage unit 6. One of "a" and "ha" is determined as a specific utterance content from the two utterance content candidates.

上記の通り、唇動の軌跡が略同一である複数の発話内容の例として、「な」、「た」、及び「だ」の組み合わせ、「し」及び「ち」の組み合わせ、並びに「あ」及び「は」の組み合わせを示したが、このような例には限られず、唇動の軌跡が略同一である他の複数の発話内容の組み合わせに対しても、本開示の実施形態に係る音声認識装置を用いることができる。 As described above, examples of multiple utterance contents in which the locus of lip movement is substantially the same are a combination of "na", "ta", and "da", a combination of "shi" and "chi", and "a". And "ha" are shown, but the present invention is not limited to such an example, and the voice according to the embodiment of the present disclosure is also applied to a combination of a plurality of other utterance contents having substantially the same lip movement trajectory. A recognition device can be used.

以上説明したように本開示の実施形態に係る音声認識装置によれば、声帯を使わない発話（呟き声）を非接触で判定することができ、唇動軌跡から予測発話の粗候補を抽出し、非可聴音による周波数パターンから予測候補の中から話者による発話を確定することができる。さらに、唇動によるパターン判定と非可聴音による判定を組み合わせることにより、発話内容の予測精度を向上させることができる。本開示の実施形態に係る音声認識装置によれば、声帯を使わずに発話される高齢者の会話を解読することができる。また、静寂が求められる乗り物内等において、声帯を使わずに小声で行われる通話内容を把握することができる。この場合、非可聴音によって会話を行うことができるため、プライバシーを保護し、あるいは情報漏洩を防ぎながら、会話を行うことができる。さらに、予め話者に専用機材を装着する必要が無いため、多目的な用途に使用することができる。 As described above, according to the voice recognition device according to the embodiment of the present disclosure, it is possible to determine utterances (murmurs) that do not use vocal cords in a non-contact manner, and rough candidates for predicted utterances are extracted from the lip movement locus. , It is possible to confirm the utterance by the speaker from the prediction candidates from the frequency pattern by the inaudible sound. Further, by combining the pattern determination based on lip movement and the determination based on inaudible sound, the accuracy of predicting the utterance content can be improved. According to the voice recognition device according to the embodiment of the present disclosure, it is possible to decipher the conversation of an elderly person spoken without using the vocal cords. In addition, it is possible to grasp the contents of a call made in a low voice without using a vocal cord in a vehicle or the like where silence is required. In this case, since the conversation can be performed by an inaudible sound, the conversation can be performed while protecting privacy or preventing information leakage. Further, since it is not necessary to equip the speaker with special equipment in advance, it can be used for various purposes.

［実施例１］
次に、実施例１に係る音声認識装置について説明する。図１０は、実施例１に係る音声認識装置を用いた会話システムの構成概略図である。話者（１２０、２２０）が高齢者等である場合、声帯を使わずに話すことにより、互いに相手の話す内容が聞き取れず、その結果うまくコミュニケーションを取ることができないという問題が生じうる。実施例１に係る音声認識装置（１００、２００）は、自己が発した言葉の内容を自己が所持した音声認識装置に表示させて相手に見せることでコミュニケーションを取るものである。 [Example 1]
Next, the voice recognition device according to the first embodiment will be described. FIG. 10 is a schematic configuration diagram of a conversation system using the voice recognition device according to the first embodiment. When the speaker (120, 220) is an elderly person or the like, speaking without using the vocal cords may cause a problem that the contents spoken by each other cannot be heard, and as a result, communication cannot be performed well. The voice recognition device (100, 200) according to the first embodiment communicates by displaying the content of the words spoken by the self on the voice recognition device possessed by the self and showing it to the other party.

ここで、第１話者１２０が第１の音声認識装置１００を持ち、第２話者２２０が第２の音声認識装置２００を持つものとする。第１の音声認識装置１００及び第２の音声認識装置２００にはタブレット端末等の情報端末を用いることができる。第１話者１２０は、第１の音声認識装置１００を首から下げて表示部１１３を第２話者２２０に向けると共に、撮像部１０１が第２話者２２０の口唇領域２２０ｍの画像を撮像できるようにする。同様に、第２話者２２０は、第２の音声認識装置２００を首から下げて表示部２１３を第１話者１２０に向けると共に、撮像部２０１が第１話者１２０の口唇領域１２０ｍの画像を撮像できるようにする。 Here, it is assumed that the first speaker 120 has the first voice recognition device 100, and the second speaker 220 has the second voice recognition device 200. An information terminal such as a tablet terminal can be used for the first voice recognition device 100 and the second voice recognition device 200. The first speaker 120 can lower the first voice recognition device 100 from the neck to point the display unit 113 toward the second speaker 220, and the image pickup unit 101 can capture an image of the lip region 220 m of the second speaker 220. To do so. Similarly, in the second speaker 220, the second voice recognition device 200 is lowered from the neck and the display unit 213 is directed toward the first speaker 120, and the image pickup unit 201 is an image of the lip region 120 m of the first speaker 120. To be able to image.

まず、第２話者２２０が発話した際の画像及び非可聴音を第１の音声認識装置１００で解析した後、解析結果を第２の音声認識装置２００の表示部２１３に表示させる手順について説明する。図１１は、実施例１に係る音声認識装置（１００、２００）のブロック図である。第１の音声認識装置１００及び第２の音声認識装置２００は、図１に示した音声認識装置１００１の構成に加えて、受信部（１１１、２１１）、送信部（１１２、２１２）、及び表示部（１１３、２１３）を備えている。その他の構成は、図１に示した音声認識装置１００１の構成と同様である。 First, a procedure will be described in which an image and an inaudible sound when the second speaker 220 speaks are analyzed by the first voice recognition device 100, and then the analysis result is displayed on the display unit 213 of the second voice recognition device 200. do. FIG. 11 is a block diagram of the voice recognition device (100, 200) according to the first embodiment. In addition to the configuration of the voice recognition device 1001 shown in FIG. 1, the first voice recognition device 100 and the second voice recognition device 200 include a receiving unit (111, 211), a transmitting unit (112, 212), and a display. The unit (113, 213) is provided. Other configurations are the same as the configuration of the voice recognition device 1001 shown in FIG.

第１の音声認識装置１００の撮像部１０１は、第２話者２２０が「今日のゴハンは？」と声帯を使わずに発話しているときの口唇領域２２０ｍを含む画像を取得する。また、非可聴音検出部１０３は、第２話者２２０が発話する際の音声から空中を伝搬する非可聴音を検出する。 The image pickup unit 101 of the first voice recognition device 100 acquires an image including the lip region 220 m when the second speaker 220 is speaking "What is today's gohan?" Without using the vocal cords. Further, the non-audible sound detection unit 103 detects the non-audible sound propagating in the air from the voice when the second speaker 220 speaks.

第１の音声認識装置１００は、取得した口唇領域２２０ｍの画像及び非可聴音から第２話者２２０の発話内容は「今日のゴハンは？」であると決定し、決定した結果を出力部１０９が送信部１１２に出力する。送信部１１２は、第２話者２２０の発話内容に関する情報を第２の音声認識装置２００の受信部２１１に送信する。 The first voice recognition device 100 determines from the acquired image of the lip region 220 m and the inaudible sound that the utterance content of the second speaker 220 is "What is today's gohan?", And outputs the determined result to the output unit 109. Outputs to the transmitter 112. The transmission unit 112 transmits information regarding the utterance content of the second speaker 220 to the reception unit 211 of the second voice recognition device 200.

第２の音声認識装置２００の受信部２１１は、発話内容に関する情報を受信し、表示部２１３に送信する。表示部２１３は取得した情報に基づいて、表示画面に「今日のゴハンは？」と表示する。第１話者１２０は、第２の音声認識装置２００の表示部２１３を視認することにより、第２話者２２０が発話した内容は「今日のゴハンは？」であることを認識する。このとき、第２話者２２０の発話が一旦終了し、第１話者１２０からの回答を待っていることを表示部２１３に表示することにより、第１話者１２０が発話するタイミングを取りやすくするようにしてもよい。 The receiving unit 211 of the second voice recognition device 200 receives the information regarding the utterance content and transmits it to the display unit 213. Based on the acquired information, the display unit 213 displays "What is today's gohan?" On the display screen. By visually recognizing the display unit 213 of the second voice recognition device 200, the first speaker 120 recognizes that the content spoken by the second speaker 220 is "What is today's gohan?". At this time, by displaying on the display unit 213 that the second speaker 220 has finished speaking and is waiting for a reply from the first speaker 120, it is easy to take the timing for the first speaker 120 to speak. You may try to do it.

次に、第１話者１２０が発話した際の画像及び非可聴音を第２の音声認識装置２００で解析した後、解析結果を第１の音声認識装置１００の表示部１１３に表示させる手順について説明する。第２の音声認識装置２００の撮像部２０１は、第１話者１２０が「カレーよ」と声帯を使わずに発話しているときの口唇領域１２０ｍを含む画像を取得する。また、非可聴音検出部２０３は、第１話者１２０が発話する際の音声から空中を伝搬する非可聴音を検出する。 Next, the procedure for displaying the analysis result on the display unit 113 of the first voice recognition device 100 after the image and the inaudible sound when the first speaker 120 speaks are analyzed by the second voice recognition device 200. explain. The image pickup unit 201 of the second voice recognition device 200 acquires an image including the lip region 120 m when the first speaker 120 speaks "Curry" without using the vocal cords. Further, the non-audible sound detection unit 203 detects the non-audible sound propagating in the air from the voice when the first speaker 120 speaks.

第２の音声認識装置２００は、取得した口唇領域１２０ｍの画像及び非可聴音から第１話者１２０の発話内容は「カレーよ」であると決定し、決定した結果を出力部２０９が送信部２１２に出力する。送信部２１２は、第１話者１２０の発話内容に関する情報を第１の音声認識装置１００の受信部１１１に送信する。 The second voice recognition device 200 determines that the utterance content of the first speaker 120 is "curry" from the acquired image of the lip region 120 m and the inaudible sound, and the output unit 209 transmits the determined result. Output to 212. The transmission unit 212 transmits information regarding the utterance content of the first speaker 120 to the reception unit 111 of the first voice recognition device 100.

第１の音声認識装置１００の受信部１１１は、発話内容に関する情報を受信し、表示部１１３に送信する。表示部１１３は取得した情報に基づいて、表示画面に「カレーよ」と表示する。第２話者２２０は、第１の音声認識装置１００の表示部１１３を視認することにより、第１話者１２０が発話した内容は「カレーよ」であることを認識する。 The receiving unit 111 of the first voice recognition device 100 receives the information regarding the utterance content and transmits it to the display unit 113. The display unit 113 displays "Curry" on the display screen based on the acquired information. The second speaker 220 recognizes that the content spoken by the first speaker 120 is "curry" by visually recognizing the display unit 113 of the first voice recognition device 100.

以上のようにして、実施例１に係る音声認識装置を用いることにより、第１話者１２０と第２話者２２０が声帯を使わずに発話した場合でも互いに相手の発話内容を把握することができ、コミュニケーションを取ることができる。 As described above, by using the voice recognition device according to the first embodiment, even when the first speaker 120 and the second speaker 220 speak without using the vocal cords, it is possible to grasp each other's utterance contents. You can and can communicate.

次に、実施例１の変形例として、第２話者２２０が発話した際の画像及び非可聴音に関する情報を第１の音声認識装置１００´で取得した後、取得した情報を第２の音声認識装置２００´に送信し、第２の音声認識装置２００´で解析し、解析結果を表示させる手順について説明する。図１２に実施例１の変形例に係る音声認識装置のブロック図を示す。 Next, as a modification of the first embodiment, after the first voice recognition device 100'acquires information on the image and the inaudible sound when the second speaker 220 speaks, the acquired information is used as the second voice. A procedure of transmitting to the recognition device 200', analyzing with the second voice recognition device 200', and displaying the analysis result will be described. FIG. 12 shows a block diagram of the voice recognition device according to the modified example of the first embodiment.

第１の音声認識装置１００´の撮像部１０１は、第２話者２２０が「今日のゴハンは？」と声帯を使わずに発話しているときの口唇領域２２０ｍを含む画像を取得する。また、非可聴音検出部１０３は、第２話者２２０が発話する際の音声から空中を伝搬する非可聴音を検出する。 The image pickup unit 101 of the first voice recognition device 100'acquires an image including the lip region 220 m when the second speaker 220 is speaking "What is today's gohan?" Without using the vocal cords. Further, the non-audible sound detection unit 103 detects the non-audible sound propagating in the air from the voice when the second speaker 220 speaks.

第１の音声認識装置１００´の撮像部１０１が取得した画像データ及び非可聴音検出部１０３が取得した非可聴音のデータは送信部１１２に供給され、送信部１１２はこれらのデータを第２の音声認識装置２００´の受信部２１１に送信する。 The image data acquired by the image pickup unit 101 of the first voice recognition device 100'and the non-audible sound data acquired by the inaudible sound detection unit 103 are supplied to the transmission unit 112, and the transmission unit 112 transfers these data to the second unit. It is transmitted to the receiving unit 211 of the voice recognition device 200'.

第２の音声認識装置２００´は、受信部２１１が受信した口唇領域２２０ｍの画像及び非可聴音から第２話者２２０の発話内容は「今日のゴハンは？」であると決定し、決定した結果を表示部２１３に表示する。 The second voice recognition device 200'determines that the utterance content of the second speaker 220 is "What is today's gohan?" From the image and the inaudible sound of the lip region 220m received by the receiving unit 211. The result is displayed on the display unit 213.

第１話者１２０は、第２の音声認識装置２００´の表示部２１３を視認することにより、第２話者２２０が発話した内容は「今日のゴハンは？」であることを認識する。このとき、第２話者２２０の発話が一旦終了し、第１話者１２０からの回答を待っていることを表示部２１３に表示することにより、第１話者１２０が発話するタイミングを取りやすくするようにしてもよい。 By visually recognizing the display unit 213 of the second voice recognition device 200', the first speaker 120 recognizes that the content spoken by the second speaker 220 is "What is today's gohan?". At this time, by displaying on the display unit 213 that the second speaker 220 has finished speaking and is waiting for a reply from the first speaker 120, it is easy to take the timing for the first speaker 120 to speak. You may try to do it.

次に、第１話者１２０が発話した内容を第２の音声認識装置２００´で取得した後、取得した情報を第１の音声認識装置１００´に送信し、第１の音声認識装置１００´で解析し、表示させる手順について説明する。 Next, after the content spoken by the first speaker 120 is acquired by the second voice recognition device 200', the acquired information is transmitted to the first voice recognition device 100', and the first voice recognition device 100' The procedure for analyzing and displaying is explained in.

第２の音声認識装置２００´の撮像部２０１は、第１話者１２０が「カレーよ」と声帯を使わずに発話しているときの口唇領域１２０ｍを含む画像を取得する。また、非可聴音検出部２０３は、第１話者１２０が発話する際の音声から空中を伝搬する非可聴音を検出する。 The image pickup unit 201 of the second voice recognition device 200'acquires an image including the lip region 120 m when the first speaker 120 speaks "Curry" without using the vocal cords. Further, the non-audible sound detection unit 203 detects the non-audible sound propagating in the air from the voice when the first speaker 120 speaks.

第２の音声認識装置２００´の撮像部２０１が取得した画像データ及び非可聴音検出部２０３が取得した非可聴音のデータは送信部２１２に供給され、送信部２１２はこれらのデータを第１の音声認識装置１００´の受信部１１１に送信する。 The image data acquired by the image pickup unit 201 of the second voice recognition device 200'and the non-audible sound data acquired by the inaudible sound detection unit 203 are supplied to the transmission unit 212, and the transmission unit 212 transfers these data to the first transmission unit 212. It is transmitted to the receiving unit 111 of the voice recognition device 100'.

第１の音声認識装置１００´は、受信部１１１が受信した口唇領域１２０ｍの画像及び非可聴音から第１話者１２０の発話内容は「カレーよ」であると決定し、決定した結果を表示部１１３に表示する。 The first voice recognition device 100'determines that the utterance content of the first speaker 120 is "curry" from the image of the lip region 120 m and the inaudible sound received by the receiving unit 111, and displays the determined result. It is displayed on the unit 113.

第２話者２２０は、第１の音声認識装置１００´の表示部１１３を視認することにより、第１話者１２０が発話した内容は「カレーよ」であることを認識する。 By visually recognizing the display unit 113 of the first voice recognition device 100', the second speaker 220 recognizes that the content spoken by the first speaker 120 is "curry".

以上のようにして、実施例１の変形例に係る音声認識装置を用いることにより、第１話者１２０と第２話者２２０が声帯を使わずに発話した場合でも互いに相手の発話内容を把握することができ、コミュニケーションを取ることができる。 As described above, by using the voice recognition device according to the modified example of the first embodiment, even when the first speaker 120 and the second speaker 220 speak without using the vocal cords, the other party's utterance contents can be grasped by each other. Can and can communicate.

［実施例２］
次に、実施例２に係る音声認識装置について説明する。図１３は、実施例２に係る音声認識装置１００２を用いた通訳装置の構成概略図である。介護士３０は、話者である高齢者４０の感情を会話内容から把握しようとするが、高齢者４０が声帯を使わずに話す場合、発話した内容を聞き取ることができず、感情を把握することが難しい場合がある。実施例２に係る音声認識装置１００２は、高齢者４０が声帯を使わずに発話した内容を表示部１３に表示させることにより高齢者４０の発話内容を認識するものである。 [Example 2]
Next, the voice recognition device according to the second embodiment will be described. FIG. 13 is a schematic configuration diagram of an interpreter using the voice recognition device 1002 according to the second embodiment. The caregiver 30 tries to grasp the emotion of the elderly person 40 who is a speaker from the conversation content, but when the elderly person 40 speaks without using the vocal cords, he / she cannot hear the uttered content and grasps the emotion. Can be difficult. The voice recognition device 1002 according to the second embodiment recognizes the utterance content of the elderly person 40 by displaying the utterance content of the elderly person 40 on the display unit 13 without using the vocal cords.

図１４は、実施例２に係る音声認識装置１００２のブロック図である。実施例２に係る音声認識装置１００２は、図１に示した音声認識装置１００１に加えて表示部１３を備えている。その他の構成は、音声認識装置１００１における構成と同様である。 FIG. 14 is a block diagram of the voice recognition device 1002 according to the second embodiment. The voice recognition device 1002 according to the second embodiment includes a display unit 13 in addition to the voice recognition device 1001 shown in FIG. Other configurations are the same as those in the voice recognition device 1001.

音声認識装置１００２の撮像部１は、話者である高齢者４０が声帯を使わずに発話しているときの口唇領域４０ｍを含む画像を取得する。また、非可聴音検出部３は、高齢者４０が発話する際の音声から空中を伝搬する非可聴音を検出する。 The image pickup unit 1 of the voice recognition device 1002 acquires an image including the lip region 40 m when the elderly person 40 who is a speaker speaks without using the vocal cords. Further, the inaudible sound detection unit 3 detects the inaudible sound propagating in the air from the voice when the elderly person 40 speaks.

音声認識装置１００２は、取得した口唇領域４０ｍを含む画像及び検出した非可聴音に基づいて、高齢者４０の発話内容を決定し、出力部９が決定した発話内容に関するデータを表示部１３に出力する。表示部１３は、取得したデータに基づいて発話内容を表示する。 The voice recognition device 1002 determines the utterance content of the elderly 40 based on the acquired image including the lip region 40 m and the detected inaudible sound, and outputs the data related to the utterance content determined by the output unit 9 to the display unit 13. do. The display unit 13 displays the utterance content based on the acquired data.

実施例２に係る音声認識装置１００２によれば、高齢者４０が声帯を使わずに発話した場合であっても発話内容を表示部１３に表示することができるため、高齢者４０の発話内容を認識することができる。 According to the voice recognition device 1002 according to the second embodiment, even when the elderly person 40 speaks without using the vocal cords, the utterance content can be displayed on the display unit 13, so that the utterance content of the elderly person 40 can be displayed. Can be recognized.

［実施例３］
次に、実施例３に係る音声認識装置について説明する。図１５は、実施例３に係る音声認識装置１００３を用いた音声機器操作システムの構成概略図である。近年、音声で家電を操作したり、自動車のナビシステムを制御したりする装置が開発されている。これらの装置は、受信した音声を認識することで制御が可能となるが、高齢者等が声帯を使わずに発話して操作しようとしても、これらの装置はそのような声帯を使わずに発せられた音声を認識することができないという問題がある。実施例３に係る音声認識装置１００３は、高齢者５０が声帯を使わずに発話した内容を認識し、可聴音声に変換して、家電等の機器６０に可聴音声に変換した音声を認識させるものである。 [Example 3]
Next, the voice recognition device according to the third embodiment will be described. FIG. 15 is a schematic configuration diagram of a voice device operation system using the voice recognition device 1003 according to the third embodiment. In recent years, devices for operating home appliances by voice and controlling automobile navigation systems have been developed. These devices can be controlled by recognizing the received voice, but even if elderly people try to speak and operate without using vocal cords, these devices can emit without using such vocal cords. There is a problem that the received voice cannot be recognized. The voice recognition device 1003 according to the third embodiment recognizes the content spoken by the elderly person 50 without using the vocal cords, converts it into audible voice, and causes the device 60 such as a home appliance to recognize the voice converted into audible voice. Is.

図１６は、実施例３に係る音声認識装置１００３のブロック図である。実施例３に係る音声認識装置１００３は、図１に示した音声認識装置１００１に加えて音声再生部１４を備えている。その他の構成は、音声認識装置１００１における構成と同様である。 FIG. 16 is a block diagram of the voice recognition device 1003 according to the third embodiment. The voice recognition device 1003 according to the third embodiment includes a voice reproduction unit 14 in addition to the voice recognition device 1001 shown in FIG. Other configurations are the same as those in the voice recognition device 1001.

音声認識装置１００３の撮像部１は、話者である高齢者５０が声帯を使わずに発話しているときの口唇領域５０ｍを含む画像を取得する。また、非可聴音検出部３は、高齢者５０が発話する際の音声から空中を伝搬する非可聴音を検出する。 The image pickup unit 1 of the voice recognition device 1003 acquires an image including the lip region 50 m when the elderly person 50 who is a speaker speaks without using the vocal cords. Further, the inaudible sound detection unit 3 detects the inaudible sound propagating in the air from the voice when the elderly person 50 speaks.

音声認識装置１００３は、取得した口唇領域５０ｍを含む画像及び検出した非可聴音に基づいて、高齢者５０の発話内容を決定し、出力部９が音声再生部１４に発話内容を出力する。音声再生部１４は、高齢者５０の発話内容を可聴音として再生し、機器６０は、再生された可聴音を認識して、所定の制御を実行する。 The voice recognition device 1003 determines the utterance content of the elderly person 50 based on the acquired image including the lip region 50 m and the detected inaudible sound, and the output unit 9 outputs the utterance content to the voice reproduction unit 14. The voice reproduction unit 14 reproduces the utterance content of the elderly person 50 as an audible sound, and the device 60 recognizes the reproduced audible sound and executes a predetermined control.

実施例３に係る音声認識装置１００３によれば、高齢者５０が声帯を使わずに発話した場合であっても、発話内容を可聴音に変換して音声再生部１４から出力することができるため、機器６０における音声認識精度を向上させることができる。 According to the voice recognition device 1003 according to the third embodiment, even when the elderly person 50 speaks without using the vocal cords, the utterance content can be converted into an audible sound and output from the voice reproduction unit 14. , The voice recognition accuracy in the device 60 can be improved.

１撮像部
２唇動軌跡検出部
３非可聴音検出部
４周波数パターン抽出部
５唇動軌跡データ記憶部
６非可聴音パターン記憶部
７発話候補抽出部
８発話決定部
９出力部
１０顔画像認識部
１３表示部
１４音声再生部 1 Imaging unit 2 Lip movement locus detection unit 3 Inaudible sound detection unit 4 Frequency pattern extraction unit 5 Lip movement locus data storage unit 6 Non-audible sound pattern storage unit 7 Speech candidate extraction unit 8 Speech determination unit 9 Output unit 10 Face image recognition Section 13 Display section 14 Audio playback section

Claims

An image pickup unit that acquires an image including the lip region during the speaker's utterance operation,
A lip movement locus detection unit that detects the locus of the speaker's lip movement from the image,
An inaudible sound detection unit that detects an inaudible sound propagating in the air from the voice when the speaker speaks,
A frequency pattern extraction unit that analyzes the frequency characteristics of the inaudible sound and extracts a frequency pattern,
A lip movement locus data storage unit that previously stores the correspondence between the lip movement locus and the utterance content,
An inaudible sound pattern storage unit that stores in advance the correspondence between the frequency pattern of the inaudible sound and the utterance content,
With reference to the lip movement locus data storage unit, an utterance candidate extraction unit that extracts utterance content candidates from the lip movement locus, and a speech candidate extraction unit.
When the utterance candidate extraction unit extracts a plurality of utterance content candidates, the utterance determination unit determines a specific utterance content from the plurality of utterance content candidates with reference to the inaudible sound pattern storage unit. When,
An output unit that outputs information about the utterance content determined by the utterance determination unit, and an output unit.
A speech recognition device characterized by having.

The voice recognition device according to claim 1, wherein the non-audible sound detection unit starts detection of the non-audible sound of the speaker triggered by the start of lip movement of the speaker detected by the lip movement locus detection unit. ..

The voice recognition device according to claim 1 or 2, wherein the non-audible sound detection unit detects sound waves of 20 kHz or more and 70 kHz or less as the non-audible sound.

The voice recognition device according to any one of claims 1 to 3, wherein the utterance determination unit determines the utterance content based on the presence or absence of a peak in the frequency pattern and the position of the peak generated in a specific frequency band. ..

The present invention according to any one of claims 1 to 4, wherein the plurality of speech contents having substantially the same lip movement trajectory include at least two of "na", "ta", and "da". Speech recognition device.

The voice recognition device according to any one of claims 1 to 4, wherein the plurality of utterance contents having substantially the same lip movement trajectory include "shi" and "chi".

The voice recognition device according to any one of claims 1 to 4, wherein the plurality of utterance contents having substantially the same lip movement trajectory include "a" and "ha".