JPH1091789A

JPH1091789A - Device for recognizing word

Info

Publication number: JPH1091789A
Application number: JP8262396A
Authority: JP
Inventors: Akihiro Okumura; 晃弘奥村
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1996-09-11
Filing date: 1996-09-11
Publication date: 1998-04-10

Abstract

PROBLEM TO BE SOLVED: To obtain a device in which sufficient word recognition can be attained even under a high noise. SOLUTION: A lip tracing part 102 traces the movement of an upper lip and a lower lip from a facial picture obtained by an inputting part 101. A pre-processing part 103 converts it into a pattern based on the difference of the movement of the upper lip and the lower lip, and operates linear complementation and smoothing to the pattern. A matching executing part 104 compares the distances of the inclination of patterns based on an arbitrary part between the pattern processed by the pre-processing part 103 and a dictionary pattern 108, and defines the pattern with the shortest distance as a recognized word.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、発話者の唇の動き
に基づき、単語を認識する単語認識装置に関する。[0001] 1. Field of the Invention [0002] The present invention relates to a word recognition apparatus for recognizing words based on the movement of the lips of a speaker.

【０００２】[0002]

【従来の技術】従来より、発話者の音声パターンを、予
め作成した単語の辞書パターンと照合し、発話者の単語
を認識する音声認識の技術が知られている。2. Description of the Related Art Conventionally, there has been known a voice recognition technique for recognizing a speaker's word by comparing a speaker's voice pattern with a dictionary pattern of words created in advance.

【０００３】[0003]

【発明が解決しようとする課題】しかしながら、従来の
音声認識では、実験室等の低雑音の環境では、十分な認
識率を確保することができるが、例えば実際の街角とい
った高雑音下の環境では十分な認識性能を得ることがで
きなかった。即ち、比較的静かな環境では実用に耐えう
るシステムであっても、実際のオフィス等での使用で
は、周囲の雑音の環境の影響により、十分な性能を発揮
することができないというのが現状であった。However, in the conventional speech recognition, a sufficient recognition rate can be secured in a low noise environment such as a laboratory, but in an environment under high noise such as an actual street corner, for example. Sufficient recognition performance could not be obtained. In other words, even in a relatively quiet environment, even if the system can be used practically, sufficient performance cannot be exhibited in an actual office or the like due to the effect of the surrounding noise environment. there were.

【０００４】このような点から、高雑音下であっても十
分な単語認識を行うことのできる単語認識装置の実現が
望まれていた。[0004] In view of the above, it has been desired to realize a word recognition device capable of performing sufficient word recognition even under high noise.

【０００５】[0005]

【課題を解決するための手段】本発明は、前述の課題を
解決するため次の構成を採用する。〈請求項１の構成〉発話者の顔画像を取得する入力部
と、入力部で取得した顔画像の中から上唇と下唇それぞ
れの動きを追跡する唇追跡部と、唇追跡部で追跡した唇
の動きを、マッチングを行うためのパターンに変換する
前処理部と、各単語に対応した辞書パターンを記憶する
辞書と、前処理部で変換したパターンを、辞書パターン
と照合し、最も近い辞書パターンを、前記発話者の発話
した単語として認識するマッチング実行部とを備えたこ
とを特徴とする単語認識装置である。The present invention employs the following structure to solve the above-mentioned problems. <Structure of Claim 1> An input unit for acquiring the face image of the speaker, a lip tracking unit for tracking the movement of each of the upper lip and the lower lip from the face image acquired by the input unit, and a lip tracking unit A pre-processing unit that converts the movement of the lips into a pattern for matching, a dictionary that stores a dictionary pattern corresponding to each word, and a pattern that is converted by the pre-processing unit against the dictionary pattern, and the closest dictionary A matching execution unit that recognizes a pattern as a word spoken by the speaker.

【０００６】〈請求項１の説明〉請求項１の発明は、発
話者の唇の動きを追跡し、この動きのパターンを予め登
録した辞書パターンと照合することにより単語認識を行
うようにしたものである。これにより、高雑音下でも利
用可能な単語認識装置を得ることができる。<Explanation of Claim 1> The invention of claim 1 tracks the movement of the lips of a speaker and performs word recognition by comparing the movement pattern with a dictionary pattern registered in advance. It is. Thus, a word recognition device that can be used even under high noise can be obtained.

【０００７】〈請求項２の構成〉請求項１記載の単語認
識装置において、前処理部は、上唇上に位置する座標の
Ｙ座標と下唇上に位置する座標のＹ座標との差に基づき
パターンに変換することを特徴とする単語認識装置であ
る。<Structure of claim 2> In the word recognition device according to claim 1, the pre-processing unit is configured to calculate a difference between a Y coordinate of a coordinate located on the upper lip and a Y coordinate of a coordinate located on the lower lip. It is a word recognition device characterized by converting it into a pattern.

【０００８】〈請求項２の説明〉請求項２の発明は、請
求項１の発明において、唇の上下の動きをパターンとし
たものである。これにより、発話者の単語内容を正確に
抽出することができ、従って、認識精度を向上させるこ
とができる。<Explanation of Claim 2> In the invention of claim 2, in the invention of claim 1, the up and down movement of the lips is made a pattern. As a result, the word content of the speaker can be accurately extracted, and thus the recognition accuracy can be improved.

【０００９】〈請求項３の構成〉請求項１または２に記
載の単語認識装置において、前処理部は、任意の時間毎
に座標値を線形補完と、スムージングを行いパターンを
変換することを特徴とする単語認識装置である。<Structure of claim 3> In the word recognition device according to claim 1 or 2, the preprocessing unit converts the pattern by performing linear interpolation and smoothing of the coordinate value at an arbitrary time. Is a word recognition device.

【００１０】〈請求項３の説明〉請求項３の発明は、前
処理部において、パターンの座標値を線形補完し、か
つ、スムージング処理を行うようにしたものである。こ
れにより、パターンの近似の精度が向上し、マッチング
の効率が上がり、その結果、認識率の向上に寄与するこ
とができる。<Explanation of Claim 3> In the invention of claim 3, the preprocessing section linearly complements the coordinate values of the pattern and performs smoothing processing. As a result, the approximation accuracy of the pattern is improved, and the efficiency of the matching is increased. As a result, it is possible to contribute to the improvement of the recognition rate.

【００１１】〈請求項４の構成〉請求項１〜３のいずれ
かに記載の単語認識装置において、マッチング実行部
は、パターンにおける任意の時間の変化量に基づくパタ
ーンの傾きの距離を算出し、最も距離が小さいパターン
を発話者の発話した単語として認識することを特徴とす
る単語認識装置である。<Structure of Claim 4> In the word recognition device according to any one of Claims 1 to 3, the matching execution unit calculates a distance of a pattern inclination based on a change amount of the pattern at an arbitrary time, The word recognition apparatus is characterized by recognizing a pattern having the shortest distance as a word spoken by a speaker.

【００１２】〈請求項４の説明〉同じ単語を発話した場
合の口の開き具合を比較すると、波形の概形が類似して
おり、これらのパターン間の距離を求める場合は、変化
量に基づいて計算すると有効であることを実験により確
かめた。即ち、同一単語を発話した場合の口の開き具合
の変化をグラフ化すると、時間的なずれを除くと、山や
谷の形の類似性は高いが、山の高さや谷の深さはまちま
ちであることが分かった。従って、波形の高さを使って
ＤＰマッチングを行っても、波形同士の類似性を正確に
表すことはできない。しかし、マッチングの距離計算に
変化量から算出した角度を利用すれば、この角度は接線
の傾きの近似になるので、山や谷の頂点に近づくほど０
に近い値になる。これにより、この角度を使ってＤＰマ
ッチングを行えば、山や谷の頂点同士の時間的位置を合
わせるように働き、波形同士の類似性を正確に表すこと
ができる。<Explanation of Claim 4> When comparing the degree of opening of the mouth when the same word is uttered, the outlines of the waveforms are similar, and when the distance between these patterns is obtained, the distance between the patterns is based on the amount of change. It was confirmed by experiments that the calculation was effective. In other words, if the change in the degree of opening of the mouth when the same word is uttered is graphed, the similarity of the shapes of the peaks and valleys is high, but the height of the peaks and the depths of the valleys vary, except for the time shift. It turned out to be. Therefore, even if DP matching is performed using the height of the waveform, the similarity between the waveforms cannot be accurately represented. However, if the angle calculated from the change amount is used for the matching distance calculation, this angle becomes an approximation of the inclination of the tangent line.
Value close to. Thus, if DP matching is performed using this angle, the peak positions of the peaks and valleys work to match the temporal positions, and the similarity between the waveforms can be accurately represented.

【００１３】請求項４の発明は、この点に着目したもの
で、従って、マッチングの精度が向上し、認識率を向上
させることができる。The invention of claim 4 focuses on this point, and therefore, the accuracy of matching can be improved and the recognition rate can be improved.

【００１４】〈請求項５の構成〉請求項１〜５のいずれ
かに記載の単語認識装置において、基準となる代表パタ
ーンと、この代表パターンと同じ発話内容の複数のパタ
ーンとの各時間毎の距離の平均と分散を示す距離の統計
情報を有する辞書と、変化量から算出した角度を距離の
統計情報によって偏差値に換算した値に基づき、マッチ
ングの距離計算を行うマッチング実行部とを備えたこと
を特徴とする単語認識装置である。<Structure of Claim 5> In the word recognition device according to any one of Claims 1 to 5, in each time, a representative pattern serving as a reference and a plurality of patterns having the same utterance content as the representative pattern are used. A dictionary having statistical information of distances indicating the average and variance of distances, and a matching execution unit that calculates a distance for matching based on a value obtained by converting an angle calculated from a change amount into a deviation value by statistical information on distances. A word recognition apparatus characterized in that:

【００１５】〈請求項５の説明〉同一単語を発話した場
合の口の開き具合の変化のグラフは山や谷の形の類似性
が高い。ところが、発話内容によっては、発話の仕方に
関係なく常に山や谷ができる部分と、発話の仕方によっ
て山や谷ができたりできなかったりする部分が存在する
場合がある。これは、発話内容の部分的（時間的）特徴
であるといえる。距離の統計情報は、この部分的特徴を
表している。つまり、常に山や谷ができる部分は分散値
が小さい値になり、逆に山や谷ができたりできなかった
りする部分は、分散値が大きい値になる。ＤＰマッチン
グの距離計算にこの分散値から求めた偏差値を利用すれ
ば、この部分的特徴を計算結果に反映させることができ
る。請求項５の発明はこの点に着目したものである。従
って、発話内容に影響されず高い認識率を得ることがで
きる。<Explanation of Claim 5> The graph of the change in the degree of opening of the mouth when the same word is uttered has a high similarity in the shape of a mountain or a valley. However, depending on the content of the utterance, there may be a portion where a peak or a valley is always formed regardless of the manner of utterance, and a portion where a hill or a valley is formed or not depending on the manner of utterance. This can be said to be a partial (temporal) feature of the utterance content. The statistical information of the distance indicates this partial feature. That is, a portion where a peak or a valley is always formed has a small variance value, and a portion where a peak or a valley is formed or cannot be formed has a large variance value. If the deviation value obtained from the variance is used for the DP matching distance calculation, this partial feature can be reflected in the calculation result. The invention of claim 5 focuses on this point. Therefore, a high recognition rate can be obtained without being affected by the utterance content.

【００１６】〈請求項６の構成〉請求項１〜５のいずれ
かに記載の単語認識装置において、基準となる口の開き
具合の代表パターンと、口の開き具合の複数のパターン
との各時間毎の距離と分散を表す高さの統計情報を有す
る辞書と、認識パターンにおける波形の各頂点での口の
開き具合を、高さの統計情報を用いて偏差値に換算し、
この換算値を用いて単語の判定を行うマッチング実行部
とを備えたことを特徴とする単語認識装置である。<Structure of claim 6> In the word recognition device according to any one of claims 1 to 5, each time of a representative pattern of the degree of opening of the mouth serving as a reference and a plurality of patterns of the degree of opening of the mouth. A dictionary having statistical information of the height representing each distance and variance, and the degree of opening of the mouth at each vertex of the waveform in the recognition pattern is converted to a deviation value using the statistical information of the height,
A word recognition device comprising: a matching execution unit that determines a word by using the converted value.

【００１７】〈請求項６の説明〉請求項６の発明は、口
の開き具合そのものを利用するようにしたものである。
例えば、母音を発音している部分のほとんどが、パター
ンにおける山もしくは谷といった頂点として現れるが、
“ａ”や“ｅ”等の母音を発音する場合、口の開き具合
のばらつきがかなり大きい。しかし、大まかには、
“ａ”＞“ｅ”＞“ｉ”＞“ｏ”＞“ｕ”の関係を守っ
ている。つまり“ａ”＜“ｅ”となることは時としてあ
るにしても、“ａ”＜“ｉ”となることは、先ずあり得
ない。このように同じ単語を発音した場合、頂点が取り
うる範囲はある程度限定されている。従って、統計情報
を使ってこの範囲を表し、偏差値によってこの範囲にう
まく入っているかを表すことによって、パターン同士の
類似性を表すことができるようになる。このような統計
情報が高さの統計情報である。<Explanation of Claim 6> The invention of claim 6 utilizes the opening itself of the mouth.
For example, most of the vowel parts appear as peaks or valleys in the pattern,
When vowels such as “a” and “e” are pronounced, the degree of opening of the mouth varies considerably. But roughly,
The relationship of “a”>“e”>“i”>“o”> “u” is maintained. In other words, even if "a"<"e" sometimes occurs, it is unlikely that "a"<"i". When the same word is pronounced in this way, the range that the vertex can take is limited to some extent. Accordingly, the similarity between patterns can be expressed by expressing this range using statistical information and expressing whether or not the pattern is well within this range using a deviation value. Such statistical information is height statistical information.

【００１８】そして、このような統計情報を使って頂点
の範囲を表し、偏差値によって、この範囲にうまく入っ
ているかを表すようにしているため、パターン同士の類
似性を的確に表すことができる。Since the range of the vertices is represented by using such statistical information, and the deviation value is used to indicate whether or not the range is well within the range, the similarity between the patterns can be accurately represented. .

【００１９】[0019]

【発明の実施の形態】以下、本発明の実施の形態を図面
を用いて詳細に説明する。《具体例１》〈構成〉図１は本発明の単語認識装置の具体例１を示す
構成図である。図の装置は、入力部１０１、唇追跡部１
０２、前処理部１０３、マッチング実行部１０４、辞書
１０５、出力部１０６からなる。Embodiments of the present invention will be described below in detail with reference to the drawings. << Specific Example 1 >><Configuration> FIG. 1 is a configuration diagram showing a specific example 1 of the word recognition apparatus of the present invention. The illustrated device includes an input unit 101 and a lip tracking unit 1
02, a pre-processing unit 103, a matching execution unit 104, a dictionary 105, and an output unit 106.

【００２０】入力部１０１は、発話者の顔画像を取得す
るもので、テレビカメラ等の撮像装置からなる。唇追跡
部１０２は、入力部１０１から入力された顔画像の上下
の唇を追跡する機能ブロックである。前処理部１０３
は、唇追跡部１０２で得た唇の位置をマッチング実行部
１０４で処理できるパターンに変換する機能ブロックで
ある。マッチング実行部１０４は、辞書１０５内のパタ
ーンと前処理部１０３から得たパターンとのマッチング
を実行し、単語の認識を行う機能ブロックである。即
ち、マッチング実行部１０４は、距離計算部１０７を有
し、この距離計算部１０７によって、辞書１０５に格納
された辞書パターン１０８と前処理部１０３で得たパタ
ーンとの距離を計算することによって単語に認識を行う
ようになっている。尚、これら唇追跡部１０２〜マッチ
ング実行部１０４は、ハードウェアやマイクロコンピュ
ータのプロセッサやメモリおよびプロセッサの実行する
ソフトウェア等で機能構成されている。The input section 101 acquires a face image of a speaker, and is composed of an image pickup device such as a television camera. The lip tracking unit 102 is a functional block that tracks upper and lower lips of a face image input from the input unit 101. Pre-processing unit 103
Is a functional block that converts the lip position obtained by the lip tracking unit 102 into a pattern that can be processed by the matching execution unit 104. The matching execution unit 104 is a functional block that executes matching between the pattern in the dictionary 105 and the pattern obtained from the preprocessing unit 103 to recognize words. That is, the matching execution unit 104 has a distance calculation unit 107, and the distance calculation unit 107 calculates the distance between the dictionary pattern 108 stored in the dictionary 105 and the pattern obtained by the pre-processing unit 103, thereby obtaining a word. Is to be recognized. The lip tracking unit 102 to the matching execution unit 104 are functionally configured by hardware, a processor or a memory of a microcomputer, software executed by the processor, and the like.

【００２１】また、辞書１０５は、認識単語のパターン
である辞書パターン１０８を有する辞書であり、出力部
１０６は、マッチング実行部１０４で認識した単語を出
力する出力装置である。The dictionary 105 is a dictionary having a dictionary pattern 108 that is a pattern of a recognized word, and the output unit 106 is an output device that outputs the word recognized by the matching execution unit 104.

【００２２】〈動作〉先ず、入力部１０１によって撮影
された顔画像は、唇追跡部１０２に送られる。唇追跡部
１０２は、この画像の中から、上唇上に位置する座標
と、下唇上に位置する座標の時間的変化を出力する。唇
追跡部１０２は、例えば、先に出願した特願平７−３４
９１５１号「唇追跡装置」に記載された唇追跡処理を行
う。<Operation> First, a face image photographed by the input unit 101 is sent to the lip tracking unit 102. The lip tracking unit 102 outputs, from this image, a temporal change in coordinates located on the upper lip and coordinates located on the lower lip. The lip tracking unit 102 is, for example, a Japanese Patent Application No. 7-34 filed earlier.
The lip tracking process described in No. 9151 “lip tracking device” is performed.

【００２３】図２は、唇追跡部１０２の出力例を示す図
である。この図においては、横軸に時間（単位はフレー
ム）、縦軸に顔画像中でのＹ座標としており、二つの曲
線は上唇下唇それぞれの動きを表している。FIG. 2 is a diagram showing an output example of the lip tracking unit 102. In this figure, the horizontal axis represents time (the unit is a frame), the vertical axis represents the Y coordinate in the face image, and the two curves represent the movements of the upper lip and the lower lip, respectively.

【００２４】そして、唇追跡部１０２の出力は、前処理
部１０３に送られる。前処理部１０３では、上唇のＹ座
標の値と、下唇のＹ座標の値の差を計算することによっ
て、口の開き具合に変換する。図３は、口の開き具合に
変換した時の波形図である。The output of the lip tracking unit 102 is sent to a pre-processing unit 103. The pre-processing unit 103 calculates the difference between the value of the Y coordinate of the upper lip and the value of the Y coordinate of the lower lip, thereby converting the opening state of the mouth. FIG. 3 is a waveform diagram when converted to the degree of opening of the mouth.

【００２５】次に、顔画像の拡大率の正規化を行う。こ
れは、例えば目の間の距離等を基準にすることによって
実現できる。次に、各点間にそれぞれｎ個の点を線形補
完する。図４は、この線形補完した波形図である。この
図４に示す例は図３の波形に対して各点間を４個の点で
補完した場合を示している。尚、線形補完するｎ個の値
が大きいほど認識性能を高くすることができるが、計算
時間が長くなる）。Next, normalization of the enlargement ratio of the face image is performed. This can be realized, for example, based on the distance between the eyes or the like. Next, n points are linearly complemented between each point. FIG. 4 is a waveform diagram in which this linear interpolation is performed. The example shown in FIG. 4 shows a case where the points between the points in the waveform of FIG. 3 are complemented by four points. Note that the recognition performance can be improved as the n values for linear interpolation become larger, but the calculation time becomes longer).

【００２６】最後に、これらの点が滑らかにつながるよ
うにスムージングを行う。図５は、スムージング後の波
形図である。スムージングは、例えば、ロー・パス・フ
ィルタを利用することによって実現することができる。Finally, smoothing is performed so that these points are connected smoothly. FIG. 5 is a waveform diagram after smoothing. Smoothing can be achieved, for example, by utilizing a low pass filter.

【００２７】前処理部１０３での前処理が終わった波形
は、マッチング実行部１０４へ送られる。マッチング実
行部１０４では、受け取った波形を認識パターンとし、
辞書１０５に記録されている各単語のパターンとマッチ
ング処理を行う。そして、マッチングの結果、パターン
間の距離が最も小さかった単語を認識結果として出力部
１０６に出力する。The waveform that has been subjected to the preprocessing in the preprocessing unit 103 is sent to the matching execution unit 104. The matching execution unit 104 uses the received waveform as a recognition pattern,
A matching process is performed with the pattern of each word recorded in the dictionary 105. Then, as a result of the matching, the word having the smallest distance between the patterns is output to the output unit 106 as a recognition result.

【００２８】ここで、同じ単語を発話した場合の口の開
き具合を比較すると、波形の概形が類似しており、これ
らのパターン間の距離を求める場合は、変化量に基づい
て計算すると有効であることを実験により確かめた。即
ち、同一単語を発話した場合の口の開き具合の変化をグ
ラフ化すると、時間的なずれを除くと、山や谷の形の類
似性は高いが、山の高さや谷の深さはまちまちであるこ
とが分かった。従って、波形の高さを使ってＤＰマッチ
ングを行っても、波形同士の類似性を正確に表すことは
できない。しかし、マッチングの距離計算に変化量から
算出した角度を利用すれば、この角度は接線の傾きの近
似になるので、山や谷の頂点に近づくほど０に近い値に
なる。これにより、この角度を使ってＤＰマッチングを
行えば、山や谷の頂点同士の時間的位置を合わせるよう
に働き、波形同士の類似性を正確に表すことができる。Here, comparing the degree of opening of the mouth when the same word is uttered, the outlines of the waveforms are similar, and when calculating the distance between these patterns, it is effective to calculate based on the amount of change. Was confirmed by experiments. In other words, if the change in the degree of opening of the mouth when the same word is uttered is graphed, the similarity of the shapes of the peaks and valleys is high, but the height of the peaks and the depths of the valleys vary, except for the time shift. It turned out to be. Therefore, even if DP matching is performed using the height of the waveform, the similarity between the waveforms cannot be accurately represented. However, if the angle calculated from the amount of change is used for the calculation of the matching distance, this angle is an approximation of the inclination of the tangent, and therefore, the closer to the peak of a mountain or a valley, the closer to zero. Thus, if DP matching is performed using this angle, the peak positions of the peaks and valleys work to match the temporal positions, and the similarity between the waveforms can be accurately represented.

【００２９】例えば、辞書パターン１０８の変化量をδ
_a、認識対象のパターンの変化量をδ_bとしたときに、
（ｔａｎ^-1（δ_a／Ｚ）−ｔａｎ^-1（δ_b／Ｚ））²を
各部分の距離として利用すると、同じ単語を発話した場
合はパターン間の距離が小さくなり、異なる単語を発話
した場合はパターン間の距離が大きくなるよう働き、優
れた認識性能を発揮できる。従って、距離計算部１０７
において、上述した式の計算をするように構成してい
る。但し、Ｚは話者や撮影環境等によって変化する係数
である。For example, the change amount of the dictionary pattern 108 is δ
_a, the variation of the pattern to be recognized is taken as [delta] _b,
When (tan ⁻¹ (δ _a / Z) −tan ⁻¹ (δ _b / Z)) ² is used as the distance of each part, when the same word is uttered, the distance between the patterns is reduced, and different words are uttered. In this case, the distance between the patterns works to increase, and excellent recognition performance can be exhibited. Therefore, the distance calculation unit 107
Is configured to calculate the above equation. Here, Z is a coefficient that changes depending on the speaker, the shooting environment, and the like.

【００３０】変化量を利用するので、このマッチング実
行部１０４におけるパターンマッチングには、一般のＤ
Ｐマッチングを利用してもよいが、同時に出願した「パ
ターン・マッチング処理方法及び装置」に開示した技術
を用いるとより一層効果がある。Since the amount of change is used, a general D
Although P matching may be used, the use of the technique disclosed in the “pattern matching processing method and apparatus” filed at the same time is more effective.

【００３１】また、この具体例１の場合、辞書パターン
１０８の作成は、発話内容が既知の場合に、上述した前
処理部１０３の出力を利用し、発話部分のみを取り出
す。発話部分の取り出しは、辞書用パターンを取得する
時は雑音がない環境を利用するようにすれば、音声の有
無を利用することによって簡単に実現できる。In the case of the first embodiment, when the utterance content is known, the dictionary pattern 108 is created by using the output of the above-described preprocessing unit 103 and extracting only the utterance part. Extraction of the utterance portion can be easily realized by using the presence or absence of speech if an environment free from noise is used when acquiring the dictionary pattern.

【００３２】〈効果〉以上のように、具体例１によれ
ば、マッチングの距離計算に変化量から算出した角度を
利用して行っているため、波形同士の類似性を正確に表
すことができ、従って、認識率を向上させることが可能
となる。<Effect> As described above, according to the specific example 1, since the matching distance is calculated using the angle calculated from the change amount, the similarity between the waveforms can be accurately represented. Therefore, the recognition rate can be improved.

【００３３】また、前処理として、線形補完とスムージ
ングを行っているため、近似の精度が上がり、また、角
度が滑らかに変化するのでＤＰマッチングの効率が向上
し、一層認識率の向上に寄与することができる。Further, since linear interpolation and smoothing are performed as pre-processing, the accuracy of approximation is improved, and the angle changes smoothly, so that the efficiency of DP matching is improved, which further contributes to the improvement of the recognition rate. be able to.

【００３４】《具体例２》具体例１のマッチングでは、
辞書パターンのみを利用したが、この具体例２では、辞
書に距離の統計情報を付加することによって認識率を更
に向上させるようにしている。<< Specific Example 2 >> In the matching of Specific Example 1,
Although only the dictionary pattern is used, in the specific example 2, the recognition rate is further improved by adding the statistical information of the distance to the dictionary.

【００３５】即ち、上記具体例１で述べたように、同一
単語を発話した場合の口の開き具合の変化のグラフは山
や谷の形の類似性が高い。ところが、発話内容によって
は、発話の仕方に関係なく常に山や谷ができる部分と、
発話の仕方によって山や谷ができたりできなかったりす
る部分が存在する場合がある。これは、発話内容の部分
的（時間的）特徴であるといえる。距離の統計情報は、
この部分的特徴を表している。つまり、常に山や谷がで
きる部分は分散値が小さい値になり、逆に山や谷ができ
たりできなかったりする部分は、分散値が大きい値にな
る。ＤＰマッチングの距離計算にこの分散値から求めた
偏差値を利用すれば、この部分的特徴を計算結果に反映
させることができる。具体例２はこの点に着目したもの
である。That is, as described in the first embodiment, the graph of the change in the degree of opening of the mouth when the same word is uttered has a high similarity in the shape of a mountain or a valley. However, depending on the content of the utterance, there are parts where peaks and valleys can always be formed regardless of the manner of utterance
There may be parts where peaks and valleys may or may not be formed depending on the way of speaking. This can be said to be a partial (temporal) feature of the utterance content. Distance statistics are:
This partial feature is shown. That is, a portion where a peak or a valley is always formed has a small variance value, and a portion where a peak or a valley is formed or not formed has a large variance value. If the deviation value obtained from the variance is used for the DP matching distance calculation, this partial feature can be reflected in the calculation result. Example 2 focuses on this point.

【００３６】〈構成〉図６は、具体例２の構成図であ
る。図の装置は、入力部１０１〜出力部１０６を備え、
辞書１０５が、辞書パターン１０８を有すると共に、距
離の統計情報１０９を有し、また、マッチング実行部１
０４の距離計算部１０７が偏差値計算部１１０を備えて
いる点に特徴を有するものである。尚、具体例１と等価
な構成要素については、同一番号を付与し、その説明は
省略する。<Structure> FIG. 6 is a diagram showing the structure of the second embodiment. The illustrated device includes an input unit 101 to an output unit 106,
The dictionary 105 has the dictionary pattern 108, the distance statistical information 109, and the matching execution unit 1
The fourth embodiment is characterized in that the distance calculation unit 107 includes a deviation value calculation unit 110. Note that the same reference numerals are given to components equivalent to those of the first embodiment, and description thereof will be omitted.

【００３７】距離の統計情報１０９は、前述した統計情
報であり、その詳細については後述する。偏差値計算部
１１０は、距離計算部１０７の内部に存在する計算ユニ
ットであり、本具体例においては認識動作の際に利用す
る。The distance statistical information 109 is the above-described statistical information, and details thereof will be described later. The deviation value calculation unit 110 is a calculation unit existing inside the distance calculation unit 107, and is used at the time of a recognition operation in this specific example.

【００３８】先ず、辞書１０５の作成方法について説明
する。本具体例において、辞書１０５を作成する場合、
最初に、同じ発話内容のパターンを複数用意する。この
パターンは、上述した具体例１で辞書を作成する場合と
同様に、前処理部１０３の出力を利用する。次に、この
複数のパターンの中から最も標準的なパターンを代表パ
ターンとして一つ選択する。この標準的なパターンを選
択する方法としては、例えば、全てのパターンをお互い
にマッチングを実行することによって得られるパターン
間の距離を比較することによって決めることができる。First, a method for creating the dictionary 105 will be described. In this specific example, when creating the dictionary 105,
First, a plurality of patterns of the same utterance content are prepared. This pattern uses the output of the preprocessing unit 103 as in the case of creating a dictionary in the specific example 1 described above. Next, one of the most standard patterns is selected as a representative pattern from the plurality of patterns. As a method for selecting this standard pattern, for example, it can be determined by comparing the distance between patterns obtained by performing matching of all patterns with each other.

【００３９】代表パターンが決定したら、代表パターン
と、その他の全てのパターンとのマッチングを具体例１
の同様の方法を使って順次行う（以後、代表パターンと
のマッチングを行うパターンを比較パターンと呼ぶこと
にする）。マッチングを行うと、比較パターンを最も代
表パターンに近くなるように伸縮させるので、時間軸を
正規化したことになる。従って、全てのマッチングが終
了した後に、全てのパターンと、代表パターンとの距離
を各時間毎に集計し、その平均値と分散値を算出し、こ
れらの値を距離の統計情報１０９として、また、代表パ
ターンを辞書パターン１０８として辞書１０５を構成す
る。但し、以上で説明した辞書作成の際に行うマッチン
グは、全て始端と終端を固定して行う。After the representative pattern is determined, matching of the representative pattern with all other patterns is performed.
(A pattern for performing matching with a representative pattern is hereinafter referred to as a comparison pattern). When the matching is performed, the comparison pattern is expanded and contracted so as to be closest to the representative pattern, so that the time axis is normalized. Therefore, after all matching is completed, the distances between all patterns and the representative pattern are totaled for each time, the average value and the variance are calculated, and these values are used as the distance statistical information 109. The dictionary 105 is configured with the representative pattern as the dictionary pattern 108. However, the matching performed at the time of creating the dictionary described above is performed with the start and end fixed.

【００４０】次に、距離の統計情報１０９を更に詳細に
説明する。距離の統計情報１０９の求め方については、
例えば、「確率モデルによる音声認識」中川聖一著
電子情報通信学会ｐ１９に記載されている非対称形で
ＤＰマッチングを実行する場合を例に説明する。Next, the distance statistical information 109 will be described in more detail. Regarding how to obtain the distance statistical information 109,
For example, "Speech Recognition by Probabilistic Model" by Seiichi Nakagawa
An example in which DP matching is performed in an asymmetric form described in IEICE p19 will be described.

【００４１】代表パターンをＴ、比較パターンをＣ₁，
Ｃ₂，…，Ｃ_n，…，Ｃ_Nとする。パターンは口の開き
具合の時系列なので、Ｔ＝ａ₁，ａ₂，…，ａ_i，…ａ_I Ｃ₁＝ｂ₁₁，ｂ₁₂，…，ｂ_1i，…ｂ_1J1 Ｃ₂＝ｂ₂₁，ｂ₂₂，…，ｂ_2i，…ｂ_2J2 ・・・Ｃ_n＝ｂ_n1，ｂ_n2，…，ｂ_ni，…ｂ_nJn ・・・Ｃ_N＝ｂ_N1，ｂ_N2，…，ｂ_Ni，…ｂ_NJN となる。ここで、ａ_iはＴのｉ番目のデータ（口の開き
具合）を表す。同様に、ｂ_niはＣ_nのｉ番目のデータ
を、更に、ｂ_NJNは、Ｃ_NのＪＮ番目のデータを表す。The representative pattern is T, the comparison pattern is C ₁ ,
_{_{C 2, ..., C n,}} ..., and C _N. Since the pattern is a time series of degree of opening of the _{_{mouth, T = a 1, a 2}} , ..., a i, ... a I C 1 = b 11, b 12, ..., b 1i, ... b 1J1 C 2 = b 21, _{_{b 22, ..., b 2i,}} ... b 2J2 · · · C n = b n1, b n2, ..., b ni, ... b nJn · · · C N = b N1, b N2, ..., b Ni, ... b _NJN . Here, a _i represents the i-th data of T (the degree of opening of the mouth). Similarly, b _ni represents the i-th data of C _n , and b _NJN represents the JN-th data of C _N.

【００４２】今、ＴとＣ_nが最もマッチした時の時間変
換関数Ｆ_nを、Ｆ_n＝ｃ_n(1)，ｃ_n(2)，…，ｃ_n(i)，…
ｃ_n(I)とする。これは、マッチングにより、パターンＴ
のｉ番目のデータと、パターンＣ_nのｃ_n(i)番目のデー
タが対応したことを示している。即ち、ａ_iとｂ_ncn(i)
がそれぞれ対応したことになる。また、始端と終端は固
定なので、必ずｃ_n(1)＝１ｃ_n(I)＝ｊ_nとなる。Now, the time conversion function F _n when T and C _n are most matched is represented by F _n = c _n (1), c _n (2),..., C _n (i),.
c _n (I). This is because the pattern T
Indicates that the i-th data of the pattern C _{n and} the c _n (i) -th data of the pattern C _n correspond to each other. That is, a _i and b _ncn (i)
Respectively correspond. Since the start and end are fixed, c _n (1) = 1c _n (I) = j _n is always satisfied.

【００４３】この時、辞書パターンＴと比較パターンＣ
_nの距離Ｄ（Ｔ，Ｃ_n）は次のようになる。図７は、距
離の統計情報に関する演算説明図である。ここで、距離
Ｄ（Ｔ，Ｃ_n）は、式（１）のように表され、これを展
開すると、式（２）のようになる。従って、各時間毎の
距離の平均をμ_i、分散をσ_iとすると、μ_iは式
（３）、σ_iは式（４）で表され、距離の統計情報１０
９は、（μ₁，μ₂，…，μ_I）と、（σ₁，σ₂，
…，σ_I）となる。At this time, the dictionary pattern T and the comparison pattern C
_n of the distance D (T, C _n) is as follows. FIG. 7 is an explanatory diagram of a calculation regarding the statistical information of the distance. Here, the distance D (T, C _n ) is expressed as in equation (1), and when expanded, it becomes as in equation (2). Therefore, assuming that the average of the distance for each time is μ _i and the variance is σ _i , μ _i is expressed by equation (3), σ _i is expressed by equation (4), and the distance statistical information 10
9, (μ ₁ , μ ₂ ,..., Μ _I ) and (σ ₁ , σ ₂ ,
..., σ _I ).

【００４４】次に、この辞書１０５を利用したマッチン
グ実行部１０４における認識方法について説明する。こ
こで、その認識方法において、基本的には具体例１と同
様であるが、マッチング処理時に利用する距離計算の方
法が異なっている。Next, a recognition method in the matching execution unit 104 using the dictionary 105 will be described. Here, the recognition method is basically the same as that of the first embodiment, but the distance calculation method used in the matching processing is different.

【００４５】先ず、具体例１の場合と同様に、距離計算
部１０７において、各部分の距離を算出する。次に、辞
書１０５内の距離の統計情報１０９を使って偏差値を算
出し、これを距離として利用する。即ち、具体例１の場
合と同様にして求めた距離をｄ（ｃ(i) ）とすると、実
際に利用する距離ｄｌ_iは以下の式（５）の通りとな
る。First, as in the first embodiment, the distance calculation unit 107 calculates the distance of each part. Next, a deviation value is calculated using the distance statistical information 109 in the dictionary 105, and this is used as the distance. That is, when the distance calculated in the same manner as in Example 1 and d (c (i)), the distance dl _i actually utilized is as the following equation (5).

【００４６】図８は、実際の利用する距離ｄｌ_iの演算
説明図である。尚、一般に、マッチングのアルゴリズム
は距離の計算方法に依存しないので、マッチング実行部
１０４にどのようなアルゴリズムのマッチング手法を利
用する場合にでも適用することができる。[0046] Figure 8 is an operation explanatory view of a distance dl _i for actual use. In general, a matching algorithm does not depend on a method of calculating a distance, and therefore can be applied to a case where any matching method of the algorithm is used in the matching execution unit 104.

【００４７】〈効果〉以上のように具体例２によれば、
具体例１の構成に加えて距離の統計情報を用いたので、
各パターンの部分的特徴を計算結果に反映させることが
できるため、更に認識率を大きく向上させることができ
る。<Effects> According to the specific example 2 as described above,
Since the statistical information of the distance was used in addition to the configuration of the specific example 1,
Since the partial characteristics of each pattern can be reflected in the calculation result, the recognition rate can be further improved.

【００４８】《具体例３》具体例１および２では、口の
開き具合の変化量だけを利用したが、具体例３では、口
の開き具合そのものも利用するよう構成している。<< Specific Example 3 >> In the specific examples 1 and 2, only the amount of change in the degree of opening of the mouth is used, but in the specific example 3, the degree of opening of the mouth itself is used.

【００４９】例えば、母音を発音している部分のほとん
どが頂点として現れるが、“ａ”や“ｅ”等の母音を発
音する場合、口の開き具合のばらつきがかなり大きい。
しかし、大まかには、“ａ”＞“ｅ”＞“ｉ”＞“ｏ”
＞“ｕ”の関係を守っている。つまり“ａ”＜“ｅ”と
なることは時としてあるにしても、“ａ”＜“ｉ”とな
ることは、先ずあり得ない。このように同じ単語を発音
した場合、頂点が取りうる範囲はある程度限定されてい
る。従って、統計情報を使ってこの範囲を表し、偏差値
によってこの範囲にうまく入っているかを表すことによ
って、パターン同士の類似性を表すことができるように
なる。For example, most of vowel-producing portions appear as vertices. However, when vowels such as "a" and "e" are produced, the degree of opening of the mouth varies considerably.
However, roughly, "a">"e">"i">"o"
>"U". In other words, even if "a"<"e" sometimes occurs, it is unlikely that "a"<"i". When the same word is pronounced in this way, the range that the vertex can take is limited to some extent. Accordingly, the similarity between patterns can be expressed by expressing this range using statistical information and expressing whether or not the pattern is well within this range using a deviation value.

【００５０】〈構成〉図９は、具体例３の構成図であ
る。図の装置は、入力部１０１〜出力部１０６を備え、
辞書１０５が具体例２の距離の統計情報１０９に加え
て、高さの統計情報１１１を有し、また、高さの得点計
算部１１２と総合得点計算部１１３を備えている。<Structure> FIG. 9 is a view showing the structure of the third embodiment. The illustrated device includes an input unit 101 to an output unit 106,
The dictionary 105 has height statistical information 111 in addition to the distance statistical information 109 of the specific example 2, and includes a height score calculating unit 112 and a total score calculating unit 113.

【００５１】高さの統計情報１１１は、辞書１０５を構
成する統計データであり、作成方法については後述す
る。高さの得点計算部１１２は、マッチング実行部１０
４の結果と、高さの統計情報１１１の情報を用いて、口
の開き具合を得点化するユニットである。総合得点計算
部１１３は、マッチング実行部１０４で得た距離と高さ
の得点計算部１１２で得た得点から、総合得点を計算す
るユニットである。The height statistical information 111 is statistical data constituting the dictionary 105, and a method for creating the statistical information will be described later. The height score calculation unit 112 includes the matching execution unit 10
A unit that scores the degree of opening of the mouth using the result of No. 4 and the information of the statistical information 111 of height. The total score calculation unit 113 is a unit that calculates a total score from the scores obtained by the distance and height score calculation units 112 obtained by the matching execution unit 104.

【００５２】先ず、辞書１０５の作成方法について説明
する。具体例２において、代表パターンと比較パターン
をマッチングして時間軸を正規化し、全てのパターンと
代表パターンとの距離を各時間毎の平均値と分散値を距
離の統計情報１０９とした。これと同様に、口の開き具
合の各時間毎の平均値と分散値を求め、これを高さの統
計情報１１１とする。更に、この高さの統計情報１１１
に、具体例２と同様の手法によって求めた辞書パターン
１０８と距離の統計情報１０９を加えて辞書１０５を構
成する。First, a method for creating the dictionary 105 will be described. In the specific example 2, the representative pattern and the comparison pattern are matched to normalize the time axis, and the distance between all the patterns and the representative pattern is defined as the average value and the variance value for each time as the distance statistical information 109. Similarly, an average value and a variance value for each time of the degree of opening of the mouth are obtained, and this is set as the height statistical information 111. Furthermore, the statistical information 111 of this height
Then, a dictionary 105 is constructed by adding the dictionary pattern 108 and the statistical information 109 of the distance obtained by the same method as in the specific example 2.

【００５３】次に、高さの統計情報１１１の求め方につ
いて、前述した非対称形でのマッチングを実行する場合
を例に説明する。Next, a method of obtaining the height statistical information 111 will be described by taking as an example a case where the above-described asymmetric matching is performed.

【００５４】代表パターンをＴ、比較パターンをＣ₁，
Ｃ₂，…，Ｃ_n，…，Ｃ_Nとする。パターンは口の開き
具合の時系列なので、Ｔ＝ａ₁，ａ₂，…，ａ_i，…ａ_I Ｃ₁＝ｂ₁₁，ｂ₁₂，…，ｂ_1i，…ｂ_1J1 Ｃ₂＝ｂ₂₁，ｂ₂₂，…，ｂ_2i，…ｂ_2J2 ・・・Ｃ_n＝ｂ_n1，ｂ_n2，…，ｂ_ni，…ｂ_nJn ・・・Ｃ_N＝ｂ_N1，ｂ_N2，…，ｂ_Ni，…ｂ_NJN となる。ここで、ａ_iはＴのｉ番目のデータ（口の開き
具合）を表す。同様に、ｂ_niはＣ_nのｉ番目のデータ
を、更に、ｂ_NJNは、Ｃ_NのＪＮ番目のデータを表す。The representative pattern is T, the comparison pattern is C ₁ ,
_{_{C 2, ..., C n,}} ..., and C _N. Since the pattern is a time series of degree of opening of the _{_{mouth, T = a 1, a 2}} , ..., a i, ... a I C 1 = b 11, b 12, ..., b 1i, ... b 1J1 C 2 = b 21, _{_{b 22, ..., b 2i,}} ... b 2J2 · · · C n = b n1, b n2, ..., b ni, ... b nJn · · · C N = b N1, b N2, ..., b Ni, ... b _NJN . Here, a _i represents the i-th data of T (the degree of opening of the mouth). Similarly, b _ni represents the i-th data of C _n , and b _NJN represents the JN-th data of C _N.

【００５５】今、ＴとＣ_nが最もマッチした時の時間変
換関数Ｆ_nを、Ｆ_n＝ｃ_n(1)，ｃ_n(2)，…，ｃ_n(i)，…
ｃ_n(I)とする。これは、マッチングにより、パターンＴ
のｉ番目のデータと、パターンＣ_nのｃ_n(i)番目のデー
タが対応したことを示している。即ち、ａ_iとｂ_ncn(i)
がそれぞれ対応したことになる。また、始端と終端は固
定なので、必ずｃ_n(1)＝１ｃ_n(I)＝ｊ_nとなる。Now, the time conversion function F _n when T and C _n are most matched is represented by F _n = c _n (1), c _n (2),..., C _n (i),.
c _n (I). This is because the pattern T
Indicates that the i-th data of the pattern C _{n and} the c _n (i) -th data of the pattern C _n correspond to each other. That is, a _i and b _ncn (i)
Respectively correspond. Since the start and end are fixed, c _n (1) = 1c _n (I) = j _n is always satisfied.

【００５６】従って、代表パターンＴにおけるａ_i と対
応するＣ₁，…，Ｃ_Nの口の開き具合の平均値をｍ_i、
分散値をｓ_iとすると次のような式６で表すことができ
る。図１０は、式６の演算説明図である。よって、高さ
の統計情報１１１は、（ｍ₁，ｍ₂，…，ｍ_I）と（ｓ
₁，ｓ₂，…，ｓ_I）となる。Therefore, the average value of the degree of opening of C ₁ ,..., C _N corresponding to a _i in the representative pattern T is represented by m _i ,
If the variance value is s _i , it can be expressed by the following equation 6. FIG. 10 is an explanatory diagram of the calculation of Expression 6. Therefore, the height statistical information 111 is (m ₁ , m ₂ ,..., M _I ) and (s
₁ , s ₂ ,..., S _I ).

【００５７】次に、この辞書１０５を利用した認識方法
について説明する。基本的には具体例２と同様である
が、具体例２と同様の方法で得た距離の他に、口の開き
具合の得点を求め、これらの値から総合得点を求めた上
で、認識結果を決定する点が異なっている。Next, a recognition method using the dictionary 105 will be described. Basically, it is the same as the specific example 2. However, in addition to the distance obtained by the same method as the specific example 2, the score of the degree of opening of the mouth is obtained, and the total score is obtained from these values. They differ in that they determine the result.

【００５８】先ず、マッチング実行部１０４が認識パタ
ーンと辞書パターンの距離を求めるまで、具体例２と同
じ方法で処理を行う。次に、高さの得点計算部１１２に
おいて、マッチングの結果、辞書パターンの各頂点と認
識パターン内のどの点が対応付けられたかを、マッチン
グ実行部１０４内の時間変換関数Ｆから求める。次に、
高さの統計情報１１１内にある同じ辞書パターン内の頂
点と対応した時の、口の開き具合の平均値と分散値を使
って、上記の口の開き具合の偏差値を求める。上記の計
算を辞書パターン内の全ての頂点に対して行い、その平
均値を高さの得点とする。ここで、頂点とは、パターン
を、横軸に時間、縦軸に口の開き具合としてグラフ化し
た時の波形において、山の頂上もしくは谷底を指す。First, processing is performed in the same manner as in the second embodiment until the matching execution unit 104 obtains the distance between the recognition pattern and the dictionary pattern. Next, in the height score calculation unit 112, as a result of the matching, which vertex of the dictionary pattern is associated with which point in the recognition pattern is obtained from the time conversion function F in the matching execution unit 104. next,
Using the average value and the variance of the degree of opening of the mouth when corresponding to the vertices in the same dictionary pattern in the height statistical information 111, the deviation value of the degree of opening of the mouth is obtained. The above calculation is performed for all vertices in the dictionary pattern, and the average value is used as the height score. Here, the apex indicates a peak or a valley in a waveform when the pattern is graphed as time on the horizontal axis and the degree of opening of the mouth on the vertical axis.

【００５９】図１１は、高さの得点の演算説明図であ
る。図示のように、数式を使って説明すると、Ｔにおい
て、頂点が存在するデータ番号をｐ(1) ，ｐ(2) ，…，
ｐ(P) とし、認識パターンＲを、Ｒ＝ｅ₁，ｅ₂，…，
ｅ_l，…，ｅ_Lとすると、高さの得点Ｓは図示の式
（７）で表すことができる。FIG. 11 is an explanatory diagram for calculating the height score. As shown in the figure, using a mathematical expression, at T, the data numbers at which vertices exist are represented by p (1), p (2),.
p (P), and the recognition pattern R is R = e ₁ , e ₂ ,.
e _l, ..., when the e _L, the score S of the height can be represented by the illustrated formula (7).

【００６０】次に、総合得点計算部１１３において、マ
ッチング実行部１０４で得た距離と、高さの得点計算部
１１２で得た高さの得点から総合得点Ｖを求める。これ
は、以下の式より求められる。Ｖ＝Ｄ（Ｔ，Ｒ）＋Ｓ・Ｋ但し、Ｋは話者や撮影環境等によって変化する係数であ
る。Next, the total score calculation unit 113 obtains a total score V from the distance obtained by the matching execution unit 104 and the height score obtained by the height score calculation unit 112. This is obtained from the following equation. V = D (T, R) + S · K where K is a coefficient that changes depending on the speaker, the shooting environment, and the like.

【００６１】このようにして、辞書１０５に記録されて
いる各単語のパターンと認識パターンとを使って、総合
得点の計算を行う。そして、総合得点の最も小さかった
単語を認識結果として、出力部１０６へ出力する。In this way, the total score is calculated using the pattern of each word recorded in the dictionary 105 and the recognition pattern. Then, the word having the smallest total score is output to the output unit 106 as a recognition result.

【００６２】〈効果〉以上のように具体例３では、角度
を使ってＤＰマッチングを行って時間的ずれをなくして
から、口の開き具合を比べるので、比較する対象がはっ
きりしている。<Effects> As described above, in the third embodiment, the DP matching is performed using the angle to eliminate the time lag, and then the degree of opening of the mouth is compared. Therefore, the comparison target is clear.

【００６３】また、統計情報を使って頂点の範囲を表
し、偏差値によって、この範囲にうまく入っているかを
表すようにしているため、パターン同士の類似性を表す
ことができる。Further, since the range of the vertices is represented by using the statistical information, and the deviation value is used to indicate whether or not the range falls within the range, the similarity between the patterns can be represented.

【００６４】更に、角度を使ってＤＰマッチングを行っ
た時のパターン間の距離と、口の開き具合から求めた得
点との両方を用いて算出した総合得点を利用することに
より、パターン同士の類似性を波形の概形とそれぞれの
頂点の位置という二つの観点から総合的に評価すること
ができ、更に、認識率を向上させることが可能となる。Further, similarity between patterns is obtained by using the total score calculated using both the distance between patterns when DP matching is performed using an angle and the score obtained from the degree of opening of the mouth. The performance can be comprehensively evaluated from the two viewpoints of the outline of the waveform and the position of each vertex, and the recognition rate can be further improved.

[Brief description of the drawings]

【図１】本発明の単語認識装置における具体例１の構成
図である。FIG. 1 is a configuration diagram of a specific example 1 of the word recognition device of the present invention.

【図２】本発明の単語認識装置における唇追跡部の出力
例を示す図である。FIG. 2 is a diagram illustrating an output example of a lip tracking unit in the word recognition device of the present invention.

【図３】本発明の単語認識装置における前処理部での口
の開き具合に変換した時の波形図である。FIG. 3 is a waveform diagram when converted to a degree of opening of a mouth in a preprocessing unit in the word recognition device of the present invention.

【図４】本発明の単語認識装置における前処理部での４
点を線形補完した時の波形図である。FIG. 4 is a diagram illustrating a 4 in a preprocessing section in the word recognition device of the present invention.
FIG. 9 is a waveform diagram when points are linearly complemented.

【図５】本発明の単語認識装置における前処理部でのス
ムージングを行った後の波形図である。FIG. 5 is a waveform diagram after smoothing is performed in a preprocessing unit in the word recognition device of the present invention.

【図６】本発明の単語認識装置における具体例２の構成
図である。FIG. 6 is a configuration diagram of a specific example 2 of the word recognition device of the present invention.

【図７】本発明の単語認識装置の具体例２における距離
の統計情報に関する演算説明図である。FIG. 7 is an explanatory diagram of calculation relating to statistical information of distance in the specific example 2 of the word recognition device of the present invention.

【図８】本発明の単語認識装置の具体例２における実際
に利用する距離の演算説明図である。FIG. 8 is an explanatory diagram of a calculation of a distance actually used in a specific example 2 of the word recognition device of the present invention.

【図９】本発明の単語認識装置における具体例３の構成
図である。FIG. 9 is a configuration diagram of Example 3 in the word recognition device of the present invention.

【図１０】本発明の単語認識装置の具体例３における口
の開き具合の平均値と分散値を表す演算説明図である。FIG. 10 is an explanatory diagram illustrating a calculation representing an average value and a variance value of the degree of opening of a mouth in the specific example 3 of the word recognition device of the present invention.

【図１１】本発明の単語認識装置の具体例３における高
さの得点の演算説明図である。FIG. 11 is an explanatory diagram of a calculation of a height score in the specific example 3 of the word recognition device of the present invention.

[Explanation of symbols]

１０１入力部１０２唇追跡部１０３前処理部１０４マッチング実行部１０５辞書１０６出力部１０７距離計算部１０８辞書パターン１０９距離の統計情報１１０偏差値計算部１１１高さの統計情報１１２高さの得点計算部１１３総合得点計算部 Reference Signs List 101 input part 102 lip tracking part 103 preprocessing part 104 matching execution part 105 dictionary 106 output part 107 distance calculation part 108 dictionary pattern 109 distance statistical information 110 deviation value calculation part 111 height statistical information 112 height score calculation part 113 Total Score Calculator

Claims

[Claims]

An input unit for obtaining a face image of a speaker; a lip tracking unit for tracking movement of each of an upper lip and a lower lip from the face image obtained by the input unit; A preprocessing unit that converts the movement of the lips into a pattern for performing matching; a dictionary that stores a dictionary pattern corresponding to each word; and a pattern that is converted by the preprocessing unit is compared with the dictionary pattern. A word recognition device comprising: a matching execution unit that recognizes a close dictionary pattern as a word spoken by the speaker.

2. The word recognition device according to claim 1, wherein the preprocessing unit converts the word into a pattern based on a difference between a Y coordinate of coordinates located on the upper lip and a Y coordinate of coordinates located on the lower lip. A word recognition device characterized by the following.

3. The word recognition device according to claim 1, wherein the pre-processing unit converts a pattern by performing linear interpolation and smoothing of coordinate values at arbitrary time intervals. .

4. The word recognition apparatus according to claim 1, wherein the matching execution unit calculates a distance of the inclination of the pattern based on an arbitrary amount of change in the pattern, and determines a pattern having the smallest distance. A word recognition device that recognizes a word as a word spoken by a speaker.

5. The word recognition apparatus according to claim 1, wherein an average and a variance of a distance at each time between a reference representative pattern and a plurality of patterns having the same utterance content as the representative pattern are set. And a matching execution unit that calculates a distance for matching based on a value obtained by converting an angle calculated from the amount of change into a deviation value by the distance statistical information. Word recognition device.

6. The word recognition apparatus according to claim 1, wherein a distance and a variance at each time between a representative pattern of the degree of opening of the mouth serving as a reference and a plurality of patterns of the degree of opening of the mouth. A dictionary having statistical information of the height representing the height of the word at each apex of the waveform in the recognition pattern is converted to a deviation value using the statistical information of the height, and the converted value is used for the word using the converted value. A word recognition device, comprising: a matching execution unit that makes a determination.