JP2000206986A

JP2000206986A - Language information detector

Info

Publication number: JP2000206986A
Application number: JP11007262A
Authority: JP
Inventors: Kazumasa Murai; 和昌村井; Masaaki Harada; 正明原田; Shin Takeuchi; 伸竹内
Original assignee: Fuji Xerox Co Ltd
Current assignee: Fujifilm Business Innovation Corp
Priority date: 1999-01-14
Filing date: 1999-01-14
Publication date: 2000-07-28

Abstract

PROBLEM TO BE SOLVED: To input language information such as voices without necessarily depending on pronunciation, rather based on the correspondence between pronunciation and the motion of sound adjustment organs. SOLUTION: A proper processing is conducted for voice information inputted from a microphone 100 and position information of sound adjustment organ inputted by a three dimensional digitizer section 110 and the process results are recorded in a table section 120. Then, the motion of the sound adjustment organ inputted by the section 110 is inputted as time patterns of the position of each measured portion and the pattern having a highest degree of similarity to the motion information of sound adjustment organs that are recorded in the section 120, is retrieved. Then, sound information, that is made correspond to the motion of the inputted sound adjustment organ, is estimated. Thus, voice information being pronounced is obtained from the picture of the sound adjustment organs.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】本発明は、音声、調音器官の動き
等の言語情報を入力ないしは認識する装置に関する。更
に詳細には、音声等の情報と、調音器官の運動との対応
を、通常の使用中に随時、記録し、この対応に基づい
て、発話ないしは調音器官の運動の少なくとも一方から
音声等の言語情報を入力ないしは認識する装置に関す
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an apparatus for inputting or recognizing linguistic information such as speech and articulator movements. More specifically, a correspondence between information such as voice and movement of the articulator is recorded at any time during normal use, and based on this correspondence, a language such as a speech is obtained from at least one of speech or movement of the articulator. The present invention relates to a device for inputting or recognizing information.

【０００２】[0002]

【従来の技術】音声等の言語情報を音響情報として入力
する装置としては、マイクロホンが広く用いられてい
る。また、音声等の言語情報を文字情報や文字列からな
る文書情報として認識する方法は、音声認識として広く
知られ、一部は実用化されている。音声入力や、音声認
識においては、話者が発話した音声を、音響情報として
マイクロホンなどにより入力する。必然的に、話者は発
話する必要がある。また、マイクロホンから入力される
音響情報は、発話に由来するか否かに関わらず、マイク
ロホンからの音声信号として出力される。従って、騒音
が大きい状況では、所望の音声情報のほか、騒音も音声
信号として出力されるという問題があった。そこで、複
数のマイクロホンと、信号処理を用いて、周囲の騒音が
音声信号として出力されにくい装置が、特開平５−２２
３９２号公報や、特開平２−０７２３９８号公報で提案
されている。2. Description of the Related Art A microphone is widely used as a device for inputting linguistic information such as voice as acoustic information. A method of recognizing linguistic information such as voice as text information or document information composed of a character string is widely known as voice recognition, and a part thereof has been put to practical use. In voice input and voice recognition, a voice uttered by a speaker is input as acoustic information using a microphone or the like. Inevitably, the speaker needs to speak. The acoustic information input from the microphone is output as an audio signal from the microphone regardless of whether the information is derived from speech. Therefore, in a situation where the noise is loud, there is a problem that the noise is output as an audio signal in addition to the desired audio information. Therefore, an apparatus that uses a plurality of microphones and signal processing to output ambient noise as an audio signal is disclosed in Japanese Unexamined Patent Publication No. 5-22.
392 and JP-A-2-072398.

【０００３】特開平５−２２３９２号公報では、ハンド
セット形状をなす携帯形無線電話機において、送話音声
を集音して送話音声信号を出力するための第１のマイク
ロホンと、周囲雑音を集音して周囲雑音信号を出力する
ための第２のマイクロホンと、この第２のマイクロホン
から出力された周囲雑音信号に基づいて、前記第１のマ
イクロホンから出力された送話音声信号に含まれる周囲
雑音成分を除去するための演算処理を行なう雑音除去回
路とを具備したことを特徴とする携帯形無線電話機が示
されている。[0003] Japanese Patent Application Laid-Open No. Hei 5-22392 discloses a portable telephone having a handset shape, a first microphone for collecting a transmission voice and outputting a transmission voice signal, and collecting a surrounding noise. And a second microphone for outputting an ambient noise signal, and an ambient noise included in the transmission voice signal output from the first microphone based on the ambient noise signal output from the second microphone. There is shown a portable radio telephone comprising a noise elimination circuit for performing arithmetic processing for removing components.

【０００４】また、特開平２−０７２３９８号公報で
は、２以上のマイクロホンと、前記マイクロホンからの
信号を入力するニユーラルネツトワーク・タイプのフイ
ルタにより、騒音下での信号対騒音比を改善する方法が
示されている。Japanese Patent Application Laid-Open No. 2-072398 discloses a method for improving a signal-to-noise ratio under noise by using two or more microphones and a filter of a neural network type for inputting signals from the microphones. It is shown.

【０００５】これらの方法によれば、所定の条件におい
ては、信号対騒音比を改善することが可能であるが、原
理的に発話による音響情報と騒音を分離することは困難
であった。According to these methods, it is possible to improve the signal-to-noise ratio under a predetermined condition, but it has been difficult in principle to separate the acoustic information from the utterance from the noise.

【０００６】また、音声認識において、音声信号だけで
なく画像情報を利用するという提案がある（特開平９−
１３４１９４号公報）。この特開平９−１３４１９４号
公報は、送話器ハウジングに収納されたカメラにより獲
得された画像情報に基づいて読話認識を実行する画像認
識システムを開示している。このような読話認識のシス
テムを、音声認識を補強するように用いることにより、
音声認識の認識精度を維持したまま、より多くの語彙に
ついてより多くの人に適用できる音声認識が可能にな
る。しかし、この提案は、画像認識において使用する参
照辞書の構成方法について言及するものではなく、ま
た、発話の音声認識の補強を前提としていることから、
実質的に発話を前提としている。そのため、騒音の影響
を免れることは困難である。また、音声認識や画像認識
に基づいた認識は、計算時間が比較的長いという課題が
あった。In speech recognition, there is a proposal to use not only a speech signal but also image information (Japanese Patent Application Laid-Open No. Hei 9-1997).
No. 134194). Japanese Patent Application Laid-Open No. Hei 9-134194 discloses an image recognition system that performs speech recognition based on image information acquired by a camera housed in a transmitter housing. By using such a speech recognition system to reinforce speech recognition,
While maintaining the recognition accuracy of speech recognition, speech recognition that can be applied to more people for more vocabulary becomes possible. However, this proposal does not refer to a method of constructing a reference dictionary used in image recognition, and is based on reinforcement of speech recognition of speech.
It presupposes utterance substantially. Therefore, it is difficult to avoid the influence of noise. In addition, recognition based on voice recognition or image recognition has a problem that the calculation time is relatively long.

【０００７】また、上述のようにマイクロホンを用いた
音声入力では発話する必要がある。しかし、公共性の高
い場所等において音声を発話することが社会問題となっ
ている。例えば、乗合バスや電車においては、携帯電話
の使用が制限されている場合が多い。また、静粛なオフ
ィスや図書館などでも、発話による音声入力が適切でな
い場合もある。また、秘話を必要とする状況においての
発話が課題となることも想定される。このような環境下
で発話を前提とするマイクロホンを用いた音声入力は適
切でない。[0007] Further, as described above, it is necessary to speak in the voice input using a microphone. However, uttering a voice in a highly public place or the like is a social problem. For example, in a shared bus or a train, use of a mobile phone is often restricted. Also, even in a quiet office or library, voice input by speech may not be appropriate. It is also assumed that utterance in situations that require confidential talk will be a challenge. In such an environment, voice input using a microphone that is supposed to speak is not appropriate.

【０００８】このような状況での入力装置としては、キ
ーボードや、手書き文字認識が用いられている。しか
し、これらの入力方法については、手を利用し、ある程
度の習熟が必要であるという課題がある。As an input device in such a situation, a keyboard or handwritten character recognition is used. However, these input methods have a problem that a certain amount of learning is required using hands.

【０００９】[0009]

【発明が解決しようとする課題】本発明は、音声等、言
語情報を検出する際に、発話等と、調音器官の運動との
対応に基づき、発話を必ずしも前提としない入力装置を
提供することを課題とする。また、この入力装置で用い
る発話と調音器官の外見との対応を、使用や習熟に応じ
て学習することにより精度を向上させることを課題とす
る。SUMMARY OF THE INVENTION It is an object of the present invention to provide an input device which does not necessarily assume speech based on the correspondence between speech and motion of articulatory organs when detecting language information such as speech. As an issue. It is another object of the present invention to improve the accuracy by learning the correspondence between the utterance used in the input device and the appearance of the articulatory organ according to use and skill.

【００１０】[0010]

【課題を解決するための手段】請求項１の発明は、調音
器官の運動と、調音器官の運動以外からの言語情報とを
対応付ける言語情報対応付け装置であって、話者が発話
する時に生ずる調音器官の運動を調音器官およびその周
辺の外皮の少なくとも１部から計測して入力データを生
成する調音器官形態入力手段と、調音器官の運動以外か
らの言語情報を入力する手段と、前記発話中の調音器官
の運動と、前記調音器官の運動以外から入力した言語情
報を対応付ける手段とを有することを特徴とする。特
に、調音器官の運動以外から入力した情報は、請求項２
の発明のように音声情報が挙げられる。ここでは、調音
器官の運動以外から入力した情報と、調音器官の運動が
対応付けられれば良いので、本発明の請求項３にあるよ
うに、装置から発話すべき発話を指示し、利用者がその
指示に基づいて発話した際の調音器官の運動を認識する
ことが可能である。また、本発明の請求項４にあるよう
に、表音文字、表音文字列、任意の記号ないしは記号列
からなる情報を認識することも可能である。The invention according to claim 1 is a linguistic information associating device for associating articulatory movement with linguistic information from a source other than the articulatory movement, which is generated when a speaker speaks. Articulatory organ input means for measuring the articulatory movement from at least a part of the articulatory organ and its surrounding skin to generate input data; means for inputting linguistic information other than the articulatory organ movement; And a means for associating the motion of the articulator with language information input from other than the motion of the articulator. In particular, the information input from other than the movement of the articulatory organ is defined in claim 2
Voice information as in the invention of (1). Here, it is only necessary that the information input from other than the articulator motion is associated with the articulator motion. Therefore, as described in claim 3 of the present invention, the user instructs the utterance to be uttered, and the user It is possible to recognize the movement of the articulatory organ when speaking based on the instruction. Further, as described in claim 4 of the present invention, it is also possible to recognize information consisting of phonetic characters, phonetic character strings, arbitrary symbols or symbol strings.

【００１１】なお、言語情報とは、発話情報、文字情
報、シンボル、調音器官の動き等、意味内容を伝達する
ために用いる情報を指す。The linguistic information refers to information used for transmitting semantic contents, such as speech information, character information, symbols, and movements of articulators.

【００１２】請求項１〜４の装置により構成された音声
情報と調音器官の運動の対応に基づき、請求項５にある
ように、調音器官の動きから、調音器官の運動以外から
入力した情報を獲得することが可能であり、また、請求
項６にあるように、調音器官の運動以外から入力した情
報から、対応付けられた調音器官の動きを推定すること
も可能である。この際、対応付けを認識するためには、
請求項１〜４記載の手段を含んでもよいし、予め対応付
けられた情報を入力しても良い。これらの記録のうち、
調音器官の運動や調音器官の運動以外から入力した情報
情報は時間パターンとなっているため、対応を検索する
ためには、対応を学習したニューラルネットワークや、
時間パターンのマッチングで一般的なダイナミックプロ
グラミングなどを用いることができる。According to the fifth aspect, based on the correspondence between the speech information and the motion of the articulator, the information inputted from the motion of the articulator other than the motion of the articulator is based on the correspondence between the articulator and the motion of the articulator. It is also possible to acquire, and it is also possible to estimate the movement of the associated articulator from information input from other than the movement of the articulator, as described in claim 6. At this time, in order to recognize the association,
It may include the means described in claims 1 to 4, or may input information associated in advance. Of these records,
Since information input from articulator movements or articulator movements is a time pattern, in order to search for correspondence, a neural network that has learned the correspondence,
General dynamic programming or the like can be used for time pattern matching.

【００１３】音声等を検出する際、音響入力などの調音
器官の運動以外の入力と、調音器官の運動による入力の
２つがあるので、いずれか一方だけから検出が難しい状
況などでは、本発明の請求項７の発明のように、他方の
情報を統合することにより音声等を検出することも可能
となる。また、通常の使用状態では、辞書を随時記録
し、学習を行いながら、いずれか一方が使用できないと
判断された場合には他方から音声を検出することも可能
である。この場合、請求項９の発明のように何れか一方
を選択してもよいし、請求項１０の発明のように双方を
合成してもよい。また請求項双方をこれによれば、通常
の使用中に、対応付けを記録することができる。このと
きに、検出が困難か否かを判定する根拠として、請求項
１０、１１の発明のように、確信度を算出し、根拠とす
ることができる。発話が不適切な状況において、発話に
関する確信度でもっとも重要な発話が有音であるか無音
であるかが挙げられる。本発明の請求項１２は、発話が
有音であるか無音であるかを認識することによって確信
度を得ている。When detecting a sound or the like, there are two inputs, such as an acoustic input, other than the motion of the articulator, and an input by the motion of the articulator. Therefore, in a situation where it is difficult to detect only one of them, the present invention is applied. As in the seventh aspect of the invention, it is also possible to detect a voice or the like by integrating the other information. In a normal use state, it is also possible to record a dictionary at any time and perform learning, and if it is determined that one of them cannot be used, a voice can be detected from the other. In this case, either one may be selected as in the ninth aspect of the present invention, or both may be combined as in the tenth aspect of the present invention. According to the claims, the association can be recorded during normal use. At this time, as a basis for determining whether or not the detection is difficult, a certainty factor can be calculated and used as a basis as in the inventions of claims 10 and 11. In a situation where the utterance is inappropriate, whether the most important utterance of the utterance certainty level is voiced or silent is mentioned. According to the twelfth aspect of the present invention, the certainty factor is obtained by recognizing whether the utterance is voiced or silent.

【００１４】また、請求項８の発明のように発話者の器
官の運動とジェスチャー等の意味情報とを対応づけるよ
うにしてもよい。Further, the movement of the organ of the speaker and the semantic information such as gestures may be associated with each other as in the invention of claim 8.

【００１５】音声情報調音器官の運動以外から入力した
音声情報は、請求項２の発明のようにマイクロホンなど
で入力した音声信号や、音声信号を適宜の周波数帯域の
信号強度に変換した時間パターン、また、音声認識した
結果などを用いることができるが、明示的に表音文字を
キーボード等から入力しても、音声情報を入力すること
ができる。The voice information input from other than the motion of the articulator may include a voice signal input by a microphone or the like, a time pattern obtained by converting the voice signal into a signal intensity in an appropriate frequency band, and the like. Although the result of voice recognition can be used, voice information can be input even if phonetic characters are explicitly input from a keyboard or the like.

【００１６】話者が発話する時に生ずる調音器官の運動
を調音器官およびその周辺の外皮の少なくとも１部から
計測して入力データを生成する調音器官形態入力手段
は、外皮に接したセンサーを用いても入力が可能であ
る。また、本発明の請求項１３にあるように、外皮の外
見を適宜のセンサー、例えば光学的に３次元の形状を入
力するセンサーを用いても良いし、本発明の請求項１４
にあるように外皮の画像を撮影して運動を検出すること
も可能である。画像から運動を検出する方法としては、
オプティカルフローによる方法が知られている。The articulator form input means for generating input data by measuring the movement of the articulator generated when the speaker speaks from at least a part of the articulator and the surrounding skin is performed by using a sensor in contact with the skin. Can also be entered. Further, as described in claim 13 of the present invention, an appropriate sensor for the appearance of the outer skin, for example, a sensor for optically inputting a three-dimensional shape may be used.
It is also possible to detect the movement by taking an image of the outer skin as shown in FIG. As a method of detecting motion from an image,
An optical flow method is known.

【００１７】調音器官の運動以外から入力した音声情
報、及び、話者が発話する時に生ずる調音器官の運動を
調音器官およびその周辺の外皮の少なくとも１部から計
測した入力データは、何れも、時刻によって変化する時
間パターンと考えることができる。従って、調音器官の
運動以外から入力した音声情報と、発話中の調音器官の
運動を対応付ける手法としては、これらを時間パターン
としてテーブルに記録する方法が考えられるほか、請求
項１５、１６の発明のように、階層型ニューラルネット
ワークや、リカレントニューラルネットワークなどのニ
ューラルネットワークにより、例えば運動の時間パター
ンを入力信号として、対応する表音文字を教師信号とし
て学習することにより対応付けを行うことができる。同
様に、表音文字を入力信号として、運動の時間パターン
を教師信号として学習することにより対応付けを行うこ
とができる。The voice information input from other than the articulator movement and the input data obtained by measuring the movement of the articulator generated when the speaker speaks from at least a part of the articulator and its surrounding skin are both time data. Can be considered as a time pattern that varies depending on the time. Therefore, as a method of associating the voice information input from other than the articulator movement with the movement of the articulator during speech, a method of recording these in a table as a time pattern can be considered. As described above, by using a neural network such as a hierarchical neural network or a recurrent neural network, it is possible to perform correspondence by learning, for example, a time pattern of exercise as an input signal and a corresponding phonogram as a teacher signal. Similarly, correspondence can be performed by learning phonograms as input signals and time patterns of exercise as teacher signals.

【００１８】この学習の基となる入力信号や教師信号の
データは、請求項１７のように、調音器官の運動以外か
ら入力した音声情報、及び、発話中の調音器官の運動を
記録することにより得ることができる。ここで、記録す
るデータは、請求項１８記載のように、入力した音声情
報、及び、発話中の調音器官の運動に基づいて、この対
応を随時、ないしは指示に基づいて記録し、更新するこ
とができるが、請求項１９記載のように、各入力の確信
度を算出し、この確信度に応じて記録を自動的に制御す
ることも可能である。例えば、何れの確信度も９０％を
超えた場合に限り、記録を更新するなどの制御が可能で
ある。また、請求項２０記載のように、音声を検出する
過程で、入力した何れか一方の確信度が所定の値以下の
場合には、他方の入力と、その入力に対応するもう一方
の特性に基づいて検出結果を出力し、双方の確信度が所
定の値以上であった場合には、得られた双方の確信度に
応じて記録、及び更新を制御する手段を有すると同時
に、検出結果を出力することも可能である。この発明に
よれば、通常の使用中に、確信度の高い入出力の対応を
随時記録しながら、認識率を向上させることが可能とな
る。同様に、請求項２１記載のように、音声認識に適用
することができる。The data of the input signal and the teacher signal, which are the basis of the learning, are obtained by recording the voice information input from other than the movement of the articulator and the movement of the articulator during speech as in claim 17. Obtainable. Here, the data to be recorded is recorded and updated at any time or based on an instruction based on the input voice information and the movement of the articulator during speech as described in claim 18. However, it is also possible to calculate the certainty factor of each input and automatically control the recording according to the certainty factor. For example, it is possible to perform control such as updating the record only when each certainty factor exceeds 90%. Further, in the case where one of the inputted certainty factors is equal to or less than a predetermined value in the process of detecting the voice, the other input and the other characteristic corresponding to the input are detected. The detection result is output based on the information.If both confidences are equal to or greater than a predetermined value, the detection means is provided with means for controlling recording and updating in accordance with both the confidences obtained. It is also possible to output. According to the present invention, it is possible to improve the recognition rate while recording the correspondence of the input and output with a high degree of certainty during normal use. Similarly, the present invention can be applied to speech recognition.

【００１９】以上の発明は、請求項２３に記載のように
電話装置に適用することができる。電話装置には、通常
の据置型の電話機のほか、携帯型無線電話装置や、テレ
ビ電話も含まれる。このような電話装置に適用すること
によって、発話が困難な状況での使用や、騒音が大きな
状況下での使用が可能となる。また、本質的に音声情報
と画像情報を取り扱うビデオカメラや、ビデオテープレ
コーダーに代表される視聴覚機器では、音声信号と画像
信号を容易に獲得することが出来るため、本発明の調音
器官の運動以外から入力した音声情報を入力する手段
と、話者が発話する時に生ずる調音器官の運動を調音器
官およびその周辺の外皮の画像を入力する手段は具備さ
れており、実施が容易である。The above invention can be applied to a telephone device as described in claim 23. The telephone device includes, in addition to a normal stationary telephone, a portable wireless telephone device and a videophone. By applying to such a telephone device, it is possible to use in a situation where speech is difficult or in a situation where noise is large. In addition, video cameras and audiovisual devices typified by video tape recorders that essentially handle audio information and image information can easily acquire audio signals and image signals. There are provided means for inputting voice information input from a user, and means for inputting images of the articulator and its surrounding skin, which indicate the movement of the articulator generated when the speaker speaks.

【００２０】特に、テレビ電話では、音声情報から、調
音器官の動きを対応付けることができるので、調音器官
の動きに関しては伝送を省略ないしは低減することが可
能となる。In particular, in a videophone, the movement of the articulator can be associated from the audio information, so that the transmission of the movement of the articulator can be omitted or reduced.

【００２１】また、文字、ないしは文字列からなる文書
を入力する装置に適用すると、入力の確信度を向上させ
ることが可能になる。また、通常の発話とは異なる調音
器官の動き、例えば、著しい左右非対称の運動などは、
発話とは異なる信号として認識することが可能である。
このような運動を、適宜のジェスチャーに割り当てるこ
とにより、音声認識だけでは不可能なジェスチャーを入
力する装置を構成できる。Further, when the present invention is applied to a device for inputting a document composed of characters or character strings, it is possible to improve the certainty of input. In addition, movement of articulatory organs different from normal speech, for example, remarkably bilaterally asymmetrical movement,
It can be recognized as a signal different from the utterance.
By allocating such exercises to appropriate gestures, it is possible to configure a device that inputs gestures that cannot be achieved by voice recognition alone.

【００２２】以上の発明は各手段について説明したもの
であるが、これらの手段の内、一部ないし全部をコンピ
ュータプログラムとして実装することも可能である。ま
た、適宜の媒体に、これらのコンピュータプログラム
や、対応付けを記録することが可能である。Although the above invention has been described with respect to each means, some or all of these means may be implemented as a computer program. Further, it is possible to record these computer programs and associations on an appropriate medium.

【００２３】[0023]

【作用】この発明では、音声情報を入力したり、音声を
伝達するために、発話した音響情報と、調音器官の運動
を統合する。According to the present invention, uttered acoustic information and motion of articulatory organs are integrated in order to input speech information or transmit speech.

【００２４】発話した音響情報は、マイクロホンなどの
音声入力手段によて入力した音響情報と、３次元デジタ
イザや、画像入力装置により入力した調音器官の運動情
報を、テーブルやニューラルネットワークにより対応付
ける。The uttered sound information correlates sound information input by voice input means such as a microphone with motion information of articulators input by a three-dimensional digitizer or an image input device using a table or a neural network.

【００２５】この対応付けにより、いずれか一方の入力
に課題がある場合には、他方の入力から入力を行う。ま
た、何れの入力も正しいと想定される場合には、その情
報を対応付けのための情報として用いる。According to this association, if there is a problem in one of the inputs, the input is made from the other input. If it is assumed that any input is correct, the information is used as information for association.

【００２６】請求項１の発明のように、音声情報と、調
音器官の運動を対応付ける装置として、音声情報を入力
する手段と、発話中の調音器官の運動を入力する手段と
により入力した情報を、必要に応じて適宜の処理を施
し、テーブルに記録したり、請求項１５、１６の発明の
ように、ニューラルネットワークの学習を行う。As a device for associating voice information with articulatory movement, the information input by means for inputting voice information and the means for inputting the motion of the articulator during speech are provided. An appropriate process is performed as needed, and the result is recorded in a table, or learning of a neural network is performed as in the inventions of claims 15 and 16.

【００２７】この対応付けに基づいて、請求項２の発明
のように、調音器官の運動から、音声情報を獲得した
り、請求項３の発明のように、音声に基づき、調音器官
の運動を推定する。Based on this association, voice information is obtained from the motion of the articulator, as in the invention of claim 2, or the motion of the articulator is obtained based on the voice, as in the invention of claim 3. presume.

【００２８】この際、音声情報は、そのまま記録（録
音）しても良いし、周波数スペクトラムの時間パターン
としてもよいし、また、音声認識を行い、結果の発音記
号や、文字または、文字情報、ないしは文字列からなる
文書情報として認識しても良い。At this time, the voice information may be recorded (recorded) as it is, or may be a time pattern of a frequency spectrum, or may be subjected to voice recognition and the resulting phonetic symbols, characters or character information, Alternatively, it may be recognized as document information composed of a character string.

【００２９】また、調音器官の運動は、画像による場合
にはそのまま記録（録画）しても良いし、オプティカル
フロー手法などを用いて、形状の認識やその運動として
記録しても良い。ただし、テーブルやニューラルネット
ワークに記録し、運動の類似度に基づいてテーブルを索
引する場合や、ニューラルネットワークの入力を想定す
る場合には、音声情報も、運動情報も、数値化するほう
が望ましい場合がある。例えば、入力した運動情報に、
テーブル中でもっとも類似度の高い運動情報を、ダイナ
ミックプログラミング手法により検索する場合、録画情
報では処理することは困難であるが、所定の位置の運動
情報として、位置の時間パターンとして記録すれば容易
に検索することが出来る。The motion of the articulator may be recorded (recorded) as it is in the case of an image, or may be recorded as a shape recognition or its motion using an optical flow method or the like. However, when recording in a table or neural network and indexing the table based on the similarity of exercise, or when assuming input from a neural network, it is sometimes desirable to digitize both voice information and exercise information. is there. For example, in the input exercise information,
When searching for the motion information with the highest similarity in the table by the dynamic programming method, it is difficult to process with the recorded information, but if it is recorded as the time pattern of the position as the motion information of the predetermined position, it can be easily performed. You can search.

【００３０】[0030]

【発明の実施の態様】以下、図面を参照しながら実施例
に基づいて本発明の特長を具体的に説明する。［実施例１］第１図は本発明の実施例１を示す。この実
施例１は、音声情報と、調音器官の運動とを対応付ける
言語情報対応付け装置とし実現されている。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS The features of the present invention will be specifically described below based on embodiments with reference to the drawings. Embodiment 1 FIG. 1 shows Embodiment 1 of the present invention. The first embodiment is realized as a linguistic information associating device for associating audio information with motion of articulators.

【００３１】図１において、言語情報対応付け装置は、
マイクロホン１００、三次元デジタイザ部１１０、テー
ブル部１２０およびスペクトラム情報算出部１３０を含
んで構成されている。マイクロホン１００により入力し
た音声信号と、調音器官に接触する３次元デジタイザ部
１１０から入力した音声信号に同期した顔を含む調音器
官の位置情報とを、テーブル部１２０に記録する。この
際、音声情報はスペクトラム情報算出部１３０により、
適宜の周波数帯域ごとのスペクトラム強度の時間パター
ンとして記録する。また、顔を含む調音器官の位置情報
は、調音器官上の適宜の１つ以上の点（例として、唇両
端、上下唇の中心、のどぼとけの中心位置、頬中央な
ど）の軌跡を３次元デジタイザ部１１０により入力し、
各測定部位の位置の時間パターンとして記録する。In FIG. 1, the language information associating device
It includes a microphone 100, a three-dimensional digitizer 110, a table 120, and a spectrum information calculator 130. The audio signal input by the microphone 100 and the position information of the articulator including the face synchronized with the audio signal input from the three-dimensional digitizer 110 that contacts the articulator are recorded in the table unit 120. At this time, the voice information is obtained by the spectrum information calculation unit 130.
It is recorded as a time pattern of the spectrum intensity for each appropriate frequency band. In addition, the position information of the articulator including the face includes three or more trajectories of one or more appropriate points on the articulator (for example, both ends of lips, centers of upper and lower lips, center position of throat, and center of cheek). Input by the dimensional digitizer 110,
It is recorded as a time pattern of the position of each measurement site.

【００３２】このテーブル部１２０により、入力した音
声情報と、入力した発話中の調音器官の運動を対応付け
ることができる。The table 120 allows the input voice information to correspond to the input motion of the articulator during speech.

【００３３】このテーブル部１２０は、索引が容易なよ
うに、適宜の並べ替えを行っても良い。また、本実施例
では調音器官の運動を入力する手段として３次元デジタ
イザを用いたが、調音器官の運動を入力できる手段であ
れば、他の方法を用いても良い。The table section 120 may be appropriately rearranged so that the index is easy. In this embodiment, a three-dimensional digitizer is used as a means for inputting articulatory movement. However, any other method may be used as long as it can input articulatory movement.

【００３４】［実施例２］第２図は本発明の実施例２を
示す。この実施例２は、調音器官の運動から、音声情報
を推定する音声情報推定装置として実現されている。な
お、図２において図１と対応する箇所には対応する符号
を付した。FIG. 2 shows a second embodiment of the present invention. The second embodiment is realized as an audio information estimating apparatus for estimating audio information from the movement of articulators. In FIG. 2, the portions corresponding to those in FIG.

【００３５】図２において、実施例１と同様に、マイク
ロホン１００により入力した音声情報、及び、３次元デ
ジタイザ部１１０により入力した調音器官の位置情報に
対して、適宜に処理を行い、テーブル部１２０に記録す
る。In FIG. 2, similarly to the first embodiment, the voice information input by the microphone 100 and the position information of the articulator input by the three-dimensional digitizer 110 are appropriately processed, and the table 120 To record.

【００３６】次に、テーブル部１２０に記録したときと
同じように、３次元デジタイザ部１１０により入力した
調音器官の運動を、各測定部位の位置の時間パターンと
して入力し、テーブル部１２０に記録してある調音器官
の運動情報と、類似度が最も高いものを検索する。この
手法の一例としては、時間パターンの認識に広く用いら
れているダイナミックプログラミング手法を用いてもよ
い。Next, the motion of the articulator entered by the three-dimensional digitizer unit 110 is input as a time pattern of the position of each measurement site, and recorded in the table unit 120 in the same manner as when recorded in the table unit 120. A search is made for the motion information of a given articulator and the one with the highest similarity. As an example of this method, a dynamic programming method widely used for recognizing a time pattern may be used.

【００３７】以上によりテーブル部１２０を検索し、入
力した調音器官の運動と対応付けられた音響情報を推定
する。As described above, the table section 120 is searched, and the acoustic information associated with the input motion of the articulator is estimated.

【００３８】本実施例によれば、音声認識を行うことな
しに調音器官の画像から、発話している音声情報を獲得
することができる。この音声情報は、例えば人間が直接
音響情報として聞く場合や、高度な音声認識の入力とし
て用いる場合には、文脈等の情報や、他の自然言語情報
を勘案することにより、多少の誤差があっても補正が可
能である。例えば、日本語においては、母音がほぼ正確
であれば、子音に多少の誤差があっても文脈などの情報
を勘案すれば、文章の内容を了解することが出来る場合
が多い。According to the present embodiment, it is possible to acquire speech information of speech from an image of an articulator without performing speech recognition. For example, when a human hears directly as acoustic information or uses it as an input for advanced speech recognition, there is a slight error due to consideration of information such as context and other natural language information. Can be corrected. For example, in Japanese, if a vowel is almost accurate, it is often possible to understand the contents of a sentence by taking into account information such as context, even if there are some errors in consonants.

【００３９】また、本実施例では、テーブル部１２０へ
の記録までの手段と、テーブル部１２０の索引からの手
段を、同一の装置で実施する例を示したが、各手段間で
所要の情報を通信などにより伝達することが出来る場合
には、同一の装置内で実施しなくても本発明を実施する
ことができる。In this embodiment, an example is shown in which the means up to recording in the table section 120 and the means from the index of the table section 120 are implemented by the same apparatus. Can be transmitted by communication or the like, the present invention can be implemented without being implemented in the same device.

【００４０】［実施例３］第３図は本発明の実施例３を
示す。この実施例３は、音声に基づき、調音器官の運動
を推定する調音器官運動推定装置として実現されてい
る。なお、図３において図１と対応する箇所には対応す
る符号を付した。Third Embodiment FIG. 3 shows a third embodiment of the present invention. The third embodiment is realized as an articulator motion estimation device that estimates the motion of the articulator based on speech. In FIG. 3, the portions corresponding to FIG. 1 are denoted by the corresponding reference numerals.

【００４１】実施例１と同様に、マイクロホン１００に
より入力した音声情報、及び、３次元デジタイザ部１１
０により入力した調音器官の位置情報に対して、適宜に
処理を行い、テーブル部１２０に記録する。As in the first embodiment, the voice information input by the microphone 100 and the three-dimensional digitizer 11
Processing is appropriately performed on the position information of the articulatory organ input by 0, and is recorded in the table unit 120.

【００４２】次に、テーブル部１２０に記録した際と同
じように、マイクロホン１００により入力した音声信号
を、スペクトラム情報算出部１３０により、適宜の周波
数帯域ごとのスペクトラム強度の時間パターンとして入
力し、テーブル部１２０に記録してある音声情報と、類
似度が最も高いものを検索する。この手法の一例として
は、時間パターンの認識に広く用いられているダイナミ
ックプログラミング手法を用いてもよい。Next, in the same manner as when recording on the table unit 120, the audio signal input by the microphone 100 is input by the spectrum information calculation unit 130 as an appropriate time pattern of the spectrum intensity for each frequency band. The voice information recorded in the section 120 and the one having the highest similarity are searched. As an example of this method, a dynamic programming method widely used for recognizing a time pattern may be used.

【００４３】以上によりテーブル部１２０を検索し、入
力した音声情報と対応付けられた調音器官の動を推定す
る。As described above, the table section 120 is searched to estimate the movement of the articulator corresponding to the input speech information.

【００４４】本実施例によれば、容易に音声に基づき、
調音器官の運動を推定するすることができる。テーブル
部１２０への記録までの手段と、テーブル部１２０を索
引する以降の手段は、同一の装置内にあってもよいし、
異なる装置に設置して、テーブル部１２０に記録された
情報をそれらの装置間で伝送しても良い。According to this embodiment, based on the voice easily,
The articulatory movement can be estimated. The unit up to the recording in the table unit 120 and the unit after indexing the table unit 120 may be in the same device,
The information recorded in the table unit 120 may be transmitted between the apparatuses by installing the apparatuses in different apparatuses.

【００４５】［実施例４］第４図は本発明の実施例４を
示す。実施例４は、テレビ電話装置として実現されてい
る。図４においても図１と対応する箇所には対応する符
号を付した。Fourth Embodiment FIG. 4 shows a fourth embodiment of the present invention. Embodiment 4 is implemented as a videophone device. In FIG. 4 as well, portions corresponding to FIG. 1 are denoted by corresponding reference numerals.

【００４６】図４において、実施例３と同様に、入力し
た音声情報と対応付けられた調音器官の動を推定する。In FIG. 4, as in the third embodiment, the motion of the articulator associated with the input speech information is estimated.

【００４７】対応付けのテーブル１２０および、入力し
た音声信号は、伝送路２２０により受信側に伝送され、
受信側では音声に対応した運動情報を、伝送されたテー
ブル１２０から推定し、運動情報を推定する。推定した
調音器官の運動に、顔表面の画像をテクスチャーマッピ
ング部２００により割り付ける。その画像を、表示部２
１０により表示する。The associating table 120 and the input audio signal are transmitted to the receiving side via the transmission path 220.
The receiving side estimates the exercise information corresponding to the voice from the transmitted table 120, and estimates the exercise information. An image of the face surface is allocated to the estimated motion of the articulator by the texture mapping unit 200. The image is displayed on the display unit 2
Indicated by 10.

【００４８】本実施例によれば、調音器官については、
随時の運動を伝送することなしに、顔画像のうち、調音
器官の運動を伝送することができる。According to the present embodiment, regarding the articulatory organ,
It is possible to transmit the movement of the articulator in the face image without transmitting the movement at any time.

【００４９】［実施例５］第５図（ａ）は本発明の実施
例５を示す。実施例５は、音声情報と、調音器官の運動
とを対応付ける装置を示す。なお、図５（ａ）において
も図１と対応する箇所には対応する符号を付した。Embodiment 5 FIG. 5 (a) shows Embodiment 5 of the present invention. Example 5 shows an apparatus for associating audio information with motion of articulators. In FIG. 5A, the same reference numerals are given to the portions corresponding to FIG.

【００５０】図５（ａ）において、マイクロホン１００
により入力した音声信号と、調音器官に接触する３次元
デジタイザ部１１０から入力した音声信号に同期した顔
を含む調音器官の位置情報とを、テーブル部１２０に記
録する。この際、音声情報は音素認識部１３１により、
母音と子音の時間パターンとして記録する。また、顔を
含む調音器官の位置情報は、調音器官上の適宜の１つ以
上の点（例として、唇両端、上下唇の中心、のどぼとけ
の中心位置、頬中央など）の軌跡を３次元デジタイザ部
１１０により入力し、各測定部位の位置の時間パターン
として記録する。In FIG. 5A, the microphone 100
In the table unit 120, the audio signal input by the above and the position information of the articulator including the face synchronized with the audio signal input from the three-dimensional digitizer unit 110 that contacts the articulator are recorded. At this time, the voice information is output by the phoneme recognition unit 131.
Record as vowel and consonant time patterns. In addition, the position information of the articulator including the face includes three or more trajectories of one or more appropriate points on the articulator (for example, both ends of lips, centers of upper and lower lips, center position of throat, and center of cheek). The data is input by the dimensional digitizer unit 110 and recorded as a time pattern of the position of each measurement site.

【００５１】このテーブル部１２０により、入力した音
素情報と、入力した発話中の調音器官の運動を対応付け
ることができる。The table unit 120 makes it possible to associate the input phoneme information with the input motion of the articulator during speech.

【００５２】このテーブル部１２０は、索引が容易なよ
うに、適宜の並べ替えを行っても良い。また、本実施例
では調音器官の運動を入力する手段として３次元デジタ
イザを用いたが、調音器官の運動を入力できる手段であ
れば、他の方法を用いても良い。The table section 120 may be appropriately rearranged so that the index is easy. In this embodiment, a three-dimensional digitizer is used as a means for inputting articulatory movement. However, any other method may be used as long as it can input articulatory movement.

【００５３】また、図５（ｂ）に示すように、本実施例
の音素認識部１３１に代えて、音声認識部１３２を用
い、音声情報を文字情報ないしは文字列からなる文書情
報として認識することも可能である。As shown in FIG. 5B, a voice recognition unit 132 is used in place of the phoneme recognition unit 131 in this embodiment, and voice information is recognized as text information or document information composed of a character string. Is also possible.

【００５４】［実施例６］第６図は本発明の実施例６を
示す。この実施例は、本発明を携帯型無線電話装置に適
用したものである。本実施例は、特開平９−１３４１９
４号公報にあるハンドセットに、本発明を実装したもの
である。図６においても図１と対応する箇所には対応す
る符号を付した。Embodiment 6 FIG. 6 shows Embodiment 6 of the present invention. In this embodiment, the present invention is applied to a portable radio telephone device. This embodiment is described in JP-A-9-13419.
No. 4 discloses a handset in which the present invention is mounted. Also in FIG. 6, the portions corresponding to FIG.

【００５５】携帯型無線電話装置は、通常の有線電話に
比較して、通話が可能な状況が広範囲で、例えば、公共
交通機関内でも原理的には通話が可能である。しかし、
一般的には、このような状況下で、電話機へ向けての発
話が迷惑となると考えられている。本実施例では、この
ような状況における使用を想定したものである。また、
この実施例は、携帯型無線電話装置だけではなく、他の
電話装置に適用することもできる。The portable wireless telephone device allows a wider range of situations in which a telephone call can be made than a normal wired telephone, and, for example, a telephone call is possible in principle even in public transportation. But,
It is generally considered that in such a situation, utterance toward the telephone is annoying. In the present embodiment, use in such a situation is assumed. Also,
This embodiment can be applied not only to a portable radio telephone device but also to other telephone devices.

【００５６】図６において、発話が可能な状況下におい
ては、マイクロホン１００により入力し、送話した音声
信号、及び、調音器官に接触する携帯型無線電話装置の
ビデオカメラ１１１から入力した音声信号に同期した顔
を含む調音器官の位置情報を、テーブル部１２０に記録
する。この際、音声情報は、入力した音声に類似した音
声を、音声合成部１４０により合成する場合のパラメー
タとなるように、音声変換部１３３により変換し、テー
ブル部１２０に記録する。この過程で、音声認識を行っ
ても良いが、必ずしも認識を行わなくても、ほぼ同等に
聞こえる音響情報を合成できるレベルのパラメータを算
出してもよい。In FIG. 6, when speech is possible, the voice signal input and transmitted by the microphone 100 and the voice signal input from the video camera 111 of the portable wireless telephone device that comes into contact with the articulator are converted to the voice signal. The position information of the articulator including the synchronized face is recorded in the table unit 120. At this time, the voice information is converted by the voice conversion unit 133 so as to be a parameter when the voice similar to the input voice is synthesized by the voice synthesis unit 140, and recorded in the table unit 120. In this process, speech recognition may be performed, but it is also possible to calculate a parameter at a level at which sound information that can be heard almost equally can be synthesized without necessarily performing recognition.

【００５７】このテーブル部１２０により、音声合成パ
ラメータと、発話中の調音器官の運動を対応付けること
ができる。The table section 120 makes it possible to associate the speech synthesis parameters with the movements of the articulator during speech.

【００５８】このテーブル部１２０は、索引が容易なよ
うに、適宜の並べ替えを行っても良い。また、本実施例
では調音器官の運動を入力する手段として調音器官に接
触する携帯型無線電話装置のビデオカメラ１１１を用い
たが、調音器官の運動を入力できる手段であれば、画像
入力や、超音波による距離測定など、他の方法を用いて
も良い。The table section 120 may be appropriately rearranged so that the index is easy. Further, in the present embodiment, the video camera 111 of the portable wireless telephone device that contacts the articulator is used as a means for inputting the motion of the articulator, but if the means can input the motion of the articulator, image input, Other methods such as distance measurement by ultrasonic waves may be used.

【００５９】発話に支障のある状況下では、利用者の指
示に基づいて、下記の手順により使用する。In a situation where there is a problem in utterance, it is used according to the following procedure based on a user's instruction.

【００６０】まず、調音器官に接触する携帯型無線電話
装置のビデオカメラ１１１から入力した顔を含む調音器
官の位置情報に最も近いテーブル部１２０に記録された
調音器官の運動を、ダイナミックプログラミング手法に
より検索する。この運動に対応する音声合成パラメータ
を索引し、音声合成部１４０により、音声を合成する。
合成した音声を、切り替え部１４１により、マイクロホ
ン１００からの出力に代えて送話する。First, the motion of the articulator recorded in the table unit 120 closest to the position information of the articulator including the face input from the video camera 111 of the portable radiotelephone device that comes into contact with the articulator is calculated by a dynamic programming method. Search for. The voice synthesis parameters corresponding to this motion are indexed, and voice is synthesized by the voice synthesis unit 140.
The synthesized voice is transmitted by the switching unit 141 instead of the output from the microphone 100.

【００６１】本実施例では、利用者の指示に基づいてテ
ーブルへの記録と、テーブルの索引と合成を切り替えた
が、マイクロホン１００からの入力の有無に応じて上記
の切り替えを自動で行うことも可能である。また、騒音
などを検知する手段を加え、騒音の大小に応じて切り替
えることも可能である。自動で切り替えを行う場合、例
えば、マイクロホン１００からの入力が途切れ途切れに
なるような状況で、本発明により一部を合成することを
想定すると、高々母音程度が了解できるレベルでも、実
用上は通話品質を大幅に向上することができる。In the present embodiment, recording on the table, indexing and synthesizing of the table are switched based on a user's instruction. However, the above switching may be automatically performed according to the presence or absence of an input from the microphone 100. It is possible. It is also possible to add means for detecting noise or the like, and to switch according to the level of the noise. When switching is performed automatically, for example, in a situation where the input from the microphone 100 is interrupted, assuming that a part is synthesized according to the present invention, even at a level at which a vowel can be understood at most, communication is practically impossible. Quality can be greatly improved.

【００６２】［実施例７］第７図は本発明の実施例７を
示す。この実施例７は、本発明を音声認識装置に適用し
たものである。図７においても図１等と対応する箇所に
は対応する符号を付した。Seventh Embodiment FIG. 7 shows a seventh embodiment of the present invention. In the seventh embodiment, the present invention is applied to a speech recognition device. Also in FIG. 7, the portions corresponding to those in FIG.

【００６３】近年の音声認識技術の発展により、用途に
よっては実用に耐えるレベルの音声認識技術が実用化さ
れている。しかし、現状の音声認識技術は発話すること
が前提となっているために、例えば公共交通機関内、静
かなオフィスや図書館での使用に支障がある場合があ
る。また、車両内などにおいて、騒音が大きい場合に
は、音声だけを入力する方法に課題がある状況もある。
本実施例では、このような状況における音声認識を想定
したものである。With the development of speech recognition technology in recent years, a speech recognition technology of a practical level has been put to practical use depending on the application. However, since the current speech recognition technology is based on the premise that speech is generated, there is a case where use in a public transportation, a quiet office or a library is hindered. In addition, there is a problem that there is a problem in a method of inputting only voice when noise is loud in a vehicle or the like.
In the present embodiment, speech recognition in such a situation is assumed.

【００６４】発話のみでの音声認識が可能な状況下にお
いては、マイクロホン１００により入力し、送話した音
声信号、及び、調音器官を撮影するのビデオカメラ１１
１から入力した音声信号に同期した顔を含む調音器官の
画像から求めた位置情報を、テーブル部１２０に記録す
る。この際、音声情報は、音声認識部１３２により認識
し、テーブル部１２０に記録する。また、ビデオ画像か
ら調音器官の運動を取得する方法としては、オプティカ
ルフロー手法を用いることができる。In a situation in which voice recognition can be performed only by utterance, a video camera 11 that captures a voice signal transmitted by the microphone 100 and transmits the voice and an articulatory organ.
The position information obtained from the image of the articulator including the face synchronized with the audio signal input from step 1 is recorded in the table unit 120. At this time, the voice information is recognized by the voice recognition unit 132 and recorded in the table unit 120. In addition, as a method of acquiring the movement of the articulator from the video image, an optical flow method can be used.

【００６５】ここで、発話のみでの音声認識が可能な状
況であるか否かの判断は、利用者が指定しても良いし、
音声認識の際に、確信度を算出し、確信度が所定の値以
上であるか否かにより判断してもよい。Here, the determination as to whether or not the situation is such that voice recognition can be made only by utterance may be specified by the user,
At the time of speech recognition, a certainty factor may be calculated, and a determination may be made based on whether the certainty factor is equal to or greater than a predetermined value.

【００６６】次に、音声認識の際、以下の手順で認識を
行う。Next, at the time of voice recognition, recognition is performed according to the following procedure.

【００６７】マイクロホン１００により入力し、送話し
た音声信号、及び、調音器官のビデオ画像から入力した
音声信号に同期した顔を含む調音器官の位置情報を、同
時に入力する。この入力と認識については、テーブル部
１２０に記録する際に用いた手段を用いても良いし、結
果が同等であれば異なる手段を用いても良い。この際、
入力した音声情報の確信度併せて認識する。例えば、雑
音などを検知することにより、認識結果が正しく認識さ
れている（誤認識でない）確率などを算出する。The voice signal input and transmitted by the microphone 100 and the position information of the articulator including the face synchronized with the audio signal input from the video image of the articulator are simultaneously input. For this input and recognition, the means used when recording in the table unit 120 may be used, or different means may be used if the results are equivalent. On this occasion,
Recognition is also performed together with the certainty factor of the input voice information. For example, by detecting noise or the like, a probability that the recognition result is correctly recognized (not erroneous recognition) is calculated.

【００６８】この確率が所定の値以上であれば、音声認
識の結果を認識結果として採用し、そうでない場合には
調音器官の運動を認識した結果を認識結果として採用す
る。If the probability is equal to or greater than a predetermined value, the result of speech recognition is adopted as a recognition result. Otherwise, the result of recognizing the movement of the articulator is adopted as the recognition result.

【００６９】以上では、音声認識のみの確信度を用いた
が、調音器官の運動を認識した結果についても確信度を
算出し、確信度が高い認識結果を採用しても良いし、合
計の確信度が高い認識結果を採用してもよい。また、調
音器官の確信度だけを求め、それに基づいて選択ないし
は合成を制御しても良い。In the above description, the certainty factor of only speech recognition is used. However, the certainty factor is also calculated for the result of recognition of the articulatory movement, and a recognition result with a high certainty factor may be employed, or the total confidence factor may be used. A recognition result with a high degree may be adopted. Alternatively, only the certainty of the articulator may be obtained, and selection or synthesis may be controlled based on the obtained confidence.

【００７０】また、子音は音声認識の結果を用い、母音
だけ双方の確信度が高いものを用い、子音と母音の認識
結果を合成しても良い。このように、選択と合成につい
ては種々の組み合わせや手法を用いることが出来る。Also, the recognition result of the consonant and the vowel may be synthesized by using the result of speech recognition as the consonant and using only the vowel having a high degree of certainty. Thus, various combinations and methods can be used for selection and synthesis.

【００７１】［実施例８］第８図は本発明の実施例８を
示す。この実施例８は、調音器官の運動から、文字情報
を対応付ける装置を示す。図８においても図１等と対応
する箇所には対応する符号を付した。[Eighth Embodiment] FIG. 8 shows an eighth embodiment of the present invention. This embodiment 8 shows an apparatus for associating character information with movement of articulatory organs. Also in FIG. 8, the portions corresponding to those in FIG.

【００７２】マイクロホン１００により入力し、音声認
識した文字情報と、調音器官に接触する３次元デジタイ
ザ部１１０から入力した音声信号に同期した顔を含む調
音器官の位置情報を獲得する。この際、音声の入力と認
識を行わず、利用者が発話した文字情報をキーボードな
どにより入力しても良いし、適宜の指示装置により発話
すべき文字情報を利用者に提示して、発話しても良い。
この際、発話は調音器官の運動だけで、実際に発話しな
くても良いし、発話してもよい。The character information input by the microphone 100 and subjected to voice recognition and the position information of the articulator including the face synchronized with the audio signal input from the three-dimensional digitizer unit 110 that contacts the articulator are acquired. At this time, the character information uttered by the user may be input using a keyboard or the like without inputting and recognizing the voice, or the character information to be uttered may be presented to the user by an appropriate instruction device, and the utterance may be performed. May be.
At this time, the utterance is performed only by the movement of the articulatory organ, and the utterance does not have to be made or the utterance may be made.

【００７３】以上により対応付けられた文字情報と、調
音器官の運動情報を用いて、ニューラルネットワーク１
２１を学習する。この際、調音器官の運動情報を、教師
信号として、文字情報をニューラルネットワーク１２１
に与える。そして、認識率をエネルギー関数として、エ
ネルギーが小さくなるようにニューラルネットワーク１
２１の諸係数を調整することにより学習を行う。ここで
は、認識率をエネルギー関数としたが、文字情報に対応
する適宜の理想出力を教師信号とし、教師信号とニュー
ラルネットワーク１２１の出力のユークリッド距離をエ
ネルギー関数としてもよい。Using the character information associated with the above and the motion information of the articulator, the neural network 1
Learn 21. At this time, the motion information of the articulator is used as a teacher signal, and the character information is used as the neural network 121.
Give to. Then, using the recognition rate as an energy function, the neural network 1
Learning is performed by adjusting the 21 coefficients. Here, the recognition rate is used as the energy function, but an appropriate ideal output corresponding to the character information may be used as the teacher signal, and the Euclidean distance between the teacher signal and the output of the neural network 121 may be used as the energy function.

【００７４】このような時間パターンの認識には、一般
的な階層型ニューラルネットワークを用いても良いし、
リカレントニューラルネットワークを用いていも良い。For recognition of such a time pattern, a general hierarchical neural network may be used.
A recurrent neural network may be used.

【００７５】以上により、エネルギー関数が所定の値以
下になるか、所定の回数まで学習を繰り返す。As described above, learning is repeated until the energy function becomes equal to or less than the predetermined value or until the predetermined number of times.

【００７６】学習が終了した時点で、音声認識の際、以
下の手順で認識を行う。At the time when the learning is completed, the speech recognition is performed in the following procedure.

【００７７】調音器官に接触する３次元デジタイザ部１
１０から入力した音声信号に同期した顔を含む調音器官
の位置情報を獲得する。この位置情報、ないしは位置情
報の時間パターンを、ニューラルネットワーク１２１に
入力する。必要に応じて、ニューラルネットワーク１２
１の出力を文字情報に変換することによって、認識結果
を算出することが出来る。Three-dimensional digitizer 1 that contacts the articulator
The position information of the articulator including the face synchronized with the audio signal input from the device 10 is acquired. The position information or the time pattern of the position information is input to the neural network 121. If necessary, the neural network 12
By converting the output of No. 1 into character information, a recognition result can be calculated.

【００７８】また、認識は、文字単位で行っても良い
し、文字列からなる文書単位で行っても良い。文字列か
らなる文書単位で認識を行う場合には、出現頻度や文脈
などを勘案し、認識率を向上させることができる。Recognition may be performed on a character basis or on a document basis consisting of character strings. In the case of performing recognition in units of documents composed of character strings, the recognition rate can be improved in consideration of the appearance frequency, context, and the like.

【００７９】また、ここではテーブル部１２０に代えて
ニューラルネットワーク１２１を用いたが、他の手段、
例えば、請求項４における、「入力した音声情報と、対
応付けられた音声情報の少なくとも一方から出力を選
択、ないしは合成する手段」に用いても良いし、請求項
５の確信度を認識する手段に用いても良い。この場合、
学習が不要であれば、必ずしも学習を行う必要はない。Although the neural network 121 is used in place of the table section 120 here, other means,
For example, the present invention may be used for “a means for selecting or synthesizing an output from at least one of input voice information and associated voice information” in claim 4 or a means for recognizing the certainty factor in claim 5. May be used. in this case,
If learning is not required, it is not necessary to perform learning.

【００８０】また、文字、ないしは文字列からなる文書
に対応しないような調音器官の運動、たとえば、左右非
対称の運動を、「ｙｅｓ」，「ｎｏ」や、ポインティン
グなどに割り当て、認識させることにより、請求項２２
のように、ジェスチャーを認識することも可能である。Also, by assigning and recognizing the movement of the articulatory organ that does not correspond to a document composed of a character or a character string, for example, a left-right asymmetrical movement, to “yes”, “no”, pointing, etc. Claim 22
It is also possible to recognize a gesture as shown in FIG.

【００８１】［実施例９］第９図は本発明の実施例９を
示す。実施例９は、調音器官の運動から、文字情報を対
応付ける装置として実現されている。この実施例は図８
と同様にニューラルネットワーク１２１を用いるもので
ある。図９においても図１等と対応する箇所には対応す
る符号を付した。Ninth Embodiment FIG. 9 shows a ninth embodiment of the present invention. The ninth embodiment is realized as a device for associating character information with the movement of the articulator. This embodiment is shown in FIG.
In this case, the neural network 121 is used in the same manner as described above. In FIG. 9 as well, portions corresponding to those in FIG.

【００８２】実施例８と同様に、マイクロホン１００
と、調音器官に接触する３次元デジタイザ部１１０を用
い、適宜の処理を加えた結果を、記録手段１７０に随
時、ないしは指示に基づいて記録する。次に、この記録
を基に、実施例８のようにニューラルネットワークの学
習を行う。認識については実施例８と同様とする。As in the eighth embodiment, the microphone 100
Then, using the three-dimensional digitizer unit 110 that comes into contact with the articulator, the result of performing appropriate processing is recorded in the recording unit 170 as needed or based on an instruction. Next, learning of the neural network is performed based on this record as in the eighth embodiment. The recognition is the same as in the eighth embodiment.

【００８３】この実施例によれば、利用中に随時学習を
行うことが可能であり、利用するにしたがって認識率を
向上させることが可能となる。According to this embodiment, learning can be performed at any time during use, and the recognition rate can be improved with use.

【００８４】また、認識結果を利用者に提示する提示手
段１６０を用いると、利用者は、調音器官の運動を、認
識率が高くなるように訓練することが可能になる。この
実施によれば、他に、聴覚障害者の発話訓練に適用する
ことも可能である。ここで、提示は視覚的に提示しても
良いし、音強敵に提示しても良い。Further, by using the presentation means 160 for presenting the recognition result to the user, the user can train the movement of the articulator so as to increase the recognition rate. According to this embodiment, it is also possible to apply to speech training for a hearing-impaired person. Here, the presentation may be presented visually or may be presented to a strong sound enemy.

【００８５】［実施例１０］第１０図は本発明の実施例
１０を示す。この実施例１０は音声認識装置として実現
されている。図１０においても図１等と対応する箇所に
は対応する符号を付した。[Embodiment 10] FIG. 10 shows an embodiment 10 of the present invention. The tenth embodiment is realized as a speech recognition device. Also in FIG. 10, portions corresponding to those in FIG.

【００８６】マイクロホン１００およびビデオカメラ１
１１により入力した音声及びビデオ画像、ないしは、ビ
デオテープレコーダー１８０に記録し、再生した音声及
びビデオ画像において、発話している音声と、調音器官
の映像から、確信度の高い音声認識を行う例を示す。Microphone 100 and Video Camera 1
11 shows an example of performing high-reliability voice recognition from a voice and an articulator image in a voice and a video image input or a voice and video image recorded on a video tape recorder 180 and reproduced. Show.

【００８７】入力ないしは再生された音声から、音声認
識部１３２により、音声を認識する。併せて、確信度１
９１を算出する。The speech recognition unit 132 recognizes the speech from the input or reproduced speech. In addition, confidence 1
91 is calculated.

【００８８】また、再生された画像中の調音器官の画像
から、オプティカルフロー解析部１５０により、調音器
官の運動を認識する。認識結果と併せて、確信度１９２
を算出する。Further, the motion of the articulator is recognized by the optical flow analysis unit 150 from the image of the articulator in the reproduced image. Together with the recognition result, the confidence 192
Is calculated.

【００８９】これらの２組の認識結果と確信度に基づ
き、認識結果判定部１９０により、いずれか確信度の高
いほうの認識結果を選択し、出力する。Based on these two sets of recognition results and certainty factors, the recognition result determination unit 190 selects and outputs the recognition result with the higher certainty factor.

【００９０】[0090]

【発明の効果】上述したように、この発明によれば、音
声情報を入力する際に、音声情報と調音器官の外見との
対応に基づき、発話を前提としない入力方法を提供する
ことができる。また、併せて、発話と調音器官の外見と
の対応を、使用や習熟に応じて学習することにより精度
を向上させることができる。As described above, according to the present invention, it is possible to provide an input method which does not require speech when inputting audio information based on the correspondence between the audio information and the appearance of the articulator. . In addition, it is possible to improve the accuracy by learning the correspondence between the utterance and the appearance of the articulatory organ according to the use and skill.

[Brief description of the drawings]

【第１図】本発明の実施例１を示す図である。FIG. 1 is a diagram showing a first embodiment of the present invention.

【第２図】本発明の実施例２を示す図である。FIG. 2 is a diagram showing a second embodiment of the present invention.

【第３図】本発明の実施例３を示す図である。FIG. 3 is a diagram showing a third embodiment of the present invention.

【第４図】本発明の実施例４を示す図である。FIG. 4 is a diagram showing a fourth embodiment of the present invention.

【第５図】本発明の実施例５の一例を示す図である。FIG. 5 is a diagram showing an example of Embodiment 5 of the present invention.

【第６図】本発明の実施例６を示す図である。FIG. 6 is a view showing a sixth embodiment of the present invention.

【第７図】本発明の実施例７を示す図である。FIG. 7 is a diagram showing a seventh embodiment of the present invention.

【第８図】本発明の実施例８を示す図である。FIG. 8 is a diagram showing an eighth embodiment of the present invention.

【第９図】本発明の実施例９を示す図である。FIG. 9 is a view showing a ninth embodiment of the present invention.

【第１０図】本発明の実施例１０を示す図である。FIG. 10 is a diagram showing a tenth embodiment of the present invention.

[Explanation of symbols]

１００…マイクロホン１１０…３次元デジタイザ部１１１…ビデオカメラ１２０…テーブル部１２１…ニューラルネットワーク１３０…スペクトラム情報算出部１３１…音素認識部１３２…音声認識部１３３…音声変換部１４０…音声合成部１４１…切り替え部１５０…オプティカルフロー解析部１６０…提示手段１８０…ビデオテープレコーダー１９０…認識結果判定部１９１…音声認識の確信度１９２…オプティカルフロー解析部の確信度２００…テクスチャーマッピング部２１０…表示部２２０…伝送路および 100 Microphone 110 3D digitizer 111 Video camera 120 Table 121 Neural network 130 Spectrum information calculator 131 Phoneme recognizer 132 Voice recognizer 133 Voice converter 140 Voice synthesizer 141 Switching Unit 150: Optical flow analysis unit 160: Presentation means 180: Video tape recorder 190: Recognition result determination unit 191: Confidence of speech recognition 192: Confidence of optical flow analysis unit 200: Texture mapping unit 210: Display unit 220: Transmission Road and

───────────────────────────────────────────────────── フロントページの続き (72)発明者竹内伸神奈川県足柄上郡中井町境430 グリーンテクなかい富士ゼロックス株式会社内Ｆターム(参考） 5D015 CC00 LL07 9A001 BB04 HZ05 KZ20 ────────────────────────────────────────────────── ─── Continued on the front page (72) Inventor Shin Takeuchi 430 Sakai Nakagami-cho, Ashigara-gun, Kanagawa Green Tech Nakai Fuji Xerox Co., Ltd. F-term (reference) 5D015 CC00 LL07 9A001 BB04 HZ05 KZ20

Claims

[Claims]

1. A linguistic information associating device for associating articulatory movement with linguistic information other than the articulatory movement, wherein the articulatory movement generated when a speaker speaks is determined by the articulator and its surroundings. Articulatory morphology input means for generating input data by measuring from at least a portion of the hull; means for inputting linguistic information other than motion of the articulator; motion of the articulator during speech; Means for associating linguistic information input from other than exercise.

2. The linguistic information associating device according to claim 1, wherein the linguistic information other than the movement of the articulator is speech information.

3. A linguistic information associating device for associating articulatory movements with linguistic information other than articulatory movements, comprising: means for instructing a speaker to speak and the contents of the speech; and the speaker speaking. An articulator form input means for generating input data by measuring the movement of the articulator that occurs at least from at least a part of the articulator and its surrounding skin; and a motion other than the motion of the articulator during the speech and the motion of the articulator Means for associating linguistic information input from the language information.

4. A linguistic information associating apparatus for associating articulatory movement with linguistic information other than the articulatory movement, wherein the articulatory movement generated when a speaker speaks is determined by using the articulatory organ and its surroundings. Articulatory form input means for generating input data by measuring from at least a part of the hull, means for inputting information comprising phonetic characters, phonetic character strings, arbitrary symbols or symbol strings, and movement of the articulatory organs A language information associating device, comprising: means for associating information input from other sources with the movement of the articulator during speech.

5. A linguistic information estimating apparatus for estimating linguistic information other than the articulator movement from the articulator movement, the linguistic information estimating apparatus comprising: A matching recognizing means for recognizing a correspondence between the articulators; and an articulator organ shape input for generating input data by measuring motion of the articulators generated when the speaker speaks from at least a part of the articulators and the surrounding skin. Means for acquiring corresponding linguistic information from the movement of the articulator during speech based on the correspondence recognized by the correspondence recognizing means.

6. An articulator motion estimation device for estimating articulator motion based on information other than articulator motion, comprising: articulator motion during speech; and information input from other than articulator motion. A correspondence recognizing means for recognizing the correspondence between: a means for inputting linguistic information other than the movement of the articulator; and a means other than the movement of the articulator based on the correspondence recognized by the correspondence recognizing means. An articulator motion estimation device comprising means for estimating articulator motion during speech.

7. A linguistic information detecting apparatus for detecting linguistic information such as a voice, wherein the linguistic information detecting apparatus recognizes a correspondence between a motion of an articulator during speech and linguistic information input from a source other than the motion of the articulator. Attachment recognition means; language information input means for inputting linguistic information other than movement of the articulator; and movement of the articulator generated when the speaker speaks is measured and input from at least a part of the articulator and its surrounding skin. Articulatory form input means for generating data; linguistic information acquiring means for acquiring corresponding linguistic information from the movement of the articulator during speech based on the association recognized by the association recognizing means; The linguistic information is obtained based on the linguistic information obtained from the movement of the articulator by the linguistic information obtaining means, and the linguistic information inputted from other than the movement of the articulator by the linguistic information input means. Language information detecting apparatus characterized by having language information detecting means for detecting.

8. The apparatus further comprises means for recognizing a correspondence between movement of a speaker's organ and semantic information such as a gesture,
The linguistic information detecting device according to claim 7, wherein the semantic information is detected using the association.

9. The linguistic information detecting means selects linguistic information acquired from the movement of the articulator by the linguistic information acquiring means, and linguistic information inputted from other than the articulator movement by the linguistic information input means. The linguistic information obtained from the movement of the articulator by the selecting means and the linguistic information obtaining means, and the linguistic information inputted from other than the movement of the articulator by the linguistic information input means. The linguistic information detecting device according to claim 7, further comprising at least one of a synthesizing unit that generates a detection result.

10. The at least one of a means for recognizing certainty of linguistic information associated with articulatory movement and a means for recognizing certainty of linguistic information input from a source other than articulatory movement. Linguistic information obtained from the movement of the articulator, depending on one or both confidence factors,
10. The linguistic information detecting device according to claim 9, wherein one of the linguistic information input from other than the motion of the articulator is selected.

11. The synthesizing means includes at least one of means for recognizing certainty of linguistic information associated with articulatory movement, and means for recognizing certainty of linguistic information input from a source other than articulatory movement. Linguistic information obtained from the movement of the articulator, depending on one or both confidence factors,
11. The linguistic information detecting device according to claim 9, wherein linguistic information input from other than the movement of the articulator is synthesized.

12. The language according to claim 10, wherein the means for recognizing the certainty factor of the linguistic information input from a source other than the movement of the articulator includes a means for recognizing whether the utterance is sound or silence. Information detection device.

13. The linguistic information detecting apparatus according to claim 7, wherein said articulator organ shape input means includes means for inputting a shape of a skin.

14. The linguistic information detecting device according to claim 13, wherein the means for inputting the shape of the outer skin includes means for inputting an image of the outer skin.

15. The linguistic information detecting device according to claim 7, wherein the association recognizing unit, the linguistic information input unit, and the linguistic information acquiring unit include a neural network.

16. The linguistic information detecting device according to claim 15, wherein said association recognizing means performs learning of a neural network.

17. The linguistic information detecting apparatus according to claim 7, further comprising means for recording the movement of the articulator during speech and information input from a source other than the movement of the articulator.

18. The language according to claim 17, wherein the correspondence is recorded and updated at any time or based on an instruction based on movement of the articulator during speech and information input from other than movement of the articulator. Information detection device.

19. A system comprising: means for calculating the certainty factor of the movement of the articulator during speech; and means for calculating the certainty factor of the information input from other than the movement of the articulator. 19. The language information detecting device according to claim 18, further comprising means for controlling recording and updating in accordance with the language information.

20. Means for calculating the certainty of the movement of the articulator during speech, and means for calculating the certainty of information input from other than the movement of the articulator, wherein one of the inputted certainties is determined. If the value is equal to or less than a predetermined value, a detection result is output based on the other input and the other characteristic corresponding to the input, and when both confidences are equal to or more than a predetermined value,
19. The linguistic information detection device according to claim 7, wherein the linguistic information detection device outputs a detection result at the same time as updating the correspondence according to the obtained certainty factors.

21. In the case of performing voice recognition, a recognition result based on the motion information of the articulator, a result of recognizing information input from a source other than the motion of the articulator, or a result of selecting or synthesizing the information is used. 21. The language information detecting device according to claim 7, further comprising means for presenting to a person.

22. A method according to claim 1, further comprising: inputting information other than the articulator movement associated with the articulator movement, and the articulator movement during speech. A medium that records the correspondence of

23. The linguistic information associating device according to claim 1, 3, 3 or 4, the linguistic information estimating device according to claim 5, the articulator movement estimating device according to claim 6, or A telephone device comprising the language information detecting device according to claim 7.