JP3478269B2

JP3478269B2 - Voice input device

Info

Publication number: JP3478269B2
Application number: JP2001004698A
Authority: JP
Inventors: 謙二松井
Original assignee: Panasonic Corp; Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Corp; Panasonic Holdings Corp
Priority date: 1992-11-02
Filing date: 2001-01-12
Publication date: 2003-12-15
Anticipated expiration: 2018-12-15
Also published as: JP2001255891A

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、音声入力装置に関
し、特に、音声認識装置などの装置に安定した音声信号
を供給する音声入力装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice input device, and more particularly to a voice input device for supplying a stable voice signal to a device such as a voice recognition device.

【０００２】[0002]

【従来の技術】音声認識装置に音声を入力する場合、音
声入力装置として、ほとんどの場合マイクロフォンが使
用される。マイクロフォンの種類としては、手持ち型、
卓上型、ネクタイピン型、接話型（電話交換手などによ
って使用される頭に固定するタイプのもの）、受話器型
などがある。これらのマイクロフォンのうち、接話型又
は受話器型のマイクロフォンを使用した場合には、比較
的良好な音声認識性能が得られる。接話型又は受話器型
のマイクロフォンは、口とマイクロフォンとの間の距離
が近いからである。2. Description of the Related Art When inputting a voice to a voice recognition device, a microphone is almost used as a voice input device. Handheld type of microphone
There are desk-top type, tie-pin type, close-talking type (those that are fixed to the head used by telephone operators, etc.), and handset type. When a close-talking type or a handset type microphone is used among these microphones, relatively good voice recognition performance is obtained. This is because the close-talking or handset-type microphone has a short distance between the mouth and the microphone.

【０００３】また、他の音声入力装置として、マイクロ
フォンとテレビカメラとを用いて、音声と口唇部分の画
像あるいは顔画像とを取り込む装置が知られている（特
開昭６０−１８８９９８号公報、特開昭６２−２３９２
３１号公報など）。これらの装置は、口唇部分の画像を
用いることにより、音声認識性能をさらに向上させるこ
とを狙ったものである。As another voice input device, there is known a device which takes in a voice and an image of a lip portion or a face image by using a microphone and a television camera (Japanese Patent Laid-Open No. 188998/60). Kaisho 62-2392
No. 31, etc.). These devices aim to further improve the voice recognition performance by using the image of the lip portion.

【０００４】図５は、音声認識装置用の従来の音声入力
装置の構成を示す。図５において、話者が発声した音声
は、マイクロフォンなどの音声入力部１１０によって音
声−電気変換され、その電気信号は音声認識装置に出力
される。また、話者の顔画像は、画像入力部１２０によ
って画像信号に変換される。口唇画像抽出部１３０は、
この画像信号から口唇部分に対応する信号を抽出して、
口唇部分に対応する信号のみを音声認識装置に出力す
る。FIG. 5 shows the configuration of a conventional voice input device for a voice recognition device. In FIG. 5, the voice uttered by the speaker is voice-electrically converted by the voice input unit 110 such as a microphone, and the electric signal is output to the voice recognition device. The face image of the speaker is converted into an image signal by the image input unit 120. The lip image extraction unit 130
Extract the signal corresponding to the lip from this image signal,
Only the signal corresponding to the lip portion is output to the voice recognition device.

【０００５】[0005]

【発明が解決しようとする課題】従来の音声入力装置
は、話者の口唇部分とマイクロフォンとの間の距離が一
定に保たれないため、安定した音声信号を音声認識装置
に供給することができないという問題点があった。この
ため、音声認識装置において、話者の口唇部分とマイク
ロフォンとの間の距離のばらつきを考慮して入力音声信
号を補正する必要があった。また、従来の音声入力装置
では、話者の顔画像から口唇部分の画像を抽出する必要
があった。しかし、特に、話者の顔が移動する場合に
は、かかる抽出を高い精度で実現することは困難である
という問題点があった。The conventional voice input device cannot supply a stable voice signal to the voice recognition device because the distance between the lip portion of the speaker and the microphone is not kept constant. There was a problem. Therefore, in the voice recognition device, it is necessary to correct the input voice signal in consideration of the variation in the distance between the lip portion of the speaker and the microphone. Further, in the conventional voice input device, it is necessary to extract the image of the lip portion from the face image of the speaker. However, there is a problem that it is difficult to realize such extraction with high accuracy, especially when the face of the speaker moves.

【０００６】本発明は、（１）安定した音声信号を音声
認識装置に供給する音声入力装置を提供すること、
（２）話者の顔画像から口唇部分の画像を抽出する処理
が不要な音声入力装置を提供することを目的とする。The present invention provides (1) a voice input device for supplying a stable voice signal to a voice recognition device,
(2) It is an object of the present invention to provide a voice input device that does not require the process of extracting the image of the lip portion from the face image of the speaker.

【０００７】[0007]

【課題を解決するための手段】本発明の音声入力装置
は、話者によって発声された入力音声を電気信号に変換
して、該電気信号を出力する音声入力手段と、前記話者
の口唇部分を映すための鏡を有して、該音声入力手段の
位置に関連して決定される所定の位置と該話者の口唇部
分の位置との間の空間的なずれを示す表示手段とを備え
たものである。A voice input device according to the present invention comprises a voice input means for converting an input voice uttered by a speaker into an electric signal and outputting the electric signal, and the speaker.
A mirror for mirror the lip portion of the display unit showing the spatial displacement between the position of the lip portion of the predetermined position and該話who is determined relative to the position of the speech input means It is equipped with and.

【０００８】[0008]

【発明の実施の形態】本発明の音声入力装置は、話者に
よって発声された入力音声を電気信号に変換して、該電
気信号を出力する音声入力手段と、該話者の口唇部分の
位置と所定の位置との間の空間的なずれを示す情報を表
示する表示手段とを備えている。前記音声入力装置は、
前記話者の口唇部分の入力画像を電気信号に変換して、
該電気信号を出力する画像入力手段をさらに備えていて
もよい。BEST MODE FOR CARRYING OUT THE INVENTION The voice input device of the present invention is a voice input means for converting an input voice uttered by a speaker into an electric signal and outputting the electric signal, and a position of a lip portion of the speaker. And display means for displaying information indicating a spatial shift between the predetermined position and the predetermined position. The voice input device,
By converting the input image of the lip portion of the speaker into an electric signal,
Image input means for outputting the electric signal may be further provided.

【０００９】前記表示手段は前記話者の口唇部分を映す
ための鏡面を有し、該鏡面は前記所定の位置を規定する
ための印を有していてもよい。The display means may have a mirror surface for displaying the lip portion of the speaker, and the mirror surface may have a mark for defining the predetermined position.

【００１０】前記所定の位置は、前記音声入力手段の位
置に関連して決定されることが好ましい。The predetermined position is preferably determined in relation to the position of the voice input means.

【００１１】前記音声入力装置は、前記話者の口唇部分
の入力画像を電気信号に変換して、該電気信号を出力す
る画像入力手段と、該画像入力手段から出力される電気
信号に基づいて、該話者の口唇部分の位置を特定する位
置特定手段と、該位置特定手段によって特定された該話
者の口唇部分の位置と前記所定の位置とを比較し、比較
結果を出力する位置比較手段とをさらに備えており、前
記表示手段は、該位置比較手段による該比較結果に基づ
いて、該話者の口唇部分の位置と該所定の位置との間の
空間的なずれを示す情報を表示してもよい。The voice input device converts an input image of the lip portion of the speaker into an electric signal and outputs the electric signal, and based on the electric signal output from the image input unit. Position comparing means for specifying the position of the lip portion of the speaker, and position comparison of the position of the lip portion of the speaker specified by the position specifying means and the predetermined position, and outputting a comparison result. The display means further includes means for displaying information indicating a spatial deviation between the position of the lip portion of the speaker and the predetermined position based on the comparison result by the position comparison means. It may be displayed.

【００１２】前記所定の位置は、前記音声入力手段の位
置及び前記画像入力手段の位置に関連して決定されるこ
とが好ましい。The predetermined position is preferably determined in relation to the position of the voice input means and the position of the image input means.

【００１３】前記音声入力装置は、音声処理装置に接続
され、該音声処理装置は、前記話者の口唇部分の入力画
像を電気信号に変換して、該電気信号を出力する画像入
力手段を備えており、該音声処理装置は、該音声入力装
置の該画像入力手段から出力される該電気信号に基づい
て、該話者の口唇部分の位置を特定する位置特定手段
と、該位置特定手段によって特定された該話者の口唇部
分の位置と前記所定の位置とを比較し、比較結果を出力
する位置比較手段とを備えており、該音声入力装置の前
記表示手段は、該音声処理装置の該位置比較手段による
該比較結果に基づいて、該話者の口唇部分の位置と該所
定の位置との間の空間的なずれを示す情報を表示しても
よい。The voice input device is connected to a voice processing device, and the voice processing device includes image input means for converting an input image of the lip portion of the speaker into an electric signal and outputting the electric signal. The voice processing device includes a position specifying means for specifying the position of the lip portion of the speaker based on the electric signal output from the image input means of the voice input device, and the position specifying means. The position of the specified lip of the speaker is compared with the predetermined position, and the position comparison means for outputting a comparison result is provided, and the display means of the voice input device is provided with the voice processing device. Information indicating a spatial shift between the position of the lip portion of the speaker and the predetermined position may be displayed based on the comparison result by the position comparison means.

【００１４】前記所定の位置は、前記音声入力手段の位
置及び前記画像入力手段の位置に関連して決定されるこ
とが好ましい。It is preferable that the predetermined position is determined in relation to the position of the voice input means and the position of the image input means.

【００１５】以下、本発明を実施例について説明する。The present invention will be described below with reference to examples.

【００１６】（第１の実施例）図１（ａ）は、本発明の第１の実施例の音声入力装置の
構成を示す。音声入力装置は、音声入力部１及び表示部
２を有している。音声入力部１は、入力された音声を電
気信号に変換して、その電気信号を外部装置（不図示）
に出力する。音声入力部１としては、例えば、マイクロ
フォンが使用される。外部装置としては、例えば、音声
処理装置や音声認識装置が考えられる。表示部２は、話
者の口唇部分の位置と所定の位置との間の空間的なずれ
を表示する。その所定の位置は、話者の口唇部分の位置
がその所定の位置に一致した場合に、音声入力部１に入
力される音声が最も安定して得られるように予め決めら
れる。例えば、表示部２は、話者の口唇部分を少なくと
も映すための鏡などの鏡面を有し、その鏡面は所定の位
置を規定するための印を有する。その印の形状は、話者
が自分の口唇部分を一致させるべき所定の位置を認識で
きるものであればいかなる形状であってもよい。例え
ば、その印の形状は、十字形、楕円形、台形などとされ
る。話者は、鏡面に映った自分の口唇部分の位置が鏡面
上の印に一致するように自分の口唇部分の位置を保ちつ
つ、発声する。このようにして、話者は自分の口唇部分
の位置を音声入力部１に対して常に適切な位置に保つこ
とができる。その結果、音声入力部１には常に安定した
音声が供給されることとなる。(First Embodiment) FIG. 1A shows the configuration of a voice input device according to a first embodiment of the present invention. The voice input device has a voice input unit 1 and a display unit 2. The voice input unit 1 converts the input voice into an electric signal and converts the electric signal into an external device (not shown).
Output to. As the voice input unit 1, for example, a microphone is used. The external device may be, for example, a voice processing device or a voice recognition device. The display unit 2 displays the spatial shift between the position of the lip portion of the speaker and a predetermined position. The predetermined position is determined in advance so that the voice input to the voice input unit 1 can be obtained most stably when the position of the lip portion of the speaker matches the predetermined position. For example, the display unit 2 has a mirror surface such as a mirror for displaying at least the lip portion of the speaker, and the mirror surface has a mark for defining a predetermined position. The shape of the mark may be any shape as long as the speaker can recognize a predetermined position where the lip portion of the speaker should match. For example, the shape of the mark is a cross, an ellipse, a trapezoid, or the like. The speaker utters while maintaining the position of his lip portion so that the position of his lip portion reflected on the mirror surface matches the mark on the mirror surface. In this way, the speaker can always keep the position of his lip on the voice input unit 1 at an appropriate position. As a result, stable voice is always supplied to the voice input unit 1.

【００１７】図１（ｂ）に示されるように、音声入力装
置は、入力された画像を電気信号に変換して、その電気
信号を外部装置（不図示）に出力する画像入力部３をさ
らに有していてもよい。画像入力部３としては、例え
ば、テレビカメラやＣＣＤ素子が使用される。入力され
た画像は、話者の口唇部分の画像を少なくとも含む。音
声入力装置が音声認識装置などの外部装置に接続される
場合において、話者の口唇部分の画像は、例えば、音声
入力部１から出力される音声信号に基づいて、音声が存
在する区間を検出するために利用される。これにより、
ノイズ環境の下で、特に音楽や人声などによって騒音レ
ベルが非定常である環境の下で、話者の周囲の騒音を音
声であると誤認識する確率が低減される。音声入力装置
が画像入力部３を有する場合には、音声入力装置は話者
の口唇部分を少なくとも照射する光源部４をさらに有し
ていることが好ましい。これは、話者の口唇部分の画像
を正確に得るために十分な照度を確保するためである。As shown in FIG. 1B, the voice input device further includes an image input unit 3 for converting an input image into an electric signal and outputting the electric signal to an external device (not shown). You may have. As the image input unit 3, for example, a television camera or a CCD element is used. The input image includes at least the image of the lip portion of the speaker. When the voice input device is connected to an external device such as a voice recognition device, the image of the lip portion of the speaker detects a section in which voice is present, for example, based on the voice signal output from the voice input unit 1. Used to do. This allows
In a noise environment, particularly in an environment where the noise level is unsteady due to music or human voice, the probability of erroneously recognizing noise around the speaker as voice is reduced. When the voice input device includes the image input unit 3, it is preferable that the voice input device further includes the light source unit 4 that illuminates at least the lip portion of the speaker. This is to ensure sufficient illuminance to accurately obtain the image of the lip portion of the speaker.

【００１８】図２は、本発明の第１の実施例の音声入力
装置の具体的な構成例を示す。この音声入力装置は、マ
イクロフォン６及び鏡９を含む筐体５を有している。マ
イクロフォン６には、話者によって発声された音声が入
力される。鏡９は、話者が鏡９に映る自分の口唇部分の
中心を鏡９の印１０に一致させた状態で発声した場合
に、話者の口唇部分とマイクロフォン６との間の距離が
ほぼ一定に保たれるように配置される。話者の口唇部分
とマイクロフォン６との間の適切な距離は、話者によら
ずほぼ一定の値であることが実験により確認されてい
る。ただし、話者に応じて話者の口唇部分とマイクロフ
ォン６との間の距離を微調整するために、鏡９の位置及
び／又は角度を変更できるようにしてもよい。FIG. 2 shows a concrete example of the configuration of the voice input device according to the first embodiment of the present invention. This voice input device has a housing 5 including a microphone 6 and a mirror 9. The voice uttered by the speaker is input to the microphone 6. The mirror 9 has a substantially constant distance between the speaker's lip portion and the microphone 6 when the speaker speaks with the center of his or her lip portion reflected in the mirror 9 aligned with the mark 10 of the mirror 9. It is arranged to be kept at. It has been confirmed by experiments that the appropriate distance between the lip portion of the speaker and the microphone 6 is a substantially constant value regardless of the speaker. However, the position and / or the angle of the mirror 9 may be changed to finely adjust the distance between the lip portion of the speaker and the microphone 6 depending on the speaker.

【００１９】この例では、筐体５には、話者の口唇部分
を照射する発光ダイオード７、話者の口唇部分の画像を
少なくとも入力するための受光部８がさらに設けられて
いる。In this example, the housing 5 is further provided with a light emitting diode 7 for illuminating the lip portion of the speaker and a light receiving portion 8 for inputting at least an image of the lip portion of the speaker.

【００２０】次に、上述の構成を有する音声入力装置に
音声を入力する方法を説明する。話者は、自分の口唇部
分が鏡９に映り、かつ、その口唇部分の中心が鏡９上の
印１０に一致するように、自分の口唇部分と筐体５との
位置関係を調整する。話者が筐体５を手で自然に持った
場合に、話者の口唇部分と筐体５との位置関係がほぼ一
意に決定されるような形状を筐体５が有していることが
好ましい。話者の口唇部分の位置と所定の位置との間の
空間的なずれの自由度を一定以下に抑えるためである。
例えば、通常の電話機の受話器に近似した形状は、上述
した筐体５の好ましい形状に該当する。その後、話者
は、鏡９に映る自分の口唇部分の中心が鏡９上の印１０
に一致するように筐体５を保持しつつ、発声する。その
発声の際、話者の口唇部分は発光ダイオード７によって
照射され、受光部８がその話者の口唇部分の画像を得る
のに十分な照度が確保される。Next, a method of inputting a voice into the voice input device having the above-mentioned structure will be described. The speaker adjusts the positional relationship between his lip portion and the housing 5 so that his lip portion is reflected in the mirror 9 and the center of the lip portion is aligned with the mark 10 on the mirror 9. When the speaker naturally holds the casing 5 with his / her hand, the casing 5 may have a shape such that the positional relationship between the lip portion of the speaker and the casing 5 is almost uniquely determined. preferable. This is because the degree of freedom of spatial deviation between the position of the lip portion of the speaker and the predetermined position is kept below a certain level.
For example, a shape similar to the handset of a normal telephone corresponds to the preferable shape of the housing 5 described above. After that, the speaker can see that the center of his / her lip part reflected on the mirror 9 is the mark 10 on the mirror 9.
Speak while holding the housing 5 so as to match with. At the time of utterance, the lip portion of the speaker is illuminated by the light emitting diode 7, and the illuminance sufficient for the light receiving unit 8 to obtain an image of the lip portion of the speaker is secured.

【００２１】本発明の第１の実施例の音声入力装置によ
れば、話者の口唇部分と音声入力部１との間の距離を常
にほぼ一定に保つことができるので、話者は自分の口唇
部分の位置を音声入力部１に対して常に適切な位置に保
つことができる。その結果、音声入力部１から常に安定
した音声信号が出力されることとなる。これにより、音
声入力装置が音声認識装置などの外部装置に接続される
場合に、その外部装置は、話者の口唇部分と音声入力装
置１との間の距離のばらつきを考慮して入力音声信号を
補正する必要がない。According to the voice input device of the first embodiment of the present invention, the distance between the lip portion of the speaker and the voice input unit 1 can be kept almost constant at all times, and therefore the speaker can The position of the lip portion can always be kept at an appropriate position with respect to the voice input unit 1. As a result, a stable audio signal is always output from the audio input unit 1. As a result, when the voice input device is connected to an external device such as a voice recognition device, the external device takes into consideration the variation in the distance between the lip portion of the speaker and the voice input device 1, and the input voice signal. Need not be corrected.

【００２２】（第２の実施例）図３は、本発明の第２の実施例の音声入力装置の構成を
示す。音声入力装置は、音声入力部１、画像入力部３及
び表示部２を有している。音声入力部１は、入力された
音声を電気信号に変換して、その電気信号を外部装置
（不図示）に出力する。画像入力部３は、入力された口
唇部分の画像を電気信号に変換して、その電気信号を外
部装置（不図示）に出力する。表示部２は、話者の口唇
部分の位置と所定の位置との間の空間的なずれを表示す
る。その所定の位置は、話者の口唇部分の位置がその所
定の位置に一致した場合に、音声入力部１から出力され
る音声信号が最も安定して得られ、かつ、画像入力部３
に口唇部分の画像のみが入力されるように予め決められ
る。表示部２としては、例えば、液晶表示ディスプレイ
やＣＲＴが使用される。また、第１の実施例と同様にし
て、話者の口唇部分の画像を正確に得るために十分な照
度を確保するために、音声入力装置は、話者の口唇部分
を照射する光源部４をさらに有していていることが好ま
しい。(Second Embodiment) FIG. 3 shows the configuration of a voice input device according to a second embodiment of the present invention. The voice input device has a voice input unit 1, an image input unit 3, and a display unit 2. The voice input unit 1 converts the input voice into an electric signal and outputs the electric signal to an external device (not shown). The image input unit 3 converts the input image of the lip portion into an electric signal and outputs the electric signal to an external device (not shown). The display unit 2 displays the spatial shift between the position of the lip portion of the speaker and a predetermined position. The predetermined position is such that the voice signal output from the voice input unit 1 is most stably obtained when the position of the lip portion of the speaker matches the predetermined position, and the image input unit 3
It is determined in advance that only the image of the lip portion is input. As the display unit 2, for example, a liquid crystal display or a CRT is used. Further, in the same manner as in the first embodiment, in order to secure sufficient illuminance for accurately obtaining the image of the lip portion of the speaker, the voice input device uses the light source unit 4 that illuminates the lip portion of the speaker. It is preferable to further have

【００２３】本発明の第２の実施例の音声入力装置で
は、画像入力部３には話者の口唇部分の画像のみが入力
されるので、画像入力部３は、例えば、マトリクス状に
配列された３２×３２個の光電変換素子を有していれば
足りる。従来、話者の顔全体の画像を入力する場合に
は、通常、５１２×５１２個の光電変換素子が使用され
ていた。従って、話者の口唇部分の画像のみを画像入力
部３に入力するようにしたことにより、光電変換素子の
数を大幅に低減することができる。さらに重要なこと
は、光電変換素子の数を大幅に低減したことにより、音
声入力部１から出力される音声信号の周波数帯域と画像
入力部３から出力される画像信号の周波数帯域とがほぼ
同一となることである。例えば、３２×３２画素を１０
０ｍ秒／１フレームで駆動する場合、１秒につき約１０
０００画素を駆動することになる。この場合、画像信号
の周波数帯域は約１０ｋＨｚとなる。これにより、音声
入力装置に接続される音声認識装置など外部装置におい
て、音声信号と画像信号とを同一のプロセッサで処理す
ることが可能になる。In the voice input device according to the second embodiment of the present invention, since only the image of the lip portion of the speaker is input to the image input unit 3, the image input units 3 are arranged in a matrix, for example. It is sufficient to have 32 × 32 photoelectric conversion elements. Conventionally, when inputting an image of the entire face of a speaker, 512 × 512 photoelectric conversion elements are usually used. Therefore, the number of photoelectric conversion elements can be significantly reduced by inputting only the image of the lip portion of the speaker to the image input unit 3. More importantly, the frequency band of the audio signal output from the audio input unit 1 and the frequency band of the image signal output from the image input unit 3 are substantially the same because the number of photoelectric conversion elements is significantly reduced. Is to be. For example, 32 × 32 pixels is 10
When driving at 0 msec / 1 frame, about 10 per second
000 pixels will be driven. In this case, the frequency band of the image signal is about 10 kHz. As a result, in an external device such as a voice recognition device connected to the voice input device, the voice signal and the image signal can be processed by the same processor.

【００２４】音声入力装置は、位置特定部１１及び位置
比較部１２をさらに有している。これらは、話者の口唇
部分の位置と所定の位置との間の空間的なずれを表示部
２上に表示するために使用される。The voice input device further includes a position specifying unit 11 and a position comparing unit 12. These are used for displaying on the display unit 2 the spatial deviation between the position of the lip portion of the speaker and the predetermined position.

【００２５】位置特定部１１は、画像入力部３から出力
される話者の口唇部分の画像を表す電気信号に基づい
て、その話者の口唇部分の位置を特定する。位置特定部
１１において、話者の口唇部分の位置を特定する方法と
しては、パターンマッチングによる方法が簡便かつ効果
的である。例えば、位置特定部１１は、唇の形状は楕円
に近似するという知識を用いて、入力画像の濃淡情報か
ら口唇部分の位置を特定する。より詳しくいうと、位置
特定部１１は、図４に示すように口唇部分の外枠にほぼ
一致する楕円関数を推定し、その推定された楕円関数の
位置を口唇部分の位置として特定する。また、図５に示
されるように、楕円関数を使用する代わりに台形関数を
使用してもよい。The position specifying unit 11 specifies the position of the lip portion of the speaker based on the electric signal representing the image of the lip portion of the speaker output from the image input unit 3. As a method of specifying the position of the lip portion of the speaker in the position specifying unit 11, a method based on pattern matching is simple and effective. For example, the position specifying unit 11 specifies the position of the lip portion from the grayscale information of the input image, using the knowledge that the shape of the lip approximates an ellipse. More specifically, the position specifying unit 11 estimates an elliptic function that substantially matches the outer frame of the lip portion as shown in FIG. 4, and specifies the position of the estimated elliptic function as the position of the lip portion. Also, as shown in FIG. 5, a trapezoidal function may be used instead of the elliptic function.

【００２６】位置比較部１２は、位置特定部１１によっ
て特定された口唇部分の位置と所定の位置とを比較し、
比較結果を出力する。その所定の位置は、位置特定部１
１によって特定された口唇部分の位置がその所定の位置
に一致した場合に、話者の口唇部分と音声入力部１との
間の距離がほぼ一定に保たれ、かつ、話者の口唇部分と
画像入力部３との間の距離がほぼ一定に保たれるように
位置比較部１２の中に予め記憶される。位置比較部１２
によって得られる比較結果は、表示部２に供給される。
表示部２は、その比較結果に基づいて、話者の口唇部分
の位置と所定の位置との間の空間的なずれを表示する。The position comparing unit 12 compares the position of the lip portion specified by the position specifying unit 11 with a predetermined position,
Output the comparison result. The predetermined position is the position specifying unit 1.
When the position of the lip portion specified by 1 matches the predetermined position, the distance between the lip portion of the speaker and the voice input unit 1 is kept substantially constant, and the lip portion of the speaker is It is stored in advance in the position comparison unit 12 so that the distance to the image input unit 3 is kept substantially constant. Position comparison unit 12
The comparison result obtained by is supplied to the display unit 2.
The display unit 2 displays the spatial deviation between the position of the lip portion of the speaker and the predetermined position based on the comparison result.

【００２７】次に、表示部２において、話者の口唇部分
の位置と所定の位置との間の空間的なずれを表示する態
様を説明する。空間的なずれを表示するためには、一般
的には、そのずれを３次元的に表示することが必要であ
る。しかし、実際には、口唇部分が接する平面に垂直な
方向における口唇部分の座標位置は実質的に一定である
とみなすことができるので、その空間的なずれは、２次
元的に又は１次元的に示されれば十分である。Next, a description will be given of a mode in which the display unit 2 displays the spatial deviation between the position of the lip portion of the speaker and a predetermined position. In order to display the spatial deviation, it is generally necessary to display the deviation three-dimensionally. However, in reality, since the coordinate position of the lip portion in the direction perpendicular to the plane with which the lip portion is in contact can be considered to be substantially constant, the spatial deviation is two-dimensional or one-dimensional. Is sufficient.

【００２８】図６（ａ）〜（ｅ）は、話者の口唇部分の
位置と所定の位置との間の空間的なずれを表示する態様
の例を示す。図６（ａ）は、空間的なずれを単一のイン
ジケータを用いて１次元的に表示する例を示す。例え
ば、空間的なずれが小さい場合には、インジケータの度
数が増大し、空間的なずれが大きい場合には、インジケ
ータの度数が減少するように表示すればよい。また、図
６（ｂ）に示されるように、空間的なずれの方向性を示
すために、複数のインジケータを用いて空間的なずれを
表示してもよい。図６（ｃ）は、空間的なずれを２つの
円の重なり度合を用いて２次元的に表示する例を示す。
この例では、話者の口唇部分の位置は実線、所定の位置
は破線で表示される。図６（ｄ）は、空間的なずれをそ
のずれの方向を示す矢印を用いて２次元的に表示する例
を示す。この例では、上、下、右、左、右上、左上、右
下、左下の８方向のうち、いずれかの方向を示す矢印が
表示される。例えば、話者の口唇部分の位置が所定の位
置に対し上方向にずれている場合には、下方向を示す矢
印が表示される。図６（ｅ）は、画像入力部３によって
入力された口唇部分の画像と所定の位置を示す印を表示
することにより、両者の間の空間的なずれを２次元的に
表示する例である。当業者であれば、上述のようにして
空間的なずれを表示する代わりに、又は、それを表示す
ることに加えて、空間的なずれを示す警告音を発生させ
ることによっても同様の効果が得られることを理解する
だろう。FIGS. 6 (a) to 6 (e) show an example of a mode in which the spatial deviation between the position of the lip portion of the speaker and a predetermined position is displayed. FIG. 6A shows an example in which the spatial deviation is displayed one-dimensionally using a single indicator. For example, when the spatial deviation is small, the indicator frequency may be increased, and when the spatial deviation is large, the indicator frequency may be decreased. Further, as shown in FIG. 6B, in order to show the directionality of the spatial shift, the spatial shift may be displayed using a plurality of indicators. FIG. 6C shows an example in which the spatial shift is two-dimensionally displayed by using the degree of overlap of the two circles.
In this example, the position of the lip portion of the speaker is displayed as a solid line and the predetermined position is displayed as a broken line. FIG. 6D shows an example in which the spatial deviation is two-dimensionally displayed using an arrow indicating the direction of the deviation. In this example, an arrow indicating any one of the eight directions of up, down, right, left, upper right, upper left, lower right, and lower left is displayed. For example, when the position of the lip portion of the speaker is displaced upward with respect to the predetermined position, an arrow indicating the downward direction is displayed. FIG. 6E is an example in which the image of the lip portion input by the image input unit 3 and a mark indicating a predetermined position are displayed to display the spatial shift between the two-dimensionally. . Those skilled in the art can obtain the same effect by generating a warning sound indicating the spatial deviation instead of or in addition to displaying the spatial deviation as described above. You will understand what you get.

【００２９】図７は、本発明の第２の実施例の音声入力
装置の具体的な構成例を示す。この音声入力装置は、マ
イクロフォン６、発光ダイオード７、受光部８を含む筐
体５を有している。液晶表示ディスプレイ１３は、筐体
５上に設けられてもよいが、話者が空間的なずれを容易
に視認できるように筐体５とは分離されていることが好
ましい。位置特定部１１及び位置比較部１２は、筐体５
の内部に収納されているため、図７には示されていな
い。FIG. 7 shows a concrete configuration example of the voice input device according to the second embodiment of the present invention. This voice input device has a housing 5 including a microphone 6, a light emitting diode 7, and a light receiving portion 8. The liquid crystal display 13 may be provided on the housing 5, but is preferably separated from the housing 5 so that a speaker can easily visually recognize a spatial shift. The position specifying unit 11 and the position comparing unit 12 are the housing 5
It is not shown in FIG. 7 as it is housed inside.

【００３０】次に、上述の構成を有する音声入力装置に
音声を入力する方法を説明する。話者は、発声する前
に、液晶表示ディスプレイ１３に表示される自分の口唇
部分の位置と所定の位置との間の空間的なずれが実質的
にゼロとなるように、自分の口唇部分と筐体５との位置
関係を調整する。その後、話者は、液晶表示ディスプレ
イ１３に表示される自分の口唇部分の位置と所定の位置
との間の空間的なずれが実質的にゼロとなるように筐体
５を保持しつつ、発声する。その発声の際、話者の口唇
部分は発光ダイオード７によって照射され、受光部８が
その話者の口唇部分の画像を得るのに十分な照度が確保
される。Next, a method of inputting a voice into the voice input device having the above-mentioned structure will be described. Before uttering, the speaker should be able to detect his or her lip portion so that the spatial deviation between the position of the lip portion displayed on the liquid crystal display 13 and the predetermined position becomes substantially zero. The positional relationship with the housing 5 is adjusted. After that, the speaker speaks while holding the housing 5 so that the spatial deviation between the position of the lip portion displayed on the liquid crystal display 13 and the predetermined position becomes substantially zero. To do. At the time of utterance, the lip portion of the speaker is illuminated by the light emitting diode 7, and the illuminance sufficient for the light receiving unit 8 to obtain an image of the lip portion of the speaker is secured.

【００３１】本発明の第２の実施例の音声入力装置によ
れば、話者の口唇部分と音声入力部１との間の距離を常
にほぼ一定に保つことができ、かつ、話者の口唇部分と
画像入力部３との間の距離を常にほぼ一定に保つことが
できるので、話者は自分の口唇部分の位置を音声入力部
１及び画像入力部３に対して常に適切な位置に保つこと
ができる。その結果、音声入力部１から常に安定した音
声信号が出力され、画像入力部３から常に安定した画像
信号が出力されることとなる。また、画像入力部３に
は、口唇部分の画像のみが入力されるので、音声入力装
置において、顔画像から口唇部分の画像を抽出する必要
がない。これにより、精度の高い口唇部分の画像が得ら
れる。According to the voice input device of the second embodiment of the present invention, the distance between the lip portion of the speaker and the voice input unit 1 can be kept substantially constant, and the lip of the speaker can be maintained. Since the distance between the portion and the image input unit 3 can be kept almost constant at all times, the speaker always keeps the position of his lip portion at an appropriate position with respect to the voice input unit 1 and the image input unit 3. be able to. As a result, the audio input unit 1 always outputs a stable audio signal, and the image input unit 3 always outputs a stable image signal. Further, since only the image of the lip portion is input to the image input unit 3, it is not necessary to extract the image of the lip portion from the face image in the voice input device. As a result, a highly accurate image of the lip portion can be obtained.

【００３２】さらに、音声入力装置が音声認識装置など
の外部装置に接続される場合に、その外部装置は、話者
の口唇部分と音声入力部１との間の距離のばらつきを考
慮して入力音声信号を補正する必要がない。また、その
外部装置には口唇部分に対応する画像信号が供給される
ので、外部装置において、口唇部分の画像を切り出す処
理を行う必要がない。さらに、音声入力部１から出力さ
れる音声信号の周波数帯域と画像入力部３から出力され
る画像信号の周波数帯域とがほぼ同一とされるので、外
部装置において、音声信号と画像信号とを同一のプロセ
ッサで処理することが可能となる。Furthermore, when the voice input device is connected to an external device such as a voice recognition device, the external device inputs the voice input device in consideration of the variation in the distance between the lip portion of the speaker and the voice input unit 1. There is no need to correct the audio signal. Further, since the image signal corresponding to the lip portion is supplied to the external device, it is not necessary to perform the process of cutting out the image of the lip portion in the external device. Further, since the frequency band of the audio signal output from the audio input unit 1 and the frequency band of the image signal output from the image input unit 3 are substantially the same, the audio signal and the image signal are the same in the external device. Can be processed by the processor.

【００３３】上述したように、本発明の第２の実施例の
音声入力装置は、位置特定部１１及び位置比較部１２を
有している。しかし、位置特定部１１及び位置比較部１
２は必ずしも音声入力装置に含まれている必要はない。
むしろ、位置特定部１１及び位置比較部１２は、音声認
識装置などの外部装置に含まれることが好ましい。その
理由は、そのような外部装置は、音声信号や画像信号を
処理するためのプロセッサを有していることが通常であ
るので、そのプロセッサにより位置特定部１１及び位置
比較部１２の処理を行うことが可能だからである。As described above, the voice input device according to the second embodiment of the present invention has the position specifying unit 11 and the position comparing unit 12. However, the position specifying unit 11 and the position comparing unit 1
2 does not necessarily have to be included in the voice input device.
Rather, the position specifying unit 11 and the position comparing unit 12 are preferably included in an external device such as a voice recognition device. The reason is that such an external device usually has a processor for processing an audio signal or an image signal, and therefore the processor performs the processing of the position specifying unit 11 and the position comparing unit 12. Because it is possible.

【００３４】図８は、位置特定部１１及び位置比較部１
２が外部装置に含まれる場合の音声入力装置の構成を示
す。図８に示される各部の機能及び動作は、第２の実施
例と同様であるので、説明を省略する。FIG. 8 shows the position specifying unit 11 and the position comparing unit 1.
2 shows a configuration of a voice input device when 2 is included in an external device. The functions and operations of the respective parts shown in FIG. 8 are the same as those in the second embodiment, and therefore their explanations are omitted.

【００３５】本発明によれば、話者の口唇部分と音声入
力部１との間の距離を常にほぼ一定に保つことができる
ので、話者は自分の口唇部分の位置を音声入力部１に対
して常に適切な位置に保つことができる。その結果、音
声入力部１から常に安定した音声信号が出力される。According to the present invention, since the distance between the lip portion of the speaker and the voice input unit 1 can be kept almost constant at all times, the speaker sets the position of his lip portion in the voice input unit 1. On the other hand, you can always keep the proper position. As a result, a stable audio signal is always output from the audio input unit 1.

【００３６】さらに、第２の実施例の音声入力装置によ
れば、話者の口唇部分と画像入力部３との間の距離をも
常にほぼ一定に保つことができるので、話者は自分の口
唇部分の位置を画像入力部３に対して常に適切な位置に
保つことができる。その結果、画像入力部３から常に安
定した画像信号が出力される。また、画像入力部３に
は、口唇部分の画像のみが入力されるので、画像入力部
３において、顔画像から口唇部分の画像を抽出する必要
がない。これにより、精度の高い口唇部分の画像を得る
ことができる。さらに、音声入力部１から出力される音
声信号の周波数帯域と画像入力部３から出力される画像
信号の周波数帯域とがほぼ同一とされるので、音声認識
装置などの外部装置において、音声信号と画像信号とを
同一のプロセッサで処理することが可能となる。Further, according to the voice input device of the second embodiment, the distance between the lip portion of the speaker and the image input unit 3 can be kept substantially constant at all times, so that the speaker can keep his or her own distance. The position of the lip portion can always be kept at an appropriate position with respect to the image input unit 3. As a result, a stable image signal is always output from the image input unit 3. Further, since only the image of the lip portion is input to the image input unit 3, it is not necessary for the image input unit 3 to extract the image of the lip portion from the face image. This makes it possible to obtain a highly accurate image of the lip portion. Furthermore, since the frequency band of the audio signal output from the voice input unit 1 and the frequency band of the image signal output from the image input unit 3 are substantially the same, the external device such as the voice recognition device is not affected by the audio signal. It is possible to process the image signal with the same processor.

【００３７】[0037]

【発明の効果】本発明によれば、話者の口唇部分と音声
入力部との間の距離を常にほぼ一定に保つことができる
ので、話者は自分の口唇部分の位置を音声入力部に対し
て常に適切な位置に保つことができる。その結果、音声
入力部１から常に安定した音声信号が出力される。According to the present invention, since the distance between the lip portion of the speaker and the voice input portion can be kept almost constant at all times, the speaker sets the position of his lip portion in the voice input portion. On the other hand, you can always keep the proper position. As a result, a stable audio signal is always output from the audio input unit 1.

[Brief description of drawings]

【図１】（ａ）は本発明の第１の実施例の音声入力装置
の構成を示すブロック図（ｂ）は本発明の第１の実施例の音声入力装置の構成を
示すブロック図FIG. 1A is a block diagram showing a configuration of a voice input device according to a first embodiment of the present invention, and FIG. 1B is a block diagram showing a configuration of a voice input device according to a first embodiment of the present invention.

【図２】本発明の第１の実施例の音声入力装置の具体的
な構成例を示す図FIG. 2 is a diagram showing a specific configuration example of the voice input device according to the first embodiment of the present invention.

【図３】本発明の第２の実施例の音声入力装置の構成を
示すブロック図FIG. 3 is a block diagram showing a configuration of a voice input device according to a second embodiment of the present invention.

【図４】話者の口唇部分の位置を特定する方法を説明す
る図FIG. 4 is a diagram illustrating a method of identifying the position of the lip portion of the speaker.

【図５】話者の口唇部分の位置を特定する方法を説明す
る図FIG. 5 is a diagram illustrating a method for identifying the position of the lip portion of the speaker.

【図６】（ａ）は話者の口唇部分の位置と所定の位置と
の間の空間的なずれの表示態様の例を示す図（ｂ）は話者の口唇部分の位置と所定の位置との間の空
間的なずれの表示態様の例を示す図（ｃ）は話者の口唇部分の位置と所定の位置との間の空
間的なずれの表示態様の例を示す図（ｄ）は話者の口唇部分の位置と所定の位置との間の空
間的なずれの表示態様の例を示す図（ｅ）は話者の口唇部分の位置と所定の位置との間の空
間的なずれの表示態様の例を示す図FIG. 6A is a diagram showing an example of a display mode of the spatial deviation between the position of the lip portion of the speaker and the predetermined position. FIG. 6B is the position of the lip portion of the speaker and the predetermined position. FIG. 6C is a diagram showing an example of the display mode of the spatial shift between the position of the speaker's lip and FIG. The figure (e) which shows the example of the display mode of the spatial gap between the position of the lip part of a speaker and a predetermined position is (e) is a spatial difference between the position of a lip part of a speaker, and a predetermined position. The figure which shows the example of the display mode of a gap

【図７】本発明の第２の実施例の音声入力装置の具体的
な構成例を示す図FIG. 7 is a diagram showing a specific configuration example of a voice input device according to a second embodiment of the present invention.

【図８】本発明の第２の実施例の他の音声入力装置の構
成を示すブロック図FIG. 8 is a block diagram showing the configuration of another voice input device according to the second embodiment of the invention.

【図９】従来の音声入力装置の構成を示すブロック図FIG. 9 is a block diagram showing a configuration of a conventional voice input device.

[Explanation of symbols]

１音声入力部２表示部３画像入力部４光源部１１位置特定部１２位置比較部 1 Voice input section 2 Display 3 Image input section 4 light source 11 Positioning part 12 Position comparison unit

───────────────────────────────────────────────────── フロントページの続き (51)Int.Cl.⁷ 識別記号ＦＩＨ０４Ｒ 1/08 Ｇ１０Ｌ 3/00 ５７１Ｈ５１１ (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 15/00 - 15/28 G06F 3/16 G06T 7/00 G06T 7/20 H04R 1/02 H04R 1/08 ─────────────────────────────────────────────────── ─── Continuation of front page (51) Int.Cl. ⁷ identification code FI H04R 1/08 G10L 3/00 571H 511 (58) Fields investigated (Int.Cl. ⁷ , DB name) G10L 15/00-15 / 28 G06F 3/16 G06T 7/00 G06T 7/20 H04R 1/02 H04R 1/08

Claims

(57) [Claims]

1. A voice input means for converting an input voice uttered by a speaker into an electric signal and outputting the electric signal.
A mirror for reflecting the lip portion of the speaker is provided, and a spatial shift between a predetermined position determined in relation to the position of the voice input means and the position of the lip portion of the speaker is calculated. voice input device and a display means for indicating.

2. The voice input device according to claim 1, further comprising image input means for converting an input image of the lip portion of the speaker into an electric signal and outputting the electric signal.

3. The display means defines the spatial deviation.
The voice input device according to claim 1, further comprising a mark for writing .

4. The voice input device according to claim 1, wherein the display means can change the position and / or the angle of the mirror.

5. The display means is arranged at a position visible to a speaker.
Installed in the same housing as the voice input means.
The voice input device according to claim 1 .