JP2000338987A

JP2000338987A - Utterance start monitor, speaker identification device, voice input system, speaker identification system and communication system

Info

Publication number: JP2000338987A
Application number: JP11150614A
Authority: JP
Inventors: Mieko Osuga; 美恵子大須賀; Kazumichi Tsutsumi; 和道堤; Shusuke Yamasaki; 秀典山▲さき▼; Akira Sawada; 晃澤田; Hiromi Terashita; 裕美寺下
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 1999-05-28
Filing date: 1999-05-28
Publication date: 2000-12-08

Abstract

PROBLEM TO BE SOLVED: To accurately, rapidly and easily decide whether an intended person is speaking or not, without imposing an unnecessary burden on a speaker, in surroundings wherein voice of others or sound from audio equipment or the like exists. SOLUTION: A deformation degree of lips is automatically acquired from an image of an object person 1 picked up by an object person imaging part 2, while an envelope curve of a voice is automatically acquired from the voice recorded by a sound recording part 3. A pattern matching part 8 calculates a similarity between time-series change patterns of the deformation degree of the lips and the envelope curve of the voice to identify a speaker. Thereby, the speaker can be identified without imposing an unnecessary burden on the object person 1, with a small information-processing load, at a high identification rate.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】この発明は、発話を検出した
い対象者の唇の動きに関する情報と音声情報とを同時に
用いることにより、発話者の同定を行うための情報処理
負荷が小さく、かつ同定率の良い話者同定装置を提案す
るものであり、その話者同定装置を、車載ナビゲーショ
ンシステムなどに用いられるキー操作の代わりに音声入
力を用いる音声入力システムや、ＴＶ会議などに用いら
れ映像と音声とを通信する通信システムに適用した発話
開始監視装置、話者同定装置、音声入力システム、およ
び話者同定システム、並びに通信システムに関するもの
である。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention uses information about the lip movement of a subject whose speech is to be detected and voice information at the same time, so that the information processing load for identifying the speaker is small and the identification rate is low. The present invention proposes a speaker identification device with good sound quality. The speaker identification device can be used as a voice input system that uses voice input instead of key operation used for an in-vehicle navigation system, and video and audio used for a TV conference. The present invention relates to an utterance start monitoring device, a speaker identification device, a voice input system, a speaker identification system, and a communication system applied to a communication system that communicates with a communication device.

【０００２】[0002]

【従来の技術】話者同定の目的で、唇の動きに関する情
報と音声情報とを同時に用い、その時系列変化の類似度
を判定に用いるというものは、これまでにはなかった。
類似の構成要素を持つ従来の技術としては、特開昭５６
−１２６１６０号公報に示された印刷用漢宇入力装置が
ある。この装置は、話者は特定された状態で、マイクか
らの音声だけでなく、唇の開閉を判断し、「Ｍ」と
「Ｎ」や、「Ｐ」と「Ｔ」など、音声だけでは区別がつ
きにくい音の認識率を上げるというもので、既に話者が
特定された状態で適用されるものである。2. Description of the Related Art For the purpose of speaker identification, there has been no method that simultaneously uses information about lip movement and voice information and uses the similarity of the time-series change for determination.
A conventional technique having similar components is disclosed in
Japanese Patent Application Publication No. 126126/1990 discloses a printing input device. With this device, the speaker is identified, and not only the sound from the microphone but also the opening and closing of the lips is determined, and it is distinguished by only the sound such as "M" and "N" or "P" and "T". This is to increase the recognition rate of sounds that are difficult to produce, and is applied in a state where the speaker has already been specified.

【０００３】他に、特開昭６０−１９６７９９号公報や
特開昭６０−２４３７００号公報に示された音声認識装
置では、話者の唇に音波を放射して、話者の唇に入射し
て生じる反響音を音声に加えて用いることにより、上記
印刷用漢宇入力装置同様、音声だけでは区別がつきにく
い音の認識率を上げようというものである。これも既に
話者が特定された状態で適用されるもので、話者がマイ
クに向かって発声するなど限られた場合にしか用いるこ
とができない。[0003] In addition, in the speech recognition apparatuses disclosed in Japanese Patent Application Laid-Open Nos. 60-196799 and 60-243700, sound waves are radiated to the lips of a speaker, and the sound waves enter the lips of the speaker. By using the reverberation sound generated in addition to the voice, the recognition rate of a sound that is difficult to be distinguished by the voice alone is raised, as in the case of the printing input device for printing. This is also applied when the speaker is already specified, and can be used only in limited cases such as when the speaker speaks into the microphone.

【０００４】その他、この発明が対象とするような話者
同定装置では、発声時にその旨を示すスイッチを押し、
発話が終了した時にスイッチを消すなど、話者に発声以
外の２次的な操作を要求するというものがあるが、発話
者の手を煩わせることになる。In a speaker identification device to which the present invention is applied, a switch indicating the fact is pressed when uttering.
In some cases, such as turning off the switch when the utterance is completed, the speaker is required to perform a secondary operation other than the utterance.

【０００５】[0005]

【発明が解決しようとする課題】従来の話者同定装置は
以上のように構成されているので、操作を意図している
人以外の発する音、例えば、他の人の発話やオーディオ
機器などから流れる音が存在する環境で、意図する人が
発話をしているかどうか、あるいは、複数の候補者の中
で誰が発話をしているのかを、精度良く、短時間で、簡
便に、発話者に余計な負担をかけずに判定することがで
きないなどの課題があった。Since the conventional speaker identification device is configured as described above, it is possible to generate a sound from a person other than the person who intends to operate, for example, a sound from another person or audio equipment. In an environment where there is a flowing sound, it is possible to determine whether the intended person is speaking or who among multiple candidates is speaking accurately, in a short time, easily, and to the speaker. There has been a problem that determination cannot be made without imposing an extra burden.

【０００６】この発明は上記のような課題を解決するた
めになされたもので、複数の候補者の中で誰が発話をし
ているのかを、精度良く、短時間で、簡便に、発話者に
余計な負担をかけずに判定することができる発話開始監
視装置、話者同定装置、音声入力システム、および話者
同定システム、並びに通信システムを得ることを目的と
する。SUMMARY OF THE INVENTION The present invention has been made to solve the above-described problem, and it is possible to accurately, in a short time, simply and easily determine who is speaking among a plurality of candidates. It is an object of the present invention to obtain a speech start monitoring device, a speaker identification device, a voice input system, a speaker identification system, and a communication system that can make a determination without applying an extra burden.

【０００７】[0007]

【課題を解決するための手段】この発明に係る発話開始
監視装置は、発話を検出したい対象者の唇部分の画像を
撮影する唇部画像撮影手段と、撮影された画像から唇の
動きの時系列変化データを求める唇運動時系列データ計
算手段と、環境音から音声を抽出する音声抽出手段と、
抽出された音声の時系列変化データを求める音声時系列
データ計算手段と、唇の動きの時系列変化データと音声
の時系列変化データとの時間変化を伴う類似度を求める
時間変化類似度計算手段と、求められた唇の動きの時系
列変化データおよび音声の時系列変化データが記憶され
た発話開始に応じた言葉と一致し、かつ求められた類似
度が所定値以上の時に、これから発話が開始されること
を認識する発話開始認識手段とを備えたものである。An utterance start monitoring apparatus according to the present invention comprises: a lip image photographing means for photographing an image of a lip portion of a subject whose utterance is to be detected; Lip movement time series data calculating means for obtaining series change data, voice extracting means for extracting voice from environmental sounds,
Speech time-series data calculation means for obtaining time-series change data of the extracted voice, and time-change similarity calculation means for obtaining a time-related similarity between the time-series change data of the lip movement and the time-series change data of the voice When the obtained time-series change data of the lip movement and the time-series change data of the voice match the stored words corresponding to the utterance start, and the obtained similarity is equal to or greater than a predetermined value, the utterance will be Utterance start recognition means for recognizing the start.

【０００８】この発明に係る発話開始監視装置は、発話
を検出したい対象者の唇部分の画像を撮影する唇部画像
撮影手段と、撮影された画像から唇の動きの時系列変化
データを求める唇運動時系列データ計算手段と、環境音
から音声を抽出する音声抽出手段と、抽出された音声の
時系列変化データを求める音声時系列データ計算手段
と、唇の動きの時系列変化データと音声の時系列変化デ
ータとの時間変化を伴う類似度を求める時間変化類似度
計算手段と、求められた唇の動きの時系列変化データお
よび音声の時系列変化データが記憶された操作内容に応
じた言葉と一致し、かつ求められた類似度が所定値以上
の時に、その操作内容が指示されたことを認識する操作
内容認識手段とを備えたものである。The utterance start monitoring apparatus according to the present invention comprises a lip image photographing means for photographing an image of a lip portion of a subject whose utterance is to be detected, and a lip for obtaining time series change data of lip movement from the photographed image. Exercise time series data calculation means, voice extraction means for extracting voice from environmental sounds, voice time series data calculation means for obtaining time series change data of the extracted voice, time series change data of lip movement and voice Time change similarity calculating means for calculating a similarity accompanying a time change with the time series change data, and a word corresponding to the operation content in which the obtained time series change data of the lip movement and the time series change data of the voice are stored And an operation content recognizing means for recognizing that the operation content is instructed when the similarity obtained is equal to or greater than a predetermined value.

【０００９】この発明に係る話者同定装置は、発話を検
出したい対象者の唇部分の画像を撮影する唇部画像撮影
手段と、撮影された画像から唇の動きの時系列変化デー
タを求める唇運動時系列データ計算手段と、環境音から
音声を抽出する音声抽出手段と、抽出された音声の時系
列変化データを求める音声時系列データ計算手段と、唇
の動きの時系列変化データと音声の時系列変化データと
の時間変化を伴う類似度を求める時間変化類似度計算手
段と、求められた類似度が所定値以上の時にその音声が
対象者から発せられたものであると判定する話者判定手
段とを備えたものである。A lip image photographing means for photographing an image of a lip portion of a subject whose speech is to be detected, and a lip for obtaining time-series change data of lip movement from the photographed image. Exercise time series data calculation means, voice extraction means for extracting voice from environmental sounds, voice time series data calculation means for obtaining time series change data of the extracted voice, time series change data of lip movement and voice Time change similarity calculating means for obtaining a similarity accompanying a time change with the time-series change data, and a speaker who determines that the sound is uttered by the subject when the obtained similarity is equal to or greater than a predetermined value. Judgment means.

【００１０】この発明に係る話者同定装置は、唇運動時
系列データ計算手段により、対象者の唇の輪郭の垂直方
向の最大値と量小値との差に応じて唇の動きを定量化す
るようにしたものである。In the speaker identification apparatus according to the present invention, the lip movement time series data calculation means quantifies the lip movement according to the difference between the maximum value and the small value of the vertical direction of the lip contour of the subject. It is something to do.

【００１１】この発明に係る話者同定装置は、唇運動時
系列データ計算手段により、対象者の唇の輪郭の水平方
向の最大値と最小値との差に応じて唇の動きを定量化す
るようにしたものである。In the speaker identification device according to the present invention, the lip movement time series data calculating means quantifies the lip movement according to the difference between the horizontal maximum value and the minimum value of the lip contour of the subject. It is like that.

【００１２】この発明に係る話者同定装置は、唇運動時
系列データ計算手段により、対象者の唇の輪郭の垂直方
向の最大値と量小値との差、およびその対象者の唇の輪
郭の水平方向の最大値と最小値との差に応じて唇の動き
を定量化するようにしたものである。In the speaker identification apparatus according to the present invention, the lip movement time-series data calculating means calculates the difference between the maximum value and the small value in the vertical direction of the lip contour of the subject and the lip contour of the subject. The lip movement is quantified according to the difference between the maximum value and the minimum value in the horizontal direction.

【００１３】この発明に係る話者同定装置は、唇運動時
系列データ計算手段により、対象者の上下唇のうちの少
なくともいずれか一方の曲率に応じて唇の動きを定量化
するようにしたものである。[0013] In the speaker identification apparatus according to the present invention, the lip movement time series data calculation means quantifies the lip movement according to the curvature of at least one of the upper and lower lips of the subject. It is.

【００１４】この発明に係る話者同定装置は、音声時系
列データ計算手段により、特定の周波数の音声検出フィ
ルタの出力の包絡線に応じて音声の時系列変化を定量化
するようにしたものである。In the speaker identification apparatus according to the present invention, the speech time series data calculation means quantifies a time series change of speech according to an envelope of an output of a speech detection filter of a specific frequency. is there.

【００１５】この発明に係る話者同定装置は、時間変化
類似度計算手段により、唇の動きの時系列変化データと
音声の時系列変化データとのそれぞれの遅れ時間を補正
した時系列データの相互相関係数を求め、話者判定手段
により、その相互相関係数の遅れ時間がゼロの周りの一
定時間内の相互相関係数を評価すべき類似度とするよう
にしたものである。In the speaker identification apparatus according to the present invention, the time change similarity calculating means corrects the respective delay times of the time series change data of the lip movement and the time series change data of the voice to mutually correct the time series data. The correlation coefficient is obtained, and the delay time of the cross-correlation coefficient within a certain time around zero is determined by the speaker determination means as the similarity to be evaluated.

【００１６】この発明に係る音声入力システムは、話者
同定装置を適用して、発話者が対象者であると判定され
た時に、音声入力に応じて処理する音声認識手段を備え
たものである。A speech input system according to the present invention includes a speech recognition unit that applies a speaker identification device and performs processing in accordance with a speech input when a speaker is determined to be a subject. .

【００１７】この発明に係る話者同定システムは、複数
の人を対象者とし、複数のカメラにより撮影された各人
の唇画像、または撮影された１つの画像の中から複数の
ウインドウを用いて切り出した各人の唇画像を用いて、
話者同定装置を適用して、各人のうちの誰が発話したか
を同定するものである。A speaker identification system according to the present invention targets a plurality of persons and uses a plurality of windows from a lip image of each person photographed by a plurality of cameras or one photographed image. Using the clipped lip image of each person,
A speaker identification device is applied to identify who uttered from each person.

【００１８】この発明に係る通信システムは、参加して
いる複数の人を対象者として、話者同定システムを適用
して、複数の人のうちの発話者を同定し、通信する映像
をその発話者にズームインしたカメラからの映像にする
通信データ制御手段を備えたものである。In the communication system according to the present invention, a speaker identification system is applied to a plurality of participating persons as subjects, a speaker among the plurality of persons is identified, and a video to be communicated is uttered. Communication data control means for converting an image from a camera zoomed in on the user.

【００１９】この発明に係る通信システムは、３次元音
定位機能を備え、参加している複数の人を対象者とし
て、発話者同定システムを適用して、複数の人のうちの
発話者を同定し、発話者の位置と受信側の聴取者の位置
に応じて、音の仮想的発生位置を計算して、発話者から
の音声を３次元音提示する通信データ制御手段を備えた
ものである。The communication system according to the present invention has a three-dimensional sound localization function, and identifies a speaker among a plurality of persons by applying a speaker identification system to a plurality of participating persons. And a communication data control means for calculating a virtual sound generation position in accordance with the position of the speaker and the position of the listener on the receiving side, and presenting a three-dimensional sound of the speaker. .

【００２０】[0020]

【発明の実施の形態】以下、この発明の実施の一形態を
説明する。実施の形態１．図１はこの発明の実施の形態１による話
者同定装置を示すブロック図であり、図において、１は
対象者、２は対象者撮影部（唇部画像撮影手段）、３は
音収録部（音声抽出手段）である。４は対象者撮影部２
によって撮影された画像から対象者１の唇部の切り出し
処理を行う切り出し処理部（唇部画像撮影手段）、５は
切り出し処理された画像から唇の動きの時系列変化デー
タを求める唇運動時系列データ計算部（唇運動時系列デ
ータ計算手段）である。６は音収録部３によって収録さ
れた環境音から音声に対応した周波数だけを通過させる
音声検出フィルタ（音声抽出手段）、７は音声検出フィ
ルタ６を通過した音声の時系列変化データを求める音声
時系列データ計算部（音声時系列データ計算手段）であ
る。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS One embodiment of the present invention will be described below. Embodiment 1 FIG. FIG. 1 is a block diagram showing a speaker identification apparatus according to Embodiment 1 of the present invention. In the figure, 1 is a subject, 2 is a subject photographing section (lip image photographing means), and 3 is a sound recording section ( Voice extraction means). 4 is the subject photographing unit 2
Motion processing unit for performing a clipping process of the lips of the subject 1 from the image captured by the lip sensor 5 and a lip motion time series for obtaining time-series change data of lip movements from the clipped image It is a data calculation unit (lip motion time series data calculation means). Reference numeral 6 denotes a sound detection filter (sound extracting means) for passing only the frequency corresponding to the sound from the environmental sound recorded by the sound recording unit 3, and reference numeral 7 denotes a sound for obtaining time-series change data of the sound passing through the sound detection filter 6. It is a sequence data calculation unit (voice time-series data calculation means).

【００２１】８は唇運動時系列データ計算部５によって
求められた唇の動きの時系列変化データと音声時系列デ
ータ計算部７によって求められた音声の時系列変化デー
タとの時間変化を伴う類似度を求める時系列パターンマ
ッチング部（時間変化類似度計算手段）、９は時系列パ
ターンマッチング部８によって求められた類似度が所定
値以上の時にその音声が対象者１から発せられたもので
あると判定する発話者判定部（話者判定手段，発話開始
認識手段，操作内容認識手段）である。Reference numeral 8 denotes a similarity accompanied by a time change between the time series change data of the lip movement obtained by the lip movement time series data calculation unit 5 and the time series change data of the voice obtained by the voice time series data calculation unit 7. A time-series pattern matching unit (time-change similarity calculation means) 9 for calculating the degree is a sound uttered from the subject 1 when the similarity calculated by the time-series pattern matching unit 8 is equal to or more than a predetermined value. Speaker determination unit (speaker determination means, utterance start recognition means, operation content recognition means).

【００２２】次に動作について説明する。カメラなどの
対象者撮影部２は、対象者１を撮影し、その撮影した動
画像を切り出し処理部４に出力する。切り出し処理部４
は、入力された動画像から、フレーム毎に対象者１の唇
部を切り出し、その切り出した唇部データを唇運動時系
列データ計算部５に出力する。この切り出し処理部４に
は、従来から用いられている色情報を利用する手法、顔
の特徴点を用いる手法などを適用する。唇運動時系列デ
ータ計算部５では、唇部データから唇の形の特徴パラメ
ータを求め、その時系列変化データを出力する。図２は
唇の形の特徴パラメータの求め方を示す説明図であり、
具体的には、唇の形の特徴パラメータとして、この図２
に示すように、唇の左右端の距離Ｘ、上下唇の距離Ｙを
用いて、これら（Ｘ，Ｙ）の２つのパラメータの発話し
ていない時の基準値（Ｘ₀，Ｙ₀ ）とのノルムを唇の変
形度Ｍとして下式（１）により求める。Ｍ（ｔ）＝（（Ｘ（ｔ）−Ｘ₀）²＋（Ｙ（ｔ）−Ｙ₀）²）^1/2 （１）但し、Ｍ，Ｘ，Ｙは、画像のフレームレイトをサンプリ
ングレイトとした時系列データであり、ｔ＝ｎΔｔであ
り、ｎは整数、Δｔはサンプリング間隔である。Next, the operation will be described. The subject photographing section 2 such as a camera photographs the subject 1 and outputs the photographed moving image to the cutout processing section 4. Cutout processing unit 4
Cuts out the lips of the subject 1 for each frame from the input moving image, and outputs the cut-out lip data to the lip movement time-series data calculation unit 5. A method using color information, a method using feature points of a face, and the like, which are conventionally used, are applied to the cutout processing unit 4. The lip movement time-series data calculation unit 5 obtains a lip-shaped characteristic parameter from the lip data and outputs the time-series change data. FIG. 2 is an explanatory diagram showing how to obtain the characteristic parameters of the lip shape.
More specifically, FIG.
As shown in FIG. 4, the distance X between the left and right ends of the lips and the distance Y between the upper and lower lips are used to determine the reference values (X ₀ , Y ₀ ) of these two parameters (X, Y) when not speaking. The norm is determined by the following equation (1) as the degree of deformation M of the lips. M (t) = ((X (t) −X ₀ ) ² + (Y (t) −Y ₀ ) ² ) ^1/2 (1) where M, X, and Y are sampling rates of an image frame rate. Where t = nΔt, where n is an integer and Δt is a sampling interval.

【００２３】一方、マイクなどの音収録部３は、環境音
を収録し、その収録した環境音を音声検出フィルタ６に
出力する。音声検出フィルタ６は、音声に対応した周波
数帯域フィルタであり、対象者の性別や年齢が特定され
ている場合には、それに見合った周波数帯域フィルタが
設けられたものである。音声検出フィルタ６は、入力さ
れた環境音から音声に対応した周波数だけを通過させ、
音声時系列データ計算部７に出力する。そして、音声時
系列データ計算部７は、入力された音声の包絡線Ｓを求
め、この包絡線Ｓを時系列変化データとする。この包絡
線Ｓを求めるタイミングは、唇運動の画像処理系の遅れ
を調整して時間を合わせ、唇の変形度Ｍと同じサンプリ
ング間隔Δｔで求める。On the other hand, the sound recording section 3 such as a microphone records the environmental sound and outputs the recorded environmental sound to the sound detection filter 6. The voice detection filter 6 is a frequency band filter corresponding to voice, and when the gender or age of the subject is specified, a frequency band filter corresponding to the gender or age is provided. The voice detection filter 6 allows only the frequency corresponding to the voice to pass from the input environmental sound,
Output to the audio time-series data calculation unit 7. Then, the voice time-series data calculation unit 7 obtains an envelope S of the input voice, and sets the envelope S as time-series change data. The envelope S is obtained at the same sampling interval Δt as the lip deformation M by adjusting the time by adjusting the delay of the image processing system of the lip movement.

【００２４】時系列パターンマッチング部８では、唇運
動時系列データ計算部５によって求められた唇の変形度
Ｍと、音声時系列データ計算部７によって求められた音
声の包絡線Ｓとの、過去一定時間Ｔ分のデータを用い、
類似度としての相互相関係数Ｒを下式（２）により求め
る。The time-series pattern matching unit 8 compares the degree of lip deformation M obtained by the lip movement time-series data calculation unit 5 with the speech envelope S obtained by the voice time-series data calculation unit 7 in the past. Using data for a fixed time T,
The cross-correlation coefficient R as the similarity is obtained by the following equation (2).

【数１】但し、ｍはｍ≦Ｔ／Δｔの最大の整数であり、Ｔは発話
者の同定にかかる許容時間である。例えば、発話から、
１秒後に同定されれば良い場合には、Ｔ＝１ｓｅｃとな
る。(Equation 1) Here, m is a maximum integer of m ≦ T / Δt, and T is an allowable time required for speaker identification. For example, from the utterance,
If it is sufficient to be identified after one second, T = 1 sec.

【００２５】発話者判定部９は、相互相関係数Ｒの遅れ
時間ゼロ（τ＝０）の周りの一定遅れ時間内の係数の重
み付き和として、判定用の評価値Ｄを下式（３）により
求める。The speaker determination unit 9 calculates the evaluation value D for determination as the weighted sum of the coefficients within a fixed delay time around the delay time zero (τ = 0) of the cross-correlation coefficient R by the following equation (3). ).

【数２】但し、Ｆはｉ＝０で最大値をとる重み関数であり、例え
ば、Ｆ（０）＝０．５，Ｆ（±１）＝０．２５，Ｆ（ｉ
＜−１）＝０，Ｆ（ｉ＞１）＝０などである。この評価
値Ｄが、予め設定した閾値Ｄ₀以上の場合（Ｄ≧Ｄ₀）
に、対象者が話者であると判定する。Ｒ（ｔ，τ）は、
−１から１の間の値をとり、Ｍ（ｔ）とＳ（ｔ）との類
似度が高く、時間遅れがないほど、１に近い値となる。
処理系の時間遅れは調整済みであるが、唇の動きと音声
との間には、本質的な時間関係の揺らぎがあると考えら
れるので、遅れ時間ゼロ（τ＝Ｏ）の周りの重み付き和
を用いる。(Equation 2) Here, F is a weighting function that takes the maximum value at i = 0, for example, F (0) = 0.5, F (± 1) = 0.25, F (i
<-1) = 0, F (i> 1) = 0 and the like. When the evaluation value D is equal to or greater than a preset threshold value D ₀ (D ≧ D ₀ )
Then, it is determined that the target person is a speaker. R (t, τ) is
It takes a value between -1 and 1, and becomes closer to 1 as the similarity between M (t) and S (t) is higher and there is no time delay.
Although the time delay of the processing system has been adjusted, it is considered that there is an essential fluctuation of the time relationship between the movement of the lips and the voice, so the weighting around the zero delay time (τ = O) is performed. Use the sum.

【００２６】なお、この実施の形態１では、複数の候補
者の中で誰が発話をしているのかを判定する話者同定装
置として示したが、発話開始の指示を発話によって行う
発話開始監視装置に適用することもできる。図１におい
て、発話開始監視装置に適用する場合は、発話者判定部
（発話開始認識手段）９に、予め発話開始に応じた言葉
（例えば、開始、発話開始、操作開始、操作指示など）
の唇の動きの時系列変化データおよび音声の時系列変化
データを記憶しておき、唇運動時系列データ計算部５に
よって求められた唇の動きの時系列変化データおよび音
声時系列データ計算部７によって求められた音声の時系
列変化データがその記憶された発話開始に応じた言葉と
一致するか判定して、その一致が認められ、かつ時系列
パターンマッチング部８によって求められた類似度が所
定値以上の時に、これから発話が開始されることを認識
するようにする。このように構成することにより、従来
の装置では、騒音環境下（車室内騒音、他の人の話し
声、オーディオ音など）で、対象者が発話を開始する時
に、その旨を示すスイッチを押してから発話を開始した
り、他者の会話の途切れを見計らってから発話したりし
なくてはならず、対象者に発話以外の２次的な操作を要
求していたが、上記発話開始監視装置では、対象者が予
め登録された発話開始に応じた言葉を発話すれば、装置
側では、対象者側がこれから発話が開始されることを認
識することができるので、対象者に発話以外の２次的な
操作により煩わしさをなくすことができる。Although the first embodiment has been described as a speaker identification device for determining who is speaking among a plurality of candidates, an utterance start monitoring device for giving an utterance start instruction by utterance It can also be applied to In FIG. 1, when the present invention is applied to the utterance start monitoring device, the utterer determination unit (utterance start recognition means) 9 causes a word corresponding to the start of utterance in advance (for example, start, utterance start, operation start, operation instruction, etc.).
The lip movement time series change data and the voice time series change data are stored, and the lip movement time series change data and the voice time series data calculation section 7 obtained by the lip movement time series data calculation section 5 are stored. It is determined whether or not the time-series change data of the voice obtained by the above-described method matches the stored word corresponding to the start of the utterance, and the matching is recognized, and the similarity obtained by the time-series pattern matching unit 8 is determined by a predetermined value. When the value is equal to or larger than the value, the user recognizes that the utterance is about to start. With this configuration, in the conventional device, when the subject starts speaking in a noisy environment (eg, vehicle interior noise, speech of another person, audio sound, etc.), the user presses a switch indicating that fact. The utterance had to be started or the utterance had to be started after the other person's conversation had been interrupted, and the target person was required to perform a secondary operation other than the utterance. If the target person speaks words corresponding to the start of the utterance registered in advance, the apparatus can recognize that the utterance is about to start from the apparatus side. Troublesome operations can be eliminated.

【００２７】さらに、上記発話開始監視装置では、発話
開始の指示を発話によって行うようにしたが、操作内容
の指示を発話によって行うことによって、発話開始と操
作内容の指示を同時に行うようにすることもできる。図
１において、この発話開始監視装置に適用する場合は、
発話者判定部（操作内容認識手段）９に、予め操作内容
に応じた言葉（例えば、オーディオオン、オーディオ
オフ、オーディオボリュームアップ、ナビゲーショ
ンシステムオンなど）の唇の動きの時系列変化データお
よび音声の時系列変化データを記憶しておき、唇運動時
系列データ計算部５によって求められた唇の動きの時系
列変化データおよび音声時系列データ計算部７によって
求められた音声の時系列変化データがその記憶された操
作内容に応じた言葉と一致するか判定して、その一致が
認められ、かつ時系列パターンマッチング部８によって
求められた類似度が所定値以上の時に、発話が開始され
てその操作内容が指示されたことを認識するようにす
る。このように構成することにより、上記発話開始監視
装置では、対象者が予め登録された操作内容に応じた言
葉を発話すれば、装置側では、発話が開始されてその操
作内容が指示されたことを認識することができるので、
対象者に発話以外の２次的な操作による煩わしさをなく
すことができると共に、発話開始と操作内容の指示を同
時に行うことができるので、さらに、発話による操作内
容の指示を容易に行うことができる。なお、発話者判定
部（発話開始認識手段，操作内容認識手段）９におい
て、唇運動時系列データ計算部５によって求められた唇
の動きの時系列変化データおよび音声時系列データ計算
部７によって求められた音声の時系列変化データが記憶
された発話開始または操作内容に応じた言葉と一致する
か判定したが、これらの判定は、時系列パターンマッチ
ング部８において判定しても良く、また、唇運動時系列
データ計算部５、および音声時系列データ計算部７にお
いて個別に判定しても良い。Further, in the above utterance start monitoring device, the utterance start instruction is given by utterance. However, the instruction of the operation content is given by utterance so that the utterance start and the instruction of the operation content are given simultaneously. Can also. In FIG. 1, when this utterance start monitoring device is applied,
The utterer determination unit (operation content recognizing means) 9 provides time-series change data of lip movement and voice of words (eg, audio on, audio off, audio volume up, navigation system on, etc.) according to the operation content in advance. The time series change data is stored, and the time series change data of the lip movement obtained by the lip movement time series data calculation unit 5 and the time series change data of the voice obtained by the voice time series data calculation unit 7 are stored in the memory. When it is determined whether or not the word matches the word corresponding to the stored operation content, the match is recognized, and when the similarity obtained by the time-series pattern matching unit 8 is equal to or greater than a predetermined value, the utterance is started and the operation is started. Recognize that the contents have been indicated. With this configuration, in the utterance start monitoring device, if the subject speaks a word corresponding to the operation content registered in advance, the utterance is started on the device side and the operation content is instructed. Can be recognized,
Since the annoyance of secondary operations other than utterance to the target person can be eliminated, and the start of utterance and the instruction of the operation content can be simultaneously performed, the instruction of the operation content by utterance can be easily performed. it can. The utterer determination unit (utterance start recognition unit, operation content recognition unit) 9 obtains the lip movement time series change data obtained by the lip movement time series data calculation unit 5 and the voice time series data calculation unit 7. It is determined whether the time-series change data of the received voice matches the stored utterance or the word corresponding to the operation content. However, these determinations may be performed by the time-series pattern matching unit 8. The exercise time-series data calculation unit 5 and the voice time-series data calculation unit 7 may individually determine.

【００２８】また、上記実施の形態１において、対象者
の顔、および唇の輪郭は、ほぼ左右対称となるが、その
対称軸を垂直方向として、その垂直方向に直交する方向
を水平方向とした場合に、唇運動時系列データ計算部５
において、対象者の唇の輪郭の垂直方向の最大値と最小
値との差、およびその対象者の唇の輪郭の水平方向の最
大値と最小値との差のうちの少なくともいずれか一方に
応じて唇の動きを定量化するようにしても良く、また、
対象者の上下唇のうちの少なくともいずれか一方の曲率
に応じて唇の動きを定量化するようにしても良い。さら
に、上唇の下端と下唇の上端の距離（開口距離）を用い
ても良い。さらに、音声時系列データ計算部７で求める
のは、音声検出フィルタ６から出力された音声の移動平
均値でも良い。さらに、対象者撮影部２による画像のフ
レームレイトを処理単位としたが、唇運動時系列データ
計算部５による系列変化データを補間してサンプリング
すれば、そのフレームレイトに拘る必要はない。さら
に、遅れ時間を音声データの系で調整したが、相互相関
を求める際に、両方の系の遅れの差を発話者判定部９で
用いられる式（３）の重み関数Ｆで考慮しても良い。さ
らに、各種演算をディジタル計算で求めたが、遅延回
路、乗算器、および積分器を組み合わせてアナログ回路
で実施しても良い。In the first embodiment, the contours of the subject's face and lips are almost bilaterally symmetric, but the axis of symmetry is the vertical direction, and the direction perpendicular to the vertical direction is the horizontal direction. In the case, the lip movement time series data calculation unit 5
In accordance with at least one of the difference between the maximum value and the minimum value of the contour of the lip of the subject in the vertical direction and the difference between the maximum value and the minimum value of the contour of the lip of the subject in the horizontal direction Lip movement may be quantified,
The movement of the lips may be quantified according to the curvature of at least one of the upper and lower lips of the subject. Further, the distance (opening distance) between the lower end of the upper lip and the upper end of the lower lip may be used. Further, the moving average value of the sound output from the sound detection filter 6 may be obtained by the sound time-series data calculation unit 7. Further, the frame rate of the image obtained by the subject photographing section 2 is used as a processing unit. However, if interpolation and sampling are performed on the series change data by the lip movement time series data calculation section 5, there is no need to be concerned with the frame rate. Further, although the delay time is adjusted in the audio data system, when calculating the cross-correlation, the difference between the delays in both systems may be considered by the weight function F of the equation (3) used in the speaker determination unit 9. good. Further, various operations are obtained by digital calculation, but may be performed by an analog circuit by combining a delay circuit, a multiplier, and an integrator.

【００２９】以上のように、この実施の形態１によれ
ば、対象者１の唇の変形度と音声の包絡線とを同時に用
い、それらの時系列変化パターンの類似度によって発話
者の同定を行うことにより、情報処理負荷を小さく、か
つ同定率を良く発話者の同定を行うことができる効果が
得られる。また、唇の変形度を、対象者撮影部２によっ
て撮像された対象者１の画像から抽出された唇部分から
自動的に求めることができ、音声の包絡線を、音収録部
３によって収録され、音声検出フィルタ６によって通過
した音声から自動的に求めることができるので、対象者
１に余計な負担をかけることなく、発話者の同定を行う
ことができる効果が得られる。As described above, according to the first embodiment, the degree of lip deformation of the subject 1 and the envelope of the voice are simultaneously used, and the speaker is identified based on the similarity of the time-series change pattern. By doing so, it is possible to obtain an effect that a speaker can be identified with a low information processing load and a high identification rate. In addition, the degree of lip deformation can be automatically obtained from the lip portion extracted from the image of the subject 1 captured by the subject imaging unit 2, and the audio envelope is recorded by the sound recording unit 3. Since the speech can be automatically obtained from the speech that has passed through the speech detection filter 6, the speaker can be identified without imposing an extra burden on the subject 1.

【００３０】実施の形態２．図３はこの発明の実施の形
態２による音声入力システムを示すブロック図であり、
図において、３１は音声検出フィルタ６を通過した音声
データを一時保持する音声データバッファ、３２は発話
者判定部９により、その音声データが対象者１から発せ
られたものであると判定された場合にゲートを開く音声
ゲート、３３は音声データバッファ３１から音声ゲート
３２を通じて入力された音声データの意味を認識して、
その意味に応じて処理されるようにする音声認識部（音
声認識手段）である。その他の構成は、図１と同一なの
で重複する説明を省略する。Embodiment 2 FIG. FIG. 3 is a block diagram showing a voice input system according to Embodiment 2 of the present invention.
In the figure, reference numeral 31 denotes an audio data buffer for temporarily storing audio data that has passed through the audio detection filter 6, and reference numeral 32 denotes a case where the speaker determination unit 9 determines that the audio data is uttered from the subject 1 An audio gate 33 that opens the gate at 33 recognizes the meaning of the audio data input from the audio data buffer 31 through the audio gate 32,
It is a speech recognition unit (speech recognition means) that is processed according to its meaning. Other configurations are the same as those in FIG.

【００３１】次に動作について説明する。この実施の形
態２は、車載ナビゲーションシステムなどに用いられる
キー操作の代わりに音声入力を用いる音声入力システム
に関するものであり、発話者同定に関する各構成要素１
〜９の動作は実施の形態１で示した動作と同様である。
音声データバッファ３１は、音声検出フィルタ６を通過
した音声データを一時保持し、発話の判定に要する時間
Ｔに処理系の遅れ時間を加えたＴ_a時間分遅らせた音声
データを常に出力する。音声ゲート３２は、発話者判定
部９において、対象者１が発話者であると判定された時
にゲートを開き、音声データバッファ３１から出力され
る音声データを音声認識部３３に送り、その音声認識部
３３において音声操作入力として処理されるようにす
る。Next, the operation will be described. The second embodiment relates to a voice input system using a voice input instead of a key operation used in an in-vehicle navigation system or the like.
Operations 9 to 9 are the same as those described in the first embodiment.
Audio data buffer 31, temporarily holds the voice data which has passed through the speech detection filter 6 always outputs the audio data obtained by delaying T _a time period obtained by adding the delay time of the processing system to the time T required for the determination of the utterance. The voice gate 32 opens the gate when the speaker 1 determines that the target person 1 is the speaker, sends the voice data output from the voice data buffer 31 to the voice recognition unit 33, and performs the voice recognition. The unit 33 processes the input as a voice operation input.

【００３２】以上のように、この実施の形態２によれ
ば、操作を意図している人以外の発する音、例えば、他
の人の発話やオーディオ機器などから流れる音を、操作
入力と誤って処理してしまうことを防ぐことができる効
果が得られる。As described above, according to the second embodiment, a sound uttered by a person other than the person who intends to operate, for example, a sound uttered by another person or a sound flowing from an audio device or the like, is mistaken for an operation input. The effect that the processing can be prevented can be obtained.

【００３３】実施の形態３．図４はこの発明の実施の形
態３による話者同定システムを示すブロック図であり、
図において、４１は発話者判定部９ａ〜９ｃから出力さ
れた評価値Ｄのうちの最大の評価値を持つ対象者を発話
候補者とし、その発話候補者の評価値Ｄが予め設定され
た閾値Ｄ₁以上の場合に、その発話候補者が発話者であ
ると判定する発話者判定部である。この実施の形態３で
は、１ａ〜１ｃの３人の対象者に対して、ａ系からｃ系
の話者同定装置を設けているが、それらａ系からｃ系の
話者同定装置内の構成は、図１と同一なので重複する説
明を省略する。Embodiment 3 FIG. 4 is a block diagram showing a speaker identification system according to Embodiment 3 of the present invention.
In the figure, reference numeral 41 denotes a candidate having the largest evaluation value among the evaluation values D output from the speaker determination units 9a to 9c as an utterance candidate, and the evaluation value D of the utterance candidate is set to a predetermined threshold. in the case of D ₁ or more, the utterance candidate is the determining speaker determination unit is speaker. In the third embodiment, the speaker identification devices of the a-system to the c-system are provided for the three subjects 1a to 1c. Is the same as that in FIG.

【００３４】次に動作について説明する。ａ系からｃ系
の話者同定装置内の動作は、実施の形態１と同一であ
り、１ａ〜１ｃの３人の対象者に対して、ａ系からｃ系
の話者同定装置が並列処理され、発話者判定部９ａ〜９
ｃからは、それぞれの評価値Ｄが出力される。発話者判
定部４１は、発話者判定部９ａ〜９ｃからそれぞれ出力
された評価値Ｄを比較し、最大の評価値を持つ対象者を
発話候補者とし、さらに、その発話候補者の評価値Ｄが
予め設定された閾値Ｄ₁以上（Ｄ≧Ｄ₁）の場合に、そ
の発話候補者が発話者であると判定する。Next, the operation will be described. The operation in the speaker identification device of the a-system to the c-system is the same as that of the first embodiment, and the speaker identification devices of the a-system to the c-system are processed in parallel for three subjects 1a to 1c. And the speaker determination units 9a to 9
From c, each evaluation value D is output. The utterer determination unit 41 compares the evaluation values D output from the utterer determination units 9a to 9c, determines the target person having the largest evaluation value as the utterance candidate, and further, evaluates the evaluation value D of the utterance candidate. Is greater than or equal to a preset threshold D ₁ (D ≧ D ₁ ), it is determined that the utterance candidate is the utterer.

【００３５】なお、この実施の形態３では、対象者撮影
部２ａ〜２ｃを各話者同定装置毎に設けたが、複数の対
象者１ａ〜１ｃに１台の対象者撮影部２を用い、切り出
し処理部４において、各対象者１ａ〜１ｃの唇部をそれ
ぞれ切り出して処理しても良い。また、音収録部３ａ〜
３ｃも各話者同定装置毎に設けず、複数の対象者１ａ〜
１ｃに１台の音収録部３を用いても良い。In the third embodiment, the subject photographing units 2a to 2c are provided for each speaker identification device. However, one subject photographing unit 2 is used for a plurality of subjects 1a to 1c. The cut-out processing unit 4 may cut out and process the lips of each of the subjects 1a to 1c. Also, the sound recording units 3a to
3c is not provided for each speaker identification device, and a plurality of subjects 1a to
One sound recording unit 3 may be used for 1c.

【００３６】以上のように、この実施の形態３によれ
ば、複数の発話候補者の中から、発話者を同定すること
ができる効果が得られる。As described above, according to the third embodiment, it is possible to obtain an effect that a speaker can be identified from a plurality of speech candidates.

【００３７】実施の形態４．図５はこの発明の実施の形
態４による通信システムを示すブロック図であり、図に
おいて、５１は発話者同定用の対象者撮影部、５２は音
収録部、５３は発話者同定部、５４は送信データ制御部
（通信データ制御手段）、５５は送信映像撮影部、５６
は受信データ制御部（通信データ制御手段）、５７は映
像提示部、５８は音声提示部である。５１〜５８はサブ
システムで、相互通信可能な通信システムを有する２地
点に同じサブシステムが用意される。Embodiment 4 FIG. FIG. 5 is a block diagram showing a communication system according to Embodiment 4 of the present invention. In the figure, reference numeral 51 denotes a speaker photographing unit for speaker identification, 52 denotes a sound recording unit, 53 denotes a speaker identification unit, and 54 denotes a speaker identification unit. A transmission data control unit (communication data control means);
Is a reception data control unit (communication data control means), 57 is a video presentation unit, and 58 is a voice presentation unit. Subsystems 51 to 58 are provided with the same subsystem at two points having a communication system capable of mutual communication.

【００３８】次に動作について説明する。この実施の形
態４は、ＴＶ会議などに用いられ映像と音声とを通信す
る通信システムに関するものであり、複数の対象者の中
から発話者を判定し、通信する映像、音声情報を制御す
ることにより臨場感を増すことができるシステムであ
る。発話者同定部５３は、発話者同定用の対象者撮影部
５１の映像と、音収録部５２の音データを用いて、実施
の形態３と同様の動作を行い、発話の存在と発話者を判
定する。Next, the operation will be described. The fourth embodiment relates to a communication system for communicating video and audio used in a TV conference or the like, in which a speaker is determined from a plurality of subjects, and video and audio information to be communicated is controlled. This is a system that can increase the sense of presence. The utterer identification unit 53 performs the same operation as in the third embodiment using the video of the target person imaging unit 51 for utterer identification and the sound data of the sound recording unit 52 to determine the presence of the utterance and the utterer. judge.

【００３９】送信データ制御部５４では、発話者同定部
５３が発話の存在を判定した時には、判定された発話者
に、送信映像撮影部５５を向けズームインする。一方、
発話者同定部５３が発話の存在を判定しなかった時に
は、送信映像撮影部５５を全体の映像が撮影されるよう
ズームアウトするかパンを続ける。また、送信データ制
御部５４では、送信映像撮影部５５が撮影した映像と、
発話者の有無と図６に示すような発話者の仮想空間上の
位置に応じた、音データを送信する。In the transmission data control unit 54, when the speaker identification unit 53 determines the presence of an utterance, the transmission image photographing unit 55 is aimed at the determined speaker and zooms in. on the other hand,
When the speaker identification unit 53 does not determine the presence of the utterance, the transmission video imaging unit 55 is zoomed out or panned so that the entire video is captured. In the transmission data control unit 54, the video captured by the transmission video capturing unit 55
The sound data is transmitted according to the presence or absence of the speaker and the position of the speaker in the virtual space as shown in FIG.

【００４０】一方、この送信データを受信した受信デー
タ制御部５６では、受信した映像を映像提示部５７に提
示し、発話の有無に応じた音データの処理を行う。具体
的には、発話がない時には、受信した音データをそのま
ま音声提示部５８に送る。発話がある場合には、実施の
形態１に示した音声検出フィルタ６により音声デ一タを
抽出し、それ以外を環境音とする。音声データについて
は、発話者と聴取者の仮想空間上の位置関係を考慮し
て、３次元定位音を作成し、環境音データを合成して、
音声提示部５８に送る。例えば、図６において、発話者
がＡ１だと同定された時、Ｂ１には、正面からの聞こえ
る音声として提示し、Ｂ２には、右前方から聞こえるよ
うに提示する。On the other hand, the reception data control unit 56 that has received the transmission data presents the received video to the video presentation unit 57, and processes sound data according to the presence or absence of speech. Specifically, when there is no utterance, the received sound data is sent to the voice presenting unit 58 as it is. If there is an utterance, voice data is extracted by the voice detection filter 6 described in the first embodiment, and the other voice data is set as environmental sound. Regarding the voice data, a three-dimensional localized sound is created in consideration of the positional relationship between the speaker and the listener in the virtual space, and the environmental sound data is synthesized.
It is sent to the voice presentation unit 58. For example, in FIG. 6, when the speaker is identified as A1, it is presented to B1 as a sound that can be heard from the front, and to B2 it is presented so as to be heard from the right front.

【００４１】以上のように、この実施の形態４によれ
ば、複数の発話候補者の中から発話者を同定して、発話
者の映像に切り替えたり発話者にズームインしたりする
ことで、臨場感の高い通信システムにすることができ
る。また、発話者の音声の聞こえる方向を制御すること
により、よりリアルな仮想臨場感通信を実現することが
できる効果が得られる。As described above, according to the fourth embodiment, a speaker is identified from a plurality of utterance candidates, and is switched to a video of the utterer or zoomed in on the utterer. A communication system with a high feeling can be provided. Further, by controlling the direction in which the speaker's voice can be heard, an effect of realizing more realistic virtual presence communication can be obtained.

【００４２】[0042]

【発明の効果】以上のように、この発明によれば、発話
を検出したい対象者の唇部分の画像を撮影する唇部画像
撮影手段と、撮影された画像から唇の動きの時系列変化
データを求める唇運動時系列データ計算手段と、環境音
から音声を抽出する音声抽出手段と、抽出された音声の
時系列変化データを求める音声時系列データ計算手段
と、唇の動きの時系列変化データと音声の時系列変化デ
ータとの時間変化を伴う類似度を求める時間変化類似度
計算手段と、求められた唇の動きの時系列変化データお
よび音声の時系列変化データが記憶された発話開始に応
じた言葉と一致し、かつ求められた類似度が所定値以上
の時に、これから発話が開始されることを認識する発話
開始認識手段とを備えるように構成したので、対象者が
予め登録された発話開始に応じた言葉を発話すれば、装
置側では、対象者からこれから発話が開始されることを
認識することができるので、対象者に発話以外の２次的
な操作により煩わしさをなくすことができる効果が得ら
れる。As described above, according to the present invention, a lip image photographing means for photographing an image of a lip portion of a subject whose utterance is to be detected, and time-series change data of lip movement from the photographed image Lip movement time series data calculation means for obtaining the sound, speech extraction means for extracting sound from the environmental sound, voice time series data calculation means for obtaining the time series change data of the extracted sound, and lip movement time series change data Time-similarity calculating means for obtaining a time-based similarity between the lip movement and the time-series change data of the voice, and a utterance start in which the obtained time-series change data of the lip movement and the time-series change data of the voice are stored When the target word matches and the obtained similarity is equal to or greater than a predetermined value, the utterance start recognition means for recognizing that the utterance is to be started from now is provided. Utterance If the user speaks the word at the beginning, the device can recognize that the utterance is about to be started from the target person, so that the target person can be less troublesome by performing a secondary operation other than the utterance. The effect that can be obtained is obtained.

【００４３】この発明によれば、発話を検出したい対象
者の唇部分の画像を撮影する唇部画像撮影手段と、撮影
された画像から唇の動きの時系列変化データを求める唇
運動時系列データ計算手段と、環境音から音声を抽出す
る音声抽出手段と、抽出された音声の時系列変化データ
を求める音声時系列データ計算手段と、唇の動きの時系
列変化データと音声の時系列変化データとの時間変化を
伴う類似度を求める時間変化類似度計算手段と、求めら
れた唇の動きの時系列変化データおよび音声の時系列変
化データが記憶された操作内容に応じた言葉と一致し、
かつ求められた類似度が所定値以上の時に、その操作内
容が指示されたことを認識する操作内容認識手段とを備
えるように構成したので、対象者が予め登録された操作
内容に応じた言葉を発話すれば、装置側では、発話が開
始されてその操作内容が指示されたことを認識すること
ができるので、対象者に発話以外の２次的な操作による
煩わしさをなくすことができると共に、発話開始と操作
内容の指示を同時に行うことができるので、さらに、発
話による操作内容の指示を容易に行うことができる効果
が得られる。According to the present invention, a lip image photographing means for photographing an image of a lip portion of a subject whose speech is to be detected, and lip movement time series data for obtaining time series change data of lip movement from the photographed image Calculating means, voice extracting means for extracting voice from environmental sounds, voice time series data calculating means for obtaining time series change data of the extracted voice, time series change data of lip movement and time series change data of voice Time change similarity calculating means for calculating a similarity with time change with time, the obtained time series change data of the movement of the lips and the time series change data of the voice coincide with the words corresponding to the stored operation contents,
And when the obtained similarity is equal to or more than a predetermined value, the operation content recognizing means for recognizing that the operation content has been instructed. When the user speaks, the device can recognize that the speech has started and the operation content has been instructed, so that the troublesomeness of the subject person due to the secondary operation other than the speech can be eliminated, and Since the start of the utterance and the instruction of the operation content can be performed at the same time, the effect that the operation content can be easily instructed by the utterance can be obtained.

【００４４】この発明によれば、発話を検出したい対象
者の唇部分の画像を撮影する唇部画像撮影手段と、撮影
された画像から唇の動きの時系列変化データを求める唇
運動時系列データ計算手段と、環境音から音声を抽出す
る音声抽出手段と、抽出された音声の時系列変化データ
を求める音声時系列データ計算手段と、唇の動きの時系
列変化データと音声の時系列変化データとの時間変化を
伴う類似度を求める時間変化類似度計算手段と、求めら
れた類似度が所定値以上の時にその音声が対象者から発
せられたものであると判定する話者判定手段とを備える
ように構成したので、対象者の唇の動きの時系列変化デ
ータと音声の時系列変化データとを同時に用い、それら
の類似度によって発話者の同定を行うことにより、情報
処理負荷を小さく、かつ同定率を良く発話者の同定を行
うことができる効果が得られる。また、唇の動きの時系
列変化データを、唇部画像撮影手段によって撮像された
対象者の画像から抽出された唇部分から自動的に求める
ことができると共に、音声の時系列変化データを、音声
抽出手段によって抽出された音声から自動的に求めるこ
とができるので、対象者に余計な負担をかけることな
く、発話者の同定を行うことができる効果が得られる。According to the present invention, lip image photographing means for photographing an image of a lip portion of a subject whose utterance is to be detected, and lip movement time series data for obtaining lip movement time series change data from the photographed image Calculating means, voice extracting means for extracting voice from environmental sounds, voice time series data calculating means for obtaining time series change data of the extracted voice, time series change data of lip movement and time series change data of voice Time-similarity calculating means for obtaining a similarity accompanied by a time change with, and speaker determining means for determining that the voice is uttered from the subject when the obtained similarity is a predetermined value or more. Since it is configured to use, the time series change data of the subject's lip movement and the time series change data of the voice are used at the same time, and the speaker is identified by their similarity, thereby reducing the information processing load. And effect can be obtained that can perform the identification of good speaker identification rate. In addition, the time-series change data of the lip movement can be automatically obtained from the lip portion extracted from the image of the subject taken by the lip image photographing means. Since the speech can be automatically obtained from the voice extracted by the extraction means, an effect is obtained that the speaker can be identified without imposing an extra burden on the subject.

【００４５】この発明によれば、唇運動時系列データ計
算手段により、対象者の唇の輪郭の垂直方向の最大値と
量小値との差に応じて唇の動きを定量化するように構成
したので、情報処理負荷を小さく、容易にかつ同定率を
良く発話者の同定を行うことができる効果が得られる。According to the present invention, the lip movement time series data calculation means is configured to quantify the lip movement according to the difference between the maximum value and the minimum value in the vertical direction of the lip contour of the subject. As a result, an effect that the information processing load is small, and the speaker can be easily identified with a high identification rate can be obtained.

【００４６】この発明によれば、唇運動時系列データ計
算手段により、対象者の唇の輪郭の水平方向の最大値と
最小値との差に応じて唇の動きを定量化するように構成
したので、情報処理負荷を小さく、容易にかつ同定率を
良く発話者の同定を行うことができる効果が得られる。According to the present invention, the lip movement time series data calculation means is configured to quantify the lip movement according to the difference between the maximum value and the minimum value in the horizontal direction of the lip contour of the subject. Therefore, it is possible to obtain an effect that the speaker can be easily identified with a small information processing load and with a high identification rate.

【００４７】この発明によれば、唇運動時系列データ計
算手段により、対象者の唇の輪郭の垂直方向の最大値と
量小値との差による上下唇の距離、およびその対象者の
唇の輪郭の水平方向の最大値と最小値との差に応じて唇
の動きを定量化するように構成したので、情報処理負荷
を小さく、容易にかつさらに同定率を良く発話者の同定
を行うことができる効果が得られる。According to the present invention, the distance between the upper and lower lips due to the difference between the maximum value and the minimum value in the vertical direction of the contour of the lip of the subject and the lip of the subject are calculated by the lip movement time series data calculating means. Since the lip movement is quantified according to the difference between the maximum and minimum values in the horizontal direction of the contour, it is possible to identify the speaker easily with a low information processing load and with a high identification rate. The effect that can be obtained is obtained.

【００４８】この発明によれば、唇運動時系列データ計
算手段により、対象者の上下唇のうちの少なくともいず
れか一方の曲率に応じて唇の動きを定量化するように構
成したので、情報処理負荷を小さく、容易にかつ同定率
を良く発話者の同定を行うことができる効果が得られ
る。According to the present invention, the lip movement time series data calculation means is configured to quantify the lip movement according to the curvature of at least one of the upper and lower lips of the subject. The effect that the load can be reduced and the speaker can be easily identified with a high identification rate can be obtained.

【００４９】この発明によれば、音声時系列データ計算
手段により、特定の周波数の音声検出フィルタの出力の
包絡線に応じて音声の時系列変化を定量化するように構
成したので、音声以外の音との偶然の同期による誤動作
を避けることができると共に、情報処理負荷を小さく、
容易にかつ同定率を良く発話者の同定を行うことができ
る効果が得られる。According to the present invention, the audio time-series data calculating means is configured to quantify the time-series change of the audio according to the envelope of the output of the audio detection filter of a specific frequency. A malfunction due to accidental synchronization with sound can be avoided, and the information processing load is reduced.
The effect that the speaker can be easily identified with a high identification rate can be obtained.

【００５０】この発明によれば、時間変化類似度計算手
段により、唇の動きの時系列変化データと音声の時系列
変化データとのそれぞれの遅れ時間を補正した時系列デ
ータの相互相関係数を求め、話者判定手段により、その
相互相関係数の遅れ時間がゼロの周りの一定時間内の相
互相関係数を評価すべき類似度とするように構成したの
で、情報処理負荷を小さく、容易にかつ同定率を良く発
話者の同定を行うことができる効果が得られる。According to the present invention, the cross-correlation coefficient of the time-series data obtained by correcting the respective delay times of the time-series change data of the lip movement and the time-series change data of the voice by the time change similarity calculating means is obtained. The delay time of the cross-correlation coefficient is determined by the speaker determination means so that the cross-correlation coefficient within a certain time around zero is the similarity to be evaluated. Thus, an effect that the speaker can be identified with good identification rate can be obtained.

【００５１】この発明によれば、話者同定装置を適用し
て、発話者が対象者であると判定された時に、音声入力
に応じて処理する音声認識手段を備えるように構成した
ので、操作を意図している人以外の発する音、例えば、
他の人の発話やオーディオ機器などから流れる音を、操
作入力と誤って処理してしまうことを防ぐことができる
効果が得られる。According to the present invention, the speaker identification device is applied, and the apparatus is provided with the voice recognition means for processing in response to the voice input when the speaker is determined to be the target person. Sounds other than those who intend
This provides an effect of preventing erroneous processing of an utterance of another person or a sound flowing from an audio device or the like as an operation input.

【００５２】この発明によれば、複数の人を対象者と
し、複数のカメラにより撮影された各人の唇画像、また
は撮影された１つの画像の中から複数のウインドウを用
いて切り出した各人の唇画像を用いて、話者同定装置を
適用して、各人のうちの誰が発話したかを同定するよう
に構成したので、複数の発話候補者の中から、発話者を
同定することができる効果が得られる。According to the present invention, a plurality of persons are targeted, and each person is cut out from a lip image of each person photographed by a plurality of cameras or a single photographed image using a plurality of windows. By using a lip image and applying a speaker identification device to identify who uttered, it was possible to identify the speaker from among multiple utterance candidates. The effect that can be obtained is obtained.

【００５３】この発明によれば、参加している複数の人
を対象者として、話者同定システムを適用して、複数の
人のうちの発話者を同定し、通信する映像をその発話者
にズームインしたカメラからの映像にする通信データ制
御手段を備えるように構成したので、臨場感の高い通信
システムにすることができる効果が得られる。According to the present invention, a speaker identification system is applied to a plurality of participating persons as subjects, a speaker among the plurality of persons is identified, and an image to be communicated is given to the speaker. Since the communication data control means for providing the image from the zoomed-in camera is provided, an effect that a communication system having a high sense of reality can be obtained can be obtained.

【００５４】この発明によれば、３次元音定位機能を備
え、参加している複数の人を対象者として、発話者同定
システムを適用して、複数の人のうちの発話者を同定
し、発話者の位置と受信側の聴取者の位置に応じて、音
の仮想的発生位置を計算して、発話者からの音声を３次
元音提示する通信データ制御手段を備えるように構成し
たので、発話者の音声の聞こえる方向を制御することに
より、よりリアルな仮想臨場感通信を実現することがで
きる効果が得られる。According to the present invention, a speaker identification system is provided with a three-dimensional sound localization function, a plurality of participating persons as subjects, and a speaker among the plurality of persons is identified. According to the communication data control means for calculating a virtual sound generation position in accordance with the position of the speaker and the position of the listener on the receiving side and presenting a three-dimensional sound from the speaker, By controlling the direction in which the speaker's voice can be heard, an effect of realizing more realistic virtual presence communication can be obtained.

[Brief description of the drawings]

【図１】この発明の実施の形態１による話者同定装置
を示すブロック図である。FIG. 1 is a block diagram showing a speaker identification device according to a first embodiment of the present invention.

【図２】唇の形の特徴パラメータの求め方を示す説明
図である。FIG. 2 is an explanatory diagram showing how to obtain a lip-shaped feature parameter.

【図３】この発明の実施の形態２による音声入力シス
テムを示すブロック図である。FIG. 3 is a block diagram showing a voice input system according to a second embodiment of the present invention.

【図４】この発明の実施の形態３による話者同定シス
テムを示すブロック図である。FIG. 4 is a block diagram showing a speaker identification system according to Embodiment 3 of the present invention.

【図５】この発明の実施の形態４による通信システム
を示すブロック図である。FIG. 5 is a block diagram showing a communication system according to a fourth embodiment of the present invention.

【図６】仮想空間上の発話者と聴取者の位置関係を示
す説明図である。FIG. 6 is an explanatory diagram showing a positional relationship between a speaker and a listener in a virtual space.

[Explanation of symbols]

１対象者、２対象者撮影部（唇部画像撮影手段）、
３音収録部（音声抽出手段）、４切り出し処理部
（唇部画像撮影手段）、５唇運動時系列データ計算部
（唇運動時系列データ計算手段）、６音声検出フィル
タ（音声抽出手段）、７音声時系列データ計算部（音
声時系列データ計算手段）、８時系列パターンマッチ
ング部（時間変化類似度計算手段）、９発話者判定部
（話者判定手段，発話開始認識手段，操作内容認識手
段）、３３音声認識部（音声認識手段）、５４送信
データ制御部（通信データ制御手段）、５６受信デー
タ制御部（通信データ制御手段）。1 subject person, 2 subject person photographing unit (lip image photographing means),
3 sound recording unit (sound extraction means), 4 clipping processing unit (lip image photographing means), 5 lip movement time series data calculation unit (lip movement time series data calculation means), 6 sound detection filter (sound extraction means), 7 voice time series data calculation section (voice time series data calculation means), 8 time series pattern matching section (time change similarity calculation means), 9 speaker determination section (speaker determination means, speech start recognition means, operation content recognition) Means), 33 voice recognition unit (voice recognition unit), 54 transmission data control unit (communication data control unit), 56 reception data control unit (communication data control unit).

───────────────────────────────────────────────────── フロントページの続き (72)発明者山▲さき▼ 秀典東京都千代田区丸の内二丁目２番３号三菱電機株式会社内 (72)発明者澤田晃東京都千代田区丸の内二丁目２番３号三菱電機株式会社内 (72)発明者寺下裕美東京都千代田区丸の内二丁目２番３号三菱電機株式会社内Ｆターム(参考） 5D015 AA03 DD03 HH01 HH04 LL07 9A001 GG05 HH16 HH21 ──────────────────────────────────────────────────続き Continuing on the front page (72) Inventor Yamasaki Saki ▼ Hidenori 2-3-2 Marunouchi, Chiyoda-ku, Tokyo Inside Mitsubishi Electric Corporation (72) Inventor Akira Sawada 2-2-2 Marunouchi, Chiyoda-ku, Tokyo 3 No. 3 Mitsubishi Electric Corporation (72) Inventor Hiromi Terashita 2-3-2 Marunouchi, Chiyoda-ku, Tokyo F-term (reference) 5D015 AA03 DD03 HH01 HH04 LL07 9A001 GG05 HH16 HH21

Claims

[Claims]

1. A lip image photographing means for photographing an image of a lip portion of a subject whose utterance is to be detected, and a lip movement for obtaining time-series change data of lip movement from an image photographed by the lip image photographing means. Time-series data calculation means, voice extraction means for extracting voice from environmental sounds, voice time-series data calculation means for obtaining time-series change data of the voice extracted by the voice extraction means, and lip movement time-series data calculation Time change similarity calculating means for calculating a similarity involving a time change between the time series change data of the lip movement obtained by the means and the time series change data of the voice obtained by the voice time series data calculating means; The time-series change data of the lip movement of the word and the time-series change data of the voice according to the start of the utterance are stored, and the lip movement obtained by the lip movement time-series data calculation means is calculated. The time-series change data of the speech and the time-series change data of the voice obtained by the voice time-series data calculation means coincide with the stored words corresponding to the start of the utterance, and are obtained by the time change similarity calculation means. An utterance start monitoring device for recognizing that utterance is to be started when the similarity is equal to or greater than a predetermined value.

2. A lip image photographing means for photographing an image of a lip portion of a subject whose utterance is to be detected, and a lip movement for obtaining time-series change data of lip movement from an image photographed by the lip image photographing means. Time-series data calculation means, voice extraction means for extracting voice from environmental sounds, voice time-series data calculation means for obtaining time-series change data of the voice extracted by the voice extraction means, and lip movement time-series data calculation Time change similarity calculating means for calculating a similarity involving a time change between the time series change data of the lip movement obtained by the means and the time series change data of the voice obtained by the voice time series data calculating means; The time-series change data of the lip movement and the time-series change data of the voice according to the operation content are stored, and the lip movement obtained by the lip movement time-series data calculating means is stored. And the time-series change data of the voice determined by the voice-time-series data calculating means coincide with the word corresponding to the stored operation content, and determined by the time-variation similarity calculating means. An utterance start monitoring device comprising: an operation content recognizing means for recognizing that the operation content has been instructed when the similarity is equal to or more than a predetermined value.

3. A lip image photographing means for photographing an image of a lip portion of a subject whose speech is to be detected, and a lip movement for obtaining time-series change data of lip movement from an image photographed by the lip image photographing means. Time-series data calculation means, voice extraction means for extracting voice from environmental sounds, voice time-series data calculation means for obtaining time-series change data of the voice extracted by the voice extraction means, and lip movement time-series data calculation Time change similarity calculating means for calculating a similarity with a time change between the time series change data of the lip movement obtained by the means and the time series change data of the voice obtained by the voice time series data calculating means; Speaker determining means for determining that the sound is from the subject when the similarity calculated by the time change similarity calculating means is equal to or more than a predetermined value. Identification device.

4. The lip movement time-series data calculation means quantifies the lip movement according to a difference between a maximum value and a small value in the vertical direction of the lip contour of the subject. 3. The speaker identification device according to 3.

5. The lip movement time-series data calculation means quantifies lip movement according to a difference between a maximum value and a minimum value in the horizontal direction of the lip contour of the subject. The described speaker identification device.

6. The lip movement time-series data calculating means calculates a difference between a maximum value and a minimum value of the contour of the subject's lip in the vertical direction and a maximum value and a minimum value of the horizontal direction of the contour of the subject's lip. 4. The speaker identification device according to claim 3, wherein the movement of the lips is quantified according to a difference from the value.

7. The speaker according to claim 3, wherein the lip movement time series data calculation means quantifies the lip movement according to the curvature of at least one of the upper and lower lips of the subject. Identification device.

8. The speaker identification according to claim 3, wherein the speech time series data calculation means quantifies a time series change of the speech according to an envelope of an output of the speech detection filter of a specific frequency. apparatus.

9. The time change similarity calculating means obtains a cross-correlation coefficient of time series data obtained by correcting the respective delay times of the time series change data of the lip movement and the time series change data of the voice. 4. The speaker identification device according to claim 3, wherein the determining means sets the cross-correlation coefficient within a predetermined time around zero as the delay time of the cross-correlation coefficient as the similarity to be evaluated.

10. A voice input system using voice input instead of key operation, wherein the speaker identification device according to any one of claims 3 to 9 is applied, and the target person is a speaker. A voice input system comprising: voice recognition means for determining whether there is a speaker, and when it is determined that the utterer is the target person, processing the voice in accordance with the voice input.

11. A lip image of each person photographed by a plurality of cameras, or a lip image of each person cut out from one photographed image by using a plurality of windows, with a plurality of persons as subjects. A speaker identification system for identifying who uttered each person by applying the speaker identification device according to any one of claims 3 to 9.

12. In a communication system for communicating video and audio, the speaker identification system according to claim 11 is applied to a plurality of participating persons, and a speaker among the plurality of persons is identified. A communication system comprising communication data control means for converting an image to be identified and communicated into an image from a camera zoomed in on the speaker.

13. A communication system for communicating video and audio, wherein a three-dimensional sound localization function is provided, and a plurality of participating persons are set as subjects, and the speaker identification system according to claim 11 is applied. Communication that identifies a speaker among the persons, calculates a virtual sound generation position according to the position of the speaker and the position of the listener on the receiving side, and presents a three-dimensional sound from the speaker. A communication system comprising data control means.