JP2822897B2

JP2822897B2 - Videoconferencing system speaker identification device

Info

Publication number: JP2822897B2
Application number: JP6265852A
Authority: JP
Inventors: 昇寺澤
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1994-10-28
Filing date: 1994-10-28
Publication date: 1998-11-11
Anticipated expiration: 2013-11-11
Also published as: JPH08130723A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】本発明は複数の地点を結んでテレ
ビジョンによる会議を実現するテレビ会議システムに係
わり、特に多地点を結んでテレビ会議を行う際の話者の
判別に有効なテレビ会議システム話者判別装置に関す
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a video conferencing system for realizing a teleconference by connecting a plurality of points, and more particularly to a video conference effective for discriminating a speaker when a videoconference is connected between multiple points. The present invention relates to a system speaker identification device.

【０００２】[0002]

【従来の技術】公衆回線あるいは専用回線を使用して複
数の地点のそれぞれのテレビ会議用端末を結んでテレビ
会議を行うテレビ会議システムが多くの企業等で採用さ
れるようになってきている。このようなテレビ会議が多
地点で行われる場合には、発言を行っている話者のみを
テレビジョンで映したり、他の地点の者よりも大きな画
面で映し出すと便利である。そこで従来からこれら多地
点に分かれているそれぞれの話者の音声回線から送られ
てくる信号を分析し、話者を特定してその話者のみを他
の会議参加者と区別して前記したような強調表示を行う
ようになっている。2. Description of the Related Art Many companies and the like have adopted a video conference system for performing a video conference by connecting video conference terminals at a plurality of locations using a public line or a dedicated line. When such a video conference is held at multiple points, it is convenient to display only the speaking speaker on a television or a larger screen than those at other points. Therefore, conventionally, the signals sent from the voice lines of the speakers divided into these multiple points are analyzed, the speakers are identified, and only the speakers are distinguished from other conference participants, as described above. Highlighting is performed.

【０００３】それぞれのテレビ参加者から話者を判別す
るにあたっては、例えば特開平４−１５０５９０号公報
に開示されているように、各テレビ端末ごとに音のレベ
ルを測定し、ここからバックグラウンドの音のレベルを
排除してそれぞれの音声レベルを求め、求められた音声
レベルを各テレビ端末ごとに比較して最も大きなレベル
の話者１人を判別することが従来から行われている。そ
して、この判別された１人の話者をテレビ画面上で強調
するようになっていた。In discriminating a speaker from each television participant, a sound level is measured for each television terminal as disclosed in, for example, Japanese Patent Laid-Open No. 4-150590, and a background level is determined from this. 2. Description of the Related Art Conventionally, it has been conventionally performed to obtain a sound level by excluding a sound level, and compare the obtained sound level for each television terminal to determine one speaker having the highest level. Then, the determined one speaker is emphasized on the television screen.

【０００４】ところが、このように各テレビ端末ごとの
音声レベルのうち最も高いものの会議参加者を話者とし
て単純に選択してその度ごとに強調する話者を切り替え
ると、例えば会議参加者の１人が咳をすると画面の表示
が変化するといった不都合を生じる。However, when the conference participant having the highest audio level among the television terminals is simply selected as the speaker and the speaker to be emphasized is switched each time, for example, one of the conference participants is emphasized. When a person coughs, an inconvenience such as a change in the display on the screen occurs.

【０００５】そこで特開平２−４０９５号公報では、該
当の音声入力ごとに有音と無音の検出を行い、有音を検
出しても直ちに話者と判定せずこれが一定の時間を越え
た際に話者と判別することにしている。また、一度話者
と判別した者については、無音を検出しても直ちに非話
者とせず、無音回数が一定の値を越えたときに初めて非
話者と判定することにしている。In Japanese Patent Laid-Open Publication No. 2-4095, a sound and a silence are detected for each voice input, and when a sound is detected, the sound is not immediately determined to be a speaker, but when the sound exceeds a predetermined time. Is determined to be the speaker. In addition, a person once determined to be a speaker is not determined to be a non-speaker immediately after detecting silence, but is determined to be a non-speaker only when the number of silences exceeds a certain value.

【０００６】[0006]

【発明が解決しようとする課題】このように従来のテレ
ビ会議システム話者判別装置では、音声レベルの大小を
比較して話者の判別を行うようになっていたので、２者
が平行して発言を続ける状態となったときには、これら
の者の間で音声レベルに差があると一方の話者のみが話
者として判別され、その者のみがテレビ画面上で強調さ
れるといった問題があった。また、２者それぞれの発言
の度に音声レベルが変化して、両者の音声レベルの差が
変化したり大きい方が入れ替わったりすると、話者を的
確に判別することができなかった。もちろん、２人の話
者が存在するこのような場合には、非話者と判別された
側の話者については強調が行われず、強調の態様によっ
てはその者がテレビ画面上で何ら表示されないといった
事態も発生した。As described above, in the conventional video conference system speaker discriminating apparatus, the speaker is discriminated by comparing the level of the voice level. When there is a situation where the speech continues, there is a problem that if there is a difference in voice level between these persons, only one speaker is determined as a speaker, and only that person is emphasized on the television screen. . In addition, if the voice level changes each time the two persons make a speech, and the difference between the two voice levels changes or the larger one changes, the speaker cannot be accurately identified. Of course, in such a case where there are two speakers, the speaker on the side determined to be a non-speaker is not emphasized, and depending on the mode of emphasis, that person is not displayed on the TV screen at all. Such a situation also occurred.

【０００７】更に、特開平２−４０９５号公報のように
一定の時間間隔を設定して話者であることを新規に判別
するようにすると、話者として判別されている者が話し
続けているときに、他の誰かが割り込む形で発言したよ
うな場合や、話者として判別されている者の他に複数の
発言者が存在するような場合には、新たな話者の判別が
的確に行えないといった問題があった。Further, when a certain time interval is set as in Japanese Patent Application Laid-Open No. 2-4095 to newly discriminate a speaker, the person who has been discriminated as a speaker continues to speak. Sometimes, when someone speaks in a way that interrupts them, or when there are multiple speakers in addition to the person who has been identified as the speaker, the new speaker can be accurately identified. There was a problem that it could not be done.

【０００８】そこで本発明の目的は、話者が割り込んだ
り複数の発言者が存在する場合にそれぞれの話者を的確
に判別することのできるテレビ会議システム話者判別装
置を提供することにある。SUMMARY OF THE INVENTION It is an object of the present invention to provide a video conference system speaker discriminating apparatus capable of accurately discriminating each speaker when a speaker is interrupted or a plurality of speakers exist.

【０００９】本発明の他の目的は、２人の話者が発言を
行っている場合にはこれらの対話状態をテレビ画面に表
示して臨場感のあるテレビ会議を行えるようにしたテレ
ビ会議システム話者判別装置を提供することにある。Another object of the present invention is to provide a video conference system in which when two speakers are speaking, the state of the conversation is displayed on a television screen so that a video conference with a sense of reality can be performed. A speaker identification device is provided.

【００１０】[0010]

【課題を解決するための手段】請求項１記載の発明で
は、（イ）テレビ会議に参加する各テレビ会議端末の音
声レベルを端末ごとに検出する音声レベル検出手段と、
（ロ）所定の周期単位で各テレビ会議端末で話者が発言
しているかどうかを検出する発言状態検出手段と、
（ハ）音声レベル検出手段によって検出された音声レベ
ルを基にして前記した周期ごとに話者の判別を行う話者
判別手段と、（ニ）発言状態検出手段の検出結果と話者
判別手段の判別結果を用いて話者の交代状況を判別する
話者交代状況判別手段と、（ホ）この話者交代状況判別
手段が話者の交代を検出する際には前記した周期を現在
の値よりも短く設定し交代までの区間が長いときにはこ
の周期を現在の値よりも長くする周期可変手段とをテレ
ビ会議システム話者判別装置に具備させる。According to the first aspect of the present invention, (a) audio level detection means for detecting the audio level of each video conference terminal participating in a video conference for each terminal;
(B) speech state detection means for detecting whether or not a speaker is speaking at each video conference terminal in a predetermined cycle unit;
(C) speaker discriminating means for discriminating a speaker at each cycle based on the sound level detected by the sound level detecting means, and (d) detection results of the speech state detecting means and the speaker discriminating means. discrimination result and speaker Substitution situation determining means for determining replacement status of the speaker using the cycle described above in detecting change of the talker is the speaker alternation condition determination means (e) current
Is set to be shorter than the value and the cycle changing means for making this cycle longer than the current value when the section up to the replacement is long is provided in the video conference system speaker identification device.

【００１１】すなわち請求項１記載の発明では、話者が
１人で長く発言しているときや、他の話者が割り込む形
等で発言を開始したとき等の話者の交代状況を判別し、
これに応じて話者の判別を行うための周期を可変にする
ことで、不要な音声に対して話者の交代を誤認すること
を防止する一方で、話者が割り込んだり複数の発言者が
存在する場合にそれぞれの話者を的確に判別することが
できる。That is, according to the first aspect of the present invention, when a single speaker has been speaking for a long time, or when another speaker has started speaking in a manner interrupting, etc., the change state of the speaker is determined. ,
By changing the cycle for discriminating the speaker in response to this, it is possible to prevent false recognition of the alternation of the speaker for the unnecessary voice, while preventing the speaker from interrupting or multiple speakers. When present, each speaker can be accurately determined.

【００１２】請求項２記載の発明では、（イ）テレビ会
議に参加する各テレビ会議端末の音声レベルを端末ごと
に検出する音声レベル検出手段と、（ロ）この音声レベ
ル検出手段によって検出された端末ごとの音声レベルの
最大のものと次に大きいものを比較する音声レベル比較
手段と、（ハ）この音声レベル比較手段によって比較さ
れた音声レベルの差が所定の範囲内のときには両者が対
話を行っていると判別する対話判別手段と、（ニ）この
対話判別手段が対話を行っていると判別した２人の話者
を他のテレビ会議参加者に比べて画面表示で強調する表
示強調手段とをテレビ会議システム話者判別装置に具備
させる。According to the second aspect of the present invention, (a) audio level detecting means for detecting the audio level of each video conference terminal participating in the video conference for each terminal, and (b) the audio level detected by the audio level detecting means. Voice level comparing means for comparing the largest voice level with the next highest voice level for each terminal; and (c) when the difference between the voice levels compared by the voice level comparing means is within a predetermined range, the two parties interact with each other. (D) display emphasizing means for emphasizing, on a screen display, two speakers who have determined that the dialogue is being performed as compared with other video conference participants Are provided in the video conference system speaker identification device.

【００１３】すなわち請求項２記載の発明では、発言中
の話者が２人いるかどうかを音声レベルが最大のものと
次のレベルのものを抽出してこれらのレベルの差を比較
することで判別することにした。そして、話者が２人い
るときには画面でこれらの者を強調表示することにし
て、対話の臨場感を盛り上げることにした。That is, according to the second aspect of the present invention, it is determined whether or not there are two speakers who are speaking by extracting the speech level of the largest speech level and the speech level of the next speech level and comparing the difference between these levels. I decided to do it. Then, when there are two speakers, these persons are highlighted on the screen to enhance the realism of the dialogue.

【００１４】請求項３記載の発明では、話者の発言状態
を検出して画面表示の切り替えの単位としての周期を可
変することにして、請求項２記載の発明に請求項１記載
の発明の長所を盛り込むことにした。According to the third aspect of the present invention, the speech state of the speaker is detected to change the cycle as a unit for switching the screen display. We decided to include our strengths.

【００１５】[0015]

【００１６】[0016]

【００１７】[0017]

【００１８】[0018]

【実施例】以下実施例につき本発明を詳細に説明する。DESCRIPTION OF THE PREFERRED EMBODIMENTS The present invention will be described in detail below with reference to embodiments.

【００１９】第１の実施例 First Embodiment

【００２０】図１は、本発明の第１の実施例におけるテ
レビ会議システム話者判別装置を使用したテレビ会議シ
ステムの概要を表わしたものである。東京の本社や名古
屋あるいは大阪の支社等のような各地点に配置されたテ
レビ会議端末１１₁、１１₂、……１１_Nは、映像回線
や音声回線等の回線１２₁、１２₂、……１２_Nを通じ
てテレビ会議システム話者判別装置の通信インタフェー
ス１３に接続されている。通信インタフェース１３から
は各テレビ会議端末１１₁、１１₂、……１１ _Nごとの
音声情報１４₁、１４₂、……１４_Nが取り出され、そ
れぞれ対応する音声レベル測定部１５₁、１５₂、……
１５_Nに入力される。ここで音声情報１４₁、１４₂、
……１４_Nは各テレビ会議端末１１₁、１１₂、……１
１_Nの図示しないマイクロフォンで検出した音の情報で
あり、音声とこれ以外のバックグラウンドの音も含んで
いる。FIG. 1 shows a first embodiment of the present invention.
Videoconferencing system using a Levi conference system speaker identification device
This is an outline of the stem. Tokyo headquarters and famous places
Stores located at various locations, such as stores or branch offices in Osaka.
Levi conference terminal 11₁, 11_Two, ... 11_NIs a video line
Line 12 such as a voice line₁, 12_Two, ... 12_NThrough
Interface of the TV conference system speaker identification device
Connected to the network 13. From the communication interface 13
Is each video conference terminal 11₁, 11_Two, ... 11 _NPer
Audio information 14₁, 14_Two, ... 14_NIs taken out and
The corresponding audio level measuring units 15₁, 15_Two, ……
Fifteen_NIs input to Here, audio information 14₁, 14_Two,
...... 14_NIs each video conference terminal 11₁, 11_Two…… 1
1_NOf the sound detected by the microphone (not shown)
Yes, including audio and other background sounds
I have.

【００２１】音声レベル測定部１５₁、１５₂、……１
５_Nは、音声レベル測定周期制御部１６から送られてく
る測定周期制御信号１７を入力し、対応するテレビ会議
端末の音声情報１４₁、１４₂、……１４_Nを分析する
測定周期をこれによって増減する。そしてこれらの結果
を音声レベル測定データ１８₁、１８₂、……１８_Nと
して、各テレビ会議端末１１₁、１１₂、……１１_Nに
対応して配置された音声レベル検出部１９₁、１９₂、
……１９_Nとバックグラウンドレベル検出部２１₁、２
１₂、……２１_Nおよび発言状態検出部２２₁、２
２₂、……２２_Nに供給する。バックグラウンドレベル
検出部２１₁、２１₂、……２１_Nは音声以外の音とし
てのバックグラウンドレベルを検出し、このバックグラ
ウンドレベル測定データ２３₁、２３₂、……２３_Nを
音声レベル検出部１９₁、１９₂、……１９_Nと発言状
態検出部２２₁、２２₂、……２２_Nに供給する。Voice level measuring units 15 ₁ , 15 ₂ ,..., 1
5 _N receives the measurement period control signal 17 sent from the sound level measurement period controller 16, the audio information 14 ₁ of the corresponding teleconference terminal, 14 _2, which measurement period to analyze ...... 14 _N Increase or decrease by The _first audio level measurement data 18 these results, 18 _2, ... as 18 _N, the video conference terminal 11 _1, 11 _2, ... 11 the sound level detector 19 _1, which are arranged corresponding to _N, 19 ₂ ,
... 19 _N and the background level detector 21 ₁ , 2
1 ₂ ,..., 21 _N and utterance state detection sections 22 ₁ , 2
2 _2, supplied to ...... 22 _N. Background level detection unit 21 _1, 21 _2, ...... 21 _N detects the background level of the sounds other than speech, the background level measurement data 23 _1, 23 _2, the sound level detecting unit ...... 23 _N 19 _1, 19 _2, he said ...... 19 _N state detection unit 22 _1, 22 _2, and supplies the ...... 22 _N.

【００２２】音声レベル検出部１９₁、１９₂、……１
９_Nは、入力された音声レベル測定データ１８₁、１８
₂、……１８_Nとバックグラウンドレベル測定データ２
３₁、２３₂、……２３_Nのそれぞれ対の測定データの
差分を検出し、バックグラウンドレベルの影響が排除さ
れた音声レベルを検出する。これらの音声レベルを表わ
した音声レベルデータ２５₁、２５₂、……２５_Nは、
話者の判別を行う話者判別部２６に送られる。話者の判
別方法は後に説明する。Audio level detectors 19 ₁ , 19 ₂ ,..., 1
9 _N is the input audio level measurement data 18 ₁ , 18
₂ , 18 _N and background level measurement data 2
3 _1, 23 _2, and detects the difference of the measured data of each pair of ...... 23 _N, detects the audio level influence the background level has been eliminated. The audio level data 25 ₁ , 25 ₂ ,..., 25 _N representing these audio levels are
It is sent to a speaker discriminating section 26 for discriminating the speaker. The method for determining the speaker will be described later.

【００２３】話者の判別結果は話者判別結果データ２７
として出力される。話者判別結果データ２７は画像処理
部２８に送られて、判別された１の話者を強調するため
の画像処理が行われる。この画像処理後の映像信号２９
は通信インタフェース１３から各テレビ会議端末１
１₁、１１₂、……１１_Nに送られて、それぞれのテレ
ビ画面上に映像が表示されることになる。The speaker discrimination result is speaker discrimination result data 27
Is output as The speaker discrimination result data 27 is sent to the image processing unit 28, and image processing for emphasizing one discriminated speaker is performed. The video signal 29 after this image processing
Means that each video conference terminal 1
1 _1, 11 _2, sent to ...... 11 _N, so that the image is displayed on each of the television screen.

【００２４】一方、発言状態検出部２２₁、２２₂、…
…２２_Nは、入力された音声レベル測定データ１８₁、
１８₂、……１８_Nとバックグラウンドレベル測定デー
タ２３₁、２３₂、……２３_Nを用いて、個々のテレビ
会議端末１１₁、１１₂、……１１_Nの発言状態の有無
を測定周期ごとに検出する。具体的には音声レベル測定
データ１８₁、１８₂、……１８_Nのそれぞれの変化率
を測定し、個々のテレビ会議端末１１₁、１１₂、……
１１_Nごとの音声レベルの立ち上がりと立ち下がりを検
出して、発言の開始状態と終了状態を判別することで各
周期ごとに発言の有無を検出する。On the other hand, the speech state detecting units 22 ₁ , 22 ₂ ,.
... 22 _N is the input audio level measurement data 18 ₁ ,
18 _2, ...... 18 _N and background level measurement data 23 _1, 23 _2, using ...... 23 _N, measured the presence of speech states of the individual television conference terminal 11 _1, 11 _2, ...... 11 _N cycles Detect every time. Specifically, the rate of change of each of the audio level measurement data 18 ₁ , 18 ₂ ,... 18 _N is measured, and the individual video conference terminals 11 ₁ , 11 ₂ ,.
The presence or absence of a speech is detected at each cycle by detecting the rise and fall of the audio level for each 11 _N and determining the start and end states of the speech.

【００２５】ここで、各音声レベル測定データ１８₁、
１８₂、……１８_Nだけでは変化が見られない状態のと
きには、音声レベル測定データ１８₁、１８₂、……１
８_Nとバックグラウンドレベル測定データ２３₁、２３
₂、……２３_Nの比較を行って、１周期内でこれらにレ
ベル差があるテレビ会議端末１１では発言状態と判別
し、レベル差がない場合には発言が行われていない状態
と判別する。Here, each sound level measurement data 18 ₁ ,
18 ₂ ,..., 18 _N , when no change is observed, the sound level measurement data 18 ₁ , 18 ₂ ,.
8 _N and background level measurement data 23 ₁ , 23
₂ ,... _23N are compared, and the TV conference terminal 11 having a level difference between them in one cycle is determined to be a speech state, and if there is no level difference, it is determined to be a state where no speech is made. .

【００２６】また、１つのテレビ会議端末１１でこの周
期内に複数の発言状態の立ち上がりと立ち下がりが検出
されたときには、１周期における発言中の時間の積分値
が所定の閾値を越えたかどうかの判別を行う。この結
果、この閾値を越えた場合には、その周期で発言があっ
たと判別する。閾値を越えない場合には非発言として判
別してもよいが、周期の区切りに発言の区間が跨がって
いる場合を考慮して、周期の切り替わるタイミングで発
言状態となっているときにはその周期で発言があったも
のと判別する。When one TV conference terminal 11 detects the rise and fall of a plurality of speech states in this cycle, it is determined whether the integrated value of the time during speech in one cycle exceeds a predetermined threshold. Make a determination. As a result, when this threshold is exceeded, it is determined that there was a utterance in that cycle. If it does not exceed the threshold, it may be determined as non-utterance, but in consideration of the case where the utterance section straddles the break of the cycle, when it is in the utterance state at the timing of the cycle change, the cycle Is determined to have made a comment.

【００２７】このようにして各発言状態検出部２２₁、
２２₂、……２２_Nの検出した１周期ごとの発言結果デ
ータ３１₁、３１₂、……３１_Nは、音声レベル測定周
期制御部１６に共通して入力される。一方、話者判別部
２６は音声レベル測定周期制御部１６から得られる発言
結果データ３１₁、３１₂、……３１_Nを用いて発言中
である１または複数のテレビ会議端末１１を検出する。
次に、これら発言中のテレビ会議端末１１の間で、音声
レベルデータ２５を比較し、最大のものを話者として判
別することになる。話者の判別結果は話者判別結果デー
タ２７として出力されることは既に説明した。In this way, each of the utterance state detectors 22 ₁ ,
The speech result data 31 ₁ , 31 ₂ ,... 31 _N for each cycle detected by 22 ₂ ,..., 22 _N are commonly input to the audio level measurement cycle control unit 16. On the other hand, the speaker determination unit 26 detects one or a plurality of teleconference terminals 11 that are speaking using the speech result data 31 ₁ , 31 ₂ ,..., 31 _N obtained from the voice level measurement cycle control unit 16.
Next, the voice level data 25 is compared between the teleconferencing terminals 11 that are speaking, and the largest one is determined as the speaker. As described above, the speaker determination result is output as the speaker determination result data 27.

【００２８】また、音声レベル測定周期制御部１６は、
話者判別部２６から話者判別データを取得して記憶して
おき、話者が同一であるか変化するかを周期単位で判別
する。この結果、（イ）話者が変化すると判別した場合
には、測定周期制御信号１７で音声レベルの測定周期を
現在の値よりも所定時間だけ短くするような制御を行
う。履歴を多少長めにとり、話者の変化する頻度を算出
して、この頻度情報を用いて周期を短くする値を制御す
るようにしてもよい。また、ＲＯＭ（リード・オンリ・
メモリ）にこのような加減算のための値をテーブルとし
て格納しておいて、状況に応じて加減算の値を読み出し
ながら１周期の間隔を設定するようにしてもよい。The audio level measurement cycle control unit 16
The speaker discrimination data is acquired from the speaker discrimination unit 26 and stored, and it is discriminated on a cycle-by-cycle basis whether the speaker is the same or changes. As a result, (a) when it is determined that the speaker changes, control is performed by the measurement cycle control signal 17 so as to shorten the measurement cycle of the voice level by a predetermined time from the current value. The history may be made somewhat longer, the frequency at which the speaker changes may be calculated, and a value for shortening the cycle may be controlled using this frequency information. In addition, ROM (read only
The values for such addition and subtraction may be stored as a table in the memory), and the interval of one cycle may be set while reading the addition and subtraction values according to the situation.

【００２９】（ロ）話者が一定していると判別した場合
には、話者以外に発言しているものがいるかどうかの判
別を行う。この判別は、話者と判別されている発言結果
データを削除した残りの発言結果データから発言開始お
よび発言中のデータ数を算出し、話者以外に発言してい
る者の有無を判断することによって行う。この結果とし
て、「話者の他に発言者がいる」と判別された場合に
は、音声レベルの測定周期を短くするような測定周期制
御信号１７を各音声レベル測定部１５₁、１５₂、……
１５_Nに送出する。このときデータ数、つまり発言者の
数が多いほど、測定周期は短くなる。(B) When it is determined that the speaker is constant, it is determined whether or not there is any speaker other than the speaker. This determination is performed by calculating the number of data at the start and during the utterance from the remaining utterance result data from which the utterance result data determined to be the speaker is deleted, and determining whether there is a person other than the speaker who is speaking. Done by As a result, when it is determined that “there is a speaker in addition to the speaker”, a measurement cycle control signal 17 for shortening the measurement cycle of the audio level is sent to each of the audio level measurement units 15 ₁ , 15 ₂ , 15. ......
15 _N. At this time, the larger the number of data, that is, the number of speakers, the shorter the measurement cycle.

【００３０】これに対して、「話者の他に発言者がいな
い」と判別された場合には、音声レベルの測定周期を長
くするような測定周期制御信号１７を各音声レベル測定
部１５₁、１５₂、……１５_Nに送出することになる。
なお、この状態は同じ発言者が話者として継続している
ことを意味する。したがって、話者判別結果データ２７
の履歴から話者としての時間を判別し、その時間が長い
ほど測定周期も長くする。なお、測定周期は１人の話者
の予定される話し中の一区切りの時間を目安として設定
される。[0030] On the contrary, if it is determined that "other is not speaker to speaker" is measured period control signal 17 to the sound level measuring unit 15 ₁ as a longer measurement period of the speech level , 15 ₂ ,..., 15 _N.
This state means that the same speaker continues as a speaker. Therefore, the speaker discrimination result data 27
, The time as a speaker is determined, and the longer the time, the longer the measurement cycle. Note that the measurement cycle is set using one section time during which one speaker is scheduled to speak.

【００３１】このようにして、各テレビ会議端末１
１₁、１１₂、……１１_Nでは話者として特定された１
人の発言者が画面上で強調されて表示され、話者の交代
が生じたときにも画面の強調表示される話者が的確に交
代することになる。話者を強調する手法としては、話者
と判別された者のみを画面全体に表示する手法や、発言
者全員の小画面が設定されていて、話者と判別された者
の画面を他の者の画面よりも大きく拡大する手法や、話
者と判別された者の画面の枠を他の者の画面の枠とは異
なった色に変化させる手法等の公知の各種手法を使用す
ることができる。In this manner, each video conference terminal 1
1 ₁ , 1 ₁ ₂ ... 1 ₁ _N identified as the speaker in _N
A person who speaks is highlighted on the screen, and even when a change of speakers occurs, the speaker whose screen is highlighted is accurately replaced. As a method of emphasizing a speaker, a method of displaying only a person who is determined to be a speaker on the entire screen, or a method in which a small screen of all speakers is set and a screen of a person who is determined to be a speaker is displayed by another screen. It is possible to use various known methods such as a method of enlarging the screen larger than the screen of the other person and a method of changing the frame of the screen of the person determined to be the speaker to a color different from the frame of the screen of another person. it can.

【００３２】第２の実施例 Second Embodiment

【００３３】図２は、本発明の第２の実施例におけるテ
レビ会議システム話者判別装置を使用したテレビ会議シ
ステムの概要を表わしたものである。図１と同一部分に
は同一の符号を付しており、これらの説明を適宜省略す
る。通信インタフェース１３からは各テレビ会議端末１
１₁、１１₂、……１１_Nごとの音声情報１４₁、１４
₂、……１４_Nが取り出され、それぞれ対応する音声レ
ベル測定部６１₁、６１₂、……６１_Nに入力される。
ここで音声情報１４₁、１４₂、……１４_Nは各テレビ
会議端末１１₁、１１₂、……１１_Nの図示しないマイ
クロフォンで検出した音の情報であり、音声とこれ以外
のバックグラウンドの音も含んでいる。FIG. 2 shows an outline of a video conference system using a video conference system speaker identification device according to a second embodiment of the present invention. The same parts as those in FIG. 1 are denoted by the same reference numerals, and description thereof will be omitted as appropriate. From the communication interface 13, each video conference terminal 1
1 _1, 11 _2, audio information 14 for each ...... 11 _N _1, 14
_2, ...... 14 _N is extracted, the speech level measuring unit 61 corresponding _1, 61 _2, are input to the ...... 61 _N.
Here the voice information 14 _1, 14 _2, ...... 14 _N is the information of each television conference terminal 11 _1, 11 _2, and detected by the microphone (not shown) of the ...... 11 _N sound, voice and other background Also includes sound.

【００３４】音声レベル測定部６１₁、６１₂、……６
１_Nの測定した音声レベル測定データ１８₁、１８₂、
……１８_Nは、各テレビ会議端末１１₁、１１₂、……
１１ _Nに対応して配置された音声レベル検出部１９₁、
１９₂、……１９_Nとバックグラウンドレベル検出部２
１₁、２１₂、……２１_Nおよび発言状態検出部２
２ ₁、２２₂、……２２_Nに供給される。音声レベル検
出部１９₁、１９₂、……１９_Nは、それぞれ入力され
る音声レベル測定データ１８₁、１８₂、……１８ _Nと
バックグラウンドレベル測定データ２３₁、２３₂、…
…２３_Nを用いてバックグラウンドレベルのそれぞれ除
去された音声レベルを表わした音声レベルデータ２
５₁、２５₂、……２５_Nを作成し、これらを対話状態
検出部６２と話者判別部６３に供給する。話者判別部６
３は対話状態検出部６２の検出した対話状態データ６４
と各音声レベルデータ２５₁、２５₂、……２５_Nを用
いて１または複数の話者を判別し、この結果を話者判別
データ６５として画像処理部６６に送出する。画像処理
部６６では、判別された１または複数の話者を強調する
ための画像処理を行う。この画像処理後の映像信号６７
は通信インタフェース１３から各テレビ会議端末１
１₁、１１₂、……１１_Nに送られて、それぞれのテレ
ビ画面上に映像が表示されることになる。Audio level measuring section 61₁, 61_Two…… 6
1_NMeasured sound level data 18₁, 18_Two,
... 18_NIs a teleconference terminal 11₁, 11_Two, ……
11 _NAudio level detector 19 arranged corresponding to₁,
19_Two............ 19_NAnd background level detector 2
1₁, 21_Two, ... 21_NAnd speech state detection unit 2
2 ₁, 22_Two, ... 22_NSupplied to Audio level detection
Departure 19₁, 19_Two............ 19_NAre each entered
Audio level measurement data 18₁, 18_Two, ... 18 _NWhen
Background level measurement data 23₁, 23_Two…
… 23_NTo remove each of the background levels
Audio level data 2 representing the removed audio level
5₁, 25_Two, ... 25_NCreate and interact with these
It is supplied to the detection unit 62 and the speaker determination unit 63. Speaker discriminator 6
3 is the dialogue state data 64 detected by the dialogue state detector 62
And each audio level data 25₁, 25_Two, ... 25_NFor
To determine one or more speakers and determine the result
The data is sent to the image processing unit 66 as data 65. Image processing
The unit 66 emphasizes one or more determined speakers.
For image processing. The video signal 67 after this image processing
Means that each video conference terminal 1
1₁, 11_Two, ... 11_NSent to each tele
An image is displayed on the screen.

【００３５】このようなテレビ会議システム話者判別装
置で、対話状態検出部６２は各音声レベルデータ２
５₁、２５₂、……２５_Nを入力して、音声レベル測定
部６１₁、６１₂、……６１_Nが測定するそれぞれの測
定周期における音声レベルの積分を各テレビ会議端末１
１₁、１１₂、……１１_Nごとに算出する。そして、こ
れらの積分値としての音声レベルの最大のものと次の音
声レベルの２種類のものを抽出する。In such a video conference system speaker discriminating apparatus, the dialogue state detecting section 62 outputs each voice level data 2
5 _1, 25 _2, enter the ...... 25 _N, sound level measurement unit 61 _1, 61 _2, each of the integral of the sound level at each measuring period ...... 61 _N to measure the television conference terminal 1
1 _1, 11 _2, is calculated for each ...... 11 _N. Then, two types, ie, the maximum audio level as the integrated value and the next audio level, are extracted.

【００３６】ただし、装置によっては最大の音声レベル
にある程度近い２以上の音声レベルがあったときにはこ
れらを共に抽出するようにしてもよい。これは、測定周
期を比較的長く設定した場合には１周期の間に３者以上
のものが矢継ぎ早に会話を行う現象が発生することがあ
るからである。このような場合には、先の第１の実施例
と同様の主旨で、対話状態検出部６２から測定周期制御
信号１７を各音声レベル測定部６１₁、６１₂、……６
１_Nに送出するような構成を採って、測定周期を制御す
ることも有効である。本実施例では説明を簡単にするた
めに、２者の対話に限定して説明を行う。However, depending on the device, when there are two or more sound levels which are close to the maximum sound level to some extent, these may be extracted together. This is because, if the measurement cycle is set relatively long, a phenomenon may occur in which three or more persons have a quick conversation during one cycle. In such a case, the measurement period control signal 17 is transmitted from the dialogue state detection unit 62 to each of the audio level measurement units 61 ₁ , 61 ₂ ,..., 6 in the same manner as in the first embodiment.
It is also effective to adopt a configuration for transmitting to 1 _N and control the measurement cycle. In the present embodiment, in order to simplify the explanation, the explanation is limited to the dialogue between two parties.

【００３７】対話状態検出部６２は、２者の音声レベル
の積分値を比較し、これらが所定の値の範囲内の小差で
あるときには、現在行われているテレビ会議は対話状態
であるとの判別が行われる。これ以外の場合、すなわち
大差が生じているときには現在行われているテレビ会議
は非対話状態であるとの判別が行われる。更に対話状態
検出部６２は、発言者が１人の場合や、発言者がいない
場合には同様に非対話状態と判別する。The dialogue state detecting section 62 compares the integrated values of the voice levels of the two parties, and when these are small differences within a predetermined value range, it is determined that the currently held video conference is in a dialogue state. Is determined. In other cases, that is, when there is a great difference, it is determined that the currently held video conference is in a non-interactive state. Further, the dialogue state detection unit 62 similarly determines that the user is in the non-conversational state when there is only one speaker or when there is no speaker.

【００３８】対話状態の有無を表わす対話状態データ６
４は、各音声レベルデータ２５₁、２５₂、……２５_N
と共に話者判別部６３に入力される。話者判別部６３は
対話状態データ６４が非対話状態を示しているとき、そ
れぞれの音声レベルデータ２５₁、２５₂、……２５_N
を比較し、最大の音声レベルを表わしているテレビ会議
端末１１のものを「話者Ａパターン」として判別する。Dialogue state data 6 indicating the presence or absence of a dialogue state
4 is each audio level data 25 ₁ , 25 ₂ ,..., 25 _N
Is input to the speaker discriminating section 63 together with the information. When the speaker determination unit 63 dialog state data 64 indicates a non-interactive state, each of the sound level data _{_{25 1, 25 2, ...... 25}} N
Are compared, and the one of the video conference terminal 11 representing the maximum audio level is determined as the “speaker A pattern”.

【００３９】また、対話状態データ６４が対話状態を示
しているときには、同様に最大の音声レベルを表わして
いるテレビ会議端末１１のものを「話者Ａパターン」と
して判別すると共に、対話状態にあるもう１つの話者と
しての次に大きな音声レベルを表わしているテレビ会議
端末１１のものを「話者Ｂパターン」として判別する。
これらの判別結果は、画像処理部６６に送られる。When the conversation state data 64 indicates the conversation state, the one of the video conference terminal 11 similarly indicating the maximum voice level is determined as the "speaker A pattern", and is in the conversation state. The one of the video conference terminal 11 representing the next highest voice level as another speaker is determined as a “speaker B pattern”.
These determination results are sent to the image processing unit 66.

【００４０】画像処理部６６は、図示しないＣＰＵ（中
央処理装置）を備えており、同じく図示しないＲＯＭ
（リード・オンリ・メモリ）に格納された制御プログラ
ムによって対話状態の有無や話者の有無に応じた画面の
強調処理を行うようになっている。The image processing unit 66 includes a CPU (Central Processing Unit) not shown, and a ROM (not shown)
The control program stored in the (read only memory) performs an emphasis process on the screen in accordance with the presence or absence of a dialogue state and the presence or absence of a speaker.

【００４１】図３はこの画像処理部の制御の様子を表わ
したものである。画像処理部６６は話者判別データ６５
を受信し、話者が存在しているかどうかをチェックする
（ステップＳ１０１）。「話者Ａパターン」が存在する
ときには話者が存在する。話者が存在しないときには
（Ｎ）、テレビ会議の全参加者の映像をそれぞれ小画面
で区分けして表示するような映像処理を行い（ステップ
Ｓ１０２）、この映像信号６７を通信インタフェース１
３を介して各テレビ会議端末１１₁、１１₂、……１１
_Nに送出させる。FIG. 3 shows how the image processing unit is controlled. The image processing unit 66 includes speaker identification data 65
And checks whether a speaker is present (step S101). When the "speaker A pattern" exists, a speaker exists. If no speaker is present (N), video processing is performed to display the video of all participants in the video conference in small screens (step S102).
3, each of the video conference terminals 11 ₁ , 11 ₂ ,.
Send to _N.

【００４２】話者が存在してかつ「話者Ｂパターン」が
存在する場合には（ステップＳ１０３；Ｙ）、画面を２
分割して話者Ａと話者Ｂの双方を表示するような映像信
号６７を作成し（ステップＳ１０４）、これを通信イン
タフェース１３に送出する。また、話者Ｂが存在しない
場合には（ステップＳ１０３；Ｎ）、話者Ａのみを画面
全体に表示するような映像信号６７を作成し（ステップ
Ｓ１０５）、これを通信インタフェース１３に送出する
ことになる。If there is a speaker and the "speaker B pattern" exists (step S103; Y), the screen is set to 2
A video signal 67 for displaying both the speaker A and the speaker B is created by division (step S104), and this is transmitted to the communication interface 13. If the speaker B does not exist (step S103; N), a video signal 67 that displays only the speaker A on the entire screen is created (step S105) and transmitted to the communication interface 13. become.

【００４３】以上の画面表示の態様は、測定周期ごとに
チェックして切り替えが行われることになる。なお、強
調の態様はこれ以外にも各種存在し得る。例えば全参加
者の画面を小枠で囲んで表示し、対話を行っている者あ
るいは話者のみを他とは異なった色の枠で表示したり、
これらの者の枠を他の者の枠よりも相対的に大きく表示
するようにしてもよい。The above screen display mode is checked and switched for each measurement cycle. It should be noted that there may be various other emphasis modes. For example, the screen of all participants is displayed in a small frame, and only the person or speaker who is interacting is displayed in a frame of a different color from the others,
These persons 'frames may be displayed relatively larger than other persons' frames.

【００４４】なお、以上第１および第２の実施例を説明
したが、本発明はこれらに限られるものではない。例え
ば、第１の実施例の測定周期の変更のための音声レベル
測定周期制御部を第２の実施例に組み込んで、対話を前
提としたテレビ会議システム話者判別装置を実現するこ
とも可能である。Although the first and second embodiments have been described above, the present invention is not limited to these embodiments. For example, it is also possible to incorporate a voice level measurement cycle control unit for changing the measurement cycle of the first embodiment into the second embodiment to realize a video conference system speaker discriminating apparatus on the premise of dialogue. is there.

【００４５】[0045]

【発明の効果】以上説明したように請求項１記載の発明
では、話者が１人で長く発言しているときや、他の話者
が割り込む形等で発言を開始したとき等の話者の交代状
況を判別し、これに応じて話者の判別を行うための周期
を可変にすることで、不要な音声に対して話者の交代を
誤認することを防止する一方で、話者が割り込んだり複
数の発言者が存在する場合にそれぞれの話者を的確に判
別することができる。As described above, according to the first aspect of the present invention, when a speaker speaks for a long time alone, or when another speaker starts speaking in a manner of interrupting, etc. By changing the cycle for determining the speaker in response to the change of the speaker, it is possible to prevent the speaker from erroneously recognizing the change of the unnecessary voice, When there is an interrupt or a plurality of speakers exist, each speaker can be accurately determined.

【００４６】また、請求項２記載の発明では、発言中の
話者が２人いるかどうかを音声レベルが最大のものと次
のレベルのものを抽出してこれらのレベルの差を比較す
ることで判別することにしたので、話者が２人いるとき
には画面でこれらの者を強調表示することにして、対話
の臨場感を盛り上げることができる。According to the second aspect of the present invention, it is determined whether there are two speakers who are speaking by extracting the speech level of the largest speech level and the speech level of the next speech level and comparing the difference between these levels. Since the discrimination is made, when there are two speakers, by highlighting them on the screen, the sense of reality of the dialogue can be enhanced.

【００４７】更に請求項３記載の発明では、話者の発言
状態を検出して画面表示の切り替えの単位としての周期
を可変することにしたので、対話形式で画像の表示を行
う際にも話者の特定を適切に行うことができる。Further, according to the third aspect of the present invention, since the speech state of the speaker is detected and the cycle as a unit for switching the screen display is changed, the speech is displayed even when the image is displayed in an interactive manner. The person can be appropriately specified.

【００４８】[0048]

【００４９】[0049]

[Brief description of the drawings]

【図１】本発明の第１の実施例におけるテレビ会議シス
テム話者判別装置とこれに接続された各テレビ会議端末
を示したブロック図である。FIG. 1 is a block diagram showing a video conference system speaker identification device according to a first embodiment of the present invention and each video conference terminal connected thereto.

【図２】本発明の第２の実施例におけるテレビ会議シス
テム話者判別装置とこれに接続された各テレビ会議端末
を示したブロック図である。FIG. 2 is a block diagram showing a video conference system speaker identification device according to a second embodiment of the present invention and each video conference terminal connected thereto;

【図３】本発明の第２の実施例における画像処理部の制
御の様子を表わした流れ図である。FIG. 3 is a flowchart illustrating a state of control of an image processing unit according to a second embodiment of the present invention.

[Explanation of symbols]

１１テレビ会議端末１５、６１音声レベル測定部１６音声レベル測定周期制御部１９音声レベル検出部２１バックグラウンドレベル検出部２２発言状態検出部２５音声レベルデータ２６、６３話者判別部２８、６６画像処理部６２対話状態検出部 Reference Signs List 11 Video conference terminal 15, 61 Audio level measurement unit 16 Audio level measurement cycle control unit 19 Audio level detection unit 21 Background level detection unit 22 Speaking state detection unit 25 Audio level data 26, 63 Speaker discrimination unit 28, 66 Image processing Unit 62 Dialogue state detection unit

Claims

(57) [Claims]

1. An audio level detecting means for detecting an audio level of each video conference terminal participating in a video conference for each terminal, and detecting whether a speaker is speaking at each of the video conference terminals in a predetermined cycle unit. Utterance state detecting means, speaker deciding means for discriminating a speaker at each cycle based on the sound level detected by the sound level detecting means, detection results of the utterance state detecting means and speaker discrimination A speaker change situation determining means for determining a change situation of the speaker by using the determination result of the means; and when the speaker change situation determining means detects the change of the speaker, the period is set shorter than a current value. A video conference system speaker discriminating device comprising: a period variable means for setting the period to be longer than a current value when a section from setting to replacement is long.

2. An audio level detecting means for detecting an audio level of each video conference terminal participating in a video conference for each terminal; a maximum audio level for each terminal detected by the audio level detecting means; Voice level comparing means for comparing loud ones; dialog determining means for determining that the two are engaged in dialogue when the difference between the voice levels compared by the voice level comparing means is within a predetermined range; A video conferencing system comprising: display emphasis means for emphasizing, on a screen display, two speakers who have determined that they are interacting with each other compared with other video conference participants.

3. The video conference system speaker discriminating apparatus according to claim 2, wherein a speech state of the speaker is detected and a cycle as a unit for switching screen display is varied.