JPH0792988A

JPH0792988A - Speech detecting device and video switching device

Info

Publication number: JPH0792988A
Application number: JP5238579A
Authority: JP
Inventors: Takeshi Norimatsu; 武志則松; Yoshihisa Nakato; 良久中藤
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1993-09-27
Filing date: 1993-09-27
Publication date: 1995-04-07

Abstract

PURPOSE:To provide the speech detecting device which can decide a speaker's speech and accurately specify the microphone corresponding to the speaker and the video switching device which can automatically switch an image to the speaker according to the specification. CONSTITUTION:A speech decision part 3 extracts the feature quantity of a spectrum from a signal inputted to a microphone 1 and decides whether or not the signal is a speech according to whether or not there is similarity to the previously found feature quantity of the speech. A speaker detection part 2 estimates the position of the speaker by detecting the difference from the input signal to an adjacent microphone 1 and specifies the microphone 1 corresponding to the speaker. On the basis of the output results of the speech decision part 3 and speaker detection part 2, a total decision part 4 decides only speeches of speakers corresponding to respective microphones 1.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、テレビ会議システム等
における話者の位置を特定する音声検出装置とこの出力
により映像を切り替える映像切り替え装置に関するもの
である。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice detecting device for specifying a position of a speaker in a video conference system and a video switching device for switching a video by this output.

【０００２】[0002]

【従来の技術】近年、ＩＳＤＮ等ディジタル通信網の発
達により、企業の間では遠隔地間で積極的にテレビ会議
システムを利用し始めている。2. Description of the Related Art In recent years, with the development of digital communication networks such as ISDN, companies have begun to actively use a video conference system between remote places.

【０００３】現在のテレビ会議システムにおいて、限ら
れた大きさのモニター画面を用いてより自然な会議進行
を実現するためには、発言者が誰であるのかを知らせる
ためにリアルタイムにモニター画面を発言者に切り換え
る必要がある。現在の多くの会議システムでは、発言者
が切り替わる度に操作卓を使ってマニュアルで映像を切
り換えなければならず、自然な会議の進行の妨げになっ
ていた。そこで会議中の発言者の音声を自動的に検出し
発言者の映像に自動的に切り換えるための音声検出装置
の実現が望まれている。In the current video conference system, in order to realize a more natural conference using a limited-sized monitor screen, the monitor screen is spoken in real time in order to inform who the speaker is. Need to switch to the person. In many current conferencing systems, it is necessary to manually switch between images using the console each time the speaker switches, which hinders the natural progress of the conference. Therefore, it is desired to realize a voice detection device that automatically detects the voice of the speaker during the conference and automatically switches to the video of the speaker.

【０００４】実際に複数の参加者が存在するテレビ会議
の場面を想定すると、会議中には参加者の発言した音声
以外に様々な雑音が発生する。また全参加者の音声を収
音するために会議室には複数のマイクロホンが設置され
ることになるが、ある話者の音声は自分自信のマイクロ
ホンだけでなく隣接した位置にあるマイクロホンにも入
力される。さらに会議の相手方の音声が拡声され各マイ
クロホンに混入する。このような状況下で上記の音声検
出装置を実現するためには、入力信号から音声信号の部
分を正確に判別すると共に、どのマイクロホンに対応し
た位置にいる話者の発声した音声であるかを的確に判定
できなければならない。Assuming a video conference scene in which a plurality of participants actually exist, various noises occur during the conference in addition to the voices of the participants. In addition, multiple microphones will be installed in the conference room to pick up the voices of all participants, but the voice of a speaker will be input not only to the microphone of oneself but also to adjacent microphones. To be done. Furthermore, the voice of the other party of the conference is amplified and mixed into each microphone. In order to realize the above-described voice detection device in such a situation, it is possible to accurately determine the part of the voice signal from the input signal, and determine which microphone is the voice uttered by the speaker at the position corresponding to the microphone. It must be possible to make an accurate judgment.

【０００５】このような音声検出装置を実現するため
に、各マイクロホンに入力される信号のパワーを算出
し、パワーが検出されたときにそのマイクロホンに音声
が入力されていると判断することによって、予め記憶さ
れたそのマイクロホンに対応する話者の位置へ自動的に
カメラを向け映像を切り換える試みが行われている。こ
こでパワーが検出された区間が一定時間以下の場合は音
声と判定しないことで突発的な雑音による誤判定を防止
している。またある話者の音声が同時に隣接した複数の
マイクロホンに混入し、複数のマイクロホン入力が音声
であると判定される場合に対応するため、パワー強度の
大きい方を選択する方法もある。In order to realize such a voice detecting device, the power of the signal input to each microphone is calculated, and when the power is detected, it is determined that voice is input to the microphone. Attempts have been made to automatically point the camera to the location of the speaker corresponding to that microphone that was previously stored and switch the image. Here, if the section in which the power is detected is less than a certain time, it is not judged as voice to prevent erroneous judgment due to sudden noise. There is also a method of selecting one having a higher power intensity in order to deal with a case where a voice of a speaker is mixed into a plurality of adjacent microphones at the same time and a plurality of microphone inputs are determined to be voices.

【０００６】[0006]

【発明が解決しようとする課題】しかしながら上記の構
成では、突発的な雑音は取り除けるが、パワーの大きな
連続的な信号であれば音声あるいは雑音にかかわらず反
応してしまい、発言していない話者に誤って映像が切り
替わる場合が発生するという問題点がある。However, in the above-mentioned configuration, although a sudden noise can be removed, a continuous signal having a large power will react regardless of voice or noise, and a speaker not speaking. However, there is a problem that the image may be accidentally switched.

【０００７】また、発言者は必ずしもマイクロホンの正
面から発声するとは限らず、口元とマイクロホンとの位
置関係は変化するため、パワー強度の違いだけでは、ど
の話者の発声した音声であるかは正確には判定すること
ができないという問題点もある。Further, the speaker does not always speak from the front of the microphone, and the positional relationship between the mouth and the microphone changes. Therefore, it is possible to accurately determine which speaker is speaking only by the difference in power intensity. However, there is also a problem that it cannot be determined.

【０００８】本発明は、上記従来の課題を解決するもの
であり、入力された信号が突発的、連続的なものにかか
わらず正確に音声信号であるか否かが判別できる共に、
その音声信号がそれぞれのマイクロホンに対応した話者
から発声されたものであるかが正確に判定することがで
きる音声検出装置と、この音声検出装置の判定結果に基
づいて自動的に話者の映像を切り換えることができる映
像切り替え装置を提供することを目的とする。The present invention solves the above-mentioned conventional problems, and can accurately determine whether or not an input signal is a voice signal regardless of whether it is a sudden signal or a continuous signal.
A voice detection device capable of accurately determining whether or not the voice signal is uttered by a speaker corresponding to each microphone, and a video of the speaker automatically based on the determination result of the voice detection device. It is an object of the present invention to provide a video switching device capable of switching the video.

【０００９】[0009]

【課題を解決するための手段】請求項１に記載の音声検
出装置は、音響を検出する複数のマイクロホンと、これ
らのマイクロホンに入力された信号からスペクトルの特
徴量を抽出し、予め求めた音声の特徴量との類似性の有
無によりその信号が音声であるか否かを判定する音声判
定部と、任意のマイクロホンの入力信号とこのマイクロ
ホンに隣接した位置にあるマイクロホンの入力信号との
間の差異を検出することにより音響の発生源である話者
の位置を推定し、この話者に対応したマイクロホンを特
定する話者検出部と、前記音声判定部と話者検出部の出
力結果を用いて予め定めた判定条件をもとにそれぞれの
マイクロホンに対応した話者の音声のみを判定する総合
判定部とを備えたことを特徴とする。According to a first aspect of the present invention, there is provided a voice detecting device, wherein a plurality of microphones for detecting sound and a feature amount of a spectrum are extracted from signals input to these microphones, and a voice obtained in advance is extracted. Between the input signal of an arbitrary microphone and the input signal of the microphone adjacent to this microphone, and a voice determination unit that determines whether or not the signal is voice based on the similarity with the feature amount of The position of the speaker that is the source of the sound is estimated by detecting the difference, and the speaker detection unit that specifies the microphone corresponding to this speaker, and the output results of the voice determination unit and the speaker detection unit are used. And a comprehensive determination unit that determines only the voice of the speaker corresponding to each microphone based on predetermined determination conditions.

【００１０】請求項３に記載の音声検出装置は、話者方
向に向いた第１のマイクロホンと、話者と反対方向に向
いた第２のマイクロホンと、前記第１のマイクロホンと
第２のマイクロホンのそれぞれの入力信号の差異を検出
することにより第１のマイクロホンの前方より発せられ
た信号のみを検出する前方音検出部と、第１のマイクロ
ホンに入力された信号からスペクトルの特徴量を抽出
し、予め求めた音声の特徴量との類似性の有無によりそ
の信号が音声であるか否かを判定する音声判定部と、前
記前方音検出部と音声判定部の出力結果を用いてそれぞ
れの第１のマイクロホンに対応した話者の音声のみを判
定する総合判定部とを備えたことを特徴とする。According to another aspect of the voice detecting device of the present invention, a first microphone facing the speaker, a second microphone facing away from the speaker, the first microphone and the second microphone. And a front sound detector that detects only the signal emitted from the front of the first microphone by detecting the difference between the respective input signals of the first microphone, and the feature amount of the spectrum is extracted from the signal input to the first microphone. , A voice determination unit that determines whether or not the signal is a voice based on the presence or absence of similarity with a feature amount of a voice that is obtained in advance, and a first determination unit that uses output results of the front sound detection unit and the voice determination unit. It is characterized in that it is provided with a comprehensive judging section for judging only the voice of the speaker corresponding to one microphone.

【００１１】請求項４に記載の音声検出装置は、話者方
向に向いた第１のマイクロホンと話者と反対方向に向い
た第２のマイクロホンとを一組とする複数組のマイクロ
ホンと、それぞれの組の前記第１のマイクロホンと第２
のマイクロホンのそれぞれの入力信号の差異を検出する
ことにより第１のマイクロホンの前方より発せられた信
号のみを検出する前方音検出部と、それぞれの組の第１
のマイクロホンに入力された信号からスペクトルの特徴
量を抽出し、予め求めた音声の特徴量との類似性の有無
によりその信号が音声であるか否かを判定する音声判定
部と、任意の第１のマイクロホンの入力信号とこのマイ
クロホンに隣接した位置にある第１のマイクロホンの入
力信号との間の差異を検出することにより話者の位置を
推定し、この話者に対応したマイクロホンを特定する話
者検出部と、前記前方音検出部と音声判定部及び話者検
出部の出力結果を用いて予め定めた判定条件をもとにそ
れぞれの組の第１のマイクロホンに対応した話者の音声
のみを判定する総合判定部とを備えたことを特徴とす
る。According to another aspect of the voice detecting device of the present invention, a plurality of sets of microphones, each set including a first microphone facing the speaker and a second microphone facing the opposite direction of the speaker, are provided. A set of said first microphone and second
Front sound detector that detects only the signal emitted from the front of the first microphone by detecting the difference between the input signals of the respective microphones, and the first sound of each pair.
A voice determination unit that extracts a spectrum feature amount from a signal input to the microphone and determines whether the signal is voice based on the similarity to the voice feature amount obtained in advance, and an arbitrary first The position of the speaker is estimated by detecting the difference between the input signal of the first microphone and the input signal of the first microphone located adjacent to this microphone, and the microphone corresponding to this speaker is specified. The speaker's voice corresponding to the first microphone of each set based on the determination conditions predetermined by using the speaker detection unit, the forward sound detection unit, the voice determination unit, and the output results of the speaker detection unit. It is characterized in that it is provided with a comprehensive judgment unit for judging only the above.

【００１２】請求項２５に記載の映像切り替え装置は、
請求項１に記載の音声検出装置と、各話者の映像を出力
するために、それぞれの話者の位置を予め記憶し出力映
像を制御するカメラ制御部と、前記音声検出部の出力に
基づいて音声が入力されているマイクロホンを特定し、
対応する話者の映像に切り換えるための制御信号を前記
カメラ制御部に出力する映像切り替え制御部とを備えた
ことを特徴とする。A video switching device according to a twenty-fifth aspect is
The audio detection device according to claim 1, a camera control unit that stores the position of each speaker in advance and controls the output video in order to output the video of each speaker, and based on the output of the audio detection unit. The microphone to which the voice is being input,
An image switching control unit that outputs a control signal for switching to a corresponding speaker image to the camera control unit is provided.

【００１３】[0013]

【作用】請求項１の構成によると、音声判定部が、マイ
クロホンに入力された信号からスペクトルの特徴量を抽
出し、予め求めた音声の特徴量との類似性の有無により
その信号が音声であるか否かを判定する。話者検出部
が、隣接したマイクロホンの入力信号の間の差異を検出
することにより話者の位置を推定し、この話者に対応し
たマイクロホンを特定する。以上の音声判定部と話者検
出部の出力結果に基づいて、総合判定部がそれぞれのマ
イクロホンに対応した話者の音声のみを判定する。According to the structure of claim 1, the voice determination unit extracts the spectrum feature amount from the signal input to the microphone, and the signal is voiced depending on the similarity to the voice feature amount obtained in advance. Determine if there is. The speaker detection unit estimates the position of the speaker by detecting the difference between the input signals of the adjacent microphones, and specifies the microphone corresponding to this speaker. Based on the output results of the voice determination unit and the speaker detection unit described above, the comprehensive determination unit determines only the voice of the speaker corresponding to each microphone.

【００１４】請求項３の構成によると、前方音検出部
が、話者方向に向いた第１のマイクロホンと話者と反対
方向に向いた第２のマイクロホンに入力された信号の差
異を検出して、第１のマイクロホンの前方より発せられ
た信号のみを検出する。音声判定部が、第１のマイクロ
ホンに入力された信号からスペクトルの特徴量を抽出
し、予め求めた音声の特徴量との類似性の有無によりそ
の信号が音声であるか否かを判定する。以上の前方音検
出部と音声判定部の出力結果に基づいて、総合判定部が
それぞれの第１のマイクロホンに対応した話者の音声の
みを判定する。According to the third aspect of the invention, the front sound detecting section detects the difference between the signals input to the first microphone facing the speaker and the second microphone facing the opposite direction of the speaker. Then, only the signal emitted from the front of the first microphone is detected. The voice determination unit extracts the spectrum feature amount from the signal input to the first microphone, and determines whether or not the signal is voice based on the similarity with the voice feature amount obtained in advance. Based on the output results of the front sound detection unit and the voice determination unit described above, the comprehensive determination unit determines only the voice of the speaker corresponding to each first microphone.

【００１５】請求項４の構成によると、前方音検出部
が、一組にされた話者方向に向いた第１のマイクロホン
と話者と反対方向に向いた第２のマイクロホンに入力さ
れた信号の差異を検出して、第１のマイクロホンの前方
より発せられた信号のみを検出する。音声判定部が、各
組の第１のマイクロホンに入力された信号からスペクト
ルの特徴量を抽出し、予め求めた音声の特徴量との類似
性の有無によりその信号が音声であるか否かを判定す
る。話者検出部が、隣接した第１のマイクロホンの入力
信号の間の差異を検出することにより話者の位置を推定
し、この話者に対応したマイクロホンを特定する。以上
の前方音検出部と音声判定部と話者検出部の出力結果に
基づいて、総合判定部が各組の第１のマイクロホンに対
応した話者の音声のみを判定する。According to the fourth aspect of the invention, the front sound detecting unit inputs the signals input to the first microphone facing the set of speakers and the second microphone facing the opposite direction of the speaker. Is detected, and only the signal emitted from the front of the first microphone is detected. The voice determination unit extracts the spectrum feature amount from the signals input to the first microphones of each set, and determines whether or not the signal is voice based on the similarity with the voice feature amount obtained in advance. judge. The speaker detection unit estimates the position of the speaker by detecting the difference between the input signals of the adjacent first microphones, and specifies the microphone corresponding to this speaker. Based on the output results of the front sound detection unit, the voice determination unit, and the speaker detection unit described above, the comprehensive determination unit determines only the voice of the speaker corresponding to the first microphone of each set.

【００１６】請求項２５の構成によると、請求項１に記
載の音声検出装置の出力に基づいて、映像切り替え制御
部が、特定したマイクロホンに対応した話者に映像を切
り換える制御信号をカメラ制御部に出力する。この制御
信号により、カメラ制御部は予め記憶した話者の位置情
報に基づいて出力映像の切り替えを制御する。According to the twenty-fifth aspect of the invention, based on the output of the voice detecting apparatus according to the first aspect, the video switching control section sends the control signal for switching the video to the speaker corresponding to the specified microphone. Output to. With this control signal, the camera control unit controls the switching of the output video based on the speaker position information stored in advance.

【００１７】[0017]

【実施例】以下、本発明の音声検出装置の第１の実施例
について図面を参照しながら説明する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS A first embodiment of the voice detecting apparatus of the present invention will be described below with reference to the drawings.

【００１８】図１は本実施例の構成を示すブロック図で
ある。図１において、Ｗは音声を発する話者、１はマイ
クロホン、２は隣接したマイクロホンの入力信号間の波
形上の類似性を調べることにより話者の位置を推定する
話者検出部、３は各マイクロホンの入力信号から音韻の
特徴を抽出し、音声信号であるか否かを判定する音声判
定部、４は音声判定部および話者検出部の結果をもと
に、それぞれのマイクロホンに対してそれぞれの前方に
位置する話者の音声信号が入力されているかを否かを判
定し、この判定結果を出力する総合判定部である。FIG. 1 is a block diagram showing the configuration of this embodiment. In FIG. 1, W is a speaker who emits a voice, 1 is a microphone, 2 is a speaker detecting unit that estimates the position of the speaker by examining the similarity between the input signals of adjacent microphones, and 3 is each Based on the results of the voice determination unit and the speaker detection unit, the voice determination unit 4 that determines the phonological features from the input signal of the microphone and determines whether or not it is a voice signal is used for each microphone. Is a comprehensive determination unit that determines whether or not a voice signal of a speaker located in front of is input and outputs the determination result.

【００１９】以下、上記音声検出装置の動作を説明す
る。ここでは一般的なテレビ会議の場面を想定し、話者
が横一線に並んでいるとし、また各話者にそれぞれマイ
クロホンが設置されているものとする。The operation of the voice detecting device will be described below. Here, assuming a general video conference scene, it is assumed that the speakers are lined up in a horizontal line and that each speaker is provided with a microphone.

【００２０】まず、マイクロホン１に入力された音響信
号はアナログ／ディジタル変換され、話者検出部２、音
声判定部３にそれぞれ入力される。話者検出部２では隣
合うマイクロホン同志での入力信号間の相関関係を調べ
ることにより話者の位置を推定する。ここで例えば話者
Ｗ２が発言している場合を考える。話者Ｗ２の発声した
音声はマイクロホンＭ２はもちろんその隣のマイクロホ
ンＭ１、Ｍ３にも入力される（その他のマイクロホンに
も入力されるがそのパワーは小さくなる）。また話者Ｗ
２は常にマイクロホンＭ２の正面方向にいるわけではな
く、話者Ｗ１、あるいは話者Ｗ３の方向に寄って発声し
ているかもしれない。これらの位置関係を示したのが図
２である。もし話者がマイクロホンＭ２、Ｍ３から等距
離の地点ｘにいるときは、音声信号の各マイクロホンへ
の到達時間は等しいが、話者が左右にずれることによっ
て到達時間に差が生じる。そこでこの到達時間の差を検
出することにより、話者のおおよその位置を推定するこ
とが可能となる。First, the acoustic signal input to the microphone 1 is subjected to analog / digital conversion and input to the speaker detecting section 2 and the voice determining section 3, respectively. The speaker detector 2 estimates the position of the speaker by examining the correlation between the input signals of adjacent microphones. Consider, for example, the case where the speaker W2 is speaking. The voice uttered by the speaker W2 is input not only to the microphone M2 but also to the adjacent microphones M1 and M3 (they are also input to other microphones but their power is low). Again speaker W
2 may not always be in front of the microphone M2, but may be speaking toward the speaker W1 or the speaker W3. FIG. 2 shows the positional relationship between them. If the speaker is at a point x equidistant from the microphones M2 and M3, the arrival time of the voice signal to each microphone is the same, but the arrival times differ due to the left and right shifts of the speaker. Therefore, it is possible to estimate the approximate position of the speaker by detecting the difference in the arrival times.

【００２１】図３は話者検出部２の動作を示す要部フロ
ーチャートである。以下図３のフローチャートに沿って
説明する。図３のステップ３１で、まず隣合う２つのマ
イクロホンそれぞれの組について入力信号の相互相関係
数を一定時間間隔毎（以下フレームと呼ぶ）に式１によ
り算出する。FIG. 3 is a main part flowchart showing the operation of the speaker detector 2. Description will be given below with reference to the flowchart of FIG. In step 31 of FIG. 3, first, the cross-correlation coefficient of the input signal for each pair of two adjacent microphones is calculated at regular time intervals (hereinafter referred to as a frame) by the equation 1.

【００２２】[0022]

【数１】 [Equation 1]

【００２３】ここでｂ_t 、Ｃ_t は任意の時刻ｔにおける
サンプル値、ｎは１フレームのサンプル数、ｍは話者の
左右のずれを検出するために予め設定された値であり、
分析条件、マイクロホンと話者の位置関係により多少変
わってくる。次にステップ３２で、各マイクロホンの組
毎に得られたそれぞれの−ｍ次からｍ次までの相互相関
係数のうち最大値を与える相関係数の値及びその次数を
記憶する。ステップ３３では、各マイクロホンの組毎の
相互相関係数の最大値の中から最大値を与えるマイクロ
ホンの組を選択する。次にステップ３４で、選択された
マイクロホンの組の最大相関値を与える次数から話者の
左右へのずれ幅を推定し、話者が対応するマイクロホン
の正面方向に存在するか否かを判定する。例えば図２に
おいて話者Ｗ２の位置から発声された音声信号のマイク
ロホンＭ２、マイクロホンＭ３への到達時間の差Ｔは音
の速度をｃ、話者Ｗ２からマイクロホンＭ２までの距離
ｌ、マイクロホンＭ３までの距離ｋとして式２で表され
る。Here, b _t and C _t are sample values at an arbitrary time t, n is the number of samples in one frame, and m is a value preset for detecting left / right deviation of the speaker,
It depends on the analysis conditions and the positional relationship between the microphone and the speaker. Next, at step 32, the value of the correlation coefficient that gives the maximum value and the order thereof are stored among the cross-correlation coefficients from the -mth order to the mth order obtained for each microphone set. In step 33, the set of microphones that gives the maximum value is selected from the maximum values of the cross-correlation coefficient for each set of microphones. Next, in step 34, the width of deviation of the speaker to the left and right is estimated from the order that gives the maximum correlation value of the selected microphone set, and it is determined whether or not the speaker exists in the front direction of the corresponding microphone. . For example, in FIG. 2, the difference T in the arrival time of the voice signal uttered from the position of the speaker W2 to the microphone M2 and the microphone M3 is the speed of sound c, the distance l from the speaker W2 to the microphone M2, and the distance to the microphone M3. It is expressed by the equation 2 as the distance k.

【００２４】[0024]

【数２】 [Equation 2]

【００２５】ここで最大相関値を与える次数がｍ1 であ
ったとすると、ＴはＴ_S ×ｍ1 （秒）に相当し、話者Ｗ
２は地点ｘからほぼこの時間に相当する距離分だけ左に
いることがわかる。Ｔ_S はサンプリング周期である。そ
こで予めマイクロホン正面方向の話者の音声を捉えるべ
き範囲を設定しておき、検出の結果その範囲内であれば
話者が存在すると判定する。またマイクロホンＭ２及び
Ｍ３からほぼ等距離の地点ｘを含む線上の近傍に音源が
存在する場合は、特に入力されているマイクロホンは特
定しないようにする。If the order giving the maximum correlation value is m1, then T corresponds to T _S × m1 (seconds), and the speaker W
It can be seen that 2 is on the left from the point x by a distance corresponding to about this time. T _S is the sampling period. Therefore, a range in which the voice of the speaker in the front direction of the microphone is to be captured is set in advance, and if it is within the range as a result of detection, it is determined that the speaker exists. If a sound source exists near a line including a point x that is substantially equidistant from the microphones M2 and M3, the microphone that is input is not specified.

【００２６】最後にステップ３５で、判定結果として、
話者が発声していると特定されたマイクロホンについて
はオン信号を、特定されなかったマイクロホンについて
はオフの信号を送出する。ここで誤判定、及び短い発
言、突発的な雑音による判定結果の短時間での切り替わ
りを防止するため、同一の判定結果が一定フレーム続い
た場合に判定結果をオンにし、またマイクロホンの特定
が一つもできない状態が一定フレーム以上続いたときに
オフにするよう制御する。以上が話者検出部２の動作説
明である。Finally, in step 35, as a determination result,
An ON signal is transmitted for a microphone that is identified as being spoken by the speaker, and an OFF signal is transmitted for a microphone that is not identified. Here, in order to prevent erroneous judgment, short speech, and switching of the judgment result due to sudden noise in a short time, the judgment result is turned on when the same judgment result continues for a certain number of frames, and the microphone is not identified. It is controlled so that it is turned off when a state where it cannot be continued continues for a certain number of frames. The above is the description of the operation of the speaker detection unit 2.

【００２７】次に音声判定部３の動作について説明す
る。図４は音声判定部３に関するブロック構成図であ
る。図４において４１は音声検出のための複数の特徴量
を抽出する特徴抽出部で、１フレーム毎の特徴量を算出
する。これらの特徴量は音声を検出するために用いられ
るものであり、音声に特有の性質を有している。本実施
例では１次以上のケプストラム係数を用いる。他の特徴
量としてたとえば線形予測分析の際に得られる自己相関
係数や線形予測係数、ＰＡＲＣＯＲ係数、メルケプスト
ラム係数等を用いても差し支えない。あるいは他の音声
分析、たとえばＦＦＴ分析により得られるスペクトル情
報を用いても、音声の特徴を捉えていることでは同じで
あるので使用可能である。また、入力信号をアナログフ
ィルタあるいはディジタルフィルタにより周波数軸上で
数個の帯域に分割し、各帯域のエネルギーを算出してそ
れをひとつの特徴量として扱うこともできる。また各帯
域毎に求めた零交差回数を特徴量として使用すること
や、各帯域毎にＦＦＴ分析して得られるメルケプストラ
ム係数をひとつの特徴量として扱う、また各帯域毎にＬ
ＰＣ分析により得られるスペクトルをひとつの特徴量と
して扱うことも可能である。Next, the operation of the voice determination section 3 will be described. FIG. 4 is a block diagram of the voice determination unit 3. In FIG. 4, reference numeral 41 denotes a feature extraction unit that extracts a plurality of feature amounts for voice detection, and calculates the feature amount for each frame. These feature amounts are used to detect voice and have a characteristic peculiar to voice. In this embodiment, a cepstrum coefficient of first order or higher is used. As the other feature amount, for example, an autocorrelation coefficient, a linear prediction coefficient, a PARCOR coefficient, a mel cepstrum coefficient, etc. obtained in the linear prediction analysis may be used. Alternatively, it is possible to use other speech analysis, for example, spectral information obtained by FFT analysis, because it is the same in that the feature of the speech is captured. It is also possible to divide the input signal into several bands on the frequency axis by an analog filter or a digital filter, calculate the energy of each band, and treat it as one feature amount. In addition, the number of zero crossings obtained for each band is used as a feature amount, the mel-cepstral coefficient obtained by FFT analysis for each band is treated as one feature amount, and L is used for each band.
It is also possible to handle the spectrum obtained by PC analysis as one feature amount.

【００２８】次に、４２は予め信頼性の高い多数の学習
用音声データについて特徴抽出部４１で抽出した特徴量
を用いて、音声の周波数的なの標準パターンを作成する
周波数パターン作成部である。標準パターンとしては、
予め多数の音声データからスペクトルに関する特徴量を
抽出しておき、各音韻毎にその特徴量を用いて標準パタ
ーンを作成する。本実施例では標準パターンとしては、
特徴量の分布を多次元正規分布としたときの平均、共分
散を用い、これを音韻毎に作成しておく。また他の分布
として、たとえばガンマ分布やポアソン分布等を用いて
も差し支えない。さらにこの標準パターンとしては、学
習用音声データを音韻毎に分類した後各音韻毎に作成し
た最適な標準パターンを用いたり、学習用音声データを
ベクトル量子化によりクラスタリングすることにより得
られたコードを用いても、より精度の高い判定が可能と
なる。Next, a reference numeral 42 is a frequency pattern creating section for creating a standard pattern for the frequency of the voice by using the feature amount extracted in advance by the feature extracting section 41 for a large number of highly reliable learning voice data. As a standard pattern,
A feature amount related to the spectrum is extracted from a large number of voice data in advance, and a standard pattern is created using the feature amount for each phoneme. In this embodiment, the standard pattern is
The average and covariance when the feature distribution is a multidimensional normal distribution are used and created for each phoneme. As another distribution, for example, a gamma distribution or a Poisson distribution may be used. Further, as this standard pattern, an optimum standard pattern created for each phoneme after classifying the training voice data is used, or a code obtained by clustering the training voice data by vector quantization is used. Even if it is used, the determination can be performed with higher accuracy.

【００２９】４３は特徴抽出部４１から出力される入力
信号のフレーム毎のケプストラム係数について周波数パ
ターン作成部４２にて作成した音韻毎の特徴量分布との
距離すなわち尤度を計算し、ある閾値と比較することで
音声であるかそれ以外かを判定する尤度判定部である。A reference numeral 43 calculates a distance, or likelihood, between the cepstral coefficient for each frame of the input signal output from the feature extraction unit 41 and the feature amount distribution for each phoneme created by the frequency pattern creation unit 42, and calculates a certain threshold value. It is a likelihood determination unit that determines by comparison whether it is voice or not.

【００３０】４４は予め信頼性の高い多数の学習用音声
データから作成した音声の時間的な特徴を表現する時間
パターンを作成する時間パターン作成部である。本実施
例においては、多数の学習用音声データから作成した、
音韻毎の継続時間に関する最大値、最小値を用いる。ま
た、他の例として、継続時間分布たとえば正規分布やガ
ンマ分布、ポアソン分布等を用いても差し支えない。Reference numeral 44 is a time pattern creating section for creating a time pattern which expresses the temporal characteristics of the voice created in advance from a large number of highly reliable voice data for learning. In the present embodiment, created from a large number of learning voice data,
The maximum and minimum values for the duration of each phoneme are used. Further, as another example, a continuous time distribution such as a normal distribution, a gamma distribution, or a Poisson distribution may be used.

【００３１】４５は、尤度判定部４３にて入力信号のう
ち音声と判定された部分について、時間パターン作成部
４４にて作成した時間パターンとを比較することで、入
力信号が音声であったかそれ以外であったかを判定する
最終判定部である。本実施例では、入力信号から各音韻
がどの程度継続しているかを示す継続時間を求め、予め
多数の音声から求めておいた音声の継続時間の最大値お
よび最小値を用いて、最大値より小さくしかも前記最小
値より大きいときのみ音声が検出されたとする。ここ
で、音声の継続時間の最大値および最小値にかえて、継
続時間が統計的な分布特性を持つと仮定し、入力信号か
ら得られた音声の継続時間をもとに確率を求め、その確
率がある閾値より大きければ音声であると断定すること
も可能である。また、時間パターンとして多数の音声デ
ータから標準的な音声のスペクトル系列を標準パターン
として登録しておき、入力信号とこの標準パターンとの
非線形伸縮（ＤＰマッチング）により、入力信号のどの
部分に各標準パターンが存在するかを検出（スポッティ
ング）することで、音声であるかそれ以外かを判定する
ことが可能である。また、時間パターンとして多数の音
声スペクトル系列から隠れマルコフモデル（ＨＭＭ）を
予め標準パターンとして作成しておき、入力信号とこの
ＨＭＭモデルとの確率計算により、入力信号のどの部分
に各標準パターンが存在するかを検出（スポッティン
グ）し、音声であるかどうかを判定することも可能であ
る。また、時間パターンを用いて音声を検出するのでは
なく、入力信号を音声分析して得られた特徴量の変化量
を時々刻々求め、その変化量を閾値判定することで音声
中の音韻を検出し、音声と雑音を判別することも可能で
ある。さらに話者の発声した音声中の音韻性を特徴付け
る特徴量や、フィルタリング処理により各帯域毎に音声
分析して得られた特徴量をベクトル量子化して求めたコ
ードブックを用いて、入力信号をベクトル量子化した際
の量子化歪みを閾値判定することで音声であるか雑音で
あるかを判定したり、さらに入力信号をベクトル量子化
した際のコード列の変化のパターンに変換し、その各コ
ードの出現頻度や、各コードの継続時間により、音声で
あるかどうかを判定することも可能である。45 compares the portion of the input signal determined by the likelihood determination section 43 as a voice with the time pattern created by the time pattern creation section 44 to determine whether the input signal is a voice. It is a final determination unit that determines whether or not it is other than. In the present embodiment, the duration indicating how long each phoneme is continuing is calculated from the input signal, and the maximum and minimum values of the duration of the voice obtained in advance from a large number of voices are used. It is assumed that the voice is detected only when it is small and larger than the minimum value. Here, instead of the maximum and minimum values of the voice duration, it is assumed that the duration has a statistical distribution characteristic, and the probability is calculated based on the voice duration obtained from the input signal. It is also possible to determine that the probability is voice if the probability is larger than a certain threshold. Further, as a time pattern, a standard voice spectrum sequence from a large number of voice data is registered as a standard pattern, and non-linear expansion / contraction (DP matching) between the input signal and this standard pattern allows each standard to be included in any part of the input signal. By detecting (spotting) whether or not a pattern exists, it is possible to determine whether the pattern is voice or not. Further, as a time pattern, a hidden Markov model (HMM) is created in advance as a standard pattern from a large number of speech spectrum sequences, and each standard pattern exists in which part of the input signal by probability calculation of the input signal and this HMM model. It is also possible to detect (spotting) whether or not to do so and determine whether or not it is a voice. In addition, instead of detecting the voice using the time pattern, the amount of change in the feature amount obtained by analyzing the input signal by voice is obtained moment by moment, and the amount of change is thresholded to detect the phoneme in the voice. However, it is also possible to distinguish between voice and noise. Further, the input signal is vectorized using a codebook obtained by vector-quantizing the feature quantities that characterize the phonological characteristics in the voice uttered by the speaker and the feature quantities obtained by voice analysis for each band by filtering. Quantization distortion at the time of quantization is used to determine whether it is voice or noise by determining the threshold value, and the input signal is converted into a pattern of changes in the code string when vector quantization is performed. It is also possible to determine whether or not it is a voice, based on the appearance frequency of and the duration of each code.

【００３２】以下、音声判定部３の動作について図４の
ブロック構成図を参照しながら詳細に説明する。音響信
号がマイクロホンを通して入力されると、特徴抽出部４
１でまず複数の特徴量が抽出される。本実施例ではケプ
ストラム係数を用いて判定する。一定時間毎にＫ次の自
己相関係数Ａ_i(k)が算出され、さらにＡ_i(k)は０次の自
己相関係数Ａ_i(0)で正規化される。ここで一定の時間間
隔は、例えばサンプリング周波数を１０ＫＨｚとして、
２００点（２０ｍｓ）とし、この時間単位をフレームと
呼ぶ。フレームｉでのＬ次のケプストラム係数Ｃ_i(l)を
線形予測分析により求める。ここでは、これらの特徴量
が互いに独立であるとして、一括して１つのベクトル
（ｍ次元）ｘとして扱うことにする。The operation of the voice determination unit 3 will be described in detail below with reference to the block diagram of FIG. When the acoustic signal is input through the microphone, the feature extraction unit 4
At 1, first, a plurality of feature quantities are extracted. In this embodiment, the determination is made using the cepstrum coefficient. The Kth-order autocorrelation coefficient A _i (k) is calculated at regular time intervals, and the A _i (k) is further normalized by the 0th-order autocorrelation coefficient A _i (0). Here, the fixed time interval is, for example, a sampling frequency of 10 KHz,
There are 200 points (20 ms), and this time unit is called a frame. The L-th order cepstrum coefficient C _i (l) in the frame i is obtained by linear prediction analysis. Here, since these feature quantities are independent of each other, they are collectively treated as one vector (m-dimensional) x.

【００３３】周波数パターン作成部４２では、予め多数
の学習用音声データを用いて、各音韻毎に特徴抽出部４
１で得られる特徴量を抽出し、各音韻毎の周波数パター
ンを作成する。音韻としては母音や無声摩擦音、鼻音、
有声破裂音、破擦音、流音、半母音等が考えられる。こ
こでは次の方法により音韻毎の平均値μ_kcと共分散行列
Σ_kcを周波数パターンとして使用する。ただし、ｋは音
韻番号、ｃは特徴量分布作成部にて得られた値であるこ
とを示し、μ_kcはｍ次元のベクトル、Σ_kcはｍ×ｍ次元
のマトリックスである。学習用音韻データとしては、例
えばある標準話者の音韻ｋの部分を学習用データから切
り出して用いればよい。また、複数の話者の音声データ
を用いることで、話者の発声の変動に強い標準モデルを
作成することができる。The frequency pattern creating section 42 uses a large number of learning voice data in advance, and the feature extracting section 4 for each phoneme.
The feature amount obtained in 1 is extracted to create a frequency pattern for each phoneme. Vowels, unvoiced fricatives, nasal sounds,
Voiced plosives, affricates, stream sounds, and half vowels are possible. Here, the average value μ _kc and covariance matrix Σ _kc for each phoneme are used as the frequency pattern by the following method. Here, k is a phoneme number, c is a value obtained by the feature amount distribution creating unit, μ _kc is an m-dimensional vector, and Σ _kc is an m × m-dimensional matrix. As the learning phoneme data, for example, a part of the phoneme k of a certain standard speaker may be cut out from the learning data and used. Further, by using the voice data of a plurality of speakers, it is possible to create a standard model that is resistant to changes in the utterances of the speakers.

【００３４】尤度判定部４３は、特徴抽出部４１から出
力されるフレーム毎の入力信号のいくつかの特徴量につ
いて、周波数パターン作成部４２にて作成した各音韻毎
の標準パターンと対数尤度を計算する部分である。ここ
で対数尤度とは、各特徴量の分布を多次元正規分布と仮
定した場合の統計的距離尺度であり、ある音韻の標準パ
ターンｋに対するｉフレーム目の入力ベクトルｘ_i の特
徴量尤度Ｌ_ikは、式３により計算される。The likelihood determination unit 43, for some feature quantities of the input signal for each frame output from the feature extraction unit 41, the standard pattern for each phoneme created by the frequency pattern creation unit 42 and the logarithmic likelihood. Is the part to calculate. Here, the log-likelihood is a statistical distance measure when the distribution of each feature quantity is assumed to be a multidimensional normal distribution, and the feature likelihood of the input vector x _i of the i-th frame with respect to a standard pattern k of a certain phoneme. L _ik is calculated by Equation 3.

【００３５】[0035]

【数３】 [Equation 3]

【００３６】ただし、ｘ_i はｍ次元のベクトル（ｍ次元
の特徴量）であり、ｔは転値、−１は逆行列を示す。そ
して式４により、各音韻毎の対数尤度と予め決めておい
た各音韻毎との閾値とを比較することで音韻の検出を行
う。Here, x _i is an m-dimensional vector (m-dimensional feature amount), t is an inversion value, and −1 is an inverse matrix. Then, according to Expression 4, the phoneme is detected by comparing the logarithmic likelihood for each phoneme with a predetermined threshold value for each phoneme.

【００３７】[0037]

【数４】 [Equation 4]

【００３８】ただし、Ｌ_kTH は各音韻ｋに関する判定閾
値（対数尤度の閾値）である。時間パターン作成部４４
では、予め多数の学習用音声データを用いて、各音韻毎
の継続時間の最大値Ｄmax 、最小値Ｄmin を求め、最終
判定部４５において、最終的な音声かそれ以外の雑音で
あるかの判定を行う。まず尤度判定部４３にて検出され
た音韻の情報を最終判定部４５に送り、各音韻が何フレ
ーム継続したかすなわち各音韻毎の継続時間Ｄk を求め
る。そして、この継続時間Ｄk と時間パターン作成部４
３にて求めておいた各音韻毎の継続時間の最大値より大
きくかつ最小値より小さいとき音韻が検出されたと判定
し、最終的に入力信号が音声であるかそれ以外であるか
を判定する。However, L _kTH is a judgment threshold value (threshold value of log likelihood) for each phoneme k. Time pattern creation unit 44
Then, the maximum value Dmax and the minimum value Dmin of the duration of each phoneme are obtained in advance using a large number of learning voice data, and the final determination section 45 determines whether the final voice or other noise. I do. First, the information on the phonemes detected by the likelihood judging section 43 is sent to the final judging section 45, and how many frames each phoneme has continued, that is, the duration Dk for each phoneme is obtained. Then, the duration Dk and the time pattern creation unit 4
It is determined that a phoneme is detected when it is larger than the maximum value and smaller than the minimum value of the duration of each phoneme obtained in step 3, and it is finally determined whether the input signal is voice or other. .

【００３９】さらに、このような音韻がある区間内でど
のくらいの頻度で出現するかを、ファジィ推論により判
定することもできる。たとえば予め多数の音声データか
ら各音韻毎の出現数に関するメンバシップ関数を決定し
ておき、実際に入力信号の各音韻毎の出現数を上記音韻
判定部４３にて求め、メンバシップ関数から算出される
ファジィ出力を最終的に判定することで音声が検出され
たのか雑音が検出されたのかを決定することができる。
以上が音声判定部３の動作説明である。Further, it is possible to determine how often such a phoneme appears in a certain section by fuzzy inference. For example, a membership function relating to the number of appearances of each phoneme is determined in advance from a large number of speech data, and the number of appearances of each input phoneme of the input signal is actually obtained by the phoneme determination unit 43 and calculated from the membership function. It is possible to determine whether the voice is detected or the noise is detected by finally determining the fuzzy output.
The above is the description of the operation of the voice determination unit 3.

【００４０】最後に総合判定部４では、話者検出部２に
おいて対応する話者が発言しているとして特定されたマ
イクロホンの入力について、音声判定部３で音声信号が
入力されていると判定されている場合に、そのマイクロ
ホンはオンであるという信号を外部に送出する。Finally, in the comprehensive judging section 4, it is judged that the voice judging section 3 inputs the voice signal for the input of the microphone specified in the speaker detecting section 2 as the corresponding speaker is speaking. The microphone is on, it sends a signal that the microphone is on to the outside.

【００４１】以上のように本実施例によれば隣接マイク
ロホン間の相関関係から話者方向から信号が入力されて
いるマイクロホンを特定し、また音韻性を用いて入力信
号が音声か否かを正確に判別することにより、突発雑
音、連続的な雑音が入力されたときに誤って音声と誤判
定するのを防ぐことができ、また音声信号が隣接するマ
イクロホンへ入力された場合でも話者に対応するマイク
ロホンを特定することができ、さらに周囲騒音等による
誤反応をも防止することができる。As described above, according to the present embodiment, the microphone to which the signal is input from the speaker direction is specified from the correlation between the adjacent microphones, and whether or not the input signal is voice is accurately determined by using the phonological property. It is possible to prevent erroneous determination as voice when sudden noise or continuous noise is input, and it is possible to handle a speaker even when a voice signal is input to an adjacent microphone. It is possible to specify the microphone to be operated and further prevent an erroneous reaction due to ambient noise or the like.

【００４２】次に本発明の音声検出装置の第２の実施例
について図面を参照しながら説明する。図５は第２の実
施例の音声検出装置の構成を示すブロック図である。図
５において、Ｗは音声を発する話者（例えば、話者Ｗ
１，Ｗ２などで構成されている）、５１は話者方向に向
いた第１のマイクロホン（例えば、マイクロホンＭ１
１，Ｍ２１などで構成されている）、５２は話者と反対
方向の向いた第２のマイクロホン（例えば、マイクロホ
ンＭ１２，Ｍ２２などで構成されている）、５３はマイ
クロホン５１とマイクロホン５２の入力信号から話者方
向からの信号のみを検出する前方音検出部、５４は第１
のマイクロホンの入力信号からスペクトルの特徴量を検
出し、音声であるか否かを判定する音声判定部、５５は
上記結果から話者方向からの音声信号のみを判定し、こ
の判定結果を出力する最終判定部である。Next, a second embodiment of the voice detecting device of the present invention will be described with reference to the drawings. FIG. 5 is a block diagram showing the configuration of the voice detecting device according to the second embodiment. In FIG. 5, W is a speaker who produces a voice (for example, speaker W
1, W2, etc.), 51 is a first microphone (for example, microphone M1) facing the speaker.
1, M21, etc.), 52 is a second microphone (for example, microphones M12, M22, etc.) facing in the direction opposite to the speaker, and 53 is an input signal of the microphones 51 and 52. A front sound detector that detects only the signal from the speaker direction from the
The voice determination unit 55 which detects the feature amount of the spectrum from the input signal of the microphone and determines whether or not it is voice, 55 determines only the voice signal from the speaker direction from the above result, and outputs this determination result. This is the final judgment unit.

【００４３】以下、上記音声検出装置の動作を説明す
る。音響信号が各第１のマイクロホン５１、第２のマイ
クロホン５２に入力され、両方の信号が前方音検出部５
３に、第１のマイクロホンへの入力信号のみが音声判定
部５４に送出される。ここでは話者毎に第１のマイクロ
ホンと第２のマイクロホンが一組として設置されている
ものとする。The operation of the voice detecting device will be described below. The acoustic signal is input to each of the first microphone 51 and the second microphone 52, and both signals are input to the front sound detection unit 5
3, only the input signal to the first microphone is sent to the voice determination unit 54. Here, it is assumed that the first microphone and the second microphone are installed as a set for each speaker.

【００４４】前方音検出部５３ではマイクロホン５１、
５２のそれぞれの入力信号の差によりマイクロホン５１
の前方からの信号であるか否かを判定する。また、どの
話者からの音声であるかの推定は、前方音検出部５３に
よりマイクロホン５１とマイクロホン５２のそれぞれの
入力信号のパワーの差を求め、この差が最も大きな値と
なるマイクロホン５１の前方の話者からの音声であると
判定することにより行う。話者方向から発せられた音響
信号が入力された場合、マイクロホン５１のパワー強度
はマイクロホン５２のそれに比べて当然大きな値とな
る。そこで、フレーム毎のマイクロホン５１のパワー値
をＰ₁ 、マイクロホン５２のパワー値をＰ ₂ とすると式
５の条件式を満たす場合に話者方向からの信号（前方
音）であると判定することができる。In the front sound detector 53, the microphone 51,
The difference between the input signals of
It is determined whether the signal is from the front of the. Also which
The forward sound detection unit 53 estimates whether the sound is from the speaker.
From microphone 51 and microphone 52 respectively
Find the difference in power between the input signals, and find that this difference is the largest
Voice from a speaker in front of the microphone 51
It is done by judging. Sound emitted from the speaker direction
Power intensity of microphone 51 when a signal is input
Is naturally a larger value than that of the microphone 52.
It Therefore, the power value of the microphone 51 for each frame
To P₁ , P to the power value of the microphone 52 ₂ Then the formula
If the conditional expression 5 is satisfied, the signal from the speaker direction (forward
Sound).

【００４５】[0045]

【数５】 [Equation 5]

【００４６】ここでｃ₁ は予め設定された前方音検出の
ためのパワー差の閾値である。なお前方音の判定は式６
の条件式を用いても同様の判定をすることができる。Here, c ₁ is a preset threshold value of the power difference for detecting the front sound. In addition, the judgment of the front sound is made by the formula
The same determination can be performed by using the conditional expression of.

【００４７】[0047]

【数６】 [Equation 6]

【００４８】ここでｃ₂ は予め設定された前方音検出の
ためにパワー比の閾値である。上記フレーム毎に得られ
た判定結果から、短時間での判定結果の切り替わりを防
止するため、前方音として判定されたフレームが連続し
て一定フレーム数以上続いたときに前方音判定結果をオ
ンにし、また前方音と判定されないフレームが一定フレ
ーム数以上続いたときに前方音判定結果をオフにして、
そのオン、オフの情報を外部に出力する。上記の処理に
より話者方向からの信号のみを検出することが可能とな
る。Here, c ₂ is a threshold value of the power ratio for detecting the front sound which is set in advance. From the judgment result obtained for each frame, in order to prevent the judgment result from switching in a short time, the front sound judgment result is turned on when the frames judged as the front sound continue for a certain number of consecutive frames or more. , When the number of frames that are not determined to be the forward sound continues for a certain number of frames or more, the forward sound determination result is turned off,
The on / off information is output to the outside. Only the signal from the speaker direction can be detected by the above processing.

【００４９】音声判定部５４では第１のマイクロホン５
１への入力信号が音声であるか否かを判定する。音声判
定部５４の動作は上記音声検出装置の第１の実施例の音
声判定部３の動作と同一であるので説明は省略する。In the voice judging section 54, the first microphone 5
It is determined whether the input signal to 1 is voice. Since the operation of the voice determination unit 54 is the same as the operation of the voice determination unit 3 of the first embodiment of the voice detection device, the description thereof will be omitted.

【００５０】総合判定部５５では前方音検出部５３、音
声判定部５４から一定時間間隔毎に送られてくる出力結
果をもとに、各マイクロホンの組の中で話者方向からの
入力が存在すると判定された第１のマイクロホンの入力
信号について、音声判定部５４でそれが音声信号である
と判定されている場合にそのマイクロホンはオンである
という信号を外部に出力する。In the comprehensive judging section 55, there is an input from the speaker direction in each microphone group based on the output result sent from the front sound detecting section 53 and the voice judging section 54 at regular time intervals. Then, for the input signal of the first microphone that is determined, if the audio determination unit 54 determines that the input signal is an audio signal, a signal that the microphone is on is output to the outside.

【００５１】以上のように本実施例によれば、話者の前
後に向いた２本のマイクロホンの組を用いて、それぞれ
の入力信号のパワー値の違いから話者方向からの信号で
あるか否かを判定し、また入力信号の音韻性から音声信
号であるか否かを判定するようにしたことにより、雑音
による誤判定を防止し、話者方向から発せられる音声信
号のみを正確に検出することができる。As described above, according to this embodiment, by using a set of two microphones facing the front and back of the speaker, whether the signals are from the speaker direction from the difference in the power value of the respective input signals. By determining whether or not it is a voice signal based on the phonological property of the input signal, erroneous determination due to noise is prevented, and only the voice signal emitted from the speaker direction is accurately detected. can do.

【００５２】次に本発明の音声検出装置の第３の実施例
について図面を参照しながら説明する。図６は本実施例
の動作を示すブロック図である。図６において、Ｗは音
声を発する話者（例えば、話者Ｗ１，Ｗ２などで構成さ
れている）、６１は話者方向を向いた第１のマイクロホ
ン（例えば、マイクロホンＭ１１，Ｍ２１などで構成さ
れている）、６２は話者と反対方向を向いた第２のマイ
クロホン（例えば、マイクロホンＭ１２，Ｍ２２などで
構成されている）、ここで、第１のマイクロホン６１と
第２のマイクロホン６２は、一対ごとに一組のマイクロ
ホン（例えば、マイクロホンの組Ｍｃ１，Ｍｃ２など）
として複数組のマイクロホンで構成されている。また図
６において、６３は第１のマイクロホンと第２のマイク
ホンのそれぞれの入力信号の差から話者方向からの信号
のみを検出する前方音検出部、６４は各第１のマイクロ
ホンの入力信号についてそのスペクトルの特徴量を検出
することにより音声信号であるか否かを判定する音声判
定部、６５は隣合う第１のマイクロホンの組毎に入力信
号間の相関をみることにより話者の位置を推定し、その
話者に対応するマイクロホンを特定する話者検出部、６
６は上記前方音検出部６３，音声判定部６４，話者検出
部６５の出力結果をもとに最終的に各第１のマイクロホ
ンについて前方からの音声信号がの入力されているか否
かを判定し、この判定結果を出力する総合判定部であ
る。Next, a third embodiment of the voice detecting device of the present invention will be described with reference to the drawings. FIG. 6 is a block diagram showing the operation of this embodiment. In FIG. 6, W is a speaker (e.g., composed of speakers W1, W2, etc.) that emits voice, and 61 is a first microphone (e.g., microphones M11, M21, etc.) facing the speaker. , 62 is a second microphone (for example, microphones M12, M22, etc.) facing away from the speaker, where the first microphone 61 and the second microphone 62 are paired. One set of microphones for each (eg, microphone set Mc1, Mc2, etc.)
Is composed of multiple sets of microphones. Further, in FIG. 6, 63 is a forward sound detection unit that detects only the signal from the speaker direction from the difference between the input signals of the first microphone and the second microphone, and 64 is the input signal of each first microphone. A voice determination unit that determines whether or not the signal is a voice signal by detecting the feature amount of the spectrum, and 65 determines the position of the speaker by observing the correlation between the input signals for each pair of the first microphones adjacent to each other. A speaker detection unit that estimates and specifies a microphone corresponding to the speaker, 6
Reference numeral 6 finally determines whether or not a voice signal from the front is input to each of the first microphones based on the output results of the front sound detection unit 63, the voice determination unit 64, and the speaker detection unit 65. Then, it is a comprehensive determination unit that outputs this determination result.

【００５３】以下、本実施例の動作を説明する。各マイ
クロホンに入力された音響信号はディジタル信号に変換
され、全てのマイクロホン出力が前方音検出部６３へ、
各第１のマイクロホンの出力信号が音声判定部６４、話
者検出部６５に送られる。The operation of this embodiment will be described below. The acoustic signal input to each microphone is converted into a digital signal, and all microphone outputs are forwarded to the front sound detector 63.
The output signal of each first microphone is sent to the voice determination unit 64 and the speaker detection unit 65.

【００５４】ここで前方音検出部６３の動作は第２の実
施例における図５の前方音検出部５３の動作と同一であ
り、音声判定部６４および話者検出部６５の動作は、そ
れぞれ第１の実施例における図１の音声判定部３、話者
検出部２の動作と同一であるので説明は省略する。The operation of the front sound detecting section 63 is the same as the operation of the front sound detecting section 53 of FIG. 5 in the second embodiment, and the operations of the voice judging section 64 and the speaker detecting section 65 are respectively the same. The operations are the same as those of the voice determination unit 3 and the speaker detection unit 2 of FIG.

【００５５】総合判定部６６では、前方音検出部６３で
前方の話者からの入力があると判定された第１のマイク
ロホン６１が、話者検出部６５でも特定された場合に、
音声判定部６４でその入力信号が音声であると判定され
ている場合に、その第１のマイクロホンはオンであると
いう信号を外部に出力する。In the comprehensive judging section 66, when the first microphone 61, which is judged by the front sound detecting section 63 to have an input from the front speaker, is also specified by the speaker detecting section 65,
When the voice determination unit 64 determines that the input signal is voice, it outputs a signal that the first microphone is on to the outside.

【００５６】以上のように本実施例によれば、前方音検
出部６３で話者の前後を向いた２つのマイクロホンの組
毎にその入力信号間のパワー値の違いから前方からの信
号のみを検出し、音声判定部６４で音韻性の検出に基づ
き音声信号であるか否かを判定し、話者検出部６５で隣
合うマイクロホンの入力信号間の相互相関係数から話者
の位置を推定することにより前方からの入力のあるマイ
クロホンを特定し、これらの結果を総合的に判断して各
マイクロホンの音声検出結果を出力するようにしたこと
により、あらゆる方向からの様々な雑音が入力されても
確実に棄却することができ、音声が他のマイクロホンに
混入した場合でも発言した話者に対応するマイクロホン
を正確に特定することができる。As described above, according to this embodiment, only the signal from the front is detected by the front sound detecting unit 63 due to the difference in the power value between the two input microphones facing the front and rear of the speaker. Then, the voice determination unit 64 determines whether or not it is a voice signal based on the phonological detection, and the speaker detection unit 65 estimates the position of the speaker from the cross-correlation coefficient between the input signals of the adjacent microphones. By specifying the microphone that has input from the front, and by comprehensively judging these results and outputting the voice detection result of each microphone, various noises from all directions are input. Can be reliably rejected, and the microphone corresponding to the speaker who has spoken can be accurately specified even if the voice is mixed with other microphones.

【００５７】次に本発明の映像切り替え装置の一実施例
について図面を参照しながら説明する。図７は本実施例
の構成を示すブロック図である。図７において７１は各
マイクロホンの入力信号からそれぞれに対応する話者の
音声信号のみを検出し、マイクロホン毎の音声信号の入
力があるか否かの情報を一定時間間隔毎に出力する音声
検出部、７２は話者の音声が入力されているマイクロホ
ンの位置に映像を切り換えるように制御信号を送出する
映像切り替え制御部、７３は、映像切り替え制御部７２
の出力を受けて、予め設定された発言している話者の位
置にモニター７４の映像を切り換えるように、カメラ７
５およびモニター制御部７６を制御するカメラ制御部で
ある。Next, an embodiment of the video switching apparatus of the present invention will be described with reference to the drawings. FIG. 7 is a block diagram showing the configuration of this embodiment. In FIG. 7, reference numeral 71 denotes a voice detection unit that detects only the voice signal of the speaker corresponding to the input signal of each microphone and outputs the information as to whether or not the voice signal is input for each microphone at regular time intervals. , 72 is a video switching control unit that sends a control signal to switch the video to the position of the microphone where the voice of the speaker is input, and 73 is a video switching control unit 72.
In response to the output of the camera 7, the camera 7 switches the image of the monitor 74 to the preset position of the speaking speaker.
5 is a camera control unit that controls the monitor control unit 76 and the monitor control unit 76.

【００５８】以下、本実施例の動作を説明する。ここで
音声検出部７１は、上記で説明した音声検出装置の第１
の実施例あるいは第２の実施例あるいは第３の実施例の
いずれかの構成であればよく、動作の説明は省略する。The operation of this embodiment will be described below. Here, the voice detection unit 71 is the first of the voice detection devices described above.
The configuration of any one of the embodiment, the second embodiment, and the third embodiment is sufficient, and the description of the operation is omitted.

【００５９】音声検出部７１からは一定時間間隔毎に音
声の検出されたマイクロホンの情報が出力される。この
出力を受けて映像切り替え制御部７２では映像切り替え
のタイミングを定め、音声検出されているマイクロホン
位置の映像に切り換えるよう制御信号をカメラ制御部７
３に送出する。ここで映像切り替えのタイミングは、映
像の頻繁に切り替わることによる画面の見ずらさを回避
し、また音声検出の誤検出の場合にも対応できるよう
に、音声検出が開始されてから一定時間後に映像切り替
えの信号を送出し、また音声検出が終了した時点から一
定時間後に終了信号を送出する。The voice detection unit 71 outputs information about the microphone in which voice is detected at regular time intervals. In response to this output, the video switching control unit 72 determines the timing of video switching and sends a control signal to the camera control unit 7 to switch to the video at the microphone position where the voice is detected.
Send to 3. Here, the timing of video switching is to switch the video after a certain period of time from the start of voice detection so as to avoid screen messiness caused by frequent switching of video and to respond to false detection of voice detection. Signal is transmitted, and an end signal is transmitted after a lapse of a fixed time from the point when voice detection is completed.

【００６０】カメラ制御部７３では、映像切り替え制御
部７２からの切り替え制御信号に基づき、判定されたマ
イクロホンに対応する話者の画面に切り換えるようにカ
メラ７５に移動信号を送りカメラ７５の向きを変更す
る。なお各マイクロホンに対応する話者の位置はそれぞ
れ予め設定しており、その位置情報がカメラ制御部７３
に記憶されている。In the camera control unit 73, based on the switching control signal from the video switching control unit 72, a movement signal is sent to the camera 75 so as to switch to the screen of the speaker corresponding to the determined microphone, and the direction of the camera 75 is changed. To do. The position of the speaker corresponding to each microphone is set in advance, and the position information is used as the camera control unit 73.
Remembered in.

【００６１】以上のように本実施例によれば、複数のマ
イクロホンから対応する話者の音声が入力されているも
ののみを正確に捉え、この音声検出情報をもとにその話
者の方に自動的に映像を切り換えることが可能となり、
特に自然なテレビ会議の進行を実現することのできる映
像切り替え装置が実現できる。As described above, according to the present embodiment, only the input of the voice of the corresponding speaker from a plurality of microphones is accurately captured, and based on this voice detection information, the speaker is identified. It is possible to automatically switch images,
In particular, it is possible to realize a video switching device that can realize a natural video conference.

【００６２】この実施例では、一台のカメラ７５を使用
して、カメラ制御部７３が、映像切り替え制御部７２か
らの切り替え制御信号に基づき、判定されたマイクロホ
ンに対応する話者に画面を切り換えるようにカメラ７５
に移動信号を送り、カメラ７５の向きを変更するよう構
成したが、複数台のカメラを、各カメラが適当数の話者
に対応するように配置して、カメラ制御部７３が、映像
切り替え制御部７２からの切り替え制御信号に基づき、
判定されたマイクロホンに対応する話者に対応して配置
されたカメラに接続を切り替えて、この話者に画面を切
り換えるように構成することもできる。これにより、話
者に対する画面の切り換えの追従性が向上して、話者の
速い立ち代わりにも、十分対応できる。In this embodiment, using one camera 75, the camera control unit 73 switches the screen to the speaker corresponding to the determined microphone based on the switching control signal from the video switching control unit 72. Camera 75
The camera 75 is configured to change the direction of the camera 75 by sending a movement signal to the camera. However, a plurality of cameras are arranged so that each camera corresponds to an appropriate number of speakers, and the camera control unit 73 controls the video switching. Based on the switching control signal from the unit 72,
It is also possible to switch the connection to the camera arranged corresponding to the speaker corresponding to the determined microphone and switch the screen to this speaker. As a result, the followability of the screen switching to the speaker is improved, and it is possible to sufficiently cope with the rapid switching of the speaker.

【００６３】[0063]

【発明の効果】請求項１の構成によれば、音声判定部
が、マイクロホンに入力された信号からスペクトルの特
徴量を抽出し、予め求めた音声の特徴量との類似性の有
無によりその信号が音声であるか否かを判定し、話者検
出部が、隣接したマイクロホンの入力信号の間の差異を
検出することにより話者の位置を推定し、この話者に対
応したマイクロホンを特定するので、音声判定部と話者
検出部の出力結果に基づいて、総合判定部がそれぞれの
マイクロホンに対応した話者の音声のみが判定できる。
そのため、発声している話者に対応するマイクロホンを
正確に特定することができ、様々な雑音が入力されても
音声と誤検出することのない精度の高い音声検出ができ
る。According to the first aspect of the invention, the voice determining unit extracts the spectrum feature amount from the signal input to the microphone and determines whether or not there is similarity with the voice feature amount obtained in advance. Is a voice, and the speaker detection unit estimates the position of the speaker by detecting the difference between the input signals of the adjacent microphones, and specifies the microphone corresponding to this speaker. Therefore, based on the output results of the voice determination unit and the speaker detection unit, the comprehensive determination unit can determine only the voice of the speaker corresponding to each microphone.
Therefore, it is possible to accurately identify the microphone corresponding to the speaker who is speaking, and it is possible to perform highly accurate voice detection that is not erroneously detected as voice even when various noises are input.

【００６４】請求項３の構成によれば、前方音検出部
が、話者方向に向いた第１のマイクロホンと話者と反対
方向に向いた第２のマイクロホンに入力された信号の差
異を検出して、第１のマイクロホンの前方より発せられ
た信号のみを検出し、音声判定部が、第１のマイクロホ
ンに入力された信号からスペクトルの特徴量を抽出し、
予め求めた音声の特徴量との類似性の有無によりその信
号が音声であるか否かを判定するので、前方音検出部と
音声判定部の出力結果に基づいて、総合判定部がそれぞ
れの第１のマイクロホンに対応した話者の音声のみが判
定できる。そのため、左右、後方からの雑音、音声を棄
却でき、様々な雑音が入力されても音声と誤検出するこ
とのない精度の高い音声検出ができる。According to the third aspect of the invention, the front sound detecting section detects the difference between the signals input to the first microphone facing the speaker and the second microphone facing the opposite direction of the speaker. Then, only the signal emitted from the front of the first microphone is detected, and the voice determination unit extracts the feature amount of the spectrum from the signal input to the first microphone,
Since it is determined whether or not the signal is a voice based on whether or not there is similarity to the feature amount of the voice obtained in advance, based on the output results of the front sound detection unit and the voice determination unit, the overall determination unit determines Only the voice of the speaker corresponding to one microphone can be determined. Therefore, noises and voices from the left and right and the rear can be rejected, and even if various noises are input, it is possible to perform voice detection with high accuracy without being erroneously detected as voice.

【００６５】請求項４の構成によれば、前方音検出部
が、一組にされた話者方向に向いた第１のマイクロホン
と話者と反対方向に向いた第２のマイクロホンに入力さ
れた信号の差異を検出して、第１のマイクロホンの前方
より発せられた信号のみを検出し、音声判定部が、各組
の第１のマイクロホンに入力された信号からスペクトル
の特徴量を抽出し、予め求めた音声の特徴量との類似性
の有無によりその信号が音声であるか否かを判定し、話
者検出部が、隣接した第１のマイクロホンの入力信号の
間の差異を検出することにより話者の位置を推定し、こ
の話者に対応したマイクロホンを特定するので、前方音
検出部と音声判定部と話者検出部の出力結果に基づい
て、総合判定部が各組の第１のマイクロホンに対応した
話者の音声のみが判定できる。そのため、左右、後方か
らの雑音、音声を棄却でき、また発声している話者に対
応するマイクロホンを正確に特定することができ、様々
な雑音が入力されても音声と誤検出することのない精度
の高い音声検出ができる。According to the structure of the fourth aspect, the front sound detecting section is input to the pair of the first microphone facing the speaker and the second microphone facing in the direction opposite to the speaker. The signal difference is detected, only the signal emitted from the front of the first microphone is detected, and the voice determination unit extracts the feature value of the spectrum from the signal input to the first microphone of each set, The speaker detection unit determines whether or not the signal is a voice based on whether or not there is similarity to the feature amount of the voice obtained in advance, and the speaker detection unit detects a difference between the input signals of the adjacent first microphones. Since the position of the speaker is estimated by the microphone and the microphone corresponding to this speaker is specified, the comprehensive determination unit determines the first sound of each group based on the output results of the front sound detection unit, the sound determination unit, and the speaker detection unit. Only the voice of the speaker corresponding to the microphone Kill. Therefore, it is possible to reject the noise and voice from the left and right and the rear, and to accurately identify the microphone corresponding to the speaker who is speaking, so that various types of noise are not erroneously detected as voice. Highly accurate voice detection is possible.

【００６６】請求項２５の構成によれば、請求項１に記
載の音声検出装置の出力に基づいて、映像切り替え制御
部が、特定したマイクロホンに対応した話者に映像を切
り換える制御信号をカメラ制御部に出力するので、この
制御信号により、カメラ制御部が予め記憶した話者の位
置情報に基づいて出力映像の切り替えが制御できる。そ
のため、音声入力のあったマイクロホンの位置に自動的
に映像を切り換えることができ、正確で使い勝手のよ
い、特にテレビ会議システムでのスムーズな会議進行が
実現できる。According to the twenty-fifth aspect, based on the output of the voice detecting device according to the first aspect, the video switching control unit controls the camera to control the control signal for switching the video to the speaker corresponding to the specified microphone. Since it is output to the unit, switching of the output video can be controlled by this control signal based on the position information of the speaker stored in advance by the camera control unit. Therefore, the image can be automatically switched to the position of the microphone where the voice was input, and accurate and easy-to-use, particularly smooth conference progress in the video conference system can be realized.

[Brief description of drawings]

【図１】本発明の第１の実施例の音声検出装置の構成図FIG. 1 is a configuration diagram of a voice detection device according to a first embodiment of the present invention.

【図２】同実施例の話者の特定動作の説明図FIG. 2 is an explanatory diagram of a speaker specifying operation according to the embodiment.

【図３】同実施例の話者の特定動作のフローチャート図FIG. 3 is a flowchart of a speaker identification operation of the same embodiment.

【図４】同実施例の音声判定部の構成図FIG. 4 is a configuration diagram of a voice determination unit of the same embodiment.

【図５】本発明の第２の実施例の音声検出装置の構成図FIG. 5 is a configuration diagram of a voice detection device according to a second embodiment of the present invention.

【図６】本発明の第３の実施例の音声検出装置の構成図FIG. 6 is a configuration diagram of a voice detection device according to a third embodiment of the present invention.

【図７】本発明の一実施例の映像切り替え装置の構成図FIG. 7 is a configuration diagram of a video switching device according to an embodiment of the present invention.

[Explanation of symbols]

１マイクロホン２，６５話者検出部３，５４，６４音声判定部４，５５，６６総合判定部５１，６１第１のマイクロホン５２，６２第２のマイクロホン５３，６３前方音検出部 1 Microphone 2,65 Speaker detection unit 3,54,64 Voice determination unit 4,55,66 Overall determination unit 51,61 First microphone 52,62 Second microphone 53,63 Front sound detection unit

Claims

[Claims]

1. A plurality of microphones for detecting sound,
An audio determination unit that extracts the spectrum feature amount from the signals input to these microphones and determines whether the signal is voice or not based on the similarity to the voice feature amount obtained in advance, and an arbitrary Detects the difference between the input signal of the microphone and the input signal of the microphone adjacent to this microphone to estimate the speaker's position, which is the source of sound, and identify the microphone corresponding to this speaker. And a comprehensive determination unit that determines only the voices of the speakers corresponding to the respective microphones based on predetermined determination conditions using the output results of the voice determination unit and the speaker detection unit. Equipped voice detection device.

2. The speaker detector detects a difference in arrival time of input signals to the adjacent microphones by using a cross-correlation coefficient between input signals of two adjacent microphones. The voice detection device according to claim 1, wherein the voice detection device is configured to estimate a position and identify a microphone corresponding to the speaker.

3. A first microphone facing the speaker, a second microphone facing away from the speaker, and a difference between respective input signals of the first microphone and the second microphone is detected. By doing so, a front sound detection unit that detects only a signal emitted from the front of the first microphone, and a feature amount of a spectrum are extracted from a signal input to the first microphone, and a feature amount of a voice that is obtained in advance Based on the judgment conditions predetermined by using the output results of the sound determination unit and the sound determination unit that determines whether or not the signal is a voice based on the similarity of A voice detection device including a comprehensive determination unit that determines only the voice of the speaker corresponding to the first microphone.

4. A plurality of microphones, each of which includes a first microphone facing the speaker and a second microphone facing away from the speaker, and each of the first microphones.
Input to the first microphone of each pair, and a front sound detector that detects only the signal emitted from the front of the first microphone by detecting the difference between the input signals of the first microphone and the second microphone. Of the spectrum of the extracted signal, and a voice determination unit that determines whether or not the signal is a voice based on whether there is similarity to the feature amount of the voice obtained in advance, and an optional first microphone. A speaker detector that estimates the position of the speaker by detecting the difference between the input signal and the input signal of the first microphone adjacent to this microphone, and specifies the microphone corresponding to this speaker. And a speaker corresponding to the first microphone of each set based on a determination condition predetermined using the output results of the forward sound detecting unit, the voice determining unit, and the speaker detecting unit. Voice detection apparatus and a comprehensive determination unit determines voice only.

5. The front sound detector calculates the difference between the powers of the input signals of the first microphone and the second microphone, and whether the signal is a signal emitted from the front of the first microphone based on this value. The voice detection device according to claim 3, wherein the voice detection device is configured to determine whether or not it is.

6. The front sound detector calculates the power ratio of the input signals of the first microphone and the second microphone, and whether the signal is emitted from the front of the first microphone by this value. The voice detection device according to claim 3, wherein the voice detection device is configured to determine whether or not it is.

7. A speaker detecting section detects a difference in arrival time of an input signal to an adjacent first microphone by using a cross-correlation coefficient between input signals of two adjacent first microphones. The voice detection device according to claim 4, wherein the position of the speaker is thus estimated and the first microphone corresponding to the speaker is specified.

8. The audio determination unit obtains in advance frequency characteristics or temporal characteristics of an audio signal from a large number of audio data, and how much the input signals are similar in frequency characteristics or temporal characteristics. 5. The voice detection device according to claim 1, wherein the voice and noise are discriminated from each other by an index that represents, and only the voice signal having the frequency characteristic or the time characteristic is detected. .

9. A linear prediction coefficient, a cepstrum coefficient, or an autocorrelation coefficient obtained when the input signal is subjected to a linear prediction analysis by the speech determination unit, wherein the linear prediction coefficient or cepstrum coefficient relating to the speech is created in advance. 9. The voice detection device according to claim 8, wherein the voice characteristic is detected by comparing with the autocorrelation coefficient to determine the voice and noise of the input signal, and only the voice in the input signal is detected.

10. The frequency characteristic is detected by the speech determination unit by recognizing the phonological property in the speech based on how similar the spectrum of each phoneme created in advance and the spectrum of the input signal are to each other. 9. The voice detecting device according to claim 8, wherein the voice and noise of the input signal are discriminated and only the voice in the input signal is detected.

11. A frequency characteristic is detected by dividing a frequency axis into several bands by a digital or analog filter and recognizing an energy pattern for each band obtained by the digital or analog filter in a voice determination unit. 9. The voice detecting device according to claim 8, wherein the voice and noise of the input signal are discriminated and only the voice in the input signal is detected.

12. A voice judging unit divides a frequency axis into several bands by a digital or analog filter, obtains a zero crossing of a signal for each band obtained by the digital or analog filter, and obtains the zero for each band. 9. The voice detection device according to claim 8, wherein the frequency characteristic is detected based on the number of crossings, the voice and noise of the input signal are discriminated, and only the voice in the input signal is detected.

13. The audio determination unit divides the frequency axis into several bands by a digital or analog filter, and determines the frequency by the autocorrelation coefficient of first or higher order of the signal for each band obtained by the digital or analog filter. 9. The voice detecting device according to claim 8, wherein the feature is detected, the voice and noise of the input signal are distinguished, and only the voice in the input signal is detected.

14. A voice judgment unit, the frequency axis of which is divided into several bands by a digital or analog filter, and a signal of each band obtained by the digital or analog filter is FFT-analyzed or higher than first order. 9. The voice detection device according to claim 8, wherein the voice characteristic of the input signal is detected by detecting the frequency characteristic by the cepstrum coefficient of 1., and only the voice in the input signal is detected.

15. A voice determination unit, the frequency axis of which is divided into several bands by a digital or analog filter, and a signal of each band obtained by the digital or analog filter is FFT-analyzed or higher. A frequency feature is detected by at least one feature amount of the autocorrelation coefficient and the first-order or higher-order cepstrum coefficient,
The voice detection device according to claim 8, wherein the voice and noise of the input signal are discriminated and only the voice in the input signal is detected.

16. The speech determination unit divides the frequency axis into several bands by a digital or analog filter, and FFT-analyzes the signal for each band obtained by the digital or analog filter to obtain a feature quantity obtained as a vector quantum. 9. The voice detection device according to claim 8, wherein a frequency characteristic is detected by a codebook obtained by converting the input signal into voice and noise, and only the voice in the input signal is detected.

17. A codebook obtained by vector-quantizing a feature quantity characterizing phonological characteristics of a voice uttered by a speaker by a voice determination unit in advance, and vector-quantizing an input signal by the codebook. 9. The voice detection device according to claim 8, wherein the voice feature of the input signal is detected by detecting the frequency characteristic by the quantization distortion in this case, and only the voice in the input signal is detected.

18. The speech determination section detects temporal characteristics by recognizing phonological characteristics in speech based on what kind of change in the spectrum of an input signal is occurring every moment,
The voice detection device according to claim 8, wherein the voice and noise of the input signal are discriminated and only the voice in the input signal is detected.

19. A speech determination unit detects a phoneme for each analysis frame from an input signal based on the maximum value and the minimum value of the duration for each phoneme which is obtained in advance from a large number of voices, and determines how long each phoneme continues. Is detected, the temporal characteristics are detected by assuming that the voice is input only when the duration is less than the maximum value and greater than the minimum value of each phoneme, and the voice and noise of the input signal are detected. The voice detection device according to claim 8, wherein the voice detection device is configured to determine and detect only voice in the input signal.

20. The voice determination unit previously obtains a spectrum sequence for each phoneme previously obtained from a large number of voices as a standard model, and to what extent the spectrum in the input signal continues using the standard model. 9. The voice detection device according to claim 8, wherein the voice detection device is configured to detect a voice and noise of an input signal and detect only a voice in the input signal by detecting a temporal characteristic by measuring a duration indicating whether or not the input signal is present.

21. A change of a code string when vector-quantizing an input signal using a codebook obtained by vector-quantizing a feature quantity that characterizes phonological characteristics in a voice uttered by a speaker by a voice determining unit. 9. The voice detecting device according to claim 8, wherein the voice feature of the input signal is detected by detecting the temporal feature by recognizing the pattern, and only the voice in the input signal is detected.

22. A codebook obtained by vector-quantizing a feature quantity that characterizes phonological characteristics in a voice uttered by a speaker by a voice determination unit, vector-quantizes an input signal, and to what extent each code continues. 9. The voice detection device according to claim 8, wherein the voice detection device is configured to detect a voice and noise of an input signal and detect only a voice in the input signal by detecting a temporal feature depending on whether or not the input signal appears.

23. The speech determination section creates an HMM model for each phoneme from a large number of speech data in advance, and the HM
The M model is used to detect phonological properties existing in the input signal to detect frequency characteristics or temporal characteristics, to distinguish voice and noise of the input signal, and to detect only voice in the input signal. The voice detection device according to claim 8, which is configured.

24. The audio determination unit extracts from the input signal a feature amount characterizing the audio for each analysis frame, and determines to what extent the audio component in the input signal continues from a large number of audio data in advance. 9. The fuzzy inference using a fuzzy membership function with respect to time is used to detect temporal features to discriminate between speech and noise of an input signal, and to detect only speech in the input signal. Voice detection device.

25. The voice detection device according to claim 1, a camera control unit that stores the position of each speaker in advance and controls the output image in order to output the image of each speaker, and the voice detection. An image switching device including a video switching control unit that specifies a microphone to which sound is input based on an output of the unit and outputs a control signal for switching to a corresponding speaker image to the camera control unit.

26. The voice detection device according to claim 3, a camera control unit for storing the position of each speaker in advance and controlling the output image in order to output the image of each speaker, and the voice detection. Video switching including a video switching control unit that specifies the first microphone to which sound is input based on the output of the unit and outputs a control signal for switching to the video of the corresponding speaker to the camera control unit apparatus.

27. The voice detection device according to claim 4, a camera control unit that stores the position of each speaker in advance and controls the output image in order to output the image of each speaker, and the voice detection. Video switching including a video switching control unit that specifies the first microphone to which sound is input based on the output of the unit and outputs a control signal for switching to the video of the corresponding speaker to the camera control unit apparatus.