JP2003005790A

JP2003005790A - Method and device for voice separation of compound voice data, method and device for specifying speaker, computer program, and recording medium

Info

Publication number: JP2003005790A
Application number: JP2001191289A
Authority: JP
Inventors: Takayoshi Yamamoto; 隆義山本
Original assignee: Individual
Current assignee: Individual
Priority date: 2001-06-25
Filing date: 2001-06-25
Publication date: 2003-01-08
Anticipated expiration: 2021-06-25
Also published as: JP3364487B2

Abstract

PROBLEM TO BE SOLVED: To provide a method and a device for separating compound voice data where voice data of several speakers mixedly exist into the voice of every speaker and to provide a method and a device for accurately and quickly specifying the speaker of each separated voice data. SOLUTION: The method for separating compound voice data where voice data of several speakers mixedly exist into the voice data of every speaker has a step (1) where correlation elimination processing is performed to eliminate correlation between the compound voice data and a step (2) where independent component separation processing is performed to separate data subjected to the correlation elimination processing into independent components.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、複数発言者の複合
音声データの音声を分離する方法、分離したそれぞれの
音声データの発言者を特定する方法、複数発言者の複合
音声データの音声を分離する装置、分離したそれぞれの
音声データの発言者を特定する装置、コンピュータプロ
グラム、及び、記録媒体に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method of separating voices of composite voice data of a plurality of speakers, a method of specifying a speaker of each of the separated voice data, and a voice of composite voice data of a plurality of speakers. Device, a computer program, and a recording medium that specify a speaker of each separated audio data.

【０００２】[0002]

【従来の技術】複数の発言者の音声が混合されて記録さ
れている、音声記録媒体中の複合音声データを、発言者
毎に正確に分離する技術が切望されている。具体的に
は、複合音声データを、音声の入力と同時進行的に発言
者毎に分離し特定することで、会議の議事録作成を自動
的に行うことのできるような技術が切望されている。2. Description of the Related Art A technique for accurately separating composite voice data in a voice recording medium, in which voices of a plurality of speakers are mixed and recorded, for each speaker is desired. Specifically, there is a need for a technology that can automatically create minutes of a conference by separating and identifying composite voice data for each speaker simultaneously with the input of voice. .

【０００３】従来、長時間にわたる会議の議事録を作成
するには、各種の音声記録機器に記録した会議の音声デ
ータを、議事録作成担当者が全て聞きなおし、要約する
などして議事録を作成していた。この作業は、音声記録
機器の再生と一時停止を何度も繰り返しつつ行う必要が
あり、手間と時間がかかる。Conventionally, in order to create a minutes of a meeting over a long period of time, a person in charge of minutes recording re-listens and summarizes the audio data of the meetings recorded in various audio recording devices. I was creating. This work needs to be performed while repeating the reproduction and the temporary stop of the voice recording device, which is troublesome and time-consuming.

【０００４】また、もう１つの問題は、発言者の特定が
困難であることである。本人が会議に出席した担当者な
らまだしも、そうでない担当者が議事録を作成するの
は、どの音声がどの発言者によるものなのかを判断する
のは非常に困難なことであった。[0004] Another problem is that it is difficult to identify the speaker. If the person who attended the meeting was the person who attended the meeting, it was very difficult for the person who did not attend the meeting to make the minutes, and it was very difficult to determine which voice was attributed by which speaker.

【０００５】従来、混合音声データからの音声分離、発
言者特定に関する技術は幾つか存在してはいるが、１本
のマイクに複数人の音声やノイズが混合されて入力され
る場合でも分離、特定を正確に行い、さらに、複合音声
の入力と同時進行的に高速な分離・特定処理を行うこと
は、時間的に連続な音素データのセグメンテーション、
及び調音結合の点で非常に難しい課題であった。Conventionally, although there are several techniques relating to voice separation from mixed voice data and speaker identification, even if voices and noises of a plurality of persons are mixed and input to one microphone, the separation is performed. Accurate identification and further high-speed separation / identification processing simultaneously with the input of complex speech is effective for segmentation of temporally continuous phoneme data.
And it was a very difficult task in terms of articulation.

【０００６】特開２００１-２７８９５には、複数の信
号源からの音響信号を分離し、所望の信号を合成出力す
るための信号分離方法が記載されている。この発明は、
解析対象の混合音声・音響信号に対し時間・周波数解析
を行い、周波数成分の倍音構成を得る。倍音周波数成分
のうち、立上がり時間及び立下り時間の少なくとも一方
が共通であるか否かで、同一信号源からの周波数成分で
あるかどうかを同定する。その周波数成分を抽出・再構
成することにより、単一信号源からの信号を分離する。[0006] Japanese Patent Laid-Open No. 2001-27895 describes a signal separation method for separating acoustic signals from a plurality of signal sources and combining and outputting a desired signal. This invention
Time / frequency analysis is performed on the mixed voice / acoustic signal to be analyzed to obtain a harmonic composition of frequency components. Among the overtone frequency components, whether at least one of the rise time and the fall time is common is used to identify whether or not the frequency components are from the same signal source. A signal from a single signal source is separated by extracting and reconstructing the frequency component.

【０００７】この発明は、混合された信号の相関性や独
立性といった事項を考慮していないので、同じ周波数帯
域に属する混合信号、あるいは同時間帯に存在する混合
信号を分離することは困難である。Since the present invention does not consider the correlation and independence of mixed signals, it is difficult to separate mixed signals belonging to the same frequency band or mixed signals existing in the same time band. is there.

【０００８】また、特開２０００−９７７５８に記載さ
れた音源信号推定装置では、複数の音響信号がそれぞれ
混在して複数のチャンネルを介して入力されたときに、
各音源信号が混合係数ベクトルと内積演算されて他の音
源信号に加算される混合過程モデルに基づき、混合係数
ベクトルに対応する分離係数ベクトルを逐次修正しなが
ら求め、この分離係数ベクトルを用いて音源信号の推
定、分離を行う（ＩＣＡの手法）にあたり、分離係数ベ
クトルの逐次修正に用いる修正ベクトルを正規化する音
声信号とそれ以外の信号が相互に混在している信号から
それぞれの信号を推定し、分離するに際し、それぞれの
信号パワー変動による推定、分離への影響を軽減するこ
とができ、さらに、収束係数を大きくすることができる
ことから安定かつ高速の信号分離が可能となる、とされ
ている。Further, in the sound source signal estimating device described in Japanese Patent Laid-Open No. 2000-97758, when a plurality of acoustic signals are mixed and input through a plurality of channels,
Based on a mixing process model in which each sound source signal is subjected to inner product calculation with the mixing coefficient vector and added to other sound source signals, the separation coefficient vector corresponding to the mixing coefficient vector is obtained by sequentially correcting, and the sound source is calculated using this separation coefficient vector. In estimating and separating signals (the ICA method), each signal is estimated from a signal in which a speech signal for normalizing a correction vector used for successive correction of a separation coefficient vector and a signal other than that are mixed together. When separating, it is said that it is possible to reduce the influence on the estimation and separation due to the fluctuation of each signal power, and further to increase the convergence coefficient, which enables stable and high-speed signal separation. .

【０００９】この発明は、独立成分解析（ＩＣＡ）をベ
ースとして分離係数ベクトルを逐次修正しながら行うの
で、信号パワーの変動影響を軽減でき、高速分離を実現
するものであるが、様々な信号源からの音源信号はお互
いに独立性を保持しているとは限らない。一般に、たと
え独立した信号源からの音源信号であっても混合される
と相関性を有してしまっていることが多いが、その点が
考慮されていない。Since the present invention is performed while sequentially correcting the separation coefficient vector based on the independent component analysis (ICA), it is possible to reduce the influence of fluctuations in signal power and realize high-speed separation. The sound source signals from the sources do not always maintain independence from each other. In general, even source signals from independent signal sources often have correlation when mixed, but that point is not taken into consideration.

【００１０】また、特開平９−２５８７８８には、基本
周波数の近接した混合音声を適切に区別分離し、音源の
数に制限されず、高品質の分離音声を得ることを目的と
した音声分離方法および装置が記載されている。この発
明では、入力音響信号中に含まれる音声信号の有声音部
分と無声音部分の内の有声音部分は有声音の音源方向の
情報を加味しながら個別に抽出し、抽出された有声音部
分を複数の有声音に分化して有声音の群として抽出し、
音声信号の無声音部分は入力音響信号から有声音部分を
減算して抽出した残差から各有声音の群の無声音に相当
する音響信号の成分として抽出し、各別に抽出された有
声音の群に無声音を補充して音声信号を抽出することに
よって上記目的を実現する。Japanese Unexamined Patent Publication No. 9-258788 discloses a voice separation method for properly separating and separating mixed voices having fundamental frequencies close to each other so as to obtain high-quality separated voices regardless of the number of sound sources. And the device is described. In this invention, the voiced sound portion of the voice signal included in the input acoustic signal and the unvoiced sound portion are individually extracted while taking into consideration the information of the sound source direction of the voiced sound, and the extracted voiced sound portion is extracted. Divided into multiple voiced sounds and extracted as a group of voiced sounds,
The unvoiced sound part of the voice signal is extracted as a component of the acoustic signal corresponding to the unvoiced sound of each voiced sound group from the residuals extracted by subtracting the voiced sound part from the input acoustic signal, and then extracted into each separately extracted voiced sound group. The above object is realized by supplementing unvoiced sound and extracting a voice signal.

【００１１】この発明は、音源方位の情報を抽出する音
源定位部を有しているが、同じ方向から異なる音声が発
せられた場合は分離が困難となる。また、複数の発言者
が同じ母音、あるいは有声音を発したときはそれらの分
離が困難であると思われる。The present invention has a sound source localization section for extracting information on the direction of the sound source, but if different voices are emitted from the same direction, separation becomes difficult. Also, when multiple speakers make the same vowel or voiced sound, it seems difficult to separate them.

【００１２】[0012]

【発明が解決しようとする課題】以上のような従来技術
が有する種々の問題点を解決すべく、本発明は、複数の
発言者の音声データが混在する混在音声データを、発言
者毎の音声に分離する方法及び装置、さらに分離された
各音声データの発言者を特定することを、正確にかつ高
速に行うことができる方法及び装置の提供を主たる目的
とする。SUMMARY OF THE INVENTION In order to solve the various problems of the prior art as described above, the present invention provides mixed voice data in which voice data of a plurality of speakers are mixed, with voices for each speaker. It is a principal object of the present invention to provide a method and apparatus for separating into two parts, and a method and device capable of accurately and speedily specifying the speaker of each separated audio data.

【００１３】[0013]

【課題を解決するための手段】上記の課題を解決するた
めに、本出願に係る第１の発明は、複数発言者の音声デ
ータが混在している混在音声データを、発言者毎の音声
データに分離する音声データ分離方法において、（１）
前記混在音声データを互いに無相関化するための無相関
化処理を行うステップと、（２）前記無相関化処理の行
われたデータを独立成分に分離するための独立成分分離
処理を行うステップとを有することを特徴とする音声分
離方法である。このような第１の発明によれば、入力さ
れる混在音声データ（生データ）に含まれる各音声デー
タの相関性、及び独立性の両性質をともに考慮し、複数
の音声データや混入する雑音などの有する相関性や独立
性が、時間的・空間的に変動する場合でも、発言者毎の
音声に正確に分離することができる。In order to solve the above-mentioned problems, a first invention of the present application is to provide mixed voice data in which voice data of a plurality of speakers are mixed, to voice data for each speaker. In the audio data separation method for separating into (1)
Performing a decorrelation process for decorrelating the mixed speech data with each other; and (2) performing an independent component separation process for separating the decorrelated data into independent components. And a voice separation method. According to the first aspect of the invention as described above, a plurality of voice data and mixed noise are considered in consideration of both the correlation and the independence of each voice data included in the input mixed voice data (raw data). Even if the correlation or independence of the above changes temporally or spatially, it is possible to accurately separate the voices of each speaker.

【００１４】また、本出願に係る第２の発明は、第１の
発明である音声分離方法において、前記独立成分分離の
行われたデータの分離性が不十分な場合には、分離性が
十分になるまで、前記独立成分分離処理の行われたデー
タについて、前記無相関化処理及び前記独立成分分離処
理を繰り返し行うことを特徴とする音声分離方法であ
る。このような第２の発明によれば、混在音声データを
音源毎の音声データに充分に分離させることができる。A second invention according to the present application is the speech separation method according to the first invention, wherein the separability is sufficient when the separability of the data subjected to the independent component separation is insufficient. Until the above, the speech separation method is characterized in that the decorrelation process and the independent component separation process are repeatedly performed on the data subjected to the independent component separation process. According to the second aspect, the mixed voice data can be sufficiently separated into the voice data for each sound source.

【００１５】また、本出願に係る第３の発明は、第１又
は第２の発明である音声分離方法において、前記独立成
分分離処理として、非ガウス性のデータを独立成分に分
離するための非ガウス性独立成分分離処理と、非定常性
のデータを独立成分に分離するための非定常性独立成分
分離処理と、有色性のデータを独立成分に分離するため
の有色性独立成分分離処理とを準備し、データの性質に
より、前記非ガウス性独立成分分離処理、前記非定常性
独立成分分離処理、及び、前記有色性独立成分分離処理
のうちのいずれかの処理を行うことを特徴とする音声分
離方法である。このような第３の発明によれば、無相関
化処理の行われたデータの性質に応じて最適な独立成分
分離処理を行うことができるから、混在音声データを音
源毎の音声データにより効果的に分離させることができ
る。A third invention according to the present application is the speech separation method according to the first or second invention, wherein the non-Gaussian data is separated into independent components as the independent component separation processing. Gaussian independent component separation processing, non-stationary independent component separation processing for separating nonstationary data into independent components, and colored independent component separation processing for separating colored data into independent components A voice which is prepared and performs one of the non-Gaussian independent component separation process, the non-stationary independent component separation process, and the chromatic independent component separation process depending on the nature of the data. It is a separation method. According to the third aspect, the optimum independent component separation process can be performed according to the property of the data subjected to the decorrelation process, so that the mixed voice data is more effective as the voice data for each sound source. Can be separated into

【００１６】また、本出願に係る第４の発明は、第３の
発明である音声分離方法において、最初に行われる独立
成分分離処理は、非ガウス性のデータを独立成分に分離
するための非ガウス性独立成分分離処理であることを特
徴とする音声分離方法である。非ガウス性独立成分分離
処理は他の独立成分分処理方法に比べてその前処理とし
ての無相関化処理の影響を受けやすいから、このような
第４の発明によれば、最初に非ガウス性独立成分分離処
理を行うことにより、無相関化処理がうまく実行された
かどうかを、該無相関化処理に引き続く非ガウス性独立
成分分離処理によって効果的に評価することが可能とな
る。The fourth invention according to the present application is the speech separation method according to the third invention, wherein the independent component separation process performed first is a non-separation process for separating non-Gaussian data into independent components. A speech separation method characterized by a Gaussian independent component separation process. Since the non-Gaussian independent component separation processing is more susceptible to the decorrelation processing as its preprocessing than other independent component processing methods, according to the fourth invention, the non-Gaussian independent component separation processing is performed first. By performing the independent component separation processing, it becomes possible to effectively evaluate whether or not the decorrelation processing was successfully executed by the non-Gaussian independent component separation processing subsequent to the decorrelation processing.

【００１７】また、本出願に係る第５の発明は、第１乃
至第４の発明である音声分離方法において、前記無相関
化処理は、少なくとも主成分分析及び因子分析を行うこ
とを特徴とする音声分離方法である。このような第５の
発明によれば、各主成分の寄与率を求めて累積寄与率が
所定のしきい値を越えるところの成分数を次数とするこ
となどにより、採用する主成分データの数（次数）を決
定した上で、効果的に無相関化処理を行うことが可能と
なる。Further, a fifth invention according to the present application is characterized in that, in the speech separation method according to the first to fourth inventions, the decorrelation processing performs at least a principal component analysis and a factor analysis. This is a voice separation method. According to the fifth aspect of the invention, the number of principal component data to be adopted is obtained by obtaining the contribution rate of each principal component and setting the number of components where the cumulative contribution rate exceeds a predetermined threshold as the order. It is possible to effectively perform the decorrelation process after determining the (order).

【００１８】また、本出願に係る第６の発明は、複数発
言者の音声データが混在している混在音声データを、発
言者毎の音声データに分離し、該発言者毎の音声データ
につき発言者を特定する発言者特定方法において、
（１）第１乃第５のいずれかの発明の音声分離方法によ
り、複数発言者の音声データが混在している混在音声デ
ータを、発言者毎の音声データに分離するステップと、
（２）発言者毎に該発言者を特定するための特定パラメ
ータを準備するステップと、（３）分離された前記発言
者毎の音声データにつき、前記特定パラメータを参照し
て、発言者を特定するステップとを有することを特徴と
する発言者特定方法である。このような第６の発明によ
れば、例えば、会議の録音データなどに記録された、複
数発言者の音声や雑音などが含まれたの混在音声データ
を音源ごとに分離し、各分離されたの音声データの発言
者を特定することによって、例えば、自動的に会議記録
データの作成などを行うことができる。Further, a sixth invention according to the present application is to separate mixed voice data in which voice data of a plurality of speakers are mixed into voice data for each speaker, and to speak for voice data for each speaker. In the speaker identification method for identifying the person,
(1) Separating mixed voice data in which voice data of a plurality of speakers are mixed into voice data for each speaker by the voice separation method according to any one of the first to fifth inventions,
(2) preparing a specific parameter for specifying the speaker for each speaker, and (3) specifying the speaker by referring to the specific parameter for the separated voice data for each speaker. The method for identifying a speaker is characterized by the following steps. According to the sixth aspect of the invention, for example, mixed voice data including voices and noises of a plurality of speakers recorded in recorded data of a conference is separated for each sound source, and each separated. By specifying the speaker of the voice data, the conference record data can be automatically created, for example.

【００１９】また、本出願に係る第７の発明は、第６の
発明である発言者特定方法において、前記特定パラメー
タは、発言者が母音を発音した際のホルマント周波数で
あり、分離された前記発言者毎の音声データにつき、ホ
ルマント周波数を求め、求められたホルマント周波数に
関して、前記特定パラメータとしてのホルマント周波数
を参照して、発言者を特定することを特徴とする発言者
特定方法である。このような第７の発明によれば、フー
リエ変換などの容易な処理で抽出できる特徴量であるホ
ルマント周波数を用いて、各分離された音声データの発
言者特定を容易に行うことができる。Further, a seventh invention according to the present application is the speaker identifying method according to the sixth invention, wherein the specific parameter is a formant frequency when the speaker produces a vowel, and the separated parameter is separated. The speaker specifying method is characterized in that a formant frequency is obtained for voice data of each speaker, and the speaker is specified by referring to the formant frequency as the specific parameter with respect to the obtained formant frequency. According to the seventh aspect of the invention, it is possible to easily identify the speaker of each separated voice data by using the formant frequency which is a feature amount that can be extracted by a simple process such as Fourier transform.

【００２０】また、本出願に係る第８の発明は、第７の
発明である発言者特定方法において、前記特定パラメー
タは、発言者が母音を発音した際の第１ホルマント周波
数及び第２ホルマント周波数であり、分離された前記発
言者毎の音声データにつき、第１ホルマント周波数及び
第２ホルマント周波数を求め、求められた第１ホルマン
ト周波数及び第２ホルマント周波数に関して、前記特定
パラメータとしての第１ホルマント周波数及び第２ホル
マント周波数を参照して、発言者を特定することを特徴
とする発言者特定方法である。このような第８の発明に
よれば、第１と第２のスペクトルピークである２つのホ
ルマント周波数を用いて発言者の特定を行うことによっ
て、容易に、かつより正確に特定を行うことができる。An eighth invention according to the present application is the speaker specifying method according to the seventh invention, wherein the specifying parameter is the first formant frequency and the second formant frequency when the speaker pronounces a vowel. The first formant frequency and the second formant frequency are obtained for the separated voice data of each speaker, and the first formant frequency as the specific parameter is obtained with respect to the obtained first formant frequency and the second formant frequency. And a second formant frequency to identify a speaker, which is a speaker identification method. According to the eighth aspect, the speaker can be specified using the two formant frequencies that are the first and second spectral peaks, so that the speaker can be specified easily and more accurately. .

【００２１】また、本出願に係る第９の発明は、第６の
発明乃至第８の発明のいずれかに記載の発言者特定方法
において、分離された前記発言者毎の音声データにつ
き、前記特定パラメータを参照して発言者を特定するス
テップにて発言者を特定できなかった場合には、該音声
データから複数の時点のホルマント周波数を求め、求め
られた複数時点のホルマント周波数に関して、前記特定
パラメータとしての複数時点のホルマント周波数を参照
して、発言者を特定することを特徴とする発言者特定方
法である。このような第９の発明によれば、ある音声の
発声者を特定する上での特徴量であるホルマント周波数
の、時間的変動をも考慮することにより、より正確に発
言者の特定を行うことができる。Further, a ninth invention according to the present application is the speaker identifying method according to any one of the sixth invention to the eighth invention, wherein the voice data of each of the speakers separated is identified. When the speaker cannot be specified in the step of specifying the speaker by referring to the parameter, the formant frequencies at a plurality of time points are obtained from the voice data, and the specified parameter is determined with respect to the obtained formant frequencies at the plurality of time points. The speaker identification method is characterized by identifying the speaker by referring to the formant frequencies at a plurality of times. According to the ninth aspect of the invention, the speaker can be specified more accurately by considering the temporal variation of the formant frequency, which is a feature amount for specifying the speaker of a certain voice. You can

【００２２】また、本出願に係る第１０の発明は、第６
の発明乃至第９の発明のいずれかに記載の発言者特定方
法において、分離された前記発言者毎の音声データにつ
き、前記特定パラメータを参照して発言者を特定するス
テップにて発言者を特定できなかった場合には、該音声
データから有声音データを分離し、該有声音データにつ
き、ホルマント周波数を求め、求められたホルマント周
波数に関して、前記特定パラメータとしてのホルマント
周波数を参照して、発言者を特定することを特徴とする
発言者特定方法である。ホルマント周波数による発言者
の特定は、有声音、特に母音の識別に有効であるので、
このような第１０の発明によれば、無声音を含む様々な
音声をもより正確に識別することができる。ここで、無
相関化処理及び独立成分分離処理がなされる前の音声デ
ータが複数人の音声が混在しているデータであるのに対
して、無相関化処理及び独立成分分離処理という二つの
処理によって分離された分離音声データは、ある一人の
音声が抽出されたデータとなっている。よって、このよ
うな二つの処理によって分離された分離音声データから
は有声音を高い精度で抽出することができる。The tenth invention of the present application is the sixth invention.
In the speaker identification method according to any one of claims 9 to 9, the speaker is identified in the step of identifying the speaker with respect to the separated voice data for each speaker by referring to the identification parameter. If not possible, the voiced sound data is separated from the voice data, the formant frequency is obtained for the voiced sound data, and the obtained formant frequency is referred to by referring to the formant frequency as the specific parameter. It is a speaker identification method characterized by identifying. Since the identification of the speaker by the formant frequency is effective for distinguishing voiced sounds, especially vowels,
According to such a tenth invention, it is possible to more accurately identify various voices including unvoiced sounds. Here, while the voice data before the decorrelation process and the independent component separation process is the data in which the voices of a plurality of people are mixed, two processes of the decorrelation process and the independent component separation process are performed. The separated voice data separated by is the data in which a certain person's voice is extracted. Therefore, voiced sound can be extracted with high accuracy from the separated voice data separated by such two processes.

【００２３】また、本出願に係る第１１の発明は、第１
０の発明の発言者特定方法において、分離された前記発
言者毎の音声データにつき、前記特定パラメータを参照
して発言者を特定するステップにて発言者を特定できな
かった場合には、該音声データから有声音データを分離
し、該有声音データにつき、第１ホルマント周波数及び
第２ホルマント周波数を求め、求められた第１ホルマン
ト周波数及び第２ホルマント周波数に関して、前記特定
パラメータとしての第１ホルマント周波数及び第２ホル
マント周波数を参照して、発言者を特定することを特徴
とする発言者特定方法である。このような第１１の発明
によれば、分離された有声音データに対して、第１と第
２のスペクトルピークである２つのホルマント周波数を
用いて発言者の特定を行うことによって、より正確に特
定を行うことができる。The eleventh invention of the present application is the first invention.
In the speaker identification method according to the invention of 0, if the speaker cannot be specified in the step of identifying the speaker by referring to the specific parameter in the separated voice data for each speaker, The voiced sound data is separated from the data, the first formant frequency and the second formant frequency are obtained for the voiced sound data, and the first formant frequency as the specific parameter with respect to the obtained first formant frequency and the second formant frequency. And a second formant frequency to identify a speaker, which is a speaker identification method. According to the eleventh aspect, the speaker can be identified more accurately by using the two formant frequencies, which are the first and second spectral peaks, for the separated voiced sound data. You can specify.

【００２４】また、本出願に係る第１２の発明は、第１
０の発明または第１１の発明の発言者特定方法におい
て、分離された前記発言者毎の音声データにつき、前記
特定パラメータを参照して発言者を特定するステップに
て発言者を特定できなかった場合には、該有声音データ
につき、複数の時点のホルマント周波数を求め、求めら
れた複数時点のホルマント周波数に関して、前記特定パ
ラメータとしての複数時点のホルマント周波数を参照し
て、発言者を特定することを特徴とする発言者特定方法
である。このような第１２の発明によれば、分離された
有声音データに対して、発言者特定上の特徴量であるホ
ルマント周波数の時間的変動をも考慮することにより、
より正確に発言者の特定を行うことができる。The twelfth invention of the present application is the first invention.
In the speaker identification method of the 0th invention or the 11th invention, the speaker cannot be identified in the step of identifying the speaker with respect to the separated voice data for each speaker by referring to the identification parameter. For the voiced sound data, the formant frequencies at a plurality of time points are determined, and with respect to the obtained formant frequencies at a plurality of time points, the speaker is identified by referring to the formant frequencies at the plurality of time points as the specific parameters. This is a characteristic speaker identification method. According to such a twelfth invention, by taking into consideration the temporal variation of the formant frequency, which is the feature amount for specifying the speaker, with respect to the separated voiced sound data,
The speaker can be specified more accurately.

【００２５】また、本出願に係る第１３の発明は、第１
０の発明又は第１１の発明の発言者特定方法において、
前記音声データから前記有声音データを分離する際に、
該音声データに対して独立成分に分離するための独立成
分分離処理が行われることを特徴とする発言者特定方法
である。有声音は声帯の振動を伴うものなので、このよ
うな第１３の発明によれば、音声データに独立成分分離
処理をかけることによって、声帯の振動を伴わない無声
音と声帯の振動を伴う有声音とを容易に分離することが
可能となる。The thirteenth invention of the present application is the first invention.
In the speaker identification method of the 0th invention or the 11th invention,
When separating the voiced sound data from the voice data,
The speaker identifying method is characterized in that an independent component separation process for separating the voice data into independent components is performed. Since the voiced sound is accompanied by the vibration of the vocal cord, according to the thirteenth invention, by performing the independent component separation processing on the voice data, the unvoiced sound without the vibration of the vocal cord and the voiced sound with the vibration of the vocal cord are generated. Can be easily separated.

【００２６】また、本出願に係る第１４の発明は、複数
発言者の音声データが混在している混在音声データか
ら、議事録を作成する議事録作成方法において、第６の
発明乃至第１３のいずれかの発明の発言者特定方法によ
り、分離された前記発言者毎の音声データにつき、発言
者を特定するステップと、特定された発言者と、該発言
者の発言とを対応付けて記録媒体に出力することによ
り、議事録を作成するステップとを有することを特徴と
する議事録作成方法である。このような第１４の発明に
よれば、発言者の特定が自動的に正確に行われるため、
長時間にわたる会議の議事録作成を自動的に行うことが
でき便利である。Further, a fourteenth invention according to the present application is a minutes creating method for creating a minutes from mixed voice data in which voice data of a plurality of speakers are mixed. According to the speaker identifying method of any of the inventions, a step of identifying a speaker in the separated voice data for each speaker, the identified speaker, and the statement of the speaker are recorded in association with each other. And a step of creating a minutes by outputting the minutes to the minutes. According to the fourteenth invention, since the speaker is specified automatically and accurately,
It is convenient because the minutes of a long meeting can be created automatically.

【００２７】また、本出願に係る第１５の発明は、複数
発言者の音声データが混在している混在音声データを、
発言者毎の音声データに分離する音声データ分離装置に
おいて、前記混在音声データを互いに無相関化するため
に無相関化処理を行い、前記無相関化処理の行われたデ
ータを独立成分に分離するために独立成分分離処理を行
うことを特徴とする音声分離装置である。このような第
１５の発明によれば、入力される混在音声データ（生デ
ータ）に含まれる各音声データの相関性、及び独立性の
両性質をともに考慮し、複数の音声データや混入する雑
音などの有する相関性や独立性が、時間的・空間的に変
動する場合でも、発言者毎の音声に正確に分離すること
が可能な音声分離装置を実現できるまた、本出願に係る
第１６の発明は、第１５の発明である音声分離装置にお
いて、前記独立成分分離の行われたデータの分離性が不
十分な場合には、分離性が十分になるまで、前記独立成
分分離処理の行われたデータについて、前記無相関化処
理及び前記独立成分分離処理を繰り返し行うことを特徴
とする音声分離装置である。このような第１６の発明に
よれば、混在音声データを音源毎の音声データに充分に
分離させることの可能な音声分離装置を実現できる。The fifteenth invention of the present application is to provide mixed voice data in which voice data of a plurality of speakers are mixed,
In a voice data separation device for separating the voice data for each speaker, a decorrelation process is performed in order to decorrelate the mixed voice data with each other, and the data subjected to the decorrelation process is separated into independent components. In order to achieve this, an audio component separation device is characterized by performing independent component separation processing. According to the fifteenth aspect, a plurality of voice data and mixed noise are considered in consideration of both the correlation and the independence of each voice data included in the input mixed voice data (raw data). It is possible to realize a voice separation device capable of accurately separating the voice of each speaker even when the correlation or independence of the above changes temporally and spatially. According to a fifteenth aspect of the invention, in the speech separation device according to the fifteenth aspect, when the separability of the data on which the independent component separation is performed is insufficient, the independent component separation processing is performed until the separability is sufficient. The speech separation apparatus is characterized in that the decorrelation processing and the independent component separation processing are repeatedly performed on the obtained data. According to such a sixteenth invention, it is possible to realize a voice separation device capable of sufficiently separating mixed voice data into voice data for each sound source.

【００２８】また、本出願に係る第１７の発明は、第１
５又は第１６の発明である音声分離装置において、デー
タの性質により、前記独立成分分離処理として、非ガウ
ス性のデータを独立成分に分離するための非ガウス性独
立成分分離処理、非定常性のデータを独立成分に分離す
るための非定常性独立成分分離処理、有色性のデータを
独立成分に分離するための有色性独立成分分離処理、の
うちのいずれかの処理を行うことを特徴とする音声分離
装置である。このような第１７の発明によれば、無相関
化処理の行われたデータの性質に応じて最適な独立成分
分離処理を行うことができるから、混在音声データを音
源毎の音声データにより効果的に分離させることの可能
な音声分離装置を実現できる。The seventeenth invention of the present application is the first invention.
In the speech separation device according to the fifth or sixteenth aspect of the invention, depending on the nature of the data, the non-Gaussian independent component separation process for separating non-Gaussian data into independent components and the non-stationary One of non-stationary independent component separation processing for separating data into independent components and colored independent component separation processing for separating chromatic data into independent components It is a voice separation device. According to the seventeenth aspect, the optimum independent component separation process can be performed according to the property of the data subjected to the decorrelation process, so that the mixed voice data is more effective as the voice data for each sound source. It is possible to realize a voice separation device that can be separated into two.

【００２９】また、本出願に係る第１８の発明は、第１
７の発明である音声分離装置において、最初に行われる
独立成分分離処理は、非ガウス性のデータを独立成分に
分離するための非ガウス性独立成分分離処理であること
を特徴とする音声分離装置である。非ガウス性独立成分
分離処理は他の独立成分分処理方法に比べてその前処理
としての無相関化処理の影響を受けやすいから、このよ
うな第１８の発明によれば、最初に非ガウス性独立成分
分離処理を行うことにより、無相関化処理がうまく実行
されたかどうかを、該無相関化処理に引き続く非ガウス
性独立成分分離処理によって効果的に評価することが可
能な音声分離装置を実現できる。The eighteenth invention of the present application is the first invention.
In the speech separation device according to the invention of claim 7, the independent component separation process performed first is a non-Gaussian independent component separation process for separating non-Gaussian data into independent components. Is. The non-Gaussian independent component separation process is more susceptible to the decorrelation process as its preprocessing than other independent component processing methods. By implementing the independent component separation processing, it is possible to realize a speech separation device capable of effectively evaluating whether or not the decorrelation processing has been successfully executed by the non-Gaussian independent component separation processing subsequent to the decorrelation processing. it can.

【００３０】また、本出願に係る第１９の発明は、第１
５乃至第１８の発明である音声分離装置において、前記
無相関化処理は、少なくとも主成分分析及び因子分析を
行うことを特徴とする音声分離装置である。このような
第１９の発明によれば、各主成分の寄与率を求めて累積
寄与率が所定のしきい値を越えるところの成分数を次数
とすることなどにより、採用する主成分データの数（次
数）を決定した上で、効果的に無相関化処理を行うこと
が可能な音声分離装置を実現できる。The nineteenth invention of the present application is the first invention.
The speech separation apparatus according to any one of the fifth through eighteenth inventions is characterized in that the decorrelation processing performs at least a principal component analysis and a factor analysis. According to the nineteenth aspect, the number of principal component data to be adopted is obtained by calculating the contribution ratio of each principal component and setting the number of components where the cumulative contribution ratio exceeds a predetermined threshold as the order. It is possible to realize a voice separation device capable of effectively performing decorrelation processing after determining (order).

【００３１】また、本出願に係る第２０の発明は、複数
発言者の音声データが混在している混在音声データを、
発言者毎の音声データに分離し、該発言者毎の音声デー
タにつき発言者を特定する発言者特定装置において、第
１５乃至第１９のいずれかの発明の音声分離装置によ
り、複数発言者の音声データが混在している混在音声デ
ータを、発言者毎の音声データに分離し、分離された前
記発言者毎の音声データにつき、発言者毎に該発言者を
特定するための特定パラメータを参照して発言者を特定
することを特徴とする発言者特定装置である。このよう
な第２０の発明によれば、例えば、会議の録音データな
どに記録された、複数発言者の音声や雑音などが含まれ
たの混在音声データを音源ごとに分離し、各分離された
の音声データの発言者を特定することによって、例え
ば、自動的に会議記録データの作成などを行うことの可
能な発言者特定装置が実現できる。The twentieth invention of the present application is to provide mixed voice data in which voice data of a plurality of speakers are mixed,
In a speaker identifying device that separates voice data for each speaker and identifies a speaker for the voice data for each speaker, the voice separating device according to any one of the fifteenth to nineteenth inventions The mixed voice data in which the data is mixed is separated into the voice data for each speaker, and the separated voice data for each speaker is referred to a specific parameter for specifying the speaker for each speaker. The speaker identifying apparatus is characterized in that the speaker is identified by According to such a twentieth invention, for example, mixed voice data containing voices and noises of a plurality of speakers, which is recorded in recorded data of a conference, is separated for each sound source, and each separated. By specifying the speaker of the voice data, it is possible to realize a speaker specifying device capable of automatically creating conference record data.

【００３２】また、本出願に係る第２１の発明は、第２
０の発明である発言者特定装置において、前記特定パラ
メータは、発言者が母音を発音した際のホルマント周波
数であり、分離された前記発言者毎の音声データにつ
き、ホルマント周波数を求め、求められたホルマント周
波数に関して、前記特定パラメータとしてのホルマント
周波数を参照して、発言者を特定することを特徴とする
発言者特定装置である。このような第２１の発明によれ
ば、フーリエ変換などの容易な処理で抽出できる特徴量
であるホルマント周波数を用いて、各分離された音声デ
ータの発言者特定を容易に行うことの可能な発言者特定
装置が実現できる。The twenty-first invention of the present application is the second invention.
In the speaker identifying apparatus according to the invention of 0, the specific parameter is a formant frequency when the speaker produces a vowel, and the formant frequency is calculated for each of the separated voice data of each speaker. Regarding the formant frequency, the speaker specifying device is characterized in that the speaker is specified by referring to the formant frequency as the specifying parameter. According to the twenty-first aspect of the invention, by using the formant frequency, which is a feature amount that can be extracted by a simple process such as Fourier transform, it is possible to easily specify the speaker of each separated voice data. The person identification device can be realized.

【００３３】また、本出願に係る第２２の発明は、第２
１の発明である発言者特定装置において、前記特定パラ
メータは、発言者が母音を発音した際の第１ホルマント
周波数及び第２ホルマント周波数であり、分離された前
記発言者毎の音声データにつき、第１ホルマント周波数
及び第２ホルマント周波数を求め、求められた第１ホル
マント周波数及び第２ホルマント周波数に関して、前記
特定パラメータとしての第１ホルマント周波数及び第２
ホルマント周波数を参照して、発言者を特定することを
特徴とする発言者特定装置である。このような第２２の
発明によれば、第１と第２のスペクトルピークである２
つのホルマント周波数を用いて発言者の特定を行うこと
によって、容易に、かつより正確に特定を行うことの可
能な発言者特定装置が実現できる。The 22nd invention of the present application is the 2nd invention.
In the speaker identifying device according to the first aspect of the present invention, the specific parameters are a first formant frequency and a second formant frequency when a speaker produces a vowel, The first formant frequency and the second formant frequency are obtained, and the first formant frequency and the second formant frequency as the specific parameters are determined with respect to the obtained first formant frequency and the second formant frequency.
It is a speaker specifying device characterized in that a speaker is specified by referring to a formant frequency. According to the twenty-second aspect, the first and second spectral peaks of 2
By specifying the speaker using one formant frequency, a speaker specifying device capable of specifying the speaker easily and more accurately can be realized.

【００３４】また、本出願に係る第２３の発明は、第２
０の発明乃至第２２の発明のいずれかに記載の発言者特
定装置において、分離された前記発言者毎の音声データ
につき、前記特定パラメータを参照して発言者を特定で
きなかった場合には、該音声データから複数の時点のホ
ルマント周波数を求め、求められた複数時点のホルマン
ト周波数に関して、前記特定パラメータとしての複数時
点のホルマント周波数を参照して、発言者を特定するこ
とを特徴とする発言者特定装置である。このような第２
３の発明によれば、ある音声の発声者を特定する上での
特徴量であるホルマント周波数の、時間的変動をも考慮
することにより、より正確に発言者の特定を行うことの
可能な発言者特定装置が実現できる。The 23rd invention of the present application is the 2nd invention.
In the speaker identifying apparatus according to any one of 0th to 22nd inventions, when the speaker cannot be specified by referring to the specific parameter in the separated voice data for each speaker, A speaker characterized in that formant frequencies at a plurality of time points are obtained from the voice data, and the obtained formant frequencies at a plurality of time points are referred to by referring to the formant frequencies at a plurality of time points as the specific parameters. It is a specific device. Such a second
According to the invention of claim 3, by taking into consideration the temporal variation of the formant frequency, which is a feature amount for specifying the speaker of a certain voice, a statement that enables the speaker to be specified more accurately The person identification device can be realized.

【００３５】また、本出願に係る第２４の発明は、第２
０の発明乃至第２３の発明のいずれかに記載の発言者特
定装置において、分離された前記発言者毎の音声データ
につき、前記特定パラメータを参照して発言者を特定で
きなかった場合には、該音声データから有声音データを
分離し、該有声音データにつき、ホルマント周波数を求
め、求められたホルマント周波数に関して、前記特定パ
ラメータとしてのホルマント周波数を参照して、発言者
を特定することを特徴とする発言者特定装置である。ホ
ルマント周波数による発言者の特定は、有声音、特に母
音の識別に有効であるので、このような第２４の発明に
よれば、無声音を含む様々な音声をもより正確に識別す
ることができる。ここで、無相関化処理及び独立成分分
離処理がなされる前の音声データが複数人の音声が混在
しているデータであるのに対して、無相関化処理及び独
立成分分離処理という二つの処理によって分離された分
離音声データは、ある一人の音声が抽出されたデータと
なっている。よって、このような二つの処理によって分
離された分離音声データからは有声音を高い精度で抽出
することができる。The twenty-fourth invention of the present application is the second invention.
In the speaker identifying device according to any one of 0th to 23rd inventions, when the speaker cannot be specified by referring to the specific parameter for the separated voice data for each speaker, The voiced sound data is separated from the voice data, the formant frequency is obtained for the voiced sound data, and the speaker is specified by referring to the formant frequency as the specific parameter with respect to the obtained formant frequency. It is a speaker identification device that does. Since the speaker identification by the formant frequency is effective for identifying voiced sounds, particularly vowels, according to the twenty-fourth aspect, it is possible to more accurately identify various voices including unvoiced sounds. Here, while the voice data before the decorrelation process and the independent component separation process is the data in which the voices of a plurality of people are mixed, two processes of the decorrelation process and the independent component separation process are performed. The separated voice data separated by is the data in which a certain person's voice is extracted. Therefore, voiced sound can be extracted with high accuracy from the separated voice data separated by such two processes.

【００３６】また、本出願に係る第２５の発明は、第２
４の発明の発言者特定装置において、分離された前記発
言者毎の音声データにつき、前記特定パラメータを参照
して発言者を特定できなかった場合には、該音声データ
から有声音データを分離し、該有声音データにつき、第
１ホルマント周波数及び第２ホルマント周波数を求め、
求められた第１ホルマント周波数及び第２ホルマント周
波数に関して、前記特定パラメータとしての第１ホルマ
ント周波数及び第２ホルマント周波数を参照して、発言
者を特定することを特徴とする発言者特定装置である。
このような第２５の発明によれば、分離された有声音デ
ータに対して、第１と第２のスペクトルピークである２
つのホルマント周波数を用いて発言者の特定を行うこと
によって、より正確に特定を行うことの可能な発言者特
定装置が実現できる。The 25th invention of the present application is the second invention.
In the speaker identifying device according to the invention of claim 4, in the case where the speaker cannot be specified by referring to the specific parameter for the separated voice data for each speaker, the voiced sound data is separated from the voice data. , The first formant frequency and the second formant frequency of the voiced sound data,
With respect to the obtained first formant frequency and second formant frequency, the speaker specifying device is characterized in that the speaker is specified by referring to the first formant frequency and the second formant frequency as the specifying parameters.
According to such a twenty-fifth aspect, the first and second spectral peaks of the separated voiced sound data are 2
By specifying the speaker using one formant frequency, a speaker specifying device capable of specifying the speaker more accurately can be realized.

【００３７】また、本出願に係る第２６の発明は、第２
４の発明または第２５の発明の発言者特定装置におい
て、分離された前記発言者毎の音声データにつき、前記
特定パラメータを参照して発言者を特定できなかった場
合には、該有声音データにつき、複数の時点のホルマン
ト周波数を求め、求められた複数時点のホルマント周波
数に関して、前記特定パラメータとしての複数時点のホ
ルマント周波数を参照して、発言者を特定することを特
徴とする発言者特定装置である。このような第２６の発
明によれば、分離された有声音データに対して、発言者
特定上の特徴量であるホルマント周波数の時間的変動を
も考慮することにより、より正確に発言者の特定を行う
ことの可能な発言者特定装置が実現できる。The 26th invention of the present application is the second invention.
In the speaker identifying apparatus according to the fourth invention or the twenty-fifth invention, regarding the separated voice data for each speaker, if the speaker cannot be identified by referring to the specific parameter, the voiced sound data is analyzed. , Formant frequencies at a plurality of time points, with respect to the obtained formant frequencies at a plurality of time points, with reference to the formant frequencies at a plurality of time points as the specific parameter, a speaker specifying device characterized by specifying a speaker is there. According to the twenty-sixth aspect, the speaker's identification can be performed more accurately by considering the temporal variation of the formant frequency, which is the feature amount for the speaker identification, with respect to the separated voiced sound data. It is possible to realize a speaker identification device capable of performing.

【００３８】また、本出願に係る第２７の発明は、第２
４の発明又は第２５の発明の発言者特定装置において、
前記音声データから前記有声音データを分離する際に、
該音声データに対して独立成分に分離するための独立成
分分離処理が行われることを特徴とする発言者特定装置
である。有声音は声帯の振動を伴うものなので、このよ
うな第２７の発明によれば、音声データに独立成分分離
処理をかけることによって、声帯の振動を伴わない無声
音と声帯の振動を伴う有声音とを容易に分離することの
可能な発言者特定装置が実現できる。The twenty-seventh invention of the present application is the second invention.
In the speaker identifying apparatus according to the fourth invention or the twenty-fifth invention,
When separating the voiced sound data from the voice data,
The speaker identifying device is characterized in that an independent component separation process for separating the voice data into independent components is performed. Since the voiced sound is accompanied by vibration of the vocal cord, according to the twenty-seventh aspect of the invention, by performing independent component separation processing on the voice data, unvoiced sound without vibration of the vocal cord and voiced sound with vibration of the vocal cord are obtained. It is possible to realize a speaker identification device that can easily separate the speakers.

【００３９】また、本出願に係る第２８の発明は、複数
発言者の音声データが混在している混在音声データか
ら、議事録を作成する議事録作成装置において、第２０
乃至第２７のいずれかの発明の発言者特定装置により、
分離された前記発言者毎の音声データにつき、発言者を
特定し、特定された発言者と、該発言者の発言とを対応
付けて記録媒体に出力することにより、議事録を作成す
ることを特徴とする議事録作成装置である。このような
第２８の発明によれば、発言者の特定が自動的に正確に
行われるため、長時間にわたる会議の議事録作成を自動
的に行うことの可能な議事録作成装置が実現できる。The twenty-eighth invention of the present application is a minutes-creating apparatus for creating a minutes from mixed voice data in which voice data of a plurality of speakers are mixed.
Through the speaker identifying apparatus of any one of the twenty-seventh invention,
It is possible to create a minutes by specifying a speaker in the separated voice data for each speaker and outputting the specified speaker and the statement of the speaker in association with each other on a recording medium. It is a characteristic minutes creating device. According to the twenty-eighth invention described above, since the speaker is automatically and accurately specified, the minutes creating apparatus capable of automatically creating the minutes of the conference for a long time can be realized.

【００４０】また、第１乃至第５のいずれかの発明の音
声分離方法を音声分離装置に実行させるためのコンピュ
ータプログラムも実現可能である。A computer program for causing a voice separation device to execute the voice separation method of any of the first to fifth inventions can also be realized.

【００４１】また、第６乃至第１３のいずれかの発明の
発言者特定方法を発言者特定装置に実行させるためのコ
ンピュータプログラムも実現可能である。A computer program for causing the speaker identifying apparatus to execute the speaker identifying method according to any one of the sixth to thirteenth inventions can be realized.

【００４２】また、そのようなコンピュータプログラム
を記録したコンピュータ読み取り可能な記録媒体も実現
可能である。A computer-readable recording medium recording such a computer program can also be realized.

【００４３】[0043]

【発明の実施の形態】＝＝混在音声データの音声分離＝
＝以下、図面を参照しつつ、本発明のより具体的な実施形
態につき、詳細に説明する。まず、本発明の方法の前半
部分である、混在音声データの音声分離ステップについ
て説明する。BEST MODE FOR CARRYING OUT THE INVENTION == Voice Separation of Mixed Voice Data =
= Hereinafter, more specific embodiments of the present invention will be described in detail with reference to the drawings. First, the voice separation step of mixed voice data, which is the first half of the method of the present invention, will be described.

【００４４】本実施形態では、２人で行われたある会議
の発言内容の音声データを２本のマイク（マイク１、マ
イク２）で拾う。図１は、そのうちマイク１から入力さ
れた音声データ（生データ）Ｘの波形である。この混在
音声データには、複数の発言者の音声データが混在して
いるのみならず、音楽や、さらには雑音などが混ざって
いてもよい。２人の発声をそれぞれ音源Ｓ１、Ｓ２と呼
ぶことにする。In this embodiment, two microphones (microphone 1 and microphone 2) pick up voice data of the content of a statement made by a two-person conference. FIG. 1 shows a waveform of audio data (raw data) X input from the microphone 1. The mixed voice data may include not only voice data of a plurality of speakers but also music and noise. The two utterances will be referred to as sound sources S1 and S2, respectively.

【００４５】図２は、音声分離処理のサイクルを示す図
である。マイク１及びマイク２から入力された混在音声
データは、まず無相関化処理Ｗ１にかけられる。無相関
化処理Ｗ１に渡される音声データは、図１の[１]、[２]
のようにセグメント化されて１つずつ渡される。最も効
率がよいように、各セグメントは互いに１／２周期ずつ
オーバーラップしている。FIG. 2 is a diagram showing a cycle of voice separation processing. The mixed voice data input from the microphone 1 and the microphone 2 is first subjected to the decorrelation processing W1. The audio data passed to the decorrelation processing W1 is [1] and [2] in FIG.
It is segmented like this and passed one by one. For maximum efficiency, the segments overlap each other by ½ cycle.

【００４６】図２において、無相関化処理Ｗ１の次のス
テップであるＩＣチューナーは、独立成分解析（ＩＣ
Ａ）の手法を３種類のうちから選択するためのチューナ
ーである。その次のステップである独立成分分離処理Ｗ
２は、非ガウス性に基づく分離処理Ｗ２（α）、非定常
性に基づく分離処理Ｗ２（β）、有色性に基づく分離処
理Ｗ（γ）の３種類のうちいずれかの方式の処理を行
う。Ｗ２の後のステップの評価器Ｅでは、Ｗ２にて分離
されたデータの分離性の評価を行う。マイクから入力さ
れた混在音声データの音声分離性能が充分になるまで、
以上のＷ１→ＩＣチューナー→Ｗ２→Ｅというサイクル
を繰り返し回す。ただし、１回目のサイクルでは、独立
成分分離処理Ｗ２として、非ガウス性に基づく独立成分
分離処理Ｗ２（α）を行い、２回目以降のサイクルで
は、ＩＣチューナの選択に従って、Ｗ２（α）、Ｗ２
（β）、Ｗ２（γ）の３種類のうちから適切な方式の独
立成分分離処理を行う。In FIG. 2, the IC tuner, which is the next step of the decorrelation processing W1, is an independent component analysis (IC
This is a tuner for selecting the method A) from three types. Independent component separation process W which is the next step
2 performs one of three types of separation processing W2 (α) based on non-Gaussianity, separation processing W2 (β) based on non-stationarity, and separation processing W (γ) based on chromaticity. . The evaluator E in the step after W2 evaluates the separability of the data separated in W2. Until the voice separation performance of mixed voice data input from the microphone becomes sufficient,
The above cycle of W1 → IC tuner → W2 → E is repeated. However, in the first cycle, as the independent component separation processing W2, the independent component separation processing W2 (α) based on the non-Gaussian property is performed, and in the second and subsequent cycles, W2 (α), W2
An appropriate component independent component separation process is performed from among the three types of (β) and W2 (γ).

【００４７】図３は、１回目の音声分離サイクルを示し
ている。図１における前記[１]の時間セグメントの、マ
イク１及びマイク２からの混在音声データｘ１、ｘ２
が、まず無相関化処理Ｗ１に入力される。FIG. 3 shows the first voice separation cycle. Mixed voice data x1 and x2 from the microphone 1 and the microphone 2 in the time segment [1] in FIG.
Is first input to the decorrelation processing W1.

【００４８】図７及び図８は、それぞれｘ１及びｘ２の
デジタル化波形図データ（縦軸は音の強さで、単位はミ
リボルト）を示す。各時点のｘ１、ｘ２データを、横軸
をｘ１の強さ、縦軸をｘ２の強さとして散布図を描くと
図９のようになる。散布図は、第１象限から第３象限に
かけて若干直線的な分布を呈し、ｘ１とｘ２のデータは
互いに相関性を有することを示している。これら生デー
タであるｘ１、ｘ２が無相関化処理Ｗ１にかけられる
と、互いに相関性を有しないデータｆ１、ｆ２に変換さ
れる。7 and 8 show x1 and x2 digitized waveform diagram data (the vertical axis represents the sound intensity and the unit is millivolts). FIG. 9 is a scatter plot of the x1 and x2 data at each time point with the abscissa representing the intensity of x1 and the ordinate representing the intensity of x2. The scatter plot exhibits a slightly linear distribution from the first quadrant to the third quadrant, indicating that the data of x1 and x2 are correlated with each other. When these raw data x1 and x2 are subjected to the decorrelation processing W1, they are converted into data f1 and f2 having no correlation with each other.

【００４９】ｆ１及びｆ２の散布図を図１０に示す。図
１０の横軸は因子得点Ｆの第１因子ｆ１、縦軸は因子得
点Ｆの第２因子ｆ２を示している。図９が軸に対してい
びつな平行四辺形状に分布していたのに対し、軸に対し
てまっすぐで形の整ったひし形状に分布しており、ｆ１
とｆ２はもはや互いに相関性を有していないことがわか
る。A scatter plot of f1 and f2 is shown in FIG. The horizontal axis of FIG. 10 represents the first factor f1 of the factor score F, and the vertical axis represents the second factor f2 of the factor score F. 9 is distributed in a parallelogram shape that is distorted with respect to the axis, it is distributed in a rhombus shape that is straight and has a regular shape with respect to the axis.
It can be seen that and f2 are no longer correlated with each other.

【００５０】ここで、無相関化処理の内容について説明
する。図６は、無相関化処理Ｗ１の一例のフローチャー
トを示したものである。まず、図７及び図８に示した音
声生データｘ１、ｘ２を（１）式により標準化する。標
準化の結果、平均が０、標準偏差１のデータとなる。Here, the contents of the decorrelation process will be described. FIG. 6 shows a flowchart of an example of the decorrelation process W1. First, the raw audio data x1 and x2 shown in FIGS. 7 and 8 are standardized by the equation (1). As a result of standardization, the data has an average of 0 and a standard deviation of 1.

【数１】 [Equation 1]

【００５１】生データｘ１、ｘ２の相関行列(ベクトル
Ｃ)を（２）式より求める。（２）式において（ｘ１、
ｘ２）はベクトルの内積を表す。The correlation matrix (vector C) of the raw data x1 and x2 is obtained from the equation (2). In formula (2), (x1,
x2) represents the dot product of the vectors.

【数２】 [Equation 2]

【００５２】上記相関行列に対する固有値λｉと固有ベ
クトルＡを（３）より求める。The eigenvalue λi and eigenvector A for the above correlation matrix are obtained from (3).

【００５３】[0053]

【数３】 [Equation 3]

【００５４】今、因子分析によって、互いに無相関な因
子得点を求めようとしているのだが、その際、第１番目
の因子から始めて、何番目の因子までを採用するのかが
重要な点である。ｍ番目の因子までを採用する場合を、
ｍ次元と呼ぶ。先に求めた固有ベクトルＡにより、
（４）式によって主成分Ｚが求まる。Now, by factor analysis, we are trying to obtain mutually uncorrelated factor scores. At that time, it is important to start with the first factor and up to which factor. When adopting up to the mth factor,
Call it m-dimensional. By the eigenvector A obtained earlier,
The principal component Z is obtained by the equation (4).

【数４】 [Equation 4]

【００５５】次にｍ個の因子に対して、（５）式の形の
定義式にて因子分析を実行する。（５）式におけるｅ
は、特殊因子と呼ばれるものである。Next, a factor analysis is performed on the m factors by the definition equation of the form (5). E in equation (5)
Is called a special factor.

【数５】 [Equation 5]

【００５６】この因子モデルが（６）式の表現をとる。
（６）式における因子負荷量ｂｉｊ、因子得点Ｆは、
（７）式及び（８）式によって求める。そして、図６の
フローチャートの最終ステップで、結局音声生データ
は、互いに無相関な因子得点（ベクトルＦ）に変換され
る。This factor model takes the expression (6).
The factor load bij and the factor score F in the equation (6) are
It is determined by the equations (7) and (8). Then, in the final step of the flowchart of FIG. 6, the raw audio data is eventually converted into factor scores (vector F) that are uncorrelated with each other.

【数６】 [Equation 6]

【数７】 [Equation 7]

【数８】 [Equation 8]

【００５７】以上説明したＷ１の主な特徴は、主成分分
析と因子分析とを組み合わせている点である。その効果
は、主成分分析を実行すると各主成分の寄与率を同時に
求めることができるので、例えば、第１次主成分から第
ｍ次主成分までの累積寄与率が８０％を超えるまでの主
成分を採用するようにすることで、次数ｍを決定するこ
とにある。分離すべき音声生データは、時間的変動が大
きく、混合による相関の度合いが大きく変化するので、
何個の因子を採用するかは無相関化処理において重要な
点である。The main feature of W1 described above is that principal component analysis and factor analysis are combined. The effect is that the contribution ratio of each principal component can be obtained at the same time by executing the principal component analysis. Therefore, for example, the main contribution until the cumulative contribution ratio from the first-order principal component to the m-th-order principal component exceeds 80%. By adopting the component, the order m is determined. The raw audio data to be separated has a large temporal variation, and the degree of correlation due to mixing greatly changes.
How many factors are adopted is an important point in decorrelation processing.

【００５８】発話者の人数があらかじめ判明している場
合には、次数ｍを発話者の人数に固定してしまえばよい
が、人数が不明なときは、例えば、累積寄与率が所定の
しきい値を超えたときの主成分数を次数ｍとする。次数
ｍの決定方法は、システムに応じて様々な方法を準備し
ておき、臨機応変に変化させる（チューニングする）こ
とが好ましい。次にこのチューニングの一実施例につい
て詳しく説明する。When the number of speakers is known in advance, the order m may be fixed to the number of speakers, but when the number of speakers is unknown, for example, the cumulative contribution rate is a predetermined threshold. Let m be the number of principal components when the value is exceeded. As a method of determining the order m, it is preferable to prepare various methods according to the system and to change (tune) flexibly. Next, an example of this tuning will be described in detail.

【００５９】図２３は、システムに応じた方法で次数ｍ
を決定する手順を示すフローチャートである。図２３
で、ＲＫ０は累積寄与率の初期しきい値、Ｍは採用し得
る最大次数（次数の上側しきい値）、△ＲＫは累積寄与
率の変化量である。主成分分析を実行すると、図２１の
ような、次数ｍ（第ｍ主成分まで採用したということを
示す）とその累積寄与率との関係を示すグラフが得られ
る。図２１にはＡ、Ｂ、Ｃ３種類のグラフの例を描いて
いる。In FIG. 23, the order m is calculated by the method according to the system.
It is a flow chart which shows the procedure which determines. FIG. 23
Here, RK0 is the initial threshold value of the cumulative contribution rate, M is the maximum order (upper threshold value of the order) that can be adopted, and ΔRK is the change amount of the cumulative contribution rate. When the principal component analysis is performed, a graph showing the relationship between the degree m (indicating that the m-th principal component is adopted) and its cumulative contribution rate is obtained as shown in FIG. FIG. 21 shows an example of three types of graphs of A, B, and C.

【００６０】まず、第１の処理ステップとして、累積寄
与率ＲＫにしきい値ＲＫ０（この実施例では８０％）を
設定しておき、このしきい値ＲＫ０を超える次数ｍを求
める。ところが、次数があまりに大きいとその後の処理
が煩雑に過ぎるので、あらかじめ次数の上限値Ｍを決め
ておく。図２１の例では、Ｍ＝４とすると、Ａの場合は
しきい値ＲＫ０を超える次数ｍ＝２であるので、ｍ＝２
＜４＝Ｍとなって、次数ｍは２に決定される。Ｂの例で
はＲＫ０を超える次数ｍは５であるので、ｍ＝５＞４＝
Ｍとなってしまい、次数ｍはまだ決定されない。Ｃの例
でも同様に次数ｍは決定されない。First, as a first processing step, a threshold value RK0 (80% in this embodiment) is set in the cumulative contribution rate RK, and an order m exceeding this threshold value RK0 is obtained. However, if the order is too large, the subsequent processing becomes too complicated. Therefore, the upper limit M of the order is determined in advance. In the example of FIG. 21, assuming that M = 4, in the case of A, the order m = 2 that exceeds the threshold value RK0, so that m = 2.
<4 = M, and the order m is determined to be 2. In the example of B, the degree m exceeding RK0 is 5, so that m = 5> 4 =
It becomes M, and the order m is not determined yet. Similarly, in the case of C, the order m is not determined.

【００６１】そのような場合は図２２に示す、第２のス
テップを実行する。すなわち、次数ｍの増加に対する、
ＲＫの差分変化量△ＲＫを調べる。これは要するに、累
積寄与率の変化が最大となる次数ｍをもって採用すべき
次数とするという処理方法である。この実施例では、Ｂ
の例ではｍ＝２、Ｃの例ではｍ＝４において△ＲＫが最
大値をとる。この場合も次数ｍが上限値Ｍよりも下なら
ば、その次数ｍを採用とするが、Ｍを上回る場合は、そ
の処理が次のステップに送られる。In such a case, the second step shown in FIG. 22 is executed. That is, for an increase in the order m,
Check the difference change amount ΔRK of RK. In short, this is a processing method in which the order m that maximizes the change in the cumulative contribution rate is the order to be adopted. In this embodiment, B
In the example, ΔRK takes the maximum value when m = 2 and in the example C, m = 4. Also in this case, if the order m is lower than the upper limit value M, the order m is adopted, but if it is higher than M, the processing is sent to the next step.

【００６２】第２のステップでも次数ｍが上限値Ｍを超
えてしまう場合であれば、次に累積寄与率のしきい値Ｒ
Ｋ０を引き下げて、例えば６０％（＝ＲＫ１）とし、上
記第１のステップと同じように比較する。新しいしきい
値ＲＫ１を超えるところの次数がＭ＝４以下であれば、
これを次数ｍとして採用とし、Ｍを超える場合は、所定
の下げ幅で順次ＲＫ２、ＲＫ３、・・・ＲＫｎの値を下
げる。ただし、累積寄与率ＲＫが５０％を下回るという
ことは、半分以上の情報が失われてしまうことを意味す
るので、ＲＫｎの下限値は５０％とする。If the order m exceeds the upper limit M even in the second step, the cumulative contribution ratio threshold R
K0 is lowered to, for example, 60% (= RK1), and the comparison is performed in the same manner as the first step. If the order above the new threshold value RK1 is M = 4 or less,
This is adopted as the order m, and when it exceeds M, the values of RK2, RK3, ... RKn are sequentially decreased with a predetermined decrease amount. However, if the cumulative contribution rate RK is less than 50%, it means that half or more of the information is lost, so the lower limit value of RKn is set to 50%.

【００６３】次数ｍがＲＫｎ＝５０％以上で、かつＭ以
下の値で発見されない場合は、再び上記第２のステップ
と同様の処理、すなわち△ＲＫが最大になる次数を求め
て、その値を次数ｍとして採用してしまう。これは、累
積寄与率が大きく変化するということは、その次数の前
後で情報がより多く保存されるということを意味するの
で、少なくともその次数までは採用したい、という考え
に基づくものである。When the order m is RKn = 50% or more and is not found with a value less than or equal to M, the same process as in the second step above is performed again, that is, the order at which ΔRK becomes maximum is obtained, and the value is It will be adopted as the order m. This is based on the idea that since the cumulative contribution ratio changes significantly, more information is stored before and after the order, and therefore it is desirable to adopt at least up to that order.

【００６４】以上のようにして、図３において、無相関
化されたデータｆ１、ｆ２は、ただちに独立成分分離処
理Ｗ２に送られる。１回目の音声分離サイクルでは、こ
れらの無相関化データｆ１、ｆ２に対し、非ガウス性に
基づく独立成分分離処理Ｗ２（α）を実行する。As described above, in FIG. 3, the decorrelated data f1 and f2 are immediately sent to the independent component separating process W2. In the first speech separation cycle, the independent component separation process W2 (α) based on non-Gaussianity is executed on these decorrelated data f1 and f2.

【００６５】以上、図３におけるＷ１及びＷ２(α)の処
理により、分離信号ａおよびｂが得られ、これらの分離
性（充分に分離されているか否か）を評価器Ｅで評価
し、分離が不十分なとき（図の＊１）はこれらａ、ｂの
データに対して、２回目のサイクルを実行する。As described above, the separation signals a and b are obtained by the processing of W1 and W2 (α) in FIG. 3, and their separation characteristics (whether or not they are sufficiently separated) are evaluated by the evaluator E and separated. When the value is insufficient (* 1 in the figure), the second cycle is executed for the data of a and b.

【００６６】２回目のサイクルの例を図４に示す。図３
に示した１回目のサイクルと似ているが、ＩＣチューナ
ーにおける処理が加わっている。独立成分分離処理Ｗ２
を行う前に、ＩＣチューナーで２回目の無相関化処理さ
れたデータｆ１´、ｆ２´の信号特性を解析し、非ガウ
ス性に基づく処理Ｗ２(α)、非定常性に基づく処理Ｗ２
(β)、有色性に基づく処理Ｗ２（γ）のいずれをＷ２と
して実行するかを選択する。この例ではＷ２（β）を実
行している。処理Ｗ２（β）の後のデータｙ１、ｙ２の
分離性は、評価器Ｅで評価され、不十分なとき（図４の
＊２）は３回目のサイクルが実行される。An example of the second cycle is shown in FIG. Figure 3
Similar to the first cycle shown in, but with the addition of processing in the IC tuner. Independent component separation process W2
Before performing, the signal characteristics of the data f1 ′ and f2 ′ subjected to the second decorrelation processing by the IC tuner are analyzed, and the processing W2 (α) based on non-Gaussianity and the processing W2 based on non-stationarity are analyzed.
Either (β) or the process W2 (γ) based on chromaticity is selected as W2. In this example, W2 (β) is executed. The separability of the data y1 and y2 after the processing W2 (β) is evaluated by the evaluator E, and when it is insufficient (* 2 in FIG. 4), the third cycle is executed.

【００６７】ここで、ＩＣチューナーの機能について説
明する。ＩＣチューナーは、次のように無相関化処理さ
れた入力データのガウス性、定常性、及び有色性を評価
し、３種のうちから最適な独立成分分離処理を選択す
る。Here, the function of the IC tuner will be described. The IC tuner evaluates the Gaussianity, stationarity, and chromaticity of the decorrelation-processed input data as follows, and selects the optimum independent component separation process from the three types.

【００６８】まず、ＩＣチューナーは、二つの入力デー
タのガウス性を評価する。詳しくは、それぞれの入力デ
ータについて、入力時系列データの頻度分布がガウス関
数（正規分布関数）型か、非ガウス関数型かを調べる。
入力データをｇｓ、ガウス関数をｇ０とすると、両者の
差分の絶対値、すなわち｜ｇｓ−ｇ０｜を、当該区間に
おいて積分した値△ｇが、所定のしきい値δｇより大き
ければ非ガウス型、小さければガウス型と評価する。無
相関化処理された入力データのいずれもが非ガウス型で
あれば、ＩＣチューナーは、独立成分分離処理Ｗ２とし
て非ガウス性に基づく処理Ｗ２(α)を選択する。First, the IC tuner evaluates the Gaussian property of two input data. Specifically, for each input data, it is checked whether the frequency distribution of the input time series data is the Gaussian function (normal distribution function) type or the non-Gaussian function type.
If the input data is gs and the Gaussian function is g0, the absolute value of the difference between the two, that is, | gs-g0 |, integrated in the interval Δg is a non-Gaussian type if the value Δg is larger than a predetermined threshold δg. If it is small, it is evaluated as Gaussian. If none of the input data subjected to the decorrelation process is non-Gaussian type, the IC tuner selects the process W2 (α) based on non-Gaussian property as the independent component separation process W2.

【００６９】無相関化処理された入力データのいずれか
がガウス型と評価された場合には、次に、ＩＣチューナ
ーは、二つの入力データの定常性を評価する。この評価
にあたっては、複数の不規則波形の集合平均をとり、こ
の集合平均の時間変化に着目する。集合平均が時間軸に
対して一定であれば、「完全定常」とする。時間的に変
動している場合は、ある時間幅における確率密度分布を
求めて分散、歪度、及び尖度から非定常性を数値化す
る。非定常性の強さは、分散の大きさ、歪度の大きさ、
尖度の大きさの順に影響を強く受けやすいため、その強
さに応じた重み付けを施した上で評価することが好まし
い。無相関化処理された入力データのいずれもが非定常
性を有すると評価された場合、ＩＣチューナーは、独立
成分分離処Ｗ２として非定常性に基づく処理Ｗ２(β)を
選択する。If any of the decorrelated input data is evaluated as Gaussian, then the IC tuner evaluates the stationarity of the two input data. In this evaluation, a set average of a plurality of irregular waveforms is taken, and attention is paid to the time change of the set average. If the collective average is constant with respect to the time axis, it is “completely stationary”. If it fluctuates with time, the probability density distribution in a certain time width is obtained, and the nonstationarity is quantified from the variance, skewness, and kurtosis. The strength of non-stationarity is the magnitude of variance, the magnitude of skewness,
Since the influence of the degree of kurtosis is strongly influenced, it is preferable to evaluate after weighting according to the strength. When any of the input data subjected to the decorrelation is evaluated to have non-stationarity, the IC tuner selects the non-stationarity-based process W2 (β) as the independent component separation process W2.

【００７０】無相関化処理された入力データのいずれか
が定常性を有すると評価された場合には、次に、ＩＣチ
ューナーは、二つの入力データの有色性を評価する。有
色性を評価するには、不規則波形の自己相関関数を求め
る。時間のずれτの大きさについての自己相関関数のグ
ラフを求め、そのグラフの重心位置が原点（τ＝０）か
らどれだけ乖離しているかを調べる。重心位置が原点
（τ＝０）から所定値以上乖離している場合には、有色
性を有していると評価する。なお、白色雑音の場合は、
自己相関関数はτ＝０にのみ値を有する。無相関化処理
された入力データのいずれもが有色性を有すると評価さ
れた場合、ＩＣチューナーは、独立成分分離処Ｗ２とし
て有色性に基づく処理Ｗ２(γ)を選択する。If any of the decorrelated input data is evaluated as having stationarity, then the IC tuner evaluates the chromaticity of the two input data. To evaluate chromaticity, an autocorrelation function of irregular waveform is obtained. A graph of the autocorrelation function with respect to the magnitude of the time lag τ is obtained, and how much the center of gravity of the graph deviates from the origin (τ = 0) is checked. When the position of the center of gravity deviates from the origin (τ = 0) by a predetermined value or more, it is evaluated as having color. In the case of white noise,
The autocorrelation function has a value only at τ = 0. When all of the input data subjected to the decorrelation are evaluated to have chromaticity, the IC tuner selects the chromaticity-based process W2 (γ) as the independent component separation process W2.

【００７１】図５は３回目のサイクルを示している。各
処理は２回目のサイクルと同様であるが、３回目の独立
成分分離処理は、この例では有色性に基づくＷ２(γ)を
実行している。FIG. 5 shows the third cycle. Each process is similar to the second cycle, but in the third independent component separation process, W2 (γ) based on the chromaticity is executed in this example.

【００７２】ここで、前述した３種の独立分離処理Ｗ２
（α）、Ｗ２（β）、及びＷ２（γ）の内容についてよ
り詳しく説明する。第１に、非ガウス性に基づく独立成
分分離処理Ｗ２（α）による信号源推定手順であるが、
まず、分離係数（行列）Ｗｔを適宜に仮定する（初期値
をＷ０とする）。Here, the above-described three types of independent separation processing W2
The contents of (α), W2 (β), and W2 (γ) will be described in more detail. First, the signal source estimation procedure by the independent component separation processing W2 (α) based on non-Gaussianity
First, the separation coefficient (matrix) Wt is appropriately assumed (the initial value is W0).

【００７３】次に（９）式の様に無相関化処理後のデー
タＦ（ｔ）に対する信号源ｙ（ｔ）を推定する。Next, the signal source y (t) for the data F (t) after the decorrelation processing is estimated as in the equation (9).

【数９】このｙ（ｔ）と、Ｗｔを用いて、（１０）式に示す式か
ら△Ｗｔを求める。[Equation 9] Using this y (t) and Wt, ΔWt is obtained from the equation shown in equation (10).

【数１０】 [Equation 10]

【００７４】（１１）式により、次の収束計算ステップ
でのＷｔ＋１を求める。このＷｔ＋１を新たなＷｔとし
て、以上のステップを繰り返す。そして、△Ｗｔがほぼ
ゼロになった時点、すなわちＷｔが十分に収束したと考
えられる時点のｙ（ｔ）が、混在音声生データｘ（ｔ）
から求められた信号源ｓ（ｔ）の推定信号となる。From equation (11), Wt + 1 in the next convergence calculation step is obtained. The above steps are repeated with this Wt + 1 as a new Wt. Then, y (t) at the time when ΔWt becomes almost zero, that is, when Wt is considered to have sufficiently converged is the mixed voice raw data x (t).
It becomes an estimated signal of the signal source s (t) obtained from

【数１１】 [Equation 11]

【００７５】第２に、非定常性に基づく独立成分分離処
理Ｗ２（β）による信号源推定手順であるが、まず、分
離係数（行列）Ｃｔと系の時定数Ｔ´のオーダーの時間
におけるｙ^２（ｔ）の移動平均Φの初期値を求める。
また、ｙ（ｔ）を（１２）式により求める。（１２）式
において、Ｉは単位行列である。Secondly, the signal source estimation procedure by the independent component separation process W2 (β) based on the non-stationarity is as follows. ² Find the initial value of the moving average Φ of (t).
Further, y (t) is calculated by the equation (12). In Expression (12), I is an identity matrix.

【数１２】 [Equation 12]

【００７６】次に（１２）式に示す微分方程式を解い
て、Φを求める。（１３）式において、Ｔ´は系の時定
数である。Next, Φ is obtained by solving the differential equation shown in equation (12). In the equation (13), T'is a time constant of the system.

【数１３】 [Equation 13]

【００７７】次に、（１２）式におけるΦ、Ｃｔ、ｙ
（ｔ）より（１４）式に示す微分方程式を用いて新たな
Ｃｔ＋１を求める。（１４）式において、Ｔは系の時定
数である。Next, Φ, Ct, y in the equation (12)
A new Ct + 1 is obtained from (t) using the differential equation shown in equation (14). In equation (14), T is the time constant of the system.

【数１４】求められたＣｔ＋１と、無相関化処理後データＦ（ｔ）
とから（１５）式を用いて次のステップのｙ（ｔ）を推
定する。[Equation 14] The obtained Ct + 1 and the data F (t) after the decorrelation process
From this, y (t) in the next step is estimated using the equation (15).

【数１５】このｙ（ｔ）とＣｔ＋１とを用いて、以上のステップを
繰り返す。そして、Ｃｔが十分収束したと考えられる時
点のｙ（ｔ）が混在音声生データｘ（ｔ）から求められ
た信号源ｓ（ｔ）の推定信号となる。[Equation 15] The above steps are repeated using this y (t) and Ct + 1. Then, y (t) at the time when Ct is considered to have sufficiently converged becomes an estimated signal of the signal source s (t) obtained from the mixed voice raw data x (t).

【００７８】第３に、有色性に基づく独立成分分離処理
Ｗ２（γ）による信号源推定手順であるが、まず、分離
係数行列ＣｔとΨ１、Ψ２の初期値を与える。ここで、
Ψ１、Ψ２は、ｙ（ｔ）に２種類の線形フィルタをかけ
たものｙ１（ｔ）、及びｙ２（ｔ）から作られる２つの
積（ｙ１＊ｙ１^Ｔ）、及び（ｙ２＊ｙ２^Ｔ）の時間平
均である。また、ｙ（ｔ）を無相関化処理後データＦ
（ｔ）から（１６）式を用いて推定する。Thirdly, regarding the signal source estimation procedure by the independent component separation processing W2 (γ) based on chromaticity, first, the separation coefficient matrix Ct and the initial values of Ψ1 and Ψ2 are given. here,
Ψ1 and Ψ2 are two products (y1 * y1 ^T ) and (y2 * y2 ^T ) made from y (t), which are two types of linear filters y1 (t) and y2 (t). It is a time average. In addition, y (t) is the data F after decorrelation processing.
Estimate from (t) using equation (16).

【数１６】このｙ（ｔ）に、２種類の線形フィルタＧ１、Ｇ２をか
けて、（１７）式によりｙ１（ｔ）、ｙ２（ｔ）を求め
る。[Equation 16] By applying two types of linear filters G1 and G2 to this y (t), y1 (t) and y2 (t) are obtained by the equation (17).

【００７９】[0079]

【数１７】上記のΨ１、Ψ２の初期値、及びｙ１、ｙ２とから、
（１８）式に示す微分方程式を用いて新たにΨ１、Ψ２
を求める。[Equation 17] From the above initial values of Ψ1 and Ψ2, and y1 and y2,
Ψ1 and Ψ2 are newly added using the differential equation shown in equation (18).
Ask for.

【００８０】[0080]

【数１８】Ｃｔ、Ψ１、Ψ２とから、（１９）式によって、新たな
Ｃｔ＋１を求める。[Equation 18] From Ct, Ψ1, Ψ2, a new Ct + 1 is obtained by the equation (19).

【００８１】[0081]

【数１９】 [Formula 19]

【００８２】このＣｔ＋１とデータＦ（ｔ）とから、前
記の（１６）式によって新たなｙ（ｔ）が求められる。
そして、このＣｔの変化、すなわちｙ(ｔ)の変化が十分
に小さくなり、収束したと考えられる時点におけるｙ
（ｔ）が、混在音声生データｘ（ｔ）から求められた信
号源ｓ（ｔ）の推定信号となる。まだ収束していない場
合は、（１７）式によりｙ１（ｔ）、ｙ２（ｔ）を求
め、以上のステップを繰り返す。From this Ct + 1 and the data F (t), a new y (t) is obtained by the above equation (16).
Then, this change in Ct, that is, the change in y (t) becomes sufficiently small, and y
(T) becomes an estimated signal of the signal source s (t) obtained from the mixed voice raw data x (t). If it has not yet converged, y1 (t) and y2 (t) are obtained from the equation (17), and the above steps are repeated.

【００８３】図５に戻って、ここでは３回目の分離サイ
クルの出力データｙ１´、ｙ２´が充分な分離性を有し
ていると評価器Ｅにて判断された。すなわち、ｙ１´、
ｙ２´がそれぞれ音源Ｓ１、Ｓ２のどちらかの音声に相
当すると思われる。これらのデータのデジタル化波形図
を図１１及び図１２に示す。振幅が一定以下の点は発話
でなくノイズとみなすことによって解析すると、ｙ１´
には「あ」（〜）、及び「か」（〜）の音声デ
ータが見られる。同様にｙ２´には「し」（〜）の
音声データが見られる。Returning to FIG. 5, the evaluator E judges that the output data y1 ', y2' of the third separation cycle here have sufficient separability. That is, y1 ',
It is considered that y2 ′ corresponds to the voice of either the sound source S1 or S2, respectively. Digitized waveform diagrams of these data are shown in FIGS. 11 and 12. When the points whose amplitude is below a certain level are analyzed as noise instead of utterance, y1 ′
The voice data of "a" (-) and "ka" (-) can be seen in. Similarly, voice data of "shi" (-) is seen in y2 '.

【００８４】図１３は、ｙ１´とｙ２´の大きさをそれ
ぞれ横軸、縦軸にプロットした散布図である。この図か
ら分かるように、、、、、、の点はいずれ
もｙ２´の値がほぼゼロであり、逆に、、の各点
はｙ１´の値がほぼゼロであり、２つの独立した音源か
らの音声にきっちりと分離されたことが分かる。FIG. 13 is a scatter diagram in which the magnitudes of y1 'and y2' are plotted on the horizontal axis and the vertical axis, respectively. As can be seen from this figure, the values of ,,,,, have y2 ′ values of almost zero, and conversely, the points of ,, have y1 ′ values of almost zero. It can be seen that the sound from was separated exactly.

【００８５】なお、評価器Ｅにおいて、処理Ｗ２を実行
した後のデータの分離性を評価するには、図１３のグラ
フにおけるのような点を調べればよい。つまり、散布
図の中でもっとも横軸または縦軸から乖離している点を
選び、その軸までの距離が一定値以上であれば、いまだ
分離性が不十分とし、もう一度図４、図５のような分離
サイクルを実行するのである。In the evaluator E, in order to evaluate the separability of the data after executing the process W2, points such as in the graph of FIG. 13 may be examined. That is, in the scatter diagram, a point that is most distant from the horizontal axis or the vertical axis is selected, and if the distance to the axis is a certain value or more, the separability is still insufficient, and again, in FIG. 4 and FIG. Such a separation cycle is executed.

【００８６】＝＝分離音声データの発言者特定＝＝次に、本発明の後半部分である、分離された各音声デー
タの発言者を特定するステップについて説明する。図１
４は、上記音声分離ステップで得られた分離データｙ１
´の波形図と、そのフーリエ変換によるスペクトル分布
図である。ここで、スペクトル分布の求め方としては、
フィルタバンク、またはＬＰＣ法などが使用できる。== Identification of Speaker of Separated Audio Data == Next, the step of identifying the speaker of each separated audio data, which is the latter half of the present invention, will be described. Figure 1
4 is the separated data y1 obtained in the voice separation step.
2A and 2B are a waveform diagram of 'and a spectrum distribution diagram by its Fourier transform. Here, as the method of obtaining the spectral distribution,
A filter bank, LPC method, or the like can be used.

【００８７】同様に図１５は、分離データｙ２´の波形
図と、そのフーリエ変換によるスペクトル分布図であ
る。この実施例では発話者として２人（ＡさんとＢさん
とする）を想定しているので、この２つの波形データｙ
１´、ｙ２´に分離されたが、この時点ではどちらがＡ
さんの音声で、どちらがＢさんの音声であるかはわかっ
ていない。それをこれから特定する。Similarly, FIG. 15 is a waveform diagram of the separated data y2 'and a spectrum distribution diagram by its Fourier transform. In this embodiment, since two speakers (A and B) are assumed to be speakers, these two waveform data y
It was separated into 1'and y2 ', but which is A at this point?
With Mr.'s voice, I do not know which is Mr. B's voice. We will identify it from now on.

【００８８】まず、発言者を特定するための第１の方法
として、ホルマント周波数を発言者特定パラメータとし
て利用する方法を実行する。図１４におけるｆｏ１とｆ
ｏ２が、ｙ１´データの第１ホルマント周波数と第２ホ
ルマント周波数であり、図１５におけるｇｏ１、ｇｏ２
が、ｙ２´データの第１ホルマント周波数と第２ホルマ
ント周波数である。あらかじめ、会議参加者ＡさんとＢ
さんの第１及び第２ホルマント周波数データを、発言者
特定のための特定パラメータとしてデータベースに準備
しておく。そして上記の分離データｙ１´、ｙ２´のホ
ルマント周波数と照会することによって各分離音声デー
タの発言者を特定するのである。First, as a first method for specifying a speaker, a method of using a formant frequency as a speaker specifying parameter is executed. Fo1 and f in FIG.
o2 is the first formant frequency and the second formant frequency of the y1 ′ data, and go1 and go2 in FIG.
Are the first formant frequency and the second formant frequency of the y2 ′ data. Conference participants A and B in advance
The first and second formant frequency data of Mr. is prepared in the database as specific parameters for speaker identification. Then, the speaker of each separated voice data is specified by referring to the formant frequencies of the separated data y1 'and y2'.

【００８９】図１６は、特定パラメータであるＡさんと
Ｂさんの５母音全てのホルマント周波数と、得られた分
離音声データであるｙ１´及びｙ２´の第１及び第２ホ
ルマント周波数をマッチングする処理の概念図である。
横軸は第１ホルマント周波数、縦軸は第２ホルマント周
波数である。まず、Ａさんの母音の発音のホルマント周
波数の広がり領域（図の実線で囲んだ領域）、及びＢさ
んの母音の発音のホルマント周波数の広がり領域（図の
点線で囲んだ領域）を示し、その上に、図１４及び図１
５の分離音声データのホルマント周波数をプロットして
いる。FIG. 16 is a process for matching the formant frequencies of all five vowels of Mr. A and Mr. B, which are specific parameters, with the first and second formant frequencies of the obtained separated speech data y1 'and y2'. It is a conceptual diagram of.
The horizontal axis represents the first formant frequency, and the vertical axis represents the second formant frequency. First, the spread area of the formant frequency of the pronunciation of Mr. A's vowel (the area surrounded by the solid line in the figure) and the spread area of the formant frequency of the pronunciation of Mr. B's vowel (the area surrounded by the dotted line in the figure) are shown. On top of FIG. 14 and FIG.
The formant frequencies of the separated speech data of No. 5 are plotted.

【００９０】ｙ１´及びｙ２´のホルマント周波数が、
ＡさんまたはＢさんのホルマント周波数領域内に収まれ
ば、これをもって発言者が特定できたとすることができ
る。しかし、ＡさんとＢさんのいずれのホルマント周波
数領域にも納まらない場合（図１６のＣ部分）、また
は、ＡさんとＢさんの領域に重なり部分Ｄに収まってし
まう場合は、この第１の方法では発言者を特定すること
ができないため、以下に説明する第２の特定方法、また
は第３の特定方法を実行する。The formant frequencies of y1 'and y2' are
If it falls within the formant frequency range of Mr. A or Mr. B, it can be said that the speaker can be identified by this. However, if it does not fit into any of the formant frequency regions of Mr. A and Mr. B (portion C in FIG. 16), or if it falls within the overlapping portion D of the regions of Mr. A and Mr. B, this first Since the speaker cannot be specified by the method, the second specifying method or the third specifying method described below is executed.

【００９１】第２の発言者特定方法は、複数時点のホル
マント周波数を発言者特定パラメータとして用いる方法
である。図１７は、本発明の前半段階である音声分離ス
テップによって分離されたある音声データ（「あ」の音
声）を、ｎ個のサンプリング時刻に分けてフーリエ変換
し、スペクトル分解したことを示す図である。それぞれ
に対して第１及び第２ピークである第１ホルマント周波
数（ｆ１^１、ｆ１^２、・・・ｆ１^ｎ）及び第２ホルマ
ント周波数（ｆ２^１、ｆ２^２、・・・ｆ２ ^ｎ）を求
める。The second speaker identification method is a method of
Method of using cloak frequency as speaker identification parameter
Is. FIG. 17 shows a speech separation stream which is the first half of the present invention.
Some audio data separated by a step (sound of "A"
Voice) divided into n sampling times and Fourier transformed
It is a figure which shows that it decomposed into a spectrum. Each
The first formant frequency which is the first and second peaks with respect to
Number (f1¹ , F1^Two, ... f1ⁿ ) And the second forma
Frequency (f2¹ , F2^Two , ... f2 ⁿ )
Meru.

【００９２】次に、これらのホルマント周波数データに
対して主成分分析を実行し、主成分得点Ｚ１、Ｚ２、・
・・Ｚｎを求め、これを発言者の音声の特徴量として用
いる。従って、あらかじめデータベースに準備しておく
発言者特定パラメータとしては、会議参加者の様々な音
声（全母音など）の主成分得点Ｚ１、Ｚ２、・・・を準
備しておく。Next, principal component analysis is performed on these formant frequency data to obtain principal component scores Z1, Z2 ,.
..Zn is obtained and used as the feature amount of the voice of the speaker. Therefore, as speaker identification parameters prepared in advance in the database, the principal component scores Z1, Z2, ... Of various voices (all vowels, etc.) of conference participants are prepared.

【００９３】図１８は、第２の発言者特定方法による結
果を示すグラフである。図１８（ａ）は、比較のために
掲げた第１の発言者特定方法による結果である。図１８
（ａ）では、「あ」の音に対して５つのサンプリング時
刻における第１及び第２ホルマント周波数をプロットし
ているが、Ａさんの領域、Ｂさんの領域のどちらに属す
るかいずれとも言えない。FIG. 18 is a graph showing the result of the second speaker identification method. FIG. 18A shows a result obtained by the first speaker identification method provided for comparison. FIG.
In (a), the first and second formant frequencies at the five sampling times are plotted for the sound of "A", but it cannot be said to belong to the region of Mr. A or the region of Mr. B. .

【００９４】これに対して、図１８（ｂ）は、第２の方
法による、第１及び第２主成分得点Ｚ１、Ｚ２を２次元
の座標軸とした分布図である。まず、この図の例のよう
に、Ａさんの領域（図の実線）とＢさんの領域（図の点
線）がこの主成分得点平面では明確に離れていることが
多いので、判定が容易である。図１７の結果から求めた
分離データの主成分得点（Ｚ１、Ｚ２）をプロットする
と明らかにＡさんの領域に近いので、この場合の「あ」
はＡさんの発音であることがわかる。On the other hand, FIG. 18B is a distribution diagram in which the first and second principal component scores Z1 and Z2 are two-dimensional coordinate axes according to the second method. First, as in the example of this figure, the area of Mr. A (solid line in the figure) and the area of Mr. B (dotted line in the figure) are often clearly separated on this principal component score plane, so judgment is easy. is there. When the principal component scores (Z1, Z2) of the separated data obtained from the results of FIG. 17 are plotted, it is clearly close to the area of Mr. A, so “a” in this case
It is understood that is the pronunciation of Mr. A.

【００９５】第３の発言者特定方法は、音声分離データ
から、さらに有声音データのみを分離し、そのホルマン
ト周波数を発言者特定パラメータとして調べるというも
のである。図１９は、分離された音声データｙ１´を有
声音成分Ａ１とそれ以外の成分Ａ２とに分離し、それぞ
れのスペクトル分布をフーリエ変換により求めた図であ
る。有声音とそれ以外のデータに分離するには、本発明
の前半段階の音声分離ステップで用いたＩＣチューナー
と３種類の独立成分分離処理を用いる。The third speaker identification method is to further separate only voiced sound data from the voice separation data and examine the formant frequency as a speaker identification parameter. FIG. 19 is a diagram in which the separated voice data y1 ′ is separated into the voiced sound component A1 and the other component A2, and the respective spectrum distributions are obtained by Fourier transform. In order to separate the voiced sound and the other data, the IC tuner used in the voice separation step in the first half of the present invention and the three types of independent component separation processing are used.

【００９６】図２０は、図１９におけるＡ１とＡ２それ
ぞれの第１及び第２ホルマント周波数ｆ１^（１）、ｆ
２^（１）、及びｆ１^（２）、ｆ２^（２）を、第１及
び第２ホルマント周波数平面にプロットしたグラフであ
る。グラフによると、ｙ１´の有声音成分Ａ１の第１、
第２ホルマント周波数（ｆ１^（１）、ｆ２^（１））
は、Ａさんの「あ」の領域によく収まっているが、有声
音以外の成分Ａ２のホルマント周波数（ｆ１^（２）、
ｆ２^（２））は、ＡさんＢさんいずれの領域にも納ま
っていない。このように、有声音データだけを分離して
そのホルマント周波数を調べれば、有声音以外の音声デ
ータが混在している場合よりも正確に発言者の特定がで
きる。FIG. 20 shows the first and second formant frequencies f1 ⁽¹⁾ and f of A1 and A2 in FIG. 19, respectively.
2 is a graph in which 2 ⁽¹⁾ , f1 ⁽²⁾ , and f2 ⁽²⁾ are plotted on the first and second formant frequency planes. According to the graph, the first voiced sound component A1 of y1 ′,
Second formant frequency (f1 ⁽¹⁾ , f2 ⁽¹⁾ )
Is well within the "A" region of Mr. A, but the formant frequency (f1 ⁽²⁾ ,
f2 ⁽²⁾ ) does not fit into any of the areas of Mr. A and Mr. B. In this way, by separating only voiced sound data and examining the formant frequency, the speaker can be identified more accurately than in the case where voice data other than voiced sound is mixed.

【００９７】上記第２及び第３の発言者特定方法、すな
わち、複数時点のホルマント周波数を特定パラメータと
して用いる方法、及び有声音データを分離する方法は、
順番を前後してもよい。つまり、第1の発言者特定方
法、すなわち単一時点のホルマント周波数を特定パラメ
ータとして用いる方法で発言者の特定ができなかったと
き、第２の方法を用いてそれでも特定不能なときに第３
の方法を用いる順番で処理してもよいし、第３の方法を
用いてそれで特定不能なときに第２の方法を用いる順番
で処理してもよい。The second and third speaker identification methods, that is, the method using the formant frequencies at a plurality of time points as the identification parameters and the method for separating the voiced sound data,
You may change the order. That is, when the speaker cannot be specified by the first speaker specifying method, that is, the method using the formant frequency at a single time point as the specifying parameter, the third method is used when the speaker cannot be specified by the second method.
The method may be performed in the order of using the method, or the third method may be used in the order of using the second method when the method cannot be specified.

【００９８】以上で、図１におけるタイムセグメント
[１]の混在音声データを分離し、各分離データの発言者
を特定することができた。同様の処理をタイムセグメン
ト[２]、[３]以降についても行えば連続した混在音声デ
ータを全て発言者ごとの音声に分離・特定できる。Thus, the time segment in FIG.
The mixed voice data of [1] was separated, and the speaker of each separated data could be specified. If similar processing is performed for time segments [2] and [3] and thereafter, all continuous mixed voice data can be separated and specified as voices for each speaker.

【００９９】＝＝発明の変形例や具体的用途＝＝上記実施形態では、グラフを描く上での便宜上などか
ら、会議の参加者を２人としたが、参加者が３人以上の
場合であっても全く同様に音声を分離し、発言者を特定
することができる。== Variations of the Invention and Specific Applications == In the above embodiment, the number of participants in the conference was two for the sake of convenience in drawing the graph, but in the case of three or more participants. Even if there is, the voice can be separated in exactly the same way and the speaker can be specified.

【０１００】また、本発明の他の実施形態として、前記
独立成分分離処理の処理方法に関して、得られた観測デ
ータが複数音声や周囲からのさまざまな音響が非線形に
混合されている場合には、非線形混合モデルに基づく独
立成分分離処理を実行することにより、本実施の形態と
同一の基本的構成に基づいて同様の効果を得ることが可
能である。Further, as another embodiment of the present invention, regarding the processing method of the above-mentioned independent component separation processing, when the obtained observation data is non-linearly mixed with a plurality of voices and various sounds from the surroundings, By executing the independent component separation process based on the non-linear mixture model, it is possible to obtain the same effect based on the same basic configuration as this embodiment.

【０１０１】本発明の具体的用途の１つとして、特定さ
れた発言者と、該発言者の発言とを対応付け、公知の各
種音声認識ソフトウェアを利用して文字データなどに変
換した上で、記録媒体に出力することによる、自動議事
録作成がある。長時間にわたる会議の議事録作成が簡便
であり、かつ発言者の特定が自動的に正確に行われる。As one of the specific applications of the present invention, the identified speaker and the speech of the speaker are associated with each other, converted into character data using various known voice recognition software, There is automatic minutes creation by outputting to a recording medium. It is easy to create minutes of a long-term meeting, and the speaker can be identified automatically and accurately.

【０１０２】その他にも、音質の悪い状況下での携帯電
話通話の発言者特定や、ＣＴＩ（コンピュータ・テレフ
ォニー・インテグレイティッド）における発言者特定、
騒音下の自動車の中でのカーナビや口元にマイクロフォ
ンを設置できない状況でのパソコン等への音声入力及び
発明者特定など、様々な用途への応用が考えられる。さ
らにまた、情報家電、携帯電話やＰＤＡ等の携帯端末、
及び、身につけて携帯可能なウェアラブルコンピュータ
（Wearable Computer)などへの音声入力手段への応用等
も考えられる。In addition, the identification of the speaker of a mobile phone call under poor sound quality, the identification of the speaker in CTI (Computer Telephony Integrated),
It can be applied to various applications such as car navigation in a noisy car and voice input to a personal computer in a situation where a microphone cannot be installed in the mouth and identification of the inventor. Furthermore, information appliances, mobile terminals such as mobile phones and PDAs,
In addition, application to a voice input means such as a wearable computer (Wearable Computer) which can be worn and carried is also considered.

【０１０３】[0103]

【発明の効果】本発明の複合音声データの音声分離方法
及び発言者特定方法によれば、複数の発言者の音声デー
タが混在する混在音声データの、分離及び発言者特定
を、正確にかつ高速に行うことができる。According to the voice separation method and speaker identification method of the composite voice data of the present invention, the separation and the speaker identification of the mixed voice data in which the voice data of a plurality of speakers are mixed can be accurately and quickly performed. Can be done.

【０１０４】このような本発明は、音声データ入力と同
時進行的かつ自動的な、会議議事録作成、及び、実環境
下でのさまざまな音声入力インターフェースなどに応用
することができる。The present invention as described above can be applied to the production of conference minutes, which is simultaneous and automatic with the input of voice data, and various voice input interfaces in a real environment.

[Brief description of drawings]

【図１】マイク１から入力された音声データ（生デー
タ）Ｘの波形を示す図である。FIG. 1 is a diagram showing a waveform of audio data (raw data) X input from a microphone 1.

【図２】音声分離処理のサイクルを示す図である。FIG. 2 is a diagram showing a cycle of voice separation processing.

【図３】１回目の音声分離サイクルを示すフロー図で
ある。FIG. 3 is a flowchart showing a first voice separation cycle.

【図４】２回目の音声分離サイクルを示すフロー図で
ある。FIG. 4 is a flowchart showing a second voice separation cycle.

【図５】３回目の音声分離サイクルを示すフロー図で
ある。FIG. 5 is a flowchart showing a third voice separation cycle.

【図６】無相関化処理Ｗ１の一例のフローチャートで
ある。FIG. 6 is a flowchart of an example of decorrelation processing W1.

【図７】ｘ１のデジタル化波形図データのグラフであ
る。FIG. 7 is a graph of x1 digitized waveform diagram data.

【図８】ｘ２のデジタル化波形図データのグラフであ
る。FIG. 8 is a graph of x2 digitized waveform diagram data.

【図９】ｘ１、ｘ２データを、横軸をｘ１の強さ、縦
軸をｘ２の強さとした散布図のグラフである。FIG. 9 is a scatter graph of x1 and x2 data, where the horizontal axis represents the intensity of x1 and the vertical axis represents the intensity of x2.

【図１０】互いに相関性を有しないデータｆ１、ｆ２
の散布図のグラフである。FIG. 10 is data f1 and f2 having no correlation with each other.
It is a graph of a scatter diagram of.

【図１１】ｙ１´のデジタル化波形図データのグラフ
である。FIG. 11 is a graph of digitized waveform diagram data of y1 ′.

【図１２】ｙ２´のデジタル化波形図データのグラフ
である。FIG. 12 is a graph of y2 ′ digitized waveform diagram data.

【図１３】ｙ１´とｙ２´の大きさをそれぞれ横軸、
縦軸にプロットした散布図である。FIG. 13 shows the sizes of y1 ′ and y2 ′ on the horizontal axis,
It is a scatter diagram plotted on the vertical axis.

【図１４】音声分離ステップで得られた分離データｙ
１´の波形図と、そのフーリエ変換によるスペクトル分
布図である。FIG. 14: Separation data y obtained in the voice separation step
FIG. 1 is a waveform chart of 1 ′ and a spectrum distribution chart by Fourier transform thereof.

【図１５】音声分離ステップで得られた分離データｙ
２´の波形図と、そのフーリエ変換によるスペクトル分
布図である。FIG. 15: Separation data y obtained in the voice separation step
FIG. 2 is a waveform diagram of 2 ′ and a spectrum distribution diagram by its Fourier transform.

【図１６】特定パラメータとしてのホルマント周波数
と、分離音声データのホルマント周波数とのマッチング
処理の概念図である。FIG. 16 is a conceptual diagram of matching processing between a formant frequency as a specific parameter and a formant frequency of separated audio data.

【図１７】分離されたある音声データを、ｎ個のサン
プリング時刻に分けたスペクトル分布図である。FIG. 17 is a spectrum distribution diagram in which certain separated audio data is divided into n sampling times.

【図１８】（ａ）は、比較のために掲げた第１の発言
者特定方法によるホルマント周波数によるマッチング処
理の分布図である。（ｂ）は第２の発言者特定方法によ
る、第１及び第２主成分得点Ｚ１、Ｚ２を２次元の座標
軸とした分布図である。FIG. 18A is a distribution diagram of matching processing by formant frequencies according to the first speaker identification method provided for comparison. FIG. 7B is a distribution diagram in which the first and second principal component scores Z1 and Z2 are two-dimensional coordinate axes according to the second speaker identification method.

【図１９】分離データｙ１´を有声音成分Ａ１とそれ
以外の成分Ａ２とに分離したそれぞれのスペクトル分布
図である。FIG. 19 is a spectrum distribution diagram in which the separated data y1 ′ is separated into a voiced sound component A1 and other components A2.

【図２０】図１９におけるＡ１とＡ２それぞれの第１
及び第２ホルマント周波数ｆ１^（１）、ｆ２^（１）、
及びｆ１^（２）、ｆ２^（２）を、第１及び第２ホルマ
ント周波数平面にプロットしたグラフである。20 is a first diagram of each of A1 and A2 in FIG.
And the second formant frequencies f1 ⁽¹⁾ , f2 ⁽¹⁾ ,
And f1 ⁽²⁾ and f2 ⁽²⁾ are graphs plotted on the first and second formant frequency planes.

【図２１】次数ｍとその累積寄与率との関係を示すグ
ラフである。FIG. 21 is a graph showing the relationship between the degree m and its cumulative contribution rate.

【図２２】次数ｍと累積寄与率の変化量との関係を示
すグラフである。FIG. 22 is a graph showing the relationship between the degree m and the amount of change in the cumulative contribution rate.

【図２３】システムに応じた方法で次数ｍを決定する
手順を示すフローチャートである。FIG. 23 is a flowchart showing a procedure for determining the order m by a method according to the system.

─────────────────────────────────────────────────────
─────────────────────────────────────────────────── ───

【手続補正書】[Procedure amendment]

【提出日】平成１４年１月２５日（２００２．１．２
５）[Submission date] January 25, 2002 (2002.1.2
5)

【手続補正１】[Procedure Amendment 1]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】全文[Correction target item name] Full text

【補正方法】変更[Correction method] Change

【補正内容】[Correction content]

【書類名】明細書[Document name] Statement

【発明の名称】複合音声データの音声分離方法、発言者
特定方法、複合音声データの音声分離装置、発言者特定
装置、コンピュータプログラム、及び、記録媒体Title: Speech separation method for composite voice data, speaker identification method, voice separation apparatus for composite voice data, speaker identification apparatus, computer program, and recording medium

【特許請求の範囲】[Claims]

【発明の詳細な説明】Detailed Description of the Invention

【０００１】[0001]

【０００２】[0002]

【００１２】[0012]

【００１３】[0013]

【課題を解決するための手段】上記の課題を解決するた
めに、本出願に係る第１の発明は、複数発言者の音声デ
ータが混在している混在音声データを、発言者毎の音声
データに分離する音声データ分離方法において、（１）
前記混在音声データを互いに無相関化するための無相関
化処理を行うステップと、（２）前記無相関化処理の行
われたデータを独立成分に分離するための独立成分分離
処理を行うステップとを有し、前記独立成分分離の行わ
れたデータの分離性が不十分な場合には、分離性が十分
になるまで、前記独立成分分離処理の行われたデータに
ついて、前記無相関化処理及び前記独立成分分離処理を
繰り返し行うことを特徴とする音声分離方法である。す
ることを特徴とする音声分離方法である。このような第
１の発明によれば、入力される混在音声データ（生デー
タ）に含まれる各音声データの相関性、及び独立性の両
性質をともに考慮し、複数の音声データや混入する雑音
などの有する相関性や独立性が、時間的・空間的に変動
する場合でも、発言者毎の音声に正確に分離することが
できる。さらに加えて、このような第１の発明によれ
ば、混在音声データを音源毎の音声データに充分に分離
させることができる。In order to solve the above-mentioned problems, a first invention of the present application is to provide mixed voice data in which voice data of a plurality of speakers are mixed, to voice data for each speaker. In the audio data separation method for separating into (1)
Performing a decorrelation process for decorrelating the mixed speech data with each other; and (2) performing an independent component separation process for separating the decorrelated data into independent components. And if the separability of the data on which the independent component separation is performed is insufficient, until the separability is sufficient, the data on which the independent component separation processing is performed, the decorrelation process and In the speech separation method, the independent component separation process is repeatedly performed. This is a method for separating voice. According to the first aspect of the invention as described above, a plurality of voice data and mixed noise are considered in consideration of both the correlation and the independence of each voice data included in the input mixed voice data (raw data). Even if the correlation or independence of the above changes temporally or spatially, it is possible to accurately separate the voices of each speaker. In addition, according to the first aspect of the invention, the mixed voice data can be sufficiently separated into the voice data for each sound source.

【００１４】また、本出願に係る第２の発明は、第１の
発明である音声分離方法において、前記独立成分分離処
理として、非ガウス性のデータを独立成分に分離するた
めの非ガウス性独立成分分離処理と、非定常性のデータ
を独立成分に分離するための非定常性独立成分分離処理
と、有色性のデータを独立成分に分離するための有色性
独立成分分離処理とを準備し、データの性質により、前
記非ガウス性独立成分分離処理、前記非定常性独立成分
分離処理、及び、前記有色性独立成分分離処理のうちの
いずれかの処理を行うことを特徴とする音声分離方法で
ある。このような第２の発明によれば、無相関化処理の
行われたデータの性質に応じて最適な独立成分分離処理
を行うことができるから、混在音声データを音源毎の音
声データにより効果的に分離させることができる。A second invention according to the present application is, in the speech separation method according to the first invention, a non-Gaussian independent for separating non-Gaussian data into independent components as the independent component separation processing. Prepare component separation processing, non-stationary independent component separation processing for separating non-stationary data into independent components, and colored independent component separation processing for separating chromatic data into independent components, Depending on the nature of the data, any one of the non-Gaussian independent component separation process, the non-stationary independent component separation process, and the chromatic independent component separation process is performed. is there. According to the second aspect of the invention as described above, the optimum independent component separation process can be performed according to the property of the data that has been subjected to the decorrelation process. Can be separated into

【００１５】また、本出願に係る第３の発明は、第２の
発明である音声分離方法において、最初に行われる独立
成分分離処理は、非ガウス性のデータを独立成分に分離
するための非ガウス性独立成分分離処理であることを特
徴とする音声分離方法である。非ガウス性独立成分分離
処理は他の独立成分分処理方法に比べてその前処理とし
ての無相関化処理の影響を受けやすいから、このような
第３の発明によれば、最初に非ガウス性独立成分分離処
理を行うことにより、無相関化処理がうまく実行された
かどうかを、該無相関化処理に引き続く非ガウス性独立
成分分離処理によって効果的に評価することが可能とな
る。The third invention according to the present application is the speech separation method according to the second invention, wherein the independent component separation processing performed first is a non-separation process for separating non-Gaussian data into independent components. A speech separation method characterized by a Gaussian independent component separation process. Since the non-Gaussian independent component separation processing is more susceptible to the decorrelation processing as its preprocessing than other independent component processing methods, according to such a third invention, the non-Gaussian independent processing is first performed. By performing the independent component separation processing, it becomes possible to effectively evaluate whether or not the decorrelation processing was successfully executed by the non-Gaussian independent component separation processing subsequent to the decorrelation processing.

【００１６】また、本出願に係る第４の発明は、第１乃
至第３の発明である音声分離方法において、前記無相関
化処理は、少なくとも主成分分析及び因子分析を行うこ
とを特徴とする音声分離方法である。このような第４の
発明によれば、各主成分の寄与率を求めて累積寄与率が
所定のしきい値を越えるところの成分数を次数とするこ
となどにより、採用する主成分データの数（次数）を決
定した上で、効果的に無相関化処理を行うことが可能と
なる。Further, a fourth invention according to the present application is characterized in that, in the speech separation method according to the first to third inventions, the decorrelation processing performs at least a principal component analysis and a factor analysis. This is a voice separation method. According to the fourth invention, the number of principal component data to be adopted is obtained by obtaining the contribution ratio of each principal component and setting the number of components where the cumulative contribution ratio exceeds a predetermined threshold as the order. It is possible to effectively perform the decorrelation process after determining the (order).

【００１７】また、本出願に係る第５の発明は、複数発
言者の音声データが混在している混在音声データを、発
言者毎の音声データに分離し、該発言者毎の音声データ
につき発言者を特定する発言者特定方法において、
（１）第１乃第４のいずれかの発明の音声分離方法によ
り、複数発言者の音声データが混在している混在音声デ
ータを、発言者毎の音声データに分離するステップと、
（２）発言者毎に該発言者を特定するための特定パラメ
ータを準備するステップと、（３）分離された前記発言
者毎の音声データにつき、前記特定パラメータを参照し
て、発言者を特定するステップとを有することを特徴と
する発言者特定方法である。このような第５の発明によ
れば、例えば、会議の録音データなどに記録された、複
数発言者の音声や雑音などが含まれたの混在音声データ
を音源ごとに分離し、各分離されたの音声データの発言
者を特定することによって、例えば、自動的に会議記録
データの作成などを行うことができる。Further, a fifth invention according to the present application is to separate mixed voice data in which voice data of a plurality of speakers are mixed into voice data for each speaker, and speak for the voice data for each speaker. In the speaker identification method for identifying the person,
(1) Separating mixed voice data in which voice data of a plurality of speakers are mixed into voice data for each speaker by the voice separation method according to any one of the first to fourth inventions,
(2) preparing a specific parameter for specifying the speaker for each speaker, and (3) specifying the speaker by referring to the specific parameter for the separated voice data for each speaker. The method for identifying a speaker is characterized by the following steps. According to such a fifth invention, for example, mixed voice data containing voices and noises of a plurality of speakers, which is recorded in recorded data of a conference or the like, is separated for each sound source, and separated. By specifying the speaker of the voice data, the conference record data can be automatically created, for example.

【００１８】また、本出願に係る第６の発明は、第５の
発明である発言者特定方法において、前記特定パラメー
タは、発言者が母音を発音した際のホルマント周波数で
あり、分離された前記発言者毎の音声データにつき、ホ
ルマント周波数を求め、求められたホルマント周波数に
関して、前記特定パラメータとしてのホルマント周波数
を参照して、発言者を特定することを特徴とする発言者
特定方法である。このような第６の発明によれば、フー
リエ変換などの容易な処理で抽出できる特徴量であるホ
ルマント周波数を用いて、各分離された音声データの発
言者特定を容易に行うことができる。A sixth invention according to the present application is the speaker specifying method according to the fifth invention, wherein the specifying parameter is a formant frequency when the speaker pronounces a vowel, and the separated parameters are separated. The speaker specifying method is characterized in that a formant frequency is obtained for voice data of each speaker, and the speaker is specified by referring to the formant frequency as the specific parameter with respect to the obtained formant frequency. According to the sixth aspect, it is possible to easily identify the speaker of each separated voice data by using the formant frequency which is a feature amount that can be extracted by a simple process such as Fourier transform.

【００１９】また、本出願に係る第７の発明は、第６の
発明である発言者特定方法において、前記特定パラメー
タは、発言者が母音を発音した際の第１ホルマント周波
数及び第２ホルマント周波数であり、分離された前記発
言者毎の音声データにつき、第１ホルマント周波数及び
第２ホルマント周波数を求め、求められた第１ホルマン
ト周波数及び第２ホルマント周波数に関して、前記特定
パラメータとしての第１ホルマント周波数及び第２ホル
マント周波数を参照して、発言者を特定することを特徴
とする発言者特定方法である。このような第７の発明に
よれば、第１と第２のスペクトルピークである２つのホ
ルマント周波数を用いて発言者の特定を行うことによっ
て、容易に、かつより正確に特定を行うことができる。A seventh invention according to the present application is the speaker identifying method according to the sixth invention, wherein the specific parameter is the first formant frequency and the second formant frequency when the speaker produces a vowel. The first formant frequency and the second formant frequency are obtained for the separated voice data for each speaker, and the first formant frequency as the specific parameter is obtained with respect to the obtained first formant frequency and the second formant frequency. And a second formant frequency to identify a speaker, which is a speaker identification method. According to the seventh aspect of the invention, the speaker can be identified using the two formant frequencies that are the first and second spectrum peaks, so that the speaker can be identified easily and more accurately. .

【００２０】また、本出願に係る第８の発明は、第５の
発明乃至第７の発明のいずれかに記載の発言者特定方法
において、分離された前記発言者毎の音声データにつ
き、前記特定パラメータを参照して発言者を特定するス
テップにて発言者を特定できなかった場合には、該音声
データから複数の時点のホルマント周波数を求め、求め
られた複数時点のホルマント周波数に関して、前記特定
パラメータとしての複数時点のホルマント周波数を参照
して、発言者を特定することを特徴とする発言者特定方
法である。このような第８の発明によれば、ある音声の
発声者を特定する上での特徴量であるホルマント周波数
の、時間的変動をも考慮することにより、より正確に発
言者の特定を行うことができる。Further, an eighth invention according to the present application is the speaker identifying method according to any one of the fifth invention to the seventh invention, in which the voice data for each speaker separated is identified. When the speaker cannot be specified in the step of specifying the speaker by referring to the parameter, the formant frequencies at a plurality of time points are obtained from the voice data, and the specified parameter is determined with respect to the obtained formant frequencies at the plurality of time points. The speaker identification method is characterized by identifying the speaker by referring to the formant frequencies at a plurality of times. According to the eighth aspect of the invention, the speaker can be specified more accurately by considering the temporal variation of the formant frequency, which is a feature amount for specifying the speaker of a certain voice. You can

【００２１】また、本出願に係る第９の発明は、複数発
言者の音声データが混在している混在音声データから、
議事録を作成する議事録作成方法において、第５の発明
乃至第８のいずれかの発明の発言者特定方法により、分
離された前記発言者毎の音声データにつき、発言者を特
定するステップと、特定された発言者と、該発言者の発
言とを対応付けて記録媒体に出力することにより、議事
録を作成するステップとを有することを特徴とする議事
録作成方法である。このような第９の発明によれば、発
言者の特定が自動的に正確に行われるため、長時間にわ
たる会議の議事録作成を自動的に行うことができ便利で
ある。Further, a ninth invention according to the present application is that, from mixed voice data in which voice data of a plurality of speakers are mixed,
A minutes creating method for creating minutes; a step of specifying a speaker in the voice data separated for each speaker by the speaker specifying method according to any one of the fifth invention to the eighth invention; The minutes creating method is characterized by comprising the step of creating the minutes by outputting the specified speaker and the statement of the speaker in association with each other on the recording medium. According to the ninth aspect of the invention, since the speaker is automatically and accurately identified, the minutes of the conference over a long time can be automatically created, which is convenient.

【００２２】また、本出願に係る第１０の発明は、複数
発言者の音声データが混在している混在音声データを、
発言者毎の音声データに分離する音声データ分離装置に
おいて、前記混在音声データを互いに無相関化するため
に無相関化処理を行い、前記無相関化処理の行われたデ
ータを独立成分に分離するために独立成分分離処理を行
い、前記独立成分分離の行われたデータの分離性が不十
分な場合には、分離性が十分になるまで、前記独立成分
分離処理の行われたデータについて、前記無相関化処理
及び前記独立成分分離処理を繰り返し行うことを特徴と
する音声分離装置である。このような第１０の発明によ
れば、入力される混在音声データ（生データ）に含まれ
る各音声データの相関性、及び独立性の両性質をともに
考慮し、複数の音声データや混入する雑音などの有する
相関性や独立性が、時間的・空間的に変動する場合で
も、発言者毎の音声に正確に分離することが可能な音声
分離装置を実現できる。さらに加えて、このような第１
０の発明によれば、混在音声データを音源毎の音声デー
タに充分に分離させることの可能な音声分離装置を実現
できる。The tenth invention of the present application is to provide mixed voice data in which voice data of a plurality of speakers are mixed,
In a voice data separation device for separating the voice data for each speaker, a decorrelation process is performed in order to decorrelate the mixed voice data with each other, and the data subjected to the decorrelation process is separated into independent components. Independent component separation processing is performed in order to separate the independent component data, and if the separability of the data is insufficient, the independent component separation processing is performed until the separability becomes sufficient. It is a voice separation device characterized in that the decorrelation process and the independent component separation process are repeatedly performed. According to the tenth aspect, a plurality of audio data and mixed noise are considered in consideration of both the correlation and independence of each audio data included in the input mixed audio data (raw data). It is possible to realize a voice separation device capable of accurately separating the voice of each speaker even if the correlation or independence of the above changes with time and space. In addition, such a first
According to the invention of No. 0, it is possible to realize a voice separation device capable of sufficiently separating mixed voice data into voice data for each sound source.

【００２３】また、本出願に係る第１１の発明は、第１
０の発明である音声分離装置において、データの性質に
より、前記独立成分分離処理として、非ガウス性のデー
タを独立成分に分離するための非ガウス性独立成分分離
処理、非定常性のデータを独立成分に分離するための非
定常性独立成分分離処理、有色性のデータを独立成分に
分離するための有色性独立成分分離処理、のうちのいず
れかの処理を行うことを特徴とする音声分離装置であ
る。このような第１１の発明によれば、無相関化処理の
行われたデータの性質に応じて最適な独立成分分離処理
を行うことができるから、混在音声データを音源毎の音
声データにより効果的に分離させることの可能な音声分
離装置を実現できる。The eleventh invention of the present application is the first invention.
In the speech separation apparatus according to the invention of No. 0, depending on the nature of the data, as the independent component separation processing, the non-Gaussian independent component separation processing for separating the non-Gaussian data into independent components and the non-stationary data independent A voice separation device characterized by performing any one of a non-stationary independent component separation process for separating components and a chromatic independent component separation process for separating chromatic data into independent components. Is. According to such an eleventh invention, since optimum independent component separation processing can be performed according to the property of the data subjected to decorrelation processing, mixed speech data can be more effectively converted into speech data for each sound source. It is possible to realize a voice separation device that can be separated into two parts.

【００２４】また、本出願に係る第１２の発明は、第１
１の発明である音声分離装置において、最初に行われる
独立成分分離処理は、非ガウス性のデータを独立成分に
分離するための非ガウス性独立成分分離処理であること
を特徴とする音声分離装置である。非ガウス性独立成分
分離処理は他の独立成分分処理方法に比べてその前処理
としての無相関化処理の影響を受けやすいから、このよ
うな第１２の発明によれば、最初に非ガウス性独立成分
分離処理を行うことにより、無相関化処理がうまく実行
されたかどうかを、該無相関化処理に引き続く非ガウス
性独立成分分離処理によって効果的に評価することが可
能な音声分離装置を実現できる。The twelfth invention of the present application is the first invention.
In the speech separation apparatus according to the first aspect of the invention, the first independent component separation process is a non-Gaussian independent component separation process for separating non-Gaussian data into independent components. Is. Since the non-Gaussian independent component separation processing is more susceptible to the decorrelation processing as its preprocessing than other independent component processing methods, according to such a twelfth invention, the non-Gaussian independent processing is first performed. By implementing the independent component separation processing, it is possible to realize a speech separation device capable of effectively evaluating whether or not the decorrelation processing has been successfully executed by the non-Gaussian independent component separation processing subsequent to the decorrelation processing. it can.

【００２５】また、本出願に係る第１３の発明は、第１
０乃至第１２の発明である音声分離装置において、前記
無相関化処理は、少なくとも主成分分析及び因子分析を
行うことを特徴とする音声分離装置である。このような
第１３の発明によれば、各主成分の寄与率を求めて累積
寄与率が所定のしきい値を越えるところの成分数を次数
とすることなどにより、採用する主成分データの数（次
数）を決定した上で、効果的に無相関化処理を行うこと
が可能な音声分離装置を実現できる。The thirteenth invention of the present application is the first invention.
The speech separation apparatus according to any one of claims 0 to 12, wherein the decorrelation processing performs at least a principal component analysis and a factor analysis. According to such a thirteenth invention, by calculating the contribution rate of each principal component and setting the number of components where the cumulative contribution rate exceeds a predetermined threshold as the order, the number of principal component data to be adopted It is possible to realize a voice separation device capable of effectively performing decorrelation processing after determining (order).

【００２６】また、本出願に係る第１４の発明は、複数
発言者の音声データが混在している混在音声データを、
発言者毎の音声データに分離し、該発言者毎の音声デー
タにつき発言者を特定する発言者特定装置において、第
１０乃至第１３のいずれかの発明の音声分離装置によ
り、複数発言者の音声データが混在している混在音声デ
ータを、発言者毎の音声データに分離し、分離された前
記発言者毎の音声データにつき、発言者毎に該発言者を
特定するための特定パラメータを参照して発言者を特定
することを特徴とする発言者特定装置である。このよう
な第１４の発明によれば、例えば、会議の録音データな
どに記録された、複数発言者の音声や雑音などが含まれ
たの混在音声データを音源ごとに分離し、各分離された
の音声データの発言者を特定することによって、例え
ば、自動的に会議記録データの作成などを行うことの可
能な発言者特定装置が実現できる。The fourteenth invention of the present application is to provide mixed voice data in which voice data of a plurality of speakers are mixed,
In a speaker identifying device for separating voice data for each speaker and specifying a speaker for the voice data for each speaker, the voice separating device according to any one of the tenth to thirteenth inventions, The mixed voice data in which the data is mixed is separated into the voice data for each speaker, and the separated voice data for each speaker is referred to a specific parameter for specifying the speaker for each speaker. The speaker identifying apparatus is characterized in that the speaker is identified by According to such a fourteenth invention, for example, mixed voice data including voices and noises of a plurality of speakers, which is recorded in recorded data of a conference, is separated for each sound source, and each separated. By specifying the speaker of the voice data, it is possible to realize a speaker specifying device capable of automatically creating conference record data.

【００２７】また、本出願に係る第１５の発明は、第１
４の発明である発言者特定装置において、前記特定パラ
メータは、発言者が母音を発音した際のホルマント周波
数であり、分離された前記発言者毎の音声データにつ
き、ホルマント周波数を求め、求められたホルマント周
波数に関して、前記特定パラメータとしてのホルマント
周波数を参照して、発言者を特定することを特徴とする
発言者特定装置である。このような第１５の発明によれ
ば、フーリエ変換などの容易な処理で抽出できる特徴量
であるホルマント周波数を用いて、各分離された音声デ
ータの発言者特定を容易に行うことの可能な発言者特定
装置が実現できる。The fifteenth invention of the present application is the first invention.
In the speaker identifying device according to the fourth aspect of the present invention, the specific parameter is a formant frequency when the speaker produces a vowel, and the formant frequency is calculated for each of the separated voice data of each speaker. Regarding the formant frequency, the speaker specifying device is characterized in that the speaker is specified by referring to the formant frequency as the specifying parameter. According to such a fifteenth aspect, by using the formant frequency, which is a feature amount that can be extracted by a simple process such as Fourier transform, it is possible to easily specify the speaker of each separated voice data. The person identification device can be realized.

【００２８】また、本出願に係る第１６の発明は、第１
５の発明である発言者特定装置において、前記特定パラ
メータは、発言者が母音を発音した際の第１ホルマント
周波数及び第２ホルマント周波数であり、分離された前
記発言者毎の音声データにつき、第１ホルマント周波数
及び第２ホルマント周波数を求め、求められた第１ホル
マント周波数及び第２ホルマント周波数に関して、前記
特定パラメータとしての第１ホルマント周波数及び第２
ホルマント周波数を参照して、発言者を特定することを
特徴とする発言者特定装置である。このような第１６の
発明によれば、第１と第２のスペクトルピークである２
つのホルマント周波数を用いて発言者の特定を行うこと
によって、容易に、かつより正確に特定を行うことの可
能な発言者特定装置が実現できる。The sixteenth invention of the present application is the first invention.
In the speaker identifying device according to the invention of claim 5, the specific parameters are a first formant frequency and a second formant frequency when a speaker produces a vowel, and the specific parameter is the first formant frequency and the second formant frequency. The first formant frequency and the second formant frequency are obtained, and the first formant frequency and the second formant frequency as the specific parameters are determined with respect to the obtained first formant frequency and the second formant frequency.
It is a speaker specifying device characterized in that a speaker is specified by referring to a formant frequency. According to the sixteenth aspect, the first and second spectral peaks of 2
By specifying the speaker using one formant frequency, a speaker specifying device capable of specifying the speaker easily and more accurately can be realized.

【００２９】また、本出願に係る第１７の発明は、第１
４の発明乃至第１６の発明のいずれかに記載の発言者特
定装置において、分離された前記発言者毎の音声データ
につき、前記特定パラメータを参照して発言者を特定で
きなかった場合には、該音声データから複数の時点のホ
ルマント周波数を求め、求められた複数時点のホルマン
ト周波数に関して、前記特定パラメータとしての複数時
点のホルマント周波数を参照して、発言者を特定するこ
とを特徴とする発言者特定装置である。このような第１
７の発明によれば、ある音声の発声者を特定する上での
特徴量であるホルマント周波数の、時間的変動をも考慮
することにより、より正確に発言者の特定を行うことの
可能な発言者特定装置が実現できる。The seventeenth invention of the present application is the first invention.
In the speaker identifying device according to any one of claims 4 to 16, when the speaker cannot be identified by referring to the specific parameter in the separated voice data for each speaker, A speaker characterized in that formant frequencies at a plurality of time points are obtained from the voice data, and the obtained formant frequencies at a plurality of time points are referred to by referring to the formant frequencies at a plurality of time points as the specific parameters. It is a specific device. Such a first
According to the invention of claim 7, it is possible to more accurately specify the speaker by considering the temporal variation of the formant frequency, which is a feature amount for specifying the speaker of a certain voice. The person identification device can be realized.

【００３０】また、本出願に係る第１８の発明は、複数
発言者の音声データが混在している混在音声データか
ら、議事録を作成する議事録作成装置において、第１４
乃至第１７のいずれかの発明の発言者特定装置により、
分離された前記発言者毎の音声データにつき、発言者を
特定し、特定された発言者と、該発言者の発言とを対応
付けて記録媒体に出力することにより、議事録を作成す
ることを特徴とする議事録作成装置である。このような
第１８の発明によれば、発言者の特定が自動的に正確に
行われるため、長時間にわたる会議の議事録作成を自動
的に行うことの可能な議事録作成装置が実現できる。The eighteenth invention of the present application is the meeting minutes creating apparatus for creating a meeting minutes from mixed sound data in which sound data of a plurality of speakers are mixed.
Through the speaker identifying apparatus according to any one of the seventeenth invention,
It is possible to create a minutes by specifying a speaker in the separated voice data for each speaker and outputting the specified speaker and the statement of the speaker in association with each other on a recording medium. It is a characteristic minutes creating device. According to such an eighteenth invention, since the speaker is automatically and accurately specified, the minutes preparation apparatus capable of automatically creating the minutes of the conference for a long time can be realized.

【００３１】また、第１乃至第４のいずれかの発明の音
声分離方法を音声分離装置に実行させるためのコンピュ
ータプログラムも実現可能である。A computer program for causing a voice separation device to execute the voice separation method according to any one of the first to fourth inventions can also be realized.

【００３２】また、第５乃至第８のいずれかの発明の発
言者特定方法を発言者特定装置に実行させるためのコン
ピュータプログラムも実現可能である。A computer program for causing the speaker identifying apparatus to execute the speaker identifying method according to any one of the fifth to eighth inventions can be realized.

【００３３】また、そのようなコンピュータプログラム
を記録したコンピュータ読み取り可能な記録媒体も実現
可能である。A computer-readable recording medium recording such a computer program can also be realized.

【００３４】[0034]

【００３５】本実施形態では、２人で行われたある会議
の発言内容の音声データを２本のマイク（マイク１、マ
イク２）で拾う。図１は、そのうちマイク１から入力さ
れた音声データ（生データ）Ｘの波形である。この混在
音声データには、複数の発言者の音声データが混在して
いるのみならず、音楽や、さらには雑音などが混ざって
いてもよい。２人の発声をそれぞれ音源Ｓ１、Ｓ２と呼
ぶことにする。In this embodiment, two microphones (microphone 1 and microphone 2) pick up voice data of the contents of a statement made by a two-person conference. FIG. 1 shows a waveform of audio data (raw data) X input from the microphone 1. The mixed voice data may include not only voice data of a plurality of speakers but also music and noise. The two utterances will be referred to as sound sources S1 and S2, respectively.

【００３６】図２は、音声分離処理のサイクルを示す図
である。マイク１及びマイク２から入力された混在音声
データは、まず無相関化処理Ｗ１にかけられる。無相関
化処理Ｗ１に渡される音声データは、図１の[１]、[２]
のようにセグメント化されて１つずつ渡される。最も効
率がよいように、各セグメントは互いに１／２周期ずつ
オーバーラップしている。FIG. 2 is a diagram showing a cycle of voice separation processing. The mixed voice data input from the microphone 1 and the microphone 2 is first subjected to the decorrelation processing W1. The audio data passed to the decorrelation processing W1 is [1] and [2] in FIG.
It is segmented like this and passed one by one. For maximum efficiency, the segments overlap each other by ½ cycle.

【００３７】図２において、無相関化処理Ｗ１の次のス
テップであるＩＣチューナーは、独立成分解析（ＩＣ
Ａ）の手法を３種類のうちから選択するためのチューナ
ーである。その次のステップである独立成分分離処理Ｗ
２は、非ガウス性に基づく分離処理Ｗ２（α）、非定常
性に基づく分離処理Ｗ２（β）、有色性に基づく分離処
理Ｗ（γ）の３種類のうちいずれかの方式の処理を行
う。Ｗ２の後のステップの評価器Ｅでは、Ｗ２にて分離
されたデータの分離性の評価を行う。マイクから入力さ
れた混在音声データの音声分離性能が充分になるまで、
以上のＷ１→ＩＣチューナー→Ｗ２→Ｅというサイクル
を繰り返し回す。ただし、１回目のサイクルでは、独立
成分分離処理Ｗ２として、非ガウス性に基づく独立成分
分離処理Ｗ２（α）を行い、２回目以降のサイクルで
は、ＩＣチューナの選択に従って、Ｗ２（α）、Ｗ２
（β）、Ｗ２（γ）の３種類のうちから適切な方式の独
立成分分離処理を行う。In FIG. 2, the IC tuner, which is the next step of the decorrelation processing W1, is an independent component analysis (IC
This is a tuner for selecting the method A) from three types. Independent component separation process W which is the next step
2 performs one of three types of separation processing W2 (α) based on non-Gaussianity, separation processing W2 (β) based on non-stationarity, and separation processing W (γ) based on chromaticity. . The evaluator E in the step after W2 evaluates the separability of the data separated in W2. Until the voice separation performance of mixed voice data input from the microphone becomes sufficient,
The above cycle of W1 → IC tuner → W2 → E is repeated. However, in the first cycle, as the independent component separation processing W2, the independent component separation processing W2 (α) based on the non-Gaussian property is performed, and in the second and subsequent cycles, W2 (α), W2
An appropriate component independent component separation process is performed from among the three types of (β) and W2 (γ).

【００３８】図３は、１回目の音声分離サイクルを示し
ている。図１における前記[１]の時間セグメントの、マ
イク１及びマイク２からの混在音声データｘ１、ｘ２
が、まず無相関化処理Ｗ１に入力される。FIG. 3 shows the first voice separation cycle. Mixed voice data x1 and x2 from the microphone 1 and the microphone 2 in the time segment [1] in FIG.
Is first input to the decorrelation processing W1.

【００３９】図７及び図８は、それぞれｘ１及びｘ２の
デジタル化波形図データ（縦軸は音の強さで、単位はミ
リボルト）を示す。各時点のｘ１、ｘ２データを、横軸
をｘ１の強さ、縦軸をｘ２の強さとして散布図を描くと
図９のようになる。散布図は、第１象限から第３象限に
かけて若干直線的な分布を呈し、ｘ１とｘ２のデータは
互いに相関性を有することを示している。これら生デー
タであるｘ１、ｘ２が無相関化処理Ｗ１にかけられる
と、互いに相関性を有しないデータｆ１、ｆ２に変換さ
れる。FIGS. 7 and 8 show digitized waveform diagram data of x1 and x2, respectively (the vertical axis represents the sound intensity and the unit is millivolts). FIG. 9 is a scatter plot of the x1 and x2 data at each time point with the abscissa representing the intensity of x1 and the ordinate representing the intensity of x2. The scatter plot exhibits a slightly linear distribution from the first quadrant to the third quadrant, indicating that the data of x1 and x2 are correlated with each other. When these raw data x1 and x2 are subjected to the decorrelation processing W1, they are converted into data f1 and f2 having no correlation with each other.

【００４０】ｆ１及びｆ２の散布図を図１０に示す。図
１０の横軸は因子得点Ｆの第１因子ｆ１、縦軸は因子得
点Ｆの第２因子ｆ２を示している。図９が軸に対してい
びつな平行四辺形状に分布していたのに対し、軸に対し
てまっすぐで形の整ったひし形状に分布しており、ｆ１
とｆ２はもはや互いに相関性を有していないことがわか
る。A scatter diagram of f1 and f2 is shown in FIG. The horizontal axis of FIG. 10 represents the first factor f1 of the factor score F, and the vertical axis represents the second factor f2 of the factor score F. 9 is distributed in a parallelogram shape that is distorted with respect to the axis, it is distributed in a rhombus shape that is straight and has a regular shape with respect to the axis.
It can be seen that and f2 are no longer correlated with each other.

【００４１】ここで、無相関化処理の内容について説明
する。図６は、無相関化処理Ｗ１の一例のフローチャー
トを示したものである。まず、図７及び図８に示した音
声生データｘ１、ｘ２を（１）式により標準化する。標
準化の結果、平均が０、標準偏差１のデータとなる。Here, the contents of the decorrelation processing will be described. FIG. 6 shows a flowchart of an example of the decorrelation process W1. First, the raw audio data x1 and x2 shown in FIGS. 7 and 8 are standardized by the equation (1). As a result of standardization, the data has an average of 0 and a standard deviation of 1.

【数１】 [Equation 1]

【００４２】生データｘ１、ｘ２の相関行列(ベクトル
Ｃ)を（２）式より求める。（２）式において（ｘ１、
ｘ２）はベクトルの内積を表す。The correlation matrix (vector C) of the raw data x1 and x2 is obtained from the equation (2). In formula (2), (x1,
x2) represents the dot product of the vectors.

【数２】 [Equation 2]

【００４３】上記相関行列に対する固有値λｉと固有ベ
クトルＡを（３）より求める。The eigenvalue λi and the eigenvector A for the above correlation matrix are obtained from (3).

【数３】 [Equation 3]

【００４４】今、因子分析によって、互いに無相関な因
子得点を求めようとしているのだが、その際、第１番目
の因子から始めて、何番目の因子までを採用するのかが
重要な点である。ｍ番目の因子までを採用する場合を、
ｍ次元と呼ぶ。先に求めた固有ベクトルＡにより、
（４）式によって主成分Ｚが求まる。Now, by factor analysis, we are trying to obtain mutually uncorrelated factor scores, but at that time, it is important to start with the first factor and up to what factor. When adopting up to the mth factor,
Call it m-dimensional. By the eigenvector A obtained earlier,
The principal component Z is obtained by the equation (4).

【数４】 [Equation 4]

【００４５】次にｍ個の因子に対して、（５）式の形の
定義式にて因子分析を実行する。（５）式におけるｅ
は、特殊因子と呼ばれるものである。Next, a factor analysis is performed on the m factors by the definition equation of the form (5). E in equation (5)
Is called a special factor.

【数５】 [Equation 5]

【００４６】この因子モデルが（６）式の表現をとる。
（６）式における因子負荷量ｂｉｊ、因子得点Ｆは、
（７）式及び（８）式によって求める。そして、図６の
フローチャートの最終ステップで、結局音声生データ
は、互いに無相関な因子得点（ベクトルＦ）に変換され
る。This factor model takes the expression (6).
The factor load bij and the factor score F in the equation (6) are
It is determined by the equations (7) and (8). Then, in the final step of the flowchart of FIG. 6, the raw audio data is eventually converted into factor scores (vector F) that are uncorrelated with each other.

【００４７】[0047]

【数６】 [Equation 6]

【数７】 [Equation 7]

【数８】 [Equation 8]

【００４８】以上説明したＷ１の主な特徴は、主成分分
析と因子分析とを組み合わせている点である。その効果
は、主成分分析を実行すると各主成分の寄与率を同時に
求めることができるので、例えば、第１次主成分から第
ｍ次主成分までの累積寄与率が８０％を超えるまでの主
成分を採用するようにすることで、次数ｍを決定するこ
とにある。分離すべき音声生データは、時間的変動が大
きく、混合による相関の度合いが大きく変化するので、
何個の因子を採用するかは無相関化処理において重要な
点である。The main feature of W1 described above is that principal component analysis and factor analysis are combined. The effect is that the contribution ratio of each principal component can be obtained at the same time by executing the principal component analysis. Therefore, for example, the main contribution until the cumulative contribution ratio from the first-order principal component to the m-th-order principal component exceeds 80%. By adopting the component, the order m is determined. The raw audio data to be separated has a large temporal variation, and the degree of correlation due to mixing greatly changes.
How many factors are adopted is an important point in decorrelation processing.

【００４９】発話者の人数があらかじめ判明している場
合には、次数ｍを発話者の人数に固定してしまえばよい
が、人数が不明なときは、例えば、累積寄与率が所定の
しきい値を超えたときの主成分数を次数ｍとする。次数
ｍの決定方法は、システムに応じて様々な方法を準備し
ておき、臨機応変に変化させる（チューニングする）こ
とが好ましい。次にこのチューニングの一実施例につい
て詳しく説明する。When the number of speakers is known in advance, the order m may be fixed to the number of speakers. When the number of speakers is unknown, for example, the cumulative contribution ratio is a predetermined threshold. Let m be the number of principal components when the value is exceeded. As a method of determining the order m, it is preferable to prepare various methods according to the system and to change (tune) flexibly. Next, an example of this tuning will be described in detail.

【００５０】図２１は、システムに応じた方法で次数ｍ
を決定する手順を示すフローチャートである。図２１
で、ＲＫ０は累積寄与率の初期しきい値、Ｍは採用し得
る最大次数（次数の上側しきい値）、△ＲＫは累積寄与
率の変化量である。主成分分析を実行すると、図１９の
ような、次数ｍ（第ｍ主成分まで採用したということを
示す）とその累積寄与率との関係を示すグラフが得られ
る。図１９にはＡ、Ｂ、Ｃ３種類のグラフの例を描いて
いる。FIG. 21 shows the order m according to the system.
It is a flow chart which shows the procedure which determines. Figure 21
Here, RK0 is the initial threshold value of the cumulative contribution rate, M is the maximum order (upper threshold value of the order) that can be adopted, and ΔRK is the change amount of the cumulative contribution rate. When the principal component analysis is executed, a graph showing the relationship between the degree m (indicating that the m-th principal component is adopted) and its cumulative contribution rate is obtained as shown in FIG. FIG. 19 shows an example of three types of graphs of A, B, and C.

【００５１】まず、第１の処理ステップとして、累積寄
与率ＲＫにしきい値ＲＫ０（この実施例では８０％）を
設定しておき、このしきい値ＲＫ０を超える次数ｍを求
める。ところが、次数があまりに大きいとその後の処理
が煩雑に過ぎるので、あらかじめ次数の上限値Ｍを決め
ておく。図１９の例では、Ｍ＝４とすると、Ａの場合は
しきい値ＲＫ０を超える次数ｍ＝２であるので、ｍ＝２
＜４＝Ｍとなって、次数ｍは２に決定される。Ｂの例で
はＲＫ０を超える次数ｍは５であるので、ｍ＝５＞４＝
Ｍとなってしまい、次数ｍはまだ決定されない。Ｃの例
でも同様に次数ｍは決定されない。First, as a first processing step, a threshold value RK0 (80% in this embodiment) is set in the cumulative contribution rate RK, and an order m exceeding this threshold value RK0 is obtained. However, if the order is too large, the subsequent processing becomes too complicated. Therefore, the upper limit M of the order is determined in advance. In the example of FIG. 19, assuming that M = 4, in the case of A, the degree m that exceeds the threshold RK0 is m = 2, so m = 2
<4 = M, and the order m is determined to be 2. In the example of B, the degree m exceeding RK0 is 5, so that m = 5> 4 =
It becomes M, and the order m is not determined yet. Similarly, in the case of C, the order m is not determined.

【００５２】そのような場合は図２０に示す、第２のス
テップを実行する。すなわち、次数ｍの増加に対する、
ＲＫの差分変化量△ＲＫを調べる。これは要するに、累
積寄与率の変化が最大となる次数ｍをもって採用すべき
次数とするという処理方法である。この実施例では、Ｂ
の例ではｍ＝２、Ｃの例ではｍ＝４において△ＲＫが最
大値をとる。この場合も次数ｍが上限値Ｍよりも下なら
ば、その次数ｍを採用とするが、Ｍを上回る場合は、そ
の処理が次のステップに送られる。In such a case, the second step shown in FIG. 20 is executed. That is, for an increase in the order m,
Check the difference change amount ΔRK of RK. In short, this is a processing method in which the order m that maximizes the change in the cumulative contribution rate is the order to be adopted. In this embodiment, B
In the example, ΔRK takes the maximum value when m = 2 and in the example C, m = 4. Also in this case, if the order m is lower than the upper limit value M, the order m is adopted, but if it is higher than M, the processing is sent to the next step.

【００５３】第２のステップでも次数ｍが上限値Ｍを超
えてしまう場合であれば、次に累積寄与率のしきい値Ｒ
Ｋ０を引き下げて、例えば６０％（＝ＲＫ１）とし、上
記第１のステップと同じように比較する。新しいしきい
値ＲＫ１を超えるところの次数がＭ＝４以下であれば、
これを次数ｍとして採用とし、Ｍを超える場合は、所定
の下げ幅で順次ＲＫ２、ＲＫ３、・・・ＲＫｎの値を下
げる。ただし、累積寄与率ＲＫが５０％を下回るという
ことは、半分以上の情報が失われてしまうことを意味す
るので、ＲＫｎの下限値は５０％とする。If the order m exceeds the upper limit M even in the second step, the cumulative contribution ratio threshold R
K0 is lowered to, for example, 60% (= RK1), and the comparison is performed in the same manner as the first step. If the order above the new threshold value RK1 is M = 4 or less,
This is adopted as the order m, and when it exceeds M, the values of RK2, RK3, ... RKn are sequentially decreased with a predetermined decrease amount. However, if the cumulative contribution rate RK is less than 50%, it means that half or more of the information is lost, so the lower limit value of RKn is set to 50%.

【００５４】次数ｍがＲＫｎ＝５０％以上で、かつＭ以
下の値で発見されない場合は、再び上記第２のステップ
と同様の処理、すなわち△ＲＫが最大になる次数を求め
て、その値を次数ｍとして採用してしまう。これは、累
積寄与率が大きく変化するということは、その次数の前
後で情報がより多く保存されるということを意味するの
で、少なくともその次数までは採用したい、という考え
に基づくものである。When the order m is RKn = 50% or more and is not found with a value less than or equal to M, the same process as in the second step above is performed again, that is, the order in which ΔRK is maximized is obtained and the value is set. It will be adopted as the order m. This is based on the idea that since the cumulative contribution ratio changes significantly, more information is stored before and after the order, and therefore it is desirable to adopt at least up to that order.

【００５５】以上のようにして、図３において、無相関
化されたデータｆ１、ｆ２は、ただちに独立成分分離処
理Ｗ２に送られる。１回目の音声分離サイクルでは、こ
れらの無相関化データｆ１、ｆ２に対し、非ガウス性に
基づく独立成分分離処理Ｗ２（α）を実行する。As described above, in FIG. 3, the decorrelated data f1 and f2 are immediately sent to the independent component separation processing W2. In the first speech separation cycle, the independent component separation process W2 (α) based on non-Gaussianity is executed on these decorrelated data f1 and f2.

【００５６】以上、図３におけるＷ１及びＷ２(α)の処
理により、分離信号ａおよびｂが得られ、これらの分離
性（充分に分離されているか否か）を評価器Ｅで評価
し、分離が不十分なとき（図の＊１）はこれらａ、ｂの
データに対して、２回目のサイクルを実行する。As described above, the separation signals a and b are obtained by the processing of W1 and W2 (α) in FIG. 3, and their separation characteristics (whether or not they are sufficiently separated) are evaluated by the evaluator E and separated. When the value is insufficient (* 1 in the figure), the second cycle is executed for the data of a and b.

【００５７】２回目のサイクルの例を図４に示す。図３
に示した１回目のサイクルと似ているが、ＩＣチューナ
ーにおける処理が加わっている。独立成分分離処理Ｗ２
を行う前に、ＩＣチューナーで２回目の無相関化処理さ
れたデータｆ１´、ｆ２´の信号特性を解析し、非ガウ
ス性に基づく処理Ｗ２(α)、非定常性に基づく処理Ｗ２
(β)、有色性に基づく処理Ｗ２（γ）のいずれをＷ２と
して実行するかを選択する。この例ではＷ２（β）を実
行している。処理Ｗ２（β）の後のデータｙ１、ｙ２の
分離性は、評価器Ｅで評価され、不十分なとき（図４の
＊２）は３回目のサイクルが実行される。An example of the second cycle is shown in FIG. Figure 3
Similar to the first cycle shown in, but with the addition of processing in the IC tuner. Independent component separation process W2
Before performing, the signal characteristics of the data f1 ′ and f2 ′ subjected to the second decorrelation processing by the IC tuner are analyzed, and the processing W2 (α) based on non-Gaussianity and the processing W2 based on non-stationarity are analyzed.
Either (β) or the process W2 (γ) based on chromaticity is selected as W2. In this example, W2 (β) is executed. The separability of the data y1 and y2 after the processing W2 (β) is evaluated by the evaluator E, and when it is insufficient (* 2 in FIG. 4), the third cycle is executed.

【００５８】ここで、ＩＣチューナーの機能について説
明する。ＩＣチューナーは、次のように無相関化処理さ
れた入力データのガウス性、定常性、及び有色性を評価
し、３種のうちから最適な独立成分分離処理を選択す
る。Here, the function of the IC tuner will be described. The IC tuner evaluates the Gaussianity, stationarity, and chromaticity of the decorrelation-processed input data as follows, and selects the optimum independent component separation process from the three types.

【００５９】まず、ＩＣチューナーは、二つの入力デー
タのガウス性を評価する。詳しくは、それぞれの入力デ
ータについて、入力時系列データの頻度分布がガウス関
数（正規分布関数）型か、非ガウス関数型かを調べる。
入力データをｇｓ、ガウス関数をｇ０とすると、両者の
差分の絶対値、すなわち｜ｇｓ−ｇ０｜を、当該区間に
おいて積分した値△ｇが、所定のしきい値δｇより大き
ければ非ガウス型、小さければガウス型と評価する。無
相関化処理された入力データのいずれもが非ガウス型で
あれば、ＩＣチューナーは、独立成分分離処理Ｗ２とし
て非ガウス性に基づく処理Ｗ２(α)を選択する。First, the IC tuner evaluates the Gaussian property of two input data. Specifically, for each input data, it is checked whether the frequency distribution of the input time series data is the Gaussian function (normal distribution function) type or the non-Gaussian function type.
If the input data is gs and the Gaussian function is g0, the absolute value of the difference between the two, that is, | gs-g0 |, integrated in the interval Δg is a non-Gaussian type if the value Δg is larger than a predetermined threshold δg. If it is small, it is evaluated as Gaussian. If none of the input data subjected to the decorrelation process is non-Gaussian type, the IC tuner selects the process W2 (α) based on non-Gaussian property as the independent component separation process W2.

【００６０】無相関化処理された入力データのいずれか
がガウス型と評価された場合には、次に、ＩＣチューナ
ーは、二つの入力データの定常性を評価する。この評価
にあたっては、複数の不規則波形の集合平均をとり、こ
の集合平均の時間変化に着目する。集合平均が時間軸に
対して一定であれば、「完全定常」とする。時間的に変
動している場合は、ある時間幅における確率密度分布を
求めて分散、歪度、及び尖度から非定常性を数値化す
る。非定常性の強さは、分散の大きさ、歪度の大きさ、
尖度の大きさの順に影響を強く受けやすいため、その強
さに応じた重み付けを施した上で評価することが好まし
い。無相関化処理された入力データのいずれもが非定常
性を有すると評価された場合、ＩＣチューナーは、独立
成分分離処Ｗ２として非定常性に基づく処理Ｗ２(β)を
選択する。If any of the decorrelated input data is evaluated as Gaussian, then the IC tuner evaluates the stationarity of the two input data. In this evaluation, a set average of a plurality of irregular waveforms is taken, and attention is paid to the time change of the set average. If the collective average is constant with respect to the time axis, it is “completely stationary”. If it fluctuates with time, the probability density distribution in a certain time width is obtained, and the nonstationarity is quantified from the variance, skewness, and kurtosis. The strength of non-stationarity is the magnitude of variance, the magnitude of skewness,
Since the influence of the degree of kurtosis is strongly influenced, it is preferable to evaluate after weighting according to the strength. When any of the input data subjected to the decorrelation is evaluated to have non-stationarity, the IC tuner selects the non-stationarity-based process W2 (β) as the independent component separation process W2.

【００６１】無相関化処理された入力データのいずれか
が定常性を有すると評価された場合には、次に、ＩＣチ
ューナーは、二つの入力データの有色性を評価する。有
色性を評価するには、不規則波形の自己相関関数を求め
る。時間のずれτの大きさについての自己相関関数のグ
ラフを求め、そのグラフの重心位置が原点（τ＝０）か
らどれだけ乖離しているかを調べる。重心位置が原点
（τ＝０）から所定値以上乖離している場合には、有色
性を有していると評価する。なお、白色雑音の場合は、
自己相関関数はτ＝０にのみ値を有する。無相関化処理
された入力データのいずれもが有色性を有すると評価さ
れた場合、ＩＣチューナーは、独立成分分離処Ｗ２とし
て有色性に基づく処理Ｗ２(γ)を選択する。If any of the decorrelated input data is evaluated as having stationarity, then the IC tuner evaluates the chromaticity of the two input data. To evaluate chromaticity, an autocorrelation function of irregular waveform is obtained. A graph of the autocorrelation function with respect to the magnitude of the time lag τ is obtained, and how much the center of gravity of the graph deviates from the origin (τ = 0) is checked. When the position of the center of gravity deviates from the origin (τ = 0) by a predetermined value or more, it is evaluated as having color. In the case of white noise,
The autocorrelation function has a value only at τ = 0. When all of the input data subjected to the decorrelation are evaluated to have chromaticity, the IC tuner selects the chromaticity-based process W2 (γ) as the independent component separation process W2.

【００６２】図５は３回目のサイクルを示している。各
処理は２回目のサイクルと同様であるが、３回目の独立
成分分離処理は、この例では有色性に基づくＷ２(γ)を
実行している。FIG. 5 shows the third cycle. Each process is similar to the second cycle, but in the third independent component separation process, W2 (γ) based on the chromaticity is executed in this example.

【００６３】ここで、前述した３種の独立分離処理Ｗ２
（α）、Ｗ２（β）、及びＷ２（γ）の内容についてよ
り詳しく説明する。第１に、非ガウス性に基づく独立成
分分離処理Ｗ２（α）による信号源推定手順であるが、
まず、分離係数（行列）Ｗｔを適宜に仮定する（初期値
をＷ０とする）。Here, the above-described three types of independent separation processing W2
The contents of (α), W2 (β), and W2 (γ) will be described in more detail. First, the signal source estimation procedure by the independent component separation processing W2 (α) based on non-Gaussianity
First, the separation coefficient (matrix) Wt is appropriately assumed (the initial value is W0).

【００６４】次に（９）式の様に無相関化処理後のデー
タＦ（ｔ）に対する信号源ｙ（ｔ）を推定する。Next, the signal source y (t) for the data F (t) after the decorrelation processing is estimated as in the equation (9).

【数１０】 [Equation 10]

【００６５】（１１）式により、次の収束計算ステップ
でのＷｔ＋１を求める。このＷｔ＋１を新たなＷｔとし
て、以上のステップを繰り返す。そして、△Ｗｔがほぼ
ゼロになった時点、すなわちＷｔが十分に収束したと考
えられる時点のｙ（ｔ）が、混在音声生データｘ（ｔ）
から求められた信号源ｓ（ｔ）の推定信号となる。From equation (11), Wt + 1 in the next convergence calculation step is obtained. The above steps are repeated with this Wt + 1 as a new Wt. Then, y (t) at the time when ΔWt becomes almost zero, that is, when Wt is considered to have sufficiently converged is the mixed voice raw data x (t).
It becomes an estimated signal of the signal source s (t) obtained from

【数１１】 [Equation 11]

【００６６】第２に、非定常性に基づく独立成分分離処
理Ｗ２（β）による信号源推定手順であるが、まず、分
離係数（行列）Ｃｔと系の時定数Ｔ´のオーダーの時間
におけるｙ２（ｔ）の移動平均Φの初期値を求める。ま
た、ｙ（ｔ）を（１２）式により求める。（１２）式に
おいて、Ｉは単位行列である。Secondly, the signal source estimation procedure by the independent component separation process W2 (β) based on non-stationarity is as follows. First, y2 in the time of the order of the separation coefficient (matrix) Ct and the system time constant T '. The initial value of the moving average Φ of (t) is calculated. Further, y (t) is calculated by the equation (12). In Expression (12), I is an identity matrix.

【数１２】次に（１２）式に示す微分方程式を解いて、Φを求め
る。（１３）式において、Ｔ´は系の時定数である。[Equation 12] Next, the differential equation shown in the equation (12) is solved to obtain Φ. In the equation (13), T'is a time constant of the system.

【数１３】次に、（１２）式におけるΦ、Ｃｔ、ｙ（ｔ）より（１
４）式に示す微分方程式を用いて新たなＣｔ＋１を求め
る。（１４）式において、Ｔは系の時定数である。[Equation 13] Next, from Φ, Ct, and y (t) in the equation (12), (1
A new Ct + 1 is obtained by using the differential equation shown in equation 4). In equation (14), T is the time constant of the system.

【００６７】第３に、有色性に基づく独立成分分離処理
Ｗ２（γ）による信号源推定手順であるが、まず、分離
係数行列ＣｔとΨ１、Ψ２の初期値を与える。ここで、
Ψ１、Ψ２は、ｙ（ｔ）に２種類の線形フィルタをかけ
たものｙ１（ｔ）、及びｙ２（ｔ）から作られる２つの
積（ｙ１＊ｙ１Ｔ）、及び（ｙ２＊ｙ２Ｔ）の時間平均
である。また、ｙ（ｔ）を無相関化処理後データＦ
（ｔ）から（１６）式を用いて推定する。Third, regarding the signal source estimation procedure by the independent component separation processing W2 (γ) based on chromaticity, first, the separation coefficient matrix Ct and initial values of Ψ1 and Ψ2 are given. here,
Ψ1 and Ψ2 are time averages of two products (y1 * y1T) and (y2 * y2T) made from y (t) obtained by applying two types of linear filters y1 (t) and y2 (t). Is. In addition, y (t) is the data F after decorrelation processing.
Estimate from (t) using equation (16).

【数１７】 [Equation 17]

【００６８】上記のΨ１、Ψ２の初期値、及びｙ１、ｙ
２とから、（１８）式に示す微分方程式を用いて新たに
Ψ１、Ψ２を求める。Initial values of Ψ1, Ψ2 and y1, y
2 and Ψ1 and Ψ2 are newly obtained by using the differential equation shown in Expression (18).

【００６９】[0069]

【数１９】このＣｔ＋１とデータＦ（ｔ）とから、前記の（１６）
式によって新たなｙ（ｔ）が求められる。そして、この
Ｃｔの変化、すなわちｙ(ｔ)の変化が十分に小さくな
り、収束したと考えられる時点におけるｙ（ｔ）が、混
在音声生データｘ（ｔ）から求められた信号源ｓ（ｔ）
の推定信号となる。まだ収束していない場合は、（１
７）式によりｙ１（ｔ）、ｙ２（ｔ）を求め、以上のス
テップを繰り返す。[Formula 19] From this Ct + 1 and the data F (t), the above (16)
A new y (t) is calculated by the formula. Then, this change in Ct, that is, the change in y (t) becomes sufficiently small, and y (t) at the time when it is considered that the convergence has occurred is the signal source s (t) obtained from the mixed voice raw data x (t). )
It becomes the estimated signal of. If it has not converged, (1
Y1 (t) and y2 (t) are calculated by the equation 7), and the above steps are repeated.

【００７０】図５に戻って、ここでは３回目の分離サイ
クルの出力データｙ１´、ｙ２´が充分な分離性を有し
ていると評価器Ｅにて判断された。すなわち、ｙ１´、
ｙ２´がそれぞれ音源Ｓ１、Ｓ２のどちらかの音声に相
当すると思われる。これらのデータのデジタル化波形図
を図１１及び図１２に示す。振幅が一定以下の点は発話
でなくノイズとみなすことによって解析すると、ｙ１´
には「あ」（〜）、及び「か」（〜）の音声デ
ータが見られる。同様にｙ２´には「し」（〜）の
音声データが見られる。Returning to FIG. 5, the evaluator E has determined that the output data y1 ', y2' of the third separation cycle have sufficient separability here. That is, y1 ',
It is considered that y2 ′ corresponds to the voice of either the sound source S1 or S2, respectively. Digitized waveform diagrams of these data are shown in FIGS. 11 and 12. When the points whose amplitude is below a certain level are analyzed as noise instead of utterance, y1 ′
The voice data of "a" (-) and "ka" (-) can be seen in. Similarly, voice data of "shi" (-) is seen in y2 '.

【００７１】図１３は、ｙ１´とｙ２´の大きさをそれ
ぞれ横軸、縦軸にプロットした散布図である。この図か
ら分かるように、、、、、、の点はいずれ
もｙ２´の値がほぼゼロであり、逆に、、の各点
はｙ１´の値がほぼゼロであり、２つの独立した音源か
らの音声にきっちりと分離されたことが分かる。FIG. 13 is a scatter diagram in which the sizes of y1 'and y2' are plotted on the horizontal axis and the vertical axis, respectively. As can be seen from this figure, the values of ,,,,, have y2 ′ values of almost zero, and conversely, the points of ,, have y1 ′ values of almost zero. It can be seen that the sound from was separated exactly.

【００７２】なお、評価器Ｅにおいて、処理Ｗ２を実行
した後のデータの分離性を評価するには、図１３のグラ
フにおけるのような点を調べればよい。つまり、散布
図の中でもっとも横軸または縦軸から乖離している点を
選び、その軸までの距離が一定値以上であれば、いまだ
分離性が不十分とし、もう一度図４、図５のような分離
サイクルを実行するのである。In the evaluator E, in order to evaluate the separability of the data after executing the process W2, points such as in the graph of FIG. 13 may be examined. That is, in the scatter diagram, a point that is most distant from the horizontal axis or the vertical axis is selected, and if the distance to the axis is a certain value or more, the separability is still insufficient, and again, in FIG. 4 and FIG. Such a separation cycle is executed.

【００７３】＝＝分離音声データの発言者特定＝＝次に、本発明の後半部分である、分離された各音声デー
タの発言者を特定するステップについて説明する。図１
４は、上記音声分離ステップで得られた分離データｙ１
´の波形図と、そのフーリエ変換によるスペクトル分布
図である。ここで、スペクトル分布の求め方としては、
フィルタバンク、またはＬＰＣ法などが使用できる。== Identification of Speaker of Separated Voice Data == Next, the step of identifying the speaker of each separated voice data, which is the latter half of the present invention, will be described. Figure 1
4 is the separated data y1 obtained in the voice separation step.
2A and 2B are a waveform diagram of 'and a spectrum distribution diagram by its Fourier transform. Here, as the method of obtaining the spectral distribution,
A filter bank, LPC method, or the like can be used.

【００７４】同様に図１５は、分離データｙ２´の波形
図と、そのフーリエ変換によるスペクトル分布図であ
る。この実施例では発話者として２人（ＡさんとＢさん
とする）を想定しているので、この２つの波形データｙ
１´、ｙ２´に分離されたが、この時点ではどちらがＡ
さんの音声で、どちらがＢさんの音声であるかはわかっ
ていない。それをこれから特定する。Similarly, FIG. 15 is a waveform diagram of the separated data y2 'and a spectrum distribution diagram by its Fourier transform. In this embodiment, since two speakers (A and B) are assumed to be speakers, these two waveform data y
It was separated into 1'and y2 ', but which is A at this point?
With Mr.'s voice, I do not know which is Mr. B's voice. We will identify it from now on.

【００７５】まず、発言者を特定するための第１の方法
として、ホルマント周波数を発言者特定パラメータとし
て利用する方法を実行する。図１４におけるｆｏ１とｆ
ｏ２が、ｙ１´データの第１ホルマント周波数と第２ホ
ルマント周波数であり、図１５におけるｇｏ１、ｇｏ２
が、ｙ２´データの第１ホルマント周波数と第２ホルマ
ント周波数である。あらかじめ、会議参加者ＡさんとＢ
さんの第１及び第２ホルマント周波数データを、発言者
特定のための特定パラメータとしてデータベースに準備
しておく。そして上記の分離データｙ１´、ｙ２´のホ
ルマント周波数と照会することによって各分離音声デー
タの発言者を特定するのである。First, as a first method for specifying a speaker, a method of using a formant frequency as a speaker specifying parameter is executed. Fo1 and f in FIG.
o2 is the first formant frequency and the second formant frequency of the y1 ′ data, and go1 and go2 in FIG.
Are the first formant frequency and the second formant frequency of the y2 ′ data. Conference participants A and B in advance
The first and second formant frequency data of Mr. is prepared in the database as specific parameters for speaker identification. Then, the speaker of each separated voice data is specified by referring to the formant frequencies of the separated data y1 'and y2'.

【００７６】図１６は、特定パラメータであるＡさんと
Ｂさんの５母音全てのホルマント周波数と、得られた分
離音声データであるｙ１´及びｙ２´の第１及び第２ホ
ルマント周波数をマッチングする処理の概念図である。
横軸は第１ホルマント周波数、縦軸は第２ホルマント周
波数である。まず、Ａさんの母音の発音のホルマント周
波数の広がり領域（図の実線で囲んだ領域）、及びＢさ
んの母音の発音のホルマント周波数の広がり領域（図の
点線で囲んだ領域）を示し、その上に、図１４及び図１
５の分離音声データのホルマント周波数をプロットして
いる。FIG. 16 is a process for matching the formant frequencies of all five vowels of Mr. A and Mr. B, which are specific parameters, with the first and second formant frequencies of the obtained separated speech data y1 'and y2'. It is a conceptual diagram of.
The horizontal axis represents the first formant frequency, and the vertical axis represents the second formant frequency. First, the spread area of the formant frequency of the pronunciation of Mr. A's vowel (the area surrounded by the solid line in the figure) and the spread area of the formant frequency of the pronunciation of Mr. B's vowel (the area surrounded by the dotted line in the figure) are shown. On top of FIG. 14 and FIG.
The formant frequencies of the separated speech data of No. 5 are plotted.

【００７７】ｙ１´及びｙ２´のホルマント周波数が、
ＡさんまたはＢさんのホルマント周波数領域内に収まれ
ば、これをもって発言者が特定できたとすることができ
る。しかし、ＡさんとＢさんのいずれのホルマント周波
数領域にも納まらない場合（図１６のＣ部分）、また
は、ＡさんとＢさんの領域に重なり部分Ｄに収まってし
まう場合は、この第１の方法では発言者を特定すること
ができないため、以下に説明する第２の特定方法を実行
する。The formant frequencies of y1 'and y2' are
If it falls within the formant frequency range of Mr. A or Mr. B, it can be said that the speaker can be identified by this. However, if it does not fit into any of the formant frequency regions of Mr. A and Mr. B (portion C in FIG. 16), or if it falls within the overlapping portion D of the regions of Mr. A and Mr. B, this first Since the speaker cannot be specified by the method, the second specifying method described below is executed.

【００７８】第２の発言者特定方法は、複数時点のホル
マント周波数を発言者特定パラメータとして用いる方法
である。図１７は、本発明の前半段階である音声分離ス
テップによって分離されたある音声データ（「あ」の音
声）を、ｎ個のサンプリング時刻に分けてフーリエ変換
し、スペクトル分解したことを示す図である。それぞれ
に対して第１及び第２ピークである第１ホルマント周波
数（ｆ１１、ｆ１２、・・・ｆ１ｎ）及び第２ホルマン
ト周波数（ｆ２１、ｆ２２、・・・ｆ２ｎ）を求める。The second speaker identification method is a method of using formant frequencies at a plurality of time points as speaker identification parameters. FIG. 17 is a diagram showing that certain voice data (“a” voice) separated by the voice separating step in the first half of the present invention is divided into n sampling times and subjected to Fourier transform and spectral decomposition. is there. The first and second peakant frequencies (f11, f12, ... F1n) and the second formant frequencies (f21, f22 ,.

【００７９】次に、これらのホルマント周波数データに
対して主成分分析を実行し、主成分得点Ｚ１、Ｚ２、・
・・Ｚｎを求め、これを発言者の音声の特徴量として用
いる。従って、あらかじめデータベースに準備しておく
発言者特定パラメータとしては、会議参加者の様々な音
声（全母音など）の主成分得点Ｚ１、Ｚ２、・・・を準
備しておく。Next, principal component analysis is performed on these formant frequency data to obtain principal component scores Z1, Z2 ,.
..Zn is obtained and used as the feature amount of the voice of the speaker. Therefore, as speaker identification parameters prepared in advance in the database, the principal component scores Z1, Z2, ... Of various voices (all vowels, etc.) of conference participants are prepared.

【００８０】図１８は、第２の発言者特定方法による結
果を示すグラフである。図１８（ａ）は、比較のために
掲げた第１の発言者特定方法による結果である。図１８
（ａ）では、「あ」の音に対して５つのサンプリング時
刻における第１及び第２ホルマント周波数をプロットし
ているが、Ａさんの領域、Ｂさんの領域のどちらに属す
るかいずれとも言えない。FIG. 18 is a graph showing the result of the second speaker identification method. FIG. 18A shows a result obtained by the first speaker identification method provided for comparison. FIG.
In (a), the first and second formant frequencies at the five sampling times are plotted for the sound of "A", but it cannot be said to belong to the region of Mr. A or the region of Mr. B. .

【００８１】これに対して、図１８（ｂ）は、第２の方
法による、第１及び第２主成分得点Ｚ１、Ｚ２を２次元
の座標軸とした分布図である。まず、この図の例のよう
に、Ａさんの領域（図の実線）とＢさんの領域（図の点
線）がこの主成分得点平面では明確に離れていることが
多いので、判定が容易である。図１７の結果から求めた
分離データの主成分得点（Ｚ１、Ｚ２）をプロットする
と明らかにＡさんの領域に近いので、この場合の「あ」
はＡさんの発音であることがわかる。On the other hand, FIG. 18B is a distribution chart in which the first and second principal component scores Z1 and Z2 are two-dimensional coordinate axes by the second method. First, as in the example of this figure, the area of Mr. A (solid line in the figure) and the area of Mr. B (dotted line in the figure) are often clearly separated on this principal component score plane, so judgment is easy. is there. When the principal component scores (Z1, Z2) of the separated data obtained from the results of FIG. 17 are plotted, it is clearly close to the area of Mr. A, so “a” in this case
It is understood that is the pronunciation of Mr. A.

【００８２】以上で、図１におけるタイムセグメント
[１]の混在音声データを分離し、各分離データの発言者
を特定することができた。同様の処理をタイムセグメン
ト[２]、[３]以降についても行えば連続した混在音声デ
ータを全て発言者ごとの音声に分離・特定できる。As described above, the time segment in FIG.
The mixed voice data of [1] was separated, and the speaker of each separated data could be specified. If similar processing is performed for time segments [2] and [3] and thereafter, all continuous mixed voice data can be separated and specified as voices for each speaker.

【００８３】＝＝発明の変形例や具体的用途＝＝上記実施形態では、グラフを描く上での便宜上などか
ら、会議の参加者を２人としたが、参加者が３人以上の
場合であっても全く同様に音声を分離し、発言者を特定
することができる。== Variations and Specific Applications of the Invention == In the above embodiment, the number of participants in the conference was two for the sake of convenience in drawing a graph, but in the case of three or more participants. Even if there is, the voice can be separated in exactly the same way and the speaker can be specified.

【００８４】本発明の具体的用途の１つとして、特定さ
れた発言者と、該発言者の発言とを対応付け、公知の各
種音声認識ソフトウェアを利用して文字データなどに変
換した上で、記録媒体に出力することによる、自動議事
録作成がある。長時間にわたる会議の議事録作成が簡便
であり、かつ発言者の特定が自動的に正確に行われる。As one of the specific applications of the present invention, the specified speaker and the speech of the speaker are associated with each other, converted into character data using various known voice recognition software, There is automatic minutes creation by outputting to a recording medium. It is easy to create minutes of a long-term meeting, and the speaker can be identified automatically and accurately.

【００８５】その他にも、音質の悪い状況下での携帯電
話通話の発言者特定や、ＣＴＩ（コンピュータ・テレフ
ォニー・インテグレイティッド）における発言者特定、
騒音下の自動車の中でのカーナビや口元にマイクロフォ
ンを設置できない状況でのパソコン等への音声入力及び
発明者特定など、様々な用途への応用が考えられる。さ
らにまた、情報家電、携帯電話やＰＤＡ等の携帯端末、
及び、身につけて携帯可能なウェアラブルコンピュータ
（Wearable Computer)などへの音声入力手段への応用等
も考えられる。In addition, the identification of the speaker of a mobile phone call under poor sound quality, the identification of the speaker in CTI (Computer Telephony Integrated),
It can be applied to various applications such as car navigation in a noisy car and voice input to a personal computer in a situation where a microphone cannot be installed in the mouth and identification of the inventor. Furthermore, information appliances, mobile terminals such as mobile phones and PDAs,
In addition, application to a voice input means such as a wearable computer (Wearable Computer) which can be worn and carried is also considered.

【００８６】[0086]

【００８７】このような本発明は、音声データ入力と同
時進行的かつ自動的な、会議議事録作成、及び、実環境
下でのさまざまな音声入力インターフェースなどに応用
することができる。The present invention as described above can be applied to the conference minutes preparation which is simultaneous and automatic with voice data input, and various voice input interfaces in a real environment.

【図面の簡単な説明】[Brief description of drawings]

【図１９】次数ｍとその累積寄与率との関係を示すグ
ラフである。FIG. 19 is a graph showing the relationship between the degree m and its cumulative contribution rate.

【図２０】次数ｍと累積寄与率の変化量との関係を示
すグラフである。FIG. 20 is a graph showing the relationship between the degree m and the amount of change in the cumulative contribution rate.

【図２１】システムに応じた方法で次数ｍを決定する
手順を示すフローチャートである。FIG. 21 is a flowchart showing a procedure for determining the order m by a method according to the system.

【手続補正２】[Procedure Amendment 2]

【補正対象書類名】図面[Document name to be corrected] Drawing

【補正対象項目名】図３[Name of item to be corrected] Figure 3

【補正方法】変更[Correction method] Change

【補正内容】[Correction content]

【図３】 [Figure 3]

【手続補正３】[Procedure 3]

【補正対象書類名】図面[Document name to be corrected] Drawing

【補正対象項目名】図５[Name of item to be corrected] Figure 5

【補正方法】変更[Correction method] Change

【補正内容】[Correction content]

【図５】 [Figure 5]

【手続補正４】[Procedure amendment 4]

【補正対象書類名】図面[Document name to be corrected] Drawing

【補正対象項目名】図１９[Name of item to be corrected] Fig. 19

【補正方法】変更[Correction method] Change

【補正内容】[Correction content]

【図１９】 FIG. 19

【手続補正５】[Procedure Amendment 5]

【補正対象書類名】図面[Document name to be corrected] Drawing

【補正対象項目名】図２０[Name of item to be corrected] Fig. 20

【補正方法】変更[Correction method] Change

【補正内容】[Correction content]

【図２０】 FIG. 20

【手続補正６】[Procedure correction 6]

【補正対象書類名】図面[Document name to be corrected] Drawing

【補正対象項目名】図２１[Name of item to be corrected] Fig. 21

【補正方法】変更[Correction method] Change

【補正内容】[Correction content]

【図２１】 FIG. 21

【手続補正７】[Procedure Amendment 7]

【補正対象書類名】図面[Document name to be corrected] Drawing

【補正対象項目名】図２２[Correction target item name] Fig. 22

【補正方法】削除[Correction method] Delete

【手続補正８】[Procedure Amendment 8]

【補正対象書類名】図面[Document name to be corrected] Drawing

【補正対象項目名】図２３[Correction target item name] Fig. 23

【補正方法】削除 ─────────────────────────────────────────────────────
[Correction method] Delete ───────────────────────────────────────────── ────────

【手続補正書】[Procedure amendment]

【提出日】平成１４年９月９日（２００２．９．９）[Submission date] September 9, 2002 (2002.9.9)

【手続補正１】[Procedure Amendment 1]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】００１３[Correction target item name] 0013

【補正方法】変更[Correction method] Change

【補正内容】[Correction content]

【００１３】[0013]

【課題を解決するための手段】上記の課題を解決するた
めに、本出願に係る第１の発明は、複数発言者の音声デ
ータが混在している混在音声データを、発言者毎の音声
データに分離する音声データ分離方法において、（１）
前記混在音声データを互いに無相関化するための無相関
化処理を行うステップと、（２）前記無相関化処理の行
われたデータを独立成分に分離するための独立成分分離
処理を行うステップとを有し、前記独立成分分離の行わ
れたデータの分離性が不十分な場合には、分離性が十分
になるまで、前記独立成分分離処理の行われたデータに
ついて、前記無相関化処理及び前記独立成分分離処理を
繰り返し行うことを特徴とする音声分離方法である。こ
のような第１の発明によれば、入力される混在音声デー
タ（生データ）に含まれる各音声データの相関性、及び
独立性の両性質をともに考慮し、複数の音声データや混
入する雑音などの有する相関性や独立性が、時間的・空
間的に変動する場合でも、発言者毎の音声に正確に分離
することができる。さらに加えて、このような第１の発
明によれば、混在音声データを音源毎の音声データに充
分に分離させることができる。In order to solve the above-mentioned problems, a first invention of the present application is to provide mixed voice data in which voice data of a plurality of speakers are mixed, to voice data for each speaker. In the audio data separation method for separating into (1)
Performing a decorrelation process for decorrelating the mixed speech data with each other; and (2) performing an independent component separation process for separating the decorrelated data into independent components. And if the separability of the data on which the independent component separation is performed is insufficient, until the separability is sufficient, the data on which the independent component separation processing is performed, the decorrelation process and In the speech separation method, the independent component separation process is repeatedly performed. According to the first aspect of the invention as described above, a plurality of voice data and mixed noise are considered in consideration of both the correlation and the independence of each voice data included in the input mixed voice data (raw data). Even if the correlation or independence of the above changes temporally or spatially, it is possible to accurately separate the voices of each speaker. In addition, according to the first aspect of the invention, the mixed voice data can be sufficiently separated into the voice data for each sound source.

【手続補正２】[Procedure Amendment 2]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】００６４[Correction target item name] 0064

【補正方法】変更[Correction method] Change

【補正内容】[Correction content]

【数１０】 [Equation 10]

───────────────────────────────────────────────────── フロントページの続き (51)Int.Cl.⁷ 識別記号ＦＩテーマコート゛(参考）Ｇ１０Ｌ 9/08 ３０１Ａ 3/02 ３０１Ｅ３０１Ｃ ─────────────────────────────────────────────────── ─── Continued Front Page (51) Int.Cl. ⁷ Identification Code FI Theme Coat (Reference) G10L 9/08 301A 3/02 301E 301C

Claims

[Claims]

1. A voice data separation method for separating mixed voice data in which voice data of a plurality of speakers are mixed into voice data for each speaker, (1) for making the mixed voice data mutually uncorrelated And a step (2) of performing independent component separation processing for separating the data subjected to the decorrelation processing into independent components. .

2. The audio separation method according to claim 1, wherein when the separability of the data on which the independent component separation is performed is insufficient, the independent component separation processing is performed until the separability becomes sufficient. A speech separation method, wherein the decorrelation processing and the independent component separation processing are repeatedly performed on the performed data.

3. The speech separation method according to claim 1, wherein the independent component separation process is a non-Gaussian independent component separation process for separating non-Gaussian data into independent components, and Non-stationary independent component separation processing for separating stationary data into independent components, and colored independent component separation processing for separating colored data into independent components are prepared. Non-Gaussian independent component separation processing,
A voice separation method, wherein any one of the non-stationary independent component separation process and the colored independent component separation process is performed.

4. The speech separation method according to claim 3, wherein the independent component separation process performed first is a non-Gaussian independent component separation process for separating non-Gaussian data into independent components. Characteristic voice separation method.

5. The speech separation method according to claim 1, wherein the decorrelation processing performs at least a principal component analysis and a factor analysis.

6. A speaker identifying method for separating mixed voice data in which voice data of a plurality of speakers are mixed into voice data for each speaker, and identifying a speaker for the voice data for each speaker. (1) Separating mixed voice data, in which voice data of a plurality of speakers are mixed, into voice data for each speaker by the voice separation method according to any one of claims 1 to 5. 2) preparing a specific parameter for specifying the speaker for each speaker, and (3) specifying the speaker by referring to the specific parameter for the separated voice data for each speaker. A method for identifying a speaker, comprising:

7. The speaker identification method according to claim 6, wherein the identification parameter is a formant frequency when the speaker produces a vowel, and the formant frequency is set for each of the separated voice data of each speaker. The speaker specifying method is characterized in that, with respect to the obtained formant frequency, the speaker is specified by referring to the formant frequency as the specifying parameter.

8. The speaker identification method according to claim 7, wherein the identification parameter is a first formant frequency and a second formant frequency when the speaker produces a vowel, and each of the separated speakers is separated. The first formant frequency and the second formant frequency are obtained for the audio data of, and the first formant frequency and the second formant frequency as the specific parameters are referred to with respect to the obtained first formant frequency and the second formant frequency, A speaker identification method characterized by identifying a speaker.

9. The speaker identifying method according to claim 6, further comprising the step of identifying the speaker by referring to the specific parameter for the separated voice data of each speaker. If the speaker cannot be identified by the above, the formant frequencies at a plurality of time points are obtained from the voice data, and the obtained formant frequencies at a plurality of time points are
A speaker specifying method characterized in that a speaker is specified by referring to formant frequencies at a plurality of times as the specifying parameter.

10. The speaker identifying method according to claim 6, further comprising the step of identifying a speaker by referring to the specific parameter for the separated voice data of each speaker. If the speaker cannot be specified by separating the voiced sound data from the voice data, a formant frequency is obtained for the voiced sound data, and the obtained formant frequency is referred to the formant frequency as the specific parameter. Then, a speaker specifying method characterized by specifying a speaker.

11. The speaker identifying method according to claim 10, wherein the speaker cannot be identified in the step of identifying the speaker with respect to the separated voice data of each speaker by referring to the specific parameter. In this case, the voiced sound data is separated from the voice data, the first formant frequency and the second formant frequency are obtained for the voiced sound data, and the obtained first formant frequency and the second formant frequency are obtained.
Regarding a formant frequency, a speaker specifying method characterized in that a speaker is specified by referring to a first formant frequency and a second formant frequency as the specifying parameter.

12. The speaker identifying method according to claim 10 or 11, wherein the speaker is identified in the step of identifying the speaker with respect to the separated voice data of each speaker by referring to the specific parameter. If it is not possible to specify, the formant frequencies at a plurality of time points are obtained for the voiced sound data, and the obtained formant frequencies at a plurality of time points are referred to by referring to the formant frequencies at a plurality of time points as the specific parameters. A speaker identification method characterized by identifying a speaker.

13. The speaker identifying method according to claim 10 or 11, wherein when separating the voiced sound data from the voice data,
A speaker identifying method, characterized in that an independent component separation process for separating the voice data into independent components is performed.

14. A minutes creating method for creating a minutes from mixed audio data in which audio data of a plurality of speakers are mixed, according to the speaker identifying method according to any one of claims 6 to 13. A step of identifying a speaker in the separated voice data for each speaker, and outputting the recorded minutes to the recording medium in association with the identified speaker and the statement of the speaker The method of creating a minutes, comprising:

15. A voice data separating device for separating mixed voice data in which voice data of a plurality of speakers are mixed into voice data for each speaker, in order to decorrelate the mixed voice data with each other A speech separation apparatus, characterized by performing an independence component separation process in order to separate the data subjected to the decorrelation process into independent components.

16. The speech separation apparatus according to claim 15, wherein when the separability of the data on which the independent component separation is performed is insufficient, the independent component separation processing is performed until the separability becomes sufficient. A speech separation apparatus, wherein the decorrelation processing and the independent component separation processing are repeatedly performed on the performed data.

17. The speech separation apparatus according to claim 15 or 16, wherein the non-Gaussian independent component for separating non-Gaussian data into independent components is used as the independent component separation processing depending on the nature of the data. Any one of separation processing, non-stationary independent component separation processing for separating non-stationary data into independent components, and chromatic independent component separation processing for separating chromatic data into independent components A voice separation device characterized by performing.

18. The speech separation apparatus according to claim 17, wherein the independent component separation process performed first is a non-Gaussian independent component separation process for separating non-Gaussian data into independent components. Characteristic audio separation device.

19. The speech separation apparatus according to claim 15, wherein the decorrelation processing performs at least a principal component analysis and a factor analysis.

20. Separating mixed voice data in which voice data of a plurality of speakers are mixed into voice data for each speaker,
A speaker identification device for identifying a speaker for audio data of each speaker, wherein the audio separation device according to any one of claims 15 to 19 mixes audio data of a plurality of speakers. The data is separated into voice data for each speaker, and the speaker is specified by referring to the separated voice data for each speaker by referring to a specific parameter for specifying the speaker for each speaker. Characterized speaker identification device.

21. The speaker identifying apparatus according to claim 20, wherein the specific parameter is a formant frequency when a speaker produces a vowel, and the formant frequency is set for each of the separated voice data of each speaker. And a speaker specifying device that specifies the speaker with reference to the formant frequency as the specifying parameter with respect to the obtained formant frequency.

22. The speaker identifying apparatus according to claim 21, wherein the specific parameter is a first formant frequency and a second formant frequency when a speaker produces a vowel, and each of the separated speakers is separated. The first formant frequency and the second formant frequency are obtained for the audio data of, and the first formant frequency and the second formant frequency as the specific parameters are referred to with respect to the obtained first formant frequency and the second formant frequency, A speaker specifying device characterized by specifying a speaker.

23. The speaker identifying apparatus according to claim 20, wherein the speaker cannot be specified by referring to the specific parameter for the separated voice data of each speaker. In this case, the formant frequencies at a plurality of time points are obtained from the audio data, and the obtained formant frequencies at a plurality of time points are
A speaker identifying apparatus for identifying a speaker by referring to formant frequencies at a plurality of times as the specific parameters.

24. The speaker identifying apparatus according to claim 20, wherein the speaker cannot be specified by referring to the specific parameter for the separated voice data of each speaker. In this case, the voiced sound data is separated from the voice data, the formant frequency is obtained for the voiced sound data, and the speaker is identified by referring to the formant frequency as the specific parameter with respect to the obtained formant frequency. A speaker identification device characterized by the above.

25. The speaker identifying apparatus according to claim 24, wherein, with respect to the separated voice data for each speaker, when the speaker cannot be identified by referring to the specific parameter, the voice data is output. Voiced sound data is separated from the voiced sound data, a first formant frequency and a second formant frequency are obtained for the voiced sound data, and the obtained first formant frequency and second formant frequency are obtained.
Regarding a formant frequency, a speaker identifying device characterized in that a speaker is identified by referring to a first formant frequency and a second formant frequency as the specific parameters.

26. The speaker identifying apparatus according to claim 24 or 25, wherein the speaker cannot be specified by referring to the specific parameter for the separated voice data of each speaker. The voiced sound data, formant frequencies at a plurality of time points are obtained, and with respect to the obtained formant frequencies at a plurality of time points, the speaker is identified by referring to the formant frequencies at a plurality of time points as the specific parameters. Speaker identification device.

27. The speaker identifying apparatus according to claim 24 or 25, wherein when separating the voiced sound data from the voice data,
An apparatus for identifying a speaker, wherein an independent component separation process for separating the voice data into independent components is performed.

28. A minutes creating device for creating a minutes from mixed voice data in which voice data of a plurality of speakers are mixed, comprising: the speaker identifying device according to claim 20. Creating a minutes by specifying a speaker in the separated voice data for each speaker and outputting the specified speaker and the statement of the speaker in association with each other on a recording medium. A minutes preparation device.

29. A computer program for causing a voice separation device to execute the voice separation method according to claim 1. Description:

30. A computer program for causing a speaker identifying device to execute the speaker identifying method according to claim 6.

31. A computer-readable recording medium having the computer program according to claim 29 or 30 recorded therein.