JP2011059500A

JP2011059500A - Speaker clustering device and speaker clustering method

Info

Publication number: JP2011059500A
Application number: JP2009210499A
Authority: JP
Inventors: Kenichi Iso; 健一磯
Original assignee: Yahoo Japan Corp
Current assignee: Yahoo Japan Corp
Priority date: 2009-09-11
Filing date: 2009-09-11
Publication date: 2011-03-24
Anticipated expiration: 2029-09-11
Also published as: JP4960416B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a speaker clustering device and a speaker clustering method, capable of accurately identifying a speaker in a voice signal. <P>SOLUTION: A speaker clustering device 100 includes a vector quantization means 30, an appearance frequency creating means 40, a similarity degree calculating means 50, and a clustering means 60. The vector quantization means 30 converts an input voice signal to a code. The appearance frequency creating means 40 creates an appearance frequency vector in which the number of appearance frequency of each code in the codes is set as a component, for each utterance. The similarity degree calculating means 50 calculates a cosine distance between utterance by using the appearance frequency vector of each utterance, and calculates a similarity degree between utterance from the cosine distance. The clustering means 60 clusters the utterance by spectrum clustering based on the similarity degree. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、音声信号中の話者を同定する話者クラスタリング装置および話者クラスタリング方法に関する。 The present invention relates to a speaker clustering apparatus and a speaker clustering method for identifying a speaker in a speech signal.

録音音声中に含まれる発話を同一話者ごとにまとめる話者クラスタリング技術は、音声認識精度の改善や、話者単位のスキップ再生などの視聴支援に活用が期待されている。
このようなクラスタリング技術として、例えば、入力された音声から抽出された特徴パターンを共通符号に変換した後、各音声区間における各符号の出現確率をクラスタ分析することにより、同一話者の判定を行う技術が知られている（例えば、特許文献１参照）。 The speaker clustering technology that collects the utterances included in the recorded speech for each speaker is expected to be used for viewing support such as improvement of speech recognition accuracy and skip playback for each speaker.
As such a clustering technique, for example, after converting a feature pattern extracted from input speech into a common code, the same speaker is determined by performing cluster analysis on the appearance probability of each code in each speech section. A technique is known (see, for example, Patent Document 1).

ここで、クラスタ分析（クラスタリング）とは、多次元空間において与えられたデータ集合を個体間の類似度によってクラスタ（塊）化する多変量解析法である。ここでは、各音声区間（発話）同士の類似度によりクラスタ分析を行う。
出現確率をクラスタ分析する際は、クラスタの指標となる出現確率ベクトル間の距離を算出しなければならない。このベクトル間の距離の計算には、一般的に、ＫＬ距離（カルバック・ライブラー情報量）やユークリッド距離が用いられる（例えば、非特許文献１参照）。特に、音声信号中の発話間の距離を測るには、各発話の発話特性を示すガウス分布を推定し、該ガウス分布間のＫＬ距離が用いられることが一般的である（例えば、非特許文献２参照）。 Here, cluster analysis (clustering) is a multivariate analysis method for clustering a data set given in a multidimensional space according to the similarity between individuals. Here, cluster analysis is performed based on the similarity between the speech sections (utterances).
When performing cluster analysis of appearance probabilities, it is necessary to calculate the distance between appearance probability vectors serving as cluster indices. For the calculation of the distance between the vectors, generally, a KL distance (Cullback / Librer information amount) or a Euclidean distance is used (for example, see Non-Patent Document 1). In particular, in order to measure the distance between utterances in a speech signal, it is common to estimate a Gaussian distribution indicating the utterance characteristics of each utterance and use the KL distance between the Gaussian distributions (for example, non-patent literature). 2).

特開平６−８３３８４号公報JP-A-6-83384

電子情報通信学会技術報告書ＳＰ−９２−４５IEICE Technical Report SP-92-45 Ning et al., InterSpeech p.2178-2181, 2006Ning et al., InterSpeech p.2178-2181, 2006

しかしながら、非特許文献１や非特許文献２で用いられているＫＬ距離やユークリッド距離は、ガウス分布全体の一致度を測る距離尺度であるため、発話の長さが短いときに話者の違いよりも、発話に含まれる音韻の出現頻度の違いに敏感になってしまうことが知られている。すなわち、発話のガウス分布は、話者の特性と発話に含まれる音韻の出現頻度の偏りとの２つに依存しているが、発話が十分に長ければ後者の違いは目立たなくなり、話者の特性を精度良く表現することができる。したがって、発話が十分に長いときにはＫＬ距離やユークリッド距離は話者性に注目して話者クラスタリングを行うのに有効であるが、例えば、討論を録音した音声などの短い発話が多数含まれるような音声の場合には、話者性よりも発話内容（発話に含まれる音韻）を重視したクラスタリングが行われてしまう。
また、特許文献１や非特許文献２のように、発話をガウス分布ではなく、ベクトル量子化の符号出現頻度ベクトルで表した場合にＫＬ距離を用いると、出現頻度が０回の部分で数値的な異常（０の対数）が生じるため発見法的な対処が必要になり、対数は０近傍で大きな値の変化をするため、正確な処理は本質的な困難があると考えられる。 However, since the KL distance and Euclidean distance used in Non-Patent Document 1 and Non-Patent Document 2 are distance scales for measuring the degree of coincidence of the entire Gaussian distribution, when the utterance length is short, the difference between speakers Is also known to be sensitive to differences in the appearance frequency of phonemes included in utterances. In other words, the Gaussian distribution of utterances depends on two characteristics: the characteristics of the speaker and the bias in the appearance frequency of the phonemes included in the utterance, but if the utterance is sufficiently long, the latter difference becomes inconspicuous. Characteristics can be expressed with high accuracy. Therefore, when the utterance is sufficiently long, the KL distance and the Euclidean distance are effective for performing speaker clustering by paying attention to speaker characteristics. For example, there are many short utterances such as a voice recording a discussion. In the case of speech, clustering is performed in which utterance contents (phonemes included in the utterance) are emphasized rather than speaker characteristics.
In addition, as in Patent Document 1 and Non-Patent Document 2, when the utterance is expressed not by a Gaussian distribution but by a vector quantization code appearance frequency vector, when the KL distance is used, the numerical value is expressed at a portion where the appearance frequency is zero. Since an abnormal anomaly (logarithm of 0) occurs, a heuristic countermeasure is required, and the logarithm changes a large value in the vicinity of 0. Therefore, it is considered that accurate processing is inherently difficult.

本発明の目的は、音声信号中の話者の同定を精度高く実施できる話者クラスタリング装置および話者クラスタリング方法を提供することである。 An object of the present invention is to provide a speaker clustering apparatus and a speaker clustering method capable of accurately identifying speakers in a speech signal.

本発明の話者クラスタリング装置は、入力された音声信号中の同一話者の発話を同定する話者クラスタリング装置であって、ベクトル量子化を用いて前記音声信号を符号に変換するベクトル量子化手段と、前記変換された符号中の各発話に対して、前記符号の出現回数を成分とした出現頻度ベクトルを生成する出現頻度生成手段と、前記出現頻度ベクトルに基づいて前記発話間のコサイン距離をそれぞれ算出し、該コサイン距離を発話間の類似度として求める類似度算出手段と、前記発話間の類似度に基づいて前記発話を分類するクラスタリング手段と、を備えたことを特徴とする。 The speaker clustering device of the present invention is a speaker clustering device for identifying the speech of the same speaker in an input speech signal, and a vector quantization means for converting the speech signal into a code using vector quantization For each utterance in the converted code, an appearance frequency generating means for generating an appearance frequency vector having the number of appearances of the code as a component, and a cosine distance between the utterances based on the appearance frequency vector A similarity calculation unit that calculates each cosine distance as a similarity between utterances and a clustering unit that classifies the utterances based on the similarity between the utterances are provided.

本発明は、ベクトル量子化手段により音声信号を符号に変換し、出現頻度生成手段により符号から出現頻度ベクトルを生成し、類似度算出手段により出現頻度ベクトルから発話間のコサイン距離を算出して類似度を求め、クラスタリング手段により類似度に基づくクラスタリングを行う。
コサイン距離による類似度は、符号の出現頻度の分布をベクトルとして、出現頻度ベクトルの間の角度に依存する。また、コサイン距離は、発話間に出現した符号の頻度だけに依存するため、一方の発話で出現頻度が０（ゼロ）の符号が存在した場合に、コサイン距離算出のための分子の和には寄与しない。このため、発話が短い場合の発話間の音韻の出現頻度の偏りが生じる場合に、コサイン距離で算出した類似度を用いてクラスタリングすることにより、発話の話者性を反映し、発話の長さや発話間の音韻の出現頻度の偏りに影響を受けない話者クラスタリングを実現することができる。
また、発話間のコサイン距離の算出において、一方の発話の出現頻度ベクトルがゼロである場合は類似度もゼロとなり、発見法的な処理を行う必要がない。したがって、演算量を大幅に低減させることができるとともに、発話データに特別な処理を施さないことから、より精度の高い話者クラスタリングを行うことができる。 In the present invention, a speech signal is converted into a code by a vector quantization means, an appearance frequency vector is generated from the code by an appearance frequency generation means, and a cosine distance between utterances is calculated from the appearance frequency vector by a similarity calculation means. The degree is obtained, and clustering based on the similarity is performed by the clustering means.
The similarity based on the cosine distance depends on the angle between the appearance frequency vectors with the distribution of the appearance frequency of the codes as a vector. In addition, since the cosine distance depends only on the frequency of the code appearing between utterances, if there is a code with an appearance frequency of 0 (zero) in one utterance, the sum of the numerators for calculating the cosine distance is Does not contribute. For this reason, when there is a bias in the appearance frequency of phonology between utterances when the utterance is short, clustering using the similarity calculated by the cosine distance reflects the speaker characteristics of the utterance, It is possible to realize speaker clustering that is not affected by the frequency of appearance of phonemes between utterances.
Also, in calculating the cosine distance between utterances, if the appearance frequency vector of one utterance is zero, the similarity is also zero, and there is no need to perform a heuristic process. Therefore, it is possible to greatly reduce the amount of calculation and to perform speaker clustering with higher accuracy since no special processing is performed on the speech data.

また、本発明では、発話中の符号を用いて算出されたコサイン距離に基づいて類似度が求められるので、発話の長さが短い場合であっても、正確な発話特性を得ることができ、その結果、精度の高い話者クラスタリングを行うことができる。 In the present invention, since the similarity is obtained based on the cosine distance calculated using the code being uttered, accurate utterance characteristics can be obtained even when the utterance length is short, As a result, speaker clustering with high accuracy can be performed.

本発明の話者クラスタリング装置において、前記出現頻度生成手段は、前記出現頻度ベクトルに対して、前記各発話内における前記符号の出現回数を、当該符号と同一の符号が出現する前記発話の数の逆数で重み付けすることが好ましい。 In the speaker clustering apparatus of the present invention, the appearance frequency generation means determines the number of appearances of the code in each utterance with respect to the appearance frequency vector by the number of utterances in which the same code as the code appears. It is preferable to weight by the reciprocal number.

この発明では、出現頻度ベクトルに重み付けを行う。符号の出現回数を、当該符号と同一の符号が出現する発話の数の逆数で重み付けするということは、該発話における特徴である当該符号が、他の発話に出現しない場合に重み付けされるということである。すなわち、どの発話にも出現するような特徴（符号）は、該発話中の特徴とされるものではなく、他の発話に出現しない特徴（符号）に対して、重み付けを行う。
これによれば、該発話の特徴を正確に捉えることができるため、より精度の高い話者クラスタリングを行うことができる。 In the present invention, the appearance frequency vector is weighted. Weighting the number of appearances of a code by the reciprocal of the number of utterances in which the same code as the code appears means that the code, which is a feature of the utterance, is weighted when it does not appear in other utterances It is. That is, a feature (symbol) that appears in any utterance is not regarded as a feature during the utterance, and is weighted to a feature (symbol) that does not appear in other utterances.
According to this, since the feature of the utterance can be accurately captured, speaker clustering with higher accuracy can be performed.

本発明の話者クラスタリング装置において、前記出現頻度生成手段は、前記各発話中の連続する符号の組み合わせによる符号列の出現回数を成分とした出現頻度ベクトルとして生成することが好ましい。 In the speaker clustering apparatus of the present invention, it is preferable that the appearance frequency generation means generates an appearance frequency vector having a component of the number of appearances of a code string based on a combination of consecutive codes in each utterance.

発話中の連続する符号の組み合わせは、当該発話における特徴を表現している。したがって、これらの組み合わせの出現回数を成分とした出現頻度ベクトルは、発話中の特徴をより正確に表現することができるため、その結果、より精度の高い話者クラスタリングを行うことができる。 A combination of consecutive codes during utterance expresses a feature in the utterance. Therefore, the appearance frequency vector having the number of appearances of these combinations as a component can more accurately represent the feature being uttered, and as a result, more accurate speaker clustering can be performed.

本発明の話者クラスタリング装置において、ベクトル量子化を用いて前記音声信号の時系列的な変化量を符号に変換する変化量ベクトル量子化手段をさらに備え、前記出現頻度生成手段は、各発話に対して、前記音声信号の符号の出現回数と、前記変化量の符号の出現回数とを成分とする前記出現頻度ベクトルを生成することが好ましい。 In the speaker clustering apparatus of the present invention, the speaker clustering device further includes a change amount vector quantization means for converting a time-series change amount of the speech signal into a code using vector quantization, and the appearance frequency generating means is provided for each utterance. On the other hand, it is preferable to generate the appearance frequency vector having as components the number of appearances of the code of the audio signal and the number of appearances of the code of the change amount.

この発明では、前述のベクトル量子化手段のほかに、変化量ベクトル量子化手段を備えている。変化量ベクトル量子化手段は、ベクトル量子化を用いて、音声信号の時系列的な変化量を符号に変換する。発話における時系列的な変化量は、当該発話の特徴を表すので、この変化量を用いることで、発話の特徴をより正確に表現することができる。
特に、本発明において、発話における符号の出現回数と、発話における時系列的な変化量の符号の出現回数とを成分として生成された出現頻度ベクトルは、発話の特徴をより正確に表現することができる。その結果より精度の高い話者クラスタリングを行うことができる。 In the present invention, in addition to the vector quantization means described above, a variation vector quantization means is provided. The change amount vector quantization means converts the time-series change amount of the audio signal into a code by using vector quantization. Since the time-series change amount in the utterance represents the feature of the utterance, the feature of the utterance can be expressed more accurately by using this change amount.
In particular, in the present invention, the appearance frequency vector generated by using the number of appearances of the code in the utterance and the number of appearances of the code of the time-series change amount in the utterance can more accurately represent the characteristics of the utterance. it can. As a result, speaker clustering with higher accuracy can be performed.

本発明の話者クラスタリング装置において、前記クラスタリング手段は、前記算出したコサイン距離による発話間の類似度を用いてスペクトラルクラスタリングを行うことが好ましい。 In the speaker clustering apparatus of the present invention, it is preferable that the clustering means performs spectral clustering using a similarity between utterances based on the calculated cosine distance.

この発明では、コサイン距離による発話間の類似度に基づいてスペクトラルクラスタリングを行う。コサイン距離による発話間の類似度は、発話の特徴をより正確に表現することができるので、この類似度を用いてスペクトラルクラスタリングを行うことで、より精度の高いクラスタリングを行うことができる。すなわち、同一話者の判定の精度を向上させることができる。 In the present invention, spectral clustering is performed based on the similarity between utterances based on the cosine distance. Since the similarity between utterances based on the cosine distance can express the feature of the utterance more accurately, clustering with higher accuracy can be performed by performing spectral clustering using this similarity. That is, the accuracy of determination of the same speaker can be improved.

本発明の話者クラスタリング装置において、前記類似度算出手段により得られる各発話間の前記類似度を要素とする類似度行列から得られる正方行列の固有値を算出し、連続する固有値間の差分が最大値をとる固有値に基づいて、前記音声信号中の話者数を決定する話者数判定手段をさらに備えたことが好ましい。 In the speaker clustering apparatus of the present invention, the eigenvalue of a square matrix obtained from a similarity matrix having the similarity between the utterances obtained by the similarity calculating means is calculated, and the difference between consecutive eigenvalues is maximized. It is preferable to further include a speaker number determination means for determining the number of speakers in the voice signal based on the eigenvalue taking a value.

この発明では、類似度算出手段により得られた類似度を要素とする類似度行列に基づいて固有値差分により音声信号中の話者数を決定している。前述のように、類似度は発話の特徴を正確に表現しているので、この類似度を要素とする類似度行列を用いることにより、話者数の決定をより正確に行うことができる。 In the present invention, the number of speakers in the speech signal is determined by the eigenvalue difference based on the similarity matrix having the similarity obtained by the similarity calculation means as an element. As described above, since the degree of similarity accurately represents the feature of the utterance, the number of speakers can be determined more accurately by using the degree of similarity matrix having the degree of similarity as an element.

本発明の話者クラスタリング装置において、前記類似度算出手段は、前記算出した各発話間の類似度のうちの最大値にあらかじめ定められた係数を乗じて算出した閾値を下回る類似度を０として前記類似度を求めることが好ましい。 In the speaker clustering device of the present invention, the similarity calculation means sets the similarity below the threshold calculated by multiplying the maximum value of the calculated similarities between the utterances by a predetermined coefficient as 0. It is preferable to obtain the similarity.

この発明では、類似度の閾値を定め、この閾値を下回る場合には類似度を０（ゼロ）とする処理を行う。類似度の閾値は、各発話間の類似度のうちの最大値にあらかじめ定められた係数を乗じて算出した値であり、適宜調整可能とされている。
これにより、発話間の類似度が明らかに小さい場合は類似度を０とすることができるので、前述した固有値差分による話者数推定の精度を向上させることができる。
演算量を大幅に低減させることができる。 In the present invention, a threshold value of similarity is set, and when the threshold value is less than this threshold value, a process of setting the similarity to 0 (zero) is performed. The threshold of similarity is a value calculated by multiplying the maximum value of similarities between utterances by a predetermined coefficient, and can be adjusted as appropriate.
As a result, when the similarity between utterances is clearly small, the similarity can be set to 0, so that the accuracy of estimating the number of speakers based on the eigenvalue difference described above can be improved.
The amount of calculation can be greatly reduced.

本発明の話者クラスタリング装置において、前記類似度算出手段は、前記算出した各発話間の類似度のうち、類似度が大きい上位所定数以外の類似度を０として前記類似度を求めることが好ましい。 In the speaker clustering apparatus of the present invention, it is preferable that the similarity calculation means obtains the similarity by setting a similarity other than the upper predetermined number having a large similarity to 0 among the calculated similarities between the utterances. .

この発明では、類似度が大きい上位所定数以外の類似度を０（ゼロ）とする。すなわち、類似度が小さい場合は、類似度を０（ゼロ）とする処理を行う。類似度の上位所定数は適宜調整可能とされている。
これにより、発話間の類似度が明らかに小さい場合は類似度を０とすることができるので、前述した固有値差分による話者数推定の精度を向上させることができる。 In the present invention, similarities other than the upper predetermined number having a large similarity are set to 0 (zero). That is, when the degree of similarity is small, processing for setting the degree of similarity to 0 (zero) is performed. The upper predetermined number of similarities can be adjusted as appropriate.
As a result, when the similarity between utterances is clearly small, the similarity can be set to 0, so that the accuracy of estimating the number of speakers based on the eigenvalue difference described above can be improved.

本発明の話者クラスタリング方法は、入力された音声信号中の同一話者の発話を同定する話者クラスタリング方法であって、ベクトル量子化を用いて前記音声信号を符号に変換するベクトル量子化ステップと、前記変換された符号中の各発話に対して、前記符号の出現回数を成分とした出現頻度ベクトルを生成する出現頻度生成ステップと、前記出現頻度ベクトルに基づいて前記発話間の類似度を求める類似度算出ステップと、前記発話間の類似度に基づいて前記発話を分類するクラスタリングステップと、を備えたことを特徴とする。 The speaker clustering method of the present invention is a speaker clustering method for identifying the speech of the same speaker in an input speech signal, and a vector quantization step for converting the speech signal into a code using vector quantization For each utterance in the converted code, an appearance frequency generation step for generating an appearance frequency vector having the number of appearances of the code as a component, and a similarity between the utterances based on the appearance frequency vector A similarity calculation step to be obtained; and a clustering step of classifying the utterances based on the similarity between the utterances.

本発明は、音声信号を符号に変換し、変換された符号から出現頻度ベクトルを生成し、生成された出現頻度ベクトルから発話間のコサイン距離を算出して類似度を求め、算出された類似度に基づくクラスタリングを行う。
コサイン距離による類似度は、符号の出現頻度の分布をベクトルとして、出現頻度ベクトルの間の角度に依存する。また、コサイン距離は、発話間に出現した符号の頻度だけに依存するため、一方の発話で出現頻度が０（ゼロ）の符号が存在した場合に、コサイン距離算出のための分子の和には寄与しない。このため、発話が短い場合の発話間の音韻の出現頻度の偏りが生じる場合に、コサイン距離で算出した類似度を用いてクラスタリングすることにより、発話の話者性を反映し、発話の長さや発話間の音韻の出現頻度の偏りに影響を受けない話者クラスタリングを実現することができる。
また、発話間のコサイン距離の算出において、一方の発話の出現頻度ベクトルがゼロである場合は類似度もゼロとなり、発見法的な処理を行う必要がない。したがって、演算量を大幅に低減させることができるとともに、発話データに特別な処理を施さないことから、より精度の高い話者クラスタリングを行うことができる。 The present invention converts a speech signal into a code, generates an appearance frequency vector from the converted code, calculates a cosine distance between utterances from the generated appearance frequency vector, obtains a similarity, and calculates the similarity Perform clustering based on.
The similarity based on the cosine distance depends on the angle between the appearance frequency vectors with the distribution of the appearance frequency of the codes as a vector. In addition, since the cosine distance depends only on the frequency of the code appearing between utterances, if there is a code with an appearance frequency of 0 (zero) in one utterance, the sum of the numerators for calculating the cosine distance is Does not contribute. For this reason, when there is a bias in the appearance frequency of phonology between utterances when the utterance is short, clustering using the similarity calculated by the cosine distance reflects the speaker characteristics of the utterance, It is possible to realize speaker clustering that is not affected by the frequency of appearance of phonemes between utterances.
Also, in calculating the cosine distance between utterances, if the appearance frequency vector of one utterance is zero, the similarity is also zero, and there is no need to perform a heuristic process. Therefore, it is possible to greatly reduce the amount of calculation and to perform speaker clustering with higher accuracy since no special processing is performed on the speech data.

本発明の第１実施形態にかかる話者クラスタリング装置の概略構成を示すブロック図。1 is a block diagram showing a schematic configuration of a speaker clustering apparatus according to a first embodiment of the present invention. 第１実施形態における話者クラスタリング装置の動作を示すフローチャート。The flowchart which shows operation | movement of the speaker clustering apparatus in 1st Embodiment. 第１実施形態において音声信号から特徴ベクトルへの変換を示す説明図。Explanatory drawing which shows conversion from an audio | voice signal to a feature vector in 1st Embodiment. 第１実施形態において特徴ベクトルから特徴空間への変換を示す説明図。Explanatory drawing which shows conversion from the feature vector to the feature space in 1st Embodiment. 第１実施形態においてベクトル量子化により得られた音声信号の符号列を示す図。The figure which shows the code sequence of the audio | voice signal obtained by vector quantization in 1st Embodiment. 第１実施形態において各発話の出現頻度ベクトルの生成を説明する図。The figure explaining the production | generation of the appearance frequency vector of each utterance in 1st Embodiment. 第１実施形態において発話間の類似度を示すマトリックス図。The matrix figure which shows the similarity between utterances in 1st Embodiment. 第３実施形態においてベクトル量子化により得られた音声信号の符号列を示す図。The figure which shows the code sequence of the audio | voice signal obtained by vector quantization in 3rd Embodiment. 第３実施形態において各発話の出現頻度ベクトルの生成を説明する図。The figure explaining generation | occurrence | production of the appearance frequency vector of each utterance in 3rd Embodiment. 第４実施形態にかかる話者クラスタリング装置の概略構成を示すブロック図。The block diagram which shows schematic structure of the speaker clustering apparatus concerning 4th Embodiment. 第４実施形態においてベクトル量子化により得られた音声信号の符号列を示す図。The figure which shows the code sequence of the audio | voice signal obtained by vector quantization in 4th Embodiment. 第４実施形態において各発話の出現頻度ベクトルの生成を説明する図。The figure explaining the production | generation of the appearance frequency vector of each utterance in 4th Embodiment.

以下、本発明の実施形態を図面に基づいて説明する。本実施形態では、入力された音声信号中の発話を同一話者ごとにまとめる話者クラスタリング装置について説明する。
〔１．第１実施形態〕
［１−１．話者クラスタリング装置の構成］
話者クラスタリング装置１００は、図１に示すように、音声信号取得手段１０と、発話区分手段２０と、ベクトル量子化手段３０と、出現頻度生成手段４０と、類似度算出手段５０と、クラスタリング手段６０と、話者数判定手段７０と、を備えている。また、図示しないが、話者クラスタリング装置１００は、音声信号を入力可能な入力手段を備えている。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. In the present embodiment, a speaker clustering apparatus that collects utterances in an input voice signal for each same speaker will be described.
[1. First Embodiment]
[1-1. Configuration of speaker clustering device]
As shown in FIG. 1, the speaker clustering apparatus 100 includes an audio signal acquisition unit 10, an utterance classification unit 20, a vector quantization unit 30, an appearance frequency generation unit 40, a similarity calculation unit 50, and a clustering unit. 60 and a speaker number determination means 70. Moreover, although not shown, the speaker clustering apparatus 100 includes an input unit capable of inputting a voice signal.

音声信号取得手段１０は、図示しない入力手段によって入力された音声信号を取得する。
発話区分手段２０は、入力された音声信号を発話ごとに区分する。具体的には、音声信号中の音声が発生していない部分で区切ることができる。 The audio signal acquisition unit 10 acquires an audio signal input by an input unit (not shown).
The utterance classification means 20 classifies the input voice signal for each utterance. Specifically, it can be divided at a portion where no sound is generated in the sound signal.

ベクトル量子化手段３０は、入力された音声信号を符号に変換するものであり、特徴ベクトル時系列変換部３１、特徴ベクトルクラスタリング部３２、および符号列生成部３３を備えている。
特徴ベクトル時系列変換部３１は、音声信号を一定時間ごとにサンプリングし、各サンプルにおける、例えば、メルケプストラムなどの音声の特徴量（特徴ベクトル）を抽出し、時系列に出力する。すなわち、特徴ベクトルを時系列に生成する。
特徴ベクトルクラスタリング部３２は、特徴ベクトルに基づいて各サンプルを特徴空間に並べ、特徴空間におけるサンプルの集合ごとにクラスタリングし、生成された各クラスタに識別可能な番号を付与する。
符号列生成部３３は、各クラスタに付与された番号を符合として用い、音声信号を時系列の符号（以下、符号列と表記する。）に変換する。 The vector quantization means 30 converts an input speech signal into a code, and includes a feature vector time series conversion unit 31, a feature vector clustering unit 32, and a code string generation unit 33.
The feature vector time-series conversion unit 31 samples the audio signal at regular time intervals, extracts a voice feature quantity (feature vector) such as a mel cepstrum in each sample, and outputs it in time series. That is, feature vectors are generated in time series.
The feature vector clustering unit 32 arranges each sample in the feature space based on the feature vector, performs clustering for each set of samples in the feature space, and assigns an identifiable number to each generated cluster.
The code string generation unit 33 converts the audio signal into a time-series code (hereinafter referred to as a code string) using the number assigned to each cluster as a sign.

出現頻度生成手段４０は、符号列中の各符号の出現回数を成分とする出現頻度ベクトルを発話ごとに生成する。
類似度算出手段５０は、各発話の出現頻度ベクトルを用いて、発話間のコサイン距離を算出し、このコサイン距離を発話間の類似度とする。 The appearance frequency generation means 40 generates an appearance frequency vector for each utterance, the component of which is the number of appearances of each code in the code string.
The similarity calculation means 50 calculates the cosine distance between utterances using the appearance frequency vector of each utterance, and sets this cosine distance as the similarity between utterances.

クラスタリング手段６０は、スペクトラルクラスタリングにより、算出した類似度に基づいて発話のクラスタリングを行う。
ここで、スペクトラルクラスタリングについて説明する。
クラスタリング対象のデータ集合をＣ＝｛ｉ｜ｉ＝１、…、Ｎ｝、データｉとｊとの類似度ｗ_ij≧０としたときのデータ集合Ｃのクラスタへの分割を以下の式（１）で表すとする。 The clustering means 60 performs utterance clustering based on the calculated similarity by spectral clustering.
Here, spectral clustering will be described.
The data set to be clustered is C = {i | i = 1,..., N}, and the similarity between data i and j is set as w _ij ≧ 0. ).

式（１）中、Ｑはクラスタ数である。
また、クラスタ内の平均類似度Ｓ_Ｗを以下の式（２）で表し、クラスタ間の平均類似度Ｓ_Ｂを以下の式（３）で表す。 In formula (1), Q is the number of clusters.
Also, it represents the average similarity S _W in the cluster by the following formula (2), expressed by the following equation (3) the average similarity S _B between clusters.

スペクトラルクラスタリングは、クラスタ内の平均類似度Ｓ_Ｗを大きくし、クラスタ間の平均類似度Ｓ_Ｂを小さくするように、データ集合Ｃをクラスタリングする手法である。 Spectral clustering is to increase the average similarity S _W in the cluster, so as to reduce the average similarity S _B between clusters, a technique for clustering data set C.

話者数判定手段７０は、スペクトラルクラスタリングによる結果に基づいて、音声信号中の話者の数を判定する。 The speaker number determination means 70 determines the number of speakers in the audio signal based on the result of spectral clustering.

［１−２．話者クラスタリング装置の動作］
次に、話者クラスタリング装置１００の具体的な動作を図２のフローチャートに従って説明する。
まず、音声信号取得手段１０は、図示しない入力手段によって入力された音声信号Ａを取得する（ステップ１、以下ステップを「Ｓ」と略す）。
次に、発話区分手段２０は、図３に示すように、入力された音声信号Ａを、発話ａ１、ａ２、ａ３ …ａＮに分割する（Ｓ２）。ここで、Ｎは発話数である。 [1-2. Operation of speaker clustering device]
Next, a specific operation of the speaker clustering apparatus 100 will be described with reference to the flowchart of FIG.
First, the audio signal acquisition unit 10 acquires the audio signal A input by an input unit (not shown) (step 1, hereinafter, step is abbreviated as “S”).
Next, as shown in FIG. 3, the utterance classification means 20 divides the input voice signal A into utterances a1, a2, a3... AN (S2). Here, N is the number of utterances.

次に、特徴ベクトル時系列変換部３１は、図３に示すように、入力された音声信号Ａを１／１００秒ずつにサンプリングし、各サンプルについて、３０次元の特徴ベクトルを時系列に抽出する（Ｓ３）。この特徴ベクトルは、音声認識の分野で一般的に用いられているメルケプストラムと呼ばれる特徴パラメータである（各次元の特徴パラメータは図示していない）。 Next, as shown in FIG. 3, the feature vector time series conversion unit 31 samples the input audio signal A every 1/100 seconds, and extracts a 30-dimensional feature vector in time series for each sample. (S3). This feature vector is a feature parameter called a mel cepstrum generally used in the field of speech recognition (feature parameters in each dimension are not shown).

次に、特徴ベクトルクラスタリング部３２は、図４に示すように、３０次元の特徴ベクトルを３０次元の特徴空間に並べ、該特徴空間におけるサンプルの集合ごとにクラスタリングする（Ｓ４）。クラスタリングにより生成したクラスタには互いに識別可能な番号を付与する。ここで生成されるクラスタの数は、後述する各発話の出現頻度ベクトルを生成したときの符号帳サイズＶに相当する。符号長サイズＶは適宜調整可能であるが、本実施形態では、例えば、２５６個のクラスタを生成し、各クラスタに１〜２５６の番号を付与する。
次に、符号列生成部３３は、図５に示すように、各クラスタに付与された番号を符号として用い、該符号を時系列に並べた符号列を生成する（Ｓ５）。
以上のＳ３〜Ｓ５の処理により、音声信号が符号に変換され、ベクトル量子化が完成する。 Next, as shown in FIG. 4, the feature vector clustering unit 32 arranges 30-dimensional feature vectors in a 30-dimensional feature space, and clusters each sample set in the feature space (S4). Clusters generated by clustering are given numbers that can be distinguished from each other. The number of clusters generated here corresponds to the codebook size V when an appearance frequency vector of each utterance described later is generated. The code length size V can be adjusted as appropriate, but in the present embodiment, for example, 256 clusters are generated, and a number of 1 to 256 is assigned to each cluster.
Next, as shown in FIG. 5, the code string generation unit 33 uses a number assigned to each cluster as a code, and generates a code string in which the codes are arranged in time series (S5).
Through the processes in S3 to S5, the speech signal is converted into a code, and vector quantization is completed.

次に、出現頻度生成手段４０は、図６に示すように、発話ａｉの符号帳をそれぞれ生成し、発話ａｉにおける各符号の出現回数を成分とする出現頻度ベクトルを生成する（Ｓ６）。ここで生成した符号帳の符号帳サイズＶは、前述の特徴ベクトルクラスタリング部２２により生成されたクラスタの数に相当する。 Next, as shown in FIG. 6, the appearance frequency generation unit 40 generates a codebook of the utterance ai, and generates an appearance frequency vector whose component is the number of appearances of each code in the utterance ai (S6). The codebook size V of the codebook generated here corresponds to the number of clusters generated by the feature vector clustering unit 22 described above.

次に、類似度算出手段５０は、発話ａｉの出現頻度ベクトルに基づいて発話間のコサイン（余弦）距離を算出し、このコサイン距離を類似度とする（Ｓ７）。
発話ａｉと発話ａｊとの類似度ｗ_ij ⁽⁰⁾は、以下の式（４）で算出することができる。なお、本実施形態では、類似度ｗ_ij ⁽⁰⁾に対して後処理を行い、最終的な類似度ｗ_ijを算出するため、調整前類似度ｗ_ij ⁽⁰⁾とする。 Next, the similarity calculation means 50 calculates a cosine (cosine) distance between utterances based on the appearance frequency vector of the utterance ai, and uses this cosine distance as the similarity (S7).
The similarity w _ij ⁽⁰⁾ between the utterance ai and the utterance aj can be calculated by the following equation (4). In this embodiment, post-processing is performed on the similarity w _ij ⁽⁰⁾ , and the final similarity w _ij is calculated, so that the similarity before adjustment w _ij ⁽⁰⁾ is used.

式（４）中、ｆ_ｉｖは発話ａｉにおける符号ｖの出現頻度、ｆ_ｊｖは発話ａｊにおける符号ｖの出現頻度、ｉ、ｊは１〜Ｎの整数、ｖ、ｖ’は１〜Ｖの整数、Ｎは発話数、Ｖは各発話の出現頻度ベクトルの符号帳サイズである。 In formula (4), f _iv is the appearance frequency of the code v in the utterance ai, f _jv is the appearance frequency of the code v in the utterance aj, i and j are integers of 1 to N, and v and v ′ are integers of 1 to V. , N is the number of utterances, and V is the codebook size of the appearance frequency vector of each utterance.

ここで、調整前類似度ｗ_ij ⁽⁰⁾に対して調整（後処理）を行う。発話ａｉの最大類似度ｗ_i ^＊を以下の式（５）で表したとき、調整前類似度ｗ_ij ⁽⁰⁾が最大類似度ｗ_i ^＊のε倍より小さい場合は類似度を０にする処理（以下、トリミングと言うこともある。）を行い、以下の式（６）に示すように、発話ａｉと発話ａｊとの類似度ｗ_ijを算出する。ここで、εは類似度トリミング係数と呼ばれる１＜ε＜０の定数であり、適宜調整することができる。
類似度ｗ_ij ⁽⁰⁾が最大類似度ｗ_i ^＊のε倍より小さいということは、類似度が明らかに小さいということである。したがって、明らかに類似度が小さい場合に類似度を０と近似することにより、後述のクラスタリングにおいて演算量を低減させることができる。 Here, adjustment (post-processing) is performed on the pre-adjustment similarity w _ij ⁽⁰⁾ . When the maximum similarity w _i ^* of the utterance ai is expressed by the following equation (5), the similarity is set to ⁰ when the similarity w _ij ⁽⁰⁾ before adjustment is smaller than ε times the maximum similarity w _i ^*. Processing (hereinafter also referred to as trimming) is performed, and the similarity w _ij between the utterance ai and the utterance aj is calculated as shown in the following equation (6). Here, ε is a constant of 1 <ε <0 called a similarity trimming coefficient, and can be adjusted as appropriate.
That the similarity w _ij ⁽⁰⁾ is smaller than ε times the maximum similarity w _i ^* means that the similarity is clearly small. Therefore, when the degree of similarity is clearly small, the amount of computation can be reduced in clustering described later by approximating the degree of similarity to zero.

式（５）中、ｉ、ｊは１〜Ｎの整数、Ｎは発話数である。 In formula (5), i and j are integers of 1 to N, and N is the number of utterances.

式（６）中、ｉ、ｊは１〜Ｎの整数、Ｎは発話数である。
算出された発話間の類似度は、図７に示すようなマトリックス状に表現することができる。図７において、発話ａ１と発話ａ３との類似度が「１５０」と高く、発話ａ１と発話ａ２との類似度は「０」で類似していないということが言える。 In formula (6), i and j are integers of 1 to N, and N is the number of utterances.
The calculated similarity between utterances can be expressed in a matrix as shown in FIG. In FIG. 7, it can be said that the similarity between the utterance a1 and the utterance a3 is as high as “150”, and the similarity between the utterance a1 and the utterance a2 is “0” and is not similar.

次に、クラスタリング手段６０は、図７に示す類似度に基づいて、スペクトラルクラスタリングを行う（Ｓ８）。以下に、具体的な方法を説明する。
まず、Ｓ７で算出された類似度ｗ_ijを要素とする類似度行列Ｗから、ラプラシアン行列（Ｎ次正方行列）Ｌを以下の式（７）により算出する。 Next, the clustering means 60 performs spectral clustering based on the similarity shown in FIG. 7 (S8). A specific method will be described below.
First, a Laplacian matrix (N-order square matrix) L is calculated by the following equation (7) from the similarity matrix W having the similarity w _ij calculated in S7 as an element.

式（７）中、Ｉは単位行列、Ｄは対角行列、ｉ、ｊ、ｋは１〜Ｎの整数、Ｎは発話数である。
ラプラシアン行列ＬのＱ個の固有ベクトルｖ_iqを以下の式（８）で表す。 In equation (7), I is a unit matrix, D is a diagonal matrix, i, j, k are integers from 1 to N, and N is the number of utterances.
The Q eigenvectors v _iq of the Laplacian matrix L are expressed by the following equation (8).

式（８）で示す固有ベクトルｖ_iqから、ｙ_iqを要素とするＮ行Ｑ列行列Ｙを以下の式（９）により算出する。 From the eigenvector v _iq shown in the equation (8), an N-row Q column matrix Y having y _iq as an element is calculated by the following equation (9).

式（９）中、ｉは１〜Ｎの整数、ｑは１〜Ｑの整数、Ｎは発話数、Ｑは固有ベクトルの数である。
式（９）に示すｙ_iqを要素とする行列ＹのＮ個の行ベクトルを、以下の式（１０）で表されるｋ−ｍｅａｎｓクラスタリングにより、Ｑ個のクラスタに分割する。 In equation (9), i is an integer from 1 to N, q is an integer from 1 to Q, N is the number of utterances, and Q is the number of eigenvectors.
N row vectors of matrix Y having y _iq as an element shown in Expression (9) are divided into Q clusters by k-means clustering expressed by Expression (10) below.

式（１０）中、α、ｑは１〜Ｑの整数、ｉは１〜Ｎの整数、Ｑは固有ベクトルの数、Ｎは発話数である。
このようにして得られたＱ個のクラスタには、特徴が類似する発話がそれぞれ分類され、１つのクラスタに属する発話は同一話者による発話であると判定することができる。 In Expression (10), α and q are integers of 1 to Q, i is an integer of 1 to N, Q is the number of eigenvectors, and N is the number of utterances.
The Q clusters obtained in this way are classified into utterances having similar characteristics, and utterances belonging to one cluster can be determined to be utterances by the same speaker.

次に、話者数判定手段７０は、音声信号中で会話をしている話者の数を判定する（Ｓ９）。スペクトラルクラスタリングにおいて、上記式（３）で示すクラスタ間の平均類似度Ｓ_Ｂが０（ゼロ）となるような理想的な場合には、ラプラシアン行列Ｌの最小固有値０がＱ重に縮退することが数学的に示される。また、理想的な場合からの乖離が小さい場合には、摂動論的な解析から、固有値λ_Ｑ＋１−λ_Ｑに大きなギャップが生じることが示される。ここで、Ｑはスペクトラルクラスタリングにより分類されたクラスタ数である。このような知見に基づいて、クラスタ数Ｑは以下の式（１１）で算出することができる。 Next, the number-of-speakers determination means 70 determines the number of speakers having a conversation in the voice signal (S9). In spectral clustering, when the equation (3) Average similarity S _B between the clusters indicated by 0 (zero) to become such an ideal is that the minimum eigenvalue 0 Laplacian matrix L is degenerated to Q heavy Shown mathematically. Further, when the deviation from the ideal case is small, perturbation analysis shows that a large gap is generated in the eigenvalue λ _{Q + 1} −λ _Q. Here, Q is the number of clusters classified by spectral clustering. Based on such knowledge, the number of clusters Q can be calculated by the following equation (11).

式（１１）中、λは固有値、ｉは整数である。
以上の処理終了後、話者クラスタリング装置１００は、入力された音声信号中の話者の同定結果および話者数を図示しない出力手段に出力した後、動作を終了する。 In formula (11), λ is an eigenvalue and i is an integer.
After the above processing ends, the speaker clustering apparatus 100 outputs the speaker identification result and the number of speakers in the input voice signal to an output unit (not shown), and then ends the operation.

［１−３．第１実施形態の作用効果］
上述した第１実施形態では、以下に示す作用効果を奏することができる。
ベクトル量子化手段３０により音声信号を符号に変換し（量子化）、出現頻度生成手段４０はこの符号に基づいて各符号を成分とする出現頻度ベクトルを発話ごとに生成する。類似度算出手段５０は、該出現頻度ベクトルを用いてコサイン距離を算出し、該コサイン距離を類似度とする。
特性の異なる発話の類似度を求める場合、出現頻度ベクトルからＫＬ距離を用いて類似度を求める際は、一方の発話の出現頻度ベクトル中の成分値がゼロの箇所を他の値に変えるなどの処理が必要であり、そのために精度が落ちるという問題があったが、本実施形態では、コサイン距離を用いて類似度を求めるため、一方の発話の出現頻度ベクトル中の成分が０（ゼロ）である場合には類似度への寄与もゼロになり（上記式（１）参照）、特別な処理を行う必要がない。したがって、より精度の高い話者クラスタリングを行うことができる。 [1-3. Effects of First Embodiment]
In the first embodiment described above, the following operational effects can be achieved.
The speech signal is converted into a code by the vector quantization means 30 (quantization), and the appearance frequency generation means 40 generates an appearance frequency vector having each code as a component for each utterance based on this code. The similarity calculation means 50 calculates the cosine distance using the appearance frequency vector, and sets the cosine distance as the similarity.
When calculating the similarity of utterances with different characteristics, when calculating the similarity using the KL distance from the appearance frequency vector, the location where the component value in the appearance frequency vector of one utterance is zero is changed to another value, etc. However, in this embodiment, since the similarity is obtained using the cosine distance, the component in the appearance frequency vector of one utterance is 0 (zero). In some cases, the contribution to the similarity is also zero (see the above formula (1)), and no special processing is required. Therefore, more accurate speaker clustering can be performed.

また、従来のＫＬ距離やユークリッド距離を用いる方法の場合、発話の長さが短くなると、発話の話者性を十分に得られない場合があった。一方、本実施形態では音声信号を符号に変換し、該符号の出現回数を成分とした出現頻度ベクトルを用いてコサイン距離を算出し、発話間の類似度を求めるので、発話の長さに関係なく正確な話者性を得ることができる。したがって、発話の長さが短い場合であっても、発話の話者性を正確に得ることができ、その結果、精度の高い話者クラスタリングを行うことができる。 Further, in the case of the conventional method using the KL distance or the Euclidean distance, if the length of the utterance is shortened, there is a case where the speaker characteristics of the utterance cannot be obtained sufficiently. On the other hand, in this embodiment, a speech signal is converted into a code, a cosine distance is calculated using an appearance frequency vector having the number of occurrences of the code as a component, and a similarity between utterances is obtained. And accurate speaker characteristics can be obtained. Therefore, even when the length of the utterance is short, the speaker characteristics of the utterance can be obtained accurately, and as a result, speaker clustering with high accuracy can be performed.

さらに、上記実施形態では、話者数判定手段７０は、ラプラシアン行列Ｌの固有値を用いた上記式（１１）により音声信号中の話者数を算出することができる。したがって、定量的に話者数を判定することができる。 Further, in the above embodiment, the number-of-speakers determining unit 70 can calculate the number of speakers in the audio signal by the above equation (11) using the eigenvalues of the Laplacian matrix L. Therefore, the number of speakers can be determined quantitatively.

そして、上記実施形態では、式（４）に示すように、類似度が明らかに小さい場合は類似度を０とする近似を行うので、クラスタリングを行う際の話者数の推定精度を向上させることができる。 In the above embodiment, as shown in Expression (4), when the similarity is clearly small, approximation is performed with the similarity being 0, so that the estimation accuracy of the number of speakers when performing clustering is improved. Can do.

〔２．第２実施形態〕
次に、本発明の第２実施形態について説明する。第２実施形態では、出現頻度生成手段の動作が前記第１実施形態の出現頻度生成手段４０と相違する。なお、前記第１実施形態と同一の構成および動作については説明を省略する。
出現頻度生成手段は、符号列中の各符号の出現回数に重み付けを行い、この重み付けされた出現回数を成分とする出現頻度ベクトルを発話ごとに生成する。
具体的な重み付けの方法としては、ＴＦ／ＩＤＦ(Term Frequency Inverse／Document Frequency)法が用いられる。具体的には、各発話における符号の出現回数を、当該符号と同一の符号が出現する発話の数の逆数で重み付けする。このように、発話ごとに各符号の出現回数が重み付けされ、出現頻度生成手段は、この重み付けされた出現回数を成分とする出現頻度ベクトルを生成する。
このようにして生成された出現頻度ベクトルを用いて、第１実施形態と同様に、類似度算出手段５０は類似度を算出する。 [2. Second Embodiment]
Next, a second embodiment of the present invention will be described. In the second embodiment, the operation of the appearance frequency generation means is different from the appearance frequency generation means 40 of the first embodiment. The description of the same configuration and operation as those in the first embodiment is omitted.
The appearance frequency generation unit weights the number of appearances of each code in the code string, and generates an appearance frequency vector having the weighted number of appearances as a component for each utterance.
As a specific weighting method, a TF / IDF (Term Frequency Inverse / Document Frequency) method is used. Specifically, the number of appearances of the code in each utterance is weighted by the reciprocal of the number of utterances in which the same code as the code appears. Thus, the appearance frequency of each code is weighted for each utterance, and the appearance frequency generation means generates an appearance frequency vector having the weighted appearance frequency as a component.
Using the appearance frequency vector generated in this way, the similarity calculation unit 50 calculates the similarity as in the first embodiment.

このような第２実施形態によれば、前記第１実施形態の効果に加えて次の効果が得られる。
各発話における各符号の出現回数に対してＴＦ／ＩＤＦ法による重み付けを行う。本実施形態においては、発話の特徴である符号が、他の発話に出現しない程、重み付けされるようになる。すなわち、どの発話にも出現するような特徴（符号）は、該発話中の特徴とされるものではなく、他の発話に出現しない特徴（符号）に対して、重み付けを行う。
したがって、符号に変換された音声信号の特徴をより正確に抽出でき、より精度の高い話者クラスタリングを行うことができる。 According to such 2nd Embodiment, in addition to the effect of the said 1st Embodiment, the following effect is acquired.
The number of appearances of each code in each utterance is weighted by the TF / IDF method. In the present embodiment, the code that is the feature of the utterance is weighted so that it does not appear in other utterances. That is, a feature (symbol) that appears in any utterance is not regarded as a feature during the utterance, and is weighted to a feature (symbol) that does not appear in other utterances.
Therefore, it is possible to extract the features of the speech signal converted into the code more accurately and perform speaker clustering with higher accuracy.

〔３．第３実施形態〕
次に、本発明の第３実施形態について説明する。第３実施形態では、出現頻度生成手段において出現頻度ベクトルの成分が前記第１実施形態の出現頻度生成手段４０と相違する。なお、前記第１実施形態と同一の構成については説明を省略する。 [3. Third Embodiment]
Next, a third embodiment of the present invention will be described. In the third embodiment, the components of the appearance frequency vector in the appearance frequency generating means are different from those in the appearance frequency generating means 40 of the first embodiment. The description of the same configuration as that of the first embodiment is omitted.

出現頻度生成手段は、音声信号から変換された符号中の連続する符号の組み合わせによる符号列の出現回数を成分とする出現頻度ベクトルを発話ごとに生成する。
第３実施形態の出現頻度生成手段による具体的な方法を図８および図９を用いて説明する。
図８に示すように、音声信号から変換された符号列において、連続する２つの符号を１単位とした、単位符号ｕ１、ｕ２、ｕ３、…という符号列が存在する。このとき、同一の符号の組み合わせが存在する場合は、同一の単位符号とみなす。例えば、図８において、符号列の最初の単位符号ｕ１の符号の組み合わせは「５，９」である。したがって、該符号列における「５，９」の符号の並びは全て単位符号ｕ１とする。 The appearance frequency generation means generates an appearance frequency vector for each utterance, the component of which is the number of appearances of a code string based on a combination of consecutive codes in the code converted from the speech signal.
A specific method by the appearance frequency generation means of the third embodiment will be described with reference to FIGS.
As shown in FIG. 8, there is a code string of unit codes u1, u2, u3,... With two consecutive codes as one unit in the code string converted from the audio signal. At this time, if the same combination of codes is present, it is regarded as the same unit code. For example, in FIG. 8, the code combination of the first unit code u1 in the code string is “5, 9”. Therefore, all the sequences of codes “5, 9” in the code string are unit codes u1.

出現頻度生成手段は、図９に示すように、発話ａｉごとに、単位符号ｕｉの出現回数を成分とする出現頻度ベクトルを生成する。ここで、発話ａｉの符号帳サイズは、単位符号ｕｉの数に相当する。
このようにして生成された出現頻度ベクトルを用いて、第１実施形態と同様に、類似度算出手段５０は類似度を算出する。 As shown in FIG. 9, the appearance frequency generation unit generates an appearance frequency vector whose component is the number of appearances of the unit code ui for each utterance ai. Here, the codebook size of the utterance ai corresponds to the number of unit codes ui.
Using the appearance frequency vector generated in this way, the similarity calculation unit 50 calculates the similarity as in the first embodiment.

このような第３実施形態によれば、前記第１実施形態の効果に加えて次の効果が得られる。
出現頻度生成手段は、音声信号から変換された符号における連続する符号の組み合わせを成分とする出現頻度ベクトルを生成する。連続する符号の組み合わせはその発話における特徴を表すものであり、単一の符号よりも発話の特性を特徴的に表す。また、同一の符号の組み合わせは同一の特徴であると判定することができる。したがって、連続する符号の組み合わせの出現回数を成分とする出現頻度ベクトルを用いることで、発話の特徴をより正確に抽出することができ、精度の高い類似度を算出することができる。その結果、より精度の高い話者クラスタリングを行うことができる。 According to such 3rd Embodiment, in addition to the effect of the said 1st Embodiment, the following effect is acquired.
The appearance frequency generation means generates an appearance frequency vector having as a component a combination of consecutive codes in the code converted from the audio signal. The combination of consecutive codes represents the characteristics of the utterance, and more characteristically represents the characteristics of the utterance than a single code. Moreover, it can be determined that the same combination of codes has the same characteristics. Therefore, by using the appearance frequency vector whose component is the number of appearances of consecutive combinations of codes, it is possible to extract the features of the utterance more accurately and calculate the similarity with high accuracy. As a result, more accurate speaker clustering can be performed.

〔４．第４実施形態〕
次に、本発明の第４実施形態について説明する。第４実施形態では、変化量ベクトル量子化手段をさらに備え、この変化量を出現頻度ベクトルの成分として用いる点が、第１実施形態と相違する。なお、前記第１実施形態と同一の構成については説明を省略する。 [4. Fourth Embodiment]
Next, a fourth embodiment of the present invention will be described. The fourth embodiment is different from the first embodiment in that a variation vector quantization means is further provided and this variation is used as a component of the appearance frequency vector. The description of the same configuration as that of the first embodiment is omitted.

第４実施形態の話者クラスタリング装置１０１の構成を図１０に示す。
図１０に示すように、話者クラスタリング装置１０１は、音声信号取得手段１０と、発話区分手段２０と、ベクトル量子化手段３０と、変化量ベクトル量子化手段８０と、出現頻度生成手段４１と、類似度算出手段５０と、クラスタリング手段６０と、話者数判定手段７０と、を備えている。 The configuration of the speaker clustering apparatus 101 of the fourth embodiment is shown in FIG.
As shown in FIG. 10, the speaker clustering apparatus 101 includes an audio signal acquisition unit 10, an utterance classification unit 20, a vector quantization unit 30, a variation vector quantization unit 80, an appearance frequency generation unit 41, Similarity calculation means 50, clustering means 60, and speaker number determination means 70 are provided.

変化量ベクトル量子化手段８０は、ベクトル量子化を用いて、音声信号の時系列的な変化量を符号に変換する。すなわち、変化量符号列とは、図１１に示すように、ベクトル量子化手段３０によって生成された符号列中の隣接する符号の変化量（差分）が新たな符号列として生成されたものである。 The variation vector quantization means 80 converts the time-series variation of the audio signal into a code using vector quantization. That is, as shown in FIG. 11, the change amount code sequence is a change amount (difference) between adjacent codes in the code sequence generated by the vector quantization means 30 as a new code sequence. .

出現頻度生成手段４１は、ベクトル量子化手段３０によって生成された符号列と、変化量ベクトル量子化手段８０によって生成された変化量符号列と、の２つの符号列中の各符号の出現回数を成分とする出現頻度ベクトルを生成する。具体的には、発話ａｉごとに、符号列中の各符号の出現回数を成分とする出現頻度ベクトルを第１実施形態と同様に生成し、さらに該出現頻度ベクトルの符号帳サイズを拡張し、変化量符号列中の各変化量の出現回数を成分とする出現頻度ベクトルを生成する。図１２において、発話ａｉの符号列中の各符号の出現回数を成分とする出現頻度ベクトルの符号帳サイズは２５６であり、変化量符号列を成分とする出現頻度ベクトルを２５７以降に生成する。図１２に示す出現頻度ベクトルの２５７以降の横軸は、変化量を表している。例えば、変化量４の出現頻度ベクトルは、発話ａ１では３、発話ａ３では３となる。このときの符号帳サイズは可変であり、変化量の数の分だけ拡張する。
このようにして生成された出現頻度ベクトルを用いて、第１実施形態と同様に、類似度算出手段５０は類似度を算出する。 The appearance frequency generation means 41 calculates the number of appearances of each code in the two code strings of the code string generated by the vector quantization means 30 and the change amount code string generated by the change amount vector quantization means 80. An appearance frequency vector as a component is generated. Specifically, for each utterance ai, an appearance frequency vector whose component is the number of appearances of each code in the code string is generated as in the first embodiment, and the codebook size of the appearance frequency vector is further expanded, An appearance frequency vector whose component is the number of times each change amount appears in the change amount code string is generated. In FIG. 12, the codebook size of the appearance frequency vector whose component is the number of appearances of each code in the code string of the utterance ai is 256, and the appearance frequency vector whose component is the change amount code string is generated after 257. The horizontal axis after 257 of the appearance frequency vector shown in FIG. 12 represents the amount of change. For example, the appearance frequency vector of the change amount 4 is 3 for the utterance a1 and 3 for the utterance a3. The codebook size at this time is variable and is expanded by the number of changes.
Using the appearance frequency vector generated in this way, the similarity calculation unit 50 calculates the similarity as in the first embodiment.

このような第４実施形態によれば、前記第１実施形態の効果に加えて次の効果が得られる。
音声信号の時系列的な変化量は発話の特徴を表すものであるので、発話における同一の変化量は同一の特徴であると判定することができる。第４実施形態では、符号列中の各符号の出現回数のほかにも変化量の出現回数を成分とする出現頻度ベクトルを生成するので、発話の特徴をより正確に抽出することができ、精度の高い類似度を算出することができる。その結果、より精度の高い話者クラスタリングを行うことができる。 According to such 4th Embodiment, in addition to the effect of the said 1st Embodiment, the following effect is acquired.
Since the time-series change amount of the audio signal represents the feature of the utterance, it can be determined that the same change amount in the utterance is the same feature. In the fourth embodiment, since an appearance frequency vector including the number of occurrences of the amount of change in addition to the number of appearances of each code in the code string is generated, it is possible to more accurately extract the features of the utterance, A high degree of similarity can be calculated. As a result, more accurate speaker clustering can be performed.

〔５．変形例〕
なお、本発明は前記各実施形態に限定されず、本発明の要旨の範囲内で種々の変形実施が可能である。
例えば、上記実施形態において、類似度ｗ_ijを算出するために、上記式（４）により調整前類似度ｗ_ij ⁽⁰⁾を算出し、この調整前類似度ｗ_ij ⁽⁰⁾に対して上記式（５）に示す調整を行い、最終的な類似度ｗ_ijを算出することとしたが、上記式（４）により算出された調整類似度ｗ_ij ⁽⁰⁾をそのまま類似度ｗ_ijとして用いてもよい。これによれば、類似度がより正確となるので、より精度の高い話者クラスタリングを行うことができる。 [5. (Modification)
In addition, this invention is not limited to said each embodiment, A various deformation | transformation implementation is possible within the range of the summary of this invention.
For example, in the above embodiment, in order to calculate the similarity w _ij , the pre-adjustment similarity w _ij ⁽⁰⁾ is calculated by the above equation (4), and the pre-adjustment similarity w _ij ⁽⁰⁾ Although the adjustment shown in Expression (5) is performed to calculate the final similarity w _ij , the adjustment similarity w _ij ⁽⁰⁾ calculated by Expression (4) is used as it is as the similarity w _ij. May be. According to this, since the similarity becomes more accurate, speaker clustering with higher accuracy can be performed.

また、上記実施形態において、調整前類似度ｗ_ij ⁽⁰⁾に対して調整（後処理）を行う際、調整前類似度ｗ_ij ⁽⁰⁾が最大類似度ｗ_i ^＊のε倍より小さい場合は類似度を０にする処理（式（６）参照）を行ったが、調整前類似度を０に近似する方法はこれに限られない。例えば、発話間の類似度のうち、類似度が大きい上位所定数を予め決めておき、当該所定数以外の類似度について調整前類似度を０に近似する方法がある。これによれば、クラスタリングを行う際の演算量を低減させることができる。 In the above embodiment, when performing the adjustment (post-processing) to the unadjusted similarity w _ij ^(0), when the adjustment before the similarity w _ij ⁽⁰⁾ is less than ε times the maximum similarity w _i ^* Performed the process of setting the similarity to 0 (see Expression (6)), but the method of approximating the similarity before adjustment to 0 is not limited to this. For example, there is a method of predetermining a high-order predetermined number having a high similarity among the similarities between utterances and approximating the pre-adjustment similarity to 0 for similarities other than the predetermined number. According to this, the amount of calculation at the time of performing clustering can be reduced.

さらに、第３実施形態において、単位符号ｕｉを連続する２つの符号の組み合わせとしたが、符号の数はこれに限られない。連続する３つの符号の組み合わせ（並び）や連続する４つの符号の並び（並び）を単位符号としてもよい。なお、符号の数が増えるほど類似度の精度は高くなるが、同一の単位符号の出現頻度が低くなり、発話特性を得られなくなる可能性があるので、適宜調整する。 Furthermore, in the third embodiment, the unit code ui is a combination of two consecutive codes, but the number of codes is not limited to this. A combination (arrangement) of three consecutive codes or an arrangement (arrangement) of four consecutive codes may be used as a unit code. Note that the accuracy of similarity increases as the number of codes increases, but the appearance frequency of the same unit code decreases and there is a possibility that speech characteristics cannot be obtained.

本発明は、音声認識精度の改善や、話者単位のスキップ再生などの視聴支援への活用など、音声認識の分野に利用できる。 The present invention can be used in the field of speech recognition, such as improvement of speech recognition accuracy and use for viewing support such as skip playback for each speaker.

１０…音声信号取得手段
２０…発話区分手段
３０…ベクトル量子化手段
４０、４１…出現頻度生成手段
５０…類似度算出手段
６０…クラスタリング手段
７０…話者数判定手段
８０…変化量ベクトル量子化手段
１００…話者クラスタリング装置 DESCRIPTION OF SYMBOLS 10 ... Voice signal acquisition means 20 ... Speech classification means 30 ... Vector quantization means 40, 41 ... Appearance frequency generation means 50 ... Similarity calculation means 60 ... Clustering means 70 ... Speaker number determination means 80 ... Change amount vector quantization means 100 ... Speaker clustering device

Claims

A speaker clustering device for identifying utterances of the same speaker in an input speech signal,
Vector quantization means for converting the speech signal into a code using vector quantization;
For each utterance in the converted code, an appearance frequency generation means for generating an appearance frequency vector having the number of appearances of the code as a component;
Similarity calculation means for calculating a cosine distance between the utterances based on the appearance frequency vector, and calculating the cosine distance as a similarity between the utterances;
Clustering means for classifying the utterances based on the similarity between the utterances. A speaker clustering device comprising:

The speaker clustering device according to claim 1,
The appearance frequency generation means weights the appearance frequency vector with respect to the appearance frequency vector by the reciprocal of the number of the utterances in which the same code as the code appears in the utterance. Speaker clustering device.

In the speaker clustering device according to claim 1 or 2,
The speaker clustering apparatus, wherein the appearance frequency generation means generates an appearance frequency vector having a component of the number of appearances of a code string based on a combination of consecutive codes in each utterance.

In the speaker clustering device according to any one of claims 1 to 3,
A variation vector quantization means for converting a time-series variation of the speech signal into a code using vector quantization;
The appearance frequency generating means generates, for each utterance, the appearance frequency vector whose components are the number of appearances of the code of the speech signal and the number of appearances of the code of the change amount. Clustering device.

In the speaker clustering device according to any one of claims 1 to 4,
The clustering means includes
A speaker clustering apparatus, wherein spectral clustering is performed using similarity between utterances based on the calculated cosine distance.

In the speaker clustering device according to any one of claims 1 to 5,
Calculate the eigenvalue of a square matrix obtained from a similarity matrix having the similarity between the utterances obtained by the similarity calculation means as an element, and based on the eigenvalue where the difference between successive eigenvalues takes the maximum value, A speaker clustering apparatus, further comprising speaker number determination means for determining the number of speakers in an audio signal.

In the speaker clustering device according to any one of claims 1 to 6,
The similarity calculation means includes:
A speaker clustering apparatus characterized in that the similarity is obtained by setting the similarity below a threshold calculated by multiplying the maximum value of the calculated similarities between utterances by a predetermined coefficient to 0.

In the speaker clustering device according to any one of claims 1 to 6,
The similarity calculation means includes:
A speaker clustering apparatus characterized in that, among the calculated similarities between utterances, the similarities other than the upper predetermined number having a large similarity are set to 0 to obtain the similarities.

A speaker clustering method for identifying utterances of the same speaker in an input speech signal,
A vector quantization step of converting the speech signal into a code using vector quantization;
For each utterance in the converted code, an appearance frequency generation step for generating an appearance frequency vector whose component is the number of appearances of the code;
A similarity calculation step for obtaining a similarity between the utterances based on the appearance frequency vector;
And a clustering step of classifying the utterances based on a similarity between the utterances.