JP2009020458A

JP2009020458A - Voice processing apparatus and program

Info

Publication number: JP2009020458A
Application number: JP2007184872A
Authority: JP
Inventors: Mikio Higashiyama; 三樹夫東山; Michiko Kazama; 道子風間; Yasuo Yoshioka; 靖雄吉岡
Original assignee: Waseda University; Yamaha Corp
Current assignee: Waseda University; Yamaha Corp
Priority date: 2007-07-13
Filing date: 2007-07-13
Publication date: 2009-01-29
Anticipated expiration: 2027-07-13
Also published as: JP5083951B2

Abstract

<P>PROBLEM TO BE SOLVED: To accurately classify a plurality of sections obtained by segmenting a voice signal, for each speaker. <P>SOLUTION: A feature extracting section 41 extracts a feature amount for each of N sections B in which the speech signal S is segmented on a time axis. An index calculating section 43 calculates a similarity index value for indicating similarity of the feature amount in two sections B for all combinations for selecting two sections B out of N sections B. A voice classification section 45 classifies N sections B into a plurality of sets on the basis of the similarity index value of each section so that each of N sections B, and a section B where the feature amount is most similar with that of the section B may belong to the same cluster. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は、音声信号を時間軸上で区分した複数の区間を発声者ごとに分類（クラスタリング）する技術に関する。 The present invention relates to a technique for classifying (clustering) a plurality of sections obtained by dividing an audio signal on a time axis for each speaker.

複数の発声者が随時に発声する環境（例えば会議）で収録された音声信号を発声者ごとに区分および分類できれば、例えば会議の議事録の作成などに利用できて便利である。特許文献１には、音声信号を時間軸上で区分した複数の区間の各々について音響的な特徴量を抽出し、特徴量の照合度（類似度）が閾値を上回る複数の区間を同じ発声者の音声信号として分類する技術が開示されている。
特開２００５−３２１５３０号公報 If audio signals recorded in an environment (for example, a conference) where a plurality of speakers speak at any time can be classified and classified for each speaker, it is convenient to use it for the preparation of conference minutes, for example. In Patent Document 1, an acoustic feature amount is extracted for each of a plurality of sections obtained by dividing an audio signal on a time axis, and a plurality of sections in which the degree of matching (similarity) of feature amounts exceeds a threshold value are the same speaker. A technique for classifying a voice signal as a voice signal is disclosed.
JP 2005-321530 A

しかし、特許文献１の技術においては、照合度を対比する閾値が固定値であるため、発声時の条件（例えば発声の長さやＳ/Ｎ比など）によっては各区間の音声信号を正確に分類できない場合がある。発声時の条件に応じて閾値を可変に制御することも考えられるが、発声時の多様な条件に応じて最適な閾値を設定することは困難である。以上の事情に鑑みて、本発明は、音声信号を区分した複数の区間を発声者ごとに正確に分類するという課題の解決を目的としている。 However, in the technique of Patent Document 1, since the threshold value for comparing the matching degree is a fixed value, depending on the utterance conditions (for example, utterance length, S / N ratio, etc.), the speech signals of each section are correctly classified. There are cases where it is not possible. Although it is conceivable to control the threshold value variably according to the conditions at the time of utterance, it is difficult to set an optimum threshold value according to various conditions at the time of utterance. In view of the above circumstances, an object of the present invention is to solve the problem of accurately classifying a plurality of sections into which speech signals are classified for each speaker.

以上の課題を解決するために、本発明に係る音声処理装置は、音声信号を時間軸上で区分した複数の区間の各々について特徴量を抽出する特徴抽出手段と、複数の区間のなかから２個の区間を選択する複数の組合せについて、２個の区間における特徴量の類否を示す類否指標値を算定する指標算定手段と、複数の区間の各々と当該区間に特徴量が最も類似する区間とが同じ集合に属するように、各区間の類否指標値に基づいて複数の区間を複数の集合に分類する音声分類手段とを具備する。以上の構成によれば、複数の区間の各々と当該区間に特徴量が最も類似する区間とが同じ集合に分類されるから、類否指標値と所定の閾値との比較は原理的には不要である。したがって、発声時の条件（例えば音声や雑音の音量）が分類の精度に与える影響を低減して複数の区間を正確に分類することが可能となる。 In order to solve the above-described problems, a speech processing apparatus according to the present invention includes a feature extraction unit that extracts a feature amount for each of a plurality of sections obtained by dividing an audio signal on a time axis, and 2 out of the plurality of sections. For a plurality of combinations for selecting individual sections, an index calculating means for calculating similarity index values indicating similarity of feature quantities in two sections, and each of the plurality of sections and the feature quantities are most similar to the sections Voice classification means for classifying a plurality of sections into a plurality of sets based on similarity index values of the sections so that the sections belong to the same set. According to the above configuration, since each of the plurality of sections and the section having the most similar feature amount to the section are classified into the same set, it is theoretically unnecessary to compare the similarity index value with a predetermined threshold value. It is. Therefore, it is possible to accurately classify a plurality of sections by reducing the influence of the conditions at the time of utterance (for example, sound volume and noise volume) on the accuracy of classification.

本発明の好適な態様において、指標算定手段は、２個の区間の特徴量の相互相関値を類否指標値として算定する。本態様によれば、類否指標値を算定する処理の負荷が低減されるという利点がある。もっとも、類否指標値は特徴量の相互相関値に限定されない。例えば、特徴抽出手段が、複数の区間の各々について音声信号の特徴ベクトルの時系列を特徴量として抽出する態様においては、２個の区間のうち一方の区間の特徴ベクトルの分布を複数の確率分布でモデル化した混合モデルから他方の区間の各特徴ベクトルが出現する尤度の平均値を指標算定手段が類否指標値として算定する構成や、２個の区間のうち一方の区間の特徴ベクトルの時系列をベクトル量子化したコードブックと他方の区間の各特徴ベクトルとのベクトル量子化歪の平均値を指標算定手段が類否指標値として算定する構成が採用される。 In a preferred aspect of the present invention, the index calculation means calculates the cross-correlation value of the feature values of the two sections as the similarity index value. According to this aspect, there is an advantage that the processing load for calculating the similarity index value is reduced. However, the similarity index value is not limited to the cross-correlation value of the feature amount. For example, in an aspect in which the feature extraction unit extracts a time series of feature vectors of an audio signal for each of a plurality of sections as a feature amount, the distribution of feature vectors in one of the two sections is a plurality of probability distributions. In the configuration in which the index calculation means calculates the average value of the likelihood that each feature vector of the other section appears from the mixed model modeled in step 1 as the similarity index value, or the feature vector of one section of the two sections A configuration is adopted in which the index calculation means calculates the average value of the vector quantization distortion between the codebook obtained by vector quantization of the time series and each feature vector in the other section as the similarity index value.

本発明の好適な態様において、音声分類手段は、複数の区間のうち、他の総ての区間の各々に対する類否指標値が非類似を示す区間（例えば、類似度の順位が所定値を下回る区間や類似度が他の各区間について最下位にある区間）を、他の総ての区間とは別個の集合に分類する。本態様によれば、例えばひとりの発声者がひとつの区間のみで発声したような場合であっても、当該発声の区間を他の区間とは別の集合に分類することが可能である。 In a preferred aspect of the present invention, the speech classification means includes a section in which the similarity index value for each of all the other sections is dissimilar among the plurality of sections (for example, the rank of similarity falls below a predetermined value). The section having the lowest section or similarity in each of the other sections) is classified into a set separate from all other sections. According to this aspect, even if, for example, a single speaker utters only in one section, the utterance section can be classified into a set different from other sections.

本発明の好適な態様に係る音声処理装置は、音声信号の波形の包絡線における各谷部を境界として音声信号を複数の区間に区分する音声区分手段を具備する。本態様によれば、音声信号の包絡線の各谷部を境界として音声信号が複数の区間に区分されるから、例えば複数の発声者が殆ど間隔をあけずに順次に発声した場合であっても、各発声者による発声を別個の区間に区分することが可能である。したがって、音声分類部による分類の精度を高めることができる。 A speech processing apparatus according to a preferred aspect of the present invention includes speech classification means for partitioning a speech signal into a plurality of sections with each valley in the envelope of the speech signal waveform as a boundary. According to this aspect, since the audio signal is divided into a plurality of sections with each valley portion of the envelope of the audio signal as a boundary, for example, when a plurality of speakers speak in sequence with almost no interval, In addition, it is possible to divide utterances by each speaker into separate sections. Therefore, the accuracy of classification by the voice classification unit can be increased.

本発明に係る音声処理装置は、音声の処理に専用されるＤＳＰ（Digital Signal Processor）などのハードウェア（電子回路）によって実現されるほか、ＣＰＵ（Central Processing Unit）などの汎用の演算処理装置とプログラムとの協働によっても実現される。本発明に係るプログラムは、音声信号を時間軸上で区分した複数の区間の各々について特徴量を抽出する特徴抽出処理（例えば図３のステップＳ2）と、複数の区間のなかから２個の区間を選択する複数の組合せについて、２個の区間における特徴量の類否を示す類否指標値を算定する指標算定処理（例えば図３のステップＳ5）と、複数の区間の各々と当該区間に特徴量が最も類似する区間とが同じ集合に属するように、各区間の類否指標値に基づいて複数の区間を複数の集合に分類する音声分類処理（例えば図３のステップＳ8）とをコンピュータに実行させる。以上のプログラムによっても、本発明に係る音声処理装置と同様の作用および効果が奏される。なお、本発明のプログラムは、コンピュータが読取可能な記録媒体に格納された形態で利用者に提供されてコンピュータにインストールされるほか、通信網を介した配信の形態でサーバ装置から提供されてコンピュータにインストールされる。 The audio processing apparatus according to the present invention is realized by hardware (electronic circuit) such as a DSP (Digital Signal Processor) dedicated to audio processing, and a general-purpose arithmetic processing apparatus such as a CPU (Central Processing Unit). It is also realized through collaboration with the program. The program according to the present invention includes a feature extraction process (for example, step S2 in FIG. 3) for extracting a feature amount for each of a plurality of sections obtained by dividing an audio signal on a time axis, and two sections among the plurality of sections. Index calculation processing (for example, step S5 in FIG. 3) for calculating similarity index values indicating similarity of feature quantities in two sections, a plurality of sections, and features in the sections A voice classification process (for example, step S8 in FIG. 3) for classifying a plurality of sections into a plurality of sets based on the similarity index value of each section so that the most similar sections belong to the same set. Let it run. Even with the above program, the same operations and effects as those of the speech processing apparatus according to the present invention are exhibited. The program of the present invention is provided to the user in a form stored in a computer-readable recording medium and installed in the computer, or is provided from the server device in the form of distribution via a communication network. Installed on.

本発明は、音声を処理する方法としても特定される。本発明のひとつの態様に係る音声処理方法は、音声信号を時間軸上で区分した複数の区間の各々について特徴量を抽出する特徴抽出手順と、複数の区間のなかから２個の区間を選択する複数の組合せについて、２個の区間における特徴量の類否を示す類否指標値を算定する指標算定手順と、複数の区間の各々と当該区間に特徴量が最も類似する区間とが同じ集合に属するように、各区間の類否指標値に基づいて複数の区間を複数の集合に分類する音声分類手順とを含む。以上の方法によれば、本発明に係る音声処理装置と同様の作用および効果が奏される。 The present invention is also specified as a method of processing speech. A speech processing method according to an aspect of the present invention includes a feature extraction procedure for extracting a feature amount for each of a plurality of sections obtained by dividing an audio signal on a time axis, and selecting two sections from the plurality of sections. For a plurality of combinations, an index calculation procedure for calculating a similarity index value indicating similarity of feature quantities in two sections, and each of the sections and the section having the most similar feature quantities in the section are the same set And a speech classification procedure for classifying a plurality of sections into a plurality of sets based on the similarity index value of each section. According to the above method, the same operation and effect as the sound processing apparatus according to the present invention are exhibited.

＜Ａ：第１実施形態＞
図１は、本発明の第１実施形態に係る音声処理装置の構成を示すブロック図である。同図に示すように、音声処理装置１００は、制御装置１０と記憶装置２０とを具備するコンピュータシステムである。制御装置１０は、プログラムを実行する演算処理装置である。記憶装置２０は、制御装置１０が実行するプログラムや制御装置１０が使用する各種のデータを記憶する。半導体記憶装置や磁気記憶装置など公知の記憶媒体が記憶装置２０として任意に採用される。制御装置１０には出力装置３０が接続される。本形態の出力装置３０は、制御装置１０による制御のもとに各種の画像を表示する表示機器である。 <A: First Embodiment>
FIG. 1 is a block diagram showing the configuration of the speech processing apparatus according to the first embodiment of the present invention. As shown in FIG. 1, the voice processing device 100 is a computer system that includes a control device 10 and a storage device 20. The control device 10 is an arithmetic processing device that executes a program. The storage device 20 stores a program executed by the control device 10 and various data used by the control device 10. A known storage medium such as a semiconductor storage device or a magnetic storage device is arbitrarily adopted as the storage device 20. An output device 30 is connected to the control device 10. The output device 30 according to the present embodiment is a display device that displays various images under the control of the control device 10.

記憶装置２０は、音声の時間軸上における波形を表す音声信号Ｓを記憶する。本形態の音声信号Ｓが表す音声は、複数の参加者が随時に発言する会議において収音機器を利用して採取された音声である。図２には、音声信号Ｓの時間軸上の波形が例示されている。 The storage device 20 stores an audio signal S representing a waveform on the time axis of audio. The voice represented by the voice signal S of this embodiment is voice collected using a sound collection device in a conference where a plurality of participants speak at any time. FIG. 2 illustrates a waveform on the time axis of the audio signal S.

図１の制御装置１０は、記憶装置２０に格納されたプログラムを実行することで音声信号Ｓから会議の議事録を作成する。議事録は、各参加者の発言の内容を参加者ごとに時系列で配列した会議の記録である。図１に示すように、制御装置１０は、音声区分部１２と分類処理部１４と音声認識部１６として機能する。なお、図１の制御装置１０の各機能は、音声の処理に専用されるＤＳＰなどの電子回路によっても実現される。また、制御装置１０は、複数の集積回路に分散して実装されてもよい。 The control device 10 in FIG. 1 creates a meeting minutes from the audio signal S by executing a program stored in the storage device 20. The minutes are conference records in which the contents of each participant's remarks are arranged in time series for each participant. As shown in FIG. 1, the control device 10 functions as a voice classification unit 12, a classification processing unit 14, and a voice recognition unit 16. Each function of the control device 10 of FIG. 1 is also realized by an electronic circuit such as a DSP dedicated to voice processing. Further, the control device 10 may be distributed and mounted on a plurality of integrated circuits.

音声区分部１２は、図２に示すように、時間軸に沿って音声信号Ｓを複数の区間Ｂに区分する。ひとつの区間Ｂは、ひとりの発声者が連続して発言した可能性が高いと推定される期間である。人間による一連の発話（特に会議における発言）には、一般的に、発話の開始点から音量が徐々に増加するとともに中途の時点から発話の終了点にかけて音量が徐々に減少するという傾向がある。以上の傾向を考慮して、本形態の音声区分部１２は、図２に示すように、音声信号Ｓの波形の包絡線（エンベロープ）Ｅに現れる複数の谷部Ｄの各々を境界として音声信号Ｓを複数の区間Ｂに区分する。 As shown in FIG. 2, the voice classification unit 12 classifies the voice signal S into a plurality of sections B along the time axis. One section B is a period in which it is estimated that there is a high possibility that one speaker speaks continuously. A series of human utterances (especially in a conference) generally has a tendency that the volume gradually increases from the start point of the utterance and gradually decreases from an intermediate point to the end point of the utterance. In consideration of the above tendency, the audio classification unit 12 of the present embodiment, as shown in FIG. 2, uses the audio signal with each of a plurality of valleys D appearing in an envelope E of the waveform of the audio signal S as a boundary. S is divided into a plurality of sections B.

以上の構成によれば、例えばひとりの発声者の発声の最後の部分と別の発声者の発声の最初の部分とが重複する場合や、複数の発声者が間隔をあけずに順次に発声した場合であっても、各発声者による発声を別個の区間Ｂに区分することが可能である。なお、以下では音声信号ＳがＮ個（Ｎは自然数）の区間Ｂに区分された場合を想定する。Ｎ個の区間Ｂの各々には固有の識別子（番号）が付与される。 According to the above configuration, for example, when the last part of one speaker's utterance overlaps with the first part of another speaker's utterance, or a plurality of speakers uttered sequentially without intervals. Even in this case, it is possible to divide the utterance by each speaker into separate sections B. In the following, it is assumed that the audio signal S is divided into N (N is a natural number) sections B. A unique identifier (number) is assigned to each of the N sections B.

図１の分類処理部１４は、音声区分部１２が音声信号Ｓを区分したＮ個の区間Ｂを発声者（会議の参加者）ごとに分類するための手段である。すなわち、音声信号Ｓのうち同じ発声者が発声した可能性の高い区間Ｂは共通の集合（クラスタ）に分類される。分類処理部１４は、分類の結果を記憶装置２０に格納する。すなわち、分類処理部１４は、複数の発声者の各々の識別符号と、当該発声者のクラスタに分類された各区間Ｂの始点および終点の時刻と、当該各区間Ｂの音声信号Ｓとを対応させて記憶装置２０に格納する。 The classification processing unit 14 in FIG. 1 is means for classifying the N sections B into which the audio classification unit 12 has classified the audio signal S for each speaker (conference participant). That is, the section B that is likely to be uttered by the same speaker in the audio signal S is classified into a common set (cluster). The classification processing unit 14 stores the classification result in the storage device 20. That is, the classification processing unit 14 associates each identification code of a plurality of speaker, the start and end times of each section B classified into the cluster of the speaker, and the audio signal S of each section B. And stored in the storage device 20.

図１の音声認識部１６は、各クラスタに分類された音声信号Ｓの各区間Ｂに基づいて発声者ごとの発言の内容を文字として特定する。各区間Ｂの音声信号Ｓから文字を認識する処理には、公知の音声認識技術が任意に採用される。例えば、音声認識部１６は、第１に、ひとつのクラスタに分類された各区間Ｂの音声信号Ｓの音響的な特徴量に応じて初期的な音響モデルを更新（話者適応）することで、当該クラスタに対応した発声者の特徴を固有に反映した音響モデルを生成し、第２に、話者適応後の音響モデルとクラスタ内の各区間Ｂの音声信号Ｓから抽出された特徴量とを対比することで、発声者が発言した文字を特定する。 The voice recognition unit 16 in FIG. 1 specifies the content of a utterance for each speaker as a character based on each section B of the voice signal S classified into each cluster. For the process of recognizing characters from the speech signal S of each section B, a known speech recognition technique is arbitrarily employed. For example, the speech recognition unit 16 firstly updates (speaker adaptation) the initial acoustic model according to the acoustic feature amount of the speech signal S of each section B classified into one cluster. , Generating an acoustic model that uniquely reflects the features of the speaker corresponding to the cluster, and secondly, an acoustic model after speaker adaptation and a feature amount extracted from the speech signal S of each section B in the cluster By comparing, the character spoken by the speaker is specified.

制御装置１０は、音声認識部１６による処理の結果を出力装置３０に出力する。出力装置３０は、発言の時刻と、発声者の識別符号（例えば発声者の氏名）と、当該発声の内容について音声認識部１６が特定した文字とが時系列に配列された議事録の画像を表示する。 The control device 10 outputs the processing result by the voice recognition unit 16 to the output device 30. The output device 30 displays an image of the minutes in which the time of the utterance, the identification code of the utterer (for example, the name of the utterer), and the characters specified by the speech recognition unit 16 regarding the content of the utterance are arranged in time series. indicate.

次に、分類処理部１４について詳述する。図１に示すように、分類処理部１４は、特徴抽出部４１と指標算定部４３と音声分類部４５とを含む。特徴抽出部４１は、Ｎ個の区間Ｂの各々について音声信号Ｓの平均的なパワースペクトル（以下「平均パワースペクトル」という）を音響的な特徴量として抽出する。指標算定部４３は、Ｎ個の区間Ｂから２個の区間Ｂを選択する総ての組合せ（_NＣ₂通り）について、当該２個の区間Ｂにおける平均パワースペクトルの相互相関値Ｃorを算定する。相互相関値Ｃorは、２種類の平均パワースペクトルの形状の類否の指標となる数値（類否指標値）である。音声分類部４５は、平均パワースペクトルが相互に類似する各区間Ｂが同じクラスタに属するように、各区間Ｂの相互相関値Ｃorに基づいてＮ個の区間Ｂを複数のクラスタに分類（クラスタリング）する。 Next, the classification processing unit 14 will be described in detail. As shown in FIG. 1, the classification processing unit 14 includes a feature extraction unit 41, an index calculation unit 43, and a voice classification unit 45. The feature extraction unit 41 extracts the average power spectrum (hereinafter referred to as “average power spectrum”) of the audio signal S for each of the N sections B as an acoustic feature quantity. The index calculation unit 43 calculates the cross-correlation value Cor of the average power spectrum in the two sections B for all combinations ( _N C ₂ types) that select the two sections B from the _N sections B. . The cross-correlation value Cor is a numerical value (similarity index value) serving as an index of similarity between two types of average power spectrum shapes. The voice classification unit 45 classifies N sections B into a plurality of clusters based on the cross-correlation value Cor of each section B (clustering) so that the sections B whose average power spectra are similar to each other belong to the same cluster. To do.

図３は、分類処理部１４の具体的な動作を示すフローチャートである。議事録の作成の指示を契機として音声区分部１２が音声信号ＳをＮ個の区間Ｂに区分すると図３の処理が開始される。特徴抽出部４１はステップＳ1からステップＳ3を実行し、指標算定部４３はステップＳ4からステップＳ7を実行し、音声分類部４５はステップＳ8およびステップＳ9を実行する。 FIG. 3 is a flowchart showing a specific operation of the classification processing unit 14. When the voice classifying unit 12 classifies the voice signal S into N sections B in response to an instruction to create the minutes, the process of FIG. 3 is started. The feature extraction unit 41 executes steps S1 to S3, the index calculation unit 43 executes steps S4 to S7, and the speech classification unit 45 executes steps S8 and S9.

図３に示すように、特徴抽出部４１は、Ｎ個のなかからひとつの区間Ｂを選択するとともに当該区間Ｂ内の音声信号Ｓを記憶装置２０から取得する（ステップＳ1）。そして、特徴抽出部４１は、ステップＳ1で選択した区間Ｂの平均パワースペクトルを特徴量として抽出する（ステップＳ2）。すなわち、特徴抽出部４１は、区間Ｂを分割した複数のフレームの各々の音声信号ＳにＦＦＴ（Fast Fourier Transform）処理を含む周波数分析を実行することで各フレームのパワースペクトルを算定し、当該区間Ｂ内の総てのフレームについてパワースペクトルを平均化することで平均パワースペクトルを算定する。ステップＳ2で算定される平均パワースペクトルのうち特定の周波数における強度は、区間Ｂ内の各フレームのパワースペクトルにおける当該周波数での強度の平均値である。ステップＳ1およびステップＳ2の処理はＮ個の区間Ｂの各々について反復される（ステップＳ3）。以上の処理によってＮ個の区間Ｂの総てについて平均パワースペクトル（Ｎ種類）が算定されると、処理はステップＳ4に移行する。 As shown in FIG. 3, the feature extraction unit 41 selects one section B from the N pieces and acquires the audio signal S in the section B from the storage device 20 (step S1). And the feature extraction part 41 extracts the average power spectrum of the area B selected by step S1 as a feature-value (step S2). That is, the feature extraction unit 41 calculates the power spectrum of each frame by performing frequency analysis including FFT (Fast Fourier Transform) processing on the audio signal S of each of the plurality of frames obtained by dividing the section B. The average power spectrum is calculated by averaging the power spectrum for all frames in B. The intensity at a specific frequency in the average power spectrum calculated in step S2 is an average value of the intensity at the frequency in the power spectrum of each frame in the section B. The processes of step S1 and step S2 are repeated for each of the N sections B (step S3). When the average power spectrum (N types) is calculated for all of the N sections B by the above process, the process proceeds to step S4.

ステップＳ4において、指標算定部４３は、Ｎ個の区間Ｂのなかからひとつの区間Ｂ（以下では特に「選択区間Ｂ」という）を選択する。そして、指標算定部４３は、選択区間Ｂの平均パワースペクトルと選択区間Ｂ以外の総て（(Ｎ−１)個）の区間Ｂ（以下では選択区間Ｂと区別するために特に「対比区間Ｂ」と表記する場合がある）の各々の平均パワースペクトルとの相互相関値Ｃorを類否指標値として算定する（ステップＳ5）。選択区間Ｂの平均パワースペクトルＳＰaとひとつの対比区間Ｂの平均パワースペクトルＳＰbとの相互相関値Ｃorは、例えば以下の式(1)で算定される。

In step S <b> 4, the index calculation unit 43 selects one section B (hereinafter referred to as “selected section B” in particular) from among the N sections B. The index calculation unit 43 then selects the average power spectrum of the selected section B and all ((N−1)) sections B other than the selected section B (hereinafter referred to as “contrast section B” in order to distinguish them from the selected section B). Is calculated as the similarity index value (step S5). The cross-correlation value Cor between the average power spectrum SPa in the selected section B and the average power spectrum SPb in one contrast section B is calculated by the following equation (1), for example.

式(1)におけるＳＰa(i)は、複数の周波数（または周波数帯域）のうち変数ｉ（Ｆ1≦ｉ≦Ｆ2）で指定される周波数における平均パワースペクトルＳＰaの強度であり、ＳＰa_AVEは、周波数Ｆ1から周波数Ｆ2までの帯域における平均パワースペクトルＳＰaの強度の平均値である。同様に、ＳＰb(i)は、変数ｉに対応した周波数における平均パワースペクトルＳＰbの強度であり、ＳＰb_AVEは、周波数Ｆ1から周波数Ｆ2までの帯域における平均パワースペクトルＳＰbの強度の平均値である。平均パワースペクトルＳＰaと平均パワースペクトルＳＰbとが完全に合致する場合に相互相関値Ｃorは最大値「１」となり、両者の相違が増大するほど相互相関値Ｃorは減少していく。なお、周波数Ｆ1および周波数Ｆ2は、平均パワースペクトルにおいて発声者ごとの相違が顕著となり易い周波数帯域の下限値（Ｆ1）および上限値（Ｆ2）となるように統計的または実験的に設定される。 SPa (i) in equation (1) is the intensity of the average power spectrum SPa at a frequency specified by the variable i (F1≤i≤F2) among a plurality of frequencies (or frequency bands), and SPa_AVE is the frequency F1 Is the average value of the intensity of the average power spectrum SPa in the band from to F2. Similarly, SPb (i) is the intensity of the average power spectrum SPb at the frequency corresponding to the variable i, and SPb_AVE is the average value of the intensity of the average power spectrum SPb in the band from the frequency F1 to the frequency F2. When the average power spectrum SPa and the average power spectrum SPb completely match, the cross-correlation value Cor becomes the maximum value “1”, and the cross-correlation value Cor decreases as the difference between the two increases. Note that the frequency F1 and the frequency F2 are set statistically or experimentally so as to be the lower limit value (F1) and the upper limit value (F2) of the frequency band in which the difference for each speaker tends to be remarkable in the average power spectrum.

次いで、指標算定部４３は、(Ｎ−１)個の対比区間Ｂの各々の識別子を、選択区間Ｂとの相互相関値Ｃorが大きい順番（すなわち類似度が高い順番）にソートする（ステップＳ6）。例えば、識別子「１」の選択区間Ｂに対して識別子「13」の区間Ｂの相互相関値Ｃorが最大値であって識別子「16」の区間Ｂの相互相関値Ｃorが最小値であるとすれば、図４に示すように、識別子「１」の選択区間Ｂについては識別子「13」が最上位で識別子「16」が最下位となるように(Ｎ−１)個の識別子が配列される。 Next, the index calculation unit 43 sorts the identifiers of the (N−1) comparison sections B in the order in which the cross-correlation value Cor with the selected section B is large (that is, the order of high similarity) (step S6). ). For example, it is assumed that the cross-correlation value Cor of the section B of the identifier “13” is the maximum value and the cross-correlation value Cor of the section B of the identifier “16” is the minimum value with respect to the selected section B of the identifier “1”. For example, as shown in FIG. 4, (N-1) identifiers are arranged so that the identifier “13” is the highest and the identifier “16” is the lowest in the selected section B of the identifier “1”. .

ステップＳ7において、指標算定部４３は、Ｎ個の区間ＢについてステップＳ4からステップＳ6の処理を完了したか否かを判定する。ステップＳ7の結果が否定である場合、指標算定部４３は、現段階とは別の区間Ｂを新たな選択区間Ｂとして選択したうえで（ステップＳ4）、(Ｎ−１)個の対比区間Ｂとの相互相関値Ｃorの算定（ステップＳ5）と各対比区間Ｂの識別子の並べ替え（ステップＳ6）とを実行する。したがって、ステップＳ7の結果が肯定となる段階（すなわちＮ個の区間ＢについてステップＳ4からステップＳ6の処理が完了した段階）では、図４に示すように、Ｎ個（図４ではＮ＝１６）の区間Ｂの各々について、他の(Ｎ−１)個の区間Ｂの識別子を相互相関値Ｃorの大きい順番に配列したテーブル（以下「類似度マップ」という）Ｍが完成する。 In step S7, the index calculation unit 43 determines whether or not the processing from step S4 to step S6 has been completed for N sections B. If the result of step S7 is negative, the index calculation unit 43 selects a section B different from the current stage as a new selection section B (step S4), and (N-1) contrast sections B. The cross-correlation value Cor is calculated (step S5) and the identifiers of the respective comparison sections B are rearranged (step S6). Therefore, at the stage where the result of step S7 is affirmative (that is, the stage from step S4 to step S6 is completed for N sections B), as shown in FIG. 4, N (N = 16 in FIG. 4). For each section B, a table (hereinafter referred to as “similarity map”) M in which the identifiers of the other (N−1) sections B are arranged in descending order of the cross-correlation value Cor is completed.

類似度マップＭが作成されると、音声分類部４５は、Ｎ個の区間Ｂの各々と当該区間Ｂに対する相互相関値Ｃor（類否指標値）が最大となる区間Ｂとが同じクラスタに属するように、類似度マップＭを参照してＮ個の区間Ｂを複数のクラスタに分類する（ステップＳ8）。すなわち、音声分類部４５は、ひとつの区間Ｂと、類似度マップＭにて当該区間Ｂに対して最上位に位置する識別子の区間Ｂとを同じクラスタに含ませる。例えば、図４に例示した類似度マップＭにおいて、識別子「１」の区間Ｂに対しては識別子「13」が最上位に位置し（すなわち識別子「１」の区間Ｂの平均パワースペクトルには識別子「13」の区間Ｂの平均パワースペクトルが最も類似する）、識別子「９」の区間Ｂに対しては識別子「13」が最上位に位置し、識別子「13」の区間Ｂに対しては識別子「９」が最上位に位置する。したがって、音声分類部４５は、識別子「１」，「９」および「13」の３個の区間Ｂを同じクラスタＧ1に分類する。同様に、識別子「２」，「３」，「４」および「14」の４個の区間ＢはクラスタＧ2に分類され、識別子「５」および「10」の２個の区間ＢはクラスタＧ3に分類される。 When the similarity map M is created, the speech classification unit 45 includes each of the N sections B and the section B having the maximum cross-correlation value Cor (similarity index value) for the section B belonging to the same cluster. In this way, the N sections B are classified into a plurality of clusters with reference to the similarity map M (step S8). That is, the speech classification unit 45 includes one section B and the section B of the identifier positioned at the highest position with respect to the section B in the similarity map M in the same cluster. For example, in the similarity map M illustrated in FIG. 4, the identifier “13” is positioned at the highest position for the section B with the identifier “1” (that is, the average power spectrum of the section B with the identifier “1” has an identifier The average power spectrum of the section B of “13” is most similar), the identifier “13” is positioned at the top for the section B of the identifier “9”, and the identifier for the section B of the identifier “13” “9” is at the top. Therefore, the voice classification unit 45 classifies the three sections B with the identifiers “1”, “9”, and “13” into the same cluster G1. Similarly, four sections B with identifiers “2”, “3”, “4” and “14” are classified into cluster G2, and two sections B with identifiers “5” and “10” are assigned to cluster G3. being classified.

ところで、ステップＳ8においては各区間Ｂに対する相互相関値Ｃorが最大となる区間Ｂを同じクラスタに分類するから、例えばひとりの発声者がひとつの区間Ｂのみで発声した場合であっても、当該区間Ｂは、相互相関値Ｃorが最大となる他の区間Ｂ（別の発声者が発声した区間Ｂ）と同じクラスタに分類される。そこで、音声分類部４５は、Ｎ個の区間Ｂのうち自身以外の(Ｎ−１)個の区間Ｂに対する類似度が所定値を下回る順位にある区間Ｂ（他の区間との類似度が低い区間Ｂ）については、ステップＳ8にて分類されたクラスタから除外して単独でひとつのクラスタに分類する（ステップＳ9）。 By the way, in step S8, since the section B having the maximum cross-correlation value Cor for each section B is classified into the same cluster, for example, even when one speaker speaks only in one section B, the section B B is classified into the same cluster as other section B (section B uttered by another speaker) having the maximum cross-correlation value Cor. Therefore, the speech classification unit 45 is a section B in which the similarity to (N−1) sections B other than itself among the N sections B is lower than a predetermined value (similarity with other sections is low). The section B) is excluded from the clusters classified in step S8 and is categorized as a single cluster (step S9).

例えば、図４の場合において、識別子「16」の区間Ｂに対しては識別子「12」が最上位にある（すなわち、Ｎ個の区間Ｂのなかでは識別子「12」の区間Ｂの平均パワースペクトルが識別子「16」の区間Ｂの平均パワースペクトルに類似する）から、ステップＳ8の段階では識別子「16」の区間Ｂは識別子「12」と同じクラスタＧ4に分類される。しかし、識別子「16」は、他の総ての識別子「１」〜「15」の区間Ｂに対して類似度マップＭの最下位に設定されている。すなわち、識別子「16」の区間Ｂの平均パワースペクトルは、他の何れの区間Ｂの平均パワースペクトルに対しても相関が低い。したがって、音声分類部４５は、識別子「16」の区間Ｂを、ステップＳ8にて分類されたクラスタＧ4から除外して独立のクラスタＧ6に分類する。以上の構成によれば、特定の発声者が発声した唯一の区間Ｂ（識別子「16」）を他の発声者のクラスタと混合することなく適切に分類することが可能である。 For example, in the case of FIG. 4, the identifier “12” is the highest for the section B with the identifier “16” (that is, the average power spectrum of the section B with the identifier “12” among the N sections B). Is similar to the average power spectrum of the section B having the identifier “16”), the section B having the identifier “16” is classified into the same cluster G4 as the identifier “12” in the step S8. However, the identifier “16” is set at the lowest level of the similarity map M with respect to the section B of all other identifiers “1” to “15”. That is, the average power spectrum of the section B with the identifier “16” has a low correlation with the average power spectrum of any other section B. Therefore, the voice classification unit 45 excludes the section B of the identifier “16” from the cluster G4 classified in step S8 and classifies it into an independent cluster G6. According to the above configuration, it is possible to appropriately classify the only section B (identifier “16”) uttered by a specific speaker without mixing with the clusters of other speakers.

以上に説明したように、平均パワースペクトルの相互相関値Ｃorが最大となる各区間Ｂは同じクラスタに分類されるから、相互相関値Ｃorと所定の閾値との比較は不要である。したがって、発声時の条件に拘わらず各区間Ｂの音声信号Ｓを発声者ごとに正確に分類することが可能となり、会議における各発言が参加者ごとに忠実に区別された適切な議事録を作成することができる。 As described above, since each section B in which the cross-correlation value Cor of the average power spectrum is maximum is classified into the same cluster, it is not necessary to compare the cross-correlation value Cor with a predetermined threshold value. Therefore, it is possible to accurately classify the speech signal S of each section B for each speaker regardless of the conditions at the time of speaking, and create appropriate minutes in which each speech in the meeting is faithfully distinguished for each participant can do.

また、各選択区間Ｂとの相互相関値Ｃorが最大となる区間Ｂが特定されるだけではなく、各区間Ｂに対する類似度の順位を示す類似度マップＭが作成される。したがって、他の区間Ｂに対する類似度の順位が下位である区間Ｂ（例えば図４の識別子「16」の区間Ｂ）を図３のステップＳ9にてクラスタから除外するといった具合に、類似度マップＭを参照することで各区間Ｂの分類の精度を高めることが可能となる。 In addition, not only the section B having the maximum cross-correlation value Cor with each selected section B is specified, but also a similarity map M indicating the rank of similarity with respect to each section B is created. Therefore, the similarity map M is such that, for example, the section B (for example, the section B with the identifier “16” in FIG. 4) having a lower rank order than the other sections B is excluded from the cluster in step S9 in FIG. It is possible to improve the accuracy of classification of each section B by referring to.

なお、第１実施形態において相互相関値Ｃorを算定する方法は適宜に変更される。例えば、平均パワースペクトルを周波数軸上で区分した複数の周波数帯域の各々における相互相関値の加算値（または加重和）を相互相関値Ｃorとして算定してもよい。すなわち、指標算定部４３は、平均パワースペクトル（ＳＰa，ＳＰb）における特定の帯域を対象として相互相関値Ｃor_aを算定するとともに別の帯域を対象として相互相関値Ｃor_bを算定し、両者の加算値や加重和を相互相関値Ｃor（Ｃor＝α・Ｃor_a＋β・Ｃor_b：αおよびβは定数）として演算する。以上の構成によれば、平均パワースペクトルのうち発声者ごとの相違が特に顕著に現れる帯域の特性を相互相関値Ｃorに対して仔細かつ有効に反映させることが可能となる。 Note that the method of calculating the cross-correlation value Cor in the first embodiment is changed as appropriate. For example, an addition value (or weighted sum) of cross-correlation values in each of a plurality of frequency bands obtained by dividing the average power spectrum on the frequency axis may be calculated as the cross-correlation value Cor. That is, the index calculation unit 43 calculates the cross-correlation value Cor_a for a specific band in the average power spectrum (SPa, SPb) and calculates the cross-correlation value Cor_b for another band. The weighted sum is calculated as a cross-correlation value Cor (Cor = α · Cor_a + β · Cor_b: α and β are constants). According to the above configuration, it is possible to reflect the characteristics of the band in which the difference for each speaker in the average power spectrum is particularly remarkable in the cross-correlation value Cor in a detailed and effective manner.

＜Ｂ：第２実施形態＞
次に、本発明の第２実施形態について説明する。第１実施形態においては各区間Ｂの類否指標値として平均パワースペクトルの相互相関値Ｃorを例示した。これに対して本形態においては、各区間Ｂの音声信号Ｓを表現する混合モデルと他の各区間Ｂの特徴量とを照合した結果（平均尤度）を類否指標値として採用する。なお、以下の各形態において作用や機能が第１実施形態と同様である要素については、図１と同じ符号を付して各々の詳細な説明を適宜に省略する。 <B: Second Embodiment>
Next, a second embodiment of the present invention will be described. In the first embodiment, the cross correlation value Cor of the average power spectrum is exemplified as the similarity index value of each section B. On the other hand, in this embodiment, a result (average likelihood) obtained by collating the mixed model expressing the audio signal S of each section B with the feature amount of each other section B is employed as the similarity index value. In addition, about the element which an effect | action and function are the same as that of 1st Embodiment in each following form, the same code | symbol as FIG. 1 is attached | subjected and each detailed description is abbreviate | omitted suitably.

特徴抽出部４１は、図３のステップＳ2において、ステップＳ1で選択した区間Ｂの音声信号Ｓについて周波数分析を実行し、当該区間Ｂ内の各フレームにおけるＭＦＣＣ(Mel Frequency Cepstral Coefficient)のベクトル（以下「特徴ベクトル」という）ｘの時系列を特徴量として抽出する。さらに、特徴抽出部４１は、ステップＳ2において、区間Ｂ内の複数の特徴ベクトルｘの分布をＭ個の正規分布の加重和としてモデル化する混合モデルλを生成する（Ｍは自然数）。混合モデルλの生成には、ＥＭ（Expectation-Maximization）アルゴリズムなど公知の技術が任意に採用される。以上の処理がＮ回にわたって反復されることで、音声区分部１２が画定したＮ個の区間Ｂの各々について特徴ベクトルｘの時系列と混合モデルλとが特定される。 In step S2 of FIG. 3, the feature extraction unit 41 performs frequency analysis on the audio signal S in the section B selected in step S1, and a MFCC (Mel Frequency Cepstral Coefficient) vector (hereinafter referred to as MFCC) in each frame in the section B. A time series of x) (referred to as “feature vector”) is extracted as a feature quantity. Further, in step S2, the feature extraction unit 41 generates a mixed model λ that models the distribution of the plurality of feature vectors x in the section B as a weighted sum of M normal distributions (M is a natural number). For the generation of the mixed model λ, a known technique such as an EM (Expectation-Maximization) algorithm is arbitrarily employed. By repeating the above process N times, the time series of the feature vector x and the mixed model λ are specified for each of the N sections B defined by the speech classification unit 12.

混合モデルλは、例えば以下の式(2)で表現される。
λ＝｛ｐi，μi，Σi｝（ｉ＝１〜Ｍ） ……(2)
式(2)のｐiは、第ｉ番目の正規分布の加重値（重み値）である。加重値ｐ1〜ｐMの総和は１である。式(2)のμiは第ｉ番目の正規分布の平均ベクトルであり、Σiは第ｉ番目の正規分布の共分散行列である。なお、式(2)のμiのように、実際にはベクトルを意味する記号であっても、当該記号がベクトルを意味することを例えば「平均ベクトル」という表現で明示したうえで、本明細書ではベクトルの記号（文字上の右向き矢印）を省略する。 The mixed model λ is expressed by, for example, the following formula (2).
λ = {pi, μi, Σi} (i = 1 to M) (2)
In the equation (2), pi is a weight value (weight value) of the i-th normal distribution. The sum of the weights p1 to pM is 1. In the equation (2), μi is an average vector of the i-th normal distribution, and Σi is a covariance matrix of the i-th normal distribution. It should be noted that even if a symbol actually means a vector, such as μi in Equation (2), this specification means that the symbol means a vector, for example, by clearly expressing the expression `` average vector ''. The vector symbol (the arrow pointing right on the character) is omitted.

図３のステップＳ5において、指標算定部４３は、ステップＳ4で選択した選択区間Ｂの混合モデルλと、(Ｎ−１)個の各対比区間Ｂから抽出された特徴ベクトルｘの時系列とに基づいて、(Ｎ−１)個の対比区間Ｂの各々について平均尤度Ｌを類否指標値として算定する。平均尤度Ｌは、以下に詳述するように、選択区間Ｂの混合モデルλから他の対比区間Ｂ内の特徴ベクトルｘが出現する確率（尤度）を当該対比区間Ｂ内の複数の特徴ベクトルｘについて平均した数値である。 In step S5 of FIG. 3, the index calculation unit 43 converts the mixed model λ of the selected section B selected in step S4 and the time series of the feature vectors x extracted from the (N−1) comparison sections B. Based on this, the average likelihood L is calculated as the similarity index value for each of the (N−1) comparison sections B. As will be described in detail below, the average likelihood L is a probability (likelihood) that a feature vector x in another comparison section B appears from the mixed model λ in the selection section B, and a plurality of features in the comparison section B. It is a numerical value averaged for the vector x.

いま、ひとつの特徴ベクトルｘをＤ次元のベクトルとすると、混合モデルλから特徴ベクトルｘが出現する尤度は以下の式(3)で算定される。

Assuming that one feature vector x is a D-dimensional vector, the likelihood that the feature vector x appears from the mixed model λ is calculated by the following equation (3).

指標算定部４３は、選択区間Ｂ以外のひとつの対比区間Ｂについて特徴抽出部４１が抽出したＫ個の特徴ベクトルｘ（ｘ1〜ｘK）を式(4)に代入することで平均尤度Ｌ（対比区間Ｂの特徴ベクトルｘ1〜ｘKが選択区間Ｂの混合モデルλから出現する確率の平均値）を算定する。

The index calculation unit 43 substitutes the K feature vectors x (x1 to xK) extracted by the feature extraction unit 41 for one contrast section B other than the selected section B into the equation (4), thereby calculating the average likelihood L ( The average value of the probability that the feature vectors x1 to xK in the comparison section B appear from the mixed model λ in the selection section B) is calculated.

以上の説明から理解されるように、選択区間Ｂの音声信号Ｓと対比区間Ｂの音声信号Ｓとで特徴ベクトルｘが類似するほど平均尤度Ｌは大きくなる。したがって、第１実施形態と同様に、指標算定部４３は、図３のステップＳ6において、(Ｎ−１)個の対比区間Ｂの各々の識別子を、平均尤度Ｌが大きい順番（すなわち類似度が高い順番）にソートする。平均尤度Ｌの算定（ステップＳ5）と識別子のソート（ステップＳ6）とがＮ回にわたって反復されることで、第１実施形態と同様の類似度マップＭが完成する。音声分類部４５の動作（ステップＳ8およびステップＳ9）は第１実施形態と同様である。本形態においても第１実施形態と同様の効果が奏される。 As understood from the above description, the average likelihood L increases as the feature vector x is similar between the audio signal S in the selected section B and the audio signal S in the comparison section B. Therefore, as in the first embodiment, the index calculation unit 43 assigns each identifier of the (N−1) comparison sections B in order of decreasing average likelihood L (that is, similarity) in step S6 of FIG. Sort in ascending order). By calculating the average likelihood L (step S5) and sorting the identifiers (step S6) N times, a similarity map M similar to that in the first embodiment is completed. The operation (step S8 and step S9) of the voice classification unit 45 is the same as that in the first embodiment. In this embodiment, the same effect as that of the first embodiment is obtained.

＜Ｃ：第３実施形態＞
本発明の第３実施形態について説明する。本形態においては、各区間Ｂの音声信号Ｓをベクトル量子化したときのコードブックと他の各区間Ｂの特徴量とを対比した結果（ＶＱ（Vector Quantization）歪）を類否指標値として採用する。 <C: Third Embodiment>
A third embodiment of the present invention will be described. In this embodiment, the result (VQ (Vector Quantization) distortion) obtained by comparing the codebook when the speech signal S of each section B is vector quantized with the feature quantity of each other section B is adopted as the similarity index value. To do.

特徴抽出部４１は、図３のステップＳ2において、ステップＳ1で選択した区間Ｂの音声信号Ｓについて第２実施形態と同様の方法で特徴ベクトルｘ（例えばＭＦＣＣ）の時系列を特徴量として抽出し、当該区間Ｂ内の複数の特徴ベクトルｘの時系列からコードブックＣ^Aを作成する。特徴ベクトルｘのベクトル量子化には、k-means法やＬＢＧアルゴリズムなど公知の技術が任意に採用される。ステップＳ2がＮ回にわたって反復されることで、音声区分部１２が画定したＮ個の区間Ｂの各々について特徴ベクトルｘの時系列とコードブックＣ^Aとが特定される。 In step S2 in FIG. 3, the feature extraction unit 41 extracts the time series of the feature vector x (for example, MFCC) as the feature amount for the audio signal S in the section B selected in step S1 by the same method as in the second embodiment. , to create a codebook C ^a from the time series of a plurality of feature vector x in the section B. For vector quantization of the feature vector x, a known technique such as a k-means method or an LBG algorithm is arbitrarily employed. By step S2 it is repeated over N times, and the time series and the codebook C ^A feature vector x for each of the N phases B voice classification unit 12 is defined is specified.

図３のステップＳ5において、指標算定部４３は、ステップＳ4で選択した選択区間ＢのコードブックＣ^Aと、(Ｎ−１)個の各対比区間Ｂの特徴ベクトルｘとに基づいて、(Ｎ−１)個の対比区間Ｂの各々についてＶＱ歪Ｄを類否指標値として算定する。ＶＱ歪Ｄは、例えば以下の式(5)で算定される。

In step S5 in FIG. 3, the index calculator 43, based on the codebook C ^A selection section B selected in step S4, the feature vector x (N-1) pieces of the contrast period B, (N -1) VQ distortion D is calculated as an similarity index value for each of the comparison sections B. The VQ distortion D is calculated by, for example, the following formula (5).

式(5)における|Ｃ^A|は、選択区間ＢのコードブックＣ^Aのサイズであり、Ｃ^A(i)は、コードブックＣ^Aにおける第ｉ番目のコードベクトル（セントロイドベクトル）である。また、ｘjは、対比区間Ｂから抽出されたｎ_B個（対比区間Ｂ内のフレーム数）の特徴ベクトルｘ1〜ｘn_Bのなかの第ｊ番目（ｊ＝１〜ｎ_B）を意味する。ｄ（X,Y）は、ベクトルＸとベクトルＹとのユークリッド距離である。すなわち、ＶＱ歪Ｄは、選択区間ＢのコードブックＣ^A内の|Ｃ^A|個のセントロイドベクトルと対比区間Ｂの特徴ベクトルｘとの最小値（min）をｎ_B個の特徴ベクトルｘ1〜ｘn_Bにわたって平均化した数値である。したがって、選択区間Ｂの音声信号Ｓと対比区間Ｂの音声信号Ｓとで特徴ベクトルｘが類似するほどＶＱ歪Ｄは小さくなる。 | C ^A | in Equation (5) is the size of the code book C ^{A in} the selected section B, and C ^A (i) is the i-th code vector (centroid vector) in the code book C ^A. Further, xj denotes the j-th among the feature vectors x1 to xn _B of contrast period n _B number extracted from B (the number of frames in contrast period _{B) (j = 1~n B)} . d (X, Y) is the Euclidean distance between the vector X and the vector Y. That is, the VQ distortion D is the minimum value (min) of | C ^A | centroid vectors in the codebook C ^A of the selected section B and the feature vector x of the contrast section B, and n _B feature vectors x 1 to is a value obtained by averaging over xn _B. Therefore, the VQ distortion D decreases as the feature vector x is similar between the audio signal S in the selected section B and the audio signal S in the comparison section B.

指標算定部４３は、図３のステップＳ6において、(Ｎ−１)個の対比区間Ｂの各々の識別子を、ＶＱ歪Ｄが小さい順番（すなわち類似度が高い順番）にソートする。ＶＱ歪Ｄの算定（ステップＳ5）と識別子のソート（ステップＳ6）とがＮ回にわたって反復されることで第１実施形態と同様の類似度マップＭが完成する。音声分類部４５の動作は第１実施形態と同様である。本形態においても第１実施形態と同様の効果が奏される。 In step S6 of FIG. 3, the index calculation unit 43 sorts the identifiers of the (N-1) comparison sections B in the order in which the VQ distortion D is small (that is, the order of high similarity). By calculating the VQ distortion D (step S5) and sorting the identifiers (step S6) N times, a similarity map M similar to that of the first embodiment is completed. The operation of the voice classification unit 45 is the same as that of the first embodiment. In this embodiment, the same effect as that of the first embodiment is obtained.

＜Ｄ：変形例＞
以上の各形態には様々な変形を加えることができる。具体的な変形の態様を例示すれば以下の通りである。なお、以下の例示から２以上の態様を任意に選択して組合わせてもよい。 <D: Modification>
Various modifications can be made to each of the above embodiments. An example of a specific modification is as follows. Two or more aspects may be arbitrarily selected from the following examples and combined.

（１）変形例１
以上の各形態においては音声信号Ｓを区分したＮ個の区間Ｂの全部を分類の対象としたが、Ｎ個の区間Ｎを発音区間と非発音区間（音声を収録する環境内の雑音のみが存在する区間）とに選別し、発音区間のみを分類の対象としてもよい。音声区分部１２は、例えば、Ｎ個の区間Ｂのうちピーク値が閾値を下回る区間Ｂを非発音区間として分類の対象から除外する。 (1) Modification 1
In each of the above forms, all of the N sections B into which the speech signal S is divided are classified, but the N sections N are classified into sound generation sections and non-sound generation sections (only noise in the environment where the sound is recorded is recorded). Existing sections), and only the pronunciation sections may be classified. For example, the voice classifying unit 12 excludes, from among the N sections B, the section B whose peak value is lower than the threshold as a non-sounding section from the classification target.

（２）変形例２
以上の各形態によって区間Ｂの分類に閾値の使用が不要になるとは言っても、区間Ｂの分類に閾値を使用する構成を本発明の範囲から除外する趣旨ではない。例えば、ひとつの区間Ｂに最も類似する区間Ｂ（類似度マップＭにおいて識別子が最上位に位置する区間Ｂ）を図３のステップＳ8にて選択すると、音声分類部４５は、両者間の類否指標値が閾値を上回る場合（類似度が高い場合）に限って両区間Ｂを同じクラスタに分類し、類否指標値が閾値を下回る場合には同じクラスタに分類しない。本変形例においても、類否指標値と閾値とを比較した結果のみに基づいて各区間Ｂを分類する従来の構成と比較すると、音声信号Ｓの収録時の条件（例えば雑音の大小）が各区間Ｂの分類の精度に与える影響は低減される。 (2) Modification 2
Although the use of the threshold value for classification of the section B is not necessary according to each of the above forms, the configuration using the threshold value for the classification of the section B is not intended to exclude from the scope of the present invention. For example, when the section B that is most similar to one section B (the section B in which the identifier is at the highest position in the similarity map M) is selected in step S8 in FIG. Only when the index value exceeds the threshold value (when the similarity is high), both sections B are classified into the same cluster, and when the similarity index value falls below the threshold value, they are not classified into the same cluster. Also in this modification, when compared with the conventional configuration in which each section B is classified based only on the result of comparing the similarity index value and the threshold, the conditions (for example, the magnitude of noise) at the time of recording the audio signal S are different. The influence on the accuracy of the classification of the section B is reduced.

（３）変形例３
特徴抽出部４１が抽出する特徴量は以上の例示に限定されない。例えば、第１実施形態において、特徴抽出部４１は、区間Ｂ内の各フレームから抽出されたＭＦＣＣの当該区間Ｂにおける平均を、平均パワースペクトルの代わりに特徴量として抽出してもよい。また、第２実施形態や第３実施形態において、区間Ｂ内における音声信号Ｓの強度の平均値や最大値や基本周波数を特徴量として算定してもよい。 (3) Modification 3
The feature amount extracted by the feature extraction unit 41 is not limited to the above examples. For example, in the first embodiment, the feature extraction unit 41 may extract the average of the MFCC extracted from each frame in the section B in the section B as a feature amount instead of the average power spectrum. In the second embodiment or the third embodiment, the average value, maximum value, or fundamental frequency of the intensity of the audio signal S in the section B may be calculated as the feature amount.

（４）変形例４
以上の各形態において、図３のステップＳ8に先立ってステップＳ9を実行してもよい。すなわち、音声分類部４５は、Ｎ個の区間Ｂのうち自身以外の(Ｎ−１)個の区間Ｂに対する類似度が所定値を下回る順位にある区間Ｂを単独でひとつのクラスタに分類し（ステップＳ9）、当該区間Ｂ以外の区間Ｂ（すなわち他の何れかの区間Ｂとの類似度が所定の順位を上回る区間Ｂ）を対象としてステップＳ8の分類を実行する。 (4) Modification 4
In each of the above embodiments, step S9 may be executed prior to step S8 in FIG. That is, the speech classification unit 45 classifies the section B having a similarity lower than a predetermined value among (N−1) sections B other than itself among the N sections B into a single cluster ( In step S9), the classification in step S8 is executed for a section B other than the section B (that is, a section B whose similarity with any other section B exceeds a predetermined rank).

（５）変形例５
音声処理装置１００が作成した議事録を印刷する印刷装置を出力装置３０として採用してもよい。もっとも、音声処理装置１００による処理の結果が議事録（文字）の形式で出力される必要はなく、例えば分類処理部１４による分類の結果を出力することも可能である。例えば、音声区分部１２が区分した複数の区間Ｂのうち利用者が指定した時刻を含む区間Ｂ内の音声信号Ｓを放音装置（例えばスピーカ）から音波として出力する構成によれば、利用者が各発声者の発言を選択的に聴取して適宜に確認しながら会議の議事録を作成するといった作業を有効に支援することが可能である。また、以上の形態においては音声区分部１２が音声信号Ｓを複数の区間Ｂに区分する構成を例示したが、音声信号Ｓが複数の区間Ｂに事前に区分された状態で記憶装置２０に格納されてもよい。以上のように、音声区分部１２や音声認識部１６は音声処理装置１００にとって必須の要素ではない。 (5) Modification 5
A printing device that prints the minutes created by the voice processing device 100 may be adopted as the output device 30. However, it is not necessary to output the result of processing by the speech processing apparatus 100 in the form of minutes (characters), and for example, the result of classification by the classification processing unit 14 can be output. For example, according to the structure which outputs the audio | voice signal S in the area B containing the time which the user specified among the some area B which the audio | voice classification | category part 12 classified as a sound wave from a sound emission apparatus (for example, speaker), a user However, it is possible to effectively support the task of creating the minutes of the meeting while selectively listening to the statements of each speaker and confirming them appropriately. Moreover, although the audio | voice classification part 12 illustrated the structure which divides the audio | voice signal S into the some area B in the above form, it stores in the memory | storage device 20 in the state by which the audio | voice signal S was divided into the some area B in advance. May be. As described above, the voice classification unit 12 and the voice recognition unit 16 are not essential elements for the voice processing apparatus 100.

（６）変形例６
以上の各形態においては記憶装置２０に予め記憶された音声信号Ｓを処理の対象としたが、収音装置（マイクロホン）から供給される音声信号Ｓや通信網を経由して順次に供給される音声信号Ｓを対象として実時間的に処理を実行してもよい。 (6) Modification 6
In each of the above embodiments, the audio signal S stored in advance in the storage device 20 is the target of processing, but is sequentially supplied via the audio signal S supplied from the sound collection device (microphone) and the communication network. The processing may be executed in real time for the audio signal S.

本発明の第１実施形態に係る音声処理装置の構成を示すブロック図である。It is a block diagram which shows the structure of the audio processing apparatus which concerns on 1st Embodiment of this invention. 音声区分部の動作の内容を示す概念図である。It is a conceptual diagram which shows the content of operation | movement of an audio | voice classification | category part. 音声分類部の動作の内容を示すフローチャートである。It is a flowchart which shows the content of operation | movement of an audio | voice classification | category part. 類似度マップの内容を示す概念図である。It is a conceptual diagram which shows the content of a similarity map.

Explanation of symbols

１００……音声処理装置、１０……制御装置、１２……音声区分部、１４……分類処理部、４１……特徴抽出部、４３……指標算定部、４５……音声分類部、１６……音声認識部、２０……記憶装置、３０……出力装置、Ｓ……音声信号、Ｂ……区間。 DESCRIPTION OF SYMBOLS 100 ... Voice processing apparatus, 10 ... Control apparatus, 12 ... Voice classification part, 14 ... Classification processing part, 41 ... Feature extraction part, 43 ... Index calculation part, 45 ... Voice classification part, 16 ... ... voice recognition unit, 20 ... storage device, 30 ... output device, S ... voice signal, B ... section.

Claims

Feature extraction means for extracting a feature amount for each of a plurality of sections obtained by dividing an audio signal on a time axis;
An index calculating means for calculating an similarity index value indicating similarity of the feature quantity in the two sections for a plurality of combinations for selecting two sections from the plurality of sections;
Voice classification for classifying the plurality of sections into a plurality of sets based on similarity index values of the sections so that each of the plurality of sections and a section having the most similar feature amount to the section belong to the same set And a voice processing apparatus.

The speech processing apparatus according to claim 1, wherein the index calculation unit calculates a cross-correlation value between feature quantities of the two sections as the similarity index value.

The feature extraction means extracts, as the feature amount, a time series of feature vectors of the voice signal for each of the plurality of sections.
The index calculating means calculates an average value of the likelihood that each feature vector of the other section appears from a mixed model obtained by modeling the distribution of feature vectors of one section of the two sections with a plurality of probability distributions, The speech processing apparatus according to claim 1, wherein the speech processing apparatus calculates the similarity index value.

The feature extraction means extracts the feature amount from a time series of feature vectors of the audio signal for each of the plurality of sections,
The index calculating means calculates an average value of vector quantization distortion of a codebook obtained by vector quantization of a feature vector time series of one of the two sections and each feature vector of the other section. The speech processing device according to claim 1, wherein the speech processing device is calculated as an index value.

The voice classification means includes
The sections in which the similarity index values for each of all the other sections are dissimilar among the plurality of sections are classified into a set different from all the other sections. Any of the voice processing devices.

The audio processing device according to any one of claims 1 to 5, further comprising: audio classification means for dividing the audio signal into the plurality of sections with each valley portion in the envelope of the waveform of the audio signal as a boundary.

On the computer,
A feature extraction process for extracting a feature amount for each of a plurality of sections obtained by dividing an audio signal on a time axis;
An index calculation process for calculating an similarity index value indicating similarity of the feature quantity in the two sections, for a plurality of combinations for selecting two sections from the plurality of sections;
Voice classification for classifying the plurality of sections into a plurality of sets based on similarity index values of the sections such that each of the plurality of sections and a section having the most similar feature amount to the section belong to the same set. A program that executes processing and.