JP5272141B2

JP5272141B2 - Voice processing apparatus and program

Info

Publication number: JP5272141B2
Application number: JP2009126598A
Authority: JP
Inventors: 道子風間; 靖雄吉岡
Original assignee: Waseda University; Yamaha Corp
Current assignee: Waseda University; Yamaha Corp
Priority date: 2009-05-26
Filing date: 2009-05-26
Publication date: 2013-08-28
Anticipated expiration: 2029-05-26
Also published as: JP2010276697A

Abstract

<P>PROBLEM TO BE SOLVED: To provide a featured value capable of highly accurately recognizing a speaker. <P>SOLUTION: A feature extraction section 32 calculates an autocorrelation sequence AV of frequency components of a voice signal V exceeding a predetermined frequency fc as a featured value FV. The autocorrelation sequence AV is a sequence (time sequence) of an autocorrelation value a(m) of a voice signal V. A storage section 24 stores an autocorrelation sequence AREF for reference. A speaker recognition section 34 performs speaker recognition by comparing the autocorrelation sequence AV of the voice signal V with the autocorrelation sequence AREF for reference. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、音声信号を処理する技術に関する。 The present invention relates to a technique for processing an audio signal.

音声信号から抽出された特徴量を利用した話者認識の技術が従来から提案されている。例えば、特許文献１には、音声信号を時間軸上で区分した複数の区間の各々から抽出された平均パワースペクトルやMFCC（mel-frequency cepstral coefficient)を利用して、各区間を話者毎に分類する技術が提案されている。 Conventionally, speaker recognition techniques using feature amounts extracted from speech signals have been proposed. For example, in Patent Document 1, each section is classified for each speaker by using an average power spectrum or MFCC (mel-frequency cepstral coefficient) extracted from each of a plurality of sections obtained by dividing a voice signal on a time axis. Classification techniques have been proposed.

特開２００９−２０４５８号公報JP 2009-20458 A

しかし、平均パワースペクトルやMFCCを利用する構成では、充分に高精度な話者認識を実現できない可能性もある。したがって、高精度な話者認識を実現できる新規な特徴量が従来から強く要望および期待されている。以上の事情を背景として、本発明は、高精度な話者認識を実現し得る特徴量を提案することを目的とする。 However, the configuration using the average power spectrum and MFCC may not be able to realize sufficiently accurate speaker recognition. Therefore, there has been a strong demand and expectation for a new feature amount that can realize speaker recognition with high accuracy. In view of the above circumstances, an object of the present invention is to propose a feature amount capable of realizing highly accurate speaker recognition.

以上の課題を解決するために、本発明の第１の態様に係る音声処理装置は、音声信号の自己相関数列を算定する特徴抽出手段と、自己相関数列を特徴量として話者認識を実行する話者認識手段とを具備する。以上の形態においては、音声信号の自己相関数列が話者認識の特徴量として利用されるから、高精度な話者認識を実現することが可能である。 In order to solve the above problems, the speech processing apparatus according to the first aspect of the present invention executes feature recognition using feature extraction means for calculating an autocorrelation sequence of speech signals and speaker recognition using the autocorrelation sequence as a feature quantity. Speaker recognition means. In the above embodiment, since the autocorrelation sequence of the speech signal is used as a feature amount for speaker recognition, it is possible to realize speaker recognition with high accuracy.

音声信号のうち高域側の成分ほど発話の内容に応じた自己相関数列の変動が抑制されるという傾向がある。したがって、発話の内容に影響されない高精度な話者認識を実現するという観点からすると、音声信号のうち所定の周波数を上回る成分について自己相関数列を算定する構成が格別に好適である。 There is a tendency that fluctuations in the autocorrelation sequence corresponding to the content of the utterance are suppressed as the higher frequency component of the audio signal. Therefore, from the viewpoint of realizing highly accurate speaker recognition that is not affected by the content of the utterance, a configuration that calculates an autocorrelation sequence for a component that exceeds a predetermined frequency in the speech signal is particularly suitable.

ところで、複数の音声信号の類否（相関の有無）の判定には、例えば、各音声信号の特徴量の相関を示す相関係数（相互相関）が利用される。しかし、特徴量を規定する数値列が表す分布（スペクトルや時間波形など）における強度の相違は相関係数に必ずしも反映されない。例えば、周波数が共通で振幅が相違する複数の正弦波の相関係数は、周波数および振幅の双方が共通する（すなわち、波形が完全に合致する）複数の正弦波の相関係数と同じ数値（最大値）となる。したがって、発声者を高精度に区別できない場合がある。 By the way, for the determination of the similarity (presence / absence of correlation) of a plurality of audio signals, for example, a correlation coefficient (cross-correlation) indicating the correlation between the feature amounts of each audio signal is used. However, the difference in intensity in the distribution (spectrum, time waveform, etc.) represented by the numerical sequence that defines the feature quantity is not necessarily reflected in the correlation coefficient. For example, the correlation coefficient of a plurality of sine waves having the same frequency but different amplitudes has the same numerical value as the correlation coefficient of a plurality of sine waves having the same frequency and amplitude (that is, the waveform perfectly matches) ( Maximum value). Therefore, the speaker may not be distinguished with high accuracy.

以上の課題を解決するために、本発明の第２の態様に係る音声処理装置は、複数の数値の系列で表わされる特徴量を音声信号から抽出する特徴抽出手段と、複数の数値の系列で表わされる参照用の特徴量を記憶する記憶手段と、特徴抽出手段が抽出した特徴量と参照用の特徴量との各々における対応する位置に、相異なる数値を含む補助成分を付加する成分付加手段と、補助成分の付加後の各特徴量の類否を示す類否指標値を算定する指標算定手段と、類否指標値を利用して話者認識を実行する認識処理手段とを具備する。以上の態様においては、音声信号の特徴量と参照用の特徴量とが共通の補助成分の付加後に比較されるから、音声信号の特徴量の各数値の系列が参照用の特徴量の各数値の系列の定数倍であるような関係が成立する場合でも、音声信号の特徴量と参照用の特徴量との相違を類否指標値にて顕在化することが可能である。したがって、高精度な話者認識が可能である。 In order to solve the above-described problem, a speech processing apparatus according to the second aspect of the present invention includes feature extraction means for extracting feature values represented by a plurality of numerical sequences from a speech signal, and a plurality of numerical sequences. Storage means for storing the reference feature quantity to be represented, and component addition means for adding auxiliary components including different numerical values to corresponding positions in the feature quantity extracted by the feature extraction means and the reference feature quantity, respectively. And an index calculation means for calculating the similarity index value indicating the similarity of each feature quantity after the addition of the auxiliary component, and a recognition processing means for executing speaker recognition using the similarity index value. In the above aspect, since the feature value of the audio signal and the feature value for reference are compared after the addition of the common auxiliary component, each numerical value series of the feature value of the audio signal is represented by each value of the reference feature value. Even when a relationship that is a constant multiple of the sequence is established, the difference between the feature value of the audio signal and the reference feature value can be manifested by the similarity index value. Therefore, highly accurate speaker recognition is possible.

なお、特許文献１の技術においては、音声信号を区分した複数の区間から２個の区間を選択する複数の組合せについて特徴量の類否を判定するから、音声信号の各区間を実時間的に発声者毎に分類することは困難である。そこで、本発明の第３の態様に係る音声処理装置は、音声信号を複数の区間に区分する音声区分手段と、音声信号の各区間について特徴量を抽出する特徴抽出手段と、参照用の特徴量を記憶する記憶手段と、各区間の特徴量を利用して複数の区間の各々を発声者毎の集合に分類する話者認識手段とを具備し、話者認識手段は、特徴抽出手段が算定した一の区間の特徴量について、記憶手段が記憶する参照用の特徴量との類否を示す類否指標値と、既存の集合に分類された１以上の区間に対応する特徴量との類否を示す類否指標値とを算定する指標算定手段と、一の区間の特徴量が、参照用の特徴量に類似する場合に、一の区間を新規な集合に分類し、既存の集合の特徴量に類似する場合に、一の区間を既存の集合に分類する認識処理手段とを含む。以上の態様においては、音声信号を区分した１個の区間の特徴量を参照用の特徴量および既存の集合の特徴量と比較することで各区間が発声者毎の集合に分類されるから、音声信号の全部の区間は各区間の分類に必要ない。したがって、音声信号の各区間を実時間的に分類できる（すなわち、音声信号の各区間が供給されるたびに当該区間を何れかの集合に分類できる）という利点がある。 In the technique of Patent Document 1, since the similarity of the feature amount is determined for a plurality of combinations for selecting two sections from a plurality of sections into which the audio signal is divided, each section of the audio signal is determined in real time. It is difficult to classify by speaker. Therefore, the speech processing apparatus according to the third aspect of the present invention includes speech classification means for dividing a speech signal into a plurality of sections, feature extraction means for extracting feature quantities for each section of the speech signal, and reference features. Storage means for storing the amount, and speaker recognition means for classifying each of the plurality of sections into a set for each speaker by using the feature amount of each section. About the calculated feature quantity of one section, the similarity index value indicating similarity with the reference feature quantity stored in the storage means, and the feature quantity corresponding to one or more sections classified into the existing set Index calculation means for calculating the similarity index value indicating similarity, and when the feature quantity of one section is similar to the reference feature quantity, classify the one section into a new set, and set the existing set A recognition processing means for classifying one section into an existing set when the feature amount is similar to No. In the above aspect, each section is classified into a set for each speaker by comparing the feature quantity of one section into which the speech signal is divided with the reference feature quantity and the feature quantity of the existing set. All sections of the audio signal are not necessary for classification of each section. Therefore, there is an advantage that each section of the audio signal can be classified in real time (that is, the section can be classified into any set whenever each section of the audio signal is supplied).

第３の態様に係る音声処理装置の具体例において、認識処理手段は、一の区間の特徴量が参照用の特徴量に類似する場合であっても、既存の集合の特徴量との類否指標値が所定の閾値に対して類似側の数値である場合には、一の区間を当該既存の集合に分類する。以上の態様によれば、発声者が共通する複数の区間が別個の集合に分類される可能性が低減されるという利点がある。また、認識処理手段は、一の区間の特徴量が記既存の集合の特徴量に類似する場合であっても、既存の集合の特徴量との類否指標値が所定の閾値に対して非類似側の数値である場合には、一の区間を新規な集合に分類する。以上の態様によれば、発声者が異なる複数の区間が共通の集合に分類される可能性が低減されるという利点がある。 In the specific example of the speech processing apparatus according to the third aspect, the recognition processing means determines whether the feature quantity of one section is similar to the feature quantity of the existing set even when the feature quantity of one section is similar to the reference feature quantity. When the index value is a numerical value on the similar side with respect to the predetermined threshold, one section is classified into the existing set. According to the above aspect, there exists an advantage that possibility that the several area | region where a speaker is in common will be classified into a separate set is reduced. In addition, the recognition processing means may determine whether the similarity index value with the feature value of the existing set is not equal to the predetermined threshold value even if the feature value of one section is similar to the feature value of the existing set. If it is a numerical value on the similar side, one section is classified into a new set. According to the above aspect, there is an advantage that the possibility that a plurality of sections with different speakers are classified into a common set is reduced.

なお、「類否指標値が所定の閾値に対して類似側の数値である場合」には、各特徴量が類似するほど増加するように定義された類否指標値が所定の閾値を上回る場合と、各特徴量が類似するほど減少するように定義された類否指標値が所定の閾値を下回る場合とが包含される。同様に、「類否指標値が所定の閾値に対して非類似側の数値である場合」には、各特徴量が類似するほど増加するように定義された類否指標値が所定の閾値を下回る場合と、各特徴量が類似するほど減少するように定義された類否指標値が所定の閾値を上回る場合とが包含される。 In addition, when “similarity index value is a numerical value on the similar side with respect to a predetermined threshold value”, the similarity index value defined so as to increase as each feature quantity is similar exceeds the predetermined threshold value And the case where the similarity index value defined so as to decrease as the feature amounts become similar is less than a predetermined threshold value. Similarly, when “similarity index value is a numerical value on a dissimilar side with respect to a predetermined threshold value”, the similarity index value defined so as to increase as each feature amount is similar to the predetermined threshold value. The case where it falls below and the case where the similarity index value defined so that each feature-value decreases like it exceed a predetermined threshold are included.

第１の態様と第２の態様と第３の態様とから選択された２以上の態様は任意に併合される。例えば、第２の態様および第３の態様における特徴量は本来的には任意ではあるが、第１の態様の自己相関数列を第２の態様および第３の態様における特徴量として採用することが可能である。また、第２の態様における成分付加手段を第１の態様や第３の態様に付加した構成も採用される。 Two or more aspects selected from the first aspect, the second aspect, and the third aspect are arbitrarily merged. For example, the feature quantities in the second and third aspects are essentially arbitrary, but the autocorrelation sequence of the first aspect may be adopted as the feature quantities in the second and third aspects. Is possible. Moreover, the structure which added the component addition means in the 2nd aspect to the 1st aspect or the 3rd aspect is also employ | adopted.

本出願内の「話者認識」は、音声信号の音声の発声者が正規の登録者に該当するか否かを判定する話者認証（話者照合）と、音声信号の音声の発声者を識別する話者識別とを包括する概念である。話者識別は、音声信号の音声の発声者が複数の登録者の何れに該当するのかを判定する処理と、複数の発声者が存在する状況で収録された音声の各区間が何れの発声者の音声に該当するのかを判定する処理（更には各区間を発声者毎に分類する処理）とを包含する。以上のように定義された話者認識の概念に含まれる何れの処理にも、本発明（第１の態様から第３の態様）を適用することが可能である。もっとも、話者認識以外の処理に対する本発明の適用の可能性を排除する趣旨ではない。 “Speaker recognition” in this application refers to speaker authentication (speaker verification) for determining whether or not a voice speaker of a voice signal corresponds to a regular registrant, and a voice speaker of a voice signal. It is a concept that includes speaker identification to be identified. For speaker identification, the process of determining which of the plurality of registrants corresponds to the voice speaker of the voice signal, and which speaker is in each segment of the voice recorded in the situation where there are multiple speakers And a process of determining whether the voice corresponds to the voice (further, a process of classifying each section for each speaker). The present invention (the first to third aspects) can be applied to any process included in the concept of speaker recognition defined as described above. However, this is not intended to exclude the possibility of applying the present invention to processing other than speaker recognition.

以上の各態様に係る音声処理装置は、音声信号の処理に専用されるＤＳＰ（Digital Signal Processor）などのハードウェア（電子回路）によって実現されるほか、ＣＰＵ（Central Processing Unit）などの汎用の演算処理装置とプログラムとの協働によっても実現される。例えば、本発明の第１の態様に係るプログラムは、音声信号の自己相関数列を算定する特徴抽出処理と、自己相関数列を特徴量として利用した話者認識処理とをコンピュータに実行させる。 The sound processing apparatus according to each of the above aspects is realized by hardware (electronic circuit) such as a DSP (Digital Signal Processor) dedicated to processing of a sound signal, or a general-purpose operation such as a CPU (Central Processing Unit). This is also realized by cooperation between the processing device and the program. For example, the program according to the first aspect of the present invention causes a computer to execute a feature extraction process for calculating an autocorrelation sequence of speech signals and a speaker recognition process using the autocorrelation sequence as a feature quantity.

第２の態様に係るプログラムは、複数の数値の系列で表わされる特徴量を音声信号から抽出する特徴抽出処理と、特徴抽出処理で抽出した特徴量と複数の数値の系列で表わされる参照用の特徴量との各々における対応する位置に、相異なる数値を含む補助成分を付加する成分付加処理と、補助成分の付加後の各特徴量の類否を示す類否指標値を算定する指標算定処理と、類否指標値を利用した話者認識処理とをコンピュータに実行させる。また、第３の態様に係るプログラムは、音声信号を複数の区間に区分する音声区分処理と、音声信号の各区間について特徴量を抽出する特徴抽出処理と、各区間の特徴量を利用して複数の区間の各々を発声者毎の集合に分類する話者認識処理とをコンピュータに実行させるプログラムであって、話者認識処理が、特徴抽出処理で算定した一の区間の特徴量について、参照用の特徴量との類否を示す類否指標値と、既存の集合に分類された１以上の区間に対応する特徴量との類否を示す類否指標値とを算定する指標算定処理と、一の区間の特徴量が、参照用の特徴量に類似する場合に、一の区間を新規な集合に分類し、既存の集合の特徴量に類似する場合に、一の区間を既存の集合に分類する認識処理とを含む。 The program according to the second aspect includes a feature extraction process for extracting feature amounts represented by a plurality of numerical value sequences from a speech signal, a feature amount extracted by the feature extraction processing, and a reference value represented by a plurality of numerical value sequences. Component addition processing for adding auxiliary components including different numerical values to corresponding positions in each feature amount, and index calculation processing for calculating similarity index values indicating similarity of each feature amount after the addition of auxiliary components And a speaker recognition process using the similarity index value. Further, the program according to the third aspect uses an audio classification process for dividing an audio signal into a plurality of sections, a feature extraction process for extracting a feature quantity for each section of the audio signal, and a feature quantity of each section. A program for causing a computer to execute speaker recognition processing for classifying each of a plurality of sections into a set for each speaker, and the speaker recognition processing refers to the feature amount of one section calculated by the feature extraction processing. An index calculation process for calculating an similarity index value indicating similarity with a feature quantity for use and an similarity index value indicating similarity with a feature quantity corresponding to one or more sections classified into an existing set; When the feature value of one section is similar to the reference feature value, the one section is classified into a new set, and when the feature value is similar to the feature value of the existing set, the one section is set to the existing set. And recognition processing to classify.

以上の各態様に係るプログラムによれば、本発明の各態様に係る音声処理装置と同様の作用および効果が奏される。本発明のプログラムは、コンピュータが読取可能な記録媒体に格納された形態で利用者に提供されてコンピュータにインストールされるほか、通信網を介した配信の形態でサーバ装置から提供されてコンピュータにインストールされる。 According to the program according to each aspect described above, the same operations and effects as the sound processing apparatus according to each aspect of the present invention are exhibited. The program of the present invention is provided to a user in a form stored in a computer-readable recording medium and installed in the computer, or provided from a server device in a form of distribution via a communication network and installed in the computer. Is done.

本発明の第１実施形態に係る音声処理装置のブロック図である。1 is a block diagram of a speech processing apparatus according to a first embodiment of the present invention. 音声信号（低域側の成分）の自己相関数列を発声者毎に示すグラフである。It is a graph which shows the autocorrelation sequence of an audio | voice signal (low-frequency component) for every speaker. 音声信号（高域側の成分）の自己相関数列を発声者毎に示すグラフである。It is a graph which shows the autocorrelation sequence of an audio | voice signal (high-frequency side component) for every speaker. 特徴抽出部のブロック図である。It is a block diagram of a feature extraction unit. ２種類の波形と相関係数との関係を説明するための概念図である。It is a conceptual diagram for demonstrating the relationship between two types of waveforms and a correlation coefficient. ２種類の波形に補助成分を付加した場合の相関係数を説明するための概念図である。It is a conceptual diagram for demonstrating the correlation coefficient at the time of adding an auxiliary component to two types of waveforms. 第２実施形態に係る音声処理装置のブロック図である。It is a block diagram of the speech processing unit concerning a 2nd embodiment. 第３実施形態に係る音声処理装置のブロック図である。It is a block diagram of the speech processing unit concerning a 3rd embodiment. 音声信号の区分について説明するための概念図である。It is a conceptual diagram for demonstrating the division | segmentation of an audio | voice signal. 話者認識部の動作のフローチャートである。It is a flowchart of operation | movement of a speaker recognition part.

＜Ａ：第１実施形態＞
図１は、本発明の第１実施形態に係る音声処理装置１００Aのブロック図である。図１に示すように、音声処理装置１００Aには、信号供給装置１２と出力装置１４とが接続される。信号供給装置１２は、音声の時間軸上の波形を表す音声信号Ｖを音声処理装置１００Aに供給する。例えば、周囲の音声を収音して音声信号Ｖを生成する収音機器や、各種の記録媒体から音声信号Ｖを取得する再生装置や、通信網から音声信号Ｖを受信する通信装置が、信号供給装置１２として利用される。 <A: First Embodiment>
FIG. 1 is a block diagram of a speech processing apparatus 100A according to the first embodiment of the present invention. As shown in FIG. 1, a signal supply device 12 and an output device 14 are connected to the sound processing device 100A. The signal supply device 12 supplies an audio signal V representing a waveform on the time axis of the audio to the audio processing device 100A. For example, a sound collection device that collects ambient sounds and generates an audio signal V, a playback device that acquires the audio signal V from various recording media, and a communication device that receives the audio signal V from a communication network are Used as the supply device 12.

音声処理装置１００Aは、音声信号Ｖを利用した話者認識を実行する装置（話者認識装置）である。具体的には、音声処理装置１００Aが実行する話者認識は、音声信号Ｖの音声の発声者が複数の登録者の何れに該当するのかを判定する話者識別である。出力装置１４は、音声処理装置１００Aによる話者認識（話者識別）の結果を画像や音声で出力する。例えば表示装置や印刷装置や放音機器（スピーカやヘッドホン）が出力装置１４として利用される。 The voice processing device 100A is a device (speaker recognition device) that performs speaker recognition using the voice signal V. Specifically, the speaker recognition performed by the speech processing apparatus 100A is speaker identification for determining which of a plurality of registrants the speaker of the speech of the speech signal V corresponds to. The output device 14 outputs the result of speaker recognition (speaker identification) by the sound processing device 100A as an image or sound. For example, a display device, a printing device, or a sound emitting device (speaker or headphones) is used as the output device 14.

図１に示すように、音声処理装置１００Aは、演算処理装置２２と記憶装置２４とを含んで構成されるコンピュータシステムである。記憶装置２４は、演算処理装置２２が実行するプログラム２６や演算処理装置２２が使用するデータを記憶する。半導体記録媒体や磁気記録媒体などの公知の記録媒体が記憶装置２４として任意に採用される。 As shown in FIG. 1, the sound processing device 100 </ b> A is a computer system that includes an arithmetic processing device 22 and a storage device 24. The storage device 24 stores a program 26 executed by the arithmetic processing device 22 and data used by the arithmetic processing device 22. A known recording medium such as a semiconductor recording medium or a magnetic recording medium is arbitrarily employed as the storage device 24.

演算処理装置２２は、記憶装置２４に格納されたプログラム２６を実行することで、音声信号Ｖの話者認識を実行するための複数の機能（特徴抽出部３２，話者認識部３４）を実現する。なお、演算処理装置２２の各要素を複数の装置（集積回路）に分散的に搭載した構成や、音声信号Ｖの処理に専用される電子回路（DSP）が各要素を実現する構成も採用される。 The arithmetic processing unit 22 executes a program 26 stored in the storage device 24, thereby realizing a plurality of functions (feature extraction unit 32, speaker recognition unit 34) for executing speaker recognition of the audio signal V. To do. It should be noted that a configuration in which each element of the arithmetic processing unit 22 is distributedly mounted on a plurality of devices (integrated circuits) or a configuration in which an electronic circuit (DSP) dedicated to processing the audio signal V realizes each element is also employed. The

図１の特徴抽出部３２は、音声信号Ｖの特徴量ＦVを抽出する。具体的には、特徴抽出部３２は、音声信号Ｖの自己相関数列ＡVを特徴量ＦVとして生成する。自己相関数列ＡVは、音声信号Ｖの自己相関値ａ(m)の系列（時系列）に相当する。記号ｍは、音声信号Ｖに対する自身の時間軸上の移動量（時間差）を示す整数である。 The feature extraction unit 32 in FIG. 1 extracts the feature amount FV of the audio signal V. Specifically, the feature extraction unit 32 generates the autocorrelation sequence AV of the audio signal V as the feature amount FV. The autocorrelation sequence AV corresponds to a series (time series) of autocorrelation values a (m) of the audio signal V. The symbol m is an integer indicating the movement amount (time difference) on the own time axis with respect to the audio signal V.

図２および図３は、試験的に採取された音声信号Ｖの自己相関数列ＡV（自己相関関数）を、３人の発声者Ｓ（ＳA〜ＳC）の各々について図示したグラフである。図２には、音声信号Ｖの低域側（1.5kHz以下）の成分の自己相関数列ＡVが図示され、図３には、音声信号Ｖの高域側（2kHz〜7.8kHz）の成分の自己相関数列ＡVが図示されている。図２および図３の各グラフにおける横軸は、音声信号Ｖの時間軸上の移動量（自身との時間差）を意味し、縦軸は、自己相関値ａ(m)を意味する。各発声者Ｓのグラフには、発話の内容が異なる複数の音声信号Ｖの自己相関数列ＡVが併記されている。 FIGS. 2 and 3 are graphs illustrating the autocorrelation sequence AV (autocorrelation function) of the voice signal V collected on a trial basis for each of the three speakers S (SA to SC). FIG. 2 shows the autocorrelation sequence AV of the low frequency component (1.5 kHz or less) of the audio signal V, and FIG. 3 shows the self component of the high frequency component (2 kHz to 7.8 kHz) of the audio signal V. A correlation sequence AV is shown. 2 and 3, the horizontal axis represents the movement amount of the audio signal V on the time axis (time difference from itself), and the vertical axis represents the autocorrelation value a (m). In the graph of each speaker S, an autocorrelation sequence AV of a plurality of speech signals V having different utterance contents is written together.

図２や図３から把握されるように、自己相関数列ＡVには、発声者に固有で発話の内容に依存しない特徴が現れる。したがって、音声信号Ｖの話者認識のための特徴量として自己相関数列ＡVを利用することが可能である。ただし、図２に例示した低域側の成分には、図３に例示した高域側の成分と比較すると、発話の内容に起因した自己相関値ａ(m)の変動が発生し易い。高精度な話者認識のためには発話の内容に独立な特徴量（例えば、発声者の声道の共振特性を反映した特徴量）が要望されるから、話者認識用の特徴量としては、音声信号Ｖのうち所定の周波数を上回る成分の自己相関数列ＡVが格別に好適である。 As can be understood from FIGS. 2 and 3, the autocorrelation sequence AV has characteristics that are specific to the speaker and independent of the content of the utterance. Therefore, the autocorrelation sequence AV can be used as a feature amount for speaker recognition of the voice signal V. However, the low-frequency component illustrated in FIG. 2 is more likely to vary the autocorrelation value a (m) due to the content of the utterance than the high-frequency component illustrated in FIG. For high-accuracy speaker recognition, features that are independent of the content of the utterance (for example, features that reflect the resonance characteristics of the vocal tract of the speaker) are required. The autocorrelation sequence AV having a component exceeding a predetermined frequency in the audio signal V is particularly suitable.

図４は、図１の特徴抽出部３２の具体的なブロック図である。図４の低域抑圧部５１は、音声信号Ｖのうち所定の周波数ｆcを下回る成分を抑圧するフィルタ（ハイパスフィルタ）である。低域抑圧部５１による処理後の音声信号Ｖを対象とした話者認識にて所望の精度が実現される程度に、発話の内容に起因した自己相関値ａ(m)の変動が抑制されるように、周波数ｆcは実験的または統計的に選定される。具体的には、1.5kHz以上かつ2.0kHz以下の範囲内の数値が周波数ｆcとして好適である。 FIG. 4 is a specific block diagram of the feature extraction unit 32 of FIG. 4 is a filter (high-pass filter) that suppresses a component of the audio signal V that is below a predetermined frequency fc. The fluctuation of the autocorrelation value a (m) due to the content of the utterance is suppressed to such an extent that the desired accuracy is realized by the speaker recognition for the processed speech signal V by the low frequency suppressing unit 51. Thus, the frequency fc is selected experimentally or statistically. Specifically, a numerical value within the range of 1.5 kHz or more and 2.0 kHz or less is suitable as the frequency fc.

時間-周波数変換部５２は、低域抑圧部５１による処理後の音声信号Ｖ（例えば2.0kHz〜7.8kHzの成分）を時間軸上で区分した複数のフレームの各々について周波数スペクトルＱを生成する。周波数スペクトルＱの生成には、高速フーリエ変換などの公知の技術が任意に採用される。パワー算定部５３は、周波数スペクトル（振幅スペクトル）Ｑの絶対値の自乗をパワースペクトル|Ｑ|²として算定する。平均部５４は、パワー算定部５３が算定したパワースペクトル|Ｑ|²を複数のフレームについて平均（または加算）することで平均パワースペクトル（平均周波数特性）Ｐを生成する。平均パワースペクトルＰの算定に使用されるパワースペクトル|Ｑ|²のフレームの個数や位置は任意である。 The time-frequency conversion unit 52 generates a frequency spectrum Q for each of a plurality of frames obtained by dividing the audio signal V (for example, a component of 2.0 kHz to 7.8 kHz) processed by the low frequency suppressing unit 51 on the time axis. For the generation of the frequency spectrum Q, a known technique such as fast Fourier transform is arbitrarily adopted. The power calculator 53 calculates the square of the absolute value of the frequency spectrum (amplitude spectrum) Q as the power spectrum | Q | ² . The averaging unit 54 generates an average power spectrum (average frequency characteristic) P by averaging (or adding) the power spectrum | Q | ² calculated by the power calculating unit 53 for a plurality of frames. The number and position of the frames of the power spectrum | Q | ² used for calculating the average power spectrum P are arbitrary.

周波数-時間変換部５５は、平均部５４が生成した平均パワースペクトルＰに逆フーリエ変換を実行する。Wiener-Khintchineの定理から、平均パワースペクトルＰに逆フーリエ変換を実行した時間領域の数値列が自己相関数列ＡVに相当する。具体的には、周波数-時間変換部５５は、以下の数式(1)の演算（逆フーリエ変換）で各自己相関値ａ(m)を算定する。なお、数式(1)の記号ｋは、周波数軸上に離散的に設定された複数の周波数（周波数ビン）の何れかを指定する整数であり、数式(1)の記号ｐ(k)は、平均パワースペクトルＰのうち記号ｋが示す周波数での強度（パワー）を意味する。

The frequency-time conversion unit 55 performs an inverse Fourier transform on the average power spectrum P generated by the average unit 54. From the Wiener-Khintchine theorem, the numerical sequence in the time domain obtained by performing the inverse Fourier transform on the average power spectrum P corresponds to the autocorrelation sequence AV. Specifically, the frequency-time conversion unit 55 calculates each autocorrelation value a (m) by the calculation (inverse Fourier transform) of the following formula (1). Note that the symbol k in Equation (1) is an integer that specifies one of a plurality of frequencies (frequency bins) discretely set on the frequency axis, and the symbol p (k) in Equation (1) is: It means the intensity (power) at the frequency indicated by the symbol k in the average power spectrum P.

図１に示すように、記憶装置２４には、特徴量ＦVと同種の参照用の特徴量ＦREF（辞書）が、相異なる複数の登録者の各々の音声について事前に格納される。具体的には、自己相関数列ＡVと同様の方法で登録者毎に生成された自己相関数列ＡREFが特徴量ＦREFとして記憶装置２４に格納される。なお、自己相関数列ＡREFの生成には特徴抽出部３２を流用できるが、音声処理装置１００Aとは別個の装置で登録者毎に生成して記憶装置２４に格納する方法も採用される。自己相関数列ＡVを構成する自己相関値ａ(m)の個数と参照用の自己相関数列ＡREFを構成する自己相関値ａ(m)の個数とは共通する。 As shown in FIG. 1, in the storage device 24, a reference feature quantity FREF (dictionary) of the same type as the feature quantity FV is stored in advance for each voice of a plurality of different registrants. Specifically, the autocorrelation sequence AREF generated for each registrant in the same manner as the autocorrelation sequence AV is stored in the storage device 24 as the feature amount FREF. Note that the feature extraction unit 32 can be used to generate the autocorrelation sequence AREF. However, a method of generating the autocorrelation sequence AREF for each registrant by a device separate from the speech processing device 100A and storing it in the storage device 24 is also employed. The number of autocorrelation values a (m) constituting the autocorrelation sequence AV is common to the number of autocorrelation values a (m) constituting the reference autocorrelation sequence AREF.

話者認識部３４は、特徴抽出部３２が抽出した特徴量ＦV（自己相関数列ＡV）と記憶装置２４に格納された各特徴量ＦREF（自己相関数列ＡREF）とを比較することで話者認識を実行する。図１に示すように、話者認識部３４は、指標算定部４２と認識処理部４４とを含んで構成される。指標算定部４２は、音声信号Ｖの自己相関数列ＡVと参照用の自己相関数列ＡREFとの類否（特徴量ＦVと特徴量ＦREFとの相関）を示す類否指標値Ｒを、記憶装置２４に格納された複数の自己相関数列ＡREFの各々について算定する。具体的には、話者認識部３４は、以下の数式(2)で定義される相関係数Ｃorを類否指標値Ｒとして算定する。 The speaker recognition unit 34 compares the feature amount FV (autocorrelation sequence AV) extracted by the feature extraction unit 32 with each feature amount FREF (autocorrelation sequence AREF) stored in the storage device 24 to recognize the speaker. Execute. As shown in FIG. 1, the speaker recognition unit 34 is configured to include an index calculation unit 42 and a recognition processing unit 44. The index calculation unit 42 stores the similarity index value R indicating the similarity (correlation between the feature amount FV and the feature amount FREF) between the autocorrelation sequence AV of the audio signal V and the reference autocorrelation sequence AREF. Is calculated for each of the plurality of autocorrelation sequences AREF stored in. Specifically, the speaker recognition unit 34 calculates a correlation coefficient Cor defined by the following formula (2) as the similarity index value R.

相関係数Ｃorは、数値ｄ1(i)の系列と数値ｄ2(i)の系列との相関を示す数値である（ｉは整数）。数式(2)の記号ｄ1_aveは、複数の数値ｄ1(i)の平均値を意味し、数式(2)の記号ｄ2_aveは、複数の数値ｄ2(i)の平均値を意味する。数式(2)から理解されるように、数値ｄ1(i)の系列と数値ｄ2(i)の系列とが類似するほど（すなわち各系列が示す波形の相関が高いほど）、相関係数Ｃorは大きい数値となり、両者が完全に合致する場合に最大値「１」となる。

The correlation coefficient Cor is a numerical value indicating the correlation between the series of numerical values d1 (i) and the numerical value d2 (i) (i is an integer). The symbol d1_ave in Equation (2) means the average value of the plurality of numerical values d1 (i), and the symbol d2_ave in Equation (2) means the average value of the plurality of numerical values d2 (i). As understood from the equation (2), the more similar the series of the numerical value d1 (i) and the series of the numerical value d2 (i) (that is, the higher the correlation of the waveform shown by each series), the correlation coefficient Cor becomes It becomes a large numerical value, and when the two match completely, the maximum value is “1”.

図１の指標算定部４２は、自己相関数列ＡVの各自己相関値ａ(m)を数式(2)の各数値ｄ1(i)に代入するとともに自己相関数列ＡREFの各自己相関値ａ(m)を数式(2)の各数値ｄ2(i)に代入したときの相関係数Ｃorを類否指標値Ｒとして算定する。したがって、自己相関数列ＡVと自己相関数列ＡREFとが類似するほど類否指標値Ｒは大きい数値となる。 The index calculation unit 42 in FIG. 1 substitutes each autocorrelation value a (m) of the autocorrelation sequence AV for each numerical value d1 (i) of the formula (2), and each autocorrelation value a (m) of the autocorrelation sequence AREF. ) Is substituted for each numerical value d2 (i) in the equation (2), and the correlation coefficient Cor is calculated as the similarity index value R. Accordingly, the similarity index value R becomes a larger numerical value as the autocorrelation sequence AV and the autocorrelation sequence AREF are more similar.

図１の認識処理部４４は、指標算定部４２が自己相関数列ＡREF毎（登録者毎）に算定した類否指標値Ｒを利用した話者識別を実行する。具体的には、認識処理部４４は、複数の登録者のうち自己相関数列ＡVと自己相関数列ＡREFとの類否指標値Ｒが最大となる登録者を特定する。すなわち、音声信号Ｖの音声の発声者（未知）が複数の登録者のなかから識別される。認識処理部４４による識別の結果は出力装置１４から出力される。 The recognition processing unit 44 in FIG. 1 executes speaker identification using the similarity index value R calculated by the index calculation unit 42 for each autocorrelation sequence AREF (for each registrant). Specifically, the recognition processing unit 44 identifies a registrant having a maximum similarity index value R between the autocorrelation sequence AV and the autocorrelation sequence AREF among a plurality of registrants. That is, the voice utterer (unknown) of the voice signal V is identified from a plurality of registrants. The result of identification by the recognition processing unit 44 is output from the output device 14.

以上の形態においては、音声信号Ｖの自己相関数列ＡVが話者認識の特徴量ＦVとして利用されるから、高精度な話者認識を実現することが可能である。また、音声信号Ｖのうち周波数ｆcを上回る成分（すなわち、発話の内容に起因した自己相関値ａ(m)の変動が少ない成分）の自己相関数列ＡVが話者認識に適用されるから、発話の内容に拘わらず高精度な話者認識が可能であるという格別の効果が実現される。 In the above embodiment, since the autocorrelation sequence AV of the audio signal V is used as the feature amount FV for speaker recognition, it is possible to realize speaker recognition with high accuracy. In addition, since the autocorrelation sequence AV of the component of the voice signal V that exceeds the frequency fc (that is, the component in which the autocorrelation value a (m) varies less due to the content of the utterance) is applied to speaker recognition, A special effect is realized that speaker recognition with high accuracy is possible regardless of the content of the.

＜Ｂ：第２実施形態＞
次に、本発明の第２実施形態について説明する。第２実施形態においては、類否指標値Ｒの算定の対象が第１実施形態とは相違する。なお、以下の各形態において作用や機能が第１実施形態と同等である要素については、以上と同じ符号を付して各々の詳細な説明を適宜に省略する。 <B: Second Embodiment>
Next, a second embodiment of the present invention will be described. In the second embodiment, the target for calculating the similarity index value R is different from that of the first embodiment. In addition, about the element in which an effect | action and a function are equivalent to 1st Embodiment in each following form, the same code | symbol as the above is attached | subjected and each detailed description is abbreviate | omitted suitably.

図５および図６は、第２実施形態にて類否指標値Ｒを算定する方法の原理を説明するための概念図である。図５には、周波数が共通で振幅が相違する２種類の波形（ＷA，ＷB）が図示されている。波形ＷAを表す数値列（数値ｄ1(i)の系列）と波形ＷBを表す数値列（数値ｄ2(i)の系列）とから数式(2)で算定される相関係数Ｃorは、波形ＷAと波形ＷBとで振幅が相違するにも拘わらず、両者の合致を意味する最大値「１」となる。したがって、相関係数Ｃorからは波形ＷAと波形ＷBとを区別できない。 5 and 6 are conceptual diagrams for explaining the principle of the method for calculating the similarity index value R in the second embodiment. FIG. 5 shows two types of waveforms (WA, WB) having the same frequency but different amplitudes. Correlation coefficient Cor calculated by equation (2) from a numerical sequence representing the waveform WA (sequence of numerical values d1 (i)) and a numerical sequence representing the waveform WB (sequence of numerical values d2 (i)) is the waveform WA Although the amplitude is different from that of the waveform WB, the maximum value “1” means that the two match. Therefore, the waveform WA and the waveform WB cannot be distinguished from the correlation coefficient Cor.

他方、図６は、波形ＷAおよび波形ＷBの各々における相対応する位置に共通の成分（以下「補助成分」という）Ｗpを付加した場合が図示されている。図６の例示では、波形ＷAおよび波形ＷBの各々の直後に補助成分Ｗpが付加されている。補助成分Ｗpは、強度が変動する成分（非直流成分）である。具体的には、強度が周期的に変動する波形の成分（例えば正弦波成分）が補助成分Ｗpとして好適に採用される。なお、図６においては、補助成分Ｗpの振幅が波形ＷAの振幅を上回るとともに波形ＷBの振幅を下回り、かつ、補助成分Ｗpの周波数が波形ＷAおよび波形ＷBの周波数を上回る場合を例示したが、補助成分Ｗpの振幅や周波数は任意に変更される。 On the other hand, FIG. 6 shows a case where a common component (hereinafter referred to as “auxiliary component”) Wp is added to the corresponding position in each of the waveform WA and the waveform WB. In the example of FIG. 6, the auxiliary component Wp is added immediately after each of the waveform WA and the waveform WB. The auxiliary component Wp is a component whose intensity varies (non-DC component). Specifically, a waveform component (for example, a sine wave component) whose intensity varies periodically is suitably employed as the auxiliary component Wp. FIG. 6 illustrates the case where the amplitude of the auxiliary component Wp exceeds the amplitude of the waveform WA and is lower than the amplitude of the waveform WB, and the frequency of the auxiliary component Wp exceeds the frequencies of the waveform WA and the waveform WB. The amplitude and frequency of the auxiliary component Wp are arbitrarily changed.

波形ＷAに補助成分Ｗpを付加した波形ＷApを表す数値列（数値ｄ1(i)の系列）と、波形ＷBに補助成分Ｗpを付加した波形ＷBpを表す数値列（数値ｄ2(i)の系列）とから数式(2)で算定される相関係数Ｃorは、最大値「１」を下回る数値（例えば0.9）となる。すなわち、共通の補助成分Ｗpを付加することで、波形ＷAと波形ＷBとの相違（振幅の相違）を相関係数Ｃorにて顕在化させることが可能である。他方、波形ＷAと波形ＷBとが完全に合致するならば、補助成分Ｗpの付加後も波形は当然に合致するから、相関係数Ｃorは最大値となる。第２実施形態においては、以上の原理を利用して、音声信号Ｖの特徴量ＦV（自己相関数列ＡV）と参照用の特徴量ＦREF（自己相関数列ＡREF）との相違を顕在化させる。 A numerical sequence (a series of numerical values d1 (i)) representing the waveform WAp with the auxiliary component Wp added to the waveform WA, and a numerical sequence (a series of numerical values d2 (i)) representing the waveform WBp with the auxiliary component Wp added to the waveform WB. Therefore, the correlation coefficient Cor calculated by the equation (2) is a numerical value (for example, 0.9) that is less than the maximum value “1”. That is, by adding the common auxiliary component Wp, the difference between the waveform WA and the waveform WB (difference in amplitude) can be manifested by the correlation coefficient Cor. On the other hand, if the waveform WA and the waveform WB completely match, the waveform naturally matches even after the auxiliary component Wp is added, so the correlation coefficient Cor becomes the maximum value. In the second embodiment, the difference between the feature amount FV (autocorrelation sequence AV) of the audio signal V and the reference feature amount FREF (autocorrelation sequence AREF) is made obvious by using the above principle.

図７は、第２実施形態に係る音声処理装置１００Bのブロック図である。図７に示すように、音声処理装置１００Bは、第１実施形態の音声処理装置１００Aに成分付加部３６を追加した構成である。成分付加部３６は、特徴抽出部３２が生成した自己相関数列ＡVと記憶装置２４に格納された複数の自己相関数列ＡREFの各々とに対して共通の補助成分Ｗpを付加する。補助成分Ｗpは、相異なる数値を含む複数の数値ｗの系列（例えば正弦波の強度の時系列）として設定される。 FIG. 7 is a block diagram of an audio processing device 100B according to the second embodiment. As shown in FIG. 7, the sound processing device 100B has a configuration in which a component addition unit 36 is added to the sound processing device 100A of the first embodiment. The component addition unit 36 adds a common auxiliary component Wp to the autocorrelation number sequence AV generated by the feature extraction unit 32 and each of the plurality of autocorrelation number sequences AREF stored in the storage device 24. The auxiliary component Wp is set as a series of a plurality of numerical values w including different numerical values (for example, a time series of sine wave intensity).

図７の指標算定部４２は、補助成分Ｗpを付加した自己相関数列ＡVと補助成分Ｗpを付加した参照用の自己相関数列ＡREFとの類否指標値Ｒを、記憶装置２４に格納された複数の自己相関数列ＡREFの各々について算定する。具体的には、指標算定部４２は、自己相関数列ＡVの各自己相関値ａ(m)と補助成分Ｗpの各数値ｗとを数式(2)の各数値ｄ1(i)に代入するとともに、自己相関数列ＡREFの各自己相関値ａ(m)と補助成分Ｗpの各数値ｗとを数式(2)の各数値ｄ2(i)に代入することで、自己相関数列ＡVと自己相関数列ＡREFとの類否指標値Ｒ（相関係数Ｃor）を算定する。認識処理部４４による話者認識の方法は第１実施形態と同様である。 The index calculation unit 42 in FIG. 7 stores a plurality of similarity index values R of the autocorrelation sequence AV added with the auxiliary component Wp and the reference autocorrelation sequence AREF added with the auxiliary component Wp stored in the storage device 24. Are calculated for each autocorrelation sequence AREF. Specifically, the index calculation unit 42 substitutes each autocorrelation value a (m) of the autocorrelation sequence AV and each numerical value w of the auxiliary component Wp into each numerical value d1 (i) of the equation (2), and By substituting each autocorrelation value a (m) of the autocorrelation sequence AREF and each value w of the auxiliary component Wp into each value d2 (i) of Equation (2), the autocorrelation sequence AV and the autocorrelation sequence AREF The similarity index value R (correlation coefficient Cor) is calculated. The method of speaker recognition by the recognition processing unit 44 is the same as in the first embodiment.

第２実施形態においては、共通の補助成分Ｗpを付加した自己相関数列ＡVと自己相関数列ＡREFとの比較で話者認識が実行されるから、補助成分Ｗpを付加しない第１実施形態と比較すると、自己相関数列ＡVと自己相関数列ＡREFとの相違を顕在化させることが可能である。したがって、第１実施形態よりも高精度な話者認識が実現されるという利点がある。 In the second embodiment, speaker recognition is performed by comparing the autocorrelation sequence AV with the common auxiliary component Wp added thereto and the autocorrelation sequence AREF. Therefore, compared with the first embodiment without adding the auxiliary component Wp, The difference between the autocorrelation sequence AV and the autocorrelation sequence AREF can be made obvious. Therefore, there is an advantage that speaker recognition with higher accuracy than that of the first embodiment is realized.

＜Ｃ：第３実施形態＞
次に、本発明の第３実施形態について説明する。話者認識のひとつの態様である話者識別は、音声の発声者が複数の登録者の何れであるのかを判定する処理と、複数の発声者が存在する状況で収録された音声の各区間が何れの発声者の音声に該当するのかを判定する処理とに大別される。第１実施形態や第２実施形態では前者の話者識別を例示したが、第３実施形態では後者の話者識別を例示する。なお、以下では第１実施形態の構成を基礎として第３実施形態を説明するが、第２実施形態における成分付加部３６を第３実施形態に追加することも当然に可能である。 <C: Third Embodiment>
Next, a third embodiment of the present invention will be described. Speaker identification, which is one aspect of speaker recognition, is a process for determining which of a plurality of registrants a voice speaker and each section of speech recorded in a situation where a plurality of speakers exist. Is roughly divided into processing for determining which speaker's voice corresponds to. In the first embodiment and the second embodiment, the former speaker identification is illustrated, but in the third embodiment, the latter speaker identification is illustrated. In the following, the third embodiment will be described based on the configuration of the first embodiment, but it is naturally possible to add the component addition unit 36 in the second embodiment to the third embodiment.

図８は、第３実施形態に係る音声処理装置１００Cのブロック図である。図８に示すように、音声処理装置１００Cは、第１実施形態の音声処理装置１００Aに音声区分部３８を追加するとともに、第１実施形態の認識処理部４４を認識処理部４６に置換した構成である。複数の発声者が存在する状況（例えば複数の参加者が存在する会議）で収録された音声信号Ｖが信号供給装置１２から音声処理装置１００Cに供給される。 FIG. 8 is a block diagram of a speech processing apparatus 100C according to the third embodiment. As shown in FIG. 8, the speech processing apparatus 100C is configured by adding a speech classification unit 38 to the speech processing apparatus 100A of the first embodiment and replacing the recognition processing unit 44 of the first embodiment with a recognition processing unit 46. It is. An audio signal V recorded in a situation where there are a plurality of speakers (for example, a conference where a plurality of participants exist) is supplied from the signal supply device 12 to the audio processing device 100C.

音声区分部３８は、音声信号Ｖを時間軸上で複数の区間（ブロック）Ｂに区分する。各区間Ｂは、ひとりの発声者が連続して発生した可能性が高いと推定される期間である。各区間Ｂには固有の識別子Ｉbが付与される。音声処理装置１００Cは、音声信号Ｖの各区間Ｂが何れの発声者の音声に該当するのかを判定する。 The audio classification unit 38 divides the audio signal V into a plurality of sections (blocks) B on the time axis. Each section B is a period in which it is estimated that there is a high possibility that one speaker is continuously generated. Each section B is given a unique identifier Ib. The speech processing apparatus 100C determines which speaker's speech each section B of the speech signal V corresponds to.

通常の発話（特に会議における発言）には、発話の開始点から音量が徐々に増加するとともに途中の時点から発話の終了点にかけて音量が徐々に減少するという傾向がある。そこで、音声区分部３８は、図９に示すように、音声信号Ｖの波形の包絡線（エンベロープ）Ｅに現れる複数の谷部Ｄの各々を境界として音声信号Ｖを複数の区間Ｂに区分する。以上の構成によれば、例えばひとりの発声者による発声の最後の部分と別の発声者による発生の先頭の部分とが重複する場合や、複数の発声者が間隔をあけずに順次に発声した場合であっても、各発声者による発声を別個の区間Ｂに区分することが可能である。もっとも、音声信号Ｖを複数の区間Ｂに区分する方法は本発明において任意である。 In a normal utterance (especially in a conference), the volume gradually increases from the start point of the utterance and gradually decreases from an intermediate point to the end point of the utterance. Therefore, as shown in FIG. 9, the audio classification unit 38 divides the audio signal V into a plurality of sections B with each of a plurality of valleys D appearing in an envelope E of the waveform of the audio signal V as a boundary. . According to the above configuration, for example, when the last part of the utterance by one speaker overlaps with the first part of the utterance by another speaker, or when a plurality of speakers uttered sequentially without any interval. Even in this case, it is possible to divide the utterance by each speaker into separate sections B. However, the method of dividing the audio signal V into a plurality of sections B is arbitrary in the present invention.

特徴抽出部３２は、音声信号Ｖの複数の区間Ｂの各々について特徴量ＦV（自己相関数列ＡV）を算定する。他方、記憶装置２４は、代表的な複数種の声質の音声（サンプル）の各々について特徴量ＦVと同種の特徴量ＦREF（自己相関数列ＡREF）を辞書として記憶する。すなわち、第１実施形態や第２実施形態では、音声信号Ｖの発声者の候補となる複数の登録者の特徴量ＦREFを事前に生成して記憶装置２４に格納したが、第３実施形態では、音声信号Ｖの発声者の特徴量ＦREFが記憶装置２４に格納されるわけではない。 The feature extraction unit 32 calculates a feature amount FV (autocorrelation sequence AV) for each of the plurality of sections B of the audio signal V. On the other hand, the storage device 24 stores, as a dictionary, a feature quantity FREF (autocorrelation sequence AREF) of the same kind as the feature quantity FV for each of a plurality of representative voices (samples) of voice qualities. That is, in the first embodiment and the second embodiment, the feature amounts FREF of a plurality of registrants who are candidates for the speaker of the audio signal V are generated in advance and stored in the storage device 24. In the third embodiment, however, The feature value FREF of the speaker of the audio signal V is not stored in the storage device 24.

図８の話者認識部３４は、音声信号Ｖの各区間Ｂの特徴量ＦVを利用して複数の区間Ｂの各々を発声者毎の集合（クラスタ）ＣLj（ｊは自然数）に分類する。区間Ｂの分類は、各区間Ｂの特徴量ＦVを特徴抽出部３２が算定するたびに実時間的に実行される。図８に示すように、話者認識部３４による分類の結果に応じて集合ＣLj毎の記憶領域Ｍj（Ｍ1，Ｍ2，……）が記憶装置２４に設定される。集合ＣLjの記憶領域Ｍjには、集合ＣLjに分類された各区間Ｂの識別子Ｉbおよび当該区間Ｂ内の音声信号Ｖと、集合ＣLjに分類された各区間Ｂの特徴量ＦVに応じた特徴量ＦCとが記憶される。集合ＣLjの特徴量ＦCは、例えば、集合ＣLjに分類された１以上の区間Ｂの特徴量ＦVの平均値である。 The speaker recognition unit 34 in FIG. 8 classifies each of the plurality of sections B into a set (cluster) CLj (j is a natural number) for each speaker using the feature amount FV of each section B of the audio signal V. The classification of the section B is executed in real time whenever the feature extraction unit 32 calculates the feature amount FV of each section B. As shown in FIG. 8, a storage area Mj (M1, M2,...) For each set CLj is set in the storage device 24 according to the classification result by the speaker recognition unit 34. In the storage area Mj of the set CLj, the feature quantity corresponding to the identifier Ib of each section B classified into the set CLj, the audio signal V in the section B, and the feature quantity FV of each section B classified into the set CLj FC is stored. The feature value FC of the set CLj is, for example, an average value of the feature values FV of one or more sections B classified into the set CLj.

図１０は、話者認識部３４の動作のフローチャートである。図１０に示すように、音声信号Ｖの最初の区間Ｂを取得すると（ステップＳ1）、認識処理部４６は、当該区間Ｂを新規な集合ＣL1に分類する（ステップＳ2）。すなわち、認識処理部４６は、ステップＳ1で取得した区間Ｂの識別子Ｉbと音声信号Ｖと特徴量ＦV（特徴量ＦC）とを、記憶装置２４の記憶領域Ｍ1に格納する。 FIG. 10 is a flowchart of the operation of the speaker recognition unit 34. As shown in FIG. 10, when the first section B of the audio signal V is acquired (step S1), the recognition processing unit 46 classifies the section B into a new set CL1 (step S2). That is, the recognition processing unit 46 stores the identifier Ib, the voice signal V, and the feature amount FV (feature amount FC) of the section B acquired in step S1 in the storage area M1 of the storage device 24.

次の区間Ｂ（以下「対象区間Ｂ」という）を取得すると（ステップＳ3）、指標算定部４２は、対象区間Ｂの特徴量ＦVと参照用の特徴量ＦREFとの類否指標値Ｒを、記憶装置２４に格納された複数の特徴量ＦREFの各々について算定する（ステップＳ4）。更に、指標算定部４２は、対象区間Ｂの特徴量ＦVと既存の集合（すなわち、認識処理部４６が１以上の区間Ｂを過去に分類した集合）ＣLjの特徴量ＦCとの類否指標値Ｒを算定する（ステップＳ5）。 When the next section B (hereinafter referred to as “target section B”) is acquired (step S3), the index calculating unit 42 calculates the similarity index value R between the feature amount FV of the target section B and the reference feature amount FREF. Calculation is performed for each of the plurality of feature values FREF stored in the storage device 24 (step S4). Further, the index calculation unit 42 is a similarity index value between the feature amount FV of the target section B and the feature set FC of the existing set (that is, the set in which the recognition processing unit 46 classifies one or more sections B in the past) CLj. R is calculated (step S5).

図１０の処理を開始した直後のステップＳ5では、最初の区間Ｂが集合ＣL1に分類された段階に過ぎないから、認識処理部４６は、集合ＣL1の特徴量ＦCについてのみ特徴量ＦVとの類否指標値Ｒが算定される。他方、図１０の処理が進行して複数の集合ＣLjが生成された段階では、認識処理部４６は、集合ＣLjの特徴量ＦCと対象区間Ｂの特徴量ＦVとの類否指標値Ｒを複数の集合ＣLjの各々についてステップＳ5で算定する。なお、ステップＳ4やステップＳ5における類否指標値Ｒの算定の方法は第１実施形態と同様である。また、ステップＳ4およびステップＳ5の順序は逆転され得る。 In step S5 immediately after the start of the process of FIG. 10, the first section B is only at the stage where it is classified into the set CL1, so the recognition processing unit 46 classifies only the feature quantity FC of the set CL1 with the feature quantity FV. A rejection index value R is calculated. On the other hand, at the stage where the process of FIG. 10 has progressed and a plurality of sets CLj has been generated, the recognition processing unit 46 generates a plurality of similarity index values R between the feature quantity FC of the set CLj and the feature quantity FV of the target section B. Each of the sets CLj is calculated in step S5. Note that the method of calculating the similarity index value R in step S4 and step S5 is the same as in the first embodiment. Also, the order of steps S4 and S5 can be reversed.

認識処理部４６は、ステップＳ4およびステップＳ5にて算定した複数の類否指標値Ｒのなかの最大値が、参照用の特徴量ＦREFの類否指標値Ｒ（ステップＳ4）と、既存の集合ＣLの特徴量ＦVの類否指標値Ｒ（ステップＳ5）との何れに該当するのかを判定する（ステップＳ6）。 The recognition processing unit 46 determines that the maximum value among the plurality of similarity index values R calculated in step S4 and step S5 is the similarity index value R (step S4) of the reference feature quantity FREF and the existing set. It is determined which of the CL feature quantity FV is the similarity index value R (step S5) (step S6).

参照用の特徴量ＦREFとの類否指標値Ｒが最大値である場合（Ｓ6：YES）、認識処理部４６は、対象区間Ｂの特徴量ＦVと既存の集合ＣLjの特徴量ＦCとの類否指標値Ｒ（ステップＳ5）のなかの最大値Ｒmax1が所定の閾値ＲTH1を上回るか否か（すなわち、両者が充分に類似するか否か）を判定する（ステップＳ7）。ステップＳ7の結果が否定である場合（すなわち、対象区間Ｂの特徴量ＦVと既存の集合ＣLjの特徴量ＦCとが類似しない場合）、認識処理部４６は、対象区間Ｂを新規な集合ＣLjに分類する（ステップＳ10）。すなわち、認識処理部４６は、対象区間Ｂの識別子Ｉbと音声信号Ｖと特徴量ＦV（特徴量ＦC）とを集合ＣLjの記憶領域Ｍjに格納する。 When the similarity index value R with the reference feature quantity FREF is the maximum value (S6: YES), the recognition processing unit 46 classifies the feature quantity FV of the target section B and the feature quantity FC of the existing set CLj. It is determined whether or not the maximum value Rmax1 in the rejection index value R (step S5) exceeds a predetermined threshold value RTH1 (that is, whether both are sufficiently similar) (step S7). When the result of step S7 is negative (that is, when the feature value FV of the target section B is not similar to the feature value FC of the existing set CLj), the recognition processing unit 46 changes the target section B to the new set CLj. Classify (step S10). That is, the recognition processing unit 46 stores the identifier Ib, the audio signal V, and the feature amount FV (feature amount FC) of the target section B in the storage area Mj of the set CLj.

他方、ステップＳ7の結果が肯定である場合（すなわち、対象区間Ｂの特徴量ＦVが、参照用の特徴量ＦREFに類似するけれども、既存の集合ＣLjの特徴量ＦCにも充分に類似する場合）、認識処理部４６は、既存の集合ＣLjのうちステップＳ5で算定した類否指標値Ｒが最大となる集合ＣLjに対象区間Ｂを分類する（ステップＳ8）。具体的には、認識処理部４６は、対象区間Ｂの識別子Ｉbと音声信号Ｖとを既存の集合ＣLjの記憶領域Ｍjに追加するとともに、当該集合ＣLjの特徴量ＦCを、対象区間Ｂの特徴量ＦVに応じた数値（例えば、集合ＣLjに過去に分類された各区間Ｂの特徴量ＦVと対象区間Ｂの特徴量ＦVとの平均値）に更新する。以上の説明から理解されるように、対象区間Ｂの発声者と既存の集合ＣLj内の区間Ｂの発声者とが共通すると充分に高い確度で判断できる場合に限って両者間の類否指標値Ｒが閾値ＲTH1を上回るように、ステップＳ7で適用される閾値ＲTH1は実験的または統計的に設定される。 On the other hand, when the result of step S7 is affirmative (that is, when the feature value FV of the target section B is similar to the reference feature value FREF but sufficiently similar to the feature value FC of the existing set CLj). The recognition processing unit 46 classifies the target section B into the set CLj having the maximum similarity index value R calculated in step S5 among the existing set CLj (step S8). Specifically, the recognition processing unit 46 adds the identifier Ib and the audio signal V of the target section B to the storage area Mj of the existing set CLj, and uses the feature value FC of the set CLj as the feature of the target section B. The number is updated to a numerical value corresponding to the amount FV (for example, an average value of the feature amount FV of each section B and the feature amount FV of the target section B previously classified into the set CLj). As can be understood from the above description, the similarity index value between the two is only when the speaker in the target section B and the speaker in the section B in the existing set CLj can be determined with sufficiently high accuracy. The threshold value RTH1 applied in step S7 is set experimentally or statistically so that R exceeds the threshold value RTH1.

既存の集合ＣLjの特徴量ＦCとの類否指標値Ｒが最大値であるとステップＳ6で判定した場合（Ｓ6：NO）、認識処理部４６は、対象区間Ｂの特徴量ＦVと既存の集合ＣLjの特徴量ＦCとの類否指標値Ｒ（ステップＳ5）のなかの最大値Ｒmax2が、所定の閾値ＲTH2を下回るか否か（すなわち、両者が充分に相違するか否か）を判定する（ステップＳ9）。ステップＳ9の結果が否定である場合（すなわち、対象区間Ｂの特徴量ＦVと集合ＣLjの特徴量ＦCとが類似する場合）、認識処理部４６は、既存の集合ＣLjのうちステップＳ5で算定した類否指標値Ｒが最大となる集合ＣLjに対象区間Ｂを分類する（ステップＳ8）。ステップＳ8の処理は前述の通りである。 When it is determined in step S6 that the similarity index value R with the feature value FC of the existing set CLj is the maximum value (S6: NO), the recognition processing unit 46 and the feature value FV of the target section B and the existing set It is determined whether or not the maximum value Rmax2 in the similarity index value R (step S5) with the feature quantity FC of CLj is below a predetermined threshold value RTH2 (that is, whether or not they are sufficiently different). Step S9). When the result of step S9 is negative (that is, when the feature value FV of the target section B is similar to the feature value FC of the set CLj), the recognition processing unit 46 calculates in step S5 of the existing set CLj. The target section B is classified into the set CLj having the maximum similarity index value R (step S8). The process in step S8 is as described above.

他方、ステップＳ9の結果が肯定である場合（すなわち、対象区間Ｂの特徴量ＦVが参照用の特徴量ＦREFおよび既存の集合ＣLjの特徴量ＦCの何れにも類似しない場合）、認識処理部４６は、対象区間Ｂを新規な集合ＣLjに分類する（ステップＳ10）。ステップＳ10の処理は前述の通りである。 On the other hand, if the result of step S9 is affirmative (that is, if the feature value FV of the target section B is not similar to either the reference feature value FREF or the feature value FC of the existing set CLj), the recognition processing unit 46 Classifies the target section B into a new set CLj (step S10). The process of step S10 is as described above.

ステップＳ8またはステップＳ10の処理を実行すると、話者認識部３４は、話者認識を終了するか否かを判定する（ステップＳ11）。話者認識の終了の指示が利用者から付与された場合や音声信号Ｖの全部の区間Ｂの分類が完了した場合に、認識処理部４６は話者認識を終了する（Ｓ11：YES）。他方、話者認識を終了しない場合、話者認識部３４は、処理をステップＳ3に移行し、次に取得する区間Ｂを新たな対象区間ＢとしてステップＳ4以降の処理を実行する。したがって、音声信号Ｖの複数の区間Ｂのうち特徴量ＦVが類似する各区間Ｂ（すなわち、発声者が共通すると判断できる区間Ｂ）が共通の集合ＣLjに分類される。以上の説明から理解されるように、対象区間Ｂの発声者と既存の集合ＣLj内の区間Ｂの発声者とが相違すると充分に高い確度で判断できる場合に限って両者間の類否指標値Ｒが閾値ＲTH2を下回るように、ステップＳ9で適用される閾値ＲTH2は実験的または統計的に設定される。 When the process of step S8 or step S10 is executed, the speaker recognition unit 34 determines whether or not to end the speaker recognition (step S11). When an instruction to end speaker recognition is given by the user or when classification of all sections B of the audio signal V is completed, the recognition processing unit 46 ends speaker recognition (S11: YES). On the other hand, when the speaker recognition is not terminated, the speaker recognition unit 34 moves the process to step S3, and executes the processes after step S4 with the section B to be acquired next as the new target section B. Therefore, among the plurality of sections B of the audio signal V, each section B having a similar feature amount FV (that is, a section B that can be determined that the speaker is common) is classified into a common set CLj. As can be understood from the above description, the similarity index value between the two is only when the speaker in the target section B and the speaker in the section B in the existing set CLj can be determined with sufficiently high accuracy. The threshold value RTH2 applied in step S9 is set experimentally or statistically so that R is below the threshold value RTH2.

認識処理部４６による分類の結果は出力装置１４から出力される。例えば、会議の議事録が出力装置１４から画像として出力される。議事録には、音声信号Ｖの区間Ｂ毎に、当該区間Ｂが分類された集合ＣLjの識別子（発声者の識別子）と、当該区間Ｂ内の音声信号Ｖの音声認識で特定された文字列（すなわち、発言の内容）とが、時系列に配列される。 The classification result by the recognition processing unit 46 is output from the output device 14. For example, the minutes of the meeting are output as an image from the output device 14. In the minutes, for each section B of the voice signal V, the identifier of the set CLj (speaker identifier) into which the section B is classified, and the character string specified by the speech recognition of the voice signal V in the section B (That is, the content of a statement) are arranged in time series.

以上の形態においては、音声信号Ｖを区分した１個の区間Ｂの特徴量ＦVを参照用の特徴量ＦREFおよび既存の集合ＣLjの特徴量ＦCと比較することで各区間Ｂが発声者毎の集合ＣLjに分類されるから、音声信号Ｖの全部の区間Ｂは１個の区間Ｂの分類に必要ない。したがって、音声信号Ｖの各区間Ｂを実時間的に分類できるという利点がある。なお、参照用の特徴量ＦREFを記憶装置２４に格納せずに、既存の各集合ＣLjの特徴量ＦCと区間Ｂの特徴量ＦVとの類否指標値Ｒを閾値と比較することで各区間Ｂを発声者毎に分類する構成（以下「対比例」という）も想定される。しかし、対比例のもとでは、類否の判断の基準となる閾値を適切に選定することが困難であるという問題がある。他方、第３実施形態においては、既存の集合ＣLjの特徴量ＦCに対する類否指標値Ｒと参照用の特徴量ＦREFに対する類否指標値Ｒとの大小に応じて各区間Ｂの分類が実行されるから、対比例における閾値の設定は問題にならないという利点がある。 In the above embodiment, each section B is determined for each speaker by comparing the feature quantity FV of one section B into which the audio signal V is divided with the reference feature quantity FREF and the feature quantity FC of the existing set CLj. Since it is classified into the set CLj, all the sections B of the audio signal V are not necessary for classification of one section B. Therefore, there is an advantage that each section B of the audio signal V can be classified in real time. In addition, without storing the reference feature quantity FREF in the storage device 24, the similarity index value R between the feature quantity FC of the existing set CLj and the feature quantity FV of the section B is compared with a threshold value, thereby comparing each section. A configuration (hereinafter referred to as “proportional”) in which B is classified for each speaker is also assumed. However, there is a problem that it is difficult to appropriately select a threshold value that is a criterion for determination of similarity under a comparative ratio. On the other hand, in the third embodiment, the classification of each section B is executed according to the size of the similarity index value R for the feature quantity FC of the existing set CLj and the similarity index value R for the reference feature quantity FREF. Therefore, there is an advantage that the setting of the threshold value in the proportionality does not become a problem.

さらに、以上の形態においては、対象区間Ｂの特徴量ＦVが参照用の特徴量ＦREFに最も類似する場合であっても、対象区間Ｂの特徴量ＦVと既存の集合ＣLjの特徴量ＦCとの類否指標値Ｒが閾値ＲTH1を上回るほどに両者が類似する場合には、対象区間Ｂが当該集合ＣLjに分類される。したがって、共通の発声者が発声した複数の区間Ｂが別の集合ＣLjに分類される可能性が低減されるという利点がある。また、対象区間Ｂの特徴量ＦVが既存の集合ＣLjの特徴量ＦCに最も類似する場合であっても、対象区間Ｂの特徴量ＦVと既存の集合ＣLjの特徴量ＦCとの類否指標値Ｒが閾値ＲTH2を下回るほどに両者が相違する場合、対象区間Ｂは当該集合ＣLjには分類されない、したがって、別の発声者が発声した複数の区間Ｂが同じ集合ＣLjに分類される可能性が低減されるという利点がある。 Furthermore, in the above embodiment, even if the feature quantity FV of the target section B is most similar to the reference feature quantity FREF, the feature quantity FV of the target section B and the feature quantity FC of the existing set CLj When the similarity index value R is so similar that it exceeds the threshold value RTH1, the target section B is classified into the set CLj. Therefore, there is an advantage that the possibility that a plurality of sections B uttered by a common speaker is classified into another set CLj is reduced. Even if the feature value FV of the target section B is most similar to the feature value FC of the existing set CLj, the similarity index value between the feature value FV of the target section B and the feature value FC of the existing set CLj If the two differ so that R falls below the threshold value RTH2, the target section B is not classified into the set CLj. Therefore, a plurality of sections B uttered by another speaker may be classified into the same set CLj. There is an advantage that it is reduced.

＜Ｄ：変形例＞
以上に例示した各形態は様々に変形され得る。変形の具体的な態様を以下に例示する。なお、以下の例示から任意に選択された２以上の態様は適宜に併合され得る。 <D: Modification>
Each form illustrated above can be variously modified. Specific modes of deformation are exemplified below. Note that two or more aspects arbitrarily selected from the following examples may be appropriately combined.

（１）変形例１
第１実施形態では平均パワースペクトルＰの逆フーリエ変換で自己相関数列ＡVを算定したが、時間領域の演算で音声信号Ｖから自己相関数列ＡVを算定する構成も採用される。ただし、周波数領域の演算で自己相関数列ＡVを算定する第１実施形態によれば、特徴抽出部３２による演算量が削減されるという利点がある。また、低域抑圧部５１の位置は任意に変更される。例えば、時間-周波数変換部５２の後段に低域抑圧部５１を配置して周波数スペクトルＱのうちの低域側の成分を抑圧する構成も採用される。もっとも、低域抑圧部５１は本発明において必須ではない。すなわち、音声信号Ｖのうちの低域側の成分の自己相関数列ＡVにも発声者に固有の特徴は現れるから、低域抑圧部５１を省略した構成（すなわち、音声信号Ｖの全帯域を対象として自己相関数列ＡVを算定する構成）であっても、話者認識に利用できる自己相関数列ＡV（特徴量ＦV）を算定することは可能である。 (1) Modification 1
In the first embodiment, the autocorrelation sequence AV is calculated by inverse Fourier transform of the average power spectrum P. However, a configuration in which the autocorrelation sequence AV is calculated from the audio signal V by time domain calculation is also employed. However, according to the first embodiment in which the autocorrelation sequence AV is calculated by calculation in the frequency domain, there is an advantage that the amount of calculation by the feature extraction unit 32 is reduced. Further, the position of the low-frequency suppressing unit 51 is arbitrarily changed. For example, a configuration in which the low-frequency suppressing unit 51 is arranged after the time-frequency converting unit 52 to suppress the low-frequency component of the frequency spectrum Q is also employed. However, the low frequency suppression unit 51 is not essential in the present invention. That is, since a characteristic unique to the speaker appears in the autocorrelation sequence AV of the low frequency component of the audio signal V, a configuration in which the low frequency suppression unit 51 is omitted (that is, the entire band of the audio signal V is targeted). The autocorrelation sequence AV (feature value FV) that can be used for speaker recognition can be calculated.

（２）変形例２
以上の各形態においては自己相関数列（ＡV，ＡREF）を特徴量（ＦV，ＦREF，ＦC）として例示したが、第２実施形態や第３実施形態における特徴量（ＦV，ＦREF，ＦC）の種類は任意に変更される。例えば、音声の平均パワースペクトルＰ（周波数スペクトルＱの絶対値の自乗の平均値）や、周波数スペクトルＱから算定されるケプストラムの複数のフレームにわたる平均（平均ケプストラム）を特徴量（ＦV，ＦREF，ＦC）として利用することが可能である。 (2) Modification 2
In each of the above embodiments, the autocorrelation sequence (AV, AREF) is exemplified as the feature quantity (FV, FREF, FC). However, the types of feature quantities (FV, FREF, FC) in the second and third embodiments are described. Is arbitrarily changed. For example, the average power spectrum P of speech (the average value of the square of the absolute value of the frequency spectrum Q) or the average (average cepstrum) over a plurality of frames of the cepstrum calculated from the frequency spectrum Q is used as the feature quantity (FV, FREF, FC ) Can be used.

また、類否指標値Ｒは数式(2)の相関係数Ｃorに限定されず、特徴量ＦVや特徴量ＦREFの種類に応じた適切な類否指標値Ｒが選定される。例えば、特徴量ＦVおよび特徴量ＦREFとして平均ケプストラムを利用した構成では、特徴量ＦVと特徴量ＦREFとの差分（第３実施形態では更に特徴量ＦVと特徴量ＦCとの差分）が類否指標値Ｒとして算定される。なお、平均パワースペクトルＰを特徴量ＦVや特徴量ＦREFとして利用した構成では、数式(2)の相関係数Ｃorが類否指標値として利用される。また、特徴量ＦVや特徴量ＦREFの種類によっては、距離や尤度を類否指標値Ｒとして算定する構成も好適である。 Further, the similarity index value R is not limited to the correlation coefficient Cor of the equation (2), and an appropriate similarity index value R corresponding to the type of the feature amount FV or the feature amount FREF is selected. For example, in the configuration using the average cepstrum as the feature quantity FV and the feature quantity FREF, the difference between the feature quantity FV and the feature quantity FREF (the difference between the feature quantity FV and the feature quantity FC in the third embodiment) is the similarity index. Calculated as the value R. In the configuration in which the average power spectrum P is used as the feature amount FV or the feature amount FREF, the correlation coefficient Cor in Expression (2) is used as the similarity index value. Also, depending on the type of feature quantity FV and feature quantity FREF, a configuration in which distance and likelihood are calculated as similarity index value R is also suitable.

以上のように類否指標値Ｒの定義は任意である。また、類否指標値Ｒの大小と特徴量ＦVおよび特徴量ＦREFとの類否との関係は類否指標値Ｒの定義に応じて定まる。すなわち、以上の各形態においては特徴量ＦVと特徴量ＦREFとが類似するほど類否指標値Ｒが大きい数値となるように類否指標値Ｒを定義したが、特徴量ＦVと特徴量ＦREFとが類似するほど類否指標値Ｒが小さい数値となるように類否指標値Ｒを定義した構成（例えば、特徴量ＦVと特徴量ＦREFとの距離を類否指標値Ｒとした構成）も採用される。 As described above, the definition of the similarity index value R is arbitrary. The relationship between the similarity index value R and the similarity between the feature quantity FV and the feature quantity FREF is determined according to the definition of the similarity index value R. That is, in each of the above embodiments, the similarity index value R is defined so that the similarity index value R becomes a larger numerical value as the feature quantity FV and the feature quantity FREF are similar, but the feature quantity FV and the feature quantity FREF are The similarity index value R is defined so that the similarity index value R becomes smaller as the similarity is similar (for example, a configuration in which the distance between the feature quantity FV and the feature quantity FREF is the similarity index value R) is also adopted. Is done.

（３）変形例３
第２実施形態における特徴量（ＦV，ＦREF）として平均パワースペクトルＰを採用した場合、成分付加部３６は、音声信号Ｖの平均パワースペクトルＰV（特徴量ＦV）と、各登録者の音声の平均パワースペクトルＰREF（特徴量ＦREF）との各々に共通の補助成分Ｗpを付加する。指標算定部４２は、平均パワースペクトルＰVにおける周波数毎の強度（パワー）の数値と補助成分Ｗpの各数値とを数式(2)の各数値ｄ1(i)に代入するとともに、平均パワースペクトルＰREFにおける周波数毎の強度の数値と補助成分Ｗpの各数値とを数式(2)の各数値ｄ2(i)に代入したときの相関係数Ｃorを類否指標値Ｒとして算定する。 (3) Modification 3
When the average power spectrum P is adopted as the feature amount (FV, FREF) in the second embodiment, the component adding unit 36 averages the average power spectrum PV (feature amount FV) of the voice signal V and the voice of each registrant. A common auxiliary component Wp is added to each of the power spectrum PREF (feature value FREF). The index calculation unit 42 substitutes the numerical value of the intensity (power) for each frequency in the average power spectrum PV and each numerical value of the auxiliary component Wp into each numerical value d1 (i) of Equation (2), and in the average power spectrum PREF. The correlation coefficient Cor when the numerical value of the intensity for each frequency and each numerical value of the auxiliary component Wp is substituted for each numerical value d2 (i) of the equation (2) is calculated as the similarity index value R.

また、補助成分Ｗpを付加する位置は適宜に変更される。例えば、特徴量ＦV（自己相関数列ＡVや平均パワースペクトルＰV）と特徴量ＦREF（自己相関数列ＡREFや平均パワースペクトルＰREF）の各々における先頭や途中の位置に補助成分Ｗpを付加または挿入する構成でも、第２実施形態と同様の効果が実現される。つまり、特徴量ＦVと特徴量ＦREFとにおける相対応する位置（両者における同じ位置）に補助成分Ｗpを付加する構成が本発明においては好適であるが、特徴量ＦVや特徴量ＦREFにおける補助成分Ｗpの具体的な位置は不問である。更に、補助成分Ｗpが示す波形も任意である。すなわち、補助成分Ｗpが示す波形に拘わらず、共通の補助成分Ｗpを特徴量ＦVと特徴量ＦREFとに付加することで第２実施形態と同様の効果が実現される。 Further, the position where the auxiliary component Wp is added is appropriately changed. For example, the auxiliary component Wp may be added or inserted at the beginning or in the middle of the feature amount FV (autocorrelation sequence AV or average power spectrum PV) and feature amount FREF (autocorrelation sequence AREF or average power spectrum PREF). The effect similar to 2nd Embodiment is implement | achieved. That is, a configuration in which the auxiliary component Wp is added to the corresponding position (the same position in both) of the feature amount FV and the feature amount FREF is preferable in the present invention, but the auxiliary component Wp in the feature amount FV and the feature amount FREF is preferred. The specific position of is unquestioned. Furthermore, the waveform indicated by the auxiliary component Wp is also arbitrary. That is, regardless of the waveform indicated by the auxiliary component Wp, by adding the common auxiliary component Wp to the feature quantity FV and the feature quantity FREF, the same effect as in the second embodiment is realized.

以上の説明から理解されるように、第２実施形態における成分付加部３６は、特徴抽出部３２が抽出した特徴量ＦVおよび特徴量ＦREFの各々を示す数値列（自己相関数列や平均パワースペクトルを構成する数値の集合）に共通の補助成分Ｗp（典型的には非直流成分）を付加する要素として包括される。 As can be understood from the above description, the component addition unit 36 in the second embodiment is a numerical sequence (autocorrelation sequence or average power spectrum) indicating each of the feature amount FV and the feature amount FREF extracted by the feature extraction unit 32. It is included as an element for adding a common auxiliary component Wp (typically a non-DC component) to a set of numerical values constituting the component.

（４）変形例４
以上の各形態においては話者識別を例示したが、第１実施形態の音声処理装置１００Aや第２実施形態の音声処理装置１００Bは、音声信号Ｖの音声の発声者が正規の登録者に該当するか否かを判定する話者認証（話者照合）にも利用される。例えば、正規の登録者の音声から抽出された特徴量ＦREF（例えば自己相関数列ＡREF）が記憶装置２４に格納され、指標算定部４２は、音声信号Ｖから抽出された特徴量ＦV（例えば自己相関数列ＡV）と登録者の特徴量ＦREFとの類否指標値Ｒを算定する。認識処理部４４は、類否指標値Ｒの大小に応じて音声信号Ｖの音声の発声者の正当性を判定する。具体的には、認識処理部４４は、類否指標値Ｒが所定の閾値を上回る場合（特徴量ＦVと特徴量ＦREFとが類似する場合）には発声者の正当性を認証し、類否指標値Ｒが閾値を下回る場合には認証を否定する。 (4) Modification 4
In each of the above embodiments, speaker identification has been exemplified. However, in the speech processing apparatus 100A of the first embodiment and the speech processing apparatus 100B of the second embodiment, the voice speaker of the voice signal V corresponds to a regular registrant. It is also used for speaker authentication (speaker verification) for determining whether or not to do so. For example, the feature value FREF (for example, autocorrelation sequence AREF) extracted from the voice of the regular registrant is stored in the storage device 24, and the index calculation unit 42 uses the feature value FV (for example, autocorrelation) extracted from the sound signal V. The similarity index value R between the numerical sequence AV) and the registrant's feature value FREF is calculated. The recognition processing unit 44 determines the legitimacy of the voicer of the voice signal V according to the similarity index value R. Specifically, when the similarity index value R exceeds a predetermined threshold value (when the feature value FV and the feature value FREF are similar), the recognition processing unit 44 authenticates the speaker and verifies the similarity. If the index value R is below the threshold, authentication is denied.

（５）変形例５
以上の各形態においては１種類の特徴量ＦVを利用したが、相異なる複数種の特徴量の組合せを特徴量ＦV（さらには特徴量ＦREF）として話者認識に利用する構成も好適である。例えば、自己相関数列ＡVと平均パワースペクトルＰと平均ケプストラムとから選択された２種以上の特徴量の組合せを特徴抽出部３２が特徴量ＦVとして抽出する。指標算定部４２は、特徴量ＦVの特徴量毎に参照用の特徴量ＦREFとの類否指標値を算定するとともに各特徴量の類否指標値の加重和を話者認識用の類否指標値Ｒとして算定する。以上の構成によれば、特徴量ＦVと特徴量ＦREFとの類否の判断に音声の様々な観点（性質）が反映されるから、１種類の特徴量を利用する場合と比較して高精度な話者認識が実現されるという利点がある。また、各特徴量の類否指標値の加重和が類否指標値Ｒとして話者認識に利用されるから、特定の特徴量を他の特徴量に対して優先させるといった操作が可能である。 (5) Modification 5
In each of the above embodiments, one type of feature value FV is used. However, a configuration in which a combination of different types of feature values is used as the feature value FV (and also the feature value FREF) for speaker recognition is also suitable. For example, the feature extraction unit 32 extracts a combination of two or more feature amounts selected from the autocorrelation sequence AV, the average power spectrum P, and the average cepstrum as the feature amount FV. The index calculation unit 42 calculates the similarity index value with the reference feature quantity FREF for each feature quantity of the feature quantity FV, and calculates the weighted sum of the similarity index values of the feature quantities for the speaker recognition similarity index. Calculated as the value R. According to the above configuration, since various viewpoints (properties) of the voice are reflected in the determination of the similarity between the feature amount FV and the feature amount FREF, the accuracy is higher than in the case of using one type of feature amount. Advantageous speaker recognition. Further, since the weighted sum of the similarity index values of the feature quantities is used as the similarity index value R for speaker recognition, an operation of giving priority to a specific feature quantity over other feature quantities is possible.

１００A，１００B，１００C……音声処理装置、１２……信号供給装置、１４……出力装置、２２……演算処理装置、２４……記憶装置、２６……プログラム、３２……特徴抽出部、３４……話者認識部、３６……成分付加部、３８……音声区分部、４２……指標算定部、４４……認識処理部、４６……認識処理部、５１……低域抑圧部、５２……時間-周波数変換部、５３……パワー算定部、５４……平均部、５５……周波数-時間変換部。
100A, 100B, 100C ... speech processing device, 12 ... signal supply device, 14 ... output device, 22 ... calculation processing device, 24 ... storage device, 26 ... program, 32 ... feature extraction unit, 34 …… Speaker recognizing part 36 …… Component adding part 38 ??? Speech classification part 42 ??? Indicator calculating part 44 ...... Recognition processing part 46 ...... Recognition processing part 51 ...... Low frequency suppressing part 52... Time-frequency conversion unit, 53... Power calculation unit, 54... Average unit, 55.

Claims

Feature extraction means for extracting a feature amount represented by a series of numerical values from an audio signal ;
Storage means for storing feature values for reference represented by a series of a plurality of numerical values;
Component addition means for adding a common auxiliary component including different numerical values to corresponding positions in each of the feature quantity extracted by the feature extraction means and the reference feature quantity;
An index calculating means for calculating an similarity index value indicating the similarity of each feature quantity after addition of the auxiliary component;
A speech processing apparatus comprising recognition processing means for performing speaker recognition using the similarity index value .

The speech processing apparatus according to claim 1, wherein the feature extraction unit extracts a feature amount for a component that exceeds a predetermined frequency in the speech signal.

Voice classification means for dividing the voice signal into a plurality of sections;
The feature extraction means extracts a feature amount for each section ,
The index calculation means calculates a similarity index value for each section,
The speech processing apparatus according to claim 1, wherein the recognition processing unit classifies each of the plurality of sections into a set for each speaker using the similarity index value of each section.

The index calculation unit, the feature amounts of one section the feature extraction means is calculated, indicating the similarity between the feature quantity of a reference stored in said storage means is extracted from the audio samples representative voice Calculating the similarity index value and the similarity index value indicating similarity between the feature quantity corresponding to one or more sections classified into the existing set ;
If the recognition processing means, the feature quantity of the one section is the case similar to the feature quantity for the reference, which classifies the one segment to a new collection, similar to the feature amount of the existing set And classify the one section into the existing set
The speech processing apparatus according to claim 3 .

Said recognition processing means, even if the characteristic quantity of the one section is similar to the feature quantity for the reference, similar against a threshold similarity index value is in a predetermined feature amount of the existing set If it is a numerical value on the side, classify the one section into the existing set
The speech processing apparatus according to claim 4 .

The recognition processing means may determine whether the similarity index value with the feature value of the existing set is equal to a predetermined threshold value even when the feature value of the one section is similar to the feature value of the existing set. If the value is on the dissimilar side, classify the one section into a new set
The speech processing apparatus according to claim 4 or 5 .

A feature extraction process for extracting a feature value represented by a series of numerical values from a speech signal ;
A common auxiliary component including different numerical values is added to the corresponding position in each of the characteristic amounts extracted in the characteristic extraction processing and the reference characteristic amounts represented by a plurality of numerical value sequences stored in the storage means. Component addition processing,
An index calculation process for calculating an similarity index value indicating the similarity of each feature quantity after the addition of the auxiliary component;
A program for causing a computer to execute speaker recognition processing using the similarity index value .