JP2005301022A

JP2005301022A - Voice feature extracting device, speaker recognizing device, program, and voice feature extracting method

Info

Publication number: JP2005301022A
Application number: JP2004118831A
Authority: JP
Inventors: Tomonari Kakino; 友成柿野
Original assignee: Toshiba TEC Corp
Current assignee: Toshiba TEC Corp
Priority date: 2004-04-14
Filing date: 2004-04-14
Publication date: 2005-10-27

Abstract

<P>PROBLEM TO BE SOLVED: To provide a voice feature extracting device capable of analyzing individualities more in detail from the frequency spectrum of a voice without fixing the resolution in a frequency direction. <P>SOLUTION: The voice feature extracting device 4 includes a 1st analyzing means 13 of analyzing frequency components of an input voice to extract spectrum components, a logarithmic converting means 14 of logarithmically converting the extracted spectrum components, and a 2nd analyzing means 15 of taking multiple resolution analysis of the logarithmically converted spectrum to obtain a feature vector. Consequently, frequency-directional lengths of respective analysis windows are reduced together with the height of the quefrency and analysis with frequency resolution which is higher as the quefrency becomes higher is enabled to take the analysis more in detail without fixing the frequency resolution when individualities are analyzed from the frequency spectrum of the voice, thereby providing a speaker recognizing device whose speaker recognition precision is improved. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、入力されるデジタル音声信号を適当な長さのフレームに切り分けて窓処理を施した後、順次個人性情報を含む特徴ベクトルを出力する音声特徴抽出装置、音声波に含まれる個人性情報を用いて話者を認識する話者認識装置、プログラム及び音声特徴抽出方法に関する。 The present invention relates to an audio feature extraction apparatus for sequentially outputting a feature vector including personality information after dividing an input digital audio signal into frames of an appropriate length and performing window processing, and an individuality included in an audio wave The present invention relates to a speaker recognition device that recognizes a speaker using information, a program, and a voice feature extraction method.

音声波に含まれる個人性情報を用いて、誰の声であるかを自動的に判定することを話者認識（speaker recognition）という。このような話者認識の形態は、話者識別（speaker identification）と話者照合（speaker verification）に分けることができる。話者識別とは、入力音声が、あらかじめ登録されているＮ人の内の誰の声であるかを判定するものである。話者照合とは、入力音声と同時に自分が誰であるかのＩＤを入力して、その音声が本当にそのＩＤに対応する人の声であるか否かを判定するものである。話者識別の場合は、多数の登録話者の内から最も類似度（尤度）の高い話者を選び、その話者の音声であると判断する。話者照合の場合は、ＩＤに基づく本人の標準パターンとの類似度（モデルに対する尤度）が、一定の閾値よりも大きければ本人の音声であると判定し、それ以外の場合は他人の音声であると判定する。 The automatic determination of who is the voice using the personality information included in the sound wave is called speaker recognition. Such forms of speaker recognition can be divided into speaker identification and speaker verification. Speaker identification is to determine who is the voice of the N people registered in advance. In speaker verification, an ID of who the person is is input at the same time as the input voice, and it is determined whether or not the voice is really a voice of a person corresponding to the ID. In the case of speaker identification, the speaker having the highest similarity (likelihood) is selected from a large number of registered speakers, and the speaker's voice is determined. In the case of speaker verification, if the similarity to the person's standard pattern based on the ID (likelihood for the model) is greater than a certain threshold, the person's voice is determined. Otherwise, the voice of the other person is used. It is determined that

ところで、話者識別の性能は、登録話者の内の本人以外の話者が選択される誤り率で評価される。当然ながら登録話者の数が多くなればそれだけ難しくなるので、話者識別の誤り率は、登録話者の数が増えるにつれて単調に増加することになる。したがって、登録話者の数が増えた場合であっても、話者識別の誤り率の増加を低く抑えることが望まれている。 By the way, the performance of speaker identification is evaluated by an error rate at which a speaker other than the registered speaker is selected. Of course, as the number of registered speakers increases, it becomes more difficult, so the error rate of speaker identification increases monotonically as the number of registered speakers increases. Therefore, even when the number of registered speakers increases, it is desired to suppress the increase in the error rate of speaker identification.

近年、話者識別においては、個人性を表す特徴パラメータとして低次ケプストラム係数が広く用いられている。ここで、ケプストラム法による低次ケプストラム係数を抽出する手順について図７を用いて説明する。 In recent years, in speaker identification, a low-order cepstrum coefficient is widely used as a feature parameter representing personality. Here, a procedure for extracting low-order cepstrum coefficients by the cepstrum method will be described with reference to FIG.

図７において、５０１は入力される音声波（デジタル音声信号）、５０２は音声波を適当な長さのフレームに切り分け、かつ、ハミング窓などの窓処理を施す時間窓処理部、５０３は離散フーリエ変換処理部、５０４は振幅スペクトルを対数変換する対数変換処理部、５０５は逆離散フーリエ変換処理部、５０６はリフタリング処理部、５０７は出力されるケプストラム係数、である。 In FIG. 7, reference numeral 501 denotes an input audio wave (digital audio signal), 502 denotes a time window processing unit that divides the audio wave into frames of an appropriate length and performs window processing such as a Hamming window, and 503 denotes a discrete Fourier. A conversion processing unit 504 is a logarithmic conversion processing unit for logarithmically converting the amplitude spectrum, 505 is an inverse discrete Fourier transform processing unit, 506 is a liftering processing unit, and 507 is an output cepstrum coefficient.

入力された音声波５０１は、時間窓処理部５０２において適当な長さ（一般的には２０〜３０ｍｓ）のフレームに分割され、順次ハミング窓などの窓が乗じられる。次いで、離散フーリエ変換部５０３にて振幅スペクトルが抽出され、これを対数変換処理部５０４にて対数変換することにより、対数振幅スペクトルが得られる。一般的に、ここで得られた対数振幅スペクトル包絡の概形情報に個人性を示す情報が含まれていると言われている。この概形情報を抽出するために、逆離散フーリエ変換処理部５０５にて逆フーリエ変換を行いケプストラムを求め、続くリフタリング処理部５０６にて高次ケプストラムを除くことにより、低次ケプストラム係数が求まる（例えば、非特許文献１参照）。 The input audio wave 501 is divided into frames of an appropriate length (generally 20 to 30 ms) in the time window processing unit 502 and sequentially multiplied by a window such as a Hamming window. Next, an amplitude spectrum is extracted by the discrete Fourier transform unit 503, and a logarithmic amplitude spectrum is obtained by logarithmically transforming the amplitude spectrum by the logarithmic transformation processing unit 504. Generally, it is said that information indicating individuality is included in the outline information of the logarithmic amplitude spectrum envelope obtained here. In order to extract the outline information, the inverse discrete Fourier transform processing unit 505 performs inverse Fourier transform to obtain a cepstrum, and the subsequent liftering processing unit 506 removes the high-order cepstrum to obtain a low-order cepstrum coefficient ( For example, refer nonpatent literature 1).

逆離散フーリエ変換は、ケフレンシーに対する分析窓の周波数分解能が図８の左図のように一定となる解析手法であって、各分析窓毎に対数振幅スペクトルを逆離散フーリエ変換し、その分析窓に対応するケプストラム係数を求める。各分析窓毎に求めたケプストラム係数の列が、図８の右図のような特徴ベクトルとなる。 The inverse discrete Fourier transform is an analysis method in which the frequency resolution of the analysis window with respect to the quefrency is constant as shown in the left diagram of FIG. 8. The logarithmic amplitude spectrum is inversely discrete Fourier transformed for each analysis window, and the analysis window Find the corresponding cepstrum coefficients. A column of cepstrum coefficients obtained for each analysis window becomes a feature vector as shown in the right figure of FIG.

古井貞熙著 “音声情報処理” 森北出版株式会社Ｐ．２５“Sound Information Processing” by Sadahiro Furui Morikita Publishing Co., Ltd. 25

ところが、従来のケプストラム法により抽出されるケプストラム係数においては、各分析窓の周波数方向の長さが一定であるため、周波数方向の分解能が固定されてしまうという問題がある。これは、周波数方向に個人性を示す情報が偏って存在していた場合、話者の識別能に悪影響を及ぼす要因となる。 However, the cepstrum coefficient extracted by the conventional cepstrum method has a problem that the resolution in the frequency direction is fixed because the length of each analysis window in the frequency direction is constant. This is a factor that adversely affects speaker discrimination when information indicating individuality is biased in the frequency direction.

本発明は、音声の周波数スペクトルから個人性を分析する際に周波数方向の分解能が固定されず、より詳細な分析を行うことができる音声特徴抽出装置、プログラム及び音声特徴抽出方法を提供することを目的とする。 The present invention provides a speech feature extraction device, a program, and a speech feature extraction method capable of performing more detailed analysis without resolution in the frequency direction being fixed when analyzing individuality from the frequency spectrum of speech. Objective.

本発明は、話者認識精度の向上した話者認識装置を提供することを目的とする。 It is an object of the present invention to provide a speaker recognition device with improved speaker recognition accuracy.

本発明の音声特徴抽出装置は、入力されるデジタル音声信号を適当な長さのフレームに切り分けて窓処理を施した後、順次個人性情報を含む特徴ベクトルを出力する音声特徴抽出装置において、窓処理を施されたデジタル音声信号を周波数分析し、スペクトル成分を抽出する第１の分析手段と、この第１の分析手段により抽出された前記スペクトル成分を対数変換する対数変換手段と、この対数変換手段により対数変換された対数スペクトルを多重解像度解析し、特徴ベクトルを得る第２の分析手段と、を備える。 The speech feature extraction apparatus according to the present invention is a speech feature extraction apparatus that outputs a feature vector including personality information sequentially after dividing an input digital speech signal into frames of an appropriate length and performing window processing. Frequency analysis of the processed digital audio signal and extraction of spectral components, logarithmic conversion means for logarithmically converting the spectral components extracted by the first analyzing means, and logarithmic conversion Second analysis means for obtaining a feature vector by performing multi-resolution analysis on the logarithmic spectrum logarithmically converted by the means.

したがって、各分析窓の周波数方向の長さがその解析対象となるケフレンシーの高さと共に縮小され、周波数方向に解像度の高い解析を実施することが可能となる。 Therefore, the length in the frequency direction of each analysis window is reduced together with the height of the quefrency to be analyzed, and analysis with high resolution in the frequency direction can be performed.

本発明の音声特徴抽出装置、プログラム及び音声特徴抽出方法によれば、音声の周波数スペクトルから個人性を分析する際に周波数方向の分解能が固定されず、より詳細な分析を行うことができる。 According to the speech feature extraction device, the program, and the speech feature extraction method of the present invention, when analyzing personality from the frequency spectrum of speech, the resolution in the frequency direction is not fixed, and more detailed analysis can be performed.

本発明の話者認識装置によれば、音声の周波数スペクトルから個人性を分析する際に周波数方向の分解能が固定されず、より詳細な分析を行うことができるため、話者認識精度の向上した話者認識装置を提供することができる。 According to the speaker recognition device of the present invention, when analyzing personality from the frequency spectrum of speech, the resolution in the frequency direction is not fixed and more detailed analysis can be performed, so that the speaker recognition accuracy is improved. A speaker recognition device can be provided.

本発明の実施の一形態を図１ないし図６に基づいて説明する。 An embodiment of the present invention will be described with reference to FIGS.

図１は本実施の形態の話者認識装置１００の構成を示すブロック図である。図１に示すように、話者認識装置１００は、マイク１、低域通過フィルタ２、Ａ／Ｄ変換部３、特徴ベクトル生成部４、話者選択部５、話者モデル生成部６、記憶部７で構成されている。 FIG. 1 is a block diagram showing the configuration of the speaker recognition apparatus 100 of the present embodiment. As shown in FIG. 1, the speaker recognition apparatus 100 includes a microphone 1, a low-pass filter 2, an A / D conversion unit 3, a feature vector generation unit 4, a speaker selection unit 5, a speaker model generation unit 6, and a storage. It consists of part 7.

マイク１は、入力された音声を電気的アナログ信号に変換するものである。低域通過フィルタ２は、入力されたアナログ信号から所定の周波数以上の周波数をカットし出力するものである。Ａ／Ｄ変換部３は、入力されたアナログ信号を所定のサンプリング周波数、量子化ビット数でデジタル信号に変換するものである。以上、マイク１、低域通過フィルタ２、Ａ／Ｄ変換部３により、音声を入力するための音声入力手段が構成されている。 The microphone 1 converts input sound into an electrical analog signal. The low-pass filter 2 cuts and outputs a frequency equal to or higher than a predetermined frequency from the input analog signal. The A / D converter 3 converts the input analog signal into a digital signal with a predetermined sampling frequency and the number of quantization bits. As described above, the microphone 1, the low-pass filter 2, and the A / D converter 3 constitute an audio input unit for inputting audio.

特徴ベクトル生成部４は、音声特徴抽出装置として機能するものであり、入力されたデジタル信号から個人性特徴情報を抽出し、順次個人性情報を含む特徴データである特徴ベクトルを出力するものである。 The feature vector generation unit 4 functions as an audio feature extraction device, extracts personality feature information from an input digital signal, and sequentially outputs feature vectors that are feature data including personality information. .

話者モデル生成部６（モデル作成手段）は、特徴ベクトル生成部４で生成された特徴ベクトルから話者モデル（個人性特徴モデル）を作成するものであり、記憶部７（登録手段）は、話者モデル生成部６で作成された話者モデル（例えば、コードブック）を登録するものである。 The speaker model generation unit 6 (model creation unit) creates a speaker model (personality feature model) from the feature vector generated by the feature vector generation unit 4, and the storage unit 7 (registration unit) A speaker model (for example, a code book) created by the speaker model generation unit 6 is registered.

話者選択部５（話者選択手段）は、特徴ベクトル生成部４で生成された特徴ベクトルと予め記憶部７に登録されている話者モデル（例えば、コードブック）から最も類似度（尤度）の高い話者を選択し、選択した話者認識結果を出力するものである。 The speaker selection unit 5 (speaker selection means) obtains the highest similarity (likelihood) from the feature vector generated by the feature vector generation unit 4 and a speaker model (for example, a code book) registered in the storage unit 7 in advance. ) Is selected, and the selected speaker recognition result is output.

ここで、本実施の形態の特徴的な機能を発揮する特徴ベクトル生成部４の各種処理部について図２を参照しつつ説明する。図２に示すように、１１は入力される音声波（デジタル音声信号）、１２は音声波を適当な長さのフレームに切り分け、かつ、ハミング窓などの窓処理を施す時間窓処理部、１３は入力音声を周波数分析し、スペクトル成分を抽出する離散フーリエ変換処理部（第１の分析手段）、１４は振幅スペクトルを対数変換する対数変換処理部（対数変換手段）、１５はウェーブレット変換によりスペクトル成分を多重解像度解析（ＭＲＡ：Multi-Resolution Analysis）し、特徴ベクトルを得るＭＲＡ処理部（第２の分析手段）、１６は上記の処理により出力される特徴ベクトル（多重解像度パラメータ）である。 Here, various processing units of the feature vector generation unit 4 that exhibits the characteristic functions of the present embodiment will be described with reference to FIG. As shown in FIG. 2, 11 is an input audio wave (digital audio signal), 12 is a time window processing unit that divides the audio wave into frames of an appropriate length and performs window processing such as a Hamming window, 13 Is a discrete Fourier transform processing unit (first analysis unit) that performs frequency analysis of input speech and extracts spectral components, 14 is a logarithmic transformation processing unit (logarithmic transformation unit) that performs logarithmic transformation of the amplitude spectrum, and 15 is a spectrum obtained by wavelet transformation. An MRA processing unit (second analysis means) 16 obtains a feature vector by performing multi-resolution analysis (MRA) on the components, and 16 is a feature vector (multi-resolution parameter) output by the above processing.

特徴ベクトル生成部４で行われる多重解像度分析は、図３の左図に示すように、各分析窓の周波数方向の長さがケフレンシーの高さと共に縮小されていることにより、高ケフレンシーになるに従い周波数分解能が高い解析を実施することが可能となっている。特徴ベクトル生成部４は、このような解析を実施することにより、図３の右図に示すような特徴ベクトル（多重解像度パラメータ）を出力する。 As shown in the left diagram of FIG. 3, the multi-resolution analysis performed by the feature vector generation unit 4 is performed as the quefrency becomes higher because the length in the frequency direction of each analysis window is reduced with the height of the quefrency. Analysis with high frequency resolution can be performed. The feature vector generation unit 4 outputs such a feature vector (multi-resolution parameter) as shown in the right diagram of FIG. 3 by performing such an analysis.

なお、本実施の形態の特徴ベクトル生成部４では、図６に示したようなケプストラム係数を抽出する従来型の分析処理部（第３の分析手段）も兼ね備えるようにしても良い。ケプストラム係数を抽出する従来型の分析処理部（第３の分析手段）も兼ね備えるようにすることで、特徴ベクトル生成部４から出力される特徴ベクトルは、図４に示すように低次ケプストラム係数と多重解像度パラメータとを合わせた多次元ベクトルとなる（統合手段）。 Note that the feature vector generation unit 4 of the present embodiment may also include a conventional analysis processing unit (third analysis unit) that extracts cepstrum coefficients as shown in FIG. By combining the conventional analysis processing unit (third analysis unit) that extracts the cepstrum coefficient, the feature vector output from the feature vector generation unit 4 is a low-order cepstrum coefficient as shown in FIG. It becomes a multidimensional vector combined with multi-resolution parameters (integration means).

また、低次ケプストラム係数と多重解像度パラメータとを統合して特徴ベクトルを生成する際には、低次ケプストラム係数と多重解像度パラメータとを足し合わせて統合するものに限らず、図５に示すように、ケプストラムの変数である各ケフレンシー帯域毎に択一的に統合しても良い（統合手段）。このようにすることにより、各ケフレンシー帯域毎に最適な分析窓を採用したことと等価の効果を得ることができ、より理想的な分析を実施することが可能となる。 Further, when the feature vector is generated by integrating the low-order cepstrum coefficient and the multi-resolution parameter, the feature vector is not limited to adding and integrating the low-order cepstrum coefficient and the multi-resolution parameter as shown in FIG. Alternatively, it may be alternatively integrated for each cefency band which is a variable of cepstrum (integration means). By doing in this way, the effect equivalent to having employ | adopted the optimal analysis window for every quefrency zone | band can be acquired, and it becomes possible to implement a more ideal analysis.

次に、本実施の形態における話者認識装置１００の登録処理の流れについて説明する。マイク１に人力された音声は、電気的アナログ信号として出力される。アナログ信号として出力された入力音声は、低域通過フィルタ２によりサンプリング周波数（例えば、１２kHz）の１／２以上の周波数をカットされる。その後、入力音声は、Ａ／Ｄ変換部３にてサンプリング周波数でサンプリングされデジタル信号に変換される。 Next, the flow of registration processing of the speaker recognition apparatus 100 in the present embodiment will be described. The sound that has been manpowered by the microphone 1 is output as an electrical analog signal. The input sound output as an analog signal is cut by the low-pass filter 2 at a frequency that is 1/2 or more of the sampling frequency (for example, 12 kHz). Thereafter, the input sound is sampled at the sampling frequency by the A / D converter 3 and converted into a digital signal.

Ａ／Ｄ変換部３にてデジタル信号に変換された入力音声は、特徴ベクトル生成部４に入力され、音声分析により抽出される個人性情報を含む特徴データが特徴ベクトル（多重解像度パラメータ）として出力される。 The input speech converted into a digital signal by the A / D conversion unit 3 is input to the feature vector generation unit 4, and feature data including personality information extracted by speech analysis is output as a feature vector (multi-resolution parameter). Is done.

特徴ベクトル生成部４から出力された特徴ベクトル（多重解像度パラメータ）は、話者モデル生成部６に入力されて話者モデル生成部６において話者モデル（例えば、コードブック）が作成され、話者モデル生成部６で作成された話者モデル（例えば、コードブック）が記憶部７に登録される。 The feature vector (multi-resolution parameter) output from the feature vector generation unit 4 is input to the speaker model generation unit 6, and a speaker model (for example, a code book) is created in the speaker model generation unit 6. A speaker model (for example, a code book) created by the model generation unit 6 is registered in the storage unit 7.

次に、本実施の形態における話者認識装置１００の話者認識処理の流れについて説明する。マイク１に人力された音声は、電気的アナログ信号として出力される。アナログ信号として出力された入力音声は、低域通過フィルタ２によりサンプリング周波数（例えば、１２kHz）の１／２以上の周波数をカットされる。その後、入力音声は、Ａ／Ｄ変換部３にてサンプリング周波数でサンプリングされデジタル信号に変換される。 Next, the flow of speaker recognition processing of the speaker recognition device 100 in the present embodiment will be described. The sound that has been manpowered by the microphone 1 is output as an electrical analog signal. The input sound output as an analog signal is cut by the low-pass filter 2 at a frequency that is 1/2 or more of the sampling frequency (for example, 12 kHz). Thereafter, the input sound is sampled at the sampling frequency by the A / D converter 3 and converted into a digital signal.

特徴ベクトル生成部４から出力された特徴ベクトル（多重解像度パラメータ）は話者選択部５に入力され、記憶部７に予め登録されている話者モデル（例えば、コードブック）から最も類似度（尤度）の高い話者が選択され、選択した話者認識結果が出力される。 The feature vector (multi-resolution parameter) output from the feature vector generation unit 4 is input to the speaker selection unit 5 and the highest similarity (likelihood) from a speaker model (for example, a code book) registered in the storage unit 7 in advance. A speaker with a high degree) is selected, and the selected speaker recognition result is output.

このように本実施の形態によれば、各分析窓の周波数方向の長さがケフレンシーの高さと共に縮小され、高ケフレンシーになるに従い周波数分解能が高い解析を実施することが可能となることにより、音声の周波数スペクトルから個人性を分析する際に周波数分解能が固定されず、より詳細な分析を行うことができるので、話者認識精度の向上した話者認識装置１００を提供することができる。 As described above, according to the present embodiment, the length in the frequency direction of each analysis window is reduced with the height of the quefrency, and it becomes possible to perform an analysis with a high frequency resolution as the quefrency becomes higher. When analyzing personality from the frequency spectrum of speech, the frequency resolution is not fixed, and more detailed analysis can be performed. Therefore, the speaker recognition device 100 with improved speaker recognition accuracy can be provided.

なお、本発明は上記した実施の形態に示す特定のハードウェア構成に限定されるものではなく、ソフトウェアによっても実現可能である。図６は、本発明をソフトウェアによって実現する場合の話者認識装置１００の構成例を示すブロック図である。話者認識装置１００は、この話者認識装置１００の各部を集中的に制御するＣＰＵ１０１を備えており、このＣＰＵ１０１には、ＢＩＯＳなどを記憶したＲＯＭや各種データを書換え可能に記憶するＲＡＭで構成されるメモリ１０２がバス接続されており、マイクロコンピュータを構成している。また、ＣＰＵ１０１には、ＨＤＤ（Hard Disk Drive）１０３と、コンピュータ読み取り可能な記憶媒体であるＣＤ（Compact Disc）−ＲＯＭ１０４を読み取るＣＤ−ＲＯＭドライブ１０５と、話者認識装置１００とインターネット等との通信を司る通信装置１０６と、キーボード１０７と、ＣＲＴ、ＬＣＤなどの表示装置１０８と、マイク１とが、図示しないＩ／Ｏを介してバス接続されている。 The present invention is not limited to the specific hardware configuration shown in the above-described embodiment, and can be realized by software. FIG. 6 is a block diagram illustrating a configuration example of the speaker recognition device 100 when the present invention is implemented by software. The speaker recognition device 100 includes a CPU 101 that centrally controls each unit of the speaker recognition device 100. The CPU 101 includes a ROM that stores a BIOS and a RAM that stores various data in a rewritable manner. A memory 102 is connected to the bus and constitutes a microcomputer. The CPU 101 includes a HDD (Hard Disk Drive) 103, a CD-ROM drive 105 that reads a CD (Compact Disc) -ROM 104 that is a computer-readable storage medium, and communication between the speaker recognition device 100 and the Internet. A communication device 106 that manages the above, a keyboard 107, a display device 108 such as a CRT or LCD, and the microphone 1 are connected by bus via an I / O (not shown).

ＣＤ−ＲＯＭ１０４などのコンピュータ読み取り可能な記憶媒体には本発明の音声特徴抽出機能を実現するプログラムが記憶されており、このプログラムを話者認識装置１００にインストールすることにより、ＣＰＵ１０１に本発明の音声特徴抽出機能を実行させることができる。また、マイク１から入力された音声は一時的にＨＤＤ１０３などに格納される。そして、該プログラムが起動されると、ＨＤＤ１０３などに一時保存された音声データが読み込まれ、音声特徴抽出処理が実行され、音声特徴抽出処理により抽出された特徴ベクトルが話者認識処理に供される。 A computer-readable storage medium such as the CD-ROM 104 stores a program for realizing the voice feature extraction function of the present invention. By installing this program in the speaker recognition device 100, the CPU 101 stores the voice of the present invention. The feature extraction function can be executed. Also, the sound input from the microphone 1 is temporarily stored in the HDD 103 or the like. When the program is started, the voice data temporarily stored in the HDD 103 or the like is read, the voice feature extraction process is executed, and the feature vector extracted by the voice feature extraction process is used for the speaker recognition process. .

なお、記憶媒体としては、ＣＤ−ＲＯＭ１０４のみならず、ＤＶＤなどの各種の光ディスク、各種光磁気ディスク、フレキシブルディスクなどの各種磁気ディスク、半導体メモリ等、各種方式のメディアを用いることができる。また、インターネットなどのネットワークからプログラムをダウンロードし、ＨＤＤ１０３にインストールするようにしてもよい。この場合に、送信側のサーバでプログラムを記憶している記憶装置も、この発明の記憶媒体である。なお、プログラムは、所定のＯＳ（Operating System）上で動作するものであってもよいし、その場合に後述の各種処理の一部の実行をＯＳに肩代わりさせるものであってもよいし、ワープロソフトなど所定のアプリケーションソフトやＯＳなどを構成する一群のプログラムファイルの一部として含まれているものであってもよい。 As a storage medium, not only the CD-ROM 104 but also various types of media such as various optical disks such as DVD, various magnetic disks such as various magneto-optical disks and flexible disks, and semiconductor memories can be used. Further, a program may be downloaded from a network such as the Internet and installed in the HDD 103. In this case, the storage device storing the program in the server on the transmission side is also a storage medium of the present invention. Note that the program may operate on a predetermined OS (Operating System), in which case the OS may execute a part of various processes described later, or a word processor. It may be included as part of a group of program files that constitute predetermined application software such as software or an OS.

本発明の実施の一形態の話者認識装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speaker recognition apparatus of one Embodiment of this invention. 特徴ベクトル生成部の各種処理部を示すブロック図である。It is a block diagram which shows the various process parts of a feature vector production | generation part. 周波数ケフレンシー平面上の分析窓と特徴ベクトル（多重解像度パラメータ）を示す模式図である。It is a schematic diagram which shows the analysis window and feature vector (multi-resolution parameter) on a frequency quefrency plane. 特徴ベクトル生成部における処理により出力される特徴ベクトルの変形例を示す模式図である。It is a schematic diagram which shows the modification of the feature vector output by the process in a feature vector production | generation part. 特徴ベクトル生成部における処理により出力される特徴ベクトルの別の変形例を示す模式図である。It is a schematic diagram which shows another modification of the feature vector output by the process in a feature vector production | generation part. 本発明をソフトウェアによって実現する場合の話者認識装置の構成例を示すブロック図である。It is a block diagram which shows the structural example of the speaker recognition apparatus in the case of implement | achieving this invention by software. ケプストラム係数を抽出する従来型の分析処理部の各種処理部を示すブロック図である。It is a block diagram which shows the various process parts of the conventional analysis process part which extracts a cepstrum coefficient. 従来型の周波数ケフレンシー平面上の分析窓と特徴ベクトルを示す模式図である。It is a schematic diagram which shows the analysis window and feature vector on the conventional frequency quefrency plane.

Explanation of symbols

１，２，３音声入力手段
４音声特徴抽出装置
５話者選択手段
６モデル作成手段
７登録手段
１３第１の分析手段
１４対数変換手段
１５第２の分析手段
１００話者認識装置
1, 2, 3 Voice input means 4 Voice feature extraction device 5 Speaker selection means 6 Model creation means 7 Registration means 13 First analysis means 14 Logarithmic conversion means 15 Second analysis means 100 Speaker recognition device

Claims

In an audio feature extraction apparatus that outputs a feature vector including personality information sequentially after dividing an input digital audio signal into frames of an appropriate length and performing window processing.
First analysis means for performing frequency analysis of the windowed digital audio signal and extracting a spectral component;
Logarithmic conversion means for logarithmically converting the spectral components extracted by the first analysis means;
A second analysis means for performing multi-resolution analysis of the logarithmic spectrum logarithmically converted by the logarithmic conversion means to obtain a feature vector;
An audio feature extraction apparatus comprising:

Third analysis means for extracting a cepstrum coefficient from the spectral component extracted by the first analysis means and obtaining a feature vector;
Integration means for integrating the feature vector obtained from the second analysis means and the feature vector obtained from the third analysis means;
The speech feature extraction apparatus according to claim 1, further comprising:

Alternatively, vector integration for each cefency band that is a variable of cepstrum,
The speech feature extraction apparatus according to claim 3.

In a speaker recognition device that recognizes a speaker using personality information included in a sound wave,
Audio input means for inputting a digital audio signal;
4. The speech feature extraction apparatus according to claim 1, wherein after the input digital speech signal is divided into frames of an appropriate length and subjected to window processing, feature vectors including personality information are sequentially output. ,
Model creation means for creating a personality feature model from the feature vector input from the speech feature extraction device;
Registration means for registering the individuality feature model created by the model creation means;
Speaker selection means for selecting the speaker with the highest similarity (likelihood) from the individuality feature model registered by the registration means based on the feature vector output from the speech feature extraction device;
A speaker recognition device comprising:

A computer-readable program for executing an audio feature extraction function for sequentially outputting a feature vector including personality information after dividing an input digital audio signal into frames of an appropriate length and performing window processing. ,
A first analysis function for performing frequency analysis on a digital audio signal subjected to window processing and extracting a spectral component;
A logarithmic conversion function for logarithmically converting the spectral components extracted by the first analysis function;
A second analysis function for multi-resolution analysis of the logarithmic spectrum logarithmically converted by the logarithmic conversion function to obtain a feature vector;
That causes the computer to execute the program.

A third analysis function for extracting a cepstrum coefficient from the spectral component extracted by the first analysis function and obtaining a feature vector;
An integration function for integrating the feature vector obtained from the second analysis function and the feature vector obtained from the third analysis function;
The program according to claim 5, wherein the computer is executed.

Alternatively, vector integration for each cefency band that is a variable of cepstrum,
The program according to claim 6.

In an audio feature extraction method for outputting a feature vector including personality information sequentially after dividing an input digital audio signal into frames of an appropriate length and performing window processing,
A first analysis step of frequency-analyzing the windowed digital audio signal and extracting a spectral component;
A logarithmic transformation step for logarithmically transforming the spectral component extracted by the first analysis step;
A second analysis step in which a logarithmic spectrum logarithmically transformed by the logarithmic transformation step is subjected to multiresolution analysis to obtain a feature vector;
A speech feature extraction method comprising:

A third analysis step of extracting a cepstrum coefficient from the spectral component extracted by the first analysis step to obtain a feature vector;
An integration step of integrating the feature vector obtained from the second analysis step and the feature vector obtained from the third analysis step;
The speech feature extraction method according to claim 8, further comprising:

Alternatively, vector integration for each cefency band that is a variable of cepstrum,
The speech feature extraction method according to claim 9.