JP2004341340A

JP2004341340A - Speaker recognition device

Info

Publication number: JP2004341340A
Application number: JP2003139252A
Authority: JP
Inventors: Tomonari Kakino; 友成柿野
Original assignee: Toshiba TEC Corp
Current assignee: Toshiba TEC Corp
Priority date: 2003-05-16
Filing date: 2003-05-16
Publication date: 2004-12-02
Anticipated expiration: 2023-05-16
Also published as: JP4408205B2

Abstract

<P>PROBLEM TO BE SOLVED: To precisely recognize a speaker and to easily calculate the distance to a speaker model. <P>SOLUTION: In a frame trimming means 21, speech waveform data are successively trimmed to the frames with a frame width L at a shift interval T for outputting to a feature vector generating means 22. The presence or absence of pitch in a frame trimmed by the frame cutting means is detected by a pitch detection means 23. The frame trimming means changes the shift interval T and the frame width L according to the pitch detection result. More specifically, the shift interval T and the frame width L between sounds, where a sound frequency exists to a silent section, are changed to be shorter. A feature vector generating means converts the sound waveform data of each frame to a feature vector for outputting to a next-stage distance calculation means 24. The distance calculation means calculates the distance between a feature vector series and a speaker model stored in a speaker model storage means 25 for outputting to a next-stage recognition means 26. The recognition means compares the distance data from the distance calculation means and a preset threshold to recognize the speaker. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、音声を入力して話者を認識する話者認識装置に関する。
【０００２】
【従来の技術】
一般に話者認識においては、入力された音声を１０ｍｓｅｃ程度のシスト間隔で長さ３０ｍｓｅｃ程度のフレームに切出し、そのフレーム毎に求まる各種個人性特徴量を全有声音区間全域にわたって特徴ベクトル列として抽出し、この抽出された特徴ベクトル列を、例えばＶＱ歪法やＨＭＭ法などの比較手法により評価して話者認識を行うようになっている。
また、最近では、音声の個人性を認識するうえで有声音区間のみでなく、無声音区間も利用するものが知られている（例えば、非特許文献１参照）。
【０００３】
【非特許文献１】
松井知子、古井貞煕著、電子情報通信学会技術報告書ＳＰ９０−２６、「音源・声道特徴を用いたテキスト独立形話者認識」、電子情報通信学会、Ｊｕｎｅ１９９０，ｐｐ．５５−６２
【０００４】
【発明が解決しようとする課題】
ところで、話者認識を行う場合に特徴量に起因した重み付けを行うことが知られているが、このような重み付けを有声音区間と無声音区間の両方を利用するものに適用した場合、分割するフレーム長を一定にしてフレームを切出しながら有声音区間は勿論、無声音区間にも重み付けを行うが、有声音区間に対しては例えば無声音区間の２倍の重み付けを行うというような処理が施されることになる。しかしながら、無声音区間も有声音区間も切出すフレーム長を一定にしたのでは、有声音区間における分析が粗くなる。そして、粗くなった分を、重み付けを与えて対処しているが、このような処理では十分な話者認識ができない。また、分析した結果と予め設定されている話者モデルとの距離を算出する場合に、その都度重み付けの処理を行わなければならず処理が面倒であった。
本発明は、精度の高い話者認識ができ、しかも話者モデルとの距離算出が容易にできる話者認識装置を提供する。
【０００５】
【課題を解決するための手段】
本発明は、音声を入力する、例えばマイクロホンなどからなる音声入力手段と、この音声入力手段で入力された音声波形データを、設定されたシフト間隔で設定されたフレーム幅のフレームに切出すフレーム切出し手段と、このフレーム切出し手段が切出したフレームの音声波形データを特徴ベクトルに変換する特徴ベクトル生成手段と、フレーム切出し手段が切出したフレームにおける声帯周波の有無を検出するピッチ検出手段と、予め話者モデルを記憶した話者モデル記憶手段と、特徴ベクトル生成手段から生成された特徴ベクトルと話者モデル記憶手段に記憶した話者モデルとの距離から話者認識を行う認識手段とを備え、フレーム切出し手段は、ピッチ検出手段が声帯周波を検出した時にはシフト間隔が短くなるように設定することにある。
【０００６】
【発明の実施の形態】
以下、本発明の一実施の形態を、図面を参照して説明する。
図１は装置全体の構成を示すブロック図で、１は制御部本体を構成するＣＰＵ（中央処理装置）、２はこのＣＰＵ１が各部を制御するプログラムデータ等を格納したＲＯＭ（リード・オンリー・メモリ）、３はＣＰＵ１がデータ処理に使用するメモリや各種バッファメモリが設けられたＲＡＭ（ランダム・アクセス・メモリ）である。
【０００７】
また、４はＩ／Ｏポート、５は予め多数の話者モデルを記憶する話者モデル記憶手段等を設けたハードディスク装置、６はキーボード７からキー信号を取込むキーボードコントローラ、８はディスプレイ９のテキストや画像等の情報を表示させるディスプレイコントローラである。
前記ＣＰＵ１、ＲＯＭ２、ＲＡＭ３、Ｉ／Ｏポート４、ハードディスク５、各コントローラ６，８はバスライン１０を経由して互いに電気的に接続されている。
【０００８】
また、音声を取込むためのマイクロホン１１を設け、このマイクロホン１１は取込んだ音声を電気的アナログ信号に変換して出力し、次段の低域通過フィルタ１２に供給している。前記低域通過フィルタ１２は、入力されたアナログ信号から所定の周波数以上の周波数をカットして出力し、次段のＡ／Ｄ変換部１３に入力している。前記Ａ／Ｄ変換部１３は、入力されたアナログ信号を、所定のサンプリング周波数、量子化ビット数でデジタル信号に変換し音声波形データとして前記Ｉ／Ｏポート４に供給している。前記マイクロホン１１、低域通過フィルタ１２及びＡ／Ｄ変換部１３は、音声入力手段を構成している。
【０００９】
図２は、前記ＣＰＵ１、ＲＯＭ２、ＲＡＭ３、ハードディスク装置５の複合体により構成される機能ブロックで、フレーム切出し手段２１は、前記Ｉ／Ｏポート４から入力される音声波形データを、設定されたシフト間隔Ｔで設定されたフレーム幅Ｌのフレームに順次切出し、次段の特徴ベクトル生成手段２２に供給する。
【００１０】
ピッチ検出手段２３は、記フレーム切出し手段２１から切出されたフレームにおける音声周波の有無、すなわち、ピッチの有無を検出し、検出結果を前記フレーム切出し手段２１に出力する。前記フレーム切出し手段２１は、ピッチ検出手段２３による検出結果に応じて設定するシフト間隔Ｔ及びフレーム幅Ｌを変更するようになっている。すなわち、音声周波が存在しない無声音区間に対して音声周波が存在する有声音区間のシフト間隔Ｔ及びフレーム幅Ｌが短くなるように変更する。
【００１１】
前記特徴ベクトル生成手段２２は、順次入力される各フレームの音声波形データを、例えばケプストラム係数などの特徴ベクトルに変換し特徴ベクトル系列として、次段の距離計算手段２４に出力するようになっている。前記距離計算手段２４は、入力した特徴ベクトル系列と前記ハードディスク５に設けられた話者モデル記憶手段２５に記憶されている例えばコードブックなどの話者モデルとの距離を計算し、次段の認識手段２６に出力するようになっている。
前記認識手段２６は、距離計算手段２４からの距離データと予め設定された閾値とを比較して話者認識を行い、結果を、例えば前記ディスプレイ９に表示するようになっている。
【００１２】
図３は、図２の機能ブロックによる処理を示す流れ図で、先ず、Ｓ１にて、フレーム切出し手段２１はシフト間隔Ｔ及びフレーム幅Ｌを、無声音区間に対応するＴ_０、Ｌ_０に設定する。続いて、Ｓ２にて、フレーム切出し手段２１はＩ／Ｏポート４から取込んでＲＡＭ３のバッファメモリに一時格納した音声波形データを、設定されたシフト間隔Ｔ_０で設定されたフレーム幅Ｌ_０のフレームに順次切出す。
【００１３】
切出されたフレームの音声波形データに対してＳ３にて分析処理が行われ、また、Ｓ４にてピッチ検出手段２３によるピッチ検出が行われる。そして、ピッチ検出が行わなければ、すなわち、音声波形を検出しなければ、Ｓ５にて、音声の終端をチェックし、終端でなければ、Ｓ１に戻って処理を繰り返す。
【００１４】
また、ピッチ検出手段２３がピッチ検出を行うと、Ｓ６にて、ピッチ周波数ｆ_Ｐを抽出し、Ｓ７にて、シフト間隔ＴをＴ_１に設定するとともにフレーム幅ＬをＦ（ｆ_Ｐ）に設定する。この場合、Ｔ_１＜Ｔ_０となり、また、Ｆ（ｆ_Ｐ）＜Ｌ_０となる。Ｆ（ｆ_Ｐ）はピッチ周波数ｆ_Ｐからフレーム幅Ｌを求める関数であり、ピッチ検出を行ったときのフレーム幅Ｌはピッチ周波数ｆ_Ｐに応じて変化することを示している。
シフト間隔ＴをＴ_１に設定し、フレーム幅ＬをＦ（ｆ_Ｐ）に設定すると、Ｓ２に戻って音声波形の切出しを行って処理を繰り返す。Ｓ５にて、音声の終端をチェックすると、Ｓ８にて予めＲＡＭ３に格納されているフレーム毎の距離に基づき認識処理を行い、認識結果を出力した後、この一連の処理を終了する。
【００１５】
Ｓ３の分析処理は、図４に示す処理になっている。すなわち、Ｓ３１にて、特徴ベクトル生成手段２２による特徴ベクトル生成処理を行い、Ｓ３２にて、距離計算手段２４による距離計算処理を行い、Ｓ３３にて、前記距離計算手段２４にて算出された１フレーム毎の距離をＲＡＭ３に格納する。
【００１６】
このような構成においては、マイクロホン１１から音声を入力すると、電気的アナログ信号に変換され、低域通過フィルタ１２で所定の周波数以上の周波数がカットされる。この場合、Ａ／Ｄ変換部１３におけるサンプリング周波数の１／２以上の周波数がカットされる。例えば、サンプリング周波数が１２ｋＨｚであれば、６ｋＨｚ以上の周波数がカットされる。低域通過フィルタ１２を通過したアナログ信号はＡ／Ｄ変換部１３において１２ｋＨｚのサンプリング周波数でサンプリングされデジタル信号に変換される。
【００１７】
フレーム切出し手段２１においてデジタル信号に変換された音声波形データは、シフト間隔Ｔでフレーム幅Ｌのフレームに順次切出される。そして、フレーム切出し手段２１で切出されたフレームにおけるピッチの有無がピッチ検出手段２３で検出され、フレーム切出し手段２１はピッチが検出されない状態とピッチが検出された状態とでシフト間隔ＴをＴ_０からＴ_１に、フレーム幅ＬをＬ_０からＦ（ｆ_Ｐ）に変換する。例えば、シフト間隔Ｔが６ｍｓｅｃから３ｍｓｅｃに変換され、フレーム幅Ｌが２４ｍｓｅｃから１２ｍｓｅｃ前後に変換される。なお、１２ｍｓｅｃ前後としたのはピッチ周波数によって変化するためである。
【００１８】
すなわち、図５に示すように、入力音声波形１００に対して、フレーム１０１〜１０５として示すように、ピッチ検出が行われない区間においてはシフト間隔６ｍｓｅｃでシフトしながらフレーム幅２４ｍｓｅｃでフレームを切出し、ピッチ検出が行われると、フレーム１０６として示すようにフレーム幅２４ｍｓｅｃが１２ｍｓｅｃ前後に変換され、以降シフト間隔３ｍｓｅｃで順次シフトしながらフレーム幅が１２ｍｓｅｃ前後のフレームの切出しが行われる。
【００１９】
そして、順次切出されたフレームの音声波形データは特徴ベクトル生成手段２２により特徴ベクトル系列に変換され、距離計算手段２４にて特徴ベクトル系列と話者モデル記憶手段２５に記憶されている話者モデルとの距離が計算され、認識手段２６にて算出された距離データと閾値との比較が行われる。こうして話者認識が行われ、結果がディスプレイ９に表示される。
【００２０】
このように、入力音声波形をフレームに切出す場合に、ピッチを検出しない区間においてはシフト間隔Ｔ及びフレーム幅Ｌを大きくし、ピッチを検出した後はシフト間隔Ｔ及びフレーム幅Ｌを小さくするので、単位時間当たりのフレーム数は無声音区間に比べて有声音区間が相対的に多くなる。すなわち、有声音区間は無声音区間に比べて細かくフレームを切出すことになり、従って、話者認識における比重が大きくなり、重み付けを行った場合と同様、精度の高い話者認識ができる。
【００２１】
しかも、有声音区間についてはシフト間隔Ｔ及びフレーム幅Ｌを小さくして切出すフレーム数を多くして精度を高めているので、重み付けする場合に比べてより精度の高いピッチ検出ができ、しかも、面倒な重み付けを行う必要が無く話者モデルとの距離算出が容易にできる。
【００２２】
また、有声音区間におけるフレーム幅Ｌの設定を、ピッチ周波数ｆ_Ｐ、すなわち、声帯周波の周波数によって変化する関数Ｆ（ｆ_Ｐ）を使用して行っているので、有声音区間におけるフレーム幅Ｌを声帯周波数に応じて適切に設定することができ、声帯周波数に関係なくピッチ検出精度を維持できる。
【００２３】
【発明の効果】
以上詳述したように、本発明によれば、精度の高い話者認識ができ、しかも話者モデルとの距離算出が容易にできる話者認識装置を提供できる。
【図面の簡単な説明】
【図１】本発明の一実施の形態に係る装置全体の構成を示すブロック図。
【図２】同実施の形態における要部構成を機能的に示すブロック図。
【図３】同実施の形態における要部処理を示す流れ図。
【図４】図３における分析処理の内容を示す流れ図。
【図５】同実施の形態における入力音声波形に対するフレーム切出しを説明するための図。
【符号の説明】
１…ＣＰＵ、１１…マイクロホン、１３…Ａ／Ｄ変換部、２１…フレーム切出し手段、２２…特徴ベクトル生成手段、２３…ピッチ検出手段、２４…距離計算手段、２５…話者モデル記憶手段、２６…認識手段。[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a speaker recognition device that recognizes a speaker by inputting voice.
[0002]
[Prior art]
In general, in speaker recognition, an input voice is cut out at a cyst interval of about 10 msec into a frame having a length of about 30 msec, and various personality features obtained for each frame are extracted as a feature vector sequence over the entire voiced sound section. The extracted feature vector sequence is evaluated by a comparison method such as the VQ distortion method or the HMM method to perform speaker recognition.
In recent years, it has been known to use not only a voiced sound section but also an unvoiced sound section in recognizing the personality of voice (for example, see Non-Patent Document 1).
[0003]
[Non-patent document 1]
By Tomoko Matsui and Sadahiro Furui, IEICE Technical Report SP90-26, "Text-independent speaker recognition using sound source / vocal tract features", IEICE, June 1990, pp. 55-62
[0004]
[Problems to be solved by the invention]
By the way, it is known to perform weighting based on the feature amount when performing speaker recognition. However, when such weighting is applied to one that uses both a voiced sound section and an unvoiced sound section, a frame to be divided is divided. While cutting out frames with a fixed length, weighting is applied not only to voiced sections but also to unvoiced sections, but voiced sections are weighted twice as much as unvoiced sections, for example. become. However, if the frame length from which both the unvoiced sound section and the voiced sound section are cut out is fixed, the analysis in the voiced sound section becomes coarse. Then, the coarse part is weighted to cope with the problem, but such processing does not allow sufficient speaker recognition. Further, when calculating the distance between the analysis result and a preset speaker model, a weighting process must be performed each time, and the process is troublesome.
The present invention provides a speaker recognition device that can perform speaker recognition with high accuracy and that can easily calculate a distance from a speaker model.
[0005]
[Means for Solving the Problems]
The present invention relates to a voice input unit for inputting voice, for example, a microphone or the like, and a frame cutout for cutting out voice waveform data input by the voice input unit into a frame having a set frame width at a set shift interval. Means, a feature vector generating means for converting the speech waveform data of the frame extracted by the frame extracting means into a feature vector, a pitch detecting means for detecting the presence or absence of a vocal cord frequency in the frame extracted by the frame extracting means, A speaker model storing means for storing a model; and a recognition means for performing speaker recognition based on a distance between a feature vector generated from the feature vector generating means and a speaker model stored in the speaker model storing means, and frame extraction. The means shall be set so that the shift interval becomes short when the pitch detecting means detects a vocal cord frequency. A.
[0006]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, an embodiment of the present invention will be described with reference to the drawings.
FIG. 1 is a block diagram showing the configuration of the entire apparatus. 1 is a CPU (central processing unit) constituting a control unit main body, and 2 is a ROM (read only memory) storing program data for controlling each unit by the CPU 1. Reference numeral 3 denotes a RAM (random access memory) provided with a memory used by the CPU 1 for data processing and various buffer memories.
[0007]
4 is an I / O port, 5 is a hard disk drive provided with speaker model storage means for storing a number of speaker models in advance, 6 is a keyboard controller for taking in key signals from a keyboard 7, and 8 is a display 9 This is a display controller that displays information such as text and images.
The CPU 1, ROM 2, RAM 3, I / O port 4, hard disk 5, and controllers 6, 8 are electrically connected to each other via a bus line 10.
[0008]
Further, a microphone 11 for taking in sound is provided, and this microphone 11 converts the taken sound into an electric analog signal and outputs it, and supplies the signal to a low-pass filter 12 in the next stage. The low-pass filter 12 cuts a frequency equal to or higher than a predetermined frequency from the input analog signal, outputs the cut signal, and inputs the cut signal to an A / D converter 13 at the next stage. The A / D converter 13 converts the input analog signal into a digital signal at a predetermined sampling frequency and a predetermined number of quantization bits, and supplies the digital signal to the I / O port 4 as audio waveform data. The microphone 11, the low-pass filter 12, and the A / D converter 13 constitute an audio input unit.
[0009]
FIG. 2 is a functional block composed of a complex of the CPU 1, the ROM 2, the RAM 3, and the hard disk device 5. The frame extracting unit 21 converts the audio waveform data input from the I / O port 4 into a set shift. The frames are sequentially cut into frames having a frame width L set at the interval T, and are supplied to the feature vector generating means 22 in the next stage.
[0010]
The pitch detecting means 23 detects the presence or absence of a voice frequency, that is, the presence or absence of a pitch in the frame extracted from the frame extracting means 21, and outputs the detection result to the frame extracting means 21. The frame cutout means 21 changes the shift interval T and the frame width L set according to the detection result by the pitch detection means 23. That is, the shift interval T and the frame width L of the voiced section in which the voice frequency exists are changed to be shorter than the unvoiced section in which the voice frequency does not exist.
[0011]
The feature vector generation unit 22 converts the audio waveform data of each frame that is sequentially input into a feature vector such as a cepstrum coefficient and outputs the feature vector to the next-stage distance calculation unit 24 as a feature vector sequence. . The distance calculating means 24 calculates the distance between the input feature vector sequence and a speaker model such as a codebook stored in the speaker model storing means 25 provided on the hard disk 5, and recognizes the next stage. Output to the means 26.
The recognition means 26 performs speaker recognition by comparing the distance data from the distance calculation means 24 with a preset threshold value, and displays the result on, for example, the display 9.
[0012]
FIG. 3 is a flowchart showing processing by the functional blocks in FIG. 2. First, in S1, the frame cutout means 21 sets the shift interval T and the frame width L to T ₀ and L ₀ corresponding to the unvoiced sound section. Subsequently, at S2, the frame cutout section 21 a speech waveform data temporarily stored in the buffer memory RAM3 crowded taken from the I / O port 4, the frame width L ₀ which is set in the shift interval T ₀ that has been set Cut out sequentially into frames.
[0013]
Analysis processing is performed on the audio waveform data of the cut-out frame in S3, and pitch detection by the pitch detection means 23 is performed in S4. If pitch detection is not performed, that is, if no voice waveform is detected, the end of the voice is checked in S5. If not, the process returns to S1 and repeats the processing.
[0014]
The setting, the pitch detection unit 23 performs pitch detection, in S6, in extracts pitch frequency _{f P,} S7, the frame width L sets the shift interval T to _{T 1} to F _{(f P)} I do. In this case, T ₁ <T ₀ and F (f _P ) <L ₀ . F (f _P) is a function for obtaining the frame width L from pitch frequency f _P, the frame width L when performing pitch detection indicates that changes in accordance with the pitch frequency f _P.
Set the shift interval T to T _1, setting the frame width L to F (f _P), the process is repeated by performing the extraction of the speech waveform returns to S2. When the end of the voice is checked in S5, a recognition process is performed based on the distance for each frame stored in the RAM 3 in advance in S8, and a recognition result is output.
[0015]
The analysis processing in S3 is the processing shown in FIG. That is, in S31, feature vector generation processing by the feature vector generation means 22 is performed, in S32, distance calculation processing by the distance calculation means 24 is performed, and in S33, one frame calculated by the distance calculation means 24 is processed. The distance for each is stored in the RAM 3.
[0016]
In such a configuration, when a sound is input from the microphone 11, the sound is converted into an electric analog signal, and a frequency higher than a predetermined frequency is cut by the low-pass filter 12. In this case, a frequency equal to or more than 1/2 of the sampling frequency in the A / D converter 13 is cut. For example, if the sampling frequency is 12 kHz, frequencies above 6 kHz are cut off. The analog signal that has passed through the low-pass filter 12 is sampled at a sampling frequency of 12 kHz in an A / D converter 13 and converted into a digital signal.
[0017]
The audio waveform data converted into a digital signal by the frame extracting means 21 is sequentially extracted into frames having a frame width L at a shift interval T. Then, the presence or absence of a pitch in the frame cut out by the frame cutout means 21 is detected by the pitch detection means 23, and the frame cutout means 21 sets the shift interval T to T ₀ between a state where the pitch is not detected and a state where the pitch is detected. To T ₁ and the frame width L from L ₀ to F (f _P ). For example, the shift interval T is converted from 6 msec to 3 msec, and the frame width L is converted from 24 msec to about 12 msec. In addition, the reason why it is set to about 12 msec is that it changes depending on the pitch frequency.
[0018]
That is, as shown in FIG. 5, a frame is cut out with a frame width of 24 msec while shifting at a shift interval of 6 msec in a section where pitch detection is not performed, as shown as frames 101 to 105, with respect to the input speech waveform 100, When the pitch detection is performed, the frame width of 24 msec is converted to about 12 msec as shown as a frame 106, and thereafter, the frame having the frame width of about 12 msec is cut out while sequentially shifting at a shift interval of 3 msec.
[0019]
Then, the speech waveform data of the sequentially cut out frame is converted into a feature vector sequence by the feature vector generation means 22, and the feature vector series and the speaker model stored in the speaker model storage means 25 by the distance calculation means 24. Is calculated, and the distance data calculated by the recognition unit 26 is compared with a threshold. Thus, speaker recognition is performed, and the result is displayed on the display 9.
[0020]
As described above, when the input speech waveform is cut into frames, the shift interval T and the frame width L are increased in a section where the pitch is not detected, and the shift interval T and the frame width L are decreased after the pitch is detected. The number of frames per unit time is relatively larger in voiced sound sections than in unvoiced sound sections. That is, a frame is cut out more finely in the voiced sound section than in the unvoiced sound section. Therefore, the specific gravity in the speaker recognition is increased, and highly accurate speaker recognition can be performed as in the case of weighting.
[0021]
In addition, since the shift interval T and the frame width L for the voiced sound section are reduced and the number of frames to be cut out is increased to improve the accuracy, more accurate pitch detection can be performed as compared with the case of weighting. There is no need to perform troublesome weighting, and the distance to the speaker model can be easily calculated.
[0022]
In addition, since the setting of the frame width L in the voiced sound section is performed using the pitch frequency f _P , that is, the function F (f _P ) that changes according to the frequency of the vocal cord frequency, the frame width L in the voiced sound section is set. Appropriate settings can be made according to the vocal cord frequency, and the pitch detection accuracy can be maintained regardless of the vocal cord frequency.
[0023]
【The invention's effect】
As described in detail above, according to the present invention, it is possible to provide a speaker recognition device that can perform speaker recognition with high accuracy and can easily calculate a distance from a speaker model.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of an entire apparatus according to an embodiment of the present invention.
FIG. 2 is a block diagram functionally showing a main part configuration in the embodiment.
FIG. 3 is a flowchart showing main part processing in the embodiment.
FIG. 4 is a flowchart showing the contents of an analysis process in FIG. 3;
FIG. 5 is an exemplary view for explaining frame extraction for an input audio waveform according to the embodiment;
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 1 ... CPU, 11 ... Microphone, 13 ... A / D conversion part, 21 ... Frame extraction means, 22 ... Feature vector generation means, 23 ... Pitch detection means, 24 ... Distance calculation means, 25 ... Speaker model storage means, 26 ... recognition means.

Claims

Voice input means for inputting voice, frame cutout means for cutting out the voice waveform data input by the voice input means into frames of a set frame width at a set shift interval, and frame cutout means cut out by the frame cutout means Feature vector generating means for converting the speech waveform data of the frame into a feature vector, pitch detecting means for detecting the presence or absence of a vocal cord frequency in the frame cut out by the frame cutout means, and speaker model storage means for storing a speaker model in advance And recognition means for performing speaker recognition from the distance between the feature vector generated from the feature vector generation means and the speaker model stored in the speaker model storage means,
The speaker recognizing device, wherein the frame cutout means sets the shift interval to be short when the pitch detection means detects a vocal cord frequency.

2. The speaker recognition apparatus according to claim 1, wherein the frame extracting means sets the shift interval to be short and sets the frame width to be short when the pitch detecting means detects a vocal cord frequency.

3. The speaker recognition apparatus according to claim 2, wherein the frame extracting means changes a frame width to be set according to a vocal cord frequency.