JP4408205B2

JP4408205B2 - Speaker recognition device

Info

Publication number: JP4408205B2
Application number: JP2003139252A
Authority: JP
Inventors: 友成柿野
Original assignee: Toshiba TEC Corp
Current assignee: Toshiba TEC Corp
Priority date: 2003-05-16
Filing date: 2003-05-16
Publication date: 2010-02-03
Anticipated expiration: 2023-05-16
Also published as: JP2004341340A

Description

【０００１】
【発明の属する技術分野】
本発明は、音声を入力して話者を認識する話者認識装置に関する。
【０００２】
【従来の技術】
一般に話者認識においては、入力された音声を１０ｍｓｅｃ程度のシスト間隔で長さ３０ｍｓｅｃ程度のフレームに切出し、そのフレーム毎に求まる各種個人性特徴量を全有声音区間全域にわたって特徴ベクトル列として抽出し、この抽出された特徴ベクトル列を、例えばＶＱ歪法やＨＭＭ法などの比較手法により評価して話者認識を行うようになっている。
また、最近では、音声の個人性を認識するうえで有声音区間のみでなく、無声音区間も利用するものが知られている（例えば、非特許文献１参照）。
【０００３】
【非特許文献１】
松井知子、古井貞煕著、電子情報通信学会技術報告書SP90-26、「音源・声道特徴を用いたテキスト独立形話者認識」、電子情報通信学会、June 1990,pp.55-62
【０００４】
【発明が解決しようとする課題】
ところで、話者認識を行う場合に特徴量に起因した重み付けを行うことが知られているが、このような重み付けを有声音区間と無声音区間の両方を利用するものに適用した場合、分割するフレーム長を一定にしてフレームを切出しながら有声音区間は勿論、無声音区間にも重み付けを行うが、有声音区間に対しては例えば無声音区間の２倍の重み付けを行うというような処理が施されることになる。
しかしながら、無声音区間も有声音区間も切出すフレーム長を一定にしたのでは、有声音区間における分析が粗くなる。そして、粗くなった分を、重み付けを与えて対処しているが、このような処理では十分な話者認識ができない。また、分析した結果と予め設定されている話者モデルとの距離を算出する場合に、その都度重み付けの処理を行わなければならず処理が面倒であった。
本発明は、精度の高い話者認識ができ、しかも話者モデルとの距離算出が容易にできる話者認識装置を提供する。
【０００５】
【課題を解決するための手段】
本発明は、音声を入力する、例えばマイクロホンなどからなる音声入力手段と、この音声入力手段で入力された音声波形データを、設定されたシフト間隔で設定されたフレーム幅のフレームに切出すフレーム切出し手段と、このフレーム切出し手段が切出したフレームの音声波形データを特徴ベクトルに変換する特徴ベクトル生成手段と、フレーム切出し手段が切出したフレームにおける声帯周波の有無を検出するピッチ検出手段と、予め話者モデルを記憶した話者モデル記憶手段と、特徴ベクトル生成手段から生成された特徴ベクトルと話者モデル記憶手段に記憶した話者モデルとの距離から話者認識を行う認識手段とを備え、フレーム切出し手段は、ピッチ検出手段が声帯周波を検出した時には、シフト間隔を、無声音区間に対応するＴ_０に替えて、ピッチ周波数に依存しないＴ_１（Ｔ_１＜Ｔ_０）に設定することにある。
【０００６】
【発明の実施の形態】
以下、本発明の一実施の形態を、図面を参照して説明する。
図１は装置全体の構成を示すブロック図で、１は制御部本体を構成するＣＰＵ（中央処理装置）、２はこのＣＰＵ１が各部を制御するプログラムデータ等を格納したＲＯＭ（リード・オンリー・メモリ）、３はＣＰＵ１がデータ処理に使用するメモリや各種バッファメモリが設けられたＲＡＭ（ランダム・アクセス・メモリ）である。
【０００７】
また、４はＩ／Ｏポート、５は予め多数の話者モデルを記憶する話者モデル記憶手段等を設けたハードディスク装置、６はキーボード７からキー信号を取込むキーボードコントローラ、８はディスプレイ９のテキストや画像等の情報を表示させるディスプレイコントローラである。
前記ＣＰＵ１、ＲＯＭ２、ＲＡＭ３、Ｉ／Ｏポート４、ハードディスク５、各コントローラ６，８はバスライン１０を経由して互いに電気的に接続されている。
【０００８】
また、音声を取込むためのマイクロホン１１を設け、このマイクロホン１１は取込んだ音声を電気的アナログ信号に変換して出力し、次段の低域通過フィルタ１２に供給している。前記低域通過フィルタ１２は、入力されたアナログ信号から所定の周波数以上の周波数をカットして出力し、次段のＡ／Ｄ変換部１３に入力している。前記Ａ／Ｄ変換部１３は、入力されたアナログ信号を、所定のサンプリング周波数、量子化ビット数でデジタル信号に変換し音声波形データとして前記Ｉ／Ｏポート４に供給している。前記マイクロホン１１、低域通過フィルタ１２及びＡ／Ｄ変換部１３は、音声入力手段を構成している。
【０００９】
図２は、前記ＣＰＵ１、ＲＯＭ２、ＲＡＭ３、ハードディスク装置５の複合体により構成される機能ブロックで、フレーム切出し手段２１は、前記Ｉ／Ｏポート４から入力される音声波形データを、設定されたシフト間隔Ｔで設定されたフレーム幅Ｌのフレームに順次切出し、次段の特徴ベクトル生成手段２２に供給する。
【００１０】
ピッチ検出手段２３は、記フレーム切出し手段２１から切出されたフレームにおける音声周波の有無、すなわち、ピッチの有無を検出し、検出結果を前記フレーム切出し手段２１に出力する。前記フレーム切出し手段２１は、ピッチ検出手段２３による検出結果に応じて設定するシフト間隔Ｔ及びフレーム幅Ｌを変更するようになっている。すなわち、音声周波が存在しない無声音区間に対して音声周波が存在する有声音区間のシフト間隔Ｔ及びフレーム幅Ｌが短くなるように変更する。
【００１１】
前記特徴ベクトル生成手段２２は、順次入力される各フレームの音声波形データを、例えばケプストラム係数などの特徴ベクトルに変換し特徴ベクトル系列として、次段の距離計算手段２４に出力するようになっている。前記距離計算手段２４は、入力した特徴ベクトル系列と前記ハードディスク５に設けられた話者モデル記憶手段２５に記憶されている例えばコードブックなどの話者モデルとの距離を計算し、次段の認識手段２６に出力するようになっている。
前記認識手段２６は、距離計算手段２４からの距離データと予め設定された閾値とを比較して話者認識を行い、結果を、例えば前記ディスプレイ９に表示するようになっている。
【００１２】
図３は、図２の機能ブロックによる処理を示す流れ図で、先ず、Ｓ１にて、フレーム切出し手段２１はシフト間隔Ｔ及びフレーム幅Ｌを、無声音区間に対応するＴ_０、Ｌ_０に設定する。続いて、Ｓ２にて、フレーム切出し手段２１はＩ／Ｏポート４から取込んでＲＡＭ３のバッファメモリに一時格納した音声波形データを、設定されたシフト間隔Ｔ_０で設定されたフレーム幅Ｌ_０のフレームに順次切出す。
【００１３】
切出されたフレームの音声波形データに対してＳ３にて分析処理が行われ、また、Ｓ４にてピッチ検出手段２３によるピッチ検出が行われる。そして、ピッチ検出が行わなければ、すなわち、音声波形を検出しなければ、Ｓ５にて、音声の終端をチェックし、終端でなければ、Ｓ１に戻って処理を繰り返す。
【００１４】
また、ピッチ検出手段２３がピッチ検出を行うと、Ｓ６にて、ピッチ周波数ｆ_Ｐを抽出し、Ｓ７にて、シフト間隔ＴをＴ_１に設定するとともにフレーム幅ＬをＦ（ｆ_Ｐ）に設定する。この場合、Ｔ_１＜Ｔ_０となり、また、Ｆ（ｆ_Ｐ）＜Ｌ_０となる。Ｆ（ｆ_Ｐ）はピッチ周波数ｆ_Ｐからフレーム幅Ｌを求める関数であり、ピッチ検出を行ったときのフレーム幅Ｌはピッチ周波数ｆ_Ｐに応じて変化することを示している。
シフト間隔ＴをＴ_１に設定し、フレーム幅ＬをＦ（ｆ_Ｐ）に設定すると、Ｓ２に戻って音声波形の切出しを行って処理を繰り返す。Ｓ５にて、音声の終端をチェックすると、Ｓ８にて予めＲＡＭ３に格納されているフレーム毎の距離に基づき認識処理を行い、認識結果を出力した後、この一連の処理を終了する。
【００１５】
Ｓ３の分析処理は、図４に示す処理になっている。すなわち、Ｓ３１にて、特徴ベクトル生成手段２２による特徴ベクトル生成処理を行い、Ｓ３２にて、距離計算手段２４による距離計算処理を行い、Ｓ３３にて、前記距離計算手段２４にて算出された１フレーム毎の距離をＲＡＭ３に格納する。
【００１６】
このような構成においては、マイクロホン１１から音声を入力すると、電気的アナログ信号に変換され、低域通過フィルタ１２で所定の周波数以上の周波数がカットされる。この場合、Ａ／Ｄ変換部１３におけるサンプリング周波数の１／２以上の周波数がカットされる。例えば、サンプリング周波数が１２ｋＨｚであれば、６ｋＨｚ以上の周波数がカットされる。低域通過フィルタ１２を通過したアナログ信号はＡ／Ｄ変換部１３において１２ｋＨｚのサンプリング周波数でサンプリングされデジタル信号に変換される。
【００１７】
フレーム切出し手段２１においてデジタル信号に変換された音声波形データは、シフト間隔Ｔでフレーム幅Ｌのフレームに順次切出される。そして、フレーム切出し手段２１で切出されたフレームにおけるピッチの有無がピッチ検出手段２３で検出され、フレーム切出し手段２１はピッチが検出されない状態とピッチが検出された状態とでシフト間隔ＴをＴ_０からＴ_１に、フレーム幅ＬをＬ_０からＦ（ｆ_Ｐ）に変換する。例えば、シフト間隔Ｔが６msecから３msecに変換され、フレーム幅Ｌが２４msecから１２msec前後に変換される。なお、１２msec前後としたのはピッチ周波数によって変化するためである。
【００１８】
すなわち、図５に示すように、入力音声波形１００に対して、フレーム１０１〜１０５として示すように、ピッチ検出が行われない区間においてはシフト間隔６msecでシフトしながらフレーム幅２４msecでフレームを切出し、ピッチ検出が行われると、フレーム１０６として示すようにフレーム幅２４msecが１２msec前後に変換され、以降シフト間隔３msecで順次シフトしながらフレーム幅が１２msec前後のフレームの切出しが行われる。
【００１９】
そして、順次切出されたフレームの音声波形データは特徴ベクトル生成手段２２により特徴ベクトル系列に変換され、距離計算手段２４にて特徴ベクトル系列と話者モデル記憶手段２５に記憶されている話者モデルとの距離が計算され、認識手段２６にて算出された距離データと閾値との比較が行われる。こうして話者認識が行われ、結果がディスプレイ９に表示される。
【００２０】
このように、入力音声波形をフレームに切出す場合に、ピッチを検出しない区間においてはシフト間隔Ｔ及びフレーム幅Ｌを大きくし、ピッチを検出した後はシフト間隔Ｔ及びフレーム幅Ｌを小さくするので、単位時間当たりのフレーム数は無声音区間に比べて有声音区間が相対的に多くなる。すなわち、有声音区間は無声音区間に比べて細かくフレームを切出すことになり、従って、話者認識における比重が大きくなり、重み付けを行った場合と同様、精度の高い話者認識ができる。
【００２１】
しかも、有声音区間についてはシフト間隔Ｔ及びフレーム幅Ｌを小さくして切出すフレーム数を多くして精度を高めているので、重み付けする場合に比べてより精度の高いピッチ検出ができ、しかも、面倒な重み付けを行う必要が無く話者モデルとの距離算出が容易にできる。
【００２２】
また、有声音区間におけるフレーム幅Ｌの設定を、ピッチ周波数ｆ_Ｐ、すなわち、声帯周波の周波数によって変化する関数Ｆ（ｆ_Ｐ）を使用して行っているので、有声音区間におけるフレーム幅Ｌを声帯周波数に応じて適切に設定することができ、声帯周波数に関係なくピッチ検出精度を維持できる。
【００２３】
【発明の効果】
以上詳述したように、本発明によれば、精度の高い話者認識ができ、しかも話者モデルとの距離算出が容易にできる話者認識装置を提供できる。
【図面の簡単な説明】
【図１】本発明の一実施の形態に係る装置全体の構成を示すブロック図。
【図２】同実施の形態における要部構成を機能的に示すブロック図。
【図３】同実施の形態における要部処理を示す流れ図。
【図４】図３における分析処理の内容を示す流れ図。
【図５】同実施の形態における入力音声波形に対するフレーム切出しを説明するための図。
【符号の説明】
１…ＣＰＵ、１１…マイクロホン、１３…Ａ／Ｄ変換部、２１…フレーム切出し手段、２２…特徴ベクトル生成手段、２３…ピッチ検出手段、２４…距離計算手段、２５…話者モデル記憶手段、２６…認識手段。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a speaker recognition device that recognizes a speaker by inputting voice.
[0002]
[Prior art]
In general, in speaker recognition, input speech is cut into frames of about 30 msec at cyst intervals of about 10 msec, and various individuality features obtained for each frame are extracted as feature vector sequences over the entire voiced sound section. The extracted feature vector sequence is evaluated by a comparison method such as the VQ distortion method and the HMM method, for example, to perform speaker recognition.
In addition, recently, in order to recognize the individuality of speech, it is known that not only voiced sound sections but also unvoiced sound sections are used (for example, see Non-Patent Document 1).
[0003]
[Non-Patent Document 1]
Tomoko Matsui, Sadaaki Furui, IEICE Technical Report SP90-26, "Text independent speaker recognition using sound source and vocal tract features", IEICE, June 1990, pp.55-62
[0004]
[Problems to be solved by the invention]
By the way, it is known that weighting due to the feature amount is performed when performing speaker recognition. However, when such weighting is applied to one using both a voiced sound segment and an unvoiced sound segment, a frame to be divided While cutting out frames with a fixed length, weighting is applied not only to voiced sound intervals but also to unvoiced sound intervals, but for voiced sound intervals, for example, processing such as weighting twice that of unvoiced sound intervals is performed. become.
However, if the frame length for extracting both the unvoiced sound section and the voiced sound section is fixed, the analysis in the voiced sound section becomes rough. The rough portion is dealt with by giving weights, but such processing cannot sufficiently recognize the speaker. In addition, when calculating the distance between the analysis result and a preset speaker model, weighting processing must be performed each time, which is troublesome.
The present invention provides a speaker recognition device that can perform speaker recognition with high accuracy and can easily calculate the distance from the speaker model.
[0005]
[Means for Solving the Problems]
According to the present invention, a voice input unit including a microphone, for example, for inputting voice, and a frame cutout that cuts voice waveform data input by the voice input unit into frames having a set frame width at a set shift interval. Means, a feature vector generating means for converting the speech waveform data of the frame extracted by the frame extracting means into a feature vector, a pitch detecting means for detecting the presence or absence of a vocal cord frequency in the frame extracted by the frame extracting means, and a speaker in advance A speaker model storage means storing a model; and a recognition means for performing speaker recognition from a distance between a feature vector generated from the feature vector generation means and a speaker model stored in the speaker model storage means, and frame extraction When the pitch detecting means detects the vocal cord frequency, the means sets the shift interval to T corresponding to the unvoiced sound section. Instead of ₀ , T ₁ (T ₁ <T ₀ ) that does not depend on the pitch frequency is set.
[0006]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, an embodiment of the present invention will be described with reference to the drawings.
FIG. 1 is a block diagram showing the overall configuration of the apparatus. 1 is a CPU (central processing unit) that constitutes a control unit main body, 2 is a ROM (read only memory) that stores program data and the like that the CPU 1 controls each part. 3 is a RAM (Random Access Memory) provided with a memory used by the CPU 1 for data processing and various buffer memories.
[0007]
Also, 4 is an I / O port, 5 is a hard disk device provided with speaker model storage means for storing a number of speaker models in advance, 6 is a keyboard controller for fetching key signals from the keyboard 7, and 8 is a display 9. This is a display controller that displays information such as text and images.
The CPU 1, ROM 2, RAM 3, I / O port 4, hard disk 5, and controllers 6 and 8 are electrically connected to each other via a bus line 10.
[0008]
Further, a microphone 11 for capturing sound is provided. The microphone 11 converts the captured sound into an electrical analog signal and outputs it, and supplies it to the low-pass filter 12 at the next stage. The low-pass filter 12 cuts and outputs a frequency equal to or higher than a predetermined frequency from the input analog signal, and inputs the cut signal to the A / D converter 13 at the next stage. The A / D converter 13 converts the input analog signal into a digital signal at a predetermined sampling frequency and the number of quantization bits, and supplies the digital signal to the I / O port 4 as voice waveform data. The microphone 11, the low-pass filter 12, and the A / D converter 13 constitute an audio input means.
[0009]
FIG. 2 is a functional block composed of a composite of the CPU 1, ROM 2, RAM 3, and hard disk device 5. The frame cutout means 21 shifts the audio waveform data input from the I / O port 4 with a set shift. The frames having the frame width L set at the interval T are sequentially cut out and supplied to the feature vector generation means 22 at the next stage.
[0010]
The pitch detection means 23 detects the presence or absence of the audio frequency in the frame extracted from the frame extraction means 21, that is, the presence or absence of the pitch, and outputs the detection result to the frame extraction means 21. The frame cutout means 21 changes the shift interval T and the frame width L set according to the detection result by the pitch detection means 23. That is, the shift interval T and the frame width L of the voiced sound section where the voice frequency exists are changed so as to be shorter than the unvoiced sound section where the voice frequency does not exist.
[0011]
The feature vector generation means 22 converts the speech waveform data of each frame that is sequentially input into a feature vector such as a cepstrum coefficient and outputs it to the distance calculation means 24 in the next stage as a feature vector series. . The distance calculation means 24 calculates the distance between the input feature vector series and a speaker model such as a codebook stored in the speaker model storage means 25 provided in the hard disk 5 to recognize the next stage. It outputs to the means 26.
The recognition unit 26 performs speaker recognition by comparing the distance data from the distance calculation unit 24 with a preset threshold value, and displays the result on the display 9, for example.
[0012]
FIG. 3 is a flowchart showing processing by the functional block of FIG. 2. First, in S1, the frame cutout means 21 sets the shift interval T and the frame width L to T ₀ and L ₀ corresponding to the unvoiced sound section. Subsequently, at S2, the frame cutout section 21 a speech waveform data temporarily stored in the buffer memory RAM3 crowded taken from the I / O port 4, the frame width L ₀ which is set in the shift interval T ₀ that has been set Cut sequentially into frames.
[0013]
Analysis processing is performed on the voice waveform data of the cut frame in S3, and pitch detection by the pitch detection means 23 is performed in S4. If pitch detection is not performed, that is, if a speech waveform is not detected, the end of the speech is checked in S5. If not, the processing returns to S1 and the process is repeated.
[0014]
The setting, the pitch detection unit 23 performs pitch detection, in S6, in extracts pitch frequency _{f P,} S7, the frame width L sets the shift interval T to _{T 1} to F _{(f P)} To do. In this case, T ₁ <T ₀ and F (f _P ) <L ₀ . F (f _P) is a function for obtaining the frame width L from pitch frequency f _P, the frame width L when performing pitch detection indicates that changes in accordance with the pitch frequency f _P.
Set the shift interval T to T _1, setting the frame width L to F (f _P), the process is repeated by performing the extraction of the speech waveform returns to S2. When the end of the voice is checked in S5, a recognition process is performed based on the distance for each frame stored in the RAM 3 in advance in S8. After outputting the recognition result, this series of processes is terminated.
[0015]
The analysis process of S3 is the process shown in FIG. That is, a feature vector generation process by the feature vector generation unit 22 is performed in S31, a distance calculation process by the distance calculation unit 24 is performed in S32, and one frame calculated by the distance calculation unit 24 in S33. Each distance is stored in the RAM 3.
[0016]
In such a configuration, when sound is input from the microphone 11, the sound is converted into an electrical analog signal, and the low-pass filter 12 cuts a frequency equal to or higher than a predetermined frequency. In this case, a frequency that is 1/2 or more of the sampling frequency in the A / D converter 13 is cut. For example, if the sampling frequency is 12 kHz, a frequency of 6 kHz or more is cut. The analog signal that has passed through the low-pass filter 12 is sampled at a sampling frequency of 12 kHz in the A / D converter 13 and converted into a digital signal.
[0017]
The voice waveform data converted into the digital signal by the frame cutout means 21 is cut out sequentially into frames having a frame width L at the shift interval T. The presence or absence of a pitch in the frame cut out by the frame cutout means 21 is detected by the pitch detection means 23, and the frame cutout means 21 sets the shift interval T to T ₀ between the state where the pitch is not detected and the state where the pitch is detected. From T to T ₁ , the frame width L is converted from L ₀ to F (f _P ). For example, the shift interval T is converted from 6 msec to 3 msec, and the frame width L is converted from 24 msec to around 12 msec. The reason why the time is around 12 msec is that it varies depending on the pitch frequency.
[0018]
That is, as shown in FIG. 5, with respect to the input speech waveform 100, as shown as frames 101 to 105, a frame is cut out with a frame width of 24 msec while shifting at a shift interval of 6 msec in a section where pitch detection is not performed. When pitch detection is performed, a frame width of 24 msec is converted to about 12 msec as shown as a frame 106, and then a frame with a frame width of about 12 msec is extracted while sequentially shifting at a shift interval of 3 msec.
[0019]
Then, the speech waveform data of the sequentially extracted frames is converted into a feature vector series by the feature vector generation unit 22, and the speaker model stored in the feature vector series and the speaker model storage unit 25 by the distance calculation unit 24. And the distance data calculated by the recognition means 26 is compared with a threshold value. In this way, speaker recognition is performed, and the result is displayed on the display 9.
[0020]
In this way, when the input speech waveform is cut out into frames, the shift interval T and the frame width L are increased in the interval where the pitch is not detected, and after the pitch is detected, the shift interval T and the frame width L are decreased. The number of frames per unit time is relatively greater in the voiced sound interval than in the unvoiced sound interval. That is, the voiced sound section cuts out frames more finely than the unvoiced sound section, and therefore, the specific gravity in speaker recognition becomes large, and speaker recognition with high accuracy can be performed as in the case of weighting.
[0021]
In addition, for the voiced sound section, the shift interval T and the frame width L are reduced to increase the number of frames to be cut out, thereby improving the accuracy. It is not necessary to perform cumbersome weighting, and distance calculation with the speaker model can be easily performed.
[0022]
In addition, since the setting of the frame width L in the voiced sound section is performed using the pitch frequency f _P , that is, the function F (f _P ) that changes according to the frequency of the vocal cord frequency, the frame width L in the voiced sound section is set. It can be set appropriately according to the vocal cord frequency, and the pitch detection accuracy can be maintained regardless of the vocal cord frequency.
[0023]
【The invention's effect】
As described above in detail, according to the present invention, it is possible to provide a speaker recognition device that can perform speaker recognition with high accuracy and can easily calculate the distance from the speaker model.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of an entire apparatus according to an embodiment of the present invention.
FIG. 2 is a block diagram functionally showing the main configuration of the embodiment.
FIG. 3 is a flowchart showing main processing in the embodiment;
4 is a flowchart showing the contents of analysis processing in FIG. 3;
FIG. 5 is a view for explaining frame extraction for an input speech waveform in the same embodiment;
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 1 ... CPU, 11 ... Microphone, 13 ... A / D conversion part, 21 ... Frame extraction means, 22 ... Feature vector generation means, 23 ... Pitch detection means, 24 ... Distance calculation means, 25 ... Speaker model storage means, 26 ... recognition means.

Claims

Voice input means for inputting voice, frame cutout means for cutting out voice waveform data input by the voice input means into frames having a frame width set at a set shift interval, and the frame cutout means cut out Feature vector generation means for converting speech waveform data of a frame into a feature vector, pitch detection means for detecting the presence or absence of vocal cord frequency in the frame extracted by the frame extraction means, and speaker model storage means for storing a speaker model in advance And recognition means for performing speaker recognition from the distance between the feature vector generated from the feature vector generation means and the speaker model stored in the speaker model storage means,
When the pitch detection unit detects the vocal cord frequency, the frame cutout unit sets the shift interval to T ₁ (T ₁ <T ₀ ) independent of the pitch frequency, instead of T ₀ corresponding to the unvoiced sound section. A speaker recognition device characterized by that.

When the pitch detection means detects the vocal cord frequency, the frame cutout means further sets the frame width to a value smaller than L ₀ that does not depend on the pitch frequency, instead of L ₀ corresponding to the unvoiced sound section. The speaker recognition device according to claim 1.