JP2004341340A - Speaker recognition device - Google Patents

Speaker recognition device Download PDF

Info

Publication number
JP2004341340A
JP2004341340A JP2003139252A JP2003139252A JP2004341340A JP 2004341340 A JP2004341340 A JP 2004341340A JP 2003139252 A JP2003139252 A JP 2003139252A JP 2003139252 A JP2003139252 A JP 2003139252A JP 2004341340 A JP2004341340 A JP 2004341340A
Authority
JP
Japan
Prior art keywords
frame
speaker
feature vector
shift interval
pitch
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
JP2003139252A
Other languages
Japanese (ja)
Other versions
JP4408205B2 (en
Inventor
Tomonari Kakino
友成 柿野
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Toshiba TEC Corp
Original Assignee
Toshiba TEC Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Toshiba TEC Corp filed Critical Toshiba TEC Corp
Priority to JP2003139252A priority Critical patent/JP4408205B2/en
Publication of JP2004341340A publication Critical patent/JP2004341340A/en
Application granted granted Critical
Publication of JP4408205B2 publication Critical patent/JP4408205B2/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Images

Abstract

<P>PROBLEM TO BE SOLVED: To precisely recognize a speaker and to easily calculate the distance to a speaker model. <P>SOLUTION: In a frame trimming means 21, speech waveform data are successively trimmed to the frames with a frame width L at a shift interval T for outputting to a feature vector generating means 22. The presence or absence of pitch in a frame trimmed by the frame cutting means is detected by a pitch detection means 23. The frame trimming means changes the shift interval T and the frame width L according to the pitch detection result. More specifically, the shift interval T and the frame width L between sounds, where a sound frequency exists to a silent section, are changed to be shorter. A feature vector generating means converts the sound waveform data of each frame to a feature vector for outputting to a next-stage distance calculation means 24. The distance calculation means calculates the distance between a feature vector series and a speaker model stored in a speaker model storage means 25 for outputting to a next-stage recognition means 26. The recognition means compares the distance data from the distance calculation means and a preset threshold to recognize the speaker. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【0001】
【発明の属する技術分野】
本発明は、音声を入力して話者を認識する話者認識装置に関する。
【0002】
【従来の技術】
一般に話者認識においては、入力された音声を10msec程度のシスト間隔で長さ30msec程度のフレームに切出し、そのフレーム毎に求まる各種個人性特徴量を全有声音区間全域にわたって特徴ベクトル列として抽出し、この抽出された特徴ベクトル列を、例えばVQ歪法やHMM法などの比較手法により評価して話者認識を行うようになっている。
また、最近では、音声の個人性を認識するうえで有声音区間のみでなく、無声音区間も利用するものが知られている(例えば、非特許文献1参照)。
【0003】
【非特許文献1】
松井知子、古井貞煕著、電子情報通信学会技術報告書SP90−26、「音源・声道特徴を用いたテキスト独立形話者認識」、電子情報通信学会、June 1990,pp.55−62
【0004】
【発明が解決しようとする課題】
ところで、話者認識を行う場合に特徴量に起因した重み付けを行うことが知られているが、このような重み付けを有声音区間と無声音区間の両方を利用するものに適用した場合、分割するフレーム長を一定にしてフレームを切出しながら有声音区間は勿論、無声音区間にも重み付けを行うが、有声音区間に対しては例えば無声音区間の2倍の重み付けを行うというような処理が施されることになる。しかしながら、無声音区間も有声音区間も切出すフレーム長を一定にしたのでは、有声音区間における分析が粗くなる。そして、粗くなった分を、重み付けを与えて対処しているが、このような処理では十分な話者認識ができない。また、分析した結果と予め設定されている話者モデルとの距離を算出する場合に、その都度重み付けの処理を行わなければならず処理が面倒であった。
本発明は、精度の高い話者認識ができ、しかも話者モデルとの距離算出が容易にできる話者認識装置を提供する。
【0005】
【課題を解決するための手段】
本発明は、音声を入力する、例えばマイクロホンなどからなる音声入力手段と、この音声入力手段で入力された音声波形データを、設定されたシフト間隔で設定されたフレーム幅のフレームに切出すフレーム切出し手段と、このフレーム切出し手段が切出したフレームの音声波形データを特徴ベクトルに変換する特徴ベクトル生成手段と、フレーム切出し手段が切出したフレームにおける声帯周波の有無を検出するピッチ検出手段と、予め話者モデルを記憶した話者モデル記憶手段と、特徴ベクトル生成手段から生成された特徴ベクトルと話者モデル記憶手段に記憶した話者モデルとの距離から話者認識を行う認識手段とを備え、フレーム切出し手段は、ピッチ検出手段が声帯周波を検出した時にはシフト間隔が短くなるように設定することにある。
【0006】
【発明の実施の形態】
以下、本発明の一実施の形態を、図面を参照して説明する。
図1は装置全体の構成を示すブロック図で、1は制御部本体を構成するCPU(中央処理装置)、2はこのCPU1が各部を制御するプログラムデータ等を格納したROM(リード・オンリー・メモリ)、3はCPU1がデータ処理に使用するメモリや各種バッファメモリが設けられたRAM(ランダム・アクセス・メモリ)である。
【0007】
また、4はI/Oポート、5は予め多数の話者モデルを記憶する話者モデル記憶手段等を設けたハードディスク装置、6はキーボード7からキー信号を取込むキーボードコントローラ、8はディスプレイ9のテキストや画像等の情報を表示させるディスプレイコントローラである。
前記CPU1、ROM2、RAM3、I/Oポート4、ハードディスク5、各コントローラ6,8はバスライン10を経由して互いに電気的に接続されている。
【0008】
また、音声を取込むためのマイクロホン11を設け、このマイクロホン11は取込んだ音声を電気的アナログ信号に変換して出力し、次段の低域通過フィルタ12に供給している。前記低域通過フィルタ12は、入力されたアナログ信号から所定の周波数以上の周波数をカットして出力し、次段のA/D変換部13に入力している。前記A/D変換部13は、入力されたアナログ信号を、所定のサンプリング周波数、量子化ビット数でデジタル信号に変換し音声波形データとして前記I/Oポート4に供給している。前記マイクロホン11、低域通過フィルタ12及びA/D変換部13は、音声入力手段を構成している。
【0009】
図2は、前記CPU1、ROM2、RAM3、ハードディスク装置5の複合体により構成される機能ブロックで、フレーム切出し手段21は、前記I/Oポート4から入力される音声波形データを、設定されたシフト間隔Tで設定されたフレーム幅Lのフレームに順次切出し、次段の特徴ベクトル生成手段22に供給する。
【0010】
ピッチ検出手段23は、記フレーム切出し手段21から切出されたフレームにおける音声周波の有無、すなわち、ピッチの有無を検出し、検出結果を前記フレーム切出し手段21に出力する。前記フレーム切出し手段21は、ピッチ検出手段23による検出結果に応じて設定するシフト間隔T及びフレーム幅Lを変更するようになっている。すなわち、音声周波が存在しない無声音区間に対して音声周波が存在する有声音区間のシフト間隔T及びフレーム幅Lが短くなるように変更する。
【0011】
前記特徴ベクトル生成手段22は、順次入力される各フレームの音声波形データを、例えばケプストラム係数などの特徴ベクトルに変換し特徴ベクトル系列として、次段の距離計算手段24に出力するようになっている。前記距離計算手段24は、入力した特徴ベクトル系列と前記ハードディスク5に設けられた話者モデル記憶手段25に記憶されている例えばコードブックなどの話者モデルとの距離を計算し、次段の認識手段26に出力するようになっている。
前記認識手段26は、距離計算手段24からの距離データと予め設定された閾値とを比較して話者認識を行い、結果を、例えば前記ディスプレイ9に表示するようになっている。
【0012】
図3は、図2の機能ブロックによる処理を示す流れ図で、先ず、S1にて、フレーム切出し手段21はシフト間隔T及びフレーム幅Lを、無声音区間に対応するT、Lに設定する。続いて、S2にて、フレーム切出し手段21はI/Oポート4から取込んでRAM3のバッファメモリに一時格納した音声波形データを、設定されたシフト間隔Tで設定されたフレーム幅Lのフレームに順次切出す。
【0013】
切出されたフレームの音声波形データに対してS3にて分析処理が行われ、また、S4にてピッチ検出手段23によるピッチ検出が行われる。そして、ピッチ検出が行わなければ、すなわち、音声波形を検出しなければ、S5にて、音声の終端をチェックし、終端でなければ、S1に戻って処理を繰り返す。
【0014】
また、ピッチ検出手段23がピッチ検出を行うと、S6にて、ピッチ周波数fを抽出し、S7にて、シフト間隔TをTに設定するとともにフレーム幅LをF(f)に設定する。この場合、T<Tとなり、また、F(f)<Lとなる。F(f)はピッチ周波数fからフレーム幅Lを求める関数であり、ピッチ検出を行ったときのフレーム幅Lはピッチ周波数fに応じて変化することを示している。
シフト間隔TをTに設定し、フレーム幅LをF(f)に設定すると、S2に戻って音声波形の切出しを行って処理を繰り返す。S5にて、音声の終端をチェックすると、S8にて予めRAM3に格納されているフレーム毎の距離に基づき認識処理を行い、認識結果を出力した後、この一連の処理を終了する。
【0015】
S3の分析処理は、図4に示す処理になっている。すなわち、S31にて、特徴ベクトル生成手段22による特徴ベクトル生成処理を行い、S32にて、距離計算手段24による距離計算処理を行い、S33にて、前記距離計算手段24にて算出された1フレーム毎の距離をRAM3に格納する。
【0016】
このような構成においては、マイクロホン11から音声を入力すると、電気的アナログ信号に変換され、低域通過フィルタ12で所定の周波数以上の周波数がカットされる。この場合、A/D変換部13におけるサンプリング周波数の1/2以上の周波数がカットされる。例えば、サンプリング周波数が12kHzであれば、6kHz以上の周波数がカットされる。低域通過フィルタ12を通過したアナログ信号はA/D変換部13において12kHzのサンプリング周波数でサンプリングされデジタル信号に変換される。
【0017】
フレーム切出し手段21においてデジタル信号に変換された音声波形データは、シフト間隔Tでフレーム幅Lのフレームに順次切出される。そして、フレーム切出し手段21で切出されたフレームにおけるピッチの有無がピッチ検出手段23で検出され、フレーム切出し手段21はピッチが検出されない状態とピッチが検出された状態とでシフト間隔TをTからTに、フレーム幅LをLからF(f)に変換する。例えば、シフト間隔Tが6msecから3msecに変換され、フレーム幅Lが24msecから12msec前後に変換される。なお、12msec前後としたのはピッチ周波数によって変化するためである。
【0018】
すなわち、図5に示すように、入力音声波形100に対して、フレーム101〜105として示すように、ピッチ検出が行われない区間においてはシフト間隔6msecでシフトしながらフレーム幅24msecでフレームを切出し、ピッチ検出が行われると、フレーム106として示すようにフレーム幅24msecが12msec前後に変換され、以降シフト間隔3msecで順次シフトしながらフレーム幅が12msec前後のフレームの切出しが行われる。
【0019】
そして、順次切出されたフレームの音声波形データは特徴ベクトル生成手段22により特徴ベクトル系列に変換され、距離計算手段24にて特徴ベクトル系列と話者モデル記憶手段25に記憶されている話者モデルとの距離が計算され、認識手段26にて算出された距離データと閾値との比較が行われる。こうして話者認識が行われ、結果がディスプレイ9に表示される。
【0020】
このように、入力音声波形をフレームに切出す場合に、ピッチを検出しない区間においてはシフト間隔T及びフレーム幅Lを大きくし、ピッチを検出した後はシフト間隔T及びフレーム幅Lを小さくするので、単位時間当たりのフレーム数は無声音区間に比べて有声音区間が相対的に多くなる。すなわち、有声音区間は無声音区間に比べて細かくフレームを切出すことになり、従って、話者認識における比重が大きくなり、重み付けを行った場合と同様、精度の高い話者認識ができる。
【0021】
しかも、有声音区間についてはシフト間隔T及びフレーム幅Lを小さくして切出すフレーム数を多くして精度を高めているので、重み付けする場合に比べてより精度の高いピッチ検出ができ、しかも、面倒な重み付けを行う必要が無く話者モデルとの距離算出が容易にできる。
【0022】
また、有声音区間におけるフレーム幅Lの設定を、ピッチ周波数f、すなわち、声帯周波の周波数によって変化する関数F(f)を使用して行っているので、有声音区間におけるフレーム幅Lを声帯周波数に応じて適切に設定することができ、声帯周波数に関係なくピッチ検出精度を維持できる。
【0023】
【発明の効果】
以上詳述したように、本発明によれば、精度の高い話者認識ができ、しかも話者モデルとの距離算出が容易にできる話者認識装置を提供できる。
【図面の簡単な説明】
【図1】本発明の一実施の形態に係る装置全体の構成を示すブロック図。
【図2】同実施の形態における要部構成を機能的に示すブロック図。
【図3】同実施の形態における要部処理を示す流れ図。
【図4】図3における分析処理の内容を示す流れ図。
【図5】同実施の形態における入力音声波形に対するフレーム切出しを説明するための図。
【符号の説明】
1…CPU、11…マイクロホン、13…A/D変換部、21…フレーム切出し手段、22…特徴ベクトル生成手段、23…ピッチ検出手段、24…距離計算手段、25…話者モデル記憶手段、26…認識手段。
[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a speaker recognition device that recognizes a speaker by inputting voice.
[0002]
[Prior art]
In general, in speaker recognition, an input voice is cut out at a cyst interval of about 10 msec into a frame having a length of about 30 msec, and various personality features obtained for each frame are extracted as a feature vector sequence over the entire voiced sound section. The extracted feature vector sequence is evaluated by a comparison method such as the VQ distortion method or the HMM method to perform speaker recognition.
In recent years, it has been known to use not only a voiced sound section but also an unvoiced sound section in recognizing the personality of voice (for example, see Non-Patent Document 1).
[0003]
[Non-patent document 1]
By Tomoko Matsui and Sadahiro Furui, IEICE Technical Report SP90-26, "Text-independent speaker recognition using sound source / vocal tract features", IEICE, June 1990, pp. 55-62
[0004]
[Problems to be solved by the invention]
By the way, it is known to perform weighting based on the feature amount when performing speaker recognition. However, when such weighting is applied to one that uses both a voiced sound section and an unvoiced sound section, a frame to be divided is divided. While cutting out frames with a fixed length, weighting is applied not only to voiced sections but also to unvoiced sections, but voiced sections are weighted twice as much as unvoiced sections, for example. become. However, if the frame length from which both the unvoiced sound section and the voiced sound section are cut out is fixed, the analysis in the voiced sound section becomes coarse. Then, the coarse part is weighted to cope with the problem, but such processing does not allow sufficient speaker recognition. Further, when calculating the distance between the analysis result and a preset speaker model, a weighting process must be performed each time, and the process is troublesome.
The present invention provides a speaker recognition device that can perform speaker recognition with high accuracy and that can easily calculate a distance from a speaker model.
[0005]
[Means for Solving the Problems]
The present invention relates to a voice input unit for inputting voice, for example, a microphone or the like, and a frame cutout for cutting out voice waveform data input by the voice input unit into a frame having a set frame width at a set shift interval. Means, a feature vector generating means for converting the speech waveform data of the frame extracted by the frame extracting means into a feature vector, a pitch detecting means for detecting the presence or absence of a vocal cord frequency in the frame extracted by the frame extracting means, A speaker model storing means for storing a model; and a recognition means for performing speaker recognition based on a distance between a feature vector generated from the feature vector generating means and a speaker model stored in the speaker model storing means, and frame extraction. The means shall be set so that the shift interval becomes short when the pitch detecting means detects a vocal cord frequency. A.
[0006]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, an embodiment of the present invention will be described with reference to the drawings.
FIG. 1 is a block diagram showing the configuration of the entire apparatus. 1 is a CPU (central processing unit) constituting a control unit main body, and 2 is a ROM (read only memory) storing program data for controlling each unit by the CPU 1. Reference numeral 3 denotes a RAM (random access memory) provided with a memory used by the CPU 1 for data processing and various buffer memories.
[0007]
4 is an I / O port, 5 is a hard disk drive provided with speaker model storage means for storing a number of speaker models in advance, 6 is a keyboard controller for taking in key signals from a keyboard 7, and 8 is a display 9 This is a display controller that displays information such as text and images.
The CPU 1, ROM 2, RAM 3, I / O port 4, hard disk 5, and controllers 6, 8 are electrically connected to each other via a bus line 10.
[0008]
Further, a microphone 11 for taking in sound is provided, and this microphone 11 converts the taken sound into an electric analog signal and outputs it, and supplies the signal to a low-pass filter 12 in the next stage. The low-pass filter 12 cuts a frequency equal to or higher than a predetermined frequency from the input analog signal, outputs the cut signal, and inputs the cut signal to an A / D converter 13 at the next stage. The A / D converter 13 converts the input analog signal into a digital signal at a predetermined sampling frequency and a predetermined number of quantization bits, and supplies the digital signal to the I / O port 4 as audio waveform data. The microphone 11, the low-pass filter 12, and the A / D converter 13 constitute an audio input unit.
[0009]
FIG. 2 is a functional block composed of a complex of the CPU 1, the ROM 2, the RAM 3, and the hard disk device 5. The frame extracting unit 21 converts the audio waveform data input from the I / O port 4 into a set shift. The frames are sequentially cut into frames having a frame width L set at the interval T, and are supplied to the feature vector generating means 22 in the next stage.
[0010]
The pitch detecting means 23 detects the presence or absence of a voice frequency, that is, the presence or absence of a pitch in the frame extracted from the frame extracting means 21, and outputs the detection result to the frame extracting means 21. The frame cutout means 21 changes the shift interval T and the frame width L set according to the detection result by the pitch detection means 23. That is, the shift interval T and the frame width L of the voiced section in which the voice frequency exists are changed to be shorter than the unvoiced section in which the voice frequency does not exist.
[0011]
The feature vector generation unit 22 converts the audio waveform data of each frame that is sequentially input into a feature vector such as a cepstrum coefficient and outputs the feature vector to the next-stage distance calculation unit 24 as a feature vector sequence. . The distance calculating means 24 calculates the distance between the input feature vector sequence and a speaker model such as a codebook stored in the speaker model storing means 25 provided on the hard disk 5, and recognizes the next stage. Output to the means 26.
The recognition means 26 performs speaker recognition by comparing the distance data from the distance calculation means 24 with a preset threshold value, and displays the result on, for example, the display 9.
[0012]
FIG. 3 is a flowchart showing processing by the functional blocks in FIG. 2. First, in S1, the frame cutout means 21 sets the shift interval T and the frame width L to T 0 and L 0 corresponding to the unvoiced sound section. Subsequently, at S2, the frame cutout section 21 a speech waveform data temporarily stored in the buffer memory RAM3 crowded taken from the I / O port 4, the frame width L 0 which is set in the shift interval T 0 that has been set Cut out sequentially into frames.
[0013]
Analysis processing is performed on the audio waveform data of the cut-out frame in S3, and pitch detection by the pitch detection means 23 is performed in S4. If pitch detection is not performed, that is, if no voice waveform is detected, the end of the voice is checked in S5. If not, the process returns to S1 and repeats the processing.
[0014]
The setting, the pitch detection unit 23 performs pitch detection, in S6, in extracts pitch frequency f P, S7, the frame width L sets the shift interval T to T 1 to F (f P) I do. In this case, T 1 <T 0 and F (f P ) <L 0 . F (f P) is a function for obtaining the frame width L from pitch frequency f P, the frame width L when performing pitch detection indicates that changes in accordance with the pitch frequency f P.
Set the shift interval T to T 1, setting the frame width L to F (f P), the process is repeated by performing the extraction of the speech waveform returns to S2. When the end of the voice is checked in S5, a recognition process is performed based on the distance for each frame stored in the RAM 3 in advance in S8, and a recognition result is output.
[0015]
The analysis processing in S3 is the processing shown in FIG. That is, in S31, feature vector generation processing by the feature vector generation means 22 is performed, in S32, distance calculation processing by the distance calculation means 24 is performed, and in S33, one frame calculated by the distance calculation means 24 is processed. The distance for each is stored in the RAM 3.
[0016]
In such a configuration, when a sound is input from the microphone 11, the sound is converted into an electric analog signal, and a frequency higher than a predetermined frequency is cut by the low-pass filter 12. In this case, a frequency equal to or more than 1/2 of the sampling frequency in the A / D converter 13 is cut. For example, if the sampling frequency is 12 kHz, frequencies above 6 kHz are cut off. The analog signal that has passed through the low-pass filter 12 is sampled at a sampling frequency of 12 kHz in an A / D converter 13 and converted into a digital signal.
[0017]
The audio waveform data converted into a digital signal by the frame extracting means 21 is sequentially extracted into frames having a frame width L at a shift interval T. Then, the presence or absence of a pitch in the frame cut out by the frame cutout means 21 is detected by the pitch detection means 23, and the frame cutout means 21 sets the shift interval T to T 0 between a state where the pitch is not detected and a state where the pitch is detected. To T 1 and the frame width L from L 0 to F (f P ). For example, the shift interval T is converted from 6 msec to 3 msec, and the frame width L is converted from 24 msec to about 12 msec. In addition, the reason why it is set to about 12 msec is that it changes depending on the pitch frequency.
[0018]
That is, as shown in FIG. 5, a frame is cut out with a frame width of 24 msec while shifting at a shift interval of 6 msec in a section where pitch detection is not performed, as shown as frames 101 to 105, with respect to the input speech waveform 100, When the pitch detection is performed, the frame width of 24 msec is converted to about 12 msec as shown as a frame 106, and thereafter, the frame having the frame width of about 12 msec is cut out while sequentially shifting at a shift interval of 3 msec.
[0019]
Then, the speech waveform data of the sequentially cut out frame is converted into a feature vector sequence by the feature vector generation means 22, and the feature vector series and the speaker model stored in the speaker model storage means 25 by the distance calculation means 24. Is calculated, and the distance data calculated by the recognition unit 26 is compared with a threshold. Thus, speaker recognition is performed, and the result is displayed on the display 9.
[0020]
As described above, when the input speech waveform is cut into frames, the shift interval T and the frame width L are increased in a section where the pitch is not detected, and the shift interval T and the frame width L are decreased after the pitch is detected. The number of frames per unit time is relatively larger in voiced sound sections than in unvoiced sound sections. That is, a frame is cut out more finely in the voiced sound section than in the unvoiced sound section. Therefore, the specific gravity in the speaker recognition is increased, and highly accurate speaker recognition can be performed as in the case of weighting.
[0021]
In addition, since the shift interval T and the frame width L for the voiced sound section are reduced and the number of frames to be cut out is increased to improve the accuracy, more accurate pitch detection can be performed as compared with the case of weighting. There is no need to perform troublesome weighting, and the distance to the speaker model can be easily calculated.
[0022]
In addition, since the setting of the frame width L in the voiced sound section is performed using the pitch frequency f P , that is, the function F (f P ) that changes according to the frequency of the vocal cord frequency, the frame width L in the voiced sound section is set. Appropriate settings can be made according to the vocal cord frequency, and the pitch detection accuracy can be maintained regardless of the vocal cord frequency.
[0023]
【The invention's effect】
As described in detail above, according to the present invention, it is possible to provide a speaker recognition device that can perform speaker recognition with high accuracy and can easily calculate a distance from a speaker model.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of an entire apparatus according to an embodiment of the present invention.
FIG. 2 is a block diagram functionally showing a main part configuration in the embodiment.
FIG. 3 is a flowchart showing main part processing in the embodiment.
FIG. 4 is a flowchart showing the contents of an analysis process in FIG. 3;
FIG. 5 is an exemplary view for explaining frame extraction for an input audio waveform according to the embodiment;
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 1 ... CPU, 11 ... Microphone, 13 ... A / D conversion part, 21 ... Frame extraction means, 22 ... Feature vector generation means, 23 ... Pitch detection means, 24 ... Distance calculation means, 25 ... Speaker model storage means, 26 ... recognition means.

Claims (3)

音声を入力する音声入力手段と、この音声入力手段で入力された音声波形データを、設定されたシフト間隔で設定されたフレーム幅のフレームに切出すフレーム切出し手段と、このフレーム切出し手段が切出したフレームの音声波形データを特徴ベクトルに変換する特徴ベクトル生成手段と、前記フレーム切出し手段が切出したフレームにおける声帯周波の有無を検出するピッチ検出手段と、予め話者モデルを記憶した話者モデル記憶手段と、前記特徴ベクトル生成手段から生成された特徴ベクトルと前記話者モデル記憶手段に記憶した話者モデルとの距離から話者認識を行う認識手段とを具備し、
前記フレーム切出し手段は、前記ピッチ検出手段が声帯周波を検出した時にはシフト間隔が短くなるように設定することを特徴とする話者認識装置。
Voice input means for inputting voice, frame cutout means for cutting out the voice waveform data input by the voice input means into frames of a set frame width at a set shift interval, and frame cutout means cut out by the frame cutout means Feature vector generating means for converting the speech waveform data of the frame into a feature vector, pitch detecting means for detecting the presence or absence of a vocal cord frequency in the frame cut out by the frame cutout means, and speaker model storage means for storing a speaker model in advance And recognition means for performing speaker recognition from the distance between the feature vector generated from the feature vector generation means and the speaker model stored in the speaker model storage means,
The speaker recognizing device, wherein the frame cutout means sets the shift interval to be short when the pitch detection means detects a vocal cord frequency.
フレーム切出し手段は、ピッチ検出手段が声帯周波を検出した時にはシフト間隔が短くなるように設定するとともにフレーム幅が短くなるように設定することを特徴とする請求項1記載の話者認識装置。2. The speaker recognition apparatus according to claim 1, wherein the frame extracting means sets the shift interval to be short and sets the frame width to be short when the pitch detecting means detects a vocal cord frequency. フレーム切出し手段は、声帯周波の周波数に応じて設定するフレーム幅を変更することを特徴とする請求項2記載の話者認識装置。3. The speaker recognition apparatus according to claim 2, wherein the frame extracting means changes a frame width to be set according to a vocal cord frequency.
JP2003139252A 2003-05-16 2003-05-16 Speaker recognition device Expired - Fee Related JP4408205B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2003139252A JP4408205B2 (en) 2003-05-16 2003-05-16 Speaker recognition device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP2003139252A JP4408205B2 (en) 2003-05-16 2003-05-16 Speaker recognition device

Publications (2)

Publication Number Publication Date
JP2004341340A true JP2004341340A (en) 2004-12-02
JP4408205B2 JP4408205B2 (en) 2010-02-03

Family

ID=33528395

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2003139252A Expired - Fee Related JP4408205B2 (en) 2003-05-16 2003-05-16 Speaker recognition device

Country Status (1)

Country Link
JP (1) JP4408205B2 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007000816A1 (en) * 2005-06-29 2007-01-04 Toshiba Tec Kabushiki Kaisha Speech feature extracting device, speaker recognizer, program, and speech feature extracting method
JP2009262702A (en) * 2008-04-23 2009-11-12 Fuji Heavy Ind Ltd Safe driving support system
CN103548076A (en) * 2012-05-23 2014-01-29 恩斯沃尔斯有限责任公司 Device and method for recognizing content using audio signals
WO2018163279A1 (en) * 2017-03-07 2018-09-13 日本電気株式会社 Voice processing device, voice processing method and voice processing program
CN109256138A (en) * 2018-08-13 2019-01-22 平安科技(深圳)有限公司 Auth method, terminal device and computer readable storage medium
CN110880322A (en) * 2019-11-29 2020-03-13 中核第四研究设计工程有限公司 Control method of monitoring equipment and voice control device

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007000816A1 (en) * 2005-06-29 2007-01-04 Toshiba Tec Kabushiki Kaisha Speech feature extracting device, speaker recognizer, program, and speech feature extracting method
JP2009262702A (en) * 2008-04-23 2009-11-12 Fuji Heavy Ind Ltd Safe driving support system
CN103548076A (en) * 2012-05-23 2014-01-29 恩斯沃尔斯有限责任公司 Device and method for recognizing content using audio signals
WO2018163279A1 (en) * 2017-03-07 2018-09-13 日本電気株式会社 Voice processing device, voice processing method and voice processing program
US11250860B2 (en) 2017-03-07 2022-02-15 Nec Corporation Speaker recognition based on signal segments weighted by quality
US11837236B2 (en) 2017-03-07 2023-12-05 Nec Corporation Speaker recognition based on signal segments weighted by quality
CN109256138A (en) * 2018-08-13 2019-01-22 平安科技(深圳)有限公司 Auth method, terminal device and computer readable storage medium
CN109256138B (en) * 2018-08-13 2023-07-07 平安科技(深圳)有限公司 Identity verification method, terminal device and computer readable storage medium
CN110880322A (en) * 2019-11-29 2020-03-13 中核第四研究设计工程有限公司 Control method of monitoring equipment and voice control device
CN110880322B (en) * 2019-11-29 2022-05-27 中核第四研究设计工程有限公司 Control method of monitoring equipment and voice control device

Also Published As

Publication number Publication date
JP4408205B2 (en) 2010-02-03

Similar Documents

Publication Publication Date Title
US10789290B2 (en) Audio data processing method and apparatus, and computer storage medium
JPS62231997A (en) Voice recognition system and method
JPH11126090A (en) Method and device for recognizing voice, and recording medium recorded with program for operating voice recognition device
WO2007046267A1 (en) Voice judging system, voice judging method, and program for voice judgment
CN112992109B (en) Auxiliary singing system, auxiliary singing method and non-transient computer readable recording medium
JP2015068897A (en) Evaluation method and device for utterance and computer program for evaluating utterance
JP2007316330A (en) Rhythm identifying device and method, voice recognition device and method
JP2015055653A (en) Speech recognition device and method and electronic apparatus
JP4408205B2 (en) Speaker recognition device
JP2002091472A (en) Rhythm display device, and reproducing device and similarity judging device for voice language and voice language processor and recording medium
KR20220134347A (en) Speech synthesis method and apparatus based on multiple speaker training dataset
JP2002236494A (en) Speech section discriminator, speech recognizer, program and recording medium
US20150112687A1 (en) Method for rerecording audio materials and device for implementation thereof
JPH0229232B2 (en)
JP2000194392A (en) Noise adaptive type voice recognition device and recording medium recording noise adaptive type voice recognition program
WO2008056604A1 (en) Sound collection system, sound collection method, and collection processing program
WO2000072308A1 (en) Interval normalization device for voice recognition input voice
JP2003044078A (en) Voice recognizing device using uttering speed normalization analysis
US20090063149A1 (en) Speech retrieval apparatus
JP2001255887A (en) Speech recognition device, speech recognition method and medium recorded with the method
JP5722295B2 (en) Acoustic model generation method, speech synthesis method, apparatus and program thereof
JP4430174B2 (en) Voice conversion device and voice conversion method
JP3588929B2 (en) Voice recognition device
JP4749990B2 (en) Voice recognition device
JP4962930B2 (en) Pronunciation rating device and program

Legal Events

Date Code Title Description
A621 Written request for application examination

Free format text: JAPANESE INTERMEDIATE CODE: A621

Effective date: 20060426

A977 Report on retrieval

Free format text: JAPANESE INTERMEDIATE CODE: A971007

Effective date: 20090427

A131 Notification of reasons for refusal

Free format text: JAPANESE INTERMEDIATE CODE: A131

Effective date: 20090512

A521 Request for written amendment filed

Free format text: JAPANESE INTERMEDIATE CODE: A523

Effective date: 20090608

A131 Notification of reasons for refusal

Free format text: JAPANESE INTERMEDIATE CODE: A131

Effective date: 20090804

A521 Request for written amendment filed

Free format text: JAPANESE INTERMEDIATE CODE: A523

Effective date: 20090819

TRDD Decision of grant or rejection written
A01 Written decision to grant a patent or to grant a registration (utility model)

Free format text: JAPANESE INTERMEDIATE CODE: A01

Effective date: 20091104

A01 Written decision to grant a patent or to grant a registration (utility model)

Free format text: JAPANESE INTERMEDIATE CODE: A01

A61 First payment of annual fees (during grant procedure)

Free format text: JAPANESE INTERMEDIATE CODE: A61

Effective date: 20091106

R150 Certificate of patent or registration of utility model

Free format text: JAPANESE INTERMEDIATE CODE: R150

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20121120

Year of fee payment: 3

LAPS Cancellation because of no payment of annual fees