CN102509547B

CN102509547B - Method and system for voiceprint recognition based on vector quantization based

Info

Publication number: CN102509547B
Application number: CN2011104503646A
Authority: CN
Inventors: 霍春宝; 赵立辉; 崔文翀; 张彩娟; 曹景胜
Original assignee: Liaoning University of Technology
Current assignee: Liaoning University of Technology
Priority date: 2011-12-29
Filing date: 2011-12-29
Publication date: 2013-06-19
Anticipated expiration: 2031-12-29
Also published as: CN102509547A

Abstract

The invention discloses a method and a system for voiceprint recognition based on vector quantization, which have high recognition performance and noise immunity, are effective in recognition, require few modeling data, and are quick in judgment speed and low in complexity. The method includes steps: acquiring audio signals; preprocessing the audio signals; extracting audio signal characteristic parameters by using MFCC (mel-frequency cepstrum coefficient) parameters, wherein the order of the MFCC ranges from 12 to 16; template training, namely using the LBG (linde, buzo and gray) clustering algorithm to set up a codebook for each speaker and store the codebooks in an audio data base to be used as the audio templates of the speakers; voiceprint recognizing, namely comparing acquired characteristic parameters of the audio signals to be recognized with the speaker audio templates set up in the audio data base and judging according to weighting Euclidean distance measure, and if the corresponding speaker template enables the audio characteristic vector X of a speaker to be recognized to have the minimum average distance measure, the speaker is supposed to be recognized.

Description

Method for recognizing sound-groove and system based on vector quantization

Technical field

The invention belongs to voice process technology, particularly a kind of voice signal with the speaker comes method for recognizing sound-groove and the system based on vector quantization of identification speaker ' s identity.

Background technology

In recent years, widespread use along with information processing and artificial intelligence technology, and people are to the effectively an urgent demand of authentication fast, the identification of conventional cipher authentication has lost his status gradually, and in field of biological recognition, but be subject to increasing people's favor based on the identity recognizing technology of speaker's voice.

Due to everyone differences of Physiological of vocal organs and the behavior difference that form the day after tomorrow causes articulation type and the custom of speaking is different, therefore identifying identity with speaker's voice becomes possibility.The advantages such as Application on Voiceprint Recognition can not forget except having, need not remember, easy to use, also have following properties: at first, its authentication mode is easy to accept, and " password " that uses is sound, opening and get final product; Secondly, the content of identification text can be random, is difficult for stealing, and security performance is higher; The 3rd, the terminal device that uses of identification is microphone or phone, and is with low cost and be easy to combine with the existing communication system.Therefore, the application prospect of Application on Voiceprint Recognition is boundless: in economic activity, can realize each bank remittance, inquiry into balance, transfer accounts etc.; In secret and safe, can check the personnel in secret place with the sound of appointment, it responds the speaker dependent; In judicial expertise, can judge according to instantaneous recording the true identity of criminal in the suspect; In biomedicine, can make this system only respond patient's order, thereby realize the control to user's artificial limb.

The gordian technique of Application on Voiceprint Recognition is mainly phonic signal character parameter extraction and Model Matching.The phonic signal character parameter can be divided into two classes substantially: a class is the low-level feature of major embodiment speaker vocal organs physiological property, as the Mel frequency cepstral coefficient (MFCC) that the sensitivity of the voice signal of different frequency is extracted according to people's ear, the linear prediction cepstrum coefficient coefficient (LPCC) that obtains according to the all-pole modeling of voice signal etc.; Another kind of is the high-level characteristic of major embodiment speaker term custom, pronunciation characteristic, as the prosodic features (Prosodic Features) that reflects the modulation in tone of speaker's voice, the phoneme feature (Phone Features) that reflects phoneme statistical law in speaker's idiom etc.LPCC is based on that the pronunciation model of voice signal sets up, and easily is subject to the impact of hypothesized model, although use in some document of high-level characteristic, discrimination is not very high.

The Model Matching method that proposes for various phonic signal character parameters mainly contains dynamic time warping (DTW) method, vector quantization (VQ) method, gauss hybrid models (GMM) method, artificial neural network (ANN) method etc.Wherein the DTW model depends on the time sequencing of parameter, and real-time performance is relatively poor, is fit to the Speaker Identification based on isolated word (word); GMM is mainly used in the Speaker Identification of a large amount of voice, needs more model training data, training time and the recognition time grown, but also need larger memory headroom.In the ANN model, might not guarantee convergence to the training algorithm of the design of best model topological structure, and can have the problem of study.In the Speaker Identification based on VQ, template matches does not rely on the time sequencing of parameter, and real-time is relatively good, and modeling data is few, and judgement speed is fast, and complexity is not high yet.Speaker Identification principle based on the vector quantization model is that each speaker's phonic signal character parameter quantification is become code book, be kept in sound bank the sound template as the speaker, sound template with existing some speakers in the eigenvector of voice to be identified and sound bank during identification compares, calculate overall average quantizing distortion separately, with the sound template of minimum distortion as recognition result.Be into elliptoid normal distribution yet weak point is voice signal, the distribution of each vector is unequal, does not obtain the reaction of arriving very much in estimating based on the Euclidean distance of traditional VQ Speaker Recognition System.

Summary of the invention

The technical problem to be solved in the present invention is to propose a kind of method for recognizing sound-groove and system based on vector quantization, has good recognition performance and anti-noise ability, and recognition effect is relatively good, and modeling data is few, and judgement speed is fast, and complexity is not high.

A kind of method for recognizing sound-groove based on vector quantization, concrete steps are as follows:

1, the collection of voice signal: as the terminal device that gathers voice, gather voice signal by sound card with the phone of programme-controlled exchange comprehensive experiment box;

2, voice signal pre-service: divide frame windowing operation by computing machine with the voice signal that extracts, a frame comprises 256 sampled points in minute frame process, and it is 128 sampled points that frame moves, and added window function is Hamming window; End-point detection adopts the end-point detection method that combines based on short-time energy and short-time zero-crossing rate; Pre-emphasis, the value that increases the weight of coefficient is 0.90 ~ 1.00;

3, phonic signal character parameter extraction: adopt the MFCC parameter, the exponent number of MFCC is 12 ~ 16;

4, template training: adopting the LBG clustering algorithm is that each speaker in system sets up a code book and is stored in speech database sound template as this speaker;

5, sound-groove identification: compare by speaker's sound template of having set up by

step

1,2,3,4 in the phonic signal character parameter to be identified that will collect and storehouse, and estimate according to weighted euclidean distance and judge, if corresponding speaker template makes words person's speech feature vector X to be identified have the minimum average B configuration distance measure, think and identify the speaker.

Above-mentioned phonic signal character parameter extraction step is as follows:

(1) pretreated voice signal is carried out Short Time Fourier Transform and obtain its frequency spectrum X (k), the DFT formula of voice signal is:

Figure 2011104503646100002DEST_PATH_IMAGE001

(1)

Wherein,

Figure 2011104503646100002DEST_PATH_IMAGE002

Be the voice signal take frame as unit of input, N is counting of Fourier transform, gets 256;

(2) ask frequency spectrum

Figure 2011104503646100002DEST_PATH_IMAGE003

Square, i.e. energy spectrum

Figure 2011104503646100002DEST_PATH_IMAGE004

, then undertaken smoothly by the frequency spectrum of Mel frequency filter to voice signal, and harmonic carcellation, highlight the resonance peak of original voice;

The Mel frequency filter is one group of V-belt bandpass filter, and centre frequency is

Figure 2011104503646100002DEST_PATH_IMAGE005

,

=1,2 ..., Q, Q are the number of V-belt bandpass filter, the Mel wave filter

Figure 2011104503646100002DEST_PATH_IMAGE007

Be expressed as follows:

Figure 2011104503646100002DEST_PATH_IMAGE008

(2)

(3) the Mel frequency spectrum of bank of filters output is taken the logarithm: the dynamic range of compressed voice spectrum; The property the taken advantage of composition conversion of noise in frequency domain is become the additivity composition, logarithm Mel frequency spectrum

Figure 2011104503646100002DEST_PATH_IMAGE009

As follows:

Figure 2011104503646100002DEST_PATH_IMAGE010

(3)

(4) discrete cosine transform (DCT)

Logarithm Mel frequency spectrum with formula (3) acquisition

Transform to time domain, its result is Mel frequency cepstral coefficient (MFCC), n coefficient

Figure 2011104503646100002DEST_PATH_IMAGE011

The formula that is calculated as follows:

Figure 2011104503646100002DEST_PATH_IMAGE012

(4)

Wherein, L is the exponent number of MFCC parameter, and Q is the number of Mel wave filter, and L gets 12 ~ 16, Q and gets 23 ~ 26;

During above-mentioned template training the concrete steps of LBG clustering algorithm that adopt as follows:

(1) obtain all trained vector X in the eigenvector set S of input, and the code word by the given initial codebook of division codebook method

Figure 2011104503646100002DEST_PATH_IMAGE013

(2) utilize a less threshold value

,

Figure 2011104503646100002DEST_PATH_IMAGE015

, will Be divided into two, the method for division is followed following rule:

Figure 2011104503646100002DEST_PATH_IMAGE016

(5)

After division, obtain the code word of new code book

Figure 2011104503646100002DEST_PATH_IMAGE017

,

Figure 2011104503646100002DEST_PATH_IMAGE018

(3) according to the most contiguous criterion, seek nearest code word for the code word of new code book, at last S is divided into the m subset, namely work as

Figure 2011104503646100002DEST_PATH_IMAGE019

The time,

Figure 2011104503646100002DEST_PATH_IMAGE020

Figure 2011104503646100002DEST_PATH_IMAGE021

(6)

In formula, M is the number of code word in current initial codebook;

(4) calculate the barycenter of eigenvector in every subset, and replace code word in this set with this barycenter, so just obtained new code book;

(5) by (3), (4) go on foot the iterative computation of carrying out, and obtain the code word of new code book ,

Figure 2011104503646100002DEST_PATH_IMAGE023

(6) and then repeated for (2) step, the code word that newly obtains respectively is divided into two, then again by (3), (4) stepping row iteration is calculated, and so continues, until required code book code word number is

Figure 2011104503646100002DEST_PATH_IMAGE024

, r is integer, need to do altogether the above-mentioned circular treatment of r wheel, until cluster is complete, at this moment, all kinds of barycenter is required code word.

Initial codebook in above-mentioned LBG clustering algorithm adopts the division codebook method to carry out the code book initialization, and detailed process is as follows:

The average of the eigenvector of all frames that (1) will extract is as the code word of initial codebook

(2) will

According to following regular splitting, form 2m code word;

Figure 2011104503646100002DEST_PATH_IMAGE025

(7)

Wherein m is the code word number that changes to current code book from 1, Parameter when being division is got

(3) according to new code word, all eigenvectors are carried out cluster, then calculate total distance measure D and :

Figure 2011104503646100002DEST_PATH_IMAGE027

(8)

Be total distance measure of next iteration,

Figure 2011104503646100002DEST_PATH_IMAGE028

Be training characteristics vector X and training code book out

Figure 2011104503646100002DEST_PATH_IMAGE029

Between distance measure;

Calculate relative distance measure:

Figure 2011104503646100002DEST_PATH_IMAGE030

(9)

If (

Figure 2011104503646100002DEST_PATH_IMAGE032

), stopping iterative computation, current code book is exactly the code book that designs, otherwise, turn next step.

(4) recomputate the new barycenter of regional;

(5) repeat (3) step and (4) step, until form the code book of the best of a 2m code word;

(6) repetition (2), (3) and (4) step are until be formed with the code book of M code word;

During above-mentioned discrete cosine transform, L=13, Q=25.

A kind of Voiceprint Recognition System based on vector quantization, composed as follows:

Speech signal collection module, voice signal pretreatment module, phonic signal character parameter extraction module, sound template training module and voiceprint identification module.

The present invention's beneficial effect compared with prior art is:

Gather voice signal by sound card, utilize voice process technology to carry out pre-service to the voice signal that collects, then extract the phonic signal character parameter, build a Speaker Recognition System thereby utilize vector quantization technology to set up speech model to the phonic signal character parameter that obtains.Adopt the MFCC parameter, have good recognition performance and anti-noise ability and can fully simulate the auditory perceptual ability, the most useful speaker information is included in the 2nd rank of MFCC parameter between 16 rank in Speaker Identification; By adopting vector quantization (VQ) method, have good recognition performance and anti-noise ability, real-time, recognition effect is good, and modeling data is few, and algorithm is simple, and judgement speed is fast, and complexity is not high.

Description of drawings

Fig. 1 is system chart of the present invention;

Fig. 2 is main flow chart of the present invention;

Fig. 3 is the LBG algorithm flow chart;

Fig. 4 is based on the Application on Voiceprint Recognition human-computer interaction interface of VQ.

Embodiment

As shown in Figure 1, should be based on the Voiceprint Recognition System of vector quantization, complete identification to speaker's voice by software and hardware combining, composed as follows:

Speech signal collection module, voice signal pretreatment module, phonic signal character parameter extraction module, speech model training module and voiceprint identification module.

As Fig. 2～shown in Figure 3, should be as follows based on concrete steps of the method for recognizing sound-groove of vector quantization:

1, the collection of voice signal

The collection of voice signal is that original voice analog signal is converted to digital signal, channel number, sample frequency are set, the present invention carries out the collection of voice signal with the SHT-8B/PCI type sound card that adopts Hangzhou San Hui company to produce, channel number is 2 (sound card default channel number be 2), and sample frequency is 8KHz (sound card acquiescence sample frequency).The terminal device of identification is the telephone set of experiment with the programme-controlled exchange comprehensive experiment box, and the programme-controlled exchange experimental box exchanged form be space switching, speech channel is first two tunnel (totally four tunnel: Jia Yilu, first two tunnel, second one tunnel, second two tunnel, the present invention chooses Jia Erlu at random, on experimental result without the impact).

2, the pre-service of voice signal

(1) windowing divides frame

The time-varying characteristics of voice signal determine it is processed and must carry out on a bit of voice, therefore to divide frame to process to it, simultaneously in order to guarantee that voice signal can not cause because of minute frame the loss of information, to guarantee certain overlapping between frame and frame, be that frame moves, frame move and the ratio of frame length generally between 0 ~ 1/2.The frame length that uses in the present invention is 256 sampled points, and it is 128 sampled points that frame moves.Window function

Figure 2011104503646100002DEST_PATH_IMAGE033

Adopt smoothness properties Hamming window function preferably, as follows:

(10)

In formula, N is length of window, and the present invention is 256 points.

(2) end-point detection

The present invention adopts the end-point detection method that combines based on short-time energy and short-time average zero-crossing rate to carry out end-point detection to voice signal, thus the starting point and ending point of judgement voice signal.Short-time energy detects voiced sound, and zero-crossing rate detects voiceless sound.Suppose

Figure 2011104503646100002DEST_PATH_IMAGE035

Be voice signal,

Figure 2011104503646100002DEST_PATH_IMAGE036

Be Hamming window function, define short-time energy For

Figure 2011104503646100002DEST_PATH_IMAGE038

(11)

In formula,

Figure 2011104503646100002DEST_PATH_IMAGE039

,

Short-time energy when n point of expression voice signal begins windowed function.

Short-time average zero-crossing rate

Figure 2011104503646100002DEST_PATH_IMAGE040

For:

Figure 2011104503646100002DEST_PATH_IMAGE041

(12)

In formula, N is the length of window function, The is-symbol function, namely

Figure 2011104503646100002DEST_PATH_IMAGE043

(3) pre-emphasis

Be subject to the impact of glottal excitation and mouth and nose radiation due to the average power spectra of voice signal, front end falls by 6dB/ times of journey more than 8000Hz greatly, will carry out the HFS that pre-emphasis processes to promote voice signal for this reason, makes the frequency spectrum of signal become smooth.Pre-emphasis realizes with the digital filter that having of 6dB/ times of journey promotes high frequency characteristics, and it is generally the digital filter of single order

Figure 2011104503646100002DEST_PATH_IMAGE044

, namely

Figure 2011104503646100002DEST_PATH_IMAGE045

(13)

Wherein u value discrimination of system between 0.90 ~ 1.00 is the highest, and the present invention gets u=0.97.

3, phonic signal character parameter extraction

The phonic signal character parameter extraction is exactly to extract the parameter that can reflect speaker's individual character from speaker's voice signal, and detailed process is as follows:

(1) pretreated voice signal is carried out Short Time Fourier Transform (DFT) and obtain its frequency spectrum X (k).The DFT formula of voice signal is:

Figure 2011104503646100002DEST_PATH_IMAGE046

(14)

Wherein,

Figure 2011104503646100002DEST_PATH_IMAGE047

Be the voice signal take frame as unit of input, N is counting of Fourier transform, gets 256.

(2) ask frequency spectrum

Figure 2011104503646100002DEST_PATH_IMAGE048

Square, i.e. energy spectrum

, then with them by the Mel wave filter, the frequency spectrum of voice signal is carried out smoothly realizing, and harmonic carcellation, highlight the resonance peak of original voice.

, =1,2 ..., Q, Q are the number of V-belt bandpass filter, the Mel wave filter

Be expressed as follows:

Figure 2011104503646100002DEST_PATH_IMAGE049

(15)

(3) output of bank of filters is taken the logarithm: the dynamic range of compressed voice spectrum; The property the taken advantage of composition conversion of noise in frequency domain is become the additivity composition, the logarithm Mel frequency spectrum that obtains

Figure 2011104503646100002DEST_PATH_IMAGE050

As follows:

(16)

(4) discrete cosine transform (DCT)

Mel frequency spectrum with the above-mentioned steps acquisition

Transform to time domain, its result is exactly Mel frequency cepstral coefficient (MFCC).N coefficient

The formula that is calculated as follows:

(17)

Wherein, L is the exponent number of MFCC, and Q is the number of Mel wave filter, and both value is often decided according to the experiment situation.The present embodiment is got L=13, Q=25, and reality is not limited by the present embodiment.

4, template training

(1) ultimate principle

In Application on Voiceprint Recognition, be generally first to use the code book of vector quantization as speaker's sound template, namely each speaker's voice in system, be quantified as a code book and deposit in sound bank as this speaker's sound template.For the speech characteristic vector sequential extraction procedures characteristic parameter of any input, calculate this speech characteristic parameter to the overall average distortion quantization error of each sound template during identification, the corresponding speaker of the template of total mean error minimum is recognition result.

(2) distance measure

If the K dimensional feature vector of unknown pattern is X, compare with certain K dimension code word vector Y in code book,

Figure 2011104503646100002DEST_PATH_IMAGE053

Represent respectively the same one dimension component of X and Y, Euclidean distance is estimated

Figure 2011104503646100002DEST_PATH_IMAGE054

For:

Figure 2011104503646100002DEST_PATH_IMAGE055

(18)

Each component for traditional Euclidean distance Measure Characteristics vector is equal weight, this NATURAL DISTRIBUTION of only having when eigenvector is spherical or when spherical, that is to say when the distribution of each component of eigenvector just can obtain recognition effect preferably when equal.And voice signal is into elliptoid normal distribution, and the distribution of each vector is unequal, and they are not well reacted in Euclidean distance is estimated, if directly adopt Euclidean distance to estimate, the speaker is adjudicated, and the discrimination of system will be affected.

The present invention adopts the MFCC on 13 rank, in order to embody them in the difference contribution of cluster, adopt the Euclidean distance of weighting to estimate, give different weights to the vector of different distributions, the more discrete vector that distributes is given very little weight, and the vector of concentrating is given very large weight for distributing.The dispersion degree that distributes is weighed to the Euclidean distance of cluster centre (vector average) with vector, weighting factor

Figure 2011104503646100002DEST_PATH_IMAGE056

For:

Figure 2011104503646100002DEST_PATH_IMAGE057

(19)

K in following formula is the dimension of eigenvector.When training and identification, the Euclidean distance that obtains is carried out descending sort, then carry out pre-emphasis with weighting factor, be equivalent to the Euclidean distance that adopts not weighting when training and identification on this process nature, and the component of respectively tieing up of eigenvector is carried out pre-emphasis with scale factor, like this to the very high vector that destruction character is arranged of sequence, give very little weight as isolated point or noise, and give larger weight to the very low good vector of sequence, thereby each vector is well embodied the contribution of identifying.

(3) template training

The LBG algorithm that is based on disintegrating method that the present invention adopts, concrete steps are as follows:

1) obtain all trained vector X in the eigenvector set S of input, and by dividing the code word of code book (code book is vector set, or perhaps the set of code word) the given initial codebook of method

2) utilize a less threshold value

( ) will

Be divided into two, the method for division is followed following rule:

Figure 2011104503646100002DEST_PATH_IMAGE060

(20)

After division, obtain the code word of new code book

,

3) according to the most contiguous criterion, seek nearest code word for the code word of new code book, at last S is divided into the m subset, namely work as

Figure 2011104503646100002DEST_PATH_IMAGE061

The time,

Figure 2011104503646100002DEST_PATH_IMAGE062

Figure 2011104503646100002DEST_PATH_IMAGE063

(21)

In formula, M is the number of code word in current initial codebook;

4) calculate the barycenter of eigenvector in every subset, and replace code word in this set with this barycenter, so just obtained new code book;

5) by the 3rd), 4) go on foot the iterative computation of carrying out, obtain the code word of new code book ,

6) and then repeat the 2nd) step, the code word that newly obtains respectively is divided into two, then again by the 3rd), 4) the stepping row iteration calculates, so continues, until required code book code word number is (r is integer) need to do the above-mentioned circular treatment of r wheel altogether, until cluster is complete, at this moment, all kinds of barycenter is required code word.

Figure 2011104503646100002DEST_PATH_IMAGE065

With the average of the eigenvector of all frames of the extracting code word as initial codebook

Will According to following regular splitting, form 2m code word;

(22)

Wherein m is the code word number that changes to current code book from 1,

Parameter when being division, the present invention gets

3. according to new code word, all eigenvectors are carried out cluster, then calculate total distance measure D and

Figure 2011104503646100002DEST_PATH_IMAGE067

:

(23)

Be total distance measure of next iteration,

Be training characteristics vector X and training code book out

Between distance measure.

Calculate relative distance measure :

(24)

If

, stopping iterative computation, current code book is exactly the code book that designs, otherwise, turn next step;

4. recomputate the new barycenter of regional;

5. repeat 3. and 4., until form the code book of the best of a 2m code word;

6. repeat 2., 3. and 4., until be formed with the code book of M code word;

5, sound-groove identification

(1) extracting length is the feature vector sequence of speaker's voice signal to be identified of T

, the code book in formed sound bank of training stage is:

(N represents speaker's number).

(2) distance measure between existing speaker's sound template in calculated characteristics vector and storehouse, namely obtain

:

（25）

In formula, j represents in X

The eigenvector of frame, m represent i speaker's m code word, total M code word, and K is the dimension of eigenvector.Weighting factor

For:

(26)

(3) calculating X estimates to the mean distance of i code book

(27)

(4) calculate

, obtain all

(5) obtain

That i that middle reckling is corresponding is namely that required people.

Native system belongs to closed set identification, that is to say that all speakers to be identified belong to known speaker's set.The human-computer interaction interface of Speaker Identification as shown in Figure 4.In the human-computer interaction interface of Voiceprint Recognition System, " demonstration of sound card state " List View shows the available voice channel of current speech card number and channel status; " speech samples storehouse " List View shows speaker's number of samples and the speaker's name in the current speech Sample Storehouse." setting of Application on Voiceprint Recognition parameter " hurdle shows the parameter that voice collecting will arrange, and comprising: training duration (acquiescence 23s), length of testing speech (acquiescence 15s) and candidate's number (acquiescence 1).

Be specifically described below in conjunction with example: suppose to have deposited in advance in the speech samples storehouse 100 people's voice, when an XX puts through phone, the process how its sound is identified.

If 1 XX does not belong to known speech samples storehouse

(1) collection of voice signal: as the terminal device that gathers voice, gather voice by sound card with the phone of programme-controlled exchange comprehensive experiment box;

At first, " training duration " parameter (scope: 10-39s), then add speaker's name " XX " in the name edit box, click " adding the speaker " button of the training utterance that needs collection is set.After interpolation is completed, click " is determined ", then put through the phone (number: 8700) of programme-controlled exchange comprehensive experiment box, after connection, the state of sound card passage 2 (being defaulted as passage 2) is updated to " in recording ", and this moment, sound card just can gather voice.The voice that gather reach predetermined training duration, phone meeting auto-hang up;

(2) pre-service of voice signal: divide frame windowing operation by computing machine and VC software in conjunction with the voice signal that will extract, a frame comprises 256 sampled points in minute frame process, and it is 128 sampled points that frame moves, and added window function is Hamming window; End-point detection adopts the detection method that combines based on short-time energy and short-time zero-crossing rate method; Pre-emphasis, the value that increases the weight of coefficient is 0.97;

(3) extract the phonic signal character parameter: utilize computing machine to be combined the MFCC parameter on extraction 13 rank with VC software;

(4) template training: utilize the division codebook method to carry out initialization to code book, then adopting the LBG clustering algorithm is that each speaker in system sets up a code book and is stored in speech database sound template as this speaker;

(5) Speaker Identification

At first, " length of testing speech " parameter (scope: 5-20s), put through the phone (number: 8700), utilize sound card (passage is 2) to gather voice of programme-controlled exchange comprehensive experiment box of the tested speech that needs collection is set.The voice that gather reach predetermined length of testing speech, phone meeting auto-hang up;

then software forbids that " carrying out speaker's identification " button uses, voice to the speaker carry out step (2), (3) operation, the speaker's to be tested that will extract at last voice and the sound template in the storehouse compare, click " carrying out speaker's identification " button, the number of candidates that selection will show (scope 1-3), if corresponding speaker template makes words person's speech feature vector X to be identified have the minimum average B configuration distance measure, think and identify the speaker, show simultaneously identification result " XX " and resolution on " speaker's identification " view list.

If 2 XX belong to known speech samples storehouse

Speaker's identification is directly carried out in the storehouse if XX belongs to known speech samples: at first, " length of testing speech " parameter (scope: 5-20s) of the tested speech that needs collection is set, put through the phone (number: 8700), utilize sound card (passage is 2) to gather voice of programme-controlled exchange comprehensive experiment box.The voice that gather reach predetermined length of testing speech, phone meeting auto-hang up;

Then software forbids that " carrying out speaker's identification " button uses, speaker's voice are carried out the operation of step (2), (3), the speaker's to be tested that will extract at last voice and the sound template in the storehouse compare, if corresponding speaker template makes words person's speech feature vector X to be identified have the minimum average B configuration distance measure, think and identify the speaker, show identification result " XX " and resolution simultaneously on " speaker's identification " view list.

Claims

1. the method for recognizing sound-groove based on vector quantization, is characterized in that, concrete steps are as follows:

(1), the collection of voice signal: as the terminal device that gathers voice, gather voice signal by sound card with the phone of programme-controlled exchange comprehensive experiment box;

(2), voice signal pre-service: divide frame windowing operation by computing machine with the voice signal that extracts, a frame comprises 256 sampled points in minute frame process, and it is 128 sampled points that frame moves, and added window function is Hamming window; End-point detection adopts the end-point detection method that combines based on short-time energy and short-time zero-crossing rate; Pre-emphasis, the value that increases the weight of coefficient is 0.90 ~ 1.00;

(3), phonic signal character parameter extraction: adopt the MFCC parameter, the exponent number of MFCC is 12 ~ 16;

(4), template training: adopting the LBG clustering algorithm is that each speaker in system sets up a code book and is stored in speech database sound template as this speaker, the concrete steps of LBG clustering algorithm that adopt as follows:

(4.1) obtain all trained vector X in the eigenvector set S of input, and the code word by the given initial codebook of division codebook method

(4.2) utilize a less threshold value

,

, will Be divided into two, the method for division is followed following rule:

(5)

After division, obtain the code word of new code book

,

(4.3) according to the most contiguous criterion, seek nearest code word for the code word of new code book, at last S is divided into the m subset, namely work as

The time,

(6)

In formula, M is the number of code word in current initial codebook;

(4.4) calculate the barycenter of eigenvector in every subset, and replace code word in this set with this barycenter, so just obtained new code book;

(4.5) go on foot the iterative computation of carrying out by the 3rd step, the 4th, obtain the code word of new code book ,

(4.6) and then repeated for the 2nd step, the code word that newly obtains respectively is divided into two, then calculates by the 3rd step, the 4th stepping row iteration again, so continue, until required code book code word number is

(5), sound-groove identification: compare by speaker's sound template of having set up by the 1st the～the 4 step of step in the phonic signal character parameter to be identified that will collect and storehouse, and estimate according to weighted euclidean distance and judge, if corresponding speaker template makes words person's speech feature vector X to be identified have the minimum average B configuration distance measure, think and identify the speaker.

2. the method for recognizing sound-groove based on vector quantization according to claim 1, is characterized in that, the initial codebook in the LBG clustering algorithm adopts the division codebook method to carry out the code book initialization, and detailed process is as follows:

(2) will

According to following regular splitting, form 2m code word;

(7)

Wherein m is the code word number that changes to current code book from 1,

Parameter when being division is got

(3) according to new code word, all eigenvectors are carried out cluster, then calculate total distance measure D and

:

(8)

Be total distance measure of next iteration,

Be training characteristics vector X and training code book out Between distance measure;

Calculate relative distance measure:

(9)

If

(

), stopping iterative computation, current code book is exactly the code book that designs, otherwise, turn next step

(4) recomputate the new barycenter of regional;

(5) repeat the 3rd step and the 4th step, until form the code book of the best of a 2m code word;

(6) the 2nd, the 3rd step of repetition, the 4th step are until be formed with the code book of M code word.