CN1877697A

CN1877697A - Method for identifying speaker based on distributed structure

Info

Publication number: CN1877697A
Application number: CNA2006101036129A
Authority: CN
Inventors: 李毅杰; 谢湘; 匡镜明
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2006-07-25
Filing date: 2006-07-25
Publication date: 2006-12-13

Abstract

The related speaker verification system (1) based on a distributed structure comprises: a system front end (2) to collect speech sound, extract feature, compress into the bit-flow format and send to (3); a data transmission channel (3) to communicate between (2) and (4); and a system back end (4) to decompress the data into the feature for verification. This invention improves storage level and calculation capacity, can fit the content decision request, and can avoid forgery.

Description

A kind of method for identifying speaker based on distributed frame

Technical field

The present invention relates to a kind of method for identifying speaker, more particularly, it relates to the method for identifying speaker that a kind of Speaker Identification combines with speech recognition.

Background technology

Speaker Identification, also claim Application on Voiceprint Recognition or words person identification, belonging to biological identification technology (discerns as DNA, iris recognition, fingerprint recognition, skull identification etc.) a kind of is a speech parameter according to reflection speaker's physiology and behavioural characteristic in the speech waveform, discerns the technology of speaker ' s identity automatically.Different with speech recognition is, the Speaker Identification utilization be speaker information in the voice signal, and do not consider the words meaning in the voice, it emphasizes speaker's individual character; And the purpose of speech recognition is the speech content that identifies in the voice signal, does not consider whom the speaker is, it emphasizes general character.

Rise based on the speaker Recognition Technology of call voice starts from the eighties, because telecommunications related services such as the universal and telephone bank of telephone network and mobile communications network, long-range speculation in stocks, ecommerce is flourish, it becomes the research emphasis in speaker Recognition Technology field.But the speaker Recognition Technology that is based on call voice has been introduced microphone and transmission channel, bring noise and distortion to voice, so improving the robustness of system is the key of technology always, also be that speaker Recognition Technology drops into one of basic problem that practical application need solve.In addition, along with data communication is dispersed throughout the whole wireless mobile world, portable set (as mobile phone, notebook computer etc.) is as obtaining the portable terminal of information by wireless network, by more applications in radio communication.Yet the complicated multimedia and the calculated amount of Speaker Recognition System also exist very big challenge for the memory space of portable terminal and computing power.When in mobile communications network voice signal being transmitted, voice signal is added the influence of transmission channel error code through low rate encoding, and the distortion of reconstruct voice is serious, causes the performance of Speaker Recognition System to decline to a great extent.Because distributed Speaker Recognition System can reduce the memory space and the calculated amount of portable terminal, and has stronger advantages such as robustness to channel error, so it may become the speaker's Developing Trend in Technology that is employed in the wireless mobile Internet.

For the speaker verification, divide according to the content of speaking, can be divided into the relevant speaker verification of text-independent speaker verification with text.For the speaker identification system of text-independent, owing to adopt the content of text of freely speaking,, can be used on the occasion that the user mismatches pronunciation though the ease for use of system is higher, the security of system is also lower; For the relevant speaker identification system of text, employing be the limited content of text of speaking, need the user to cooperate the pronunciation of prompting content of text, so the security of system is higher.In order to prevent that the personator from using speaker's recording and entering system, people adopt the speaker verification of random text prompting, and promptly this speaker identification system not only will be adjudicated speaker information, also will adjudicate the content of speaking, have only under the two situation that all compliance with system requires, system just accepts.For the speaker identification system of this random text, traditional method is to adopt speaker dependent's audio recognition method, but this method needs the training data abundance, and in real application systems, this point but often can not be met.

The present invention is at based on the channel matched problem that exists in the telephone channel Speaker Identification, based on distributed frame, proposes a kind of method for identifying speaker based on distributed frame.In addition, in the speaker identification system of random text prompting, the dual-threshold judgement method that has adopted Speaker Identification and speech recognition to combine.

Summary of the invention

The present invention will solve the existing defective that exists based on call voice speaker verification technology, a kind of method for identifying speaker based on distributed frame is provided, the method that combines by Speaker Identification and speech recognition, realize the method for identifying speaker of random text prompting, to prevent that the personator from using speaker's recording and entering system.

The technical solution adopted for the present invention to solve the technical problems: this random text prompting method for identifying speaker based on distributed frame, be to set up a Speaker Identification template for each speaker, set up the speech recognition template simultaneously, at first voice are carried out feature extraction at system front end, the boil down to bitstream format arrives the system rear end by data channel transmission; Compression is reduced to feature through characteristic solution with bit stream in the system rear end, and trains Speaker Identification template and speech recognition template respectively; At cognitive phase, what adopt is the method that double threshold secondary that Speaker Identification combines with speech recognition is adjudicated, for a new statement, respectively Speaker Identification template and speech recognition template are given a mark, under prerequisite, score is carried out carrying out the secondary judgement after the normalization by once judgement.

The effect that the present invention is useful is: adopt distributed frame to solve the memory space of portable terminal and the deficiency of computing power; Adopt the method for identifying speaker of random text prompting to prevent that the personator from using speaker's recording and entering system; Adopt the double threshold method of coupling speaker template and speech recognition template to make the method for identifying speaker of random text prompting satisfy speaker and the requirement of content judgement in a minute simultaneously.

Description of drawings

Fig. 1 is a speaker verification's topology diagram based on distributed frame of the present invention;

Fig. 2 is the system flowchart of one embodiment of the invention;

Fig. 3 is the secondary judgement process flow diagram of GMM identification;

Fig. 4 is the secondary judgement process flow diagram of HMM identification;

Embodiment

Below in conjunction with drawings and Examples the present invention is further introduced: method of the present invention was divided into for six steps.

The first step: system front end feature extraction

Feature extraction is divided into noise-cut, waveform processing, and spectrum is calculated, four parts of blind equalization.

1. noise-cut is handled

The noise-cut module is carried out the second order noise reduction process to input signal based on frame.The output signal on first rank is as the input signal on second rank.

A) utilizing the spectrum estimation module to carry out linear spectral for the input speech frame estimates;

B) utilize power spectrum density average module that signal is carried out smoothing processing chronologically;

C) utilize the spectrum of present frame to estimate to calculate the frequency domain wiener filter coefficients with the noise spectrum estimated information;

D) coefficient of linear S filter will be done smoothing processing along frequency axis by one group of Mei Er wave filter, just obtain the S filter of Mei Er frequency field afterwards again through the Mei Er anti-cosine transform;

E) input signal of every rank noise reduction process all passes through filter application and carries out Filtering Processing;

F) treat output signal and carry out the bias compensation processing.

2. waveform processing

The waveform processing module is handled the output signal of noise-cut module according to signal to noise ratio (S/N ratio).Mainly comprise the level and smooth module of energy envelope, peak detection block and waveform noise Ratio Weighted module.

A) with the frame be the energy of basic calculation noise reduction module output, and it is level and smooth to adopt the FIR wave filter to carry out energy;

B) determine maximal value corresponding to the smoothed energy envelope of fundamental frequency;

C) structure length is N _InWeighting function w _Swp(n), and this function acted on the input speech frame of whole waveform processing module, obtain output signal.

S _out(n)＝1.2·w _swp(n)·s _in(n)+0.8·(1-w _swp(n))·s _in(n)，0≤n≤N _in-1

3. spectrum is calculated

Spectrum is calculated and is promptly extracted Mei Er frequency cepstral coefficient (MFCC), comprises energy measure calculation, pre-emphasis, windowing, fast fourier transform, Mei Er filtering, nonlinear transformation, cosine transform.The parameter that calculates will be carried out the format of vector quantization feature compression and bit-stream frames together and be handled.

4. blind equalization

The blind equalization module is to utilize the LMS algorithm that the MFCC cepstrum coefficient is carried out equilibrium treatment.

Second step: system front end feature compression

1. the feature compression input parameter is the Mei Er frequency cepstral coefficient by the output of blind equalization module, adopts the feature compression algorithm of division vector quantization to compress;

2. to carrying out frame formatting and error protection, add the CRC redundancy check through the bitstream data of vector quantization compression, synchronizing sequence, frame head information is sent into channel after the binding and layout and is transmitted.

The 3rd step: system's rear end characteristic solution compression

1. bitstream data process Error detection and the error correction thereof to receiving from channel;

2. with the bit stream after the Error detection error correction, carry out the characteristic solution compression according to the code book that divides vector quantization.

The 4th step: system's rear end template training

1. Speaker Identification template training

That here, the Speaker Identification template adopts is gauss hybrid models (GMM).With the voice of training corpus, after the feature extraction quantification, according to speaker information, for everyone trains a GMM.

2. speech recognition template training

Here, that the speech recognition template adopts is hidden Markov model (HMM), and, adopt the speaker adaptation technology in the time of training.With the voice of training corpus, after the feature extraction quantification, train the HMM that the speaker has nothing to do; With the voice of self-adaptation corpus, after the feature extraction quantification, the HMM that the speaker is had nothing to do carries out self-adaptation, obtains the HMM that the speaker is correlated with then.

The 5th step: system's rear end template matches

1.GMM Model Matching

A) utilize this speaker's GMM that the proper vector of every frame is given a mark, get the mean value S of the score of all frames then ₁₁Statement is for the score of speaker's differentiation for this reason;

B) utilize except when each outer speaker GMM of preceding speaker gives a mark to the proper vector of every frame, and obtain the highest N ₁Individual score, and ask this N ₁The arithmetic mean S of individual score ₁

2.HMM Model Matching

A) according to the content of speaking, fixedly the grammer search network utilizes the speech recognition template that the proper vector of every frame is given a mark, and gets the mean value S of the score of all frames then ₂₁Statement is for the score of speaker's differentiation for this reason;

B) according to morphology network cyclic search, utilize the speech recognition template that the proper vector of every frame is given a mark, all are searched in scores and obtain the highest N ₂Individual score, and ask this N ₂The arithmetic mean S of individual score ₂

The 6th step: system's rear end dual-threshold judgement

1. once judgement

With S ₁₁With speaker one subthreshold T ₁₁Compare S ₂₁With the content one subthreshold T that speaks ₂₁Compare, if S ₁₁＞T ₁₁And S ₂₁＞T ₂₁The time, carry out the secondary judgement, otherwise system's refusal thinks that promptly this statement does not meet the speaker and the double requirements of the content of speaking;

2. secondary judgement

According to S ₁₁With S ₁, S ₁₂With S ₂, can obtain normalization score S respectively ₁₂With S ₂₂, wherein:

S ₁₂＝S ₁₁-S ₁

S ₂₂＝S ₂₁-S ₂

With S ₁₂With speaker two subthreshold T ₁₂Compare S ₂₂With the content two subthreshold T that speak ₂₂Compare, if S ₁₂＞T ₁₂And S ₂₂＞T ₂₂The time, then system accepts the correct content statement of speaking that this statement is this speaker, otherwise system's refusal thinks that promptly this statement does not meet the speaker and the double requirements of the content of speaking.

Experiment for example

1. experimental data base

This method is tested on the breadboard random number word string of Beijing Institute of Technology's modern communications corpus.Corpus is made up of 99 speaker's voice, and comprising 45 male sex speakers and 44 female speaker, recorded content is that " zero " arrives " nine " these ten numerals and " one " (its meaning is identical with " one ") in the standard Chinese.Language material is divided into training set, self-adaptation collection and test set three parts, wherein training set comprises each speaker each 3 of three words, four words, five character word at random, totally 99 * 9=891 sentence statement, adaptive set comprises that each speaker " zero " arrives the pronunciation one time of " nine " these ten numerals and " yao1 ", test set comprises each speaker each 3 of three words, four words, five character word at random, totally 99 * 9=891 sentence statement.

2. system performance evaluation index

The system evaluation index is defined as follows:

Wherein:

When system performance was estimated, we such as adopted at error rate (EER), the value when promptly false rejection rate FRR and average false acceptance rate FAR equate.

3. experimental result

Experimental result is as shown in the table:

Table 1 random number word string speaker verification experimental result

Systems approach	EER	FAR _I	FAR _II	FAR _III
Systems approach	EER	FAR _I	FAR _II	FAR _III	Speaker dependent HMM	15.01％	35.12％	8.12％	3.29％
GMM+ unspecified person HMM	13.33％	23.16％	10.89％	5.96％	Speaker dependent HMM	15.01％	35.12％	8.12％	3.29％

GMM+ adaptive H MM	12.03％	23.16％	9.42％	3.51％
GMM+ adaptive H MM	12.03％	23.16％	9.42％	3.51％	The judgement of GMM+ adaptive H MM+ secondary	4.09％	6.73％	4.98％	0.66％

Wherein, this experiment when the method for identifying speaker of the double threshold secondary judgement of adopting GMM to combine with adaptive H MM, the decision threshold S that GMM adjudicates ₁₁=-35.21, secondary decision threshold S ₁₂=-0.53; A decision threshold S of HMM judgement ₂₁=-33.56, secondary decision threshold S ₂₂=-1.70.

By experimental result as can be seen, for the speaker identification system of random text prompting, behind the double threshold secondary decision method that employing GMM combines with HMM, system performance adopts the method for speaker dependent HMM identification to have significantly and promotes.

Claims

1. speaker identification system based on distributed frame, this system has comprised system front end, data transmission channel and system rear end; System front end is gathered speaker's voice, extracts feature, the boil down to bitstream format, and send into data transmission channel; Data transmission channel is responsible for the transmission of system front end and system's Back end data; The system rear end is condensed to feature with the bitstream format data decompression, and carries out the speaker verification.

2. system according to claim 1 is characterized in that, described speaker verification is the speaker verification's of random text prompting a method.

3. the method for identifying speaker of random text prompting according to claim 2 is characterized in that, the method that adopts method for distinguishing speek person to combine with audio recognition method.

4. the method for identifying speaker of random text prompting according to claim 2 is characterized in that, adopts the method for dual-threshold judgement.

5. according to claim 3 or 4 described method for identifying speaker, it is characterized in that: the key step of this method system front end based on distributed frame:

5.1) feature extraction: feature extraction is divided into noise-cut, waveform processing, spectrum is calculated, blind equalization;

5.2) feature compression: feature compression is divided into the division vector quantization, the bit-stream frames format of packed data.

6. according to claim 3 or 4 described random text prompting method for identifying speaker, it is characterized in that: the key step of this method system rear end based on distributed frame:

6.1) characteristic solution compression: the characteristic solution compression is according to the code book of division vector quantization, and the bit stream of packed data is reduced to characteristic vector sequence;

6.2) template training: template training comprises the training of Speaker Identification template and speech recognition template;

6.3) template matches: template matches comprises the coupling of Speaker Identification template and speech recognition template, with the characteristic vector sequence after decompressing, utilizes this Speaker Identification template that the proper vector of every frame is given a mark, and gets the mean value S of the score of all frames then ₁₁Statement is for the score of speaker's differentiation for this reason; In addition,, utilize the speech recognition template that the proper vector of every frame is given a mark, get the mean value S of the score of all frames then according to the content of speaking ₂₁Statement is for the score of speaker's differentiation for this reason;

6.4) dual-threshold judgement: dual-threshold judgement adopts the method for secondary judgement; When once adjudicating, with S ₁₁With speaker one subthreshold T ₁₁Compare S ₂₁With the content one subthreshold T that speaks ₂₁Compare, if S ₁₁＞T ₁₁And S ₂₁＞T ₂₁The time, carry out the secondary judgement, otherwise system's refusal thinks that promptly this statement does not meet the speaker and the double requirements of the content of speaking; During the secondary judgement, according to S ₁₁Obtain the normalization score S after the normalization ₁₂, according to S ₂₁Obtain normalization score S ₂₂, with S ₁₂With speaker two subthreshold T ₁₂Compare S ₂₂With the content two subthreshold T that speak ₂₂Compare, if S ₁₂＞T ₁₂And S ₂₂＞T ₂₂The time, then system accepts the correct content statement of speaking that this statement is this speaker, otherwise system's refusal thinks that promptly this statement does not meet the speaker and the double requirements of the content of speaking.

7. the random text prompting method for identifying speaker based on distributed frame according to claim 6 is characterized in that: what the secondary judgement method for normalizing of dual-threshold judgement adopted is the method for normalizing of competitive model:

7.1) characteristic vector sequence after will decompressing, utilize except when each outer Speaker Identification template of preceding speaker is given a mark to the proper vector of every frame, and obtain the highest N ₁Individual score, and ask this N ₁The arithmetic mean S of individual score ₁

7.2) characteristic vector sequence after will decompressing, according to the lexical search network, utilize the speech recognition template that the proper vector of every frame is given a mark, in all search scores and obtain the highest N ₂Individual score, and ask this N ₂The arithmetic mean S of individual score ₂

7.3) according to S ₁₁With S ₁, S ₁₂With S ₂, can obtain normalization score S respectively ₁₂With S ₂₂, wherein:

S ₁₂＝S ₁₁- S ₁

S ₂₂＝S ₂₁- S ₂