CN1877697A - Method for identifying speaker based on distributed structure - Google Patents

Method for identifying speaker based on distributed structure Download PDF

Info

Publication number
CN1877697A
CN1877697A CNA2006101036129A CN200610103612A CN1877697A CN 1877697 A CN1877697 A CN 1877697A CN A2006101036129 A CNA2006101036129 A CN A2006101036129A CN 200610103612 A CN200610103612 A CN 200610103612A CN 1877697 A CN1877697 A CN 1877697A
Authority
CN
China
Prior art keywords
speaker
template
score
frame
statement
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CNA2006101036129A
Other languages
Chinese (zh)
Inventor
李毅杰
谢湘
匡镜明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CNA2006101036129A priority Critical patent/CN1877697A/en
Publication of CN1877697A publication Critical patent/CN1877697A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Telephonic Communication Services (AREA)

Abstract

The related speaker verification system (1) based on a distributed structure comprises: a system front end (2) to collect speech sound, extract feature, compress into the bit-flow format and send to (3); a data transmission channel (3) to communicate between (2) and (4); and a system back end (4) to decompress the data into the feature for verification. This invention improves storage level and calculation capacity, can fit the content decision request, and can avoid forgery.

Description

A kind of method for identifying speaker based on distributed frame
Technical field
The present invention relates to a kind of method for identifying speaker, more particularly, it relates to the method for identifying speaker that a kind of Speaker Identification combines with speech recognition.
Background technology
Speaker Identification, also claim Application on Voiceprint Recognition or words person identification, belonging to biological identification technology (discerns as DNA, iris recognition, fingerprint recognition, skull identification etc.) a kind of is a speech parameter according to reflection speaker's physiology and behavioural characteristic in the speech waveform, discerns the technology of speaker ' s identity automatically.Different with speech recognition is, the Speaker Identification utilization be speaker information in the voice signal, and do not consider the words meaning in the voice, it emphasizes speaker's individual character; And the purpose of speech recognition is the speech content that identifies in the voice signal, does not consider whom the speaker is, it emphasizes general character.
Rise based on the speaker Recognition Technology of call voice starts from the eighties, because telecommunications related services such as the universal and telephone bank of telephone network and mobile communications network, long-range speculation in stocks, ecommerce is flourish, it becomes the research emphasis in speaker Recognition Technology field.But the speaker Recognition Technology that is based on call voice has been introduced microphone and transmission channel, bring noise and distortion to voice, so improving the robustness of system is the key of technology always, also be that speaker Recognition Technology drops into one of basic problem that practical application need solve.In addition, along with data communication is dispersed throughout the whole wireless mobile world, portable set (as mobile phone, notebook computer etc.) is as obtaining the portable terminal of information by wireless network, by more applications in radio communication.Yet the complicated multimedia and the calculated amount of Speaker Recognition System also exist very big challenge for the memory space of portable terminal and computing power.When in mobile communications network voice signal being transmitted, voice signal is added the influence of transmission channel error code through low rate encoding, and the distortion of reconstruct voice is serious, causes the performance of Speaker Recognition System to decline to a great extent.Because distributed Speaker Recognition System can reduce the memory space and the calculated amount of portable terminal, and has stronger advantages such as robustness to channel error, so it may become the speaker's Developing Trend in Technology that is employed in the wireless mobile Internet.
For the speaker verification, divide according to the content of speaking, can be divided into the relevant speaker verification of text-independent speaker verification with text.For the speaker identification system of text-independent, owing to adopt the content of text of freely speaking,, can be used on the occasion that the user mismatches pronunciation though the ease for use of system is higher, the security of system is also lower; For the relevant speaker identification system of text, employing be the limited content of text of speaking, need the user to cooperate the pronunciation of prompting content of text, so the security of system is higher.In order to prevent that the personator from using speaker's recording and entering system, people adopt the speaker verification of random text prompting, and promptly this speaker identification system not only will be adjudicated speaker information, also will adjudicate the content of speaking, have only under the two situation that all compliance with system requires, system just accepts.For the speaker identification system of this random text, traditional method is to adopt speaker dependent's audio recognition method, but this method needs the training data abundance, and in real application systems, this point but often can not be met.
The present invention is at based on the channel matched problem that exists in the telephone channel Speaker Identification, based on distributed frame, proposes a kind of method for identifying speaker based on distributed frame.In addition, in the speaker identification system of random text prompting, the dual-threshold judgement method that has adopted Speaker Identification and speech recognition to combine.
Summary of the invention
The present invention will solve the existing defective that exists based on call voice speaker verification technology, a kind of method for identifying speaker based on distributed frame is provided, the method that combines by Speaker Identification and speech recognition, realize the method for identifying speaker of random text prompting, to prevent that the personator from using speaker's recording and entering system.
The technical solution adopted for the present invention to solve the technical problems: this random text prompting method for identifying speaker based on distributed frame, be to set up a Speaker Identification template for each speaker, set up the speech recognition template simultaneously, at first voice are carried out feature extraction at system front end, the boil down to bitstream format arrives the system rear end by data channel transmission; Compression is reduced to feature through characteristic solution with bit stream in the system rear end, and trains Speaker Identification template and speech recognition template respectively; At cognitive phase, what adopt is the method that double threshold secondary that Speaker Identification combines with speech recognition is adjudicated, for a new statement, respectively Speaker Identification template and speech recognition template are given a mark, under prerequisite, score is carried out carrying out the secondary judgement after the normalization by once judgement.
The effect that the present invention is useful is: adopt distributed frame to solve the memory space of portable terminal and the deficiency of computing power; Adopt the method for identifying speaker of random text prompting to prevent that the personator from using speaker's recording and entering system; Adopt the double threshold method of coupling speaker template and speech recognition template to make the method for identifying speaker of random text prompting satisfy speaker and the requirement of content judgement in a minute simultaneously.
Description of drawings
Fig. 1 is a speaker verification's topology diagram based on distributed frame of the present invention;
Fig. 2 is the system flowchart of one embodiment of the invention;
Fig. 3 is the secondary judgement process flow diagram of GMM identification;
Fig. 4 is the secondary judgement process flow diagram of HMM identification;
Embodiment
Below in conjunction with drawings and Examples the present invention is further introduced: method of the present invention was divided into for six steps.
The first step: system front end feature extraction
Feature extraction is divided into noise-cut, waveform processing, and spectrum is calculated, four parts of blind equalization.
1. noise-cut is handled
The noise-cut module is carried out the second order noise reduction process to input signal based on frame.The output signal on first rank is as the input signal on second rank.
A) utilizing the spectrum estimation module to carry out linear spectral for the input speech frame estimates;
B) utilize power spectrum density average module that signal is carried out smoothing processing chronologically;
C) utilize the spectrum of present frame to estimate to calculate the frequency domain wiener filter coefficients with the noise spectrum estimated information;
D) coefficient of linear S filter will be done smoothing processing along frequency axis by one group of Mei Er wave filter, just obtain the S filter of Mei Er frequency field afterwards again through the Mei Er anti-cosine transform;
E) input signal of every rank noise reduction process all passes through filter application and carries out Filtering Processing;
F) treat output signal and carry out the bias compensation processing.
2. waveform processing
The waveform processing module is handled the output signal of noise-cut module according to signal to noise ratio (S/N ratio).Mainly comprise the level and smooth module of energy envelope, peak detection block and waveform noise Ratio Weighted module.
A) with the frame be the energy of basic calculation noise reduction module output, and it is level and smooth to adopt the FIR wave filter to carry out energy;
B) determine maximal value corresponding to the smoothed energy envelope of fundamental frequency;
C) structure length is N InWeighting function w Swp(n), and this function acted on the input speech frame of whole waveform processing module, obtain output signal.
S out(n)=1.2·w swp(n)·s in(n)+0.8·(1-w swp(n))·s in(n),0≤n≤N in-1
3. spectrum is calculated
Spectrum is calculated and is promptly extracted Mei Er frequency cepstral coefficient (MFCC), comprises energy measure calculation, pre-emphasis, windowing, fast fourier transform, Mei Er filtering, nonlinear transformation, cosine transform.The parameter that calculates will be carried out the format of vector quantization feature compression and bit-stream frames together and be handled.
4. blind equalization
The blind equalization module is to utilize the LMS algorithm that the MFCC cepstrum coefficient is carried out equilibrium treatment.
Second step: system front end feature compression
1. the feature compression input parameter is the Mei Er frequency cepstral coefficient by the output of blind equalization module, adopts the feature compression algorithm of division vector quantization to compress;
2. to carrying out frame formatting and error protection, add the CRC redundancy check through the bitstream data of vector quantization compression, synchronizing sequence, frame head information is sent into channel after the binding and layout and is transmitted.
The 3rd step: system's rear end characteristic solution compression
1. bitstream data process Error detection and the error correction thereof to receiving from channel;
2. with the bit stream after the Error detection error correction, carry out the characteristic solution compression according to the code book that divides vector quantization.
The 4th step: system's rear end template training
1. Speaker Identification template training
That here, the Speaker Identification template adopts is gauss hybrid models (GMM).With the voice of training corpus, after the feature extraction quantification, according to speaker information, for everyone trains a GMM.
2. speech recognition template training
Here, that the speech recognition template adopts is hidden Markov model (HMM), and, adopt the speaker adaptation technology in the time of training.With the voice of training corpus, after the feature extraction quantification, train the HMM that the speaker has nothing to do; With the voice of self-adaptation corpus, after the feature extraction quantification, the HMM that the speaker is had nothing to do carries out self-adaptation, obtains the HMM that the speaker is correlated with then.
The 5th step: system's rear end template matches
1.GMM Model Matching
A) utilize this speaker's GMM that the proper vector of every frame is given a mark, get the mean value S of the score of all frames then 11Statement is for the score of speaker's differentiation for this reason;
B) utilize except when each outer speaker GMM of preceding speaker gives a mark to the proper vector of every frame, and obtain the highest N 1Individual score, and ask this N 1The arithmetic mean S of individual score 1
2.HMM Model Matching
A) according to the content of speaking, fixedly the grammer search network utilizes the speech recognition template that the proper vector of every frame is given a mark, and gets the mean value S of the score of all frames then 21Statement is for the score of speaker's differentiation for this reason;
B) according to morphology network cyclic search, utilize the speech recognition template that the proper vector of every frame is given a mark, all are searched in scores and obtain the highest N 2Individual score, and ask this N 2The arithmetic mean S of individual score 2
The 6th step: system's rear end dual-threshold judgement
1. once judgement
With S 11With speaker one subthreshold T 11Compare S 21With the content one subthreshold T that speaks 21Compare, if S 11>T 11And S 21>T 21The time, carry out the secondary judgement, otherwise system's refusal thinks that promptly this statement does not meet the speaker and the double requirements of the content of speaking;
2. secondary judgement
According to S 11With S 1, S 12With S 2, can obtain normalization score S respectively 12With S 22, wherein:
S 12=S 11-S 1
S 22=S 21-S 2
With S 12With speaker two subthreshold T 12Compare S 22With the content two subthreshold T that speak 22Compare, if S 12>T 12And S 22>T 22The time, then system accepts the correct content statement of speaking that this statement is this speaker, otherwise system's refusal thinks that promptly this statement does not meet the speaker and the double requirements of the content of speaking.
Experiment for example
1. experimental data base
This method is tested on the breadboard random number word string of Beijing Institute of Technology's modern communications corpus.Corpus is made up of 99 speaker's voice, and comprising 45 male sex speakers and 44 female speaker, recorded content is that " zero " arrives " nine " these ten numerals and " one " (its meaning is identical with " one ") in the standard Chinese.Language material is divided into training set, self-adaptation collection and test set three parts, wherein training set comprises each speaker each 3 of three words, four words, five character word at random, totally 99 * 9=891 sentence statement, adaptive set comprises that each speaker " zero " arrives the pronunciation one time of " nine " these ten numerals and " yao1 ", test set comprises each speaker each 3 of three words, four words, five character word at random, totally 99 * 9=891 sentence statement.
2. system performance evaluation index
The system evaluation index is defined as follows:
Figure A20061010361200091
Wherein:
Figure A20061010361200095
When system performance was estimated, we such as adopted at error rate (EER), the value when promptly false rejection rate FRR and average false acceptance rate FAR equate.
3. experimental result
Experimental result is as shown in the table:
Table 1 random number word string speaker verification experimental result
Systems approach EER FAR I FAR II FAR III
Speaker dependent HMM 15.01% 35.12% 8.12% 3.29%
GMM+ unspecified person HMM 13.33% 23.16% 10.89% 5.96%
GMM+ adaptive H MM 12.03% 23.16% 9.42% 3.51%
The judgement of GMM+ adaptive H MM+ secondary 4.09% 6.73% 4.98% 0.66%
Wherein, this experiment when the method for identifying speaker of the double threshold secondary judgement of adopting GMM to combine with adaptive H MM, the decision threshold S that GMM adjudicates 11=-35.21, secondary decision threshold S 12=-0.53; A decision threshold S of HMM judgement 21=-33.56, secondary decision threshold S 22=-1.70.
By experimental result as can be seen, for the speaker identification system of random text prompting, behind the double threshold secondary decision method that employing GMM combines with HMM, system performance adopts the method for speaker dependent HMM identification to have significantly and promotes.

Claims (7)

1. speaker identification system based on distributed frame, this system has comprised system front end, data transmission channel and system rear end; System front end is gathered speaker's voice, extracts feature, the boil down to bitstream format, and send into data transmission channel; Data transmission channel is responsible for the transmission of system front end and system's Back end data; The system rear end is condensed to feature with the bitstream format data decompression, and carries out the speaker verification.
2. system according to claim 1 is characterized in that, described speaker verification is the speaker verification's of random text prompting a method.
3. the method for identifying speaker of random text prompting according to claim 2 is characterized in that, the method that adopts method for distinguishing speek person to combine with audio recognition method.
4. the method for identifying speaker of random text prompting according to claim 2 is characterized in that, adopts the method for dual-threshold judgement.
5. according to claim 3 or 4 described method for identifying speaker, it is characterized in that: the key step of this method system front end based on distributed frame:
5.1) feature extraction: feature extraction is divided into noise-cut, waveform processing, spectrum is calculated, blind equalization;
5.2) feature compression: feature compression is divided into the division vector quantization, the bit-stream frames format of packed data.
6. according to claim 3 or 4 described random text prompting method for identifying speaker, it is characterized in that: the key step of this method system rear end based on distributed frame:
6.1) characteristic solution compression: the characteristic solution compression is according to the code book of division vector quantization, and the bit stream of packed data is reduced to characteristic vector sequence;
6.2) template training: template training comprises the training of Speaker Identification template and speech recognition template;
6.3) template matches: template matches comprises the coupling of Speaker Identification template and speech recognition template, with the characteristic vector sequence after decompressing, utilizes this Speaker Identification template that the proper vector of every frame is given a mark, and gets the mean value S of the score of all frames then 11Statement is for the score of speaker's differentiation for this reason; In addition,, utilize the speech recognition template that the proper vector of every frame is given a mark, get the mean value S of the score of all frames then according to the content of speaking 21Statement is for the score of speaker's differentiation for this reason;
6.4) dual-threshold judgement: dual-threshold judgement adopts the method for secondary judgement; When once adjudicating, with S 11With speaker one subthreshold T 11Compare S 21With the content one subthreshold T that speaks 21Compare, if S 11>T 11And S 21>T 21The time, carry out the secondary judgement, otherwise system's refusal thinks that promptly this statement does not meet the speaker and the double requirements of the content of speaking; During the secondary judgement, according to S 11Obtain the normalization score S after the normalization 12, according to S 21Obtain normalization score S 22, with S 12With speaker two subthreshold T 12Compare S 22With the content two subthreshold T that speak 22Compare, if S 12>T 12And S 22>T 22The time, then system accepts the correct content statement of speaking that this statement is this speaker, otherwise system's refusal thinks that promptly this statement does not meet the speaker and the double requirements of the content of speaking.
7. the random text prompting method for identifying speaker based on distributed frame according to claim 6 is characterized in that: what the secondary judgement method for normalizing of dual-threshold judgement adopted is the method for normalizing of competitive model:
7.1) characteristic vector sequence after will decompressing, utilize except when each outer Speaker Identification template of preceding speaker is given a mark to the proper vector of every frame, and obtain the highest N 1Individual score, and ask this N 1The arithmetic mean S of individual score 1
7.2) characteristic vector sequence after will decompressing, according to the lexical search network, utilize the speech recognition template that the proper vector of every frame is given a mark, in all search scores and obtain the highest N 2Individual score, and ask this N 2The arithmetic mean S of individual score 2
7.3) according to S 11With S 1, S 12With S 2, can obtain normalization score S respectively 12With S 22, wherein:
S 12=S 11- S 1
S 22=S 21- S 2
CNA2006101036129A 2006-07-25 2006-07-25 Method for identifying speaker based on distributed structure Pending CN1877697A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNA2006101036129A CN1877697A (en) 2006-07-25 2006-07-25 Method for identifying speaker based on distributed structure

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNA2006101036129A CN1877697A (en) 2006-07-25 2006-07-25 Method for identifying speaker based on distributed structure

Publications (1)

Publication Number Publication Date
CN1877697A true CN1877697A (en) 2006-12-13

Family

ID=37510108

Family Applications (1)

Application Number Title Priority Date Filing Date
CNA2006101036129A Pending CN1877697A (en) 2006-07-25 2006-07-25 Method for identifying speaker based on distributed structure

Country Status (1)

Country Link
CN (1) CN1877697A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102402983A (en) * 2011-11-25 2012-04-04 浪潮电子信息产业股份有限公司 Cloud data center speech recognition method
CN102457845A (en) * 2010-10-14 2012-05-16 阿里巴巴集团控股有限公司 Method, equipment and system for authenticating identity by wireless service
CN102496364A (en) * 2011-11-30 2012-06-13 苏州奇可思信息科技有限公司 Interactive speech recognition method based on cloud network
CN102543083A (en) * 2012-03-16 2012-07-04 北京海尔集成电路设计有限公司 Intelligent voice recognition method and chip, cloud equipment and cloud server
CN101211615B (en) * 2006-12-31 2012-10-03 于柏泉 System for automatic recording for specific human voice
CN101562013B (en) * 2008-04-15 2013-05-22 联芯科技有限公司 Method and device for automatically recognizing voice
CN104490570A (en) * 2014-12-31 2015-04-08 桂林电子科技大学 Embedding type voiceprint identification and finding system for blind persons
CN105632515A (en) * 2014-10-31 2016-06-01 科大讯飞股份有限公司 Pronunciation error detection method and device
CN108630207A (en) * 2017-03-23 2018-10-09 富士通株式会社 Method for identifying speaker and speaker verification's equipment
CN109313902A (en) * 2016-06-06 2019-02-05 思睿逻辑国际半导体有限公司 Voice user interface

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101211615B (en) * 2006-12-31 2012-10-03 于柏泉 System for automatic recording for specific human voice
CN101562013B (en) * 2008-04-15 2013-05-22 联芯科技有限公司 Method and device for automatically recognizing voice
CN102457845A (en) * 2010-10-14 2012-05-16 阿里巴巴集团控股有限公司 Method, equipment and system for authenticating identity by wireless service
CN102457845B (en) * 2010-10-14 2016-04-13 阿里巴巴集团控股有限公司 Wireless traffic identity identifying method, equipment and system
CN102402983A (en) * 2011-11-25 2012-04-04 浪潮电子信息产业股份有限公司 Cloud data center speech recognition method
CN102496364A (en) * 2011-11-30 2012-06-13 苏州奇可思信息科技有限公司 Interactive speech recognition method based on cloud network
CN102543083A (en) * 2012-03-16 2012-07-04 北京海尔集成电路设计有限公司 Intelligent voice recognition method and chip, cloud equipment and cloud server
CN105632515A (en) * 2014-10-31 2016-06-01 科大讯飞股份有限公司 Pronunciation error detection method and device
CN105632515B (en) * 2014-10-31 2019-10-18 科大讯飞股份有限公司 A kind of pronunciation error-detecting method and device
CN104490570A (en) * 2014-12-31 2015-04-08 桂林电子科技大学 Embedding type voiceprint identification and finding system for blind persons
CN109313902A (en) * 2016-06-06 2019-02-05 思睿逻辑国际半导体有限公司 Voice user interface
CN108630207A (en) * 2017-03-23 2018-10-09 富士通株式会社 Method for identifying speaker and speaker verification's equipment

Similar Documents

Publication Publication Date Title
CN1877697A (en) Method for identifying speaker based on distributed structure
CN106847292B (en) Method for recognizing sound-groove and device
Lu et al. An investigation of dependencies between frequency components and speaker characteristics for text-independent speaker identification
Aloufi et al. Emotionless: Privacy-preserving speech analysis for voice assistants
CN107492382A (en) Voiceprint extracting method and device based on neutral net
CN1650349A (en) On-line parametric histogram normalization for noise robust speech recognition
CN113129897B (en) Voiceprint recognition method based on attention mechanism cyclic neural network
CN103971675A (en) Automatic voice recognizing method and system
WO2012075641A1 (en) Device and method for pass-phrase modeling for speaker verification, and verification system
WO2013060079A1 (en) Record playback attack detection method and system based on channel mode noise
CN103794211B (en) A kind of audio recognition method and system
CN103794207A (en) Dual-mode voice identity recognition method
CN108597505A (en) Audio recognition method, device and terminal device
CN111341323B (en) Voiceprint recognition training data amplification method and system, mobile terminal and storage medium
CN109754790A (en) A kind of speech recognition system and method based on mixing acoustic model
CN109448702A (en) Artificial cochlea's auditory scene recognition methods
CN105845143A (en) Speaker confirmation method and speaker confirmation system based on support vector machine
Verma Multi-feature fusion for closed set text independent speaker identification
CN1787077A (en) Method for fast identifying speeking person based on comparing ordinal number of archor model space projection
CN109616124A (en) Lightweight method for recognizing sound-groove and system based on ivector
CN107093430A (en) A kind of vocal print feature extraction algorithm based on wavelet package transforms
Chowdhury et al. Text-independent distributed speaker identification and verification using GMM-UBM speaker models for mobile communications
CN113393847B (en) Voiceprint recognition method based on fusion of Fbank features and MFCC features
CN110197657A (en) A kind of dynamic speech feature extracting method based on cosine similarity
CN116312559A (en) Training method of cross-channel voiceprint recognition model, voiceprint recognition method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Open date: 20061213