CN107481723A

CN107481723A - A kind of channel matched method and its device for Application on Voiceprint Recognition

Info

Publication number: CN107481723A
Application number: CN201710751356.2A
Authority: CN
Inventors: 梁永立; 何亮; 吴晋
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2017-08-28
Filing date: 2017-08-28
Publication date: 2017-12-15

Abstract

The present invention proposes a kind of channel matched method and its device for Application on Voiceprint Recognition, belongs to speech recognition and field of voice communication.The inventive method gathers speech data first, carries out voice coding to speech data according to the communication pattern to be simulated and obtains compressing speech data；Error code operation is then carried out to compression speech data according to the bit error rate that respective communication pattern lower channel is simulated, obtains channel simulation speech data；Tone decoding operation finally is carried out to the speech data, obtains the voice under respective communication pattern.Apparatus of the present invention include：Voice collecting and read module, voice coding module, channel error code analog module, tone decoding module and data memory module.The present invention can simulate the voice communication courses such as fixed line, the VOIP networking telephones, wechat phone, QQ phones, 2G, 3G, 4G, so as to obtain training voice with tested speech channel condition identical, channel mismatch is efficiently solved the problems, such as, suitable for the application demand of Application on Voiceprint Recognition.

Description

A kind of channel matched method and its device for Application on Voiceprint Recognition

Technical field

The present invention relates to speech recognition and field of voice communication, is a kind of channel for Application on Voiceprint Recognition specifically Method of completing the square and its device.

Technical background

Application on Voiceprint Recognition, also referred to as Speaker Identification, it is to utilize computer, according to the life of voice automatic decision speaker's identity Thing feature identification technique.According to different application scenarios, sound groove recognition technology in e has a variety of sorting techniques：According to voice content whether , it is known that Application on Voiceprint Recognition can be divided into the unrelated with text of text correlation.According to the difference of identification mission, Application on Voiceprint Recognition can be divided into Talk about people's identification and speaker verification.Sound groove recognition technology in e is mainly used in the fields such as security monitoring, the criminal investigation administration of justice and ecommerce.

In recent years, the unrelated speaker of the text of main flow recognizes (hereinafter referred to as Speaker Identification) technology and is based on Douglas A.Reynolds were in gauss hybrid models-universal background model (the Gaussian mixture proposed in 2000 Model-universal background models, GMM-UBM) Speaker Recognition System.GMM-UBM systems are from speaker Identify angle, it is proposed that weigh the theoretical frame and implementation of two sections of voice similarity degrees, there is landmark meaning.

Voice communication refers to, by voice and by the communication way of transmission medium, there is base call, mobile phone communication, intercommunication Machine is conversed, and voice-enabled chat above network etc., is referred to as voice call.Voice communication mode common at present has landline telephone to lead to Letter, the VOIP networking telephones, wechat phone, QQ phones, 2G communication, 3G communications and 4G communications etc..

Public Switched Telephony Network (Public Switched Telephone Network, PSTN), i.e., in daily life Conventional telephone network.PSTN is a kind of circuit-switched network based on analogue technique, and its speech coding algorithm used is G.711a rate coding mode or u rate coding modes.The coded system that the VOIP networking telephones often use is International Telecommunication Union G.723 standard, specially algebraic code-excited linear predictive coding ACELP encode.The communication party that wechat phone, QQ phones use Formula is narrowband self-adaption multi code Rate of Chinese character AMR-NB coded systems.2G communicates, i.e. Generation Mobile Telecommunication System technology, bag Chinese juniper gsm communication system System and CDMA1x communication systems, the 2G of wherein China Mobile and CHINAUNICOM use GSM standard, and China Telecom 2G is used It is CDMA1x standards.GSM voice codings are Regular-Pulse Excitation long-term linearity predictive coding RPE-LTP.3G communications include China The TD-SCDMA and the WCDMA of the CHINAUNICOM and CDMA2000 of China Telecom that movable independent is formulated.TD-SCDMA and WCDMA All encoded using adaptive multi-beam forming AMR-NB or AMR-WB.Telecommunications 2G, 3G uses enhanced variable rate encoding and decoding Device EVRC or QCELP coded system.4G communicates, and China Mobile uses TD-LTE (Time Division Long Term Evolution) standard, CHINAUNICOM and China Telecom use FDD-LTE standards.What 4G communications used is high definition voice Converse VoLTE, and voice coding modes are adaptive multi-beam forming AMR.

Due to the extensive use of digital voice communication system, Speaker Recognition System can be obtained instructing in actual environment White silk voice and the coding of tested speech are often different, and Application on Voiceprint Recognition at this moment is just faced with because training and tested speech encode not Same and caused voice channel mismatch problem, this will have a huge impact to the performance of Speaker Recognition System.Solves letter Road mismatch problem is to improve Speaker Identification performance, strengthens one of the key of Speaker Recognition System degree of being practical.

To solve the problems, such as channel mismatch in Application on Voiceprint Recognition, what is be commonly used at present is that research models calculation across the vocal print of channel Method.Under the technology main vocal print modeling algorithm have disturbance component projection model (Nuisance Attribute Projection, NAP), simultaneous factor analysis model (Joint Factor Analysis, JFA), identity-based authentication vector (identity- Vector, i-vector) Speaker Identification modeling method and combine speech recognition DNN acoustic models and i-vector models Speaker Identification modeling method etc..

NAP and JFA is the subspace model put forward for channel mismatch problem.Wherein NAP direct estimations go out one Channel subspace, then the subspace is removed to reduce channel information to Speaker Identification from GMM average super vectors space Interference.JFA is thought in the higher dimensional space of GMM averages super vector, two sub-spaces is present and is contained speaker respectively Information and channel information, can be more effectively by speaker information in voice and letter by the way that the two subspaces are carried out with joint modeling Road information separates, so as to lift the Speaker Recognition System performance under Complex Channel.Because channel in JFA models is empty The interior speaker information contained compared with horn of plenty, the method that JFA separately models to speaker and channel can be to speaker information Larger damage is produced, 2010, Dehak et al. proposed i-vector models on JFA basis.In i-vector models only A sub-spaces, referred to as entire change subspace are defined, speaker information and channel information are contained simultaneously in the subspace. Further every section of voice has been expressed as a low dimension vector in the subspace, i.e. i-vector.Finally by i- The mode that vector aspects carry out channel compensation weakens influence of the channel to Speaker Recognition System performance.With JFA model phases Than, the complexity of i-vector models greatly reduces, while more flexible by way of carrying out channel compensation in subspace, And more preferable Speaker Identification performance is shown, and this also causes i-vector models to become most main flow and forefront Speaker Identification modeling method.2014, Lei and Kenny et al. proposed one kind and combine speech recognition DNN acoustic models and i- The Speaker Identification modeling method of vector models：During the valuation of i-vector model correlation sufficient statistics, use The DNN acoustic models classified in speech recognition to phoneme state replace traditional UBM model to calculate frame posterior probability.The party Method reduces Speaker Recognition System modeling complexity, and recognition effect lifting is obvious.

Illustrated by taking conventional JFA method for recognizing sound-groove as an example, this method assumes that can be used by giving one section of voice by one Super vector represents that then this super vector can be expressed as：

M_h(s)=m_ubm+vy(s)+ux_h(s) (1)

Wherein M_h(s) given speaker s h section voices, m are represented_ubmUBM average super vector is represented, v represents to say People's space matrix is talked about, u represents channel space matrix, and y (s) is speaker's factor, x_h(s) be given speaker s h section languages The channel factors that sound has.

In JFA, it is necessary first to estimate v, the two matrixes of u, i.e. speaker space matrix and channel space matrix.Work as instruction Practice the when marquis of speaker, then corresponding speaker factor y (s) and channel factors x are specifically estimated respectively to every section of training voice_h (s), so as to obtain this section training voice corresponding to speaker model be：

M (s)=m_ubm+vy(s) (2)

During test, it is only necessary to estimate the channel factors x of every section of tested speech_h(s), you can be said with what is above obtained Words people's model is combined to realize Speaker Identification.

In Voiceprint Recognition System based on JFA, it is necessary to speech data, by function division have following three part：Training The speech data of common background gauss hybrid models；The speech data of training objective speaker；Speech data to be identified.

The existing method for recognizing sound-groove based on JFA include training the universal model stage, estimation speaker space matrix and Channel space stage matrix, training speaker model stage and test phase, wherein：

1) the universal model stage is trained, is comprised the following steps：

1-a) by voice pretreatment and feature extraction, the speech data for training common background gauss hybrid models is converted For spectrum signature；

It is initial to common background gauss hybrid models using K-means algorithms 1-b) based on the spectrum signature extracted Change；

1-c) using EM algorithm (Expectation maximum, EM) renewal step 1-b) initialization it is general Background gauss hybrid models.

2) estimate speaker space matrix and channel space stage matrix, comprise the following steps：

The characteristic vector of all voices of speaker 2-a) is calculated relative to the single order of Gaussian component in universal background model Baum-welch statistics, obtain corresponding average super vector；

2-b) combine step 2-a) gained average super vector, estimate speaker space matrix v using EM algorithm iterations；

The characteristic vector of all voices under same channel 2-c) is calculated relative to one of Gaussian component in universal background model Rank baum-welch statistics, obtain corresponding average super vector；

2-d) combine step 2-c) gained average super vector, estimate channel space matrix u using EM algorithm iterations.

3) the speaker model stage is trained, is comprised the following steps：

3-a) by voice pretreatment and feature extraction, the speech data of training objective speaker is converted into frequency spectrum spy Sign；

3-b) be based on step 3-a) spectrum signature, calculate baum-welch statistics, obtain corresponding average super vector；

3-c) combine step 3-b) gained average super vector, estimate corresponding speaker's factor y using the E-step of EM algorithms And channel factors x (s)_h(s) combine vector, take y (s) part therein；

3-d) combine the speaker space matrix v and channel space matrix u that step 2) obtains, and 3-c) gained y (s), Target speaker model is calculated.

In step 3-d) in, train speaker model corresponding to voice to be calculated according to formula (2).

4) test phase：

4-a) by voice pretreatment and feature extraction, voice to be identified is converted into spectrum signature；

4-b) be based on 4-a) spectrum signature, calculate baum-welch statistics, obtain corresponding average super vector；

4-c) combine step 3-b) gained average super vector, estimate corresponding channel factors x using the E-step of EM algorithms_h (s)；

The speaker model that step 3) obtains 4-d) is combined, calculation of group dividing affinity score is obtained using likelihood ratio；

4-e) using step 4-d) largest score of gained is calculated as the test statement recognition result.

Above-mentioned method for recognizing sound-groove may serve to solve the problems, such as channel mismatch, but corresponding problem be present.By taking JFA as an example, It is required that amount of training data is very big, operand is also very big when test, is difficult often to obtain to know very well in actual applications Other effect.

The content of the invention

It is open to provide a kind of channel for Application on Voiceprint Recognition the invention aims to solve the deficiency of prior art Matching process and its device.The present invention can effectively carry out channel soft simulation, simulation fixed line, 2G, 3G, 4G etc. to voice communication Voice communication course, so as to obtain training voice with tested speech channel condition identical, channel mismatch is efficiently solved the problems, such as, Suitable for the application demand of reality.

The technical solution adopted by the present invention is as follows：

A kind of channel matched method for Application on Voiceprint Recognition, it is characterised in that this method includes：Data acquisition and reading rank Section, voice coding stage, channel error code simulation stage, tone decoding stage；

1) data acquisition and reading stage comprise the following steps：

1-a) gather and read primary voice data, wherein, primary voice data is WAV forms；

The file header of primary voice data 1-b) is removed according to WAV format standards, obtains pure speech data block；

2) the voice coding stage comprises the following steps：

Voice communication mode to be simulated 2-a) is selected according to primary voice data, the voice communication mode be fixed line, Any one in the VOIP networking telephones, wechat phone, QQ phones, GSM, 3G, 4G；

2-b) according to speech coding standard corresponding with selected voice communication mode to step 1-b) obtained pure voice number Voice coding is carried out according to block, obtains compressing speech data；

3) the channel error code dummy run phase comprises the following steps：

3-a) according to step 2-a) selected by voice communication mode selection respective channel, obtain voice under different state of signal-to-noise The bit error rate of transmission；

Signal to noise ratio 3-b) is selected, to step 2-b) in after encoded compression speech data carry out error code operation, obtain Speech data after channel simulation, the speech data is as channel simulation speech data；Wherein, error code operation is to be communicated according to selected The bit error rate under pattern, that is, bit error rate corresponding under signal to noise ratio is selected, random error is carried out to compression speech data；

4) the tone decoding stage comprises the following steps：

4-a) according to step 2-a) selected by voice communication mode select corresponding tone decoding algorithm；

4-b) with corresponding tone decoding algorithm to step 3-b) obtained channel simulation speech data decodes；

Wav file head 4-c) is added to decoded speech data, obtains training language with tested speech channel condition identical Sound data.

A kind of channel matched device for Application on Voiceprint Recognition is also proposed based on the above method present invention, it is characterised in that should Device includes following 5 modules：

Voice collecting and read module：For gathering and reading the original sound data of speaker, speech data will be obtained Voice data file head is removed, obtains pure speech data block；

Voice coding module：For selecting voice communication mode to be simulated according to primary voice data, to according to voice Collection and read module obtain pure speech data block and carry out voice coding, so as to obtain the compression voice number under respective communication pattern According to；

Channel error code analog module：Respective channel is selected according to communication pattern, realizes and the channel for compressing speech data is missed Code simulation, obtains channel simulation speech data；

Tone decoding module：According to the corresponding decoding algorithm of voice communication mode, language is carried out to channel simulation speech data Sound is decoded, and increases file header to decoded voice, and voice is exported with tested speech channel condition identical so as to obtain；

Data memory module：For store primary voice data, compression speech data, channel simulation speech data and with survey Voice channel condition identical output speech data is tried, and corresponding data is passed into corresponding module.

The features of the present invention and beneficial effect：

(1) compared with traditional method for recognizing sound-groove, information channel simulation method is applied in Application on Voiceprint Recognition by the inventive method, only Need that channel simulator will be carried out for the primary voice data of training, with regard to that can obtain training with tested speech channel condition identical Voice, so as to solve the problems, such as channel mismatch existing for traditional method for recognizing sound-groove.

(2) compared with the Application on Voiceprint Recognition modeling algorithm across channel, the present invention simply carries out letter to original training speech samples Road emulates, without changing Application on Voiceprint Recognition modeling algorithm, so as to reduce the complexity of recognizer, while recognition effect It is more preferable than the Application on Voiceprint Recognition modeling algorithm across channel.Therefore, the inventive method meets actual answer more suitable for Application on Voiceprint Recognition task Demand.

Brief description of the drawings

Fig. 1 is the method flow block diagram of the present invention.

Fig. 2 is the apparatus structure block diagram of the present invention.

Embodiment

A kind of channel matched method and its device for Application on Voiceprint Recognition proposed by the present invention, be described with reference to the accompanying drawings as Under.

A kind of channel matched method for Application on Voiceprint Recognition proposed by the present invention, its flow is as shown in figure 1, this method bag Include：Data acquisition and reading stage, voice coding stage, channel error code dummy run phase, tone decoding stage；

1) data acquisition and reading stage, comprise the following steps：

2) the voice coding stage, comprise the following steps：

2-b) according to the speech coding standard corresponding with selected voice communication mode to step 1-b) obtained pure voice Data block carries out voice coding, obtains compressing speech data；

3) the channel error code dummy run phase comprises the following steps：

3-a) according to step 2-a) selected by voice communication mode selection respective channel, obtain different state of signal-to-noise lower channels Transmit the bit error rate of voice；

4) the tone decoding stage, comprise the following steps：

Wav file head 4-c) is added to decoded speech data, obtains training language with tested speech channel condition identical Sound data, the training speech data will be used for Application on Voiceprint Recognition.

Above-mentioned steps 1-a) in, primary voice data requires that form is WAV, 8Khz or 16Khz sampling, and 16bit quantifies, its Middle selection 8Khz or 16Khz sampling is determined by voice coding modes.

Obtained by the inventive method after training voice, conventional method for recognizing sound-groove can be used to carry out Application on Voiceprint Recognition.Example Such as Application on Voiceprint Recognition is carried out using i-vector models.The i-vector features of extraction training voice and tested speech, are calculated afterwards Therebetween the maximum as recognition result of cos distances, wherein distance value.

The present invention also proposes the channel matched device for Application on Voiceprint Recognition using the above method, it is characterised in that the dress Put including following 5 modules：

Voice coding module：For selecting voice communication mode to be simulated according to primary voice data, to obtaining pure language Sound data block carries out voice coding, so as to obtain the compression speech data under corresponding voice communication mode；

Channel error code analog module：Respective channel is selected according to voice communication mode, realizes the letter to compressing speech data Road error code simulation, obtains channel simulation speech data；

Tone decoding module：According to the corresponding decoding algorithm of voice communication mode, language is carried out to channel simulation speech data Sound is decoded, and increases file header to decoded voice, and speech data is exported with tested speech channel condition identical so as to obtain；

The available conventional simulation of above-mentioned each module, digital integrated electronic circuit are realized.

Claims

A kind of 1. channel matched method for Application on Voiceprint Recognition, it is characterised in that this method includes：Data acquisition and reading rank Section, voice coding stage, channel error code simulation stage, tone decoding stage；

1) data acquisition and reading stage comprise the following steps：

1-a) gather and read primary voice data, wherein, primary voice data is WAV forms；

The file header of primary voice data 1-b) is removed according to WAV format standards, obtains pure speech data block；

2) the voice coding stage comprises the following steps：

Voice communication mode to be simulated 2-a) is selected according to primary voice data, the voice communication mode is fixed line, VOIP nets Any one in network phone, wechat phone, QQ phones, GSM, 3G, 4G；

2-b) according to speech coding standard corresponding with selected voice communication mode to step 1-b) obtained pure speech data block Voice coding is carried out, obtains compressing speech data；

3) the channel error code dummy run phase comprises the following steps：

3-a) according to step 2-a) selected by voice communication mode selection respective channel, obtain voice channel under different state of signal-to-noise The bit error rate of transmission；

Signal to noise ratio 3-b) is selected, to step 2-b) in after encoded compression speech data carry out error code operation, obtain channel Speech data after simulation, the speech data is as channel simulation speech data；Wherein, error code operation is the communication pattern selected by Under the bit error rate, i.e., corresponding bit error rate, random error is carried out to compression speech data under selected signal to noise ratio；

4) the tone decoding stage comprises the following steps：

4-a) according to step 2-a) selected by voice communication mode select corresponding tone decoding algorithm；

4-b) with corresponding tone decoding algorithm to step 3-b) obtained channel simulation speech data decodes；

Wav file head 4-c) is added to decoded speech data, obtains training voice number with tested speech channel condition identical According to.
2. a kind of channel matched device for Application on Voiceprint Recognition using method as claimed in claim 1, it is characterised in that should Device includes following 5 modules：

Voice collecting and read module：For gathering and reading the original sound data of speaker, speech data removal will be obtained Voice data file head, obtain pure speech data block；

Voice coding module：For selecting voice communication mode to be simulated according to primary voice data, to according to voice collecting And read module obtains pure speech data block and carries out voice coding, so as to obtain the compression speech data under respective communication pattern；

Channel error code analog module：Respective channel is selected according to communication pattern, realizes the channel error code mould to compressing speech data Intend, obtain channel simulation speech data；

Tone decoding module：According to the corresponding decoding algorithm of voice communication mode, voice solution is carried out to channel simulation speech data Code, increases file header to decoded voice, and voice is exported with tested speech channel condition identical so as to obtain；

Data memory module：For storing primary voice data, compressing speech data, channel simulation speech data and with testing language Sound channel condition identical exports speech data, and corresponding data is passed into corresponding module.