CN106019230A

CN106019230A - Sound source positioning method based on i-vector speaker recognition

Info

Publication number: CN106019230A
Application number: CN201610365659.6A
Authority: CN
Inventors: 万新旺; 顾晓瑜; 杨悦; 廖鹏程
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Nanjing Post and Telecommunication University; Nanjing University of Posts and Telecommunications
Priority date: 2016-05-27
Filing date: 2016-05-27
Publication date: 2016-10-12
Anticipated expiration: 2036-05-27
Also published as: CN106019230B

Abstract

The invention discloses a sound source positioning method based on i-vector speaker recognition. The method includes that the features of the discriminating cross-correlation function are introduced to obtain the discriminating cross-correlation function, and the features are divided into a training set and a testing set, the model in an i-vector speaker recognition system is trained and tested, by means of expectation maximization (EM), the maximum likelihood estimation of the probability function of the development set i-vector distribution is realized, and a PLDA model, which is constrained by the speech duration, is established to accurately perform speech recognition and positioning, in addition, by means of the realization of the algorithm, the problems of noise and reverberation in conventional sound source positioning are effectively solved.

Description

A kind of sound localization method based on i-vector Speaker Identification

Technical field

The present invention relates to a kind of sound localization method based on i-vector Speaker Identification, belong to Internet information technique Field.

Background technology

Speaker Identification as the one of biometrics, be according to application speech waveform in reflection speak human physiology and Behavior characteristics speech parameter, differentiates a kind of technology of speaker's identity automatically.Speaker Identification is that one identifies speaker automatically Process, it is the important branch in human body personal characteristics identification, and it is to speak human physiology and row according to reflection in speech waveform The speech parameter being characterized identifies the technology of speaker's identity automatically.Along with the development of information technology, know with other biological The advantages such as other technology is compared, and Speaker Identification has the easiest, and economic and extensibility is good, can be widely applied to data base The fields such as access, safety verification, telephone bank, computer remote login.Speaker Recognition Technology is as an important biology Characteristic identity identification technology, has a wide range of applications, and the most many researcheres have all joined in the research in this field In.In recent years, the speaker's modeling technique based on authentication vector i-vector achieved the biggest success, made The performance obtaining Speaker Recognition System is greatly improved.Identity-based authentication vector (identity vector, i- Vector) subspace modeling is proved to be current forefront maximally effective speaker modeling technique.

Along with the fast development of computer technology Yu information industry, sound localization has become as a heat of current research Point.Determine that sound source position in space is the research having very much broad prospect of application, can be widely applied to society raw Produce and the various aspects of life.Sound localization is that object is positioned by the sound sent by Measuring Object, with use sonar, thunder Reach, the localization method of wireless telecommunications different, the former signal is common sound, is broadband signal, and the latter's information source is arrowband letter Number.According to the feature of acoustical signal, there has been proposed different sound localization algorithms, but due to noise and the existence of reverberation, make The positioning precision obtaining existing sound localization algorithm is relatively low.

Current sound localization algorithm substantially can be divided into 3 classes: location algorithm based on High-Resolution Spectral Estimation, based on time delay The location algorithm estimating (TDE:Time Delay Estimation) and the location algorithm formed based on steerable beam.

(1) mainly there are 4 kinds: ARMA Power estimation method, minimum variance Power estimation method, entropy-spectrum based on High-Resolution Spectral Estimation method The estimation technique and subspace method.ARMA Power estimation method carrys out estimated power spectrum density by stationary linear signal process is set up model. Entropy spectral estimation method comprises maximum entropy method (MEM) and minimum cross entropy method two kinds.Subspace method include Pisarenko Harmonic Decomposition method, Prony method, multiple signal classification (MUSIC:Multiple Signal Classification) method and based on invariable rotary skill Art modulated parameter estimating method (ESPRIT:Estimation of Signal Parameters via Rotational Invariance Techniques).Location algorithm based on High-Resolution Spectral Estimation is employed to receive the covariance square of signal Battle array, and the covariance matrix of signal is unknown in practice, it is necessary to estimate to obtain from observation data.Estimate the association side of signal Difference matrix, needs to suppose that sound source and noise are statistical average, and parameter (sound source position) to be estimated is changeless, Averagely obtain in intervals, and voice is short-term stationarity signal, tend not to meet this condition.Current method The overwhelming majority designs based on far field narrow band signal, and the reverberation in indoor environment can make the performance of this kind of algorithm seriously dislike Change.

(2) location algorithm estimated based on time delay

The algorithm estimated based on time delay is divided into two steps.The first step is that time delay is estimated, i.e. calculates sound source to each two wheat Time delay between gram wind；Second step is location estimation, i.e. estimates sound source according to the geometric position of time delay and microphone array Position, wherein time delay estimates that (TDE) is the most key.Broad sense cross-correlation (GCC:Generalized Cross Correlation) Time Delay Estimation Method, receives the cross-correlation function between signal, it is estimated that reach time difference by calculating different mike (TDOA:Time Difference Arrival).But in actual environment, due to noise and the impact of reverberation, correlation function Maximum peak can be weakened, cause peakvalue's checking difficulty.General cross correlation is by the crosspower spectrum to two microphone signals It is weighted so that correlation function peak value outside time delay is more prominent.Knapp lists five kinds of conventional weighting functions, its The general cross correlation (GCC-ML:GCC using Maximum Likelihood) of middle maximum likelihood weighting and phse conversion The general cross correlation (GCC-PHAT:GCC using Phase Transform) that (PHAT:Phase Transform) weights Typical case the most.Computation complexity feature that is low and that be easily achieved makes GCC method obtain comparing and be widely applied.

(3) location algorithm formed based on steerable beam

The location algorithm formed based on steerable beam positions for the target of radar and sonar system in early days, is introduced into later Process to microphone array signals.Microphone array beam-forming technology mainly has answering of two aspects With: 1) speech enhan-cement；2) sound localization.When the position of sound source is known, adjusts the guiding time delay of each mike, can make The signal obtaining each mike aligns in time, so that microphone array is by the position to guiding sound source, then by each The signal of mike is added, and reaches to suppress noise, the purpose of enhancing signal.Above-mentioned the most simple and practical this wave beam is referred to as prolonging Time-summation (delay-and-sum) Wave beam forming.

Algorithm traditional in the environment of strong reverberation receives serious restriction.Such as, controlled based on peak power output Wave beam environment to external world and frequency of source reflection are more sensitive, can limit application scenario；Based on High-Resolution Spectral Estimation technology Localization method operand greatly and be unsuitable for in-plant location；The time delay precision of localization method based on time delay is vulnerable to mix Ring and the impact of noise jamming.

Summary of the invention

Present invention aim at solving above-mentioned the deficiencies in the prior art, propose a kind of based on i-vector Speaker Identification Sound localization algorithm, the method by introduce differentiate cross-correlation function feature, obtain differentiate cross-correlation function, by this feature It is divided into training set test set, the model in i-vector Speaker Recognition System is trained and tests, use the maximum phase Hope that (EM:expectation maximization) algorithm realizes the maximum to development set i-vector vector distribution probability function Possibility predication, it is established that a PLDA model retrained by voice duration, it is possible to carry out speech recognition exactly and sound source is fixed Position, the realization of this algorithm, efficiently solve noise and the problem of reverberation in tradition sound localization.

The present invention solves its technical problem and is adopted the technical scheme that: a kind of sound based on i-vector Speaker Identification Source location algorithm, the method includes training stage and positioning stage.

Wherein, the step of training stage is as follows:

Step 1: sound source is positioned at each training position r_i, i=1,2 ... K, microphone array records sound source in this position The signal (reverb signal) sent；

Step 2: utilize the reverb signal recorded, calculate cross-correlation function；

Step 3: generated characteristic vector y by cross-correlation function；

Step 4: for each training position r_i, utilize characteristic vector, calculate the average of cross-correlation function PLDA model Vector μ and the speaker subspace of fixed dimensionAnd residual epsilon_ij。

The step of positioning stage is as follows:

Step 1: microphone array records signal, this signal includes signal (reverb signal) and the noise that sound source sends；

Step 2: utilize the signal recorded, calculate cross-correlation function；

Step 3: generated characteristic vector y by cross-correlation function；If there being N frame data, then generate a characteristic vector set y ={ y^t, t=1 ... N}；

Step 4: utilize PLDA model to test feature, estimates the position of sound source.

Additionally, in the choosing of cross-correlation function feature, by utilizing a kind of room impulse response pulsing algorithm roomsim Simulate real acoustic environment, signal x₁(k) and x₂K the broad sense cross-correlation function (GCC) between () can be in frequency-domain calculations:

R_{x_{1} x_{2}} (τ) = {&Integral;}_{- \infty}^{\infty} Ψ_{1, 2} (ω) X_{1} (ω) X_{2}^{*} (ω) e^{j ω τ} d ω - - - (1.1)

In formula, subscript " * " represents complex conjugate, X₁(ω) it is x₁The Fourier transformation of (t), Ψ_1,2(ω) it is weighting function.

In order to strengthen the anti-reverberation ability of cross-correlation function, it is possible to use phase place change (PHAT) weighting function:

Ψ_{1, 2} (ω) = \frac{1}{| X_{1} (ω) X_{2}^{*} (ω) |} - - - (1.2)

Formula (1.2) is substituted into formula (1.1), obtains:

R_{x_{1} x_{2}} (τ) = {&Integral;}_{- \infty}^{\infty} \frac{X_{1} (ω) X_{2}^{*} (ω)}{| X_{1} (ω) X_{2}^{*} (ω) |} e^{j ω τ} d ω - - - (1.3)

In a practical situation, microphone signal x₁(t) and x₂T () is after windowing, then tried to achieve X by Fourier transformation₁(ω) And X₂(ω).If the length of room impulse response (L) is shorter than the length of window function a lot, then microphone signal is permissible at frequency domain It is expressed as:

X_n(ω)=H_n(r_s, ω) and S (ω), n=1,2, (1.4)

In formula, S (ω) and H_n(r_s, ω) and it is s (k) and h respectively_n(r_s, Fourier transformation k).

Formula (1.4) is substituted into formula (1.3), obtains:

\begin{matrix} R_{x_{1} x_{2}} (τ) = {&Integral;}_{- \infty}^{\infty} \frac{H_{1} (r_{s}, ω) H_{2}^{*} (r_{s}, ω)}{| H_{1} (r_{s}, ω) H_{2}^{*} (r_{s}, ω) |} e^{j ω τ} d ω \\ = R_{h_{1} h_{2}} (r_{s}, τ) \end{matrix} - - - (1.5)

From formula (1.5), microphone array receives signal x₁(k) and x₂K the GCC between () is equal to room impulse response h₁ (r_s, k) and h₂(r_s, k) between GCC.

But, length L of room impulse response is more much larger than the length of window function in a practical situation, then microphone signal At frequency domain can only approximate representation be:

X_n(ω)≈H_n(r_s, ω) and * S (ω), n=1,2, (1.6)

And, microphone array receives signal x₁(k) and x₂K the GCC between () can only be approximately equal to room impulse response h₁ (r_s, k) and h₂(r_s, k) between GCC, it may be assumed that

\begin{matrix} R_{x_{1} x_{2}} (τ) \approx {&Integral;}_{- \infty}^{\infty} \frac{H_{1} (r_{s}, ω) H_{2}^{*} (r_{s}, ω)}{| H_{1} (r_{s}, ω) H_{2}^{*} (r_{s}, ω) |} e^{j ω τ} d ω \\ = R_{h_{1} h_{2}} (r_{s}, τ) \end{matrix} - - - (1.7)

It is hereby achieved that the feature of cross-correlation function.

The present invention can be applied under reverberation and noise Speaker Identification and the sound localization to speaker.

Beneficial effect

1, present invention utilizes the feature of cross-correlation function, combine the modeling method of PLDA, according to i-in PLDA model The probability-distribution function of vector, can improve the effectiveness of PLDA model.Compared to traditional sound localization algorithm, can drop Low error rate, improves the accuracy of location.The realization of this algorithm, efficiently solves noise and reverberation in tradition sound localization Problem.

2, characteristic information and the PLDA algorithm of the cross-correlation function of sound source are combined by the present invention, it is adaptable to all have by force Noise and the situation of reverberation.

3, the present invention is by the extraction of the cross-correlation function feature to sound source, and data acquisition is convenient and simple, and locating effect is also Preferably.

Accompanying drawing explanation

Fig. 1 is the method flow diagram of the present invention.

Fig. 2 be the present invention to different speakers under iVector model etc. the analysis schematic diagram of error rate eer.

Fig. 3 is the present invention to difference test data at iVector model and signal to noise ratio is that the marking in the case of 10dB divides Analysis schematic diagram.

Fig. 4 is that difference test data marking analysis in the case of iVector model and signal to noise ratio are 20dB is shown by the present invention It is intended to.

Detailed description of the invention

Below in conjunction with Figure of description, the invention is described in further detail.

As it is shown in figure 1, the present invention is a kind of sound localization algorithm research based on i-vector Speaker Identification.PLDA Algorithm is a kind of channel compensation algorithm, and it is based on i-Vector feature, because i-Vector feature had both comprised speaker's letter Breath comprises again channel information, and we are only concerned speaker information, so needing channel compensation.Will be detailed below sound source special Levy selection, probability linear discriminant analysis, model training and four aspects of marking.

The present invention is embodied as step, includes the following:

Step 1: utilize the simulated environment of Roomsim, simulates at the environment with reverberation and noise, calculates sound source letter The feature of the cross-correlation function of breath, carries out dimensionality reduction, speech detection etc. to it and processes, and be divided into training set and test set, for next The model training of step is prepared.

Step 2: extract i-Vector, under the framework of PLDA, the generation process of i-Vector can be hidden with one and become Amount describes.Different hidden variable numbers, different a priori assumptions constitutes different PLDA models.Assuming that i-th is spoken Jth i-vector of people is expressed as w_ij, conventional PLDA model hypothesis is as follows:

w_ij=μ+Vy_i+z_ij

Wherein, μ is the average of all training datas, and V matrix represents speaker space (eigentones matrix), vector y_iFor right The speaker's factor answered, obeys standard gaussian distribution, z_ijRepresent residual error, a full-shape matrix D represent.

Step 3: application PLDA, on labeled data collection by expectation maximization method (EM) estimate model parameter λ=(μ, V, D), initial model uses random value.

Step 4: after having estimated model parameter, given two i-Vector w₁And w₂, its log-likelihood ratio is calculated by formula, Wherein assume θ_tarRepresent that they are from same speaker, θ_nonRepresent that they, from different speakers, use log-likelihood ratio Calculate and be divided into:

s c o r e = l o g \frac{p (w_{1}, w_{2} | θ_{t a r})}{p (w_{1}, w_{2} | θ_{n o n})}

Respectively under noise-free case, have under noise situations and test, wherein have signal to noise ratio under noise situations gradually to drop Low, even if can obtain in the case of having noise and reverberation through experiment, the method also has good locating effect.

Below the sound localization algorithm based on iVector of the present invention is compared checking the most respectively, Experiment parameter is chosen and is included the following:

(1) emulation data set is chosen in Roomsim, and it is a segment length square RMR room reverb simulation code, can arrange sound source Position with those who answer.Its size is 7m × 6m × 3m, reverberation time (T₆₀) with the relation of reflection coefficient (β) by Ai Run formula Determine:

β = \exp (- 13.82 / [c (L_{x}^{- 1} + L_{y}^{- 1} + L_{z}^{- 1}) T_{60}])

Whole data set is divided into training set and test set in the ratio of 8:2, and training set data inputs as algorithm, and tests Collection algorithm performance after testing improvement.

(2) sonic location system uses PLDA algorithm, and parameter is μ, V, y_i, z_ij.μ is the average of all training datas, V square Matrix representation speaker space (eigentones matrix), vector y_iFor corresponding speaker's factor, obey standard gaussian distribution, z_ijRepresent Residual error, is represented by a full-shape matrix D.

(3) the parameter matrix T of i-Vector uses a space to replace two spaces, at traditional audio recognition method In, two spaces are the speaker spaces defined by eigentones space matrix, and the letter defined by eigentones channel space matrix Space, road.The difference that this new space had not only contained the difference between speaker but also contained channel.

Experiment 1: verify without make an uproar under environment with iVector model carry out sound localization etc. the result figure of error rate

Fig. 2 be the present invention under noise-free environment, five people are carried out sound localization.Wherein, Model represents the mould of training Type, Test represents the model of test.Every a line being mated with every string, it is the highest that color represents score the most deeply.Etc. error rate eer It is the lowest that to represent performance the best.As seen in Figure 2, without making an uproar under environment, the eer of this algorithm is 0, so the location of this model Effect is the best.

Experiment 2: checking under signal to noise ratio is 15dB environment with iVector model carry out sound localization etc. the result of error rate Figure

Fig. 3 be under signal to noise ratio is 10dB etc. the result figure of error rate.Similar with experiment 1, it can be seen that under 15dB, eer Remaining as 0, locating effect is fine.

Experiment 3: checking under signal to noise ratio is 20dB environment with iVector model carry out sound localization etc. the result of error rate Figure

Fig. 4 be under signal to noise ratio is 20dB etc. the result figure of error rate.Similar with experiment 1, it can be seen that under 15dB, eer Remain as 0, therefore it may be concluded that sound localization algorithm based on i-vector Speaker Identification location has the most calmly Position effect.

To those skilled in the art, according to above-mentioned implementation type can be easy to association other advantage and deformation. Therefore, the present invention is not limited to above example, and a kind of form of the present invention is carried out detailed, exemplary as just example by it Explanation.In the range of without departing substantially from present inventive concept, those skilled in the art, according to above-mentioned instantiation, are replaced by various equivalents Change obtained technical scheme, within should be included in scope of the presently claimed invention and equivalency range thereof.

Claims

1. a sound localization method based on i-vector Speaker Identification, it is characterised in that described method includes walking as follows Rapid:

Step 1: sound source is positioned at each training position r_i, i=1,2 ... K, microphone array is recorded sound source and is sent in this position Signal；

Step 3: generated characteristic vector y by cross-correlation function；

Step 4: for each training position r_i, utilize characteristic vector, calculate the mean vector μ of cross-correlation function PLDA model Speaker subspace with fixed dimensionAnd residual epsilon_ij；

Step 5: microphone array records signal, this signal includes signal and the noise that sound source sends；

Step 6: utilize the signal recorded, calculate cross-correlation function；

Step 7: generated characteristic vector y by cross-correlation function；If there being N frame data, then generate a characteristic vector set y.

Step 8: utilize PLDA model to test feature, estimates the position of sound source.

A kind of sound localization algorithm based on i-vector Speaker Identification the most according to claim 1, it is characterised in that In step 2, described characteristic attribute needs to distribute different weights.

A kind of sound localization algorithm based on i-vector Speaker Identification the most according to claim 1, it is characterised in that In step 3, sound source position eigenvalue is included by item characteristic property calculation, described calculating process:

Step 3-1, in the choosing of cross-correlation function feature, by utilizing a kind of room impulse response pulsing algorithm roomsim to come Simulating real acoustic environment, the broad sense cross-correlation function between signal can be in frequency-domain calculations；

Step 3-2, in order to strengthen the anti-reverberation ability of cross-correlation function, it is possible to use phase place change weighting function；

Step 3-3, in practical situation, microphone signal time-domain function is after windowing, then is tried to achieve frequency domain letter by Fourier transformation Number；If the length of room impulse response is shorter than the length of window function a lot, then the GCC that microphone array receives between signal is equal to The GCC of room impulse response.

A kind of sound localization algorithm based on i-vector Speaker Identification the most according to claim 1, it is characterised in that: Described method is applied to all items sonic location system with characteristic attribute.