CN106019230B

CN106019230B - A kind of sound localization method based on i-vector Speaker Identification

Info

Publication number: CN106019230B
Application number: CN201610365659.6A
Authority: CN
Inventors: 万新旺; 顾晓瑜; 杨悦; 廖鹏程
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Nanjing Post and Telecommunication University
Priority date: 2016-05-27
Filing date: 2016-05-27
Publication date: 2019-01-08
Anticipated expiration: 2036-05-27
Also published as: CN106019230A

Abstract

The invention discloses a kind of sound localization methods based on i-vector Speaker Identification, this method identifies the feature of cross-correlation function by introducing, it obtains identifying cross-correlation function, this feature is divided into training set test set, model in i-vector Speaker Recognition System is trained and is tested, maximal possibility estimation to development set i-vector vector distribution probability function is realized using EM algorithm, set up the PLDA model constrained by voice duration, speech recognition and auditory localization can accurately be carried out, the realization of this algorithm, efficiently solve the problems, such as noise and reverberation in traditional auditory localization.

Description

A kind of sound localization method based on i-vector Speaker Identification

Technical field

The present invention relates to a kind of sound localization methods based on i-vector Speaker Identification, belong to Internet information technique Field.

Background technique

The one kind of Speaker Identification as biometrics, be according to application speech waveform in reflection speak human physiology and Behavioural characteristic speech parameter, a kind of automatic technology for identifying speaker's identity.Speaker Identification is a kind of automatic identification speaker Process, it is the important branch in the identification of human body personal characteristics, it is spoken human physiology and row according to reflection in speech waveform The technology for the speech parameter automatic identification speaker's identity being characterized.With the continuous development of information technology, know with other biological Other technology is compared, and Speaker Identification has more easy, and the advantages such as economic and scalability is good can be widely applied to database The fields such as access, safety verification, telephone bank, computer remote login.The speaker Recognition Technology biology important as one Characteristic identity identification technology, has a wide range of applications, and domestic and international many researchers have joined in the research in this field In.In recent years, speaker's modeling technique based on authentication vector i-vector achieved very big success, made The performance for obtaining Speaker Recognition System is greatly improved.Identity-based authentication vector (identity vector, i- Vector subspace modeling) is proved to be the most effective speaker's modeling technique in current forefront.

With the fast development of computer technology and information industry, auditory localization has become a heat of current research Point.It determines that the position of a sound source in space is the research for having very much broad prospect of application, can be widely applied to social life The various aspects for producing and living.Auditory localization is that the sound issued by measurement object positions object, and uses sonar, thunder It reaches, the localization method of wireless telecommunications difference, it is broadband signal that the former signal, which is common sound, and the latter's information source is narrowband letter Number.The characteristics of according to voice signal, there has been proposed different auditory localization algorithms, but due to the presence of noise and reverberation, make The positioning accuracy for obtaining existing auditory localization algorithm is lower.

Current auditory localization algorithm can substantially be divided into 3 classes: location algorithm based on High-Resolution Spectral Estimation, based on time delay The location algorithm estimating the location algorithm of (TDE:Time Delay Estimation) and being formed based on steerable beam.

(1) 4 kinds: ARMA Power estimation method, minimum variance Power estimation method, entropy-spectrum are mainly had based on High-Resolution Spectral Estimation method The estimation technique and subspace method.ARMA Power estimation method is by establishing model to stationary linear signal process come estimated power spectrum density. Entropy spectral estimation method includes maximum entropy method (MEM) and two kinds of minimum cross entropy method.Subspace method include Pisarenko Harmonic Decomposition method, Prony method, multiple signal classification (MUSIC:Multiple Signal Classification) method and be based on invariable rotary skill Art modulated parameter estimating method (ESPRIT:Estimation of Signal Parameters via Rotational Invariance Techniques).Location algorithm based on High-Resolution Spectral Estimation is employed to receive the covariance square of signal Battle array, and the covariance matrix of signal is unknown in practice, it is necessary to estimate to obtain from observation data.Estimate the association side of signal Poor matrix, needs to assume sound source and noise is statistical average, and parameter (sound source position) to be estimated is fixed and invariable, It is averagely obtained in certain time interval, and voice is short-term stationarity signal, tends not to meet this condition.Current method is exhausted Most of designed based on far field narrow band signal, and the reverberation meeting in environment is so that the performance of this kind of algorithm is seriously disliked indoors Change.

(2) location algorithm based on time delay estimation

Algorithm based on time delay estimation is divided into two steps.The first step is time delay estimation, i.e. calculating sound source to every two wheat Time delay between gram wind；Second step is location estimation, i.e., estimates sound source according to the geometric position of time delay and microphone array Position, wherein time delay estimation (TDE) is the most key.Broad sense cross-correlation (GCC:Generalized Cross Correlation) Time Delay Estimation Method, by calculating the cross-correlation function between different microphones reception signals, it is estimated that reaching the time difference (TDOA:Time Difference Arrival).But in the actual environment, due to the influence of noise and reverberation, correlation function Maximum peak can be weakened, cause peak detection difficult.General cross correlation passes through the crosspower spectrum to two microphone signals It is weighted, so that peak value of the correlation function outside time delay is more prominent.Knapp lists five kinds of common weighting functions, The general cross correlation (GCC-ML:GCC using Maximum Likelihood) of middle maximum likelihood weighting and phse conversion The general cross correlation (GCC-PHAT:GCC using Phase Transform) of (PHAT:Phase Transform) weighting It is the most typical.Computation complexity is low and the characteristics of being easily achieved makes GCC method obtain comparing and be widely applied.

(3) location algorithm formed based on steerable beam

It is used for the target positioning of radar and sonar system based on the location algorithm early stage that steerable beam is formed, was introduced into later To microphone array signals processing.Microphone array beam-forming technology main answering there are two aspect in speech signal processing With: 1) speech enhan-cement；2) auditory localization.When known to the position of sound source, the guiding time delay of each microphone is adjusted, can be made The signal of each microphone is aligned in time, so that microphone array is arrived the position of guidance sound source, then by each wheat The signal of gram wind is added, and achievees the purpose that inhibit noise, enhancing signal.Above-mentioned this most simple and practical wave beam, which is referred to as, to be prolonged When-summation (delay-and-sum) Wave beam forming.

Traditional algorithm receives serious limitation in the environment of strong reverberation.For example, controllable based on peak power output Wave beam is more sensitive to external environment and frequency of source reflection, will limit application；Based on High-Resolution Spectral Estimation technology Localization method operand greatly and be unsuitable for the positioning of short distance；The time delay precision of localization method based on time delay is vulnerable to mixed Loud and noise jamming influence.

Summary of the invention

Present invention aims at solving above-mentioned the deficiencies in the prior art, propose a kind of based on i-vector Speaker Identification Auditory localization algorithm, this method by introduce identify cross-correlation function feature, obtain identify cross-correlation function, by this feature It is divided into training set test set, the model in i-vector Speaker Recognition System is trained and is tested, using the maximum phase (EM:expectation maximization) algorithm is hoped to realize the maximum to development set i-vector vector distribution probability function Possibility predication, it is established that a PLDA model constrained by voice duration can accurately carry out speech recognition and sound source is fixed Position, the realization of this algorithm efficiently solve the problems, such as noise and reverberation in traditional auditory localization.

The technical scheme adopted by the invention to solve the technical problem is that: a kind of sound based on i-vector Speaker Identification Source location algorithm, this method include training stage and positioning stage.

Wherein, the step of training stage is as follows:

Step 1: sound source is located at each trained position r_i, i=1,2 ... K, microphone array record sound source at this location The signal (reverb signal) of sending；The meaning of K are as follows: the number of sound source training；

Step 2: using the reverb signal recorded, calculating cross-correlation function；

Step 3: feature vector y is generated by cross-correlation function；

Step 4: for each trained position r_i, using feature vector, calculate the mean value of cross-correlation function PLDA model The speaker subspace of vector μ and fixed dimensionAnd residual epsilon_ij。

The step of positioning stage, is as follows:

Step 1: microphone array records signal, which includes the signal (reverb signal) and noise that sound source issues；

Step 2: using the signal recorded, calculating cross-correlation function；

Step 3: feature vector y is generated by cross-correlation function；If there is N frame data, then a feature vector set y is generated ={ y^t, t=1 ... N }；

Step 4: feature being tested using PLDA model, estimates the position of sound source.

In addition, in the selection of cross-correlation function feature, by utilizing a kind of room impulse response pulsing algorithm roomsim To simulate true acoustic environment, signal x₁(k) and x₂(k) the broad sense cross-correlation function (GCC) between can be in frequency-domain calculations:

In formula, subscript " * " indicates complex conjugate, X₁(ω) is x₁(t) Fourier transformation, Ψ_1,2(ω) is weighting function.

In order to enhance the anti-reverberation ability of cross-correlation function, phase change (PHAT) weighting function can be used:

Formula (1.2) are substituted into formula (1.1), are obtained:

In a practical situation, microphone signal x₁(t) and x₂(t) after adding window, then X acquired by Fourier transformation₁(ω) And X₂(ω).If the length (L) of room impulse response is shorter than the length of window function very much, microphone signal can be in frequency domain It indicates are as follows:

X_n(ω)=H_n(r_s, ω) and S (ω), n=1,2, (1.4)

In formula, S (ω) and H_n(r_s, ω) and it is s (k) and h respectively_n(r_s, k) Fourier transformation.

Formula (1.4) are substituted into formula (1.3), are obtained:

By formula (1.5) it is found that microphone array receives signal x₁(k) and x₂(k) GCC between is equal to room impulse response h₁ (r_s, k) and h₂(r_s, k) between GCC.

However, the length L of room impulse response is more much larger than the length of window function in a practical situation, then microphone signal Frequency domain can only approximate representation are as follows:

X_n(ω)≈H_n(r_s, ω) and * S (ω), n=1,2, (1.6)

Moreover, microphone array receives signal x₁(k) and x₂(k) GCC between can only be approximately equal to room impulse response h₁ (r_s, k) and h₂(r_s, k) between GCC, it may be assumed that

It is hereby achieved that the feature of cross-correlation function.

The present invention can be applied under reverberation and noise to Speaker Identification and to the auditory localization of speaker.

Beneficial effect

1, present invention utilizes the features of cross-correlation function, combine the modeling method of PLDA, according to i- in PLDA model The validity of PLDA model can be improved in the probability-distribution function of vector.Compared to traditional auditory localization algorithm, can drop Low error rate improves the accuracy of positioning.The realization of this algorithm efficiently solves noise and reverberation in traditional auditory localization The problem of.

2, the present invention combines the characteristic information of the cross-correlation function of sound source and PLDA algorithm, has by force suitable for all The case where noise and reverberation.

3, extraction of the present invention by the cross-correlation function feature to sound source, convenient and simple, the locating effect of data acquisition Preferably.

Detailed description of the invention

Fig. 1 is flow chart of the method for the present invention.

Fig. 2 be the present invention to different speakers under iVector model etc. error rates eer analysis schematic diagram.

Fig. 3 is marking point of the present invention to different test datas when iVector model and signal-to-noise ratio are 10dB Analyse schematic diagram.

Fig. 4 is marking point of the present invention to different test datas when iVector model and signal-to-noise ratio are 20dB Analyse schematic diagram.

Specific embodiment

The invention is described in further detail with reference to the accompanying drawings of the specification.

As shown in Figure 1, the present invention is a kind of auditory localization algorithm research based on i-vector Speaker Identification.PLDA is calculated Method is a kind of channel compensation algorithm, it is based on i-Vector feature, because i-Vector feature both includes speaker information It include again channel information, and we are only concerned speaker information, so needing channel compensation.It will be detailed below sound source characteristics Selection, probability linear discriminant analysis, model training and four aspects of marking.

Specific implementation step of the present invention includes the following:

Step 1: using the simulated environment of Roomsim, simulating in the environment for having reverberation and noise, calculate sound source letter The feature of the cross-correlation function of breath carries out the processing such as dimensionality reduction, speech detection to it, and is divided into training set and test set, is next The model training of step is prepared.

Step 2: extracting i-Vector, under the frame of PLDA, the generation process of i-Vector can be hidden with one to be become Amount is to describe.Different hidden variable numbers, different a priori assumptions constitute different PLDA models.It is assumed that i-th is spoken J-th of i-vector of people is expressed as w_ij, common PLDA model hypothesis is as follows:

w_ij=μ+Vy_i+z_ij

Wherein, μ is the mean value of all training datas, and V matrix indicates speaker space (eigentones matrix), vector y_iIt is right The speaker's factor answered obeys standard gaussian distribution, z_ijIt indicates residual error, is indicated by a full-shape matrix D.

Step 3: apply PLDA, on labeled data collection by expectation maximization method (EM) estimate model parameter λ=(μ, V, D), initial model uses random value.

Step 4: after having estimated model parameter, giving two i-Vector w₁And w₂, log-likelihood ratio calculates by formula, Wherein assume θ_tarIndicate them from the same speaker, θ_nonIndicate that they, from different speakers, use log-likelihood ratio Calculate score are as follows:

Respectively under noise-free case, have and tested under noise situations, wherein there is signal-to-noise ratio under noise situations gradually to drop Low, even if available in the case where having noise and reverberation by testing, this method also has good locating effect.

The auditory localization algorithm to of the invention based on iVector is compared verifying respectively in varied situations below, Experiment parameter is chosen

(1) emulation data set is chosen in Roomsim, it is one section long rectangular RMR room reverb simulation code, settable sound source With the position of those who answer.Its size is 7m × 6m × 3m, reverberation time (T₆₀) with the relationship of reflection coefficient (β) by Ai Run formula It determines:

Entire data set is divided into training set and test set in the ratio of 8:2, and training set data is inputted as algorithm, and is tested Collection is for the algorithm performance after testing improvement.

(2) sonic location system uses PLDA algorithm, parameter μ, V, y_i, z_ij.μ is the mean value of all training datas, V square Matrix representation speaker space (eigentones matrix), vector y_iFor corresponding speaker's factor, standard gaussian distribution, z are obeyed_ijIt indicates Residual error is indicated by a full-shape matrix D.

(3) the parameter matrix T of i-Vector replaces two spaces using a space, in traditional audio recognition method In, two spaces are the speaker spaces defined by eigentones space matrix, and the letter defined by eigentones channel space matrix Road space.This new space had not only contained the difference between speaker but also had contained the difference of channel.

Experiment 1: verify without make an uproar under environment with iVector model carry out auditory localization etc. error rates result figure

Fig. 2 is the present invention under noise-free environment, carries out auditory localization to five people.Wherein, Model represents the mould of training Type, Test represent the model of test.Every a line is matched with each column, color is deeper, and to represent score higher.Etc. error rates eer It is lower that represent performance better.As seen in Figure 2, without making an uproar under environment, the eer of the algorithm is 0, so the positioning of the model Effect is very good.

Experiment 2: verify signal-to-noise ratio be 15dB environment under with iVector model carry out auditory localization etc. error rates result Figure

Fig. 3 be signal-to-noise ratio be 10dB under etc. error rates result figure.It is similar with experiment 1, it can be seen that at 15dB, eer 0 is remained as, locating effect is fine.

Experiment 3: verify signal-to-noise ratio be 20dB environment under with iVector model carry out auditory localization etc. error rates result Figure

Fig. 4 be signal-to-noise ratio be 20dB under etc. error rates result figure.It is similar with experiment 1, it can be seen that at 15dB, eer 0 is remained as, therefore it may be concluded that the auditory localization algorithm positioning based on i-vector Speaker Identification has well calmly Position effect.

To those skilled in the art, according to above-mentioned implementation type can be easy to association other the advantages of and deformation. Therefore, the present invention is not limited to above example, carries out as just example to a kind of form of the invention detailed, exemplary Explanation.In the range of without departing substantially from present inventive concept, those skilled in the art are equally replaced according to above-mentioned specific example by various Obtained technical solution is changed, should be included within scope of the presently claimed invention and its equivalency range.

Claims

1. a kind of sound localization method based on i-vector Speaker Identification, which is characterized in that the method includes walking as follows It is rapid:

Step 1: sound source is located at each trained position r_i, i=1,2 ... K, microphone array are recorded sound source and are issued at this location Signal；K is the number of sound source training；

Step 3: feature vector y is generated by cross-correlation function；

Step 4: for each trained position r_i, using feature vector, calculate the mean vector μ of cross-correlation function PLDA model With the speaker subspace of fixed dimensionAnd residual epsilon_ij；

Step 5: microphone array records signal, which includes the signal and noise that sound source issues；

Step 6: using the signal recorded, calculating cross-correlation function；

Step 7: feature vector y is generated by cross-correlation function；If there is N frame data, then a feature vector set is generated；

Step 8: feature vector being tested using PLDA model, estimates the position of sound source；

In addition, in the selection of cross-correlation function feature, by using a kind of room impulse response pulsing algorithm roomsim come mould Intend true acoustic environment, microphone signal x₁(k) and x₂(k) the broad sense cross-correlation function (GCC) between is in frequency-domain calculations:

In formula, subscript " * " indicates complex conjugate, X₁(ω) is x₁(k) Fourier transformation, X₂(ω) is x₂(k) Fourier transformation, Ψ_1,2(ω) is weighting function；

In order to enhance the anti-reverberation ability of cross-correlation function, phase change (PHAT) weighting function is used:

Formula (1.2) are substituted into formula (1.1), are obtained:

In a practical situation, microphone signal x₁(k) and x₂(k) after adding window, then X acquired by Fourier transformation₁(ω) and X₂ (ω), if the length (L) of room impulse response is shorter than the length of window function very much, microphone signal is in frequency domain representation are as follows:

X_n(ω)=H_n(r_s, ω) and S (ω), n=1,2, (1.4)

In formula, S (ω) and H_n(r_s, ω) and it is s (k) and h respectively_n(r_s, k) Fourier transformation, s (k) is sound source at r (s) Signal；

Formula (1.4) are substituted into formula (1.3), are obtained:

Known by formula (1.5), microphone array receives microphone signal x₁(k) and x₂(k) GCC between is equal to room impulse response h₁ (r_s, k) and h₂(r_s, k) between GCC,Equal to room impulse response h₁(r_s, k) and h₂(r_s, k) between GCC；

However, the length L of room impulse response is more much larger than the length of window function in a practical situation, then microphone signal is in frequency Domain can only approximate representation are as follows:

X_n(ω)≈H_n(r_s, ω) and * S (ω), n=1,2, (1.6)

Moreover, microphone array receives microphone signal x₁(k) and x₂(k) GCC between can only be approximately equal to room impulse response h₁ (r_s, k) and h₂(r_s, k) between GCC, it may be assumed that

Thus the feature of cross-correlation function is obtained.