Disclosure of Invention
in order to solve the problems in the prior art, the invention provides a sound source positioning method based on a Gaussian mixture model and spatial power spectrum characteristics.
The method comprises the steps of adopting microphones as array elements to form an array, wherein the number of the microphones is U, the serial number of each microphone is U, U is 1,2, … and U, collecting sound source signals, the directions of the sound source signals are known, the direction vector of each sound source is gamma, the azimuth angle is determined by the horizontal angle theta and the pitch angle of the sound source relative to the array, and then, a time domain signal of the direction vector gamma collected by the U-th array element is used as a training signal.
framing the training signal, the framing length is N, the sampling serial number is N, N is more than or equal to 0 and less than N, the number of single frames is L, the serial number of the single frame is L, L is 1,2, … and L, then the training signal framing is that the Hamming window function is adopted to carry out windowing, the formula is substituted for calculation, and the L-th frame single frame signal of the direction vector gamma acquired by the u array elements is obtained
And performing discrete Fourier transform on the single-frame signal by adopting a discrete Fourier transform function DFT, wherein the length of the DFT is K, the frequency point is K, and the DFT is substituted into a formula for calculation to obtain a frequency domain signal corresponding to the l-th frame time domain signal of the direction vector gamma acquired by the u array elements so as to convert the training signal from the time domain signal into the frequency domain signal.
Setting the sound source from far field, propagating from known candidate position to array, the speed of sound in air is c, the candidate position of far field is r, the central position of array is r0, the position of u array element is ru, the sound propagation delay from the candidate position of far field to the central position of array is tau 0(r), the sound propagation delay from the candidate position of far field to the position of u array element is tau u (r), the guiding delay from the candidate position of far field to u array element is tau u0(r), then
calculating the guiding time delay of a sound source in a far field, enabling the sound source position to be equivalent to a sound source azimuth, measuring a direction vector xi of a far field candidate position, wherein the candidate direction vector xi is determined by a horizontal angle alpha and a pitch angle phi of the candidate position relative to an array, and when xi is [ cos phi cos alpha, cos phi sin alpha, sin phi ] T, substituting the formula for calculation, and substituting the formula for obtaining the guiding time delay tau u0(r) of the far field candidate position.
Setting the signal sampling rate to fs, substituting the guide time delay tau u0(r) and the frequency domain signal into a formula for calculation to obtain PHAT weighted controllable response output of the single frame signal, substituting the PHAT weighted controllable response output into the formula for calculation to obtain a controllable response power value of the training signal, substituting the PHAT weighted controllable response output into the formula for calculation to perform normalization processing, setting the vector dimension to D, substituting the normalization power value into the formula for calculation to obtain a space power spectrum characteristic vector of the first frame signal of the direction vector gamma
A Gaussian mixture model lambda gamma is constructed according to the characteristic vector, the characteristic vector is used as a training sample and is substituted into a formula for calculation, L is the frame number of a training signal, a training set of a direction vector gamma is obtained, a Gaussian mixture model lambda gamma is constructed for each training direction, one category of characteristic distribution is decomposed into a plurality of Gaussian sub-distributions, the characteristic distribution is approximated by the mixture weighted average of Gaussian components, and the complete Gaussian mixture model is composed of a mixture weight, a mean vector and a covariance matrix.
Setting the parameter of the Gaussian mixture model with the d-th component of the characteristic vector as a direction vector gamma as lambda gamma, the order of the Gaussian mixture model as I, the serial number of the Gaussian component as I, and the d-th component with the mixed weight as a mean vector as a covariance matrix as
Estimating model parameters by adopting an EM algorithm, setting 1/I as an initial value of a mixed weight, randomly selecting a training sample from a training set as an initial value of a mean vector, adopting a unit matrix as an initial value of a covariance matrix, and constructing the initial value of lambda gamma by using the initial value.
Will be substituted into the formula
Calculating to obtain Gaussian probability sub-distribution, substituting the Gaussian probability sub-distribution into a formula, calculating to obtain the posterior probability of the training sample in the ith Gaussian component, substituting the posterior probability into the formula to obtain a re-estimated value, calculating to obtain a re-estimated value, substituting the formula with D being more than or equal to 1 and less than or equal to D, calculating to obtain a substitution diagonal matrix, obtaining a re-estimated value, constructing the re-estimated value to obtain a re-estimated value of lambda gamma, and repeating iterative operation for re-estimation until the model is converged according to the method.
collecting a sound source with an unknown azimuth as a test signal, converting the test signal into a single-frame signal, calculating a spatial power spectrum characteristic vector ytest (l) of the sound source with the unknown azimuth from the single-frame signal, and substituting the ytest (l) into a Gaussian mixture model
substituting ytest (l) into a formula for calculation to obtain Gaussian probability sub-distribution, and substituting the Gaussian probability sub-distribution into the formula for calculation to obtain the likelihood of the test sample in each direction vector gamma
p(y(l)|λγ)。
substituting p (ytest (l) | λ γ) into a formula for calculation to obtain the direction vector of the l frame signal corresponding to the maximum value of p (ytest (l) | λ γ) as the estimation value of the direction vector γ of the l frame signal of the sound source with unknown azimuth.
The method comprises two stages of training and testing, wherein a spatial power spectrum of each direction is extracted as a characteristic vector in the training stage, and a Gaussian mixture model is established for each direction; in the testing stage, the likelihood of the testing signal relative to each azimuth is given by a Gaussian mixture model classifier, and the estimated value of the sound source azimuth is obtained based on the maximum likelihood; the space power spectrum characteristic vector contains space structure information which is closely related to the sound source direction and the acoustic environment; the Gaussian mixture model can effectively depict the characteristics of the categories, and the mixed weighted average of a plurality of Gaussian probability density functions is used for approximating the characteristic distribution of the categories; the Gaussian mixture model of the training azimuth can be stored in a memory after off-line calculation, and real-time sound source positioning can be realized only by one frame of signal during testing; the SRP-PHAT space power spectrum of the microphone array is extracted to be used as a characteristic vector of the sound source positioning system, acoustic environment information is fully utilized, positioning performance is obviously improved, and the anti-noise capability is stronger.
Detailed Description
The technical scheme of the invention is specifically explained in the following by combining the attached drawings.
As shown in fig. 1, a sound source signal at a known position is collected as a training signal, the training signal is converted into a single-frame signal through frame division and windowing preprocessing, the single-frame signal is transformed into a frequency domain signal through discrete fourier transform, the guiding time delay of a candidate position is calculated to obtain a controllable response output, a spatial power spectrum feature vector of a training orientation is extracted as a training sample, a gaussian mixture model is constructed, a sound source signal at an unknown position is collected as a test signal by the same method, a spatial power spectrum feature vector of a test orientation is extracted as a test sample, the test sample is input into a constructed gaussian mixture model classifier, the likelihood of the gaussian mixture model corresponding to each training orientation of the test sample is calculated, and the orientation corresponding to the maximum likelihood is selected as the sound source orientation.
Setting a sound source at a known position in the far field, where the known position can be equivalent to a known azimuth, the direction vector γ of the sound source with respect to the array is made up of a horizontal angle θ and a pitch angle, one for each known sound source.
The technical scheme is that 6 omnidirectional microphones are used as microphone array elements to form a uniform circular array, the array radius is 0.1m, the microphone serial numbers are u, u is 1,2, … and 6, sound source signals with known directions are set for convenience of explanation, and the sound source signals are located on the same horizontal plane with the microphone array, so that a pitch angle sound source direction vector gamma of a sound source is degraded from three dimensions to two dimensions, and gamma is [ cos theta, sin theta ] T.
defining the right front of a horizontal plane as 90 degrees, setting the range of a horizontal angle theta as [0 degrees and 360 degrees ], setting the interval of theta as 10 degrees, then theta as 0 degrees, 10 degrees, … degrees and 350 degrees, then the number of the directions of training signals is 36, each array element collects sound source signals as training signals, and then the training signals of the direction vector gamma collected by u array elements are
The training signal is set to be from a candidate position r, the position of the array center is r0, the position of the u-th array element is ru, the sound velocity in air is 342m/s, the sound propagation delay from the candidate position to the array center is tau 0(r), the sound propagation delay from the candidate position to the u-th array element is tau u (r), and the guide delay from the candidate position to the u-th array element is tau u0 (r).
Framing the training signal, the framing length is 512(32ms), the frame is shifted to 0, the sampling serial number is N, N is more than or equal to 0 and is less than N, the number of single frames is 300, the serial number of the single frame is L, L is 1,2, …, L, then the training signal is framed to be
Adopting Hamming window function to carry out windowing, substituting the Hamming window function into a formula to calculate to obtain l frame time domain signals of the direction vector gamma acquired by u array elements
And performing discrete Fourier transform on the single-frame signal by adopting a discrete Fourier transform function DFT, wherein the length K of the DFT is 1024, the frequency point is K, and the DFT is substituted into a formula for calculation to obtain a frequency domain signal corresponding to the l-frame time domain signal of the direction vector gamma acquired by the u array elements so as to convert the training signal from the time domain signal into the frequency domain signal.
And calculating the guiding time delay of the sound source in the far field, wherein the position of the sound source can be equivalent to the azimuth of the sound source, the direction vector of the candidate position is set to xi, the vector is composed of a horizontal angle alpha and a pitch angle phi of the candidate position relative to the array, and xi is [ cos phi cos alpha, cos phi sin alpha, sin phi ] T, and each candidate position corresponds to xi.
Since the sound source and the microphone array are set to be in the same horizontal plane, the azimuth angle can be equivalent to xi ═ cos α, sin α ] T, the azimuth angle is substituted into the formula for calculation, the substitution formula obtains the guiding time delay tau u0(r) of the far-field candidate position, and xi ═ cos α, sin α ] T and tau u0(r) are irrelevant to the received signal and can be stored in a memory after being calculated off line.
Setting the signal sampling rate fs to be 16kHz, substituting the guide time delay tau u0(r) and the frequency domain signal into a formula for calculation to obtain the PHAT weighted controllable response output of the single-frame signal, substituting the PHAT weighted controllable response output into the formula for calculation to obtain the controllable response power value of the training signal, substituting the PHAT weighted controllable response power value into the formula for calculation, and performing normalization processing.
Setting the vector dimension as D, and since the number of orientations of the training signal is set as 36 in the foregoing, D is 36, and the normalized power value is substituted into the formula to calculate, so as to obtain the space power spectrum feature vector of the l frame signal of the direction vector γ
Constructing a Gaussian mixture model lambda gamma according to the characteristic vector, corresponding to each training orientation, taking the space power spectrum characteristic vector of all frames of the orientation as a training sample, taking L as the frame number of a training signal, wherein the number of the training signal is consistent with that of single frames set in the previous paragraph, and the L is 300, substituting the formula for calculation to obtain a training sample set of the direction vector gamma, constructing a Gaussian mixture model for each training orientation, decomposing the characteristic distribution of one category into a plurality of Gaussian sub-distributions, approximating the characteristic distribution by using the mixed weighted average of Gaussian components, and forming the complete Gaussian mixture model by using mixed weight, mean vector and covariance matrix.
Setting the d-th component of the feature vector as the order I of the Gaussian mixture model to be 8, the serial number of the Gaussian component to be I, the mixing weight as the mean vector as the d-th component of the mean vector as the variance as the covariance matrix to be
Iterative estimation of model parameters is carried out by adopting an EM algorithm, 1/I is set as an initial value of a mixing weight, training samples are randomly selected from a training set as an initial value of a mean vector, a unit matrix is used as an initial value of a covariance matrix, a Gaussian mixing model is constructed, and an initial value of lambda gamma is constructed according to the initial value.
Firstly, calculating a substitution formula to obtain Gaussian probability sub-distribution, calculating the substitution formula to obtain the posterior probability of the training sample at the ith Gaussian component
secondly, substituting the formula for calculation to obtain a re-estimated value, and substituting the formula for D to be not less than 1 and not more than D to obtain a re-estimated value to be substituted for diagonal matrix calculation.
And finally, substituting the re-estimated value into a formula for calculation to obtain a re-estimated value of the lambda gamma, repeating the re-estimation again, iteratively calculating and updating the lambda gamma until the model is converged, completing the construction of the Gaussian mixture models of all the training directions to form a classifier, and storing the classifier in an internal memory in an off-line manner.
The same microphone array is adopted, for convenience of explaining the technical scheme, a sound source with an unknown position is set to be in a far field, the sound source and the array are in the same horizontal plane, the sound source position is equivalent to a sound source azimuth at the moment, an azimuth angle is equivalent to a horizontal angle, a sound source signal with the unknown azimuth is collected, the sound source signal is used as a test signal, and the test signal is converted into a single-frame signal by adopting the same framing and windowing method as the method.
According to the same method as the method, a single-frame test signal is converted from a time domain signal to a frequency domain signal, the pilot time delay of the candidate position is calculated, the controllable response output of the test signal is calculated according to the frequency domain signal and the pilot time delay, and then the spatial power spectrum feature vector ytest (l) of the unknown sound source is obtained to serve as a test sample.
Inputting ytest (l) into Gaussian mixture model classifiers corresponding to different training orientations, substituting ytest (l) into a formula for calculation to obtain Gaussian probability sub-distribution, and substituting into the formula for calculation to obtain the likelihood p (ytest (l) | lambda gamma) of the test sample in each direction vector gamma.
substituting p (ytest (l) | λ γ) into a formula for calculation, because the likelihood of the test sample in the training azimuth is maximum, the prior probability of each training azimuth is equal, and the posterior probability of the test sample in the training azimuth is also maximum by a Bayesian formula, so that the direction vector with the maximum likelihood p (ytest (l) | λ γ) is used as the direction vector estimation value of the l-th frame signal of the unknown azimuth to realize sound source positioning.
when the reverberation time T60 is 0.3s and 0.6s respectively, the signal-to-noise ratio dB is taken as an abscissa, the positioning success rate% is taken as an ordinate, the sound source is positioned by adopting the method and the traditional SRP-PHAT method, and the results are compared, so that the method is superior to the traditional SRP-PHAT method.
The above-described embodiments are not intended to limit the present invention, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present invention are included in the scope of the present invention.