CN110544490A

CN110544490A - sound source positioning method based on Gaussian mixture model and spatial power spectrum characteristics

Info

Publication number: CN110544490A
Application number: CN201910694072.3A
Authority: CN
Inventors: 赵小燕; 陈书文; 刘鸿斌
Original assignee: Jiangsu Second Normal College (jiangsu Academy Of Educational Sciences); Nanjing Forestry University
Current assignee: JIANGSU SECOND NORMAL University (JIANGSU INSTITUTE OF EDUCATIONAL SCIENCE RESEARCH); Nanjing Institute of Technology
Priority date: 2019-07-30
Filing date: 2019-07-30
Publication date: 2019-12-06
Anticipated expiration: 2039-07-30
Also published as: CN110544490B

Abstract

the invention discloses a sound source positioning method based on a Gaussian mixture model and space power spectrum characteristics, which comprises two stages of training and testing, wherein a space power spectrum of each azimuth is extracted as a characteristic vector in the training stage, the Gaussian mixture model is established for each azimuth, the likelihood of a test signal relative to each azimuth is given by a Gaussian mixture model classifier in the testing stage, and an estimated value of a sound source azimuth is obtained based on the maximum likelihood. The invention utilizes information of sound source position, acoustic environment and the like, can effectively depict class characteristics through the Gaussian mixture model, can realize real-time sound source positioning only by one frame of signal, obviously improves the positioning performance and has stronger anti-noise capability.

Description

Sound source positioning method based on Gaussian mixture model and spatial power spectrum characteristics

Technical Field

The invention belongs to the technical field of signal processing, and particularly relates to a sound source positioning technology.

Background

The small microphone array is usually suitable for application places such as offices, conference rooms, intelligent robots and the like, and the traditional positioning algorithm, such as the SRP algorithm, has unsatisfactory positioning performance in complex acoustic environments of reverberation and noise, because the reverberation causes a spatial power spectrum to form a plurality of local peaks, and the peaks are caused by the reflection of the acoustic environment to a real sound source.

The reverberation information of the room is embodied on the room impulse response, and researches show that the phase weighted controllable response power spectrum SRP-PHAT of the signals received by the microphone is equal to the phase weighted controllable response power spectrum SRP-PHAT of the room impulse response from the sound source to each array element, the reverberation actually comprises the acoustic environment information of the sound source, including the room size, the position of the sound source relative to the room, the room reflection characteristics and the like, so that the spatial power spectrum function comprises spatial structure information which is closely related to the sound source position and the acoustic environment.

The invention discloses a Gaussian Mixture Model (GMM), which is a Model with only one state, wherein the state is provided with a plurality of Gaussian distribution functions, and the GMM can effectively depict the characteristics of the category.

Disclosure of Invention

in order to solve the problems in the prior art, the invention provides a sound source positioning method based on a Gaussian mixture model and spatial power spectrum characteristics.

The method comprises the steps of adopting microphones as array elements to form an array, wherein the number of the microphones is U, the serial number of each microphone is U, U is 1,2, … and U, collecting sound source signals, the directions of the sound source signals are known, the direction vector of each sound source is gamma, the azimuth angle is determined by the horizontal angle theta and the pitch angle of the sound source relative to the array, and then, a time domain signal of the direction vector gamma collected by the U-th array element is used as a training signal.

framing the training signal, the framing length is N, the sampling serial number is N, N is more than or equal to 0 and less than N, the number of single frames is L, the serial number of the single frame is L, L is 1,2, … and L, then the training signal framing is that the Hamming window function is adopted to carry out windowing, the formula is substituted for calculation, and the L-th frame single frame signal of the direction vector gamma acquired by the u array elements is obtained

And performing discrete Fourier transform on the single-frame signal by adopting a discrete Fourier transform function DFT, wherein the length of the DFT is K, the frequency point is K, and the DFT is substituted into a formula for calculation to obtain a frequency domain signal corresponding to the l-th frame time domain signal of the direction vector gamma acquired by the u array elements so as to convert the training signal from the time domain signal into the frequency domain signal.

Setting the sound source from far field, propagating from known candidate position to array, the speed of sound in air is c, the candidate position of far field is r, the central position of array is r0, the position of u array element is ru, the sound propagation delay from the candidate position of far field to the central position of array is tau 0(r), the sound propagation delay from the candidate position of far field to the position of u array element is tau u (r), the guiding delay from the candidate position of far field to u array element is tau u0(r), then

calculating the guiding time delay of a sound source in a far field, enabling the sound source position to be equivalent to a sound source azimuth, measuring a direction vector xi of a far field candidate position, wherein the candidate direction vector xi is determined by a horizontal angle alpha and a pitch angle phi of the candidate position relative to an array, and when xi is [ cos phi cos alpha, cos phi sin alpha, sin phi ] T, substituting the formula for calculation, and substituting the formula for obtaining the guiding time delay tau u0(r) of the far field candidate position.

Setting the signal sampling rate to fs, substituting the guide time delay tau u0(r) and the frequency domain signal into a formula for calculation to obtain PHAT weighted controllable response output of the single frame signal, substituting the PHAT weighted controllable response output into the formula for calculation to obtain a controllable response power value of the training signal, substituting the PHAT weighted controllable response output into the formula for calculation to perform normalization processing, setting the vector dimension to D, substituting the normalization power value into the formula for calculation to obtain a space power spectrum characteristic vector of the first frame signal of the direction vector gamma

A Gaussian mixture model lambda gamma is constructed according to the characteristic vector, the characteristic vector is used as a training sample and is substituted into a formula for calculation, L is the frame number of a training signal, a training set of a direction vector gamma is obtained, a Gaussian mixture model lambda gamma is constructed for each training direction, one category of characteristic distribution is decomposed into a plurality of Gaussian sub-distributions, the characteristic distribution is approximated by the mixture weighted average of Gaussian components, and the complete Gaussian mixture model is composed of a mixture weight, a mean vector and a covariance matrix.

Setting the parameter of the Gaussian mixture model with the d-th component of the characteristic vector as a direction vector gamma as lambda gamma, the order of the Gaussian mixture model as I, the serial number of the Gaussian component as I, and the d-th component with the mixed weight as a mean vector as a covariance matrix as

Estimating model parameters by adopting an EM algorithm, setting 1/I as an initial value of a mixed weight, randomly selecting a training sample from a training set as an initial value of a mean vector, adopting a unit matrix as an initial value of a covariance matrix, and constructing the initial value of lambda gamma by using the initial value.

Will be substituted into the formula

Calculating to obtain Gaussian probability sub-distribution, substituting the Gaussian probability sub-distribution into a formula, calculating to obtain the posterior probability of the training sample in the ith Gaussian component, substituting the posterior probability into the formula to obtain a re-estimated value, calculating to obtain a re-estimated value, substituting the formula with D being more than or equal to 1 and less than or equal to D, calculating to obtain a substitution diagonal matrix, obtaining a re-estimated value, constructing the re-estimated value to obtain a re-estimated value of lambda gamma, and repeating iterative operation for re-estimation until the model is converged according to the method.

collecting a sound source with an unknown azimuth as a test signal, converting the test signal into a single-frame signal, calculating a spatial power spectrum characteristic vector ytest (l) of the sound source with the unknown azimuth from the single-frame signal, and substituting the ytest (l) into a Gaussian mixture model

substituting ytest (l) into a formula for calculation to obtain Gaussian probability sub-distribution, and substituting the Gaussian probability sub-distribution into the formula for calculation to obtain the likelihood of the test sample in each direction vector gamma

p(y(l)|λγ)。

substituting p (ytest (l) | λ γ) into a formula for calculation to obtain the direction vector of the l frame signal corresponding to the maximum value of p (ytest (l) | λ γ) as the estimation value of the direction vector γ of the l frame signal of the sound source with unknown azimuth.

The method comprises two stages of training and testing, wherein a spatial power spectrum of each direction is extracted as a characteristic vector in the training stage, and a Gaussian mixture model is established for each direction; in the testing stage, the likelihood of the testing signal relative to each azimuth is given by a Gaussian mixture model classifier, and the estimated value of the sound source azimuth is obtained based on the maximum likelihood; the space power spectrum characteristic vector contains space structure information which is closely related to the sound source direction and the acoustic environment; the Gaussian mixture model can effectively depict the characteristics of the categories, and the mixed weighted average of a plurality of Gaussian probability density functions is used for approximating the characteristic distribution of the categories; the Gaussian mixture model of the training azimuth can be stored in a memory after off-line calculation, and real-time sound source positioning can be realized only by one frame of signal during testing; the SRP-PHAT space power spectrum of the microphone array is extracted to be used as a characteristic vector of the sound source positioning system, acoustic environment information is fully utilized, positioning performance is obviously improved, and the anti-noise capability is stronger.

Drawings

Fig. 1 is a flowchart of the method, fig. 2 is a diagram illustrating a comparison between the positioning success rates of the method and the conventional SRP-PHAT method when the reverberation time T60 is 0.3s, and fig. 3 is a diagram illustrating a comparison between the positioning success rates of the method and the conventional SRP-PHAT method when the reverberation time T60 is 0.6 s.

Detailed Description

The technical scheme of the invention is specifically explained in the following by combining the attached drawings.

As shown in fig. 1, a sound source signal at a known position is collected as a training signal, the training signal is converted into a single-frame signal through frame division and windowing preprocessing, the single-frame signal is transformed into a frequency domain signal through discrete fourier transform, the guiding time delay of a candidate position is calculated to obtain a controllable response output, a spatial power spectrum feature vector of a training orientation is extracted as a training sample, a gaussian mixture model is constructed, a sound source signal at an unknown position is collected as a test signal by the same method, a spatial power spectrum feature vector of a test orientation is extracted as a test sample, the test sample is input into a constructed gaussian mixture model classifier, the likelihood of the gaussian mixture model corresponding to each training orientation of the test sample is calculated, and the orientation corresponding to the maximum likelihood is selected as the sound source orientation.

Setting a sound source at a known position in the far field, where the known position can be equivalent to a known azimuth, the direction vector γ of the sound source with respect to the array is made up of a horizontal angle θ and a pitch angle, one for each known sound source.

The technical scheme is that 6 omnidirectional microphones are used as microphone array elements to form a uniform circular array, the array radius is 0.1m, the microphone serial numbers are u, u is 1,2, … and 6, sound source signals with known directions are set for convenience of explanation, and the sound source signals are located on the same horizontal plane with the microphone array, so that a pitch angle sound source direction vector gamma of a sound source is degraded from three dimensions to two dimensions, and gamma is [ cos theta, sin theta ] T.

defining the right front of a horizontal plane as 90 degrees, setting the range of a horizontal angle theta as [0 degrees and 360 degrees ], setting the interval of theta as 10 degrees, then theta as 0 degrees, 10 degrees, … degrees and 350 degrees, then the number of the directions of training signals is 36, each array element collects sound source signals as training signals, and then the training signals of the direction vector gamma collected by u array elements are

The training signal is set to be from a candidate position r, the position of the array center is r0, the position of the u-th array element is ru, the sound velocity in air is 342m/s, the sound propagation delay from the candidate position to the array center is tau 0(r), the sound propagation delay from the candidate position to the u-th array element is tau u (r), and the guide delay from the candidate position to the u-th array element is tau u0 (r).

Framing the training signal, the framing length is 512(32ms), the frame is shifted to 0, the sampling serial number is N, N is more than or equal to 0 and is less than N, the number of single frames is 300, the serial number of the single frame is L, L is 1,2, …, L, then the training signal is framed to be

Adopting Hamming window function to carry out windowing, substituting the Hamming window function into a formula to calculate to obtain l frame time domain signals of the direction vector gamma acquired by u array elements

And performing discrete Fourier transform on the single-frame signal by adopting a discrete Fourier transform function DFT, wherein the length K of the DFT is 1024, the frequency point is K, and the DFT is substituted into a formula for calculation to obtain a frequency domain signal corresponding to the l-frame time domain signal of the direction vector gamma acquired by the u array elements so as to convert the training signal from the time domain signal into the frequency domain signal.

And calculating the guiding time delay of the sound source in the far field, wherein the position of the sound source can be equivalent to the azimuth of the sound source, the direction vector of the candidate position is set to xi, the vector is composed of a horizontal angle alpha and a pitch angle phi of the candidate position relative to the array, and xi is [ cos phi cos alpha, cos phi sin alpha, sin phi ] T, and each candidate position corresponds to xi.

Since the sound source and the microphone array are set to be in the same horizontal plane, the azimuth angle can be equivalent to xi ═ cos α, sin α ] T, the azimuth angle is substituted into the formula for calculation, the substitution formula obtains the guiding time delay tau u0(r) of the far-field candidate position, and xi ═ cos α, sin α ] T and tau u0(r) are irrelevant to the received signal and can be stored in a memory after being calculated off line.

Setting the signal sampling rate fs to be 16kHz, substituting the guide time delay tau u0(r) and the frequency domain signal into a formula for calculation to obtain the PHAT weighted controllable response output of the single-frame signal, substituting the PHAT weighted controllable response output into the formula for calculation to obtain the controllable response power value of the training signal, substituting the PHAT weighted controllable response power value into the formula for calculation, and performing normalization processing.

Setting the vector dimension as D, and since the number of orientations of the training signal is set as 36 in the foregoing, D is 36, and the normalized power value is substituted into the formula to calculate, so as to obtain the space power spectrum feature vector of the l frame signal of the direction vector γ

Constructing a Gaussian mixture model lambda gamma according to the characteristic vector, corresponding to each training orientation, taking the space power spectrum characteristic vector of all frames of the orientation as a training sample, taking L as the frame number of a training signal, wherein the number of the training signal is consistent with that of single frames set in the previous paragraph, and the L is 300, substituting the formula for calculation to obtain a training sample set of the direction vector gamma, constructing a Gaussian mixture model for each training orientation, decomposing the characteristic distribution of one category into a plurality of Gaussian sub-distributions, approximating the characteristic distribution by using the mixed weighted average of Gaussian components, and forming the complete Gaussian mixture model by using mixed weight, mean vector and covariance matrix.

Setting the d-th component of the feature vector as the order I of the Gaussian mixture model to be 8, the serial number of the Gaussian component to be I, the mixing weight as the mean vector as the d-th component of the mean vector as the variance as the covariance matrix to be

Iterative estimation of model parameters is carried out by adopting an EM algorithm, 1/I is set as an initial value of a mixing weight, training samples are randomly selected from a training set as an initial value of a mean vector, a unit matrix is used as an initial value of a covariance matrix, a Gaussian mixing model is constructed, and an initial value of lambda gamma is constructed according to the initial value.

Firstly, calculating a substitution formula to obtain Gaussian probability sub-distribution, calculating the substitution formula to obtain the posterior probability of the training sample at the ith Gaussian component

secondly, substituting the formula for calculation to obtain a re-estimated value, and substituting the formula for D to be not less than 1 and not more than D to obtain a re-estimated value to be substituted for diagonal matrix calculation.

And finally, substituting the re-estimated value into a formula for calculation to obtain a re-estimated value of the lambda gamma, repeating the re-estimation again, iteratively calculating and updating the lambda gamma until the model is converged, completing the construction of the Gaussian mixture models of all the training directions to form a classifier, and storing the classifier in an internal memory in an off-line manner.

The same microphone array is adopted, for convenience of explaining the technical scheme, a sound source with an unknown position is set to be in a far field, the sound source and the array are in the same horizontal plane, the sound source position is equivalent to a sound source azimuth at the moment, an azimuth angle is equivalent to a horizontal angle, a sound source signal with the unknown azimuth is collected, the sound source signal is used as a test signal, and the test signal is converted into a single-frame signal by adopting the same framing and windowing method as the method.

According to the same method as the method, a single-frame test signal is converted from a time domain signal to a frequency domain signal, the pilot time delay of the candidate position is calculated, the controllable response output of the test signal is calculated according to the frequency domain signal and the pilot time delay, and then the spatial power spectrum feature vector ytest (l) of the unknown sound source is obtained to serve as a test sample.

Inputting ytest (l) into Gaussian mixture model classifiers corresponding to different training orientations, substituting ytest (l) into a formula for calculation to obtain Gaussian probability sub-distribution, and substituting into the formula for calculation to obtain the likelihood p (ytest (l) | lambda gamma) of the test sample in each direction vector gamma.

substituting p (ytest (l) | λ γ) into a formula for calculation, because the likelihood of the test sample in the training azimuth is maximum, the prior probability of each training azimuth is equal, and the posterior probability of the test sample in the training azimuth is also maximum by a Bayesian formula, so that the direction vector with the maximum likelihood p (ytest (l) | λ γ) is used as the direction vector estimation value of the l-th frame signal of the unknown azimuth to realize sound source positioning.

when the reverberation time T60 is 0.3s and 0.6s respectively, the signal-to-noise ratio dB is taken as an abscissa, the positioning success rate% is taken as an ordinate, the sound source is positioned by adopting the method and the traditional SRP-PHAT method, and the results are compared, so that the method is superior to the traditional SRP-PHAT method.

The above-described embodiments are not intended to limit the present invention, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present invention are included in the scope of the present invention.

Claims

1. A sound source positioning method based on a Gaussian mixture model and spatial power spectrum features is characterized by comprising the following steps:

collecting sound source signals of known azimuth as training signals, converting the training signals into L single-frame signals, calculating space power spectrum characteristic vectors of the sound source signals of the known azimuth through the single-frame signals, constructing a Gaussian mixture model lambda gamma according to the characteristic vectors, and positioning the sound source of the unknown azimuth through the Gaussian mixture model.

2. The sound source localization method based on Gaussian mixture model and spatial power spectrum features according to claim 1, wherein the acquiring the sound source signal with known orientation as the training signal comprises:

Using a microphone as an array element to form an array, collecting sound source signals in a known direction, setting the number of the array elements as U, the serial number as U, the number as U being 1,2, …, and the direction vector of the known direction as gamma, and obtaining training signals of the direction vector gamma collected by the U array elements

3. The sound source localization method based on Gaussian mixture model and spatial power spectrum features according to claim 1, wherein the converting the training signal into L single frame signals comprises:

Setting the frame length as N, the sampling sequence number as N, the number of single frames as L, the sequence number of single frames as L, N is more than or equal to 0 and less than N, L as 1,2, …, L, dividing the frame into frames

Adopting Hamming window function to carry out windowing, substituting into formula to calculate to obtain the l frame time domain signal of the direction vector gamma collected by u array elements

4. the sound source localization method according to claim 1, wherein the calculating the spatial power spectral feature vector of the sound source signal of the known azimuth from the single frame signal comprises:

Calculating the pilot time delay tau u0(r) of the far-field candidate position, converting the single frame signal into the frequency domain signal, calculating the space power spectrum characteristic vector of the sound source with known azimuth from the pilot time delay and the frequency domain signal

5. The sound source localization method based on Gaussian mixture model and spatial power spectrum features according to claim 4, wherein the calculating the steering time delay tau u0(r) of the far-field candidate position comprises:

Setting the sound velocity in air as c, the far-field candidate position as r, the array center position as r0, the u array element position as ru, the sound propagation delay from the far-field candidate position to the array center position as τ 0(r), the sound propagation delay from the far-field candidate position to the u array element position as τ u (r), and the guide delay from the far-field candidate position to the u array element as τ u0(r), the acoustic velocity in air is set as c, the far-field candidate position as r, the array center position as r0, the u array element position as ru

And measuring the direction vector xi of the far-field candidate position, substituting the vector xi into a formula for calculation, and replacing the formula to obtain the guiding time delay tau u0 (r).

6. The sound source localization method based on Gaussian mixture model and spatial power spectrum features according to claim 4, wherein the converting the single frame signal into a frequency domain signal comprises:

Adopting a discrete Fourier transform function DFT to carry out discrete Fourier transform, setting the length of the DFT as K and the frequency point as K, substituting the length of the DFT into a formula to calculate, and obtaining the l frame frequency domain signal of the gamma direction collected by the u array elements

7. the sound source localization method according to claim 6, wherein the calculating the spatial power spectrum feature vector of the known sound source from the pilot time delay and the frequency domain signal comprises:

Setting the signal sampling rate to fs, substituting the guide time delay tau u0(r) and the frequency domain signal into a formula for calculation to obtain the PHAT weighted controllable response output of the single-frame signal

Calculating the substituted formula to obtain the controllable response power value of the training signal, calculating the substituted formula to perform normalization processing;

Setting the vector dimension as D, substituting the normalized power value into a formula for calculation to obtain the space power spectrum characteristic vector of the l frame signal of the direction vector gamma

8. The method for sound source localization based on Gaussian mixture model according to claim 1, wherein the constructing the Gaussian mixture model λ γ according to the feature vector comprises:

Substituting the characteristic vector as a training sample into a formula for calculation to obtain a training set of a direction vector gamma

Iterative estimation parameters are adopted, 1/I is set as an initial value of the mixing weight, training samples are randomly selected from a training set as an initial value of a mean vector, and a unit matrix is adopted as an initial value of a covariance matrix;

Constructing a Gaussian mixture model lambda gamma, and constructing an initial value of the lambda gamma according to the initial value;

Will be substituted into the formula

Calculating to obtain Gaussian probability sub-distribution, substituting the obtained Gaussian probability sub-distribution into a formula to calculate to obtain the posterior probability of the training sample in the ith Gaussian component

Calculating by substituting into a formula to obtain a reestimated value;

Calculating a substitution formula to obtain a re-estimated value which is calculated by substituting a diagonal matrix;

Substituting the reestimated value into a formula for calculation to obtain a reestimated value of lambda gamma, and reestimating iterative operation until the model converges.

9. The Gaussian mixture model-based sound source localization method according to claim 8, wherein the localization of the unknown sound source through the Gaussian mixture model comprises:

substituting ytest (l) into a formula for calculation to obtain Gaussian probability sub-distribution, and substituting the Gaussian probability sub-distribution into the formula for calculation to obtain the likelihood p (ytest (l) lambda gamma) of the test sample in the direction vector gamma;

Substituting p (ytest (l) | λ γ) into a formula for calculation to obtain an estimated value of the direction vector γ of the l-th frame signal of the sound source with unknown azimuth, wherein the direction vector of the l-th frame signal corresponding to the maximum value of p (ytest (l) | λ γ) is used as the estimated value of the direction vector γ of the l-th frame signal of the sound source with unknown azimuth.