Disclosure of Invention
The invention aims to provide a voice DOA estimation method based on ResNet, which can effectively solve the problem of inaccurate voice DOA estimation under the condition of strong noise reverberation and is a DOA estimation method suitable for any array structure.
In order to achieve the above purpose, the solution of the invention is:
a speech DOA estimation method based on ResNet comprises the following steps:
step 1, an MATLAB simulation training data set is utilized, the data set uses a plurality of voice signals to traverse a measurement range, and corresponding angles and voice signals are stored;
step 2, after framing processing of each simulation signal, calculating GCC and performing phase transformation; cutting the array model according to the array model parameters, and weighting and summing each voice frame; storing the weighted features and the corresponding incident angles as a data set;
step 3, initializing ResNet by using MATConvNet and training by using a data set;
and 4, performing coarse positioning on the signal to be measured by using the broadband MUSIC to obtain a coarse positioning result, and selecting a group ResNet with the central point closest to the broadband MUSIC result to perform subsequent accurate positioning according to the coarse positioning result to obtain a DOA estimation result.
In the step 1, the microphone array is provided with M array elements, the distance between the array elements is d, each array element is the same omnidirectional microphone, and far-field signals are incident in a theta direction; the noise is assumed to be white gaussian noise independent of the incident signal, with a mean of 0 and a variance of σ 2 Then the output of the array at time k is:
x(k)=a(θ)s(k)+n(k)
in the formula, s (k) represents a target source complex amplitude vector at the time k; n (k) represents an M-dimensional additive noise complex vector at time k; a (theta) represents an M-dimensional array flow pattern matrix with an incident angle of theta.
The expression of the M-dimensional array flow pattern matrix a (theta) is as follows:
in the formula (I), the compound is shown in the specification,
wherein lambda is the wavelength of the speech signal, d is the array element spacing, and theta is the speech signal angle of incidence.
In step 1, the covariance matrix of the array signal is R = E [ x (k) x [ ] H (k)]Performing characteristic decomposition on the covariance matrix of the array signals, wherein the characteristic vectors corresponding to the K larger characteristic values form a signal subspace U s The remaining feature vectors constitute the noise subspace U N (ii) a From the orthogonal relationship between the noise vector and the signal vector, the spatial spectrum function is obtained as:
wherein a (theta) is an array flow pattern at an incident angle of theta, and U
N Is a noise subspace of the signal, a
H (theta) is the conjugate transpose of a (theta),
is U
N The conjugate transpose of (1); p
MUSIC (θ) the maximum point corresponds to the MISIC localization result;
performing frequency division processing on the x (k) to obtain L self-frequency band signals x (k, f) l ) L =1, \ 8230;, L, the spatial spectral function of the wideband MUSIC localization is:
wherein, a (theta, f)
l ) Is a frequency of f
l From the array flow pattern, U, of the band signal with the incident angle theta
N Is a noise subspace of the signal, a
H (θ,f
l ) Is a (theta, f)
l ) The conjugate transpose of (a) is performed,
is U
N The conjugate transpose of (1); p
WMUSIC And theta corresponding to the maximum point is the result of positioning the broadband MUSIC.
In the step 2, the m-th and n-th microphones in the array model are setThe actual signals received by the array elements are x respectively m (k) And x n (k) Then:
x m (k)=a m s(k-τ m )+n em (k)+n rm (k) (1)
x n (k)=a n s(k-τ n )+n en (k)+n rn (k) (2)
in the formula, n em (k)、n en (k) Respectively represents additive noise in m and n microphone receiving environments at the k time, n rm (k)、n rn (k) Multipath reflection noise received by m and n microphones respectively at time k m 、a n Amplitude attenuation factor, tau, for the m, n microphone received signal m 、τ n S (k) is the time taken for the sound source signal to propagate to the m, n microphones.
Neglecting the effects of reverberation and noise, x m (k) And x n (k) The correlation function of (a) is:
R xmxn (τ)=E[x m (t)x n (t-τ)]
substituting the formula (1) and the formula (2) into the formula to obtain:
wherein R is
ss [τ-(τ
m -τ
n )]Is s (k-tau)
m ) And s (k-tau)
n ) The correlation function of (a) is calculated,
is s (k) and n
en (k) The correlation function of (a) is determined,
is s (k) and n
em (k) The correlation function of (a) is determined,
is n
em (k) And n
en (k) The correlation function of (a);
let s (k), n em (k) And n en (k) Are not related to each other, equation (3) is written as:
in the formula, τ mn =τ m -τ n ,R ss (τ) is the autocorrelation function of the acoustic source s (t);
when tau-tau
mn When the value is not less than 0, the reaction time is not less than 0,
take a maximum value, thus by
Maximum estimation of the time delay tau of signals received by two microphone elements
mn ;
From the relationship of the cross-correlation function and the cross-power spectrum, we obtain:
the generalized cross-correlation is obtained by adding a weighting function to equation (4):
in the formula (I), the compound is shown in the specification,
in order to be a function of the weighting,
for cross-power spectra of two signals, f is the signal frequency division, τ is x
m (k)、x
n (k) The time delay therebetween. PHAT weighting function
In step 3, if the desired mapping of ResNet is H (x) and the network input is x, the desired mapping of the residual structure is F (x) = H (x) -x, and the final output result is F (x) + x, and F (x) + x is realized by the feedforward neural network with the addition of the summation unit.
In the step 3, the data set is divided into a plurality of groups according to the incident angle, the advancing step length between two adjacent groups with the width of 10 degrees in each group is 7.5 degrees, and one ResNet is trained by using the data set of each group.
After the scheme is adopted, the method is based on a residual convolutional network (ResNet), the residual convolutional network is applied to sound source DOA estimation, MATLAB is used for simulating a large number of microphone array receiving signals containing reverberation and noise to form a data set, and ResNet learns the characteristics of the array receiving signals under the noise and reverberation conditions from the data set. In order to reduce the network complexity and accelerate the training process, the whole positioning range is divided into a plurality of groups, the traditional broadband MUSIC is adopted to roughly estimate the sound source, and then the neural network trained by the data set corresponding to the grouping range is selected according to the rough estimation result to be accurately positioned. The method of the invention has wide application in the fields of remote voice automatic identification, automatic camera steering, hearing aids, vehicle-mounted hands-free voice communication, remote video conference systems, robot hearing and the like.
Compared with the existing DOA estimation method in the prior art, the method provided by the invention has the following advantages:
(1) Without unrealistic assumption on the received signal, resNet can learn the characteristics of the environmental noise and reverberation, and the positioning is more accurate in the strong noise reverberation environment;
(2) The deep learning has a mature GPU acceleration technology, is short in calculation time consumption during actual use, and has good real-time performance;
(3) There is no limitation on the array structure, and the ResNet speech DOA estimation systems of different array structures can be obtained by using data sets of different array structures.
Detailed Description
As shown in FIG. 1, the invention provides a speech DOA estimation method based on ResNet, which extracts features from generalized cross-correlation (GCC), learns the nonlinear mapping relation between the features and DOA from a large number of simulated microphone array signals by using ResNet, and uses a plurality of ResNet to perform accurate robust DOA estimation on the basis of the rough estimation of the traditional broadband MUSIC method.
The technical solution of the present invention is described in detail below with reference to the accompanying drawings and specific embodiments.
Wideband MUSIC positioning
The microphone array has M array elements, the array element interval is d, each array element is the same omnidirectional microphone, and far field signals are incident at an angle theta. Suppose the noise is white Gaussian noise independent of the incident signal, with a mean of 0 and a variance of σ 2 Then the output of the array at time k is:
x(k)=a(θ)s(k)+n(k)
in the formula, s (k) represents a target source complex amplitude vector at the time k; n (k) represents an M-dimensional additive noise complex vector at time k; a (theta) represents an M-dimensional array flow pattern matrix with an incident angle of theta.
In the formula (I), the compound is shown in the specification,
wherein, λ is the wavelength of the voice signal, d is the array element distance, and θ is the voice signal incidence angle.
The covariance matrix of the array signal is R = E [ x (k) x H (k)]Performing characteristic decomposition on the covariance matrix of the array signals, wherein the characteristic vectors corresponding to the K larger characteristic values form a signal subspace U s The remaining feature vectors form the noise subspace U N . From the orthogonal relationship between the noise vector and the signal vector, the spatial spectrum function is obtained as:
wherein a (theta) is an array flow pattern at an incident angle of theta, and U
N Is a noise subspace of the signal, a
H (theta) is the conjugate transpose of a (theta),
is U
N The conjugate transpose of (1); p is
MUSIC (θ) the maximum point corresponds to the MISIC localization result;
because the voice signal is a broadband signal, the frequency division processing is carried out on the received signal x (k) to obtain L self-frequency-band signals x (k, f) l ) L =1, \ 8230;, L. The spatial spectrum function for wideband MUSIC positioning is:
wherein, a (theta, f)
l ) Is a frequency of f
l From the array flow pattern, U, of the band signal with the incident angle theta
N Is a noise subspace of the signal, a
H (θ,f
l ) Is a (theta, f)
l ) The conjugate transpose of (a) is performed,
is U
N The conjugate transpose of (1); p is
WMUSIC And theta corresponding to the maximum point is the result of positioning the broadband MUSIC.
Calculating GCC and making phase transformation (GCC-PHAT)
Let the actual signals received by the m-th and n-th microphone elements in the array model be x respectively m (k) And x n (k) Then:
x m (k)=a m s(k-τ m )+n em (k)+n rm (k) (1)
x n (k)=a n s(k-τ n )+n en (k)+n rn (k) (2)
in the formula, n em (k)、n en (k) Respectively represents additive noise in m and n microphone receiving environments at the k time, n rm (k)、n rn (k) Multipath reflection noise received by m and n microphones respectively at time k m 、a n Amplitude attenuation factor, tau, for the m, n microphone received signal m 、τ n S (k) is the time taken for the sound source signal to propagate to the m, n microphones.
Neglecting the effects of reverberation and noise, x m (k) And x n (k) The correlation function of (a) is:
substituting the formula (1) and the formula (2) into the formula to obtain:
wherein R is
ss [τ-(τ
m -τ
n )]Is s (k-tau)
m ) And s (k- τ)
n ) The correlation function of (a) is calculated,
is s (k) and n
en (k) The correlation function of (a) is calculated,
is s (k) and n
em (k) The correlation function of (a) is calculated,
is n
em (k) And n
en (k) The correlation function of (a);
let s (k), n em (k) And n en (k) Are not related to each other, then equation (3) can be written as:
in the formula, τ mn =τ m -τ n ,R ss (τ) is the autocorrelation function of the sound source s (t).
When tau-tau
mn When the value is not less than 0, the reaction time is not less than 0,
maximum values can be obtained and thus can be obtained from
Maximum value estimation of time delay tau of signals received by two microphone elements
mn 。
From the relationship between the cross-correlation function and the cross-power spectrum, we obtain:
the generalized cross-correlation (GCC) is obtained by adding a weighting function to equation (4):
in the formula (I), the compound is shown in the specification,
in order to be a function of the weighting,
for cross-power spectra of two signals, f is the signal frequency division, τ is x
m (k)、x
n (k) The time delay therebetween. PHAT weighting function
Feature extraction of data sets
And selecting GCC-PHAT as the basis of input characteristics. The cutting process after GCC-PHAT is explained by taking an 8-element uniform linear array as an example. The array pitch is set to be 0.1m, and the array microphones are combined in pairs
For the microphone pair, for each 0.1s signal, calculating the GCC and performing phase transformation (GCC-PHAT), wherein the maximum interval between two microphone elements is 0.8m, the maximum time delay existing between the array elements is tau =0.8/340=2.353ms (the sound velocity is 340 m/s), the sampling rate of an incident signal is 44.1kHz, and the GCC peak representing the time delay is always in the middle n =44100 tau ≈ 104 points. Thus, the input matrix dimension is 28 × 104.
Framing weighting to reduce the effects of mute frames
A segment of speech signal is divided into 0.1s long speech frames. Due to the nature of the speech signal, there may be some non-speech frames, each weighted for greater robustness:
wherein o is m And the number is GCC-PHAT of the mth voice frame, D is the element number of the GCC matrix, | · | is an absolute value, alpha is a control parameter, and if the alpha =0, the average value of the GCC vector is obtained. The use of a large alpha can effectively reduce the impact of the silence frame on the GCC matrix.
ResNet structure
Fig. 2 is a schematic diagram of the residual structure in ResNet. Assuming that the desired mapping is H (x) and the network input is x, the desired mapping of the residual structure becomes F (x) = H (x) -x, and the final output result is F (x) + x. F (x) + x is realized by a feedforward neural network of an additional summation unit, the addition of the summation unit does not increase parameters, and the whole network can be trained according to the original back propagation.
Fig. 3 is a schematic diagram of the ResNet structure used in the present invention. The ResNet used by the invention is improved on the basis of the original method, and the positions of the batch normalization layer (BN) and the activation layer (Relu) are shifted from the back of the nonlinear layer to the front. All the volume blocks except the first volume block in the figure are obtained by connecting a batch normalization layer (BN), an activation layer (Relu) and a volume layer (Conv) in sequence. Each residual Sum layer (Sum) is preceded by 2 3 × 3 convolution blocks, each convolution being kept unchanged in size plus a Pad of size 1. And selecting a convolution kernel with the size of 3 multiplied by 3 for each convolution block after 4 residual errors, adding Pad with the size of 1, convolving the input by taking 2 as a step length, reducing the size by 1/2, and doubling the number of channels. After shrinking three times, the dimensions were reduced to 1 × 1 in one volume block, passed through the full link layer, and fed into the Loss layer.
The effects of the present invention can be illustrated by the following examples:
fig. 4 shows simulation experimental conditions. The room size is 8m multiplied by 6m multiplied by 2.5m, the uniform array is placed at the position with the height of 1m, the array element interval of the array is 0.1m, and the number of the array elements is 8. The four walls and ceiling of the room have a reverberation reflection coefficient of 0.95 (ordinary lime walls) and the floor has a reflection coefficient of 0.90.
The data set sound source signal selects a pure voice signal, the incidence angle traverses the positioning range of each group at 0.05 degrees, and a training set and a testing set are randomly extracted according to the proportion of 9. The data set size is about 4.5 ten thousand.
FIG. 5 is a 5.5 m.times.3.3 m.times.2.3 m anechoic chamber. The sound equipment is arranged at different angles of the reference array element to be used as sound sources, experimental data are collected by an 8-array element linear microphone array, and the array element interval is 0.1m.
FIG. 6 is a graph of the impact of the number of array elements and the size of the data set used to train and test ResNet on system performance. The abscissa in the graph represents the percentage of the amount of data used to the total amount of data, which is about 4.5 ten thousand. FIG. 6 shows that system performance is better when the more array elements and the larger the data set size; the performance is more significantly affected by the data size, and increasing the array elements greatly increases the amount of preprocessing calculation and the complexity of the network. Therefore, it is more inclined to train the ResNet with more data to improve system performance than to add array elements.
Fig. 7 is a convergence curve of the neural network. Wherein FIG. 7 (a) is a ResNet convergence map of the method of the present invention, and FIG. 7 (b) is a MLP convergence map based on multi-layer perceptron (MLP) DOA estimation, both using a uniform data set and learning rate. Where after 40 iterations, the learning rate is halved every 5 iterations until the network no longer converges. It can be seen that ResNet is more stable in convergence when the learning rate is relatively large, and overfitting occurring in MLP is effectively prevented.
Figure 8 is the effect of signal-to-noise ratio on system performance. The performance of the present invention, the MLP-based DOA estimation method, and the least squares-based delay estimation method (LS-TDOA) are compared at different signal-to-noise ratios, where the solid line is the test result at random angular positions using about 3000 simulated test signals and the dashed line is the test result at three fixed angles in the anechoic room using about 100 speech signals. Fig. 8 shows that at high signal-to-noise ratios the differences between the several methods are not very large, but below 10dB, the classical LS-TDOA method has almost failed, while both DOA estimation methods based on deep learning still work effectively and the present invention performs better. The RMSE of several methods has increased due to electrical noise of the experimental equipment and inevitable errors introduced during measurement, but the advantages of the present invention at low signal-to-noise ratios are still quite significant.
The above embodiments are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modifications made on the basis of the technical scheme according to the technical idea of the present invention fall within the protection scope of the present invention.