CN109490822B - Voice DOA estimation method based on ResNet - Google Patents

Voice DOA estimation method based on ResNet Download PDF

Info

Publication number
CN109490822B
CN109490822B CN201811201570.1A CN201811201570A CN109490822B CN 109490822 B CN109490822 B CN 109490822B CN 201811201570 A CN201811201570 A CN 201811201570A CN 109490822 B CN109490822 B CN 109490822B
Authority
CN
China
Prior art keywords
signal
array
theta
resnet
noise
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811201570.1A
Other languages
Chinese (zh)
Other versions
CN109490822A (en
Inventor
郭业才
张浩然
顾弘毅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Information Science and Technology
Original Assignee
Nanjing University of Information Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Information Science and Technology filed Critical Nanjing University of Information Science and Technology
Priority to CN201811201570.1A priority Critical patent/CN109490822B/en
Publication of CN109490822A publication Critical patent/CN109490822A/en
Application granted granted Critical
Publication of CN109490822B publication Critical patent/CN109490822B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S3/00Direction-finders for determining the direction from which infrasonic, sonic, ultrasonic, or electromagnetic waves, or particle emission, not having a directional significance, are being received
    • G01S3/80Direction-finders for determining the direction from which infrasonic, sonic, ultrasonic, or electromagnetic waves, or particle emission, not having a directional significance, are being received using ultrasonic, sonic or infrasonic waves
    • G01S3/802Systems for determining direction or deviation from predetermined direction

Abstract

The invention discloses a voice DOA estimation method based on ResNet, which comprises the following steps: step 1, an MATLAB simulation training data set is utilized, the data set uses a plurality of voice signals to traverse a measurement range, and corresponding angles and voice signals are stored; step 2, after framing processing of each simulation signal, calculating GCC and performing phase transformation; cutting the array model according to the array model parameters, and then weighting and summing each voice frame; storing the weighted features and the corresponding incident angles as a data set; step 3, initializing ResNet by using MATConvNet and training by using a data set; and 4, performing coarse positioning on the signal to be measured by using the broadband MUSIC to obtain a coarse positioning result, and selecting a grouped ResNet with the central point closest to the broadband MUSIC result according to the coarse positioning result to perform subsequent accurate positioning to obtain a DOA estimation result. The method can effectively solve the problem of inaccurate voice DOA estimation under the condition of strong noise reverberation, and is a DOA estimation method suitable for any array structure.

Description

Voice DOA estimation method based on ResNet
Technical Field
The invention belongs to the technical field of microphone array DOA estimation, and particularly relates to a voice DOA estimation method based on ResNet, which can realize accurate positioning of voice under the condition of strong noise reverberation.
Background
Direction of Arrival (Direction of Arrival) estimation is one of the important directions for array signal processing, and is widely applied in remote automatic speech recognition, teleconferencing, automatic camera steering, and the like. However, it is difficult to derive an accurate DOA estimate when the signal is under strong noise and room reverberation distortion conditions. Therefore, robust DOA estimation under indoor conditions is required. The conventional DOA estimation method in noise and reverberation environment can be mainly divided into: (1) Subspace methods, such as multiple signal classification (MUSIC) and estimation of signal parameters by means of rotation invariant techniques (Esprit); (2) using generalized cross-correlation and Least Squares (LS); (3) Signal synchronization methods, e.g., based on the joint controllable response power and phase transform (SRP-PHAT) method and the multi-channel cross-correlation (MCCC) method; (4) Blind identification methods of impulse responses, such as Adaptive Eigenvalue Decomposition (AED) methods and independent component analysis methods; (5) Based on l 1 A norm-punished sparse signal representation method; (6) model-based methods, such as maximum likelihood methods.
The above methods have problems in practical use, such as high calculation cost, unrealistic assumption on signal and noise models, and the like.
Disclosure of Invention
The invention aims to provide a voice DOA estimation method based on ResNet, which can effectively solve the problem of inaccurate voice DOA estimation under the condition of strong noise reverberation and is a DOA estimation method suitable for any array structure.
In order to achieve the above purpose, the solution of the invention is:
a speech DOA estimation method based on ResNet comprises the following steps:
step 1, an MATLAB simulation training data set is utilized, the data set uses a plurality of voice signals to traverse a measurement range, and corresponding angles and voice signals are stored;
step 2, after framing processing of each simulation signal, calculating GCC and performing phase transformation; cutting the array model according to the array model parameters, and weighting and summing each voice frame; storing the weighted features and the corresponding incident angles as a data set;
step 3, initializing ResNet by using MATConvNet and training by using a data set;
and 4, performing coarse positioning on the signal to be measured by using the broadband MUSIC to obtain a coarse positioning result, and selecting a group ResNet with the central point closest to the broadband MUSIC result to perform subsequent accurate positioning according to the coarse positioning result to obtain a DOA estimation result.
In the step 1, the microphone array is provided with M array elements, the distance between the array elements is d, each array element is the same omnidirectional microphone, and far-field signals are incident in a theta direction; the noise is assumed to be white gaussian noise independent of the incident signal, with a mean of 0 and a variance of σ 2 Then the output of the array at time k is:
x(k)=a(θ)s(k)+n(k)
in the formula, s (k) represents a target source complex amplitude vector at the time k; n (k) represents an M-dimensional additive noise complex vector at time k; a (theta) represents an M-dimensional array flow pattern matrix with an incident angle of theta.
The expression of the M-dimensional array flow pattern matrix a (theta) is as follows:
Figure BDA0001830131290000021
in the formula (I), the compound is shown in the specification,
Figure BDA0001830131290000022
wherein lambda is the wavelength of the speech signal, d is the array element spacing, and theta is the speech signal angle of incidence.
In step 1, the covariance matrix of the array signal is R = E [ x (k) x [ ] H (k)]Performing characteristic decomposition on the covariance matrix of the array signals, wherein the characteristic vectors corresponding to the K larger characteristic values form a signal subspace U s The remaining feature vectors constitute the noise subspace U N (ii) a From the orthogonal relationship between the noise vector and the signal vector, the spatial spectrum function is obtained as:
Figure BDA0001830131290000023
wherein a (theta) is an array flow pattern at an incident angle of theta, and U N Is a noise subspace of the signal, a H (theta) is the conjugate transpose of a (theta),
Figure BDA0001830131290000024
is U N The conjugate transpose of (1); p MUSIC (θ) the maximum point corresponds to the MISIC localization result;
performing frequency division processing on the x (k) to obtain L self-frequency band signals x (k, f) l ) L =1, \ 8230;, L, the spatial spectral function of the wideband MUSIC localization is:
Figure BDA0001830131290000031
wherein, a (theta, f) l ) Is a frequency of f l From the array flow pattern, U, of the band signal with the incident angle theta N Is a noise subspace of the signal, a H (θ,f l ) Is a (theta, f) l ) The conjugate transpose of (a) is performed,
Figure BDA0001830131290000032
is U N The conjugate transpose of (1); p WMUSIC And theta corresponding to the maximum point is the result of positioning the broadband MUSIC.
In the step 2, the m-th and n-th microphones in the array model are setThe actual signals received by the array elements are x respectively m (k) And x n (k) Then:
x m (k)=a m s(k-τ m )+n em (k)+n rm (k) (1)
x n (k)=a n s(k-τ n )+n en (k)+n rn (k) (2)
in the formula, n em (k)、n en (k) Respectively represents additive noise in m and n microphone receiving environments at the k time, n rm (k)、n rn (k) Multipath reflection noise received by m and n microphones respectively at time k m 、a n Amplitude attenuation factor, tau, for the m, n microphone received signal m 、τ n S (k) is the time taken for the sound source signal to propagate to the m, n microphones.
Neglecting the effects of reverberation and noise, x m (k) And x n (k) The correlation function of (a) is:
R xmxn (τ)=E[x m (t)x n (t-τ)]
substituting the formula (1) and the formula (2) into the formula to obtain:
Figure BDA0001830131290000033
wherein R is ss [τ-(τ mn )]Is s (k-tau) m ) And s (k-tau) n ) The correlation function of (a) is calculated,
Figure BDA0001830131290000034
is s (k) and n en (k) The correlation function of (a) is determined,
Figure BDA0001830131290000035
is s (k) and n em (k) The correlation function of (a) is determined,
Figure BDA0001830131290000036
is n em (k) And n en (k) The correlation function of (a);
let s (k), n em (k) And n en (k) Are not related to each other, equation (3) is written as:
Figure BDA0001830131290000037
in the formula, τ mn =τ mn ,R ss (τ) is the autocorrelation function of the acoustic source s (t);
when tau-tau mn When the value is not less than 0, the reaction time is not less than 0,
Figure BDA0001830131290000038
take a maximum value, thus by
Figure BDA0001830131290000039
Maximum estimation of the time delay tau of signals received by two microphone elements mn
From the relationship of the cross-correlation function and the cross-power spectrum, we obtain:
Figure BDA0001830131290000041
the generalized cross-correlation is obtained by adding a weighting function to equation (4):
Figure BDA0001830131290000042
in the formula (I), the compound is shown in the specification,
Figure BDA0001830131290000043
in order to be a function of the weighting,
Figure BDA0001830131290000044
for cross-power spectra of two signals, f is the signal frequency division, τ is x m (k)、x n (k) The time delay therebetween. PHAT weighting function
Figure BDA0001830131290000045
In step 3, if the desired mapping of ResNet is H (x) and the network input is x, the desired mapping of the residual structure is F (x) = H (x) -x, and the final output result is F (x) + x, and F (x) + x is realized by the feedforward neural network with the addition of the summation unit.
In the step 3, the data set is divided into a plurality of groups according to the incident angle, the advancing step length between two adjacent groups with the width of 10 degrees in each group is 7.5 degrees, and one ResNet is trained by using the data set of each group.
After the scheme is adopted, the method is based on a residual convolutional network (ResNet), the residual convolutional network is applied to sound source DOA estimation, MATLAB is used for simulating a large number of microphone array receiving signals containing reverberation and noise to form a data set, and ResNet learns the characteristics of the array receiving signals under the noise and reverberation conditions from the data set. In order to reduce the network complexity and accelerate the training process, the whole positioning range is divided into a plurality of groups, the traditional broadband MUSIC is adopted to roughly estimate the sound source, and then the neural network trained by the data set corresponding to the grouping range is selected according to the rough estimation result to be accurately positioned. The method of the invention has wide application in the fields of remote voice automatic identification, automatic camera steering, hearing aids, vehicle-mounted hands-free voice communication, remote video conference systems, robot hearing and the like.
Compared with the existing DOA estimation method in the prior art, the method provided by the invention has the following advantages:
(1) Without unrealistic assumption on the received signal, resNet can learn the characteristics of the environmental noise and reverberation, and the positioning is more accurate in the strong noise reverberation environment;
(2) The deep learning has a mature GPU acceleration technology, is short in calculation time consumption during actual use, and has good real-time performance;
(3) There is no limitation on the array structure, and the ResNet speech DOA estimation systems of different array structures can be obtained by using data sets of different array structures.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a schematic diagram of the residual structure in ResNet;
FIG. 3 is a schematic diagram of a ResNet structure for use with the present invention;
FIG. 4 is a schematic diagram of simulation experiment conditions;
FIG. 5 is an anechoic chamber of 5.5m 3.3m 2.3 m;
FIG. 6 is a graph of the impact of array element number and dataset size on system performance;
FIG. 7 is a convergence curve of a neural network;
fig. 8 is a graph of the effect of signal-to-noise ratio on system performance.
Detailed Description
As shown in FIG. 1, the invention provides a speech DOA estimation method based on ResNet, which extracts features from generalized cross-correlation (GCC), learns the nonlinear mapping relation between the features and DOA from a large number of simulated microphone array signals by using ResNet, and uses a plurality of ResNet to perform accurate robust DOA estimation on the basis of the rough estimation of the traditional broadband MUSIC method.
The technical solution of the present invention is described in detail below with reference to the accompanying drawings and specific embodiments.
Wideband MUSIC positioning
The microphone array has M array elements, the array element interval is d, each array element is the same omnidirectional microphone, and far field signals are incident at an angle theta. Suppose the noise is white Gaussian noise independent of the incident signal, with a mean of 0 and a variance of σ 2 Then the output of the array at time k is:
x(k)=a(θ)s(k)+n(k)
in the formula, s (k) represents a target source complex amplitude vector at the time k; n (k) represents an M-dimensional additive noise complex vector at time k; a (theta) represents an M-dimensional array flow pattern matrix with an incident angle of theta.
Figure BDA0001830131290000051
In the formula (I), the compound is shown in the specification,
Figure BDA0001830131290000052
wherein, λ is the wavelength of the voice signal, d is the array element distance, and θ is the voice signal incidence angle.
The covariance matrix of the array signal is R = E [ x (k) x H (k)]Performing characteristic decomposition on the covariance matrix of the array signals, wherein the characteristic vectors corresponding to the K larger characteristic values form a signal subspace U s The remaining feature vectors form the noise subspace U N . From the orthogonal relationship between the noise vector and the signal vector, the spatial spectrum function is obtained as:
Figure BDA0001830131290000061
wherein a (theta) is an array flow pattern at an incident angle of theta, and U N Is a noise subspace of the signal, a H (theta) is the conjugate transpose of a (theta),
Figure BDA0001830131290000062
is U N The conjugate transpose of (1); p is MUSIC (θ) the maximum point corresponds to the MISIC localization result;
because the voice signal is a broadband signal, the frequency division processing is carried out on the received signal x (k) to obtain L self-frequency-band signals x (k, f) l ) L =1, \ 8230;, L. The spatial spectrum function for wideband MUSIC positioning is:
Figure BDA0001830131290000063
wherein, a (theta, f) l ) Is a frequency of f l From the array flow pattern, U, of the band signal with the incident angle theta N Is a noise subspace of the signal, a H (θ,f l ) Is a (theta, f) l ) The conjugate transpose of (a) is performed,
Figure BDA0001830131290000064
is U N The conjugate transpose of (1); p is WMUSIC And theta corresponding to the maximum point is the result of positioning the broadband MUSIC.
Calculating GCC and making phase transformation (GCC-PHAT)
Let the actual signals received by the m-th and n-th microphone elements in the array model be x respectively m (k) And x n (k) Then:
x m (k)=a m s(k-τ m )+n em (k)+n rm (k) (1)
x n (k)=a n s(k-τ n )+n en (k)+n rn (k) (2)
in the formula, n em (k)、n en (k) Respectively represents additive noise in m and n microphone receiving environments at the k time, n rm (k)、n rn (k) Multipath reflection noise received by m and n microphones respectively at time k m 、a n Amplitude attenuation factor, tau, for the m, n microphone received signal m 、τ n S (k) is the time taken for the sound source signal to propagate to the m, n microphones.
Neglecting the effects of reverberation and noise, x m (k) And x n (k) The correlation function of (a) is:
Figure BDA0001830131290000065
substituting the formula (1) and the formula (2) into the formula to obtain:
Figure BDA0001830131290000066
wherein R is ss [τ-(τ mn )]Is s (k-tau) m ) And s (k- τ) n ) The correlation function of (a) is calculated,
Figure BDA0001830131290000067
is s (k) and n en (k) The correlation function of (a) is calculated,
Figure BDA0001830131290000071
is s (k) and n em (k) The correlation function of (a) is calculated,
Figure BDA0001830131290000072
is n em (k) And n en (k) The correlation function of (a);
let s (k), n em (k) And n en (k) Are not related to each other, then equation (3) can be written as:
Figure BDA0001830131290000073
in the formula, τ mn =τ mn ,R ss (τ) is the autocorrelation function of the sound source s (t).
When tau-tau mn When the value is not less than 0, the reaction time is not less than 0,
Figure BDA0001830131290000074
maximum values can be obtained and thus can be obtained from
Figure BDA0001830131290000075
Maximum value estimation of time delay tau of signals received by two microphone elements mn
From the relationship between the cross-correlation function and the cross-power spectrum, we obtain:
Figure BDA0001830131290000076
the generalized cross-correlation (GCC) is obtained by adding a weighting function to equation (4):
Figure BDA0001830131290000077
in the formula (I), the compound is shown in the specification,
Figure BDA0001830131290000078
in order to be a function of the weighting,
Figure BDA0001830131290000079
for cross-power spectra of two signals, f is the signal frequency division, τ is x m (k)、x n (k) The time delay therebetween. PHAT weighting function
Figure BDA00018301312900000710
Feature extraction of data sets
And selecting GCC-PHAT as the basis of input characteristics. The cutting process after GCC-PHAT is explained by taking an 8-element uniform linear array as an example. The array pitch is set to be 0.1m, and the array microphones are combined in pairs
Figure BDA00018301312900000711
For the microphone pair, for each 0.1s signal, calculating the GCC and performing phase transformation (GCC-PHAT), wherein the maximum interval between two microphone elements is 0.8m, the maximum time delay existing between the array elements is tau =0.8/340=2.353ms (the sound velocity is 340 m/s), the sampling rate of an incident signal is 44.1kHz, and the GCC peak representing the time delay is always in the middle n =44100 tau ≈ 104 points. Thus, the input matrix dimension is 28 × 104.
Framing weighting to reduce the effects of mute frames
A segment of speech signal is divided into 0.1s long speech frames. Due to the nature of the speech signal, there may be some non-speech frames, each weighted for greater robustness:
Figure BDA00018301312900000712
Figure BDA0001830131290000081
wherein o is m And the number is GCC-PHAT of the mth voice frame, D is the element number of the GCC matrix, | · | is an absolute value, alpha is a control parameter, and if the alpha =0, the average value of the GCC vector is obtained. The use of a large alpha can effectively reduce the impact of the silence frame on the GCC matrix.
ResNet structure
Fig. 2 is a schematic diagram of the residual structure in ResNet. Assuming that the desired mapping is H (x) and the network input is x, the desired mapping of the residual structure becomes F (x) = H (x) -x, and the final output result is F (x) + x. F (x) + x is realized by a feedforward neural network of an additional summation unit, the addition of the summation unit does not increase parameters, and the whole network can be trained according to the original back propagation.
Fig. 3 is a schematic diagram of the ResNet structure used in the present invention. The ResNet used by the invention is improved on the basis of the original method, and the positions of the batch normalization layer (BN) and the activation layer (Relu) are shifted from the back of the nonlinear layer to the front. All the volume blocks except the first volume block in the figure are obtained by connecting a batch normalization layer (BN), an activation layer (Relu) and a volume layer (Conv) in sequence. Each residual Sum layer (Sum) is preceded by 2 3 × 3 convolution blocks, each convolution being kept unchanged in size plus a Pad of size 1. And selecting a convolution kernel with the size of 3 multiplied by 3 for each convolution block after 4 residual errors, adding Pad with the size of 1, convolving the input by taking 2 as a step length, reducing the size by 1/2, and doubling the number of channels. After shrinking three times, the dimensions were reduced to 1 × 1 in one volume block, passed through the full link layer, and fed into the Loss layer.
The effects of the present invention can be illustrated by the following examples:
fig. 4 shows simulation experimental conditions. The room size is 8m multiplied by 6m multiplied by 2.5m, the uniform array is placed at the position with the height of 1m, the array element interval of the array is 0.1m, and the number of the array elements is 8. The four walls and ceiling of the room have a reverberation reflection coefficient of 0.95 (ordinary lime walls) and the floor has a reflection coefficient of 0.90.
The data set sound source signal selects a pure voice signal, the incidence angle traverses the positioning range of each group at 0.05 degrees, and a training set and a testing set are randomly extracted according to the proportion of 9. The data set size is about 4.5 ten thousand.
FIG. 5 is a 5.5 m.times.3.3 m.times.2.3 m anechoic chamber. The sound equipment is arranged at different angles of the reference array element to be used as sound sources, experimental data are collected by an 8-array element linear microphone array, and the array element interval is 0.1m.
FIG. 6 is a graph of the impact of the number of array elements and the size of the data set used to train and test ResNet on system performance. The abscissa in the graph represents the percentage of the amount of data used to the total amount of data, which is about 4.5 ten thousand. FIG. 6 shows that system performance is better when the more array elements and the larger the data set size; the performance is more significantly affected by the data size, and increasing the array elements greatly increases the amount of preprocessing calculation and the complexity of the network. Therefore, it is more inclined to train the ResNet with more data to improve system performance than to add array elements.
Fig. 7 is a convergence curve of the neural network. Wherein FIG. 7 (a) is a ResNet convergence map of the method of the present invention, and FIG. 7 (b) is a MLP convergence map based on multi-layer perceptron (MLP) DOA estimation, both using a uniform data set and learning rate. Where after 40 iterations, the learning rate is halved every 5 iterations until the network no longer converges. It can be seen that ResNet is more stable in convergence when the learning rate is relatively large, and overfitting occurring in MLP is effectively prevented.
Figure 8 is the effect of signal-to-noise ratio on system performance. The performance of the present invention, the MLP-based DOA estimation method, and the least squares-based delay estimation method (LS-TDOA) are compared at different signal-to-noise ratios, where the solid line is the test result at random angular positions using about 3000 simulated test signals and the dashed line is the test result at three fixed angles in the anechoic room using about 100 speech signals. Fig. 8 shows that at high signal-to-noise ratios the differences between the several methods are not very large, but below 10dB, the classical LS-TDOA method has almost failed, while both DOA estimation methods based on deep learning still work effectively and the present invention performs better. The RMSE of several methods has increased due to electrical noise of the experimental equipment and inevitable errors introduced during measurement, but the advantages of the present invention at low signal-to-noise ratios are still quite significant.
The above embodiments are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modifications made on the basis of the technical scheme according to the technical idea of the present invention fall within the protection scope of the present invention.

Claims (1)

1. A speech DOA estimation method based on ResNet is characterized by comprising the following steps:
step 1, an MATLAB simulation training data set is utilized, the data set uses a plurality of voice signals to traverse a measurement range, and corresponding angles and voice signals are stored;
in the step 1, theThe microphone array is provided with M array elements, the distance between the array elements is d, each array element is the same omnidirectional microphone, and far-field signals are incident in theta; the noise is assumed to be white gaussian noise independent of the incident signal, with a mean of 0 and a variance of σ 2 Then the output of the array at time k is:
x(k)=a(θ)s(k)+n(k)
in the formula, s (k) represents a target source complex amplitude vector at the time k; n (k) represents an M-dimensional additive noise complex vector at time k; a (theta) represents an M-dimensional array flow pattern matrix with an incidence angle of theta;
the expression of the M-dimensional array flow pattern matrix is as follows:
Figure FDA0003893015210000011
in the formula (I), the compound is shown in the specification,
Figure FDA0003893015210000012
wherein lambda is the wavelength of the voice signal, d is the array element distance, and theta is the voice signal incidence angle;
in step 1, the covariance matrix of the array signal is R = E [ x (k) x [ ] H (k)]Performing characteristic decomposition on the covariance matrix of the array signals, and forming a signal subspace U by using eigenvectors corresponding to K larger eigenvalues s The remaining feature vectors constitute the noise subspace U N (ii) a From the orthogonal relationship between the noise vector and the signal vector, the spatial spectrum function is obtained as:
Figure FDA0003893015210000013
wherein a (theta) is an array flow pattern at an incident angle of theta, and U N As noise subspace of the signal, a H (theta) is the conjugate transpose of a (theta),
Figure FDA0003893015210000014
is U N The conjugation transpose of (1); p is MUSIC (θ) maximum Point correspondence is MISIC, positioning results;
dividing frequency of x (k) to obtain L self-frequency band signals x (k, f) l ) L =1, \ 8230;, L, the spatial spectral function of the wideband MUSIC localization is:
Figure FDA0003893015210000021
wherein, a (theta, f) l ) Is a frequency of f l Array flow pattern when the self-frequency band signal incidence angle is theta, U N Is a noise subspace of the signal, a H (θ,f l ) Is a (theta, f) l ) The conjugate transpose of (a) is performed,
Figure FDA0003893015210000022
is U N The conjugate transpose of (1); p is WMUSIC (theta) theta corresponding to the maximum point is a broadband MUSIC positioning result;
step 2, after framing processing of each simulation signal, calculating GCC and performing phase transformation; cutting the array model according to the array model parameters, and weighting and summing each voice frame; storing the weighted features and the corresponding incident angles as a data set;
in the step 2, the actual signals received by the m-th microphone element and the n-th microphone element in the array model are respectively set as x m (k) And x n (k) Then:
x m (k)=a m s(k-τ m )+n em (k)+n rm (k) (1)
x n (k)=a n s(k-τ n )+n en (k)+n rn (k) (2)
in the formula, n em (k)、n en (k) Respectively represents additive noise in m and n microphone receiving environments at the k time, n rm (k)、n rn (k) Multipath reflection noise received by m and n microphones respectively at time k m 、a n Amplitude attenuation factor, tau, for the m, n microphone received signal m 、τ n The time taken for the sound source signal to propagate to the m, n microphones, s (k)) Is a sound source signal;
neglecting the effects of reverberation and noise, x m (k) And x n (k) The correlation function of (d) is:
Figure FDA0003893015210000023
substituting the formula (1) and the formula (2) into the formula to obtain:
Figure FDA0003893015210000024
wherein R is ss [τ-(τ mn )]Is s (k-tau) m ) And s (k-tau) n ) The correlation function of (a) is calculated,
Figure FDA0003893015210000025
is s (k) and n en (k) The correlation function of (a) is calculated,
Figure FDA0003893015210000026
is s (k) and n em (k) The correlation function of (a) is calculated,
Figure FDA0003893015210000027
is n em (k) And n en (k) The correlation function of (a);
let s (k), n em (k) And n en (k) Are not related to each other, equation (3) is written as:
Figure FDA0003893015210000028
in the formula, τ mn =τ mn ,R ss (τ) is the autocorrelation function of the sound source s (t);
when tau-tau mn When the value is not less than 0, the reaction time is not less than 0,
Figure FDA0003893015210000031
take a maximum value, thus by
Figure FDA0003893015210000032
Maximum estimation of the time delay tau of signals received by two microphone elements mn
From the relationship of the cross-correlation function and the cross-power spectrum, we obtain:
Figure FDA0003893015210000033
the generalized cross-correlation is obtained by adding a weighting function to equation (4):
Figure FDA0003893015210000034
in the formula (I), the compound is shown in the specification,
Figure FDA0003893015210000035
in order to be a function of the weighting,
Figure FDA0003893015210000036
for cross-power spectra of two signals, f is the signal frequency division, τ is x m (k)、x n (k) The time delay therebetween;
step 3, initializing ResNet by using MATConvNet and training by using a data set;
in the step 3, if the expected mapping of ResNet is H (x) and the network input is x, the expected mapping of the residual structure is changed to F (x) = H (x) -x, and the final output result is F (x) + x, and F (x) + x is realized by a feedforward neural network of an additional summing unit;
in the step 3, the data set is divided into a plurality of groups according to the incident angle, the advancing step length between two adjacent groups with the width of 10 degrees in each group is 7.5 degrees, and one ResNet is trained by using the data set of each group;
and 4, performing coarse positioning on the signal to be measured by using the broadband MUSIC to obtain a coarse positioning result, and selecting a group ResNet with the central point closest to the broadband MUSIC result to perform subsequent accurate positioning according to the coarse positioning result to obtain a DOA estimation result.
CN201811201570.1A 2018-10-16 2018-10-16 Voice DOA estimation method based on ResNet Active CN109490822B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811201570.1A CN109490822B (en) 2018-10-16 2018-10-16 Voice DOA estimation method based on ResNet

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811201570.1A CN109490822B (en) 2018-10-16 2018-10-16 Voice DOA estimation method based on ResNet

Publications (2)

Publication Number Publication Date
CN109490822A CN109490822A (en) 2019-03-19
CN109490822B true CN109490822B (en) 2022-12-20

Family

ID=65690610

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811201570.1A Active CN109490822B (en) 2018-10-16 2018-10-16 Voice DOA estimation method based on ResNet

Country Status (1)

Country Link
CN (1) CN109490822B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110068795A (en) * 2019-03-31 2019-07-30 天津大学 A kind of indoor microphone array sound localization method based on convolutional neural networks
CN110261816B (en) * 2019-07-10 2020-12-15 苏州思必驰信息科技有限公司 Method and device for estimating direction of arrival of voice
CN111159888B (en) * 2019-12-28 2023-06-02 上海师范大学 Covariance matrix sparse iteration time delay estimation method based on cross-correlation function
CN113345421B (en) * 2020-02-18 2022-08-02 中国科学院声学研究所 Multi-channel far-field target voice recognition method based on angle spectrum characteristics
CN111610488B (en) * 2020-04-08 2023-08-08 中国人民解放军国防科技大学 Random array angle of arrival estimation method based on deep learning
CN112904279B (en) * 2021-01-18 2024-01-26 南京工程学院 Sound source positioning method based on convolutional neural network and subband SRP-PHAT spatial spectrum
GB2607933A (en) * 2021-06-17 2022-12-21 Nokia Technologies Oy Apparatus, methods and computer programs for training machine learning models
CN113362856A (en) * 2021-06-21 2021-09-07 国网上海市电力公司 Sound fault detection method and device applied to power Internet of things
CN113674762A (en) * 2021-08-02 2021-11-19 大连理工大学 Noise source identification system based on multiple signal classification algorithm and working method
CN115980668A (en) * 2023-01-29 2023-04-18 桂林电子科技大学 Sound source localization method based on generalized cross correlation of wide neural network

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104076331B (en) * 2014-06-18 2016-04-13 南京信息工程大学 A kind of sound localization method of seven yuan of microphone arrays
CN106782590B (en) * 2016-12-14 2020-10-09 南京信息工程大学 Microphone array beam forming method based on reverberation environment
CN107102296B (en) * 2017-04-27 2020-04-14 大连理工大学 Sound source positioning system based on distributed microphone array
CN107703486B (en) * 2017-08-23 2021-03-23 南京邮电大学 Sound source positioning method based on convolutional neural network CNN
CN107578775B (en) * 2017-09-07 2021-02-12 四川大学 Multi-classification voice method based on deep neural network
CN107907852B (en) * 2017-10-27 2021-08-03 大连大学 Covariance matrix rank minimization DOA estimation method based on space smoothing

Also Published As

Publication number Publication date
CN109490822A (en) 2019-03-19

Similar Documents

Publication Publication Date Title
CN109490822B (en) Voice DOA estimation method based on ResNet
Saruwatari et al. Blind source separation combining independent component analysis and beamforming
Pedersen et al. Convolutive blind source separation methods
Saruwatari et al. Blind source separation based on a fast-convergence algorithm combining ICA and beamforming
US8874439B2 (en) Systems and methods for blind source signal separation
CN111044973B (en) MVDR target sound source directional pickup method for microphone matrix
CN111415676B (en) Blind source separation method and system based on separation matrix initialization frequency point selection
CN112904279B (en) Sound source positioning method based on convolutional neural network and subband SRP-PHAT spatial spectrum
CN111239680A (en) Direction-of-arrival estimation method based on differential array
CN110544490A (en) sound source positioning method based on Gaussian mixture model and spatial power spectrum characteristics
CN107167770A (en) A kind of microphone array sound source locating device under the conditions of reverberation
Aroudi et al. Dbnet: Doa-driven beamforming network for end-to-end reverberant sound source separation
Xiao et al. Beamforming networks using spatial covariance features for far-field speech recognition
Salvati et al. Two-microphone end-to-end speaker joint identification and localization via convolutional neural networks
CN112201276B (en) TC-ResNet network-based microphone array voice separation method
Zhao et al. Sound source localization based on srp-phat spatial spectrum and deep neural network
Dmour et al. A new framework for underdetermined speech extraction using mixture of beamformers
CN113593596A (en) Robust self-adaptive beam forming directional pickup method based on subarray division
Maazaoui et al. Adaptive blind source separation with HRTFs beamforming preprocessing
Talmon et al. Relative transfer function identification on manifolds for supervised GSC beamformers
CN113111765B (en) Multi-voice source counting and positioning method based on deep learning
CN115713943A (en) Beam forming voice separation method based on complex space angular center Gaussian mixture clustering model and bidirectional long-short-term memory network
Wang et al. U-net based direct-path dominance test for robust direction-of-arrival estimation
Nakadai et al. Partially-shared convolutional neural network for classification of multi-channel recorded audio signals
Aroudi et al. DBNET: DOA-driven beamforming network for end-to-end farfield sound source separation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: The Olympic Avenue in Jianye District of Nanjing city of Jiangsu Province, No. 69 210019

Applicant after: Nanjing University of Information Science and Technology

Address before: 211500 Yuting Square, 59 Wangqiao Road, Liuhe District, Nanjing City, Jiangsu Province

Applicant before: Nanjing University of Information Science and Technology

CB02 Change of applicant information
CB02 Change of applicant information

Address after: 210032 No. 219 Ning six road, Jiangbei new district, Nanjing, Jiangsu

Applicant after: Nanjing University of Information Science and Technology

Address before: The Olympic Avenue in Jianye District of Nanjing city of Jiangsu Province, No. 69 210019

Applicant before: Nanjing University of Information Science and Technology

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant