CN112904279B - Sound source positioning method based on convolutional neural network and subband SRP-PHAT spatial spectrum - Google Patents
Sound source positioning method based on convolutional neural network and subband SRP-PHAT spatial spectrum Download PDFInfo
- Publication number
- CN112904279B CN112904279B CN202110059164.1A CN202110059164A CN112904279B CN 112904279 B CN112904279 B CN 112904279B CN 202110059164 A CN202110059164 A CN 202110059164A CN 112904279 B CN112904279 B CN 112904279B
- Authority
- CN
- China
- Prior art keywords
- srp
- subband
- phat
- frame
- sound source
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000013527 convolutional neural network Methods 0.000 title claims abstract description 48
- 238000000034 method Methods 0.000 title claims abstract description 45
- 238000001228 spectrum Methods 0.000 title claims abstract description 39
- 238000012549 training Methods 0.000 claims abstract description 62
- 239000011159 matrix material Substances 0.000 claims abstract description 23
- 230000004807 localization Effects 0.000 claims abstract description 18
- 238000009432 framing Methods 0.000 claims abstract description 7
- 230000006870 function Effects 0.000 claims description 21
- 238000011176 pooling Methods 0.000 claims description 20
- 238000004364 calculation method Methods 0.000 claims description 13
- 230000004044 response Effects 0.000 claims description 8
- 238000005316 response function Methods 0.000 claims description 6
- 238000010606 normalization Methods 0.000 claims description 5
- 230000004913 activation Effects 0.000 claims description 4
- 238000012545 processing Methods 0.000 claims description 4
- 238000012360 testing method Methods 0.000 abstract description 36
- 230000008569 process Effects 0.000 abstract description 8
- 230000000694 effects Effects 0.000 description 6
- 238000005070 sampling Methods 0.000 description 4
- 230000006978 adaptation Effects 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 230000037433 frameshift Effects 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 1
- 230000021615 conjugation Effects 0.000 description 1
- 238000002790 cross-validation Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000008909 emotion recognition Effects 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 210000000653 nervous system Anatomy 0.000 description 1
- 210000002569 neuron Anatomy 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01S—RADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
- G01S5/00—Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations
- G01S5/18—Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations using ultrasonic, sonic, or infrasonic waves
- G01S5/22—Position of source determined by co-ordinating a plurality of position lines defined by path-difference measurements
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/45—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L2021/02082—Noise filtering the noise being echo, reverberation of the speech
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L2021/02161—Number of inputs available containing the signal or the noise to be suppressed
- G10L2021/02166—Microphone arrays; Beamforming
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Multimedia (AREA)
- Acoustics & Sound (AREA)
- Human Computer Interaction (AREA)
- Signal Processing (AREA)
- Quality & Reliability (AREA)
- Evolutionary Computation (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Remote Sensing (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Radar, Positioning & Navigation (AREA)
- Circuit For Audible Band Transducer (AREA)
Abstract
The invention discloses a sound source positioning method based on a convolutional neural network and a subband SRP-PHAT spatial spectrum, which comprises the following steps: the microphone array collects voice signals, and carries out framing and windowing pretreatment on the collected voice signals to obtain single-frame signals; calculating a subband SRP-PHAT spatial spectrum matrix of each frame of signal; inputting the subband SRP-PHAT space spectrum matrix of all frame signals into a convolutional neural network after training, outputting the probability of the voice signal belonging to each azimuth, and taking the azimuth with the highest probability as the estimated value of the azimuth of the sound source of the voice signal. The invention can improve the sound source positioning performance of the microphone array in a complex acoustic environment and the generalization capability of the sound source space structure, reverberation and noise; the training process of the convolutional neural network can be finished offline, the trained convolutional neural network is stored in the memory, and the real-time sound source localization can be realized only by one frame of signal during the test.
Description
Technical Field
The invention belongs to the field of sound source localization, and particularly relates to a sound source localization method based on a convolutional neural network and a subband SRP-PHAT spatial spectrum.
Background
The sound source localization technology based on the microphone array has wide application prospect and potential economic value in the front-end processing of voice recognition, speaker recognition and emotion recognition systems, as well as video conferences, intelligent robots, intelligent households, intelligent vehicle-mounted equipment, hearing aids and the like. The SRP-erat (Steered Response Power-Phase Transform) method is most popular and commonly used among the conventional sound source localization methods, which achieves sound source localization by detecting peaks of spatial spectrum, but noise and reverberation often cause spatial spectrum to exhibit multimodal characteristics, and particularly in a strong reverberant environment, the peak of spatial spectrum generated by reflected sound may be larger than the peak of direct sound, resulting in misdetection of sound source position. In recent years, a model-based sound source localization method is used for localization in a complex acoustic environment, and the method builds a mapping relation between a sound source position and a space characteristic parameter by modeling the space characteristic parameter so as to realize sound source localization, but the current algorithm has low generalization capability on an unknown environment (noise and reverberation) and needs to be further improved in performance. The spatial feature parameters and modeling method are the main factors affecting the performance of the model-based sound source localization method.
Disclosure of Invention
The invention aims to: in order to overcome the problems in the prior art, the invention discloses a sound source positioning method based on a convolutional neural network and a subband SRP-PHAT spatial spectrum, which adopts the subband SRP-PHAT spatial spectrum as a spatial characteristic parameter, adopts the convolutional neural network (Convolutional Neural Network, CNN) to model the spatial characteristic parameters of directional voice data under various reverberation and noise environments, can improve the sound source positioning performance of a microphone array under a complex acoustic environment, and improves the generalization capability of the sound source spatial structure, the reverberation and the noise.
The technical scheme is as follows: in order to achieve the above purpose, the invention adopts the following technical scheme: a sound source localization method based on a convolutional neural network and a subband SRP-PHAT spatial spectrum is characterized by comprising the following steps:
s1, a microphone array collects voice signals, and the collected voice signals are subjected to framing and windowing pretreatment to obtain single-frame signals;
s2, calculating a subband SRP-PHAT spatial spectrum matrix of each frame of signal;
s3, inputting the subband SRP-PHAT space spectrum matrix of all frame signals into the convolutional neural network after training is completed, outputting the probability that the voice signal belongs to each azimuth, and taking the azimuth with the highest probability as the estimated value of the azimuth of the sound source of the voice signal.
Preferably, in step S2, calculating the subband SRP-heat spatial spectrum matrix of each frame signal includes the steps of:
s21, performing discrete Fourier transform on each frame of signal:
wherein x is m (i, n) is the i frame signal of the M-th microphone in the microphone array, m=1, 2, …, M is the number of microphones, X m (i, k) is x m (i, N) a discrete fourier transform representing a frequency domain signal of an ith frame of the mth microphone, K being a frequency bin, K being a length of the discrete fourier transform, N being a frame length, K = 2N, dft (·) representing the discrete fourier transform;
s22, designing an impulse response function of the gammatine filter bank:
wherein j represents GammaNumber of tone filter; c is the gain of the gammatine filter; t represents a continuous time; a is the order of the gammatine filter;representing phase; f (f) j Representing the center frequency of the j-th gammatine filter; b j Represents the attenuation factor, b, of the j-th gammatine filter j The calculation formula is as follows:
b j =1.109ERB(f j )
ERB(f j )=24.7(4.37f j /1000+1)
performing a discrete fourier transform on the impulse response function of each gammatine filter:
wherein G is j (k) Is the frequency domain expression of the jth gammatine filter, K is the frequency point, K is the length of the discrete fourier transform, N is the frame length, k=2n, f s Representing the signal sample rate, DFT (·) represents the discrete Fourier transform;
s23, calculating a subband SRP-PHAT function of each frame of signal:
wherein, P (i, j, r) represents the j-th subband SRP-PHAT function of the i-th frame signal when the beam direction is r; m is the number of microphones in the microphone array; τ mn (r) represents a time difference of propagation of sound waves from the beam direction r to the mth microphone and the nth microphone, and the calculation formula is:
where r represents the coordinates of the beam direction, r m Representing the position coordinate of the mth microphone, r n Representing the position coordinates of the nth microphone, c being the speed of sound in air;
s24, carrying out normalization processing on a subband SRP-PHAT function of each frame of signal:
s25, combining all the subband SRP-PHAT functions of the same frame signal into a matrix form to obtain a subband SRP-PHAT spatial spectrum matrix:
wherein y (i) represents a subband SRP-PHAT spatial spectrum matrix of the ith frame signal, J is the number of subbands, namely the number of Gamma filters, and L is the number of beam directions.
Preferably, in step S23, when the sound source is set to be at the same level as the microphone array and the sound source is located in the far field of the microphone array, τ mn The equivalent calculation formula of (r) is:
wherein ζ= [ cos θ, sin θ ]] T θ is the azimuth angle of the beam direction r.
Preferably, the convolutional neural network comprises an input layer, three convolutional-pooling layers, a full-connection layer and an output layer which are sequentially connected;
in the convolution-pooling layers, each convolution layer adopts convolution kernels with the size of 3 multiplied by 3, the step length is 1, the number of the convolution kernels of the three layers of convolution layers is 24, 48 and 96 in sequence, after each convolution layer carries out convolution operation, batch normalization is carried out firstly, then a ReLU function is used for activation, and a zero padding mode is adopted in the convolution operation so that the characteristic dimension before and after the convolution is kept unchanged; the pooling layer adopts a maximum pooling mode, the pooling size is 2 multiplied by 2, and the step length is 2;
after the convolution-pooling layer, straightening and deforming the characteristic data into one-dimensional vector characteristic data;
a Dropout connection mode is added in the connection of the full connection layer and the one-dimensional vector feature data;
the output layer adopts a Softmax classifier.
Preferably, the training steps of the convolutional neural network are as follows:
s1, convoluting the clean voice signals with room impulse responses of different azimuth angles, adding noise and reverberation of different degrees, and generating a plurality of directional voice signals of different specified azimuth angles:
x m (t)=h m (t)*s(t)+v m (t),m=1,2,...,M
wherein x is m (t) represents a directional voice signal of a specified azimuth angle received by an mth microphone in the microphone array; m is the serial number of the microphones, m=1, 2, …, M is the number of the microphones; s (t) is a clean speech signal; h is a m (t) represents a room impulse response from a specified azimuth angle to an mth microphone; v m (t) represents noise;
s2, carrying out framing and windowing pretreatment on all directional voice signals to obtain single-frame signals, and calculating a subband SRP-PHAT spatial spectrum matrix of each frame of signals;
s3, taking the subband SRP-PHAT space spectrum matrix of all directional voice signals as training samples, taking the designated azimuth angles of all directional voice signals as class labels corresponding to the training samples, taking the training samples and the class labels as training data sets, and training the convolutional neural network by adopting a random gradient descent algorithm minimization loss function with momentum.
The beneficial effects are that: the invention has the following remarkable beneficial effects:
1. the invention can improve the sound source positioning performance of the microphone array in a complex acoustic environment and the generalization capability of the sound source space structure, reverberation and noise;
2. the invention adopts the subband SRP-PHAT spatial spectrum as the spatial characteristic parameter, and the parameter not only can represent the whole acoustic environment information, but also has the advantage of strong robustness; modeling spatial feature parameters of directional voice data in various reverberation and noise environments by adopting a convolutional neural network, establishing a mapping relation between azimuth and the spatial feature parameters, and converting a sound source localization problem into a multi-classification problem;
3. the invention can finish the training process of the convolutional neural network offline, store the trained convolutional neural network in the memory, and realize real-time sound source localization only by one frame of signal during testing.
Drawings
FIG. 1 is a flow chart of an algorithm of the present invention;
FIG. 2 is a diagram of a model structure of a convolutional neural network in accordance with the present invention;
FIG. 3 is a graph comparing the positioning success rate of the method of the present invention with that of the conventional SRP-PHAT algorithm when the test environment and the training environment are consistent and the reverberation time is 0.5 s;
FIG. 4 is a graph comparing the positioning success rate of the method of the present invention with that of the conventional SRP-PHAT algorithm when the test environment and the training environment are consistent and the reverberation time is 0.8 s;
FIG. 5 is a graph comparing the positioning success rate of the method of the present invention with that of the conventional SRP-PHAT algorithm when the noise environments of the test environment and the training environment are inconsistent and the reverberation time is 0.5 s;
FIG. 6 is a graph comparing the positioning success rate of the method of the present invention with that of the conventional SRP-PHAT algorithm when the noise environments of the test environment and the training environment are inconsistent and the reverberation time is 0.8 s;
FIG. 7 is a graph comparing the positioning success rate of the method of the present invention with that of the conventional SRP-PHAT algorithm when the reverberations of the test environment and the training environment are inconsistent and the reverberations time of the test environment is 0.6 s;
FIG. 8 is a graph comparing the positioning success rate of the method of the present invention with that of the conventional SRP-PHAT algorithm when the reverberations of the test environment and the training environment are inconsistent and the reverberations time of the test environment is 0.9 s.
Detailed Description
The invention will be further described with reference to the accompanying drawings.
The subband SRP-PHAT spatial spectrum characterizes the whole acoustic environment spatial information including sound source azimuth, room size, room reflection characteristics and the like, has strong robustness, and can be used as spatial characteristic parameters in a positioning system. The deep neural network can simulate the mode of the information processing of the nervous system, can describe the fusion relation and the structural information between the space characteristic parameters, has strong expression and modeling capacity, and meanwhile, does not need to make assumptions on the data distribution during modeling. Among them, convolutional neural networks are a type of neural network that is specially used to process data having a similar network structure, and are applied to image or time-series data. The speech signal collected by the microphone array is precisely a time series of data.
Therefore, the invention provides a sound source localization method based on a convolutional neural network and a subband SRP-PHAT spatial spectrum, which is shown in figure 1 and comprises the following steps:
step one: convolving the clean speech signal with room impulse responses of different azimuth angles, and adding different degrees of noise and reverberation to generate a plurality of directional speech signals of different specified azimuth angles, namely microphone array signals:
x m (t)=h m (t)*s(t)+v m (t),m=1,2,...,M
wherein x is m (t) represents a directional voice signal of a specified azimuth angle received by an mth microphone in the microphone array; m is the serial number of the microphones, m=1, 2, …, M is the number of the microphones; s (t) is a clean speech signal; h is a m (t) represents the room impulse response from the specified azimuth angle to the mth microphone, h m (t) is related to sound source orientation, room reverberation; v m And (t) represents noise.
In this embodiment, the microphone array is set to be a uniform circular array composed of 6 omni-directional microphones, and the radius of the array is 0.1m. The sound source is set to be in the same horizontal plane with the microphone array, and the sound source is located in the far field of the microphone array. The right front of the horizontal plane is defined as 90 degrees, the range of azimuth angles of the sound source is [0 degrees, 360 degrees ], the interval is 10 degrees, the number of training azimuth is marked as F, and F is equal to 36. The reverberation time of the training data comprises 0.5s and 0.8s, and different Image algorithms are used for generatingRoom impulse response h for different azimuth angles at reverberation time m (t)。v m (t) Gaussian white noise, the signal-to-noise ratio of the training data includes 0dB, 5dB, 10dB, 15dB, and 20dB.
And step two, preprocessing the microphone array signal obtained in the step one to obtain a single frame signal.
Preprocessing includes framing and windowing, wherein:
the framing method comprises the following steps: the directional voice signal x of the appointed azimuth angle of the mth microphone is processed by adopting the preset frame length and frame shift m (t) dividing into a plurality of single frame signals x m (iN+n), wherein i is a frame number, N represents a sampling number iN one frame, N is more than or equal to 0 and less than N, and N is a frame length. Signal sampling rate f in this embodiment s For 16kHz, a frame length N of 512 (i.e., 32 ms) is taken, and the frame shift is 0.
The windowing method comprises the following steps: x is x m (i,n)=w H (n)x m (iN+n), where x m (i, n) is the i-th frame signal of the m-th microphone after the windowing process,is a hamming window.
And thirdly, extracting spatial characteristic parameters of the microphone array signals, namely a subband SRP-PHAT spatial spectrum matrix. The method specifically comprises the following steps:
(3-1) performing discrete fourier transform on each frame of the signal obtained in the step two, and converting the time domain signal into a frequency domain signal.
The discrete fourier transform calculation formula is:
wherein X is m (i, k) is x m (i, n) a discrete fourier transform representing a frequency domain signal of an ith frame of the mth microphone, K being a frequency bin, K being a length of the discrete fourier transform, k=2n, dft (·) representing the discrete fourier transform. The length of the discrete fourier transform is set to 1024 in this embodiment.
(3-2) designing a gammatine filter bank.
g j (t) is the impulse response function of the j-th gammatine filter, expressed as:
wherein j represents the serial number of the gammatine filter; c is the gain of the gammatine filter; t represents a continuous time; a is the order of the gammatine filter;representing phase; f (f) j Representing the center frequency of the j-th gammatine filter; b j Represents the attenuation factor, b, of the j-th gammatine filter j The calculation formula is as follows:
b j =1.109ERB(f j )
ERB(f j )=24.7(4.37f j /1000+1)
in this embodiment, the order a is 4, the phaseSet to 0, the number of gammatine filters is 36, i.e., j=1, 2, …,36, center frequency f of gammatine filter j Is in the range of [200Hz,8000Hz]。
Performing discrete Fourier transform on the impulse response function of each Gamma filter to obtain a frequency domain expression of the Gamma filter:
wherein G is j (k) G is g j (n/f s ) Represents the frequency domain expression of the jth gammatine filter, K is the frequency bin, K is the length of the discrete fourier transform, k=2n, dft (·) represents the discrete fourier transform, f s Representing the sampling rate. The length of the discrete fourier transform is set to 1024 in this embodiment.
(3-3) calculating a subband SRP-phas function for each frame of signal, the calculation formula being as follows:
wherein, P (i, j, r) represents the j-th subband SRP-PHAT function of the i-th frame signal when the beam direction of the array is r; (. Cndot. * Represents conjugation; τ mn (r) represents a time difference of propagation of sound waves from the beam direction r to the mth microphone and the nth microphone, and the calculation formula is:
where r represents the coordinates of the beam direction, r m Representing the position coordinate of the mth microphone, r n Representing the position coordinate of the nth microphone, c is the sound velocity in the air, and is about 342m/s, f at normal temperature s Is the signal sampling rate; i represent 2 norms.
In this embodiment, the sound source and the microphone array are set to be in the same horizontal plane, and if the sound source is located in the far field of the microphone array, τ is mn The equivalent calculation formula of (r) is:
wherein ζ= [ cos θ, sin θ ]] T θ is the azimuth angle of the beam direction r. τ mn (r) is independent of the received signal and can be stored in memory after off-line calculation.
The subband SRP-PHAT function P (i, j, r) is normalized, and the calculation formula is as follows:
(3-4) SRP-PHAT function of all sub-bands of the same frame signalCombining the two sub-band SRP-PHAT spatial spectrum matrixes into a matrix form to obtain the sub-band SRP-PHAT spatial spectrum matrixes:
where y (i) represents a spatial feature parameter of the i-th frame signal, i.e. a subband SRP-phas spatial spectrum matrix, and J is the number of subbands, i.e. the number of gammatine filters, in this embodiment j=36. The azimuth range of the beam direction of the array in this embodiment is [0 °,360 ° ], which defines 90 ° directly in front of the horizontal plane, and the interval is 5 °, so the number l=72 of beam directions. The number L of the beam taking directions is generally larger than the number F of the training orientations, so that the accuracy of the spatial characteristic parameters of the signals can be improved, and the training accuracy of the CNN model is improved.
Step four, preparing a training set: according to the first to third steps, extracting the space characteristic parameters of the directional voice signals under all training environments (the implementation setting of the training environments is detailed in the first step), taking the space characteristic parameters as training samples of CNN, marking the corresponding appointed azimuth angle of each training sample, and taking the corresponding appointed azimuth angle as the class label of the training sample.
And fifthly, constructing a CNN model, and training the training sample and the category label obtained in the fourth step as a CNN training data set to obtain the CNN model. The method specifically comprises the following steps:
(5-1) setting a CNN model structure.
The CNN structure employed in the present invention is shown in fig. 2 as comprising an input layer followed by three convolution-pooling layers, then a fully connected layer, and finally an output layer.
The input signal of the input layer is a two-dimensional subband SRP-phas spatial spectrum matrix of j×l, i.e. training samples, in this embodiment j=36, l=72.
The input layer is followed by three convolution-pooling layers, each convolution layer adopts a convolution kernel with the size of 3 multiplied by 3, the step length is 1, and the characteristic dimension before and after convolution is kept unchanged by adopting a zero filling mode in the convolution operation. The number of convolution kernels for the 1 st, 2 nd and 3 rd convolution layers is 24, 48 and 96, respectively. After each convolution layer carries out convolution operation, batch normalization is carried out first, and then a ReLU function is used for activation. The pooling layer adopts a maximum pooling mode, the pooling size is 2 multiplied by 2, and the step length is 2.
Through three convolution-pooling operations, the 36×72 two-dimensional subband SRP-PHAT spatial spectrum matrix becomes 5×9×96 feature data, which is straightened and deformed into 4320×1 one-dimensional vector feature data. Neurons in the fully connected layer are connected to all feature data in the previous layer and the connection mode of Dropout is added to prevent overfitting, dropout rate is set to 0.5.
The output layer adopts a Softmax classifier, a Softmax function converts the characteristic data of the full-connection layer into the probability of the voice signal relative to each azimuth, and the azimuth with the highest probability is taken as the predicted sound source direction.
(5-2) training network parameters of the CNN model.
The training process of CNN includes two parts, forward propagation and backward propagation.
Forward propagation is the output of the calculated input data under the current network parameters, is a layer-by-layer transfer process of the features, and the forward propagation expression at the position (u, v) in the d layer is:
S d (u,v)=ReLU((S d - 1 *w d )(u,v)+β d (u,v))
wherein d represents a layer identifier and the d layer is a convolution layer, S d Represents the output of the d layer, S d-1 Represents the output of layer d-1, the sign represents the convolution operation, w d Representing the convolution kernel weight of layer d, beta d Representing the bias of layer d, reLU is the activation function. The layers in the CNN structure adopted by the invention comprise an input layer, a convolution layer and a pooling layer in a convolution-pooling layer, a full connection layer and an output layer.
D represents the output layer, the expression of the output layer is:
S D =Softmax((w D ) T S D-1 +β D )
wherein S is D Representing the output of the output layer S D-1 Representing the output of the fully connected layer, w D Convolution kernel weights, beta, representing the output layer D Representing the bias of the output layer.
The goal of the back propagation phase is to minimize the cross entropy loss function E (w, β):
wherein the subscript f represents the f-th azimuth,indicating the desired output of the output layer at the f-th azimuth angle,>representing the actual output of the output layer at the f-th azimuth angle. F represents the number of training orientations, in this embodiment f=36. The invention adopts a random gradient descent with momentum (Stochastic Gradient Descent with Momentum, SGDM) algorithm to minimize the loss function, and the related parameters of the SGDM are as follows: the Momentum parameter Momentum is set to 0.9, the L2 regularization coefficient is 0.0001, the initial learning rate is set to 0.01, the learning rate is reduced by 0.2 times every 6 rounds, and the mini-batch is set to 200.
The invention adopts a 7:3 cross-validation mode in the training process. And (5) carrying out repeated iterative training until convergence. So far, the CNN model training is completed.
Step six, processing the test signal according to the step two and the step three to obtain a spatial characteristic parameter of the single-frame test signal, namely a subband SRP-PHAT spatial spectrum matrix, and taking the spatial characteristic parameter as a test sample.
And step seven, taking the test sample as the input characteristic of the CNN model trained in the step four, outputting the probability of the test signal belonging to each azimuth angle by CNN, and taking the azimuth with the highest probability as the estimated value of the azimuth angle of the sound source of the test sample.
In contrast to the prior art, the method of the present invention comprises two stages, training and testing. In the training stage, space characteristic parameters are extracted from directional voice signals under various reverberation and noise environments, and the space characteristic parameters are input into CNN for training to obtain a CNN model. In the test stage, the spatial characteristic parameters of the test signals are extracted, a trained CNN model is input, and the azimuth with the highest probability is taken as a target sound source azimuth estimation value. The invention can finish the training process of CNN off-line, store the CNN model trained in the memory, only need a frame signal to realize the real-time sound source localization while testing. Compared with the traditional SRP-PHAT algorithm, the algorithm of the invention remarkably improves the positioning performance under the complex acoustic environment, and has better generalization capability on the spatial structure, reverberation and noise of the sound source.
Fig. 3 and 4 show the positioning effect of the method according to the present invention with the conventional SRP-phas algorithm when the test environment and the training environment are identical: the reverberation time of the test environment and the training environment in fig. 3 is 0.5s, the reverberation time of the test environment and the training environment in fig. 4 is 0.8s, and the positioning effect of the test environment and the training environment under the conditions that the signal to noise ratio is 0dB, 5dB, 10dB, 15dB and 20dB is respectively researched, so that the positioning success rate of the method is far higher than that of the traditional SRP-PHAT algorithm.
Fig. 5 and 6 show the positioning effect of the method according to the present invention and the conventional SRP-phas algorithm when the signal-to-noise ratio of the test environment and the training environment are inconsistent: the reverberation time of the test environment and the training environment in fig. 5 is 0.5s, the signal-to-noise ratio of the test environment is different from that of the training environment, the reverberation time of the test environment and the training environment in fig. 6 is 0.8s, the signal-to-noise ratio of the test environment is different from that of the training environment, and the positioning effects of the test environment under the conditions that the signal-to-noise ratio is-2 dB, 3dB, 8dB, 13dB and 18dB are respectively researched, so that the positioning success rate of the method is far higher than that of the traditional SRP-PHAT algorithm.
Fig. 7 and 8 show the positioning effect of the method of the present invention and the conventional SRP-phas algorithm when the reverberation times of the test environment and the training environment are not identical: the reverberation time of the test environment and the training environment in fig. 7 are different, the reverberation time of the test environment is 0.6s, the reverberation time of the test environment and the training environment in fig. 8 are different, the reverberation time of the test environment is 0.9s, and the positioning effects of the test environment and the training environment under the conditions that the signal to noise ratio is 0dB, 5dB, 10dB, 15dB and 20dB are respectively researched, so that the positioning success rate of the method is far higher than that of the traditional SRP-PHAT algorithm.
As can be seen from fig. 5 to 8, the success rate of the method of the present invention is still far higher than that of the conventional SRP-phas algorithm even in the non-training environment, which illustrates that the method of the present invention has better robustness and generalization capability to the unknown environment.
The foregoing is only a preferred embodiment of the invention, it being noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the present invention, and such modifications and adaptations are intended to be comprehended within the scope of the invention.
Claims (4)
1. A sound source localization method based on a convolutional neural network and a subband SRP-PHAT spatial spectrum is characterized by comprising the following steps:
s1, a microphone array collects voice signals, and the collected voice signals are subjected to framing and windowing pretreatment to obtain single-frame signals;
s2, calculating a subband SRP-PHAT spatial spectrum matrix of each frame of signal; the method specifically comprises the following steps:
s21, performing discrete Fourier transform on each frame of signal:
wherein x is m (i, n) is the i frame signal of the M-th microphone in the microphone array, m=1, 2, …, M is the number of microphones, X m (i, k) is x m (i, N) a discrete fourier transform representing a frequency domain signal of an ith frame of the mth microphone, K being a frequency bin, K being a length of the discrete fourier transform, N being a frame length, K = 2N, dft (·) representing the discrete fourier transform;
s22, designing an impulse response function of the gammatine filter bank:
wherein j represents the serial number of the gammatine filter; c is the gain of the gammatine filter; t represents a continuous time; a is the order of the gammatine filter;representing phase; f (f) j Representing the center frequency of the j-th gammatine filter; b j Represents the attenuation factor, b, of the j-th gammatine filter j The calculation formula is as follows:
b j =1.109ERB(f j )
ERB(f j )=24.7(4.37f j /1000+1)
performing a discrete fourier transform on the impulse response function of each gammatine filter:
wherein G is j (k) Is the frequency domain expression of the jth gammatine filter, K is the frequency point, K is the length of the discrete fourier transform, N is the frame length, k=2n, f s Representing the signal sample rate, DFT (·) represents the discrete Fourier transform;
s23, calculating a subband SRP-PHAT function of each frame of signal:
wherein, P (i, j, r) represents the j-th subband SRP-PHAT function of the i-th frame signal when the beam direction is r; m is the number of microphones in the microphone array; τ mn (r) represents a time difference of propagation of sound waves from the beam direction r to the mth microphone and the nth microphone, and the calculation formula is:
where r represents the coordinates of the beam direction, r m Representing the position coordinate of the mth microphone, r n Representing the position coordinates of the nth microphone, c being the speed of sound in air;
s24, carrying out normalization processing on a subband SRP-PHAT function of each frame of signal:
s25, combining all the subband SRP-PHAT functions of the same frame signal into a matrix form to obtain a subband SRP-PHAT spatial spectrum matrix:
wherein y (i) represents a subband SRP-PHAT spatial spectrum matrix of an ith frame signal, J is the number of subbands, namely the number of Gamma filters, and L is the number of beam directions;
s3, inputting the subband SRP-PHAT space spectrum matrix of all frame signals into the convolutional neural network after training is completed, outputting the probability that the voice signal belongs to each azimuth, and taking the azimuth with the highest probability as the estimated value of the azimuth of the sound source of the voice signal.
2. The method for locating sound source based on convolutional neural network and subband SRP-PHAT spatial spectrum as recited in claim 1, wherein in step S23, when the sound source is set to be at the same level as the microphone array and the sound source is located in far field of the microphone array, τ is calculated by mn The equivalent calculation formula of (r) is:
wherein ζ= [ cos θ, sin θ ]] T θ is the azimuth angle of the beam direction r.
3. The sound source localization method based on the convolutional neural network and the subband SRP-heat spatial spectrum according to claim 1, wherein the convolutional neural network comprises an input layer, three convolutional-pooling layers, a full connection layer and an output layer which are sequentially connected;
in the convolution-pooling layers, each convolution layer adopts convolution kernels with the size of 3 multiplied by 3, the step length is 1, the number of the convolution kernels of the three layers of convolution layers is 24, 48 and 96 in sequence, after each convolution layer carries out convolution operation, batch normalization is carried out firstly, then a ReLU function is used for activation, and a zero padding mode is adopted in the convolution operation so that the characteristic dimension before and after the convolution is kept unchanged; the pooling layer adopts a maximum pooling mode, the pooling size is 2 multiplied by 2, and the step length is 2;
after the convolution-pooling layer, straightening and deforming the characteristic data into one-dimensional vector characteristic data;
a Dropout connection mode is added in the connection of the full connection layer and the one-dimensional vector feature data;
the output layer adopts a Softmax classifier.
4. The sound source localization method based on convolutional neural network and subband SRP-heat spatial spectrum according to claim 1, wherein the training steps of the convolutional neural network are as follows:
s1, convoluting the clean voice signals with room impulse responses of different azimuth angles, adding noise and reverberation of different degrees, and generating a plurality of directional voice signals of different specified azimuth angles:
x m (t)=h m (t)*s(t)+v m (t),m=1,2,...,M
wherein x is m (t) represents a directional voice signal of a specified azimuth angle received by an mth microphone in the microphone array; m is the serial number of the microphones, m=1, 2, …, M is the number of the microphones; s (t) is a clean speech signal; h is a m (t) represents a room impulse response from a specified azimuth angle to an mth microphone; v m (t) represents noise;
s2, carrying out framing and windowing pretreatment on all directional voice signals to obtain single-frame signals, and calculating a subband SRP-PHAT spatial spectrum matrix of each frame of signals;
s3, taking the subband SRP-PHAT space spectrum matrix of all directional voice signals as training samples, taking the designated azimuth angles of all directional voice signals as class labels corresponding to the training samples, taking the training samples and the class labels as training data sets, and training the convolutional neural network by adopting a random gradient descent algorithm minimization loss function with momentum.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110059164.1A CN112904279B (en) | 2021-01-18 | 2021-01-18 | Sound source positioning method based on convolutional neural network and subband SRP-PHAT spatial spectrum |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110059164.1A CN112904279B (en) | 2021-01-18 | 2021-01-18 | Sound source positioning method based on convolutional neural network and subband SRP-PHAT spatial spectrum |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112904279A CN112904279A (en) | 2021-06-04 |
CN112904279B true CN112904279B (en) | 2024-01-26 |
Family
ID=76114123
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110059164.1A Active CN112904279B (en) | 2021-01-18 | 2021-01-18 | Sound source positioning method based on convolutional neural network and subband SRP-PHAT spatial spectrum |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112904279B (en) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113655440B (en) * | 2021-08-09 | 2023-05-30 | 西南科技大学 | Self-adaptive compromise pre-whitened sound source positioning method |
CN113589230B (en) * | 2021-09-29 | 2022-02-22 | 广东省科学院智能制造研究所 | Target sound source positioning method and system based on joint optimization network |
CN114994608B (en) * | 2022-04-21 | 2024-05-14 | 西北工业大学深圳研究院 | Multi-device self-organizing microphone array sound source positioning method based on deep learning |
CN114897033B (en) * | 2022-07-13 | 2022-09-27 | 中国人民解放军海军工程大学 | Three-dimensional convolution kernel group calculation method for multi-beam narrow-band process data |
CN115201753B (en) * | 2022-09-19 | 2022-11-29 | 泉州市音符算子科技有限公司 | Low-power-consumption multi-spectral-resolution voice positioning method |
CN115331691A (en) * | 2022-10-13 | 2022-11-11 | 广州成至智能机器科技有限公司 | Pickup method and device for unmanned aerial vehicle, unmanned aerial vehicle and computer readable storage medium |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109164415A (en) * | 2018-09-07 | 2019-01-08 | 东南大学 | A kind of binaural sound sources localization method based on convolutional neural networks |
CN109490822A (en) * | 2018-10-16 | 2019-03-19 | 南京信息工程大学 | Voice DOA estimation method based on ResNet |
CN110133572A (en) * | 2019-05-21 | 2019-08-16 | 南京林业大学 | A kind of more sound localization methods based on Gammatone filter and histogram |
CN110133596A (en) * | 2019-05-13 | 2019-08-16 | 南京林业大学 | A kind of array sound source localization method based on frequency point signal-to-noise ratio and biasing soft-decision |
CN110517705A (en) * | 2019-08-29 | 2019-11-29 | 北京大学深圳研究生院 | A kind of binaural sound sources localization method and system based on deep neural network and convolutional neural networks |
CN110544490A (en) * | 2019-07-30 | 2019-12-06 | 南京林业大学 | sound source positioning method based on Gaussian mixture model and spatial power spectrum characteristics |
WO2020042708A1 (en) * | 2018-08-31 | 2020-03-05 | 大象声科(深圳)科技有限公司 | Time-frequency masking and deep neural network-based sound source direction estimation method |
CN111123202A (en) * | 2020-01-06 | 2020-05-08 | 北京大学 | Indoor early reflected sound positioning method and system |
CN111583948A (en) * | 2020-05-09 | 2020-08-25 | 南京工程学院 | Improved multi-channel speech enhancement system and method |
CN111707990A (en) * | 2020-08-19 | 2020-09-25 | 东南大学 | Binaural sound source positioning method based on dense convolutional network |
CN111968677A (en) * | 2020-08-21 | 2020-11-20 | 南京工程学院 | Voice quality self-evaluation method for fitting-free hearing aid |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101090893B1 (en) * | 2010-03-15 | 2011-12-08 | 한국과학기술연구원 | Sound source localization system |
-
2021
- 2021-01-18 CN CN202110059164.1A patent/CN112904279B/en active Active
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020042708A1 (en) * | 2018-08-31 | 2020-03-05 | 大象声科(深圳)科技有限公司 | Time-frequency masking and deep neural network-based sound source direction estimation method |
CN109164415A (en) * | 2018-09-07 | 2019-01-08 | 东南大学 | A kind of binaural sound sources localization method based on convolutional neural networks |
CN109490822A (en) * | 2018-10-16 | 2019-03-19 | 南京信息工程大学 | Voice DOA estimation method based on ResNet |
CN110133596A (en) * | 2019-05-13 | 2019-08-16 | 南京林业大学 | A kind of array sound source localization method based on frequency point signal-to-noise ratio and biasing soft-decision |
CN110133572A (en) * | 2019-05-21 | 2019-08-16 | 南京林业大学 | A kind of more sound localization methods based on Gammatone filter and histogram |
CN110544490A (en) * | 2019-07-30 | 2019-12-06 | 南京林业大学 | sound source positioning method based on Gaussian mixture model and spatial power spectrum characteristics |
CN110517705A (en) * | 2019-08-29 | 2019-11-29 | 北京大学深圳研究生院 | A kind of binaural sound sources localization method and system based on deep neural network and convolutional neural networks |
CN111123202A (en) * | 2020-01-06 | 2020-05-08 | 北京大学 | Indoor early reflected sound positioning method and system |
CN111583948A (en) * | 2020-05-09 | 2020-08-25 | 南京工程学院 | Improved multi-channel speech enhancement system and method |
CN111707990A (en) * | 2020-08-19 | 2020-09-25 | 东南大学 | Binaural sound source positioning method based on dense convolutional network |
CN111968677A (en) * | 2020-08-21 | 2020-11-20 | 南京工程学院 | Voice quality self-evaluation method for fitting-free hearing aid |
Non-Patent Citations (5)
Title |
---|
Deep and CNN fusion method for binaural sound source localization;S. Jiang, W. L., P. Yuan, Y. Sun and H. Liu;《The Journal of Engineering》;511–516 * |
End-to-end Binaural Sound Localisation from the Raw Waveform;Vecchiotti等;《IEEE》;451-455 * |
Sound Source Localization Based on SRP-PHAT Spatial Spectrum and Deep Neural Network;Xiaoyan Zhao 等;《Computers, Materials & Continua 》;第253-271页 * |
基于卷积神经网络的交通声音事件识别方法;张文涛;韩莹莹;黎恒;;现代电子技术(第14期);全文 * |
基于神经网络的鲁棒双耳声源定位研究;王茜茜;《中国优秀硕士学位论文全文数据库 信息科技辑》;I136-129 * |
Also Published As
Publication number | Publication date |
---|---|
CN112904279A (en) | 2021-06-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112904279B (en) | Sound source positioning method based on convolutional neural network and subband SRP-PHAT spatial spectrum | |
CN107703486B (en) | Sound source positioning method based on convolutional neural network CNN | |
CN109490822B (en) | Voice DOA estimation method based on ResNet | |
CN112151059A (en) | Microphone array-oriented channel attention weighted speech enhancement method | |
CN110068795A (en) | A kind of indoor microphone array sound localization method based on convolutional neural networks | |
US20040175006A1 (en) | Microphone array, method and apparatus for forming constant directivity beams using the same, and method and apparatus for estimating acoustic source direction using the same | |
Vesperini et al. | Localizing speakers in multiple rooms by using deep neural networks | |
CN110610718B (en) | Method and device for extracting expected sound source voice signal | |
CN107167770A (en) | A kind of microphone array sound source locating device under the conditions of reverberation | |
CN107527626A (en) | Audio identification system | |
CN110444220B (en) | Multi-mode remote voice perception method and device | |
CN113111765B (en) | Multi-voice source counting and positioning method based on deep learning | |
CN112180318B (en) | Sound source direction of arrival estimation model training and sound source direction of arrival estimation method | |
Salvati et al. | Two-microphone end-to-end speaker joint identification and localization via convolutional neural networks | |
CN112201276B (en) | TC-ResNet network-based microphone array voice separation method | |
CN111123202B (en) | Indoor early reflected sound positioning method and system | |
CN116559778B (en) | Vehicle whistle positioning method and system based on deep learning | |
CN113593596A (en) | Robust self-adaptive beam forming directional pickup method based on subarray division | |
CN112363112A (en) | Sound source positioning method and device based on linear microphone array | |
CN111443328A (en) | Sound event detection and positioning method based on deep learning | |
CN110838303A (en) | Voice sound source positioning method using microphone array | |
CN114245266B (en) | Area pickup method and system for small microphone array device | |
Wang et al. | U-net based direct-path dominance test for robust direction-of-arrival estimation | |
Firoozabadi et al. | Combination of nested microphone array and subband processing for multiple simultaneous speaker localization | |
CN114895245A (en) | Microphone array sound source positioning method and device and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |