CN112904279B - Sound source positioning method based on convolutional neural network and subband SRP-PHAT spatial spectrum - Google Patents

Sound source positioning method based on convolutional neural network and subband SRP-PHAT spatial spectrum Download PDF

Info

Publication number
CN112904279B
CN112904279B CN202110059164.1A CN202110059164A CN112904279B CN 112904279 B CN112904279 B CN 112904279B CN 202110059164 A CN202110059164 A CN 202110059164A CN 112904279 B CN112904279 B CN 112904279B
Authority
CN
China
Prior art keywords
srp
subband
phat
frame
sound source
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110059164.1A
Other languages
Chinese (zh)
Other versions
CN112904279A (en
Inventor
赵小燕
童莹
芮雄丽
陈瑞
毛铮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Institute of Technology
Original Assignee
Nanjing Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Institute of Technology filed Critical Nanjing Institute of Technology
Priority to CN202110059164.1A priority Critical patent/CN112904279B/en
Publication of CN112904279A publication Critical patent/CN112904279A/en
Application granted granted Critical
Publication of CN112904279B publication Critical patent/CN112904279B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S5/00Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations
    • G01S5/18Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations using ultrasonic, sonic, or infrasonic waves
    • G01S5/22Position of source determined by co-ordinating a plurality of position lines defined by path-difference measurements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Evolutionary Computation (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Remote Sensing (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The invention discloses a sound source positioning method based on a convolutional neural network and a subband SRP-PHAT spatial spectrum, which comprises the following steps: the microphone array collects voice signals, and carries out framing and windowing pretreatment on the collected voice signals to obtain single-frame signals; calculating a subband SRP-PHAT spatial spectrum matrix of each frame of signal; inputting the subband SRP-PHAT space spectrum matrix of all frame signals into a convolutional neural network after training, outputting the probability of the voice signal belonging to each azimuth, and taking the azimuth with the highest probability as the estimated value of the azimuth of the sound source of the voice signal. The invention can improve the sound source positioning performance of the microphone array in a complex acoustic environment and the generalization capability of the sound source space structure, reverberation and noise; the training process of the convolutional neural network can be finished offline, the trained convolutional neural network is stored in the memory, and the real-time sound source localization can be realized only by one frame of signal during the test.

Description

Sound source positioning method based on convolutional neural network and subband SRP-PHAT spatial spectrum
Technical Field
The invention belongs to the field of sound source localization, and particularly relates to a sound source localization method based on a convolutional neural network and a subband SRP-PHAT spatial spectrum.
Background
The sound source localization technology based on the microphone array has wide application prospect and potential economic value in the front-end processing of voice recognition, speaker recognition and emotion recognition systems, as well as video conferences, intelligent robots, intelligent households, intelligent vehicle-mounted equipment, hearing aids and the like. The SRP-erat (Steered Response Power-Phase Transform) method is most popular and commonly used among the conventional sound source localization methods, which achieves sound source localization by detecting peaks of spatial spectrum, but noise and reverberation often cause spatial spectrum to exhibit multimodal characteristics, and particularly in a strong reverberant environment, the peak of spatial spectrum generated by reflected sound may be larger than the peak of direct sound, resulting in misdetection of sound source position. In recent years, a model-based sound source localization method is used for localization in a complex acoustic environment, and the method builds a mapping relation between a sound source position and a space characteristic parameter by modeling the space characteristic parameter so as to realize sound source localization, but the current algorithm has low generalization capability on an unknown environment (noise and reverberation) and needs to be further improved in performance. The spatial feature parameters and modeling method are the main factors affecting the performance of the model-based sound source localization method.
Disclosure of Invention
The invention aims to: in order to overcome the problems in the prior art, the invention discloses a sound source positioning method based on a convolutional neural network and a subband SRP-PHAT spatial spectrum, which adopts the subband SRP-PHAT spatial spectrum as a spatial characteristic parameter, adopts the convolutional neural network (Convolutional Neural Network, CNN) to model the spatial characteristic parameters of directional voice data under various reverberation and noise environments, can improve the sound source positioning performance of a microphone array under a complex acoustic environment, and improves the generalization capability of the sound source spatial structure, the reverberation and the noise.
The technical scheme is as follows: in order to achieve the above purpose, the invention adopts the following technical scheme: a sound source localization method based on a convolutional neural network and a subband SRP-PHAT spatial spectrum is characterized by comprising the following steps:
s1, a microphone array collects voice signals, and the collected voice signals are subjected to framing and windowing pretreatment to obtain single-frame signals;
s2, calculating a subband SRP-PHAT spatial spectrum matrix of each frame of signal;
s3, inputting the subband SRP-PHAT space spectrum matrix of all frame signals into the convolutional neural network after training is completed, outputting the probability that the voice signal belongs to each azimuth, and taking the azimuth with the highest probability as the estimated value of the azimuth of the sound source of the voice signal.
Preferably, in step S2, calculating the subband SRP-heat spatial spectrum matrix of each frame signal includes the steps of:
s21, performing discrete Fourier transform on each frame of signal:
wherein x is m (i, n) is the i frame signal of the M-th microphone in the microphone array, m=1, 2, …, M is the number of microphones, X m (i, k) is x m (i, N) a discrete fourier transform representing a frequency domain signal of an ith frame of the mth microphone, K being a frequency bin, K being a length of the discrete fourier transform, N being a frame length, K = 2N, dft (·) representing the discrete fourier transform;
s22, designing an impulse response function of the gammatine filter bank:
wherein j represents GammaNumber of tone filter; c is the gain of the gammatine filter; t represents a continuous time; a is the order of the gammatine filter;representing phase; f (f) j Representing the center frequency of the j-th gammatine filter; b j Represents the attenuation factor, b, of the j-th gammatine filter j The calculation formula is as follows:
b j =1.109ERB(f j )
ERB(f j )=24.7(4.37f j /1000+1)
performing a discrete fourier transform on the impulse response function of each gammatine filter:
wherein G is j (k) Is the frequency domain expression of the jth gammatine filter, K is the frequency point, K is the length of the discrete fourier transform, N is the frame length, k=2n, f s Representing the signal sample rate, DFT (·) represents the discrete Fourier transform;
s23, calculating a subband SRP-PHAT function of each frame of signal:
wherein, P (i, j, r) represents the j-th subband SRP-PHAT function of the i-th frame signal when the beam direction is r; m is the number of microphones in the microphone array; τ mn (r) represents a time difference of propagation of sound waves from the beam direction r to the mth microphone and the nth microphone, and the calculation formula is:
where r represents the coordinates of the beam direction, r m Representing the position coordinate of the mth microphone, r n Representing the position coordinates of the nth microphone, c being the speed of sound in air;
s24, carrying out normalization processing on a subband SRP-PHAT function of each frame of signal:
s25, combining all the subband SRP-PHAT functions of the same frame signal into a matrix form to obtain a subband SRP-PHAT spatial spectrum matrix:
wherein y (i) represents a subband SRP-PHAT spatial spectrum matrix of the ith frame signal, J is the number of subbands, namely the number of Gamma filters, and L is the number of beam directions.
Preferably, in step S23, when the sound source is set to be at the same level as the microphone array and the sound source is located in the far field of the microphone array, τ mn The equivalent calculation formula of (r) is:
wherein ζ= [ cos θ, sin θ ]] T θ is the azimuth angle of the beam direction r.
Preferably, the convolutional neural network comprises an input layer, three convolutional-pooling layers, a full-connection layer and an output layer which are sequentially connected;
in the convolution-pooling layers, each convolution layer adopts convolution kernels with the size of 3 multiplied by 3, the step length is 1, the number of the convolution kernels of the three layers of convolution layers is 24, 48 and 96 in sequence, after each convolution layer carries out convolution operation, batch normalization is carried out firstly, then a ReLU function is used for activation, and a zero padding mode is adopted in the convolution operation so that the characteristic dimension before and after the convolution is kept unchanged; the pooling layer adopts a maximum pooling mode, the pooling size is 2 multiplied by 2, and the step length is 2;
after the convolution-pooling layer, straightening and deforming the characteristic data into one-dimensional vector characteristic data;
a Dropout connection mode is added in the connection of the full connection layer and the one-dimensional vector feature data;
the output layer adopts a Softmax classifier.
Preferably, the training steps of the convolutional neural network are as follows:
s1, convoluting the clean voice signals with room impulse responses of different azimuth angles, adding noise and reverberation of different degrees, and generating a plurality of directional voice signals of different specified azimuth angles:
x m (t)=h m (t)*s(t)+v m (t),m=1,2,...,M
wherein x is m (t) represents a directional voice signal of a specified azimuth angle received by an mth microphone in the microphone array; m is the serial number of the microphones, m=1, 2, …, M is the number of the microphones; s (t) is a clean speech signal; h is a m (t) represents a room impulse response from a specified azimuth angle to an mth microphone; v m (t) represents noise;
s2, carrying out framing and windowing pretreatment on all directional voice signals to obtain single-frame signals, and calculating a subband SRP-PHAT spatial spectrum matrix of each frame of signals;
s3, taking the subband SRP-PHAT space spectrum matrix of all directional voice signals as training samples, taking the designated azimuth angles of all directional voice signals as class labels corresponding to the training samples, taking the training samples and the class labels as training data sets, and training the convolutional neural network by adopting a random gradient descent algorithm minimization loss function with momentum.
The beneficial effects are that: the invention has the following remarkable beneficial effects:
1. the invention can improve the sound source positioning performance of the microphone array in a complex acoustic environment and the generalization capability of the sound source space structure, reverberation and noise;
2. the invention adopts the subband SRP-PHAT spatial spectrum as the spatial characteristic parameter, and the parameter not only can represent the whole acoustic environment information, but also has the advantage of strong robustness; modeling spatial feature parameters of directional voice data in various reverberation and noise environments by adopting a convolutional neural network, establishing a mapping relation between azimuth and the spatial feature parameters, and converting a sound source localization problem into a multi-classification problem;
3. the invention can finish the training process of the convolutional neural network offline, store the trained convolutional neural network in the memory, and realize real-time sound source localization only by one frame of signal during testing.
Drawings
FIG. 1 is a flow chart of an algorithm of the present invention;
FIG. 2 is a diagram of a model structure of a convolutional neural network in accordance with the present invention;
FIG. 3 is a graph comparing the positioning success rate of the method of the present invention with that of the conventional SRP-PHAT algorithm when the test environment and the training environment are consistent and the reverberation time is 0.5 s;
FIG. 4 is a graph comparing the positioning success rate of the method of the present invention with that of the conventional SRP-PHAT algorithm when the test environment and the training environment are consistent and the reverberation time is 0.8 s;
FIG. 5 is a graph comparing the positioning success rate of the method of the present invention with that of the conventional SRP-PHAT algorithm when the noise environments of the test environment and the training environment are inconsistent and the reverberation time is 0.5 s;
FIG. 6 is a graph comparing the positioning success rate of the method of the present invention with that of the conventional SRP-PHAT algorithm when the noise environments of the test environment and the training environment are inconsistent and the reverberation time is 0.8 s;
FIG. 7 is a graph comparing the positioning success rate of the method of the present invention with that of the conventional SRP-PHAT algorithm when the reverberations of the test environment and the training environment are inconsistent and the reverberations time of the test environment is 0.6 s;
FIG. 8 is a graph comparing the positioning success rate of the method of the present invention with that of the conventional SRP-PHAT algorithm when the reverberations of the test environment and the training environment are inconsistent and the reverberations time of the test environment is 0.9 s.
Detailed Description
The invention will be further described with reference to the accompanying drawings.
The subband SRP-PHAT spatial spectrum characterizes the whole acoustic environment spatial information including sound source azimuth, room size, room reflection characteristics and the like, has strong robustness, and can be used as spatial characteristic parameters in a positioning system. The deep neural network can simulate the mode of the information processing of the nervous system, can describe the fusion relation and the structural information between the space characteristic parameters, has strong expression and modeling capacity, and meanwhile, does not need to make assumptions on the data distribution during modeling. Among them, convolutional neural networks are a type of neural network that is specially used to process data having a similar network structure, and are applied to image or time-series data. The speech signal collected by the microphone array is precisely a time series of data.
Therefore, the invention provides a sound source localization method based on a convolutional neural network and a subband SRP-PHAT spatial spectrum, which is shown in figure 1 and comprises the following steps:
step one: convolving the clean speech signal with room impulse responses of different azimuth angles, and adding different degrees of noise and reverberation to generate a plurality of directional speech signals of different specified azimuth angles, namely microphone array signals:
x m (t)=h m (t)*s(t)+v m (t),m=1,2,...,M
wherein x is m (t) represents a directional voice signal of a specified azimuth angle received by an mth microphone in the microphone array; m is the serial number of the microphones, m=1, 2, …, M is the number of the microphones; s (t) is a clean speech signal; h is a m (t) represents the room impulse response from the specified azimuth angle to the mth microphone, h m (t) is related to sound source orientation, room reverberation; v m And (t) represents noise.
In this embodiment, the microphone array is set to be a uniform circular array composed of 6 omni-directional microphones, and the radius of the array is 0.1m. The sound source is set to be in the same horizontal plane with the microphone array, and the sound source is located in the far field of the microphone array. The right front of the horizontal plane is defined as 90 degrees, the range of azimuth angles of the sound source is [0 degrees, 360 degrees ], the interval is 10 degrees, the number of training azimuth is marked as F, and F is equal to 36. The reverberation time of the training data comprises 0.5s and 0.8s, and different Image algorithms are used for generatingRoom impulse response h for different azimuth angles at reverberation time m (t)。v m (t) Gaussian white noise, the signal-to-noise ratio of the training data includes 0dB, 5dB, 10dB, 15dB, and 20dB.
And step two, preprocessing the microphone array signal obtained in the step one to obtain a single frame signal.
Preprocessing includes framing and windowing, wherein:
the framing method comprises the following steps: the directional voice signal x of the appointed azimuth angle of the mth microphone is processed by adopting the preset frame length and frame shift m (t) dividing into a plurality of single frame signals x m (iN+n), wherein i is a frame number, N represents a sampling number iN one frame, N is more than or equal to 0 and less than N, and N is a frame length. Signal sampling rate f in this embodiment s For 16kHz, a frame length N of 512 (i.e., 32 ms) is taken, and the frame shift is 0.
The windowing method comprises the following steps: x is x m (i,n)=w H (n)x m (iN+n), where x m (i, n) is the i-th frame signal of the m-th microphone after the windowing process,is a hamming window.
And thirdly, extracting spatial characteristic parameters of the microphone array signals, namely a subband SRP-PHAT spatial spectrum matrix. The method specifically comprises the following steps:
(3-1) performing discrete fourier transform on each frame of the signal obtained in the step two, and converting the time domain signal into a frequency domain signal.
The discrete fourier transform calculation formula is:
wherein X is m (i, k) is x m (i, n) a discrete fourier transform representing a frequency domain signal of an ith frame of the mth microphone, K being a frequency bin, K being a length of the discrete fourier transform, k=2n, dft (·) representing the discrete fourier transform. The length of the discrete fourier transform is set to 1024 in this embodiment.
(3-2) designing a gammatine filter bank.
g j (t) is the impulse response function of the j-th gammatine filter, expressed as:
wherein j represents the serial number of the gammatine filter; c is the gain of the gammatine filter; t represents a continuous time; a is the order of the gammatine filter;representing phase; f (f) j Representing the center frequency of the j-th gammatine filter; b j Represents the attenuation factor, b, of the j-th gammatine filter j The calculation formula is as follows:
b j =1.109ERB(f j )
ERB(f j )=24.7(4.37f j /1000+1)
in this embodiment, the order a is 4, the phaseSet to 0, the number of gammatine filters is 36, i.e., j=1, 2, …,36, center frequency f of gammatine filter j Is in the range of [200Hz,8000Hz]。
Performing discrete Fourier transform on the impulse response function of each Gamma filter to obtain a frequency domain expression of the Gamma filter:
wherein G is j (k) G is g j (n/f s ) Represents the frequency domain expression of the jth gammatine filter, K is the frequency bin, K is the length of the discrete fourier transform, k=2n, dft (·) represents the discrete fourier transform, f s Representing the sampling rate. The length of the discrete fourier transform is set to 1024 in this embodiment.
(3-3) calculating a subband SRP-phas function for each frame of signal, the calculation formula being as follows:
wherein, P (i, j, r) represents the j-th subband SRP-PHAT function of the i-th frame signal when the beam direction of the array is r; (. Cndot. * Represents conjugation; τ mn (r) represents a time difference of propagation of sound waves from the beam direction r to the mth microphone and the nth microphone, and the calculation formula is:
where r represents the coordinates of the beam direction, r m Representing the position coordinate of the mth microphone, r n Representing the position coordinate of the nth microphone, c is the sound velocity in the air, and is about 342m/s, f at normal temperature s Is the signal sampling rate; i represent 2 norms.
In this embodiment, the sound source and the microphone array are set to be in the same horizontal plane, and if the sound source is located in the far field of the microphone array, τ is mn The equivalent calculation formula of (r) is:
wherein ζ= [ cos θ, sin θ ]] T θ is the azimuth angle of the beam direction r. τ mn (r) is independent of the received signal and can be stored in memory after off-line calculation.
The subband SRP-PHAT function P (i, j, r) is normalized, and the calculation formula is as follows:
(3-4) SRP-PHAT function of all sub-bands of the same frame signalCombining the two sub-band SRP-PHAT spatial spectrum matrixes into a matrix form to obtain the sub-band SRP-PHAT spatial spectrum matrixes:
where y (i) represents a spatial feature parameter of the i-th frame signal, i.e. a subband SRP-phas spatial spectrum matrix, and J is the number of subbands, i.e. the number of gammatine filters, in this embodiment j=36. The azimuth range of the beam direction of the array in this embodiment is [0 °,360 ° ], which defines 90 ° directly in front of the horizontal plane, and the interval is 5 °, so the number l=72 of beam directions. The number L of the beam taking directions is generally larger than the number F of the training orientations, so that the accuracy of the spatial characteristic parameters of the signals can be improved, and the training accuracy of the CNN model is improved.
Step four, preparing a training set: according to the first to third steps, extracting the space characteristic parameters of the directional voice signals under all training environments (the implementation setting of the training environments is detailed in the first step), taking the space characteristic parameters as training samples of CNN, marking the corresponding appointed azimuth angle of each training sample, and taking the corresponding appointed azimuth angle as the class label of the training sample.
And fifthly, constructing a CNN model, and training the training sample and the category label obtained in the fourth step as a CNN training data set to obtain the CNN model. The method specifically comprises the following steps:
(5-1) setting a CNN model structure.
The CNN structure employed in the present invention is shown in fig. 2 as comprising an input layer followed by three convolution-pooling layers, then a fully connected layer, and finally an output layer.
The input signal of the input layer is a two-dimensional subband SRP-phas spatial spectrum matrix of j×l, i.e. training samples, in this embodiment j=36, l=72.
The input layer is followed by three convolution-pooling layers, each convolution layer adopts a convolution kernel with the size of 3 multiplied by 3, the step length is 1, and the characteristic dimension before and after convolution is kept unchanged by adopting a zero filling mode in the convolution operation. The number of convolution kernels for the 1 st, 2 nd and 3 rd convolution layers is 24, 48 and 96, respectively. After each convolution layer carries out convolution operation, batch normalization is carried out first, and then a ReLU function is used for activation. The pooling layer adopts a maximum pooling mode, the pooling size is 2 multiplied by 2, and the step length is 2.
Through three convolution-pooling operations, the 36×72 two-dimensional subband SRP-PHAT spatial spectrum matrix becomes 5×9×96 feature data, which is straightened and deformed into 4320×1 one-dimensional vector feature data. Neurons in the fully connected layer are connected to all feature data in the previous layer and the connection mode of Dropout is added to prevent overfitting, dropout rate is set to 0.5.
The output layer adopts a Softmax classifier, a Softmax function converts the characteristic data of the full-connection layer into the probability of the voice signal relative to each azimuth, and the azimuth with the highest probability is taken as the predicted sound source direction.
(5-2) training network parameters of the CNN model.
The training process of CNN includes two parts, forward propagation and backward propagation.
Forward propagation is the output of the calculated input data under the current network parameters, is a layer-by-layer transfer process of the features, and the forward propagation expression at the position (u, v) in the d layer is:
S d (u,v)=ReLU((S d - 1 *w d )(u,v)+β d (u,v))
wherein d represents a layer identifier and the d layer is a convolution layer, S d Represents the output of the d layer, S d-1 Represents the output of layer d-1, the sign represents the convolution operation, w d Representing the convolution kernel weight of layer d, beta d Representing the bias of layer d, reLU is the activation function. The layers in the CNN structure adopted by the invention comprise an input layer, a convolution layer and a pooling layer in a convolution-pooling layer, a full connection layer and an output layer.
D represents the output layer, the expression of the output layer is:
S D =Softmax((w D ) T S D-1D )
wherein S is D Representing the output of the output layer S D-1 Representing the output of the fully connected layer, w D Convolution kernel weights, beta, representing the output layer D Representing the bias of the output layer.
The goal of the back propagation phase is to minimize the cross entropy loss function E (w, β):
wherein the subscript f represents the f-th azimuth,indicating the desired output of the output layer at the f-th azimuth angle,>representing the actual output of the output layer at the f-th azimuth angle. F represents the number of training orientations, in this embodiment f=36. The invention adopts a random gradient descent with momentum (Stochastic Gradient Descent with Momentum, SGDM) algorithm to minimize the loss function, and the related parameters of the SGDM are as follows: the Momentum parameter Momentum is set to 0.9, the L2 regularization coefficient is 0.0001, the initial learning rate is set to 0.01, the learning rate is reduced by 0.2 times every 6 rounds, and the mini-batch is set to 200.
The invention adopts a 7:3 cross-validation mode in the training process. And (5) carrying out repeated iterative training until convergence. So far, the CNN model training is completed.
Step six, processing the test signal according to the step two and the step three to obtain a spatial characteristic parameter of the single-frame test signal, namely a subband SRP-PHAT spatial spectrum matrix, and taking the spatial characteristic parameter as a test sample.
And step seven, taking the test sample as the input characteristic of the CNN model trained in the step four, outputting the probability of the test signal belonging to each azimuth angle by CNN, and taking the azimuth with the highest probability as the estimated value of the azimuth angle of the sound source of the test sample.
In contrast to the prior art, the method of the present invention comprises two stages, training and testing. In the training stage, space characteristic parameters are extracted from directional voice signals under various reverberation and noise environments, and the space characteristic parameters are input into CNN for training to obtain a CNN model. In the test stage, the spatial characteristic parameters of the test signals are extracted, a trained CNN model is input, and the azimuth with the highest probability is taken as a target sound source azimuth estimation value. The invention can finish the training process of CNN off-line, store the CNN model trained in the memory, only need a frame signal to realize the real-time sound source localization while testing. Compared with the traditional SRP-PHAT algorithm, the algorithm of the invention remarkably improves the positioning performance under the complex acoustic environment, and has better generalization capability on the spatial structure, reverberation and noise of the sound source.
Fig. 3 and 4 show the positioning effect of the method according to the present invention with the conventional SRP-phas algorithm when the test environment and the training environment are identical: the reverberation time of the test environment and the training environment in fig. 3 is 0.5s, the reverberation time of the test environment and the training environment in fig. 4 is 0.8s, and the positioning effect of the test environment and the training environment under the conditions that the signal to noise ratio is 0dB, 5dB, 10dB, 15dB and 20dB is respectively researched, so that the positioning success rate of the method is far higher than that of the traditional SRP-PHAT algorithm.
Fig. 5 and 6 show the positioning effect of the method according to the present invention and the conventional SRP-phas algorithm when the signal-to-noise ratio of the test environment and the training environment are inconsistent: the reverberation time of the test environment and the training environment in fig. 5 is 0.5s, the signal-to-noise ratio of the test environment is different from that of the training environment, the reverberation time of the test environment and the training environment in fig. 6 is 0.8s, the signal-to-noise ratio of the test environment is different from that of the training environment, and the positioning effects of the test environment under the conditions that the signal-to-noise ratio is-2 dB, 3dB, 8dB, 13dB and 18dB are respectively researched, so that the positioning success rate of the method is far higher than that of the traditional SRP-PHAT algorithm.
Fig. 7 and 8 show the positioning effect of the method of the present invention and the conventional SRP-phas algorithm when the reverberation times of the test environment and the training environment are not identical: the reverberation time of the test environment and the training environment in fig. 7 are different, the reverberation time of the test environment is 0.6s, the reverberation time of the test environment and the training environment in fig. 8 are different, the reverberation time of the test environment is 0.9s, and the positioning effects of the test environment and the training environment under the conditions that the signal to noise ratio is 0dB, 5dB, 10dB, 15dB and 20dB are respectively researched, so that the positioning success rate of the method is far higher than that of the traditional SRP-PHAT algorithm.
As can be seen from fig. 5 to 8, the success rate of the method of the present invention is still far higher than that of the conventional SRP-phas algorithm even in the non-training environment, which illustrates that the method of the present invention has better robustness and generalization capability to the unknown environment.
The foregoing is only a preferred embodiment of the invention, it being noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the present invention, and such modifications and adaptations are intended to be comprehended within the scope of the invention.

Claims (4)

1. A sound source localization method based on a convolutional neural network and a subband SRP-PHAT spatial spectrum is characterized by comprising the following steps:
s1, a microphone array collects voice signals, and the collected voice signals are subjected to framing and windowing pretreatment to obtain single-frame signals;
s2, calculating a subband SRP-PHAT spatial spectrum matrix of each frame of signal; the method specifically comprises the following steps:
s21, performing discrete Fourier transform on each frame of signal:
wherein x is m (i, n) is the i frame signal of the M-th microphone in the microphone array, m=1, 2, …, M is the number of microphones, X m (i, k) is x m (i, N) a discrete fourier transform representing a frequency domain signal of an ith frame of the mth microphone, K being a frequency bin, K being a length of the discrete fourier transform, N being a frame length, K = 2N, dft (·) representing the discrete fourier transform;
s22, designing an impulse response function of the gammatine filter bank:
wherein j represents the serial number of the gammatine filter; c is the gain of the gammatine filter; t represents a continuous time; a is the order of the gammatine filter;representing phase; f (f) j Representing the center frequency of the j-th gammatine filter; b j Represents the attenuation factor, b, of the j-th gammatine filter j The calculation formula is as follows:
b j =1.109ERB(f j )
ERB(f j )=24.7(4.37f j /1000+1)
performing a discrete fourier transform on the impulse response function of each gammatine filter:
wherein G is j (k) Is the frequency domain expression of the jth gammatine filter, K is the frequency point, K is the length of the discrete fourier transform, N is the frame length, k=2n, f s Representing the signal sample rate, DFT (·) represents the discrete Fourier transform;
s23, calculating a subband SRP-PHAT function of each frame of signal:
wherein, P (i, j, r) represents the j-th subband SRP-PHAT function of the i-th frame signal when the beam direction is r; m is the number of microphones in the microphone array; τ mn (r) represents a time difference of propagation of sound waves from the beam direction r to the mth microphone and the nth microphone, and the calculation formula is:
where r represents the coordinates of the beam direction, r m Representing the position coordinate of the mth microphone, r n Representing the position coordinates of the nth microphone, c being the speed of sound in air;
s24, carrying out normalization processing on a subband SRP-PHAT function of each frame of signal:
s25, combining all the subband SRP-PHAT functions of the same frame signal into a matrix form to obtain a subband SRP-PHAT spatial spectrum matrix:
wherein y (i) represents a subband SRP-PHAT spatial spectrum matrix of an ith frame signal, J is the number of subbands, namely the number of Gamma filters, and L is the number of beam directions;
s3, inputting the subband SRP-PHAT space spectrum matrix of all frame signals into the convolutional neural network after training is completed, outputting the probability that the voice signal belongs to each azimuth, and taking the azimuth with the highest probability as the estimated value of the azimuth of the sound source of the voice signal.
2. The method for locating sound source based on convolutional neural network and subband SRP-PHAT spatial spectrum as recited in claim 1, wherein in step S23, when the sound source is set to be at the same level as the microphone array and the sound source is located in far field of the microphone array, τ is calculated by mn The equivalent calculation formula of (r) is:
wherein ζ= [ cos θ, sin θ ]] T θ is the azimuth angle of the beam direction r.
3. The sound source localization method based on the convolutional neural network and the subband SRP-heat spatial spectrum according to claim 1, wherein the convolutional neural network comprises an input layer, three convolutional-pooling layers, a full connection layer and an output layer which are sequentially connected;
in the convolution-pooling layers, each convolution layer adopts convolution kernels with the size of 3 multiplied by 3, the step length is 1, the number of the convolution kernels of the three layers of convolution layers is 24, 48 and 96 in sequence, after each convolution layer carries out convolution operation, batch normalization is carried out firstly, then a ReLU function is used for activation, and a zero padding mode is adopted in the convolution operation so that the characteristic dimension before and after the convolution is kept unchanged; the pooling layer adopts a maximum pooling mode, the pooling size is 2 multiplied by 2, and the step length is 2;
after the convolution-pooling layer, straightening and deforming the characteristic data into one-dimensional vector characteristic data;
a Dropout connection mode is added in the connection of the full connection layer and the one-dimensional vector feature data;
the output layer adopts a Softmax classifier.
4. The sound source localization method based on convolutional neural network and subband SRP-heat spatial spectrum according to claim 1, wherein the training steps of the convolutional neural network are as follows:
s1, convoluting the clean voice signals with room impulse responses of different azimuth angles, adding noise and reverberation of different degrees, and generating a plurality of directional voice signals of different specified azimuth angles:
x m (t)=h m (t)*s(t)+v m (t),m=1,2,...,M
wherein x is m (t) represents a directional voice signal of a specified azimuth angle received by an mth microphone in the microphone array; m is the serial number of the microphones, m=1, 2, …, M is the number of the microphones; s (t) is a clean speech signal; h is a m (t) represents a room impulse response from a specified azimuth angle to an mth microphone; v m (t) represents noise;
s2, carrying out framing and windowing pretreatment on all directional voice signals to obtain single-frame signals, and calculating a subband SRP-PHAT spatial spectrum matrix of each frame of signals;
s3, taking the subband SRP-PHAT space spectrum matrix of all directional voice signals as training samples, taking the designated azimuth angles of all directional voice signals as class labels corresponding to the training samples, taking the training samples and the class labels as training data sets, and training the convolutional neural network by adopting a random gradient descent algorithm minimization loss function with momentum.
CN202110059164.1A 2021-01-18 2021-01-18 Sound source positioning method based on convolutional neural network and subband SRP-PHAT spatial spectrum Active CN112904279B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110059164.1A CN112904279B (en) 2021-01-18 2021-01-18 Sound source positioning method based on convolutional neural network and subband SRP-PHAT spatial spectrum

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110059164.1A CN112904279B (en) 2021-01-18 2021-01-18 Sound source positioning method based on convolutional neural network and subband SRP-PHAT spatial spectrum

Publications (2)

Publication Number Publication Date
CN112904279A CN112904279A (en) 2021-06-04
CN112904279B true CN112904279B (en) 2024-01-26

Family

ID=76114123

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110059164.1A Active CN112904279B (en) 2021-01-18 2021-01-18 Sound source positioning method based on convolutional neural network and subband SRP-PHAT spatial spectrum

Country Status (1)

Country Link
CN (1) CN112904279B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113655440B (en) * 2021-08-09 2023-05-30 西南科技大学 Self-adaptive compromise pre-whitened sound source positioning method
CN113589230B (en) * 2021-09-29 2022-02-22 广东省科学院智能制造研究所 Target sound source positioning method and system based on joint optimization network
CN114994608B (en) * 2022-04-21 2024-05-14 西北工业大学深圳研究院 Multi-device self-organizing microphone array sound source positioning method based on deep learning
CN114897033B (en) * 2022-07-13 2022-09-27 中国人民解放军海军工程大学 Three-dimensional convolution kernel group calculation method for multi-beam narrow-band process data
CN115201753B (en) * 2022-09-19 2022-11-29 泉州市音符算子科技有限公司 Low-power-consumption multi-spectral-resolution voice positioning method
CN115331691A (en) * 2022-10-13 2022-11-11 广州成至智能机器科技有限公司 Pickup method and device for unmanned aerial vehicle, unmanned aerial vehicle and computer readable storage medium

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109164415A (en) * 2018-09-07 2019-01-08 东南大学 A kind of binaural sound sources localization method based on convolutional neural networks
CN109490822A (en) * 2018-10-16 2019-03-19 南京信息工程大学 Voice DOA estimation method based on ResNet
CN110133572A (en) * 2019-05-21 2019-08-16 南京林业大学 A kind of more sound localization methods based on Gammatone filter and histogram
CN110133596A (en) * 2019-05-13 2019-08-16 南京林业大学 A kind of array sound source localization method based on frequency point signal-to-noise ratio and biasing soft-decision
CN110517705A (en) * 2019-08-29 2019-11-29 北京大学深圳研究生院 A kind of binaural sound sources localization method and system based on deep neural network and convolutional neural networks
CN110544490A (en) * 2019-07-30 2019-12-06 南京林业大学 sound source positioning method based on Gaussian mixture model and spatial power spectrum characteristics
WO2020042708A1 (en) * 2018-08-31 2020-03-05 大象声科(深圳)科技有限公司 Time-frequency masking and deep neural network-based sound source direction estimation method
CN111123202A (en) * 2020-01-06 2020-05-08 北京大学 Indoor early reflected sound positioning method and system
CN111583948A (en) * 2020-05-09 2020-08-25 南京工程学院 Improved multi-channel speech enhancement system and method
CN111707990A (en) * 2020-08-19 2020-09-25 东南大学 Binaural sound source positioning method based on dense convolutional network
CN111968677A (en) * 2020-08-21 2020-11-20 南京工程学院 Voice quality self-evaluation method for fitting-free hearing aid

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101090893B1 (en) * 2010-03-15 2011-12-08 한국과학기술연구원 Sound source localization system

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020042708A1 (en) * 2018-08-31 2020-03-05 大象声科(深圳)科技有限公司 Time-frequency masking and deep neural network-based sound source direction estimation method
CN109164415A (en) * 2018-09-07 2019-01-08 东南大学 A kind of binaural sound sources localization method based on convolutional neural networks
CN109490822A (en) * 2018-10-16 2019-03-19 南京信息工程大学 Voice DOA estimation method based on ResNet
CN110133596A (en) * 2019-05-13 2019-08-16 南京林业大学 A kind of array sound source localization method based on frequency point signal-to-noise ratio and biasing soft-decision
CN110133572A (en) * 2019-05-21 2019-08-16 南京林业大学 A kind of more sound localization methods based on Gammatone filter and histogram
CN110544490A (en) * 2019-07-30 2019-12-06 南京林业大学 sound source positioning method based on Gaussian mixture model and spatial power spectrum characteristics
CN110517705A (en) * 2019-08-29 2019-11-29 北京大学深圳研究生院 A kind of binaural sound sources localization method and system based on deep neural network and convolutional neural networks
CN111123202A (en) * 2020-01-06 2020-05-08 北京大学 Indoor early reflected sound positioning method and system
CN111583948A (en) * 2020-05-09 2020-08-25 南京工程学院 Improved multi-channel speech enhancement system and method
CN111707990A (en) * 2020-08-19 2020-09-25 东南大学 Binaural sound source positioning method based on dense convolutional network
CN111968677A (en) * 2020-08-21 2020-11-20 南京工程学院 Voice quality self-evaluation method for fitting-free hearing aid

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
Deep and CNN fusion method for binaural sound source localization;S. Jiang, W. L., P. Yuan, Y. Sun and H. Liu;《The Journal of Engineering》;511–516 *
End-to-end Binaural Sound Localisation from the Raw Waveform;Vecchiotti等;《IEEE》;451-455 *
Sound Source Localization Based on SRP-PHAT Spatial Spectrum and Deep Neural Network;Xiaoyan Zhao 等;《Computers, Materials & Continua 》;第253-271页 *
基于卷积神经网络的交通声音事件识别方法;张文涛;韩莹莹;黎恒;;现代电子技术(第14期);全文 *
基于神经网络的鲁棒双耳声源定位研究;王茜茜;《中国优秀硕士学位论文全文数据库 信息科技辑》;I136-129 *

Also Published As

Publication number Publication date
CN112904279A (en) 2021-06-04

Similar Documents

Publication Publication Date Title
CN112904279B (en) Sound source positioning method based on convolutional neural network and subband SRP-PHAT spatial spectrum
CN107703486B (en) Sound source positioning method based on convolutional neural network CNN
CN109490822B (en) Voice DOA estimation method based on ResNet
CN112151059A (en) Microphone array-oriented channel attention weighted speech enhancement method
CN110068795A (en) A kind of indoor microphone array sound localization method based on convolutional neural networks
US20040175006A1 (en) Microphone array, method and apparatus for forming constant directivity beams using the same, and method and apparatus for estimating acoustic source direction using the same
Vesperini et al. Localizing speakers in multiple rooms by using deep neural networks
CN110610718B (en) Method and device for extracting expected sound source voice signal
CN107167770A (en) A kind of microphone array sound source locating device under the conditions of reverberation
CN107527626A (en) Audio identification system
CN110444220B (en) Multi-mode remote voice perception method and device
CN113111765B (en) Multi-voice source counting and positioning method based on deep learning
CN112180318B (en) Sound source direction of arrival estimation model training and sound source direction of arrival estimation method
Salvati et al. Two-microphone end-to-end speaker joint identification and localization via convolutional neural networks
CN112201276B (en) TC-ResNet network-based microphone array voice separation method
CN111123202B (en) Indoor early reflected sound positioning method and system
CN116559778B (en) Vehicle whistle positioning method and system based on deep learning
CN113593596A (en) Robust self-adaptive beam forming directional pickup method based on subarray division
CN112363112A (en) Sound source positioning method and device based on linear microphone array
CN111443328A (en) Sound event detection and positioning method based on deep learning
CN110838303A (en) Voice sound source positioning method using microphone array
CN114245266B (en) Area pickup method and system for small microphone array device
Wang et al. U-net based direct-path dominance test for robust direction-of-arrival estimation
Firoozabadi et al. Combination of nested microphone array and subband processing for multiple simultaneous speaker localization
CN114895245A (en) Microphone array sound source positioning method and device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant