CN112904279A - Sound source positioning method based on convolutional neural network and sub-band SRP-PHAT space spectrum - Google Patents

Sound source positioning method based on convolutional neural network and sub-band SRP-PHAT space spectrum Download PDF

Info

Publication number
CN112904279A
CN112904279A CN202110059164.1A CN202110059164A CN112904279A CN 112904279 A CN112904279 A CN 112904279A CN 202110059164 A CN202110059164 A CN 202110059164A CN 112904279 A CN112904279 A CN 112904279A
Authority
CN
China
Prior art keywords
srp
phat
sub
band
frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110059164.1A
Other languages
Chinese (zh)
Other versions
CN112904279B (en
Inventor
赵小燕
童莹
芮雄丽
陈瑞
毛铮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Institute of Technology
Original Assignee
Nanjing Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Institute of Technology filed Critical Nanjing Institute of Technology
Priority to CN202110059164.1A priority Critical patent/CN112904279B/en
Publication of CN112904279A publication Critical patent/CN112904279A/en
Application granted granted Critical
Publication of CN112904279B publication Critical patent/CN112904279B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S5/00Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations
    • G01S5/18Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations using ultrasonic, sonic, or infrasonic waves
    • G01S5/22Position of source determined by co-ordinating a plurality of position lines defined by path-difference measurements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Theoretical Computer Science (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The invention discloses a sound source positioning method based on a convolutional neural network and a sub-band SRP-PHAT space spectrum, which comprises the following steps: the microphone array collects voice signals, and carries out frame division and windowing pretreatment on the collected voice signals to obtain single-frame signals; calculating a sub-band SRP-PHAT spatial spectrum matrix of each frame signal; and inputting the sub-band SRP-PHAT space spectrum matrixes of all the frame signals into the trained convolutional neural network, outputting the probability that the voice signal belongs to each azimuth angle, and taking the azimuth angle with the highest probability as the sound source azimuth angle estimation value of the voice signal. The invention can improve the sound source positioning performance of the microphone array in a complex acoustic environment and improve the generalization ability to the sound source space structure, reverberation and noise; the training process of the convolutional neural network can be completed off line, the trained convolutional neural network is stored in the memory, and real-time sound source positioning can be realized only by one frame of signal during testing.

Description

Sound source positioning method based on convolutional neural network and sub-band SRP-PHAT space spectrum
Technical Field
The invention belongs to the field of sound source positioning, and particularly relates to a sound source positioning method based on a convolutional neural network and a sub-band SRP-PHAT spatial spectrum.
Background
The sound source positioning technology based on the microphone array has wide application prospect and potential economic value in the front-end processing of voice recognition, speaker recognition and emotion recognition systems, video conferences, intelligent robots, intelligent homes, intelligent vehicle-mounted equipment, hearing aids and the like. The SRP-PHAT (stepped Response Power-Phase Transform) method, which is the most popular and commonly used method among conventional sound source localization methods, achieves sound source localization by detecting the peak of the spatial spectrum, but noise and reverberation often cause the spatial spectrum to exhibit a multi-peak characteristic, and especially in a strong reverberation environment, the spatial spectrum peak generated by reflected sound may be larger than the peak of direct sound, resulting in an error in sound source position detection. In recent years, model-based sound source positioning methods are used for positioning in complex acoustic environments, and such methods implement sound source positioning by modeling spatial characteristic parameters and constructing a mapping relationship between a sound source position and the spatial characteristic parameters, but at present, the generalization capability of such algorithms to unknown environments (noise and reverberation) is low, and the performance needs to be further improved. The spatial characteristic parameters and the modeling method are main factors influencing the performance of the model-based sound source positioning method.
Disclosure of Invention
The purpose of the invention is as follows: in order to solve the problems in the prior art, the invention discloses a sound source positioning method based on a Convolutional Neural Network and a sub-band SRP-PHAT space spectrum.
The technical scheme is as follows: in order to achieve the purpose, the invention adopts the following technical scheme: a sound source positioning method based on a convolutional neural network and a sub-band SRP-PHAT space spectrum is characterized by comprising the following steps:
s1, the microphone array collects voice signals, and the collected voice signals are subjected to frame division and windowing preprocessing to obtain single-frame signals;
s2, calculating a sub-band SRP-PHAT spatial spectrum matrix of each frame signal;
s3, inputting the sub-band SRP-PHAT space spectrum matrixes of all frame signals into the trained convolutional neural network, outputting the probability that the voice signal belongs to each azimuth angle, and taking the azimuth angle with the maximum probability as the sound source azimuth angle estimation value of the voice signal.
Preferably, in step S2, the calculating the sub-band SRP-PHAT spatial spectrum matrix of each frame signal includes the following steps:
s21, performing discrete Fourier transform on each frame of signal:
Figure BDA0002901844130000021
wherein x ism(i, n) is the ith frame signal of the mth microphone in the microphone array, M is 1,2, …, M is the number of microphones, Xm(i, k) is xm(i, N) indicating a frequency domain signal of an ith frame of the mth microphone, K being a frequency point, K being a length of discrete fourier transform, N being a frame length, K being 2N, and DFT (·) indicating discrete fourier transform;
s22, designing an impulse response function of the Gamma filter bank:
Figure BDA0002901844130000022
wherein j represents the serial number of the Gamma atom filter; c is the gain of the Gamma filter; t is tRepresenting a continuous time; a is the order of the Gamma-tone filter;
Figure BDA0002901844130000023
represents the phase; f. ofjRepresents the center frequency of the j Gamma filter; bjRepresents the attenuation factor of the j Gamma filter, bjThe calculation formula is as follows:
bj=1.109ERB(fj)
ERB(fj)=24.7(4.37fj/1000+1)
discrete fourier transform of the impulse response function for each gamma filter:
Figure BDA0002901844130000024
wherein G isj(k) Is the frequency domain expression of the j Gamma filter, K is the frequency point, K is the length of discrete Fourier transform, N is the frame length, K is 2N, fsRepresents the signal sampling rate, DFT (-) represents the discrete fourier transform;
s23, calculating a sub-band SRP-PHAT function of each frame signal:
Figure BDA0002901844130000025
wherein, P (i, j, r) represents the j sub-band SRP-PHAT function of the i frame signal when the beam direction is r; m is the number of microphones in the microphone array; tau ismn(r) represents a time difference of the sound wave propagating from the beam direction r to the m-th microphone and the n-th microphone, and is calculated by the formula:
Figure BDA0002901844130000031
where r denotes the coordinates of the beam direction, rmDenotes the position coordinate of the m-th microphone, rnPosition coordinates of the nth microphone, and c is the speed of sound in air;
S24, carrying out normalization processing on the sub-band SRP-PHAT function of each frame signal:
Figure BDA0002901844130000032
s25, combining all sub-band SRP-PHAT functions of the same frame signal into a matrix form to obtain a sub-band SRP-PHAT spatial spectrum matrix:
Figure BDA0002901844130000033
wherein, y (i) represents the sub-band SRP-PHAT spatial spectrum matrix of the signal of the ith frame, J is the number of the sub-bands, namely the number of Gamma filters, and L is the number of beam directions.
Preferably, in step S23, when the sound source is set to be at the same level as the microphone array and the sound source is located in the far field of the microphone array, τ ismnThe equivalent calculation formula of (r) is:
Figure BDA0002901844130000034
where xi is [ cos θ, sin θ ═]TAnd θ is the azimuth of the beam direction r.
Preferably, the convolutional neural network comprises an input layer, three convolutional-pooling layers, a full-link layer and an output layer which are connected in sequence;
in the convolution-pooling layer, each convolution layer adopts convolution kernels with the size of 3 multiplied by 3, the step length is 1, the number of the convolution kernels of the three convolution layers is 24, 48 and 96 in sequence, after each convolution layer is subjected to convolution operation, batch normalization is firstly carried out, then ReLU function activation is used, and the characteristic dimensions before and after convolution are kept unchanged by adopting a zero filling mode in the convolution operation; the pooling layer adopts a maximum pooling mode, the size of the pooling is 2 multiplied by 2, and the step length is 2;
after convolution-pooling, straightening and deforming the characteristic data into one-dimensional vector characteristic data;
adding a Dropout connection mode into the connection of the full connection layer and the one-dimensional vector characteristic data;
the output layer uses a Softmax classifier.
Preferably, the training step of the convolutional neural network is as follows:
s1, convolving the clean voice signal with the room impulse response of different azimuth angles, and adding different degrees of noise and reverberation to generate a plurality of directional voice signals of different specified azimuth angles:
xm(t)=hm(t)*s(t)+vm(t),m=1,2,...,M
wherein x ism(t) a directional speech signal representing a specified azimuth received by an mth microphone of the array of microphones; m is the serial number of the microphone, M is 1,2, …, and M is the number of the microphones; s (t) is a clean speech signal; h ism(t) represents the room impulse response from the specified azimuth to the mth microphone; v. ofm(t) represents noise;
s2, preprocessing all directional voice signals by framing and windowing to obtain single-frame signals, and calculating a sub-band SRP-PHAT spatial spectrum matrix of each frame of signals;
s3, taking the sub-band SRP-PHAT space spectrum matrixes of all the directional voice signals as training samples, taking the designated azimuth angles of all the directional voice signals as class labels of the corresponding training samples, taking the training samples and the class labels as training data sets, and training the convolutional neural network by adopting a random gradient descent algorithm with momentum to minimize a loss function.
Has the advantages that: the invention has the following remarkable beneficial effects:
1. the invention can improve the sound source positioning performance of the microphone array in a complex acoustic environment and improve the generalization ability to the sound source space structure, reverberation and noise;
2. the invention adopts the sub-band SRP-PHAT space spectrum as the space characteristic parameter, and the parameter not only can represent the whole acoustic environment information, but also has the advantage of strong robustness; modeling spatial characteristic parameters of directional voice data under various reverberation and noise environments by adopting a convolutional neural network, establishing a mapping relation between an azimuth and the spatial characteristic parameters, and converting a sound source positioning problem into a multi-classification problem;
3. the invention can complete the training process of the convolutional neural network off line, and the trained convolutional neural network is stored in the memory, and the real-time sound source positioning can be realized only by one frame of signal during testing.
Drawings
FIG. 1 is a flow chart of the algorithm of the present invention;
FIG. 2 is a diagram of a model architecture of a convolutional neural network of the present invention;
FIG. 3 is a graph comparing the success rate of positioning between the method of the present invention and the conventional SRP-PHAT algorithm when the testing environment and the training environment are consistent and the reverberation time is 0.5 s;
FIG. 4 is a graph comparing the success rate of positioning between the method of the present invention and the conventional SRP-PHAT algorithm when the testing environment and the training environment are consistent and the reverberation time is 0.8 s;
FIG. 5 is a comparison graph of the positioning success rate of the method of the present invention and the conventional SRP-PHAT algorithm when the noise environments of the test environment and the training environment are not consistent and the reverberation time is 0.5 s;
FIG. 6 is a comparison graph of the positioning success rate of the method of the present invention and the conventional SRP-PHAT algorithm when the noise environments of the test environment and the training environment are not consistent and the reverberation time is 0.8 s;
FIG. 7 is a comparison graph of the positioning success rate of the method of the present invention and the conventional SRP-PHAT algorithm when the reverberation environments of the test environment and the training environment are not consistent and the reverberation time of the test environment is 0.6 s;
fig. 8 is a comparison graph of the positioning success rate of the method of the present invention and the conventional SRP-PHAT algorithm when the reverberation environments of the test environment and the training environment are not consistent and the reverberation time of the test environment is 0.9 s.
Detailed Description
The present invention will be further described with reference to the accompanying drawings.
The sub-band SRP-PHAT space spectrum represents the whole acoustic environment space information, including sound source direction, room size, room reflection characteristics and the like, has strong robustness and can be used as space characteristic parameters in a positioning system. The deep neural network can simulate the information processing mode of a nervous system, can describe the fusion relation and structural information among the spatial characteristic parameters, has strong expression and modeling capabilities, and meanwhile, does not need to assume the distribution of data during modeling. The convolutional neural network is a neural network specially used for processing data with a similar network structure, and is applied to image or time series data. The speech signal collected by the microphone array is just a kind of time series data.
Therefore, the present invention provides a sound source localization method based on a convolutional neural network and a subband SRP-phot spatial spectrum, as shown in fig. 1, comprising the following steps:
the method comprises the following steps: convolving the clean speech signal with the room impulse response at different azimuth angles, and adding different degrees of noise and reverberation to generate a plurality of directional speech signals at different specified azimuth angles, namely microphone array signals:
xm(t)=hm(t)*s(t)+vm(t),m=1,2,...,M
wherein x ism(t) a directional speech signal representing a specified azimuth received by an mth microphone of the array of microphones; m is the serial number of the microphone, M is 1,2, …, and M is the number of the microphones; s (t) is a clean speech signal; h ism(t) represents the room impulse response from the specified azimuth to the mth microphone, hm(t) is related to sound source orientation, room reverberation; v. ofm(t) represents noise.
In this embodiment, the microphone array is a uniform circular array composed of 6 omnidirectional microphones, and the array radius is 0.1 m. And setting the sound source and the microphone array to be in the same horizontal plane, wherein the sound source is positioned in the far field of the microphone array. It is defined that the right front of the horizontal plane is 90 °, the azimuth angle of the sound source is in the range of [0 °,360 °, the interval is 10 °, the number of training azimuths is marked as F, and then F equals 36. The reverberation time of the training data comprises 0.5s and 0.8s, and the Image algorithm generates room impulse responses h of different azimuth angles under different reverberation timesm(t)。vm(t) is white Gaussian noise, and the signal-to-noise ratios of the training data include 0dB, 5dB, 10dB, 15dB, and 20 dB.
And step two, preprocessing the microphone array signal obtained in the step one to obtain a single-frame signal.
The pre-processing includes framing and windowing, wherein:
the framing method comprises the following steps: using the preset frame length and frame shift to convert the directional voice signal x of the m-th microphone with the appointed azimuth anglem(t) division into a plurality of single-frame signals xm(iN + N), wherein i is a frame number, N represents a sampling number iN one frame, N is more than or equal to 0 and less than N, and N is a framing length. Signal sampling rate f in this embodimentsAt 16kHz, the length of the frame taken, N, is 512 (i.e., 32ms) and the frame shift is 0.
The windowing method comprises the following steps: x is the number ofm(i,n)=wH(n)xm(iN + n) where xm(i, n) is the ith frame signal of the mth microphone after the windowing processing,
Figure BDA0002901844130000061
is a hamming window.
And step three, extracting spatial characteristic parameters of microphone array signals, namely a sub-band SRP-PHAT spatial spectrum matrix. The method specifically comprises the following steps:
and (3-1) carrying out discrete Fourier transform on each frame of signal obtained in the step two, and converting the time domain signal into a frequency domain signal.
The discrete fourier transform calculation formula is:
Figure BDA0002901844130000062
wherein, Xm(i, k) is xmThe discrete fourier transform of (i, N) represents the frequency domain signal of the ith frame of the mth microphone, K is a frequency point, K is the length of the discrete fourier transform, K is 2N, and DFT (·) represents the discrete fourier transform. In this embodiment, the length of the discrete fourier transform is set to 1024.
And (3-2) designing a Gamma atom filter bank.
gj(t) is the impulse response function of the j-th Gamma filter, and the expression is:
Figure BDA0002901844130000063
wherein j represents the serial number of the Gamma atom filter; c is the gain of the Gamma filter; t represents a continuous time; a is the order of the Gamma-tone filter;
Figure BDA0002901844130000064
represents the phase; f. ofjRepresents the center frequency of the j Gamma filter; bjRepresents the attenuation factor of the j Gamma filter, bjThe calculation formula is as follows:
bj=1.109ERB(fj)
ERB(fj)=24.7(4.37fj/1000+1)
in this embodiment, the order a is 4, and the phase
Figure BDA0002901844130000065
Set to 0, the number of gamma filters is 36, i.e. j is 1,2, …,36, the central frequency f of the gamma filtersjIn the range of [200Hz, 8000Hz]。
Performing discrete Fourier transform on the impulse response function of each Gamma filter to obtain a frequency domain expression:
Figure BDA0002901844130000071
wherein G isj(k) Is gj(n/fs) Represents the frequency domain expression of the j-th Gamma transform filter, K is the frequency point, K is the length of the discrete Fourier transform, K is 2N, DFT ((-) represents the discrete Fourier transform, fsRepresenting the sampling rate. In this embodiment, the length of the discrete fourier transform is set to 1024.
(3-3) calculating a sub-band SRP-PHAT function of each frame signal, wherein the calculation formula is as follows:
Figure BDA0002901844130000072
wherein, P (i, j, r) represents the j sub-band SRP-PHAT function of the i frame signal when the beam direction of the array is r; (.)*Represents a conjugation; tau ismn(r) represents a time difference of the sound wave propagating from the beam direction r to the m-th microphone and the n-th microphone, and is calculated by the formula:
Figure BDA0002901844130000073
where r denotes the coordinates of the beam direction, rmDenotes the position coordinate of the m-th microphone, rnC is the sound velocity in the air, and is about 342m/s at normal temperature, fsIs the signal sampling rate; i | · | | represents a 2 norm.
In this embodiment, if the sound source and the microphone array are set to be in the same horizontal plane, and the sound source is located in the far field of the microphone array, then τ ismnThe equivalent calculation formula of (r) is:
Figure BDA0002901844130000074
where xi is [ cos θ, sin θ ═]TAnd θ is the azimuth of the beam direction r. Tau ismn(r) is independent of the received signal and can therefore be calculated off-line and stored in memory.
The normalization processing is carried out on the sub-band SRP-PHAT function P (i, j, r), and the calculation formula is as follows:
Figure BDA0002901844130000075
(3-4) combining all sub-bands SRP-PHAT functions of the same frame signal
Figure BDA0002901844130000076
Combining the two into a matrix form to obtain a sub-band SRP-PHAT spatial spectrum matrix:
Figure BDA0002901844130000081
where y (i) represents the spatial characteristic parameter of the ith frame signal, i.e., the subband SRP-PHAT spatial spectrum matrix, J is the number of subbands, i.e., the number of gamma filters, and J is 36 in this embodiment. The number of beam directions L is 72 because the azimuth range of the beam directions of the array is [0 ° and 360 °) defined as 90 ° directly in front of the horizontal plane and the interval is 5 ° in the present embodiment. Generally, the number L of beam directions is greater than the number F of training azimuths, so that the accuracy of the spatial characteristic parameters of the signals can be improved, and the training accuracy of the CNN model can be improved.
Step four, preparing a training set: according to the first to third steps, extracting spatial characteristic parameters of the directional voice signals under all training environments (the implementation setting of the training environments is detailed in the first step), using the spatial characteristic parameters as CNN training samples, marking the corresponding specified azimuth angle of each training sample, and using the corresponding specified azimuth angle as a class label of the training sample.
And fifthly, constructing a CNN model, and training by taking the training samples and the class labels obtained in the fourth step as a training data set of the CNN so as to obtain the CNN model. The method specifically comprises the following steps:
and (5-1) setting a CNN model structure.
The CNN architecture employed in the present invention is shown in fig. 2 and includes an input layer followed by three convolutional-pooling layers, then a fully-connected layer, and finally an output layer.
The input signal of the input layer is a J × L two-dimensional subband SRP-PHAT spatial spectrum matrix, i.e., training samples, in this embodiment, J is 36, and L is 72.
The input layer is followed by three convolution-pooling layers, each convolution layer adopts a convolution kernel with the size of 3 multiplied by 3, the step length is 1, and the characteristic dimensionality before and after convolution is kept unchanged by adopting a zero filling mode in the convolution operation. The numbers of convolution kernels of 1 st, 2 nd and 3 rd convolution layers are respectively 24, 48 and 96. After each convolution layer is subjected to convolution operation, batch normalization is carried out firstly, and then ReLU function activation is used. The pooling layer adopts a maximum pooling mode, the pooling size is 2 multiplied by 2, and the step length is 2.
After three convolution-pooling operations, the 36 × 72 two-dimensional sub-band SRP-PHAT spatial spectrum matrix becomes 5 × 9 × 96 feature data, and the feature data is straightened and deformed into 4320 × 1 one-dimensional vector feature data. The neurons in the fully connected layer are connected to all the feature data in the previous layer and the connected mode of Dropout is added to prevent overfitting, with the Dropout rate set to 0.5.
And the output layer adopts a Softmax classifier, a Softmax function converts the characteristic data of the full connection layer into the probability of the voice signal relative to each azimuth angle, and the azimuth angle with the maximum probability is taken as the predicted sound source direction.
And (5-2) training network parameters of the CNN model.
The training process of CNN includes two parts, forward propagation and backward propagation.
The forward propagation is to calculate the output of input data under the current network parameters, and is the layer-by-layer transfer process of the characteristics, and the forward propagation expression at the position (u, v) in the d-th layer is as follows:
Sd(u,v)=ReLU((Sd-1*wd)(u,v)+βd(u,v))
wherein d represents a layer identifier and the d-th layer is a convolutional layer, SdRepresents the output of the d-th layer, Sd-1Denotes the output of layer d-1, the symbol denotes the convolution operation, wdRepresents the convolution kernel weight, β, of the d-th layerdIndicating the bias of layer d, ReLU is the activation function. The layers in the CNN structure employed by the present invention include the input layer, the convolutional and pooling layers in the convolutional-pooling layer, the fully-connected layer, and the output layer.
D represents an output layer, and the expression of the output layer is as follows:
SD=Softmax((wD)TSD-1D)
wherein S isDRepresenting the output of the output layer, SD-1Represents the output of the fully connected layer, wDTo representConvolution kernel weight, beta, of the output layerDIndicating the bias of the output layer.
The goal of the back propagation phase is to minimize the cross-entropy loss function E (w, β):
Figure BDA0002901844130000091
wherein the subscript f denotes the f-th azimuth angle,
Figure BDA0002901844130000092
representing the desired output of the output layer at the f-th azimuth,
Figure BDA0002901844130000093
representing the actual output of the output layer at the f-th azimuth. F denotes the number of training positions, and in this embodiment, F is 36. The invention adopts a random Gradient Descent (SGDM) algorithm with Momentum to minimize a loss function, and relevant parameters of the SGDM are as follows: the Momentum parameter Momentum is set to 0.9, the L2 regularization coefficient is 0.0001, the initial learning rate is set to 0.01, the learning rate is reduced by 0.2 times every 6 rounds, and the mini-batch is set to 200.
The invention adopts a 7: 3 cross validation mode in the training process. And (5) carrying out iterative training for multiple times until convergence. So far, the CNN model training is completed.
And step six, processing the test signal according to the step two and the step three to obtain the spatial characteristic parameter of the single-frame test signal, namely the sub-band SRP-PHAT spatial spectrum matrix, and taking the spatial characteristic parameter as a test sample.
And step seven, taking the test sample as the input characteristic of the CNN model trained in the step four, outputting the probability that the test signal belongs to each azimuth angle by the CNN, and taking the azimuth with the highest probability as the sound source azimuth angle estimation value of the test sample.
Compared with the prior art, the method comprises two stages of training and testing. In the training stage, spatial characteristic parameters are extracted from directional voice signals in various reverberation and noise environments, and the directional voice signals are input into a CNN (computer-aided network) for training to obtain a CNN model. And in the testing stage, the spatial characteristic parameters of the test signals are extracted, the trained CNN model is input, and the azimuth with the highest probability is taken as the azimuth estimation value of the target sound source. The invention can complete the CNN training process off line, and stores the trained CNN model in the memory, and can realize real-time sound source positioning only by one frame of signal during testing. Compared with the traditional SRP-PHAT algorithm, the algorithm provided by the invention has the advantages that the positioning performance under a complex acoustic environment is obviously improved, and the generalization capability on the sound source space structure, reverberation and noise is better.
Fig. 3 and 4 show the positioning effect of the method of the present invention and the conventional SRP-phot algorithm when the testing environment and the training environment are consistent: in fig. 3, the reverberation time of the test environment and the training environment is 0.5s, and in fig. 4, the reverberation time of the test environment and the training environment is 0.8s, and the positioning effects of the test environment and the training environment under the conditions that the signal-to-noise ratios are 0dB, 5dB, 10dB, 15dB and 20dB are respectively researched, so that the positioning success rate of the method disclosed by the invention is far higher than that of the traditional SRP-PHAT algorithm.
Fig. 5 and fig. 6 show the positioning effect of the method of the present invention and the conventional SRP-phot algorithm when the signal-to-noise ratios of the test environment and the training environment are not consistent: in fig. 5, the reverberation time of the test environment and the training environment is 0.5s, the signal-to-noise ratio of the test environment is different from that of the training environment, in fig. 6, the reverberation time of the test environment and the training environment is 0.8s, and the signal-to-noise ratio of the test environment is different from that of the training environment, and the positioning effects of the test environment under the signal-to-noise ratios of-2 dB, 3dB, 8dB, 13dB and 18dB are respectively researched, so that the positioning success rate of the method disclosed by the invention is far higher than that of the traditional SRP-PHAT algorithm.
Fig. 7 and 8 show the positioning effect of the method of the present invention and the conventional SRP-PHAT algorithm when the reverberation times of the test environment and the training environment are inconsistent: in fig. 7, the reverberation time of the test environment is different from that of the training environment, the reverberation time of the test environment is 0.6s, in fig. 8, the reverberation time of the test environment is different from that of the training environment, and the reverberation time of the test environment is 0.9s, and the positioning effects of the test environment and the training environment under the signal-to-noise ratios of 0dB, 5dB, 10dB, 15dB and 20dB are respectively researched, so that the positioning success rate of the method disclosed by the invention is far higher than that of the traditional SRP-PHAT algorithm.
As can be seen from fig. 5 to fig. 8, even in the non-training environment, the success rate of the method of the present invention is still much higher than that of the conventional SRP-PHAT algorithm, which illustrates that the method of the present invention has better robustness and generalization capability to the unknown environment.
The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims (5)

1. A sound source positioning method based on a convolutional neural network and a sub-band SRP-PHAT space spectrum is characterized by comprising the following steps:
s1, the microphone array collects voice signals, and the collected voice signals are subjected to frame division and windowing preprocessing to obtain single-frame signals;
s2, calculating a sub-band SRP-PHAT spatial spectrum matrix of each frame signal;
s3, inputting the sub-band SRP-PHAT space spectrum matrixes of all frame signals into the trained convolutional neural network, outputting the probability that the voice signal belongs to each azimuth angle, and taking the azimuth angle with the maximum probability as the sound source azimuth angle estimation value of the voice signal.
2. The sound source localization method according to claim 1, wherein the step S2 of calculating the sub-band SRP-PHAT spatial spectrum matrix of each frame signal comprises the following steps:
s21, performing discrete Fourier transform on each frame of signal:
Figure FDA0002901844120000011
wherein x ism(i, n) is the ith frame signal of the mth microphone in the microphone array, M is 1,2, …, M is the number of microphones, Xm(i, k) is xm(i, N) indicating a frequency domain signal of an ith frame of the mth microphone, K being a frequency point, K being a length of discrete fourier transform, N being a frame length, K being 2N, and DFT (·) indicating discrete fourier transform;
s22, designing an impulse response function of the Gamma filter bank:
Figure FDA0002901844120000012
wherein j represents the serial number of the Gamma atom filter; c is the gain of the Gamma filter; t represents a continuous time; a is the order of the Gamma-tone filter;
Figure FDA0002901844120000013
represents the phase; f. ofjRepresents the center frequency of the j Gamma filter; bjRepresents the attenuation factor of the j Gamma filter, bjThe calculation formula is as follows:
bj=1.109ERB(fj)
ERB(fj)=24.7(4.37fj/1000+1)
discrete fourier transform of the impulse response function for each gamma filter:
Figure FDA0002901844120000014
wherein G isj(k) Is the frequency domain expression of the j Gamma filter, K is the frequency point, K is the length of discrete Fourier transform, N is the frame length, K is 2N, fsRepresents the signal sampling rate, DFT (-) represents the discrete fourier transform;
s23, calculating a sub-band SRP-PHAT function of each frame signal:
Figure FDA0002901844120000021
wherein P (i, j)R) represents the jth sub-band SRP-PHAT function of the ith frame signal when the beam direction is r; m is the number of microphones in the microphone array; tau ismn(r) represents a time difference of the sound wave propagating from the beam direction r to the m-th microphone and the n-th microphone, and is calculated by the formula:
Figure FDA0002901844120000022
where r denotes the coordinates of the beam direction, rmDenotes the position coordinate of the m-th microphone, rnRepresenting the position coordinate of the nth microphone, and c is the sound velocity in the air;
s24, carrying out normalization processing on the sub-band SRP-PHAT function of each frame signal:
Figure FDA0002901844120000023
s25, combining all sub-band SRP-PHAT functions of the same frame signal into a matrix form to obtain a sub-band SRP-PHAT spatial spectrum matrix:
Figure FDA0002901844120000024
wherein, y (i) represents the sub-band SRP-PHAT spatial spectrum matrix of the signal of the ith frame, J is the number of the sub-bands, namely the number of Gamma filters, and L is the number of beam directions.
3. The sound source localization method according to claim 2, wherein in step S23, when the sound source and the microphone array are set to be at the same horizontal plane and the sound source is located in the far field of the microphone array, τ ismnThe equivalent calculation formula of (r) is:
Figure FDA0002901844120000025
where xi is [ cos θ, sin θ ═]TAnd θ is the azimuth of the beam direction r.
4. The sound source localization method based on the convolutional neural network and the subband SRP-PHAT spatial spectrum as claimed in claim 1, wherein the convolutional neural network comprises an input layer, three convolutional-pooling layers, a full-link layer and an output layer which are connected in sequence;
in the convolution-pooling layer, each convolution layer adopts convolution kernels with the size of 3 multiplied by 3, the step length is 1, the number of the convolution kernels of the three convolution layers is 24, 48 and 96 in sequence, after each convolution layer is subjected to convolution operation, batch normalization is firstly carried out, then ReLU function activation is used, and the characteristic dimensions before and after convolution are kept unchanged by adopting a zero filling mode in the convolution operation; the pooling layer adopts a maximum pooling mode, the size of the pooling is 2 multiplied by 2, and the step length is 2;
after convolution-pooling, straightening and deforming the characteristic data into one-dimensional vector characteristic data;
adding a Dropout connection mode into the connection of the full connection layer and the one-dimensional vector characteristic data;
the output layer uses a Softmax classifier.
5. The sound source localization method based on the convolutional neural network and the subband SRP-PHAT spatial spectrum as claimed in claim 1, wherein the training step of the convolutional neural network is as follows:
s1, convolving the clean voice signal with the room impulse response of different azimuth angles, and adding different degrees of noise and reverberation to generate a plurality of directional voice signals of different specified azimuth angles:
xm(t)=hm(t)*s(t)+vm(t),m=1,2,...,M
wherein x ism(t) a directional speech signal representing a specified azimuth received by an mth microphone of the array of microphones; m is the serial number of the microphone, M is 1,2, …, and M is the number of the microphones; s (t) is a clean speech signal; h ism(t) denotes a designationRoom impulse response of azimuth to mth microphone; v. ofm(t) represents noise;
s2, preprocessing all directional voice signals by framing and windowing to obtain single-frame signals, and calculating a sub-band SRP-PHAT spatial spectrum matrix of each frame of signals;
s3, taking the sub-band SRP-PHAT space spectrum matrixes of all the directional voice signals as training samples, taking the designated azimuth angles of all the directional voice signals as class labels of the corresponding training samples, taking the training samples and the class labels as training data sets, and training the convolutional neural network by adopting a random gradient descent algorithm with momentum to minimize a loss function.
CN202110059164.1A 2021-01-18 2021-01-18 Sound source positioning method based on convolutional neural network and subband SRP-PHAT spatial spectrum Active CN112904279B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110059164.1A CN112904279B (en) 2021-01-18 2021-01-18 Sound source positioning method based on convolutional neural network and subband SRP-PHAT spatial spectrum

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110059164.1A CN112904279B (en) 2021-01-18 2021-01-18 Sound source positioning method based on convolutional neural network and subband SRP-PHAT spatial spectrum

Publications (2)

Publication Number Publication Date
CN112904279A true CN112904279A (en) 2021-06-04
CN112904279B CN112904279B (en) 2024-01-26

Family

ID=76114123

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110059164.1A Active CN112904279B (en) 2021-01-18 2021-01-18 Sound source positioning method based on convolutional neural network and subband SRP-PHAT spatial spectrum

Country Status (1)

Country Link
CN (1) CN112904279B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113589230A (en) * 2021-09-29 2021-11-02 广东省科学院智能制造研究所 Target sound source positioning method and system based on joint optimization network
CN113655440A (en) * 2021-08-09 2021-11-16 西南科技大学 Self-adaptive compromising pre-whitening sound source positioning method
CN114897033A (en) * 2022-07-13 2022-08-12 中国人民解放军海军工程大学 Three-dimensional convolution kernel group calculation method for multi-beam narrow-band process data
CN114994608A (en) * 2022-04-21 2022-09-02 西北工业大学深圳研究院 Multi-device self-organizing microphone array sound source positioning method based on deep learning
CN115201753A (en) * 2022-09-19 2022-10-18 泉州市音符算子科技有限公司 Low-power-consumption multi-spectral-resolution voice positioning method
CN115331691A (en) * 2022-10-13 2022-11-11 广州成至智能机器科技有限公司 Pickup method and device for unmanned aerial vehicle, unmanned aerial vehicle and computer readable storage medium
CN116859336A (en) * 2023-07-14 2023-10-10 苏州大学 High-precision implementation method for sound source localization

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110222707A1 (en) * 2010-03-15 2011-09-15 Do Hyung Hwang Sound source localization system and method
CN109164415A (en) * 2018-09-07 2019-01-08 东南大学 A kind of binaural sound sources localization method based on convolutional neural networks
CN109490822A (en) * 2018-10-16 2019-03-19 南京信息工程大学 Voice DOA estimation method based on ResNet
CN110133572A (en) * 2019-05-21 2019-08-16 南京林业大学 A kind of more sound localization methods based on Gammatone filter and histogram
CN110133596A (en) * 2019-05-13 2019-08-16 南京林业大学 A kind of array sound source localization method based on frequency point signal-to-noise ratio and biasing soft-decision
CN110517705A (en) * 2019-08-29 2019-11-29 北京大学深圳研究生院 A kind of binaural sound sources localization method and system based on deep neural network and convolutional neural networks
CN110544490A (en) * 2019-07-30 2019-12-06 南京林业大学 sound source positioning method based on Gaussian mixture model and spatial power spectrum characteristics
WO2020042708A1 (en) * 2018-08-31 2020-03-05 大象声科(深圳)科技有限公司 Time-frequency masking and deep neural network-based sound source direction estimation method
CN111123202A (en) * 2020-01-06 2020-05-08 北京大学 Indoor early reflected sound positioning method and system
CN111583948A (en) * 2020-05-09 2020-08-25 南京工程学院 Improved multi-channel speech enhancement system and method
CN111707990A (en) * 2020-08-19 2020-09-25 东南大学 Binaural sound source positioning method based on dense convolutional network
CN111968677A (en) * 2020-08-21 2020-11-20 南京工程学院 Voice quality self-evaluation method for fitting-free hearing aid

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110222707A1 (en) * 2010-03-15 2011-09-15 Do Hyung Hwang Sound source localization system and method
WO2020042708A1 (en) * 2018-08-31 2020-03-05 大象声科(深圳)科技有限公司 Time-frequency masking and deep neural network-based sound source direction estimation method
CN109164415A (en) * 2018-09-07 2019-01-08 东南大学 A kind of binaural sound sources localization method based on convolutional neural networks
CN109490822A (en) * 2018-10-16 2019-03-19 南京信息工程大学 Voice DOA estimation method based on ResNet
CN110133596A (en) * 2019-05-13 2019-08-16 南京林业大学 A kind of array sound source localization method based on frequency point signal-to-noise ratio and biasing soft-decision
CN110133572A (en) * 2019-05-21 2019-08-16 南京林业大学 A kind of more sound localization methods based on Gammatone filter and histogram
CN110544490A (en) * 2019-07-30 2019-12-06 南京林业大学 sound source positioning method based on Gaussian mixture model and spatial power spectrum characteristics
CN110517705A (en) * 2019-08-29 2019-11-29 北京大学深圳研究生院 A kind of binaural sound sources localization method and system based on deep neural network and convolutional neural networks
CN111123202A (en) * 2020-01-06 2020-05-08 北京大学 Indoor early reflected sound positioning method and system
CN111583948A (en) * 2020-05-09 2020-08-25 南京工程学院 Improved multi-channel speech enhancement system and method
CN111707990A (en) * 2020-08-19 2020-09-25 东南大学 Binaural sound source positioning method based on dense convolutional network
CN111968677A (en) * 2020-08-21 2020-11-20 南京工程学院 Voice quality self-evaluation method for fitting-free hearing aid

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
S. JIANG, W. L., P. YUAN, Y. SUN AND H. LIU: "Deep and CNN fusion method for binaural sound source localization", 《THE JOURNAL OF ENGINEERING》, pages 511 *
VECCHIOTTI等: "End-to-end Binaural Sound Localisation from the Raw Waveform", 《IEEE》, pages 451 - 455 *
XIAOYAN ZHAO 等: "Sound Source Localization Based on SRP-PHAT Spatial Spectrum and Deep Neural Network", 《COMPUTERS, MATERIALS & CONTINUA 》, pages 253 - 271 *
张文涛;韩莹莹;黎恒;: "基于卷积神经网络的交通声音事件识别方法", 现代电子技术, no. 14 *
王茜茜: "基于神经网络的鲁棒双耳声源定位研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》, pages 136 - 129 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113655440A (en) * 2021-08-09 2021-11-16 西南科技大学 Self-adaptive compromising pre-whitening sound source positioning method
CN113589230A (en) * 2021-09-29 2021-11-02 广东省科学院智能制造研究所 Target sound source positioning method and system based on joint optimization network
CN114994608A (en) * 2022-04-21 2022-09-02 西北工业大学深圳研究院 Multi-device self-organizing microphone array sound source positioning method based on deep learning
CN114994608B (en) * 2022-04-21 2024-05-14 西北工业大学深圳研究院 Multi-device self-organizing microphone array sound source positioning method based on deep learning
CN114897033A (en) * 2022-07-13 2022-08-12 中国人民解放军海军工程大学 Three-dimensional convolution kernel group calculation method for multi-beam narrow-band process data
CN114897033B (en) * 2022-07-13 2022-09-27 中国人民解放军海军工程大学 Three-dimensional convolution kernel group calculation method for multi-beam narrow-band process data
CN115201753A (en) * 2022-09-19 2022-10-18 泉州市音符算子科技有限公司 Low-power-consumption multi-spectral-resolution voice positioning method
CN115331691A (en) * 2022-10-13 2022-11-11 广州成至智能机器科技有限公司 Pickup method and device for unmanned aerial vehicle, unmanned aerial vehicle and computer readable storage medium
CN116859336A (en) * 2023-07-14 2023-10-10 苏州大学 High-precision implementation method for sound source localization

Also Published As

Publication number Publication date
CN112904279B (en) 2024-01-26

Similar Documents

Publication Publication Date Title
CN112904279B (en) Sound source positioning method based on convolutional neural network and subband SRP-PHAT spatial spectrum
Diaz-Guerra et al. Robust sound source tracking using SRP-PHAT and 3D convolutional neural networks
CN107703486B (en) Sound source positioning method based on convolutional neural network CNN
CN109490822B (en) Voice DOA estimation method based on ResNet
CN110068795A (en) A kind of indoor microphone array sound localization method based on convolutional neural networks
CN112151059A (en) Microphone array-oriented channel attention weighted speech enhancement method
CN107452389A (en) A kind of general monophonic real-time noise-reducing method
US20040175006A1 (en) Microphone array, method and apparatus for forming constant directivity beams using the same, and method and apparatus for estimating acoustic source direction using the same
Vesperini et al. Localizing speakers in multiple rooms by using deep neural networks
Morito et al. Partially Shared Deep Neural Network in sound source separation and identification using a UAV-embedded microphone array
CN113111765B (en) Multi-voice source counting and positioning method based on deep learning
CN112180318B (en) Sound source direction of arrival estimation model training and sound source direction of arrival estimation method
CN110888105A (en) DOA estimation method based on convolutional neural network and received signal strength
CN111123202B (en) Indoor early reflected sound positioning method and system
CN111443328A (en) Sound event detection and positioning method based on deep learning
Salvati et al. Time Delay Estimation for Speaker Localization Using CNN-Based Parametrized GCC-PHAT Features.
CN113593596B (en) Robust self-adaptive beam forming directional pickup method based on subarray division
CN112201276B (en) TC-ResNet network-based microphone array voice separation method
CN116559778B (en) Vehicle whistle positioning method and system based on deep learning
CN110838303B (en) Voice sound source positioning method using microphone array
CN112269158A (en) Method for positioning voice source by utilizing microphone array based on UNET structure
CN111948609A (en) Binaural sound source positioning method based on Soft-argmax regression device
CN116859336A (en) High-precision implementation method for sound source localization
Firoozabadi et al. Combination of nested microphone array and subband processing for multiple simultaneous speaker localization
CN114895245A (en) Microphone array sound source positioning method and device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant