CN112904279B

CN112904279B - Sound source positioning method based on convolutional neural network and subband SRP-PHAT spatial spectrum

Info

Publication number: CN112904279B
Application number: CN202110059164.1A
Authority: CN
Inventors: 赵小燕; 童莹; 芮雄丽; 陈瑞; 毛铮
Original assignee: Nanjing Institute of Technology
Current assignee: Nanjing Institute of Technology
Priority date: 2021-01-18
Filing date: 2021-01-18
Publication date: 2024-01-26
Anticipated expiration: 2041-01-18
Also published as: CN112904279A

Abstract

The invention discloses a sound source positioning method based on a convolutional neural network and a subband SRP-PHAT spatial spectrum, which comprises the following steps: the microphone array collects voice signals, and carries out framing and windowing pretreatment on the collected voice signals to obtain single-frame signals; calculating a subband SRP-PHAT spatial spectrum matrix of each frame of signal; inputting the subband SRP-PHAT space spectrum matrix of all frame signals into a convolutional neural network after training, outputting the probability of the voice signal belonging to each azimuth, and taking the azimuth with the highest probability as the estimated value of the azimuth of the sound source of the voice signal. The invention can improve the sound source positioning performance of the microphone array in a complex acoustic environment and the generalization capability of the sound source space structure, reverberation and noise; the training process of the convolutional neural network can be finished offline, the trained convolutional neural network is stored in the memory, and the real-time sound source localization can be realized only by one frame of signal during the test.

Description

Sound source positioning method based on convolutional neural network and subband SRP-PHAT spatial spectrum

Technical Field

The invention belongs to the field of sound source localization, and particularly relates to a sound source localization method based on a convolutional neural network and a subband SRP-PHAT spatial spectrum.

Background

The sound source localization technology based on the microphone array has wide application prospect and potential economic value in the front-end processing of voice recognition, speaker recognition and emotion recognition systems, as well as video conferences, intelligent robots, intelligent households, intelligent vehicle-mounted equipment, hearing aids and the like. The SRP-erat (Steered Response Power-Phase Transform) method is most popular and commonly used among the conventional sound source localization methods, which achieves sound source localization by detecting peaks of spatial spectrum, but noise and reverberation often cause spatial spectrum to exhibit multimodal characteristics, and particularly in a strong reverberant environment, the peak of spatial spectrum generated by reflected sound may be larger than the peak of direct sound, resulting in misdetection of sound source position. In recent years, a model-based sound source localization method is used for localization in a complex acoustic environment, and the method builds a mapping relation between a sound source position and a space characteristic parameter by modeling the space characteristic parameter so as to realize sound source localization, but the current algorithm has low generalization capability on an unknown environment (noise and reverberation) and needs to be further improved in performance. The spatial feature parameters and modeling method are the main factors affecting the performance of the model-based sound source localization method.

Disclosure of Invention

The invention aims to: in order to overcome the problems in the prior art, the invention discloses a sound source positioning method based on a convolutional neural network and a subband SRP-PHAT spatial spectrum, which adopts the subband SRP-PHAT spatial spectrum as a spatial characteristic parameter, adopts the convolutional neural network (Convolutional Neural Network, CNN) to model the spatial characteristic parameters of directional voice data under various reverberation and noise environments, can improve the sound source positioning performance of a microphone array under a complex acoustic environment, and improves the generalization capability of the sound source spatial structure, the reverberation and the noise.

The technical scheme is as follows: in order to achieve the above purpose, the invention adopts the following technical scheme: a sound source localization method based on a convolutional neural network and a subband SRP-PHAT spatial spectrum is characterized by comprising the following steps:

s1, a microphone array collects voice signals, and the collected voice signals are subjected to framing and windowing pretreatment to obtain single-frame signals;

s2, calculating a subband SRP-PHAT spatial spectrum matrix of each frame of signal;

s3, inputting the subband SRP-PHAT space spectrum matrix of all frame signals into the convolutional neural network after training is completed, outputting the probability that the voice signal belongs to each azimuth, and taking the azimuth with the highest probability as the estimated value of the azimuth of the sound source of the voice signal.

Preferably, in step S2, calculating the subband SRP-heat spatial spectrum matrix of each frame signal includes the steps of:

s21, performing discrete Fourier transform on each frame of signal:

wherein x is _m (i, n) is the i frame signal of the M-th microphone in the microphone array, m=1, 2, …, M is the number of microphones, X _m (i, k) is x _m (i, N) a discrete fourier transform representing a frequency domain signal of an ith frame of the mth microphone, K being a frequency bin, K being a length of the discrete fourier transform, N being a frame length, K = 2N, dft (·) representing the discrete fourier transform;

s22, designing an impulse response function of the gammatine filter bank:

wherein j represents GammaNumber of tone filter; c is the gain of the gammatine filter; t represents a continuous time; a is the order of the gammatine filter;representing phase; f (f) _j Representing the center frequency of the j-th gammatine filter; b _j Represents the attenuation factor, b, of the j-th gammatine filter _j The calculation formula is as follows:

b _j ＝1.109ERB(f _j )

ERB(f _j )＝24.7(4.37f _j /1000+1)

performing a discrete fourier transform on the impulse response function of each gammatine filter:

wherein G is _j (k) Is the frequency domain expression of the jth gammatine filter, K is the frequency point, K is the length of the discrete fourier transform, N is the frame length, k=2n, f _s Representing the signal sample rate, DFT (·) represents the discrete Fourier transform;

s23, calculating a subband SRP-PHAT function of each frame of signal:

wherein, P (i, j, r) represents the j-th subband SRP-PHAT function of the i-th frame signal when the beam direction is r; m is the number of microphones in the microphone array; τ _mn (r) represents a time difference of propagation of sound waves from the beam direction r to the mth microphone and the nth microphone, and the calculation formula is:

where r represents the coordinates of the beam direction, r _m Representing the position coordinate of the mth microphone, r _n Representing the position coordinates of the nth microphone, c being the speed of sound in air;

s24, carrying out normalization processing on a subband SRP-PHAT function of each frame of signal:

s25, combining all the subband SRP-PHAT functions of the same frame signal into a matrix form to obtain a subband SRP-PHAT spatial spectrum matrix:

wherein y (i) represents a subband SRP-PHAT spatial spectrum matrix of the ith frame signal, J is the number of subbands, namely the number of Gamma filters, and L is the number of beam directions.

Preferably, in step S23, when the sound source is set to be at the same level as the microphone array and the sound source is located in the far field of the microphone array, τ _mn The equivalent calculation formula of (r) is:

wherein ζ= [ cos θ, sin θ ]] ^T θ is the azimuth angle of the beam direction r.

Preferably, the convolutional neural network comprises an input layer, three convolutional-pooling layers, a full-connection layer and an output layer which are sequentially connected;

in the convolution-pooling layers, each convolution layer adopts convolution kernels with the size of 3 multiplied by 3, the step length is 1, the number of the convolution kernels of the three layers of convolution layers is 24, 48 and 96 in sequence, after each convolution layer carries out convolution operation, batch normalization is carried out firstly, then a ReLU function is used for activation, and a zero padding mode is adopted in the convolution operation so that the characteristic dimension before and after the convolution is kept unchanged; the pooling layer adopts a maximum pooling mode, the pooling size is 2 multiplied by 2, and the step length is 2;

after the convolution-pooling layer, straightening and deforming the characteristic data into one-dimensional vector characteristic data;

a Dropout connection mode is added in the connection of the full connection layer and the one-dimensional vector feature data;

the output layer adopts a Softmax classifier.

Preferably, the training steps of the convolutional neural network are as follows:

s1, convoluting the clean voice signals with room impulse responses of different azimuth angles, adding noise and reverberation of different degrees, and generating a plurality of directional voice signals of different specified azimuth angles:

x _m (t)＝h _m (t)*s(t)+v _m (t),m＝1,2,...,M

wherein x is _m (t) represents a directional voice signal of a specified azimuth angle received by an mth microphone in the microphone array; m is the serial number of the microphones, m=1, 2, …, M is the number of the microphones; s (t) is a clean speech signal; h is a _m (t) represents a room impulse response from a specified azimuth angle to an mth microphone; v _m (t) represents noise;

s2, carrying out framing and windowing pretreatment on all directional voice signals to obtain single-frame signals, and calculating a subband SRP-PHAT spatial spectrum matrix of each frame of signals;

s3, taking the subband SRP-PHAT space spectrum matrix of all directional voice signals as training samples, taking the designated azimuth angles of all directional voice signals as class labels corresponding to the training samples, taking the training samples and the class labels as training data sets, and training the convolutional neural network by adopting a random gradient descent algorithm minimization loss function with momentum.

The beneficial effects are that: the invention has the following remarkable beneficial effects:

1. the invention can improve the sound source positioning performance of the microphone array in a complex acoustic environment and the generalization capability of the sound source space structure, reverberation and noise;

2. the invention adopts the subband SRP-PHAT spatial spectrum as the spatial characteristic parameter, and the parameter not only can represent the whole acoustic environment information, but also has the advantage of strong robustness; modeling spatial feature parameters of directional voice data in various reverberation and noise environments by adopting a convolutional neural network, establishing a mapping relation between azimuth and the spatial feature parameters, and converting a sound source localization problem into a multi-classification problem;

3. the invention can finish the training process of the convolutional neural network offline, store the trained convolutional neural network in the memory, and realize real-time sound source localization only by one frame of signal during testing.

Drawings

FIG. 1 is a flow chart of an algorithm of the present invention;

FIG. 2 is a diagram of a model structure of a convolutional neural network in accordance with the present invention;

FIG. 3 is a graph comparing the positioning success rate of the method of the present invention with that of the conventional SRP-PHAT algorithm when the test environment and the training environment are consistent and the reverberation time is 0.5 s;

FIG. 4 is a graph comparing the positioning success rate of the method of the present invention with that of the conventional SRP-PHAT algorithm when the test environment and the training environment are consistent and the reverberation time is 0.8 s;

FIG. 5 is a graph comparing the positioning success rate of the method of the present invention with that of the conventional SRP-PHAT algorithm when the noise environments of the test environment and the training environment are inconsistent and the reverberation time is 0.5 s;

FIG. 6 is a graph comparing the positioning success rate of the method of the present invention with that of the conventional SRP-PHAT algorithm when the noise environments of the test environment and the training environment are inconsistent and the reverberation time is 0.8 s;

FIG. 7 is a graph comparing the positioning success rate of the method of the present invention with that of the conventional SRP-PHAT algorithm when the reverberations of the test environment and the training environment are inconsistent and the reverberations time of the test environment is 0.6 s;

FIG. 8 is a graph comparing the positioning success rate of the method of the present invention with that of the conventional SRP-PHAT algorithm when the reverberations of the test environment and the training environment are inconsistent and the reverberations time of the test environment is 0.9 s.

Detailed Description

The invention will be further described with reference to the accompanying drawings.

The subband SRP-PHAT spatial spectrum characterizes the whole acoustic environment spatial information including sound source azimuth, room size, room reflection characteristics and the like, has strong robustness, and can be used as spatial characteristic parameters in a positioning system. The deep neural network can simulate the mode of the information processing of the nervous system, can describe the fusion relation and the structural information between the space characteristic parameters, has strong expression and modeling capacity, and meanwhile, does not need to make assumptions on the data distribution during modeling. Among them, convolutional neural networks are a type of neural network that is specially used to process data having a similar network structure, and are applied to image or time-series data. The speech signal collected by the microphone array is precisely a time series of data.

Therefore, the invention provides a sound source localization method based on a convolutional neural network and a subband SRP-PHAT spatial spectrum, which is shown in figure 1 and comprises the following steps:

step one: convolving the clean speech signal with room impulse responses of different azimuth angles, and adding different degrees of noise and reverberation to generate a plurality of directional speech signals of different specified azimuth angles, namely microphone array signals:

x _m (t)＝h _m (t)*s(t)+v _m (t),m＝1,2,...,M

wherein x is _m (t) represents a directional voice signal of a specified azimuth angle received by an mth microphone in the microphone array; m is the serial number of the microphones, m=1, 2, …, M is the number of the microphones; s (t) is a clean speech signal; h is a _m (t) represents the room impulse response from the specified azimuth angle to the mth microphone, h _m (t) is related to sound source orientation, room reverberation; v _m And (t) represents noise.

In this embodiment, the microphone array is set to be a uniform circular array composed of 6 omni-directional microphones, and the radius of the array is 0.1m. The sound source is set to be in the same horizontal plane with the microphone array, and the sound source is located in the far field of the microphone array. The right front of the horizontal plane is defined as 90 degrees, the range of azimuth angles of the sound source is [0 degrees, 360 degrees ], the interval is 10 degrees, the number of training azimuth is marked as F, and F is equal to 36. The reverberation time of the training data comprises 0.5s and 0.8s, and different Image algorithms are used for generatingRoom impulse response h for different azimuth angles at reverberation time _m (t)。v _m (t) Gaussian white noise, the signal-to-noise ratio of the training data includes 0dB, 5dB, 10dB, 15dB, and 20dB.

And step two, preprocessing the microphone array signal obtained in the step one to obtain a single frame signal.

Preprocessing includes framing and windowing, wherein:

the framing method comprises the following steps: the directional voice signal x of the appointed azimuth angle of the mth microphone is processed by adopting the preset frame length and frame shift _m (t) dividing into a plurality of single frame signals x _m (iN+n), wherein i is a frame number, N represents a sampling number iN one frame, N is more than or equal to 0 and less than N, and N is a frame length. Signal sampling rate f in this embodiment _s For 16kHz, a frame length N of 512 (i.e., 32 ms) is taken, and the frame shift is 0.

The windowing method comprises the following steps: x is x _m (i,n)＝w _H (n)x _m (iN+n), where x _m (i, n) is the i-th frame signal of the m-th microphone after the windowing process,is a hamming window.

And thirdly, extracting spatial characteristic parameters of the microphone array signals, namely a subband SRP-PHAT spatial spectrum matrix. The method specifically comprises the following steps:

(3-1) performing discrete fourier transform on each frame of the signal obtained in the step two, and converting the time domain signal into a frequency domain signal.

The discrete fourier transform calculation formula is:

wherein X is _m (i, k) is x _m (i, n) a discrete fourier transform representing a frequency domain signal of an ith frame of the mth microphone, K being a frequency bin, K being a length of the discrete fourier transform, k=2n, dft (·) representing the discrete fourier transform. The length of the discrete fourier transform is set to 1024 in this embodiment.

(3-2) designing a gammatine filter bank.

g _j (t) is the impulse response function of the j-th gammatine filter, expressed as:

wherein j represents the serial number of the gammatine filter; c is the gain of the gammatine filter; t represents a continuous time; a is the order of the gammatine filter;representing phase; f (f) _j Representing the center frequency of the j-th gammatine filter; b _j Represents the attenuation factor, b, of the j-th gammatine filter _j The calculation formula is as follows:

b _j ＝1.109ERB(f _j )

ERB(f _j )＝24.7(4.37f _j /1000+1)

in this embodiment, the order a is 4, the phaseSet to 0, the number of gammatine filters is 36, i.e., j=1, 2, …,36, center frequency f of gammatine filter _j Is in the range of [200Hz,8000Hz]。

Performing discrete Fourier transform on the impulse response function of each Gamma filter to obtain a frequency domain expression of the Gamma filter:

wherein G is _j (k) G is g _j (n/f _s ) Represents the frequency domain expression of the jth gammatine filter, K is the frequency bin, K is the length of the discrete fourier transform, k=2n, dft (·) represents the discrete fourier transform, f _s Representing the sampling rate. The length of the discrete fourier transform is set to 1024 in this embodiment.

(3-3) calculating a subband SRP-phas function for each frame of signal, the calculation formula being as follows:

wherein, P (i, j, r) represents the j-th subband SRP-PHAT function of the i-th frame signal when the beam direction of the array is r; (. Cndot. ^* Represents conjugation; τ _mn (r) represents a time difference of propagation of sound waves from the beam direction r to the mth microphone and the nth microphone, and the calculation formula is:

where r represents the coordinates of the beam direction, r _m Representing the position coordinate of the mth microphone, r _n Representing the position coordinate of the nth microphone, c is the sound velocity in the air, and is about 342m/s, f at normal temperature _s Is the signal sampling rate; i represent 2 norms.

In this embodiment, the sound source and the microphone array are set to be in the same horizontal plane, and if the sound source is located in the far field of the microphone array, τ is _mn The equivalent calculation formula of (r) is:

wherein ζ= [ cos θ, sin θ ]] ^T θ is the azimuth angle of the beam direction r. τ _mn (r) is independent of the received signal and can be stored in memory after off-line calculation.

The subband SRP-PHAT function P (i, j, r) is normalized, and the calculation formula is as follows:

(3-4) SRP-PHAT function of all sub-bands of the same frame signalCombining the two sub-band SRP-PHAT spatial spectrum matrixes into a matrix form to obtain the sub-band SRP-PHAT spatial spectrum matrixes:

where y (i) represents a spatial feature parameter of the i-th frame signal, i.e. a subband SRP-phas spatial spectrum matrix, and J is the number of subbands, i.e. the number of gammatine filters, in this embodiment j=36. The azimuth range of the beam direction of the array in this embodiment is [0 °,360 ° ], which defines 90 ° directly in front of the horizontal plane, and the interval is 5 °, so the number l=72 of beam directions. The number L of the beam taking directions is generally larger than the number F of the training orientations, so that the accuracy of the spatial characteristic parameters of the signals can be improved, and the training accuracy of the CNN model is improved.

Step four, preparing a training set: according to the first to third steps, extracting the space characteristic parameters of the directional voice signals under all training environments (the implementation setting of the training environments is detailed in the first step), taking the space characteristic parameters as training samples of CNN, marking the corresponding appointed azimuth angle of each training sample, and taking the corresponding appointed azimuth angle as the class label of the training sample.

And fifthly, constructing a CNN model, and training the training sample and the category label obtained in the fourth step as a CNN training data set to obtain the CNN model. The method specifically comprises the following steps:

(5-1) setting a CNN model structure.

The CNN structure employed in the present invention is shown in fig. 2 as comprising an input layer followed by three convolution-pooling layers, then a fully connected layer, and finally an output layer.

The input signal of the input layer is a two-dimensional subband SRP-phas spatial spectrum matrix of j×l, i.e. training samples, in this embodiment j=36, l=72.

The input layer is followed by three convolution-pooling layers, each convolution layer adopts a convolution kernel with the size of 3 multiplied by 3, the step length is 1, and the characteristic dimension before and after convolution is kept unchanged by adopting a zero filling mode in the convolution operation. The number of convolution kernels for the 1 st, 2 nd and 3 rd convolution layers is 24, 48 and 96, respectively. After each convolution layer carries out convolution operation, batch normalization is carried out first, and then a ReLU function is used for activation. The pooling layer adopts a maximum pooling mode, the pooling size is 2 multiplied by 2, and the step length is 2.

Through three convolution-pooling operations, the 36×72 two-dimensional subband SRP-PHAT spatial spectrum matrix becomes 5×9×96 feature data, which is straightened and deformed into 4320×1 one-dimensional vector feature data. Neurons in the fully connected layer are connected to all feature data in the previous layer and the connection mode of Dropout is added to prevent overfitting, dropout rate is set to 0.5.

The output layer adopts a Softmax classifier, a Softmax function converts the characteristic data of the full-connection layer into the probability of the voice signal relative to each azimuth, and the azimuth with the highest probability is taken as the predicted sound source direction.

(5-2) training network parameters of the CNN model.

The training process of CNN includes two parts, forward propagation and backward propagation.

Forward propagation is the output of the calculated input data under the current network parameters, is a layer-by-layer transfer process of the features, and the forward propagation expression at the position (u, v) in the d layer is:

S ^d (u,v)＝ReLU((S ^d - ¹ *w ^d )(u,v)+β ^d (u,v))

wherein d represents a layer identifier and the d layer is a convolution layer, S ^d Represents the output of the d layer, S ^d-1 Represents the output of layer d-1, the sign represents the convolution operation, w ^d Representing the convolution kernel weight of layer d, beta ^d Representing the bias of layer d, reLU is the activation function. The layers in the CNN structure adopted by the invention comprise an input layer, a convolution layer and a pooling layer in a convolution-pooling layer, a full connection layer and an output layer.

D represents the output layer, the expression of the output layer is:

S ^D ＝Softmax((w ^D ) ^T S ^D-1 +β ^D )

wherein S is ^D Representing the output of the output layer S ^D-1 Representing the output of the fully connected layer, w ^D Convolution kernel weights, beta, representing the output layer ^D Representing the bias of the output layer.

The goal of the back propagation phase is to minimize the cross entropy loss function E (w, β):

wherein the subscript f represents the f-th azimuth,indicating the desired output of the output layer at the f-th azimuth angle,>representing the actual output of the output layer at the f-th azimuth angle. F represents the number of training orientations, in this embodiment f=36. The invention adopts a random gradient descent with momentum (Stochastic Gradient Descent with Momentum, SGDM) algorithm to minimize the loss function, and the related parameters of the SGDM are as follows: the Momentum parameter Momentum is set to 0.9, the L2 regularization coefficient is 0.0001, the initial learning rate is set to 0.01, the learning rate is reduced by 0.2 times every 6 rounds, and the mini-batch is set to 200.

The invention adopts a 7:3 cross-validation mode in the training process. And (5) carrying out repeated iterative training until convergence. So far, the CNN model training is completed.

Step six, processing the test signal according to the step two and the step three to obtain a spatial characteristic parameter of the single-frame test signal, namely a subband SRP-PHAT spatial spectrum matrix, and taking the spatial characteristic parameter as a test sample.

And step seven, taking the test sample as the input characteristic of the CNN model trained in the step four, outputting the probability of the test signal belonging to each azimuth angle by CNN, and taking the azimuth with the highest probability as the estimated value of the azimuth angle of the sound source of the test sample.

In contrast to the prior art, the method of the present invention comprises two stages, training and testing. In the training stage, space characteristic parameters are extracted from directional voice signals under various reverberation and noise environments, and the space characteristic parameters are input into CNN for training to obtain a CNN model. In the test stage, the spatial characteristic parameters of the test signals are extracted, a trained CNN model is input, and the azimuth with the highest probability is taken as a target sound source azimuth estimation value. The invention can finish the training process of CNN off-line, store the CNN model trained in the memory, only need a frame signal to realize the real-time sound source localization while testing. Compared with the traditional SRP-PHAT algorithm, the algorithm of the invention remarkably improves the positioning performance under the complex acoustic environment, and has better generalization capability on the spatial structure, reverberation and noise of the sound source.

Fig. 3 and 4 show the positioning effect of the method according to the present invention with the conventional SRP-phas algorithm when the test environment and the training environment are identical: the reverberation time of the test environment and the training environment in fig. 3 is 0.5s, the reverberation time of the test environment and the training environment in fig. 4 is 0.8s, and the positioning effect of the test environment and the training environment under the conditions that the signal to noise ratio is 0dB, 5dB, 10dB, 15dB and 20dB is respectively researched, so that the positioning success rate of the method is far higher than that of the traditional SRP-PHAT algorithm.

Fig. 5 and 6 show the positioning effect of the method according to the present invention and the conventional SRP-phas algorithm when the signal-to-noise ratio of the test environment and the training environment are inconsistent: the reverberation time of the test environment and the training environment in fig. 5 is 0.5s, the signal-to-noise ratio of the test environment is different from that of the training environment, the reverberation time of the test environment and the training environment in fig. 6 is 0.8s, the signal-to-noise ratio of the test environment is different from that of the training environment, and the positioning effects of the test environment under the conditions that the signal-to-noise ratio is-2 dB, 3dB, 8dB, 13dB and 18dB are respectively researched, so that the positioning success rate of the method is far higher than that of the traditional SRP-PHAT algorithm.

Fig. 7 and 8 show the positioning effect of the method of the present invention and the conventional SRP-phas algorithm when the reverberation times of the test environment and the training environment are not identical: the reverberation time of the test environment and the training environment in fig. 7 are different, the reverberation time of the test environment is 0.6s, the reverberation time of the test environment and the training environment in fig. 8 are different, the reverberation time of the test environment is 0.9s, and the positioning effects of the test environment and the training environment under the conditions that the signal to noise ratio is 0dB, 5dB, 10dB, 15dB and 20dB are respectively researched, so that the positioning success rate of the method is far higher than that of the traditional SRP-PHAT algorithm.

As can be seen from fig. 5 to 8, the success rate of the method of the present invention is still far higher than that of the conventional SRP-phas algorithm even in the non-training environment, which illustrates that the method of the present invention has better robustness and generalization capability to the unknown environment.

The foregoing is only a preferred embodiment of the invention, it being noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the present invention, and such modifications and adaptations are intended to be comprehended within the scope of the invention.

Claims

1. A sound source localization method based on a convolutional neural network and a subband SRP-PHAT spatial spectrum is characterized by comprising the following steps:

s2, calculating a subband SRP-PHAT spatial spectrum matrix of each frame of signal; the method specifically comprises the following steps:

s21, performing discrete Fourier transform on each frame of signal:

s22, designing an impulse response function of the gammatine filter bank:

b _j ＝1.109ERB(f _j )

ERB(f _j )＝24.7(4.37f _j /1000+1)

s23, calculating a subband SRP-PHAT function of each frame of signal:

wherein y (i) represents a subband SRP-PHAT spatial spectrum matrix of an ith frame signal, J is the number of subbands, namely the number of Gamma filters, and L is the number of beam directions;

2. The method for locating sound source based on convolutional neural network and subband SRP-PHAT spatial spectrum as recited in claim 1, wherein in step S23, when the sound source is set to be at the same level as the microphone array and the sound source is located in far field of the microphone array, τ is calculated by _mn The equivalent calculation formula of (r) is:

3. The sound source localization method based on the convolutional neural network and the subband SRP-heat spatial spectrum according to claim 1, wherein the convolutional neural network comprises an input layer, three convolutional-pooling layers, a full connection layer and an output layer which are sequentially connected;

the output layer adopts a Softmax classifier.

4. The sound source localization method based on convolutional neural network and subband SRP-heat spatial spectrum according to claim 1, wherein the training steps of the convolutional neural network are as follows:

x _m (t)＝h _m (t)*s(t)+v _m (t),m＝1,2,...,M