CN112904279A

CN112904279A - Sound source positioning method based on convolutional neural network and sub-band SRP-PHAT space spectrum

Info

Publication number: CN112904279A
Application number: CN202110059164.1A
Authority: CN
Inventors: 赵小燕; 童莹; 芮雄丽; 陈瑞; 毛铮
Original assignee: Nanjing Institute of Technology
Current assignee: Nanjing Institute of Technology
Priority date: 2021-01-18
Filing date: 2021-01-18
Publication date: 2021-06-04
Anticipated expiration: 2041-01-18
Also published as: CN112904279B

Abstract

The invention discloses a sound source positioning method based on a convolutional neural network and a sub-band SRP-PHAT space spectrum, which comprises the following steps: the microphone array collects voice signals, and carries out frame division and windowing pretreatment on the collected voice signals to obtain single-frame signals; calculating a sub-band SRP-PHAT spatial spectrum matrix of each frame signal; and inputting the sub-band SRP-PHAT space spectrum matrixes of all the frame signals into the trained convolutional neural network, outputting the probability that the voice signal belongs to each azimuth angle, and taking the azimuth angle with the highest probability as the sound source azimuth angle estimation value of the voice signal. The invention can improve the sound source positioning performance of the microphone array in a complex acoustic environment and improve the generalization ability to the sound source space structure, reverberation and noise; the training process of the convolutional neural network can be completed off line, the trained convolutional neural network is stored in the memory, and real-time sound source positioning can be realized only by one frame of signal during testing.

Description

Sound source positioning method based on convolutional neural network and sub-band SRP-PHAT space spectrum

Technical Field

The invention belongs to the field of sound source positioning, and particularly relates to a sound source positioning method based on a convolutional neural network and a sub-band SRP-PHAT spatial spectrum.

Background

The sound source positioning technology based on the microphone array has wide application prospect and potential economic value in the front-end processing of voice recognition, speaker recognition and emotion recognition systems, video conferences, intelligent robots, intelligent homes, intelligent vehicle-mounted equipment, hearing aids and the like. The SRP-PHAT (stepped Response Power-Phase Transform) method, which is the most popular and commonly used method among conventional sound source localization methods, achieves sound source localization by detecting the peak of the spatial spectrum, but noise and reverberation often cause the spatial spectrum to exhibit a multi-peak characteristic, and especially in a strong reverberation environment, the spatial spectrum peak generated by reflected sound may be larger than the peak of direct sound, resulting in an error in sound source position detection. In recent years, model-based sound source positioning methods are used for positioning in complex acoustic environments, and such methods implement sound source positioning by modeling spatial characteristic parameters and constructing a mapping relationship between a sound source position and the spatial characteristic parameters, but at present, the generalization capability of such algorithms to unknown environments (noise and reverberation) is low, and the performance needs to be further improved. The spatial characteristic parameters and the modeling method are main factors influencing the performance of the model-based sound source positioning method.

Disclosure of Invention

The purpose of the invention is as follows: in order to solve the problems in the prior art, the invention discloses a sound source positioning method based on a Convolutional Neural Network and a sub-band SRP-PHAT space spectrum.

The technical scheme is as follows: in order to achieve the purpose, the invention adopts the following technical scheme: a sound source positioning method based on a convolutional neural network and a sub-band SRP-PHAT space spectrum is characterized by comprising the following steps:

s1, the microphone array collects voice signals, and the collected voice signals are subjected to frame division and windowing preprocessing to obtain single-frame signals;

s2, calculating a sub-band SRP-PHAT spatial spectrum matrix of each frame signal;

s3, inputting the sub-band SRP-PHAT space spectrum matrixes of all frame signals into the trained convolutional neural network, outputting the probability that the voice signal belongs to each azimuth angle, and taking the azimuth angle with the maximum probability as the sound source azimuth angle estimation value of the voice signal.

Preferably, in step S2, the calculating the sub-band SRP-PHAT spatial spectrum matrix of each frame signal includes the following steps:

s21, performing discrete Fourier transform on each frame of signal:

wherein x is_m(i, n) is the ith frame signal of the mth microphone in the microphone array, M is 1,2, …, M is the number of microphones, X_m(i, k) is x_m(i, N) indicating a frequency domain signal of an ith frame of the mth microphone, K being a frequency point, K being a length of discrete fourier transform, N being a frame length, K being 2N, and DFT (·) indicating discrete fourier transform;

s22, designing an impulse response function of the Gamma filter bank:

wherein j represents the serial number of the Gamma atom filter; c is the gain of the Gamma filter; t is tRepresenting a continuous time; a is the order of the Gamma-tone filter;

represents the phase; f. of_jRepresents the center frequency of the j Gamma filter; b_jRepresents the attenuation factor of the j Gamma filter, b_jThe calculation formula is as follows:

b_j＝1.109ERB(f_j)

ERB(f_j)＝24.7(4.37f_j/1000+1)

discrete fourier transform of the impulse response function for each gamma filter:

wherein G is_j(k) Is the frequency domain expression of the j Gamma filter, K is the frequency point, K is the length of discrete Fourier transform, N is the frame length, K is 2N, f_sRepresents the signal sampling rate, DFT (-) represents the discrete fourier transform;

s23, calculating a sub-band SRP-PHAT function of each frame signal:

wherein, P (i, j, r) represents the j sub-band SRP-PHAT function of the i frame signal when the beam direction is r; m is the number of microphones in the microphone array; tau is_mn(r) represents a time difference of the sound wave propagating from the beam direction r to the m-th microphone and the n-th microphone, and is calculated by the formula:

where r denotes the coordinates of the beam direction, r_mDenotes the position coordinate of the m-th microphone, r_nPosition coordinates of the nth microphone, and c is the speed of sound in air；

S24, carrying out normalization processing on the sub-band SRP-PHAT function of each frame signal:

s25, combining all sub-band SRP-PHAT functions of the same frame signal into a matrix form to obtain a sub-band SRP-PHAT spatial spectrum matrix:

wherein, y (i) represents the sub-band SRP-PHAT spatial spectrum matrix of the signal of the ith frame, J is the number of the sub-bands, namely the number of Gamma filters, and L is the number of beam directions.

Preferably, in step S23, when the sound source is set to be at the same level as the microphone array and the sound source is located in the far field of the microphone array, τ is_mnThe equivalent calculation formula of (r) is:

where xi is [ cos θ, sin θ ═]^TAnd θ is the azimuth of the beam direction r.

Preferably, the convolutional neural network comprises an input layer, three convolutional-pooling layers, a full-link layer and an output layer which are connected in sequence;

in the convolution-pooling layer, each convolution layer adopts convolution kernels with the size of 3 multiplied by 3, the step length is 1, the number of the convolution kernels of the three convolution layers is 24, 48 and 96 in sequence, after each convolution layer is subjected to convolution operation, batch normalization is firstly carried out, then ReLU function activation is used, and the characteristic dimensions before and after convolution are kept unchanged by adopting a zero filling mode in the convolution operation; the pooling layer adopts a maximum pooling mode, the size of the pooling is 2 multiplied by 2, and the step length is 2;

after convolution-pooling, straightening and deforming the characteristic data into one-dimensional vector characteristic data;

adding a Dropout connection mode into the connection of the full connection layer and the one-dimensional vector characteristic data;

the output layer uses a Softmax classifier.

Preferably, the training step of the convolutional neural network is as follows:

s1, convolving the clean voice signal with the room impulse response of different azimuth angles, and adding different degrees of noise and reverberation to generate a plurality of directional voice signals of different specified azimuth angles:

x_m(t)＝h_m(t)*s(t)+v_m(t),m＝1,2,...,M

wherein x is_m(t) a directional speech signal representing a specified azimuth received by an mth microphone of the array of microphones; m is the serial number of the microphone, M is 1,2, …, and M is the number of the microphones; s (t) is a clean speech signal; h is_m(t) represents the room impulse response from the specified azimuth to the mth microphone; v. of_m(t) represents noise;

s2, preprocessing all directional voice signals by framing and windowing to obtain single-frame signals, and calculating a sub-band SRP-PHAT spatial spectrum matrix of each frame of signals;

s3, taking the sub-band SRP-PHAT space spectrum matrixes of all the directional voice signals as training samples, taking the designated azimuth angles of all the directional voice signals as class labels of the corresponding training samples, taking the training samples and the class labels as training data sets, and training the convolutional neural network by adopting a random gradient descent algorithm with momentum to minimize a loss function.

Has the advantages that: the invention has the following remarkable beneficial effects:

1. the invention can improve the sound source positioning performance of the microphone array in a complex acoustic environment and improve the generalization ability to the sound source space structure, reverberation and noise;

2. the invention adopts the sub-band SRP-PHAT space spectrum as the space characteristic parameter, and the parameter not only can represent the whole acoustic environment information, but also has the advantage of strong robustness; modeling spatial characteristic parameters of directional voice data under various reverberation and noise environments by adopting a convolutional neural network, establishing a mapping relation between an azimuth and the spatial characteristic parameters, and converting a sound source positioning problem into a multi-classification problem;

3. the invention can complete the training process of the convolutional neural network off line, and the trained convolutional neural network is stored in the memory, and the real-time sound source positioning can be realized only by one frame of signal during testing.

Drawings

FIG. 1 is a flow chart of the algorithm of the present invention;

FIG. 2 is a diagram of a model architecture of a convolutional neural network of the present invention;

FIG. 3 is a graph comparing the success rate of positioning between the method of the present invention and the conventional SRP-PHAT algorithm when the testing environment and the training environment are consistent and the reverberation time is 0.5 s;

FIG. 4 is a graph comparing the success rate of positioning between the method of the present invention and the conventional SRP-PHAT algorithm when the testing environment and the training environment are consistent and the reverberation time is 0.8 s;

FIG. 5 is a comparison graph of the positioning success rate of the method of the present invention and the conventional SRP-PHAT algorithm when the noise environments of the test environment and the training environment are not consistent and the reverberation time is 0.5 s;

FIG. 6 is a comparison graph of the positioning success rate of the method of the present invention and the conventional SRP-PHAT algorithm when the noise environments of the test environment and the training environment are not consistent and the reverberation time is 0.8 s;

FIG. 7 is a comparison graph of the positioning success rate of the method of the present invention and the conventional SRP-PHAT algorithm when the reverberation environments of the test environment and the training environment are not consistent and the reverberation time of the test environment is 0.6 s;

fig. 8 is a comparison graph of the positioning success rate of the method of the present invention and the conventional SRP-PHAT algorithm when the reverberation environments of the test environment and the training environment are not consistent and the reverberation time of the test environment is 0.9 s.

Detailed Description

The present invention will be further described with reference to the accompanying drawings.

The sub-band SRP-PHAT space spectrum represents the whole acoustic environment space information, including sound source direction, room size, room reflection characteristics and the like, has strong robustness and can be used as space characteristic parameters in a positioning system. The deep neural network can simulate the information processing mode of a nervous system, can describe the fusion relation and structural information among the spatial characteristic parameters, has strong expression and modeling capabilities, and meanwhile, does not need to assume the distribution of data during modeling. The convolutional neural network is a neural network specially used for processing data with a similar network structure, and is applied to image or time series data. The speech signal collected by the microphone array is just a kind of time series data.

Therefore, the present invention provides a sound source localization method based on a convolutional neural network and a subband SRP-phot spatial spectrum, as shown in fig. 1, comprising the following steps:

the method comprises the following steps: convolving the clean speech signal with the room impulse response at different azimuth angles, and adding different degrees of noise and reverberation to generate a plurality of directional speech signals at different specified azimuth angles, namely microphone array signals:

x_m(t)＝h_m(t)*s(t)+v_m(t),m＝1,2,...,M

wherein x is_m(t) a directional speech signal representing a specified azimuth received by an mth microphone of the array of microphones; m is the serial number of the microphone, M is 1,2, …, and M is the number of the microphones; s (t) is a clean speech signal; h is_m(t) represents the room impulse response from the specified azimuth to the mth microphone, h_m(t) is related to sound source orientation, room reverberation; v. of_m(t) represents noise.

In this embodiment, the microphone array is a uniform circular array composed of 6 omnidirectional microphones, and the array radius is 0.1 m. And setting the sound source and the microphone array to be in the same horizontal plane, wherein the sound source is positioned in the far field of the microphone array. It is defined that the right front of the horizontal plane is 90 °, the azimuth angle of the sound source is in the range of [0 °,360 °, the interval is 10 °, the number of training azimuths is marked as F, and then F equals 36. The reverberation time of the training data comprises 0.5s and 0.8s, and the Image algorithm generates room impulse responses h of different azimuth angles under different reverberation times_m(t)。v_m(t) is white Gaussian noise, and the signal-to-noise ratios of the training data include 0dB, 5dB, 10dB, 15dB, and 20 dB.

And step two, preprocessing the microphone array signal obtained in the step one to obtain a single-frame signal.

The pre-processing includes framing and windowing, wherein:

the framing method comprises the following steps: using the preset frame length and frame shift to convert the directional voice signal x of the m-th microphone with the appointed azimuth angle_m(t) division into a plurality of single-frame signals x_m(iN + N), wherein i is a frame number, N represents a sampling number iN one frame, N is more than or equal to 0 and less than N, and N is a framing length. Signal sampling rate f in this embodiment_sAt 16kHz, the length of the frame taken, N, is 512 (i.e., 32ms) and the frame shift is 0.

The windowing method comprises the following steps: x is the number of_m(i,n)＝w_H(n)x_m(iN + n) where x_m(i, n) is the ith frame signal of the mth microphone after the windowing processing,

is a hamming window.

And step three, extracting spatial characteristic parameters of microphone array signals, namely a sub-band SRP-PHAT spatial spectrum matrix. The method specifically comprises the following steps:

and (3-1) carrying out discrete Fourier transform on each frame of signal obtained in the step two, and converting the time domain signal into a frequency domain signal.

The discrete fourier transform calculation formula is:

wherein, X_m(i, k) is x_mThe discrete fourier transform of (i, N) represents the frequency domain signal of the ith frame of the mth microphone, K is a frequency point, K is the length of the discrete fourier transform, K is 2N, and DFT (·) represents the discrete fourier transform. In this embodiment, the length of the discrete fourier transform is set to 1024.

And (3-2) designing a Gamma atom filter bank.

g_j(t) is the impulse response function of the j-th Gamma filter, and the expression is:

wherein j represents the serial number of the Gamma atom filter; c is the gain of the Gamma filter; t represents a continuous time; a is the order of the Gamma-tone filter;

b_j＝1.109ERB(f_j)

ERB(f_j)＝24.7(4.37f_j/1000+1)

in this embodiment, the order a is 4, and the phase

Set to 0, the number of gamma filters is 36, i.e. j is 1,2, …,36, the central frequency f of the gamma filters_jIn the range of [200Hz, 8000Hz]。

Performing discrete Fourier transform on the impulse response function of each Gamma filter to obtain a frequency domain expression:

wherein G is_j(k) Is g_j(n/f_s) Represents the frequency domain expression of the j-th Gamma transform filter, K is the frequency point, K is the length of the discrete Fourier transform, K is 2N, DFT ((-) represents the discrete Fourier transform, f_sRepresenting the sampling rate. In this embodiment, the length of the discrete fourier transform is set to 1024.

(3-3) calculating a sub-band SRP-PHAT function of each frame signal, wherein the calculation formula is as follows:

wherein, P (i, j, r) represents the j sub-band SRP-PHAT function of the i frame signal when the beam direction of the array is r; (.)^*Represents a conjugation; tau is_mn(r) represents a time difference of the sound wave propagating from the beam direction r to the m-th microphone and the n-th microphone, and is calculated by the formula:

where r denotes the coordinates of the beam direction, r_mDenotes the position coordinate of the m-th microphone, r_nC is the sound velocity in the air, and is about 342m/s at normal temperature, f_sIs the signal sampling rate; i | · | | represents a 2 norm.

In this embodiment, if the sound source and the microphone array are set to be in the same horizontal plane, and the sound source is located in the far field of the microphone array, then τ is_mnThe equivalent calculation formula of (r) is:

where xi is [ cos θ, sin θ ═]^TAnd θ is the azimuth of the beam direction r. Tau is_mn(r) is independent of the received signal and can therefore be calculated off-line and stored in memory.

The normalization processing is carried out on the sub-band SRP-PHAT function P (i, j, r), and the calculation formula is as follows:

(3-4) combining all sub-bands SRP-PHAT functions of the same frame signal

Combining the two into a matrix form to obtain a sub-band SRP-PHAT spatial spectrum matrix:

where y (i) represents the spatial characteristic parameter of the ith frame signal, i.e., the subband SRP-PHAT spatial spectrum matrix, J is the number of subbands, i.e., the number of gamma filters, and J is 36 in this embodiment. The number of beam directions L is 72 because the azimuth range of the beam directions of the array is [0 ° and 360 °) defined as 90 ° directly in front of the horizontal plane and the interval is 5 ° in the present embodiment. Generally, the number L of beam directions is greater than the number F of training azimuths, so that the accuracy of the spatial characteristic parameters of the signals can be improved, and the training accuracy of the CNN model can be improved.

Step four, preparing a training set: according to the first to third steps, extracting spatial characteristic parameters of the directional voice signals under all training environments (the implementation setting of the training environments is detailed in the first step), using the spatial characteristic parameters as CNN training samples, marking the corresponding specified azimuth angle of each training sample, and using the corresponding specified azimuth angle as a class label of the training sample.

And fifthly, constructing a CNN model, and training by taking the training samples and the class labels obtained in the fourth step as a training data set of the CNN so as to obtain the CNN model. The method specifically comprises the following steps:

and (5-1) setting a CNN model structure.

The CNN architecture employed in the present invention is shown in fig. 2 and includes an input layer followed by three convolutional-pooling layers, then a fully-connected layer, and finally an output layer.

The input signal of the input layer is a J × L two-dimensional subband SRP-PHAT spatial spectrum matrix, i.e., training samples, in this embodiment, J is 36, and L is 72.

The input layer is followed by three convolution-pooling layers, each convolution layer adopts a convolution kernel with the size of 3 multiplied by 3, the step length is 1, and the characteristic dimensionality before and after convolution is kept unchanged by adopting a zero filling mode in the convolution operation. The numbers of convolution kernels of 1 st, 2 nd and 3 rd convolution layers are respectively 24, 48 and 96. After each convolution layer is subjected to convolution operation, batch normalization is carried out firstly, and then ReLU function activation is used. The pooling layer adopts a maximum pooling mode, the pooling size is 2 multiplied by 2, and the step length is 2.

After three convolution-pooling operations, the 36 × 72 two-dimensional sub-band SRP-PHAT spatial spectrum matrix becomes 5 × 9 × 96 feature data, and the feature data is straightened and deformed into 4320 × 1 one-dimensional vector feature data. The neurons in the fully connected layer are connected to all the feature data in the previous layer and the connected mode of Dropout is added to prevent overfitting, with the Dropout rate set to 0.5.

And the output layer adopts a Softmax classifier, a Softmax function converts the characteristic data of the full connection layer into the probability of the voice signal relative to each azimuth angle, and the azimuth angle with the maximum probability is taken as the predicted sound source direction.

And (5-2) training network parameters of the CNN model.

The training process of CNN includes two parts, forward propagation and backward propagation.

The forward propagation is to calculate the output of input data under the current network parameters, and is the layer-by-layer transfer process of the characteristics, and the forward propagation expression at the position (u, v) in the d-th layer is as follows:

S^d(u,v)＝ReLU((S^d-¹*w^d)(u,v)+β^d(u,v))

wherein d represents a layer identifier and the d-th layer is a convolutional layer, S^dRepresents the output of the d-th layer, S^d-1Denotes the output of layer d-1, the symbol denotes the convolution operation, w^dRepresents the convolution kernel weight, β, of the d-th layer^dIndicating the bias of layer d, ReLU is the activation function. The layers in the CNN structure employed by the present invention include the input layer, the convolutional and pooling layers in the convolutional-pooling layer, the fully-connected layer, and the output layer.

D represents an output layer, and the expression of the output layer is as follows:

S^D＝Softmax((w^D)^TS^D-1+β^D)

wherein S is^DRepresenting the output of the output layer, S^D-1Represents the output of the fully connected layer, w^DTo representConvolution kernel weight, beta, of the output layer^DIndicating the bias of the output layer.

The goal of the back propagation phase is to minimize the cross-entropy loss function E (w, β):

wherein the subscript f denotes the f-th azimuth angle,

representing the desired output of the output layer at the f-th azimuth,

representing the actual output of the output layer at the f-th azimuth. F denotes the number of training positions, and in this embodiment, F is 36. The invention adopts a random Gradient Descent (SGDM) algorithm with Momentum to minimize a loss function, and relevant parameters of the SGDM are as follows: the Momentum parameter Momentum is set to 0.9, the L2 regularization coefficient is 0.0001, the initial learning rate is set to 0.01, the learning rate is reduced by 0.2 times every 6 rounds, and the mini-batch is set to 200.

The invention adopts a 7: 3 cross validation mode in the training process. And (5) carrying out iterative training for multiple times until convergence. So far, the CNN model training is completed.

And step six, processing the test signal according to the step two and the step three to obtain the spatial characteristic parameter of the single-frame test signal, namely the sub-band SRP-PHAT spatial spectrum matrix, and taking the spatial characteristic parameter as a test sample.

And step seven, taking the test sample as the input characteristic of the CNN model trained in the step four, outputting the probability that the test signal belongs to each azimuth angle by the CNN, and taking the azimuth with the highest probability as the sound source azimuth angle estimation value of the test sample.

Compared with the prior art, the method comprises two stages of training and testing. In the training stage, spatial characteristic parameters are extracted from directional voice signals in various reverberation and noise environments, and the directional voice signals are input into a CNN (computer-aided network) for training to obtain a CNN model. And in the testing stage, the spatial characteristic parameters of the test signals are extracted, the trained CNN model is input, and the azimuth with the highest probability is taken as the azimuth estimation value of the target sound source. The invention can complete the CNN training process off line, and stores the trained CNN model in the memory, and can realize real-time sound source positioning only by one frame of signal during testing. Compared with the traditional SRP-PHAT algorithm, the algorithm provided by the invention has the advantages that the positioning performance under a complex acoustic environment is obviously improved, and the generalization capability on the sound source space structure, reverberation and noise is better.

Fig. 3 and 4 show the positioning effect of the method of the present invention and the conventional SRP-phot algorithm when the testing environment and the training environment are consistent: in fig. 3, the reverberation time of the test environment and the training environment is 0.5s, and in fig. 4, the reverberation time of the test environment and the training environment is 0.8s, and the positioning effects of the test environment and the training environment under the conditions that the signal-to-noise ratios are 0dB, 5dB, 10dB, 15dB and 20dB are respectively researched, so that the positioning success rate of the method disclosed by the invention is far higher than that of the traditional SRP-PHAT algorithm.

Fig. 5 and fig. 6 show the positioning effect of the method of the present invention and the conventional SRP-phot algorithm when the signal-to-noise ratios of the test environment and the training environment are not consistent: in fig. 5, the reverberation time of the test environment and the training environment is 0.5s, the signal-to-noise ratio of the test environment is different from that of the training environment, in fig. 6, the reverberation time of the test environment and the training environment is 0.8s, and the signal-to-noise ratio of the test environment is different from that of the training environment, and the positioning effects of the test environment under the signal-to-noise ratios of-2 dB, 3dB, 8dB, 13dB and 18dB are respectively researched, so that the positioning success rate of the method disclosed by the invention is far higher than that of the traditional SRP-PHAT algorithm.

Fig. 7 and 8 show the positioning effect of the method of the present invention and the conventional SRP-PHAT algorithm when the reverberation times of the test environment and the training environment are inconsistent: in fig. 7, the reverberation time of the test environment is different from that of the training environment, the reverberation time of the test environment is 0.6s, in fig. 8, the reverberation time of the test environment is different from that of the training environment, and the reverberation time of the test environment is 0.9s, and the positioning effects of the test environment and the training environment under the signal-to-noise ratios of 0dB, 5dB, 10dB, 15dB and 20dB are respectively researched, so that the positioning success rate of the method disclosed by the invention is far higher than that of the traditional SRP-PHAT algorithm.

As can be seen from fig. 5 to fig. 8, even in the non-training environment, the success rate of the method of the present invention is still much higher than that of the conventional SRP-PHAT algorithm, which illustrates that the method of the present invention has better robustness and generalization capability to the unknown environment.

The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims

1. A sound source positioning method based on a convolutional neural network and a sub-band SRP-PHAT space spectrum is characterized by comprising the following steps:

2. The sound source localization method according to claim 1, wherein the step S2 of calculating the sub-band SRP-PHAT spatial spectrum matrix of each frame signal comprises the following steps:

s21, performing discrete Fourier transform on each frame of signal:

s22, designing an impulse response function of the Gamma filter bank:

b_j＝1.109ERB(f_j)

ERB(f_j)＝24.7(4.37f_j/1000+1)

s23, calculating a sub-band SRP-PHAT function of each frame signal:

wherein P (i, j)R) represents the jth sub-band SRP-PHAT function of the ith frame signal when the beam direction is r; m is the number of microphones in the microphone array; tau is_mn(r) represents a time difference of the sound wave propagating from the beam direction r to the m-th microphone and the n-th microphone, and is calculated by the formula:

where r denotes the coordinates of the beam direction, r_mDenotes the position coordinate of the m-th microphone, r_nRepresenting the position coordinate of the nth microphone, and c is the sound velocity in the air;

3. The sound source localization method according to claim 2, wherein in step S23, when the sound source and the microphone array are set to be at the same horizontal plane and the sound source is located in the far field of the microphone array, τ is_mnThe equivalent calculation formula of (r) is:

4. The sound source localization method based on the convolutional neural network and the subband SRP-PHAT spatial spectrum as claimed in claim 1, wherein the convolutional neural network comprises an input layer, three convolutional-pooling layers, a full-link layer and an output layer which are connected in sequence;

the output layer uses a Softmax classifier.

5. The sound source localization method based on the convolutional neural network and the subband SRP-PHAT spatial spectrum as claimed in claim 1, wherein the training step of the convolutional neural network is as follows:

x_m(t)＝h_m(t)*s(t)+v_m(t),m＝1,2,...,M

wherein x is_m(t) a directional speech signal representing a specified azimuth received by an mth microphone of the array of microphones; m is the serial number of the microphone, M is 1,2, …, and M is the number of the microphones; s (t) is a clean speech signal; h is_m(t) denotes a designationRoom impulse response of azimuth to mth microphone; v. of_m(t) represents noise;