CN115267672A

CN115267672A - Method for detecting and positioning sound source

Info

Publication number: CN115267672A
Application number: CN202210786953.XA
Authority: CN
Inventors: 颜俊; 朱鸿翔; 曹艳华
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2022-07-04
Filing date: 2022-07-04
Publication date: 2022-11-01

Abstract

The invention provides a sound source detection and positioning method, which mainly comprises the following steps: splitting the multi-channel signal into single-channel signals; performing noise reduction processing by using a CEEMDAN noise reduction algorithm; respectively extracting FBANK characteristics and GCC characteristics from the denoised single-channel signal; training the CRNN by combining the category label and the position label to obtain a sound source positioning detection model; splitting the online extracted sample according to channels; and respectively extracting FBANK characteristics and GCC characteristics from the split single-channel signals, combining the FBANK characteristics and the GCC characteristics, and inputting the combined signal serving as comprehensive characteristics into a sound source positioning detection model to obtain an estimation result of the sound source type and an estimation result of the position. According to the invention, the influence of unknown noise on the sound source signal is effectively reduced by carrying out denoising processing on the sound source signal with unknown noise distribution, and meanwhile, the precision can be obviously improved and the complexity of an online prediction process can be reduced by using a method for carrying out multi-task learning on the sound source type and position.

Description

Method for detecting and positioning sound source

Technical Field

The invention relates to a sound source detection and positioning method, and belongs to the field of deep learning.

Background

In recent years, due to the wide application of various positioning algorithms and information, the positioning and detection of sound events have also received wide attention, such as intelligent dispersion of traffic conditions in smart cities, voice recognition in smart meeting rooms, and audio monitoring in smart homes. Currently, with the rapid development of the internet of things and artificial intelligence, people also urgently need a rapid and accurate positioning and detection algorithm for sound events. Generally, such algorithms are divided into two subtasks: sound Source Detection (SED) and Sound Source Localization (SSL). The SED task mainly solves the type judgment of the sound source, and the SSL task mainly solves the position estimation of the sound source.

For SED tasks, different supervised classification learning is typically used to judge the class of sound sources. Some currently available classifiers include: hidden markov models, recurrent Neural Networks (RNNs), convolutional Neural Networks (CNNs), and Convolutional Recurrent Neural Networks (CRNN). Aiming at SED tasks, rapid and accurate judgment of sound source types is needed, the best effect of the academic community comes from CRNN at present, the network structure is obtained by superposing CNN, RNN and FC layers, the sense fields of the CNN on different levels can be effectively utilized to realize reduction of characteristic dimensions and expansion of characteristic longitudinal dimensions, and meanwhile, the RNN is utilized to effectively model time-related sequences.

For the SSL task, there are conventionally methods based on time delay of arrival, based on controllable beam response, and multiple signal classification. The traditional mode is different in algorithm complexity, microphone array geometric constraint and model assumption of an acoustic scene, and an end-to-end sound source positioning system is difficult to realize. Meanwhile, in recent years, with the continuous development of deep learning, more and more learners begin to build various SSL networks by using a deep learning framework. In the early SSL task, the sound source direction was many times divided into classification tasks. Since the construction of the classification network is much more convenient than that of the regression network in the early stage of the development of the deep learning network, the sound source direction is subjectively divided into a plurality of categories in the early stage, but some problems are caused by the fact that the classification of the sound source direction directly affects the resolution of the SSL task, and most predecessors work on classifying and dividing the elevation angle and the azimuth angle in the spherical coordinate system, if the positioning of the three-dimensional cartesian space coordinate system is raised, the classification category may need hundreds, and the requirements on the construction of the network and the training data are extremely strict and have no practical significance. Thus, the sound source localization task based on the classification task is gradually replaced by the regression-based sound source localization task.

In speech feature extraction, the most common feature is the reciprocal Mel-Frequency Cepstral Coefficients (MFCCs). MFCCs make use of a cepstral transformation between the source signal and the mel filter bank, as well as a spectral compression transformation of the mel scale. Since the first few MFCCs values capture pitch-invariant audio features, they are commonly used for pitch-summarization tasks, such as speaker recognition. However, in recent sound event detection efforts, the results of using MFCCs have shown that MFCCs is not the optimal choice because it is sensitive to background noise. In previous work, mel-Filter Bank (FBANK) features have proven to be better than MFCCs in deep neural networks. Compared with spatial localization, for multichannel signals, generalized Cross Correlation (GCC) between adjacent channel signals can well represent differences between channel signals, and can represent resolving power for signals from different directions, so that GCC is widely applied and developed, and is also a solution for sound source localization which is commonly used in a conventional manner at present.

In view of the above, it is necessary to provide a method for detecting and positioning a sound source to solve the above problems.

Disclosure of Invention

The invention aims to provide a sound source detection and positioning method, which effectively reduces the influence of unknown noise on a sound source signal.

In order to achieve the above object, the present invention provides a sound source detecting and positioning method, which mainly comprises the following steps:

step 1, splitting a sound source audio signal according to channels, and splitting a multi-channel signal into a single-channel signal;

step 2, performing noise reduction processing on each single-channel signal by using a CEEMDAN noise reduction algorithm;

step 3, extracting FBANK characteristics and GCC characteristics from the denoised single-channel signal respectively, and inputting the FBANK characteristics and the GCC characteristics which are combined as comprehensive characteristics into the CRNN network;

step 4, training the CRNN by combining the category label and the position label to obtain a sound source positioning detection model;

step 5, splitting the online extracted sample according to channels, and splitting the multi-channel signal into a single-channel signal;

and 6, extracting FBANK characteristics and GCC characteristics from the single-channel signal split in the step 5 respectively, combining the FBANK characteristics and the GCC characteristics, and inputting the combined FBANK characteristics and GCC characteristics serving as comprehensive characteristics into the sound source positioning detection model in the step 4 to obtain the estimation result of the sound source category and the estimation result of the position.

As a further improvement of the invention, the method comprises an off-line stage and an on-line stage, wherein the steps 1 to 4 are completed in the off-line stage, and the steps 5 and 6 are completed in the on-line stage.

As a further improvement of the present invention, in step 1, category information and position information are used as labels to mark different sound sources, the category information uses a unique code as a mark, and the position information is converted from a spherical coordinate system to a three-dimensional cartesian coordinate system, and the formula is as follows:

x＝r·cos(ele)·cos(ele)

y＝r·cos(ele)·sin(azi)

Z＝r·sin(ele)，

where r is the distance of the speaker from the microphone, ele is the degree of elevation, azi is the degree of azimuth, and x, y, and z are the three-dimensional space cartesian coordinates, respectively.

As a further improvement of the present invention, step 2 specifically comprises the following steps:

step 21, adding Gaussian white noise into a single-channel signal to be decomposed to obtain a first group of new signals;

step 22, performing EMD on the first group of new signals to obtain a first-order eigenmode component;

step 23, carrying out overall average on the generated N modal components to obtain a 1 st intrinsic modal component decomposed by a CEEMDAN noise reduction algorithm;

step 24, calculating and removing a residual signal of the 1 st intrinsic mode component, adding positive and negative paired Gaussian white noise to obtain a second group of new signals, and performing EMD decomposition by using the second group of new signals as a carrier to obtain a first-order mode component;

step 25, repeating the steps until all modal components are obtained;

step 26, for each modal component, calculating its cross-correlation coefficient with the single-channel signal to be decomposed in step 21.

As a further improvement of the present invention, in step 21, the single-channel signal to be decomposed is y (t), and white Gaussian noise is added to obtain a first new set of signals y (t) + (-1)^qεv^j(t), wherein q =1,2.

As a further improvement of the present invention, step 3 specifically comprises the following steps:

step 31, carrying out short-time Fourier transform on the single-channel signal after denoising;

step 32, extracting the internal characteristics of the frequency band by using a Mel filter bank from the vector obtained by short-time Fourier transform;

step 33, carrying out logarithm operation on the obtained internal features to obtain FBANK features;

step 34, combining the different channels in pairs to obtain different combinations;

step 35, performing fourier transform on each signal in each combination in the step 34, and performing conjugation operation on one of the signals to obtain two vectors;

step 36, using GCC-PHAT weighting function to obtain the product of the two vectors;

step 37, performing inverse Fourier transform on the product to obtain GCC characteristics among channels;

and step 38, superposing the FBANK characteristics and the GCC characteristics on a time axis to obtain comprehensive characteristics.

As a further improvement of the present invention, in step 32, the mel filter bank includes 64 triangular filters, and the frequency response of the triangular filters is defined as:

wherein the content of the first and second substances,

as a further development of the invention, in step 33 the logarithmic operation is

The resulting FBANK is characterized by a dimension of 513.

As a further improvement of the invention, in step 34, the received signals between the two channels are respectively

x₁(t)＝α₁s(t-τ₁)+n₁(t)

x₂(r)＝α₂s(t-τ₂)+n₂(t)，

Where s (t) is the sound source signal, n₁(t) and n₂And (t) is environmental noise, and tau is the time when the array element receives the sound source signal.

As a further development of the invention, in step 36 the GCC-PHAT weighting function is

Wherein X (ω) is the fourier transform of the original signal.

The invention has the beneficial effects that: according to the invention, the influence of unknown noise on the sound source signal is effectively reduced by carrying out denoising processing on the sound source signal with unknown noise distribution, and meanwhile, the accuracy can be obviously improved and the complexity of an online prediction process can be reduced by carrying out multi-task learning on the sound source type and position.

Drawings

Fig. 1 is a schematic flow chart of the sound source detection and localization method of the present invention.

Fig. 2 is a cross-correlation coefficient diagram of each IMF component decomposed by the CEEMDAN noise reduction algorithm in the sound source detection and localization method of the present invention.

Fig. 3 is a diagram of the noise reduction effect achieved by using different noise reduction thresholds in the sound source detection and localization method according to the present invention.

FIG. 4 is a schematic diagram of feature extraction and fusion used in the method for detecting and locating a sound source according to the present invention.

FIG. 5 is a schematic diagram illustrating that the extracted features are affected by the sound source type and location in the sound source detection and localization method according to the present invention.

FIG. 6 is a schematic diagram of a CRNN network framework in the method for detecting and locating a sound source according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.

It should be noted that, in order to avoid obscuring the present invention with unnecessary details, only the structures and/or processing steps closely related to the aspects of the present invention are shown in the drawings, and other details not closely related to the present invention are omitted.

In addition, it is also to be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

As shown in fig. 1 to fig. 6, the present invention discloses a method for detecting and positioning a sound source based on deep learning by using CRNN, which performs denoising processing on a sound source signal with unknown noise distribution by using a CEEMDAN denoising algorithm (adaptive noise complete set empirical mode decomposition), and the method mainly includes the following steps:

and step 6, respectively extracting FBANK characteristics and GCC characteristics from the single-channel signal split in the step 5, combining the FBANK characteristics and the GCC characteristics, inputting the combined FBANK characteristics and GCC characteristics serving as comprehensive characteristics into the sound source positioning detection model in the step 4, and obtaining an estimation result of the sound source type and an estimation result of the position.

The present invention mainly comprises two stages, namely an off-line stage and an on-line stage, wherein the above-mentioned steps 1 to 4 are completed in the off-line stage, and the steps 5 and 6 are completed in the on-line stage, and the steps 1 to 6 will be described in detail below.

In step 1, sound source audio signals are collected to form a data set, the sound source audio signals are split according to channels, and multi-channel signals are split into single-channel signals. For different sound source sounding conditions, recording the category information and the position information of each sample as labels, wherein the category information uses a one-hot code as a mark, and the position information is converted into a three-dimensional Cartesian coordinate system from a spherical coordinate system, and the formula is as follows:

x＝r·cos(ele)·cos(ele)

yr.cos(ele).sin(azi)

Z＝r·sin(ele)，

where r is the distance of the speaker from the microphone, ele is the degree of elevation, azi is the degree of azimuth, and x, y, and z are the three-dimensional space cartesian coordinates, respectively. In order to accelerate the network convergence speed when SSL is realized in the final regression, normalization processing is carried out on three-dimensional coordinates, so that the coordinate ranges are within (-1,1). And then splitting the multi-channel signal into single-channel signals according to the number of the microphone arrays, and changing the sampling rate to 24KHz again.

In step 2, for each single-channel signal, noise reduction processing is performed by using a CEEMDAN noise reduction algorithm, so that the signal quality is improved, and the influence caused by noise is reduced. Suppose E_i(. H) is the ith eigenmode component obtained after EMD decomposition (empirical mode decomposition), and the ith eigenmode component obtained by decomposition of the CEEMDAN noise reduction algorithm is

v^jIn order to satisfy a gaussian white noise signal with standard normal distribution, j =1,2,3.. N is the number of times white noise is added, epsilon is a white noise standard value, y (t) is a signal to be decomposed, and the CEEMDAN noise reduction algorithm specifically comprises the following steps:

step 21, adding white Gaussian noise to the single-channel signal y (t) to be decomposed to obtain a first new signal y (t) + (-1)^qεv^j(t), wherein q =1,2.

Step 22, performing EMD on the first group of new signals to obtain a first-order eigenmode component,

step 23, performing ensemble averaging on the generated N modal components to obtain the 1 st eigenmode component decomposed by the CEEMDAN noise reduction algorithm, that is, the first eigenmode component

Step 24, calculating and removing the residual signal of the 1 st intrinsic mode component, i.e.

At r₁(t) adding positive and negative paired Gaussian white noise to obtain a new signal, and performing EMD with the new signal as a carrier to obtain a first-order modal component D₁。

And step 25, repeating the steps until all modal components are obtained. Then at this time

Step 26, for each modal component, calculating a cross-correlation coefficient between the modal component and the single-channel signal to be decomposed in step 21 (i.e., the original audio signal), and performing a round-off for each Intrinsic Mode Function (IMF) according to the cross-correlation coefficient. In normal processing, high-frequency signals are generally directly removed, but in many cases, the high-frequency signals contain some useful information, and the integrity of original data is damaged by direct removal. How to cut off the IMF resolved for each signal involves the problem of the original distribution of noise to the original signal. However, noise in reality is very complex, and the distribution of the channel characteristics is often unknown, so that the problem of rejection of the IMF is judged by using the difference of the correlation coefficients.

In order to show the algorithm logic, 4000 sampling points are extracted to perform CEEMDAN noise reduction algorithm simulation interpretation, 13 IMF components are obtained, correlation coefficient calculation is performed respectively, and the correlation coefficient is obtained as shown in FIG. 2.

As can be seen from fig. 2, different IMFs have different correlation degrees with the original signal, and we set the concept of the noise reduction threshold t, that is, the IMF with a correlation coefficient greater than t is retained, and the IMF with a correlation coefficient smaller than t is regarded as noise and filtered out.

Fig. 3 is a comparison graph of the signal obtained after noise is filtered by different noise reduction thresholds and the original signal, where an a sub-graph is the original signal, a b sub-graph is the signal when the noise reduction threshold t =0, a c sub-graph is the signal when the noise reduction threshold t =0.05, and a d sub-graph is the signal when the noise reduction threshold t = 0.5. The noise reduction coefficient is derived using an exhaustive method.

As can be seen from the circled part in fig. 3, different noise reduction thresholds can achieve the smoothing effect of the signal, and reduce part of the noise, it is noted that part of the information of the original signal may be lost while filtering part of the IMF.

In step 3, extracting FBANK characteristics and GCC characteristics from the denoised signals respectively, and inputting the FBANK characteristics and the GCC characteristics into a CRNN network in parallel cooperation;

fig. 4 shows an overall block diagram of the feature extraction and fusion algorithm. Herein, the FBANK feature and the GCC feature are extracted and fused respectively, and the step 3 is divided into the following steps:

and 31, firstly extracting FBANK characteristics in the single-channel signal after denoising, and performing short-time Fourier transform on the single-channel signal after denoising. Using 25ms as a frame, the audio signal can be regarded as a stationary signal in a short time, and at this time, the sampling rate is 24KHz, then 1024-point fourier transform is performed, and the length of the obtained vector is 513.

And 32, extracting the internal features of the frequency band by using a Mel filter bank from the vector obtained by short-time Fourier transform. A mel filter bank with 64 triangular filters is used to extract the band information. The triangular bandpass filter has two main purposes: the frequency spectrum is smoothed, the effect of harmonic waves is eliminated, and the formants of the original voice are highlighted. The frequency response of the triangular filter is defined as:

wherein, the first and the second end of the pipe are connected with each other,

the purpose of this filter bank filtering is to simulate the non-linear perception of the human ear to sound, more discriminative at lower frequencies and not discriminative at higher frequencies, i.e. the frequency is converted to mel scale, with the formula:

and step 33, carrying out logarithm operation on the obtained internal characteristics to obtain FBANK characteristics. Namely, it is

The extracted FBANK features are 513 dimensions, and the FBANK extraction is finished.

And step 34, extracting the GCC characteristics in the single-channel signal after denoising, and combining two different channels to obtain different combinations. For example, in this experiment, a 4-channel microphone array is used, and two microphones are combined to obtain 6 cases, so that the GCC feature has 6 dimensions. Assume that the received signals between two microphones are respectively

x₁(f)＝α₁s(t-τ₁)+n₁(t)

x₂(t)＝α₂s(t-τ₂)+n₂(t)，

Where s (t) is the sound source signal, n₁(t) and n₂(t) is the environmental noise, and τ is the time when the array element receives the sound source signal.

And step 35, performing Fourier transform on each signal in each combination, and performing conjugation operation on one signal to obtain two vectors.

And step 36, obtaining a product of the two obtained vectors by using a GCC-PHAT weighting function. The GCC-based time delay estimation algorithm can introduce a weighting function to adjust the cross-power spectral density, so that the time delay estimation performance is optimized. The Generalized Cross-Correlation function has many different variations according to the weighting function, and among them, the Generalized Cross Correlation Phase Transformation (GCC-PHAT) method is most widely used. The GCC-PHAT weighting function has certain anti-noise and anti-reverberation capacity, so that the robustness of the system is enhanced by using the method. A GCC-PHAT weighting function of

Wherein X (ω) is the fourier transform of the original signal. It can be seen that the cross-power spectrum weighted by the PHAT is similar to the expression of unit impulse response, the peak value of the time delay is highlighted, the reverberation noise can be effectively inhibited, and the precision and the accuracy of time delay estimation are improved.

And step 37, performing inverse Fourier transform on the product to obtain the GCC characteristics among the channels. Thus, the GCC feature extraction is finished.

And step 38, finally, superposing the FBANK characteristics and the GCC characteristics on a time axis to obtain comprehensive characteristics. At this time, the dimensions of the FBANK and the GCC features are 513 dimensions, and a concat function of a numpy module in Python is used for combining all extracted features into a comprehensive feature of (10, 513).

FIG. 5 is a visualization of a feature. Wherein (a) the sub-graph indicates a time domain plot of the telephone ringtone at location a and the corresponding feature, (B) the sub-graph indicates a time domain plot of the telephone ringtone at location B and the corresponding feature, and (c) the sub-graph indicates a time domain plot of the knock at location B and the corresponding feature. (a) Comparison of the subgraph with the subgraph (b) shows that the FBANK characteristics are basically unchanged, but the GCC characteristics are changed to a greater extent when the same sound source vocalizes at different positions. (b) Comparing the subgraph with the subgraph (c), it can be seen that different sound sources produce sound at the same position, the FBANK characteristic changes greatly, and the GCC characteristic does not change basically. Therefore, the judgment of the sound source type and the sound source position can be realized through the combined action of the FBANK and the GCC.

In step 4, training is carried out by combining the class label and the position label given by the sample to obtain a sound source positioning detection model;

fig. 6 depicts a CRNN network framework used by the present invention. (a) The sub-diagram shows a framework structure diagram of the whole network, which is divided into 3 volume blocks, 2 Gated Round Units (GRUs) and Full Connected (FC) layers corresponding to classification and regression. The detail of the convolution block is shown in (b) subgraph, in which the soft attention mechanism is embedded in the 1 st convolution block, and the detailed structure is shown in (c) subgraph. As will be described in detail below.

First, for the whole frame, the input features (10 × 1 × 513), and before going through the 1 st convolution block, the division of the attention mechanism is first performed, as shown in (c) sub-diagram. Since the effect of convolution is local, multiple layers of convolution are required to be able to achieve correlation of features at different locations in the entire feature map, and the attention mechanism can achieve fusion of the whole features in the convolution rather than being confined to the convolution kernel. The attention mechanism used by the invention uses a self-attention mechanism realized by a soft attention mode by taking the thought in Natural Language Processing (NLP) as a reference. Firstly, separating feature maps of all channels, resetting the size of a matrix of a vector of each channel respectively and performing dot product on the matrix and the matrix since the input feature of the invention is 10 channels, wherein the significance of the step is that the (i, j) coordinate in the subsequent attention mechanism map is the influence of the ith element and the jth element in the channel, so that the dependency relationship between any two elements in the whole feature map is realized, and then the attention mechanism map feature map is obtained through softmax normalization. And finally, performing dot product on the feature graph and the original CNN feature graph, so that the weight of the feature in each CNN is updated, and with the continuous deepening of learning, the single feature of the original feature graph obtains the weight after the attention mechanism is updated, namely the global dependence of any position is obtained.

The parameters of each volume block are similar, wherein the batch regularization layers in all volume blocks are used for normalizing the parameters, so that the convergence speed can be accelerated in training; the dropout layer uses a fixed probability of 0.2 to ensure that the training process cannot be overfitted; the ReLU layer is an activation function, and at the end of each volume block, linear relations between learned parameters are avoided, and the function of preventing overfitting is also achieved.

In the structure proposed by the present invention, the comprehensive features of FBANK and GCC can be considered as the comprehensive features of 10 channels, each channel being a 1-dimensional vector of the feature dimension with respect to the time dimension. For the local shift invariant property therein, we focus on multi-layer learning using CNN. In the three convolution kernels, the 2D convolution kernel with the convolution kernel size of 1x2 has the step size of 1x1, the dimensionality expansion is from 10 to 32 dimensionalities, then from 32 to 64 dimensionalities, the 2D pooling kernel with the pooling kernel size of 1x2 has the step size of 1x2. The convolution and pooling part mainly reduces the dimension of the feature length in a single channel, takes out the local invariant property of the feature according to the time dimension, expands the feature into more dimension spaces, enhances the deep information of the feature, and increases 0x1 edge filling in the 3 rd convolution block to ensure that the CNN output dimension can be converted into the dimension suitable for the GRU through a matrix size reconstruction mode. And (4) expanding key points in the volume block from the channel dimension, deeply excavating the comprehensive characteristics, and meanwhile, compressing the characteristic values on the dimension of time to extract required characteristic information. The role of the 3 convolution kernels is to synthesize the inter-channel features and make them match the subsequent GRU input dimensions.

After the output result of the CNN is reset to 128 × 64 by the matrix size, the sequence length of the GRU is matched and directly input into the GRU for learning the memory in the time dimension. Specifically, the GRU unit is divided into two GRU layers, specified by defining a num _ layer parameter in the pytorch, each GRU layer has an input sequence length and an output sequence length of 64, a hidden layer size of 64, and the output result of each GRU is activated using tanh. The reason why the ReLU activation is not used is mainly that for cyclic units, the ReLU activation function is prone to gradient explosion and gradient attenuation, so that tanh is uniformly used for activation in RNN, and the above phenomenon is avoided. The GRUs are all bidirectional GRUs, and through learning of the GRUs, time sequence information of the features on the time dimension can be obtained, and the features are further refined. Through the GRU, the feature vector is output as a 128x256 two-dimensional vector. This feature contains more information on the time sequence.

2 branch networks are arranged behind the main network and are all constructed by an FC layer, and the FC layer spans time sharing weight from the time dimension and respectively corresponds to the classification task of the SED and the regression task of the SSL. The SED branch network is composed of 3 FCs, the last FC uses a sigmoid activation function to realize 11 classification tasks, for the sigmoid function, the corresponding range of output results is (0, 1), and corresponding to the prediction result of each event, the result is judged to be the output result when the threshold value exceeds 0.5. The SSL branch network is composed of 4 FCs, the last FC is activated by using 3 tanh, and the regression prediction results of the events on three coordinates of x, y and z are respectively corresponding to the events, and since the invention provides that the x, y and z coordinates correspond to the range of (-1, 1) in the process of label specification, the tanh is used for ensuring the output result to be in the range.

For the Loss function, a Binary Cross Entropy-Loss (BCE Loss) function is used between the SED branch prediction category and the real category; the SSL branch uses a Mean Square Error Loss (MSE Loss) function as the difference between the predicted and true coordinates. Meanwhile, the Loss function of the classification task and the regression task is not a magnitude order, in order to effectively balance the difference between the Loss functions, the magnitude orders of BCE and MSE are regulated and controlled to be a magnitude order, and the Loss function Loss = BCE +50 × MSE participating in back propagation is finally determined after parameter adjustment by an exhaustion method.

In summary, according to the method for detecting and positioning a sound source based on a convolution cycle network provided by the invention, after passing through the same backbone network, an SED task and an SSL task are respectively defined as a classification task and a regression task, and the estimation performance can be effectively improved by the multi-task learning mode. The method has the advantages that the FBANK with the fixed and unchangeable sound source type and the GCC with the sound source changing along with the position are fused, the comprehensive characteristics of sound source identification and positioning can be extracted in one step and used as the input characteristics of the whole neural network for training to obtain a model, and then a prediction result is obtained through the model. The neural network framework is end-to-end and simultaneously solves the SED and SSL problems, the precision is effectively improved, no additional hardware support is needed except for a microphone array in a data acquisition stage, and the end-to-end sound source positioning detection system is extremely convenient.

Meanwhile, the CEEMDAN noise reduction algorithm is utilized to perform noise reduction processing on the signal with unknown distributed noise, so that the influence of the unknown distributed noise on the sound source signal is effectively reduced, an end-to-end model with one model processing two tasks is realized, the precision is obviously improved, the complexity of an online prediction process is reduced compared with other research tasks of the same type, and the CEEMDAN noise reduction method has an excellent effect. The CEEMDAN noise reduction algorithm is to perform eigenmode decomposition on the original signal to obtain each IMF component. The cross-correlation degree of each IMF component and the original signal represents the contribution degree of the component, so that the noise can be effectively removed by using the noise reduction threshold t. Multiple experiments show that the noise reduction threshold value with the best effect in the invention is 0.05.

In addition, the invention provides a new deep learning network framework for the sound source detection and positioning method. For the algorithm, different frames provide different effects, the frame provided by the invention is simple and easy to realize, and has better effect compared with other frames.

Although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted for elements thereof without departing from the spirit and scope of the present invention.

Claims

1. A method for detecting and positioning a sound source is characterized by mainly comprising the following steps:

step 3, respectively extracting FBANK characteristics and GCC characteristics from the denoised single-channel signal, and inputting the FBANK characteristics and the GCC characteristics which are combined as comprehensive characteristics into a CRNN network;

step 4, training the CRNN network by combining the category label and the position label to obtain a sound source positioning detection model;

step 5, splitting the online extracted sample according to channels, and splitting the multi-channel signal into single-channel signals;

2. The method of sound source detection and localization according to claim 1, wherein: comprises an off-line stage and an on-line stage, wherein the steps 1 to 4 are completed in the off-line stage, and the steps 5 and 6 are completed in the on-line stage.

3. The method of sound source detection and localization according to claim 1, wherein: in step 1, different sound sources are marked by adopting category information and position information as labels, the category information uses a unique code as a mark, and the position information is converted into a three-dimensional Cartesian coordinate system from a spherical coordinate system, wherein the formula is as follows:

x＝r·cos(ele)·cos(ele)

y＝r·cos(ele)·sin(azi)

z＝r·sin(ele)，

4. The method for sound source detection and localization according to claim 1, wherein step 2 comprises the following steps:

step 25, repeating the steps until all modal components are obtained;

5. Sound source detection and localization as claimed in claim 4The method is characterized in that: in step 21, the single channel signal to be decomposed is y (t), and Gaussian white noise is added to obtain a first new set of signals y (t) + (-1)^qεv^j(t), wherein q =1,2.

6. The method of sound source detection and localization according to claim 1, wherein: the step 3 specifically comprises the following steps:

step 31, performing short-time Fourier transform on the single-channel signal after denoising;

step 32, extracting the internal features of the frequency band of the vector obtained by short-time Fourier transform by using a Mel filter bank;

step 35, performing fourier transform on each signal in each combination in step 34, and performing conjugation operation on one of the signals to obtain two vectors;

and step 38, superposing the FBANK characteristic and the GCC characteristic on a time axis to obtain a comprehensive characteristic.

7. The method of sound source detection and localization according to claim 6, wherein: in step 32, the mel filter bank includes 64 triangular filters whose frequency response is defined as:

wherein the content of the first and second substances,

8. the method of sound source detection and localization according to claim 6, wherein: in step 33, the logarithm operation is

The resulting FBANK is characterized by a dimension of 513.

9. The method of sound source detection and localization according to claim 6, wherein: in step 34, the received signals between the two channels are respectively

x₁(t)＝α₁s(t-τ₁)+n₁(t)

x₂(t)＝α₂s(t-τ₂)+n₂(t)，

10. The method of sound source detection and localization according to claim 6, wherein: in step 36, the GCC-PHAT weighting function is

Wherein X (ω) is the fourier transform of the original signal.