CN109782231B

CN109782231B - End-to-end sound source positioning method and system based on multi-task learning

Info

Publication number: CN109782231B
Application number: CN201910043338.8A
Authority: CN
Inventors: 曲天书; 吴玺宏; 黄炎坤
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2019-01-17
Filing date: 2019-01-17
Publication date: 2020-11-20
Anticipated expiration: 2039-01-17
Also published as: CN109782231A

Abstract

The invention discloses an end-to-end sound source positioning method and system based on multi-task learning. The method comprises the following steps: 1) for each sound source position to be scanned, calculating the time delay of sound signals transmitted from the sound source position to each microphone position; 2) performing corresponding delay compensation on the multi-channel frame-level time domain signals collected by each microphone during each scanning of the microphone array according to the delay; 3) inputting each time domain signal after time delay compensation into a corresponding CNN model for feature extraction and inputting the time domain signal into a deep neural network; 4) the deep neural network estimates a multi-channel sound source signal of each scanning position according to the characteristics extracted by each CNN model; 5) and for each scanning position, calculating the cross-correlation coefficient sum of the multi-channel sound source signals corresponding to the scanning position, and selecting the position with the maximum correlation coefficient sum as the sound source position. The method can automatically extract proper characteristics, introduce a multi-task learning mechanism and improve the positioning performance of the model.

Description

End-to-end sound source positioning method and system based on multi-task learning

Technical Field

The invention belongs to the technical field of array signal processing, relates to a microphone array and a sound source positioning method, and particularly relates to an end-to-end sound source positioning method and system based on multi-task learning.

Background

With the development of artificial intelligence technology, machine hearing has received a great deal of attention, and many techniques and research fields related to machine hearing have been developed successively. The sound source positioning technology is a basic and important technology in a machine auditory system, and essentially simulates the functions of human ears, and sound signals are collected through a microphone array so as to judge the position of a sound object. The sound source positioning technology can be independently applied to various fields, such as video conferences, whistle vehicle identification and the like, and meanwhile, basic position information, such as voice enhancement and the like, can be provided for various technologies. Therefore, the positioning accuracy of the sound source positioning algorithm is improved through optimization, the method can be applied to many fields, the development of other technologies can be promoted to a certain extent, and powerful support is provided for the other technologies.

According to the positioning principle, the sound source positioning technology can be roughly divided into the following five categories: time difference of arrival estimation based, high resolution spectral estimation based, steerable beam forming based, transfer function based, and neural network based methods.

The method based on the arrival time difference estimation is to estimate the position of a sound source by estimating the time difference between sound signals arriving at different microphones and then deducing the position of the sound source according to the arrival time difference and the spatial geometrical relation of an array. The method divides the positioning process into two steps, and the problem of error transmission can occur, namely, the estimation of the arrival time difference is not accurate, and the error can be transmitted to the second step. And the arrival time difference is difficult to accurately estimate, and the positioning accuracy is not high.

Methods based on high resolution spectral estimation are multiple signal classification (MUSIC), minimum variance spectral estimation (MVM), etc. The method comprises the steps of forming a covariance matrix by signals collected by a microphone array, performing characteristic decomposition by using EVD (enhanced vector decomposition), thus obtaining a signal subspace corresponding to signal components and a noise subspace corresponding to noise components, and estimating a target azimuth by using the two subspaces. The method has higher spatial resolution, but has poorer performance under the condition of reverberation. Because the noise under the reverberation condition has directivity and is homologous with the signal, the noise has strong correlation. The determination of the sound source position by feature decomposition is easily misjudged.

The steerable beam forming based method is a scanning based method that scans all possible sound source positions one by one. For each scanning position, a beam is formed by performing delay compensation on signals collected by a microphone array, the output power of the formed beam is calculated, and the position corresponding to the maximum output power is selected as an estimated sound source position, wherein a typical algorithm is controllable response power (SRP-PHAT) based on phase transformation weighting. The method only considers the information of the arrival time difference, does not utilize the information of the amplitude difference, and is easily influenced by noise under the conditions of high reverberation and low signal-to-noise ratio.

The transfer function-based method is also a scanning method by measuring the transfer characteristic, i.e., transfer function, of the sound signal from each sound source position to each microphone. The method comprises the steps of restoring a multi-channel source signal by carrying out inverse filtering operation on signals collected by a microphone array, namely restoring time difference and intensity difference of the multi-channel source signal, further carrying out correlation detection on the multi-channel source signal, and selecting a position corresponding to the maximum correlation as a sound source position. The method comprehensively utilizes the positioning information of the arrival time difference and the intensity difference, but the method needs actual measurement of a transfer function and cannot be used in a scene which cannot be actually measured. In addition, under the environment of low signal-to-noise ratio and high reverberation, an accurate transfer function can hardly be measured, the measured transfer function has poor robustness, and the positioning performance is poor. Moreover, the transfer function obtained by actual measurement is strongly correlated to the environment, and is difficult to migrate to and be suitable for other environments.

In recent years, studies of scholars have mainly focused on an end-to-end sound source localization algorithm based on multitask learning. This type of study basically treats the neural network as a black box, and different studies mainly change the three modules of input features, output contents and neural network structure. Most methods need to extract features such as amplitude spectrum, phase spectrum and the like in advance as the input of the network, then determine the types of network output nodes such as azimuth angle, distance and the like, and finally learn the mapping from the features to the azimuth angle or the distance by utilizing a neural network. The neural network method is very convenient in modeling, iterative learning is carried out only by a large amount of data, so that a classifier with good performance is learned, but an artificial feature with good performance needs to be designed and screened in advance, and the method is difficult to obtain a universal mapping relation from the feature to the sound source position through learning, namely the classifier has poor actual positioning performance and insufficient robustness.

Disclosure of Invention

Aiming at the technical problems in the prior art, the invention provides an end-to-end sound source positioning method and system based on multi-task learning. And based on the angle of signal processing, the neural network is utilized for modeling, so that the neural network model has better interpretability and trainability. In the invention, a multi-channel time domain signal acquired by a microphone array is directly used as the input of a neural network, the output of a model is the sound source position, an end-to-end algorithm frame is constructed, and the model is trained by combining two loss functions of Mean Square Error (MSE) and Cross Entropy (CE). In addition, the invention enables the model to automatically extract proper characteristics through a Convolutional Neural Network (CNN) module, and introduces a multi-task learning mechanism to improve the positioning performance of the model.

The basic idea of the end-to-end sound source positioning algorithm based on the multi-task learning provided by the invention is to learn the phase and amplitude change rule of a sound signal caused by the existence of scatterers or the environment and the like in the transmission process from a large amount of data in a deep learning mode. Furthermore, the original phase and amplitude of the acquired multi-channel time domain signal can be recovered through the model, and finally, the sound source is positioned by combining two positioning clues of time difference and amplitude difference. The important innovation of the invention is that a multi-task learning and end-to-end algorithm framework is introduced, and the anti-noise performance is obviously improved.

The technical scheme of the invention is as follows:

an end-to-end sound source positioning method based on multitask learning comprises the following steps:

1) for each sound source position to be scanned, calculating the time delay of sound signals transmitted from the sound source position to each microphone position in the microphone array;

2) performing corresponding delay compensation on the multi-channel frame-level time domain signals collected by each microphone during each scanning of the microphone array according to the delay;

3) inputting each time domain signal after time delay compensation into a corresponding CNN model for feature extraction and inputting the extracted features into a deep neural network; the deep neural network includes a plurality of DNN models, CNN_mInputting extracted features into DNN_m,n(ii) a Wherein, CNN_mCNN model, DNN, for mth microphone_m,nA DNN model representing a transmission path corresponding to the nth microphone at the nth scanning position; m1., M, N1., N, M denotes the number of microphones, N denotes the number of scanning positions;

4) the deep neural network estimates a multi-channel sound source signal of each scanning position according to the characteristics extracted by each CNN model;

5) and for each scanning position, calculating the cross-correlation coefficient sum of the multi-channel sound source signals corresponding to the scanning position, and selecting the position with the maximum correlation coefficient sum as the sound source position.

Further, to theThe method for time-domain signal delay compensation comprises the following steps:

wherein x is_mA time domain signal representing the frame level acquired by the m-th microphone,

representing the delay x with the nth scan position and the mth microphone_mTime-domain signal, tau, after delay compensation_m,nIndicating the delay between the nth scan position and the mth microphone.

Further, the CNN model performs feature extraction on the input time-domain signal using a one-dimensional convolution layer in the time domain.

Further, adopt

Estimating a multi-channel sound source signal; wherein,

is the sound source signal of the mth microphone at the nth sound source position.

Further, the method for obtaining the DNN model by training comprises the following steps:

51) for each set sound source, collecting a multi-channel time domain signal of the sound source by using a microphone array, and acquiring the position of the sound source; then, time-domain signals of the sound source are subjected to delay compensation, and a DNN model of a transmission path corresponding to the position of the sound source is trained based on the delay compensation signals and an MSE loss function until a convergence condition is reached;

52) on the basis of the convergence, the MSE and CE losses are combined and added in proportion to form a final loss function: firstly, calculating the estimated mean square error of a transmission path of each set sound source position, and adding the estimated mean square errors to be used as an MSE loss term in a loss function; the CE loss item is a loss function of the final DNN model, which is formed by adding a consistency detection module among calculation channels, using a softmax function to normalize the correlation coefficient sum into probabilities of different positions, so as to obtain a probability vector, and then adding the probability vector and a one-hot vector which is a supervision signal according to a calculation formula of the CE, so as to obtain the CE loss item.

An end-to-end sound source positioning system based on multitask learning is characterized by comprising a delay calculation module, a delay compensation module, a CNN (CNN) model, a deep neural network and a target sound source position estimation module; wherein

The delay calculation module is used for calculating the delay of sound signals transmitted from the sound source position to each microphone position in the microphone array for each sound source position to be scanned;

the delay compensation module is used for carrying out corresponding delay compensation on the multi-channel frame-level time domain signals collected by each microphone during each scanning of the microphone array according to the delay;

the CNN model is used for extracting the characteristics of the input time domain signal after the time delay compensation and inputting the extracted characteristics into a deep neural network;

the deep neural network is used for estimating a multichannel sound source signal of each scanning position according to the extracted features; wherein the deep neural network comprises a plurality of DNN models, CNN_mInputting extracted features into DNN_m,n；CNN_mCNN model, DNN, for mth microphone_m,nA DNN model representing a transmission path corresponding to the nth microphone at the nth scanning position; m1., M, N1., N, M denotes the number of microphones, N denotes the number of scanning positions;

and the target sound source position estimation module is used for calculating the cross-correlation coefficient sum of the multi-channel sound source signals corresponding to each scanning position, and selecting the position with the maximum correlation coefficient sum as the sound source position.

The basic framework of the end-to-end sound source localization algorithm based on multitask learning is shown in fig. 1, and the method is a scanning method, and is explained in the following from two aspects of the localization process and the model training.

(1) Flow of positioning

The positioning process mainly comprises the following steps:

calculating the delay of the transmission of a sound signal from a scanning position to a microphone position can be pre-calculated for each sound source position to be scanned and each microphone and stored as a delay matrix for subsequent use.

Input time domain signal here, the input to the model is a multi-channel frame-level time domain signal.

The compensation delay calculates the delay according to the distance from each microphone to the scanning position, and then the delay is used for compensating the corresponding delay of the multi-channel acquisition signals respectively.

The CNN extraction feature CNN model is used as a shared hidden layer in the whole network to extract features and learn common transmission characteristics.

The DNN recovery signal deep neural network (namely, a fully-connected neural network, DNN) is a part which is modeled for each task respectively, and the output of DNN is an estimated multi-channel sound source signal. The compensated delay, the CNN model and the DNN model together form a mapping model of the collected signal to the sound source signal, i.e. the mapping recovers the phase and amplitude variations during propagation.

Inter-channel consistency calculation an inter-channel consistency calculation is used to measure the consistency of the estimated multi-channel sound source signal. This operation is performed by calculating correlation coefficients between multi-channel sound source signals and for subsequent localization operations. If we have N possible positions, we need to calculate N correlation coefficient sums respectively.

Estimating the sound source position assuming that the position we scan coincides with the true position, the estimated multi-channel sound source signals should be consistent, and the correlation coefficient sum is maximum. Therefore, when estimating the sound source position, we can select the correlation coefficient and the maximum position as the estimated sound source position.

(2) Training of models

Obviously, the delay information is a priori information, and can be calculated in advance by a method of dividing the distance by the sound velocity, and can also be calculated by cross-correlation. Therefore, the known information of the neural network model is not required to be obtained and learned, the known information can be directly obtained through calculation, and delay compensation is carried out in advance, so that the learning task and difficulty of the neural network are reduced.

In addition, during the training process, the MSE and CE loss functions are jointly trained. Notably, where the MSE and CE loss functions apply is different. MSE is used at the output of the DNN model for the regression task, mapping the received microphone signals to estimated sound source signals, where the supervisory signal is the frame level sound source signal. The CE is used in the output of consistency detection, firstly, a softmax function is used for normalizing correlation coefficient sums into probabilities at different positions, namely, a probability vector can be obtained, and then, a one-hot vector which is a supervision signal is used for obtaining a CE loss term according to a calculation formula of the CE. Assuming we have N possible positions, the supervisory signal is an N-dimensional one-hot vector. Adding the two according to a certain proportion to obtain the final loss value

The specific training steps are as follows.

The method comprises the following steps: for a set sound source, a microphone array is used for collecting multi-channel time domain signals of the sound source, and the position of the sound source is obtained.

Step two: and performing time delay compensation operation on the time domain signal of the sound source.

Step three: and training a DNN model of the transmission path corresponding to the sound source position based on the MSE loss function by using the obtained signals after the delay compensation, wherein in the step, models of other sound source positions are not considered, and consistency calculation and estimation of the sound source position are not involved.

Step four: and repeating the above operations on the training data acquired at different sound source positions, wherein the models of the transmission paths corresponding to the different sound source positions are independent, and the models of the different transmission paths are trained respectively until the network is converged, namely the MSE is not reduced.

Step five: after step four converges, the model is jointly trained using MSE and CE. Through the steps, the end-to-end model can be realized. In the step, models of different transmission paths are not trained independently, the sum of the estimated mean square errors of all the transmission paths is calculated to be used as an MSE loss term, a consistency detection module for calculating the channels is added, the CE loss terms calculated according to the method are added according to a certain proportion to be used as a final loss value, and a loss function is namely the loss function

Loss＝MSE+α×CE。

Compared with the prior art, the invention has the following positive effects:

the invention combines two loss functions of Mean Square Error (MSE) and Cross Entropy (CE) to train the model. And the positioning performance of the model is improved.

The invention introduces a multi-task learning mechanism, uses a Convolutional Neural Network (CNN) as a shared hidden layer, extracts common characteristics and improves the positioning performance of the model.

The invention utilizes the multi-channel time domain signal directly collected by the microphone array as the input of the neural network, the output of the model is the sound source position, and an end-to-end algorithm frame is constructed.

Drawings

FIG. 1 is a basic block diagram of an end-to-end sound source localization algorithm based on multitask learning;

FIG. 2 is a schematic diagram of a CNN model used in the present invention;

FIG. 3 is a schematic diagram of the DNN model used in the present invention;

FIG. 4 is a schematic diagram of a ball model and microphone distribution used in the present invention;

FIG. 5 is a plot of the positioning performance of the proposed method and baseline at different SNR.

Detailed Description

Preferred embodiments of the present invention will be described in more detail below with reference to the accompanying drawings of the invention. Fig. 1 is a basic block diagram of an end-to-end sound source localization algorithm based on multitask learning according to the present invention, and the specific implementation steps of the method of the present invention include calculating delay, inputting time domain signals, compensating for delay, CNN extracting features, DNN restoring signals, calculating inter-channel consistency, and estimating the position of a target sound source. The specific implementation process of each step is as follows:

1. calculating time delay

In the scanning method, because the position to be scanned and the distribution of the microphone array are known, the time delay can be obtained by the known scanning position and microphone position calculation, and is also known information, the time delay can be calculated in advance and stored as a time delay matrix for subsequent direct use, namely, the distance between the scanning position and the microphone position is calculated according to the scanning position and the microphone position, and then the time delay of the sound signal transmitted from the scanning position to the microphone position is calculated by combining the sound velocity, namely, the time delay is calculated

Wherein, tau_m,nRepresenting the delay between the nth scanning position and the mth microphone, d_m,nIs the distance from the nth scan position to the mth microphone, and v is the speed of sound.

2. Input time domain signal

The input of the neural network is a multi-channel frame-level time domain signal. For example, if we have M microphones and the frame length is L sampling points, the input is M frames of time domain signals, and there are M × L sampling points in total, but the M frames of time domain signals are operated independently at the beginning.

3. Compensating for time delay

In each scanning process, the delay from each microphone to the scanning position can be calculated in step 1, therefore, for a certain candidate position, the compensation of the corresponding delay needs to be performed on the multi-channel microphone signal, that is, the multi-channel microphone signal is compensated

Where M denotes the number of microphones, N denotes the number of scanning positions, x_mA time domain signal representing the frame level acquired by the m-th microphone,

representing the delay x with the nth scan position and the mth microphone_mCarry out time-delay compensationThe compensated time domain signal.

CNN extraction features

The time domain signal after the delay compensation is input into a corresponding CNN model, and the CNN model is used as a shared hidden layer in the whole network to extract features and learn common transmission characteristics. Experiments have shown that pooling does not provide any benefit, and therefore we use only one-dimensional convolution layers in the time domain in CNN here.

Wherein, CNN_mA CNN model corresponding to the mth microphone is shown, and the model structure thereof can be seen in fig. 2. h is_m,n(t) represents a signal

Input to corresponding CNN_mThe hidden layer of the model outputs a signal, i.e., the extracted features.

DNN recovery Signal

The deep neural network (i.e. fully-connected neural network, DNN) is after sharing the CNN model of the hidden layer, the DNN model of each position is independent from each other, different positions are actually learning different transmission characteristics, which are equivalent to different tasks, the DNN model is a part for modeling each task separately, and the output of DNN is estimated multi-channel sound source signal. The compensated delay, the CNN model and the DNN model together form a mapping model of the collected signal to the sound source signal, i.e. the mapping recovers the phase and amplitude variations during propagation. Since the DNN model is used for the regression task, it is trained using the MSE loss function, since the last layer of DNN is the regression layer, and no activation function is used, i.e. the DNN model is a function that is used for the regression

Wherein,

for the nth scanEstimated sound source signal, DNN, of the m-th microphone in position_mnA DNN model representing the transmission path corresponding to the nth microphone at the nth scanning position can be seen in fig. 3.

5. Computing inter-channel coherence

For a certain scanning position, the multichannel original signals can be recovered, and the cross correlation coefficient sum of the recovered multichannel signals is calculated and used as the index of consistency between channels, namely

Wherein,

is a signal

Scorr (n) represents the sum of the correlation coefficients for the nth scan position.

6. Estimating a sound source position

Assuming that the scanned position coincides with the true position, the estimated multi-channel sound source signals should be consistent, and the correlation coefficient sum is maximum. Therefore, when estimating the sound source position, the correlation coefficient and the position at which the correlation coefficient is the largest are selected as the estimated sound source position

7. Training of models

Training of the model as described above, the algorithm uniformly selects the adam algorithm (or other known algorithms can be adopted), and firstly, the models of different transmission paths are respectively trained based on the MSE loss function until the network converges. The single model MSE equation is shown below.

Wherein,

estimated sound source signal of i frame for m microphone at n scanning position, s_m,n,i(t) is

The source signal of the corresponding frame, i.e. the supervisory signal, T is the number of sampling points of a frame signal, and I is the total number of frames of samples in a mini-batch.

On the basis, a cross entropy function is introduced, and the formula is

Wherein, P_n,iIs the correlation coefficient of the nth scan position of the ith frame and SCorr (n) probability normalized by the softmax function, Y_n,iI is the total number of frames of samples in a mini-batch, which is the supervisory signal corresponding to the frame and scan position.

Combining MSE and CE, adding the MSE and CE according to a certain proportion to form a final loss function, wherein the formula is shown as follows.

Wherein, α is a set scale factor.

The advantages of the invention are illustrated below with reference to specific embodiments.

The invention uses transfer function to generate simulation signal, and tests the positioning performance of the positioning algorithm on the simulation signal under different signal-to-noise ratio conditions. The proposed method uses the MSE training model as Shared-CNN-MSE, and the model using MSE and CE in combination as Shared-CNN-MSE-CE. In addition, the experiment used SRP-PHAT, MUSIC and another neural network-based approach (DNN) as baseline. The sound source signals are Gaussian white noise signals respectively, and the change range of the signal to noise ratio is-10 dB to 5 dB.

1. Microphone array

The microphone array used in this experiment uses six microphones, which are uniformly distributed on the horizontal plane of a sphere with a radius of 0.0875m, i.e., 60 degrees apart between adjacent microphones, as shown in fig. 4. The distance between the sound source position and the center of the sphere is 3 meters, the azimuth range is 0-360 degrees, the resolution is 5 degrees, and the total number of directions is 72. The corresponding transfer function is calculated from the ball model given by Duda et al.

2. Signal emulation

The experiment used a sound source convolution transfer function to generate a simulated signal, the sound source signal being a gaussian white noise signal. In the experiment, extra noise is added into each channel of the simulation signal according to different signal-to-noise ratios, the noise among the channels is independent, the sampling rate of the signal is 48kHz, and the frame length is set to be 1024 sampling points. Under each condition (sound source position and signal-to-noise ratio), the positioning results of the method and the baseline method provided by the invention are counted.

3. Neural network setup

Because the frame length in the experiment of the invention is set to 1024 sampling points, 6 microphones, the input of the CNN is 6 times 1024 time domain sampling points. For the network structure of the CNN, three layers of one-dimensional convolution layers are used, the number of convolution kernels is 8, 4 and 1 respectively, the size of the convolution kernels is 2 multiplied by 1, but the lengths after convolution are all equal to the frame length. The structure of a single CNN is shown in fig. 2. For each DNN model, we use one full-connectivity layer, with an output layer size of 1024 nodes. The structure of a single CNN is shown in fig. 3. Adam is selected as an optimization algorithm in the training process, and the proportion of CE loss items is set to be 0.01.

3. Results of the experiment

In the experiment, the positioning performance of the five methods is tested, and noise-containing signals (-10 dB-5 dB) with different signal-to-noise ratios are respectively used, wherein the noise type is white Gaussian noise. Of these five methods, SRP-PHAT and MUSIC are the conventional sound source localization methods mentioned above. DNN is a neural network based baseline system. Shared-CNN-MSE and Shared-CNN-MSE-CE are algorithms proposed herein, which correspond to the two training strategies, respectively. Shared-CNN-MSE is trained using only MSE, while Shared-CNN-MSE-CE is trained using both MSE and CE.

As can be seen from FIG. 5, when the signal-to-noise ratio is lower than-1 dB, the performance training of MUSIC and SRP-PHAT is reduced, while other neural network based methods have better anti-noise capability because the neural network model makes full use of the information of time difference and amplitude difference. In the figure, assuming that the average error angle is fixed at 30 degrees, the SNR of SRP-PHAT, MUSIC, DNN, Shared-CNN-MSE and Shared-CNN-MSE-CE is respectively-2 dB, -3dB, -5dB, -6dB and-7 dB. Compared with SRP-PHAT, the Shared-CNN-MSE and the Shared-CNN-MSE-CE are respectively improved by 4dB and 5 dB. Compared with a DNN method, the Shared-CNN-MSE has a lower average error angle under the same signal-to-noise ratio, which shows that the multitask learning contributes to the improvement of performance, because the Shared CNN hidden layer can extract more robust features, and the learning of each position contributes to the learning of other positions. On the basis, the MSE and the CE are jointly used for training, and the Shared-CNN-MSE-CE is improved relative to the Shared-CNN-MSE, so that the effectiveness of the training strategy is illustrated, and the CE loss item fully considers the mutual exclusion relation among different positions.

The invention provides an end-to-end sound source positioning algorithm based on multi-task learning. Based on the idea of multi-task learning, the CNN is used as a shared hidden layer to extract features, and MSE and CE are jointly used as loss functions to train a model. Compared with other methods, the method has better noise immunity and better positioning performance under the condition of lower signal-to-noise ratio. Moreover, the method is a self-adaptive model, online learning is supported, and the more training data, the higher the positioning accuracy. Compared with the prior method based on the neural network, the method uses multi-task learning and is more interpretable when modeling is carried out from the perspective of signal processing. Here, the neural network is not a black box, and the neural network model is used to recover the time difference and amplitude difference of the received microphone signals.

Although specific embodiments of the invention have been disclosed for illustrative purposes and the accompanying drawings, which are included to provide a further understanding of the invention and are incorporated by reference, those skilled in the art will appreciate that: various substitutions, changes and modifications are possible without departing from the spirit and scope of the present invention and the appended claims. Therefore, the present invention should not be limited to the disclosure of the preferred embodiments and the accompanying drawings.

Claims

1. An end-to-end sound source positioning method based on multitask learning comprises the following steps:

5) for each scanning position, calculating the cross-correlation coefficient sum of the multi-channel sound source signals corresponding to the scanning position, and selecting the position with the maximum correlation coefficient sum as the sound source position;

the method for obtaining the DNN model by training comprises the following steps: 31) for each set sound source, collecting a multi-channel time domain signal of the sound source by using a microphone array, and acquiring the position of the sound source; then, time-domain signals of the sound source are subjected to delay compensation, and a DNN model of a transmission path corresponding to the position of the sound source is trained based on the delay compensation signals and an MSE loss function until a convergence condition is reached; respectively training models of different transmission paths based on an MSE loss function until a network converges; the single model MSE equation is

Corresponding to the source signal of a frame, T is the number of sampling points of a frame signal, and I is the total frame number of samples in a mini-batch; 32) on the basis of the convergence, a cross entropy function is introduced

Wherein P is_n,iIs the correlation coefficient of the nth scan position of the ith frame and SCorr (n) probability normalized by the softmax function, Y_n,iSupervision signals corresponding to the frames and the scanning positions; the final loss function is formed by combining the MSE and CE losses and adding the MSE and CE losses in proportion: firstly, calculating the estimated mean square error of a transmission path of each set sound source position, and adding the estimated mean square errors to be used as an MSE loss term in a loss function; the CE loss item is a CE loss item which is obtained by adding a consistency detection module between calculation channels, using a softmax function to normalize the correlation coefficient sum into probabilities at different positions to obtain a probability vector, and then solving the probability vector and a supervision signal, namely a one-hot vector according to a calculation formula of the CE, wherein the probability vector and the one-hot vector are calculated according to the calculation formula of the CE

The sum constitutes the loss function of the final DNN model, and α is a set scaling factor.

2. The method of claim 1, wherein the time-domain signal is compensated for delay by:

wherein x is_mTime domain signal representing frame level collected by mth microphone，

3. The method of claim 1, wherein the CNN model performs feature extraction on the input time-domain signal using one-dimensional convolutional layers in the time domain.

4. The method as claimed in claim 1, wherein

Estimating a multi-channel sound source signal; wherein,

5. An end-to-end sound source positioning system based on multitask learning is characterized by comprising a delay calculation module, a delay compensation module, a CNN (CNN) model, a deep neural network and a target sound source position estimation module; wherein

the CNN model is used for extracting the characteristics of the input time domain signal after the time delay compensation and inputting the extracted characteristics into a deep neural network; the method for obtaining the DNN model by training comprises the following steps: a) for each set sound source, a microphone array is used to acquire multiple passes of the sound sourceA time domain signal of a channel, and acquiring the position of the sound source; then, time-domain signals of the sound source are subjected to delay compensation, and a DNN model of a transmission path corresponding to the position of the sound source is trained based on the delay compensation signals and an MSE loss function until a convergence condition is reached; respectively training models of different transmission paths based on an MSE loss function until a network converges; the single model MSE equation is

Corresponding to the source signal of a frame, T is the number of sampling points of a frame signal, and I is the total frame number of samples in a mini-batch; b) on the basis of the convergence, a cross entropy function is introduced

Adding to form a loss function of the final DNN model, wherein alpha is a set scale factor;

6. The system of claim 5, wherein the delay calculation module is based on

Performing delay compensation on the time domain signal; x is the number of_mA time domain signal representing the frame level acquired by the m-th microphone,

7. The system of claim 5, wherein the deep neural network employs

Estimating a multi-channel sound source signal; wherein,

8. The system of claim 5, wherein the CNN model performs feature extraction on an input time-domain signal using one-dimensional convolutional layers in the time domain.