CN109782231B - End-to-end sound source positioning method and system based on multi-task learning - Google Patents

End-to-end sound source positioning method and system based on multi-task learning Download PDF

Info

Publication number
CN109782231B
CN109782231B CN201910043338.8A CN201910043338A CN109782231B CN 109782231 B CN109782231 B CN 109782231B CN 201910043338 A CN201910043338 A CN 201910043338A CN 109782231 B CN109782231 B CN 109782231B
Authority
CN
China
Prior art keywords
sound source
microphone
signal
delay
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910043338.8A
Other languages
Chinese (zh)
Other versions
CN109782231A (en
Inventor
曲天书
吴玺宏
黄炎坤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Original Assignee
Peking University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University filed Critical Peking University
Priority to CN201910043338.8A priority Critical patent/CN109782231B/en
Publication of CN109782231A publication Critical patent/CN109782231A/en
Application granted granted Critical
Publication of CN109782231B publication Critical patent/CN109782231B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Circuit For Audible Band Transducer (AREA)
  • Measurement Of Velocity Or Position Using Acoustic Or Ultrasonic Waves (AREA)

Abstract

The invention discloses an end-to-end sound source positioning method and system based on multi-task learning. The method comprises the following steps: 1) for each sound source position to be scanned, calculating the time delay of sound signals transmitted from the sound source position to each microphone position; 2) performing corresponding delay compensation on the multi-channel frame-level time domain signals collected by each microphone during each scanning of the microphone array according to the delay; 3) inputting each time domain signal after time delay compensation into a corresponding CNN model for feature extraction and inputting the time domain signal into a deep neural network; 4) the deep neural network estimates a multi-channel sound source signal of each scanning position according to the characteristics extracted by each CNN model; 5) and for each scanning position, calculating the cross-correlation coefficient sum of the multi-channel sound source signals corresponding to the scanning position, and selecting the position with the maximum correlation coefficient sum as the sound source position. The method can automatically extract proper characteristics, introduce a multi-task learning mechanism and improve the positioning performance of the model.

Description

End-to-end sound source positioning method and system based on multi-task learning
Technical Field
The invention belongs to the technical field of array signal processing, relates to a microphone array and a sound source positioning method, and particularly relates to an end-to-end sound source positioning method and system based on multi-task learning.
Background
With the development of artificial intelligence technology, machine hearing has received a great deal of attention, and many techniques and research fields related to machine hearing have been developed successively. The sound source positioning technology is a basic and important technology in a machine auditory system, and essentially simulates the functions of human ears, and sound signals are collected through a microphone array so as to judge the position of a sound object. The sound source positioning technology can be independently applied to various fields, such as video conferences, whistle vehicle identification and the like, and meanwhile, basic position information, such as voice enhancement and the like, can be provided for various technologies. Therefore, the positioning accuracy of the sound source positioning algorithm is improved through optimization, the method can be applied to many fields, the development of other technologies can be promoted to a certain extent, and powerful support is provided for the other technologies.
According to the positioning principle, the sound source positioning technology can be roughly divided into the following five categories: time difference of arrival estimation based, high resolution spectral estimation based, steerable beam forming based, transfer function based, and neural network based methods.
The method based on the arrival time difference estimation is to estimate the position of a sound source by estimating the time difference between sound signals arriving at different microphones and then deducing the position of the sound source according to the arrival time difference and the spatial geometrical relation of an array. The method divides the positioning process into two steps, and the problem of error transmission can occur, namely, the estimation of the arrival time difference is not accurate, and the error can be transmitted to the second step. And the arrival time difference is difficult to accurately estimate, and the positioning accuracy is not high.
Methods based on high resolution spectral estimation are multiple signal classification (MUSIC), minimum variance spectral estimation (MVM), etc. The method comprises the steps of forming a covariance matrix by signals collected by a microphone array, performing characteristic decomposition by using EVD (enhanced vector decomposition), thus obtaining a signal subspace corresponding to signal components and a noise subspace corresponding to noise components, and estimating a target azimuth by using the two subspaces. The method has higher spatial resolution, but has poorer performance under the condition of reverberation. Because the noise under the reverberation condition has directivity and is homologous with the signal, the noise has strong correlation. The determination of the sound source position by feature decomposition is easily misjudged.
The steerable beam forming based method is a scanning based method that scans all possible sound source positions one by one. For each scanning position, a beam is formed by performing delay compensation on signals collected by a microphone array, the output power of the formed beam is calculated, and the position corresponding to the maximum output power is selected as an estimated sound source position, wherein a typical algorithm is controllable response power (SRP-PHAT) based on phase transformation weighting. The method only considers the information of the arrival time difference, does not utilize the information of the amplitude difference, and is easily influenced by noise under the conditions of high reverberation and low signal-to-noise ratio.
The transfer function-based method is also a scanning method by measuring the transfer characteristic, i.e., transfer function, of the sound signal from each sound source position to each microphone. The method comprises the steps of restoring a multi-channel source signal by carrying out inverse filtering operation on signals collected by a microphone array, namely restoring time difference and intensity difference of the multi-channel source signal, further carrying out correlation detection on the multi-channel source signal, and selecting a position corresponding to the maximum correlation as a sound source position. The method comprehensively utilizes the positioning information of the arrival time difference and the intensity difference, but the method needs actual measurement of a transfer function and cannot be used in a scene which cannot be actually measured. In addition, under the environment of low signal-to-noise ratio and high reverberation, an accurate transfer function can hardly be measured, the measured transfer function has poor robustness, and the positioning performance is poor. Moreover, the transfer function obtained by actual measurement is strongly correlated to the environment, and is difficult to migrate to and be suitable for other environments.
In recent years, studies of scholars have mainly focused on an end-to-end sound source localization algorithm based on multitask learning. This type of study basically treats the neural network as a black box, and different studies mainly change the three modules of input features, output contents and neural network structure. Most methods need to extract features such as amplitude spectrum, phase spectrum and the like in advance as the input of the network, then determine the types of network output nodes such as azimuth angle, distance and the like, and finally learn the mapping from the features to the azimuth angle or the distance by utilizing a neural network. The neural network method is very convenient in modeling, iterative learning is carried out only by a large amount of data, so that a classifier with good performance is learned, but an artificial feature with good performance needs to be designed and screened in advance, and the method is difficult to obtain a universal mapping relation from the feature to the sound source position through learning, namely the classifier has poor actual positioning performance and insufficient robustness.
Disclosure of Invention
Aiming at the technical problems in the prior art, the invention provides an end-to-end sound source positioning method and system based on multi-task learning. And based on the angle of signal processing, the neural network is utilized for modeling, so that the neural network model has better interpretability and trainability. In the invention, a multi-channel time domain signal acquired by a microphone array is directly used as the input of a neural network, the output of a model is the sound source position, an end-to-end algorithm frame is constructed, and the model is trained by combining two loss functions of Mean Square Error (MSE) and Cross Entropy (CE). In addition, the invention enables the model to automatically extract proper characteristics through a Convolutional Neural Network (CNN) module, and introduces a multi-task learning mechanism to improve the positioning performance of the model.
The basic idea of the end-to-end sound source positioning algorithm based on the multi-task learning provided by the invention is to learn the phase and amplitude change rule of a sound signal caused by the existence of scatterers or the environment and the like in the transmission process from a large amount of data in a deep learning mode. Furthermore, the original phase and amplitude of the acquired multi-channel time domain signal can be recovered through the model, and finally, the sound source is positioned by combining two positioning clues of time difference and amplitude difference. The important innovation of the invention is that a multi-task learning and end-to-end algorithm framework is introduced, and the anti-noise performance is obviously improved.
The technical scheme of the invention is as follows:
an end-to-end sound source positioning method based on multitask learning comprises the following steps:
1) for each sound source position to be scanned, calculating the time delay of sound signals transmitted from the sound source position to each microphone position in the microphone array;
2) performing corresponding delay compensation on the multi-channel frame-level time domain signals collected by each microphone during each scanning of the microphone array according to the delay;
3) inputting each time domain signal after time delay compensation into a corresponding CNN model for feature extraction and inputting the extracted features into a deep neural network; the deep neural network includes a plurality of DNN models, CNNmInputting extracted features into DNNm,n(ii) a Wherein, CNNmCNN model, DNN, for mth microphonem,nA DNN model representing a transmission path corresponding to the nth microphone at the nth scanning position; m1., M, N1., N, M denotes the number of microphones, N denotes the number of scanning positions;
4) the deep neural network estimates a multi-channel sound source signal of each scanning position according to the characteristics extracted by each CNN model;
5) and for each scanning position, calculating the cross-correlation coefficient sum of the multi-channel sound source signals corresponding to the scanning position, and selecting the position with the maximum correlation coefficient sum as the sound source position.
Further, to theThe method for time-domain signal delay compensation comprises the following steps:
Figure GDA0002635419840000031
wherein x ismA time domain signal representing the frame level acquired by the m-th microphone,
Figure GDA0002635419840000032
representing the delay x with the nth scan position and the mth microphonemTime-domain signal, tau, after delay compensationm,nIndicating the delay between the nth scan position and the mth microphone.
Further, the CNN model performs feature extraction on the input time-domain signal using a one-dimensional convolution layer in the time domain.
Further, adopt
Figure GDA0002635419840000033
Estimating a multi-channel sound source signal; wherein,
Figure GDA0002635419840000034
is the sound source signal of the mth microphone at the nth sound source position.
Further, the method for obtaining the DNN model by training comprises the following steps:
51) for each set sound source, collecting a multi-channel time domain signal of the sound source by using a microphone array, and acquiring the position of the sound source; then, time-domain signals of the sound source are subjected to delay compensation, and a DNN model of a transmission path corresponding to the position of the sound source is trained based on the delay compensation signals and an MSE loss function until a convergence condition is reached;
52) on the basis of the convergence, the MSE and CE losses are combined and added in proportion to form a final loss function: firstly, calculating the estimated mean square error of a transmission path of each set sound source position, and adding the estimated mean square errors to be used as an MSE loss term in a loss function; the CE loss item is a loss function of the final DNN model, which is formed by adding a consistency detection module among calculation channels, using a softmax function to normalize the correlation coefficient sum into probabilities of different positions, so as to obtain a probability vector, and then adding the probability vector and a one-hot vector which is a supervision signal according to a calculation formula of the CE, so as to obtain the CE loss item.
An end-to-end sound source positioning system based on multitask learning is characterized by comprising a delay calculation module, a delay compensation module, a CNN (CNN) model, a deep neural network and a target sound source position estimation module; wherein
The delay calculation module is used for calculating the delay of sound signals transmitted from the sound source position to each microphone position in the microphone array for each sound source position to be scanned;
the delay compensation module is used for carrying out corresponding delay compensation on the multi-channel frame-level time domain signals collected by each microphone during each scanning of the microphone array according to the delay;
the CNN model is used for extracting the characteristics of the input time domain signal after the time delay compensation and inputting the extracted characteristics into a deep neural network;
the deep neural network is used for estimating a multichannel sound source signal of each scanning position according to the extracted features; wherein the deep neural network comprises a plurality of DNN models, CNNmInputting extracted features into DNNm,n;CNNmCNN model, DNN, for mth microphonem,nA DNN model representing a transmission path corresponding to the nth microphone at the nth scanning position; m1., M, N1., N, M denotes the number of microphones, N denotes the number of scanning positions;
and the target sound source position estimation module is used for calculating the cross-correlation coefficient sum of the multi-channel sound source signals corresponding to each scanning position, and selecting the position with the maximum correlation coefficient sum as the sound source position.
The basic framework of the end-to-end sound source localization algorithm based on multitask learning is shown in fig. 1, and the method is a scanning method, and is explained in the following from two aspects of the localization process and the model training.
(1) Flow of positioning
The positioning process mainly comprises the following steps:
calculating the delay of the transmission of a sound signal from a scanning position to a microphone position can be pre-calculated for each sound source position to be scanned and each microphone and stored as a delay matrix for subsequent use.
Input time domain signal here, the input to the model is a multi-channel frame-level time domain signal.
The compensation delay calculates the delay according to the distance from each microphone to the scanning position, and then the delay is used for compensating the corresponding delay of the multi-channel acquisition signals respectively.
The CNN extraction feature CNN model is used as a shared hidden layer in the whole network to extract features and learn common transmission characteristics.
The DNN recovery signal deep neural network (namely, a fully-connected neural network, DNN) is a part which is modeled for each task respectively, and the output of DNN is an estimated multi-channel sound source signal. The compensated delay, the CNN model and the DNN model together form a mapping model of the collected signal to the sound source signal, i.e. the mapping recovers the phase and amplitude variations during propagation.
Inter-channel consistency calculation an inter-channel consistency calculation is used to measure the consistency of the estimated multi-channel sound source signal. This operation is performed by calculating correlation coefficients between multi-channel sound source signals and for subsequent localization operations. If we have N possible positions, we need to calculate N correlation coefficient sums respectively.
Estimating the sound source position assuming that the position we scan coincides with the true position, the estimated multi-channel sound source signals should be consistent, and the correlation coefficient sum is maximum. Therefore, when estimating the sound source position, we can select the correlation coefficient and the maximum position as the estimated sound source position.
(2) Training of models
Obviously, the delay information is a priori information, and can be calculated in advance by a method of dividing the distance by the sound velocity, and can also be calculated by cross-correlation. Therefore, the known information of the neural network model is not required to be obtained and learned, the known information can be directly obtained through calculation, and delay compensation is carried out in advance, so that the learning task and difficulty of the neural network are reduced.
In addition, during the training process, the MSE and CE loss functions are jointly trained. Notably, where the MSE and CE loss functions apply is different. MSE is used at the output of the DNN model for the regression task, mapping the received microphone signals to estimated sound source signals, where the supervisory signal is the frame level sound source signal. The CE is used in the output of consistency detection, firstly, a softmax function is used for normalizing correlation coefficient sums into probabilities at different positions, namely, a probability vector can be obtained, and then, a one-hot vector which is a supervision signal is used for obtaining a CE loss term according to a calculation formula of the CE. Assuming we have N possible positions, the supervisory signal is an N-dimensional one-hot vector. Adding the two according to a certain proportion to obtain the final loss value
The specific training steps are as follows.
The method comprises the following steps: for a set sound source, a microphone array is used for collecting multi-channel time domain signals of the sound source, and the position of the sound source is obtained.
Step two: and performing time delay compensation operation on the time domain signal of the sound source.
Step three: and training a DNN model of the transmission path corresponding to the sound source position based on the MSE loss function by using the obtained signals after the delay compensation, wherein in the step, models of other sound source positions are not considered, and consistency calculation and estimation of the sound source position are not involved.
Step four: and repeating the above operations on the training data acquired at different sound source positions, wherein the models of the transmission paths corresponding to the different sound source positions are independent, and the models of the different transmission paths are trained respectively until the network is converged, namely the MSE is not reduced.
Step five: after step four converges, the model is jointly trained using MSE and CE. Through the steps, the end-to-end model can be realized. In the step, models of different transmission paths are not trained independently, the sum of the estimated mean square errors of all the transmission paths is calculated to be used as an MSE loss term, a consistency detection module for calculating the channels is added, the CE loss terms calculated according to the method are added according to a certain proportion to be used as a final loss value, and a loss function is namely the loss function
Loss=MSE+α×CE。
Compared with the prior art, the invention has the following positive effects:
the invention combines two loss functions of Mean Square Error (MSE) and Cross Entropy (CE) to train the model. And the positioning performance of the model is improved.
The invention introduces a multi-task learning mechanism, uses a Convolutional Neural Network (CNN) as a shared hidden layer, extracts common characteristics and improves the positioning performance of the model.
The invention utilizes the multi-channel time domain signal directly collected by the microphone array as the input of the neural network, the output of the model is the sound source position, and an end-to-end algorithm frame is constructed.
Drawings
FIG. 1 is a basic block diagram of an end-to-end sound source localization algorithm based on multitask learning;
FIG. 2 is a schematic diagram of a CNN model used in the present invention;
FIG. 3 is a schematic diagram of the DNN model used in the present invention;
FIG. 4 is a schematic diagram of a ball model and microphone distribution used in the present invention;
FIG. 5 is a plot of the positioning performance of the proposed method and baseline at different SNR.
Detailed Description
Preferred embodiments of the present invention will be described in more detail below with reference to the accompanying drawings of the invention. Fig. 1 is a basic block diagram of an end-to-end sound source localization algorithm based on multitask learning according to the present invention, and the specific implementation steps of the method of the present invention include calculating delay, inputting time domain signals, compensating for delay, CNN extracting features, DNN restoring signals, calculating inter-channel consistency, and estimating the position of a target sound source. The specific implementation process of each step is as follows:
1. calculating time delay
In the scanning method, because the position to be scanned and the distribution of the microphone array are known, the time delay can be obtained by the known scanning position and microphone position calculation, and is also known information, the time delay can be calculated in advance and stored as a time delay matrix for subsequent direct use, namely, the distance between the scanning position and the microphone position is calculated according to the scanning position and the microphone position, and then the time delay of the sound signal transmitted from the scanning position to the microphone position is calculated by combining the sound velocity, namely, the time delay is calculated
Figure GDA0002635419840000061
Wherein, taum,nRepresenting the delay between the nth scanning position and the mth microphone, dm,nIs the distance from the nth scan position to the mth microphone, and v is the speed of sound.
2. Input time domain signal
The input of the neural network is a multi-channel frame-level time domain signal. For example, if we have M microphones and the frame length is L sampling points, the input is M frames of time domain signals, and there are M × L sampling points in total, but the M frames of time domain signals are operated independently at the beginning.
3. Compensating for time delay
In each scanning process, the delay from each microphone to the scanning position can be calculated in step 1, therefore, for a certain candidate position, the compensation of the corresponding delay needs to be performed on the multi-channel microphone signal, that is, the multi-channel microphone signal is compensated
Figure GDA0002635419840000071
Where M denotes the number of microphones, N denotes the number of scanning positions, xmA time domain signal representing the frame level acquired by the m-th microphone,
Figure GDA0002635419840000072
representing the delay x with the nth scan position and the mth microphonemCarry out time-delay compensationThe compensated time domain signal.
CNN extraction features
The time domain signal after the delay compensation is input into a corresponding CNN model, and the CNN model is used as a shared hidden layer in the whole network to extract features and learn common transmission characteristics. Experiments have shown that pooling does not provide any benefit, and therefore we use only one-dimensional convolution layers in the time domain in CNN here.
Figure GDA0002635419840000073
Wherein, CNNmA CNN model corresponding to the mth microphone is shown, and the model structure thereof can be seen in fig. 2. h ism,n(t) represents a signal
Figure GDA0002635419840000074
Input to corresponding CNNmThe hidden layer of the model outputs a signal, i.e., the extracted features.
DNN recovery Signal
The deep neural network (i.e. fully-connected neural network, DNN) is after sharing the CNN model of the hidden layer, the DNN model of each position is independent from each other, different positions are actually learning different transmission characteristics, which are equivalent to different tasks, the DNN model is a part for modeling each task separately, and the output of DNN is estimated multi-channel sound source signal. The compensated delay, the CNN model and the DNN model together form a mapping model of the collected signal to the sound source signal, i.e. the mapping recovers the phase and amplitude variations during propagation. Since the DNN model is used for the regression task, it is trained using the MSE loss function, since the last layer of DNN is the regression layer, and no activation function is used, i.e. the DNN model is a function that is used for the regression
Figure GDA0002635419840000075
Wherein,
Figure GDA0002635419840000076
for the nth scanEstimated sound source signal, DNN, of the m-th microphone in positionmnA DNN model representing the transmission path corresponding to the nth microphone at the nth scanning position can be seen in fig. 3.
5. Computing inter-channel coherence
For a certain scanning position, the multichannel original signals can be recovered, and the cross correlation coefficient sum of the recovered multichannel signals is calculated and used as the index of consistency between channels, namely
Figure GDA0002635419840000077
Wherein,
Figure GDA0002635419840000081
is a signal
Figure GDA0002635419840000082
Scorr (n) represents the sum of the correlation coefficients for the nth scan position.
6. Estimating a sound source position
Assuming that the scanned position coincides with the true position, the estimated multi-channel sound source signals should be consistent, and the correlation coefficient sum is maximum. Therefore, when estimating the sound source position, the correlation coefficient and the position at which the correlation coefficient is the largest are selected as the estimated sound source position
Figure GDA0002635419840000083
7. Training of models
Training of the model as described above, the algorithm uniformly selects the adam algorithm (or other known algorithms can be adopted), and firstly, the models of different transmission paths are respectively trained based on the MSE loss function until the network converges. The single model MSE equation is shown below.
Figure GDA0002635419840000084
Wherein,
Figure GDA0002635419840000085
estimated sound source signal of i frame for m microphone at n scanning position, sm,n,i(t) is
Figure GDA0002635419840000086
The source signal of the corresponding frame, i.e. the supervisory signal, T is the number of sampling points of a frame signal, and I is the total number of frames of samples in a mini-batch.
On the basis, a cross entropy function is introduced, and the formula is
Figure GDA0002635419840000087
Wherein, Pn,iIs the correlation coefficient of the nth scan position of the ith frame and SCorr (n) probability normalized by the softmax function, Yn,iI is the total number of frames of samples in a mini-batch, which is the supervisory signal corresponding to the frame and scan position.
Combining MSE and CE, adding the MSE and CE according to a certain proportion to form a final loss function, wherein the formula is shown as follows.
Figure GDA0002635419840000088
Wherein, α is a set scale factor.
The advantages of the invention are illustrated below with reference to specific embodiments.
The invention uses transfer function to generate simulation signal, and tests the positioning performance of the positioning algorithm on the simulation signal under different signal-to-noise ratio conditions. The proposed method uses the MSE training model as Shared-CNN-MSE, and the model using MSE and CE in combination as Shared-CNN-MSE-CE. In addition, the experiment used SRP-PHAT, MUSIC and another neural network-based approach (DNN) as baseline. The sound source signals are Gaussian white noise signals respectively, and the change range of the signal to noise ratio is-10 dB to 5 dB.
1. Microphone array
The microphone array used in this experiment uses six microphones, which are uniformly distributed on the horizontal plane of a sphere with a radius of 0.0875m, i.e., 60 degrees apart between adjacent microphones, as shown in fig. 4. The distance between the sound source position and the center of the sphere is 3 meters, the azimuth range is 0-360 degrees, the resolution is 5 degrees, and the total number of directions is 72. The corresponding transfer function is calculated from the ball model given by Duda et al.
2. Signal emulation
The experiment used a sound source convolution transfer function to generate a simulated signal, the sound source signal being a gaussian white noise signal. In the experiment, extra noise is added into each channel of the simulation signal according to different signal-to-noise ratios, the noise among the channels is independent, the sampling rate of the signal is 48kHz, and the frame length is set to be 1024 sampling points. Under each condition (sound source position and signal-to-noise ratio), the positioning results of the method and the baseline method provided by the invention are counted.
3. Neural network setup
Because the frame length in the experiment of the invention is set to 1024 sampling points, 6 microphones, the input of the CNN is 6 times 1024 time domain sampling points. For the network structure of the CNN, three layers of one-dimensional convolution layers are used, the number of convolution kernels is 8, 4 and 1 respectively, the size of the convolution kernels is 2 multiplied by 1, but the lengths after convolution are all equal to the frame length. The structure of a single CNN is shown in fig. 2. For each DNN model, we use one full-connectivity layer, with an output layer size of 1024 nodes. The structure of a single CNN is shown in fig. 3. Adam is selected as an optimization algorithm in the training process, and the proportion of CE loss items is set to be 0.01.
3. Results of the experiment
In the experiment, the positioning performance of the five methods is tested, and noise-containing signals (-10 dB-5 dB) with different signal-to-noise ratios are respectively used, wherein the noise type is white Gaussian noise. Of these five methods, SRP-PHAT and MUSIC are the conventional sound source localization methods mentioned above. DNN is a neural network based baseline system. Shared-CNN-MSE and Shared-CNN-MSE-CE are algorithms proposed herein, which correspond to the two training strategies, respectively. Shared-CNN-MSE is trained using only MSE, while Shared-CNN-MSE-CE is trained using both MSE and CE.
As can be seen from FIG. 5, when the signal-to-noise ratio is lower than-1 dB, the performance training of MUSIC and SRP-PHAT is reduced, while other neural network based methods have better anti-noise capability because the neural network model makes full use of the information of time difference and amplitude difference. In the figure, assuming that the average error angle is fixed at 30 degrees, the SNR of SRP-PHAT, MUSIC, DNN, Shared-CNN-MSE and Shared-CNN-MSE-CE is respectively-2 dB, -3dB, -5dB, -6dB and-7 dB. Compared with SRP-PHAT, the Shared-CNN-MSE and the Shared-CNN-MSE-CE are respectively improved by 4dB and 5 dB. Compared with a DNN method, the Shared-CNN-MSE has a lower average error angle under the same signal-to-noise ratio, which shows that the multitask learning contributes to the improvement of performance, because the Shared CNN hidden layer can extract more robust features, and the learning of each position contributes to the learning of other positions. On the basis, the MSE and the CE are jointly used for training, and the Shared-CNN-MSE-CE is improved relative to the Shared-CNN-MSE, so that the effectiveness of the training strategy is illustrated, and the CE loss item fully considers the mutual exclusion relation among different positions.
The invention provides an end-to-end sound source positioning algorithm based on multi-task learning. Based on the idea of multi-task learning, the CNN is used as a shared hidden layer to extract features, and MSE and CE are jointly used as loss functions to train a model. Compared with other methods, the method has better noise immunity and better positioning performance under the condition of lower signal-to-noise ratio. Moreover, the method is a self-adaptive model, online learning is supported, and the more training data, the higher the positioning accuracy. Compared with the prior method based on the neural network, the method uses multi-task learning and is more interpretable when modeling is carried out from the perspective of signal processing. Here, the neural network is not a black box, and the neural network model is used to recover the time difference and amplitude difference of the received microphone signals.
Although specific embodiments of the invention have been disclosed for illustrative purposes and the accompanying drawings, which are included to provide a further understanding of the invention and are incorporated by reference, those skilled in the art will appreciate that: various substitutions, changes and modifications are possible without departing from the spirit and scope of the present invention and the appended claims. Therefore, the present invention should not be limited to the disclosure of the preferred embodiments and the accompanying drawings.

Claims (8)

1. An end-to-end sound source positioning method based on multitask learning comprises the following steps:
1) for each sound source position to be scanned, calculating the time delay of sound signals transmitted from the sound source position to each microphone position in the microphone array;
2) performing corresponding delay compensation on the multi-channel frame-level time domain signals collected by each microphone during each scanning of the microphone array according to the delay;
3) inputting each time domain signal after time delay compensation into a corresponding CNN model for feature extraction and inputting the extracted features into a deep neural network; the deep neural network includes a plurality of DNN models, CNNmInputting extracted features into DNNm,n(ii) a Wherein, CNNmCNN model, DNN, for mth microphonem,nA DNN model representing a transmission path corresponding to the nth microphone at the nth scanning position; m1., M, N1., N, M denotes the number of microphones, N denotes the number of scanning positions;
4) the deep neural network estimates a multi-channel sound source signal of each scanning position according to the characteristics extracted by each CNN model;
5) for each scanning position, calculating the cross-correlation coefficient sum of the multi-channel sound source signals corresponding to the scanning position, and selecting the position with the maximum correlation coefficient sum as the sound source position;
the method for obtaining the DNN model by training comprises the following steps: 31) for each set sound source, collecting a multi-channel time domain signal of the sound source by using a microphone array, and acquiring the position of the sound source; then, time-domain signals of the sound source are subjected to delay compensation, and a DNN model of a transmission path corresponding to the position of the sound source is trained based on the delay compensation signals and an MSE loss function until a convergence condition is reached; respectively training models of different transmission paths based on an MSE loss function until a network converges; the single model MSE equation is
Figure FDA0002635419830000011
Figure FDA0002635419830000012
Estimated sound source signal of i frame for m microphone at n scanning position, sm,n,i(t) is
Figure FDA0002635419830000013
Corresponding to the source signal of a frame, T is the number of sampling points of a frame signal, and I is the total frame number of samples in a mini-batch; 32) on the basis of the convergence, a cross entropy function is introduced
Figure FDA0002635419830000014
Wherein P isn,iIs the correlation coefficient of the nth scan position of the ith frame and SCorr (n) probability normalized by the softmax function, Yn,iSupervision signals corresponding to the frames and the scanning positions; the final loss function is formed by combining the MSE and CE losses and adding the MSE and CE losses in proportion: firstly, calculating the estimated mean square error of a transmission path of each set sound source position, and adding the estimated mean square errors to be used as an MSE loss term in a loss function; the CE loss item is a CE loss item which is obtained by adding a consistency detection module between calculation channels, using a softmax function to normalize the correlation coefficient sum into probabilities at different positions to obtain a probability vector, and then solving the probability vector and a supervision signal, namely a one-hot vector according to a calculation formula of the CE, wherein the probability vector and the one-hot vector are calculated according to the calculation formula of the CE
Figure FDA0002635419830000021
The sum constitutes the loss function of the final DNN model, and α is a set scaling factor.
2. The method of claim 1, wherein the time-domain signal is compensated for delay by:
Figure FDA0002635419830000022
wherein x ismTime domain signal representing frame level collected by mth microphone,
Figure FDA0002635419830000023
Representing the delay x with the nth scan position and the mth microphonemTime-domain signal, tau, after delay compensationm,nIndicating the delay between the nth scan position and the mth microphone.
3. The method of claim 1, wherein the CNN model performs feature extraction on the input time-domain signal using one-dimensional convolutional layers in the time domain.
4. The method as claimed in claim 1, wherein
Figure FDA0002635419830000024
Estimating a multi-channel sound source signal; wherein,
Figure FDA0002635419830000025
is the sound source signal of the mth microphone at the nth sound source position.
5. An end-to-end sound source positioning system based on multitask learning is characterized by comprising a delay calculation module, a delay compensation module, a CNN (CNN) model, a deep neural network and a target sound source position estimation module; wherein
The delay calculation module is used for calculating the delay of sound signals transmitted from the sound source position to each microphone position in the microphone array for each sound source position to be scanned;
the delay compensation module is used for carrying out corresponding delay compensation on the multi-channel frame-level time domain signals collected by each microphone during each scanning of the microphone array according to the delay;
the CNN model is used for extracting the characteristics of the input time domain signal after the time delay compensation and inputting the extracted characteristics into a deep neural network; the method for obtaining the DNN model by training comprises the following steps: a) for each set sound source, a microphone array is used to acquire multiple passes of the sound sourceA time domain signal of a channel, and acquiring the position of the sound source; then, time-domain signals of the sound source are subjected to delay compensation, and a DNN model of a transmission path corresponding to the position of the sound source is trained based on the delay compensation signals and an MSE loss function until a convergence condition is reached; respectively training models of different transmission paths based on an MSE loss function until a network converges; the single model MSE equation is
Figure FDA0002635419830000026
Figure FDA0002635419830000027
Estimated sound source signal of i frame for m microphone at n scanning position, sm,n,i(t) is
Figure FDA0002635419830000028
Corresponding to the source signal of a frame, T is the number of sampling points of a frame signal, and I is the total frame number of samples in a mini-batch; b) on the basis of the convergence, a cross entropy function is introduced
Figure FDA0002635419830000031
Wherein P isn,iIs the correlation coefficient of the nth scan position of the ith frame and SCorr (n) probability normalized by the softmax function, Yn,iSupervision signals corresponding to the frames and the scanning positions; the final loss function is formed by combining the MSE and CE losses and adding the MSE and CE losses in proportion: firstly, calculating the estimated mean square error of a transmission path of each set sound source position, and adding the estimated mean square errors to be used as an MSE loss term in a loss function; the CE loss item is a CE loss item which is obtained by adding a consistency detection module between calculation channels, using a softmax function to normalize the correlation coefficient sum into probabilities at different positions to obtain a probability vector, and then solving the probability vector and a supervision signal, namely a one-hot vector according to a calculation formula of the CE, wherein the probability vector and the one-hot vector are calculated according to the calculation formula of the CE
Figure FDA0002635419830000032
Adding to form a loss function of the final DNN model, wherein alpha is a set scale factor;
the deep neural network is used for estimating a multichannel sound source signal of each scanning position according to the extracted features; wherein the deep neural network comprises a plurality of DNN models, CNNmInputting extracted features into DNNm,n;CNNmCNN model, DNN, for mth microphonem,nA DNN model representing a transmission path corresponding to the nth microphone at the nth scanning position; m1., M, N1., N, M denotes the number of microphones, N denotes the number of scanning positions;
and the target sound source position estimation module is used for calculating the cross-correlation coefficient sum of the multi-channel sound source signals corresponding to each scanning position, and selecting the position with the maximum correlation coefficient sum as the sound source position.
6. The system of claim 5, wherein the delay calculation module is based on
Figure FDA0002635419830000033
Performing delay compensation on the time domain signal; x is the number ofmA time domain signal representing the frame level acquired by the m-th microphone,
Figure FDA0002635419830000034
representing the delay x with the nth scan position and the mth microphonemTime-domain signal, tau, after delay compensationm,nIndicating the delay between the nth scan position and the mth microphone.
7. The system of claim 5, wherein the deep neural network employs
Figure FDA0002635419830000035
Estimating a multi-channel sound source signal; wherein,
Figure FDA0002635419830000036
is the sound source signal of the mth microphone at the nth sound source position.
8. The system of claim 5, wherein the CNN model performs feature extraction on an input time-domain signal using one-dimensional convolutional layers in the time domain.
CN201910043338.8A 2019-01-17 2019-01-17 End-to-end sound source positioning method and system based on multi-task learning Active CN109782231B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910043338.8A CN109782231B (en) 2019-01-17 2019-01-17 End-to-end sound source positioning method and system based on multi-task learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910043338.8A CN109782231B (en) 2019-01-17 2019-01-17 End-to-end sound source positioning method and system based on multi-task learning

Publications (2)

Publication Number Publication Date
CN109782231A CN109782231A (en) 2019-05-21
CN109782231B true CN109782231B (en) 2020-11-20

Family

ID=66500851

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910043338.8A Active CN109782231B (en) 2019-01-17 2019-01-17 End-to-end sound source positioning method and system based on multi-task learning

Country Status (1)

Country Link
CN (1) CN109782231B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111161757B (en) * 2019-12-27 2021-09-03 镁佳(北京)科技有限公司 Sound source positioning method and device, readable storage medium and electronic equipment
CN111859241B (en) * 2020-06-01 2022-05-03 北京大学 Unsupervised sound source orientation method based on sound transfer function learning
CN111694433B (en) * 2020-06-11 2023-06-20 阿波罗智联(北京)科技有限公司 Voice interaction method and device, electronic equipment and storage medium
CN112731086A (en) * 2021-01-19 2021-04-30 国网上海能源互联网研究院有限公司 Method and system for comprehensively inspecting electric power equipment
CN113138363A (en) * 2021-04-22 2021-07-20 苏州臻迪智能科技有限公司 Sound source positioning method and device, storage medium and electronic equipment
CN113835065B (en) * 2021-09-01 2024-05-17 深圳壹秘科技有限公司 Sound source direction determining method, device, equipment and medium based on deep learning
CN114489321B (en) * 2021-12-13 2024-04-09 广州大鱼创福科技有限公司 Steady-state visual evoked potential target recognition method based on multi-task deep learning
CN117368847B (en) * 2023-12-07 2024-03-15 深圳市好兄弟电子有限公司 Positioning method and system based on microphone radio frequency communication network
CN118112501B (en) * 2024-04-28 2024-07-26 杭州爱华智能科技有限公司 Sound source positioning method and device suitable for periodic signals and sound source measuring device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108197697A (en) * 2017-12-29 2018-06-22 汕头大学 A kind of dynamic method for resampling of trained deep neural network
CN108924836A (en) * 2018-07-04 2018-11-30 南方电网科学研究院有限责任公司 Edge side physical layer channel authentication method based on deep neural network

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102103200B (en) * 2010-11-29 2012-12-05 清华大学 Acoustic source spatial positioning method for distributed asynchronous acoustic sensor
US9753119B1 (en) * 2014-01-29 2017-09-05 Amazon Technologies, Inc. Audio and depth based sound source localization
CN107144818A (en) * 2017-03-21 2017-09-08 北京大学深圳研究生院 Binaural sound sources localization method based on two-way ears matched filter Weighted Fusion
CN108305641B (en) * 2017-06-30 2020-04-07 腾讯科技(深圳)有限公司 Method and device for determining emotion information
CN107703486B (en) * 2017-08-23 2021-03-23 南京邮电大学 Sound source positioning method based on convolutional neural network CNN
CN108120436A (en) * 2017-12-18 2018-06-05 北京工业大学 Real scene navigation method in a kind of iBeacon auxiliary earth magnetism room
CN108318862B (en) * 2017-12-26 2021-08-20 北京大学 Sound source positioning method based on neural network
CN108375763B (en) * 2018-01-03 2021-08-20 北京大学 Frequency division positioning method applied to multi-sound-source environment
CN108417224B (en) * 2018-01-19 2020-09-01 苏州思必驰信息科技有限公司 Training and recognition method and system of bidirectional neural network model
CN108734733B (en) * 2018-05-17 2022-04-26 东南大学 Microphone array and binocular camera-based speaker positioning and identifying method
CN109031200A (en) * 2018-05-24 2018-12-18 华南理工大学 A kind of sound source dimensional orientation detection method based on deep learning
CN109164415B (en) * 2018-09-07 2022-09-16 东南大学 Binaural sound source positioning method based on convolutional neural network

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108197697A (en) * 2017-12-29 2018-06-22 汕头大学 A kind of dynamic method for resampling of trained deep neural network
CN108924836A (en) * 2018-07-04 2018-11-30 南方电网科学研究院有限责任公司 Edge side physical layer channel authentication method based on deep neural network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Robust Sound Source Localization Using a Microphone Array on a Mobile Robot;Jean-Marc Valin;《Proceedings of the 2003 IEEE/RSJ》;20031031;1228-1233 *

Also Published As

Publication number Publication date
CN109782231A (en) 2019-05-21

Similar Documents

Publication Publication Date Title
CN109782231B (en) End-to-end sound source positioning method and system based on multi-task learning
CN108318862B (en) Sound source positioning method based on neural network
CN109993280B (en) Underwater sound source positioning method based on deep learning
Xiao et al. A learning-based approach to direction of arrival estimation in noisy and reverberant environments
CN110531313B (en) Near-field signal source positioning method based on deep neural network regression model
EP1600791B1 (en) Sound source localization based on binaural signals
CN109490822B (en) Voice DOA estimation method based on ResNet
CN111401565A (en) DOA estimation method based on machine learning algorithm XGboost
CN112904279A (en) Sound source positioning method based on convolutional neural network and sub-band SRP-PHAT space spectrum
CN112180318B (en) Sound source direction of arrival estimation model training and sound source direction of arrival estimation method
Huang et al. A time-domain unsupervised learning based sound source localization method
Ramezanpour et al. Two-stage beamforming for rejecting interferences using deep neural networks
CN111123202B (en) Indoor early reflected sound positioning method and system
CN112014791A (en) Near-field source positioning method of array PCA-BP algorithm with array errors
CN111352075B (en) Underwater multi-sound-source positioning method and system based on deep learning
Nie et al. Adaptive direction-of-arrival estimation using deep neural network in marine acoustic environment
Houégnigan et al. Machine and deep learning approaches to localization and range estimation of underwater acoustic sources
CN111859241B (en) Unsupervised sound source orientation method based on sound transfer function learning
CN117451055A (en) Underwater sensor positioning method and system based on basis tracking noise reduction
Brendel et al. Distance estimation of acoustic sources using the coherent-to-diffuse power ratio based on distributed training
Hu et al. Evaluation and comparison of three source direction-of-arrival estimators using relative harmonic coefficients
Laufer-Goldshtein et al. Multi-view source localization based on power ratios
Chetupalli et al. Robust offline trained neural network for TDOA based sound source localization
CN113030849A (en) Near-field source positioning method based on self-encoder and parallel network
Huang et al. A time-domain end-to-end method for sound source localization using multi-task learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant