CN109782231B - End-to-end sound source positioning method and system based on multi-task learning - Google Patents
End-to-end sound source positioning method and system based on multi-task learning Download PDFInfo
- Publication number
- CN109782231B CN109782231B CN201910043338.8A CN201910043338A CN109782231B CN 109782231 B CN109782231 B CN 109782231B CN 201910043338 A CN201910043338 A CN 201910043338A CN 109782231 B CN109782231 B CN 109782231B
- Authority
- CN
- China
- Prior art keywords
- sound source
- microphone
- signal
- delay
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 64
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 51
- 238000013528 artificial neural network Methods 0.000 claims abstract description 37
- 230000005236 sound signal Effects 0.000 claims abstract description 11
- 238000000605 extraction Methods 0.000 claims abstract description 8
- 230000006870 function Effects 0.000 claims description 48
- 230000005540 biological transmission Effects 0.000 claims description 25
- 238000012549 training Methods 0.000 claims description 20
- 238000004364 calculation method Methods 0.000 claims description 19
- 238000005070 sampling Methods 0.000 claims description 9
- 238000001514 detection method Methods 0.000 claims description 6
- 239000000284 extract Substances 0.000 abstract description 4
- 230000007246 mechanism Effects 0.000 abstract description 3
- 230000000875 corresponding effect Effects 0.000 description 23
- 238000004422 calculation algorithm Methods 0.000 description 17
- 238000012546 transfer Methods 0.000 description 11
- 238000005516 engineering process Methods 0.000 description 8
- 238000002474 experimental method Methods 0.000 description 8
- 230000004807 localization Effects 0.000 description 7
- 238000013507 mapping Methods 0.000 description 7
- 230000008569 process Effects 0.000 description 7
- 238000010586 diagram Methods 0.000 description 5
- 238000003062 neural network model Methods 0.000 description 4
- 230000008859 change Effects 0.000 description 3
- 238000000354 decomposition reaction Methods 0.000 description 3
- 239000011159 matrix material Substances 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 238000004088 simulation Methods 0.000 description 3
- 230000003595 spectral effect Effects 0.000 description 3
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 238000005259 measurement Methods 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 238000011084 recovery Methods 0.000 description 2
- 238000001228 spectrum Methods 0.000 description 2
- 230000004913 activation Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 210000005069 ears Anatomy 0.000 description 1
- 230000007717 exclusion Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000036039 immunity Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 230000008092 positive effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Images
Landscapes
- Circuit For Audible Band Transducer (AREA)
- Measurement Of Velocity Or Position Using Acoustic Or Ultrasonic Waves (AREA)
Abstract
The invention discloses an end-to-end sound source positioning method and system based on multi-task learning. The method comprises the following steps: 1) for each sound source position to be scanned, calculating the time delay of sound signals transmitted from the sound source position to each microphone position; 2) performing corresponding delay compensation on the multi-channel frame-level time domain signals collected by each microphone during each scanning of the microphone array according to the delay; 3) inputting each time domain signal after time delay compensation into a corresponding CNN model for feature extraction and inputting the time domain signal into a deep neural network; 4) the deep neural network estimates a multi-channel sound source signal of each scanning position according to the characteristics extracted by each CNN model; 5) and for each scanning position, calculating the cross-correlation coefficient sum of the multi-channel sound source signals corresponding to the scanning position, and selecting the position with the maximum correlation coefficient sum as the sound source position. The method can automatically extract proper characteristics, introduce a multi-task learning mechanism and improve the positioning performance of the model.
Description
Technical Field
The invention belongs to the technical field of array signal processing, relates to a microphone array and a sound source positioning method, and particularly relates to an end-to-end sound source positioning method and system based on multi-task learning.
Background
With the development of artificial intelligence technology, machine hearing has received a great deal of attention, and many techniques and research fields related to machine hearing have been developed successively. The sound source positioning technology is a basic and important technology in a machine auditory system, and essentially simulates the functions of human ears, and sound signals are collected through a microphone array so as to judge the position of a sound object. The sound source positioning technology can be independently applied to various fields, such as video conferences, whistle vehicle identification and the like, and meanwhile, basic position information, such as voice enhancement and the like, can be provided for various technologies. Therefore, the positioning accuracy of the sound source positioning algorithm is improved through optimization, the method can be applied to many fields, the development of other technologies can be promoted to a certain extent, and powerful support is provided for the other technologies.
According to the positioning principle, the sound source positioning technology can be roughly divided into the following five categories: time difference of arrival estimation based, high resolution spectral estimation based, steerable beam forming based, transfer function based, and neural network based methods.
The method based on the arrival time difference estimation is to estimate the position of a sound source by estimating the time difference between sound signals arriving at different microphones and then deducing the position of the sound source according to the arrival time difference and the spatial geometrical relation of an array. The method divides the positioning process into two steps, and the problem of error transmission can occur, namely, the estimation of the arrival time difference is not accurate, and the error can be transmitted to the second step. And the arrival time difference is difficult to accurately estimate, and the positioning accuracy is not high.
Methods based on high resolution spectral estimation are multiple signal classification (MUSIC), minimum variance spectral estimation (MVM), etc. The method comprises the steps of forming a covariance matrix by signals collected by a microphone array, performing characteristic decomposition by using EVD (enhanced vector decomposition), thus obtaining a signal subspace corresponding to signal components and a noise subspace corresponding to noise components, and estimating a target azimuth by using the two subspaces. The method has higher spatial resolution, but has poorer performance under the condition of reverberation. Because the noise under the reverberation condition has directivity and is homologous with the signal, the noise has strong correlation. The determination of the sound source position by feature decomposition is easily misjudged.
The steerable beam forming based method is a scanning based method that scans all possible sound source positions one by one. For each scanning position, a beam is formed by performing delay compensation on signals collected by a microphone array, the output power of the formed beam is calculated, and the position corresponding to the maximum output power is selected as an estimated sound source position, wherein a typical algorithm is controllable response power (SRP-PHAT) based on phase transformation weighting. The method only considers the information of the arrival time difference, does not utilize the information of the amplitude difference, and is easily influenced by noise under the conditions of high reverberation and low signal-to-noise ratio.
The transfer function-based method is also a scanning method by measuring the transfer characteristic, i.e., transfer function, of the sound signal from each sound source position to each microphone. The method comprises the steps of restoring a multi-channel source signal by carrying out inverse filtering operation on signals collected by a microphone array, namely restoring time difference and intensity difference of the multi-channel source signal, further carrying out correlation detection on the multi-channel source signal, and selecting a position corresponding to the maximum correlation as a sound source position. The method comprehensively utilizes the positioning information of the arrival time difference and the intensity difference, but the method needs actual measurement of a transfer function and cannot be used in a scene which cannot be actually measured. In addition, under the environment of low signal-to-noise ratio and high reverberation, an accurate transfer function can hardly be measured, the measured transfer function has poor robustness, and the positioning performance is poor. Moreover, the transfer function obtained by actual measurement is strongly correlated to the environment, and is difficult to migrate to and be suitable for other environments.
In recent years, studies of scholars have mainly focused on an end-to-end sound source localization algorithm based on multitask learning. This type of study basically treats the neural network as a black box, and different studies mainly change the three modules of input features, output contents and neural network structure. Most methods need to extract features such as amplitude spectrum, phase spectrum and the like in advance as the input of the network, then determine the types of network output nodes such as azimuth angle, distance and the like, and finally learn the mapping from the features to the azimuth angle or the distance by utilizing a neural network. The neural network method is very convenient in modeling, iterative learning is carried out only by a large amount of data, so that a classifier with good performance is learned, but an artificial feature with good performance needs to be designed and screened in advance, and the method is difficult to obtain a universal mapping relation from the feature to the sound source position through learning, namely the classifier has poor actual positioning performance and insufficient robustness.
Disclosure of Invention
Aiming at the technical problems in the prior art, the invention provides an end-to-end sound source positioning method and system based on multi-task learning. And based on the angle of signal processing, the neural network is utilized for modeling, so that the neural network model has better interpretability and trainability. In the invention, a multi-channel time domain signal acquired by a microphone array is directly used as the input of a neural network, the output of a model is the sound source position, an end-to-end algorithm frame is constructed, and the model is trained by combining two loss functions of Mean Square Error (MSE) and Cross Entropy (CE). In addition, the invention enables the model to automatically extract proper characteristics through a Convolutional Neural Network (CNN) module, and introduces a multi-task learning mechanism to improve the positioning performance of the model.
The basic idea of the end-to-end sound source positioning algorithm based on the multi-task learning provided by the invention is to learn the phase and amplitude change rule of a sound signal caused by the existence of scatterers or the environment and the like in the transmission process from a large amount of data in a deep learning mode. Furthermore, the original phase and amplitude of the acquired multi-channel time domain signal can be recovered through the model, and finally, the sound source is positioned by combining two positioning clues of time difference and amplitude difference. The important innovation of the invention is that a multi-task learning and end-to-end algorithm framework is introduced, and the anti-noise performance is obviously improved.
The technical scheme of the invention is as follows:
an end-to-end sound source positioning method based on multitask learning comprises the following steps:
1) for each sound source position to be scanned, calculating the time delay of sound signals transmitted from the sound source position to each microphone position in the microphone array;
2) performing corresponding delay compensation on the multi-channel frame-level time domain signals collected by each microphone during each scanning of the microphone array according to the delay;
3) inputting each time domain signal after time delay compensation into a corresponding CNN model for feature extraction and inputting the extracted features into a deep neural network; the deep neural network includes a plurality of DNN models, CNNmInputting extracted features into DNNm,n(ii) a Wherein, CNNmCNN model, DNN, for mth microphonem,nA DNN model representing a transmission path corresponding to the nth microphone at the nth scanning position; m1., M, N1., N, M denotes the number of microphones, N denotes the number of scanning positions;
4) the deep neural network estimates a multi-channel sound source signal of each scanning position according to the characteristics extracted by each CNN model;
5) and for each scanning position, calculating the cross-correlation coefficient sum of the multi-channel sound source signals corresponding to the scanning position, and selecting the position with the maximum correlation coefficient sum as the sound source position.
Further, to theThe method for time-domain signal delay compensation comprises the following steps:wherein x ismA time domain signal representing the frame level acquired by the m-th microphone,representing the delay x with the nth scan position and the mth microphonemTime-domain signal, tau, after delay compensationm,nIndicating the delay between the nth scan position and the mth microphone.
Further, the CNN model performs feature extraction on the input time-domain signal using a one-dimensional convolution layer in the time domain.
Further, adoptEstimating a multi-channel sound source signal; wherein,is the sound source signal of the mth microphone at the nth sound source position.
Further, the method for obtaining the DNN model by training comprises the following steps:
51) for each set sound source, collecting a multi-channel time domain signal of the sound source by using a microphone array, and acquiring the position of the sound source; then, time-domain signals of the sound source are subjected to delay compensation, and a DNN model of a transmission path corresponding to the position of the sound source is trained based on the delay compensation signals and an MSE loss function until a convergence condition is reached;
52) on the basis of the convergence, the MSE and CE losses are combined and added in proportion to form a final loss function: firstly, calculating the estimated mean square error of a transmission path of each set sound source position, and adding the estimated mean square errors to be used as an MSE loss term in a loss function; the CE loss item is a loss function of the final DNN model, which is formed by adding a consistency detection module among calculation channels, using a softmax function to normalize the correlation coefficient sum into probabilities of different positions, so as to obtain a probability vector, and then adding the probability vector and a one-hot vector which is a supervision signal according to a calculation formula of the CE, so as to obtain the CE loss item.
An end-to-end sound source positioning system based on multitask learning is characterized by comprising a delay calculation module, a delay compensation module, a CNN (CNN) model, a deep neural network and a target sound source position estimation module; wherein
The delay calculation module is used for calculating the delay of sound signals transmitted from the sound source position to each microphone position in the microphone array for each sound source position to be scanned;
the delay compensation module is used for carrying out corresponding delay compensation on the multi-channel frame-level time domain signals collected by each microphone during each scanning of the microphone array according to the delay;
the CNN model is used for extracting the characteristics of the input time domain signal after the time delay compensation and inputting the extracted characteristics into a deep neural network;
the deep neural network is used for estimating a multichannel sound source signal of each scanning position according to the extracted features; wherein the deep neural network comprises a plurality of DNN models, CNNmInputting extracted features into DNNm,n;CNNmCNN model, DNN, for mth microphonem,nA DNN model representing a transmission path corresponding to the nth microphone at the nth scanning position; m1., M, N1., N, M denotes the number of microphones, N denotes the number of scanning positions;
and the target sound source position estimation module is used for calculating the cross-correlation coefficient sum of the multi-channel sound source signals corresponding to each scanning position, and selecting the position with the maximum correlation coefficient sum as the sound source position.
The basic framework of the end-to-end sound source localization algorithm based on multitask learning is shown in fig. 1, and the method is a scanning method, and is explained in the following from two aspects of the localization process and the model training.
(1) Flow of positioning
The positioning process mainly comprises the following steps:
calculating the delay of the transmission of a sound signal from a scanning position to a microphone position can be pre-calculated for each sound source position to be scanned and each microphone and stored as a delay matrix for subsequent use.
Input time domain signal here, the input to the model is a multi-channel frame-level time domain signal.
The compensation delay calculates the delay according to the distance from each microphone to the scanning position, and then the delay is used for compensating the corresponding delay of the multi-channel acquisition signals respectively.
The CNN extraction feature CNN model is used as a shared hidden layer in the whole network to extract features and learn common transmission characteristics.
The DNN recovery signal deep neural network (namely, a fully-connected neural network, DNN) is a part which is modeled for each task respectively, and the output of DNN is an estimated multi-channel sound source signal. The compensated delay, the CNN model and the DNN model together form a mapping model of the collected signal to the sound source signal, i.e. the mapping recovers the phase and amplitude variations during propagation.
Inter-channel consistency calculation an inter-channel consistency calculation is used to measure the consistency of the estimated multi-channel sound source signal. This operation is performed by calculating correlation coefficients between multi-channel sound source signals and for subsequent localization operations. If we have N possible positions, we need to calculate N correlation coefficient sums respectively.
Estimating the sound source position assuming that the position we scan coincides with the true position, the estimated multi-channel sound source signals should be consistent, and the correlation coefficient sum is maximum. Therefore, when estimating the sound source position, we can select the correlation coefficient and the maximum position as the estimated sound source position.
(2) Training of models
Obviously, the delay information is a priori information, and can be calculated in advance by a method of dividing the distance by the sound velocity, and can also be calculated by cross-correlation. Therefore, the known information of the neural network model is not required to be obtained and learned, the known information can be directly obtained through calculation, and delay compensation is carried out in advance, so that the learning task and difficulty of the neural network are reduced.
In addition, during the training process, the MSE and CE loss functions are jointly trained. Notably, where the MSE and CE loss functions apply is different. MSE is used at the output of the DNN model for the regression task, mapping the received microphone signals to estimated sound source signals, where the supervisory signal is the frame level sound source signal. The CE is used in the output of consistency detection, firstly, a softmax function is used for normalizing correlation coefficient sums into probabilities at different positions, namely, a probability vector can be obtained, and then, a one-hot vector which is a supervision signal is used for obtaining a CE loss term according to a calculation formula of the CE. Assuming we have N possible positions, the supervisory signal is an N-dimensional one-hot vector. Adding the two according to a certain proportion to obtain the final loss value
The specific training steps are as follows.
The method comprises the following steps: for a set sound source, a microphone array is used for collecting multi-channel time domain signals of the sound source, and the position of the sound source is obtained.
Step two: and performing time delay compensation operation on the time domain signal of the sound source.
Step three: and training a DNN model of the transmission path corresponding to the sound source position based on the MSE loss function by using the obtained signals after the delay compensation, wherein in the step, models of other sound source positions are not considered, and consistency calculation and estimation of the sound source position are not involved.
Step four: and repeating the above operations on the training data acquired at different sound source positions, wherein the models of the transmission paths corresponding to the different sound source positions are independent, and the models of the different transmission paths are trained respectively until the network is converged, namely the MSE is not reduced.
Step five: after step four converges, the model is jointly trained using MSE and CE. Through the steps, the end-to-end model can be realized. In the step, models of different transmission paths are not trained independently, the sum of the estimated mean square errors of all the transmission paths is calculated to be used as an MSE loss term, a consistency detection module for calculating the channels is added, the CE loss terms calculated according to the method are added according to a certain proportion to be used as a final loss value, and a loss function is namely the loss function
Loss=MSE+α×CE。
Compared with the prior art, the invention has the following positive effects:
the invention combines two loss functions of Mean Square Error (MSE) and Cross Entropy (CE) to train the model. And the positioning performance of the model is improved.
The invention introduces a multi-task learning mechanism, uses a Convolutional Neural Network (CNN) as a shared hidden layer, extracts common characteristics and improves the positioning performance of the model.
The invention utilizes the multi-channel time domain signal directly collected by the microphone array as the input of the neural network, the output of the model is the sound source position, and an end-to-end algorithm frame is constructed.
Drawings
FIG. 1 is a basic block diagram of an end-to-end sound source localization algorithm based on multitask learning;
FIG. 2 is a schematic diagram of a CNN model used in the present invention;
FIG. 3 is a schematic diagram of the DNN model used in the present invention;
FIG. 4 is a schematic diagram of a ball model and microphone distribution used in the present invention;
FIG. 5 is a plot of the positioning performance of the proposed method and baseline at different SNR.
Detailed Description
Preferred embodiments of the present invention will be described in more detail below with reference to the accompanying drawings of the invention. Fig. 1 is a basic block diagram of an end-to-end sound source localization algorithm based on multitask learning according to the present invention, and the specific implementation steps of the method of the present invention include calculating delay, inputting time domain signals, compensating for delay, CNN extracting features, DNN restoring signals, calculating inter-channel consistency, and estimating the position of a target sound source. The specific implementation process of each step is as follows:
1. calculating time delay
In the scanning method, because the position to be scanned and the distribution of the microphone array are known, the time delay can be obtained by the known scanning position and microphone position calculation, and is also known information, the time delay can be calculated in advance and stored as a time delay matrix for subsequent direct use, namely, the distance between the scanning position and the microphone position is calculated according to the scanning position and the microphone position, and then the time delay of the sound signal transmitted from the scanning position to the microphone position is calculated by combining the sound velocity, namely, the time delay is calculated
Wherein, taum,nRepresenting the delay between the nth scanning position and the mth microphone, dm,nIs the distance from the nth scan position to the mth microphone, and v is the speed of sound.
2. Input time domain signal
The input of the neural network is a multi-channel frame-level time domain signal. For example, if we have M microphones and the frame length is L sampling points, the input is M frames of time domain signals, and there are M × L sampling points in total, but the M frames of time domain signals are operated independently at the beginning.
3. Compensating for time delay
In each scanning process, the delay from each microphone to the scanning position can be calculated in step 1, therefore, for a certain candidate position, the compensation of the corresponding delay needs to be performed on the multi-channel microphone signal, that is, the multi-channel microphone signal is compensated
Where M denotes the number of microphones, N denotes the number of scanning positions, xmA time domain signal representing the frame level acquired by the m-th microphone,representing the delay x with the nth scan position and the mth microphonemCarry out time-delay compensationThe compensated time domain signal.
CNN extraction features
The time domain signal after the delay compensation is input into a corresponding CNN model, and the CNN model is used as a shared hidden layer in the whole network to extract features and learn common transmission characteristics. Experiments have shown that pooling does not provide any benefit, and therefore we use only one-dimensional convolution layers in the time domain in CNN here.
Wherein, CNNmA CNN model corresponding to the mth microphone is shown, and the model structure thereof can be seen in fig. 2. h ism,n(t) represents a signalInput to corresponding CNNmThe hidden layer of the model outputs a signal, i.e., the extracted features.
DNN recovery Signal
The deep neural network (i.e. fully-connected neural network, DNN) is after sharing the CNN model of the hidden layer, the DNN model of each position is independent from each other, different positions are actually learning different transmission characteristics, which are equivalent to different tasks, the DNN model is a part for modeling each task separately, and the output of DNN is estimated multi-channel sound source signal. The compensated delay, the CNN model and the DNN model together form a mapping model of the collected signal to the sound source signal, i.e. the mapping recovers the phase and amplitude variations during propagation. Since the DNN model is used for the regression task, it is trained using the MSE loss function, since the last layer of DNN is the regression layer, and no activation function is used, i.e. the DNN model is a function that is used for the regression
Wherein,for the nth scanEstimated sound source signal, DNN, of the m-th microphone in positionmnA DNN model representing the transmission path corresponding to the nth microphone at the nth scanning position can be seen in fig. 3.
5. Computing inter-channel coherence
For a certain scanning position, the multichannel original signals can be recovered, and the cross correlation coefficient sum of the recovered multichannel signals is calculated and used as the index of consistency between channels, namely
Wherein,is a signalScorr (n) represents the sum of the correlation coefficients for the nth scan position.
6. Estimating a sound source position
Assuming that the scanned position coincides with the true position, the estimated multi-channel sound source signals should be consistent, and the correlation coefficient sum is maximum. Therefore, when estimating the sound source position, the correlation coefficient and the position at which the correlation coefficient is the largest are selected as the estimated sound source position
7. Training of models
Training of the model as described above, the algorithm uniformly selects the adam algorithm (or other known algorithms can be adopted), and firstly, the models of different transmission paths are respectively trained based on the MSE loss function until the network converges. The single model MSE equation is shown below.
Wherein,estimated sound source signal of i frame for m microphone at n scanning position, sm,n,i(t) isThe source signal of the corresponding frame, i.e. the supervisory signal, T is the number of sampling points of a frame signal, and I is the total number of frames of samples in a mini-batch.
On the basis, a cross entropy function is introduced, and the formula is
Wherein, Pn,iIs the correlation coefficient of the nth scan position of the ith frame and SCorr (n) probability normalized by the softmax function, Yn,iI is the total number of frames of samples in a mini-batch, which is the supervisory signal corresponding to the frame and scan position.
Combining MSE and CE, adding the MSE and CE according to a certain proportion to form a final loss function, wherein the formula is shown as follows.
Wherein, α is a set scale factor.
The advantages of the invention are illustrated below with reference to specific embodiments.
The invention uses transfer function to generate simulation signal, and tests the positioning performance of the positioning algorithm on the simulation signal under different signal-to-noise ratio conditions. The proposed method uses the MSE training model as Shared-CNN-MSE, and the model using MSE and CE in combination as Shared-CNN-MSE-CE. In addition, the experiment used SRP-PHAT, MUSIC and another neural network-based approach (DNN) as baseline. The sound source signals are Gaussian white noise signals respectively, and the change range of the signal to noise ratio is-10 dB to 5 dB.
1. Microphone array
The microphone array used in this experiment uses six microphones, which are uniformly distributed on the horizontal plane of a sphere with a radius of 0.0875m, i.e., 60 degrees apart between adjacent microphones, as shown in fig. 4. The distance between the sound source position and the center of the sphere is 3 meters, the azimuth range is 0-360 degrees, the resolution is 5 degrees, and the total number of directions is 72. The corresponding transfer function is calculated from the ball model given by Duda et al.
2. Signal emulation
The experiment used a sound source convolution transfer function to generate a simulated signal, the sound source signal being a gaussian white noise signal. In the experiment, extra noise is added into each channel of the simulation signal according to different signal-to-noise ratios, the noise among the channels is independent, the sampling rate of the signal is 48kHz, and the frame length is set to be 1024 sampling points. Under each condition (sound source position and signal-to-noise ratio), the positioning results of the method and the baseline method provided by the invention are counted.
3. Neural network setup
Because the frame length in the experiment of the invention is set to 1024 sampling points, 6 microphones, the input of the CNN is 6 times 1024 time domain sampling points. For the network structure of the CNN, three layers of one-dimensional convolution layers are used, the number of convolution kernels is 8, 4 and 1 respectively, the size of the convolution kernels is 2 multiplied by 1, but the lengths after convolution are all equal to the frame length. The structure of a single CNN is shown in fig. 2. For each DNN model, we use one full-connectivity layer, with an output layer size of 1024 nodes. The structure of a single CNN is shown in fig. 3. Adam is selected as an optimization algorithm in the training process, and the proportion of CE loss items is set to be 0.01.
3. Results of the experiment
In the experiment, the positioning performance of the five methods is tested, and noise-containing signals (-10 dB-5 dB) with different signal-to-noise ratios are respectively used, wherein the noise type is white Gaussian noise. Of these five methods, SRP-PHAT and MUSIC are the conventional sound source localization methods mentioned above. DNN is a neural network based baseline system. Shared-CNN-MSE and Shared-CNN-MSE-CE are algorithms proposed herein, which correspond to the two training strategies, respectively. Shared-CNN-MSE is trained using only MSE, while Shared-CNN-MSE-CE is trained using both MSE and CE.
As can be seen from FIG. 5, when the signal-to-noise ratio is lower than-1 dB, the performance training of MUSIC and SRP-PHAT is reduced, while other neural network based methods have better anti-noise capability because the neural network model makes full use of the information of time difference and amplitude difference. In the figure, assuming that the average error angle is fixed at 30 degrees, the SNR of SRP-PHAT, MUSIC, DNN, Shared-CNN-MSE and Shared-CNN-MSE-CE is respectively-2 dB, -3dB, -5dB, -6dB and-7 dB. Compared with SRP-PHAT, the Shared-CNN-MSE and the Shared-CNN-MSE-CE are respectively improved by 4dB and 5 dB. Compared with a DNN method, the Shared-CNN-MSE has a lower average error angle under the same signal-to-noise ratio, which shows that the multitask learning contributes to the improvement of performance, because the Shared CNN hidden layer can extract more robust features, and the learning of each position contributes to the learning of other positions. On the basis, the MSE and the CE are jointly used for training, and the Shared-CNN-MSE-CE is improved relative to the Shared-CNN-MSE, so that the effectiveness of the training strategy is illustrated, and the CE loss item fully considers the mutual exclusion relation among different positions.
The invention provides an end-to-end sound source positioning algorithm based on multi-task learning. Based on the idea of multi-task learning, the CNN is used as a shared hidden layer to extract features, and MSE and CE are jointly used as loss functions to train a model. Compared with other methods, the method has better noise immunity and better positioning performance under the condition of lower signal-to-noise ratio. Moreover, the method is a self-adaptive model, online learning is supported, and the more training data, the higher the positioning accuracy. Compared with the prior method based on the neural network, the method uses multi-task learning and is more interpretable when modeling is carried out from the perspective of signal processing. Here, the neural network is not a black box, and the neural network model is used to recover the time difference and amplitude difference of the received microphone signals.
Although specific embodiments of the invention have been disclosed for illustrative purposes and the accompanying drawings, which are included to provide a further understanding of the invention and are incorporated by reference, those skilled in the art will appreciate that: various substitutions, changes and modifications are possible without departing from the spirit and scope of the present invention and the appended claims. Therefore, the present invention should not be limited to the disclosure of the preferred embodiments and the accompanying drawings.
Claims (8)
1. An end-to-end sound source positioning method based on multitask learning comprises the following steps:
1) for each sound source position to be scanned, calculating the time delay of sound signals transmitted from the sound source position to each microphone position in the microphone array;
2) performing corresponding delay compensation on the multi-channel frame-level time domain signals collected by each microphone during each scanning of the microphone array according to the delay;
3) inputting each time domain signal after time delay compensation into a corresponding CNN model for feature extraction and inputting the extracted features into a deep neural network; the deep neural network includes a plurality of DNN models, CNNmInputting extracted features into DNNm,n(ii) a Wherein, CNNmCNN model, DNN, for mth microphonem,nA DNN model representing a transmission path corresponding to the nth microphone at the nth scanning position; m1., M, N1., N, M denotes the number of microphones, N denotes the number of scanning positions;
4) the deep neural network estimates a multi-channel sound source signal of each scanning position according to the characteristics extracted by each CNN model;
5) for each scanning position, calculating the cross-correlation coefficient sum of the multi-channel sound source signals corresponding to the scanning position, and selecting the position with the maximum correlation coefficient sum as the sound source position;
the method for obtaining the DNN model by training comprises the following steps: 31) for each set sound source, collecting a multi-channel time domain signal of the sound source by using a microphone array, and acquiring the position of the sound source; then, time-domain signals of the sound source are subjected to delay compensation, and a DNN model of a transmission path corresponding to the position of the sound source is trained based on the delay compensation signals and an MSE loss function until a convergence condition is reached; respectively training models of different transmission paths based on an MSE loss function until a network converges; the single model MSE equation is Estimated sound source signal of i frame for m microphone at n scanning position, sm,n,i(t) isCorresponding to the source signal of a frame, T is the number of sampling points of a frame signal, and I is the total frame number of samples in a mini-batch; 32) on the basis of the convergence, a cross entropy function is introducedWherein P isn,iIs the correlation coefficient of the nth scan position of the ith frame and SCorr (n) probability normalized by the softmax function, Yn,iSupervision signals corresponding to the frames and the scanning positions; the final loss function is formed by combining the MSE and CE losses and adding the MSE and CE losses in proportion: firstly, calculating the estimated mean square error of a transmission path of each set sound source position, and adding the estimated mean square errors to be used as an MSE loss term in a loss function; the CE loss item is a CE loss item which is obtained by adding a consistency detection module between calculation channels, using a softmax function to normalize the correlation coefficient sum into probabilities at different positions to obtain a probability vector, and then solving the probability vector and a supervision signal, namely a one-hot vector according to a calculation formula of the CE, wherein the probability vector and the one-hot vector are calculated according to the calculation formula of the CEThe sum constitutes the loss function of the final DNN model, and α is a set scaling factor.
2. The method of claim 1, wherein the time-domain signal is compensated for delay by:wherein x ismTime domain signal representing frame level collected by mth microphone,Representing the delay x with the nth scan position and the mth microphonemTime-domain signal, tau, after delay compensationm,nIndicating the delay between the nth scan position and the mth microphone.
3. The method of claim 1, wherein the CNN model performs feature extraction on the input time-domain signal using one-dimensional convolutional layers in the time domain.
5. An end-to-end sound source positioning system based on multitask learning is characterized by comprising a delay calculation module, a delay compensation module, a CNN (CNN) model, a deep neural network and a target sound source position estimation module; wherein
The delay calculation module is used for calculating the delay of sound signals transmitted from the sound source position to each microphone position in the microphone array for each sound source position to be scanned;
the delay compensation module is used for carrying out corresponding delay compensation on the multi-channel frame-level time domain signals collected by each microphone during each scanning of the microphone array according to the delay;
the CNN model is used for extracting the characteristics of the input time domain signal after the time delay compensation and inputting the extracted characteristics into a deep neural network; the method for obtaining the DNN model by training comprises the following steps: a) for each set sound source, a microphone array is used to acquire multiple passes of the sound sourceA time domain signal of a channel, and acquiring the position of the sound source; then, time-domain signals of the sound source are subjected to delay compensation, and a DNN model of a transmission path corresponding to the position of the sound source is trained based on the delay compensation signals and an MSE loss function until a convergence condition is reached; respectively training models of different transmission paths based on an MSE loss function until a network converges; the single model MSE equation is Estimated sound source signal of i frame for m microphone at n scanning position, sm,n,i(t) isCorresponding to the source signal of a frame, T is the number of sampling points of a frame signal, and I is the total frame number of samples in a mini-batch; b) on the basis of the convergence, a cross entropy function is introducedWherein P isn,iIs the correlation coefficient of the nth scan position of the ith frame and SCorr (n) probability normalized by the softmax function, Yn,iSupervision signals corresponding to the frames and the scanning positions; the final loss function is formed by combining the MSE and CE losses and adding the MSE and CE losses in proportion: firstly, calculating the estimated mean square error of a transmission path of each set sound source position, and adding the estimated mean square errors to be used as an MSE loss term in a loss function; the CE loss item is a CE loss item which is obtained by adding a consistency detection module between calculation channels, using a softmax function to normalize the correlation coefficient sum into probabilities at different positions to obtain a probability vector, and then solving the probability vector and a supervision signal, namely a one-hot vector according to a calculation formula of the CE, wherein the probability vector and the one-hot vector are calculated according to the calculation formula of the CEAdding to form a loss function of the final DNN model, wherein alpha is a set scale factor;
the deep neural network is used for estimating a multichannel sound source signal of each scanning position according to the extracted features; wherein the deep neural network comprises a plurality of DNN models, CNNmInputting extracted features into DNNm,n;CNNmCNN model, DNN, for mth microphonem,nA DNN model representing a transmission path corresponding to the nth microphone at the nth scanning position; m1., M, N1., N, M denotes the number of microphones, N denotes the number of scanning positions;
and the target sound source position estimation module is used for calculating the cross-correlation coefficient sum of the multi-channel sound source signals corresponding to each scanning position, and selecting the position with the maximum correlation coefficient sum as the sound source position.
6. The system of claim 5, wherein the delay calculation module is based onPerforming delay compensation on the time domain signal; x is the number ofmA time domain signal representing the frame level acquired by the m-th microphone,representing the delay x with the nth scan position and the mth microphonemTime-domain signal, tau, after delay compensationm,nIndicating the delay between the nth scan position and the mth microphone.
8. The system of claim 5, wherein the CNN model performs feature extraction on an input time-domain signal using one-dimensional convolutional layers in the time domain.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910043338.8A CN109782231B (en) | 2019-01-17 | 2019-01-17 | End-to-end sound source positioning method and system based on multi-task learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910043338.8A CN109782231B (en) | 2019-01-17 | 2019-01-17 | End-to-end sound source positioning method and system based on multi-task learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109782231A CN109782231A (en) | 2019-05-21 |
CN109782231B true CN109782231B (en) | 2020-11-20 |
Family
ID=66500851
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910043338.8A Active CN109782231B (en) | 2019-01-17 | 2019-01-17 | End-to-end sound source positioning method and system based on multi-task learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109782231B (en) |
Families Citing this family (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111161757B (en) * | 2019-12-27 | 2021-09-03 | 镁佳(北京)科技有限公司 | Sound source positioning method and device, readable storage medium and electronic equipment |
CN111859241B (en) * | 2020-06-01 | 2022-05-03 | 北京大学 | Unsupervised sound source orientation method based on sound transfer function learning |
CN111694433B (en) * | 2020-06-11 | 2023-06-20 | 阿波罗智联(北京)科技有限公司 | Voice interaction method and device, electronic equipment and storage medium |
CN112731086A (en) * | 2021-01-19 | 2021-04-30 | 国网上海能源互联网研究院有限公司 | Method and system for comprehensively inspecting electric power equipment |
CN113138363A (en) * | 2021-04-22 | 2021-07-20 | 苏州臻迪智能科技有限公司 | Sound source positioning method and device, storage medium and electronic equipment |
CN113835065B (en) * | 2021-09-01 | 2024-05-17 | 深圳壹秘科技有限公司 | Sound source direction determining method, device, equipment and medium based on deep learning |
CN114489321B (en) * | 2021-12-13 | 2024-04-09 | 广州大鱼创福科技有限公司 | Steady-state visual evoked potential target recognition method based on multi-task deep learning |
CN117368847B (en) * | 2023-12-07 | 2024-03-15 | 深圳市好兄弟电子有限公司 | Positioning method and system based on microphone radio frequency communication network |
CN118112501B (en) * | 2024-04-28 | 2024-07-26 | 杭州爱华智能科技有限公司 | Sound source positioning method and device suitable for periodic signals and sound source measuring device |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108197697A (en) * | 2017-12-29 | 2018-06-22 | 汕头大学 | A kind of dynamic method for resampling of trained deep neural network |
CN108924836A (en) * | 2018-07-04 | 2018-11-30 | 南方电网科学研究院有限责任公司 | Edge side physical layer channel authentication method based on deep neural network |
Family Cites Families (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102103200B (en) * | 2010-11-29 | 2012-12-05 | 清华大学 | Acoustic source spatial positioning method for distributed asynchronous acoustic sensor |
US9753119B1 (en) * | 2014-01-29 | 2017-09-05 | Amazon Technologies, Inc. | Audio and depth based sound source localization |
CN107144818A (en) * | 2017-03-21 | 2017-09-08 | 北京大学深圳研究生院 | Binaural sound sources localization method based on two-way ears matched filter Weighted Fusion |
CN108305641B (en) * | 2017-06-30 | 2020-04-07 | 腾讯科技(深圳)有限公司 | Method and device for determining emotion information |
CN107703486B (en) * | 2017-08-23 | 2021-03-23 | 南京邮电大学 | Sound source positioning method based on convolutional neural network CNN |
CN108120436A (en) * | 2017-12-18 | 2018-06-05 | 北京工业大学 | Real scene navigation method in a kind of iBeacon auxiliary earth magnetism room |
CN108318862B (en) * | 2017-12-26 | 2021-08-20 | 北京大学 | Sound source positioning method based on neural network |
CN108375763B (en) * | 2018-01-03 | 2021-08-20 | 北京大学 | Frequency division positioning method applied to multi-sound-source environment |
CN108417224B (en) * | 2018-01-19 | 2020-09-01 | 苏州思必驰信息科技有限公司 | Training and recognition method and system of bidirectional neural network model |
CN108734733B (en) * | 2018-05-17 | 2022-04-26 | 东南大学 | Microphone array and binocular camera-based speaker positioning and identifying method |
CN109031200A (en) * | 2018-05-24 | 2018-12-18 | 华南理工大学 | A kind of sound source dimensional orientation detection method based on deep learning |
CN109164415B (en) * | 2018-09-07 | 2022-09-16 | 东南大学 | Binaural sound source positioning method based on convolutional neural network |
-
2019
- 2019-01-17 CN CN201910043338.8A patent/CN109782231B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108197697A (en) * | 2017-12-29 | 2018-06-22 | 汕头大学 | A kind of dynamic method for resampling of trained deep neural network |
CN108924836A (en) * | 2018-07-04 | 2018-11-30 | 南方电网科学研究院有限责任公司 | Edge side physical layer channel authentication method based on deep neural network |
Non-Patent Citations (1)
Title |
---|
Robust Sound Source Localization Using a Microphone Array on a Mobile Robot;Jean-Marc Valin;《Proceedings of the 2003 IEEE/RSJ》;20031031;1228-1233 * |
Also Published As
Publication number | Publication date |
---|---|
CN109782231A (en) | 2019-05-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109782231B (en) | End-to-end sound source positioning method and system based on multi-task learning | |
CN108318862B (en) | Sound source positioning method based on neural network | |
CN109993280B (en) | Underwater sound source positioning method based on deep learning | |
Xiao et al. | A learning-based approach to direction of arrival estimation in noisy and reverberant environments | |
CN110531313B (en) | Near-field signal source positioning method based on deep neural network regression model | |
EP1600791B1 (en) | Sound source localization based on binaural signals | |
CN109490822B (en) | Voice DOA estimation method based on ResNet | |
CN111401565A (en) | DOA estimation method based on machine learning algorithm XGboost | |
CN112904279A (en) | Sound source positioning method based on convolutional neural network and sub-band SRP-PHAT space spectrum | |
CN112180318B (en) | Sound source direction of arrival estimation model training and sound source direction of arrival estimation method | |
Huang et al. | A time-domain unsupervised learning based sound source localization method | |
Ramezanpour et al. | Two-stage beamforming for rejecting interferences using deep neural networks | |
CN111123202B (en) | Indoor early reflected sound positioning method and system | |
CN112014791A (en) | Near-field source positioning method of array PCA-BP algorithm with array errors | |
CN111352075B (en) | Underwater multi-sound-source positioning method and system based on deep learning | |
Nie et al. | Adaptive direction-of-arrival estimation using deep neural network in marine acoustic environment | |
Houégnigan et al. | Machine and deep learning approaches to localization and range estimation of underwater acoustic sources | |
CN111859241B (en) | Unsupervised sound source orientation method based on sound transfer function learning | |
CN117451055A (en) | Underwater sensor positioning method and system based on basis tracking noise reduction | |
Brendel et al. | Distance estimation of acoustic sources using the coherent-to-diffuse power ratio based on distributed training | |
Hu et al. | Evaluation and comparison of three source direction-of-arrival estimators using relative harmonic coefficients | |
Laufer-Goldshtein et al. | Multi-view source localization based on power ratios | |
Chetupalli et al. | Robust offline trained neural network for TDOA based sound source localization | |
CN113030849A (en) | Near-field source positioning method based on self-encoder and parallel network | |
Huang et al. | A time-domain end-to-end method for sound source localization using multi-task learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |