CN111239687A - Sound source positioning method and system based on deep neural network - Google Patents
Sound source positioning method and system based on deep neural network Download PDFInfo
- Publication number
- CN111239687A CN111239687A CN202010050760.9A CN202010050760A CN111239687A CN 111239687 A CN111239687 A CN 111239687A CN 202010050760 A CN202010050760 A CN 202010050760A CN 111239687 A CN111239687 A CN 111239687A
- Authority
- CN
- China
- Prior art keywords
- neural network
- deep neural
- sound source
- microphone
- vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 113
- 238000000034 method Methods 0.000 title claims abstract description 39
- 239000013598 vector Substances 0.000 claims abstract description 95
- 238000012549 training Methods 0.000 claims abstract description 40
- 238000012360 testing method Methods 0.000 claims abstract description 32
- 238000007781 pre-processing Methods 0.000 claims abstract description 31
- 238000005314 correlation function Methods 0.000 claims abstract description 22
- 230000004807 localization Effects 0.000 claims description 15
- 238000002790 cross-validation Methods 0.000 claims description 12
- 238000003062 neural network model Methods 0.000 claims description 8
- 238000010606 normalization Methods 0.000 claims description 7
- 238000004364 calculation method Methods 0.000 claims description 5
- 238000004140 cleaning Methods 0.000 claims description 3
- 238000009432 framing Methods 0.000 claims description 3
- 238000010200 validation analysis Methods 0.000 claims description 3
- 238000003491 array Methods 0.000 claims description 2
- 238000004422 calculation algorithm Methods 0.000 description 42
- 230000006870 function Effects 0.000 description 14
- 238000001228 spectrum Methods 0.000 description 6
- 238000012544 monitoring process Methods 0.000 description 5
- 210000002569 neuron Anatomy 0.000 description 5
- 230000004044 response Effects 0.000 description 5
- 230000007547 defect Effects 0.000 description 4
- 230000004927 fusion Effects 0.000 description 4
- 238000011160 research Methods 0.000 description 4
- 238000009825 accumulation Methods 0.000 description 3
- 230000004913 activation Effects 0.000 description 3
- 230000003993 interaction Effects 0.000 description 3
- 238000004088 simulation Methods 0.000 description 3
- 238000006243 chemical reaction Methods 0.000 description 2
- 230000007613 environmental effect Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 238000007476 Maximum Likelihood Methods 0.000 description 1
- 238000000342 Monte Carlo simulation Methods 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 238000007792 addition Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000005286 illumination Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
- 230000003313 weakening effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01S—RADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
- G01S5/00—Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations
- G01S5/18—Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations using ultrasonic, sonic, or infrasonic waves
- G01S5/22—Position of source determined by co-ordinating a plurality of position lines defined by path-difference measurements
Landscapes
- Physics & Mathematics (AREA)
- Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Radar, Positioning & Navigation (AREA)
- Remote Sensing (AREA)
- Measurement Of Velocity Or Position Using Acoustic Or Ultrasonic Waves (AREA)
Abstract
The invention discloses a positioning method, which comprises the following steps: s1, acquiring a voice signal received by a microphone and generating a voice data set; s2, preprocessing a voice signal in the voice data set; s3, calculating a phase weighted generalized cross-correlation function of a sound source signal corresponding to the voice signal; s4, acquiring time delay information corresponding to a peak of the phase-weighted generalized cross-correlation function, and taking the time delay information as a TDOA observed value when a sound source signal reaches a microphone; obtaining an amplitude value corresponding to the time delay information; s5, combining the TDOA observed value and the amplitude value to serve as an input vector, using the three-dimensional space position coordinate corresponding to the sound source signal as an output vector, and combining the input vector and the output vector to generate a feature vector; s6, preprocessing the characteristic vector; s7, setting parameters related to the deep neural network, and training the deep neural network by using the feature vectors of the training set to obtain the trained deep neural network; and S8, transmitting the input vector of the test set into the trained deep neural network for prediction to obtain the three-dimensional space coordinate of the sound source signal.
Description
Technical Field
The invention relates to the technical field of indoor sound source positioning, in particular to a sound source positioning method and system based on a deep neural network.
Background
In recent years, intelligent service type products (such as intelligent sound boxes, smart homes, and the like) are widely used in real life, and in order to obtain good user experience, the human-computer interaction capability of the products is paid more and more attention to by many people. In human-computer interaction, voice communication is an indispensable part, people can directly issue voice passwords to command a machine to provide corresponding services, and the machine identifies the voice passwords and provides the corresponding services without manual operation. At present, in a near-field speech recognition application scenario (such as a mobile phone end), the quality of speech signals received by a microphone is high, and the speech recognition rate meets the actual requirements. However, in far-field speech recognition application scenarios such as smart home, the quality of speech signals captured by the microphone is poor, the speech recognition rate is low, and actual requirements cannot be met. Therefore, solving the problem of far-field speech recognition application has become a research hotspot of domestic and foreign institutions in recent years. At present, the method of estimating the position of a sound source by utilizing a sound source positioning algorithm at the front end of voice recognition, enhancing the sound source signal in the direction of the position, and weakening interference signals in other directions can improve the quality of the voice signal and the voice recognition rate, and the method can effectively solve the problem of landing of far-field voice recognition application. Among them, effective sound source localization prior to speech recognition is of great significance in practice.
The classical positioning algorithm is mainly a two-dimensional sound source positioning algorithm, and the algorithms are mainly divided into three categories: one is an algorithm based on Time Difference of Arrival (TDOA). The time delay estimation algorithm, also called arrival time difference algorithm, determines the position of a sound source according to the time difference of two microphones at different positions receiving the same sound source signal. The time delay corresponding to the maximum peak of a Generalized Cross Correlation (GCC) function of sound signals received by two microphones is used as time delay estimation, and then the geometric constraint of a microphone array is utilized to obtain sound source position estimation. The method is easily affected by environmental noise and indoor reverberation, when the noise is large or the reverberation is serious, a plurality of false Peaks (spidious Peaks) appear in the GCC function, and a wrong TDOA value is easily estimated, so that a wrong sound source position estimation is caused. The second is an algorithm based on spatial spectrum estimation. The basic idea of an algorithm based on spatial spectrum estimation is to determine the direction angle and the position of the sound source from the spatial spectrum. Because the estimation of the spatial signal is similar to the frequency estimation of the time domain signal, the estimation method of the spatial spectrum can be popularized by the time domain nonlinear spectrum, but the algorithm has the premise that signal sources are continuously distributed and the space is stable, so the application of the algorithm is greatly limited. One of the typical algorithms of the spatial spectrum algorithm is a feature subspace class algorithm, which can be divided into a subspace decomposition class algorithm and a subspace fitting class algorithm, the former algorithm is mainly a multi Signal Classification (MUSIC) algorithm and a rotation invariant subspace algorithm (ESPRIT), and the latter algorithm is mainly a Maximum Likelihood algorithm (ML) and a Weighted subspace fitting algorithm (WSF). And thirdly, an algorithm based on controllable beam response. The algorithm based on the steerable beam response is to search globally in the microphone array for the location with the largest energy, i.e. the sound source location. Generally, the speech signals collected by the microphones are filtered and weighted and summed to form a beam, and then the point where the output power of the beam is maximum, i.e., the sound source position, is found. Algorithms based on the controllable beam response can be specifically divided into a delay accumulation beam algorithm and an adaptive beam algorithm. Although the delay accumulation algorithm has small signal distortion and small calculation amount, the anti-interference capability is weak, and the delay accumulation algorithm is easily influenced by noise. The self-adaptive algorithm has large calculation amount, and the signal has certain distortion but strong anti-interference capability.
Multi-Modal Fusion algorithms are currently used as sound source localization algorithms in three-dimensional space, and a representative algorithm is Audio-Visual Fusion (Audio-Visual Fusion) algorithm. The sound source position is usually estimated jointly with the face position information collected by the camera and the delay estimation (DOA) collected by the microphone, and the algorithm avoids the defects that the traditional image tracking is limited by the number of cameras and the illumination intensity, avoids the defects that the traditional sound source tracking is limited by background noise and indoor reverberation, and greatly reduces the influence of environmental factors. However, in the multi-mode fusion algorithm, a lot of parameters are still needed to be set, and when the environment changes, the robustness of the algorithm is reduced.
In recent years, sound source localization using neural networks is a popular research direction, especially after the development of deep learning. The study of sound source localization algorithm by using neural network usually extracts feature vectors from speech signals, and then transmits the feature vectors into the neural network for training. The common speech feature vector is composed of TDOAs of multiple microphone pairs, and does not utilize amplitude information corresponding to the TDOAs, and the amplitude corresponding to the TDOAs reflects the reliability of the TDOAs to some extent.
In general, a sound source positioning method based on a deep neural network is a research hotspot of an indoor sound source positioning problem, and the research is of great significance for solving the technical grounds of many current audio applications, such as intelligent voice interaction. However, the current sound source positioning method based on the deep neural network is not researched well, and the existing results are more or less insufficient.
Disclosure of Invention
The invention aims to provide a sound source positioning method and system based on a deep neural network aiming at the defects of the prior art, and the method and system can estimate time delayAnd the corresponding amplitudeThe three-dimensional space coordinate is used as an input vector of the deep neural network, and the three-dimensional space coordinate is used as an output vector of the deep neural network, so that the method is suitable for indoor sound source positioning, and has good expandability and algorithm robustness.
In order to achieve the purpose, the invention adopts the following technical scheme:
a sound source positioning method based on a deep neural network comprises a training stage of the deep neural network and a testing stage of the deep neural network, and comprises the following steps:
s1, acquiring a voice signal received by a microphone, and generating a voice data set from the acquired voice signal; wherein the speech data set comprises a training data set and a testing data set;
s2, performing first preprocessing on the voice signals in the generated voice data set;
s3, calculating a phase weighted generalized cross-correlation function of a sound source signal corresponding to the preprocessed voice signal;
s4, acquiring time delay information corresponding to the peak of the phase weighted generalized cross-correlation function, and taking the acquired time delay information as a TDOA observed value of a sound source signal reaching a microphone; obtaining the amplitude corresponding to the time delay information;
s5, combining the TDOA observation value with the amplitude value to serve as an input vector of a deep neural network, taking a three-dimensional space position coordinate corresponding to a sound source signal as an output vector of the neural network, and combining the input vector and the output vector to generate a feature vector;
s6, performing second preprocessing on the generated feature vectors;
s7, in the training stage of the deep neural network, setting parameters related to the deep neural network, and training the deep neural network by using the feature vectors of the training set to obtain the trained deep neural network;
and S8, in the testing stage of the deep neural network, transmitting the input feature vectors of the test set into the trained deep neural network for prediction to obtain the three-dimensional space position coordinates of the sound source signal, and evaluating the performance of the deep neural network model by adopting cross validation.
Further, in step S1, the set of microphone arrays is V ═ 1,2, …, M }; each microphone node m comprises two microphones, wherein m belongs to V; m denotes a total of M pairs of microphones.
Further, the step S2 is specifically to perform a first preprocessing on the speech signals received by the two microphones in the microphone node m, where the first preprocessing includes framing, windowing, and pre-emphasis.
Further, the step S3 is specifically to calculate a phase weighted generalized cross-correlation function R of the two microphone voice signals in the preprocessed microphone node mm(τ), expressed as:
wherein, m is equal to V,andrespectively represented as time domain microphone signals at node mAndthe corresponding frequency domain portion of (a); the symbol x denotes a complex conjugate operation.
Further, the step S4 obtains the phase weighted generalized cross-correlation function Rm(tau) time delay information corresponding to wave crestExpressed as:
Further, the step S5 is specifically:
calculating the time delay information of all nodesAnd the corresponding amplitudeCombining as input vector I of the deep neural network:
taking the three-dimensional space position coordinate Q corresponding to the sound source signal S as an output vector of the neural network:
combining the input vector I and the output vector Q to generate a feature vector G:
G=(I,Q)T。
further, the second preprocessing in step S6 includes data cleaning, data disordering, and data normalization.
Further, the cross-validation employed in step S8 includes leave-one-out validation.
Correspondingly, a sound source positioning system based on a deep neural network is also provided, which comprises:
the first acquisition module is used for acquiring the voice signal received by the microphone and generating a voice data set from the acquired voice signal; wherein the speech data set comprises a training data set and a testing data set;
a first preprocessing module for performing a first preprocessing on the speech signal within the generated speech data set;
the calculation module is used for calculating a phase weighted generalized cross-correlation function of a sound source signal corresponding to the preprocessed voice signal;
the second acquisition module is used for acquiring time delay information corresponding to the peak of the phase weighted generalized cross-correlation function and taking the acquired time delay information as a TDOA observation value of a sound source signal reaching a microphone; obtaining the amplitude corresponding to the time delay information;
the generating module is used for combining the TDOA observed value and the amplitude value to serve as an input vector of a deep neural network, using a three-dimensional space position coordinate corresponding to a sound source signal as an output vector of the neural network, and combining the input vector and the output vector to generate a feature vector;
the second preprocessing module is used for carrying out second preprocessing on the generated feature vectors;
a training module for setting the related parameters of the deep neural network and training the deep neural network by using the feature vectors of the training set to obtain the trained deep neural network
Further, the method also comprises the following steps:
and the test module is used for transmitting the input vectors of the test set into the trained deep neural network for prediction to obtain the three-dimensional spatial position coordinates of the sound source signal and evaluating the performance of the deep neural network model by adopting cross validation.
Compared with the prior art, the method estimates the time delayAnd the corresponding amplitudeThe three-dimensional space coordinate is used as an input vector of the deep neural network, and the three-dimensional space coordinate is used as an output vector of the deep neural network, so that the method is suitable for indoor sound source positioning, and has good expandability and algorithm robustness.
Drawings
FIG. 1 is a flowchart of a sound source localization method based on a deep neural network according to an embodiment;
FIG. 2 is a schematic top view of a simulation environment provided by an embodiment, wherein a circle represents a position of a microphone;
FIG. 3 is a flow chart of a training phase of the deep neural network provided in one embodiment;
FIG. 4 is a flowchart illustrating a testing phase of the deep neural network according to an embodiment.
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.
The invention aims to provide a sound source positioning method and system based on a deep neural network, aiming at the defects of the prior art.
Example one
The embodiment provides a sound source localization method based on a deep neural network, which includes a training phase of the deep neural network and a testing phase of the deep neural network, as shown in fig. 1-2, and includes the steps of:
s11, acquiring a voice signal received by a microphone, and generating a voice data set from the acquired voice signal; wherein the speech data set comprises a training data set and a testing data set;
s12, performing first preprocessing on the voice signals in the generated voice data set;
s13, calculating a phase weighted generalized cross-correlation function of a sound source signal corresponding to the preprocessed voice signal;
s14, acquiring time delay information corresponding to the peak of the phase weighted generalized cross-correlation function, and taking the acquired time delay information as a TDOA observed value of a sound source signal reaching a microphone; obtaining the amplitude corresponding to the time delay information;
s15, combining the TDOA observation value with the amplitude value to serve as an input vector of a deep neural network, taking a three-dimensional space position coordinate corresponding to a sound source signal as an output vector of the neural network, and combining the input vector and the output vector to generate a feature vector;
s16, performing second preprocessing on the generated feature vectors;
s17, in the training stage of the deep neural network, setting parameters related to the deep neural network, and training the deep neural network by using the feature vectors of the training set to obtain the trained deep neural network;
and S18, in the testing stage of the deep neural network, transmitting the input vectors of the test set into the trained deep neural network for prediction to obtain the three-dimensional space position coordinates of the sound source signal, and evaluating the performance of the deep neural network model by adopting cross validation.
In the present embodiment, a distributed microphone array is specifically described:
the specific simulation settings are as follows: the simulated environment is a typical conference room of size 4.1m x 3.1m x 3m with a total of L-12 randomly distributed microphones. The distance between two microphones in each microphone node is Dm-0.6 m. For simplicity, the microphone is positioned in a plane having a height of 1.75 m. The sound propagation speed is c 343 m/s. In this embodiment, the original non-reverberant speech signal is a single-channel pure male english pronunciation with a sampling frequency of 16kHz, and the frame length of the speech signal is 120 ms. The room reverberation time T60 is 0.1s, the SNR is 20dB, and the number of monte carlo experiments is 50. The distributed microphone array has M microphone nodes in total, i.e. the set V of microphone nodes is {1,2, …, M }. Each microphone node m contains two microphones, where m ∈ V.
In step S11, acquiring a voice signal received by a microphone, and generating a voice data set from the acquired voice signal; wherein the speech data set comprises a training data set and a testing data set; .
In the present embodiment, the sound source position set is set in a plane with a height of 1.5m to 1.7m, and 24000 position sample sets are uniformly acquired as data sets of the neural network. In an MATLAB simulation environment, firstly, an Image model is used for simulating a room impulse response, then an original voice signal without reverberation is convoluted with the room impulse response and is added with Gaussian white noise, and finally a signal received by a microphone is simulated.
In step S12, a first pre-processing is performed on the speech signal within the generated speech data set.
Specifically, the method comprises the steps of performing first preprocessing on voice signals received by two microphones in a microphone node m, wherein the first preprocessing comprises framing, windowing and pre-emphasis.
Windowing is carried out on a voice signal by adopting a rectangular window, and the window function omega (n) of the rectangular window is as follows:
where N represents the length of the window function.
The formula for pre-emphasis is:
H(z)=1-αz-1
where α denotes the pre-emphasis coefficient, which is in the range of 0.9< α < 1.0. in the present embodiment, the length of the window function is the frame length, and the pre-emphasis coefficient α is 0.97.
In step S13, a phase weighted generalized cross-correlation function of the sound source signal corresponding to the preprocessed speech signal is calculated.
In particular to calculate the phase weighted generalized cross-correlation function R of two microphone voice signals in a preprocessed microphone node mm(τ), expressed as:
wherein, m is equal to V,andrespectively represented as time domain microphone signals at node mAndthe corresponding frequency domain portion of (a); the symbol x denotes a complex conjugate operation. In the present embodiment, M is 6.
In step S14, acquiring delay information corresponding to the peak of the phase-weighted generalized cross-correlation function, and taking the acquired delay information as a TDOA observation of the arrival of the sound source signal at the microphone; and obtaining the amplitude corresponding to the time delay information.
Obtaining a generalized cross-correlation function R weighted with said phasem(tau) time delay information corresponding to wave crestAnd will delay the informationAs a TDOA observation of the arrival of sound source signal S at microphone node m, is represented as:
wherein τ ∈ [ - τ)max,τmax],τmaxRepresenting the theoretical maximum Time Delay (TDOA) of the arrival of the sound source information S at the microphone node m, i.e.Andrepresents the distance of the microphone pair contained at the node m from the sound source information S, and c represents the sound propagation speed; | | · | represents the euclidean norm. Then obtaining the time delay information(i.e., TDOA observations) corresponding to the magnitude of the signal
TDOA location is a method of location using time differences. By measuring the time of arrival of the signal at the monitoring station, the distance of the signal source can be determined. The location of the signal can be determined by the distance from the signal source to each monitoring station (taking the monitoring station as the center and the distance as the radius to make a circle). However, the absolute time is generally difficult to measure, and by comparing the absolute time difference of the arrival of the signal at each monitoring station, a hyperbola with the monitoring station as the focus and the distance difference as the major axis can be formed, and the intersection point of the hyperbola is the position of the signal.
In step S15, the TDOA observations are combined with the amplitudes as input vectors for a deep neural network, the three-dimensional spatial location coordinates corresponding to the acoustic source signal are used as output vectors for the neural network, and the input vectors and the output vectors are combined to generate feature vectors.
The method specifically comprises the following steps: time delay information(i.e., TDOA observations) and their corresponding amplitudesCombining the input vector I as a deep neural network:
taking the three-dimensional space position coordinate Q corresponding to the sound source signal S as an output vector of the neural network:
combining the input vector I and the output vector Q to generate a feature vector G:
G=(I,Q)T。
in step S16, a second preprocessing is performed on the generated feature vector. And the second preprocessing comprises data cleaning, data disordering and data normalization.
The normalization adopts a min-max normalization method, and the conversion function is as follows:
wherein, gmin、gmaxRespectively representing the minimum value and the maximum value in the sample feature vector G;indicating sample data normalisedAnd (6) obtaining the result. After training of the neural network, a data value, which is a three-dimensional spatial position of the sound source point in this embodiment, should be obtained through inverse normalization.
Wherein, the conversion function of the reverse normalization is as follows:
wherein, gmin、gmaxRespectively representing the minimum and maximum values in the sample feature vector G,and g is the result after the sample data is normalized.
In step S17, in the training phase of the deep neural network, parameters related to the deep neural network are set, and the deep neural network is trained by using the feature vectors of the training set, so as to obtain a trained deep neural network.
In this embodiment, the input layer neuron number of the Deep Neural Network (DNN) is set to 12, and the output layer neuron number is set to 3. The hidden layer is set to be three layers, the neuron number of the first layer hidden layer is 12, the activation function is a tanh function, the neuron number of the second layer hidden layer is 15, the activation function is a tanh function, the neuron number of the third layer hidden layer is 3, and the activation function is a tanh function.
In this embodiment, the loss function of the neural network is set as the Mean Squared Error (MSE) between the true spatial position vector Q and the predicted estimate vector P of the neural network, expressed as:
wherein, U is the total number of the current neural network iteration data set.
In step S18, in the testing stage of the deep neural network, the input vectors of the test set are transmitted into the trained deep neural network for prediction, so as to obtain the three-dimensional spatial position coordinates of the sound source signal, and the performance of the deep neural network model is evaluated by using cross validation.
The input vector of the test set is transmitted into a trained deep neural network, and the three-dimensional spatial position coordinate P ═ px, py, pz of the sound source signal can be predicted]TAnd evaluating the performance of the deep neural network model by using cross validation.
In this embodiment, the total number of the data sets is 24000, the performance of the neural network is tested by using a cross validation method, the cross validation method adopts a leave-one-validation method, that is, 4000 sample points are left as a test set, 20000 samples are left as a training set, the tested data will be part of the training set in the next process, and the process is repeated until no new sample data needs to be predicted.
The sound source positioning method based on the deep neural network comprises a training stage of the deep neural network and a testing stage of the deep neural network.
As shown in FIG. 3, in the training phase of the deep neural network, steps S11-S17 are included.
As shown in FIG. 4, in the testing stage of the deep neural network, steps S11-S16, S18 are included.
It should be noted that, in the testing phase of this embodiment, a trained deep neural network is obtained based on the training phase, and then test positioning is performed.
Compared with the prior art, the embodiment estimates the time delayAnd Rm(τ) amplitude corresponding to maximum peakThe three-dimensional space coordinate is used as an input vector of the deep neural network, and the three-dimensional space coordinate is used as an output vector of the deep neural network, so that the method is suitable for indoor sound source positioning, and has good expandability and algorithm robustness.
Example two
The embodiment provides a sound source positioning system based on a deep neural network, which comprises:
the first acquisition module is used for acquiring the voice signal received by the microphone and generating a voice data set from the acquired voice signal; wherein the speech data set comprises a training data set and a testing data set;
a first preprocessing module for performing a first preprocessing on the speech signal within the generated speech data set;
the calculation module is used for calculating a phase weighted generalized cross-correlation function of a sound source signal corresponding to the preprocessed voice signal;
the second acquisition module is used for acquiring time delay information corresponding to the peak of the phase weighted generalized cross-correlation function and taking the acquired time delay information as a TDOA observation value of a sound source signal reaching a microphone; obtaining the amplitude corresponding to the time delay information;
the generating module is used for combining the TDOA observed value and the amplitude value to serve as an input vector of a deep neural network, using a three-dimensional space position coordinate corresponding to a sound source signal as an output vector of the neural network, and combining the input vector and the output vector to generate a feature vector;
the second preprocessing module is used for carrying out second preprocessing on the generated feature vectors;
a training module for setting the related parameters of the deep neural network and training the deep neural network by using the feature vectors of the training set to obtain the trained deep neural network
Further, the method also comprises the following steps:
and the test module is used for transmitting the input vectors of the test set into the trained deep neural network for prediction to obtain the three-dimensional spatial position coordinates of the sound source signal and evaluating the performance of the deep neural network model by adopting cross validation.
It should be noted that, a sound source localization system based on a deep neural network in this embodiment is similar to the embodiment, and will not be described herein again.
Compared with the prior art, the embodiment estimates the time delayAnd Rm(τ) amplitude corresponding to maximum peakThe three-dimensional space coordinate is used as an input vector of the deep neural network, and the three-dimensional space coordinate is used as an output vector of the deep neural network, so that the method is suitable for indoor sound source positioning, and has good expandability and algorithm robustness.
The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications or additions may be made to the described embodiments or alternatives may be employed by those skilled in the art without departing from the spirit or ambit of the invention as defined in the appended claims.
Claims (10)
1. A sound source positioning method based on a deep neural network is characterized by comprising a training stage of the deep neural network and a testing stage of the deep neural network, and comprises the following steps:
s1, acquiring a voice signal received by a microphone, and generating a voice data set from the acquired voice signal; wherein the speech data set comprises a training data set and a testing data set;
s2, performing first preprocessing on the voice signals in the generated voice data set;
s3, calculating a phase weighted generalized cross-correlation function of a sound source signal corresponding to the preprocessed voice signal;
s4, acquiring time delay information corresponding to the peak of the phase weighted generalized cross-correlation function, and taking the acquired time delay information as a TDOA observed value of a sound source signal reaching a microphone; obtaining the amplitude corresponding to the time delay information;
s5, combining the TDOA observation value with the amplitude value to serve as an input vector of a deep neural network, taking a three-dimensional space position coordinate corresponding to a sound source signal as an output vector of the neural network, and combining the input vector and the output vector to generate a feature vector;
s6, performing second preprocessing on the generated feature vectors;
s7, in the training stage of the deep neural network, setting parameters related to the deep neural network, and training the deep neural network by using the feature vectors of the training set to obtain the trained deep neural network;
and S8, in the testing stage of the deep neural network, transmitting the input vectors of the test set into the trained deep neural network for prediction to obtain the three-dimensional space position coordinates of the sound source signal, and evaluating the performance of the deep neural network model by adopting cross validation.
2. The method as claimed in claim 1, wherein the set of microphone arrays in step S1 is V ═ 1,2, …, M }; each microphone node m comprises two microphones, wherein m belongs to V; m denotes a total of M microphone nodes.
3. The method for sound source localization based on deep neural network as claimed in claim 2, wherein the step S2 is specifically to perform a first pre-processing on the speech signals received by two microphones in the microphone node m, and the first pre-processing includes framing, windowing and pre-emphasis.
4. The method for sound source localization according to claim 2, wherein the step S3 is specifically to calculate a phase weighted generalized cross-correlation function R of two microphone voice signals within the preprocessed microphone node mm(τ), expressed as:
5. The sound source localization method based on the deep neural network as claimed in claim 4, wherein the step S4 is implemented by obtaining the phase weighted generalized cross-correlation function Rm(tau) time delay information corresponding to wave crestExpressed as:
6. The sound source localization method based on the deep neural network as claimed in claim 5, wherein the step S5 specifically comprises:
time delay informationAnd the corresponding amplitudeCombined as depth spiritInput vector I through the network:
taking the three-dimensional space position coordinate Q corresponding to the sound source signal S as an output vector of the neural network:
combining the input vector I and the output vector Q to generate a feature vector G:
G=(I,Q)T。
7. the method for sound source localization based on deep neural network of claim 6, wherein the second preprocessing in step S6 includes data cleaning, data de-ordering, and data normalization.
8. The method for sound source localization based on deep neural network of claim 7, wherein the cross-validation employed in step S8 comprises leave-one-out validation.
9. A sound source localization system based on a deep neural network, comprising:
the first acquisition module is used for acquiring the voice signal received by the microphone and generating a voice data set from the acquired voice signal; wherein the speech data set comprises a training data set and a testing data set;
a first preprocessing module for performing a first preprocessing on the speech signal within the generated speech data set;
the calculation module is used for calculating a phase weighted generalized cross-correlation function of a sound source signal corresponding to the preprocessed voice signal;
the second acquisition module is used for acquiring time delay information corresponding to the peak of the phase weighted generalized cross-correlation function and taking the acquired time delay information as a TDOA observation value of a sound source signal reaching a microphone; obtaining the amplitude corresponding to the time delay information;
the generating module is used for combining the TDOA observed value and the amplitude value to serve as an input vector of a deep neural network, using a three-dimensional space position coordinate corresponding to a sound source signal as an output vector of the neural network, and combining the input vector and the output vector to generate a feature vector;
the second preprocessing module is used for carrying out second preprocessing on the generated feature vectors;
and the training module is used for setting parameters related to the deep neural network and training the deep neural network by using the feature vectors of the training set to obtain the trained deep neural network.
10. The deep neural network-based sound source localization system according to claim 9, further comprising:
and the test module is used for transmitting the input vectors of the test set into the trained deep neural network for prediction to obtain the three-dimensional spatial position coordinates of the sound source signal and evaluating the performance of the deep neural network model by adopting cross validation.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010050760.9A CN111239687B (en) | 2020-01-17 | 2020-01-17 | Sound source positioning method and system based on deep neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010050760.9A CN111239687B (en) | 2020-01-17 | 2020-01-17 | Sound source positioning method and system based on deep neural network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111239687A true CN111239687A (en) | 2020-06-05 |
CN111239687B CN111239687B (en) | 2021-12-14 |
Family
ID=70872716
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010050760.9A Active CN111239687B (en) | 2020-01-17 | 2020-01-17 | Sound source positioning method and system based on deep neural network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111239687B (en) |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111949965A (en) * | 2020-08-12 | 2020-11-17 | 腾讯科技(深圳)有限公司 | Artificial intelligence-based identity verification method, device, medium and electronic equipment |
CN111965600A (en) * | 2020-08-14 | 2020-11-20 | 长安大学 | Indoor positioning method based on sound fingerprints in strong shielding environment |
CN111981644A (en) * | 2020-08-26 | 2020-11-24 | 北京声智科技有限公司 | Air conditioner control method and device and electronic equipment |
CN112180318A (en) * | 2020-09-28 | 2021-01-05 | 深圳大学 | Sound source direction-of-arrival estimation model training and sound source direction-of-arrival estimation method |
CN113111765A (en) * | 2021-04-08 | 2021-07-13 | 浙江大学 | Multi-voice source counting and positioning method based on deep learning |
CN113589230A (en) * | 2021-09-29 | 2021-11-02 | 广东省科学院智能制造研究所 | Target sound source positioning method and system based on joint optimization network |
CN114545332A (en) * | 2022-02-18 | 2022-05-27 | 桂林电子科技大学 | Arbitrary array sound source positioning method based on cross-correlation sequence and neural network |
CN115267671A (en) * | 2022-06-29 | 2022-11-01 | 金茂云科技服务(北京)有限公司 | Distributed voice interaction terminal equipment and sound source positioning method and device thereof |
WO2022263710A1 (en) * | 2021-06-17 | 2022-12-22 | Nokia Technologies Oy | Apparatus, methods and computer programs for obtaining spatial metadata |
CN115980668A (en) * | 2023-01-29 | 2023-04-18 | 桂林电子科技大学 | Sound source localization method based on generalized cross correlation of wide neural network |
CN116304639A (en) * | 2023-05-05 | 2023-06-23 | 上海玫克生储能科技有限公司 | Identification model generation method, identification system, identification device and identification medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103576126A (en) * | 2012-07-27 | 2014-02-12 | 姜楠 | Four-channel array sound source positioning system based on neural network |
US20160322055A1 (en) * | 2015-03-27 | 2016-11-03 | Google Inc. | Processing multi-channel audio waveforms |
CN108318862A (en) * | 2017-12-26 | 2018-07-24 | 北京大学 | A kind of sound localization method based on neural network |
CN109839612A (en) * | 2018-08-31 | 2019-06-04 | 大象声科(深圳)科技有限公司 | Sounnd source direction estimation method based on time-frequency masking and deep neural network |
-
2020
- 2020-01-17 CN CN202010050760.9A patent/CN111239687B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103576126A (en) * | 2012-07-27 | 2014-02-12 | 姜楠 | Four-channel array sound source positioning system based on neural network |
US20160322055A1 (en) * | 2015-03-27 | 2016-11-03 | Google Inc. | Processing multi-channel audio waveforms |
CN108318862A (en) * | 2017-12-26 | 2018-07-24 | 北京大学 | A kind of sound localization method based on neural network |
CN109839612A (en) * | 2018-08-31 | 2019-06-04 | 大象声科(深圳)科技有限公司 | Sounnd source direction estimation method based on time-frequency masking and deep neural network |
Non-Patent Citations (4)
Title |
---|
SHARATH ADAVANNE等: "Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks", 《IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING》 * |
王义圆: "基于麦克风阵列的目标探测与信号增强技术研究", 《中国优秀硕博学位论文全文数据库(硕士) 信息科技辑》 * |
祖丽楠等: "一种基于神经网络滤波的广义互相关时延估计方法的设计", 《化工自动化及仪表》 * |
黎长江等: "基于循环神经网络的音素识别研究", 《微电子学与计算机》 * |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111949965A (en) * | 2020-08-12 | 2020-11-17 | 腾讯科技(深圳)有限公司 | Artificial intelligence-based identity verification method, device, medium and electronic equipment |
CN111965600A (en) * | 2020-08-14 | 2020-11-20 | 长安大学 | Indoor positioning method based on sound fingerprints in strong shielding environment |
CN111981644A (en) * | 2020-08-26 | 2020-11-24 | 北京声智科技有限公司 | Air conditioner control method and device and electronic equipment |
CN111981644B (en) * | 2020-08-26 | 2021-09-24 | 北京声智科技有限公司 | Air conditioner control method and device and electronic equipment |
CN112180318A (en) * | 2020-09-28 | 2021-01-05 | 深圳大学 | Sound source direction-of-arrival estimation model training and sound source direction-of-arrival estimation method |
CN112180318B (en) * | 2020-09-28 | 2023-06-27 | 深圳大学 | Sound source direction of arrival estimation model training and sound source direction of arrival estimation method |
CN113111765A (en) * | 2021-04-08 | 2021-07-13 | 浙江大学 | Multi-voice source counting and positioning method based on deep learning |
WO2022263710A1 (en) * | 2021-06-17 | 2022-12-22 | Nokia Technologies Oy | Apparatus, methods and computer programs for obtaining spatial metadata |
CN113589230A (en) * | 2021-09-29 | 2021-11-02 | 广东省科学院智能制造研究所 | Target sound source positioning method and system based on joint optimization network |
CN114545332A (en) * | 2022-02-18 | 2022-05-27 | 桂林电子科技大学 | Arbitrary array sound source positioning method based on cross-correlation sequence and neural network |
CN114545332B (en) * | 2022-02-18 | 2024-05-03 | 桂林电子科技大学 | Random array sound source positioning method based on cross-correlation sequence and neural network |
CN115267671A (en) * | 2022-06-29 | 2022-11-01 | 金茂云科技服务(北京)有限公司 | Distributed voice interaction terminal equipment and sound source positioning method and device thereof |
CN115980668A (en) * | 2023-01-29 | 2023-04-18 | 桂林电子科技大学 | Sound source localization method based on generalized cross correlation of wide neural network |
CN116304639A (en) * | 2023-05-05 | 2023-06-23 | 上海玫克生储能科技有限公司 | Identification model generation method, identification system, identification device and identification medium |
Also Published As
Publication number | Publication date |
---|---|
CN111239687B (en) | 2021-12-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111239687B (en) | Sound source positioning method and system based on deep neural network | |
CN111025233B (en) | Sound source direction positioning method and device, voice equipment and system | |
Salvati et al. | Exploiting CNNs for improving acoustic source localization in noisy and reverberant conditions | |
Aarabi et al. | Robust sound localization using multi-source audiovisual information fusion | |
Nakadai et al. | Improvement of recognition of simultaneous speech signals using av integration and scattering theory for humanoid robots | |
Vesperini et al. | Localizing speakers in multiple rooms by using deep neural networks | |
Liu et al. | Continuous sound source localization based on microphone array for mobile robots | |
Hu et al. | Unsupervised multiple source localization using relative harmonic coefficients | |
Raykar et al. | Speaker localization using excitation source information in speech | |
WO2020024816A1 (en) | Audio signal processing method and apparatus, device, and storage medium | |
CN113870893B (en) | Multichannel double-speaker separation method and system | |
CN112363112A (en) | Sound source positioning method and device based on linear microphone array | |
CN114171041A (en) | Voice noise reduction method, device and equipment based on environment detection and storage medium | |
CN113514801A (en) | Microphone array sound source positioning method and sound source identification method based on deep learning | |
CN103901400B (en) | A kind of based on delay compensation and ears conforming binaural sound source of sound localization method | |
CN112712818A (en) | Voice enhancement method, device and equipment | |
Zhang et al. | AcousticFusion: Fusing sound source localization to visual SLAM in dynamic environments | |
Yang et al. | Srp-dnn: Learning direct-path phase difference for multiple moving sound source localization | |
Rascon et al. | Lightweight multi-DOA tracking of mobile speech sources | |
Huang et al. | A time-domain unsupervised learning based sound source localization method | |
Pertilä et al. | Time Difference of Arrival Estimation with Deep Learning–From Acoustic Simulations to Recorded Data | |
Liu et al. | Wavoice: An mmWave-Assisted Noise-Resistant Speech Recognition System | |
Zhao et al. | Accelerated steered response power method for sound source localization via clustering search | |
Dwivedi et al. | Long-term temporal audio source localization using sh-crnn | |
Liu et al. | Deep learning based two-dimensional speaker localization with large ad-hoc microphone arrays |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |