CN111239687A - Sound source positioning method and system based on deep neural network - Google Patents

Sound source positioning method and system based on deep neural network Download PDF

Info

Publication number
CN111239687A
CN111239687A CN202010050760.9A CN202010050760A CN111239687A CN 111239687 A CN111239687 A CN 111239687A CN 202010050760 A CN202010050760 A CN 202010050760A CN 111239687 A CN111239687 A CN 111239687A
Authority
CN
China
Prior art keywords
neural network
deep neural
sound source
microphone
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010050760.9A
Other languages
Chinese (zh)
Other versions
CN111239687B (en
Inventor
张巧灵
唐柔冰
马晗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Sci Tech University ZSTU
Original Assignee
Zhejiang Sci Tech University ZSTU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Sci Tech University ZSTU filed Critical Zhejiang Sci Tech University ZSTU
Priority to CN202010050760.9A priority Critical patent/CN111239687B/en
Publication of CN111239687A publication Critical patent/CN111239687A/en
Application granted granted Critical
Publication of CN111239687B publication Critical patent/CN111239687B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S5/00Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations
    • G01S5/18Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations using ultrasonic, sonic, or infrasonic waves
    • G01S5/22Position of source determined by co-ordinating a plurality of position lines defined by path-difference measurements

Landscapes

  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Measurement Of Velocity Or Position Using Acoustic Or Ultrasonic Waves (AREA)

Abstract

The invention discloses a positioning method, which comprises the following steps: s1, acquiring a voice signal received by a microphone and generating a voice data set; s2, preprocessing a voice signal in the voice data set; s3, calculating a phase weighted generalized cross-correlation function of a sound source signal corresponding to the voice signal; s4, acquiring time delay information corresponding to a peak of the phase-weighted generalized cross-correlation function, and taking the time delay information as a TDOA observed value when a sound source signal reaches a microphone; obtaining an amplitude value corresponding to the time delay information; s5, combining the TDOA observed value and the amplitude value to serve as an input vector, using the three-dimensional space position coordinate corresponding to the sound source signal as an output vector, and combining the input vector and the output vector to generate a feature vector; s6, preprocessing the characteristic vector; s7, setting parameters related to the deep neural network, and training the deep neural network by using the feature vectors of the training set to obtain the trained deep neural network; and S8, transmitting the input vector of the test set into the trained deep neural network for prediction to obtain the three-dimensional space coordinate of the sound source signal.

Description

Sound source positioning method and system based on deep neural network
Technical Field
The invention relates to the technical field of indoor sound source positioning, in particular to a sound source positioning method and system based on a deep neural network.
Background
In recent years, intelligent service type products (such as intelligent sound boxes, smart homes, and the like) are widely used in real life, and in order to obtain good user experience, the human-computer interaction capability of the products is paid more and more attention to by many people. In human-computer interaction, voice communication is an indispensable part, people can directly issue voice passwords to command a machine to provide corresponding services, and the machine identifies the voice passwords and provides the corresponding services without manual operation. At present, in a near-field speech recognition application scenario (such as a mobile phone end), the quality of speech signals received by a microphone is high, and the speech recognition rate meets the actual requirements. However, in far-field speech recognition application scenarios such as smart home, the quality of speech signals captured by the microphone is poor, the speech recognition rate is low, and actual requirements cannot be met. Therefore, solving the problem of far-field speech recognition application has become a research hotspot of domestic and foreign institutions in recent years. At present, the method of estimating the position of a sound source by utilizing a sound source positioning algorithm at the front end of voice recognition, enhancing the sound source signal in the direction of the position, and weakening interference signals in other directions can improve the quality of the voice signal and the voice recognition rate, and the method can effectively solve the problem of landing of far-field voice recognition application. Among them, effective sound source localization prior to speech recognition is of great significance in practice.
The classical positioning algorithm is mainly a two-dimensional sound source positioning algorithm, and the algorithms are mainly divided into three categories: one is an algorithm based on Time Difference of Arrival (TDOA). The time delay estimation algorithm, also called arrival time difference algorithm, determines the position of a sound source according to the time difference of two microphones at different positions receiving the same sound source signal. The time delay corresponding to the maximum peak of a Generalized Cross Correlation (GCC) function of sound signals received by two microphones is used as time delay estimation, and then the geometric constraint of a microphone array is utilized to obtain sound source position estimation. The method is easily affected by environmental noise and indoor reverberation, when the noise is large or the reverberation is serious, a plurality of false Peaks (spidious Peaks) appear in the GCC function, and a wrong TDOA value is easily estimated, so that a wrong sound source position estimation is caused. The second is an algorithm based on spatial spectrum estimation. The basic idea of an algorithm based on spatial spectrum estimation is to determine the direction angle and the position of the sound source from the spatial spectrum. Because the estimation of the spatial signal is similar to the frequency estimation of the time domain signal, the estimation method of the spatial spectrum can be popularized by the time domain nonlinear spectrum, but the algorithm has the premise that signal sources are continuously distributed and the space is stable, so the application of the algorithm is greatly limited. One of the typical algorithms of the spatial spectrum algorithm is a feature subspace class algorithm, which can be divided into a subspace decomposition class algorithm and a subspace fitting class algorithm, the former algorithm is mainly a multi Signal Classification (MUSIC) algorithm and a rotation invariant subspace algorithm (ESPRIT), and the latter algorithm is mainly a Maximum Likelihood algorithm (ML) and a Weighted subspace fitting algorithm (WSF). And thirdly, an algorithm based on controllable beam response. The algorithm based on the steerable beam response is to search globally in the microphone array for the location with the largest energy, i.e. the sound source location. Generally, the speech signals collected by the microphones are filtered and weighted and summed to form a beam, and then the point where the output power of the beam is maximum, i.e., the sound source position, is found. Algorithms based on the controllable beam response can be specifically divided into a delay accumulation beam algorithm and an adaptive beam algorithm. Although the delay accumulation algorithm has small signal distortion and small calculation amount, the anti-interference capability is weak, and the delay accumulation algorithm is easily influenced by noise. The self-adaptive algorithm has large calculation amount, and the signal has certain distortion but strong anti-interference capability.
Multi-Modal Fusion algorithms are currently used as sound source localization algorithms in three-dimensional space, and a representative algorithm is Audio-Visual Fusion (Audio-Visual Fusion) algorithm. The sound source position is usually estimated jointly with the face position information collected by the camera and the delay estimation (DOA) collected by the microphone, and the algorithm avoids the defects that the traditional image tracking is limited by the number of cameras and the illumination intensity, avoids the defects that the traditional sound source tracking is limited by background noise and indoor reverberation, and greatly reduces the influence of environmental factors. However, in the multi-mode fusion algorithm, a lot of parameters are still needed to be set, and when the environment changes, the robustness of the algorithm is reduced.
In recent years, sound source localization using neural networks is a popular research direction, especially after the development of deep learning. The study of sound source localization algorithm by using neural network usually extracts feature vectors from speech signals, and then transmits the feature vectors into the neural network for training. The common speech feature vector is composed of TDOAs of multiple microphone pairs, and does not utilize amplitude information corresponding to the TDOAs, and the amplitude corresponding to the TDOAs reflects the reliability of the TDOAs to some extent.
In general, a sound source positioning method based on a deep neural network is a research hotspot of an indoor sound source positioning problem, and the research is of great significance for solving the technical grounds of many current audio applications, such as intelligent voice interaction. However, the current sound source positioning method based on the deep neural network is not researched well, and the existing results are more or less insufficient.
Disclosure of Invention
The invention aims to provide a sound source positioning method and system based on a deep neural network aiming at the defects of the prior art, and the method and system can estimate time delay
Figure BDA0002371094510000031
And the corresponding amplitude
Figure BDA0002371094510000032
The three-dimensional space coordinate is used as an input vector of the deep neural network, and the three-dimensional space coordinate is used as an output vector of the deep neural network, so that the method is suitable for indoor sound source positioning, and has good expandability and algorithm robustness.
In order to achieve the purpose, the invention adopts the following technical scheme:
a sound source positioning method based on a deep neural network comprises a training stage of the deep neural network and a testing stage of the deep neural network, and comprises the following steps:
s1, acquiring a voice signal received by a microphone, and generating a voice data set from the acquired voice signal; wherein the speech data set comprises a training data set and a testing data set;
s2, performing first preprocessing on the voice signals in the generated voice data set;
s3, calculating a phase weighted generalized cross-correlation function of a sound source signal corresponding to the preprocessed voice signal;
s4, acquiring time delay information corresponding to the peak of the phase weighted generalized cross-correlation function, and taking the acquired time delay information as a TDOA observed value of a sound source signal reaching a microphone; obtaining the amplitude corresponding to the time delay information;
s5, combining the TDOA observation value with the amplitude value to serve as an input vector of a deep neural network, taking a three-dimensional space position coordinate corresponding to a sound source signal as an output vector of the neural network, and combining the input vector and the output vector to generate a feature vector;
s6, performing second preprocessing on the generated feature vectors;
s7, in the training stage of the deep neural network, setting parameters related to the deep neural network, and training the deep neural network by using the feature vectors of the training set to obtain the trained deep neural network;
and S8, in the testing stage of the deep neural network, transmitting the input feature vectors of the test set into the trained deep neural network for prediction to obtain the three-dimensional space position coordinates of the sound source signal, and evaluating the performance of the deep neural network model by adopting cross validation.
Further, in step S1, the set of microphone arrays is V ═ 1,2, …, M }; each microphone node m comprises two microphones, wherein m belongs to V; m denotes a total of M pairs of microphones.
Further, the step S2 is specifically to perform a first preprocessing on the speech signals received by the two microphones in the microphone node m, where the first preprocessing includes framing, windowing, and pre-emphasis.
Further, the step S3 is specifically to calculate a phase weighted generalized cross-correlation function R of the two microphone voice signals in the preprocessed microphone node mm(τ), expressed as:
Figure BDA0002371094510000041
wherein, m is equal to V,
Figure BDA0002371094510000042
and
Figure BDA0002371094510000043
respectively represented as time domain microphone signals at node m
Figure BDA0002371094510000044
And
Figure BDA0002371094510000045
the corresponding frequency domain portion of (a); the symbol x denotes a complex conjugate operation.
Further, the step S4 obtains the phase weighted generalized cross-correlation function Rm(tau) time delay information corresponding to wave crest
Figure BDA0002371094510000046
Expressed as:
Figure BDA0002371094510000047
and obtaining the time delay information
Figure BDA0002371094510000048
Corresponding amplitude value
Figure BDA0002371094510000049
Further, the step S5 is specifically:
calculating the time delay information of all nodes
Figure BDA00023710945100000410
And the corresponding amplitude
Figure BDA00023710945100000411
Combining as input vector I of the deep neural network:
Figure BDA00023710945100000412
taking the three-dimensional space position coordinate Q corresponding to the sound source signal S as an output vector of the neural network:
Figure BDA00023710945100000413
combining the input vector I and the output vector Q to generate a feature vector G:
G=(I,Q)T
further, the second preprocessing in step S6 includes data cleaning, data disordering, and data normalization.
Further, the cross-validation employed in step S8 includes leave-one-out validation.
Correspondingly, a sound source positioning system based on a deep neural network is also provided, which comprises:
the first acquisition module is used for acquiring the voice signal received by the microphone and generating a voice data set from the acquired voice signal; wherein the speech data set comprises a training data set and a testing data set;
a first preprocessing module for performing a first preprocessing on the speech signal within the generated speech data set;
the calculation module is used for calculating a phase weighted generalized cross-correlation function of a sound source signal corresponding to the preprocessed voice signal;
the second acquisition module is used for acquiring time delay information corresponding to the peak of the phase weighted generalized cross-correlation function and taking the acquired time delay information as a TDOA observation value of a sound source signal reaching a microphone; obtaining the amplitude corresponding to the time delay information;
the generating module is used for combining the TDOA observed value and the amplitude value to serve as an input vector of a deep neural network, using a three-dimensional space position coordinate corresponding to a sound source signal as an output vector of the neural network, and combining the input vector and the output vector to generate a feature vector;
the second preprocessing module is used for carrying out second preprocessing on the generated feature vectors;
a training module for setting the related parameters of the deep neural network and training the deep neural network by using the feature vectors of the training set to obtain the trained deep neural network
Further, the method also comprises the following steps:
and the test module is used for transmitting the input vectors of the test set into the trained deep neural network for prediction to obtain the three-dimensional spatial position coordinates of the sound source signal and evaluating the performance of the deep neural network model by adopting cross validation.
Compared with the prior art, the method estimates the time delay
Figure BDA0002371094510000051
And the corresponding amplitude
Figure BDA0002371094510000052
The three-dimensional space coordinate is used as an input vector of the deep neural network, and the three-dimensional space coordinate is used as an output vector of the deep neural network, so that the method is suitable for indoor sound source positioning, and has good expandability and algorithm robustness.
Drawings
FIG. 1 is a flowchart of a sound source localization method based on a deep neural network according to an embodiment;
FIG. 2 is a schematic top view of a simulation environment provided by an embodiment, wherein a circle represents a position of a microphone;
FIG. 3 is a flow chart of a training phase of the deep neural network provided in one embodiment;
FIG. 4 is a flowchart illustrating a testing phase of the deep neural network according to an embodiment.
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.
The invention aims to provide a sound source positioning method and system based on a deep neural network, aiming at the defects of the prior art.
Example one
The embodiment provides a sound source localization method based on a deep neural network, which includes a training phase of the deep neural network and a testing phase of the deep neural network, as shown in fig. 1-2, and includes the steps of:
s11, acquiring a voice signal received by a microphone, and generating a voice data set from the acquired voice signal; wherein the speech data set comprises a training data set and a testing data set;
s12, performing first preprocessing on the voice signals in the generated voice data set;
s13, calculating a phase weighted generalized cross-correlation function of a sound source signal corresponding to the preprocessed voice signal;
s14, acquiring time delay information corresponding to the peak of the phase weighted generalized cross-correlation function, and taking the acquired time delay information as a TDOA observed value of a sound source signal reaching a microphone; obtaining the amplitude corresponding to the time delay information;
s15, combining the TDOA observation value with the amplitude value to serve as an input vector of a deep neural network, taking a three-dimensional space position coordinate corresponding to a sound source signal as an output vector of the neural network, and combining the input vector and the output vector to generate a feature vector;
s16, performing second preprocessing on the generated feature vectors;
s17, in the training stage of the deep neural network, setting parameters related to the deep neural network, and training the deep neural network by using the feature vectors of the training set to obtain the trained deep neural network;
and S18, in the testing stage of the deep neural network, transmitting the input vectors of the test set into the trained deep neural network for prediction to obtain the three-dimensional space position coordinates of the sound source signal, and evaluating the performance of the deep neural network model by adopting cross validation.
In the present embodiment, a distributed microphone array is specifically described:
the specific simulation settings are as follows: the simulated environment is a typical conference room of size 4.1m x 3.1m x 3m with a total of L-12 randomly distributed microphones. The distance between two microphones in each microphone node is Dm-0.6 m. For simplicity, the microphone is positioned in a plane having a height of 1.75 m. The sound propagation speed is c 343 m/s. In this embodiment, the original non-reverberant speech signal is a single-channel pure male english pronunciation with a sampling frequency of 16kHz, and the frame length of the speech signal is 120 ms. The room reverberation time T60 is 0.1s, the SNR is 20dB, and the number of monte carlo experiments is 50. The distributed microphone array has M microphone nodes in total, i.e. the set V of microphone nodes is {1,2, …, M }. Each microphone node m contains two microphones, where m ∈ V.
In step S11, acquiring a voice signal received by a microphone, and generating a voice data set from the acquired voice signal; wherein the speech data set comprises a training data set and a testing data set; .
In the present embodiment, the sound source position set is set in a plane with a height of 1.5m to 1.7m, and 24000 position sample sets are uniformly acquired as data sets of the neural network. In an MATLAB simulation environment, firstly, an Image model is used for simulating a room impulse response, then an original voice signal without reverberation is convoluted with the room impulse response and is added with Gaussian white noise, and finally a signal received by a microphone is simulated.
In step S12, a first pre-processing is performed on the speech signal within the generated speech data set.
Specifically, the method comprises the steps of performing first preprocessing on voice signals received by two microphones in a microphone node m, wherein the first preprocessing comprises framing, windowing and pre-emphasis.
Windowing is carried out on a voice signal by adopting a rectangular window, and the window function omega (n) of the rectangular window is as follows:
Figure BDA0002371094510000071
where N represents the length of the window function.
The formula for pre-emphasis is:
H(z)=1-αz-1
where α denotes the pre-emphasis coefficient, which is in the range of 0.9< α < 1.0. in the present embodiment, the length of the window function is the frame length, and the pre-emphasis coefficient α is 0.97.
In step S13, a phase weighted generalized cross-correlation function of the sound source signal corresponding to the preprocessed speech signal is calculated.
In particular to calculate the phase weighted generalized cross-correlation function R of two microphone voice signals in a preprocessed microphone node mm(τ), expressed as:
Figure BDA0002371094510000072
wherein, m is equal to V,
Figure BDA0002371094510000081
and
Figure BDA0002371094510000082
respectively represented as time domain microphone signals at node m
Figure BDA0002371094510000083
And
Figure BDA0002371094510000084
the corresponding frequency domain portion of (a); the symbol x denotes a complex conjugate operation. In the present embodiment, M is 6.
In step S14, acquiring delay information corresponding to the peak of the phase-weighted generalized cross-correlation function, and taking the acquired delay information as a TDOA observation of the arrival of the sound source signal at the microphone; and obtaining the amplitude corresponding to the time delay information.
Obtaining a generalized cross-correlation function R weighted with said phasem(tau) time delay information corresponding to wave crest
Figure BDA0002371094510000085
And will delay the information
Figure BDA0002371094510000086
As a TDOA observation of the arrival of sound source signal S at microphone node m, is represented as:
Figure BDA0002371094510000087
wherein τ ∈ [ - τ)maxmax],τmaxRepresenting the theoretical maximum Time Delay (TDOA) of the arrival of the sound source information S at the microphone node m, i.e.
Figure BDA0002371094510000088
And
Figure BDA0002371094510000089
represents the distance of the microphone pair contained at the node m from the sound source information S, and c represents the sound propagation speed; | | · | represents the euclidean norm. Then obtaining the time delay information
Figure BDA00023710945100000810
(i.e., TDOA observations) corresponding to the magnitude of the signal
Figure BDA00023710945100000811
TDOA location is a method of location using time differences. By measuring the time of arrival of the signal at the monitoring station, the distance of the signal source can be determined. The location of the signal can be determined by the distance from the signal source to each monitoring station (taking the monitoring station as the center and the distance as the radius to make a circle). However, the absolute time is generally difficult to measure, and by comparing the absolute time difference of the arrival of the signal at each monitoring station, a hyperbola with the monitoring station as the focus and the distance difference as the major axis can be formed, and the intersection point of the hyperbola is the position of the signal.
In step S15, the TDOA observations are combined with the amplitudes as input vectors for a deep neural network, the three-dimensional spatial location coordinates corresponding to the acoustic source signal are used as output vectors for the neural network, and the input vectors and the output vectors are combined to generate feature vectors.
The method specifically comprises the following steps: time delay information
Figure BDA00023710945100000812
(i.e., TDOA observations) and their corresponding amplitudes
Figure BDA00023710945100000813
Combining the input vector I as a deep neural network:
Figure BDA00023710945100000814
taking the three-dimensional space position coordinate Q corresponding to the sound source signal S as an output vector of the neural network:
Figure BDA00023710945100000815
combining the input vector I and the output vector Q to generate a feature vector G:
G=(I,Q)T
in step S16, a second preprocessing is performed on the generated feature vector. And the second preprocessing comprises data cleaning, data disordering and data normalization.
The normalization adopts a min-max normalization method, and the conversion function is as follows:
Figure BDA0002371094510000091
wherein, gmin、gmaxRespectively representing the minimum value and the maximum value in the sample feature vector G;
Figure BDA0002371094510000092
indicating sample data normalisedAnd (6) obtaining the result. After training of the neural network, a data value, which is a three-dimensional spatial position of the sound source point in this embodiment, should be obtained through inverse normalization.
Wherein, the conversion function of the reverse normalization is as follows:
Figure BDA0002371094510000095
wherein, gmin、gmaxRespectively representing the minimum and maximum values in the sample feature vector G,
Figure BDA0002371094510000093
and g is the result after the sample data is normalized.
In step S17, in the training phase of the deep neural network, parameters related to the deep neural network are set, and the deep neural network is trained by using the feature vectors of the training set, so as to obtain a trained deep neural network.
In this embodiment, the input layer neuron number of the Deep Neural Network (DNN) is set to 12, and the output layer neuron number is set to 3. The hidden layer is set to be three layers, the neuron number of the first layer hidden layer is 12, the activation function is a tanh function, the neuron number of the second layer hidden layer is 15, the activation function is a tanh function, the neuron number of the third layer hidden layer is 3, and the activation function is a tanh function.
In this embodiment, the loss function of the neural network is set as the Mean Squared Error (MSE) between the true spatial position vector Q and the predicted estimate vector P of the neural network, expressed as:
Figure BDA0002371094510000094
wherein, U is the total number of the current neural network iteration data set.
In step S18, in the testing stage of the deep neural network, the input vectors of the test set are transmitted into the trained deep neural network for prediction, so as to obtain the three-dimensional spatial position coordinates of the sound source signal, and the performance of the deep neural network model is evaluated by using cross validation.
The input vector of the test set is transmitted into a trained deep neural network, and the three-dimensional spatial position coordinate P ═ px, py, pz of the sound source signal can be predicted]TAnd evaluating the performance of the deep neural network model by using cross validation.
In this embodiment, the total number of the data sets is 24000, the performance of the neural network is tested by using a cross validation method, the cross validation method adopts a leave-one-validation method, that is, 4000 sample points are left as a test set, 20000 samples are left as a training set, the tested data will be part of the training set in the next process, and the process is repeated until no new sample data needs to be predicted.
The sound source positioning method based on the deep neural network comprises a training stage of the deep neural network and a testing stage of the deep neural network.
As shown in FIG. 3, in the training phase of the deep neural network, steps S11-S17 are included.
As shown in FIG. 4, in the testing stage of the deep neural network, steps S11-S16, S18 are included.
It should be noted that, in the testing phase of this embodiment, a trained deep neural network is obtained based on the training phase, and then test positioning is performed.
Compared with the prior art, the embodiment estimates the time delay
Figure BDA0002371094510000101
And Rm(τ) amplitude corresponding to maximum peak
Figure BDA0002371094510000102
The three-dimensional space coordinate is used as an input vector of the deep neural network, and the three-dimensional space coordinate is used as an output vector of the deep neural network, so that the method is suitable for indoor sound source positioning, and has good expandability and algorithm robustness.
Example two
The embodiment provides a sound source positioning system based on a deep neural network, which comprises:
the first acquisition module is used for acquiring the voice signal received by the microphone and generating a voice data set from the acquired voice signal; wherein the speech data set comprises a training data set and a testing data set;
a first preprocessing module for performing a first preprocessing on the speech signal within the generated speech data set;
the calculation module is used for calculating a phase weighted generalized cross-correlation function of a sound source signal corresponding to the preprocessed voice signal;
the second acquisition module is used for acquiring time delay information corresponding to the peak of the phase weighted generalized cross-correlation function and taking the acquired time delay information as a TDOA observation value of a sound source signal reaching a microphone; obtaining the amplitude corresponding to the time delay information;
the generating module is used for combining the TDOA observed value and the amplitude value to serve as an input vector of a deep neural network, using a three-dimensional space position coordinate corresponding to a sound source signal as an output vector of the neural network, and combining the input vector and the output vector to generate a feature vector;
the second preprocessing module is used for carrying out second preprocessing on the generated feature vectors;
a training module for setting the related parameters of the deep neural network and training the deep neural network by using the feature vectors of the training set to obtain the trained deep neural network
Further, the method also comprises the following steps:
and the test module is used for transmitting the input vectors of the test set into the trained deep neural network for prediction to obtain the three-dimensional spatial position coordinates of the sound source signal and evaluating the performance of the deep neural network model by adopting cross validation.
It should be noted that, a sound source localization system based on a deep neural network in this embodiment is similar to the embodiment, and will not be described herein again.
Compared with the prior art, the embodiment estimates the time delay
Figure BDA0002371094510000111
And Rm(τ) amplitude corresponding to maximum peak
Figure BDA0002371094510000112
The three-dimensional space coordinate is used as an input vector of the deep neural network, and the three-dimensional space coordinate is used as an output vector of the deep neural network, so that the method is suitable for indoor sound source positioning, and has good expandability and algorithm robustness.
The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications or additions may be made to the described embodiments or alternatives may be employed by those skilled in the art without departing from the spirit or ambit of the invention as defined in the appended claims.

Claims (10)

1. A sound source positioning method based on a deep neural network is characterized by comprising a training stage of the deep neural network and a testing stage of the deep neural network, and comprises the following steps:
s1, acquiring a voice signal received by a microphone, and generating a voice data set from the acquired voice signal; wherein the speech data set comprises a training data set and a testing data set;
s2, performing first preprocessing on the voice signals in the generated voice data set;
s3, calculating a phase weighted generalized cross-correlation function of a sound source signal corresponding to the preprocessed voice signal;
s4, acquiring time delay information corresponding to the peak of the phase weighted generalized cross-correlation function, and taking the acquired time delay information as a TDOA observed value of a sound source signal reaching a microphone; obtaining the amplitude corresponding to the time delay information;
s5, combining the TDOA observation value with the amplitude value to serve as an input vector of a deep neural network, taking a three-dimensional space position coordinate corresponding to a sound source signal as an output vector of the neural network, and combining the input vector and the output vector to generate a feature vector;
s6, performing second preprocessing on the generated feature vectors;
s7, in the training stage of the deep neural network, setting parameters related to the deep neural network, and training the deep neural network by using the feature vectors of the training set to obtain the trained deep neural network;
and S8, in the testing stage of the deep neural network, transmitting the input vectors of the test set into the trained deep neural network for prediction to obtain the three-dimensional space position coordinates of the sound source signal, and evaluating the performance of the deep neural network model by adopting cross validation.
2. The method as claimed in claim 1, wherein the set of microphone arrays in step S1 is V ═ 1,2, …, M }; each microphone node m comprises two microphones, wherein m belongs to V; m denotes a total of M microphone nodes.
3. The method for sound source localization based on deep neural network as claimed in claim 2, wherein the step S2 is specifically to perform a first pre-processing on the speech signals received by two microphones in the microphone node m, and the first pre-processing includes framing, windowing and pre-emphasis.
4. The method for sound source localization according to claim 2, wherein the step S3 is specifically to calculate a phase weighted generalized cross-correlation function R of two microphone voice signals within the preprocessed microphone node mm(τ), expressed as:
Figure FDA0002371094500000021
wherein, m is equal to V,
Figure FDA0002371094500000022
and
Figure FDA0002371094500000023
respectively represented as time domain microphone signals at node m
Figure FDA0002371094500000024
And
Figure FDA0002371094500000025
the corresponding frequency domain portion of (a); the symbol x denotes a complex conjugate operation.
5. The sound source localization method based on the deep neural network as claimed in claim 4, wherein the step S4 is implemented by obtaining the phase weighted generalized cross-correlation function Rm(tau) time delay information corresponding to wave crest
Figure FDA0002371094500000026
Expressed as:
Figure FDA0002371094500000027
and obtaining the time delay information
Figure FDA0002371094500000028
Corresponding amplitude value
Figure FDA0002371094500000029
6. The sound source localization method based on the deep neural network as claimed in claim 5, wherein the step S5 specifically comprises:
time delay information
Figure FDA00023710945000000210
And the corresponding amplitude
Figure FDA00023710945000000211
Combined as depth spiritInput vector I through the network:
Figure FDA00023710945000000212
taking the three-dimensional space position coordinate Q corresponding to the sound source signal S as an output vector of the neural network:
Figure FDA00023710945000000213
combining the input vector I and the output vector Q to generate a feature vector G:
G=(I,Q)T
7. the method for sound source localization based on deep neural network of claim 6, wherein the second preprocessing in step S6 includes data cleaning, data de-ordering, and data normalization.
8. The method for sound source localization based on deep neural network of claim 7, wherein the cross-validation employed in step S8 comprises leave-one-out validation.
9. A sound source localization system based on a deep neural network, comprising:
the first acquisition module is used for acquiring the voice signal received by the microphone and generating a voice data set from the acquired voice signal; wherein the speech data set comprises a training data set and a testing data set;
a first preprocessing module for performing a first preprocessing on the speech signal within the generated speech data set;
the calculation module is used for calculating a phase weighted generalized cross-correlation function of a sound source signal corresponding to the preprocessed voice signal;
the second acquisition module is used for acquiring time delay information corresponding to the peak of the phase weighted generalized cross-correlation function and taking the acquired time delay information as a TDOA observation value of a sound source signal reaching a microphone; obtaining the amplitude corresponding to the time delay information;
the generating module is used for combining the TDOA observed value and the amplitude value to serve as an input vector of a deep neural network, using a three-dimensional space position coordinate corresponding to a sound source signal as an output vector of the neural network, and combining the input vector and the output vector to generate a feature vector;
the second preprocessing module is used for carrying out second preprocessing on the generated feature vectors;
and the training module is used for setting parameters related to the deep neural network and training the deep neural network by using the feature vectors of the training set to obtain the trained deep neural network.
10. The deep neural network-based sound source localization system according to claim 9, further comprising:
and the test module is used for transmitting the input vectors of the test set into the trained deep neural network for prediction to obtain the three-dimensional spatial position coordinates of the sound source signal and evaluating the performance of the deep neural network model by adopting cross validation.
CN202010050760.9A 2020-01-17 2020-01-17 Sound source positioning method and system based on deep neural network Active CN111239687B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010050760.9A CN111239687B (en) 2020-01-17 2020-01-17 Sound source positioning method and system based on deep neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010050760.9A CN111239687B (en) 2020-01-17 2020-01-17 Sound source positioning method and system based on deep neural network

Publications (2)

Publication Number Publication Date
CN111239687A true CN111239687A (en) 2020-06-05
CN111239687B CN111239687B (en) 2021-12-14

Family

ID=70872716

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010050760.9A Active CN111239687B (en) 2020-01-17 2020-01-17 Sound source positioning method and system based on deep neural network

Country Status (1)

Country Link
CN (1) CN111239687B (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111949965A (en) * 2020-08-12 2020-11-17 腾讯科技(深圳)有限公司 Artificial intelligence-based identity verification method, device, medium and electronic equipment
CN111965600A (en) * 2020-08-14 2020-11-20 长安大学 Indoor positioning method based on sound fingerprints in strong shielding environment
CN111981644A (en) * 2020-08-26 2020-11-24 北京声智科技有限公司 Air conditioner control method and device and electronic equipment
CN112180318A (en) * 2020-09-28 2021-01-05 深圳大学 Sound source direction-of-arrival estimation model training and sound source direction-of-arrival estimation method
CN113111765A (en) * 2021-04-08 2021-07-13 浙江大学 Multi-voice source counting and positioning method based on deep learning
CN113589230A (en) * 2021-09-29 2021-11-02 广东省科学院智能制造研究所 Target sound source positioning method and system based on joint optimization network
CN114545332A (en) * 2022-02-18 2022-05-27 桂林电子科技大学 Arbitrary array sound source positioning method based on cross-correlation sequence and neural network
CN115267671A (en) * 2022-06-29 2022-11-01 金茂云科技服务(北京)有限公司 Distributed voice interaction terminal equipment and sound source positioning method and device thereof
WO2022263710A1 (en) * 2021-06-17 2022-12-22 Nokia Technologies Oy Apparatus, methods and computer programs for obtaining spatial metadata
CN115980668A (en) * 2023-01-29 2023-04-18 桂林电子科技大学 Sound source localization method based on generalized cross correlation of wide neural network
CN116304639A (en) * 2023-05-05 2023-06-23 上海玫克生储能科技有限公司 Identification model generation method, identification system, identification device and identification medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103576126A (en) * 2012-07-27 2014-02-12 姜楠 Four-channel array sound source positioning system based on neural network
US20160322055A1 (en) * 2015-03-27 2016-11-03 Google Inc. Processing multi-channel audio waveforms
CN108318862A (en) * 2017-12-26 2018-07-24 北京大学 A kind of sound localization method based on neural network
CN109839612A (en) * 2018-08-31 2019-06-04 大象声科(深圳)科技有限公司 Sounnd source direction estimation method based on time-frequency masking and deep neural network

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103576126A (en) * 2012-07-27 2014-02-12 姜楠 Four-channel array sound source positioning system based on neural network
US20160322055A1 (en) * 2015-03-27 2016-11-03 Google Inc. Processing multi-channel audio waveforms
CN108318862A (en) * 2017-12-26 2018-07-24 北京大学 A kind of sound localization method based on neural network
CN109839612A (en) * 2018-08-31 2019-06-04 大象声科(深圳)科技有限公司 Sounnd source direction estimation method based on time-frequency masking and deep neural network

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
SHARATH ADAVANNE等: "Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks", 《IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING》 *
王义圆: "基于麦克风阵列的目标探测与信号增强技术研究", 《中国优秀硕博学位论文全文数据库(硕士) 信息科技辑》 *
祖丽楠等: "一种基于神经网络滤波的广义互相关时延估计方法的设计", 《化工自动化及仪表》 *
黎长江等: "基于循环神经网络的音素识别研究", 《微电子学与计算机》 *

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111949965A (en) * 2020-08-12 2020-11-17 腾讯科技(深圳)有限公司 Artificial intelligence-based identity verification method, device, medium and electronic equipment
CN111965600A (en) * 2020-08-14 2020-11-20 长安大学 Indoor positioning method based on sound fingerprints in strong shielding environment
CN111981644A (en) * 2020-08-26 2020-11-24 北京声智科技有限公司 Air conditioner control method and device and electronic equipment
CN111981644B (en) * 2020-08-26 2021-09-24 北京声智科技有限公司 Air conditioner control method and device and electronic equipment
CN112180318A (en) * 2020-09-28 2021-01-05 深圳大学 Sound source direction-of-arrival estimation model training and sound source direction-of-arrival estimation method
CN112180318B (en) * 2020-09-28 2023-06-27 深圳大学 Sound source direction of arrival estimation model training and sound source direction of arrival estimation method
CN113111765A (en) * 2021-04-08 2021-07-13 浙江大学 Multi-voice source counting and positioning method based on deep learning
WO2022263710A1 (en) * 2021-06-17 2022-12-22 Nokia Technologies Oy Apparatus, methods and computer programs for obtaining spatial metadata
CN113589230A (en) * 2021-09-29 2021-11-02 广东省科学院智能制造研究所 Target sound source positioning method and system based on joint optimization network
CN114545332A (en) * 2022-02-18 2022-05-27 桂林电子科技大学 Arbitrary array sound source positioning method based on cross-correlation sequence and neural network
CN114545332B (en) * 2022-02-18 2024-05-03 桂林电子科技大学 Random array sound source positioning method based on cross-correlation sequence and neural network
CN115267671A (en) * 2022-06-29 2022-11-01 金茂云科技服务(北京)有限公司 Distributed voice interaction terminal equipment and sound source positioning method and device thereof
CN115980668A (en) * 2023-01-29 2023-04-18 桂林电子科技大学 Sound source localization method based on generalized cross correlation of wide neural network
CN116304639A (en) * 2023-05-05 2023-06-23 上海玫克生储能科技有限公司 Identification model generation method, identification system, identification device and identification medium

Also Published As

Publication number Publication date
CN111239687B (en) 2021-12-14

Similar Documents

Publication Publication Date Title
CN111239687B (en) Sound source positioning method and system based on deep neural network
CN111025233B (en) Sound source direction positioning method and device, voice equipment and system
Salvati et al. Exploiting CNNs for improving acoustic source localization in noisy and reverberant conditions
Aarabi et al. Robust sound localization using multi-source audiovisual information fusion
Nakadai et al. Improvement of recognition of simultaneous speech signals using av integration and scattering theory for humanoid robots
Vesperini et al. Localizing speakers in multiple rooms by using deep neural networks
Liu et al. Continuous sound source localization based on microphone array for mobile robots
Hu et al. Unsupervised multiple source localization using relative harmonic coefficients
Raykar et al. Speaker localization using excitation source information in speech
WO2020024816A1 (en) Audio signal processing method and apparatus, device, and storage medium
CN113870893B (en) Multichannel double-speaker separation method and system
CN112363112A (en) Sound source positioning method and device based on linear microphone array
CN114171041A (en) Voice noise reduction method, device and equipment based on environment detection and storage medium
CN113514801A (en) Microphone array sound source positioning method and sound source identification method based on deep learning
CN103901400B (en) A kind of based on delay compensation and ears conforming binaural sound source of sound localization method
CN112712818A (en) Voice enhancement method, device and equipment
Zhang et al. AcousticFusion: Fusing sound source localization to visual SLAM in dynamic environments
Yang et al. Srp-dnn: Learning direct-path phase difference for multiple moving sound source localization
Rascon et al. Lightweight multi-DOA tracking of mobile speech sources
Huang et al. A time-domain unsupervised learning based sound source localization method
Pertilä et al. Time Difference of Arrival Estimation with Deep Learning–From Acoustic Simulations to Recorded Data
Liu et al. Wavoice: An mmWave-Assisted Noise-Resistant Speech Recognition System
Zhao et al. Accelerated steered response power method for sound source localization via clustering search
Dwivedi et al. Long-term temporal audio source localization using sh-crnn
Liu et al. Deep learning based two-dimensional speaker localization with large ad-hoc microphone arrays

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant