WO2020042708A1 - Time-frequency masking and deep neural network-based sound source direction estimation method - Google Patents

Time-frequency masking and deep neural network-based sound source direction estimation method Download PDF

Info

Publication number
WO2020042708A1
WO2020042708A1 PCT/CN2019/090531 CN2019090531W WO2020042708A1 WO 2020042708 A1 WO2020042708 A1 WO 2020042708A1 CN 2019090531 W CN2019090531 W CN 2019090531W WO 2020042708 A1 WO2020042708 A1 WO 2020042708A1
Authority
WO
WIPO (PCT)
Prior art keywords
target
time
channel
signal
sound signal
Prior art date
Application number
PCT/CN2019/090531
Other languages
French (fr)
Chinese (zh)
Inventor
王中秋
李号
Original Assignee
大象声科(深圳)科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 大象声科(深圳)科技有限公司 filed Critical 大象声科(深圳)科技有限公司
Publication of WO2020042708A1 publication Critical patent/WO2020042708A1/en

Links

Images

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S3/00Direction-finders for determining the direction from which infrasonic, sonic, ultrasonic, or electromagnetic waves, or particle emission, not having a directional significance, are being received
    • G01S3/80Direction-finders for determining the direction from which infrasonic, sonic, ultrasonic, or electromagnetic waves, or particle emission, not having a directional significance, are being received using ultrasonic, sonic or infrasonic waves
    • G01S3/802Systems for determining direction or deviation from predetermined direction

Definitions

  • the present disclosure relates to the technical field of computer applications, and in particular, to a method, an apparatus, an electronic device, and a storage medium for estimating a sound source direction based on time-frequency masking and a deep neural network.
  • Sound source localization in noisy environments has many applications in real life, such as human-computer interaction, robotics, and beamforming.
  • GCC-PHAT Generalized Cross Correlation Phase Transformation
  • SRP-PHAT Steered Response Response Phase Transformation
  • MUSIC Multiple Signal Signature Classification
  • Signal classification algorithms such as signal classification are the most common. However, these algorithms can only locate the signal source with the loudest sound in the environment, and the signal source with the loudest sound may not be the target speaker at all.
  • the sum of the GCC-PHAT coefficients will appear from the peak of the interference source, and it is formed according to the minimum eigenvector value with the noise covariance matrix in the MUSIC algorithm
  • the noise subspace may not belong to real noise.
  • SNR Signal-to-noise ratio
  • GCC-PHAT GCC-PHAT algorithm
  • SNR estimation method such as an algorithm based on voice activity detection or a method based on a minimum mean square error is used.
  • these algorithms usually assume that the noise is static and the noise in the real environment is usually dynamic, which results in poor robustness of the direction estimation when the sound source is located in the real environment.
  • the present disclosure provides a sound source direction estimation method, device, electronic device, and storage medium based on time-frequency masking and deep neural network.
  • a sound source direction estimation method based on time-frequency masking and a deep neural network including:
  • the step of iteratively calculating the short-time Fourier spectrum by using a pre-trained neural network model to calculate a ratio film corresponding to a target signal in the multi-channel sound signal includes:
  • the short-time Fourier spectrum of each channel sound signal is iteratively calculated through a pre-trained neural network model, and the ratio films corresponding to the sound signals of each channel in the multi-channel sound signal are calculated respectively.
  • the step of iteratively computing the short-time Fourier spectrum of the sound signals of each channel by using a pre-trained neural network model, and respectively calculating the ratio film corresponding to the sound signals of each channel in the multi-channel sound signal includes:
  • a direct reverberation or reverberation speech signal is used as a target, and a deep recurrent neural network model with long and short-term memory is used to calculate the ratio film corresponding to each single-channel target signal in the multi-channel sound signal.
  • the step of fusing multiple ratio films to form a single ratio film includes:
  • the ratio film produced by the target signal in the multi-channel sound signal is multiplied on the corresponding time-frequency unit.
  • the step of determining the position of the target sound source by masking and weighting the multi-channel sound signal through the single ratio film includes:
  • the masked generalized cross-correlation function is summed along frequency and time, and the direction corresponding to the maximum peak position of the summed cross-correlation function is selected as the target sound source orientation.
  • the step of determining the position of the target sound source by masking and weighting the multi-channel sound signal through the single ratio film, the second scheme includes:
  • each time-frequency unit calculating a covariance matrix of a short-time Fourier spectrum of the multi-channel sound signal
  • the candidate direction corresponding to the largest overall signal-to-noise ratio is selected as the orientation of the target sound source.
  • the step of determining a target sound source's position by masking and weighting multi-channel sound signals through the single ratio film includes:
  • the candidate direction corresponding to the maximum cosine distance is selected as the azimuth of the target sound source.
  • a sound source direction estimation device based on time-frequency masking and a deep neural network including:
  • Sound signal acquisition module for acquiring multi-channel sound signals
  • a short-time Fourier spectrum extraction module configured to perform frame, window, and Fourier transform on each channel of the multi-channel sound signal to form a short-time Fourier transform of the multi-channel sound signal Spectrum
  • a ratio film calculation module configured to iteratively calculate the short-time Fourier spectrum through a pre-trained neural network model to calculate a ratio film corresponding to a target signal in the multi-channel sound signal;
  • Ratio film fusion module used to fuse multiple ratio films to form a single ratio film
  • a masking weighting module is configured to perform masking weighting on a multi-channel sound signal through the single ratio film to determine the position of a target sound source.
  • an electronic device including:
  • At least one processor At least one processor
  • a memory connected in communication with the at least one processor; wherein,
  • the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to execute the method according to the first aspect.
  • a computer-readable storage medium for storing a program that, when executed, causes an electronic device to perform the method according to the first aspect.
  • the pre-trained neural network model is used to calculate the ratio film corresponding to the target signal in the multi-channel sound signal, and the multiple ratio films are fused to form a single ratio.
  • a single ratio film is used to mask and weight the multi-channel sound signals to determine the orientation of the target sound source, which can have strong robustness in the environment of low signal-to-noise ratio and strong reverberation, and improve the direction of the target sound source. Accuracy and stability of estimates.
  • Fig. 1 is a flowchart illustrating a method for estimating a sound source direction based on time-frequency masking and a deep neural network according to an exemplary embodiment.
  • FIG. 2 is a first specific implementation flowchart of step S150 in the sound source azimuth estimation method based on time-frequency masking and deep neural network according to the embodiment of FIG. 1.
  • FIG. 3 is a second specific implementation flowchart of step S150 in the sound source azimuth estimation method based on time-frequency masking and deep neural network according to the embodiment of FIG. 1.
  • FIG. 4 is a third specific implementation flowchart of step S150 in the sound source azimuth estimation method based on time-frequency masking and deep neural network according to the embodiment of FIG. 1.
  • Fig. 5 is a schematic diagram of a binaural setup (a) and a schematic diagram of a dual microphone setup (b) according to an exemplary embodiment.
  • Fig. 6 is a block diagram of a sound source azimuth estimation device based on time-frequency masking and a deep neural network according to an exemplary embodiment.
  • FIG. 7 is a first block diagram of a masking weighting module 150 in a sound source azimuth estimation device based on a time-frequency masking and a deep neural network according to the embodiment shown in FIG. 6.
  • FIG. 8 is a second block diagram of a masking weighting module 150 in a sound source azimuth estimation device based on a time-frequency masking and a deep neural network according to the embodiment shown in FIG. 6.
  • FIG. 9 is a third block diagram of a masking weighting module 150 in a sound source azimuth estimation device based on a time-frequency masking and a deep neural network according to the embodiment shown in FIG. 6.
  • Fig. 1 is a flowchart illustrating a method for estimating a sound source position based on time-frequency masking and a deep neural network according to an exemplary embodiment.
  • the sound source azimuth estimation method based on time-frequency masking and deep neural network can be used in electronic devices such as smart phones, smart homes, and computers.
  • the sound source azimuth estimation method based on time-frequency masking and deep neural network may include steps S110, S120, S130, S140, and S150.
  • Step S110 Acquire a multi-channel sound signal.
  • TDOA Time Difference of Arrival
  • TDOA Time Difference of Arrival
  • TDOA determines the location of the target sound source by detecting the time difference between when the signals arrive at two or more microphones. This method is widely used. Therefore, the accuracy and robustness of TDOA calculation is particularly important in the localization of the target sound source.
  • a multi-channel sound signal is a sound signal containing a mix of two or more microphone channels.
  • the target sound source needs to be located in the environment based on the received multi-channel sound signals.
  • Step S120 Frame, window, and Fourier transform each sound signal in the multi-channel sound signal to form a short-time Fourier spectrum of the multi-channel sound signal.
  • Framing is to divide a single channel sound signal into multiple time frames according to a preset time period.
  • each channel of the multi-channel sound signal is divided into multiple time frames according to 20 milliseconds per frame, and there is a 10 millisecond overlap between every two adjacent time frames.
  • an STFT short-time Fourier transform
  • Step S130 Iteratively calculate the short-time Fourier spectrum by using a pre-trained neural network model to calculate a ratio film corresponding to the target signal in the multi-channel sound signal.
  • the ratio film characterizes the relationship between a noisy speech signal and a pure speech signal, which indicates an appropriate trade-off between noise suppression and speech retention.
  • the speech spectrum signal can be restored from the noisy speech.
  • Neural network models are pre-trained. By extracting the short-time Fourier spectrum of a multi-channel sound signal and performing an iterative operation in the neural network model, a ratio film of the multi-channel sound signal is calculated.
  • a ratio film corresponding to each single-channel sound signal in the multi-channel sound signal is separately calculated through a pre-trained neural network model, and then the ratio corresponding to each single-channel sound signal is calculated.
  • the membrane separately masks the single-channel sound signal and applies different weights to different time-frequency (TF) units, thereby sharpening the peaks corresponding to the target speech in the multi-channel sound signal and suppressing the peaks corresponding to the noise source.
  • a deep recurrent neural network model with long and short-term memory is used to separately calculate the ratio film corresponding to each channel sound signal in the multi-channel sound signal, so that the calculated ratio film is closer to the ideal ratio membrane.
  • Formula (1) shows that the ideal ratio film corresponding to the sound signal of each channel in the multi-channel sound signal is calculated with the target of the reverberation speech signal.
  • Equation (2) shows the ideal ratio film corresponding to the sound signal of each channel in the multi-channel sound signal with the direct sound as the target.
  • Reverberation speech is the sound from the sound source reflected in all directions back and forth to the microphone.
  • the sound wave energy of the reverberation speech is gradually attenuated due to continuous absorption by the wall surface.
  • Direct sound refers to the sound directly transmitted from the sound source to the microphone in a straight line without any reflection. Direct sound determines the clarity of the sound.
  • c (f) s (t, f), h (t, f), and n (t, f) are the short-time Fourier transform of direct sound, reverberation, and reflected noise, respectively ( STFT) vector.
  • the calculation model of the ratio film may be closer to the real environment by targeting the direct sound signal.
  • Step S140 Fusion multiple ratio films to form a single ratio film.
  • each single-channel sound signal has a corresponding ratio film.
  • a multi-channel sound signal including multiple single-channel sound signals there are multiple corresponding ratio films.
  • the invention fuses multiple ratio films to form a single ratio film.
  • the ratio film generated by the multi-channel sound signal may be accumulated on the corresponding time-frequency unit to form a single ratio film.
  • step S150 the multi-channel sound signal is masked and weighted by a single ratio film to determine the position of the target sound source.
  • T-F units dominated by the target speech.
  • These T-F units with clearer phases are often sufficient to achieve robust localization of the target sound source.
  • masking weighting the contribution of those voice-dominant units to localization is improved, thereby improving the robustness of the calculated TDOA and the accuracy of the target sound source localization.
  • step S150 may include steps S151, S152, and S153.
  • step S151 a short-time Fourier spectrum of the multi-channel input signal is used to calculate a generalized cross-correlation phase transformation (GCC-PHAT).
  • GCC-PHAT generalized cross-correlation phase transformation
  • step S152 the generalized cross-correlation function is masked by using a single ratio film.
  • step S153 the masked generalized cross-correlation function is summed along the frequency and time, and the direction corresponding to the maximum peak position of the summed cross-correlation function is selected as the target sound source position.
  • a deep recurrent neural network model with long and short-term memory will be used to calculate the ratio film corresponding to each channel sound signal in the multi-channel sound signal.
  • the invention can be directly applied to microphone arrays of various geometric shapes.
  • the pair of microphone signals can be modeled as follows:
  • s (t, f) represents the short-time Fourier transform (STFT) value of the target sound source at time t and frequency f
  • c (f) represents the relative transfer function
  • y (t, f) are received respectively Short-time Fourier transform (STFT) vector of mixed sound.
  • ⁇ * is the basic time delay in seconds
  • j is the imaginary unit
  • a (f) is a real-valued gain
  • f s is the sampling rate in Hz
  • N is the number of DFT frequencies
  • [ ⁇ ] T stands for matrix transpose. f ranges from 0 to N / 2.
  • (.) H represents the conjugate transpose
  • Real ⁇ extracts the real part, and calculates the amplitude.
  • Subscripts 1 and 2 indicate microphone channels.
  • the algorithm first uses candidate delays to align the two microphone signals, then calculates their phase difference and cosine distance. If the cosine distance is close to 1, it means that the candidate delay is close to the true delay (phase difference). Therefore, each GCC coefficient is between -1 and 1. Assuming that the sound source is fixed in each utterance, the GCC coefficients are summed together, and the maximum value is taken as the estimated value of the time delay. PHAT weights are essential here. Without normalization, frequencies with higher energies will have larger GCC coefficients and dominate the summation.
  • the present invention calculates the GCC-PHAT function by masking and weighting multi-channel sound signals:
  • GCC PHAT-MASK (t, f, ⁇ ) ⁇ (t, f) GCC PHAT (t, f, ⁇ ), (6)
  • ⁇ ( ⁇ , f) represents the masked weighting term of the T-F unit in the TDOA estimation. It can be defined as:
  • I a ratio film corresponding to channel i, which represents the specific gravity of the target speech energy at each TF unit in the channel.
  • the target sound source position By weighting the multi-channel sound signals and adding the masked generalized cross-correlation function along the frequency and time, the direction corresponding to the maximum peak position of the cross-correlation function is selected as the target sound source position, which greatly improves the determination of the target sound source The accuracy of the target sound source.
  • step S150 may include step S154, step S155, step S156, step S157, step S158, step S159, and step S160.
  • step S154 the covariance matrix of the short-time Fourier spectrum of the multi-channel sound signal is calculated in each time-frequency unit.
  • step S155 a single ratio film is used to mask the covariance matrix, and the masked covariance matrix is summed along the time dimension at each individual frequency to obtain the covariance matrices of the target speech and the background noise at different frequencies, respectively.
  • step S156 according to the topology of the microphone array, steering vectors of candidate directions at different frequencies are calculated.
  • step S157 according to the noise covariance matrix and the candidate steering vector, filter coefficients for MVDR (Minimum Variable Distortionless Response) beamforming at different frequencies are calculated.
  • MVDR Minimum Variable Distortionless Response
  • step S158 the beamforming filter coefficients and the target voice covariance matrix are used to calculate the energy of the target speech at different frequencies, and the beamforming filter coefficients and the noise covariance matrix are used to calculate the energy of the background noise at different frequencies.
  • Step S159 Calculate the energy ratios of the target speech and noise at different frequencies and add them along the frequency dimension to form an overall signal-to-noise ratio in a certain candidate direction.
  • step S160 the candidate direction corresponding to the largest overall signal-to-noise ratio is selected as the azimuth of the target sound source.
  • ⁇ (t, f) is calculated using formula (7), which is a single ratio film.
  • ⁇ (t, f) is calculated using:
  • formula (7) means that only the voice-dominated time-frequency unit is used to calculate the target voice covariance matrix, and the more dominant the target voice of the time-frequency unit, the greater the weight placed.
  • Equation (8) uses a similar method to calculate the interference signal covariance matrix.
  • MVDR minimum variance distortionless response
  • the SNR of the beamforming signal can be obtained by calculating the energy of the beamforming target speech and noise:
  • the sound source orientation can be predicted as:
  • Equation (13) we limit the SNR to between 0 and 1. It is basically similar to the PHAT weighting in the GCC-PHAT algorithm, where the GCC coefficients of each T-F unit are normalized to -1 to 1. We can also replay more weights at higher SNR frequencies:
  • ⁇ (f) can be defined as:
  • the third solution of step S150 may include steps S161, S162, S163, S164, and S165.
  • step S161 at different frequencies, feature decomposition is applied to the target voice covariance matrix, and the corresponding feature vector with the largest feature value is selected as the steering vector of the target voice.
  • Step S162 Use the steering vector of the target voice to calculate the time difference between the microphone signals.
  • Step S163 Calculate the arrival time difference between the microphones for each candidate direction according to the microphone array topology.
  • Step S164 Calculate the cosine distance between the arrival time difference between the microphone signals and the arrival time difference between the candidate directions between the microphones.
  • step S165 a candidate direction corresponding to the maximum cosine distance is selected as the azimuth of the target sound source.
  • the steering vector can be calculated using the following formula:
  • P ⁇ extracts the principal feature vector of the estimated speech covariance matrix calculated in formula (8).
  • it will be close to the rank 1 matrix, so its main feature vector is a reasonable estimate of the steering vector.
  • the basic principle is to calculate the steering vector independently at each frequency therefore, The linear phase assumption is not strictly followed.
  • the present invention enumerates all potential time delays, searching with phase delay Time delay ⁇ , with each frequency (The direction of the steering vector) is the best match, so it is used as the final prediction result. Similar to formula (15), we use ⁇ (f) weighting to emphasize higher SNR.
  • the pre-trained neural network model is used to calculate the ratio film corresponding to the multi-channel sound signal, and then multiple ratios are The film is fused into a single ratio film, and then the multi-channel sound signals are masked and weighted by the single ratio film to determine the target sound source's orientation.
  • the invention has strong robustness in the environment of low signal-to-noise ratio and strong reverberation, and improves the accuracy and stability of the target sound source direction estimation.
  • Fig. 5 is a schematic diagram showing a binaural setting and a dual microphone setting according to an exemplary embodiment.
  • the average duration of mixed speech is 2.4 seconds.
  • the calculated input SNR for the reverberation speech and reverberation noise of the two data sets is -6dB. If we consider the direct sound signal as the target speech and the remaining signals as noise, the SNR will be lower.
  • the LSTM contains two hidden layers, each with 500 neurons.
  • the Adam algorithm is used to minimize the mean square error of the ratio film estimation.
  • the window length is 32 ms and the window shift size is 8 ms.
  • the sampling rate is 16kHz.
  • an image-based RIR (room impulse response) generator is used to generate the RIR to simulate reverberation.
  • RIR room impulse response
  • For training and verification data we place an interfering speaker in each of the 36 directions, from –87.5 ° to 87.5 ° in steps of 5 °, and the target speaker is in one of the 36 directions .
  • For the test data we placed an interfering speaker in each of the 37 directions, ranging from -90 ° to 90 °, with a step size of 5 °, and the target speaker was in any of the 37 directions . In this way, the test RIR is invisible during training.
  • the distance between the target speaker and the center of the array is 1 meter.
  • the size of the room is fixed at 8x8x3m, and two microphones are placed in the center of the room.
  • the distance between the two microphones is 0.2 meters, and the height is set to 1.5 meters.
  • T60 of each mixed speech segment is randomly selected from 0.0s to 1.0s in steps of 0.1s. IEEE and TIMIT statements are used to generate training, verification, and test speech.
  • the binaural room impulse response was simulated using software, where the T60 (reverberation time) range was from 0.0s to 1.0s with a step size of 0.1s.
  • the simulation room size is fixed at 6x4x3m.
  • the measurement method of BRIR is to place the ears around the center of the room at a height of 2 meters, and the sound source is located in one of 37 directions (from -90 ° to 90 ° in 5 ° steps), which is the same height as the array 1.5 meters from the center of the array.
  • the real BRIRs collected in four different sizes and T60 real rooms using HATS simulation heads were used for testing.
  • the simulation head is placed at a height of 2.8 meters, and the distance from the sound source to the array is 1.5 meters.
  • True BRIR is also measured using the same 37 directions.
  • 720 female IEEE sentences were used as the target speech.
  • the sentences of 630 speakers in our TIMIT dataset are connected together, and 37 randomly selected speakers and their speech segments are placed in each of the 37 directions.
  • For each speaker in the noisy noise we use the first half of the connected utterance to generate training and verification noise, and the second half to generate test noise. There are a total of 10,000, 800, and 3,000 binaural mixed speech sets in the training, verification, and test data sets.
  • the overall positioning accuracy results are shown in Tables 1 and 2. Among them, gray marks the performance of the ideal ratio film.
  • the table also shows the direct-to-reverberant energy ratio (DRR) for each T60 level.
  • DRR direct-to-reverberant energy ratio
  • the proposed masking weighted GCC-PHAT algorithm significantly improves the traditional GCC-PHAT algorithm (as shown in Table 1 from 25.8% to 78.5%, 88.2%, and Table 2 from 29.4% To 91.3%, 90.8%).
  • the steering vector-based TDOA estimation algorithm shows the strongest robustness among all algorithms, especially when T60 is high.
  • the time delay information is mainly contained in the direct sound, in a dual microphone setting, using the direct sound as the target voice to define the IRM is always better than using the reverb sound as the target voice (88.2% vs. 78.5%, 90.5% vs. 86.7% And 91.0% vs. 86.4%).
  • the masked weighted steered response SNR algorithm performs relatively poorly in the binaural setting as in the dual microphone setting.
  • the gain in the binaural case cannot simply be equal to the gain of different channels. Therefore, using the reverberation sound as the target voice to estimate the IRM in the binaural setting is slightly better than using direct sound as the target voice Good performance (91.3% vs 90.8%, 86.4% vs 70.0% and 92.0% vs 91.1%).
  • the following is a device embodiment of the present disclosure, which can be used to implement the foregoing method embodiment of a sound source azimuth estimation method based on time-frequency masking and deep neural network.
  • a sound source azimuth estimation method based on time-frequency masking and deep neural networks of the present disclosure please refer to an embodiment of a sound source azimuth estimation method based on time-frequency masking and deep neural networks of the present disclosure.
  • Fig. 6 is a block diagram of a sound source azimuth estimation device based on time-frequency masking and deep neural network according to an exemplary embodiment.
  • the device includes, but is not limited to, a sound signal acquisition module 110 and short-time Fourier spectrum extraction.
  • Module 120 ratio film calculation module 130, ratio film fusion module 140, and masking weighting module 150.
  • a sound signal obtaining module 110 configured to obtain a multi-channel sound signal
  • the short-time Fourier spectrum extraction module 120 is configured to perform frame, window, and Fourier transform on each channel of the multi-channel sound signal to form a short-time Fourier spectrum of the multi-channel sound signal;
  • the ratio film calculation module 130 is configured to iteratively calculate the short-time Fourier spectrum through a pre-trained neural network model to calculate a ratio film corresponding to a target signal in a multi-channel sound signal;
  • the ratio film fusion module 140 is configured to fuse multiple ratio films to form a single ratio film
  • the masking weighting module 150 is configured to perform masking weighting on the multi-channel sound signal through a single ratio film, and determine the position of the target sound source.
  • the ratio film calculation module 130 in FIG. 6 includes, but is not limited to, a ratio film calculation unit.
  • Ratio film calculation units are respectively used to iteratively calculate the short-time Fourier spectrum of the sound signals of each channel through a pre-trained neural network model, and calculate the ratio films corresponding to the sound signals of each channel in the multi-channel sound signal.
  • the ratio film separate calculation unit can be specifically applied to the direct sound or reverberation speech signal as a target, and a deep recurrent neural network model with long and short-term memory is used to separately calculate the corresponding ratio of each single-channel target signal in the multi-channel sound signal. membrane.
  • the ratio film fusion module 140 in FIG. 6 is specifically used to multiply the ratio film generated by the target in the multi-channel sound signal on the corresponding time-frequency unit.
  • the masking weighting module 150 in FIG. 6 includes, but is not limited to, a generalized cross-correlation function calculation sub-module 151, a masking sub-module 152, and an orientation determination sub-module 153.
  • the generalized cross-correlation function calculation sub-module 151 is configured to calculate a generalized cross-correlation function using a short-time Fourier spectrum of a multi-channel input signal;
  • a first azimuth determining submodule 153 is configured to add the masked generalized cross-correlation function along frequency and time, and select a direction corresponding to a maximum peak position of the summed cross-correlation function as the azimuth of the target sound source.
  • the second scheme of the masking weighting module 150 in FIG. 6 includes, but is not limited to, a covariance matrix calculation submodule 154, a covariance matrix masking submodule 155, a candidate direction guidance vector calculation submodule 156, The beamforming filter coefficient calculation sub-module 157, the energy calculation sub-module 158, the overall signal-to-noise ratio calculation sub-module 159, and the second orientation determination sub-module 160.
  • a covariance matrix calculation submodule 154 configured to calculate a covariance matrix of a short-time Fourier spectrum of a multi-channel sound signal in each time-frequency unit;
  • the covariance matrix masking sub-module 155 is used to mask the covariance matrix with a single ratio film. At each individual frequency, the masked covariance matrix is summed along the time dimension to obtain the target speech and noise at different frequencies. Covariance matrix on
  • Candidate direction steering vector calculation submodule 156 for calculating the steering vectors of candidate directions at different frequencies according to the topology of the microphone array
  • the beamforming filter coefficient calculation submodule 157 is configured to calculate filter coefficients of MVDR beamforming at different frequencies according to a noise covariance matrix and a candidate steering vector;
  • An energy calculation sub-module 158 is configured to calculate energy of a target voice at different frequencies by using beamforming filter coefficients and a target voice covariance matrix, and calculate energy of noise at different frequencies by using a beamforming filter coefficient and a noise covariance matrix;
  • the overall signal-to-noise ratio forming sub-module 159 is used to calculate the energy ratio of the target speech and noise at different frequencies, and sum them along the frequency dimension to form an overall signal-to-noise ratio in a certain candidate direction;
  • the second orientation determination sub-module 160 is configured to select a candidate direction corresponding to the largest overall signal-to-noise ratio as the orientation of the target sound source.
  • the third scheme of the masking weighting module 150 in FIG. 6 includes, but is not limited to, a speech guidance vector calculation sub-module 161, an arrival time difference calculation sub-module 162, a candidate direction arrival time difference sub-module 163, and a cosine
  • the distance calculation sub-module 164 and the third-party bit determination sub-module sub-module 165 are not limited to, a speech guidance vector calculation sub-module 161, an arrival time difference calculation sub-module 162, a candidate direction arrival time difference sub-module 163, and a cosine.
  • a speech steering vector calculation sub-module 161 which is used to apply feature decomposition to the target speech covariance matrix at different frequencies, and selects the corresponding feature vector with the largest feature value as the steering vector of the target speech;
  • Arrival time difference calculation sub-module 162 for calculating the arrival time difference between the microphone signals by using the steering vector of the target voice
  • Candidate direction arrival time difference sub-module 163 is configured to calculate the difference in arrival time between candidate directions between microphones according to the microphone array topology
  • a cosine distance calculation submodule 164 configured to calculate a cosine distance between a difference in time of arrival between microphone signals and a difference in time of arrival between candidate microphones in a direction;
  • the third-party bit determining sub-module 165 is configured to select a candidate direction corresponding to the maximum cosine distance as the azimuth of the target sound source.
  • the present invention further provides an electronic device that performs all or part of the steps of the sound source azimuth estimation method based on the time-frequency masking and the deep neural network as shown in any of the above exemplary embodiments.
  • Electronic equipment includes:
  • a memory connected in communication with the processor; wherein,
  • the memory stores readability instructions that, when executed by the processor, implement the method according to any one of the foregoing exemplary embodiments.
  • a storage medium is also provided, and the storage medium is a computer-readable storage medium, such as a temporary and non-transitory computer-readable storage medium including instructions.

Abstract

A time-frequency masking and deep neural network-based sound source orientation estimation method and device, an electronic device and a storage medium, belonging to the field of computer technology. Said method comprises: acquiring sound signals of multiple channels (S110); performing frame division, windowing and Fourier transformation on a sound signal of each channel in the signals of multiple channels, so as to form a short-time Fourier spectrum of the sound signals of multiple channels (S120); performing an iterative operation on the short-time Fourier spectrum by means of a pre-trained neural network model, so as to calculate ratio filters corresponding to target signals in the sound signals of multiple channels (S130); fusing the plurality of ratio filters to form a single ratio filter (S140); and performing masking and weighting on the signals of multiple channels by means of the single ratio filter, so as to determine the orientation of a target sound source (S150). The time-frequency masking and deep neural network-based sound source direction estimation method and device can have strong robustness in an environment of a low signal-to-noise ratio and strong reverberation, and improve the accuracy and stability of target sound source direction estimation.

Description

基于时频掩蔽和深度神经网络的声源方向估计方法Sound source direction estimation method based on time-frequency masking and deep neural network 技术领域Technical field
本公开涉及计算机应用技术领域,特别涉及一种基于时频掩蔽和深度神经网络的声源方向估计方法、装置及电子设备、存储介质。The present disclosure relates to the technical field of computer applications, and in particular, to a method, an apparatus, an electronic device, and a storage medium for estimating a sound source direction based on time-frequency masking and a deep neural network.
背景技术Background technique
噪音环境下的声源定位在现实生活中有很多应用,例如人机交互、机器人和波束形成。传统上,GCC-PHAT(Generalized Cross Correlation Phase Transform,广义互相关-相位变换方法)、SRP-PHAT(Steered Response Power Phase Transform,相位变换加权的可控响应功率法)或MUSIC(Multiple Signal Classification,多信号分类)等声源定位算法最为常见。然而,这些算法只能定位环境中声量最大的信号源,而声量最大的信号源可能根本不是目标说话人。例如,在强混响、有向噪声或漫反射噪声的环境中,GCC-PHAT系数的总和会出现来自干扰源的峰值,而根据MUSIC算法中带噪音协方差矩阵的最小特征向量值而构成得的噪声子空间可能不属于真正的噪声。Sound source localization in noisy environments has many applications in real life, such as human-computer interaction, robotics, and beamforming. Traditionally, GCC-PHAT (Generalized Cross Correlation Phase Transformation), SRP-PHAT (Steered Response Response Phase Transformation), or MUSIC (Multiple Signal Signature Classification) Signal classification algorithms such as signal classification are the most common. However, these algorithms can only locate the signal source with the loudest sound in the environment, and the signal source with the loudest sound may not be the target speaker at all. For example, in the environment of strong reverberation, directional noise or diffuse reflection noise, the sum of the GCC-PHAT coefficients will appear from the peak of the interference source, and it is formed according to the minimum eigenvector value with the noise covariance matrix in the MUSIC algorithm The noise subspace may not belong to real noise.
为提高鲁棒性,早期的研究采用SNR(Signal-to-noise ratio,信噪比)加权的方式加强目标声音频率,得到更高的SNR,之后再运行GCC-PHAT算法。例如使用基于语音活动检测的算法或基于最小均方误差的方法等SNR估计法。然而,这些算法通常假设噪声是静态的,而现实环境中的噪声通常是动态的,从而导致现实环境中进行声源定位时,方向估计的鲁棒性较差。In order to improve the robustness, early research used the SNR (Signal-to-noise ratio) weighting method to strengthen the target sound frequency to obtain a higher SNR, and then run the GCC-PHAT algorithm. For example, an SNR estimation method such as an algorithm based on voice activity detection or a method based on a minimum mean square error is used. However, these algorithms usually assume that the noise is static and the noise in the real environment is usually dynamic, which results in poor robustness of the direction estimation when the sound source is located in the real environment.
发明内容Summary of the Invention
为了解决方位估计的鲁棒性较差的技术问题,本公开提供了一种基于时频掩蔽和深度神经网络的声源方向估计方法、装置及电子设备、存储介质。In order to solve the technical problem of poor robustness of azimuth estimation, the present disclosure provides a sound source direction estimation method, device, electronic device, and storage medium based on time-frequency masking and deep neural network.
第一方面,提供了一种基于时频掩蔽和深度神经网络的声源方向估计方法, 包括:In a first aspect, a sound source direction estimation method based on time-frequency masking and a deep neural network is provided, including:
获取多通道声音信号;Acquire multi-channel sound signals;
对所述多通道声音信号中的每一通道声音信号进行分帧、加窗和傅里叶变换,形成所述多通道声音信号的短时傅里叶频谱;Frame, window, and Fourier transform each sound signal in the multi-channel sound signal to form a short-time Fourier spectrum of the multi-channel sound signal;
通过预先训练的神经网络模型对所述短时傅里叶谱进行迭代运算,计算所述多通道声音信号中目标信号对应的比值膜;Performing an iterative operation on the short-time Fourier spectrum through a pre-trained neural network model to calculate a ratio film corresponding to a target signal in the multi-channel sound signal;
将多个比值膜融合形成单一比值膜;Fusing multiple ratio films to form a single ratio film;
通过所述单一比值膜对多通道声音信号进行掩蔽加权,确定目标声源的方位。Masking and weighting the multi-channel sound signal through the single ratio film to determine the orientation of the target sound source.
可选的,所述通过预先训练的神经网络模型对所述短时傅里叶谱进行迭代运算,计算所述多通道声音信号中目标信号对应的比值膜的步骤包括:Optionally, the step of iteratively calculating the short-time Fourier spectrum by using a pre-trained neural network model to calculate a ratio film corresponding to a target signal in the multi-channel sound signal includes:
通过预先训练的神经网络模型对各通道声音信号的短时傅里叶谱进行迭代运算,分别计算所述多通道声音信号中各通道声音信号对应的比值膜。The short-time Fourier spectrum of each channel sound signal is iteratively calculated through a pre-trained neural network model, and the ratio films corresponding to the sound signals of each channel in the multi-channel sound signal are calculated respectively.
可选的,所述通过预先训练的神经网络模型对各通道声音信号的短时傅里叶谱进行迭代运算,分别计算所述多通道声音信号中各通道声音信号对应的比值膜的步骤包括:Optionally, the step of iteratively computing the short-time Fourier spectrum of the sound signals of each channel by using a pre-trained neural network model, and respectively calculating the ratio film corresponding to the sound signals of each channel in the multi-channel sound signal includes:
以直达声或混响语音信号为目标,采用具有长短期记忆的深度递归神经网络模型分别计算所述多通道声音信号中各单通道目标信号对应的比值膜。A direct reverberation or reverberation speech signal is used as a target, and a deep recurrent neural network model with long and short-term memory is used to calculate the ratio film corresponding to each single-channel target signal in the multi-channel sound signal.
可选的,所述将多个比值膜融合形成单一比值膜的步骤包括:Optionally, the step of fusing multiple ratio films to form a single ratio film includes:
将多通道声音信号中目标信号所产生的比值膜,在相应时频单元上进行累乘。The ratio film produced by the target signal in the multi-channel sound signal is multiplied on the corresponding time-frequency unit.
可选的,通过所述单一比值膜对多通道声音信号进行掩蔽加权,确定目标声源的方位的步骤,方案一包括:Optionally, the step of determining the position of the target sound source by masking and weighting the multi-channel sound signal through the single ratio film. The first solution includes:
使用多通道输入信号的短时傅里叶谱计算广义互相关函数;Calculate a generalized cross-correlation function using the short-time Fourier spectrum of a multi-channel input signal;
采用所述单一比值膜对所述广义互相关函数进行掩蔽;Using the single ratio film to mask the generalized cross-correlation function;
将掩蔽后的广义互相关函数沿频率和时间进行加和,选取加和互相关函数最大峰值位对应的方向作为目标声源的方位。The masked generalized cross-correlation function is summed along frequency and time, and the direction corresponding to the maximum peak position of the summed cross-correlation function is selected as the target sound source orientation.
可选的,通过所述单一比值膜对多通道声音信号进行掩蔽加权,确定目标声源的方位的步骤,方案二包括:Optionally, the step of determining the position of the target sound source by masking and weighting the multi-channel sound signal through the single ratio film, the second scheme includes:
在每个时频单元中,计算所述多通道声音信号短时傅里叶谱的协方差矩阵;In each time-frequency unit, calculating a covariance matrix of a short-time Fourier spectrum of the multi-channel sound signal;
采用所述单一比值膜对所述协方差矩阵进行掩蔽,在每个单独的频率上,对掩蔽的协方差矩阵沿时间维度加和,分别得到目标语音和噪声在不同频率上的协方差矩阵;Masking the covariance matrix using the single ratio film, summing the masked covariance matrix along the time dimension at each individual frequency to obtain the covariance matrix of the target speech and noise at different frequencies, respectively;
依据麦克风阵列的拓扑结构,计算候选方向在不同频率上的导向矢量;Calculate the steering vectors of candidate directions at different frequencies according to the topology of the microphone array;
根据所述噪声协方差矩阵和候选导向矢量,计算不同频率上MVDR波束成形的滤波器系数;Calculating filter coefficients for MVDR beamforming at different frequencies according to the noise covariance matrix and the candidate steering vector;
采用所述波束成形的滤波器系数和目标语音协方差矩阵计算不同频率上目标语音的能量,采用所述波束成形的滤波器系数和噪声协方差矩阵计算不同频率上噪声的能量;Using the beamforming filter coefficients and the target voice covariance matrix to calculate the energy of the target speech at different frequencies, and using the beamforming filter coefficients and the noise covariance matrix to calculate the energy of noise at different frequencies;
在不同频率上,计算目标语音和噪声的能量比,并沿频率维度加和,形成在某一候选方向上的总体信噪比;Calculate the energy ratio of the target speech and noise at different frequencies and sum them along the frequency dimension to form an overall signal-to-noise ratio in a candidate direction;
选择对应总体信噪比最大的候选方向作为目标声源的方位。The candidate direction corresponding to the largest overall signal-to-noise ratio is selected as the orientation of the target sound source.
可选的,所述通过所述单一比值膜对多通道声音信号进行掩蔽加权,确定目标声源的方位的步骤,方案三包括:Optionally, the step of determining a target sound source's position by masking and weighting multi-channel sound signals through the single ratio film, the third solution includes:
在不同频率上,对所述目标语音协方差矩阵采用特征分解,选取特征值最大的对应特征向量作为目标语音的导向矢量;Applying feature decomposition to the target voice covariance matrix at different frequencies, and selecting the corresponding feature vector with the largest feature value as the steering vector of the target voice;
采用所述目标语音的导向矢量计算麦克风信号之间的到达时间差;Calculating a time difference of arrival between microphone signals by using a steering vector of the target voice;
根据麦克风阵列拓扑结构计算候选方向在麦克风之间的到达时间差;Calculate the arrival time difference between candidate microphones according to the microphone array topology;
计算所述麦克风信号之间到达时间差和所述候选方向在麦克风之间到达时间差之间的余弦距离;Calculating a cosine distance between the time difference of arrival between the microphone signals and the time difference of arrival between the candidate directions between microphones;
选择对应最大余弦距离的候选方向作为目标声源的方位。The candidate direction corresponding to the maximum cosine distance is selected as the azimuth of the target sound source.
第二方面,提供了一种基于时频掩蔽和深度神经网络的声源方向估计装置,包括:In a second aspect, a sound source direction estimation device based on time-frequency masking and a deep neural network is provided, including:
声音信号获取模块,用于获取多通道声音信号;Sound signal acquisition module, for acquiring multi-channel sound signals;
短时傅里叶频谱提取模块,用于对所述多通道声音信号中的每一通道声音信号进行分帧、加窗和傅里叶变换,形成所述多通道声音信号的短时傅里叶频谱;A short-time Fourier spectrum extraction module, configured to perform frame, window, and Fourier transform on each channel of the multi-channel sound signal to form a short-time Fourier transform of the multi-channel sound signal Spectrum
比值膜计算模块,用于通过预先训练的神经网络模型对所述短时傅里叶谱进行迭代运算,计算所述多通道声音信号中目标信号对应的比值膜;A ratio film calculation module, configured to iteratively calculate the short-time Fourier spectrum through a pre-trained neural network model to calculate a ratio film corresponding to a target signal in the multi-channel sound signal;
比值膜融合模块,用于将多个比值膜融合,形成单一比值膜;Ratio film fusion module, used to fuse multiple ratio films to form a single ratio film;
掩蔽加权模块,用于通过所述单一比值膜对多通道声音信号进行掩蔽加权,确定目标声源的方位。A masking weighting module is configured to perform masking weighting on a multi-channel sound signal through the single ratio film to determine the position of a target sound source.
第三方面,提供了一种电子设备,包括:In a third aspect, an electronic device is provided, including:
至少一个处理器;以及At least one processor; and
与所述至少一个处理器通信连接的存储器;其中,A memory connected in communication with the at least one processor; wherein,
所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行如第一方面所述的方法。The memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to execute the method according to the first aspect.
第四方面,提供了一种计算机可读存储介质,用于存储程序,所述程序在被执行时使得电子设备执行如第一方面所述的方法。According to a fourth aspect, a computer-readable storage medium is provided for storing a program that, when executed, causes an electronic device to perform the method according to the first aspect.
本公开的实施例提供的技术方案可以包括以下有益效果:The technical solutions provided by the embodiments of the present disclosure may include the following beneficial effects:
在通过估计目标声源到达时间差以进行定位时,在获取多通道声音信号后,通过预先训练的神经网络模型计算多通道声音信号中目标信号对应的比值膜,将多个比值膜融合形成单一比值膜后,通过用单一比值膜对多通道声音信号进行掩蔽加权,确定目标声源的方位,从而能够在低信噪比、强混响的环境中都具有强大鲁棒性,提高目标声源方向估计的准确性和稳定性。When estimating the target sound source arrival time difference for localization, after acquiring multi-channel sound signals, the pre-trained neural network model is used to calculate the ratio film corresponding to the target signal in the multi-channel sound signal, and the multiple ratio films are fused to form a single ratio. After the film, a single ratio film is used to mask and weight the multi-channel sound signals to determine the orientation of the target sound source, which can have strong robustness in the environment of low signal-to-noise ratio and strong reverberation, and improve the direction of the target sound source. Accuracy and stability of estimates.
应当理解的是,以上的一般描述和后文的细节描述仅为示例性,并不能限制本公开范围。It should be understood that the above general description and the following detailed description are merely exemplary, and should not limit the scope of the present disclosure.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本发明 的实施例,并于说明书一起用于解释本发明的原理。The drawings herein are incorporated in and constitute a part of the specification, illustrate embodiments consistent with the invention, and together with the description serve to explain the principles of the invention.
图1是根据一示例性实施例示出的一种基于时频掩蔽和深度神经网络的声源方向估计方法的流程图。Fig. 1 is a flowchart illustrating a method for estimating a sound source direction based on time-frequency masking and a deep neural network according to an exemplary embodiment.
图2是图1对应实施例的基于时频掩蔽和深度神经网络的声源方位估计方法中步骤S150的第一种具体实现流程图。FIG. 2 is a first specific implementation flowchart of step S150 in the sound source azimuth estimation method based on time-frequency masking and deep neural network according to the embodiment of FIG. 1.
图3是图1对应实施例的基于时频掩蔽和深度神经网络的声源方位估计方法中步骤S150的第二种具体实现流程图。FIG. 3 is a second specific implementation flowchart of step S150 in the sound source azimuth estimation method based on time-frequency masking and deep neural network according to the embodiment of FIG. 1.
图4是图1对应实施例的基于时频掩蔽和深度神经网络的声源方位估计方法中步骤S150的第三种具体实现流程图。FIG. 4 is a third specific implementation flowchart of step S150 in the sound source azimuth estimation method based on time-frequency masking and deep neural network according to the embodiment of FIG. 1.
图5是根据一示例性实施例示出的双耳设置示意图(a)和双麦克风设置的示意图(b)。Fig. 5 is a schematic diagram of a binaural setup (a) and a schematic diagram of a dual microphone setup (b) according to an exemplary embodiment.
图6是根据一示例性实施例示出的一种基于时频掩蔽和深度神经网络的声源方位估计装置的框图。Fig. 6 is a block diagram of a sound source azimuth estimation device based on time-frequency masking and a deep neural network according to an exemplary embodiment.
图7是图6对应实施例示出的基于时频掩蔽和深度神经网络的声源方位估计装置中掩蔽加权模块150的第一种框图。FIG. 7 is a first block diagram of a masking weighting module 150 in a sound source azimuth estimation device based on a time-frequency masking and a deep neural network according to the embodiment shown in FIG. 6.
图8是图6对应实施例示出的基于时频掩蔽和深度神经网络的声源方位估计装置中掩蔽加权模块150的第二种框图。FIG. 8 is a second block diagram of a masking weighting module 150 in a sound source azimuth estimation device based on a time-frequency masking and a deep neural network according to the embodiment shown in FIG. 6.
图9是图6对应实施例示出的基于时频掩蔽和深度神经网络的声源方位估计装置中掩蔽加权模块150的第三种框图。FIG. 9 is a third block diagram of a masking weighting module 150 in a sound source azimuth estimation device based on a time-frequency masking and a deep neural network according to the embodiment shown in FIG. 6.
具体实施方式detailed description
这里将详细地对示例性实施例执行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本发明相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、与本发明的一些方面相一致的装置和方法的例子。Exemplary embodiments will be described in detail herein, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with the present invention. Rather, they are merely examples of devices and methods consistent with some aspects of the invention as detailed in the appended claims.
图1是根据一示例性实施例示出的一种基于时频掩蔽和深度神经网络的声 源方位估计方法的流程图。该基于时频掩蔽和深度神经网络的声源方位估计方法可用于智能手机、智能家居、电脑等电子设备中。如图1所示,该基于时频掩蔽和深度神经网络的声源方位估计方法可以包括步骤S110、步骤S120、步骤S130、步骤S140和步骤S150。Fig. 1 is a flowchart illustrating a method for estimating a sound source position based on time-frequency masking and a deep neural network according to an exemplary embodiment. The sound source azimuth estimation method based on time-frequency masking and deep neural network can be used in electronic devices such as smart phones, smart homes, and computers. As shown in FIG. 1, the sound source azimuth estimation method based on time-frequency masking and deep neural network may include steps S110, S120, S130, S140, and S150.
步骤S110,获取多通道声音信号。Step S110: Acquire a multi-channel sound signal.
TDOA(Time Difference of Arrival,到达时间差)定位是一种利用到达时间差进行定位的方法。通过测量信号到达监测点的时间,可以确定目标声源的距离。利用目标声源到各个麦克风的距离,就能确定目标声源的位置。但是声源在空间转播时间比较难测量。通过比较声音信号到达各麦克风的到达时间差,能较好确定声源的位置。TDOA (Time Difference of Arrival) positioning is a method of locating using the time of arrival difference. By measuring the time it takes for the signal to reach the monitoring point, the distance of the target sound source can be determined. Using the distance from the target sound source to each microphone, the position of the target sound source can be determined. However, it is more difficult to measure the sound source's time in space. By comparing the difference in the time of arrival of the sound signal to each microphone, the position of the sound source can be better determined.
不同于计算转播时间,TDOA是通过检测信号到达两个或多个麦克风的时间差来确定目标声源的位置。这一方法被广泛采用。因此,TDOA计算的准确性和鲁棒性在目标声源的定位中就显得尤为重要。多通道声音信号是包含2个或2个以上麦克风通道混合的声音信号。Unlike calculating the broadcast time, TDOA determines the location of the target sound source by detecting the time difference between when the signals arrive at two or more microphones. This method is widely used. Therefore, the accuracy and robustness of TDOA calculation is particularly important in the localization of the target sound source. A multi-channel sound signal is a sound signal containing a mix of two or more microphone channels.
通常地,多个麦克风装设于噪音环境中的不同位置,通过麦克风接收不同位置的声音信号。但在现实环境中,除了目标声源所发出的声音信号外,还有其他噪声声源发出的声音信号。因此,需根据接收的多通道声音信号,在所处环境中进行目标声源的定位。Generally, multiple microphones are installed at different positions in a noisy environment, and the sound signals at different positions are received through the microphones. But in the real environment, in addition to the sound signal from the target sound source, there are sound signals from other noise sound sources. Therefore, the target sound source needs to be located in the environment based on the received multi-channel sound signals.
步骤S120,对多通道声音信号中的每一通道声音信号进行分帧、加窗和傅里叶变换,形成多通道声音信号的短时傅里叶频谱。Step S120: Frame, window, and Fourier transform each sound signal in the multi-channel sound signal to form a short-time Fourier spectrum of the multi-channel sound signal.
分帧是按照预设时间周期,将单通道声音信号分为多个时间帧。Framing is to divide a single channel sound signal into multiple time frames according to a preset time period.
在一具体示例性实施例中,将多通道声音信号中的每一通道声音信号按照每帧20毫秒分为多个时间帧,且每两个相邻的时间帧之间具有10毫秒的重叠。In a specific exemplary embodiment, each channel of the multi-channel sound signal is divided into multiple time frames according to 20 milliseconds per frame, and there is a 10 millisecond overlap between every two adjacent time frames.
在一示例性实施例中,将STFT(short-time Fourier transform,短时傅里叶变换)应用于每个时间帧以提取短时傅里叶频谱。In an exemplary embodiment, an STFT (short-time Fourier transform) is applied to each time frame to extract a short-time Fourier spectrum.
步骤S130,通过预先训练的神经网络模型对短时傅里叶频谱进行迭代运算,计算多通道声音信号中目标信号对应的比值膜。Step S130: Iteratively calculate the short-time Fourier spectrum by using a pre-trained neural network model to calculate a ratio film corresponding to the target signal in the multi-channel sound signal.
比值膜是表征带噪语音信号与纯净语音信号之间的关系,其指示了抑制噪 声与保留语音的适当权衡。The ratio film characterizes the relationship between a noisy speech signal and a pure speech signal, which indicates an appropriate trade-off between noise suppression and speech retention.
理想情况下,通过比值膜对带噪语音信号进行掩蔽处理后,可以从带噪语音中还原出语音频谱信号。Ideally, after masking the noisy speech signal through the ratio film, the speech spectrum signal can be restored from the noisy speech.
神经网络模型是预先训练而成的。通过提取多通道声音信号的短时傅里叶频谱,在该神经网络模型中进行迭代运算,计算该多通道声音信号的比值膜。Neural network models are pre-trained. By extracting the short-time Fourier spectrum of a multi-channel sound signal and performing an iterative operation in the neural network model, a ratio film of the multi-channel sound signal is calculated.
可选的,在计算该多通道声音信号的比值膜时,通过预先训练的神经网络模型分别计算多通道声音信号中各单通道声音信号对应的比值膜,进而通过各单通道声音信号对应的比值膜单独进行单通道声音信号的掩蔽,对不同时频(T-F)单元施加不同权重,从而锐化多通道声音信号中目标语音相对应的峰值,并抑制与噪声源相对应的峰值。Optionally, when calculating the ratio film of the multi-channel sound signal, a ratio film corresponding to each single-channel sound signal in the multi-channel sound signal is separately calculated through a pre-trained neural network model, and then the ratio corresponding to each single-channel sound signal is calculated. The membrane separately masks the single-channel sound signal and applies different weights to different time-frequency (TF) units, thereby sharpening the peaks corresponding to the target speech in the multi-channel sound signal and suppressing the peaks corresponding to the noise source.
在计算各单通道声音信号对应的比值膜时,采用具有长短期记忆的深度递归神经网络模型分别计算多通道声音信号中各通道声音信号对应的比值膜,使计算出的比值膜更加接近理想比值膜。When calculating the ratio film corresponding to each single channel sound signal, a deep recurrent neural network model with long and short-term memory is used to separately calculate the ratio film corresponding to each channel sound signal in the multi-channel sound signal, so that the calculated ratio film is closer to the ideal ratio membrane.
公式(1)示出了以混响语音信号为目标,计算多通道声音信号中各通道声音信号对应的理想比值膜。公式(2)示出了以直达声为目标,计算多通道声音信号中各通道声音信号对应的理想比值膜。Formula (1) shows that the ideal ratio film corresponding to the sound signal of each channel in the multi-channel sound signal is calculated with the target of the reverberation speech signal. Equation (2) shows the ideal ratio film corresponding to the sound signal of each channel in the multi-channel sound signal with the direct sound as the target.
混响语音是从声源发出的声波在各方向来回反射而传播到麦克风的声音。混响语音的声波能量在传播过程中由于不断被壁面吸收而逐渐衰减。Reverberation speech is the sound from the sound source reflected in all directions back and forth to the microphone. The sound wave energy of the reverberation speech is gradually attenuated due to continuous absorption by the wall surface.
直达声是指从声源不经过任何的反射而以直线的形式直接传播到麦克风的声音。直达声决定着声音的清晰度。Direct sound refers to the sound directly transmitted from the sound source to the microphone in a straight line without any reflection. Direct sound determines the clarity of the sound.
Figure PCTCN2019090531-appb-000001
Figure PCTCN2019090531-appb-000001
Figure PCTCN2019090531-appb-000002
Figure PCTCN2019090531-appb-000002
其中i指示麦克风通道,c(f)s(t,f),h(t,f),和n(t,f)分别是直达声、混响、和反射噪声的短时傅里叶变换(STFT)向量。Where i indicates the microphone channel, c (f) s (t, f), h (t, f), and n (t, f) are the short-time Fourier transform of direct sound, reverberation, and reflected noise, respectively ( STFT) vector.
由于TDOA信息主要包含在直达声中,因此通过以直达声信号为目标,使比值膜的计算模型可能更加接近真实环境。Since the TDOA information is mainly contained in the direct sound, the calculation model of the ratio film may be closer to the real environment by targeting the direct sound signal.
可选的,还可以采用其它方式计算各单通道声音信号对应的比值膜,在此不进行一一描述。Optionally, other methods can also be used to calculate the ratio film corresponding to each single-channel sound signal, which will not be described one by one here.
步骤S140,将多个比值膜融合形成单一比值膜。Step S140: Fusion multiple ratio films to form a single ratio film.
如前所述的,各单通道声音信号存在其对应的比值膜,对包含多个单通道声音信号的多通道声音信号而言,存在着多个对应的比值膜。As mentioned above, each single-channel sound signal has a corresponding ratio film. For a multi-channel sound signal including multiple single-channel sound signals, there are multiple corresponding ratio films.
本发明将多个比值膜融合形成单一比值膜。The invention fuses multiple ratio films to form a single ratio film.
具体的,可以对多通道声音信号所产生的比值膜在相应时频单元上进行累乘,形成单一比值膜。Specifically, the ratio film generated by the multi-channel sound signal may be accumulated on the corresponding time-frequency unit to form a single ratio film.
步骤S150,通过单一比值膜对多通道声音信号进行掩蔽加权,确定目标声源的方位。In step S150, the multi-channel sound signal is masked and weighted by a single ratio film to determine the position of the target sound source.
需要说明的是,即使对于严重受干扰的语音信号,仍有许多T-F单元由目标语音主导。这些具有较清晰相位的T-F单元往往足以实现鲁棒的目标声源的定位。通过掩蔽加权,提高那些语音主导单元对定位的贡献,从而提高计算的TDOA的鲁棒性,提高目标声源定位的准确性。It should be noted that even for severely disturbed speech signals, there are still many T-F units dominated by the target speech. These T-F units with clearer phases are often sufficient to achieve robust localization of the target sound source. Through masking weighting, the contribution of those voice-dominant units to localization is improved, thereby improving the robustness of the calculated TDOA and the accuracy of the target sound source localization.
可选的,在一示例性实施例中,如图2所示,步骤S150可以包括步骤S151、步骤S152、步骤S153。Optionally, in an exemplary embodiment, as shown in FIG. 2, step S150 may include steps S151, S152, and S153.
步骤S151,使用多通道输入信号的短时傅里叶谱计算广义互相关函数(Generalized Cross-Correlation Phase Transform,GCC-PHAT)。In step S151, a short-time Fourier spectrum of the multi-channel input signal is used to calculate a generalized cross-correlation phase transformation (GCC-PHAT).
步骤S152,采用单一比值膜对广义互相关函数进行掩蔽。In step S152, the generalized cross-correlation function is masked by using a single ratio film.
步骤S153,将掩蔽后的广义互相关函数沿频率和时间进行加和,选取加和互相关函数最大峰值位对应的方向作为目标声源的方位。In step S153, the masked generalized cross-correlation function is summed along the frequency and time, and the direction corresponding to the maximum peak position of the summed cross-correlation function is selected as the target sound source position.
如前所述的,将采用具有长短期记忆的深度递归神经网络模型分别计算多通道声音信号中各通道声音信号对应的比值膜。本发明可直接应用于各种几何形状的麦克风阵列。As mentioned earlier, a deep recurrent neural network model with long and short-term memory will be used to calculate the ratio film corresponding to each channel sound signal in the multi-channel sound signal. The invention can be directly applied to microphone arrays of various geometric shapes.
假设只有一个目标声源和一对麦克风。在有混响和噪音环境下,这对麦克风信号可以用模型表述如下:Suppose there is only one target sound source and a pair of microphones. In the presence of reverberation and noise, the pair of microphone signals can be modeled as follows:
y(t,f)=c(f)s(t,f)+g(t,f)+n(t,f),     (3)y (t, f) = c (f) s (t, f) + g (t, f) + n (t, f), (3)
其中s(t,f)表示目标声源在时间t和频率f时候的短时傅里叶变换(STFT)值,c(f)表示相对传递函数,y(t,f)分别是接收到的混合声音的短时傅里叶变换(STFT)向量。通过选取第一个麦克风作为参考麦克风,相对传递函数c(f)可以 表述如下:Where s (t, f) represents the short-time Fourier transform (STFT) value of the target sound source at time t and frequency f, c (f) represents the relative transfer function, and y (t, f) are received respectively Short-time Fourier transform (STFT) vector of mixed sound. By selecting the first microphone as the reference microphone, the relative transfer function c (f) can be expressed as follows:
Figure PCTCN2019090531-appb-000003
Figure PCTCN2019090531-appb-000003
其中τ*是以秒为单位的基础时间延迟,j为虚值单元,A(f)是一个实值增益,f s是以Hz为单位的采样率,N是DFT频率的数量,[·] T代表矩阵转置。f的范围从0到N/2。 Where τ * is the basic time delay in seconds, j is the imaginary unit, A (f) is a real-valued gain, f s is the sampling rate in Hz, N is the number of DFT frequencies, [·] T stands for matrix transpose. f ranges from 0 to N / 2.
通过基于相位变换的加权机制计算广义互相关函数来估计时间延迟:Estimate the time delay by calculating a generalized cross-correlation function using a weighting mechanism based on phase transformation:
Figure PCTCN2019090531-appb-000004
Figure PCTCN2019090531-appb-000004
其中(.) H代表共轭转置,Real{·}提取实部,|·|计算幅度。下标1和2表示麦克风通道。直观地,该算法首先使用候选时延来对齐两个麦克风信号,然后计算它们的相位差和余弦距离。如果余弦距离接近1,则意味着候选时延接近真实时延(相位差)。因此,每个GCC系数在-1和1之间。假设每个话语中声源是固定的,则对GCC系数汇集求和,取最大值作为时间延迟的估计值。PHAT权重在这里是必不可少的。如果不进行归一化,则具有较高能量的频率将具有较大的GCC系数并且主导求和。 Where (.) H represents the conjugate transpose, Real {·} extracts the real part, and calculates the amplitude. Subscripts 1 and 2 indicate microphone channels. Intuitively, the algorithm first uses candidate delays to align the two microphone signals, then calculates their phase difference and cosine distance. If the cosine distance is close to 1, it means that the candidate delay is close to the true delay (phase difference). Therefore, each GCC coefficient is between -1 and 1. Assuming that the sound source is fixed in each utterance, the GCC coefficients are summed together, and the maximum value is taken as the estimated value of the time delay. PHAT weights are essential here. Without normalization, frequencies with higher energies will have larger GCC coefficients and dominate the summation.
本发明通过对多通道声音信号进行掩蔽加权后再计算GCC-PHAT函数:The present invention calculates the GCC-PHAT function by masking and weighting multi-channel sound signals:
GCC PHAT-MASK(t,f,τ)=η(t,f)GCC PHAT(t,f,τ),     (6) GCC PHAT-MASK (t, f, τ) = η (t, f) GCC PHAT (t, f, τ), (6)
其中η(τ,f)表示TDOA估计中T-F单元的掩蔽加权项。它可以定义为:Where η (τ, f) represents the masked weighting term of the T-F unit in the TDOA estimation. It can be defined as:
Figure PCTCN2019090531-appb-000005
Figure PCTCN2019090531-appb-000005
其中D(在本例中=2)是麦克风通道的数量。
Figure PCTCN2019090531-appb-000006
是对应于通道i的比值膜,表示在该通道中每个T-F单元处目标语音能量的比重。
Where D (= 2 in this example) is the number of microphone channels.
Figure PCTCN2019090531-appb-000006
Is a ratio film corresponding to channel i, which represents the specific gravity of the target speech energy at each TF unit in the channel.
通过对多通道声音信号进行掩蔽加权,并将掩蔽后的广义互相关函数沿频率和时间进行加和,选取加和互相关函数最大峰值位对应的方向作为目标声源的方位,大大提高了确定目标声源方位时的准确性。By weighting the multi-channel sound signals and adding the masked generalized cross-correlation function along the frequency and time, the direction corresponding to the maximum peak position of the cross-correlation function is selected as the target sound source position, which greatly improves the determination of the target sound source The accuracy of the target sound source.
可选的,在一示例性实施例中,如图3所示,步骤S150另一种方案可以包括步骤S154、步骤S155、步骤S156、步骤S157、步骤S158、步骤S159、步骤S160。Optionally, in an exemplary embodiment, as shown in FIG. 3, another solution of step S150 may include step S154, step S155, step S156, step S157, step S158, step S159, and step S160.
步骤S154,在每个时频单元中,计算多通道声音信号短时傅里叶谱的协方差矩阵。In step S154, the covariance matrix of the short-time Fourier spectrum of the multi-channel sound signal is calculated in each time-frequency unit.
步骤S155,采用单一比值膜对协方差矩阵进行掩蔽,在每个单独的频率上,对掩蔽的协方差矩阵沿时间维度加和,分别得到目标语音和背景噪声在不同频率上的协方差矩阵。In step S155, a single ratio film is used to mask the covariance matrix, and the masked covariance matrix is summed along the time dimension at each individual frequency to obtain the covariance matrices of the target speech and the background noise at different frequencies, respectively.
步骤S156,依据麦克风阵列的拓扑结构,计算候选方向在不同频率上的导向矢量(Steering vector)。In step S156, according to the topology of the microphone array, steering vectors of candidate directions at different frequencies are calculated.
步骤S157,根据噪声协方差矩阵和候选导向矢量,计算不同频率上MVDR(Minimum Variance Distortionless Response)波束成形的滤波器系数。In step S157, according to the noise covariance matrix and the candidate steering vector, filter coefficients for MVDR (Minimum Variable Distortionless Response) beamforming at different frequencies are calculated.
步骤S158,采用波束成形的滤波器系数和目标语音协方差矩阵来计算不同频率上目标语音的能量,并采用波束成形的滤波器系数和噪声协方差矩阵来计算不同频率上背景噪声的能量。In step S158, the beamforming filter coefficients and the target voice covariance matrix are used to calculate the energy of the target speech at different frequencies, and the beamforming filter coefficients and the noise covariance matrix are used to calculate the energy of the background noise at different frequencies.
步骤S159,在不同频率上,计算目标语音和噪声的能量比,并沿频率维度加和,形成在某一候选方向上总体信噪比。Step S159: Calculate the energy ratios of the target speech and noise at different frequencies and add them along the frequency dimension to form an overall signal-to-noise ratio in a certain candidate direction.
步骤S160,选择对应总体信噪比最大的候选方向作为目标声源的方位。In step S160, the candidate direction corresponding to the largest overall signal-to-noise ratio is selected as the azimuth of the target sound source.
通过公式(8)和公式(9)分别计算每个时频单元目标语音的协方差矩阵
Figure PCTCN2019090531-appb-000007
和噪声的协方差矩阵
Figure PCTCN2019090531-appb-000008
Calculate the covariance matrix of the target speech of each time-frequency unit by formula (8) and formula (9)
Figure PCTCN2019090531-appb-000007
And noise covariance matrix
Figure PCTCN2019090531-appb-000008
Figure PCTCN2019090531-appb-000009
Figure PCTCN2019090531-appb-000009
Figure PCTCN2019090531-appb-000010
Figure PCTCN2019090531-appb-000010
η(t,f)使用公式(7)进行计算,即单一比值膜。η (t, f) is calculated using formula (7), which is a single ratio film.
ξ(t,f)使用下式进行计算:ξ (t, f) is calculated using:
Figure PCTCN2019090531-appb-000011
Figure PCTCN2019090531-appb-000011
基本上,公式(7)意味着仅利用语音主导的时频单元来计算目标语音协方差矩阵,并且时频单元的目标语音占优势越多,放置的权重越大。公式(8)用类似方法来计算干扰信号协方差矩阵。Basically, formula (7) means that only the voice-dominated time-frequency unit is used to calculate the target voice covariance matrix, and the more dominant the target voice of the time-frequency unit, the greater the weight placed. Equation (8) uses a similar method to calculate the interference signal covariance matrix.
接着,遵循自由场和平面波假设,将潜在的目标声源位置k的单位长度导向矢量建模为:Then, following the free field and plane wave assumptions, the unit-length steering vector of the potential target sound source position k is modeled as:
Figure PCTCN2019090531-appb-000012
Figure PCTCN2019090531-appb-000012
d ki是指声源位置k与麦克风i之间的距离,C s指声音的传播速度。于是,一个最小方差无失真响应(MVDR)波束形成可以构造如下: d ki refers to the distance between the sound source position k and the microphone i, and C s refers to the speed of sound propagation. Therefore, a minimum variance distortionless response (MVDR) beamforming can be constructed as follows:
Figure PCTCN2019090531-appb-000013
Figure PCTCN2019090531-appb-000013
之后,波束形成信号的SNR可以通过计算波束形成的目标语音和噪声的能量而得出:Then, the SNR of the beamforming signal can be obtained by calculating the energy of the beamforming target speech and noise:
Figure PCTCN2019090531-appb-000014
Figure PCTCN2019090531-appb-000014
最终,声源方位可以预测为:In the end, the sound source orientation can be predicted as:
Figure PCTCN2019090531-appb-000015
Figure PCTCN2019090531-appb-000015
在公式(13)中,我们将SNR限制在0和1之间。它基本上类似于GCC-PHAT算法中的PHAT加权,其中每个T-F单元的GCC系数归一化为-1到1。我们还可以将更多权重放在更高的SNR频率上:In equation (13), we limit the SNR to between 0 and 1. It is basically similar to the PHAT weighting in the GCC-PHAT algorithm, where the GCC coefficients of each T-F unit are normalized to -1 to 1. We can also replay more weights at higher SNR frequencies:
Figure PCTCN2019090531-appb-000016
Figure PCTCN2019090531-appb-000016
γ(f)可以定义为:γ (f) can be defined as:
Figure PCTCN2019090531-appb-000017
Figure PCTCN2019090531-appb-000017
每个频率内的组合语音掩蔽的总和用于指示每个频率的重要性。在实验中发现使用公式(15)比公式(13)得到的结果要更好。The sum of the combined speech masks within each frequency is used to indicate the importance of each frequency. It is found in experiments that the results obtained by using formula (15) are better than those obtained by formula (13).
可选的,在一示例性实施例中,如图4所示,步骤S150第三种方案可以包括步骤S161、步骤S162、步骤S163、步骤S164、步骤S165。Optionally, in an exemplary embodiment, as shown in FIG. 4, the third solution of step S150 may include steps S161, S162, S163, S164, and S165.
步骤S161,在不同频率上,对目标语音协方差矩阵采用特征分解(Eigen decomposition),选取特征值最大的对应特征向量作为目标语音的导向矢量。In step S161, at different frequencies, feature decomposition is applied to the target voice covariance matrix, and the corresponding feature vector with the largest feature value is selected as the steering vector of the target voice.
步骤S162,采用目标语音的导向矢量计算麦克风信号之间的到达时间差。Step S162: Use the steering vector of the target voice to calculate the time difference between the microphone signals.
步骤S163,根据麦克风阵列拓扑结构计算每一候选方向在麦克风之间的到达时间差。Step S163: Calculate the arrival time difference between the microphones for each candidate direction according to the microphone array topology.
步骤S164,计算麦克风信号之间到达时间差和候选方向在麦克风之间到达时间差之间的余弦距离。Step S164: Calculate the cosine distance between the arrival time difference between the microphone signals and the arrival time difference between the candidate directions between the microphones.
步骤S165,选择对应最大余弦距离的候选方向作为目标声源的方位。In step S165, a candidate direction corresponding to the maximum cosine distance is selected as the azimuth of the target sound source.
导向矢量可以使用如下公式进行计算:The steering vector can be calculated using the following formula:
Figure PCTCN2019090531-appb-000018
Figure PCTCN2019090531-appb-000018
其中P{·}提取在公式(8)中计算的估计语音协方差矩阵的主特征向量。如果
Figure PCTCN2019090531-appb-000019
计算得当,它将接近于秩1矩阵,因此它的主要特征向量是导向向量的合理估计。
Among them, P {·} extracts the principal feature vector of the estimated speech covariance matrix calculated in formula (8). in case
Figure PCTCN2019090531-appb-000019
Well calculated, it will be close to the rank 1 matrix, so its main feature vector is a reasonable estimate of the steering vector.
为了估算时间延迟
Figure PCTCN2019090531-appb-000020
我们列举了所有潜在的时间延迟,并找到以下目标最大化的延迟:
To estimate time delay
Figure PCTCN2019090531-appb-000020
We enumerate all potential time delays and find the maximum delays that are:
Figure PCTCN2019090531-appb-000021
Figure PCTCN2019090531-appb-000021
基本原理是在每个频率上独立地计算导向矢量
Figure PCTCN2019090531-appb-000022
因此,
Figure PCTCN2019090531-appb-000023
没有严格 遵循线性相位假设。本发明列举了所有潜在的时间延迟,搜索带有相位延迟
Figure PCTCN2019090531-appb-000024
的时间延迟τ,与每个频率的
Figure PCTCN2019090531-appb-000025
(导向向量方向)最为匹配,于是将其作为最终预测结果。类似于公式(15),我们使用γ(f)加权以强调更高的SNR。
The basic principle is to calculate the steering vector independently at each frequency
Figure PCTCN2019090531-appb-000022
therefore,
Figure PCTCN2019090531-appb-000023
The linear phase assumption is not strictly followed. The present invention enumerates all potential time delays, searching with phase delay
Figure PCTCN2019090531-appb-000024
Time delay τ, with each frequency
Figure PCTCN2019090531-appb-000025
(The direction of the steering vector) is the best match, so it is used as the final prediction result. Similar to formula (15), we use γ (f) weighting to emphasize higher SNR.
利用如上所述的方法,在通过估计TDOA以进行目标声源的定位时,在获取多通道声音信号后,通过预先训练的神经网络模型计算多通道声音信号对应的比值膜,然后将多个比值膜融合成单一比值膜,进而通过单一比值膜对多通道声音信号进行掩蔽加权,确定目标声源的方位。本发明在低信噪比、强混响的环境中均具有强大的鲁棒性,提高了目标声源方向估计的准确性和稳定性。Using the method described above, when TDOA is estimated to locate the target sound source, after acquiring the multi-channel sound signal, the pre-trained neural network model is used to calculate the ratio film corresponding to the multi-channel sound signal, and then multiple ratios are The film is fused into a single ratio film, and then the multi-channel sound signals are masked and weighted by the single ratio film to determine the target sound source's orientation. The invention has strong robustness in the environment of low signal-to-noise ratio and strong reverberation, and improves the accuracy and stability of the target sound source direction estimation.
下面将使用双耳实验装置和双麦克风实验装置在具有强混响和混杂人声的环境中对上述示例性实施例进行TDOA的鲁棒性测试。图5为根据一示例性实施例示出的双耳设置和双麦克风设置的示意图。The following will perform a TDOA robustness test on the above exemplary embodiment using a binaural experimental device and a dual microphone experimental device in an environment with strong reverberation and mixed human voice. Fig. 5 is a schematic diagram showing a binaural setting and a dual microphone setting according to an exemplary embodiment.
混合语音的平均持续时间为2.4秒。两个数据集的对混响语音和混响噪声计算得出的输入SNR是-6dB。如果我们将直达声信号视为目标语音而将其余信号视为噪声,则SNR会更低。我们使用训练数据中的所有单通道信号(总共10000*2)训练LSTM(long short-term memory,具有长短期记忆的递归神经网络)。在麦克风阵列设置中,对数功率谱图用作输入特征;在双耳设置中,还使用了耳间能量差。在全局均值-方差归一化之前,对输入特征进行了句子层面上的均值归一化。LSTM包含两个隐层,每个隐层有500个神经元。Adam算法用于最小化比值膜估计的均方误差。窗长为32毫秒,窗移大小为8毫秒。采样率为16kHz。The average duration of mixed speech is 2.4 seconds. The calculated input SNR for the reverberation speech and reverberation noise of the two data sets is -6dB. If we consider the direct sound signal as the target speech and the remaining signals as noise, the SNR will be lower. We use all single-channel signals in the training data (total 10000 * 2) to train LSTM (long short-term memory, recurrent neural network with long short-term memory). In a microphone array setting, a log power spectrum is used as an input feature; in a binaural setting, the energy difference between ears is also used. Prior to the global mean-variance normalization, the input features were normalized at the sentence level. The LSTM contains two hidden layers, each with 500 neurons. The Adam algorithm is used to minimize the mean square error of the ratio film estimation. The window length is 32 ms and the window shift size is 8 ms. The sampling rate is 16kHz.
我们根据总准确度来衡量效果,如果预测方向在真实目标方向的5°及以内,则认为预测是正确的。We measure the effect according to the total accuracy. If the prediction direction is within 5 ° of the true target direction, the prediction is considered correct.
在双麦克风设置中,基于图像方法的RIR(room impulse response,房间脉冲响应)发生器用于生成RIR以模拟混响。对于训练和验证数据,我们在36个方向中的每个方向上放置一个干扰说话人,从–87.5°到87.5°,步长为5°,并且目标说话人在36个方向中的一个方向上。对于测试数据,我们在37个方向中的每一个方向上放置一个干扰说话人,范围从-90°到90°,步长为5°,并且目标说话人在37个方向中的任一个方向上。这样,测试RIR在训练期间是看不见的。目标说话人和阵列中心之间的距离为1米。房间大小固定在8x8x3m,两只麦克 风放在房间的中心。In a dual microphone setup, an image-based RIR (room impulse response) generator is used to generate the RIR to simulate reverberation. For training and verification data, we place an interfering speaker in each of the 36 directions, from –87.5 ° to 87.5 ° in steps of 5 °, and the target speaker is in one of the 36 directions . For the test data, we placed an interfering speaker in each of the 37 directions, ranging from -90 ° to 90 °, with a step size of 5 °, and the target speaker was in any of the 37 directions . In this way, the test RIR is invisible during training. The distance between the target speaker and the center of the array is 1 meter. The size of the room is fixed at 8x8x3m, and two microphones are placed in the center of the room.
表1.比较双麦克风设置中不同方法的TDOA估计效果(%总正确度)Table 1. Comparison of TDOA estimation effects of different methods in dual microphone setup (% total accuracy)
Figure PCTCN2019090531-appb-000026
Figure PCTCN2019090531-appb-000026
两个麦克风之间的距离为0.2米,高度均设为1.5米。每种混合语音片段的T60以0.1s的步长从0.0s至1.0s随机挑选。IEEE和TIMIT语句用于生成训练、验证和测试语音。The distance between the two microphones is 0.2 meters, and the height is set to 1.5 meters. T60 of each mixed speech segment is randomly selected from 0.0s to 1.0s in steps of 0.1s. IEEE and TIMIT statements are used to generate training, verification, and test speech.
在双耳实验装置中,使用软件仿真双耳房间脉冲响应(BRIR),其中T60(混响时间)范围从0.0s到1.0s,步长为0.1s。仿真房间大小固定为6x4x3m。BRIR的测量方法是将双耳放置在房间中心周围,高度为2米,声源位于37个方向中的一个(从-90°到90°,步长为5°),与阵列的高度相同,距离阵列中心1.5米。使用HATS仿真头在四个不同尺寸和T60的真实房间中采集的真实BRIRs用于测试。仿真头放置在2.8米的高度,声源到阵列的距离是1.5米。真正的BRIR也使用相同的37个方向进行测量。我们在37个方向中的每一个上放置了37个不同的干扰人声,并且在某一个方向上放置了目标人声。在我们的实验中,720名女性IEEE的语句被用作目标语音。我们将它们随机分成500、100和120个话语,以用于训练、验证和测试数据。为了生成散漫的多人说话噪声,我们TIMIT数据集里630个说话人的语句连接在一起,并将随机选择的37个说话人及其语音段放置在37个方向中的每个方向上。对于嘈杂噪声中的每个说话人,我们使用连接的话语的前半部分来生成训练和验证噪声,而后半部分用于产生测试噪声。训练、验证和测试数据集中总共有10000,800和3000种双耳混合语音。In the binaural experimental setup, the binaural room impulse response (BRIR) was simulated using software, where the T60 (reverberation time) range was from 0.0s to 1.0s with a step size of 0.1s. The simulation room size is fixed at 6x4x3m. The measurement method of BRIR is to place the ears around the center of the room at a height of 2 meters, and the sound source is located in one of 37 directions (from -90 ° to 90 ° in 5 ° steps), which is the same height as the array 1.5 meters from the center of the array. The real BRIRs collected in four different sizes and T60 real rooms using HATS simulation heads were used for testing. The simulation head is placed at a height of 2.8 meters, and the distance from the sound source to the array is 1.5 meters. True BRIR is also measured using the same 37 directions. We placed 37 different interfering vocals in each of the 37 directions, and placed the target vocals in a certain direction. In our experiments, 720 female IEEE sentences were used as the target speech. We randomly divided them into 500, 100, and 120 utterances for training, validation, and test data. In order to generate loose multi-person speaking noise, the sentences of 630 speakers in our TIMIT dataset are connected together, and 37 randomly selected speakers and their speech segments are placed in each of the 37 directions. For each speaker in the noisy noise, we use the first half of the connected utterance to generate training and verification noise, and the second half to generate test noise. There are a total of 10,000, 800, and 3,000 binaural mixed speech sets in the training, verification, and test data sets.
表2.双耳设置中不同方法的TDOA估计效果比较(%总正确度)Table 2. Comparison of TDOA estimation effects of different methods in binaural settings (% total accuracy)
Figure PCTCN2019090531-appb-000027
Figure PCTCN2019090531-appb-000027
表1和表2中展示了总的定位准确度结果。其中灰色标记理想比值膜的性能。表中还显示对每个T60水平的直达混响能量比(direct-to-reverberant energy ratio,DRR)。使用来自LSTM估计比值膜进行掩蔽,所提出的掩蔽加权GCC-PHAT算法显著改进了传统的GCC-PHAT算法(如表1中从25.8%提升到78.5%、88.2%,表2中从29.4%提升到91.3%、90.8%)。基于导向矢量的TDOA估计算法在所有算法中表现出最强的鲁棒性,尤其是当T60较高时。使用直达声作为目标语音的理想比值膜可以使所有提出的算法的准确率几乎达到100%(表1中为100.0%、99.9%和99.8%,表2中为99.4%,99.4%和99.4%)。这表明基于T-F单元进行掩蔽的方法十分适用于强鲁棒性的TDOA估计。The overall positioning accuracy results are shown in Tables 1 and 2. Among them, gray marks the performance of the ideal ratio film. The table also shows the direct-to-reverberant energy ratio (DRR) for each T60 level. Using the estimated ratio film from LSTM for masking, the proposed masking weighted GCC-PHAT algorithm significantly improves the traditional GCC-PHAT algorithm (as shown in Table 1 from 25.8% to 78.5%, 88.2%, and Table 2 from 29.4% To 91.3%, 90.8%). The steering vector-based TDOA estimation algorithm shows the strongest robustness among all algorithms, especially when T60 is high. Using direct sound as the ideal ratio film of the target speech can make the accuracy of all the proposed algorithms reach almost 100% (100.0%, 99.9%, and 99.8% in Table 1, and 99.4%, 99.4%, and 99.4% in Table 2) . This shows that the masking method based on T-F unit is very suitable for strong robust TDOA estimation.
因为时间延迟信息主要包含在直达声中,在双麦克风设置中,使用直达声作为目标语音定义IRM始终比使用混响声作为目标语音的结果要好(88.2%vs.78.5%,90.5%vs.86.7%和91.0%vs.86.4%)。Because the time delay information is mainly contained in the direct sound, in a dual microphone setting, using the direct sound as the target voice to define the IRM is always better than using the reverb sound as the target voice (88.2% vs. 78.5%, 90.5% vs. 86.7% And 91.0% vs. 86.4%).
然而,由于头部阴影效应以及双耳设置中训练和测试BRIR之间的不匹配,掩蔽加权的导向响应SNR算法在双耳设置中的表现相对不如双麦克风设置中好。考虑到头部阴影效应,双耳情况下的增益不能简单地相等于不同声道的增益,因此,使用混响声作为目标语音来估计IRM在双耳设置中要比使用直达声作为目标语音获得稍好的性能(91.3%v.s.90.8%,86.4%v.s.70.0%和92.0%v.s.91.1%)。However, due to the head shadow effect and the mismatch between training and testing BRIR in the binaural setting, the masked weighted steered response SNR algorithm performs relatively poorly in the binaural setting as in the dual microphone setting. Considering the head shadow effect, the gain in the binaural case cannot simply be equal to the gain of different channels. Therefore, using the reverberation sound as the target voice to estimate the IRM in the binaural setting is slightly better than using direct sound as the target voice Good performance (91.3% vs 90.8%, 86.4% vs 70.0% and 92.0% vs 91.1%).
下述为本公开装置实施例,可以用于执行本上述基于时频掩蔽和深度神经网络的声源方位估计方法实施例。对于本公开装置实施例中未披露的细节,请参照本公开基于时频掩蔽和深度神经网络的声源方位估计方法实施例。The following is a device embodiment of the present disclosure, which can be used to implement the foregoing method embodiment of a sound source azimuth estimation method based on time-frequency masking and deep neural network. For details not disclosed in the device embodiments of the present disclosure, please refer to an embodiment of a sound source azimuth estimation method based on time-frequency masking and deep neural networks of the present disclosure.
图6是根据一示例性实施例示出的一种基于时频掩蔽和深度神经网络的声源方位估计装置的框图,该装置包括但不限于:声音信号获取模块110、短时傅里叶频谱提取模块120、比值膜计算模块130、比值膜融合模块140及掩蔽加权模块150。Fig. 6 is a block diagram of a sound source azimuth estimation device based on time-frequency masking and deep neural network according to an exemplary embodiment. The device includes, but is not limited to, a sound signal acquisition module 110 and short-time Fourier spectrum extraction. Module 120, ratio film calculation module 130, ratio film fusion module 140, and masking weighting module 150.
声音信号获取模块110,用于获取多通道声音信号;A sound signal obtaining module 110, configured to obtain a multi-channel sound signal;
短时傅里叶频谱提取模块120,用于对多通道声音信号中的每一通道声音信号进行分帧、加窗和傅里叶变换,形成多通道声音信号的短时傅里叶频谱;The short-time Fourier spectrum extraction module 120 is configured to perform frame, window, and Fourier transform on each channel of the multi-channel sound signal to form a short-time Fourier spectrum of the multi-channel sound signal;
比值膜计算模块130,用于通过预先训练的神经网络模型对短时傅里叶谱进行迭代运算,计算多通道声音信号中目标信号对应的比值膜;The ratio film calculation module 130 is configured to iteratively calculate the short-time Fourier spectrum through a pre-trained neural network model to calculate a ratio film corresponding to a target signal in a multi-channel sound signal;
比值膜融合模块140,用于将多个比值膜融合形成单一比值膜;The ratio film fusion module 140 is configured to fuse multiple ratio films to form a single ratio film;
掩蔽加权模块150,用于通过单一比值膜对多通道声音信号进行掩蔽加权,确定目标声源的方位。The masking weighting module 150 is configured to perform masking weighting on the multi-channel sound signal through a single ratio film, and determine the position of the target sound source.
上述装置中各个模块的功能和作用的实现过程,具体见上述基于时频掩蔽和深度神经网络的声源方位估计方法中对应步骤的实现过程,在此不再赘述。For the implementation process of the functions and functions of the various modules in the above device, see the implementation process of the corresponding steps in the sound source azimuth estimation method based on the time-frequency masking and the deep neural network, and details are not described herein again.
可选的,图6中的比值膜计算模块130包括但不限于:比值膜分别计算单元。Optionally, the ratio film calculation module 130 in FIG. 6 includes, but is not limited to, a ratio film calculation unit.
比值膜分别计算单元,用于通过预先训练的神经网络模型对各通道声音信号的短时傅里叶谱进行迭代运算,分别计算多通道声音信号中各通道声音信号对应的比值膜。Ratio film calculation units are respectively used to iteratively calculate the short-time Fourier spectrum of the sound signals of each channel through a pre-trained neural network model, and calculate the ratio films corresponding to the sound signals of each channel in the multi-channel sound signal.
可选的,比值膜分别计算单元可具体应用于以直达声或混响语音信号为目标,采用具有长短期记忆的深度递归神经网络模型分别计算多通道声音信号中各单通道目标信号对应的比值膜。Optionally, the ratio film separate calculation unit can be specifically applied to the direct sound or reverberation speech signal as a target, and a deep recurrent neural network model with long and short-term memory is used to separately calculate the corresponding ratio of each single-channel target signal in the multi-channel sound signal. membrane.
可选的,图6中的比值膜融合模块140具体应用于将多通道声音信号中目标所产生的比值膜,在相应时频单元上进行累乘。Optionally, the ratio film fusion module 140 in FIG. 6 is specifically used to multiply the ratio film generated by the target in the multi-channel sound signal on the corresponding time-frequency unit.
可选的,如图7所示,图6中的掩蔽加权模块150包括但不限于:广义互相关函数计算子模块151、掩蔽子模块152和方位确定子模块153。Optionally, as shown in FIG. 7, the masking weighting module 150 in FIG. 6 includes, but is not limited to, a generalized cross-correlation function calculation sub-module 151, a masking sub-module 152, and an orientation determination sub-module 153.
广义互相关函数计算子模块151,用于使用多通道输入信号的短时傅里叶谱计算广义互相关函数;The generalized cross-correlation function calculation sub-module 151 is configured to calculate a generalized cross-correlation function using a short-time Fourier spectrum of a multi-channel input signal;
掩蔽子模块152,用于采用单一比值膜对广义互相关函数进行掩蔽;A masking sub-module 152 for masking the generalized cross-correlation function by using a single ratio film;
第一方位确定子模块153,用于将掩蔽后的广义互相关函数沿频率和时间进行加和,选取加和互相关函数最大峰值位对应的方向作为目标声源的方位。A first azimuth determining submodule 153 is configured to add the masked generalized cross-correlation function along frequency and time, and select a direction corresponding to a maximum peak position of the summed cross-correlation function as the azimuth of the target sound source.
可选的,如图8所示,图6中的掩蔽加权模块150第二方案包括但不限于协方差矩阵计算子模块154、协方差矩阵掩蔽子模块155、候选方向导向矢量计算子模块156、波束成形滤波器系数计算子模块157、能量计算子模块158、总体信噪比计算子模块159和第二方位确定子模块160。Optionally, as shown in FIG. 8, the second scheme of the masking weighting module 150 in FIG. 6 includes, but is not limited to, a covariance matrix calculation submodule 154, a covariance matrix masking submodule 155, a candidate direction guidance vector calculation submodule 156, The beamforming filter coefficient calculation sub-module 157, the energy calculation sub-module 158, the overall signal-to-noise ratio calculation sub-module 159, and the second orientation determination sub-module 160.
协方差矩阵计算子模块154,用于在每个时频单元中,计算多通道声音信号短时傅里叶谱的协方差矩阵;A covariance matrix calculation submodule 154, configured to calculate a covariance matrix of a short-time Fourier spectrum of a multi-channel sound signal in each time-frequency unit;
协方差矩阵掩蔽子模块155,用于采用单一比值膜对协方差矩阵进行掩蔽,在每个单独的频率上,对掩蔽的协方差矩阵沿时间维度加和,分别得到目标语音和噪声在不同频率上的协方差矩阵;The covariance matrix masking sub-module 155 is used to mask the covariance matrix with a single ratio film. At each individual frequency, the masked covariance matrix is summed along the time dimension to obtain the target speech and noise at different frequencies. Covariance matrix on
候选方向导向矢量计算子模块156,用于依据麦克风阵列的拓扑结构,计算候选方向在不同频率上的导向矢量;Candidate direction steering vector calculation submodule 156, for calculating the steering vectors of candidate directions at different frequencies according to the topology of the microphone array;
波束成形滤波器系数计算子模块157,用于根据噪声协方差矩阵和候选导向矢量,计算不同频率上MVDR波束成形的滤波器系数;The beamforming filter coefficient calculation submodule 157 is configured to calculate filter coefficients of MVDR beamforming at different frequencies according to a noise covariance matrix and a candidate steering vector;
能量计算子模块158,用于采用波束成形的滤波器系数和目标语音协方差矩阵计算不同频率上目标语音的能量,采用波束成形的滤波器系数和噪声协方差矩阵计算不同频率上噪声的能量;An energy calculation sub-module 158 is configured to calculate energy of a target voice at different frequencies by using beamforming filter coefficients and a target voice covariance matrix, and calculate energy of noise at different frequencies by using a beamforming filter coefficient and a noise covariance matrix;
总体信噪比形成子模块159,用于在不同频率上,计算目标语音和噪声的能量比,并沿频率维度加和,形成在某一候选方向上总体信噪比;The overall signal-to-noise ratio forming sub-module 159 is used to calculate the energy ratio of the target speech and noise at different frequencies, and sum them along the frequency dimension to form an overall signal-to-noise ratio in a certain candidate direction;
第二方位确定子模块160,用于选择对应总体信噪比最大的候选方向作为目标声源的方位。The second orientation determination sub-module 160 is configured to select a candidate direction corresponding to the largest overall signal-to-noise ratio as the orientation of the target sound source.
可选的,如图9所示,图6中的掩蔽加权模块150第三方案包括但不限于:语音导向矢量计算子模块161、到达时间差计算子模块162、候选方向到达时间差子模块163、余弦距离计算子模块164和第三方位确定子模块子模块165。Optionally, as shown in FIG. 9, the third scheme of the masking weighting module 150 in FIG. 6 includes, but is not limited to, a speech guidance vector calculation sub-module 161, an arrival time difference calculation sub-module 162, a candidate direction arrival time difference sub-module 163, and a cosine The distance calculation sub-module 164 and the third-party bit determination sub-module sub-module 165.
语音导向矢量计算子模块161,用于在不同频率上,对目标语音协方差矩阵采用特征分解,选取特征值最大的对应特征向量作为目标语音的导向矢量;A speech steering vector calculation sub-module 161, which is used to apply feature decomposition to the target speech covariance matrix at different frequencies, and selects the corresponding feature vector with the largest feature value as the steering vector of the target speech;
到达时间差计算子模块162,用于采用目标语音的导向矢量计算麦克风信号之间的到达时间差;Arrival time difference calculation sub-module 162, for calculating the arrival time difference between the microphone signals by using the steering vector of the target voice;
候选方向到达时间差子模块163,用于根据麦克风阵列拓扑结构计算候选方向在麦克风之间的到达时间差;Candidate direction arrival time difference sub-module 163 is configured to calculate the difference in arrival time between candidate directions between microphones according to the microphone array topology;
余弦距离计算子模块164,用于计算麦克风信号之间的到达时间差和候选方向在麦克风之间的到达时间差之间的余弦距离;A cosine distance calculation submodule 164, configured to calculate a cosine distance between a difference in time of arrival between microphone signals and a difference in time of arrival between candidate microphones in a direction;
第三方位确定子模块165,用于选择对应最大余弦距离的候选方向作为目标声源的方位。The third-party bit determining sub-module 165 is configured to select a candidate direction corresponding to the maximum cosine distance as the azimuth of the target sound source.
可选的,本发明还提供一种电子设备,执行如上述示例性实施例任一所示的基于时频掩蔽和深度神经网络的声源方位估计方法的全部或者部分步骤。电子设备包括:Optionally, the present invention further provides an electronic device that performs all or part of the steps of the sound source azimuth estimation method based on the time-frequency masking and the deep neural network as shown in any of the above exemplary embodiments. Electronic equipment includes:
处理器;以及Processor; and
与所述处理器通信连接的存储器;其中,A memory connected in communication with the processor; wherein,
所述存储器存储有可读性指令,所述可读性指令被所述处理器执行时实现如上述任一示例性实施例所述的方法。The memory stores readability instructions that, when executed by the processor, implement the method according to any one of the foregoing exemplary embodiments.
该实施例中的终端中处理器执行操作的具体方式已经在有关该基于时频掩蔽和深度神经网络的声源方位估计方法的实施例中执行了详细描述,此处将不做详细阐述说明。The specific manner in which the processor performs operations in the terminal in this embodiment has been described in detail in the embodiment of the method for estimating the sound source position based on time-frequency masking and deep neural networks, and will not be described in detail here.
在示例性实施例中,还提供了一种存储介质,该存储介质为计算机可读性存储介质,例如可以为包括指令的临时性和非临时性计算机可读性存储介质。In an exemplary embodiment, a storage medium is also provided, and the storage medium is a computer-readable storage medium, such as a temporary and non-transitory computer-readable storage medium including instructions.
应当理解的是,本发明并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围时进行各种修改和改变。本发明的范围仅由所附 的权利要求来限制。It should be understood that the present invention is not limited to the precise structure that has been described above and shown in the drawings, and various modifications and changes can be made without departing from the scope thereof. The scope of the invention is only limited by the appended claims.

Claims (10)

  1. 一种基于时频掩蔽和深度神经网络的声源方位估计方法,其特征在于,所述方法包括:A sound source azimuth estimation method based on time-frequency masking and deep neural network, characterized in that the method includes:
    获取多通道声音信号;Acquire multi-channel sound signals;
    对所述多通道声音信号中的每一通道声音信号进行分帧、加窗和傅里叶变换,形成所述多通道声音信号的短时傅里叶频谱;Frame, window, and Fourier transform each sound signal in the multi-channel sound signal to form a short-time Fourier spectrum of the multi-channel sound signal;
    通过预先训练的神经网络模型对所述短时傅里叶谱进行迭代运算,计算所述多通道声音信号中目标信号对应的比值膜;Performing an iterative operation on the short-time Fourier spectrum through a pre-trained neural network model to calculate a ratio film corresponding to a target signal in the multi-channel sound signal;
    将多个比值膜融合形成单一比值膜;Fusing multiple ratio films to form a single ratio film;
    通过所述单一比值膜对多通道声音信号进行掩蔽加权,确定目标声源的方位。Masking and weighting the multi-channel sound signal through the single ratio film to determine the orientation of the target sound source.
  2. 根据权利要求1所述的方法,其特征在于,所述通过预先训练的神经网络模型对所述短时傅里叶谱进行迭代运算,计算所述多通道声音信号中目标信号对应的比值膜的步骤包括:The method according to claim 1, wherein the short-time Fourier spectrum is iteratively calculated by using a pre-trained neural network model to calculate the ratio of the ratio film corresponding to the target signal in the multi-channel sound signal. The steps include:
    通过预先训练的神经网络模型对各通道声音信号的短时傅里叶谱进行迭代运算,分别计算所述多通道声音信号中各通道声音信号对应的比值膜。The short-time Fourier spectrum of each channel sound signal is iteratively calculated through a pre-trained neural network model, and the ratio films corresponding to the sound signals of each channel in the multi-channel sound signal are calculated respectively.
  3. 根据权利要求2所述的方法,其特征在于,所述通过预先训练的神经网络模型对各通道声音信号的短时傅里叶谱进行迭代运算,分别计算所述多通道声音信号中各通道声音信号对应的比值膜的步骤包括:The method according to claim 2, wherein the pre-trained neural network model performs an iterative operation on the short-time Fourier spectrum of the sound signals of each channel, and calculates the sound of each channel in the multi-channel sound signal separately. The steps of the signal corresponding to the ratio film include:
    以直达声或混响语音信号为目标,采用具有长短期记忆的深度递归神经网络模型分别计算所述多通道声音信号中各单通道目标信号对应的比值膜。A direct reverberation or reverberation speech signal is used as a target, and a deep recurrent neural network model with long and short-term memory is used to calculate the ratio film corresponding to each single-channel target signal in the multi-channel sound signal.
  4. 根据权利要求1所述的方法,其特征在于,所述将多个比值膜融合,形成单一比值膜的步骤包括:The method according to claim 1, wherein the step of fusing multiple ratio films to form a single ratio film comprises:
    将多通道声音信号中目标信号所产生的比值膜,在相应时频单元上进行累乘。The ratio film produced by the target signal in the multi-channel sound signal is multiplied on the corresponding time-frequency unit.
  5. 根据权利要求1所述的方法,其特征在于,通过所述单一比值膜对多通 道声音信号进行掩蔽加权的步骤包括:The method according to claim 1, wherein the step of masking and weighting a multi-channel sound signal through the single ratio film comprises:
    使用多通道输入信号的短时傅里叶谱计算广义互相关函数;Calculate a generalized cross-correlation function using the short-time Fourier spectrum of a multi-channel input signal;
    采用所述单一比值膜对所述广义互相关函数进行掩蔽;Using the single ratio film to mask the generalized cross-correlation function;
    将掩蔽后的广义互相关函数沿频率和时间进行加和,选取加和互相关函数最大峰值位对应的方向作为目标声源的方位。The masked generalized cross-correlation function is summed along frequency and time, and the direction corresponding to the maximum peak position of the summed cross-correlation function is selected as the target sound source orientation.
  6. 根据权利要求1所述的方法,其特征在于,所述通过所述单一比值膜对多通道声音信号进行掩蔽加权的步骤包括:The method according to claim 1, wherein the step of masking and weighting a multi-channel sound signal through the single ratio film comprises:
    在每个时频单元中,计算所述多通道声音信号短时傅里叶谱的协方差矩阵;In each time-frequency unit, calculating a covariance matrix of a short-time Fourier spectrum of the multi-channel sound signal;
    采用所述单一比值膜对所述协方差矩阵进行掩蔽,在每个单独的频率上,对掩蔽的协方差矩阵沿时间维度加和,分别得到目标语音和噪声在不同频率上的协方差矩阵;Masking the covariance matrix using the single ratio film, summing the masked covariance matrix along the time dimension at each individual frequency to obtain the covariance matrix of the target speech and noise at different frequencies, respectively;
    依据麦克风阵列的拓扑结构,计算候选方向在不同频率上的导向矢量;Calculate the steering vectors of candidate directions at different frequencies according to the topology of the microphone array;
    根据所述噪声协方差矩阵和候选导向矢量,计算不同频率上MVDR波束成形的滤波器系数;Calculating filter coefficients for MVDR beamforming at different frequencies according to the noise covariance matrix and the candidate steering vector;
    采用所述波束成形的滤波器系数和目标语音协方差矩阵计算不同频率上目标语音的能量,采用所述波束成形的滤波器系数和噪声协方差矩阵计算不同频率上噪声的能量;Using the beamforming filter coefficients and the target voice covariance matrix to calculate the energy of the target speech at different frequencies, and using the beamforming filter coefficients and the noise covariance matrix to calculate the energy of noise at different frequencies;
    在不同频率上,计算目标语音和噪声的能量比,并沿频率维度加和,形成在某一候选方向上的总体信噪比;Calculate the energy ratio of the target speech and noise at different frequencies and sum them along the frequency dimension to form an overall signal-to-noise ratio in a candidate direction;
    选择对应总体信噪比最大的候选方向作为目标声源的方位。The candidate direction corresponding to the largest overall signal-to-noise ratio is selected as the orientation of the target sound source.
  7. 根据权利要求1所述的方法,其特征在于,通过所述单一比值膜对多通道声音信号进行掩蔽加权的步骤包括:The method according to claim 1, wherein the step of masking and weighting the multi-channel sound signal through the single ratio film comprises:
    在不同频率上,对所述目标语音协方差矩阵采用特征分解,选取特征值最大的对应特征向量作为目标语音的导向矢量;Applying feature decomposition to the target voice covariance matrix at different frequencies, and selecting the corresponding feature vector with the largest feature value as the steering vector of the target voice;
    采用所述目标语音的导向矢量计算麦克风信号之间的到达时间差;Calculating a time difference of arrival between microphone signals by using a steering vector of the target voice;
    根据麦克风阵列拓扑结构计算候选方向在麦克风之间的到达时间差;Calculate the arrival time difference between candidate microphones according to the microphone array topology;
    计算所述麦克风信号之间到达时间差和所述候选方向在麦克风之间到达时间差之间的余弦距离;Calculating a cosine distance between the time difference of arrival between the microphone signals and the time difference of arrival between the candidate directions between microphones;
    选择对应最大余弦距离的候选方向作为目标声源的方位。The candidate direction corresponding to the maximum cosine distance is selected as the azimuth of the target sound source.
  8. 一种基于时频掩蔽和深度神经网络的声源方位估计装置,其特征在于,所述装置包括:A sound source azimuth estimation device based on time-frequency masking and deep neural network, characterized in that the device includes:
    声音信号获取模块,用于获取多通道声音信号;Sound signal acquisition module, for acquiring multi-channel sound signals;
    短时傅里叶频谱提取模块,用于对所述多通道声音信号中的每一通道声音信号进行分帧、加窗和傅里叶变换,形成所述多通道声音信号的短时傅里叶频谱;A short-time Fourier spectrum extraction module, configured to perform frame, window, and Fourier transform on each channel of the multi-channel sound signal to form a short-time Fourier transform of the multi-channel sound signal Spectrum
    比值膜计算模块,用于通过预先训练的神经网络模型对所述短时傅里叶谱进行迭代运算,计算所述多通道声音信号中目标信号对应的比值膜;A ratio film calculation module, configured to iteratively calculate the short-time Fourier spectrum through a pre-trained neural network model to calculate a ratio film corresponding to a target signal in the multi-channel sound signal;
    比值膜融合模块,用于将多个比值膜融合,形成单一比值膜;Ratio film fusion module, used to fuse multiple ratio films to form a single ratio film;
    掩蔽加权模块,用于通过所述单一比值膜对多通道声音信号进行掩蔽加权,确定目标声源的方位。A masking weighting module is configured to perform masking weighting on a multi-channel sound signal through the single ratio film to determine the position of a target sound source.
  9. 一种电子设备,其特征在于,所述电子设备包括:An electronic device, characterized in that the electronic device includes:
    至少一个处理器;以及At least one processor; and
    与所述至少一个处理器通信连接的存储器;其中,A memory connected in communication with the at least one processor; wherein,
    所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行如权利要求1-7任一项所述的方法。The memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor, so that the at least one processor can execute the method according to any one of claims 1-7. Methods.
  10. 一种计算机可读存储介质,用于存储程序,其特征在于,所述程序在被执行时使得电子设备执行如权利要求1-7任一项所述的方法。A computer-readable storage medium for storing a program, wherein when the program is executed, the electronic device executes the method according to any one of claims 1-7.
PCT/CN2019/090531 2018-08-31 2019-06-10 Time-frequency masking and deep neural network-based sound source direction estimation method WO2020042708A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811009529.4 2018-08-31
CN201811009529.4A CN109839612B (en) 2018-08-31 2018-08-31 Sound source direction estimation method and device based on time-frequency masking and deep neural network

Publications (1)

Publication Number Publication Date
WO2020042708A1 true WO2020042708A1 (en) 2020-03-05

Family

ID=66883029

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/090531 WO2020042708A1 (en) 2018-08-31 2019-06-10 Time-frequency masking and deep neural network-based sound source direction estimation method

Country Status (2)

Country Link
CN (1) CN109839612B (en)
WO (1) WO2020042708A1 (en)

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111487589A (en) * 2020-04-21 2020-08-04 中国科学院上海微系统与信息技术研究所 Target placement positioning method based on multi-source sensor network
CN111681668A (en) * 2020-05-20 2020-09-18 陕西金蝌蚪智能科技有限公司 Acoustic imaging method and terminal equipment
CN111724801A (en) * 2020-06-22 2020-09-29 北京小米松果电子有限公司 Audio signal processing method and device and storage medium
CN111880146A (en) * 2020-06-30 2020-11-03 海尔优家智能科技(北京)有限公司 Sound source orientation method and device and storage medium
CN112379330A (en) * 2020-11-27 2021-02-19 浙江同善人工智能技术有限公司 Multi-robot cooperative 3D sound source identification and positioning method
CN112415467A (en) * 2020-11-06 2021-02-26 中国海洋大学 Single-vector subsurface buoy target positioning implementation method based on neural network
CN112462355A (en) * 2020-11-11 2021-03-09 西北工业大学 Sea target intelligent detection method based on time-frequency three-feature extraction
CN112634930A (en) * 2020-12-21 2021-04-09 北京声智科技有限公司 Multi-channel sound enhancement method and device and electronic equipment
CN112904279A (en) * 2021-01-18 2021-06-04 南京工程学院 Sound source positioning method based on convolutional neural network and sub-band SRP-PHAT space spectrum
CN112951263A (en) * 2021-03-17 2021-06-11 云知声智能科技股份有限公司 Speech enhancement method, apparatus, device and storage medium
CN113050039A (en) * 2021-03-10 2021-06-29 杭州瑞利超声科技有限公司 Acoustic fluctuation positioning system used in tunnel
CN113325401A (en) * 2021-07-06 2021-08-31 东南大学 Distortion towed linear array signal reconstruction method based on line spectrum phase difference ambiguity resolution
US20210375294A1 (en) * 2019-07-24 2021-12-02 Tencent Technology (Shenzhen) Company Limited Inter-channel feature extraction method, audio separation method and apparatus, and computing device
CN113763982A (en) * 2020-06-05 2021-12-07 阿里巴巴集团控股有限公司 Audio processing method and device, electronic equipment and readable storage medium
CN113763976A (en) * 2020-06-05 2021-12-07 北京有竹居网络技术有限公司 Method and device for reducing noise of audio signal, readable medium and electronic equipment
CN113782047A (en) * 2021-09-06 2021-12-10 云知声智能科技股份有限公司 Voice separation method, device, equipment and storage medium
CN113936681A (en) * 2021-10-13 2022-01-14 东南大学 Voice enhancement method based on mask mapping and mixed hole convolution network
CN114613384A (en) * 2022-03-14 2022-06-10 中国电子科技集团公司第十研究所 Deep learning-based multi-input voice signal beam forming information complementation method
CN115050367A (en) * 2022-08-12 2022-09-13 清华大学苏州汽车研究院(相城) Method, device, equipment and storage medium for positioning speaking target
CN115856987A (en) * 2023-02-28 2023-03-28 西南科技大学 Nuclear pulse signal and noise signal discrimination method under complex environment
CN117040662A (en) * 2023-09-07 2023-11-10 中通服网盈科技有限公司 Multichannel signal transmission system

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109839612B (en) * 2018-08-31 2022-03-01 大象声科(深圳)科技有限公司 Sound source direction estimation method and device based on time-frequency masking and deep neural network
CN112257484B (en) * 2019-07-22 2024-03-15 中国科学院声学研究所 Multi-sound source direction finding method and system based on deep learning
CN110728989B (en) * 2019-09-29 2020-07-14 东南大学 Binaural speech separation method based on long-time and short-time memory network L STM
CN110838303B (en) * 2019-11-05 2022-02-08 南京大学 Voice sound source positioning method using microphone array
CN110808063A (en) * 2019-11-29 2020-02-18 北京搜狗科技发展有限公司 Voice processing method and device for processing voice
CN110992977B (en) * 2019-12-03 2021-06-22 北京声智科技有限公司 Method and device for extracting target sound source
CN111103568A (en) * 2019-12-10 2020-05-05 北京声智科技有限公司 Sound source positioning method, device, medium and equipment
CN113053400A (en) * 2019-12-27 2021-06-29 武汉Tcl集团工业研究院有限公司 Training method of audio signal noise reduction model, audio signal noise reduction method and device
CN111239687B (en) * 2020-01-17 2021-12-14 浙江理工大学 Sound source positioning method and system based on deep neural network
CN111239686B (en) * 2020-02-18 2021-12-21 中国科学院声学研究所 Dual-channel sound source positioning method based on deep learning
CN111596261B (en) * 2020-04-02 2022-06-14 云知声智能科技股份有限公司 Sound source positioning method and device
CN112259117A (en) * 2020-09-28 2021-01-22 上海声瀚信息科技有限公司 Method for locking and extracting target sound source
CN112788278B (en) * 2020-12-30 2023-04-07 北京百度网讯科技有限公司 Video stream generation method, device, equipment and storage medium
CN112989566B (en) * 2021-02-05 2022-11-11 浙江大学 Geometric sound propagation optimization method based on A-weighted variance
CN113687305A (en) * 2021-07-26 2021-11-23 浙江大华技术股份有限公司 Method, device and equipment for positioning sound source azimuth and computer readable storage medium
CN113724727A (en) * 2021-09-02 2021-11-30 哈尔滨理工大学 Long-short time memory network voice separation algorithm based on beam forming
CN113644947A (en) * 2021-10-14 2021-11-12 西南交通大学 Adaptive beam forming method, device, equipment and readable storage medium
CN114255733B (en) * 2021-12-21 2023-05-23 中国空气动力研究与发展中心低速空气动力研究所 Self-noise masking system and flight device
CN115359804B (en) * 2022-10-24 2023-01-06 北京快鱼电子股份公司 Directional audio pickup method and system based on microphone array

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104103277A (en) * 2013-04-15 2014-10-15 北京大学深圳研究生院 Time frequency mask-based single acoustic vector sensor (AVS) target voice enhancement method
US20170328983A1 (en) * 2015-12-04 2017-11-16 Fazecast, Inc. Systems and methods for transient acoustic event detection, classification, and localization
CN107452389A (en) * 2017-07-20 2017-12-08 大象声科(深圳)科技有限公司 A kind of general monophonic real-time noise-reducing method
CN108318862A (en) * 2017-12-26 2018-07-24 北京大学 A kind of sound localization method based on neural network
CN109839612A (en) * 2018-08-31 2019-06-04 大象声科(深圳)科技有限公司 Sounnd source direction estimation method based on time-frequency masking and deep neural network

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101456866B1 (en) * 2007-10-12 2014-11-03 삼성전자주식회사 Method and apparatus for extracting the target sound signal from the mixed sound
EP2088802B1 (en) * 2008-02-07 2013-07-10 Oticon A/S Method of estimating weighting function of audio signals in a hearing aid
CN102157156B (en) * 2011-03-21 2012-10-10 清华大学 Single-channel voice enhancement method and system
JP2012234150A (en) * 2011-04-18 2012-11-29 Sony Corp Sound signal processing device, sound signal processing method and program
EP2747451A1 (en) * 2012-12-21 2014-06-25 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Filter and method for informed spatial filtering using multiple instantaneous direction-of-arrivial estimates
US10939201B2 (en) * 2013-02-22 2021-03-02 Texas Instruments Incorporated Robust estimation of sound source localization

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104103277A (en) * 2013-04-15 2014-10-15 北京大学深圳研究生院 Time frequency mask-based single acoustic vector sensor (AVS) target voice enhancement method
US20170328983A1 (en) * 2015-12-04 2017-11-16 Fazecast, Inc. Systems and methods for transient acoustic event detection, classification, and localization
CN107452389A (en) * 2017-07-20 2017-12-08 大象声科(深圳)科技有限公司 A kind of general monophonic real-time noise-reducing method
CN108318862A (en) * 2017-12-26 2018-07-24 北京大学 A kind of sound localization method based on neural network
CN109839612A (en) * 2018-08-31 2019-06-04 大象声科(深圳)科技有限公司 Sounnd source direction estimation method based on time-frequency masking and deep neural network

Cited By (37)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11908483B2 (en) * 2019-07-24 2024-02-20 Tencent Technology (Shenzhen) Company Limited Inter-channel feature extraction method, audio separation method and apparatus, and computing device
US20210375294A1 (en) * 2019-07-24 2021-12-02 Tencent Technology (Shenzhen) Company Limited Inter-channel feature extraction method, audio separation method and apparatus, and computing device
CN111487589B (en) * 2020-04-21 2023-08-04 中国科学院上海微系统与信息技术研究所 Target drop point positioning method based on multi-source sensor network
CN111487589A (en) * 2020-04-21 2020-08-04 中国科学院上海微系统与信息技术研究所 Target placement positioning method based on multi-source sensor network
CN111681668A (en) * 2020-05-20 2020-09-18 陕西金蝌蚪智能科技有限公司 Acoustic imaging method and terminal equipment
CN111681668B (en) * 2020-05-20 2023-07-07 陕西金蝌蚪智能科技有限公司 Acoustic imaging method and terminal equipment
CN113763976A (en) * 2020-06-05 2021-12-07 北京有竹居网络技术有限公司 Method and device for reducing noise of audio signal, readable medium and electronic equipment
CN113763982A (en) * 2020-06-05 2021-12-07 阿里巴巴集团控股有限公司 Audio processing method and device, electronic equipment and readable storage medium
CN113763976B (en) * 2020-06-05 2023-12-22 北京有竹居网络技术有限公司 Noise reduction method and device for audio signal, readable medium and electronic equipment
CN111724801A (en) * 2020-06-22 2020-09-29 北京小米松果电子有限公司 Audio signal processing method and device and storage medium
CN111880146A (en) * 2020-06-30 2020-11-03 海尔优家智能科技(北京)有限公司 Sound source orientation method and device and storage medium
CN112415467B (en) * 2020-11-06 2022-10-25 中国海洋大学 Single-vector subsurface buoy target positioning implementation method based on neural network
CN112415467A (en) * 2020-11-06 2021-02-26 中国海洋大学 Single-vector subsurface buoy target positioning implementation method based on neural network
CN112462355A (en) * 2020-11-11 2021-03-09 西北工业大学 Sea target intelligent detection method based on time-frequency three-feature extraction
CN112462355B (en) * 2020-11-11 2023-07-14 西北工业大学 Intelligent sea target detection method based on time-frequency three-feature extraction
CN112379330A (en) * 2020-11-27 2021-02-19 浙江同善人工智能技术有限公司 Multi-robot cooperative 3D sound source identification and positioning method
CN112379330B (en) * 2020-11-27 2023-03-10 浙江同善人工智能技术有限公司 Multi-robot cooperative 3D sound source identification and positioning method
CN112634930A (en) * 2020-12-21 2021-04-09 北京声智科技有限公司 Multi-channel sound enhancement method and device and electronic equipment
CN112904279B (en) * 2021-01-18 2024-01-26 南京工程学院 Sound source positioning method based on convolutional neural network and subband SRP-PHAT spatial spectrum
CN112904279A (en) * 2021-01-18 2021-06-04 南京工程学院 Sound source positioning method based on convolutional neural network and sub-band SRP-PHAT space spectrum
CN113050039B (en) * 2021-03-10 2023-03-07 杭州瑞利超声科技有限公司 Acoustic fluctuation positioning system used in tunnel
CN113050039A (en) * 2021-03-10 2021-06-29 杭州瑞利超声科技有限公司 Acoustic fluctuation positioning system used in tunnel
CN112951263A (en) * 2021-03-17 2021-06-11 云知声智能科技股份有限公司 Speech enhancement method, apparatus, device and storage medium
CN112951263B (en) * 2021-03-17 2022-08-02 云知声智能科技股份有限公司 Speech enhancement method, apparatus, device and storage medium
CN113325401A (en) * 2021-07-06 2021-08-31 东南大学 Distortion towed linear array signal reconstruction method based on line spectrum phase difference ambiguity resolution
CN113325401B (en) * 2021-07-06 2024-03-19 东南大学 Distortion towing linear array signal reconstruction method based on line spectrum phase difference deblurring
CN113782047A (en) * 2021-09-06 2021-12-10 云知声智能科技股份有限公司 Voice separation method, device, equipment and storage medium
CN113782047B (en) * 2021-09-06 2024-03-08 云知声智能科技股份有限公司 Voice separation method, device, equipment and storage medium
CN113936681A (en) * 2021-10-13 2022-01-14 东南大学 Voice enhancement method based on mask mapping and mixed hole convolution network
CN113936681B (en) * 2021-10-13 2024-04-09 东南大学 Speech enhancement method based on mask mapping and mixed cavity convolution network
CN114613384A (en) * 2022-03-14 2022-06-10 中国电子科技集团公司第十研究所 Deep learning-based multi-input voice signal beam forming information complementation method
CN114613384B (en) * 2022-03-14 2023-08-29 中国电子科技集团公司第十研究所 Deep learning-based multi-input voice signal beam forming information complementation method
CN115050367A (en) * 2022-08-12 2022-09-13 清华大学苏州汽车研究院(相城) Method, device, equipment and storage medium for positioning speaking target
CN115050367B (en) * 2022-08-12 2022-11-04 清华大学苏州汽车研究院(相城) Method, device, equipment and storage medium for positioning speaking target
CN115856987A (en) * 2023-02-28 2023-03-28 西南科技大学 Nuclear pulse signal and noise signal discrimination method under complex environment
CN117040662A (en) * 2023-09-07 2023-11-10 中通服网盈科技有限公司 Multichannel signal transmission system
CN117040662B (en) * 2023-09-07 2024-04-12 中通服网盈科技有限公司 Multichannel signal transmission system

Also Published As

Publication number Publication date
CN109839612A (en) 2019-06-04
CN109839612B (en) 2022-03-01

Similar Documents

Publication Publication Date Title
WO2020042708A1 (en) Time-frequency masking and deep neural network-based sound source direction estimation method
Wang et al. Robust speaker localization guided by deep learning-based time-frequency masking
Moore et al. Direction of arrival estimation in the spherical harmonic domain using subspace pseudointensity vectors
Wang et al. An iterative approach to source counting and localization using two distant microphones
Blandin et al. Multi-source TDOA estimation in reverberant audio using angular spectra and clustering
Nikunen et al. Direction of arrival based spatial covariance model for blind sound source separation
JP4937622B2 (en) Computer-implemented method for building location model
CN107219512B (en) Sound source positioning method based on sound transfer function
MX2014006499A (en) Apparatus and method for microphone positioning based on a spatial power density.
CN106373589B (en) A kind of ears mixing voice separation method based on iteration structure
Pavlidi et al. 3D localization of multiple sound sources with intensity vector estimates in single source zones
Varanasi et al. Near-field acoustic source localization using spherical harmonic features
CN109188362A (en) A kind of microphone array auditory localization signal processing method
Dorfan et al. Distributed expectation-maximization algorithm for speaker localization in reverberant environments
Beit-On et al. Speaker localization using the direct-path dominance test for arbitrary arrays
Peterson et al. Hybrid algorithm for robust, real-time source localization in reverberant environments
Imran et al. A methodology for sound source localization and tracking: Development of 3D microphone array for near-field and far-field applications
CN101771923A (en) Sound source positioning method for glasses type digital hearing aid
Wan et al. Improved steered response power method for sound source localization based on principal eigenvector
CN110838303B (en) Voice sound source positioning method using microphone array
Yang et al. Supervised direct-path relative transfer function learning for binaural sound source localization
Hwang et al. Sound source localization using HRTF database
Drude et al. DOA-estimation based on a complex Watson kernel method
Barber et al. End-to-end Alexa device arbitration
Hadad et al. Multi-speaker direction of arrival estimation using SRP-PHAT algorithm with a weighted histogram

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19854432

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19854432

Country of ref document: EP

Kind code of ref document: A1