WO2020042708A1

WO2020042708A1 - Time-frequency masking and deep neural network-based sound source direction estimation method

Info

Publication number: WO2020042708A1
Application number: PCT/CN2019/090531
Authority: WO
Inventors: 王中秋; 李号
Original assignee: 大象声科（深圳）科技有限公司
Priority date: 2018-08-31
Filing date: 2019-06-10
Publication date: 2020-03-05
Also published as: CN109839612A; CN109839612B

Abstract

A time-frequency masking and deep neural network-based sound source orientation estimation method and device, an electronic device and a storage medium, belonging to the field of computer technology. Said method comprises: acquiring sound signals of multiple channels (S110); performing frame division, windowing and Fourier transformation on a sound signal of each channel in the signals of multiple channels, so as to form a short-time Fourier spectrum of the sound signals of multiple channels (S120); performing an iterative operation on the short-time Fourier spectrum by means of a pre-trained neural network model, so as to calculate ratio filters corresponding to target signals in the sound signals of multiple channels (S130); fusing the plurality of ratio filters to form a single ratio filter (S140); and performing masking and weighting on the signals of multiple channels by means of the single ratio filter, so as to determine the orientation of a target sound source (S150). The time-frequency masking and deep neural network-based sound source direction estimation method and device can have strong robustness in an environment of a low signal-to-noise ratio and strong reverberation, and improve the accuracy and stability of target sound source direction estimation.

Description

Sound source direction estimation method based on time-frequency masking and deep neural network

Technical field

The present disclosure relates to the technical field of computer applications, and in particular, to a method, an apparatus, an electronic device, and a storage medium for estimating a sound source direction based on time-frequency masking and a deep neural network.

Background technique

Sound source localization in noisy environments has many applications in real life, such as human-computer interaction, robotics, and beamforming. Traditionally, GCC-PHAT (Generalized Cross Correlation Phase Transformation), SRP-PHAT (Steered Response Response Phase Transformation), or MUSIC (Multiple Signal Signature Classification) Signal classification algorithms such as signal classification are the most common. However, these algorithms can only locate the signal source with the loudest sound in the environment, and the signal source with the loudest sound may not be the target speaker at all. For example, in the environment of strong reverberation, directional noise or diffuse reflection noise, the sum of the GCC-PHAT coefficients will appear from the peak of the interference source, and it is formed according to the minimum eigenvector value with the noise covariance matrix in the MUSIC algorithm The noise subspace may not belong to real noise.

In order to improve the robustness, early research used the SNR (Signal-to-noise ratio) weighting method to strengthen the target sound frequency to obtain a higher SNR, and then run the GCC-PHAT algorithm. For example, an SNR estimation method such as an algorithm based on voice activity detection or a method based on a minimum mean square error is used. However, these algorithms usually assume that the noise is static and the noise in the real environment is usually dynamic, which results in poor robustness of the direction estimation when the sound source is located in the real environment.

Summary of the Invention

In order to solve the technical problem of poor robustness of azimuth estimation, the present disclosure provides a sound source direction estimation method, device, electronic device, and storage medium based on time-frequency masking and deep neural network.

In a first aspect, a sound source direction estimation method based on time-frequency masking and a deep neural network is provided, including:

Acquire multi-channel sound signals;

Frame, window, and Fourier transform each sound signal in the multi-channel sound signal to form a short-time Fourier spectrum of the multi-channel sound signal;

Performing an iterative operation on the short-time Fourier spectrum through a pre-trained neural network model to calculate a ratio film corresponding to a target signal in the multi-channel sound signal;

Fusing multiple ratio films to form a single ratio film;

Masking and weighting the multi-channel sound signal through the single ratio film to determine the orientation of the target sound source.

Optionally, the step of iteratively calculating the short-time Fourier spectrum by using a pre-trained neural network model to calculate a ratio film corresponding to a target signal in the multi-channel sound signal includes:

The short-time Fourier spectrum of each channel sound signal is iteratively calculated through a pre-trained neural network model, and the ratio films corresponding to the sound signals of each channel in the multi-channel sound signal are calculated respectively.

Optionally, the step of iteratively computing the short-time Fourier spectrum of the sound signals of each channel by using a pre-trained neural network model, and respectively calculating the ratio film corresponding to the sound signals of each channel in the multi-channel sound signal includes:

A direct reverberation or reverberation speech signal is used as a target, and a deep recurrent neural network model with long and short-term memory is used to calculate the ratio film corresponding to each single-channel target signal in the multi-channel sound signal.

Optionally, the step of fusing multiple ratio films to form a single ratio film includes:

The ratio film produced by the target signal in the multi-channel sound signal is multiplied on the corresponding time-frequency unit.

Optionally, the step of determining the position of the target sound source by masking and weighting the multi-channel sound signal through the single ratio film. The first solution includes:

Calculate a generalized cross-correlation function using the short-time Fourier spectrum of a multi-channel input signal;

Using the single ratio film to mask the generalized cross-correlation function;

The masked generalized cross-correlation function is summed along frequency and time, and the direction corresponding to the maximum peak position of the summed cross-correlation function is selected as the target sound source orientation.

Optionally, the step of determining the position of the target sound source by masking and weighting the multi-channel sound signal through the single ratio film, the second scheme includes:

In each time-frequency unit, calculating a covariance matrix of a short-time Fourier spectrum of the multi-channel sound signal;

Masking the covariance matrix using the single ratio film, summing the masked covariance matrix along the time dimension at each individual frequency to obtain the covariance matrix of the target speech and noise at different frequencies, respectively;

Calculate the steering vectors of candidate directions at different frequencies according to the topology of the microphone array;

Calculating filter coefficients for MVDR beamforming at different frequencies according to the noise covariance matrix and the candidate steering vector;

Using the beamforming filter coefficients and the target voice covariance matrix to calculate the energy of the target speech at different frequencies, and using the beamforming filter coefficients and the noise covariance matrix to calculate the energy of noise at different frequencies;

Calculate the energy ratio of the target speech and noise at different frequencies and sum them along the frequency dimension to form an overall signal-to-noise ratio in a candidate direction;

The candidate direction corresponding to the largest overall signal-to-noise ratio is selected as the orientation of the target sound source.

Optionally, the step of determining a target sound source's position by masking and weighting multi-channel sound signals through the single ratio film, the third solution includes:

Applying feature decomposition to the target voice covariance matrix at different frequencies, and selecting the corresponding feature vector with the largest feature value as the steering vector of the target voice;

Calculating a time difference of arrival between microphone signals by using a steering vector of the target voice;

Calculate the arrival time difference between candidate microphones according to the microphone array topology;

Calculating a cosine distance between the time difference of arrival between the microphone signals and the time difference of arrival between the candidate directions between microphones;

The candidate direction corresponding to the maximum cosine distance is selected as the azimuth of the target sound source.

In a second aspect, a sound source direction estimation device based on time-frequency masking and a deep neural network is provided, including:

Sound signal acquisition module, for acquiring multi-channel sound signals;

A short-time Fourier spectrum extraction module, configured to perform frame, window, and Fourier transform on each channel of the multi-channel sound signal to form a short-time Fourier transform of the multi-channel sound signal Spectrum

A ratio film calculation module, configured to iteratively calculate the short-time Fourier spectrum through a pre-trained neural network model to calculate a ratio film corresponding to a target signal in the multi-channel sound signal;

Ratio film fusion module, used to fuse multiple ratio films to form a single ratio film;

A masking weighting module is configured to perform masking weighting on a multi-channel sound signal through the single ratio film to determine the position of a target sound source.

In a third aspect, an electronic device is provided, including:

At least one processor; and

A memory connected in communication with the at least one processor; wherein,

The memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to execute the method according to the first aspect.

According to a fourth aspect, a computer-readable storage medium is provided for storing a program that, when executed, causes an electronic device to perform the method according to the first aspect.

The technical solutions provided by the embodiments of the present disclosure may include the following beneficial effects:

When estimating the target sound source arrival time difference for localization, after acquiring multi-channel sound signals, the pre-trained neural network model is used to calculate the ratio film corresponding to the target signal in the multi-channel sound signal, and the multiple ratio films are fused to form a single ratio. After the film, a single ratio film is used to mask and weight the multi-channel sound signals to determine the orientation of the target sound source, which can have strong robustness in the environment of low signal-to-noise ratio and strong reverberation, and improve the direction of the target sound source. Accuracy and stability of estimates.

It should be understood that the above general description and the following detailed description are merely exemplary, and should not limit the scope of the present disclosure.

BRIEF DESCRIPTION OF THE DRAWINGS

The drawings herein are incorporated in and constitute a part of the specification, illustrate embodiments consistent with the invention, and together with the description serve to explain the principles of the invention.

Fig. 1 is a flowchart illustrating a method for estimating a sound source direction based on time-frequency masking and a deep neural network according to an exemplary embodiment.

FIG. 2 is a first specific implementation flowchart of step S150 in the sound source azimuth estimation method based on time-frequency masking and deep neural network according to the embodiment of FIG. 1.

FIG. 3 is a second specific implementation flowchart of step S150 in the sound source azimuth estimation method based on time-frequency masking and deep neural network according to the embodiment of FIG. 1.

FIG. 4 is a third specific implementation flowchart of step S150 in the sound source azimuth estimation method based on time-frequency masking and deep neural network according to the embodiment of FIG. 1.

Fig. 5 is a schematic diagram of a binaural setup (a) and a schematic diagram of a dual microphone setup (b) according to an exemplary embodiment.

Fig. 6 is a block diagram of a sound source azimuth estimation device based on time-frequency masking and a deep neural network according to an exemplary embodiment.

FIG. 7 is a first block diagram of a masking weighting module 150 in a sound source azimuth estimation device based on a time-frequency masking and a deep neural network according to the embodiment shown in FIG. 6.

FIG. 8 is a second block diagram of a masking weighting module 150 in a sound source azimuth estimation device based on a time-frequency masking and a deep neural network according to the embodiment shown in FIG. 6.

FIG. 9 is a third block diagram of a masking weighting module 150 in a sound source azimuth estimation device based on a time-frequency masking and a deep neural network according to the embodiment shown in FIG. 6.

detailed description

Exemplary embodiments will be described in detail herein, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with the present invention. Rather, they are merely examples of devices and methods consistent with some aspects of the invention as detailed in the appended claims.

Fig. 1 is a flowchart illustrating a method for estimating a sound source position based on time-frequency masking and a deep neural network according to an exemplary embodiment. The sound source azimuth estimation method based on time-frequency masking and deep neural network can be used in electronic devices such as smart phones, smart homes, and computers. As shown in FIG. 1, the sound source azimuth estimation method based on time-frequency masking and deep neural network may include steps S110, S120, S130, S140, and S150.

Step S110: Acquire a multi-channel sound signal.

TDOA (Time Difference of Arrival) positioning is a method of locating using the time of arrival difference. By measuring the time it takes for the signal to reach the monitoring point, the distance of the target sound source can be determined. Using the distance from the target sound source to each microphone, the position of the target sound source can be determined. However, it is more difficult to measure the sound source's time in space. By comparing the difference in the time of arrival of the sound signal to each microphone, the position of the sound source can be better determined.

Unlike calculating the broadcast time, TDOA determines the location of the target sound source by detecting the time difference between when the signals arrive at two or more microphones. This method is widely used. Therefore, the accuracy and robustness of TDOA calculation is particularly important in the localization of the target sound source. A multi-channel sound signal is a sound signal containing a mix of two or more microphone channels.

Generally, multiple microphones are installed at different positions in a noisy environment, and the sound signals at different positions are received through the microphones. But in the real environment, in addition to the sound signal from the target sound source, there are sound signals from other noise sound sources. Therefore, the target sound source needs to be located in the environment based on the received multi-channel sound signals.

Step S120: Frame, window, and Fourier transform each sound signal in the multi-channel sound signal to form a short-time Fourier spectrum of the multi-channel sound signal.

Framing is to divide a single channel sound signal into multiple time frames according to a preset time period.

In a specific exemplary embodiment, each channel of the multi-channel sound signal is divided into multiple time frames according to 20 milliseconds per frame, and there is a 10 millisecond overlap between every two adjacent time frames.

In an exemplary embodiment, an STFT (short-time Fourier transform) is applied to each time frame to extract a short-time Fourier spectrum.

Step S130: Iteratively calculate the short-time Fourier spectrum by using a pre-trained neural network model to calculate a ratio film corresponding to the target signal in the multi-channel sound signal.

The ratio film characterizes the relationship between a noisy speech signal and a pure speech signal, which indicates an appropriate trade-off between noise suppression and speech retention.

Ideally, after masking the noisy speech signal through the ratio film, the speech spectrum signal can be restored from the noisy speech.

Neural network models are pre-trained. By extracting the short-time Fourier spectrum of a multi-channel sound signal and performing an iterative operation in the neural network model, a ratio film of the multi-channel sound signal is calculated.

Optionally, when calculating the ratio film of the multi-channel sound signal, a ratio film corresponding to each single-channel sound signal in the multi-channel sound signal is separately calculated through a pre-trained neural network model, and then the ratio corresponding to each single-channel sound signal is calculated. The membrane separately masks the single-channel sound signal and applies different weights to different time-frequency (TF) units, thereby sharpening the peaks corresponding to the target speech in the multi-channel sound signal and suppressing the peaks corresponding to the noise source.

When calculating the ratio film corresponding to each single channel sound signal, a deep recurrent neural network model with long and short-term memory is used to separately calculate the ratio film corresponding to each channel sound signal in the multi-channel sound signal, so that the calculated ratio film is closer to the ideal ratio membrane.

Formula (1) shows that the ideal ratio film corresponding to the sound signal of each channel in the multi-channel sound signal is calculated with the target of the reverberation speech signal. Equation (2) shows the ideal ratio film corresponding to the sound signal of each channel in the multi-channel sound signal with the direct sound as the target.

Reverberation speech is the sound from the sound source reflected in all directions back and forth to the microphone. The sound wave energy of the reverberation speech is gradually attenuated due to continuous absorption by the wall surface.

Direct sound refers to the sound directly transmitted from the sound source to the microphone in a straight line without any reflection. Direct sound determines the clarity of the sound.

Where i indicates the microphone channel, c (f) s (t, f), h (t, f), and n (t, f) are the short-time Fourier transform of direct sound, reverberation, and reflected noise, respectively ( STFT) vector.

Since the TDOA information is mainly contained in the direct sound, the calculation model of the ratio film may be closer to the real environment by targeting the direct sound signal.

Optionally, other methods can also be used to calculate the ratio film corresponding to each single-channel sound signal, which will not be described one by one here.

Step S140: Fusion multiple ratio films to form a single ratio film.

As mentioned above, each single-channel sound signal has a corresponding ratio film. For a multi-channel sound signal including multiple single-channel sound signals, there are multiple corresponding ratio films.

The invention fuses multiple ratio films to form a single ratio film.

Specifically, the ratio film generated by the multi-channel sound signal may be accumulated on the corresponding time-frequency unit to form a single ratio film.

In step S150, the multi-channel sound signal is masked and weighted by a single ratio film to determine the position of the target sound source.

It should be noted that even for severely disturbed speech signals, there are still many T-F units dominated by the target speech. These T-F units with clearer phases are often sufficient to achieve robust localization of the target sound source. Through masking weighting, the contribution of those voice-dominant units to localization is improved, thereby improving the robustness of the calculated TDOA and the accuracy of the target sound source localization.

Optionally, in an exemplary embodiment, as shown in FIG. 2, step S150 may include steps S151, S152, and S153.

In step S151, a short-time Fourier spectrum of the multi-channel input signal is used to calculate a generalized cross-correlation phase transformation (GCC-PHAT).

In step S152, the generalized cross-correlation function is masked by using a single ratio film.

In step S153, the masked generalized cross-correlation function is summed along the frequency and time, and the direction corresponding to the maximum peak position of the summed cross-correlation function is selected as the target sound source position.

As mentioned earlier, a deep recurrent neural network model with long and short-term memory will be used to calculate the ratio film corresponding to each channel sound signal in the multi-channel sound signal. The invention can be directly applied to microphone arrays of various geometric shapes.

Suppose there is only one target sound source and a pair of microphones. In the presence of reverberation and noise, the pair of microphone signals can be modeled as follows:

y (t, f) = c (f) s (t, f) + g (t, f) + n (t, f), (3)

Where s (t, f) represents the short-time Fourier transform (STFT) value of the target sound source at time t and frequency f, c (f) represents the relative transfer function, and y (t, f) are received respectively Short-time Fourier transform (STFT) vector of mixed sound. By selecting the first microphone as the reference microphone, the relative transfer function c (f) can be expressed as follows:

Where τ * is the basic time delay in seconds, j is the imaginary unit, A (f) is a real-valued gain, f _s is the sampling rate in Hz, N is the number of DFT frequencies, [·] ^T stands for matrix transpose. f ranges from 0 to N / 2.

Estimate the time delay by calculating a generalized cross-correlation function using a weighting mechanism based on phase transformation:

Where (.) ^H represents the conjugate transpose, Real {·} extracts the real part, and calculates the amplitude. Subscripts 1 and 2 indicate microphone channels. Intuitively, the algorithm first uses candidate delays to align the two microphone signals, then calculates their phase difference and cosine distance. If the cosine distance is close to 1, it means that the candidate delay is close to the true delay (phase difference). Therefore, each GCC coefficient is between -1 and 1. Assuming that the sound source is fixed in each utterance, the GCC coefficients are summed together, and the maximum value is taken as the estimated value of the time delay. PHAT weights are essential here. Without normalization, frequencies with higher energies will have larger GCC coefficients and dominate the summation.

The present invention calculates the GCC-PHAT function by masking and weighting multi-channel sound signals:

GCC _PHAT-MASK (t, f, τ) = η (t, f) GCC _PHAT (t, f, τ), (6)

Where η (τ, f) represents the masked weighting term of the T-F unit in the TDOA estimation. It can be defined as:

Where D (= 2 in this example) is the number of microphone channels.

Is a ratio film corresponding to channel i, which represents the specific gravity of the target speech energy at each TF unit in the channel.

By weighting the multi-channel sound signals and adding the masked generalized cross-correlation function along the frequency and time, the direction corresponding to the maximum peak position of the cross-correlation function is selected as the target sound source position, which greatly improves the determination of the target sound source The accuracy of the target sound source.

Optionally, in an exemplary embodiment, as shown in FIG. 3, another solution of step S150 may include step S154, step S155, step S156, step S157, step S158, step S159, and step S160.

In step S154, the covariance matrix of the short-time Fourier spectrum of the multi-channel sound signal is calculated in each time-frequency unit.

In step S155, a single ratio film is used to mask the covariance matrix, and the masked covariance matrix is summed along the time dimension at each individual frequency to obtain the covariance matrices of the target speech and the background noise at different frequencies, respectively.

In step S156, according to the topology of the microphone array, steering vectors of candidate directions at different frequencies are calculated.

In step S157, according to the noise covariance matrix and the candidate steering vector, filter coefficients for MVDR (Minimum Variable Distortionless Response) beamforming at different frequencies are calculated.

In step S158, the beamforming filter coefficients and the target voice covariance matrix are used to calculate the energy of the target speech at different frequencies, and the beamforming filter coefficients and the noise covariance matrix are used to calculate the energy of the background noise at different frequencies.

Step S159: Calculate the energy ratios of the target speech and noise at different frequencies and add them along the frequency dimension to form an overall signal-to-noise ratio in a certain candidate direction.

In step S160, the candidate direction corresponding to the largest overall signal-to-noise ratio is selected as the azimuth of the target sound source.

Calculate the covariance matrix of the target speech of each time-frequency unit by formula (8) and formula (9)

And noise covariance matrix

η (t, f) is calculated using formula (7), which is a single ratio film.

ξ (t, f) is calculated using:

Basically, formula (7) means that only the voice-dominated time-frequency unit is used to calculate the target voice covariance matrix, and the more dominant the target voice of the time-frequency unit, the greater the weight placed. Equation (8) uses a similar method to calculate the interference signal covariance matrix.

Then, following the free field and plane wave assumptions, the unit-length steering vector of the potential target sound source position k is modeled as:

d _ki refers to the distance between the sound source position k and the microphone i, and C _s refers to the speed of sound propagation. Therefore, a minimum variance distortionless response (MVDR) beamforming can be constructed as follows:

Then, the SNR of the beamforming signal can be obtained by calculating the energy of the beamforming target speech and noise:

In the end, the sound source orientation can be predicted as:

In equation (13), we limit the SNR to between 0 and 1. It is basically similar to the PHAT weighting in the GCC-PHAT algorithm, where the GCC coefficients of each T-F unit are normalized to -1 to 1. We can also replay more weights at higher SNR frequencies:

γ (f) can be defined as:

The sum of the combined speech masks within each frequency is used to indicate the importance of each frequency. It is found in experiments that the results obtained by using formula (15) are better than those obtained by formula (13).

Optionally, in an exemplary embodiment, as shown in FIG. 4, the third solution of step S150 may include steps S161, S162, S163, S164, and S165.

In step S161, at different frequencies, feature decomposition is applied to the target voice covariance matrix, and the corresponding feature vector with the largest feature value is selected as the steering vector of the target voice.

Step S162: Use the steering vector of the target voice to calculate the time difference between the microphone signals.

Step S163: Calculate the arrival time difference between the microphones for each candidate direction according to the microphone array topology.

Step S164: Calculate the cosine distance between the arrival time difference between the microphone signals and the arrival time difference between the candidate directions between the microphones.

In step S165, a candidate direction corresponding to the maximum cosine distance is selected as the azimuth of the target sound source.

The steering vector can be calculated using the following formula:

Among them, P {·} extracts the principal feature vector of the estimated speech covariance matrix calculated in formula (8). in case

Well calculated, it will be close to the rank 1 matrix, so its main feature vector is a reasonable estimate of the steering vector.

To estimate time delay

We enumerate all potential time delays and find the maximum delays that are:

The basic principle is to calculate the steering vector independently at each frequency

therefore,

The linear phase assumption is not strictly followed. The present invention enumerates all potential time delays, searching with phase delay

Time delay τ, with each frequency

(The direction of the steering vector) is the best match, so it is used as the final prediction result. Similar to formula (15), we use γ (f) weighting to emphasize higher SNR.

Using the method described above, when TDOA is estimated to locate the target sound source, after acquiring the multi-channel sound signal, the pre-trained neural network model is used to calculate the ratio film corresponding to the multi-channel sound signal, and then multiple ratios are The film is fused into a single ratio film, and then the multi-channel sound signals are masked and weighted by the single ratio film to determine the target sound source's orientation. The invention has strong robustness in the environment of low signal-to-noise ratio and strong reverberation, and improves the accuracy and stability of the target sound source direction estimation.

The following will perform a TDOA robustness test on the above exemplary embodiment using a binaural experimental device and a dual microphone experimental device in an environment with strong reverberation and mixed human voice. Fig. 5 is a schematic diagram showing a binaural setting and a dual microphone setting according to an exemplary embodiment.

The average duration of mixed speech is 2.4 seconds. The calculated input SNR for the reverberation speech and reverberation noise of the two data sets is -6dB. If we consider the direct sound signal as the target speech and the remaining signals as noise, the SNR will be lower. We use all single-channel signals in the training data (total 10000 * 2) to train LSTM (long short-term memory, recurrent neural network with long short-term memory). In a microphone array setting, a log power spectrum is used as an input feature; in a binaural setting, the energy difference between ears is also used. Prior to the global mean-variance normalization, the input features were normalized at the sentence level. The LSTM contains two hidden layers, each with 500 neurons. The Adam algorithm is used to minimize the mean square error of the ratio film estimation. The window length is 32 ms and the window shift size is 8 ms. The sampling rate is 16kHz.

We measure the effect according to the total accuracy. If the prediction direction is within 5 ° of the true target direction, the prediction is considered correct.

In a dual microphone setup, an image-based RIR (room impulse response) generator is used to generate the RIR to simulate reverberation. For training and verification data, we place an interfering speaker in each of the 36 directions, from –87.5 ° to 87.5 ° in steps of 5 °, and the target speaker is in one of the 36 directions . For the test data, we placed an interfering speaker in each of the 37 directions, ranging from -90 ° to 90 °, with a step size of 5 °, and the target speaker was in any of the 37 directions . In this way, the test RIR is invisible during training. The distance between the target speaker and the center of the array is 1 meter. The size of the room is fixed at 8x8x3m, and two microphones are placed in the center of the room.

Table 1. Comparison of TDOA estimation effects of different methods in dual microphone setup (% total accuracy)

The distance between the two microphones is 0.2 meters, and the height is set to 1.5 meters. T60 of each mixed speech segment is randomly selected from 0.0s to 1.0s in steps of 0.1s. IEEE and TIMIT statements are used to generate training, verification, and test speech.

In the binaural experimental setup, the binaural room impulse response (BRIR) was simulated using software, where the T60 (reverberation time) range was from 0.0s to 1.0s with a step size of 0.1s. The simulation room size is fixed at 6x4x3m. The measurement method of BRIR is to place the ears around the center of the room at a height of 2 meters, and the sound source is located in one of 37 directions (from -90 ° to 90 ° in 5 ° steps), which is the same height as the array 1.5 meters from the center of the array. The real BRIRs collected in four different sizes and T60 real rooms using HATS simulation heads were used for testing. The simulation head is placed at a height of 2.8 meters, and the distance from the sound source to the array is 1.5 meters. True BRIR is also measured using the same 37 directions. We placed 37 different interfering vocals in each of the 37 directions, and placed the target vocals in a certain direction. In our experiments, 720 female IEEE sentences were used as the target speech. We randomly divided them into 500, 100, and 120 utterances for training, validation, and test data. In order to generate loose multi-person speaking noise, the sentences of 630 speakers in our TIMIT dataset are connected together, and 37 randomly selected speakers and their speech segments are placed in each of the 37 directions. For each speaker in the noisy noise, we use the first half of the connected utterance to generate training and verification noise, and the second half to generate test noise. There are a total of 10,000, 800, and 3,000 binaural mixed speech sets in the training, verification, and test data sets.

Table 2. Comparison of TDOA estimation effects of different methods in binaural settings (% total accuracy)

The overall positioning accuracy results are shown in Tables 1 and 2. Among them, gray marks the performance of the ideal ratio film. The table also shows the direct-to-reverberant energy ratio (DRR) for each T60 level. Using the estimated ratio film from LSTM for masking, the proposed masking weighted GCC-PHAT algorithm significantly improves the traditional GCC-PHAT algorithm (as shown in Table 1 from 25.8% to 78.5%, 88.2%, and Table 2 from 29.4% To 91.3%, 90.8%). The steering vector-based TDOA estimation algorithm shows the strongest robustness among all algorithms, especially when T60 is high. Using direct sound as the ideal ratio film of the target speech can make the accuracy of all the proposed algorithms reach almost 100% (100.0%, 99.9%, and 99.8% in Table 1, and 99.4%, 99.4%, and 99.4% in Table 2) . This shows that the masking method based on T-F unit is very suitable for strong robust TDOA estimation.

Because the time delay information is mainly contained in the direct sound, in a dual microphone setting, using the direct sound as the target voice to define the IRM is always better than using the reverb sound as the target voice (88.2% vs. 78.5%, 90.5% vs. 86.7% And 91.0% vs. 86.4%).

However, due to the head shadow effect and the mismatch between training and testing BRIR in the binaural setting, the masked weighted steered response SNR algorithm performs relatively poorly in the binaural setting as in the dual microphone setting. Considering the head shadow effect, the gain in the binaural case cannot simply be equal to the gain of different channels. Therefore, using the reverberation sound as the target voice to estimate the IRM in the binaural setting is slightly better than using direct sound as the target voice Good performance (91.3% vs 90.8%, 86.4% vs 70.0% and 92.0% vs 91.1%).

The following is a device embodiment of the present disclosure, which can be used to implement the foregoing method embodiment of a sound source azimuth estimation method based on time-frequency masking and deep neural network. For details not disclosed in the device embodiments of the present disclosure, please refer to an embodiment of a sound source azimuth estimation method based on time-frequency masking and deep neural networks of the present disclosure.

Fig. 6 is a block diagram of a sound source azimuth estimation device based on time-frequency masking and deep neural network according to an exemplary embodiment. The device includes, but is not limited to, a sound signal acquisition module 110 and short-time Fourier spectrum extraction. Module 120, ratio film calculation module 130, ratio film fusion module 140, and masking weighting module 150.

A sound signal obtaining module 110, configured to obtain a multi-channel sound signal;

The short-time Fourier spectrum extraction module 120 is configured to perform frame, window, and Fourier transform on each channel of the multi-channel sound signal to form a short-time Fourier spectrum of the multi-channel sound signal;

The ratio film calculation module 130 is configured to iteratively calculate the short-time Fourier spectrum through a pre-trained neural network model to calculate a ratio film corresponding to a target signal in a multi-channel sound signal;

The ratio film fusion module 140 is configured to fuse multiple ratio films to form a single ratio film;

The masking weighting module 150 is configured to perform masking weighting on the multi-channel sound signal through a single ratio film, and determine the position of the target sound source.

For the implementation process of the functions and functions of the various modules in the above device, see the implementation process of the corresponding steps in the sound source azimuth estimation method based on the time-frequency masking and the deep neural network, and details are not described herein again.

Optionally, the ratio film calculation module 130 in FIG. 6 includes, but is not limited to, a ratio film calculation unit.

Ratio film calculation units are respectively used to iteratively calculate the short-time Fourier spectrum of the sound signals of each channel through a pre-trained neural network model, and calculate the ratio films corresponding to the sound signals of each channel in the multi-channel sound signal.

Optionally, the ratio film separate calculation unit can be specifically applied to the direct sound or reverberation speech signal as a target, and a deep recurrent neural network model with long and short-term memory is used to separately calculate the corresponding ratio of each single-channel target signal in the multi-channel sound signal. membrane.

Optionally, the ratio film fusion module 140 in FIG. 6 is specifically used to multiply the ratio film generated by the target in the multi-channel sound signal on the corresponding time-frequency unit.

Optionally, as shown in FIG. 7, the masking weighting module 150 in FIG. 6 includes, but is not limited to, a generalized cross-correlation function calculation sub-module 151, a masking sub-module 152, and an orientation determination sub-module 153.

The generalized cross-correlation function calculation sub-module 151 is configured to calculate a generalized cross-correlation function using a short-time Fourier spectrum of a multi-channel input signal;

A masking sub-module 152 for masking the generalized cross-correlation function by using a single ratio film;

A first azimuth determining submodule 153 is configured to add the masked generalized cross-correlation function along frequency and time, and select a direction corresponding to a maximum peak position of the summed cross-correlation function as the azimuth of the target sound source.

Optionally, as shown in FIG. 8, the second scheme of the masking weighting module 150 in FIG. 6 includes, but is not limited to, a covariance matrix calculation submodule 154, a covariance matrix masking submodule 155, a candidate direction guidance vector calculation submodule 156, The beamforming filter coefficient calculation sub-module 157, the energy calculation sub-module 158, the overall signal-to-noise ratio calculation sub-module 159, and the second orientation determination sub-module 160.

A covariance matrix calculation submodule 154, configured to calculate a covariance matrix of a short-time Fourier spectrum of a multi-channel sound signal in each time-frequency unit;

The covariance matrix masking sub-module 155 is used to mask the covariance matrix with a single ratio film. At each individual frequency, the masked covariance matrix is summed along the time dimension to obtain the target speech and noise at different frequencies. Covariance matrix on

Candidate direction steering vector calculation submodule 156, for calculating the steering vectors of candidate directions at different frequencies according to the topology of the microphone array;

The beamforming filter coefficient calculation submodule 157 is configured to calculate filter coefficients of MVDR beamforming at different frequencies according to a noise covariance matrix and a candidate steering vector;

An energy calculation sub-module 158 is configured to calculate energy of a target voice at different frequencies by using beamforming filter coefficients and a target voice covariance matrix, and calculate energy of noise at different frequencies by using a beamforming filter coefficient and a noise covariance matrix;

The overall signal-to-noise ratio forming sub-module 159 is used to calculate the energy ratio of the target speech and noise at different frequencies, and sum them along the frequency dimension to form an overall signal-to-noise ratio in a certain candidate direction;

The second orientation determination sub-module 160 is configured to select a candidate direction corresponding to the largest overall signal-to-noise ratio as the orientation of the target sound source.

Optionally, as shown in FIG. 9, the third scheme of the masking weighting module 150 in FIG. 6 includes, but is not limited to, a speech guidance vector calculation sub-module 161, an arrival time difference calculation sub-module 162, a candidate direction arrival time difference sub-module 163, and a cosine The distance calculation sub-module 164 and the third-party bit determination sub-module sub-module 165.

A speech steering vector calculation sub-module 161, which is used to apply feature decomposition to the target speech covariance matrix at different frequencies, and selects the corresponding feature vector with the largest feature value as the steering vector of the target speech;

Arrival time difference calculation sub-module 162, for calculating the arrival time difference between the microphone signals by using the steering vector of the target voice;

Candidate direction arrival time difference sub-module 163 is configured to calculate the difference in arrival time between candidate directions between microphones according to the microphone array topology;

A cosine distance calculation submodule 164, configured to calculate a cosine distance between a difference in time of arrival between microphone signals and a difference in time of arrival between candidate microphones in a direction;

The third-party bit determining sub-module 165 is configured to select a candidate direction corresponding to the maximum cosine distance as the azimuth of the target sound source.

Optionally, the present invention further provides an electronic device that performs all or part of the steps of the sound source azimuth estimation method based on the time-frequency masking and the deep neural network as shown in any of the above exemplary embodiments. Electronic equipment includes:

Processor; and

A memory connected in communication with the processor; wherein,

The memory stores readability instructions that, when executed by the processor, implement the method according to any one of the foregoing exemplary embodiments.

The specific manner in which the processor performs operations in the terminal in this embodiment has been described in detail in the embodiment of the method for estimating the sound source position based on time-frequency masking and deep neural networks, and will not be described in detail here.

In an exemplary embodiment, a storage medium is also provided, and the storage medium is a computer-readable storage medium, such as a temporary and non-transitory computer-readable storage medium including instructions.

It should be understood that the present invention is not limited to the precise structure that has been described above and shown in the drawings, and various modifications and changes can be made without departing from the scope thereof. The scope of the invention is only limited by the appended claims.

Claims

A sound source azimuth estimation method based on time-frequency masking and deep neural network, characterized in that the method includes:

Acquire multi-channel sound signals;

Frame, window, and Fourier transform each sound signal in the multi-channel sound signal to form a short-time Fourier spectrum of the multi-channel sound signal;

Performing an iterative operation on the short-time Fourier spectrum through a pre-trained neural network model to calculate a ratio film corresponding to a target signal in the multi-channel sound signal;

Fusing multiple ratio films to form a single ratio film;

Masking and weighting the multi-channel sound signal through the single ratio film to determine the orientation of the target sound source.
The method according to claim 1, wherein the short-time Fourier spectrum is iteratively calculated by using a pre-trained neural network model to calculate the ratio of the ratio film corresponding to the target signal in the multi-channel sound signal. The steps include:

The short-time Fourier spectrum of each channel sound signal is iteratively calculated through a pre-trained neural network model, and the ratio films corresponding to the sound signals of each channel in the multi-channel sound signal are calculated respectively.
The method according to claim 2, wherein the pre-trained neural network model performs an iterative operation on the short-time Fourier spectrum of the sound signals of each channel, and calculates the sound of each channel in the multi-channel sound signal separately. The steps of the signal corresponding to the ratio film include:

A direct reverberation or reverberation speech signal is used as a target, and a deep recurrent neural network model with long and short-term memory is used to calculate the ratio film corresponding to each single-channel target signal in the multi-channel sound signal.
The method according to claim 1, wherein the step of fusing multiple ratio films to form a single ratio film comprises:

The ratio film produced by the target signal in the multi-channel sound signal is multiplied on the corresponding time-frequency unit.
The method according to claim 1, wherein the step of masking and weighting a multi-channel sound signal through the single ratio film comprises:

Calculate a generalized cross-correlation function using the short-time Fourier spectrum of a multi-channel input signal;

Using the single ratio film to mask the generalized cross-correlation function;

The masked generalized cross-correlation function is summed along frequency and time, and the direction corresponding to the maximum peak position of the summed cross-correlation function is selected as the target sound source orientation.
The method according to claim 1, wherein the step of masking and weighting a multi-channel sound signal through the single ratio film comprises:

In each time-frequency unit, calculating a covariance matrix of a short-time Fourier spectrum of the multi-channel sound signal;

Masking the covariance matrix using the single ratio film, summing the masked covariance matrix along the time dimension at each individual frequency to obtain the covariance matrix of the target speech and noise at different frequencies, respectively;

Calculate the steering vectors of candidate directions at different frequencies according to the topology of the microphone array;

Calculating filter coefficients for MVDR beamforming at different frequencies according to the noise covariance matrix and the candidate steering vector;

Using the beamforming filter coefficients and the target voice covariance matrix to calculate the energy of the target speech at different frequencies, and using the beamforming filter coefficients and the noise covariance matrix to calculate the energy of noise at different frequencies;

Calculate the energy ratio of the target speech and noise at different frequencies and sum them along the frequency dimension to form an overall signal-to-noise ratio in a candidate direction;

The candidate direction corresponding to the largest overall signal-to-noise ratio is selected as the orientation of the target sound source.
The method according to claim 1, wherein the step of masking and weighting the multi-channel sound signal through the single ratio film comprises:

Applying feature decomposition to the target voice covariance matrix at different frequencies, and selecting the corresponding feature vector with the largest feature value as the steering vector of the target voice;

Calculating a time difference of arrival between microphone signals by using a steering vector of the target voice;

Calculate the arrival time difference between candidate microphones according to the microphone array topology;

Calculating a cosine distance between the time difference of arrival between the microphone signals and the time difference of arrival between the candidate directions between microphones;

The candidate direction corresponding to the maximum cosine distance is selected as the azimuth of the target sound source.
A sound source azimuth estimation device based on time-frequency masking and deep neural network, characterized in that the device includes:

Sound signal acquisition module, for acquiring multi-channel sound signals;

A short-time Fourier spectrum extraction module, configured to perform frame, window, and Fourier transform on each channel of the multi-channel sound signal to form a short-time Fourier transform of the multi-channel sound signal Spectrum

A ratio film calculation module, configured to iteratively calculate the short-time Fourier spectrum through a pre-trained neural network model to calculate a ratio film corresponding to a target signal in the multi-channel sound signal;

Ratio film fusion module, used to fuse multiple ratio films to form a single ratio film;

A masking weighting module is configured to perform masking weighting on a multi-channel sound signal through the single ratio film to determine the position of a target sound source.
An electronic device, characterized in that the electronic device includes:

At least one processor; and

A memory connected in communication with the at least one processor; wherein,

The memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor, so that the at least one processor can execute the method according to any one of claims 1-7. Methods.
A computer-readable storage medium for storing a program, wherein when the program is executed, the electronic device executes the method according to any one of claims 1-7.