CN110838303A

CN110838303A - Voice sound source positioning method using microphone array

Info

Publication number: CN110838303A
Application number: CN201911069273.0A
Authority: CN
Inventors: 王浩; 卢晶
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2019-11-05
Filing date: 2019-11-05
Publication date: 2020-02-25
Anticipated expiration: 2039-11-05
Also published as: CN110838303B

Abstract

The invention discloses a method for positioning a voice sound source by using a microphone array, which comprises the following steps: (1) generating a training sample to obtain a time-frequency domain signal and obtain a power envelope; (2) judging whether each time-frequency point of the time-frequency domain signal is a direct voice signal; (3) training a neural network of a UNET structure by using the sample generated in the step (1); (4) predicting a time-frequency point corresponding to a direct sound of a to-be-detected noise-containing signal voice by using a trained neural network with a UNET structure; (5) and applying a positioning method to the time-frequency point which is judged to be the direct voice sound to obtain a positioning result. The voice sound source positioning method can effectively remove the influence of interference and reverberation in the environment with high reverberation and high interference, and obtain the result with higher accuracy and robustness.

Description

Voice sound source positioning method using microphone array

Technical Field

The invention relates to a voice sound source positioning method using a microphone array under a high-interference and high-reverberation environment based on a UNET structure, and belongs to the technical field of voice signal processing.

Background

The purpose of Speech Signal Source Localization (SSL) is to estimate the angle (DOA) at which the Speech signal reaches the microphone array. Sound localization, or DOA estimation, of speech signals using a microphone array is a very important and hot topic in acoustic signal processing. The method plays a very important role in sound capture in many application scenarios, such as man-machine voice interaction, lens tracking and intelligent monitoring of intelligent devices. However, there is a difficulty in that the speech signal is a broadband, non-stationary random process, while there is also noise floor, reverberation and other interfering sound sources.

Classical sound source localization methods can be divided into TDOA (time Delay Of arrival), SRP (SteeredResponse Power) and Spatial Spectrum; the data-driven method mainly utilizes a convolutional neural network to directly obtain the DOA result. In a large number of practical application scenes, not only reverberation but also noise interference exists, and most of the existing methods cannot keep high accuracy and robustness in such a complex environment.

Disclosure of Invention

In order to overcome the defects in the prior art, the invention provides a voice sound source positioning method using a microphone array, which can still obtain results with higher accuracy and robustness in the environment with high reverberation and high interference.

In order to achieve the purpose, the invention adopts the technical scheme that:

a method for locating a voice sound source using a microphone array, comprising the steps of:

step 1, collecting voice signals and interference signals by using a microphone array, obtaining time-frequency domain signals of noise-containing voice signals and clean voice signals, and calculating power spectrum amplitude logarithm values of the noise-containing voice signals and the clean voice signals; the clean voice signal is a signal only composed of direct voice sound;

step 2, respectively calculating respective space power response spectrums of all time-frequency points in the time-frequency domain of the noise-containing voice signal and the clean voice signal, further estimating time delay corresponding to the time-frequency points, and recording the time delay corresponding to the time-frequency points

And

respectively, a noisy speech signal and a clean speech signalA time-frequency window delay estimate corresponding to time n and frequency band k; obtaining a time-frequency point distribution diagram corresponding to the direct voice sound;

step 3, training a neural network of the UNET structure by using the power spectrum amplitude logarithmic value of the noise-containing voice signal and the clean voice signal in the step 1 and the time-frequency point distribution diagram corresponding to the direct voice sound in the step 2; estimating a time-frequency point distribution diagram corresponding to the voice direct sound of the signal to be detected by using the power spectrum amplitude logarithmic value of the signal to be detected and the trained neural network;

and 4, obtaining a voice sound source positioning result by using the voice direct sound distribution estimated in the step 3 as a weight and combining a weighted positioning algorithm.

Further, in the step 2, selecting the time-frequency distribution points corresponding to the direct sound should satisfy the following conditions at the same time:

1) time delay estimation in noisy speech signals

Differs from the real time delay tau (dsin theta)/c by less than a threshold value TH₁D, c and theta are the distance of the microphone, the sound velocity and the angle of the voice source reaching the array respectively;

2) in clean speech signals, time delay estimationThe difference of the real time delay tau is less than a threshold value TH₁A time-frequency window of (d);

3) the spatial power spectral response correlation of the same position of the noise-containing voice signal and the clean voice signal is greater than a threshold value TH₂Time-frequency window of (d).

Further, in step 3, the input of the neural network is a logarithmic power spectrogram of the noisy speech signal, and the output is a clean speech signal logarithmic power spectrogram and a speech direct sound time-frequency point distribution diagram, wherein the clean speech signal logarithmic power spectrogram is used for assisting training, and a value corresponding to the speech direct sound time-frequency point distribution diagram is used as a weight value of the time-frequency point of step 4.

According to the voice sound source positioning method, the direct voice component is positioned, so that interference and reverberation components are reduced to the greatest extent to participate in positioning, results with high accuracy and robustness can still be obtained in the environment with high reverberation and high interference, and the influence of interference noise on the positioning effect is effectively avoided. The neural network with the UNET structure firstly performs down-sampling, learns the deep features of input data through convolution of different degrees, and performs deconvolution to realize up-sampling fitting of original data and output data features. The network structure has strong feature learning capability in a deep neural network, and is suitable for learning voice features and judging direct sound. The UNET network model adopted in the invention can be used in different array shapes, and because the UNET network predicts single-channel signals, the UNET network does not need to retrain the arrays in different shapes in actual use.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

fig. 2 is a basic block diagram of an UNET network in an embodiment of the present invention;

FIG. 3 is a logarithmic power spectrum of a noisy speech signal;

FIG. 4 is a log power spectrum of a clean speech signal;

FIG. 5 is a graph of theoretical direct speech sound time-frequency distribution;

FIG. 6 is a distribution diagram of UNET predicted direct speech sound time-frequency points;

fig. 7 shows unweighted noisy signals, theoretically weighted noisy signals, predicted spatial power responses of weighted noisy signals of direct sound (the left peak corresponds to a speech signal, and the right peak corresponds to an interference signal, which are normalized according to the maximum value).

Detailed Description

The present invention is further illustrated by the following description in conjunction with the accompanying drawings and the specific embodiments, it is to be understood that these examples are given solely for the purpose of illustration and are not intended as a definition of the limits of the invention, since various equivalent modifications will occur to those skilled in the art upon reading the present invention and fall within the limits of the appended claims.

The embodiment is performed in simulation, and provides a method for positioning a voice sound source by using a microphone array based on a UNET structure, which is suitable for high-interference and high-reverberation environments and is also suitable for arrays with different shapes, and the method comprises the following steps:

1. generating training samples to obtain time-frequency domain signals and obtaining power envelopes.

Arranging voice or interference sound sources in a simulation room, collecting signals by using I microphones, respectively collecting voice signals and interference signals at different positions, superposing the voice signals and the interference signals on a time domain to form noisy voice signals, normalizing the signal amplitude according to the maximum value, and obtaining time-frequency domain signals thereof through STFT (space time transform), wherein the time-frequency domain signals are recorded as x_i(n, k) representing the noise-containing speech signal of the nth frame of the ith microphone in the frequency band k, and the speech time-frequency domain signal received by the microphone array before superposition is recorded as s_i(n, k). The mean values of the power spectral magnitudes of noisy speech signal X (n, k) and clean speech signal S (n, k) are respectively expressed as:

in the formula, x_ι(n, k) and s_ιAnd (n, k) respectively represent single-channel signals in the noise-containing voice signal and the clean voice signal received by the microphone array. Respectively represent the logarithmization as

X_L(n,k)＝log₁₀(X(n,k)+ξ) (3)

S_L(n,k)＝log₁₀(S(n,k)+ξ) (4)

In the formula, X_L(n, k) and S_L(n, k) represent the power spectrum amplitude logarithm of the noisy signal and the clean speech signal respectively, and ξ is a background noise power estimation value used for reducing the influence of the background noise on the robustness of the invention.

2. And judging the direct voice signal.

In a real scenario there will always be some environmental disturbances and room reverberation, which has a bad influence on the speech localization. The judgment of the direct voice is beneficial to improving the accuracy and robustness of the voice source positioning, and the influence of interference and reverberation is effectively eliminated.

For any time-frequency point in the time-frequency domain of the voice signal and the voice signal containing noise, the respective space power response spectrum P is calculated by utilizing a controllable response power (SRP) algorithm_X(τ | n, k) and P_S(τ|n,k)

Where X (n, k) and S (n, k) are the multi-channel frequency domain noisy speech and clean speech signals corresponding to time n and frequency band k, respectively, and X (n, k) ═ X₁(n,k),x₂(n,k),...,x_I(n,k)]^T，S(n,k)＝[s₁(n,k),s₂(n,k),...,s_I(n,k)]^TThe superscript "H" is the conjugate transposed symbol, "T" is the transposed symbol, and g (k, τ) is the steering vector corresponding to the frequency band k delay τ. After obtaining the space power response, further estimating the time delay (TDoA) corresponding to the point

In the formula (I), the compound is shown in the specification,

and

the time-frequency window delay estimates for noisy speech and clean speech signals corresponding to time n and frequency band k, respectively, and argmax represents the value of the argument that corresponds to the maximum value of the expression.

Extracting time-frequency distribution points corresponding to the direct sound as follows:

1) estimating time delay in corresponding time-frequency window in noisy speech signal

2) in speech signals, the time delay in the corresponding time-frequency window is estimated

The difference of the real time delay tau is less than a threshold value TH₁A time-frequency window of (d);

3) spatial power spectrum response P of two groups of signals at same position_X(τ | n, k) and P_S(τ | n, k) correlation is greater than a threshold value TH₂Time-frequency window of (d).

The center point of the time-frequency window which meets the three conditions is marked as 1, otherwise, the center point is 0. And obtaining the time-frequency point distribution diagram of the direct voice sound.

3. And training the UNET structure.

The basic block diagram of UNET used in this embodiment is shown in fig. 2. Cnn (K), decnn (K) are Convolutional Neural Network Layer (Convolutional Neural Network Layer) and Convolutional Neural Network Layer (Convolutional Neural Network Layer) of K channel, respectively, and the activation functions are all leakage ReLU (lreul). The length and the width of the matrix unit in the layer are expanded by the latter, the matrix unit corresponds to Max scaling, which means the maximum pooling layer, the length and the width of the matrix unit in the layer are reduced, and the matrix unit corresponds to the UNET decoding and encoding processes respectively, and the expansion or the reduction is doubled in the invention. Input in UNET structure is logarithmized power spectrogram X of noise-containing voice signal_L(n, k), two outputs Speech (S) and DPD (D), which are respectively a logarithmic power spectrogram S of direct voice sound_L(n, k) and the time-frequency point distribution diagram of the direct voice sound obtained in the step 2. Both Input and Speech spectrogram information is derived from data acquired by a single microphone, regardless of the array structure, so that the model can be applied to different types of arrays.

The UNET neural network cost function is min (1-lambda) | S | -S | | sweet wind₂+λ||D*-D||₂(9)

In the formula, S and D are predicted values of S and D output by the neural network, | · U.T. respectively₂Representing a second order norm, λ is 0 at the beginning of training and gradually increases to 1 as training progresses.

4. And predicting the direct voice of the to-be-detected noise-containing signal by using the UNET structure.

When the UNET neural network is used, only the logarithmized power spectrogram X of a noise-containing voice signal is Input in the Input_L(n, k), namely, obtaining a time-frequency point distribution diagram of the direct voice sound in Output, wherein a corresponding value in the distribution diagram can be used as a weight of a time-frequency point used in the following process and is marked as W (n, k).

5. A positioning result is obtained using a weighted controlled response power (WSRP) algorithm.

Here, the usual positioning methods can be used: and the SRP method is used for positioning the selected time-frequency point. Because the time-frequency points need to be screened, the WSRP method is adopted in this embodiment, and the final positioning result is expressed as

Where g (k, theta) is a steering vector corresponding to the delay theta of the frequency band k, theta represents a possible value of the sound wave arrival direction, i.e. an independent variable,

representing the direction of arrival of the acoustic wave to be estimated. The microphone array may be any suitable array, typically a line array or a ring array is used. If the microphone uses a uniform line array, g (k, θ) is expressed as

Where exp denotes an index based on a natural logarithm e, j denotes an imaginary variable, c denotes a sound velocity, d is a distance vector of the microphone array, and ω is_kIs the angular frequency corresponding to band k.

At this point, a voice sound source localization result is obtained.

An example of a simulation is given below.

1. Simulated hybrid speech generation

The present implementation takes the positioning of the simulation signal as an example. And during simulation, generating room impulse response by using Imagemodel, convolving the room impulse response with clean voice to generate voice under a reverberation environment, and convolving and superposing the room impulse response generated by Imagemodel at different sound source positions with the clean voice with the same room parameters to obtain a mixed signal. When Imagemodel is used for simulation, the microphone array adopts a 4-channel line array, the unit spacing of the microphone array is 2cm during network training, the prediction time distance is 3.5cm, and the room size is set to be 7 multiplied by 5 multiplied by 3m³Obtaining nearby randomly; the target sound source is positioned at 60 degrees, 45 degrees and 30 degrees on the left side of the array, the distance from the center of the array is 2m, and the interference sound source is positioned at 45 degrees on the right side of the array; the room reverberation time is randomly selected between 0.2s and 0.9s, and the signal-to-interference ratio is randomly selected between-5 dB and 10 dB. Each speech sample is 1.2s in length. The sampling frequency of the signal is 16 KHz. And respectively collecting the voice signals and the interference signals at different positions, and superposing the voice signals and the interference signals on a time domain to form noisy voice signals. And the reflectivity of the wall surface of the room is all 0 when the voice direct sound signal is collected. Since a single channel signal is used in training, the selection of the shape of the array and the location of the sound source has a negligible effect on the network training.

2. Method process flow

a) Parameter setting

The parameters of the process of the invention are first given in table 1. It should be noted that the method of the present invention does not require adjustment of parameters in different environments, and the given parameters can be applied in various environments.

TABLE 1 respective parameters

Parameters	Values
		Window width	512
Frameshift	256
		ξ	1×10^-4
c	344m/s
		TH₁	d/(15c)
TH₂	0.98
		Range of the frequency band	[2000Hz,8000Hz]

b) Short time Fourier transform

And (3) performing discrete short-time Fourier transform on the time domain signal acquired by the microphone to obtain a time-frequency domain signal, wherein the window function is a Hanning window, the window length is 32ms, and the window shift is 16 ms.

c) Computing an "energy" envelope

Each time-frequency point of the time-frequency domain signal: the logarithmized power spectrum amplitude is calculated using the equations (1) - (4).

d) Selecting time-frequency points corresponding to direct voice sound

Each time-frequency point of the time-frequency domain signal: and (3) calculating the spatial power response and the time delay by using the formulas (5) to (8), and judging whether the sound is direct sound according to 3 conditions in the step 2.

e) Training designed UNET structures with generated samples

For the designed UNET structure:

1. input is a logarithmized power spectrum X of a noisy speech signal_L(n, k), two outputs Speech (S) and DPD (D), which are respectively a logarithmic power spectrogram S of direct voice sound_L(n, k) and the time-frequency point distribution map of the direct voice sound obtained in the step 1.2;

see (9) for the UNET neural network cost function.

f) Predicting direct sound

For the trained UNET structure: and inputting a logarithmic power spectrogram of the noise-containing voice signal at Input, and obtaining a time-frequency point distribution diagram of the direct voice sound in Output.

g) The method of weighting controllable response power is applied to the selected time-frequency point to obtain the positioning result

Each time-frequency point of the time-frequency domain signal: the final positioning result is estimated using equation (10).

In order to illustrate the advantages of the method, the method is compared and verified with the common traditional algorithm SRP-PHAT by using simulation and experiment.

Under simulation conditions, 60 sets of data were tested in each direction using a 4-channel line array. The experimental conditions were the same as those in the simulation example. Fig. 3-7 show that after the noisy speech signal is processed by the method of the present invention, the energy in the interference signal (right side) is greatly reduced in the spatial power response, and the influence of the interference on the positioning is greatly reduced.

A positioning result differing from the true angle by less than 5 ° is defined here as a valid positioning. Table 2 shows the effective positioning rate of the method and the conventional algorithm SRP-PHAT in the test set, and the effectiveness of the positioning effect can be obviously seen.

TABLE 2 comparison of effective location results

Angle (°)	The method of the invention	SRP-PHAT
			-30	68.33％	23.33％
-45	75％	18.33％
			-60	55％	15％

In the experiment, the test was performed in two rooms: room 1 is a small Room with high reverberation, and has a volume of 5.2 × 3.5 × 3m, and T60 ═ 1.10 s; room 2 Audio-visual Room, volume 7.3X 5.3X 3m³T60 ═ 0.36 s; 50 voice samples are recorded by using a 4-channel line array with the spacing of 3.5cm, interference samples containing 20 different common noises are played circularly in a recording environment, and the distances between a sound source and the microphone array are expected to be 2 meters and the heights are the same. The sampling rate is 16 KHz. The speech sound source is at-30 deg. -45 deg. -60 deg. respectively, and the interfering sound source is at 45 deg.. The signal-to-interference ratio stabilizes at 3dB to be practical.

TABLE 3 comparison of RMSE (. degree.) for different methods in the experiment

Simulation and experiments show that the method provided by the invention is superior to the SRP-PHAT method in accuracy and robustness, the method is more stable under the condition of high reverberation, and the maximum RMSE in the experiment is 3.69 degrees and is far lower than that of the traditional SRP-PHAT algorithm.

Claims

1. A method for locating a voice sound source using a microphone array, comprising the steps of:

And

time-frequency window delay estimation values of the noisy speech signal and the clean speech signal corresponding to time n and frequency band k respectively; obtaining a time-frequency point distribution diagram corresponding to the direct voice sound;

2. The method as claimed in claim 1, wherein the selecting of the time-frequency distribution points corresponding to the direct sound in step 2 satisfies the following conditions:

1) time delay estimation in noisy speech signals

2) in clean speech signals, time delay estimation

3. The method as claimed in claim 1, wherein in the step 3, the input of the neural network is a log-quantized power spectrum of the noisy speech signal, and the output is a log-quantized power spectrum of the clean speech signal and a time-frequency point distribution map of the direct speech sound, wherein the log-quantized power spectrum of the clean speech signal is used for training assistance, and a value corresponding to the time-frequency point distribution map of the direct speech sound is used as the weight value of the time-frequency point in the step 4.