CN113782046A

CN113782046A - Microphone array pickup method and system for remote speech recognition

Info

Publication number: CN113782046A
Application number: CN202111057434.1A
Authority: CN
Inventors: 马超; 李冬梅
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2021-09-09
Filing date: 2021-09-09
Publication date: 2021-12-10
Anticipated expiration: 2041-09-09
Also published as: CN113782046B

Abstract

The invention discloses a microphone array pickup method and system for remote speech recognition, which are applied to the technical field of array signal processing. Secondly, the direction of the interference signal is randomly specified, the human voice interference is weighted, the human voice interference is suppressed to be larger, the stable noise suppression is smaller, and therefore sound pickup can be accurately carried out.

Description

Microphone array pickup method and system for remote speech recognition

Technical Field

The invention relates to the technical field of array signal processing, in particular to a microphone array pickup method and system for remote speech recognition.

Background

Speech is a tool used very frequently and with very important functions in human daily life. However, in practical environment, speech is always affected by environmental noise, room reverberation and interfering speakers, which greatly reduces speech quality and affects speech intelligibility and recognition rate, so we need to enhance the noise-interfered signal to obtain a clean signal. The speech enhancement technology has wide application in many application fields, such as audio and video communication, audio and video recording and playing, man-machine interaction, speech recognition and the like.

In the prior art, speech enhancement methods are classified into two major categories, single-channel enhancement algorithms and array enhancement algorithms. The two algorithms have advantages and disadvantages respectively, complement each other, can be used simultaneously in most environments, and improve the enhancement performance together. The classical single-channel speech enhancement algorithm comprises a spectral subtraction method, a wiener filtering method, a statistical method, a deep learning-based single-channel speech enhancement algorithm and the like. The classical multi-channel speech enhancement algorithm comprises a delay summation beam forming algorithm, a minimum variance undistorted response beam forming algorithm, a linear constraint minimum variance beam forming algorithm, a generalized sidelobe cancellation beam forming algorithm, a multi-channel wiener filtering algorithm and the like. Among them, the minimum variance distortion free response beamforming (MVDR) algorithm has become one of the most widely used adaptive beamforming algorithms currently.

However, when the target voice that we need to record is far away or the energy is small, so that the voice signal-to-noise ratio is extremely low, the performance of the adaptive algorithm is not satisfactory. First, since the target speech has small energy and it is difficult to accurately distinguish between noise segments and speech segments, it is difficult to accurately estimate the direction of the target speech, which may cause distortion of the target speech. Secondly, the anti-noise capability of the human ear and the auditory nerve is strong, but the capability of resisting human voice interference is weak, but the adaptive algorithm cannot distinguish the human voice interference from stable noise interference, and the same weight is suppressed, so that the remaining human voice interference can cause great influence on the intelligibility of the processed target voice.

Therefore, it is an urgent need to solve the above-mentioned problems by those skilled in the art to provide a microphone array sound pickup method and system for long-distance speech recognition.

Disclosure of Invention

In view of the above, the present invention provides a microphone array sound pickup method and system for long-distance speech recognition,

in order to achieve the above purpose, the invention provides the following technical scheme:

on one hand, the microphone array pickup method for the remote speech recognition comprises the following specific steps:

s100: manually selecting the direction of a voice signal, and calculating a guide vector of the voice signal;

s200: manually selecting the direction of an interference signal, calculating a guide vector of the interference signal, and obtaining a covariance matrix of the interference signal according to the guide vector of the direction of the interference signal;

s300: collecting sound by multiple microphones, and calculating a noise covariance matrix according to sound data received by the multiple microphones;

s400: and calculating the optimal weight vector of the multi-microphone according to the noise covariance matrix, and obtaining the target voice according to the guide vector of the voice signal and the weight vector.

Preferably, in S100, the steering vector of the voice signal direction is calculated as follows:

manually selecting a voice signal direction, acquiring the positions and sound velocities of a plurality of microphones, and calculating the time delay of sound reaching each microphone according to the voice signal direction, the positions and the sound velocities of the microphones to obtain a guide vector of a voice signal:

in the formula, τ_nThe time delay of sound reaching each microphone is N-1, 2, …, N, d (ω), which is the steering vector of the speech signal.

Preferably, in S200, the step of calculating the covariance matrix of the interference signal includes:

s210: manually selecting the direction of an interference signal, calculating the time delay of sound reaching each microphone according to the direction of the interference signal, the position and the sound velocity of each microphone, and obtaining a guide vector of the interference signal:

in the formula, τ_nFor the time delay of sound reaching each of the microphones, N is 1,2, …, N, d_i(ω) is the steering vector of the interference signal;

s220: according to the definition of covariance matrix, the covariance matrix of interference signal can be obtained

In the formula (d)_i(ω) is the steering vector of the interfering signal,

is a guide vector d_iConjugate transpose of (omega), phi_iiAnd (omega) is the covariance matrix of the interference signal.

Preferably, in S300, the noise covariance matrix is calculated as follows:

the method comprises the following steps of (1) picking up by multiple microphones, and calculating a noise covariance matrix according to sound data collected by the multiple microphones:

Φ_vv(ω)＝E[y(ω)y^H(ω)]；

where y (ω) is a frequency domain representation of the signals received by the plurality of microphones, y^HAnd (omega) is a conjugate transpose vector of y (omega). Phi is a_vvAnd (omega) is the noise covariance matrix.

Preferably, after the noise covariance matrix is calculated in S300, the noise covariance matrix is further modified:

calculating noise energy and interference energy, wherein the noise energy calculation formula is as follows:

E_v(ω)＝d^H(ω)Φ_vv(ω)d(ω)；

in the formula, phi_vv(ω) is a noise covariance matrix, d (ω) is a steering vector of the speech signal, d^H(ω) is the conjugate transpose of the steering vector d (ω), E_v(ω) is the noise energy;

the interference signal energy calculation formula is as follows:

in the formula, phi_vv(ω) is the noise covariance matrix, d_i(ω) is the steering vector of the interfering signal,

is a guide vector d_iConjugate transpose of (ω).

Determining a weighting coefficient according to the energy ratio of the noise energy and the interference energy, wherein the specific formula is as follows:

and correcting the noise covariance matrix according to the weighting coefficient to obtain a corrected equation as follows:

h(ω)＝arg min(h^H(ω)(Φ_vv(ω)+λ(ω)Φ_ii(ω))h(ω)，s.t.h^H(ω)d(ω)＝1；

wherein λ (ω) is a weighting coefficient, φ_vv(ω) is the noise covariance matrix, φ_ii(ω) i.e. the covariance matrix of the interfering signal, d (ω) is the steering vector of the speech signal, h^HAnd (omega) is a conjugate transpose vector of h (omega), and h (omega) is a filter coefficient.

Preferably, the step of obtaining the target voice in S400 is as follows:

and solving a filter coefficient by adopting a Lagrange multiplier method according to the corrected equation:

in the formula, phi_vv(ω) is a noise covariance matrix, λ (ω) is a weighting coefficient, φ_ii(ω) is the covariance matrix of the interfering signal, d (ω) is the steering vector of the speech signal, h^H(omega) is a conjugate transpose vector of h (omega), and h (omega) is the obtained filter coefficient;

weighting the multi-microphone voice according to the filter coefficient to obtain a target voice:

Z(ω)＝h^H(ω)y(ω)；

in the formula, Z (ω) is the clear long-distance voice that we want to record.

In another aspect, a microphone array sound collecting system for long-distance speech recognition includes:

the first selection module is used for selecting the direction of the voice signal;

the first calculation module is connected with the first selection module and used for calculating the guide vector of the voice signal according to the direction of the voice signal;

the second selection module is used for selecting the direction of the interference signal;

the second calculation module is connected with the second selection module and used for calculating the guide vector of the interference signal according to the direction of the interference signal;

the acquisition module is used for acquiring sound data of multiple microphones;

the third calculation module is connected with the acquisition module and the second calculation module and used for calculating a noise covariance matrix according to the sound data;

and the output module is connected with the third calculation module and the first calculation module and used for calculating the optimal weight vector of the multi-microphone according to the noise covariance matrix, and obtaining and outputting the target voice according to the guide vector of the voice signal and the weight vector.

According to the technical scheme, compared with the prior art, the microphone array pickup method and the system for remote speech recognition are provided, and the distortion of target speech is reduced by manually specifying the direction of the speech. Secondly, the direction of the interference signal is manually appointed, the human voice interference is weighted, the human voice interference is suppressed to be larger, the stable noise suppression is smaller, therefore, the sound can be picked up accurately, and a better solution is provided for the remote voice control.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a flow chart of a method provided by the present invention;

FIG. 2 is a schematic diagram of a system according to the present invention;

fig. 3 is a schematic view of the environment setup provided in this embodiment 2;

FIG. 4 is a diagram illustrating the processing results of a conventional MVDR method;

fig. 5 is a schematic view of the processing result of the present invention provided in this embodiment 2.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1

Referring to fig. 1, an embodiment of the present invention discloses a microphone array pickup method for remote speech recognition, which includes the following specific steps:

s100: manually selecting the direction of the voice signal, and calculating a guide vector of the voice signal;

s400: and calculating the optimal weight vector of the multi-microphone according to the noise covariance matrix, and obtaining the target voice according to the guide vector and the weight vector of the voice signal.

In one embodiment, in S100, the steering vector of the speech signal direction is calculated as follows:

manually selecting the direction of a voice signal, acquiring the positions and sound velocities of a plurality of microphones, and calculating the time delay of sound reaching each microphone according to the direction of the voice signal, the positions and the sound velocities of the microphones to obtain a guide vector of the voice signal:

in the formula, τ_nThe delay of sound reaching each microphone, where N is 1,2, …, and N, d (ω) is the steering vector of the speech signal.

In one embodiment, in S200, the step of calculating the covariance matrix of the interference signal includes:

s210: manually selecting the direction of the interference signal, calculating the time delay of sound reaching each microphone according to the direction of the interference signal, the position and the sound velocity of each microphone, and obtaining a guide vector of the interference signal:

in the formula, τ_nFor the delay of sound reaching each microphone, N is 1,2, …, N, d_i(ω) is the steering vector of the speech signal;

In the formula (d)_i(ω) is the steering vector of the speech signal,

In one embodiment, in S300, the noise covariance matrix is calculated as follows:

Φ_vv(ω)＝E[y(ω)y^H(ω)]；

In a specific embodiment, after the noise covariance matrix is calculated in S300, the noise covariance matrix is further modified:

E_v(ω)＝d^H(ω)Φ_vv(ω)d(ω)；

in the formula, phi_vv(ω) is a noise covariance matrix, d (ω) is a steering vector of the speech signal, d^H(ω) is the conjugate transpose of the steering vector d (ω), E_v(omega) is noise energy

The interference signal energy calculation formula is as follows:

in the formula, phi_vv(ω) is the noise covariance matrix, d_i(ω) is the steering vector of the speech signal,

is a guide vector d_iConjugate transpose of (ω).

In a specific embodiment, the step of obtaining the target speech in S400 is as follows:

and solving the filter coefficient by adopting a Lagrange multiplier method according to the corrected equation:

specifically, in step S300, the equation after the filter coefficient h (ω) is corrected is obtained, and in step S400, the calculation process is the lagrangian multiplier method for the solved h (ω) calculation formula.

Z(ω)＝h^H(ω)y(ω)；

in the formula, Z (ω) is the clear long-distance voice that we want to record.

Referring to fig. 2, an embodiment of the present invention further discloses a microphone array sound collecting system for long-distance speech recognition, including:

the third calculation module is connected with the acquisition module and the second calculation module and used for calculating a noise covariance matrix according to the sound data received by the multiple microphones;

and the output module is connected with the third calculation module and the first calculation module and used for calculating the optimal weight vector of the multi-microphone according to the noise covariance matrix, obtaining the target voice according to the guide vector and the weight vector of the voice signal and outputting the target voice.

Example 2

An example of a sound pickup method to which embodiment 1 of the present invention is specifically applied is as follows:

the invention has no requirements on the number, the shape and the size of the array, and only needs the fixed and known positions of all the microphones.

As shown in fig. 3, in the experiment of the conventional MVDR method and the present invention, an 8-microphone array was set and horizontally placed to test the effect. Meanwhile, 4 noise sources, an interference source, and a target voice were placed in the simulation experiment, and they were placed on a horizontal plane at 60-degree intervals at a distance of 20 m.

In addition to noise and interference sources, this embodiment adds independent white noise to each microphone to simulate a real recording environment.

FIG. 4 is a schematic diagram illustrating the processing result of the conventional MVDR method;

referring to FIG. 5, a schematic diagram of the processing results of the present invention provided in example 2 is shown;

as can be seen from fig. 4 and fig. 5, after the conventional MVDR method and the voice control method are tested, the distortion of the voice obtained by the voice control method is smaller, the voice interference is suppressed, and the voice is clearer, so that the target voice intelligibility is improved, the subsequent processing is easier, the accurate sound pickup is performed, and a better solution is provided for the remote voice control.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A microphone array pickup method for remote speech recognition is characterized by comprising the following specific steps:

s100: selecting any voice signal direction, and calculating a guide vector of the voice signal;

s200: selecting any interference signal direction, calculating a guide vector of the interference signal, and obtaining a covariance matrix of the interference signal according to the guide vector of the interference signal direction;

s300: picking up sound by using a multi-microphone array, and calculating a noise covariance matrix according to sound data collected by the multi-microphone;

2. The microphone array pickup method for long-distance speech recognition according to claim 1, wherein in S100, the steering vector of the speech signal direction is calculated as follows:

3. The method as claimed in claim 1, wherein the step of calculating the covariance matrix of the interference signal in S200 is as follows:

In the formula (d)_i(ω) is the steering vector of the interfering signal,

4. The microphone array pickup method for long-distance speech recognition according to claim 1, wherein in S300, the noise covariance matrix is calculated as follows:

Φ_vv(ω)＝E[y(ω)y^H(ω)]；

5. The microphone array pickup method for long-distance speech recognition according to claim 4, wherein after the noise covariance matrix is calculated in S300, the noise covariance matrix is further modified by:

E_v(ω)＝d^H(ω)Φ_vv(ω)d(ω)；

the interference signal energy calculation formula is as follows:

in the formula, phi_νv(ω) is the noise covariance matrix, d_i(ω) is the steering vector of the interfering signal,

is a guide vector d_iConjugate transpose of (ω);

h(ω)＝argmin(h^H(ω)(Φ_vv(ω)+λ(ω)Φ_ii(ω))h(ω)，s.t.h^H(ω)d(ω)＝1；

6. The microphone array pickup method for remote speech recognition according to claim 5, wherein the step of obtaining the target speech in S400 is as follows:

Z(ω)＝h^H(ω)y(ω)；

in the formula, Z (ω) is the clear long-distance voice that we want to record.

7. A microphone array sound pickup system for long-distance speech recognition, comprising: