CN114613383A

CN114613383A - Multi-input voice signal beam forming information complementation method under airborne environment

Info

Publication number: CN114613383A
Application number: CN202210246203.3A
Authority: CN
Inventors: 黄钰; 王立; 雷志雄; 张晓�; 王梦琦; 朱宇; 马建民; 王煦; 邓诚; 陈卓立; 张绪皓
Original assignee: CETC 10 Research Institute
Current assignee: CETC 10 Research Institute
Priority date: 2022-03-14
Filing date: 2022-03-14
Publication date: 2022-06-10
Anticipated expiration: 2042-03-14
Also published as: CN114613383B

Abstract

The invention discloses a multi-input voice signal beam forming information complementation method under an airborne environment, which belongs to the field of airborne voice signal processing and comprises the following steps: s1, preprocessing the input signal; s2, carrying out voice activity detection on the preprocessed signals; s3, estimating and synchronizing time delay, adjusting the range of the corresponding voice section and noise section, judging whether the time delay among the signals after synchronization is less than the length of the filter, if so, estimating a matrix, otherwise, continuing the time delay synchronization; s4, carrying out noise matrix estimation and noisy speech matrix estimation, and carrying out optimal matrix estimation according to the two; and S5, estimating the optimal weight vector by using the optimal matrix to obtain an optimal filter, and outputting a synthesized signal by using the input signal through the optimal filter. The invention can ensure the information integrity, enhance the communication quality and the communication stability between the air and the machine, not only keep the complete voice information, but also effectively improve the signal-to-noise ratio.

Description

Multi-input voice signal beam forming information complementation method under airborne environment

Technical Field

The invention relates to the field of airborne voice signal processing, in particular to a multi-input voice signal beam forming information complementation method under an airborne environment.

Background

In the process of a flight task, an ultrashort wave communication system has a problem of voice signal discontinuity caused by multiple reasons such as incomplete coverage of a multi-antenna space of an airplane, poor diffraction capability of ultrashort waves, electromagnetic interference existing in an airplane system and the like, and existing solutions for the problem mainly include a selective combining method scheme, an equal gain combining method scheme, a microphone array beam forming method scheme and the like.

The scheme of the selective combining method belongs to a combining method in a diversity combining scheme, and the scheme outputs by selecting a channel with the best performance, but only one signal is output by selective combining, which can cause information loss. The problem can be seen in fig. 1, which is similar to the problem in fig. 1. That is, taking four antennas to receive voice 123456789 as an example, when voice interruption occurs, the multichannel voice signal is comparatively gated. Because only the gating processing is carried out, the output signals still cause the speech interruption and the word loss, and complete information cannot be obtained.

The equal gain combining method scheme belongs to a combining method in a diversity combining scheme, and can only ensure in-phase addition, if the inputs are unbalanced, weak signal amplification is easily caused to participate in combining by multiple times, more noise is introduced, and even combining loss can be caused.

Microphone array beamforming approach by beamforming, the gain applied to the output of a single microphone or microphones in an array may be controlled, preferably maximizing the microphone array gain from beamforming, but increasing the gain may also increase the internal or self-noise of the system.

From the above, the existing scheme of the selective combining method selects a single signal to output, which causes the problem of signal loss. And the existing equal gain combining method scheme has the problem that more noise is easily introduced to cause combining loss.

Disclosure of Invention

The present invention is directed to overcome the deficiencies of the prior art, and provides a method for complementing multi-input speech signal beam forming information in an airborne environment, which aims to solve the problems set forth in the background.

The purpose of the invention is realized by the following scheme:

a multi-input voice signal beam forming information complementation method under an airborne environment comprises the following steps:

step S1, preprocessing the input signal;

step S2, voice activity detection is carried out on the preprocessed signals, and a voice section range of the input signals and a noise section range of the input signals are obtained;

step S3, estimating and synchronizing time delay, adjusting the range of the corresponding voice section and noise section, judging whether the time delay among the signals after synchronization is less than the length of the filter, if so, estimating the matrix, otherwise, continuing the time delay synchronization;

step S4, carrying out noise matrix estimation and noisy speech matrix estimation, and carrying out optimal matrix estimation by the two;

and step S5, estimating the optimal weight vector by using the optimal matrix to obtain an optimal filter, and outputting a synthesized signal by using the input signal through the optimal filter.

Further, in step S1, the preprocessing includes a framing windowing process.

Further, in step S2, the method includes the sub-steps of: and carrying out voice endpoint detection on the voice signals, and mutually determining the interval by utilizing a short-time energy method and a short-time zero-crossing rate method to obtain an accurate endpoint detection result.

Further, in step S3, the method includes the sub-steps of: setting the two input signal models as follows at time k:

wherein i is 1, 2.. times.n; s (k) represents the original clean speech signal; tau is_iRepresenting the relative time delay of the voice signal received by each channel relative to the original pure voice signal; v. of_i(k) Representing the noise of the voice signal received by each channel relative to the original pure voice signal;

is a cross-correlation function of two input speech signals, y₁(k) Representing the first received signal, y₂(k) Representing the second received signal;

τ＝τ₁-τ₂for time delay of two signals, alpha₁Representing the coefficient of the ratio, alpha, of the first received signal to the clean original speech signal₂The ratio coefficient of the second path of received signal and the pure original voice signal is represented;

if tau- (tau)₁-τ₂) 0, then the autocorrelation matrix R of s (k)_ss(τ-(τ₁-τ₂) To take the maximum value of the maximum value,

obtaining the maximum correlation of two paths of signals, calculating the corresponding displacement point number lambda, and obtaining the sampling rate f_sAnd the relation between the point number lambda is used for calculating the time delay tau of the two sections of signals:

and after obtaining the time delay estimation result, carrying out displacement synchronization to obtain a signal without time delay difference.

Further, in step S4, the method includes the sub-steps of: calculating the autocorrelation matrix R of the noisy speech_yy：

R_yy＝E[y(k)y^T(k)]Wherein

E[]Indicating the desired value.

Computing a noise autocorrelation matrix R_vv：

R_vv＝E[v(k)v^T(k)]Wherein

Further, in step S5, the optimal matrix estimation includes the sub-steps of: the optimal matrix W is calculated as follows_i,0：

Wherein i represents the number of channels,

W_i,0representing an optimal matrix for channel i;

L_hthe length is represented as a function of time,

representing an order of L_hUnit matrix of W₀Optimal matrix sum representing all channels

And constructing an optimal matrix.

Further, in step S5, the estimating the optimal weight vector by using the optimal matrix includes the sub-steps of: the optimal weight vector is calculated as follows:

wherein u' ═ 1,0,. 0, 0]^TIs of length L_hWherein h represents the optimal filter,

and

denotes h under the conditions of the optimal filter transformation_y ^TR_yyh_yAnd h_v ^TR_vvh_vRespectively representing the output power of the noisy speech and the noise, s.t. denotes W under constraint^TRepresents the transpose of the optimal filter matrix, u' ═ 1,0]^TIs of length L_hA vector of (a);

solving the two optimization problems by a Lagrange multiplier method:

L_y(h_y,λ)＝h_y ^TR_yyh_y+λ_y(W^Th_y-u')

L_v(h_v,λ_v)＝h_v ^TR_vvh_v+λ_v(W^Th_v-u')

wherein L is_y(h_yλ) and L_v(h,λ_v) Respectively representing the lagrange function, lambda, of noisy speech and noise under constraints_vRepresenting lagrange multiplier vector parameters;

to L_y(h_yλ) and L_v(h,λ_v) Derivation of h in (1) to obtain:

L'_y(h_y,λ)＝R_yyh_y+Wλ_y ^T

L'_v(h,λ_v)＝R_vvh+Wλ_v ^T

wherein L'_y(h_y,λ_y) And L'_v(h,λ_v) Are respectively L_y(h_y,λ_y) And L_v(h_v,λ_v) A derivative of (a); l 'of'_y(h_y,λ_y) And L'_v(h_v,λ_v) All equal to 0, find:

bringing both into a constraint W^Th_yU' and W^Th_vU', yielding:

re-substitution into

And

to obtain:

substituting W containing space-time information₀A matrix, resulting in:

h_ST,yrepresenting an optimal filter, h, found for noisy speech_ST,vRepresents an optimal filter found for noise.

Further, in step S5, the outputting the synthesized signal through the optimal filter by using the input signal includes the sub-steps of:

use of h_ST,vAs a filter matrix, the synthesized signal output by the optimal filter is:

wherein

For filtering the output signal, h_i,ST,vThe optimal filter matrix, x, representing channel i_ir(k) And v_ir(k) Respectively, speech and residual noise after being filtered by the optimal filter.

Further, the mutually determining the interval by using the short-time energy method and the short-time zero-crossing rate method comprises the following substeps:

let the speech signal of the nth frame be x_n(k) Then the short-time energy of the frame is

Short-time zero-crossing rate of

Wherein the content of the first and second substances,

representing a step function, k representing the time of day and N representing the total number of frames.

The beneficial effects of the invention include:

the invention prevents the problem of voice interruption by utilizing mutual supplement among multi-input voice information, ensures the information integrity, and enhances the communication quality and the communication stability between the air and the machine. Compared with a comparative gating method or an equal gain combining method, the method not only keeps complete voice information, but also can effectively improve the signal-to-noise ratio.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a schematic diagram of the prior art method for selecting and combining to perform multi-channel speech signal comparison gating;

FIG. 2 is a flow chart of steps of a method according to an embodiment of the present invention;

FIG. 3 is a diagram showing the relationship between the number of estimated points corresponding to a noise segment and the output SNR;

FIG. 4 is a graph of the relationship between a noisy speech segment Ly and the output signal-to-noise ratio;

FIG. 5 is a graph of maximum delay length (0-1000) versus output signal-to-noise ratio;

FIG. 6 is a normalized speech waveform for four input signals with speech discontinuities at 5dB for both signal-to-noise ratios;

FIG. 7 is a speech waveform of the four input signals of FIG. 6 output using the method of the present invention;

FIG. 8 is a flowchart illustrating steps of a method according to an embodiment of the present invention.

Detailed Description

All features disclosed in all embodiments in this specification, or all methods or process steps implicitly disclosed, may be combined and/or expanded, or substituted, in any way, except for mutually exclusive features and/or steps.

The embodiment of the invention provides a beam forming method for multi-input voice signals under an airborne environment, aiming at the problem that voice signals are interrupted due to the multiple reasons of incomplete coverage of a multi-antenna space of an airplane, poor diffraction capability of ultra-short waves, electromagnetic interference existing in an airplane system and the like in the flying task process of an ultra-short wave communication system and the defects of the existing solution (selection combination, equal gain combination) of the problem.

The technical scheme of the embodiment of the invention is as follows: performing frame-division windowing pretreatment on an input signal; carrying out voice endpoint detection on each input signal to determine whether the input signal is a voice section; carrying out time delay estimation processing on the existing voice section, and carrying out time delay synchronization to ensure that the maximum time delay is not greater than the length of a filter; determining a noise section according to the voice section information of each input signal after time delay synchronization, performing cross-correlation matrix estimation of the voice section and the noise section, and calculating an optimal filtering matrix and an optimal weight vector according to the results of the voice section and the noise section; and finally filtering the output signal.

As shown in fig. 2, the method comprises the following steps:

preprocessing input signals by framing, windowing and the like;

performing voice activity detection on the framed signal to obtain a voice segment range of the input signal and a noise segment range of the input signal;

carrying out time delay estimation preprocessing, estimating the maximum time delay, synchronizing, adjusting the range of the corresponding voice section and noise section, judging whether the time delay among the signals after synchronization is less than the length of a filter, carrying out matrix estimation if the time delay is less than the length of the filter, and otherwise, continuing time delay synchronization;

carrying out noise matrix estimation and noisy speech matrix estimation, and carrying out optimal matrix estimation by using the noise matrix estimation and the noisy speech matrix estimation;

and estimating the optimal weight vector by using the optimal matrix to obtain an optimal filter, and outputting a synthesized signal by using the input signal through the filter.

In the specific implementation process, the method comprises the following sub-steps:

firstly, preprocessing such as framing and windowing;

frame dividing time: 25ms

Windowing: s_w(n) ═ S (n) (w (n)), where S_w(n) is the windowed function, S (n) is the function to be windowed, w (n) is the window function, w (n) selects the Hamming window,

secondly, voice endpoint detection is carried out on the voice signals;

short-time energy: let the speech signal of the nth frame be x_n(k) Then the short-time energy of the frame is

Short-time zero-crossing rate of

An accurate end point detection result can be obtained by mutually determining the interval by using a short-time energy method and a short-time zero crossing rate method.

And thirdly, performing time delay estimation synchronization. The time delay among the multiple input channels needs to be eliminated or is small, at least needs to be below the order of a filter, otherwise, the quality of the received voice is reduced, referring to fig. 5, the positions of the voice section and the noise section which are unified cannot be used for calculation, and the calculation amount is increased;

and (3) time delay estimation: suppose that the two input signal models are respectively

Wherein i is 1, 2.. times.n; s (k) represents the original clean speech signal; tau is_iRepresenting the relative time delay of the voice signal received by each channel relative to the original pure voice signal; v. of_i(k) Representing the noise of the received speech signal for each channel relative to the original clean speech signal.

Is the cross-correlation function of two input speech signals.

τ＝τ₁-τ₂Time delay for two-way signals

If tau- (tau)₁-τ₂) When R is equal to 0, then R_ss(τ-(τ₁-τ₂) To take the maximum value of the maximum value,

the maximum correlation of the two paths of signals is obtained, the corresponding displacement point number lambda can be obtained, and the sampling rate f is used_sAnd the relation between the point number lambda can calculate the time difference tau of two sections of signals.

Fourthly, estimating a cross-correlation matrix of the voice section and the noise section, and greatly influencing the result by the effective points of the voice section and the noise section by referring to fig. 3 and 4;

and (3) calculating a noise voice autocorrelation matrix: r_yy＝E[y(k)y^T(k)]Wherein

And (3) noise autocorrelation matrix calculation: r_vv＝E[v(k)v^T(k)]Wherein

Fifthly, estimating an optimal filter matrix;

wherein i represents the number of channels,

W_i,0representing the optimal filter matrix for channel i.

Sixthly, calculating the optimal weight vector according to the following formula:

and

solving the two optimization problems by a Lagrange multiplier method:

L_y(h_y,λ)＝h_y ^TR_yyh_y+λ_y(W^Th_y-u')

L_v(h_v,λ_v)＝h_v ^TR_vvh_v+λ_v(W^Th_v-u')

to L_y(h_yλ) and L_v(h,λ_v) Derivation of h in (1) to obtain:

L'_y(h_y,λ)＝R_yyh_y+Wλ_y ^T

L'_v(h,λ_v)＝R_vvh+Wλ_v ^T

wherein L'_y(h_y,λ_y) And L'_v(h,λ_v) Are respectively L_y(h_y,λ_y) And L_v(h_v,λ_v) A derivative of (d); l 'of'_y(h_y,λ_y) And L'_v(h_v,λ_v) Are all equal to 0, find:

bringing both into constraint W^Th_yU' and W^Th_vU', yielding:

re-substitution into

And

to obtain:

substituting W containing space-time information₀A matrix, resulting in:

h_ST,yrepresenting an optimal filter, h, found for noisy speech_ST,vOptimal filter representing solution to noise。

Seventhly, the voice and the noise are completely uncorrelated under the algorithm condition, so when the output power of the whole voice with noise is minimum after filtering, the output power of the noise is also minimum at the same time. In practice this does not hold completely, so to prevent the information of speech segments from being filtered, h is used here_ST,vAs a filter matrix, filter the output signal:

wherein

As shown in fig. 3, the correlation between the number of noise segment significant points and the output signal-to-noise ratio. In the experiment, time delay, the number of effective points of a voice section and other variables are controlled, only the number of the corresponding effective points of the noise section is changed, and two paths of 5dB signals are input.

As shown in fig. 4, the correlation between the number of active points of a speech segment and the output signal-to-noise ratio. In the experiment, other variables such as time delay, effective point number of noise sections and the like are controlled, only the effective point number of voice sections is changed, and two paths of 5dB signals are input. The horizontal axis represents how many characters in the same sentence of voice enter the selection range, and the more characters are, that is, the larger the effective point number of the voice segment is.

As shown in fig. 5, the maximum delay length between multiple inputs is related to the output signal-to-noise ratio. In the experiment, other variables such as the number of effective points of a voice section, the number of effective points of a noise section and the like are controlled, only the maximum time delay between input signals is changed, and two paths of 5dB signals are input. At this time, the order of the filter is set to be 64, and it is seen that when the number of delay points is greater than the order of the filter, the output signal-to-noise ratio will have a great slip.

As shown in fig. 6, there are four input signals with speech discontinuities and the signal-to-noise ratios are all 5 dB.

As shown in fig. 7, the four input signals in fig. 6 are speech waveforms output using this method.

Example 1

As shown in fig. 8, a method for complementing multi-input speech signal beam forming information in an airborne environment includes the following steps:

step S1, preprocessing the input signal;

Example 2

Based on embodiment 1, in step S1, the preprocessing includes a framing windowing process.

Example 3

Based on embodiment 1, in step S2, the method includes the sub-steps of: and carrying out voice endpoint detection on the voice signals, and mutually determining the interval by utilizing a short-time energy method and a short-time zero-crossing rate method to obtain an accurate endpoint detection result.

Example 4

Based on embodiment 1, in step S3, the method includes the sub-steps of: setting the two input signal models as follows at time k:

τ＝τ₁-τ₂for time delay of two signals, alpha₁Representing the coefficient of the ratio, alpha, of the first received signal to the clean original speech signal₂The ratio coefficient of the signal received by the second path and the pure original voice signal is represented;

Example 5

Based on embodiment 1, in step S4, the method includes the sub-steps of: calculating the autocorrelation matrix R of the noisy speech_yy：

R_yy＝E[y(k)y^T(k)]Wherein

E[]Indicating the desired value.

Computing a noise autocorrelation matrix R_vv：

R_vv＝E[v(k)v^T(k)]Wherein

Example 6

Based on embodiment 5, in step S5, the optimal matrix estimation includes the sub-steps of: the optimal matrix W is calculated as follows_i,0：

Wherein i represents the number of channels,

W_i,0representing an optimal matrix for channel i;

L_hthe length is represented as a function of time,

The formed optimal matrix can ensure the full rank of the matrix so as to be convenient for derivation when a Lagrange multiplier method is carried out subsequently.

Example 7

Based on embodiment 6, in step S5, the estimating the optimal weight vector by using the optimal matrix includes the following sub-steps: the optimal weight vector is calculated as follows:

and

solving the two optimization problems by a Lagrange multiplier method:

L_y(h_y,λ)＝h_y ^TR_yyh_y+λ_y(W^Th_y-u')

L_v(h_v,λ_v)＝h_v ^TR_vvh_v+λ_v(W^Th_v-u')

to L_y(h_yλ) and L_v(h,λ_v) Derivation of h in (1) to obtain:

L'_y(h_y,λ)＝R_yyh_y+Wλ_y ^T

L'_v(h,λ_v)＝R_vvh+Wλ_v ^T

bringing both into constraint W^Th_yU' and W^Th_vU', yielding:

re-substitution into

And

to obtain:

substituting W containing space-time information₀A matrix, resulting in:

Example 8

Based on embodiment 7, in step S5, since the speech and the noise are completely uncorrelated under the algorithm condition, when the output power is minimum after the filtering of the whole noisy speech, the output power of the noise is also minimum at the same time. But in practice this will not be the case, so to prevent the information of the speech segments from being filtered, the embodiment of the present invention uses h here_ST,vAs a filter matrix. The outputting of the synthesized signal by the input signal through the optimal filter includes the substeps of:

the composite signal output by the optimal filter is:

wherein

Example 9

Based on embodiment 3, the mutually determining the interval by using the short-time energy method and the short-time zero-crossing rate method comprises the following substeps:

Short-term zero-crossing rate of

Wherein the content of the first and second substances,

represents a step function, k represents time, and N represents the total number of frames.

Other embodiments than the above examples may be devised by those skilled in the art based on the foregoing disclosure, or by adapting and using knowledge or techniques of the relevant art, and features of various embodiments may be interchanged or substituted and such modifications and variations that may be made by those skilled in the art without departing from the spirit and scope of the present invention are intended to be within the scope of the following claims.

Claims

1. A multi-input voice signal beam forming information complementation method under an airborne environment is characterized by comprising the following steps:

step S1, preprocessing the input signal;

2. The method for complementing beamforming information of multiple input voice signals according to claim 1, wherein in step S1, the pre-processing comprises framing and windowing.

3. The method for complementing beamforming information of multiple input voice signals under airborne environment according to claim 1, wherein in step S2, the method comprises the sub-steps of: and carrying out voice endpoint detection on the voice signals, and mutually determining the interval by utilizing a short-time energy method and a short-time zero-crossing rate method to obtain an accurate endpoint detection result.

4. The method for complementing beamforming information of multiple input voice signals under airborne environment according to claim 1, wherein in step S3, the method comprises the sub-steps of: setting the two input signal models as follows at time k:

R_y1y2(τ)＝E[y₁(k)y₂(k-τ)]，R_y1y2(tau) is the cross-correlation function of the two input speech signals, y₁(k) Representing the first received signal, y₂(k) Representing the second received signal;

if tau- (tau)₁-τ₂) 0, then the autocorrelation matrix R of s (k)_ss(τ-(τ₁-τ₂) To obtain a maximum value, R_y1y2(tau) obtaining maximum value, obtaining maximum correlation of two paths of signals, obtaining corresponding displacement point number lambda, and obtaining the sampling rate f_sAnd the relation between the point number lambda is used for calculating the time delay tau of the two sections of signals:

5. The method for complementing beamforming information of multiple input voice signals under airborne environment according to claim 1, wherein in step S4, the method comprises the sub-steps of: calculating the autocorrelation matrix R of the noisy speech_yy：

R_yy＝E[y(k)y^T(k)]Wherein

E[]Indicating the desired value.

Computing a noise autocorrelation matrix R_vv：

R_vv＝E[v(k)v^T(k)]Wherein

6. The method for complementing multi-input speech signal beam-forming information according to claim 5, wherein in step S5, the optimal matrix estimation comprises the sub-steps of: the optimal matrix W is calculated as follows_i,0：

Wherein i represents the number of channels,

W_i,0representing an optimal matrix for channel i;

L_hthe length is represented as a function of time,

And constructing an optimal matrix.

7. The method for complementing beamforming information of multiple input voice signals according to claim 6, wherein in step S5, the estimating the optimal weight vector using the optimal matrix comprises the following sub-steps: the optimal weight vector is calculated as follows:

and

denotes h under the conditions of the optimal filter transformation_y ^TR_yyh_yAnd h_v ^TR_vvh_vRepresenting the output power of the noisy speech and noise, respectively, s.t. denotes, under constraints, W^TRepresents the transpose of the optimal filter matrix, u' ═ 1,0]^TIs of length L_hA vector of (a);

solving the two optimization problems by a Lagrange multiplier method:

L_y(h_y,λ)＝h_y ^TR_yyh_y+λ_y(W^Th_y-u')

L_v(h_v,λ_v)＝h_v ^TR_vvh_v+λ_v(W^Th_v-u')

to L_y(h_yλ) and L_v(h,λ_v) Derivation of h in (1) to obtain:

L'_y(h_y,λ)＝R_yyh_y+Wλ_y ^T

L'_v(h,λ_v)＝R_vvh+Wλ_v ^T

wherein L'_y(h_y,λ_y) And L'_v(h,λ_v) Are respectively L_y(h_y,λ_y) And L_v(h_v,λ_v) A derivative of (a); l 'of'_y(h_y,λ_y) And L'_v(h_v,λ_v) Are all equal to 0, find:

bringing both into constraint W^Th_yU' and W^Th_vU', yielding:

re-substitution into

And

to obtain:

substituting W containing space-time information₀A matrix, resulting in:

8. The method for complementing beamforming information of a multi-input speech signal under airborne environment according to claim 7, wherein in step S5, the step of outputting the synthesized signal by the input signal through an optimal filter comprises the sub-steps of:

wherein

For filtering the output signal, h_i,ST,vThe optimal filter matrix, x, representing channel i_ir(k) And v_ir(k) Respectively, the speech and the residual noise after being filtered by the optimal filter.

9. The method of complementing beamforming information for multiple input speech signals according to claim 3, wherein said mutually determining the interval by using the short-time energy method and the short-time zero-crossing rate method comprises the sub-steps of:

Short-term zero-crossing rate of

Wherein the content of the first and second substances,