CN114613383A - Multi-input voice signal beam forming information complementation method under airborne environment - Google Patents

Multi-input voice signal beam forming information complementation method under airborne environment Download PDF

Info

Publication number
CN114613383A
CN114613383A CN202210246203.3A CN202210246203A CN114613383A CN 114613383 A CN114613383 A CN 114613383A CN 202210246203 A CN202210246203 A CN 202210246203A CN 114613383 A CN114613383 A CN 114613383A
Authority
CN
China
Prior art keywords
signal
matrix
optimal
representing
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210246203.3A
Other languages
Chinese (zh)
Other versions
CN114613383B (en
Inventor
黄钰
王立
雷志雄
张晓�
王梦琦
朱宇
马建民
王煦
邓诚
陈卓立
张绪皓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC 10 Research Institute
Original Assignee
CETC 10 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC 10 Research Institute filed Critical CETC 10 Research Institute
Priority to CN202210246203.3A priority Critical patent/CN114613383B/en
Publication of CN114613383A publication Critical patent/CN114613383A/en
Application granted granted Critical
Publication of CN114613383B publication Critical patent/CN114613383B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10KSOUND-PRODUCING DEVICES; METHODS OR DEVICES FOR PROTECTING AGAINST, OR FOR DAMPING, NOISE OR OTHER ACOUSTIC WAVES IN GENERAL; ACOUSTICS NOT OTHERWISE PROVIDED FOR
    • G10K11/00Methods or devices for transmitting, conducting or directing sound in general; Methods or devices for protecting against, or for damping, noise or other acoustic waves in general
    • G10K11/16Methods or devices for protecting against, or for damping, noise or other acoustic waves in general
    • G10K11/175Methods or devices for protecting against, or for damping, noise or other acoustic waves in general using interference effects; Masking sound
    • G10K11/178Methods or devices for protecting against, or for damping, noise or other acoustic waves in general using interference effects; Masking sound by electro-acoustically regenerating the original acoustic waves in anti-phase
    • G10K11/1785Methods, e.g. algorithms; Devices
    • G10K11/17853Methods, e.g. algorithms; Devices of the filter
    • G10K11/17854Methods, e.g. algorithms; Devices of the filter the filter being an adaptive filter
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Abstract

The invention discloses a multi-input voice signal beam forming information complementation method under an airborne environment, which belongs to the field of airborne voice signal processing and comprises the following steps: s1, preprocessing the input signal; s2, carrying out voice activity detection on the preprocessed signals; s3, estimating and synchronizing time delay, adjusting the range of the corresponding voice section and noise section, judging whether the time delay among the signals after synchronization is less than the length of the filter, if so, estimating a matrix, otherwise, continuing the time delay synchronization; s4, carrying out noise matrix estimation and noisy speech matrix estimation, and carrying out optimal matrix estimation according to the two; and S5, estimating the optimal weight vector by using the optimal matrix to obtain an optimal filter, and outputting a synthesized signal by using the input signal through the optimal filter. The invention can ensure the information integrity, enhance the communication quality and the communication stability between the air and the machine, not only keep the complete voice information, but also effectively improve the signal-to-noise ratio.

Description

Multi-input voice signal beam forming information complementation method under airborne environment
Technical Field
The invention relates to the field of airborne voice signal processing, in particular to a multi-input voice signal beam forming information complementation method under an airborne environment.
Background
In the process of a flight task, an ultrashort wave communication system has a problem of voice signal discontinuity caused by multiple reasons such as incomplete coverage of a multi-antenna space of an airplane, poor diffraction capability of ultrashort waves, electromagnetic interference existing in an airplane system and the like, and existing solutions for the problem mainly include a selective combining method scheme, an equal gain combining method scheme, a microphone array beam forming method scheme and the like.
The scheme of the selective combining method belongs to a combining method in a diversity combining scheme, and the scheme outputs by selecting a channel with the best performance, but only one signal is output by selective combining, which can cause information loss. The problem can be seen in fig. 1, which is similar to the problem in fig. 1. That is, taking four antennas to receive voice 123456789 as an example, when voice interruption occurs, the multichannel voice signal is comparatively gated. Because only the gating processing is carried out, the output signals still cause the speech interruption and the word loss, and complete information cannot be obtained.
The equal gain combining method scheme belongs to a combining method in a diversity combining scheme, and can only ensure in-phase addition, if the inputs are unbalanced, weak signal amplification is easily caused to participate in combining by multiple times, more noise is introduced, and even combining loss can be caused.
Microphone array beamforming approach by beamforming, the gain applied to the output of a single microphone or microphones in an array may be controlled, preferably maximizing the microphone array gain from beamforming, but increasing the gain may also increase the internal or self-noise of the system.
From the above, the existing scheme of the selective combining method selects a single signal to output, which causes the problem of signal loss. And the existing equal gain combining method scheme has the problem that more noise is easily introduced to cause combining loss.
Disclosure of Invention
The present invention is directed to overcome the deficiencies of the prior art, and provides a method for complementing multi-input speech signal beam forming information in an airborne environment, which aims to solve the problems set forth in the background.
The purpose of the invention is realized by the following scheme:
a multi-input voice signal beam forming information complementation method under an airborne environment comprises the following steps:
step S1, preprocessing the input signal;
step S2, voice activity detection is carried out on the preprocessed signals, and a voice section range of the input signals and a noise section range of the input signals are obtained;
step S3, estimating and synchronizing time delay, adjusting the range of the corresponding voice section and noise section, judging whether the time delay among the signals after synchronization is less than the length of the filter, if so, estimating the matrix, otherwise, continuing the time delay synchronization;
step S4, carrying out noise matrix estimation and noisy speech matrix estimation, and carrying out optimal matrix estimation by the two;
and step S5, estimating the optimal weight vector by using the optimal matrix to obtain an optimal filter, and outputting a synthesized signal by using the input signal through the optimal filter.
Further, in step S1, the preprocessing includes a framing windowing process.
Further, in step S2, the method includes the sub-steps of: and carrying out voice endpoint detection on the voice signals, and mutually determining the interval by utilizing a short-time energy method and a short-time zero-crossing rate method to obtain an accurate endpoint detection result.
Further, in step S3, the method includes the sub-steps of: setting the two input signal models as follows at time k:
Figure BDA0003544754230000031
wherein i is 1, 2.. times.n; s (k) represents the original clean speech signal; tau isiRepresenting the relative time delay of the voice signal received by each channel relative to the original pure voice signal; v. ofi(k) Representing the noise of the voice signal received by each channel relative to the original pure voice signal;
Figure BDA0003544754230000032
Figure BDA0003544754230000033
is a cross-correlation function of two input speech signals, y1(k) Representing the first received signal, y2(k) Representing the second received signal;
Figure BDA0003544754230000034
τ=τ12for time delay of two signals, alpha1Representing the coefficient of the ratio, alpha, of the first received signal to the clean original speech signal2The ratio coefficient of the second path of received signal and the pure original voice signal is represented;
if tau- (tau)12) 0, then the autocorrelation matrix R of s (k)ss(τ-(τ12) To take the maximum value of the maximum value,
Figure BDA0003544754230000035
obtaining the maximum correlation of two paths of signals, calculating the corresponding displacement point number lambda, and obtaining the sampling rate fsAnd the relation between the point number lambda is used for calculating the time delay tau of the two sections of signals:
Figure BDA0003544754230000036
and after obtaining the time delay estimation result, carrying out displacement synchronization to obtain a signal without time delay difference.
Further, in step S4, the method includes the sub-steps of: calculating the autocorrelation matrix R of the noisy speechyy
Ryy=E[y(k)yT(k)]Wherein
Figure BDA0003544754230000037
E[]Indicating the desired value.
Computing a noise autocorrelation matrix Rvv
Rvv=E[v(k)vT(k)]Wherein
Figure BDA0003544754230000041
Further, in step S5, the optimal matrix estimation includes the sub-steps of: the optimal matrix W is calculated as followsi,0
Figure BDA0003544754230000042
Wherein i represents the number of channels,
Figure BDA0003544754230000043
Figure BDA0003544754230000044
Wi,0representing an optimal matrix for channel i;
Figure BDA0003544754230000045
Lhthe length is represented as a function of time,
Figure BDA0003544754230000046
representing an order of LhUnit matrix of W0Optimal matrix sum representing all channels
Figure BDA0003544754230000047
And constructing an optimal matrix.
Further, in step S5, the estimating the optimal weight vector by using the optimal matrix includes the sub-steps of: the optimal weight vector is calculated as follows:
Figure BDA0003544754230000048
Figure BDA0003544754230000049
wherein u' ═ 1,0,. 0, 0]TIs of length LhWherein h represents the optimal filter,
Figure BDA00035447542300000410
and
Figure BDA00035447542300000411
denotes h under the conditions of the optimal filter transformationy TRyyhyAnd hv TRvvhvRespectively representing the output power of the noisy speech and the noise, s.t. denotes W under constraintTRepresents the transpose of the optimal filter matrix, u' ═ 1,0]TIs of length LhA vector of (a);
solving the two optimization problems by a Lagrange multiplier method:
Ly(hy,λ)=hy TRyyhyy(WThy-u')
Lv(hvv)=hv TRvvhvv(WThv-u')
wherein L isy(hyλ) and Lv(h,λv) Respectively representing the lagrange function, lambda, of noisy speech and noise under constraintsvRepresenting lagrange multiplier vector parameters;
to Ly(hyλ) and Lv(h,λv) Derivation of h in (1) to obtain:
L'y(hy,λ)=Ryyhy+Wλy T
L'v(h,λv)=Rvvh+Wλv T
wherein L'y(hyy) And L'v(h,λv) Are respectively Ly(hyy) And Lv(hvv) A derivative of (a); l 'of'y(hyy) And L'v(hvv) All equal to 0, find:
Figure BDA0003544754230000051
Figure BDA0003544754230000052
bringing both into a constraint WThyU' and WThvU', yielding:
Figure BDA0003544754230000053
Figure BDA0003544754230000054
re-substitution into
Figure BDA0003544754230000055
And
Figure BDA0003544754230000056
to obtain:
Figure BDA0003544754230000057
Figure BDA0003544754230000058
substituting W containing space-time information0A matrix, resulting in:
Figure BDA0003544754230000059
Figure BDA00035447542300000510
hST,yrepresenting an optimal filter, h, found for noisy speechST,vRepresents an optimal filter found for noise.
Further, in step S5, the outputting the synthesized signal through the optimal filter by using the input signal includes the sub-steps of:
use of hST,vAs a filter matrix, the synthesized signal output by the optimal filter is:
Figure BDA0003544754230000061
wherein
Figure BDA0003544754230000062
For filtering the output signal, hi,ST,vThe optimal filter matrix, x, representing channel iir(k) And vir(k) Respectively, speech and residual noise after being filtered by the optimal filter.
Further, the mutually determining the interval by using the short-time energy method and the short-time zero-crossing rate method comprises the following substeps:
let the speech signal of the nth frame be xn(k) Then the short-time energy of the frame is
Figure BDA0003544754230000063
Short-time zero-crossing rate of
Figure BDA0003544754230000064
Wherein the content of the first and second substances,
Figure BDA0003544754230000065
representing a step function, k representing the time of day and N representing the total number of frames.
The beneficial effects of the invention include:
the invention prevents the problem of voice interruption by utilizing mutual supplement among multi-input voice information, ensures the information integrity, and enhances the communication quality and the communication stability between the air and the machine. Compared with a comparative gating method or an equal gain combining method, the method not only keeps complete voice information, but also can effectively improve the signal-to-noise ratio.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a schematic diagram of the prior art method for selecting and combining to perform multi-channel speech signal comparison gating;
FIG. 2 is a flow chart of steps of a method according to an embodiment of the present invention;
FIG. 3 is a diagram showing the relationship between the number of estimated points corresponding to a noise segment and the output SNR;
FIG. 4 is a graph of the relationship between a noisy speech segment Ly and the output signal-to-noise ratio;
FIG. 5 is a graph of maximum delay length (0-1000) versus output signal-to-noise ratio;
FIG. 6 is a normalized speech waveform for four input signals with speech discontinuities at 5dB for both signal-to-noise ratios;
FIG. 7 is a speech waveform of the four input signals of FIG. 6 output using the method of the present invention;
FIG. 8 is a flowchart illustrating steps of a method according to an embodiment of the present invention.
Detailed Description
All features disclosed in all embodiments in this specification, or all methods or process steps implicitly disclosed, may be combined and/or expanded, or substituted, in any way, except for mutually exclusive features and/or steps.
The embodiment of the invention provides a beam forming method for multi-input voice signals under an airborne environment, aiming at the problem that voice signals are interrupted due to the multiple reasons of incomplete coverage of a multi-antenna space of an airplane, poor diffraction capability of ultra-short waves, electromagnetic interference existing in an airplane system and the like in the flying task process of an ultra-short wave communication system and the defects of the existing solution (selection combination, equal gain combination) of the problem.
The technical scheme of the embodiment of the invention is as follows: performing frame-division windowing pretreatment on an input signal; carrying out voice endpoint detection on each input signal to determine whether the input signal is a voice section; carrying out time delay estimation processing on the existing voice section, and carrying out time delay synchronization to ensure that the maximum time delay is not greater than the length of a filter; determining a noise section according to the voice section information of each input signal after time delay synchronization, performing cross-correlation matrix estimation of the voice section and the noise section, and calculating an optimal filtering matrix and an optimal weight vector according to the results of the voice section and the noise section; and finally filtering the output signal.
As shown in fig. 2, the method comprises the following steps:
preprocessing input signals by framing, windowing and the like;
performing voice activity detection on the framed signal to obtain a voice segment range of the input signal and a noise segment range of the input signal;
carrying out time delay estimation preprocessing, estimating the maximum time delay, synchronizing, adjusting the range of the corresponding voice section and noise section, judging whether the time delay among the signals after synchronization is less than the length of a filter, carrying out matrix estimation if the time delay is less than the length of the filter, and otherwise, continuing time delay synchronization;
carrying out noise matrix estimation and noisy speech matrix estimation, and carrying out optimal matrix estimation by using the noise matrix estimation and the noisy speech matrix estimation;
and estimating the optimal weight vector by using the optimal matrix to obtain an optimal filter, and outputting a synthesized signal by using the input signal through the filter.
In the specific implementation process, the method comprises the following sub-steps:
firstly, preprocessing such as framing and windowing;
frame dividing time: 25ms
Windowing: sw(n) ═ S (n) (w (n)), where Sw(n) is the windowed function, S (n) is the function to be windowed, w (n) is the window function, w (n) selects the Hamming window,
Figure BDA0003544754230000081
secondly, voice endpoint detection is carried out on the voice signals;
short-time energy: let the speech signal of the nth frame be xn(k) Then the short-time energy of the frame is
Figure BDA0003544754230000091
Short-time zero-crossing rate of
Figure BDA0003544754230000092
An accurate end point detection result can be obtained by mutually determining the interval by using a short-time energy method and a short-time zero crossing rate method.
And thirdly, performing time delay estimation synchronization. The time delay among the multiple input channels needs to be eliminated or is small, at least needs to be below the order of a filter, otherwise, the quality of the received voice is reduced, referring to fig. 5, the positions of the voice section and the noise section which are unified cannot be used for calculation, and the calculation amount is increased;
and (3) time delay estimation: suppose that the two input signal models are respectively
Figure BDA0003544754230000093
Wherein i is 1, 2.. times.n; s (k) represents the original clean speech signal; tau isiRepresenting the relative time delay of the voice signal received by each channel relative to the original pure voice signal; v. ofi(k) Representing the noise of the received speech signal for each channel relative to the original clean speech signal.
Figure BDA0003544754230000094
Figure BDA0003544754230000095
Is the cross-correlation function of two input speech signals.
Figure BDA0003544754230000096
τ=τ12Time delay for two-way signals
If tau- (tau)12) When R is equal to 0, then Rss(τ-(τ12) To take the maximum value of the maximum value,
Figure BDA0003544754230000097
the maximum correlation of the two paths of signals is obtained, the corresponding displacement point number lambda can be obtained, and the sampling rate f is usedsAnd the relation between the point number lambda can calculate the time difference tau of two sections of signals.
Figure BDA0003544754230000098
And after obtaining the time delay estimation result, carrying out displacement synchronization to obtain a signal without time delay difference.
Fourthly, estimating a cross-correlation matrix of the voice section and the noise section, and greatly influencing the result by the effective points of the voice section and the noise section by referring to fig. 3 and 4;
and (3) calculating a noise voice autocorrelation matrix: ryy=E[y(k)yT(k)]Wherein
Figure BDA0003544754230000101
And (3) noise autocorrelation matrix calculation: rvv=E[v(k)vT(k)]Wherein
Figure BDA0003544754230000102
Fifthly, estimating an optimal filter matrix;
Figure BDA0003544754230000103
wherein i represents the number of channels,
Figure BDA0003544754230000104
Figure BDA0003544754230000105
Wi,0representing the optimal filter matrix for channel i.
Figure BDA0003544754230000106
Sixthly, calculating the optimal weight vector according to the following formula:
Figure BDA0003544754230000107
Figure BDA0003544754230000108
wherein u' ═ 1,0,. 0, 0]TIs of length LhWherein h represents the optimal filter,
Figure BDA0003544754230000109
and
Figure BDA00035447542300001010
denotes h under the conditions of the optimal filter transformationy TRyyhyAnd hv TRvvhvRespectively representing the output power of the noisy speech and the noise, s.t. denotes W under constraintTRepresents the transpose of the optimal filter matrix, u' ═ 1,0]TIs of length LhA vector of (a);
solving the two optimization problems by a Lagrange multiplier method:
Ly(hy,λ)=hy TRyyhyy(WThy-u')
Lv(hvv)=hv TRvvhvv(WThv-u')
wherein L isy(hyλ) and Lv(h,λv) Respectively representing the lagrange function, lambda, of noisy speech and noise under constraintsvRepresenting lagrange multiplier vector parameters;
to Ly(hyλ) and Lv(h,λv) Derivation of h in (1) to obtain:
L'y(hy,λ)=Ryyhy+Wλy T
L'v(h,λv)=Rvvh+Wλv T
wherein L'y(hyy) And L'v(h,λv) Are respectively Ly(hyy) And Lv(hvv) A derivative of (d); l 'of'y(hyy) And L'v(hvv) Are all equal to 0, find:
Figure BDA0003544754230000111
Figure BDA0003544754230000112
bringing both into constraint WThyU' and WThvU', yielding:
Figure BDA0003544754230000113
Figure BDA0003544754230000114
re-substitution into
Figure BDA0003544754230000115
And
Figure BDA0003544754230000116
to obtain:
Figure BDA0003544754230000117
Figure BDA0003544754230000118
substituting W containing space-time information0A matrix, resulting in:
Figure BDA0003544754230000119
Figure BDA00035447542300001110
hST,yrepresenting an optimal filter, h, found for noisy speechST,vOptimal filter representing solution to noise。
Seventhly, the voice and the noise are completely uncorrelated under the algorithm condition, so when the output power of the whole voice with noise is minimum after filtering, the output power of the noise is also minimum at the same time. In practice this does not hold completely, so to prevent the information of speech segments from being filtered, h is used hereST,vAs a filter matrix, filter the output signal:
Figure BDA0003544754230000121
wherein
Figure BDA0003544754230000122
For filtering the output signal, hi,ST,vThe optimal filter matrix, x, representing channel iir(k) And vir(k) Respectively, speech and residual noise after being filtered by the optimal filter.
As shown in fig. 3, the correlation between the number of noise segment significant points and the output signal-to-noise ratio. In the experiment, time delay, the number of effective points of a voice section and other variables are controlled, only the number of the corresponding effective points of the noise section is changed, and two paths of 5dB signals are input.
As shown in fig. 4, the correlation between the number of active points of a speech segment and the output signal-to-noise ratio. In the experiment, other variables such as time delay, effective point number of noise sections and the like are controlled, only the effective point number of voice sections is changed, and two paths of 5dB signals are input. The horizontal axis represents how many characters in the same sentence of voice enter the selection range, and the more characters are, that is, the larger the effective point number of the voice segment is.
As shown in fig. 5, the maximum delay length between multiple inputs is related to the output signal-to-noise ratio. In the experiment, other variables such as the number of effective points of a voice section, the number of effective points of a noise section and the like are controlled, only the maximum time delay between input signals is changed, and two paths of 5dB signals are input. At this time, the order of the filter is set to be 64, and it is seen that when the number of delay points is greater than the order of the filter, the output signal-to-noise ratio will have a great slip.
As shown in fig. 6, there are four input signals with speech discontinuities and the signal-to-noise ratios are all 5 dB.
As shown in fig. 7, the four input signals in fig. 6 are speech waveforms output using this method.
Example 1
As shown in fig. 8, a method for complementing multi-input speech signal beam forming information in an airborne environment includes the following steps:
step S1, preprocessing the input signal;
step S2, voice activity detection is carried out on the preprocessed signals, and a voice section range of the input signals and a noise section range of the input signals are obtained;
step S3, estimating and synchronizing time delay, adjusting the range of the corresponding voice section and noise section, judging whether the time delay among the signals after synchronization is less than the length of the filter, if so, estimating the matrix, otherwise, continuing the time delay synchronization;
step S4, carrying out noise matrix estimation and noisy speech matrix estimation, and carrying out optimal matrix estimation by the two;
and step S5, estimating the optimal weight vector by using the optimal matrix to obtain an optimal filter, and outputting a synthesized signal by using the input signal through the optimal filter.
Example 2
Based on embodiment 1, in step S1, the preprocessing includes a framing windowing process.
Example 3
Based on embodiment 1, in step S2, the method includes the sub-steps of: and carrying out voice endpoint detection on the voice signals, and mutually determining the interval by utilizing a short-time energy method and a short-time zero-crossing rate method to obtain an accurate endpoint detection result.
Example 4
Based on embodiment 1, in step S3, the method includes the sub-steps of: setting the two input signal models as follows at time k:
Figure BDA0003544754230000131
wherein i is 1, 2.. times.n; s (k) represents the original clean speech signal; tau isiRepresenting the relative time delay of the voice signal received by each channel relative to the original pure voice signal; v. ofi(k) Representing the noise of the voice signal received by each channel relative to the original pure voice signal;
Figure BDA0003544754230000132
Figure BDA0003544754230000133
is a cross-correlation function of two input speech signals, y1(k) Representing the first received signal, y2(k) Representing the second received signal;
Figure BDA0003544754230000141
τ=τ12for time delay of two signals, alpha1Representing the coefficient of the ratio, alpha, of the first received signal to the clean original speech signal2The ratio coefficient of the signal received by the second path and the pure original voice signal is represented;
if tau- (tau)12) 0, then the autocorrelation matrix R of s (k)ss(τ-(τ12) To take the maximum value of the maximum value,
Figure BDA0003544754230000142
obtaining the maximum correlation of two paths of signals, calculating the corresponding displacement point number lambda, and obtaining the sampling rate fsAnd the relation between the point number lambda is used for calculating the time delay tau of the two sections of signals:
Figure BDA0003544754230000143
and after obtaining the time delay estimation result, carrying out displacement synchronization to obtain a signal without time delay difference.
Example 5
Based on embodiment 1, in step S4, the method includes the sub-steps of: calculating the autocorrelation matrix R of the noisy speechyy
Ryy=E[y(k)yT(k)]Wherein
Figure BDA0003544754230000144
E[]Indicating the desired value.
Computing a noise autocorrelation matrix Rvv
Rvv=E[v(k)vT(k)]Wherein
Figure BDA0003544754230000145
Example 6
Based on embodiment 5, in step S5, the optimal matrix estimation includes the sub-steps of: the optimal matrix W is calculated as followsi,0
Figure BDA0003544754230000146
Wherein i represents the number of channels,
Figure BDA0003544754230000147
Figure BDA0003544754230000148
Wi,0representing an optimal matrix for channel i;
Figure BDA0003544754230000151
Lhthe length is represented as a function of time,
Figure BDA0003544754230000152
representing an order of LhUnit matrix of W0Optimal matrix sum representing all channels
Figure BDA0003544754230000153
The formed optimal matrix can ensure the full rank of the matrix so as to be convenient for derivation when a Lagrange multiplier method is carried out subsequently.
Example 7
Based on embodiment 6, in step S5, the estimating the optimal weight vector by using the optimal matrix includes the following sub-steps: the optimal weight vector is calculated as follows:
Figure BDA0003544754230000154
Figure BDA0003544754230000155
wherein u' ═ 1,0,. 0, 0]TIs of length LhWherein h represents the optimal filter,
Figure BDA0003544754230000156
and
Figure BDA0003544754230000157
denotes h under the conditions of the optimal filter transformationy TRyyhyAnd hv TRvvhvRespectively representing the output power of the noisy speech and the noise, s.t. denotes W under constraintTRepresents the transpose of the optimal filter matrix, u' ═ 1,0]TIs of length LhA vector of (a);
solving the two optimization problems by a Lagrange multiplier method:
Ly(hy,λ)=hy TRyyhyy(WThy-u')
Lv(hvv)=hv TRvvhvv(WThv-u')
wherein L isy(hyλ) and Lv(h,λv) Respectively representing the lagrange function, lambda, of noisy speech and noise under constraintsvRepresenting lagrange multiplier vector parameters;
to Ly(hyλ) and Lv(h,λv) Derivation of h in (1) to obtain:
L'y(hy,λ)=Ryyhy+Wλy T
L'v(h,λv)=Rvvh+Wλv T
wherein L'y(hyy) And L'v(h,λv) Are respectively Ly(hyy) And Lv(hvv) A derivative of (a); l 'of'y(hyy) And L'v(hvv) All equal to 0, find:
Figure BDA0003544754230000161
Figure BDA0003544754230000162
bringing both into constraint WThyU' and WThvU', yielding:
Figure BDA0003544754230000163
Figure BDA0003544754230000164
re-substitution into
Figure BDA0003544754230000165
And
Figure BDA0003544754230000166
to obtain:
Figure BDA0003544754230000167
Figure BDA0003544754230000168
substituting W containing space-time information0A matrix, resulting in:
Figure BDA0003544754230000169
Figure BDA00035447542300001610
hST,yrepresenting an optimal filter, h, found for noisy speechST,vRepresents an optimal filter found for noise.
Example 8
Based on embodiment 7, in step S5, since the speech and the noise are completely uncorrelated under the algorithm condition, when the output power is minimum after the filtering of the whole noisy speech, the output power of the noise is also minimum at the same time. But in practice this will not be the case, so to prevent the information of the speech segments from being filtered, the embodiment of the present invention uses h hereST,vAs a filter matrix. The outputting of the synthesized signal by the input signal through the optimal filter includes the substeps of:
the composite signal output by the optimal filter is:
Figure BDA0003544754230000171
wherein
Figure BDA0003544754230000172
For filtering the output signal, hi,ST,vThe optimal filter matrix, x, representing channel iir(k) And vir(k) Respectively, speech and residual noise after being filtered by the optimal filter.
Example 9
Based on embodiment 3, the mutually determining the interval by using the short-time energy method and the short-time zero-crossing rate method comprises the following substeps:
let the speech signal of the nth frame be xn(k) Then the short-time energy of the frame is
Figure BDA0003544754230000173
Short-term zero-crossing rate of
Figure BDA0003544754230000174
Wherein the content of the first and second substances,
Figure BDA0003544754230000175
represents a step function, k represents time, and N represents the total number of frames.
Other embodiments than the above examples may be devised by those skilled in the art based on the foregoing disclosure, or by adapting and using knowledge or techniques of the relevant art, and features of various embodiments may be interchanged or substituted and such modifications and variations that may be made by those skilled in the art without departing from the spirit and scope of the present invention are intended to be within the scope of the following claims.

Claims (9)

1. A multi-input voice signal beam forming information complementation method under an airborne environment is characterized by comprising the following steps:
step S1, preprocessing the input signal;
step S2, voice activity detection is carried out on the preprocessed signals, and a voice section range of the input signals and a noise section range of the input signals are obtained;
step S3, estimating and synchronizing time delay, adjusting the range of the corresponding voice section and noise section, judging whether the time delay among the signals after synchronization is less than the length of the filter, if so, estimating the matrix, otherwise, continuing the time delay synchronization;
step S4, carrying out noise matrix estimation and noisy speech matrix estimation, and carrying out optimal matrix estimation by the two;
and step S5, estimating the optimal weight vector by using the optimal matrix to obtain an optimal filter, and outputting a synthesized signal by using the input signal through the optimal filter.
2. The method for complementing beamforming information of multiple input voice signals according to claim 1, wherein in step S1, the pre-processing comprises framing and windowing.
3. The method for complementing beamforming information of multiple input voice signals under airborne environment according to claim 1, wherein in step S2, the method comprises the sub-steps of: and carrying out voice endpoint detection on the voice signals, and mutually determining the interval by utilizing a short-time energy method and a short-time zero-crossing rate method to obtain an accurate endpoint detection result.
4. The method for complementing beamforming information of multiple input voice signals under airborne environment according to claim 1, wherein in step S3, the method comprises the sub-steps of: setting the two input signal models as follows at time k:
Figure FDA0003544754220000021
wherein i is 1, 2.. times.n; s (k) represents the original clean speech signal; tau isiRepresenting the relative time delay of the voice signal received by each channel relative to the original pure voice signal; v. ofi(k) Representing the noise of the voice signal received by each channel relative to the original pure voice signal;
Ry1y2(τ)=E[y1(k)y2(k-τ)],Ry1y2(tau) is the cross-correlation function of the two input speech signals, y1(k) Representing the first received signal, y2(k) Representing the second received signal;
Figure FDA0003544754220000022
τ=τ12for time delay of two signals, alpha1Representing the coefficient of the ratio, alpha, of the first received signal to the clean original speech signal2The ratio coefficient of the second path of received signal and the pure original voice signal is represented;
if tau- (tau)12) 0, then the autocorrelation matrix R of s (k)ss(τ-(τ12) To obtain a maximum value, Ry1y2(tau) obtaining maximum value, obtaining maximum correlation of two paths of signals, obtaining corresponding displacement point number lambda, and obtaining the sampling rate fsAnd the relation between the point number lambda is used for calculating the time delay tau of the two sections of signals:
Figure FDA0003544754220000023
and after obtaining the time delay estimation result, carrying out displacement synchronization to obtain a signal without time delay difference.
5. The method for complementing beamforming information of multiple input voice signals under airborne environment according to claim 1, wherein in step S4, the method comprises the sub-steps of: calculating the autocorrelation matrix R of the noisy speechyy
Ryy=E[y(k)yT(k)]Wherein
Figure FDA0003544754220000024
E[]Indicating the desired value.
Computing a noise autocorrelation matrix Rvv
Rvv=E[v(k)vT(k)]Wherein
Figure FDA0003544754220000025
6. The method for complementing multi-input speech signal beam-forming information according to claim 5, wherein in step S5, the optimal matrix estimation comprises the sub-steps of: the optimal matrix W is calculated as followsi,0
Figure FDA0003544754220000031
Wherein i represents the number of channels,
Figure FDA0003544754220000032
Figure FDA0003544754220000033
Wi,0representing an optimal matrix for channel i;
Figure FDA0003544754220000034
Lhthe length is represented as a function of time,
Figure FDA0003544754220000035
representing an order of LhUnit matrix of W0Optimal matrix sum representing all channels
Figure FDA0003544754220000036
And constructing an optimal matrix.
7. The method for complementing beamforming information of multiple input voice signals according to claim 6, wherein in step S5, the estimating the optimal weight vector using the optimal matrix comprises the following sub-steps: the optimal weight vector is calculated as follows:
Figure FDA0003544754220000037
Figure FDA0003544754220000038
wherein u' ═ 1,0,. 0, 0]TIs of length LhWherein h represents the optimal filter,
Figure FDA0003544754220000039
and
Figure FDA00035447542200000310
denotes h under the conditions of the optimal filter transformationy TRyyhyAnd hv TRvvhvRepresenting the output power of the noisy speech and noise, respectively, s.t. denotes, under constraints, WTRepresents the transpose of the optimal filter matrix, u' ═ 1,0]TIs of length LhA vector of (a);
solving the two optimization problems by a Lagrange multiplier method:
Ly(hy,λ)=hy TRyyhyy(WThy-u')
Lv(hvv)=hv TRvvhvv(WThv-u')
wherein L isy(hyλ) and Lv(h,λv) Respectively representing the lagrange function, lambda, of noisy speech and noise under constraintsvRepresenting lagrange multiplier vector parameters;
to Ly(hyλ) and Lv(h,λv) Derivation of h in (1) to obtain:
L'y(hy,λ)=Ryyhy+Wλy T
L'v(h,λv)=Rvvh+Wλv T
wherein L'y(hyy) And L'v(h,λv) Are respectively Ly(hyy) And Lv(hvv) A derivative of (a); l 'of'y(hyy) And L'v(hvv) Are all equal to 0, find:
Figure FDA0003544754220000041
Figure FDA0003544754220000042
bringing both into constraint WThyU' and WThvU', yielding:
Figure FDA0003544754220000043
Figure FDA0003544754220000044
re-substitution into
Figure FDA0003544754220000045
And
Figure FDA0003544754220000046
to obtain:
Figure FDA0003544754220000047
Figure FDA0003544754220000048
substituting W containing space-time information0A matrix, resulting in:
Figure FDA0003544754220000049
Figure FDA00035447542200000410
hST,yrepresenting an optimal filter, h, found for noisy speechST,vRepresents an optimal filter found for noise.
8. The method for complementing beamforming information of a multi-input speech signal under airborne environment according to claim 7, wherein in step S5, the step of outputting the synthesized signal by the input signal through an optimal filter comprises the sub-steps of:
use of hST,vAs a filter matrix, the synthesized signal output by the optimal filter is:
Figure FDA0003544754220000051
wherein
Figure FDA0003544754220000052
For filtering the output signal, hi,ST,vThe optimal filter matrix, x, representing channel iir(k) And vir(k) Respectively, the speech and the residual noise after being filtered by the optimal filter.
9. The method of complementing beamforming information for multiple input speech signals according to claim 3, wherein said mutually determining the interval by using the short-time energy method and the short-time zero-crossing rate method comprises the sub-steps of:
let the speech signal of the nth frame be xn(k) Then the short-time energy of the frame is
Figure FDA0003544754220000053
Short-term zero-crossing rate of
Figure FDA0003544754220000054
Wherein the content of the first and second substances,
Figure FDA0003544754220000055
representing a step function, k representing the time of day and N representing the total number of frames.
CN202210246203.3A 2022-03-14 2022-03-14 Multi-input voice signal beam forming information complementation method in airborne environment Active CN114613383B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210246203.3A CN114613383B (en) 2022-03-14 2022-03-14 Multi-input voice signal beam forming information complementation method in airborne environment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210246203.3A CN114613383B (en) 2022-03-14 2022-03-14 Multi-input voice signal beam forming information complementation method in airborne environment

Publications (2)

Publication Number Publication Date
CN114613383A true CN114613383A (en) 2022-06-10
CN114613383B CN114613383B (en) 2023-07-18

Family

ID=81863801

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210246203.3A Active CN114613383B (en) 2022-03-14 2022-03-14 Multi-input voice signal beam forming information complementation method in airborne environment

Country Status (1)

Country Link
CN (1) CN114613383B (en)

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1633121A1 (en) * 2004-09-03 2006-03-08 Harman Becker Automotive Systems GmbH Speech signal processing with combined adaptive noise reduction and adaptive echo compensation
CN102611669A (en) * 2010-12-29 2012-07-25 Zte维创通讯公司 Channel estimation filtering
US20150019213A1 (en) * 2013-07-15 2015-01-15 Rajeev Conrad Nongpiur Measuring and improving speech intelligibility in an enclosure
CN104952459A (en) * 2015-04-29 2015-09-30 大连理工大学 Distributed speech enhancement method based on distributed uniformity and MVDR (minimum variance distortionless response) beam forming
CN105223544A (en) * 2015-08-26 2016-01-06 南京信息工程大学 The constant Beamforming Method of the near field linear constraint adaptive weighted frequency of minimum variance
CN106782590A (en) * 2016-12-14 2017-05-31 南京信息工程大学 Based on microphone array Beamforming Method under reverberant ambiance
CN107610713A (en) * 2017-10-23 2018-01-19 科大讯飞股份有限公司 Echo cancel method and device based on time delay estimation
CN110045334A (en) * 2019-02-28 2019-07-23 西南电子技术研究所(中国电子科技集团公司第十研究所) Sidelobe null Beamforming Method
CN110111807A (en) * 2019-04-27 2019-08-09 南京理工大学 A kind of indoor sound source based on microphone array follows and Enhancement Method
US20190341054A1 (en) * 2018-05-07 2019-11-07 Microsoft Technology Licensing, Llc Multi-modal speech localization
CN110473564A (en) * 2019-07-10 2019-11-19 西北工业大学深圳研究院 A kind of multi-channel speech enhancement method based on depth Wave beam forming
CN111508516A (en) * 2020-03-31 2020-08-07 上海交通大学 Voice beam forming method based on channel correlation time frequency mask

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1633121A1 (en) * 2004-09-03 2006-03-08 Harman Becker Automotive Systems GmbH Speech signal processing with combined adaptive noise reduction and adaptive echo compensation
CN102611669A (en) * 2010-12-29 2012-07-25 Zte维创通讯公司 Channel estimation filtering
US20150019213A1 (en) * 2013-07-15 2015-01-15 Rajeev Conrad Nongpiur Measuring and improving speech intelligibility in an enclosure
CN104952459A (en) * 2015-04-29 2015-09-30 大连理工大学 Distributed speech enhancement method based on distributed uniformity and MVDR (minimum variance distortionless response) beam forming
CN105223544A (en) * 2015-08-26 2016-01-06 南京信息工程大学 The constant Beamforming Method of the near field linear constraint adaptive weighted frequency of minimum variance
CN106782590A (en) * 2016-12-14 2017-05-31 南京信息工程大学 Based on microphone array Beamforming Method under reverberant ambiance
CN107610713A (en) * 2017-10-23 2018-01-19 科大讯飞股份有限公司 Echo cancel method and device based on time delay estimation
US20190341054A1 (en) * 2018-05-07 2019-11-07 Microsoft Technology Licensing, Llc Multi-modal speech localization
CN110045334A (en) * 2019-02-28 2019-07-23 西南电子技术研究所(中国电子科技集团公司第十研究所) Sidelobe null Beamforming Method
CN110111807A (en) * 2019-04-27 2019-08-09 南京理工大学 A kind of indoor sound source based on microphone array follows and Enhancement Method
CN110473564A (en) * 2019-07-10 2019-11-19 西北工业大学深圳研究院 A kind of multi-channel speech enhancement method based on depth Wave beam forming
CN111508516A (en) * 2020-03-31 2020-08-07 上海交通大学 Voice beam forming method based on channel correlation time frequency mask

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王秋菊: ""机载噪声环境下语音增强研究"" *

Also Published As

Publication number Publication date
CN114613383B (en) 2023-07-18

Similar Documents

Publication Publication Date Title
EP3703053B1 (en) Microphone array-based target voice acquisition method and device
CN107102296B (en) Sound source positioning system based on distributed microphone array
CN109490822B (en) Voice DOA estimation method based on ResNet
CN104936091B (en) Intelligent interactive method and system based on circular microphone array
CN107018470B (en) A kind of voice recording method and system based on annular microphone array
US9318124B2 (en) Sound signal processing device, method, and program
US9031257B2 (en) Processing signals
CN109830245A (en) A kind of more speaker's speech separating methods and system based on beam forming
CN108877827A (en) Voice-enhanced interaction method and system, storage medium and electronic equipment
CA2621940A1 (en) Method and device for binaural signal enhancement
CN102204281A (en) A system and method for producing a directional output signal
CN109725285B (en) DOA estimation method based on MVDR covariance matrix element self-adaptive phase angle conversion
CN110827846B (en) Speech noise reduction method and device adopting weighted superposition synthesis beam
CN110610718B (en) Method and device for extracting expected sound source voice signal
CN112904279A (en) Sound source positioning method based on convolutional neural network and sub-band SRP-PHAT space spectrum
CN110534126A (en) A kind of auditory localization and sound enhancement method and system based on fixed beam formation
CN110830870B (en) Earphone wearer voice activity detection system based on microphone technology
CN114613383A (en) Multi-input voice signal beam forming information complementation method under airborne environment
CN112363112A (en) Sound source positioning method and device based on linear microphone array
CN114613384B (en) Deep learning-based multi-input voice signal beam forming information complementation method
Priyanka et al. Adaptive Beamforming Using Zelinski-TSNR Multichannel Postfilter for Speech Enhancement
WO2023108864A1 (en) Regional pickup method and system for miniature microphone array device
CN116701921B (en) Multi-channel time sequence signal self-adaptive noise suppression circuit
CN113782024B (en) Method for improving accuracy of automatic voice recognition after voice awakening
Xiaohua et al. Research of the principle of cognitive sonar and beamforming simulation analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant