CN113628634B - Real-time voice separation method and device guided by directional information - Google Patents
Real-time voice separation method and device guided by directional information Download PDFInfo
- Publication number
- CN113628634B CN113628634B CN202110963498.1A CN202110963498A CN113628634B CN 113628634 B CN113628634 B CN 113628634B CN 202110963498 A CN202110963498 A CN 202110963498A CN 113628634 B CN113628634 B CN 113628634B
- Authority
- CN
- China
- Prior art keywords
- time
- signal
- filter
- frequency
- target voice
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000926 separation method Methods 0.000 title claims abstract description 59
- 239000013598 vector Substances 0.000 claims abstract description 41
- 238000000034 method Methods 0.000 claims abstract description 34
- 238000005457 optimization Methods 0.000 claims abstract description 29
- 238000004364 calculation method Methods 0.000 claims abstract description 23
- 238000000354 decomposition reaction Methods 0.000 claims abstract description 22
- 230000009466 transformation Effects 0.000 claims abstract description 5
- 239000011159 matrix material Substances 0.000 claims description 24
- 238000009499 grossing Methods 0.000 claims description 6
- 230000008569 process Effects 0.000 claims description 6
- 238000001228 spectrum Methods 0.000 claims description 6
- 239000000284 extract Substances 0.000 abstract description 3
- 230000010365 information processing Effects 0.000 abstract description 2
- 230000006870 function Effects 0.000 description 24
- 238000004891 communication Methods 0.000 description 6
- 238000006243 chemical reaction Methods 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D30/00—Reducing energy consumption in communication networks
- Y02D30/70—Reducing energy consumption in communication networks in wireless communication networks
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Quality & Reliability (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Circuit For Audible Band Transducer (AREA)
Abstract
The invention discloses a method and a device for separating real-time voice directed to information guidance, which belong to the field of information processing, and the method comprises the following steps: s1: initializing a steering vector and a directional filter for the time domain signal of each microphone; s2: performing time-frequency decomposition on the initialized signals to finish the transformation from the time domain signals to the time-frequency domain signals; s3: performing separation filter calculation on the time-frequency domain signal to obtain a filter for separating target voice from residual signals; s4: and obtaining a time-frequency domain signal of the target voice according to the obtained filter, and further obtaining a time-domain signal of the target voice. The invention constructs the initial estimation of the real-time IVA based on the super-directional filter, corrects the optimization function of the IVA, ensures that the separation algorithm can quickly converge, and accurately extracts the target voice signal.
Description
Technical Field
The invention belongs to the field of information processing, and particularly relates to a method and a device for real-time voice separation of information guidance.
Background
At present, the microphone array beam forming technology is widely applied to the fields of online conference systems, vehicle-mounted man-machine interaction, intelligent home and the like. In an actual environment, the noise, the interference of competing speakers and the like are obvious, and the hearing of conference communication and the accuracy of subsequent voice recognition can be obviously reduced. The method for generating the wave beam based on the microphone array multi-array elements is the most commonly used method for reducing signal noise and improving communication quality. How to extract the voice signal in a certain direction in a targeted way, and meanwhile, remarkably suppress other noise, and has important significance in improving conference communication quality, improving voice recognition rate and the like.
Independent vector analysis (Independent vector analysis, IVA) based is currently the most common speech separation/pickup technique. Firstly, converting time domain signals picked up by all array elements into a time-frequency domain through short-time Fourier change, then constructing an optimization function based on the principle of minimum cross entropy of separated voice, iteratively updating a separation matrix based on the optimization function, obtaining frequency domain estimation of a target signal after estimating the separation matrix, and finally obtaining time domain estimation based on inverse Fourier transform. In some of the latest IVA methods, the distance constraint of the separation matrix and the target direction guide vector is increased, so that the IVA separation result can extract the target voice in real time.
The main disadvantages of the prior art are as follows:
1) The existing directional IVA is obviously insufficient in performance in a reverberation scene due to the fact that the accuracy of a steering vector is greatly reduced in the reverberation scene by directly increasing the constraint of the distance between a separation matrix and the steering vector.
2) The directional IVA technique does not constrain the initial estimate, resulting in an excessively long convergence time, and if the environment changes, such as an interfering speaker is walking, the IVA separation matrix convergence rate will not follow the rate of acoustic environment changes.
In view of this, the present invention has been made.
Disclosure of Invention
The invention aims to provide a real-time voice separation method and device guided by directional information, which construct an initial estimation of real-time IVA based on a super directional filter and correct an optimization function of IVA, so that a separation algorithm can be ensured to be converged rapidly and a target voice signal can be extracted accurately.
In order to achieve the above object, the present invention provides a real-time voice separation method guided by directional information, which is applied to a microphone array-based system, and includes the following steps:
s1: initializing a steering vector and a directional filter for the time domain signal of each microphone;
s2: performing time-frequency decomposition on the initialized signals to finish the transformation from the time domain signals to the time-frequency domain signals;
s3: performing separation filter calculation on the time-frequency domain signal to obtain a filter for separating target voice from residual signals;
s4: and obtaining a time-frequency domain signal of the target voice according to the obtained filter, and further obtaining a time-domain signal of the target voice.
Further, before the step S1, the method further includes: acquiring a time-domain signal x for each microphone m (n);
In the step S1, the method for performing the steering vector is as follows: for each frequency band k, a steering vector u (k) is calculated,
q(θ)=[cos(θ),sin(θ)]
wherein ,fk K=1, 2, for the frequency of the kth band; c is the speed of sound, c=340 m/s; d, d m Two-dimensional coordinate values for the mth microphone; q (θ) is a direction vector, ω k Is the frequency band circle frequency;
the method for initializing the directional filter is as follows: a super-steering filter h (k) is calculated for each frequency band k:
where R (k) represents the autocorrelation coefficients of each microphone of the uniform scattered field normalized with respect to the picked-up signal.
Further, the step S2 includes:
s201: for time domain signal x m (n) performing a short-time Fourier transform to obtain a time-frequency domain tableAnd (3) the following steps:
where N is the frame length, n=512; w (n) is a hamming window of length 512, l is a time frame number, and k is a frequency number. X is X m (l, k) is the spectrum of the mth microphone signal in the kth frequency band in the first frame;
s202: for each frequency band k, a frequency domain original vector X (l, k) is constructed:
X(l,k)=[X 1 (l,k),X 2 (l,k),...,X M (l,k)] T 。
further, the step S3 includes:
s301: calculating a frame level separation guide factor:
wherein ,r1(l) and r2 (l) Respectively used for guiding the voice of the target party and the residual signal;
s302: calculating a separate steering matrix for each frequency band:
ψ 1 (k)=αψ 1 (k)+(1-α)r 1 (l)X(l,k)X H (l,k)
ψ 2 (k)=αψ 1 (k)+(1-α)r 1 (l)X(l,k)X H (l,k)
wherein ,ψ1(k) and ψ2 (k) A guide matrix representing the target voice and the residual signal respectively; alpha is a smoothing factor, and the value range is 0 to 1;
s303: a new optimization function is constructed for the filter separating the target speech from the residual signal, the optimization function being as follows:
wherein ,G1(k) and G2 (k) Filters for separating target speech from residual signal
S304: and minimizing the optimization function to obtain the optimal filter.
The process of minimizing the optimization function is to solve the following equation:
Ψ(k)G(k)=ρ(k)
wherein ,
the optimal filter G (k) can be solved as:
G(k)=Ψ -1 (k)ρ(k)。
further, the step S4 includes:
s401: according to the optimal filter obtained by solving, further obtaining the frequency domain estimation of the target voice:
s402: performing inverse Fourier transform to obtain target voice time domain estimation:
the invention also provides a real-time voice separation device guided by the pointing information, which is applied to a system based on a microphone array and comprises an initialization module, a signal decomposition module, a separation filter calculation module and a target voice estimation module:
the initialization module is used for initializing a steering vector and a directional filter for the time domain signal of each microphone;
the signal decomposition module is used for performing time-frequency decomposition on the initialized signals to finish the transformation from the time domain signals to the time-frequency domain signals;
the separation filter calculation module is used for carrying out separation filter calculation on the time-frequency domain signals to obtain a filter for separating target voice from residual signals;
the target voice estimation module is used for obtaining a time-frequency domain signal of the target voice according to the obtained filter, and further obtaining a time-domain signal of the target voice.
Further, the initialization module is further configured to obtain a time domain signal x of each microphone m (n);
The method for guiding the vector by the signal decomposition module is as follows: for each frequency band k, a steering vector u (k) is calculated,
q(θ)=[cos(θ),sin(θ)]
wherein ,fk K=1, 2, for the frequency of the kth band; c is the speed of sound, c=340 m/s; d, d m Two-dimensional coordinate values for the mth microphone; q (θ) is a direction vector, ω k Is the frequency band circle frequency;
the method for initializing the directional filter by the signal decomposition module is as follows: a super-steering filter h (k) is calculated for each frequency band k:
where R (k) represents the autocorrelation coefficients of each microphone of the uniform scattered field normalized with respect to the picked-up signal.
Further, the operation steps of the signal decomposition module are as follows:
first, a time domain signal x m (n) performing a short-time fourier transform to obtain a time-frequency domain representation:
where N is the frame length, n=512; w (n) is a hamming window of length 512, l is a time frame number, and k is a frequency number. X is X m (l, k) is the spectrum of the mth microphone signal in the kth frequency band in the first frame;
next, for each frequency band k, a frequency domain original vector X (l, k) is constructed:
X(l,k)=[X 1 (l,k),X 2 (l,k),...,X M (l,k)] T 。
further, the operation steps of the separation filter calculation module are as follows:
first, a frame level separation guide factor is calculated:
wherein ,r1(l) and r2 (l) Respectively used for guiding the target voice and the residual signal;
secondly, a separate steering matrix for each frequency band is calculated:
ψ 1 (k)=αψ 1 (k)+(1-α)r 1 (l)X(l,k)X H (l,k)
ψ 2 (k)=αψ 1 (k)+(1-α)r 1 (l)X(l,k)X H (l,k)
wherein ,ψ1(k) and ψ2 (k) Representing the target speech and the residual signal respectivelyIs a pilot matrix of (a); alpha is a smoothing factor, and the value range is 0 to 1;
then, a new optimization function is constructed for the filter for separating the target voice from the residual signal, and the optimization function is as follows:
wherein ,G1(k) and G2 (k) Filters for separating target speech from residual signal
And finally, minimizing an optimization function to obtain an optimal filter.
The process of minimizing the optimization function is to solve the following equation:
Ψ(k)G(k)=ρ(k)
wherein ,
the optimal filter G (k) can be solved as:
G(K)=Ψ -1 (k)ρ(k)。
further, the operation steps of the voice estimation module are as follows:
firstly, according to the optimal filter obtained by solving, further obtaining the frequency domain estimation of the target voice:
then, performing inverse Fourier transform to obtain target voice time domain estimation:
the real-time voice separation method and device for guiding the pointing information provided by the invention have the following beneficial effects:
1. compared with the traditional IVA, the invention calculates the guide factor by using the super-directional filter, so that the convergence is faster, and the invention can adapt to the scene of acoustic environment change.
2. The objective function designed by the invention considers the difference between signals, increases the ambiguity constraint, can obtain the analytic optimal solution, and does not need iteration, so that the robustness is stronger, and the separation effect is more stable and reliable.
Drawings
Fig. 1 is a flowchart of a real-time voice separation method directed to information guidance in this embodiment.
Fig. 2 is a schematic diagram of a hamming window function used in this embodiment.
Fig. 3 is a schematic diagram of a real-time voice separation apparatus directed to information guidance in this embodiment.
Detailed Description
In order that those skilled in the art will better understand the present invention, the present invention will be described in further detail with reference to specific embodiments.
As shown in fig. 1, an embodiment of the present invention is a real-time voice separation method directed to information guidance, which can be applied to a microphone array-based system, such as a voice conference system, an on-vehicle voice communication system, and a man-machine interaction system, and can extract a target voice signal in real time, thereby being beneficial to improving the communication quality of an on-line voice conference and improving the accuracy of subsequent voice recognition.
The method specifically comprises the following four implementation steps:
s1: steering vector and directional filter initialization is performed on the time domain signal of each microphone.
Before step S1, the method further includes acquiring a voice signal of each microphone, where the acquired voice signals are as follows: let x be m (n) represents the original time domain signal picked up in real time by M microphone array elements, wherein M represents the microphoneAnd the gram serial number label has a value from 1 to M, n represents a time label, and the direction of target voice relative to the microphone array is theta.
The target voice refers to a voice signal corresponding to a target direction, and for a voice separation task, the target direction is known in advance according to the extracted signal, for example, for a large-screen voice communication device, it is desirable to separate the voice signal in a 90-degree direction.
Specifically, the method for performing the steering vector is as follows:
for each frequency band k, a steering vector u (k) is calculated, where a frequency band refers to a signal component corresponding to a certain frequency:
q(θ)=[cos(θ),sin(θ)]
wherein ,fk K=1, 2, where the value of K is determined from the subsequent fourier transform, if the frame length is 512, then the value of K is half the frame length; c is the speed of sound, c=340 m/s; d, d m Two-dimensional coordinate values for the mth microphone; the superscript H represents a conjugate transpose operator; j represents imaginary unitq (θ) is a direction vector, ω k Is the band round frequency.
The method for initializing the directional filter is as follows:
a super-steering filter h (k) is calculated for each frequency band k:
where R (k) represents the normalized autocorrelation coefficients of the individual microphones of the uniform scattered field with respect to the picked-up signal, and the superscript-1 represents the inverse of the matrix.
The initialization of the filter is completed through the two steps, so as to calculate the subsequent spatial distinguishing information.
S2: and performing time-frequency decomposition on the initialized signal to complete the conversion from the time domain signal to the time-frequency domain signal.
Specifically, the method comprises the following steps:
s201: for time domain signal x m (n) performing a short-time fourier transform to obtain a time-frequency domain representation:
where N is the frame length, n=512; w (n) is a hamming window of length 512, where n represents a sequence number over time, and thus w (n) represents a value over each corresponding time sequence number n; l is a time frame number in frames; k is the frequency number. X is X m (l, k) is the spectrum of the mth microphone signal in the kth frequency band in the first frame. In the present invention, a hamming window function is used as shown in fig. 2.
S202: for each frequency band k, a frequency domain original vector X (l, k) is constructed:
X(l,k)=[X 1 (l,k),X 2 (l,k),...,X M (l,k)] T
wherein the superscript T represents the transpose operator, resulting in the original vector being an M-dimensional column vector.
The conversion from the time domain signal to the time-frequency domain can be completed through the steps.
S3: and performing separation filter calculation on the time-frequency domain signal to obtain a filter for separating the target voice from the residual signal.
Specifically, the method comprises the following steps:
s301: calculating a frame level separation guide factor:
wherein, I represents taking the modulus of complex numbers, r 1(l) and r2 (l) Respectively for guiding the target speech and the residual signal. The target direction refers to the direction of interest to the user, and the target voice is the voice signal from this direction; the residual signal is the target voice signal, sound from other directions and environmental noise, and the residual signal is the target voice signal subtracted from the total signal acquired by the microphone. This step calculates the steering matrix with the next step, providing a priori.
S302: calculating a separate steering matrix for each frequency band:
ψ 1 (k)=αψ 1 (k)+(1-α)r 1 (l)X(l,k)X H (l,k)
ψ 2 (k)=αψ 1 (k)+(1-α)r 1 (l)X(l,k)X H (l,k)
wherein ,ψ1(k) and ψ2 (k) A guide matrix representing the target voice and the residual signal respectively; the value of alpha is a smoothing factor, the value range is 0 to 1, and the preferred invention adopts the value of alpha=0.85, and the adoption of the value can avoid excessive dependence on history information and fully mine the spatial information of the signal.
By this step the calculation of the matrix is guided, directly with the subsequent updating of the separation filter.
S303: a new optimization function is constructed for the filter separating the target speech from the residual signal, the optimization function being as follows:
wherein ,G1(k) and G2 (k) Filters separating the target speech and the residual signal, respectively.
The first term of the optimization function maximizes the difference between the separated signals, and the second term can avoid ambiguity of the filter estimation, and ensure that the sum of the separation results is as consistent as possible with the value of the microphone signal.
S304: and minimizing the optimization function to obtain an optimal filter.
This minimization process is equivalent to solving the following equation:
Ψ(k)G(k)=ρ(k)
wherein ,
superscript x represents a conjugate operator.
The optimal filter G (k) can be solved as:
G(k)=Ψ -1 (k)ρ(k)。
after solving to obtain G (k), obtaining a filter G for separating the target voice and the residual signal according to the corresponding relation of the vectors 1(k) and G2 (k)。
In step S301, a super directional filter is used to calculate a speech separation guide factor; in step S302, a separate steering matrix is calculated based on the steering factors; in step S303, the designed speech separation optimization function; in step S304, the formula adopted is the invented optimal separation filter calculation method, which adopts the principle of minimum mean square error, and is derived from the formula ψ=ρ, and the solution of minimum mean square error is g=ψ -1 * ρ, in turn, ensures the minimization of the defined J.
Thus, step S3 enables the calculation of a frequency domain separation filter.
S4: and obtaining a time-frequency domain signal of the target voice according to the obtained filter, and further obtaining a time-domain signal of the target voice.
The method specifically comprises the following steps:
s401: according to the optimal filter obtained by solving, further obtaining the frequency domain estimation of the target voice:
s402: performing inverse Fourier transform to obtain target voice time domain estimation, and further obtaining a target voice time domain signal:
this step achieves the acquisition of the target speech time domain signal.
Through the steps of the invention, the initialization of the microphone matrix signals, the signal decomposition, the calculation of the separation filter and the target voice estimation can be realized.
As shown in fig. 3, an embodiment of the present invention is a real-time voice separation device directed to information guidance, which is applied to a microphone array-based system, and includes an initialization module 1, a signal decomposition module 2, a separation filter calculation module 3, and a target voice estimation module 4.
The initialization module 1 is used for initializing steering vectors and directional filters for voice signals of each microphone.
The initialization module 1 can also be used to obtain the voice signal of each microphone, the obtained voice signals are as follows: let x be m And (n) represents original signals picked up by M microphone array elements in real time, wherein M represents a microphone serial number label, the value of the microphone serial number label is from 1 to M, n represents a time label, and the direction of target voice relative to the microphone array is theta.
Specifically, the method for performing the steering vector is as follows:
for each frequency band k, a steering vector u (k) is calculated, where a frequency band refers to a signal component corresponding to a certain frequency:
q(θ)=[cos(θ),sin(θ)]
wherein ,fk K=1, 2, for the frequency of the kth band; c is the speed of sound, c=340 m/s; d, d m Two-dimensional coordinate values for the mth microphone; the superscript H represents a conjugate transpose operator; j represents imaginary unitq (θ) is a direction vector, ω k Is the band round frequency.
The method for initializing the directional filter is as follows:
a super-steering filter h (k) is calculated for each frequency band k:
where R (k) represents the normalized autocorrelation coefficients of the individual microphones of the uniform scattered field with respect to the picked-up signal, and the superscript-1 represents the inverse of the matrix.
The initialization of the filter is completed through the two steps, so as to calculate the subsequent spatial distinguishing information.
The signal decomposition module 2 is configured to perform time-frequency decomposition on the initialized signal, and complete conversion from a time domain signal to a time-frequency domain signal.
Specifically, the operation steps of the signal decomposition module 2 are as follows:
first, a time domain signal x m (n) performs short-time Fourier transform to obtain a time-frequency domain expression:
where N is the frame length, n=512; w (n) is a Hamming window of length 512, and l is timeFrame number, k is frequency number. X is X m (l, k) is the spectrum of the mth microphone signal in the kth frequency band in the first frame. The hamming window function is shown in fig. 2.
Next, for each frequency band k, a frequency domain original vector X (l, k) is constructed:
X(l,k)=[X 1 (l,k),X 2 (l,k),...,X M (l,k)] T
wherein the superscript T represents the transpose operator, resulting in the original vector being an M-dimensional column vector.
The conversion from the time domain signal to the time-frequency domain can be completed through the steps.
The separation filter calculation module 3 is configured to perform separation filter calculation on the time-frequency domain signal, and obtain a filter for separating the target voice from the residual signal.
Specifically, the operation steps of the separation filter calculation module 3 are as follows:
first, a frame level separation guide factor is calculated:
wherein, I represents taking the modulus of complex numbers, r 1(l) and r2 (l) The target speech and the residual signal are directed separately. The above operation calculates the steering matrix with the next step, providing a priori.
Secondly, a separate steering matrix for each frequency band is calculated:
ψ 1 (k)=αψ 1 (k)+(1-α)r 1 (l)X(l,k)X H (l,k)
ψ 2 (k)=αψ 1 (k)+(1-α)r 1 (l)X(l,k)X H (l,k)
wherein ,ψ1(k) and ψ2 (k) A guide matrix representing the target voice and the residual signal respectively; alphaFor the smoothing factor, the value range is 0 to 1, and the preferred invention adopts the value of alpha=0.85, so that excessive dependence on history information can be avoided, and spatial information of signals can be fully mined.
The operation guides the matrix calculation, and the subsequent updating of the separation filter is directly used.
Then, a new optimization function is constructed for the filter for separating the target voice from the residual signal, and the optimization function is as follows:
wherein ,G1(k) and G2 (k) Filters separating the target speech and the residual signal, respectively.
The first term of the optimization function maximizes the difference between the separated signals, and the second term can avoid ambiguity of the filter estimation, and ensure that the sum of the separation results is as consistent as possible with the value of the microphone signal.
And finally, minimizing the optimization function to obtain the optimal filter.
This minimization process is equivalent to solving the following equation:
Ψ(k)G(k)=ρ(k)
wherein ,
superscript x represents a conjugate operator.
The optimal filter G (k) can be solved as:
G(k)=Ψ -1 (k)ρ(k)
the calculation of the frequency domain separation filter can be realized by the above operation.
The target voice estimation module 4 is configured to obtain a time-frequency domain signal of the target voice according to the obtained filter, and further obtain a time-domain signal of the target voice.
Specifically, the operation steps of the target speech estimation module 4 are as follows:
firstly, according to the optimal filter obtained by solving, further obtaining the frequency domain estimation of the target voice:
then, performing inverse Fourier transform to obtain target voice time domain estimation:
the target speech estimation module 4 is capable of achieving acquisition of a target speech time domain signal.
In the above embodiment, these 4 modules including the initialization module 1, the signal decomposition module 2, the separation filter calculation module 3, and the target speech estimation module 4 are not sufficient, and any of the modules is missing, which results in that the target speech cannot be extracted.
Specific examples are set forth herein to illustrate the invention in detail, and the description of the above examples is only for the purpose of aiding in understanding the core concept of the invention. It should be noted that any obvious modifications, equivalents, or other improvements to those skilled in the art without departing from the inventive concept are intended to be included in the scope of the present invention.
Claims (4)
1. A method for real-time voice separation directed to information guidance, applied to a microphone array-based system, comprising the steps of:
step S1: initializing a steering vector and a directional filter for the time domain signal of each microphone;
step S2: performing time-frequency decomposition on the initialized signals to finish the transformation from the time domain signals to the time-frequency domain signals;
step S3: performing separation filter calculation on the time-frequency domain signal to obtain a filter for separating target voice from residual signals;
step S4: obtaining a time-frequency domain signal of the target voice according to the obtained filter, and further obtaining a time-domain signal of the target voice;
the step S1 further includes: acquiring a time-domain signal x for each microphone m (n);
In the step S1, the method for performing the steering vector is as follows: for each frequency band k, a steering vector u (k) is calculated,
q(θ)=[cos(θ),sin(θ)]
wherein ,fk K=1, 2, for the frequency of the kth band; c is the speed of sound, c=340 m/s; d, d m Two-dimensional coordinate values for the mth microphone; q (θ) is a direction vector, ω k Is the frequency band circle frequency;
the method for initializing the directional filter is as follows: a super-steering filter h (k) is calculated for each frequency band k:
wherein R (k) represents the normalized autocorrelation coefficients of the individual microphones of the uniform scattered field with respect to the picked-up signal;
the step S2 includes:
s201: for time domain signal x m (n) performing a short-time fourier transform to obtain a time-frequency domain representation:
where N is the frame length, n=512; w (n) is a hamming window with length 512, 1 is a time frame number, and k is a frequency number; x is X m (l, k) is the spectrum of the mth microphone signal in the kth frequency band in frame 1;
s202: for each frequency band k, a frequency domain original vector X (l, k) is constructed:
X(l,k)=[X 1 (l,k)X 2 (l,k),...,X M (l,k)] T ;
the step S3 includes:
s301: calculating a frame level separation guide factor:
wherein ,r1(l) and r2 (l) Respectively used for guiding the target voice and the residual signal;
s302: calculating a separate steering matrix for each frequency band:
ψ 1 (k)=αψ 1 (k)+(1-α)r 1 (l)X(l,k)X H (l,k)
ψ 2 (k)=αψ 1 (k)+(1-α)r 1 (l)X(l,k)X H (l,k)
wherein ,ψ1(k) and ψ2 (k) A guide matrix representing the target voice and the residual signal respectively; alpha is a smoothing factor, and the value range is 0 to 1;
s303: a new optimization function is constructed for the filter separating the target speech from the residual signal, the optimization function being as follows:
wherein ,G1(k) and G2 (k) Filters for separating target speech from residual signal
S304: minimizing the optimization function to obtain an optimal filter;
the process of minimizing the optimization function is to solve the following equation:
Ψ(k)G(k)=ρ(k)
wherein ,
the filter G (k) is solved as:
G(k)=Ψ -1 (k) ρ (middle).
2. The method for real-time voice separation guided by directional information according to claim 1, wherein the step S4 comprises:
s401: according to the filter obtained by solving, further obtaining the frequency domain estimation of the target voice:
s402: performing inverse Fourier transform to obtain target voice time domain estimation:
3. a real-time voice separation device guided by directional information, which is applied to a system based on a microphone array, and is characterized by comprising an initialization module, a signal decomposition module, a separation filter calculation module and a target voice estimation module:
the initialization module is used for initializing a steering vector and a directional filter for the time domain signal of each microphone;
the signal decomposition module is used for performing time-frequency decomposition on the initialized signals to finish the transformation from the time domain signals to the time-frequency domain signals;
the separation filter calculation module is used for carrying out separation filter calculation on the time-frequency domain signals to obtain a filter for separating target voice from residual signals;
the target voice estimation module is used for obtaining a time-frequency domain signal of target voice according to the obtained filter, and further obtaining a target voice time domain signal;
the initialization module is further configured to obtain a time domain signal x of each microphone m (n);
The method for guiding the vector by the signal decomposition module is as follows: for each frequency band k, a steering vector u (k) is calculated,
q(θ)=[cos(θ),sin(θ)]
wherein ,fk K=1, 2, for the frequency of the kth band; c is the speed of sound, c=340 m/s; d, d m Two-dimensional coordinate values for the mth microphone; q (θ) is a direction vector, ω k Is the frequency band circle frequency;
the method for initializing the directional filter by the signal decomposition module is as follows: a super-steering filter h (k) is calculated for each frequency band k:
wherein R (k) represents the normalized autocorrelation coefficients of the individual microphones of the uniform scattered field with respect to the picked-up signal;
the operation steps of the signal decomposition module are as follows:
first, a time domain signal x m (n) performing a short-time fourier transform to obtain a time-frequency domain representation:
where N is the frame length, n=512; w (n) is a Hamming window of length 512, l is a time frame number, and k is a frequency number; x is X m (l, k) is the spectrum of the mth microphone signal in the kth frequency band in the first frame;
next, for each frequency band k, a frequency domain original vector X (l, k) is constructed:
X(l,k)=[X 1 (l,k),X 2 (l,k),...,X M (l,k)] T ;
the operation steps of the separation filter calculation module are as follows:
first, a frame level separation guide factor is calculated:
wherein ,r1(l) and r2 (l) Respectively used for guiding the voice of the target party and the residual signal;
secondly, a separate steering matrix for each frequency band is calculated:
ψ 1 (k)=αψ 1 (k)+(1-α)r 1 (l)X(l,k)X H (l,k)
ψ 2 (k)=αψ 1 (k)+(1-αr 1 (l)X(l,k)X H (l,k)
wherein ,ψ1(k) and ψ2 (k) The guiding matrixes respectively represent the target party voice and the residual signals; alpha is a smoothing factor, and the value range is 0 to 1;
then, a new optimization function is constructed for the filter for separating the target voice from the residual signal, and the optimization function is as follows:
wherein ,G1(k) and G2 (k) Filters for separating target speech from residual signal
Finally, minimizing an optimization function to obtain an optimal filter;
the process of minimizing the optimization function is to solve the following equation:
Ψ(k)G(k)=ρ(k)
wherein ,
the optimal filter G (k) is solved as:
G(k)=Ψ -1 (k)ρ(k)。
4. the information-directed real-time voice separation apparatus of claim 3, wherein the target voice estimation module operates as follows:
firstly, according to the optimal filter obtained by solving, further obtaining the frequency domain estimation of the target voice:
then, performing inverse Fourier transform to obtain target voice time domain estimation:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110963498.1A CN113628634B (en) | 2021-08-20 | 2021-08-20 | Real-time voice separation method and device guided by directional information |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110963498.1A CN113628634B (en) | 2021-08-20 | 2021-08-20 | Real-time voice separation method and device guided by directional information |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113628634A CN113628634A (en) | 2021-11-09 |
CN113628634B true CN113628634B (en) | 2023-10-03 |
Family
ID=78386993
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110963498.1A Active CN113628634B (en) | 2021-08-20 | 2021-08-20 | Real-time voice separation method and device guided by directional information |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113628634B (en) |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104866866A (en) * | 2015-05-08 | 2015-08-26 | 太原理工大学 | Improved natural gradient variable step-size blind source separation algorithm |
WO2015196729A1 (en) * | 2014-06-27 | 2015-12-30 | 中兴通讯股份有限公司 | Microphone array speech enhancement method and device |
GB201602382D0 (en) * | 2016-02-10 | 2016-03-23 | Cedar Audio Ltd | Acoustic source seperation systems |
CN108831495A (en) * | 2018-06-04 | 2018-11-16 | 桂林电子科技大学 | A kind of sound enhancement method applied to speech recognition under noise circumstance |
JP2019054344A (en) * | 2017-09-13 | 2019-04-04 | 日本電信電話株式会社 | Filter coefficient calculation device, sound pickup device, method thereof, and program |
CN110427968A (en) * | 2019-06-28 | 2019-11-08 | 武汉大学 | A kind of binocular solid matching process based on details enhancing |
CN110706719A (en) * | 2019-11-14 | 2020-01-17 | 北京远鉴信息技术有限公司 | Voice extraction method and device, electronic equipment and storage medium |
CN112037813A (en) * | 2020-08-28 | 2020-12-04 | 南京大学 | Voice extraction method for high-power target signal |
CN112996019A (en) * | 2021-03-01 | 2021-06-18 | 军事科学院系统工程研究院网络信息研究所 | Terahertz frequency band distributed constellation access control method based on multi-objective optimization |
CN113096684A (en) * | 2021-06-07 | 2021-07-09 | 成都启英泰伦科技有限公司 | Target voice extraction method based on double-microphone array |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2009151578A2 (en) * | 2008-06-09 | 2009-12-17 | The Board Of Trustees Of The University Of Illinois | Method and apparatus for blind signal recovery in noisy, reverberant environments |
US9237391B2 (en) * | 2012-12-04 | 2016-01-12 | Northwestern Polytechnical University | Low noise differential microphone arrays |
JP2014219467A (en) * | 2013-05-02 | 2014-11-20 | ソニー株式会社 | Sound signal processing apparatus, sound signal processing method, and program |
US11694707B2 (en) * | 2015-03-18 | 2023-07-04 | Industry-University Cooperation Foundation Sogang University | Online target-speech extraction method based on auxiliary function for robust automatic speech recognition |
US10991362B2 (en) * | 2015-03-18 | 2021-04-27 | Industry-University Cooperation Foundation Sogang University | Online target-speech extraction method based on auxiliary function for robust automatic speech recognition |
-
2021
- 2021-08-20 CN CN202110963498.1A patent/CN113628634B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2015196729A1 (en) * | 2014-06-27 | 2015-12-30 | 中兴通讯股份有限公司 | Microphone array speech enhancement method and device |
CN104866866A (en) * | 2015-05-08 | 2015-08-26 | 太原理工大学 | Improved natural gradient variable step-size blind source separation algorithm |
GB201602382D0 (en) * | 2016-02-10 | 2016-03-23 | Cedar Audio Ltd | Acoustic source seperation systems |
JP2019054344A (en) * | 2017-09-13 | 2019-04-04 | 日本電信電話株式会社 | Filter coefficient calculation device, sound pickup device, method thereof, and program |
CN108831495A (en) * | 2018-06-04 | 2018-11-16 | 桂林电子科技大学 | A kind of sound enhancement method applied to speech recognition under noise circumstance |
CN110427968A (en) * | 2019-06-28 | 2019-11-08 | 武汉大学 | A kind of binocular solid matching process based on details enhancing |
CN110706719A (en) * | 2019-11-14 | 2020-01-17 | 北京远鉴信息技术有限公司 | Voice extraction method and device, electronic equipment and storage medium |
CN112037813A (en) * | 2020-08-28 | 2020-12-04 | 南京大学 | Voice extraction method for high-power target signal |
CN112996019A (en) * | 2021-03-01 | 2021-06-18 | 军事科学院系统工程研究院网络信息研究所 | Terahertz frequency band distributed constellation access control method based on multi-objective optimization |
CN113096684A (en) * | 2021-06-07 | 2021-07-09 | 成都启英泰伦科技有限公司 | Target voice extraction method based on double-microphone array |
Non-Patent Citations (4)
Title |
---|
《Speech Separation Using Partially Asynchronous Microphone Arrays Without Resampling》;R. M. Corey and A. C. Singer;《2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC)》;1-9 * |
《一种强混响环境下的盲语音分离算法》;顾凡,王惠刚,李虎雄;《信号处理》;第27卷(第04期);534-540 * |
Suleiman Erateb et,al..《Enhanced Online IVA with Switched Source Prior for Speech Separation》.《2020 IEEE 11th Sensor Array and Multichannel Signal Processing Workshop (SAM)》.2020,1-5. * |
基于双麦克风的欠定盲源分离算法研究;黄曼露;《中国优秀硕士论文电子期刊网》;第2章2.2节-第3张第3节 * |
Also Published As
Publication number | Publication date |
---|---|
CN113628634A (en) | 2021-11-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108831495B (en) | Speech enhancement method applied to speech recognition in noise environment | |
CN108986838B (en) | Self-adaptive voice separation method based on sound source positioning | |
CN110503972B (en) | Speech enhancement method, system, computer device and storage medium | |
CN107221336B (en) | Device and method for enhancing target voice | |
US9100734B2 (en) | Systems, methods, apparatus, and computer-readable media for far-field multi-source tracking and separation | |
KR20200066366A (en) | Method and apparatus for target speech acquisition based on microphone array | |
CN110148420A (en) | A kind of audio recognition method suitable under noise circumstance | |
CN105244036A (en) | Microphone speech enhancement method and microphone speech enhancement device | |
CN113903353B (en) | Directional noise elimination method and device based on space distinguishing detection | |
CN109285557B (en) | Directional pickup method and device and electronic equipment | |
CN107346664A (en) | A kind of ears speech separating method based on critical band | |
CN106653044B (en) | Dual microphone noise reduction system and method for tracking noise source and target sound source | |
CN107369460B (en) | Voice enhancement device and method based on acoustic vector sensor space sharpening technology | |
CN111816200B (en) | Multi-channel speech enhancement method based on time-frequency domain binary mask | |
CN106847301A (en) | A kind of ears speech separating method based on compressed sensing and attitude information | |
CN111681665A (en) | Omnidirectional noise reduction method, equipment and storage medium | |
Li et al. | Online Directional Speech Enhancement Using Geometrically Constrained Independent Vector Analysis. | |
CN110875054A (en) | Far-field noise suppression method, device and system | |
CN113628634B (en) | Real-time voice separation method and device guided by directional information | |
CN113744752A (en) | Voice processing method and device | |
CN113050035B (en) | Two-dimensional directional pickup method and device | |
CN110890099A (en) | Sound signal processing method, device and storage medium | |
CN113948101B (en) | Noise suppression method and device based on space distinguishing detection | |
CN112863525B (en) | Method and device for estimating direction of arrival of voice and electronic equipment | |
CN111650559B (en) | Real-time processing two-dimensional sound source positioning method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |