CN113628634A - Real-time voice separation method and device guided by pointing information - Google Patents
Real-time voice separation method and device guided by pointing information Download PDFInfo
- Publication number
- CN113628634A CN113628634A CN202110963498.1A CN202110963498A CN113628634A CN 113628634 A CN113628634 A CN 113628634A CN 202110963498 A CN202110963498 A CN 202110963498A CN 113628634 A CN113628634 A CN 113628634A
- Authority
- CN
- China
- Prior art keywords
- time
- signal
- filter
- frequency
- target voice
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000926 separation method Methods 0.000 title claims abstract description 61
- 238000000034 method Methods 0.000 claims abstract description 32
- 238000005457 optimization Methods 0.000 claims abstract description 29
- 238000004364 calculation method Methods 0.000 claims abstract description 22
- 238000000354 decomposition reaction Methods 0.000 claims abstract description 22
- 230000009466 transformation Effects 0.000 claims abstract description 11
- 239000011159 matrix material Substances 0.000 claims description 25
- 238000009499 grossing Methods 0.000 claims description 6
- 230000008569 process Effects 0.000 claims description 6
- 238000001228 spectrum Methods 0.000 claims description 6
- 230000010365 information processing Effects 0.000 abstract description 2
- 238000004891 communication Methods 0.000 description 6
- 238000004458 analytical method Methods 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D30/00—Reducing energy consumption in communication networks
- Y02D30/70—Reducing energy consumption in communication networks in wireless communication networks
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Quality & Reliability (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Circuit For Audible Band Transducer (AREA)
Abstract
The invention discloses a real-time voice separation method and a device for pointing information guidance, belonging to the field of information processing, wherein the method comprises the following steps: s1: initializing a guide vector and a directional filter for a time domain signal of each microphone; s2: performing time-frequency decomposition on the initialized signal to finish the transformation from the time domain signal to the time-frequency domain signal; s3: performing separation filter calculation on the time-frequency domain signal to obtain a filter for separating the target voice from the residual signal; s4: and according to the obtained filter, obtaining a time-frequency domain signal of the target voice, and further obtaining a time-domain signal of the target voice. The method constructs the initial estimation of the real-time IVA based on the super-directional filter, and modifies the optimization function of the IVA, thereby ensuring that the separation algorithm can be rapidly converged and accurately extracting the target voice signal.
Description
Technical Field
The invention belongs to the field of information processing, and particularly relates to a real-time voice separation method and device guided by pointing information.
Background
At present, a microphone array beam forming technology is widely applied to the fields of online conference systems, vehicle-mounted human-computer interaction, smart home and the like. In an actual environment, interference such as obvious noise, competing speakers and the like exists, and the listening feeling of conference communication and the accuracy of subsequent voice recognition can be obviously reduced. The most common method for reducing signal noise and improving communication quality is to perform wave beam generation based on multiple array elements of a microphone array. How to pertinently extract the voice signal of a certain direction, other noises are obviously suppressed simultaneously, and the method has important significance for improving conference communication quality, improving voice recognition rate and the like.
Independent Vector Analysis (IVA) based speech separation/picking technology is currently the most commonly used technique. Firstly, time domain signals picked up by all array elements are converted into time-frequency domains through short-time Fourier transformation, then an optimization function is constructed based on the principle that the cross entropy of separated voice is minimum, a separation matrix is updated iteratively based on the optimization function, after the separation matrix is estimated, frequency domain estimation of a target signal can be obtained, and finally time domain estimation is obtained based on inverse Fourier transformation. In some latest IVA methods, the target speech is extracted in real time by adding the distance constraint between the separation matrix and the target direction guide vector.
The main disadvantages of the prior art are as follows:
1) the existing directional IVA is restricted by directly increasing the distance between a separation matrix and a guide vector, and the accuracy of the guide vector is greatly reduced in a reverberation scene, so that the performance is obviously insufficient in the reverberation scene.
2) The directional IVA technique is not constrained in the initial estimation, resulting in too long convergence time, and if the environment changes, such as an interfering speaker moving around, the convergence speed of the IVA separation matrix cannot keep up with the speed of the change of the acoustic environment.
In view of the above, the present invention is particularly proposed.
Disclosure of Invention
The invention aims to provide a real-time voice separation method and device guided by pointing information, which are used for constructing initial estimation of a real-time IVA based on a super-pointing filter, correcting an optimization function of the IVA, ensuring that a separation algorithm can be rapidly converged and accurately extracting a target voice signal.
In order to achieve the above object, the present invention provides a real-time voice separation method guided by directional information, which is applied to a system based on a microphone array, and comprises the following steps:
s1: initializing a guide vector and a directional filter for a time domain signal of each microphone;
s2: performing time-frequency decomposition on the initialized signal to finish the transformation from the time domain signal to the time-frequency domain signal;
s3: performing separation filter calculation on the time-frequency domain signal to obtain a filter for separating the target voice from the residual signal;
s4: and according to the obtained filter, obtaining a time-frequency domain signal of the target voice, and further obtaining a time-domain signal of the target voice.
Further, step S1 is preceded by: obtaining a time-domain signal x for each microphonem(n);
In step S1, the method for performing the steering vector is as follows: for each frequency band k, a steering vector u (k) is calculated,
q(θ)=[cos(θ),sin(θ)]
wherein ,fkK is the frequency of the kth band, K being 1,2,. K; c is sound speed, and c is 340 m/s; dmIs the two-dimensional coordinate value of the mth microphone; q (theta) is a direction vector, omegakIs the frequency band circle frequency;
the method for initializing the directional filter is as follows: calculating a super-directional filter h (k) for each frequency band k:
where r (k) represents the autocorrelation coefficients of each microphone of the uniform scattered field normalized with respect to the picked-up signal.
Further, the step S2 includes:
s201: for time domain signal xm(n) performing short-time Fourier transform to obtain a time-frequency domain expression:
wherein, N is the frame length, and N is 512; w (n) is a Hamming window of length 512, l is a time frame number, and k is a frequency number. Xm(l, k) is the spectrum of the mth microphone signal, in the mth frame, in the kth frequency band;
s202: for each frequency band k, a frequency domain original vector X (l, k) is constructed:
X(l,k)=[X1(l,k),X2(l,k),...,XM(l,k)]T。
further, the step S3 includes:
s301: calculating a frame-level separation guidance factor:
wherein ,r1(l) and r2(l) Respectively used for guiding the target voice and the residual signal;
s302: computing a separate steering matrix for each band:
ψ1(k)=αψ1(k)+(1-α)r1(l)X(l,k)XH(l,k)
ψ2(k)=αψ1(k)+(1-α)r1(l)X(l,k)XH(l,k)
wherein ,ψ1(k) and ψ2(k) A steering matrix representing the target speech and the residual signal, respectively; alpha is a smoothing factor and has a value range of 0 to 1;
s303: constructing a new optimization function for the filter separating the target speech and the residual signal, the optimization function being as follows:
wherein ,G1(k) and G2(k) Filters for separating target speech and residual signal respectively
S304: and minimizing the optimization function to obtain an optimal filter.
The process of minimizing the optimization function is to solve the following equation:
Ψ(k)G(k)=ρ(k)
wherein ,
the optimal filter g (k) can be solved as:
G(k)=Ψ-1(k)ρ(k)。
further, the step S4 includes:
s401: and according to the optimal filter obtained by solving, further obtaining the frequency domain estimation of the target voice:
s402: performing inverse Fourier transform to obtain target voice time domain estimation:
the invention also provides a real-time voice separation device guided by the pointing information, which is applied to a system based on a microphone array and comprises an initialization module, a signal decomposition module, a separation filter calculation module and a target voice estimation module:
the initialization module is used for initializing a guide vector and a directional filter for the time domain signal of each microphone;
the signal decomposition module is used for performing time-frequency decomposition on the initialized signal to finish the transformation from a time domain signal to a time-frequency domain signal;
the separation filter calculation module is used for performing separation filter calculation on the time-frequency domain signals to obtain a filter for separating target voice and residual signals;
and the target voice estimation module is used for obtaining a time-frequency domain signal of the target voice according to the obtained filter so as to obtain a time-domain signal of the target voice.
Further, the initialization module is further configured to obtain a time domain signal x of each microphonem(n);
The method for the signal decomposition module to perform the steering vector is as follows: for each frequency band k, a steering vector u (k) is calculated,
q(θ)=[cos(θ),sin(θ)]
wherein ,fkK is the frequency of the kth band, K being 1,2,. K; c is sound speed, and c is 340 m/s; dmIs the two-dimensional coordinate value of the mth microphone; q (theta) is a direction vector, omegakIs the frequency band circle frequency;
the method for initializing the directional filter by the signal decomposition module is as follows: calculating a super-directional filter h (k) for each frequency band k:
where r (k) represents the autocorrelation coefficients of each microphone of the uniform scattered field normalized with respect to the picked-up signal.
Further, the signal decomposition module comprises the following steps:
first, for a time domain signal xm(n) performing short-time Fourier transform to obtain a time-frequency domain expression:
wherein, N is the frame length, and N is 512; w (n) is a Hamming window of length 512, l is a time frame number, and k is a frequency number. Xm(l, k) is the spectrum of the mth microphone signal, in the mth frame, in the kth frequency band;
secondly, for each frequency band k, a frequency domain original vector X (l, k) is constructed:
X(l,k)=[X1(l,k),X2(l,k),...,XM(l,k)]T。
further, the separation filter calculation module operates as follows:
first, a frame-level separation guidance factor is calculated:
wherein ,r1(l) and r2(l) Respectively used for guiding the target voice and the residual signal;
next, a separate steering matrix for each frequency band is calculated:
ψ1(k)=αψ1(k)+(1-α)r1(l)X(l,k)XH(l,k)
ψ2(k)=αψ1(k)+(1-α)r1(l)X(l,k)XH(l,k)
wherein ,ψ1(k) and ψ2(k) A steering matrix representing the target speech and the residual signal, respectively; alpha is a smoothing factor and has a value ranging from 0 to1;
Then, a new optimization function is constructed for the filter separating the target speech and the residual signal, the optimization function is as follows:
wherein ,G1(k) and G2(k) Filters for separating target speech and residual signal respectively
And finally, minimizing the optimization function to obtain an optimal filter.
The process of minimizing the optimization function is to solve the following equation:
Ψ(k)G(k)=ρ(k)
wherein ,
the optimal filter g (k) can be solved as:
G(K)=Ψ-1(k)ρ(k)。
further, the operation steps of the speech estimation module are as follows:
firstly, according to the optimal filter obtained by solving, further obtaining the frequency domain estimation of the target voice:
secondly, performing inverse Fourier transform to obtain a target voice time domain estimation:
the invention provides a real-time voice separation method and a device guided by pointing information, which have the following beneficial effects:
1. compared with the traditional IVA, the invention uses the super-directional filter to calculate the guide factor, so the convergence is faster, and the invention can adapt to the scene of the change of the acoustic environment.
2. The target function designed by the invention not only considers the difference between signals, but also increases ambiguity constraint, can obtain an optimal solution for analysis, and does not need iteration, so that the robustness is stronger, and the separation effect is more stable and reliable.
Drawings
Fig. 1 is a flowchart of a method for separating real-time voice guided by directional information according to the present embodiment.
Fig. 2 is a diagram of a hamming window function used in this embodiment.
Fig. 3 is a schematic diagram of a real-time voice separating apparatus guided by directional information according to this embodiment.
Detailed Description
The present invention will be described in further detail with reference to specific embodiments in order to make the technical field better understand the scheme of the present invention.
As shown in fig. 1, an embodiment of the present invention is a real-time voice separation method guided by directional information, which can be applied to a system based on a microphone array, such as a voice conference system, an on-vehicle voice communication system, and a human-computer interaction system, and can extract a target voice signal in real time, thereby facilitating improvement of communication quality of an on-line voice conference and improving accuracy of subsequent voice recognition.
The method specifically comprises the following four implementation steps:
s1: and initializing a guide vector and a directional filter for the time domain signal of each microphone.
Before step S1, the method further includes obtaining a voice signal of each microphone, where the obtained voice signals are as follows: let x bem(n) represents the original time domain signal picked up by M microphone elements in real time, wherein M represents a microphone serial number label, the value of the microphone serial number label is from 1 to M, and n represents a time scaleAnd the direction of the target voice relative to the microphone array is theta.
The target voice is a voice signal corresponding to a target direction, and for a voice separation task, the target direction is known in advance according to the extracted signal, for example, for a large-screen voice communication device, a voice signal in a 90-degree direction is expected to be separated.
Specifically, the method of performing the steering vector is as follows:
for each frequency band k, a steering vector u (k) is calculated, wherein a frequency band refers to a signal component corresponding to a certain frequency:
q(θ)=[cos(θ),sin(θ)]
wherein ,fkK, where K is determined according to subsequent fourier transform, and if the frame length is 512, the value of K is half of the frame length; c is sound speed, and c is 340 m/s; dmIs the two-dimensional coordinate value of the mth microphone; superscript H represents the conjugate transpose operator; j represents an imaginary unitq (theta) is a direction vector, omegakIs the frequency band circle frequency.
The method for initializing the directional filter is as follows:
calculating a super-directional filter h (k) for each frequency band k:
where r (k) represents the autocorrelation coefficients of each microphone of the uniform scattered field normalized with respect to the picked up signal, with the superscript-1 representing the inverse of the matrix.
The initialization of the filter is completed through the two steps to calculate the subsequent spatial distinguishability information.
S2: and performing time-frequency decomposition on the initialized signal to finish the transformation from the time domain signal to the time-frequency domain signal.
Specifically, the method comprises the following steps:
s201: for time domain signal xm(n) performing short-time Fourier transform to obtain a time-frequency domain expression:
wherein, N is the frame length, and N is 512; w (n) is a Hamming window of length 512, where n represents the number of times, and thus w (n) represents the value of each corresponding time number n; l is a time frame sequence number and takes a frame as a unit; k is a frequency number. Xm(l, k) is the spectrum of the mth microphone signal, in the mth frame, the kth frequency band. The hamming window function used in the present invention is shown in fig. 2.
S202: for each frequency band k, a frequency domain original vector X (l, k) is constructed:
X(l,k)=[X1(l,k),X2(l,k),...,XM(l,k)]T
wherein, the superscript T represents the transpose operator, and the obtained original vector is an M-dimension column vector.
The transformation from the time domain signal to the time-frequency domain can be completed through the steps.
S3: and performing separation filter calculation on the time-frequency domain signal to obtain a filter for separating the target voice and the residual signal.
Specifically, the method comprises the following steps:
s301: calculating a frame-level separation guidance factor:
wherein, | represents a modulus of a complex number, r1(l) and r2(l) For guiding the target speech and the residual signal, respectively. The target direction refers to a direction in which the user is interested, and the target voice is a voice signal from the direction; the residual signal refers to sound and environmental noise from other directions except the target speech signal, and the residual signal can be understood as the residual signal obtained by subtracting the target speech signal from the total signal acquired by the microphone. This step is used as the next step to calculate the steering matrix, providing the priors.
S302: computing a separate steering matrix for each band:
ψ1(k)=αψ1(k)+(1-α)r1(l)X(l,k)XH(l,k)
ψ2(k)=αψ1(k)+(1-α)r1(l)X(l,k)XH(l,k)
wherein ,ψ1(k) and ψ2(k) A steering matrix representing the target speech and the residual signal, respectively; alpha is a smoothing factor, the value range is 0 to 1, the preferred value of the invention is alpha-0.85, the adoption of the value can avoid excessive dependence on historical information, and the spatial information of the signal can be fully mined.
This step leads to the calculation of a matrix directly for the updating of the subsequent separation filter.
S303: constructing a new optimization function for the filter separating the target speech and the residual signal, the optimization function being as follows:
wherein ,G1(k) and G2(k) Respectively, filters that separate the target speech from the residual signal.
The first term of the optimization function maximizes the difference between the split signals, and the second term can avoid ambiguity in filter estimation, ensuring that the sum of the split results is as consistent as possible with the value of the microphone signal.
S304: and minimizing the optimization function to obtain an optimal filter.
This minimization process is equivalent to solving the following equation:
Ψ(k)G(k)=ρ(k)
wherein ,
the superscript denotes the conjugate operator.
The optimal filter g (k) can be solved as:
G(k)=Ψ-1(k)ρ(k)。
after solving to obtain G (k), according to the corresponding relation of the vector, the filter G for separating the target voice and the residual signal is obtained1(k) and G2(k)。
In step S301, a super-directional filter is used to calculate a voice separation guidance factor; in step S302, a separation guide matrix is calculated based on the guide factor; in step S303, a designed speech separation optimization function; in step S304, the formula adopted is the invented optimal separation filter calculation method, which adopts the principle of minimum mean square error, and is derived from the formula ψ G ρ, and the solution of the minimum mean square error is G ψ-1ρ, which in turn guarantees the minimization of the defined J.
Therefore, step S3 enables the calculation of the frequency domain separation filter.
S4: and according to the obtained filter, obtaining a time-frequency domain signal of the target voice, and further obtaining a time-domain signal of the target voice.
The method specifically comprises the following steps:
s401: and according to the optimal filter obtained by solving, further obtaining the frequency domain estimation of the target voice:
s402: performing inverse Fourier transform to obtain target voice time domain estimation, and further obtaining a target voice time domain signal:
the step realizes the acquisition of the target voice time domain signal.
Through the steps of the invention, the initialization, the signal decomposition, the separation filter calculation and the target voice estimation of the microphone matrix signal can be realized.
As shown in fig. 3, an embodiment of the present invention is a directional information guided real-time speech separation apparatus applied to a microphone array based system, which includes an initialization module 1, a signal decomposition module 2, a separation filter calculation module 3, and a target speech estimation module 4.
The initialization module 1 is used for initializing a steering vector and a directional filter for the voice signal of each microphone.
The initialization module 1 can also be used to obtain the voice signal of each microphone, the obtained voice signal is as follows: let x bemAnd (n) represents original signals picked up by the M microphone elements in real time, wherein M represents a microphone serial number label, the value of the microphone serial number label is from 1 to M, n represents a time label, and the direction of the target voice relative to the microphone array is theta.
Specifically, the method of performing the steering vector is as follows:
for each frequency band k, a steering vector u (k) is calculated, wherein a frequency band refers to a signal component corresponding to a certain frequency:
q(θ)=[cos(θ),sin(θ)]
wherein ,fkK is the frequency of the kth band, K being 1,2,. K; c is sound speed, and c is 340 m/s; dmIs the two-dimensional coordinate value of the mth microphone; superscript H represents the conjugate transpose operator; j represents an imaginary unitq (theta) is a direction vector, omegakIs the frequency band circle frequency.
The method for initializing the directional filter is as follows:
calculating a super-directional filter h (k) for each frequency band k:
where r (k) represents the autocorrelation coefficients of each microphone of the uniform scattered field normalized with respect to the picked up signal, with the superscript-1 representing the inverse of the matrix.
The initialization of the filter is completed through the two steps to calculate the subsequent spatial distinguishability information.
The signal decomposition module 2 is configured to perform time-frequency decomposition on the initialized signal, and complete the transformation from the time domain signal to the time-frequency domain signal.
Specifically, the signal decomposition module 2 operates as follows:
firstly, to time domain signalsxm (n) performing short-time Fourier transform to obtain a time-frequency domain expression:
wherein, N is the frame length, and N is 512; w (n) is a Hamming window of length 512, l is a time frame number, and k is a frequency number. Xm(l, k) is the mth microphoneSignal, in the l-th frame, the frequency spectrum of the k-th band. The hamming window function is shown in fig. 2.
Secondly, for each frequency band k, a frequency domain original vector X (l, k) is constructed:
X(l,k)=[X1(l,k),X2(l,k),...,XM(l,k)]T
wherein, the superscript T represents the transpose operator, and the obtained original vector is an M-dimension column vector.
The transformation from the time domain signal to the time-frequency domain can be completed through the steps.
The separation filter calculation module 3 is configured to perform separation filter calculation on the time-frequency domain signal, and obtain a filter for separating the target speech and the residual signal.
Specifically, the operation steps of the separation filter calculation module 3 are as follows:
first, a frame-level separation guidance factor is calculated:
wherein, | represents a modulus of a complex number, r1(l) and r2(l) The target speech and the residual signal are guided separately. The above operation is used for the next step of calculating the steering matrix, providing a priori.
Next, a separate steering matrix for each frequency band is calculated:
ψ1(k)=αψ1(k)+(1-α)r1(l)X(l,k)XH(l,k)
ψ2(k)=αψ1(k)+(1-α)r1(l)X(l,k)XH(l,k)
wherein ,ψ1(k) and ψ2(k) A steering matrix representing the target speech and the residual signal, respectively; alpha is a smoothing factor, the value range is 0 to 1, and the preferred value adopted by the inventionThe value of α is 0.85, so that excessive dependence on historical information can be avoided, and spatial information of the signal can be sufficiently mined.
This operation leads to the calculation of a matrix directly for the updating of the subsequent separation filter.
Then, a new optimization function is constructed for the filter separating the target speech and the residual signal, the optimization function is as follows:
wherein ,G1(k) and G2(k) Respectively, filters that separate the target speech from the residual signal.
The first term of the optimization function maximizes the difference between the split signals, and the second term can avoid ambiguity in filter estimation, ensuring that the sum of the split results is as consistent as possible with the value of the microphone signal.
And finally, minimizing the optimization function to obtain an optimal filter.
This minimization process is equivalent to solving the following equation:
Ψ(k)G(k)=ρ(k)
wherein ,
the superscript denotes the conjugate operator.
The optimal filter g (k) can be solved as:
G(k)=Ψ-1(k)ρ(k)
the calculation of the frequency domain separation filter can be realized by the above operation.
And the target voice estimation module 4 is configured to obtain a time-frequency domain signal of the target voice according to the obtained filter, and further obtain a time-domain signal of the target voice.
Specifically, the target speech estimation module 4 operates as follows:
firstly, according to the optimal filter obtained by solving, further obtaining the frequency domain estimation of the target voice:
secondly, performing inverse Fourier transform to obtain a target voice time domain estimation:
the target speech estimation module 4 can achieve the acquisition of the target speech time domain signal.
In the above embodiment, the 4 modules including the initialization module 1, the signal decomposition module 2, the separation filter calculation module 3, and the target speech estimation module 4 are all absent, and the absence of any one module can result in that the target speech cannot be extracted.
The inventive concept is explained in detail herein using specific examples, which are given only to aid in understanding the core concepts of the invention. It should be understood that any obvious modifications, equivalents and other improvements made by those skilled in the art without departing from the spirit of the present invention are included in the scope of the present invention.
Claims (10)
1. A real-time voice separation method guided by directional information is applied to a system based on a microphone array, and is characterized by comprising the following steps:
s1: initializing a guide vector and a directional filter for a time domain signal of each microphone;
s2: performing time-frequency decomposition on the initialized signal to finish the transformation from the time domain signal to the time-frequency domain signal;
s3: performing separation filter calculation on the time-frequency domain signal to obtain a filter for separating the target voice from the residual signal;
s4: and according to the obtained filter, obtaining a time-frequency domain signal of the target voice, and further obtaining a time-domain signal of the target voice.
2. The method for separating real-time voice guided by pointing information according to claim 1, wherein said step S1 is preceded by the step of: obtaining a time-domain signal x for each microphonem(n);
In step S1, the method for performing the steering vector is as follows: for each frequency band k, a steering vector u (k) is calculated,
q(θ)=[cos(θ),sin(θ)]
wherein ,fkK is the frequency of the kth band, K being 1,2,. K; c is sound speed, and c is 340 m/s; dmIs the two-dimensional coordinate value of the mth microphone; q (theta) is a direction vector, omegakIs the frequency band circle frequency;
the method for initializing the directional filter is as follows: calculating a super-directional filter h (k) for each frequency band k:
where r (k) represents the autocorrelation coefficients of each microphone of the uniform scattered field normalized with respect to the picked-up signal.
3. The direction information guided real-time speech separation method according to claim 2, wherein said step S2 comprises:
s201: for time domain signal xm(n) performing short-time Fourier transform to obtain a time-frequency domain expression:
wherein, N is the frame length, and N is 512; w (n) is a Hamming window with a length of 512, l is a time frame sequence number, and k is a frequency sequence number; xm(l, k) is the spectrum of the mth microphone signal, in the mth frame, in the kth frequency band;
s202: for each frequency band k, a frequency domain original vector X (l, k) is constructed:
X(l,k)=[X1(l,k),X2(l,k),...,XM(l,k)]T。
4. the direction information guided real-time speech separation method according to claim 3, wherein said step S3 comprises:
s301: calculating a frame-level separation guidance factor:
wherein ,r1(l) and r2(l) Respectively used for guiding the target voice and the residual signal;
s302: computing a separate steering matrix for each band:
ψ1(k)=αψ1(k)+(1-α)r1(l)X(l,k)XH(l,k)
ψ2(k)=αψ1(k)+(1-α)r1(l)X(l,k)XH(l,k)
wherein ,ψ1(k) and ψ2(k) Representing target speech and residual signal, respectivelyA steering matrix; alpha is a smoothing factor and has a value range of 0 to 1;
s303: constructing a new optimization function for the filter separating the target speech and the residual signal, the optimization function being as follows:
wherein ,G1(k) and G2(k) Filters for separating target speech and residual signal respectively
S304: minimizing an optimization function to obtain an optimal filter;
the process of minimizing the optimization function is to solve the following equation:
Ψ(k)G(k)=ρ(k)
wherein ,
filter g (k) can be solved as:
G(k)=Ψ-1(k)ρ(k)。
5. the direction information guided real-time speech separation method according to claim 4, wherein said step S4 comprises:
s401: and according to the filter obtained by solving, further obtaining the frequency domain estimation of the target voice:
s402: performing inverse Fourier transform to obtain target voice time domain estimation:
6. the real-time voice separation device guided by the directional information is applied to a system based on a microphone array, and is characterized by comprising an initialization module, a signal decomposition module, a separation filter calculation module and a target voice estimation module:
the initialization module is used for initializing a guide vector and a directional filter for the time domain signal of each microphone;
the signal decomposition module is used for performing time-frequency decomposition on the initialized signal to finish the transformation from a time domain signal to a time-frequency domain signal;
the separation filter calculation module is used for performing separation filter calculation on the time-frequency domain signals to obtain a filter for separating target voice and residual signals;
and the target voice estimation module is used for obtaining a time-frequency domain signal of the target voice according to the obtained filter so as to obtain a time-domain signal of the target voice.
7. The direction-information guided real-time speech separation apparatus of claim 6, wherein the initialization module is further configured to obtain a time-domain signal x for each microphonem(n);
The method for the signal decomposition module to perform the steering vector is as follows: for each frequency band k, a steering vector u (k) is calculated,
q(θ)=[cos(θ),sin(θ)]
wherein ,fkFrequency of the k-th frequency bandA rate, K ═ 1,2,. K; c is sound speed, and c is 340 m/s; dmIs the two-dimensional coordinate value of the mth microphone; q (theta) is a direction vector, omegakIs the frequency band circle frequency;
the method for initializing the directional filter by the signal decomposition module is as follows: calculating a super-directional filter h (k) for each frequency band k:
where r (k) represents the autocorrelation coefficients of each microphone of the uniform scattered field normalized with respect to the picked-up signal.
8. The direction-information guided real-time speech separation apparatus of claim 7, wherein the signal decomposition module operates as follows:
first, for a time domain signal xm(n) performing short-time Fourier transform to obtain a time-frequency domain expression:
wherein, N is the frame length, and N is 512; w (n) is a Hamming window with a length of 512, l is a time frame sequence number, and k is a frequency sequence number; xm(l, k) is the spectrum of the mth microphone signal, in the mth frame, in the kth frequency band;
secondly, for each frequency band k, a frequency domain original vector X (l, k) is constructed:
X(l,k)=[X1(l,k),X2(l,k),...,XM(l,k)]T。
9. the direction-information guided real-time speech separation apparatus of claim 8, wherein said separation filter computation module operates as follows:
first, a frame-level separation guidance factor is calculated:
wherein ,r1(l) and r2(l) Respectively used for guiding the target voice and the residual signal;
next, a separate steering matrix for each frequency band is calculated:
ψ1(k)=αψ1(k)+(1-α)r1(l)X(l,k)XH(l,k)
ψ2(k)=αψ1(k)+(1-α)r1(l)X(l,k)XH(l,k)
wherein ,ψ1(k) and ψ2(k) A steering matrix representing the target party speech and the residual signal, respectively; alpha is a smoothing factor and has a value range of 0 to 1;
then, a new optimization function is constructed for the filter separating the target speech and the residual signal, the optimization function is as follows:
wherein ,G1(k) and G2(k) Filters for separating target speech and residual signal respectively
Finally, minimizing an optimization function to obtain an optimal filter;
the process of minimizing the optimization function is to solve the following equation:
Ψ(k)G(k)=ρ(k)
wherein ,
the optimal filter g (k) can be solved as:
G(k)=Ψ-1(k)ρ(k)。
10. the direction information guided real-time speech separation apparatus of claim 9, wherein the speech estimation module operates as follows:
firstly, according to the optimal filter obtained by solving, further obtaining the frequency domain estimation of the target voice:
secondly, performing inverse Fourier transform to obtain a target voice time domain estimation:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110963498.1A CN113628634B (en) | 2021-08-20 | 2021-08-20 | Real-time voice separation method and device guided by directional information |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110963498.1A CN113628634B (en) | 2021-08-20 | 2021-08-20 | Real-time voice separation method and device guided by directional information |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113628634A true CN113628634A (en) | 2021-11-09 |
CN113628634B CN113628634B (en) | 2023-10-03 |
Family
ID=78386993
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110963498.1A Active CN113628634B (en) | 2021-08-20 | 2021-08-20 | Real-time voice separation method and device guided by directional information |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113628634B (en) |
Citations (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110231185A1 (en) * | 2008-06-09 | 2011-09-22 | Kleffner Matthew D | Method and apparatus for blind signal recovery in noisy, reverberant environments |
US20140328487A1 (en) * | 2013-05-02 | 2014-11-06 | Sony Corporation | Sound signal processing apparatus, sound signal processing method, and program |
US20150163577A1 (en) * | 2012-12-04 | 2015-06-11 | Northwestern Polytechnical University | Low noise differential microphone arrays |
CN104866866A (en) * | 2015-05-08 | 2015-08-26 | 太原理工大学 | Improved natural gradient variable step-size blind source separation algorithm |
WO2015196729A1 (en) * | 2014-06-27 | 2015-12-30 | 中兴通讯股份有限公司 | Microphone array speech enhancement method and device |
GB201602382D0 (en) * | 2016-02-10 | 2016-03-23 | Cedar Audio Ltd | Acoustic source seperation systems |
CN108831495A (en) * | 2018-06-04 | 2018-11-16 | 桂林电子科技大学 | A kind of sound enhancement method applied to speech recognition under noise circumstance |
JP2019054344A (en) * | 2017-09-13 | 2019-04-04 | 日本電信電話株式会社 | Filter coefficient calculation device, sound pickup device, method thereof, and program |
CN110427968A (en) * | 2019-06-28 | 2019-11-08 | 武汉大学 | A kind of binocular solid matching process based on details enhancing |
CN110706719A (en) * | 2019-11-14 | 2020-01-17 | 北京远鉴信息技术有限公司 | Voice extraction method and device, electronic equipment and storage medium |
US20200243072A1 (en) * | 2015-03-18 | 2020-07-30 | Industry-University Cooperation Foundation Sogang Univesity | Online target-speech extraction method based on auxiliary function for robust automatic speech recognition |
CN112037813A (en) * | 2020-08-28 | 2020-12-04 | 南京大学 | Voice extraction method for high-power target signal |
CN112996019A (en) * | 2021-03-01 | 2021-06-18 | 军事科学院系统工程研究院网络信息研究所 | Terahertz frequency band distributed constellation access control method based on multi-objective optimization |
CN113096684A (en) * | 2021-06-07 | 2021-07-09 | 成都启英泰伦科技有限公司 | Target voice extraction method based on double-microphone array |
US20210217434A1 (en) * | 2015-03-18 | 2021-07-15 | Industry-University Cooperation Foundation Sogang University | Online target-speech extraction method based on auxiliary function for robust automatic speech recognition |
-
2021
- 2021-08-20 CN CN202110963498.1A patent/CN113628634B/en active Active
Patent Citations (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110231185A1 (en) * | 2008-06-09 | 2011-09-22 | Kleffner Matthew D | Method and apparatus for blind signal recovery in noisy, reverberant environments |
US20150163577A1 (en) * | 2012-12-04 | 2015-06-11 | Northwestern Polytechnical University | Low noise differential microphone arrays |
US20140328487A1 (en) * | 2013-05-02 | 2014-11-06 | Sony Corporation | Sound signal processing apparatus, sound signal processing method, and program |
WO2015196729A1 (en) * | 2014-06-27 | 2015-12-30 | 中兴通讯股份有限公司 | Microphone array speech enhancement method and device |
US20200243072A1 (en) * | 2015-03-18 | 2020-07-30 | Industry-University Cooperation Foundation Sogang Univesity | Online target-speech extraction method based on auxiliary function for robust automatic speech recognition |
US20210217434A1 (en) * | 2015-03-18 | 2021-07-15 | Industry-University Cooperation Foundation Sogang University | Online target-speech extraction method based on auxiliary function for robust automatic speech recognition |
CN104866866A (en) * | 2015-05-08 | 2015-08-26 | 太原理工大学 | Improved natural gradient variable step-size blind source separation algorithm |
GB201602382D0 (en) * | 2016-02-10 | 2016-03-23 | Cedar Audio Ltd | Acoustic source seperation systems |
JP2019054344A (en) * | 2017-09-13 | 2019-04-04 | 日本電信電話株式会社 | Filter coefficient calculation device, sound pickup device, method thereof, and program |
CN108831495A (en) * | 2018-06-04 | 2018-11-16 | 桂林电子科技大学 | A kind of sound enhancement method applied to speech recognition under noise circumstance |
CN110427968A (en) * | 2019-06-28 | 2019-11-08 | 武汉大学 | A kind of binocular solid matching process based on details enhancing |
CN110706719A (en) * | 2019-11-14 | 2020-01-17 | 北京远鉴信息技术有限公司 | Voice extraction method and device, electronic equipment and storage medium |
CN112037813A (en) * | 2020-08-28 | 2020-12-04 | 南京大学 | Voice extraction method for high-power target signal |
CN112996019A (en) * | 2021-03-01 | 2021-06-18 | 军事科学院系统工程研究院网络信息研究所 | Terahertz frequency band distributed constellation access control method based on multi-objective optimization |
CN113096684A (en) * | 2021-06-07 | 2021-07-09 | 成都启英泰伦科技有限公司 | Target voice extraction method based on double-microphone array |
Non-Patent Citations (4)
Title |
---|
R. M. COREY AND A. C. SINGER: "《Speech Separation Using Partially Asynchronous Microphone Arrays Without Resampling》", 《2018 16TH INTERNATIONAL WORKSHOP ON ACOUSTIC SIGNAL ENHANCEMENT (IWAENC)》, pages 1 - 9 * |
SULEIMAN ERATEB ET, AL.: "《Enhanced Online IVA with Switched Source Prior for Speech Separation》", 《2020 IEEE 11TH SENSOR ARRAY AND MULTICHANNEL SIGNAL PROCESSING WORKSHOP (SAM)》, pages 1 - 5 * |
顾凡,王惠刚,李虎雄: "《一种强混响环境下的盲语音分离算法》", 《信号处理》, vol. 27, no. 04, pages 534 - 540 * |
黄曼露: "基于双麦克风的欠定盲源分离算法研究", 《中国优秀硕士论文电子期刊网》, pages 2 * |
Also Published As
Publication number | Publication date |
---|---|
CN113628634B (en) | 2023-10-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108831495B (en) | Speech enhancement method applied to speech recognition in noise environment | |
CN108986838B (en) | Self-adaptive voice separation method based on sound source positioning | |
CN107993670B (en) | Microphone array speech enhancement method based on statistical model | |
CN107221336B (en) | Device and method for enhancing target voice | |
CN109584903B (en) | Multi-user voice separation method based on deep learning | |
CN102565759B (en) | Binaural sound source localization method based on sub-band signal to noise ratio estimation | |
CN108109617A (en) | A kind of remote pickup method | |
CN107346664A (en) | A kind of ears speech separating method based on critical band | |
CN110610718B (en) | Method and device for extracting expected sound source voice signal | |
CN104811867B (en) | Microphone array airspace filter method based on array virtual extended | |
CN107369460B (en) | Voice enhancement device and method based on acoustic vector sensor space sharpening technology | |
CN108520756B (en) | Method and device for separating speaker voice | |
CN113903353A (en) | Directional noise elimination method and device based on spatial discrimination detection | |
CN111899756A (en) | Single-channel voice separation method and device | |
Li et al. | Online Directional Speech Enhancement Using Geometrically Constrained Independent Vector Analysis. | |
CN103901400A (en) | Binaural sound source positioning method based on delay compensation and binaural coincidence | |
CN113707136B (en) | Audio and video mixed voice front-end processing method for voice interaction of service robot | |
CN113050035B (en) | Two-dimensional directional pickup method and device | |
CN112466327B (en) | Voice processing method and device and electronic equipment | |
CN113628634B (en) | Real-time voice separation method and device guided by directional information | |
CN109901114B (en) | Time delay estimation method suitable for sound source positioning | |
WO2020078210A1 (en) | Adaptive estimation method and device for post-reverberation power spectrum in reverberation speech signal | |
CN108269581B (en) | Double-microphone time delay difference estimation method based on frequency domain coherent function | |
CN113345421B (en) | Multi-channel far-field target voice recognition method based on angle spectrum characteristics | |
CN111650559B (en) | Real-time processing two-dimensional sound source positioning method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |