CN113628634A - Real-time voice separation method and device guided by pointing information - Google Patents

Real-time voice separation method and device guided by pointing information Download PDF

Info

Publication number
CN113628634A
CN113628634A CN202110963498.1A CN202110963498A CN113628634A CN 113628634 A CN113628634 A CN 113628634A CN 202110963498 A CN202110963498 A CN 202110963498A CN 113628634 A CN113628634 A CN 113628634A
Authority
CN
China
Prior art keywords
time
signal
filter
frequency
target voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110963498.1A
Other languages
Chinese (zh)
Other versions
CN113628634B (en
Inventor
何平
蒋升
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suirui Technology Group Co Ltd
Original Assignee
Suirui Technology Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suirui Technology Group Co Ltd filed Critical Suirui Technology Group Co Ltd
Priority to CN202110963498.1A priority Critical patent/CN113628634B/en
Publication of CN113628634A publication Critical patent/CN113628634A/en
Application granted granted Critical
Publication of CN113628634B publication Critical patent/CN113628634B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The invention discloses a real-time voice separation method and a device for pointing information guidance, belonging to the field of information processing, wherein the method comprises the following steps: s1: initializing a guide vector and a directional filter for a time domain signal of each microphone; s2: performing time-frequency decomposition on the initialized signal to finish the transformation from the time domain signal to the time-frequency domain signal; s3: performing separation filter calculation on the time-frequency domain signal to obtain a filter for separating the target voice from the residual signal; s4: and according to the obtained filter, obtaining a time-frequency domain signal of the target voice, and further obtaining a time-domain signal of the target voice. The method constructs the initial estimation of the real-time IVA based on the super-directional filter, and modifies the optimization function of the IVA, thereby ensuring that the separation algorithm can be rapidly converged and accurately extracting the target voice signal.

Description

Real-time voice separation method and device guided by pointing information
Technical Field
The invention belongs to the field of information processing, and particularly relates to a real-time voice separation method and device guided by pointing information.
Background
At present, a microphone array beam forming technology is widely applied to the fields of online conference systems, vehicle-mounted human-computer interaction, smart home and the like. In an actual environment, interference such as obvious noise, competing speakers and the like exists, and the listening feeling of conference communication and the accuracy of subsequent voice recognition can be obviously reduced. The most common method for reducing signal noise and improving communication quality is to perform wave beam generation based on multiple array elements of a microphone array. How to pertinently extract the voice signal of a certain direction, other noises are obviously suppressed simultaneously, and the method has important significance for improving conference communication quality, improving voice recognition rate and the like.
Independent Vector Analysis (IVA) based speech separation/picking technology is currently the most commonly used technique. Firstly, time domain signals picked up by all array elements are converted into time-frequency domains through short-time Fourier transformation, then an optimization function is constructed based on the principle that the cross entropy of separated voice is minimum, a separation matrix is updated iteratively based on the optimization function, after the separation matrix is estimated, frequency domain estimation of a target signal can be obtained, and finally time domain estimation is obtained based on inverse Fourier transformation. In some latest IVA methods, the target speech is extracted in real time by adding the distance constraint between the separation matrix and the target direction guide vector.
The main disadvantages of the prior art are as follows:
1) the existing directional IVA is restricted by directly increasing the distance between a separation matrix and a guide vector, and the accuracy of the guide vector is greatly reduced in a reverberation scene, so that the performance is obviously insufficient in the reverberation scene.
2) The directional IVA technique is not constrained in the initial estimation, resulting in too long convergence time, and if the environment changes, such as an interfering speaker moving around, the convergence speed of the IVA separation matrix cannot keep up with the speed of the change of the acoustic environment.
In view of the above, the present invention is particularly proposed.
Disclosure of Invention
The invention aims to provide a real-time voice separation method and device guided by pointing information, which are used for constructing initial estimation of a real-time IVA based on a super-pointing filter, correcting an optimization function of the IVA, ensuring that a separation algorithm can be rapidly converged and accurately extracting a target voice signal.
In order to achieve the above object, the present invention provides a real-time voice separation method guided by directional information, which is applied to a system based on a microphone array, and comprises the following steps:
s1: initializing a guide vector and a directional filter for a time domain signal of each microphone;
s2: performing time-frequency decomposition on the initialized signal to finish the transformation from the time domain signal to the time-frequency domain signal;
s3: performing separation filter calculation on the time-frequency domain signal to obtain a filter for separating the target voice from the residual signal;
s4: and according to the obtained filter, obtaining a time-frequency domain signal of the target voice, and further obtaining a time-domain signal of the target voice.
Further, step S1 is preceded by: obtaining a time-domain signal x for each microphonem(n);
In step S1, the method for performing the steering vector is as follows: for each frequency band k, a steering vector u (k) is calculated,
Figure BDA0003222962980000021
Figure BDA0003222962980000022
q(θ)=[cos(θ),sin(θ)]
wherein ,fkK is the frequency of the kth band, K being 1,2,. K; c is sound speed, and c is 340 m/s; dmIs the two-dimensional coordinate value of the mth microphone; q (theta) is a direction vector, omegakIs the frequency band circle frequency;
the method for initializing the directional filter is as follows: calculating a super-directional filter h (k) for each frequency band k:
Figure BDA0003222962980000023
where r (k) represents the autocorrelation coefficients of each microphone of the uniform scattered field normalized with respect to the picked-up signal.
Further, the step S2 includes:
s201: for time domain signal xm(n) performing short-time Fourier transform to obtain a time-frequency domain expression:
Figure BDA0003222962980000031
wherein, N is the frame length, and N is 512; w (n) is a Hamming window of length 512, l is a time frame number, and k is a frequency number. Xm(l, k) is the spectrum of the mth microphone signal, in the mth frame, in the kth frequency band;
s202: for each frequency band k, a frequency domain original vector X (l, k) is constructed:
X(l,k)=[X1(l,k),X2(l,k),...,XM(l,k)]T
further, the step S3 includes:
s301: calculating a frame-level separation guidance factor:
Figure BDA0003222962980000032
Figure BDA0003222962980000033
wherein ,r1(l) and r2(l) Respectively used for guiding the target voice and the residual signal;
s302: computing a separate steering matrix for each band:
ψ1(k)=αψ1(k)+(1-α)r1(l)X(l,k)XH(l,k)
ψ2(k)=αψ1(k)+(1-α)r1(l)X(l,k)XH(l,k)
wherein ,ψ1(k) and ψ2(k) A steering matrix representing the target speech and the residual signal, respectively; alpha is a smoothing factor and has a value range of 0 to 1;
s303: constructing a new optimization function for the filter separating the target speech and the residual signal, the optimization function being as follows:
Figure BDA0003222962980000034
wherein ,G1(k) and G2(k) Filters for separating target speech and residual signal respectively
S304: and minimizing the optimization function to obtain an optimal filter.
The process of minimizing the optimization function is to solve the following equation:
Ψ(k)G(k)=ρ(k)
wherein ,
Figure BDA0003222962980000041
Figure BDA0003222962980000042
the optimal filter g (k) can be solved as:
G(k)=Ψ-1(k)ρ(k)。
further, the step S4 includes:
s401: and according to the optimal filter obtained by solving, further obtaining the frequency domain estimation of the target voice:
Figure BDA0003222962980000043
s402: performing inverse Fourier transform to obtain target voice time domain estimation:
Figure BDA0003222962980000044
the invention also provides a real-time voice separation device guided by the pointing information, which is applied to a system based on a microphone array and comprises an initialization module, a signal decomposition module, a separation filter calculation module and a target voice estimation module:
the initialization module is used for initializing a guide vector and a directional filter for the time domain signal of each microphone;
the signal decomposition module is used for performing time-frequency decomposition on the initialized signal to finish the transformation from a time domain signal to a time-frequency domain signal;
the separation filter calculation module is used for performing separation filter calculation on the time-frequency domain signals to obtain a filter for separating target voice and residual signals;
and the target voice estimation module is used for obtaining a time-frequency domain signal of the target voice according to the obtained filter so as to obtain a time-domain signal of the target voice.
Further, the initialization module is further configured to obtain a time domain signal x of each microphonem(n);
The method for the signal decomposition module to perform the steering vector is as follows: for each frequency band k, a steering vector u (k) is calculated,
Figure BDA0003222962980000045
Figure BDA0003222962980000051
q(θ)=[cos(θ),sin(θ)]
wherein ,fkK is the frequency of the kth band, K being 1,2,. K; c is sound speed, and c is 340 m/s; dmIs the two-dimensional coordinate value of the mth microphone; q (theta) is a direction vector, omegakIs the frequency band circle frequency;
the method for initializing the directional filter by the signal decomposition module is as follows: calculating a super-directional filter h (k) for each frequency band k:
Figure BDA0003222962980000052
where r (k) represents the autocorrelation coefficients of each microphone of the uniform scattered field normalized with respect to the picked-up signal.
Further, the signal decomposition module comprises the following steps:
first, for a time domain signal xm(n) performing short-time Fourier transform to obtain a time-frequency domain expression:
Figure BDA0003222962980000053
wherein, N is the frame length, and N is 512; w (n) is a Hamming window of length 512, l is a time frame number, and k is a frequency number. Xm(l, k) is the spectrum of the mth microphone signal, in the mth frame, in the kth frequency band;
secondly, for each frequency band k, a frequency domain original vector X (l, k) is constructed:
X(l,k)=[X1(l,k),X2(l,k),...,XM(l,k)]T
further, the separation filter calculation module operates as follows:
first, a frame-level separation guidance factor is calculated:
Figure BDA0003222962980000054
Figure BDA0003222962980000055
wherein ,r1(l) and r2(l) Respectively used for guiding the target voice and the residual signal;
next, a separate steering matrix for each frequency band is calculated:
ψ1(k)=αψ1(k)+(1-α)r1(l)X(l,k)XH(l,k)
ψ2(k)=αψ1(k)+(1-α)r1(l)X(l,k)XH(l,k)
wherein ,ψ1(k) and ψ2(k) A steering matrix representing the target speech and the residual signal, respectively; alpha is a smoothing factor and has a value ranging from 0 to1;
Then, a new optimization function is constructed for the filter separating the target speech and the residual signal, the optimization function is as follows:
Figure BDA0003222962980000061
wherein ,G1(k) and G2(k) Filters for separating target speech and residual signal respectively
And finally, minimizing the optimization function to obtain an optimal filter.
The process of minimizing the optimization function is to solve the following equation:
Ψ(k)G(k)=ρ(k)
wherein ,
Figure BDA0003222962980000062
Figure BDA0003222962980000063
the optimal filter g (k) can be solved as:
G(K)=Ψ-1(k)ρ(k)。
further, the operation steps of the speech estimation module are as follows:
firstly, according to the optimal filter obtained by solving, further obtaining the frequency domain estimation of the target voice:
Figure BDA0003222962980000064
secondly, performing inverse Fourier transform to obtain a target voice time domain estimation:
Figure BDA0003222962980000065
the invention provides a real-time voice separation method and a device guided by pointing information, which have the following beneficial effects:
1. compared with the traditional IVA, the invention uses the super-directional filter to calculate the guide factor, so the convergence is faster, and the invention can adapt to the scene of the change of the acoustic environment.
2. The target function designed by the invention not only considers the difference between signals, but also increases ambiguity constraint, can obtain an optimal solution for analysis, and does not need iteration, so that the robustness is stronger, and the separation effect is more stable and reliable.
Drawings
Fig. 1 is a flowchart of a method for separating real-time voice guided by directional information according to the present embodiment.
Fig. 2 is a diagram of a hamming window function used in this embodiment.
Fig. 3 is a schematic diagram of a real-time voice separating apparatus guided by directional information according to this embodiment.
Detailed Description
The present invention will be described in further detail with reference to specific embodiments in order to make the technical field better understand the scheme of the present invention.
As shown in fig. 1, an embodiment of the present invention is a real-time voice separation method guided by directional information, which can be applied to a system based on a microphone array, such as a voice conference system, an on-vehicle voice communication system, and a human-computer interaction system, and can extract a target voice signal in real time, thereby facilitating improvement of communication quality of an on-line voice conference and improving accuracy of subsequent voice recognition.
The method specifically comprises the following four implementation steps:
s1: and initializing a guide vector and a directional filter for the time domain signal of each microphone.
Before step S1, the method further includes obtaining a voice signal of each microphone, where the obtained voice signals are as follows: let x bem(n) represents the original time domain signal picked up by M microphone elements in real time, wherein M represents a microphone serial number label, the value of the microphone serial number label is from 1 to M, and n represents a time scaleAnd the direction of the target voice relative to the microphone array is theta.
The target voice is a voice signal corresponding to a target direction, and for a voice separation task, the target direction is known in advance according to the extracted signal, for example, for a large-screen voice communication device, a voice signal in a 90-degree direction is expected to be separated.
Specifically, the method of performing the steering vector is as follows:
for each frequency band k, a steering vector u (k) is calculated, wherein a frequency band refers to a signal component corresponding to a certain frequency:
Figure BDA0003222962980000071
Figure BDA0003222962980000072
q(θ)=[cos(θ),sin(θ)]
wherein ,fkK, where K is determined according to subsequent fourier transform, and if the frame length is 512, the value of K is half of the frame length; c is sound speed, and c is 340 m/s; dmIs the two-dimensional coordinate value of the mth microphone; superscript H represents the conjugate transpose operator; j represents an imaginary unit
Figure BDA0003222962980000081
q (theta) is a direction vector, omegakIs the frequency band circle frequency.
The method for initializing the directional filter is as follows:
calculating a super-directional filter h (k) for each frequency band k:
Figure BDA0003222962980000082
where r (k) represents the autocorrelation coefficients of each microphone of the uniform scattered field normalized with respect to the picked up signal, with the superscript-1 representing the inverse of the matrix.
The initialization of the filter is completed through the two steps to calculate the subsequent spatial distinguishability information.
S2: and performing time-frequency decomposition on the initialized signal to finish the transformation from the time domain signal to the time-frequency domain signal.
Specifically, the method comprises the following steps:
s201: for time domain signal xm(n) performing short-time Fourier transform to obtain a time-frequency domain expression:
Figure BDA0003222962980000083
wherein, N is the frame length, and N is 512; w (n) is a Hamming window of length 512, where n represents the number of times, and thus w (n) represents the value of each corresponding time number n; l is a time frame sequence number and takes a frame as a unit; k is a frequency number. Xm(l, k) is the spectrum of the mth microphone signal, in the mth frame, the kth frequency band. The hamming window function used in the present invention is shown in fig. 2.
S202: for each frequency band k, a frequency domain original vector X (l, k) is constructed:
X(l,k)=[X1(l,k),X2(l,k),...,XM(l,k)]T
wherein, the superscript T represents the transpose operator, and the obtained original vector is an M-dimension column vector.
The transformation from the time domain signal to the time-frequency domain can be completed through the steps.
S3: and performing separation filter calculation on the time-frequency domain signal to obtain a filter for separating the target voice and the residual signal.
Specifically, the method comprises the following steps:
s301: calculating a frame-level separation guidance factor:
Figure BDA0003222962980000091
Figure BDA0003222962980000092
wherein, | represents a modulus of a complex number, r1(l) and r2(l) For guiding the target speech and the residual signal, respectively. The target direction refers to a direction in which the user is interested, and the target voice is a voice signal from the direction; the residual signal refers to sound and environmental noise from other directions except the target speech signal, and the residual signal can be understood as the residual signal obtained by subtracting the target speech signal from the total signal acquired by the microphone. This step is used as the next step to calculate the steering matrix, providing the priors.
S302: computing a separate steering matrix for each band:
ψ1(k)=αψ1(k)+(1-α)r1(l)X(l,k)XH(l,k)
ψ2(k)=αψ1(k)+(1-α)r1(l)X(l,k)XH(l,k)
wherein ,ψ1(k) and ψ2(k) A steering matrix representing the target speech and the residual signal, respectively; alpha is a smoothing factor, the value range is 0 to 1, the preferred value of the invention is alpha-0.85, the adoption of the value can avoid excessive dependence on historical information, and the spatial information of the signal can be fully mined.
This step leads to the calculation of a matrix directly for the updating of the subsequent separation filter.
S303: constructing a new optimization function for the filter separating the target speech and the residual signal, the optimization function being as follows:
Figure BDA0003222962980000093
wherein ,G1(k) and G2(k) Respectively, filters that separate the target speech from the residual signal.
The first term of the optimization function maximizes the difference between the split signals, and the second term can avoid ambiguity in filter estimation, ensuring that the sum of the split results is as consistent as possible with the value of the microphone signal.
S304: and minimizing the optimization function to obtain an optimal filter.
This minimization process is equivalent to solving the following equation:
Ψ(k)G(k)=ρ(k)
wherein ,
Figure BDA0003222962980000101
Figure BDA0003222962980000102
the superscript denotes the conjugate operator.
The optimal filter g (k) can be solved as:
G(k)=Ψ-1(k)ρ(k)。
after solving to obtain G (k), according to the corresponding relation of the vector, the filter G for separating the target voice and the residual signal is obtained1(k) and G2(k)。
In step S301, a super-directional filter is used to calculate a voice separation guidance factor; in step S302, a separation guide matrix is calculated based on the guide factor; in step S303, a designed speech separation optimization function; in step S304, the formula adopted is the invented optimal separation filter calculation method, which adopts the principle of minimum mean square error, and is derived from the formula ψ G ρ, and the solution of the minimum mean square error is G ψ-1ρ, which in turn guarantees the minimization of the defined J.
Therefore, step S3 enables the calculation of the frequency domain separation filter.
S4: and according to the obtained filter, obtaining a time-frequency domain signal of the target voice, and further obtaining a time-domain signal of the target voice.
The method specifically comprises the following steps:
s401: and according to the optimal filter obtained by solving, further obtaining the frequency domain estimation of the target voice:
Figure BDA0003222962980000103
s402: performing inverse Fourier transform to obtain target voice time domain estimation, and further obtaining a target voice time domain signal:
Figure BDA0003222962980000104
the step realizes the acquisition of the target voice time domain signal.
Through the steps of the invention, the initialization, the signal decomposition, the separation filter calculation and the target voice estimation of the microphone matrix signal can be realized.
As shown in fig. 3, an embodiment of the present invention is a directional information guided real-time speech separation apparatus applied to a microphone array based system, which includes an initialization module 1, a signal decomposition module 2, a separation filter calculation module 3, and a target speech estimation module 4.
The initialization module 1 is used for initializing a steering vector and a directional filter for the voice signal of each microphone.
The initialization module 1 can also be used to obtain the voice signal of each microphone, the obtained voice signal is as follows: let x bemAnd (n) represents original signals picked up by the M microphone elements in real time, wherein M represents a microphone serial number label, the value of the microphone serial number label is from 1 to M, n represents a time label, and the direction of the target voice relative to the microphone array is theta.
Specifically, the method of performing the steering vector is as follows:
for each frequency band k, a steering vector u (k) is calculated, wherein a frequency band refers to a signal component corresponding to a certain frequency:
Figure BDA0003222962980000111
Figure BDA0003222962980000112
q(θ)=[cos(θ),sin(θ)]
wherein ,fkK is the frequency of the kth band, K being 1,2,. K; c is sound speed, and c is 340 m/s; dmIs the two-dimensional coordinate value of the mth microphone; superscript H represents the conjugate transpose operator; j represents an imaginary unit
Figure BDA0003222962980000113
q (theta) is a direction vector, omegakIs the frequency band circle frequency.
The method for initializing the directional filter is as follows:
calculating a super-directional filter h (k) for each frequency band k:
Figure BDA0003222962980000114
where r (k) represents the autocorrelation coefficients of each microphone of the uniform scattered field normalized with respect to the picked up signal, with the superscript-1 representing the inverse of the matrix.
The initialization of the filter is completed through the two steps to calculate the subsequent spatial distinguishability information.
The signal decomposition module 2 is configured to perform time-frequency decomposition on the initialized signal, and complete the transformation from the time domain signal to the time-frequency domain signal.
Specifically, the signal decomposition module 2 operates as follows:
firstly, to time domain signalsxm (n) performing short-time Fourier transform to obtain a time-frequency domain expression:
Figure BDA0003222962980000121
wherein, N is the frame length, and N is 512; w (n) is a Hamming window of length 512, l is a time frame number, and k is a frequency number. Xm(l, k) is the mth microphoneSignal, in the l-th frame, the frequency spectrum of the k-th band. The hamming window function is shown in fig. 2.
Secondly, for each frequency band k, a frequency domain original vector X (l, k) is constructed:
X(l,k)=[X1(l,k),X2(l,k),...,XM(l,k)]T
wherein, the superscript T represents the transpose operator, and the obtained original vector is an M-dimension column vector.
The transformation from the time domain signal to the time-frequency domain can be completed through the steps.
The separation filter calculation module 3 is configured to perform separation filter calculation on the time-frequency domain signal, and obtain a filter for separating the target speech and the residual signal.
Specifically, the operation steps of the separation filter calculation module 3 are as follows:
first, a frame-level separation guidance factor is calculated:
Figure BDA0003222962980000122
Figure BDA0003222962980000123
wherein, | represents a modulus of a complex number, r1(l) and r2(l) The target speech and the residual signal are guided separately. The above operation is used for the next step of calculating the steering matrix, providing a priori.
Next, a separate steering matrix for each frequency band is calculated:
ψ1(k)=αψ1(k)+(1-α)r1(l)X(l,k)XH(l,k)
ψ2(k)=αψ1(k)+(1-α)r1(l)X(l,k)XH(l,k)
wherein ,ψ1(k) and ψ2(k) A steering matrix representing the target speech and the residual signal, respectively; alpha is a smoothing factor, the value range is 0 to 1, and the preferred value adopted by the inventionThe value of α is 0.85, so that excessive dependence on historical information can be avoided, and spatial information of the signal can be sufficiently mined.
This operation leads to the calculation of a matrix directly for the updating of the subsequent separation filter.
Then, a new optimization function is constructed for the filter separating the target speech and the residual signal, the optimization function is as follows:
Figure BDA0003222962980000131
wherein ,G1(k) and G2(k) Respectively, filters that separate the target speech from the residual signal.
The first term of the optimization function maximizes the difference between the split signals, and the second term can avoid ambiguity in filter estimation, ensuring that the sum of the split results is as consistent as possible with the value of the microphone signal.
And finally, minimizing the optimization function to obtain an optimal filter.
This minimization process is equivalent to solving the following equation:
Ψ(k)G(k)=ρ(k)
wherein ,
Figure BDA0003222962980000132
Figure BDA0003222962980000133
the superscript denotes the conjugate operator.
The optimal filter g (k) can be solved as:
G(k)=Ψ-1(k)ρ(k)
the calculation of the frequency domain separation filter can be realized by the above operation.
And the target voice estimation module 4 is configured to obtain a time-frequency domain signal of the target voice according to the obtained filter, and further obtain a time-domain signal of the target voice.
Specifically, the target speech estimation module 4 operates as follows:
firstly, according to the optimal filter obtained by solving, further obtaining the frequency domain estimation of the target voice:
Figure BDA0003222962980000134
secondly, performing inverse Fourier transform to obtain a target voice time domain estimation:
Figure BDA0003222962980000135
the target speech estimation module 4 can achieve the acquisition of the target speech time domain signal.
In the above embodiment, the 4 modules including the initialization module 1, the signal decomposition module 2, the separation filter calculation module 3, and the target speech estimation module 4 are all absent, and the absence of any one module can result in that the target speech cannot be extracted.
The inventive concept is explained in detail herein using specific examples, which are given only to aid in understanding the core concepts of the invention. It should be understood that any obvious modifications, equivalents and other improvements made by those skilled in the art without departing from the spirit of the present invention are included in the scope of the present invention.

Claims (10)

1. A real-time voice separation method guided by directional information is applied to a system based on a microphone array, and is characterized by comprising the following steps:
s1: initializing a guide vector and a directional filter for a time domain signal of each microphone;
s2: performing time-frequency decomposition on the initialized signal to finish the transformation from the time domain signal to the time-frequency domain signal;
s3: performing separation filter calculation on the time-frequency domain signal to obtain a filter for separating the target voice from the residual signal;
s4: and according to the obtained filter, obtaining a time-frequency domain signal of the target voice, and further obtaining a time-domain signal of the target voice.
2. The method for separating real-time voice guided by pointing information according to claim 1, wherein said step S1 is preceded by the step of: obtaining a time-domain signal x for each microphonem(n);
In step S1, the method for performing the steering vector is as follows: for each frequency band k, a steering vector u (k) is calculated,
Figure FDA0003222962970000011
Figure FDA0003222962970000012
q(θ)=[cos(θ),sin(θ)]
wherein ,fkK is the frequency of the kth band, K being 1,2,. K; c is sound speed, and c is 340 m/s; dmIs the two-dimensional coordinate value of the mth microphone; q (theta) is a direction vector, omegakIs the frequency band circle frequency;
the method for initializing the directional filter is as follows: calculating a super-directional filter h (k) for each frequency band k:
Figure FDA0003222962970000013
where r (k) represents the autocorrelation coefficients of each microphone of the uniform scattered field normalized with respect to the picked-up signal.
3. The direction information guided real-time speech separation method according to claim 2, wherein said step S2 comprises:
s201: for time domain signal xm(n) performing short-time Fourier transform to obtain a time-frequency domain expression:
Figure FDA0003222962970000021
wherein, N is the frame length, and N is 512; w (n) is a Hamming window with a length of 512, l is a time frame sequence number, and k is a frequency sequence number; xm(l, k) is the spectrum of the mth microphone signal, in the mth frame, in the kth frequency band;
s202: for each frequency band k, a frequency domain original vector X (l, k) is constructed:
X(l,k)=[X1(l,k),X2(l,k),...,XM(l,k)]T
4. the direction information guided real-time speech separation method according to claim 3, wherein said step S3 comprises:
s301: calculating a frame-level separation guidance factor:
Figure FDA0003222962970000022
Figure FDA0003222962970000023
wherein ,r1(l) and r2(l) Respectively used for guiding the target voice and the residual signal;
s302: computing a separate steering matrix for each band:
ψ1(k)=αψ1(k)+(1-α)r1(l)X(l,k)XH(l,k)
ψ2(k)=αψ1(k)+(1-α)r1(l)X(l,k)XH(l,k)
wherein ,ψ1(k) and ψ2(k) Representing target speech and residual signal, respectivelyA steering matrix; alpha is a smoothing factor and has a value range of 0 to 1;
s303: constructing a new optimization function for the filter separating the target speech and the residual signal, the optimization function being as follows:
Figure FDA0003222962970000024
wherein ,G1(k) and G2(k) Filters for separating target speech and residual signal respectively
S304: minimizing an optimization function to obtain an optimal filter;
the process of minimizing the optimization function is to solve the following equation:
Ψ(k)G(k)=ρ(k)
wherein ,
Figure FDA0003222962970000031
Figure FDA0003222962970000032
filter g (k) can be solved as:
G(k)=Ψ-1(k)ρ(k)。
5. the direction information guided real-time speech separation method according to claim 4, wherein said step S4 comprises:
s401: and according to the filter obtained by solving, further obtaining the frequency domain estimation of the target voice:
Figure FDA0003222962970000033
s402: performing inverse Fourier transform to obtain target voice time domain estimation:
Figure FDA0003222962970000034
6. the real-time voice separation device guided by the directional information is applied to a system based on a microphone array, and is characterized by comprising an initialization module, a signal decomposition module, a separation filter calculation module and a target voice estimation module:
the initialization module is used for initializing a guide vector and a directional filter for the time domain signal of each microphone;
the signal decomposition module is used for performing time-frequency decomposition on the initialized signal to finish the transformation from a time domain signal to a time-frequency domain signal;
the separation filter calculation module is used for performing separation filter calculation on the time-frequency domain signals to obtain a filter for separating target voice and residual signals;
and the target voice estimation module is used for obtaining a time-frequency domain signal of the target voice according to the obtained filter so as to obtain a time-domain signal of the target voice.
7. The direction-information guided real-time speech separation apparatus of claim 6, wherein the initialization module is further configured to obtain a time-domain signal x for each microphonem(n);
The method for the signal decomposition module to perform the steering vector is as follows: for each frequency band k, a steering vector u (k) is calculated,
Figure FDA0003222962970000041
Figure FDA0003222962970000042
q(θ)=[cos(θ),sin(θ)]
wherein ,fkFrequency of the k-th frequency bandA rate, K ═ 1,2,. K; c is sound speed, and c is 340 m/s; dmIs the two-dimensional coordinate value of the mth microphone; q (theta) is a direction vector, omegakIs the frequency band circle frequency;
the method for initializing the directional filter by the signal decomposition module is as follows: calculating a super-directional filter h (k) for each frequency band k:
Figure FDA0003222962970000043
where r (k) represents the autocorrelation coefficients of each microphone of the uniform scattered field normalized with respect to the picked-up signal.
8. The direction-information guided real-time speech separation apparatus of claim 7, wherein the signal decomposition module operates as follows:
first, for a time domain signal xm(n) performing short-time Fourier transform to obtain a time-frequency domain expression:
Figure FDA0003222962970000044
wherein, N is the frame length, and N is 512; w (n) is a Hamming window with a length of 512, l is a time frame sequence number, and k is a frequency sequence number; xm(l, k) is the spectrum of the mth microphone signal, in the mth frame, in the kth frequency band;
secondly, for each frequency band k, a frequency domain original vector X (l, k) is constructed:
X(l,k)=[X1(l,k),X2(l,k),...,XM(l,k)]T
9. the direction-information guided real-time speech separation apparatus of claim 8, wherein said separation filter computation module operates as follows:
first, a frame-level separation guidance factor is calculated:
Figure FDA0003222962970000051
Figure FDA0003222962970000052
wherein ,r1(l) and r2(l) Respectively used for guiding the target voice and the residual signal;
next, a separate steering matrix for each frequency band is calculated:
ψ1(k)=αψ1(k)+(1-α)r1(l)X(l,k)XH(l,k)
ψ2(k)=αψ1(k)+(1-α)r1(l)X(l,k)XH(l,k)
wherein ,ψ1(k) and ψ2(k) A steering matrix representing the target party speech and the residual signal, respectively; alpha is a smoothing factor and has a value range of 0 to 1;
then, a new optimization function is constructed for the filter separating the target speech and the residual signal, the optimization function is as follows:
Figure FDA0003222962970000053
wherein ,G1(k) and G2(k) Filters for separating target speech and residual signal respectively
Finally, minimizing an optimization function to obtain an optimal filter;
the process of minimizing the optimization function is to solve the following equation:
Ψ(k)G(k)=ρ(k)
wherein ,
Figure FDA0003222962970000054
Figure FDA0003222962970000055
the optimal filter g (k) can be solved as:
G(k)=Ψ-1(k)ρ(k)。
10. the direction information guided real-time speech separation apparatus of claim 9, wherein the speech estimation module operates as follows:
firstly, according to the optimal filter obtained by solving, further obtaining the frequency domain estimation of the target voice:
Figure FDA0003222962970000061
secondly, performing inverse Fourier transform to obtain a target voice time domain estimation:
Figure FDA0003222962970000062
CN202110963498.1A 2021-08-20 2021-08-20 Real-time voice separation method and device guided by directional information Active CN113628634B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110963498.1A CN113628634B (en) 2021-08-20 2021-08-20 Real-time voice separation method and device guided by directional information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110963498.1A CN113628634B (en) 2021-08-20 2021-08-20 Real-time voice separation method and device guided by directional information

Publications (2)

Publication Number Publication Date
CN113628634A true CN113628634A (en) 2021-11-09
CN113628634B CN113628634B (en) 2023-10-03

Family

ID=78386993

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110963498.1A Active CN113628634B (en) 2021-08-20 2021-08-20 Real-time voice separation method and device guided by directional information

Country Status (1)

Country Link
CN (1) CN113628634B (en)

Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110231185A1 (en) * 2008-06-09 2011-09-22 Kleffner Matthew D Method and apparatus for blind signal recovery in noisy, reverberant environments
US20140328487A1 (en) * 2013-05-02 2014-11-06 Sony Corporation Sound signal processing apparatus, sound signal processing method, and program
US20150163577A1 (en) * 2012-12-04 2015-06-11 Northwestern Polytechnical University Low noise differential microphone arrays
CN104866866A (en) * 2015-05-08 2015-08-26 太原理工大学 Improved natural gradient variable step-size blind source separation algorithm
WO2015196729A1 (en) * 2014-06-27 2015-12-30 中兴通讯股份有限公司 Microphone array speech enhancement method and device
GB201602382D0 (en) * 2016-02-10 2016-03-23 Cedar Audio Ltd Acoustic source seperation systems
CN108831495A (en) * 2018-06-04 2018-11-16 桂林电子科技大学 A kind of sound enhancement method applied to speech recognition under noise circumstance
JP2019054344A (en) * 2017-09-13 2019-04-04 日本電信電話株式会社 Filter coefficient calculation device, sound pickup device, method thereof, and program
CN110427968A (en) * 2019-06-28 2019-11-08 武汉大学 A kind of binocular solid matching process based on details enhancing
CN110706719A (en) * 2019-11-14 2020-01-17 北京远鉴信息技术有限公司 Voice extraction method and device, electronic equipment and storage medium
US20200243072A1 (en) * 2015-03-18 2020-07-30 Industry-University Cooperation Foundation Sogang Univesity Online target-speech extraction method based on auxiliary function for robust automatic speech recognition
CN112037813A (en) * 2020-08-28 2020-12-04 南京大学 Voice extraction method for high-power target signal
CN112996019A (en) * 2021-03-01 2021-06-18 军事科学院系统工程研究院网络信息研究所 Terahertz frequency band distributed constellation access control method based on multi-objective optimization
CN113096684A (en) * 2021-06-07 2021-07-09 成都启英泰伦科技有限公司 Target voice extraction method based on double-microphone array
US20210217434A1 (en) * 2015-03-18 2021-07-15 Industry-University Cooperation Foundation Sogang University Online target-speech extraction method based on auxiliary function for robust automatic speech recognition

Patent Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110231185A1 (en) * 2008-06-09 2011-09-22 Kleffner Matthew D Method and apparatus for blind signal recovery in noisy, reverberant environments
US20150163577A1 (en) * 2012-12-04 2015-06-11 Northwestern Polytechnical University Low noise differential microphone arrays
US20140328487A1 (en) * 2013-05-02 2014-11-06 Sony Corporation Sound signal processing apparatus, sound signal processing method, and program
WO2015196729A1 (en) * 2014-06-27 2015-12-30 中兴通讯股份有限公司 Microphone array speech enhancement method and device
US20200243072A1 (en) * 2015-03-18 2020-07-30 Industry-University Cooperation Foundation Sogang Univesity Online target-speech extraction method based on auxiliary function for robust automatic speech recognition
US20210217434A1 (en) * 2015-03-18 2021-07-15 Industry-University Cooperation Foundation Sogang University Online target-speech extraction method based on auxiliary function for robust automatic speech recognition
CN104866866A (en) * 2015-05-08 2015-08-26 太原理工大学 Improved natural gradient variable step-size blind source separation algorithm
GB201602382D0 (en) * 2016-02-10 2016-03-23 Cedar Audio Ltd Acoustic source seperation systems
JP2019054344A (en) * 2017-09-13 2019-04-04 日本電信電話株式会社 Filter coefficient calculation device, sound pickup device, method thereof, and program
CN108831495A (en) * 2018-06-04 2018-11-16 桂林电子科技大学 A kind of sound enhancement method applied to speech recognition under noise circumstance
CN110427968A (en) * 2019-06-28 2019-11-08 武汉大学 A kind of binocular solid matching process based on details enhancing
CN110706719A (en) * 2019-11-14 2020-01-17 北京远鉴信息技术有限公司 Voice extraction method and device, electronic equipment and storage medium
CN112037813A (en) * 2020-08-28 2020-12-04 南京大学 Voice extraction method for high-power target signal
CN112996019A (en) * 2021-03-01 2021-06-18 军事科学院系统工程研究院网络信息研究所 Terahertz frequency band distributed constellation access control method based on multi-objective optimization
CN113096684A (en) * 2021-06-07 2021-07-09 成都启英泰伦科技有限公司 Target voice extraction method based on double-microphone array

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
R. M. COREY AND A. C. SINGER: "《Speech Separation Using Partially Asynchronous Microphone Arrays Without Resampling》", 《2018 16TH INTERNATIONAL WORKSHOP ON ACOUSTIC SIGNAL ENHANCEMENT (IWAENC)》, pages 1 - 9 *
SULEIMAN ERATEB ET, AL.: "《Enhanced Online IVA with Switched Source Prior for Speech Separation》", 《2020 IEEE 11TH SENSOR ARRAY AND MULTICHANNEL SIGNAL PROCESSING WORKSHOP (SAM)》, pages 1 - 5 *
顾凡,王惠刚,李虎雄: "《一种强混响环境下的盲语音分离算法》", 《信号处理》, vol. 27, no. 04, pages 534 - 540 *
黄曼露: "基于双麦克风的欠定盲源分离算法研究", 《中国优秀硕士论文电子期刊网》, pages 2 *

Also Published As

Publication number Publication date
CN113628634B (en) 2023-10-03

Similar Documents

Publication Publication Date Title
CN108831495B (en) Speech enhancement method applied to speech recognition in noise environment
CN108986838B (en) Self-adaptive voice separation method based on sound source positioning
CN107993670B (en) Microphone array speech enhancement method based on statistical model
CN107221336B (en) Device and method for enhancing target voice
CN109584903B (en) Multi-user voice separation method based on deep learning
CN102565759B (en) Binaural sound source localization method based on sub-band signal to noise ratio estimation
CN108109617A (en) A kind of remote pickup method
CN107346664A (en) A kind of ears speech separating method based on critical band
CN110610718B (en) Method and device for extracting expected sound source voice signal
CN104811867B (en) Microphone array airspace filter method based on array virtual extended
CN107369460B (en) Voice enhancement device and method based on acoustic vector sensor space sharpening technology
CN108520756B (en) Method and device for separating speaker voice
CN113903353A (en) Directional noise elimination method and device based on spatial discrimination detection
CN111899756A (en) Single-channel voice separation method and device
Li et al. Online Directional Speech Enhancement Using Geometrically Constrained Independent Vector Analysis.
CN103901400A (en) Binaural sound source positioning method based on delay compensation and binaural coincidence
CN113707136B (en) Audio and video mixed voice front-end processing method for voice interaction of service robot
CN113050035B (en) Two-dimensional directional pickup method and device
CN112466327B (en) Voice processing method and device and electronic equipment
CN113628634B (en) Real-time voice separation method and device guided by directional information
CN109901114B (en) Time delay estimation method suitable for sound source positioning
WO2020078210A1 (en) Adaptive estimation method and device for post-reverberation power spectrum in reverberation speech signal
CN108269581B (en) Double-microphone time delay difference estimation method based on frequency domain coherent function
CN113345421B (en) Multi-channel far-field target voice recognition method based on angle spectrum characteristics
CN111650559B (en) Real-time processing two-dimensional sound source positioning method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant