CN113628634B - Real-time voice separation method and device guided by directional information - Google Patents

Real-time voice separation method and device guided by directional information Download PDF

Info

Publication number
CN113628634B
CN113628634B CN202110963498.1A CN202110963498A CN113628634B CN 113628634 B CN113628634 B CN 113628634B CN 202110963498 A CN202110963498 A CN 202110963498A CN 113628634 B CN113628634 B CN 113628634B
Authority
CN
China
Prior art keywords
time
signal
filter
frequency
target voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110963498.1A
Other languages
Chinese (zh)
Other versions
CN113628634A (en
Inventor
何平
蒋升
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suirui Technology Group Co Ltd
Original Assignee
Suirui Technology Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suirui Technology Group Co Ltd filed Critical Suirui Technology Group Co Ltd
Priority to CN202110963498.1A priority Critical patent/CN113628634B/en
Publication of CN113628634A publication Critical patent/CN113628634A/en
Application granted granted Critical
Publication of CN113628634B publication Critical patent/CN113628634B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The invention discloses a method and a device for separating real-time voice directed to information guidance, which belong to the field of information processing, and the method comprises the following steps: s1: initializing a steering vector and a directional filter for the time domain signal of each microphone; s2: performing time-frequency decomposition on the initialized signals to finish the transformation from the time domain signals to the time-frequency domain signals; s3: performing separation filter calculation on the time-frequency domain signal to obtain a filter for separating target voice from residual signals; s4: and obtaining a time-frequency domain signal of the target voice according to the obtained filter, and further obtaining a time-domain signal of the target voice. The invention constructs the initial estimation of the real-time IVA based on the super-directional filter, corrects the optimization function of the IVA, ensures that the separation algorithm can quickly converge, and accurately extracts the target voice signal.

Description

Real-time voice separation method and device guided by directional information
Technical Field
The invention belongs to the field of information processing, and particularly relates to a method and a device for real-time voice separation of information guidance.
Background
At present, the microphone array beam forming technology is widely applied to the fields of online conference systems, vehicle-mounted man-machine interaction, intelligent home and the like. In an actual environment, the noise, the interference of competing speakers and the like are obvious, and the hearing of conference communication and the accuracy of subsequent voice recognition can be obviously reduced. The method for generating the wave beam based on the microphone array multi-array elements is the most commonly used method for reducing signal noise and improving communication quality. How to extract the voice signal in a certain direction in a targeted way, and meanwhile, remarkably suppress other noise, and has important significance in improving conference communication quality, improving voice recognition rate and the like.
Independent vector analysis (Independent vector analysis, IVA) based is currently the most common speech separation/pickup technique. Firstly, converting time domain signals picked up by all array elements into a time-frequency domain through short-time Fourier change, then constructing an optimization function based on the principle of minimum cross entropy of separated voice, iteratively updating a separation matrix based on the optimization function, obtaining frequency domain estimation of a target signal after estimating the separation matrix, and finally obtaining time domain estimation based on inverse Fourier transform. In some of the latest IVA methods, the distance constraint of the separation matrix and the target direction guide vector is increased, so that the IVA separation result can extract the target voice in real time.
The main disadvantages of the prior art are as follows:
1) The existing directional IVA is obviously insufficient in performance in a reverberation scene due to the fact that the accuracy of a steering vector is greatly reduced in the reverberation scene by directly increasing the constraint of the distance between a separation matrix and the steering vector.
2) The directional IVA technique does not constrain the initial estimate, resulting in an excessively long convergence time, and if the environment changes, such as an interfering speaker is walking, the IVA separation matrix convergence rate will not follow the rate of acoustic environment changes.
In view of this, the present invention has been made.
Disclosure of Invention
The invention aims to provide a real-time voice separation method and device guided by directional information, which construct an initial estimation of real-time IVA based on a super directional filter and correct an optimization function of IVA, so that a separation algorithm can be ensured to be converged rapidly and a target voice signal can be extracted accurately.
In order to achieve the above object, the present invention provides a real-time voice separation method guided by directional information, which is applied to a microphone array-based system, and includes the following steps:
s1: initializing a steering vector and a directional filter for the time domain signal of each microphone;
s2: performing time-frequency decomposition on the initialized signals to finish the transformation from the time domain signals to the time-frequency domain signals;
s3: performing separation filter calculation on the time-frequency domain signal to obtain a filter for separating target voice from residual signals;
s4: and obtaining a time-frequency domain signal of the target voice according to the obtained filter, and further obtaining a time-domain signal of the target voice.
Further, before the step S1, the method further includes: acquiring a time-domain signal x for each microphone m (n);
In the step S1, the method for performing the steering vector is as follows: for each frequency band k, a steering vector u (k) is calculated,
q(θ)=[cos(θ),sin(θ)]
wherein ,fk K=1, 2, for the frequency of the kth band; c is the speed of sound, c=340 m/s; d, d m Two-dimensional coordinate values for the mth microphone; q (θ) is a direction vector, ω k Is the frequency band circle frequency;
the method for initializing the directional filter is as follows: a super-steering filter h (k) is calculated for each frequency band k:
where R (k) represents the autocorrelation coefficients of each microphone of the uniform scattered field normalized with respect to the picked-up signal.
Further, the step S2 includes:
s201: for time domain signal x m (n) performing a short-time Fourier transform to obtain a time-frequency domain tableAnd (3) the following steps:
where N is the frame length, n=512; w (n) is a hamming window of length 512, l is a time frame number, and k is a frequency number. X is X m (l, k) is the spectrum of the mth microphone signal in the kth frequency band in the first frame;
s202: for each frequency band k, a frequency domain original vector X (l, k) is constructed:
X(l,k)=[X 1 (l,k),X 2 (l,k),...,X M (l,k)] T
further, the step S3 includes:
s301: calculating a frame level separation guide factor:
wherein ,r1(l) and r2 (l) Respectively used for guiding the voice of the target party and the residual signal;
s302: calculating a separate steering matrix for each frequency band:
ψ 1 (k)=αψ 1 (k)+(1-α)r 1 (l)X(l,k)X H (l,k)
ψ 2 (k)=αψ 1 (k)+(1-α)r 1 (l)X(l,k)X H (l,k)
wherein ,ψ1(k) and ψ2 (k) A guide matrix representing the target voice and the residual signal respectively; alpha is a smoothing factor, and the value range is 0 to 1;
s303: a new optimization function is constructed for the filter separating the target speech from the residual signal, the optimization function being as follows:
wherein ,G1(k) and G2 (k) Filters for separating target speech from residual signal
S304: and minimizing the optimization function to obtain the optimal filter.
The process of minimizing the optimization function is to solve the following equation:
Ψ(k)G(k)=ρ(k)
wherein ,
the optimal filter G (k) can be solved as:
G(k)=Ψ -1 (k)ρ(k)。
further, the step S4 includes:
s401: according to the optimal filter obtained by solving, further obtaining the frequency domain estimation of the target voice:
s402: performing inverse Fourier transform to obtain target voice time domain estimation:
the invention also provides a real-time voice separation device guided by the pointing information, which is applied to a system based on a microphone array and comprises an initialization module, a signal decomposition module, a separation filter calculation module and a target voice estimation module:
the initialization module is used for initializing a steering vector and a directional filter for the time domain signal of each microphone;
the signal decomposition module is used for performing time-frequency decomposition on the initialized signals to finish the transformation from the time domain signals to the time-frequency domain signals;
the separation filter calculation module is used for carrying out separation filter calculation on the time-frequency domain signals to obtain a filter for separating target voice from residual signals;
the target voice estimation module is used for obtaining a time-frequency domain signal of the target voice according to the obtained filter, and further obtaining a time-domain signal of the target voice.
Further, the initialization module is further configured to obtain a time domain signal x of each microphone m (n);
The method for guiding the vector by the signal decomposition module is as follows: for each frequency band k, a steering vector u (k) is calculated,
q(θ)=[cos(θ),sin(θ)]
wherein ,fk K=1, 2, for the frequency of the kth band; c is the speed of sound, c=340 m/s; d, d m Two-dimensional coordinate values for the mth microphone; q (θ) is a direction vector, ω k Is the frequency band circle frequency;
the method for initializing the directional filter by the signal decomposition module is as follows: a super-steering filter h (k) is calculated for each frequency band k:
where R (k) represents the autocorrelation coefficients of each microphone of the uniform scattered field normalized with respect to the picked-up signal.
Further, the operation steps of the signal decomposition module are as follows:
first, a time domain signal x m (n) performing a short-time fourier transform to obtain a time-frequency domain representation:
where N is the frame length, n=512; w (n) is a hamming window of length 512, l is a time frame number, and k is a frequency number. X is X m (l, k) is the spectrum of the mth microphone signal in the kth frequency band in the first frame;
next, for each frequency band k, a frequency domain original vector X (l, k) is constructed:
X(l,k)=[X 1 (l,k),X 2 (l,k),...,X M (l,k)] T
further, the operation steps of the separation filter calculation module are as follows:
first, a frame level separation guide factor is calculated:
wherein ,r1(l) and r2 (l) Respectively used for guiding the target voice and the residual signal;
secondly, a separate steering matrix for each frequency band is calculated:
ψ 1 (k)=αψ 1 (k)+(1-α)r 1 (l)X(l,k)X H (l,k)
ψ 2 (k)=αψ 1 (k)+(1-α)r 1 (l)X(l,k)X H (l,k)
wherein ,ψ1(k) and ψ2 (k) Representing the target speech and the residual signal respectivelyIs a pilot matrix of (a); alpha is a smoothing factor, and the value range is 0 to 1;
then, a new optimization function is constructed for the filter for separating the target voice from the residual signal, and the optimization function is as follows:
wherein ,G1(k) and G2 (k) Filters for separating target speech from residual signal
And finally, minimizing an optimization function to obtain an optimal filter.
The process of minimizing the optimization function is to solve the following equation:
Ψ(k)G(k)=ρ(k)
wherein ,
the optimal filter G (k) can be solved as:
G(K)=Ψ -1 (k)ρ(k)。
further, the operation steps of the voice estimation module are as follows:
firstly, according to the optimal filter obtained by solving, further obtaining the frequency domain estimation of the target voice:
then, performing inverse Fourier transform to obtain target voice time domain estimation:
the real-time voice separation method and device for guiding the pointing information provided by the invention have the following beneficial effects:
1. compared with the traditional IVA, the invention calculates the guide factor by using the super-directional filter, so that the convergence is faster, and the invention can adapt to the scene of acoustic environment change.
2. The objective function designed by the invention considers the difference between signals, increases the ambiguity constraint, can obtain the analytic optimal solution, and does not need iteration, so that the robustness is stronger, and the separation effect is more stable and reliable.
Drawings
Fig. 1 is a flowchart of a real-time voice separation method directed to information guidance in this embodiment.
Fig. 2 is a schematic diagram of a hamming window function used in this embodiment.
Fig. 3 is a schematic diagram of a real-time voice separation apparatus directed to information guidance in this embodiment.
Detailed Description
In order that those skilled in the art will better understand the present invention, the present invention will be described in further detail with reference to specific embodiments.
As shown in fig. 1, an embodiment of the present invention is a real-time voice separation method directed to information guidance, which can be applied to a microphone array-based system, such as a voice conference system, an on-vehicle voice communication system, and a man-machine interaction system, and can extract a target voice signal in real time, thereby being beneficial to improving the communication quality of an on-line voice conference and improving the accuracy of subsequent voice recognition.
The method specifically comprises the following four implementation steps:
s1: steering vector and directional filter initialization is performed on the time domain signal of each microphone.
Before step S1, the method further includes acquiring a voice signal of each microphone, where the acquired voice signals are as follows: let x be m (n) represents the original time domain signal picked up in real time by M microphone array elements, wherein M represents the microphoneAnd the gram serial number label has a value from 1 to M, n represents a time label, and the direction of target voice relative to the microphone array is theta.
The target voice refers to a voice signal corresponding to a target direction, and for a voice separation task, the target direction is known in advance according to the extracted signal, for example, for a large-screen voice communication device, it is desirable to separate the voice signal in a 90-degree direction.
Specifically, the method for performing the steering vector is as follows:
for each frequency band k, a steering vector u (k) is calculated, where a frequency band refers to a signal component corresponding to a certain frequency:
q(θ)=[cos(θ),sin(θ)]
wherein ,fk K=1, 2, where the value of K is determined from the subsequent fourier transform, if the frame length is 512, then the value of K is half the frame length; c is the speed of sound, c=340 m/s; d, d m Two-dimensional coordinate values for the mth microphone; the superscript H represents a conjugate transpose operator; j represents imaginary unitq (θ) is a direction vector, ω k Is the band round frequency.
The method for initializing the directional filter is as follows:
a super-steering filter h (k) is calculated for each frequency band k:
where R (k) represents the normalized autocorrelation coefficients of the individual microphones of the uniform scattered field with respect to the picked-up signal, and the superscript-1 represents the inverse of the matrix.
The initialization of the filter is completed through the two steps, so as to calculate the subsequent spatial distinguishing information.
S2: and performing time-frequency decomposition on the initialized signal to complete the conversion from the time domain signal to the time-frequency domain signal.
Specifically, the method comprises the following steps:
s201: for time domain signal x m (n) performing a short-time fourier transform to obtain a time-frequency domain representation:
where N is the frame length, n=512; w (n) is a hamming window of length 512, where n represents a sequence number over time, and thus w (n) represents a value over each corresponding time sequence number n; l is a time frame number in frames; k is the frequency number. X is X m (l, k) is the spectrum of the mth microphone signal in the kth frequency band in the first frame. In the present invention, a hamming window function is used as shown in fig. 2.
S202: for each frequency band k, a frequency domain original vector X (l, k) is constructed:
X(l,k)=[X 1 (l,k),X 2 (l,k),...,X M (l,k)] T
wherein the superscript T represents the transpose operator, resulting in the original vector being an M-dimensional column vector.
The conversion from the time domain signal to the time-frequency domain can be completed through the steps.
S3: and performing separation filter calculation on the time-frequency domain signal to obtain a filter for separating the target voice from the residual signal.
Specifically, the method comprises the following steps:
s301: calculating a frame level separation guide factor:
wherein, I represents taking the modulus of complex numbers, r 1(l) and r2 (l) Respectively for guiding the target speech and the residual signal. The target direction refers to the direction of interest to the user, and the target voice is the voice signal from this direction; the residual signal is the target voice signal, sound from other directions and environmental noise, and the residual signal is the target voice signal subtracted from the total signal acquired by the microphone. This step calculates the steering matrix with the next step, providing a priori.
S302: calculating a separate steering matrix for each frequency band:
ψ 1 (k)=αψ 1 (k)+(1-α)r 1 (l)X(l,k)X H (l,k)
ψ 2 (k)=αψ 1 (k)+(1-α)r 1 (l)X(l,k)X H (l,k)
wherein ,ψ1(k) and ψ2 (k) A guide matrix representing the target voice and the residual signal respectively; the value of alpha is a smoothing factor, the value range is 0 to 1, and the preferred invention adopts the value of alpha=0.85, and the adoption of the value can avoid excessive dependence on history information and fully mine the spatial information of the signal.
By this step the calculation of the matrix is guided, directly with the subsequent updating of the separation filter.
S303: a new optimization function is constructed for the filter separating the target speech from the residual signal, the optimization function being as follows:
wherein ,G1(k) and G2 (k) Filters separating the target speech and the residual signal, respectively.
The first term of the optimization function maximizes the difference between the separated signals, and the second term can avoid ambiguity of the filter estimation, and ensure that the sum of the separation results is as consistent as possible with the value of the microphone signal.
S304: and minimizing the optimization function to obtain an optimal filter.
This minimization process is equivalent to solving the following equation:
Ψ(k)G(k)=ρ(k)
wherein ,
superscript x represents a conjugate operator.
The optimal filter G (k) can be solved as:
G(k)=Ψ -1 (k)ρ(k)。
after solving to obtain G (k), obtaining a filter G for separating the target voice and the residual signal according to the corresponding relation of the vectors 1(k) and G2 (k)。
In step S301, a super directional filter is used to calculate a speech separation guide factor; in step S302, a separate steering matrix is calculated based on the steering factors; in step S303, the designed speech separation optimization function; in step S304, the formula adopted is the invented optimal separation filter calculation method, which adopts the principle of minimum mean square error, and is derived from the formula ψ=ρ, and the solution of minimum mean square error is g=ψ -1 * ρ, in turn, ensures the minimization of the defined J.
Thus, step S3 enables the calculation of a frequency domain separation filter.
S4: and obtaining a time-frequency domain signal of the target voice according to the obtained filter, and further obtaining a time-domain signal of the target voice.
The method specifically comprises the following steps:
s401: according to the optimal filter obtained by solving, further obtaining the frequency domain estimation of the target voice:
s402: performing inverse Fourier transform to obtain target voice time domain estimation, and further obtaining a target voice time domain signal:
this step achieves the acquisition of the target speech time domain signal.
Through the steps of the invention, the initialization of the microphone matrix signals, the signal decomposition, the calculation of the separation filter and the target voice estimation can be realized.
As shown in fig. 3, an embodiment of the present invention is a real-time voice separation device directed to information guidance, which is applied to a microphone array-based system, and includes an initialization module 1, a signal decomposition module 2, a separation filter calculation module 3, and a target voice estimation module 4.
The initialization module 1 is used for initializing steering vectors and directional filters for voice signals of each microphone.
The initialization module 1 can also be used to obtain the voice signal of each microphone, the obtained voice signals are as follows: let x be m And (n) represents original signals picked up by M microphone array elements in real time, wherein M represents a microphone serial number label, the value of the microphone serial number label is from 1 to M, n represents a time label, and the direction of target voice relative to the microphone array is theta.
Specifically, the method for performing the steering vector is as follows:
for each frequency band k, a steering vector u (k) is calculated, where a frequency band refers to a signal component corresponding to a certain frequency:
q(θ)=[cos(θ),sin(θ)]
wherein ,fk K=1, 2, for the frequency of the kth band; c is the speed of sound, c=340 m/s; d, d m Two-dimensional coordinate values for the mth microphone; the superscript H represents a conjugate transpose operator; j represents imaginary unitq (θ) is a direction vector, ω k Is the band round frequency.
The method for initializing the directional filter is as follows:
a super-steering filter h (k) is calculated for each frequency band k:
where R (k) represents the normalized autocorrelation coefficients of the individual microphones of the uniform scattered field with respect to the picked-up signal, and the superscript-1 represents the inverse of the matrix.
The initialization of the filter is completed through the two steps, so as to calculate the subsequent spatial distinguishing information.
The signal decomposition module 2 is configured to perform time-frequency decomposition on the initialized signal, and complete conversion from a time domain signal to a time-frequency domain signal.
Specifically, the operation steps of the signal decomposition module 2 are as follows:
first, a time domain signal x m (n) performs short-time Fourier transform to obtain a time-frequency domain expression:
where N is the frame length, n=512; w (n) is a Hamming window of length 512, and l is timeFrame number, k is frequency number. X is X m (l, k) is the spectrum of the mth microphone signal in the kth frequency band in the first frame. The hamming window function is shown in fig. 2.
Next, for each frequency band k, a frequency domain original vector X (l, k) is constructed:
X(l,k)=[X 1 (l,k),X 2 (l,k),...,X M (l,k)] T
wherein the superscript T represents the transpose operator, resulting in the original vector being an M-dimensional column vector.
The conversion from the time domain signal to the time-frequency domain can be completed through the steps.
The separation filter calculation module 3 is configured to perform separation filter calculation on the time-frequency domain signal, and obtain a filter for separating the target voice from the residual signal.
Specifically, the operation steps of the separation filter calculation module 3 are as follows:
first, a frame level separation guide factor is calculated:
wherein, I represents taking the modulus of complex numbers, r 1(l) and r2 (l) The target speech and the residual signal are directed separately. The above operation calculates the steering matrix with the next step, providing a priori.
Secondly, a separate steering matrix for each frequency band is calculated:
ψ 1 (k)=αψ 1 (k)+(1-α)r 1 (l)X(l,k)X H (l,k)
ψ 2 (k)=αψ 1 (k)+(1-α)r 1 (l)X(l,k)X H (l,k)
wherein ,ψ1(k) and ψ2 (k) A guide matrix representing the target voice and the residual signal respectively; alphaFor the smoothing factor, the value range is 0 to 1, and the preferred invention adopts the value of alpha=0.85, so that excessive dependence on history information can be avoided, and spatial information of signals can be fully mined.
The operation guides the matrix calculation, and the subsequent updating of the separation filter is directly used.
Then, a new optimization function is constructed for the filter for separating the target voice from the residual signal, and the optimization function is as follows:
wherein ,G1(k) and G2 (k) Filters separating the target speech and the residual signal, respectively.
The first term of the optimization function maximizes the difference between the separated signals, and the second term can avoid ambiguity of the filter estimation, and ensure that the sum of the separation results is as consistent as possible with the value of the microphone signal.
And finally, minimizing the optimization function to obtain the optimal filter.
This minimization process is equivalent to solving the following equation:
Ψ(k)G(k)=ρ(k)
wherein ,
superscript x represents a conjugate operator.
The optimal filter G (k) can be solved as:
G(k)=Ψ -1 (k)ρ(k)
the calculation of the frequency domain separation filter can be realized by the above operation.
The target voice estimation module 4 is configured to obtain a time-frequency domain signal of the target voice according to the obtained filter, and further obtain a time-domain signal of the target voice.
Specifically, the operation steps of the target speech estimation module 4 are as follows:
firstly, according to the optimal filter obtained by solving, further obtaining the frequency domain estimation of the target voice:
then, performing inverse Fourier transform to obtain target voice time domain estimation:
the target speech estimation module 4 is capable of achieving acquisition of a target speech time domain signal.
In the above embodiment, these 4 modules including the initialization module 1, the signal decomposition module 2, the separation filter calculation module 3, and the target speech estimation module 4 are not sufficient, and any of the modules is missing, which results in that the target speech cannot be extracted.
Specific examples are set forth herein to illustrate the invention in detail, and the description of the above examples is only for the purpose of aiding in understanding the core concept of the invention. It should be noted that any obvious modifications, equivalents, or other improvements to those skilled in the art without departing from the inventive concept are intended to be included in the scope of the present invention.

Claims (4)

1. A method for real-time voice separation directed to information guidance, applied to a microphone array-based system, comprising the steps of:
step S1: initializing a steering vector and a directional filter for the time domain signal of each microphone;
step S2: performing time-frequency decomposition on the initialized signals to finish the transformation from the time domain signals to the time-frequency domain signals;
step S3: performing separation filter calculation on the time-frequency domain signal to obtain a filter for separating target voice from residual signals;
step S4: obtaining a time-frequency domain signal of the target voice according to the obtained filter, and further obtaining a time-domain signal of the target voice;
the step S1 further includes: acquiring a time-domain signal x for each microphone m (n);
In the step S1, the method for performing the steering vector is as follows: for each frequency band k, a steering vector u (k) is calculated,
q(θ)=[cos(θ),sin(θ)]
wherein ,fk K=1, 2, for the frequency of the kth band; c is the speed of sound, c=340 m/s; d, d m Two-dimensional coordinate values for the mth microphone; q (θ) is a direction vector, ω k Is the frequency band circle frequency;
the method for initializing the directional filter is as follows: a super-steering filter h (k) is calculated for each frequency band k:
wherein R (k) represents the normalized autocorrelation coefficients of the individual microphones of the uniform scattered field with respect to the picked-up signal;
the step S2 includes:
s201: for time domain signal x m (n) performing a short-time fourier transform to obtain a time-frequency domain representation:
where N is the frame length, n=512; w (n) is a hamming window with length 512, 1 is a time frame number, and k is a frequency number; x is X m (l, k) is the spectrum of the mth microphone signal in the kth frequency band in frame 1;
s202: for each frequency band k, a frequency domain original vector X (l, k) is constructed:
X(l,k)=[X 1 (l,k)X 2 (l,k),...,X M (l,k)] T
the step S3 includes:
s301: calculating a frame level separation guide factor:
wherein ,r1(l) and r2 (l) Respectively used for guiding the target voice and the residual signal;
s302: calculating a separate steering matrix for each frequency band:
ψ 1 (k)=αψ 1 (k)+(1-α)r 1 (l)X(l,k)X H (l,k)
ψ 2 (k)=αψ 1 (k)+(1-α)r 1 (l)X(l,k)X H (l,k)
wherein ,ψ1(k) and ψ2 (k) A guide matrix representing the target voice and the residual signal respectively; alpha is a smoothing factor, and the value range is 0 to 1;
s303: a new optimization function is constructed for the filter separating the target speech from the residual signal, the optimization function being as follows:
wherein ,G1(k) and G2 (k) Filters for separating target speech from residual signal
S304: minimizing the optimization function to obtain an optimal filter;
the process of minimizing the optimization function is to solve the following equation:
Ψ(k)G(k)=ρ(k)
wherein ,
the filter G (k) is solved as:
G(k)=Ψ -1 (k) ρ (middle).
2. The method for real-time voice separation guided by directional information according to claim 1, wherein the step S4 comprises:
s401: according to the filter obtained by solving, further obtaining the frequency domain estimation of the target voice:
s402: performing inverse Fourier transform to obtain target voice time domain estimation:
3. a real-time voice separation device guided by directional information, which is applied to a system based on a microphone array, and is characterized by comprising an initialization module, a signal decomposition module, a separation filter calculation module and a target voice estimation module:
the initialization module is used for initializing a steering vector and a directional filter for the time domain signal of each microphone;
the signal decomposition module is used for performing time-frequency decomposition on the initialized signals to finish the transformation from the time domain signals to the time-frequency domain signals;
the separation filter calculation module is used for carrying out separation filter calculation on the time-frequency domain signals to obtain a filter for separating target voice from residual signals;
the target voice estimation module is used for obtaining a time-frequency domain signal of target voice according to the obtained filter, and further obtaining a target voice time domain signal;
the initialization module is further configured to obtain a time domain signal x of each microphone m (n);
The method for guiding the vector by the signal decomposition module is as follows: for each frequency band k, a steering vector u (k) is calculated,
q(θ)=[cos(θ),sin(θ)]
wherein ,fk K=1, 2, for the frequency of the kth band; c is the speed of sound, c=340 m/s; d, d m Two-dimensional coordinate values for the mth microphone; q (θ) is a direction vector, ω k Is the frequency band circle frequency;
the method for initializing the directional filter by the signal decomposition module is as follows: a super-steering filter h (k) is calculated for each frequency band k:
wherein R (k) represents the normalized autocorrelation coefficients of the individual microphones of the uniform scattered field with respect to the picked-up signal;
the operation steps of the signal decomposition module are as follows:
first, a time domain signal x m (n) performing a short-time fourier transform to obtain a time-frequency domain representation:
where N is the frame length, n=512; w (n) is a Hamming window of length 512, l is a time frame number, and k is a frequency number; x is X m (l, k) is the spectrum of the mth microphone signal in the kth frequency band in the first frame;
next, for each frequency band k, a frequency domain original vector X (l, k) is constructed:
X(l,k)=[X 1 (l,k),X 2 (l,k),...,X M (l,k)] T
the operation steps of the separation filter calculation module are as follows:
first, a frame level separation guide factor is calculated:
wherein ,r1(l) and r2 (l) Respectively used for guiding the voice of the target party and the residual signal;
secondly, a separate steering matrix for each frequency band is calculated:
ψ 1 (k)=αψ 1 (k)+(1-α)r 1 (l)X(l,k)X H (l,k)
ψ 2 (k)=αψ 1 (k)+(1-αr 1 (l)X(l,k)X H (l,k)
wherein ,ψ1(k) and ψ2 (k) The guiding matrixes respectively represent the target party voice and the residual signals; alpha is a smoothing factor, and the value range is 0 to 1;
then, a new optimization function is constructed for the filter for separating the target voice from the residual signal, and the optimization function is as follows:
wherein ,G1(k) and G2 (k) Filters for separating target speech from residual signal
Finally, minimizing an optimization function to obtain an optimal filter;
the process of minimizing the optimization function is to solve the following equation:
Ψ(k)G(k)=ρ(k)
wherein ,
the optimal filter G (k) is solved as:
G(k)=Ψ -1 (k)ρ(k)。
4. the information-directed real-time voice separation apparatus of claim 3, wherein the target voice estimation module operates as follows:
firstly, according to the optimal filter obtained by solving, further obtaining the frequency domain estimation of the target voice:
then, performing inverse Fourier transform to obtain target voice time domain estimation:
CN202110963498.1A 2021-08-20 2021-08-20 Real-time voice separation method and device guided by directional information Active CN113628634B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110963498.1A CN113628634B (en) 2021-08-20 2021-08-20 Real-time voice separation method and device guided by directional information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110963498.1A CN113628634B (en) 2021-08-20 2021-08-20 Real-time voice separation method and device guided by directional information

Publications (2)

Publication Number Publication Date
CN113628634A CN113628634A (en) 2021-11-09
CN113628634B true CN113628634B (en) 2023-10-03

Family

ID=78386993

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110963498.1A Active CN113628634B (en) 2021-08-20 2021-08-20 Real-time voice separation method and device guided by directional information

Country Status (1)

Country Link
CN (1) CN113628634B (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104866866A (en) * 2015-05-08 2015-08-26 太原理工大学 Improved natural gradient variable step-size blind source separation algorithm
WO2015196729A1 (en) * 2014-06-27 2015-12-30 中兴通讯股份有限公司 Microphone array speech enhancement method and device
GB201602382D0 (en) * 2016-02-10 2016-03-23 Cedar Audio Ltd Acoustic source seperation systems
CN108831495A (en) * 2018-06-04 2018-11-16 桂林电子科技大学 A kind of sound enhancement method applied to speech recognition under noise circumstance
JP2019054344A (en) * 2017-09-13 2019-04-04 日本電信電話株式会社 Filter coefficient calculation device, sound pickup device, method thereof, and program
CN110427968A (en) * 2019-06-28 2019-11-08 武汉大学 A kind of binocular solid matching process based on details enhancing
CN110706719A (en) * 2019-11-14 2020-01-17 北京远鉴信息技术有限公司 Voice extraction method and device, electronic equipment and storage medium
CN112037813A (en) * 2020-08-28 2020-12-04 南京大学 Voice extraction method for high-power target signal
CN112996019A (en) * 2021-03-01 2021-06-18 军事科学院系统工程研究院网络信息研究所 Terahertz frequency band distributed constellation access control method based on multi-objective optimization
CN113096684A (en) * 2021-06-07 2021-07-09 成都启英泰伦科技有限公司 Target voice extraction method based on double-microphone array

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009151578A2 (en) * 2008-06-09 2009-12-17 The Board Of Trustees Of The University Of Illinois Method and apparatus for blind signal recovery in noisy, reverberant environments
US9237391B2 (en) * 2012-12-04 2016-01-12 Northwestern Polytechnical University Low noise differential microphone arrays
JP2014219467A (en) * 2013-05-02 2014-11-20 ソニー株式会社 Sound signal processing apparatus, sound signal processing method, and program
US11694707B2 (en) * 2015-03-18 2023-07-04 Industry-University Cooperation Foundation Sogang University Online target-speech extraction method based on auxiliary function for robust automatic speech recognition
US10991362B2 (en) * 2015-03-18 2021-04-27 Industry-University Cooperation Foundation Sogang University Online target-speech extraction method based on auxiliary function for robust automatic speech recognition

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015196729A1 (en) * 2014-06-27 2015-12-30 中兴通讯股份有限公司 Microphone array speech enhancement method and device
CN104866866A (en) * 2015-05-08 2015-08-26 太原理工大学 Improved natural gradient variable step-size blind source separation algorithm
GB201602382D0 (en) * 2016-02-10 2016-03-23 Cedar Audio Ltd Acoustic source seperation systems
JP2019054344A (en) * 2017-09-13 2019-04-04 日本電信電話株式会社 Filter coefficient calculation device, sound pickup device, method thereof, and program
CN108831495A (en) * 2018-06-04 2018-11-16 桂林电子科技大学 A kind of sound enhancement method applied to speech recognition under noise circumstance
CN110427968A (en) * 2019-06-28 2019-11-08 武汉大学 A kind of binocular solid matching process based on details enhancing
CN110706719A (en) * 2019-11-14 2020-01-17 北京远鉴信息技术有限公司 Voice extraction method and device, electronic equipment and storage medium
CN112037813A (en) * 2020-08-28 2020-12-04 南京大学 Voice extraction method for high-power target signal
CN112996019A (en) * 2021-03-01 2021-06-18 军事科学院系统工程研究院网络信息研究所 Terahertz frequency band distributed constellation access control method based on multi-objective optimization
CN113096684A (en) * 2021-06-07 2021-07-09 成都启英泰伦科技有限公司 Target voice extraction method based on double-microphone array

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
《Speech Separation Using Partially Asynchronous Microphone Arrays Without Resampling》;R. M. Corey and A. C. Singer;《2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC)》;1-9 *
《一种强混响环境下的盲语音分离算法》;顾凡,王惠刚,李虎雄;《信号处理》;第27卷(第04期);534-540 *
Suleiman Erateb et,al..《Enhanced Online IVA with Switched Source Prior for Speech Separation》.《2020 IEEE 11th Sensor Array and Multichannel Signal Processing Workshop (SAM)》.2020,1-5. *
基于双麦克风的欠定盲源分离算法研究;黄曼露;《中国优秀硕士论文电子期刊网》;第2章2.2节-第3张第3节 *

Also Published As

Publication number Publication date
CN113628634A (en) 2021-11-09

Similar Documents

Publication Publication Date Title
CN108831495B (en) Speech enhancement method applied to speech recognition in noise environment
CN108986838B (en) Self-adaptive voice separation method based on sound source positioning
CN110503972B (en) Speech enhancement method, system, computer device and storage medium
CN107221336B (en) Device and method for enhancing target voice
US9100734B2 (en) Systems, methods, apparatus, and computer-readable media for far-field multi-source tracking and separation
KR20200066366A (en) Method and apparatus for target speech acquisition based on microphone array
CN110148420A (en) A kind of audio recognition method suitable under noise circumstance
CN105244036A (en) Microphone speech enhancement method and microphone speech enhancement device
CN113903353B (en) Directional noise elimination method and device based on space distinguishing detection
CN109285557B (en) Directional pickup method and device and electronic equipment
CN107346664A (en) A kind of ears speech separating method based on critical band
CN106653044B (en) Dual microphone noise reduction system and method for tracking noise source and target sound source
CN107369460B (en) Voice enhancement device and method based on acoustic vector sensor space sharpening technology
CN111816200B (en) Multi-channel speech enhancement method based on time-frequency domain binary mask
CN106847301A (en) A kind of ears speech separating method based on compressed sensing and attitude information
CN111681665A (en) Omnidirectional noise reduction method, equipment and storage medium
Li et al. Online Directional Speech Enhancement Using Geometrically Constrained Independent Vector Analysis.
CN110875054A (en) Far-field noise suppression method, device and system
CN113628634B (en) Real-time voice separation method and device guided by directional information
CN113744752A (en) Voice processing method and device
CN113050035B (en) Two-dimensional directional pickup method and device
CN110890099A (en) Sound signal processing method, device and storage medium
CN113948101B (en) Noise suppression method and device based on space distinguishing detection
CN112863525B (en) Method and device for estimating direction of arrival of voice and electronic equipment
CN111650559B (en) Real-time processing two-dimensional sound source positioning method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant