CN110610718B - Method and device for extracting expected sound source voice signal - Google Patents

Method and device for extracting expected sound source voice signal Download PDF

Info

Publication number
CN110610718B
CN110610718B CN201810623577.6A CN201810623577A CN110610718B CN 110610718 B CN110610718 B CN 110610718B CN 201810623577 A CN201810623577 A CN 201810623577A CN 110610718 B CN110610718 B CN 110610718B
Authority
CN
China
Prior art keywords
sound source
voice
position information
signal
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810623577.6A
Other languages
Chinese (zh)
Other versions
CN110610718A (en
Inventor
余立志
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Actions Technology Co Ltd
Original Assignee
Actions Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Actions Technology Co Ltd filed Critical Actions Technology Co Ltd
Priority to CN201810623577.6A priority Critical patent/CN110610718B/en
Publication of CN110610718A publication Critical patent/CN110610718A/en
Application granted granted Critical
Publication of CN110610718B publication Critical patent/CN110610718B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • G10L21/0308Voice signal separating characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02087Noise filtering the noise being separate speech, e.g. cocktail party
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02165Two microphones, one receiving mainly the noise signal and the other one mainly the speech signal
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming

Abstract

The present invention relates to audio processing technologies, and in particular, to a method and an apparatus for extracting a speech signal of an expected sound source. The method is used for ensuring the voice recognition rate on the premise of not increasing the hardware cost. The method comprises the following steps: the method comprises the steps of obtaining position information of an expected sound source according to existence probability of the expected sound source based on relevant characteristics of corresponding voice signals received through at least two microphones, further obtaining a preset target separation coefficient, and extracting the voice signals of the expected sound source from the at least two voice signals of the corresponding voice signals by adopting the target separation coefficient. Therefore, due to the fact that the stable corresponding relation is preset between the position information and the target separation coefficient, stable pointing can be formed based on the position information, the corresponding target separation coefficient is obtained rapidly, the voice signal of the expected sound source is extracted rapidly and accurately from the reverberation environment, the voice recognition rate under the interference environment is greatly improved, and meanwhile hardware cost cannot be increased.

Description

Method and device for extracting expected sound source voice signal
Technical Field
The present invention relates to audio processing technologies, and in particular, to a method and an apparatus for extracting a speech signal of an expected sound source.
Background
In the prior art, in the process of acquiring a speech signal, in order to improve data accuracy, a dual microphone is usually used to extract the speech signal emitted by a desired sound source.
However, there are often other sources of interference around the desired sound source; for example, it is assumed that in a conference scene, a speaker as a desired sound source speaks and other persons participating in the conference also participate in comments. At this time, the two microphones will simultaneously collect the voice signal of the desired sound source and the voice signals of other sources, so how to identify the voice signal of the desired sound source from the received signals of the two microphones becomes a problem to be solved urgently.
At present, a solution has been proposed which is as follows:
first, a first voice signal and a second voice signal obtained by a dual microphone are received.
Then, acoustic scene features (e.g., azimuth, energy, etc.) of the first speech signal and the second speech signal are extracted, respectively.
Finally, based on the obtained acoustic scene features, Independent Component Analysis (ICA) is performed on the first speech signal and the second speech signal, respectively, and a speech signal of an Independent desired sound source and a speech signal of an interference source are extracted from the mixed signal.
And finally, filtering the analysis result.
With the solution, the ICA processing needs to be performed, which results in huge computation and thus large power consumption, and therefore, a high-level speech processing engine needs to be used to perform the matching processing.
However, the hardware cost of the advanced speech processing engine is very high, and the advanced speech processing engine has no universality, and if a common speech processing engine is adopted for replacement, the complicated processing process is possibly not supported, so that the speech signal of the expected sound source cannot be correctly recognized, the speech recognition rate is influenced, and the service quality is further reduced.
Disclosure of Invention
The embodiment of the invention provides a method and a device for extracting a voice signal of an expected sound source, which are used for ensuring the voice recognition rate on the premise of not increasing the hardware cost.
The embodiment of the invention provides the following specific technical scheme:
a method of extracting a desired sound source speech signal, comprising:
extracting a reference voice signal from corresponding voice signals received through at least two microphones, and determining existence probability of an expected sound source based on acoustic features of the reference voice signal;
determining position information of a desired sound source based on a phase difference of at least one pair of voice signals among the corresponding voice signals;
acquiring a preset target separation coefficient based on the existence probability of the expected sound source and the position information of the expected sound source;
and extracting the voice signal of the expected sound source from at least two voice signals of the corresponding voice signals by adopting the target separation coefficient.
Optionally, determining the existence probability of the desired sound source characterized by the reference speech signal based on the acoustic features of the reference speech signal includes:
respectively extracting acoustic features of the reference voice signal on set N frequency bands;
taking the acoustic features on the N frequency bands as feature vectors to establish corresponding voice models;
respectively calculating the likelihood ratio of each acoustic feature based on the voice model;
when the likelihood ratio of any one acoustic feature is determined to reach a set threshold, the existence probability of the desired sound source is set to a specified value indicating the existence of the desired sound source.
Optionally, obtaining a preset target separation coefficient based on the existence probability and the position information of the expected sound source includes:
when the existence probability of the expected sound source indicates that the expected sound source exists, acquiring a group of preset separation coefficients corresponding to the position information, and taking the group of preset separation coefficients as target separation coefficients; alternatively, the first and second electrodes may be,
and when the existence probability of the expected sound source is determined to indicate that the expected sound source exists, and continuous Ln reference voice signals containing the reference voice signals are determined to indicate that the expected sound source exists, acquiring a set of separation coefficients preset corresponding to the position information, performing smoothing processing on the set of separation coefficients and each set of separation coefficients obtained based on other reference voice signals in the continuous Ln reference voice signals, and taking the smoothing processing result as a target separation coefficient.
Optionally, obtaining a preset target separation coefficient includes:
acquiring a preset storage table, wherein the storage table records a corresponding relation between a preset separation coefficient and position information;
and searching the storage table based on the position information and the corresponding relation to obtain a group of preset separation coefficients corresponding to the position information.
Optionally, extracting, by using the target separation coefficient, a speech signal of a desired sound source from at least two speech signals of the corresponding speech signals, including:
separating frequency domain output signals of each frequency point based on at least two voice signals of the corresponding voice signals by adopting a target separation coefficient;
converting the frequency domain output signals of each frequency point into at least two paths of time domain output signals by adopting a short-time inverse Fourier transform and splice addition method or a short-time inverse Fourier transform and splice reservation method;
and selecting one path of time domain output signal as a voice signal of a desired sound source.
An apparatus for extracting a desired sound source voice signal, comprising:
a first determining unit, configured to select one of a first speech signal and a second speech signal received by two microphones as a reference speech signal, and determine a probability of existence of a desired sound source characterized by the reference speech signal based on an acoustic feature of the reference speech signal;
a second determining unit configured to determine position information of a desired sound source based on a phase difference of the first voice signal and the second voice signal;
the acquisition unit is used for acquiring a preset target separation coefficient based on the existence probability of the expected sound source represented by the reference voice signal and the position information;
an extracting unit configured to extract a voice signal of a desired sound source from the first voice signal and the second voice signal using the target separation coefficient.
Optionally, when determining the existence probability of the desired sound source characterized by the reference speech signal based on the acoustic feature of the reference speech signal, the first determining unit is configured to:
respectively extracting acoustic features of the reference voice signal on set N frequency bands;
taking the acoustic features on the N frequency bands as feature vectors to establish corresponding voice models;
respectively calculating the likelihood ratio of each acoustic feature based on the voice model;
when the likelihood ratio of any one acoustic feature is determined to reach a set threshold, the existence probability of the desired sound source is set to a specified value indicating the existence of the desired sound source.
Optionally, when a preset target separation coefficient is acquired based on the existence probability and the position information of the expected sound source, the acquiring unit is configured to:
when the existence probability of the expected sound source indicates that the expected sound source exists, acquiring a group of preset separation coefficients corresponding to the position information, and taking the group of preset separation coefficients as target separation coefficients; alternatively, the first and second electrodes may be,
and when the existence probability of the expected sound source is determined to indicate that the expected sound source exists, and continuous Ln reference voice signals containing the reference voice signals are determined to indicate that the expected sound source exists, acquiring a set of separation coefficients preset corresponding to the position information, performing smoothing processing on the set of separation coefficients and each set of separation coefficients obtained based on other reference voice signals in the continuous Ln reference voice signals, and taking the smoothing processing result as a target separation coefficient.
Optionally, when a preset target separation coefficient is obtained, the obtaining unit is configured to:
acquiring a preset storage table, wherein the storage table records a corresponding relation between a preset separation coefficient and position information;
and searching the storage table based on the position information and the corresponding relation to obtain a group of preset separation coefficients corresponding to the position information.
Optionally, when extracting a speech signal of a desired sound source from at least two speech signals of the corresponding speech signals by using the target separation coefficient, the extracting unit is configured to:
separating frequency domain output signals of each frequency point based on at least two voice signals of the corresponding voice signals by adopting a target separation coefficient;
converting the frequency domain output signals of each frequency point into at least two paths of time domain output signals by adopting a short-time inverse Fourier transform and splice addition method or a short-time inverse Fourier transform and splice reservation method;
and selecting one path of time domain output signal as a voice signal of a desired sound source.
A storage medium storing a program of a method for extracting a desired sound source voice signal, which when executed by a processor, performs the steps of:
extracting a reference voice signal from corresponding voice signals received through at least two microphones, and determining existence probability of an expected sound source based on acoustic features of the reference voice signal;
determining position information of a desired sound source based on a phase difference of at least one pair of voice signals among the corresponding voice signals;
acquiring a preset target separation coefficient based on the existence probability of the expected sound source and the position information of the expected sound source;
and extracting the voice signal of the expected sound source from at least two voice signals of the corresponding voice signals by adopting the target separation coefficient.
A communications apparatus comprising one or more processors; and
one or more computer-readable media having instructions stored thereon that, when executed by the one or more processors, cause the apparatus to perform the method of any of the above.
In the embodiment of the invention, one of a first voice signal and a second voice signal received by a double microphone is selected as a reference voice signal, a preset target separation coefficient is obtained based on the existence probability of a desired sound source represented by the acoustic characteristics of the reference voice signal and the position information of the desired sound source represented by the phase difference of the first voice signal and the second voice signal, and the voice signal of the desired sound source is extracted from the first voice signal and the second voice signal by adopting the target separation coefficient.
Therefore, because the stable corresponding relation is preset between the position information and the target separation coefficient, stable pointing can be formed based on the position information, so that the corresponding target separation coefficient can be quickly obtained, and then the voice signal of the expected sound source can be quickly and accurately extracted from the reverberation environment.
Drawings
FIG. 1 is a diagram illustrating the logic functions of a speech processing apparatus according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart of extracting a desired sound source speech signal according to an embodiment of the present invention;
FIG. 3 is a diagram illustrating two microphones collecting speech signals from two sound sources according to an embodiment of the present invention;
FIG. 4 is a functional structure diagram of a speech processing apparatus according to an embodiment of the present invention.
Detailed Description
In an actual use environment, a speech processing device extracts features from an input speech signal for recognition, but various interferences, such as reverberation, noise and signal distortion, exist in the environment. These interferences cause a large difference between the characteristics of the input speech signal and the characteristics of the speech recognition model, thereby reducing the recognition rate.
In the embodiment of the invention, the difference is minimized under the principles of blind estimation and distortion-free filtering to improve the voice recognition rate, and simultaneously, the hardware cost is not increased.
Preferred embodiments of the present invention will be described in further detail below with reference to the accompanying drawings.
Referring to fig. 1, in the embodiment of the present invention, the speech processing apparatus mainly includes the following functional modules:
echo Cancellation (AEC) is mainly used for interrupting a voice signal, and canceling the voice signal sent by a device itself and picked up by a microphone, such as a voice signal generated during voice playing by a loudspeaker.
Blind Source Separation (BSS) is mainly used to form spatial domain directivity and frequency component resolution, filter out interference except an expected sound Source (i.e., the expected sound Source), improve the signal-to-noise ratio of signals, and enlarge the distance of voice recognition and the robustness to interference.
Further, an Automatic Gain Control (AGC) may be included, which is mainly used to expand the amplitude of the enhanced speech signal, thereby expanding the distance of speech recognition.
In the embodiment of the present invention, the speech processing apparatus may accurately extract the speech signal of the desired sound source from a plurality of speech signals recorded by at least two microphones, and for convenience of description, in the following embodiments, two speech signals recorded by two microphones are used as an example for explanation.
The desired sound source is a speaking object that emits a main speech signal in a noisy environment including reverberation and interference, and is also called a main sound source.
Referring to fig. 2, in the embodiment of the present invention, the detailed process of the speech processing apparatus extracting the desired sound source speech signal is as follows:
step 200: the speech processing means extracts a reference speech signal from the corresponding speech signals received by the at least two microphones.
Specifically, the speech processing device may select any one of the speech signals from the corresponding speech signals as a reference speech signal; alternatively, the speech processing device may select any at least two speech signals from the corresponding speech signals, and combine the at least two speech signals to generate the reference speech signal.
Taking the example of recording two voice signals (hereinafter referred to as a first voice signal and a second voice signal) through two microphones, specifically, when step 200 is executed, the voice processing apparatus needs to transform both the first voice signal and the second voice signal to a time-frequency domain, so as to facilitate subsequent processing. The transformation method adopts short-time Fourier transformation, and optionally adopts a transformation formula as follows:
Figure GDA0003199509740000071
wherein the content of the first and second substances,
Figure GDA0003199509740000072
denotes the normalized discrete frequency, k ranges from 0 to N-1, N denotes the short-time Fourier transform length, and τ denotes the signal frame number.
Step 210: the voice processing device extracts the acoustic characteristics of the reference voice signal, establishes a voice model based on the acoustic characteristics, and determines the existence probability of an expected sound source represented by the reference voice signal based on the voice model.
Optionally, taking the speech model as an example, when the speech model is established, the speech processing apparatus may perform the following operations:
firstly, the voice processing device respectively extracts the acoustic features of the reference voice signal on set N frequency bands as the acoustic features of the reference voice signal;
for example, assuming that N is 6, then the log energy of the reference speech signal is calculated over 6 frequency bands, the 6 frequency bands being 80-250kHZ, 250-500kHZ, 500-1kHZ, 1kHZ-2kHZ, 2kHZ-3kHZ, 3kHZ-4kHZ, respectively, corresponding to energy value 1 (i.e., acoustic feature 1), energy value 2 (i.e., acoustic feature 2), …, energy value 6 (i.e., acoustic feature 6).
There are various ways to represent the acoustic characteristics of the reference speech signal in the frequency band, such as energy value, amplitude value, etc.
Then, the speech processing apparatus uses the acoustic features on the N frequency bands as feature vectors, applies Gaussian Mixed Models (GMMs) to establish corresponding speech models, and calculates likelihood ratios of each acoustic feature based on the speech models.
Specifically, in calculating the likelihood ratio, the speech-class signal characteristic parameters (e.g., speech-class signal mean, speech-class signal variance, etc.) in each frequency band may be obtained by using the GMM based on the feature vector, the interference-class signal characteristic parameters (e.g., interference-class signal mean, interference-class signal variance, etc.) in each frequency band may be obtained by using the GMM, the likelihood ratio of each acoustic feature may be calculated by using the obtained various types of parameters, and when the likelihood ratio of any one acoustic feature reaches a set threshold, the existence probability of the desired sound source may be set to a specified value indicating the existence of the desired sound source, so as to determine the existence of the speech signal.
Optionally, the calculation formula of the likelihood ratio is as follows:
Figure GDA0003199509740000081
where k denotes a band index, Fn denotes an input feature vector, Fn is an acoustic feature of a specific band input, and usAnd unRespectively representing the mean value, σ, of the speech-like signal and the mean value, σ, of the interference-like signal of a certain frequency bandsAnd σnRespectively representing the variance of the speech-like signal and the variance of the interference-like signal for a certain frequency band.
When the value of any one of the likelihood ratios L (Fn, k) reaches a set threshold, the existence probability of the desired sound source is set to 1.
Of course, the GMM is only an example, and in practical applications, it is also possible to use other methods to establish a corresponding speech model. For example: a Support Vector Machine (SVM) algorithm, a Deep Neural Network (DNN) algorithm, a Convolutional Neural Network (CNN) algorithm, a Recurrent Neural Network (RNN) algorithm, etc.).
Step 220: the speech processing device determines position information of a desired sound source based on a phase difference of at least one pair of speech signals among the corresponding speech signals.
Still taking the example of recording two paths of voice signals through two microphones, specifically, a controllable power response phase weighting (SRP-PHAT) algorithm may be adopted, and based on the phase difference between the first voice signal and the second voice signal, the azimuth angle (DOA) of the desired sound source is estimated, and the azimuth angle is used as the position information of the desired sound source; wherein the so-called DOA may be an angle between the desired sound source and a set perpendicular bisector.
Optionally, the calculation formula of DOA is as follows:
Figure GDA0003199509740000091
since two microphones are used, i-1, j-2, and X can be definedj(w) and Xj(w) denotes the first and second short-time Fourier transformed speech signals, Ψ, received by two microphonesijThe preset weight value for improving the positioning performance is represented, in the embodiment of the invention, the weight is weighted by using a PHAT algorithm, and specifically,
Figure GDA0003199509740000092
dw denotes the integral over the frequency bin, ejwt denotes the delay-dependent phase difference, and r (t) denotes the cross-correlation energy.
Then the time delays t corresponding to all possible DOAs are substituted into the formula so that
Figure GDA0003199509740000093
The largest DOA, i.e. the DOA of the desired sound source.
The SRP-PHAT algorithm is adopted, and the SRP algorithm has short-time analysis characteristics and robustness, and is combined with the insensitivity of the phase transformation PHAT weighting algorithm to the surrounding interference environment (such as reverberation and noise), so that the DOA of a desired sound source can be robustly estimated in various actual environments.
Of course, the SRP-PHAT algorithm is used to perform DOA estimation only in the continuous speech signal input stage, so that the speech signal of the desired sound source can be stably extracted during the subsequent speech signal separation.
Other localization methods may also be utilized, such as generalized cross-correlation (GCC) algorithm, non-linear generalized cross-correlation (GCC-Nonlinear), sum of Delay (DS) algorithm, minimum variance distortion free (MVDR), etc.
Step 230: the voice processing device acquires a preset target separation coefficient based on the existence probability of the expected sound source and the position information of the expected sound source.
Still take two paths of voice signals recorded by two microphones as an example, in the embodiment of the present invention, a simple separation model of two microphones is adopted, and a group of corresponding separation coefficients is preset in advance corresponding to the existence probability and DOA of different expected sound sources, and the group of separation coefficients includes a separation coefficient set for each frequency point. The specific configuration process is as follows:
referring to fig. 3, it is assumed that two sound sources exist in the environment, one is a desired sound source s1, and the other is an interfering sound source s 2. And the speech signals received by the two microphones can be represented by x1 and x2, where h11 and h12 are transfer functions of s1 to the two microphones respectively, and h21 and h22 are transfer functions of s2 to the two microphones respectively, then x1 and x2 can be represented as:
x1(t)=s1(t)*h11(t)+s2(t)*h12(t) (1)
x2(t)=s1(t)*h21(t)+s2(t)*h22(t) (2)
since the spacing between the two microphones is small, the transfer functions of the desired and interfering sound sources to the different microphones can be approximated as only delays involving time, and then:
Figure GDA0003199509740000101
Figure GDA0003199509740000102
let y1(t)=s1(t)*h11(t)y2(t)=s2(t)*h22(t) substituting (1) and (2) in combination with (3) and (4) to obtain:
x1(t)=y1(t)+y2(t-d2) (5)
x2(t)=y1(t-d1)+y2(t) (6)
applying (5) and (6) to each frequency bin after short-time fourier transform, and abbreviated to vector form, then,
X(ω,τ)=AY(ω,τ) (7)
Figure GDA0003199509740000111
τ represents a signal frame number, a (ω) is based on a mixed matrix model of the simple separation model, and in combination with a view point of beam forming, a half space of the two microphones is divided into a series of azimuth regions by 180 degrees, each azimuth region corresponds to a value of DOA, and each value of DOA corresponds to a time delay D ═ Dsin (DOA)/C, where D is a distance between the two microphones and C is a sound velocity in air at normal temperature.
Since only the speech signal of the desired sound source is extracted, assuming that d1 is d2 is d, the time delay d is substituted into (8), so that a set of mixing matrices a (ω) corresponding to each azimuth region can be obtained, where an inverse matrix of a (ω) is a corresponding separation matrix W, each element in W is a separation coefficient corresponding to one frequency bin, and W includes all elements, i.e., a set of separation coefficients corresponding to one azimuth region.
In the embodiment of the invention, a group of separation coefficients contained in the separation matrix W corresponding to different azimuth areas are stored in a storage table in advance.
On the other hand, in practical applications, the speech signal is propagated in a signal frame manner, and the speech processing apparatus needs to continuously detect the signal frame, so that the speech processing apparatus continuously receives the two paths of speech signals transmitted by the two microphones and continuously extracts the reference speech signal, thereby continuously determining the existence probability of the desired sound source represented by the reference speech signal and continuously determining the DOA of the desired sound source represented by the phase difference between at least one pair of speech signals in the corresponding speech signals received by the at least two microphones. In this embodiment of the present invention, optionally, when performing step 230, the speech processing apparatus may adopt, but is not limited to, the following manners:
A. when the voice processing device determines that the existing probability of the expected sound source represented by the newly obtained reference voice signal indicates that the expected sound source exists, a group of separation coefficients preset corresponding to the newly obtained position information are obtained, and the group of separation coefficients are used as target separation coefficients.
B. The speech processing apparatus determines that the existence probability of a desired sound source represented by a newly obtained reference speech signal indicates the existence of the desired sound source, and determines that consecutive Ln reference speech signals including the newly obtained reference speech signal each indicate the existence of the desired sound source, acquires a set of separation coefficients preset corresponding to the newly obtained position information, and performs smoothing processing on the set of separation coefficients and sets of separation coefficients obtained based on other reference speech signals in the consecutive Ln reference speech signals, taking the result of the smoothing processing as a target separation coefficient.
Specifically, when the voice processing device determines that the existence probabilities of the expected sound sources represented by the continuous Ln reference voice signals (Ln is a preset threshold value) are all 1, a group of separation coefficients set corresponding to the DOA can be searched from a storage table according to the DOA of the expected sound source corresponding to the current reference voice signal (namely, the latest reference voice signal), and the separation coefficients are used as target separation coefficients to form a stable direction for subsequently separating the voice signal of the expected sound source from the mixed signal; the value of Ln is 1 to infinity, and the smaller the value, the faster the response, but the easier it is to extract the unstable speech signal of the desired sound source.
After acquiring a corresponding set of separation coefficients based on the DOA of the desired sound source corresponding to the current reference speech signal, optionally, smoothing the set of separation coefficients and each set of separation coefficients acquired based on other reference speech signals in the Ln reference speech signals, and taking the smoothing result as a target separation coefficient, so that advantages of beamforming (beamforming) and blind source separation are fully integrated, and not only is the advantage of small calculation amount of conventional beamforming, but also a stable direction can be formed in various reverberation environments. Of course, a group of separation coefficients obtained based on the DOA of the desired sound source corresponding to the current reference speech signal may be directly used as the target separation coefficient without performing the smoothing process, and the speech signal of the desired sound source may be extracted from the reverberant environment, which is not described herein again.
Further, when a set of separation coefficients preset corresponding to the position information is obtained, the voice processing apparatus may obtain a preset storage table in which a correspondence between preset separation coefficients and position information is recorded, and then, the voice processing apparatus searches the storage table based on the position information and the correspondence to obtain a set of separation coefficients preset corresponding to the position information.
Step 240: the voice processing device extracts a voice signal of a desired sound source from at least two voice signals of the corresponding voice signals by using the target separation coefficient.
Specifically, the target separation coefficient includes a separation coefficient of each frequency point that may be used by the system, that is, a separation coefficient of each element in the separation matrix W corresponding to one frequency point, so that, by using the target separation coefficient (i.e., the separation matrix W), based on the first voice signal and the second voice signal, a frequency domain output signal of each frequency point may be obtained by using a formula OUT (ω, τ) ═ W (ω) X (ω, τ), where X (ω, τ) [ X1(ω, τ), X2(ω, τ) ], and τ denote a signal frame number.
Then, a short-time inverse Fourier transform may be employed
Figure GDA0003199509740000131
And a splicing and adding method to obtain two paths of time domain output signals, wherein,
Figure GDA0003199509740000132
denotes the normalized discrete frequency, k ranges from 0 to N-1, N denotes the short-time Fourier transform length, and τ denotes the signal frame number.
Then, in the process of extracting according to the target separation coefficient, one path of output time domain output signal can be selected to use the voice signal of the expected sound source.
Based on the embodiment, referring to fig. 4, in an embodiment of the present invention, the speech processing apparatus at least includes:
a first determining unit 41 configured to select one of the first speech signal and the second speech signal received by the two microphones as a reference speech signal, and determine a probability of existence of a desired sound source characterized by the reference speech signal based on an acoustic feature of the reference speech signal;
a second determining unit 42 for determining position information of a desired sound source based on a phase difference of the first and second voice signals;
an obtaining unit 43, configured to obtain a preset target separation coefficient based on the existence probability of the expected sound source represented by the reference speech signal and the position information;
an extracting unit 44, configured to extract a voice signal of a desired sound source from the first voice signal and the second voice signal by using the target separation coefficient.
The first determining unit 41, the second determining unit 42, the obtaining unit 43 and the extracting unit 44 can be regarded as functional units for performing the operation of "blind source separation" shown in fig. 1.
Optionally, when determining the existence probability of the desired sound source characterized by the reference speech signal based on the acoustic feature of the reference speech signal, the first determining unit 41 is configured to:
respectively extracting acoustic features of the reference voice signal on set N frequency bands;
taking the acoustic features on the N frequency bands as feature vectors to establish corresponding voice models;
respectively calculating the likelihood ratio of each acoustic feature based on the voice model;
when the likelihood ratio of any one acoustic feature is determined to reach a set threshold, the existence probability of the desired sound source is set to a specified value indicating the existence of the desired sound source.
Optionally, when a preset target separation coefficient is acquired based on the existence probability and the position information of the desired sound source, the acquiring unit 43 is configured to:
when the existence probability of the expected sound source indicates that the expected sound source exists, acquiring a group of preset separation coefficients corresponding to the position information, and taking the group of preset separation coefficients as target separation coefficients; alternatively, the first and second electrodes may be,
and when the existence probability of the expected sound source is determined to indicate that the expected sound source exists, and continuous Ln reference voice signals containing the reference voice signals are determined to indicate that the expected sound source exists, acquiring a set of separation coefficients preset corresponding to the position information, performing smoothing processing on the set of separation coefficients and each set of separation coefficients obtained based on other reference voice signals in the continuous Ln reference voice signals, and taking the smoothing processing result as a target separation coefficient.
Optionally, when a preset target separation coefficient is obtained, the obtaining unit 43 is configured to:
acquiring a preset storage table, wherein the storage table records a corresponding relation between a preset separation coefficient and position information;
and searching the storage table based on the position information and the corresponding relation to obtain a group of preset separation coefficients corresponding to the position information.
Optionally, when extracting a speech signal of a desired sound source from at least two speech signals of the corresponding speech signals by using the target separation coefficient, the extracting unit 44 is configured to:
separating frequency domain output signals of each frequency point based on at least two voice signals of the corresponding voice signals by adopting a target separation coefficient;
converting the frequency domain output signals of each frequency point into at least two paths of time domain output signals by adopting a short-time inverse Fourier transform and splice addition method or a short-time inverse Fourier transform and splice reservation method;
and selecting one path of time domain output signal as a voice signal of a desired sound source.
Based on the same inventive concept, an embodiment of the present invention provides a storage medium storing a program of a method for extracting a desired sound source voice signal, the program, when executed by a processor, performing the steps of:
extracting a reference voice signal from corresponding voice signals received through at least two microphones, and determining existence probability of an expected sound source based on acoustic features of the reference voice signal;
determining position information of a desired sound source based on a phase difference of at least one pair of voice signals among the corresponding voice signals;
acquiring a preset target separation coefficient based on the existence probability of the expected sound source and the position information of the expected sound source;
and extracting the voice signal of the expected sound source from at least two voice signals of the corresponding voice signals by adopting the target separation coefficient.
Based on the same inventive concept, the embodiment of the invention provides a communication device, which comprises one or more processors; and
one or more computer-readable media having instructions stored thereon that, when executed by the one or more processors, cause the apparatus to perform any of the methods described above.
Two major separation principles of traditional blind source separation are signal sparsity and signal statistical independence.
The advantages of blind source separation based on signal sparsity are: the voice signal separation can be carried out aiming at the condition that the number of signal sources is more than that of the microphones; the disadvantages are that: if the assumption of signal sparsity is wrong in the presence of reverberation, the separation result is greatly different from the actual environment effect, and the calculation amount is huge.
The advantages of blind source separation based on statistical independence are: the method can be used for voice separation under the condition that the number of the microphones is more than or equal to that of the signal sources, can be applied under the condition of reverberation, generally needs iterative optimization and can be carried out in a time domain or a frequency domain; the disadvantages are that: the frequency domain separation has uncertainty for each frequency point, especially in the actual complex scene and the reverberation environment; the time domain separation has no uncertainty problem, but the computation amount is huge.
In the embodiment of the present invention, the existence probability of the desired sound source is determined based on the acoustic features of the reference voice signals extracted from the corresponding voice signals received by the at least two microphones, the position information of the desired sound source is determined based on the phase difference of at least one pair of voice signals in the corresponding voice signals, a preset target separation coefficient is obtained based on the obtained various types of information, and the voice signal of the desired sound source is extracted from the at least two voice signals of the corresponding voice signals by using the target separation coefficient.
Therefore, because the stable corresponding relation is preset between the position information and the target separation coefficient, stable pointing can be formed based on the position information, so that the corresponding target separation coefficient can be quickly obtained, and then the voice signal of the expected sound source can be quickly and accurately extracted from the reverberation environment.
Furthermore, the technical scheme of the invention provides a complete set of voice recognition preprocessing technology, considers the difficulties of daily voice interaction such as interruption, awakening, remote recognition, noise environment recognition and the like of voice recognition, and improves the effective distance of voice interaction between an expected sound source and the double microphones, for example, the effective distance of the effective voice interaction reaches 5 m.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.
It will be apparent to those skilled in the art that various modifications and variations can be made in the embodiments of the present invention without departing from the spirit or scope of the embodiments of the invention. Thus, if such modifications and variations of the embodiments of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to encompass such modifications and variations.

Claims (10)

1. A method of extracting a desired sound source speech signal, comprising:
extracting a reference voice signal from corresponding voice signals received through at least two microphones, and determining existence probability of an expected sound source based on acoustic features of the reference voice signal;
determining position information of a desired sound source based on a phase difference of at least one pair of voice signals among the corresponding voice signals;
acquiring a preset target separation coefficient based on the existence probability of the expected sound source and the position information of the expected sound source, wherein the target separation coefficient is acquired by searching in an acquired preset storage table based on the position information, and the storage table records the corresponding relation between the preset separation coefficient and the position information;
and extracting the voice signal of the expected sound source from at least two voice signals of the corresponding voice signals by adopting the target separation coefficient.
2. The method of claim 1, wherein determining a probability of existence of a desired sound source characterized by the reference speech signal based on acoustic features of the reference speech signal comprises:
respectively extracting acoustic features of the reference voice signal on set N frequency bands;
taking the acoustic features on the N frequency bands as feature vectors to establish corresponding voice models;
respectively calculating the likelihood ratio of each acoustic feature based on the voice model;
when the likelihood ratio of any one acoustic feature is determined to reach a set threshold, the existence probability of the desired sound source is set to a specified value indicating the existence of the desired sound source.
3. The method according to claim 1 or 2, wherein obtaining a preset target separation coefficient based on the existence probability and the position information of the desired sound source comprises:
when the existence probability of the expected sound source indicates that the expected sound source exists, acquiring a group of preset separation coefficients corresponding to the position information, and taking the group of preset separation coefficients as target separation coefficients; alternatively, the first and second electrodes may be,
and when the existence probability of the expected sound source is determined to indicate that the expected sound source exists, and continuous Ln reference voice signals containing the reference voice signals are determined to indicate that the expected sound source exists, acquiring a set of separation coefficients preset corresponding to the position information, performing smoothing processing on the set of separation coefficients and each set of separation coefficients obtained based on other reference voice signals in the continuous Ln reference voice signals, and taking the smoothing processing result as a target separation coefficient.
4. The method according to claim 1 or 2, wherein extracting a speech signal of a desired sound source from at least two speech signals of the corresponding speech signals using the target separation coefficient comprises:
separating frequency domain output signals of each frequency point based on at least two voice signals of the corresponding voice signals by adopting a target separation coefficient;
converting the frequency domain output signals of each frequency point into at least two paths of time domain output signals by adopting a short-time inverse Fourier transform and splice addition method or a short-time inverse Fourier transform and splice reservation method;
and selecting one path of time domain output signal as a voice signal of a desired sound source.
5. An apparatus for extracting a desired sound source voice signal, comprising:
a first determining unit, configured to select one of a first speech signal and a second speech signal received by two microphones as a reference speech signal, and determine a probability of existence of a desired sound source characterized by the reference speech signal based on an acoustic feature of the reference speech signal;
a second determining unit configured to determine position information of a desired sound source based on a phase difference of the first voice signal and the second voice signal;
an obtaining unit, configured to obtain a preset target separation coefficient based on the existence probability of the expected sound source represented by the reference speech signal and the position information, where the target separation coefficient is obtained by searching an obtained preset storage table based on the position information, and a corresponding relationship between the preset separation coefficient and the position information is recorded in the storage table;
an extracting unit configured to extract a voice signal of a desired sound source from the first voice signal and the second voice signal using the target separation coefficient.
6. The apparatus of claim 5, wherein when determining the probability of existence of the desired sound source characterized by the reference speech signal based on the acoustic features of the reference speech signal, the first determining unit is configured to:
respectively extracting acoustic features of the reference voice signal on set N frequency bands;
taking the acoustic features on the N frequency bands as feature vectors to establish corresponding voice models;
respectively calculating the likelihood ratio of each acoustic feature based on the voice model;
when the likelihood ratio of any one acoustic feature is determined to reach a set threshold, the existence probability of the desired sound source is set to a specified value indicating the existence of the desired sound source.
7. The apparatus according to claim 5 or 6, wherein when acquiring a preset target separation coefficient based on the existence probability and the position information of the desired sound source, the acquisition unit is configured to:
when the existence probability of the expected sound source indicates that the expected sound source exists, acquiring a group of preset separation coefficients corresponding to the position information, and taking the group of preset separation coefficients as target separation coefficients; alternatively, the first and second electrodes may be,
and when the existence probability of the expected sound source is determined to indicate that the expected sound source exists, and continuous Ln reference voice signals containing the reference voice signals are determined to indicate that the expected sound source exists, acquiring a set of separation coefficients preset corresponding to the position information, performing smoothing processing on the set of separation coefficients and each set of separation coefficients obtained based on other reference voice signals in the continuous Ln reference voice signals, and taking the smoothing processing result as a target separation coefficient.
8. The apparatus according to claim 5 or 6, wherein when extracting the voice signal of a desired sound source from the first voice signal and the second voice signal using the target separation coefficient, the extraction unit is configured to:
separating frequency domain output signals of each frequency point based on the first voice signal and the second voice signal by adopting a target separation coefficient;
converting the frequency domain output signals of each frequency point into two paths of time domain output signals by adopting a short-time inverse Fourier transform and splice addition method or a short-time inverse Fourier transform and splice reservation method;
and selecting one path of time domain output signal as a voice signal of a desired sound source.
9. A storage medium storing a program of a method for extracting a desired sound source voice signal, the program, when executed by a processor, performing the steps of:
extracting a reference voice signal from corresponding voice signals received through at least two microphones, and determining existence probability of an expected sound source based on acoustic features of the reference voice signal;
determining position information of a desired sound source based on a phase difference of at least one pair of voice signals among the corresponding voice signals;
acquiring a preset target separation coefficient based on the existence probability of the expected sound source and the position information of the expected sound source, wherein the target separation coefficient is acquired by searching in an acquired preset storage table based on the position information, and the storage table records the corresponding relation between the preset separation coefficient and the position information;
and extracting the voice signal of the expected sound source from at least two voice signals of the corresponding voice signals by adopting the target separation coefficient.
10. A communications apparatus comprising one or more processors; and
one or more computer-readable media having instructions stored thereon that, when executed by the one or more processors, cause the apparatus to perform the method of any of claims 1-4.
CN201810623577.6A 2018-06-15 2018-06-15 Method and device for extracting expected sound source voice signal Active CN110610718B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810623577.6A CN110610718B (en) 2018-06-15 2018-06-15 Method and device for extracting expected sound source voice signal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810623577.6A CN110610718B (en) 2018-06-15 2018-06-15 Method and device for extracting expected sound source voice signal

Publications (2)

Publication Number Publication Date
CN110610718A CN110610718A (en) 2019-12-24
CN110610718B true CN110610718B (en) 2021-10-08

Family

ID=68888662

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810623577.6A Active CN110610718B (en) 2018-06-15 2018-06-15 Method and device for extracting expected sound source voice signal

Country Status (1)

Country Link
CN (1) CN110610718B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111383629B (en) * 2020-03-20 2022-03-29 深圳市未艾智能有限公司 Voice processing method and device, electronic equipment and storage medium
CN111624553B (en) * 2020-05-26 2023-07-07 锐迪科微电子科技(上海)有限公司 Sound source positioning method and system, electronic equipment and storage medium
CN112259117A (en) * 2020-09-28 2021-01-22 上海声瀚信息科技有限公司 Method for locking and extracting target sound source
CN112637742B (en) * 2020-12-29 2022-10-11 北京安声浩朗科技有限公司 Signal processing method and signal processing device, storage medium and earphone
CN112799019B (en) * 2021-01-26 2023-07-07 安徽淘云科技股份有限公司 Sound source positioning method and device, electronic equipment and storage medium
CN113884986B (en) * 2021-12-03 2022-05-03 杭州兆华电子股份有限公司 Beam focusing enhanced strong impact signal space-time domain joint detection method and system

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101079267A (en) * 2006-05-26 2007-11-28 富士通株式会社 Collecting sound device with directionality, collecting sound method with directionality and memory product
CN101751912A (en) * 2008-12-05 2010-06-23 索尼株式会社 Information processing apparatus, sound material capturing method, and program
CN103106390A (en) * 2011-11-11 2013-05-15 索尼公司 Information processing apparatus, information processing method, and program
US20130142343A1 (en) * 2010-08-25 2013-06-06 Asahi Kasei Kabushiki Kaisha Sound source separation device, sound source separation method and program
JP2016194657A (en) * 2015-04-01 2016-11-17 日本電信電話株式会社 Sound source separation device, sound source separation method, and sound source separation program
CN106251877A (en) * 2016-08-11 2016-12-21 珠海全志科技股份有限公司 Voice Sounnd source direction method of estimation and device
US20170053662A1 (en) * 2015-08-20 2017-02-23 Honda Motor Co., Ltd. Acoustic processing apparatus and acoustic processing method

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020009203A1 (en) * 2000-03-31 2002-01-24 Gamze Erten Method and apparatus for voice signal extraction
JP5351856B2 (en) * 2010-08-18 2013-11-27 日本電信電話株式会社 Sound source parameter estimation device, sound source separation device, method thereof, program, and storage medium
US9928848B2 (en) * 2015-12-24 2018-03-27 Intel Corporation Audio signal noise reduction in noisy environments
CN106531156A (en) * 2016-10-19 2017-03-22 兰州交通大学 Speech signal enhancement technology method based on indoor multi-mobile source real-time processing

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101079267A (en) * 2006-05-26 2007-11-28 富士通株式会社 Collecting sound device with directionality, collecting sound method with directionality and memory product
CN101751912A (en) * 2008-12-05 2010-06-23 索尼株式会社 Information processing apparatus, sound material capturing method, and program
US20130142343A1 (en) * 2010-08-25 2013-06-06 Asahi Kasei Kabushiki Kaisha Sound source separation device, sound source separation method and program
CN103106390A (en) * 2011-11-11 2013-05-15 索尼公司 Information processing apparatus, information processing method, and program
JP2016194657A (en) * 2015-04-01 2016-11-17 日本電信電話株式会社 Sound source separation device, sound source separation method, and sound source separation program
US20170053662A1 (en) * 2015-08-20 2017-02-23 Honda Motor Co., Ltd. Acoustic processing apparatus and acoustic processing method
CN106251877A (en) * 2016-08-11 2016-12-21 珠海全志科技股份有限公司 Voice Sounnd source direction method of estimation and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Multiple moving speaker tracking via degenerate unmixing estimation technique and Cardinality Balanced Multi-target Multi-Bernoulli Filter (DUET-CBMeMBer);Nicholas Chong et al.;《ISSNIP》;20141231;第1-6页 *
一种适用于双微阵列的语音增强算法;毛维 等;《科学技术与工程》;20180430;第18卷(第10期);第245-249页 *

Also Published As

Publication number Publication date
CN110610718A (en) 2019-12-24

Similar Documents

Publication Publication Date Title
CN110610718B (en) Method and device for extracting expected sound source voice signal
JP7434137B2 (en) Speech recognition method, device, equipment and computer readable storage medium
Erdogan et al. Improved mvdr beamforming using single-channel mask prediction networks.
Zhang et al. Deep learning for environmentally robust speech recognition: An overview of recent developments
CN110556103B (en) Audio signal processing method, device, system, equipment and storage medium
US10546593B2 (en) Deep learning driven multi-channel filtering for speech enhancement
US10123113B2 (en) Selective audio source enhancement
US9837099B1 (en) Method and system for beam selection in microphone array beamformers
US9008329B1 (en) Noise reduction using multi-feature cluster tracker
CN108352818B (en) Sound signal processing apparatus and method for enhancing sound signal
US8577054B2 (en) Signal processing apparatus, signal processing method, and program
EP3189521B1 (en) Method and apparatus for enhancing sound sources
Minhua et al. Frequency domain multi-channel acoustic modeling for distant speech recognition
CN111044973B (en) MVDR target sound source directional pickup method for microphone matrix
CN111445920B (en) Multi-sound source voice signal real-time separation method, device and pickup
CN112017681B (en) Method and system for enhancing directional voice
JP2021110938A (en) Multiple sound source tracking and speech section detection for planar microphone array
US11869481B2 (en) Speech signal recognition method and device
TW202147862A (en) Robust speaker localization in presence of strong noise interference systems and methods
Nesta et al. A flexible spatial blind source extraction framework for robust speech recognition in noisy environments
KR20210153919A (en) Joint training method and apparatus for deep neural network-based dereverberation and beamforming for sound event detection in multi-channel environment
US11528571B1 (en) Microphone occlusion detection
CN112363112A (en) Sound source positioning method and device based on linear microphone array
Girin et al. Audio source separation into the wild
Takatani et al. High-fidelity blind separation of acoustic signals using SIMO-model-based independent component analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 519085 High-tech Zone, Tangjiawan Town, Zhuhai City, Guangdong Province

Applicant after: ACTIONS TECHNOLOGY Co.,Ltd.

Address before: 519085 High-tech Zone, Tangjiawan Town, Zhuhai City, Guangdong Province

Applicant before: ACTIONS (ZHUHAI) TECHNOLOGY Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant